## Some inequalities for faster integer division by a constant

*** This is currently a work in progress, and TODO shows me and any readers where more work is needed.

??? is a reminder to me to check that something I believe to be true really is true.

TODO put page links to quickly get to the conditions TODO maybe a short contents table with those links TODO maybe add a brief history of developments of the theory at the end

Modern compilers – for example LLVM’s Clang and GNU’s gcc – and some libraries – for example https://libdivide.com by http://ridiculousfish.com – make integer division by a constant divisor faster by replacing a relatively slow division operation floor(n/d) by floor(n*m/2**s), where:
* floor(X) is the greatest integer less than or equal to X
* 2**s is 2 exponentiated to the power s
(ui) Given integers 1<=d<=nmax, we have integers m & s so that m/2**s is a close enough approximation to 1/d that:
(ui’) for integers 0<=n<=nmax we have:
(ui”) floor(n/d) = q = floor(n*m/2**s)

The point is that dividing by 2**s can be done with bits shifts, which are fast, so instead of a slow division operation we can use a faster multiplication and bits shifts. (This only applies if we know d in advance, and we may be dividing by d several times: while the calculation floor(n*m/2**s) is fast, it does take some time to find m & s.)

But how to find suitable m/2**s? Here we:

* Give an unusual unified approach to finding and proving sufficient conditions for m/2**s. We believe this gives short simple proofs which easily explain why the approximations work, and why they eventually fail for “too large” n.

* Derive and prove a condition which seems to be novel, and which:
** gives a simpler and faster algorithm for finding the best possible m/2**s than the algorithm currently used??? by LLVM;
** may be nearly as fast as the algorithm currently used by GCC, which usually, but not quite always, finds the best possible m/2**s.

* Suggest a modification to the algorithm currently used by GCC which makes that faster.

* List several practical and/or theoretical conditions due to others for finding m/2**s, and use the unified approach to give simple, very short, proofs of these. This list is also useful for showing the similarities and differences of these conditions. (A few of these conditions are in our opinion not useful for either actually finding m/2**s or explaining how and why the approximation works – we think none of the conditions is good at both – but we have included them because they are in the literature.)

The unified approach – an example:
(ui.n) if n>0 a sufficient condition for (ui”) is: 1/d <= m/2**s < (1 + 1/n) * 1/d
Proof:
* instead of floor(n/d) use floor(n*m/2**s)
* which is equivalent to instead using floor(n * d*m/2**s / d)
* let K = d*m/2**s so equivalently instead use floor(n*K/d)
* then it is obvious that a sufficient condition for (ui”) is n<=n*K<n+1
* so if n>0 then a sufficient condition for (ui”) is 1 <= K = d*m/2**s < 1 + 1/n
* dividing all parts of that inequality by d gives (ui.n)
We believe this is an elegant sufficient condition which clearly shows how good the approximation is to 1/d. Moreover, it can be easily modified to give a similar necessary & sufficient condition for (ui), which we do below.

This uses (with hindsight: it wasn’t found by following Jacobi’s advice!) a maxim attributed to Carl Jacobi that “man muss immer umkehren” (one must always invert), albeit he made that about finding new research topics. https://en.wikipedia.org/wiki/Carl_Gustav_Jacob_Jacobi#Scientific_contributions
For dividing n by d, instead of:
* finding conditions for changing to dividing n by (an approximation to d)
* we find conditions for changing to (dividing an approximation to n) by d
but doing it in a way in which the latter is equivalent to the former.

We believe this approach gives simpler and more obvious proofs of conditions for (ui) than those in the literature that we are aware of, and below we give those proofs.

Preliminaries:

* HD10 means chapter 10 of the second edition of “Hacker’s Delight” (published in 2012), which discusses this, and proves necessary & sufficient conditions. (References to inequalities and conditions in that chapter are like this: (27), that is the identifiers are numbers inside round brackets, which is how the late Henry S Warren wrote them in his book.)

* Here we are mostly concerned with the mathematics underlying this, not with practical implementations, although we don’t completely ignore those. So this omits some important details on how to use m/2**s once found: for these you need to read other articles and/or books like HD10. At the end of this post are some links. TODO

* The somewhat clumsy system used to label the conditions is to allow additions to be made without having to renumber, and to at least hint at a condition’s originators.

Conventions:
* all “variables” are integers except those in upper-case like E, E’, E”, H, K, etc, which can be rational or real;
* in the mathematics “/” is rational or real division, so 17/3 is not evaluated as 5 or 6; we always explicitly use floor(expression) to convert an expression to an integer. (But in any computer code examples “/” is used to mean rounding to the integer towards zero, so in code 17/3 is evaluated as 5 and -17/3 as -5.)

Definitions:
* qc = floor((nmax+1)/d)
* nc = qc*d – 1; then nmax – (d-1) <= nc <= nmax;
* or we can use the other HD10 formula for nc, and then qc = (nc+1)/d.

Measure how good the m/2**s approximation is to 1/d, and 2**s/m is to d:
* delta = m*d – 2**s, so m = (2**s + delta) / d and 2**s = m*d – delta;
* d – 2**s/m = (m*d – 2**s) / m = delta/m
* m/2**s – 1/d = (m*d – 2**s) / (d*2**s) = delta/(d*2**s)

Remarks:

* We mostly, but not always, use nmax = 2**w – 1 where w >= 1 is an integer, often 32 or 64 for bits sizes of integers. But always nmax>=d. (We could also develope the theory for nmax<d, but then it gets messy, and the algorithms for finding m/2**s are made more complicated, both of which are unnecessary because if nmax<d then floor(n/d)=0 when 0<=n<=nmax.)

* In practice m = ceiling(2**s/d), where ceiling(X) is the least integer greater than or equal to X, but that is not essential for the mathematics.

* It is also not essential for the mathematics that we use m/2**s as an approximation to 1/d, although that is what we always do in practice. For the theory we can almost always use integers m/b instead of m/2**s, where b does not have to be a power of 2. All the measures of approximation just above work just as well if we use a general b instead of 2**s.

* For the mathematics it is not even essential that m & b are integers: in almost all of it we can instead approximate 1/d by 1/(d-E), where E is zero or a smallish positive real number. Then we have:
* m/2**s = m/b = 1/(d-E) and 2**s/m = b/m = d-E
* E = delta/m

* A reason for the apparently pointless generalisation (and idiosyncrasy – we admit that using 1/(d-E) or m/b instead of m/2**s looks more than somewhat peculiar compared with probably everything else written on this and similar topics, especially when in practice we only use m/2**s) is a curiousity of this subject seems to be that ideas & theorems & proofs for seeing what’s really going on and why it works aren’t directly good for actually finding m/2**s, and that different but equivalent ideas & theorems & proofs for actually finding m/2**s are less good at easily showing what’s happening. We believe generalising makes clearer what’s really happening. (Also using m/b saves a few “**” or superscripts and makes the formulas somewhat less cluttered.)  And if a practical computer is ever built using a non-binary base with an operation equivalent to bits shifting, then we’ll be ready to implement faster division by a constant divisor!

(The Bourbaki method for mathematically capturing a lion in a desert:
The capture of a lion in the desert is a special case of a far more general problem. Formulate this problem and find necessary and sufficient conditions for its solution. The capture of a lion is now a trivial corollary of the general theory, which on no account should be written down explicitly.
(Online here, but originally #27 in “15 New Ways to Catch a Lion” on page 11 of Manifold (18, Spring 1976). Manifold was a magazine mostly, but not exclusively, for mathematics undergraduates, and was based at, but was not officially part of, the Mathematics Institute of Warwick University.) Here we follow the Bourbaki method, but with a crucial difference: we may stray from the path of allegedly true “Bourbakianism”, and explicitly apply the general theory to the actual problem!)

* For practical algorithms to find m/2**s use necessary & sufficient conditions (ui.qc), (ui.HD.nc), and sufficient conditions (ui.GM), (ui.HD.delta), (ui.Alv).

These conditions are for unsigned integers, but can be adapted for signed integers. TODO: maybe do that at the end

*** Six necessary & sufficient – and hence equivalent – conditions for (ui)

(ui.nc) For (ui) it is necessary & sufficient that: 1/d <= m/b < (1 + 1/nc) * 1/d.
In fact we prove this for 0<=n<=nc+d-1, where nc<=nmax<=nc+d-1.
Proof: as for (ui.n) above, let K = d*m/b
Necessity:
* at n=d it is necessary that d<n*K=d*K, or 1<=K=d*m/b;
* at n=nc it is necessary that nc*K<nc+1=qc*d, or K=d*m/b < 1 + 1/nc;
Sufficiency:
* nc<=K*nc<nc+1; if 0<=n<=nc multiply by n/nc, so n<=n*K<n+1, which is (ui.n);
* if d=1 then nmax=nc and we are done
* if d>=2 then for 1<=i<=d-1<=nc we have i<=i*K<i+1
so for qc*d=nc+1<=n=nc+i<=nc+d-1
we have qc*d<=n<=n*K=(nc+i)*K<nc+1+d-1+1=nc+1+d=(qc+1)*d,
so qc<=n*K/d<qc+1 and floor(n*K/d)=qc, and we are done.

This condition is not practical, but it has that simple proof, and it and similar conditions are very useful for showing how close m/b = 1/(d-E) must be to 1/d, and for comparing sufficient conditions to necessary & sufficient conditions. All other conditions used (whether equivalent, or only sufficient) are easily derived from it.

For all the remaining necessary & sufficient conditions,
from (ui.nc) 1/d<=m/b is equivalent to b<=m*d, and so to 0<=m*d-b=delta,
and also to 1/d<=1/(d-E) and d-E<=d, and so to 0<=E.

(ui.E) For (ui) it is necessary & sufficient that: 0<=delta, or 0<=E;
and: E=delta/m<1/qc, or qc*delta/m = qc*E < 1;
where instead of floor(n/d) we use floor(n*m/b)=floor(n/(d-E))=floor(n*K/d),
where K = d*m/b = d/(d-E).

Proof: from (ui.nc) m/b = 1/(d-E) < (1 + 1/nc) * 1/d
is equivalent to nc/(d-E) < (nc+1)/d = qc
and nc = qc*d-1 < qc*(d-E) = qc*d – qc*E
and so to qc*E<1. The rest follow easily.

These aren’t useful practically, but are useful as intermediate conditions, and for explaining how and why the approximations work. They may be novel: they are not in HD10 (published in 2012), or on a related website archived here (I suggest using snapshots *before* 2019): https://web.archive.org/web/*/hackersdelight.org

Condition (ui.E) is a key to seeing why suitable m/2**s work up to nc+d-1,
and why they may go wrong at nc+d = (qc+1)*d – 1.
Consider two equivalent variants for calculating q, using as an example:
nmax=23; d=10; qc=2; nc=19; E=0.4; D=d-E=9.6; K=d/(d-E)=1.041…;

* 0…8_9_9.6=1*D_10…18_19=nc_19.2=2*D_20…28_28.8=3*D_29
q=floor(n / (d-E)): this is approximating d with d-E, and is like using a measuring rod of length d-E instead of the correct length d: as we lay rods of length d units alongside rods of length d-E, the end of the latter rods drifts back from the end of the former rods, and eventually the end of the latter rods is 1 or more units (including parts of a unit) behind the end of the former, and the integer that is 1 unit behind the end of the d units rods is the least integer for which: floor(n/d)<floor(n/(d-E))

* 0…9_9.4=9*K_10…19=nc_19.8=19*K_20…28_29_29.2=28*K_30_30.2=29*K_31
q=floor((n*d/(d-E)) / d): this is approximating n with n*d/(d-E)=n*K, and is like stretching the line with integer points by a factor d/(d-E) before dividing by d, making it clear why we need nc*d/(d-E)<nc+1.

The next three conditions for (ui) use increasingly large integers.
The first two give practical algorithms for finding the best possible m/2**s: the second is in HD10 and is used??? in LLVM??? and libdivide???; the first may be novel, and is simpler and faster than the second.

(ui.qc) For (ui) it is necessary & sufficient that 0<=delta and qc*delta<m.
Proof: from (ui.E) use that E = delta/m
Or from (ui.HD.nc) use the direct proof that these two are equivalent: this isn’t a circular argument because we also prove (ui.HD.nc) directly from (ui.nc).
This gives a simple algorithm to find m/2**s which easily fits within w bits: this is because if m = ceiling(2**s / d) then qc*delta <= nmax. The code (see Annex.ui.qc) is shorter, simpler, and faster than for (ui.HD.nc). Neither the condition nor the algorithm are in HD10, and both may be novel.

(ui.HD.nc) For (ui) it is necessary & sufficient that 0<=delta and nc*delta<b.
Proof: from (ui.qc): qc*delta < m
is equivalent to (qc*d – 1) * delta = nc*delta < m*d – delta = b.
Or from (ui.nc) m/b < (1 + 1/nc) * 1/d is equivalent to nc*m/b < (nc + 1) / d
and nc*m*d < (nc + 1) * b = nc*b + b
and nc*m*d – nc*b = nc * (m*d – b) = nc*delta < b
This is HD10 (27), we believe due to Henry S Warren, but with 2**p generalised to b. This gives code which can be made to fit within w bits at the expense of some complexity because mostly nc*delta>nmax: for example see the code in HD10, which is used as the basis for code in LLVM’s Clang compiler???.

(ui.GMW) For (ui) it is necessary & sufficient that 0<=delta and nc*m<qc*b.
Proof: from (ui.qc) qc*d*m – m = nc*m < qc*d*m – qc*delta = qc*b
Basically part of step 6 of “Algorithm 1 Granlund-Montgomery-Warren” in the LKK paper???  We believe this condition doesn’t have any practical uses, because the values used in it are so large that code fitting in w bits would need to be more complicated than for (ui.HD10.nc).

(ui.HD.25) For (ui) it is necessary & sufficient that b/d <= m < (b/d)*(nc+1)/nc.
Proof: from (ui.nc) multiply all expressions by b, and rewrite (1 + 1/nc) as (nc+1)/nc.
This is HD10 (25), we believe due to Henry S Warren, with 2**p generalised to b. We believe this is only useful as an intermediate result in the HD10 proof of (ui.HD.nc).

*** Three sufficient conditions for (ui) with b = 2**s and nmax = 2**w – 1

These conditions are sufficient, but not necessary, and (ui.GM) & (ui.HD.delta) are equivalent. All give algorithms to find m/2**s.

Pedantic correction – that’s not quite correct: as implemented in GCC, XXXXX TODO this is because GCC wants, entirely reasonably, to use the same function to find m/2**s for unsigned and signed integers, and to do that it needs to exclude delta=0, which is not done by (ui.HD.delta). However, that only makes a difference if abs(d) is an integer power of 2, for which the smallest values of delta making the condition correct are 0 or d (for all other d we have the smallest values of delta are 0<delta<d), and for these GCC just uses shifts, and doesn’t need to find a multiplier m, so for all practical purposes (ui.GM) and (ui.HD.delta) *are* equivalent.

(ui.GM) For (ui) it is sufficient that 2**s <= m*d <= 2**s + 2**(s-w)
or (ui.GM’) 1/d <= m/2**s <= (1 + 1/2**w) * 1/d
Proof: from (ui.nc) use nc <= nmax < nmax+1.
This is Theorem 4.2, condition (4.3) in Granlund & Montgomery’s 1994 paper.

Note that (ui.nc) & (ui.GM’) are similar, the differences being (ui.nc) uses nc, but (ui.GM’) uses nmax+1, and in (ui.nc) the right side comparison is “<” but in (ui.GM’) it is “<=”.

HD10 says that (ui.HD.delta) – and so also (ui.GM) – gives code which is simpler than code from (ui.HD.nc), but which nearly always finds the minimal m/2**s. Condition (ui.GM’) helps explain this: nc < nmax+1, but by at most d, so for small d by not much. HD10 has an example: for w = 32 bits the smallest divisor with (ui.GM) & (ui.HD.delta) not giving the minimal m/2**s is 102,807. (TODO: show my proof of a lower bound – 2**((w+1)/2) Z??? – for d below which (ui.GM) & (ui.HD.delta) always find the same m/2**s as (ui.qc) & (ui.HD.nc).)

But: (ui.qc) gives code for the minimal m/2**s which is almost as simple as that using (ui.HD.delta), and which is at worst only a little slower than using (ui.HD.delta), and might even be slightly faster. It is, for now, an open problem how code from (ui.qc) compares in performance with code from (ui.HD.delta).

(ui.HD.delta) For (ui) it is sufficient that 0 <= delta <= 2**(s-w)
Proof: from (ui.HD.nc) use that nc <= nmax < nmax+1.
This is HD10 (30), I believe due to Henry S Warren.

Since (ui.GM) and (ui.HD.delta) are derived from equivalent conditions by using nmax+1 instead of nc, it’s arguable that in itself shows that (ui.GM) & (ui.HD.delta ) are equivalent, but we will prove it directly. From (ui.GM’) we have:
m*d*(nmax+1) <= b * ((nmax+1) + 1) so (m*d – b) * (nmax+1) <=b

The algorithm from Granlund & Montgomery (used in GCC), from their (ui.GM), is elegant, but we believe it can be improved. As currently used in GNU gcc it starts with s = w + ceiling(log2(d)), which guarantees that the inequality is true, and iterates downwards until it finds the least s for which the inequality is true. Instead we suggest start s = w + ceiling(log2(d)) – 1, and iterating downwards from there; with this it is possible that the inequality is false at the start s, but in that case using (start s) + 1 makes the inequality true, and it is easy to get the m for s+1 from the m for s; this suggested change means that in the GNU GCC compiler we can avoid using wide_int integers, which seem quite slow compared with 32 or 64 (etc) bits fixed width integers.

And if we use the (ui.qc) improvement of the HD10 code used in LLVM, then it’s possible that iterating upwards to find m/2**s may be faster than iterating downwards as is done in GCC, albeit (to repeat myself) the GCC method is an elegant algorithm. This may be especially true for relatively small divisors.

TODO give my lower bound for divisors for which (vii) & (viii) always give the minimal m/2**s

(ui.Alv) s = w + ceiling(log2(d)) and m = ceiling(2**s/d)
Proof: delta < d, so delta <= 2**ceiling(log2(d)) = 2**(s-w), which is (ui.HD.delta);
and TODO prove always 2**w <= m < 2**(w+1)
HD10 says this is due to Alverson, and that it is simpler than the HD10 methods because it doesn’t need trial and error to find s, so is more suitable for building in hardware, but that always 2**w <= m < 2**(w+1), so the code using m/2**s for division is often more complicated than that using (ui.qc) or (ui.HD.nc) or (ui.GM) or (ui.HD.delta).

(ui.RF) TODO apply idea of (ui.i) & (ui.qc) to ridiculousfish’s proof for his libdivide n+1 algorìthm && does rf prefer to be cited as rf or r_f? TODO ??? Mayve mention this may be somewhere in HD (but I now can’t find it) but not in HD10 and so not in context of faster int div, so …..)

TODO  give an account of finding qc & nc for finding m/2**s as approximations to c/d so that floor((n*c + h) / d) = floor((n*m + H) / 2**s)   TODO  thought: might we want the remainder for this without the quotient? If so could the ideas in LKK 2018 paper be used?

TODO  other similar ideas, eg 64 bits nanoseconds converted to complete seconds or minutes or hours or days, and how for one of those by doing a pre-shift for dividing by a power of 2, and then doing a multiply and add, then another shift, we can get a quotient which is either exact or is only -1 out.

TODO Algorithm 1 Granlund-Montgomery-Warren” in the LKK paper?  maybe do a bit more on the conditions in the different steps of this, for each first as d-E &/or b/m, then as m/2**s and 0 <= n < 2**W

TODO maybe add more on maximum & minimum & miscellaneous stuff, etc, in HD10