***Update – Preliminary Draft on 2014-02-11***
Note that as of 2014-02-11 none of what follows has yet been considered by anyone other than me, so anyone wishing to make any use of any of what follows should form their own judgment on the accuracy and effectiveness, etc, etc, etc, of what follows.
I apologise in advance for any bad formatting – I will fix that as soon as possible.
As a temporary fix for bad formatting, a text only file can be found here:
3. Application to Signed Integers
I have been looking into ways of making calendar calculations faster,
specifically calculations with Chronological Julian Day Numbers.
One part of that is trying to get faster division by 365 + 1/4.
That led to looking at articles and books on the subject of
faster division by integers, for example approximating 1/d by m/2**p,
and with appropriate choice of m and p we have
floor(n * m/2**p) == floor(n/d)
for a wide range of positive integers, and the better the approximation
the wider the range for which the approximation works. In these articles
the focus is effectively on the approximation error: m/2**p – 1/d.
More details on division by integers can be found in, for example:
* “Hacker’s Delight” by Henry S Warren http://www.hackersdelight.org
* online articles by ridiculous_fish
Dividing by rational numbers generalises integer division, and looking into
this it seems easier to use a different approximation error: 2**p/m – d,
which has led to an inequality which I have not found in the articles and
book (Hacker’s Delight) that I have read on this subject. This inequality
is exactly equivalent to the main inequality in “Hacker’s Delight”,
and seems to enable practical calculations of m and p in a simpler and
faster way than the other calculation methods that I am aware of.
The term “magic numbers” has multiple uses in computing. Here we use it
for numbers which enable us to transform division by an integer into
multiplication by the magic number followed by division by a power of 2,
which can be done by a right shift of bits, a calculation
which is likely to be faster than the original division.
This give details of possibly faster calculations of “magic numbers”:
my experiments on a low specification netbook show the algorithms here
calculating magic numbers about 60% faster than the equivalent algorithms
in Henry S Warren’s book “Hacker’s Delight”.
In some cases the magic numbers and pre-shifts or post-shifts are smaller
than those calculated by divide_by_constants_codegen_reference.c
found in https://raw.github.com/ridiculousfish/libdivide/master/divide_by_constants_codegen_reference.c
As an introduction we consider basic ideas from “Hacker’s Delight”,
and show that there is another way of looking at them.
As defined in “Hacker’s Delight” if W is the bit size of unsigned integers:
nc = floor(((2**W – 1) + 1) / d) * d – 1
m = ceiling(2**(W+s) / d)
delta = m*d – 2**(W+s)
m and delta can be calculated recursively.
Define: qc = floor(((2**W – 1) + 1) / d), so: nc = qc * d – 1,
and pulling a rabbit out of a hat (or quoting an Australian mathematician
on the mathematics of juggling, we make a “good guess” which means
a guess that (we think) is correct) we state the basic “qc” inequality:
qc * (2**(W+s) / m – d) > -1
This inequality is explained just below, but for now rearrange it to:
qc * (m*d – 2**(W+s)) < m, and set delta = m*d – 2**(W+s).
We have 0 <= delta <= d – 1 < d, and can prove 0 <= qc * delta < 2**W,
so we can recursively calculate delta and m,
stopping when qc * delta < m or m overflows.
Now do some more rearranging:
qc * (m*d – 2**(W+s)) < m,
qc * d * (m*d – 2**(W+s)) < m*d – 2**(W+s) + 2**(W+s),
(qc * d – 1) * (m*d – 2**(W+s)) < 2**(W+s),
nc * (m*d – 2**(W+s)) < 2**(W+s).
which is the "nc" inequality given in "Hacker's Delight".
This derivation works equally well in the reverse direction,
so the "qc" and "nc" inequalities are exactly equivalent.
Essentially "Hacker's Delight" (magicu) uses the approximation error:
m/2**(W+s) – 1/d == delta / (d * 2**(W+s))
and asks for which nc with mod(nc, d) == d-1
does floor(nc * m/2**(W+s)) == floor(n / d),
which leads to this "nc" inequality: nc * (m/2**(W+s) – 1/d) < 1/d,
which can be rearranged to: nc * (m*d – 2**(W+s)) = d-1,
then for 0 <= n <= nc + d-1:
floor(n * m/2**(W+s)) == floor(n/d)
A faster alternative (magicu2), which sometimes gives sub-optimal m, p,
is rearrange: nc * delta < 2**(W+s) to: delta < 2**(W+s) / nc,
and now because nc -1,
which leads to this “qc” inequality: qc * (2**(W+s)/m – d) > -1,
which can be rearranged to: qc * (m*d – 2**(W+s)) 0,
then for 0 <= n <= (qc + 1) * d – 2:
floor(n * m/2**(W+s)) == floor(n/d)
The "qc" inequality is not, as far as I can see, in the Second Edition
of Henry S Warren's "Hacker's Delight", so I am assuming that
one of the following is true of the "qc" inequality:
* it is not known, or at least it is not well known;
* it is not known, but is not useful;
* it is known, and is not useful.
I would like to have found where magic numbers are calculated in compilers
(gcc, etc) to see how it is done there, but my knowledge of those is very
limited, and my admittedly cursory search in the gcc code drew a blank.
3. Application to Signed Integers
Here we applied the ideas to unsigned integer arithmetic, but they also
apply to signed integer arithmetic. In particular, applying these ideas
we can calculate signed integer magic numbers using only signed integers
in a straightforward way: we can *easily* avoid using unsigned integers
to calculate signed integer magic numbers. (I stress easily, because
we can of course mimic unsigned integers with signed integers, provided
we use some reasonably complicated code.) So for languages which have
fixed size signed integers, but not fixed size unsigned integers
(for example Java before JDK8?), I believe there is a straightforward way
to calculate magic numbers using only those fixed size signed integers.
Timings for calculating 32_bit unsigned integer magic numbers
for divisors in a for loop from 3 upto 10**7, excluding powers of 2.
The calculations were made on a low specification netbook
running Microsoft Windows 7 Starter Edition,
with the programs compiled using MinGW version gcc (GCC) 4.8.1.
Calculation time taken
loop only 0.3 seconds, for loop only, no magic number calculations;
magicu 8.3 seconds, Hacker's Delight using: nc*delta < 2**(W+s);
magicu2 6.0 seconds, Hacker's Delight using: delta < 2**s;
magicu2v 3.4 seconds, rewritten magicu2 using: delta < 2**s;
magicqu 3.6 seconds, Suliram using: qc*delta < m;
So magicqu is about 6% slower than magicu2v,
but calculates the same magic numbers and shifts as magicu,
that is it calculates the smallest possible magic numbers.
The code for magicu2v and magicqu is almost identical: the only
before loop: twopows = 2 qc = m – 1
in loop: delta <= twopows qc * delta < m
in loop: twopows = 2*twopows nothing
or twopows = twopows << 1
which explains why magicqu seems to be only a little slower than magicu2v?
magicu2v and magicqu have been tested by me against magicu2 and magicu:
* for all 16-bit divisors;
* for a large-ish sample of 32-bit divisors.
I have not yet asked anyone else to look at either the theory or this
implementation, so it is possible that there are mistakes in my proofs
of the theory and/or in my implementation. That said, I have gone over
the proofs (and implementation) several times, so I am very confident
of at least the theory behind magicqu,
in particular that the "qc" inequality used in magicqu
is exactly equivalent to the "nc" inequality used in magicu.
***But experience shows that proofs need to be checked by other people.***
I have made versions of divide_by_constants_codegen_reference.c using
the ideas in magicu2v and magicqu. I have not yet run timings on these,
because I believe that the reasons why magicu2v and magicqu seem to be
faster than magicu2 and magicu will also apply to these.
That may not be quite true: it should be true for when the magic number
does not overflow or is for the ridiculous_fish round-down "n+1" method.
But where we use a pre-shift, instead of taking all powers of 2 out of d,
and calculating the magic number for the reduced d, my version starts with
the overflowed m (or to be precise, the m just before it would overflow),
and than recurses downwards, reducing d by a factor of 2 at each recursion
and then seeing if we can further reduce m before we again reduce 2,
stopping when d becomes odd or the post-shift has become 0.
My version sometimes gives smaller m and shifts than the RF version.
For example, for 32_bits and d = 1440 both versions have post-shift 0,
but the RF version gives pre-shift 5 and magic number 95443718,
whereas my version gives pre-shift 4 and magic number 47721859.
1440 = 24*60 is a distinctly possible divisor: it is the minutes in a day.
But here my version is working in a different way, so it may or may not
be faster, and that is something that I will be investigating shortly.
***** original post on 2014-01-23 *****
Hacker';s Delight by Henry S Warren, Chapter 10, second edition
nc = floor((nmax + 1) / d) * d – 1
(5) m = (2**p + d – rem(2**p, d)) / d
(6) nc * (d – rem(2**p, d)) < 2**p
Using (5) we can rearrange (6) as follows
(nc + 1) * (d – rem(2**p, d)) < 2**p + d – rem(2**p, d) == m * d
and writing qc = floor((nmax + 1) / d)
we have nc + 1 == qc * d, so
qc * d * (d – rem(2**p, d)) < m * d
qc * (d – rem(2**p, d)) < m
We know that either m is a valid W-bit integer or m is a (W+1)-bit integer, and we can prove that for d not a power of 2
qc * (d – rem(2**p, d)) <= nmax
so we can write code for magic number calculations which:
* for W-bit signed integers uses uncomplicated code which uses W-bit signed integer variables;
* for W-bit unsigned integers uses uncomplicated code which uses W-bit unsigned integer variables.
(In fact, because we can mimic W-bit unsigned integers using W-bit signed integers, we could write code for W-bit unsigned integer magic numbers using code written with W-bit signed integer variables, but if you want unsigned integer magic numbers, then you probably have access to unsigned integer variables, and so don't need the more complicated signed integer version of the code.)