***Update – Preliminary Draft on 2014-02-11***

Note that as of 2014-02-11 none of what follows has yet been considered by anyone other than me, so anyone wishing to make any use of any of what follows should form their own judgment on the accuracy and effectiveness, etc, etc, etc, of what follows.

I apologise in advance for any bad formatting – I will fix that as soon as possible.

As a temporary fix for bad formatting, a text only file can be found here:

http://pastie.org/8724156

1. Introduction

2. Details

3. Application to Signed Integers

4. Timings

5. Testing

6. divide_by_constants_codegen_reference.c

1. Introduction

===============

I have been looking into ways of making calendar calculations faster,

specifically calculations with Chronological Julian Day Numbers.

One part of that is trying to get faster division by 365 + 1/4.

That led to looking at articles and books on the subject of

faster division by integers, for example approximating 1/d by m/2**p,

and with appropriate choice of m and p we have

floor(n * m/2**p) == floor(n/d)

for a wide range of positive integers, and the better the approximation

the wider the range for which the approximation works. In these articles

the focus is effectively on the approximation error: m/2**p – 1/d.

More details on division by integers can be found in, for example:

* “Hacker’s Delight” by Henry S Warren http://www.hackersdelight.org

* online articles by ridiculous_fish

http://www.ridiculousfish.com/blog/posts/labor-of-division-episode-i.html

http://www.ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html

Dividing by rational numbers generalises integer division, and looking into

this it seems easier to use a different approximation error: 2**p/m – d,

which has led to an inequality which I have not found in the articles and

book (Hacker’s Delight) that I have read on this subject. This inequality

is exactly equivalent to the main inequality in “Hacker’s Delight”,

and seems to enable practical calculations of m and p in a simpler and

faster way than the other calculation methods that I am aware of.

2. Details

==========

The term “magic numbers” has multiple uses in computing. Here we use it

for numbers which enable us to transform division by an integer into

multiplication by the magic number followed by division by a power of 2,

which can be done by a right shift of bits, a calculation

which is likely to be faster than the original division.

This give details of possibly faster calculations of “magic numbers”:

my experiments on a low specification netbook show the algorithms here

calculating magic numbers about 60% faster than the equivalent algorithms

in Henry S Warren’s book “Hacker’s Delight”.

In some cases the magic numbers and pre-shifts or post-shifts are smaller

than those calculated by divide_by_constants_codegen_reference.c

found in https://raw.github.com/ridiculousfish/libdivide/master/divide_by_constants_codegen_reference.c

As an introduction we consider basic ideas from “Hacker’s Delight”,

and show that there is another way of looking at them.

As defined in “Hacker’s Delight” if W is the bit size of unsigned integers:

nc = floor(((2**W – 1) + 1) / d) * d – 1

m = ceiling(2**(W+s) / d)

delta = m*d – 2**(W+s)

m and delta can be calculated recursively.

Define: qc = floor(((2**W – 1) + 1) / d), so: nc = qc * d – 1,

and pulling a rabbit out of a hat (or quoting an Australian mathematician

on the mathematics of juggling, we make a “good guess” which means

a guess that (we think) is correct) we state the basic “qc” inequality:

qc * (2**(W+s) / m – d) > -1

This inequality is explained just below, but for now rearrange it to:

qc * (m*d – 2**(W+s)) < m, and set delta = m*d – 2**(W+s).

We have 0 <= delta <= d – 1 < d, and can prove 0 <= qc * delta < 2**W,

so we can recursively calculate delta and m,

stopping when qc * delta < m or m overflows.

Now do some more rearranging:

qc * (m*d – 2**(W+s)) < m,

qc * d * (m*d – 2**(W+s)) < m*d – 2**(W+s) + 2**(W+s),

(qc * d – 1) * (m*d – 2**(W+s)) < 2**(W+s),

nc * (m*d – 2**(W+s)) < 2**(W+s).

which is the "nc" inequality given in "Hacker's Delight".

This derivation works equally well in the reverse direction,

so the "qc" and "nc" inequalities are exactly equivalent.

Essentially "Hacker's Delight" (magicu) uses the approximation error:

m/2**(W+s) – 1/d == delta / (d * 2**(W+s))

and asks for which nc with mod(nc, d) == d-1

does floor(nc * m/2**(W+s)) == floor(n / d),

which leads to this "nc" inequality: nc * (m/2**(W+s) – 1/d) < 1/d,

which can be rearranged to: nc * (m*d – 2**(W+s)) = d-1,

then for 0 <= n <= nc + d-1:

floor(n * m/2**(W+s)) == floor(n/d)

A faster alternative (magicu2), which sometimes gives sub-optimal m, p,

is rearrange: nc * delta < 2**(W+s) to: delta < 2**(W+s) / nc,

and now because nc -1,

which leads to this “qc” inequality: qc * (2**(W+s)/m – d) > -1,

which can be rearranged to: qc * (m*d – 2**(W+s)) 0,

then for 0 <= n <= (qc + 1) * d – 2:

floor(n * m/2**(W+s)) == floor(n/d)

The "qc" inequality is not, as far as I can see, in the Second Edition

of Henry S Warren's "Hacker's Delight", so I am assuming that

one of the following is true of the "qc" inequality:

* it is not known, or at least it is not well known;

* it is not known, but is not useful;

* it is known, and is not useful.

I would like to have found where magic numbers are calculated in compilers

(gcc, etc) to see how it is done there, but my knowledge of those is very

limited, and my admittedly cursory search in the gcc code drew a blank.

3. Application to Signed Integers

=================================

Here we applied the ideas to unsigned integer arithmetic, but they also

apply to signed integer arithmetic. In particular, applying these ideas

we can calculate signed integer magic numbers using only signed integers

in a straightforward way: we can *easily* avoid using unsigned integers

to calculate signed integer magic numbers. (I stress easily, because

we can of course mimic unsigned integers with signed integers, provided

we use some reasonably complicated code.) So for languages which have

fixed size signed integers, but not fixed size unsigned integers

(for example Java before JDK8?), I believe there is a straightforward way

to calculate magic numbers using only those fixed size signed integers.

4. Timings

==========

Timings for calculating 32_bit unsigned integer magic numbers

for divisors in a for loop from 3 upto 10**7, excluding powers of 2.

The calculations were made on a low specification netbook

running Microsoft Windows 7 Starter Edition,

with the programs compiled using MinGW version gcc (GCC) 4.8.1.

Calculation time taken

———– ———–

loop only 0.3 seconds, for loop only, no magic number calculations;

magicu 8.3 seconds, Hacker's Delight using: nc*delta < 2**(W+s);

magicu2 6.0 seconds, Hacker's Delight using: delta < 2**s;

magicu2v 3.4 seconds, rewritten magicu2 using: delta < 2**s;

magicqu 3.6 seconds, Suliram using: qc*delta < m;

So magicqu is about 6% slower than magicu2v,

but calculates the same magic numbers and shifts as magicu,

that is it calculates the smallest possible magic numbers.

The code for magicu2v and magicqu is almost identical: the only

differences are:

magicu2v magicqu

———————- ————–

before loop: twopows = 2 qc = m – 1

in loop: delta <= twopows qc * delta < m

in loop: twopows = 2*twopows nothing

or twopows = twopows << 1

which explains why magicqu seems to be only a little slower than magicu2v?

5. Testing

==========

magicu2v and magicqu have been tested by me against magicu2 and magicu:

* for all 16-bit divisors;

* for a large-ish sample of 32-bit divisors.

I have not yet asked anyone else to look at either the theory or this

implementation, so it is possible that there are mistakes in my proofs

of the theory and/or in my implementation. That said, I have gone over

the proofs (and implementation) several times, so I am very confident

of at least the theory behind magicqu,

in particular that the "qc" inequality used in magicqu

is exactly equivalent to the "nc" inequality used in magicu.

***But experience shows that proofs need to be checked by other people.***

6. divide_by_constants_codegen_reference.c

==========================================

I have made versions of divide_by_constants_codegen_reference.c using

the ideas in magicu2v and magicqu. I have not yet run timings on these,

because I believe that the reasons why magicu2v and magicqu seem to be

faster than magicu2 and magicu will also apply to these.

That may not be quite true: it should be true for when the magic number

does not overflow or is for the ridiculous_fish round-down "n+1" method.

But where we use a pre-shift, instead of taking all powers of 2 out of d,

and calculating the magic number for the reduced d, my version starts with

the overflowed m (or to be precise, the m just before it would overflow),

and than recurses downwards, reducing d by a factor of 2 at each recursion

and then seeing if we can further reduce m before we again reduce 2,

stopping when d becomes odd or the post-shift has become 0.

My version sometimes gives smaller m and shifts than the RF version.

For example, for 32_bits and d = 1440 both versions have post-shift 0,

but the RF version gives pre-shift 5 and magic number 95443718,

whereas my version gives pre-shift 4 and magic number 47721859.

1440 = 24*60 is a distinctly possible divisor: it is the minutes in a day.

But here my version is working in a different way, so it may or may not

be faster, and that is something that I will be investigating shortly.

***** original post on 2014-01-23 *****

Hacker';s Delight by Henry S Warren, Chapter 10, second edition

nc = floor((nmax + 1) / d) * d – 1

(5) m = (2**p + d – rem(2**p, d)) / d

(6) nc * (d – rem(2**p, d)) < 2**p

Using (5) we can rearrange (6) as follows

(nc + 1) * (d – rem(2**p, d)) < 2**p + d – rem(2**p, d) == m * d

and writing qc = floor((nmax + 1) / d)

we have nc + 1 == qc * d, so

qc * d * (d – rem(2**p, d)) < m * d

qc * (d – rem(2**p, d)) < m

We know that either m is a valid W-bit integer or m is a (W+1)-bit integer, and we can prove that for d not a power of 2

qc * (d – rem(2**p, d)) <= nmax

so we can write code for magic number calculations which:

* for W-bit signed integers uses uncomplicated code which uses W-bit signed integer variables;

* for W-bit unsigned integers uses uncomplicated code which uses W-bit unsigned integer variables.

(In fact, because we can mimic W-bit unsigned integers using W-bit signed integers, we could write code for W-bit unsigned integer magic numbers using code written with W-bit signed integer variables, but if you want unsigned integer magic numbers, then you probably have access to unsigned integer variables, and so don't need the more complicated signed integer version of the code.)

;

;