Considering 64bit fixnums on 64bit GCL

Subject: Considering 64bit fixnums on 64bit GCL
From: Richard Fateman
Date: Fri, 24 May 2013 11:13:15 -0700
As far as timing goes, it would be necessary to come up with some benchmark
which typifies usage.  Unless the data is so overwhelmingly one-sided -- 
e.g.
for all cases large and small there is a clear winner, in which case the 
usage doesn't matter...

I suspect that most cases of polynomial GCD using a modular algorithm 
use just one prime.
If that is the case, no CRA is used at all.  It is possible that this 
would be the case for even a
smaller prime though.  Yet if the prime is too small, then the whole 
polynomial GCD algorithm
has to be run another time.  Though the coefficient arithmetic would be 
faster with smaller primes,
this would be a big cost.

One comparison might be to run a polynomial GCD using 2 (smaller) primes 
and very fast
coefficient arithmetic vs 1 larger prime, and then add the
cost of the CRA to piece together the two smaller pieces.  The CRA 
requires bignum arithmetic,
though not a great deal of it,  but nevertheless might vastly dominate 
the total cost.

I think that, at a minimum,  the smaller prime multiply/ remainder would 
have to be 4X faster than the larger primes case.

There could be a good technical paper in this:  how large a prime should 
you use.
Unfortunately, the answer would depend on how clever your Lisp compiler 
was with respect
to fixnum arithmetic,  esp. 64bit divided by 32 bit.


On 5/24/2013 10:15 AM, Camm Maguire wrote:
<snip>

> These days on 64bit machines, and even most 32bit machines, two 32bit 
> numbers can be multiplied in machine registers. Is 2147483648 large 
> enough for the algorithm? 
Any list of distinct primes could be used.   If all arithmetic of size 
<S is of constant cost, then choosing primes
such that size S bounds all [or most] of the arithmetic is one useful goal.

So it would not lead to wrong answers to use that limit.  One hack would 
be to start with a single
smallish prime;  if that allows you to conclude that the GCD of 2 
polynomials is 1, then you are done.
then try a LARGE prime; if that allows you to conclude .... you are done.
then try as many additional  32 primes as you need... (in case the GCD 
is not 1...)




>> 1. *bigprimes* is a LIST, initially of length 20, of the largest prime
>> numbers x such that
>> x+x is still a fixnum. At least that appears to be the intention in Maxima.
>>
>> 3. Your number is 32768.  It seems that what you want is a number such
>> that x*x   (NOT x+x)
>> is a fixnum.
> Yes, currently, but what if this number was 2147483648?  Actually, I
> think it is best to take the maximum of (ash 1 (ash (integer-length
> most-positive-fixnum) -1)) and (ash 1 31).
I suppose this depends on the actual code generated for fiddling with 
fixnums.
>
>> 4. I don't know what you mean by circular modular multiplication in
>> 64bits in registers. Is this
>> some common lisp feature I don't know about?  Or are you using
>> something for bignum arithmetic?
>> (In which case I suggest you consider using GMP, already used by some
>> other lisps. Advantage of GMP
>> is you don't have to debug it, and it is likely to be fast on large
>> numbers in case you have
>> an audience including number theorists).
>>
> GMP is used already when fixnums overflow.  This is still heavy compared
> to a register multiply with no function call.
OK, I didn't know you used GMP.  I have found GMP speed to be highly 
sensitive to specifying
exactly the right CPU version. That is, using generic C code can be 5x 
or 10x slower than
if the right assembler (I guess) is spliced in.

>
>> ...
>> Why x+x rather than x^2 ?
>> If you are doing modular arithmetic as part of (say) modular polynomial GCD
>> you will be doing  r := a*b mod p.  It would be fast if a,b, r, p and
>> a*b are all fixnum.
> Yes, exactly.
>
>> But you will have to do more of them to build up a result using the
>> Chinese Remainder Algorithm, if p is small.
>>
>> It might be more efficient ultimately if a*b is not necessarily a
>> fixnum, but a,b,p and r are fixnums.
>> No storage of bignums.  Fewer modular images needed. Indeed, Half.
>>
> Thank you so much, here is the tradeoff.  My point is that if the
> somewhat common bound in current use is (ash 1 30), and this is deemed
> acceptable performance wise given the remainder algorithm, can't we cap
> the numbers in *bigprimes* at this level even when most-positive-fixnum
> gets larger, so that maybe we can do the product in fixnums as well?  Do
> you expect big gains from the reduction in iterations needed in the
> algorithm as the prime goes from 30 bits to 62?
It  probably makes no difference in speed in running random programs. 
The only
testing regimen that seems to be in use is the collection of regression 
tests, and that
could, I suppose, be timed with different size primes.
In a test that pounded on polynomial GCD with many variables, dense, 
with huge
coefficients in which the GCD algorithm was set to MOD and there were large
common factors,  then there might be something to see.



>
>> That is, I think, the rationale.  If any timing trials were done, they
>> were probably done in ancient
>> times on PDP-6 or PDP-10 computers in Maclisp.
>>
> This was my guess too.  While I haven't studied the algorithm, from your
> comments I'm gathering that the number of iterations needed is of the
> order of the length of *bigprimes*, which has currently been lowered to
> ~20.
The number 20 is a high overestimate of the number of terms expect to be 
used.  This list is
augmented by the computation of more primes at runtime should they be 
needed in some
rare eventuality of humongous computing.  As mentioned earlier, I 
suspect that this list
is either not used, or one term is used; rarely more. Not to say I could 
not construct an
example that used them all and then some.  It would require  high degree 
many variables
and/or coefficients of 170+ decimal digits.

>    Cutting it in half again by using larger primes doesn't seem worth
> it in comparison to getting the product done in fixnums.
If you want to make the modular GCD fast by low-level hacking,
the key computation, I think, is ctimes,  which used to be (maybe still 
is) in assembler for KCL/GCL:
(ctimes a b)  is in rat3a.lisp .

if maxima is doing modular arithmetic, i.e. modulus is a fixnum p with   
representation of a and b in  -p/2 to p/2
then the result is  c = a*b mod p
which could  be something like one or two assembler instructions.

RJF
Prev by Date: Regression in integration
Next by Date: Reading 32-bit IEEE floats from binary file ??
Previous by thread: Considering 64bit fixnums on 64bit GCL
Next by thread: Considering 64bit fixnums on 64bit GCL
Index(es):
- Date
- Thread