Considering 64bit fixnums on 64bit GCL



Greetings, and thanks for the discussion.

Poking around a bit, it appears that (nearly) all the calls to ctimes
need to return generic exported objects, i.e. are not found in loops
where fixnum values can be reused unboxed by other functions expecting
the same.  In such a case the cost is overwhelmingly dominated by
boxing, and attempting to avoid a function call seems pointless.
Avoiding bignums in ctimes should still be worth it as the boxing cost
should decrease.

Below shows the modest gain to be had doing the multiply in ctimes in
register.  With a 31bit modulus or less, this is done, then with the
modulus less than 63 bits, a minimal gmp solution is done, then for
others generic code (gmp bignum) is used.

Once we have a box and a function call, the gain is no where near 4x.
So I guess big primes are the best.  Still its curious that this slowed
the testsuite slightly.

Take care,

=============================================================================

>(in-package :si)

#<"SYSTEM" package>

SYSTEM>(defun tr (mod s l i)
  (declare (fixnum mod i))
  (let ((modulus mod))
    (dolist (x l s)
      (dotimes (j i) (declare (fixnum j))(setf s (ctimes s x))))))

TR

SYSTEM>(compile 'tr)

Compiling /tmp/gazonk_28191_0.lsp.
End of Pass 1.  
End of Pass 2.  
OPTIMIZE levels: Safety=0 (No runtime error checking), Space=0, Speed=3
Finished compiling /tmp/gazonk_28191_0.lsp.
Loading /tmp/gazonk_28191_0.o
start address -T 0x1a45e60 Finished loading /tmp/gazonk_28191_0.o
#<compiled-function TR>
NIL
NIL

SYSTEM>(defun ml (mod n &aux r)
  (declare (fixnum mod n))
  (dotimes (i n r)
    (push (random mod) r)))

ML

SYSTEM>(compile 'ml)

Compiling /tmp/gazonk_28191_0.lsp.
End of Pass 1.  
End of Pass 2.  
OPTIMIZE levels: Safety=0 (No runtime error checking), Space=0, Speed=3
Finished compiling /tmp/gazonk_28191_0.lsp.
Loading /tmp/gazonk_28191_0.o
start address -T 0x1a45950 Finished loading /tmp/gazonk_28191_0.o
#<compiled-function ML>
NIL
NIL

SYSTEM>(setq m 21267647932558653966460912964485513157)

21267647932558653966460912964485513157

SYSTEM>(gbc t)(time (setq mod m l (ml mod 1000000) res (tr mod 1 l 100)))

T

SYSTEM>
real time       :      7.079 secs
run-gbc time    :      6.500 secs
child run time  :      0.000 secs
gbc time        :      0.560 secs
1106513430

SYSTEM>(gbc t)(time (setq mod m l (ml mod 1000000) res (tr mod 1 l 100)))

T

SYSTEM>
real time       :      7.079 secs
run-gbc time    :      6.510 secs
child run time  :      0.000 secs
gbc time        :      0.550 secs
1844854408

SYSTEM>(setq m 4611686018427387847)

4611686018427387847

SYSTEM>(gbc t)(time (setq mod m l (ml mod 1000000) res (tr mod 1 l 100)))

T

SYSTEM>
real time       :      7.079 secs
run-gbc time    :      6.500 secs
child run time  :      0.000 secs
gbc time        :      0.569 secs
-1874585652381379215

SYSTEM>(gbc t)(time (setq mod m l (ml mod 1000000) res (tr mod 1 l 100)))

T

SYSTEM>
real time       :      7.079 secs
run-gbc time    :      6.530 secs
child run time  :      0.000 secs
gbc time        :      0.529 secs
2187180236878594834

SYSTEM>(setq m 1073741789)

1073741789

SYSTEM>(gbc t)(time (setq mod m l (ml mod 1000000) res (tr mod 1 l 100)))

T

SYSTEM>
real time       :      5.780 secs
run-gbc time    :      5.219 secs
child run time  :      0.000 secs
gbc time        :      0.550 secs
98972026

SYSTEM>(gbc t)(time (setq mod m l (ml mod 1000000) res (tr mod 1 l 100)))

T

SYSTEM>
real time       :      5.760 secs
run-gbc time    :      5.199 secs
child run time  :      0.000 secs
gbc time        :      0.560 secs
-130265634

SYSTEM>

Richard Fateman <fateman at eecs.berkeley.edu> writes:

> As far as timing goes, it would be necessary to come up with some benchmark
> which typifies usage.  Unless the data is so overwhelmingly one-sided -- 
> e.g.
> for all cases large and small there is a clear winner, in which case
> the usage doesn't matter...
>
> I suspect that most cases of polynomial GCD using a modular algorithm
> use just one prime.
> If that is the case, no CRA is used at all.  It is possible that this
> would be the case for even a
> smaller prime though.  Yet if the prime is too small, then the whole
> polynomial GCD algorithm
> has to be run another time.  Though the coefficient arithmetic would
> be faster with smaller primes,
> this would be a big cost.
>
> One comparison might be to run a polynomial GCD using 2 (smaller)
> primes and very fast
> coefficient arithmetic vs 1 larger prime, and then add the
> cost of the CRA to piece together the two smaller pieces.  The CRA
> requires bignum arithmetic,
> though not a great deal of it,  but nevertheless might vastly dominate
> the total cost.
>
> I think that, at a minimum,  the smaller prime multiply/ remainder
> would have to be 4X faster than the larger primes case.
>
> There could be a good technical paper in this:  how large a prime
> should you use.
> Unfortunately, the answer would depend on how clever your Lisp
> compiler was with respect
> to fixnum arithmetic,  esp. 64bit divided by 32 bit.
>
>
> On 5/24/2013 10:15 AM, Camm Maguire wrote:
> <snip>
>
>> These days on 64bit machines, and even most 32bit machines, two
>> 32bit numbers can be multiplied in machine registers. Is 2147483648
>> large enough for the algorithm? 
> Any list of distinct primes could be used.   If all arithmetic of size
> <S is of constant cost, then choosing primes
> such that size S bounds all [or most] of the arithmetic is one useful goal.
>
> So it would not lead to wrong answers to use that limit.  One hack
> would be to start with a single
> smallish prime;  if that allows you to conclude that the GCD of 2
> polynomials is 1, then you are done.
> then try a LARGE prime; if that allows you to conclude .... you are done.
> then try as many additional  32 primes as you need... (in case the GCD
> is not 1...)
>
>
>
>
>>> 1. *bigprimes* is a LIST, initially of length 20, of the largest prime
>>> numbers x such that
>>> x+x is still a fixnum. At least that appears to be the intention in Maxima.
>>>
>>> 3. Your number is 32768.  It seems that what you want is a number such
>>> that x*x   (NOT x+x)
>>> is a fixnum.
>> Yes, currently, but what if this number was 2147483648?  Actually, I
>> think it is best to take the maximum of (ash 1 (ash (integer-length
>> most-positive-fixnum) -1)) and (ash 1 31).
> I suppose this depends on the actual code generated for fiddling with
> fixnums.
>>
>>> 4. I don't know what you mean by circular modular multiplication in
>>> 64bits in registers. Is this
>>> some common lisp feature I don't know about?  Or are you using
>>> something for bignum arithmetic?
>>> (In which case I suggest you consider using GMP, already used by some
>>> other lisps. Advantage of GMP
>>> is you don't have to debug it, and it is likely to be fast on large
>>> numbers in case you have
>>> an audience including number theorists).
>>>
>> GMP is used already when fixnums overflow.  This is still heavy compared
>> to a register multiply with no function call.
> OK, I didn't know you used GMP.  I have found GMP speed to be highly
> sensitive to specifying
> exactly the right CPU version. That is, using generic C code can be 5x
> or 10x slower than
> if the right assembler (I guess) is spliced in.
>
>>
>>> ...
>>> Why x+x rather than x^2 ?
>>> If you are doing modular arithmetic as part of (say) modular polynomial GCD
>>> you will be doing  r := a*b mod p.  It would be fast if a,b, r, p and
>>> a*b are all fixnum.
>> Yes, exactly.
>>
>>> But you will have to do more of them to build up a result using the
>>> Chinese Remainder Algorithm, if p is small.
>>>
>>> It might be more efficient ultimately if a*b is not necessarily a
>>> fixnum, but a,b,p and r are fixnums.
>>> No storage of bignums.  Fewer modular images needed. Indeed, Half.
>>>
>> Thank you so much, here is the tradeoff.  My point is that if the
>> somewhat common bound in current use is (ash 1 30), and this is deemed
>> acceptable performance wise given the remainder algorithm, can't we cap
>> the numbers in *bigprimes* at this level even when most-positive-fixnum
>> gets larger, so that maybe we can do the product in fixnums as well?  Do
>> you expect big gains from the reduction in iterations needed in the
>> algorithm as the prime goes from 30 bits to 62?
> It  probably makes no difference in speed in running random
> programs. The only
> testing regimen that seems to be in use is the collection of
> regression tests, and that
> could, I suppose, be timed with different size primes.
> In a test that pounded on polynomial GCD with many variables, dense,
> with huge
> coefficients in which the GCD algorithm was set to MOD and there were large
> common factors,  then there might be something to see.
>
>
>
>>
>>> That is, I think, the rationale.  If any timing trials were done, they
>>> were probably done in ancient
>>> times on PDP-6 or PDP-10 computers in Maclisp.
>>>
>> This was my guess too.  While I haven't studied the algorithm, from your
>> comments I'm gathering that the number of iterations needed is of the
>> order of the length of *bigprimes*, which has currently been lowered to
>> ~20.
> The number 20 is a high overestimate of the number of terms expect to
> be used.  This list is
> augmented by the computation of more primes at runtime should they be
> needed in some
> rare eventuality of humongous computing.  As mentioned earlier, I
> suspect that this list
> is either not used, or one term is used; rarely more. Not to say I
> could not construct an
> example that used them all and then some.  It would require  high
> degree many variables
> and/or coefficients of 170+ decimal digits.
>
>>    Cutting it in half again by using larger primes doesn't seem worth
>> it in comparison to getting the product done in fixnums.
> If you want to make the modular GCD fast by low-level hacking,
> the key computation, I think, is ctimes,  which used to be (maybe
> still is) in assembler for KCL/GCL:
> (ctimes a b)  is in rat3a.lisp .
>
> if maxima is doing modular arithmetic, i.e. modulus is a fixnum p with
> representation of a and b in  -p/2 to p/2
> then the result is  c = a*b mod p
> which could  be something like one or two assembler instructions.
>
> RJF
>
>
>
>

-- 
Camm Maguire			     		    camm at maguirefamily.org
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah