ECL? was ..Re: Runtime determination of share directories?

Subject: ECL? was ..Re: Runtime determination of share directories?
From: Michael Abshoff
Date: Fri, 23 Jan 2009 14:44:22 -0800
Richard Fateman wrote:
> Michael Abshoff wrote:...

Hi,

> ...
>> Mhhh, maybe I did not name the right package? I am referring to the 
>> recent work of Dieter Kaiser implementing more special functions and I 
>> do recall him increasing the number of bits used internally for some 
>> computations to ensure identical results on various lisps.
>>
> I noticed some mail about (double precision?) logarithm / gamma .  
> Common Lisp allows for different floats, up to 4 in the standard, I 
> think. And possibly more,  short, single, long, double.   It is 
> certainly possible to support double-extended, double-double (which is, 
> I think, in CMU-CL), and others.
>> ......

Ok.

> 
>> MPFR itself is pure C, but MPFR relies on GMP for the underlying 
>> arithmetic which is partially written in assembler. But I don't see 
>> how requiring MPFR and GMP present on the system would be an issue 
>> since building a lisp from sources is often harder. 
> 
> Just because Sage requires the building of whatever from sources is 
> pretty much irrelevant to Maxima. The vast vast majority of Maxima users 
> just download and install a working executable (+ support files) 
> suitable for their machine.

Yes, I know that.

> If MPFR were to be made part of Maxima (which might be nice) it would 
> involve some technical tricks. For example, to get the best out of MPFR 
> and GMP, it would be necessary, in the install process, to distinguish 
> among many x86 implementations to choose the optimal binary. I don't 
> know the details of the current system, but the best code I have for a 
> Pentium 4  does not run on an earlier Pentium.
> The code for the earlier Pentium runs on the Pentium 4, but is slower.
> A pure-C implementation of GMP is considerably slower on the Pentium 4.
> So a Maxima using MPFR+GMP could use the pure-C version and be 10X 
> slower, or at some stage figure what CPU is running.

Could you tell me which examples were 10 times slower in pure C than 
assembly? You don't have to target C code, i.e. I doubt too many people 
are running Maxima on pre-Pentium CPUs. Given the sleak size of Maxima 
this is certainly be possible.

> Testing such a system build for all x86 implementations would be a 
> hassle.  I, of course, only care about the computer I'm usually
> using.

GMP is probably one of the most extensively tested scientific libraries 
out there. If you want a 32 bit build optimized for any CPU used 
--enable-fat and at runtime the optimum implementation will be picked by 
adjusting a function pointer table. This does not work in GMP itself on 
anything but Linux, but MPIR which is a fork of GMP has fixed that also 
for 64 bit systems as well as MinGW.

>> Making it optional and falling back to a pure lisp implementation 
>> would obviously be a good idea for small devices like PDAs since I 
>> guess you want to make Maxima as sleek as possible on those devices.
> If I want to make my life more difficult, I can try riding a unicycle or 
> maybe constructing ships in bottles.  Doing mathematics by typing with 
> my thumbs and seeing the answers on a 2x2inch screen is not so 
> interesting to me.  But what do I know, I don't own an IPod.


I do not see how this is relevant. There clearly seems to be demand for 
hooking up external libraries into Maxima. I am pretty sure that if I 
benchmark a some Lapack routines using ATLAS against the f2cl translated 
Lapack in Maxima you would see a large performance difference. And I can 
trivially build ATLAS to use 4 cores at the same time and BLAS level 3 
will scale linear in the number of cores. Given an appropriately large 
enough matrix we have seen linea scaling of BLAS level 3 matrix matrix 
multiply with up to 32 cores. We have also been running BLAS on GPUs and 
the speed improvements there are significant for certain problem 
classes, i.e. recent NVidia as well as ATI GPUs offer over 200GFlops 
double and in excess of 1.2 TFlops single precision. I have seen people 
achieve 60% peak of that, so it is nothing to sneeze at.

>>
>> Well, take the latest official release, build it on Solaris running on 
>> a  x86 cpu and run make check. It failed its test suite on every x86 
>> based Solaris box I tried and that is a bad, bad thing. 
> If you haven't reported it, maybe they don't know.  I don't know of 
> anyone (else?) using Solaris on x86, and maybe they don't either.
> 
>> If you use quaddouble to do numerical work this is less of a issue IMHO,
> 
> Huh?  You are using quaddouble but not numerical?    Or do you mean you 
> are using quaddouble for integers only?

The number of partitions algorithm in Sage uses doubles, qd or MPFR 
depending on the input size to get optimum speed and it beats the second 
fastest implementation in MMA 7 by about a 20% margin. Everything else 
out there needs orders of magnitude more time to compute the number of 
partitions of anything non-trivial.

>> but I see little benefit from getting potentially wrong results 
>> anywhere from 10 to 50% more quickly than MPFR if you want identical 
>> results on any platform which MPFR does deliver.
> Well, consider those people who are hacking away in dungeons trying to 
> get 2 processors to work together to get a program to work 50% faster.  
> You are saying you could do the job, but with one processor. But 
> debugging the program is not worth it.
> Imagine those dungeon-hackers... if they could get 1.5X for one 
> processor, and another 50% from the second, they would have
> (1.5)^2 or 2.25X speedup.

Ok, I do not understand how the above is relevant to my point. qd's 
result on Solaris x86 and even PPC OSX cannot be trusted by a 
significant number of digits. MPFR is about 30% slower for precisions 
around 200 bits and I am happy to trade 30% speed in arithmetic for the 
trade off of correctness. Maybe qd can be fixed on Solaris 10, but then 
what about the other BSDs? MPFR passes its extensive test suite on 
numerous platforms assuming GMP does and they will quickly fix any issue 
you report. I could not even find a mailing list for qd which seems to 
indicate to me that its user community is significantly smaller than MPFRs.

>> And by the way: quaddouble is released under the BSD license by 
>> researchers working for LBNL and U.C. Berkeley, but according to
>>
>>   "http://crd.lbl.gov/~dhbailey/mpdist/";
>>
>> one can read that
>>
>>
>>   "Incorporating this software in any commercial product requires a 
>> license agreement"
>>
>>
>> Maybe someone ought to clue these people in what it means to release 
>> software under the BSD license. And I am sure someone should point 
>> them to the wikipedia page about BSD to make 100% sure that they will 
>> appreciate the irony.
> I have forwarded your note to one of the people :)

Thanks. Feel free to send me the response :)

>>
>>>
>>>> On the same platform using quaddouble the number of partitions for 
>>>> the first five hundred integers is incorrect in about half the 
>>>> cases, so any time you are using some extended precision library 
>>>> that is not proven to be correct and vigorously I get very nervous.
>>>>   
>>> Maybe the algorithm requires more precision?
>>
>> No, the problem with quaddouble is that it requires at least on x86 to 
>> precisely set the FPU control word, i.e. rounding mode and so on. On 
>> PowerPC or Sparc this is not possible to my recollection, but in our 
>> experience arithmetic operations there also deliver less than the 212 
>> bits promised. I have even seen cases where a single multiplication of 
>> two numbers (and we did not attempt to hit a corner case) produced 
>> results that were different in the last three of four bits.
> In my own use of quaddouble from Lisp  on x86, I have to be careful that 
> the fpu control word isn't messed up by other processes which may also 
> be taking time from the same Lisp, and this might be the same kind of 
> situation for Sparc.  x86 can do more arithmetic (fused multiply-add) in 
> registers with more precision than Sparc2. I don't know about later 
> Sparc or PowerPC.  In any case, quad-double does not, as I recall, 
> guarantee the last few bits in the fraction, and thus if you are relying 
> on them either to be correct or consistent across different 
> architectures, then you should not be using this software.
> 
> I changed the code to set    the rounding modes, e.g.

qd automatically sets the rounding mode if you define -DX86 via inline 
assembly and we do so in Sage.

> void c_qd_add(const double *a, const double *b, double *c) {
>     fpu_fix_start(NULL);    ////  ADDED
> 
>  qd_real cc;
>  cc = qd_real(a) + qd_real(b);
>  TO_DOUBLE_PTR(cc, c);
> }
> 
> I frankly don't know if this guards the fpu control word adequately for 
> all possible operating systems and architectures.

It is supposed to do so on any x86 CPU I ever had access to. Obviously 
hardware and software deviates from that ideal.


>>
>> If the documentation tells me that I get 212 bits of precision then it 
>> should not matter which IEEE conformant CPU I am running the code on 
>> (modulo compiler bugs), but quaddouble does for the purpose the Sage 
>> project uses it not live up to the standard of reproducibility. 
> Reproducibility down to the last few bits is, according to some people, 
> not worth slowing down.  Maxima's bigfloats are way slower than they 
> could be except that all basic arithmetic is done with 
> round-to-nearest-even. It could probably be made 2 or 4 times faster.
> Elementary functions like exp and cos are probably good to 1 unit in 
> last place, but I never proved that. Since you can always up the 
> precision by a bunch of bits, the advantages of correctly rounded can 
> probably be obtained by less work.
> 

Sure, but a lot of people will tell you that correctness is more 
important than speed. Numerical computations have dealt with the problem 
that operations with doubles and floats are commutative or distributive 
among other things, but most of the people I work with do pure math 
reserch. From past conversations on and off list I know that your take 
on this is different then mine, so I don't think either one of us will 
convince the other that they are right (if there even is such a thing).

> 
>> That does not mean it is not useful for other projects, but at least 
>> Sage is not in the business of delivering potentially worng results 
>> 50% faster. 
> Well, wrong is in the eye of the beholder.  Have you heard the phrase 
> "close enough for government work"?

:)

>> AFAIK the issue is known to quaddouble developers and Carl Witty and I 
>> discussed the possibility of attempting to fix it by working around 
>> potential miscompilations last night, but this is a waste of time 
>> since even if we get it to work for some examples it will still not 
>> even come close to the assurance that MPFR gives me. And correctness 
>> should always come before speed in any software project.
> certainly it was known in 2007 when I wrote the generic lisp interface 
> to qd.
> 
> RJF
> 

Anyway, it has been fun arguing with you, but this is getting OT and I 
have things to do, so this will probably be my last email about this to 
the Maxima list.

Cheers,

Michael
Prev by Date: ECL? was ..Re: Runtime determination of share directories?
Next by Date: Package bigfloat
Previous by thread: ECL? was ..Re: Runtime determination of share directories?
Next by thread: ECL? was ..Re: Runtime determination of share directories?
Index(es):
- Date
- Thread