Efficiency of 64-bit mode

Subject: Efficiency of 64-bit mode
From: Stavros Macrakis
Date: Mon, 4 Oct 2010 10:58:12 -0400
There have been various discussions of efficiency of 32 vs. 64-bit mode in
Maxima, so I thought the group would be interested in this note by a very
experienced compiler writer.

My summary:

* 64-bit takes more memory (pointers are 2x as big and some instructions are
longer), so caches are less effective.
* 64-bit will be slower in pointer-heavy computations (like Lisp/Maxima).
* 64-bit is generally faster for floating-point calculations when using
precompiled binaries.
* But in the end it all depends on the particular application, the
particular compiler, and the particular processor.

               -s

Michael Meissner <meissner at the-meissners.org>
Date: Fri, Oct 1, 2010 at 21:59

There are a couple of things at the instruction level (i.e. where I tend to
live as a compiler developer) that always make these types of question only
answerable by "it depends".

Programs that mostly are pointer chasing tend to do better in 32-bits than
64-bits.  This is because if pointers (and possibly int/long) are 64-bits,
your
data cache is less effective, because fewer unique values will be in the
cache
(no matter what your cache size is).

In the x86 land, because most systems worry about being backwards compatible
to
the most ancient 386's, by default only the instructions that existed in the
386 tend to get generated.  This means for instance using the 80387 floating
point stack (bletch), instead of the SSE registers for floating point, not
having condition move instructions, etc.  Some of the newer linux distros
have
changed their minimum requirements, and have changed the compiler to
generate
more modern code by default.  In addition, some compilers (Intel, PGI) will
generate fat binaries that contains the code compiled for several different
instruction set level.  Because the 64-bit code started at the SSE2 level,
you
have more instructions that the compiler can use by default without
specifying
cpu/arch type options.

Due to the rather contorted way that both Intel and AMD have grown the
architecture over the years, 64-bit code tends to be bigger than 32-bit
code.
You get more registers available in 64-bit (16 GPR + 16 FPR vs. 8/8 if
memory
serves).  When I was at AMD, we discovered some code in a hot-loop where
compiling it in 64-bit mode, made the loop just big enough that it no longer
completely fit in the instruction buffers and it had to read the values from
the instruction cache.  Speaking of the i-cache, just like bigger pointers
means the d-cache is less effective, similarly bigger 64-bit instructions
means
the i-cache is less effective, though it is less of an effect than the
d-cache.

In terms of scheduling instructions, each of the major generations of chips
want the code scheduled differently, and it isn't consistant on the chip
makers, for example the newer AMD chips are scheduled differently than the
older K8's, and are more Intel like.  For example, on some chips it is
better
to use an INC instruction while on others an ADDI $1 to increment a value.
Another thing is whether it is better to load a value into a register, and
do
the operation on the register, or whether you can operate on the value
directly
from memory.  On each of the compilers, there is always the issue of what to
choose for the defaults.  When I was at AMD, we were trying to change some
of
the defaults so that the code would run faster on new machines that hadn't
been
announced, and of course Intel was doing the same thing.

The ABI also plays an issue.  The x86 32-bit abi tends to push values on the
stack, while the 64-bit abis (and windows/linux needless to say have
different
64-bit abis) tend to pass the first few values in registers, and there are
various tradeoffs which is better.  It is amusing that both recent Intel and
AMD chips added optimization to speed up pushing values on the stack, just
as
a lot of people were switching over from 32-bit (where it is done all of the
time, and push is a single byte instruction for some registers) to 64-bit
where
push is rarely used, and stores are used instead.

Language abis are another issue.  Fortran for instance tends to favour
default
integers being the same size as default reals.  This penalizes 32-bit which
then has to do the multiple instructions to do 64-bit arithmetic in 32-bit
mode.

Of course it is not just x86, I've seen the same issues in Mips and Power.
 I
find for instance that on Power, SpecINT tends to be about 10% faster for
32-bit apps than for 64-bit apps, but nearly even for SpecFP, but it isn't
all
29 benchmarks are uniformally slower, but it goes all over the board.  For
instance libquantum is much faster in 64-bit, because the code is almost all
long long shift and logical instructions, but there are definately 32-bit
apps
that are faster due to the size of pointers and d-cache.

--
Michael Meissner
email: meissner at the-meissners.org
http://www.the-meissners.org
Prev by Date: Grouping terms
Next by Date: ABCL and maxima
Previous by thread: Fwd: ECL 9.12.3 and xmaxima
Next by thread: eigenvectors with maxima
Index(es):
- Date
- Thread