| version 1.1.1.1, 2000/09/09 14:13:19 |
version 1.1.1.2, 2003/08/25 16:06:37 |
|
|
| |
Copyright 2000, 2001, 2002 Free Software Foundation, Inc. |
| |
|
| |
This file is part of the GNU MP Library. |
| |
|
| |
The GNU MP Library is free software; you can redistribute it and/or modify |
| |
it under the terms of the GNU Lesser General Public License as published by |
| |
the Free Software Foundation; either version 2.1 of the License, or (at your |
| |
option) any later version. |
| |
|
| |
The GNU MP Library is distributed in the hope that it will be useful, but |
| |
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY |
| |
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public |
| |
License for more details. |
| |
|
| |
You should have received a copy of the GNU Lesser General Public License |
| |
along with the GNU MP Library; see the file COPYING.LIB. If not, write to |
| |
the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA |
| |
02111-1307, USA. |
| |
|
| |
|
| |
|
| |
|
| |
|
| GMP SPEED MEASURING AND PARAMETER TUNING |
GMP SPEED MEASURING AND PARAMETER TUNING |
| |
|
| |
|
| The programs in this directory are for knowledgeable users who want to make |
The programs in this directory are for knowledgeable users who want to |
| measurements of the speed of GMP routines on their machine, and perhaps |
measure GMP routines on their machine, and perhaps tweak some settings or |
| tweak some settings or identify things that can be improved. |
identify things that can be improved. |
| |
|
| The programs here are tools, not ready to run solutions. Nothing is built |
The programs here are tools, not ready to run solutions. Nothing is built |
| in a normal "make all", but various Makefile targets described below exist. |
in a normal "make all", but various Makefile targets described below exist. |
| |
|
| Relatively few systems and CPUs have been tested, so be sure to verify that |
Relatively few systems and CPUs have been tested, so be sure to verify that |
| you're getting sensible results before relying on them. |
results are sensible before relying on them. |
| |
|
| |
|
| |
|
| |
|
| MISCELLANEOUS NOTES |
MISCELLANEOUS NOTES |
| |
|
| Don't configure with --enable-assert when using the things here, since the |
--enable-assert |
| extra code added by assertion checking may influence measurements. |
|
| |
|
| Some effort has been made to accommodate CPUs with direct mapped caches, but |
Don't configure with --enable-assert, since the extra code added by |
| it will depend on TMP_ALLOC using a proper alloca, and even then it may or |
assertion checking may influence measurements. |
| may not be enough. |
|
| |
|
| The sparc32/v9 addmul_1 code runs at noticeably different speeds on |
Direct mapped caches |
| successive sizes, and this has a bad effect on the tune program's |
|
| determinations of the multiply and square thresholds. |
|
| |
|
| |
Some effort has been made to accommodate CPUs with direct mapped caches, |
| |
by putting data blocks more or less contiguously on the stack. But this |
| |
will depend on TMP_ALLOC using alloca, and even then it may or may not |
| |
be enough. |
| |
|
| |
FreeBSD 4.2 i486 getrusage |
| |
|
| |
This getrusage seems to be a bit doubtful, it looks like it's |
| |
microsecond accurate, but sometimes ru_utime remains unchanged after a |
| |
time of many microseconds has elapsed. It'd be good to detect this in |
| |
the time.c initializations, but for now the suggestion is to pretend it |
| |
doesn't exist. |
| |
|
| |
./configure ac_cv_func_getrusage=no |
| |
|
| |
NetBSD 1.4.1 m68k macintosh time base |
| |
|
| |
On this system it's been found getrusage often goes backwards, making it |
| |
unusable (configure is setup to ignore it). gettimeofday sometimes |
| |
doesn't update atomically when it crosses a 1 second boundary. Not sure |
| |
what to do about this. Expect intermittent failures. |
| |
|
| |
SCO OpenUNIX 8 /etc/hw |
| |
|
| |
/etc/hw takes about a second to return the cpu frequency, which suggests |
| |
perhaps it's measuring each time it runs. If this is annoying when |
| |
running the speed program repeatedly then set a GMP_CPU_FREQUENCY |
| |
environment variable (see TIME BASE section below). |
| |
|
| |
Low resolution timebase |
| |
|
| |
Parameter tuning can be very time consuming if the only timebase |
| |
available is a 10 millisecond clock tick, to the point of being |
| |
unusable. This is currently the case on VAX and ARM systems. |
| |
|
| |
|
| |
|
| |
|
| PARAMETER TUNING |
PARAMETER TUNING |
| |
|
| The "tuneup" program runs some tests designed to find the best settings for |
The "tuneup" program runs some tests designed to find the best settings for |
| various thresholds, like KARATSUBA_MUL_THRESHOLD. Its output can be put |
various thresholds, like MUL_KARATSUBA_THRESHOLD. Its output can be put |
| into gmp-mparam.h. The program can be built and run with |
into gmp-mparam.h. The program is built and run with |
| |
|
| make tune |
make tune |
| |
|
| If the thresholds indicated are grossly different from the values in the |
If the thresholds indicated are grossly different from the values in the |
| selected gmp-mparam.h then you may get a performance boost in relevant size |
selected gmp-mparam.h then there may be a performance boost in applicable |
| ranges by changing gmp-mparam.h accordingly. |
size ranges by changing gmp-mparam.h accordingly. |
| |
|
| If your CPU has specific tuned parameters coming from a gmp-mparam.h in one |
Be sure to do a full reconfigure and rebuild to get any newly set thresholds |
| of the mpn subdirectories then the values from "make tune" should be |
to take effect. A partial rebuild is enough sometimes, but a fresh |
| similar. You can submit new values if it looks like the current ones are |
configure and make is certain to be correct. |
| out of date or wildly wrong. But check you're on the right CPU target and |
|
| there aren't any machine-specific effects causing a difference. |
|
| |
|
| |
If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of |
| |
the mpn subdirectories then the values from "make tune" should be similar. |
| |
But check that the configured CPU is right and there are no machine specific |
| |
effects causing a difference. |
| |
|
| It's hoped the compiler and options used won't have too much effect on |
It's hoped the compiler and options used won't have too much effect on |
| thresholds, since for most CPUs they ultimately come down to comparisons |
thresholds, since for most CPUs they ultimately come down to comparisons |
| between assembler subroutines. Missing out on the longlong.h macros by not |
between assembler subroutines. Missing out on the longlong.h macros by not |
| using gcc will probably have an effect. |
using gcc will probably have an effect. |
| |
|
| Some thresholds produced by the tune program are merely single values chosen |
Some thresholds produced by the tune program are merely single values chosen |
| from what's actually a range of sizes where two algorithms are pretty much |
from what's a range of sizes where two algorithms are pretty much the same |
| the same speed. When this happens the program is likely to give slightly |
speed. When this happens the program is likely to give somewhat different |
| different values on successive runs. This is noticeable on the toom3 |
values on successive runs. This is noticeable on the toom3 thresholds for |
| thresholds for instance. |
instance. |
| |
|
| |
|
| |
|
| Line 71 routines, and producing tables of data or gnuplot grap |
|
| Line 126 routines, and producing tables of data or gnuplot grap |
|
| |
|
| make speed |
make speed |
| |
|
| |
(Or on DOS systems "make speed.exe".) |
| |
|
| Here are some examples of how to use it. Check the code for all the |
Here are some examples of how to use it. Check the code for all the |
| options. |
options. |
| |
|
| Line 80 Draw a graph of mpn_mul_n, stepping through sizes by 1 |
|
| Line 137 Draw a graph of mpn_mul_n, stepping through sizes by 1 |
|
| ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n |
./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n |
| gnuplot foo.gnuplot |
gnuplot foo.gnuplot |
| |
|
| Compare mpn_add_n and mpn_lshift by 1, showing times in cycles and showing |
Compare mpn_add_n and an mpn_lshift by 1, showing times in cycles and |
| under mpn_lshift the difference between it and mpn_add_n. |
showing under mpn_lshift the difference between it and mpn_add_n. |
| |
|
| ./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1 |
./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1 |
| |
|
| Line 101 don't get this since it would upset gnuplot or other d |
|
| Line 158 don't get this since it would upset gnuplot or other d |
|
| TIME BASE |
TIME BASE |
| |
|
| The time measuring method is determined in time.c, based on what the |
The time measuring method is determined in time.c, based on what the |
| configured target has available. A microsecond accurate gettimeofday() will |
configured host has available. A cycle counter is preferred, possibly |
| work well, but there's code to use better methods, such as the cycle |
supplemented by another method if the counter has a limited range. A |
| counters on various CPUs. |
microsecond accurate getrusage() or gettimeofday() will work quite well too. |
| |
|
| Currently, all methods except possibly the alpha cycle counter depend on the |
The cycle counters (except possibly on alpha) and gettimeofday() will depend |
| machine being otherwise idle, or rather on other jobs not stealing CPU time |
on the machine being otherwise idle, or rather on other jobs not stealing |
| from the measuring program. Short routines (that complete within a |
CPU time from the measuring program. Short routines (those that complete |
| timeslice) should work even on a busy machine. Some trouble is taken by |
within a timeslice) should work even on a busy machine. |
| speed_measure() in common.c to avoid the ill effects of sporadic interrupts, |
|
| or other intermittent things (like cron waking up every minute). But |
|
| generally you'll want an idle machine to be sure of consistent results. |
|
| |
|
| The CPU frequency is needed if times in cycles are to be displayed, and it's |
Some trouble is taken by speed_measure() in common.c to avoid ill effects |
| always needed when using a cycle counter time base. time.c knows how to get |
from sporadic interrupts, or other intermittent things (like cron waking up |
| the frequency on some systems, but when that fails, or needs to be |
every minute). But generally an idle machine will be necessary to be |
| overridden, an environment variable GMP_CPU_FREQUENCY can be used (in |
certain of consistent results. |
| Hertz). For example in "bash" on a 650 MHz machine, |
|
| |
|
| |
The CPU frequency is needed to convert between cycles and seconds, or for |
| |
when a cycle counter is supplemented by getrusage() etc. The speed program |
| |
will convert as necessary according to the output format requested. The |
| |
tune program will work with either cycles or seconds. |
| |
|
| |
freq.c knows how to get the frequency on some systems, or can measure a |
| |
cycle counter against gettimeofday() or getrusage(), but when that fails, or |
| |
needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be |
| |
used (in Hertz). For example in "bash" on a 650 MHz machine, |
| |
|
| export GMP_CPU_FREQUENCY=650e6 |
export GMP_CPU_FREQUENCY=650e6 |
| |
|
| A high precision time base makes it possible to get accurate measurements in |
A high precision time base makes it possible to get accurate measurements in |
| a shorter time. Support for systems and CPUs not already covered is wanted. |
a shorter time. |
| |
|
| When setting up a method, be sure not to claim a higher accuracy than is |
|
| really available. For example the default gettimeofday() code is set for |
|
| microsecond accuracy, but if only 10ms or 55ms is available then |
|
| inconsistent results can be expected. |
|
| |
|
| |
|
| |
|
| |
EXAMPLE COMPARISONS - VARIOUS |
| |
|
| |
Here are some ideas for things that can be done with the speed program. |
| |
|
| EXAMPLE COMPARISONS |
|
| |
|
| Here are some ideas for things you can do with the speed program. |
|
| |
|
| There's always going to be a certain amount of overhead in the time |
There's always going to be a certain amount of overhead in the time |
| measurements, due to reading the time base, and in the loop that runs a |
measurements, due to reading the time base, and in the loop that runs a |
| routine enough times to get a reading of the desired precision. Noop |
routine enough times to get a reading of the desired precision. Noop |
| Line 147 the times printed or anything. |
|
| Line 204 the times printed or anything. |
|
| |
|
| ./speed -s 1 noop noop_wxs noop_wxys |
./speed -s 1 noop noop_wxs noop_wxys |
| |
|
| If you want to know how many cycles per limb a routine is taking, look at |
To see how many cycles per limb a routine is taking, look at the time |
| the time increase when the size increments, using option -D. This avoids |
increase when the size increments, using option -D. This avoids fixed |
| fixed overheads in the measuring. Also, remember many of the assembler |
overheads in the measuring. Also, remember many of the assembler routines |
| routines have unrolled loops, so it might be necessary to compare times at, |
have unrolled loops, so it might be necessary to compare times at, say, 16, |
| say, 16, 32, 48, 64 etc to see what the unrolled part is taking, as opposed |
32, 48, 64 etc to see what the unrolled part is taking, as opposed to any |
| to any finishing off. |
finishing off. |
| |
|
| ./speed -s 16-64 -t 16 -C -D mpn_add_n |
./speed -s 16-64 -t 16 -C -D mpn_add_n |
| |
|
|
|
| |
|
| When a routine has an unrolled loop for, say, multiples of 8 limbs and then |
When a routine has an unrolled loop for, say, multiples of 8 limbs and then |
| an ordinary loop for the remainder, it can happen that it's actually faster |
an ordinary loop for the remainder, it can happen that it's actually faster |
| to do an operation on, say, 8 limbs than it is on 7 limbs. Here's an |
to do an operation on, say, 8 limbs than it is on 7 limbs. The following |
| example drawing a graph of mpn_sub_n, which you can look at to see if times |
draws a graph of mpn_sub_n, to see whether times smoothly increase with |
| smoothly increase with size. |
size. |
| |
|
| ./speed -s 1-100 -c -P foo mpn_sub_n |
./speed -s 1-100 -c -P foo mpn_sub_n |
| gnuplot foo.gnuplot |
gnuplot foo.gnuplot |
| |
|
| If mpn_lshift and mpn_rshift for your CPU have special case code for shifts |
If mpn_lshift and mpn_rshift have special case code for shifts by 1, it |
| by 1, it ought to be faster (or at least not slower) than shifting by, say, |
ought to be faster (or at least not slower) than shifting by, say, 2 bits. |
| 2 bits. |
|
| |
|
| ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2 |
./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2 |
| |
|
| Line 195 if the lshift isn't faster there's an obvious improvem |
|
| Line 251 if the lshift isn't faster there's an obvious improvem |
|
| |
|
| On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the |
On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the |
| destination is one of the sources is faster than a separate destination. |
destination is one of the sources is faster than a separate destination. |
| Here's an example to see this. (mpn_add_n_inplace is a special measuring |
Here's an example to see this. ".1" selects dst==src1 for mpn_add_n (and |
| routine, not available for other operations.) |
mpn_sub_n), for other values see speed.h SPEED_ROUTINE_MPN_BINARY_N_CALL. |
| |
|
| ./speed -s 1-200 -c mpn_add_n mpn_add_n_inplace |
./speed -s 1-200 -c mpn_add_n mpn_add_n.1 |
| |
|
| The gmp manual recommends divisions by powers of two should be done using a |
The gmp manual points out that divisions by powers of two should be done |
| right shift because it'll be significantly faster. Here's how you can see |
using a right shift because it'll be significantly faster than an actual |
| by what factor mpn_rshift is faster, using division by 32 as an example. |
division. The following shows by what factor mpn_rshift is faster than |
| |
mpn_divrem_1, using division by 32 as an example. |
| |
|
| ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32 |
./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32 |
| |
|
| mul_basecase takes an "r" parameter that's the first (larger) size |
|
| |
|
| |
|
| |
EXAMPLE COMPARISONS - MULTIPLICATION |
| |
|
| |
mul_basecase takes a ".<r>" parameter which is the first (larger) size |
| parameter. For example to show speeds for 20x1 up to 20x15 in cycles, |
parameter. For example to show speeds for 20x1 up to 20x15 in cycles, |
| |
|
| ./speed -s 1-15 -c mpn_mul_basecase.20 |
./speed -s 1-15 -c mpn_mul_basecase.20 |
| Line 221 up to twice as fast as mul_basecase. In practice loop |
|
| Line 283 up to twice as fast as mul_basecase. In practice loop |
|
| products on the diagonal mean it falls short of this. Here's an example |
products on the diagonal mean it falls short of this. Here's an example |
| running the two and showing by what factor an NxN mul_basecase is slower |
running the two and showing by what factor an NxN mul_basecase is slower |
| than an NxN sqr_basecase. (Some versions of sqr_basecase only allow sizes |
than an NxN sqr_basecase. (Some versions of sqr_basecase only allow sizes |
| below KARATSUBA_SQR_THRESHOLD, so if it crashes at that point don't worry.) |
below SQR_KARATSUBA_THRESHOLD, so if it crashes at that point don't worry.) |
| |
|
| ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase |
./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase |
| |
|
|
|
| ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase |
./speed -s 10-20 -t 10 -CDE mpn_mul_basecase |
| ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase |
./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase |
| |
|
| |
Two versions of toom3 interpolation and evaluation are available in |
| |
mpn/generic/mul_n.c, using either a one-pass open-coded style or simple mpn |
| |
subroutine calls. The former is used on RISCs with lots of registers, the |
| |
latter on other CPUs. The two can be compared directly to check which is |
| |
best. Naturally it's sizes where toom3 is faster than karatsuba that are of |
| |
interest. |
| |
|
| |
./speed -s 80-120 -c mpn_toom3_mul_n_mpn mpn_toom3_mul_n_open |
| |
./speed -s 80-120 -c mpn_toom3_sqr_n_mpn mpn_toom3_sqr_n_open |
| |
|
| |
|
| |
|
| |
|
| |
EXAMPLE COMPARISONS - MALLOC |
| |
|
| The gmp manual recommends application programs avoid excessive initializing |
The gmp manual recommends application programs avoid excessive initializing |
| and clearing of mpz_t variables (and mpq_t and mpf_t too). Every new |
and clearing of mpz_t variables (and mpq_t and mpf_t too). Every new |
| variable will at a minimum go through an init, a realloc for its first |
variable will at a minimum go through an init, a realloc for its first |
| store, and finally a clear. Quite how long that takes depends on the C |
store, and finally a clear. Quite how long that takes depends on the C |
| library. The following compares an mpz_init/realloc/clear to a 10 limb |
library. The following compares an mpz_init/realloc/clear to a 10 limb |
| mpz_add. |
mpz_add. Don't be surprised if the mallocing is quite slow. |
| |
|
| ./speed -s 10 -c mpz_init_realloc_clear mpz_add |
./speed -s 10 -c mpz_init_realloc_clear mpz_add |
| |
|
| The normal libtool link of the speed program does a static link to libgmp.la |
On some systems malloc and free are much slower when dynamic linked. The |
| and libspeed.la, but will end up dynamic linked to libc. Depending on the |
speed-dynamic program can be used to see this. For example the following |
| system, a dynamic linked malloc may be noticeably slower than static linked, |
measures malloc/free, first static then dynamic. |
| and you may want to re-run the libtool link invocation to static link libc |
|
| for comparison. The example below does a 10 limb malloc/free or |
|
| malloc/realloc/free to test the C library. Of course a real world program |
|
| has big problems if it's doing so many mallocs and frees that it gets slowed |
|
| down by a dynamic linked malloc. |
|
| |
|
| ./speed -s 10 -c malloc_free malloc_realloc_free |
./speed -s 10 -c malloc_free |
| |
./speed-dynamic -s 10 -c malloc_free |
| |
|
| |
Of course a real world program has big problems if it's doing so many |
| |
mallocs and frees that it gets slowed down by a dynamic linked malloc. |
| |
|
| |
|
| |
|
| |
|
| |
|
| |
EXAMPLE COMPARISONS - STRING CONVERSIONS |
| |
|
| |
mpn_get_str does a binary to string conversion. The base is specified with |
| |
a ".<r>" parameter, or decimal by default. Power of 2 bases are much faster |
| |
than general bases. The following compares decimal and hex for instance. |
| |
|
| |
./speed -s 1-20 -c mpn_get_str mpn_get_str.16 |
| |
|
| |
Smaller bases need more divisions to split a given size number, and so are |
| |
slower. The following compares base 3 and base 9. On small operands 9 will |
| |
be nearly twice as fast, though at bigger sizes this reduces since in the |
| |
current implementation both divide repeatedly by 3^20 (or 3^40 for 64 bit |
| |
limbs) and those divisions come to dominate. |
| |
|
| |
./speed -s 1-20 -cr mpn_get_str.3 mpn_get_str.9 |
| |
|
| |
mpn_set_str does a string to binary conversion. The base is specified with |
| |
a ".<r>" parameter, or decimal by default. Power of 2 bases are faster than |
| |
general bases on large conversions. |
| |
|
| |
./speed -s 1-512 -f 2 -c mpn_set_str.8 mpn_set_str.10 |
| |
|
| |
mpn_set_str also has some special case code for decimal which is a bit |
| |
faster than the general case, basically by giving the compiler a chance to |
| |
optimize some multiplications by 10. |
| |
|
| |
./speed -s 20-40 -c mpn_set_str.9 mpn_set_str.10 mpn_set_str.11 |
| |
|
| |
|
| |
|
| |
|
| |
EXAMPLE COMPARISONS - GCDs |
| |
|
| |
mpn_gcd_1 has a threshold for when to reduce using an initial x%y when both |
| |
x and y are single limbs. This isn't tuned currently, but a value can be |
| |
established by a measurement like |
| |
|
| |
./speed -s 10-32 mpn_gcd_1.10 |
| |
|
| |
This runs src[0] from 10 to 32 bits, and y fixed at 10 bits. If the div |
| |
threshold is high, say 31 so it's effectively disabled then a 32x10 bit gcd |
| |
is done by nibbling away at the 32-bit operands bit-by-bit. When the |
| |
threshold is small, say 1 bit, then an initial x%y is done to reduce it to a |
| |
10x10 bit operation. |
| |
|
| |
The threshold in mpn/generic/gcd_1.c or the various assembler |
| |
implementations can be tweaked up or down until there's no more speedups on |
| |
interesting combinations of sizes. Note that this affects only a 1x1 limb |
| |
operation and so isn't very important. (An Nx1 limb operation always does |
| |
an initial modular reduction, using mpn_mod_1 or mpn_modexact_1_odd.) |
| |
|
| |
|
| |
|
| |
|
| SPEED PROGRAM EXTENSIONS |
SPEED PROGRAM EXTENSIONS |
| |
|
| Potentially lots of things could be made available in the program, but it's |
Potentially lots of things could be made available in the program, but it's |
| Line 284 Extensions should be fairly easy to make though. spee |
|
| Line 415 Extensions should be fairly easy to make though. spee |
|
| in a style that should suit one-off tests, or new code fragments under |
in a style that should suit one-off tests, or new code fragments under |
| development. |
development. |
| |
|
| |
many.pl is a script for generating a new speed program supplemented with |
| |
alternate versions of the standard routines. It can be used for measuring |
| |
experimental code, or for comparing different implementations that exist |
| |
within a CPU family. |
| |
|
| |
|
| |
|
| |
|
| THRESHOLD EXAMINING |
THRESHOLD EXAMINING |
| |
|
| The speed program can be used to examine the speeds of different algorithms |
The speed program can be used to examine the speeds of different algorithms |
| Line 297 the karatsuba multiply threshold, |
|
| Line 433 the karatsuba multiply threshold, |
|
| |
|
| When examining the toom3 threshold, remember it depends on the karatsuba |
When examining the toom3 threshold, remember it depends on the karatsuba |
| threshold, so the right karatsuba threshold needs to be compiled into the |
threshold, so the right karatsuba threshold needs to be compiled into the |
| library first. The tune program uses special recompiled versions of |
library first. The tune program uses specially recompiled versions of |
| mpn/mul_n.c etc for this reason, but the speed program simply uses the |
mpn/mul_n.c etc for this reason, but the speed program simply uses the |
| normal libgmp.la. |
normal libgmp.la. |
| |
|
| Note further that the various routines may recurse into themselves on sizes |
Note further that the various routines may recurse into themselves on sizes |
| far enough above applicable thresholds. For example, mpn_kara_mul_n will |
far enough above applicable thresholds. For example, mpn_kara_mul_n will |
| recurse into itself on sizes greater than twice the compiled-in |
recurse into itself on sizes greater than twice the compiled-in |
| KARATSUBA_MUL_THRESHOLD. |
MUL_KARATSUBA_THRESHOLD. |
| |
|
| When doing the above comparison between mul_basecase and kara_mul_n what's |
When doing the above comparison between mul_basecase and kara_mul_n what's |
| probably of interest is mul_basecase versus a kara_mul_n that does one level |
probably of interest is mul_basecase versus a kara_mul_n that does one level |
| of Karatsuba then calls to mul_basecase, but this only happens on sizes less |
of Karatsuba then calls to mul_basecase, but this only happens on sizes less |
| than twice the compiled KARATSUBA_MUL_THRESHOLD. A larger value for that |
than twice the compiled MUL_KARATSUBA_THRESHOLD. A larger value for that |
| setting can be compiled-in to avoid the problem if necessary. The same |
setting can be compiled-in to avoid the problem if necessary. The same |
| applies to toom3 and BZ, though in a trickier fashion. |
applies to toom3 and DC, though in a trickier fashion. |
| |
|
| There are some upper limits on some of the thresholds, arising from arrays |
There are some upper limits on some of the thresholds, arising from arrays |
| dimensioned according to a threshold (mpn_mul_n), or asm code with certain |
dimensioned according to a threshold (mpn_mul_n), or asm code with certain |
| Line 321 values for the thresholds, even just for testing, may |
|
| Line 457 values for the thresholds, even just for testing, may |
|
| |
|
| |
|
| |
|
| THINGS AFFECTING THRESHOLDS |
|
| |
|
| The following are some general notes on some things that can affect the |
|
| various algorithm thresholds. |
|
| |
|
| KARATSUBA_MUL_THRESHOLD |
|
| |
|
| At size 2N, karatsuba does three NxN multiplies and some adds and |
|
| shifts, compared to a 2Nx2N basecase multiply which will be roughly |
|
| equivalent to four NxN multiplies. |
|
| |
|
| Fast mul - increases threshold |
|
| |
|
| If the CPU has a fast multiply, the basecase multiplies are going |
|
| to stay faster than the karatsuba overheads for longer. Conversely |
|
| if the CPU has a slow multiply the karatsuba method trading some |
|
| multiplies for adds will become worthwhile sooner. |
|
| |
|
| Remember it's "addmul" performance that's of interest here. This |
|
| may differ from a simple "mul" instruction in the CPU. For example |
|
| K6 has a 3 cycle mul but takes nearly 8 cycles/limb for an addmul, |
|
| and K7 has a 6 cycle mul latency but has a 4 cycle/limb addmul due |
|
| to pipelining. |
|
| |
|
| Unrolled addmul - increases threshold |
|
| |
|
| If the CPU addmul routine (or the addmul part of the mul_basecase |
|
| routine) is unrolled it can mean that a 2Nx2N multiply is a bit |
|
| faster than four NxN multiplies, due to proportionally less looping |
|
| overheads. This can be thought of as the addmul warming to its |
|
| task on bigger sizes, and keeping the basecase better than |
|
| karatsuba for longer. |
|
| |
|
| Karatsuba overheads - increases threshold |
|
| |
|
| Fairly obviously anything gained or lost in the karatsuba extra |
|
| calculations will translate directly to the threshold. But |
|
| remember the extra calculations are likely to always be a |
|
| relatively small fraction of the total multiply time and in that |
|
| sense the basecase code is the best place to be looking for |
|
| optimizations. |
|
| |
|
| KARATSUBA_SQR_THRESHOLD |
|
| |
|
| Squaring is essentially the same as multiplying, so the above applies |
|
| to squaring too. Fixed overheads will, proportionally, be bigger when |
|
| squaring, leading to a higher threshold usually. |
|
| |
|
| mpn/generic/sqr_basecase.c |
|
| |
|
| This relies on a reasonable umul_ppmm, and if the generic C code is |
|
| being used it may badly affect the speed. Don't bother paying |
|
| attention to the square thresholds until you have either a good |
|
| umul_ppmm or an assembler sqr_basecase. |
|
| |
|
| TOOM3_MUL_THRESHOLD |
|
| |
|
| At size N, toom3 does five (N/3)x(N/3) multiplies and some extra |
|
| calculations, compared to karatsuba doing three (N/2)x(N/2) |
|
| multiplies and some extra calculations (fewer). Toom3 will become |
|
| better before long, being O(n^1.465) versus karatsuba at O(n^1.585), |
|
| but exactly where depends a great deal on the implementations of all |
|
| the relevant bits of extra calculation. |
|
| |
|
| In practice the curves for time versus size on toom3 and karatsuba |
|
| have similar slopes near their crossover, leading to a range of sizes |
|
| where there's very little difference between the two. Choosing a |
|
| single value from the range is a bit arbitrary and will lead to |
|
| slightly different values on successive runs of the tune program. |
|
| |
|
| divexact_by3 - used by toom3 |
|
| |
|
| Toom3 does a divexact_by3 which at size N is roughly equivalent to |
|
| N successively dependent multiplies with a further couple of extra |
|
| instructions in between. CPUs with a low latency multiply and good |
|
| divexact_by3 implementation should see the toom3 threshold lowered. |
|
| But note this is unlikely to have much effect on total multiply |
|
| times. |
|
| |
|
| Asymptotic behaviour |
|
| |
|
| At the fairly small sizes where the thresholds occur it's worth |
|
| remembering that the asymptotic behaviour for karatsuba and toom3 |
|
| can't be expected to make accurate predictions, due of course to |
|
| the big influence of all sorts of overheads, and the fact that only |
|
| a few recursions of each are being performed. |
|
| |
|
| Even at large sizes there's a good chance machine dependent effects |
|
| like cache architecture will mean actual performance deviates from |
|
| what might be predicted. This is why the rather positivist |
|
| approach of just measuring things has been adopted, in general. |
|
| |
|
| TOOM3_SQR_THRESHOLD |
|
| |
|
| The same factors apply to squaring as to multiplying, though with |
|
| overheads being proportionally a bit bigger. |
|
| |
|
| FFT_MUL_THRESHOLD, etc |
|
| |
|
| When configured with --enable-fft, a Fermat style FFT is used for |
|
| multiplication above FFT_MUL_THRESHOLD, and a further threshold |
|
| FFT_MODF_MUL_THRESHOLD exists for where FFT is used for a modulo 2^N+1 |
|
| multiply. FFT_MUL_TABLE is the thresholds at which each split size |
|
| "k" is used in the FFT. |
|
| |
|
| step effect - coarse grained thresholds |
|
| |
|
| The FFT has size restrictions that mean it rounds up sizes to |
|
| certain multiples and therefore does the same amount of work for a |
|
| range of different sized operands. For example at k=8 the size is |
|
| internally rounded to a multiple of 1024 limbs. The current single |
|
| values for the various thresholds are set to give good average |
|
| performance, but in the future multiple values might be wanted to |
|
| take into account the different step sizes for different "k"s. |
|
| |
|
| FFT_SQR_THRESHOLD, etc |
|
| |
|
| The same considerations apply as for multiplications, plus the |
|
| following. |
|
| |
|
| similarity to mul thresholds |
|
| |
|
| On some CPUs the squaring thresholds are nearly the same as those |
|
| for multiplying. It's not quite clear why this is, it might be |
|
| similar shaped size/time graphs for the mul and sqrs recursed into. |
|
| |
|
| BZ_THRESHOLD |
|
| |
|
| The B-Z division algorithm rearranges a traditional multi-precision |
|
| long division so that NxN multiplies can be done rather than repeated |
|
| Nx1 multiplies, thereby exploiting the algorithmic advantages of |
|
| karatsuba and toom3, and leading to significant speedups. |
|
| |
|
| fast mul_basecase - decreases threshold |
|
| |
|
| CPUs with an optimized mul_basecase can expect a lower B-Z |
|
| threshold due to the helping hand such a mul_basecase will give to |
|
| B-Z as compared to submul_1 used in the schoolbook method. |
|
| |
|
| GCD_ACCEL_THRESHOLD |
|
| |
|
| Below this threshold a simple binary subtract and shift is used, above |
|
| it Ken Weber's accelerated algorithm is used. The accelerated GCD |
|
| performs far fewer steps than the binary GCD and will normally kick in |
|
| at quite small sizes. |
|
| |
|
| modlimb_invert and find_a - affect threshold |
|
| |
|
| At small sizes the performance of modlimb_invert and find_a will |
|
| affect the accelerated algorithm and CPUs where those routines are |
|
| not well optimized may see a higher threshold. (At large sizes |
|
| mpn_addmul_1 and mpn_submul_1 come to dominate the accelerated |
|
| algorithm.) |
|
| |
|
| GCDEXT_THRESHOLD |
|
| |
|
| mpn/generic/gcdext.c is based on Lehmer's multi-step improvement of |
|
| Euclid's algorithm. The multipliers are found using single limb |
|
| calculations below GCDEXT_THRESHOLD, or double limb calculations |
|
| above. The single limb code is fast but doesn't produce full-limb |
|
| multipliers. |
|
| |
|
| data-dependent multiplier - big threshold |
|
| |
|
| If multiplications done by mpn_mul_1, addmul_1 and submul_1 run |
|
| slower when there's more bits in the multiplier, then producing |
|
| bigger multipliers with the double limb calculation doesn't save |
|
| much more than some looping and function call overheads. A large |
|
| threshold can then be expected. |
|
| |
|
| slow division - low threshold |
|
| |
|
| The single limb calculation does some plain "/" divisions, whereas |
|
| the double limb calculation has a divide routine optimized for the |
|
| small quotients that often occur. Until the single limb code does |
|
| something similar a slow hardware divide will count against it. |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| FUTURE |
FUTURE |
| |
|
| Make a program to check the time base is working properly, for small and |
Make a program to check the time base is working properly, for small and |
| large measurements. Make it able to test each available method, including |
large measurements. Make it able to test each available method, including |
| perhaps the apparent resolution of each. |
perhaps the apparent resolution of each. |
| |
|
| Add versions of the toom3 multiplication using either the mpn calls or the |
Make a general mechanism for specifying operand overlap, and a syntax like |
| open-coded style, so the two can be compared. |
maybe "mpn_add_n.dst=src2" to select it. Some measuring routines do this |
| |
sort of thing with the "r" parameter currently. |
| Add versions of the generic C mpn_divrem_1 using straight division versus a |
|
| multiply by inverse, so the two can be compared. Include the branch-free |
|
| version of multiply by inverse too. |
|
| |
|
| Make an option in struct speed_parameters to specify operand overlap, |
|
| perhaps 0 for none, 1 for dst=src1, 2 for dst=src2, 3 for dst1=src1 |
|
| dst2=src2, 4 for dst1=src2 dst2=src1. This is done for addsub_n with the r |
|
| parameter (though addsub_n isn't yet enabled), and could be done for add_n, |
|
| xor_n, etc too. |
|
| |
|
| When speed_measure() divides the total time measured by repetitions |
|
| performed, it divides the fixed overheads imposed by speed_starttime() and |
|
| speed_endtime(). When different routines are run with different repetitions |
|
| the overhead will then be differently counted. It would improve precision |
|
| to try to avoid this. Currently the idea is just to set speed_precision big |
|
| enough that the effect is insignificant compared to the routines being |
|
| measured. |
|
| |
|
| |
|
| |
|
| |
|