[BACK]Return to README CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp / tune

Diff for /OpenXM_contrib/gmp/tune/Attic/README between version 1.1 and 1.1.1.2

version 1.1, 2000/09/09 14:13:19 version 1.1.1.2, 2003/08/25 16:06:37
Line 1 
Line 1 
   Copyright 2000, 2001, 2002 Free Software Foundation, Inc.
   
   This file is part of the GNU MP Library.
   
   The GNU MP Library is free software; you can redistribute it and/or modify
   it under the terms of the GNU Lesser General Public License as published by
   the Free Software Foundation; either version 2.1 of the License, or (at your
   option) any later version.
   
   The GNU MP Library is distributed in the hope that it will be useful, but
   WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
   License for more details.
   
   You should have received a copy of the GNU Lesser General Public License
   along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
   the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
   02111-1307, USA.
   
   
   
   
   
                GMP SPEED MEASURING AND PARAMETER TUNING                 GMP SPEED MEASURING AND PARAMETER TUNING
   
   
 The programs in this directory are for knowledgeable users who want to make  The programs in this directory are for knowledgeable users who want to
 measurements of the speed of GMP routines on their machine, and perhaps  measure GMP routines on their machine, and perhaps tweak some settings or
 tweak some settings or identify things that can be improved.  identify things that can be improved.
   
 The programs here are tools, not ready to run solutions.  Nothing is built  The programs here are tools, not ready to run solutions.  Nothing is built
 in a normal "make all", but various Makefile targets described below exist.  in a normal "make all", but various Makefile targets described below exist.
   
 Relatively few systems and CPUs have been tested, so be sure to verify that  Relatively few systems and CPUs have been tested, so be sure to verify that
 you're getting sensible results before relying on them.  results are sensible before relying on them.
   
   
   
   
 MISCELLANEOUS NOTES  MISCELLANEOUS NOTES
   
 Don't configure with --enable-assert when using the things here, since the  --enable-assert
 extra code added by assertion checking may influence measurements.  
   
 Some effort has been made to accommodate CPUs with direct mapped caches, but      Don't configure with --enable-assert, since the extra code added by
 it will depend on TMP_ALLOC using a proper alloca, and even then it may or      assertion checking may influence measurements.
 may not be enough.  
   
 The sparc32/v9 addmul_1 code runs at noticeably different speeds on  Direct mapped caches
 successive sizes, and this has a bad effect on the tune program's  
 determinations of the multiply and square thresholds.  
   
       Some effort has been made to accommodate CPUs with direct mapped caches,
       by putting data blocks more or less contiguously on the stack.  But this
       will depend on TMP_ALLOC using alloca, and even then it may or may not
       be enough.
   
   FreeBSD 4.2 i486 getrusage
   
       This getrusage seems to be a bit doubtful, it looks like it's
       microsecond accurate, but sometimes ru_utime remains unchanged after a
       time of many microseconds has elapsed.  It'd be good to detect this in
       the time.c initializations, but for now the suggestion is to pretend it
       doesn't exist.
   
           ./configure ac_cv_func_getrusage=no
   
   NetBSD 1.4.1 m68k macintosh time base
   
       On this system it's been found getrusage often goes backwards, making it
       unusable (configure is setup to ignore it).  gettimeofday sometimes
       doesn't update atomically when it crosses a 1 second boundary.  Not sure
       what to do about this.  Expect intermittent failures.
   
   SCO OpenUNIX 8 /etc/hw
   
       /etc/hw takes about a second to return the cpu frequency, which suggests
       perhaps it's measuring each time it runs.  If this is annoying when
       running the speed program repeatedly then set a GMP_CPU_FREQUENCY
       environment variable (see TIME BASE section below).
   
   Low resolution timebase
   
       Parameter tuning can be very time consuming if the only timebase
       available is a 10 millisecond clock tick, to the point of being
       unusable.  This is currently the case on VAX and ARM systems.
   
   
   
   
 PARAMETER TUNING  PARAMETER TUNING
   
 The "tuneup" program runs some tests designed to find the best settings for  The "tuneup" program runs some tests designed to find the best settings for
 various thresholds, like KARATSUBA_MUL_THRESHOLD.  Its output can be put  various thresholds, like MUL_KARATSUBA_THRESHOLD.  Its output can be put
 into gmp-mparam.h.  The program can be built and run with  into gmp-mparam.h.  The program is built and run with
   
         make tune          make tune
   
 If the thresholds indicated are grossly different from the values in the  If the thresholds indicated are grossly different from the values in the
 selected gmp-mparam.h then you may get a performance boost in relevant size  selected gmp-mparam.h then there may be a performance boost in applicable
 ranges by changing gmp-mparam.h accordingly.  size ranges by changing gmp-mparam.h accordingly.
   
 If your CPU has specific tuned parameters coming from a gmp-mparam.h in one  Be sure to do a full reconfigure and rebuild to get any newly set thresholds
 of the mpn subdirectories then the values from "make tune" should be  to take effect.  A partial rebuild is enough sometimes, but a fresh
 similar.  You can submit new values if it looks like the current ones are  configure and make is certain to be correct.
 out of date or wildly wrong.  But check you're on the right CPU target and  
 there aren't any machine-specific effects causing a difference.  
   
   If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of
   the mpn subdirectories then the values from "make tune" should be similar.
   But check that the configured CPU is right and there are no machine specific
   effects causing a difference.
   
 It's hoped the compiler and options used won't have too much effect on  It's hoped the compiler and options used won't have too much effect on
 thresholds, since for most CPUs they ultimately come down to comparisons  thresholds, since for most CPUs they ultimately come down to comparisons
 between assembler subroutines.  Missing out on the longlong.h macros by not  between assembler subroutines.  Missing out on the longlong.h macros by not
 using gcc will probably have an effect.  using gcc will probably have an effect.
   
 Some thresholds produced by the tune program are merely single values chosen  Some thresholds produced by the tune program are merely single values chosen
 from what's actually a range of sizes where two algorithms are pretty much  from what's a range of sizes where two algorithms are pretty much the same
 the same speed.  When this happens the program is likely to give slightly  speed.  When this happens the program is likely to give somewhat different
 different values on successive runs.  This is noticeable on the toom3  values on successive runs.  This is noticeable on the toom3 thresholds for
 thresholds for instance.  instance.
   
   
   
Line 71  routines, and producing tables of data or gnuplot grap
Line 126  routines, and producing tables of data or gnuplot grap
   
         make speed          make speed
   
   (Or on DOS systems "make speed.exe".)
   
 Here are some examples of how to use it.  Check the code for all the  Here are some examples of how to use it.  Check the code for all the
 options.  options.
   
Line 80  Draw a graph of mpn_mul_n, stepping through sizes by 1
Line 137  Draw a graph of mpn_mul_n, stepping through sizes by 1
         ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n          ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n
         gnuplot foo.gnuplot          gnuplot foo.gnuplot
   
 Compare mpn_add_n and mpn_lshift by 1, showing times in cycles and showing  Compare mpn_add_n and an mpn_lshift by 1, showing times in cycles and
 under mpn_lshift the difference between it and mpn_add_n.  showing under mpn_lshift the difference between it and mpn_add_n.
   
         ./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1          ./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1
   
Line 101  don't get this since it would upset gnuplot or other d
Line 158  don't get this since it would upset gnuplot or other d
 TIME BASE  TIME BASE
   
 The time measuring method is determined in time.c, based on what the  The time measuring method is determined in time.c, based on what the
 configured target has available.  A microsecond accurate gettimeofday() will  configured host has available.  A cycle counter is preferred, possibly
 work well, but there's code to use better methods, such as the cycle  supplemented by another method if the counter has a limited range.  A
 counters on various CPUs.  microsecond accurate getrusage() or gettimeofday() will work quite well too.
   
 Currently, all methods except possibly the alpha cycle counter depend on the  The cycle counters (except possibly on alpha) and gettimeofday() will depend
 machine being otherwise idle, or rather on other jobs not stealing CPU time  on the machine being otherwise idle, or rather on other jobs not stealing
 from the measuring program.  Short routines (that complete within a  CPU time from the measuring program.  Short routines (those that complete
 timeslice) should work even on a busy machine.  Some trouble is taken by  within a timeslice) should work even on a busy machine.
 speed_measure() in common.c to avoid the ill effects of sporadic interrupts,  
 or other intermittent things (like cron waking up every minute).  But  
 generally you'll want an idle machine to be sure of consistent results.  
   
 The CPU frequency is needed if times in cycles are to be displayed, and it's  Some trouble is taken by speed_measure() in common.c to avoid ill effects
 always needed when using a cycle counter time base.  time.c knows how to get  from sporadic interrupts, or other intermittent things (like cron waking up
 the frequency on some systems, but when that fails, or needs to be  every minute).  But generally an idle machine will be necessary to be
 overridden, an environment variable GMP_CPU_FREQUENCY can be used (in  certain of consistent results.
 Hertz).  For example in "bash" on a 650 MHz machine,  
   
   The CPU frequency is needed to convert between cycles and seconds, or for
   when a cycle counter is supplemented by getrusage() etc.  The speed program
   will convert as necessary according to the output format requested.  The
   tune program will work with either cycles or seconds.
   
   freq.c knows how to get the frequency on some systems, or can measure a
   cycle counter against gettimeofday() or getrusage(), but when that fails, or
   needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be
   used (in Hertz).  For example in "bash" on a 650 MHz machine,
   
         export GMP_CPU_FREQUENCY=650e6          export GMP_CPU_FREQUENCY=650e6
   
 A high precision time base makes it possible to get accurate measurements in  A high precision time base makes it possible to get accurate measurements in
 a shorter time.  Support for systems and CPUs not already covered is wanted.  a shorter time.
   
 When setting up a method, be sure not to claim a higher accuracy than is  
 really available.  For example the default gettimeofday() code is set for  
 microsecond accuracy, but if only 10ms or 55ms is available then  
 inconsistent results can be expected.  
   
   
   
   EXAMPLE COMPARISONS - VARIOUS
   
   Here are some ideas for things that can be done with the speed program.
   
 EXAMPLE COMPARISONS  
   
 Here are some ideas for things you can do with the speed program.  
   
 There's always going to be a certain amount of overhead in the time  There's always going to be a certain amount of overhead in the time
 measurements, due to reading the time base, and in the loop that runs a  measurements, due to reading the time base, and in the loop that runs a
 routine enough times to get a reading of the desired precision.  Noop  routine enough times to get a reading of the desired precision.  Noop
Line 147  the times printed or anything.
Line 204  the times printed or anything.
   
         ./speed -s 1 noop noop_wxs noop_wxys          ./speed -s 1 noop noop_wxs noop_wxys
   
 If you want to know how many cycles per limb a routine is taking, look at  To see how many cycles per limb a routine is taking, look at the time
 the time increase when the size increments, using option -D.  This avoids  increase when the size increments, using option -D.  This avoids fixed
 fixed overheads in the measuring.  Also, remember many of the assembler  overheads in the measuring.  Also, remember many of the assembler routines
 routines have unrolled loops, so it might be necessary to compare times at,  have unrolled loops, so it might be necessary to compare times at, say, 16,
 say, 16, 32, 48, 64 etc to see what the unrolled part is taking, as opposed  32, 48, 64 etc to see what the unrolled part is taking, as opposed to any
 to any finishing off.  finishing off.
   
         ./speed -s 16-64 -t 16 -C -D mpn_add_n          ./speed -s 16-64 -t 16 -C -D mpn_add_n
   
Line 175  limbs.
Line 232  limbs.
   
 When a routine has an unrolled loop for, say, multiples of 8 limbs and then  When a routine has an unrolled loop for, say, multiples of 8 limbs and then
 an ordinary loop for the remainder, it can happen that it's actually faster  an ordinary loop for the remainder, it can happen that it's actually faster
 to do an operation on, say, 8 limbs than it is on 7 limbs.  Here's an  to do an operation on, say, 8 limbs than it is on 7 limbs.  The following
 example drawing a graph of mpn_sub_n, which you can look at to see if times  draws a graph of mpn_sub_n, to see whether times smoothly increase with
 smoothly increase with size.  size.
   
         ./speed -s 1-100 -c -P foo mpn_sub_n          ./speed -s 1-100 -c -P foo mpn_sub_n
         gnuplot foo.gnuplot          gnuplot foo.gnuplot
   
 If mpn_lshift and mpn_rshift for your CPU have special case code for shifts  If mpn_lshift and mpn_rshift have special case code for shifts by 1, it
 by 1, it ought to be faster (or at least not slower) than shifting by, say,  ought to be faster (or at least not slower) than shifting by, say, 2 bits.
 2 bits.  
   
         ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2          ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2
   
Line 195  if the lshift isn't faster there's an obvious improvem
Line 251  if the lshift isn't faster there's an obvious improvem
   
 On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the  On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the
 destination is one of the sources is faster than a separate destination.  destination is one of the sources is faster than a separate destination.
 Here's an example to see this.  (mpn_add_n_inplace is a special measuring  Here's an example to see this.  ".1" selects dst==src1 for mpn_add_n (and
 routine, not available for other operations.)  mpn_sub_n), for other values see speed.h SPEED_ROUTINE_MPN_BINARY_N_CALL.
   
         ./speed -s 1-200 -c mpn_add_n mpn_add_n_inplace          ./speed -s 1-200 -c mpn_add_n mpn_add_n.1
   
 The gmp manual recommends divisions by powers of two should be done using a  The gmp manual points out that divisions by powers of two should be done
 right shift because it'll be significantly faster.  Here's how you can see  using a right shift because it'll be significantly faster than an actual
 by what factor mpn_rshift is faster, using division by 32 as an example.  division.  The following shows by what factor mpn_rshift is faster than
   mpn_divrem_1, using division by 32 as an example.
   
         ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32          ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32
   
 mul_basecase takes an "r" parameter that's the first (larger) size  
   
   
   EXAMPLE COMPARISONS - MULTIPLICATION
   
   mul_basecase takes a ".<r>" parameter which is the first (larger) size
 parameter.  For example to show speeds for 20x1 up to 20x15 in cycles,  parameter.  For example to show speeds for 20x1 up to 20x15 in cycles,
   
         ./speed -s 1-15 -c mpn_mul_basecase.20          ./speed -s 1-15 -c mpn_mul_basecase.20
Line 221  up to twice as fast as mul_basecase.  In practice loop
Line 283  up to twice as fast as mul_basecase.  In practice loop
 products on the diagonal mean it falls short of this.  Here's an example  products on the diagonal mean it falls short of this.  Here's an example
 running the two and showing by what factor an NxN mul_basecase is slower  running the two and showing by what factor an NxN mul_basecase is slower
 than an NxN sqr_basecase.  (Some versions of sqr_basecase only allow sizes  than an NxN sqr_basecase.  (Some versions of sqr_basecase only allow sizes
 below KARATSUBA_SQR_THRESHOLD, so if it crashes at that point don't worry.)  below SQR_KARATSUBA_THRESHOLD, so if it crashes at that point don't worry.)
   
         ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase          ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase
   
Line 251  square,
Line 313  square,
         ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase          ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase
         ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase          ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase
   
   Two versions of toom3 interpolation and evaluation are available in
   mpn/generic/mul_n.c, using either a one-pass open-coded style or simple mpn
   subroutine calls.  The former is used on RISCs with lots of registers, the
   latter on other CPUs.  The two can be compared directly to check which is
   best.  Naturally it's sizes where toom3 is faster than karatsuba that are of
   interest.
   
           ./speed -s 80-120 -c mpn_toom3_mul_n_mpn mpn_toom3_mul_n_open
           ./speed -s 80-120 -c mpn_toom3_sqr_n_mpn mpn_toom3_sqr_n_open
   
   
   
   
   EXAMPLE COMPARISONS - MALLOC
   
 The gmp manual recommends application programs avoid excessive initializing  The gmp manual recommends application programs avoid excessive initializing
 and clearing of mpz_t variables (and mpq_t and mpf_t too).  Every new  and clearing of mpz_t variables (and mpq_t and mpf_t too).  Every new
 variable will at a minimum go through an init, a realloc for its first  variable will at a minimum go through an init, a realloc for its first
 store, and finally a clear.  Quite how long that takes depends on the C  store, and finally a clear.  Quite how long that takes depends on the C
 library.  The following compares an mpz_init/realloc/clear to a 10 limb  library.  The following compares an mpz_init/realloc/clear to a 10 limb
 mpz_add.  mpz_add.  Don't be surprised if the mallocing is quite slow.
   
         ./speed -s 10 -c mpz_init_realloc_clear mpz_add          ./speed -s 10 -c mpz_init_realloc_clear mpz_add
   
 The normal libtool link of the speed program does a static link to libgmp.la  On some systems malloc and free are much slower when dynamic linked.  The
 and libspeed.la, but will end up dynamic linked to libc.  Depending on the  speed-dynamic program can be used to see this.  For example the following
 system, a dynamic linked malloc may be noticeably slower than static linked,  measures malloc/free, first static then dynamic.
 and you may want to re-run the libtool link invocation to static link libc  
 for comparison.  The example below does a 10 limb malloc/free or  
 malloc/realloc/free to test the C library.  Of course a real world program  
 has big problems if it's doing so many mallocs and frees that it gets slowed  
 down by a dynamic linked malloc.  
   
         ./speed -s 10 -c malloc_free malloc_realloc_free          ./speed -s 10 -c malloc_free
           ./speed-dynamic -s 10 -c malloc_free
   
   Of course a real world program has big problems if it's doing so many
   mallocs and frees that it gets slowed down by a dynamic linked malloc.
   
   
   
   
   
   EXAMPLE COMPARISONS - STRING CONVERSIONS
   
   mpn_get_str does a binary to string conversion.  The base is specified with
   a ".<r>" parameter, or decimal by default.  Power of 2 bases are much faster
   than general bases.  The following compares decimal and hex for instance.
   
           ./speed -s 1-20 -c mpn_get_str mpn_get_str.16
   
   Smaller bases need more divisions to split a given size number, and so are
   slower.  The following compares base 3 and base 9.  On small operands 9 will
   be nearly twice as fast, though at bigger sizes this reduces since in the
   current implementation both divide repeatedly by 3^20 (or 3^40 for 64 bit
   limbs) and those divisions come to dominate.
   
           ./speed -s 1-20 -cr mpn_get_str.3 mpn_get_str.9
   
   mpn_set_str does a string to binary conversion.  The base is specified with
   a ".<r>" parameter, or decimal by default.  Power of 2 bases are faster than
   general bases on large conversions.
   
           ./speed -s 1-512 -f 2 -c mpn_set_str.8 mpn_set_str.10
   
   mpn_set_str also has some special case code for decimal which is a bit
   faster than the general case, basically by giving the compiler a chance to
   optimize some multiplications by 10.
   
           ./speed -s 20-40 -c mpn_set_str.9 mpn_set_str.10 mpn_set_str.11
   
   
   
   
   EXAMPLE COMPARISONS - GCDs
   
   mpn_gcd_1 has a threshold for when to reduce using an initial x%y when both
   x and y are single limbs.  This isn't tuned currently, but a value can be
   established by a measurement like
   
           ./speed -s 10-32 mpn_gcd_1.10
   
   This runs src[0] from 10 to 32 bits, and y fixed at 10 bits.  If the div
   threshold is high, say 31 so it's effectively disabled then a 32x10 bit gcd
   is done by nibbling away at the 32-bit operands bit-by-bit.  When the
   threshold is small, say 1 bit, then an initial x%y is done to reduce it to a
   10x10 bit operation.
   
   The threshold in mpn/generic/gcd_1.c or the various assembler
   implementations can be tweaked up or down until there's no more speedups on
   interesting combinations of sizes.  Note that this affects only a 1x1 limb
   operation and so isn't very important.  (An Nx1 limb operation always does
   an initial modular reduction, using mpn_mod_1 or mpn_modexact_1_odd.)
   
   
   
   
 SPEED PROGRAM EXTENSIONS  SPEED PROGRAM EXTENSIONS
   
 Potentially lots of things could be made available in the program, but it's  Potentially lots of things could be made available in the program, but it's
Line 284  Extensions should be fairly easy to make though.  spee
Line 415  Extensions should be fairly easy to make though.  spee
 in a style that should suit one-off tests, or new code fragments under  in a style that should suit one-off tests, or new code fragments under
 development.  development.
   
   many.pl is a script for generating a new speed program supplemented with
   alternate versions of the standard routines.  It can be used for measuring
   experimental code, or for comparing different implementations that exist
   within a CPU family.
   
   
   
   
 THRESHOLD EXAMINING  THRESHOLD EXAMINING
   
 The speed program can be used to examine the speeds of different algorithms  The speed program can be used to examine the speeds of different algorithms
Line 297  the karatsuba multiply threshold,
Line 433  the karatsuba multiply threshold,
   
 When examining the toom3 threshold, remember it depends on the karatsuba  When examining the toom3 threshold, remember it depends on the karatsuba
 threshold, so the right karatsuba threshold needs to be compiled into the  threshold, so the right karatsuba threshold needs to be compiled into the
 library first.  The tune program uses special recompiled versions of  library first.  The tune program uses specially recompiled versions of
 mpn/mul_n.c etc for this reason, but the speed program simply uses the  mpn/mul_n.c etc for this reason, but the speed program simply uses the
 normal libgmp.la.  normal libgmp.la.
   
 Note further that the various routines may recurse into themselves on sizes  Note further that the various routines may recurse into themselves on sizes
 far enough above applicable thresholds.  For example, mpn_kara_mul_n will  far enough above applicable thresholds.  For example, mpn_kara_mul_n will
 recurse into itself on sizes greater than twice the compiled-in  recurse into itself on sizes greater than twice the compiled-in
 KARATSUBA_MUL_THRESHOLD.  MUL_KARATSUBA_THRESHOLD.
   
 When doing the above comparison between mul_basecase and kara_mul_n what's  When doing the above comparison between mul_basecase and kara_mul_n what's
 probably of interest is mul_basecase versus a kara_mul_n that does one level  probably of interest is mul_basecase versus a kara_mul_n that does one level
 of Karatsuba then calls to mul_basecase, but this only happens on sizes less  of Karatsuba then calls to mul_basecase, but this only happens on sizes less
 than twice the compiled KARATSUBA_MUL_THRESHOLD.  A larger value for that  than twice the compiled MUL_KARATSUBA_THRESHOLD.  A larger value for that
 setting can be compiled-in to avoid the problem if necessary.  The same  setting can be compiled-in to avoid the problem if necessary.  The same
 applies to toom3 and BZ, though in a trickier fashion.  applies to toom3 and DC, though in a trickier fashion.
   
 There are some upper limits on some of the thresholds, arising from arrays  There are some upper limits on some of the thresholds, arising from arrays
 dimensioned according to a threshold (mpn_mul_n), or asm code with certain  dimensioned according to a threshold (mpn_mul_n), or asm code with certain
Line 321  values for the thresholds, even just for testing, may 
Line 457  values for the thresholds, even just for testing, may 
   
   
   
 THINGS AFFECTING THRESHOLDS  
   
 The following are some general notes on some things that can affect the  
 various algorithm thresholds.  
   
    KARATSUBA_MUL_THRESHOLD  
   
       At size 2N, karatsuba does three NxN multiplies and some adds and  
       shifts, compared to a 2Nx2N basecase multiply which will be roughly  
       equivalent to four NxN multiplies.  
   
       Fast mul - increases threshold  
   
          If the CPU has a fast multiply, the basecase multiplies are going  
          to stay faster than the karatsuba overheads for longer.  Conversely  
          if the CPU has a slow multiply the karatsuba method trading some  
          multiplies for adds will become worthwhile sooner.  
   
          Remember it's "addmul" performance that's of interest here.  This  
          may differ from a simple "mul" instruction in the CPU.  For example  
          K6 has a 3 cycle mul but takes nearly 8 cycles/limb for an addmul,  
          and K7 has a 6 cycle mul latency but has a 4 cycle/limb addmul due  
          to pipelining.  
   
       Unrolled addmul - increases threshold  
   
          If the CPU addmul routine (or the addmul part of the mul_basecase  
          routine) is unrolled it can mean that a 2Nx2N multiply is a bit  
          faster than four NxN multiplies, due to proportionally less looping  
          overheads.  This can be thought of as the addmul warming to its  
          task on bigger sizes, and keeping the basecase better than  
          karatsuba for longer.  
   
       Karatsuba overheads - increases threshold  
   
          Fairly obviously anything gained or lost in the karatsuba extra  
          calculations will translate directly to the threshold.  But  
          remember the extra calculations are likely to always be a  
          relatively small fraction of the total multiply time and in that  
          sense the basecase code is the best place to be looking for  
          optimizations.  
   
    KARATSUBA_SQR_THRESHOLD  
   
       Squaring is essentially the same as multiplying, so the above applies  
       to squaring too.  Fixed overheads will, proportionally, be bigger when  
       squaring, leading to a higher threshold usually.  
   
       mpn/generic/sqr_basecase.c  
   
          This relies on a reasonable umul_ppmm, and if the generic C code is  
          being used it may badly affect the speed.  Don't bother paying  
          attention to the square thresholds until you have either a good  
          umul_ppmm or an assembler sqr_basecase.  
   
    TOOM3_MUL_THRESHOLD  
   
       At size N, toom3 does five (N/3)x(N/3) multiplies and some extra  
       calculations, compared to karatsuba doing three (N/2)x(N/2)  
       multiplies and some extra calculations (fewer).  Toom3 will become  
       better before long, being O(n^1.465) versus karatsuba at O(n^1.585),  
       but exactly where depends a great deal on the implementations of all  
       the relevant bits of extra calculation.  
   
       In practice the curves for time versus size on toom3 and karatsuba  
       have similar slopes near their crossover, leading to a range of sizes  
       where there's very little difference between the two.  Choosing a  
       single value from the range is a bit arbitrary and will lead to  
       slightly different values on successive runs of the tune program.  
   
       divexact_by3 - used by toom3  
   
          Toom3 does a divexact_by3 which at size N is roughly equivalent to  
          N successively dependent multiplies with a further couple of extra  
          instructions in between.  CPUs with a low latency multiply and good  
          divexact_by3 implementation should see the toom3 threshold lowered.  
          But note this is unlikely to have much effect on total multiply  
          times.  
   
       Asymptotic behaviour  
   
          At the fairly small sizes where the thresholds occur it's worth  
          remembering that the asymptotic behaviour for karatsuba and toom3  
          can't be expected to make accurate predictions, due of course to  
          the big influence of all sorts of overheads, and the fact that only  
          a few recursions of each are being performed.  
   
          Even at large sizes there's a good chance machine dependent effects  
          like cache architecture will mean actual performance deviates from  
          what might be predicted.  This is why the rather positivist  
          approach of just measuring things has been adopted, in general.  
   
    TOOM3_SQR_THRESHOLD  
   
       The same factors apply to squaring as to multiplying, though with  
       overheads being proportionally a bit bigger.  
   
    FFT_MUL_THRESHOLD, etc  
   
       When configured with --enable-fft, a Fermat style FFT is used for  
       multiplication above FFT_MUL_THRESHOLD, and a further threshold  
       FFT_MODF_MUL_THRESHOLD exists for where FFT is used for a modulo 2^N+1  
       multiply.  FFT_MUL_TABLE is the thresholds at which each split size  
       "k" is used in the FFT.  
   
       step effect - coarse grained thresholds  
   
          The FFT has size restrictions that mean it rounds up sizes to  
          certain multiples and therefore does the same amount of work for a  
          range of different sized operands.  For example at k=8 the size is  
          internally rounded to a multiple of 1024 limbs.  The current single  
          values for the various thresholds are set to give good average  
          performance, but in the future multiple values might be wanted to  
          take into account the different step sizes for different "k"s.  
   
    FFT_SQR_THRESHOLD, etc  
   
       The same considerations apply as for multiplications, plus the  
       following.  
   
       similarity to mul thresholds  
   
          On some CPUs the squaring thresholds are nearly the same as those  
          for multiplying.  It's not quite clear why this is, it might be  
          similar shaped size/time graphs for the mul and sqrs recursed into.  
   
    BZ_THRESHOLD  
   
       The B-Z division algorithm rearranges a traditional multi-precision  
       long division so that NxN multiplies can be done rather than repeated  
       Nx1 multiplies, thereby exploiting the algorithmic advantages of  
       karatsuba and toom3, and leading to significant speedups.  
   
       fast mul_basecase - decreases threshold  
   
          CPUs with an optimized mul_basecase can expect a lower B-Z  
          threshold due to the helping hand such a mul_basecase will give to  
          B-Z as compared to submul_1 used in the schoolbook method.  
   
    GCD_ACCEL_THRESHOLD  
   
       Below this threshold a simple binary subtract and shift is used, above  
       it Ken Weber's accelerated algorithm is used.  The accelerated GCD  
       performs far fewer steps than the binary GCD and will normally kick in  
       at quite small sizes.  
   
       modlimb_invert and find_a - affect threshold  
   
          At small sizes the performance of modlimb_invert and find_a will  
          affect the accelerated algorithm and CPUs where those routines are  
          not well optimized may see a higher threshold.  (At large sizes  
          mpn_addmul_1 and mpn_submul_1 come to dominate the accelerated  
          algorithm.)  
   
    GCDEXT_THRESHOLD  
   
       mpn/generic/gcdext.c is based on Lehmer's multi-step improvement of  
       Euclid's algorithm.  The multipliers are found using single limb  
       calculations below GCDEXT_THRESHOLD, or double limb calculations  
       above.  The single limb code is fast but doesn't produce full-limb  
       multipliers.  
   
       data-dependent multiplier - big threshold  
   
          If multiplications done by mpn_mul_1, addmul_1 and submul_1 run  
          slower when there's more bits in the multiplier, then producing  
          bigger multipliers with the double limb calculation doesn't save  
          much more than some looping and function call overheads.  A large  
          threshold can then be expected.  
   
       slow division - low threshold  
   
          The single limb calculation does some plain "/" divisions, whereas  
          the double limb calculation has a divide routine optimized for the  
          small quotients that often occur.  Until the single limb code does  
          something similar a slow hardware divide will count against it.  
   
   
   
   
   
 FUTURE  FUTURE
   
 Make a program to check the time base is working properly, for small and  Make a program to check the time base is working properly, for small and
 large measurements.  Make it able to test each available method, including  large measurements.  Make it able to test each available method, including
 perhaps the apparent resolution of each.  perhaps the apparent resolution of each.
   
 Add versions of the toom3 multiplication using either the mpn calls or the  Make a general mechanism for specifying operand overlap, and a syntax like
 open-coded style, so the two can be compared.  maybe "mpn_add_n.dst=src2" to select it.  Some measuring routines do this
   sort of thing with the "r" parameter currently.
 Add versions of the generic C mpn_divrem_1 using straight division versus a  
 multiply by inverse, so the two can be compared.  Include the branch-free  
 version of multiply by inverse too.  
   
 Make an option in struct speed_parameters to specify operand overlap,  
 perhaps 0 for none, 1 for dst=src1, 2 for dst=src2, 3 for dst1=src1  
 dst2=src2, 4 for dst1=src2 dst2=src1.  This is done for addsub_n with the r  
 parameter (though addsub_n isn't yet enabled), and could be done for add_n,  
 xor_n, etc too.  
   
 When speed_measure() divides the total time measured by repetitions  
 performed, it divides the fixed overheads imposed by speed_starttime() and  
 speed_endtime().  When different routines are run with different repetitions  
 the overhead will then be differently counted.  It would improve precision  
 to try to avoid this.  Currently the idea is just to set speed_precision big  
 enough that the effect is insignificant compared to the routines being  
 measured.  
   
   
   
   

Legend:
Removed from v.1.1  
changed lines
  Added in v.1.1.1.2

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>