[BACK]Return to tasks.html CVS log [TXT][DIR] Up to [local] / OpenXM_contrib / gmp / doc

Diff for /OpenXM_contrib/gmp/doc/Attic/tasks.html between version 1.1 and 1.1.1.2

version 1.1, 2000/09/09 14:12:20 version 1.1.1.2, 2003/08/25 16:06:11
Line 13 
Line 13 
   </h1>    </h1>
 </center>  </center>
   
   <font size=-1>
   Copyright 2000, 2001, 2002 Free Software Foundation, Inc. <br><br>
   This file is part of the GNU MP Library. <br><br>
   The GNU MP Library is free software; you can redistribute it and/or modify
   it under the terms of the GNU Lesser General Public License as published
   by the Free Software Foundation; either version 2.1 of the License, or (at
   your option) any later version. <br><br>
   The GNU MP Library is distributed in the hope that it will be useful, but
   WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
   License for more details. <br><br>
   You should have received a copy of the GNU Lesser General Public License
   along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
   the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
   MA 02111-1307, USA.
   </font>
   
   <hr>
   <!-- NB. timestamp updated automatically by emacs -->
 <comment>  <comment>
   An up-to-date html version of this file is available at    This file current as of 20 May 2002.  An up-to-date version is available at
   <a href="http://www.swox.com/gmp/tasks.html">http://www.swox.com/gmp/tasks.html</a>.    <a href="http://www.swox.com/gmp/tasks.html">http://www.swox.com/gmp/tasks.html</a>.
     Please send comments about this page to
     <a href="mailto:bug-gmp@gnu.org">bug-gmp@gnu.org</a>.
 </comment>  </comment>
   
 <p> This file lists itemized GMP development tasks.  Not all the tasks  <p> These are itemized GMP development tasks.  Not all the tasks
     listed here are suitable for volunteers, but many of them are.      listed here are suitable for volunteers, but many of them are.
     Please see the <a href="projects.html">projects file</a> for more      Please see the <a href="projects.html">projects file</a> for more
     sizeable projects.      sizeable projects.
   
 <h4>Correctness and Completeness</h4>  <h4>Correctness and Completeness</h4>
 <ul>  <ul>
 <li> HPUX 10.20 assembler requires a `.LEVEL 1.1' directive for accepting the  
      new instructions.  Unfortunately, the HPUX 9 assembler as well as earlier  
      assemblers reject that directive.  How very clever of HP!  We will have to  
      pass assembler options, and make sure it works with new and old systems  
      and GNU assembler.  
 <li> The various reuse.c tests need to force reallocation by calling  <li> The various reuse.c tests need to force reallocation by calling
      <code>_mpz_realloc</code> with a small (1 limb) size.       <code>_mpz_realloc</code> with a small (1 limb) size.
 <li> One reuse case is missing from mpX/tests/reuse.c: <code>mpz_XXX(a,a,a)</code>.  <li> One reuse case is missing from mpX/tests/reuse.c:
 <li> When printing mpf_t numbers with exponents > 2^53 on machines with 64-bit       <code>mpz_XXX(a,a,a)</code>.
      <code>mp_exp_t</code>, the precision of  <li> When printing <code>mpf_t</code> numbers with exponents &gt;2^53 on
        machines with 64-bit <code>mp_exp_t</code>, the precision of
      <code>__mp_bases[base].chars_per_bit_exactly</code> is insufficient and       <code>__mp_bases[base].chars_per_bit_exactly</code> is insufficient and
      <code>mpf_get_str</code> aborts.  Detect and compensate.       <code>mpf_get_str</code> aborts.  Detect and compensate.  Alternately,
 <li> Fix <code>mpz_get_si</code> to work properly for MIPS N32 ABI (and other       think seriously about using some sort of fixed-point integer value.
      machines that use <code>long long</code> for storing limbs.)       Avoiding unnecessary floating point is probably a good thing in general,
        and it might be faster on some CPUs.
 <li> Make the string reading functions allow the `0x' prefix when the base is  <li> Make the string reading functions allow the `0x' prefix when the base is
      explicitly 16.  They currently only allow that prefix when the base is       explicitly 16.  They currently only allow that prefix when the base is
      unspecified.       unspecified (zero).
 <li> In the development sources, we return abs(a%b) in the  
      <code>mpz_*_ui</code> division routines.  Perhaps make them return the  
      real remainder instead?  Changes return type to <code>signed long int</code>.  
 <li> <code>mpf_eq</code> is not always correct, when one operand is  <li> <code>mpf_eq</code> is not always correct, when one operand is
      1000000000... and the other operand is 0111111111..., i.e., extremely       1000000000... and the other operand is 0111111111..., i.e., extremely
      close.  There is a special case in <code>mpf_sub</code> for this       close.  There is a special case in <code>mpf_sub</code> for this
      situation; put similar code in <code>mpf_eq</code>.       situation; put similar code in <code>mpf_eq</code>.
 <li> mpf_eq doesn't implement what gmp.texi specifies.  It should not use just  <li> <code>mpf_eq</code> doesn't implement what gmp.texi specifies.  It should
      whole limbs, but partial limbs.       not use just whole limbs, but partial limbs.
 <li> Install Alpha assembly changes (prec/gmp-alpha-patches).  <li> <code>mpf_set_str</code> doesn't validate it's exponent, for instance
 <li> NeXT has problems with newlines in asm strings in longlong.h.  Also,       garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow
      <code>__builtin_constant_p</code> is unavailable?  Same problem with MacOS       of a <code>long</code> is not detected.
      X.  <li> <code>mpf_add</code> doesn't check for a carry from truncated portions of
 <li> Shut up SGI's compiler by declaring <code>dump_abort</code> in       the inputs, and in that respect doesn't implement the "infinite precision
      mp?/tests/*.c.       followed by truncate" specified in the manual.
 <li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000.  <li> <code>mpf_div</code> of x/x doesn't always give 1, reported by Peter
        Moulder.  Perhaps it suffices to put +1 on the effective divisor prec, so
        that data bits rather than zeros are shifted in when normalizing.  Would
        prefer to switch to <code>mpn_tdiv_qr</code>, where all shifting should
        disappear.
   <li> Windows DLLs: tests/mpz/reuse.c and tests/mpf/reuse.c initialize global
        variables with pointers to <code>mpz_add</code> etc, which doesn't work
        when those routines are coming from a DLL (because they're effectively
        function pointer global variables themselves).  Need to rearrange perhaps
        to a set of calls to a test function rather than iterating over an array.
   <li> demos/pexpr.c: The local variables in <code>main</code> might be
        clobbered by the <code>longjmp</code>.
 </ul>  </ul>
   
   
   
 <h4>Machine Independent Optimization</h4>  <h4>Machine Independent Optimization</h4>
 <ul>  <ul>
 <li> In hundreds of places in the code, we invoke count_leading_zeros and then  <li> <code>mpn_gcdext</code>, <code>mpz_get_d</code>,
      check if the returned count is zero.  Instead check the most significant       <code>mpf_get_str</code>: Don't test <code>count_leading_zeros</code> for
      bit of the operand, and avoid invoking <code>count_leading_zeros</code> if       zero, instead check the high bit of the operand and avoid invoking
      the bit is set.  This is an optimization on all machines, and significant       <code>count_leading_zeros</code>.  This is an optimization on all
      on machines with slow <code>count_leading_zeros</code>.       machines, and significant on machines with slow
 <li> In a couple of places <code>count_trailing_zeros</code> is used       <code>count_leading_zeros</code>, though it's possible an already
      on more or less uniformly distributed numbers.  For some CPUs       normalized operand might not be encountered very often.
      <code>count_trailing_zeros</code> is slow and it's probably worth  
      handling the frequently occurring 0 to 2 trailing zeros cases specially.  
 <li> Change all places that use <code>udiv_qrnnd</code> for inverting limbs to  
      instead use <code>invert_limb</code>.  
 <li> Reorganize longlong.h so that we can inline the operations even for the  
      system compiler.  When there is no such compiler feature, make calls to  
      stub functions.  Write such stub functions for as many machines as  
      possible.  
 <li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the  <li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
      most significant limb (if <code>BITS_PER_MP_LIMB</code> &lt= 52 bits).       most significant limb (if <code>BITS_PER_MP_LIMB</code> &lt= 52 bits).
      (Peter Montgomery has some ideas on this subject.)       (Peter Montgomery has some ideas on this subject.)
 <li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial  <li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
      products with fewer operations.       products with fewer operations.
 <li> Write new <code>mpn_get_str</code> and <code>mpn_set_str</code> running in  <li> Consider inlining <code>mpz_set_ui</code>.  This would be both small and
      the sub O(n^2) range, using some divide-and-conquer approach, preferably       fast, especially for compile-time constants, but would make application
      without using division.       binaries depend on having 1 limb allocated to an <code>mpz_t</code>,
 <li> Copy tricky code for converting a limb from development version of       preventing the "lazy" allocation scheme below.
      <code>mpn_get_str</code> to mpf/get_str.  (Talk to Torbjörn about this.)  <li> Consider inlining <code>mpz_[cft]div_ui</code> and maybe
 <li> Consider inlining these functions: <code>mpz_size</code>,       <code>mpz_[cft]div_r_ui</code>.  A <code>__gmp_divide_by_zero</code>
      <code>mpz_set_ui</code>, <code>mpz_set_q</code>, <code>mpz_clear</code>,       would be needed for the divide by zero test, unless that could be left to
      <code>mpz_init</code>, <code>mpz_get_ui</code>, <code>mpz_scan0</code>,       <code>mpn_mod_1</code> (not sure currently whether all the risc chips
      <code>mpz_scan1</code>, <code>mpz_getlimbn</code>,       provoke the right exception there if using mul-by-inverse).
      <code>mpz_init_set_ui</code>, <code>mpz_perfect_square_p</code>,  <li> Consider inlining: <code>mpz_fits_s*_p</code>.  The setups for
      <code>mpz_popcount</code>, <code>mpf_size</code>,       <code>LONG_MAX</code> etc would need to go into gmp.h, and on Cray it
      <code>mpf_get_prec</code>, <code>mpf_set_prec_raw</code>,       might, unfortunately, be necessary to forcibly include &lt;limits.h&gt;
      <code>mpf_set_ui</code>, <code>mpf_init</code>, <code>mpf_init2</code>,       since there's no apparent way to get <code>SHRT_MAX</code> with an
      <code>mpf_clear</code>, <code>mpf_set_si</code>.       expression (since <code>short</code> and <code>unsigned short</code> can
        be different sizes).
 <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very  <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very
      fast on one or two limb moduli, due to a lot of function call       fast on one or two limb moduli, due to a lot of function call
      overheads.  These could perhaps be handled as special cases.       overheads.  These could perhaps be handled as special cases.
 <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> want better  <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> want better
      algorithm selection, and the latter should use REDC.  Both could       algorithm selection, and the latter should use REDC.  Both could
      change to use an <code>mpn_powm</code> and <code>mpn_redc</code>.       change to use an <code>mpn_powm</code> and <code>mpn_redc</code>.
   <li> <code>mpz_powm</code> REDC should do multiplications by <code>g[]</code>
        using the division method when they're small, since the REDC form of a
        small multiplier is normally a full size product.  Probably would need a
        new tuned parameter to say what size multiplier is "small", as a function
        of the size of the modulus.
   <li> <code>mpz_powm</code> REDC should handle even moduli if possible.  Maybe
        this would mean for m=n*2^k doing mod n using REDC and an auxiliary
        calculation mod 2^k, then putting them together at the end.
 <li> <code>mpn_gcd</code> might be able to be sped up on small to  <li> <code>mpn_gcd</code> might be able to be sped up on small to
      moderate sizes by improving <code>find_a</code>, possibly just by       moderate sizes by improving <code>find_a</code>, possibly just by
      providing an alternate implementation for CPUs with slowish       providing an alternate implementation for CPUs with slowish
      <code>count_leading_zeros</code>.       <code>count_leading_zeros</code>.
 <li> Implement a cache localized evaluate and interpolate for the  <li> Toom3 <code>USE_MORE_MPN</code> could use a low to high cache localized
      toom3 <code>USE_MORE_MPN</code> code.  The necessary       evaluate and interpolate.  The necessary <code>mpn_divexact_by3c</code>
      right-to-left <code>mpn_divexact_by3c</code> exists.       exists.
 <li> <code>mpn_mul_basecase</code> on NxM with big N but small M could try for  <li> <code>mpn_mul_basecase</code> on NxM with big N but small M could try for
      better cache locality by taking N piece by piece.  The current code could       better cache locality by taking N piece by piece.  The current code could
      be left available for CPUs without caching.  Depending how karatsuba etc       be left available for CPUs without caching.  Depending how karatsuba etc
      is applied to unequal size operands it might be possible to assume M is       is applied to unequal size operands it might be possible to assume M is
      always smallish.       always smallish.
   <li> <code>mpn_perfect_square_p</code> on small operands might be better off
        skipping the residue tests and just taking a square root.
   <li> <code>mpz_perfect_power_p</code> could be improved in a number of ways.
        Test for Nth power residues modulo small primes like
        <code>mpn_perfect_square_p</code> does.  Use p-adic arithmetic to find
        possible roots.  Divisibility by other primes should be tested by
        grouping into a limb like <code>PP</code>.
   <li> <code>mpz_perfect_power_p</code> might like to use <code>mpn_gcd_1</code>
        instead of a private GCD routine.  The use it's put to isn't
        time-critical, and it might help be ensure correctness to use the main GCD
        routine.
   <li> <code>mpz_perfect_power_p</code> could use
        <code>mpz_divisible_ui_p</code> instead of <code>mpz_tdiv_ui</code> for
        divisibility testing, the former is faster on a number of systems.  (But
        all that prime test stuff is going to be rewritten some time.)
   <li> Change <code>PP</code>/<code>PP_INVERTED</code> into an array of such
        pairs, listing several hundred primes.  Perhaps actually make the
        products larger than one limb each.
   <li> <code>PP</code> can have factors of 2 introduced in order to get the high
        bit set and therefore a <code>PP_INVERTED</code> existing.  The factors
        of 2 don't affect the way the remainder r = a % ((x*y*z)*2^n) is used,
        further remainders r%x, r%y, etc, are the same since x, y, etc are odd.
        The advantage of this is that <code>mpn_preinv_mod_1</code> can then be
        used if it's faster than plain <code>mpn_mod_1</code>.  This would be a
        change only for 16-bit limbs, all the rest already have <code>PP</code>
        in the right form.
   <li> <code>PP</code> could have extra factors of 3 or 5 or whatever introduced
        if they fit, and final remainders mod 9 or 25 or whatever used, thereby
        making more efficient use of the <code>mpn_mod_1</code> done.  On a
        16-bit limb it looks like <code>PP</code> could take an extra factor of
        3.
   <li> <code>mpz_probab_prime_p</code>, <code>mpn_perfect_square_p</code> and
        <code>mpz_perfect_power_p</code> could use <code>mpn_mod_34lsub1</code>
        to take a remainder mod 2^24-1 or 2^48-1 and quickly get remainders mod
        3, 5, 7, 13 and 17 (factors of 2^24-1).  This could either replace the
        <code>PP</code> division currently done, or allow <code>PP</code> to do
        larger primes, depending how many residue tests seem worthwhile before
        launching into full root extractions or Miller-Rabin etc.
   <li> <code>mpz_probab_prime_p</code> (and maybe others) could code the
        divisibility tests like <code>n%7 == 0</code> in the form
   <pre>
   #define MP_LIMB_DIVISIBLE_7_P(n) \
     ((n) * MODLIMB_INVERSE_7 &lt;= MP_LIMB_T_MAX/7)
   </pre>
        This would help compilers which don't know how to optimize divisions by
        constants, and would help current gcc (3.0) too since gcc forms a whole
        remainder rather than using a modular inverse and comparing.  This
        technique works for any odd modulus, and with some tweaks for even moduli
        too.  See Granlund and Montgomery "Division By Invariant Integers"
        section 9.
   <li> <code>mpz_probab_prime_p</code> and <code>mpz_nextprime</code> could
        offer certainty for primes up to 2^32 by using a one limb miller-rabin
        test to base 2, combined with a table of actual strong pseudoprimes in
        that range (2314 of them).  If that table is too big then both base 2 and
        base 3 tests could be done, leaving a table of 104.  The test could use
        REDC and therefore be a <code>modlimb_invert</code> a remainder (maybe)
        then two multiplies per bit (successively dependent).  Processors with
        pipelined multipliers could do base 2 and 3 in parallel.  Vector systems
        could do a whole bunch of bases in parallel, and perhaps offer near
        certainty up to 64-bits (certainty might depend on an exhaustive search
        of pseudoprimes up to that limit).  Obviously 2^32 is not a big number,
        but an efficient and certain calculation is attractive.  It might find
        other uses internally, and could even be offered as a one limb prime test
        <code>mpn_probab_prime_1_p</code> or <code>gmp_probab_prime_ui_p</code>
        perhaps.
   <li> <code>mpz_probab_prime_p</code> doesn't need to make a copy of
        <code>n</code> when the input is negative, it can setup an
        <code>mpz_t</code> alias, same data pointer but a positive size.  With no
        need to clear before returning, the recursive function call could be
        dispensed with too.
   <li> <code>mpf_set_str</code> produces low zero limbs when a string has a
        fraction but is exactly representable, eg. 0.5 in decimal.  These could be
        stripped to save work in later operations.
   <li> <code>mpz_and</code>, <code>mpz_ior</code> and <code>mpz_xor</code> should
        use <code>mpn_and_n</code> etc for the benefit of the small number of
        targets with native versions of those routines.  Need to be careful not to
        pass size==0.  Is some code sharing possible between the <code>mpz</code>
        routines?
   <li> <code>mpf_add</code>: Don't do a copy to avoid overlapping operands
        unless it's really necessary (currently only sizes are tested, not
        whether r really is u or v).
   <li> <code>mpf_add</code>: Under the check for v having no effect on the
        result, perhaps test for r==u and do nothing in that case, rather than
        currently it looks like an <code>MPN_COPY_INCR</code> will be done to
        reduce prec+1 limbs to prec.
   <li> <code>mpn_divrem_2</code> could usefully accept unnormalized divisors and
        shift the dividend on-the-fly, since this should cost nothing on
        superscalar processors and avoid the need for temporary copying in
        <code>mpn_tdiv_qr</code>.
   <li> <code>mpf_sqrt_ui</code> calculates prec+1 limbs, whereas just prec would
        satisfy the application requested precision.  It should suffice to simply
        reduce the rsize temporary to 2*prec-1 limbs.  <code>mpf_sqrt</code>
        might be similar.
   <li> <code>invert_limb</code> generic C: The division could use dividend
        b*(b-d)-1 which is high:low of (b-1-d):(b-1), instead of the current
        (b-d):0, where b=2^<code>BITS_PER_MP_LIMB</code> and d=divisor.  The
        former is per the original paper and is used in the x86 code, the
        advantage is that the current special case for 0x80..00 could be dropped.
        The two should be equivalent, but a little check of that would be wanted.
   <li> <code>mpq_cmp_ui</code> could form the <code>num1*den2</code> and
        <code>num2*den1</code> products limb-by-limb from high to low and look at
        each step for values differing by more than the possible carry bit from
        the uncalculated portion.
   <li> <code>mpq_cmp</code> could do the same high-to-low progressive multiply
        and compare.  The benefits of karatsuba and higher multiplication
        algorithms are lost, but if it's assumed only a few high limbs will be
        needed to determine an order then that's fine.
   <li> <code>mpn_add_1</code>, <code>mpn_sub_1</code>, <code>mpn_add</code>,
        <code>mpn_sub</code>: Internally use <code>__GMPN_ADD_1</code> etc
        instead of the functions, so they get inlined on all compilers, not just
        gcc and others with <code>inline</code> recognised in gmp.h.
        <code>__GMPN_ADD_1</code> etc are meant mostly to support application
        inline <code>mpn_add_1</code> etc and if they don't come out good for
        internal uses then special forms can be introduced, for instance many
        internal uses are in-place.  Sometimes a block of code is executed based
        on the carry-out, rather than using it arithmetically, and those places
        might want to do their own loops entirely.
   <li> <code>__gmp_extract_double</code> on 64-bit systems could use just one
        bitfield for the mantissa extraction, not two, when endianness permits.
        Might depend on the compiler allowing <code>long long</code> bit fields
        when that's the only actual 64-bit type.
   <li> <code>mpf_get_d</code> could be more like <code>mpz_get_d</code> and do
        more in integers and give the float conversion as such a chance to round
        in its preferred direction.  Some code sharing ought to be possible.  Or
        if nothing else then for consistency the two ought to give identical
        results on integer operands (not clear if this is so right now).
   <li> <code>usqr_ppm</code> or some such could do a widening square in the
        style of <code>umul_ppmm</code>.  This would help 68000, and be a small
        improvement for the generic C (which is used on UltraSPARC/64 for
        instance).  GCC recognises the generic C ul*vh and vl*uh are identical,
        but does two separate additions to the rest of the result.
   <li> tal-notreent.c could keep a block of memory permanently allocated.
        Currently the last nested <code>TMP_FREE</code> releases all memory, so
        there's an allocate and free every time a top-level function using
        <code>TMP</code> is called.  Would need
        <code>mp_set_memory_functions</code> to tell tal-notreent.c to release
        any cached memory when changing allocation functions though.
   <li> <code>__gmp_tmp_alloc</code> from tal-notreent.c could be partially
        inlined.  If the current chunk has enough room then a couple of pointers
        can be updated.  Only if more space is required then a call to some sort
        of <code>__gmp_tmp_increase</code> would be needed.  The requirement that
        <code>TMP_ALLOC</code> is an expression might make the implementation a
        bit ugly and/or a bit sub-optimal.
   <pre>
   #define TMP_ALLOC(n)
     ((ROUND_UP(n) &gt; current-&gt;end - current-&gt;point ?
        __gmp_tmp_increase (ROUND_UP (n)) : 0),
        current-&gt;point += ROUND_UP (n),
        current-&gt;point - ROUND_UP (n))
   </pre>
   <li> <code>__mp_bases</code> has a lot of data for bases which are pretty much
        never used.  Perhaps the table should just go up to base 16, and have
        code to generate data above that, if and when required.  Naturally this
        assumes the code would be smaller than the data saved.
   <li> <code>__mp_bases</code> field <code>big_base_inverted</code> is only used
        if <code>USE_PREINV_DIVREM_1</code> is true, and could be omitted
        otherwise, to save space.
   <li> Make <code>mpf_get_str</code> and <code>mpf_set_str</code> call the
        corresponding, much faster, mpn functions.
   <li> <code>mpn_mod_1</code> could pre-calculate values of R mod N, R^2 mod N,
        R^3 mod N, etc, with R=2^<code>BITS_PER_MP_LIMB</code>, and use them to
        process multiple limbs at each step by multiplying.  Suggested by Peter
        L. Montgomery.
   <li> <code>mpz_get_str</code>, <code>mtox</code>: For power-of-2 bases, which
        are of course fast, it seems a little silly to make a second pass over
        the <code>mpn_get_str</code> output to convert to ASCII.  Perhaps combine
        that with the bit extractions.
   <li> <code>mpz_gcdext</code>: If the caller requests only the S cofactor (of
        A), and A&lt;B, then the code ends up generating the cofactor T (of B) and
        deriving S from that.  Perhaps it'd be possible to arrange to get S in
        the first place by calling <code>mpn_gcdext</code> with A+B,B.  This
        might only be an advantage if A and B are about the same size.
   <li> <code>mpn_toom3_mul_n</code>, <code>mpn_toom3_sqr_n</code>: Temporaries
        <code>B</code> and <code>D</code> are adjacent in memory and at the final
        coefficient additions look like they could use a single
        <code>mpn_add_n</code> of <code>l4</code> limbs rather than two of
        <code>l2</code> limbs.
 </ul>  </ul>
   
   
 <h4>Machine Dependent Optimization</h4>  <h4>Machine Dependent Optimization</h4>
 <ul>  <ul>
   <li> <code>udiv_qrnnd_preinv2norm</code>, the branch-free version of
        <code>udiv_qrnnd_preinv</code>, might be faster on various pipelined
        chips.  In particular the first <code>if (_xh != 0)</code> in
        <code>udiv_qrnnd_preinv</code> might be roughly a 50/50 chance and might
        branch predict poorly.  (The second test is probably almost always
        false.)  Measuring with the tuneup program would be possible, but perhaps
        a bit messy.  In any case maybe the default should be the branch-free
        version.
        <br>
        Note that the current <code>udiv_qrnnd_preinv2norm</code> implementation
        assumes a right shift will sign extend, which is not guaranteed by the C
        standards, and doesn't happen on Cray vector systems.
 <li> Run the `tune' utility for more compiler/CPU combinations.  We would like  <li> Run the `tune' utility for more compiler/CPU combinations.  We would like
      to have gmp-mparam.h files in practically every implementation specific       to have gmp-mparam.h files in practically every implementation specific
      mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system       mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system
      compiler.  See the `tune' top-level directory for more information.       compiler.  See the `tune' top-level directory for more information.
 <li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and          <pre>
      <code>mpn_mul_1</code> for the 21264.  On 21264, they should run at 4, 3,          #ifdef (__GNUC__)
      and 3 cycles/limb respectively, if the code is unrolled properly.  (Ask          #if __GNUC__ == 2 && __GNUC_MINOR__ == 7
      Torbjörn for his xm.s and xam.s skeleton files.)          ...
 <li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and          #endif
      <code>mpn_mul_1</code> for the 21164.  This should use both integer          #if __GNUC__ == 2 && __GNUC_MINOR__ == 8
           ...
           #endif
           #ifndef MUL_KARATSUBA_THRESHOLD
           /* Default GNUC values */
           ...
           #endif
           #else /* system compiler */
           ...
           #endif  </pre>
   <li> <code>invert_limb</code> on various processors might benefit from the
        little Newton iteration done for alpha and ia64.
   <li> Alpha 21264: Improve feed-in code for <code>mpn_mul_1</code>,
        <code>mpn_addmul_1</code>, and <code>mpn_submul_1</code>.
   <li> Alpha 21164: Rewrite <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
        and <code>mpn_submul_1</code> for the 21164.  This should use both integer
      multiplies and floating-point multiplies.  For the floating-point       multiplies and floating-point multiplies.  For the floating-point
      operations, the single-limb multiplier should be split into three 21-bit       operations, the single-limb multiplier should be split into three 21-bit
      chunks.       chunks, or perhaps even better in four 16-bit chunks.  Probably possible
 <li> UltraSPARC: Rewrite 64-bit <code>mpn_addmul_1</code>,       to reach 9 cycles/limb.
      <code>mpn_submul_1</code>, and <code>mpn_mul_1</code>.  Should use  <li> Alpha 21264 ev67: Use <code>ctlz</code> and <code>cttz</code> for
      floating-point operations, and split the invariant single-limb multiplier       <code>count_leading_zeros</code> and<code>count_trailing_zeros</code>.
      into 21-bit chunks.  Should give about 18 cycles/limb, but the pipeline       Use inline for gcc, probably want asm files for elsewhere.
      will become very deep.  (Torbjörn has C code that is useful as a starting  <li> ARC: gcc longlong.h sets up <code>umul_ppmm</code> to call
      point.)       <code>__umulsidi3</code> in libgcc.  Could be copied straight across, but
 <li> UltraSPARC: Rewrite <code>mpn_lshift</code> and <code>mpn_rshift</code>.       perhaps ought to be tested.
      Should give 2 cycles/limb.  (Torbjörn has code that just needs to be  <li> ARM: On v5 cpus see if the <code>clz</code> instruction can be used for
      finished.)       <code>count_leading_zeros</code>.
 <li> SPARC32/V9: Find out why the speed of <code>mpn_addmul_1</code>  <li> Itanium: <code>mpn_divexact_by3</code> isn't particularly important, but
      and the other multiplies varies so much on successive sizes.       the generic C runs at about 27 c/l, whereas with the multiplies off the
        dependent chain about 3 c/l ought to be possible.
   <li> Itanium: <code>mpn_hamdist</code> could be put together based on the
        current <code>mpn_popcount</code>.
   <li> Itanium: <code>popc_limb</code> in gmp-impl.h could use the
        <code>popcnt</code> insn.
   <li> Itanium: <code>mpn_submul_1</code> is not implemented directly, only via
        a combination of <code>mpn_mul_1</code> and <code>mpn_sub_n</code>.
   <li> UltraSPARC/64: Optimize <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
        for s2 &lt; 2^32 (or perhaps for any zero 16-bit s2 chunk).  Not sure how
        much this can improve the speed, though, since the symmetry that we rely
        on is lost.  Perhaps we can just gain cycles when s2 &lt; 2^16, or more
        accurately, when two 16-bit s2 chunks which are 16 bits apart are zero.
   <li> UltraSPARC/64: Write native <code>mpn_submul_1</code>, analogous to
        <code>mpn_addmul_1</code>.
   <li> UltraSPARC/64: Write <code>umul_ppmm</code>.  Using four
        "<code>mulx</code>"s either with an asm block or via the generic C code is
        about 90 cycles.  Try using fp operations, and also try using karatsuba
        for just three "<code>mulx</code>"s.
   <li> UltraSPARC/64: <code>mpn_divrem_1</code>, <code>mpn_mod_1</code>,
        <code>mpn_divexact_1</code> and <code>mpn_modexact_1_odd</code> could
        process 32 bits at a time when the divisor fits 32-bits.  This will need
        only 4 <code>mulx</code>'s per limb instead of 8 in the general case.
   <li> UltraSPARC/32: Rewrite <code>mpn_lshift</code>, <code>mpn_rshift</code>.
        Will give 2 cycles/limb.  Trivial modifications of mpn/sparc64 should do.
   <li> UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 &lt; 2^16.
   <li> UltraSPARC/32: Use <code>mulx</code> for <code>umul_ppmm</code> if
        possible (see commented out code in longlong.h).  This is unlikely to
        save more than a couple of cycles, so perhaps isn't worth bothering with.
   <li> UltraSPARC/32: On Solaris gcc doesn't give us <code>__sparc_v9__</code>
        or anything to indicate V9 support when -mcpu=v9 is selected.  See
        gcc/config/sol2-sld-64.h.  Will need to pass something through from
        ./configure to select the right code in longlong.h.  (Currently nothing
        is lost because <code>mulx</code> for multiplying is commented out.)
   <li> UltraSPARC: <code>modlimb_invert</code> might save a few cycles from
        masking down to just the useful bits at each point in the calculation,
        since <code>mulx</code> speed depends on the highest bit set.  Either
        explicit masks or small types like <code>short</code> and
        <code>int</code> ought to work.
   <li> Sparc64 HAL R1: <code>mpn_popcount</code> and <code>mpn_hamdist</code>
        could use <code>popc</code> currently commented out in gmp-impl.h.  This
        chip reputedly implements <code>popc</code> properly (see gcc sparc.md),
        would need to recognise the chip as <code>sparchalr1</code> or something
        in configure / config.sub / config.guess.
 <li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and  <li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
      <code>mpn_mul_1</code>.  The current development code runs at 11       <code>mpn_mul_1</code>.  The current code runs at 11 cycles/limb.  It
      cycles/limb, which is already very good.  But it should be possible to       should be possible to saturate the cache, which will happen at 8
      saturate the cache, which will happen at 7.5 cycles/limb.       cycles/limb (7.5 for mpn_mul_1).  Write special loops for s2 &lt; 2^32;
 <li> Sparc & SparcV8: Enable umul.asm for native cc.  The generic       it should be possible to make them run at about 5 cycles/limb.
      longlong.h umul_ppmm is suspected to be causing sqr_basecase to  <li> PPC630: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
      be slower than mul_basecase.       <code>mpn_mul_1</code>.  Use both integer and floating-point operations,
 <li> UltraSPARC: Write <code>umul_ppmm</code>.  Important in particular for       possibly two floating-point and one integer limb per loop.  Split operands
      <code>mpn_sqr_basecase</code>.  Using four "<code>mulx</code>"s either       into four 16-bit chunks for fast fp operations.  Should easily reach 9
      with an asm block or via the generic C code is about 90 cycles.       cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb
        (using one int + two fp).
   <li> PPC630: <code>mpn_rshift</code> could do the same sort of unrolled loop
        as <code>mpn_lshift</code>.  Some judicious use of m4 might let the two
        share source code, or with a register to control the loop direction
        perhaps even share object code.
   <li> PowerPC-32: <code>mpn_rshift</code> should do the same sort of unrolled
        loop as <code>mpn_lshift</code>.
 <li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>  <li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
      for important machines.  Helping the generic sqr_basecase.c with an       for important machines.  Helping the generic sqr_basecase.c with an
      <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.       <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.
 <li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.  <li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
      Will bring time from 1.75 to 1.25 cycles/limb.       Will bring time from 1.75 to 1.25 cycles/limb.
 <li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1.  (See Pentium code.)  <li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1.  (See
 <li> Alpha: Optimize <code>count_leading_zeros</code>.       Pentium code.)
 <li> Alpha: Optimize <code>udiv_qrnnd</code>.  (Ask Torbjörn for the file  <li> X86: Good authority has it that in the past an inline <code>rep
      test-udiv-preinv.c as a starting point.)       movs</code> would upset GCC register allocation for the whole function.
 <li> R10000/R12000: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>.       Is this still true in GCC 3?  It uses <code>rep movs</code> itself for
      It should just require 3 cycles/limb, but the current code propagates       <code>__builtin_memcpy</code>.  Examine the code for some simple and
      carry poorly.  The trick is to add carry-in later than we do now,       complex functions to find out.  Inlining <code>rep movs</code> would be
      decreasing the number of operations used to generate carry-out from 4 to       desirable, it'd be both smaller and faster.
      to 3.  <li> Pentium P54: <code>mpn_lshift</code> and <code>mpn_rshift</code> can come
        down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after
        <code>shrdl</code> and <code>shldl</code>, see mpn/x86/pentium/README.
   <li> Pentium P55 MMX: <code>mpn_lshift</code> and <code>mpn_rshift</code>
        might benefit from some destination prefetching.
   <li> PentiumPro: <code>mpn_divrem_1</code> might be able to use a
        mul-by-inverse, hoping for maybe 30 c/l.
   <li> P6: <code>mpn_add_n</code> and <code>mpn_sub_n</code> should be able to go
        faster than the generic x86 code at 3.5 c/l.  The athlon code for instance
        runs at about 2.7.
   <li> K7: <code>mpn_lshift</code> and <code>mpn_rshift</code> might be able to
        do something branch-free for unaligned startups, and shaving one insn
        from the loop with alternative indexing might save a cycle.
 <li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.  <li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
      The pipeline is now extremely deep, perhaps unnecessarily deep.  Also, r5       The pipeline is now extremely deep, perhaps unnecessarily deep.
      is unused.  (Ask Torbjörn for a copy of the current code.)  
 <li> PPC32: Write <code>mpn_rshift</code> based on new <code>mpn_lshift</code>.  <li> PPC32: Write <code>mpn_rshift</code> based on new <code>mpn_lshift</code>.
 <li> PPC32: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>.  Should  <li> PPC32: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>.  Should
      run at just 3.25 cycles/limb.  (Ask for xxx-add_n.s as a starting point.)       run at just 3.25 cycles/limb.
 <li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.  <li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
 <li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and  <li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
      <code>mpn_sqr_basecase</code>.  This should use a "vertical multiplication       <code>mpn_sqr_basecase</code>.  This should use a "vertical multiplication
      method", to avoid carry propagation.  splitting one of the operands in       method", to avoid carry propagation.  splitting one of the operands in
      11-bit chunks.       11-bit chunks.
 <li> Cray: Vectorize main functions, perhaps in assembly language.  <li> 68k, Pentium: <code>mpn_lshift</code> by 31 should use the special rshift
 <li> Cray: Write <code>mpn_mul_basecase</code> and       by 1 code, and vice versa <code>mpn_rshift</code> by 31 should use the
      <code>mpn_sqr_basecase</code>.  Same comment applies to this as to the       special lshift by 1.  This would be best as a jump across to the other
      same functions for Fujitsu VPP.       routine, could let both live in lshift.asm and omit rshift.asm on finding
        <code>mpn_rshift</code> already provided.
   <li> Cray T3E: Experiment with optimization options.  In particular,
        -hpipeline3 seems promising.  We should at least up -O to -O2 or -O3.
   <li> Cray: <code>mpn_com_n</code> and <code>mpn_and_n</code> etc very probably
        wants a pragma like <code>MPN_COPY_INCR</code>.
   <li> Cray vector systems: <code>mpn_lshift</code>, <code>mpn_rshift</code>,
        <code>mpn_popcount</code> and <code>mpn_hamdist</code> are nice and small
        and could be inlined to avoid function calls.
   <li> Cray: Variable length arrays seem to be faster than the tal-notreent.c
        scheme.  Not sure why, maybe they merely give the compiler more
        information about aliasing (or the lack thereof).  Would like to modify
        <code>TMP_ALLOC</code> to use them, or introduce a new scheme.  Memory
        blocks wanted unconditionally are easy enough, those wanted only
        sometimes are a problem.  Perhaps a special size calculation to ask for a
        dummy length 1 when unwanted, or perhaps an inlined subroutine
        duplicating code under each conditional.  Don't really want to turn
        everything into a dog's dinner just because Cray don't offer an
        <code>alloca</code>.
   <li> Cray: <code>mpn_get_str</code> on power-of-2 bases ought to vectorize.
        Does it?  <code>bits_per_digit</code> and the inner loop over bits in a
        limb might prevent it.  Perhaps special cases for binary, octal and hex
        would be worthwhile (very possibly for all processors too).
   <li> Cray: <code>popc_limb</code> could use the Cray <code>_popc</code>
        intrinsic.  That would help <code>mpz_hamdist</code> and might make the
        generic C versions of <code>mpn_popcount</code> and
        <code>mpn_hamdist</code> suffice for Cray (if it vectorizes, or can be
        given a hint to do so).
   <li> 68000: <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
        <code>mpn_submul_1</code>: Check for a 16-bit multiplier and use two
        multiplies per limb, not four.
   <li> 68000: <code>mpn_lshift</code> and <code>mpn_rshift</code> could use a
        <code>roll</code> and mask instead of <code>lsrl</code> and
        <code>lsll</code>.  This promises to be a speedup, effectively trading a
        6+2*n shift for one or two 4 cycle masks.  Suggested by Jean-Charles
        Meyrignac.
 <li> Improve <code>count_leading_zeros</code> for 64-bit machines:  <li> Improve <code>count_leading_zeros</code> for 64-bit machines:
   
   <pre>    <pre>
   if ((x &gt&gt W_TYPE_SIZE-W_TYPE_SIZE/2) == 0) { x &lt&lt= W_TYPE_SIZE/2; cnt += W_TYPE_SIZE/2}             if ((x &gt&gt 32) == 0) { x &lt&lt= 32; cnt += 32; }
   if ((x &gt&gt W_TYPE_SIZE-W_TYPE_SIZE/4) == 0) { x &lt&lt= W_TYPE_SIZE/4; cnt += W_TYPE_SIZE/4}             if ((x &gt&gt 48) == 0) { x &lt&lt= 16; cnt += 16; }
   ... </pre>             ... </pre>
   <li> IRIX 6 MIPSpro compiler has an <code>__inline</code> which could perhaps
        be used in <code>__GMP_EXTERN_INLINE</code>.  What would be the right way
        to identify suitable versions of that compiler?
   <li> VAX D and G format <code>double</code> floats are straightforward and
        could perhaps be handled directly in <code>__gmp_extract_double</code>
        and maybe in <code>mpz_get_d</code>, rather than falling back on the
        generic code.  (Both formats are detected by <code>configure</code>.)
   <li> <code>mpn_get_str</code> final divisions by the base with
        <code>udiv_qrnd_unnorm</code> could use some sort of multiply-by-inverse
        on suitable machines.  This ends up happening for decimal by presenting
        the compiler with a run-time constant, but the same for other bases would
        be good.  Perhaps use could be made of the fact base&lt;256.
   <li> <code>mpn_umul_ppmm</code>, <code>mpn_udiv_qrnnd</code>: Return a
        structure like <code>div_t</code> to avoid going through memory, in
        particular helping RISCs that don't do store-to-load forwarding.  Clearly
        this is only possible if the ABI returns a structure of two
        <code>mp_limb_t</code>s in registers.
 </ul>  </ul>
   
 <h4>New Functionality</h4>  <h4>New Functionality</h4>
 <ul>  <ul>
 <li> <code>mpz_get_nth_ui</code>.  Return the nth word (not necessarily the nth limb).  <li> Add in-memory versions of <code>mp?_out_raw</code> and
        <code>mp?_inp_raw</code>.
   <li> <code>mpz_get_nth_ui</code>.  Return the nth word (not necessarily the
        nth limb).
 <li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).  <li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
 <li> Let `0b' and `0B' mean binary input everywhere.  <li> Let `0b' and `0B' mean binary input everywhere.
 <li> Add <code>mpq_set_f</code> for assignment from <code>mpf_t</code>  <li> <code>mpz_init</code> and <code>mpq_init</code> could do lazy allocation.
      (cf. <code>mpq_set_d</code>).       Set <code>ALLOC(var)</code> to 0 to indicate nothing allocated, and let
 <li> Maybe make <code>mpz_init</code> (and <code>mpq_init</code>) do lazy       <code>_mpz_realloc</code> do the initial alloc.  Set
      allocation.  Set <code>ALLOC(var)</code> to 0, and have       <code>z-&gt;_mp_d</code> to a dummy that <code>mpz_get_ui</code> and
      <code>mpz_realloc</code> special-handle that case.  Update functions that       similar can unconditionally fetch from.  Niels Möller has had a go at
      rely on a single limb (like <code>mpz_set_ui</code>,       this.
      <code>mpz_[tfc]div_r_ui</code>, and others).       <br>
        The advantages of the lazy scheme would be:
        <ul>
        <li> Initial allocate would be the size required for the first value
             stored, rather than getting 1 limb in <code>mpz_init</code> and then
             more or less immediately reallocating.
        <li> <code>mpz_init</code> would only store magic values in the
             <code>mpz_t</code> fields, and could be inlined.
        <li> A fixed initializer could even be used by applications, like
             <code>mpz_t z = MPZ_INITIALIZER;</code>, which might be convenient
             for globals.
        </ul>
        The advantages of the current scheme are:
        <ul>
        <li> <code>mpz_set_ui</code> and other similar routines needn't check the
             size allocated and can just store unconditionally.
        <li> <code>mpz_set_ui</code> and perhaps others like
             <code>mpz_tdiv_r_ui</code> and a prospective
             <code>mpz_set_ull</code> could be inlined.
        </ul>
 <li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>.  Make sure  <li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>.  Make sure
      format is portable between 32-bit and 64-bit machines, and between       format is portable between 32-bit and 64-bit machines, and between
      little-endian and big-endian machines.       little-endian and big-endian machines.
 <li> Handle numeric exceptions: Call an error handler, and/or set  <li> <code>mpn_and_n</code> ... <code>mpn_copyd</code>: Perhaps make the mpn
      <code>gmp_errno</code>.       logops and copys available in gmp.h, either as library functions or
 <li> Implement <code>gmp_fprintf</code>, <code>gmp_sprintf</code>, and       inlines, with the availability of library functions instantiated in the
      <code>gmp_snprintf</code>.  Think about some sort of wrapper       generated gmp.h at build time.
      around <code>printf</code> so it and its several variants don't  <li> <code>mpz_set_str</code> etc variants taking string lengths rather than
      have to be completely reimplemented.       null-terminators.
 <li> Implement some <code>mpq</code> input and output functions.  
 <li> Implement a full precision <code>mpz_kronecker</code>, leave  
      <code>mpz_jacobi</code> for compatibility.  
 <li> Make the mpn logops and copys available in gmp.h.  Since they can  
      be either library functions or inlines, gmp.h would need to be  
      generated from a gmp.in based on what's in the library.  gmp.h  
      would still be compiler-independent though.  
 <li> Make versions of <code>mpz_set_str</code> etc taking string  
      lengths rather than null-terminators.  
 <li> Consider changing the thresholds to apply the simpler algorithm when  <li> Consider changing the thresholds to apply the simpler algorithm when
      "<code>&lt;=</code>" rather than "<code>&lt;</code>", so a threshold can       "<code>&lt;=</code>" rather than "<code>&lt;</code>", so a threshold can
      be set to <code>MP_SIZE_T_MAX</code> to get only the simpler code (the       be set to <code>MP_SIZE_T_MAX</code> to get only the simpler code (the
      compiler will know <code>size &lt;= MP_SIZE_T_MAX</code> is always true).       compiler will know <code>size &lt;= MP_SIZE_T_MAX</code> is always true).
 <li> <code>mpz_cdiv_q_2exp</code> and <code>mpz_cdiv_r_2exp</code>       Alternately it looks like the <code>ABOVE_THRESHOLD</code> and
      could be implemented to match the corresponding tdiv and fdiv.       <code>BELOW_THRESHOLD</code> macros can do this adequately, and also pick
      Maybe some code sharing is possible.       up cases where a threshold of zero should mean only the second algorithm.
   <li> <code>mpz_nthprime</code>.
   <li> Perhaps <code>mpz_init2</code>, initializing and making initial room for
        N bits.  The actual size would be rounded up to a limb, and perhaps an
        extra limb added since so many <code>mpz</code> routines need that on
        their destination.
   <li> <code>mpz_andn</code>, <code>mpz_iorn</code>, <code>mpz_nand</code>,
        <code>mpz_nior</code>, <code>mpz_xnor</code> might be useful additions,
        if they could share code with the current such functions (which should be
        possible).
   <li> <code>mpz_and_ui</code> etc might be of use sometimes.  Suggested by
        Niels Möller.
   <li> <code>mpf_set_str</code> and <code>mpf_inp_str</code> could usefully
        accept 0x, 0b etc when base==0.  Perhaps the exponent could default to
        decimal in this case, with a further 0x, 0b etc allowed there.
        Eg. 0xFFAA@0x5A.  A leading "0" for octal would match the integers, but
        probably something like "0.123" ought not mean octal.
   <li> <code>GMP_LONG_LONG_LIMB</code> or some such could become a documented
        feature of gmp.h, so applications could know whether to
        <code>printf</code> a limb using <code>%lu</code> or <code>%Lu</code>.
   <li> <code>PRIdMP_LIMB</code> and similar defines following C99
        &lt;inttypes.h&gt; might be of use to applications printing limbs.
        Perhaps they should be defined only if specifically requested, the way
        &lt;inttypes.h&gt; does.  But if <code>GMP_LONG_LONG_LIMB</code> or
        whatever is added then perhaps this can easily enough be left to
        applications.
   <li> <code>mpf_get_ld</code> and <code>mpf_set_ld</code> converting
        <code>mpf_t</code> to and from <code>long double</code>.  Other
        <code>long double</code> routines would be desirable too, but these would
        be a start.  Often <code>long double</code> is the same as
        <code>double</code>, which is easy but pretty pointless.  Should
        recognise the Intel 80-bit format on i386, and IEEE 128-bit quad on
        sparc, hppa and power.  Might like an ABI sub-option or something when
        it's a compiler option for 64-bit or 128-bit <code>long double</code>.
   <li> <code>gmp_printf</code> could accept <code>%b</code> for binary output.
        It'd be nice if it worked for plain <code>int</code> etc too, not just
        <code>mpz_t</code> etc.
   <li> <code>gmp_printf</code> in fact could usefully accept an arbitrary base,
        for both integer and float conversions.  A base either in the format
        string or as a parameter with <code>*</code> should be allowed.  Maybe
        <code>&amp;13b</code> (b for base) or something like that.
   <li> <code>gmp_printf</code> could perhaps have a type code for an
        <code>mp_limb_t</code>.  That would save an application from having to
        worry whether it's a <code>long</code> or a <code>long long</code>.
   <li> <code>gmp_printf</code> could perhaps accept <code>mpq_t</code> for float
        conversions, eg. <code>"%.4Qf"</code>.  This would be merely for
        convenience, but still might be useful.  Rounding would be the same as
        for an <code>mpf_t</code> (ie. currently round-to-nearest, but not
        actually documented).  Alternately, perhaps a separate
        <code>mpq_get_str_point</code> or some such might be more use.  Suggested
        by Pedro Gimeno.
   <li> <code>gmp_printf</code> could usefully accept a flag to control the
        rounding of float conversions.  The wouldn't do much for
        <code>mpf_t</code>, but would be good if <code>mpfr_t</code> was
        supported in the future, or perhaps for <code>mpq_t</code>.  Something
        like <code>&amp;*r</code> (r for rounding, and mpfr style
        <code>GMP_RND</code> parameter).
   <li> <code>mpz_combit</code> to toggle a bit would be a good companion for
        <code>mpz_setbit</code> and <code>mpz_clrbit</code>.  Suggested by Niels
        Möller (and has done some work towards it).
   <li> <code>mpz_scan0_reverse</code> or <code>mpz_scan0low</code> or some such
        searching towards the low end of an integer might match
        <code>mpz_scan0</code> nicely.  Likewise for <code>scan1</code>.
        Suggested by Roberto Bagnara.
   <li> <code>mpz_bit_subset</code> or some such to test whether one integer is a
        bitwise subset of another might be of use.  Some sort of return value
        indicating whether it's a proper or non-proper subset would be good and
        wouldn't cost anything in the implementation.  Suggested by Roberto
        Bagnara.
   <li> <code>gmp_randinit_r</code> and maybe <code>gmp_randstate_set</code> to
        init-and-copy or to just copy a <code>gmp_randstate_t</code>.  Suggested
        by Pedro Gimeno.
   <li> <code>mpf_get_ld</code>, <code>mpf_set_ld</code>: Conversions between
        <code>mpf_t</code> and <code>long double</code>, suggested by Dan
        Christensen.  There'd be some work to be done by <code>configure</code>
        to recognise the format in use.  xlc on aix for instance apparently has
        an option for either plain double 64-bit or quad 128-bit precision.  This
        might mean library contents vary with the compiler used to build, which
        is undesirable.  It might be possible to detect the mode the application
        is compiling with, and try to avoid mismatch problems.
   <li> <code>mpz_sqrt_if_perfect_square</code>: When
        <code>mpz_perfect_square_p</code> does its tests it calculates a square
        root and then discards it.  For some applications it might be useful to
        return that root.  Suggested by Jason Moxham.
   <li> <code>mpz_get_ull</code>, <code>mpz_set_ull</code>,
        <code>mpz_get_sll</code>, <code>mpz_get_sll</code>: Conversions for
        <code>long long</code>.  These would aid interoperability, though a
        mixture of GMP and <code>long long</code> would probably not be too
        common.  Disadvantages of using <code>long long</code> in libgmp.a would
        be
        <ul>
        <li> Library contents vary according to the build compiler.
        <li> gmp.h would need an ugly <code>#ifdef</code> block to decide if the
             application compiler could take the <code>long long</code>
             prototypes.
        <li> Some sort of <code>LIBGMP_HAS_LONGLONG</code> would be wanted to
             indicate whether the functions are available.  (Applications using
             autoconf could probe the library too.)
        </ul>
        It'd be possible to defer the need for <code>long long</code> to
        application compile time, by having something like
        <code>mpz_set_2ui</code> called with two halves of a <code>long
        long</code>.  Disadvantages of this would be,
        <ul>
        <li> Bigger code in the application, though perhaps not if a <code>long
             long</code> is normally passed as two halves anyway.
        <li> <code>mpz_get_ull</code> would be a rather big inline, or would have
             to be two function calls.
        <li> <code>mpz_get_sll</code> would be a worse inline, and would put the
             treatment of <code>-0x10..00</code> into applications (see
             <code>mpz_get_si</code> correctness above).
        <li> Although having libgmp.a independent of the build compiler is nice,
             it sort of sacrifices the capabilities of a good compiler to
             uniformity with inferior ones.
        </ul>
        Plain use of <code>long long</code> is probably the lesser evil, if only
        because it makes best use of gcc.
   <li> <code>mpz_strtoz</code> parsing the same as <code>strtol</code>.
        Suggested by Alexander Kruppa.
 </ul>  </ul>
   
   
 <h4>Configuration</h4>  <h4>Configuration</h4>
   
 <ul>  <ul>
 <li> Improve config.guess.  We want to recognize the processor very  <li> Floating-point format: <code>GMP_C_DOUBLE_FORMAT</code> seems to work
      accurately, more accurately than other GNU packages.       well.  Get rid of the <code>#ifdef</code> mess in gmp-impl.h and use the
      config.guess does not currently make the distinctions we would       results of the test instead.
      like it to do and a --target often needs to be set explicitly.  <li> a29k: umul.s and udiv.s exist but don't get used.
   <li> ARM: <code>umul_ppmm</code> in longlong.h always uses <code>umull</code>,
        but is that available only for M series chips or some such?  Perhaps it
        should be configured in some way.
   <li> HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00.
   <li> HPPA 2.0w: gcc is rumoured to support 2.0w as of version 3, though
        perhaps just as a build-time choice.  In any case, figure out how to
        identify a suitable gcc or put it in the right mode, for the GMP compiler
        choices.
   <li> IA64: Latest libtool has some nonsense to detect ELF-32 or ELF-64 on
        <code>ia64-*-hpux*</code>.  Does GMP need to know anything about that?
   <li> Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc.
        "hinv -c processor" gives lots of information on Irix.  Standard
        config.guess appends "el" to indicate endianness, but
        <code>AC_C_BIGENDIAN</code> seems the best way to handle that for GMP.
   <li> PowerPC: The function descriptor nonsense for AIX is currently driven by
        <code>*-*-aix*</code>.  It might be more reliable to do some sort of
        feature test, examining the compiler output perhaps.  It might also be
        nice to merge the aix.m4 files into powerpc-defs.m4.
   <li> Sparc: <code>config.guess</code> recognises various exact sparcs, make
        use of that information in <code>configure</code> (work on this is in
        progress).
   <li> Sparc32: floating point or integer <code>udiv</code> should be selected
        according to the CPU target.  Currently floating point ends up being
        used on all sparcs, which is probably not right for generic V7 and V8.
   <li> Sparc: The use of <code>-xtarget=native</code> with <code>cc</code> is
        incorrect when cross-compiling, the target should be set according to the
        configured <code>$host</code> CPU.
   <li> m68k: config.guess can detect 68000, 68010, CPU32 and 68020, but relies
        on system information for 030, 040 and 060.  Can they be identified by
        running some code?
   <li> m68k: gas 2.11.90.0.1 pads with zero bytes in text segments, which is not
        valid code.  Probably need <code>.balignw &lt;n&gt;,0x4e7f</code> to get
        nops, if <code>ALIGN</code> is going to be used for anything that's
        executed across.
   <li> Some CPUs have <code>umul</code> and <code>udiv</code> code not being
        used.  Check all such for bit rot and then put umul and udiv in
        <code>$gmp_mpn_functions_optional</code> as "standard optional" objects.
        <br> In particular Sparc and SparcV8 on non-gcc should benefit from
        umul.asm enabled; the generic umul is suspected to be making sqr_basecase
        slower than mul_basecase.
   <li> HPPA <code>mpn_umul_ppmm</code> and <code>mpn_udiv_qrnnd</code> have a
        different parameter order than those functions on other CPUs.  It might
        avoid confusion to have them under different names, maybe
        <code>mpn_umul_ppmm_r</code> or some such.  Prototypes then wouldn't
        be conditionalized, and the appropriate form could be selected with the
        <code>HAVE_NATIVE</code> scheme if/when the code switches to use a
        <code>PROLOGUE</code> style.
   <li> <code>DItype</code>: The setup in gmp-impl.h for non-GCC could use an
        autoconf test to determine whether <code>long long</code> is available.
   <li> m88k: Make the assembler code work on non-underscore systems.  Conversion
        to .asm would be desirable.  Ought to be easy, but would want to be
        tested.
   <li> z8k: The use of a 32-bit limb in mpn/z8000x as opposed to 16-bits in
        mpn/z8000 could be an ABI choice.  But this chip is obsolete and nothing
        is likely to be done unless someone is actively using it.
   <li> config.m4 is generated only by the configure script, it won't be
        regenerated by config.status.  Creating it as an <code>AC_OUTPUT</code>
        would work, but it might upset "make" to have things like <code>L$</code>
        get into the Makefiles through <code>AC_SUBST</code>.
        <code>AC_CONFIG_COMMANDS</code> would be the alternative.  With some
        careful m4 quoting the <code>changequote</code> calls might not be
        needed, which might free up the order in which things had to be output.
   <li> <code>make distclean</code>: Only the mpn directory links which were
        created are removed, but perhaps all possible links should be removed, in
        case someone runs configure a second time without a
        <code>distclean</code> in between.  The only tricky part would be making
        sure all possible <code>extra_functions</code> are covered.
   <li> MinGW: Apparently a Cygwin version of gcc can be used by passing
        <code>-mno-cygwin</code>.  For <code>--host=*-*-mingw32*</code> it might
        be convenient to automatically use that option, if it works.  Needs
        someone with a dual cygwin/mingw setup to test.
   <li> Automake: Latest automake has a <code>CCAS</code>, <code>CCASFLAGS</code>
        scheme.  Though we probably wouldn't be using its assembler support we
        could try to use those variables in compatible ways.
   </ul>
   
      For example, "sparc" is not very useful as a machine architecture  
      denotation.  We want to distinguish old 32-bit SPARC without  
      multiply support from newer 32-bit SPARC with such support.  We  
      want to recognize a SuperSPARC, since its implementation of the  
      UDIV instruction is not complete, and will trap to the OS kernel  
      for certain operands.  And we want to recognize 64-bit capable  
      SPARC processors as such.  While the assembly routines can use  
      64-bit operations on all 64-bit SPARC processors, one can not use  
      64-bit limbs under all operating system.  E.g., Solaris 2.5 and  
      2.6 doesn't preserve the upper 32 bits of most processor  
      registers.  For SPARC we therefore sometimes need to choose GMP  
      configuration depending both on processor and operating system.  
   
 <li> Remember to make sure config.sub accepts any output from config.guess.  <h4>Random Numbers</h4>
   <ul>
 <li> Find out whether there's an alloca available and how to use it.  <li> <code>_gmp_rand</code> is not particularly fast on the linear
      AC_FUNC_ALLOCA has various system dependencies covered, but we       congruential algorithm and could stand various improvements.
      don't want its alloca.c replacement.  (One thing current cpp       <ul>
      tests don't cover: HPUX 10 C compiler supports alloca, but       <li> Make a second seed area within <code>gmp_randstate_t</code> (or
      cannot find any symbol to test in order to know if we're on            <code>_mp_algdata</code> rather) to save some copying.
      HPUX 10.  Damn.)       <li> Make a special case for a single limb <code>2exp</code> modulus, to
 <li> Identify Mips processor under Irix: `hinv -c processor'.            avoid <code>mpn_mul</code> calls.  Perhaps the same for two limbs.
      config.guess should say mips2, mips3, and mips4.       <li> Inline the <code>lc</code> code, to avoid a function call and
 <li> Identify Alpha processor under OSF: "/usr/sbin/sizer -c".            <code>TMP_ALLOC</code> for every chunk.
      Unfortunately, sizer is not available before some revision of       <li> The special case for <code>seedn==0</code> will be very rarely used,
      Dec Unix 4.0, and it also returns some rather cryptic names for            and on that basis seems unnecessary.
      processors.  Perhaps the <code>implver</code> and       <li> Perhaps the <code>2exp</code> and general LC cases should be split,
      <code>amask</code> assembly instructions are better, but that            for clarity (if the general case is retained).
      doesn't differentiate between ev5 and ev56.       </ul>
 <li> Identify Sparc processors.  config.guess should say supersparc,  <li> <code>gmp_randinit_mers</code> for a Mersenne Twister generator.  It's
      microsparc, ultrasparc1, ultrasparc2, etc.       likely to be more random and about the same speed as Knuth's 55-element
 <li> Identify HPPA processors similarly.       Fibonacci generator, and can probably become the default.  Pedro Gimeno
 <li> Get lots of information about a Solaris system: prtconf -vp       has started on this.
 <li> For some target machines and some compilers, specific options  <li> <code>gmp_randinit_lc</code>: Finish or remove.  Doing a division for
      are needed (sparcv8/gcc needs -mv8, sparcv8/cc needs -cg92,       every every step won't be very fast, so check whether the usefulness of
      Irix64/cc needs -64, Irix32/cc might need -n32, etc).  Some are       this algorithm can be justified.  (Consensus is that it's not useful and
      set already, add more, see configure.in.       can be removed.)
 <li> Options to be passed to the assembler (via the compiler, using  <li> Blum-Blum-Shub: Finish or remove.  A separate
      whatever syntax the compiler uses for passing options to the       <code>gmp_randinit_bbs</code> would be wanted, not the currently
      assembler).       commented out case in <code>gmp_randinit</code>.
 <li> On Solaris 7, check if gcc supports native v9 64-bit  <li> <code>_gmp_rand</code> could be done as a function pointer within
      arithmetic.  If not compile using "cc -fast -xarch=v9".       <code>gmp_randstate_t</code> (or rather in the <code>_mp_algdata</code>
      (Problem: -fast requires that we link with -fast too, which       part), instead of switching on a <code>gmp_randalg_t</code>.  Likewise
      might not be very good.  Pass "-xO4 -xtarget=native" instead?)       <code>gmp_randclear</code>, and perhaps <code>gmp_randseed</code> if it
 <li> Extend the "optional" compiler arguments to choose the first       became algorithm-specific.  This would be more modular, and would ensure
      that works from from a set, so when gcc gets athlon support it       only code for the desired algorithms is dragged into the link.
      can try -mcpu=athlon, -mcpu=pentiumpro, or -mcpu=i486,  <li> <code>mpz_urandomm</code> should do something for n&lt;=0, but what?
      whichever works.  <li> <code>mpz_urandomm</code> implementation looks like it could be improved.
 <li> Detect gcc >=2.96 and enable -march=pentiumpro for relevant       Perhaps it's enough to calculate <code>nbits</code> as ceil(log2(n)) and
      x86s.  (A bug in gcc 2.95.2 prevents it being used       call <code>_gmp_rand</code> until a value <code>&lt;n</code> is obtained.
      unconditionally.)  <li> <code>gmp_randstate_t</code> used for parameters perhaps should become
 <li> Build multiple variants of the library under certain systems.       <code>gmp_randstate_ptr</code> the same as other types.
      An example is -n32, -o32, and -64 on Irix.  <li> Some of the empirical randomness tests could be included in a "make
 <li> There's a few filenames that don't fit in 14 chars, if this       check".  They ought to work everywhere, for a given seed at least.
      matters.  
 <li> Enable support for FORTRAN versions of mpn files (eg. for  
      mpn/cray/mulww.f).  Add "f" to the mpn path searching, run AC_PROG_F77 if  
      such a file is found.  Automake will generate some of what's needed in the  
      makefiles, but libtool doesn't know fortran and so rules like the current  
      ".asm.lo" will be needed.  
 <li> Only run GMP_PROG_M4 if it's needed, ie. if there's .asm files  
      selected from the mpn path.  This might help say a generic C  
      build on weird systems.  
 </ul>  </ul>
   
 <p> In general, getting the exact right configuration, passing the  
 exact right options to the compiler, etc, might mean that the GMP  
 performance more than doubles.  
   
 <p> When testing, make sure to test at least the following for all out  
 target machines: (1) Both gcc and cc (and c89).  (2) Both 32-bit mode  
 and 64-bit mode (such as -n32 vs -64 under Irix). (3) Both the system  
 `make' and GNU `make'. (4) With and without GNU binutils.  
   
   
 <h4>Miscellaneous</h4>  <h4>Miscellaneous</h4>
 <ul>  <ul>
   
 <li> Work on the way we build the library.  We now do it building  
      convenience libraries but then listing all the object files a  
      second time in the top level Makefile.am.  
 <li> Get rid of mp[zq]/sub.c, and instead define a compile parameter to  
      mp[zq]/add.c to decide whether it will add or subtract.  Will decrease  
      redundancy.  Similarly in other places.  
 <li> Make <code>mpz_div</code> and <code>mpz_divmod</code> use rounding  <li> Make <code>mpz_div</code> and <code>mpz_divmod</code> use rounding
      analogous to <code>mpz_mod</code>.  Document, and list as an       analogous to <code>mpz_mod</code>.  Document, and list as an
      incompatibility.       incompatibility.
 <li> Maybe make mpz_pow_ui.c more like mpz/ui_pow_ui.c, or write new  <li> <code>mpz_gcdext</code> and <code>mpn_gcdext</code> ought to document
      mpn/generic/pow_ui.       what range of values the generated cofactors can take, and preferably
 <li> Make mpz_invert call mpn_gcdext directly.       ensure the definition uniquely specifies the cofactors for given inputs.
 <li> Make a build option to enable execution profiling with gprof.  In       A basic extended Euclidean algorithm or multi-step variant leads to
      particular look at getting the right <code>mcount</code> call at       |x|&lt;|b| and |y|&lt;|a| or something like that, but there's probably
      the start of each assembler subroutine (for important targets at       two solutions under just those restrictions.
      least).  <li> <code>mpz_invert</code> should call <code>mpn_gcdext</code> directly.
   <li> demos/factorize.c: use <code>mpz_divisible_ui_p</code> rather than
        <code>mpz_tdiv_qr_ui</code>.  (Of course dividing multiple primes at a
        time would be better still.)
   <li> The various test programs use quite a bit of the main
        <code>libgmp</code>.  This establishes good cross-checks, but it might be
        better to use simple reference routines where possible.  Where it's not
        possible some attention could be paid to the order of the tests, so a
        <code>libgmp</code> routine is only used for tests once it seems to be
        good.
   <li> <code>mpf_set_q</code> is very similar to <code>mpf_div</code>, it'd be
        good for the two to share code.  Perhaps <code>mpf_set_q</code> should
        make some <code>mpf_t</code> aliases for its numerator and denominator
        and just call <code>mpf_div</code>.  Both would be simplified a good deal
        by switching to <code>mpn_tdiv_qr</code> perhaps making them small enough
        not to bother with sharing (especially since <code>mpf_set_q</code>
        wouldn't need to watch out for overlaps).
   <li> PowerPC: The cpu time base registers (per <code>mftb</code> and
        <code>mftbu</code>) could be used for the speed and tune programs.  Would
        need to know its frequency of course.  Usually it's 1/4 of bus speed
        (eg. 25 MHz) but some chips drive it from an external input.  Probably
        have to measure to be sure.
   <li> <code>MUL_FFT_THRESHOLD</code> etc: the FFT thresholds should allow a
        return to a previous k at certain sizes.  This arises basically due to
        the step effect caused by size multiples effectively used for each k.
        Looking at a graph makes it fairly clear.
   <li> <code>__gmp_doprnt_mpf</code> does a rather unattractive round-to-nearest
        on the string returned by <code>mpf_get_str</code>.  Perhaps some variant
        of <code>mpf_get_str</code> could be made which would better suit.
 </ul>  </ul>
   
   
 <h4>Aids to Debugging</h4>  <h4>Aids to Development</h4>
 <ul>  <ul>
 <li> Make an option for stack-alloc.c to call <code>malloc</code>  <li> Add <code>ASSERT</code>s at the start of each user-visible mpz/mpq/mpf
      separately for each <code>TMP_ALLOC</code> block, so a redzoning       function to check the validity of each <code>mp?_t</code> parameter, in
      malloc debugger could be used during development.       particular to check they've been <code>mp?_init</code>ed.  This might
 <li> Add <code>ASSERT</code>s at the start of each user-visible       catch elementary mistakes in user programs.  Care would need to be taken
      mpz/mpq/mpf function to check the validity of each       over <code>MPZ_TMP_INIT</code>ed variables used internally.  If nothing
      <code>mp?_t</code> parameter, in particular to check they've been       else then consistency checks like size&lt;=alloc, ptr not
      <code>mp?_init</code>ed.  This might catch elementary mistakes in       <code>NULL</code> and ptr+size not wrapping around the address space,
      user programs.  Care would need to be taken over       would be possible.  A more sophisticated scheme could track
      <code>MPZ_TMP_INIT</code>ed variables used internally.       <code>_mp_d</code> pointers and ensure only a valid one is used.  Such a
        scheme probably wouldn't be reentrant, not without some help from the
        system.
   <li> tune/time.c could try to determine at runtime whether
        <code>getrusage</code> and <code>gettimeofday</code> are reliable.
        Currently we pretend in configure that the dodgy m68k netbsd 1.4.1
        <code>getrusage</code> doesn't exist.  If a test might take a long time
        to run then perhaps cache the result in a file somewhere.
 </ul>  </ul>
   
   
Line 359  and 64-bit mode (such as -n32 vs -64 under Irix). (3) 
Line 905  and 64-bit mode (such as -n32 vs -64 under Irix). (3) 
 <li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.  <li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
 </ul>  </ul>
   
 <hr>  
   
 <table width="100%">  <h4>Bright Ideas</h4>
   <tr>  
     <td>  
       <font size=2>  
       Please send comments about this page to  
       <a href="mailto:tege@swox.com">tege@swox.com</a>.<br>  
       Copyright (C) 1999, 2000 Torbjörn Granlund.  
       </font>  
     </td>  
     <td align=right>  
     </td>  
   </tr>  
 </table>  
   
   The following may or may not be feasible, and aren't likely to get done in the
   near future, but are at least worth thinking about.
   
   <ul>
   <li> Reorganize longlong.h so that we can inline the operations even for the
        system compiler.  When there is no such compiler feature, make calls to
        stub functions.  Write such stub functions for as many machines as
        possible.
   <li> longlong.h could declare when it's using, or would like to use,
        <code>mpn_umul_ppmm</code>, and the corresponding umul.asm file could be
        included in libgmp only in that case, the same as is effectively done for
        <code>__clz_tab</code>.  Likewise udiv.asm and perhaps cntlz.asm.  This
        would only be a very small space saving, so perhaps not worth the
        complexity.
   <li> longlong.h could be built at configure time by concatenating or
        #including fragments from each directory in the mpn path.  This would
        select CPU specific macros the same way as CPU specific assembler code.
        Code used would no longer depend on cpp predefines, and the current
        nested conditionals could be flattened out.
   <li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000, whereas it's
        sort of supposed to return the low 31 (or 63) bits.  But this is
        undocumented, and perhaps not too important.
   <li> <code>mpz_*_ui</code> division routines currently return abs(a%b).
        Perhaps make them return the real remainder instead?  Return type would
        be <code>signed long int</code>.  But this would be an incompatible
        change, so it might have to be under newly named functions.
   <li> <code>mpz_init_set*</code> and <code>mpz_realloc</code> could allocate
        say an extra 16 limbs over what's needed, so as to reduce the chance of
        having to do a reallocate if the <code>mpz_t</code> grows a bit more.
        This could only be an option, since it'd badly bloat memory usage in
        applications using many small values.
   <li> <code>mpq</code> functions could perhaps check for numerator or
        denominator equal to 1, on the assumption that integers or
        denominator-only values might be expected to occur reasonably often.
   <li> <code>count_trailing_zeros</code> is used on more or less uniformly
        distributed numbers in a couple of places.  For some CPUs
        <code>count_trailing_zeros</code> is slow and it's probably worth handling
        the frequently occurring 0 to 2 trailing zeros cases specially.
   <li> <code>mpf_t</code> might like to let the exponent be undefined when
        size==0, instead of requiring it 0 as now.  It should be possible to do
        size==0 tests before paying attention to the exponent.  The advantage is
        not needing to set exp in the various places a zero result can arise,
        which avoids some tedium but is otherwise perhaps not too important.
        Currently <code>mpz_set_f</code> and <code>mpf_cmp_ui</code> depend on
        exp==0, maybe elsewhere too.
   <li> <code>__gmp_allocate_func</code>: Could use GCC <code>__attribute__
        ((malloc))</code> on this, though don't know if it'd do much.  GCC 3.0
        allows that attribute on functions, but not function pointers (see info
        node "Attribute Syntax"), so would need a new autoconf test.  This can
        wait until there's a GCC that supports it.
   <li> <code>mpz_add_ui</code> contains two <code>__GMPN_COPY</code>s, one from
        <code>mpn_add_1</code> and one from <code>mpn_sub_1</code>.  If those two
        routines were opened up a bit maybe that code could be shared.  When a
        copy needs to be done there's no carry to append for the add, and if the
        copy is non-empty no high zero for the sub. <br> An alternative would be
        to do a copy at the start and then an in-place add or sub.  Obviously
        that duplicates the fetches and stores for carry propagation, but that's
        normally only one or two limbs.  The same applies to <code>mpz_add</code>
        when one operand is longer than the other, and to <code>mpz_com</code>
        since it's just -(x+1).
   <li> <code>restrict</code>'ed pointers: Does the C99 definition of restrict
        (one writer many readers, or whatever it is) suit the GMP style "same or
        separate" function parameters?  If so, judicious use might improve the
        code generated a bit.  Do any compilers have their own flavour of
        restrict as "completely unaliased", and is that still usable?
   <li> 68000: A 16-bit limb might suit 68000 better than 32-bits, since the
        native multiply is only 16x16.  Could have this as an <code>ABI</code>
        option, selecting <code>_SHORT_LIMB</code> in gmp.h.  Naturally a new set
        of asm subroutines would be necessary.  Would need new
        <code>mpz_set_ui</code> etc since the current code assumes limb&gt;=long,
        but 2-limb operand forms would find a use for <code>long long</code> on
        other processors too.
   <li> Nx1 remainders can be taken at multiplier throughput speed by
        pre-calculating an array "p[i] = 2^(i*<code>BITS_PER_MP_LIMB</code>) mod
        m", then for the input limbs x calculating an inner product "sum
        p[i]*x[i]", and a final 3x1 limb remainder mod m.  If those powers take
        roughly N divide steps to calculate then there'd be an advantage any time
        the same m is used three or more times.  Suggested by Victor Shoup in
        connection with chinese-remainder style decompositions, but perhaps with
        other uses.
   </ul>
   <hr>
   
 </body>  </body>
 </html>  </html>
   
   <!--
   Local variables:
   eval: (add-hook 'write-file-hooks 'time-stamp)
   time-stamp-start: "This file current as of "
   time-stamp-format: "%:d %3b %:y"
   time-stamp-end: "\\."
   time-stamp-line-limit: 50
   End:
   -->

Legend:
Removed from v.1.1  
changed lines
  Added in v.1.1.1.2

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>