version 1.1.1.1, 2000/09/09 14:12:20 |
version 1.1.1.2, 2003/08/25 16:06:11 |
|
|
</h1> |
</h1> |
</center> |
</center> |
|
|
|
<font size=-1> |
|
Copyright 2000, 2001, 2002 Free Software Foundation, Inc. <br><br> |
|
This file is part of the GNU MP Library. <br><br> |
|
The GNU MP Library is free software; you can redistribute it and/or modify |
|
it under the terms of the GNU Lesser General Public License as published |
|
by the Free Software Foundation; either version 2.1 of the License, or (at |
|
your option) any later version. <br><br> |
|
The GNU MP Library is distributed in the hope that it will be useful, but |
|
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY |
|
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public |
|
License for more details. <br><br> |
|
You should have received a copy of the GNU Lesser General Public License |
|
along with the GNU MP Library; see the file COPYING.LIB. If not, write to |
|
the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, |
|
MA 02111-1307, USA. |
|
</font> |
|
|
|
<hr> |
|
<!-- NB. timestamp updated automatically by emacs --> |
<comment> |
<comment> |
An up-to-date html version of this file is available at |
This file current as of 20 May 2002. An up-to-date version is available at |
<a href="http://www.swox.com/gmp/tasks.html">http://www.swox.com/gmp/tasks.html</a>. |
<a href="http://www.swox.com/gmp/tasks.html">http://www.swox.com/gmp/tasks.html</a>. |
|
Please send comments about this page to |
|
<a href="mailto:bug-gmp@gnu.org">bug-gmp@gnu.org</a>. |
</comment> |
</comment> |
|
|
<p> This file lists itemized GMP development tasks. Not all the tasks |
<p> These are itemized GMP development tasks. Not all the tasks |
listed here are suitable for volunteers, but many of them are. |
listed here are suitable for volunteers, but many of them are. |
Please see the <a href="projects.html">projects file</a> for more |
Please see the <a href="projects.html">projects file</a> for more |
sizeable projects. |
sizeable projects. |
|
|
<h4>Correctness and Completeness</h4> |
<h4>Correctness and Completeness</h4> |
<ul> |
<ul> |
<li> HPUX 10.20 assembler requires a `.LEVEL 1.1' directive for accepting the |
|
new instructions. Unfortunately, the HPUX 9 assembler as well as earlier |
|
assemblers reject that directive. How very clever of HP! We will have to |
|
pass assembler options, and make sure it works with new and old systems |
|
and GNU assembler. |
|
<li> The various reuse.c tests need to force reallocation by calling |
<li> The various reuse.c tests need to force reallocation by calling |
<code>_mpz_realloc</code> with a small (1 limb) size. |
<code>_mpz_realloc</code> with a small (1 limb) size. |
<li> One reuse case is missing from mpX/tests/reuse.c: <code>mpz_XXX(a,a,a)</code>. |
<li> One reuse case is missing from mpX/tests/reuse.c: |
<li> When printing mpf_t numbers with exponents > 2^53 on machines with 64-bit |
<code>mpz_XXX(a,a,a)</code>. |
<code>mp_exp_t</code>, the precision of |
<li> When printing <code>mpf_t</code> numbers with exponents >2^53 on |
|
machines with 64-bit <code>mp_exp_t</code>, the precision of |
<code>__mp_bases[base].chars_per_bit_exactly</code> is insufficient and |
<code>__mp_bases[base].chars_per_bit_exactly</code> is insufficient and |
<code>mpf_get_str</code> aborts. Detect and compensate. |
<code>mpf_get_str</code> aborts. Detect and compensate. Alternately, |
<li> Fix <code>mpz_get_si</code> to work properly for MIPS N32 ABI (and other |
think seriously about using some sort of fixed-point integer value. |
machines that use <code>long long</code> for storing limbs.) |
Avoiding unnecessary floating point is probably a good thing in general, |
|
and it might be faster on some CPUs. |
<li> Make the string reading functions allow the `0x' prefix when the base is |
<li> Make the string reading functions allow the `0x' prefix when the base is |
explicitly 16. They currently only allow that prefix when the base is |
explicitly 16. They currently only allow that prefix when the base is |
unspecified. |
unspecified (zero). |
<li> In the development sources, we return abs(a%b) in the |
|
<code>mpz_*_ui</code> division routines. Perhaps make them return the |
|
real remainder instead? Changes return type to <code>signed long int</code>. |
|
<li> <code>mpf_eq</code> is not always correct, when one operand is |
<li> <code>mpf_eq</code> is not always correct, when one operand is |
1000000000... and the other operand is 0111111111..., i.e., extremely |
1000000000... and the other operand is 0111111111..., i.e., extremely |
close. There is a special case in <code>mpf_sub</code> for this |
close. There is a special case in <code>mpf_sub</code> for this |
situation; put similar code in <code>mpf_eq</code>. |
situation; put similar code in <code>mpf_eq</code>. |
<li> mpf_eq doesn't implement what gmp.texi specifies. It should not use just |
<li> <code>mpf_eq</code> doesn't implement what gmp.texi specifies. It should |
whole limbs, but partial limbs. |
not use just whole limbs, but partial limbs. |
<li> Install Alpha assembly changes (prec/gmp-alpha-patches). |
<li> <code>mpf_set_str</code> doesn't validate it's exponent, for instance |
<li> NeXT has problems with newlines in asm strings in longlong.h. Also, |
garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow |
<code>__builtin_constant_p</code> is unavailable? Same problem with MacOS |
of a <code>long</code> is not detected. |
X. |
<li> <code>mpf_add</code> doesn't check for a carry from truncated portions of |
<li> Shut up SGI's compiler by declaring <code>dump_abort</code> in |
the inputs, and in that respect doesn't implement the "infinite precision |
mp?/tests/*.c. |
followed by truncate" specified in the manual. |
<li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000. |
<li> <code>mpf_div</code> of x/x doesn't always give 1, reported by Peter |
|
Moulder. Perhaps it suffices to put +1 on the effective divisor prec, so |
|
that data bits rather than zeros are shifted in when normalizing. Would |
|
prefer to switch to <code>mpn_tdiv_qr</code>, where all shifting should |
|
disappear. |
|
<li> Windows DLLs: tests/mpz/reuse.c and tests/mpf/reuse.c initialize global |
|
variables with pointers to <code>mpz_add</code> etc, which doesn't work |
|
when those routines are coming from a DLL (because they're effectively |
|
function pointer global variables themselves). Need to rearrange perhaps |
|
to a set of calls to a test function rather than iterating over an array. |
|
<li> demos/pexpr.c: The local variables in <code>main</code> might be |
|
clobbered by the <code>longjmp</code>. |
</ul> |
</ul> |
|
|
|
|
|
|
<h4>Machine Independent Optimization</h4> |
<h4>Machine Independent Optimization</h4> |
<ul> |
<ul> |
<li> In hundreds of places in the code, we invoke count_leading_zeros and then |
<li> <code>mpn_gcdext</code>, <code>mpz_get_d</code>, |
check if the returned count is zero. Instead check the most significant |
<code>mpf_get_str</code>: Don't test <code>count_leading_zeros</code> for |
bit of the operand, and avoid invoking <code>count_leading_zeros</code> if |
zero, instead check the high bit of the operand and avoid invoking |
the bit is set. This is an optimization on all machines, and significant |
<code>count_leading_zeros</code>. This is an optimization on all |
on machines with slow <code>count_leading_zeros</code>. |
machines, and significant on machines with slow |
<li> In a couple of places <code>count_trailing_zeros</code> is used |
<code>count_leading_zeros</code>, though it's possible an already |
on more or less uniformly distributed numbers. For some CPUs |
normalized operand might not be encountered very often. |
<code>count_trailing_zeros</code> is slow and it's probably worth |
|
handling the frequently occurring 0 to 2 trailing zeros cases specially. |
|
<li> Change all places that use <code>udiv_qrnnd</code> for inverting limbs to |
|
instead use <code>invert_limb</code>. |
|
<li> Reorganize longlong.h so that we can inline the operations even for the |
|
system compiler. When there is no such compiler feature, make calls to |
|
stub functions. Write such stub functions for as many machines as |
|
possible. |
|
<li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the |
<li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the |
most significant limb (if <code>BITS_PER_MP_LIMB</code> <= 52 bits). |
most significant limb (if <code>BITS_PER_MP_LIMB</code> <= 52 bits). |
(Peter Montgomery has some ideas on this subject.) |
(Peter Montgomery has some ideas on this subject.) |
<li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial |
<li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial |
products with fewer operations. |
products with fewer operations. |
<li> Write new <code>mpn_get_str</code> and <code>mpn_set_str</code> running in |
<li> Consider inlining <code>mpz_set_ui</code>. This would be both small and |
the sub O(n^2) range, using some divide-and-conquer approach, preferably |
fast, especially for compile-time constants, but would make application |
without using division. |
binaries depend on having 1 limb allocated to an <code>mpz_t</code>, |
<li> Copy tricky code for converting a limb from development version of |
preventing the "lazy" allocation scheme below. |
<code>mpn_get_str</code> to mpf/get_str. (Talk to Torbjörn about this.) |
<li> Consider inlining <code>mpz_[cft]div_ui</code> and maybe |
<li> Consider inlining these functions: <code>mpz_size</code>, |
<code>mpz_[cft]div_r_ui</code>. A <code>__gmp_divide_by_zero</code> |
<code>mpz_set_ui</code>, <code>mpz_set_q</code>, <code>mpz_clear</code>, |
would be needed for the divide by zero test, unless that could be left to |
<code>mpz_init</code>, <code>mpz_get_ui</code>, <code>mpz_scan0</code>, |
<code>mpn_mod_1</code> (not sure currently whether all the risc chips |
<code>mpz_scan1</code>, <code>mpz_getlimbn</code>, |
provoke the right exception there if using mul-by-inverse). |
<code>mpz_init_set_ui</code>, <code>mpz_perfect_square_p</code>, |
<li> Consider inlining: <code>mpz_fits_s*_p</code>. The setups for |
<code>mpz_popcount</code>, <code>mpf_size</code>, |
<code>LONG_MAX</code> etc would need to go into gmp.h, and on Cray it |
<code>mpf_get_prec</code>, <code>mpf_set_prec_raw</code>, |
might, unfortunately, be necessary to forcibly include <limits.h> |
<code>mpf_set_ui</code>, <code>mpf_init</code>, <code>mpf_init2</code>, |
since there's no apparent way to get <code>SHRT_MAX</code> with an |
<code>mpf_clear</code>, <code>mpf_set_si</code>. |
expression (since <code>short</code> and <code>unsigned short</code> can |
|
be different sizes). |
<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very |
<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very |
fast on one or two limb moduli, due to a lot of function call |
fast on one or two limb moduli, due to a lot of function call |
overheads. These could perhaps be handled as special cases. |
overheads. These could perhaps be handled as special cases. |
<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> want better |
<li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> want better |
algorithm selection, and the latter should use REDC. Both could |
algorithm selection, and the latter should use REDC. Both could |
change to use an <code>mpn_powm</code> and <code>mpn_redc</code>. |
change to use an <code>mpn_powm</code> and <code>mpn_redc</code>. |
|
<li> <code>mpz_powm</code> REDC should do multiplications by <code>g[]</code> |
|
using the division method when they're small, since the REDC form of a |
|
small multiplier is normally a full size product. Probably would need a |
|
new tuned parameter to say what size multiplier is "small", as a function |
|
of the size of the modulus. |
|
<li> <code>mpz_powm</code> REDC should handle even moduli if possible. Maybe |
|
this would mean for m=n*2^k doing mod n using REDC and an auxiliary |
|
calculation mod 2^k, then putting them together at the end. |
<li> <code>mpn_gcd</code> might be able to be sped up on small to |
<li> <code>mpn_gcd</code> might be able to be sped up on small to |
moderate sizes by improving <code>find_a</code>, possibly just by |
moderate sizes by improving <code>find_a</code>, possibly just by |
providing an alternate implementation for CPUs with slowish |
providing an alternate implementation for CPUs with slowish |
<code>count_leading_zeros</code>. |
<code>count_leading_zeros</code>. |
<li> Implement a cache localized evaluate and interpolate for the |
<li> Toom3 <code>USE_MORE_MPN</code> could use a low to high cache localized |
toom3 <code>USE_MORE_MPN</code> code. The necessary |
evaluate and interpolate. The necessary <code>mpn_divexact_by3c</code> |
right-to-left <code>mpn_divexact_by3c</code> exists. |
exists. |
<li> <code>mpn_mul_basecase</code> on NxM with big N but small M could try for |
<li> <code>mpn_mul_basecase</code> on NxM with big N but small M could try for |
better cache locality by taking N piece by piece. The current code could |
better cache locality by taking N piece by piece. The current code could |
be left available for CPUs without caching. Depending how karatsuba etc |
be left available for CPUs without caching. Depending how karatsuba etc |
is applied to unequal size operands it might be possible to assume M is |
is applied to unequal size operands it might be possible to assume M is |
always smallish. |
always smallish. |
|
<li> <code>mpn_perfect_square_p</code> on small operands might be better off |
|
skipping the residue tests and just taking a square root. |
|
<li> <code>mpz_perfect_power_p</code> could be improved in a number of ways. |
|
Test for Nth power residues modulo small primes like |
|
<code>mpn_perfect_square_p</code> does. Use p-adic arithmetic to find |
|
possible roots. Divisibility by other primes should be tested by |
|
grouping into a limb like <code>PP</code>. |
|
<li> <code>mpz_perfect_power_p</code> might like to use <code>mpn_gcd_1</code> |
|
instead of a private GCD routine. The use it's put to isn't |
|
time-critical, and it might help be ensure correctness to use the main GCD |
|
routine. |
|
<li> <code>mpz_perfect_power_p</code> could use |
|
<code>mpz_divisible_ui_p</code> instead of <code>mpz_tdiv_ui</code> for |
|
divisibility testing, the former is faster on a number of systems. (But |
|
all that prime test stuff is going to be rewritten some time.) |
|
<li> Change <code>PP</code>/<code>PP_INVERTED</code> into an array of such |
|
pairs, listing several hundred primes. Perhaps actually make the |
|
products larger than one limb each. |
|
<li> <code>PP</code> can have factors of 2 introduced in order to get the high |
|
bit set and therefore a <code>PP_INVERTED</code> existing. The factors |
|
of 2 don't affect the way the remainder r = a % ((x*y*z)*2^n) is used, |
|
further remainders r%x, r%y, etc, are the same since x, y, etc are odd. |
|
The advantage of this is that <code>mpn_preinv_mod_1</code> can then be |
|
used if it's faster than plain <code>mpn_mod_1</code>. This would be a |
|
change only for 16-bit limbs, all the rest already have <code>PP</code> |
|
in the right form. |
|
<li> <code>PP</code> could have extra factors of 3 or 5 or whatever introduced |
|
if they fit, and final remainders mod 9 or 25 or whatever used, thereby |
|
making more efficient use of the <code>mpn_mod_1</code> done. On a |
|
16-bit limb it looks like <code>PP</code> could take an extra factor of |
|
3. |
|
<li> <code>mpz_probab_prime_p</code>, <code>mpn_perfect_square_p</code> and |
|
<code>mpz_perfect_power_p</code> could use <code>mpn_mod_34lsub1</code> |
|
to take a remainder mod 2^24-1 or 2^48-1 and quickly get remainders mod |
|
3, 5, 7, 13 and 17 (factors of 2^24-1). This could either replace the |
|
<code>PP</code> division currently done, or allow <code>PP</code> to do |
|
larger primes, depending how many residue tests seem worthwhile before |
|
launching into full root extractions or Miller-Rabin etc. |
|
<li> <code>mpz_probab_prime_p</code> (and maybe others) could code the |
|
divisibility tests like <code>n%7 == 0</code> in the form |
|
<pre> |
|
#define MP_LIMB_DIVISIBLE_7_P(n) \ |
|
((n) * MODLIMB_INVERSE_7 <= MP_LIMB_T_MAX/7) |
|
</pre> |
|
This would help compilers which don't know how to optimize divisions by |
|
constants, and would help current gcc (3.0) too since gcc forms a whole |
|
remainder rather than using a modular inverse and comparing. This |
|
technique works for any odd modulus, and with some tweaks for even moduli |
|
too. See Granlund and Montgomery "Division By Invariant Integers" |
|
section 9. |
|
<li> <code>mpz_probab_prime_p</code> and <code>mpz_nextprime</code> could |
|
offer certainty for primes up to 2^32 by using a one limb miller-rabin |
|
test to base 2, combined with a table of actual strong pseudoprimes in |
|
that range (2314 of them). If that table is too big then both base 2 and |
|
base 3 tests could be done, leaving a table of 104. The test could use |
|
REDC and therefore be a <code>modlimb_invert</code> a remainder (maybe) |
|
then two multiplies per bit (successively dependent). Processors with |
|
pipelined multipliers could do base 2 and 3 in parallel. Vector systems |
|
could do a whole bunch of bases in parallel, and perhaps offer near |
|
certainty up to 64-bits (certainty might depend on an exhaustive search |
|
of pseudoprimes up to that limit). Obviously 2^32 is not a big number, |
|
but an efficient and certain calculation is attractive. It might find |
|
other uses internally, and could even be offered as a one limb prime test |
|
<code>mpn_probab_prime_1_p</code> or <code>gmp_probab_prime_ui_p</code> |
|
perhaps. |
|
<li> <code>mpz_probab_prime_p</code> doesn't need to make a copy of |
|
<code>n</code> when the input is negative, it can setup an |
|
<code>mpz_t</code> alias, same data pointer but a positive size. With no |
|
need to clear before returning, the recursive function call could be |
|
dispensed with too. |
|
<li> <code>mpf_set_str</code> produces low zero limbs when a string has a |
|
fraction but is exactly representable, eg. 0.5 in decimal. These could be |
|
stripped to save work in later operations. |
|
<li> <code>mpz_and</code>, <code>mpz_ior</code> and <code>mpz_xor</code> should |
|
use <code>mpn_and_n</code> etc for the benefit of the small number of |
|
targets with native versions of those routines. Need to be careful not to |
|
pass size==0. Is some code sharing possible between the <code>mpz</code> |
|
routines? |
|
<li> <code>mpf_add</code>: Don't do a copy to avoid overlapping operands |
|
unless it's really necessary (currently only sizes are tested, not |
|
whether r really is u or v). |
|
<li> <code>mpf_add</code>: Under the check for v having no effect on the |
|
result, perhaps test for r==u and do nothing in that case, rather than |
|
currently it looks like an <code>MPN_COPY_INCR</code> will be done to |
|
reduce prec+1 limbs to prec. |
|
<li> <code>mpn_divrem_2</code> could usefully accept unnormalized divisors and |
|
shift the dividend on-the-fly, since this should cost nothing on |
|
superscalar processors and avoid the need for temporary copying in |
|
<code>mpn_tdiv_qr</code>. |
|
<li> <code>mpf_sqrt_ui</code> calculates prec+1 limbs, whereas just prec would |
|
satisfy the application requested precision. It should suffice to simply |
|
reduce the rsize temporary to 2*prec-1 limbs. <code>mpf_sqrt</code> |
|
might be similar. |
|
<li> <code>invert_limb</code> generic C: The division could use dividend |
|
b*(b-d)-1 which is high:low of (b-1-d):(b-1), instead of the current |
|
(b-d):0, where b=2^<code>BITS_PER_MP_LIMB</code> and d=divisor. The |
|
former is per the original paper and is used in the x86 code, the |
|
advantage is that the current special case for 0x80..00 could be dropped. |
|
The two should be equivalent, but a little check of that would be wanted. |
|
<li> <code>mpq_cmp_ui</code> could form the <code>num1*den2</code> and |
|
<code>num2*den1</code> products limb-by-limb from high to low and look at |
|
each step for values differing by more than the possible carry bit from |
|
the uncalculated portion. |
|
<li> <code>mpq_cmp</code> could do the same high-to-low progressive multiply |
|
and compare. The benefits of karatsuba and higher multiplication |
|
algorithms are lost, but if it's assumed only a few high limbs will be |
|
needed to determine an order then that's fine. |
|
<li> <code>mpn_add_1</code>, <code>mpn_sub_1</code>, <code>mpn_add</code>, |
|
<code>mpn_sub</code>: Internally use <code>__GMPN_ADD_1</code> etc |
|
instead of the functions, so they get inlined on all compilers, not just |
|
gcc and others with <code>inline</code> recognised in gmp.h. |
|
<code>__GMPN_ADD_1</code> etc are meant mostly to support application |
|
inline <code>mpn_add_1</code> etc and if they don't come out good for |
|
internal uses then special forms can be introduced, for instance many |
|
internal uses are in-place. Sometimes a block of code is executed based |
|
on the carry-out, rather than using it arithmetically, and those places |
|
might want to do their own loops entirely. |
|
<li> <code>__gmp_extract_double</code> on 64-bit systems could use just one |
|
bitfield for the mantissa extraction, not two, when endianness permits. |
|
Might depend on the compiler allowing <code>long long</code> bit fields |
|
when that's the only actual 64-bit type. |
|
<li> <code>mpf_get_d</code> could be more like <code>mpz_get_d</code> and do |
|
more in integers and give the float conversion as such a chance to round |
|
in its preferred direction. Some code sharing ought to be possible. Or |
|
if nothing else then for consistency the two ought to give identical |
|
results on integer operands (not clear if this is so right now). |
|
<li> <code>usqr_ppm</code> or some such could do a widening square in the |
|
style of <code>umul_ppmm</code>. This would help 68000, and be a small |
|
improvement for the generic C (which is used on UltraSPARC/64 for |
|
instance). GCC recognises the generic C ul*vh and vl*uh are identical, |
|
but does two separate additions to the rest of the result. |
|
<li> tal-notreent.c could keep a block of memory permanently allocated. |
|
Currently the last nested <code>TMP_FREE</code> releases all memory, so |
|
there's an allocate and free every time a top-level function using |
|
<code>TMP</code> is called. Would need |
|
<code>mp_set_memory_functions</code> to tell tal-notreent.c to release |
|
any cached memory when changing allocation functions though. |
|
<li> <code>__gmp_tmp_alloc</code> from tal-notreent.c could be partially |
|
inlined. If the current chunk has enough room then a couple of pointers |
|
can be updated. Only if more space is required then a call to some sort |
|
of <code>__gmp_tmp_increase</code> would be needed. The requirement that |
|
<code>TMP_ALLOC</code> is an expression might make the implementation a |
|
bit ugly and/or a bit sub-optimal. |
|
<pre> |
|
#define TMP_ALLOC(n) |
|
((ROUND_UP(n) > current->end - current->point ? |
|
__gmp_tmp_increase (ROUND_UP (n)) : 0), |
|
current->point += ROUND_UP (n), |
|
current->point - ROUND_UP (n)) |
|
</pre> |
|
<li> <code>__mp_bases</code> has a lot of data for bases which are pretty much |
|
never used. Perhaps the table should just go up to base 16, and have |
|
code to generate data above that, if and when required. Naturally this |
|
assumes the code would be smaller than the data saved. |
|
<li> <code>__mp_bases</code> field <code>big_base_inverted</code> is only used |
|
if <code>USE_PREINV_DIVREM_1</code> is true, and could be omitted |
|
otherwise, to save space. |
|
<li> Make <code>mpf_get_str</code> and <code>mpf_set_str</code> call the |
|
corresponding, much faster, mpn functions. |
|
<li> <code>mpn_mod_1</code> could pre-calculate values of R mod N, R^2 mod N, |
|
R^3 mod N, etc, with R=2^<code>BITS_PER_MP_LIMB</code>, and use them to |
|
process multiple limbs at each step by multiplying. Suggested by Peter |
|
L. Montgomery. |
|
<li> <code>mpz_get_str</code>, <code>mtox</code>: For power-of-2 bases, which |
|
are of course fast, it seems a little silly to make a second pass over |
|
the <code>mpn_get_str</code> output to convert to ASCII. Perhaps combine |
|
that with the bit extractions. |
|
<li> <code>mpz_gcdext</code>: If the caller requests only the S cofactor (of |
|
A), and A<B, then the code ends up generating the cofactor T (of B) and |
|
deriving S from that. Perhaps it'd be possible to arrange to get S in |
|
the first place by calling <code>mpn_gcdext</code> with A+B,B. This |
|
might only be an advantage if A and B are about the same size. |
|
<li> <code>mpn_toom3_mul_n</code>, <code>mpn_toom3_sqr_n</code>: Temporaries |
|
<code>B</code> and <code>D</code> are adjacent in memory and at the final |
|
coefficient additions look like they could use a single |
|
<code>mpn_add_n</code> of <code>l4</code> limbs rather than two of |
|
<code>l2</code> limbs. |
</ul> |
</ul> |
|
|
|
|
<h4>Machine Dependent Optimization</h4> |
<h4>Machine Dependent Optimization</h4> |
<ul> |
<ul> |
|
<li> <code>udiv_qrnnd_preinv2norm</code>, the branch-free version of |
|
<code>udiv_qrnnd_preinv</code>, might be faster on various pipelined |
|
chips. In particular the first <code>if (_xh != 0)</code> in |
|
<code>udiv_qrnnd_preinv</code> might be roughly a 50/50 chance and might |
|
branch predict poorly. (The second test is probably almost always |
|
false.) Measuring with the tuneup program would be possible, but perhaps |
|
a bit messy. In any case maybe the default should be the branch-free |
|
version. |
|
<br> |
|
Note that the current <code>udiv_qrnnd_preinv2norm</code> implementation |
|
assumes a right shift will sign extend, which is not guaranteed by the C |
|
standards, and doesn't happen on Cray vector systems. |
<li> Run the `tune' utility for more compiler/CPU combinations. We would like |
<li> Run the `tune' utility for more compiler/CPU combinations. We would like |
to have gmp-mparam.h files in practically every implementation specific |
to have gmp-mparam.h files in practically every implementation specific |
mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system |
mpn subdirectory, and repeat each *_THRESHOLD for gcc and the system |
compiler. See the `tune' top-level directory for more information. |
compiler. See the `tune' top-level directory for more information. |
<li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and |
<pre> |
<code>mpn_mul_1</code> for the 21264. On 21264, they should run at 4, 3, |
#ifdef (__GNUC__) |
and 3 cycles/limb respectively, if the code is unrolled properly. (Ask |
#if __GNUC__ == 2 && __GNUC_MINOR__ == 7 |
Torbjörn for his xm.s and xam.s skeleton files.) |
... |
<li> Alpha: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and |
#endif |
<code>mpn_mul_1</code> for the 21164. This should use both integer |
#if __GNUC__ == 2 && __GNUC_MINOR__ == 8 |
|
... |
|
#endif |
|
#ifndef MUL_KARATSUBA_THRESHOLD |
|
/* Default GNUC values */ |
|
... |
|
#endif |
|
#else /* system compiler */ |
|
... |
|
#endif </pre> |
|
<li> <code>invert_limb</code> on various processors might benefit from the |
|
little Newton iteration done for alpha and ia64. |
|
<li> Alpha 21264: Improve feed-in code for <code>mpn_mul_1</code>, |
|
<code>mpn_addmul_1</code>, and <code>mpn_submul_1</code>. |
|
<li> Alpha 21164: Rewrite <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>, |
|
and <code>mpn_submul_1</code> for the 21164. This should use both integer |
multiplies and floating-point multiplies. For the floating-point |
multiplies and floating-point multiplies. For the floating-point |
operations, the single-limb multiplier should be split into three 21-bit |
operations, the single-limb multiplier should be split into three 21-bit |
chunks. |
chunks, or perhaps even better in four 16-bit chunks. Probably possible |
<li> UltraSPARC: Rewrite 64-bit <code>mpn_addmul_1</code>, |
to reach 9 cycles/limb. |
<code>mpn_submul_1</code>, and <code>mpn_mul_1</code>. Should use |
<li> Alpha 21264 ev67: Use <code>ctlz</code> and <code>cttz</code> for |
floating-point operations, and split the invariant single-limb multiplier |
<code>count_leading_zeros</code> and<code>count_trailing_zeros</code>. |
into 21-bit chunks. Should give about 18 cycles/limb, but the pipeline |
Use inline for gcc, probably want asm files for elsewhere. |
will become very deep. (Torbjörn has C code that is useful as a starting |
<li> ARC: gcc longlong.h sets up <code>umul_ppmm</code> to call |
point.) |
<code>__umulsidi3</code> in libgcc. Could be copied straight across, but |
<li> UltraSPARC: Rewrite <code>mpn_lshift</code> and <code>mpn_rshift</code>. |
perhaps ought to be tested. |
Should give 2 cycles/limb. (Torbjörn has code that just needs to be |
<li> ARM: On v5 cpus see if the <code>clz</code> instruction can be used for |
finished.) |
<code>count_leading_zeros</code>. |
<li> SPARC32/V9: Find out why the speed of <code>mpn_addmul_1</code> |
<li> Itanium: <code>mpn_divexact_by3</code> isn't particularly important, but |
and the other multiplies varies so much on successive sizes. |
the generic C runs at about 27 c/l, whereas with the multiplies off the |
|
dependent chain about 3 c/l ought to be possible. |
|
<li> Itanium: <code>mpn_hamdist</code> could be put together based on the |
|
current <code>mpn_popcount</code>. |
|
<li> Itanium: <code>popc_limb</code> in gmp-impl.h could use the |
|
<code>popcnt</code> insn. |
|
<li> Itanium: <code>mpn_submul_1</code> is not implemented directly, only via |
|
a combination of <code>mpn_mul_1</code> and <code>mpn_sub_n</code>. |
|
<li> UltraSPARC/64: Optimize <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>, |
|
for s2 < 2^32 (or perhaps for any zero 16-bit s2 chunk). Not sure how |
|
much this can improve the speed, though, since the symmetry that we rely |
|
on is lost. Perhaps we can just gain cycles when s2 < 2^16, or more |
|
accurately, when two 16-bit s2 chunks which are 16 bits apart are zero. |
|
<li> UltraSPARC/64: Write native <code>mpn_submul_1</code>, analogous to |
|
<code>mpn_addmul_1</code>. |
|
<li> UltraSPARC/64: Write <code>umul_ppmm</code>. Using four |
|
"<code>mulx</code>"s either with an asm block or via the generic C code is |
|
about 90 cycles. Try using fp operations, and also try using karatsuba |
|
for just three "<code>mulx</code>"s. |
|
<li> UltraSPARC/64: <code>mpn_divrem_1</code>, <code>mpn_mod_1</code>, |
|
<code>mpn_divexact_1</code> and <code>mpn_modexact_1_odd</code> could |
|
process 32 bits at a time when the divisor fits 32-bits. This will need |
|
only 4 <code>mulx</code>'s per limb instead of 8 in the general case. |
|
<li> UltraSPARC/32: Rewrite <code>mpn_lshift</code>, <code>mpn_rshift</code>. |
|
Will give 2 cycles/limb. Trivial modifications of mpn/sparc64 should do. |
|
<li> UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 < 2^16. |
|
<li> UltraSPARC/32: Use <code>mulx</code> for <code>umul_ppmm</code> if |
|
possible (see commented out code in longlong.h). This is unlikely to |
|
save more than a couple of cycles, so perhaps isn't worth bothering with. |
|
<li> UltraSPARC/32: On Solaris gcc doesn't give us <code>__sparc_v9__</code> |
|
or anything to indicate V9 support when -mcpu=v9 is selected. See |
|
gcc/config/sol2-sld-64.h. Will need to pass something through from |
|
./configure to select the right code in longlong.h. (Currently nothing |
|
is lost because <code>mulx</code> for multiplying is commented out.) |
|
<li> UltraSPARC: <code>modlimb_invert</code> might save a few cycles from |
|
masking down to just the useful bits at each point in the calculation, |
|
since <code>mulx</code> speed depends on the highest bit set. Either |
|
explicit masks or small types like <code>short</code> and |
|
<code>int</code> ought to work. |
|
<li> Sparc64 HAL R1: <code>mpn_popcount</code> and <code>mpn_hamdist</code> |
|
could use <code>popc</code> currently commented out in gmp-impl.h. This |
|
chip reputedly implements <code>popc</code> properly (see gcc sparc.md), |
|
would need to recognise the chip as <code>sparchalr1</code> or something |
|
in configure / config.sub / config.guess. |
<li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and |
<li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and |
<code>mpn_mul_1</code>. The current development code runs at 11 |
<code>mpn_mul_1</code>. The current code runs at 11 cycles/limb. It |
cycles/limb, which is already very good. But it should be possible to |
should be possible to saturate the cache, which will happen at 8 |
saturate the cache, which will happen at 7.5 cycles/limb. |
cycles/limb (7.5 for mpn_mul_1). Write special loops for s2 < 2^32; |
<li> Sparc & SparcV8: Enable umul.asm for native cc. The generic |
it should be possible to make them run at about 5 cycles/limb. |
longlong.h umul_ppmm is suspected to be causing sqr_basecase to |
<li> PPC630: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and |
be slower than mul_basecase. |
<code>mpn_mul_1</code>. Use both integer and floating-point operations, |
<li> UltraSPARC: Write <code>umul_ppmm</code>. Important in particular for |
possibly two floating-point and one integer limb per loop. Split operands |
<code>mpn_sqr_basecase</code>. Using four "<code>mulx</code>"s either |
into four 16-bit chunks for fast fp operations. Should easily reach 9 |
with an asm block or via the generic C code is about 90 cycles. |
cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb |
|
(using one int + two fp). |
|
<li> PPC630: <code>mpn_rshift</code> could do the same sort of unrolled loop |
|
as <code>mpn_lshift</code>. Some judicious use of m4 might let the two |
|
share source code, or with a register to control the loop direction |
|
perhaps even share object code. |
|
<li> PowerPC-32: <code>mpn_rshift</code> should do the same sort of unrolled |
|
loop as <code>mpn_lshift</code>. |
<li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code> |
<li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code> |
for important machines. Helping the generic sqr_basecase.c with an |
for important machines. Helping the generic sqr_basecase.c with an |
<code>mpn_sqr_diagonal</code> might be enough for some of the RISCs. |
<code>mpn_sqr_diagonal</code> might be enough for some of the RISCs. |
<li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>. |
<li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>. |
Will bring time from 1.75 to 1.25 cycles/limb. |
Will bring time from 1.75 to 1.25 cycles/limb. |
<li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1. (See Pentium code.) |
<li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1. (See |
<li> Alpha: Optimize <code>count_leading_zeros</code>. |
Pentium code.) |
<li> Alpha: Optimize <code>udiv_qrnnd</code>. (Ask Torbjörn for the file |
<li> X86: Good authority has it that in the past an inline <code>rep |
test-udiv-preinv.c as a starting point.) |
movs</code> would upset GCC register allocation for the whole function. |
<li> R10000/R12000: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>. |
Is this still true in GCC 3? It uses <code>rep movs</code> itself for |
It should just require 3 cycles/limb, but the current code propagates |
<code>__builtin_memcpy</code>. Examine the code for some simple and |
carry poorly. The trick is to add carry-in later than we do now, |
complex functions to find out. Inlining <code>rep movs</code> would be |
decreasing the number of operations used to generate carry-out from 4 to |
desirable, it'd be both smaller and faster. |
to 3. |
<li> Pentium P54: <code>mpn_lshift</code> and <code>mpn_rshift</code> can come |
|
down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after |
|
<code>shrdl</code> and <code>shldl</code>, see mpn/x86/pentium/README. |
|
<li> Pentium P55 MMX: <code>mpn_lshift</code> and <code>mpn_rshift</code> |
|
might benefit from some destination prefetching. |
|
<li> PentiumPro: <code>mpn_divrem_1</code> might be able to use a |
|
mul-by-inverse, hoping for maybe 30 c/l. |
|
<li> P6: <code>mpn_add_n</code> and <code>mpn_sub_n</code> should be able to go |
|
faster than the generic x86 code at 3.5 c/l. The athlon code for instance |
|
runs at about 2.7. |
|
<li> K7: <code>mpn_lshift</code> and <code>mpn_rshift</code> might be able to |
|
do something branch-free for unaligned startups, and shaving one insn |
|
from the loop with alternative indexing might save a cycle. |
<li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>. |
<li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>. |
The pipeline is now extremely deep, perhaps unnecessarily deep. Also, r5 |
The pipeline is now extremely deep, perhaps unnecessarily deep. |
is unused. (Ask Torbjörn for a copy of the current code.) |
|
<li> PPC32: Write <code>mpn_rshift</code> based on new <code>mpn_lshift</code>. |
<li> PPC32: Write <code>mpn_rshift</code> based on new <code>mpn_lshift</code>. |
<li> PPC32: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>. Should |
<li> PPC32: Rewrite <code>mpn_add_n</code> and <code>mpn_sub_n</code>. Should |
run at just 3.25 cycles/limb. (Ask for xxx-add_n.s as a starting point.) |
run at just 3.25 cycles/limb. |
<li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language. |
<li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language. |
<li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and |
<li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and |
<code>mpn_sqr_basecase</code>. This should use a "vertical multiplication |
<code>mpn_sqr_basecase</code>. This should use a "vertical multiplication |
method", to avoid carry propagation. splitting one of the operands in |
method", to avoid carry propagation. splitting one of the operands in |
11-bit chunks. |
11-bit chunks. |
<li> Cray: Vectorize main functions, perhaps in assembly language. |
<li> 68k, Pentium: <code>mpn_lshift</code> by 31 should use the special rshift |
<li> Cray: Write <code>mpn_mul_basecase</code> and |
by 1 code, and vice versa <code>mpn_rshift</code> by 31 should use the |
<code>mpn_sqr_basecase</code>. Same comment applies to this as to the |
special lshift by 1. This would be best as a jump across to the other |
same functions for Fujitsu VPP. |
routine, could let both live in lshift.asm and omit rshift.asm on finding |
|
<code>mpn_rshift</code> already provided. |
|
<li> Cray T3E: Experiment with optimization options. In particular, |
|
-hpipeline3 seems promising. We should at least up -O to -O2 or -O3. |
|
<li> Cray: <code>mpn_com_n</code> and <code>mpn_and_n</code> etc very probably |
|
wants a pragma like <code>MPN_COPY_INCR</code>. |
|
<li> Cray vector systems: <code>mpn_lshift</code>, <code>mpn_rshift</code>, |
|
<code>mpn_popcount</code> and <code>mpn_hamdist</code> are nice and small |
|
and could be inlined to avoid function calls. |
|
<li> Cray: Variable length arrays seem to be faster than the tal-notreent.c |
|
scheme. Not sure why, maybe they merely give the compiler more |
|
information about aliasing (or the lack thereof). Would like to modify |
|
<code>TMP_ALLOC</code> to use them, or introduce a new scheme. Memory |
|
blocks wanted unconditionally are easy enough, those wanted only |
|
sometimes are a problem. Perhaps a special size calculation to ask for a |
|
dummy length 1 when unwanted, or perhaps an inlined subroutine |
|
duplicating code under each conditional. Don't really want to turn |
|
everything into a dog's dinner just because Cray don't offer an |
|
<code>alloca</code>. |
|
<li> Cray: <code>mpn_get_str</code> on power-of-2 bases ought to vectorize. |
|
Does it? <code>bits_per_digit</code> and the inner loop over bits in a |
|
limb might prevent it. Perhaps special cases for binary, octal and hex |
|
would be worthwhile (very possibly for all processors too). |
|
<li> Cray: <code>popc_limb</code> could use the Cray <code>_popc</code> |
|
intrinsic. That would help <code>mpz_hamdist</code> and might make the |
|
generic C versions of <code>mpn_popcount</code> and |
|
<code>mpn_hamdist</code> suffice for Cray (if it vectorizes, or can be |
|
given a hint to do so). |
|
<li> 68000: <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>, |
|
<code>mpn_submul_1</code>: Check for a 16-bit multiplier and use two |
|
multiplies per limb, not four. |
|
<li> 68000: <code>mpn_lshift</code> and <code>mpn_rshift</code> could use a |
|
<code>roll</code> and mask instead of <code>lsrl</code> and |
|
<code>lsll</code>. This promises to be a speedup, effectively trading a |
|
6+2*n shift for one or two 4 cycle masks. Suggested by Jean-Charles |
|
Meyrignac. |
<li> Improve <code>count_leading_zeros</code> for 64-bit machines: |
<li> Improve <code>count_leading_zeros</code> for 64-bit machines: |
|
|
<pre> |
<pre> |
if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/2) == 0) { x <<= W_TYPE_SIZE/2; cnt += W_TYPE_SIZE/2} |
if ((x >> 32) == 0) { x <<= 32; cnt += 32; } |
if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/4) == 0) { x <<= W_TYPE_SIZE/4; cnt += W_TYPE_SIZE/4} |
if ((x >> 48) == 0) { x <<= 16; cnt += 16; } |
... </pre> |
... </pre> |
|
<li> IRIX 6 MIPSpro compiler has an <code>__inline</code> which could perhaps |
|
be used in <code>__GMP_EXTERN_INLINE</code>. What would be the right way |
|
to identify suitable versions of that compiler? |
|
<li> VAX D and G format <code>double</code> floats are straightforward and |
|
could perhaps be handled directly in <code>__gmp_extract_double</code> |
|
and maybe in <code>mpz_get_d</code>, rather than falling back on the |
|
generic code. (Both formats are detected by <code>configure</code>.) |
|
<li> <code>mpn_get_str</code> final divisions by the base with |
|
<code>udiv_qrnd_unnorm</code> could use some sort of multiply-by-inverse |
|
on suitable machines. This ends up happening for decimal by presenting |
|
the compiler with a run-time constant, but the same for other bases would |
|
be good. Perhaps use could be made of the fact base<256. |
|
<li> <code>mpn_umul_ppmm</code>, <code>mpn_udiv_qrnnd</code>: Return a |
|
structure like <code>div_t</code> to avoid going through memory, in |
|
particular helping RISCs that don't do store-to-load forwarding. Clearly |
|
this is only possible if the ABI returns a structure of two |
|
<code>mp_limb_t</code>s in registers. |
</ul> |
</ul> |
|
|
<h4>New Functionality</h4> |
<h4>New Functionality</h4> |
<ul> |
<ul> |
<li> <code>mpz_get_nth_ui</code>. Return the nth word (not necessarily the nth limb). |
<li> Add in-memory versions of <code>mp?_out_raw</code> and |
|
<code>mp?_inp_raw</code>. |
|
<li> <code>mpz_get_nth_ui</code>. Return the nth word (not necessarily the |
|
nth limb). |
<li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction). |
<li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction). |
<li> Let `0b' and `0B' mean binary input everywhere. |
<li> Let `0b' and `0B' mean binary input everywhere. |
<li> Add <code>mpq_set_f</code> for assignment from <code>mpf_t</code> |
<li> <code>mpz_init</code> and <code>mpq_init</code> could do lazy allocation. |
(cf. <code>mpq_set_d</code>). |
Set <code>ALLOC(var)</code> to 0 to indicate nothing allocated, and let |
<li> Maybe make <code>mpz_init</code> (and <code>mpq_init</code>) do lazy |
<code>_mpz_realloc</code> do the initial alloc. Set |
allocation. Set <code>ALLOC(var)</code> to 0, and have |
<code>z->_mp_d</code> to a dummy that <code>mpz_get_ui</code> and |
<code>mpz_realloc</code> special-handle that case. Update functions that |
similar can unconditionally fetch from. Niels Möller has had a go at |
rely on a single limb (like <code>mpz_set_ui</code>, |
this. |
<code>mpz_[tfc]div_r_ui</code>, and others). |
<br> |
|
The advantages of the lazy scheme would be: |
|
<ul> |
|
<li> Initial allocate would be the size required for the first value |
|
stored, rather than getting 1 limb in <code>mpz_init</code> and then |
|
more or less immediately reallocating. |
|
<li> <code>mpz_init</code> would only store magic values in the |
|
<code>mpz_t</code> fields, and could be inlined. |
|
<li> A fixed initializer could even be used by applications, like |
|
<code>mpz_t z = MPZ_INITIALIZER;</code>, which might be convenient |
|
for globals. |
|
</ul> |
|
The advantages of the current scheme are: |
|
<ul> |
|
<li> <code>mpz_set_ui</code> and other similar routines needn't check the |
|
size allocated and can just store unconditionally. |
|
<li> <code>mpz_set_ui</code> and perhaps others like |
|
<code>mpz_tdiv_r_ui</code> and a prospective |
|
<code>mpz_set_ull</code> could be inlined. |
|
</ul> |
<li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>. Make sure |
<li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>. Make sure |
format is portable between 32-bit and 64-bit machines, and between |
format is portable between 32-bit and 64-bit machines, and between |
little-endian and big-endian machines. |
little-endian and big-endian machines. |
<li> Handle numeric exceptions: Call an error handler, and/or set |
<li> <code>mpn_and_n</code> ... <code>mpn_copyd</code>: Perhaps make the mpn |
<code>gmp_errno</code>. |
logops and copys available in gmp.h, either as library functions or |
<li> Implement <code>gmp_fprintf</code>, <code>gmp_sprintf</code>, and |
inlines, with the availability of library functions instantiated in the |
<code>gmp_snprintf</code>. Think about some sort of wrapper |
generated gmp.h at build time. |
around <code>printf</code> so it and its several variants don't |
<li> <code>mpz_set_str</code> etc variants taking string lengths rather than |
have to be completely reimplemented. |
null-terminators. |
<li> Implement some <code>mpq</code> input and output functions. |
|
<li> Implement a full precision <code>mpz_kronecker</code>, leave |
|
<code>mpz_jacobi</code> for compatibility. |
|
<li> Make the mpn logops and copys available in gmp.h. Since they can |
|
be either library functions or inlines, gmp.h would need to be |
|
generated from a gmp.in based on what's in the library. gmp.h |
|
would still be compiler-independent though. |
|
<li> Make versions of <code>mpz_set_str</code> etc taking string |
|
lengths rather than null-terminators. |
|
<li> Consider changing the thresholds to apply the simpler algorithm when |
<li> Consider changing the thresholds to apply the simpler algorithm when |
"<code><=</code>" rather than "<code><</code>", so a threshold can |
"<code><=</code>" rather than "<code><</code>", so a threshold can |
be set to <code>MP_SIZE_T_MAX</code> to get only the simpler code (the |
be set to <code>MP_SIZE_T_MAX</code> to get only the simpler code (the |
compiler will know <code>size <= MP_SIZE_T_MAX</code> is always true). |
compiler will know <code>size <= MP_SIZE_T_MAX</code> is always true). |
<li> <code>mpz_cdiv_q_2exp</code> and <code>mpz_cdiv_r_2exp</code> |
Alternately it looks like the <code>ABOVE_THRESHOLD</code> and |
could be implemented to match the corresponding tdiv and fdiv. |
<code>BELOW_THRESHOLD</code> macros can do this adequately, and also pick |
Maybe some code sharing is possible. |
up cases where a threshold of zero should mean only the second algorithm. |
|
<li> <code>mpz_nthprime</code>. |
|
<li> Perhaps <code>mpz_init2</code>, initializing and making initial room for |
|
N bits. The actual size would be rounded up to a limb, and perhaps an |
|
extra limb added since so many <code>mpz</code> routines need that on |
|
their destination. |
|
<li> <code>mpz_andn</code>, <code>mpz_iorn</code>, <code>mpz_nand</code>, |
|
<code>mpz_nior</code>, <code>mpz_xnor</code> might be useful additions, |
|
if they could share code with the current such functions (which should be |
|
possible). |
|
<li> <code>mpz_and_ui</code> etc might be of use sometimes. Suggested by |
|
Niels Möller. |
|
<li> <code>mpf_set_str</code> and <code>mpf_inp_str</code> could usefully |
|
accept 0x, 0b etc when base==0. Perhaps the exponent could default to |
|
decimal in this case, with a further 0x, 0b etc allowed there. |
|
Eg. 0xFFAA@0x5A. A leading "0" for octal would match the integers, but |
|
probably something like "0.123" ought not mean octal. |
|
<li> <code>GMP_LONG_LONG_LIMB</code> or some such could become a documented |
|
feature of gmp.h, so applications could know whether to |
|
<code>printf</code> a limb using <code>%lu</code> or <code>%Lu</code>. |
|
<li> <code>PRIdMP_LIMB</code> and similar defines following C99 |
|
<inttypes.h> might be of use to applications printing limbs. |
|
Perhaps they should be defined only if specifically requested, the way |
|
<inttypes.h> does. But if <code>GMP_LONG_LONG_LIMB</code> or |
|
whatever is added then perhaps this can easily enough be left to |
|
applications. |
|
<li> <code>mpf_get_ld</code> and <code>mpf_set_ld</code> converting |
|
<code>mpf_t</code> to and from <code>long double</code>. Other |
|
<code>long double</code> routines would be desirable too, but these would |
|
be a start. Often <code>long double</code> is the same as |
|
<code>double</code>, which is easy but pretty pointless. Should |
|
recognise the Intel 80-bit format on i386, and IEEE 128-bit quad on |
|
sparc, hppa and power. Might like an ABI sub-option or something when |
|
it's a compiler option for 64-bit or 128-bit <code>long double</code>. |
|
<li> <code>gmp_printf</code> could accept <code>%b</code> for binary output. |
|
It'd be nice if it worked for plain <code>int</code> etc too, not just |
|
<code>mpz_t</code> etc. |
|
<li> <code>gmp_printf</code> in fact could usefully accept an arbitrary base, |
|
for both integer and float conversions. A base either in the format |
|
string or as a parameter with <code>*</code> should be allowed. Maybe |
|
<code>&13b</code> (b for base) or something like that. |
|
<li> <code>gmp_printf</code> could perhaps have a type code for an |
|
<code>mp_limb_t</code>. That would save an application from having to |
|
worry whether it's a <code>long</code> or a <code>long long</code>. |
|
<li> <code>gmp_printf</code> could perhaps accept <code>mpq_t</code> for float |
|
conversions, eg. <code>"%.4Qf"</code>. This would be merely for |
|
convenience, but still might be useful. Rounding would be the same as |
|
for an <code>mpf_t</code> (ie. currently round-to-nearest, but not |
|
actually documented). Alternately, perhaps a separate |
|
<code>mpq_get_str_point</code> or some such might be more use. Suggested |
|
by Pedro Gimeno. |
|
<li> <code>gmp_printf</code> could usefully accept a flag to control the |
|
rounding of float conversions. The wouldn't do much for |
|
<code>mpf_t</code>, but would be good if <code>mpfr_t</code> was |
|
supported in the future, or perhaps for <code>mpq_t</code>. Something |
|
like <code>&*r</code> (r for rounding, and mpfr style |
|
<code>GMP_RND</code> parameter). |
|
<li> <code>mpz_combit</code> to toggle a bit would be a good companion for |
|
<code>mpz_setbit</code> and <code>mpz_clrbit</code>. Suggested by Niels |
|
Möller (and has done some work towards it). |
|
<li> <code>mpz_scan0_reverse</code> or <code>mpz_scan0low</code> or some such |
|
searching towards the low end of an integer might match |
|
<code>mpz_scan0</code> nicely. Likewise for <code>scan1</code>. |
|
Suggested by Roberto Bagnara. |
|
<li> <code>mpz_bit_subset</code> or some such to test whether one integer is a |
|
bitwise subset of another might be of use. Some sort of return value |
|
indicating whether it's a proper or non-proper subset would be good and |
|
wouldn't cost anything in the implementation. Suggested by Roberto |
|
Bagnara. |
|
<li> <code>gmp_randinit_r</code> and maybe <code>gmp_randstate_set</code> to |
|
init-and-copy or to just copy a <code>gmp_randstate_t</code>. Suggested |
|
by Pedro Gimeno. |
|
<li> <code>mpf_get_ld</code>, <code>mpf_set_ld</code>: Conversions between |
|
<code>mpf_t</code> and <code>long double</code>, suggested by Dan |
|
Christensen. There'd be some work to be done by <code>configure</code> |
|
to recognise the format in use. xlc on aix for instance apparently has |
|
an option for either plain double 64-bit or quad 128-bit precision. This |
|
might mean library contents vary with the compiler used to build, which |
|
is undesirable. It might be possible to detect the mode the application |
|
is compiling with, and try to avoid mismatch problems. |
|
<li> <code>mpz_sqrt_if_perfect_square</code>: When |
|
<code>mpz_perfect_square_p</code> does its tests it calculates a square |
|
root and then discards it. For some applications it might be useful to |
|
return that root. Suggested by Jason Moxham. |
|
<li> <code>mpz_get_ull</code>, <code>mpz_set_ull</code>, |
|
<code>mpz_get_sll</code>, <code>mpz_get_sll</code>: Conversions for |
|
<code>long long</code>. These would aid interoperability, though a |
|
mixture of GMP and <code>long long</code> would probably not be too |
|
common. Disadvantages of using <code>long long</code> in libgmp.a would |
|
be |
|
<ul> |
|
<li> Library contents vary according to the build compiler. |
|
<li> gmp.h would need an ugly <code>#ifdef</code> block to decide if the |
|
application compiler could take the <code>long long</code> |
|
prototypes. |
|
<li> Some sort of <code>LIBGMP_HAS_LONGLONG</code> would be wanted to |
|
indicate whether the functions are available. (Applications using |
|
autoconf could probe the library too.) |
|
</ul> |
|
It'd be possible to defer the need for <code>long long</code> to |
|
application compile time, by having something like |
|
<code>mpz_set_2ui</code> called with two halves of a <code>long |
|
long</code>. Disadvantages of this would be, |
|
<ul> |
|
<li> Bigger code in the application, though perhaps not if a <code>long |
|
long</code> is normally passed as two halves anyway. |
|
<li> <code>mpz_get_ull</code> would be a rather big inline, or would have |
|
to be two function calls. |
|
<li> <code>mpz_get_sll</code> would be a worse inline, and would put the |
|
treatment of <code>-0x10..00</code> into applications (see |
|
<code>mpz_get_si</code> correctness above). |
|
<li> Although having libgmp.a independent of the build compiler is nice, |
|
it sort of sacrifices the capabilities of a good compiler to |
|
uniformity with inferior ones. |
|
</ul> |
|
Plain use of <code>long long</code> is probably the lesser evil, if only |
|
because it makes best use of gcc. |
|
<li> <code>mpz_strtoz</code> parsing the same as <code>strtol</code>. |
|
Suggested by Alexander Kruppa. |
</ul> |
</ul> |
|
|
|
|
<h4>Configuration</h4> |
<h4>Configuration</h4> |
|
|
<ul> |
<ul> |
<li> Improve config.guess. We want to recognize the processor very |
<li> Floating-point format: <code>GMP_C_DOUBLE_FORMAT</code> seems to work |
accurately, more accurately than other GNU packages. |
well. Get rid of the <code>#ifdef</code> mess in gmp-impl.h and use the |
config.guess does not currently make the distinctions we would |
results of the test instead. |
like it to do and a --target often needs to be set explicitly. |
<li> a29k: umul.s and udiv.s exist but don't get used. |
|
<li> ARM: <code>umul_ppmm</code> in longlong.h always uses <code>umull</code>, |
|
but is that available only for M series chips or some such? Perhaps it |
|
should be configured in some way. |
|
<li> HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00. |
|
<li> HPPA 2.0w: gcc is rumoured to support 2.0w as of version 3, though |
|
perhaps just as a build-time choice. In any case, figure out how to |
|
identify a suitable gcc or put it in the right mode, for the GMP compiler |
|
choices. |
|
<li> IA64: Latest libtool has some nonsense to detect ELF-32 or ELF-64 on |
|
<code>ia64-*-hpux*</code>. Does GMP need to know anything about that? |
|
<li> Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc. |
|
"hinv -c processor" gives lots of information on Irix. Standard |
|
config.guess appends "el" to indicate endianness, but |
|
<code>AC_C_BIGENDIAN</code> seems the best way to handle that for GMP. |
|
<li> PowerPC: The function descriptor nonsense for AIX is currently driven by |
|
<code>*-*-aix*</code>. It might be more reliable to do some sort of |
|
feature test, examining the compiler output perhaps. It might also be |
|
nice to merge the aix.m4 files into powerpc-defs.m4. |
|
<li> Sparc: <code>config.guess</code> recognises various exact sparcs, make |
|
use of that information in <code>configure</code> (work on this is in |
|
progress). |
|
<li> Sparc32: floating point or integer <code>udiv</code> should be selected |
|
according to the CPU target. Currently floating point ends up being |
|
used on all sparcs, which is probably not right for generic V7 and V8. |
|
<li> Sparc: The use of <code>-xtarget=native</code> with <code>cc</code> is |
|
incorrect when cross-compiling, the target should be set according to the |
|
configured <code>$host</code> CPU. |
|
<li> m68k: config.guess can detect 68000, 68010, CPU32 and 68020, but relies |
|
on system information for 030, 040 and 060. Can they be identified by |
|
running some code? |
|
<li> m68k: gas 2.11.90.0.1 pads with zero bytes in text segments, which is not |
|
valid code. Probably need <code>.balignw <n>,0x4e7f</code> to get |
|
nops, if <code>ALIGN</code> is going to be used for anything that's |
|
executed across. |
|
<li> Some CPUs have <code>umul</code> and <code>udiv</code> code not being |
|
used. Check all such for bit rot and then put umul and udiv in |
|
<code>$gmp_mpn_functions_optional</code> as "standard optional" objects. |
|
<br> In particular Sparc and SparcV8 on non-gcc should benefit from |
|
umul.asm enabled; the generic umul is suspected to be making sqr_basecase |
|
slower than mul_basecase. |
|
<li> HPPA <code>mpn_umul_ppmm</code> and <code>mpn_udiv_qrnnd</code> have a |
|
different parameter order than those functions on other CPUs. It might |
|
avoid confusion to have them under different names, maybe |
|
<code>mpn_umul_ppmm_r</code> or some such. Prototypes then wouldn't |
|
be conditionalized, and the appropriate form could be selected with the |
|
<code>HAVE_NATIVE</code> scheme if/when the code switches to use a |
|
<code>PROLOGUE</code> style. |
|
<li> <code>DItype</code>: The setup in gmp-impl.h for non-GCC could use an |
|
autoconf test to determine whether <code>long long</code> is available. |
|
<li> m88k: Make the assembler code work on non-underscore systems. Conversion |
|
to .asm would be desirable. Ought to be easy, but would want to be |
|
tested. |
|
<li> z8k: The use of a 32-bit limb in mpn/z8000x as opposed to 16-bits in |
|
mpn/z8000 could be an ABI choice. But this chip is obsolete and nothing |
|
is likely to be done unless someone is actively using it. |
|
<li> config.m4 is generated only by the configure script, it won't be |
|
regenerated by config.status. Creating it as an <code>AC_OUTPUT</code> |
|
would work, but it might upset "make" to have things like <code>L$</code> |
|
get into the Makefiles through <code>AC_SUBST</code>. |
|
<code>AC_CONFIG_COMMANDS</code> would be the alternative. With some |
|
careful m4 quoting the <code>changequote</code> calls might not be |
|
needed, which might free up the order in which things had to be output. |
|
<li> <code>make distclean</code>: Only the mpn directory links which were |
|
created are removed, but perhaps all possible links should be removed, in |
|
case someone runs configure a second time without a |
|
<code>distclean</code> in between. The only tricky part would be making |
|
sure all possible <code>extra_functions</code> are covered. |
|
<li> MinGW: Apparently a Cygwin version of gcc can be used by passing |
|
<code>-mno-cygwin</code>. For <code>--host=*-*-mingw32*</code> it might |
|
be convenient to automatically use that option, if it works. Needs |
|
someone with a dual cygwin/mingw setup to test. |
|
<li> Automake: Latest automake has a <code>CCAS</code>, <code>CCASFLAGS</code> |
|
scheme. Though we probably wouldn't be using its assembler support we |
|
could try to use those variables in compatible ways. |
|
</ul> |
|
|
For example, "sparc" is not very useful as a machine architecture |
|
denotation. We want to distinguish old 32-bit SPARC without |
|
multiply support from newer 32-bit SPARC with such support. We |
|
want to recognize a SuperSPARC, since its implementation of the |
|
UDIV instruction is not complete, and will trap to the OS kernel |
|
for certain operands. And we want to recognize 64-bit capable |
|
SPARC processors as such. While the assembly routines can use |
|
64-bit operations on all 64-bit SPARC processors, one can not use |
|
64-bit limbs under all operating system. E.g., Solaris 2.5 and |
|
2.6 doesn't preserve the upper 32 bits of most processor |
|
registers. For SPARC we therefore sometimes need to choose GMP |
|
configuration depending both on processor and operating system. |
|
|
|
<li> Remember to make sure config.sub accepts any output from config.guess. |
<h4>Random Numbers</h4> |
|
<ul> |
<li> Find out whether there's an alloca available and how to use it. |
<li> <code>_gmp_rand</code> is not particularly fast on the linear |
AC_FUNC_ALLOCA has various system dependencies covered, but we |
congruential algorithm and could stand various improvements. |
don't want its alloca.c replacement. (One thing current cpp |
<ul> |
tests don't cover: HPUX 10 C compiler supports alloca, but |
<li> Make a second seed area within <code>gmp_randstate_t</code> (or |
cannot find any symbol to test in order to know if we're on |
<code>_mp_algdata</code> rather) to save some copying. |
HPUX 10. Damn.) |
<li> Make a special case for a single limb <code>2exp</code> modulus, to |
<li> Identify Mips processor under Irix: `hinv -c processor'. |
avoid <code>mpn_mul</code> calls. Perhaps the same for two limbs. |
config.guess should say mips2, mips3, and mips4. |
<li> Inline the <code>lc</code> code, to avoid a function call and |
<li> Identify Alpha processor under OSF: "/usr/sbin/sizer -c". |
<code>TMP_ALLOC</code> for every chunk. |
Unfortunately, sizer is not available before some revision of |
<li> The special case for <code>seedn==0</code> will be very rarely used, |
Dec Unix 4.0, and it also returns some rather cryptic names for |
and on that basis seems unnecessary. |
processors. Perhaps the <code>implver</code> and |
<li> Perhaps the <code>2exp</code> and general LC cases should be split, |
<code>amask</code> assembly instructions are better, but that |
for clarity (if the general case is retained). |
doesn't differentiate between ev5 and ev56. |
</ul> |
<li> Identify Sparc processors. config.guess should say supersparc, |
<li> <code>gmp_randinit_mers</code> for a Mersenne Twister generator. It's |
microsparc, ultrasparc1, ultrasparc2, etc. |
likely to be more random and about the same speed as Knuth's 55-element |
<li> Identify HPPA processors similarly. |
Fibonacci generator, and can probably become the default. Pedro Gimeno |
<li> Get lots of information about a Solaris system: prtconf -vp |
has started on this. |
<li> For some target machines and some compilers, specific options |
<li> <code>gmp_randinit_lc</code>: Finish or remove. Doing a division for |
are needed (sparcv8/gcc needs -mv8, sparcv8/cc needs -cg92, |
every every step won't be very fast, so check whether the usefulness of |
Irix64/cc needs -64, Irix32/cc might need -n32, etc). Some are |
this algorithm can be justified. (Consensus is that it's not useful and |
set already, add more, see configure.in. |
can be removed.) |
<li> Options to be passed to the assembler (via the compiler, using |
<li> Blum-Blum-Shub: Finish or remove. A separate |
whatever syntax the compiler uses for passing options to the |
<code>gmp_randinit_bbs</code> would be wanted, not the currently |
assembler). |
commented out case in <code>gmp_randinit</code>. |
<li> On Solaris 7, check if gcc supports native v9 64-bit |
<li> <code>_gmp_rand</code> could be done as a function pointer within |
arithmetic. If not compile using "cc -fast -xarch=v9". |
<code>gmp_randstate_t</code> (or rather in the <code>_mp_algdata</code> |
(Problem: -fast requires that we link with -fast too, which |
part), instead of switching on a <code>gmp_randalg_t</code>. Likewise |
might not be very good. Pass "-xO4 -xtarget=native" instead?) |
<code>gmp_randclear</code>, and perhaps <code>gmp_randseed</code> if it |
<li> Extend the "optional" compiler arguments to choose the first |
became algorithm-specific. This would be more modular, and would ensure |
that works from from a set, so when gcc gets athlon support it |
only code for the desired algorithms is dragged into the link. |
can try -mcpu=athlon, -mcpu=pentiumpro, or -mcpu=i486, |
<li> <code>mpz_urandomm</code> should do something for n<=0, but what? |
whichever works. |
<li> <code>mpz_urandomm</code> implementation looks like it could be improved. |
<li> Detect gcc >=2.96 and enable -march=pentiumpro for relevant |
Perhaps it's enough to calculate <code>nbits</code> as ceil(log2(n)) and |
x86s. (A bug in gcc 2.95.2 prevents it being used |
call <code>_gmp_rand</code> until a value <code><n</code> is obtained. |
unconditionally.) |
<li> <code>gmp_randstate_t</code> used for parameters perhaps should become |
<li> Build multiple variants of the library under certain systems. |
<code>gmp_randstate_ptr</code> the same as other types. |
An example is -n32, -o32, and -64 on Irix. |
<li> Some of the empirical randomness tests could be included in a "make |
<li> There's a few filenames that don't fit in 14 chars, if this |
check". They ought to work everywhere, for a given seed at least. |
matters. |
|
<li> Enable support for FORTRAN versions of mpn files (eg. for |
|
mpn/cray/mulww.f). Add "f" to the mpn path searching, run AC_PROG_F77 if |
|
such a file is found. Automake will generate some of what's needed in the |
|
makefiles, but libtool doesn't know fortran and so rules like the current |
|
".asm.lo" will be needed. |
|
<li> Only run GMP_PROG_M4 if it's needed, ie. if there's .asm files |
|
selected from the mpn path. This might help say a generic C |
|
build on weird systems. |
|
</ul> |
</ul> |
|
|
<p> In general, getting the exact right configuration, passing the |
|
exact right options to the compiler, etc, might mean that the GMP |
|
performance more than doubles. |
|
|
|
<p> When testing, make sure to test at least the following for all out |
|
target machines: (1) Both gcc and cc (and c89). (2) Both 32-bit mode |
|
and 64-bit mode (such as -n32 vs -64 under Irix). (3) Both the system |
|
`make' and GNU `make'. (4) With and without GNU binutils. |
|
|
|
|
|
<h4>Miscellaneous</h4> |
<h4>Miscellaneous</h4> |
<ul> |
<ul> |
|
|
<li> Work on the way we build the library. We now do it building |
|
convenience libraries but then listing all the object files a |
|
second time in the top level Makefile.am. |
|
<li> Get rid of mp[zq]/sub.c, and instead define a compile parameter to |
|
mp[zq]/add.c to decide whether it will add or subtract. Will decrease |
|
redundancy. Similarly in other places. |
|
<li> Make <code>mpz_div</code> and <code>mpz_divmod</code> use rounding |
<li> Make <code>mpz_div</code> and <code>mpz_divmod</code> use rounding |
analogous to <code>mpz_mod</code>. Document, and list as an |
analogous to <code>mpz_mod</code>. Document, and list as an |
incompatibility. |
incompatibility. |
<li> Maybe make mpz_pow_ui.c more like mpz/ui_pow_ui.c, or write new |
<li> <code>mpz_gcdext</code> and <code>mpn_gcdext</code> ought to document |
mpn/generic/pow_ui. |
what range of values the generated cofactors can take, and preferably |
<li> Make mpz_invert call mpn_gcdext directly. |
ensure the definition uniquely specifies the cofactors for given inputs. |
<li> Make a build option to enable execution profiling with gprof. In |
A basic extended Euclidean algorithm or multi-step variant leads to |
particular look at getting the right <code>mcount</code> call at |
|x|<|b| and |y|<|a| or something like that, but there's probably |
the start of each assembler subroutine (for important targets at |
two solutions under just those restrictions. |
least). |
<li> <code>mpz_invert</code> should call <code>mpn_gcdext</code> directly. |
|
<li> demos/factorize.c: use <code>mpz_divisible_ui_p</code> rather than |
|
<code>mpz_tdiv_qr_ui</code>. (Of course dividing multiple primes at a |
|
time would be better still.) |
|
<li> The various test programs use quite a bit of the main |
|
<code>libgmp</code>. This establishes good cross-checks, but it might be |
|
better to use simple reference routines where possible. Where it's not |
|
possible some attention could be paid to the order of the tests, so a |
|
<code>libgmp</code> routine is only used for tests once it seems to be |
|
good. |
|
<li> <code>mpf_set_q</code> is very similar to <code>mpf_div</code>, it'd be |
|
good for the two to share code. Perhaps <code>mpf_set_q</code> should |
|
make some <code>mpf_t</code> aliases for its numerator and denominator |
|
and just call <code>mpf_div</code>. Both would be simplified a good deal |
|
by switching to <code>mpn_tdiv_qr</code> perhaps making them small enough |
|
not to bother with sharing (especially since <code>mpf_set_q</code> |
|
wouldn't need to watch out for overlaps). |
|
<li> PowerPC: The cpu time base registers (per <code>mftb</code> and |
|
<code>mftbu</code>) could be used for the speed and tune programs. Would |
|
need to know its frequency of course. Usually it's 1/4 of bus speed |
|
(eg. 25 MHz) but some chips drive it from an external input. Probably |
|
have to measure to be sure. |
|
<li> <code>MUL_FFT_THRESHOLD</code> etc: the FFT thresholds should allow a |
|
return to a previous k at certain sizes. This arises basically due to |
|
the step effect caused by size multiples effectively used for each k. |
|
Looking at a graph makes it fairly clear. |
|
<li> <code>__gmp_doprnt_mpf</code> does a rather unattractive round-to-nearest |
|
on the string returned by <code>mpf_get_str</code>. Perhaps some variant |
|
of <code>mpf_get_str</code> could be made which would better suit. |
</ul> |
</ul> |
|
|
|
|
<h4>Aids to Debugging</h4> |
<h4>Aids to Development</h4> |
<ul> |
<ul> |
<li> Make an option for stack-alloc.c to call <code>malloc</code> |
<li> Add <code>ASSERT</code>s at the start of each user-visible mpz/mpq/mpf |
separately for each <code>TMP_ALLOC</code> block, so a redzoning |
function to check the validity of each <code>mp?_t</code> parameter, in |
malloc debugger could be used during development. |
particular to check they've been <code>mp?_init</code>ed. This might |
<li> Add <code>ASSERT</code>s at the start of each user-visible |
catch elementary mistakes in user programs. Care would need to be taken |
mpz/mpq/mpf function to check the validity of each |
over <code>MPZ_TMP_INIT</code>ed variables used internally. If nothing |
<code>mp?_t</code> parameter, in particular to check they've been |
else then consistency checks like size<=alloc, ptr not |
<code>mp?_init</code>ed. This might catch elementary mistakes in |
<code>NULL</code> and ptr+size not wrapping around the address space, |
user programs. Care would need to be taken over |
would be possible. A more sophisticated scheme could track |
<code>MPZ_TMP_INIT</code>ed variables used internally. |
<code>_mp_d</code> pointers and ensure only a valid one is used. Such a |
|
scheme probably wouldn't be reentrant, not without some help from the |
|
system. |
|
<li> tune/time.c could try to determine at runtime whether |
|
<code>getrusage</code> and <code>gettimeofday</code> are reliable. |
|
Currently we pretend in configure that the dodgy m68k netbsd 1.4.1 |
|
<code>getrusage</code> doesn't exist. If a test might take a long time |
|
to run then perhaps cache the result in a file somewhere. |
</ul> |
</ul> |
|
|
|
|
Line 359 and 64-bit mode (such as -n32 vs -64 under Irix). (3) |
|
Line 905 and 64-bit mode (such as -n32 vs -64 under Irix). (3) |
|
<li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits. |
<li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits. |
</ul> |
</ul> |
|
|
<hr> |
|
|
|
<table width="100%"> |
<h4>Bright Ideas</h4> |
<tr> |
|
<td> |
|
<font size=2> |
|
Please send comments about this page to |
|
<a href="mailto:tege@swox.com">tege@swox.com</a>.<br> |
|
Copyright (C) 1999, 2000 Torbjörn Granlund. |
|
</font> |
|
</td> |
|
<td align=right> |
|
</td> |
|
</tr> |
|
</table> |
|
|
|
|
The following may or may not be feasible, and aren't likely to get done in the |
|
near future, but are at least worth thinking about. |
|
|
|
<ul> |
|
<li> Reorganize longlong.h so that we can inline the operations even for the |
|
system compiler. When there is no such compiler feature, make calls to |
|
stub functions. Write such stub functions for as many machines as |
|
possible. |
|
<li> longlong.h could declare when it's using, or would like to use, |
|
<code>mpn_umul_ppmm</code>, and the corresponding umul.asm file could be |
|
included in libgmp only in that case, the same as is effectively done for |
|
<code>__clz_tab</code>. Likewise udiv.asm and perhaps cntlz.asm. This |
|
would only be a very small space saving, so perhaps not worth the |
|
complexity. |
|
<li> longlong.h could be built at configure time by concatenating or |
|
#including fragments from each directory in the mpn path. This would |
|
select CPU specific macros the same way as CPU specific assembler code. |
|
Code used would no longer depend on cpp predefines, and the current |
|
nested conditionals could be flattened out. |
|
<li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000, whereas it's |
|
sort of supposed to return the low 31 (or 63) bits. But this is |
|
undocumented, and perhaps not too important. |
|
<li> <code>mpz_*_ui</code> division routines currently return abs(a%b). |
|
Perhaps make them return the real remainder instead? Return type would |
|
be <code>signed long int</code>. But this would be an incompatible |
|
change, so it might have to be under newly named functions. |
|
<li> <code>mpz_init_set*</code> and <code>mpz_realloc</code> could allocate |
|
say an extra 16 limbs over what's needed, so as to reduce the chance of |
|
having to do a reallocate if the <code>mpz_t</code> grows a bit more. |
|
This could only be an option, since it'd badly bloat memory usage in |
|
applications using many small values. |
|
<li> <code>mpq</code> functions could perhaps check for numerator or |
|
denominator equal to 1, on the assumption that integers or |
|
denominator-only values might be expected to occur reasonably often. |
|
<li> <code>count_trailing_zeros</code> is used on more or less uniformly |
|
distributed numbers in a couple of places. For some CPUs |
|
<code>count_trailing_zeros</code> is slow and it's probably worth handling |
|
the frequently occurring 0 to 2 trailing zeros cases specially. |
|
<li> <code>mpf_t</code> might like to let the exponent be undefined when |
|
size==0, instead of requiring it 0 as now. It should be possible to do |
|
size==0 tests before paying attention to the exponent. The advantage is |
|
not needing to set exp in the various places a zero result can arise, |
|
which avoids some tedium but is otherwise perhaps not too important. |
|
Currently <code>mpz_set_f</code> and <code>mpf_cmp_ui</code> depend on |
|
exp==0, maybe elsewhere too. |
|
<li> <code>__gmp_allocate_func</code>: Could use GCC <code>__attribute__ |
|
((malloc))</code> on this, though don't know if it'd do much. GCC 3.0 |
|
allows that attribute on functions, but not function pointers (see info |
|
node "Attribute Syntax"), so would need a new autoconf test. This can |
|
wait until there's a GCC that supports it. |
|
<li> <code>mpz_add_ui</code> contains two <code>__GMPN_COPY</code>s, one from |
|
<code>mpn_add_1</code> and one from <code>mpn_sub_1</code>. If those two |
|
routines were opened up a bit maybe that code could be shared. When a |
|
copy needs to be done there's no carry to append for the add, and if the |
|
copy is non-empty no high zero for the sub. <br> An alternative would be |
|
to do a copy at the start and then an in-place add or sub. Obviously |
|
that duplicates the fetches and stores for carry propagation, but that's |
|
normally only one or two limbs. The same applies to <code>mpz_add</code> |
|
when one operand is longer than the other, and to <code>mpz_com</code> |
|
since it's just -(x+1). |
|
<li> <code>restrict</code>'ed pointers: Does the C99 definition of restrict |
|
(one writer many readers, or whatever it is) suit the GMP style "same or |
|
separate" function parameters? If so, judicious use might improve the |
|
code generated a bit. Do any compilers have their own flavour of |
|
restrict as "completely unaliased", and is that still usable? |
|
<li> 68000: A 16-bit limb might suit 68000 better than 32-bits, since the |
|
native multiply is only 16x16. Could have this as an <code>ABI</code> |
|
option, selecting <code>_SHORT_LIMB</code> in gmp.h. Naturally a new set |
|
of asm subroutines would be necessary. Would need new |
|
<code>mpz_set_ui</code> etc since the current code assumes limb>=long, |
|
but 2-limb operand forms would find a use for <code>long long</code> on |
|
other processors too. |
|
<li> Nx1 remainders can be taken at multiplier throughput speed by |
|
pre-calculating an array "p[i] = 2^(i*<code>BITS_PER_MP_LIMB</code>) mod |
|
m", then for the input limbs x calculating an inner product "sum |
|
p[i]*x[i]", and a final 3x1 limb remainder mod m. If those powers take |
|
roughly N divide steps to calculate then there'd be an advantage any time |
|
the same m is used three or more times. Suggested by Victor Shoup in |
|
connection with chinese-remainder style decompositions, but perhaps with |
|
other uses. |
|
</ul> |
|
<hr> |
|
|
</body> |
</body> |
</html> |
</html> |
|
|
|
<!-- |
|
Local variables: |
|
eval: (add-hook 'write-file-hooks 'time-stamp) |
|
time-stamp-start: "This file current as of " |
|
time-stamp-format: "%:d %3b %:y" |
|
time-stamp-end: "\\." |
|
time-stamp-line-limit: 50 |
|
End: |
|
--> |