===================================================================
RCS file: /home/cvs/OpenXM_contrib/gmp/doc/Attic/tasks.html,v
retrieving revision 1.1.1.1
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.1 -r1.1.1.2
--- OpenXM_contrib/gmp/doc/Attic/tasks.html 2000/09/09 14:12:20 1.1.1.1
+++ OpenXM_contrib/gmp/doc/Attic/tasks.html 2003/08/25 16:06:11 1.1.1.2
@@ -13,342 +13,888 @@
+
+Copyright 2000, 2001, 2002 Free Software Foundation, Inc.
+This file is part of the GNU MP Library.
+The GNU MP Library is free software; you can redistribute it and/or modify
+it under the terms of the GNU Lesser General Public License as published
+by the Free Software Foundation; either version 2.1 of the License, or (at
+your option) any later version.
+The GNU MP Library is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
+License for more details.
+You should have received a copy of the GNU Lesser General Public License
+along with the GNU MP Library; see the file COPYING.LIB. If not, write to
+the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
+MA 02111-1307, USA.
+
+
+
This file lists itemized GMP development tasks. Not all the tasks +
These are itemized GMP development tasks. Not all the tasks listed here are suitable for volunteers, but many of them are. Please see the projects file for more sizeable projects.
_mpz_realloc with a small (1 limb) size.
-mpz_XXX(a,a,a).
-mp_exp_t, the precision of
+mpz_XXX(a,a,a).
+mpf_t numbers with exponents >2^53 on
+ machines with 64-bit mp_exp_t, the precision of
__mp_bases[base].chars_per_bit_exactly is insufficient and
- mpf_get_str aborts. Detect and compensate.
-mpz_get_si to work properly for MIPS N32 ABI (and other
- machines that use long long for storing limbs.)
+ mpf_get_str aborts. Detect and compensate. Alternately,
+ think seriously about using some sort of fixed-point integer value.
+ Avoiding unnecessary floating point is probably a good thing in general,
+ and it might be faster on some CPUs.
mpz_*_ui division routines. Perhaps make them return the
- real remainder instead? Changes return type to signed long int.
+ unspecified (zero).
mpf_eq is not always correct, when one operand is
1000000000... and the other operand is 0111111111..., i.e., extremely
close. There is a special case in mpf_sub for this
situation; put similar code in mpf_eq.
-__builtin_constant_p is unavailable? Same problem with MacOS
- X.
-dump_abort in
- mp?/tests/*.c.
-mpz_get_si returns 0x80000000 for -0x100000000.
+mpf_eq doesn't implement what gmp.texi specifies. It should
+ not use just whole limbs, but partial limbs.
+mpf_set_str doesn't validate it's exponent, for instance
+ garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow
+ of a long is not detected.
+mpf_add doesn't check for a carry from truncated portions of
+ the inputs, and in that respect doesn't implement the "infinite precision
+ followed by truncate" specified in the manual.
+mpf_div of x/x doesn't always give 1, reported by Peter
+ Moulder. Perhaps it suffices to put +1 on the effective divisor prec, so
+ that data bits rather than zeros are shifted in when normalizing. Would
+ prefer to switch to mpn_tdiv_qr, where all shifting should
+ disappear.
+mpz_add etc, which doesn't work
+ when those routines are coming from a DLL (because they're effectively
+ function pointer global variables themselves). Need to rearrange perhaps
+ to a set of calls to a test function rather than iterating over an array.
+main might be
+ clobbered by the longjmp.
count_leading_zeros if
- the bit is set. This is an optimization on all machines, and significant
- on machines with slow count_leading_zeros.
-count_trailing_zeros is used
- on more or less uniformly distributed numbers. For some CPUs
- count_trailing_zeros is slow and it's probably worth
- handling the frequently occurring 0 to 2 trailing zeros cases specially.
-udiv_qrnnd for inverting limbs to
- instead use invert_limb.
-mpn_gcdext, mpz_get_d,
+ mpf_get_str: Don't test count_leading_zeros for
+ zero, instead check the high bit of the operand and avoid invoking
+ count_leading_zeros. This is an optimization on all
+ machines, and significant on machines with slow
+ count_leading_zeros, though it's possible an already
+ normalized operand might not be encountered very often.
umul_ppmm to use floating-point for generating the
most significant limb (if BITS_PER_MP_LIMB <= 52 bits).
(Peter Montgomery has some ideas on this subject.)
umul_ppmm code in longlong.h: Add partial
products with fewer operations.
-mpn_get_str and mpn_set_str running in
- the sub O(n^2) range, using some divide-and-conquer approach, preferably
- without using division.
-mpn_get_str to mpf/get_str. (Talk to Torbjörn about this.)
-mpz_size,
- mpz_set_ui, mpz_set_q, mpz_clear,
- mpz_init, mpz_get_ui, mpz_scan0,
- mpz_scan1, mpz_getlimbn,
- mpz_init_set_ui, mpz_perfect_square_p,
- mpz_popcount, mpf_size,
- mpf_get_prec, mpf_set_prec_raw,
- mpf_set_ui, mpf_init, mpf_init2,
- mpf_clear, mpf_set_si.
+mpz_set_ui. This would be both small and
+ fast, especially for compile-time constants, but would make application
+ binaries depend on having 1 limb allocated to an mpz_t,
+ preventing the "lazy" allocation scheme below.
+mpz_[cft]div_ui and maybe
+ mpz_[cft]div_r_ui. A __gmp_divide_by_zero
+ would be needed for the divide by zero test, unless that could be left to
+ mpn_mod_1 (not sure currently whether all the risc chips
+ provoke the right exception there if using mul-by-inverse).
+mpz_fits_s*_p. The setups for
+ LONG_MAX etc would need to go into gmp.h, and on Cray it
+ might, unfortunately, be necessary to forcibly include <limits.h>
+ since there's no apparent way to get SHRT_MAX with an
+ expression (since short and unsigned short can
+ be different sizes).
mpz_powm and mpz_powm_ui aren't very
fast on one or two limb moduli, due to a lot of function call
overheads. These could perhaps be handled as special cases.
mpz_powm and mpz_powm_ui want better
algorithm selection, and the latter should use REDC. Both could
change to use an mpn_powm and mpn_redc.
+mpz_powm REDC should do multiplications by g[]
+ using the division method when they're small, since the REDC form of a
+ small multiplier is normally a full size product. Probably would need a
+ new tuned parameter to say what size multiplier is "small", as a function
+ of the size of the modulus.
+mpz_powm REDC should handle even moduli if possible. Maybe
+ this would mean for m=n*2^k doing mod n using REDC and an auxiliary
+ calculation mod 2^k, then putting them together at the end.
mpn_gcd might be able to be sped up on small to
moderate sizes by improving find_a, possibly just by
providing an alternate implementation for CPUs with slowish
count_leading_zeros.
-USE_MORE_MPN code. The necessary
- right-to-left mpn_divexact_by3c exists.
+USE_MORE_MPN could use a low to high cache localized
+ evaluate and interpolate. The necessary mpn_divexact_by3c
+ exists.
mpn_mul_basecase on NxM with big N but small M could try for
better cache locality by taking N piece by piece. The current code could
be left available for CPUs without caching. Depending how karatsuba etc
is applied to unequal size operands it might be possible to assume M is
always smallish.
+mpn_perfect_square_p on small operands might be better off
+ skipping the residue tests and just taking a square root.
+mpz_perfect_power_p could be improved in a number of ways.
+ Test for Nth power residues modulo small primes like
+ mpn_perfect_square_p does. Use p-adic arithmetic to find
+ possible roots. Divisibility by other primes should be tested by
+ grouping into a limb like PP.
+mpz_perfect_power_p might like to use mpn_gcd_1
+ instead of a private GCD routine. The use it's put to isn't
+ time-critical, and it might help be ensure correctness to use the main GCD
+ routine.
+mpz_perfect_power_p could use
+ mpz_divisible_ui_p instead of mpz_tdiv_ui for
+ divisibility testing, the former is faster on a number of systems. (But
+ all that prime test stuff is going to be rewritten some time.)
+PP/PP_INVERTED into an array of such
+ pairs, listing several hundred primes. Perhaps actually make the
+ products larger than one limb each.
+PP can have factors of 2 introduced in order to get the high
+ bit set and therefore a PP_INVERTED existing. The factors
+ of 2 don't affect the way the remainder r = a % ((x*y*z)*2^n) is used,
+ further remainders r%x, r%y, etc, are the same since x, y, etc are odd.
+ The advantage of this is that mpn_preinv_mod_1 can then be
+ used if it's faster than plain mpn_mod_1. This would be a
+ change only for 16-bit limbs, all the rest already have PP
+ in the right form.
+PP could have extra factors of 3 or 5 or whatever introduced
+ if they fit, and final remainders mod 9 or 25 or whatever used, thereby
+ making more efficient use of the mpn_mod_1 done. On a
+ 16-bit limb it looks like PP could take an extra factor of
+ 3.
+mpz_probab_prime_p, mpn_perfect_square_p and
+ mpz_perfect_power_p could use mpn_mod_34lsub1
+ to take a remainder mod 2^24-1 or 2^48-1 and quickly get remainders mod
+ 3, 5, 7, 13 and 17 (factors of 2^24-1). This could either replace the
+ PP division currently done, or allow PP to do
+ larger primes, depending how many residue tests seem worthwhile before
+ launching into full root extractions or Miller-Rabin etc.
+mpz_probab_prime_p (and maybe others) could code the
+ divisibility tests like n%7 == 0 in the form
++#define MP_LIMB_DIVISIBLE_7_P(n) \ + ((n) * MODLIMB_INVERSE_7 <= MP_LIMB_T_MAX/7) ++ This would help compilers which don't know how to optimize divisions by + constants, and would help current gcc (3.0) too since gcc forms a whole + remainder rather than using a modular inverse and comparing. This + technique works for any odd modulus, and with some tweaks for even moduli + too. See Granlund and Montgomery "Division By Invariant Integers" + section 9. +
mpz_probab_prime_p and mpz_nextprime could
+ offer certainty for primes up to 2^32 by using a one limb miller-rabin
+ test to base 2, combined with a table of actual strong pseudoprimes in
+ that range (2314 of them). If that table is too big then both base 2 and
+ base 3 tests could be done, leaving a table of 104. The test could use
+ REDC and therefore be a modlimb_invert a remainder (maybe)
+ then two multiplies per bit (successively dependent). Processors with
+ pipelined multipliers could do base 2 and 3 in parallel. Vector systems
+ could do a whole bunch of bases in parallel, and perhaps offer near
+ certainty up to 64-bits (certainty might depend on an exhaustive search
+ of pseudoprimes up to that limit). Obviously 2^32 is not a big number,
+ but an efficient and certain calculation is attractive. It might find
+ other uses internally, and could even be offered as a one limb prime test
+ mpn_probab_prime_1_p or gmp_probab_prime_ui_p
+ perhaps.
+mpz_probab_prime_p doesn't need to make a copy of
+ n when the input is negative, it can setup an
+ mpz_t alias, same data pointer but a positive size. With no
+ need to clear before returning, the recursive function call could be
+ dispensed with too.
+mpf_set_str produces low zero limbs when a string has a
+ fraction but is exactly representable, eg. 0.5 in decimal. These could be
+ stripped to save work in later operations.
+mpz_and, mpz_ior and mpz_xor should
+ use mpn_and_n etc for the benefit of the small number of
+ targets with native versions of those routines. Need to be careful not to
+ pass size==0. Is some code sharing possible between the mpz
+ routines?
+mpf_add: Don't do a copy to avoid overlapping operands
+ unless it's really necessary (currently only sizes are tested, not
+ whether r really is u or v).
+mpf_add: Under the check for v having no effect on the
+ result, perhaps test for r==u and do nothing in that case, rather than
+ currently it looks like an MPN_COPY_INCR will be done to
+ reduce prec+1 limbs to prec.
+mpn_divrem_2 could usefully accept unnormalized divisors and
+ shift the dividend on-the-fly, since this should cost nothing on
+ superscalar processors and avoid the need for temporary copying in
+ mpn_tdiv_qr.
+mpf_sqrt_ui calculates prec+1 limbs, whereas just prec would
+ satisfy the application requested precision. It should suffice to simply
+ reduce the rsize temporary to 2*prec-1 limbs. mpf_sqrt
+ might be similar.
+invert_limb generic C: The division could use dividend
+ b*(b-d)-1 which is high:low of (b-1-d):(b-1), instead of the current
+ (b-d):0, where b=2^BITS_PER_MP_LIMB and d=divisor. The
+ former is per the original paper and is used in the x86 code, the
+ advantage is that the current special case for 0x80..00 could be dropped.
+ The two should be equivalent, but a little check of that would be wanted.
+mpq_cmp_ui could form the num1*den2 and
+ num2*den1 products limb-by-limb from high to low and look at
+ each step for values differing by more than the possible carry bit from
+ the uncalculated portion.
+mpq_cmp could do the same high-to-low progressive multiply
+ and compare. The benefits of karatsuba and higher multiplication
+ algorithms are lost, but if it's assumed only a few high limbs will be
+ needed to determine an order then that's fine.
+mpn_add_1, mpn_sub_1, mpn_add,
+ mpn_sub: Internally use __GMPN_ADD_1 etc
+ instead of the functions, so they get inlined on all compilers, not just
+ gcc and others with inline recognised in gmp.h.
+ __GMPN_ADD_1 etc are meant mostly to support application
+ inline mpn_add_1 etc and if they don't come out good for
+ internal uses then special forms can be introduced, for instance many
+ internal uses are in-place. Sometimes a block of code is executed based
+ on the carry-out, rather than using it arithmetically, and those places
+ might want to do their own loops entirely.
+__gmp_extract_double on 64-bit systems could use just one
+ bitfield for the mantissa extraction, not two, when endianness permits.
+ Might depend on the compiler allowing long long bit fields
+ when that's the only actual 64-bit type.
+mpf_get_d could be more like mpz_get_d and do
+ more in integers and give the float conversion as such a chance to round
+ in its preferred direction. Some code sharing ought to be possible. Or
+ if nothing else then for consistency the two ought to give identical
+ results on integer operands (not clear if this is so right now).
+usqr_ppm or some such could do a widening square in the
+ style of umul_ppmm. This would help 68000, and be a small
+ improvement for the generic C (which is used on UltraSPARC/64 for
+ instance). GCC recognises the generic C ul*vh and vl*uh are identical,
+ but does two separate additions to the rest of the result.
+TMP_FREE releases all memory, so
+ there's an allocate and free every time a top-level function using
+ TMP is called. Would need
+ mp_set_memory_functions to tell tal-notreent.c to release
+ any cached memory when changing allocation functions though.
+__gmp_tmp_alloc from tal-notreent.c could be partially
+ inlined. If the current chunk has enough room then a couple of pointers
+ can be updated. Only if more space is required then a call to some sort
+ of __gmp_tmp_increase would be needed. The requirement that
+ TMP_ALLOC is an expression might make the implementation a
+ bit ugly and/or a bit sub-optimal.
++#define TMP_ALLOC(n) + ((ROUND_UP(n) > current->end - current->point ? + __gmp_tmp_increase (ROUND_UP (n)) : 0), + current->point += ROUND_UP (n), + current->point - ROUND_UP (n)) ++
__mp_bases has a lot of data for bases which are pretty much
+ never used. Perhaps the table should just go up to base 16, and have
+ code to generate data above that, if and when required. Naturally this
+ assumes the code would be smaller than the data saved.
+__mp_bases field big_base_inverted is only used
+ if USE_PREINV_DIVREM_1 is true, and could be omitted
+ otherwise, to save space.
+mpf_get_str and mpf_set_str call the
+ corresponding, much faster, mpn functions.
+mpn_mod_1 could pre-calculate values of R mod N, R^2 mod N,
+ R^3 mod N, etc, with R=2^BITS_PER_MP_LIMB, and use them to
+ process multiple limbs at each step by multiplying. Suggested by Peter
+ L. Montgomery.
+mpz_get_str, mtox: For power-of-2 bases, which
+ are of course fast, it seems a little silly to make a second pass over
+ the mpn_get_str output to convert to ASCII. Perhaps combine
+ that with the bit extractions.
+mpz_gcdext: If the caller requests only the S cofactor (of
+ A), and A<B, then the code ends up generating the cofactor T (of B) and
+ deriving S from that. Perhaps it'd be possible to arrange to get S in
+ the first place by calling mpn_gcdext with A+B,B. This
+ might only be an advantage if A and B are about the same size.
+mpn_toom3_mul_n, mpn_toom3_sqr_n: Temporaries
+ B and D are adjacent in memory and at the final
+ coefficient additions look like they could use a single
+ mpn_add_n of l4 limbs rather than two of
+ l2 limbs.
udiv_qrnnd_preinv2norm, the branch-free version of
+ udiv_qrnnd_preinv, might be faster on various pipelined
+ chips. In particular the first if (_xh != 0) in
+ udiv_qrnnd_preinv might be roughly a 50/50 chance and might
+ branch predict poorly. (The second test is probably almost always
+ false.) Measuring with the tuneup program would be possible, but perhaps
+ a bit messy. In any case maybe the default should be the branch-free
+ version.
+ udiv_qrnnd_preinv2norm implementation
+ assumes a right shift will sign extend, which is not guaranteed by the C
+ standards, and doesn't happen on Cray vector systems.
mpn_addmul_1, mpn_submul_1, and
- mpn_mul_1 for the 21264. On 21264, they should run at 4, 3,
- and 3 cycles/limb respectively, if the code is unrolled properly. (Ask
- Torbjörn for his xm.s and xam.s skeleton files.)
-mpn_addmul_1, mpn_submul_1, and
- mpn_mul_1 for the 21164. This should use both integer
+ + #ifdef (__GNUC__) + #if __GNUC__ == 2 && __GNUC_MINOR__ == 7 + ... + #endif + #if __GNUC__ == 2 && __GNUC_MINOR__ == 8 + ... + #endif + #ifndef MUL_KARATSUBA_THRESHOLD + /* Default GNUC values */ + ... + #endif + #else /* system compiler */ + ... + #endif+
invert_limb on various processors might benefit from the
+ little Newton iteration done for alpha and ia64.
+mpn_mul_1,
+ mpn_addmul_1, and mpn_submul_1.
+mpn_mul_1, mpn_addmul_1,
+ and mpn_submul_1 for the 21164. This should use both integer
multiplies and floating-point multiplies. For the floating-point
operations, the single-limb multiplier should be split into three 21-bit
- chunks.
-mpn_addmul_1,
- mpn_submul_1, and mpn_mul_1. Should use
- floating-point operations, and split the invariant single-limb multiplier
- into 21-bit chunks. Should give about 18 cycles/limb, but the pipeline
- will become very deep. (Torbjörn has C code that is useful as a starting
- point.)
-mpn_lshift and mpn_rshift.
- Should give 2 cycles/limb. (Torbjörn has code that just needs to be
- finished.)
-mpn_addmul_1
- and the other multiplies varies so much on successive sizes.
+ chunks, or perhaps even better in four 16-bit chunks. Probably possible
+ to reach 9 cycles/limb.
+ctlz and cttz for
+ count_leading_zeros andcount_trailing_zeros.
+ Use inline for gcc, probably want asm files for elsewhere.
+umul_ppmm to call
+ __umulsidi3 in libgcc. Could be copied straight across, but
+ perhaps ought to be tested.
+clz instruction can be used for
+ count_leading_zeros.
+mpn_divexact_by3 isn't particularly important, but
+ the generic C runs at about 27 c/l, whereas with the multiplies off the
+ dependent chain about 3 c/l ought to be possible.
+mpn_hamdist could be put together based on the
+ current mpn_popcount.
+popc_limb in gmp-impl.h could use the
+ popcnt insn.
+mpn_submul_1 is not implemented directly, only via
+ a combination of mpn_mul_1 and mpn_sub_n.
+mpn_mul_1, mpn_addmul_1,
+ for s2 < 2^32 (or perhaps for any zero 16-bit s2 chunk). Not sure how
+ much this can improve the speed, though, since the symmetry that we rely
+ on is lost. Perhaps we can just gain cycles when s2 < 2^16, or more
+ accurately, when two 16-bit s2 chunks which are 16 bits apart are zero.
+mpn_submul_1, analogous to
+ mpn_addmul_1.
+umul_ppmm. Using four
+ "mulx"s either with an asm block or via the generic C code is
+ about 90 cycles. Try using fp operations, and also try using karatsuba
+ for just three "mulx"s.
+mpn_divrem_1, mpn_mod_1,
+ mpn_divexact_1 and mpn_modexact_1_odd could
+ process 32 bits at a time when the divisor fits 32-bits. This will need
+ only 4 mulx's per limb instead of 8 in the general case.
+mpn_lshift, mpn_rshift.
+ Will give 2 cycles/limb. Trivial modifications of mpn/sparc64 should do.
+mulx for umul_ppmm if
+ possible (see commented out code in longlong.h). This is unlikely to
+ save more than a couple of cycles, so perhaps isn't worth bothering with.
+__sparc_v9__
+ or anything to indicate V9 support when -mcpu=v9 is selected. See
+ gcc/config/sol2-sld-64.h. Will need to pass something through from
+ ./configure to select the right code in longlong.h. (Currently nothing
+ is lost because mulx for multiplying is commented out.)
+modlimb_invert might save a few cycles from
+ masking down to just the useful bits at each point in the calculation,
+ since mulx speed depends on the highest bit set. Either
+ explicit masks or small types like short and
+ int ought to work.
+mpn_popcount and mpn_hamdist
+ could use popc currently commented out in gmp-impl.h. This
+ chip reputedly implements popc properly (see gcc sparc.md),
+ would need to recognise the chip as sparchalr1 or something
+ in configure / config.sub / config.guess.
mpn_addmul_1, mpn_submul_1, and
- mpn_mul_1. The current development code runs at 11
- cycles/limb, which is already very good. But it should be possible to
- saturate the cache, which will happen at 7.5 cycles/limb.
-umul_ppmm. Important in particular for
- mpn_sqr_basecase. Using four "mulx"s either
- with an asm block or via the generic C code is about 90 cycles.
+ mpn_mul_1. The current code runs at 11 cycles/limb. It
+ should be possible to saturate the cache, which will happen at 8
+ cycles/limb (7.5 for mpn_mul_1). Write special loops for s2 < 2^32;
+ it should be possible to make them run at about 5 cycles/limb.
+mpn_addmul_1, mpn_submul_1, and
+ mpn_mul_1. Use both integer and floating-point operations,
+ possibly two floating-point and one integer limb per loop. Split operands
+ into four 16-bit chunks for fast fp operations. Should easily reach 9
+ cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb
+ (using one int + two fp).
+mpn_rshift could do the same sort of unrolled loop
+ as mpn_lshift. Some judicious use of m4 might let the two
+ share source code, or with a register to control the loop direction
+ perhaps even share object code.
+mpn_rshift should do the same sort of unrolled
+ loop as mpn_lshift.
mpn_mul_basecase and mpn_sqr_basecase
for important machines. Helping the generic sqr_basecase.c with an
mpn_sqr_diagonal might be enough for some of the RISCs.
mpn_lshift/mpn_rshift.
Will bring time from 1.75 to 1.25 cycles/limb.
-mpn_lshift for shifts by 1. (See Pentium code.)
-count_leading_zeros.
-udiv_qrnnd. (Ask Torbjörn for the file
- test-udiv-preinv.c as a starting point.)
-mpn_add_n and mpn_sub_n.
- It should just require 3 cycles/limb, but the current code propagates
- carry poorly. The trick is to add carry-in later than we do now,
- decreasing the number of operations used to generate carry-out from 4 to
- to 3.
+mpn_lshift for shifts by 1. (See
+ Pentium code.)
+rep
+ movs would upset GCC register allocation for the whole function.
+ Is this still true in GCC 3? It uses rep movs itself for
+ __builtin_memcpy. Examine the code for some simple and
+ complex functions to find out. Inlining rep movs would be
+ desirable, it'd be both smaller and faster.
+mpn_lshift and mpn_rshift can come
+ down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after
+ shrdl and shldl, see mpn/x86/pentium/README.
+mpn_lshift and mpn_rshift
+ might benefit from some destination prefetching.
+mpn_divrem_1 might be able to use a
+ mul-by-inverse, hoping for maybe 30 c/l.
+mpn_add_n and mpn_sub_n should be able to go
+ faster than the generic x86 code at 3.5 c/l. The athlon code for instance
+ runs at about 2.7.
+mpn_lshift and mpn_rshift might be able to
+ do something branch-free for unaligned startups, and shaving one insn
+ from the loop with alternative indexing might save a cycle.
mpn_lshift.
- The pipeline is now extremely deep, perhaps unnecessarily deep. Also, r5
- is unused. (Ask Torbjörn for a copy of the current code.)
+ The pipeline is now extremely deep, perhaps unnecessarily deep.
mpn_rshift based on new mpn_lshift.
mpn_add_n and mpn_sub_n. Should
- run at just 3.25 cycles/limb. (Ask for xxx-add_n.s as a starting point.)
+ run at just 3.25 cycles/limb.
mpn_mul_basecase and
mpn_sqr_basecase. This should use a "vertical multiplication
method", to avoid carry propagation. splitting one of the operands in
11-bit chunks.
-mpn_mul_basecase and
- mpn_sqr_basecase. Same comment applies to this as to the
- same functions for Fujitsu VPP.
+mpn_lshift by 31 should use the special rshift
+ by 1 code, and vice versa mpn_rshift by 31 should use the
+ special lshift by 1. This would be best as a jump across to the other
+ routine, could let both live in lshift.asm and omit rshift.asm on finding
+ mpn_rshift already provided.
+mpn_com_n and mpn_and_n etc very probably
+ wants a pragma like MPN_COPY_INCR.
+mpn_lshift, mpn_rshift,
+ mpn_popcount and mpn_hamdist are nice and small
+ and could be inlined to avoid function calls.
+TMP_ALLOC to use them, or introduce a new scheme. Memory
+ blocks wanted unconditionally are easy enough, those wanted only
+ sometimes are a problem. Perhaps a special size calculation to ask for a
+ dummy length 1 when unwanted, or perhaps an inlined subroutine
+ duplicating code under each conditional. Don't really want to turn
+ everything into a dog's dinner just because Cray don't offer an
+ alloca.
+mpn_get_str on power-of-2 bases ought to vectorize.
+ Does it? bits_per_digit and the inner loop over bits in a
+ limb might prevent it. Perhaps special cases for binary, octal and hex
+ would be worthwhile (very possibly for all processors too).
+popc_limb could use the Cray _popc
+ intrinsic. That would help mpz_hamdist and might make the
+ generic C versions of mpn_popcount and
+ mpn_hamdist suffice for Cray (if it vectorizes, or can be
+ given a hint to do so).
+mpn_mul_1, mpn_addmul_1,
+ mpn_submul_1: Check for a 16-bit multiplier and use two
+ multiplies per limb, not four.
+mpn_lshift and mpn_rshift could use a
+ roll and mask instead of lsrl and
+ lsll. This promises to be a speedup, effectively trading a
+ 6+2*n shift for one or two 4 cycle masks. Suggested by Jean-Charles
+ Meyrignac.
count_leading_zeros for 64-bit machines:
-
- if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/2) == 0) { x <<= W_TYPE_SIZE/2; cnt += W_TYPE_SIZE/2}
- if ((x >> W_TYPE_SIZE-W_TYPE_SIZE/4) == 0) { x <<= W_TYPE_SIZE/4; cnt += W_TYPE_SIZE/4}
- ...
-
+ if ((x >> 32) == 0) { x <<= 32; cnt += 32; }
+ if ((x >> 48) == 0) { x <<= 16; cnt += 16; }
+ ...
+__inline which could perhaps
+ be used in __GMP_EXTERN_INLINE. What would be the right way
+ to identify suitable versions of that compiler?
+double floats are straightforward and
+ could perhaps be handled directly in __gmp_extract_double
+ and maybe in mpz_get_d, rather than falling back on the
+ generic code. (Both formats are detected by configure.)
+mpn_get_str final divisions by the base with
+ udiv_qrnd_unnorm could use some sort of multiply-by-inverse
+ on suitable machines. This ends up happening for decimal by presenting
+ the compiler with a run-time constant, but the same for other bases would
+ be good. Perhaps use could be made of the fact base<256.
+mpn_umul_ppmm, mpn_udiv_qrnnd: Return a
+ structure like div_t to avoid going through memory, in
+ particular helping RISCs that don't do store-to-load forwarding. Clearly
+ this is only possible if the ABI returns a structure of two
+ mp_limb_ts in registers.
mpz_get_nth_ui. Return the nth word (not necessarily the nth limb).
+mp?_out_raw and
+ mp?_inp_raw.
+mpz_get_nth_ui. Return the nth word (not necessarily the
+ nth limb).
mpz_crr (Chinese Remainder Reconstruction).
mpq_set_f for assignment from mpf_t
- (cf. mpq_set_d).
-mpz_init (and mpq_init) do lazy
- allocation. Set ALLOC(var) to 0, and have
- mpz_realloc special-handle that case. Update functions that
- rely on a single limb (like mpz_set_ui,
- mpz_[tfc]div_r_ui, and others).
+mpz_init and mpq_init could do lazy allocation.
+ Set ALLOC(var) to 0 to indicate nothing allocated, and let
+ _mpz_realloc do the initial alloc. Set
+ z->_mp_d to a dummy that mpz_get_ui and
+ similar can unconditionally fetch from. Niels Möller has had a go at
+ this.
+ mpz_init and then
+ more or less immediately reallocating.
+ mpz_init would only store magic values in the
+ mpz_t fields, and could be inlined.
+ mpz_t z = MPZ_INITIALIZER;, which might be convenient
+ for globals.
+ mpz_set_ui and other similar routines needn't check the
+ size allocated and can just store unconditionally.
+ mpz_set_ui and perhaps others like
+ mpz_tdiv_r_ui and a prospective
+ mpz_set_ull could be inlined.
+ mpf_out_raw and mpf_inp_raw. Make sure
format is portable between 32-bit and 64-bit machines, and between
little-endian and big-endian machines.
-gmp_errno.
-gmp_fprintf, gmp_sprintf, and
- gmp_snprintf. Think about some sort of wrapper
- around printf so it and its several variants don't
- have to be completely reimplemented.
-mpq input and output functions.
-mpz_kronecker, leave
- mpz_jacobi for compatibility.
-mpz_set_str etc taking string
- lengths rather than null-terminators.
+mpn_and_n ... mpn_copyd: Perhaps make the mpn
+ logops and copys available in gmp.h, either as library functions or
+ inlines, with the availability of library functions instantiated in the
+ generated gmp.h at build time.
+mpz_set_str etc variants taking string lengths rather than
+ null-terminators.
<=" rather than "<", so a threshold can
be set to MP_SIZE_T_MAX to get only the simpler code (the
compiler will know size <= MP_SIZE_T_MAX is always true).
-mpz_cdiv_q_2exp and mpz_cdiv_r_2exp
- could be implemented to match the corresponding tdiv and fdiv.
- Maybe some code sharing is possible.
+ Alternately it looks like the ABOVE_THRESHOLD and
+ BELOW_THRESHOLD macros can do this adequately, and also pick
+ up cases where a threshold of zero should mean only the second algorithm.
+mpz_nthprime.
+mpz_init2, initializing and making initial room for
+ N bits. The actual size would be rounded up to a limb, and perhaps an
+ extra limb added since so many mpz routines need that on
+ their destination.
+mpz_andn, mpz_iorn, mpz_nand,
+ mpz_nior, mpz_xnor might be useful additions,
+ if they could share code with the current such functions (which should be
+ possible).
+mpz_and_ui etc might be of use sometimes. Suggested by
+ Niels Möller.
+mpf_set_str and mpf_inp_str could usefully
+ accept 0x, 0b etc when base==0. Perhaps the exponent could default to
+ decimal in this case, with a further 0x, 0b etc allowed there.
+ Eg. 0xFFAA@0x5A. A leading "0" for octal would match the integers, but
+ probably something like "0.123" ought not mean octal.
+GMP_LONG_LONG_LIMB or some such could become a documented
+ feature of gmp.h, so applications could know whether to
+ printf a limb using %lu or %Lu.
+PRIdMP_LIMB and similar defines following C99
+ <inttypes.h> might be of use to applications printing limbs.
+ Perhaps they should be defined only if specifically requested, the way
+ <inttypes.h> does. But if GMP_LONG_LONG_LIMB or
+ whatever is added then perhaps this can easily enough be left to
+ applications.
+mpf_get_ld and mpf_set_ld converting
+ mpf_t to and from long double. Other
+ long double routines would be desirable too, but these would
+ be a start. Often long double is the same as
+ double, which is easy but pretty pointless. Should
+ recognise the Intel 80-bit format on i386, and IEEE 128-bit quad on
+ sparc, hppa and power. Might like an ABI sub-option or something when
+ it's a compiler option for 64-bit or 128-bit long double.
+gmp_printf could accept %b for binary output.
+ It'd be nice if it worked for plain int etc too, not just
+ mpz_t etc.
+gmp_printf in fact could usefully accept an arbitrary base,
+ for both integer and float conversions. A base either in the format
+ string or as a parameter with * should be allowed. Maybe
+ &13b (b for base) or something like that.
+gmp_printf could perhaps have a type code for an
+ mp_limb_t. That would save an application from having to
+ worry whether it's a long or a long long.
+gmp_printf could perhaps accept mpq_t for float
+ conversions, eg. "%.4Qf". This would be merely for
+ convenience, but still might be useful. Rounding would be the same as
+ for an mpf_t (ie. currently round-to-nearest, but not
+ actually documented). Alternately, perhaps a separate
+ mpq_get_str_point or some such might be more use. Suggested
+ by Pedro Gimeno.
+gmp_printf could usefully accept a flag to control the
+ rounding of float conversions. The wouldn't do much for
+ mpf_t, but would be good if mpfr_t was
+ supported in the future, or perhaps for mpq_t. Something
+ like &*r (r for rounding, and mpfr style
+ GMP_RND parameter).
+mpz_combit to toggle a bit would be a good companion for
+ mpz_setbit and mpz_clrbit. Suggested by Niels
+ Möller (and has done some work towards it).
+mpz_scan0_reverse or mpz_scan0low or some such
+ searching towards the low end of an integer might match
+ mpz_scan0 nicely. Likewise for scan1.
+ Suggested by Roberto Bagnara.
+mpz_bit_subset or some such to test whether one integer is a
+ bitwise subset of another might be of use. Some sort of return value
+ indicating whether it's a proper or non-proper subset would be good and
+ wouldn't cost anything in the implementation. Suggested by Roberto
+ Bagnara.
+gmp_randinit_r and maybe gmp_randstate_set to
+ init-and-copy or to just copy a gmp_randstate_t. Suggested
+ by Pedro Gimeno.
+mpf_get_ld, mpf_set_ld: Conversions between
+ mpf_t and long double, suggested by Dan
+ Christensen. There'd be some work to be done by configure
+ to recognise the format in use. xlc on aix for instance apparently has
+ an option for either plain double 64-bit or quad 128-bit precision. This
+ might mean library contents vary with the compiler used to build, which
+ is undesirable. It might be possible to detect the mode the application
+ is compiling with, and try to avoid mismatch problems.
+mpz_sqrt_if_perfect_square: When
+ mpz_perfect_square_p does its tests it calculates a square
+ root and then discards it. For some applications it might be useful to
+ return that root. Suggested by Jason Moxham.
+mpz_get_ull, mpz_set_ull,
+ mpz_get_sll, mpz_get_sll: Conversions for
+ long long. These would aid interoperability, though a
+ mixture of GMP and long long would probably not be too
+ common. Disadvantages of using long long in libgmp.a would
+ be
+ #ifdef block to decide if the
+ application compiler could take the long long
+ prototypes.
+ LIBGMP_HAS_LONGLONG would be wanted to
+ indicate whether the functions are available. (Applications using
+ autoconf could probe the library too.)
+ long long to
+ application compile time, by having something like
+ mpz_set_2ui called with two halves of a long
+ long. Disadvantages of this would be,
+ long
+ long is normally passed as two halves anyway.
+ mpz_get_ull would be a rather big inline, or would have
+ to be two function calls.
+ mpz_get_sll would be a worse inline, and would put the
+ treatment of -0x10..00 into applications (see
+ mpz_get_si correctness above).
+ long long is probably the lesser evil, if only
+ because it makes best use of gcc.
+mpz_strtoz parsing the same as strtol.
+ Suggested by Alexander Kruppa.
GMP_C_DOUBLE_FORMAT seems to work
+ well. Get rid of the #ifdef mess in gmp-impl.h and use the
+ results of the test instead.
+umul_ppmm in longlong.h always uses umull,
+ but is that available only for M series chips or some such? Perhaps it
+ should be configured in some way.
+ia64-*-hpux*. Does GMP need to know anything about that?
+AC_C_BIGENDIAN seems the best way to handle that for GMP.
+*-*-aix*. It might be more reliable to do some sort of
+ feature test, examining the compiler output perhaps. It might also be
+ nice to merge the aix.m4 files into powerpc-defs.m4.
+config.guess recognises various exact sparcs, make
+ use of that information in configure (work on this is in
+ progress).
+udiv should be selected
+ according to the CPU target. Currently floating point ends up being
+ used on all sparcs, which is probably not right for generic V7 and V8.
+-xtarget=native with cc is
+ incorrect when cross-compiling, the target should be set according to the
+ configured $host CPU.
+.balignw <n>,0x4e7f to get
+ nops, if ALIGN is going to be used for anything that's
+ executed across.
+umul and udiv code not being
+ used. Check all such for bit rot and then put umul and udiv in
+ $gmp_mpn_functions_optional as "standard optional" objects.
+ mpn_umul_ppmm and mpn_udiv_qrnnd have a
+ different parameter order than those functions on other CPUs. It might
+ avoid confusion to have them under different names, maybe
+ mpn_umul_ppmm_r or some such. Prototypes then wouldn't
+ be conditionalized, and the appropriate form could be selected with the
+ HAVE_NATIVE scheme if/when the code switches to use a
+ PROLOGUE style.
+DItype: The setup in gmp-impl.h for non-GCC could use an
+ autoconf test to determine whether long long is available.
+AC_OUTPUT
+ would work, but it might upset "make" to have things like L$
+ get into the Makefiles through AC_SUBST.
+ AC_CONFIG_COMMANDS would be the alternative. With some
+ careful m4 quoting the changequote calls might not be
+ needed, which might free up the order in which things had to be output.
+make distclean: Only the mpn directory links which were
+ created are removed, but perhaps all possible links should be removed, in
+ case someone runs configure a second time without a
+ distclean in between. The only tricky part would be making
+ sure all possible extra_functions are covered.
+-mno-cygwin. For --host=*-*-mingw32* it might
+ be convenient to automatically use that option, if it works. Needs
+ someone with a dual cygwin/mingw setup to test.
+CCAS, CCASFLAGS
+ scheme. Though we probably wouldn't be using its assembler support we
+ could try to use those variables in compatible ways.
+implver and
- amask assembly instructions are better, but that
- doesn't differentiate between ev5 and ev56.
-_gmp_rand is not particularly fast on the linear
+ congruential algorithm and could stand various improvements.
+ gmp_randstate_t (or
+ _mp_algdata rather) to save some copying.
+ 2exp modulus, to
+ avoid mpn_mul calls. Perhaps the same for two limbs.
+ lc code, to avoid a function call and
+ TMP_ALLOC for every chunk.
+ seedn==0 will be very rarely used,
+ and on that basis seems unnecessary.
+ 2exp and general LC cases should be split,
+ for clarity (if the general case is retained).
+ gmp_randinit_mers for a Mersenne Twister generator. It's
+ likely to be more random and about the same speed as Knuth's 55-element
+ Fibonacci generator, and can probably become the default. Pedro Gimeno
+ has started on this.
+gmp_randinit_lc: Finish or remove. Doing a division for
+ every every step won't be very fast, so check whether the usefulness of
+ this algorithm can be justified. (Consensus is that it's not useful and
+ can be removed.)
+gmp_randinit_bbs would be wanted, not the currently
+ commented out case in gmp_randinit.
+_gmp_rand could be done as a function pointer within
+ gmp_randstate_t (or rather in the _mp_algdata
+ part), instead of switching on a gmp_randalg_t. Likewise
+ gmp_randclear, and perhaps gmp_randseed if it
+ became algorithm-specific. This would be more modular, and would ensure
+ only code for the desired algorithms is dragged into the link.
+mpz_urandomm should do something for n<=0, but what?
+mpz_urandomm implementation looks like it could be improved.
+ Perhaps it's enough to calculate nbits as ceil(log2(n)) and
+ call _gmp_rand until a value <n is obtained.
+gmp_randstate_t used for parameters perhaps should become
+ gmp_randstate_ptr the same as other types.
+In general, getting the exact right configuration, passing the -exact right options to the compiler, etc, might mean that the GMP -performance more than doubles. -
When testing, make sure to test at least the following for all out -target machines: (1) Both gcc and cc (and c89). (2) Both 32-bit mode -and 64-bit mode (such as -n32 vs -64 under Irix). (3) Both the system -`make' and GNU `make'. (4) With and without GNU binutils. - -
mpz_div and mpz_divmod use rounding
analogous to mpz_mod. Document, and list as an
incompatibility.
-mcount call at
- the start of each assembler subroutine (for important targets at
- least).
+mpz_gcdext and mpn_gcdext ought to document
+ what range of values the generated cofactors can take, and preferably
+ ensure the definition uniquely specifies the cofactors for given inputs.
+ A basic extended Euclidean algorithm or multi-step variant leads to
+ |x|<|b| and |y|<|a| or something like that, but there's probably
+ two solutions under just those restrictions.
+mpz_invert should call mpn_gcdext directly.
+mpz_divisible_ui_p rather than
+ mpz_tdiv_qr_ui. (Of course dividing multiple primes at a
+ time would be better still.)
+libgmp. This establishes good cross-checks, but it might be
+ better to use simple reference routines where possible. Where it's not
+ possible some attention could be paid to the order of the tests, so a
+ libgmp routine is only used for tests once it seems to be
+ good.
+mpf_set_q is very similar to mpf_div, it'd be
+ good for the two to share code. Perhaps mpf_set_q should
+ make some mpf_t aliases for its numerator and denominator
+ and just call mpf_div. Both would be simplified a good deal
+ by switching to mpn_tdiv_qr perhaps making them small enough
+ not to bother with sharing (especially since mpf_set_q
+ wouldn't need to watch out for overlaps).
+mftb and
+ mftbu) could be used for the speed and tune programs. Would
+ need to know its frequency of course. Usually it's 1/4 of bus speed
+ (eg. 25 MHz) but some chips drive it from an external input. Probably
+ have to measure to be sure.
+MUL_FFT_THRESHOLD etc: the FFT thresholds should allow a
+ return to a previous k at certain sizes. This arises basically due to
+ the step effect caused by size multiples effectively used for each k.
+ Looking at a graph makes it fairly clear.
+__gmp_doprnt_mpf does a rather unattractive round-to-nearest
+ on the string returned by mpf_get_str. Perhaps some variant
+ of mpf_get_str could be made which would better suit.
malloc
- separately for each TMP_ALLOC block, so a redzoning
- malloc debugger could be used during development.
-ASSERTs at the start of each user-visible
- mpz/mpq/mpf function to check the validity of each
- mp?_t parameter, in particular to check they've been
- mp?_inited. This might catch elementary mistakes in
- user programs. Care would need to be taken over
- MPZ_TMP_INITed variables used internally.
+ASSERTs at the start of each user-visible mpz/mpq/mpf
+ function to check the validity of each mp?_t parameter, in
+ particular to check they've been mp?_inited. This might
+ catch elementary mistakes in user programs. Care would need to be taken
+ over MPZ_TMP_INITed variables used internally. If nothing
+ else then consistency checks like size<=alloc, ptr not
+ NULL and ptr+size not wrapping around the address space,
+ would be possible. A more sophisticated scheme could track
+ _mp_d pointers and ensure only a valid one is used. Such a
+ scheme probably wouldn't be reentrant, not without some help from the
+ system.
+getrusage and gettimeofday are reliable.
+ Currently we pretend in configure that the dodgy m68k netbsd 1.4.1
+ getrusage doesn't exist. If a test might take a long time
+ to run then perhaps cache the result in a file somewhere.
mpz_inp_str (etc) doesn't say when it stops reading digits.
-|
-
- Please send comments about this page to
- tege@swox.com. - Copyright (C) 1999, 2000 Torbjörn Granlund. - - |
- - | -
mpn_umul_ppmm, and the corresponding umul.asm file could be
+ included in libgmp only in that case, the same as is effectively done for
+ __clz_tab. Likewise udiv.asm and perhaps cntlz.asm. This
+ would only be a very small space saving, so perhaps not worth the
+ complexity.
+mpz_get_si returns 0x80000000 for -0x100000000, whereas it's
+ sort of supposed to return the low 31 (or 63) bits. But this is
+ undocumented, and perhaps not too important.
+mpz_*_ui division routines currently return abs(a%b).
+ Perhaps make them return the real remainder instead? Return type would
+ be signed long int. But this would be an incompatible
+ change, so it might have to be under newly named functions.
+mpz_init_set* and mpz_realloc could allocate
+ say an extra 16 limbs over what's needed, so as to reduce the chance of
+ having to do a reallocate if the mpz_t grows a bit more.
+ This could only be an option, since it'd badly bloat memory usage in
+ applications using many small values.
+mpq functions could perhaps check for numerator or
+ denominator equal to 1, on the assumption that integers or
+ denominator-only values might be expected to occur reasonably often.
+count_trailing_zeros is used on more or less uniformly
+ distributed numbers in a couple of places. For some CPUs
+ count_trailing_zeros is slow and it's probably worth handling
+ the frequently occurring 0 to 2 trailing zeros cases specially.
+mpf_t might like to let the exponent be undefined when
+ size==0, instead of requiring it 0 as now. It should be possible to do
+ size==0 tests before paying attention to the exponent. The advantage is
+ not needing to set exp in the various places a zero result can arise,
+ which avoids some tedium but is otherwise perhaps not too important.
+ Currently mpz_set_f and mpf_cmp_ui depend on
+ exp==0, maybe elsewhere too.
+__gmp_allocate_func: Could use GCC __attribute__
+ ((malloc)) on this, though don't know if it'd do much. GCC 3.0
+ allows that attribute on functions, but not function pointers (see info
+ node "Attribute Syntax"), so would need a new autoconf test. This can
+ wait until there's a GCC that supports it.
+mpz_add_ui contains two __GMPN_COPYs, one from
+ mpn_add_1 and one from mpn_sub_1. If those two
+ routines were opened up a bit maybe that code could be shared. When a
+ copy needs to be done there's no carry to append for the add, and if the
+ copy is non-empty no high zero for the sub. mpz_add
+ when one operand is longer than the other, and to mpz_com
+ since it's just -(x+1).
+restrict'ed pointers: Does the C99 definition of restrict
+ (one writer many readers, or whatever it is) suit the GMP style "same or
+ separate" function parameters? If so, judicious use might improve the
+ code generated a bit. Do any compilers have their own flavour of
+ restrict as "completely unaliased", and is that still usable?
+ABI
+ option, selecting _SHORT_LIMB in gmp.h. Naturally a new set
+ of asm subroutines would be necessary. Would need new
+ mpz_set_ui etc since the current code assumes limb>=long,
+ but 2-limb operand forms would find a use for long long on
+ other processors too.
+BITS_PER_MP_LIMB) mod
+ m", then for the input limbs x calculating an inner product "sum
+ p[i]*x[i]", and a final 3x1 limb remainder mod m. If those powers take
+ roughly N divide steps to calculate then there'd be an advantage any time
+ the same m is used three or more times. Suggested by Victor Shoup in
+ connection with chinese-remainder style decompositions, but perhaps with
+ other uses.
+