=================================================================== RCS file: /home/cvs/OpenXM_contrib/gmp/mpn/x86/pentium/Attic/README,v retrieving revision 1.1 retrieving revision 1.1.1.2 diff -u -p -r1.1 -r1.1.1.2 --- OpenXM_contrib/gmp/mpn/x86/pentium/Attic/README 2000/01/10 15:35:26 1.1 +++ OpenXM_contrib/gmp/mpn/x86/pentium/Attic/README 2000/09/09 14:12:44 1.1.1.2 @@ -1,6 +1,52 @@ -This directory contains mpn functions optimized for Intel Pentium -processors. + INTEL PENTIUM P5 MPN SUBROUTINES + + +This directory contains mpn functions optimized for Intel Pentium (P5,P54) +processors. The mmx subdirectory has code for Pentium with MMX (P55). + + +STATUS + + cycles/limb + + mpn_add_n/sub_n 2.375 + + mpn_copyi/copyd 1.0 + + mpn_divrem_1 44.0 + mpn_mod_1 44.0 + mpn_divexact_by3 15.0 + + mpn_l/rshift 5.375 normal (6.0 on P54) + 1.875 special shift by 1 bit + + mpn_mul_1 13.0 + mpn_add/submul_1 14.0 + + mpn_mul_basecase 14.2 cycles/crossproduct (approx) + + mpn_sqr_basecase 8 cycles/crossproduct (approx) + or 15.5 cycles/triangleproduct (approx) + +Pentium MMX gets the following improvements + + mpn_l/rshift 1.75 + + +1. mpn_lshift and mpn_rshift run at about 6 cycles/limb on P5 and P54, but the +documentation indicates that they should take only 43/8 = 5.375 cycles/limb, +or 5 cycles/limb asymptotically. The P55 runs them at the expected speed. + +2. mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb. Due to loop +overhead and other delays (cache refill?), they run at or near 2.5 cycles/limb. + +3. mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they +should. Intel documentation says a mul instruction is 10 cycles, but it +measures 9 and the routines using it run with it as 9. + + + RELEVANT OPTIMIZATION ISSUES 1. Pentium doesn't allocate cache lines on writes, unlike most other modern @@ -13,14 +59,19 @@ to different cache banks. The simplest way to insure two words from the same object. If we make operations on different objects, they might or might not be to the same cache bank. -STATUS -1. mpn_lshift and mpn_rshift run at about 6 cycles/limb, but the Pentium -documentation indicates that they should take only 43/8 = 5.375 cycles/limb, -or 5 cycles/limb asymptotically. -2. mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb. Due to loop -overhead and other delays (cache refill?), they run at or near 2.5 cycles/limb. +REFERENCES -3. mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they -should... +"Intel Architecture Optimization Manual", 1997, order number 242816. This +is mostly about P5, the parts about P6 aren't relevant. Available on-line: + + http://download.intel.com/design/PentiumII/manuals/242816.htm + + + +---------------- +Local variables: +mode: text +fill-column: 76 +End: