Accurate Math Functions on the Intel IA-32 ... - RNC 7
Transcript of Accurate Math Functions on the Intel IA-32 ... - RNC 7
INTEL CONFIDENTIAL
Accurate Math Functions on the Intel IA-32 Architecture: A Performance-Driven Design
Cristina Anderson, Nikita Astafiev and Shane StoryIntel Corporation
July 2006
Agenda
• Intel LIBM design– Goals– Techniques– Verification
• Implementation tricks for gaining– Performance– Accuracy
• Current results– Performance and Accuracy tables
• Future development– Challenges of coding in C– Correctly rounded
Intel LIBM design
• Goals– Performance: Latency optimized routines
• Given the maximum resources function should finish execution as soon as possible
• Opposed to Through-put optimization: maximum number of independent functions run in parallel – each is constrained in resources
sin(x)
sin(x) sin(x) sin(x)
Intel LIBM design
• Goals– Performance: Latency optimized routines– Accuracy: 0.55 ulps (units in the last place) or better in round-to-nearest
Rounding error = |bexp·d0.d1d2 … dp-1 – bexp·d0.d1d2 … dp-1dddddddd… |·bp-1-exp
p-digits format Exact base b representation
d0.d1d2 … dp-20
d0.d1d2 … dp-201
d0.d1d2 … dp-21
Intel LIBM design
• Goals– Performance: Latency optimized routines– Accuracy: 0.55 ulps or better in round-to-nearest– Correctly rounded in IEEE-754 mandated cases– Exception flags– C99, F90 conformance– Structured Exception Handling
Intel LIBM design
• Algorithms– Argument reduction– Table-lookup– Polynomial approximation– Reconstruction
log(2n · M) = n·log2 + logM = n·log2 – log(B) + log(B·M)
= n·log2 – log(B) + log(1 + (B·M – 1)) B 1/M≈
≈ n·Tlog2 – Tlog(1/M) + Poly(r)
= n·(Tlog2_hi + Tlog2_lo) – T(indexlog(1/M)) + r·(P1 + r·(P2… )… )
Intel LIBM design
• Architecture– IA-32 instructions set
• General Purpose– 32-bit GP registers
• x87 FPU– 80-bit FP registers
MOVLoads
CMP, TESTTests
SHL, SHR, SARShifts
OR, AND, XORLogical
ADD, SUBArithmetic
General Purpose
FADD, FMULArithmetic
FPREM1Remainder
FLDLoad
x87 FPU
Intel LIBM design
• Architecture– IA-32 instructions set
• General Purpose• x87 FPU• SSE/SSE2/SSE3…
– 128-bit registers– 2 double FP– 4 single FP
PSHUFDPack / unpack
MOVD, PEXTRW, PINSRWTransfers to /from GPR
CVTSS2SD, CVTSD2SSFormat conversions
PSRLQ, PSLLQ,PSRLD, PSLLD
Logical shifts
MOVSSMOVAPS
ORPS,ANDPS,XORPS
ORPD,ANDPD,XORPD
Logical
MOVSDMOVAPD
Load one elementLoad 128 bit
MOVDDUPPack
MULPS/SSDIVPS/SS
MULPD/SDDIVPD/SD
FP SIMD/scalarmultiply, divide
ADDPS/SSSUBPS/SS
ADDPD/SDSUBPD/SD
FP SIMD/scalaradd, subtract
SinglePrecision
DoublePrecision
SSE/SSE2/SSE3
Intel LIBM design
• Architecture– IA-32 instructions set
• General Purpose• x87 FPU• SSE/SSE2/SSE3…
– Instruction-level parallelism
Intel LIBM design
• Architecture– IA-32 instructions set
• General Purpose• x87 FPU• SSE/SSE2/SSE3…
– Instruction-level parallelism– SIMD parallelism
P(x) = (… ((P9·x + P8)·x + P7)·x +… +P0
Podd = (((P9·x2 + P7)·x2 + P5)·x2 + P3)·x2 + P1
Peven = (((P8·x2 + P6)·x2 + P4)·x2 + P2)·x2 + P0
P(x) = Podd·x + Peven
Intel LIBM design
• Verification– Paper proofs are limited
• IEEE required cases (sqrt, division, remainder)– IMLTS (Intel Math Library Test Suite)
• Black and White box testing– Accuracy / Monotonicity / Symmetry / Special Values / Flags– Hard to round cases
• Errors have been found in the 3 available CR libraries
Implementation tricksBranch collapsing
y = y + a1
a1 = ma & ay = y + a
ma = (b – x) >> 31if (x > b)
Branch collapsing via bit-mask
Implementation tricksBranch collapsing
if ((unsigned) (x-b) > a-b )goto special
…<main path>…
if (x > a)goto overflow
if (x < b)goto underflow
…<main path>…
Let a > b
Branch collapsing via condition shrinking
Implementation tricksFast conversions
cvtsd2si y, xcvtsi2sd r, y
Result is represented in double FP format
Let double |x| < 251
Fast conversion to integer via Right Shifter
252 + 251
1 1 1 * *
rounding
20
.fraction
movd index, x
addsd x, 252 + 251
subsd x, 252 + 251
Implementation tricks
y = 2(E - 1023)·significand
psllq x, 52 – 23paddd x, 0x3800000000000000
cvtss2sd y, x
x = 2(E - 127)·significand
Let single x > 0
Fast Single to Double conversion
y = |x|y = (x + S) xor S
N = 32 or 64
000…0 , if x >= 0111…1 , if x < 0
S = x >> (N − 1)
Fast integer ABS
S =
Implementation tricksAccuracy and performance
• arcsinh(x) = ln(x + sqrt(x2+1)) = ln(1 + (x + sqrt(x2+1) – 1))= ln(1 + V), V >= 0
• Relative error
dy d(ln(1+V)) dV dV y ln(1+V) (1+V)·ln(1+V) V
• Multiprecision computations– Newton iterations for sqrt and argument reduction– Polynomial reconstruction of log (lower terms)
A + B != round to double (A + B)Let |A| > |B|Shi = round to double (A + B)
Slo = (A – Shi) + B
A + B = Shi + Slo
Sometimes it is possible to make the lower terms of the polynomial simple – to have less high-low parts computations
__ _______ _________ __= = < V > 0
Current results: Accuracy
0.50.501025POWF0.50.502916LOGF0.50.506582EXPF0.50.520918ATANF0.50.500003ACOSF0.50.500003ASINF7.04E+130.504488TANF3.52E+130.509615COSF1.13E+150.513276SINF8.48E+080.506688POW
0.50.5005830.5003760.501397LOG0.50.5000470.7872140.540348EXP0.50.50.5003070.542333ATAN
2.15E+050.531348ACOS0.50.501573.4408710.535745ASIN0.50.5000012.60E+330.541852TAN0.50.5000012.60E+330.51851COS0.50.52.60E+330.515082SIN
CRlibmCRlibm (first step)GNU libmIntel libmFunction
Intel 9.1 compiler LIBM: Intel(R) C++ Compiler for 32-bit applications, Version 9.1 Build 20060505Zglibc 2.3.6 built on SUSE* SLES9 with gcc 3.3.3Modified CRLIBM* 0.11beta1, built on SUSE* SLES9 with gcc 3.3.3 Performance tests and ratings are measured using specific computer systems and/or components. Any difference in system design or configuration may affect actual performance.Intel, the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.* Other brands and names may be claimed as the property of others.
Current results: Performance (clock cycles)
594115POWF19757LOGF35448EXPF30269ATANF35784ACOSF35782ASINF28386TANF23265COSF23166SINF665249POW
1505365342159LOG2548528444145EXP25907730338155ATAN
498206ACOS35421244498203ASIN47194732304270TAN20468455256164COS21406460258164SIN
CRlibmCRlibm (first step)GNU libmIntel libmFunction
Intel 9.1 compiler LIBM: Intel(R) C++ Compiler for 32-bit applications, Version 9.1 Build 20060505Zglibc 2.3.6 built on SUSE* SLES9 with gcc 3.3.3Modified CRLIBM* 0.11beta1, built on SUSE* SLES9 with gcc 3.3.3 Performance tests and ratings are measured using specific computer systems and/or components. Any difference in system design or configuration may affect actual performance.Intel, the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.* Other brands and names may be claimed as the property of others.
Future Development
• Code in C– Ease of use and support
• In-lining• Portability
– Some architectural features are inaccessible from the high-level programming language (e.g. logical operations on FP numbers)
• Compiler built-ins (extensions to language) can help– Dependency on the compiler
• Correctly rounded– Currently out of scope– Challenge: accuracy (consistency) vs performance– Proof is the main problem: e.g. 3 publicly available CR libraries have had accuracy
issues• IBM Ultimate LIBM*. 1999, 2002.• Sun LIBMCR*. Ver. 0.9; December 2004.• ENS-Lyon CRLIBM*. Ver. 0.8beta; January 2005
* Other brands and names may be claimed as the property of others.