Lp fp us with multispeculative techniques_19_12_12
-
Upload
greendisc -
Category
Technology
-
view
247 -
download
3
Transcript of Lp fp us with multispeculative techniques_19_12_12
Low Power Floa+ng Point Units
Dr. Alberto A. Del Barrio García Complutense University of Madrid
Outline • Introduc+on • FP-‐MACs – Why not the mul+plier?
• Mul+specula+ve Adders (MSADD) • Mul+specula+ve FP-‐MACs – Integra+ng MSADD and FP-‐MAC – The CCTA – The Sign Predictor
• Experiments • Conclusions
Outline • Introduc)on • FP-‐MACs – Why not the mul+plier?
• Mul+specula+ve Adders (MSADD) • Mul+specula+ve FP-‐MACs – Integra+ng MSADD and FP-‐MAC – The CCTA – The Sign Predictor
• Experiments • Conclusions
Introduc+on: FPUs
Blue Gene/Q FPU (1st top 500, June 2012) Power distribu+on within a node of a hypothe+cal Exascale system [Kogge et al., 2008]
FP#MAC0( FP#MAC1( FP#MAC2( FP#MAC3(
RF( RF( RF( RF(
LOAD(
A2(
Permute(
256(
64(
Introduc+on: FP-‐MAC, R=AxB+C • Mul+plica+on stage
– Effec+ve addi+on/subtrac+on decision, sub=(s(A) xor s(B)) xor s(C) – Exponent difference, d=exp(C)-‐(exp(A)+exp(B)) – C alignment and inversion (Cinv) – AxB in CSA format (X,Y) such that Z=2X+Y
• Addi+on stage – Sign calcula+on – Addi+on: 2X+Y+Cinv – LZA: normalizing shib calcula+on
• Round and Normalize – 2’s Complement – Normalize and exponent adjustement – Round and Postnormaliza+on
Introduc+on: FP-‐MAC C[wt%1:0]* A[wt%1:0]* B[wt%1:0]*
Sign*Logic* Exponent*Difference*
Alignment*Shi>er*
Inversion*
Booth*MulDplier*Array*
3:2*CSA*Compressor*
Carry*Propagate*Adder* Leading*Zero*AnDcipator*
Incrementer*
MUX*Sign*Adjust*
Complement*
NormalizaDon*Shi>er*
Rounder*&*Post*norm.*
SDcky*Logic*Exponent*Adjust*
R[wt%1:0]*
[wt%1]*[wf+we%1:wf]* {1’b1,B[wf%1:0]}*{1’b1,C[wf%1:0]}* {1’b1,A[wf%1:0]}*
[walign%1:0]*
[wep%1:0]* [walign%1:wcsa]* [wcsa%1:1]* [wcsa%1:0]*[wcsa%1:0]*
[wcsa%1:0]*[wcsa%1:0]*Compl&
[walign%1:0]* [wsh%1:0]*
cout&
[wep%1:0]*
[2:0]*
s+cky&rnd&lsb&
[wep%1:0]*
[wf%1:0]*
Introduc+on: FP-‐MAC • Baseline implementa+on
[Huang et al., Trans. On Comp. 2012]
• A FPU is a/several Floa+ng Point Mul+plica+on Accumula+on FUs
• Three significant components: the mul+plier, the adder and the rounding and normalizing module
Here is some data [Huang et al. 2012]
Introduc+on
• Mul+specula+on, Exascale and FPUs – FPU’s role is quite important in Exascale systems
• Exaflop energy efficiency must be improved – FPUs are cri+cal from both performance and power/energy point of view
– Mul+specula+ve Adders are low-‐power – This is an excellent scenario for applying mul+specula+on • Mul+plier è Extremely difficult • Addi+on è Possible • Rounder è Let me think J
Outline • Introduc+on • FP-‐MACs – Why not the mul)plier?
• Mul+specula+ve Adders (MSADD) • Mul+specula+ve FP-‐MACs – Integra+ng MSADD and FP-‐MAC – The CCTA – The Sign Predictor
• Experiments • Conclusions
FP-‐MAC: Mul+plier • Why not the mul+plier ? – Par+al Product Matrix + Last Stage Adder
• In FP-‐MAC the LSA corresponds with the accumula+on – All PPM implementa+ons consist of a Booth recoder and a Wallace/Dadda trees • Booth (1951), Wallace (1964), Dadda (1965) • Important improvements since then
– U+lizing 4:2 compressors counters instead of (3,2) counters [Weinberger, 1981]
– Balancing delays: Three Dimensional op+miza+on Method (TDM) [Oklobdzija,Villeger and Liu, 1996]
– LSA designs con+nue evolving with the appearance of VLFUs
FP-‐MAC: Mul+plier
FP-‐MAC: Mul+plier
Radix-‐4 Booth Higher radices have been tried, but the addi+onal complexity makes this approach unworthy.
Radix-‐4 is what is u+lized.
FP-‐MAC: Booth Encoding • Add 0 to right of LSB since the first group has no group with
which to overlap • Examine 3 bits at a +me • Encode 2 bits at a +me • Overlap a bit between par+al products • Example: mul+plier = 1001 = -‐7 (C2 format)
FP-‐MAC: Mul+plier Number of stages in a Dadda Tree
Dadda Tree for an 8x8 mul+plier with (3,2) counters
FP-‐MAC: Mul+plier Conceptually: 4:2 compressor built with two (3,2) counters
(Full Adders)
(7,3) and (15,4) counters, and several other compressors have also been tried, but 4:2
compressors seems to be the most efficient. Most implementa+ons today u+lize 4:2 compressors
FP-‐MAC: Mul+plier An example of Oklobdzija et al. technique: a balanced 4:2 compressor built with two (3,2) counters. Large delays are connected to “fast inputs” and short delays to “slow inputs”
Outline • Introduc+on • FP-‐MACs – Why not the mul+plier?
• Mul)specula)ve Adders (MSADD) • Mul+specula+ve FP-‐MACs – Integra+ng MSADD and FP-‐MAC – The CCTA – The Sign Predictor
• Experiments • Conclusions
Mul+specula+ve Adder
Kogge-‐Stone Adder
• White dots: • gi = xi ·∙ yi • pi = xi xor yi
• Purple dots: • (G’,P’) * (G’’,P’’) = (G’+P’·∙G’’,P’·∙P’’), being X’ more significant than X’’
Mul+specula+ve Adder
20
n-‐bit Kogge-‐Stone Adder o Complex carry propaga+on tree o The fastest non specula+ve design
o O(log(n)) o Huge area
o O(n*log(n)) with large n
Image taken from hvp://www.aoki.ecei.tohoku.ac.jp/arith/mg/algorithm.html
P P P
n-‐bit Mul)specula)ve KS o n/k simpler carry propaga+on trees o Extremely fast
o O(log(k)) o Predictors accuracy
o Reduced area o Small KS have area O(n) o Area: n/k*O(k) ≅ O(n)
Mul+specula+ve Adder a)#
b)#
15# 14# 13# 12# 11# 10# 9# 8# 7# 6# 5# 4# 3# 2# 1# 0#
15# 14# 13# 12# 11# 10# 9# 8# 7# 6# 5# 4# 3# 2# 1# 0#
The same bit flip affects to much less nodes, especially if
there is a hit.
Moreover, MSADD contains less nodes
Outline • Introduc+on • FP-‐MACs – Why not the mul+plier?
• Mul+specula+ve Adders (MSADD) • Mul)specula)ve FP-‐MACs – Integra)ng MSADD and FP-‐MAC – The CCTA – The Sign Predictor
• Experiments • Conclusions
FP-‐MAC and Mul+specula+on Conceptually, the baseline design consists of two adders: • 1 Adder to perform the addi+on • 1 Constant Adder to Complement
result when necessary
The idea consists of subs+tu+ng the Adders by a MSADD with Sta+c Zero Predic+on. In order to avoid addi+onal delays due to correc+ons, mispredic+ons (+1) will be corrected on the fly in the C2-‐complementer.
n"
+"
n"n"
+"n"+"n"
1"0"
+"
“00…00”"
n"+"n"+"
n"n"Compl&
Adder%
C2(Complementer%Frag. 0 Frag. 1 …Frag. n/k-2
errn/k-1
…
Frag. n/k-1
k k k k
n
hit
Pn/k-1 P1
k k k k k k k k
n
A B
n
Z
Cout predn/k-1
err1
pred1
Zn-1..n-k Zn-k-1..n-2k Z2k-1..k Zk-1..0
New$C2'Complementer$
MSADD$
k
+kk
+ k+k
10
+k
+ k+kk
Compl
+kk
+ k+k
…
k&1
Compl Mi
+kk
+ k+k
…
n
kk ……
…
FP-‐MAC and Mul+specula+on • In every k-‐bit fragment of the C2-‐Complementer, 4 cases
can happen: – Compl=’0’, Mi=’0’. The k-‐bit result is correct and it must not be complemented.
– Compl=’0’, Mi=’1’. The k-‐bit result is not correct and it must not be complemented. Hence, we must perform the opera+on X+1.
– Compl=’1’, Mi=’0’. The k-‐bit result is correct and it must be complemented. Hence, we must perform the opera+on C1(X).
– Compl=’1’, Mi=’1’. The k-‐bit result is incorrect and it must be complemented. Hence, we must perform the opera+on C1(X+1) except frag. 0, that must perform C1(X)+1 instead.
• Lemma. C1(X+1) = C1(X) + “11…11”
FP-‐MAC and Mul+specula+on • I1 = 1111 1111 1111 1111 • I2 = 0000 0000 0000 0001 • Without mul+specula+on I1+I2 = 0000 0000 0000 0000 • Applying mul+specula+on, the result of I1+I2 aber the first step is:
– S = 1111 1111 1111 0000 – C = 0000 0000 0001 0000
• As the carry-‐out of fragment 0, i.e. M1, is equal to ‘1’, the new two’s complementer must add this to the addi+on result of fragment 1. The new vectors would be: – S = 1111 1111 0000 0000 – C = 0000 0001 0000 0000
• We need more steps to complete the correc+on è NOT ALLOWED
Outline • Introduc+on • FP-‐MACs – Why not the mul+plier?
• Mul+specula+ve Adders (MSADD) • Mul)specula)ve FP-‐MACs – Integra+ng MSADD and FP-‐MAC – The CCTA – The Sign Predictor
• Experiments • Conclusions
MSADD%
New%C2+Complementer%
Corrected%Carry%Tree%An8cipator%
n% n%
n%S"
n/k+1%
Mi"
n%ZSM"
Compl"
G,P"n/k+1%
n/k+1%C"
FP-‐MAC and Mul+specula+on
With the proposed scheme it is possible to correct individual mispredic+ons, but the case of propaga+ng mispredic+ons s+ll exists. In order to solve this problem, the Correc+on Carry Tree An+cipator (CCTA) will feed the k-‐bit modules with the proper Cin to produce a totally correct result at the output of the C2-‐Complementer
FP-‐MAC and Mul+specula+on: An example
I1 = 1111 1111 1111 1111 I2 = 0000 0000 0000 0011 The result will be posi+ve, hence Compl=’0’. With a conven+onal flow Z=ZSM= 0000 0000 0000 0010 Applying mul+specula+on in the addi+on, the result of I1+I2 is: S = 1111 1111 1111 0010 C = 0000 0000 0001 0000
With the proper transforma+ons and Mi’s coming from the CCTA, T = 1111 1111 1111 0010 M = 0001 0001 0001 0000 ZSM = 0000 0000 0000 0010
These 4-‐bit addi+ons are performed in parallel !! Without propaga+ng carries from a fragment to the following !!
G0#=#‘1’#P0#=#‘0’#
G1#=#‘0’#P1#=#‘1’#
G2#=#‘0’#P2#=#‘1’#
G3#=#‘0’#P3#=#‘1’#
(‘1’,’0’)#
(‘0’,’1’)#
(‘0’,’1’)# (‘1’,’0’)#
M1=‘1’#
M2=‘1’#
M3=‘1’#
M4=‘1’#
(‘1’,’0’)#
FP-‐MAC and Mul+specula+on: An example (2)
I1 = 1111 1110 1111 1001 I2 = 0000 0000 0001 0011 The result (-‐263)+19 is nega+ve, hence Compl=‘1’ With a conven+onal flow Z = 1111 1111 0000 1100, and ZSM = 0000 0000 1111 0100 Applying mul+specula+on in the addi+on, the result of I1+I2 is: S = 1111 1110 0000 1100 C = 0000 0001 0000 0000 With the proper transforma+ons and
Mi’s coming from the CCTA, T = 0000 0001 1111 0011 M = 0000 1111 0000 0001 ZSM = 0000 0000 1111 0100
These 4-‐bit addi+ons are performed in parallel !! Without propaga+ng carries from a fragment to the following !!
G0#=#‘0’#P0#=#‘0’#
G1#=#‘1’#P1#=#‘0’#
G2#=#‘0’#P2#=#‘0’#
G3#=#‘0’#P3#=#‘1’#
(‘1’,’0’)#
(‘0’,’0’)#
(‘0’,’0’)# (‘0’,’0’)#
M1=‘0’#
M2=‘1’#
M3=‘0’#
M4=‘0’#
(‘0’,’0’)#
FP-‐MAC and Mul+specula+on: a fast evalua+on
• Execu+on +me: worst case log(k)+log(n/k)+log(k) = log(n)+log(k) vs 2log(n)
• Area and energy: MSADDs occupy/consume less than conven+onal adders
MSADD%
New%C2+Complementer%
Corrected%Carry%Tree%An8cipator%
n% n%
n%S"
n/k+1%
Mi"
n%ZSM"
Compl"
G,P"n/k+1%
n/k+1%C" n"
+"
n"n"
+"n"+"n"
1"0"
+"
“00…00”"
n"+"n"+"
n"n"Compl&
Adder%
C2(Complementer%
O(log(k))
O(log(k))
O(log(n/k)) O(log(n))
O(log(n))
Outline • Introduc+on • FP-‐MACs – Why not the mul+plier?
• Mul+specula+ve Adders (MSADD) • Mul)specula)ve FP-‐MACs – Integra+ng MSADD and FP-‐MAC – The CCTA – The Sign Predictor
• Experiments • Conclusions
FP-‐MAC and Mul+specula+on: the Compl signal
• In conven+onal implementa+ons, Compl=addi+on.msbit
• With the mul+specula+ve adder stage the msbit can be incorrect
• Solu+ons – [Lang & Bruguera 2004, 2005] Sign Detector + Full man+ssa comparison in the worst case
– Ours è Predict the Sign
FP-‐MAC and Mul+specula+on: the Sign Predictor
• Only few cases can mispredict, given R=AxB+C – d= Exp(C)-‐(Exp(A)+Exp(B)) • d<0 è R>=0 • d>1 è R<0
– sub=sign(C) xor sign(A) xor sign(B) • This signal is ‘1’ if there is an effec+ve subtrac+on, i.e. iff sign(AxB) is different to sign(C)
– Full man+ssa comparisson (mispredic+on in our case) iff sub=‘1’ and (d=0 or d=1) [Lang and Bruguera]
FP-‐MAC and Mul+specula+on: the Sign Predictor
est is a combina+onal es+ma+on of the Compl signal Ini+ally, est=‘1’ iff sub=‘1’ and d>1 hit <= est xor msbit; Q(i+1) <= not(hit); Compl <= Q(i) xor est; If there is a mispredic+on, only a stall cycle is required for recomplemen+ng the result
S0#
S1#
hit=‘1’
hit=‘0’
hit Q(i) Compl Q(i+1) Comments
0 0 est 1 S03and3failure,3transition3to3S1
0 1 : : This3case3never3happens,3In3S13there3is3always3a3hit
1 0 est 0 S03and3hit,3remain3in3S0
1 1 ¬(est) 0 S13and3hit,3transition3to3S0
FP-‐MAC and Mul+specula+on: the Sign Predictor
• Hit rate is cri+cal. It may be necessary to increase it.
• N-‐bits synchronous predictor. Use the most significant bits of the addi+on operands [Ashmila et al. 2005] – 1 bit-‐> esti = xi·∙yi – 2 bits-‐> esti = xi·∙yi + (xi+yi)·∙(xi-‐1·∙yi-‐1) – 3 bits-‐> esti = xi·∙yi +(xi+yi)(xi-‐1·∙yi-‐1+(xi-‐1+yi-‐1)·∙(xi-‐2·∙yi-‐2))
• Hit probability: 1-‐2-‐(N+1) • The es+ma+on logic is out of the cri+cal path
FP-‐MAC and Mul+specula+on: the Sign Predictor
s
Es+ma+on bits Es+ma+on bits
d=1 d=0
mul
Cinv
mul
Cinv
• Es+mate the carry-‐out of the yellow cell • Predict the msbit, i.e. red cell, with zi = xi xor yi xor ci = xi xor yi xor cest • The bits more significant than the red cell are a sign extension
Mul+specula+ve FP-‐MAC C[wt%1:0]* A[wt%1:0]* B[wt%1:0]*
Sign*Logic* Exponent*Difference*
Alignment*Shi>er*
Inversion*
Booth*MulDplier*Array*
3:2*CSA*Compressor*
MSADD*Leading*Zero*AnDcipator*
Sign*Predictor*
NC2C*
NormalizaDon*Shi>er*
Rounder*&*Post*norm.*
SDcky*Logic*Exponent*Adjust*
R[wt%1:0]*
[wt%1]*[wf+we%1:wf]* {1’b1,B[wf%1:0]}*{1’b1,C[wf%1:0]}* {1’b1,A[wf%1:0]}*
[walign%1:0]*
[wep%1:0]* [walign%1:0]* [wcsa%1:1]* [wcsa%1:0]*[wcsa%1:0]*
[walign%1:0]*[wcsa%1:0]*
Complpred)[walign%1:0]* [wsh%1:0]*
[wep%1:0]*
[2:0]*
s+cky)rnd)lsb)
[wep%1:0]*
[wf%1:0]*
[walign%1:0]*
[walign%1:wcsa]*[wcsa%1:0]*
“0…0”*
CCTA* [walign/k%1:0]*
[walign/k%1:0]*
Mi)
P)
G)
msNC2C)
Outline • Introduc+on • FP-‐MACs – Why not the mul+plier?
• Mul+specula+ve Adders (MSADD) • Mul+specula+ve FP-‐MACs – Integra+ng MSADD and FP-‐MAC – The CCTA – The Sign Predictor
• Experiments • Conclusions
Experiments: synthesis results
Precision Area+(um2) Delay+(ns) Power+(mW) %+P+Gain Energy+(pJ) %+E+GainSingle 23676.1 4.98 15.10 G 75.22 GDouble 80851.3 5.64 52.85 G 298.09 GQuad 301294.4 6.23 191.56 G 1193.43 GSingle 22478 4.24 14.67 G2.85 62.22 G17.28Double 77074.56 4.44 51.77 G2.06 229.84 G22.90Quad 289275.5 5.34 186.72 G2.53 997.06 G16.45
Conv
MS
Not pipelined implementa+on MS Single and Double with k=4 bits, Quad with k=8 bits
3:2$Compr Adder CCTA C2C SignConv 0.481 1.778 0 1.694 0K64 0.499 1.614 1.17E=06 2.113 3.33E=02K32 0.501 1.285 5.01E=03 1.849 3.23E=02K16 0.503 1.212 1.45E=02 1.924 3.24E=02K8 0.496 1.013 3.75E=02 1.566 3.24E=02K4 0.501 0.979 1.36E=01 1.589 3.24E=02
Power breakdown, DP: 3:2 + adder + C2C
Experiments: Sign Predictor Accuracy
0.95%0.955%0.96%0.965%0.97%0.975%0.98%0.985%0.99%0.995%
1%
ferret%
blackscholes%
bodytrack%
x264%
streamcluster%
0=bits%
1=bits%
2=bits%
3=bits%
0.9$0.91$0.92$0.93$0.94$0.95$0.96$0.97$0.98$0.99$
1$
ferret$
blackscholes$
bodytrack$
x264$
swap>ons$
freqmine$
streamcluster$
canneal$
splash2x.barnes$
splash2x.fmm
$
splash2x.ocean_cp$
0Ebits$ 1Ebits$ 2Ebits$ 3Ebits$
Single
Double
Modified libsob-‐fp for x86_64 PARSEC and SPLASH2x compiled with this library Single and Double precision traces are processed aberwards
Outline • Introduc+on • FP-‐MACs – Why not the mul+plier?
• Mul+specula+ve Adders (MSADD) • Mul+specula+ve FP-‐MACs – Integra+ng MSADD and FP-‐MAC – The CCTA – The Sign Predictor
• Experiments • Conclusions
Conclusions • Mul+specula+ve ideas contribute to the energy decrease
• 1st, a probably inexact addi+on is performed • 2nd, the C2C is used for complemen+ng and correc+ng
• The sign is predicted in order to avoid huge comparisons
• In the future new op+miza+ons must be performed – Mul+plier and Normalizer
Thank you very much for your aven+on !!
You can em@il me to: [email protected]