II. A new era in processor evolution Dezső Sima Spring 2007 (Ver. 2.0) Dezső Sima, 2007.
Microarchitecture of Superscalars (3) Branch Prediction Dezső Sima Fall 2007 (Ver. 2.0) Dezső...
-
Upload
forrest-addison -
Category
Documents
-
view
218 -
download
4
Transcript of Microarchitecture of Superscalars (3) Branch Prediction Dezső Sima Fall 2007 (Ver. 2.0) Dezső...
Microarchitecture of Superscalars (3)Branch Prediction
Dezső Sima
Fall 2007
(Ver. 2.0) Dezső Sima, 2007
Branch prediction
1. Introdutcion•
2. Basic branch prediction mechanisms•
3. Auxiliary branch prediction mechanisms•
4. Accessing the branch target path•
1.1 The branch processing problem of pipelining (1)
Figure 1.1: Straightforward processing of an unconditional branch on a four stage pipeline
BTI
F D E W
2 bubblesBTI
Branch
BTA
F
F
fetchingBranchdetection
calculation
fetching
i j
i+1i
i+2i
i i
t i t i+1 t i+2 t i+3 t i+4
b
D
F
1.1 The branch processing problem of pipelining (2)
Figure 1.2: Straightforward processing of a conditional branch on a four stage pipelinewith immediate condition resolution
BTI
F D E W
3 bubbles
BTA
bc
Condition
F
F
fetchingbc
detection
checking(branch!)
calculation
i j
i+1i
i+2i
i i
t i t i+1 t i+2 t i+3 t i+4
bc
D
F
i+3i
t i+5
E
D
BTI
F
fetching
1.1 The branch processing problem of pipelining (3)
Figure 1.3: Straightforward processing of a conditional branch on a four stage pipeline, with delayed condition resolution
calculationBTA
E E
Large number of bubbles
F E ED
stop
bc
Condition
fetching bcdetection
checking
Dynamic
D
F
t i t i+1 t i+2 t i+3 t i+4
F
Conditionchecking
W
t j t j+1 t j+2 t j+3
Conditionchecking
Conditionchecking
FBTIi j
i+1i
i+2i
i i bc
BTIfetching
(branch!)
t j+4
1.1 The branch processing problem of pipelining (4)
20
30
Year*
10
40
1990 2000
*
* *
*
Pentium(5)
2005
No of pipeline stages
Pentium Pro(~12)
Pentium 4(~20)
Athlon-64(12)
P4 Prescott(~30)
(14)Conroe
*Athlon(6)K6
(6)*
1995
*
Core Duo
Figure 1.4: Number of pipeline stages in Intel’s and AMD’s processors
1.2 Branch statistics (2)
Figure 1.6: Ratio of the main instruction types
Source: Stephens et al. „Instruction level profiling and evaluation of the IBM RS/6000”, Proc. 18th ISCA, pp. 137-146
Branches
Unconditional branches Conditional branches
Simpleunconditional
branch
Branchto subroutine
Return fromsubroutine
Loop-closingconditional
branch
Otherconditional
branches
Taken for thefirst (n-1) iterations
Taken Not taken
Not taken
~ 1/6
Taken
~ 1/6
~ 1/3~ 1/3~ 1/3
~ 1/6~ 5/6
Figure 1.7: Grohoski’s estimate of branch statistics
Source: Grohoski, G.F, IBM J. Res. Develop., 34 Jan. pp. 37-58
1.2 Branch statistics (3)
1.2 Branch statistics (3)
Figure 1.8: Frequency of taken and not taken branches
Source: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 303
ReferenceFrequency of taken
branchesFrequency of not taken branches
Lee, Smith 1984 57 - 99 % 1 - 43 %
Edenfield & al. 1990 75 % 25 %
Grohoski 1990 ~ 5/6 ~ 1/6
1.3 The principle of branch prediction (1)
Figure 1.9: Correctly predicted conditional branch with delayed condition resolutionon a four stage pipeline
calculationBTA
E EF E ED
stop
bcfetching bc
detectionConditionchecking
Dynamic
D
F
t i t i+1 t i+2 t i+3 t i+4
F
Conditionchecking
W
t j t j+1 t j+2 t j+3
Conditionchecking
Conditionchecking
BTI (speculative)
i j
i+1i
i+2i
i i bc
F
F Di+3i
Branchprediction(branch!)
BTAcalculation
(branch!)
2 bubbles
BTAfetching
BTIdecode
F
acknowledgedSpec. ex.
1.3 The principle of branch prediction (2)
calculationBTA
E EF E ED
stop
bc
Condition
fetching bcdetection
checking
Branch pred.(branch!)BTA calc.
Dynamic
D
F
t i t i+1 t i+2 t i+3 t i+4
F
Conditionchecking
W
t j t j+1 t j+2 t j+3
Conditionchecking
Conditionchecking
i+1i
i+2i
i i bc
(no branch!)
t j+4
A large number of bubbles
Fi j+1
fetching
BTI (speculative)
i j F
F Di+3i
BTAfetching BTI
decode
F
i+1i
Figure 1.10: Incorrectly predicted conditional branch with delayed condition resolutionon a four stage pipeline
Figure 1.11: Branch misprediction penalty on a long pipeline
1.3 The principle of branch prediction (3)
calculationBTA
E1 E2F1 F3 D1F2
F2
F1
F1
W
Conditionchecking
i+1i
i+2i
i i bc
F1
F1i+ni
mispred.!
D2
Misprediction penalty
BTIfetching
t i t i+1 t i+2 t i+3 t i+4 t i+5 t j t j+1 t j+2 t j+3 t j+4
(branch!)
D1
F3
F3
F2
F2
F1
F1
bc fetching
bc detectionBranch prediction
(no branch!)
BTIi+n+1i
1.4 Branch prediction accuracy/penalty (1)
Figure 1.12: Branch prediction accuracySource: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 340
Processor
Guessing method(relevant for prediction accuracy)
ImplementationPrediction accuracy
Reference
Am 29000 (1987) Implicit dynamic 32-entry two-way set associative BTIC
60 % for repetitive branches
Weiss 1987
MC 88110 (1991) Implicit dynamic, overridden by opcode-
based static
32-entry fully associative BTIC
70 % on SPEC Diefendorff, Allen 1992
MC 68060 (1993) 2-bit dynamic 256-entry BTAC > 90 % Circello, Goodrich 1993
MIPS R10000 (1996) 2-bit dynamic 512-entry BHT 90 % Halfhill, 1994
PowerPC 620 (1995) Implicit dynamic, augmented with 2-bit
dynamic
256-entry fully associative BTAC, 2-K-
entry BHT
90 % Thomson, Ryan 1994
PA-8000 (1995) Implicit dynamic, overridden by 3-bit
dynamic or compiler based static
32-entry fully associative BTAC, 256-entry BHT
80 % on SPECint92 Gwennap 1994
UltraSparc (1995) 2-bit dynamic 2 K-entries in the IC, each shared among two
instructions
88 % on SPECint92 94 % on SPECfp92
Wayner 1994
BHT : Branch history table BTAC : Branch target address cacheBTIC : Branch target instruction cache IC : Instruction cache
Examples:
mmcc PfPfP
fc: Probability (frequency) of correctly predicted branches
fm: Probability (frequency) of mispredicted branchesPc: Penalty of correctly predicted branchesPm: Penalty of mispredicted branches
mm PfP
PProP4 Willamette P4 Prescott
0 :If cP
0.1 10 cycles0.05 20 cycles 0.05 30 cycles
111.5
Effective penalty of branch processing (simplified)
1.4 Prediction accuracy/penalty (2)
2.1 Introduction (1)
Branch prediction
Branch detection
Accessing the branch target path
Branch processing
2. Basic branch prediction mechanisms
Basic branch prediction mechanism
Auxilliary branch prediction mechanism
Branch prediction mechanisms
2.1 Introduction (2)
Basic branch prediction mechanism
Processor based
Global
(2-level)
Local
Compiler hints
2.1 Introduction (2)
Figure 2.2.: Global prediction
Path 2:
0
0
0
0
1
0
Path 1:
. . 0 0 0
?
. . 1 0 0
Prediction depends on the actual execution path,
that is on all branches executed
Basic branch prediction mechanism
Processor based
GlobalLocal
Compiler hints
Combined
(Choice prediction)(2-level)
2.1 Introduction (2)
80486 (1989)
PPC 601 (1993)POWER2 (1993)
POWER1 (1990)
Static prediction
Displacement-based
Dynamic prediction
1-level (local) prediction
Opcode-based
1-bitprediction
'Always taken'
Fixed prediction
'Always not taken'approach approach
MC 68040 (1990)
PPC 601 (1993)
SuperSparc (1992)
R4000 (1992) R8000 (1994)
PPC: PowerPC
2.2. Local prediction (2)
Always the same prediction Based on the object code Based on the execution history
2.2. Local prediction (3)
IFA:
BHT (Branch History Table)
0: sequential cont1: branch.
} x:
Figure 2.3: Principle of the 1-bit dynamic prediction
x
takenTaken
NT
NTT
T
T: Branch has been taken
Not
NT: Branch has not been taken
Figure 2.4: State transition diagram of the 1-bit dynamic prediction
2.2. Local prediction (4)
80486 (1989)
PPC 601 (1993)POWER2 (1993)
POWER1 (1990)
Pentium (1993)
PPC 604 (1995)PPC 620 (1996)
Static prediction
Displacement-based
Dynamic prediction
1-level (local) prediction
Opcode-based
1-bitprediction prediction
2-bit'Always taken'
Fixed prediction
'Always not taken'approach approach
MC 68040 (1990)
PPC 601 (1993)
SuperSparc (1992)
R4000 (1992) R8000 (1994) R10000 (1996)
MC 68060 (1993)
UltraSparc (1995)
PPC: PowerPC
2.2. Local prediction (6)
Always the same prediction Based on the object code Based on the execution history
2.2. Local prediction (7)
IFA:
BHT
00,01: sequential cont10,11: branch.
} xx:
BHT: Branch History Table
Figure 2.6: Principle of the 2-bit dynamic prediction
xx
AT: actually takenANT: actually not taken
Branch has been :
takentaken takentaken
ANT ANT
ANT
ANT
AT
AT AT AT
Strongly StronglyWeakly Weaklynot
Initialised when abranch is taken first
Prediction: "Taken" Prediction: "Not Taken"
not
11 10 0001
Figure 2.7: State transition diagram of the most frequently used 2-bit dynamic prediction (Smith algorithm)
2.2. Local prediction (8)
2.2. Local prediction (5)
Figure 2.5: Alternatives for accessing Branch History Tables or Branch Target Address Buffers
Accessing BHTs/BTACs
Cache-like access Associative access(direct / set associative)
Indexed access
Index BHT
C
IFA:
(Counters)
For large tables most branches willmap to a unique entry.For smaller tables multiple branches may map to the same entry, resultingin interferences and thus in degratedprediction accuracy.
Examples:
16K entry local BHT (Power4)16K entry global BHT (Power4)16K entry selector table (Power4)
IFA
IFA:
IFA C
Avoids interference but stronly increases cost.
64 entry BTAC (PPC 604)
Index
IFA:
Tags
Tags TagsC C
Reduces interferences but increases cost.
(E.g. two-way set associative)
128*4 way BHT/BTAC (Pentium Pro)1K*4 way BHT/BTAC (Pentium II, III, 4)128*2 way BTAC (Power3)
80486 (1989)
PPC 601 (1993)POWER2 (1993)
POWER1 (1990)
Pentium (1993)
PPC 604 (1995)PPC 620 (1996)
Static prediction
Displacement-based
Dynamic prediction
1-level (local) prediction
Opcode-based
1-bitprediction prediction prediction
2-bit 3-bit'Always taken'
Fixed prediction
'Always not taken'approach approach
MC 68040 (1990)
PPC 601 (1993)
SuperSparc (1992)
R4000 (1992) R8000 (1994) R10000 (1996)
MC 68060 (1993)
UltraSparc (1995)
PPC: PowerPC
Figure 2.8: Early branch prediction mechanisms and their trends indicated by subsequent models of pipelined, 1. and 2. generation superscalars
2.2. Local prediction (9)
Always the same prediction Based on the object code Based on the execution history
1-level 2-level
Fixedprediction
Static prediction
Dynamic prediction
Local prediction
Always the same prediction
Based on the object code
Based on the execution history
2.2. Local prediction (10)
1 1 0 0 1 0 1 0 0 1
Local BHT
2-level local branch prediction
With a shared global historytable for all patterns
(Alpha 21264)
With individual historytables for different patterns
(Pentium Pro)
IFA:
(e.g. 1K×10 bit)
Local BHT(e.g. 1K×3 bit)1
Shared counters Individual counters
2.2. Local prediction (11)
(1.-level: branch patterns, 2.-level: history bits) 2-level local prediction
0 1 1 0
Local BHT
IFA:
1 0
(e.g. 128×4 bit)
e.g. 4-ways each
Local BHT(e.g. 16×2 bit)
6Branch1 0 1 Branch
The 21264 uses 3-bit saturating counters whose most significant bit provides the prediction
Figure 2.9.: The principle of Pentium Pro’s 128x4 way set associative BHT
Tags History Tags History Tags History Tags History4-bit4-bit4-bit 4-bit
127
0
00 01
BTA(linear)
Tag Index
0
15
x x xx: 00/01 not taken10/11 taken
067
10 01
Way 3 Way 1Way 2 Way 0
Counters
BHT
6
2.2. Local prediction (12)
Figure 2.10.: The actual layout of Pentium Pro’s 128x4 way set associative BHT
Tag Tag
0
127
TagTag H H H H CCCC
2.2. Local prediction (13)
Basic branch prediction mechanism
Processor based
GlobalLocal
Compiler hints
Combined
(Choice prediction)(2-level)
2.3. Global prediction (1)
Figure 2.11.: Simple global prediction
BHT
Global history(shift register)
0 00
x
1 1 1 1
Branch history
2.3. Global prediction (1)
Figure 2.12.: Principle of the Gshare prediction
}Global history
IFA 1 11
x
BHT
0 0 0 0
0 00 1 1 1 1
XOR
...
Branch history
2.3. Global prediction (1)
Global history
IFA:
BHT0 00 1 1 1 1
0
x
1...0 1 1 0
Branch history
Figure 2.13.: Principle of the Gselect prediction
2.3. Global prediction (1)
Basic branch prediction mechanism
Processor based
GlobalLocal
Compiler hints
Combined
(Choice prediction)(2-level)
2.4. Combined prediction (1)
Figure 2.14.: Principle of the combined local and global prediction (as used in the Alpha 21264, or the POWER 4)
BHT
IFA: Global historyGlobal BHT
LocalIFA:
Best choiceBHT
Resulting prediction
x
Local prediction
Global prediction Localprediction
Globalprediction
Actual prediction(for updating)
2.4. Combined prediction (2)
Alpha 21264
2-level local dynamic prediction with ashared counter table for all patterns
(1K * 10 bits/1K * 3 bits)
Simple 2-level global prediction
(12-bit global history/4K * 2 bits)
Global history referenced choice table
(12-bit global history/4K * 2-bits)
Figure 2.15.: Implementation alternatives of the combined prediction
Combined prediction
1. prediction 2. prediction Choice
2.4. Combined prediction (3)
Source: Microprocessor Report, 10/28/96
• Minimum branch penalty: 7 cycles• Typical branch penalty: 11+ cycles (IQ delay)• 48K bits of target addresses stored in I-cache• 32-entry return address stack• Predictor tables are reset on a context switch
2.4. Combined prediction (4)
Figure 2.16.: The combined predictor of the Alpha 21264
1-level local dynamic prediction
Alpha 21264
POWER 4
2-level local dynamic prediction with ashared counter table for all patterns
(1K * 10 bits/1K * 3 bits)
Simple 2-level global prediction
(12-bit global history/4K * 2 bits)
Global history referenced choice table
(12-bit global history/4K * 2-bits)
(16K * 1-bit)
2-level Gshare global prediction (11-bit global history is hashed with the IFA, 16K * 1-bit counter table)
Accessed in the same way as theglobal counter table
(16K * 1-bit)
Figure 2.17.: Implementation alternatives of the combined prediction
Combined prediction
1. prediction 2. prediction Choice
2.4. Combined prediction (5)
Figure 2.18.: The principle of the combined predictor of the POWER 4
2.4. Combined prediction (6)
}
16K*1bit
IFA
1 11
BHT
0 0 0 0
0 00 1 1 1 1
XOR
...
Select the better
18IFA:
5
14
Local History
14
16K*1bit
Selector Table
16K*1bit
Global History
Localprediction prediction
Global
14
Update
1-bit per group
11-bit global history
Pentium
Pentium Pro
P4 Will/Northw.
P4 Prescott
K6
K7
K8
PPC 604
PPC 620
POWER 3
POWER 4
Alpha 21164
Alpha 21264
PA-8000
PA-8500/8700
UltraSPARC-III
Pentium
Pentium Pro
P4 Will/Northw.
P4 Prescott
K6
K7
K8
PPC 604
PPC 620
POWER 3
POWER 4
Alpha 21164
Alpha 21264
PA-8000
PA-8500/8700
UltraSPARC-III
(256*2)
(512*2)
(512*2)
(2K*2)
(2K*2)
(2K*2)
(256*3)
(12-bits/16K*2)
(Alpha 21264)(1K*10/1K*3)
(Alpha 21264)(12its/4K*2)
(POWER 4)(16K*1)
(POWER 4)(11-bit/16K*1)
(4K*2)
(4K*2)
(8K*2)
(16K*2)
1-level
(Choice
Fixed
Basic prediction mechanism
Gshare Gselect
1-bit 2-bit
2-level
Dynamicglobal
Shared
3-bit
Individualcounters counters
prediction)
2-level
Local
Simple
Global Combined
1
1
1
1. generation superscalars
Staticprediction prediction
Figure 2.20.: Trends of branch prediction schemes used in 2. and 3. generation superscalars
2.5. Overview of the basic branch prediction mechanisms
Figure 3.1.: Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars1
Pentium
Pentium Pro
P4 Will/Northw.
P4 Prescott
K6
K7
K8
PPC 604
PPC 620
POWER 3
POWER 4
Alpha 21164
Alpha 21264
PA-8000
PA-8500/8700
UltraSPARC-III
Pentium Pro
Pentium
P4 Will/Northw.
P4 Prescott
Backup use of static
prediction
Auxiliary branch prediction mechanisms
1: 1. generation superscalars
1
1
RAS: Return Address Stack
POWER 5
2: Supported by compiler hints
3. Auxillary branch prediction mechanisms
Figure 3.2: Static branch prediction algorithm of the Pentium Pro Source: Shanley T., „Pentium Pro Processor System Architecture„, Addison-Wesley Developers Press, 1996
Figure 3.1.: Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars1
Pentium
Pentium Pro
P4 Will/Northw.
P4 Prescott
K6
K7
K8
PPC 604
PPC 620
POWER 3
POWER 4
Alpha 21164
Alpha 21264
PA-8000
PA-8500/8700
UltraSPARC-III
Pentium Pro Pentium Pro
Pentium
P4 Will/Northw. P4 Will/Northw. P4 Will/Northw.
P4 Prescott P4 Prescott P4 Prescott
K6
K7
K8
PPC 620
POWER 4 POWER 4 2
Alpha 21164
Alpha 21264
PA-8000
UltraSPARC-IIIUltraSPARC-III
(16-entries)
(12-entries)
(12-entries)
(8-entries)
(32-entries)
(12-entries)
Backup use of static
prediction
Dedicated prediction
RAS
Preemptive use of
compiler hints
Auxiliary branch prediction mechanisms
1: 1. generation superscalars
1
1
POWER 3
RAS: Return Address Stack
POWER 5 POWER 5 POWER 52
2: Supported by compiler hints
3. Auxillary branch prediction mechanisms
Return Address Stack (RAS)
PUSHreturn address
on a CALL
POPreturn address
on a RET
RASused to continue execution speculatively
from the popped up return address
PUSHreturn address
on a CALL
POPreturn address
on a RET
Architectural stack with preserved sequential consistency
A procedure, such as a printf () might be called from many different locations, so there are many different return addresses. During speculative ooo execution however,
the logical sequence of the related PUSH RET instructions may be disturbed, so the predicted return address may be wrong.
For checking the prediction the RET instruction will be executed, and for a misprediction a repair mechanism will be activated
(to cancel wrongly executed instructions and repair the corrupted RAS).
The Problem of RASs:
Figure 3.1.: Overview of auxiliary branch prediction mechanisms in 2. and 3. generation superscalars1
Pentium
Pentium Pro
P4 Will/Northw.
P4 Prescott
K6
K7
K8
PPC 604
PPC 620
POWER 3
POWER 4
Alpha 21164
Alpha 21264
PA-8000
PA-8500/8700
UltraSPARC-III
Pentium Pro Pentium Pro
Pentium
P4 Will/Northw. P4 Will/Northw. P4 Will/Northw.
P4 Prescott P4 Prescott P4 Prescott P4 Prescott
K6
K7
K8
PPC 604
PPC 620PPC 620
POWER 4 POWER 4 2 POWER 4 2 POWER 4
Alpha 21164
Alpha 21264
PA-8000
UltraSPARC-IIIUltraSPARC-III
(16-entries)
(12-entries)
(12-entries)
(8-entries)
(32-entries)
(12-entries)
Backup use of static
prediction
Dedicated prediction
RAS
Loop detector Indirect branch pred.
Preemptive use of
compiler hints
Auxiliary branch prediction mechanisms
1: 1. generation superscalars
1
1
POWER 3
RAS: Return Address Stack
POWER 5 POWER 5 POWER 52 POWER 5 2
2: Supported by compiler hints
3. Auxililary branch prediction mechanisms
Figure 4.1.: Alternatives to generate the BTA
BTA
Calculated on the fly
4. Accessing the branch target path (1)
4.1. Overview
I-cache
IFAR
A I I+1 I+2 I+3
IIFA
BTA
BTI BTI+1 BTI+2 BTI+3
Instructionfetch address
+
sequential
address
ComputeBTA
(IFA)
Figure 4.2.: Principle of calculating the BTA on the fly
This scheme is employed in earlier scalar (pipeline) processors as well as in a number of superscalar processors, such as:
Z 80000i486MC 68040 Sparc CY7C601 SuperSparc Power PC 601 603 Power1 Power2
21064 21064A 21164 R4000 R 10000
(1984)(1989) (1990)
(1988), (1992p),(1993), (1993), (1990), (1993),
(1992), (1994), (1995),(1992), (1996)
Source: Sima, D et. al., ACA, Addison Wesley, 1997, pp. 303
POWER4 (2001), POWER5 (2005)
Ultra SPARC III (2003)
Figure 4.1.: Alternatives to generate the BTA
BTA
Accessed from the BTACCalculated on the fly
4. Accessing the branch target path (1)
4.1. Overview
IFAR
IIFAInstruction fetch address (IFA)
BTAC
BA-1 BTA
+
BTA
I-cache
A I I+1 I+2 I+3
BTI BTI+1 BTI+2 BTI+3
Sequentialaddress
Branch target address
The Branch Target Address Cache (BTAC) contains branch target addresses (BTAs). These BTAs are readfrom the BTAC when the instruction immediately preceding a branh is fetched. (Their addresses are
designated as BA-1).
Figure 4.3.: Principle of the BTAC scheme to access the branch target path
Figure 4.4.: The principle of branch prediction using both a BHT and a BTAC(C: counter)
IFA:
Tags BTA
IFA: I$
IB
Further processing
BHT
C
IFA:
I
F
A
R
+
Update BHT with branch result
Update BTAC with BTA if BHT initiates it.
(create/deleteUpdate BTAC
BTAC entry)
IIFAif BTAC misses
BTA if mispred.if BTAC hits
Tag
BTAC
(Designated as BTB (Branch Target Buffer) by Intel)
ProcessorNumber of
BTAC entriesImplementation of
the BTAC
ES/9000 520-based procs (1992p)
4K 2-way associative
Pentium (1994) 256 Fully associative
Pentium Pro 512 4-way associative
Pentium 4 4K 4-way associative
MC 68060 (1993) 256 4-way associative
R 8000 (1994)1 1K
PA 8000 (1995) 32 Fully associative
Power PC 604 (1994) 64 Fully associative
Power PC 620 (1995) 256 Fully associative
1: Each entry is shared among 4 instructions
Figure 4.5.: Examples of processors using the BTAC scheme
Figure 4.6.: The physical implementation of branch prediction
in Intel’s P4 Northwood and Prescott coresSource: de Vries H., „Looking at Intel’s Prescott die, part II.”, http://www.chip-architect.com, April 2003
4. Accessing the branch target path (1)
Figure 4.1.: Alternatives to generate the BTA
BTA
Accessed from BTAC From the I$Calculated on the fly
4.1. Overview
I-cache
IFAR
IA
IFAInstruction fetch address (IFA)
BA BTI BTA+
+
To decoding
The BTIC contains the addresses of the last recently taken branches (BA), the corresponding branch
target instructions (BTI) and the addresses of the instructions following the BTIs (BTA+). When there
is an entry in the BTIC for the actual IFA, the corresponding BTI is fetched from the BTIC and
selected for decoding instead of the instruction from the I-cache. The address of the subsequent
instruction along the taken path is also read from BTIC and becomes the next IFA
Examples: Gmicrol/200 (1988), AM 29000 (1988), MC 88110 (1993).
BTIC
Figure 4.7.: Principle of the BTIC scheme to access the branch target path
Figure 4.8.:Trends to generate the BTA
BTA
Accessed from BTAC From the I$
Ultra SPARC III
Calculated on the fly
K6
PPro/PII/PIII/P4
K7/K8
Power 4, 5 Power 3
21264Examples
4. Accessing the branch target path (1)
4.1. Overview
Fetch block(16-Byte)
Selector block(16-bit)
15 14 13 12 3 012
1514
132
1012
3
BTA
Instructionexecution
The selector block identifies branches, included in the associated fetch block. Two bits of the selector block correspont to two bytes of the fetch block.
RETs are a single byte long all other branches are at least two bytes long.Assuming max. a single RET in the fetch block, there may be at most one branch ending in any pair of Bytes.
In a fetch block, there are up to a single RET and two non-RET branches.More branches in a fetch block lead to conflicts in the prediction logic.
To each 16-Byte long fetch block a 16 bit selector block is allocated as follows:
4.2. Case example 1: K7 (1)
Each two bit entry indicates whether or not there is a branch ending in the corresponding two bytes in the fetch block, if yes, it identifies the type of the branch as well. A branch instruction that crosses the 16-byte boundary is counted to the second 16 byte window.
Coding of the two bits (assumed)00: no branch01: RET10: There is a conditional branch whose brach is in the BTA0 field of the BTAC11: There is a conditional branch whose brach is in the BTA1 field of the BTAC
4.2. Case example 1: K7 (2)
Characteristic examples of selector settings:
xx 00 00 00 00 00 00 00 No branch
xx 00 01 00 00 00 00 00 A RET instruction
xx 00 00 00 10 00 00 00 A cond. branch (it’s BTA is in the BTAC 0 field)
xx 00 00 10 00 11 00 00 Two cond. branches (their BTAs are in the BTAC 0 and BTAC 1 fields)
IFA+16
Return address of the RET
BTA0 if taken else IFA+16
Y
YN
N
BC1
BC2
BTA0
BTA1
IFA+16
During predecoding instruction boundaries as well as branch instructions are detected and the appropriate selector entries are marked accordingly.
Predecoding is performed not faster than 4 bytes/cycle
If a cache line (64 bytes = 4 fetch blocks) is replaced, all associated selector blocks are invalidated
4.2. Case example 1: K7 (3)
The selector table is shared between the upper and lower part of the I$, and an extra address bit (A) identifies whether the entry belongt to the upper or the lower part of the I$.
4.2. Case example 1: K7 (4)
Source: Kaiser, A. ,”K7 Branch Prediction”, Dec. 1999, http://www.s.netic.de
Figure 4.9.: Assumed simplified scheme of accessing the branch target path in the K7,
without showing the global prediction (A: address bit, C: Conditional branch, W: Way)
Way 1 Way 0
2-way set associative I$
1K*16Bfetch blocks
1K*16Bfetch blocks
IFA [14:4]
BTA 1 BTA 0
BTAC
1K x2 addr.
IFA [13:4]
IFA [3:1]IFA [3:0]
block16 B Fetch 16 bit selector
block
Sequential RET BTA 1 BTA 0Decode and issue instructionsbeginning with the given address
C:BTA
32-bit
Take or not according Take the branchto the global prediction
RATentries12
I
F
A
R
RET address
xx
+16
[31:15]
IFA [14:4]
[31:15]
BTA0BTA1
Selector Table(shared for the
upper and lowerparts of the I$)
Tags 16B+P 16B+P Tags
16 b 16 b
Tag Index
034141531IFA:
Tag Index
034131431IFA:
BTA
(cond. branch)(uncond. branch)
15 0 15 0
(no branch)
BTA
Fetch unit(during predecoding)
031W:
(Exec.)
IFA14A
4.2. Case example 1: K7 (5)
The K8 doubled the size of the selector table, so each fetch block has it’s own selector entry.
The K8 allows any mix of up to 3 branches (CALL, JMP, RET, conditional) / fetch block, the coding of the selector entries is modified accordingly.
When instruction cache lines are evicted to the L2 cache, branch selectors and predecode information are also stored in the L2 cache.
The K8 uses 48-bit addresses but the BTAC keeps only the 15 least significant bits to identify the next address.
Each BTA entry identifies the least significant 15-bits of the IFA as well as additional information, such as
3-bit old IFA (bits 16,15)W bit: W identificator
4.2. Case example 2: K8 (1)
Figure 4.10.: Assumed simplified scheme of accessing the branch target path in the K8,
without showing the global prediction (C: Conditional branch, R: Return, W: Way 0/1, SA: Start address)
Way 1 Way 0
2-way set associative I$
1K*16Bfetch blocks
1K*16Bfetch blocks
IFA [14:4]
BTA 2 BTA 0
BTAC
512 x4 addr.
IFA [12:4]
IFA [3:1]SA [3:0]
block16 B Fetch 16 bit selector
block
Sequential BTA2/RET BTA1/RET BTA0/RETDecode and issue instructionsbeginning with the given address
CNew IFA
11-bit
Take or not according Take the branchto the global prediction
RATentries12
IFAR
RET address
xx
+ 16
[31:15]
IFA [14:4]
[31:15] BTA0BTA1
Tags 16B+P 16B+P Tags
Tag Index
034141531IFA:
Tag Index
034131431IFA:
(cond. branch)(uncond. branch)
15 0 15 0
(no branch)
BTA
16 b 16 b
BTA 1?
BTA2?
Predecoding
Selector Table
SA
SA
RW 01615 14Old IFA
BTAcalculator
4.2. Case example 2: K8 (2)
Figure 4.11.: Logical view of Opteron’s (K8’s) instruction fetch and decode stages Source: de Vries H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, http://www.chip-archtect.com, Sept., 2003
4.2. Case example 2: K8 (3)