1 Carnegie Mellon University 2 Intel Corporation
Chris Fallin1 Chris Wilkerson2 Onur Mutlu1
The Heterogeneous Block Architecture
A Flexible Substrate for Building Energy-Efficient High-Performance Cores
Executive Summary n Problem: General purpose core design is a compromise between
performance and energy-efficiency n Our Goal: Design a core that achieves high performance and high
energy efficiency at the same time n Two New Observations: 1) Applications exhibit fine-grained
heterogeneity in regions of code, 2) A core can exploit this heterogeneity by grouping tens of instructions into atomic blocks and executing each block in the best-fit backend
n Heterogeneous Block Architecture (HBA): 1) Forms atomic blocks of code, 2) Dynamically determines the best-fit (most efficient) backend for each block, 3) Specializes the block for that backend
n Initial HBA Implementation: Chooses from Out-of-order, VLIW or In-order backends, with simple schedule stability and stall heuristics
n Results: HBA provides higher efficiency than four previous designs at negligible performance loss; HBA enables new tradeoff points
2
Backend 1
Backend 2
Backend 3
Shared
Fronten
d
“Spe
cializa
7on”
Picture of HBA
3
Out-‐of-‐order
VLIW/IO
Shared
Fronten
d
“Spe
cializa
7on”
Out-‐of-‐order
VLIW/IO
Talk Agenda n Background and Motivation n Two New Observations n The Heterogeneous Block Architecture (HBA) n An Initial HBA Design n Experimental Evaluation n Conclusions
4
Competing Goals in Core Design n High performance and high energy-efficiency
q Difficult to achieve both at the same time
n High performance: Sophisticated features to extract it q Out-of-order execution (OoO), complex branch prediction,
wide instruction issue, … n High energy efficiency: Use of only features that provide
the highest efficiency for each workload q Adaptation of resources to different workload requirements
n Today’s high-performance designs: Features may not yield high performance, but every piece of code pays for their energy penalty
5
Principle: Exploiting Heterogeneity n Past observation 1: Workloads have different characteristics
at coarse granularity (thousands to millions of instructions) q Each workload or phase may require different features
n Past observation 2: This heterogeneity can be exploited by switching “modes” of operation q Separate out-of-order and in-order cores q Separate out-of-order and in-order pipelines with shared
frontend
n Past coarse-grained heterogeneous designs q provide higher energy efficiency q at slightly lower performance vs. a homogeneous OoO core
6
Talk Agenda n Background and Motivation n Two New Observations n The Heterogeneous Block Architecture (HBA) n An Initial HBA Design n Experimental Evaluation n Conclusions
7
Two Key Observations n Fine-Grained Heterogeneity of Application Behavior
q Small chunks (10s or 100s of instructions) of code have different execution characteristics n Some instruction chunks always have the same instruction
schedule in an out-of-order processor n Nearby chunks can have different schedules
n Atomic Block Execution with Multiple Execution Backends
q A core can exploit fine-grained heterogeneity with n Having multiple execution backends n Dividing the program into atomic blocks n Executing each block in the best-fit backend
8
Fine-Grained Heterogeneity n Small chunks of code have different execution characteristics
9
Coarse-GrainedHeterogeneity
phase 1:regular�oating-point
{high ILP
stall
low ILP
phase 2:pointer-chasing
Fine-GrainedHeterogeneity
}
(1K-1M insns) (10-100 insns)
fadd r1,r2,r3fadd r4,r5,r6add r13,r14,8fmul r0,r1,r4......ld r1,(r13) ^-MISSES LLCadd r13,r14,8fadd r2,r1,r7fadd r4,r5,r6......fadd r4,r1,r2fadd r4,r4,r3fadd r4,r4,r7...
time (instructions)
}high ILP,stableinstructionschedule
} high ILP,varyinginstructionschedule
{} low ILP,
stableinstructionschedule
phase 3:...
Good fit for Wide VLIW
Good fit for Wide OoO
Good fit for Narrow In-order
Instruction Schedule Stability Varies at Fine Grain
n Observation 1: Some regions of code are scheduled in the same instruction scheduling order in an OoO engine across different dynamic instances
n Observation 2: Stability of instruction schedules of (nearby) code regions can be very different
n An experiment: q Same-schedule chunk: A chunk of code that has the same
schedule it had in its previous dynamic instance q Examined all chunks of <=16 instructions in 248 workloads q Computed the fraction of “same schedule chunks” in each
workload
10
Instruction Schedule Stability Varies at Fine Grain
11
0
0.2
0.4
0.6
0.8
1
0 50 100 150 200 250
Fra
ctio
n o
fD
yn
amic
Co
de
Ch
un
ks
Benchmark (Sorted by Y-axis Value)
different-schedule chunks
same-schedule chunks
1. Many chunks have the same schedule in their previous instances à Can reuse the previous instruc8on scheduling order the next 8me 2. Stability of instruc8on schedule of chunks can be very different à Need a core that can dynamically schedule insts. in some chunks
3. Temporally-‐adjacent chunks can have different schedule stability à Need a core that can switch quickly between mul8ple schedulers
Two Key Observations n Fine-Grained Heterogeneity of Application Behavior
q Small chunks (10s or 100s of instructions) of code have different execution characteristics n Some instruction chunks always have the same schedule n Nearby chunks can have different schedules
n Atomic Block Execution with Multiple Execution Backends
q A core can exploit fine-grained heterogeneity with n Having multiple execution backends n Dividing the program into atomic blocks n Executing each block in the best-fit backend
12
Atomic Block Execution w/ Multiple Backends
n Fine grained heterogeneity can be exploited by a core that:
q Has multiple (specialized) execution backends
q Divides the program into atomic (all-or-none) blocks
q Executes each atomic block in the best-fit backend
13
Atomicity enables specializa8on of a block for a backend (can freely reorder/rewrite/eliminate instruc8ons in an atomic block)
Talk Agenda n Background and Motivation n Two New Observations n The Heterogeneous Block Architecture (HBA) n An Initial HBA Design n Experimental Evaluation n Conclusions
14
HBA: Principles n Multiple different execution backends
q Customized for different block execution requirements: OoO, VLIW, in-order, SIMD, …
q Can be simultaneously active, executing different blocks q Enables efficient exploitation of fine-grained heterogeneity
n Block atomicity q Either the entire block executes or none of it q Enables specialization (reordering and rewriting) of
instructions freely within the block to fit a particular backend
n Exploit stable block characteristics (instruction schedules) q E.g., reuse schedule learned by the OoO backend in VLIW/IO q Enables high efficiency by adapting to stable behavior
15
HBA Operation at High Level
16
Backend 1
Backend 2
Backend 3 Shared
Fronten
d
“Spe
cializa
7on”
Applica7on
Block 1
Block 2
Block 3 Inst
ruct
ion
sequ
ence
HBA Design: Three Components (I) n Block formation and fetch n Block sequencing and value communication n Block execution
17
HBA Design: Three Components (II)
18
Branch predictor
Block Cache
Block Forma7on
Block Liveout Rename
Global RF
Subscrip7o
n Matrix
Sequ
encer
Block Execu7on Unit
Block Execu7on Unit
Block Execu7on Unit
Block Execu7on Unit
Block Core Block Frontend
Block Execu7on Unit
Block Execu7on Unit
...
Block ROB & Atomic Re7re
Shared LSQ L1D Cache
1. Frontend forms blocks dynamically 2. Blocks are dispatched and communicate via global registers
3. Heterogeneous block execution units execute different types of blocks in a specialized way
Talk Agenda n Background and Motivation n Two New Observations n The Heterogeneous Block Architecture (HBA) n An Initial HBA Design n Experimental Evaluation n Conclusions
19
An Example HBA Design: OoO/VLIW n Block formation and fetch n Block sequencing and value communication n Block execution
n Two types of backends q OoO scheduling q VLIW/in-order scheduling
20
Out-‐of-‐order
VLIW/IO
Shared
Fronten
d
“Spe
cializa
7on”
Out-‐of-‐order
VLIW/IO
Block Formation and Fetch n Atomic blocks are microarchitectural and dynamically formed n Block info cache (BIC) stores metadata about a formed block
q Indexed with PC and branch path of a block (ala Trace Caches) q Metadata info: backend type, instruction schedule, …
n Frontend fetches instructions/metadata from I-cache/BIC n If miss in BIC, instructions are sent to OoO backend n Otherwise, buffer instructions until the entire block is fetched
n The first time a block is executed on the OoO backend, its’ OoO version is formed (& OoO schedule/names stored in BIC) q Max 16 instructions, ends in a hard-to-predict or indirect branch
21
An Example HBA Design: OoO/VLIW n Block formation and fetch n Block sequencing and value communication n Block execution
n Two types of backends q OoO scheduling q VLIW/in-order scheduling
22
Out-‐of-‐order
VLIW/IO
Shared
Fronten
d
“Spe
cializa
7on”
Out-‐of-‐order
VLIW/IO
Block Sequencing and Communication n Block Sequencing
q Block based reorder buffer for in-order block sequencing q All subsequent blocks squashed on a branch misprediction q Current block squashed on an exception or intra-block
misprediction q Single-instruction sequencing from non-optimized code upon
an exception in a block to reach the exception point
n Value Communication q Across blocks: via the Global Register File q Within block: via the Local Register File q Only liveins and liveouts to a block need to be renamed and
allocated global registers [Sprangle & Patt, MICRO 1994]
n Reduces energy consumption of register file and bypass units
23
An Example HBA Design: OoO/VLIW n Block formation and fetch n Block sequencing and value communication n Block execution
n Two types of backends q OoO scheduling q VLIW/in-order scheduling
24
Out-‐of-‐order
VLIW/IO
Shared
Fronten
d
“Spe
cializa
7on”
Out-‐of-‐order
VLIW/IO
Block Execution n Each backend contains
q Execution units, scheduler, local register file
n Each backend receives q A block specialized for it (at block dispatch time) q Liveins for the executing block (as they become available)
n A backend executes the block q As specified by its logic and any information from the BIC
n Each backend produces q Liveouts to be written to the Global RegFile (as available) q Branch misprediction, exception results, block completion
signal to be sent to the Block Sequencer
n Memory operations are handled via a traditional LSQ 25
An Example HBA Design: OoO/VLIW n Block formation and fetch n Block sequencing and value communication n Block execution
n Two types of backends q OoO scheduling q VLIW/in-order scheduling
26
Out-‐of-‐order
VLIW/IO
Shared
Fronten
d
“Spe
cializa
7on”
Out-‐of-‐order
VLIW/IO
OoO and VLIW Backends
27
n Two customized backend types (physically, they are unified)
Out-of-order VLIW/In-order
• Dynamic scheduling with matrix scheduler • Renaming performed before the backend (block specialization logic) • No renaming or matrix computation logic (done for previous instance of block) • No need to maintain per-instruction order
• Static scheduling • Stall-on-use policy • VLIW blocks specialized by OoO backends • No per-instruction order
OoO Backend
28
Block Dispatch
Liveins (reg, val)
Liveouts (reg, val)
Resolved Branches
Dispatch Queue
Scheduler (no CAM)
Block Completion
Generic Interface Dynamically-scheduled implementation
Selec7on
logic
Local RF
Execu7
on Units
Consum
er wakeu
p vectors srcs
ready? srcs ready? srcs ready? srcs ready? srcs ready? srcs ready?
Liveout?
Branch?
Writeback bus Counter
Livein bus
To shared EUs
VLIW/In-Order Backend
29
Block Dispatch
Liveins (reg, val)
Liveouts (reg, val)
Resolved Branches
Dispatch Queue
Issue Logic
Block Completion
Generic Interface Statically-scheduled implementation
Ready-‐bit S
corebo
ard
Local RF
Execu7
on Units
Liveout?
Branch?
Writeback bus Counter
Livein bus
To shared EUs
Backend Switching and Specialization
n OoO to VLIW: If dynamic schedule is the same for N previous executions of the block on the OoO backend: q Record the schedule in the Block Info Cache q Mark the block to be executed on the VLIW backend
n VLIW to OoO: If there are too many false stalls in the block on the VLIW backend q Mark the block to be executed on the OoO backend
30
Talk Agenda n Background and Motivation n Two New Observations n The Heterogeneous Block Architecture (HBA) n An Initial HBA Design n Experimental Evaluation n Conclusions
31
Experimental Methodology n Models
q Cycle-accurate execution-driven x86-64 simulator – ISA remains unchanged q Energy model with modified McPAT [Li+ MICRO’09]
– various leakage parameters investigated
n Workloads q 184 different representative execution checkpoints q SPEC CPU2006, Olden, MediaBench, Firefox, Ffmpeg, Adobe
Flash player, MySQL, lighttpd web server, LaTeX, Octave, an x86 simulator
32
Simulation Setup
33
Four Core Comparison Points n Out-of-order core with large, monolithic backend
n Clustered out-of-order core with split backend [Farkas+ MICRO’97]
q Clusters of <scheduler, register file, execution units>
n Coarse-grained heterogeneous design [Lukefahr+ MICRO’12]
q Combines out-of-order and in-order cores q Switches mode for coarse-grained intervals based on a
performance model (ideal in our results) q Only one mode can be active: in-order or out-of-order
n Coarse-grained heterogeneous design, with clustered OoO core
34
Key Results: Power, EPI, Performance
n HBA greatly reduces power and energy-per-instruction (>30%) n HBA performance is within 1% of monolithic out-of-order core n HBA’s performance higher than coarse-grained hetero. core n HBA is the most power- and energy-efficient design among the
five different core configurations
35
Averaged across 184 workloads
Energy Analysis (I)
n HBA reduces energy/power by concurrently
q Decoupling Execution Backends to Simplify the Core (Clustering)
q Exploiting Block Atomicity to Simplify the Core (Atomicity)
q Exploiting Heterogeneous Backends to Specialize (Heterogeneity)
n What is the relative energy benefit of each when applied successively?
36
Energy Analysis (II)
37
0
0.5
1
1.5
2
OoO
Clu
stered
Coarse
Coarse,
Clu
stered
HB
A/O
oO
HB
A
Ener
gy/I
nst
ruct
ion (
nJ)
FrontendRATROB
RS (Scheduler)RF
Exec (ALUs)Bypass Buses
LSQL1L2
+Clustering: 8%
+Atomicity: 17% +Heterogeneity: 6%
HBA effec8vely takes advantage of clustering, atomic blocks, and heterogeneous backends to reduce energy
Averaged across 184 workloads
Comparison to Best Previous Design n Coarse-grained heterogeneity + clustering the OoO core
38
0
0.5
1
1.5
2
OoO
Clu
stered
Coarse
Coarse,
Clu
stered
HB
A/O
oO
HB
A
Ener
gy/I
nst
ruct
ion (
nJ)
FrontendRATROB
RS (Scheduler)RF
Exec (ALUs)Bypass Buses
LSQL1L2
+Coarse-grained hetero: 9% +Clustering: 9%
HBA: 15% lower EPI
HBA provides higher efficiency (+15%) at higher performance (+2%) than coarse-‐grained heterogeneous core with clustering
Averaged across 184 workloads
Per-Workload Results
n HBA reduces energy on almost all workloads n HBA can reduce or improve performance (analysis in paper)
39
0.5
1
1.5
2
0 50 100 150Rel
ativ
e vs.
Bas
elin
e
Benchmark (Sorted by Rel. HBA Performance)
IPCEPI
S-curve of all 184 workloads
Power-Performance Tradeoff Space
n HBA enables new design points in the tradeoff space
40
0.2
0.4
0.6
0.8
1
1.2
1.4
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3
Rel
. A
vg. C
ore
Pow
er (
vs.
4-w
ide
OoO
)
Rel. IPC (vs. 4-wide OoO)
HBA(16,OoO)HBA(16)
HBA(8)HBA(4)
HBA(16,2-wide)
1-wide in-order2-wide in-order
256-ROB, 4-wide OoO, clustered
Coarse-grained (256-ROB)
HBA(64)
1024-ROB, 4-wide OoO, clustered
4-wide in-order
128-ROB, 4-wide OoO
256-ROB, 4-wide OoO
Coarse-grained (128-ROB)
More Efficient
In-OrderOut-of-OrderCoarse-grained HeteroHBA
Relative Performance (vs. Baseline OoO)
GOAL
Averaged across 184 workloads
Cost of Heterogeneity n Higher core area
q Due to more backends q Can be optimized with good design (unified backends) q Core area cost increasingly small in an SoC
n only 17% in Apple’s A7
q Analyzed in the paper but our design is not optimized for area
n Increased leakage power q HBA benefits reduce with higher transistor leakage q Analyzed in the paper
41
More Results and Analyses in the Paper n Performance bottlenecks n Fraction of OoO vs. VLIW blocks n Energy optimizations in VLIW mode n Energy optimizations in general n Area and leakage
n Other and detailed results available in our technical report
Chris Fallin, Chris Wilkerson, and Onur Mutlu, "The Heterogeneous Block Architecture"
SAFARI Technical Report, TR-SAFARI-2014-001, Carnegie Mellon University, March 2014.
42
Talk Agenda n Background and Motivation n Two New Observations n The Heterogeneous Block Architecture (HBA) n An Initial HBA Design n Experimental Evaluation n Conclusions
43
Conclusions n HBA is the first heterogeneous core substrate that enables
q concurrent execution of fine-grained code blocks q on the most-efficient backend for each block
n Three characteristics of HBA q Atomic block execution to exploit fine-grained heterogeneity q Exploits stable instruction schedules to adapt code to backends q Uses clustering, atomicity and heterogeneity to reduce energy
n HBA greatly improves energy efficiency over four core designs
n A flexible execution substrate for exploiting fine-grained heterogeneity in core design
44
Building on This Work … n HBA is a substrate for efficient core design
n One can design different cores using this substrate: q More backends of different types (CPU, SIMD, reconfigurable
logic, …) q Better switching heuristics between backends q More optimization of each backend q Dynamic code optimizations specific to each backend q …
45
1 Carnegie Mellon University 2 Intel Corporation
Chris Fallin1 Chris Wilkerson2 Onur Mutlu1
The Heterogeneous Block Architecture
A Flexible Substrate for Building Energy-Efficient High-Performance Cores
Why Atomic Blocks? n Each block can be pre-specialized to its backend
(pre-rename for OoO, pre-schedule VLIW in hardware, reorder instructions, eliminate instructions)
n Each backend can operate independently n User-visible ISA can remain unchanged despite multiple
execution backends
47
Nearby Chunks Have Different Behavior
48
0
0.2
0.4
0.6
1 8 16 24 32
Fre
qu
ency
Chunks Retired Before Chunk-Type Switch
In-Order vs. Superscalar vs. OoO
49
0
0.2
0.4
0.6
0.8
1
In-O
rder
Superscalar
Out-o
f-Ord
er
Avg. P
erfo
rman
ce (
IPC
)
In-O
rder
Superscalar
Out-o
f-Ord
er
0
2
4
6
8
Avg. C
ore
Pow
er (
W)
Baseline EPI Breakdown
50
0
1
2
2.5
1-w
ide
In-o
rder
4-w
ide
In-o
rder
4-w
ide
OoO
En
erg
y/I
nst
ruct
ion
(n
J)
FrontendRATROBRS (Scheduler)RFExec (ALUs)Bypass BusesLSQL1L2
Retired Block Types (I)
51
0
0.2
0.4
0.6
0.8
1
Fra
ctio
n o
f R
etir
ed B
lock
s
Benchmark (Sorted by OoO Fraction)
OoO blocks
VLIW blocks (wide or narrow)
Retired Block Types (II)
52
0
0.02
0.04
0.06
5 10 15 20 25 30
Spec
trum
Frequency ( / 64 block retires)
Frequency Spectrum of Retire-Order Block Types
Retired Block Types (III)
53
0 0.1 0.2 0.3 0.4 0.5
OoO 50% OoO/VLIW VLIW
Fre
quen
cy
Average Block Type
Average Block Type: Histogram per Block
BIC Size Sensitivity
54
0.8
0.9
1
64
128
256
512
1K
2K
4K
perfect
Rel
ativ
e IP
C
Block Info Cache Size (blocks of 16 uops)
default
Livein-Liveout Statistics
55
0
0.1
0.2
0.3
0 -
- 1
2 -
- 3
4 -
- 5
6 -
- 7
8 -
- 9
10 -
- 11
12 -
- 13
14 -
- 15
16+
Fre
quen
cy
Liveins per Block
0
0.1
0.2
0.3
0 -
- 1
2 -
- 3
4 -
- 5
6 -
- 7
8 -
- 9
10
--
11
12
--
13
14
--
15
16
+
Fre
qu
ency
Liveouts per Block
Auxiliary Data: Benefits of HBA
56
0
0.2
0.4
0.6
0.8
1
RAT ROB Global RFNorm
ali
zed A
ccesses Baseline
HBA
Auxiliary Data: Benefits of HBA
57
0
0.2
0.4
0.6
0.8
1
Reads Writes
Norm
ali
zed R
AT
Accesses Baseline
HBA
0
0.2
0.4
0.6
0.8
1
Reads Writes
Norm
ali
zed R
OB
Accesses Baseline
HBA
Reads Writes
0
0.2
0.4
0.6
0.8
1
Norm
ali
zed R
F A
ccessesBaseline
HBA
Some Specific Workloads
58
0.5
1
ffmpeg
473.astar
429.mcf
latex401.bzip2
graphchi
403.gcc
437.leslie3d
octave
456.hmm
er
450.soplex
433.milc
Rel
. vs.
Bas
elin
e
IPCEPI
Related Works n Forming and reusing instruction schedules [DIF, ISCA’97]
[Transmeta’00][Banerjia, IEEE TOC’98][Palomar, ICCD’09]
n Coarse-grained heterogeneous cores [Lukefahr, MICRO’12]
n Atomic blocks and block-structured ISA [Melvin&Patt, MICRO’88, IJPP’95]
n Instruction prescheduling [Michaud&Seznec, HPCA’01]
n Yoga [Villavieja+, HPS Tech Report’14]
59
Top Related