Models and Architectures for Heterogeneous System Design · 2018. 4. 3. · • Single- (z4) and...
Transcript of Models and Architectures for Heterogeneous System Design · 2018. 4. 3. · • Single- (z4) and...
1
Models and Architectures for Heterogeneous System Design
Andreas Gerstlauer
Electrical and Computer Engineering
University of Texas at Austinhttp://www.ece.utexas.edu/~gerstl
© 2013 A. Gerstlauer 2
Heterogeneous Computing
• Embedded systems• Costs of custom design Increasingly programmable
• General-purpose computing• Power limits of scaling Increasingly heterogeneous
Multi-Processor/Multi-Core Systems-on-Chip (MPCSoC)• Mix of processors/cores?• Programming and mapping? Application/architecture interactions
App 0App 1
App 2App 1
2
© 2013 A. Gerstlauer 3
DCT
TX
ARM
M1Ctrl
I/O4
HW
DSP
MBUS
BUS1 (AHB) BUS2 (DSP)
Arb
iter
1
IP Bridge
DCTBus
I/O3I/O2I/O1
DMA
M1
Enc DecJpeg
Codebk
stripe
SI BO BI SO
DCT
MP3
System-Level Design
App
Proc
Proc
Proc
Proc
Proc
Domain-specific Models of Computation (MoC)
Implementation (HW/SW synthesis)
Exploration &Compilation
• From specification
• Application functionality– Parallel programming models
• To implementation
• Heterogeneous MPCSoC– Hardware & software
Design Automation• Modeling
• Synthesis
• Architectures
© 2013 A. Gerstlauer 4
Outline
• Modeling and simulation
• Host-compiled modeling
• Synthesis
• Design-space exploration
• System compilation
• Architectures
• Domain-specific processors
• Approximate Computing
3
© 2013 A. Gerstlauer 5
System-Level Modeling Space
Communication
CyclesWordsPacketsMessages
AccuracySpeed
MoC
RTL
NS
Network Simulator
NS -
Model of Computation
MoC -
Virtual Platform
Cycle-Accurate
Host-Compiled
© 2013 A. Gerstlauer 6
Host-Compiled Modeling
• Coarse-grain system model[ASPDAC’12,RSP’10]
• Source-level application model
• Abstract OS and processor models
• Transaction-level model (TLM) backplane
• C-based discrete-eventsimulation kernel [SpecC,SystemC]
Fast and accurate full-system simulation Speed vs. accuracy tradeoffs
4
© 2013 A. Gerstlauer 7
Application Model
• Application source code• C-based task/process model
– Parallel programming model– Canonical API
• Communication primitives– IPC channel library
• Timing and energy model• Granularity?• Compiler optimizations?• Dynamic micro-architecture effects? Basic-block level annotation
Compile to intermediate representation [gcc] Block-level timing and energy estimates [ISS + McPAT] Hybrid simulation (static annotation + cache, branch predictor models)
CH
I
© 2013 A. Gerstlauer 8
Retargetable Back-Annotation
• Back-annotation flow • Intermediate
representation (IR)– Frontend optimizations [gcc]– IR to C conversion
• Target binary matching– Cross-compiler backend [gcc]– Control-flow graph matching
• Timing and power estimation
– Micro-architecture description language (uADL) or RTL
– Cycle-accurate timing– Reference power model
[McPAT]
• Back-annotation– IR basic block level
a=b=c=0;if(a<=0) { a=1; c=2; }……printf(…);
: a = 1; b = 0; c = 2; goto bb_7;
:…..
: printf(…); : a = 1; b = 0; c = 2; incrDelay(15); incrEnergy(2); bb = BB_2; goto bb_7;
: ….. incrDelay(delay[bb][BB_3]); incrEnergy(energy[bb][BB_3]); bb = BB_3;
…..
IR
Binary
GraphMatching
Mapping Table
Basic BlockTiming and Energy Cz.
AugmentedMapping Table
Back Annotator
uADL ISS
McPAT
5
© 2013 A. Gerstlauer 9
Power and Performance Estimation
• Pairwise block characterization
• Path-dependent metrics– Timing & energy
• Over all immediate predecessors
Close to cycle-accurate at source-level speeds
• Single- (z4) and dual-issue (z6) PowerPC [MiBench]
>98% timing and energy accuracy @ 2000 MIPS
BB1 BB2
BB3
Exec flow 1
Exec flow 2
SS =A SS = BSS – Sys State
(registers, mem,
pipeline)
0500
100015002000250030003500400045005000
SHA(Small)
SHA(Large)
ADPCM(Small)
ADPCM(Large)
CRC32(Small)
CRC32(Large)
Sieve
Speed [MIPS]
HC IR Source
0.000001
0.00001
0.0001
0.001
0.01
0.1
1
10
SHA(Small)
SHA(Large)
ADPCM(Small)
ADPCM(Large)
CRC32(Small)
CRC32(Large)
Sieve
Error [%
]
z4 Timing z6 Timing z4 Energy z6 Energy
© 2013 A. Gerstlauer 10
Multi-Core Processor Model
• Layered processor model [DATE’11,TODAES’10]• Abstract RTOS model• Hardware abstraction layer • TLM backplane
OS
SLDL Simulation Kernel
Intr.Handler
Application
HAL
TLM
I/ODrv
I/O IF
T1
CH
Intr.Handler
Intr. IF
T2
Intr.Task
Intr.Task
T3
Full-system simulation• Heterogeneous MPCSoC• 600 MIPS (≈ real time)• >97% accuracy
6
© 2013 A. Gerstlauer 11
Automatic Timing Granularity Adjustment
• Errors in discrete preemption models [ESL’12]
ATGA-based OS model [ASPDAC’12]• Observe system state to
predict preemption points• Dynamically and optimally
adjust timing model Full-system simulation at
900 MIPS w/ 96% accuracy
Time
Thigh
rlrh
Idle
Preemption Error
fh fl
TlowRun
Preemption Error
• Potentially large preemption errors– Not bounded by
simulation granularity
© 2013 A. Gerstlauer 12
Ongoing Modeling Work
• Dynamic architecture features• Multi-core memory hierarchy, caches [ESLSyn’13]• Microarchitecture effects (branch predictors, …)
• Parallelizing the simulation• Parallel SLDL simulation kernels [ASPDAC’11]• Parallelizing the model itself
• Modeling of implementation effects• Performance, energy, reliability, power, thermal (PERPT)• Application to other processor models (GPUs)• Integration with network simulation
7
© 2013 A. Gerstlauer 13
Outline
Modeling and simulation
Host-compiled modeling
• Synthesis
• Design-space exploration
• System compilation
• Architectures
• Domain-specific processors
• Approximate computing
© 2013 A. Gerstlauer 14
Design Space Exploration
• Electronic system-level synthesis [TCAD’10]
• From formal dataflow models to MPCSoC architectures
• Genetic algorithms + host-compiled models [T-HiPEAC’11]– Hierarchy of layered models for successive design space pruning
Select
Formal Models
Cross-over
Mutate
Decision Making Model refinement & Evaluation
Performance metrics
Mapping solutions
ExecutableModels
Exp
lora
tio
n E
ng
ine
An
alys
is
Sim
ula
tio
n
SDF Mapping
8
© 2013 A. Gerstlauer 15
Dataflow Mapping• Streaming real-time applications
(Synchronous Dataflow, SDF)
• Heterogeneous multi-processor partitioning & scheduling
• Computation & communication
• Pipelined latency & throughput
• Mapping heuristics [JSPS’12]
• Globally optimal ILP formulation (1-ILP)
• Partitioning ILP + scheduling ILP (2-ILP)
• Evolutionary algorithm + scheduling ILP / list scheduler (EA)MP3 Decoder Design Space (EA)
Global ILP optimum
k
Bus ( l)
kk
k
10110...
Processor ( j)
Mem
ij
Shared Memory
Processor ( j’)
Mem
i’j’
Edge
k
Bus ( l’)
kl
kl’
Actor Exec. Timeij
i’j
Edge Comm. Timek DCkl
Back
© 2013 A. Gerstlauer 16
System Compilation• System-on-Chip Environment
(SCE) [JES’08]
ArchnArchnHCMn
Impln
App
ImplnImpln
Mem
IPHW
Bri
dg
e
CPU Bus DSP Bus
B3v1v2
B5B4
DSP
C4C2C1
OS + Drv
CPU
OS + Drv
1
B2B1
C3
Synthesize target HW/SW
Compile onto MPCSoC platform
Backend synthesis
9
© 2013 A. Gerstlauer 17
System Compiler Backend
• Communication synthesis [TCAD’07]• From dataflow primitives
– Queues, message-passing
• To bus transactions– Protocol stacks over
heterogeneous architectures
• Software synthesis [ASPDAC’08]• Code gen. + OS targeting + compilation
– Drivers, interrupt handlers, task handling
• Hardware synthesis [CODES+ISSS’12]• Transactor synthesis + high-level synthesis
– Custom communication interfaces
• Transactor optimizations (merging & fusion)– Average 20% latency and 70% area improvement– 16% faster in similar area as manual design
Block level synthesis
High level synthesis
Logic synthesis
Communication Synthesis
Back
© 2013 A. Gerstlauer 18
Outline
Modeling and simulation
Host-compiled modeling
Synthesis
Design-space exploration
System compilation
• Architectures
• Domain-specific processors
• Approximate computing
10
© 2013 A. Gerstlauer 19
Adapted from: T. Noll, RWTH Aachen, via R. Leupers, “From ASIP to MPSoC”, Computer Engineering Colloquium, TU Delft, 2006
Algorithm/Architecture Co-Design
Domain SpecificProcessors
E F F I C I E N C Y
GP-GPUs
© 2013 A. Gerstlauer 20
Linear Algebra (w/ R. van de Geijn, CS)
BLASKernels
PackingLAPACKKernels
Host Application
LevelBLAS
LAPACKLevel
• Layered libraries [FLAME]• Linear Algebra Package
(LAPACK) level– Cholesky and QR factorization
• Basic Linear Algebra Subroutines (BLAS)
– Matrix-matrix and matrix-vector operations
– General matrix-matrixmultiplication (GEMM)
• Inner kernels– Hand-optimized assembly
Key to many applications High-performance computing Embedded computing
11
© 2013 A. Gerstlauer 21
Linear Algebra Fabric
Parallelism and locality
© 2013 A. Gerstlauer 22
Linear Algebra Fabric
Broadcast nature of collective communications
GEMM mapping
12
© 2013 A. Gerstlauer 23
A Linear Algebra Core (LAC)
• Scalable 2-D array of n×n processing elements (PEs) [ASAP’11]• Specialized floating-point units w/ 1-cycle MAC throughput• Broadcast busses (possibly pipelined)• Distributed memory architecture• Distributed, PE-local micro-coded control
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(2,1) PE(2,2) PE(2,3)
PE(3,0) PE(3,1) PE(3,2) PE(3,3)
`
MEM
Addr1
Row Bus Write (RBW)
Column Bus Write (CBW)
A B
Column Bus Read (CBR)
Row Bus Read (RBR)
ACC_in
Accumulator
Cin
Memory Interface
Addr2
RF
GEMM mapping
© 2013 A. Gerstlauer 24
Core-Level Implementation
4x4 LAC w/ 256kB of local SRAM @ 1GHz
Running GEMM FMAC alone are 120 GFLOPS/W (single) or 60 GFLOPS/W (double)
W
mm2
GFLOPS
mm2
GFLOPS
W
Utilization
Cell SPE (SP) 0.4 6.4 16 83%
NVidia GTX480 SM (SP) 0.5 3.8 7.0 58%
NVidia GTX480 SM (DP) 0.5 1.7 3.4 58%
Intel Core (DP) 0.5 0.4 0.9 95%
ClearSpeed (DP) 0.02 0.3 13 80%
LAC (SP) 0.2 20 104 95+%
LAC (DP) 0.3 16 47 95+%
13
© 2013 A. Gerstlauer 25
Linear Algebra Processor (LAP)
Multiple Linear Algebra Cores [TC’12]• Fine- and coarse-grain parallelism• System integration
– Memory hierarchy via shared memory interface– Host applications and libraries
Host CPU
LAP LACs
© 2013 A. Gerstlauer 26
Mapping GEMM to the LAP
• Multiple LACs on chip
• S cores (LACs)
• Shared memoryhierarchy
• Second- & third-level blocking
• Streaming of n×n Ci,j
• Prefetching of Bp,j & Ai,p
• Bp,j replicated across column PEs
14
© 2013 A. Gerstlauer 27
GFLOPS W/mm2 GFLOPS/mm2 GFLOPS/W Utilization
Cell BE (SP) 200 0.3 1.5 5 88%
NVidia GTX480 SM (SP) 780 0.2 0.9 4.5 58%
NVidia GTX480 SM (DP) 390 0.2 0.4 2.2 58%
Intel Core-i7 960 (SP) 96 0.4 0.5 1.2 95%
Intel Core-i7 960 (DP) 48 0.4 0.25 0.6 95%
Altera Stratix IV (DP) 100 0.02 0.05 3.5 90+%
ClearSpeed CSX700 (DP) 75 0.02 0.2 12.5 78%
LAP (SP) 1200 0.2 6-11 55 90+%
LAP (DP) 600 0.2 3-5 25 90+%
LAP Summary and Comparison• 45nm scaled power/performance @ 1.4GHz
• Equivalent throughput (# of cores), running GEMM
15-core LAP chip SGEMM/DGEMM in 1200/600 GFLOPS 90% utilization, 25W, 120mm2
1-2 orders of magnitude improvement vs. CPUs/GPUs
© 2013 A. Gerstlauer 28
LAP Generalization
• Level-3 BLAS [ASAP’12]
• Triangular Solve with Mult. Right-hand side (TRSM)
• Sym. Rank-K Up. (SYRK)
• Sym. Rank-2K Up. (SYR2K)
Addition of (1 / x ) unit
• Linear solvers [ARITH’13]
• Cholesky, LU, QR factorization
Addition of (1 /√) unit
• Signal processing [ASAP’13]
• FFT, Neural nets
Addition of memory interfaces
Minimal modifications for flexibility
Coarse-grain programming model
FFT GFLOPS GFLOPS/mm2
GFLOPS/W
Utili.
Xeon E3 16 0.65 0.64 66%
ARM A9 0.6 0.09 2.13 60%
Cell 12 0.12 0.19 12%
Tesla 110 0.21 0.73 21%
LAP-FFT 26.7 2.01 27.7 83%
15
© 2013 A. Gerstlauer 2929
Approximate Computing• Controlled timing error acceptance (w/ Prof. Orshansky)
• Statistical error treatment under scaled Vdd
• Quality-energy profile shaping
2D-IDCT prototype [TCSVT’13]
• ~50% energy savings possible
• < 3% area overhead
Adder synthesis [ICCAD’12]
• Family of quality-energy optimal arithmetic units
40 60 80 100 120 140 160 180 200 220 2405
10
15
20
25
30
35
40
45
PS
NR
(d
B)
Energy (J)
origPSNR all3PSNR80 J
22 dB
© 2013 A. Gerstlauer 30
Summary & Conclusions
• Heterogeneous system design
• Host-compiled modeling• Full-system simulation in close to real time with >90% accuracy
• Automatic timing granularity adjustment
• System compilation & synthesis• Design space exploration, mapping, scheduling
• Hardware/software/interface synthesis
• Domain-specific processor architectures• High efficiency, performance and flexibility
• Linear algebra processor (LAP)
• Approximate computing for digital signal processing