Abbas Rahimi † , A. Ghofrani ‡ , M. A. Montano ‡ , K-T Cheng ‡ , L. Benini * , R. K....
description
Transcript of Abbas Rahimi † , A. Ghofrani ‡ , M. A. Montano ‡ , K-T Cheng ‡ , L. Benini * , R. K....
Energy-Efficient GPGPU Energy-Efficient GPGPU Architectures via Architectures via
Collaborative Compilation and Collaborative Compilation and Memristive Memory-Based Memristive Memory-Based
ComputingComputing
Energy-Efficient GPGPU Energy-Efficient GPGPU Architectures via Architectures via
Collaborative Compilation and Collaborative Compilation and Memristive Memory-Based Memristive Memory-Based
ComputingComputing
Abbas RahimiAbbas Rahimi††, A. Ghofrani, A. Ghofrani‡‡, M. A. , M. A. MontanoMontano‡‡, ,
K-T ChengK-T Cheng‡‡, L. Benini, L. Benini**, R. K. Gupta, R. K. Gupta††
††UCSDUCSD, , ‡‡UCSB, EHTZUCSB, EHTZ**, UNIBO, UNIBO**
Variability.org Micrel.deis.unibo.it/MultiTherman
Energy-Efficient GPGPUEnergy-Efficient GPGPUThousands of deep and wide pipelines make GPGPU high Thousands of deep and wide pipelines make GPGPU high power consuming partspower consuming parts
NT and VOS achieve energy efficiency at costs toNT and VOS achieve energy efficiency at costs to1.1. Performance lossPerformance loss2.2. Increasing timing sensitivity in the presence of variationsIncreasing timing sensitivity in the presence of variations
Total delay: corner + 3σ stochastic delay
Kakoee et al, TCAS-II’12
× conservative × conservative guardbandsguardbands loss of operational loss of operational efficiency efficiency
✓SIMD g
uar
db
and
Variability is about Variability is about CostCost and and ScaleScale
Eliminating guardband
Timing error
Costly error recovery for
SIMD
Bowman et al, JSSC’09
Wid
e la
ne
s
Deep pipes
error rate × wider width
Recovery cycles increases linearly with pipeline length
quadratically expensive
RF ALU M WB
IF RF ALU M WB
RF ALU M WB
….
Taxonomy of SIMD Variability-ToleranceTaxonomy of SIMD Variability-ToleranceGuardband
Timing error
Error recovery
Independent recoveryMemoization
Lane decoupling through private queues
Recalling recent context of error-free execution
(approximately / exactly)
No timing error
EliminatingAdaptive
Detect-then-correct
Predict & prevent
Hierarchically focused guardbanding and uniform instruction
assignment
Pawlowski et al, ISSCC’12Krimer et al, ISCA’12Rahimi et al, TCAS’13
Rahimi et al, DATE’14
Rahimi et al, DATE’13Rahimi et al, DAC’13
Exact / approximatecomputing
Exactcomputing
Efficient spatiotemporal reuse of computation in Efficient spatiotemporal reuse of computation in GPGPUs by collaborativeGPGPUs by collaborative
1.1. Micro-architectural designMicro-architectural design An associative memristive memory (AMM) An associative memristive memory (AMM)
module is integrated with FPUs − module is integrated with FPUs − representing partial functionalityrepresenting partial functionality
2.2. Compiler profiling Compiler profiling Fine-grained partitioning of values Fine-grained partitioning of values
(searching space of possible inputs)(searching space of possible inputs) Pre- storing high-frequent sets of values in Pre- storing high-frequent sets of values in
AMM modulesAMM modules
Ensure their resiliency under voltage overscaling Ensure their resiliency under voltage overscaling for Evergreen GPGPUsfor Evergreen GPGPUs
ContributionsContributions
Collaborative compilation Collaborative compilation framework and memristive-based framework and memristive-based computingcomputingOpenCLKernel Profiler Training
datasets
Highly frequent computations
Customized clCreateBuffer to insert AMM contents
FPU AMM
KernelAMM
contents
1) Profiling
programminglunching kernel
2) Code generation
3) Runtime
one-off activity
=?
AMM with FPUAMM with FPU
Ternary content addressable memory
(TCAM)
Crossbar-based1T-1R memristive
memory block
AMM:AMM:
Software programmableSoftware programmable
Mimics partial functionality of FPUMimics partial functionality of FPU
Two pipelined stagesTwo pipelined stages
SearchOperands
Return pre-stored
result
Error
No Recovery
1.1. TCAM: a self-referenced TCAM: a self-referenced
sensing schemesensing scheme†, 2-bit , 2-bit
encoding, 15% positive encoding, 15% positive
slack at 45nmslack at 45nm
2.2. Memory block: avoids Memory block: avoids
read disturbance read disturbance
†Li et al, JSSC’14
AMM Hit RatesAMM Hit Rates
train
test1
test2
test3
test4
FPU+ AMM+
FPU* AMM*
FPU√ AMM√
…
Profiler
+: {a, b} → {q}*: {a, b} → {q}√ : {a} → {q}
… offline
runtime
Programming before lunching kernel
0
10
20
30
40
50
60
ADD MUL SQRT MULADD
AM
M h
it r
ate
for
So
be
l (%
)
test1
test2
test3
test4
OpenCL Sobel
0
10
20
30
40
50
Sobel Gaussian URNG
Ove
rall
AMM
s hi
t rat
e (%
)
# Trains = 20 # Tests = 400
Efficiency under Voltage Efficiency under Voltage OverscalingOverscaling
0
500
1000
1500
2000
2500
3000
3500
4000
0.88 0.90 0.92 0.94 0.96 0.98 1.00
En
erg
y (μ
J)
Voltage (V)
Sobel
Eigenvalue
0
20
40
60
80
100
120
0.88 0.90 0.92 0.94 0.96 0.98 1.00
En
erg
y (μ
J)
Voltage (V)
Gaussian
URNG
0
10
20
30
40
50
60
70
80
90
100
0.88 0.90 0.92 0.94 0.96 0.98 1.00
En
erg
y (μ
J)
Voltage (V)
Prefixsum
x FPUs Ŷ FPUs+AMMs
Harr
17%
33%
28%
32%
37%
28%
36%
19%
39%
29%
33%
30%
At 1.0V, without any timing error, At 1.0V, without any timing error, 36%36% average energy saving (7 kernels) average energy saving (7 kernels)
At 0.88V, on average 39% energy saving
Reduce timing errors from 38% to 24%
Static Static compilercompiler analysis and coordinated analysis and coordinated microarchitecturalmicroarchitectural design that enable design that enable efficient reuse of computations in efficient reuse of computations in GPGPUsGPGPUs
Emerging Emerging associativeassociative memristive modules memristive modules are coupled with are coupled with FPUFPU for fast spatial and for fast spatial and temporal reusetemporal reuse
GPGPU Kernels exhibit a low entropy GPGPU Kernels exhibit a low entropy yielding an average energy saving of 36% yielding an average energy saving of 36% on the 32-entry AMMson the 32-entry AMMs
ConclusionConclusion