Abbas Rahimi † , A. Ghofrani ‡ , M. A. Montano ‡ , K-T Cheng ‡ , L. Benini * , R. K....

Energy-Efficient GPGPU Energy-Efficient GPGPU Architectures via Architectures via

Collaborative Compilation and Collaborative Compilation and Memristive Memory-Based Memristive Memory-Based

ComputingComputing

Energy-Efficient GPGPU Energy-Efficient GPGPU Architectures via Architectures via

Collaborative Compilation and Collaborative Compilation and Memristive Memory-Based Memristive Memory-Based

ComputingComputing

Abbas RahimiAbbas Rahimi††, A. Ghofrani, A. Ghofrani‡‡, M. A. , M. A. MontanoMontano‡‡, ,

K-T ChengK-T Cheng‡‡, L. Benini, L. Benini**, R. K. Gupta, R. K. Gupta††

††UCSDUCSD, , ‡‡UCSB, EHTZUCSB, EHTZ**, UNIBO, UNIBO**

Variability.org Micrel.deis.unibo.it/MultiTherman

http://variability.org/

http://www-micrel.deis.unibo.it/multitherman/

http://www-micrel.deis.unibo.it/multitherman/

Energy-Efficient GPGPUEnergy-Efficient GPGPUThousands of deep and wide pipelines make GPGPU high Thousands of deep and wide pipelines make GPGPU high power consuming partspower consuming parts

NT and VOS achieve energy efficiency at costs toNT and VOS achieve energy efficiency at costs to1.1. Performance lossPerformance loss2.2. Increasing timing sensitivity in the presence of variationsIncreasing timing sensitivity in the presence of variations

Total delay: corner + 3σ stochastic delay

Kakoee et al, TCAS-II’12

× conservative × conservative guardbandsguardbands loss of operational loss of operational efficiency efficiency

✓SIMD g

uar

db

and

Variability is about Variability is about CostCost and and ScaleScale

Eliminating guardband

Timing error

Costly error recovery for

SIMD

Bowman et al, JSSC’09

Wid

e la

ne

s

Deep pipes

error rate × wider width

Recovery cycles increases linearly with pipeline length

quadratically expensive

RF ALU M WB

IF RF ALU M WB

RF ALU M WB

….

Taxonomy of SIMD Variability-ToleranceTaxonomy of SIMD Variability-ToleranceGuardband

Timing error

Error recovery

Independent recoveryMemoization

Lane decoupling through private queues

Recalling recent context of error-free execution

(approximately / exactly)

No timing error

EliminatingAdaptive

Detect-then-correct

Predict & prevent

Hierarchically focused guardbanding and uniform instruction

assignment

Pawlowski et al, ISSCC’12Krimer et al, ISCA’12Rahimi et al, TCAS’13

Rahimi et al, DATE’14

Rahimi et al, DATE’13Rahimi et al, DAC’13

Exact / approximatecomputing

Exactcomputing

Efficient spatiotemporal reuse of computation in Efficient spatiotemporal reuse of computation in GPGPUs by collaborativeGPGPUs by collaborative

1.1. Micro-architectural designMicro-architectural design An associative memristive memory (AMM) An associative memristive memory (AMM)

module is integrated with FPUs − module is integrated with FPUs − representing partial functionalityrepresenting partial functionality

2.2. Compiler profiling Compiler profiling Fine-grained partitioning of values Fine-grained partitioning of values

(searching space of possible inputs)(searching space of possible inputs) Pre- storing high-frequent sets of values in Pre- storing high-frequent sets of values in

AMM modulesAMM modules

Ensure their resiliency under voltage overscaling Ensure their resiliency under voltage overscaling for Evergreen GPGPUsfor Evergreen GPGPUs

ContributionsContributions

Collaborative compilation Collaborative compilation framework and memristive-based framework and memristive-based computingcomputingOpenCLKernel Profiler Training

datasets

Highly frequent computations

Customized clCreateBuffer to insert AMM contents

FPU AMM

KernelAMM

contents

1) Profiling

programminglunching kernel

2) Code generation

3) Runtime

one-off activity

=?

AMM with FPUAMM with FPU

Ternary content addressable memory

(TCAM)

Crossbar-based1T-1R memristive

memory block

AMM:AMM:

Software programmableSoftware programmable

Mimics partial functionality of FPUMimics partial functionality of FPU

Two pipelined stagesTwo pipelined stages

SearchOperands

Return pre-stored

result

Error

No Recovery

1.1. TCAM: a self-referenced TCAM: a self-referenced

sensing schemesensing scheme†, 2-bit , 2-bit

encoding, 15% positive encoding, 15% positive

slack at 45nmslack at 45nm

2.2. Memory block: avoids Memory block: avoids

read disturbance read disturbance

†Li et al, JSSC’14

AMM Hit RatesAMM Hit Rates

train

test1

test2

test3

test4

FPU+ AMM+

FPU* AMM*

FPU√ AMM√

…

Profiler

+: {a, b} → {q}*: {a, b} → {q}√ : {a} → {q}

… offline

runtime

Programming before lunching kernel

0

10

20

30

40

50

60

ADD MUL SQRT MULADD

AM

M h

it r

ate

for

So

be

l (%

)

test1

test2

test3

test4

OpenCL Sobel

0

10

20

30

40

50

Sobel Gaussian URNG

Ove

rall

AMM

s hi

t rat

e (%

)

# Trains = 20 # Tests = 400

Efficiency under Voltage Efficiency under Voltage OverscalingOverscaling

0

500

1000

1500

2000

2500

3000

3500

4000

0.88 0.90 0.92 0.94 0.96 0.98 1.00

En

erg

y (μ

J)

Voltage (V)

Sobel

Eigenvalue

0

20

40

60

80

100

120

0.88 0.90 0.92 0.94 0.96 0.98 1.00

En

erg

y (μ

J)

Voltage (V)

Gaussian

URNG

0

10

20

30

40

50

60

70

80

90

100

0.88 0.90 0.92 0.94 0.96 0.98 1.00

En

erg

y (μ

J)

Voltage (V)

Prefixsum

x FPUs Ŷ FPUs+AMMs

Harr

17%

33%

28%

32%

37%

28%

36%

19%

39%

29%

33%

30%

At 1.0V, without any timing error, At 1.0V, without any timing error, 36%36% average energy saving (7 kernels) average energy saving (7 kernels)

At 0.88V, on average 39% energy saving

Reduce timing errors from 38% to 24%

Static Static compilercompiler analysis and coordinated analysis and coordinated microarchitecturalmicroarchitectural design that enable design that enable efficient reuse of computations in efficient reuse of computations in GPGPUsGPGPUs

Emerging Emerging associativeassociative memristive modules memristive modules are coupled with are coupled with FPUFPU for fast spatial and for fast spatial and temporal reusetemporal reuse

GPGPU Kernels exhibit a low entropy GPGPU Kernels exhibit a low entropy yielding an average energy saving of 36% yielding an average energy saving of 36% on the 32-entry AMMson the 32-entry AMMs

ConclusionConclusion

Abbas Rahimi † , A. Ghofrani ‡ , M. A. Montano ‡ , K-T Cheng ‡ , L. Benini * , R. K....

Documents

Transcript of Abbas Rahimi † , A. Ghofrani ‡ , M. A. Montano ‡ , K-T Cheng ‡ , L. Benini * , R. K....