Designing Energy-Efficient Microprocessors in the Era of...

Designing Energy-Efficient Microprocessors in the Era of

Unpredictable Transistors

Radu TeodorescuDepartment of Computer Science and Engineering

The Ohio State University http://arch.cse.ohio-state.edu

computerarchitectureresearch lab

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

http://www.cse.ohio-state.edu/~teodores/arch/

http://www.cse.ohio-state.edu/~teodores/arch/

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors









X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













The case for energy efficiency

2

• Mobility!• Battery life!

• Energy cost !• Environment!

Energy efficiency is now crucial to all computing markets, in particular the growth areas: mobile and cloud computing.










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Near-threshold voltage (NTV)

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

NTVdd

NominalVdd

Vth

Power reduction100X

Frequency cost10X

Energy reduction10X

Voltage

Near-threshold computing, a promising energy-efficient solution.










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Intel NTV prototype

4










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













NTV faces significant challenges

5

Reliability

1E-10

1E-08

1E-06

1E-04

1E-02

1E+00

900 825 750 675 600 525 450 375 300Supply Voltage - millivolts

Prob

abili

ty o

f SR

AM

Bit

Failu

re

Intel Vcc-min

NTV5% error rate

Process Variation

0 0.5 1 1.5 2

Frequency Distribution

nominal

NTV%

Voltage Variation

Voltage Emergency










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Variation effects at NTV

6

Voltage

delay = f(Vdd - Vth)

NTV Nominal

Vth Vdd

Delay Nom.

Delay NTV










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Outline of our solutions

7

Reliability Process Variation

Voltage Variation

Parichute [micro2010]

Data

Parity

Parity

Booster [hpca2012]

VRSync [isca2012]

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)

Cores in BarrierPower (Watts)

0

10

20

30

40

50

60

70

36 38 40 42 44 46 48 50 0

8

16

24

32

Voltage Speculation in Itanium II [isca2013]










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:














8


Voltage Variation


Data

Parity

Parity

Booster [hpca2012]

VRSync [isca2012]

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)


0

10

20

30

40

50

60

70

36 38 40 42 44 46 48 50 0

8

16

24

32











X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













9

1E-10

1E-08

1E-06

1E-04

1E-02

1E+00

900 825 750 675 600 525 450 375 300Supply Voltage - millivolts

Prob

abili

ty o

f Bit

Cel

l Fai

lure

Intel Vcc-min

350mV5%

Parichute

SRAM failure rates










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Turbo product codes

10

Data

Parity

Pari

ty










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Parichute ECC

11

Permutation 0

Permutation 1

Permutation 2

Permutation 3

0 56

100

351

511437

351100

437 051156

87

2045 2

511 56

351499

201

1511

73

← Permutation 0 →

← P

erm

utat

ion

1 →










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Parichute cache architecture

12

EncoderData Block (cache line)

Parityencoders

Parityencoders

Parityencoders

PW PW PW ... PW PW PW ... PW PW PW ......

Parity Group 0 Parity Group 1 Parity Group N

Permutation Network

Permutation 0 Permutation 1 Permutation NData In

Cache

Data+Parity

Decoder

Data Out

Line 0

Line 1

Line 2

Line 3

Line 4

Line 5

Line 6

Line 7

Data Parity+ Data bitsRedundant bits










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Error correction example

13

a b dcea

e d

c b

1-bit error ✓ 2-bit error ✗

Corrector 0

Corrector 1

Corrector 2

Corrector 3










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Experimental setup

14

• SRAM error model• SPICE model of cell• 8-way 2MB caches• VARIUS

• Processor model• SESC [Intel Core]• CACTI & WATTCH

• Benchmarks• SPECint, SPECfp 2000

• Prototype• Verilog• Synopsys Design Compiler• Nangate 45nm standard cell• Formality

Vdd Freq + LatencyNominal 0.9V 3GHz 0NTHigh 0.375V 460Mhz 4NTMid 0.350V 355Mhz 4NTLow 0.337V 300Mhz 6

Overhead used in

simulations










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Error correction strength

15

0%

25%

50%

75%

100%

0 5 10 15 20 25 30 350%

25%

50%

75%

100%

0 5 10 15 20 25 30 350%

25%

50%

75%

100%

0 5 10 15 20 25 30 35

SECD

EDErrors in 512 data bits

Perc

ent

lines

cor

rect

able

OLSC

256

Parichute 252

Z. Chishti, A. R. Alameldeen, C. Wilkerson, W. Wu, and S. L. Lu, “Improving cache lifetime reliability at ultra-low voltages,” in

International Symposium on Microarchitecture, December 2009.










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Cache capacity

16

0%

25%

50%

75%

100%

600 550 500 450 400 350 300 250

No ProtectionSECDEDOLSC 256Parichute 252

Rem

aini

ng C

ache

Cap

acity

Parichute: 50%OLSC: 24%

Parichute: 25%OLSC: 7%










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Parichute hardware overhead

• Encoder and decoder hardware

• 27628 standard cells

• Area: 0.056mm2

• Power: 11mW

• Critical path: 0.95ns (1GHz)

• Cache area

• + 4%

17










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:














18


Voltage Variation


Data

Parity

Parity

Booster [hpca2012]

VRSync [isca2012]

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)


0

10

20

30

40

50

60

70

36 38 40 42 44 46 48 50 0

8

16

24

32











X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Variation effects on frequency

19

0 0.5 1 1.5 2


Vth σ/μ = 12%Vdd = 900mV

Vth = 210±50mVF σ/μ = 4.4%F = 3GHz ± 260MHz










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













0 0.5 1 1.5 2


Variation effects on frequency

20

Vth σ/μ = 12%Vdd = 900mV

Vth = 210±50mVF σ/μ = 4.4%F = 3GHz ± 260MHz

Vdd = 400mV

F σ/μ = 30.6%F = 400 ± 245MHz

Vth σ/μ = 12%










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Impact of frequency variation

21

Frequency

Execution progress

Wasted Perf.

Execution progress

Frequency

NTV VariationNo variation

Bottleneck










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Dual-Vdd chip multiprocessor

22

Core 0

Core 1

Core 2

Core 3

Core 4

Core 5

Core 6

Core 7

Core 8

Core 9

Core 10

Core 11

Core 12

Core 13

Core 14

Core 15

Core 0

Core 1

Core 2

Core 3

Core 4

Core 5

Core 6

Core 7

Core 8

Core 9

Core 10

Core 11

Core 12

Core 13

Core 14

Core 15

Vdd High

Vdd Low

• Each core assigned two power rails:

• NT Vdd High & Low, with Fhigh and Flow

• Cores can switch rapidly between the two rails and Fhigh and Flow










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Frequency interpolation

23

C0

C1

C2

C3

Core

775

650

575

425

LowVddMHz

2025

1775

1625

1375

HighVddMHz

74% 26%

Target: 1100 MHz

60% 40%

50%

29%

50%

71%

74% 26%

60% 40%

50%

29%

50%

71%










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Frequency interpolation - in action

24

Fastest

Slowest

Slow

Fast

64-core CMP










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Two Booster algorithms: VAR & SYNC

• Booster VAR:

• Eliminates heterogeneity: all cores appear to run at target F

• Booster SYNC

• Dynamically redistribute “boost” from blocked to active threads

• Use hints from synchronization primitives

• Hardware support

25

Blocked Normal Critical










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Experimental setup

26

• Processor• Modeled by SESC• Dual-issue OOO• 32nm, 32 cores• 3GHz at 900mV• 300-2500MHz at NT• NT at 400-635mV

• Benchmarks• SPLASH-2• PARSEC

• Circuit modeling• SPICE• Markovic̀, et al

• Variation modeling• VARIUS










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Booster runtimeStatic workloads

27

0.5

0.6

0.7

0.8

0.9

1

1.1

barnes

ocean

water-nsqd

cholesky

fft lu radixblackscholes

fluidanimate

swaptions

dedup

streamclster

g.mean

No

rmalize

d E

xecu

tio

n T

ime

Hetero SchedulingBooster VAR

Booster SYNCHeterogeneous

14%

22%










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Booster runtimeDynamic workloads

28

0.5

0.6

0.7

0.8

0.9

1

1.1

radiosity

raytrace

volrend

bodytrack

g.mean

No

rmalize

d E

xecu

tio

n T

ime

Hetero SchedulingBooster VAR

Booster SYNCHeterogeneous

9%

18%










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:














29


Voltage Variation


Data

Parity

Parity

Booster [hpca2012]

VRSync [isca2012]

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)


0

10

20

30

40

50

60

70

36 38 40 42 44 46 48 50 0

8

16

24

32











X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Voltage Variability

30

Voltage emergency!

0.0010 0.0015

0.3

0.4

0.5

0.6

0.7

V(o

ut) (

V)

time (s)

V(out) I(load)

0

5

10

15

20

25

30

35

40

45

-10%

I(loa

d) (A

)

+10%

0.0010 0.0015

0.3

0.4

0.5

0.6

0.7

V(o

ut) (

V)

time (s)

V(out) I(load)

0

5

10

15

20

25

30

35

40

45

I(loa

d) (A

)

+10%

-10%

Normal operation

Guardband










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Synchronization-Induced Voltage Emergencies

31

Voltage Emergency










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)


0

10

20

30

40

50

60

70

35 40 45 50 55 0

4

8

12

16

16 cores

32

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)


0

10

20

30

40

50

60

70

30 40 50 60 70 80 0

1

2

3

4

4 cores Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)


0

10

20

30

40

50

60

70

30 35 40 45 50 55 60 65 70 0

2

4

6

8

8 cores

Thread Synchronization Effects on CMP Power Profile

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)


0

10

20

30

40

50

60

70

36 38 40 42 44 46 48 50 0

8

16

24

32

32 cores

6X!










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













VRSync

• VRSync: voltage-aware synchronization library

• Reduces dI/dt caused by synchronization events

• Helps reduce voltage guardband

• Lower voltage guardband = Energy savings

• On average VRSync saves 33% energy

33










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Our solution: VRSync Barriers

34

First enterExecution

Blocked on barrier 1 Delay

All in barrier 1

T0

T7

Linear schedule

t7

T1T2T3T4T5T6

Time

Thre

ads

Linear scheduleFirst enter

Execution

Blocked on barrier 1All in barrier

Delay

All outT0

T7

Thre

ads

T1T2T3T4T5T6

Time

Bulk schedule

0 50 100 150 200 250 300400

450

500

550

600

650

700

V(out) No. of active cores

-10%

V(o

ut) (

mV

)

time (µs)

04812162024283236

No.

of

activ

e co

res

0 50 100 150 200 250 300400

450

500

550

600

650

700

V(out) No. of active cores

-10%

time (µs)

V(o

ut) (

mV

)

04812162024283236

No.

of a

ctiv

e co

res










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Experimental setup

35

•  Processor!•  Modeled by SESC!•  32nm, 32 cores!•  1GHz at 600mV!

•  Benchmarks!•  SPLASH2!•  PARSEC!

!!!

•  Circuit modeling!•  SPICE!•  Markovic̀, et al.*!

•  Voltage Regulator!•  Linear Technology’s

LTC3729L-6 polyphase!•  LTspice!

•  Barrier!•  Software Combining

Tree Barrier!










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Eliminating Barrier-Induced Emergencies

36

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)

Emergency

EmergencyCores in Barrier

Power (Watts)

0

10

20

30

40

50

60

70

80

90

46.98 47 47.02 47.04 47.06 0

8

16

24

32

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)


Power (Watts)

0

10

20

30

40

50

60

70

80

90

49.44 49.47 49.5 49.53 49.56 49.59 0

8

16

24

32

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)


Power (Watts)

0

10

20

30

40

50

60

70

80

90

49.38 49.41 49.44 49.47 49.5 49.53 0

8

16

24

32

Baseline Linear Bulk

fluidanimate - parsec!










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Eliminating Barrier Emergencies

37

VRSync Bulk!

Baseline!

lu – splash2!










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Eliminating Phase Alignment Emergencies

38

Baseline!

VRSync Linear!

fft – splash2!










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Execution time overhead

39

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

radiosity

barnesocean

raytrace

water-nsquared

cholesky

fft lu radixblackscholes

bodytrack

fluidanimate

swaptions

dedupstreamcluster

g.mean

Norm

aliz

ed E

xec

uti

on T

ime

2.1

LinearBulk

11%

6%










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













VRSync: Energy Savings

40

Technique! Guardband! Runtime! Power! Energy!

Baseline with High

Guardband!210mV! 1.0! 1.563! 1.563!

VRSync Linear! 60mV! 1.112! 0.98! 1.086!

VRSync Bulk! 60mV! 1.063! 0.99! 1.049!

VRSync Bulk Fast! 160mV! 1.045! 1.361! 1.422!

31%

33%

VRSync Bulk is 33% more energy efficient than baseline with high guardband










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:














41


Voltage Variation


Data

Parity

Parity

Booster [hpca2012]

VRSync [isca2012]

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)


0

10

20

30

40

50

60

70

36 38 40 42 44 46 48 50 0

8

16

24

32











X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













• High voltage margins in modern CPUs - energy inefficient

• Voltage speculation techniques exist (Razor, etc.)

• Require dedicated hardware

• Reliability challenges lead to heavy use of on-chip ECC

• Caches, register file, TLBs, etc.

• Idea: leverage on-chip ECC to dynamically lower voltage margins (voltage speculation)

42

Voltage speculation in Itanium II










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Voltage margin exploration

• HP BL860-i4 Server (2X 9560 Itanium II 8-core CPUs)

• Gradually lowered supply voltage (Vdd) for each core individually

• 1.1V Nominal -> 0.9, constant frequency (2.53GHz)

• Logged correctable errors, power consumption

• Recorded crashes, data corruption

• Experiments performed with HP stress test application, SPECjbb, SPECfp, and SPECint benchmarks

43










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Correctable errors vs. Vdd

44

Observation: Correctable errors always triggered before uncorrectable ones, while running a stress test workload.

0

2

4

6

8

10

12

14

16

18

0.96 0.98 1 1.02 1.04 1.06 1.08 1.1

Err

or

Rate

(err

ors

/min

ute

)

Supply Voltage

Unsa

fe V

dd

Itanium Core

Failure VddCorrectable error range










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













High variation in safe Vdd

0.8$0.85$0.9$0.95$

1$1.05$1.1$

Core$0$Core$1$Core$2$Core$3$Core$4$Core$5$Core$6$Core$7$

Nominal$Vdd$ Safe/Min$Vdd$ Fail$Vdd$

Supp

ly$Voltage$(V

)$

45

Core-to-core variation in safe/min Vdd: 0.96-1V










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













ECC-based voltage speculation

• Our solution: dynamically lower supply voltage

• Use correctable errors as “early warning system”

• Two-step approach:

• Margin Voltage - determined post-manufacturing by running stress test workload

• Runtime reevaluation based on correctable error reports

• Monitoring & control implemented in firmware, transparent to OS

• Prototyped in HP Server with Itanium II CPUs

46










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Margin discovery and runtime

47

Safety padding (10 mV)

First Error Voltage

Margin Voltage

Supp

ly V

olta

ge

Time

Cor

rect

able

Erro

rs

Core Vdd

Core errors

Discovery phase Runtime










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Aggressive speculation

• Some applications/cores more amenable to voltage speculation

• Constant stream of correctable errors

48

Safety padding (10 mV)

First Error Voltage

Margin Voltage

Supp

ly V

olta

ge

Time

Core Vdd

Cor

rect

able

erro

rs

Max error threshold

burst testing

Min error threshold










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Voltage speculation in action

0.965

0.97

0.975

0.98

0.985

0.99

0.995

1

0 5 10 15 20 0

10

20

30

40

50

Supply

Volta

ge (

V)

Err

or

Rate

(per

min

ute

)

Time (minutes)

Margin Voltage Error rate Core Voltage

49

SPECjbb










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Power savings

0.5

0.6

0.7

0.8

0.9

1

Specjbb2005 SPECint SPECfp

Re

lativ

e P

ow

er

Cores-only CPU Total

• Cores-only: 22% SPECjbb, 23% SPECint and 18% SPECfp

• Total (with uncore): 14% SPECjbb, 15% SPECint and 11% SPECfp

50










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













High variation in correctable errorsa few corner cases that led to system crashes during the ini-tial testing of this solution. This was root caused to havingsome applications triggering their correctable errors at muchlower voltages than others (more than 10 mV difference). Thistranslated to system crashes whenever another application thatrequired a higher operating voltage would get switched in forexecution on the core. This issue was solved by making surethat the application running on an aggressive core can tolerateaggressive mode operation before lowering its voltage belowthe “margin voltage”.

6. Evaluation

In this section we examine the power and energy savingsachieved by our dynamic voltage speculation system as wellas its performance overhead. We begin by characterizing theprocess variation effects on voltage margins and types of errorstriggered at low voltages.

6.1. Process Variation Effects

In order to characterize the effects of core-to-core processvariation on voltage margins we run our stress tests and bench-marks on each core while progressively lowering the voltage.We record the lowest supply voltage at which the stress testapplication runs for at least 20 minutes. We also collect allcorrectable error reports raised by the hardware at that supplyvoltage. Figure 9 shows the distribution of correctable errorsfor each core, for two different Itanium II processors. Bothprocessors show a wide range of behaviors with cores 0-2 ofprocessor A (Figure 9a) showing a large number of correctablecache failures and core 4 exhibiting a large number of registerfile correctable errors. Most cores seem to trigger either cacheerrors or register file errors depending on which critical pathsin each core are affected by process variation. Cores 3 and 4are the exception triggering both cache and register file errors.

Processor B (Figure 9a) shows similar variability but witha different distribution of error rates and types. ProcessorB triggers fewer cache errors and slightly more correctableregister file errors.

In general, we observed that correctable cache errors havea more graceful onset and are overall a better predictor foraggressive cores and aggressive mode operation beyond themargin voltage. Correctable register file errors on the otherhand are an indication that the core’s execution pipeline is inthe critical path of the core. In these cases the core is lesstolerant of further voltage scaling and could lead to errors.We classify cores that exhibit correctable register file errorsas “conservative” and we always run them at a voltage thatis above the one at which correctable register file errors areobserved (margin voltage).

6.2. Dynamic Adaptation to Workload

The Voltage Speculation Governor continuously adjusts thesupply voltage to ensure reliable operation. For “aggressive”

0

50

100

150

200

250

300

core0 core1 core2 core3 core4 core5 core6 core7

Co

rre

cta

ble

Err

ors

Correctable Cache Errors Correctable RF Errors

(a) Processor A

0

50

100

150

200

250

core0 core1 core2 core3 core4 core5 core6 core7

Co

rre

cta

ble

Err

ors

(b) Processor B

Figure 9: Distribution of correctable error rates and error typesover a 20 minute run of the stress test application at the mar-gin voltage, for two Itanium 8-core processors.

0.965

0.97

0.975

0.98

0.985

0.99

0.995

1

0 5 10 15 20 0

10

20

30

40

50

Su

pp

ly V

olta

ge

(V

)

Err

or

Ra

te (

pe

r m

inu

te)

Time (minutes)

Margin Voltage Error rate Core Voltage

Figure 10: Dynamic adaptation of supply voltage to runtimeconditions in SPECjbb running on an “agressive” core.

cores, the governor attempts to lower supply voltage belowthe “margin voltage” as long as the rate of correctable errorsis maintained at a target level. Figure 10 shows a trace of thesupply voltage over time for the SPECjbb workload. We alsoshow the correctable error rate for the same interval. The sup-ply voltage is initially set at the margin voltage. The VoltageSpeculation Governor lowers the voltage in 5mV incrementsevery minute of operation as long as the core exhibits an errorrate of 1 correctable error per minute. The voltage is immedi-ately raised back to the safety voltage when one of two eventsoccurs over the current time interval: (1) the error rate in-creases above 1 error per minute or (2) no correctable errorsare triggered over the previous interval.

6.3. Power Reduction through Voltage Speculation

Voltage Speculation lowers supply voltage by an average of 9-11% across all benchmarks we test as shown in Figure 11. Thedata is collected on Processor A on which we identify cores0,1,2 and 6 as “agressive” and 3,4,5 and 7 as “conservative”.

8

51










X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:














52


Voltage Variation


Data

Parity

Parity

Booster [hpca2012]

VRSync [isca2012]

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)


0

10

20

30

40

50

60

70

36 38 40 42 44 46 48 50 0

8

16

24

32











X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model



CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:






Introduction







Time







� Reacquire%work








SPMD

SPMD

TaskParallel

��n

Termination

Shared

Private

Proc0 Proc1 Procn





tailsplitnlocal




� Spinlocks:













Acknowledgments

• The Research Team:

• Timothy N. Miller, PhD 2012, now Assist. Prof. @ SUNY Binghamton

• Renji Thomas

• Xiang Pan

• Naser Sedaghati

• Anys Bacha

53

• The Sponsors:

Designing Energy-Efficient Microprocessors in the Era of...

Documents

Transcript of Designing Energy-Efficient Microprocessors in the Era of...