Designing Energy-Efficient Microprocessors in the Era of...

53
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors Radu Teodorescu Department of Computer Science and Engineering The Ohio State University http://arch.cse.ohio-state.edu computer architecture research lab

Transcript of Designing Energy-Efficient Microprocessors in the Era of...

Page 1: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of

Unpredictable Transistors

Radu TeodorescuDepartment of Computer Science and Engineering

The Ohio State University http://arch.cse.ohio-state.edu

computerarchitectureresearch lab

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

Page 2: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

The case for energy efficiency

2

• Mobility!• Battery life!

• Energy cost !• Environment!

Energy efficiency is now crucial to all computing markets, in particular the growth areas: mobile and cloud computing.

Page 3: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Near-threshold voltage (NTV)

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

NTVdd

NominalVdd

Vth

Power reduction100X

Frequency cost10X

Energy reduction10X

Voltage

Near-threshold computing, a promising energy-efficient solution.

Page 4: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Intel NTV prototype

4

Page 5: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

NTV faces significant challenges

5

Reliability

1E-10

1E-08

1E-06

1E-04

1E-02

1E+00

900 825 750 675 600 525 450 375 300Supply Voltage - millivolts

Prob

abili

ty o

f SR

AM

Bit

Failu

re

Intel Vcc-min

NTV5% error rate

Process Variation

0 0.5 1 1.5 2

Frequency Distribution

nominal

NTV%

Voltage Variation

Voltage Emergency

Page 6: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Variation effects at NTV

6

Voltage

delay = f(Vdd - Vth)

NTV Nominal

Vth Vdd

Delay Nom.

Delay NTV

Page 7: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Outline of our solutions

7

Reliability Process Variation

Voltage Variation

Parichute [micro2010]

Data

Parity

Parity

Booster [hpca2012]

VRSync [isca2012]

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)

Cores in BarrierPower (Watts)

0

10

20

30

40

50

60

70

36 38 40 42 44 46 48 50 0

8

16

24

32

Voltage Speculation in Itanium II [isca2013]

Page 8: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Outline of our solutions

8

Reliability Process Variation

Voltage Variation

Parichute [micro2010]

Data

Parity

Parity

Booster [hpca2012]

VRSync [isca2012]

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)

Cores in BarrierPower (Watts)

0

10

20

30

40

50

60

70

36 38 40 42 44 46 48 50 0

8

16

24

32

Voltage Speculation in Itanium II [isca2013]

Page 9: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

9

1E-10

1E-08

1E-06

1E-04

1E-02

1E+00

900 825 750 675 600 525 450 375 300Supply Voltage - millivolts

Prob

abili

ty o

f Bit

Cel

l Fai

lure

Intel Vcc-min

350mV5%

Parichute

SRAM failure rates

Page 10: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Turbo product codes

10

Data

Parity

Pari

ty

Page 11: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Parichute ECC

11

Permutation 0

Permutation 1

Permutation 2

Permutation 3

0 56

100

351

511437

351100

437 051156

87

2045 2

511 56

351499

201

1511

73

← Permutation 0 →

← P

erm

utat

ion

1 →

Page 12: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Parichute cache architecture

12

EncoderData Block (cache line)

Parityencoders

Parityencoders

Parityencoders

PW PW PW ... PW PW PW ... PW PW PW ......

Parity Group 0 Parity Group 1 Parity Group N

Permutation Network

Permutation 0 Permutation 1 Permutation NData In

Cache

Data+Parity

Decoder

Data Out

Line 0

Line 1

Line 2

Line 3

Line 4

Line 5

Line 6

Line 7

Data Parity+ Data bitsRedundant bits

Page 13: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Error correction example

13

a b dcea

e d

c b

1-bit error ✓ 2-bit error ✗

Corrector 0

Corrector 1

Corrector 2

Corrector 3

Page 14: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Experimental setup

14

• SRAM error model• SPICE model of cell• 8-way 2MB caches• VARIUS

• Processor model• SESC [Intel Core]• CACTI & WATTCH

• Benchmarks• SPECint, SPECfp 2000

• Prototype• Verilog• Synopsys Design Compiler• Nangate 45nm standard cell• Formality

Vdd Freq + LatencyNominal 0.9V 3GHz 0NTHigh 0.375V 460Mhz 4NTMid 0.350V 355Mhz 4NTLow 0.337V 300Mhz 6

Overhead used in

simulations

Page 15: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Error correction strength

15

0%

25%

50%

75%

100%

0 5 10 15 20 25 30 350%

25%

50%

75%

100%

0 5 10 15 20 25 30 350%

25%

50%

75%

100%

0 5 10 15 20 25 30 35

SECD

EDErrors in 512 data bits

Perc

ent

lines

cor

rect

able

OLSC

256

Parichute 252

Z. Chishti, A. R. Alameldeen, C. Wilkerson, W. Wu, and S. L. Lu, “Improving cache lifetime reliability at ultra-low voltages,” in

International Symposium on Microarchitecture, December 2009.

Page 16: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Cache capacity

16

0%

25%

50%

75%

100%

600 550 500 450 400 350 300 250

No ProtectionSECDEDOLSC 256Parichute 252

Rem

aini

ng C

ache

Cap

acity

Parichute: 50%OLSC: 24%

Parichute: 25%OLSC: 7%

Page 17: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Parichute hardware overhead

• Encoder and decoder hardware

• 27628 standard cells

• Area: 0.056mm2

• Power: 11mW

• Critical path: 0.95ns (1GHz)

• Cache area

• + 4%

17

Page 18: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Outline of our solutions

18

Reliability Process Variation

Voltage Variation

Parichute [micro2010]

Data

Parity

Parity

Booster [hpca2012]

VRSync [isca2012]

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)

Cores in BarrierPower (Watts)

0

10

20

30

40

50

60

70

36 38 40 42 44 46 48 50 0

8

16

24

32

Voltage Speculation in Itanium II [isca2013]

Page 19: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Variation effects on frequency

19

0 0.5 1 1.5 2

Frequency Distribution

Vth σ/μ = 12%Vdd = 900mV

Vth = 210±50mVF σ/μ = 4.4%F = 3GHz ± 260MHz

Page 20: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

0 0.5 1 1.5 2

Frequency Distribution

Variation effects on frequency

20

Vth σ/μ = 12%Vdd = 900mV

Vth = 210±50mVF σ/μ = 4.4%F = 3GHz ± 260MHz

Vdd = 400mV

F σ/μ = 30.6%F = 400 ± 245MHz

Vth σ/μ = 12%

Page 21: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Impact of frequency variation

21

Frequency

Execution progress

Wasted Perf.

Execution progress

Frequency

NTV VariationNo variation

Bottleneck

Page 22: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Dual-Vdd chip multiprocessor

22

Core 0

Core 1

Core 2

Core 3

Core 4

Core 5

Core 6

Core 7

Core 8

Core 9

Core 10

Core 11

Core 12

Core 13

Core 14

Core 15

Core 0

Core 1

Core 2

Core 3

Core 4

Core 5

Core 6

Core 7

Core 8

Core 9

Core 10

Core 11

Core 12

Core 13

Core 14

Core 15

Vdd High

Vdd Low

• Each core assigned two power rails:

• NT Vdd High & Low, with Fhigh and Flow

• Cores can switch rapidly between the two rails and Fhigh and Flow

Page 23: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Frequency interpolation

23

C0

C1

C2

C3

Core

775

650

575

425

LowVddMHz

2025

1775

1625

1375

HighVddMHz

74% 26%

Target: 1100 MHz

60% 40%

50%

29%

50%

71%

74% 26%

60% 40%

50%

29%

50%

71%

Page 24: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Frequency interpolation - in action

24

Fastest

Slowest

Slow

Fast

64-core CMP

Page 25: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Two Booster algorithms: VAR & SYNC

• Booster VAR:

• Eliminates heterogeneity: all cores appear to run at target F

• Booster SYNC

• Dynamically redistribute “boost” from blocked to active threads

• Use hints from synchronization primitives

• Hardware support

25

Blocked Normal Critical

Page 26: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Experimental setup

26

• Processor• Modeled by SESC• Dual-issue OOO• 32nm, 32 cores• 3GHz at 900mV• 300-2500MHz at NT• NT at 400-635mV

• Benchmarks• SPLASH-2• PARSEC

• Circuit modeling• SPICE• Markovic̀, et al

• Variation modeling• VARIUS

Page 27: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Booster runtimeStatic workloads

27

0.5

0.6

0.7

0.8

0.9

1

1.1

barnes

ocean

water-nsqd

cholesky

fft lu radixblackscholes

fluidanimate

swaptions

dedup

streamclster

g.mean

No

rmalize

d E

xecu

tio

n T

ime

Hetero SchedulingBooster VAR

Booster SYNCHeterogeneous

14%

22%

Page 28: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Booster runtimeDynamic workloads

28

0.5

0.6

0.7

0.8

0.9

1

1.1

radiosity

raytrace

volrend

bodytrack

g.mean

No

rmalize

d E

xecu

tio

n T

ime

Hetero SchedulingBooster VAR

Booster SYNCHeterogeneous

9%

18%

Page 29: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Outline of our solutions

29

Reliability Process Variation

Voltage Variation

Parichute [micro2010]

Data

Parity

Parity

Booster [hpca2012]

VRSync [isca2012]

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)

Cores in BarrierPower (Watts)

0

10

20

30

40

50

60

70

36 38 40 42 44 46 48 50 0

8

16

24

32

Voltage Speculation in Itanium II [isca2013]

Page 30: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Voltage Variability

30

Voltage emergency!

0.0010 0.0015

0.3

0.4

0.5

0.6

0.7

V(o

ut) (

V)

time (s)

V(out) I(load)

0

5

10

15

20

25

30

35

40

45

-10%

I(loa

d) (A

)

+10%

0.0010 0.0015

0.3

0.4

0.5

0.6

0.7

V(o

ut) (

V)

time (s)

V(out) I(load)

0

5

10

15

20

25

30

35

40

45

I(loa

d) (A

)

+10%

-10%

Normal operation

Guardband

Page 31: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Synchronization-Induced Voltage Emergencies

31

Voltage Emergency

Page 32: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)

Cores in BarrierPower (Watts)

0

10

20

30

40

50

60

70

35 40 45 50 55 0

4

8

12

16

16 cores

32

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)

Cores in BarrierPower (Watts)

0

10

20

30

40

50

60

70

30 40 50 60 70 80 0

1

2

3

4

4 cores Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)

Cores in BarrierPower (Watts)

0

10

20

30

40

50

60

70

30 35 40 45 50 55 60 65 70 0

2

4

6

8

8 cores

Thread Synchronization Effects on CMP Power Profile

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)

Cores in BarrierPower (Watts)

0

10

20

30

40

50

60

70

36 38 40 42 44 46 48 50 0

8

16

24

32

32 cores

6X!

Page 33: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

VRSync

• VRSync: voltage-aware synchronization library

• Reduces dI/dt caused by synchronization events

• Helps reduce voltage guardband

• Lower voltage guardband = Energy savings

• On average VRSync saves 33% energy

33

Page 34: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Our solution: VRSync Barriers

34

First enterExecution

Blocked on barrier 1 Delay

All in barrier 1

T0

T7

Linear schedule

t7

T1T2T3T4T5T6

Time

Thre

ads

Linear scheduleFirst enter

Execution

Blocked on barrier 1All in barrier

Delay

All outT0

T7

Thre

ads

T1T2T3T4T5T6

Time

Bulk schedule

0 50 100 150 200 250 300400

450

500

550

600

650

700

V(out) No. of active cores

-10%

V(o

ut) (

mV

)

time (µs)

04812162024283236

No.

of

activ

e co

res

0 50 100 150 200 250 300400

450

500

550

600

650

700

V(out) No. of active cores

-10%

time (µs)

V(o

ut) (

mV

)

04812162024283236

No.

of a

ctiv

e co

res

Page 35: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Experimental setup

35

•  Processor!•  Modeled by SESC!•  32nm, 32 cores!•  1GHz at 600mV!

•  Benchmarks!•  SPLASH2!•  PARSEC!

!!!

•  Circuit modeling!•  SPICE!•  Markovic̀, et al.*!

•  Voltage Regulator!•  Linear Technology’s

LTC3729L-6 polyphase!•  LTspice!

•  Barrier!•  Software Combining

Tree Barrier!

Page 36: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Eliminating Barrier-Induced Emergencies

36

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)

Emergency

EmergencyCores in Barrier

Power (Watts)

0

10

20

30

40

50

60

70

80

90

46.98 47 47.02 47.04 47.06 0

8

16

24

32

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)

EmergencyCores in Barrier

Power (Watts)

0

10

20

30

40

50

60

70

80

90

49.44 49.47 49.5 49.53 49.56 49.59 0

8

16

24

32

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)

EmergencyCores in Barrier

Power (Watts)

0

10

20

30

40

50

60

70

80

90

49.38 49.41 49.44 49.47 49.5 49.53 0

8

16

24

32

Baseline Linear Bulk

fluidanimate - parsec!

Page 37: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Eliminating Barrier Emergencies

37

VRSync Bulk!

Baseline!

lu – splash2!

Page 38: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Eliminating Phase Alignment Emergencies

38

Baseline!

VRSync Linear!

fft – splash2!

Page 39: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Execution time overhead

39

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

radiosity

barnesocean

raytrace

water-nsquared

cholesky

fft lu radixblackscholes

bodytrack

fluidanimate

swaptions

dedupstreamcluster

g.mean

Norm

aliz

ed E

xec

uti

on T

ime

2.1

LinearBulk

11%

6%

Page 40: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

VRSync: Energy Savings

40

Technique! Guardband! Runtime! Power! Energy!

Baseline with High

Guardband!210mV! 1.0! 1.563! 1.563!

VRSync Linear! 60mV! 1.112! 0.98! 1.086!

VRSync Bulk! 60mV! 1.063! 0.99! 1.049!

VRSync Bulk Fast! 160mV! 1.045! 1.361! 1.422!

31%

33%

VRSync Bulk is 33% more energy efficient than baseline with high guardband

Page 41: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Outline of our solutions

41

Reliability Process Variation

Voltage Variation

Parichute [micro2010]

Data

Parity

Parity

Booster [hpca2012]

VRSync [isca2012]

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)

Cores in BarrierPower (Watts)

0

10

20

30

40

50

60

70

36 38 40 42 44 46 48 50 0

8

16

24

32

Voltage Speculation in Itanium II [isca2013]

Page 42: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

• High voltage margins in modern CPUs - energy inefficient

• Voltage speculation techniques exist (Razor, etc.)

• Require dedicated hardware

• Reliability challenges lead to heavy use of on-chip ECC

• Caches, register file, TLBs, etc.

• Idea: leverage on-chip ECC to dynamically lower voltage margins (voltage speculation)

42

Voltage speculation in Itanium II

Page 43: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Voltage margin exploration

• HP BL860-i4 Server (2X 9560 Itanium II 8-core CPUs)

• Gradually lowered supply voltage (Vdd) for each core individually

• 1.1V Nominal -> 0.9, constant frequency (2.53GHz)

• Logged correctable errors, power consumption

• Recorded crashes, data corruption

• Experiments performed with HP stress test application, SPECjbb, SPECfp, and SPECint benchmarks

43

Page 44: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Correctable errors vs. Vdd

44

Observation: Correctable errors always triggered before uncorrectable ones, while running a stress test workload.

0

2

4

6

8

10

12

14

16

18

0.96 0.98 1 1.02 1.04 1.06 1.08 1.1

Err

or

Rate

(err

ors

/min

ute

)

Supply Voltage

Unsa

fe V

dd

Itanium Core

Failure VddCorrectable error range

Page 45: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

High variation in safe Vdd

0.8$0.85$0.9$0.95$

1$1.05$1.1$

Core$0$Core$1$Core$2$Core$3$Core$4$Core$5$Core$6$Core$7$

Nominal$Vdd$ Safe/Min$Vdd$ Fail$Vdd$

Supp

ly$Voltage$(V

)$

45

Core-to-core variation in safe/min Vdd: 0.96-1V

Page 46: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

ECC-based voltage speculation

• Our solution: dynamically lower supply voltage

• Use correctable errors as “early warning system”

• Two-step approach:

• Margin Voltage - determined post-manufacturing by running stress test workload

• Runtime reevaluation based on correctable error reports

• Monitoring & control implemented in firmware, transparent to OS

• Prototyped in HP Server with Itanium II CPUs

46

Page 47: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Margin discovery and runtime

47

Safety padding (10 mV)

First Error Voltage

Margin Voltage

Supp

ly V

olta

ge

Time

Cor

rect

able

Erro

rs

Core Vdd

Core errors

Discovery phase Runtime

Page 48: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Aggressive speculation

• Some applications/cores more amenable to voltage speculation

• Constant stream of correctable errors

48

Safety padding (10 mV)

First Error Voltage

Margin Voltage

Supp

ly V

olta

ge

Time

Core Vdd

Cor

rect

able

erro

rs

Max error threshold

burst testing

Min error threshold

Page 49: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Voltage speculation in action

0.965

0.97

0.975

0.98

0.985

0.99

0.995

1

0 5 10 15 20 0

10

20

30

40

50

Supply

Volta

ge (

V)

Err

or

Rate

(per

min

ute

)

Time (minutes)

Margin Voltage Error rate Core Voltage

49

SPECjbb

Page 50: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Power savings

0.5

0.6

0.7

0.8

0.9

1

Specjbb2005 SPECint SPECfp

Re

lativ

e P

ow

er

Cores-only CPU Total

• Cores-only: 22% SPECjbb, 23% SPECint and 18% SPECfp

• Total (with uncore): 14% SPECjbb, 15% SPECint and 11% SPECfp

50

Page 51: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

High variation in correctable errorsa few corner cases that led to system crashes during the ini-tial testing of this solution. This was root caused to havingsome applications triggering their correctable errors at muchlower voltages than others (more than 10 mV difference). Thistranslated to system crashes whenever another application thatrequired a higher operating voltage would get switched in forexecution on the core. This issue was solved by making surethat the application running on an aggressive core can tolerateaggressive mode operation before lowering its voltage belowthe “margin voltage”.

6. Evaluation

In this section we examine the power and energy savingsachieved by our dynamic voltage speculation system as wellas its performance overhead. We begin by characterizing theprocess variation effects on voltage margins and types of errorstriggered at low voltages.

6.1. Process Variation Effects

In order to characterize the effects of core-to-core processvariation on voltage margins we run our stress tests and bench-marks on each core while progressively lowering the voltage.We record the lowest supply voltage at which the stress testapplication runs for at least 20 minutes. We also collect allcorrectable error reports raised by the hardware at that supplyvoltage. Figure 9 shows the distribution of correctable errorsfor each core, for two different Itanium II processors. Bothprocessors show a wide range of behaviors with cores 0-2 ofprocessor A (Figure 9a) showing a large number of correctablecache failures and core 4 exhibiting a large number of registerfile correctable errors. Most cores seem to trigger either cacheerrors or register file errors depending on which critical pathsin each core are affected by process variation. Cores 3 and 4are the exception triggering both cache and register file errors.

Processor B (Figure 9a) shows similar variability but witha different distribution of error rates and types. ProcessorB triggers fewer cache errors and slightly more correctableregister file errors.

In general, we observed that correctable cache errors havea more graceful onset and are overall a better predictor foraggressive cores and aggressive mode operation beyond themargin voltage. Correctable register file errors on the otherhand are an indication that the core’s execution pipeline is inthe critical path of the core. In these cases the core is lesstolerant of further voltage scaling and could lead to errors.We classify cores that exhibit correctable register file errorsas “conservative” and we always run them at a voltage thatis above the one at which correctable register file errors areobserved (margin voltage).

6.2. Dynamic Adaptation to Workload

The Voltage Speculation Governor continuously adjusts thesupply voltage to ensure reliable operation. For “aggressive”

0

50

100

150

200

250

300

core0 core1 core2 core3 core4 core5 core6 core7

Co

rre

cta

ble

Err

ors

Correctable Cache Errors Correctable RF Errors

(a) Processor A

0

50

100

150

200

250

core0 core1 core2 core3 core4 core5 core6 core7

Co

rre

cta

ble

Err

ors

(b) Processor B

Figure 9: Distribution of correctable error rates and error typesover a 20 minute run of the stress test application at the mar-gin voltage, for two Itanium 8-core processors.

0.965

0.97

0.975

0.98

0.985

0.99

0.995

1

0 5 10 15 20 0

10

20

30

40

50

Su

pp

ly V

olta

ge

(V

)

Err

or

Ra

te (

pe

r m

inu

te)

Time (minutes)

Margin Voltage Error rate Core Voltage

Figure 10: Dynamic adaptation of supply voltage to runtimeconditions in SPECjbb running on an “agressive” core.

cores, the governor attempts to lower supply voltage belowthe “margin voltage” as long as the rate of correctable errorsis maintained at a target level. Figure 10 shows a trace of thesupply voltage over time for the SPECjbb workload. We alsoshow the correctable error rate for the same interval. The sup-ply voltage is initially set at the margin voltage. The VoltageSpeculation Governor lowers the voltage in 5mV incrementsevery minute of operation as long as the core exhibits an errorrate of 1 correctable error per minute. The voltage is immedi-ately raised back to the safety voltage when one of two eventsoccurs over the current time interval: (1) the error rate in-creases above 1 error per minute or (2) no correctable errorsare triggered over the previous interval.

6.3. Power Reduction through Voltage Speculation

Voltage Speculation lowers supply voltage by an average of 9-11% across all benchmarks we test as shown in Figure 11. Thedata is collected on Processor A on which we identify cores0,1,2 and 6 as “agressive” and 3,4,5 and 7 as “conservative”.

8

51

Page 52: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Outline of our solutions

52

Reliability Process Variation

Voltage Variation

Parichute [micro2010]

Data

Parity

Parity

Booster [hpca2012]

VRSync [isca2012]

Pow

er (

Wat

ts)

Core

s in

Bar

rier

Time (milliseconds)

Cores in BarrierPower (Watts)

0

10

20

30

40

50

60

70

36 38 40 42 44 46 48 50 0

8

16

24

32

Voltage Speculation in Itanium II [isca2013]

Page 53: Designing Energy-Efficient Microprocessors in the Era of ...web.cse.ohio-state.edu/~teodorescu.1/download/slides/radu_wiscon… · Designing Energy-Efficient Microprocessors in the

Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors

Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan

PGAS%Models%and%The%Asynchronous%Gap

� PGAS%models%provide%an%asynch:ronous irregular%data%model

� E.g.%Global%Arrays,%UPC,%CAF

� Computation%model%is%stillregular,%process:centric%SPMD

� Irregularity%in%the%data%canlead%to%load%imbalance

� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap

� Dynamic%task:based%view%of%the%computation

X[M][M][N]

X[1..9][1..9][1..9]X

Scioto%Task%Model

� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)

� Task%Outputs:%Global%data,%CLOs,%Child%tasks

CLO1 CLO1

SharedY[0]

Private

Y[1] Y[N]

Proc0 Proc1 Procn

CLO1

f(...)

In:%5,%Y[0],%...

Out:%Y[1]

Task:

Runtime%System%Design

� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access

� Queues%are%prioritized%by%affinity

� Use%the%work%first%principle%(LIFO%task%execution)

� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)

Introduction

This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.

2.%Reduce%Search Time:%Work%Splitting

� Problem: Search%time%grows%with%system%size

� Strategy: Divide%tasks evenly%between%victim%and%thief

� Double%number%of%work%sources%after%each%step

� Reduce%avg.%time%to%findwork%to%log(ncpus)

Time

1.%Optimize%Local%Accesses:%Split%Queues

� Queues%are%split%into%two%parts:

� Private: Local:only

� Shared: Any,%locked

� Removes%locking%from%criticalpath

� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses

� Reacquire%work

� Release%work%(lockless)

Scioto:%Scalable%Collections%of%Task%Objects

� Programmer%expresses the computation%as%collection%of%tasks

� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)

� Executed%in%collective%task%parallel%phases

� Runtime%system%manages%task%execution%/%task%parallel%phases

� Load%balancing,%locality%optimizations,%fault%resilience,%etc

SPMD

SPMD

TaskParallel

�����������������������n

Termination

Shared

Private

Proc0 Proc1 Procn

Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing

1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path

2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution

3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources

tailsplitnlocal

3.%Manage%Contention:%Aborting%Steals

� ARMCI%Locks:%BakeryAlgorithm

� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket

� Spinlocks:

� while(!atomic_swap(lock))%

� Can%give%up%at%any%time

� Spinlocks%+%Aborting%Steals:

� Periodically%check%if%we%should%abort%lock()

� Avoid%waits on%%stale%resource

Experimental%Setup%and%Benchmarks

� HP%Infiniband Cluster

� 2,310%Nodes,%2x2.2GHz%4:core%AMD%

� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations

� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation

� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree

computerarchitectureresearch lab

Acknowledgments

• The Research Team:

• Timothy N. Miller, PhD 2012, now Assist. Prof. @ SUNY Binghamton

• Renji Thomas

• Xiang Pan

• Naser Sedaghati

• Anys Bacha

53

• The Sponsors: