Designing Energy-Efficient Microprocessors in the Era of...
Transcript of Designing Energy-Efficient Microprocessors in the Era of...
Designing Energy-Efficient Microprocessors in the Era of
Unpredictable Transistors
Radu TeodorescuDepartment of Computer Science and Engineering
The Ohio State University http://arch.cse.ohio-state.edu
computerarchitectureresearch lab
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
The case for energy efficiency
2
• Mobility!• Battery life!
• Energy cost !• Environment!
Energy efficiency is now crucial to all computing markets, in particular the growth areas: mobile and cloud computing.
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Near-threshold voltage (NTV)
3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
NTVdd
NominalVdd
Vth
Power reduction100X
Frequency cost10X
Energy reduction10X
Voltage
Near-threshold computing, a promising energy-efficient solution.
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Intel NTV prototype
4
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
NTV faces significant challenges
5
Reliability
1E-10
1E-08
1E-06
1E-04
1E-02
1E+00
900 825 750 675 600 525 450 375 300Supply Voltage - millivolts
Prob
abili
ty o
f SR
AM
Bit
Failu
re
Intel Vcc-min
NTV5% error rate
Process Variation
0 0.5 1 1.5 2
Frequency Distribution
nominal
NTV%
Voltage Variation
Voltage Emergency
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Variation effects at NTV
6
Voltage
delay = f(Vdd - Vth)
NTV Nominal
Vth Vdd
Delay Nom.
Delay NTV
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Outline of our solutions
7
Reliability Process Variation
Voltage Variation
Parichute [micro2010]
Data
Parity
Parity
Booster [hpca2012]
VRSync [isca2012]
Pow
er (
Wat
ts)
Core
s in
Bar
rier
Time (milliseconds)
Cores in BarrierPower (Watts)
0
10
20
30
40
50
60
70
36 38 40 42 44 46 48 50 0
8
16
24
32
Voltage Speculation in Itanium II [isca2013]
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Outline of our solutions
8
Reliability Process Variation
Voltage Variation
Parichute [micro2010]
Data
Parity
Parity
Booster [hpca2012]
VRSync [isca2012]
Pow
er (
Wat
ts)
Core
s in
Bar
rier
Time (milliseconds)
Cores in BarrierPower (Watts)
0
10
20
30
40
50
60
70
36 38 40 42 44 46 48 50 0
8
16
24
32
Voltage Speculation in Itanium II [isca2013]
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
9
1E-10
1E-08
1E-06
1E-04
1E-02
1E+00
900 825 750 675 600 525 450 375 300Supply Voltage - millivolts
Prob
abili
ty o
f Bit
Cel
l Fai
lure
Intel Vcc-min
350mV5%
Parichute
SRAM failure rates
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Turbo product codes
10
Data
Parity
Pari
ty
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Parichute ECC
11
Permutation 0
Permutation 1
Permutation 2
Permutation 3
0 56
100
351
511437
351100
437 051156
87
2045 2
511 56
351499
201
1511
73
← Permutation 0 →
← P
erm
utat
ion
1 →
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Parichute cache architecture
12
EncoderData Block (cache line)
Parityencoders
Parityencoders
Parityencoders
PW PW PW ... PW PW PW ... PW PW PW ......
Parity Group 0 Parity Group 1 Parity Group N
Permutation Network
Permutation 0 Permutation 1 Permutation NData In
Cache
Data+Parity
Decoder
Data Out
Line 0
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Data Parity+ Data bitsRedundant bits
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Error correction example
13
a b dcea
e d
c b
1-bit error ✓ 2-bit error ✗
Corrector 0
Corrector 1
Corrector 2
Corrector 3
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Experimental setup
14
• SRAM error model• SPICE model of cell• 8-way 2MB caches• VARIUS
• Processor model• SESC [Intel Core]• CACTI & WATTCH
• Benchmarks• SPECint, SPECfp 2000
• Prototype• Verilog• Synopsys Design Compiler• Nangate 45nm standard cell• Formality
Vdd Freq + LatencyNominal 0.9V 3GHz 0NTHigh 0.375V 460Mhz 4NTMid 0.350V 355Mhz 4NTLow 0.337V 300Mhz 6
Overhead used in
simulations
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Error correction strength
15
0%
25%
50%
75%
100%
0 5 10 15 20 25 30 350%
25%
50%
75%
100%
0 5 10 15 20 25 30 350%
25%
50%
75%
100%
0 5 10 15 20 25 30 35
SECD
EDErrors in 512 data bits
Perc
ent
lines
cor
rect
able
OLSC
256
Parichute 252
Z. Chishti, A. R. Alameldeen, C. Wilkerson, W. Wu, and S. L. Lu, “Improving cache lifetime reliability at ultra-low voltages,” in
International Symposium on Microarchitecture, December 2009.
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Cache capacity
16
0%
25%
50%
75%
100%
600 550 500 450 400 350 300 250
No ProtectionSECDEDOLSC 256Parichute 252
Rem
aini
ng C
ache
Cap
acity
Parichute: 50%OLSC: 24%
Parichute: 25%OLSC: 7%
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Parichute hardware overhead
• Encoder and decoder hardware
• 27628 standard cells
• Area: 0.056mm2
• Power: 11mW
• Critical path: 0.95ns (1GHz)
• Cache area
• + 4%
17
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Outline of our solutions
18
Reliability Process Variation
Voltage Variation
Parichute [micro2010]
Data
Parity
Parity
Booster [hpca2012]
VRSync [isca2012]
Pow
er (
Wat
ts)
Core
s in
Bar
rier
Time (milliseconds)
Cores in BarrierPower (Watts)
0
10
20
30
40
50
60
70
36 38 40 42 44 46 48 50 0
8
16
24
32
Voltage Speculation in Itanium II [isca2013]
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Variation effects on frequency
19
0 0.5 1 1.5 2
Frequency Distribution
Vth σ/μ = 12%Vdd = 900mV
Vth = 210±50mVF σ/μ = 4.4%F = 3GHz ± 260MHz
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
0 0.5 1 1.5 2
Frequency Distribution
Variation effects on frequency
20
Vth σ/μ = 12%Vdd = 900mV
Vth = 210±50mVF σ/μ = 4.4%F = 3GHz ± 260MHz
Vdd = 400mV
F σ/μ = 30.6%F = 400 ± 245MHz
Vth σ/μ = 12%
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Impact of frequency variation
21
Frequency
Execution progress
Wasted Perf.
Execution progress
Frequency
NTV VariationNo variation
Bottleneck
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Dual-Vdd chip multiprocessor
22
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
Core 8
Core 9
Core 10
Core 11
Core 12
Core 13
Core 14
Core 15
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
Core 8
Core 9
Core 10
Core 11
Core 12
Core 13
Core 14
Core 15
Vdd High
Vdd Low
• Each core assigned two power rails:
• NT Vdd High & Low, with Fhigh and Flow
• Cores can switch rapidly between the two rails and Fhigh and Flow
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Frequency interpolation
23
C0
C1
C2
C3
Core
775
650
575
425
LowVddMHz
2025
1775
1625
1375
HighVddMHz
74% 26%
Target: 1100 MHz
60% 40%
50%
29%
50%
71%
74% 26%
60% 40%
50%
29%
50%
71%
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Frequency interpolation - in action
24
Fastest
Slowest
Slow
Fast
64-core CMP
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Two Booster algorithms: VAR & SYNC
• Booster VAR:
• Eliminates heterogeneity: all cores appear to run at target F
• Booster SYNC
• Dynamically redistribute “boost” from blocked to active threads
• Use hints from synchronization primitives
• Hardware support
25
Blocked Normal Critical
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Experimental setup
26
• Processor• Modeled by SESC• Dual-issue OOO• 32nm, 32 cores• 3GHz at 900mV• 300-2500MHz at NT• NT at 400-635mV
• Benchmarks• SPLASH-2• PARSEC
• Circuit modeling• SPICE• Markovic̀, et al
• Variation modeling• VARIUS
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Booster runtimeStatic workloads
27
0.5
0.6
0.7
0.8
0.9
1
1.1
barnes
ocean
water-nsqd
cholesky
fft lu radixblackscholes
fluidanimate
swaptions
dedup
streamclster
g.mean
No
rmalize
d E
xecu
tio
n T
ime
Hetero SchedulingBooster VAR
Booster SYNCHeterogeneous
14%
22%
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Booster runtimeDynamic workloads
28
0.5
0.6
0.7
0.8
0.9
1
1.1
radiosity
raytrace
volrend
bodytrack
g.mean
No
rmalize
d E
xecu
tio
n T
ime
Hetero SchedulingBooster VAR
Booster SYNCHeterogeneous
9%
18%
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Outline of our solutions
29
Reliability Process Variation
Voltage Variation
Parichute [micro2010]
Data
Parity
Parity
Booster [hpca2012]
VRSync [isca2012]
Pow
er (
Wat
ts)
Core
s in
Bar
rier
Time (milliseconds)
Cores in BarrierPower (Watts)
0
10
20
30
40
50
60
70
36 38 40 42 44 46 48 50 0
8
16
24
32
Voltage Speculation in Itanium II [isca2013]
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Voltage Variability
30
Voltage emergency!
0.0010 0.0015
0.3
0.4
0.5
0.6
0.7
V(o
ut) (
V)
time (s)
V(out) I(load)
0
5
10
15
20
25
30
35
40
45
-10%
I(loa
d) (A
)
+10%
0.0010 0.0015
0.3
0.4
0.5
0.6
0.7
V(o
ut) (
V)
time (s)
V(out) I(load)
0
5
10
15
20
25
30
35
40
45
I(loa
d) (A
)
+10%
-10%
Normal operation
Guardband
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Synchronization-Induced Voltage Emergencies
31
Voltage Emergency
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Pow
er (
Wat
ts)
Core
s in
Bar
rier
Time (milliseconds)
Cores in BarrierPower (Watts)
0
10
20
30
40
50
60
70
35 40 45 50 55 0
4
8
12
16
16 cores
32
Pow
er (
Wat
ts)
Core
s in
Bar
rier
Time (milliseconds)
Cores in BarrierPower (Watts)
0
10
20
30
40
50
60
70
30 40 50 60 70 80 0
1
2
3
4
4 cores Pow
er (
Wat
ts)
Core
s in
Bar
rier
Time (milliseconds)
Cores in BarrierPower (Watts)
0
10
20
30
40
50
60
70
30 35 40 45 50 55 60 65 70 0
2
4
6
8
8 cores
Thread Synchronization Effects on CMP Power Profile
Pow
er (
Wat
ts)
Core
s in
Bar
rier
Time (milliseconds)
Cores in BarrierPower (Watts)
0
10
20
30
40
50
60
70
36 38 40 42 44 46 48 50 0
8
16
24
32
32 cores
6X!
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
VRSync
• VRSync: voltage-aware synchronization library
• Reduces dI/dt caused by synchronization events
• Helps reduce voltage guardband
• Lower voltage guardband = Energy savings
• On average VRSync saves 33% energy
33
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Our solution: VRSync Barriers
34
First enterExecution
Blocked on barrier 1 Delay
All in barrier 1
T0
T7
Linear schedule
t7
T1T2T3T4T5T6
Time
Thre
ads
Linear scheduleFirst enter
Execution
Blocked on barrier 1All in barrier
Delay
All outT0
T7
Thre
ads
T1T2T3T4T5T6
Time
Bulk schedule
0 50 100 150 200 250 300400
450
500
550
600
650
700
V(out) No. of active cores
-10%
V(o
ut) (
mV
)
time (µs)
04812162024283236
No.
of
activ
e co
res
0 50 100 150 200 250 300400
450
500
550
600
650
700
V(out) No. of active cores
-10%
time (µs)
V(o
ut) (
mV
)
04812162024283236
No.
of a
ctiv
e co
res
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Experimental setup
35
• Processor!• Modeled by SESC!• 32nm, 32 cores!• 1GHz at 600mV!
• Benchmarks!• SPLASH2!• PARSEC!
!!!
• Circuit modeling!• SPICE!• Markovic̀, et al.*!
• Voltage Regulator!• Linear Technology’s
LTC3729L-6 polyphase!• LTspice!
• Barrier!• Software Combining
Tree Barrier!
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Eliminating Barrier-Induced Emergencies
36
Pow
er (
Wat
ts)
Core
s in
Bar
rier
Time (milliseconds)
Emergency
EmergencyCores in Barrier
Power (Watts)
0
10
20
30
40
50
60
70
80
90
46.98 47 47.02 47.04 47.06 0
8
16
24
32
Pow
er (
Wat
ts)
Core
s in
Bar
rier
Time (milliseconds)
EmergencyCores in Barrier
Power (Watts)
0
10
20
30
40
50
60
70
80
90
49.44 49.47 49.5 49.53 49.56 49.59 0
8
16
24
32
Pow
er (
Wat
ts)
Core
s in
Bar
rier
Time (milliseconds)
EmergencyCores in Barrier
Power (Watts)
0
10
20
30
40
50
60
70
80
90
49.38 49.41 49.44 49.47 49.5 49.53 0
8
16
24
32
Baseline Linear Bulk
fluidanimate - parsec!
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Eliminating Barrier Emergencies
37
VRSync Bulk!
Baseline!
lu – splash2!
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Eliminating Phase Alignment Emergencies
38
Baseline!
VRSync Linear!
fft – splash2!
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Execution time overhead
39
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
radiosity
barnesocean
raytrace
water-nsquared
cholesky
fft lu radixblackscholes
bodytrack
fluidanimate
swaptions
dedupstreamcluster
g.mean
Norm
aliz
ed E
xec
uti
on T
ime
2.1
LinearBulk
11%
6%
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
VRSync: Energy Savings
40
Technique! Guardband! Runtime! Power! Energy!
Baseline with High
Guardband!210mV! 1.0! 1.563! 1.563!
VRSync Linear! 60mV! 1.112! 0.98! 1.086!
VRSync Bulk! 60mV! 1.063! 0.99! 1.049!
VRSync Bulk Fast! 160mV! 1.045! 1.361! 1.422!
31%
33%
VRSync Bulk is 33% more energy efficient than baseline with high guardband
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Outline of our solutions
41
Reliability Process Variation
Voltage Variation
Parichute [micro2010]
Data
Parity
Parity
Booster [hpca2012]
VRSync [isca2012]
Pow
er (
Wat
ts)
Core
s in
Bar
rier
Time (milliseconds)
Cores in BarrierPower (Watts)
0
10
20
30
40
50
60
70
36 38 40 42 44 46 48 50 0
8
16
24
32
Voltage Speculation in Itanium II [isca2013]
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
• High voltage margins in modern CPUs - energy inefficient
• Voltage speculation techniques exist (Razor, etc.)
• Require dedicated hardware
• Reliability challenges lead to heavy use of on-chip ECC
• Caches, register file, TLBs, etc.
• Idea: leverage on-chip ECC to dynamically lower voltage margins (voltage speculation)
42
Voltage speculation in Itanium II
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Voltage margin exploration
• HP BL860-i4 Server (2X 9560 Itanium II 8-core CPUs)
• Gradually lowered supply voltage (Vdd) for each core individually
• 1.1V Nominal -> 0.9, constant frequency (2.53GHz)
• Logged correctable errors, power consumption
• Recorded crashes, data corruption
• Experiments performed with HP stress test application, SPECjbb, SPECfp, and SPECint benchmarks
43
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Correctable errors vs. Vdd
44
Observation: Correctable errors always triggered before uncorrectable ones, while running a stress test workload.
0
2
4
6
8
10
12
14
16
18
0.96 0.98 1 1.02 1.04 1.06 1.08 1.1
Err
or
Rate
(err
ors
/min
ute
)
Supply Voltage
Unsa
fe V
dd
Itanium Core
Failure VddCorrectable error range
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
High variation in safe Vdd
0.8$0.85$0.9$0.95$
1$1.05$1.1$
Core$0$Core$1$Core$2$Core$3$Core$4$Core$5$Core$6$Core$7$
Nominal$Vdd$ Safe/Min$Vdd$ Fail$Vdd$
Supp
ly$Voltage$(V
)$
45
Core-to-core variation in safe/min Vdd: 0.96-1V
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
ECC-based voltage speculation
• Our solution: dynamically lower supply voltage
• Use correctable errors as “early warning system”
• Two-step approach:
• Margin Voltage - determined post-manufacturing by running stress test workload
• Runtime reevaluation based on correctable error reports
• Monitoring & control implemented in firmware, transparent to OS
• Prototyped in HP Server with Itanium II CPUs
46
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Margin discovery and runtime
47
Safety padding (10 mV)
First Error Voltage
Margin Voltage
Supp
ly V
olta
ge
Time
Cor
rect
able
Erro
rs
Core Vdd
Core errors
Discovery phase Runtime
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Aggressive speculation
• Some applications/cores more amenable to voltage speculation
• Constant stream of correctable errors
48
Safety padding (10 mV)
First Error Voltage
Margin Voltage
Supp
ly V
olta
ge
Time
Core Vdd
Cor
rect
able
erro
rs
Max error threshold
burst testing
Min error threshold
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Voltage speculation in action
0.965
0.97
0.975
0.98
0.985
0.99
0.995
1
0 5 10 15 20 0
10
20
30
40
50
Supply
Volta
ge (
V)
Err
or
Rate
(per
min
ute
)
Time (minutes)
Margin Voltage Error rate Core Voltage
49
SPECjbb
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Power savings
0.5
0.6
0.7
0.8
0.9
1
Specjbb2005 SPECint SPECfp
Re
lativ
e P
ow
er
Cores-only CPU Total
• Cores-only: 22% SPECjbb, 23% SPECint and 18% SPECfp
• Total (with uncore): 14% SPECjbb, 15% SPECint and 11% SPECfp
50
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
High variation in correctable errorsa few corner cases that led to system crashes during the ini-tial testing of this solution. This was root caused to havingsome applications triggering their correctable errors at muchlower voltages than others (more than 10 mV difference). Thistranslated to system crashes whenever another application thatrequired a higher operating voltage would get switched in forexecution on the core. This issue was solved by making surethat the application running on an aggressive core can tolerateaggressive mode operation before lowering its voltage belowthe “margin voltage”.
6. Evaluation
In this section we examine the power and energy savingsachieved by our dynamic voltage speculation system as wellas its performance overhead. We begin by characterizing theprocess variation effects on voltage margins and types of errorstriggered at low voltages.
6.1. Process Variation Effects
In order to characterize the effects of core-to-core processvariation on voltage margins we run our stress tests and bench-marks on each core while progressively lowering the voltage.We record the lowest supply voltage at which the stress testapplication runs for at least 20 minutes. We also collect allcorrectable error reports raised by the hardware at that supplyvoltage. Figure 9 shows the distribution of correctable errorsfor each core, for two different Itanium II processors. Bothprocessors show a wide range of behaviors with cores 0-2 ofprocessor A (Figure 9a) showing a large number of correctablecache failures and core 4 exhibiting a large number of registerfile correctable errors. Most cores seem to trigger either cacheerrors or register file errors depending on which critical pathsin each core are affected by process variation. Cores 3 and 4are the exception triggering both cache and register file errors.
Processor B (Figure 9a) shows similar variability but witha different distribution of error rates and types. ProcessorB triggers fewer cache errors and slightly more correctableregister file errors.
In general, we observed that correctable cache errors havea more graceful onset and are overall a better predictor foraggressive cores and aggressive mode operation beyond themargin voltage. Correctable register file errors on the otherhand are an indication that the core’s execution pipeline is inthe critical path of the core. In these cases the core is lesstolerant of further voltage scaling and could lead to errors.We classify cores that exhibit correctable register file errorsas “conservative” and we always run them at a voltage thatis above the one at which correctable register file errors areobserved (margin voltage).
6.2. Dynamic Adaptation to Workload
The Voltage Speculation Governor continuously adjusts thesupply voltage to ensure reliable operation. For “aggressive”
0
50
100
150
200
250
300
core0 core1 core2 core3 core4 core5 core6 core7
Co
rre
cta
ble
Err
ors
Correctable Cache Errors Correctable RF Errors
(a) Processor A
0
50
100
150
200
250
core0 core1 core2 core3 core4 core5 core6 core7
Co
rre
cta
ble
Err
ors
(b) Processor B
Figure 9: Distribution of correctable error rates and error typesover a 20 minute run of the stress test application at the mar-gin voltage, for two Itanium 8-core processors.
0.965
0.97
0.975
0.98
0.985
0.99
0.995
1
0 5 10 15 20 0
10
20
30
40
50
Su
pp
ly V
olta
ge
(V
)
Err
or
Ra
te (
pe
r m
inu
te)
Time (minutes)
Margin Voltage Error rate Core Voltage
Figure 10: Dynamic adaptation of supply voltage to runtimeconditions in SPECjbb running on an “agressive” core.
cores, the governor attempts to lower supply voltage belowthe “margin voltage” as long as the rate of correctable errorsis maintained at a target level. Figure 10 shows a trace of thesupply voltage over time for the SPECjbb workload. We alsoshow the correctable error rate for the same interval. The sup-ply voltage is initially set at the margin voltage. The VoltageSpeculation Governor lowers the voltage in 5mV incrementsevery minute of operation as long as the core exhibits an errorrate of 1 correctable error per minute. The voltage is immedi-ately raised back to the safety voltage when one of two eventsoccurs over the current time interval: (1) the error rate in-creases above 1 error per minute or (2) no correctable errorsare triggered over the previous interval.
6.3. Power Reduction through Voltage Speculation
Voltage Speculation lowers supply voltage by an average of 9-11% across all benchmarks we test as shown in Figure 11. Thedata is collected on Processor A on which we identify cores0,1,2 and 6 as “agressive” and 3,4,5 and 7 as “conservative”.
8
51
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Outline of our solutions
52
Reliability Process Variation
Voltage Variation
Parichute [micro2010]
Data
Parity
Parity
Booster [hpca2012]
VRSync [isca2012]
Pow
er (
Wat
ts)
Core
s in
Bar
rier
Time (milliseconds)
Cores in BarrierPower (Watts)
0
10
20
30
40
50
60
70
36 38 40 42 44 46 48 50 0
8
16
24
32
Voltage Speculation in Itanium II [isca2013]
Designing Energy-Efficient Microprocessors in the Era of Unpredictable Transistors
Task%Parallel%Programming%in%the%Partitioned%Global%Address%SpaceJames&Dinan and&Prof.&P.&Sadayappan
PGAS%Models%and%The%Asynchronous%Gap
� PGAS%models%provide%an%asynch:ronous irregular%data%model
� E.g.%Global%Arrays,%UPC,%CAF
� Computation%model%is%stillregular,%process:centric%SPMD
� Irregularity%in%the%data%canlead%to%load%imbalance
� Scioto%extends%PGAS%models%to%bridge%asynchronous%gap
� Dynamic%task:based%view%of%the%computation
X[M][M][N]
X[1..9][1..9][1..9]X
Scioto%Task%Model
� Task%Inputs:%Global%data,%Immediates,%Common%Local%Objects (CLO)
� Task%Outputs:%Global%data,%CLOs,%Child%tasks
CLO1 CLO1
SharedY[0]
Private
Y[1] Y[N]
Proc0 Proc1 Procn
CLO1
f(...)
In:%5,%Y[0],%...
Out:%Y[1]
Task:
Runtime%System%Design
� Per:process%ARMCI%circular%task%queues for%efficient%one:sided%access
� Queues%are%prioritized%by%affinity
� Use%the%work%first%principle%(LIFO%task%execution)
� Load%balancing%off%the%tail%via%random%work%stealing%(FIFO%stealing)
Introduction
This poster describes our work on Scioto, a new parallelprogramming model that provides scalable support for task parallelprogramming on distributed memory clusters. Scioto's task modelcomplements existing Partitioned Global Address Space (PGAS) datamodels to form a complete environment for expressing andmanaging irregular and dynamic parallelism. The Sciotoprogramming model is supported by a scalable runtime system thatprovides dynamic load balancing and improves communicationoverheads by co:locating tasks with data on which they operate. Wepresent an evaluation of Scioto on several benchmarks including theMADNESS computational chemistry kernel and demonstrate strongscaling and high efficiency on an 8,192 core cluster.
2.%Reduce%Search Time:%Work%Splitting
� Problem: Search%time%grows%with%system%size
� Strategy: Divide%tasks evenly%between%victim%and%thief
� Double%number%of%work%sources%after%each%step
� Reduce%avg.%time%to%findwork%to%log(ncpus)
Time
1.%Optimize%Local%Accesses:%Split%Queues
� Queues%are%split%into%two%parts:
� Private: Local:only
� Shared: Any,%locked
� Removes%locking%from%criticalpath
� Local%enqueue/dequeue� Periodically%move%split%as%computation%progresses
� Reacquire%work
� Release%work%(lockless)
Scioto:%Scalable%Collections%of%Task%Objects
� Programmer%expresses the computation%as%collection%of%tasks
� Tasks%operate%on%data%stored%in%PGAS%(Global%Arrays)
� Executed%in%collective%task%parallel%phases
� Runtime%system%manages%task%execution%/%task%parallel%phases
� Load%balancing,%locality%optimizations,%fault%resilience,%etc
SPMD
SPMD
TaskParallel
�����������������������n
Termination
Shared
Private
Proc0 Proc1 Procn
Scalable%Work%Stealing� Enhancements%to%enable%efficient%scaling%to%8,192%cores� Highest%known%scaling%for%work%stealing
1. Split%work%queues� Optimize%local%accesses,%reduce%locking%on%critical%path
2. Work%splitting:%Steal:half� Reduce%search%time,%improve%work%distribution
3. Aborting%lock%operations� Abort long%waits%on%exhausted%resources
tailsplitnlocal
3.%Manage%Contention:%Aborting%Steals
� ARMCI%Locks:%BakeryAlgorithm
� Take%a%ticket,%wait%in%line� Fair,%but%if%victim%runs%outof%work%must%still%wait%togive%up%ticket
� Spinlocks:
� while(!atomic_swap(lock))%
� Can%give%up%at%any%time
� Spinlocks%+%Aborting%Steals:
� Periodically%check%if%we%should%abort%lock()
� Avoid%waits on%%stale%resource
Experimental%Setup%and%Benchmarks
� HP%Infiniband Cluster
� 2,310%Nodes,%2x2.2GHz%4:core%AMD%
� BPC:%Bouncing%Producer%Consumer� Producer%task%migrates%due%toload%balancing%operations
� MADNESS:%Comp.%chemistry%kernel� Project%3:d%function%into%oct:tree%spatial%representation
� UTS:%Unbalanced%Tree%Search%Benchmark� Exhaustive%parallel%DFS%on%highly%unbalanced%tree
computerarchitectureresearch lab
Acknowledgments
• The Research Team:
• Timothy N. Miller, PhD 2012, now Assist. Prof. @ SUNY Binghamton
• Renji Thomas
• Xiang Pan
• Naser Sedaghati
• Anys Bacha
53
• The Sponsors: