Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf ·...
Transcript of Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf ·...
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Accelerating Decoupled Look-aheadto Exploit Implicit Parallelism
Raj Parihar
Advisor: Prof. Michael C. Huang
March 22, 2013
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
MotivationDespite the proliferation of multi-core, multi-threaded systems
Single-thread performance is still an important design goal
Modern programs do not lack instruction level parallelism
Real challenge: exploit implicit parallelism without undue costs
One effective approach: Decoupled look-ahead
bzip2 crafty eon gap gcc gzip mcf pbmk twolf vortex vpr Gmean 1
10
100
IPC
128 512 2K 128 512 2K 107
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 2
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
MotivationDespite the proliferation of multi-core, multi-threaded systems
Single-thread performance is still an important design goal
Modern programs do not lack instruction level parallelism
Real challenge: exploit implicit parallelism without undue costs
One effective approach: Decoupled look-ahead
bzip2 crafty eon gap gcc gzip mcf pbmk twolf vortex vpr Gmean 1
10
100
IPC
128 512 2K 128 512 2K 107
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 3
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Motivation: Decoupled Look-ahead
Decoupled look-ahead architecture targets
Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling
The look-ahead thread can often become a new bottleneck
We explore techniques to accelerate the look-ahead thread
Speculative parallelization: aptly suited due to increasedparallelism in the look-ahead binaryWeak dependence: lack of correctness constraint allows weakinstruction removal w/o affecting the quality of look-ahead
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 4
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Motivation: Decoupled Look-ahead
Decoupled look-ahead architecture targets
Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling
The look-ahead thread can often become a new bottleneck
We explore techniques to accelerate the look-ahead thread
Speculative parallelization: aptly suited due to increasedparallelism in the look-ahead binaryWeak dependence: lack of correctness constraint allows weakinstruction removal w/o affecting the quality of look-ahead
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 4
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Motivation: Decoupled Look-ahead
Decoupled look-ahead architecture targets
Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling
The look-ahead thread can often become a new bottleneck
We explore techniques to accelerate the look-ahead thread
Speculative parallelization: aptly suited due to increasedparallelism in the look-ahead binaryWeak dependence: lack of correctness constraint allows weakinstruction removal w/o affecting the quality of look-ahead
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 4
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Outline
Motivation
Baseline decoupled look-ahead
Look-ahead thread acceleration
Speculative parallelization in look-ahead
Weak dependence removal in look-ahead
Experimental analysis
Summary
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 5
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Baseline Decoupled Look-ahead
Binary parser is used to generate skeleton from original programThe skeleton runs on a separate core and
Maintains its memory image in local L1, no writeback to shared L2Sends branch outcomes through FIFO queue; also helps prefetching
A. Garg and M. Huang, “A Performance-Correctness Explicitly Decoupled Architecture”, MICRO-08
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 6
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Practical Advantages of Decoupled Look-ahead
Look-ahead thread is a self-reliant agent,completely independent of main thread
No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable
Natural throttling mechanism to prevent
Run-away prefetching, cache pollution
Look-ahead thread size comparable to
aggregation of short helper threads
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 7
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Practical Advantages of Decoupled Look-ahead
Look-ahead thread is a self-reliant agent,completely independent of main thread
No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable
Natural throttling mechanism to prevent
Run-away prefetching, cache pollution
Look-ahead thread size comparable to
aggregation of short helper threads
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 7
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Practical Advantages of Decoupled Look-ahead
Look-ahead thread is a self-reliant agent,completely independent of main thread
No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable
Natural throttling mechanism to prevent
Run-away prefetching, cache pollution
Look-ahead thread size comparable to
aggregation of short helper threads
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 7
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks
Single-thread, decoupled look-ahead, ideal, and look-ahead aloneApplication categories:
Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck (right half)
apl msa wup mgr six swim fac gal gcc gap eon fm3d gzip crft vor apsi vpr bzp2 eqk amp luc art pbmkmcf twlf0
1
2
3
4
IPC
Single−thread Decoupled look−ahead Ideal(Cache,Br) Look−ahead only
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 8
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks
Single-thread, decoupled look-ahead, ideal, and look-ahead aloneApplication categories:
Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck (right half)
apl msa wup mgr six swim fac gal gcc gap eon fm3d gzip crft vor apsi vpr bzp2 eqk amp luc art pbmkmcf twlf0
1
2
3
4
IPC
Single−thread Decoupled look−ahead Ideal(Cache,Br) Look−ahead only
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 9
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks
Single-thread, decoupled look-ahead, ideal, and look-ahead aloneApplication categories:
Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck (right half)
apl msa wup mgr six swim fac gal gcc gap eon fm3d gzip crft vor apsi vpr bzp2 eqk amp luc art pbmkmcf twlf0
1
2
3
4
IPC
Single−thread Decoupled look−ahead Ideal(Cache,Br) Look−ahead only
81%53%
47%27%
11%
70%11%
16%35%
47%68%
154%189%
117%
262%262%
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 10
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks
Single-thread, decoupled look-ahead, ideal, and look-ahead aloneApplication categories:
Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck (right half)
apl msa wup mgr six swim fac gal gcc gap eon fm3d gzip crft vor apsi vpr bzp2 eqk amp luc art pbmkmcf twlf0
1
2
3
4
IPC
Single−thread Decoupled look−ahead Ideal(Cache,Br) Look−ahead only
81%53%
47%27%
11%
70%11%
16%35%
47%68%
154%189%
117%
262%262%
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 11
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Unique Opportunities for Speculative Parallelization
Skeleton code offers more parallelism
Certain dependencies removed duringslicing for skeletonShort-distance dependence chainsbecome long-distance chains, suitablefor TLP exploitation
Look-ahead is inherently error-tolerant
Can ignore dependence violationsLittle to no support needed, unlike inconventional TLS
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 12
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Unique Opportunities for Speculative Parallelization
Skeleton code offers more parallelism
Certain dependencies removed duringslicing for skeletonShort-distance dependence chainsbecome long-distance chains, suitablefor TLP exploitation
Look-ahead is inherently error-tolerant
Can ignore dependence violationsLittle to no support needed, unlike inconventional TLS
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 12
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Software Support
Dependence analysis
Profile guided, coarse-grain at basicblock level
Spawn and Target points
Basic blocks with consistentdependence distance of more thanthreshold of DMIN
Spawned thread executes fromtarget point
Loop level parallelism is also
exploited
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 13
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Software Support
Dependence analysis
Profile guided, coarse-grain at basicblock level
Spawn and Target points
Basic blocks with consistentdependence distance of more thanthreshold of DMIN
Spawned thread executes fromtarget point
Loop level parallelism is also
exploited
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 13
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Software Support
Dependence analysis
Profile guided, coarse-grain at basicblock level
Spawn and Target points
Basic blocks with consistentdependence distance of more thanthreshold of DMIN
Spawned thread executes fromtarget point
Loop level parallelism is also
exploited
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 13
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Parallelism Potential in Look-ahead Binary
Available parallelism for 2 core/contexts system; DMIN = 15BB
Skeleton exhibits significant more BB level parallelism (17%)Loop based FP applications exhibit more BB level parallelism
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 14
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Parallelism Potential in Look-ahead Binary
Available parallelism for 2 core/contexts system; DMIN = 15BB
Skeleton exhibits significant more BB level parallelism (17%)
Loop based FP applications exhibit more BB level parallelism
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 14
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Parallelism Potential in Look-ahead Binary
Available parallelism for 2 core/contexts system; DMIN = 15BB
Skeleton exhibits significant more BB level parallelism (17%)Loop based FP applications exhibit more BB level parallelism
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 14
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Hardware and Runtime Support
Thread spawning and merging are verysimilar to regular thread spawning except
Spawned thread shares the same registerand memory stateSpawning thread terminates at the target PC
Value communication
Register-based naturally through sharedregisters in SMTMemory-based communication can besupported at different levelsPartial versioning in cache at line level
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 15
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Hardware and Runtime Support
Thread spawning and merging are verysimilar to regular thread spawning except
Spawned thread shares the same registerand memory stateSpawning thread terminates at the target PC
Value communication
Register-based naturally through sharedregisters in SMTMemory-based communication can besupported at different levelsPartial versioning in cache at line level
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 15
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Speedup of Speculative Parallelization
14 applications in which the look-ahead thread is bottleneck
Speedup of look-ahead systems over single-thread
Decoupled look-ahead over single-thread baseline: 1.53xSpeculative parallel look-ahead over single-thread: 1.73x
Speculative look-ahead over decoupled look-ahead: 1.13x
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 16
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Speedup of Speculative Parallelization
14 applications in which the look-ahead thread is bottleneckSpeedup of look-ahead systems over single-thread
Decoupled look-ahead over single-thread baseline: 1.53x
Speculative parallel look-ahead over single-thread: 1.73x
Speculative look-ahead over decoupled look-ahead: 1.13x
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 16
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Speedup of Speculative Parallelization
14 applications in which the look-ahead thread is bottleneckSpeedup of look-ahead systems over single-thread
Decoupled look-ahead over single-thread baseline: 1.53xSpeculative parallel look-ahead over single-thread: 1.73x
Speculative look-ahead over decoupled look-ahead: 1.13x
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 16
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Speedup of Speculative Parallelization
14 applications in which the look-ahead thread is bottleneckSpeedup of look-ahead systems over single-thread
Decoupled look-ahead over single-thread baseline: 1.53xSpeculative parallel look-ahead over single-thread: 1.73x
Speculative look-ahead over decoupled look-ahead: 1.13x
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 16
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Speculative Look-ahead vs Conventional TLS
Skeleton provides more opportunities for parallelization
Speculative look-ahead over decoupled LA baseline: 1.13xSpeculative main thread over single thread baseline: 1.07x
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 17
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Speculative Look-ahead vs Conventional TLS
Skeleton provides more opportunities for parallelization
Speculative look-ahead over decoupled LA baseline: 1.13x
Speculative main thread over single thread baseline: 1.07x
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 17
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Speculative Look-ahead vs Conventional TLS
Skeleton provides more opportunities for parallelization
Speculative look-ahead over decoupled LA baseline: 1.13xSpeculative main thread over single thread baseline: 1.07x
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 17
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Motivation for Exploiting Weak Dependences
Not all instructions are equally important and critical
Example of weak instructions:
Inconsequential adjustmentsLoad and store instructions thatare (mostly) silentDynamic NOP instructions
Plenty of weak instructions are
present in programs
Challenges involved:
Context-based, hard to identify and combine – much like Jenga
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 18
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Motivation for Exploiting Weak Dependences
Not all instructions are equally important and critical
Example of weak instructions:
Inconsequential adjustmentsLoad and store instructions thatare (mostly) silentDynamic NOP instructions
Plenty of weak instructions are
present in programs
Challenges involved:
Context-based, hard to identify and combine – much like Jenga
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 18
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Comparison of Weak and Strong Instructions
Static attributes of weak and strong insts are remarkably same
Static attributes: opcode, number of inputsThe correlation coefficient of the two distributions is 0.96
Weakness has very poor correlation with static attributes
Hard to identify the weak insts through static heuristics
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 19
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Comparison of Weak and Strong Instructions
Static attributes of weak and strong insts are remarkably same
Static attributes: opcode, number of inputsThe correlation coefficient of the two distributions is 0.96
Weakness has very poor correlation with static attributes
Hard to identify the weak insts through static heuristics
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 19
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Genetic Algorithm based Framework
Genetic algorithm based framework to identify and eliminateweak instructions from the look-ahead skeleton
Genetic evolution: procreation and natural selectionChromosomes creation and hybridizationBaseline look-ahead skeleton construction
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 20
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Genetic Algorithm based Framework
Genetic algorithm based framework to identify and eliminateweak instructions from the look-ahead skeleton
Genetic evolution: procreation and natural selection
Chromosomes creation and hybridizationBaseline look-ahead skeleton construction
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 20
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Genetic Algorithm based Framework
Genetic algorithm based framework to identify and eliminateweak instructions from the look-ahead skeleton
Genetic evolution: procreation and natural selectionChromosomes creation and hybridizationBaseline look-ahead skeleton construction
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 20
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Heuristic Based Solutions
Heuristic based solutions are helpful to jump start the evolution
Superposition based chromosomesOrthogonal subroutine based chromosomes
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 21
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative parallelization in look-aheadWeak dependence removal in look-ahead
Progress of Genetic Evolution Process
Per generation progress compared to the final best solution
After 2 generations, more than half of the benefits are achievedAfter 5 generations, at least 90% of benefits are achieved
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 22
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Experimental Setup
Program/binary analysis tool: based
on ALTO
Simulator: based on heavilymodified SimpleScalar
SMT, look-ahead and speculativeparallelization supportTrue execution-driven simulation(faithfully value modeling)
Genetic algorithm framework
Modeled as offline and onlineextension to the simulator
Microarchitectural config:
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 23
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speedup of Self-tuned Look-ahead
Applications in which the look-ahead thread is a bottleneck
Self-tuned, genetic algorithm based decoupled look-ahead
Speedup over baseline decoupled look-ahead: 1.16xSpeedup over single-thread baseline: 1.78x
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 24
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speedup of Self-tuned Look-ahead
Applications in which the look-ahead thread is a bottleneckSelf-tuned, genetic algorithm based decoupled look-ahead
Speedup over baseline decoupled look-ahead: 1.16x
Speedup over single-thread baseline: 1.78x
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 24
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speedup of Self-tuned Look-ahead
Applications in which the look-ahead thread is a bottleneckSelf-tuned, genetic algorithm based decoupled look-ahead
Speedup over baseline decoupled look-ahead: 1.16xSpeedup over single-thread baseline: 1.78x
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 24
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Comparison with Speculative Parallel Look-ahead
Self-tuned skeleton is used in the speculative parallel look-ahead
In some cases, self-tuned and speculative parallel look-ahead
techniques are synergistic (ammp, art)
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 25
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Ongoing and Future Explorations
Load balancing through skipping non-critical branches
Weak instruction classification and identification
Single-core version of decoupled look-ahead
Static heuristics to construct adaptive skeletons
Role of look-ahead to promote parallelization and accelerate the
execution of interpreted programs
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 26
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Summary
Decoupled look-ahead can uncover significant implicit parallelism
However, look-ahead thread often becomes a new bottleneck
Fortunately, look-ahead lends itself to various optimizations:
Speculative parallelization is more beneficial in look-ahead threadWeak instructions can be removed w/o affecting look-ahead quality
Intelligent look-ahead technique is a promising solution in the
era of flat frequency and modest microarchitecture scaling
Idle cores in multicore environment will further strengthen the
case of decoupled look-ahead adoption in mainstream system
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 27
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Backup Slides
Accelerating Decoupled Look-aheadto Exploit Implicit Parallelism
Raj Parihar
Advisor: Prof. Michael C. Huang
March 22, 2013
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 28
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Microthreads vs Decoupled Look-ahead
Lightweight Microthreads: Decoupled Look-ahead:
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 29
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Look-ahead Skeleton Construction
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 30
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Performance Benefits of Decoupled Look-ahead
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 31
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
IPC of Speculative Parallelization
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 32
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Speculative Parallelization: Cortex-A9 vs POWER5
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 33
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Flexibility in Look-ahead Hardware Design
Comparison of regular (partial versioning) cache support with twoother alternatives
No cache versioning supportDependence violation detection and squash
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 34
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Partial Recoveries and Spawns
Partial recoveries:
Breakdown of all the spawns:
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 35
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Genetic Algorithm Evolution
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 36
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Multi-instruction Gene Examples
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 37
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Optimizations to Implementation
Fitness test optimizations
Sampling based fitnessMulti-instruction genesEarly termination of tests
GA framework optimizations
Hybridization of solutionsAdaptive mutation rateUnique chromosomesFusion crossover operatorElitism policy
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 38
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Superposition based Chromosomes
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 39
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Recovery based Early Termination of Fitness Test
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 40
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Weak Dependence: IPC and Instructions Removed
IPC comparison:
Instructions removed from baseline skeleton:
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 41
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Sampling based Fitness Test
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 42
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
L2 Cache Sensitivity Study
Speedup for various L2 caches is quite stable
1.161x (1 MB), 1.154x (2 MB), and 1.152x (4 MB) L2 caches
Avg. speedups, shown in the figure, are relative to
single-threaded execution with a 1 MB L2 cache
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 43
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Single-Gene vs Heuristic based Chromosomes
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 44
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Genetic Algorithm based Heuristics
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 45
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Other Details
Energy reduction: 11% over baseline decoupled look-ahead
Reduced cache accesses, less stalling of main thread
On an average, 10% of the dynamic instructions are removed
from the baseline skeletonOffline profiling and control software overhead
Offline profiling time: 2 to 20 seconds on the target machineOnline control software: 17 million instructions for whole evolution
Average extra recoveries: 3-4 per 100,000 instructions
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 46
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Load Balancing in Look-ahead
Non-critical branches can be transformed to accelerate thelook-ahead thread and achieve better load balancing
Non-critical branches are skipped in the look-ahead threadMain thread executes such branches by itself w/o any helpWe call these branches Do-It-Yourself or DIY branches
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 47
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Experimental analysisSummary
Preliminary Speedup of Load Balancing
Load balancing through skipping of the non-critical branches
Max. speedup over decoupled look-ahead: 1.76x (art)Avg. speedup over decoupled look-ahead: 1.12x (Gmean)
Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 48