1Parallelizing FPGA Placement with TM Steffan
Parallelizing FPGA Placement with Parallelizing FPGA Placement with Transactional MemoryTransactional Memory
Steven Birk*, Steven Birk*, Greg Steffan**, Greg Steffan**, and Jason Anderson**and Jason Anderson**
*CS Department / **ECE Department*CS Department / **ECE Department
University of TorontoUniversity of Toronto
2
Implications of Moore’s LawImplications of Moore’s Law
need for parallel CAD is intensifying
1995 2000 2005 2010Year
FPGAs
CAD Complexity
CPUs
…
7.5m
Pentium II
42m
PIV
1.1b70m 350m 2.5b
291m
Core 2 Duo
731m
Core i7 Quad
3Parallelizing FPGA Placement with TM Steffan
Parallelizing CAD SoftwareParallelizing CAD Software
• The focus of this talk:The focus of this talk:
– simulated-annealing-based placementsimulated-annealing-based placement
key algorithm in FPGA CAD
4Parallelizing FPGA Placement with TM Steffan
Simulated Annealing Placement: Basic IdeaSimulated Annealing Placement: Basic Idea
Algorithm:Algorithm:
1) Start with random placement of blocks1) Start with random placement of blocks
2) Randomly pick a pair of blocks to swap2) Randomly pick a pair of blocks to swap
3) Keep new placement if an improvement3) Keep new placement if an improvement
…
A
B
C
D
? B
A
C
D
?
blocks
nets
5Parallelizing FPGA Placement with TM Steffan
Potential Parallelism: the IntuitionPotential Parallelism: the Intuition
Thread 1
Single-Threaded
parallelism when blocks/nets are disjoint
A
B
C
D
?
Thread 1
Thread 2
Parallel Moves (success)
A
B
C
D
?
?
Thread 1
Thread 2
Parallel Moves (failure)
A
B
C
D
?
?
nice match to Transactional Memory
6Parallelizing FPGA Placement with TM Steffan
abort!
Transactional Memory (TM): the Basic IdeaTransactional Memory (TM): the Basic IdeaSource Code:
...atomic { ... access_shared_data(); ...}...
TM System
Specifies transactions in source code
...atomic { ... access_shared_data(); ...}...
...atomic { ... access_shared_data(); ...}...
Transactions:
Executes transactions optimistically in parallel
Programmer:
TM System:
1) Checkpoints execution
2) Detects conflicts
? ?
3) Commits or aborts and re-executes
Exploits available parallelism
while maintaining correctness!
7Parallelizing FPGA Placement with TM Steffan
• Software TM (STM)Software TM (STM)– compiler or library basedcompiler or library based
– works on current multicores, but high overheadsworks on current multicores, but high overheads
– JavaJava: DSTM, ASTM: DSTM, ASTM
– C or C++C or C++: McRT icc, TL2, RSTM, : McRT icc, TL2, RSTM, JudoSTM, JudoSTM, tinySTMtinySTM
• Hardware TM (HTM)Hardware TM (HTM)– more automatic, low overhead, limited transaction sizemore automatic, low overhead, limited transaction size
– commercial systems don’t exist yetcommercial systems don’t exist yet
– Stanford’s TCC, Wisconsin’s LogTM, SUN’s ROCKStanford’s TCC, Wisconsin’s LogTM, SUN’s ROCK
TM ImplementationsTM Implementations
This work
STM has high overhead, no HTM’s (yet)
8Parallelizing FPGA Placement with TM Steffan
Goals of this WorkGoals of this Work
• Parallelize simulated-annealing placement
– using software transactional memory (tinySTM)
– demonstrate the potential for good scaling
– not expecting great speedup due to the overheads of STM
• For the FPGA community
– evaluate potential for easier parallelization via TM
– suggest CAD algorithm changes to capitalize on TM
• For the systems/TM community
– lessons from a real application
– TM feature wish-list
9Parallelizing FPGA Placement with TM Steffan
MethodologyMethodology
• CAD SW: Versatile Place and Route (VPR) 5.0CAD SW: Versatile Place and Route (VPR) 5.0
– available at www.eecg.toronto.edu/vpravailable at www.eecg.toronto.edu/vpr
• Benchmark circuits: provided by VPRBenchmark circuits: provided by VPR
– sizes ranging from: 67-6000 blocks, 100-60000 netssizes ranging from: 67-6000 blocks, 100-60000 nets
– target architecture: 4 LUTs, cluster size 10target architecture: 4 LUTs, cluster size 10
• STM: tinySTMSTM: tinySTM
– available at www.tinystm.orgavailable at www.tinystm.org
• Platform: 8 CPUsPlatform: 8 CPUs
– 2 X Quad-Core Intel Xeon E5345 @ 2.33 Ghz2 X Quad-Core Intel Xeon E5345 @ 2.33 Ghz
10Parallelizing FPGA Placement with TM Steffan
Challenges: Non-Determinism & MeasurementChallenges: Non-Determinism & Measurement
• Our initial implementation is non-deterministicOur initial implementation is non-deterministic
– however a deterministic version is possible, see paperhowever a deterministic version is possible, see paper
• Non-determinism makes measurement difficultNon-determinism makes measurement difficult
– different numbers of threads -> different work/resultsdifferent numbers of threads -> different work/results
• Solution: consider both runtime & quality-of-result (QoR)Solution: consider both runtime & quality-of-result (QoR)
– QoR: worst-case critical path delayQoR: worst-case critical path delay
can trade-off runtime and QoR
11Parallelizing FPGA Placement with TM Steffan
The Parallelization StoryThe Parallelization Story
12Parallelizing FPGA Placement with TM Steffan
First Parallelization AttemptFirst Parallelization Attempt
• Fast: one student-monthFast: one student-month
– includes time to get familiar with tinySTM, VPR codeincludes time to get familiar with tinySTM, VPR code
– very few code changesvery few code changes
– produced correct results very quicklyproduced correct results very quickly
– no deadlocks or data raceno deadlocks or data race
• Standard parallelism optimizations:Standard parallelism optimizations:
– reductions: i.e. reductions: i.e. success_sum += 1
– scheduling: move unnecessary code out of transactionsscheduling: move unnecessary code out of transactions
additional effort devoted to improving perf.
13Parallelizing FPGA Placement with TM Steffan
Performance (avg all benchmark circuits)Performance (avg all benchmark circuits)
high QoR degradation (30%), high abort rate (60%)
deg.
14Parallelizing FPGA Placement with TM Steffan
More Optimization: Reduce AbortsMore Optimization: Reduce Aborts• Use feedback to identify causes of abortsUse feedback to identify causes of aborts
– 80% of aborts caused by accesses to x_lookup[] 80% of aborts caused by accesses to x_lookup[]
• array used to locate 2array used to locate 2ndnd block in a swap block in a swap
– interesting: not used by “I/O” type blocksinteresting: not used by “I/O” type blocks
• Interesting resulting behavior: “favoritism”Interesting resulting behavior: “favoritism”– system favors swapping I/O blockssystem favors swapping I/O blocks
• I/O block swaps have much shorter txns, no conflictsI/O block swaps have much shorter txns, no conflicts
– only one non-I/O block swapping at a timeonly one non-I/O block swapping at a time
• others conflict immediately on x_lookup[]others conflict immediately on x_lookup[]
– intuition: causing QoR degradation, ‘false’ speedupintuition: causing QoR degradation, ‘false’ speedup
solution: privatize x_lookup[]
15Parallelizing FPGA Placement with TM Steffan
Transactions and Swaps: TerminologyTransactions and Swaps: Terminology
• SwapsSwaps
– ACCEPTEDACCEPTED or or REJECTEDREJECTED
• TransactionsTransactions
– COMMITCOMMIT or or ABORTABORT
A
B
A
B
16Parallelizing FPGA Placement with TM Steffan
More Optimization: Leveraging TMMore Optimization: Leveraging TM
• VPR code implements commit/abortVPR code implements commit/abort
– directly modifies placement data structuresdirectly modifies placement data structures
– undoes modifications if swap is rejectedundoes modifications if swap is rejected
• TM implements commit/abort, hence optimize:TM implements commit/abort, hence optimize:
– delete VPR code for undoing rejected swapsdelete VPR code for undoing rejected swaps
– force transaction to abort if swap is rejectedforce transaction to abort if swap is rejected
requires API for forcing a transaction to abort
17Parallelizing FPGA Placement with TM Steffan
Impact on Abort RateImpact on Abort Rate
Standard Optimizations Privatization and Leveraging
significant decrease in abort rate
18Parallelizing FPGA Placement with TM Steffan
Performance of Privatization and Leveraging TMPerformance of Privatization and Leveraging TM
deg.
deg.
improved QoR deg: max 35% to 8%, avg 7% to 2%
19Parallelizing FPGA Placement with TM Steffan
Even More Optimization: Ignoring Large NetsEven More Optimization: Ignoring Large Nets
improves abort rate, little impact on QoR
Privatization and Leveraging Ignore Large Nets
20Parallelizing FPGA Placement with TM Steffan
Evaluating ScalingEvaluating ScalingRelative to Single Thread STM
(estimated)
Single Thread STM vs. Sequential
21Parallelizing FPGA Placement with TM Steffan
ConclusionsConclusions
• Parallel placement via STMParallel placement via STM
– good algorithmic fit (accept/reject -> commit/abort)good algorithmic fit (accept/reject -> commit/abort)
– speedup poor due to overheads, scaling good, need HTM!speedup poor due to overheads, scaling good, need HTM!
• FPGA community:FPGA community:
– should pay attention to TM, especially HTMshould pay attention to TM, especially HTM
– TM offers fast & correct parallelization, focus on performanceTM offers fast & correct parallelization, focus on performance
– algorithms can be modified to better exploit TM (ignoring nets)algorithms can be modified to better exploit TM (ignoring nets)
• Systems/TM community:Systems/TM community:
– need API for forced abort, ordered transactionsneed API for forced abort, ordered transactions
21
Top Related