1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory...
-
Upload
hannah-clark -
Category
Documents
-
view
235 -
download
4
Transcript of 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory...
![Page 1: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/1.jpg)
1Parallelizing FPGA Placement with TM Steffan
Parallelizing FPGA Placement with Parallelizing FPGA Placement with Transactional MemoryTransactional Memory
Steven Birk*, Steven Birk*, Greg Steffan**, Greg Steffan**, and Jason Anderson**and Jason Anderson**
*CS Department / **ECE Department*CS Department / **ECE Department
University of TorontoUniversity of Toronto
![Page 2: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/2.jpg)
2
Implications of Moore’s LawImplications of Moore’s Law
need for parallel CAD is intensifying
1995 2000 2005 2010Year
FPGAs
CAD Complexity
CPUs
…
7.5m
Pentium II
42m
PIV
1.1b70m 350m 2.5b
291m
Core 2 Duo
731m
Core i7 Quad
![Page 3: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/3.jpg)
3Parallelizing FPGA Placement with TM Steffan
Parallelizing CAD SoftwareParallelizing CAD Software
• The focus of this talk:The focus of this talk:
– simulated-annealing-based placementsimulated-annealing-based placement
key algorithm in FPGA CAD
![Page 4: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/4.jpg)
4Parallelizing FPGA Placement with TM Steffan
Simulated Annealing Placement: Basic IdeaSimulated Annealing Placement: Basic Idea
Algorithm:Algorithm:
1) Start with random placement of blocks1) Start with random placement of blocks
2) Randomly pick a pair of blocks to swap2) Randomly pick a pair of blocks to swap
3) Keep new placement if an improvement3) Keep new placement if an improvement
…
A
B
C
D
? B
A
C
D
?
blocks
nets
![Page 5: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/5.jpg)
5Parallelizing FPGA Placement with TM Steffan
Potential Parallelism: the IntuitionPotential Parallelism: the Intuition
Thread 1
Single-Threaded
parallelism when blocks/nets are disjoint
A
B
C
D
?
Thread 1
Thread 2
Parallel Moves (success)
A
B
C
D
?
?
Thread 1
Thread 2
Parallel Moves (failure)
A
B
C
D
?
?
nice match to Transactional Memory
![Page 6: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/6.jpg)
6Parallelizing FPGA Placement with TM Steffan
abort!
Transactional Memory (TM): the Basic IdeaTransactional Memory (TM): the Basic IdeaSource Code:
...atomic { ... access_shared_data(); ...}...
TM System
Specifies transactions in source code
...atomic { ... access_shared_data(); ...}...
...atomic { ... access_shared_data(); ...}...
Transactions:
Executes transactions optimistically in parallel
Programmer:
TM System:
1) Checkpoints execution
2) Detects conflicts
? ?
3) Commits or aborts and re-executes
Exploits available parallelism
while maintaining correctness!
![Page 7: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/7.jpg)
7Parallelizing FPGA Placement with TM Steffan
• Software TM (STM)Software TM (STM)– compiler or library basedcompiler or library based
– works on current multicores, but high overheadsworks on current multicores, but high overheads
– JavaJava: DSTM, ASTM: DSTM, ASTM
– C or C++C or C++: McRT icc, TL2, RSTM, : McRT icc, TL2, RSTM, JudoSTM, JudoSTM, tinySTMtinySTM
• Hardware TM (HTM)Hardware TM (HTM)– more automatic, low overhead, limited transaction sizemore automatic, low overhead, limited transaction size
– commercial systems don’t exist yetcommercial systems don’t exist yet
– Stanford’s TCC, Wisconsin’s LogTM, SUN’s ROCKStanford’s TCC, Wisconsin’s LogTM, SUN’s ROCK
TM ImplementationsTM Implementations
This work
STM has high overhead, no HTM’s (yet)
![Page 8: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/8.jpg)
8Parallelizing FPGA Placement with TM Steffan
Goals of this WorkGoals of this Work
• Parallelize simulated-annealing placement
– using software transactional memory (tinySTM)
– demonstrate the potential for good scaling
– not expecting great speedup due to the overheads of STM
• For the FPGA community
– evaluate potential for easier parallelization via TM
– suggest CAD algorithm changes to capitalize on TM
• For the systems/TM community
– lessons from a real application
– TM feature wish-list
![Page 9: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/9.jpg)
9Parallelizing FPGA Placement with TM Steffan
MethodologyMethodology
• CAD SW: Versatile Place and Route (VPR) 5.0CAD SW: Versatile Place and Route (VPR) 5.0
– available at www.eecg.toronto.edu/vpravailable at www.eecg.toronto.edu/vpr
• Benchmark circuits: provided by VPRBenchmark circuits: provided by VPR
– sizes ranging from: 67-6000 blocks, 100-60000 netssizes ranging from: 67-6000 blocks, 100-60000 nets
– target architecture: 4 LUTs, cluster size 10target architecture: 4 LUTs, cluster size 10
• STM: tinySTMSTM: tinySTM
– available at www.tinystm.orgavailable at www.tinystm.org
• Platform: 8 CPUsPlatform: 8 CPUs
– 2 X Quad-Core Intel Xeon E5345 @ 2.33 Ghz2 X Quad-Core Intel Xeon E5345 @ 2.33 Ghz
![Page 10: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/10.jpg)
10Parallelizing FPGA Placement with TM Steffan
Challenges: Non-Determinism & MeasurementChallenges: Non-Determinism & Measurement
• Our initial implementation is non-deterministicOur initial implementation is non-deterministic
– however a deterministic version is possible, see paperhowever a deterministic version is possible, see paper
• Non-determinism makes measurement difficultNon-determinism makes measurement difficult
– different numbers of threads -> different work/resultsdifferent numbers of threads -> different work/results
• Solution: consider both runtime & quality-of-result (QoR)Solution: consider both runtime & quality-of-result (QoR)
– QoR: worst-case critical path delayQoR: worst-case critical path delay
can trade-off runtime and QoR
![Page 11: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/11.jpg)
11Parallelizing FPGA Placement with TM Steffan
The Parallelization StoryThe Parallelization Story
![Page 12: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/12.jpg)
12Parallelizing FPGA Placement with TM Steffan
First Parallelization AttemptFirst Parallelization Attempt
• Fast: one student-monthFast: one student-month
– includes time to get familiar with tinySTM, VPR codeincludes time to get familiar with tinySTM, VPR code
– very few code changesvery few code changes
– produced correct results very quicklyproduced correct results very quickly
– no deadlocks or data raceno deadlocks or data race
• Standard parallelism optimizations:Standard parallelism optimizations:
– reductions: i.e. reductions: i.e. success_sum += 1
– scheduling: move unnecessary code out of transactionsscheduling: move unnecessary code out of transactions
additional effort devoted to improving perf.
![Page 13: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/13.jpg)
13Parallelizing FPGA Placement with TM Steffan
Performance (avg all benchmark circuits)Performance (avg all benchmark circuits)
high QoR degradation (30%), high abort rate (60%)
deg.
![Page 14: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/14.jpg)
14Parallelizing FPGA Placement with TM Steffan
More Optimization: Reduce AbortsMore Optimization: Reduce Aborts• Use feedback to identify causes of abortsUse feedback to identify causes of aborts
– 80% of aborts caused by accesses to x_lookup[] 80% of aborts caused by accesses to x_lookup[]
• array used to locate 2array used to locate 2ndnd block in a swap block in a swap
– interesting: not used by “I/O” type blocksinteresting: not used by “I/O” type blocks
• Interesting resulting behavior: “favoritism”Interesting resulting behavior: “favoritism”– system favors swapping I/O blockssystem favors swapping I/O blocks
• I/O block swaps have much shorter txns, no conflictsI/O block swaps have much shorter txns, no conflicts
– only one non-I/O block swapping at a timeonly one non-I/O block swapping at a time
• others conflict immediately on x_lookup[]others conflict immediately on x_lookup[]
– intuition: causing QoR degradation, ‘false’ speedupintuition: causing QoR degradation, ‘false’ speedup
solution: privatize x_lookup[]
![Page 15: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/15.jpg)
15Parallelizing FPGA Placement with TM Steffan
Transactions and Swaps: TerminologyTransactions and Swaps: Terminology
• SwapsSwaps
– ACCEPTEDACCEPTED or or REJECTEDREJECTED
• TransactionsTransactions
– COMMITCOMMIT or or ABORTABORT
A
B
A
B
![Page 16: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/16.jpg)
16Parallelizing FPGA Placement with TM Steffan
More Optimization: Leveraging TMMore Optimization: Leveraging TM
• VPR code implements commit/abortVPR code implements commit/abort
– directly modifies placement data structuresdirectly modifies placement data structures
– undoes modifications if swap is rejectedundoes modifications if swap is rejected
• TM implements commit/abort, hence optimize:TM implements commit/abort, hence optimize:
– delete VPR code for undoing rejected swapsdelete VPR code for undoing rejected swaps
– force transaction to abort if swap is rejectedforce transaction to abort if swap is rejected
requires API for forcing a transaction to abort
![Page 17: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/17.jpg)
17Parallelizing FPGA Placement with TM Steffan
Impact on Abort RateImpact on Abort Rate
Standard Optimizations Privatization and Leveraging
significant decrease in abort rate
![Page 18: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/18.jpg)
18Parallelizing FPGA Placement with TM Steffan
Performance of Privatization and Leveraging TMPerformance of Privatization and Leveraging TM
deg.
deg.
improved QoR deg: max 35% to 8%, avg 7% to 2%
![Page 19: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/19.jpg)
19Parallelizing FPGA Placement with TM Steffan
Even More Optimization: Ignoring Large NetsEven More Optimization: Ignoring Large Nets
improves abort rate, little impact on QoR
Privatization and Leveraging Ignore Large Nets
![Page 20: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/20.jpg)
20Parallelizing FPGA Placement with TM Steffan
Evaluating ScalingEvaluating ScalingRelative to Single Thread STM
(estimated)
Single Thread STM vs. Sequential
![Page 21: 1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**](https://reader036.fdocuments.in/reader036/viewer/2022062322/56649eaa5503460f94baed6e/html5/thumbnails/21.jpg)
21Parallelizing FPGA Placement with TM Steffan
ConclusionsConclusions
• Parallel placement via STMParallel placement via STM
– good algorithmic fit (accept/reject -> commit/abort)good algorithmic fit (accept/reject -> commit/abort)
– speedup poor due to overheads, scaling good, need HTM!speedup poor due to overheads, scaling good, need HTM!
• FPGA community:FPGA community:
– should pay attention to TM, especially HTMshould pay attention to TM, especially HTM
– TM offers fast & correct parallelization, focus on performanceTM offers fast & correct parallelization, focus on performance
– algorithms can be modified to better exploit TM (ignoring nets)algorithms can be modified to better exploit TM (ignoring nets)
• Systems/TM community:Systems/TM community:
– need API for forced abort, ordered transactionsneed API for forced abort, ordered transactions
21