National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to...
-
Upload
phillip-hutchinson -
Category
Documents
-
view
213 -
download
0
Transcript of National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to...
![Page 1: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/1.jpg)
National Center for Supercomputing Applications
Case Studies of Porting Two Algorithms to Reconfigurable
Processors
Reconfigurable Systems Summer Institute
Wednesday July 13 2005
Craig Steffen
National Center for SuperComputing Applications
![Page 2: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/2.jpg)
National Center for Supercomputing Applications
Reminder: FPGA computational Strengths
• Parallel elements
• Integer processing means small footprint
• Processor cache structure—data reuse
• Simple problems: maximum utilization
• Simple problems to describe
![Page 3: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/3.jpg)
National Center for Supercomputing Applications
Example Algorithm: Matrix Multiply
• Popular and much-used algorithm
• Well-known API in place
• Advantages: Simple, dividable, inherently parallel, data re-use
• Disadvantage: floating point
![Page 4: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/4.jpg)
National Center for Supercomputing Applications
Matrix Multiply Algorithm
• Parallel computations
• Multiple data uses
• Lends well to MAC units
a b c de f g h
kmnp
X =qr
q = ak + bm + cn + dp r = ek + fm + gn + hp
A B C
![Page 5: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/5.jpg)
National Center for Supercomputing Applications
Matrix Multiply Algorithm(matrix dimensions)
a b c de f g h
kmnp
X =qr
q = ak + bm + cn + dp r = ek + fm + gn + hp
α α
δ δθ
θ
![Page 6: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/6.jpg)
National Center for Supercomputing Applications
Matrix Multiply Implementation in C
for(i=0; i<α; i++){for(j=0; j<δ; j++){
C[i][j] = 0.0;
for(k=0; k<θ; k++){
C[i][j] += A[i][k] * B[k][j];
}
}
}
![Page 7: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/7.jpg)
National Center for Supercomputing Applications
Naïve Implementation Performance
• Generated a version that timed matrix multiply with inputs α,δ,θ and N (iterations)
• Going from 40,800,250,40 to 400,800,250,40 caused a 2.5x slowdown (cache issues) (data re-use rears its head)
• Speed was 500M MACs per second, or 1B operations per second on a 2.8 GHz CPU
• Real optimized library would run about 6x faster
![Page 8: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/8.jpg)
National Center for Supercomputing Applications
Matrix Multiply: block-wise divisible
X =
• Any block of elements may be multiplied as a unit• As long as the general rules are followed the final result
is the same as it would have been• This can be exploited to take advantage of specialized
units with preferred operand sizes
= X + X
![Page 9: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/9.jpg)
National Center for Supercomputing Applications
64-bit Floating-Point FPGA Matrix Multiplication
• Yong Dou, S. Vassiliadis, G. K. Kuzmanov, G. N. Gaydadjiev
• FPGA ’05, February 20-22, 2005, Monterey, California, USA
• Contains a 12-stage pipelined MAC block design
• Multi-FPGA master-slave multiple simultaneous execution design
![Page 10: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/10.jpg)
National Center for Supercomputing Applications
Plan for Implementation on SRC MAP Processor
• MAP runs at 100 MHz clock speed
• Assuming fully pipelined logical units (MAC in this case), requires 5 MACs running in parallel (disregarding transfer latencies)
![Page 11: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/11.jpg)
National Center for Supercomputing Applications
Single Data Use: MAP Starves
• RAM to MAP data pipe can feed 2 MACs
• 6 required to make it worthwhile
MAP Processor
64 bit
64 bit
32 bit
32 bit
32 bit
32 bit
![Page 12: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/12.jpg)
National Center for Supercomputing Applications
Using Caching: Now Equals CPU Speed
64 bit
64 bit
32 bit
32 bit
32 bit
32 bitFPGA
On-Board Memory
On-Board Memory
On-Board Memory
On-Board Memory
On-Board Memory
On-Board Memory
MAP ProcessorData re-use in OBM
![Page 13: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/13.jpg)
National Center for Supercomputing Applications
More Speedup:Requires Data Re-use in FPGA Block Ram
64 bit
64 bit
32 bit
32 bit
32 bit
32 bitFPGA
On-Board Memory
On-Board Memory
On-Board Memory
On-Board Memory
On-Board Memory
On-Board Memory
MAP Processor
![Page 14: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/14.jpg)
National Center for Supercomputing Applications
Matrix Multiply Status
• On hold for the moment
• Need to understand programming and access issues for Block Ram
![Page 15: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/15.jpg)
National Center for Supercomputing Applications
BLAST: DNA Comparison Code
• Basic Local Alignment Search Tool• Biology code for comparing DNA and protein
sequences• Multiple modes: DNA-DNA, DNA-protein,
protein-DNA, protein-protein• Answers the question:
“Is A a subset of B?” where A is short and B is very long
• Not a complete match—takes into account DNA combinatoric and protein substitution rules
![Page 16: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/16.jpg)
National Center for Supercomputing Applications
blastp: compare DNA to Protein
• First, translate DNA to Protein
DNA
AminoAcidsequence
![Page 17: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/17.jpg)
National Center for Supercomputing Applications
blastp: compare DNA to Protein
• First, translate DNA to Protein
DNA
AminoAcidsequences
Frame 1
Frame 2
![Page 18: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/18.jpg)
National Center for Supercomputing Applications
blastp: compare DNA to Protein
• First, translate DNA to Protein
• Translate all forward frames
DNA
AminoAcidsequences
Frame 1
Frame 2
Frame 3
![Page 19: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/19.jpg)
National Center for Supercomputing Applications
blastp: compare DNA to Protein
• First, translate DNA to Protein• Translate all forward frames• Complete “6-Frame translation”
DNA
AminoAcidsequences
![Page 20: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/20.jpg)
National Center for Supercomputing Applications
BLAST Method (per Frame):• Finds small local matches • Tries to expand matches to improve them, by changing matches or
inserting gaps; all of which have weights which are tied to the probability of one combination mutating to another
• Each change causes a change in the “goodness of match” score• Many combinations are attempted until the goodness value peaks• This method very unsuited for FPGA: each step depends on the
previous one, one small comparison loop carrying all the weight
Initial match:
Score: two elements Score: 2 el. – 1 gap Score: 4 el. – 1 gap
![Page 21: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/21.jpg)
National Center for Supercomputing Applications
BLAST: Solve the same problem differently
• Component-by-component comparison for multiple offsets
• For each offset record a score based on the number and/or arrangement of matches
• After all comparisons are finished, then (perhaps) do iterative matching
Position score: 2
Position score: 0
Position score: 0
Position score: 2
Position score: 0
![Page 22: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/22.jpg)
National Center for Supercomputing Applications
Performance
• Problem provided by Matt Hudson of Crop Sciences: a protein that is built by a DNA in a certain plant chromosome.
• Protein is ~1100 amino acids, Chromosome is 31 million bases
• BLAST detects two very strong hits and two weak ones, taking 3 CPU-seconds
• My algorithm, when coded naively, takes 20 minutes
• With reasonable speed-ups, takes 19 to 40 seconds
![Page 23: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/23.jpg)
National Center for Supercomputing Applications
FPGA Advantages to this Algorithm:
• Each offset is independent
• each element’s comparision is independent
• Data re-use factor is the length of the short sequence
• 6-fold parallelism due to 6-frame translation
Position score: 2
Position score: 0
Position score: 0
Position score: 2
Position score: 0
![Page 24: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/24.jpg)
National Center for Supercomputing Applications
Implementation:
• Do comparisons in parallel
• Shift dna sequence, push new element into pipe
• repeat
![Page 25: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/25.jpg)
National Center for Supercomputing Applications
Current Status
• Element-wise shift not trivially defined in MAP-C, defined Perl-expanded macro
• Must create parallel comparision and adder tree to finish
????
+
+
+ Total offset score
![Page 26: National Center for Supercomputing Applications Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute.](https://reader035.fdocuments.in/reader035/viewer/2022081519/56649ee65503460f94bf6c26/html5/thumbnails/26.jpg)
National Center for Supercomputing Applications
Conclusion
• Tools are gaining in usefulness and sophistication
• The programmer must explicitly deal with memory architectures and data movement
• Some things just don’t work as you’re used to thinking about them