MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu...
-
date post
15-Jan-2016 -
Category
Documents
-
view
220 -
download
0
Transcript of MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu...
MEMOCODE 2007HW/SW Co-design Contest
Documentation of the submission by
Eric Simpson
Pengyuan Yu
Sumit Ahuja
Sandeep Shukla
Patrick Schaumont
Electrical and Computer Engineering
Department
Virginia Tech
Table of Contents
Section 1 Performance Evaluation and Analysis Section 2 Matrix Multiplication Algorithm Optimization Section 3 HW/SW System Implementation Section 4 Co-design Flow and Methodology Section 5 Conclusion
Section 1Performance Evaluation and
Analysis
Performance Results
Section 1 Performance Evaluation and Analysis
Matrix Size 64 128 256 512 1024
Run Time
(sec)
Our Design
(Average)0.0052 0.0322 0.2170 1.5176 11.882
Reference 0.0346 0.6697 5.3133 42.302 338.72
SpeedUp 6.65 20.8 24.5 26.9 28.5
Device
Utilization
BRAM 80 (64 Coprocessor + 16 On-Chip-Memory)
Mult 128
Performance Calculation
FCPU-Speed = 1, we used 300Mhz PPC
FFPGA-Capacity = 1, we used XUP’s XC2VP30
FFPGA-speed = 1, we used 100Mhz clock for bus and coprocessor
TimeEffective = (Tmeas,N=1024 + Tmeas,N=256 * 64) *
FCPU-Speed * FFPGA-Capacity * FFPGA-speed
= (11.882 + 64*0.217) * 1 * 1 * 1
= 25.77 secondsSection 1 Performance Evaluation and Analysis
Performance Results
Section 1 Performance Evaluation and Analysis
Speed Up Factor Against Reference Design (Blocked by 16)
6.66
20.79
24.4926.92
28.51
0.00
5.00
10.00
15.00
20.00
25.00
30.00
64 128 256 512 1,024
Test Case Matrix Size
Spe
edU
p Fa
ctor
Section 2Matrix Multiplication
Algorithm Optimization
Algorithm Optimization
Algorithm is optimized based on targeting platform (Virtex2 Pro VP30)
Optimization goal: Best utilized the slow DDR Memory Interface
Optimally 128-bit/cycle transfers => 4 Complex Numbers Linear accesses result in better throughput
Utilize as many fast discrete FPGA Resources as possible 136 18x18-Hardware Multipliers 136 18kbits Block Rams
Section 2 Matrix Multiplication Algorithm Optimization
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Optimized Algorithm
A
B
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
Bring in 4 complex numbers from “A”
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
• Bring in four numbers from “B” and perform the following calculations:
C[0][0] = C[0][0] + A[0][0]*B[0][0]C[0][1] = C[0][0] + A[0][0]*B[0][1]C[0][2] = C[0][0] + A[0][0]*B[0][2]C[0][3] = C[0][0] + A[0][0]*B[0][3]…C[8][0] = C[8][0] + A[8][0]*B[0][0]C[8][1] = C[8][0] + A[8][0]*B[0][1]C[8][2] = C[8][0] + A[8][0]*B[0][2]C[8][3] = C[8][0] + A[8][0]*B[0][3]
• Where “A*B” is a complex multiplication.
• 32 Complex multiplication in parallel = 128 multiplies, 64 additions/subtractions and 64 accumulates per cycle
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
At this point we have completed calculating the first 8xN rows of C in our coprocessor and we write the results back to RAM
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
C
Optimized Algorithm
A
B
C
• [A] currently in coprocessor
• [A] currently used for calculation
• [B] currently used for calculation
• [C] stored and accumulated in BRAM
• [C] being multiplied and accumulated
Section 2 Matrix Multiplication Algorithm Optimization
Next, we repeat the previous algorithm to
calculate the next “8xN CSlice”
Optimized Algorithm
Performs 128 MACs per cycle (utilizing 128 out of 136 hard multipliers)
Linear scan through B matrix (optimizing interface to DDR storage)
Section 2 Matrix Multiplication Algorithm Optimization
Section 3HW/SW System Implementation
System Architecture
Processor Local Bus
Section 3 HW/SW System Implementation
Minor deviation from proposed algorithm I/O size for coprocessor: B elements are loaded 2
at a time instead of 4 PLB DMA failed to function resulting in a much slower
{DDR->PPC->Coprocessor FIFO} datapath. FIFO width of 64-bit => 2-number sends from PPC to
Coprocessor FIFO
To maintain SAME calculation capacity: A-Block dimension doubled from 8x4 to 16x4. C-Slice doubled from 8xN to 16xN Still utilizes 128 Hardware Multipliers.
Coprocessor Architecture vs. Optimized Algorithm
Section 3 HW/SW System Implementation
Coprocessor Architecture
Coprocessor is scalable! Reduce the depth of the A-matrix subblock to
reduce the amount of MAC needed
Section 3 HW/SW System Implementation
Coprocessor Architecture
Section 3 HW/SW System Implementation
MAC Unit ArchitectureB(0,0) B(0,1)
A(0,0)
C(0,1)
X
C(0,0)
X
A(L,0)
C(L,1)
X
C(L,0)
X
......
...
Row[0]
Row[15]
Section 3 HW/SW System Implementation
MAC Unit ArchitectureB(0,0) B(0,1)
A(0,0)
C(0,1)
X
C(0,0)
X
A(L,0)
C(L,1)
X
C(L,0)
X
......
...
Row[0]
Row[15]
Complex Multiply
Accumulate
BlockRAM Storage for current “C” value
Input “B” Value
“A” Values
Section 3 HW/SW System Implementation
Section 4Co-design Flow and
Methodology
Design Flow ReferenceC Algorithm
OptimizedC Algorithm
Driver CAlgorithm
GEZELCoprocessor
VHDLPPC Binary
XUP Board
Manual Partitioning
Rectangular-BlockTransformation
Cosimulation
Synthesis
PerformanceAnalysis
Section 4 Co-design Flow and Methodology
Simulation ReferenceC Algorithm
OptimizedC Algorithm
Driver CAlgorithm
GEZELCoprocessor
VHDLPPC Binary
XUP Board
workstation
cycle-basedinstruction-set
cosimulator
FPGA
Section 4 Co-design Flow and Methodology
Simulation Simulation-based verification on three levels
workstation (behavioral) cycle-based ISS (functional model of coprocessor) FPGA board (skipping VHDL simulation since
synthesis is swift and easy) Drawback - simulations capture only behavior,
but not the architecture. Example: Hard to estimate post-synthesis timing Example: Hard to reflect memory-bus behavior
(DMA, DDR, ...) in a C simulation model
Section 4 Co-design Flow and Methodology
Cycle-based Instruction-set Simulation Uses GEZEL Cosimulation Tool
http://rijndael.ece.vt.edu/gezel2
Application SW(C Code)
uP DDR
“N” Reg FIFO IN FIFO OUT
Coprocessor
ExecutableInstruction
Set simulator
CosimulationInterfaces
CoprocessorHardware
Section 4 Co-design Flow and Methodology
Cycle-based Instruction-set Simulation Need cycle-based cosimulation of software
and hardware before synthesis Coprocessor mapped in FSMD semantics
Modular bottom-up hardware description Cosimulation Interfaces captured with GEZEL
simulation primitives Memory-mapped register FIFO based (with request/acknowledge
handshake)
Section 4 Co-design Flow and Methodology
HW-SW Interface Example
ipblock fsl1(out data : ns(32); out exists : ns(1); in read : ns(1)) { iptype "armfslslave"; ipparm "core=ppc"; ipparm "write=0x80000000"; ipparm "status=0x80000004"; } to
cop
roce
ssor
data
exists
read
connectedto ISS
PPC SW can write to address 0x80000000 Will drive data output and perform handshake
PPC SW can check status with read from 0x80000004
fsl1
GEZEL Code Hardware
Section 4 Co-design Flow and Methodology
Synthesis
Application SW(C Code)
uP DDR
“N” Reg FIFO IN FIFO OUT
Coprocessor
InstructionSet simulator
CosimulationInterfaces
CoprocessorHardware
Automatic conversion to hierarchical RTL-VHDL, withblack-boxes for cosimulation
interfacesXilinx EDK + ISE
Section 4 Co-design Flow and Methodology
Conclusions
Matrix Multiplication can be sped up by 25 times over standard reference C implementation Rectangular Blocking Dedicated Coprocessor Hardware, highly scalable Integrated design flow
Conclusions
Remaining Challenges Memory bottleneck (hardware/software codesign
yields ~7 % computation time and 93 % memory access time) Further optimization possible using DMA and data
caching schemes
Conclusions
Challenge to the MEMOCODE community accurate system-level modeling of platform artifacts to
support the designer