Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA =...
Transcript of Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA =...
![Page 1: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/1.jpg)
Spatial: A Language and Compiler for Application Accelerators
Raghu Prabhakar Stanford University / SambaNova
Systems
TVM Conference Dec 13, 2018
![Page 2: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/2.jpg)
The Future Is (Probably) Reconfigurable
10,000
1,000
100
10
1
0.1
Ene
rgy
Effi
cien
cy (M
OP
S/m
W)
Not programmable Less programmable More programmable
Programmability
ASIC
CPU
GPU
Reconfigurable
Instruction-BasedFPGA
�2
CGRADedicated
![Page 3: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/3.jpg)
The Future Is (Probably) Reconfigurable
10,000
1,000
100
10
1
0.1
Ene
rgy
Effi
cien
cy (M
OP
S/m
W)
Not programmable Less programmable More programmable
Programmability
ASIC
CPU
GPU
Reconfigurable
Instruction-BasedFPGA
�2
CGRADedicated
25x perf/W vs. CPU
XPU (HotChips ’17)
287 MOps/mW
Brainwave (ISCA ’18)
![Page 4: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/4.jpg)
The Future Is (Probably) Reconfigurable
10,000
1,000
100
10
1
0.1
Ene
rgy
Effi
cien
cy (M
OP
S/m
W)
Not programmable Less programmable More programmable
Programmability
ASIC
CPU
GPU
Reconfigurable
Instruction-BasedFPGA
�2
CGRADedicated
25x perf/W vs. CPU
XPU (HotChips ’17)
287 MOps/mW
Brainwave (ISCA ’18) 77x perf/W vs. FPGA
Plasticine (ISCA ’17)
![Page 5: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/5.jpg)
Key QuestionHow can we more productively target
reconfigurable architectures like FPGAs?
�3
![Page 6: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/6.jpg)
Key Question
Performance Productivity
Portability
How can we more productively target reconfigurable architectures like FPGAs?
Fast and efficient designs Fast and efficient programmers
Target-generic solutions
�3
![Page 7: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/7.jpg)
HDLs
�4
Hardware Description Languages (HDLs) e.g. Verilog, VHDL, Chisel, Bluespec
![Page 8: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/8.jpg)
HDLs
Performance
�4
Hardware Description Languages (HDLs) e.g. Verilog, VHDL, Chisel, Bluespec
✓ Arbitrary RTL
![Page 9: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/9.jpg)
HDLs
Performance
Portability
�4
Hardware Description Languages (HDLs) e.g. Verilog, VHDL, Chisel, Bluespec
✓ Arbitrary RTL
✘ Significant target-specific code
![Page 10: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/10.jpg)
HDLs
Performance Productivity
Portability
�4
Hardware Description Languages (HDLs) e.g. Verilog, VHDL, Chisel, Bluespec
✓ Arbitrary RTL ✘ No high-level abstractions
✘ Significant target-specific code
![Page 11: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/11.jpg)
C + Pragmas
�5
Existing High Level Synthesis (C + Pragmas) e.g. Vivado HLS, SDAccel, Altera OpenCL
HDLs
![Page 12: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/12.jpg)
C + Pragmas
Performance
�5
✘ No memory hierarchy
✘ No arbitrarypipelining
Existing High Level Synthesis (C + Pragmas) e.g. Vivado HLS, SDAccel, Altera OpenCL
HDLs
![Page 13: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/13.jpg)
C + Pragmas
Performance
Portability
�5
✓ Portable for single vendor
✘ No memory hierarchy
✘ No arbitrarypipelining
Existing High Level Synthesis (C + Pragmas) e.g. Vivado HLS, SDAccel, Altera OpenCL
HDLs
![Page 14: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/14.jpg)
C + Pragmas
Performance Productivity
Portability
�5
✓ Nested loops
✘ Difficult to optimize
✘ Ad-hoc mix of software/hardware
✓ Portable for single vendor
✘ No memory hierarchy
✘ No arbitrarypipelining
Existing High Level Synthesis (C + Pragmas) e.g. Vivado HLS, SDAccel, Altera OpenCL
HDLs
![Page 15: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/15.jpg)
Rethinking HLS
�6
HDLs C + PragmasImproved HLS
![Page 16: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/16.jpg)
Rethinking HLS
Performance
�6
✓ Memory hierarchy
✓ Arbitrary pipelining
HDLs C + PragmasImproved HLS
![Page 17: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/17.jpg)
Rethinking HLS
Performance
Portability
�6
✓ Memory hierarchy
✓ Arbitrary pipelining
✓ Target-generic sourceacross reconfigurable architectures
HDLs C + PragmasImproved HLS
![Page 18: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/18.jpg)
Rethinking HLS
Performance Productivity
Portability
�6
✓ Nested loops✓ Automatic memory
banking/buffering✓ Implicit design parameters
(unrolling, banking, etc.)
✓ Memory hierarchy
✓ Arbitrary pipelining
✓ Target-generic sourceacross reconfigurable architectures
✓ Automated design tuning
HDLs C + PragmasImproved HLS
![Page 19: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/19.jpg)
Introducing Spatial
■ Programming language to simplify configurable accelerator design ■ Constructs to express:
■ Hierarchical parallel and pipelined data paths ■ explicit memory hierarchies
■ Simple APIs to manage CPU Accelerator communication
■ Open source: https://spatial-lang.org/
■ Allows programmers to focus on “interesting stuff” ■ Designed for performance oriented programmers ■ More intuitive than CUDA: dataflow instead of threads
David Koeplinger et al, “Spatial: A Language And Compiler For Application Accelerators”, PLDI 2018
![Page 20: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/20.jpg)
Spatial: Memory Hierarchy
�8
DDR DRAM GB
On-Chip SRAM MB
Local Regs KB
![Page 21: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/21.jpg)
Spatial: Memory Hierarchy
�8
DDR DRAM GB
On-Chip SRAM MB
Local Regs KB
val image = DRAM[UInt8](H,W)
![Page 22: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/22.jpg)
Spatial: Memory Hierarchy
�8
DDR DRAM GB
On-Chip SRAM MB
Local Regs KB
val image = DRAM[UInt8](H,W)
val buffer = SRAM[UInt8](C)val fifo = FIFO[Float](D)val lbuf = LineBuffer[Int](R,C)
![Page 23: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/23.jpg)
Spatial: Memory Hierarchy
�8
DDR DRAM GB
On-Chip SRAM MB
Local Regs KB
val image = DRAM[UInt8](H,W)
val buffer = SRAM[UInt8](C)val fifo = FIFO[Float](D)val lbuf = LineBuffer[Int](R,C)
buffer load image(i, j::j+C) // densebuffer gather image(a) // sparse
![Page 24: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/24.jpg)
Spatial: Memory Hierarchy
�8
DDR DRAM GB
On-Chip SRAM MB
Local Regs KB
val image = DRAM[UInt8](H,W)
val buffer = SRAM[UInt8](C)val fifo = FIFO[Float](D)val lbuf = LineBuffer[Int](R,C)
val accum = Reg[Double]val pixels = RegFile[UInt8](R,C)
buffer load image(i, j::j+C) // densebuffer gather image(a) // sparse
![Page 25: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/25.jpg)
Spatial: Control And Design Parameters
�9
![Page 26: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/26.jpg)
val P = 16 (1 ! 32)Reduce(0)(N by 1 par P){i => data(i)}{(a,b) => a + b}
Implicit/Explicit parallelization factors(optional, but can be explicitly declared)
Spatial: Control And Design Parameters
�9
![Page 27: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/27.jpg)
val P = 16 (1 ! 32)Reduce(0)(N by 1 par P){i => data(i)}{(a,b) => a + b}Stream.Foreach(0 until N){i => …}
Implicit/Explicit parallelization factors(optional, but can be explicitly declared)
Implicit/Explicit control schemes(also optional, but can be used to override compiler)
Spatial: Control And Design Parameters
�9
![Page 28: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/28.jpg)
val B = 64 (64 ! 1024)val buffer = SRAM[Float](B)Foreach(N by B){i => …}
val P = 16 (1 ! 32)Reduce(0)(N by 1 par P){i => data(i)}{(a,b) => a + b}Stream.Foreach(0 until N){i => …}
Implicit/Explicit parallelization factors(optional, but can be explicitly declared)
Explicit size parameters for loop step size and buffer sizes(informs compiler it can tune this value)
Implicit/Explicit control schemes(also optional, but can be used to override compiler)
Spatial: Control And Design Parameters
�9
![Page 29: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/29.jpg)
val B = 64 (64 ! 1024)val buffer = SRAM[Float](B)Foreach(N by B){i => …}
val P = 16 (1 ! 32)Reduce(0)(N by 1 par P){i => data(i)}{(a,b) => a + b}Stream.Foreach(0 until N){i => …}
Implicit/Explicit parallelization factors(optional, but can be explicitly declared)
Explicit size parameters for loop step size and buffer sizes(informs compiler it can tune this value)
Implicit/Explicit control schemes(also optional, but can be used to override compiler)
Foreach(64 par 16){i => buffer(i) // Parallel read}
Implicit memory banking and buffering schemes for parallelized access
Spatial: Control And Design Parameters
�9
![Page 30: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/30.jpg)
Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float]
tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} }{a, b => a + b}
DRAM
vectorA
Off-chip memory declarations
vectorB
FPGA
output
�10
![Page 31: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/31.jpg)
DRAM
Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float]
tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} }{a, b => a + b}
vectorA
vectorB
Explicit work division in IR
FPGA
output
24
![Page 32: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/32.jpg)
Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float]
tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b}
DRAM
vectorA
vectorB
Tiled reduction (outer)
FPGAOuter Reduce
output
24
![Page 33: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/33.jpg)
DRAM
Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} }
vectorA
vectorB
FPGAOuter Reduce
On-chip memory declarations
tileB (0)
tileA (0) tileA (1)
tileB (1)acc
output
24
acc
![Page 34: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/34.jpg)
DRAM
Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float]
tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b}
vectorA
vectorB
FPGAOuter Reduce
DRAM ! SRAM transfers(also have store, scatter, and
gather)
Stage 1
tileB (0)
tileA (0) tileA (1)
tileB (1)
output
24
acc acc
![Page 35: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/35.jpg)
DRAM
Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float]
tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b}
vectorA
vectorB
FPGAOuter Reduce
acc
Stage 1
tileB (0)
tileA (0) tileA (1)
tileB (1)
Tiled reduction (pipelined)
Stage 2
+× acc
output
24
acc acc
![Page 36: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/36.jpg)
FPGAOuter ReduceStage 3
DRAM
Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float]
tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} }{a, b => a + b}
vectorA
vectorB
acc
Stage 1
tileB (0)
tileA (0) tileA (1)
tileB (1)
Stage 2
+× acc
output
Outer reduce function
+
24
acc acc
![Page 37: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/37.jpg)
Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float]
tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} }{a, b => a + b}
24
FPGAOuter ReduceStage 3
DRAM
vectorA
vectorB
acc
Stage 1
tileB (0)
tileA (0) tileA (1)
tileB (1)
Stage 2
+× acc
output +
acc acc
Tile Size (B) Banking strategy
Parallelism factor #1Metapipelining toggle
Parallelism factor #3
Parallelism factor #2
![Page 38: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/38.jpg)
Dot Product in Spatialval output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel { Reduce(output)(N by B){ i => val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float]
tileA load vectorA(i :: i+B) tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j) }{a, b => a + b} }{a, b => a + b}}
25
![Page 39: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/39.jpg)
Dot Product in Spatial
Spatial Program Design Parameters
25
![Page 40: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/40.jpg)
Spatial Program
The Spatial Compiler
26
![Page 41: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/41.jpg)
Spatial IR
The Spatial Compiler
26
![Page 42: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/42.jpg)
Spatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
Host Resource Allocation
Control Signal InferenceChisel Code Generation
Area/Runtime Analysis
Spatial IR Design Parameters
Intermediate Representatio
nDesign Parameters
IR Transformation
IR Analysis
Code Generation
Legend
The Spatial Compiler
26
![Page 43: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/43.jpg)
Control SchedulingSpatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
Host Resource Allocation
Control Signal InferenceChisel Code Generation
Area/Runtime Analysis
Spatial IR
■ Creates loop pipeline schedules ■ Detects data dependencies across loop intervals ■ Calculate initiation interval of pipelines ■ Set maximum depth of buffers
■ Supports arbitrarily nested pipelines (Commercial HLS tools don’t support this)
27
![Page 44: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/44.jpg)
Spatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
Modified Parameters
Host Resource Allocation
Control Signal InferenceChisel Code Generation
Area/Runtime Analysis
Spatial IR Design ParametersDesign Tuning
29
![Page 45: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/45.jpg)
FPGA
+×tileB
tileA
acc
DRAM
vectorA
vectorB
Design Space Parameters ExamplevectorA ∙ vectorB
LegendControl Compute
RegsSRAM
Small and simple, but slow!
ctr
�22
![Page 46: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/46.jpg)
■ Increases length of DRAM accesses Runtime■ Increases exploited locality Runtime ■ Increases local memory sizes Area
FPGA
+×tileB
tileA
acc
DRAM
vectorA
vectorB
vectorA ∙ vectorB
Important Parameters: Buffer Sizes
LegendControl Compute
RegsSRAM
ctr
�23
![Page 47: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/47.jpg)
FPGA
Stage 2
Tile B
■ Overlaps memory and compute Runtime■ Increases local memory sizes Area■ Adds synchronization logic Area
Important Parameters: Pipelining
Stage 1
+×tileB (0)
tileA (0)
acc
DRAM
vectorA
vectorB
tileA (1)
tileB (1)
vectorA ∙ vectorB
LegendControl Compute
RegsSRAM
Double Buffer
�24
Metapipelining requires buffering
![Page 48: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/48.jpg)
■ Improves element throughput Runtime■ Duplicates compute resources Area
Important Parameters: Parallelization
FPGA
+× acc
DRAM
vectorA
vectorBctr
vectorA ∙ vectorB
×ctrctr
×+
+
LegendControl Compute
RegsSRAM
tileB
tileA
�25
![Page 49: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/49.jpg)
■ Improves memory bandwidth Runtime■ May duplicate memory resources Area
Important Parameters: Memory Banking
+× acc
DRAM
vectorA
vectorBctr
vectorA ∙ vectorB
×ctrctr
×+
+
LegendControl Compute
RegsSRAM
tileB
tileA
Banked SRAM
�26
Parallelization requires banking
![Page 50: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/50.jpg)
Spatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
Modified Parameters
Host Resource Allocation
Control Signal InferenceChisel Code Generation
Area/Runtime Analysis
Spatial IR Design ParametersDesign Tuning
29
![Page 51: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/51.jpg)
Spatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
Modified Parameters
Host Resource Allocation
Control Signal InferenceChisel Code Generation
Area/Runtime Analysis
Spatial IR Design Parameters
Original tuning methods:■ Pre-prune space using simple
heuristics■ Randomly sample ~100,000 design
points■ Model area/runtime of each point
Design Tuning
29
![Page 52: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/52.jpg)
Spatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
Modified Parameters
Host Resource Allocation
Control Signal InferenceChisel Code Generation
Area/Runtime Analysis
Spatial IR Design Parameters
Original tuning methods:■ Pre-prune space using simple
heuristics■ Randomly sample ~100,000 design
points■ Model area/runtime of each point
Proposed tuning method■ Active learning: HyperMapper (More details in paper)
Design Tuning
29
![Page 53: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/53.jpg)
Spatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
Modified Parameters
Host Resource Allocation
Control Signal InferenceChisel Code Generation
Area/Runtime Analysis
Spatial IR Design Parameters
Original tuning methods:■ Pre-prune space using simple
heuristics■ Randomly sample ~100,000 design
points■ Model area/runtime of each point
Proposed tuning method■ Active learning: HyperMapper (More details in paper)
■ Fast: No slow transformers in loop
Design Tuning
29
![Page 54: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/54.jpg)
Spatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
Modified Parameters
Host Resource Allocation
Control Signal InferenceChisel Code Generation
Area/Runtime Analysis
Spatial IR Design Parameters
Original tuning methods:■ Pre-prune space using simple
heuristics■ Randomly sample ~100,000 design
points■ Model area/runtime of each point
Proposed tuning method■ Active learning: HyperMapper (More details in paper)
■ Fast: No slow transformers in loop
Design Tuning
29
![Page 55: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/55.jpg)
Spatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
Modified Parameters
Host Resource Allocation
Control Signal InferenceChisel Code Generation
Area/Runtime Analysis
Spatial IR Design Parameters
Original tuning methods:■ Pre-prune space using simple
heuristics■ Randomly sample ~100,000 design
points■ Model area/runtime of each point
Proposed tuning method■ Active learning: HyperMapper (More details in paper)
■ Fast: No slow transformers in loop
Design Tuning
29
![Page 56: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/56.jpg)
Spatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
Host Resource Allocation
Control Signal InferenceChisel Code Generation
Area/Runtime Analysis
Spatial IRThe Spatial Compiler: The Rest
Code generation ■ Synthesizable Chisel ■ C++ code for host CPU
30
![Page 57: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/57.jpg)
■ FPGA:■ Amazon EC2 F1 Instance: Xilinx VU9P FPGA■ Fixed clock rate of 150 MHz
■ Applications■ SDAccel: Hand optimized, tuned implementations■ Spatial: Hand written, automatically tuned implementations
■ Execution time = FPGA execution time
Evaluation: Performance
31
![Page 58: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/58.jpg)
0
5
10
15
BlackScholes GEMM PageRank TPC-H Q6
Performance (Spatial vs. SDAccel) Average 2.9x faster hardware than SDAccel
Spee
dup
over
SD
Acc
el 8.5x 1.4x1.6x 1.4x 3.5x 14.1x 1.3x
32
![Page 59: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/59.jpg)
Productivity: Lines of Code
0
63
125
188
250
BlackScholes GEMM PageRank TPC-H Q6
SDAccelSpatial
12%
Average 42% shorter programs versus SDAccel
60%47% 44% 31% 66% 35%
Lines
33
![Page 60: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/60.jpg)
■ FPGA 1■ Amazon EC2 F1 Instance: Xilinx VU9P FPGA■ 19.2 GB/s DRAM bandwidth (single channel)
■ FPGA 2■ Xilinx Zynq ZC706 ■ 4.3 GB/s
■ Applications■ Spatial: Hand written, automatically tuned implementations■ Fixed clock rate of 150 MHz
Evaluation: Portability
34
![Page 61: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/61.jpg)
Portability: VU9P vs. Zynq ZC706
0
7.5
15
22.5
30
BlackScholes GEMM PageRank TPC-H Q6
2.5x 1.2x2.5x 2.5x 1.3x 2.5x 4.6xIdentical Spatial source, multiple targets
Porting: Speedup (VU9P / Zynq) only from moving to larger FPGA
Spee
dup
DRAM Bandwidth: 4.5x
LUTs (GP compute): 47.3x
DSPs (integer FMA): 7.6x
On-chip memory*: 4.0x
VU9P / ZC706* No URAM used on VU9P
35
![Page 62: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/62.jpg)
Portability: VU9P vs. Zynq ZC706
0
7.5
15
22.5
30
BlackScholes GEMM PageRank TPC-H Q6
2.6x 2.1x9.4x 2.7x 1.7x 1.0x 1.1xIdentical Spatial source, multiple targets
Porting: Speedup (VU9P / Zynq) only from moving to larger FPGATuning: Speedup only from tuning parameters for larger FPGA
Spee
dup
DRAM Bandwidth: 4.5x
LUTs (GP compute): 47.3x
DSPs (integer FMA): 7.6x
On-chip memory*: 4.0x
VU9P / ZC706* No URAM used on VU9P
35
![Page 63: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/63.jpg)
Portability: VU9P vs. Zynq ZC706
0
7.5
15
22.5
30
BlackScholes GEMM PageRank TPC-H Q6
6.5x 2.5x23.4x 6.8x 2.2x 2.5x 5.0xIdentical Spatial source, multiple targets
Porting: Speedup (VU9P / Zynq) only from moving to larger FPGATuning: Speedup only from tuning parameters for larger FPGAProduct = Porting × Tuning
Spee
dup
DRAM Bandwidth: 4.5x
LUTs (GP compute): 47.3x
DSPs (integer FMA): 7.6x
On-chip memory*: 4.0x
VU9P / ZC706* No URAM used on VU9P
35
![Page 64: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/64.jpg)
Portability: Plasticine CGRAIdentical Spatial source, multiple targets
Even reconfigurable hardware that isn’t an FPGA!
36
BenchmarkDRAM Bandwidth (%)
Load StoreResource Utilization (%)
PCU PMU AG Speedup vs. VU9P
BlackScholes 77.4 12.9 73.4 10.9 20.6 1.6
GDA 24.0 0.2 95.3 73.4 38.2 9.8
GEMM 20.5 2.1 96.8 64.1 11.7 55.0
K-Means 8.0 0.4 89.1 57.8 17.6 6.3
TPC-H Q6 97.2 0.0 29.7 37.5 70.6 1.6
Prabhakar et al. Plasticine: A Reconfigurable Architecture For Parallel Patterns (ISCA ‘17)
![Page 65: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/65.jpg)
Halide to Spatial
![Page 66: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/66.jpg)
● DSL for computational photography ● Separation between algorithm (what to compute) and schedule (how to compute) ● Straightforward to express and iterate over various schedules
What is Halide?
Var x, y; Func f; f(x, y) = x + y;
Algorithm
f.tile(x,y,xi,yi,8,8);
Schedule #1Implementations
![Page 67: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/67.jpg)
● DSL for computational photography ● Separation between algorithm (what to compute) and schedule (how to compute) ● Straightforward to express and iterate over various schedules
What is Halide?
Var x, y; Func f; f(x, y) = x + y;
Algorithm
f.parallel(y); f.vectorize(x, 8);
Schedule #2Implementations
![Page 68: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/68.jpg)
Why use Halide as Front-End to Spatial?
● Separation of concerns ○ High-level transformations: Tiling, Vectorization etc can happen in
Halide ○ Lift the hard work of transforming loop nests to Halide ○ Optimized code can be lowered into spatial
● Loop-based IR ○ Easy mapping to Spatial front-end
![Page 69: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/69.jpg)
Halide IR
// Algorithm Var x, y; Func f; f(x, y) = x + y; // Schedule f.parallel(y); f.vectorize(x, 8); f.realize(32, 32);
produce f { let t6 = (f.extent.0 + f.min.0) let t7 = (f.min.1*f.stride.1) let t8 = max((f.extent.0/8), 0) let t3 = (t8 < ((f.extent.0 + 7)/8)) let t2 = (0 - t7) let t5 = (((t6 - t7) - f.min.0) + -8) let t4 = (t6 + -8) parallel (f.s0.y, f.min.1, f.extent.1) { let t10 = ((f.s0.y*f.stride.1) + t2) let t9 = (f.min.0 + f.s0.y) for (f.s0.x.x, 0, t8) { f[ramp(((f.s0.x.x*8) + t10), 1, 8)] = ramp(((f.s0.x.x*8) + t9), 1, 8) } if (t3) { f[ramp(((f.s0.y*f.stride.1) + t5), 1, 8)] = ramp((f.s0.y + t4), 1, 8) } } }
![Page 70: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/70.jpg)
Example: Halide to Spatial
// Algorithm f(x, y) = x + y; g(x, y) = (f(x, y) + f(x, y+1))/2; // Schedule g.in().spatial();g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost()); g.tile(x, y, xo, yo, xi, yi, 4, 4); f.compute_root(); f.in() .copy_to_device() .store_in(MemoryType::SRAM) .compute_at(g, xo);
g.in().copy_to_host();wrapper.compile_to_spatial(...);
SpatialHalide
![Page 71: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/71.jpg)
Example: Halide to Spatial
// Algorithm f(x, y) = x + y; g(x, y) = (f(x, y) + f(x, y+1))/2; // Schedule g.in().spatial();g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost()); g.tile(x, y, xo, yo, xi, yi, 4, 4); f.compute_root(); f.in() .copy_to_device() .store_in(MemoryType::SRAM) .compute_at(g, xo);
g.in().copy_to_host();wrapper.compile_to_spatial(...);
val g_wrapper = DRAM[Int](16, 16); Accel { val g = SRAM[Int](16, 16); Foreach(0 until 4 by 1) {yo => Foreach(0 until 4 by 1) {xo => val f_wrapper = SRAM[Int](4, 5); f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); Foreach(0 until 4 by 1) {yi => Foreach(0 until 4 by 1) {xi => g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; } } } } g_wrapper store g; }
SpatialHalide
![Page 72: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/72.jpg)
Example: Halide to Spatial
// Algorithm f(x, y) = x + y; g(x, y) = (f(x, y) + f(x, y+1))/2; // Schedule g.in().spatial();g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost()); g.tile(x, y, xo, yo, xi, yi, 4, 4); f.compute_root(); f.in() .copy_to_device() .store_in(MemoryType::SRAM) .compute_at(g, xo);
g.in().copy_to_host();wrapper.compile_to_spatial(...);
val g_wrapper = DRAM[Int](16, 16); Accel { val g = SRAM[Int](16, 16); Foreach(0 until 4 by 1) {yo => Foreach(0 until 4 by 1) {xo => val f_wrapper = SRAM[Int](4, 5); f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); Foreach(0 until 4 by 1) {yi => Foreach(0 until 4 by 1) {xi => g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; } } } } g_wrapper store g; }
SpatialHalide
Compute at Accelerator
![Page 73: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/73.jpg)
Example: Halide to Spatial
// Algorithm f(x, y) = x + y; g(x, y) = (f(x, y) + f(x, y+1))/2; // Schedule g.in().spatial();g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost()); g.tile(x, y, xo, yo, xi, yi, 4, 4); f.compute_root(); f.in() .copy_to_device() .store_in(MemoryType::SRAM) .compute_at(g, xo);
g.in().copy_to_host();wrapper.compile_to_spatial(...);
val g_wrapper = DRAM[Int](16, 16); Accel { val g = SRAM[Int](16, 16); Foreach(0 until 4 by 1) {yo => Foreach(0 until 4 by 1) {xo => val f_wrapper = SRAM[Int](4, 5); f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); Foreach(0 until 4 by 1) {yi => Foreach(0 until 4 by 1) {xi => g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; } } } } g_wrapper store g; }
SpatialHalideAllocate SRAM
to store ‘g’
![Page 74: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/74.jpg)
Example: Halide to Spatial
// Algorithm f(x, y) = x + y; g(x, y) = (f(x, y) + f(x, y+1))/2; // Schedule g.in().spatial();g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost()); g.tile(x, y, xo, yo, xi, yi, 4, 4); f.compute_root(); f.in() .copy_to_device() .store_in(MemoryType::SRAM) .compute_at(g, xo);
g.in().copy_to_host();wrapper.compile_to_spatial(...);
val g_wrapper = DRAM[Int](16, 16); Accel { val g = SRAM[Int](16, 16); Foreach(0 until 4 by 1) {yo => Foreach(0 until 4 by 1) {xo => val f_wrapper = SRAM[Int](4, 5); f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); Foreach(0 until 4 by 1) {yi => Foreach(0 until 4 by 1) {xi => g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; } } } } g_wrapper store g; }
SpatialHalide
Tile g
![Page 75: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/75.jpg)
Example: Halide to Spatial
// Algorithm f(x, y) = x + y; g(x, y) = (f(x, y) + f(x, y+1))/2; // Schedule g.in().spatial();g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost()); g.tile(x, y, xo, yo, xi, yi, 4, 4); f.compute_root(); f.in() .copy_to_device() .store_in(MemoryType::SRAM) .compute_at(g, xo);
g.in().copy_to_host();wrapper.compile_to_spatial(...);
val g_wrapper = DRAM[Int](16, 16); Accel { val g = SRAM[Int](16, 16); Foreach(0 until 4 by 1) {yo => Foreach(0 until 4 by 1) {xo => val f_wrapper = SRAM[Int](4, 5); f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); Foreach(0 until 4 by 1) {yi => Foreach(0 until 4 by 1) {xi => g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; } } } } g_wrapper store g; }
SpatialHalide
Load ‘f’ into the
accelerator’s memory
![Page 76: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/76.jpg)
Example: Halide to Spatial
// Algorithm f(x, y) = x + y; g(x, y) = (f(x, y) + f(x, y+1))/2; // Schedule g.in().spatial();g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost()); g.tile(x, y, xo, yo, xi, yi, 4, 4); f.compute_root(); f.in() .copy_to_device() .store_in(MemoryType::SRAM) .compute_at(g, xo);
g.in().copy_to_host();wrapper.compile_to_spatial(...);
val g_wrapper = DRAM[Int](16, 16); Accel { val g = SRAM[Int](16, 16); Foreach(0 until 4 by 1) {yo => Foreach(0 until 4 by 1) {xo => val f_wrapper = SRAM[Int](4, 5); f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); Foreach(0 until 4 by 1) {yi => Foreach(0 until 4 by 1) {xi => g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; } } } } g_wrapper store g; }
SpatialHalide
Do the load at loop level ‘xo’ and store
in SRAM
![Page 77: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/77.jpg)
Example: Halide to Spatial
// Algorithm f(x, y) = x + y; g(x, y) = (f(x, y) + f(x, y+1))/2; // Schedule g.in().spatial();g.store_in(MemoryType::SRAM) .compute_at(g.in(), Var::outermost()); g.tile(x, y, xo, yo, xi, yi, 4, 4); f.compute_root(); f.in() .copy_to_device() .store_in(MemoryType::SRAM) .compute_at(g, xo);
g.in().copy_to_host();wrapper.compile_to_spatial(...);
val g_wrapper = DRAM[Int](16, 16); Accel { val g = SRAM[Int](16, 16); Foreach(0 until 4 by 1) {yo => Foreach(0 until 4 by 1) {xo => val f_wrapper = SRAM[Int](4, 5); f_wrapper load f(xo*4::xo*4+4, yo*4::yo*4+5); Foreach(0 until 4 by 1) {yi => Foreach(0 until 4 by 1) {xi => g(xo*4+xi, yo*4+yi) = (f_wrapper(xi,yi)+f_wrapper(xi,yi+1))/2; } } } } g_wrapper store g; }
SpatialHalide
Store ‘g’ back into
host’s DRAM
![Page 78: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/78.jpg)
Conclusion
37
![Page 79: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/79.jpg)
Conclusion■ Reconfigurable architectures are becoming key for performance / energy efficiency
37
![Page 80: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/80.jpg)
Conclusion■ Reconfigurable architectures are becoming key for performance / energy efficiency■ Current programming solutions for reconfigurables are still inadequate
37
![Page 81: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/81.jpg)
Conclusion■ Reconfigurable architectures are becoming key for performance / energy efficiency■ Current programming solutions for reconfigurables are still inadequate■ Need to rethink outside of the C box for high level synthesis:
■ Memory hierarchy for optimization ■ Design parameters for tuning■ Arbitrarily nestable pipelines
37
![Page 82: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/82.jpg)
Conclusion■ Reconfigurable architectures are becoming key for performance / energy efficiency■ Current programming solutions for reconfigurables are still inadequate■ Need to rethink outside of the C box for high level synthesis:
■ Memory hierarchy for optimization ■ Design parameters for tuning■ Arbitrarily nestable pipelines
■ Spatial prototypes these language and compiler criteria:■ Average speedup of 2.9x versus SDAccel on VU9P■ Average 42% less code than SDAccel■ Achieves transparent portability through internal support for automated design tuning (HyperMapper)
37
Performance Productivity
Portability
![Page 83: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/83.jpg)
Conclusion■ Reconfigurable architectures are becoming key for performance / energy efficiency■ Current programming solutions for reconfigurables are still inadequate■ Need to rethink outside of the C box for high level synthesis:
■ Memory hierarchy for optimization ■ Design parameters for tuning■ Arbitrarily nestable pipelines
■ Spatial prototypes these language and compiler criteria:■ Average speedup of 2.9x versus SDAccel on VU9P■ Average 42% less code than SDAccel■ Achieves transparent portability through internal support for automated design tuning (HyperMapper)
37
Spatial is open source: https://spatial-lang.org/
Performance Productivity
Portability
![Page 84: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/84.jpg)
Backup Slides
![Page 85: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/85.jpg)
The Team
Raghu Prabhakar
Yaqi Zhang
David Koeplinger
Matt Feldman
Tian Zhao
Ardavan Pedram
Christos Kozyrakis
Kunle Olukotun
Stefan Hadjis
Ruben Fiszel
Luigi Nardi
38
![Page 86: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/86.jpg)
Custom ASICs
![Page 87: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/87.jpg)
Custom ASICsGood for widely used, fixed specifications (like compression)
![Page 88: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/88.jpg)
Custom ASICsGood for widely used, fixed specifications (like compression)Expensive with long design turnaround for developing fields
like ML
![Page 89: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/89.jpg)
Custom ASICsGood for widely used, fixed specifications (like compression)Expensive with long design turnaround for developing fields
like ML
TimeJeff Dean, Scaled ML 2018Kunle Olukotun, ISCA 2018
20,000
15,000
10,000
5,000
02009 20172011 2013 2015
20
15
10
5
0
Relative # of Papers / Year Since 2009
ML Arxiv Papers
![Page 90: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/90.jpg)
C + Pragmas ExampleAdd 512 integers originating from accelerator DRAM
void sum(int* mem) { mem[512] = 0;
for(int i=0; i < 512; i++) { mem[512] += mem[i]; }
}
�54
![Page 91: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/91.jpg)
C + Pragmas ExampleAdd 512 integers originating from accelerator DRAM
void sum(int* mem) { mem[512] = 0;
for(int i=0; i < 512; i++) { mem[512] += mem[i]; }
}
�54
Commercial HLS Tool
Runtime: 27,236 clock cycles
(100x too long!)
![Page 92: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/92.jpg)
C + Pragmas ExampleAdd 512 integers originating from external DRAM
�55
#define CHUNKSIZE (sizeof(MPort)/sizeof(int)) #define LOOPCOUNT (512/CHUNKSIZE)
void sum(MPort* mem) { MPort buff[LOOPCOUNT]; memcpy(buff, mem, LOOPCOUNT);
int sum = 0; for(int i=1; i<LOOPCOUNT; i++) { #pragma PIPELINE for(int j=0; j<CHUNKSIZE; j++) {
#pragma UNROLL sum += (int)
(buff[i]>>j*sizeof(int)*8);}
} mem[512] = sum; }
Runtime: 302 clock cycles
![Page 93: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/93.jpg)
C + Pragmas ExampleAdd 512 integers originating from external DRAM
�55
#define CHUNKSIZE (sizeof(MPort)/sizeof(int)) #define LOOPCOUNT (512/CHUNKSIZE)
void sum(MPort* mem) { MPort buff[LOOPCOUNT]; memcpy(buff, mem, LOOPCOUNT);
int sum = 0; for(int i=1; i<LOOPCOUNT; i++) { #pragma PIPELINE for(int j=0; j<CHUNKSIZE; j++) {
#pragma UNROLL sum += (int)
(buff[i]>>j*sizeof(int)*8);}
} mem[512] = sum; }
Width of DRAM controller interface
Burst Access
Use local variable
Special compiler directives
Loop Restructuring
Bit shifting to extract individual elements
Special compiler directives
Runtime: 302 clock cycles
![Page 94: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/94.jpg)
Hardware Design Considerations
![Page 95: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/95.jpg)
Hardware Design Considerations
1. Finite physical compute and memory resources
![Page 96: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/96.jpg)
Hardware Design Considerations
1. Finite physical compute and memory resources2. Requires aggressive pipelining for performance
■ Maximize useful execution time of compute resources
![Page 97: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/97.jpg)
Hardware Design Considerations
1. Finite physical compute and memory resources2. Requires aggressive pipelining for performance
■ Maximize useful execution time of compute resources3. Disjoint memory space
■ No hardware managed memory hierarchy
![Page 98: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/98.jpg)
Hardware Design Considerations
1. Finite physical compute and memory resources2. Requires aggressive pipelining for performance
■ Maximize useful execution time of compute resources3. Disjoint memory space
■ No hardware managed memory hierarchy4. Huge design parameter spaces
■ Parameters are interdependent, change runtime by orders of magnitude
![Page 99: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/99.jpg)
Hardware Design Considerations
1. Finite physical compute and memory resources2. Requires aggressive pipelining for performance
■ Maximize useful execution time of compute resources3. Disjoint memory space
■ No hardware managed memory hierarchy4. Huge design parameter spaces
■ Parameters are interdependent, change runtime by orders of magnitude
5. Others… pipeline timing, clocking, etc.
![Page 100: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/100.jpg)
Local Memory Analysis Example
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
a
![Page 101: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/101.jpg)
Local Memory Analysis Example
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
a
![Page 102: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/102.jpg)
Local Memory Analysis Example
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
a2i
2i+1
Foreach{i =>
![Page 103: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/103.jpg)
Local Memory Analysis Example
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
a2i
2i+1
Foreach{i =>
b(2j) b(2j+1)
Reduce{j =>
![Page 104: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/104.jpg)
Local Memory Analysis Example
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
a2i
2i+1
Foreach{i =>
b(2j) b(2j+1)
Reduce{j =>
2k 2k+1
Foreach{k =>
![Page 105: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/105.jpg)
Local Memory Analysis Example
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
a2i
2i+1
Foreach{i =>
b(2j) b(2j+1)
Reduce{j =>
2k 2k+1
Foreach{k =>
Write port
Read port
1 “instance” of a
![Page 106: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/106.jpg)
Local Memory Analysis Example
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
Step 1: For each read: Find the banking and buffering for that read and all writes that may be visible to that read
a2i
2i+1
Foreach{i =>
b(2j) b(2j+1)
Reduce{j =>
2k 2k+1
Foreach{k =>
Write port
Read port
1 “instance” of a
![Page 107: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/107.jpg)
Local Memory Analysis Example (Cont.)
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
2i 2i+1
Foreach{i =>
b(2j) b(2j+1)
Reduce{j =>
Step 1: For each read: Find the banking and buffering for that read and all writes that may be visible to that read
![Page 108: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/108.jpg)
Local Memory Analysis Example (Cont.)
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
2i 2i+1
Foreach{i =>
b(2j) b(2j+1)
Reduce{j =>
a
Step 1: For each read: Find the banking and buffering for that read and all writes that may be visible to that read
![Page 109: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/109.jpg)
Local Memory Analysis Example (Cont.)
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
2i 2i+1
Foreach{i =>
b(2j) b(2j+1)
Reduce{j =>
a
a
a
Step 1: For each read: Find the banking and buffering for that read and all writes that may be visible to that read
![Page 110: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/110.jpg)
Local Memory Analysis Example (Cont.)
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
2i 2i+1
Foreach{i =>
b(2j) b(2j+1)
Reduce{j =>
a a
Step 1: For each read: Find the banking and buffering for that read and all writes that may be visible to that read
![Page 111: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/111.jpg)
Local Memory Analysis Example (Cont.)
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
2i 2i+1
Foreach{i =>
b(2j) b(2j+1)
Reduce{j =>
Metapipeline Distance = 1 a a
a a
![Page 112: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/112.jpg)
Local Memory Analysis Example (Cont.)
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
2i 2i+1
Foreach{i =>
b(2j) b(2j+1)
Reduce{j =>
Metapipeline Distance = 1 a a
a a
(~4-8x memory)
![Page 113: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/113.jpg)
Local Memory Analysis Example (Cont.)
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
2i 2i+1
Foreach{i =>
2k 2k+1
Foreach{k =>
![Page 114: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/114.jpg)
a
Local Memory Analysis Example (Cont.)
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
2i 2i+1
Foreach{i =>
Metapipeline Distance = 2
2k 2k+1
Foreach{k =>
aa
![Page 115: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/115.jpg)
a
Local Memory Analysis Example (Cont.)
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
2i 2i+1
Foreach{i =>
Metapipeline Distance = 2
2k 2k+1
Foreach{k =>
aa (~3-6x memory)
![Page 116: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/116.jpg)
a
Local Memory Analysis Example (Cont.)
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
2i 2i+1
Foreach{i =>
2k 2k+1
Foreach{k =>
aa
b(2j) b(2j+1)
Reduce{j =>
a a
a a
![Page 117: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/117.jpg)
a
Local Memory Analysis Example (Cont.)
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
2i 2i+1
Foreach{i =>
2k 2k+1
Foreach{k =>
aa
b(2j) b(2j+1)
Reduce{j =>
a a
a a(~7-14x memory)
![Page 118: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/118.jpg)
a
Local Memory Analysis Example (Cont.)
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
2i 2i+1
Foreach{i =>
2k 2k+1
Foreach{k =>
aa
b(2j) b(2j+1)
Reduce{j =>
a a
a aStep 2: Greedily combine (merge) instances - Don’t combine if there are port conflicts - Don’t combine if the cost of merging is greater than sum of unmerged**Recompute banking for merged instances!
(~7-14x memory)
![Page 119: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/119.jpg)
Local Memory Analysis
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
2i 2i+1
Foreach{i =>
2k 2k+1
Foreach{k =>
b(2j) b(2j+1)
Reduce{j =>
aa
Step 2: Greedily combine (merge) instances - Don’t combine if there are bank conflicts - Don’t combine if the cost of merging is greater than sum of unmerged**Recompute banking for merged instances!
aa
a
![Page 120: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/120.jpg)
Local Memory Analysis
Foreach(N by 1){ r => val a = SRAM[Float](D) val b = SRAM[Float](D) val c = SRAM[Float](D) Foreach(D par 2){i => a(i) = … } Reduce(sum)(D par 2){j => a(b(j)) }{(a,b) => a + b} Foreach(D par 2){k => c(k) = a(k) * sum } }
2i 2i+1
Foreach{i =>
2k 2k+1
Foreach{k =>
b(2j) b(2j+1)
Reduce{j =>
aa
Step 2: Greedily combine (merge) instances - Don’t combine if there are bank conflicts - Don’t combine if the cost of merging is greater than sum of unmerged**Recompute banking for merged instances!
aa
a(~5-10x memory) (40% less)
![Page 121: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/121.jpg)
Kernel-Based Approach
Manually implement each DSL operation; use a simple compiler to stitch them together
![Page 122: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/122.jpg)
Kernel-Based Approach
Performance
Manually implement each DSL operation; use a simple compiler to stitch them together
Misses cross-kernel optimizationsExcessive memory transfersExcessive buffering
![Page 123: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/123.jpg)
Kernel-Based Approach
Performance ProductivityHigh level specificationno hardware design knowledge required
Manually implement each DSL operation; use a simple compiler to stitch them together
Misses cross-kernel optimizationsExcessive memory transfersExcessive buffering
![Page 124: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/124.jpg)
Kernel-Based Approach
Performance Productivity
Portability
High level specificationno hardware design knowledge required
Reasonably target-generic if done right
Manually implement each DSL operation; use a simple compiler to stitch them together
Misses cross-kernel optimizationsExcessive memory transfersExcessive buffering
![Page 125: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/125.jpg)
type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]
val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)
Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D)
wK load weights(0::D)
Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() }
weights(0 :: D) store wK}
123456789
101112131415161718192021
Stochastic Gradient Descent in Spatial
![Page 126: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/126.jpg)
type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]
val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)
Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D)
wK load weights(0::D)
Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() }
weights(0 :: D) store wK}
123456789
101112131415161718192021
Arbitrary precision custom types
Stochastic Gradient Descent in Spatial
![Page 127: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/127.jpg)
type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]
val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)
Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D)
wK load weights(0::D)
Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() }
weights(0 :: D) store wK}
123456789
101112131415161718192021
Arbitrary precision custom types
Off-chip memory allocations
Stochastic Gradient Descent in Spatial
![Page 128: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/128.jpg)
type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]
val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)
Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D)
wK load weights(0::D)
Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() }
weights(0 :: D) store wK}
123456789
101112131415161718192021
Arbitrary precision custom types
Off-chip memory allocations
Accelerator scope
Stochastic Gradient Descent in Spatial
![Page 129: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/129.jpg)
type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]
val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)
Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D)
wK load weights(0::D)
Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() }
weights(0 :: D) store wK}
123456789
101112131415161718192021
Arbitrary precision custom types
Off-chip memory allocations
Accelerator scope
On-chip memory allocations
Stochastic Gradient Descent in Spatial
![Page 130: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/130.jpg)
type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]
val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)
Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D)
wK load weights(0::D)
Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() }
weights(0 :: D) store wK}
123456789
101112131415161718192021
Arbitrary precision custom types
Off-chip memory allocations
Accelerator scope
On-chip memory allocations
Explicit memory transfer
Stochastic Gradient Descent in Spatial
![Page 131: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/131.jpg)
type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]
val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)
Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D)
wK load weights(0::D)
Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() }
weights(0 :: D) store wK}
123456789
101112131415161718192021
Arbitrary precision custom types
Off-chip memory allocations
Accelerator scope
On-chip memory allocations
Explicit memory transfer
Declaration of a sequential loop
Stochastic Gradient Descent in Spatial
![Page 132: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/132.jpg)
type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]
val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)
Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D)
wK load weights(0::D)
Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() }
weights(0 :: D) store wK}
123456789
101112131415161718192021
Arbitrary precision custom types
Off-chip memory allocations
Accelerator scope
On-chip memory allocations
Explicit memory transfer
Declaration of a sequential loop
Explicit memory transfer
Stochastic Gradient Descent in Spatial
![Page 133: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/133.jpg)
type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]
val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)
Accel { val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE) val wK = SRAM[TM](D)
wK load weights(0::D)
Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint() }
weights(0 :: D) store wK}
123456789
101112131415161718192021
Arbitrary precision custom types
Off-chip memory allocations
Accelerator scope
On-chip memory allocations
Explicit memory transfer
Declaration of a sequential loop
Explicit memory transfer
Debugging breakpoint
Stochastic Gradient Descent in Spatial
![Page 134: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/134.jpg)
SGD in Spatialdef epoch(i: Int, ...): Unit = { val yPt = Reg[TM] if (i >= yAddr & i < yAddr+CSIZE & yAddr != -1) { yPt := yCache(i - yAddr) } else { yAddr := i - (i % CSIZE) yCache load y(yAddr::yAddr + CSIZE) yPt := yCache(i % CSIZE) }
val x = SRAM[TX](D) x load data(i, 0::D)
// Compute gradient against wK_t val yHat = Reg[TM] Reduce(yHat)(D by 1){j => wK(j) * x(j).to[TM] }{_+_} val yErr = yHat - yPt
// Update wK_t with reduced variance update Foreach(D by 1){i => wK(i) = wK(i) – (A.to[TM] * yErr * x(i).to[TM]) }}
222324252627282930313233343536373839404142434445
![Page 135: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/135.jpg)
SGD in Spatialdef epoch(i: Int, ...): Unit = { val yPt = Reg[TM] if (i >= yAddr & i < yAddr+CSIZE & yAddr != -1) { yPt := yCache(i - yAddr) } else { yAddr := i - (i % CSIZE) yCache load y(yAddr::yAddr + CSIZE) yPt := yCache(i % CSIZE) }
val x = SRAM[TX](D) x load data(i, 0::D)
// Compute gradient against wK_t val yHat = Reg[TM] Reduce(yHat)(D by 1){j => wK(j) * x(j).to[TM] }{_+_} val yErr = yHat - yPt
// Update wK_t with reduced variance update Foreach(D by 1){i => wK(i) = wK(i) – (A.to[TM] * yErr * x(i).to[TM]) }}
222324252627282930313233343536373839404142434445
Custom caching for random access on y
![Page 136: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/136.jpg)
SGD in Spatialdef epoch(i: Int, ...): Unit = { val yPt = Reg[TM] if (i >= yAddr & i < yAddr+CSIZE & yAddr != -1) { yPt := yCache(i - yAddr) } else { yAddr := i - (i % CSIZE) yCache load y(yAddr::yAddr + CSIZE) yPt := yCache(i % CSIZE) }
val x = SRAM[TX](D) x load data(i, 0::D)
// Compute gradient against wK_t val yHat = Reg[TM] Reduce(yHat)(D by 1){j => wK(j) * x(j).to[TM] }{_+_} val yErr = yHat - yPt
// Update wK_t with reduced variance update Foreach(D by 1){i => wK(i) = wK(i) – (A.to[TM] * yErr * x(i).to[TM]) }}
222324252627282930313233343536373839404142434445
Custom caching for random access on y
Explicit memory transfer
![Page 137: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/137.jpg)
SGD in Spatialdef epoch(i: Int, ...): Unit = { val yPt = Reg[TM] if (i >= yAddr & i < yAddr+CSIZE & yAddr != -1) { yPt := yCache(i - yAddr) } else { yAddr := i - (i % CSIZE) yCache load y(yAddr::yAddr + CSIZE) yPt := yCache(i % CSIZE) }
val x = SRAM[TX](D) x load data(i, 0::D)
// Compute gradient against wK_t val yHat = Reg[TM] Reduce(yHat)(D by 1){j => wK(j) * x(j).to[TM] }{_+_} val yErr = yHat - yPt
// Update wK_t with reduced variance update Foreach(D by 1){i => wK(i) = wK(i) – (A.to[TM] * yErr * x(i).to[TM]) }}
222324252627282930313233343536373839404142434445
Custom caching for random access on y
Explicit memory transfer
Gradient computation
![Page 138: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/138.jpg)
SGD in Spatialdef epoch(i: Int, ...): Unit = { val yPt = Reg[TM] if (i >= yAddr & i < yAddr+CSIZE & yAddr != -1) { yPt := yCache(i - yAddr) } else { yAddr := i - (i % CSIZE) yCache load y(yAddr::yAddr + CSIZE) yPt := yCache(i % CSIZE) }
val x = SRAM[TX](D) x load data(i, 0::D)
// Compute gradient against wK_t val yHat = Reg[TM] Reduce(yHat)(D by 1){j => wK(j) * x(j).to[TM] }{_+_} val yErr = yHat - yPt
// Update wK_t with reduced variance update Foreach(D by 1){i => wK(i) = wK(i) – (A.to[TM] * yErr * x(i).to[TM]) }}
222324252627282930313233343536373839404142434445
Custom caching for random access on y
Explicit memory transfer
Gradient computation
Weight update
![Page 139: Spatial: A Language and Compiler for Application Accelerators · 2020. 12. 12. · val tileA = SRAM[Float](B) val tileB = SRAM[Float](B) val acc = Reg[Float] tileA load vectorA(i](https://reader036.fdocuments.in/reader036/viewer/2022071511/612ff4ea1ecc51586943c8e0/html5/thumbnails/139.jpg)
FPGA15.Sequential.Foreach
41. Reduce
DRAM
13.load
×wKweights
22.store
37.load yHat+
-45. Foreach
- ×
y
27. if … else
yPt
yAddr
yCache
yErr
xdata
SGD in Spatial: Hardware