Improving Memory System Performance for Soft Vector Processors
description
Transcript of Improving Memory System Performance for Soft Vector Processors
Improving Memory System Performance for Soft Vector ProcessorsPeter YiannacourasJ. Gregory SteffanJonathan Rose
WoSPS – Oct 26, 2008
2
Soft Processors in FPGA Systems
SoftProcessor
CustomLogic
HDL+
CAD
C+
Compiler
Easier Faster Smaller Less Power
Data-level parallelism → soft vector processors
Configurable – how can we make use of this?
3
Vector Processing Primer
// C codefor(i=0;i<16; i++) b[i]+=a[i]
// Vectorized codeset vl,16vload vr0,bvload vr1,avadd vr0,vr0,vr1vstore vr0,b
Each vector instructionholds many units of independent operations
b[0]+=a[0]b[1]+=a[1]b[2]+=a[2]
b[4]+=a[4]b[3]+=a[3]
b[5]+=a[5]b[6]+=a[6]b[7]+=a[7]b[8]+=a[8]b[9]+=a[9]
b[10]+=a[10]b[11]+=a[11]b[12]+=a[12]b[13]+=a[13]b[14]+=a[14]b[15]+=a[15]
vadd
1 Vector Lane
4
Vector Processing Primer
// C codefor(i=0;i<16; i++) b[i]+=a[i]
// Vectorized codeset vl,16vload vr0,bvload vr1,avadd vr0,vr0,vr1vstore vr0,b
Each vector instructionholds many units of independent operations
vadd
16 Vector Lanes
b[0]+=a[0]b[1]+=a[1]b[2]+=a[2]
b[4]+=a[4]b[3]+=a[3]
b[5]+=a[5]b[6]+=a[6]b[7]+=a[7]b[8]+=a[8]b[9]+=a[9]
b[10]+=a[10]b[11]+=a[11]b[12]+=a[12]b[13]+=a[13]b[14]+=a[14]b[15]+=a[15]
16x speedup
5
Sub-Linear Scalability
4.7
8.0
6.05.2
3.1
01
23
45
67
89
autc
or
conv
en
ip_c
heck
sum
imgb
lend
GM
EA
N
Cycle
Perf
orm
ance
Rela
tive to 1
Lane
1 Lane
2 Lanes
4 Lanes
8 Lanes
16 Lanes
Vector lanes not being fully utilized
6
Where Are The Cycles Spent?
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
autcor conven ip_checksum imgblend AVERAGE
Fra
ction o
f Tota
l Cycle
s
~
Memory Unit Stall Cycles
Miss Cycles 67%
2/3 cycles spent waiting on memory unit, often from cache misses
16 lanes
7
Our Goals
1. Improve memory system Better cache design Hardware prefetching
2. Evaluate improvements for real: Using a complete hardware design (in Verilog) On real FPGA hardware (Stratix 1S80C6) Running full benchmarks (EEMBC) From off-chip memory (DDR-133MHz)
8
Current Infrastructure
Vectorizedassembly
subroutines
GNU as+
Vectorsupport
ELFBinary
MINT Instruction
Set Simulator
scalar μP
+vpu
VCRF
VSRF
VCWB
VSWB
Logic
DecodeRepli-cate
Hazardcheck
VRRF
ALU
MemUnit
x & satur.
VRWB
MUX
Satu-rate
Rshift
VRRF
ALU
x & satur.
VRWB
MUX
Satu-rate
Rshift
EEMBC CBenchmarks
Modelsim(RTL
Simulator)
SOFTWARE HARDWAREVerilog
AlteraQuartus II
v 8.0
cyclesarea,frequency
GCC
ld
verification verification
9
VESPA Architecture Design
ScalarPipeline3-stage
VectorControlPipeline3-stage
VectorPipeline6-stage
Icache Dcache
Decode RFALU
MUX WB
VCRF
VSRF
VCWB
VSWB
Logic
DecodeRepli-cate
Hazardcheck
VRRF A
LU
x & satur.
VRWB
MUX
Satu-rate
Rshift
VRRF A
LU
x & satur.
VRWB
MUX
Satu-rate
Rshift
MemUnit
Decode
Supports integerand fixed-point operations, and predication
32-bitdatapaths
Shared Dcache
10
10
Vector Memory Crossbar
Memory System Design
DDR
ScalarVectorCoproc
Lane0Lane
0Lane
0Lane
4
Dcache4KB,
16B line …
Lane0Lane
0Lane
0Lane
8Lane
0Lane
0Lane
0Lane12
Lane4Lane
4Lane15Lane16
VESPA16 lanes
DDR9 cycle access
vld.w (load 16 contiguous 32-bit words)
11
Vector Memory Crossbar
Memory System Design
DDR
ScalarVectorCoproc
Lane0Lane
0Lane
0Lane
4
Dcache16KB,
64B line …
Lane0Lane
0Lane
0Lane
8Lane
0Lane
0Lane
0Lane12
Lane4Lane
4Lane15Lane16
VESPA16 lanes
DDR9 cycle access
vld.w (load 16 contiguous 32-bit words)
4x
4x
Reducedcache accesses +some prefetching
12
Improving Cache Design
Vary the cache depth & cache line size Using parameterized design Cache line size: 16, 32, 64, 128 bytes Cache depth: 4, 8, 16, 32, 64 KB
Measure performance on 9 benchmarks 6 from EEMBC, all executed in hardware
Measure area cost Equate silicon area of all resources used
Report in units of Equivalent LEs
13
Cache Design Space – Performance (Wall Clock Time)
1.68
1.93
1.55
1.77
1.37
1.50
1.13
1.00
1.25
1.50
1.75
2.00
4KB 8KB 16KB 32KB 64KB
Speedup V
s 4
KB
,16B
128B
64B
32B
16B
Best cache design almost doubles performance of original VESPA
122MHz
123MHz
126MHz129MHz
More pipelining/retiming could reduce clock frequency penalty
Cache line more important than cache depth (lots of streaming)
14
Cache Design Space – Area
1.00
1.25
1.50
1.75
2.00
4KB 8KB 16KB 32KB 64KB
Are
a V
s 4
KB
,16B
128B
64B
32B
16B
M4K
MRAM
16bits
4096bits
64B (512 bits)
16bits
4096bits
16bits
4096bits
…16bits
4096bits
16bits
4096bits
16bits
4096bits
16bits
4096bits
32 => 16KB of storage
System area almost doubled in worst case
15
Cache Design Space – Area
1.00
1.25
1.50
1.75
2.00
4KB 8KB 16KB 32KB 64KB
Are
a V
s 4
KB
,16B
128B
64B
32B
16B
M4K
MRAM
b) Don’t use MRAMs: big, few, and overkill
a) Choose depth to fill block RAMs needed for line size
16
Hardware Prefetching Example
DDR
Dcache
…
vld.w
No Prefetching Prefetching 3 blocks
DDR
Dcache
…
vld.w
MISS MISS
9 cyclepenalty
9 cyclepenalty
vld.w vld.w
HITMISS
17
Hardware Data Prefetching
Advantages Little area overhead Parallelize memory fetching with computation Use full memory bandwidth
Disadvantages Cache pollution
We use Sequential Prefetching triggered on: a) any miss, or b) sequential vector instruction miss
We measure performance/area using a 64B, 16KB dcache
18
Prefetching K Blocks – Any Miss
0.5
1
1.5
2
2.5
0 1 3 7 15 31 63
Number of Cache Lines Prefetched
Speedup v
s n
o
Pre
fetc
hin
g
autcor
conven
viterb
fbital
rgbcmyk
rgbyiq
ip_checksum
imgblend
filt3x3
GMEAN
Peak average speedup 28%
2.2x
Not receptive
Only half the benchmarks significantly sped-up, max of 2.2x, avg 28%
19
dirtylines
…
Prefetching Area Cost: Writeback Buffer
Two options: Deny prefetch Buffer all dirty lines
Area cost is small 1.6% of system area Mostly block RAMs Little logic
No clock frequency impact
Prefetching 3 blocks
DDR
Dcache
…
vld.w
MISS
9 cyclepenalty
WBBuffer
20
Any Miss vs Sequential Vector Miss
0.70
0.80
0.90
1.00
1.10
1.20
1.30
0 1 3 7 15 31 63
Number of Cache Lines Prefetched
Speedup
Any Cache Misses
Sequential Vector only
Collinear – nearly all misses in our benchmarks are sequential vector
21
Vector Length Prefetching
Previously: constant # cache lines prefetched Now: Use multiple of vector length
Only for sequential vector memory instructions Eg. Vector load of 32 elements
Guarantees <= 1 miss per vector memory instr
vld.w0 31
fetch +prefetch 28*k
22
Vector Length Prefetching - Performance
0.5
1
1.5
2
2.5N
one
1*V
L
2*V
L
4*V
L
8*V
L
16*V
L
32*V
L
Amount of Prefetching
Speedup
autcor
conven
fbital
viterb
rgbcmyk
rgbyiq
ip_checksum
imgblend
filt3x3
GMEAN
Peak 29%
2.2x
Not receptive
1*VL prefetching provides good speedup without tuning, 8*VL best
no cachepollution
21%
23
Overall Memory System Performance
00.10.20.30.40.50.60.70.8
16-byte line 64-byte line 64-byte line +prefetch
Fra
ction o
f Tota
l Cycle
s
Memory Unit Stall Cycles
Miss Cycles
(4KB) (16KB)
67%
48%
31%
4%
15
Wider line + prefetching reduces memory unit stall cycles significantly
Wider line + prefetching eliminates all but 4% of miss cycles
24
Improved Scalability
02468
101214
autc
or
conv
en
fbita
l
vite
rb
rgbc
myk
rgby
iq
ip_c
heck
sum
imgb
lend
filt3
x3
GM
EA
N
Cyc
le P
erfo
rman
ce
Rel
ativ
e to
1 L
ane
1 Lane
2 Lanes
4 Lanes
8 Lanes
16 Lanes
Previous: 3-8x range, average of 5x for 16 lanes Now: 6-13x range, average of 10x for 16 lanes
25
Summary
Explored cache design ~2x performance for ~2x system area
Area growth due largely to memory crossbar Widened cache line size to 64B and depth to 16KB
Enhanced VESPA w/ hardware data prefetching Up to 2.2x performance, average of 28% for K=15 Vector length prefetcher gains 21% on average for 1*VL
Good for mixed workloads, no tuning, no cache pollution Peak at 8*VL, average of 29% speedup
Overall improved VESPA memory system & scalability Decreased miss cycles to 4%, Decreased memory unit stall cycles to 31%
26
Vector Memory Unit
Dcache
base
stride*0
index0
+MUX
...
stride*1
index1
+MUXstride*L
indexL
+MUX
MemoryRequestQueue
ReadCrossbar
…Memory Lanes=4
rddata0rddata1
rddataL
wrdata0wrdata1
wrdataL ...
WriteCrossbar
MemoryWrite
Queue
L = # Lanes - 1……