Vector Architectures Vs. Superscalar and VLIW for...
Transcript of Vector Architectures Vs. Superscalar and VLIW for...
Vector Architectures Vs. Superscalar and VLIW
for Embedded Media Benchmarks
Christos Kozyrakis David Patterson
Stanford University U.C. Berkeley
http://csl.stanford.edu/~christos
2© C. Kozyrakis,11/ 2002
Motivation
• Ideal processor for embedded media processing– High performance for media tasks
– Low cost• Small code size, low power consumption, highly integrated
– Low power consumption (for portable applications)
– Low design complexity
– Easy to program with HLLs
– Scalable
• This work– How efficient is a simple vector processor for embedded media
processing?• No cache, no wide issue, no out-of-order execution
– How does it compare to superscalar and VLIW embedded designs?
3© C. Kozyrakis,11/ 2002
Outline
• Motivation
• Overview of VIRAM architecture
– Multimedia instruction set, processor organization, vectorizing compiler
• EEMBC benchmarks & alternative architectures
• Evaluation
– Instruction set analysis & code size comparison
– Performance comparison
– VIRAM scalability study
• Conclusions
4© C. Kozyrakis,11/ 2002
VIRAM Instruction Set
• Vector load-store instruction set for media processing
– Coprocessor extension to MIPS architecture
• Architecture state
– 32 general-purpose vector registers
– 16 flag registers
– Scalar registers for control, addresses, strides, etc
• Vector instructions
– Arithmetic: integer, floating-point, logical
– Load-store: unit-stride, strided, indexed
– Misc: vector & flag processing (pop count, insert/extract)
– 90 unique instructions
5© C. Kozyrakis,11/ 2002
VIRAM ISA Enhancements
• Multimedia processing
– Support for multiple data-types (64b/32b/16b)
• Element & operation width specified with control register
– Saturated and fixed-point arithmetic
• Flexible multiply-add model without accumulators
– Simple element permutations for reductions and FFTs
– Conditional execution using the flag registers
• General-purpose systems
– TLB-based virtual memory
• Separate TLB for vector loads & stores
– Hardware support for reduced context switch overhead
• Valid/dirty bits for vector registers
• Support for “lazy” save/restore of vector state
7© C. Kozyrakis,11/ 2002
VIRAM Chip
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
DR
AM
LANE LANE LANE LANE
MIPS Core
IO-FPU
Vector DRAM Control
• 0.18µm CMOS (IBM)
• Features– 8KB vector register file
– 2 256-bit integer ALUs
– 1 256-bit FPU
– 13MB DRAM
• 325mm2 die area
• 125M transistors
• 200 MHz, 2 Watts
• Peak vector performance– Integer: 1.6/3.2/6.4 Gop/s
(64b/32b/16b)
– Fixed-point: 2.4/4.8/9.6 Gop/s (64b/32b/16b)
– FP: 1.6 Gflop/s (32b)
8© C. Kozyrakis,11/ 2002
Vectorizing Compiler
• Based on Cray PDGCS compiler
– Used with all vector and MPP Cray machines
• Extensive vectorization capabilities
– Including outer-loop vectorization
• Vectorization of narrow operations and reductions
Optimizer
C
Fortran95
C++
Frontends Code Generators
Cray’s
PDGCS
T3D, T3E
X1(SV2), VIRAM
C90, T90, SV1
9© C. Kozyrakis,11/ 2002
EEMBC Benchmarks
• The de-facto industrial standard for embedded CPUs
• Used consumer & telecommunication categories– Representative of workload for multimedia devices with
wireless/broadband capabilities
– C code, EEMBC reference input data
• Consumer category– Image processing tasks for digital camera devices
– Rgb2cmyk & rgb2yiq conversions, convolutional filter, jpeg encode & decode
• Telecommunication category– Encoding/decoding tasks for DSL/wireless
– Autocorrelation compression, convolutional encoder, DSL bit allocation, FFT, Viterbi decoding
10© C. Kozyrakis,11/ 2002
EEMBC Metrics
• Performance: repeats/second (throughput)
– Use geometric means to summarize scores
• Code size and static data size in bytes
– Data sizes the same for the processors we discuss
• Pitfall with caching behavior
– Fundamentally, no temporal locality in most benchmarks
– Repeating kernel on same (small) data creates locality
– Unfair advantage for cache based architectures
• VIRAM has no data cache for vector loads/stores
11© C. Kozyrakis,11/ 2002
Embedded Processors
1.730096/512/–––8TMS320C6203
VLIW+DSP
VelociTI
2.716632/16/–––5TM1300VLIW+SIMD
Trimedia
5.025032/32/–––2VR5000RISCMIPS
21.3100032/32/256√√4MPC7455RISCPowerPC
21.655032/32/256√√3K6-IIIE+CISCx86
2.02008/––/–––1VIRAMVectorVIRAM
Power
(W)
Clock
(MHz)
CacheL1I/L1D/L2
(KB)
OOO
Issue WidthChipArchitecture
12© C. Kozyrakis,11/ 2002
Degree of Vectorization
0%
20%
40%
60%
80%
100%
Rgb2cm
yk
Rgb2yiq
Filter
Cjpeg
Djpeg
Autoco
r
Convenc
Bital Fft
Viterb
i
% o
f D
ynam
ic O
per
atio
ns
Vector Operations Scalar Operations
• Typical degree of vectorization is 90%– Even for Viterbi decoding which is partially vectorizable
– ISA & compiler can capture the data parallelism in EEMBC
– Great potential for vector hardware
13© C. Kozyrakis,11/ 2002
0
20
40
60
80
100
120
Rgb2cm
yk Filter
CjpegDjpeg
Conven
cVite
rbi
Rgb2yiq
CjpegDjpeg
Autocor Bita
l FftVect
or L
engt
h (E
lem
ents
)Average Maximum Supported
Vector Length
16-bit ops 32-bit ops
• Short vectors– Unavoidable with Viterbi, artifact of code with Cjpeg/Djpeg
• Even shortest length operations have ~20 16-bit elements– More parallelism than 128-bit SSE/Altivec can capture
14© C. Kozyrakis,11/ 2002
Code Size
• VIRAM high code density due to
– No loop unrolling/software pipelining, compact expression of strided/indexed/permutation patterns, no small loops
• Vectors: a “code compression” technique for RISC
1.1 1.1
2.0
4.1
12.3
0.61.0
2.5
1.91.6
1.9
1.0
0
1
2
3
4
5
VIRAM
x86
PowerPC
MIP
S
VLIW (c
c)
VLIW (o
pt)
VIRAM
x86
PowerPC
MIP
S
VLIW (c
c)
VLIW (o
pt)
No
rmal
ized
Co
de
Siz
eConsumer Telecommunications
15© C. Kozyrakis,11/ 2002
Comparison: Performance
16 41 48 1217
0
2
4
6
8
10
Rgb2cm
yk
Rgb2yiq
Filter
Cjpeg
Djpeg
Autoco
r
Convenc
Bital Fft
Viterb
i
Geom
. Mea
n
Per
form
ance
VIRAM PowerPC VLIW (cc) VLIW (opt)
• 200MHz, cache-less, single-issue VIRAM is
– 100% faster than 1-GHz, 4-way OOO processor
– 40% faster than manually-optimized VLIW with SIMD/DSP
• VIRAM can sustain high computation throughput
– Up to 16 (32-bit) to 32 (16-bit) arithmetic ops/cycle
• VIRAM can hide latency of accesses to DRAM
16© C. Kozyrakis,11/ 2002
Comparison: Performance/MHz
43 112 131 3331
0
5
10
15
20
25
Rgb2cm
yk
Rgb2yiq
Filter
Cjpeg
Djpeg
Autoco
r
Convenc
Bital Fft
Viterb
i
Geom
. Mea
n
Per
form
ance
/MH
z
VIRAM PowerPC VLIW (cc) VLIW (opt)
• VIRAM Vs 4-way OOO & VLIW with compiler code
– 10x-50x with long vectors, 2x-5x with short vectors
• VIRAM Vs VLIW with manually optimized code
– VLIW within 50% of VIRAM
– VIRAM would benefit from the same optimizations
• Similar results if normalized by power or complexity
17© C. Kozyrakis,11/ 2002
VIRAM Scalability
• Same executable, frequency, and memory system
• Decreased efficiency for 8 lanes
– Short vector lengths, conflicts in the memory system
– Difficult to hide overhead with short execution times
• Overall: 2.5x with 4 lanes, 3.5x with 8 lanes
– Can you scale similarly a superscalar processor?
1.01.7
2.53.5
012345678
Rgbcmyk Rgbyiq Filter Cjpeg Djpeg Autocor Convenc Bital Fft Viterbi Geom.Mean
Per
form
ance
1 Lane 2 Lanes 4 Lanes 8 Lanes
18© C. Kozyrakis,11/ 2002
Conclusions
• Vectors architectures are great match for embedded multimedia processing
– Combined high performance, low power, low complexity
– Add a vector unit to your media-processor!
• VIRAM code density
– Similar to x86, 5-10 times better than optimized VLIW
• VIRAM performance• With compiler vectorization and no hand-tuning
– 2x performance of 4-way OOO superscalar
• Even if OOO runs at 5x the clock frequency
– 50% faster than manually-optimized 5 to 8-way VLIW
• Even if VLIW has hand-inserted SIMD and DSP support