Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004...
-
Upload
christine-simmons -
Category
Documents
-
view
218 -
download
0
Transcript of Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004...
![Page 1: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/1.jpg)
Multimedia Characteristics and Optimizations
Marilyn WolfDept. of EEPrinceton University
© 2004 Marilyn Wolf
![Page 2: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/2.jpg)
Outline
Fritts: compiler studies.Lv: compiler studies.Memory system optimizations.
![Page 3: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/3.jpg)
Basic Characteristics
Comparison of operation frequencies with SPEC (ALU, mem, branch, shift, FP, mult) => (4, 2, 1, 1, 1, 1) Lower frequency of memory and floating-point operations More arithmetic operations Larger variation in memory usage
Basic block statistics Average of 5.5 operations per basic block Need global scheduling techniques to extract ILP
Static branch prediction Average of 89.5% static branch prediction on training input` Average of 85.9% static branch prediction on evaluation input
Data types and sizes Nearly 70% of all instructions require only 8 or 16 bit data
types
![Page 4: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/4.jpg)
Breakdown of Data Types by Media Type
0%
20%
40%
60%
80%
100%
Media type
Rat
io o
f dat
a ty
pes
(%)
floating-pointpointerswordhalfwordbyte
![Page 5: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/5.jpg)
Memory Statistics
Working set size cache regression: cache sizes 1K to 4MB assumed line size of 64 bytes measured read and write miss ratios
Spatial locality cache regression: line sizes 8 to 1024 bytes assumed cache size of 64 KB measure read and write miss ratios
Memory Results data memory: 32 KB and 60.8% spatial locality (up to 128 bytes) instruction memory: 8 KB and 84.8% spatial locality (up to 256 bytes)
ba ll
A
BAlocality spatial
/
![Page 6: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/6.jpg)
Data Spatial Locality
-200
-150
-100
-50
0
50
100
150
16 32 64 128 256 512 1024Line Size (bytes)
Deg
ree
of S
patia
l Loc
ality
(%)
videoimagegraphicsaudiospeechsecurityaverage
![Page 7: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/7.jpg)
Multimedia Looping Characteristics
Highly Loop Centric Nearly 95% of execution time spent within two innermost loop
levels
Large Number of Iterations Significant processing regularity About 10 iterations per loop on average
Path Ratio indicates Intra-Loop Complexity Computed as ratio of average number of instructions executed per
loop invocation to total number of instructions in loop Average path ratio of 78% Indicates greater control complexity than expected
![Page 8: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/8.jpg)
Average Iterations per Loopand Path Ratio
0
0.2
0.4
0.6
0.8
1
video image graphics audio speech security average
Media Type
Path
Rat
io
1
10
100
1000
Media Type
Ave
rage
Num
ber o
f Ite
ratio
ns
- average number of loop iterations
- average path ratio
![Page 9: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/9.jpg)
Instruction Level Parallelism Instruction level parallelism
base model: single issue using classical optimizations only
parallel model: 8-issue
Explores only parallel scheduling performance assumes an ideal processor model no performance penalties from branches, cache misses, etc.
0
0.5
1
1.5
2
2.5
3
3.5
video image graphics audio speech security average
Media Type
Spe
edup
8-issue classical only8-issue classical w/ inlining8-issue superblock8-issue hyperblock
![Page 10: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/10.jpg)
Workload Evaluation Conclusions
Operation Characteristics More arithmetic operations; less memory and floating-point usage Large variation in memory usage (ALU, mem, branch, shift, FP, mult) => (4, 2, 1, 1, 1, 1)
Good Static Branch Prediction Multimedia:10-15% avg. miss ratio General-purpose: 20-30% avg. miss ratio Similar basic block sizes (5 instrs per basic block)
Primarily Small Data Types (8 or 16 bits) Nearly 70% of instructions require 16-bit or smaller data types Significant opportunity for subword parallelism or narrower datapaths
Memory Typically small data and instruction working set sizes High data and instruction spatial locality
Loop-Centric Majority of execution time spent in two innermost loops Average of 10 iterations per loop invocation Path ratio indicates greater control complexity than expected
![Page 11: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/11.jpg)
Architecture Evaluation
![Page 12: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/12.jpg)
Architecture Evaluation Determine fundamental architecture style
Statically Scheduled => Very Long Instruction Word (VLIW) allows wider issue simple hardware => potentially higher frequencies
Dynamically Scheduled => Superscalar allows decoupled data memory accesses effective at reducing penalties from stall
Examine variety of architecture parameters Fundamental Architecture Style Instruction Fetch Architecture High Frequency Effects Cache Memory Hierarchy
Related Work[Lee98] “Media Architecture: General Purpose vs. Multiple Application-Specific
Programmable Processors,” DAC-35, 1998.[PChang91] “Comparing Static and Dynamic Code Scheduling for Multiple-Instruction-
Issue Processors,” MICRO-24, 1991.[ZWu99] “Architecture Evaluation of Multi-Cluster Wide-Issue Video Signal Processors,”
Ph.D. Thesis, Princeton University, 1999.[DZucker95] “A comparison of hardware prefetching techniques for multimedia
benchmarks,” Technical Report CSL-TR-95-683, Stanford University, 1995.
![Page 13: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/13.jpg)
Fundamental ArchitectureEvaluation
Fundamental architecture evaluation included: Static vs. dynamic scheduling Issue width
Focused on non-memory limited applications Determine impact of datapath features independent of memory Assume memory techniques can solve memory bottleneck
Architecture model 8-issue processor Operation latencies targeted for 500 MHz to 1 GHz 64 integer and floating-point registers Pipeline: 1 fetch, 2 decode, 1 write back, variable execute stages 32 KB direct-mapped L1 data cache with 64 byte lines 16 KB direct-mapped L1 instruction cache with 256 byte lines 256 KB 4-way set associate on-chip L2 cache 4:1 Processor to external bus frequency ratio
![Page 14: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/14.jpg)
Static versus Dynamic Scheduling
0
0.5
1
1.5
2
2.5
3
Classical Superscalar Hyperblock
Compilation Method
IPC
VLIW
in-order superscalar
out-of-order superscalar
VLIW w/ perfect caches
in-order superscalar w/perfect cachesout-of-order superscalar w/perfect caches
0
0.5
1
1.5
2
2.5
0 5 10Issue-Width
IPC
VLIW
in-ordersuperscalarout-of-ordersuperscalar
- static versus dynamic scheduling for various compiler methods
- result of increasing issue width for the given architecture and compiler methods
![Page 15: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/15.jpg)
Instruction Fetch Architecture
0
1
2
3
4
5
6
Classical Superscalar HyperblockCompilation Method
IPC
Dif
fere
nce
(%) VLIW w/ fixed-
width instrsVLIW w/ variable-width instrsin-order superscalarout-of-ordersuperscalar
90.591
91.5
9292.5
9393.5
9494.5
0 5 10 15 20 25Predictor Size (KB)
Bra
nch
Pre
dict
ion
Rat
e (%
)
2-bit counter (Classical)2-bit counter (Superscalar)2-bit counter (Hyperblock)PAs(6,16) (Classical)PAs(6,16) (Superscalar)PAs(6,16) (Hyperblock)PAs(10,8) (Classical)PAs(10,8) (Superscalar)PAs(10,8) (Hyperblock)
- aggressive versus conservative fetch methods
- comparison of dynamic branch prediction schemes
![Page 16: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/16.jpg)
Experimental ConfigurationSingle-issue processor
SimpleScalar sim-outorder Single issue configuration RISC
![Page 17: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/17.jpg)
Experimental Configuration-Benchmarks
Selected from different area of MediaBench
Additional real-world applications
![Page 18: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/18.jpg)
Baseline benchmark characteristics
Measure on the single issue processor
Execution time closely related to dynamic instruction count
1.0E+00
1.0E+02
1.0E+04
1.0E+06
1.0E+08
1.0E+10
g7
21
_en
c
g7
21
_d
ec
jpeg
_en
c
jpeg
_d
ec
mp
eg2
_en
c
mp
eg2
_d
ec
reg
ion
con
tou
r
elli
pse
mat
ch
hm
mDy namic Inst Count Exec Cy cles
![Page 19: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/19.jpg)
VLIW vs. Single Issue
Static Code SizeDynamic Operation CountExecution SpeedBasic Block Size
![Page 20: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/20.jpg)
Static Code Size Results
0
20
40
60
80
100
120
140
160
180
200
Uni
fied
Sta
tic C
ode
Size
g7
21
_en
c
g7
21
_d
ec
jpeg
_en
c
jpeg
_d
ec
mp
eg2
_en
c
mp
eg2
_d
ec
reg
ion
con
tou
r
ellip
se
mat
ch
hm
m
AV
G
Single Issue TM1300
![Page 21: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/21.jpg)
Static Code Size Analysis
Similar Static Code SizeOn average, TM1300 requires 17%
more space
![Page 22: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/22.jpg)
Dynamic Operation Count Results
0
50
100
150
200
250
Un
ifie
d D
yn
amic
Op
erat
ion
Co
un
t
g7
21
_en
c
g7
21
_d
ec
jpeg
_en
c
jpeg
_d
ec
mp
eg2
_en
c
mp
eg2
_d
ec
reg
ion
con
tou
r
ellip
se
mat
ch
hm
m
AV
G
Single Issue TM1300
![Page 23: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/23.jpg)
Dynamic Operation Count Analysis
Dynamic instruction counts are similar for two type of processors
On average, TM1300 needs 20% more operations ISA difference resulted execution time is small
![Page 24: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/24.jpg)
Execution Speed Results
0
100
200
300
400
500
600
Un
ifie
d E
xec
uti
on
Sp
eed
g7
21
_en
c
g7
21
_d
ec
jpeg
_en
c
jpeg
_d
ec
mp
eg2
_en
c
mp
eg2
_d
ec
reg
ion
con
tou
r
elli
pse
mat
ch
hm
m
AV
G
Single Issue TM1300
![Page 25: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/25.jpg)
Execution Speed Analysis
TM1300 executes all benchmarks faster than the single issue processor
On average, the speedup is 3.4x wide issue capability, is partly resulted Architecture features
![Page 26: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/26.jpg)
Unoptimized Basic Block Size Results
0
5
10
15
20
25
30
35
Bas
ic B
lock
Siz
e
g721
_enc
g721
_dec
jpeg
_enc
jpeg
_dec
mpe
g2_e
nc
mpe
g2_d
ec
regi
on
cont
our
ellip
se
mat
ch
hmm
AVG
Single Issue TM1300
![Page 27: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/27.jpg)
Unoptimized Basic Block Size Analysis
Trimedia compile provides code with larger basic block size
On average, the basic block on TM1300 is twice as large
![Page 28: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/28.jpg)
Exploiting Special Features
Methods Using custom instruction Loop transformation
Metrics Execution Speed Memory Access Count Basic Block Size Operation Level Parallelism
![Page 29: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/29.jpg)
Execution Time Results
0
50
100
150
200
250
Un
ifie
d E
xec
uti
on
Sp
eed
g7
21
_en
c
g7
21
_d
ec
jpeg
_en
c
jpeg
_d
ec
mp
eg2
_en
c
mp
eg2
_d
ec
reg
ion
con
tou
r
ellip
se
mat
ch
hm
m
AV
G
Un-optimized Optimized
![Page 30: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/30.jpg)
Execution Time Analysis
1.5 x average speedupData transferring intensive, floating point
intensive, and table looking intensive applications have less speedup
![Page 31: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/31.jpg)
Memory Access Count Results
0
20
40
60
80
100
120
Uni
fied
Mem
ory
Ref
eren
ce C
ount
g7
21
_en
c
g7
21
_d
ec
jpeg
_en
c
jpeg
_d
ec
mp
eg2
_en
c
mp
eg2
_d
ec
reg
ion
con
tou
r
ellip
se
mat
ch
hm
m
AV
G
Un-optimized Optimized
![Page 32: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/32.jpg)
Memory Access Count Analysis
Reduced average memory access count
Memory access can be bottleneck (MPEG)
![Page 33: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/33.jpg)
Optimized Basic Block Size
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
Bas
ic B
lock
Siz
e
g721
_enc
g721
_dec
jpeg
_enc
jpeg
_dec
mpe
g2_e
nc
mpe
g2_d
ec
regi
on
cont
our
ellip
se
mat
ch
hmm
AVG
Un-optimized Optimized
![Page 34: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/34.jpg)
Optimized Basic Block Size Analysis
Significant basic block size change results performance gain ( Region)
![Page 35: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/35.jpg)
Operation Level Parallelism Results
0
0.5
1
1.5
2
2.5
3
3.5
OPC
g7
21
_en
c
g7
21
_d
ec
jpeg
_en
c
jpeg
_d
ec
mp
eg2
_en
c
mp
eg2
_d
ec
reg
ion
con
tou
r
ellip
se
mat
ch
hm
m
AV
G
![Page 36: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/36.jpg)
Operation Level Parallelism Analysis
OPC close to 2Memory access can be bottleneckWider bus&super-word parallelism
![Page 37: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/37.jpg)
Overall Performance Change Results
0
2
4
6
8
10
12
Sp
eed
up
g721
_enc
g721
_dec
jpeg
_enc
jpeg
_dec
mpe
g2_e
nc
mpe
g2_d
ec
regi
on
cont
our
ellip
se
mat
ch
hmm
AVG
![Page 38: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/38.jpg)
Overall Performance Change Analysis
TM1300 exhibit significant performance gain over single issue processor
5x speedup on average 10x best case speedup
![Page 39: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/39.jpg)
What type of memory system?
Cache: size, # sets, block size.
On-chip main memory: amount, type, banking, network to PEs.
Off-chip main memory: type, organization.
![Page 40: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/40.jpg)
Memory system optimizations
Strictly software: Effectively using the cache and
partitioned memory.Hardware + software:
Scratch-pad memories. Custom memory hierarchies.
![Page 41: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/41.jpg)
Taxonomy of memory optimizations (Wolf/Kandemir)
Data vs. code.Array/buffer vs. non-array.Cache/scratch pad vs. main memory.Code size vs. data size.Program vs. process.Languages.
![Page 42: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/42.jpg)
Software performance analysis
Worst-case execution time (WCET) analysis (Li/Malik): Find longest path through CDFG. Can use annotations of branch probabilities. Can be mapped onto cache lines. Difficult in practice---must analyze optimized
code.
Trace-driven analysis: Well understood. Requires code, input vectors.
![Page 43: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/43.jpg)
Software energy/power analysis
Analytical models of cache (Su/Despain, Kamble/Ghose, etc.): Decoding, memory core, I/O path, etc.
System-level models (Li/Henkel).Power simulators (Vijaykrishnan et
al, Brooks et al).
![Page 44: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/44.jpg)
Power-optimizing transformations
Kandemir et al: Most energy is consumed by the
memory system, not the CPU core. Performance-oriented optimizations
reduce memory system energy but increase datapath energy consumption.
Larger caches increase cache energy consumption but reduce overall memory system energy.
![Page 45: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/45.jpg)
Scratch pad memories
Explicitly managed local memory.Panda et al used a static
management scheme. Data structures assigned to off-chip
memory or scratch pad at compile time. Put scalars in scratch pad, arrays in
main.May want to manage scratch pad at
run time.
![Page 46: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/46.jpg)
Reconfigurable caches
Use compiler to determine best cache configuration for various program regions. Must be able to quickly reconfigure the
cache. Must be able to identify where program
behavior changes.
![Page 47: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/47.jpg)
Software methods for cache placement
McFarling analyzed inter-function dependencies.
Tomiyama and Yasuura used ILP.Li and Wolf used a process-level model.Kirovski et al use profiling information plus
graph model.Dwyer/Fernando use bit vectors to
construct boudns in instruction caches.Parmeswaran and Henkel use heuristics.
![Page 48: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/48.jpg)
Addressing optimizations
Addressing can be expensive: 55% of DSP56000 instructions performed
addressing operations in MediaBench.Utilize specialized addressing registers,
pre/post-incr/decrement, etc. Place variables in proper order in memory
so that simpler operations can be used to calculate next address from previous address.
![Page 49: Multimedia Characteristics and Optimizations Marilyn Wolf Dept. of EE Princeton University © 2004 Marilyn Wolf.](https://reader035.fdocuments.in/reader035/viewer/2022062314/56649f0d5503460f94c21bcc/html5/thumbnails/49.jpg)
Hardware methods for cache optimization
Kirk and Strosnider divided the cache into sections and allocated timing-critical code to its own section.