Post on 13-Jan-2016
description
System-Level Memory Bus Power And Performance Optimization for
Embedded Systems
Ke Ningkning@ece.neu.edu
David Kaelikaeli@ece.neu.edu
2
Why Power is More Important?
“Power: A First Class Design Constraint for Future Architecture” – Trevor Mudge 2001
Increasing complexity for higher performance (MIPS)Parallelism, pipeline, memory/cache sizeHigher clock frequency, larger die sizeRising dynamic power consumption
CMOS process continues to shrink: Smaller size logic gates reduce Vthreshold
Lower Vthreshold will have higher leakageLeakage power will exceed dynamic power
Things getting worse in Embedded SystemLow power and low cost systemsFixed or Limited applications/functionalities Real-time systems with timing constraints
3
Power Breakdown of An Embedded System
Internal Dynamic
Internal Leakage
RTC
PPI
SPORT0
SPORT1
UART
SDRAM
25°C1.2V Internal400MHz CCLK Blackfin Processor3.3V External133MHz SDRAM27MHz PPI
Source: Analog Devices Inc.
ResearchTarget
4
Introduction
Related work on microprocessor power Low power design trendPower metricsPower performance tradeoffsPower optimization techniques
Power estimation frameworkExperimental framework built from Blackfin cycle accurate
simulatorValidated through a Blackfin EZKit board
Power aware bus arbitrationMemory page remapping
5
Outline
Research Motivation and Introduction
Related Work
Power Estimation Framework
Optimization I – Power-Aware Bus Arbitration
Optimization II – Memory Page Remapping
Summary
6
Power Modeling
Dynamic power estimationInstruction level model: [Tiwari94], JouleTrack[Sinha01]Function level model: [Qu00]Architecture model: Cai-Lim Model, TEMPEST[CaiLim99],
Wattch[Brooks00], Simplepower[Ye00]Static power estimation
Butts-Sohi model [Butts00]Previous memory system power estimation
Activity model: CACTI[Wilton96]Trace driven model: Dinero IV[Elder98]
7
Power Equation
leakage
leakagedesignDD
dynamic
DDIN kVfACVP 2
A
f
C
DDV
Ndesignk leakageI
Activity Factor
Total Capacitance
Voltage
Frequency
Transistor Number
Technology factor
8
Common Power Optimization Techniques
Gating (turn off unused components)Clock gatingVoltage gating: Cache decay [Hu01]
Scaling: (scale operating point of an component)Voltage scaling: Drowsy cache [Flautner02]Frequency scaling: [Pering98]Resource scaling: DRAM power mode [Delaluz01]
Banking: (break single component into smaller sub-units)Vertical sub-banking: Filter cache[Kin97]Horizontal sub-banking: Scratchpad [Kandemir01]
Clustering: (partition components into clusters)Switching reduction: (redesigning with lower activity)
Bus encoding: Permutation Code [Mehta96], redundant code[Stan95, Benini98], WZE[Musoll97]
9
Power Aware Figure of Merit
Delay, DPerformance, MIPS
Power, PBattery life (mobile), packaging (high performance)
Obvious choice for power performance tradeoff, PDJoules/instruction, inversely MIPS/WEnergy figureMobile / low power applications
Energy Delay PD2
MIPS2/W [Gonzalez96]Energy Delay Square PD3
MIPS3/WVoltage and frequency independent
More generically, MIPSm/W
10
Most of optimization schemes sacrifice performance for lower power consumption, except switching reduction.
All of optimization schemes generate higher power efficiency.All of optimization schemes increase hardware complexity.
Power Optimization Effect on Power Figure
11
Outline
Research Motivation and Introduction
Related
Power Estimation Framework
Optimization I – Power-Aware Bus Arbitration
Optimization II – Memory Page Remapping
Summary
12
External Bus
External Bus ComponentsTypically is off-chip busIncludes: Control Bus, Address Bus, Data Bus
External Bus Power ConsumptionDynamic power factors: activity, capacitance, frequency, voltageLeakage power: supply voltage, threshold voltage, CMOS
technologyDifferent from internal memory bus power:
Longer physical distance, higher bus capacitance, lower speedCross line interference, Higher leakage currentDifferent communication protocols (memory/peripheral dependent)Multiplexed row/column address bus, narrower data bus
13
Embedded SOC System Architecture
Media Processor
Core
Data Cache Instruction Cache
System DMA Controller
Memory DMA 0PPI
DMASPORT
DMAMemory DMA 1
NTSC/PALEncoder
StreamingInterface
S-Video/CVBS NIC
Ex
tern
al
Bu
s I
nte
rfa
ce
Un
it (
EB
IU)
SDRAM
FLASHMemory
AsynchronousDevices
Internal Bus
Ext
ern
al B
us
PowerModeling
Area
14
ADSP-BF533 EZ-Kit Lite Board
FLASHMemory
SPORTData I/O
Video Codec/ADV Converter
BF533Blackfin
Processor
SDRAMMemory
Video In & OutAudio In Audio Out
Audio Codec/AD Converter
15
External Bus Power Estimator
Previous ApproachesUsed Hamming distance [Benini98]Control signal was not consideredShared row and column address busMemory state transitions were not considered
In Our EstimatorIntegrate memory control signal power into the modelConsider the case where row and column address are sharedMemory state transitions and stalls also cost powerConsider page miss penalty and traffic reverse penalty
P(bus) = P(page miss) + P(bus turnaround) + P(control signal) + P(address generation)+ P(data transmission) + P(leakage)
16
Two External Bus SDRAM Timing Models
Bank 0 Request
Bank 1 Request
P A R R R R
P A R R
N N N N
N N N N
RPt CAStRCDt
System Clock Cycles (SCLK)
Bank 0 Request
Bank 1 Request
P A R R R R
P A R R
NN
(a) SDRAM Access in Sequential Command Mode
(b) SDRAM Access in Pipelined Command Mode
P - PRECHARGE A - ACTIVATE N - NOP R - READ
R R
17
Bus Power Simulation Framework
Program Target BinaryCompiler
Instruction LevelSimulator
Memory PowerModel
External Bus Power Estimator
Memory TechnologyTiming Model
Memory HierarchyModel
Memory TraceGenerator
Bus Power
Developed software modules
18
Multimedia Benchmark Configurations
Name Description I-Cache Size
D-Cache Size
MPEG2-ENC MPEG-2 Video encoder with 720x480 4:2:0 inputframes.
16k 16k
MPEG2-DEC MPEG-2 Video decoder of 720x480 sequence with4:2:2 CCIR frame output.
16k 16k
H264-ENC H.264/MPEG-4 Part 10 (AVC) digital video encoder for achieving very high data compression.
16k 16k
H264-DEC H.264/MPEG-4 Part 10 (AVC) video decompression algorithm.
16k 16k
JPEG-ENC JPEG image encoder for 512x512 image. 8k 8k
JPEG-DEC JPEG image decoder for 512x512 image. 8k 8k
PGP-ENC Pretty Good Privacy encryption and digital signatureof text message.
8k 4k
PGP-DEC Pretty Good Privacy decryption of encrypted message.
8k 4k
G721-ENC G.721 Voice Encoder of 16bit input audio samples. 4k 2k
G721-DEC G.721 Voice Decoder of encoded bits. 4k 2k
19
Outline
Research Motivation and Introduction
Related Work
Power Estimation Framework
Optimization I – Power-Aware Bus Arbitration
Optimization II – Memory Page Remapping
Summary
20
Optimization I – Bus Arbitration
Multiple bus access masters in an SOC systemProcessor coresData/Instruction cachesDMAASIC modules
Multimedia applicationsHigh bus bandwidth throughputLarge memory footprint
Efficient arbitration algorithm can:Increase power awarenessIncrease bus throughputReduce bus power
21
Bus Arbitration Target Region
Media Processor
Core
Data Cache Instruction Cache
System DMA Controller
Memory DMA 0PPI
DMASPORT
DMAMemory DMA 1
NTSC/PALEncoder
StreamingInterface
S-Video/CVBS NIC
EB
IU w
ith
Arb
itra
tio
n E
nab
led
SDRAM
FLASHMemory
AsynchronousDevices
Internal Bus
Ext
ern
al B
us
22
Bus Arbitration Schemes
EBIU with arbitration enabledHandle core-to-memory and core-to-peripheral communicationResolve bus access contentionSchedule bus access requests
Traditional AlgorithmsFirst Come First Serve (FCFS)Fixed Priority
Power Aware Algorithms(Categorized by power metric / cost function)Minimum Power (P1D0) or (1, 0)Minimum Delay (P0D1) or (0, 1)Minimum Power Delay Product (P1D1) or (1, 1)Minimum Power Delay Square Product (P1D2) or (1, 2)More generically (PnDm) or (n, m)
23
Bus Arbitration Schemes (Continued)
Power Aware ArbitrationFrom the current pending requests in the waiting queue, find a
permutation of the external bus requests to achieve the minimum total power and/or performance cost.
Reducible to minimum Hamiltonian path problem in a graph G(V,E).Vertex = Request R(t,s,b,l)
t – request arrival times – starting addressb – block size l – read / write
Edge = Transition of Request i and j. i,j - Request i and jedge weight w(i, j) is cost of transition
24
Minimum Hamiltonian Path Problem
R0
R3
R2
R1
w(0,3)w
(0,2
)
w(0
,1)
w(1,2)w(2,1)
w(2,3)
w(3,2)
w(1,3)
w(3,1)
R0 – Last Request on the Bus. Must be the starting point of a path.R1, R2, R3 – Requests in the queue
w(i,j) = P(i,j)nD(i,j)m
P(i,j) – Power of Rj after Ri D(i,j) – Delay of Rj after Ri
Hamiltonian Path: R0->R3->R1->R2
Minimum Path weight = w(0,3)+w(3,1)+w(1,2)
NP-Complete Problem
25
Greedy Solution
R0
R3
R2
R1
w(0,3)w
(0,2
)
w(0
,1)
w(1,2)w(2,1)
w(2,3)
w(3,2)
w(1,3)
w(3,1)
Greedy Algorithm (local min)
Only the next requestin the path is needed
min{w(0,j) | w(i,j) is the edge weight of graph G(V,E)}
In each iteration of arbitration:
1. A new graph G(V,E) need to be constructed.2. A greedy solution request is arbitrated to use the bus.
26
Experimental Setup
Utilized embedded power modeling framework Implemented eleven different arbitration schemes inside EBIU
FCFS, FixedPriority.minimum power (P1D0) or (1,0), minimum delay (P0D1) or (0, 1), and
(1,1), (1,2), (2,1), (1,3), (3, 1), (3, 2), (2, 3)10 multimedia application benchmarks are ported to Blackfin
architecture and simulated, including MPEG-2, H.264, JPEG, PGP and G.721.
27
Power Improvement
Power-aware arbitration schemes have lower power consumptions than Fixed Priority and FCFS.
Difference across different power-aware arbitration strategies is small. Parallel Command model has 6-7% saving than Sequential Command model
for MPEG2 ENC & DEC. The results are consistent to all other benchmarks.
MPEG2 Encoder External Bus Power
0.0
10.0
20.0
30.0
40.0
50.0
60.0
FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)
Arbitration Algorithm
Av
erag
e B
us P
ower
(mW
)
Sequential Command
Pipelined Command
MPEG2 Decoder External Bus Power
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)
Arbitration Algorithm
Av
erag
e B
us P
ower
(mW
)
Sequential Command
Pipelined Command
28
Speed Improvement
Power-aware schemes have smaller bus delay than traditional Fixed Priority and FCFS.
Difference across different power-aware arbitration strategies is small. Parallel Command model has 3-9% speedup than Sequential Command
model for MPEG2 ENC & DEC. The results are consistent to all other benchmarks.
MPEG2 Encoder External Bus Delay
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
160.0
FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)
Arbitration Algorithm
Ave
rage
Del
ay (S
CLK
)
Sequential Command
Pipelined Command
MPEG2 Decoder External Bus Delay
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
160.0
180.0
200.0
FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)
Arbitration Algorithm
Ave
rage
Del
ay (S
CLK
)
Sequential Command
Pipelined Command
29
Comparison with Exhaustive Algorithm
Greedy Algorithm can fail in certain case.
Complexity of O(n) vs O(n!).Performance difference is
negligible:
R0
R3
R2
R1
20
20
18
17
15
7
5
18
17
ExhaustiveSearch
GreedySearch
new
30
Comments on Experimental Results
Power aware arbitrators significantly reduce the external bus power for all 8 benchmarks. In average, there are 14% power saving.
Power aware arbitrators reduce the bus access delay. The delay are reduced by 21% in average among 8 benchmarks.
Pipelined SDRAM model has big performance advantage over sequential SDRAM model. It achieve 6% power saving and 12% speedup.
Power and delay in external bus are highly correlated. Minimum power also achieves minimum delay.
Minimum power schemes will lead to simpler design options. Scheme (1,0) is preferred due to its simplicity.
31
Design of A Power Estimation Unit (PEU)
Last Bank Address
Bank(0) Open Row Addr
Bank(1) Open Row Addr
Bank(2) Open Row Addr
Bank(3) Open Row Addr
Last Column Address
Updated Column AddrRow
Addr
Colum
n Addr
Bank A
ddr
Next RequestAddress
If not equal, output bank miss power
If not equal, output page miss penalty power,
update last column address register
Use hamming distanceto calculate columnaddress data power
Power Estimation Unit (PEU)
EstimatedPower
32
Two Arbitrator Implementation Structures
Request Queue Buffer
Power EstimatorUnit (PEU)
Power EstimatorUnit (PEU)
Memory/BusStates Info
Com
para
tor
MinimumPower
Request
State Update
t s b l
t s b l
t s b l
t s b l
AccessCommandGenerator
External Bus
Power EstimatorUnit
Power EstimatorUnit
Request Queue Buffer
Memory/BusStates Info
Com
para
tor
MinimumPower
Request
State Update
t s b l
t s b l
t s b l
t s b l
AccessCommandGenerator
External BusPower Estimator
UnitPower EstimatorUnitPower Estimator
Unit (PEU)
Shared PEUStructure
Dedicated PEUStructure
33
Performance of two structures
Higher PEU delay will lower the external bus performance for both MPEG-2 encoder and decoder.
When PEU delay is 5 or higher, dedicated structure is preferred than shared structure. Otherwise, shared structure is enough.
MPEG-2 Encoder (1,0) Arbitrator Estimator Unit Implemation Performance Comparison
120.0125.0
130.0135.0
140.0145.0
150.0155.0
160.0165.0
0 2 4 6 8 10
Estimator Logic Delay (Cycles)
Ave
rag
e D
elay
(C
ycle
s) Estimator Unit Shared
Estimator Unit Dedicated
MPEG-2 Decoder (1,0) Arbitrator Estimator Unit Implemation Performance Comparison
100.0
105.0
110.0
115.0
120.0
125.0
130.0
135.0
0 2 4 6 8 10
Estimator Logic Delay (Cycles)
Ave
rag
e D
elay
(C
ycle
s)
Estimator Unit Shared
Estimator Unit Dedicated
34
Summary of Bus Arbitration Schemes
Efficient bus arbitrations can provide benefits to both power and performance over traditional arbitration schemes.
Minimum power and minimum delay are highly correlated on external bus performance.
Pipelined SDRAM model has significant advantage over sequential SDRAM model.
Arbitration scheme (1, 0) is recommended. Minimum power approach provides more design options and leads
to simpler design implementations. The trade-off between design complexity and performance was presented.
35
Outline
Research Motivation and Introduction
Related Work
Power Estimation Framework
Optimization I – Power-Aware Bus Arbitration
Optimization II – Memory Page Remapping
Summary
36
address
time
Data Access Pattern in Multimedia Apps
time
addresstime
address
3 common data access patterns in multimedia applications
Majority of cycles in loop bodies and array accesses
High data access bandwidth Poor locality, cross page
references
Fix Stride
2-Way Stream
2-D Stride
37
Previous work on Access Pattern
Previous work was performance driven and OS/compiler related approachData Pre-fetching [Chen94] [Zhang00]Memory Customization [Adve00] [Grun01] Data Layout Optimization [Catthoor98] [DeLaLuz04]
Shortcoming of OS/compiler-based strategies:Multimedia benchmark’s dominant activities are within large
monolithic data buffers.Buffers generally contain many memory pages and can not be
further optimized.Constraint by the OS and compiler capability. Poor flexibility.
38
Optimization II - Page Remapping
Technique currently used in large memory space peripheral memory access.
External memories in embedded multimedia systems High bus access overheadPage miss penalty
Efficient page remapping canReduce page missesImprove external bus throughputReduce power / energy consumption.
39
Page Remapping Target Region
Media Processor
Core
Data Cache Instruction Cache
System DMA Controller
Memory DMA 0PPI
DMASPORTDMA
Memory DMA 1
NTSC/PALEncoder
StreamingInterface
S-Video/CVBS NIC
Exte
rnal
Bu
s In
terf
ace U
nit
(E
BIU
)
Internal Bus
Ext
ern
al B
us
SDRAM
FLASHMemory
AsynchronousDevices
40
SDRAM Memory Pages
X
X*
X
Bank 0Page 0Page 1
Page 2Page 3
Page 4
Page N-1
X
X*
X
Bank 1
X
X*
X
Bank 2X
X
X
X*
Bank M-1
High memory access latency. Minimum latency of an sclk cycle Page miss penalty Additional latency due to refresh cycle No guaranteed access due to arbitration logic Non-sequential read/write would suffer
41
COMMAND P A R R R R P A R R
RPtRCDt CASt
System Clock Cycles (SCLK)
P A R R R R R R
P - PRECHARGE A - ACTIVATE R - READ
R R
R R
RCDt CASt
D D D DDATA D D D D
D D D D D D D D
COMMAND
DATA
N - NOP
RPt
D - DATA
SDRAM Page Miss Penalty
42
Access type
Number of cycles
Read cycle trp +n*(tcas)
Write cycle twp
Page miss trp + trcd
Refresh cycle
2*(trcd) * nrows
SDRAM parameter
Sclk cycles
trcd1-15
trp1-7
trcd = tras + trp 1-15
tcas2-3
twp = write to prechargetrp = read to prechargetras = activate to prechargetcas = read latency
~8-10 sclk penalty associated with a page miss
SDRAM Timing Parameters
43
P – PrechargeA – ActivationR - Read
Bank 0Page 0Page 1Page 2Page 3
Bank 1 Bank 2 Bank 3
P A R
RR
P A R
R
P A R
R
P A R
RRRR
P A R P A R P A R P A R
System Clock
SDRAM Page Access Sequence (I)
Typically access pattern of 2-D stride / 2-way stream. Poor data layout causes significant access overhead.
P A R P A R P A R P A R
RRRR
12 Reads across 4 banks
44
P – PrechargeA – ActivationR - Read
Bank 0Page 0Page 1Page 2Page 3
Bank 1 Bank 2 Bank 3
P A R
R R
P A R
R
P A R
R
P A R
R R R R
R R R R
System Clock
SDRAM Page Access Sequence (II)
R R R R
R R R R
Less access overhead with distributed data layout.
12 Reads across 4 banks
45
Why we use Page Remapping
X
Bank 0
Page 2 X
Bank 1
X
Bank 2
X
Bank 3
XPage 2 X X X
Page Remapping Entryof Page 2:{2,0,1,3}
46
Module in an SOC System
Address translation unit, only translates bank address
Non-MMU system inserts a page remapping module before EBIU
MMU system can take advantage the existing address translation unit. No extra hardware needed
Ext
ern
al B
us
Inte
rfac
e U
nit
(E
BIU
)
Ext
ern
al B
us
SDRAM
FLASHMemory
AsynchronousDevices
Pag
e R
em
app
ing
InternalBus
47
P – PrechargeA – ActivationR - Read
Bank 0Page 0Page 1Page 2Page 3
Bank 1 Bank 2 Bank 3
P A R
RR
P A R
R
P A R
R
P A R
RR
RR
R R R R
System Clock
Sequence (I) after Remapping
Same performance as sequence II.Applicable for monolithic data buffers (eg. frame buffers).
R R R R
RR
RR
12 Reads across 4 banks
48
Page Remapping Algorithm
NP complete problem. Reducible to graph coloring problem in a page transition
graph G(V,E).Vertex = Page Im,n
m – page bank numbern – page row number
Edge = Transition of Page Im,n to Ip,q. weighted edges captures page traversal during the program executionedge weight is number of transition from Page Im,n to Page Ip,q
Color = BankEach bank have one distinct color.Every page will be assigned one color.
49
Page Remapping Algorithm (continued)
Page Remapping AlgorithmFrom the page transition graph, find the color (bank) assignment for each
page, such that the transition cost between same color pages is minimized.
Algorithm Steps:Sort the edges based on their transition weightEdges are process in a decreasing weight orderColor the pages associated with each edgeWeight parameter array for each page represents the cost of mapping that
page into each bankeg: {500, 200, 0, 0}
5 different situations of processing each edgePage remapping table (PMT) is generated as a result of
mapping.
50
I0,0
I0,1 I1,1
I1,2
I1,3
I2,1 I3,1
I0,0
I0,1
I1,1
I1,2
I1,3
I2,1
I3,1
500
200
100
80
60
5030
40
Bank 0
Page 0Page 1Page 2Page 3
Bank 1 Bank 2 Bank 3
Example Case
Original page allocation
Page transition graph
51
Bank 0
Page 0Page 1Page 2Page 3
Bank 1 Bank 2 Bank 3
Initial Step
No page is mapped. All slots are available.
52
I0,0
Bank 0
Page 0Page 1Page 2Page 3
Bank 1 Bank 2 Bank 3
I0,0 I0,1
500
I0,1
Selected Edge:
Weight Parameters Updates:
I0,0[0]: { 0, 500, 0, 0}
I0,1[1]: { 500, 0, 0, 0}
Actions: Allocate unmapped pages I0,0 and I0,1
Step (1) – two unmapped pages
53
I0,0
I1,1 I1,2
200
I0,1
Selected Edge:
Weight Parameters Updates:
I1,1[0]: { 0, 200, 0, 0}
I1,2[1]: { 200, 0, 0, 0}
Actions: Allocate unmapped pages I1,1 and I1,2
I1,1
I1,2
Bank 0
Page 0Page 1Page 2Page 3
Bank 1 Bank 2 Bank 3
Step (2) – two unmapped pages
54
I0,0
I0,0 I3,1
100
I0,1
Selected Edge:
Weight Parameters Updates:
I3,1[2]: { 100, 0, 0, 0}
I0,0[0]: { 0, 500, 100, 0}
Actions: Map pages I3,1 and no change for I0,0
I1,1
I1,2
I3,1
Bank 0
Page 0Page 1Page 2Page 3
Bank 1 Bank 2 Bank 3
Step (3) – one unmapped page
55
I0,0
I1,2 I2,1
80
I0,1
Selected Edge:
Weight Parameters Updates:
I2,1[3]: { 0, 80, 0, 0}
I1,2[1]: { 200, 0, 0, 80}
Actions: Map pages I2,1 and no change for I1,2
I1,1
I1,2
I3,1 I2,1
Bank 0
Page 0Page 1Page 2Page 3
Bank 1 Bank 2 Bank 3
Step (4) – one unmapped page
56
I0,0
I3,1 I1,3
60
I0,1
Selected Edge:
Weight Parameters Updates:
I1,3[0]: { 0, 0, 60, 0}
I3,1[2]: { 160, 0, 0, 0}
Actions: Map pages I1,3 and no change for I3,1
I1,1
I1,2
I3,1 I2,1
I1,3
Bank 0
Page 0Page 1Page 2Page 3
Bank 1 Bank 2 Bank 3
Step (5) – one unmapped page
57
I0,0
I1,1 I3,1
50
I0,1
Selected Edge:
Actions: Both I1,1 and I3,1 are on the same row, no actions.
I1,1
I1,2
I3,1 I2,1
I1,3
Bank 0
Page 0Page 1Page 2Page 3
Bank 1 Bank 2 Bank 3
Step (6) – same row pages
58
I0,0
I2,1 I1,3
40
I0,1
Selected Edge:
Weight Parameters Updates:
I1,3[0]: { 0, 0, 60, 40}
I2,1[3]: { 40, 80, 0, 0}
Actions: Both I2,1 and I1,3 are mapped, no conflicts.
I1,1
I1,2
I3,1 I2,1
I1,3
Bank 0
Page 0Page 1Page 2Page 3
Bank 1 Bank 2 Bank 3
Step (7) – two mapped pages
59
I0,0
I0,0 I1,1
30
I0,1
Selected Edge:
Actions: Both I0,0 and I1,1 are mapped and in same bank.
I1,1
I1,2
I3,1 I2,1
I1,3
Current Weight Parameters:
I2,1[3]: { 40, 80, 0, 0}
I3,1[2]: { 160, 0, 0, 0}
I1,1[0]: { 30, 200, 0, 0}
I0,1[1]: { 500, 0, 0, 0}
I0,0
I3,1I2,1
I1,2
I0,1 I1,1
I1,3
Updated Weight Parameters:
I0,0[0]: {0, 500, 100, 30}
No Conflict
Bank 0
Page 0Page 1Page 2Page 3
Bank 1 Bank 2 Bank 3
Bank 0
Page 0Page 1Page 2Page 3
Bank 1 Bank 2 Bank 3
Step (8) – conflict resolving
60
I-Cache D-Cache
External Memory Address
Pa
ge
Re
ma
pp
ing
Ta
ble
(4
kB
)
EBIU
Row/Column Address (22bits)
Bank Address (2bits)
16MB External SDRAM
Memory Page Address (14bits)
I0,0
I3,1
I2,1
I1,2
I0,1
I1,1
I1,3
00
01
00
01
10
11
00
xx
xx
xx
xx
xx
xx
xx
xx
xx
Generated PMT table
61
Experimental Setup
Utilized embedded power modeling frameworkExtended address translation unit for page remappingPage coloring program to generate PMTSame 10 Multimedia application benchmarks
MPEG-2 encoder and decoderH.264 encoder and decoderJPEG encoder and decoderPGP encoder and decoderG.721 encoder and decoder
62
Page Miss Reduction
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
MPEG2-ENC
MPEG2-DEC
H264-ENC
H264-DEC
JPEG-ENC
JPEG-DEC
PGP-ENC PGP-DEC G721-ENC
G721-DEC
Pa
ge
Mis
s p
er
10
0 R
eq
ue
sts
2 Bank Original
4 Bank Original
8 Bank Original
2 Bank Remapped
4 Bank Remapped
8 Bank Remapped
63
External Bus Power
0
5
10
15
20
25
30
35
40
MPEG2-ENC
MPEG2-DEC
H264-ENC
H264-DEC
JPEG-ENC
JPEG-DEC
PGP-ENC
PGP-DEC
G721-ENC
G721-DEC
Ext
ern
al P
ow
er (
mW
)
2 Bank Original
4 Bank Original
8 Bank Original
2 Bank Remapped
4 Bank Remapped
8 Bank Remapped
64
Average Access Delay
0
20
40
60
80
100
120
MPEG2-ENC
MPEG2-DEC
H264-ENC
H264-DEC
JPEG-ENC
JPEG-DEC
PGP-ENC
PGP-DEC
G721-ENC
G721-DEC
Ave
rag
e R
equ
est
Del
ay (
cycl
e)
2 Bank Original
4 Bank Original
8 Bank Original
2 Bank Remapped
4 Bank Remapped
8 Bank Remapped
65
Comments of Page Remapping
Page remapping algorithm is presented by example.Our algorithm can significantly reduce the memory page miss
rate by 70-80% on average.For a 4-bank SDRAM memory system, we reduced externalmemory access time by 12.6%.The proposed algorithm can reduce power consumption in
majority of the benchmarks, averaged by 13.2% of power reduction.
Combining the effects of both power and delay, our algorithm can benefit significantly to the total energy cost.
Stability study was done in dissertation. PMT table generated from one test vector input perform well on different inputs.
66
Outline
Research Motivation and Introduction
Related Work
Power Estimation Framework
Optimization I – Power-Aware Bus Arbitration
Optimization II – Memory Page Remapping
Summary
67
Summary
Reviewed the issues of external bus power in a system-on-a-chip (SOC) embedded system.
Built external bus power estimation framework and experimental methodology.PACS’04
Proposed a series of power aware bus arbitration schemes and their performance improvement over traditional schemes.HiPEAC’05 also appeared in LNCSTransaction of High performance of Embedded Architectures and
CompilersProposed page remapping algorithm to reduce page misses
and its power and delay improvements.LCTES’07
68
Future Work
Integration of power estimation framework in complete tool chain
Extend arbitration schemes to multiple memory interfaces and other peripheral interfaces.
Compare performance of page remapping with corresponding OS/Compiler schemes
69
Thank You !