Profiles in Power: Optimizing Real-Time Systems for Power As well as Speed (IPS), Response Latency...
-
Upload
tyler-payne -
Category
Documents
-
view
215 -
download
0
Transcript of Profiles in Power: Optimizing Real-Time Systems for Power As well as Speed (IPS), Response Latency...
Profiles in Power: Optimizing Real-Time Systems for
PowerAs well as Speed (IPS), Response Latency and Cost
Graham Hellestrand
Mahdi Seddighnazhad
James Brogan
VaST Systems Technology Corporation
CONFIDENTIAL2
Key Focus: Low Cost, Power Reduction and Increased Features Competitive positions
must be maintained Product complexity is
increasing• Hardware growth• Software growth
Critical Program Schedules Market windows
must be hit Revenue opportunities
must be captured
Burden has moved to designand development
Wireless Trends
CONFIDENTIAL4
The Metric Power
Reducing in power regardless of the effect on other optimization
factors is of limited value.
Example: Saving 50% power
While Yielding:• 50% speed hit and/or• Failure to meet response latency specifications
Is likely to be a unacceptable in the marketplace
CONFIDENTIAL5
Implications
Real-time software architecture and development needs to be subject to a rigorous optimization of an appropriate objective function, based on: Power Speed Event response latencies
• Examples: interrupts, exceptions Cost – approximated by:
• Cache sizes• Memory sizes and hierarchies
System Architecture & Optimization
Software ArchitecturePlatform Architecture
Real-world interaction architectureProcessor µ-architecture
+Empirical experimentation
CONFIDENTIAL7
Architecture Addresses the Whole System
Software
DeviceDrivers
Operating
Systems
Middleware,
Comms
Appli-
cations
PhysicalRTL
B
ehav. P
latform
Hardware
RF, Mechanical, Physical
Devices Structures Sub- systems
Systems
Architecture Buses
& Bridges
VPMs &
Peripheral
Devices
Virtual
Prototype
Evaluation,
Exploration
Optimization effect:Software Architecture &
Design
1st Order Effect on system performance
CONFIDENTIAL9
Software Architecture & Design
Create
Compile
Assemble
Link
Load
Debug +Monitor
SW IDE
VaST VSP
HW
Hardware Software
Architecture
VSP
Monitor prototype internals Cache hits/misses Bus transactions Processor performance Memory usage Interrupt latency
Trigger hardware and software debuggers
Example usage: analyze processor and platform power
Make intelligent tradeoffs between power, performance and cost
UML,Simulink,
C, C++, …
Optimization effect:Platform Architecture & Design
1st Order Effect on system
CONFIDENTIAL11
Typical 3G Cell Phone Controller
3 processors, 12 buses, 10 bus bridges, 70 peripherals
StdBusBridgeStdBusBridge
MemoryBlock
MemoryBlockP1 Memory
MemoryBlock
MemoryBlock
StdBusBridgeStdBusBridge
Arb. Ctrl DRAM
Arb. Ctrl DRAM
MemoryBlock
MemoryBlock
MemoryBlock
MemoryBlock Shared Memory
ARM1176 P1Virtual Processor Model
ARM1176 P1Virtual Processor Model
I CacheI Cache D CacheD CacheStdBus I/FStdBus I/F StdBus I/FStdBus I/F
ARM1156 P2Virtual Processor Model
ARM1156 P2Virtual Processor Model
I CacheI Cache D CacheD CacheStdBus I/FStdBus I/F StdBus I/FStdBus I/F
StarCore SC1400Virtual Processor Model
StarCore SC1400Virtual Processor Model
I CacheI Cache D CacheD CacheStdBus I/FStdBus I/F StdBus I/FStdBus I/F
D ROM P ROM
StdBusBridgeStdBusBridge
UARTUART
TIMERTIMER
INTCINTC
P1 Devices
Console 1Console 1
AHB
Buses
StdBusBridgeStdBusBridge
UARTUART
TIMERTIMER
INTCINTC
P2 Devices
Console 2Console 2
P2 Memory
VaST Virtual System Prototype(model)
Optimization effect:Real-world Interaction
Architecture
1st Order Effect on system
CONFIDENTIAL13
AutomotivePower-train Control
Igniting fuel under pressure at the wrong part of the cylinder strokeResults in spectacular destruction of the engine (and maybe the experimenter)
Real-time Engine Monitoring Engine control unit
Optimization of:Processor µ-architecture
2nd / 3rd Order Effect(apart from caches & buffering)
CONFIDENTIAL16
Generic Single Pipeline Operation
ADD
Exec Memory Write R3
ADD: Read R3 R2
ADD: Read R2, (R2 LSL R3 )
Time (ticks)
SUB SUB: Cannot read R0 - stall
ADD: Write R0
R0
ADD: R1 + (R2 LSL R3 )
Pipeline Stage
SUB: R5 + (R6 LSL R0 )
SUB: Write R4
R4
SUB: Read R0
R0 bypass
SUB: Read R6, (R6 LSL R0 )
R6
R1
Pre-Silicon System Design
Process
CONFIDENTIAL18
System Development Process
CoMET System Level Design Tool
Executable System
Specification
Executable System
Architecture (VSP)
Business Requirements
Functional Requirements
Architecture +Concurrent, Iterative
S/W – H/W Development+
Integrated & Optimized
Final Product
Software
Hardware
Translate Architectand Test
Designand Test
Developand Test
Translate Architectand Test
Designand Test
Developand Test
+
+
METeor
Virtual System Platform
CoMET
Inte
gra
te &
Co
Ve
rify
VS
P
Inte
gra
te &
Co
Ve
rify
Sil
ico
n H
ard
wa
re P
latf
orm
+E
mb
ed
de
d S
ys
tem
So
ftw
are
CONFIDENTIAL19
Electronic System Design Process
System architecture Virtual Prototype (timing accurate)+
Software || Hardware design Virtual System Prototypes (high speed)
Develop behavioral-level executable specification and
verify RTL
Design, develop and debug software before silicon or hardware
prototypes are available
Hardware development Software development
Evaluate architectures of candidate designs using real software
applications
VirtualPrototype
Architecture
So What Performance can we get
from a Timing Accurate VSPon a Single Processor Host?
That is how useful are these things?
CONFIDENTIAL21
VSP Computation PerformanceMultiple Independent Platforms
GPMEM
GPMEM
ARM926E VPM 1
INST DATA
CONFIG & CONTROL
GPINTCARM
GPTIMER
GP UART
GP CONSOLE
Bridge
Bridge
Bridge
GPMEM
GPMEM
ARM926E VPM 1
INST DATA
CONFIG & CONTROL
GPINTCARM
GPTIMER
GP UART
GP CONSOLE
Bridge
Bridge
Bridge
GPMEM
GPMEM
ARM926E VPM 1
INST DATA
CONFIG & CONTROL
GPINTCARM
GPTIMER
GP UART
GP CONSOLE
Bridge
Bridge
Bridge
CONFIDENTIAL22
Results - Computational Performance Study
6x Single Processor, Virtual System Prototypes - Cached
0
10
20
30
40
50
60
1 2 3 4 5 6
Number of Processors
Eff
ec
tiv
e M
IPS
No. ofProcessors
VPM MIPSPerformance
Simulation Overhead
HardwareSimulation
Platform dominated study: As Virtual System Prototypes (VSPs), with the processors having software and data resident in cache, are switched into the simulation (Pink line), the sharing of host cycles between the processor and the hardware (purple line) of each VSP stays in proportion for each additional VSP activated. The frequent switching between VSPs, each having a processor and hardware that also share the host cycles, also increases the Simulation overhead (blue line).
CONFIDENTIAL23
VSP with TLM Bus Matrix
SC1200
DMA Master
Core Master
Slave
Mem Bank 0
(512KB)
SC1200
DMA Master
Core Master
Slave
Bridges Bridges Bridges Bridges Bridges
32323232 32 32
32
32
32
AHB
AHB
Application software (Viterbi), on INT will shuffle
data from DRAM to MemBanks
Application software
(Vocoder), on INT will shuffle data from DRAM to MemBanks
approx. 60% utilization
DRAM
(2MB)
Mem Bank 1
(512KB)
Mem Bank 2
(512KB)
Mem Bank 3
(512KB)
Mem Bank 4
(512KB)
Mem Bank 5
(512KB)
DMA Traffic Generator
DMA Traffic Generator
DMA Traffic Generator
DMA Traffic Generator
Bridges
every 300-500 cycles AHB like
transactions
OCP ChannelWrapper
CONFIDENTIAL24
Results – Bus Matrix Performance
0.00
50.00
100.00
150.00
Count (Trans/DSec |
MIPS)
1024 64 4
Trans. (0 VPM)
Headroom MIPS
Trans. (1 VPM)
Headroom MIPS
Trans. (2 VPM)
Headroom MIPS
Transaction Size (bytes)
Number of Processors
Communications vs Computation Loading (Double Core)
Trans. (0 VPM)
Headroom MIPS
Trans. (1 VPM)
Headroom MIPS
Trans. (2 VPM)
Headroom MIPS
Communications and computation sharing study: This is a multi-variable study measuring simulation performance of a system having transactions of various sizes (1024, 64 and 4 bytes) being transmitted at a high rate over a complex switch to which are attached two SC1200 processors. Initially no processors are activated and each is then successively activated. The bar chart is best read as a sequence of 3 pairs (Transaction / Headroom (MIPS) – into the slide. As transactions become progressively smaller, there is relatively more work to be performed by the model to transmit and receive them. The Headroom measure is the amount of available host cycles for further simulation. As more processor are activated and the transaction size is reduced, the available headroom diminishes.
CONFIDENTIAL25
Study 4: VSP Interrupt Handling
Automotive Benchmark, Feb 2004
Capability or a VSP under interrupt loads: This is a relatively simple experiment that shows the performance of a single processor Virtual System Prototype under increasingly stressful rates of processing asynchronous events (interrupts). Even at high interrupt rates (every 3,750 cycles is equivalent to a 12 cylinder engine running at 20,000 RPM and producing an interrupt every 10 degrees of crank-angle) the VPM is capable of simulating high software execution rates (4 MIPS) while handling the interrupts.
0
1000
2000
3000
4000
5000
3750 50000 100000C
ycle
s p
er
Inte
rrupt
VP
M
Perf
orm
ance
(MIP
S/1
0000)
Cycles between Interrupts
Event Count
VPM Peformance under High Interrupt Load
Cycles per Interrupt
VPM Performance(MIPS/10000)
Back to Building Systems
CONFIDENTIAL27
Physical Prototype
Virtual Prototypes
32-bitMPU
RAM
InterruptC
ontrollerROMBus Interface
Flash
DMAInterruptTimer
General I/O
A2D Convert
Clock Gen.
Serial Comms
Virtual bus
It is all about optimization,
stupid!
Asynch-Signal Response Latency
Powe
r Con
sum
ptio
n
Spee
dSoftware
Specifications
Very Smart System Instantiator
PhysicalMechanical, RF, ..
H-typeRespecifier
Typical 2.5G Wireless Systems
built using aVirtual System Prototype
CONFIDENTIAL29
Virtual PrototypingMobile Handset Development
Full System Development
Architecture, Software, Hardware, I/F
I Q Signals
Virtual COMPort
ARMARMDebuggerDebugger
TeakLiteTeakLiteDebuggerDebugger
SG2SG2
CONFIDENTIAL31
Early Design Feedback in Semiconductor Development Process
Enabled 1st Pass Silicon Success Eliminated Costly 2nd Silicon
Provided Complete SoftwareDevelopment Environment 9 Months Prior to Silicon
Resulted in a Better QualityProduct 5 Months EarlierThan Standard DevelopmentProcess
Advanced Debugging Multi-Core debugging
• ARM926 (ADS 1.2) • TeakLite* (DSP group)
Complete system visibility • S-GOLD programmer model
– Bus status & Interrupt behavior– System cycle count, monitors
I/O Test Bench Support Open Model Extension
Wireless VP Benefits
Keypad Test Bench
LCD Display QCIF/CIFCamera Test Bench Win32 Terminal for all Serial IO
Virtual COM Ports
ARM Debugger
Linux OS Execution + MPEG4 EncodingCamera Input
TeakLite Debugger
SGOLD2
Architecture
CONFIDENTIAL32
Concurrent Bus Activity
Optimizing forPower and Performance
Separated Functions
CONFIDENTIAL35
General Form of Multi-Objective Optimization
Equation:
Characterize an objective function in terms of events directly measurable from the VSP
1.. , ,1 ,2 ,( (...)) ( (...), (...) ,...., (...))k k k k kCPU EvType et CPUk EvType CPU CPU CPU CPU etwhere f g f g g g
, , ,
, ,
0.. 1.. s ec ..
0.. 1.. ..
(| ( ( ( )),
( ( (cc cc CEvType cc CEvType CEvCnt
bc bc BEvType bc BEvTyp
VSP CPU cc cn CPU CEvType cet CPU CEvCnt c n tcecn CPU
Bus bc bcn Bus BEvType bet Bus BEvCnt sbecn tbecn Bus
F f f g Event
f f g Event
,
, , ,
,
0.. 1.. ..
0.. 1..
)),
( ( ( )),
( ( (
e BEvCnt
bbc bbc BBEvType bc BBEvType BBEvCnt
mc mc MEvType
BusBridge bbc bbcn BBus BBEvType bbet BBus BBEvCnt sbbecn tbbecn BBus
Mem mc mcn Mem MEvType met Mem MEvCnt sme
f f g Event
f f g
, ,
, , ,
..
0.. 1..det ..
)),
( ( ( )))mc MEvType MEvCnt
dc dc DEvType dc DEvType DEvCnt
cn tmecn Mem
Dev dc cn Dev DEvType Dev DEvCnt sdecn tdecn Dev
Event
f f g Event
Problem: Huge volume of data some of which may be highly correlated with other data – leading to multiple counting and unreliability in composite measures.
CONFIDENTIAL37
A Simple Power Function for a Full Platform
15
Re Re
, , , ,
2 :.
.
.2 2 0 120
Power Pipe Pipe Instr Instr Cache Cache TLB TLB
gAcc gAcc MemAcc MemAcc PeriphAcc PeriphAcc
Instr Instr jmp Instr except Instr ctrl Instr coproc
Equationf W f W f W f W f
W f W f W fwheref f f f f
f
, , ,
,
.
. ( )
Instr LdSt Instr arith Instr other
Instr i i
f fand
f instructions of type in k cycles
CONFIDENTIAL38
Resolving the Weights for the Power Function
Table 2: Power: Function Types, Event & Weighting Functions
Function Types Events Weight Functions
Pipeline ibase 6.0
Instruction Types ijmp 2.0
iexcept 2.0
icoproc 12.0
iarith 1.0
Caches (I&D) Cache_lookup fi-dcache(size, ways)
icache_hit iCache-lookup + ficache(line size, decode)
icache_miss Icache_lookup
dcache_hit Dcache_lookup + fdcache(size, ways, line size,)
dcache_miss Dcache_lookup
TLB tlb_miss 30.0
Register regfile_access 1.0
Memory (incl.bus transactions) membus_transaction 50.0
Periph Device (incl.bus transactions) periphbus_reg_access 50.0
CONFIDENTIAL39
Single Task Working Set vs Cache Size Analysis
Graph 1B: Power Consumption - Viterbi on ARM926E Subsystem of VSP in Figure 1
5.00
7.00
9.00
11.00
13.00
15.00
17.00
0 10,000 20,000 30,000 40,000
Cache Size (Bytes)
Ave.
Po
wer
* 10^
7 /
# I
nstr
ucti
on
Cache Line =16 bytes
Cache Line =32 bytes
Graph 1A: VPM Speed - Viterbi on ARM926E Subsystem of VSP in Figure 1
0.001.002.003.004.005.006.007.008.00
0 10,000 20,000 30,000 40,000
Cache Size (Bytes)
Instr
uct
ion
s /
10-
Cyc
les
Cache Line =16 bytes
Cache Line =32 bytes
CONFIDENTIAL40
Linux Boot - Memory Hierarchy Analysis (I&D cache + bus + bus bridge + Mem (DDR | SDR)
Analysis Graph 2A: VPM Speed - Linux Boot on ARM926E
Subsystem of Fig.1 VSP
1.001.201.401.601.802.002.202.402.60
0 10,000 20,000 30,000 40,000
Cache Size (Bytes)
Ins
tru
cti
on
s /
10
-Cy
cle
s
CL = 16B,Mem = DDRCL = 32B,Mem = DDRCL = 32B,Mem = SDR
Graph 2B: Power Consumption - Linux Boot on ARM926E Subsystem of Fig. 1 VSP
1.00
1.20
1.40
1.60
1.80
2.00
0 10,000 20,000 30,000 40,000
Cache Size (Bytes)
Ave
. P
ow
er
* 10
^7
/
# In
str
ucti
on
s
CL= 16B, Mem= DDR
CL = 32B,Mem = DDR
CL = 32B,Mem = SDR
CONFIDENTIAL41
Replace Cache with Simple External Buffer
for a Known Task SetSpeed - Sieve of Eratosthenes on ARM926E
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
5.50
6.00
-10 490 990 1490
Cache Size (Bytes)
# In
str
uc
tio
ns
/ 1
0-c
yc
les
CL = 16B, Mem = DDR
CL = 16B, Mem = SDR
CL = 32B, Mem = DDR
CL = 32B, Mem = SDR
Power Consumption - Sieve of Eratosthenes on ARM926E
4.00
5.00
6.00
7.00
8.00
9.00
10.00
-10 490 990 1490
Cache Size (Bytes)
Ave. P
ow
er
* 10
^7 / #
In
str
uc
tio
ns
CL = 16B, Mem = DDR
CL = 16B, Mem = SDR
CL = 32B, Mem = DDR
CL = 32B, Mem = SDR
CONFIDENTIAL42
The Message
System optimization needs a composite, complex optimization function of functions operating on a complete (model of a) system. The constituent functions include:
Power
Speed
Response deadline compliance
Cost ……
A rigorous scientific methodology is required for empirical experimentation