Profiles in Power: Optimizing Real-Time Systems for Power As well as Speed (IPS), Response Latency...

Profiles in Power: Optimizing Real-Time Systems for

PowerAs well as Speed (IPS), Response Latency and Cost

Graham Hellestrand

Mahdi Seddighnazhad

James Brogan

VaST Systems Technology Corporation

CONFIDENTIAL2

Key Focus: Low Cost, Power Reduction and Increased Features Competitive positions

must be maintained Product complexity is

increasing• Hardware growth• Software growth

Critical Program Schedules Market windows

must be hit Revenue opportunities

must be captured

Burden has moved to designand development

Wireless Trends

CONFIDENTIAL4

The Metric Power

Reducing in power regardless of the effect on other optimization

factors is of limited value.

Example: Saving 50% power

While Yielding:• 50% speed hit and/or• Failure to meet response latency specifications

Is likely to be a unacceptable in the marketplace

CONFIDENTIAL5

Implications

Real-time software architecture and development needs to be subject to a rigorous optimization of an appropriate objective function, based on: Power Speed Event response latencies

• Examples: interrupts, exceptions Cost – approximated by:

• Cache sizes• Memory sizes and hierarchies

System Architecture & Optimization

Software ArchitecturePlatform Architecture

Real-world interaction architectureProcessor µ-architecture

+Empirical experimentation

CONFIDENTIAL7

Architecture Addresses the Whole System

Software

DeviceDrivers

Operating

Systems

Middleware,

Comms

Appli-

cations

PhysicalRTL

B

ehav. P

latform

Hardware

RF, Mechanical, Physical

Devices Structures Sub- systems

Systems

Architecture Buses

& Bridges

VPMs &

Peripheral

Devices

Virtual

Prototype

Evaluation,

Exploration

Optimization effect:Software Architecture &

Design

1st Order Effect on system performance

CONFIDENTIAL9

Software Architecture & Design

Create

Compile

Assemble

Link

Load

Debug +Monitor

SW IDE

VaST VSP

HW

Hardware Software

Architecture

VSP

Monitor prototype internals Cache hits/misses Bus transactions Processor performance Memory usage Interrupt latency

Trigger hardware and software debuggers

Example usage: analyze processor and platform power

Make intelligent tradeoffs between power, performance and cost

UML,Simulink,

C, C++, …

Optimization effect:Platform Architecture & Design

1st Order Effect on system

CONFIDENTIAL11

Typical 3G Cell Phone Controller

3 processors, 12 buses, 10 bus bridges, 70 peripherals

StdBusBridgeStdBusBridge

MemoryBlock

MemoryBlockP1 Memory

MemoryBlock

MemoryBlock


Arb. Ctrl DRAM

Arb. Ctrl DRAM

MemoryBlock

MemoryBlock

MemoryBlock

MemoryBlock Shared Memory

ARM1176 P1Virtual Processor Model


I CacheI Cache D CacheD CacheStdBus I/FStdBus I/F StdBus I/FStdBus I/F




StarCore SC1400Virtual Processor Model

StarCore SC1400Virtual Processor Model


D ROM P ROM


UARTUART

TIMERTIMER

INTCINTC

P1 Devices

Console 1Console 1

AHB

Buses


UARTUART

TIMERTIMER

INTCINTC

P2 Devices

Console 2Console 2

P2 Memory

VaST Virtual System Prototype(model)

Optimization effect:Real-world Interaction

Architecture

1st Order Effect on system

CONFIDENTIAL13

AutomotivePower-train Control

Igniting fuel under pressure at the wrong part of the cylinder strokeResults in spectacular destruction of the engine (and maybe the experimenter)

Real-time Engine Monitoring Engine control unit

Optimization of:Processor µ-architecture

2nd / 3rd Order Effect(apart from caches & buffering)

CONFIDENTIAL16

Generic Single Pipeline Operation

ADD

Exec Memory Write R3

ADD: Read R3 R2

ADD: Read R2, (R2 LSL R3 )

Time (ticks)

SUB SUB: Cannot read R0 - stall

ADD: Write R0

R0

ADD: R1 + (R2 LSL R3 )

Pipeline Stage

SUB: R5 + (R6 LSL R0 )

SUB: Write R4

R4

SUB: Read R0

R0 bypass

SUB: Read R6, (R6 LSL R0 )

R6

R1

Pre-Silicon System Design

Process

CONFIDENTIAL18

System Development Process

CoMET System Level Design Tool

Executable System

Specification

Executable System

Architecture (VSP)

Business Requirements

Functional Requirements

Architecture +Concurrent, Iterative

S/W – H/W Development+

Integrated & Optimized

Final Product

Software

Hardware

Translate Architectand Test

Designand Test

Developand Test

Translate Architectand Test

Designand Test

Developand Test

+

+

METeor

Virtual System Platform

CoMET

Inte

gra

te &

Co

Ve

rify

VS

P

Inte

gra

te &

Co

Ve

rify

Sil

ico

n H

ard

wa

re P

latf

orm

+E

mb

ed

de

d S

ys

tem

So

ftw

are

CONFIDENTIAL19

Electronic System Design Process

System architecture Virtual Prototype (timing accurate)+

Software || Hardware design Virtual System Prototypes (high speed)

Develop behavioral-level executable specification and

verify RTL

Design, develop and debug software before silicon or hardware

prototypes are available

Hardware development Software development

Evaluate architectures of candidate designs using real software

applications

VirtualPrototype

Architecture

So What Performance can we get

from a Timing Accurate VSPon a Single Processor Host?

That is how useful are these things?

CONFIDENTIAL21

VSP Computation PerformanceMultiple Independent Platforms

GPMEM

GPMEM

ARM926E VPM 1

INST DATA

CONFIG & CONTROL

GPINTCARM

GPTIMER

GP UART

GP CONSOLE

Bridge

Bridge

Bridge

GPMEM

GPMEM

ARM926E VPM 1

INST DATA

CONFIG & CONTROL

GPINTCARM

GPTIMER

GP UART

GP CONSOLE

Bridge

Bridge

Bridge

GPMEM

GPMEM

ARM926E VPM 1

INST DATA

CONFIG & CONTROL

GPINTCARM

GPTIMER

GP UART

GP CONSOLE

Bridge

Bridge

Bridge

CONFIDENTIAL22

Results - Computational Performance Study

6x Single Processor, Virtual System Prototypes - Cached

0

10

20

30

40

50

60

1 2 3 4 5 6

Number of Processors

Eff

ec

tiv

e M

IPS

No. ofProcessors

VPM MIPSPerformance

Simulation Overhead

HardwareSimulation

Platform dominated study: As Virtual System Prototypes (VSPs), with the processors having software and data resident in cache, are switched into the simulation (Pink line), the sharing of host cycles between the processor and the hardware (purple line) of each VSP stays in proportion for each additional VSP activated. The frequent switching between VSPs, each having a processor and hardware that also share the host cycles, also increases the Simulation overhead (blue line).

CONFIDENTIAL23

VSP with TLM Bus Matrix

SC1200

DMA Master

Core Master

Slave

Mem Bank 0

(512KB)

SC1200

DMA Master

Core Master

Slave

Bridges Bridges Bridges Bridges Bridges

32323232 32 32

32

32

32

AHB

AHB

Application software (Viterbi), on INT will shuffle

data from DRAM to MemBanks

Application software

(Vocoder), on INT will shuffle data from DRAM to MemBanks

approx. 60% utilization

DRAM

(2MB)

Mem Bank 1

(512KB)

Mem Bank 2

(512KB)

Mem Bank 3

(512KB)

Mem Bank 4

(512KB)

Mem Bank 5

(512KB)

DMA Traffic Generator




Bridges

every 300-500 cycles AHB like

transactions

OCP ChannelWrapper

CONFIDENTIAL24

Results – Bus Matrix Performance

0.00

50.00

100.00

150.00

Count (Trans/DSec |

MIPS)

1024 64 4

Trans. (0 VPM)

Headroom MIPS

Trans. (1 VPM)

Headroom MIPS

Trans. (2 VPM)

Headroom MIPS

Transaction Size (bytes)

Number of Processors

Communications vs Computation Loading (Double Core)

Trans. (0 VPM)

Headroom MIPS

Trans. (1 VPM)

Headroom MIPS

Trans. (2 VPM)

Headroom MIPS

Communications and computation sharing study: This is a multi-variable study measuring simulation performance of a system having transactions of various sizes (1024, 64 and 4 bytes) being transmitted at a high rate over a complex switch to which are attached two SC1200 processors. Initially no processors are activated and each is then successively activated. The bar chart is best read as a sequence of 3 pairs (Transaction / Headroom (MIPS) – into the slide. As transactions become progressively smaller, there is relatively more work to be performed by the model to transmit and receive them. The Headroom measure is the amount of available host cycles for further simulation. As more processor are activated and the transaction size is reduced, the available headroom diminishes.

CONFIDENTIAL25

Study 4: VSP Interrupt Handling

Automotive Benchmark, Feb 2004

Capability or a VSP under interrupt loads: This is a relatively simple experiment that shows the performance of a single processor Virtual System Prototype under increasingly stressful rates of processing asynchronous events (interrupts). Even at high interrupt rates (every 3,750 cycles is equivalent to a 12 cylinder engine running at 20,000 RPM and producing an interrupt every 10 degrees of crank-angle) the VPM is capable of simulating high software execution rates (4 MIPS) while handling the interrupts.

0

1000

2000

3000

4000

5000

3750 50000 100000C

ycle

s p

er

Inte

rrupt

VP

M

Perf

orm

ance

(MIP

S/1

0000)

Cycles between Interrupts

Event Count

VPM Peformance under High Interrupt Load

Cycles per Interrupt

VPM Performance(MIPS/10000)

Back to Building Systems

CONFIDENTIAL27

Physical Prototype

Virtual Prototypes

32-bitMPU

RAM

InterruptC

ontrollerROMBus Interface

Flash

DMAInterruptTimer

General I/O

A2D Convert

Clock Gen.

Serial Comms

Virtual bus

It is all about optimization,

stupid!

Asynch-Signal Response Latency

Powe

r Con

sum

ptio

n

Spee

dSoftware

Specifications

Very Smart System Instantiator

PhysicalMechanical, RF, ..

H-typeRespecifier

Typical 2.5G Wireless Systems

built using aVirtual System Prototype

CONFIDENTIAL29

Virtual PrototypingMobile Handset Development

Full System Development

Architecture, Software, Hardware, I/F

I Q Signals

Virtual COMPort

ARMARMDebuggerDebugger

TeakLiteTeakLiteDebuggerDebugger

SG2SG2

http://www.teknooptik.se/images/10a6.gif

http://www.hama.de/web-bilder/pressetreff/pressemitteilungen/feindaten/pr172_prfd1.jpg

http://www.easy-use.de/media/SIM-KARTE.JPG

CONFIDENTIAL31

Early Design Feedback in Semiconductor Development Process

Enabled 1st Pass Silicon Success Eliminated Costly 2nd Silicon

Provided Complete SoftwareDevelopment Environment 9 Months Prior to Silicon

Resulted in a Better QualityProduct 5 Months EarlierThan Standard DevelopmentProcess

Advanced Debugging Multi-Core debugging

• ARM926 (ADS 1.2) • TeakLite* (DSP group)

Complete system visibility • S-GOLD programmer model

– Bus status & Interrupt behavior– System cycle count, monitors

I/O Test Bench Support Open Model Extension

Wireless VP Benefits

Keypad Test Bench

LCD Display QCIF/CIFCamera Test Bench Win32 Terminal for all Serial IO

Virtual COM Ports

ARM Debugger

Linux OS Execution + MPEG4 EncodingCamera Input

TeakLite Debugger

SGOLD2

Architecture

CONFIDENTIAL32

Concurrent Bus Activity

Optimizing forPower and Performance

Separated Functions

CONFIDENTIAL35

General Form of Multi-Objective Optimization

Equation:

Characterize an objective function in terms of events directly measurable from the VSP

1.. , ,1 ,2 ,( (...)) ( (...), (...) ,...., (...))k k k k kCPU EvType et CPUk EvType CPU CPU CPU CPU etwhere f g f g g g

, , ,

, ,

0.. 1.. s ec ..

0.. 1.. ..

(| ( ( ( )),

( ( (cc cc CEvType cc CEvType CEvCnt

bc bc BEvType bc BEvTyp

VSP CPU cc cn CPU CEvType cet CPU CEvCnt c n tcecn CPU

Bus bc bcn Bus BEvType bet Bus BEvCnt sbecn tbecn Bus

F f f g Event

f f g Event

,

, , ,

,

0.. 1.. ..

0.. 1..

)),

( ( ( )),

( ( (

e BEvCnt

bbc bbc BBEvType bc BBEvType BBEvCnt

mc mc MEvType

BusBridge bbc bbcn BBus BBEvType bbet BBus BBEvCnt sbbecn tbbecn BBus

Mem mc mcn Mem MEvType met Mem MEvCnt sme

f f g Event

f f g

, ,

, , ,

..

0.. 1..det ..

)),

( ( ( )))mc MEvType MEvCnt

dc dc DEvType dc DEvType DEvCnt

cn tmecn Mem

Dev dc cn Dev DEvType Dev DEvCnt sdecn tdecn Dev

Event

f f g Event

Problem: Huge volume of data some of which may be highly correlated with other data – leading to multiple counting and unreliability in composite measures.

CONFIDENTIAL37

A Simple Power Function for a Full Platform

15

Re Re

, , , ,

2 :.

.

.2 2 0 120

Power Pipe Pipe Instr Instr Cache Cache TLB TLB

gAcc gAcc MemAcc MemAcc PeriphAcc PeriphAcc

Instr Instr jmp Instr except Instr ctrl Instr coproc

Equationf W f W f W f W f

W f W f W fwheref f f f f

f

, , ,

,

.

. ( )

Instr LdSt Instr arith Instr other

Instr i i

f fand

f instructions of type in k cycles

CONFIDENTIAL38

Resolving the Weights for the Power Function

Table 2: Power: Function Types, Event & Weighting Functions

Function Types Events Weight Functions

Pipeline ibase 6.0

Instruction Types ijmp 2.0

iexcept 2.0

icoproc 12.0

iarith 1.0

Caches (I&D) Cache_lookup fi-dcache(size, ways)

icache_hit iCache-lookup + ficache(line size, decode)

icache_miss Icache_lookup

dcache_hit Dcache_lookup + fdcache(size, ways, line size,)

dcache_miss Dcache_lookup

TLB tlb_miss 30.0

Register regfile_access 1.0

Memory (incl.bus transactions) membus_transaction 50.0

Periph Device (incl.bus transactions) periphbus_reg_access 50.0

CONFIDENTIAL39

Single Task Working Set vs Cache Size Analysis

Graph 1B: Power Consumption - Viterbi on ARM926E Subsystem of VSP in Figure 1

5.00

7.00

9.00

11.00

13.00

15.00

17.00

0 10,000 20,000 30,000 40,000

Cache Size (Bytes)

Ave.

Po

wer

* 10^

7 /

# I

nstr

ucti

on

Cache Line =16 bytes


Graph 1A: VPM Speed - Viterbi on ARM926E Subsystem of VSP in Figure 1

0.001.002.003.004.005.006.007.008.00

0 10,000 20,000 30,000 40,000

Cache Size (Bytes)

Instr

uct

ion

s /

10-

Cyc

les



CONFIDENTIAL40

Linux Boot - Memory Hierarchy Analysis (I&D cache + bus + bus bridge + Mem (DDR | SDR)

Analysis Graph 2A: VPM Speed - Linux Boot on ARM926E

Subsystem of Fig.1 VSP

1.001.201.401.601.802.002.202.402.60

0 10,000 20,000 30,000 40,000

Cache Size (Bytes)

Ins

tru

cti

on

s /

10

-Cy

cle

s

CL = 16B,Mem = DDRCL = 32B,Mem = DDRCL = 32B,Mem = SDR

Graph 2B: Power Consumption - Linux Boot on ARM926E Subsystem of Fig. 1 VSP

1.00

1.20

1.40

1.60

1.80

2.00

0 10,000 20,000 30,000 40,000

Cache Size (Bytes)

Ave

. P

ow

er

* 10

^7

/

# In

str

ucti

on

s

CL= 16B, Mem= DDR

CL = 32B,Mem = DDR

CL = 32B,Mem = SDR

CONFIDENTIAL41

Replace Cache with Simple External Buffer

for a Known Task SetSpeed - Sieve of Eratosthenes on ARM926E

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

5.00

5.50

6.00

-10 490 990 1490

Cache Size (Bytes)

# In

str

uc

tio

ns

/ 1

0-c

yc

les

CL = 16B, Mem = DDR

CL = 16B, Mem = SDR

CL = 32B, Mem = DDR

CL = 32B, Mem = SDR

Power Consumption - Sieve of Eratosthenes on ARM926E

4.00

5.00

6.00

7.00

8.00

9.00

10.00

-10 490 990 1490

Cache Size (Bytes)

Ave. P

ow

er

* 10

^7 / #

In

str

uc

tio

ns

CL = 16B, Mem = DDR

CL = 16B, Mem = SDR

CL = 32B, Mem = DDR

CL = 32B, Mem = SDR

CONFIDENTIAL42

The Message

System optimization needs a composite, complex optimization function of functions operating on a complete (model of a) system. The constituent functions include:

Power

Speed

Response deadline compliance

Cost ……

A rigorous scientific methodology is required for empirical experimentation

Profiles in Power: Optimizing Real-Time Systems for Power As well as Speed (IPS), Response Latency...

Documents

Transcript of Profiles in Power: Optimizing Real-Time Systems for Power As well as Speed (IPS), Response Latency...