Software for embedded multiprocessors

Software for embedded multiprocessors

Introduction on embedded OS Code parallelization Interprocessor communication Task allocation Power management

Introduction on embedded operating systems

OS Overview Kernel, device drivers, boot loader, user interface, file system and

utilies Kernel components:

Interrupt handler scheduler Memory manager System services (networking and IPC)

It runs in protected memory space – kernel space and full access to HW, while apps run in user-space

Apps communicate with kernel via system calls

app

printf()

write()

app

open()

app

strcpy() library functions

system calls

OS Overview (II)

Operating system takes control of the execution Timer interrupt I/O interrupts System calls Exceptions (undef instruction, data

abort, page faults etc…)

Processes

A process is a unique execution of a program. Several copies of a program may run

simultaneously or at different times. A process has its own state:

registers; memory.

The operating system manages processes.

Process state

A process can be in one of three states: executing on the

CPU; ready to run; waiting for data.

executing

ready waiting

gets dataand CPU

needsdata

gets data

needs data

preemptedgetsCPU

Processes and CPUs

Activation record: copy of process state.

Context switch: current CPU

context goes out; new CPU context

goes in.

CPU

PC

registers

process 1

process 2

...

memory

Terms

Thread = lightweight process: a process that shares memory space with other processes.

Reentrancy: ability of a program to be executed several times with the same results.

Processes in POSIX

Create a process with fork: parent process

keeps executing old program;

child process executes new program.

process a

process a process b

fork()

The fork process creates child:

childid = fork();if (childid == 0) {

/* child operations */} else {

/* parent operations */}

execv()

Overlays child code:childid = fork();if (childid == 0) {

execv(“mychild”,childargs);perror(“execv”);exit(1);

}

file with child code

Context switching

Who controls when the context is switched?

How is the context switched?

Co-operative multitasking

Improvement on co-routines: hides context switching mechanism; still relies on processes to give up

CPU. Each process allows a context

switch at cswitch() call. Separate scheduler chooses which

process runs next.

Problems with co-operative multitasking

Programming errors can keep other processes out: process never gives up CPU; process waits too long to switch,

missing input.

Context switching

Must copy all registers to activation record, keeping proper return value for PC.

Must copy new activation record into CPU state.

How does the program that copies the context keep its own context?

Preemptive multitasking

Most powerful form of multitasking: OS controls when contexts switches; OS determines what process runs next.

Use timer to call OS, switch contexts:

CPU

ti

mer

interrupt

Flow of control with preemption

time

P1 OS P1 OS P2

interrupt interrupt

Preemptive context switching

Timer interrupt gives control to OS, which saves interrupted process’s state in an activation record.

OS chooses next process to run. OS installs desired activation

record as current CPU state.

Operating systems

The operating system controls resources: who gets the CPU; when I/O takes place; how much memory is allocated.

The most important resource is the CPU itself. CPU access controlled by the scheduler.

Design Issues

Kernel space/user space/real-time space Monolithic versus micro-kernel

Monolithic: OS services (including DDs, network and filesystem) run in privileged mode (easier to make efficient) (Linux, WinNT)

Microkernel: privileged mode only for task management, scheduling, IPC, interrupt handling, memory management (more robust) (QNX, VxWorks)

Pre-emptable kernel or not Memory management versus shared memory Dedicated versus general

Embedded vs General Purpose OS

Small footprint Stability (must run for years

without manual intervention) Hardware watchdogs Little power Autonomous reboot (safely and

instantly)

Taxonomy

High-end embedded systems Down sized derivatives of existing GP OSes (routers,

switches, PDA, set-top boxes) Deeply embedded OS

Small OSes with a handful of basic functions. They are designed from the ground for a particular application

They typically lack high-performance GUI and networking (automotive control, digital camera, mobile phones)

They are statically linked to the application. After the compilation the whole package containing OS kernel and applications are concatenated to a single package that can be loaded to the embedded machine

Run-time environment Boot routine + embedded libraries Java, C++ offers functionalities such as memory

management, threading, task synchronization, exception handling

Embedded operating system

Hardware

Operating

System

User Programs

Typical OS Configuration

Hardware

Including Operating

System Components

User Program

Typical Embedded Configuration

Real-time operating system

Dynamic VS Static Loading

Dynamic loading OS is loaded as a separate entity and applications are

dynamically loaded in memory (more flexibility, code relocation is needed) (e.g. uClinux)

Static loading OS is linked and loaded together with applications (no

flexibility, higher predictability) (e.g. eCos, RTEMS) OS is a set of libraries that provide OS services

How about Memory protection? (shared address space) System calls? (no cpu mode change required) Process creation? (fork, exec)? (shared address space,

no overloading) File system? (only for input/output data)

Static Loading

No address space separation User applications run with the same

access privilege as the kernel Functions are accessed as function

calls, no “system calls” No need for copying parameters and data No need for state saving

Speed and control

Dynamic Loading

File system

OS

process

boot time

run time

system memory

constant address

relocated (compile addresses != run time

addressesaddress space separation

Focus on software for embedded multiprocessors

Embedded vs. General Purpose

Embedded Applications Asymmetric Multi-Processing

Differentiated Processors Specific tasks known early

Mapped to dedicated processors Configurable and extensible

processors: performance, power efficiency

Communication Coherent memory Shared local memories HW FIFOS, other direct connections

Dataflow programming models Classical example – Smart mobile –

RISC + DSP + Media processors

Server Applications Symmetric Multi-Processing

Homogeneous cores General tasks known late

Tasks run on any core High-performance, high-speed

microprocessors Communication

large coherent memory space on multi-core die or bus

SMT programming models (Simultaneous Multi-Threading)

Examples: large server chips (eg Sun Niagara 8x4 threads), scientific multi-processors

Parallel programming of embedded multiprocessors

Parallelism & Programming Models MP is difficult: Concurrency, and “Fear of Concurrency” No robust and general models to automatically extract

concurrency in 20-30+ years of research Many programming models/libraries - SMT, SMP

OpenMP, MPI (message passing interface) Users manually modify code

Concurrent tasks or threads Communications Synchronisation

Today: Coarse-grained (whole application/data-wise) concurrency –

unmodified source + MP scheduler API for communications and synchronisation

Sequential execution model

The most common Supported by traditional (imperative) languages (C,

C++, Fortran, etc.) Huge bulk of legacy code

The most well understood We are trained to solve problems algorithmically

(sequence of steps) Microprocessors have been originally designed to run

sequential code The easiest to debug

Tracing the state of the CPU Step-by-step execution

But… it HIDES parallelism!!

Types of Parallelism

Instruction Level Parallelism (ILP) Compilers & HW are mature

Task Parallelism Parallelism explicit in algorithm Between filters without

producer/consumer relationship

Data Parallelism Between iterations of a stateless

filter Place within scatter/gather pair

(fission) Can’t parallelize filters with state

Pipeline Parallelism Between producers and consumers Stateful filters can be parallelized

Scatter

Gather

Scatter

Gather

Task

Pip

elin

e

Data

Data Parallel

Parallelizing Loops: a Key Problem

FORALL No “loop carried

dependences” Fully parallel

FORACROSS Some “loop carried

dependences”

90% of execution time is in loops Partial success in automatic extraction

Mostly “well-behaved loops” Challenges: dependency analysis & interaction with data

placement Cooperative approaches are common

The programmers drives automatic parallelization (openMP)

Parallelized loops rely on Barrier Synchronization

Barrier with Pthreads

Master core only

initializessynchronization

structures

pthread_mutex_init()

SERIAL REGION PARALLEL REGION

BARRIER

pthread_create()

Pthreads on Heterogeneous CPUs?

ISSUES There is an OS running

on each core. No means for the

master core to fork new threads on slave nodes.

Use of pthreads is not a suitable solution.

SOLUTION Standalone

implementation. Master/Slave cores

instead of threads. Synchronization

through shared memory.

Heterogeneous MPSoC

MasterCPU

SlaveCPU

SlaveCPU

PrivateMem

PrivateMem

PrivateMem

PrivateMem

SharedMem

INTERCONNECT

SlaveCPU

SPMD Barrier

SERIAL REGION PARALLEL REGION

All coresinitialize

synchronizationstructures and

common data in shared memory

Additional serial code is only executed by master core while slaves

wait on barrier

Slaves notify their presence

on the barrier to the master

Master releases slaves as soon as he’s ready to

start parallel region

Code implementation flow

original C code

parallel code

MPARM

MasterCPU

SlaveCPU

SlaveCPU

PrivateMem

PrivateMem

PrivateMem

PrivateMem

SharedMem

INTERCONNECT

SlaveCPU

Parallel compiler

binarycode

binarycode

binarycode

binarycodegcc

Runtime Library

The Runtime Library is responsible for

Initializing needed synchronization features, creating new worker threads (in the original implementation) and coordinating their parallel execution over multiple cores

Providing implementation of synchronization facilities (locks, barriers)

Code Execution Each CPU execute the same program. Basing upon the CPU id

we separate portions of code to be executed by master and slaves.

Master CPU executes serial code, initializes synchronization structures in shared memory, etc..

Slave CPUs only execute the parallel regions of code, behaving like the typical slave threads

MasterCPU

SlaveCPU

SlaveCPU

PrivateMem

PrivateMem

PrivateMem

PrivateMem

SharedMem

NoC INTERCONNECT

SlaveCPU

int main() {

…

if (MASTERID) {

serial code

synchronization

}

…

if (SLAVEID) {

parallel code

}

…

}

int main() {

…

if (MASTERID) {

serial code

synchronization

}

…

if (SLAVEID) {

parallel code

}

…

}

int main() {

…

if (MASTERID) {

serial code

synchronization

}

…

if (SLAVEID) {

parallel code

}

…

}

int main() {

…

if (MASTERID) {

serial code

synchronization

}

…

if (SLAVEID) {

parallel code

}

…

}

Synchronization structures

(barriers, locks)

Synchronization Parallel programming through shared memory

requires global and point-to-point synchronization On symmetric architectures, implementations use

pthreads library synchronization facilities, on heterogeneous architectures hw semaphores must be used

void lock(pthread_mutex_t *lock){ pthread_mutex_lock(lock);}

void unlock(pthread_mutex_t *lock){ pthread_mutex_unlock(lock);}

void lock(int *lock){ while(*lock);}

void unlock(pthread_mutex_t *lock){ *lock = 0;}

Typical Barrier implementation

LOCK(bar->lock);bar->entry_count++; if (bar->entry_count < nproc) {

UNLOCK(bar->lock);while(bar->entry_count != nproc);LOCK(bar->lock);bar->exit_count++;if (bar->exit_count == nproc)

bar->entry_count = 0x0;UNLOCK(bar->lock);

} else {bar->exit_count = 0x1;if (bar->exit_count == nproc)

bar->entry_count = 0x0;UNLOCK(bar->lock);

}while(bar->exit_count != nproc);

struct barrier {lock_type lock;int entry_count; int exit_count;

} *bar;

Shared counters protected by locks

Barrier Implementation Issues

ISSUES This approach is not very scalable Every processor notifies its arrival to the barrier

increasing the value of a common shared variable

As the number of cores increases contention for the shared resource may increase significantly

POSSIBLE SOLUTION A vector of flags, one per each core, instead of

a single shared counter

New Barrier Implementation

typedef struct Barrier { int entered[NSLAVES]; int usecount;} Barrier;

void Slave_Enter (Barrier b, int id) { int ent = b−>usecount; b−>entered[id] = 1; while (ent == b−>usecount);}

void Master_Wait (Barrier b, int num_procs ) { int i; for (i = 1; i < num_procs ; i++) while (!b−>entered[i]); //Reset flags to 0}

void Master_Release(Barrier b) { b−>usecount++;}

No busy-waiting due to contention

of a shared counter. Each

slave updates its own flag

Only the master spin waits on

each slave’s flag to detect their

presence on the barrier

Only the slaves spin wait on a

shared counter that is updated by the master

Compiler aware of synchronization cost?

A lightweight implementation of the synchronization structures allows a parallelized code with a big number of barrier instruction to still perform better than the serial version

It would be useful to let the compiler know about the cost of synchronization. This would allow it not only to select the parallelizable loops, but also to estabilish if the parallelization is worthwhile

For well distributed workloads across the cores the proposed barrier performs well, but for a high degree of load imbalance an interrupt-based implementation may be best suited. The compiler may choose which barrier instruction to insert depending on the amount of busy waiting

Upper triangular 32x32 matrix filling

0

500

1000

1500

2000

2500

Serial 1 2 4 8

Cores

Cost

(*)

.

Texec (overhead)

Texec (ideal)

Tsync (overhead)

Tsync (ideal)

Tinit (overhead)

Tinit (ideal)

Performance analysis

Time needed for initializing synchronization structures in shared memory was measured on a single core simulation.

It is expected to be invariant with increasing numbers of cores.

Simultaneous accesses to the shared memory generate a traffic on the bus that produces a significant overhead.

Ideal synchronization time was estimated for the various configurations making the master core wait on the barrier after all slaves entered.

In the real case synchronization requires additional waiting time.

Those additional cycles also include the contribution due to polling on the synchronization structures in shared memory.

Ideal parallel execution time was calculated simulating on one core the computational load of the various configurations.

As expected, it almost halves with the doubling of the number of working cores.

Overall execution time is lenghtened by the waiting cycles due to the concurrent accesses to shared memory.

(*) Overall number of cycles normalized by the number of cycles spent for an ideal bus transaction (1 read + 1 write)


0

200000

400000

600000

800000

1000000

1200000

1400000

Serial 1 2 4 8Cores

Cost

.

Texec (overhead)

Texec (ideal)

Tsync (overhead)

Tsync (ideal)

Tinit (overhead)

Tinit (ideal)


0

500

1000

1500

2000

2500

Serial 1 2 4 8

Cores

Cost

.

Texec (overhead)

Texec (ideal)

Tsync (overhead)

Tsync (ideal)

Tinit (overhead)

Tinit (ideal)

Performance analysis 2

For small computational load (i.e. few matrix elements) initialization and synchronization have a big impact on overall performance. No speedup.

Possible optimizations on barriers in order to reduce accesses to shared memory.

Possible optimizations on initialization, serializing / interleaving accesses to bus.

For bigger computational load initialization and synchronization contribution go completely unnoticed. Big speedup margin.

Speedup is heavily limited by frequent accesses to shared memory. Would pure computation follow the profile of the blue bars?

Would cacheable shared memory regions help?

Example: MP-Queue library

MP-Queue is a library intended for message-passing among different cores in a MPSoc environment.

Highly optimized C implementation: Low level exploitation of data structures and semaphores:

low overhead; data transfer optimized for performance:

analyses of disassembled code; synch operations optimized for minimal interconnect

utilization

Producer-consumer paradigm, different topologies: 1-N N-1 N-N

Communication library API1. Autonit_system()

1. Every core has to call it at the very beginning.2. Allocates data structures and prepares the semaphore arrays.

2. Autoinit_producer()1. To be called by a producer core only.2. Requires a queue id.3. Creates the queue buffers and signals its position to n

consumers.

3. Autoinit_consumer()1. To be called by a consumer core only.2. Requires a queue id.3. Waits for n producers to be bounded to the consumer

structures.

4. Read()1. Gets a message from the circular buffer (consumer only).

5. Write()1. Puts a message into the circular buffer (producer only).

Communication semantics Notification mechanisms

available: Round robin. Notify all. Target core specifying.

The i-th producer:– Gets the write position index.– Puts data onto the buffer.– Signals either one consumer

(round-robin / fixed) or all consumers (notify all).

The i-th consumer:– Gets the read position index.– Gets data from the buffer.– Signals either one producer

(round-robin / fixed) or all producers (notify all).

P2

P1C1

C2

Architectural Flexibility

1. Multi core architectures with distributed memory.

2. Purely shared memory based architectures.

3. Hybrid platforms

Transaction Chart Shares bus accesses

are minimized as much as possible:

Local polling on scratchpad memories.

Insertion and extraction indexes are stored into shared memory and protected by mutex.

Data transfer section involves shared bus

Critical for performance.

Sequence diagrams 1 producer and 1

consumer (parallel activity).

Synch time vs pure data transfer.

Local polling onto scratch semaphore

Signaling to remote core scratch

“Pure” data transfer to and from FIFO buffer in shared memory

Message size 8 WORDS Message size 64 WORDS

Communication efficiency

Comparison against ideal point-to-point communication.

1-N queues leverages bus pipelining:

bigger asymptotic efficiency.

Interrupt based notification allows more than one task per core.

significant overhead (up to 15%).

Low-level optimizations are critical!16 words per token 32 words per token

Not produced any more by compiler!!!

Growth of assembly length for copy sections

Gcc compiler avoids to insert the multiple load /multiple store loop from 32 words on.

Code size would be exponentially rising.

Where high throughput is required, a less compact but more optimized representation is desired.

Compiler-aware optimization benefits

Compiler may be forced to unroll data transfer loops.

About 15% improvement with 32 word sized messages.

A Typical JPEG 8x8 block is encoded in a 32 word struct.

8x8x16 bit data.

Task allocation in MPSoC architectures

Application Mapping

The problem of allocating, scheduling and freq. selection for task graphs on multi-processors in a distributed real-time system is NP-hard.

New tool flows for efficient mapping of multi-task applications onto hardware platforms

T1

T2 T3

T4 T5 T6

T7

T8

…Proc. 1 Proc. 2 Proc. N

INTERCONNECT

Private

Mem

Private

Mem

Private

Mem

…

T1

T2

T3

T4

T5

T6

T8

T7

Time

R

esou

rces

T1

T2

T3

T4

T5

T7

Deadline

T8

Alloca

tion

Schedule&Freq.sel.

ApproachFocus: Statically scheduled Applications;

Objectives: Complete approach to allocation, scheduling

and frequency selection: High computational efficiency w.r.t. commercial solvers; High accuracy of generated solutions;

New methodology for multi-task application development:

To quickly develop multi-task applications; To easily apply the optimal solution found by our

optimizer.

Target architecture An architectural template for a message-

oriented distributed memory MPSoC: Support for message exchange between the computation

tiles; Single-token communication; Availability of local memory devices at the computation

tiles and of remote memories for program data. Several MPSoC platforms available on the

market match this template: The Silicon Hive Avispa-CH1 processor; The Cradle CT3600 family of multiprocessor; The Cell Processor The ARM MPCore platform.

The throughput requirement is reflected in the maximum tolerable scheduling period T of each processor;

.

.

.

.

Act. A

Act. B

Act. N

periodT

A task graph represents:– A group of tasks T– Task dependencies– Execution times express in clock cycles: WCN(Ti)– Communication time (writes & reads) expressed as:

WCN(WTiTj) and WCN(RTiTj)– These values can be back-annotated from functional

simulation

Application model

Task1

Task2

Task3

Task4

Task5

Task6

WCN(WT1T2)WCN(RT1T2)WCN(T1)

WCN(WT1T3)WCN(RT1T3)

WCN(T2)WCN(WT2T4)WCN(RT2T4)




WCN(T3)

WCN(T4)

WCN(T5)

WCN(T6)

//Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCHuint node_behaviour[TASK_NUMBER] = {2,3,3,..};

#define N_CPU 2uint task_on_core[TASK_NUMBER] = {1,1,2,1};

int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..};

uint queue_consumer [..] [..] = { {0,1,1,0,..},

{0,0,0,1,1,.}, {0,0,0,0,0,1,1..}, {0,0,0,0,..}..};

//Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTICuint node_type[TASK_NUMBER] = {1,2,2,1,..};

Example Number of nodes : 12 Graph of activities Node type

Normal, Branch, Conditional, Terminator Node behaviour

Or, And, Fork, Branch Number of CPU : 2 Task Allocation Task Scheduling Arc priorities Freq. & Voltage

Time

R

eso

urc

es

N1

B2

B3

C4

C7

Deadline

N8

T2 T3

T4 T5 T6 T7

T8 T9 T10

T11

T12

T1N1

B2 B3

C4 C5 C6 C7

N8 N9 N10

N11

T12

fork

or

or

and

branch branch

P1

P2

N11

N10 T1

2

a1a2

a3 a4 a5 a6

a7 a8 a9 a10

a11a12

B3 C7 N10 T1

2

a13

a14

#define TASK_NUMBER 12

Queue ordering optimization

Communication ordering affects system performances

T1

T2

T4

CPU1 CPU2

…

C3 C1

T3

…C

2

Wait!

RUN!

T5 T6… …

C4 C5

Queue ordering optimization

Communication ordering affects system performances

T1

T2

T4

T5 T6

CPU1 CPU2

…… …

C3 C1

T3

…C

2

Wait!

RUN!

C4 C5

T4 re-activated

Synchronization among tasks

T1

T2 T4C2

T3

C1

C3

Proc. 1

T1

Proc. 2

T2T3 T4

T4 is suspended

Non blocked semaphores

Logic Based Benders Decomposition

Obj. Function:Communication

cost

& energy consumption Valid

allocation

Allocation& Freq. Assign.:

INTEGER PROGRAMMING

Scheduling:CONSTRAINT PROGRAMMING

No good: linearconstraint

Memory constraints

Real Time constraint

Decomposes a problem into 2 sub-problems: Allocation & Assignment of freq. settings → IP

Objective Function: minimizing energy consumption during execution and communication of tasks

Scheduling → CP Objective Function: minimizing energy consumption during frequency

switching The process continues until the master problem and sub-

problem converge providing the same value. Methodology has been proven to converge to the optimal solution

[J.N.Hooker and G.Ottosson].

Application Development Methodology

CTGCharacterization

Phase

Simulator

OptimizationPhase

Optimizer

ApplicationProfiles

Optimal SWApplication

Implementation

ApplicationDevelopment

Support

Allo

cation

Sche

dulin

g

PlatformExecution

GSM Encoder

Throughput required: 1 frame/10ms. With 2 processors and 4 possible

freq.&voltage settings:

Task Graph: 10 computational

tasks; 15 communication

tasks.

Without optimizations:50.9μJ

With optimizations:17.1 μJ -66,4%

Energy Management

o Basic techniques: Shutdown and DVFS

o Advanced techniques: Feedback control

Urbino, 19-10-2006 79

Energy Optimization in MPSoCs

Two main steps: Workload allocation to processing elements: task mapping and

scheduling After workload allocation, resource of processing

elements should be adapted to the required performance to minimize energy consumption

shut-down voltage scaling

Urbino, 19-10-2006 80

Shut-Down

When the system is idle the processor can be placed into a low-power state

reactivity power level

-core clock gating (waked-up by timer interrupt)

-core power gating (waked-up by on-chip peripherals)

-chip power gating (waked-up by external, on board interrupts)

no need for context restore

need for context restore

Urbino, 19-10-2006 81

Frequency/Voltage Scaling

DFVS: Adapting frequency/voltage to the

workload Frequency must be scaled with voltage to

keep circuit functionality Dynamic power goes with the square of

Vdd and linearly with clock speed Scaling V/F by a factor of s -> power

scales as s3 fVCP ddeff 2

Urbino, 19-10-2006 82

Power Manager Implementation

Power management policy consists of algorithms that use input information and parameter settings to generate commands to steer mechanisms [Pouwelse03]

policy

operational

conditions

workload commands

parameter settings

operating points

A dynamic power management system is a set of rules and procedures that move the system from one operating point to

another as events occur [IBM/Montavista 02]

Urbino, 19-10-2006 83

Power Manager Components

Monitoring Utilization, idle times, busy times

Prediction Averaging (e.g. EMA), filtering (e.g. LMS) Per-task based (e.g. Vertigo), global utilization (e.g. Grunwald)

Control Shutdown, DVFS Open-loop, closed loop (e.g. adaptive control)

Update Rule Establish decision points

Urbino, 19-10-2006 84

Traditional Approach

idle timemonitor

global utilizationmonitor

per-task utilizationmonitor

idle timepredictor

workloadpredictor

shutdowncontroller

DVFScontroller

TASK 1 TASK 2

update rule

Urbino, 19-10-2006 85

Limitations Assuming no specific info from applications are

available, traditional approaches are based on observation of utilization history

Slow adaptation impact system reactivity Specific techniques for tracking interactive task have been

proposed [Flautner2002] For soft real-time (multimedia) application deadline

may be missed Frequent voltage oscillations impact energy

efficiency Square relationship between power and voltage Cost of switching (power/time/functionality)

BAD

GOOD

Multimedia applications Multimedia applications can be

represented as communicating objects Ex.: Gstreamer multimedia framework

pads

data data data

21/04/23 86

OSHMA Workshop –Brasov ,ROMANIA

Streaming Applications Multimedia streaming applications are going

multicore Objects are mapped to tasks that are distributed on

the cores to increase performance Specific tasks can be executed by hardware

accelerators such as IPU, GPU units

data data data

CORE #0CORE #1 CORE #2P0

P1P3

P2

21/04/23 87

21/04/23 88

Overcoming Traditional Approaches

Key observation Multimedia applications are mapped into MPSoCs as

communicating tasks

A pipeline with single and parallel blocks (split-join) communicating with queues

Feedback path are also common

split join

P0 P1

P2

P4

P3

P5

EXT.PERIPHERAL

Software FM Radio

Frequency, Voltage Setting Middleware

C O M M . A ND S Y NC H R O NI Z. L A Y ER

PR O C ES S O R N

O P. S Y S T. N+ 2O P. S Y S T. N

H W

O S /M W

U S E R

FR EQ .C O NTR .

N

FR EQ .C O NTR .

N+ 1

PR O C ES S O R N+ 1 PR O C ES S O R N+ 2

S TA G EN+ 1

S TA G EN

S TA G EN

O P. S Y S T. N+ 1

DATA

OUT

DATA

INQ

UEUE O C C .

FREQ .

• Migration+dynamic f,Vdd setting critical for energy management – DVFS for multi-stage producer-consumer streaming exploits info on

occupancy of the synchronization queues – @equilibrium, average output rate should match input rate in each

queue occupancy level monitored to adjust PE speed and Vdd

[Carta TECS07]21/04/23 89

Middleware Support

Almost OS-independent If interrupts are used an OS-specific ISR must be written

Easy integration into communication library support for Gstreamer, openmax

main() {produce_data();write_queue();}

main() {read_queue();consume_data();}

communication library

check queue level();run_control_algorithm();set_frequency_and_voltage();

21/04/23 90

91

Feedback Controlled DVS Technique to perform run-time energy

optimization of pipelined computation in MPSoCs Queue occupancy provide estimation of the level of

performance required by the preceding block Unlike traditional approaches, the idea is to look

at level of occupancy of inter-processor queues to compute the speed of processing elements

queue

speed control

queue queue

21/04/23

Energy/power optimization is NOT thermal optimization!

Need for temperature awareness

Thermal optimization

OS-MPSoC Thermal Studies

Focus on embedded multimedia streaming and interactive applications

Efficient automatic code parallelization for embedded multiprocessors

Efficient communication and synchronization infrastructure

Static + dynamic task allocation and for performance/energy/thermal balancing

EU projects: PREDATOR, REALITY

Spunti di ricerca

Software for embedded multiprocessors

Documents

Transcript of Software for embedded multiprocessors