Software for embedded multiprocessors
description
Transcript of Software for embedded multiprocessors
Software for embedded multiprocessors
Introduction on embedded OS Code parallelization Interprocessor communication Task allocation Power management
Introduction on embedded operating systems
OS Overview Kernel, device drivers, boot loader, user interface, file system and
utilies Kernel components:
Interrupt handler scheduler Memory manager System services (networking and IPC)
It runs in protected memory space – kernel space and full access to HW, while apps run in user-space
Apps communicate with kernel via system calls
app
printf()
write()
app
open()
app
strcpy() library functions
system calls
OS Overview (II)
Operating system takes control of the execution Timer interrupt I/O interrupts System calls Exceptions (undef instruction, data
abort, page faults etc…)
Processes
A process is a unique execution of a program. Several copies of a program may run
simultaneously or at different times. A process has its own state:
registers; memory.
The operating system manages processes.
Process state
A process can be in one of three states: executing on the
CPU; ready to run; waiting for data.
executing
ready waiting
gets dataand CPU
needsdata
gets data
needs data
preemptedgetsCPU
Processes and CPUs
Activation record: copy of process state.
Context switch: current CPU
context goes out; new CPU context
goes in.
CPU
PC
registers
process 1
process 2
...
memory
Terms
Thread = lightweight process: a process that shares memory space with other processes.
Reentrancy: ability of a program to be executed several times with the same results.
Processes in POSIX
Create a process with fork: parent process
keeps executing old program;
child process executes new program.
process a
process a process b
fork()
The fork process creates child:
childid = fork();if (childid == 0) {
/* child operations */} else {
/* parent operations */}
execv()
Overlays child code:childid = fork();if (childid == 0) {
execv(“mychild”,childargs);perror(“execv”);exit(1);
}
file with child code
Context switching
Who controls when the context is switched?
How is the context switched?
Co-operative multitasking
Improvement on co-routines: hides context switching mechanism; still relies on processes to give up
CPU. Each process allows a context
switch at cswitch() call. Separate scheduler chooses which
process runs next.
Problems with co-operative multitasking
Programming errors can keep other processes out: process never gives up CPU; process waits too long to switch,
missing input.
Context switching
Must copy all registers to activation record, keeping proper return value for PC.
Must copy new activation record into CPU state.
How does the program that copies the context keep its own context?
Preemptive multitasking
Most powerful form of multitasking: OS controls when contexts switches; OS determines what process runs next.
Use timer to call OS, switch contexts:
CPU
ti
mer
interrupt
Flow of control with preemption
time
P1 OS P1 OS P2
interrupt interrupt
Preemptive context switching
Timer interrupt gives control to OS, which saves interrupted process’s state in an activation record.
OS chooses next process to run. OS installs desired activation
record as current CPU state.
Operating systems
The operating system controls resources: who gets the CPU; when I/O takes place; how much memory is allocated.
The most important resource is the CPU itself. CPU access controlled by the scheduler.
Design Issues
Kernel space/user space/real-time space Monolithic versus micro-kernel
Monolithic: OS services (including DDs, network and filesystem) run in privileged mode (easier to make efficient) (Linux, WinNT)
Microkernel: privileged mode only for task management, scheduling, IPC, interrupt handling, memory management (more robust) (QNX, VxWorks)
Pre-emptable kernel or not Memory management versus shared memory Dedicated versus general
Embedded vs General Purpose OS
Small footprint Stability (must run for years
without manual intervention) Hardware watchdogs Little power Autonomous reboot (safely and
instantly)
Taxonomy
High-end embedded systems Down sized derivatives of existing GP OSes (routers,
switches, PDA, set-top boxes) Deeply embedded OS
Small OSes with a handful of basic functions. They are designed from the ground for a particular application
They typically lack high-performance GUI and networking (automotive control, digital camera, mobile phones)
They are statically linked to the application. After the compilation the whole package containing OS kernel and applications are concatenated to a single package that can be loaded to the embedded machine
Run-time environment Boot routine + embedded libraries Java, C++ offers functionalities such as memory
management, threading, task synchronization, exception handling
Embedded operating system
Hardware
Operating
System
User Programs
Typical OS Configuration
Hardware
Including Operating
System Components
User Program
Typical Embedded Configuration
Real-time operating system
Dynamic VS Static Loading
Dynamic loading OS is loaded as a separate entity and applications are
dynamically loaded in memory (more flexibility, code relocation is needed) (e.g. uClinux)
Static loading OS is linked and loaded together with applications (no
flexibility, higher predictability) (e.g. eCos, RTEMS) OS is a set of libraries that provide OS services
How about Memory protection? (shared address space) System calls? (no cpu mode change required) Process creation? (fork, exec)? (shared address space,
no overloading) File system? (only for input/output data)
Static Loading
No address space separation User applications run with the same
access privilege as the kernel Functions are accessed as function
calls, no “system calls” No need for copying parameters and data No need for state saving
Speed and control
Dynamic Loading
File system
OS
process
boot time
run time
system memory
constant address
relocated (compile addresses != run time
addressesaddress space separation
Focus on software for embedded multiprocessors
Embedded vs. General Purpose
Embedded Applications Asymmetric Multi-Processing
Differentiated Processors Specific tasks known early
Mapped to dedicated processors Configurable and extensible
processors: performance, power efficiency
Communication Coherent memory Shared local memories HW FIFOS, other direct connections
Dataflow programming models Classical example – Smart mobile –
RISC + DSP + Media processors
Server Applications Symmetric Multi-Processing
Homogeneous cores General tasks known late
Tasks run on any core High-performance, high-speed
microprocessors Communication
large coherent memory space on multi-core die or bus
SMT programming models (Simultaneous Multi-Threading)
Examples: large server chips (eg Sun Niagara 8x4 threads), scientific multi-processors
Parallel programming of embedded multiprocessors
Parallelism & Programming Models MP is difficult: Concurrency, and “Fear of Concurrency” No robust and general models to automatically extract
concurrency in 20-30+ years of research Many programming models/libraries - SMT, SMP
OpenMP, MPI (message passing interface) Users manually modify code
Concurrent tasks or threads Communications Synchronisation
Today: Coarse-grained (whole application/data-wise) concurrency –
unmodified source + MP scheduler API for communications and synchronisation
Sequential execution model
The most common Supported by traditional (imperative) languages (C,
C++, Fortran, etc.) Huge bulk of legacy code
The most well understood We are trained to solve problems algorithmically
(sequence of steps) Microprocessors have been originally designed to run
sequential code The easiest to debug
Tracing the state of the CPU Step-by-step execution
But… it HIDES parallelism!!
Types of Parallelism
Instruction Level Parallelism (ILP) Compilers & HW are mature
Task Parallelism Parallelism explicit in algorithm Between filters without
producer/consumer relationship
Data Parallelism Between iterations of a stateless
filter Place within scatter/gather pair
(fission) Can’t parallelize filters with state
Pipeline Parallelism Between producers and consumers Stateful filters can be parallelized
Scatter
Gather
Scatter
Gather
Task
Pip
elin
e
Data
Data Parallel
Parallelizing Loops: a Key Problem
FORALL No “loop carried
dependences” Fully parallel
FORACROSS Some “loop carried
dependences”
90% of execution time is in loops Partial success in automatic extraction
Mostly “well-behaved loops” Challenges: dependency analysis & interaction with data
placement Cooperative approaches are common
The programmers drives automatic parallelization (openMP)
Parallelized loops rely on Barrier Synchronization
Barrier with Pthreads
Master core only
initializessynchronization
structures
pthread_mutex_init()
SERIAL REGION PARALLEL REGION
BARRIER
pthread_create()
Pthreads on Heterogeneous CPUs?
ISSUES There is an OS running
on each core. No means for the
master core to fork new threads on slave nodes.
Use of pthreads is not a suitable solution.
SOLUTION Standalone
implementation. Master/Slave cores
instead of threads. Synchronization
through shared memory.
Heterogeneous MPSoC
MasterCPU
SlaveCPU
SlaveCPU
PrivateMem
PrivateMem
PrivateMem
PrivateMem
SharedMem
INTERCONNECT
SlaveCPU
SPMD Barrier
SERIAL REGION PARALLEL REGION
All coresinitialize
synchronizationstructures and
common data in shared memory
Additional serial code is only executed by master core while slaves
wait on barrier
Slaves notify their presence
on the barrier to the master
Master releases slaves as soon as he’s ready to
start parallel region
Code implementation flow
original C code
parallel code
MPARM
MasterCPU
SlaveCPU
SlaveCPU
PrivateMem
PrivateMem
PrivateMem
PrivateMem
SharedMem
INTERCONNECT
SlaveCPU
Parallel compiler
binarycode
binarycode
binarycode
binarycodegcc
Runtime Library
The Runtime Library is responsible for
Initializing needed synchronization features, creating new worker threads (in the original implementation) and coordinating their parallel execution over multiple cores
Providing implementation of synchronization facilities (locks, barriers)
Code Execution Each CPU execute the same program. Basing upon the CPU id
we separate portions of code to be executed by master and slaves.
Master CPU executes serial code, initializes synchronization structures in shared memory, etc..
Slave CPUs only execute the parallel regions of code, behaving like the typical slave threads
MasterCPU
SlaveCPU
SlaveCPU
PrivateMem
PrivateMem
PrivateMem
PrivateMem
SharedMem
NoC INTERCONNECT
SlaveCPU
int main() {
…
if (MASTERID) {
serial code
synchronization
}
…
if (SLAVEID) {
parallel code
}
…
}
int main() {
…
if (MASTERID) {
serial code
synchronization
}
…
if (SLAVEID) {
parallel code
}
…
}
int main() {
…
if (MASTERID) {
serial code
synchronization
}
…
if (SLAVEID) {
parallel code
}
…
}
int main() {
…
if (MASTERID) {
serial code
synchronization
}
…
if (SLAVEID) {
parallel code
}
…
}
Synchronization structures
(barriers, locks)
Synchronization Parallel programming through shared memory
requires global and point-to-point synchronization On symmetric architectures, implementations use
pthreads library synchronization facilities, on heterogeneous architectures hw semaphores must be used
void lock(pthread_mutex_t *lock){ pthread_mutex_lock(lock);}
void unlock(pthread_mutex_t *lock){ pthread_mutex_unlock(lock);}
void lock(int *lock){ while(*lock);}
void unlock(pthread_mutex_t *lock){ *lock = 0;}
Typical Barrier implementation
LOCK(bar->lock);bar->entry_count++; if (bar->entry_count < nproc) {
UNLOCK(bar->lock);while(bar->entry_count != nproc);LOCK(bar->lock);bar->exit_count++;if (bar->exit_count == nproc)
bar->entry_count = 0x0;UNLOCK(bar->lock);
} else {bar->exit_count = 0x1;if (bar->exit_count == nproc)
bar->entry_count = 0x0;UNLOCK(bar->lock);
}while(bar->exit_count != nproc);
struct barrier {lock_type lock;int entry_count; int exit_count;
} *bar;
Shared counters protected by locks
Barrier Implementation Issues
ISSUES This approach is not very scalable Every processor notifies its arrival to the barrier
increasing the value of a common shared variable
As the number of cores increases contention for the shared resource may increase significantly
POSSIBLE SOLUTION A vector of flags, one per each core, instead of
a single shared counter
New Barrier Implementation
typedef struct Barrier { int entered[NSLAVES]; int usecount;} Barrier;
void Slave_Enter (Barrier b, int id) { int ent = b−>usecount; b−>entered[id] = 1; while (ent == b−>usecount);}
void Master_Wait (Barrier b, int num_procs ) { int i; for (i = 1; i < num_procs ; i++) while (!b−>entered[i]); //Reset flags to 0}
void Master_Release(Barrier b) { b−>usecount++;}
No busy-waiting due to contention
of a shared counter. Each
slave updates its own flag
Only the master spin waits on
each slave’s flag to detect their
presence on the barrier
Only the slaves spin wait on a
shared counter that is updated by the master
Compiler aware of synchronization cost?
A lightweight implementation of the synchronization structures allows a parallelized code with a big number of barrier instruction to still perform better than the serial version
It would be useful to let the compiler know about the cost of synchronization. This would allow it not only to select the parallelizable loops, but also to estabilish if the parallelization is worthwhile
For well distributed workloads across the cores the proposed barrier performs well, but for a high degree of load imbalance an interrupt-based implementation may be best suited. The compiler may choose which barrier instruction to insert depending on the amount of busy waiting
Upper triangular 32x32 matrix filling
0
500
1000
1500
2000
2500
Serial 1 2 4 8
Cores
Cost
(*)
.
Texec (overhead)
Texec (ideal)
Tsync (overhead)
Tsync (ideal)
Tinit (overhead)
Tinit (ideal)
Performance analysis
Time needed for initializing synchronization structures in shared memory was measured on a single core simulation.
It is expected to be invariant with increasing numbers of cores.
Simultaneous accesses to the shared memory generate a traffic on the bus that produces a significant overhead.
Ideal synchronization time was estimated for the various configurations making the master core wait on the barrier after all slaves entered.
In the real case synchronization requires additional waiting time.
Those additional cycles also include the contribution due to polling on the synchronization structures in shared memory.
Ideal parallel execution time was calculated simulating on one core the computational load of the various configurations.
As expected, it almost halves with the doubling of the number of working cores.
Overall execution time is lenghtened by the waiting cycles due to the concurrent accesses to shared memory.
(*) Overall number of cycles normalized by the number of cycles spent for an ideal bus transaction (1 read + 1 write)
Upper triangular 1024x1024 matrix filling
0
200000
400000
600000
800000
1000000
1200000
1400000
Serial 1 2 4 8Cores
Cost
.
Texec (overhead)
Texec (ideal)
Tsync (overhead)
Tsync (ideal)
Tinit (overhead)
Tinit (ideal)
Upper triangular 32x32 matrix filling
0
500
1000
1500
2000
2500
Serial 1 2 4 8
Cores
Cost
.
Texec (overhead)
Texec (ideal)
Tsync (overhead)
Tsync (ideal)
Tinit (overhead)
Tinit (ideal)
Performance analysis 2
For small computational load (i.e. few matrix elements) initialization and synchronization have a big impact on overall performance. No speedup.
Possible optimizations on barriers in order to reduce accesses to shared memory.
Possible optimizations on initialization, serializing / interleaving accesses to bus.
For bigger computational load initialization and synchronization contribution go completely unnoticed. Big speedup margin.
Speedup is heavily limited by frequent accesses to shared memory. Would pure computation follow the profile of the blue bars?
Would cacheable shared memory regions help?
Example: MP-Queue library
MP-Queue is a library intended for message-passing among different cores in a MPSoc environment.
Highly optimized C implementation: Low level exploitation of data structures and semaphores:
low overhead; data transfer optimized for performance:
analyses of disassembled code; synch operations optimized for minimal interconnect
utilization
Producer-consumer paradigm, different topologies: 1-N N-1 N-N
Communication library API1. Autonit_system()
1. Every core has to call it at the very beginning.2. Allocates data structures and prepares the semaphore arrays.
2. Autoinit_producer()1. To be called by a producer core only.2. Requires a queue id.3. Creates the queue buffers and signals its position to n
consumers.
3. Autoinit_consumer()1. To be called by a consumer core only.2. Requires a queue id.3. Waits for n producers to be bounded to the consumer
structures.
4. Read()1. Gets a message from the circular buffer (consumer only).
5. Write()1. Puts a message into the circular buffer (producer only).
Communication semantics Notification mechanisms
available: Round robin. Notify all. Target core specifying.
The i-th producer:– Gets the write position index.– Puts data onto the buffer.– Signals either one consumer
(round-robin / fixed) or all consumers (notify all).
The i-th consumer:– Gets the read position index.– Gets data from the buffer.– Signals either one producer
(round-robin / fixed) or all producers (notify all).
P2
P1C1
C2
Architectural Flexibility
1. Multi core architectures with distributed memory.
2. Purely shared memory based architectures.
3. Hybrid platforms
Transaction Chart Shares bus accesses
are minimized as much as possible:
Local polling on scratchpad memories.
Insertion and extraction indexes are stored into shared memory and protected by mutex.
Data transfer section involves shared bus
Critical for performance.
Sequence diagrams 1 producer and 1
consumer (parallel activity).
Synch time vs pure data transfer.
Local polling onto scratch semaphore
Signaling to remote core scratch
“Pure” data transfer to and from FIFO buffer in shared memory
Message size 8 WORDS Message size 64 WORDS
Communication efficiency
Comparison against ideal point-to-point communication.
1-N queues leverages bus pipelining:
bigger asymptotic efficiency.
Interrupt based notification allows more than one task per core.
significant overhead (up to 15%).
Low-level optimizations are critical!16 words per token 32 words per token
Not produced any more by compiler!!!
Growth of assembly length for copy sections
Gcc compiler avoids to insert the multiple load /multiple store loop from 32 words on.
Code size would be exponentially rising.
Where high throughput is required, a less compact but more optimized representation is desired.
Compiler-aware optimization benefits
Compiler may be forced to unroll data transfer loops.
About 15% improvement with 32 word sized messages.
A Typical JPEG 8x8 block is encoded in a 32 word struct.
8x8x16 bit data.
Task allocation in MPSoC architectures
Application Mapping
The problem of allocating, scheduling and freq. selection for task graphs on multi-processors in a distributed real-time system is NP-hard.
New tool flows for efficient mapping of multi-task applications onto hardware platforms
T1
T2 T3
T4 T5 T6
T7
T8
…Proc. 1 Proc. 2 Proc. N
INTERCONNECT
Private
Mem
Private
Mem
Private
Mem
…
T1
T2
T3
T4
T5
T6
T8
T7
Time
R
esou
rces
T1
T2
T3
T4
T5
T7
Deadline
T8
Alloca
tion
Schedule&Freq.sel.
ApproachFocus: Statically scheduled Applications;
Objectives: Complete approach to allocation, scheduling
and frequency selection: High computational efficiency w.r.t. commercial solvers; High accuracy of generated solutions;
New methodology for multi-task application development:
To quickly develop multi-task applications; To easily apply the optimal solution found by our
optimizer.
Target architecture An architectural template for a message-
oriented distributed memory MPSoC: Support for message exchange between the computation
tiles; Single-token communication; Availability of local memory devices at the computation
tiles and of remote memories for program data. Several MPSoC platforms available on the
market match this template: The Silicon Hive Avispa-CH1 processor; The Cradle CT3600 family of multiprocessor; The Cell Processor The ARM MPCore platform.
The throughput requirement is reflected in the maximum tolerable scheduling period T of each processor;
.
.
.
.
Act. A
Act. B
Act. N
periodT
A task graph represents:– A group of tasks T– Task dependencies– Execution times express in clock cycles: WCN(Ti)– Communication time (writes & reads) expressed as:
WCN(WTiTj) and WCN(RTiTj)– These values can be back-annotated from functional
simulation
Application model
Task1
Task2
Task3
Task4
Task5
Task6
WCN(WT1T2)WCN(RT1T2)WCN(T1)
WCN(WT1T3)WCN(RT1T3)
WCN(T2)WCN(WT2T4)WCN(RT2T4)
WCN(WT3T5)WCN(RT3T5)
WCN(WT4T6)WCN(RT4T6)
WCN(WT5T6)WCN(RT5T6)
WCN(T3)
WCN(T4)
WCN(T5)
WCN(T6)
//Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCHuint node_behaviour[TASK_NUMBER] = {2,3,3,..};
#define N_CPU 2uint task_on_core[TASK_NUMBER] = {1,1,2,1};
int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..};
uint queue_consumer [..] [..] = { {0,1,1,0,..},
{0,0,0,1,1,.}, {0,0,0,0,0,1,1..}, {0,0,0,0,..}..};
//Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTICuint node_type[TASK_NUMBER] = {1,2,2,1,..};
Example Number of nodes : 12 Graph of activities Node type
Normal, Branch, Conditional, Terminator Node behaviour
Or, And, Fork, Branch Number of CPU : 2 Task Allocation Task Scheduling Arc priorities Freq. & Voltage
Time
R
eso
urc
es
N1
B2
B3
C4
C7
Deadline
N8
T2 T3
T4 T5 T6 T7
T8 T9 T10
T11
T12
T1N1
B2 B3
C4 C5 C6 C7
N8 N9 N10
N11
T12
fork
or
or
and
branch branch
P1
P2
N11
N10 T1
2
a1a2
a3 a4 a5 a6
a7 a8 a9 a10
a11a12
B3 C7 N10 T1
2
a13
a14
#define TASK_NUMBER 12
Queue ordering optimization
Communication ordering affects system performances
T1
T2
T4
CPU1 CPU2
…
C3 C1
T3
…C
2
Wait!
RUN!
T5 T6… …
C4 C5
Queue ordering optimization
Communication ordering affects system performances
T1
T2
T4
T5 T6
CPU1 CPU2
…… …
C3 C1
T3
…C
2
Wait!
RUN!
C4 C5
T4 re-activated
Synchronization among tasks
T1
T2 T4C2
T3
C1
C3
Proc. 1
T1
Proc. 2
T2T3 T4
T4 is suspended
Non blocked semaphores
Logic Based Benders Decomposition
Obj. Function:Communication
cost
& energy consumption Valid
allocation
Allocation& Freq. Assign.:
INTEGER PROGRAMMING
Scheduling:CONSTRAINT PROGRAMMING
No good: linearconstraint
Memory constraints
Real Time constraint
Decomposes a problem into 2 sub-problems: Allocation & Assignment of freq. settings → IP
Objective Function: minimizing energy consumption during execution and communication of tasks
Scheduling → CP Objective Function: minimizing energy consumption during frequency
switching The process continues until the master problem and sub-
problem converge providing the same value. Methodology has been proven to converge to the optimal solution
[J.N.Hooker and G.Ottosson].
Application Development Methodology
CTGCharacterization
Phase
Simulator
OptimizationPhase
Optimizer
ApplicationProfiles
Optimal SWApplication
Implementation
ApplicationDevelopment
Support
Allo
cation
Sche
dulin
g
PlatformExecution
GSM Encoder
Throughput required: 1 frame/10ms. With 2 processors and 4 possible
freq.&voltage settings:
Task Graph: 10 computational
tasks; 15 communication
tasks.
Without optimizations:50.9μJ
With optimizations:17.1 μJ -66,4%
Energy Management
o Basic techniques: Shutdown and DVFS
o Advanced techniques: Feedback control
Urbino, 19-10-2006 79
Energy Optimization in MPSoCs
Two main steps: Workload allocation to processing elements: task mapping and
scheduling After workload allocation, resource of processing
elements should be adapted to the required performance to minimize energy consumption
shut-down voltage scaling
Urbino, 19-10-2006 80
Shut-Down
When the system is idle the processor can be placed into a low-power state
reactivity power level
-core clock gating (waked-up by timer interrupt)
-core power gating (waked-up by on-chip peripherals)
-chip power gating (waked-up by external, on board interrupts)
no need for context restore
need for context restore
Urbino, 19-10-2006 81
Frequency/Voltage Scaling
DFVS: Adapting frequency/voltage to the
workload Frequency must be scaled with voltage to
keep circuit functionality Dynamic power goes with the square of
Vdd and linearly with clock speed Scaling V/F by a factor of s -> power
scales as s3 fVCP ddeff 2
Urbino, 19-10-2006 82
Power Manager Implementation
Power management policy consists of algorithms that use input information and parameter settings to generate commands to steer mechanisms [Pouwelse03]
policy
operational
conditions
workload commands
parameter settings
operating points
A dynamic power management system is a set of rules and procedures that move the system from one operating point to
another as events occur [IBM/Montavista 02]
Urbino, 19-10-2006 83
Power Manager Components
Monitoring Utilization, idle times, busy times
Prediction Averaging (e.g. EMA), filtering (e.g. LMS) Per-task based (e.g. Vertigo), global utilization (e.g. Grunwald)
Control Shutdown, DVFS Open-loop, closed loop (e.g. adaptive control)
Update Rule Establish decision points
Urbino, 19-10-2006 84
Traditional Approach
idle timemonitor
global utilizationmonitor
per-task utilizationmonitor
idle timepredictor
workloadpredictor
shutdowncontroller
DVFScontroller
TASK 1 TASK 2
update rule
Urbino, 19-10-2006 85
Limitations Assuming no specific info from applications are
available, traditional approaches are based on observation of utilization history
Slow adaptation impact system reactivity Specific techniques for tracking interactive task have been
proposed [Flautner2002] For soft real-time (multimedia) application deadline
may be missed Frequent voltage oscillations impact energy
efficiency Square relationship between power and voltage Cost of switching (power/time/functionality)
BAD
GOOD
Multimedia applications Multimedia applications can be
represented as communicating objects Ex.: Gstreamer multimedia framework
pads
data data data
21/04/23 86
OSHMA Workshop –Brasov ,ROMANIA
Streaming Applications Multimedia streaming applications are going
multicore Objects are mapped to tasks that are distributed on
the cores to increase performance Specific tasks can be executed by hardware
accelerators such as IPU, GPU units
data data data
CORE #0CORE #1 CORE #2P0
P1P3
P2
21/04/23 87
21/04/23 88
Overcoming Traditional Approaches
Key observation Multimedia applications are mapped into MPSoCs as
communicating tasks
A pipeline with single and parallel blocks (split-join) communicating with queues
Feedback path are also common
split join
P0 P1
P2
P4
P3
P5
EXT.PERIPHERAL
Software FM Radio
Frequency, Voltage Setting Middleware
C O M M . A ND S Y NC H R O NI Z. L A Y ER
PR O C ES S O R N
O P. S Y S T. N+ 2O P. S Y S T. N
H W
O S /M W
U S E R
FR EQ .C O NTR .
N
FR EQ .C O NTR .
N+ 1
PR O C ES S O R N+ 1 PR O C ES S O R N+ 2
S TA G EN+ 1
S TA G EN
S TA G EN
O P. S Y S T. N+ 1
DATA
OUT
DATA
INQ
UEUE O C C .
FREQ .
• Migration+dynamic f,Vdd setting critical for energy management – DVFS for multi-stage producer-consumer streaming exploits info on
occupancy of the synchronization queues – @equilibrium, average output rate should match input rate in each
queue occupancy level monitored to adjust PE speed and Vdd
[Carta TECS07]21/04/23 89
Middleware Support
Almost OS-independent If interrupts are used an OS-specific ISR must be written
Easy integration into communication library support for Gstreamer, openmax
main() {produce_data();write_queue();}
main() {read_queue();consume_data();}
communication library
check queue level();run_control_algorithm();set_frequency_and_voltage();
21/04/23 90
91
Feedback Controlled DVS Technique to perform run-time energy
optimization of pipelined computation in MPSoCs Queue occupancy provide estimation of the level of
performance required by the preceding block Unlike traditional approaches, the idea is to look
at level of occupancy of inter-processor queues to compute the speed of processing elements
queue
speed control
queue queue
21/04/23
Energy/power optimization is NOT thermal optimization!
Need for temperature awareness
Thermal optimization
OS-MPSoC Thermal Studies
Focus on embedded multimedia streaming and interactive applications
Efficient automatic code parallelization for embedded multiprocessors
Efficient communication and synchronization infrastructure
Static + dynamic task allocation and for performance/energy/thermal balancing
EU projects: PREDATOR, REALITY
Spunti di ricerca