COEN 691B: Embedded System Design Lecture 3:...
Transcript of COEN 691B: Embedded System Design Lecture 3:...
Samar Abdi (slides courtesy of A. Gerstlauer, D. Gajski and R. Doemer)
Assistant Professor
Electrical and Computer Engineering
Concordia University
http://www.ece.concordia.ca/~samar
COEN 691B: Embedded System Design
Lecture 3: Computation Modeling
2
System Design Flow
Computation
Co
mm
un
icat
ion
A B
C
D F
Un- timed
Approximate- timed
Cycle- timed
Un- timed
Approximate- timed
A. System specification model B. Timed functional model C. Transaction-level model (TLM) D. Bus cycle-accurate model (BCAM) E. Computation cycle-accurate model (CCAM) F. Cycle-accurate model (CAM)
E
Cycle- timed
• Abstraction based on level of detail & granularity – Computation and communication
System design flow Path from model A to model F
Design methodology and modeling flow
Set of models and transformations between models
Source: L. Cai, D. Gajski. “Transaction level modeling: An overview”, ISSS 2003
COEN 691B: Embedded System Design
COEN 691B: Embedded System Design 3
Lecture 3: Outline
• Profiling
• Timing estimation using processor model
• RTOS modeling
• Hardware abstraction layer modeling
4
Profiling
• Input specification MoC – Hierarchy
– Computation & communication
• Multi-dimensional analysis – Multi-entities
• Behavior, channel, port, variable
– Multi-metrics • Operation, traffic, storage
• Static, dynamic
– Multi-levels • Application, transaction, bus-
functional
v c B1 B2
B
Profiling
Profiled App.
Simulation
Instr. Appl
Static Analysis
Counters
Instrumentation
Application
COEN 691B: Embedded System Design
5
Profiling
• Instrumentation-based profiling – Bb: The execution counts of basic
block b • Enumerate execution paths
– Cb,i,d: No. of computed characteristics for item type i and data type d in the block b
– Data type i: float, int, .. – Item type d: metric-dependent
Specification metrics Ri,d = bCb,i,d Bb
R = idRi,d
R++,int= i [ Bi * Ci,++,int ]
= 1 * 1 + 3 * 2
= 7
B1 = 1
B3 = 3
C1,++,int = 1
C3,++,int = 2
Source: L. Cai, A. Gerstlauer, D. Gajski, “Retargetable Profiling for Rapid, Early System-Level Design Space Exploration,“ DAC, 2004.
int b,c;
if( a = 0){
b++;
}
else{
b++;
c++;
}
COEN 691B: Embedded System Design
6
Retargeting
• Target machine model – Wi,d : weights of
components which the entity mapped to
• Manual • Simulation • Complex cost function/
algorithm
Implementation estimates E = id(Ri,d * Wi,d) Time complexity: O(n)
v c B1 B2
B
PE1 PE2
Mem
R(B1)++,int= 7
W(PE1)++,int= 1
E(B1,PE1)++,int= 7 x 1 = 7
Source: L. Cai, A. Gerstlauer, D. Gajski, “Retargetable Profiling for Rapid, Early System-Level Design Space Exploration,“ DAC, 2004.
COEN 691B: Embedded System Design
7
Computational complexity of top-level Vocoder behaviors:
LP_Analysis Open_Loop Closed_Loop Codebook Update
377.0 MOp 337.1MOp 478.7 MOp 646.4 MOp 43.6 MOp
Codebook operation mix: (x, int) (+, int) (-, int) (/,int) (others,int)
46.2% 33.5% 9.1% 7.1% 4.1%
HW acceleration
Floating –point not required Dedicated hardware multipliers
Vocoder Example Profiling
COEN 691B: Embedded System Design
8
Mapping of 8 top-level encoder behaviors onto ColdFire + DSP + HW 85:04h for 6561 alternatives (1.7s simulation + 3s refinement each) 100% fidelity
HW (144.1, 12.24 ms)
SW (20.0, 30.73 ms)
10 ms
15 ms
20 ms
25 ms
30 ms
35 ms
10 30 50 70 90 110 130 150 170
Cost
Tra
ns
co
din
g d
ela
y
Vocoder Design Space Exploration
Timing constraint
COEN 691B: Embedded System Design
COEN 691B: Embedded System Design 9
General Processor Micro-Architecture • Basic computation component is a processor (PE)
– Programmable, general-purpose software processor (CPU) – Programmable special-purpose processor (e.g. DSPs) – Application-specific instruction set processor (ASIP) – Custom hardware processor
Functionality and timing
PE
Controller Datapath
Bus interface CLK
Control signals
Status lines ∆t
COEN 691B: Embedded System Design 10
Processor Models (1) • Structural RTL models
Sub-cycle accurate
HW
Controller
State
Next state logic
Output logic
Datapath
Register file
Memory
Bus interface CLK
FU1
CPU
Controller Datapath
Register file
Memory (data & progr.)
Load/store unit CLK
ALU
IR
PC
Decode
Fetch
Software processor Hardware processor
COEN 691B: Embedded System Design 11
Processor Models (2) • Behavioral RTL/IS models
Cycle accurate
HW
HW_CLK
CPU
CPU_CLK
HAL
ISS
RTOS
App.
Instruction set simulation (ISS) FSMD
Bin
ary
COEN 691B: Embedded System Design 12
Computation Modeling • Application modeling
– Native process execution (C code) – Back-annotated execution timing
• Processor modeling
– Operating system • Real-time multi-tasking (RTOS model) • Bus drivers (C code)
– Hardware abstraction layer (HAL) • Interrupt handlers • Media accesses
– Processor hardware • Bus interfaces (I/O state machines) • Interrupt suspension and timing
P1 P2
OS
CP
U
Drv
Interrupts
Bus
ISR HAL
Process B1()
{
…
waitfor(15000);
…
waitfor(25000);
…
};
COEN 691B: Embedded System Design 13
• High-level, abstract programming model – Hierarchical process graph
• ANSI C leaf processes • Parallel-serial composition
– Abstract, typed inter-process communication
• Channels • Shared variables
Timed simulation of application functionality – Back-annotate timing
• Estimation or measurement (trace, ISS)
• Function or basic block level granularity
– Execute natively on simulation host
• Discrete event simulator • Fast, native compiled simulation
Application Layer
Logical time
5 10 0
CPU
B2 C1
B1
B3C2
…
…
…p1
.c
...
void f() {
waitfor(5);
...
}
...
Timing Estimation Input: Application Model
v1
C1
P1 P2
P3 P4
C2
14
• Application model consists of • Processes for computation (eg. P1, P2, P3, P4) • Channels for communication (eg. C1 between P1 and P3) • Variables for storage (eg. v1)
14 COEN 691B: Embedded System Design
Application Model Objects
• Processes
– Symbolic representation of computation – Contain C/C++ code imported from reference
• Process ports
– Symbolic representation of communication services required by processes
– Provide object orientation by allowing processes to connect to different channels
• Channels – Symbolic representation of inter-process
communication – Implement communication services such as
blocking, non-blocking, handshake, FIFO etc. – Encapsulation for communication functions
• Variables
– Symbolic representation of data storage
15
v1
C1
P1 P2
P3 P4
C2
15 COEN 691B: Embedded System Design
Timing Estimation Input: Platform Architecture
TX
CPU1 Mem
HW CPU2
Arb
ite
r
Bus1 Bus2
OS2
OS1
16
• Platform consists of • Hardware: PEs (eg. CPU1, HW), Buses (eg. Bus1), Memories (eg. Mem),
Interfaces (eg. Transducer) • Software: Operating systems (eg. OS1) on SW PEs
16 COEN 691B: Embedded System Design
Platform Objects
• Processing element (PE) – Symbolic representation of computation resources – Different types such as SW processors, HW IPs etc.
• Bus – Symbolic representation of communication media – Types include shared, point-to-point, link, crossbar etc.
• Memory – Symbolic representation of physical storage – May contain shared variables or SW program/data
• Transducer – For protocol conversion and store-forward routing – Necessary for PEs with different bus protocols
• Operating system (OS) – Software platform for individual PEs – Needed for scheduling multiple processes on a PE
17
TX
CPU1 Mem
HW CPU2
Arb
ite
r
Bus1 Bus2
OS2
OS1
17 COEN 691B: Embedded System Design
Timing Estimation Input: Mapping
TX
v1 C
1
P1 P2
CPU1 Mem
HW IP
P3
CPU2
P4
C2
Arb
ite
r
Bus1 Bus2
OS
OS
18
• Processes PEs • Channels Routes • Variables Memories
18 COEN 691B: Embedded System Design
Mapping Rules
• Processes to PEs – Each process in the application must be mapped to a PE – Multiple processes may be mapped to SW PE with OS support – Example: P1, P2 CPU1
• Channels to Routes – All channels between processes mapped to different PEs are mapped to
routes in the platform – Route consists of bus segments and interfaces – Channel on each bus segment is assigned a unique address
• Variables to Memories – Variables accessed by processes mapped to different PEs are mapped to
shared memories – All variables are assigned an address range depending on size
19 19 COEN 691B: Embedded System Design
Computation Timing Estimation
• Stochastic memory delay model
• DFG scheduling to compute basic block delay [DATE 08]
• RTOS model added for PEs with multiple processes
Timing Estimation
Timed Process
Processor Model
const
status
RF
OR
ALUAR
MemDR
offset
CMem
CW
PC
AG P
bL
Sum
Add
aL
Mul
wait(t1)
BB1
If
If Y N
Y N
BB2 BB3
wait(t2) wait(t3)
Process CDFG
BB1
If
If Y N
Y N
BB2 BB3
20 COEN 691B: Embedded System Design
Stochastic Memory Delay Model
Mem. Overhead= 4.1 Branch Delay= 1.2
• Assumption – Cache and branch prediction hit rate available in data model
• Delay Estimation – Operation access overhead = Nop * ((1.0 – HRi) * (CD + Lmem))
– Data access overhead = Nld * ((1.0 – HRd) * (CD + Lmem))
– Branch prediction miss penalty = MPrate * Penalty
Cache
D-Mapped
16K
Icache: 97.79%
Dcache: 69.96%
Delay : 1
Memory
Delay: 8
BrPredictPolicy: Taken
Penalty : 260.00%
Memory/Branch Model
Mem./Br. Delay Calcutation
1: a = $i - 1 2: t1 = a + 2 3: t2 = $n * $m 4: t3 = t1 - t2 5: load b 6: t4 = b / 10 7: jmp
LLVM Bytecode
21 COEN 691B: Embedded System Design
Pipeline Scheduling
1: a = $i - 1 2: t1 = a + 2 3: t2 = $n * $m 4: t3 = t1 - t2 5: load b 6: t4 = b / 10 7: jmp 8: wait 47*CT
• Assumptions – In-order, single issue processor – Optimistic during scheduling (100% cache hit)
Operations Datapath
Processor Data Model
Add
IF
ID
EX: int-ALU IntAdd
Sub
IF
ID
EX: int-ALU IntSub
Int-ALU
Qty: 1
IntAdd IntSubLat: 1 Lat: 1
Processor Timing Estimation
LLVM Bytecode
Operation delay= 42
Total BB delay= Op.+Mem.+Br. = 47.3 cycles
22 COEN 691B: Embedded System Design
Output: SystemC Timed Model
Bus1
P1 P2
OS
CP
U1
Mem
CPU2
P3
HW IP
Bus2
TX
Model Generation Technique • Application code sc_thread • Processing element sc_module • OS Model sc_module • Bus sc_channel • Memory Array inside sc_module • Interface FIFO channel+sc_process
P4
OS
23 23 COEN 691B: Embedded System Design
COEN 691B: Embedded System Design 24
Operating System Layer • Scheduling
– Group processes into tasks • Static scheduling
– Schedule tasks • Dynamic scheduling, multitasking • Preemption, interrupt handling • Task communication (IPC)
Scheduling refinement – Flatten hierarchy – Reorder behaviors
OS refinement – Insert OS model – Task refinement – IPC refinement
OSApp
Task
P2
C1
P1
Task
P3C2
OS Model
Application
SLDL
Task Scheduler
P1 P2
COEN 691B: Embedded System Design 25
OS Modeling • High-level RTOS abstraction
– Specification is fast but inaccurate • Native execution, concurrency model
– Traditional ISS-based validation infeasible • Accurate but slow (esp. in multi-processor context), requires full binary
Model of operating system High accuracy but small overhead at early stages Focus on key effects, abstract unnecessary implementation details Model all concepts: Multi-tasking, scheduling, preemption, interrupts,
IPC
Specification TLM Implementation
Source: A. Gerstlauer, H. Yu, D. Gajski. "RTOS Modeling for System-Level Design," DATE03.
Application
SLDL
Channels
RTOS
Model
T1 T2Application
SLDL
Channels
T1 T2
RTOS
Application
SLDL
Comm. & Sync. API
Instruction Set Simulator
COEN 691B: Embedded System Design 26
Simulated Dynamic Behavior
C1
c1.recv()
c1.send()
Bu
s
bus.recv()
P2 P3
S1
Logical time
t0
t1
t2
t3
t5
t8
t6
t4
t7
Unscheduled
t0
t1
t2
t3
t4
t5
t6
t7
t8
Inaccuracy due to timing granularity
waitfor() waitfor()
waitfor()
waitfor() waitfor()
waitfor()
ISR
P1
waitfor()
Scheduled
C1
c1.recv()
c1.send()
Bu
s
bus.recv()
Task P2 Task P3
S1
time_wait(
)
time_wait(
)
time_wait(
)
ISR
time_wait(
)
time_wait(
)
time_wait(
)
time_wait(
)
P1
COEN 691B: Embedded System Design 27
RTOS Model Implementation • RTOS model
– OS, task, event management • Descriptors & queues
– Scheduling • Select and dispatch task based on
algorithm
• Block all but active task on SystemC level
– Preemption • Allow rescheduling at simulation time
increases
– Event handling • Remove task temporarily from OS while
waiting for SystemC event
RTOS model library
– RTOS models for different scheduling strategies
• Round robin, priority based
– Parametrizable • Task parameters (priorities)
channel OS implements OSAPI {
Task current = 0;
os_queue rdyq;
void dispatch(void) {
current = schedule();
notify(current.event);
}
void yield() {
task = current;
dispatch();
wait(task.event);
}
void time_wait(time t) {
waitfor(t);
yield();
}
Task pre_wait(void) {
Task t = rdyq.get(current);
dispatch(); return t;
}
void post_wait(Task t) {
rdyq.put(t);
wait(t.event);
}
};
1
5
10
15
20
25
schedule();
COEN 691B: Embedded System Design 28
RTOS Model Interface
interface OSAPI
{
void init();
void start(int sched_alg);
void interrupt_return();
Task task_create(char *name, int type,
sim_time period);
void task_terminate();
void task_sleep();
void task_activate(Task t);
void task_endcycle();
void task_kill(Task t);
Task par_start();
void par_end(Task t);
Task pre_wait();
void post_wait(Task t);
void time_wait(sim_time nsec);
};
1
5
10
15
20
Task management
OS management
Event handling
Delay modeling
• Canonical, target-independent API
Back
COEN 691B: Embedded System Design 29
Task Refinement process task_B2(OSAPI os) {
void main(void) {
...
/* model execution delay */
waitfor(BLOCK1_DELAY);
...
send();
/* model execution delay */
waitfor(BLOCK2_DELAY);
...
}
void send() {
wait(ack);
}
};
1
5
10
15
20
25
os.task_terminate(h)
;
Convert processes into tasks Task initialization
– Register task with OS model
Task activation – Wait for task start trigger from OS
Replace delay model – Trigger rescheduling in OS Preemption points
Communication and synchronization
– Wrap around SLDL event handling
os.time_wait(BLOCK1_DELAY)
;
os.time_wait(BLOCK2_DELAY)
;
Task h;
void task_B2(void) {
h = os.task_create(“B2”,
APERIODIC, 0);
}
os.task_activate(h);
t = os.pre_wait();
os.post_wait(t);
Back
COEN 691B: Embedded System Design 30
Operating System Layer OS model
– On top of standard SystemC
– Wrap around SystemC primitives, replace event handling
• Block all but active task • Select and dispatch tasks
– Target-independent, canonical API
• Task management • Channel communication • Timing and all events
OSApp
Task
P2
C1
P1
Task
P3C2
OS Model
Application
SLDL
OS Model
Task P2 Task P3
COEN 691B: Embedded System Design 31
Hardware Abstraction Layer (HAL)
• External communication – Software Drivers
• Presentation, session, network communication layers
• Synchronization (interrupts)
– Hardware/software boundary
• Low-level HW access • Bus drivers and interrupt
handlers • Canonical HW/SW
interface
– External interface • Bus transactions (TLM) • Interrupt trigger
HALOSApp
Task
P2
C1
P1
Task
P3C2
OS Model
INTA INTB INTC
UsrInt2UsrInt1
Drive
rD
rive
r
INTD
Bus
TLM
sample.send(v1);
void send(…) {
intr.receive();
bus.masterWrite(0xA000,
&tmp,
len);
}
Ap
p.
Dri
ver
COEN 691B: Embedded System Design 32
Hardware Layer
• Processor TLM – HW interrupt handling
• Interrupt logic – Suspend user code
• Interrupt scheduling – Priority, nesting
– Peripherals • Interrupt controller • Timers
– TLM bus model • Bus transactions
time
TB1
IntA
t1 t2
TB2
t3 time
TB1
IntA
t1 t2
TB2
t3
HAL: Hardware:
HWHALOSApp
Task
P2
C1
P1
Task
P3C2
OS Model
Acce
ss
HW
Int
IntA IntB IntC
UsrInt2UsrInt1
Drive
rD
rive
r
IntD
Bus
TLM
INTAINTBINTCINTD
COEN 691B: Embedded System Design 33
Hardware Layer
• Bus-functional model (BFM)
– Pin-accurate processor model
• Timing-accurate bus and interrupt protocols
– Bus model
• Pin- and cycle-accurate
• Driving and sampling of bus wires
GRANT
CNTRL
ADDR
WDATA
READY
0x27000000
REQ
nonseq.
word
0xA000 0000
0x2F00 9801
HWHALOSApp
Task
P2
C1
P1
Task
P3C2
OS Model
Acce
ss
HW
Int
IntA IntB IntC
UsrInt2UsrInt1
Drive
rD
rive
r
IntD
Pro
t
INTAINTBINTCINTD
COEN 691B: Embedded System Design 34
Features
Target approx. computation timing Appl.
Processor Model OS
App
Task
P2
C1
P1
Task
P3C2
OS Model
App
Task
P2
C1
P1
Task
P3C2
HALOSApp
Task
P2
C1
P1
Task
P3C2
OS Model
Acce
ss
UsrIntr2UsrIntr1
Drive
rD
rive
r
IntB IntC IntDIntA
HWHALOSApp
Task
P2
C1
P1
Task
P3C2
OS Model
Acce
ss
HW
Int.
UsrIntr2UsrIntr1
Drive
rD
rive
r
Bus
TLM
INTAINTBINTC
INTD
intB intC intDintA
OS
Features
Target approx. computation timing
Task mapping, dynamic scheduling
Task communication, synchronization
Appl. OS HA
L
Features
Target approx. computation timing
Task mapping, dynamic scheduling
Task communication, synchronization
Interrupt handlers, low level SW drivers
Appl. OS HA
L
HW
-TL
M
HW
-BF
M
Features
Target approx. computation timing
Task mapping, dynamic scheduling
Task communication, synchronization
Interrupt handlers, low level SW drivers
HW interrupt handling, int. scheduling
Cycle accurate communication
Appl. OS HA
L
HW
-TL
M
HW
-BF
M
BF
M - IS
S
Features
Target approx. computation timing
Task mapping, dynamic scheduling
Task communication, synchronization
Interrupt handlers, low level SW drivers
HW interrupt handling, int. scheduling
Cycle accurate communication
Cycle accurate computation
Appl.
• Processor layers – Application
• Native, host-compiled C
• Annotated timing
– OS • OS model • Middleware,
drivers
– HAL • Firmware
– Processor hardware
• Bus interfaces • Interrupts
handling & suspension Source: G. Schirner, A. Gerstlauer, R. Doemer. “Fast and Accurate Processor Models for Efficient MPSoC Design," TODAES, 2009.