STRUCTURED CODESIGN FOR MANYCORE SYSTEMS Jürg Gutknecht & Lisa (Ling) Liu, ETH Zürich Sofsem Novy...
-
Upload
loren-herde -
Category
Documents
-
view
217 -
download
0
Transcript of STRUCTURED CODESIGN FOR MANYCORE SYSTEMS Jürg Gutknecht & Lisa (Ling) Liu, ETH Zürich Sofsem Novy...
STRUCTURED CODESIGN FOR MANYCORE SYSTEMSJürg Gutknecht & Lisa (Ling) Liu, ETH Zürich
Sofsem Novy Smokovec, January 2011
About Me
1968 System programming at Swissair
1977 PhD in Mathematics 1981 Joined Niklaus
Wirth's Lilith/ Modula team 1985 Sabbatial stay at
Xerox PARC 1986 Project Oberon
together with Wirth 2000 Academic languages
researcher at MSR
Outline of Talk
Context & Vision A Structured Approach Use Cases Programming Language & Compiler Power Management Codesign Hardware Library
Microsoft Innovation Cluster
Launched in 2008 by Microsoft (Reseach) Volume 5 years/ $5 mio Theme embedded systems software Participants
ETH Zürich (3 projects) EPFL Lausanne (4 projects)
Goals Research in embedded systems Technology transfer Education
„Supercomputerin the pocket“ is one among them
Supercomputer in the Pocket
Manycore architecture for embedded systems on the basis of programmable hardware (FPGA)
High-performance computing in the small Generic technology for wide range of apps
Sensor driven medical IT Data streaming in financial apps Running robot with limb control Real time audio processing
Hardware/ software design from the ground up
will be focussed in this talk
People Involved
Microsoft Research Chuck Thacker (consultant)
ETH Zürich Niklaus Wirth (processor design) Jürg Gutknecht (project leader) Lisa (Ling) Liu (hardware design) Felix Friedrich (compiler)
University Hospital Basel Alexej Morozow (medical IT app)
The Vision
Custom hardware design for embedded systems
Programmers need no hardware knowledge
System design process at high level of abstraction
Fully automated mapping process to FPGA
FPGA resources are used efficiently
Semantic Gap
Object Thread Data structure Statement Communication I/O ...
Lookup tables (LUT)
Block RAMs (BRAM),
DSP slices …
Program Constructs FPGA Resources
Map
Options for How to Achieve It
Hardware compilation: Custom mapping of specific algorithm (or hot spots) to hardware circuits.
Uniprocessor: Single universal processor plus on-chip cache memory. Transparently connected to external memory.
SMP: Several universal processors, each with on-chip cache memory, and each transparently connected to external memory. Cache coherence mechanism needed.
Preconfigured: Several universal processors, each with private on-chip memory. Interconnected via on-chip network. One processor connected to external memory.
A Better Approach
Hardware/ software codesign based on a suitable high-level computing model and programming language
Fully automated mapping/ synthesizing to FPGA hardware based on suitable library of highly configurable hardware components
Our Computing Model
Active Cell (Actor) Object with private state space Behavior control thread Communicating with other actors via
channels Actor Graph
Collection of interoperating actors running in parallel
Some actors connected to I/O via serial port
Our Hardware Library
TRM processor (Tiny Register Machine) Extremely simple Two level pipelined instruction execution Several variants
VTRM (vectors via DSP), DTRM (DMA) Communication FIFO
Ring buffer Sizes 32, 64, 128, 1024
I/O controllers DDR2, CF, LCD, UART
Mapping
Actor
Communication channel
I/ O
TRM processor („core“)
Instruction memory Data memory FIFO buffer
I/ O controllers connected to cores
Actor Graph FPGA
Map
TRM/ FIFO Cooperation
TRM M
FIFO
FIFOchannel
channel
rec
vse
nd
•fully orchestrated by TRM•no interrupts!
Realtime Multichannel ECG Monitor Analyze the activity of the heart, the
morphology of the corresponding waves, and the heart rate variability (HRV), with the aim of detecting and classifying potential anomalies
The signal to be analyzed decomposes into 8 physical channels, each of them sampled at 500 Hz
Decomposition into Actor Graph
Signal
input
Wave proc_
1
QRSdete
ct
HRV analys
is
Disease
classifier
Wave proc_
2
Wave proc_
8
ECGbitstream
outstream
Actions
Receive ECG signal from UART, compose individual samples, and distribute them to channel processors.
(Per channel): Precondition wave by suppressing noise via linear filtering; Detect the heart beats and contractions.
Detect QRS patterns and make a final decision about heart rate on the basis of standard multichannel logic.
Analyze the current heart rhythm and the heart rate variability (HRV).
Use decision tree logic to detect and classify arrhythmia events such as premature ventricular contractions (PVC), ventricular tachycardia etc. Feed results back to configure wave processing.
ECG
TRM12
UART
Ctrl
LCD
Ctrl
CFCtrl
RS232 CF
LCD
TRM11
TRM10
TRM2
TRM3
TRM9
TRM1
TRM4
FIFO1
FIFO8
FIFO9
FIFO16
FIFO17 FIFO18
FIFO19
FIFO20
FIFO33
FIFO34
ResultingFPGA
configuration
ECG Monitor
Maximum number of TRMs in communication chain
Use of Resources
#TRM
#LUT#BRA
M#DSP
TRM load@116 MHz
121385
9(48%)
52(86%)
12(25%)
< 10%
FPGA #TRM #LUT #BRAM #DSP
Virtex-5
3027692(96%)
60(100%)
30(62%)
Virtex-6
500
Preconfigured Version
Column 0
Column 1
Column 2
Column 3
H0H1H2H3
TRM1
TRM2
TRM3
TRM7
TRM8
TRM9
TRM4
TRM5
TRM6
TRM10
TRM11
TRM12
inbound arbiteroutbound arbiter
inbound arbiteroutbound arbiter
inbound arbiteroutbound arbiter
inbound arbiteroutbound arbiter
UART controller
CF controller
LCD controller
Virtex-5LX50T FPGA
CF
LCD
RS232
Xilinx ML505 board
ECG Sensor
Comparative Power Usage
Preconfigured FPGA (TRM, IM/ DM, I/O, interconnect)
Fully configurable
System
Quiescent
power (W)
Dynamicpower
(W)
Preconfigured
3.43823 0.58988
Dynamically
configured0.49742 0.48060
86% saving!
Graphics Based Motion Detection Problem: Detect moving objects in a
series of image frames Approach: Parallelize detection process
by domain decomposition (into 4 parts) Design: A reader process continuously
reads frames from external memory and forwards them to (4) part-detection processes running in parallel and reporting detected movements
Performance Results
Data base 10 frames of resolution 576 x 768 (432 KP)
Estimated performance Transfer from external DDR2 memory ca. 40
MP/sec Computation: 4 x 31 MP/sec Total time used per frame 55 ms Total throughput 18 frames/ sec
The ActiveCells Language
History & Profile Evolution of Pascal, Modula, Oberon Actor based Compositional
Active cell (Actor) Object with active behavior, communicating via
channels Assembly
Network of interoperating active cells Reusable software component with ports
interface
Example of Functional Actor
F = actor (in1, in2: instr; out: outstr); var i, j: integer;begin loop recv(in1, i); recv(in2, j); send(out, someOp(i, j)) endend
Example of User Interface Actor UI = actor (out1, out2: outstr; in: instr);
var i, j, k: INTEGER;begin loop RS232.RecvInt(i); RS232.RecvInt(j); send(out1, i); send(out2, j); recv(in, k); RS232.SendInt(k) endend
Examples of Assemblies
Assembly without ports
Assembly with ports
UIout1out2
inF
in1 in2
out
connect
Gin1 in2
out
Fin1 in2
outF
in1 in2
out
delegate
RS232 actor
in1 in2 in3 in4
out
A B
Assembly A Code
assembly A; (*without ports*) import RS232; type F = actor (in1, in2: instr; out: outstr); UI = actor (out1, out2: outstr; in: instr); var ifc: UI; f: F;begin new(ifc); new(f); connect(ifc.out1, f.in1); connect(ifc.out2, f.in2); connect(f.out, ifc.in)end A.
Assembly B Code
Assembly B (in1, in2, in3, in4: instr; out: outstr); (*with five ports*) type F, G = actor (in1, in2: instr; out: outstr); var f1, f2: F; g: G;begin new(f1); new(f2); new(g); connect(f1.out, g.in1); connect(f2.out2, g.in2); delegate(in1, f1.in1); delegate(in2, f1.in2); delegate(in3, f2.in1); delegate(in4, f2.in2); delegate(out, g.out)end B.
Built-In Vector Types and Operators Runge-Kutta (x, x1, k1, k2, … 3d vectors)
while t <= tmax do k1 := f(t, x); k2 := f(t + dt/2, x + dt/2 * k1); k3 := f(t + dt/2, x + dt/2 * k2); k4 := f(t + dt, x + dt * k3); x1 := x + dt/3 * (1/2 * k1 + k2 + k3 + 1/2 * k4); Draw(x, x1); x := x1; t := t + dt;end
Built-In Matrix Types and Operators
Graphics pipeline (Matrix multiplication) M := Graphics.Proj(left, right, bot, top, near,
far) * Graphics.Trans(0.0, 0.0, -d) * Graphics.RotX(elev) * Graphics.RotY(-azim) * Graphics.Trans(0.0, 0.0,- zm)
Hybrid Compilation
Code body
Role Compilation method
Actor Business logic
Software compilation (TRM/ DSP)
Assembly
Creating actor graph (wiring)
Hardware compilation (Verilog)
Actor Code
F = actor (in1, in2: instr; out: outstr); var i, j: integer;begin loop recv(in1, i); recv(in2, j); send(out, someOp(i, j)) endend
TRM
Assembly Code
assembly B (in1, in2, in3, in4: instr; out: outstr); type F, G = actor (in1, in2: instr; out: outstr); var f1, f2: F; g: G;begin new(f1); new(f2); new(g); connect(f1.out, g.in1); connect(f2.out2, g.in2); delegate(in1, f1.in1); delegate(in2, f1.in2); delegate(in3, f2.in1); delegate(in4, f2.in2); delegate(out, g.out)end B.
Verilog
Automated Mapping to FPGA
source program
hybridcompiler
memory images.mem
Verilog codescripts
make.tcl, ram.bmm
Xilinxsynthesizer
bits
runtime
library
hardware
library
TRMcode
Program Model Refinement
Each thread may spawn any number mutually independent sub-threads
Advantages Allows (lock-free) fine-grained parallel
computing Requirements
Needs core clustering Needs runtime scheduling support Needs barrier mechanism
spawn
barrier
AA1
A2
A1
Next Step
Use the ActiveCells language for developing embedded software on top of some standard IDE Including design, programming, debugging,
analyzing Analyzer may need cycle accurate
simulator Use fully automated tool to generate an
FPGA imageburndown
Integrated HW/SW power management systemCollaboration with Prof. Shiao-Li Tsao, National Chiao Tung University, Taiwan
Power Management Codesign
Power Management as Add-On Clock gating PM Add-On generated automatically on
demand actor { PM } (...);
PMAdd-On Circuitr
y
TRM
clk
outin
•Instruction• clockOff()
•Control registers• TRM mode, clock rate,
voltage•Signals
• Data on port•I/O ports
• Interop with PM controller•Internal memory
• backup TRM state/ registers
data
Clock Gating Off Procedure
Clo
ckM
an
ag
er
PMControll
er
PM Add-On Circuitr
y
TRMdata
clk
clk
outin
signal PM controller
stop clock
Clock Gating On Procedure
Clo
ckM
an
ag
er
PMControll
er
PM Add-On Circuitr
y
TRMdata
clk
clk
outin
Data arrives
PM controller feeds in clock
processor resumes
SW Add-on Enhancements
Conditional compilation of (blocking) recv statement recv(in, a) without { PM } option
repeat until nonblockingRecv(in, a); recv(in, a) with { PM } option
resetTimer(shortTime);repeat dataAvailable := nonblockingRecv(in, a)until timerExpired() or dataAvailable;stopTimer();if ~dataAvailable then clockOff() end
Next Step for Real Time Software begin { T } ... (* statements *) end
Adjust idle/ busy periods or clock rate between begin ... end to just meet indicated time limit T
Motivation
Allow automatic generating tailored hardware for a given stream application
The semantic gap between application model and hardware circuitry is too big An abstraction of hardware circuitry is
required to bridge the gap A clear classification of hardware
components is required to achieve efficient mapping with regards to resource, performance and energy
Hardware Components Classification
Computation Components
• General purpose minimal machine: TRM
• Vector machine: VTRM
Communication Components
• FIFOs• 32 * 128• 512 * 128• 32, 64, 128, 1k * 32
Storage Components
• DMA + TRM: DTRM
• direct transfer vector from DDR to VTRM
I/O Components
• TRM + I/O access: IOTRM
• packing/unpacking I/O data to vectors or words
Abstraction
Hardware interfaces Computation components
#(IMB, DMB) TRM (input clk, rst, irq0, irq1, input[31:0] inbus,
output[5:0] ioadr, output iowr, iord, output[31:0] outbus)
#(VL, IMB) VTRM (input clk, rst, input[VL*32-1:0] inbus, output[5:0] ioadr, output iowr, iord, output[VL*32-1:0] outbus)
Communication components#(Width, Depth) ParChannel (input clk, rst, input[Width-
1:0] inData, input wreq, rdreq, output[Width-1:0] outData, output[31:0] status)
Storage component#(DataWidth) DTRM (input clk, rst,
input[DataWidth-1:0] inbus, output[5:0] ioadr, output iowr, iord, output[DataWidth-1:0] outbus)
IO component#(VL) IOTRM (input clk, rst, input [VL*32-1:0] inbus, output [5:0] ioadr, output iowr, iord, output[VL*32-1:0] outbus)
TRM (Tiny Register Machine)
2-address register machine (8 registers) Configurable instruction/ data memory Optional I/O controller added
IMemory(4K x 18
bits)
DMemory(1K x 32
bits)
Decoder
Registers18
32
ALU
116 MHz
Vector TRM
8 vector registers (each 8 32-bit floats) Vector add/ multiply takes 4 cycles Horizontal addition takes 10 cycles
IMemory(4K x 18
bits)
DMemory(8K x 32
bits)
TRM
Vector
256
256
DMA TRM
256 bits wide data bus Loading 256 bits from DMA takes 2
cycles Storing 256 bits to DMA takes 1 cycle
IMemory(4K x 18
bits)
DMemory(1K x 32
bits)TRM
DMA
I/O data bus
256
256
Area, Performance Features (on Virtex-5LX50T)
System clock speed: 116MHz TRM : 2% LUTs, 1 DSP, 5 cycles for multiplication VTRM
integer vector unit, VL=4: 8% LUTs, 8 DSPs, 5 cycles for Vector multiplication, 3 cycles for horizontal vector addition
Floating point vection unit, VL = 4: 18% LUTs, 9 DSPs DMA: 10% LUTs, 1 DSP, 2 cycles for loading a
block from DDR2 controller buffer, 1 cycle for writing a block into DDR2 controller buffer
IOTRM: 5% LUTs, 1 DSP, 2 cycles for loading a vector, 1 cycle for writing a vector
References
http://www.nativesystems.inf.ethz.ch/ Reference papers
Ling Liu, Oleksii Morozov, A Process-Oriented Streaming System Design Paradigm for FPGAs, Reconfig’2010, Cancun, Mexico, December 13-15, 2010.
Ling Liu, Oleksii Morozov, Yuxing Han, Jürg Gutknecht, Patrick Hunziker, Automatic SoC Design Flow on Many-core Processors: a Software Hardware Co-Design Approach for FPGAs, FPGA’2011, Monterey California, February 27 ~ March 1, 2011.
Program Model Refinement 2 Separate agent thread for each
communication Each actor running one main thread
(behavior) and several communication threads (agents) under mutual exclusion
Advantages Stateful dialogs No deadlocks
Requirements Fast context switches
Y XX
behavior
communication
c
Wiring Integrated into Actorsmodule M; var x1, x2: X; y: Y; type X = object … end X; Y = object … end Y; begin new(y); new(x1, y); new (x2, y)end M.
X = object var c: Y.C; activity A; var i, j, k: integer; begin (*behave*) …; c(i, j); …; c(k); … end A; procedure X (y: Y); begin (*build object*) …; new (c); … end X;begin new A (*launch behavior*)end X;
Y = object activity A; begin (*behave*) … end A; activity C; var u, v, w: integer; begin (*communicate*) …; accept(u, v); …; accept(w); … end C; procedure Y; begin (*construct*) … end Y; begin new A end Y;