Exploring Power and Throughput for Dataflow Applications ... · Exploring Power and Throughput for...
Transcript of Exploring Power and Throughput for Dataflow Applications ... · Exploring Power and Throughput for...
Exploring Power and Throughput for DataflowApplications on Predictable NoC Multiprocessors
Kathrin Rosvall, Tage Mohammadat*, George Ungureanu, Johnny Oberg, Ingo Sander
Department of ElectronicsSchool of Electrical Engineering and Computer Science
KTH Royal Institute of TechnologyStockholm, Sweden*[email protected]
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 1 / 26
Introduction
Outline
1 Introduction
2 Design Space Exploration: Paper Excerpt
3 Closing remarks
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 2 / 26
Introduction
Design ChallengeTiming Guarantees in the Many-Core Era
Technology advances lead to
increasingly parallel,powerful and complexarchitecturesincreasingly advanced anddemanding applications
Difficult to verify timingrequirements
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
Network-on-Chip (NoC)
We have already problems with complexity today! How do we designtomorrow’s ”sea-of-cores” systems with timing guarantees?
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 3 / 26
Introduction
Design ChallengeTiming Guarantees in the Many-Core Era
Technology advances lead to
increasingly parallel,powerful and complexarchitecturesincreasingly advanced anddemanding applications
Difficult to verify timingrequirements
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
Network-on-Chip (NoC)
We have already problems with complexity today! How do we designtomorrow’s ”sea-of-cores” systems with timing guarantees?
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 3 / 26
Introduction
Design ChallengeMixed-criticality challenge for Transport Industry
Hard requirements on timing.Typical use-cases require low power consumption and mixed-critical.
Simple use-cases have hundreds to millions points in design space.
Dependencies between software and hardware exacerbate complexity.
Consequence
Architectural designs are experience-based and not methodological.
Large safety-margins at the cost of efficiency.
Engineering costs, design and verification time, are exploding.
Vision
Automated design methods that are correct-by-construction!
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 4 / 26
Introduction
Design ChallengeMixed-criticality challenge for Transport Industry
Hard requirements on timing.Typical use-cases require low power consumption and mixed-critical.
Simple use-cases have hundreds to millions points in design space.
Dependencies between software and hardware exacerbate complexity.
Consequence
Architectural designs are experience-based and not methodological.
Large safety-margins at the cost of efficiency.
Engineering costs, design and verification time, are exploding.
Vision
Automated design methods that are correct-by-construction!
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 4 / 26
Introduction
Design ChallengeMixed-criticality challenge for Transport Industry
Hard requirements on timing.Typical use-cases require low power consumption and mixed-critical.
Simple use-cases have hundreds to millions points in design space.
Dependencies between software and hardware exacerbate complexity.
Consequence
Architectural designs are experience-based and not methodological.
Large safety-margins at the cost of efficiency.
Engineering costs, design and verification time, are exploding.
Vision
Automated design methods that are correct-by-construction!
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 4 / 26
Introduction
ForSyDe (Formal System Design) VisionThe approach
ForSyDe envisions a correct-by-construction design flow by combining
System modelling based on models of computation
Computing platforms providing tight computational metrics, e.g.:
computation and communication timecommunication time. . .
Leveraging
Modelling librariesSimulation methodsTransformation and refinement rulesDesign Space ExplorationDeterministic computing architecturesSound compilation and synthesis methods. . .
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 5 / 26
Introduction
Envisioned Design FlowThe big picture
A1 A2 A3
B1 B2
B3
Analyzable Application Models
• formal base (MoCs)
• executable
Appl.A
GlobalConstr.
Appl.B
Design Constraints
• real-time
• power and energy
• . . .S
R
S
R
S
R
S
R
Platform Architecture
• multiprocessor
• predictable performance
Design Space Exploration(based on formal model and
predictable architecture)
S
R
S
R
S
R
S
R
B1 B2
B3 A1 A2A3
Mapping
Synthesisand
Compilation
Implemen-tation
Implementation
• Customized hardware
• Efficient software
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 6 / 26
Introduction
Envisioned Design FlowScope and Context
Devices
Logic Gates
IPs
System-on-Chip*Components: Processors-
Memories- I/Os
System of Systems**Hierarchical SoCs
Satisfy criticality and throughputMinimize: Power and Cost
Behavioural Domain Structural Domain
Physical Domain
Systems
Algorithms
Register transfers
Logic
Transfer functions
System-of-Systems
Processor system
ALUs, RAM, etc.
Gates, flip-flops, etc.
Transistors
Physical partitions
Floorplans
Module layout
Cell layout
Transistor layout
ComputingSystemDesign
Plat-forms
ASIC
FPGA
APU
DSP
GPU
Design
Evalua-tion
Off-line
On-line
Explora-tion
Exhau-stive
Incom-plete
SpaceIS
Archit-ecture
Alloc-ation
Mapp-ing
Rou-ting
Sched-uling
Config-uration
Objec-tive
Optim-isation
Satis-faction
Decisions
Compile-time
Run-time
Applica-tion
Dataflow
Sta-ticDyna-
mic
Tasks
Peri-odic
Aper-iodic
Specific-ations
funct-ional
Extrafunct-ional
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 7 / 26
Design Space Exploration: Paper Excerpt
Outline
1 Introduction
2 Design Space Exploration: Paper Excerpt
3 Closing remarks
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 8 / 26
Design Space Exploration: Paper Excerpt
Contributions
Support for low-power design of mixed-critical applications onNoC-based MPSoC by:
spatio-temporal partitioning of computing and networking resources,whilesatisfying real-time constraints (throughput), andminimising system’s power consumption.
Exploiting:
applications’ formal modelsplatform’s heterogeneity, modes and options.temporally-disjoint networks property.
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 9 / 26
Design Space Exploration: Paper Excerpt
The Framework
Application Graph G1Application Graph Gz
a1,1
a1,2
a1,3p1,1 q1,1
p1,2
q1,2p1,3
q1,3 az,1
az,2
az,4
az,3pn,1
qn,1 pn,2
qn,2
pn,3
qn,3 pn,4qn,4
· · ·
Design Space Exploration
+AppConstraints D1
+AppConstraints Dz
· · ·
+System
Constraints
Ds
Tar
get-
spec
ific
Pro
per
ties
(WC
ET
,W
CC
T,
mem
ory
foot
prin
t,p
ower
con
sum
pti
on,
run
-tim
ese
rvic
es,
mo
de
chan
gepr
ofilin
g)
Implem
entation
Library
ConfigurationMappingSchedulesPerformance Data O
utpu
ts
S0,0
R0,0
S1,0
R1,0
Sr,0
Rr,0
S0,1
R0,1
S1,1
R1,1
Sr,1
Rr,1
S0,2
R0,2
S1,2
R1,2
Sr,2
Rr,2
S0,c
R0,c
S1,c
R1,c
Sr,c
Rr,c
Switching Stages
Stage 0
Stage 1
Stage 2
Stage k
...
Mo
du
lok
time
axis
Flittk
Flittk−1
Flittk−2
Flitt0
...
Network InterfaceInjection Table
Cycle 0
Cycle 1
Cycle 2
Cycle n
...
TD
Maxis
X
X...
µPkind,modeMemory µPkind,modeMemory. . .
TDM· · · 1 · · · r 1 · · · r 1 · · ·Time Division Multiple-Access Bus
Core 1 Core m
Memory
Peripherals
Injection Table
Cycle 0
Cycle 1
Cycle 2
Cycle n
...
TD
Maxis
X
X...
...
...
......
...
. . . . . . . . .
. . .
. . .
ArchitectureModel
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 10 / 26
Design Space Exploration: Paper Excerpt
The FrameworkApplication Model
synchronous dataflow graphs
actor characterisation library for eachprocessing element type and mode
WCETMemory footprint
A B2 3
A0
A1
A2
B0
B1
22
11
11
22
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 11 / 26
Design Space Exploration: Paper Excerpt
The FrameworkArchitectural Model
Mesh topology, withtemporarily-disjoint networks
bufferless switchingfixed routing (X-,Y-,Z-)guaranteed services viatime slot assignment
Low power and smallfootprint
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
S
R
Network Interface
Injection Table
cycle 0
cycle 1
cycle 2
cycle n
...
X
X...
CPUmode/HW Acc
MemoryInjection Table
cycle 0
cycle 1
cycle 2
cycle n
...
X
X...
...
...
......
...
. . . . . . . . .
. . .
. . .
Nostrum Network-on-Chip
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 12 / 26
Design Space Exploration: Paper Excerpt
The FrameworkPerformance Models
temporal analysis for applications based on MoC theories.linear power models:
static: as a function of platform components’ kind and mode.dynamic: as a function of resource utilisation/traffic.
system power = (∑p∈P
dynPowerp+statPowerp)
+dynPowerNoC+statPowerNoC
dynPowerp = (∑
a∈A|proca=p
wceta(p, proc modep))
· dynPow(proc modep) ÷ perioda
dynPowerNoC = dynPowerNIs+dynPowerswitches
+dynPowerlinks
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 13 / 26
Design Space Exploration: Paper Excerpt
The FrameworkDesign Constraints
Support for mixed-critical applications through:
spatio-temporal partitioning of computing and networking resources.time constraints (throughput): satisfylow power execution: minimize
Currently supported performance metrics:
throughput / iteration periodenergy consumptionothers: area and monetary cost, memory consumption
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 14 / 26
Design Space Exploration: Paper Excerpt
The FrameworkDesign Constraints
Support for mixed-critical applications through:
spatio-temporal partitioning of computing and networking resources.time constraints (throughput): satisfylow power execution: minimize
Currently supported performance metrics:
throughput / iteration periodenergy consumptionothers: area and monetary cost, memory consumption
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 14 / 26
Design Space Exploration: Paper Excerpt
The MethodConstraint Programming (CP) Programme
Captures the problem as a whole
no decomposition into sub-problems:(potentially) optimal, complete and trade-off-awareSimultaneous mapping, scheduling, configuration and performanceanalysis
Declarative nature: flexible, modular and extendable modelSupports different design goals with the same CP model
Separation of concerns: modelling vs solving
Complete or heuristic search using the same CP model
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 15 / 26
Design Space Exploration: Paper Excerpt
The MethodCP model + Dataflow Analysis
A CP model reflects a mapping- and scheduling-aware graph(MSAG):
captures applications and platform.
analyses implications of mapping and scheduling (e.g. power).
Px
Py srca
blocka
senda
reca dsta
srcb
blockb
sendb
recb dstb
next
next
sendNext
sendNextInterconnect
NI
buffer ssrca ,dsta
buffer rsrca ,dsta
buffer ssrcb ,dstb
buffer rsrcb ,dstb
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 16 / 26
Design Space Exploration: Paper Excerpt
The MethodCP model + Dataflow Analysis
A CP model reflects a mapping- and scheduling-aware graph(MSAG):
captures applications and platform.analyses implications of mapping and scheduling (e.g. power).
Px
Py srca
blocka
senda
reca dsta
srcb
blockb
sendb
recb dstb
next
next
sendNext
sendNextInterconnect
NI
buffer ssrca ,dsta
buffer rsrca ,dsta
buffer ssrcb ,dstb
buffer rsrcb ,dstb
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 16 / 26
Design Space Exploration: Paper Excerpt
ExperimentsExperiments set
# Description
1 Physical experiment: TDN-NoC FPGA platform2-5 DSE for larger problem with the fixed mapping with different
optimization goals:2 power consumption, one TDN slot/processor3 power consumption, multiple TDN slots/processor4 throughput for cyclic graph, one TDN slot/processor5 throughput for cyclic graph, multiple TDN slot/processor
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 17 / 26
Design Space Exploration: Paper Excerpt
ExperimentsApplications set
getPx
gx gy
abs
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
getIm
usan
dir
thin
putIm
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
frontE
rasta
pows
auds
comp
filter backE
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
read
CC
DCT1
Huff1
1
1
1
1
DCT2
Huff2
1
1
1
1
DCT3
Huff3
1
1
1
1
DCT4
Huff4
1
1
1
1
DCT5
Huff5
1
1
1
1
DCT6
Huff6
1
1
1
1
CS
write
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11111
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11111
1
1
1
1
1
1
1
1
1
1
1
1
a0
a1
a2
a9
a8
a4
a3
a5
a6
a7
1
1
1
1 1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
JPEG encoderRASTA-PLPSUSANSobel Synthetic Cyclic Graph
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 18 / 26
Design Space Exploration: Paper Excerpt
Experiment 1: Physical experimentValidation on a Xilinx Zynq Quadprocessor: Description
1 Goals: Minimize power consumption + satisfy throughput.2 App/platform: Sobel & Susan on Nostrum NoC.
⇐= Network-on-Chip
⇐= Microblaze tile
⇐= ARM tile
Tile 1: MicroblazeSystem
Tile 2: MicroblazeSystem
Tile 3: MicroblazeSystem
NI NI NI
Switch S10 Switch S11
Switch S01 Switch S00
NI
Tile 0: Dual Cor-tex A9
Dynamically Reconfigurable Clock Generator
clk0
clk1 clk2 clk3
DFS
PS-PL BarrierD
FS
DF
S
DF
S
DF
S
Off-chip DVFS
Processing System (PS)
Programmable Logic (PL)
Network Interfaces (NI)
Links
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 19 / 26
Design Space Exploration: Paper Excerpt
Experiment 1: Physical experimentValidation on a Xilinx Zynq Quadprocessor: Solution
Allocation
Configuration
Mapping
Scheduling
Exact area cost
Caches disabled
10% time margin
Low-Power
⇐= Network-on-Chip
(Disabled Tile) ⇐= Microblaze tiles
⇐= A9 MPCore tile
F=50MHz F=50MHz F=0MHz
NI {X,−,−,−} NI {−,−,−,−} NI {−,−,−,−}
Switch S2 Switch S3
Switch S1 Switch S0
NI {−,−,−,−}
Dynamically Reconfigurable Clock Generator
clk0,F=50MHz
clk1 clk2 clk3
DynamicFrequency
Scaling
(DFS)PS-PL Barrier
DF
S
DF
S
DF
S
Off-chip Dynamic Fre-quency Scaling (DVFS)
Processing System (PS), Mode=High Performance#partitions=2
Programmable Logic (PL)
32-bit Links
getIm usan
dirthin
putIm
1
1
1
1
1111
1
1
1
1
1111
getPx
gx gy
abs
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 20 / 26
Design Space Exploration: Paper Excerpt
Experiment 3: Larger designPower Minimisation and Slot Assignment for Fixed Mapping
S0,0
S0,1
S0,2
S0,3
S0,4
S1,0
S1,1
S1,2
S1,3
S1,4
S2,0
S2,1
S2,2
S2,3
S2,4
S3,0
S3,1
S3,2
S3,3
S3,4
S4,0
S4,1
S4,2
S4,3
S4,4
S5,0
S5,1
S5,2
S5,3
S5,4
getPx
gx
gy
abs
getIm
usan
dir
thin
putIm
frontE
rasta pows auds comp
filter
backE
read
CC
DCT1
Huff1
DCT2
Huff2
DCT3
Huff3
DCT4
Huff4
DCT5
Huff5
DCT6
Huff6
CS
write
a0
a1
a2
a9
a8
a4
a3
a5
a6
a7
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 21 / 26
Design Space Exploration: Paper Excerpt
Experiment 3: Larger designPower Minimisation and Slot Assignment for Fixed Mapping: Solution
0 0.5 1 1.5 2 2.5 3 3.5 4
·105
0.9
1
1.1
1.2
1.3
1.4
·104
Exploration time (ms)
Pow
er(m
W)
Best FoundExp. 3, Run 1Exp. 3, Run 2Exp. 3, Run 3Exp. 3, Run 4Exp. 3, Run 5
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 22 / 26
Closing remarks
Outline
1 Introduction
2 Design Space Exploration: Paper Excerpt
3 Closing remarks
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 23 / 26
Closing remarks
Summary
Low-power design spaceexploration of
multiple dataflow applicationswithstatic mixed-criticality supporton a shared MPSoC platform
support for predictablenetwork-on-chip
Application Graph G1Application Graph Gz
a1,1
a1,2
a1,3p1,1 q1,1
p1,2
q1,2p1,3
q1,3 az,1
az,2
az,4
az,3pn,1
qn,1 pn,2
qn,2
pn,3
qn,3 pn,4qn,4
· · ·
Design Space Exploration
+AppConstraints D1
+AppConstraints Dz
· · ·
+System
Constraints
Ds
Tar
get-
spec
ific
Pro
per
ties
(WC
ET
,W
CC
T,
mem
ory
foot
prin
t,p
ower
con
sum
pti
on,
run
-tim
ese
rvic
es,
mo
de
chan
gepr
ofilin
g)
Implem
entation
Library
ConfigurationMappingSchedulesPerformance Data O
utpu
ts
S0,0
R0,0
S1,0
R1,0
Sr,0
Rr,0
S0,1
R0,1
S1,1
R1,1
Sr,1
Rr,1
S0,2
R0,2
S1,2
R1,2
Sr,2
Rr,2
S0,c
R0,c
S1,c
R1,c
Sr,c
Rr,c
Switching Stages
Stage 0
Stage 1
Stage 2
Stage k
...
Mo
du
lok
time
axis
Flittk
Flittk−1
Flittk−2
Flitt0
...
Network InterfaceInjection Table
Cycle 0
Cycle 1
Cycle 2
Cycle n
...
TD
Maxis
X
X...
µPkind,modeMemory µPkind,modeMemory. . .
TDM· · · 1 · · · r 1 · · · r 1 · · ·Time Division Multiple-Access Bus
Core 1 Core m
Memory
Peripherals
Injection Table
Cycle 0
Cycle 1
Cycle 2
Cycle n
...
TD
Maxis
X
X...
...
...
......
...
. . . . . . . . .
. . .
. . .
ArchitectureModel
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 24 / 26
Closing remarks
Open Problems
Exploring models-of-computation that change in time.
Scenario-aware data-flow graphs.Reconfigurable computing.
Exploring computer architectures
parallelisms within computing elementsnetwork topologiesmemory hierarchy
Targetting resiliency
Graceful degradationSecurity trade-offs
Designing for adaptive and distributed computing.
Distributed exploration: limits 10x10 (Vivado), 6x8 (GeCode).
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 25 / 26
Closing remarks
This work is part of ForSyDe/DeSyDe
Kathrin Rosvall, Tage Mohammadat, George Ungureanu, Johnny Oberg, and Ingo Sander.Exploring power and throughput for dataflow applications on predictable nocmultiprocessors.In Euromicro Conference on Digital System Design (DSD 2018), Prague, Czech Republic,August 2018.
Kathrin Rosvall and Ingo Sander.Flexible and trade-off-aware constraint-based design space exploration for streamingapplications on heterogeneous platforms.ACM Transactions on Design Automation of Electronic Systems (TODAES), 23(2), 2017.
Ingo Sander, Axel Jantsch, and Seyed-Hosein Attarzadeh-Niaki.ForSyDe: System design using a functional language and models of computation.In Handbook of Hardware/Software Codesign. Springer, Dordrecht, 2017.
George Ungureanu and Ingo Sander.A layered formal framework for modeling of cyber-physical systems.In 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages1715–1720. IEEE, 2017.
More resources: https://github.com/forsyde/desyde
Tage Mohammadat et al. Architectural Design of Efficient MPSoCs 26 / 26