: ILP-based Worst-Case Contention Estimation for Mesh Real...
Transcript of : ILP-based Worst-Case Contention Estimation for Mesh Real...
www.bsc.es
NoCo: ILP-based Worst-Case Contention
Estimation for Mesh Real-Time Manycores
December 13th Nashville, USA
39th IEEE Real-Time Systems Symposium
RTSS 2018
Jordi Cardona1,2, Carles Hernandez1, Enrico Mezzetti1, Jaume Abella1
and Francisco J.Cazorla1,3
1Barcelona Supercomputing Center (BSC) 2Universitat Politècnica de Catalunya (UPC)
3IIIA-CSIC
Critical Real-Time Embedded Systems
Used in industries like:
Require:
– Functional Correctness
– Timing Correctness
Need to provide evidence against the safety standards
– Avionics: DO178B/C
– Automotive: ISO26262
Avionics Space Railway
Validation & Verification (V&V) process
2
Increasing Performance Needs in CRTES
New software implementing complex functionalities
– Complex AI algorithms
– Manage Huge amounts of data
Performance needs increase significantly
Autonomous driving
ARM predicts that the performance requirements
of ADAS to grow 100x from 2016 to 2024
3
Covering high-performance needs
How to deliver the performance needed by CRTES Software in
an efficient way?
– “Embrace” high-performance hardware coming from mainstream market
• Multicores and Manycores
• Caches
• Accelerators
Networks on chip (NoCs)
SnapDragon
(automotive) Nvidia Pascal
(automotive) Kalray MPPA-256
(aviation)
4
The other side of the coin …
High-performance (complex) hardware complicates timing
analysis, i.e. deriving WCET estimates for tasks
Source of the problem: contention – Must be bounded and reduced
– Worst-case Contention Delay (WCD) Worst Case Execution Time (WCET)
2x2 2D mesh with 4 cores
5
Related work
Real-Time Specific NoC designs:
– Provide Contention-Free NoCs and easy to V&V
– Do not scale well (bad average performance in general)
– High costs for being adopted in Industry
Wormhole NoC designs (wNoC)
– Best-effort wormhole NoCs (wormhole switching)
• Used in Commercial Off the shelf processors (low costs for industry)
• More difficult to derive upperbounds (can be very pessimistic)
– Optimize parameters of these NoCs
» Mapping, Routing, Bandwidth distribution, …
6
Worst Case Execution Time (WCET) - ZLL
WCET = f(ZLL,WCD)
– Zero Load Latency (ZLL) = f(distance)
1. Mapping
3 hops
5 hops
7
Worst Case Execution Time (WCET)
WCET = f(ZLL,WCD)
– Zero Load Latency (ZLL) = f(distance)
1. Mapping
– Worst case Contention Delay (WCD) = f(Routing, Arbitration)
2. Routing
3. Bandwidth weighted allocation (walloc)
8
Worst Case Execution Time (WCET) - WCD
WCET = f(ZLL,WCD)
– Zero Load Latency (ZLL) = f(distance)
1. Mapping
– Worst case Contention Delay (WCD) = f(Routing, Arbitration)
2. Routing
3. Bandwidth weighted allocation (walloc)
3x3 mesh flows mapping using XY 3x3 mesh flows mapping using XY-YX combination
Y
X
9
Worst Case Execution Time (WCET) - WCD
WCET = f(ZLL,WCD)
– Zero Load Latency (ZLL) = f(distance)
1. Mapping
– Worst case Contention Delay (WCD) = f(Routing, Arbitration)
2. Routing
3. Bandwidth weighted allocation (walloc)
RR arbitration Weighted mesh arbitration (WRR) 2x2 2D mesh XY flows mapping
10
WCD = 15 WCD = 10
WCET is affected by all three parameters:
Mapping, Routing and Walloc
Parameters are inter-dependent
WCET = f(ZLL, WCD) = f(Mapping, Routing, Walloc)
Optimizing each parameter individually or in pairs, does not provide a global
optimal NoC configuration, just a local one.
11
All the parameters need to be optimized at the
same time
Routing constraints
Mapping constraints
Bandwidth constraints
Our proposal: NoCo
Given a
– Workload (Tasks)
– Wormhole Mesh NoC configuration
Optimizes
– The WCET of applications finding the best mesh configuration:
• Mapping
• Routing
• Weights allocation (Walloc)
NoCo uses:
– Stochastic exploration to optimize routing
– Integer Linear Programming (ILP) to optimize Mapping and Walloc
12
Agenda
Introduction and Motivation
Background and problem analysis
NoCo: Stochastic/ILP model
Evaluation
Conclusions
13
NoCo: ILP-based Worst-Case Contention Estimation
for Mesh Real-Time Manycores
Jordi Cardona
PROPOSAL: NOCO
NoCo Optimization Framework
– Routing
• Stochastic
• Generates random routes and pass it to the ILP optimizer
– Placement, walloc
• Integer Linear Programming (ILP)
• Placement and Walloc are optimized per each routing
– Selection of the best setup
Route Route
Approach/Concept
15
Stochastic
Random
selection Route ILP
NoC
Performanc
e
Best
configuration
NoC
Performance
NoC
Optimized
configurations
Route generation Mapping and Walloc
Optimization
NoCo Proposal
Problem description:
– Tasks information
• Execution Time Observed (ETO)
• Memory Accesses
– NoC information:
• Target Node location
• Number of routers
– Constraints (only one task to each core)
Main stages of NoCo Framework
(mapping and walloc)
16
NoCo Proposal
Routing Stochastic exploration
– Generate Randomly Routing configurations
• Minimal distance routing policies
• Deterministic routing policies (ie XY, YX)
• Deadlock avoidance
– Prohibiting certain turns (no cycles)
(mapping and walloc)
Main stages of NoCo Framework
17
NoCo Proposal
Routing Random sampling (finite population)
– C = probability that one of the top X% routes is not in the random
sample.
The probability of having 1 routing in the 1% of the top routings in
a 1000 size sample is 1-0,000043 = 0,999957 (99,99%)
Worst routings Best routings
1% 0,1%
18
NoCo Proposal
Stochastic Routing
– It warrantees stochastically to find one of the best routing solutions at
low cost (without exploring all the possible routings)
• With 330 samples out of 2^16 = 65536 routings finds the best routing
(0,5% of the population)
Main stages of NoCo Framework
(mapping and walloc)
19
NoCo Proposal
Mapping and Walloc ILP optimization
Main stages of NoCo Framework
20
NoCo Proposal
ILP model – Objective function:
•
The WCET of the application is
determined by the WCET of the
slowest thread
Parallel applications
21
W_C1 W_C2 W_C3 W_C3
ILP model
– Compute WCET:
• Bandwidth and WCD modeling
NoCo Proposal
22
Number of
Memory accesses
WCET in isolation
BW distribution constraints
from Routing configuration
Path flows mapping
from Routing configuration
NoCo Proposal
ILP model
– Compute WCET:
• Routing rules:
– Bandwidth distribution
– Path restrictions
• Other restrictions:
– One task assigned in one core
– One core can only run one task
– BW assigned to a cores > 0.0
– WCD of all tasks > 0.0
– Total BW in the mesh must be 1
23
Encoded in Boolean matrixes
NoCo Proposal
Stochastic + ILP model
– Local solutions:
• Provides WCET of each task
• Mapping
• Bandwidth distribution (arbitration weights)
– Post processing (minimum WCET) Global solution
Main stages of NoCo Framework
24
NoCo: ILP-based Worst-Case Contention Estimation
for Mesh Real-Time Manycores
Jordi Cardona
EVALUATION
Evaluation
Cycle-accurate Simulator
– SoCLib simulator integrated with gNoCSim
Benchmarks
– Key parameter: frequency of access to the NoC for loads/stores
Workloads
– Cover the range shown for Mediabench and EEMBC auto
– MIX Benchmarks (i.e MIX1 => ABCDEFFGH)
26
Evaluation: impact of optimizing each parameter
Incremental Evaluation
NoC configuration ILP Optimizations
Routing Weights Mapping
Static-base (RR)
Static-opt (WRR)
Map
Map + Walloc
Map + Walloc + R
Optimization versions evaluated
Baseline
NoCo
27
Results
Incremental Evaulation
– Static_opt (WRR) vs Static_base (RR)
16%
-1%
Effect of incremental optimizations: mapping, walloc and routing (3x3 heterogeneous workloads)
28
Results
Incremental Optimizations
– Map vs Static-base (RR)
30%
17% 23%
Effect of incremental optimizations: mapping, walloc and routing (3x3 heterogeneous workloads)
29
Results
Incremental Optimizations
– Map_Walloc vs Static-base (RR)
37% 31%
41%
30
Effect of incremental optimizations: mapping, walloc and routing (3x3 heterogeneous workloads)
Results
Incremental Optimizations
– Map_Walloc_Routing (NoCo) vs XY_RR
50% 46% 40%
23%
14%
9%
3%
31
Effect of incremental optimizations: mapping, walloc and routing (3x3 heterogeneous workloads)
NoCo: ILP-based Worst-Case Contention Estimation
for Mesh Real-Time Manycores
Jordi Cardona
CONCLUSIONS
Conclusions
Optimizing NoC to reduce WCET is a multidimensional problem
– Zero Load Latency (Mapping)
– Worst Case Delay (Routing and Arbitration)
Some proposals exist in the state of the art that optimize one or
combinations of the mentioned parameters that increase the
WCET of applications.
We propose NoCo a stochastic/ILP hybrid solution that optimizes
at the same time:
– Routing (XY, YX combinations)
– Arbitration (Walloc)
– Applications’ mapping
NoCo reduces the maxWCET of heterogeneous tasks in 3x3
meshes between 40 and 50% with respect XY-RR configuration.
33
www.bsc.es
NoCo: ILP-based Worst-Case Contention
Estimation for Mesh Real-Time Manycores
December 13th Nashville, USA
39th IEEE Real-Time Systems Symposium
RTSS 2018
Jordi Cardona1,2, Carles Hernandez1, Enrico Mezzetti1, Jaume Abella1
and Francisco J.Cazorla1,3
1Barcelona Supercomputing Center (BSC) 2Universitat Politècnica de Catalunya (UPC)
3IIIA-CSIC
BACKUP
35
Reliability of stochastic method
Random Routing vs Optimal Routing
2^9 = 512 routings 2^16 = 65536 routings
94,6% 88,7%
100 samples 98,5%
330 samples best solution in fifth examples 36
37
Reliability of stochastic method
Improvement in homogeneous tasks running in parallel
– Reducing max WCET of all tasks
rILP(m,w) maxWCET results for 9 threads rILP(m,w) maxWCET results for 16 threads
rILP (9 threads):
Avg w.r.t RR 74%
Avg w.r.t WRR 26%
rILP (16 threads):
Avg w.r.t RR 88%
Avg w.r.t WRR 29%
38
Reliability of stochastic method
Improvement in heterogeneous tasks running in parallel
– Reducing summation of max WCET of all tasks
rILP(m,w) sumWCET results for 9 threads rILP(m,w) sumWCET results for 16 threads
rILP (9 threads):
Avg w.r.t RR 26%
Avg w.r.t WRR 19%
rILP (16 threads):
Avg w.r.t RR30%
Avg w.r.t WRR 23%