Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza...

Programming and Timing Analysis of Parallel Programs on Multicores

Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault

ACSD 2013

Introduction

• Safety-critical systems:

– Perform specific real-time tasks.– Strict safety standards (IEC 61508, DO 178).– Time-predictability useful in real-time designs.– Shift towards multicore designs.

[Paolieri et al 2011] Towards Functional-Safe Timing-Dependable Real-Time Architectures.[Pellizzoni et al 2009] Handling Mixed-Criticality in SoC-Based Real-Time Embedded Systems.[Cullmann et al 2010] Predictability Considerations in the Design of Multi-Core Embedded Systems.

Embedded SystemsSafety-critical concerns

Introduction

• Designing safety-critical systems:– Certified Real-Time Operating Systems (RTOS)• E.g., VxWorks, LynxOS, and SafeRTOS. • Programmer manages shared variables. • Hard to verify timing.

[VxWorks] http://www.windriver.com/products/vxworks/[LynxOS] http://www.lynuxworks.com/rtos/rtos-178.php[SafeRTOS] http://www.freertos.org/FreeRTOS-Plus/Safety_Critical_Certified/SafeRTOS.shtml[Sandell et al 2006] Static Timing Analysis of Real-Time Operating System Code

Introduction

• Designing safety-critical systems:– Certified Real-Time Operating Systems (RTOS)• E.g., VxWorks, LynxOS, and SafeRTOS. • Programmer manages shared variables. • Hard to verify timing.

– Synchronous Languages• E.g., Esterel, Esterel C Language (ECL), and PRET-C.• Deterministic concurrency (Synchrony hypothesis). • Difficult to distribute: Instantaneous communication or

sequential semantics.[Benveniste et al 2003] The Synchronous Languages 12 Years Later.[Lavagno et al 1999] ECL: A Specification Environment for System-Level Design.[Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs.

[Girault 2005] A Survey of Automatic Distribution Method forSynchronous Programs

Research Objective

• To design a C-based, parallel programming language that: – has deterministic execution behaviour, – can take advantage of multicore execution, and– is amenable to static timing analysis.

Outline

• Introduction• ForeC Language• Timing Analysis• Results• Conclusions

Outline

ForeC (Foresee) Language

• C-based, multi-threaded, synchronous language. Inspired by Esterel and PRET-C.

• Minimal set of synchronous constructs.• Fork/join parallelism and shared memory

thread communication.• Structured preemption.

Execution Exampleshared int sum = 1 combine with plus;

int plus(int copy1, int copy2) { return (copy1 + copy2);}

void main(void) { par(f(1), f(2));}

void f(int i) { sum = sum + i; pause; ...}

Global synchronisation barrier

Fork-join• Blocking statement.• Arbitrary thread execution order.

Shared variable and its combine function

Global

sum = 1Global tick start

Global Copies

sum = 1

sum1 = 1 sum2 = 1

Global tick start

Threads get a conceptual copy of the shared variables at the start of every global tick.

Global Copies

sum = 1

sum1 = 1sum1 = 2

sum2 = 1sum2 = 3

Global tick start

Threads modify their own copy during execution.

Global Copies

sum = 1

sum1 = 1sum1 = 2

sum2 = 1sum2 = 3

Global tick start

Global tick end

Global Copies

sum = 1

sum1 = 1sum1 = 2

sum2 = 1sum2 = 3

sum = 5

Global tick start

Global tick end

When a global tick ends, the modified copies are combined and assigned to the actual shared variables.

Combine function is defined by the programmer and must be commutative and associative.

Global Copies

sum = 1

sum1 = 1sum1 = 2

sum2 = 1sum2 = 3

sum = 5

Global tick start

Global tick end

• Modifications are isolated.• Interleaving does not matter.• Do not need locks or critical

sections.• But, the programmer has to

specify the combine function and placement of pauses.

Global Copies

sum = 1

sum1 = 1sum1 = 2

sum2 = 1sum2 = 3

sum = 5

sum1 = 5. . .

sum2 = 5. . .

Global tick start

Global tick end

Global tick start

int x = 1;abort {

x = 2;pause;x = 3;

} when (x > 0);...

Initialise variable x

Abort body starts executing.

Check the abort condition.

The abort body is preempted.

Execution continues.

Preemption construct:

Preemption construct:[weak] abort {

st } when [immediate] (cond)

• immediate: The abort condition is checked when execution first reaches the abort.• weak: Let the abort body to execute one last time before it is

preempted.

Variable type-qualifiers:input and output• Declares a variable whose value is updated or emitted to

the environment at each global tick.

E.g., input int x;

Scheduling

Light-weight static scheduling:– Take advantage of multicore performance while

delivering time-predictability (ease static timing analysis).

– Thread allocation and scheduling order on each core decided at compile time by the programmer.

– Cooperative (non-preemptive) scheduling.– Fork/join semantics and notion of a global tick is

preserved via synchronisation.

Scheduling

Light-weight static scheduling:– One core to perform housekeeping tasks at the

end of the global tick.• Combining of shared variables.• Emitting outputs and sampling inputs.• Starting the next global tick.

Outline

Timing Analysis

Compute the program’s Worst-Case Reaction Time (WCRT).

Physical time1s 2s 3s 4s

Maximumtime allowed

(design specification)

WCRT = max(Reaction times)

Must validate:WCRT ≤ Maximum time allowed

Reaction time

[Boldt et al 2008] Worst Case Reaction Time Analysis of Concurrent Reactive Programs.

Timing Analysis

Construct a Concurrent CFG (CCFG) of the executable binary.

shared int sum = 1 combine with plus;

Computation

Condition

Graph End

Graph Start

Timing Analysis

One existing approach for multicores:• [Ju et al 2010] Timing Analysis of Esterel Programs on General-Purpose

Multiprocessors.• Uses ILP which is NP-Complete, no tightness result, analysis results are

only for a 4-core processor.

Existing approaches for single-core:– Integer Linear Programming (ILP)– Model Checking/Reachability– Max-Plus

[P. S. Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs.[M. Boldt et al 2008] Worst Case Reaction Time Analysis of Concurrent Reactive Programs.

Reachability

g3a g3b

g4a g4b g4c

RT1 = Reaction Time of g1

RT4cRT4b

WCRT = MAX(RT1 … RT4c)

• Traverse the CCFG to find all possible global ticks.

• State-space explosion.• Precision vs. Analysis time.

Reachability

g3a g3b

g4a g4b g4c

RT4cWCRT = RT4b

Identify the path leading to the WCRT. Good for understanding the timing behaviour.

Max-Plus

• Makes the safe assumption that the program’s WCRT occurs when all threads execute their longest reaction together.– Compute the WCRT of each thread separately.– Compute the program’s WCRT by using WCRT of

the threads.– Fast analysis time but over-estimation could be

large.

Timing Analysis

Propose the use of Reachability for multicore analysis:– Trade off analysis time for higher precision.– Analyse inter-core synchronisations in detail.– Handle state-space explosion by reducing the

program’s CCFG before reachability analysis.

Program binary

(annotated)

Compute each global tick. WCRTProgram’s

reduced CCFG

Timing Analysis

CCFG optimisations:– merge: Reduces the number of CFG nodes that

need to be traversed.– merge-b: Reduces the number of alternate paths

in the CFG. (Reduces the number of global ticks)

cost = 1

cost = 4

cost = 3

cost = 1

cost= 1 + 3= 4

cost= 1 + 4 + 1= 6

cost = 6

merge merge-b

Timing Analysis

• Computing each global tick:1. Parallel thread execution and inter-core

synchronisations.2. Scheduling overheads.3. Variable delay in accessing the shared bus.

Timing Analysis

1. Parallel thread execution and inter-core synchronisations.• An integer counter to track each core’s execution time.• Static scheduling allows us to determine the thread

execution order on each core.• Synchronisation at fork/join, and end of the global tick.

Core 1: Core 2:main f2

Core 1 Core 2

mainf2f1

Timing Analysis

2. Scheduling overheads.– Synchronisation: Fork/join and global tick.

• Via global memory.

– Thread context-switching.• Copying of shared variables at the start the thread’s

local tick via global memory.

Synchronisation

Thread context-switch

Core 1 Core 2

mainf2f1

Global tick

Timing Analysis

2. Scheduling overheads.– Required scheduling routines statically known.– Analyse the control-flow of the routines.– Compute the execution time for each scheduling

overhead. Core 1 Core 2

Core 1 Core 2

mainf2f1

Timing Analysis

3. Variable delay in accessing the shared bus.– Global memory accessed by scheduling routines.– TDMA bus delay has to be considered.

Core 1 Core 2

Timing Analysis

121212121212

Core 1 Core 2

slotsCore 1 Core 2

Timing Analysis

121212121212

Core 1 Core 2

Outline

Results

• For the proposed reachability-based timing analysis, we demonstrate:– the precision of the computed WCRT.– the efficiency of the analysis, in terms of analysis

Results

• Timing analysis tool:

Program binary

(annotated)

ProposedReachability

Max-Plus

WCRTProgram CCFG (optimisations)

Results

• Multicore simulator (Xilinx MicroBlaze):– Based on http://www.jwhitham.org/c/smmu.html

and extended to be cycle-accurate and support multiple cores and a TDMA bus.

TDMA Shared Bus

Data memory

Datamemory

Instruction memory Core

nDatamemory

Instruction memory16KB

32KB5 cycles

1 cycle

5 cycles/core(Bus schedule round = 5 * no. cores)

Results

• Mix of control/data computations, thread structure and computation load.

* [Pop et al 2011] A Stream-Computing Extension to OpenMP.# [Nemer et al 2006] A Free Real-Time Benchmark.

Benchmark programs.

Results

• Each benchmark program was distributed over 1 to n-number of cores.– n = maximum number of parallel threads.

• Observed the WCRT:– Input vectors to elicit the worst case execution

path identified by Reachability analysis.• Computed the WCRT:– Reachability– Max-Plus

802.11a Results

Observed:• WCRT decreases

until 5 cores.• TDMA Bus is a

bottleneck: Global memory becomes more expensive.

• Synchronisation overheads.

1 2 3 4 5 6 7 8 9 100

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000

Observed

Reachability

MaxPlus

802.11a Results

1 2 3 4 5 6 7 8 9 100

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000

Observed

Reachability

MaxPlus

Reachability:• ~2% over-

estimation.• Benefit of explicit

path exploration.

802.11a Results

Max-Plus:• Assumes one global

tick where all threads execute their worst-case.

• Loss of thread execution context: Max execution time of the scheduling routines.

1 2 3 4 5 6 7 8 9 100

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000

Observed

Reachability

MaxPlus

802.11a Results

Both approaches:• Estimation of

synchronisation cost is conservative. Assumed that the receive only starts after the last sender.

1 2 3 4 5 6 7 8 9 100

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000

Observed

Reachability

MaxPlus

802.11a Results

1 2 3 4 5 6 7 8 9 100

Max-Plus takes less than 2 seconds.Reachability

802.11a Results

1 2 3 4 5 6 7 8 9 100

Reachability (merge)

Reachabilitymerge:• Reduction of ~9.34x

802.11a Results

1 2 3 4 5 6 7 8 9 100

Reachability (merge-b)

Reachabilitymerge:• Reduction of ~9.34x

802.11a Results

1 2 3 4 5 6 7 8 9 100

Reachability (merge-b)

Reachabilitymerge:• Reduction of ~9.34xmerge-b:• Reduction of ~342x• Less than 7 sec.

802.11a Results

Reduction in states reduction in analysis time

Number of global ticks explored.

Results

Reachability:• ~1 to 8% over-estimation.• Loss in precision mainly from over-estimating the

synchronisation costs.

1 2 3 40

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

FmRadio

1 2 3 4 5 6 70

Fly by Wire

1 2 3 4 5 6 7 80

20,000

40,000

60,000

80,000

100,000

120,000

140,000

Cores1 2 3 4 5 6 7 8

10,000

15,000

20,000

25,000

30,000

35,000

Matrix

Observed

Reachability

MaxPlus

Results

Max-Plus:• Over-estimation very dependent on program structure.• FmRadio and Life very imprecise. Loops can “amplify” over-

estimations.• Matrix quite precise. Executes in one global tick. Thus, Max-

Plus assumption is valid.

1 2 3 40

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

FmRadio

1 2 3 4 5 6 70

Fly by Wire

1 2 3 4 5 6 7 80

20,000

40,000

60,000

80,000

100,000

120,000

140,000

Cores1 2 3 4 5 6 7 8

10,000

15,000

20,000

25,000

30,000

35,000

Matrix

Observed

Reachability

MaxPlus

Results

• Our tool generates a timing trace for the computed WCRT:– For each core: Thread start/end time, context-

switching, fork/join, ...– Can be used to tune the thread distribution.• Was used to find good thread distributions for each

benchmark program.

Outline

Conclusions

• ForeC language for deterministic parallel programming.

• Based on synchronous framework.• Able to achieve WCRT speedup while

providing time-predictability.• Precise, fast and scalable timing analysis for

multicore programs using reachability.

Future work

Implementation:• WCRT-guided, automatic thread distribution.• Decrease global synchronisation overhead

without increasing analysis complexity.Analysis:• Prune additional infeasible paths using value

analysis.• Include the use of caches/scratchpads in the

multicore memory hierarchy.

Questions?

Introduction

• Existing parallel programming solutions.– Shared memory model.• OpenMP, Pthreads• Intel Cilk Plus, Thread Building Blocks• Unified Parallel C, ParC, X10

– Message passing model.• MPI, SHIM

– Provides ways to manage shared resources but not prevent concurrency errors.

[OpenMP] http://openmp.org [Pthreads] https://computing.llnl.gov/tutorials/pthreads/ [X10] http://x10-lang.org/[Intel Cilk Plus] http://software.intel.com/en-us/intel-cilk-plus [Intel Thread Building Blocks] http://threadingbuildingblocks.org/[Unified Parallel C] http://upc.lbl.gov/ [Ben-Asher et al] ParC – An Extension of C for Shared Memory Parallel Processing.[MPI] http://www.mcs.anl.gov/research/projects/mpi/ [SHIM] SHIM: A Language for Hardware/Software Integration.

Introduction

– Desktop variants optimised for average-case performance (FLOPS), not time-predictability.

– Threaded programming model.• Non-deterministic thread interleaving makes

understanding and debugging hard.

[Lee 2006] The Problem with Threads.

Introduction

• Parallel programming:– Programmer manages the shared resources.– Concurrency errors:• Deadlock, Race condition, Atomic violation, Order

violation.

[McDowell et al 1989] Debugging Concurrent Programs.[Lu et al 2008] Learning from Mistakes: A Comprehensive Study on Real World Concurrency Bug Characteristics.

Introduction

• Deterministic runtime support.– Pthreads• dOS, Grace, Kendo, CoreDet, Dthreads.

– OpenMP• Deterministic OMP

– Concept of logical time.– Each logical time step broken into an execution

and communication phase.

[Bergan et al 2010] Deterministic Process Groups in dOS.[Olszewski et al 2009] Kendo: Efficient Deterministic Multithreading in Software. [Bergan et al 2010] CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution.[Liu et al 2011] Dthreads: Efficient Deterministic Multithreading.[Aviram 2012] Deterministic OpenMP.

ForeC language

• Behaviour of shared variables is similar to:• Esterel (Valued signals)• Intel Cilk+ (Reducers)• Unified Parallel C (Collectives)• DOMP (Workspace consistency)• Grace (Copy-on-write)• Dthreads (Copy-on-write)

ForeC language

• Parallel programming patterns:– Specifying an appropriate combine function.– Sacrifice for deterministic parallel programs.– Map-reduce– Scatter-gather– Software pipelining– Delayed broadcast or point-to-point

communication.

Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza...

Documents

Transcript of Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza...

Demonstration of quantum volume 64 on a superconducting ... · 1 Demonstration of quantum volume 64 on a superconducting quantum computing system Petar Jurcevic , Ali Javadi-Abhari

Biglari 2015

Void formation by Kirkendall effect in solder joints - · PDF fileVoid formation by Kirkendall effect in solder joints M.J.M. Hermans M.H. Biglari Elfnet: reliability Athens 2006

Continuity or Change? - iranian-studies.stanford.edu · Pooya Azadi, Project Manager, Stanford Iran 2040 Project Iran: An Economy with Great Potential in Disequilibrium Hamid Biglari,

Separating Functional and Timed Aspects in Transactional ...pop-art.inrialpes.fr/~girault/Synchron06/Slides/cornet.pdf · Separating Functional and Timed Aspects in Transactional

Comparison of Analytical, Numerical, and Experimental ...me.aut.ac.ir/staff/manufacturing/biglari/pdf/Comparison of... · Comparison of Analytical, Numerical, and Experimental ...

Girault Le Culte Des Apachetes JSA 1958

Android Programming By Mohsen Biglari Android Programming, Part2: Android Studio, A Closer Look 1 Part2: Android Studio, A Closer Look By Mohsen Biglari.

Fission Products Experimental Programme: Validation and ... · Nicolas Leclaire, Tatiana Ivanova, Eric Létang, Emmanuel Girault, Jean-François Thro Abstract ... 133 Cs, nat Nd,

Android Programming By Mohsen Biglari Android Programming, Part1: Introduction 1 Part1: Introduction By Mohsen Biglari.

Migration and Endogamy According to Social Class: France ... · 6. Antoine Prost, ‘‘Structures sociales du XVIIIe arrondissement en 1936’’, in J. Girault (ed.), Ouvriers en

Generic Process Framework for Developing High-Integrity ...me.ce.sharif.edu/files/Biglari_Ramsin_SoMeT_2012.pdf · 1 Biglari, B., and Ramsin, R., "Generic Process Framework for Developing

Scaffold: Quantum Programming · PDF fileScaffold: Quantum Programming Language Ali Javadi Abhari, ... and return types — serve our goals regarding ease of use as well as exploitation

Catalog of World Eucharitidae, 2017 - Hymenoptera · Catalog of Eucharitidae, March 2017 2 Akapala Girault Akapala Girault, 1934[442]: 1[306]. Type species: Akapala astriaticeps Girault

Parallel Programming and Timing Analysis on Embedded Multicores Eugene Yip The University of Auckland Supervisors:Advisor: Dr. Partha RoopDr. Alain Girault.

CHARLOTTE LICHAcharlottelicha.com/gallery/magazine/CHARLOTTE_LICHA.pdf · 2016-05-25 · CHARLOTTE LICHA Press Review Spring Summer - 14 TOTEM GIRAULT / Kuki de Salvertes – Sebastien

COURT OF APPEALS OF INDIANAJoseph Hipps and Eugene Protz, individually and on behalf of a class of common shareholders (“Shareholders”) of Biglari Holdings, Inc. (“Biglari Holdings”)

Page 1 Topic1 An overview of database and DBMS CPS510 Database Systems Abdolreza Abhari School of Computer Science Ryerson University.

Dear Shareholders of Biglari Holdings Inc.

Incentive Bonus Agreement - Biglari · PDF fileFIRST AMENDMENT TO AMENDED AND RESTATED INCENTIVE AGREEMENT This First Amendment, dated as of July 1, 2013 (this