Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

cpeg421-10-F/Topic-3-II-EARTH 1

Topic 2 -- II: Compilers and Runtime Technology:

Optimization Under Fine-Grain Multithreading- The EARTH Model (in more details)

Guang R. Gao

ACM Fellow and IEEE FellowEndowed Distinguished ProfessorElectrical & Computer Engineering

University of [email protected]


Outline• Overview• Fine-grain multithreading• Compiling for fine-grain multithreading• The power of fine-grain synchronization -

SSB• The percolation model and its applications• Summary

TheThe EARTH EARTH Multithreaded Execution Model


fiber within a frame

Aync. function invocation

A sync operationInvoke a threaded func

Two Level of Fine-Grain Threads:- threaded procedures- fibers

2 2 1 2

1 2 2 4

Signal Token

Total # signals

Arrived # signals

EARTH vs. CILK


Fiber within a frameParallel function invocation framesfork a procedure

SYNC ops

Note: EARTH has it origin in static dataflow model

EARTH Model CILK Model

The “Fiber” Execution Model


0 2 0 2

0 1 0 2 0 4

Signal TokenTotal # signalsArrived # signals



1 2 0 2

0 1 0 2 0 4




2 2 0 2

0 1 0 2 0 4




2 2 0 2

1 1 0 2 0 4




2 2 0 2

1 1 1 2 0 4




2 2 1 2

1 1 1 2 0 4




2 2 2 2

1 1 1 2 0 4




2 2 2 2

1 1 2 2 0 4




2 2 2 2

1 1 2 2 1 4




2 2 2 2

1 1 2 2 2 4




2 2 2 2

1 1 2 2 3 4




2 2 2 2

1 1 2 2 4 4


A Loop Example


for(i =1; i <= N; ++i){ S1: … S2: x[i] = … S3: y[i] = … + x[i-1] … . . . Sk: …}

S1:

S2:

S3:

Sk:

i= 1 i= 2 i= 3 i= N

Note:How loop carried dependencies are handled?And its implication on cross core software pipelining

T1 T2 T3

Main Features of EARTH

* Fast thread context switching• Efficient parallel function invocation• Good support of fine grain dynamic load

balancing* Efficient support split phase transactions

and fibers


*Features unique to the EARTH model in comparison to the CILK model

Compiling C for EARTHObjectives

• Design simple high-level extensions for C that allow programmers to write programs that will run efficiently on multi-threaded architectures. (EARTH-C)

• Develop compiler techniques to automatically translate programs written in EARTH-C to multi-threaded programs. (EARTH-C, Threaded-C)

• Determine if EARTH-C + compiler can compete with hand-coded Threaded-C programs.


Summary of EARTH-C Extensions

• Explicit Parallelism– Parallel versus Sequential statement sequences– Forall loops

• Locality Annotation– Local versus Remote Memory references (global, local,

replicate, …)• Dynamic Load Balancing

– Basic versus remote function and invocation sites


EARTH-C Compiler Environment


McCAT

EARTH-C Compiler

Threaded-C Compiler

C EARTH-C

EARTH SIMPLE

Threaded-C

Program Dependence Analysis

Thread Generation

EARTH SIMPLE

Thread Partitioning

Threaded-CEARTH Compilation EnvironmentThe EARTH Compiler

The McCAT/EARTH Compiler


EARTH-C

THREADED-C

EARTH-SIMPLE-C

EARTH-SIMPLE-C

Simplify goto eliminationLocal function inlining Points-to Analysis

Heap AnalysisR/W Set Analysis

Array Dependence Tester

Forall Loop DetectionLoop Partitioning

Build Hierarchical DDGThread Generation

Code Generation

04/24/23 \Petaflop\Workshop98-7B.ppt 25

If n < 2 DATA_RSYNC (1, result, done)else { TOKEN (fib, n-1, & sum1, slot_1); TOKEN (fib, n-2, & sum2, slot_2); }END_THREAD( ) ;

THREAD-1; DATA_RSYNC (sum1 + sum2, result, done);END_THREAD ( ) ;END_FUNCTION

0 0

2 2

fibn result done

The Fibonacci Example


void main ( ){ int i, j, k; float sum;

for (i=0; i < N; i++) for (j=0; j < N ; j++) { sum = 0; for (k=0; k < N; k++) sum = sum + a [i] [k] * b [k] [j] c [i] [j] = sum; }}

Sequential Version

Matrix Multiplication


BLKMOV_SYNC (a, row_a, N, slot_1);BLKMOV_SYNC (b, column_b, N, slot_1);sum = 0;END_THREAD;

THREAD-1; for (i=0; i<N; i++ ); sum = sum + (row_a[i] * column_b[i]); DATA_RSYNC (sum, result, done); END_THREAD ( ) ;

0 0

2 2

innera result doneb

The Inner Product Example

END_FUNCTION

Summary of EARTH-C Extensions

• Explicit Parallelism– Parallel versus Sequential statement sequences– Forall loops

• Locality Annotation– Local versus Remote Memory references (global, local,

replicate, …)• Dynamic Load Balancing

– Basic versus remote function and invocation sites


EARTH C Threaded C(Thread Generation)

Given a sequence of statements, s1, s2, …sn, we wish to create threads such that:– Maximize thread length (minimize thread

switching overhead)– retain sufficient parallelism– Issue remote memory requests as early as

possible (prefetching)– Compile split-phase remote memory operations

and remote function calls correctly


An Example


int f(int *x, int i, int j){ int a, b, sum, prod, fact; int r1, r2, r3; a = x[i]; fact = 1; fact = fact * a;

b = x[j]; sum = a + b; prod = a * b; r1 = g(sum); r2 = g(prod); r3 = g(fact); return(r1 + r2 + r3); }

1

3

1

Example Partitioned into Four Fibers


a = x[i];fact = 1; fact = fact * a;

b = x[j];

sum = a + b;prod = a * b;r1 = g(sum);r2 = g(prod);r3 = g(fact);

return (r1 + r2 + r3);

Fiber-0:

Fiber-1:

Fiber-2:

Fiber-3:

Better Strategy Using List Scheduling

• Put each instruction in the earliest possible thread.

• Within a thread, the remote operations are executed as early as possible.

Build a Data Dependence Graph (DDG), and use a list scheduling strategy, where the selection of instructions is guided by Earliest Thread Number and Statement Type.


Instruction Types

• Schedule First– remote_read, remote_write– remote_fn_call– local_simple– remote_compound– local_compound– basic_fn_call

• Schedule Last


List Scheduling Previous Example


(0,RR) (0,RR) (0,LS)

(1,LS) (1,LS) (1,LC)

(1,RF) (1,RF) (1,RF)

(2,LS)

Resulting List Scheduled Threads


a=x[i];b=x[j];fact=1;

sum=a+b;r1=g(sum);prod=a*b;r2=g(prod);fact=fact*i;r3=g(fact)

return (r1+r2+r3);

2

3

Generating Threaded-C Code


THREADED f ( int *ret_parm, SLOT *rsync_parm, int *x, int i, int j){

SLOTS SYNC_SLOTS[2];int a, b, sum, prod, fact, r1, r2, r3;

/* THREAD_0:; */INIT_SYNC(0, 2, 2, 1); INIT_SYNC (1, 3, 3, 2);GET_SYNC_L (&x[i], &a, 0);GET_SYNC_L (&x[j], &b, 0);fact = 1;END_THREAD( );

THREAD_1:;sum = a + b;TOKEN (G, &r1, SLOT_ADR(1), sum);prod = a * b;TOKEN (g, &r2, SLOT_ADR(1), prod);fact = fact * a;TOKEN (g, &r3, SLOT_ADR(1), fact);END_THREAD( );

THREAD_2:;DATA_RSYNC_L(r1 + r2 + r3, ret_parm, rsync_parm);END_FUNCTION( );

}

Fine-Grain Synchronization: Two Types

Sync Type Enforce Mutual Exclusion

Enforce Data Dependencies

Order No Specific Order required

Uni-directional

Fine Grain Sync. Solution

• Software Fine grained locks• Lock free concurrent data structures• Full / Empty bits

• I-structures• Full / Empty bits


Enforce Data Dependencies

• A DoAcross loop with positive and constant dependence distance.


for(i= D; i < N; ++i){ A[i] = … … … = A[i-D];}

In parallel iterations are assigned to different threads

T0 T1

(i = 2 + D){ A[2+D] = … … … = A[2]}

(i = 2){ A[2] = … … … = A[2-D]}

The data dependence needs to be enforced by synchronization

Memory Based Fine-Grain Synchronization:

• Full/Empty Bits (HEP, Tera MTA, etc) & I-Structures (dataflow based machines)

• Associate “state” to a memory location (fine-granularity). Fine-grain synchronization for the memory location is realized through “state transition” on such “state”.


I-Structure state transition[ArvindEtAl89 @ TOPLAS]

Empty

Full Deferred

read

readwrite

resetwrite

read

With Memory Based Fine-Grain Sync

• Using a single atomic operation complete synchronized write/read in memory directly

• No need to implement synchronization with other resources, e.g., shared memory.

• Low overhead: just one memory transaction


for(i= D; i < N; ++i){ A[i] = … … … = A[i-D];}

for(i= D; i < N; ++i){ write_sync(&(A[i]),…) … … = read_sync(&(A[i-D]));}

With Memory Based Fine-Grain Sync

• Using a single atomic operation complete synchronized write/read in memory directly

• No need to implement synchronization with other resources, e.g., shared memory.

• Low overhead: just one memory transaction


T1

(i = 2 + D){ write_sync(&(A[2 + D]),…); … … = read_sync(&(A[2]));}

T0

(i = 2){ write_sync(&(A[2]),…); … … = read_sync(&(A[2-D]));}

An Alternative: control-flow based synchronizations


• The post/wait instructions needs to be implemented in shared memory in coordination with the underline memory (consistency) models

• You may need to worry about this:

A[i] = …;fence;post(i);

wait(i-D);fence;… = A[i-D];

for(i= D; i < N; ++i){ A[i] = … post(i); … wait(i-D); … = A[i-D];}

No data dependency

No data dependency

For computation with more complicated data dependencies, memory-based fine-grain synchronization is more effective and efficient. [ArvindEtAl89 @ TOPLAS]

A Question!



Key ObservationKey Observation:

Solution:

What is SSB?

• A small hardware buffer attached to the memory controller of each memory bank.

• Record and manage states of actively synchronized data units.

• Hardware Cost– Each SSB is a small look-up table: Easy-to-implement– Independence of each SSB: hardware cost increases

only linearly proportional to # of memory banks



SSB on Many-Core (IBM C64)

IBM Cyclops-64, Designed by Monty Denneau.

SSB Synchronization Functionalities

Data Synchronization: Enforce RAW data dependencies• Support word-level

– Two single-writer-single-reader (SWSR) modes– One single-writer-multiple-reader (SWMR) mode

Fine-Grain Locking: Enforce mutual exclusion• Support word-level

– write lock (exclusive lock)– read lock (shared lock)– recursive lock

SSB is capable of supporting more functionality


Experimental Infrastructure


IBM Cyclops-64 Chip Architecture• 160 thread units (500MHz)• Three-level explicit-addressable memory hierarchy • Efficient thread-level execution support• SSB for on-chip SRAM bank: 16-entry, 8-way associative

Cyclops-64 Micro KernelSimulation Testbed: FAST Simulator (Software) Ms. Clops Hardware Emulator

CCompiler

(GCC/Open64)

OpenMP Compiler

Binutils:

assembler

linkerLibraries:

OpenMP RTS

TiNy Threads Library/RTS

Std C/Math lib

SSB Fine-Grain Sync. is Efficient

• For all the benchmarks, the SSB-based version shows significant performance improvement over the versions based on other synchronization mechanisms.

• For example, with up to 128 threads– Livermore loop 6 (linear recurrence): a 312%

improvement over the barrier based version– Ordered integer set (hash table): outperform the

software-based fine-grain methods by up to 84%


Research LayoutFuture Programming Models


Advanced Execution / Programming Model

Percolation Location Consistency

Base Execution Model Fine Grain Multi

threading (e.g. EARTH, CARE)

Infrastructure & Tools•System Software•Simulation / Emulation•Analytical Modeling

HTMT like Architecture

Cellular Multithreaded Architecture(e.g. BG/c)

High End PIM Architecture

Percolation Model


Hig

h Sp

eed

CPU

sSR

AM

PI

MD

RA

M

PIM

Primary Execution Engine

Prepare and percolate “parceled threads”

Perform intelligent memory operations

Global Memory Management

A User’s Perspective

The Percolation Model


• What is percolation?dynamic, adaptive computation/data movement, migration, transformation in-place or on-the fly to keep system resource usefully busy

• Features of percolation– both data and thread

may percolate– computation

reorganization and data layout reorganization

– asynchronous invocation

An Example of percolation—Cannon’s Algorithm

Level 0

Level 1

Level 2

Level 3

Level 0: fast cpu

Level 1 PIM

Level 2 PIM

Level 3

percolation

HTML-like Architectures

Cannon’s nearest neighbor data transferData layout reorganization during percolation

Performance of SCCA2Kernel 4


#threads C64 SMPs MTA2

4 2917082 5369740 752256

8 5513257 2141457 619357

16 9799661 915617 488894

32 17349325 362390 482681

• Reasonable scalability–Scale well with # threads–Linear speedup for #threads < 32

• Commodity SMPs has poor performance• Competitive vs. MTA-2

Metric:TEPS -- Traversed Edges per second

SMPs: 4-way Xeon dual-core, 2MB L2 Cache

Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor

Documents

Transcript of Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor