GRID superscalar: a programming model for the Grid

GRID superscalar: a programming model for the Grid

Raül Sirvent PardellAdvisor: Rosa M. Badia Sala

Doctoral ThesisComputer Architecture Department

Technical University of Catalonia


2

Outline

1. Introduction2. Programming interface3. Runtime4. Fault tolerance at the programming model level5. Conclusions and future work


3

Outline

1. Introduction1.1 Motivation1.2 Related work1.3 Thesis objectives and contributions

2. Programming interface3. Runtime4. Fault tolerance at the programming model level5. Conclusions and future work


4

1.1 Motivation

The Grid architecture layers

Applications

Grid Middleware

(Job management, Data transfer,

Security, Information, QoS, ...)

Distributed Resources


5

1.1 Motivation

What middleware should I use?


6

1.1 Motivation

Programming tools: are they easy?

VS.Grid AWARE Grid UNAWARE

GRID


7

1.1 Motivation

Can I run my programs in parallel?

VS.Explicit

parallelismImplicit

parallelism

fork

join

for(i=0; i < MSIZE; i++) for(j=0; j < MSIZE; j++) for(k=0; k < MSIZE; k++) matmul(A(i,k), B(k,j), C(i,j))

…

Draw it by

hand means

explicit


8

1.1 Motivation

The Grid: a massive, dynamic and heterogeneous environment prone to failures– Study different techniques to detect and overcome

failures

Checkpoint Retries Replication


9

1.2 Related work

System / Features

Grid unawareImplicit

parallelismLanguage

Triana No No Graphical

Satin Yes No Java

ProActive Partial Partial Java

Pegasus Yes Partial VDL

Swift Yes Partial SwiftScript


10

1.3 Thesis objectives and contributions Objective: create a programming model

for the Grid– Grid unaware– Implicit parallelism– Sequential programming– Allows to use well-known imperative languages– Speed up applications– Include fault detection and recovery


11

1.3 Thesis objectives and contributions Contribution: GRID superscalar

– Programming interface– Runtime environment– Fault tolerance features


12

Outline

1. Introduction2. Programming interface

2.1 Design2.2 User interface2.3 Programming comparison

3. Runtime4. Fault tolerance at the programming model level5. Conclusions and future work


13

2.1 Design

Interface objectives– Grid unaware

– Implicit parallelism– Sequential programming– Allows to use well-known imperative

languages


14

2.1 Design

Target applications– Algorithms which may be easily splitted in tasks

• Branch and bound computations, divide and conquer algorithms, recursive algorithms, …

– Coarse grained tasks– Independent tasks

• Scientific workflows, optimization algorithms, parameter sweep

– Main parameters: FILES• External simulators, finite element solvers, BLAST, GAMESS


15

2.1 Design

Application’s architecture: a master-worker paradigm– Master-worker parallel paradigm fits with our objectives– Main program: the master– Functions: workers

• Function = Generic representation of a task

– Glue to transform a sequential application into a master-worker application: stubs – skeletons (RMI, RPC, …)

• Stub: call to runtime interface• Skeleton: binary which calls to the user function


16

2.1 Design

app.c

app-functions.c


void matmul(char *f1, char *f2, char *f3){ getBlocks(f1, f2, f3, A, B, C); for (i = 0; i < A->rows; i++) { for (j = 0; j < B->cols; j++) { for (k = 0; k < A->cols; k++) { C->data[i][j] += A->data[i][k] * B->data[k][j]; putBlocks(f1, f2, f3, A, B, C);}

Local scenario


17

2.1 Design

Middleware

Master-Worker paradigm

app.c

app-functions.capp-functions.capp-functions.capp-functions.capp-functions.capp-functions.c


18

2.1 Design

Intermediate language concept: assembler code

In GRIDSs

The Execute generic interface– Instruction set is defined by the user– Single entry point to the runtime– Allows easy building of programming language bindings

(Java, Perl, Shell Script)• Easier technology adoption

C, C++, … Assembler Processor execution

C, C++, … Workflow Grid execution


19

2.2 User interface

Steps to program an application– Task definition

• Identify those functions/programs in the application that are going to be executed in the computational Grid

• All parameters must be passed in the header (remote execution)

– Interface Definition Language (IDL)• For every task defined, identify which parameters are

input/output files and which are input/output scalars

– Programming API: master and worker• Write the main program and the tasks using GRIDSs API


20

interface MATMUL {void matmul(in File f1, in File f2, inout File f3);

};

Interface Definition Language (IDL) file

– CORBA-IDL like interface:• in/out/inout files• in/out/inout scalar values

– The functions listed in this file will be executed in the Grid

2.2 User interface


21

2.2 User interface

Programming API: master and worker

Master sideGS_OnGS_OffGS_FOpen/GS_FCloseGS_Open/GS_CloseGS_BarrierGS_Speculative_End

app.c app-functions.c

Worker sideGS_Systemgs_resultGS_Throw


22

2.2 User interface

Task’s constraints and cost specification– Constraints: allow to specify the needs of a task (CPU,

memory, architecture, software, …)• Build an expression in a constraint function (evaluated for

every machine)

– Cost: estimated execution time of a task (in seconds)• Useful for scheduling• Calculate it in a cost function• GS_GFlops / GS_Filesize may be used• An external estimator can be also called

other.Mem == 1024

cost = operations / GS_GFlops();


23

2.3 Programming comparison

Globus vs GRIDSs

int main(){ rsl = "&(executable=/home/user/sim)(arguments=input1.txt output1.txt) (file_stage_in=(gsiftp://bscgrid01.bsc.es/path/input1.txt home/user/input1.txt))(file_stage_out=/home/user/output1.txt gsiftp://bscgrid01.bsc.es/path/output1.txt)(file_clean_up=/home/user/input1.txt /home/user/output1.txt)"; globus_gram_client_job_request(bscgrid02.bsc.es, rsl, NULL, NULL);

rsl = "&(executable=/home/user/sim)(arguments=input2.txt output2.txt) (file_stage_in=(gsiftp://bscgrid01.bsc.es/path/input2.txt /home/user/input2.txt))(file_stage_out=/home/user/output2.txt gsiftp://bscgrid01.bsc.es/path/output2.txt)(file_clean_up=/home/user/input2.txt /home/user/output2.txt)"; globus_gram_client_job_request(bscgrid03.bsc.es, rsl, NULL, NULL);

rsl = "&(executable=/home/user/sim)(arguments=input3.txt output3.txt) (file_stage_in=(gsiftp://bscgrid01.bsc.es/path/input3.txt /home/user/input3.txt))(file_stage_out=/home/user/output3.txt gsiftp://bscgrid01.bsc.es/path/output3.txt)(file_clean_up=/home/user/input3.txt /home/user/output3.txt)"; globus_gram_client_job_request(bscgrid04.bsc.es, rsl, NULL, NULL);}

Grid-awareExplicit parallelism


24


Globus vs GRIDSs

void sim(File input, File output){ command = "/home/user/sim " + input + ' ' + output; gs_result = GS_System(command);}

int main(){ GS_On(); sim("/path/input1.txt", "/path/output1.txt"); sim("/path/input2.txt", "/path/output2.txt"); sim("/path/input3.txt", "/path/output3.txt"); GS_Off(0);}


25


DAGMan vs GRIDSs

JOB A A.condorJOB B B.condorJOB C C.condorJOB D D.condorPARENT A CHILD B CPARENT B C CHILD D

int main(){ GS_On(); task_A(f1, f2, f3); task_B(f2, f4); task_C(f3, f5); task_D(f4, f5, f6); GS_Off(0);}

A

B C

DExplicit parallelism

No if/while clauses


26


Ninf-G vs GRIDSsint main(){ grpc_initialize("config_file"); grpc_object_handle_init_np("A", &A_h, "class"); grpc_object_handle_init_np("B", &B_h," class"); for(i = 0; i < 25; i++) { grpc_invoke_async_np(A_h,"foo",&sid,f_in[2*i],f_out[2*i]); grpc_invoke_async_np(B_h,"foo",&sid,f_in[2*i+1],f_out[2*i+1]); grpc_wait_all(); } grpc_object_handle_destruct_np(&A_h); grpc_object_handle_destruct_np(&B_h); grpc_finalize();} int main()

{ GS_On(); for(i = 0; i < 50; i++)

foo(f_in[i], f_out[i]); GS_Off(0);}

Grid-awareExplicit parallelism


27


VDL vs GRIDSsDV trans1( a2=@{output:tmp.0}, a1=@{input:filein.0} );DV trans2( a2=@{output:fileout.0}, a1=@{input:tmp.0} );

DV trans1( a2=@{output:tmp.1}, a1=@{input:filein.1} );DV trans2( a2=@{output:fileout.1}, a1=@{input:tmp.1} );

...

DV trans1( a2=@{output:tmp.999}, a1=@{input:filein.999} );DV trans2( a2=@{output:fileout.999}, a1=@{input:tmp.999} );

int main(){ GS_On(); for(i = 0; i < 1000; i++) {

tmp = "tmp." + i; filein = "filein." + i;fileout = "fileout." + i;trans1(tmp, filein);trans2(fileout, tmp);

} GS_Off(0);}

No if/while clauses


28

Outline

1. Introduction2. Programming interface3. Runtime

3.1 Scientific contributions3.2 Developments3.3 Evaluation tests

4. Fault tolerance at the programming model level5. Conclusions and future work


29

3.1 Scientific contributions

Runtime objectives– Extract implicit parallelism in sequential

applications

– Speed up execution using the Grid

Main requirement: Grid middleware– Job management

– Data transfer

– Security


30


Apply computer architecture knowledge to the Grid (superscalar processor)

Grid

ns seconds/minutes/hours

L3 D

irectory/C

on

trol L2 L2 L2

LSU LSUIFUBXU

IDU IDU

IFUBXU

FPU FPU

FX

U

FX

U

ISU ISU


31


Data dependence analysis: allow parallelism

Read after Write

Write after Read

Write after Write

task1(..., f1)

task2(f1, ...)

task1(f1, ...)

task2(..., f1)

task1(..., f1)

task2(..., f1)


32



matmul(A(0,0), B(0,0), C(0,0))

matmul(A(0,1), B(1,0), C(0,0))

matmul(A(0,2), B(2,0), C(0,0))

i = 0

j = 0

matmul(A(0,0), B(0,0), C(0,1))

matmul(A(0,1), B(1,0), C(0,1))

matmul(A(0,2), B(2,0), C(0,1))

i = 0

j = 1

...

k = 0

k = 1

k = 2

k = 0

k = 1

k = 2


33



matmul(A(0,0), B(0,0), C(0,0))

matmul(A(0,1), B(1,0), C(0,0))

matmul(A(0,2), B(2,0), C(0,0))

i = 0

j = 0

matmul(A(0,0), B(0,0), C(0,1))

matmul(A(0,1), B(1,0), C(0,1))

matmul(A(0,2), B(2,0), C(0,1))

i = 0

j = 1

i = 0

j = 2

...

i = 1

j = 0

i = 1

j = 1

i = 1

j = 2

...

k = 0

k = 1

k = 2

k = 0

k = 1

k = 2


34


File renaming: increase parallelism

Read after Write

Write after Read

Write after Write

task1(..., f1)

task2(f1, ...)Unavoidab

le

task1(f1, ...)

task1(..., f1)

Avoidable

Avoidable

task2(..., f1)task2(..., f1_NEW)

task2(..., f1)task2(..., f1_NEW)


35

3.2 Developments

Basic functionality– Job submission (middleware usage)

• Select sources for input files• Submit, monitor or cancel jobs• Results collection

– API implementation• GS_On: read configuration file and environment• GS_Off: wait for tasks, cleanup remote data, undo renaming• GS_(F)Open: create a local task• GS_(F)Close: notify end of local task• GS_Barrier: wait for all running tasks to finish• GS_System: translate path• GS_Speculative_End: barrier until throw. If throw, discard

tasks from throw to GS_Speculative_End• GS_Throw: use gs_result to notify it


36

3.2 Developments

Middleware

...

Task scheduling: Direct Acyclic Graph


37

3.2 Developments

Task scheduling: resource brokering– A resource broker is needed (but not an objective)

– Grid configuration file• Information about hosts (hostname, limit of jobs, queue,

working directory, quota, …)• Initial set of machines (can be changed dynamically)

<?xml version="1.0" encoding="UTF-8"?><project isSimple="yes" masterBandwidth="100000" masterBuildScript="" masterInstallDir="/home/rsirvent/matmul-master" masterName="bscgrid01.bsc.es" masterSourceDir="/datos/GRID-S/GT4/doc/examples/matmul" name="matmul" workerBuildScript="" workerSourceDir="/datos/GRID-S/GT4/doc/examples/matmul">...<workers><worker Arch="x86" GFlops="5.985" LimitOfJobs="2" Mem="1024" NCPUs="2" NetKbps="100000" OpSys="Linux" Queue="none" Quota="0" deploymentStatus="deployed" installDir="/home/rsirvent/matmul-worker" name="bscgrid01.bsc.es">


38

3.2 Developments

Task scheduling: resource brokering– Scheduling policy

• Estimation of total execution time of a single task

• FileTransferTime: time to transfer needed files to a resource (calculated with the hosts information and the location of files)

– Select fastest source for a file

• ExecutionTime: estimation of the task’s run time in a resource. An interface function (can be calculated, or estimated by an external entity)

– Select fastest resource for execution

• Smallest estimation is selected

imeExecutionTerTimeFileTransft


39

3.2 Developments

Task scheduling: resource brokering– Match task constraints and machine capabilities– Implemented using the ClassAd library

• Machine: offers capabilities (from Grid configuration file: memory, architecture, …)

• Task: demands capabilities

– Filter candidate machines for a particular task

Software = BLAST

SoftwareList = BLAST, GAMESS

SoftwareList = GAMESS


40

3.2 Developments

Middleware

f1 f2

f3f3

Task scheduling: File locality

imeExecutionTerTimeFileTransft


41

3.2 Developments

Other file locality exploitation mechanisms– Shared input disks

• NFS or replicated data

– Shared working directories• NFS

– Erasing unused versions of files (decrease disk usage)

– Disk quota control (locality increases disk usage and quota may be lower than expected)


42

3.3 Evaluation

NAS Grid Benchmarks Representative benchmark, includes different types of workflows which emulate a wide range of Grid Applications

Simple optimization example

Representative of optimization algorithms, workflow with two-level synchronization

New product and process development

Production application, workflow with parallel chains of computation

Potential energy hypersurface for acetone

Massively parallel, long running application

Protein comparison Production application, big computational challenge, massively parallel, high number of tasks

fastDNAml Well-known application in the context of MPI for Grids, workflow with synchronization steps


43

3.3 Evaluation

NAS Grid BenchmarksLaunchLaunch

ReportReport

SP

SP

SP

SPSP

SPSP

SPSP

SP

SP

SP

SPSP

SPSP

SPSP

SP

SP

SP

SPSP

SPSP

SPSP

Launch

Report

BT MG FT

BT MG FT

BT MG FT

MF

MF

MF

MFMF

MF

LaunchLaunch

ReportReport

BTBT MGMG FTFT

BTBT MGMG FTFT

BTBT MGMG FTFT

MF

MF

MF

MFMF

MF

Launch

Report

LU LU LU

MG MG MG

FT FT FT

MFMFMF

MFMFMF

LaunchLaunch

ReportReport

LULU LULU LULU

MGMG MGMG MGMG

FTFT FTFT FTFT

MFMFMF

MFMFMF

Launch

Report

BT SP LU

BT SP LU

BT SP LU

MF

MF MF

MF

MF MF

MF

MF

LaunchLaunch

ReportReport

BTBT SPSP LULU

BTBT SPSP LULU

BTBT SPSP LULU

MF

MF MF

MF

MF MF

MF

MF

ED

HC

VP

MB


44

3.3 Evaluation

Run with classes S, W, A (2 machines x 4 CPUs) VP benchmark must be analyzed in detail (does

not scale up to 3 CPUs)

0,00

0,50

1,00

1,50

2,00

2,50

3,00

1 2 3 4

Limit of tasks (Kadesh8)

Sp

eed

up

MB.S

MB.W

VP.S

VP.W

VP.A


45

3.3 Evaluation

Performance analysis– GRID superscalar runtime instrumented– Paraver tracefiles from the client side– The lifecycle of all tasks has been studied in detail

Overhead of GRAM Job Manager polling intervalGlobus overhead (VP.W)

05

1015202530354045

1 3 5 7 9 11 13 15

Task N

tim

e (

s) Task duration

Active to Done

Request to Active


46

3.3 Evaluation

VP.S task assignment– 14.7% of the transfers when exploiting locality– VP is parallel, but its last part is sequentially executed

BT

BT

BT

MF

MF

MF

MG

MG

MG

MF

MF

MF

FT

FT

FT

Kadesh8

Khafre Remote file transfers


47

3.3 Evaluation

Conclusion: workflow and granularity are important to achieve speed up


48

Two-dimensional potential energy hypersurface for acetone as a function of the 1, and 2 angles

3.3 Evaluation


49

3.3 Evaluation

Number of executed tasks: 1120 Each task between 45 and 65 minutes Speed up: 26.88 (32 CPUs), 49.17 (64 CPUs) Long running test, heterogeneous and

transatlantic Grid

22 CPUs14 CPUs

28 CPUs


50

15 million Proteins

Genomes

15 million Proteins

3.3 Evaluation

15 million protein sequences have been compared using BLAST and GRID superscalar


51

3.3 Evaluation

100,000 tasks in 4000 CPUs (= 1,000 exclusive nodes)

“Grid” of 1,000 machines with very low latency between them– Stress test for the runtime

Avoids user to work with queuing system Saves queuing system from handling a huge set

of independent tasks


52

GRID superscalar: programming interface and runtime

Publications

Raül Sirvent, Josep M. Pérez, Rosa M. Badia, Jesús Labarta, "Automatic Grid workflow based on imperative programming languages", Concurrency and Computation: Practice and Experience, John Wiley & Sons, vol. 18, no. 10, pp. 1169-1186, 2006.

Rosa M. Badia, Raul Sirvent, Jesus Labarta, Josep M. Perez, "Programming the GRID: An Imperative Language-based Approach", Engineering The Grid: Status and Perspective, Section 4, Chapter 12, American Scientific Publishers, January 2006.

Rosa M. Badia, Jesús Labarta, Raül Sirvent, Josep M. Pérez, José M. Cela and Rogeli Grima, "Programming Grid Applications with GRID Superscalar", Journal of Grid Computing, Volume 1, Issue 2, 2003.


53

GRID superscalar: programming interface and runtime

Work related to standards

R.M. Badia, D. Du, E. Huedo, A. Kokossis, I. M. Llorente, R. S. Montero, M. de Palol, R. Sirvent, and C. Vázquez, "Integration of GRID superscalar and GridWay Metascheduler with the DRMAA OGF Standard", Euro-Par, 2008.

Raül Sirvent, Andre Merzky, Rosa M. Badia, Thilo Kielmann, "GRID superscalar and SAGA: forming a high-level and platform-independent Grid programming environment", CoreGRID Integration Workshop. Integrated Research in Grid Computing, Pisa (Italy), 2005.


54

Outline

1. Introduction2. Programming interface3. Runtime4. Fault tolerance at the programming model level

4.1 Checkpointing4.2 Retry mechanisms4.3 Task replication

5. Conclusions and future work


55

4.1 Checkpointing

Inter-task checkpointing Recovers sequential consistency in the out-of-

order execution of tasks– Single version of every file is saved– No need to save any data structures in the runtime

Drawback: some completed tasks may be lost– Application-level checkpoint can avoid this

30 1 2 3 4 5 6


56

4.1 Checkpointing

Conclusions– Low complexity in order to checkpoint a task

• ~1% overhead introduced

– Can deal with both application level errors or Grid level errors

• Most important when an unrecoverable error appears

– Transparent for end users


57

4.2 Retry mechanisms

Middleware

Automatic drop of machines

CC


58


Middleware

Soft and hard timeouts for tasks

Soft timeout

FailureSuccess

Soft timeoutHard timeout


59


Middleware

Retry of operations

Request

FailureSuccess

syscall

SuccessFailure


60


Conclusions– Keep running despite failures– Dynamic: when and where to resubmit– Detects performance degradations– No overhead when no failures are detected– Transparent for end users


61

4.3 Task replication

Middleware

Replicate running tasks depending on successors

3 4 5 6 7

0 1 210 21


62


Middleware

Replicate running tasks to speed up the execution

3 4 5 6 7

0 1 210


63


Conclusions– Dynamic replication: application level knowledge is used

(the workflow)– Replication can deal with failures hiding retry overhead– Replication can speed up applications in heterogeneous

Grids– Transparent for end users

– Drawback: increased usage of resources


64

4. Fault tolerance features

Publications

Vasilis Dialinos, Rosa M. Badia, Raül Sirvent, Josep M. Pérez and Jesús Labarta, "Implementing Phylogenetic Inference with GRID superscalar", Cluster Computing and Grid 2005 (CCGRID 2005), Cardiff, UK, 2005.

Raül Sirvent, Rosa M. Badia and Jesús Labarta, "Graph-based task replication for workflow applications", Submitted, HPCC 2009.


65

Outline

1. Introduction2. Programming interface3. Runtime4. Fault tolerance at the programming model level5. Conclusions and future work


66


Grid-unaware programming model

Transparent features for users, exploiting parallelism and failure treatment

Used in REAL systems and REAL applications

Some future research is already ONGOING (StarSs)


67


Future work– Grid of supercomputers (Red Española de

Supercomputación)– Higher scale tests (hundreds? thousands?)– More complex brokering

• Resource discovery/monitoring• New scheduling policies based on the workflow• Automatic prediction of execution times

– New policies for task replication– New architectures for StarSs

GRID superscalar: a programming model for the Grid

Documents

Transcript of GRID superscalar: a programming model for the Grid