GRID superscalar: a programming model for the Grid

Raül Sirvent PardellAdvisor: Rosa M. Badia Sala

Doctoral ThesisComputer Architecture Department

Technical University of Catalonia

Outline

1. Introduction2. Programming interface3. Runtime4. Fault tolerance at the programming model level5. Conclusions and future work

Outline

1. Introduction1.1 Motivation1.2 Related work1.3 Thesis objectives and contributions

2. Programming interface3. Runtime4. Fault tolerance at the programming model level5. Conclusions and future work

1.1 Motivation

The Grid architecture layers

Applications

Grid Middleware

(Job management, Data transfer,

Security, Information, QoS, ...)

Distributed Resources

1.1 Motivation

What middleware should I use?

1.1 Motivation

Programming tools: are they easy?

VS.Grid AWARE Grid UNAWARE

1.1 Motivation

Can I run my programs in parallel?

VS.Explicit

parallelismImplicit

parallelism

for(i=0; i < MSIZE; i++) for(j=0; j < MSIZE; j++) for(k=0; k < MSIZE; k++) matmul(A(i,k), B(k,j), C(i,j))

Draw it by

hand means

explicit

1.1 Motivation

The Grid: a massive, dynamic and heterogeneous environment prone to failures– Study different techniques to detect and overcome

failures

Checkpoint Retries Replication

1.2 Related work

System / Features

Grid unawareImplicit

parallelismLanguage

Triana No No Graphical

Satin Yes No Java

ProActive Partial Partial Java

Pegasus Yes Partial VDL

Swift Yes Partial SwiftScript

1.3 Thesis objectives and contributions Objective: create a programming model

for the Grid– Grid unaware– Implicit parallelism– Sequential programming– Allows to use well-known imperative languages– Speed up applications– Include fault detection and recovery

1.3 Thesis objectives and contributions Contribution: GRID superscalar

– Programming interface– Runtime environment– Fault tolerance features

Outline

1. Introduction2. Programming interface

2.1 Design2.2 User interface2.3 Programming comparison

3. Runtime4. Fault tolerance at the programming model level5. Conclusions and future work

2.1 Design

Interface objectives– Grid unaware

– Implicit parallelism– Sequential programming– Allows to use well-known imperative

languages

2.1 Design

Target applications– Algorithms which may be easily splitted in tasks

• Branch and bound computations, divide and conquer algorithms, recursive algorithms, …

– Coarse grained tasks– Independent tasks

• Scientific workflows, optimization algorithms, parameter sweep

– Main parameters: FILES• External simulators, finite element solvers, BLAST, GAMESS

2.1 Design

Application’s architecture: a master-worker paradigm– Master-worker parallel paradigm fits with our objectives– Main program: the master– Functions: workers

• Function = Generic representation of a task

– Glue to transform a sequential application into a master-worker application: stubs – skeletons (RMI, RPC, …)

• Stub: call to runtime interface• Skeleton: binary which calls to the user function

2.1 Design

app-functions.c

void matmul(char *f1, char *f2, char *f3){ getBlocks(f1, f2, f3, A, B, C); for (i = 0; i < A->rows; i++) { for (j = 0; j < B->cols; j++) { for (k = 0; k < A->cols; k++) { C->data[i][j] += A->data[i][k] * B->data[k][j]; putBlocks(f1, f2, f3, A, B, C);}

Local scenario

2.1 Design

Middleware

Master-Worker paradigm

app-functions.capp-functions.capp-functions.capp-functions.capp-functions.capp-functions.c

2.1 Design

Intermediate language concept: assembler code

In GRIDSs

The Execute generic interface– Instruction set is defined by the user– Single entry point to the runtime– Allows easy building of programming language bindings

(Java, Perl, Shell Script)• Easier technology adoption

C, C++, … Assembler Processor execution

C, C++, … Workflow Grid execution

2.2 User interface

Steps to program an application– Task definition

• Identify those functions/programs in the application that are going to be executed in the computational Grid

• All parameters must be passed in the header (remote execution)

– Interface Definition Language (IDL)• For every task defined, identify which parameters are

input/output files and which are input/output scalars

– Programming API: master and worker• Write the main program and the tasks using GRIDSs API

interface MATMUL {void matmul(in File f1, in File f2, inout File f3);

Interface Definition Language (IDL) file

– CORBA-IDL like interface:• in/out/inout files• in/out/inout scalar values

– The functions listed in this file will be executed in the Grid

2.2 User interface

Programming API: master and worker

Master sideGS_OnGS_OffGS_FOpen/GS_FCloseGS_Open/GS_CloseGS_BarrierGS_Speculative_End

app.c app-functions.c

Worker sideGS_Systemgs_resultGS_Throw

2.2 User interface

Task’s constraints and cost specification– Constraints: allow to specify the needs of a task (CPU,

memory, architecture, software, …)• Build an expression in a constraint function (evaluated for

every machine)

– Cost: estimated execution time of a task (in seconds)• Useful for scheduling• Calculate it in a cost function• GS_GFlops / GS_Filesize may be used• An external estimator can be also called

other.Mem == 1024

cost = operations / GS_GFlops();

2.3 Programming comparison

Globus vs GRIDSs

int main(){ rsl = "&(executable=/home/user/sim)(arguments=input1.txt output1.txt) (file_stage_in=(gsiftp://bscgrid01.bsc.es/path/input1.txt home/user/input1.txt))(file_stage_out=/home/user/output1.txt gsiftp://bscgrid01.bsc.es/path/output1.txt)(file_clean_up=/home/user/input1.txt /home/user/output1.txt)"; globus_gram_client_job_request(bscgrid02.bsc.es, rsl, NULL, NULL);

rsl = "&(executable=/home/user/sim)(arguments=input2.txt output2.txt) (file_stage_in=(gsiftp://bscgrid01.bsc.es/path/input2.txt /home/user/input2.txt))(file_stage_out=/home/user/output2.txt gsiftp://bscgrid01.bsc.es/path/output2.txt)(file_clean_up=/home/user/input2.txt /home/user/output2.txt)"; globus_gram_client_job_request(bscgrid03.bsc.es, rsl, NULL, NULL);

rsl = "&(executable=/home/user/sim)(arguments=input3.txt output3.txt) (file_stage_in=(gsiftp://bscgrid01.bsc.es/path/input3.txt /home/user/input3.txt))(file_stage_out=/home/user/output3.txt gsiftp://bscgrid01.bsc.es/path/output3.txt)(file_clean_up=/home/user/input3.txt /home/user/output3.txt)"; globus_gram_client_job_request(bscgrid04.bsc.es, rsl, NULL, NULL);}

Grid-awareExplicit parallelism

Globus vs GRIDSs

void sim(File input, File output){ command = "/home/user/sim " + input + ' ' + output; gs_result = GS_System(command);}

int main(){ GS_On(); sim("/path/input1.txt", "/path/output1.txt"); sim("/path/input2.txt", "/path/output2.txt"); sim("/path/input3.txt", "/path/output3.txt"); GS_Off(0);}

DAGMan vs GRIDSs

JOB A A.condorJOB B B.condorJOB C C.condorJOB D D.condorPARENT A CHILD B CPARENT B C CHILD D

int main(){ GS_On(); task_A(f1, f2, f3); task_B(f2, f4); task_C(f3, f5); task_D(f4, f5, f6); GS_Off(0);}

DExplicit parallelism

No if/while clauses

Ninf-G vs GRIDSsint main(){ grpc_initialize("config_file"); grpc_object_handle_init_np("A", &A_h, "class"); grpc_object_handle_init_np("B", &B_h," class"); for(i = 0; i < 25; i++) { grpc_invoke_async_np(A_h,"foo",&sid,f_in[2*i],f_out[2*i]); grpc_invoke_async_np(B_h,"foo",&sid,f_in[2*i+1],f_out[2*i+1]); grpc_wait_all(); } grpc_object_handle_destruct_np(&A_h); grpc_object_handle_destruct_np(&B_h); grpc_finalize();} int main()

{ GS_On(); for(i = 0; i < 50; i++)

foo(f_in[i], f_out[i]); GS_Off(0);}

Grid-awareExplicit parallelism

VDL vs GRIDSsDV trans1( a2=@{output:tmp.0}, a1=@{input:filein.0} );DV trans2( a2=@{output:fileout.0}, a1=@{input:tmp.0} );

DV trans1( a2=@{output:tmp.1}, a1=@{input:filein.1} );DV trans2( a2=@{output:fileout.1}, a1=@{input:tmp.1} );

DV trans1( a2=@{output:tmp.999}, a1=@{input:filein.999} );DV trans2( a2=@{output:fileout.999}, a1=@{input:tmp.999} );

int main(){ GS_On(); for(i = 0; i < 1000; i++) {

tmp = "tmp." + i; filein = "filein." + i;fileout = "fileout." + i;trans1(tmp, filein);trans2(fileout, tmp);

} GS_Off(0);}

No if/while clauses

Outline

1. Introduction2. Programming interface3. Runtime

3.1 Scientific contributions3.2 Developments3.3 Evaluation tests

4. Fault tolerance at the programming model level5. Conclusions and future work

3.1 Scientific contributions

Runtime objectives– Extract implicit parallelism in sequential

applications

– Speed up execution using the Grid

Main requirement: Grid middleware– Job management

– Data transfer

– Security

Apply computer architecture knowledge to the Grid (superscalar processor)

ns seconds/minutes/hours

irectory/C

trol L2 L2 L2

LSU LSUIFUBXU

IDU IDU

IFUBXU

FPU FPU

ISU ISU

Data dependence analysis: allow parallelism

Read after Write

Write after Read

Write after Write

task1(..., f1)

task2(f1, ...)

task1(f1, ...)

task2(..., f1)

task1(..., f1)

task2(..., f1)

matmul(A(0,0), B(0,0), C(0,0))

matmul(A(0,1), B(1,0), C(0,0))

matmul(A(0,2), B(2,0), C(0,0))

matmul(A(0,0), B(0,0), C(0,1))

matmul(A(0,1), B(1,0), C(0,1))

matmul(A(0,2), B(2,0), C(0,1))

matmul(A(0,0), B(0,0), C(0,0))

matmul(A(0,1), B(1,0), C(0,0))

matmul(A(0,2), B(2,0), C(0,0))

matmul(A(0,0), B(0,0), C(0,1))

matmul(A(0,1), B(1,0), C(0,1))

matmul(A(0,2), B(2,0), C(0,1))

File renaming: increase parallelism

Read after Write

Write after Read

Write after Write

task1(..., f1)

task2(f1, ...)Unavoidab

task1(f1, ...)

task1(..., f1)

Avoidable

task2(..., f1)task2(..., f1_NEW)

3.2 Developments

Basic functionality– Job submission (middleware usage)

• Select sources for input files• Submit, monitor or cancel jobs• Results collection

– API implementation• GS_On: read configuration file and environment• GS_Off: wait for tasks, cleanup remote data, undo renaming• GS_(F)Open: create a local task• GS_(F)Close: notify end of local task• GS_Barrier: wait for all running tasks to finish• GS_System: translate path• GS_Speculative_End: barrier until throw. If throw, discard

tasks from throw to GS_Speculative_End• GS_Throw: use gs_result to notify it

3.2 Developments

Middleware

Task scheduling: Direct Acyclic Graph

3.2 Developments

Task scheduling: resource brokering– A resource broker is needed (but not an objective)

– Grid configuration file• Information about hosts (hostname, limit of jobs, queue,

working directory, quota, …)• Initial set of machines (can be changed dynamically)

<?xml version="1.0" encoding="UTF-8"?><project isSimple="yes" masterBandwidth="100000" masterBuildScript="" masterInstallDir="/home/rsirvent/matmul-master" masterName="bscgrid01.bsc.es" masterSourceDir="/datos/GRID-S/GT4/doc/examples/matmul" name="matmul" workerBuildScript="" workerSourceDir="/datos/GRID-S/GT4/doc/examples/matmul">...<workers><worker Arch="x86" GFlops="5.985" LimitOfJobs="2" Mem="1024" NCPUs="2" NetKbps="100000" OpSys="Linux" Queue="none" Quota="0" deploymentStatus="deployed" installDir="/home/rsirvent/matmul-worker" name="bscgrid01.bsc.es">

3.2 Developments

Task scheduling: resource brokering– Scheduling policy

• Estimation of total execution time of a single task

• FileTransferTime: time to transfer needed files to a resource (calculated with the hosts information and the location of files)

– Select fastest source for a file

• ExecutionTime: estimation of the task’s run time in a resource. An interface function (can be calculated, or estimated by an external entity)

– Select fastest resource for execution

• Smallest estimation is selected

imeExecutionTerTimeFileTransft

3.2 Developments

Task scheduling: resource brokering– Match task constraints and machine capabilities– Implemented using the ClassAd library

• Machine: offers capabilities (from Grid configuration file: memory, architecture, …)

• Task: demands capabilities

– Filter candidate machines for a particular task

Software = BLAST

SoftwareList = BLAST, GAMESS

SoftwareList = GAMESS

3.2 Developments

Middleware

Task scheduling: File locality

imeExecutionTerTimeFileTransft

3.2 Developments

Other file locality exploitation mechanisms– Shared input disks

• NFS or replicated data

– Shared working directories• NFS

– Erasing unused versions of files (decrease disk usage)

– Disk quota control (locality increases disk usage and quota may be lower than expected)

3.3 Evaluation

NAS Grid Benchmarks Representative benchmark, includes different types of workflows which emulate a wide range of Grid Applications

Simple optimization example

Representative of optimization algorithms, workflow with two-level synchronization

New product and process development

Production application, workflow with parallel chains of computation

Potential energy hypersurface for acetone

Massively parallel, long running application

Protein comparison Production application, big computational challenge, massively parallel, high number of tasks

fastDNAml Well-known application in the context of MPI for Grids, workflow with synchronization steps

3.3 Evaluation

NAS Grid BenchmarksLaunchLaunch

ReportReport

Launch

Report

BT MG FT

LaunchLaunch

ReportReport

BTBT MGMG FTFT

Launch

Report

LU LU LU

MG MG MG

FT FT FT

MFMFMF

LaunchLaunch

ReportReport

LULU LULU LULU

MGMG MGMG MGMG

FTFT FTFT FTFT

MFMFMF

Launch

Report

BT SP LU

LaunchLaunch

ReportReport

BTBT SPSP LULU

3.3 Evaluation

Run with classes S, W, A (2 machines x 4 CPUs) VP benchmark must be analyzed in detail (does

not scale up to 3 CPUs)

1 2 3 4

Limit of tasks (Kadesh8)

3.3 Evaluation

Performance analysis– GRID superscalar runtime instrumented– Paraver tracefiles from the client side– The lifecycle of all tasks has been studied in detail

Overhead of GRAM Job Manager polling intervalGlobus overhead (VP.W)

1015202530354045

1 3 5 7 9 11 13 15

Task N

s) Task duration

Active to Done

Request to Active

3.3 Evaluation

VP.S task assignment– 14.7% of the transfers when exploiting locality– VP is parallel, but its last part is sequentially executed

Kadesh8

Khafre Remote file transfers

3.3 Evaluation

Conclusion: workflow and granularity are important to achieve speed up

Two-dimensional potential energy hypersurface for acetone as a function of the 1, and 2 angles

3.3 Evaluation

Number of executed tasks: 1120 Each task between 45 and 65 minutes Speed up: 26.88 (32 CPUs), 49.17 (64 CPUs) Long running test, heterogeneous and

transatlantic Grid

22 CPUs14 CPUs

28 CPUs

15 million Proteins

Genomes

15 million Proteins

3.3 Evaluation

15 million protein sequences have been compared using BLAST and GRID superscalar

3.3 Evaluation

100,000 tasks in 4000 CPUs (= 1,000 exclusive nodes)

“Grid” of 1,000 machines with very low latency between them– Stress test for the runtime

Avoids user to work with queuing system Saves queuing system from handling a huge set

of independent tasks

GRID superscalar: programming interface and runtime

Publications

Raül Sirvent, Josep M. Pérez, Rosa M. Badia, Jesús Labarta, "Automatic Grid workflow based on imperative programming languages", Concurrency and Computation: Practice and Experience, John Wiley & Sons, vol. 18, no. 10, pp. 1169-1186, 2006.

Rosa M. Badia, Raul Sirvent, Jesus Labarta, Josep M. Perez, "Programming the GRID: An Imperative Language-based Approach", Engineering The Grid: Status and Perspective, Section 4, Chapter 12, American Scientific Publishers, January 2006.

Rosa M. Badia, Jesús Labarta, Raül Sirvent, Josep M. Pérez, José M. Cela and Rogeli Grima, "Programming Grid Applications with GRID Superscalar", Journal of Grid Computing, Volume 1, Issue 2, 2003.

GRID superscalar: programming interface and runtime

Work related to standards

R.M. Badia, D. Du, E. Huedo, A. Kokossis, I. M. Llorente, R. S. Montero, M. de Palol, R. Sirvent, and C. Vázquez, "Integration of GRID superscalar and GridWay Metascheduler with the DRMAA OGF Standard", Euro-Par, 2008.

Raül Sirvent, Andre Merzky, Rosa M. Badia, Thilo Kielmann, "GRID superscalar and SAGA: forming a high-level and platform-independent Grid programming environment", CoreGRID Integration Workshop. Integrated Research in Grid Computing, Pisa (Italy), 2005.

Outline

1. Introduction2. Programming interface3. Runtime4. Fault tolerance at the programming model level

4.1 Checkpointing4.2 Retry mechanisms4.3 Task replication

5. Conclusions and future work

4.1 Checkpointing

Inter-task checkpointing Recovers sequential consistency in the out-of-

order execution of tasks– Single version of every file is saved– No need to save any data structures in the runtime

Drawback: some completed tasks may be lost– Application-level checkpoint can avoid this

30 1 2 3 4 5 6

4.1 Checkpointing

Conclusions– Low complexity in order to checkpoint a task

• ~1% overhead introduced

– Can deal with both application level errors or Grid level errors

• Most important when an unrecoverable error appears

– Transparent for end users

4.2 Retry mechanisms

Middleware

Automatic drop of machines

Middleware

Soft and hard timeouts for tasks

Soft timeout

FailureSuccess

Soft timeoutHard timeout

Middleware

Retry of operations

Request

FailureSuccess

syscall

SuccessFailure

Conclusions– Keep running despite failures– Dynamic: when and where to resubmit– Detects performance degradations– No overhead when no failures are detected– Transparent for end users

4.3 Task replication

Middleware

Replicate running tasks depending on successors

3 4 5 6 7

0 1 210 21

Middleware

Replicate running tasks to speed up the execution

3 4 5 6 7

0 1 210

Conclusions– Dynamic replication: application level knowledge is used

(the workflow)– Replication can deal with failures hiding retry overhead– Replication can speed up applications in heterogeneous

Grids– Transparent for end users

– Drawback: increased usage of resources

4. Fault tolerance features

Publications

Vasilis Dialinos, Rosa M. Badia, Raül Sirvent, Josep M. Pérez and Jesús Labarta, "Implementing Phylogenetic Inference with GRID superscalar", Cluster Computing and Grid 2005 (CCGRID 2005), Cardiff, UK, 2005.

Raül Sirvent, Rosa M. Badia and Jesús Labarta, "Graph-based task replication for workflow applications", Submitted, HPCC 2009.

Outline

1. Introduction2. Programming interface3. Runtime4. Fault tolerance at the programming model level5. Conclusions and future work

Grid-unaware programming model

Transparent features for users, exploiting parallelism and failure treatment

Used in REAL systems and REAL applications

Some future research is already ONGOING (StarSs)

Future work– Grid of supercomputers (Red Española de

Supercomputación)– Higher scale tests (hundreds? thousands?)– More complex brokering

• Resource discovery/monitoring• New scheduling policies based on the workflow• Automatic prediction of execution times

– New policies for task replication– New architectures for StarSs

GRID superscalar: a programming model for the Grid

Documents

Transcript of GRID superscalar: a programming model for the Grid

Superscalar Processors

Grid Programming in Python with pyGlobus

Superscalar Processor

Superscalar - summary

Workshop on Grid Applications Programming, July 2004 GRID superscalar: a programming paradigm for GRID applications CEPBA-IBM Research Institute Raül Sirvent,

Chapter 4 Superscalar Organization. Superscalar Organization Superscalar machines go beyond just a single-instruction pipeline Able to simultaneously.

Python programming exercises, I - GC3: Grid … · Grid Computing Competence Center Python programming exercises, I Riccardo Murri Grid Computing Competence Center, Organisch-Chemisches

COMP Superscalar: Bringing GRID superscalar and GCM together

Architecture Basics ECE 454 Computer Systems Programming Topics: Basics of Computer Architecture Pipelining, Branches, Superscalar, Out of order Execution.

Superscalar Architecture_AIUB

User-driven resource selection in GRID superscalar Last developments and future plans in the framework of CoreGRID Rosa M. Badia Grid and Clusters Manager.

Programming the Grid

Superscalar Processor Design Superscalar Architecture

GRID superscalar: a programming model for the Gridpersonals.ac.upc.edu/rsirvent/thesis.pdf · Laura, because she already suspected that marrying a twin brother was like marrying two

GridSuperscalar A programming model for GRID applications

COMP Superscalar - compss.bsc.escompss.bsc.es/releases/compss/latest/docs/COMPSs... · 1COMP Superscalar (COMPSs) COMP Superscalar (COMPSs) is a programming model which aims to ease

Chapter 5 Superscalar Techniques. Superscalar Techniques The ultimate performance goal of a superscalar pipeline is to achieve maximum throughput of.

COMP Superscalar: Bringing GRID superscalar and GCM together Enric Tejedor Universitat Politècnica de Catalunya etejedor@ac.upc.edu V ProActive and GCM.

Grid Programming Models: Requirements and Approaches

Grid Programming in Python Grid Programming in Python with pyGlobus Keith R. Jackson Distributed Systems Department Lawrence Berkeley National Lab.