San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida Programming...

28
San Diego Supercomputer Center San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida Programming Gridflows using Matrix Arun Jagatheesan Architect, SDSC Matrix San Diego Supercomputer Center SDSC Tech Talk SDSC, UCSD

Transcript of San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida Programming...

San Diego Supercomputer CenterSan Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida

Programming Gridflows using Matrix

Arun Jagatheesan

Architect, SDSC Matrix

San Diego Supercomputer Center

SDSC Tech TalkSDSC, UCSD

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 2

Talk Outline• Where do we need this?• Infrastructure-based Execution logic (Concept?)• Matrix Project Overview (Who?)• Data Grid Language and Programming

• Gridflow Runnable (flowable)• Flow• Gridflow Metadata• ECAA rules

• Other benefits• What Next – Straight Talk

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 3

Data handling pipeline(data information pipeline)

Metadata derivation

Ingest Metadata

Ingest Data

Determine analysis pipeline

Initiate automated analysis

Organize result data into distributeddata grid collections

Use the optimal set of resources

based on the task – on demand

Pipeline could be triggered by input at data source or by a data request

from user

Pipeline could be triggered by input at data source or by a data request

from user

All gridflow activities stored for data flow

provenance

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 4

Generic Gridflow Scenario

• Application X, Application Y, Application Z• May be different programming languages, programmers,

different execution environments• May be in different grid domains (sites)• Pass data between each other during their execution• SDSC Note: Might use a data grid environment that

works!

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 5

Example for Generic Gridflow

1. Ingest 1 million URLs into digital library using URL Ingestor (or harvestor – App X)

2. For each URL iterate with 5 parallel execution1. Do some processing on the file (App Y)

2. Store the output file from App Y in a grid disk resource

3. Replicate a copy of same file in a grid archive resource

4. Calculate MD5 checksum (App Z) for file in disk

5. Calculate MD5 checksum (App Z) for file in archive

6. If checksums mismatch, ingest a metadata warning flag

For each

Ifchecksums mismatch

Pattern

Gridflow metadata processingRules

Late binding

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 6

Traditional way

• Write a customized program• Create a common program that can invoke the distributed

or localized applications using appropriate client code• Hardwire all the apps (X, Y, Z) together• Have this customized program as the delegator invoking

all other applications• Declare the necessary variables, implement the

rules/conditions also [like the checksum1 == checksum2]

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 7

Why take the Gridflow approach?• What if scenarios…

• The infrastructure can run more or less things in parallel• The cyber-infrastructure has more resources for

distribution (An app can be run at multiple places for different parameters – parameter sweep distribution)

• Different meta-data conditions or milestone• Run this till the molecule changes from green to red (or yellow)

• Change in the sequence of execution it self (New app)• Process provenance is required• Any ways, you are not coding/changing your application to

fit into the gridflow environment (It’s the other way around) – Make simple changes only in the execution logic…

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 8

Infrastructure-based Execution Logic• Each gridflow has different executables

• App X, App Y, App Z – Runnable or “Flowable”• How should these flowables be run?

• Parallel, Sequential, for-each input item (pipeline), while, switch• Capture this as a Flow

• Is there a condition• Run till exit value = 0 or till molecule color changes to red

• Are there metadata variables? (color)

• Describe this Execution Logic Separately• Loosely coupled, modified without compilation• Use a XML based language

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 9

That is why we started Matrix Project

• Movie break ….

• Language to describe and execute this Infrastructure-based Execution Logic

• Software to design, query, run this logic

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 10

“Flowable”

• Any thing that can Run in a gridflow• Not using Runnable (java) as its taken in Thread

paradigms• Any App (single execution of App X, Y, Z)• Any SRB based data grid step (to handle data)

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 11

“Flowable” in java

ExecuteProcessStep executeMD5 = new ExecuteProcessStep("executeMD5-Metadata", "md5");

executeMD5.setStdOut(new StreamData("$md5Sum", false));

executeMD5.addParameterAsExpression("$locationOfFile");

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 12

“Flowable” in DGL<ns1:Step stepID="executeMD5-Metadata"> <ns1:Operation>

<ns1:ExecuteProcessOp> <ns1:StdParams>

<ns1:exeURI>md5</ns1:exeURI><ns1:input name="$locationOfFile">

<ns1:string>$locationOfFile</ns1:string></ns1:input>

<ns1:std_out><ns1:StdStreamData>

<ns1:variable>$md5Sum</ns1:variable></ns1:StdStreamData>

</ns1:std_out></ns1:StdParams>

</ns1:ExecuteProcessOp> </ns1:Operation></ns1:Step>

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 13

Data Grid Language (DGL)

• XML based gridflow description• Describes execution flow logic

• ECA-based rule description for execution• ECA = Event, Condition, Action

• Querying of Status of Gridflow• XQuery / Simple query of a Gridflow Execution

• Scoped variables and gridflow patterns• For control of execution flow logic

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 14

Gridflow Patterns

• These basic things can be combined together• E.g. Execute all 9 flowables in parallel• Switch based on color:

• Red: App X• Green: App Y

• Gridflow Patterns• Sequential, Parallel, For-Each-Parallel, For-each-

sequential, Switch, While / MileStone processing

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 15

Gridflow Pattern in Java

// forEach file in the collectionList, do some processing

ForEachFlow forEach = new ForEachFlow("forEachFlow", "file",

new CollectionList("$collectionList"));

// could also say how many files to be handled in parallel

// A DGL (XML) code would be generated

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 16

FlowScoped Variables that can control

the flow

Logic used by the sub-members

Sub-members that are the

real execution statements

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 17

Gridflow Variable in Java

/* create a variable called "collectionList" with an initial value of "empty“. this variable is a string now, but will later be used to hold a CollectionList. This is ok to do because variables are dynamically typed in DGL */

processFilesFlow.addVariable("collectionList", "empty");

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 18

Data Grid RequestAnnotations

about the Data Grid Request

Can be either a Flow or a Status

Query

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 19

DGL Requests

• Data Grid Flow• An XML Structure that describes the execution logic,

associated procedural rules and grid environment variables

• Status Query• An XML Structure used to query the execution status any

gridflow or a sub-flow at any granular level

• A DGL or Matrix client sends any of these to the Matrix Server

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 23

Matrix Gridflow Server Architecture

Matrix Agent Abstraction

In Memory Store

JDBCAgents for java, WSDL

and other grid executables

Persistence (Store) Abstraction

ECA rules Handler

Matrix Data Grid Request Processor

Transaction Handler Status Query Handler

Gridflow Meta data Manager

JAXM Wrapper

SOAP Service for Matrix Clients

Flow Handler andExecution Manager

Workflow Query Processor

XQueryProcessor

JMS Messaging Interface

Event Publish

Subscribe, Notification

SDSC SRB Agents

Other SDSC Data

Services

WSDL Description

Sangam P2P Gridflow Broker and Protocols

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 24

Matrix Folks (Emeritus)• Jonathan Weinberg• Daniel Moore• Allen Ding• Reena Mathew• Erik Vandekieft

Don’t you guys have a group

picture?

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 25

SRB Java Folks

Luke - Jargon ManOne man development team

Says he works on strategies for SRB Java software

Hey, The guy on right is all talk and no walk

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 26

Advantages from SRB Perspective• Reduces the Client-Server Communication

• The whole execution logic is sent to the server• Less number of WAN messages• Our experiments prove significant increase in performance

• Datagrid Information Lifecycle Management• Autonomic: “Move data at 9:00 PM in weekdays and in

week ends”• Data Grid Administration

• Power-users and Sophisticated Users • Data Grid Administrator (Rules to manage data grid)• Scientist or Librarian (Visualized data flow programming)

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 27

Using DG-Modeler• GUI for dataflow programming

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 28

Gridflow Process I (Vision)

End User using DGBuilderGridflow Description Data Grid Language

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 29

Gridflow Process II (Vision)

Abstract Gridflow usingData Grid Language

Planner Concrete Gridflow Using Data Grid Language

San Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida 30

Gridflow Process III (Vision)

Gridflow P2P Network

Gridflow Processor

Concrete Gridflow Using Data Grid Language

San Diego Supercomputer CenterSan Diego Supercomputer CenterGrid Physics Network (GriPhyN)

University of Florida

got ideas/suggestions?

Contact:SDSC Matrix project

[email protected]

Google key word: SDSC Gridflow

Click here to start the slide show again