San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida Programming...
-
Upload
benedict-wilcox -
Category
Documents
-
view
215 -
download
0
Transcript of San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida Programming...
San Diego Supercomputer CenterSan Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida
Programming Gridflows using Matrix
Arun Jagatheesan
Architect, SDSC Matrix
San Diego Supercomputer Center
SDSC Tech TalkSDSC, UCSD
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 2
Talk Outline• Where do we need this?• Infrastructure-based Execution logic (Concept?)• Matrix Project Overview (Who?)• Data Grid Language and Programming
• Gridflow Runnable (flowable)• Flow• Gridflow Metadata• ECAA rules
• Other benefits• What Next – Straight Talk
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 3
Data handling pipeline(data information pipeline)
Metadata derivation
Ingest Metadata
Ingest Data
Determine analysis pipeline
Initiate automated analysis
Organize result data into distributeddata grid collections
Use the optimal set of resources
based on the task – on demand
Pipeline could be triggered by input at data source or by a data request
from user
Pipeline could be triggered by input at data source or by a data request
from user
All gridflow activities stored for data flow
provenance
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 4
Generic Gridflow Scenario
• Application X, Application Y, Application Z• May be different programming languages, programmers,
different execution environments• May be in different grid domains (sites)• Pass data between each other during their execution• SDSC Note: Might use a data grid environment that
works!
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 5
Example for Generic Gridflow
1. Ingest 1 million URLs into digital library using URL Ingestor (or harvestor – App X)
2. For each URL iterate with 5 parallel execution1. Do some processing on the file (App Y)
2. Store the output file from App Y in a grid disk resource
3. Replicate a copy of same file in a grid archive resource
4. Calculate MD5 checksum (App Z) for file in disk
5. Calculate MD5 checksum (App Z) for file in archive
6. If checksums mismatch, ingest a metadata warning flag
For each
Ifchecksums mismatch
Pattern
Gridflow metadata processingRules
Late binding
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 6
Traditional way
• Write a customized program• Create a common program that can invoke the distributed
or localized applications using appropriate client code• Hardwire all the apps (X, Y, Z) together• Have this customized program as the delegator invoking
all other applications• Declare the necessary variables, implement the
rules/conditions also [like the checksum1 == checksum2]
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 7
Why take the Gridflow approach?• What if scenarios…
• The infrastructure can run more or less things in parallel• The cyber-infrastructure has more resources for
distribution (An app can be run at multiple places for different parameters – parameter sweep distribution)
• Different meta-data conditions or milestone• Run this till the molecule changes from green to red (or yellow)
• Change in the sequence of execution it self (New app)• Process provenance is required• Any ways, you are not coding/changing your application to
fit into the gridflow environment (It’s the other way around) – Make simple changes only in the execution logic…
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 8
Infrastructure-based Execution Logic• Each gridflow has different executables
• App X, App Y, App Z – Runnable or “Flowable”• How should these flowables be run?
• Parallel, Sequential, for-each input item (pipeline), while, switch• Capture this as a Flow
• Is there a condition• Run till exit value = 0 or till molecule color changes to red
• Are there metadata variables? (color)
• Describe this Execution Logic Separately• Loosely coupled, modified without compilation• Use a XML based language
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 9
That is why we started Matrix Project
• Movie break ….
• Language to describe and execute this Infrastructure-based Execution Logic
• Software to design, query, run this logic
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 10
“Flowable”
• Any thing that can Run in a gridflow• Not using Runnable (java) as its taken in Thread
paradigms• Any App (single execution of App X, Y, Z)• Any SRB based data grid step (to handle data)
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 11
“Flowable” in java
ExecuteProcessStep executeMD5 = new ExecuteProcessStep("executeMD5-Metadata", "md5");
executeMD5.setStdOut(new StreamData("$md5Sum", false));
executeMD5.addParameterAsExpression("$locationOfFile");
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 12
“Flowable” in DGL<ns1:Step stepID="executeMD5-Metadata"> <ns1:Operation>
<ns1:ExecuteProcessOp> <ns1:StdParams>
<ns1:exeURI>md5</ns1:exeURI><ns1:input name="$locationOfFile">
<ns1:string>$locationOfFile</ns1:string></ns1:input>
<ns1:std_out><ns1:StdStreamData>
<ns1:variable>$md5Sum</ns1:variable></ns1:StdStreamData>
</ns1:std_out></ns1:StdParams>
</ns1:ExecuteProcessOp> </ns1:Operation></ns1:Step>
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 13
Data Grid Language (DGL)
• XML based gridflow description• Describes execution flow logic
• ECA-based rule description for execution• ECA = Event, Condition, Action
• Querying of Status of Gridflow• XQuery / Simple query of a Gridflow Execution
• Scoped variables and gridflow patterns• For control of execution flow logic
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 14
Gridflow Patterns
• These basic things can be combined together• E.g. Execute all 9 flowables in parallel• Switch based on color:
• Red: App X• Green: App Y
• Gridflow Patterns• Sequential, Parallel, For-Each-Parallel, For-each-
sequential, Switch, While / MileStone processing
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 15
Gridflow Pattern in Java
// forEach file in the collectionList, do some processing
ForEachFlow forEach = new ForEachFlow("forEachFlow", "file",
new CollectionList("$collectionList"));
// could also say how many files to be handled in parallel
// A DGL (XML) code would be generated
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 16
FlowScoped Variables that can control
the flow
Logic used by the sub-members
Sub-members that are the
real execution statements
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 17
Gridflow Variable in Java
/* create a variable called "collectionList" with an initial value of "empty“. this variable is a string now, but will later be used to hold a CollectionList. This is ok to do because variables are dynamically typed in DGL */
processFilesFlow.addVariable("collectionList", "empty");
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 18
Data Grid RequestAnnotations
about the Data Grid Request
Can be either a Flow or a Status
Query
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 19
DGL Requests
• Data Grid Flow• An XML Structure that describes the execution logic,
associated procedural rules and grid environment variables
• Status Query• An XML Structure used to query the execution status any
gridflow or a sub-flow at any granular level
• A DGL or Matrix client sends any of these to the Matrix Server
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 23
Matrix Gridflow Server Architecture
Matrix Agent Abstraction
In Memory Store
JDBCAgents for java, WSDL
and other grid executables
Persistence (Store) Abstraction
ECA rules Handler
Matrix Data Grid Request Processor
Transaction Handler Status Query Handler
Gridflow Meta data Manager
JAXM Wrapper
SOAP Service for Matrix Clients
Flow Handler andExecution Manager
Workflow Query Processor
XQueryProcessor
JMS Messaging Interface
Event Publish
Subscribe, Notification
SDSC SRB Agents
Other SDSC Data
Services
WSDL Description
Sangam P2P Gridflow Broker and Protocols
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 24
Matrix Folks (Emeritus)• Jonathan Weinberg• Daniel Moore• Allen Ding• Reena Mathew• Erik Vandekieft
Don’t you guys have a group
picture?
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 25
SRB Java Folks
Luke - Jargon ManOne man development team
Says he works on strategies for SRB Java software
Hey, The guy on right is all talk and no walk
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 26
Advantages from SRB Perspective• Reduces the Client-Server Communication
• The whole execution logic is sent to the server• Less number of WAN messages• Our experiments prove significant increase in performance
• Datagrid Information Lifecycle Management• Autonomic: “Move data at 9:00 PM in weekdays and in
week ends”• Data Grid Administration
• Power-users and Sophisticated Users • Data Grid Administrator (Rules to manage data grid)• Scientist or Librarian (Visualized data flow programming)
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 27
Using DG-Modeler• GUI for dataflow programming
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 28
Gridflow Process I (Vision)
End User using DGBuilderGridflow Description Data Grid Language
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 29
Gridflow Process II (Vision)
Abstract Gridflow usingData Grid Language
Planner Concrete Gridflow Using Data Grid Language
San Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida 30
Gridflow Process III (Vision)
Gridflow P2P Network
Gridflow Processor
Concrete Gridflow Using Data Grid Language
San Diego Supercomputer CenterSan Diego Supercomputer CenterGrid Physics Network (GriPhyN)
University of Florida
got ideas/suggestions?
Contact:SDSC Matrix project
Google key word: SDSC Gridflow
Click here to start the slide show again