The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration Summer Grid...
-
Upload
cory-alicia-walters -
Category
Documents
-
view
217 -
download
1
Transcript of The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration Summer Grid...
The Virtual Data Grid:A New Model and Architecture for
Data-Intensive Collaboration
Summer Grid 2004UT Brownsville South Padre Island Center
24 June 2004
Mike WildeArgonne National Laboratory
Mathematics and Computer Science Division
2Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
GriPhyN:Grid Physics Network Mission
Enhance scientific productivity through discovery and processing of datasets, using the grid as a scientific workstation
Virtual Data enables this approach by creating datasets from workflow “recipes” and recording their provenance.
GriPhyN works to “cross the chasm” -
application and computer scientists create and field-test paradigms and toolkits together
3Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Acknowledgements:Virtual Data is a Large Team Effort
The Chimera Virtual Data Systemis the work of Ian Foster, Jens Voeckler, Mike Wilde and Yong Zhao
The Pegasus Planner is the work of Ewa Deelman, Gaurang Mehta, and Karan Vahi
Applications described are the work of many people, including: James Annis, Rick Cavanaugh, Dan Engh, Rob Gardner, Albert Lazzarini, Natalia Maltsev, Marge Bardeen, and their wonderful teams
4Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Virtual Data Scenario
simulate –t 10 …
file1
file2reformat –f fz …
file1file1File3,4,5
psearch –t 10 …
conv –I esd –o aodfile6 summarize –t 10 …
file7
file8
On-demand data
generation
Update workflow following changes
Manage workflow;
psearch –t 10 –i file3 file4 file5 –o file8summarize –t 10 –i file6 –o file7reformat –f fz –i file2 –o file3 file4 file5 conv –l esd –o aod –i file 2 –o file6simulate –t 10 –o file1 file2
Explain provenance, e.g. for file8:
5Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Virtual DataDescribes analysis workflow
The recorded virtual data “recipe” here is:
– Files: 8 < (1,3,4,5,7), 7 < 6, (3,4,5,6) < 2
– Programs: 8 < psearch, 7 < summarize,(3,4,5) < reformat, 6 < conv, (1,2) < simulate
simulate –t 10 …
file1
file2reformat –f fz …
file1file1File3,4,5
psearch –t 10 …
conv –I esd –o aodfile6 summarize –t 10 …
file7
file8
Requesteddataset
6Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Virtual DataDescribes analysis workflow
To recreate file 8: Step 1
– simulate > file1, file2
simulate –t 10 …
file1
file2reformat –f fz …
file1file1File3,4,5
psearch –t 10 …
conv –I esd –o aodfile6 summarize –t 10 …
file7
file8
Requestedfile
7Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Virtual DataDescribes analysis workflow
To re-create file8: Step 2
– files 3, 4, 5, 6 derived from file 2
– reformat > file3, file4, file5
– conv > file 6
simulate –t 10 …
file1
file2reformat –f fz …
file1file1File3,4,5
psearch –t 10 …
conv –I esd –o aodfile6 summarize –t 10 …
file7
file8
Requestedfile
8Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Virtual DataDescribes analysis workflow
To re-create file 8: step 3
– File 7 depends on file 6
– Summarize > file 7
simulate –t 10 …
file1
file2reformat –f fz …
file1file1File3,4,5
psearch –t 10 …
conv –I esd –o aodfile6 summarize –t 10 …
file7
file8
Requestedfile
9Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Virtual DataDescribes analysis workflow
To re-create file 8: final step
– File 8 depends on files 1, 3, 4, 5, 7
– psearch < file1, file3, file4, file5, file 7 > file 8
simulate –t 10 …
file1
file2
psearch –t 10 …
reformat –f fz …
conv –I esd –o aod
file1file1File3,4,5
file6 summarize –t 10 …
file7
file8
Requestedfile
10Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Grid3 – The Laboratory
Supported by the National Science Foundation and the Department of Energy.
11Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
VDL: Virtual Data LanguageDescribes Data Transformations
Transformation– Abstract template of program invocation– Similar to "function definition"
Derivation– “Function call” to a Transformation– Store past and future:
> A record of how data products were generated> A recipe of how data products can be generated
Invocation– Record of a Derivation execution
These XML documents reside in a “virtual data catalog” – VDC - a relational database
12Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
VDL Describes Workflowvia Data Dependencies
TR tr1(in a1, out a2) {
argument stdin = ${a1};
argument stdout = ${a2}; }
TR tr2(in a1, out a2) {
argument stdin = ${a1};
argument stdout = ${a2}; }
DV x1->tr1(a1=@{in:file1}, a2=@{out:file2});
DV x2->tr2(a1=@{in:file2}, a2=@{out:file3});
file1
file2
file3
x1
x2
13Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Workflow example
Graph structure– Fan-in
– Fan-out
– "left" and "right" can run in parallel
Needs external input file– Located via replica catalog
Data file dependencies– Form graph structure
findrangefindrange
analyze
preprocess
14Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Complete VDL workflow
Generate appropriate derivationsDV top->preprocess( b=[ @{out:"f.b1"},
@{ out:"f.b2"} ], a=@{in:"f.a"} );DV left->findrange( b=@{out:"f.c1"},
a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="left", p="0.5" );
DV right->findrange( b=@{out:"f.c2"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="right" );
DV bottom->analyze( b=@{out:"f.d"}, a=[ @{in:"f.c1"}, @{in:"f.c2"} );
15Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Compound TransformationsEnable Functional Abstractions
Compound TR encapsulates an entire sub-graph:TR rangeAnalysis (in fa, p1, p2, out fd, io fc1, io fc2, io fb1, io fb2, ){ call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2},
name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2},
name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); }
16Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Derivation scripts Representation of virtual data provenance:
DV d1->diamond( fd=@{out:"f.00005"}, fc1=@{io:"f.00004"}, fc2=@{io:"f.00003"}, fb1=@{io:"f.00002"}, fb2=@{io:"f.00001"}, fa=@{io:"f.00000"}, p2="100", p1="0" );
DV d2->diamond( fd=@{out:"f.0000B"}, fc1=@{io:"f.0000A"}, fc2=@{io:"f.00009"}, fb1=@{io:"f.00008"}, fb2=@{io:"f.00007"}, fa=@{io:"f.00006"}, p2="141.42135623731", p1="0" );
...DV d70->diamond( fd=@{out:"f.001A3"},
fc1=@{io:"f.001A2"}, fc2=@{io:"f.001A1"}, fb1=@{io:"f.001A0"}, fb2=@{io:"f.0019F"}, fa=@{io:"f.0019E"}, p2="800", p1="18" );
17Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Invocation Provenance
Completion status and resource usage
Attributes of executable transformation
Attributes of input and output files
18Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Executing VDL Workflows
Abstractworkflow
local planner
ConcreteDAG
Global planner“Pegasus”
DAGman /Condor-G
GridInfo
“jit” planner(research)
19Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
GriPhyN-iVDGLApplications to date
ATLAS, BTeV, CMS – HEP event simulation Argonne Computational Biology – sequence
comparison and result capture LIGO – Pulsar search Sloan Digital Sky Survey – cluster finding;
near-earth object search planned Quarknet – science education – cosmic
rays, HEP analysis
20Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Genome Analysis Database Update
Jazz/ANL
Grid3
UofWiscJazz/ANL
Grid3
UofWisc
Grid
A
B
D
C A
B
C
D A
D
B
C
C
D
A
B
Automatic Workflows Created as per UserRequest or Project
GADU - GServer
A
B
D
C A
B
C
D A
D
B
C
C
D
A
B
A
B
D
C
A
B
D
C A
B
C
D
A
B
C
D A
D
B
C
A
D
B
C
C
D
A
B
C
D
A
B
Automatic Workflows Created as per UserRequest or Project
GADU - GServer
Automatic Workflows Created as per UserRequest or Project
GADU - GServer
Hit and Run Registered Groups Collaborators
Interface to theServer
Jets
pee
d
Hit and Run Registered Groups CollaboratorsPublic Registered Groups Collaborators
End Users
Interface to theServer
Jets
pee
d
Dat
a F
low
an
d S
tora
ge
at v
ario
us
leve
ls
Ch
imer
a, C
on
do
r, G
lob
us
Application work by Alex Rodriguez, Dina Sulakhe, Natalia Matlsev,Argonne MCS
Described in GGF10workshop paper.
21Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
1
10
100
1000
10000
100000
1 10 100
Num
ber
of C
lust
ers
Number of Galaxies
Galaxy clustersize distribution
DAG
Virtual Data Example:Galaxy Cluster Search
Sloan Data
Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao,
University of Chicago. Described in SC2002 paper
22Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Cluster SearchWorkflow Graph
and Execution Trace
Workflow jobs vs time
23Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
mass = 200decay = WWstability = 1LowPt = 20HighPt = 10000
mass = 200decay = WWstability = 1event = 8
mass = 200decay = WWstability = 1plot = 1
mass = 200decay = WWplot = 1
mass = 200decay = WWevent = 8
mass = 200decay = WWstability = 1
mass = 200decay = WWstability = 3
mass = 200
mass = 200decay = WW
mass = 200decay = ZZ
mass = 200decay = bb
mass = 200plot = 1
mass = 200event = 8
Virtual Data Application: High Energy Physics
Data Analysis
Work and slide byRick Cavanaugh andDimitri Bourilkov,University of FloridaRef: CHEP 2002 paper
24Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Using Virtual Data forScience Education
The QuarkNet-Trillium collaboration is using Grid virtual data tools and methods to enrich science education
Its an experiment to give students the means to:– discover and apply datasets, algorithms, and data
analysis methods
– collaborate by developing new ones and sharing results and observations
– learn data analysis methods that will ready and excite them for a scientific career
And in later steps, we may actually use the Grid!
25Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Quarknet Virtual Data Project
Standard Web access
Central High SchoolReston, Virginia
LocallyCollected Data
CosmicRay
DetectorS
tud
ent/
Teach
erT
eams
Yale / Middletown High CollaborationHartford, Connecticut
LocallyCollected Data
CosmicRay
Detector
Stu
den
t/T
eacher
Team
s
Foothills High SchoolGreat Falls, Montana
LocallyCollected Data
CosmicRay
Detector
Stu
den
t/T
eacher
Team
s
Quarknet Virtual Data Portal
Student Data,Algorithms,
Results, Notes,and communications
VirtualData
Toolkit
VirtualData
Catalog
Student teacher teams sharing data, methods, programs, and knowledge
Enabling collaboration-intensive science discovery with virtual data tools and methods
26Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Detector Performance Study
27Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Example: BTeV Event Simulation
28Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Support for Search and Discovery
Goal: make it as easy to use as Google More advanced capabilities lie below the
surface (as with Google) Understand the structure and meaning of
the datasets and their fields. Advanced search, using SQL-like queries Find both DATA and TRANSFORMATIONS Create datasets from queries Perform calculations on datasets, filtering
results to look for patterns
29Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Search byMetadata
30Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Derving a new
dataset
…to find mass of
“z” particle:
31Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Workflow formissing energy calculations
32Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Virtual Provenance:list of derivations and files
<job id="ID000001" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="5“ dv-namespace="Quarknet.HEPSRCH" dv-name="run1aesum"> <argument><filename file="run1a.event"/> <filename file="run1a.esm"/></argument> <uses file="run1a.esm" link="output" dontRegister="false" dontTransfer="false"/> <uses file="run1a.event" link="input" dontRegister="false" dontTransfer="false"/> </job><job id="ID000002" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="7“ dv-namespace="Quarknet.HEPSRCH" … <argument><filename file="electron10GeV.event"/> <filenamefile="electron10GeV.sum"/></argument>… </job><job id="ID000014" namespace="Quarknet.HEPSRCH" name="ReconTotalEnergy" level="3"… <argument><filename file="run1a.mis"/> <filename file="run1a.ecal"/> … <uses file="run1a.muon" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.total" link="output" dontRegister="false" dontTransfer="false"/> <uses file="run1a.ecal" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.hcal" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.mis" link="input" dontRegister="false" dontTransfer="false"/> </job>
<!--list of all files used --> <filename file="ecal.pct" link="inout"/> <filename file="electron10GeV.avg" link="inout"/> <filename file="electron10GeV.sum" link="inout"/> <filename file="hcal.pct" link="inout"/>….(excerpted for display)
33Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Virtual Provenance in XML:control flow graph
<child ref="ID000003"> <parent ref="ID000002"/> </child> <child ref="ID000004"> <parent ref="ID000003"/> </child> <child ref="ID000005"> <parent ref="ID000004"/> <parent ref="ID000001"/>… <child ref="ID000009"> <parent ref="ID000008"/> </child> <child ref="ID000010"> <parent ref="ID000009"/> <parent ref="ID000006"/>… <child ref="ID000012"> <parent ref="ID000011"/> </child> <child ref="ID000013"> <parent ref="ID000011"/> </child> <child ref="ID000014"> <parent ref="ID000010"/> <parent ref="ID000012"/>… <parent ref="ID000013"/>… </child>…
(excerpted for display…)
And writing the results up in a “poster”
35Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Poster describing analysis
36Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Using active data from Web Services
37Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
38Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
39Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
40Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Levels of Interaction “Skins” – use it like a calculator,
experiment with scenarios and settings, use virtual data like a log book to document, assess, and share parameter values.
“Blocks” – re-assemble workflow pipelines using existing ones as patterns and pre-developed transforms as building blocks
“Code” – write new transforms in a variety of languages and data models
41Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Observations
A provenance approach based on interface definition and data flow declaration fits well with Grid requirements for code and data transportability and heterogeneity
Working in a provenance-managed system has many fringe benefits: uniformity, precision, structure, communication, documentation
The real world is messy – finding the right abstractions is hard, and handling “legacy” applications is even harder
42Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Vision for Provenance in the Large
Universal knowledge management and production systems
Vendors integrate the provenance tracking protocol into data processing products
Ability to run anywhere “in the Grid”
43Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Virtual Data Grid Vision
GridOperations
simulation data
discovery
ScienceReview
Data Grid
storageelement
replica locationservice
storageelement
storageelement
Dat
aT
ran
spo
rt Sto
rage
Reso
urce
Mg
mt
virtualdata
catalogvirtual data
index
virtualdata
catalog
virtualdata
catalog
Computing Grid
workflowplanner
request plannerworkflowexecutor
(DAGman)
request executor(Condor-G,
GRAM)
requestpredictor
(Prophesy)
Grid Monitor
ProductionManager
Researcher
planning
discovery
com
po
sition
sim
ula
tio
n
anal
ysis
sharing
raw d
ata
detector
derivatio
n
44Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Planned Dataset Model
<FORM <Title…>/FORM>
File Set of files
Relational query or spreadsheet range
XML Element
Set of files with relational index
Object closure
New user-defined dataset type:
Speculative model described in CIDR 2003 paper by Foster, Voeckler, Wilde and Zhao
45Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Planned Dataset Type ModelFileDataset
File FileSet
MultiFileSet TarFileSetEventCollection
RawEventSet SimulatedEventSet
MonteCarloSimulation
DiscreteEventSimulation
Representational
Logical
(Nonleaf Typesare Superclasses)
46Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Provenance Server Plans OGSA-based Grid services
– Discovery, security, resource management Supports code and data discovery
and workflow management Object names (TR, DS, TY, DV, IV) can be used as
global cross-server links Derivations can reference remote transformations
and datasets Structured object namespaces & object-level access
control enable large VO collaboration Generalize transforms to describe service calls,
database queries and language interpreters
47Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
CollaborationVDS
TR
TR
TR
DV
TR
DV
DV
DV
DV
DV
Group VDS
PersonalVDS
PersonalVDS
DS
DSDS
Provenance Hyperlinks
48Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
Indexing Serversto Support Discovery
Collaboration-wideindex
Collaboration-levelindex
Group Index
PersonalIndex
PersonalIndex
PersonalIndex
CollaborationVDS
TR
TR
TR
DV
TR
DV
DV
DV
DV
DV
Group VDS
PersonalVDS
PersonalVDS
DS
DSDS
49Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
For Information and Software Virtual Data System
– www.griphyn.org/chimera - Chimera Virtual Data System: Overview, papers, software
Grids and Grid Software– www.ivdgl.org/grid2003 - Using Grid3– www.griphyn.org/vdt - Virtual Data Toolkit– www.globus.org – The Globus Toolkit– www.cs.wisc.edu/condor - The Condor Project– www.ppdg.net – Particle Physics Data Grid
50Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI
AcknowledgementsGriPhyN, iVDGL, and QuarkNet
(in part) are supported by the National Science Foundation
The Globus Alliance, PPDG, and QuarkNet are supported in part by the US Department of
Energy, Office of Science; by the NASA Information Power Grid program; and by IBM