Virtual Data and the Chimera System*
description
Transcript of Virtual Data and the Chimera System*
![Page 1: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/1.jpg)
Virtual Dataand the Chimera System*
Ian FosterMathematics and Computer Science Division
Argonne National Laboratoryand
Department of Computer ScienceThe University of Chicago
http://www.mcs.anl.gov/~foster
*Joint work with Jens Vöckler, Mike Wilde, Yong ZhaoHPC 2002 Conference, Cetraro, June 26, 2002
![Page 2: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/2.jpg)
2
[email protected] ARGONNE CHICAGO
Overview Problem
– Managing programs and computations as community resources
Technology– Chimera virtual data system
Applications– Virtual Data ≠ Virtual Concept!
Futures– Research challenges & plans
![Page 3: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/3.jpg)
3
[email protected] ARGONNE CHICAGO
Overview Problem
– Managing programs and computations as community resources
Technology– Chimera virtual data system
Applications– Virtual Data ≠ Virtual Concept!
Futures– Research challenges & plans
![Page 4: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/4.jpg)
4
[email protected] ARGONNE CHICAGO
Programs as Community Resources:Data Derivation and Provenance
Most [scientific] data are not simple “measurements”; essentially all are:– Computationally corrected/reconstructed– And/or produced by numerical simulation
And thus, as data and computers become ever larger and more expensive:– Programs are significant community resources– So are the executions of those programs
Management of the transformations that map between datasets an important problem
![Page 5: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/5.jpg)
5
[email protected] ARGONNE CHICAGO
Transformation Derivation
Data
created-by
execution-of
consumed-by/generated-by
“I’ve detected a calibration error in an instrument and
want to know which derived data to recompute.”
“I’ve come across some interesting data, but I need to understand the nature of the corrections applied when it was constructed before I can trust it for my purposes.”
“I want to search an astronomical database for galaxies with certain characteristics. If a program that performs this analysis exists, I won’t have to write one from scratch.”
“I want to apply an astronomical analysis program to millions of objects. If the results
already exist, I’ll save weeks of computation.”
Motivations (1)
![Page 6: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/6.jpg)
6
[email protected] ARGONNE CHICAGO
Motivations (2) Data track-ability and result audit-ability
– Universally sought by GriPhyN applications Repair and correction of data
– Rebuild data products—c.f., “make” Workflow management
– A new, structured paradigm for organizing, locating, specifying, and requesting data products
Performance optimizations– Ability to re-create data rather than move it
And others, some we haven’t thought of
![Page 7: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/7.jpg)
7
[email protected] ARGONNE CHICAGO
Overview Problem
– Managing programs and computations as community resources
Technology– Chimera virtual data system
Applications– Virtual Data ≠ Virtual Concept!
Futures– Research challenges & plans
![Page 8: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/8.jpg)
8
[email protected] ARGONNE CHICAGO
Virtual data catalog– Transformations,
derivations, data Virtual data language
– VDC definition and query
Applications include browsers and data analysis applications
Data Grid Resources(distributed execution
and data management)
VDL Interpreter(manipulate derivations
and transformations)
Virtual Data Catalog(implements ChimeraVirtual Data Schema)
Virtual DataApplications
Virtual Data Language(definition and query)
Task Graphs(compute and data
movement tasks, withdependencies)
SQL
Chimera
Chimera Virtual Data System
GriPhyN VDT:Replica catalogDAGManGlobus ToolkitEtc.
![Page 9: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/9.jpg)
9
[email protected] ARGONNE CHICAGO
Transformations and Derivations Transformation
– Abstract template of program invocation– Similar to "function definition" in C
Derivation– Formal invocation of a Transformation– Similar to "function call" in C– Store past and future:
> A record of how data products were generated> A recipe of how data products can be generated
Invocation (future)– Record of each Derivation (re) execution– Similar to strace (BSD) or truss (SysV)
![Page 11: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/11.jpg)
11
[email protected] ARGONNE CHICAGO
Virtual Data Tools Virtual Data API
– A Java class hierarchy to represent transformations and derivations
Virtual Data Language– Textual for people & illustrative examples– XML for machine-to-machine interfaces
Virtual Data Database– Makes the objects of a virtual data definition
persistent Virtual Data Service
– Provides a service interface (e.g., OGSA) to persistent objects
![Page 13: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/13.jpg)
13
[email protected] ARGONNE CHICAGO
Example Transformation
TR t1( out a2, in a1, none pa = "500", none env = "100000" ) {
profile hints.exec-pfn = "/usr/bin/app3"; argument = "-p "${pa}; argument = "-f "${a1}; argument = "-x –y"; argument stdout = ${a2}; profile env.MAXMEM = ${env};}
$a1
$a2
t1
![Page 14: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/14.jpg)
14
[email protected] ARGONNE CHICAGO
Example Derivations
DV d1->t1 (env="20000", pa="600",a2=@{out:run1.exp15.T1932.summary},a1=@{in:run1.exp15.T1932.raw},
);
DV d2->t1 (a1=@{in:run1.exp16.T1918.raw},a2=@{out.run1.exp16.T1918.summary}
);
![Page 15: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/15.jpg)
15
[email protected] ARGONNE CHICAGO
Managing Dependencies
TR tr1( out a2, in a1 ) { profile hints.exec-pfn = "/usr/bin/app1"; argument stdin = ${a1}; argument stdout = ${a2}; }
TR tr2( out a2, in a1 ) { profile hints.exec-pfn = "/usr/bin/app2"; argument stdin = ${a1}; argument stdout = ${a2}; }
DV x1->tr1( a2=@{out:file2}, a1=@{in:file1});DV x2->tr2( a2=@{out:file3}, a1=@{in:file2});
file1
file2
file3
x1
x2
![Page 16: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/16.jpg)
16
[email protected] ARGONNE CHICAGO
Initial “Strawman” Architecture(Use of GriPhyN Virtual Data Toolkit)
VDLx
abstractplanner
DAX
DAGMan
concreteplanner
![Page 17: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/17.jpg)
17
[email protected] ARGONNE CHICAGO
Overview Problem
– Managing programs and computations as community resources
Technology– Chimera virtual data system
Applications– Virtual Data ≠ Virtual Concept!
Futures– Research challenges & plans
![Page 18: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/18.jpg)
18
[email protected] ARGONNE CHICAGO Joint work with Jim Annis, Steve Kent, FNAL
Size distribution ofgalaxy clusters?
1
10
100
1000
10000
100000
1 10 100
Num
ber
of C
lust
ers
Number of Galaxies
Galaxy clustersize distribution
Chimera Virtual Data System+ GriPhyN Virtual Data Toolkit
+ iVDGL Data Grid (many CPUs)
Chimera Application: Sloan Digital Sky Survey Analysis
![Page 19: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/19.jpg)
19
[email protected] ARGONNE CHICAGO
catalog
cluster
5
4
core
brg
field
tsObj
3
2
1
brg
field
tsObj
2
1
brg
field
tsObj
2
1
brg
field
tsObj
2
1
core
3
Cluster-finding Data Pipeline
![Page 23: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/23.jpg)
23
[email protected] ARGONNE CHICAGO
Overview Problem
– Managing programs and computations as community resources
Technology– Chimera virtual data system
Applications– Virtual Data ≠ Virtual Concept!
Futures– Research challenges & plans
![Page 24: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/24.jpg)
24
[email protected] ARGONNE CHICAGO
Virtual Data Usage Model
Transformation designers create programmatic abstractions– Simple or compound; augment with metadata
Production managers create bulk derivations– Can materialize data products or leave virtual
Users track their work through derivations– Augment (replace?) the scientist’s log book
Definitions can be augmented with metadata– The key to intelligent data retrieval– Issues relating to metadata propagation
![Page 25: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/25.jpg)
25
[email protected] ARGONNE CHICAGO
Virtual Data Research Issues Representation
– Metadata: how is it created, stored, propagated?– What knowledge must be represented? How?– Capturing notions of data approximation– Higher-order knowledge: virtual transformations
VDC as a community resource– Automating data capture– Access control and privacy issues– Quality control
Data derivation– Query estimation and request planning
![Page 26: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/26.jpg)
26
[email protected] ARGONNE CHICAGO
Virtual Data Research Issues “Engineering” issues
– Dynamic (runtime-computed) dependencies– Large dependent sets– Extensions to other data models: relational, OO– Virtual data browsers– XML vs. relational databases & query languages
Additional usage modalities– E.g., meta-analyses, automated experiment
generation, “active notebooks” Virtual data browsers, editors
![Page 27: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/27.jpg)
27
[email protected] ARGONNE CHICAGO
Status of Chimera R&D Early virtual data system demonstrated Nov ’01: HEP
collision simulations Larger scale problems addressed recently: “cluster
finding” in SDSS First public release in June: Chimera v1.0 Enhancements planned throughout the summer Physics & astronomy applications by SC’02 Future R&D focus #1: request planning Future R&D focus #2: knowledge representation Future apps: bioinformatics, earth sciences
![Page 28: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/28.jpg)
28
[email protected] ARGONNE CHICAGO
Related Work Data provenance
– Materialized views, lineage: Cui, Widom– Data provenance tracking: Buneman et al.
Capturing transformations– ZOO system and conceptual schema
Data Grid technologies– GriPhyN, Globus Project, EU DataGrid
![Page 29: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/29.jpg)
29
[email protected] ARGONNE CHICAGO
Summary Concept: Tools to support management of
transformations and derivations as community resources
Technology: Chimera virtual data system including virtual data catalog and virtual data language; use of GriPhyN virtual data toolkit for automated data derivation
Results: Successful early applications to CMS and SDSS data generation/analysis
Future: Public release of prototype, new apps, knowledge representation, planning
![Page 30: Virtual Data and the Chimera System*](https://reader035.fdocuments.in/reader035/viewer/2022062315/56815bde550346895dc9ce6f/html5/thumbnails/30.jpg)
30
[email protected] ARGONNE CHICAGO
For More Information GriPhyN project (NSF ITR funded)
– www.griphyn.org Chimera virtual data system
– www.griphyn.org/chimera– “Chimera: A Virtual Data System for
Representing, Querying, and Automating Data Derivation,” SSDBM, July 2002
– “Applying Chimera Virtual Data Concepts to Cluster Finding in the Sloan Sky Survey”, SC’02, November 2002.