Virtual Data and the Chimera System*

30
Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science The University of Chicago http://www.mcs.anl.gov/~foster *Joint work with Jens Vöckler, Mike Wilde, Yong Zhao HPC 2002 Conference, Cetraro, June 26, 2002

description

Virtual Data and the Chimera System*. Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science The University of Chicago http://www.mcs.anl.gov/~foster *Joint work with Jens Vöckler , Mike Wilde, Yong Zhao. - PowerPoint PPT Presentation

Transcript of Virtual Data and the Chimera System*

Page 1: Virtual Data and the Chimera System*

Virtual Dataand the Chimera System*

Ian FosterMathematics and Computer Science Division

Argonne National Laboratoryand

Department of Computer ScienceThe University of Chicago

http://www.mcs.anl.gov/~foster

*Joint work with Jens Vöckler, Mike Wilde, Yong ZhaoHPC 2002 Conference, Cetraro, June 26, 2002

Page 2: Virtual Data and the Chimera System*

2

[email protected] ARGONNE CHICAGO

Overview Problem

– Managing programs and computations as community resources

Technology– Chimera virtual data system

Applications– Virtual Data ≠ Virtual Concept!

Futures– Research challenges & plans

Page 3: Virtual Data and the Chimera System*

3

[email protected] ARGONNE CHICAGO

Overview Problem

– Managing programs and computations as community resources

Technology– Chimera virtual data system

Applications– Virtual Data ≠ Virtual Concept!

Futures– Research challenges & plans

Page 4: Virtual Data and the Chimera System*

4

[email protected] ARGONNE CHICAGO

Programs as Community Resources:Data Derivation and Provenance

Most [scientific] data are not simple “measurements”; essentially all are:– Computationally corrected/reconstructed– And/or produced by numerical simulation

And thus, as data and computers become ever larger and more expensive:– Programs are significant community resources– So are the executions of those programs

Management of the transformations that map between datasets an important problem

Page 5: Virtual Data and the Chimera System*

5

[email protected] ARGONNE CHICAGO

Transformation Derivation

Data

created-by

execution-of

consumed-by/generated-by

“I’ve detected a calibration error in an instrument and

want to know which derived data to recompute.”

“I’ve come across some interesting data, but I need to understand the nature of the corrections applied when it was constructed before I can trust it for my purposes.”

“I want to search an astronomical database for galaxies with certain characteristics. If a program that performs this analysis exists, I won’t have to write one from scratch.”

“I want to apply an astronomical analysis program to millions of objects. If the results

already exist, I’ll save weeks of computation.”

Motivations (1)

Page 6: Virtual Data and the Chimera System*

6

[email protected] ARGONNE CHICAGO

Motivations (2) Data track-ability and result audit-ability

– Universally sought by GriPhyN applications Repair and correction of data

– Rebuild data products—c.f., “make” Workflow management

– A new, structured paradigm for organizing, locating, specifying, and requesting data products

Performance optimizations– Ability to re-create data rather than move it

And others, some we haven’t thought of

Page 7: Virtual Data and the Chimera System*

7

[email protected] ARGONNE CHICAGO

Overview Problem

– Managing programs and computations as community resources

Technology– Chimera virtual data system

Applications– Virtual Data ≠ Virtual Concept!

Futures– Research challenges & plans

Page 8: Virtual Data and the Chimera System*

8

[email protected] ARGONNE CHICAGO

Virtual data catalog– Transformations,

derivations, data Virtual data language

– VDC definition and query

Applications include browsers and data analysis applications

Data Grid Resources(distributed execution

and data management)

VDL Interpreter(manipulate derivations

and transformations)

Virtual Data Catalog(implements ChimeraVirtual Data Schema)

Virtual DataApplications

Virtual Data Language(definition and query)

Task Graphs(compute and data

movement tasks, withdependencies)

SQL

Chimera

Chimera Virtual Data System

GriPhyN VDT:Replica catalogDAGManGlobus ToolkitEtc.

Page 9: Virtual Data and the Chimera System*

9

[email protected] ARGONNE CHICAGO

Transformations and Derivations Transformation

– Abstract template of program invocation– Similar to "function definition" in C

Derivation– Formal invocation of a Transformation– Similar to "function call" in C– Store past and future:

> A record of how data products were generated> A recipe of how data products can be generated

Invocation (future)– Record of each Derivation (re) execution– Similar to strace (BSD) or truss (SysV)

Page 10: Virtual Data and the Chimera System*

10

[email protected] ARGONNE CHICAGO

Virtual Data Catalog Structure

Page 11: Virtual Data and the Chimera System*

11

[email protected] ARGONNE CHICAGO

Virtual Data Tools Virtual Data API

– A Java class hierarchy to represent transformations and derivations

Virtual Data Language– Textual for people & illustrative examples– XML for machine-to-machine interfaces

Virtual Data Database– Makes the objects of a virtual data definition

persistent Virtual Data Service

– Provides a service interface (e.g., OGSA) to persistent objects

Page 12: Virtual Data and the Chimera System*

12

[email protected] ARGONNE CHICAGO

Virtual Data Language: XML

Page 13: Virtual Data and the Chimera System*

13

[email protected] ARGONNE CHICAGO

Example Transformation

TR t1( out a2, in a1, none pa = "500", none env = "100000" ) {

profile hints.exec-pfn = "/usr/bin/app3"; argument = "-p "${pa}; argument = "-f "${a1}; argument = "-x –y"; argument stdout = ${a2}; profile env.MAXMEM = ${env};}

$a1

$a2

t1

Page 14: Virtual Data and the Chimera System*

14

[email protected] ARGONNE CHICAGO

Example Derivations

DV d1->t1 (env="20000", pa="600",a2=@{out:run1.exp15.T1932.summary},a1=@{in:run1.exp15.T1932.raw},

);

DV d2->t1 (a1=@{in:run1.exp16.T1918.raw},a2=@{out.run1.exp16.T1918.summary}

);

Page 15: Virtual Data and the Chimera System*

15

[email protected] ARGONNE CHICAGO

Managing Dependencies

TR tr1( out a2, in a1 ) { profile hints.exec-pfn = "/usr/bin/app1";  argument stdin = ${a1};  argument stdout = ${a2}; }

TR tr2( out a2, in a1 ) { profile hints.exec-pfn = "/usr/bin/app2"; argument stdin = ${a1}; argument stdout = ${a2}; }

DV x1->tr1( a2=@{out:file2}, a1=@{in:file1});DV x2->tr2( a2=@{out:file3}, a1=@{in:file2});

file1

file2

file3

x1

x2

Page 16: Virtual Data and the Chimera System*

16

[email protected] ARGONNE CHICAGO

Initial “Strawman” Architecture(Use of GriPhyN Virtual Data Toolkit)

VDLx

abstractplanner

DAX

DAGMan

concreteplanner

Page 17: Virtual Data and the Chimera System*

17

[email protected] ARGONNE CHICAGO

Overview Problem

– Managing programs and computations as community resources

Technology– Chimera virtual data system

Applications– Virtual Data ≠ Virtual Concept!

Futures– Research challenges & plans

Page 18: Virtual Data and the Chimera System*

18

[email protected] ARGONNE CHICAGO Joint work with Jim Annis, Steve Kent, FNAL

Size distribution ofgalaxy clusters?

1

10

100

1000

10000

100000

1 10 100

Num

ber

of C

lust

ers

Number of Galaxies

Galaxy clustersize distribution

Chimera Virtual Data System+ GriPhyN Virtual Data Toolkit

+ iVDGL Data Grid (many CPUs)

Chimera Application: Sloan Digital Sky Survey Analysis

Page 19: Virtual Data and the Chimera System*

19

[email protected] ARGONNE CHICAGO

catalog

cluster

5

4

core

brg

field

tsObj

3

2

1

brg

field

tsObj

2

1

brg

field

tsObj

2

1

brg

field

tsObj

2

1

core

3

Cluster-finding Data Pipeline

Page 20: Virtual Data and the Chimera System*

20

[email protected] ARGONNE CHICAGO

Cluster-Finding Pipeline Execution

Page 21: Virtual Data and the Chimera System*

21

[email protected] ARGONNE CHICAGO

Small SDSS Cluster-Finding DAG

Page 22: Virtual Data and the Chimera System*

22

[email protected] ARGONNE CHICAGO

And Even Bigger:744 Files, 387 Nodes

108

168

60

50

Page 23: Virtual Data and the Chimera System*

23

[email protected] ARGONNE CHICAGO

Overview Problem

– Managing programs and computations as community resources

Technology– Chimera virtual data system

Applications– Virtual Data ≠ Virtual Concept!

Futures– Research challenges & plans

Page 24: Virtual Data and the Chimera System*

24

[email protected] ARGONNE CHICAGO

Virtual Data Usage Model

Transformation designers create programmatic abstractions– Simple or compound; augment with metadata

Production managers create bulk derivations– Can materialize data products or leave virtual

Users track their work through derivations– Augment (replace?) the scientist’s log book

Definitions can be augmented with metadata– The key to intelligent data retrieval– Issues relating to metadata propagation

Page 25: Virtual Data and the Chimera System*

25

[email protected] ARGONNE CHICAGO

Virtual Data Research Issues Representation

– Metadata: how is it created, stored, propagated?– What knowledge must be represented? How?– Capturing notions of data approximation– Higher-order knowledge: virtual transformations

VDC as a community resource– Automating data capture– Access control and privacy issues– Quality control

Data derivation– Query estimation and request planning

Page 26: Virtual Data and the Chimera System*

26

[email protected] ARGONNE CHICAGO

Virtual Data Research Issues “Engineering” issues

– Dynamic (runtime-computed) dependencies– Large dependent sets– Extensions to other data models: relational, OO– Virtual data browsers– XML vs. relational databases & query languages

Additional usage modalities– E.g., meta-analyses, automated experiment

generation, “active notebooks” Virtual data browsers, editors

Page 27: Virtual Data and the Chimera System*

27

[email protected] ARGONNE CHICAGO

Status of Chimera R&D Early virtual data system demonstrated Nov ’01: HEP

collision simulations Larger scale problems addressed recently: “cluster

finding” in SDSS First public release in June: Chimera v1.0 Enhancements planned throughout the summer Physics & astronomy applications by SC’02 Future R&D focus #1: request planning Future R&D focus #2: knowledge representation Future apps: bioinformatics, earth sciences

Page 28: Virtual Data and the Chimera System*

28

[email protected] ARGONNE CHICAGO

Related Work Data provenance

– Materialized views, lineage: Cui, Widom– Data provenance tracking: Buneman et al.

Capturing transformations– ZOO system and conceptual schema

Data Grid technologies– GriPhyN, Globus Project, EU DataGrid

Page 29: Virtual Data and the Chimera System*

29

[email protected] ARGONNE CHICAGO

Summary Concept: Tools to support management of

transformations and derivations as community resources

Technology: Chimera virtual data system including virtual data catalog and virtual data language; use of GriPhyN virtual data toolkit for automated data derivation

Results: Successful early applications to CMS and SDSS data generation/analysis

Future: Public release of prototype, new apps, knowledge representation, planning

Page 30: Virtual Data and the Chimera System*

30

[email protected] ARGONNE CHICAGO

For More Information GriPhyN project (NSF ITR funded)

– www.griphyn.org Chimera virtual data system

– www.griphyn.org/chimera– “Chimera: A Virtual Data System for

Representing, Querying, and Automating Data Derivation,” SSDBM, July 2002

– “Applying Chimera Virtual Data Concepts to Cluster Finding in the Sloan Sky Survey”, SC’02, November 2002.