Download - Datasets

Transcript
Page 1: Datasets

David Adams

ATLAS

Datasets

David Adams

BNL

December 19, 2002

PPDG meeting

Interactive analysis

Page 2: Datasets

December 19 , 2002Datasets PPDG Interactive analysis 2

David Adams

ATLAS

ContentsDIAL

Dataset properties

Dataset representations

Dataset package status

Future

Page 3: Datasets

December 19 , 2002Datasets PPDG Interactive analysis 3

David Adams

ATLAS

DIALDIAL is

• Distributed Interactive Analysis of Large datasets

DIAL described at• http://www.usatlas.bnl.gov/~dladams/dial/talks/021219_dial.ppt

Use DIAL to deduce dataset properties

Page 4: Datasets

December 19 , 2002Datasets PPDG Interactive analysis 4

David Adams

ATLAS

Dataset propertiesDataset is a collection of data objects

• Means to iterate over objects• Typically objects are also indexed with labels

– Unique within dataset

– For event data: event ID + type + string key> E.g. run 123, event 456, EM jet, cone_0.5

– Allows for random access

• Data may be in a persistent store– Each object has a GUID

Page 5: Datasets

December 19 , 2002Datasets PPDG Interactive analysis 5

David Adams

ATLAS

Dataset properties (cont)Dataset has content

• Indicates suitability for a particular analysis or other transformation

• Might be expressed in terms of object labels• For ATLAS event data:

– Event ID’s + type-keys for each (ATLAS) event

• (Part of type in GriPhyN VDG)

Page 6: Datasets

December 19 , 2002Datasets PPDG Interactive analysis 6

David Adams

ATLAS

Dataset properties (cont)Data in dataset has a location

• Persistent store where data may be found• List of files holding the data

– File ID’s or LFN’s> Persistent store locates physical replicas

• Or rows in RDB tables…• May be multiple locations for a dataset

– Due to different representations

– More later

Page 7: Datasets

December 19 , 2002Datasets PPDG Interactive analysis 7

David Adams

ATLAS

Dataset properties (cont)Dataset has a history

• Transformation used to create the dataset– Executable, version, input parameters

– (VDG transformation)

• Input datasets– (VDG derivation)

• Run-time properties (node, time, …)– Multiple values for distributed processing

– (VDG invocation)

Page 8: Datasets

December 19 , 2002Datasets PPDG Interactive analysis 8

David Adams

ATLAS

Dataset properties (cont)Dataset has a unique identity (name)

• So it can b referenced

Dataset has portable representation• Possible to carry around a description the

content and location of a dataset without reference to any DB’s

• Dataset package uses XML

Page 9: Datasets

December 19 , 2002Datasets PPDG Interactive analysis 9

David Adams

ATLAS

Dataset representationsThere are different ways to represent the data in a dataset

Simple datasets:• All data in a single file• Table in a RDB• Indexed list of GUID’s for a persistent store

– Commercial ODB such as Objectivity

– HES such as LCG POOL

Page 10: Datasets

December 19 , 2002Datasets PPDG Interactive analysis 10

David Adams

ATLAS

Dataset representations (cont)Compound datasets

• Concatenation of datasets– Concatenation of content

– Any overlap between content of constituent datasets must index identical objects

• Subset of a dataset– Based on content

• Result of an algorithm applied on a dataset– Virtual data

Page 11: Datasets

December 19 , 2002Datasets PPDG Interactive analysis 11

David Adams

ATLAS

Dataset package statusDatasets

• Generic implementation in place– http://www.usatlas.bnl.gov/~dladams/dataset

• Assumes content is event data• Supported representations:

– Single file> AthenaRoot format> ATLAS Monte Carlo generator output

– Concatenation of events– Selection based on event ID

Page 12: Datasets

December 19 , 2002Datasets PPDG Interactive analysis 12

David Adams

ATLAS

FutureSupport other types of ATLAS event data

Add concatenation and selection based on event content

Add representation for POOL EventCollection

Add non-event data• Relevant conditions data objects• Derived metadata• Provenance and production history