David Adams
ATLAS
Datasets
David Adams
BNL
December 19, 2002
PPDG meeting
Interactive analysis
December 19 , 2002Datasets PPDG Interactive analysis 2
David Adams
ATLAS
ContentsDIAL
Dataset properties
Dataset representations
Dataset package status
Future
December 19 , 2002Datasets PPDG Interactive analysis 3
David Adams
ATLAS
DIALDIAL is
• Distributed Interactive Analysis of Large datasets
DIAL described at• http://www.usatlas.bnl.gov/~dladams/dial/talks/021219_dial.ppt
Use DIAL to deduce dataset properties
December 19 , 2002Datasets PPDG Interactive analysis 4
David Adams
ATLAS
Dataset propertiesDataset is a collection of data objects
• Means to iterate over objects• Typically objects are also indexed with labels
– Unique within dataset
– For event data: event ID + type + string key> E.g. run 123, event 456, EM jet, cone_0.5
– Allows for random access
• Data may be in a persistent store– Each object has a GUID
December 19 , 2002Datasets PPDG Interactive analysis 5
David Adams
ATLAS
Dataset properties (cont)Dataset has content
• Indicates suitability for a particular analysis or other transformation
• Might be expressed in terms of object labels• For ATLAS event data:
– Event ID’s + type-keys for each (ATLAS) event
• (Part of type in GriPhyN VDG)
December 19 , 2002Datasets PPDG Interactive analysis 6
David Adams
ATLAS
Dataset properties (cont)Data in dataset has a location
• Persistent store where data may be found• List of files holding the data
– File ID’s or LFN’s> Persistent store locates physical replicas
• Or rows in RDB tables…• May be multiple locations for a dataset
– Due to different representations
– More later
December 19 , 2002Datasets PPDG Interactive analysis 7
David Adams
ATLAS
Dataset properties (cont)Dataset has a history
• Transformation used to create the dataset– Executable, version, input parameters
– (VDG transformation)
• Input datasets– (VDG derivation)
• Run-time properties (node, time, …)– Multiple values for distributed processing
– (VDG invocation)
December 19 , 2002Datasets PPDG Interactive analysis 8
David Adams
ATLAS
Dataset properties (cont)Dataset has a unique identity (name)
• So it can b referenced
Dataset has portable representation• Possible to carry around a description the
content and location of a dataset without reference to any DB’s
• Dataset package uses XML
December 19 , 2002Datasets PPDG Interactive analysis 9
David Adams
ATLAS
Dataset representationsThere are different ways to represent the data in a dataset
Simple datasets:• All data in a single file• Table in a RDB• Indexed list of GUID’s for a persistent store
– Commercial ODB such as Objectivity
– HES such as LCG POOL
December 19 , 2002Datasets PPDG Interactive analysis 10
David Adams
ATLAS
Dataset representations (cont)Compound datasets
• Concatenation of datasets– Concatenation of content
– Any overlap between content of constituent datasets must index identical objects
• Subset of a dataset– Based on content
• Result of an algorithm applied on a dataset– Virtual data
December 19 , 2002Datasets PPDG Interactive analysis 11
David Adams
ATLAS
Dataset package statusDatasets
• Generic implementation in place– http://www.usatlas.bnl.gov/~dladams/dataset
• Assumes content is event data• Supported representations:
– Single file> AthenaRoot format> ATLAS Monte Carlo generator output
– Concatenation of events– Selection based on event ID
December 19 , 2002Datasets PPDG Interactive analysis 12
David Adams
ATLAS
FutureSupport other types of ATLAS event data
Add concatenation and selection based on event content
Add representation for POOL EventCollection
Add non-event data• Relevant conditions data objects• Derived metadata• Provenance and production history
Top Related