Henry Nebrensky - MICE CM24 - 2 June 2009 MICE Data Flow Henry Nebrensky Brunel University 1.

Henry Nebrensky - MICE CM24 - 2 June 2009

MICE Data Flow

Henry Nebrensky

Brunel University

1


MICE Data and the Grid

2

Storage, archiving and dissemination of experimental data:

Not been a high priority so far Overall strategy not documented anywhere

obvious Individual work on parts of this – but do the

pieces fit together?

Grid: Certain Grid services are separately funded to

provide a production service to MICE Provides a ready-made set of building blocks – but

“we” have to put them together MICE need to know what they want, to make sure

that the finished edifice meets all their needs (and that Grid includes all the necessary bricks)


Decision Time

We need to start putting the pieces together very soon.

Once data starts going on tape it will not be possible to change how and where it is stored

need an agreed plan in the near future (i.e. by end of CM24)

There are a number of unresolved issues – see

Note 252 and the data flow diagram. Data volumes, lifetime and access control mostly

unclear (LFC) File naming scheme – see MICE Note 247 File metadata requirements – raised at CM23

3


The Awesome Power of Grid Computing

The Grid provides seamless interconnection between tens of thousands of computers.

It therefore generates new acronyms and jargon at superhuman speed.

4


Grid Middleware

We are currently using EGEE/WLCG middleware and resources, as they are receiving significant development effort and are a reasonable match for our needs (shared with various minor experiments such as LHC)

Outside Europe other software may be expected – e.g. the OSG stack in the US. Interoperability is, from our perspective, yet another “known unknown”...

5


MICE and Grid Data Storage

The Grid provides MICE not only with computing (number-crunching) power, but also with a secure global framework allowing users access to data

Good news: storing development data on the Grid keeps it available to the collaboration – not stuck on an old PC in the corner of the lab

Bad news: loss of ownership – who picks up the data curation responsibilities?

Data can be downloaded from the Grid to user’s “own” PC – doesn’t need to be analysed remotely

6


Grid File Management (1)

Each file is given a unique, machine-generated, GUID when stored on the Grid

The file is physically uploaded to one (or more) SEs (Storage Elements) where it is given a machine-generated SURL (Storage URL)

Machine-generated names are not (meant to be) human-usable

A “replica catalogue” tracks the multiple SURLs of a GUID For sanity's sake we would like to associate sensible

filenames with each file (LFN, Logical File Name) A “file catalogue” is a database that translates

between something that looks like a Unix filesystem and the GUIDs and SURLs needed to actually access the data on the Grid

7


Grid File Management (2)

8

MICE has an instance of LFC (LCG File Catalogue) run by the Tier 1 at RAL

The LFC service can do both the replica and LFN cataloguing

LFC presents the user with what looks like a normal Unix filespace - the Grid client SW keeps track of the data behind the scenes.

SE

Head Node

Tape

Pool Pool Pool

SE

SE

Replica Catalogue

GUID ad9e349c-7a56-4961-8741-8242949433b0

File Transfer Service

TURL gsiftp://dgc-grid-52.brunel.ac.uk/data2/dpmfs/mice/2009-03-09/ file34c34ee1-f10b-463a-80b3-2d257231261f.3660836.0

SURL srm://dgc-grid-34.brunel.ac.uk/dpm/ brunel.ac.uk/home/mice/generated/2009-03-09/ file34c34ee1-f10b-463a-80b3-2d257231261f

File Catalogue

LFN /grid/mice/users/Nebrensky/sw/g4beamline-1.15.3-Linux-g++.tgz

LFN

LFN

M etadata Catalogue

UI

Local Disk

LFC

From MICE Note 247


Data Integrity

(For recent SE releases) a checksum is calculated automatically when a file is uploaded.

This can be checked when the file is transferred between SEs, or the value retrieved to check local copies.

9


The VOMS server

File permissions will needed e.g. to ensure that users can’t accidentally delete RAW data. These rules will need to last for at least the life of the experiment.

VOMS is a Grid service that allows us to define specific roles (e.g. DAQ data archiver) which will then be allowed certain privileges (such as writing to tape at RAL Tier 1).

The VOMS service then maps humans to those roles, via their certificates.

MICE VOMS server is provided via GridPP at Manchester, UK.

New Mice are added or assigned to roles by the VO Manager (and Mouse) Paul Hodgson.

Thus the VOMS service provides us with a single portal where we can add/remove/reassign Mice, without needing to negotiate with the operators of every Grid resource worldwide – we actually keep control “in-house.”

10


MICE Data Flow

The basic data flow in MICE is thus something like:

The raw data file from the experiment are sent to tape using Grid protocols, including registering the files in LFC.

The offline reconstruction can then use Grid/LFC to pull down the raw data, and upload reconstructed (“RECO” or DST) files.

Users can use Grid/LFC to access RECO files they want to play with.

If I combine the above description with some background knowledge of the Grid, some snippets of what people are working on and a whole lot of guesswork I get:

11


MICE Data Flow

Diagram

12

M IC E D A Q

O n lin e B u ffe r

O n lin e F a rm

R A W

R A W E p h e m e ra l R O O T h is to g ra m s

C re a te f ile m e ta d a ta ?

R A W

M ic e N e t D A Q n e tw o rk

O p tic a l F ib re 1 G b p s to T ie r 1

R A W

“ R o b o t” ? c e r t if ic a te V O M S “ a rc h iv e r” ? ro le M IC E _ R A W _ T A P E ? to k e n

C A S T O R ta p e

B a c k u p lin k to T ie r 1 ? F a i lo v e r l in k to T ie r 2 ?

V O M S , D N S , N T P ?

IS IS o r P P D n e tw o rk

R A W

R E C O (R O O T tre e s )

L F C

T ie r 1 n e tw o rk

O fflin e R e c o n s tr u c tio n

O n lin e R e c o n s tr u c tio n

“ A n o in te d u s e r” c e r tif ic a te V O M S “ p ro d u c tio n ” ro le M IC E _ R E C O _ D IS K ? to k e n

U K G rid P P T ie r 2 F a rm s

A n y T ie r 2 F a rm

S e m i-a u to m a te d p ro c e s s

T ie r 2 S E d isk

U se r lo c a l d isk

F ig u r e 1 : D a ta f lo w fr o m th e M IC E e x p e r im e n t. S h o r t-d a sh e d e n t it ie s r e q u ir e c o n f ir m a tio n . L o n g -d a sh e d lin e s r e p r e s e n t b o r d e r s b e tw e e n s u b n e ts .

A n a ly s is re su lts

re su lts a rc h iv e ?

M y P ro x y

B D II

T ra n s fe r B o x

(M IC E A C Q 0 5 )

C o n tro ls & M o n ito r in g n e tw o rk

C o n d it io n s D a ta b a se (E P IC S )

2 T B / d a y = 2 0 0 M b p s

C o n fig u ra tio n D a ta b a se ?

C o n fig D B “ A P I” ?

?

?

“ C h a o tic ” (o n -d e m a n d ) a n a ly s is G e n e ric u s e r c e r tif ic a te

M C s im u la tio n

?

C A S T O R d isk

A M G A

R E C O

Short-dashed lines indicate entities that still need confirmation

Question marks indicate even higher levels of uncertainty

More details in MICE Note 252

The diagram would look pretty much the same if non-Grid tools were used


Data Flow Implementation

13

Most of this is NOT in place yet (at production level)!

M IC E D A Q

O n lin e B u ffe r

O n lin e F a rm

R A W

R A W

M ic e N e t D A Q n e tw o rk

C A S T O R ta p e

L F C

T ie r 1 n e tw o rk

O fflin e R e c o n s tr u c tio n

O n lin e R e c o n s tr u c tio n

U K G rid P P T ie r 2 F a rm s

A n y T ie r 2 F a rm

T ie r 2 S E d isk

U se r lo c a l d isk

F ig u r e 1 : D a ta f lo w fr o m th e M IC E e x p e r im e n t. S h o r t-d a sh e d e n t it ie s r e q u ir e c o n f ir m a tio n . L o n g -d a sh e d lin e s r e p r e s e n t b o r d e r s b e tw e e n s u b n e ts .

A n a ly s is re su lts

T ra n s fe r B o x

(M IC E A C Q 0 5 )

C o n tro ls & M o n ito r in g n e tw o rk

C o n d it io n s D a ta b a se (E P IC S )

2 T B / d a y = 2 0 0 M b p s

“ C h a o tic ” (o n -d e m a n d ) a n a ly s is G e n e ric u s e r c e r tif ic a te

M C s im u la tio n

C A S T O R d isk


MICE Data Unknowns

MICE Note 252 identifies four main flavours of data: RAW, RECO, analysis results, and MonteCarlo simulation.

For all four, we need to understand the: volume (the total amount of data, the rate at which

it will be produced, and the size of the individual files in which it will be stored)

lifetime (ephemeral or longer lasting? will it need archiving to tape?)

access control (who will create the data? who is allowed to see it? can it be modified or deleted, and if so who has those privileges?)

Also need to identify use cases I’ve missed, especially ones that will need more VOMS roles or CASTOR space tokens.

14


File Catalogue Namespace (1)

Also, we need to agree on a consistent namespace for the file catalogue

Proposal (MICE Note 247, Grid talk at CM23): We get given /grid/mice/ by the server Five upper-level directories: Construction/

historical data from detector development and QA

Calibration/needed during analysis (large datasets, c.f.

DB) TestBeam/

test beam data MICE/

DAQ output and corresponding MC simulation15


File Catalogue Namespace (2)

/grid/mice/users/nameFor people to use as scratch space for their

own purposes, e.g. analysis

Encourage people to do this through LFC – helps avoid “dark data”

LFC allows Unix-style access permissions

Again, the LFC namespace is something that needs to be finalised before production data can start to be registered.

16


Metadata Catalogue

For many applications – such as analysis – you will want to identify the list of files containing the data that matches some parameters

This is done by a “metadata catalogue”.For MICE this doesn't yet exist

A metadata catalogue can in principle return either the GUID or an LFN – it shouldn’t matter which as long as it’s properly integrated with the other Grid services.

(Grid talk at CM23)

17


MICE Metadata Catalogue

We need to select a technology to use for this use the configuration database? gLite AMGA (who else uses it – will it remain

supported?)

Need to implement – i.e. register metadata to files

What metadata will be needed for analysis?

Should the catalogue include the file format and compression scheme (gzip ≠ PKzip)? 18


MICE Metadata Cataloguefor Humans

or, in non-Gridspeak: we have several databases (configuration

DB, EPICS, e-Logbook) where we should be able to find all sorts of information about a run/timestamp.

but how do we know which runs to be interested in, for our analysis?

we need an “index” to the MICE data, and for this we need to define the set of “index terms” that will be used to search for relevant datasets.

19


MICE Metadata

Run, date/time Step Nominal 4-d / tranverse normalised Emittance Diffuser setting Nominal Momentum Configuration:

Magnet currents Physical geometry

RF?

???

20


Conclusions

The data flow is more complex than people realise…

… and probably won’t work by accident

Some specific issues that need to be understood are the attributes of the data flows (Note 252), the LFC Namespace (Note 247) and the index terms for the metadata catalogue.

This needs discussion and (where necessary) decision pretty soon – by end CM24 – to be ready for data taking. 21

Henry Nebrensky - MICE CM24 - 2 June 2009 MICE Data Flow Henry Nebrensky Brunel University 1.

Documents

Transcript of Henry Nebrensky - MICE CM24 - 2 June 2009 MICE Data Flow Henry Nebrensky Brunel University 1.