Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status...

18
Marianne Bargiotti BK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN

Transcript of Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status...

Page 1: Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

Marianne Bargiotti BK Workshop – CERN - 6/12/2007 1

Bookkeeping Meta Data catalogue: present status

Marianne BargiottiCERN

Page 2: Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

Marianne Bargiotti BK Workshop – CERN - 6/12/2007 2

Outline BK overview Logical data model and DB schema BK services and User Interface Conclusions Appendix A,B

Page 3: Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

BK Workshop – CERN - 6/12/2007 3

LHCb Bookkeeping Meta Data Catalogue

The Bookkeeping (BK) is the AMGA* based system that manages the meta-data infos (file-metadata) of data files. It contains information about jobs, files and their relations: Job: Application name, Application version, Application

parameters, which files it has generated etc.. File: size, event, filename, from which job it was generated

etc. The Bookkeeping DB represents the main gateway for users to

select the available data and datasets. Three main services are available:

Booking service: to write data to the bookkeeping Servlets service: for the BK web browsing and selection of the data

files. AMGA server: for remote application use.

*: AMGA is the ARDA implementation of the ARDA/gLite Metadata Catalog Interface (see http://amga.web.cern.ch).

Page 4: Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

Marianne Bargiotti BK Workshop – CERN - 6/12/2007 4

Logical data models The AMGA-Schema shows how

the information is logically grouped:

The logical model is built around the two main entities: Jobs and Files

The relation between them is of type input/output. A job can take one or more files as input and produce more than a file (usually a data file plus a couple of log files).

Around these two entities there is a full set of satellite entities (Fileparams, Jobparams etc) that help to keep extra information.

At each entity is associated one or more attributes. For instance LFN is an attribute of Files or Program Name is an attribute of Jobparams.

the entities in the AMGA logical model are directories (see Appendix A)

Page 5: Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

5

DB schema The database tables are logically

grouped in two (plus one) sets based on their functionality: the Warehouse tables :

at each AMGA directory/entity is associated a table in the database (see Appendix A)

the Views: The views summarise the information

stored in the Warehouse database to best suite the physicists query providing a good performance

Most important: rottree and jobfileinfo views: each row in roottree summarise the attributes associate to the data-files stored in a JobFileInfoXXX table (as many JobFileInfoXXX tables as the entries in the roottree table)

Auxiliary tables The process that elaborates the

Warehouse data to create or update the Views makes use of auxiliary tables (see Appendix B)

Page 6: Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

Marianne Bargiotti BK Workshop – CERN - 6/12/2007 6

BK service The bookkeeping service is made up of: an application:

BkkManager. and two sub-services:

BkkReceiver tomcat

Plus: two satellite services tightly related to the bookkeeping: FileCatalog and BkkMonitor.

All these services are currently deployed on volhcb01.cern.ch

Page 7: Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

Marianne Bargiotti BK Workshop – CERN - 6/12/2007 7

Booking of data The Booking of data is how the information about

jobs and files reach the bookkeeping and how they are registered in the database.

The information about jobs and files are sent in xml format and are stored in files

Two central services involved: BkkReceiver BkkManager

Page 8: Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

Marianne Bargiotti BK Workshop – CERN - 6/12/2007 8

BkkReceiver BkkReceiver is responsible for receiving the xml

files and stores them in a directory. The directory works as a queue where files are

processed in a FIFO order. BkkReceiver service is listening on port 8092 (on

the deployment machine volhcb01) for jobs to send the xml formatted information about the generated files.

Page 9: Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

Marianne Bargiotti BK Workshop – CERN - 6/12/2007 9

BkkManager BkkManager is responsible

of reading the xml files, checking the correctness of their format and information and uploading the new data in the database. Two DTD definition files

are used: one is used to define the

jobs and files tags (Book.xml)

the second is used for the information on the replica (Replica.dtd).

Page 10: Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

Marianne Bargiotti BK Workshop – CERN - 6/12/2007 10

NewConfirm servlet BkkManager does deploy, to accomplish his duty,

the NewConfirm servlet service. NewConfirm takes care of checking the conformity of

the xml files to its DTD format then it checks that the information provided are correct.

The information is inserted in the Database only if all the checks are fine. If one of check fails an error message is saved in a file and no information is uploaded.

Page 11: Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

Marianne Bargiotti BK Workshop – CERN - 6/12/2007 11

Night updates The BkkManager application takes care of

selecting the xml files from the queue and asks NewConfirm to book them.

Every night it makes a backup of all the xml files that have been successfully booked and run the update views script. before extracting the xml files from the queue, it checks

the errors generated during the processing of the previous xml files to see if there are files that need to be reprocessed.

Page 12: Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

Marianne Bargiotti BK Workshop – CERN - 6/12/2007 12

Tomcat & BkkMonitor Tomcat is the servlets container used by the

bookkeeping listening on port 8080 on the deployment machine.

BkkMonitor is a monitoring service which controls FileCatalog and BkkReceiver servers.

The service actively ping these two services at one minute interval. In case of problems (service not responding):

warning email sent to BK operation manager in charge the server will be restarted.

Page 13: Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

Marianne Bargiotti BK Workshop – CERN - 6/12/2007 13

User Interface: Bookkeeping Web Page The web page allows to

browse the bookkeeping contents and get information about file and their provenance.

It is also used to generated Gaudi Card, the list of files to be processed by a job.

On left frame links many browsing options: File look-up, Job look-up, Production look-up, BK summary

Dataset search: retrieve a list of files based on their provenance history.

The result is:

Page 14: Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

14

FileCatalog FileCatalog is the service used by

the genCatalog script and the bookkeeping to get information about the Physical File Name of a file and its ancestors. It is a frontend to the LFC and

bookkeeping database. No security is required on this service since it provides read only API.

Accessible through the web page selecting the ‘Dataset Replicated at’ section: the system looks first for LFNs in

the bookkeeping database and then it tries to get the physical location for each of them from LFC

search is expensive: always done on a limited number of files !! (bunches of 200)

Page 15: Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

Marianne Bargiotti BK Workshop – CERN - 6/12/2007 15

Conclusion Several issues raised:

By users: web interface: lack of functionality in the data sets search

Java code to be replaced with python necessity of having a defined structure embedded in

DIRAC Forthcoming changes in the DB schema with data

taking necessity for a new versatile tool able to match

different requests

Page 16: Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

16

Appendix ADescription of each entity/directory: Jobs: Each Job has a Configuration Name and Version plus the date of

its execution. These three attributes are always present. Extra information about it is kept in the Jobparms and Inputfiles entities.

Jobparams: Provide extra information about job like the program name and version, the location where it was executed etc. Some attributes may not be compulsory.

Inputfiles: Contains the list of input files used by each job. No entries are present for jobs that didn’t take any input file.

Files: At each file is always associated the Logical Filename, the job that generated it and the type of file.

Fileparams: Similar to Jobparams provides extra information about files like the file size, file GUID etc. Some attributes may be not compulsory.

Qualityparams: This entity provides information on the quality of the files. It says for which group of physicists a file may be of interest.

Eventtypes: Keeps information about the event types like its description.

Typeparams: Extra information about the file type: name, description and version. The description may not be present.

Page 17: Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

Marianne Bargiotti BK Workshop – CERN - 6/12/2007 17

Appendix B Auxiliary tables:

FileSummary: the table contains an entry for each files with all the related information.

JobSummary: the table contains an entry for each job with related information.

JobHistory: contains for each job the information on his immediate ancestor if any.

JobHistory2Level: contains for each job the information on his ancestor of second degree if any.

Summary: Contains an entry for each possible n-tuple (eventtype,config,filetype, dbversion, program0, inputfile1, program1, inputfile2, program2).

Jobs_FileSummary: Is just the join of filesummary and jobsummary done on the column job_id.

Page 18: Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

Marianne Bargiotti BK Workshop – CERN - 6/12/2007 18

AMGA server There is no direct access to the DB:

the access is direct toward the AMGA server, which then will take care of contacting the DB to serve the information to the client.

The AMGA server comes with client APIs for C++, Python, Java and Perl