Data Aggregation System

14
CMS Data Aggregation System Valentin Kuznetsov, Cornell University 1 ICCS Workshop, Amsterdam, May 31 - Jun. 2d, 2010 How can I find my data? DBS SiteDB Phedex GenDB LumiDB RunDB PSetDB Data Quality Overview

description

The talk present a new Data Aggregation System for CMS experiment at CERN. We use MongoDB database as caching layer to query multiple data-provides (backed up by RDMS) and aggregate data across them. Talk has been presented at ICCS 2010 conference.

Transcript of Data Aggregation System

Page 1: Data Aggregation System

CMS Data Aggregation SystemValentin Kuznetsov, Cornell University

1

ICCS Workshop, Amsterdam, May 31 - Jun. 2d, 2010

How can I findmy data?

DBS SiteDB

Phedex

GenDB LumiDB

RunDB

PSetDBDataQuality

Overview

Page 2: Data Aggregation System

Talk outline

✤ Introduction

✤ Motivations

✤ What is DAS?

✤ Design, architecture, implementations

✤ Current status & benchmarks

✤ Future plans

2

Page 3: Data Aggregation System

Introduction

✤ CMS is a general purpose physics detector built for the LHC

✤ beam collision 25 nsec, online trigger 300 Hz, event size 1-2MB

✤ More then 3000 physicists, 183 institution, 38 countries

✤ CMS uses distributed computing and data model

✤ 1 Tier-0, 7 Tier-1, O(50) Tier-2, O(50) Tier-3 centers

✤ 2-6 PB/year of real data + 1x Simulated data, ~500GB/year of meta-data

✤ Code: C++/Python; Databases: ORACLE, MySQL, CouchDB, MongoDB ...

Page 4: Data Aggregation System

Motivations ...

✤ A user want to query different meta-data services without knowing of their existence

✤ A user want to combine information from different meta-data services

✤ A user has domain knowledge, but need to query X services, using Y interface and dealing with Z data formats to get our data

4

block,site

lumi

site

DBSrun, file, block, site,config, tier, dataset,lumi, parameters, ....

LumiDBlumi, luminosity, hltpath

SiteDBsite, admin, site.status, ..

Phedexblock, file, block.replica,file.replica, se, node, ...

GenDBgenerator, xsection, process, decay, ...

RunSummaryrun, trigger, detector, ...

DataQualitytrigger, ecal, hcal, ...

run,lumi

run

MC id

Overviewcountry, node, region, ..

Parameter Set DBCMSSW parameters

run

Service Eparam1, param2, ..Service D

param1, param2, ..Service Cparam1, param2, ..Service B

param1, param2, ..Service Aparam1, param2, ..

pset

Data Aggregation System

Page 5: Data Aggregation System

What is DAS?

✤ DAS stands for Data Aggregation System

✤ It is layer on top of existing data-services

✤ It aggregates data across distributed data-services while preserving their integrity, security policy and data-formats

✤ it provides caching for data-services (side effect)

✤ It represents data in defined format: JSON documents

✤ It allows query data via free text-based queries

✤ Agnostic to data content 5

Page 6: Data Aggregation System

Challenges ...

✤ Combining N data-services is a great idea, but

✤ there is no ad-hoc IT solution

✤ DAS doesn’t hold the data, can’t have pre-defined schema

✤ must support existing APIs, data formats, interfaces, security policies

✤ must relate and aggregate meta-data

✤ must be efficient, flexible, scalable and easy to use

✤ Work on DAS prototype to understand those challenges 6

Page 7: Data Aggregation System

DAS prototype

✤ Code written in python, ideal for prototyping

✤ Use existing meta-data from CMS data-services as test-bed

✤ 8 data-services, 75/250GB in tables/indexes

✤ Use document-oriented “schema-less’’database: MongoDB

✤ raw cache, merge result cache, mapping and analytics DBs

✤ Support free keyword-based queries, e.g. site=T1_CERN, run=100

✤ Aggregate information using key-value matching7

Page 8: Data Aggregation System

DAS architecture

DAS webserver

dbs

sitedb

phedex

lumidb

runsum

DAS cache

DAS Analytics

CPU core

DAS core

DAS core

DAS Cache server

record query, APIcall to Analytics

Fetch popularqueries/APIs

Invoke the same API(params)Update cache periodically

DAS mapping Map data-service

output to DASrecords

mapping

par

ser

����������

dat

a-se

rvic

es

DAS merge

plu

gin

s

aggregator

UI

RESTful interface

DAS robot

Page 9: Data Aggregation System

DAS workflow

✤ Query parser

✤ Query DAS merge collection

✤ Query DAS cache collection

✤ invoke call to data service

✤ write to analytics

✤ Aggregate results (generator)

query

parser

queryDAS merge

Aggregator

queryDAS cache

querydata-services

DASmerge

DAScache

noyes

noyes

results

DASMapping

DASAnalytics

Web UI

DASlogging

DAScore

Page 10: Data Aggregation System

DAS and data-services

✤ DAS is data-service agnostic

✤ a data-service is identified by its URI and input parameters

✤ Use plug-and-play mechanism:

✤ add new data-service using ASCII map file (URI, parameters, ...)

✤ use generic HTTP access and standard data-parsers (XML, JSON)

✤ Use dedicated plugin:

✤ specific access requirements, custom parsers, etc.

Page 11: Data Aggregation System

DAS map files

system : google_mapsformat : JSON---urn : google_geo_mapsurl : "http://maps.google.com/maps/geo"expire : 30params : { "q" : "required", "output": "json" }daskeys : [ {"key":"city","map":"city.name","pattern":""},]

Data Aggregation System

DAS mapping

Data Service: URL/api?params

Page 12: Data Aggregation System

DAS benchmark✤ Fetch all blocks from our bookkeeping (DBS) and data transfer (PhEDEx) CMS data services

✤ parse, remap notations, store to cache, merge matched records (aggregation)

✤ Linux 64-bit, 1CPU for DAS, 1CPU for MongoDB, record size ~1KB

✤ Elapsed time = retrieval time + parsing time + remapping time + cache insertion/indexing time + output creation time

12

Format Records Time, no cache

Time w/ cache

DBS yield XML 387K 68s 0.98s

PhEDEx yield XML 190K 107s 0.98s

Merge step JSON 577K 63s 0.9s

DAS total JSON 393K 238s 2.05s

393K DAS records,create ~6K docs/sread ~7.6K docs/s

Page 13: Data Aggregation System

Future plans

✤ DAS goes into production this year in CMS:

✤ confirm scalability, transparency and durability w/ various data-services

✤ work on analytics to organize pre-fetch strategies

✤ Apply to other domain disciplines

✤ Release as open source

Page 14: Data Aggregation System

Summary

✤ Data Aggregation System is data agnostic and allow to query/aggregate meta-data information in customizable way

✤ The current architecture easily integrates with existing data-services preserving their access, security policy and development cycle

✤ DAS is designed to work with existing CMS data-services, but can easily go beyond that boundary

✤ Plug-and-play mechanism makes it easily to add new data-services and configure DAS to specific domain