Status of the BaBar Databases

29
1/29 Status of the BaBar Databases Jacek Becla BaBar Database Group

description

Status of the BaBar Databases. Jacek Becla BaBar Database Group. BaBar Is in Production. Run 1: May 1999 – Oct 2000 ~24.2 fb -1 (~1.3 per month) Run 2: Feb 2001 – July 2002 up to 12.6 fb -1 now (~2.5 per month) Expected ~100 fb -1 by July 2002 already well over designed luminosity. - PowerPoint PPT Presentation

Transcript of Status of the BaBar Databases

Page 1: Status of the BaBar Databases

1/29

Status of the BaBar DatabasesStatus of the BaBar Databases

Jacek Becla

BaBar Database Group

Page 2: Status of the BaBar Databases

2/29

BaBar Is in Production

Run 1: May 1999 – Oct 2000– ~24.2 fb-1 (~1.3 per month)

Run 2: Feb 2001 – July 2002– up to 12.6 fb-1 now (~2.5 per month)

Expected ~100 fb-1 by July 2002– already well over designed luminosity

Page 3: Status of the BaBar Databases

3/29

Prognosis

FY00 FY01 FY02 FY03 FY04 FY05

Peak luminosity

[1033 cm-2 sec-1] 2.5 5 8 10 13 24

Yearly integrated luminosity

[fb-1]

25 40 80 115 135 225

Total Integrated luminosity

[fb-1]25 65 145 260 395 620

Page 4: Status of the BaBar Databases

4/29

Changes 4 -> 21 streams

– >5 times more files, locks– no data duplication (streams not self-contained)

Smaller files– 2 -> 0.5, 10 -> 2 [GB]

Using Objy 6.1, read only dbs Clustering hint server and cond OID server Migrating production to Linux (now) Introducing multi-fds (now) Cannot afford a large test-bed anymore

Page 5: Status of the BaBar Databases

5/29

OPR

In general keeps up with data– ~150 pb-1 per day– faster than at the end of Run 1

in spite of 5x load

– will have to deal with 300 pb-1 soon

Page 6: Status of the BaBar Databases

6/29

Current OPR Configuration

Hardware– 6 4-CPU data servers, lock server, jnl

server, catalog server, clustering hint server + conditions OID server

– 220 clients

Software– Objy 6.1, Solaris 7– about to migrate to Linux

Page 7: Status of the BaBar Databases

7/29

OPR – Short Term Future Use multi-fds

– 2 event store fds, 1 conditions– 6 + 6 data servers– new federation approx. every week

Migrate clients to Linux– 2.2 faster CPU, more memory

Use faster machine for lock servers– now: Sun Netra T1, 440 MHz– planned: Sun Blade 1000, 750 MHz UltraSPARC-3

Discussions about storing all digis in objy, and reprocessing from Objy, not xtc

Page 8: Status of the BaBar Databases

8/29

REPRO

Hardware configuration similar to OPR

Occasionally up to 3 repro farms– over 300 pb-1 on a good day

150+150+200 nodes

– condition merging nightmare

Page 9: Status of the BaBar Databases

9/29

REPRO – Near Future

Use multi-fds– 2 event store fds, 1 conditions– 5 + 5 data servers– new federation ~ every other week– same slow lock servers

Move to Linux

Run in Italy. Timescale ~mid 2002

Page 10: Status of the BaBar Databases

10/29

Robustness Db creation (weak point) removed

– precreation in background by CHS, automatic recovery, new C++ api in 6.0

AMS crash– ¾ of the farm continues, unless it is a “default” AMS

(used by CHS)

CHS – new central point of failure– entirely in our hands, very stabile so far

One event store fd down (e.g. lock server crash)– the second should finish processing current run

Cleanup server – worked on

Page 11: Status of the BaBar Databases

11/29

Page 12: Status of the BaBar Databases

12/29

Page 13: Status of the BaBar Databases

13/29

Analysis

200 CPUs (~Sun Netra T1 like) 17 servers, 24 TB disk cache On demand staging turned off Read only dbs

– starting to see effect now

Disk space – always a problem– micro – 5.4 KB/event (aod, col, tag, evt, evshdr)– mini – 4.7 KB/event (esd)

Page 14: Status of the BaBar Databases

14/29

Analysis – cont…

Veritas File System reconfiguration– direct I/O instead of buffered I/O

more than doubles effective data rate

Lock server memory leak– grows up to 600 MB in a week– switching every week

Kanga (ROOT based) will become deprecated– recent computing model: enhance Objy, deprecate kanga

(freeze by Mid 2002, produce files till late 2002)

Page 15: Status of the BaBar Databases

15/29

AMS

Known (but not fixed) problem– file used immediately after being closed– crashes AMS (in 6.1 kills the client)

Ported to Linux– no performance figures yet

New feature - compression Redesigning front end part

– got ok from Objy

Page 16: Status of the BaBar Databases

16/29

A Word on Conditions

Using OID server to find time interval– only in REPRO so far, about to put in OPR

Staircase problem– incorrect design– purging every 2 weeks, ~15 min per rolling

calibration (35 in total), run in parallel

Finalize problem– based on genealogy object, (all objects named),

result of iteration in unpredicted order. Just slow

Condition merging problem

Page 17: Status of the BaBar Databases

17/29

Conditions…cont Index problem

– occasionally index inconsistent (does not return all objects in given range). Solution – rebuild. Happens ~once every 2 months. Not reported yet.

Index scaling– range query (the way we use it) does not scale

response time linear (100 K entries -> 0.5 sec)

Will extend OID server – now read only access

Will redesign & re-implement conditions– and address all the problems, timescale: end of 01

Page 18: Status of the BaBar Databases

18/29

Data Distribution

Micro-level data mirrored @ in2p3 Run2 – mirror raw as well Current tools do not scale with

increased data volume– a lot of manual work

Will try using data grid based tools soon

Page 19: Status of the BaBar Databases

19/29

Operations

2 DBAs +3rd coming soon Many manual tasks slowly being

automated

Page 20: Status of the BaBar Databases

20/29

Some Numbers

Total size of data – 300+ TB

# files – 128K

# users in analysis ~220

10 active production federations– this includes 5 analysis fds

Cond dbs – 12 GB

Page 21: Status of the BaBar Databases

21/29

TuningPerformance

Scalability

Page 22: Status of the BaBar Databases

22/29

8 Hz

160 nodes run, 20 streams, with duplication

420 Streams Was Non-trivial

4 streams: 100 nodes: ~ 60 Hz200 nodes: ~115 Hz

Page 23: Status of the BaBar Databases

23/29

Clustering Hint Server CORBA based, multithreaded Precreates in background dbs and conts,

distributes oid to clients Many other features:

– containers reused– full integration with HPSS (precreated files pinned

in cache, full dbs immediately migrated)– file disparsification

file transfer to tape: 1MB -> 15-25MB now

– db creation locally, pre-sizing no container extensions on the client side

– round robin load balancing– automatic recovery, and so on

Page 24: Status of the BaBar Databases

24/29

Others commitAndHold

– significant reduction in lock traffic

Initial transaction for condition– one instead of 50 transactions

Cache authorization– rather then check on every event

Tune # client file descriptor limit– Hit 8K limit on AMS site. Reduced client fd limit:

196 -> 32. AMS response improved, AMS CPU usage decreased

Increase trans granularity

Page 25: Status of the BaBar Databases

25/29

Bottlenecks

Lock server– 1st signs of saturation: with ~ 200 nodes– use faster CPU– use Objy 7 (33% lock traffic reduction)

scheduled for October 2001

– more event store fds per farm

CPU on data servers– buy more – expensive– improve AMS, reduce event size

Page 26: Status of the BaBar Databases

26/29

Use Faster CPU…

Page 27: Status of the BaBar Databases

27/29

Miscellaneous

64 K pages?– unfortunately not working with multi-fds

Maybe precreate/purge dbs only in between runs?

David is stepping down as a head of the BaBar DB group

Page 28: Status of the BaBar Databases

28/29

Future Looks Bright

Lock server bottleneck– multi-fds – can always add one

more event store fd– Objy 7 will feature faster lock server– CPUs are getting faster

Data server CPU saturation– AMS redesign should help– size of event (rec) being reduced now by ~10%,

looking for more– can always buy more servers

Page 29: Status of the BaBar Databases

29/29

Summary

No serious problems– conditions need to be redesigned

Likely OPR will keep up

Working in the BaBar DB group is fun!