Computing and LHCb Raja Nandakumar. The LHCb experiment Universe is made of matter Still not clear...
-
Upload
phebe-pearson -
Category
Documents
-
view
214 -
download
1
Transcript of Computing and LHCb Raja Nandakumar. The LHCb experiment Universe is made of matter Still not clear...
Computing and LHCb
Raja Nandakumar
The LHCb experiment Universe is made of matter
Still not clear whyAndrei Sakharov’s theory of cp-violation
Study cp-violationIndirect evidence of new physics
There are many other questions (of course)
The LHCb experiment has been built Hope to answer some of these questions
The LHCb detector
February 2002Cavern ready for detector installationAugust 2008
How the data looks
The detector records … >1 Million channels of data every bunch
crossing 25ns between bunch crossings Trigger reduces to about 2000 events/sec
~7 Million events / hour25 KB/s raw event size
4.3 TB/day Not as much as ATLAS / CMS but still … Assuming continuous operation
Breaks for fills, etc. These events will need to be farmed out of
CERNReconstructed and stripped at Tier-1sThen replicated to all LHCb Tier-1 sites
Finally available for user analysis
The LHCb computing model
CERN
Production (T2/T1/T0) Simulation + digitization
.digi
Reconstruction (T1 / T0)
.rdst
.digi
Stripping(T1 / T0)
.dst .rdst
T1 / T0.dst
FTS
User Analysis(T1/T0)
LHCb job submission Computing distributed all over the
world Particle physics is collaborative across
institutes in various nations Both cpu, storage available at various sites
Welcome to the world of grid computing Take advantage of distributed resources Set up a framework for other disciplines
alsoFault tolerant job execution.Also used by Medicine, Chemistry, Space
science, … LHCb interface : DIRAC
What the user sees …
Submit job to the “grid” Ganga (ATLAS/LHCb) Sometimes needs a lot of persuasion
Usually the job comes back successful
On occasion problems seen Frequently wrong parameters, code, …
Correct and resubmit
What the user does not see …
Requirements of DIRAC Fault tolerance
Retries Duplication Failover
Guard against possible grid problems … Network, timeouts Drive failures Systems hacked Bugs in code If it cannot go wrong, it still will
Caching Watchdogs Logs
Overloaded machine, service
Thread safety Fire, Cooling
problems
Submitting jobs on the grid
Two ways of submitting jobs Push jobs out to a site’s batch system
The grid is a simple multiple batch system Job waits at the site until it runs
Lose control of jobs when they leave us (LHCb)Many things can change in the time between job
submission and runningWe only see the batch systems / queues
We do not see the status of the grid in real timeCause of low success rate – previous experience
Load on site Site temporary downtime Change in job priority within the experiment
Pull jobs into the site Pilot jobs
Pilot jobs “Wrapper” jobs
Submitted to a site If site is available, free & there are waiting jobs
Pilot job returns information at current time Job may have resource requirements too …
Look at local environment and request job from DIRACDIRAC returns job with highest priority matching
available resource Internal job prioritisation within DIRAC
Has latest information on experiment prioritiesExit after a short delay if no matching job found
Have fine grained (level of worker node) view of the grid Very high job success rate Pioneered by LHCb
Very simple requirements for sites
Does all on previous slide Refinements still needed (as always)
Job prioritisation still static Dynamic job prioritisation on the way
Basic logs all in place Not everything easy to view for user / shifter Being improved
More improvements in resilience upcoming DIRAC portal : http://lhcbweb.pic.es
All needed information for LHCb users Locating data, Job monitoring, …
Restricted information for outsiders Grid privacy issues
Ganga + DIRAC the only official LHCb grid interface Will support any reasonable use case
Successes …
A single machine is the DIRAC server No particular load issues seen
Analysis also going on
Comparison of different monte carlo
The occasional problem Black hole worker nodes
Bad environment that cannot match jobs Sink for our pilot jobs
Once sink for production jobs alsoMigration from sl3 to sl4
Introduce short sleep time before pilot exits DOS attack on CERN servers
Software being downloaded from CERNWas done if software was not available locally
Now users do not install software
We donot understand …
Very very preliminary Still working on
understanding this
“Same” class of cpu-s at different sites
CPU time scaled median for the cpu class
Now over to ATLAS …