The LEGO Train Framework Andrei Gheata Costin Grigoras Jan Fiete Grosse-Oetringhaus.

19
The LEGO Train Framework Andrei Gheata Costin Grigoras Jan Fiete Grosse-Oetringhaus

Transcript of The LEGO Train Framework Andrei Gheata Costin Grigoras Jan Fiete Grosse-Oetringhaus.

The LEGO Train Framework

Andrei Gheata

Costin Grigoras

Jan Fiete Grosse-Oetringhaus

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 2

Idea

• Manage trains using MonALISA– Users register wagons– Train operators compose trains

• Automatic testing per wagon• Train file generation• Submission managed by ML (existing LPM

infrastructure)• Merging managed by LPM• Aim: allow operators easy running of analysis

trains (~weekly) getting output on the scale of 1-2 days

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 3

Configuration & Testing

• Train Configuration – New class AliAnalysisTaskCfg

• Contains description of wagons (add task macro, libraries, dependencies)

• See talk by Andrei on Monday

• Testing– Uses alientest04 machine– Downloads AliEn packages (ROOT, AliRoot)– Copies a part of the input data

set to the local machine– Runs tests per wagon– Uses syswatch to extract mem/cpu

information– Tests also "base line" task which is empty

Base line

Phys Sel

Centr Sel

User A

User B

User C

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 4

Workflow

MonALISA

User

Train operator

Test machine

AliEn

1. adds wagons

2. composes train4. recompose after test

3. generates test files + executes test5. generates train jdl + scripts

6. runs train

config

test results

train files

LPM

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 5

Screenshot

Handler configuration

Wagon configuration

Data configuration

Testing and running status

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 6

Handler

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 7

Wagon

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 8

Dataset

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 9

Run

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 10

Syswatch

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 11

Operator Workflow

Select dataset

Select wagon

Start testing

Inspect output

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 12

Operator Workflow (2)

status of analysis

status ofmerging

intermediatemerging stepsSubmit final

merge job(to be automatized)

final mergingstatus

check output

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 13

Demo…

• Enough theory, let's do some clicking…

http://alimonitor.cern.ch/trains

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 14

Some More Details

• Train runs with an analysis tag– All code + "AddTask" macro has to be in the tag (no

par file!)

• Output per run stored in the input data directory (like AOD, QA trains). E.g.:/alice/data/2010/LHC10h/000137366/ESDs/pass2/PWG4/

CorrelationTrain/7_20111117_1350

• All merged runs found in/alice/cern.ch/user/a/alitrain/PWG4/CorrelationTrain/

7_20111117_1350/merge

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 15

Operations

• After 10-12h most jobs are done (~90-98%)– Few running, few waiting– This situation can persist for days killer for merging the output– Solutions

• Kill jobs that have waited longer than X (being tested on the level of the LPM, better as a JDL tag)

• Remove CE requirement after a certain time (thx Latchezar for this idea), to be implemented

• Merge jobs have the same tails of few jobs that wait a long time– Ideas: same as above or run them on any CE (problem with

splitting, Pablo is investigating)• Output available after ~2 days

– 25% (real time) spend in running– 75% in merging– I believe this can still be improved!

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 16

Operations (visually…)Analysis jobs

WaitingRunningDoneError

Merging jobs

WaitingRunningDoneError

Analysis jobs

WaitingRunningDoneError

hours since submission

hours since submission

hours since submission

here we kill the remaining ones

80% donein 4 hours

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 17

Current Trains

• Four active beta testers– Jets (Christian KB)– D2H (Zaida)– Correlations in pp (Eva)– Correlations in PbPb (JF)

• We got a lot of feedback, improved the system

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 18

TODO

• Graphs for CPU/Wall/Mem consumption of user tasks as function of AliRoot tag

• Some improvements in the web interface• Automatic launching of final job

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 19

Documentation

• Mailing list (for operators)– [email protected]

• TWiki (Users + operators)– https://twiki.cern.ch/twiki/bin/viewauth/ALICE/

AnalysisTrains