The LEGO Train Framework

19
The LEGO Train Framework Andrei Gheata Costin Grigoras Jan Fiete Grosse-Oetringhaus

Transcript of The LEGO Train Framework

Page 1: The LEGO Train Framework

The LEGO Train Framework

Andrei Gheata

Costin Grigoras

Jan Fiete Grosse-Oetringhaus

Page 2: The LEGO Train Framework

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 2

Idea

• Manage trains using MonALISA – Users register wagons

– Train operators compose trains

• Automatic testing per wagon

• Train file generation

• Submission managed by ML (existing LPM infrastructure)

• Merging managed by LPM

• Aim: allow operators easy running of analysis trains (~weekly) getting output on the scale of 1-2 days

Page 3: The LEGO Train Framework

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 3

Configuration & Testing

• Train Configuration – New class AliAnalysisTaskCfg

• Contains description of wagons (add task macro, libraries, dependencies)

• See talk by Andrei on Monday

• Testing – Uses alientest04 machine

– Downloads AliEn packages (ROOT, AliRoot)

– Copies a part of the input data set to the local machine

– Runs tests per wagon

– Uses syswatch to extract mem/cpu information

– Tests also "base line" task which is empty

Base line

Phys Sel

Centr Sel

User A

User B

User C

Page 4: The LEGO Train Framework

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 4

Workflow

MonALISA

User

Train operator

Test machine

AliEn

1. adds wagons

2. composes train

4. recompose after test

3. generates test files + executes test

5. generates train jdl + scripts

6. runs train

config

test results

train files

LPM

Page 5: The LEGO Train Framework

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 5

Screenshot

Handler configuration

Wagon configuration

Data configuration

Testing and running status

Page 6: The LEGO Train Framework

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 6

Handler

Page 7: The LEGO Train Framework

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 7

Wagon

Page 8: The LEGO Train Framework

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 8

Dataset

Page 9: The LEGO Train Framework

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 9

Run

Page 10: The LEGO Train Framework

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 10

Syswatch

Page 11: The LEGO Train Framework

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 11

Operator Workflow

Select dataset

Select wagon

Start testing

Inspect output

Page 12: The LEGO Train Framework

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 12

Operator Workflow (2)

status of

analysis

status of

merging

intermediate

merging steps Submit final

merge job

(to be automatized)

final merging

status

check output

Page 13: The LEGO Train Framework

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 13

Demo…

• Enough theory, let's do some clicking…

http://alimonitor.cern.ch/trains

Page 14: The LEGO Train Framework

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 14

Some More Details

• Train runs with an analysis tag

– All code + "AddTask" macro has to be in the tag (no

par file!)

• Output per run stored in the input data directory

(like AOD, QA trains). E.g.: /alice/data/2010/LHC10h/000137366/ESDs/pass2/PWG4/

CorrelationTrain/7_20111117_1350

• All merged runs found in /alice/cern.ch/user/a/alitrain/PWG4/CorrelationTrain/

7_20111117_1350/merge

Page 15: The LEGO Train Framework

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 15

Operations

• After 10-12h most jobs are done (~90-98%) – Few running, few waiting

– This situation can persist for days killer for merging the output

– Solutions

• Kill jobs that have waited longer than X (being tested on the level of the LPM, better as a JDL tag)

• Remove CE requirement after a certain time (thx Latchezar for this idea), to be implemented

• Merge jobs have the same tails of few jobs that wait a long time – Ideas: same as above or run them on any CE (problem with

splitting, Pablo is investigating)

• Output available after ~2 days – 25% (real time) spend in running

– 75% in merging

– I believe this can still be improved!

Page 16: The LEGO Train Framework

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 16

Operations (visually…) Analysis jobs

Waiting

Running

Done

Error

Merging jobs

Waiting

Running

Done

Error

Analysis jobs

Waiting

Running

Done

Error

hours since submission

hours since submission

hours since submission

here we kill the remaining ones

80% done

in 4 hours

Page 17: The LEGO Train Framework

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 17

Current Trains

• Four active beta testers

– Jets (Christian KB)

– D2H (Zaida)

– Correlations in pp (Eva)

– Correlations in PbPb (JF)

• We got a lot of feedback, improved the system

Page 18: The LEGO Train Framework

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 18

TODO

• Graphs for CPU/Wall/Mem consumption of user

tasks as function of AliRoot tag

• Some improvements in the web interface

• Automatic launching of final job

Page 19: The LEGO Train Framework

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 19

Documentation

• Mailing list (for operators)

[email protected]

• TWiki (Users + operators)

– https://twiki.cern.ch/twiki/bin/viewauth/ALICE/Analysis

Trains