Download - LHCb on-line / off-line computing

LHCb on-line / off-line computing

INFN CSN1

Lecce, 24.9.2003

Domenico Galli, Bologna

LHCb on-line / off-line computing. 2D. Galli

Off-line computing We plan LHCb-Italy off-line computing resources

to be as much centralized as possible. Put as much computing power as possible in CNAF Tier-

1. To minimize system administration manpower. To optimize resources exploitation.

“Distributed” for us means distributed among CNAF and other European Regional Centres.

Virtual drawback: strong dependence on CNAF resource sharing.

The improvement following the setup of Tier-3 in major INFN sites for parallel nt-ples analysis should be evaluated later.


2003 Activities In 2003 LHCb-Italy contributed to DC03

(production of MC samples for TDR).

47 Mevt / 60 d. 32 Mevt minimum bias; 10 Mevt inclusive b; 50 signal samples,

whose size is 50 to100 kevt.

18 Computing centresinvolved.

Italian contribution:11.5% (should be 15%).


2003 Activities (II) Italian contribution to DC03 has been obtained

using limited resources (40kSi2000, i.e. 100 1GHz PIII CPUs).

Larger contibutions (Karlsruhe D, Imperial College, UK) come from the huges, dinamically allocated, resources of these centres.

DIRAC, LHCb distributed MC production system, has been used to produce 36600 jobs; 85% of them run out of CERN with 92% mean efficiency.


2003 Activities (III) DC03 has also been used to validate LHCb

distributed analysis model. Distribution to Tier-1 centres of signal and bg MC

samples stored at CERN during production.

Samples has been pre-reduced based on kinematical or trigger criteria.

Selection algorithms for specific decay channels (~30) has been executed.

Events has been classified by means of tagging algorithms.

LHCb-Italy contributed to implementation of selection algorithms for B decays in 2 charged pions/kaons.


2003 Activities (IV) To perform high statistics data samples

analysis the PVFS distributed file system has been used.

110 MB/s aggregate I/O using 100Base-T Ethernet connection (to be compared with 50 MB/s of a typical 1000Base T NAS).


2003 Activities (V) Analysis work by LHCb-Italy has been included

in “Reoptimized Detector Design and Performance” TDR (2 hadrons channel + tagging).

3 LHCb internal notes has been written: CERN-LHCb/2003-123: Bologna group, “Selection of

B/Bsh+h- decays at LHCb”;

CERN-LHCb/2003-124: Bologna group, “CP sensitivity with B/Bsh+h- decays at LHCb”.

CERN-LHCb/2003-115: Milano group, “LHCb flavour tagging performance”.


Software Roadmap


DC04 (April-June 2004) – Physics Goals Demonstrate performance of HLTs (needed

for computing TDR) Large minimum bias sample + signal

Improve B/S estimates of optimisation TDR Large bb sample + signal

Physics improvements to generators


DC04 – Computing Goals Main goal: gather information to be used for

writing LHCb computing TDR Robustness test of the LHCb software and production

system Using software as realistic as possible in terms of

performance

Test of the LHCb distributed computing model Including distributed analyses

Incorporation of the LCG application area software into the LHCb production environment

Use of LCG resources as a substantial fraction of the production capacity


DC04 – Production Scenario Generate (Gauss, “SIM” output):

150 Million events minimum bias

50 Million events inclusive b decays

20 Million exclusive b decays in the channels of interest

Digitize (Boole, “DIGI” output): All events, apply L0+L1 trigger decision

Reconstruct (Brunel, “DST” output): Minimum bias and inclusive b decays passing L0 and L1 trigger

Entire exclusive b-decay sample

Store: SIM+DIGI+DST of all reconstructed events


Goal: Robustness Test of the LHCb Software and Production System First use of the simulation program Gauss based on Geant4 Introduction of the new digitisation program, Boole

With HLTEvent as output

Robustness of the reconstruction program, Brunel Including any new tuning or other available improvements Not including mis-alignment/calibration

Pre-selection of events based on physics criteria (DaVinci) AKA “stripping” Performed by production system after the reconstruction Producing multiple DST output streams

Further development of production tools (Dirac etc.) e.g. integration of stripping e.g. Book-keeping improvements e.g. Monitoring improvements


Goal: Test of the LHCb Computing Model Distributed data production

As in 2003, will be run on all available production sites Including LCG1 Controlled by the production manager at CERN In close collaboration with the LHCb production site managers

Distributed data sets CERN:

Complete DST (copied from production centres) Master copies of pre-selections (stripped DST)

Tier1: Complete replica of pre-selections Master copy of DST produced at associated sites Master (unique!) copy of SIM+DIGI produced at associated sites

Distributed analysis


Goal: Incorporation of the LCG Software Gaudi will be updated to:

Use POOL (persistency hybrid implementation) mechanism Use certain SEAL (general framework services) services

e.g. Plug-in manager

All the applications will use the new Gaudi Should be ~transparent but must be commissioned

N.B.: POOL provides existing functionality of ROOT I/O

And more: e.g. location independent event collections

But incompatible with existing TDR data May need to convert it if we want just one data format


Needed Resources for DC04 CPU requirement is 10 times what was needed for DC03

Current resource estimates indicate DC04 will last 3 months

Assumes that Gauss is twice slower than SICBMC

Currently planned for April-June

GOAL: use of LCG resources as a substantial fraction of the production capacity

We can hope for up to 50%

Storage requirement: 6TB at CERN for complete DST

19TB distributed among TIER1 for locally produced SIM+DIGI+DST

up to 1TB per TIER1 for pre-selected DSTs


Resources request to Bologna Tier-1 for DC04 CPU power: 200 kSI2000 (500 1GHz PIII CPU).

Disk: 5 TB

Tape: 5 TB


Tier-1 Grow in Next Years

2004 2005 2006 2007

CPU

[kSI2000]200 200 400 800

Disk

[TB]5 20 100 200

Tape

[TB]5 20 200 600


Online Computing LHCb-Italy has been involved in online group to

design the L1/HLT trigger farm. Sezione di Bologna

G. Avoni, A. Carbone , D. Galli, U. Marconi, G. Peco, M. Piccinini, V. Vagnoni

Sezione di Milano T. Bellunato, L. Carbone, P. Dini

Sezione di Ferrara A. Gianoli


Online Computing (II) Lots of changes since the Online TDR

abandoned Network Processors included Level-1 DAQ have now Ethernet from the readout boards destination assignment by TFC (Timing and Fast Control)

Main ideas the same large gigabit Ethernet Local Area Network to connect

detector sources to CPU destinations simple (push) protocol, no event-manager commodity components wherever possible everything controlled, configured and monitored by ECS

(Experimental Control System)


DAQ Architecture

MultiplexingLayer

FE FE FE FE FE FE FE FE FE FE FE FE

Switch Switch

Level-1Traffic

HLTTraffic

126-224Links

44 kHz5.5-11.0 GB/s

323Links4 kHz

1.6 GB/s

29 Switches

32 Links

94-175 SFCs

Front-end Electronics

Gb Ethernet

Level-1 Traffic

Mixed Traffic

HLT Traffic

94-175 Links7.1-12.6

GB/s

TRM

Sorter

TFCSystem

L1-Decision

StorageSystem

Readout Network

Switch Switch Switch

SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

CPUFarm

62-87 Switches

64-137 Links88 kHz

~1800 CPUs


Following the Data-Flow

FE FE FE FE FE FE FE FE FE FE FE FE

Switch Switch

94 SFCs

Front-end Electronics

Gb Ethernet

Level-1 Traffic

Mixed Traffic

HLT Traffic

94 Links7.1 GB/s

TRM

Sorter

TFCSystem

L1-Decision

StorageSystem

Readout Network

Switch Switch Switch

SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

SFC

Switch

CPU

CPU

CPU

CPUFarm

~1800 CPUs

1

21

L0Yes

2

L1TriggerL1

D

L1Yes

12

21

HLTYes

BΦΚs


Design Studies Items under study:

Physical farm implementation (choice of cases, cooling, etc.)

Farm management (bootstrap procedure, monitoring)

Subfarm Controllers (event-builders, load-balancing queue)

Ethernet Switches

Integration with TFC and ECS

System Simulation

LHCb-Italy is involved in Farm management, Subfarm Controllers and their communication with Subfarm Nodes.


Tests in Bologna To begin the activity in Bologna we started

(August 2003) from scratch by trying to transfer data through 1000Base-T (gigabit Ethernet on copper cables) from PC to PC and to measure performances.

As we plan to use an unreliable protocol (RAW Ethernet, RAW IP or UDP) because reliable ones (like TCP, which retransmit datagrams not acknowledged) introduce unpredictable latency, so, together with throughput and latency, we need to benchmark also data loss.


Tests in Bologna (II) – Previous results In IEEE802.3 standard specifications, for 100 m long

cat5e cables, the BER (Bit Error Rates) is said to be< 10-10.

Previous measures, performed by A. Barczyc, B. Jost, N. Neufeld using Network Processors (not real PCs) and 100 m long cat5e cables showed a BER < 10-14.

Recent measures (presented A. Barczyc at Zürich, 18.09.2003), performed using PCs gave a frame drop rate O(10-6).

Many data (too much for L1!) get lost inside kernel network stack implementation in PCs.


Tests in Bologna (III) Transferring data on 1000Base-T Ehernet is not as trivial as it

was for 100Base-TX Ethernet. A new bus (PCI-X) and new chipsets (e.g. Intel E7601, 875P) has

been designed to support gigabit NIC data flow (PCI bus and old chipsets have not enough bandwidth to support gigabit NIC at gigabit rate).

Linux kernel implementation of network stack has been rewritten 2 times since kernel 2.4 to support gigabit data flow (networking code is 20% of the kernel source). Last modification imply the change of the kernel-to-driver interface (network driver must be rewritten).

Standard Linux RedHat 9A setup uses back-compatibility stuff and looses packets.

No many people are interested in achieving very low packet loss (except for video streaming).

Also a DATATAG group is working on packet losses (M. Rio, T. Kelly, M. Goutelle, R. Hughes-Jones, J.P.Martin-Flatin, “A map of the networking code in Linux Kernel 2.4.20”, draft 8, 18 August 2003).


Tests in Bologna. Results Summary Throughput was always higher than expected (957 Mb/s of

IP payload measured) while data loss was our main concern.

We have understood, first (at least) in the LHCb collaboration, how to send IP datagram at gigabit/second rate from Linux to Linux on 1000Base-T Ethernet without datagram loss (4 datagrams lost / 2.0x1010 datagrams sent).

This required: use the appropriate software:

NAPI kernel ( 2.4.20 ).

NAPI-enabled drivers (for Intel e1000 driver, recompilation with a special flag set was needed).

kernel parameters tuning (buffer & queue length).

1000Base-T flow control enabled on NIC.


Test-bed 0 2 x PC with 3 x 1000Base-T interfaces each

Motherboard: SuperMicro X5DPL-iGM Dual Pentium IV Xeon 2.4 GHz, 1 GB ECC RAM Chipset Intel E7501 400/533 MHz FSB (front side bus) Bus Controller Hub Intel P64H2 (2 x PCI-X, 64 bit, 66/100/133

MHz) Ethernet controller Intel 82545EM: 1 x 1000Base-T interface

(supports Jumbo Frames) Plugged-in PCI-X Ethernet Card: Intel Pro/1000 MT Dual Port

Server Adapter Ethernet controller Intel 82546EB: 2 x 1000Base-T interfaces

(supports Jumbo Frames)

1000Base-T 8 ports switch: HP ProCurve 6108 16 Gbps backplane: non-blocking architecture latency: < 12.5 µs (LIFO 64-byte packets) throughput: 11.9 million pps (64-byte packets) switching capacity: 16 Gbps

Cat. 6e cables max 500 MHz (cfr 125 MHz 1000Base-T)


Test-bed 0 (II)

echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter to use only one interface to receive packet owning to a

certain network (131.154.10, 10.10.0 and 10.10.1).

lhcbcn1 lhcbcn2

1000Base-T switch

13

1.1

54

.10

.2

13

1.1

54

.10

.7

10

.10

.0.

2

10

.10

.1.

2

10

.10

.0.

7

10

.10

.1.

7

uplink


Test-bed 0 (III)


SuperMicro X5DPL-iGM Motherboard (Chipset Intel E7501)

Chipset internal bandwidth is granted

6.4 Gb/s min


Benchmark Software We used 2 benchmark software:

Netperf 2.2p14 UDP_STREAM

Self-made basic sender & receiver programs using UDP & RAW IP

We discovered a bug in netperf on Linux platform:

since Linux calls setsockopt(SO_SNDBUF) & setsockopt(SO_RCVBUF) set the buffer size to twice the requested size, while Linux calls getsockopt(SO_SNDBUF) & getsockopt(SO_RCVBUF) return the actual the buffer size, then when netperf iterate to achieve the requested precision in results, it doubles the buffer size each iteration, using the same variable for both the sistem calls.


Benchmark Environment Kernel 2.4.20-18.9smp

GigaEthernet driver: e1000 version 5.0.43-k1 (RedHat 9A)

version 5.2.16 recompiled with NAPI flag enabled

System disconnected from public network

Runlevel 3 (X11 stopped)

Daemons stopped (crond, atd, sendmail, etc.)

Flow control on (on both NICs and switch)

Numer of descriptors allocated by the driver rings: 256, 4096

IP send buffer size: 524288 (x2) Bytes

IP receive buffer size: 524288 (x2), 1048576 (x2) Bytes

Tx queue length 100, 1600


First Results. Linux RedHat 9A, Kernel 2.4.20, Default Setup, no Tuning. First benchmark results about datagram loss

showed big fluctuations which, in principle, can due to packet queue reset, other CPU process, interrupts, soft_irqs, broadcast network traffic, etc.

Resultingdistribution ismulti-modal.

Mean loss:1 datagramlost every20000datagramsent.Too much forLHCb L1!!!

0

1 0 0

2 0 0

3 0 0

4 0 0

5 0 0

6 0 0

7 0 0

0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9x 1 0

-3

1 /3 3 3 0 0

1 /1 0 3 0 0 1 /3 5 5 0

M ean v a lu e : 5 .0 9 5 1 0 = 1 /1 9 6 3 0 5

F lo w co n tro l o nS w itch u p lin k o ffT x -rin g : 2 5 6 d esc rip to rsR x -rin g : 2 5 6 d e sc rip to rsS en d b u ffe r: 5 2 4 2 8 8 BR ecv b u ffe r: 5 2 4 2 8 8 BT x q u eu e len g th : 1 0 0D riv e r e1 0 0 0 : 5 .0 .4 3 -k 1 (n o N A P I)

L o st d a tag ram frac tio n fo r 1 0 d a tag ram ru n s . 2 6 0 0 ru n s6


First Results. Linux RedHat 9A, Kernel 2.4.20, Default Setup, no Tuning (II)

We think that peak behavior is due to kernel queues resets (all queue packets silently dropped when queue is full).

0

1 0 0

2 0 0

3 0 0

4 0 0

5 0 0

6 0 0

7 0 0

0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9x 1 0

-3

1 /3 3 3 0 0

1 /1 0 3 0 0 1 /3 5 5 0

M ean v a lu e : 5 .0 9 5 1 0 = 1 /1 9 6 3 0 5



0

5 0

1 0 0

1 5 0

2 0 0

2 5 0

0 0 .0 5 0 .1 0 .1 5 0 .2 0 .2 5 0 .3 0 .3 5 0 .4 0 .4 5 0 .5x 1 0

-4L o st d a tag ram frac tio n fo r 1 0 d a tag ram ru n s . 2 6 0 0 ru n s6

M ean v a lu e : 5 .0 9 5 1 0 = 1 /1 9 6 3 0 5

1 /4 4 6 4 3 1 /3 3 11 2

1 /4 2 7 3 5

1 /4 6 5 1 2

F lo w co n tro l o nS w itch u p lin k o ffT x -rin g : 2 5 6 d esc rip to rsR x -rin g : 2 5 6 d e sc rip to rsS en d b u ffe r: 5 2 4 2 8 8 BR ec v b u ffe r: 5 2 4 2 8 8 BT x q u eu e len g th : 1 0 0D riv e r e1 0 0 0 : 5 .0 .4 3 -k 1 (n o N A P I)

0

1 0 0

2 0 0

3 0 0

4 0 0

5 0 0

6 0 0

0 0 .0 5 0 .1 0 .1 5 0 .2 0 .2 5 0 .3 0 .3 5 0 .4x 1 0

-3

M ean v a lu e : 5 .0 9 5 1 0 = 1 /1 9 6 3 0 5


1 /1 8 6 9 1 61 /7 4 0 7 4

1 /4 5 8 7 21 /3 3 0 0 3

1 /2 6 5 9 6



Changes in Linux Network Stack Implementation 2.1 2.2: netlink, bottom halves, HFC (harware flow control)

As few computation as possible while in interrupt context (interrupt disabled).

Part of the processing deferred from interrupt handler to bottom halves to be executed at later time (with interrupt enabled).

HFC (to prevent interrupt livelock): as the backlog queue is totally filled, interrupt are disable until backlog queue is emptied.

Bottom halves execution strictly serialized among CPUs; only one packet at a time can enter the system.

2.3.43 2.4: softnet, softirq softirqs are software thread that replaces bottom halves. possible parallelism on SMP machines

2.5.53 2.4.20 (N.B.: back-port): NAPI (new application program interface)

interrupt mitigation technology (mixture of interrupt and polling mechanisms)


Interrupt livelock Given the interrupt rate coming in, the IP

processing thread never gets a chance to remove any packets off the system.

There are so many interrupts coming into the system such that no useful work is done.

Packets go all the way to be queued, but are dropped because the backlog queue is full.

System resourced are abused extensively but no useful work is accomplished.


NAPI (New API) NAPI is a interrupt mitigation mechanism

constituted by a mixture of interrupt and polling mechanisms:

Polling: useful under heavy load.

introduces more latency under light load.

abuses the CPU by polling devices that have no packet to offer.

Interrupts: improve latency under light load.

make the system vulnerable to livelock as the interrupt load exceed the MLFFR (Maximum Loss Free Forwarding Rate).


Packet Reception in Linux kernel 2.4.19 (softnet) and 2.4.20 (NAPI)

in t h a n d lerd isab le in t

a llo c sk _ b u fffe tch (D M A )n e tif_ rx _ sch ed u le ()

en ab le in t

D M Aen g in e

N IC

in te rru p t

p o in t to

k e rn e l m em o ry

rx -rin g d esc rip to r

_ _ cp u _ ra ise_ so ftirq ()(sch ed u le a so ftirq )

so ftirq h a n d ler

n e t_ rx _ ac tio n ()

k e rn e l th rea d

ca t 5 ecab le

ip _ rcv () fu rth e rp ro cess in g

e th 1e th 0…

p o ll_ lis tad dd ev icep o in te r

p o ll

read

Softnet (kernel 2.4.19)

NAPI (kernel 2.4.20)


a llo c sk _ b u fffe tch (D M A )

n e tif_ rx ()en ab le in t

D M Aen g in e

N IC

C P U # 0 b ack lo g q u eu ein te rru p t

p o in t to






k e rn e l th rea d

ca t 5 ecab le

C P U # 1 b ack lo g q u eu e

if fu ll d o em p ty it c o m p le te ly(to av o id in e rru p t liv e lo ck , H F C )

ip _ rcv ()

p o p

p u sh

fu rth e rp ro cess in g


NAPI (II) Under low load, before the MLFFR is reached,

the system converges toward an interrupt driven system: packets/interrupt ratio is lower and latency is reduced.

Under heavy load, the system takes its time to poll devices registered. Interrupts are allowed as fast as the system can process them : packets/interrupt ratio is higher and latency is increased.


NAPI (III) NAPI changes driver-to-kernel interfaces.

all network drivers should be rewritten.

In order to accommodate devices not NAPI-aware, the old interface (backlog queue) is still available for the old drivers (back-compatibility).

Backlog queues, when used in back-compatibility mode, are polled just like other devices.


True NAPI vs Back-Compatibility Mode NAPI


a llo c sk _ b u fffe tch (D M A )n e tif_ rx _ sch ed u le ()

en ab le in t

D M Aen g in e

N IC

in te rru p t

p o in t to






k e rn e l th rea d

ca t 5 ecab le


e th 1e th 0…

p o ll_ lis tad dd ev icep o in te r

p o ll

read

D M Aen g in e

N IC

in te rru p t

p o in t to






k e rn e l th rea d

ca t 5 ecab le


b ack lo g 1e th 0…

p o ll_ lis t

p o ll

readin t h a n d lerd isab le in t

a llo c sk _ b u fffe tch (D M A )

n e tif_ rx ()en ab le in t

C P U # 0 b ack lo g q u eu e

C P U # 1 b ack lo g q u eu ep u sh

ad db ack lo g

NAPI kernel with NAPI driver

NAPI kernel with old(not NAPI-aware) driver


The Intel e1000 Driver Even in the last version of e1000 driver

(5.2.16) NAPI is turned off by default (to allow the usage of the driver also in kernels 2.4.19).

To enable NAPI, e1000 5.2.16 driver must be recompiled with the option:make CFLAGS_EXTRA=-DCONFIG_E1000_NAPI


Best Results Maximum trasfer rate (udp 4096 byte datagrams):

957 Mb/s.

Mean datagram lost fraction (@ 957 Mb/s):2.0x10-10 (4 datagram lost for 2.0x1010 4k-datagrams sent)

corresponding to BER 6.2x10-15 (using 1 m cat6e cables) if data loss is totally due to hardware CRC errors.


To be Tested to Improve Further kernel 2.5

fully preemptive (real time) sysenter & sysexit (instead of int 0x80) for context switching

following system calls (3-4 times faster). Asynchronous datagram receiving

Jumbo frames Ethernet frames whose MTU (Maximum Transmission Unit) is 9000

instead of 1500. Less IP datagram fragmentation in packets.

Kernel Mode Linux (http://web.yl.is.s.u-tokyo.ac.jp/~tosh/kml/)

KML is a technology that enables the execution of ordinary user-space programs inside kernel space.

Protection-by-software (like in Java bytecode) instead of protection-by-hardware.

System calls become function calls (132 time faster than int 0x80, 36 time faster than sysenter/sysexit).


Milestones 8.2004 – Streaming benchmarks:

Maximum streaming throughput and packet loss using UDP, RAW IP and ROW Ethernet with loopback cable.

Test of switch performance (streaming throughput, latency and packet loss, using standard frames and jumbo frames).

Maximum streaming throughput and packet loss using UDP, RAW IP and ROW Ethernet for 2 or 3 simultaneous connections on the same PC.

Test of event building (receive 2 message stream and send 1 joined messages stream)

12.2004 – SFC (Sub Farm Controller) to nodes communication:

Definition of SFC-to-nodes communication protocol. Definition of SFC queueing and scheduling mechanism First implementation of queueing/scheduling

procedures (possibly zero-copy).


Milestones (II) OS test (if performance need to be improved)

kernel Linux 2.5.53.

KML (kernel mode linux).

Design and test of bootstrap procedures: Measure of the rate of failure of simultaneous boot of a

cluster of PCs, using pxe/dhcp and tftp.

Test of node switch on/off and powe cycle using ASF.

Design of bootstrap system (rate nodes/proxy servers/servers, sofware alignment among servers)

Definition of requirement for the trigger software: error trapping.

timeout.