LHCb on-line / off-line computing
INFN CSN1
Lecce, 24.9.2003
Domenico Galli, Bologna
LHCb on-line / off-line computing. 2D. Galli
Off-line computing We plan LHCb-Italy off-line computing resources
to be as much centralized as possible. Put as much computing power as possible in CNAF Tier-
1. To minimize system administration manpower. To optimize resources exploitation.
“Distributed” for us means distributed among CNAF and other European Regional Centres.
Virtual drawback: strong dependence on CNAF resource sharing.
The improvement following the setup of Tier-3 in major INFN sites for parallel nt-ples analysis should be evaluated later.
LHCb on-line / off-line computing. 3D. Galli
2003 Activities In 2003 LHCb-Italy contributed to DC03
(production of MC samples for TDR).
47 Mevt / 60 d. 32 Mevt minimum bias; 10 Mevt inclusive b; 50 signal samples,
whose size is 50 to100 kevt.
18 Computing centresinvolved.
Italian contribution:11.5% (should be 15%).
LHCb on-line / off-line computing. 4D. Galli
2003 Activities (II) Italian contribution to DC03 has been obtained
using limited resources (40kSi2000, i.e. 100 1GHz PIII CPUs).
Larger contibutions (Karlsruhe D, Imperial College, UK) come from the huges, dinamically allocated, resources of these centres.
DIRAC, LHCb distributed MC production system, has been used to produce 36600 jobs; 85% of them run out of CERN with 92% mean efficiency.
LHCb on-line / off-line computing. 5D. Galli
2003 Activities (III) DC03 has also been used to validate LHCb
distributed analysis model. Distribution to Tier-1 centres of signal and bg MC
samples stored at CERN during production.
Samples has been pre-reduced based on kinematical or trigger criteria.
Selection algorithms for specific decay channels (~30) has been executed.
Events has been classified by means of tagging algorithms.
LHCb-Italy contributed to implementation of selection algorithms for B decays in 2 charged pions/kaons.
LHCb on-line / off-line computing. 6D. Galli
2003 Activities (IV) To perform high statistics data samples
analysis the PVFS distributed file system has been used.
110 MB/s aggregate I/O using 100Base-T Ethernet connection (to be compared with 50 MB/s of a typical 1000Base T NAS).
LHCb on-line / off-line computing. 7D. Galli
2003 Activities (V) Analysis work by LHCb-Italy has been included
in “Reoptimized Detector Design and Performance” TDR (2 hadrons channel + tagging).
3 LHCb internal notes has been written: CERN-LHCb/2003-123: Bologna group, “Selection of
B/Bsh+h- decays at LHCb”;
CERN-LHCb/2003-124: Bologna group, “CP sensitivity with B/Bsh+h- decays at LHCb”.
CERN-LHCb/2003-115: Milano group, “LHCb flavour tagging performance”.
LHCb on-line / off-line computing. 8D. Galli
Software Roadmap
LHCb on-line / off-line computing. 9D. Galli
DC04 (April-June 2004) – Physics Goals Demonstrate performance of HLTs (needed
for computing TDR) Large minimum bias sample + signal
Improve B/S estimates of optimisation TDR Large bb sample + signal
Physics improvements to generators
LHCb on-line / off-line computing. 10D. Galli
DC04 – Computing Goals Main goal: gather information to be used for
writing LHCb computing TDR Robustness test of the LHCb software and production
system Using software as realistic as possible in terms of
performance
Test of the LHCb distributed computing model Including distributed analyses
Incorporation of the LCG application area software into the LHCb production environment
Use of LCG resources as a substantial fraction of the production capacity
LHCb on-line / off-line computing. 11D. Galli
DC04 – Production Scenario Generate (Gauss, “SIM” output):
150 Million events minimum bias
50 Million events inclusive b decays
20 Million exclusive b decays in the channels of interest
Digitize (Boole, “DIGI” output): All events, apply L0+L1 trigger decision
Reconstruct (Brunel, “DST” output): Minimum bias and inclusive b decays passing L0 and L1 trigger
Entire exclusive b-decay sample
Store: SIM+DIGI+DST of all reconstructed events
LHCb on-line / off-line computing. 12D. Galli
Goal: Robustness Test of the LHCb Software and Production System First use of the simulation program Gauss based on Geant4 Introduction of the new digitisation program, Boole
With HLTEvent as output
Robustness of the reconstruction program, Brunel Including any new tuning or other available improvements Not including mis-alignment/calibration
Pre-selection of events based on physics criteria (DaVinci) AKA “stripping” Performed by production system after the reconstruction Producing multiple DST output streams
Further development of production tools (Dirac etc.) e.g. integration of stripping e.g. Book-keeping improvements e.g. Monitoring improvements
LHCb on-line / off-line computing. 13D. Galli
Goal: Test of the LHCb Computing Model Distributed data production
As in 2003, will be run on all available production sites Including LCG1 Controlled by the production manager at CERN In close collaboration with the LHCb production site managers
Distributed data sets CERN:
Complete DST (copied from production centres) Master copies of pre-selections (stripped DST)
Tier1: Complete replica of pre-selections Master copy of DST produced at associated sites Master (unique!) copy of SIM+DIGI produced at associated sites
Distributed analysis
LHCb on-line / off-line computing. 14D. Galli
Goal: Incorporation of the LCG Software Gaudi will be updated to:
Use POOL (persistency hybrid implementation) mechanism Use certain SEAL (general framework services) services
e.g. Plug-in manager
All the applications will use the new Gaudi Should be ~transparent but must be commissioned
N.B.: POOL provides existing functionality of ROOT I/O
And more: e.g. location independent event collections
But incompatible with existing TDR data May need to convert it if we want just one data format
LHCb on-line / off-line computing. 15D. Galli
Needed Resources for DC04 CPU requirement is 10 times what was needed for DC03
Current resource estimates indicate DC04 will last 3 months
Assumes that Gauss is twice slower than SICBMC
Currently planned for April-June
GOAL: use of LCG resources as a substantial fraction of the production capacity
We can hope for up to 50%
Storage requirement: 6TB at CERN for complete DST
19TB distributed among TIER1 for locally produced SIM+DIGI+DST
up to 1TB per TIER1 for pre-selected DSTs
LHCb on-line / off-line computing. 16D. Galli
Resources request to Bologna Tier-1 for DC04 CPU power: 200 kSI2000 (500 1GHz PIII CPU).
Disk: 5 TB
Tape: 5 TB
LHCb on-line / off-line computing. 17D. Galli
Tier-1 Grow in Next Years
2004 2005 2006 2007
CPU
[kSI2000]200 200 400 800
Disk
[TB]5 20 100 200
Tape
[TB]5 20 200 600
LHCb on-line / off-line computing. 18D. Galli
Online Computing LHCb-Italy has been involved in online group to
design the L1/HLT trigger farm. Sezione di Bologna
G. Avoni, A. Carbone , D. Galli, U. Marconi, G. Peco, M. Piccinini, V. Vagnoni
Sezione di Milano T. Bellunato, L. Carbone, P. Dini
Sezione di Ferrara A. Gianoli
LHCb on-line / off-line computing. 19D. Galli
Online Computing (II) Lots of changes since the Online TDR
abandoned Network Processors included Level-1 DAQ have now Ethernet from the readout boards destination assignment by TFC (Timing and Fast Control)
Main ideas the same large gigabit Ethernet Local Area Network to connect
detector sources to CPU destinations simple (push) protocol, no event-manager commodity components wherever possible everything controlled, configured and monitored by ECS
(Experimental Control System)
LHCb on-line / off-line computing. 20D. Galli
DAQ Architecture
MultiplexingLayer
FE FE FE FE FE FE FE FE FE FE FE FE
Switch Switch
Level-1Traffic
HLTTraffic
126-224Links
44 kHz5.5-11.0 GB/s
323Links4 kHz
1.6 GB/s
29 Switches
32 Links
94-175 SFCs
Front-end Electronics
Gb Ethernet
Level-1 Traffic
Mixed Traffic
HLT Traffic
94-175 Links7.1-12.6
GB/s
TRM
Sorter
TFCSystem
L1-Decision
StorageSystem
Readout Network
Switch Switch Switch
SFC
Switch
CPU
CPU
CPU
SFC
Switch
CPU
CPU
CPU
SFC
Switch
CPU
CPU
CPU
SFC
Switch
CPU
CPU
CPU
SFC
Switch
CPU
CPU
CPU
CPUFarm
62-87 Switches
64-137 Links88 kHz
~1800 CPUs
LHCb on-line / off-line computing. 21D. Galli
Following the Data-Flow
FE FE FE FE FE FE FE FE FE FE FE FE
Switch Switch
94 SFCs
Front-end Electronics
Gb Ethernet
Level-1 Traffic
Mixed Traffic
HLT Traffic
94 Links7.1 GB/s
TRM
Sorter
TFCSystem
L1-Decision
StorageSystem
Readout Network
Switch Switch Switch
SFC
Switch
CPU
CPU
CPU
SFC
Switch
CPU
CPU
CPU
SFC
Switch
CPU
CPU
CPU
SFC
Switch
CPU
CPU
CPU
SFC
Switch
CPU
CPU
CPU
CPUFarm
~1800 CPUs
1
21
L0Yes
2
L1TriggerL1
D
L1Yes
12
21
HLTYes
BΦΚs
LHCb on-line / off-line computing. 22D. Galli
Design Studies Items under study:
Physical farm implementation (choice of cases, cooling, etc.)
Farm management (bootstrap procedure, monitoring)
Subfarm Controllers (event-builders, load-balancing queue)
Ethernet Switches
Integration with TFC and ECS
System Simulation
LHCb-Italy is involved in Farm management, Subfarm Controllers and their communication with Subfarm Nodes.
LHCb on-line / off-line computing. 23D. Galli
Tests in Bologna To begin the activity in Bologna we started
(August 2003) from scratch by trying to transfer data through 1000Base-T (gigabit Ethernet on copper cables) from PC to PC and to measure performances.
As we plan to use an unreliable protocol (RAW Ethernet, RAW IP or UDP) because reliable ones (like TCP, which retransmit datagrams not acknowledged) introduce unpredictable latency, so, together with throughput and latency, we need to benchmark also data loss.
LHCb on-line / off-line computing. 24D. Galli
Tests in Bologna (II) – Previous results In IEEE802.3 standard specifications, for 100 m long
cat5e cables, the BER (Bit Error Rates) is said to be< 10-10.
Previous measures, performed by A. Barczyc, B. Jost, N. Neufeld using Network Processors (not real PCs) and 100 m long cat5e cables showed a BER < 10-14.
Recent measures (presented A. Barczyc at Zürich, 18.09.2003), performed using PCs gave a frame drop rate O(10-6).
Many data (too much for L1!) get lost inside kernel network stack implementation in PCs.
LHCb on-line / off-line computing. 25D. Galli
Tests in Bologna (III) Transferring data on 1000Base-T Ehernet is not as trivial as it
was for 100Base-TX Ethernet. A new bus (PCI-X) and new chipsets (e.g. Intel E7601, 875P) has
been designed to support gigabit NIC data flow (PCI bus and old chipsets have not enough bandwidth to support gigabit NIC at gigabit rate).
Linux kernel implementation of network stack has been rewritten 2 times since kernel 2.4 to support gigabit data flow (networking code is 20% of the kernel source). Last modification imply the change of the kernel-to-driver interface (network driver must be rewritten).
Standard Linux RedHat 9A setup uses back-compatibility stuff and looses packets.
No many people are interested in achieving very low packet loss (except for video streaming).
Also a DATATAG group is working on packet losses (M. Rio, T. Kelly, M. Goutelle, R. Hughes-Jones, J.P.Martin-Flatin, “A map of the networking code in Linux Kernel 2.4.20”, draft 8, 18 August 2003).
LHCb on-line / off-line computing. 26D. Galli
Tests in Bologna. Results Summary Throughput was always higher than expected (957 Mb/s of
IP payload measured) while data loss was our main concern.
We have understood, first (at least) in the LHCb collaboration, how to send IP datagram at gigabit/second rate from Linux to Linux on 1000Base-T Ethernet without datagram loss (4 datagrams lost / 2.0x1010 datagrams sent).
This required: use the appropriate software:
NAPI kernel ( 2.4.20 ).
NAPI-enabled drivers (for Intel e1000 driver, recompilation with a special flag set was needed).
kernel parameters tuning (buffer & queue length).
1000Base-T flow control enabled on NIC.
LHCb on-line / off-line computing. 27D. Galli
Test-bed 0 2 x PC with 3 x 1000Base-T interfaces each
Motherboard: SuperMicro X5DPL-iGM Dual Pentium IV Xeon 2.4 GHz, 1 GB ECC RAM Chipset Intel E7501 400/533 MHz FSB (front side bus) Bus Controller Hub Intel P64H2 (2 x PCI-X, 64 bit, 66/100/133
MHz) Ethernet controller Intel 82545EM: 1 x 1000Base-T interface
(supports Jumbo Frames) Plugged-in PCI-X Ethernet Card: Intel Pro/1000 MT Dual Port
Server Adapter Ethernet controller Intel 82546EB: 2 x 1000Base-T interfaces
(supports Jumbo Frames)
1000Base-T 8 ports switch: HP ProCurve 6108 16 Gbps backplane: non-blocking architecture latency: < 12.5 µs (LIFO 64-byte packets) throughput: 11.9 million pps (64-byte packets) switching capacity: 16 Gbps
Cat. 6e cables max 500 MHz (cfr 125 MHz 1000Base-T)
LHCb on-line / off-line computing. 28D. Galli
Test-bed 0 (II)
echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter to use only one interface to receive packet owning to a
certain network (131.154.10, 10.10.0 and 10.10.1).
lhcbcn1 lhcbcn2
1000Base-T switch
13
1.1
54
.10
.2
13
1.1
54
.10
.7
10
.10
.0.
2
10
.10
.1.
2
10
.10
.0.
7
10
.10
.1.
7
uplink
LHCb on-line / off-line computing. 29D. Galli
Test-bed 0 (III)
LHCb on-line / off-line computing. 30D. Galli
SuperMicro X5DPL-iGM Motherboard (Chipset Intel E7501)
Chipset internal bandwidth is granted
6.4 Gb/s min
LHCb on-line / off-line computing. 31D. Galli
Benchmark Software We used 2 benchmark software:
Netperf 2.2p14 UDP_STREAM
Self-made basic sender & receiver programs using UDP & RAW IP
We discovered a bug in netperf on Linux platform:
since Linux calls setsockopt(SO_SNDBUF) & setsockopt(SO_RCVBUF) set the buffer size to twice the requested size, while Linux calls getsockopt(SO_SNDBUF) & getsockopt(SO_RCVBUF) return the actual the buffer size, then when netperf iterate to achieve the requested precision in results, it doubles the buffer size each iteration, using the same variable for both the sistem calls.
LHCb on-line / off-line computing. 32D. Galli
Benchmark Environment Kernel 2.4.20-18.9smp
GigaEthernet driver: e1000 version 5.0.43-k1 (RedHat 9A)
version 5.2.16 recompiled with NAPI flag enabled
System disconnected from public network
Runlevel 3 (X11 stopped)
Daemons stopped (crond, atd, sendmail, etc.)
Flow control on (on both NICs and switch)
Numer of descriptors allocated by the driver rings: 256, 4096
IP send buffer size: 524288 (x2) Bytes
IP receive buffer size: 524288 (x2), 1048576 (x2) Bytes
Tx queue length 100, 1600
LHCb on-line / off-line computing. 33D. Galli
First Results. Linux RedHat 9A, Kernel 2.4.20, Default Setup, no Tuning. First benchmark results about datagram loss
showed big fluctuations which, in principle, can due to packet queue reset, other CPU process, interrupts, soft_irqs, broadcast network traffic, etc.
Resultingdistribution ismulti-modal.
Mean loss:1 datagramlost every20000datagramsent.Too much forLHCb L1!!!
0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9x 1 0
-3
1 /3 3 3 0 0
1 /1 0 3 0 0 1 /3 5 5 0
M ean v a lu e : 5 .0 9 5 1 0 = 1 /1 9 6 3 0 5
F lo w co n tro l o nS w itch u p lin k o ffT x -rin g : 2 5 6 d esc rip to rsR x -rin g : 2 5 6 d e sc rip to rsS en d b u ffe r: 5 2 4 2 8 8 BR ecv b u ffe r: 5 2 4 2 8 8 BT x q u eu e len g th : 1 0 0D riv e r e1 0 0 0 : 5 .0 .4 3 -k 1 (n o N A P I)
L o st d a tag ram frac tio n fo r 1 0 d a tag ram ru n s . 2 6 0 0 ru n s6
LHCb on-line / off-line computing. 34D. Galli
First Results. Linux RedHat 9A, Kernel 2.4.20, Default Setup, no Tuning (II)
We think that peak behavior is due to kernel queues resets (all queue packets silently dropped when queue is full).
0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9x 1 0
-3
1 /3 3 3 0 0
1 /1 0 3 0 0 1 /3 5 5 0
M ean v a lu e : 5 .0 9 5 1 0 = 1 /1 9 6 3 0 5
F lo w co n tro l o nS w itch u p lin k o ffT x -rin g : 2 5 6 d esc rip to rsR x -rin g : 2 5 6 d e sc rip to rsS en d b u ffe r: 5 2 4 2 8 8 BR ecv b u ffe r: 5 2 4 2 8 8 BT x q u eu e len g th : 1 0 0D riv e r e1 0 0 0 : 5 .0 .4 3 -k 1 (n o N A P I)
L o st d a tag ram frac tio n fo r 1 0 d a tag ram ru n s . 2 6 0 0 ru n s6
0
5 0
1 0 0
1 5 0
2 0 0
2 5 0
0 0 .0 5 0 .1 0 .1 5 0 .2 0 .2 5 0 .3 0 .3 5 0 .4 0 .4 5 0 .5x 1 0
-4L o st d a tag ram frac tio n fo r 1 0 d a tag ram ru n s . 2 6 0 0 ru n s6
M ean v a lu e : 5 .0 9 5 1 0 = 1 /1 9 6 3 0 5
1 /4 4 6 4 3 1 /3 3 11 2
1 /4 2 7 3 5
1 /4 6 5 1 2
F lo w co n tro l o nS w itch u p lin k o ffT x -rin g : 2 5 6 d esc rip to rsR x -rin g : 2 5 6 d e sc rip to rsS en d b u ffe r: 5 2 4 2 8 8 BR ec v b u ffe r: 5 2 4 2 8 8 BT x q u eu e len g th : 1 0 0D riv e r e1 0 0 0 : 5 .0 .4 3 -k 1 (n o N A P I)
0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
0 0 .0 5 0 .1 0 .1 5 0 .2 0 .2 5 0 .3 0 .3 5 0 .4x 1 0
-3
M ean v a lu e : 5 .0 9 5 1 0 = 1 /1 9 6 3 0 5
F lo w co n tro l o nS w itch u p lin k o ffT x -rin g : 2 5 6 d esc rip to rsR x -rin g : 2 5 6 d e sc rip to rsS en d b u ffe r: 5 2 4 2 8 8 BR ecv b u ffe r: 5 2 4 2 8 8 BT x q u eu e len g th : 1 0 0D riv e r e1 0 0 0 : 5 .0 .4 3 -k 1 (n o N A P I)
1 /1 8 6 9 1 61 /7 4 0 7 4
1 /4 5 8 7 21 /3 3 0 0 3
1 /2 6 5 9 6
L o st d a tag ram frac tio n fo r 1 0 d a tag ram ru n s . 2 6 0 0 ru n s6
LHCb on-line / off-line computing. 35D. Galli
Changes in Linux Network Stack Implementation 2.1 2.2: netlink, bottom halves, HFC (harware flow control)
As few computation as possible while in interrupt context (interrupt disabled).
Part of the processing deferred from interrupt handler to bottom halves to be executed at later time (with interrupt enabled).
HFC (to prevent interrupt livelock): as the backlog queue is totally filled, interrupt are disable until backlog queue is emptied.
Bottom halves execution strictly serialized among CPUs; only one packet at a time can enter the system.
2.3.43 2.4: softnet, softirq softirqs are software thread that replaces bottom halves. possible parallelism on SMP machines
2.5.53 2.4.20 (N.B.: back-port): NAPI (new application program interface)
interrupt mitigation technology (mixture of interrupt and polling mechanisms)
LHCb on-line / off-line computing. 36D. Galli
Interrupt livelock Given the interrupt rate coming in, the IP
processing thread never gets a chance to remove any packets off the system.
There are so many interrupts coming into the system such that no useful work is done.
Packets go all the way to be queued, but are dropped because the backlog queue is full.
System resourced are abused extensively but no useful work is accomplished.
LHCb on-line / off-line computing. 37D. Galli
NAPI (New API) NAPI is a interrupt mitigation mechanism
constituted by a mixture of interrupt and polling mechanisms:
Polling: useful under heavy load.
introduces more latency under light load.
abuses the CPU by polling devices that have no packet to offer.
Interrupts: improve latency under light load.
make the system vulnerable to livelock as the interrupt load exceed the MLFFR (Maximum Loss Free Forwarding Rate).
LHCb on-line / off-line computing. 38D. Galli
Packet Reception in Linux kernel 2.4.19 (softnet) and 2.4.20 (NAPI)
in t h a n d lerd isab le in t
a llo c sk _ b u fffe tch (D M A )n e tif_ rx _ sch ed u le ()
en ab le in t
D M Aen g in e
N IC
in te rru p t
p o in t to
k e rn e l m em o ry
rx -rin g d esc rip to r
_ _ cp u _ ra ise_ so ftirq ()(sch ed u le a so ftirq )
so ftirq h a n d ler
n e t_ rx _ ac tio n ()
k e rn e l th rea d
ca t 5 ecab le
ip _ rcv () fu rth e rp ro cess in g
e th 1e th 0…
p o ll_ lis tad dd ev icep o in te r
p o ll
read
Softnet (kernel 2.4.19)
NAPI (kernel 2.4.20)
in t h a n d lerd isab le in t
a llo c sk _ b u fffe tch (D M A )
n e tif_ rx ()en ab le in t
D M Aen g in e
N IC
C P U # 0 b ack lo g q u eu ein te rru p t
p o in t to
k e rn e l m em o ry
rx -rin g d esc rip to r
_ _ cp u _ ra ise_ so ftirq ()(sch ed u le a so ftirq )
so ftirq h a n d ler
n e t_ rx _ ac tio n ()
k e rn e l th rea d
ca t 5 ecab le
C P U # 1 b ack lo g q u eu e
if fu ll d o em p ty it c o m p le te ly(to av o id in e rru p t liv e lo ck , H F C )
ip _ rcv ()
p o p
p u sh
fu rth e rp ro cess in g
LHCb on-line / off-line computing. 39D. Galli
NAPI (II) Under low load, before the MLFFR is reached,
the system converges toward an interrupt driven system: packets/interrupt ratio is lower and latency is reduced.
Under heavy load, the system takes its time to poll devices registered. Interrupts are allowed as fast as the system can process them : packets/interrupt ratio is higher and latency is increased.
LHCb on-line / off-line computing. 40D. Galli
NAPI (III) NAPI changes driver-to-kernel interfaces.
all network drivers should be rewritten.
In order to accommodate devices not NAPI-aware, the old interface (backlog queue) is still available for the old drivers (back-compatibility).
Backlog queues, when used in back-compatibility mode, are polled just like other devices.
LHCb on-line / off-line computing. 41D. Galli
True NAPI vs Back-Compatibility Mode NAPI
in t h a n d lerd isab le in t
a llo c sk _ b u fffe tch (D M A )n e tif_ rx _ sch ed u le ()
en ab le in t
D M Aen g in e
N IC
in te rru p t
p o in t to
k e rn e l m em o ry
rx -rin g d esc rip to r
_ _ cp u _ ra ise_ so ftirq ()(sch ed u le a so ftirq )
so ftirq h a n d ler
n e t_ rx _ ac tio n ()
k e rn e l th rea d
ca t 5 ecab le
ip _ rcv () fu rth e rp ro cess in g
e th 1e th 0…
p o ll_ lis tad dd ev icep o in te r
p o ll
read
D M Aen g in e
N IC
in te rru p t
p o in t to
k e rn e l m em o ry
rx -rin g d esc rip to r
_ _ cp u _ ra ise_ so ftirq ()(sch ed u le a so ftirq )
so ftirq h a n d ler
n e t_ rx _ ac tio n ()
k e rn e l th rea d
ca t 5 ecab le
ip _ rcv () fu rth e rp ro cess in g
b ack lo g 1e th 0…
p o ll_ lis t
p o ll
readin t h a n d lerd isab le in t
a llo c sk _ b u fffe tch (D M A )
n e tif_ rx ()en ab le in t
C P U # 0 b ack lo g q u eu e
C P U # 1 b ack lo g q u eu ep u sh
ad db ack lo g
NAPI kernel with NAPI driver
NAPI kernel with old(not NAPI-aware) driver
LHCb on-line / off-line computing. 42D. Galli
The Intel e1000 Driver Even in the last version of e1000 driver
(5.2.16) NAPI is turned off by default (to allow the usage of the driver also in kernels 2.4.19).
To enable NAPI, e1000 5.2.16 driver must be recompiled with the option:make CFLAGS_EXTRA=-DCONFIG_E1000_NAPI
LHCb on-line / off-line computing. 43D. Galli
Best Results Maximum trasfer rate (udp 4096 byte datagrams):
957 Mb/s.
Mean datagram lost fraction (@ 957 Mb/s):2.0x10-10 (4 datagram lost for 2.0x1010 4k-datagrams sent)
corresponding to BER 6.2x10-15 (using 1 m cat6e cables) if data loss is totally due to hardware CRC errors.
LHCb on-line / off-line computing. 44D. Galli
To be Tested to Improve Further kernel 2.5
fully preemptive (real time) sysenter & sysexit (instead of int 0x80) for context switching
following system calls (3-4 times faster). Asynchronous datagram receiving
Jumbo frames Ethernet frames whose MTU (Maximum Transmission Unit) is 9000
instead of 1500. Less IP datagram fragmentation in packets.
Kernel Mode Linux (http://web.yl.is.s.u-tokyo.ac.jp/~tosh/kml/)
KML is a technology that enables the execution of ordinary user-space programs inside kernel space.
Protection-by-software (like in Java bytecode) instead of protection-by-hardware.
System calls become function calls (132 time faster than int 0x80, 36 time faster than sysenter/sysexit).
LHCb on-line / off-line computing. 45D. Galli
Milestones 8.2004 – Streaming benchmarks:
Maximum streaming throughput and packet loss using UDP, RAW IP and ROW Ethernet with loopback cable.
Test of switch performance (streaming throughput, latency and packet loss, using standard frames and jumbo frames).
Maximum streaming throughput and packet loss using UDP, RAW IP and ROW Ethernet for 2 or 3 simultaneous connections on the same PC.
Test of event building (receive 2 message stream and send 1 joined messages stream)
12.2004 – SFC (Sub Farm Controller) to nodes communication:
Definition of SFC-to-nodes communication protocol. Definition of SFC queueing and scheduling mechanism First implementation of queueing/scheduling
procedures (possibly zero-copy).
LHCb on-line / off-line computing. 46D. Galli
Milestones (II) OS test (if performance need to be improved)
kernel Linux 2.5.53.
KML (kernel mode linux).
Design and test of bootstrap procedures: Measure of the rate of failure of simultaneous boot of a
cluster of PCs, using pxe/dhcp and tftp.
Test of node switch on/off and powe cycle using ASF.
Design of bootstrap system (rate nodes/proxy servers/servers, sofware alignment among servers)
Definition of requirement for the trigger software: error trapping.
timeout.
Top Related