L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests...
-
Upload
herbert-washington -
Category
Documents
-
view
216 -
download
2
Transcript of L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler SW distribution tests...
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
SW distribution tests at Lyon
Pierre GirardLuisa Arrabito, David BouvetYannick Perret, Xavier CanehanSuzanne Poulat, Rolf Rumler
Initially presented at Jamboree LHCb
March 7th, 2011Updated on March 25th for Atlas CAF
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
Content
• AFS Latency story AFS latency problem schedule LHCb SetupProject tests Preliminary conclusions
• xxx-FS client stress tests Test suite results CVMFS tests results
• Conclusions
2
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler 3
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
AFS latency problem schedule
4
08/07: New LHCb setup timeout problems
May June July Aug. Sept. Oct. Nov. Dec. Jan.
2011
17/05: SetupProject.sh timeout (5%)
17/06: SetupProject.sh timeout (0,5%)
New AFS server versionAFS Story
RO AFS Volumes for LHCb SW Area New AFS client
03/11: ATLAS timeout problems (50% failures)
05/11: Atlas increased its timeout (3600s)
26/11: PARTLY SOLVED LHCb increased its timeout (3600s)
Many WN crashes due to IO problem
Many (temporary) freezing WNs after OS patch
SL5 Story
Stable SL5 WNs after new kernel patch
LHCb test infrastructure setup for different tunings (AFS, SL5, LRMS)
Tests of different kernel parameters
Tests Story
07/01: SOLVEDCCIN2P3 reduced the number of job slots on recent HW
06/09:CCIN2P3 adding new HW (24 logical cores): 110 WNs
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
LHCb job setup tests
5
Source: L. Arrabito
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
AFS Latency / LHCb job efficiency
6
All-T1s CPU efficiency during the same period
Sou
rce:
http
://w
ww
3.eg
ee.c
esga
.es/
grid
site
/acc
ount
ing/
CE
SG
A/ti
er1_
view
.htm
l
SIteJan 10
Feb 10
Mar 10
Apr 10
May 10
Jun 10
Jul 10
Aug 10
Sep 10
Oct 10
Nov 10
Dec 10
Jan 11
Total
CC 39.9 46.3 67.1 80.3 84.7 87.3 88.5 88.6 79.8 80.7 88.6 88.5 90.9 86.0
All T1s
42.7 58.4 74.6 82.4 88.2 88.4 88.3 73.5 79.4 85.1 81.7 89.8 92.8 82.5
Visible effect of adding latest (24-cores) WNs?
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
Preliminary conclusions
• LHCb/ATLAS environment setup is very (too much) FS-intensive By stracing SetupProject.sh
17 868 open() 110 765 stat()
• Investigate on job distribution strategy to avoid too many similar jobs on the same WN According to “lhcb-alone” and “atlas-excluded” tests results
• AFS latency problem is now a AFS client scalability problem Temporary solved by decreasing the number of job slots on the most
recent machines, but … Is that a major concern in the near future ?
• Have the other sites already experienced the same problem ?
7
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler 8
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
Test suite conditions
• For each test Walk of the same directory arborescence (DAVINCI) Same actions are achieved in the same order
LHCb ProjectSetup-like100 000 stat()7 000 open()
– First block is read to ensure the open() is effective
• Pre-loading the cache (if any) by pre-executing the test once
• Averaged results are taken from 4 executions
9
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
FS Test Results
10
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
CVMFS test suite conditionsDifferent cache sizes
• Dedicated SQUID Used by the tested WN only With pre-loaded LHCb cache
• On CVMFS client (0.2.53-1), before each test Cache was removed Service was restarted Different cache sizes
« ls –lR » on sibling directories to make grow up the cache
11
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
CVMFS Cache Size Tests Results
12
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
CVMFS Min/Max results
13
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
Comparison latest CVMFS (0.2.61)
14
NEW
(2011/03/25
)
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler 15
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
Conclusions
• LHCb/Atlas job setup should be optimized
• Multi-VOs sites should try to implement a fair distribution of VO’ jobs over the cluster WNs Restrict the number of similar jobs on a WN
• Issue of shared FS client scalability (Most likely) Checked with AFS, NFS4.0, and CVMFS Tests must go on
NFS4.1 (pNFS) still to be tested With other HWs (for now, only Poweredge C6100) By virtualizing the WNs (“divide and rule” principle)
– First attempt was achieved by basically splitting 24-cores WN into 2x(12-cores VM-WN)– Must be further investigated
• CVMFS Interesting for VO SW distribution (without installation job) But, take care that latency increases with cache size
16
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
Questions & Comments
17
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler 18
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
Other T1s CPU EfficiencyFrom January 2010 to January 2011
19
CERN KIT PIC
CNAF NL-T1 RAL
Source: http://www3.egee.cesga.es/gridsite/accounting/CESGA/tier1_view.html
L. Arrabito, D. Bouvet, X. Canehan, P. Girard, Y. Perret, S. Poulat, R. Rumler
Virtualized WN: divide and rule ?
20