The DataTransfer statusExperience on VSR2
A. Bozzi, L. Salconi – 27 Oct 2009
The new software procedures 1/2
We implemented a simple, robust replica manager architecture.
An automatic system that:
scans for new DAQ files and build metadata on it (based upon FrDump output); keep track of that files and order them in multiple queues (one for each kind of
file); prepares the data transfer sessions (builded on static configuration parameters) and
starts them, one session for each data flow; checks the sessions output status and performs some actions based on it
(basically different actions were perfomed on a succesful or failed data transfer); schedules a retry on a failed transfer session; keeps tracks of all operation scheduled (succesfull or failed); builds a metadata structure for each file (a raw ffl entry)
The new software procedures 2/2
… and it also:has the same architecture and similar topology for each data flow: only the sendFile class changes, so we have some primitives that are wrapper around bbftp's and SRB's command (… and why not in a future on gridFTP).has a network “star configuration” (from Cascina to the CCs with 8 independent flows);collects informations on closed sessions only parsing the local log and the output of the performed operations in order to find the status of the transferred files;builds locally a remote ffl, based upon the FrDump output performed on the local file and mixing them with the static information on the remote destination directory;organize the data path in the same way in all repositories in order to have same script for search for missing files or errors.
The Cascina – Bologna – Lyon star architecture
Lyon Bologna
datagw.virgo.infn.it
Procdata vols
Rawdata circular buffer
SRB bbftp
The LIGO data interface (using LDR)
LIGO Lyon Bologna
dataldr.virgo.infn.it datagw.virgo.infn.itLIGO vols (RW)
Procdata vols (RO)
SRB bbftp
The achieved performance
2009-07-20 16:45:27,108 INFO DtDBase: adding V-raw-932135940-180.gwf to rawdata queque2009-07-20 16:45:28,563 INFO SRBEngine: [raw2ly] sending file V-raw-932135940-180.gwf2009-07-20 16:45:31,314 INFO BBEngine: [raw2bo] sending file V-raw-932135940-180.gwf2009-07-20 16:46:32,715 INFO BBEngine: [raw2bo] file V-raw-932135940-180.gwf successfully sent2009-07-20 16:46:36,227 INFO BBEngine: [raw2bo] sent updated ffl ./ffl/raw2bo.ffl2009-07-20 16:46:57,363 INFO SRBEngine: [raw2ly] file V-raw-932135940-180.gwf successfully sent2009-07-20 16:47:00,978 INFO SRBEngine: [raw2ly] sent updated ffl ./ffl/raw2ly.ffl
fflGen.pl [Mon Jul 20 16:45:27 2009] -> file to insert V-raw-932135940-180.gwf on st4rear::v081fflGen.pl [Mon Jul 20 16:45:27 2009] -> sending infos about V-raw-932135940-180.gwf to dataSendfflGen.pl [Mon Jul 20 16:45:27 2009] -> sending infos about V-raw-932135940-180.gwf to dataBackupfflGen.pl [Mon Jul 20 16:45:27 2009] -> generate a new ffl file...fflGen.pl [Mon Jul 20 16:45:34 2009] -> ...public ffl file updated with 87962 records
An example with a VSR2 rawdata file: (V-raw-932135940-180.gwf)
→ available in Cascina to users (circular buffer) at 16:45:27→ published in the local ffl in Cascina at 16:45:34→ available in Bologna (published with ffl) at 16:46:36 (1'09”)→ available in Lyon (published with ffl) at 16:47:00 (1'33”)
The amount of data sent to CCs
(27 Oct 09 – 10:00am) Bologna Lyon
n. files space used (TB) n. files space used (TB)
raw (931-933) 16605 30 16605 30raw (934-936) 16667 31 16667 31raw (937-939) 16666 32 16666 32raw (940-now) 3720 6.6 3720 6.6
proc 2401 1.3 2401 1.3
LIGO (S6/H1) 528 1.1 528 1.1LIGO (S6/L1) 466 0.9 466 0.9
57053 102.9 57053 102.9
Here is the amount of data sent to remote CCs until now (27 oct '09 at 10:00am):- from logs we see that we are in a “just in time” situation for about the 93% of the data transfer activity (this means that we got a delay of about 2 minutes between the publication of the file in Cascina and the availability of the file replica at remote CCs)
- at this moment, only 3 files were missed (2 raw and 1 proc on a total of about 53000 files) from the sent list (due to exceptions not managed by the procedure).Problems manually fixed.
Conclusions
We achieve a good level of performance for all the 8 independent data flows active: (rawdata, hreconline, ligo H1, ligo L1 each from Cascina to Bologna and Lyon)
No particular problems were detected in Bologna: only two file missing from the list;
Some problems were detected in Lyon, one for a missing file, all other are related to the SRB interface:
Sput command sometimes lock (a manual procedure is needed for unlock it) good FFL files were transferred to Lyon but they result to be a zero file length at
destination sometimes we loose the synchronization between the SRB/xrootd layer and the HPSS layer
(ex: the Smv command).
About this problems, we got a good support from the Lyon SRB service team
Top Related