Padova, 5 October StoRM Service view Riccardo Zappi INFN-CNAF Bologna.
CASPUR / GARR / CERN / CNAF / CSP New results from CASPUR Storage Lab Andrei Maslennikov CASPUR...
-
Upload
grace-wheeler -
Category
Documents
-
view
224 -
download
3
Transcript of CASPUR / GARR / CERN / CNAF / CSP New results from CASPUR Storage Lab Andrei Maslennikov CASPUR...
CASPUR / GARR / CERN / CNAF / CSP
New results from CASPUR Storage Lab
Andrei MaslennikovCASPUR Consortium
May 2003
A.Maslennikov - May 2003 - SLAB update
2
Participated:
CASPUR : M.Goretti, A.Maslennikov(*), M.Mililotti,
G.Palumbo
ACAL FCS (UK) : N. Houghton
GARR : M.Carboni
CERN : M.Gug, G.Lee, R.Többicke, A.Van Praag
CNAF : P.P.Ricci, S.Zani
CSP Turin : R.Boraso
Nishan (UK) : S.Macfall
(*) Project Coordinator
A.Maslennikov - May 2003 - SLAB update
3
Sponsors:
E4 Computer : Loaned 6 SuperMicro servers (MBs and assembly) - excellent hardware quality and support – Italy
Intel : Donated 12 x 2.8 GHz Xeon CPUs
San Valley Systems : Loaned two SL1000 units - good remote CE support during tests
ACAL FCS / Nishan : Loaned two 4300 units - active participation in tests, excellent support
A.Maslennikov - May 2003 - SLAB update
4
Contents
• Goals
• Components and test setup
• Measurements: - SAN over WAN
- NAS protocols - IBM GPFS - Sistina GFS
• Final remarks
• Vendors’ contact info
A.Maslennikov - May 2003 - SLAB update
5
1. Feasibility study for a SAN-based Distributed Staging System
2. Comparison of the well-known NAS protocols on latest commodity hardware
3. Evaluation of the new versions of IBM GPFS and Sistina GFS as a possible underlying technology for a scalable NFS server.
Goals for these test series
A.Maslennikov - May 2003 - SLAB update
6
1. Feasibility study for a SAN-based Distributed Staging System
- Most of the large centers keep the bulk of the data on tapes and use some kind of disk caching (staging, HSM, etc) to access these data. - Sharing datastores between several centers is frequently requested, and this means that some kind of remote tape access mechanism should be implemented.
- Suppose now that your centre has implemented a tape<->disk migration system. And you have to extend your system to allow it to access the data dislocated on remote tape drives.
Let us see how this can be achieved.
Remote Staging
A.Maslennikov - May 2003 - SLAB update
7
Solution 1: To access a remote tape file, stage it on a remote disk, then copy it via network to the local disk. Local Site Remote Site
Remote Staging
Disk
Tape
Disk
Disadvantages:
- 2-step operation: more time is needed, harder to orchestrate- Wasted remote disk space
A.Maslennikov - May 2003 - SLAB update
8
Solution 2: Use a “tape server”: a process residing on a remote host that has access to the tape drive. The data are read remotely and then “piped” via network directly to the local disk. Local Site Remote Site
Remote Staging
Disadvantages:
- remote machine is needed- architecture is quite complex
Tape
Disk
TapeServer
A.Maslennikov - May 2003 - SLAB update
9
Solution 3: Access the remote tape drive as a native device on SAN. Use it then as if it is a local unit attached to one of your local data movers. Local Site Remote Site
Remote Staging
Benefits:
- Makes the staging software a lot simpler. Local field-tested solution applies.- Best performance guaranteed (provided the remote drive may be used locally at native speed)
Tape
Disk
Tape SAN
A.Maslennikov - May 2003 - SLAB update
10
“Verify whether FC tape drives may be used at native speeds over the WAN, using the SAN-over-WAN interconnection middleware” - In 2002, we had already tried to reach this goal. In particular, we used the CISCO 5420 iSCSI appliance to access an FC tape over the 400 km distance. We were able to write at the native speed of the drive, but the read performance was very poor.
- This year, we were able to assemble a setup which implements a symmetric SAN interconnection and hence used it to repeat these tests.
Remote Staging
A.Maslennikov - May 2003 - SLAB update
11
2. Benchmark the well-known NAS protocols on a modern commodity hardware.
- These tests we do on a regular basis, as we wish to know what performance we may currently count on, and how the different protocols compare on the same hardware base.
- Our test setup was visibly more powerful than that of the last year, so we were expecting to obtain better numbers.
- We were comparing two remote copy protocols: RFIO and Atrans (cacheless AFS), and two protocols that provide the transparent file access: NFS and AFS.
NAS protocols
A.Maslennikov - May 2003 - SLAB update
12
3. Evaluate the new versions of IBM GPFS and Sistina GFS as a possible underlying technology for a scalable NFS server.
- In 2002, we have already tried both GPFS and GFS.
- GFS 5.0 has shown interesting performance figures, but we have observed several issues with it: unbalanced perfomance in case of multiple clients, exponential increase of load on the lock server with increasing number of clients.
- GPFS 1.2 was showing a poor performance in case of concurrent writing on several storage nodes.
- We used GFS 5.1.1 and GPFS 1.3.0-2 during this test session.
Goal 3
A.Maslennikov - May 2003 - SLAB update
13
- High-end Linux units for both servers and clients 6x SuperMicro Superserver 7042M-6 and 2x HP Proliant DL380 with:
2 CPUs Pentium IV Xeon 2.8GHz SysKonnect 9843 Gigabit Ethernet NIC (fibre)
Qlogic QLA2300 2Gbit Fibre Channel HBA Myrinet HBA
- Disk systems 4x Infortrend IFT-6300 IDE-to-FC arrays:
12 x Maxtor DiamondMax Plus 9 200 GB IDE disks (7200 rpm) Dual Fibre Channel outlet at 2 Gbit
Cache: 256 MB
Components
A.Maslennikov - May 2003 - SLAB update
14
- Tape drives 4x LTO/FC (IBM Ultrium 3580)
- Network 12-port NPI Keystone GE switch (fibre)
28-port Dell 5224 GE switches (fibre / copper)
Myricom Myrinet 8-port switch
Fast geographical link (Rome-Bologna, 400km), with guaranteed throughput of 1 Gbit.
- SAN Brocade 2400, 2800 (1Gbit) and 3800 (2Gbit) switches
SAN Valley Systems SL1000 IP-SAN Gateway
Nishan IPS 4300 multiprotocol IP Storage Switch
Components -2
A.Maslennikov - May 2003 - SLAB update
15
New devices
We were loaned two new objects, one from San Valley Systems, and one from Nishan Systems. Both units provide the SAN-over-IP interconnect function, and are suitable for wide-area SAN connectivity.
Let me give some more detail of both units.
Components -3
16A.Maslennikov - May 2003 - SLAB update
San Valley Systems IP-SAN Gateway SL-700 / SL-1000
- 1 or 4 wirespeed Fibre Channel -to- Gigabit Ethernet channels- Uses UDP and hence delegates to the application the handling of a network outage- Easy in configuration- Allows for the fine-grained traffic shaping (step size 200 Kbit, 1Gb/s to 1Mb/s) and QoS - Connecting two SANs over IP with a pair of SL1000 units is in all aspects equivalent - to the case when these two SANs are connected with a simple fibre cable - Approximate cost: 20 KUSD/unit (SL-700, 1 channel)- 30 KUSD/unit (SL-1000, 4 channels)
- Recommended number of units per site: 1-
17A.Maslennikov - May 2003 - SLAB update
Nishan IPS 3300/4300 multiprotocol IP Storage Switch
- 2 or 4 wirespeed iFCP ports for SAN interconnection over IP- Uses TCP and is capable to seamlessly handle the network outages- Allows for traffic shaping at predefined bandwidth (8 steps,1Gbit- 10Mbit) and QoS- Impements an intelligent router function: allows to interconnect multiple fabrics from different vendors and makes them look as a single SAN- When interconnecting two or more separately managed SANs, maintains their independent administration - Approximate cost: 33 KUSD/unit (6 universal FC/GE ports + 2 iFCP ports - IPS 3300) 48 KUSD/unit (12 universal FC/GE ports + 4 iFCP ports - IPS 4300)
- Recommended number of units per site: 2 (to provide redundant routing)
A.Maslennikov - May 2003 - SLAB update
18
CASPUR Storage Lab
HP DL380 Bologna Gigabit IP (Bologna)FC SAN (Bologna)
IPS 4300
SL1000
HP DL380 Rome
SM 7042M-6
SM 7042M-6
Myrinet
Gigabit IP (Rome)SM 7042M-6
SM 7042M-6
SM 7042M-6
SM 7042M-6 Disks
Tapes
FC SAN (Rome)
IPS 4300SL1000
1 Gbit WAN, 400km
A.Maslennikov - May 2003 - SLAB update
19
Series 1: accessing remote SAN devices
Disks
Tapes
FC SAN (Rome) IPS 4300
SL1000
FC SAN (Bologna) IPS 4300
SL1000
1 Gbit WAN, 400km
HP DL380 Bologna
HP DL380 Rome
A.Maslennikov - May 2003 - SLAB update
20
We were able to operate at the wire speed (100 MB/sec over400 km distance) with both SL-1000 and ISP 4300 units!
- Both middleware devices worked fairly well
- We were able to operate with tape drives at the drive native speed (R and W): 15 MB/sec in case of LTO and 25 MB/sec in case of other faster drive
- In case of disk devices we have observed a small (5%) loss of performance on writes and a more visible (up to 12%) loss on reads, on both units. - Several powerful devices grab the whole available bandwidth of the GigE
- in case of Nishan (TCP-based SAN interconnection) we have witnessed a successful job completion after an emulated 1-minute network outage
Conclusion: Distributed Staging based on a direct tape drive access is POSSIBLE.
Series 1 - results
A.Maslennikov - May 2003 - SLAB update
21
Series 2 – Comparison of NAS protocols
Server ClientGigabit Ethernet
W: 78 MB/secR: 123 MB/sec
Infortrend IFT6300FC 2 Gbit
A.Maslennikov - May 2003 - SLAB update
22
Some settings: - Kernels on server: 2.4.18-27 (RedHat 7.3, 8.0) - Kernel on client: 2.4.20-9 (RedHat 9) - AFS : cache was set up on ramdisk (400MB) - used ext2 filesystem on server
Problems encountered: - Poor array performance on reads with kernel 2.4.20-9
Series 2 - details
A.Maslennikov - May 2003 - SLAB update
23
Write tests: - Measured average time needed to transfer 20 x 1.9 GB from memory on the client to the disk of the file server and vice versa including the time needed to run “sync” command on both client and the server at the end of operation: 20 x {dd if=/dev/zero of=<filename on server> bs=1000k count=1900}
T=Tdd + max(Tsyncclient, Tsyncserver)
Read tests: - Measured average time needed to transfer 20 x 1.9 GB files from a disk on the server to the memory on the client (output directly to /dev/null ).
Because of the large number of files in use and the file size comparable with available RAM on both client and server machines, caching effects were negligible.
Series 2 – more detail
A.Maslennikov - May 2003 - SLAB update
24
Series 2- current results (MB/sec) [SM 7042 - 2GB RAM on server and client]
Write Read
Pure disk 78 123
RFIO 78 111
NFS 77 80
AFS cacheless(Atrans) 70 59
AFS 48 30
A.Maslennikov - May 2003 - SLAB update
25
Series 3a – IBM GPFS
4 x IFT 6300 disk arrays
SM 7042M-6
SM 7042M-6
SM 7042M-6
SM7042M-6
FC SANNFS
Myrinet
A.Maslennikov - May 2003 - SLAB update
26
GPFS installation: - GPFS version 1.3.0-2 - Kernel 2.4.18-27.8.0smp - Myrinet as server interconnection network - All nodes see all disks (NSDs)
What was measured: 1) Read and Write transfer rates (memory<->GPFS file system) for large files 2) Read and Write rates (memory on NFS client<->GPFS exported via NFS)
Series 3a - details
A.Maslennikov - May 2003 - SLAB update
27
Series 3a – GPFS native (MB/sec)
R / W speed for a single disk array: 123 / 78
Read Write
1 node 96 117
2 nodes 135 127
3 nodes 157 122
A.Maslennikov - May 2003 - SLAB update
28
Series 3a – GPFS exported via NFS (MB/sec)
1 node exporting
2 nodes exporting
3 nodes exporting3 clients 6 clients 9 clients
Read 84 113 120
Write 107 106 106
1 client 2 clients 3 clients 9 clients
Read 35 44 44 44
Write 55 73 83 88
2 clients 4 clients 6 clients
Read 60 72 85
Write 90 106 106
A.Maslennikov - May 2003 - SLAB update
29
Series 3b – Sistina GFS
4 x IFT 6300 disk arrays
SM 7042M-6
SM 7042M-6
SM 7042M-6
SM7042M-6
FC SAN
SM7042M-6
NFS
Lock Server
A.Maslennikov - May 2003 - SLAB update
30
GFS installation: - GFS version 5.1.1 - Kernel: SMP 2.4.18-27.8.0.gfs (may be downloaded from Sistina together with the trial distribution), includes all the required drivers.
Problems encountered: - Kernel-based NFS daemon does not work well on GFS nodes (I/O ends in error). Sistina is aware of the bug and is working on that using our setup. We hence used user space NFSD in these tests, it was quite stable.
What was measured: 1) Read and Write transfer rates (memory<->GFS file system) for large files 2) Same for the case (memory on NFS client<->GFS exported via NFS)
Series 3b - details
A.Maslennikov - May 2003 - SLAB update
31
Series 3b – GFS native (MB/sec)
NB: - Out of 5 nodes: 1 node was running the lock server process 4 nodes were doing only I/O
Read Write
1 client 122 156
2 clients 230 245
3 clients 291 297
4 clients 330 300
R / W speed for a single disk array: 123 / 78
A.Maslennikov - May 2003 - SLAB update
32
Series 3b – GFS exported via NFS (MB/sec)
8 clients
Read 250
Write 236
1 client 2 clients 4 clients 8 clients
Read 54 67 78 93
Write 56 64 64 61
3 clients 6 clients 9 clients
Read 145 194 207
Write 164 190 185
1 node exporting
NB: - User space NFSD was used
3 nodes exporting
4 nodes exporting
A.Maslennikov - May 2003 - SLAB update
33
- We are proceeding with the test program. Currently under test: new middleware from CISCO, new tape drive from Sony. We are expecting also a new iSCSI appliance from HP, and an LTO2 drive.
- We are open for any collaboration.
Final remarks
A.Maslennikov - May 2003 - SLAB update
34
- Supermicro servers for Italy E4 Computer Vincenzo Nuti - [email protected]
- FC over IP San Valley Systems John McCormack - [email protected]
Nishan Systems Stephen Macfall - [email protected]
ACAL FCS Nigel Houghton - [email protected]
Vendors’ contact info