Initial Data Access Module & Lustre Deployment Tan Li.
-
date post
21-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of Initial Data Access Module & Lustre Deployment Tan Li.
Initial Data Access Module & Initial Data Access Module & Lustre DeploymentLustre Deployment
Tan Li Tan Li
2
Outline
• Disk I/O test for netqos03 and netqos04
• Initial design for file I/O module Data read with different function and buffer size Data read with fread() with different waiting time and buffer size Some conclusions
• Intro to Lustre setup
• Lustre deployment for the new servers
3
Initial Design for Data Access Current data access module (Block size: 100K, 1M, 10M,100M, 500M for 100G file)
4
Initial design for file I/O module1. Head file: ftp_io.h2. Date access functionsint ftp_open(char *path, int block_size, int mode);int ftp_read(int infile_fd, char *out_buf, int block_size);int ftp_write(int outfile_fd, char *in_buf, int block_size);int ftp_close(int close_fd, int block);Usage of ftp_open(): Block size passed to the function in order to decide the
open method (open, fopen or open with O_DIRECT), and the close method of ftp_close should accord with the ftp_open. mode=0 is open for read, and mode=1 is for write
7
Initial design for file I/O module
Block size > 400K?
open/fopen (Read only)
open with O_DIRECT(Read only)
NoYes
Mode=0 or 1
Mode=0 or 1
Return the file descriptor
open with O_DIRECT(Write only)
open/fopen (Write only)
8
Initial design for file I/O module Problem with O_DIRECT when write data
When write data with O_DIRECT, the block should be the multiple of 512 Byte on our platform. So, we will have problem to write the last few bytes of the file.
Possible solution: 1. using the regular write() to output the remaining data. 2. Integrate open function into the read and write function
9
Data reading test on fread()1. Test result by the time tool of linux2. Test result by nmon (recording data every two secs)
10
Data reading test on fread() Some Conclusions
The bandwidth grows with the increment of buffer size, especially when the buffer size change from 100K to 1000K(3 times).
The bandwidth is not sensitive to the wait time until it reach some threshold. And the larger the buffer size is, the bandwidth is less sensitive to the delay.
The CPU utilization is 0% when the buffer size is below 100K. And it grows with the increase of buffer size.
11
IWARP and Infiniband
Infiniband IWARP
Hardware Specialized I/O structure A set of mechanisms over Ethernet that
moving data management and network protocol
processing to the RNIC card
Transport method point-to-point end to end
Compatibility fully compatible with existing Ethernet
switching
specialized infrastructure
Vendors A broad range of vendors
Only two: Mellanox and QLogic
12
RoCEE RoCEE = Infiniband over Ethernet(IBoE)
RDMA over Converged Enhanced Ethernet (RoCEE) protocol proposal, is designed to allow the deployment of RDMA semantics on Converged Enhanced Ethernet fabric by running the IB transport protocol using Ethernet frames.In other words, to take the InfiniBand transport layer and package it into Ethernet frames, instead of using the iWARP protocol for Ethernet-based high-performance cluster networking.
13
RoCEE Problem 1: IWARP has already leveraged the performance
benefit of RoCEE Problem 2: hard to implement. Problem 3: the RoCEE is dependent on the deployment of
10GbE CEE infrastructure; currently only one vendor (Cisco) offers CEE switches, which are at relatively high price points.