1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth...
-
Upload
gilbert-black -
Category
Documents
-
view
221 -
download
6
Transcript of 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth...
1
Cplant I/O
Pang Chen
Lee WardSandia National Laboratories
Scalable Computing Systems
Fifth NASA/DOE Joint PC Cluster
Computing Conference
October 6-8, 1999
2
Net I/O
Service
Users
File I/OCompute
/home
Conceptual Partition Model
3
File I/O Model
• Support large-scale unstructured grid applications.– Manipulate single file per application, not per processor.
• Support collective I/O libraries.– Require fast concurrent writes to a single file.
4
Problems
• Need a file system NOW!
• Need scalable, parallel I/O.
• Need file management infrastructure.
• Need to present the I/O subsystem as a single parallel file system both internally and externally.
• Need production-quality code.
5
Approaches
• Provide independent access to file systems on each I/O node.– Can’t stripe across multiple I/O nodes to get better performance.
• Add a file management layer to “glue” the independent file systems so as to present a single file view.– Require users (both on and off Cplant) to differentiate between this
“special” file system and other “normal” file systems.
– Lots of special utilities are required.
• Build our own parallel file system from scratch.– A lot of work just to reinvent the wheel, let alone the right wheel.
• Port other parallel file systems into Cplant.– Also a lot of work with no immediate payoff.
6
Current Approach
• Build our I/O partition as a scalable nexus between Cplant and external file systems.+ Leverage off existing and future parallel file systems.
+ Allow immediate payoff with Cplant accessing existing file systems.
+ Reduce data storage, copies, and management.
– Expect lower performance with non-local file systems.
– Waste external bandwidth when accessing scratch files.
7
Building the Nexus
• Semantics– How can and should the compute partition use this service?
• Architecture– What are the components and protocols between them?
• Implementation– What we have now and what we hope to achieve in the future?
8
Compute Partition Semantics
• POSIX-like.– Allow users to be in a familiar environment.
• No support for ordered operations (e.g., no O_APPEND).
• No support for data locking.– Enable fast non-overlapping concurrent writes to a single file.
– Prevent a job from slowing down the entire system for others.
• Additional call to invalidate buffer cache.– Allow file views to synchronize when required.
9
Cplant I/O
I/O I/O I/O I/O
Enterprise Storage Services
10
Architecture
• I/O nodes present a symmetric view.– Every I/O node behaves the same (except for the cache).– Without any control, a compute node may open a file with one I/O node,
and write that file via another I/O node.
• I/O partition is fault-tolerant and scalable.– Any I/O node can go down without the system losing jobs.– Appropriate number of I/O nodes can be added to scale with the compute
partition.
• I/O partition is the nexus for all file I/O.– It provides our POSIX-like semantics to the compute nodes and
accomplishes tasks on behalf of the them outside the compute partition.
• Links/protocols to external storage servers are server dependent.– External implementation hidden from the compute partition.
11
Compute -- I/O node protocol
• Base protocol is NFS version 2.– Stateless protocols allow us to repair faulty I/O nodes without aborting
applications.
– Inefficiency/latency between the two partitions is currently moot; Bottleneck is not here.
• Extension/modifications:– Larger I/O requests.
– Propagation of a call to invalidate cache on the I/O node.
12
Current Implementation
• Basic implementation of the I/O nodes
• Have straight NFS inside Linux with ability to invalidate cache.
• I/O nodes have no cache.
• I/O nodes are dumb proxies knowing only about one server.
• Credentials rewritten by the I/O nodes and sent to the server as if the the requests came from the I/O nodes.
• I/O nodes are attached via 100 BaseT’s to a Gb ethernet with an SGI O2K as the (XFS) file server on the other end.
• Don’t have jumbo packets.
• Bandwidth is about 30MB/s with 18 clients driving 3 I/O nodes, each using about 15% of CPU.
13
Current Improvements
• Put a VFS infrastructure into I/O node daemon.– Allow access to multiple servers.
– Allow a Linux /proc interface to tune individual I/O nodes quickly and easily.
– Allow vnode identification to associate buffer cache with files.
• Experiment with a multi-node server (SGI/CXFS).
14
Future Improvements
• Stop retries from going out of network.
• Put in jumbo packets.
• Put in read cache.
• Put in write cache.
• Port over Portals 3.0.
• Put in bulk data services.
• Allow dynamic compute-node-to-I/O-node mapping.
15
Looking for Collaborations
Lee Ward
505-844-9545
Pang Chen
510-796-9605