1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth...

15
1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8, 1999

Transcript of 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth...

Page 1: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,

1

Cplant I/O

Pang Chen

Lee WardSandia National Laboratories

Scalable Computing Systems

Fifth NASA/DOE Joint PC Cluster

Computing Conference

October 6-8, 1999

Page 2: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,

2

Net I/O

Service

Users

File I/OCompute

/home

Conceptual Partition Model

Page 3: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,

3

File I/O Model

• Support large-scale unstructured grid applications.– Manipulate single file per application, not per processor.

• Support collective I/O libraries.– Require fast concurrent writes to a single file.

Page 4: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,

4

Problems

• Need a file system NOW!

• Need scalable, parallel I/O.

• Need file management infrastructure.

• Need to present the I/O subsystem as a single parallel file system both internally and externally.

• Need production-quality code.

Page 5: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,

5

Approaches

• Provide independent access to file systems on each I/O node.– Can’t stripe across multiple I/O nodes to get better performance.

• Add a file management layer to “glue” the independent file systems so as to present a single file view.– Require users (both on and off Cplant) to differentiate between this

“special” file system and other “normal” file systems.

– Lots of special utilities are required.

• Build our own parallel file system from scratch.– A lot of work just to reinvent the wheel, let alone the right wheel.

• Port other parallel file systems into Cplant.– Also a lot of work with no immediate payoff.

Page 6: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,

6

Current Approach

• Build our I/O partition as a scalable nexus between Cplant and external file systems.+ Leverage off existing and future parallel file systems.

+ Allow immediate payoff with Cplant accessing existing file systems.

+ Reduce data storage, copies, and management.

– Expect lower performance with non-local file systems.

– Waste external bandwidth when accessing scratch files.

Page 7: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,

7

Building the Nexus

• Semantics– How can and should the compute partition use this service?

• Architecture– What are the components and protocols between them?

• Implementation– What we have now and what we hope to achieve in the future?

Page 8: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,

8

Compute Partition Semantics

• POSIX-like.– Allow users to be in a familiar environment.

• No support for ordered operations (e.g., no O_APPEND).

• No support for data locking.– Enable fast non-overlapping concurrent writes to a single file.

– Prevent a job from slowing down the entire system for others.

• Additional call to invalidate buffer cache.– Allow file views to synchronize when required.

Page 9: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,

9

Cplant I/O

I/O I/O I/O I/O

Enterprise Storage Services

Page 10: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,

10

Architecture

• I/O nodes present a symmetric view.– Every I/O node behaves the same (except for the cache).– Without any control, a compute node may open a file with one I/O node,

and write that file via another I/O node.

• I/O partition is fault-tolerant and scalable.– Any I/O node can go down without the system losing jobs.– Appropriate number of I/O nodes can be added to scale with the compute

partition.

• I/O partition is the nexus for all file I/O.– It provides our POSIX-like semantics to the compute nodes and

accomplishes tasks on behalf of the them outside the compute partition.

• Links/protocols to external storage servers are server dependent.– External implementation hidden from the compute partition.

Page 11: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,

11

Compute -- I/O node protocol

• Base protocol is NFS version 2.– Stateless protocols allow us to repair faulty I/O nodes without aborting

applications.

– Inefficiency/latency between the two partitions is currently moot; Bottleneck is not here.

• Extension/modifications:– Larger I/O requests.

– Propagation of a call to invalidate cache on the I/O node.

Page 12: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,

12

Current Implementation

• Basic implementation of the I/O nodes

• Have straight NFS inside Linux with ability to invalidate cache.

• I/O nodes have no cache.

• I/O nodes are dumb proxies knowing only about one server.

• Credentials rewritten by the I/O nodes and sent to the server as if the the requests came from the I/O nodes.

• I/O nodes are attached via 100 BaseT’s to a Gb ethernet with an SGI O2K as the (XFS) file server on the other end.

• Don’t have jumbo packets.

• Bandwidth is about 30MB/s with 18 clients driving 3 I/O nodes, each using about 15% of CPU.

Page 13: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,

13

Current Improvements

• Put a VFS infrastructure into I/O node daemon.– Allow access to multiple servers.

– Allow a Linux /proc interface to tune individual I/O nodes quickly and easily.

– Allow vnode identification to associate buffer cache with files.

• Experiment with a multi-node server (SGI/CXFS).

Page 14: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,

14

Future Improvements

• Stop retries from going out of network.

• Put in jumbo packets.

• Put in read cache.

• Put in write cache.

• Port over Portals 3.0.

• Put in bulk data services.

• Allow dynamic compute-node-to-I/O-node mapping.

Page 15: 1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,

15

Looking for Collaborations

Lee Ward

505-844-9545

[email protected]

Pang Chen

510-796-9605

[email protected]