Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable...

19
Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C. Moxon { ggibson, bwelch, dnagle, bmoxon }@panasas.com Panasas Inc., 6520 Kaiser Dr., Fremont, CA 94555 This paper describes the Object Storage Architecture solution for cost-effective, high bandwidth storage in High Performance Computing (HPC) environments. An HPC environment requires a storage system to scale to very large sizes and performance without sacrificing cost-effectiveness nor ease of sharing and managing data. Traditional storage solutions, including disk-per-node, Storage-Area Networks (SAN), and Network-Attached Storage (NAS) implementations, fail to find a balance between performance, ease of use, and cost as the storage system scales up. In contrast, building storage systems as specialized storage clusters using commodity-off-the-shelf (COTS) components promise excellent price-performance at scale provided that binding them into a single system image and linking them to HPC compute clusters can be done without introducing bottlenecks or management complexities. While a file interface (typified by NAS systems) at each storage cluster component is too high-level to provide scalable bandwidth and simple management across large numbers of components, and a block interface (typified by SAN systems) is too low-level to avoid synchronization bottlenecks in a shared storage cluster, an object interface (typified by the inode layer of traditional file system implementations) is at the intermediate level needed for independent, highly parallel operation at each storage cluster component under centralized, but infrequently applied, control. The Object Storage Device (OSD) interface achieves this independence by storing an unordered collection of named variable-length byte arrays, called objects, and embedding extendable attributes, fine-grain capability-based access control, and encapsulated data layout and allocation into each object. With this higher-level interface, object storage clusters are capable of highly parallel data transfers between storage and compute cluster node under the infrequently applied control of the out-of-band metadata managers. Object Storage Architectures support single-system-image file systems with the traditional sharing and management features of NAS systems and the resource consolidation and scalable performance of SAN systems. 1 The HPC Storage Bandwidth Problem A structural transformation is taking place within the HPC environment. Traditional low-volume, proprietary systems are being replaced with clusters of computers made from commodity, off-the-shelf (COTS) components and free operating systems such as Linux. These compute clusters deliver new levels of application performance and allow cost effective scaling to 10s and, soon, 100s of Tflops. The large datasets and main memory checkpoints of such science-oriented cluster computations also demand record- breaking data throughput from the storage system. One rule of thumb is that 1 GB/sec of storage bandwidth is needed per Tflop in the computing cluster [SGSRFP01]. Complicating matters for the HPC community is the fact that storage bandwidth issues are given low priority by mainstream storage vendors because it is expensive and difficult to provide high bandwidth using traditional storage architectures and there is a limited market for systems that scale to HPC levels. As an example of this challenge, BP’s seismic analysis supercomputing, which cost as much as $80 million per Tflop in 1997, today costs about $2 million per Tflop. The 170 TB of storage on this 7.5 Tflop Intel- Linux cluster today costs about $15 thousand per TB. BP hopes to cut the cost of each Tflop and each TB by as much as 50% by the end of 2003 [Knott03]. Combining BP’s example with the bandwidth rule of thumb and adjusting to round numbers gives us a simple model for science-oriented cluster computing requirements for early in 2004: per Tflop, a cluster roughly needs 10 TB storage sustaining 1 GB/sec and costing $100,000

Transcript of Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable...

Page 1: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

Object Storage: Scalable Bandwidth for HPC Clusters

Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C. Moxon

{ ggibson, bwelch, dnagle, bmoxon }@panasas.comPanasas Inc., 6520 Kaiser Dr., Fremont, CA 94555

This paper describes the Object Storage Architecture solution for cost-effective, high bandwidth storagein High Performance Computing (HPC) environments. An HPC environment requires a storage systemto scale to very large sizes and performance without sacrificing cost-effectiveness nor ease of sharingand managing data. Traditional storage solutions, including disk-per-node, Storage-Area Networks(SAN), and Network-Attached Storage (NAS) implementations, fail to find a balance betweenperformance, ease of use, and cost as the storage system scales up. In contrast, building storage systemsas specialized storage clusters using commodity-off-the-shelf (COTS) components promise excellentprice-performance at scale provided that binding them into a single system image and linking them toHPC compute clusters can be done without introducing bottlenecks or management complexities. Whilea file interface (typified by NAS systems) at each storage cluster component is too high-level to providescalable bandwidth and simple management across large numbers of components, and a block interface(typified by SAN systems) is too low-level to avoid synchronization bottlenecks in a shared storagecluster, an object interface (typified by the inode layer of traditional file system implementations) is atthe intermediate level needed for independent, highly parallel operation at each storage clustercomponent under centralized, but infrequently applied, control. The Object Storage Device (OSD)interface achieves this independence by storing an unordered collection of named variable-length bytearrays, called objects, and embedding extendable attributes, fine-grain capability-based access control,and encapsulated data layout and allocation into each object. With this higher-level interface, objectstorage clusters are capable of highly parallel data transfers between storage and compute cluster nodeunder the infrequently applied control of the out-of-band metadata managers. Object StorageArchitectures support single-system-image file systems with the traditional sharing and managementfeatures of NAS systems and the resource consolidation and scalable performance of SAN systems.

1 The HPC Storage Bandwidth Problem

A structural transformation is taking place within the HPC environment. Traditional low-volume,proprietary systems are being replaced with clusters of computers made from commodity, off-the-shelf(COTS) components and free operating systems such as Linux. These compute clusters deliver new levelsof application performance and allow cost effective scaling to 10s and, soon, 100s of Tflops. The largedatasets and main memory checkpoints of such science-oriented cluster computations also demand record-breaking data throughput from the storage system. One rule of thumb is that 1 GB/sec of storagebandwidth is needed per Tflop in the computing cluster [SGSRFP01]. Complicating matters for the HPCcommunity is the fact that storage bandwidth issues are given low priority by mainstream storage vendorsbecause it is expensive and difficult to provide high bandwidth using traditional storage architectures andthere is a limited market for systems that scale to HPC levels.

As an example of this challenge, BP’s seismic analysis supercomputing, which cost as much as $80 millionper Tflop in 1997, today costs about $2 million per Tflop. The 170 TB of storage on this 7.5 Tflop Intel-Linux cluster today costs about $15 thousand per TB. BP hopes to cut the cost of each Tflop and each TBby as much as 50% by the end of 2003 [Knott03].

Combining BP’s example with the bandwidth rule of thumb and adjusting to round numbers gives us asimple model for science-oriented cluster computing requirements for early in 2004:

• per Tflop, a cluster roughly needs 10 TB storage sustaining 1 GB/sec and costing $100,000

Page 2: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

2 G. Gibson, B. Welch, D. Nagle, and B. Moxon

Capital equipment costs are only a part of the total cost of ownership. The cost of operating, or managing astorage system often adds up to more than the capital costs over the lifetime of the system. Storagemanagement tasks include installing and configuring new hardware, allocating space to function or users,moving collections of files between subsystems to load and capacity balance, taking backups, replacingfailed equipment and reconstructing or restoring lost data, creating new users, and resolving performanceproblems or capacity requests from users. The cost of storage management tasks are driven by the loadedlabor rates of experienced Linux/Unix cluster, network, server and storage administrators and are typicallycalculated as cluster nodes per administrator or terabytes per administrator.

HPC clusters, in contrast to monolithic supercomputers, have many more subsystems to be managed.COTS clusters, which are typically built from comparatively small computers and storage subsystems, arelikely to have the highest number of subsystems per Tflop provided. Additionally, computationalalgorithms for clusters usually decompose a workload into thousands or millions of tasks, each of which isexecuted (mostly) independently. This algorithmic strategy often requires the decomposition of stored datainto partitions and replicas, whose placement and balancing in a cluster can be a time consuming set oftasks for cluster operators and users, especially in large cluster and grid computing environments sharedamongst a number of projects or organizations, and in environments where core datasets change regularly.

With these extra storage management difficulties compounding the cutting edge demands of HPC clusterscalability and bandwidth, HPC cluster designers need to carefully consider the storage architecture theyemploy. In this paper we review common hardware and software architectures and contrast these to the newObject Storage architecture for use in HPC cluster storage. Qualitatively, we seek storage architectures that:

• Scale to PBs of stored data and 100s of GB/sec of storage bandwidth.

• Leverage COTS hardware, including both networking and disk drives, for cost-effectiveness.

• Unify management of storage subsystems to simplify and lower operational costs.

• Share stored files with non-cluster nodes to simplify application development and experimentpre- and post-processing.

• Grow capacity and performance incrementally and independently to cost-effectivelycustomize to a cluster application’s unique balance between size and bandwidth.

2 Storage Architectures for Cluster Computing

The fundamental tradeoffs in storage architectures are tied to two basic issues: 1) the semantics of thestorage interface (i.e., blocks versus files), and 2) the flow of metadata and data (i.e., control and datatraffic flow) between storage and applications. The interface is important because it defines the granularityof access, locking and synchronization and the security for access to shared data. Traffic flowfundamentally defines the parallelism available for bandwidth. The architectural flexibility andimplementation costs of these two basic storage properties ultimately determine the performance andscalability of any storage architecture.

Consider the common block-based disk interface, commonly referred to as Direct Attach Storage (DAS)and Storage Area Network (SAN). DAS and SAN have been historically been managed by separate filesystem or database software located on a single host. Performance is good at the small scale, butbottlenecks appear as these systems scale. Moreover, fine-grained data sharing on different hosts isdifficult with DAS, requiring data copies that significantly reduce performance. Therefore, most filesystems are not distributed over multiple hosts, except to use a single secondary host for failover, in whichthe secondary hosts takes over control of shared disks when the primary host fails.

The high-level network file service interfaces, including NFS and CIFS protocols and broadly known asNetwork Attached Storage (NAS), overcomes many of the block-level interface limitations. Presenting a

Page 3: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

G. Gibson, B. Welch, D. Nagle, and B. Moxon 3

file/directory interface, NAS servers can dynamically and efficiently mediate requests among multipleusers, avoiding the sharing problems of DAS/SAN. The high-level file interface also provides secureaccess to storage and enables low-level performance optimizations, including file-based pre-fetching andcaching of data.

Fig. 1. Traditional Scalable Bandwidth Cluster vs. Out-of- Band Scalable Bandwidth

However, the traditional NAS in-band data flow forces all data to be copied through the server, resulting ina performance bottleneck at the server’s processor, memory subsystem, and network interface. Toovercome this limitation, recent storage architectures have decoupled data and metadata access using anout-of-band architecture where client’s fetch metadata from servers, directly accessing data from storageand avoiding the server bottleneck. Unfortunately, the client’s out-of-band data accesses must utilize theDAS/SAN block-based interface, working behind the NAS file-level interface and eliminating many of thesecurity and performance benefits provided by NAS.

The new object-based storage architecture combines the performance benefits of DAS with themanageability and data sharing features of NAS. The object-based storage interface is richer than theblock-based interface, hiding details like sectors by providing access to objects, a named range of bytes ona storage device that are cryptographically signed to enable secure sharing among untrusted clients.Moreover, the object-based storage architecture efficiently supports out-of-band data transfer, enablinghigh bandwidth data flow between clients and storage while preserving many of the performance andsecurity enhancements available for NAS systems.

In the following sections, we discuss the tradeoffs between different block-based, NAS-based and object-based systems with in-band and out-of-band data movement.

2.1 Scaling at the Disk Abstraction

Direct Attached Storage (DAS) and its evolution as Storage Area Network (SAN) storage are the dominantstorage architectures in use today.

2.1.1 Disk Per Node

Because the commodity PC components used in the nodes of most HPC clusters usually come with a localdisk, some cluster designers have chosen to use these disks as the cluster’s primary storage system[Fineberg99]. In some ways this is a superb HPC solution because today each disk in a node provides 10 to20 MB/sec of storage bandwidth and 80 to 200 GB of storage capacity for $1 to $2 per GB. Given thatcompute nodes today provide 2 to 10 Gflops, depending on the number and type of CPUs, a one-disk pernode storage solution offers 0.2 to 0.5 GB/sec bandwidth per Tflop and 2 to 4 TB capacity per Tflop. Withtwo to five disks per node, this approach to building cluster storage systems meets our 2004 storage systemtarget bandwidth and capacity, and has capital costs that are much less than our 2004 target cost.

Page 4: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

4 G. Gibson, B. Welch, D. Nagle, and B. Moxon

The downside of this approach to cost-effective scalable storage bandwidth is the implications forprogrammer complexity, reliability and manageability.

Programmer complexity arises because the data path between a disk and a processor is only fast anddedicated when both are on the same node. With a simple disk-per-node storage architecture there is nopath at all when the data needed at a processor is on a disk on a different node. For these non-localaccesses it is the application programmer’s job to move the data as an integral part of an application.Algorithms have to be tailored to this restriction, often specializing the uses of the cluster to only a veryfew applications that have been appropriately tailored, such as out-of-core database sort machines[Fineberg99]. Storage tailored applications unable to adapt computation on each node to where the data isdynamically found must first move the data. For instance, if input to a 32 node run is determined by outputfrom a prior run on 16 different nodes, special transforming copy programs may be needed to transform the16-node output files (on nodes 10-25, for example) to the 32-node input files (on nodes 4-35). This extratransforming work effectively reduces compute performance, storage bandwidth, and it costs scarce humandevelopment time.

With data stored across the nodes of the computing cluster, compute node failure is also the loss ofgigabytes of stored data. If applications must also be written to create, find and restore replicas of data onother nodes, development time may be significantly increased, and data dependability weakened. If asystem service implements mirroring across nodes, then useful capacity and write bandwidth are at leasthalved. Additionally, disks inside a compute node add to the reasons that a node fails. Five diskssignificantly increase node failure rates, possibly causing users to take more frequent checkpoints, whichalso lowers storage bandwidth. Adding RAID across the disks inside a node can reduce the frequency thatdisk failure causes immediate failure of the node, but it lowers per node capacity, lowers per nodebandwidth, and adds per node cost for RAID controllers. And because the rate of individual disk failure isnot changed by RAID, administrators still have to get inside cluster nodes and replace failed disks, apotentially disruptive and error prone management activity.

Administrators are also saddled with the management of a relatively small file system on every node.Balancing the available capacity without impacting programmer decisions about where data is stored ishard. And because small file systems fill quickly, administrators are likely to be called far more often toinspect and manipulate 1000s of small file systems. Moreover, the compute capabilities and storagecapabilities of the cluster are not independently changeable. Based on characteristics of the importantapplications, and the sizes of cost-effective disks and CPUs, there may be a small number of reasonableratios of disk to CPU, leading to over-designed systems, another way to pay more for what you need. Forlarge clusters, this co-dependency leads to severe capacity increment scenarios. For example, to upgradethe overall storage capacity of a 1000 node cluster using 36 GB local disks, the smallest incrementalcapacity increment may be 36 TB (upgrading a 36 GB drive to a 72 GB drive on each of 1000 nodes). Inaddition, there can be very extensive periods of downtime for the cluster, measured in weeks, to accomplishan upgrade of the disk in every node of a large cluster.

Fig 2. HPC Cluster with Disk per Node Storage

Page 5: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

G. Gibson, B. Welch, D. Nagle, and B. Moxon 5

Finally, if all data is stored on the nodes of the cluster, then pre-processing or post-processing ofexperimental data, or application development, reaches into the nodes of the cluster even if the cluster iscurrently allocated to some other application. This effectively turns the cluster into a massively parallel filesystem server that also timeshares with HPC applications.

Since file servers enforce access control to protect stored data from accidental or malicious unauthorizedtampering, massively parallel file systems sharing COTS nodes with all applications suffer from the filesystem analog of pre-multi-tasking operating systems – just as executing all applications and the operatingsystem in a single address space exposes the stability of all applications and the node itself to bugs in orattacks on any one application, executing the access control enforcement code on the same machines as theapplications whose access is being controlled exposes the entire storage system to damage if any node hasbugs in its file system code or is breached by an attack. Since COTS clusters exploit rapidly evolving code,often from multi-organizational open source collaborations, bad interactions between file systems andoperating system code, imperfect ports of system-level code, and trapdoors in imported code put theintegrity of all stored data at risk.

Inevitably, these major inconveniences drive HPC cluster administrators to maintain the permanent copy ofdata off the cluster, in dedicated and restricted function servers, devoting the disks on each compute node toreplicas and temporaries. This means applications must perform data staging and destaging – whereapplication datasets are loaded from a shared server prior to job execution, and results unloaded back to ashared server when done. Staging and destaging adds to execution time, lowering effective storagebandwidth and compute performance again. In some environments, particularly multi-user or multi-projectenvironments, staging and de-staging can waste as much as 25% of the available time on the cluster.

These issues, and the desire for a more manageable cluster computing environment, have driven manyfacilities to look at shared storage cluster computing models – where a significant portion of, if not all,application data is maintained and dynamically accessed directly.

2.1.2 SAN attached Disk per Node

Commercially, most high end storage systems are Storage Array Network (SAN) storage, big disk arraysoffering a simple disk block service interface (SCSI) and interconnected with other arrays and multiplehosts systems for fault tolerance, so that after a host fails another host can see the failed host’s storage andtake over the management of its data. SAN networking is usually FibreChannel, a packetized SCSI networkprotocol built on the same physical wires and transceivers as is Gigabit Ethernet and transmitting at 128 or256 MB/s. While the first generation of FibreChannel only expanded on parallel SCSI’s 16 addresses with126 addresses in an arbitrated loop, later versions, called Fabrics, can be switched in much larger domains.

SAN storage offers a technological basis for consolidating the disks of a disk per node solution outside ofthe nodes on a shared network. Using disk array controllers providing RAID and virtualized disk mappings

Fig. 3. SAN Attached Disk per Node Cluster

Page 6: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

6 G. Gibson, B. Welch, D. Nagle, and B. Moxon

so that a set of N physical disks looks like a set of M logical disks, where M is the right number for the diskper node solution, allows SAN storage to overcome the reliability, availability and incremental growthproblems of the disk per node solutions.

Unfortunately SAN NICs, called host bus adapters, SAN switches and SAN RAID controllers andsubsystems are far less cost-effective than PC disks. Because the high end commercial market hasrelatively small volume and relatively small numbers of relatively big nodes, FibreChannel SAN equipmentis expensive: factors of four higher capital costs for individual components are not unusual and the externalconsolidated storage approach adds another switched network in addition to the cluster interconnectionnetwork, with additional NICs and switches, which the disk per node approach did not have. The cost ofdistributing a FibreChannel SAN storage system over all nodes of a COTS cluster is generally prohibitive.

The recent IETF definition of iSCSI [Satran2003], a mapping of SCSI, the transport and commandprotocols used in FibreChannel SANs, onto the IETF’s TCP/IP transport and network protocols where itcan be run on commodity Gigabit Ethernet, may renew interest in SAN storage systems. The cost of aniSCSI SAN may be much less than a FibreChannel SAN once iSCSI is deployed and in volume.

Traditionally SAN storage executes any command arriving from any accessible node, because in the pastthe only nodes that could reach storage were the trusted primary and backup server hosts attached via adaisy-chained SCSI ribbon cable. With consolidated SAN storage, it becomes convenient to pool thestorage of many computing systems on one SAN network where tape archive systems and SAN storagemanagement servers are also available, exposing cluster storage to errors, bugs and attacks on non-clustersystems and vice versa. For example, in a past release of a server operating system, new manageabilitycode tried to help administrators by formatting any disk storage it could see, even if these disks were ownedand in use by another host!

FibreChannel has provided a first step improvement to total lack of access control in SAN storage. Usingthe time honored memory protection key scheme, where a requestor presents a key (its host ID forexample) with every request and the storage validates that this key is in the access list for a specific virtualdisk, FibreChannel can detect accidental misaddressing of a request by an incorrect node. However,because the unit of access control, which is the granularity of accident detection, is large – an entire virtualdisk – and because any node that might ever need to touch any part of a virtual disk must always be listed,FibreChannel access control is useful only for isolating completely independent systems attached to thesame SAN. Finer grain controls, and dynamically changing access rules, required for true data sharing withrobust integrity for storage metadata and files, must be provided by host-based cooperating software.

2.1.3 Cluster Networking and I/O Nodes

A key limitation to the external shared storage approach, particularly evident in FibreChannel SANarchitectures, is the cost of interconnecting all cluster nodes with external storage. The cost of a SAN NICand SAN switch port for every cluster node is comparable to the node’s cost. Moreover, HPC clustersusually already have a fully interconnected, high performance switching infrastructure, usually optimizedfor small packets and low-latency as well as high bandwidth, such as Quadrics, Myrinet or Infiniband.While these cluster-specialized networks are also expensive, they are needed for the computational purposeof the HPC cluster and are unlikely to be replaced by a storage optimized network such as FibreChannel.Instead, cluster designers often seek to transport storage traffic over the cluster network. Because storagedevices, particularly cost-effective commodity storage devices, are not available with native attachment toHPC cluster-specialized networks, cluster designers often designate a subset of cluster nodes as I/O nodesand equip these with cluster and storage NICs, as illustrated in Figure 4a.

Page 7: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

G. Gibson, B. Welch, D. Nagle, and B. Moxon 7

The primary function of such I/O nodes is to convert storage requests arriving on its cluster network NICinto a storage request leaving on its storage network NIC. This I/O node architecture avoids the cost ofprovisioning two complete network infrastructures to each node in the cluster. In order to limit the fractionof the cluster compute capacity lost to providing I/O service, the whole I/O node is usually devoted to thisfunction and its capacity for data copying from one protocol stack to another is fully exploited.

With a SAN-attached disk-per-node storage architecture, an I/O node’s protocol conversion is very simple:it terminates a connection in the cluster network’s transport layer, collects the embedded disk request, thenwraps that request in the storage network’s transport layer and forwards it into the storage network. Thecost of this protocol conversion is the I/O node, its two NICs (or more if the bandwidths of the cluster andstorage network are not matched), and a network switch port for each NIC. While this can be much lessexpensive than equipping each node with both networks, it can still be a significant cost if high bandwidthis sought, especially in comparison to disks embedded in each node.

The simple protocol conversion, or bridging, function of such I/O nodes lends itself to being offloaded intothe networks themselves. As illustrated in Figure 4b, a multi-protocol switch contains line cards, or blades,with different types of network ports and employs hardware protocol conversion, allowing cluster networkssuch as Myrinet or Infiniband to efficiently and cost-effectively switch data with storage networks such asFibreChannel (FCP/FC) or Ethernet (iSCSI/GE) [Seitz02, Topspin360]. Instead of terminating a clusternetwork connection, parsing the embedded storage request, and proxying each request into the storagenetwork, the storage connection can be “tunneled” through the compute cluster connection. In thisapproach the payload of a compute cluster network connection are routing layer packets of a storageconnection.

Fig. 4a. Storage Protocol Conversion in I/O Nodes

Fig. 4b. HPC Cluster Network with Bridging Switch

Page 8: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

8 G. Gibson, B. Welch, D. Nagle, and B. Moxon

For example, Myricom’s new M3-SW16-8E switch line card connects up to 8 Gigabit Ethernet ports intoan 8-link Myrinet backplane fabric. This provides a seamless conversion between Ethernet and Myrinet’sphysical layers, eliminating the need for multiple switch infrastructures and reducing by half the number ofswitch ports and NICs employed between the endpoint cluster node and storage device. To transportstorage or internet data, a Myrinet client node encapsulates TCP/IP traffic inside Myricom’s GM protocol.Received by the protocol conversion switch, the storage or internet TCP/IP packets are stripped of theirGM headers and then forwarded over Ethernet to an IP-based storage or internet destination.

Multi-protocol switches such as described in this Myricom example, and similar products being introducedfor Infiniband clusters from vendors such as Topspin, essentially eliminate the cost of I/O nodes that onlyconvert storage requests from one network protocol to another. Because of the widespread use of TCP/IPon Ethernet it is the “second” protocol in most of these new products. IP is a particularly appealing protocolbecause it is ubiquitous, has a universal address space, and is media independent, allowing multi-protocolswitches to reach other non-storage links using the same networking protocols inside the same computecluster tunnels. While FibreChannel line cards will also become available, Ethernet’s cost-effectivenessand applicability to internet traffic as well as storage traffic makes iSCSI-based storage protocols veryappealing.

2.1.4 Summary

Attaching a small number of disks to every node in an HPC cluster provides a low cost, scalable bandwidthstorage architecture, but at the cost of:

• complexity for programmer to effectively use local disks,

• complexity for administrators to backup, grow and capacity balance many small file systems,

• susceptibility to data loss with each node failure,

• competition for compute resources from pre- and post-processing, and

• susceptibility of data and metadata to damage caused by bugs in a cluster node’s local software.

Network disk, or SAN storage, whether the dominant FibreChannel or the newcomer iSCSI, appears toeach node as a logical disk per node, retaining the programming complexity, integrity, pre- and post-processing competition and 1000s of small file system management weaknesses. But network disk storagedoes simplify the physical configuration and growth complexity. Network disk storage also decouplesnode and storage failure handling. Unfortunately, network disk storage also requires a significant additionalstorage network infrastructure investment, although this is greatly ameliorated by multi-protocol switchingand iSCSI transport of disk commands over TCP/IP and Ethernet.

2.2 Scaling at the File Abstraction

A shared file system is the simplest approach for users/programmers, and most manageable approach foradministrators, providing convenient organization of a huge collection of storage.

2.2.1 NAS Servers for Shared Repository

To date, the most common approach to providing a shared repository outside of the cluster entails the useof dedicated multi-TB Network-Attached Storage (NAS) systems. An external NAS system is almost theopposite of the disk-per-node solution; that is, a NAS server is a good solution for reliability, availabilityand manageability, but a weak solution for bandwidth.

Page 9: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

G. Gibson, B. Welch, D. Nagle, and B. Moxon 9

With a NAS repository, all data is accessed externally so transforming the data layout to match the computelayout gets done implicitly with every access. Storage redundancy for reliability is offloaded from thecluster to the NAS system. Since a NAS system is usually built as a primary/backup front end for one ormore fault tolerant disk arrays, it is explicitly designed for high data reliability and availability. Moreover,because NAS systems simply distribute the single system image of single file system, a single NAS systemis generally taken as the benchmark standard for simple, inexpensive storage management includingincremental capacity growth independent of cluster characteristics. Finally, because a NAS systemsupports access control enforced file sharing as its most basic use case, application development, pre- andpost-processing and data staging and destaging can all occur in parallel without interfering with theapplications running on the HPC cluster.

Unfortunately, a traditional NAS system delivers fractions of a 100 MB/sec per file, and aggregates of atmost a few 100 MB/sec. When most NAS vendors advertise scaling, they mean that a few NICs and a fewhundred disks can be attached to a single (or possibly dual failover) NAS system. Getting one GB/secfrom the files in one directory is virtually unheard of in NAS products.

For HPC purposes, scaling NAS performance and capacity means multiple independent NAS servers. Butmultiple NAS servers re-introduces many of the problems with administering multiple file systems in thedisk-per-node solutions, albeit with many fewer and bigger file systems. For example, with two near-capacity file systems online, an administrator would need to purchase an additional NAS server andmigrate data (for both capacity and bandwidth balancing) from other servers. This typically requiressignificant down time while data is moved and applications are re-configured to access data on the newmount points. And any given file is on just one NAS server, so access is only faster if a collection of filesis assigned into the namespace of multiple NAS servers in just the right way – a few files on each NASserver – reducing the primary advantage of NAS, which is its manageability.

An external NAS repository for permanent copies of data staged to or destaged from an HPC clustercontaining disks on each node is, unfortunately, not the best of both worlds. It does enable external sharingand management, while allowing cluster algorithms coded for disk-per-node to get scalable bandwidth. Butthe disks on each compute node still raise cluster node failure rates, still bind the unit of incrementalbandwidth growth to the total size of the cluster and still present the administrator with 1000s of small filesystems to manage, even if the disks at each node contain only replicas of externally stored data. Andstaging/destaging time can be significant because of NAS bandwidth limitations.

Fig. 5. HPC Cluster with NAS Storage Repository

Page 10: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

10 G. Gibson, B. Welch, D. Nagle, and B. Moxon

2.2.2 SAN File Systems

A SAN file system is the combination of SAN-attached storage devices and a multi-processorimplementation of a file system with a piece of the file system distributed onto every compute cluster node.SAN file systems differ from NAS systems by locating the controlling metadata management functionaway from the storage. A SAN file system overcomes the management inconveniences of a consolidateddisk per node solution like SAN storage. Rather than 1000s of independent small file systems, a SAN filesystem is managed as a single large file system, simplifying capacity and load balancing. For large HPCdatasets and main memory checkpoints that need unrivalled bandwidth, this direct data access approachbetween cluster node and SAN storage device can provide full disk bandwidth to the cluster, limited onlyby network bisection bandwidth. SAN file systems seek to provide the manageability of NAS with thescalability of SAN, but suffer from the poor sharing support of a block-based interface, which must becompensated for with messaging between the nodes of the cluster.

2.2.2.1 In-Band SAN File Systems

An in-band SAN file system, such as Sistina’s GFS [Preslan99], is a fully symmetric distributedimplementation of a file system, with a piece of the file system running all of its services on each node ofthe cluster and no nodes differentiated in their privileges or powers to change the contents of disk storage.Each node’s piece of the file system negotiates with any other piece to gain temporary ownership of anystorage block it needs access to. To make changes in file data, it obtains ownership and up-to-date copies offile data, metadata and associated directory entries. To allocate space or move data between physicalmedia locations, it obtains ownership of encompassing physical media and up-to-date copies of thatmedia’s metadata. Unfortunately allocation is not a rare event, so even with minimal data sharing, inter-node negotiation over unallocated media is a common arbitration event. Because of the central nature ofarbitrating for ownership of resources, these systems often have distributed lock managers that employsophisticated techniques to minimize the arbitration bottleneck.

Fully symmetric distributed file systems have the same problems with data integrity as disk-per-nodesolutions. Bugs in compute node operating systems, or bad interactions between separately ported systemservices, can cause compute nodes to bypass access control and metadata use rules, allowing any and alldata or metadata in the system to be damaged. This is perhaps controllable in a homogeneous HPC clusterwith carefully screened updates to the cluster node operating systems, although COTS operating systemssuch as Linux evolve rapidly with independent changes collected by different integrators.

Enabling pre- and post-processing from non-cluster systems concurrent with cluster computations isanother problem. Enabling directly access from all non-cluster nodes to shared cluster storage isparticularly prone to accidents because of the diversity of machines needing to run the in-band SAN filesystem software. Most often, SAN file systems run only on the server cluster, and not on all desktops andworkstations in the environment, in order to limit the number of different ports that must be fullyinteroperably correct. But this forces at least some of the server cluster nodes to service storage requestsfrom non-cluster nodes, re-introducing the I/O node approach for proxying a different set of storagerequests.

2.2.2.2 Out-of-Band SAN File Systems

Out-of-band SAN file systems, such as IBM’s Sanergy and EMC’s High Road [EMC03, IBM03], improvethe robustness and administrator manageability of in-band SAN file systems by differentiating thecapabilities of file system code running on cluster nodes from the file system code running on I/O nodes:only I/O node file system software can arbitrate the allocation and ownership decisions and only thissoftware can change most metadata values. The file system software running on cluster nodes is stillallowed to read and write SAN storage directly, provided it synchronizes with I/O nodes to obtainpermission, up-to-date metadata and all allocation decisions and metadata changes. Because metadatacontrol is not available in the data path from the cluster node to the storage device, these I/O nodes arecalled metadata servers or out-of-band metadata servers. Metadata servers can become a bottleneck because

Page 11: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

G. Gibson, B. Welch, D. Nagle, and B. Moxon 11

the block abstraction of SAN storage is so simple that many cluster write commands will requiresynchronization with metadata servers [Gibson98].

Unfortunately, the isolation of metadata control on the I/O nodes is by convention only; the SAN storageinterface will allow any node that can access it for any reason to execute any command including metadatachanges. There is no protection from accidental or deliberate inappropriate access. Data integrity is greatlyweakened by this lack of storage enforced protection; the block interface doesn’t provide the fundamentalsupport needed for multi-user access control that is provided, for example, by separate address spaces in avirtual memory system.

Out-of-band metadata file system software running on I/O nodes can also offer the same proxy file systemaccess for non-cluster workstation or desktop clients. The proxy file system protocols used by non-clusternodes is usually simple NAS protocols, leading the metadata servers to be sometimes called NAS heads.

2.2.3 Summary

While the file sharing interface provided by NAS is enjoyed by users, it has had difficulty scaling to meetthe performance demands of the HPC environment. SAN solutions can provide good performance but aredifficult and expensive to manage. SAN file systems can provide performance and data sharing, but thepoor sharing support of the SAN block interface limits scalability.

2.3 Scaling at the Object Abstraction

Storage offering the Object Storage Device (OSD) interface stores an unordered collection of namedvariable-length byte arrays, called objects, each with embedded attributes, fine-grain access control, andencapsulated data layout and allocation [Gibson98]. The OSD interface, which is described in more detailin Section 3, coupled with an out-of-band storage networking architecture such as shown in Figure 6,improves the scalability of out-of-band SAN file systems because it encapsulates much of the informationthat an out-of-band SAN file system must synchronize with metadata servers. The OSD interface is richerthan the block-based interface used in DAS and SAN, but not as complex as the file-base interface of aNAS (NFS or CIFS, for example) file server. The art in object storage architecture is finding the right levelof abstraction for the storage device that supports security and performance in the I/O path, without limitingflexibility in the metadata management.

Fig. 6. HPC Cluster with an Out-of-Band File System

Page 12: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

12 G. Gibson, B. Welch, D. Nagle, and B. Moxon

Storage objects were inspired by the inode layer of traditional UNIX local file systems [McKusick84]. Filesystems are usually constructed as two or more layers. The lower layer, inodes in UNIX and objects inOSD, encapsulates physical layout allocation decisions and per file attributes like size and create time. Thissimplifies the representation of a file in the upper layer, which handles directory structures, interpretingaccess control and coordinating with environmental authentication services, layering these on top of objectstorage. For example, consider file naming. An OSD does not implement hierarchical file names orcontent-based addressing. Instead, it allows those naming schemes to be implemented on top of a simple(group ID, object ID) naming system and an extensible set of object attributes. To implement a hierarchicalnaming scheme, some objects are used as directories while others are data files, just as a traditional UNIXfile system uses inodes. The semantics of the directory tree hierarchy are known to unprivileged filesystem code, called clients, running on cluster nodes, and to privileged metadata managers. To the OSD, adirectory is just another object. Object attributes include standard attributes like modify times and capacityinformation, as well as higher-level attributes that are managed by the meta-data managers (e.g., parentdirectory pointer, file type). The OSD operations include operations like create, delete, get attributes, setattributes, and list objects.

Most importantly for scalability, changes in the layout of an object on media and most allocationsextending the length of an object can be handled locally at the OSD without changing any metadata cachedby unprivileged clients.

To support various access control schemes, the OSD provides capability-based access enforcement.Capabilities are compact representations naming only the specific objects that can be accessed and thespecific actions that can be done by the holder of a capability. Capabilities can be cryptographicallysecured. The metadata manager and the OSD have a shared key used to generate and check capabilities.For example, when a client wants to access a file it requests a capability from the metadata manager. Themetadata manager enforces access control by whatever means it chooses. Typically it consults ownershipand access control list information stored as attributes on an object. The metadata manager generates acapability and returns it to the client, which includes that capability in its request to the OSD. The OSDvalidates the capability, which includes bits that specify which OSD operations are allowed. Metadatamanagers can be designed to use various authentication (e.g., NIS, Kerberos, Active Directory) andauthorization (e.g., Windows ACLs or POSIX ACLs) schemes to grant access, and rely on the OSD toenforce these access policy decisions without knowledge of the authentication or authorization system isuse.

Most importantly for scalability, capabilities are small, cacheable and revocable (by changing an attributeof the named object on an OSD), so file system client code can cache many permission decisions for a longtime and an out-of-band metadata server can always synchronously and immediately control the use of acapability. Metadata servers can even change the representation of the file, migrating its objects orreconstructing a failed OSD, with no more interruption to client access than is required to update the smallcapability the first time the client tries to use it in a way that is no longer valid.

With this higher-level interface, object storage clusters are capable of highly parallel data transfers betweenstorage and compute cluster node under the infrequently applied control of the out-of-band metadatamanagers. Object Storage Architectures support single-system-image file systems with the traditionalsharing and management features of NAS systems and the resource consolidation and scalable performanceof SAN systems.

2.4 Cost Analysis Example

In this section we examine typical costs for storage systems built from Commodity, Off-The-Shelf (COTS)components, best-in-class server systems, and SAN storage components. Following the example ofBeowulf-style HPC compute clusters we show that Object Storage systems should also be built as cluster ofrelatively small nodes, which we characterize as powerful disks, rather than thin servers.

Page 13: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

G. Gibson, B. Welch, D. Nagle, and B. Moxon 13

To illustrate the tradeoffs between standard practices for building a shared multiple NAS server storagesystem and a comparable clustered COTS storage system, in April 2003 we priced the state-of-the-arthardware needed for a 50 TB shared storage system providing 2.5 GB/sec bandwidth. Our NAS solutionprices were taken from Sun Microsystem’s online store. For the COTS pricing we looked at a number ofonline stores; we report the best pricing, which was for Supermicro 1U rack-mount Intel servers from ASAServers of Santa Clara, CA. The results are shown in Table 1.

For the multiple NAS server system we used five SUN V480 Server 4s with 4 GE ports each and built aSAN storage system for each server using 144 GB 10,000 rpm FibreChannel disk drives, Brocade 3800 FCswitches and Qlogic 2340 adapter cards. For the COTS storage cluster to run an object storage file system,we priced five 2 GHz Xeon-based metadata servers and forty-one OSDs, each OSD with 6 ATA disksstoring 200 GB and sustaining 10 MB/sec each. This is conservative; fewer metadata servers will be neededby most HPC workloads. The object storage cluster is connected through inexpensive 4-port gigabitEthernet switches as concentrators and Ethernet-to-Myrinet protocol converter blades attaching into thecomputer cluster’s interconnect.

Table 1. NFS NAS vs. COTS Object Storage Costs

Sun NFS server COTS Object Storage

MetaData Server $234,475 $18,63050TB storage $508,090 $134,904

Ethernet switch $3,000 $12,500TOTAL $744,565 $166,034

Table 1 shows that the object storage COTS hardware cost is 4.5 times lower cost than a multiple NASserver solution sustaining the same bandwidth. The largest cost difference is the Fibre Channel storagesystem (including disks and FC switches) in the NAS solution versus ATA disk drives in the COTShardware. Even if ATA drives are substituted into the NAS configuration (which is not done forperformance reasons), the cost of the NAS solution is $371,379 which is 2.2 times more than the objectstorage hardware. The second largest cost difference is the expensive NAS server bottleneck; 4-waymultiprocessors are used to deliver 4 Gbit/sec bandwidth per server. In contrast, object storage’s metadataservers only require a single commodity Xeon processor per server, because all data movement isoffloaded, performed directly between clients and storage.

Table 2. Disk per Node + NAS vs. COTS Object Storage Costs

Disk per Node + NAS COTS Object Storage

MetaData Server $43,995 $18,63050TB storage $508,090 $134,904

Extra 200GB drive/node $64,750Ethernet switch $3,000 $12,500

Total $619,835 $166,034

To reduce the cost of the original NAS system, we also priced a disk-per-node compute cluster with NASas a shared repository incapable of the required bandwidth. All data needed for a computation are copiedonto 200 GB of local storage attached to each cluster node, allowing local storage to provide each clusternode with sufficient bandwidth. This disk-per-node plus shared NAS solution allows us to eliminate all butone of NAS servers, reducing system cost to $619,835. With FibreChannel disks, this disk-per-node plusNAS solution, however, is still over 3.5 times more expensive than the object storage system. Replacingthe FibreChannel disks with ATA disks and the object storage is still 33% cheaper. And by not having thedisk-per-node data management problems, the object storage solution is more easily managed as well.

Page 14: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

14 G. Gibson, B. Welch, D. Nagle, and B. Moxon

3 The Object Storage Architecture

The Object Storage Architecture (OSA) provides a foundation for building an out-of-band clustered storagesystem that is independent of the computing cluster, yet still provides extremely high bandwidth, sharedaccess, fault tolerance, easy manageability, and robust data reliability and security. Most importantly,because object storage is designed for systems built from COTS components, its storage solutions will becost-effective.

There are two key elements of the OSA that enableexceptional scalability and performance: a high-level interface to storage for I/O, and an out-of-band interface for metadata management. Thearchitecture uses a cluster of Object StorageDevices (OSD) that is managed by a much smallerset of metadata manager nodes. I/O between thecomputing cluster and the OSD is direct; themetadata managers are not involved in the maindata path. This is shown in Figure 7.

3.1 High-Level OSD Interface

Each OSD provides a high-level interface to itsstorage that hides traditional storage details likesectors, partitions, and LUNs. Instead, an OSD is aserver for objects that have a range of bytes and anextensible set of attributes. Objects can representfiles, databases, or components of files. The high-level interface is necessary in a large scale systemwhere storage devices are shared by many clients.Traditional block device interfaces do not havesupport for data sharing and access control,making it more difficult to optimize I/O streamsfrom multiple clients with block storage.

For example, consider the case where multiple clients are reading large files from the storage clustersimultaneously. When each client issues a READ command for an object, the OSD knows exactly how bigthat specific object is and where it is located on disks. The OSD can schedule read-ahead operations onbehalf of its clients, and balance buffer space and queue depths among I/O streams. In contrast, intraditional storage systems the operating system or client manage read-ahead by issuing explicit readrequests. That approach does not scale well in a distributed storage system. By implementing intelligentread-ahead logic on the storage device, the client is simpler, fewer network operations are required, and thestorage device can stream data to multiple clients efficiently.

A WRITE command example may be even more illuminating. When a write is done beyond the end of afile in a block-based SAN file system, the writing client needs to synchronize with its metadata server toallocate additional space on media and modify the file’s metadata in its cache and on disk. With objectstorage however, the metadata server that issued to the client the right to issue write command to an OSDcan mark the client’s capability with a quota far in excess of the size of the file. Then the OSD can increasethe size of the file without synchronizing with metadata servers at the time of the write and still not bindactual media to the newly written data until performance optimizations such as write behind decide it istime to write the media.

The Object Storage Architecture supports high bandwidth by striping file data across storage nodes.Clients issue parallel I/O requests to several OSDs to obtain an aggregate bandwidth in a networked

Fig. 7. Object Storage Architecture

Page 15: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

G. Gibson, B. Welch, D. Nagle, and B. Moxon 15

environment that is comparable to the bandwidth obtained from a locally attached RAID controller. Inaddition, by distributing file components among OSDs using RAID techniques, the storage system isprotected from OSD failures. Thus we see that the Object Storage Architecture lets us create a systemwhere many clients are simultaneously accessing many storage nodes to obtain very high bandwidth accessto large data repositories. In addition, balanced scaling is built into the system because each storage nodehas a network interface, processor, and memory, as well as disks.

3.2 Objects and OSD Command Set

Drawing on the lessons of iSCSI and FibreChannel, the OSD protocol is designed to work within the SCSIframework, allowing it to be directly transported using iSCSI and providing cluster nodes with a standardprotocol for communicating with OSDs.

The OSD object model and command set is being defined by a SNIA (www.snia.org/osd) and an ANSI T10(www.t10.org) OSD working group. The basic data object, called a user-object, stores data as an orderedset of bytes contained within the storage device and addressed by a unique 96-bit identifier. Essentially,user-objects are data containers, abstracting the physical layout details under the object interface andenabling vendor specific OSD-based layout policies. OSDs also support group-objects, which are logicalcollections of user-objects addressed using a unique 32-bit identifier. Group-objects allow for efficientaddressing and capacity management over collections of user-objects, enabling such basic storagemanagement functions as quota management and backup.

Associated with each object is an extensible set of attributes, which store per-object information. The OSDpredefines and manages some attributes such as user-object size (physical size), object create time, object-data last modified time, and object-attribute last modified time. The OSD also provides storage for anextensible set of externally managed attributes, allowing higher-level software (e.g., file systems) to recordhigher-level information such as user names, permissions, and application-specific timestamps on a peruser- or group-object basis. All OSD attributes are organized into collections (called pages), with 216

attributes per page and 216 attribute pages; each attribute can be a maximum of 256 bytes. The OSDinterprets attributes that are defined by the standard (e.g., last access time) while treating vendor- orapplication-specific attributes as opaque blobs that are read or updated by higher-level software.

OSD operations include commands such as create, delete, get attributes, set attributes, and list objects, aswell as the traditional read and write commands. Commands are transported over a SCSI extended-CDB(i.e. operation code 0x7F) and include the command, the capability, and any application-defined attributesthat are to be set or retrieved as a side effect of the command. When commands complete, they return astatus code, any requested data and any requested attributes. This coupling of attribute get and setprocessing with data access enables atomic access to both data and attributes within a single command,

Fig. 8. Typical OSD SCSI CDB and Read Service Action

Page 16: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

16 G. Gibson, B. Welch, D. Nagle, and B. Moxon

significantly decreasing the complexity of higher-level applications while increasing overall performanceby reducing the number of round-trip messages.

To ensure security, OSD commands include a cryptographically signed capability, granting permission toperform a specified set of operations on an object or set of objects. The capability is signed with an SHA-1digital signature derived from a secret shared between the OSD and manager. The capability defines theminimum level of security (i.e. integrity and/or privacy on the command header and/or data) allowed, a keyidentifier that specifies which secret OSD key was used to sign the capability, a signature nonce to avoidreplay attacks, an expiration time for the capability, a bitmap of permissible operations (e.g., read data, setattribute), and user object information, including the user object id, length, and offset of data over whichthe capability can be applied, and an object creation time. Embedding the object creation date ensures thatafter a user object is deleted, any reuse of the object identifier does not create a security hole.

To illustrate the use of an OSD command, consider the following READ example that fetch 255 bytes fromobject 0x47. The CDB is comprised of a 10-byte header plus the Service Action Specified Fields; the OSDCDB also includes a security capability and specifies any attribute retrieval/updates that are done as a sideeffect of the command. The prototype READ command CDB is shown below. To issue this command, theinitiator (i.e., client) generates the CDB command blocks with the following information:

Byte 0 OPERATION CODE =0x7F (OSD command)Byte 6 IS_CDB =0; IS_DATA =0; PS_CDB =0; PS_DATA=0 (no on-the-wire

tests)

Byte 7 Additional CDB length = 176

Byte 8-9 SERVICE ACTION =0x8805 (READ)Byte 10 OPTIONS BYTE =0x00Byte 12-15 OBJECT_GROUP_ID =0x01Byte 16-23 USER_OBJECT_ID =0x47

Byte 28-35 LENGTH =0XFF (255)Byte 36-43 STARTING ADDRESS =0x00 (beginning of the file)Byte 44-55 GET ATTRIBUTE ={0x03, 0x01} (return the create time)Byte 56-75 SET ATTRIBUTE ={0x03, 0x04, time = 2/21/2003, 10:15 pm}

Byte 76-176 CAPABILITY = {objected 0x47, accessPermission read object +write attribute {0x03, 0x04} + read attribute {0x03, 0x01},version 0x123, nonce = 0x221} || SHA1(Secret||CAPABILITY)

The READ command specifies that after the object is read, the last time accessed (attribute page 0x03,attribute number 0x04) should be set while the create time (attribute page 0x03, attribute number 0x01)should be returned along with the data. Appended to the command is a capability that defines which objectcan be accessed, what operations can be performed (access Permission), and which attributes can be read orwritten. The entire capability is signed with a secret, allowing the OSD to verify that the capability has notbeen tampered. Upon receiving this command, the OSD would: 1) verify the capability signature; 2) fetchthe request object’s data, 3) update/fetch the specified attributes; 4) return the data and specified attributes.

4. Related Work

The Object Storage Device interface standardization effort can be traced directly to the DARPA sponsoredresearch project "Network Attached Secure Disks" conducted between 1995 and 1999 at Carnegie MellonUniversity (CMU) by some of the authors of this paper [Gibson98]. Building on CMU's RAID research atthe Parallel Data Lab (www.pdl.cmu.edu) and the Data Storage Systems Center (www.dssc.ece.cmu.edu),NASD was charted to "enable commodity storage components to be the building blocks of high-bandwidth,low-latency, secure scalable storage systems."

Page 17: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

G. Gibson, B. Welch, D. Nagle, and B. Moxon 17

From prior experience defining the RAID taxonomy at Berkeley in 1988 [Patterson88], the NASD teamunderstood that it was industry adoption of revolutionary ideas that yields impact on technology, so in 1997CMU initiated an industry working group in the National Storage Industry Consortium (nowwww.insic.org). This group, including representatives from CMU, HP, IBM, Seagate, StorageTek andQuantum, worked on the initial transformation of CMU NASD research into what became, in 1999, thefounding document of "Object Storage Device" working groups in the Storage Networking IndustryAssociation (www.snia.org/osd) and the ANSI X3 T10 (SCSI) standards body (www.t10.org). Since thattime the OSD working group in SNIA has guided the evolution of Object Storage Device interfaces, asmember companies experiment with the technology in their R&D labs. Today the SNIA OSD workinggroup is co-led by Intel and IBM with participation from across the spectrum of storage technologycompanies.

CMU's NASD project was not the only academic research contributing to today's understanding of ObjectStorage. In the same timeframe as the NASD work, LAN-based block storage was explored in multipleresearch labs [Cabrera91, Lee96, VanMeter98]. Almost immediately academics leapt to embeddedcomputational elements in each smart storage device [Acharya98, Keeton98, Riedel98]. A couple yearslater more detailed complex analyses of transparency, synchronization and security were published[Amiri00, Anderson01, Burns01, Miller02, Aguilera03]. Initiated from CMU during the NASD project,Peter Braam’s ambitious and stimulating open source object–based file system continues to evolve[Lustre03]. And just this year a spate of new object storage research have addressed scalability for both theobject storage and the metadata servers [Azagury02, Brandt03, Bright03, Rodeh02, Yasuda03].

5. Conclusion

High Performance Computing (HPC) environments make exceptional demands on storage systems toprovide high capacity, high bandwidth, high reliability, easy sharing, low capital cost and low operatingcost. While the disk-per-node storage architecture, which embeds one or a few disks in each computingcluster node, is a very capital-cost-effective way to scale capacity and bandwidth, it has poor reliability,high programmer complexity, inconvenient sharing and high operating costs. SAN-attached virtual-disk-per-node, multiple NAS server, NAS repositories with data replicas in disk-per-node storage, and in-bandand out-of-band SAN file systems are alternatives with a variety of advantages and disadvantages but noneclearly solves the problem for HPC cluster storage. From a capital cost viewpoint, it is clear that scalablestorage should be constructed from COTS components specialized to storage function and linked into thecluster through multi-protocol conversion in the inter-processor-optimized cluster switch.

Object Storage Devices (OSD) are a new storage interface developed specifically for scaling shared storageto extraordinary levels of storage bandwidth and capacity without sacrificing reliability or simple, low costoperations. Coupled with COTS cluster implementation, OSD storage systems promise a complete solutionfor HPC clusters. The key properties of a storage object are its variable length ordered sequence ofaddressable bytes, its embedded management of data layout, its extensible attributes and its fine graindevice-enforced access restrictions. These properties make objects closer to the widely understood UNIXinode abstraction than to block storage and allow direct, parallel access from client nodes under the firmbut infrequently applied control of an out-of-band metadata server. Object Storage Architectures supportsingle-system-image file systems with the traditional sharing and management features of NAS systemsand the resource consolidation and scalable performance of SAN systems.

References

[Acharya98] Acharya, A., Uysal, M. Saltz, J. "Active Disks," International Conference onArchitectural Support for Programming Languages and Operating Systems (ASPLOS),October 1998.

Page 18: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

18 G. Gibson, B. Welch, D. Nagle, and B. Moxon

[Anderson01] Anderson, D., Chase, J., Vahdat, A., “Interposed Request Routing for Scalable NetworkStorage,” Fourth Symposium on Operating System Design and Implementation (OSDI),ACM 2001.

[Aguilera03] Aguilera, M., Minwen, J., Lillibridge, M., MacCormick, J., Oertli, E., Anderson D.,Burrows, M., Mann, T., Thekkath, C., “Block-Level Security for Network-AttachedDisks,” USENIX File and Storage Technology Conference (FAST03), April 2003.

[Amiri00] Amiri, K., Gibson, G.A., Golding, R., "Highly Concurrent Shared Storage," Int. Conf. onDistributed Computing Systems (ICDCS00), April 2000.

[Azagury02] Azagury, A., Dreizin, V., Factor, M., Henis, E., Naor, D., Rinetzky, N., Satran, J.,Tavory, A., Yerushalmi, L, “Towards an Object Store,” IBM Storage SystemsTechnology Workshop, November 2002.

[Bright03] Bright, J., Chandy, J., “A Scalable Architecture for Clustered Network AttachedStorage,” Twentieth IEEE / Eleventh NASA Goddard Conference on Mass StorageSystems and Technologies, April 2003.

[Brandt03] Brandt, S., Xue, L., Miller, E., Long D., "Efficient Metadata Management in LargeDistributed File Systems," Twentieth IEEE / Eleventh NASA Goddard Conference onMass Storage Systems and Technologies, April 2003.

[Burns01] Burns, R. C., Rees, R. M., Long, D. D. E., "An Analytical Study of Opportunistic LeaseRenewal," Proc. of the 16th International Conference on Distributed Computing Systems(ICDCS), IEEE, 2001.

[Cabrera91] Cabrera, L. Long, D., Swift., “Using Distributed Disk Striping to Provide High I/O DataRates,” Computing Systems 4:4, Fall 1991.

[EMC03] "EMC Celerra HighRoad," 2003, http://www.emc.com/products/software/highroad.jsp.

[Fineberg99] Fineberg, S. A., Mehra, P., “The Record-Breaking Terabyte Sort on a Compaq Cluster,”Proc. of the 3rd USENIX Windows NT Symposium, July 1999.

[Gibson98] Gibson, G. A., et. al., “A Cost-Effective, High-Bandwidth Storage Architecture,”International Conference on Architectural Support for Programming Languages andOperating Systems (ASPLOS), October 1998.

[IBM03] "Tivoli SANergy," 2003, http://www.ibm.com/software/tivoli/products/sanergy/.

[Keeton98] Keeton, K., Patterson, D. A. and Hellerstein, J. M., "A Case for Intelligent Disks(IDISKs)," SIGMOD Record 27 (3), August 1998.

[Knott03] Knott, T., "Computing colossus," BP Frontiers magazine, Issue 6, April 2003,http://www.bp.com/frontiers.

[Lee96] Lee, E., Thekkath, C. Petal, “Distributed virtual disks,” ACM 7th InternationalConference on Architectural Support for Programming Languages and OperatingSystems (ASPLOS) October, 1996.

[Luster03] “Lustre: A Scalable, High Performance File System,” Cluster File System, Inc., 2003,http://www.lustre.org/docs.html.

[McKusick84] McKusick, M. K., et. al., “A Fast File System for UNIX,” ACM Transactions onComputer Systems vol. 2, August 1984.

[Miller02] Miller, E. L., Freeman, W. E., Long, D. E., Reed, B. C., "Strong Security for Network-Attached Storage," USENIX Conference on File and Storage Technologies (FAST),2002.

Page 19: Object Storage: Scalable Bandwidth for HPC Clusters · 2003-05-27 · Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C.

G. Gibson, B. Welch, D. Nagle, and B. Moxon 19

[Patterson88] Patterson, D. A., Gibson, G. A., Katz, R. H., "A Case for Redundant Arrays ofInexpensive Disks (RAID)," Proceedings of the International Conference onManagement of Data (SIGMOD), June 1988.

[Preslan99] Preslan, K. W., O’Keefe, M. T., et. al., “A 64-bit shared file system for Linux,” Proc. ofthe 16th IEEE Mass Storage Systems Symposium, 1999.

[Riedel98] Riedel, E., Gibson, G., Faloutsos, C., "Active Storage for Large-Scale Data Mining andMultimedia," VLDB, August 1998.

[Rodeh02] Rodeh, O., Schonfeld, U., Teperman, A., “zFS - A Scalable distributed File System usingObject Disks,” IBM Storage Systems Technology Workshop, November 2002.

[SGSRFP01] SGS File System RFP, DOE NNCA and DOD NSA, April 25, 2001.

[Seitz02] Seitz, Charles L., "Myrinet Technology Roadmap," Myrinet User's Group Conference,Vienna, Austria, May 2002, http://www.myri.com/news/02512/.

[Topspin360] “ T o p s p i n 3 6 0 S w i t c h e d C o m p u t i n g S y s t e m , " 2 0 0 3 ,http://www.topspin.com/solutions/topspin360.html.

[VanMeter98] Van Meter, R., Finn, G., Hotz, S., “VISA: Netstation's virtual Internet SCSI adapter.”ACM 8th International Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS) Oct, 1998.

[Yasuda03] Yasuda, Y., Kawamoto, S., Ebata, A., Okitsu, J., Higuchi, T., “The Concept andEvaluation of X-NAS: a Highly Scalable NAS System,” Twentieth IEEE/EleventhNASA Goddard Conference on Mass Storage Systems and Technologies, April 2003.