A Self-Organizing Storage Cluster for Decentralized and ... · A Self-Organizing Storage Cluster...

12
GESTS Int’l Trans. Computer Science and Engr., Vol.20, No.1 77 GESTS-Oct.2005 A Self-Organizing Storage Cluster for Decentralized and Reliable File Storage * Bin Cai 1 , Changsheng Xie 2 , Jianzhong Huang 3 , and Chengfeng Zhang 4 1 National Storage System Laboratory, Department of Computer Science, Huazhong University of Science and Technology, Postal Code 430074, Wuhan, P.R.China [email protected] 2 National Storage System Laboratory, Department of Computer Science, Huazhong University of Science and Technology, Postal Code 430074, Wuhan, P.R.China [email protected] 3 National Storage System Laboratory, Department of Computer Science, Huazhong University of Science and Technology, Postal Code 430074, Wuhan, P.R.China [email protected] 4 National Storage System Laboratory, Department of Computer Science, Huazhong University of Science and Technology, Postal Code 430074, Wuhan, P.R.China [email protected] Abstract. This paper describes the design and implementation of a decentral- ized storage cluster for file storage so as to provide self-organizing, available, scalable, and consistent data management capability for cluster platform. The self-organizing capability of CHT grants our storage cluster the potential to handle both failure and online provisioning gracefully. We use soft-state to manage membership changes, and evenly distribute data across nodes by adopting linear hash algorithm so as to achieve incremental scalability of throughput and storage capacity, thus making the system easy to scale as the number of nodes grows. In order to guarantee metadata and data strict consis- tency among multiple replicas, we adopt decentralized weighted voting scheme that allows data and metadata operations to be safely interleaved, with enabling the system to perform self-tuning. We also present the experiment re- sults to demonstrate the features and performance of our design. * This research is supported by National 973 Great Research Project of P.R China under the grant No. 2004CB318200 and National Natural Science Foundation under grant No. 60273037 and No. 60303031.

Transcript of A Self-Organizing Storage Cluster for Decentralized and ... · A Self-Organizing Storage Cluster...

GESTS Int’l Trans. Computer Science and Engr., Vol.20, No.1 77

GESTS-Oct.2005

A Self-Organizing Storage Cluster for Decentralized and Reliable File Storage*

Bin Cai1, Changsheng Xie 2, Jianzhong Huang3, and Chengfeng Zhang 4

1 National Storage System Laboratory, Department of Computer Science, Huazhong University of Science and Technology, Postal Code 430074,

Wuhan, P.R.China [email protected]

2 National Storage System Laboratory, Department of Computer Science, Huazhong University of Science and Technology, Postal Code 430074,

Wuhan, P.R.China [email protected]

3 National Storage System Laboratory, Department of Computer Science, Huazhong University of Science and Technology, Postal Code 430074,

Wuhan, P.R.China [email protected]

4 National Storage System Laboratory, Department of Computer Science, Huazhong University of Science and Technology, Postal Code 430074,

Wuhan, P.R.China [email protected]

Abstract. This paper describes the design and implementation of a decentral-ized storage cluster for file storage so as to provide self-organizing, available, scalable, and consistent data management capability for cluster platform. The self-organizing capability of CHT grants our storage cluster the potential to handle both failure and online provisioning gracefully. We use soft-state to manage membership changes, and evenly distribute data across nodes by adopting linear hash algorithm so as to achieve incremental scalability of throughput and storage capacity, thus making the system easy to scale as the number of nodes grows. In order to guarantee metadata and data strict consis-tency among multiple replicas, we adopt decentralized weighted voting scheme that allows data and metadata operations to be safely interleaved, with enabling the system to perform self-tuning. We also present the experiment re-sults to demonstrate the features and performance of our design.

* This research is supported by National 973 Great Research Project of P.R China under the

grant No. 2004CB318200 and National Natural Science Foundation under grant No. 60273037 and No. 60303031.

78 A Self-Organizing Storage Cluster for Decentralized

GESTS-Oct.2005

1 Introduction

DHT-based P2P cooperative storage systems such as CFS [19], OceanStore [14], FarSite [23], and Ivy [5], offer a Distributed Hash Table (DHT) abstraction [6], [7], [8], [9] for data storage so as to harness a potentially tremendous and self-scaling storage resource. Decentralized storage systems that consist of a cluster of storage nodes with local disks are also a cost-effective solution for wide range of applications, ranging form enterprise-class storage backend, HPC (High-Performance Computing) to data mining and Internet services, such as RepStore [12], FAB [29], and Sorrento [13], thereby providing high resource utilization, high availability and easy scalabil-ity.

In this paper, we design and implement a decentralized storage cluster for file storage so as to provide self-organizing, available, scalable, and consistent data man-agement capability for cluster platform. The key design points of our storage cluster are summarized as follows:

1. To reveal important principles for building a self-organizing storage cluster system, we design a CHT (Cluster-Based Hash Table) to store persistent data such as user data and metadata. Meanwhile, we make it fault-tolerant of a set of storage nodes failure and minimal performance degradation by adopting replication redun-dancy scheme in CHT. The self-organizing capability of CHT grants our storage cluster the potential to handle both failure and online provisioning gracefully.

2. We use soft-state in a way contrary to hard-state to manage membership changes when nodes join or leave the system. We use linear hash algorithm [1], [18] to evenly distribute data across nodes so as to achieve incremental scalability of throughput and storage capacity when more nodes are added to the system, thus making the system easy to scale as the number of nodes grows.

3. We distribute versioned copies of the metadata along with the user data in each replica, and enforced an additional constraint to ensure data consistency among multiple replicas by adopting weighted voting scheme [2], [15]. Therefore, we can ensure that quorums are based on the last version of metadata, and thus every request accesses the most current data. This allows data and metadata operations to be safely interleaved, with enabling the system to perform self-tuning.

4. We provide a cluster-aware buffer cache scheme to hide the distributed nature of the cluster node’s caches. It constructs an efficient and cooperative cache-to-disk accesses policy. In [3], [4], we describe the design and implementation of such scheme.

The rest of the paper is organized as follows: in section 2, we describe related work. In section 3, we describe the architecture of CHT. Section 4 presents the de-tails of the system design. In section 5, the experimental results are evaluated, and the conclusion comes at section 6.

GESTS Int’l Trans. Computer Science and Engr., Vol.20, No.1 79

GESTS-Oct.2005

2 Related Work

Our work is motivated by previous work on parallel and distributed file systems and cluster-based storage systems, such as Sorrento [13], AFS [24], GPFS [26], PVFS [27], xFS [16], [17], CFS [19], GoogleFS [28], FAB [29], OceanStore [14], FarSite [23], and Ivy [5]. xFS organizes storage in RAID volumes, which are difficult to scale incrementally. PVFS focus on parallel I/O and cannot provide fault tolerance and scalability. GoogleFS has a narrow focus on applications arising from Web search, and some of its design choices may not be suitable for other applications. FAB system defines a storage system with a block level interface. The clients use SCSI commands for data manipulation whose implementations uses the thin client and thick sever paradigm. Objectives of Sorrento self-organizing storage cluster are the closest to ours. However, Sorrento does not implement any caching for either namespace server of file data.

Peer-to-Peer systems, such as CFS, OceanStore and Ivy, target the wide area net-work environment that has a different set of assumptions and constraints. For in-stance, they assume the incompletely trusted environment. In contrast, our design target is trusted LAN-based environment, therefore the security issues are not our focus. Particularly, they do not export traditional file system semantics as our system does. Our design can be considered as a transition ground between traditional tightly coupled storage systems and loosely coupled peer-to-peer systems, like the Sorrento system. Dynamic distributed data location has been studied in CAN, Chord, Pastry, and Tapestry. Because our system only operates in a LAN environment, our solution is simplified.

Version-based data consistency model and immutable files include Oasis [2], Ele-phant [10], and CVFS [11]. Their goals are to protect data loss or to provide infor-mation for intrusion analysis, and differ from our base objectives.

3 Architecture of Cluster-Based Hash Table

Our storage cluster is designed as a server-less system [16], [17] in which multiple storage nodes deal with storage of file data and manage metadata cooperatively. Cluster-based hash tables (CHTs) have emerged as a critical component in the over-all state-storage solution. One primary advantage of a CHT is its ability to scale linearly to achieve high performance. Figure 1 illustrates the architecture of the CHT in our storage cluster system, which consists of the following components:

CHT Library: the CHT library is a java class library that presents the hash API to storage nodes. The library accepts hash table operations, and cooperates with the local file system to realize those operations. The library contains only soft state, including metadata about the cluster’s current configuration and the partitioning of data in the cluster hash tables across the local file system.

Storage Nodes: the basic building block of the storage cluster is the storage nodes. Storage nodes are responsible for managing locally attached disks through the native

80 A Self-Organizing Storage Cluster for Decentralized

GESTS-Oct.2005

file system interface. Storage nodes are the only system components that manage durable data. Each node manages a set of network-accessible single node hash tables. A node consists of a local buffer cache, a cluster-aware buffer cache [3], [4], a persis-tent chained hash table implementation, and network stubs and skeletons for remote communications.

CHT API: the CHT API is the boundary between a storage node and its CHT li-brary. The API provides service with put( ), get( ), remove( ), create( ), and destroy( ) operations on hash table. Each operation is atomic, and all storage nodes see the same consistent image of all existing hash tables through this API. Hash table names are strings, hash table keys are 128 bit integers, and hash table values are opaque byte arrays; operations affect hash table values in their entirety.

Fig. 1. The Architecture of CHT in Cluster System

The CHT is made up of the following components: 1. The hash table server processes: the hash table state is distributed across one or

more server processes; a server is a UNIX daemon-like process that accepts network connections from clients, and services requests from those clients to insert, delete, retrieve, etc. values from named hash tables. Each server maintains many local hash tables. Conceptually, a server is a storage node that exports an RPC-like interface, so that clients can access the values stored in that node.

2. The hash table client library: The client library exports a very simple, clean in-terface to applications, and internally handles all of the complexities of marshalling arguments, doing the RPCs to the server processes, partitioning and replicating data across multiple servers, and all of that. In order to know how to map data across the different servers, the library has to read two files, the “map” file, and the “phys” file. These two files determine how to map a series of “logical nodes” across the many different physical server processes. Conceptually, the data is partitioned evenly across the N logical nodes in the system, and each logical node is replicated across one or more physical server processes. The “map” file determines the mapping of logical nodes to physical nodes, and the “phys” file identifies the host name and port number of each physical server process.

GESTS Int’l Trans. Computer Science and Engr., Vol.20, No.1 81

GESTS-Oct.2005

Each node in the storage cluster runs a user-level daemon program for the purpose of transferring file data, transferring control message, and performing statistical information. TCP/IP sockets are being explicitly used for sending messages to it from the individual local caches regardless of which application process is making a call. When a node needs to get file at other node, the daemon finds the file firstly, and then gets the file from remote node. When a node’s residual cache capacity is less than the threshold value, the daemon will move some blocks from its local cache to the remote cache in order to balance the load.

4 System Design

4.1 File Availability and Storage Overhead

At some level of abstraction, all serverless systems offer a storage service. Therefore, a very important issue is data availability. Since no component is 100% reliable, we cannot have 100% availability; but we can achieve very high availability by provid-ing data redundancy schemes. Whole-file replication is the simplest technique used to increase reliability in distributed systems.

Assume that a file is replicated k copies and placed at k storage nodes, and the average node availability is λ . Assuming that node availability is independent and identically distributed (I.I.D), and any 1 out of k peers is enough to recover the whole file, we compute the file availability A is:

( ) ( ) ( )1 21 21 2

1 1 1k k k kkk k kk

A λ λ λ λ λ λ− − −

= − + − + + −L (1)

Upon formula 1, we can solve the storage overhead k of a file is: ( )( )

log 1log 1

Ak

λ

−=

− (2)

For example, we assume that average node availability is 0.8 in our storage cluster, and the system provides 99.999% file availability. By formula 2, the storage over-head of a file approximates to 7. Therefore, the replica number in our implementa-tion is 7. This is the value used in the Chord implementation [20].

4.2 Membership Management and Two-Level Replication Map

For a large-scale cluster, nodes may join or leave the network fairly frequently. In our membership maintenance, we use soft-state in a way which is contrary to hard-state used in xFS [16] or PVFS [27], which are either updated synchronously among all nodes or maintained at a central location. Updating the membership information synchronously is costly and may lead to scalability problems. In our implementation,

82 A Self-Organizing Storage Cluster for Decentralized

GESTS-Oct.2005

all storage nodes periodically send out heartbeat packets through a multicast channel, and each node can construct the set of live node by monitoring the same channel. If a process fails to receive heartbeat packets from a node for a timeout period, this node will be removed from its membership set.

In order to achieve incremental scalability of throughput and storage capacity as more nodes are added to the system, we evenly distribute data across nodes using linear hash algorithm, and when new nodes are added to the cluster, this even distri-bution is altered so that data is spread onto the new nodes. We use two-level replica-tion map scheme to find the partition that manages a particular hash table key, and to determine the list of replicas in partitions’ replica groups.

The first map is partition map: given a hash table key, this map returns the parti-tion key. This map thus controls the even distribution of data across the storage nodes. A typical process of partition map is a trie tree search over hash table keys: to find a key’s partition, key bits are used to walk down the trie, starting from the least significant key bit until a leaf node is found. The second map is replica group mem-bership map: given a partition key, this map returns a list of nodes that are currently storing replicas in the partition’s replica group. Two-level mapping scheme is shown in figure 2.

By two-level replication map, our storage cluster provides significant scalability and flexibility. If the cluster grows, the trie tree subdivides in a split operation. For instance, partition 10 in the trie tree in figure 2 could split into partitions 010 and 110. if the cluster is shrunk, a merge operation will perform in the trie tree. For example, partitions 000 and 100 in figure 2 could be merged into a single partition 00. If a node fails, it is removed from all replica group membership maps that con-tain it. Rejoining a replica group after finishing recovery is an important issue that has been described in [21], [22], but goes beyond the scope of this article.

Fig. 2. Tow-Level Replication Mapping Scheme

4.3 Decentralized Weighted Voting Replicas Consistency

The traditional way to provide strict consistency is with a quorum-based scheme in which a majority of the replicas must be present to update the data. Quorums are

GESTS Int’l Trans. Computer Science and Engr., Vol.20, No.1 83

GESTS-Oct.2005

established by acquiring locks on replica group with a total of at least R or W votes, as needed for reads or writes respectively. Weighted voting ensures strict consistency by requiring that R W N+ > , where N is the total number of votes assigned to all replicas of a file. This constraint guarantees that no read can complete without see-ing at least one replica updated by the last write since R N W> − . Weighted voting scheme offers greater flexibility than traditional quorum-based scheme by assigning a number of votes to every replica of a file.

In that user data and metadata are cooperatively managed by multiple nodes in our decentralized storage system, we decentralized the functionality of the collector, which is used to gather a quorum for every request to the file, by distributing ver-sioned copies of the metadata along with the user data in each replica, and enforced the additional constraint W R≥ to ensure sequential consistency like [2]. Therefore, we can ensure that quorums are based on the last version of metadata, and thus every request accesses the most current data. This allows data and metadata operations to be safely interleaved, with enabling the system to perform self-tuning.

Fig. 3. Establishing a quorum and Locking the most up-to-date version

In order to guarantee strict consistency, we employ three steps in the decentralized weighted voting scheme. First, when a client wants to access data, it must obtain a copy of the metadata of the file to be read or written initially. Second, the client uses acquired metadata to establish a quorum. For read requests, we acquire the version of the file from a replica group with R votes and lock one of the replicas with the last version of the file; for write requests, we lock the replica group of a file to construct a write quorum with W votes. Finally, once the corresponding quorum is established, the requests are issued to the nodes storing the locked replica(s) to perform opera-tions, such as read data, write data, and update metadata. When one of above opera-tions is executed, we unlock the replica(s), and the corresponding storage nodes

84 A Self-Organizing Storage Cluster for Decentralized

GESTS-Oct.2005

return the data to the client. If a write operation fails to complete, a failure hading mechanism is used to ensure the integrity of the data and consistency across replicas.

A client must establish a quorum of votes to access a file and process the requests. In order to obtain the votes, the client contacts the nodes storing replicas of the file, which are listed in replica group membership map, to obtain a lock on their replicas. The clients issues a locking request to a node for a specific file, and the node returns “yes” or “no”, indicating whether the lock was acquired or not, the data version and the metadata version that it is storing. Newer metadata may require the client to reconstruct quorum of votes when vote distribution, R or W have been changed, and to issue additional locking requests to the replica locations. This process is shown in figure 3 when the client receives a newer version of metadata from node 4.

5 Performance Evaluation

To evaluate the performance we used a laboratory cluster testbed that consists of 6 nodes connected to the network through 100Mbit/s Ethernet cards, each having Pen-tium4/2GHz PCs with 256MB RAM running Linux operating system with the ver-sion of the 2.4 kernel. Each node was equipped with a single 160GB IDE hard disk to provide storage space, run a communication daemon and a cluster-aware buffer cache kernel module [3], [4], and shared 128MB RAM to consist of cooperative cache.

Maximum Throughput Scalability

0

500

1000

1500

2000

2500

3000

3500

1 2 3 4 5 6node numbers

request numbers /s

read requests write requests

maximum read throughput (reads/s)

0

500

1000

1500

2000

2500

3000

3500

40 100 200 400 800 1000 1500 3000

hash table value size (bytes)

read throughput (reads/s)

Fig. 4. Maximum Throughput Scalability with node numbers increment and The Throughput

with read size increment

We varied the number of nodes in the CHT; for each configuration, we slowly in-creased the number of requests and measured the completion throughput flowing from the nodes. All configurations had 2 replicas per replica group, and each read/write request accessed 150 byte values. On the other hand, we varied the size of elements that we read out of a 6 node hash table. Throughput was flat from 40 bytes throughput 800 bytes, but then began to degrade. For larger sized hash table values, the 100Mbit/s switched network became the throughput bottleneck. Figure 4 shows

GESTS Int’l Trans. Computer Science and Engr., Vol.20, No.1 85

GESTS-Oct.2005

the maximum throughput sustained by the CHT as a function of the number of nodes, and the throughput sustained by a 6 node hash table serving these values.

We also conducted cache hit ratio test. In this test, one node was dedicated to ser-vicing HTTP requests and the other five nodes were available to service data storage. Each client was a Windows PC running the WebBench [25] suite of e-commerce tests. A comparison between remote cache, local cache, and local disk in terms of the number of request operations in term of request operations and request bytes are presented in figure 5. Notice that I/O in Web server’s local disk degrades fast with the increasing number of the node, large parts of requests hitting in Web server’s local cache or remote cache improves the Web server’s I/O performance.

Hit Ratio (request operations)

0

10

20

30

40

50

60

70

80

1 2 3 4 5 6node number

hit r

atio

(%)

remote cache local disk local cache

Hit Ratio (request bytes)

0

10

20

30

40

50

60

70

80

1 2 3 4 5 6node number

hit r

atio(

%)

remote cache local disk local cache

Fig. 5. The Comparison of Cache Hit Ratio

6 Conclusion

In this paper, we presented a CHT-based decentralized storage cluster built upon off-the-shelf commodity components for reliable file storage. The design of our system targets self-organization, incremental expandability, easy manageability and robust failure tolerance. The self-organizing capability of CHT granted our storage cluster the potential to handle both failure and online provisioning gracefully. We used soft-state to manage membership changes, and evenly distributed data across nodes by adopting linear hash algorithm so as to achieve incremental scalability of throughput and storage capacity, thus making the system easy to scale as the number of nodes grows. In order to guarantee metadata and data sequential consistency among multi-ple replicas, we adopted decentralized weighted voting scheme that allowed data and metadata operations to be safely interleaved, with enabling the system to perform self-tuning. Our experiments show that such a storage cluster can achieve high availability and flexibility while delivering reasonable performance for I/O perform-ance.

86 A Self-Organizing Storage Cluster for Decentralized

GESTS-Oct.2005

References

[1] Steven D.Gribble, Eric A.Brewer, Joseph M.Hellerstein, and David Culler, "Scalable, Distributed Data Structures for Internet Service Construction," In Proceedings of the 4th USENIX Symposium on Operating System Design and Implementation, San Diego, CA, 2000.

[2] Maya Rodrig and Anthony LaMarca, "Decentralized Weighted Voting for P2P Data Management," In Proceedings of ACM MobiDE’03, San Diego, California, USA, Sept. 2003.

[3] CaiBin, Changsheng Xie, CaoQiang, "Cluster-Aware Cache for Network Attached Storage", to appear in the 2th IFIP International Conference on Network and Paral-lel Computing, Nov.-Dec. 2005.

[4] CaiBin, Changsheng Xie, Huaiyang Li, "Building Strongly-Reliable and High-Performance Storage Cluster with Network Attached Storage Disks", to appear in the 5th International Conference on Computer and Information Technology , Sept. 2005.

[5] Athicha Muthitacharoen, Robert Morris, Thomer M. Gil, and Benjie Chen, "Ivy: A Read/Write Peer-to-Peer File System," In Proceedings of OSDI’02, 2002.

[6] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrish-nan, "Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications," In SIGCOMM, Aug. 2001.

[7] Sylvia Ratnasarny, Paul Francis, Mark Handley, Richard Karp, and Scott Shenker, "A Scalabel Content-Addressable Network," In SIGCOMM, 2001.

[8] Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, and Anthony D. Loseph, "Tapestry: A Resilient Global-Scale Overlay for Service Deployment," IEEE Journal on Selected Areas in Communications, Vol. 22, NO.1, Jan. 2004.

[9] A. Rowstron and P. Druschel, "Pastry: Scalable, Distributed Object location and Routing for Large-Scale Peer-to-Peer Systems," In Proceedings of IFIP/ACM Mid-dleware2001, Heidelberg, Germany, Nov. 2001.

[10] Douglas S. Santry, Michael J. Feeley, Norman C. Hutchinson, et al, "Deciding When to Forget in The Elephant File System," In Proc. of 17th ACM Symposium on Operating Systems Principles, Dec. 1999.

[11] Craig A. N. Soules, Garth R. Goodson, John D. Strunk, and Gregory R. Ganger, "Metadata Efficiency in Versioning File Systems," In Proceedings of 2nd USENIX Conference on File and Storage Technologies, San Francisco, CA, USA, Mar. 2003.

[12] Zheng Zhang, Shiding Lin, Qiao Lian and Chao Jin, "RepStore: A Self-Managing and Self-Tuning Storage Backend with Smart Bricks," In Proceedings of ICAC’04, 2004.

[13] Hong Tang, Aziz Gulbeden, Jingyu Zhou, William Strathearn, Tao Yang, and Lingku Chu, "Sorrento: A Self-Organizing Storage Cluster for Parallel Data-Intensive Applications," Proc. of High Performance Computing, Networking and Storage Conference, Pittsburgh PA, Nov. 2004.

[14] John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, et al, "OceanStore: An Architecture for Global-Scale Persistent Storage," In ASPLOS, 2000.

[15] Gifford, D.K., "Weighted Voting for Replicated Data," In Proceedings of the 7th Symposium on Operating Systems Principles, pp. 150-162, 1979.

[16] T. Anderson, M. Dahlin, J. Neefe, D. Patterson, D. Roselli, and R. Young, "Server-less Network File Systems," ACM Trans. Computer Systems, Feb. 1996.

GESTS Int’l Trans. Computer Science and Engr., Vol.20, No.1 87

GESTS-Oct.2005

[17] T. Anderson, M. Dahlin, J. Neefe, D. Patterson, D. Roselli, and R. Wang, "Server-less Network File Systems," In Proc. Symposium on Operating System Principles, pp. 109-126, Dec. 1995.

[18] W.Litwin and M.A.Neimat, "High-availability LH* Scheme with Mirroring," In Proceedings of the Conference on Cooperative Information Systems, pp. 196-205, 1996.

[19] Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, and Ion Stoica, "Wide-area Cooperative Storage with CFS," In SOSP, 2001.

[20] F.Dabek, J.Li, E.Sit, J.Robertson, F.Kaashoek, and R.Morris, "Designing a DHT for Low Latency and High Throughput," In Proceedings of NSDI’04.

[21] Rodrigo Rodrigues and Barbara Liskov, "High Availability in DHTs: Erasure Cod-ing vs. Replication," In Proceedings of the 4th International Workshop on Peer-to-Peer Systems, 2005.

[22] Charles Blake and Rodrigo Rodrigues, "High Availability, Scalable Storage, Dy-namic Peer Networks: Pick Two," In Proceedings of HotOS, 2003.

[23] Atul Adya, William J. Bolosky, Miguel Castro, Gerald Cermak, Ronnic Chaiken, et al, "FarSite: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment," In SIGMETRICS, 2000.

[24] J. Howard et al, "Scale and Performance in A Distributed File System," ACM Trans. On Computer System, Vol. 6, Issue 1, 1988.

[25] Http://www.veritest.com/benchmarks/webbench/ [26] F. Schmucj and R. Haskin, "GPFS: A Shard-Disk File System for Large Comput-

ing Clusters," In Proceedings of the First USENIX Conference on File and Storage Technologies, Monterey, California, USA, Jan. 2002.

[27] P. Carns, W. Ligon III, R. Ross, and R. Thakur, "PVFS: A Parallel File System for Linux Clusters," In Proc. of 4th Annual Linux Showcase and Conference, 2000.

[28] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, "The Google File Sys-tem," In SOSP, New York, USA, Oct. 2003.

[29] Svend Frolund, Arif Merchant, Yasushi Saito, Susan Spence and Alistair Veitch, "FAB: Enterprise Storage Systems on A Shoestring," In HOTOS, 2003.

Biography

▲ Name: Bin Cai Address: National Storage System Laboratory, Depart-ment of Computer Science, Huazhong University of Science and Technology, Postal Code 430074, Wuhan, P.R.China Education & Work experience: He received the BS and MS degrees from Huazhong University of Science and Tech-nology, at Wuhan, China, in 1999 and 2002, respectively. He is currently pursuing the PhD degree in computer science and technology at Huazhong University of Science and Technology. His research interests include operating systems,

network storage systems, peer-to-peer computing, distributed systems, and embed-ded systems. Tel: +86-27-87556569 E-mail: [email protected]

88 A Self-Organizing Storage Cluster for Decentralized

GESTS-Oct.2005

▲ Name: Changsheng Xie Address: National Storage System Laboratory, Depart-ment of Computer Science, Huazhong University of Science and Technology, Postal Code 430074, Wuhan, P.R.China Education & Work experience: He is an eternal professor of computer science and technology at Huazhong University of Science and Technology. His research interests include operating systems, optical/network storage systems, high speed interface and channel, super-high density and super-high speed storage technology.

Tel: +86-27-87556569 E-mail: [email protected]

▲ Name: Jianzhong Huang Address: National Storage System Laboratory, Depart-ment of Computer Science, Huazhong University of Science and Technology, Postal Code 430074, Wuhan, P.R.China Education & Work experience: He received the BS and MS degrees from Wuhan University of Technology, at Wu-han, China, in 1998 and 2001, respectively. He is currently pursuing the PhD degree in computer science and technology at Huazhong University of Science and Technology. His re-search interests include network storage systems and storage

security. Tel: +86-27-87556569

E-mail: [email protected] ▲ Name: Chengfeng Zhang Address: National Storage System Laboratory, Depart-ment of Computer Science, Huazhong University of Science and Technology, Postal Code 430074, Wuhan, P.R.China Education & Work experience: He is currently pursuing the PhD degree in computer science and technology at Huazhong University of Science and Technology. His re-search interests include network storage systems. Tel: +86-27-87556569

E-mail: [email protected]