Disk - University of California, Berkeley

136

Transcript of Disk - University of California, Berkeley

Page 1: Disk - University of California, Berkeley

Performance Modeling and Analysis

of Disk Arraysby

Edward Kihyen Lee

B.S. (University of California at Los Angeles) 1987M.S. (University of California at Berkeley) 1990

A dissertation submitted in partial satisfaction of the

requirements for the degree of

Doctor of Philosophy

in

Computer Science

in the

GRADUATE DIVISION

of the

UNIVERSITY of CALIFORNIA at BERKELEY

Committee in Charge

Professor Randy H. Katz, ChairProfessor David A. Patterson

Professor Ronald Wol�

1993

Page 2: Disk - University of California, Berkeley

The dissertation of Edward Kihyen Lee is approved:

Chair Date

Date

Date

University of California at Berkeley

1993

Page 3: Disk - University of California, Berkeley

Performance Modeling and Analysis

of Disk Arrays

Copyright c 1993

by

Edward Kihyen Lee

Page 4: Disk - University of California, Berkeley

1

Abstract

Performance Modeling and Analysis

of Disk Arraysby

Edward Kihyen Lee

Doctor of Philosophy in Computer Science

University of California at Berkeley

Professor Randy H. Katz, Chair

As disk arrays become widely used, tools for modeling and analyzing the perfor-

mance of disk arrays become increasingly important. In particular, accurate performance

models and systematic analysis techniques, combined with a thorough understanding of the

expected workload, are invaluable in both con�guring and designing disk arrays. Unfortu-

nately, disk arrays, like many parallel systems, are di�cult to model and analyze because

of queueing and fork-join synchronization. In this dissertation, we present an analytic per-

formance model for non-redundant disk arrays and a new technique based on utilization

pro�les for analyzing the performance of redundant disk arrays. In both cases, we provide

applications of our work. We use the analytic model to derive an equation for the opti-

mal size of data striping in disk arrays, and we apply utilization pro�les to analyze the

Page 5: Disk - University of California, Berkeley

2

performance of RAID-II, our second disk array prototype. The results of the analysis are

used to answer several performance related questions about RAID-II and to compare the

performance of RAID-II to RAID-I, our �rst disk array prototype.

Professor Randy H. Katz, Chair Date

Page 6: Disk - University of California, Berkeley

iii

To my parents.

Page 7: Disk - University of California, Berkeley

iv

Contents

List of Figures vi

List of Tables vii

1 Introduction 1

1.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11.2 Contributions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 41.3 Organization : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6

2 Overview 9

2.1 Basic De�nitions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 92.2 The RAID Taxonomy : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10

2.2.1 Non-Redundant (RAID Level 0) : : : : : : : : : : : : : : : : : : : : 112.2.2 Mirrored (RAID Level 1) : : : : : : : : : : : : : : : : : : : : : : : : 132.2.3 Hamming Coded (RAID Level 2) : : : : : : : : : : : : : : : : : : : : 132.2.4 Bit-Interleaved Parity (RAID Level 3) : : : : : : : : : : : : : : : : : 142.2.5 Block-Interleaved Parity (RAID Level 4) : : : : : : : : : : : : : : : : 142.2.6 Block-Interleaved Distributed-Parity (RAID Level 5) : : : : : : : : : 152.2.7 PQ Redundant (RAID Level 6) : : : : : : : : : : : : : : : : : : : : : 15

2.3 RAID Access Modes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 172.3.1 Read : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 182.3.2 Read-Modify-Write : : : : : : : : : : : : : : : : : : : : : : : : : : : : 182.3.3 Reconstruct-Write : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19

2.4 Comparison of RAID levels : : : : : : : : : : : : : : : : : : : : : : : : : : : 192.5 Related Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22

2.5.1 Related Modeling Work : : : : : : : : : : : : : : : : : : : : : : : : : 222.5.2 Related Analysis Work : : : : : : : : : : : : : : : : : : : : : : : : : : 242.5.3 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 27

3 Experimental Methodology 29

3.1 The RAID Simulator : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 293.2 The RAID Performance Measurement Tool : : : : : : : : : : : : : : : : : : 33

Page 8: Disk - University of California, Berkeley

v

3.3 The RAID-II Disk Array Prototype : : : : : : : : : : : : : : : : : : : : : : : 353.4 The Performance Evaluation Process : : : : : : : : : : : : : : : : : : : : : : 41

3.4.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 423.4.2 Experimental Design : : : : : : : : : : : : : : : : : : : : : : : : : : : 443.4.3 Linear Regression : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 453.4.4 Error Metrics : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 483.4.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 49

4 The Analytic Model 51

4.1 The Modeled System : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 524.2 The Model System : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 524.3 The Expected Utilization : : : : : : : : : : : : : : : : : : : : : : : : : : : : 544.4 Approximating the Expected Utilization : : : : : : : : : : : : : : : : : : : : 564.5 Experimental Design : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 584.6 Validation of the Analytic Model : : : : : : : : : : : : : : : : : : : : : : : : 594.7 Validation of Model Properties : : : : : : : : : : : : : : : : : : : : : : : : : 64

4.7.1 Signi�cance of Factors : : : : : : : : : : : : : : : : : : : : : : : : : : 644.7.2 Relationships Between Factors : : : : : : : : : : : : : : : : : : : : : 66

4.8 Variable Request Sizes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 694.9 The Optimal Stripe Unit Size : : : : : : : : : : : : : : : : : : : : : : : : : : 72

4.9.1 Derivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 744.9.2 Validation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 76

4.10 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80

5 Utilization Pro�les 82

5.1 First Look at Utilization Pro�les : : : : : : : : : : : : : : : : : : : : : : : : 845.2 An Example: Utilization Pro�le for a Disk : : : : : : : : : : : : : : : : : : : 855.3 Analysis of Blackbox Systems : : : : : : : : : : : : : : : : : : : : : : : : : : 885.4 Second Look at Utilization Pro�les : : : : : : : : : : : : : : : : : : : : : : : 925.5 Analysis of RAID-II : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 95

5.5.1 Analysis of Reads : : : : : : : : : : : : : : : : : : : : : : : : : : : : 965.5.2 Analysis of Read-Modify-Writes : : : : : : : : : : : : : : : : : : : : 1015.5.3 Analysis of Reconstruct-Writes : : : : : : : : : : : : : : : : : : : : : 106

5.6 Applications of Utilization Pro�les : : : : : : : : : : : : : : : : : : : : : : : 1105.7 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 116

6 Summary and Conclusions 118

Bibliography 122

Page 9: Disk - University of California, Berkeley

vi

List of Figures

1.1 The Modeling and Analysis Processes : : : : : : : : : : : : : : : : : : : : : 7

2.1 RAID De�nitions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 102.2 RAID Levels 0 Through 6 : : : : : : : : : : : : : : : : : : : : : : : : : : : : 122.3 RAID Level 5 Parity Placements : : : : : : : : : : : : : : : : : : : : : : : : 162.4 RAID Level 6 Left Symmetric Parity Placement : : : : : : : : : : : : : : : 172.5 Block-Interleaved RAID Access Modes : : : : : : : : : : : : : : : : : : : : : 18

3.1 RAID-II Network Environment : : : : : : : : : : : : : : : : : : : : : : : : : 363.2 Photograph of the RAID-II Storage Server : : : : : : : : : : : : : : : : : : : 373.3 Schematic View of the RAID-II Storage Server : : : : : : : : : : : : : : : : 383.4 X-Bus Board : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 393.5 System Level Performance : : : : : : : : : : : : : : : : : : : : : : : : : : : : 41

4.1 Data Striping in Disk Arrays : : : : : : : : : : : : : : : : : : : : : : : : : : 524.2 Closed Queuing Model for Disk Arrays : : : : : : : : : : : : : : : : : : : : : 534.3 Time-line of Events at a Given Disk : : : : : : : : : : : : : : : : : : : : : : 554.4 Analytic U with 90 Percentile Intervals as a Function of L and p : : : : : : 634.5 Plots of log(1=U � 1) vs. log2L : : : : : : : : : : : : : : : : : : : : : : : : : 674.6 UL vs. UL1 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 704.7 Analytic U with Maximum Error Intervals and Individual Data Points : : : 734.8 Analytic vs. Empirical Optimal Stripe Unit Sizes : : : : : : : : : : : : : : : 774.9 Comparison with Chen's Model : : : : : : : : : : : : : : : : : : : : : : : : : 784.10 Optimal Stripe Unit Sizes with 95% Performance Intervals : : : : : : : : : : 79

5.1 Compound Requests : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 945.2 Analysis of Reads for each Resource : : : : : : : : : : : : : : : : : : : : : : 1025.3 Final Analysis of Read : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1035.4 Analysis of Read-Modify-Write for each Resource : : : : : : : : : : : : : : : 1075.5 Final Analysis of Read-Modify-Write : : : : : : : : : : : : : : : : : : : : : : 1085.6 Analysis of Reconstruct-Write for each Resource : : : : : : : : : : : : : : : 1115.7 Final Analysis of Reconstruct-Write : : : : : : : : : : : : : : : : : : : : : : 1125.8 Maximum Throughput Envelopes for Reads : : : : : : : : : : : : : : : : : : 113

Page 10: Disk - University of California, Berkeley

vii

List of Tables

2.1 Throughput per dollar relative to RAID level 0 : : : : : : : : : : : : : : : : 20

4.1 Disk Model Parameters : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 60

Page 11: Disk - University of California, Berkeley

viii

Acknowledgements

My experiences as a graduate student at UC Berkeley have been the most re-

warding of my educational and personal life. Never before have I worked with so many

bright, motivated and supportive individuals. The experience has been both exhilarating

and humbling, and I feel extremely fortunate for having worked with them.

First and foremost, I would like to thank my advisor, Prof. Randy Katz, for sup-

porting my work and for his technical and professional guidance. He has allowed me to pur-

sue my own interests while also giving me the opportunity to contribute to group projects.

He has contributed immeasurably to both my professional and personal development.

I would like to thank Prof. David Patterson for his support and especially for

his philosophical guidance. I will never forget his timeless comments and advice. I have

frequently thought of him as my second advisor and guardian angel.

I would like to thank Prof. John Ousterhout for his invaluable advice in the orga-

nization and presentation of talks and his technical guidance in the development of software

for RAID.

I would like to thank Prof. Ronald Wol� for his courses on queueing theory, the

conversations I have had with him regarding analytic models for disk arrays, and his par-

ticipation on my dissertation committee.

I would like to thank the other members of the RAID Group, Peter Chen, Ann

Drapeau, Garth Gibson, Ken Lutz, Bob Miller, Ethan Miller, Srini Seshan, and Theresa

Lessard-Smith. I especially owe Peter Chen, Ann Drapeau and Garth Gibson my gratitude

for the many interesting discussions that have in uenced my thinking, their collaboration

Page 12: Disk - University of California, Berkeley

ix

on the RAID project, and their friendship. Ken Lutz, with his attention to details and

valuable experience, kept the hardware side of the project running smoothly. Without the

dedication and e�orts of Bob Miller and Theresa Lessard-Smith, the orderly organization

of the project would be overcome by chaos.

I would like to thank members of the Sprite operating system group and the

Postgress database group with whom I have had the pleasure to work. In particular I would

like to thank Mary Baker, Fred Douglis, John Hartman, Wei Hong, Mendel Rosenblum,

Margo Seltzer, Ken Shirri�, Mark Sullivan and Brent Welch. I would especially like to

thank Mary Baker, John Hartman, Mendel Rosenblum and Ken Shirri� for explaining the

operation of the Sprite operating system, and their support of the software components of

the RAID project.

I would like to thank our government and industrial a�liates, Array Technologies,

DARPA/NASA (NAG2-591), DEC, Hewlett-Packard, IBM, Intel Scienti�c Computers, Cal-

ifornia MICRO, NSF (MIP 8715235), Seagate, Storage Tek, Sun Microsystems and Thinking

Machines Corporation for their support.

Last but not least, I would like to thank my family for their encouragements,

understanding and love.

Page 13: Disk - University of California, Berkeley

1

Chapter 1

Introduction

1.1 Motivation

In recent years, interest in RAID, Redundant Arrays of Inexpensive Disks, as high-

performance, reliable secondary storage systems has grown explosively. The driving force

behind this phenomenon is the sustained exponential improvements in the performance

and density of semiconductors. Improvements in semiconductor technology makes possible

faster microprocessors and larger primary memory systems, which in turn require larger,

higher-performance secondary storage systems. More speci�cally, the consequences of im-

provements in microprocessors on secondary storage systems can be broken into quantitative

and qualitative components.

On the quantitative side, Amdahl's Law predicts that, in general, large improve-

ments in microprocessors will result in only marginal improvements in overall system per-

formance unless accompanied by corresponding improvements in secondary storage sys-

tems. Unfortunately, while RISC microprocessor performance has been improving 50% per

Page 14: Disk - University of California, Berkeley

2

year [33], disk access times, which depend on improvements of mechanical systems, have

been improving less than 10% per year. Over the same time interval, disk transfer rates,

which track improvements in both mechanical systems and magnetic media densities, have

improved at a signi�cantly faster rate of 20% per year. Assuming semiconductor and disk

technologies continue their current trends, we must conclude that the performance gap

between microprocessors and magnetic disks will continue to widen. Thus, as long as sec-

ondary storage systems are based on magnetic disks, technologies such as disk arrays will

become increasingly important.

In addition to the quantitative e�ect, a second, perhaps more important, qual-

itative e�ect is driving the need for higher-performance secondary storage systems. As

microprocessors become faster, they make possible new applications and greatly expand

the scope of existing applications. In particular, applications such as video, hypertext and

multi-media are becoming common. Even in existing application areas such as CAD and

scienti�c computing, faster microprocessors make it possible to process and generate in-

creasingly larger datasets. This shift in applications combined with a trend toward large,

shared, high-performance, network-based storage systems is causing us to reevaluate the

way we design and use secondary storage systems.

Disk arrays, which organize many independent disks into a large, high-performance

logical disk, are a natural solution to the problem. They o�er several advantages over

independent disks. First, disk arrays stripe data across multiple disks and access them in

parallel to achieve both higher data transfer rates on large data accesses and higher I/O

rates on small data accesses. Second, data striping results in uniform load balancing across

Page 15: Disk - University of California, Berkeley

3

all the disks, eliminating hot spots that saturate a small number of disks while the majority

of disks sit idle. Third, by adding redundancy, disk arrays can protect against disk failures

and continue to operate while the contents of the failed disk is reconstructed onto a spare

disk. Fourth, a few large logical disks are easier to administer than hundreds of small

independent disks and facilitate the implementation of extremely large, shared �le systems

composed of very large objects.

Given the important role disk arrays will play in the future, tools for modeling

and analyzing the performance of disk arrays become increasingly important. In particular,

accurate performance models and systematic analysis techniques, combined with a thorough

understanding of an installation's workload, are invaluable in both con�guring and designing

disk arrays. Unfortunately, disk arrays, like many parallel systems, are di�cult to model and

analyze because of queueing and fork-join synchronization; a logical request to a disk array is

split-up into multiple disk requests which are queued and serviced independently; however,

the logical request is not completed until all the corresponding disk requests are completed.

Either queueing or fork-join synchronization by itself is tractable but the combination of

the two is intractable. This is a classic problem in the modeling of queueing systems that

currently has no general solution [2, 10, 11, 39]. The modeling and analysis of such systems

is highly dependent on the characteristics of the particular system and requires the judicious

use of approximations and simplifying assumptions.

In this dissertation, we develop an analytic performance model for non-redundant

disk arrays by approximating the queueing and fork-join overheads. We also develop a new

empirical technique based on utilization pro�les for analyzing the performance of redundant

Page 16: Disk - University of California, Berkeley

4

disk arrays. The technique is not speci�c to disk arrays and can be used to analyze a wide

variety of systems. In both cases, we provide applications of our work. We use the analytic

model to derive an equation for the optimal size of data striping in disk arrays and utilization

pro�les are used to analyze the performance of RAID-II, our second hardware disk-array

prototype. The results of the analysis are used to answer several performance questions

about RAID-II and to compare the performance of RAID-II to RAID-I, our �rst disk array

prototype.

1.2 Contributions

The primary contributions of this work are the development of an analytic per-

formance model for non-redundant disk arrays, the application of the analytic performance

model to determine a mathematical equation for the optimal size of data striping, the de-

velopment of a general performance analysis technique based on utilization pro�les, and the

application of the technique to analyze the performance of RAID-II, our second disk array

prototype.

In this dissertation, we develop, validate and apply an analytic performance model

for disk arrays by modeling them as closed queueing systems and approximating the queue-

ing and fork-join overheads. Our model is di�erent from previous analytic models of disk

arrays [4, 8, 19,20, 26, 32,36] for the following reasons. First, we use a closed queueing model

with a �xed number of processes. Previous analytic models of disk arrays used open queue-

ing models with Poisson arrivals [8, 19,26]. A closed model more accurately models the

synchronous I/O behavior of scienti�c, time-sharing and distributed systems. In such sys-

Page 17: Disk - University of California, Berkeley

5

tems, processes tends to wait for previous I/O requests to complete before issuing new I/O

requests, whereas in transaction oriented systems, I/O requests are issued randomly in time

regardless of whether the previous I/O requests have completed.

Second, to the best of our knowledge, we present the �rst analytic model for

disk arrays that have both queueing at individual disks and the fork-join synchronization

introduced by data striping, over a continuum of request sizes. Previous analytic models

for disk arrays with both queueing and fork-join synchronization model only small or large

requests [8, 26].

We have found the analytic model especially useful for comparative studies that

focus on the relative performance of di�erent disk array con�gurations, such as determining

the optimal size of data striping. The model can also be used in calculating quantitative

price/performance metrics for disk arrays under a range of system and workload parameters,

in identifying important factors in the performance of disk arrays, and has provided us with

useful insights in characterizing the performance of disk arrays.

The model, of course, is not without limitations. Although our analytic model

handles both reads and writes to non-redundant disk arrays and reads to a subclass of

redundant disk arrays, it does not handle writes to redundant disk arrays. Also, the work-

load model is not expressive enough to accurately model real workloads. Thus, although

the model can be used to accurately predict the performance of real disk arrays under the

synthetic workloads used in this paper, its prediction of performance under real workloads

is highly approximate. A �nal limitation is that we only model the disk components of a

disk array. Real disk arrays have many other components such as strings, I/O controllers,

Page 18: Disk - University of California, Berkeley

6

I/O busses and CPU's that a�ect the performance of the system.

Due to the above limitations, which are characteristic of analytic models in general,

it is desirable to have more general tools for analyzing the performance of disk arrays.

In particular, we will present an analysis technique that combines empirical performance

models with statistical techniques to summarize the performance of disk arrays over a wide

range of system and workload parameters. The analysis technique is generalizable to a wide

variety of systems, requires minimal information about the system to be analyzed, does not

require measurements of internal system resources, and produces results that are compact,

intuitive, and easily compared with those of similar systems.

Finally, we will apply the analysis technique to analyze the performance of RAID-

II, our second disk array prototype. We will show how it can be used to determine the limits

of system performance, to determine the bottleneck resource for a given system con�guration

and workload, to eliminate resource bottlenecks, to prevent excessive queueing delays at

speci�ed resources, and to compare similar systems.

1.3 Organization

This document is concerned with the modeling and analysis of disk arrays. Fig-

ure 1.1 schematically illustrates the two processes. Modeling consists of formulating the

system of interest as a formal speci�cation, deriving a solution to the speci�cation, and

validating the solution. Analysis consists of designing experiments to study the system of

interest, performing the experiments either via measurement or simulation, and interpreting

the resulting data. In both processes, successive steps will uncover new information that

Page 19: Disk - University of California, Berkeley

7

Measurement

Interpretation

Analysis

ExperimentalDesign

Modeling

Formulation

Derivation

Validation

Figure 1.1: The Modeling and Analysis Processes.

may require back-tracking and repeating previous steps.

As illustrated in Figure 1.1, modeling and analysis are mutually dependent pro-

cesses; the validation of a model requires analysis and the interpretation of data requires

a model to guide the interpretation. Without analysis, we can never know the accuracy of

our model and without modeling, we cannot draw general conclusions from our analysis.

Although modeling and analysis are interdependent, most studies emphasize one

over the other. Studies interested in the general properties of a certain type of system

Page 20: Disk - University of California, Berkeley

8

will emphasize modeling while studies interested in the properties of a speci�c system will

emphasize analysis. This document is no exception to this rule. Modeling is emphasized

in Chapter 4, which develops an analytic performance model for disk arrays, validates the

model via simulation, and applies the model to derive an equation for the optimal size

of data striping in disk arrays. Analysis is emphasized in Chapter 5, which develops a

new technique for analyzing systems based on utilization pro�les, applies the technique to

analyze the performance of RAID-II, our second disk array prototype, and uses the results

of the analysis to answer several performance questions about RAID-II. These are the two

core chapters of this dissertation.

The remaining chapters are organized as follows: Chapter 2 provides an overview

of disk arrays and a summary of the related work in the �eld, Chapter 3 describes the

empirical methods that will be used in both the modeling and analysis of disk arrays,

and Chapter 6 summarizes the main results of this dissertation and concludes with our

experiences in modeling and analyzing disk arrays.

Page 21: Disk - University of California, Berkeley

9

Chapter 2

Overview

Redundant Arrays of Inexpensive Disks (RAID) employ two orthogonal concepts,

data striping and redundancy, for improved performance and reliability. Data striping pro-

vides high-performance by increasing parallelism and improving load-balance. Redundancy

improves reliability by allowing RAID to operate in the face of disk failures without data

loss. Although the basic concepts of data striping and redundancy are simple, selecting be-

tween the many possible data striping and redundancy schemes involves complex tradeo�s

between reliability, performance and cost. This chapter de�nes basic terms for describing

RAID, the commonly accepted RAID organizations, how a RAID services I/O requests,

and compares their performance and cost.

2.1 Basic De�nitions

Figure 2.1 illustrates and de�nes the terms stripe unit and parity stripe in the

context of a simple redundant disk array consisting of �ve disks. In the disk array illustrated

Page 22: Disk - University of California, Berkeley

10

Parity StripeStripe Unit

765

3210

4

P0

P1

Stripe unit is the unit of data interleaving, that is, the amount of data that is placed on adisk before data is placed on the next disk. Stripe units typically range from a sectorto a track in size (512 bytes to 64 kilobytes).

Parity stripe is a collection of data stripe units along with its corresponding parity stripeunit.

Figure 2.1: RAID De�nitions.

in Figure 2.1, the parity stripe units are simply grouped together on a single disk but many

other RAID organizations distribute the parity in complex ways across the disks in the disk

array.

2.2 The RAID Taxonomy

Most redundant disk array organizations can be distinguished based on the granu-

larity of data interleaving and the method and pattern in which the redundant information

is computed and distributed, respectively, across the disk array. Data interleaving can be

characterized as either �ne-grained or coarse-grained. Fine-grained disk arrays conceptually

interleave data at a small unit such that all I/O requests, regardless or their size, access all

the disks in the disk array. This results in high data transfer rates for all I/O requests but

only one I/O request can be serviced at a time. Coarse-grained disk arrays interleave data

at a large unit so that small I/O requests access a few disks but large requests access all

Page 23: Disk - University of California, Berkeley

11

the disks in the disk array. This allows many small requests to be serviced simultaneously

but allows large requests to use all the disks.

The incorporation of redundancy in disk arrays brings up two somewhat orthog-

onal problems. The �rst problem is selecting the method for computing the redundant

information. Most redundant disk arrays today use parity, but there are some that use

Hamming or Reed-Solomon codes [12]. The second problem is selecting a method for dis-

tributing the redundant information across the disk array. Although there are an unlimited

number of patterns in which redundant information can be distributed, we can roughly

distinguish between two di�erent distributions schemes: those that concentrate redundant

information on a few disks, and those that distributed redundant information uniformly

across all disks. Schemes that uniformly distribute redundant information are generally

more desirable because they avoid contention for the redundant information, which must

be updated on each write to the disk array.

2.2.1 Non-Redundant (RAID Level 0)

The simplest redundancy scheme is to have no redundancy; this is clearly the

lowest cost redundancy scheme. This scheme has the best write performance, because it

never needs to update redundant information. Surprisingly, it does not have the best read

performance. Redundancy schemes such as mirroring, which duplicate data, can perform

better on small reads by selectively scheduling requests on the disk with the shortest ex-

pected seek and rotational delays. Without redundancy, any single disk failure will result

in data-loss, making this the least reliable redundancy scheme. Non-redundant disk arrays

are widely used in supercomputing environments where performance rather than reliability

Page 24: Disk - University of California, Berkeley

12

Non-Redundant (RAID Level 0)

Mirrored (RAID Level 1)

Hamming Coded (RAID Level 2)

Bit-Interleaved Parity (RAID Level 3)

Block-Interleaved Parity (RAID Level 4)

Block-Interleaved Distributed-Parity (RAID Level 5)

PQ Redundant (RAID Level 6)

Figure 2.2: RAID Levels 0 Through 6. All RAID levels are illustrated at a usercapacity of four disks. Disks with multiple platters indicate block-level striping while diskswithout platters indicate bit-level striping. The shaded platters represent redundant infor-mation.

Page 25: Disk - University of California, Berkeley

13

is the primary concern.

2.2.2 Mirrored (RAID Level 1)

Mirroring is the simplest redundancy scheme. All data are duplicated so that a

complete backup copy is available when a disk fails. The duplication of data results in poor

storage utilization, but provides high reliability and allows small read requests to be serviced

by one of two disks. This usually results in more e�cient disk scheduling. Unfortunately,

each write results in two disk accesses. This is a disadvantage compared to other RAID

schemes which have lower write overheads for large write requests. Mirroring is frequently

used in database systems where availability and transaction rate are more important than

write bandwidth or storage e�ciency.

2.2.3 Hamming Coded (RAID Level 2)

Hamming-coded disk arrays use a redundancy scheme similar to that used in main

memory systems [12]. A collection of paity disks computes parity over di�erent subsets of

the data and parity disks. By noting which parity disks are in error, one can determine

both the disk that has failed and the data that was stored on the failed disk. The storage

e�ciency for Hamming-coded disk arrays can be much better than for mirrored disk arrays.

In practice, Hamming-coded disk arrays are not common because redundancy schemes such

as bit-interleaved parity provide similar reliability at better performance and cost.

Page 26: Disk - University of California, Berkeley

14

2.2.4 Bit-Interleaved Parity (RAID Level 3)

In a bit-interleaved parity disk array, data is interleaved bit-wise over the data

disks and a single parity disk is added to tolerate any single disk failure. Each read request

accesses all the data disks and each write request accesses all the data disks and parity disk.

Thus, only one read or write request can be serviced at a time. Because the parity disk

contains only parity and no data, the parity disk cannot participate in reads, resulting in

slightly lower read performance than for redundancy schemes that distribute the parity and

data over all disks. Bit-interleaved parity disk arrays are frequently used in high-bandwidth

applications that do not require many small I/O's per second.

2.2.5 Block-Interleaved Parity (RAID Level 4)

The block-interleaved parity disk array is similar to the bit-interleaved parity disk

array except that data is interleaved across disks in blocks rather than bits. Small read

requests access only a single data disk and small write requests access a single data disk

and the parity disk. Each small write request results in a read-modify-write operation.

The old data and parity is read from the corresponding data and parity disks and xored|

checksum modulo two|with the new data to compute the new parity. The new data and

parity are then written out to the corresponding disks. Thus, each small write request

results in four disk accesses: two reads and two writes. Because a block-interleaved parity

disk array has only one parity disk, which must be updated on all write operations, the

parity disk can easily become a bottleneck. Because of this limitation, the distributed-parity

disk array is universally preferred over the non-distributed parity disk array.

Page 27: Disk - University of California, Berkeley

15

2.2.6 Block-Interleaved Distributed-Parity (RAID Level 5)

The block-interleaved distributed-parity disk array eliminates the parity disk bot-

tleneck in RAID level 4 disk arrays by distributing the parity uniformly over all the disks.

A frequently overlooked advantage to distributing the parity is that it also distributes the

data over all the disks. This allows all disks to participate in servicing read operations, in

contrast to redundancy schemes with dedicated parity disks that cannot participate in ser-

vicing read requests. Block-interleaved distributed-parity disk arrays have one of the best

small read, large read and large write performance of any redundant disk array. Small write

requests are somewhat ine�cient compared with redundancy schemes such as mirroring,

however, because of the read-modify-write operations.

The exact method used to distribute parity in block-interleaved distributed-parity

disk arrays can make a signi�cant impact on performance. Figure 2.3 illustrates four simple

parity distribution schemes. The symmetric parity distributions are obtained by numbering

each striping unit starting from the parity striping unit; the asymmetric parity distributions

are obtained by numbering each striping unit starting from the left-most striping unit. In

earlier work, we shows that, of the four parity distributions illustrated in Figure 2.3, the

left-symmetric parity distribution is the most desirable for several reasons [22]. The most

important reason is that a sequential traversal of the striping units in the left-symmetric

parity placement will access each disk once before accessing any disk twice. This reduces

disk con icts when servicing large requests.

2.2.7 PQ Redundant (RAID Level 6)

Page 28: Disk - University of California, Berkeley

16

19181716

15141312

111098

7654

3210

19181716

15141312

111098

7654

3

0

19181716

15 141312

1110 98

765 4

321

19181716

151413 12

1110 98

7 654

3

P4P4

P4 P4

P0P0

P0

P1P1

P1

P2P2

P2

P3P3

P3 P3

P2

P1

P0

Asymmetric Symmetric

Left

Right

210 210

Figure 2.3: RAID Level 5 Parity Placements. Each square corresponds to a stripeunit. Each column of squares corresponds to a disk. Only the minimum repeating patternis shown.

Page 29: Disk - University of California, Berkeley

17

Q4

Q3

Q2

Q1

Q0

P3

P2

P1

P0

P4

1 2 3

4 56 7

8 9 10 11

12 13 14 15

16 1718 19

0

Figure 2.4: RAID Level 6 Left Symmetric Parity Placement. Each squarecorresponds to a stripe unit. Each column of squares corresponds to a disk.

The PQ redundant disk array is identical to the RAID Level 5 disk array except

that it uses two disks per redundancy group instead of one. The P disk stores parity while

the Q disk stores a Reed-Solomon code [12]. The Reed-Solomon code in combination with

the parity allows the recovery of data on any two disks. Figure 2.4 illustrates a variation of

the left-symmetric parity placement when extended to PQ redundant disk arrays.

2.3 RAID Access Modes

The access modes for bit-interleaved disk arrays are simple; all the data and parity

in a parity stripe are read and written together. The access modes for block-interleaved

disk arrays are more complex since only a part of a parity stripe may be read or written.

During normal operation, a block-interleaved RAID uses three di�erent access modes to

service I/O requests. Figure 2.5 illustrates the read, read-modify-write and reconstruct-

write access modes.

Page 30: Disk - University of California, Berkeley

18

Reconstruct-Write

Read-Modify-Write

Read

XOR

XOR

Figure 2.5: Block-Interleaved RAID Access Modes. The lightly shaded regionscorrespond to data that is to be read or written. The dark regions correspond to parity.

2.3.1 Read

Read is the simplest access mode. It simply reads the requested data from the

appropriate disks.

2.3.2 Read-Modify-Write

The read-modify-write access mode is used if a small portion of the stripe is writ-

ten. In such a case, fewer disk accesses are required if the parity is updated incrementally

by xoring the new data, old data and old parity than by xoring the new data with the rest

of the parity stripe. The read-modify-write method is executed in three steps:

Page 31: Disk - University of California, Berkeley

19

1. Read old data and old parity.

2. Compute new parity.

3. Write new data and new parity.

2.3.3 Reconstruct-Write

The reconstruct-write access mode is used if a large portion of the stripe is written.

Usually, the reconstruct-write method is preferred to the read-modify-write method if half

or more of the parity stripe is being written. The reconstruct-write method is executed in

three distinct steps.

1. Read data from rest of the parity stripe.

2. Compute parity.

3. Write new data and new parity.

If an exact parity stripe is being written, step one is trivial and does not require disk

accesses.

2.4 Comparison of RAID levels

There are many ways to compare RAID organizations. To a large extent, the

metrics used depend on the intended use of the systems. In time-sharing applications, the

primary metric may be user capacity|disk capacity excluding the redundant information|

per dollar; in transaction processing applications the primary metric may be I/O rate per

dollar; and in scienti�c applications, the primary metric may be bandwidth per dollar. In

Page 32: Disk - University of California, Berkeley

20

Small Read Small Write Large Read Large Write Storage E�ciency

RAID 0 1 1 1 1 1RAID 1 1 1/2 1 1/2 1/2RAID 3 1/G 1/G (G-1)/G (G-1)/G (G-1)/GRAID 5 1 max(1/G,1/4) 1 (G-1)/G (G-1)/GRAID 6 1 max(1/G,1/6) 1 (G-2)/G (G-2)/G

Table 2.1: Throughput per dollar relative to RAID level 0. G is the parity groupsize in disks. For small requests, throughput is measured in I/O's per second while for largerequests, throughput is measured in bytes per second. Storage e�ciency is the fraction oftotal storage capacity the is available to the user.

certain heterogeneous systems, both I/O rate and bandwidth may be important. In this

section, we will use throughput, either I/O rate or bandwidth, per dollar as our primary

performance metric in comparing RAID organizations.

Even after the appropriate metrics have been chosen, comparisons between RAID

levels one through six can be confusing. Frequently, a RAID level speci�es not a speci�c

implementation of a system but its con�guration and use. For example, a RAID level 5

disk array with a parity group size of two is equivalent to RAID level 1 with the exception

that in a mirrored disk array, certain disk scheduling optimizations on reads are performed

that generally are not implemented for RAID level 5 disk arrays. Analogously, a RAID

level 5 disk array can be con�gured to operate equivalently to a RAID level 3 disk array

by choosing a striping unit such that the smallest I/O request always accesses a full parity

stripe of data. RAID level 1 and RAID level 3 disk arrays can be viewed as a subclass of

RAID level 5 disk arrays. Since RAID level 2 and RAID level 4 disk arrays are, practically

speaking, in all ways inferior to RAID level 5 disk arrays, we will exclude them from our

comparisons to accentuate the di�erences between RAID levels 0, 1, 3, and 5.

Page 33: Disk - University of California, Berkeley

21

Table 2.1 tabulates the maximum throughput per dollar relative to RAID level

0. The cost of each system is assumed to be proportional to the total number of disks in

the disk array; this is somewhat inaccurate since more complex RAID organizations may

require more expensive hardware and software. The table illustrates, for example, that

given equivalent cost RAID level 0 and RAID level 1 systems, the RAID level 1 system can

sustain half the number of small writes per second that a RAID level 0 system can sustain.

Equivalently, we can say that the cost of small writes is twice as expensive in a RAID level 1

system as in a RAID level 0 system. In addition to performance, the table shows the storage

e�ciency of each disk array organization. The storage e�ciency is inverse the cost of each

unit of user capacity relative to a RAID level 0 system. Note that the storage e�ciency is

the same as the performance/cost for large writes.

As expected, Table 2.1 shows that performance/cost of RAID level 1 is equivalent

to the performance/cost of RAID level 5 when the parity group size is equal to two. The

performance/cost of RAID level 3 is always less than or equal to the performance/cost of

RAID level 5. This is expected given that RAID level 3 is a subclass of RAID level 5 derived

by restricting the striping unit size such that requests access exactly a parity stripe. Since

the con�guration of RAID level 5 is not subject to such a restriction, the performance/cost

of RAID level 5 can never be less than that of an equivalent RAID level 3. In the rest of

this dissertation, we will deal exclusively with RAID level 0 and RAID level 5 disk arrays,

the two most \e�cient" disk array organizations.

Page 34: Disk - University of California, Berkeley

22

2.5 Related Work

This section describes related work in the modeling and analysis of disk arrays. The

related work most relevant to this dissertation concerns the analytic modeling of disk arrays.

The analysis work presented in this dissertation is di�cult to compare directly with other

related work because we develop a new analysis technique and use it to analyze a unique,

real disk array prototype. Still, there are a few measurement studies that are relevant to

our work. For the bene�t of our readers, we will also describe disk array research that is

not directly relevant to this dissertation.

2.5.1 Related Modeling Work

Analytic queueing models of disk arrays are di�cult to formulate because of queu-

ing and fork-join synchronization, that is, a request to a disk array is broken up into indepen-

dent disk requests that must all complete. Related work in analytic models for disk arrays

fall into one of three categories: models that ignore queueing, models that ignore fork-join

synchronization, and models that handle both queueing and fork-join synchronization using

approximate techniques.

Analytic models that ignore queueing are frequently used to calculate expected

minimum response times and maximum throughputs of systems. Such metrics are useful

for estimating the limits of system performance but are applicable to only extreme workloads

when the load in the system is either very light or very heavy. Salem and Garcia-Molina [36]

derive expected minimum response time equations to study the bene�ts of data striping in

synchronized (disk arm and rotational position) non-redundant disk arrays. They also show

Page 35: Disk - University of California, Berkeley

23

the e�ects of several low-level disk optimizations on the response time for individual disks.

Bitton and Gray [4] derive expected disk seek times for unsynchronized mirrored disk arrays.

Kim and Tantawi [20] derive service time distributions for unsynchronized, bit-interleaved,

non-redundant disk arrays. Patterson, Gibson and Katz [32] derive maximum throughput

estimates for RAID levels 0 through 5.

Analytic models that handle queueing but ignore fork-join synchronization tend to

be more sophisticated than models that ignore queueing. Such models are frequently used to

model synchronized, bit-interleaved disk arrays, which can be accurately modeled without

fork-join synchronization. Kim [19] models synchronized bit-interleaved disk arrays as an

M=G=1 queueing system and shows that bit-interleaved disk arrays provide lower service

times and better load balancing but decrease the number of requests that can be serviced

concurrently. Chen and Towsley [8] analytically model RAID level 1 and RAID level 5 disk

arrays. Bounds are used to model the queueing and fork-join synchronization in RAID level

1 disk arrays. The fork-join synchronization overhead is ignored for small write requests,

resulting in an optimistic model. Large requests are modeled by using a single queue for all

the disks in the disk array. The main drawback with the model is that it only models small

and large requests. Moderate sized requests are ignored.

Systems that exhibit both queueing and fork-join synchronization are beyond the

current capabilities of queueing theory. Few analytic models for disk arrays attempt to

model both elements. Menon and Mattson [26] formulate an approximate model for RAID

level 5 systems under transaction processing workloads. Their model is based on a mathe-

matical relationship for two server M=M=1 systems presented by Nelson and Tantawi [31].

Page 36: Disk - University of California, Berkeley

24

The work presented by Menon and Mattson is awed, however, since they directly apply

the results of Nelson and Tantawi to systems with more than two servers without justifying

the generalization. In earlier work, we derive an analytic performance model for unsynchro-

nized, block-interleaved, non-redundant disk arrays by approximating the queueing and

fork-join overheads [23]. This work is the �rst to model such disk arrays as closed queueing

systems with a continuum of request sizes from small to large requests. The work reported

in this dissertation is a superset of the work reported in that paper.

To summarize, a de�nitive analytic model for block-interleaved redundant disk

arrays does not exist. Basic advances in the analysis of fork-join queueing systems are

needed before such a model can be derived. Such advances are not likely in the near future.

Until they occur, we will have to live with many specialized approximate models for disk

arrays. Aside from the work mentioned above, readers interested in modeling disk arrays

may be interested in several queueing theoretic papers that deal speci�cally with fork-join

queueing models [1, 2, 10, 11,14, 15, 18].

2.5.2 Related Analysis Work

Work on the analysis of disk arrays typically focus on measurement or simulation.

Measurement studies usually attempt to characterize the overall performance of a system

or to answer speci�c performance question about the system. Simulation studies usually

examine the e�ect of speci�c design or con�guration tradeo�s in canonical disk arrays.

Simulation studies usually address one of three main topics: data striping, redundancy

schemes, or operation during the reconstruction of failed disks.

Since a large part of this dissertation is concerned with the analysis of RAID-II, our

Page 37: Disk - University of California, Berkeley

25

second disk array prototype, other measurement-based studies are the most relevant for the

purposes of comparison with our work. Chen et al. [5] compares the throughput of RAID

level 1 disk arrays versus RAID level 5 disk arrays using Amdahl mainframes. The primary

purpose of the study is to validate the maximum throughput comparisons presented in a

paper by Patterson, Gibson and Katz [32]. Drapeau [9], also know as Chervenak, presents

low-level performance measurements of RAID-I, a disk array prototype constructed using

o�-the-shelf components. The paper is primarily concerned with the e�ciency of the SCSI

busses used in the system. Chen, Lee, et al. [6] describe the design, implementation and

performance of RAID-II. The paper presents system-level and component-level performance

measurements and provides justi�cations for major design decisions. Our work di�ers from

the above related work in that we use a systematic framework, utilization pro�les, to both

characterize overall system performance and to answer speci�c performance questions about

RAID-II, instead of ad-hoc techniques and measurements.

Data striping is the most studied topic in disk arrays. Livny [24] and Gray, Horst

and Walker [13] study the bene�ts of striping versus not striping in non-redundant and

redundant disk arrays, respectively. Livny shows that striping is generally preferable except

at high loads. Gray, Horst and Walker show that it is sometimes desirable to distributed

the redundant information across many disks even when the data is not striped. Reddy

and Banerjee [34] investigate the performance of bit-level versus block-level data striping in

non-redundant disk arrays. Chen and Patterson [7] and Lee and Katz [21] extend the work

by Reddy and Banerjee by studying performance over a continuum of data striping sizes.

Chen and Patterson present empirical rules of thumb for picking the data striping size. Lee

Page 38: Disk - University of California, Berkeley

26

and Katz derive analytic equations for the best data striping size.

Redundancy schemes concern the pattern in which redundant information is dis-

tributed across the disks in a disk array, the method in which the redundant information

is computed, and the technique used to update the redundant information when data is

written to the disks. This is currently a very active area of disk array research. Lee and

Katz [22] investigate the performance consequences of di�erent methods of placing parity

in RAID level 5 disk arrays and proposes several properties that are generally desirable of

parity placements. Bhide and Dias [3] and Stodolsky and Gibson [37] propose using logs

to transform random parity updates into sequential disk accesses, signi�cantly reducing the

parity update overhead for small writes in RAID level 5 disk arrays. Menon, Roche and

Kasson [25] propose an alternative scheme, called oating parity, for reducing small write

overheads. Floating parity signi�cantly reduces the rotational latency for parity updates

by opportunistically writing the new parity to the rotationally nearest sectors.

Because most disk arrays will operate with a failed disk for only brief periods of

time, performance in the face of disk failures has not been given serious attention until

recently. The commercialization of disk arrays for continuous, real-time applications such

as video service is changing the trend. In many applications, it is simply not acceptable

for I/O performance to degrade sharply without warning. Menon and Mattson [27,28] and

Reddy and Banerjee [35] collectively propose three di�erent ways, of \sparing" redundant

disk arrays: dedicated sparing, parity sparing, and distributed sparing. In redundant disk

arrays, it is desirable to have a spare disk that can be used immediately to replace a failed

disk. More complex sparing schemes, such as parity sparing and distributed sparing, allow

Page 39: Disk - University of California, Berkeley

27

the spare disk to be used during the normal operation of the disk array by distributing

the spare disk storage across all disks in the array. In addition to improved performance

during the normal operation of the system, the distribution of the spare storage results in

more uniform load distribution when a disk fails. Muntz and Lui [30] propose a method of

organizing parity similar to RAID level 5 disk arrays that uniformly distributes the load

during reconstruction. In contrast, a disk failure in a standard RAID level 5 disk array

dramatically increases the load for the small group of disks in the same parity group as the

failed disk due to the additional I/O needed to reconstruct the failed disk and to service

I/O requests issued to the failed disk. Hou, Menon and Patt [16] investigate the e�ects on

the response time of user requests when using di�erent sized units of data recovery during

the rebuilding of a failed disk.

2.5.3 Summary

Analytic queueing models of disk arrays are di�cult to analyze because of queuing

and fork-join synchronization. Basic advances in the analysis of fork-join queueing systems

are needed before general analytic models of disk arrays are feasible. Until they occur,

we will have to live with many specialized approximate models for disk arrays. In this

dissertation, we present a specialized analytic model for disk arrays. It di�ers from other

models in that it models the disk array as a closed rather than open queueing system, and

handles a continuum of request sizes from small to large requests.

The analysis of disk arrays has so far focussed mainly on simulation studies of

canonical disk arrays rather than the measurement or real disk arrays. In either case, the

analysis studies rely on ad-hoc techniques. The lack of systematic analysis procedures is

Page 40: Disk - University of California, Berkeley

28

characteristic of computer science in general and especially noticeable in the analysis of

disk arrays. In this dissertation, we present a new analysis technique based on utilization

pro�les and apply the technique to systematically analyze RAID-II, our second disk array

prototype.

Page 41: Disk - University of California, Berkeley

29

Chapter 3

Experimental Methodology

The previous chapter has provided an overview of RAID and a survey of the related

work on RAID. This chapter introduces the experimental tools and techniques that were

used in conducting the work presented in this dissertation. In particular, we describe the

RAID simulator used to validate our analytic model; the RAID performance measurement

tool used to measure the performance of RAID-II, our second disk array prototype; RAID-II

itself; and the performance evaluation process.

3.1 The RAID Simulator

The RAID Simulator, raidSim, is an event-driven simulator developed by us at

Berkeley for modeling both non-redundant and redundant disk arrays. The simulator mod-

els only disks; in particular, it does not model the host CPU, host disk controllers, or

I/O busses. This allows us to study idealized disk arrays without being concerned with

implementation details. RaidSim is used in Chapter 4 to validate our analytic model of

Page 42: Disk - University of California, Berkeley

30

non-redundant disk arrays.

Two questions basic to any event-driven simulation are:(1) when to start collecting

simulation statistics, and (2) when to stop. When to start statistics collection is the simpler

question to answer. In regenerative simulation, the process that is being modeled has

well de�ned beginning and end points referred to as regeneration points. Thus, if you

are interested in the behavior of the entire process from beginning to end, then statistics

collection should begin immediately from the initial regeneration point.

Usually however, the processes we are interested in, such as the servicing of I/O

requests by a disk array, are non-regenerative. It is a continuous process without well de-

�ned beginning or end points, and we are primarily interested in the long-term steady-state

behavior of the system. In such systems, the initial behavior of the system is unrepresen-

tative of the steady state behavior of the system because the process begins from an initial

state that may never or rarely occur in the steady state. Thus, it is desirable to delay

statistics collection until the system has \warmed up".

In our simulations, we use two di�erent techniques to avoid the initial transient

behavior. First, instead of issuing all the requests at the same time at the beginning of the

simulation, we stagger their start. This starts the simulation in an initial state that is closer

to the steady state. Second, we wait for all the initially issued requests to complete before

collecting statistics. Other requests are issued as the initial requests complete; however, we

do not start collecting statistics until the initially issued requests have been ushed from

the system.

The question of when to end statistics collection is more di�cult. With regener-

Page 43: Disk - University of California, Berkeley

31

ative simulation, statistics collection should clearly end at a regeneration point. But even

so, how many regeneration points should be simulated? The longer we simulate, the more

accurate the results, but the more expensive the simulation.

Several simple stopping conditions are as follows: (1) simulate a constant number

of regeneration points or requests, (2) simulate for a constant period of real time, or (3)

simulate for a constant period of simulation time. Alas, all these alternatives are highly

wasteful of resources. Because the variance in the collected statistics varies greatly as a

function of the simulation parameters, low variance parameters are over-simulated to get

accurate results for the few high variance parameters.

A more sophisticated stopping condition is to run a \presimulation" with a few

regeneration points or requests before running the main simulation. The pre-simulation

provides an estimate of the variance of individual regeneration points or requests and the

estimated variance can be used to calculate the number of regeneration points or requests

necessary to achieve the desired variance in the �nal average statistics. The main simu-

lation stops when a constant number of regeneration points or requests, as determined by

the pre-simulation, has been processed. This alternative, however, has a couple of unaes-

thetic qualities. First, one needs a mechanism to determine when the estimated variance is

su�ciently accurate. Second, one cannot reuse the work performed by the pre-simulation

during the actual simulation.

A general problem with any stopping condition dependent on variance estimation

is correlation. If correlation is ignored, the estimated variance of average statistics can

be much higher or lower than the actual variance of the average statistics. Regenerative

Page 44: Disk - University of California, Berkeley

32

simulation, by de�nition, does not su�er from correlation. In non-regenerative simulation,

however, statistics taken close together in simulation time will tend to be positively corre-

lated and hence result in estimated variances for averages that are much smaller than the

actual variances. That is, the average statistics will vary much more than expected. When

statistics are correlated in time, the correlation is referred to as autocorrelation.

In most non-regenerative simulation, the autocorrelation is highest for measure-

ments taken close together in time and will rapidly decrease as they are taken farther apart.

To make the problem more concrete, consider the following two sequences binary integers:

(0; 1; 1; 0; 1; 0; 0; 1) and (0; 0; 1; 1; 1; 1; 0; 0). The �rst is generated by a set of uncorrelated

binary random variables and the second is generated by a set of highly correlated binary

random variables. Both sequences contain the same number of zeros and ones and, thus,

the individual variances and the means are the same. In the second sequence, however, the

zeros and ones tend to occur in groups of two. The second sequence might just as well have

been represented by the following third sequence: (0; 2; 2; 0). Thus, although the individual

variances and the means of the two sequences are the same, the variance of the mean for the

second sequence is signi�cantly higher simply because the second sequence was determined

with much less \freedom" than the �rst sequence.

In our simulations, we employ a modi�ed version of the stopping condition based

on variance estimation previously described. First, we batch together statistics from many

requests when estimating the variance in order to reduce autocorrelation. Since most of the

correlation occurs close together in time, the correlated statistics will be grouped together

in batches and the batches themselves will be only slightly correlated. We examine autocor-

Page 45: Disk - University of California, Berkeley

33

relation coe�cients for the batches to ensure that the batches are large enough to eliminate

most of the correlation. Second, rather than use a pre-simulation to estimate variance, we

iteratively generate variance estimates as the simulation proceeds, stopping when the de-

sired variance for the average statistics is reached. To prevent premature termination of a

simulation, a minimum of thirty requests must complete before checking the variance of the

average statistics. Unlike the original stopping condition based on variance estimation, this

stopping condition is dependent on the estimated variance. Thus, it will tend to produce

variance estimates that are slightly smaller than the actual variance.

3.2 The RAID Performance Measurement Tool

The RAID Performance Measurement Tool, raidPerf, developed by us at Berke-

ley, is a generic tool for measuring the performance of concurrent I/O systems. Unlike

most I/O devices today, a concurrent I/O system can service multiple requests simultane-

ously. Although the internal structures of raidPerf and raidSim are quite di�erent, both

use similar application-level interfaces. This makes it easy to run the same workloads on

both the simulator and the real system. RaidPerf is used in Chapter 5 to gather perfor-

mance measurements of RAID-II, our second disk array prototype. The measurements are

subsequently used to analyze RAID-II using utilization pro�les, a new empirical analysis

technique presented in Chapter 5.

RaidPerf is primarily a workload generation tool with built-in statistic collection.

As in raidSim, raidPerfmodels a closed system with a �xed number of independent processes

continuously issuing I/O requests. Each process is created via the UNIX fork() system call.

Page 46: Disk - University of California, Berkeley

34

The processes communicate with a central process that generates workloads and collects

statistics via UNIX pipes. Using a single process that uses asynchronous I/O interfaces

would be much more e�cient than using multiple processes and pipes, but asynchronous

I/O interfaces are not generally available on UNIX.

We have found that using pipes to communicate between multiple processes is

highly CPU intensive. In the worst case, each communication operation between processes

requires a context switch that takes on the order of a few milliseconds. To reduce the

overhead, the central control process pipelines one request ahead, e�ectively creating an \on-

deck" request for each process issuing I/O requests. This bu�ering of requests, however,

creates greater measurement uncertainties for the central command process since more

requests are outstanding at a time. The tradeo� is thus CPU overhead versus variations in

the measured statistics.

Statistics collection during measurement also brings up the question of when to

start and stop. Although many of the issues are the same as with simulation, the solutions

are di�erent due to the practical constraints of measurement. For example, in a simulation,

getting the current simulation time is inexpensive since it is usually kept in a global variable.

In a measurement run, however, getting the current real time usually requires an expensive

system call. Furthermore, there is a wide disparity in the precision of the time available on

di�erent machines, with some machines keeping track of time in increments as large as 20

milliseconds!

Thus, stopping conditions based on the measurement of individual events is un-

desirable in measurements. A throughput based criterion is generally more desirable. In

Page 47: Disk - University of California, Berkeley

35

raidPerf, measurement stops when the measured throughput as calculated from the start of

measurement to the current point in time varies by less than one percent for two consecu-

tive measurement intervals. Each successive measurement interval contains twenty percent

more requests than the previous interval.

3.3 The RAID-II Disk Array Prototype

This section describes RAID-II, our second disk array prototype, which will be

analyzed in depth in Chapter 5. In the simplest view, RAID-II is a network disk controller

that interfaces a 100MB/s HIPPI (High-Performance Peripheral Interconnect) interconnect

to a large number of 3.5" SCSI disks. In its overall operation, it is somewhat more compli-

cated. RAID-II is a complete network �le server that supports a custom network �le system

that e�ciently support both large and small �le requests from a wide variety of clients from

supercomputers to desktop workstations. As of this writing, RAID-II is fully operational as

a network disk controller and in experimental use as a network �le server. It is RAID-II's

former capabilities that interest us.

Figure 3.1 illustrates RAID-II's current network environment. RAID-II is con-

nected via HIPPI to a network of HIPPI switches and an UltraNet. Only one of the two

connections can be in operation at any given time. A HIPPI �ber extender connects a

HIPPI switch ports to Lawrence Berkeley Labs where, in the near future, we will have

access to both high-performance clients and peripherals that can e�ectively use RAID-II.

The UltraNet currently spans two buildings at UC Berkeley and interconnects three Sun 4

workstations via 250Mb/s Ultra links. In addition to the two high-performance networks,

Page 48: Disk - University of California, Berkeley

36

Ultra 1000

RAID-II

Workstation

Client

Ethernet (10 Mb/s)

Hub

Ultra 250

Hub

Workstation

Client

Sun4 Client Sun4 Client Sun4 Client

4x4 HIPPI

Switch

To LBL

Ultra link (250 Mb/s)

HIPPI (1 Gb/s)

HIPPI (1 Gb/s)

Figure 3.1: RAID-II Network Environment.

RAID-II is connected to an Ethernet to support client workstations without access to a

high-performance network.

Figure 3.2 and Figure 3.3 illustrate RAID-II, which consists of three racks: two

side racks each containing 72 SCSI disks and a center rack containing most of the electronic

components of the system. The center rack contains most of the electronic components

of the system in three chassis. The top chassis contains up to seven independent VME

backplanes of which only four are currently in use. Each VME backplane can support up

to two o�-the-shelf VME disk controllers. The middle chassis contains the HIPPI interfaces

Page 49: Disk - University of California, Berkeley

37

Figure 3.2: Photograph of the RAID-II Storage Server.

and a custom designed board called the X-Bus board which acts as the central memory and

data interconnect for the system. The bottom chassis contains a Sun 4 workstation that

controls all the previously mentioned components of the system via o�-the-shelf VME-to-

VME links. Each of the VME backplanes in the top shelf are connected via independent

VME links to dedicated ports on the Xbus board. This allows data to be transferred directly

Page 50: Disk - University of California, Berkeley

38

Server

Sun4

Ethernet (10 Mb/s)HIPPI (1 Gb/s) HIPPI (1 Gb/s)

HIPPISHIPPID

InterfaceInterface

Disk

Controller

Disk

Controller Controller

Disk Disk

Controller

100 MB/s 100 MB/s VME

VMEVMEVMEVME

X-Bus Board 4

1

Figure 3.3: Schematic View of the RAID-II Storage Server. Solid lines indicatedata paths and dotted lines indicate VME control paths.

between the disks and the network without having to go through the low-bandwidth memory

system of the Sun 4 workstation. The X-Bus board serves as the primary data interconnect

in the system. Each of the two unidirectional HIPPI source and destination interfaces are

connected via 100MB/s data busses to the X-Bus board. The rest of the �ve visible X-Bus

ports consist of VME ports, four of which are used to connect to the VME disk controllers

and one of which is used by the Sun 4 workstation to access network headers and �le system

metadata.

As illustrated by Figure 3.4, the memory system of the X-Bus board consists

of 32MB of DRAM organized into four independent memory banks with a sixteen-word

interleave. By using the page-mode accesses supported by the DRAM, each memory bank

Page 51: Disk - University of California, Berkeley

39

to HIPPI boards to host to disk controllers

160 MB/s aggregate bandwidth

40 MB/s Per Port

VME3

16 word interleave32 MB DRAM

XOR

MEMMEMMEMMEM

4x8 Crossbar Interconnect

HPPIS HPPID VME VME0 VME1 VME2

Figure 3.4: X-Bus Board.

can supply a 32-bit word every 80ns, resulting in a peak data transfer rate of 50MB/s per

memory bank. Due to arbitration, addressing and memory latency overheads, each memory

bank can sustain up to 40MB/s on reads and 44MB/s on writes.

The memory banks are connected via a four-by-eight crossbar interconnect to

eight independent X-bus ports. Each X-Bus port can sustain up to 40MB/s on reads and

44MB/s on writes up to the aggregate bandwidth of the four memory banks. The X-Bus

interconnects the X-Bus ports only to the memory banks, not to other X-Bus ports. All

data transfers must go through the memory banks. Of the eight X-Bus ports, we have

already described all of them except for the XOR/DMA port. The XOR/DMA port is the

only port that is completely internal to the X-Bus board. It is used by the XOR/DMA

Page 52: Disk - University of California, Berkeley

40

engine to generate parity in support of RAID storage systems and also to perform block

data transfers within the X-Bus board.

On a typical read operation, data is transferred from the disks via the VME disk

controllers to the X-Bus memory and out over the HIPPI source interface. On a write

operation, data is transferred from the network via the HIPPI destination interface to the

X-Bus memory, the XOR/DMA computes the parity, and the parity and data are written to

the disks via the VME disk controllers. To properly orchestrate the data transfers, network

headers and other control information is transferred back and forth between the X-bus

memory and the Sun 4 server. The Sun 4 server also controls the HIPPI interfaces and

VME disk controllers by directly reading and writing control words over the VME links.

Figure 3.5 summarizes the system-level performance of the RAID-II storage server

when accessed as a raw disk device over the HIPPI interconnect. Up to approximately

20MB/s can be sustained for both large reads and large writes. The performance for

writes is somewhat less than the performance for reads because parity information must be

updated during writes and the X-Bus board's VME interfaces are slightly more e�cient for

disk reads (writes to memory) than disk writes (reads from memory). The bottlenecks in

both cases turn out to be the Interphase Cougar disk controllers and our implementation

of the X-Bus board's VME interfaces. Since we can easily add four more disk controllers,

the ultimate bottleneck is our implementation of the VME interfaces. These can support

only 7MB/s each because we implemented them as less e�cient synchronous rather than

asynchronous interfaces. Fortunately, they are implemented as daughtercards on the X-Bus

board and can easily be replaced. With an asynchronous VME interface, each VME port

Page 53: Disk - University of California, Berkeley

41

Write

Read

System Performance

Request Size (KB)

30

20

10

02000150010005000

Throughput(

M

B/s)

Figure 3.5: System Level Performance. Performance over HIPPI to/from disks.RAID-II is con�gured with 4 Interphase Cougar disk controllers, 8 SCSI strings and 24SCSI disks in a RAID level 5 con�guration. Due to the current lack of a client machine thatcan sustain the bandwidth provided by RAID-II, all HIPPI transfers are performed usingthe loopback mode of the HIPPI interfaces. Thus, data sent out from the X-Bus boardover the HIPPI source interface is always looped back into X-Bus board via the HIPPIdestination interface. This stresses RAID-II more than it would if an actual client weresourcing and sinking the data.

should be able to support up to 20MB/s, and RAID-II should be able to sustain close to

40MB/s in both reads and writes.

3.4 The Performance Evaluation Process

The previous sections have described the tools at our disposal for performing

analytic, simulation and measurement studies. This section describes the performance

Page 54: Disk - University of California, Berkeley

42

evaluation process in which the tools are used. Much of the process described here is

used both in Chapter 4 to validate our analytic model and in Chapter 5 to analyze the

performance of RAID-II.

3.4.1 Introduction

We identify four main steps in the performance evaluation process:

1. De�nition of goals.

2. Experimental design.

3. Analysis of data.

4. Presentation of results.

In most studies, backtracking will frequently occur as new results suggest di�erent things

to try. Because each step is dependent on the previous step, it is particularly important to

avoid mistakes in the earlier steps.

Experimental design refers to the process of identifying the important system

and workload parameters, picking the most signi�cant values for each parameter, and the

combination of parameter values to study. Frequently, investigators select a too narrow

range of parameters or design studies in which only one parameter is varied at a time. Such

studies lead to misleading results that not generalizable. On the other hand, selecting a

too large range of parameters lead to results that are \generally" applicable but does not

apply well to the most interesting parameter ranges. One can think of experimental design

as analogous to the selection of questions an investigator may ask to solve a mystery. Just

Page 55: Disk - University of California, Berkeley

43

as the answer to each question reveals an aspect of the mystery, the performance of the

system at each parameter combination reveals an aspect of the system under study. Clearly,

without the right questions, you cannot get the right answers.

Once the experimental design is complete and the required data has been collected,

we need to analyze the data and present the results. Unfortunately, investigators frequently

do not distinguish between the analysis of data, and the presentation of results. The conse-

quence are papers that are full of graphs and tables illustrating simulation or measurement

output but lacking in signi�cant conclusions. The analysis of data involves the formulation

of analytic models, and the application of statistical techniques to summarize measured

data. The presentation of results, on the other hand, communicates the conclusions derived

from the analysis in a form that is both convincing and easily understood by the intended

readers. In particular, the presentation of results should leave the reader with the following

three things:

� An understanding of the results.

� The conditions under which the results apply.

� An understanding of the strength of the results.

We do not consider describing the process used to derive the results as absolutely necessary

if the above three goals can be achieved without it. Usually, however, it will be impossible

to give the reader an understanding of the strength of the results without describing the

process that was used.

Page 56: Disk - University of California, Berkeley

44

3.4.2 Experimental Design

Central to any experimental study, is the determination of which variables are

important, what values of the variables are signi�cant and what combination of variable

values should be used. This process is referred to as experimental design, and is equally

applicable to both simulation and measurement studies. In both cases, we are interested

in studying the e�ect of certain parameter values on the performance of the system. We

use experimental design extensively in both Chapter 4 and Chapter 5 to determine the

parameters for our simulation and measurement experiments, respectively.

We will �nd the following de�nitions useful in describing the experimental design

process:

Experiment A single measurement or simulation.Response The result of an experiment.Factor An input variable that can in uence the response.Factor level A possible value for a factor. For example, the factor request

size may have a factor level of 4KB, 8KB, or 16KB.Replication Repetition of an experiment at the same factor levels.Design An experimental design is speci�ed by the list of factors, fac-

tor levels, combination of factor levels used for each experi-ment, and the number of replications.

Design point The factor levels for a single experiment. For example, (re-quest type = read, request size = 4KB).

Design set The set of design points used in an experimental design.

Replication of experiments allows the quanti�cation of experimental, or \random,"

errors. In simulations, replication requires repeating the simulation at the same factor levels

with a di�erent random seed. Replication is frequently ignored in experimental studies in

computer science, but some way to gauge experimental errors is always necessary. Without

such a gauge, it is impossible to tell if the data is strong enough to support the results of

the study.

Page 57: Disk - University of California, Berkeley

45

Given the factors and factor levels for an experimental study, there are several

common ways to design experiments. A full factorial design uses a complete cross product

of all possible factor levels. This has the bene�t that the design set is balanced, that is, it

is easier to study the e�ect of a factor independently of the other factors. It also has the

advantage that all interactions between any combination of factors can be studied. The

only drawback is the large number of experiments required.

Other techniques are sometimes preferred over full factorial designs because they

require many fewer experiments [17]. Such designs assume that the e�ect from higher order

interactions between two or three factors is usually much smaller than �rst or second order

e�ects. They construct design sets in which higher order e�ects cannot be distinguished

from lower order e�ects.

In this dissertation, we always use full factorial designs because the number of

factors and factor levels we are interested in is small enough and we have su�cient resources

to perform the necessary experiments. The alternative design techniques are more frequently

used in �elds such as biology or chemistry where large amounts of time and resources are

often required for each experiment.

3.4.3 Linear Regression

In Chapter 5, we will frequently �t equations to experimental data to formulate

models for the measured system, study the e�ects of parameters, and to summarize the

experimental data. Given enough CPU power, it is feasible to �t arbitrary equations using

arbitrary criteria; however, it is far more e�cient to �t linear equations while minimizing

the sum of squared errors. This process is called linear regression, and is the main form of

Page 58: Disk - University of California, Berkeley

46

curve �tting used in this dissertation.

Linear regression takes as input the data to be �tted and a list of basis functions.

It outputs the best-�t coe�cients|the coe�cients for the basis functions that minimizes

the sum of squared errors. For example, given the basis functions (1; x1; x2; x1x2), linear

regression will calculate coe�cients �0 through �3 such that the equation y(x1; x2) = �0 +

�1x1+ �2x2+ �3x1x2 minimizes the sum of squared errors. Here, �0 is the base response of

the system|the system response when x1 and x2 are both zero. �1 is the �rst order e�ect

of x1; when x2 is equal to zero, a unit change in x1 will change the response on average by

�1. Likewise, �2 is the �rst order e�ect of x2. �3 is the interaction e�ect of x1 and x2; a

non-zero �3 indicates that a change in the response due to a change in x1 or x2 depends on

the current value of x2 or x1, respectively.

Aside from the best-�t coe�cients, most linear regression packages also output

statistics describing the signi�cance of each coe�cient, the correlation between each pair of

coe�cients, and the function's goodness-of-�t. This allows us to eliminate insigni�cant or

redundant factors and to gauge the importance of each factor to the accuracy of the linear

regression. As the preceding sentence illustrates, it is important to distinguish between

statistical signi�cance and practical signi�cance. A factor is statistically signi�cance if it has

a distinctly measurable e�ect; it is practically signi�cant only if the e�ect is su�ciently large.

For example, using suntan lotion before jumping into a volcano may have a statistically

signi�cant e�ect, but for all practical purposes, the e�ect is insigni�cant.

The value of each coe�cient divided by its standard deviation is referred to as

a t-statistic and provides a measure the coe�cient's statistical signi�cance. A small t-

Page 59: Disk - University of California, Berkeley

47

statistic results when the standard deviation is large relative to the estimated value of

the coe�cient, and indicates that the coe�cient is statistically insigni�cant. Practically

speaking, you may want to eliminate the corresponding basis function and retry the linear

regression. If the errors in the regression are independently distributed as a normal variable

with a common mean, the t-statistics can be used to formulate con�dence intervals for the

best-�t coe�cients. Most errors in modeling and analyzing computer systems, however, do

not meet these requirements.

The correlation matrix tabulates the correlation coe�cient between each pair of

best-�t coe�cients. High correlation between a pair of coe�cients indicates that it may

be possible to remove a correlated factor without signi�cantly a�ecting the accuracy of the

linear regression. Although the t-statistic and the correlation matrix for best-�t coe�cients

can be used to iteratively eliminate unimportant or redundant factors, the only sure way

to get the best regression with the fewest factors is to try all subsets of factors as a basis.

So far, we have discussed how to tell if the best-�t coe�cients are statistically

signi�cant and some techniques for eliminating unimportant or redundant basis functions.

We have yet to discuss, however, a method for characterizing the errors in a regression

to determine if the regression is su�ciently accurate. Although a statistically insigni�cant

result is useless, just because a result is statistically signi�cant does not necessarily mean

that it is practically signi�cant or useful. Methods and statistics for characterizing such

errors will be discussed in the next section.

Page 60: Disk - University of California, Berkeley

48

3.4.4 Error Metrics

Performance evaluation often requires comparing two or more models and charac-

terizing the di�erences between them. Usually, one model is considered less accurate than

the other, such as when an analytic model is compared against a simulation model, or when

a simulation model is compared to measurements of the modeled system. In such cases,

the di�erences can be thought of as errors in the less accurate model. Because such com-

parisons frequently involve many data points, aggregate statistics and graphs must be used

to summarize the errors in a form that is more easily interpreted. The following de�nes

statistical terms and metrics that will be used in Chapter 4 and Chapter 5 to characterize

errors in modeling and analyzing disk arrays, respectively:

yi The response for the ith design point.yi The predicted response for the ith design point.ei yi � yi (Error for the ith design point.)y Arithmetic mean of yi's.SSE

Pe2i (sum of squared errors)

SSTP(yi � y)2 (sum of squares total)

R2 (SST � SSE)=SST (coe�cient of determination)max error max ei (maximum error)90% error The 90th percentile of ei.

The coe�cient of determination, R2, is a measure of how closely the predicted

response matches the actual response. A baseline prediction predicting all responses equal

to the mean, y, would have R2 = 0, while a perfect predictor would have R2 = 1. R2 can

be interpreted as the fraction of \total variation" that is accurately predicted. Although

R2 is a good measure of overall accuracy, it may be highly dependent on the choice of

the design set and may vary signi�cantly for certain subsets of the design set. That is,

a predictor that is wildly inaccurate for a signi�cant collection of design points can still

Page 61: Disk - University of California, Berkeley

49

have R2 ' 1 if the predictor is accurate for the other design points. Because R2 can be

insensitive to large errors in the predictor, the max error metric is useful as a measure of the

worst-case prediction. The 90% error metric is a compromise between R2 and max error

which compensates for the extreme sensitivity of max error to outliers. We will frequently

use the coe�cient of determination, max error, and 90% error in characterizing the error

of analytic models and linear regressions.

In addition to the above error metrics, we will frequently use graphs to illustrate

errors. During analysis, it is useful to graph each predicted response along with its actual

response; however, this usually results in either a large number of graphs of a very cluttered

graph. Thus, we will frequently graph the actual response in aggregate form as percentile

intervals containing the indicated percentage of the response. We speci�cally use percentile

intervals rather than con�dence intervals because standard methods for calculating con�-

dence intervals require that errors be independently distributed as a normal variable with

a common standard deviation. If the errors are independent, certain transformations can

be made on responses to satisfy the requirements but these transformations are often ar-

ti�cial. Furthermore, errors resulting from modeling errors are almost never independent.

Percentile intervals, on the other hand, do not require any assumptions in the distribution

of errors or arti�cial transformations of the response. They do, however, require a larger

number of data points.

3.4.5 Summary

Simulation and measurement are widely used in computer science but few re-

searchers use systematic performance evaluation procedures. Experimental design, linear

Page 62: Disk - University of California, Berkeley

50

regression, and statistical error metrics are widely used in other experimental sciences, but

have yet to gain acceptance in computer science. This makes it di�cult to compare or dupli-

cate the work of other researchers and sometimes makes it impossible to assess the validity

of advanced research. In this section, we have described several basic tools and techniques

that are used throughout this dissertation to guarantee the accuracy and validity of our

results. In particular, the RAID simulator is used in Chapter 4 to validate our analytic

model and the results derived from the model. The RAID performance measurement tool

is used in Chapter 5 to analyze the performance of RAID-II. The performance evaluation

process is used throughout the dissertation in designing experiments and interpreting the

results of both simulation and measurement.

Page 63: Disk - University of California, Berkeley

51

Chapter 4

The Analytic Model

In this chapter, we derive and validate via simulation an analytic performance

model for block-interleaved, non-redundant disk arrays. Our approach is to derive the ex-

pected utilization of a given disk in the disk array. Because we are modeling a closed system

where each disk plays a symmetric role, knowing the expected utilization of a given disk

will allow us to compute the entire system's throughput and response time. We examine

via simulation the accuracy of the analytic model and the error introduced by each approx-

imation in the derivation of the analytic model. We then extend the basic analytic model

to handle variable sized requests and validate the result. Finally, we apply the model to

derive an equation for the optimal size of data striping in disk arrays.

For the sake of convenience and clarity, we present the complete derivation of

the analytic model before presenting its empirical validation; the actual development of the

model followed an iterative process, alternating between analytical derivations and empirical

validations.

Page 64: Disk - University of California, Berkeley

52

0 1 2 3 4

5 6 7 8 9

Stripe Unit Data Stripe

Figure 4.1: Data Striping in Disk Arrays.

Stripe unit is the unit of data interleaving, that is, the amount of data that is placed on adisk before data is placed on the next disk. Stripe units typically range from a sectorto a track in size (512 bytes to 64 kilobytes). The �gure illustrates a disk array with�ve disks with the �rst ten stripe units labeled.

Data stripe is a sequence of logically consecutive stripe units. A logical I/O request to adisk array corresponds to a data stripe. The �gure illustrates a data stripe consistingof four stripe units spanning stripe units three through six.

4.1 The Modeled System

Our primary focus is on modeling non-redundant asynchronous disk arrays. Fig-

ure 4.1 illustrates the basic disk array of interest and the terms stripe unit and data stripe.

For readers who are familiar with the RAID taxonomy, we mention that the analytic model

we develop can also be used to model reads for RAID level 5 disk arrays using the left-

symmetric parity placement [22]; the left-symmetric parity placement does not disturb the

data mapping illustrated in Figure 4.1.

4.2 The Model System

Consider the closed queueing system illustrated in Figure 4.2. The system consists

of L processes, each of which issues, one at a time, an array request of size n stripe units.

Page 65: Disk - University of California, Berkeley

53

N queues ...

N disks

L processes L21

N321

...

...

L requestsof size n

L � Number of processes issuing requests.N � Number of disks.n � Request size in disks.

n � N:

S � Service time of a given disk request.

Figure 4.2: Closed Queuing Model for Disk Arrays.

Page 66: Disk - University of California, Berkeley

54

Each array request is broken up into n disk requests and the disk requests are queued round-

robin starting from a randomly chosen disk. Each disk services a single disk request at a

time in a FIFO manner. When all the disk requests corresponding to an array request

are serviced, the issuing process generates another array request, repeating the cycle. Two

or more array requests may partially overlap on some of the disks, resulting in complex

interactions. We sometimes refer to array requests simply as requests. In the derivation

of the analytic model, we will assume that L and n are �xed. We will also assume that

the processes do nothing but issue I/O requests. Later, we will extend the model to allow

variable sized requests.

4.3 The Expected Utilization

In deriving the expected utilization of the model system, we will �nd the following

de�nitions useful:

U � Expected utilization of a given disk.R � Response time of a given array request.W � Disk idle (wait) time between disk request servicings.Q � Queue length at a given disk when a request arrives.p0 � Probability that the queue at a given disk is empty

when the disk �nishes servicing a disk request.p � n=N; the probability that a request will access a given disk;

If we visualize the activity at a given disk as an alternating sequence of busy

periods of length S and idle periods of length W , the expected utilization of a given disk is,

U =E(S)

E(S) +E(W ): (4.1)

Page 67: Disk - University of California, Berkeley

55

ri = time intervals

M=2, W=r0+r1+r2

next disk request arrivesdisk request finishes

X = array request issued

r0 r1 r2 r3r-1r-2

time -> t0 t1 t2 t3 t4 t5 t6

Figure 4.3: Time-line of Events at a Given Disk. After the disk request �nishesservice at time t2, M = 2 array requests that do not access the given disk are issued attimes t3 and t4 before an array request that accesses the given disk is issued at time t5. Thedisk remains idle for a time period of W = r0 + r1 + r2.

Idle periods of length zero can occur and imply that another disk request is already waiting

for service, Q > 0, when the current disk request �nishes service.

Let r0 denote the time between the end of service of a given disk request and

the issuing of a new array request into the system. Let ri; i 2 f1; 2; : : :g, denote the time

intervals between successive issues of array requests numbered relative to r0. Let M denote

the number of array requests issued after a given disk �nishes a disk request until, but

excluding, the array request that accesses the given disk. Since each array request has

probability p of accessing a given disk, M is a modi�ed geometric random variable with

E(M) = 1=p� 1. Figure 4.3 illustrates the above terms.

By conditioning on the queue length at the time a disk request �nishes service, we

Page 68: Disk - University of California, Berkeley

56

have

E(W ) = P (Q > 0)E(W jQ > 0) + P (Q = 0)E(W jQ= 0);

E(W ) = (1� p0)0 + p0E(PM

i=0 ri);

E(W ) = p0(E(r0) +E(PM

i=1 ri)):

Substituting into Equation 4.1 we have

U =E(S)

E(S) + p0(E(r0) + E(PM

i=1 ri))(4.2)

Equation 4.2 is an exact, though not directly useful, equation for the expected utilization

of the model system.

4.4 Approximating the Expected Utilization

In the previous section, we formulated Equation 4.2, an exact equation for the

expected utilization of the model system. Unfortunately, the exact equation consists of

terms that are di�cult, if not impossible, to compute. In this section, we approximate the

components of Equation 4.2 to make it analytically tractable.

To simplify Equation 4.2, the �rst approximation we make is

E(MXi=1

ri) ' E(M)E(ri) = (1=p� 1)E(R)=L:

From Little's Law, we know that the average time between successive issues of array requests

is E(R)=L. Note also that M is a stopping time, that is, the event that the mth request

misses the given disk, M > m, is independent of the random variables rm+1; rm+2; : : :.

Page 69: Disk - University of California, Berkeley

57

Thus, from Wald's Equation, the above approximation would be exact if ri; i 2 f1; 2; : : :g

were independently distributed with a common mean of E(R)=L. For the moment, we will

take the above approximation as given, but later show via simulation that the above is an

extremely good approximation.

The second approximation we make is to assume that E(r0) ' 0. As motivation,

we present the following observations about E(r0):

� E(r0) = 0 implies that disk requests associated with the same array request �nish

at the same time and thus an array request is issued immediately whenever any disk

request �nishes.

� E(r0) = 0 when n = 1, that is, when each array request consists of a single disk

request, the completion of each disk request corresponds to the completion of the

corresponding array request and, thus, the process that issued the disk request will

immediately issue another array request.

� E(r0) ' 0 when n = N , that is, when an array request always uses all the disks, disk

requests associated with the same array request will tend to �nish at close to the same

time because all the disks will be in similar states and operate in a lock step fashion

since disk service times are deterministic and disk requests across disks will be almost

identical.

The third and �nal approximation is p0 ' E(S)=E(R). This equation is true

for M=M=1 systems and approximately holds at low to moderate loads for M=G=1 sys-

tems but the primary motivation for the approximation comes from empirical observations.

Page 70: Disk - University of California, Berkeley

58

Incorporating the three approximations, we can rewrite Equation 4.2 as,

U ' 1

1 + 1L(1=p� 1)

: (4.3)

Under the approximations we have made, the expected utilization is insensitive to the disk

service time distribution, S.

Since this is a closed system, the expected response time can be directly calculated

from the expected utilization:

E(R) =E(S)Ln

UN: (4.4)

The expected throughput can be written as,

T =UNB

E(S); (4.5)

where B is the size of the stripe unit. Future references to a speci�c analytic model will

refer to the above equations and to Equation 4.3 in particular.

4.5 Experimental Design

We are interested in the following factors to validate our analytic model.

diskModel The type of disk.N Number of disks in the disk array.L The number of processes issuing I/O requests.B Size of the stripe unit (block size).p Request size as a fraction of the disk array.

The design set consists of the complete cross product of the following factor levels.

Page 71: Disk - University of California, Berkeley

59

Factor Factor Levels

diskModel Lightning, Fujitsu, FutureDiskN 2, 3, 4, 8, 16L 1, 2, 4, 8, 16, 32B (1, 4, 16, 64)KBp (1; 2; : : : ; N)=N

The parameters for the three disk models listed above are describe in Table 4.1.

The number of factor levels for p depend on N and are expressed as multiples of 1=N . This

means, for example, that there are twice as many design points with N = 4 as N = 2.

Thus, although we do not intrinsically value disk arrays with N = 4 more than disk arrays

with N = 2, the design set implicitly assigns the former twice the importance of the latter.

To compensate for this e�ect, we weigh all design points by the factor 1=N when calculating

statistics.

We simulate each design point in the design set twice with two di�erent random

seeds to quantify the experimental error, which intrinsically cannot be explained by any

model. Thus, the total number of simulation runs is 2(3�4�6�(2+3+4+8+16)) = 4; 752.

In each simulation run, the disks are rotationally synchronized, requests to the disk array

are aligned on stripe unit boundaries, and L and p are held constant.

4.6 Validation of the Analytic Model

In this section, we examine via simulation the accuracy of the analytic model and

the error introduced by each of the three approximations in the model of Section 4.4. Recall

that the three approximations are as follows:

1. E(PM

i=1 ri) ' (1=p� 1)E(R)=L,

2. E(r0) ' 0,

Page 72: Disk - University of California, Berkeley

60

Lightning Fujitsu FutureDisk

bytes per sector 512 512 512sectors per track 48 88 132tracks per cylinder 14 20 20cylinders per disk 949 1944 2500revolution time 13.9ms 11.1ms 9.1mssingle cylinder seek time 2.0ms 2.0ms 1.8msaverage seek time 12.6ms 11.0ms 10.0msmax stroke seek time 25.0ms 22.0ms 20.0mssustained transfer rate 1.8MB/s 4.1MB/s 7.4MB/s

Table 4.1: Disk Model Parameters. Average-seek-time is the average time needed toseek between two equally randomly selected cylinders. Sustained-transfer-rate is a functionof bytes-per-sector, sectors-per-track and revolution-time. Lightning is the IBM 0661 3.5"320MB SCSI disk drive, Fujitsu is the Fujitsu M2652H/S 5.25" 1.8GB SCSI disk drive andFutureDisk is a hypothetical disk of the future created by projecting the parameters of theFujitsu disk three years into the future based on current trends in disk technology. Themost dramatic improvements are in the bit and track density of the disks rather than inmechanical positioning times. Thus, disks in the future will have much higher sustainedtransfer rates but only marginally better positioning times. The seek pro�le for each diskis computed using the following formula:

seekTime(x) =

(0 if x = 0apx� 1 + b(x� 1) + c if x > 0

where x is the seek distance in cylinders and a, b and c are chosen to satisfy the single-cylinder-seek-time, average-seek-time and max-stroke-seek-time constraints. The squareroot term in the above formula models the constant acceleration/deceleration period ofthe disk head and the linear term models the period after maximum disk head velocityis reached. If cylinders-per-disk is greater than approximately 200, a, b and c can beapproximated using the following formulas:

a = (�10minSeek + 15 avgSeek � 5maxSeek)=(3pnumCyl)

b = (7minSeek � 15 avgSeek + 8maxSeek)=(3 numCyl)

c = minSeek

where minSeek, avgSeek, maxSeek and numCyl correspond to the disk parameters single-cylinder-seek-time, average-seek-time, max-stroke-seek-time and cylinders-per-disk, respec-tively. We have compared the model to the seek pro�le of the Amdahl 6380A publishedby Thisquen [38] and have found the model to closely approximate the seek pro�le of theactual disk. In practice, we have found the model to be well behaved, although care mustbe taken to check that a and b evaluate to positive numbers.

Page 73: Disk - University of California, Berkeley

61

3. p0 ' E(S)=E(R).

Consider the following de�nitions:

U � 1

1 + 1L(1=p� 1)

U1 � E(S)

E(S) + p0(E(r0) + (1=p� 1)E(R)=L)

U2 � E(S)

E(S) + p0(PM

i=1 ri)

U3 � 1

1 + 1E(R)(E(r0) +

PMi=1 ri)

The variable U represents the analytic model derived in Section 4.4 by applying approxi-

mations 1, 2 and 3. The variables U1, U2 and U3 are submodels derived by applying only

one of approximations 1, 2 or 3, respectively. By examining the errors in U1, U2 and U3,

we can study the errors introduced by each of the corresponding approximations.

The table below tabulates the error metrics discussed in Section 3.4.4 for the best

empirical model, U , U1, U2 and U3. The best empirical model is the model that minimizes

the sum of squared errors or, equivalently, maximizes R2 and is computed by averaging

the responses of data points with the same design point. The error metrics for the best

empirical model are useful for comparison purposes and give a feel for the magnitude of

experimental errors that cannot be explained by any model. The metric 1�R2 is included

for easier comparisons of models that have values of R2 close to one. Because we are more

interested in relative rather than absolute errors, the error metrics are calculated for the

logarithm of utilization rather than utilization directly. This means that for small errors|

less than approximately 20%|the max error and 90% error metrics can be interpreted as

Page 74: Disk - University of California, Berkeley

62

percentage deviations. For example, a max error of 0.0430 represents a 4.30% deviation

from the predicted utilization.

Model R2 1� R2 max error 90% error

BEST 0.9995 0.0005 0.0430 0.0133

U 0.9814 0.0186 0.1863 0.0987

U1 0.9990 0.0010 0.0834 0.0192

U2 0.9888 0.0112 0.1676 0.0742

U3 0.9808 0.0192 0.1807 0.0951

The above table shows that 0.05% of the variation in response is caused by exper-

imental errors that cannot be explained by any model. The maximum experimental error

is 4.30%, and 90% of all experimental errors are less than 1.33%. The maximum error

for the analytic model, U , derived in Section 4.4 is 18.63% with 90% of errors less than

9.87% of the predicted utilization. Finally, approximations 2 and 3 introduce most of the

experimental errors, while approximation 1, as expected, introduces very few errors.

Figure 4.4 plots the response predicted by the analytic model U together with the

90 percentile intervals as a function of L and p, the only two factors considered impor-

tant by the analytic model. In the �gure, each 90 percentile interval encloses 90% of the

weighted simulated responses for the given values of L and p. We have elected to give per-

centile intervals rather than con�dence intervals because standard methods for calculating

con�dence intervals require that errors be normally distributed with a constant standard

deviation. Certain transformations can be made on responses to satisfy the requirement

but these transformations are often arti�cial. Percentile intervals, on the other hand, do

not require any assumptions in the distribution of errors but do require more data points

to calculate. Figure 4.4 shows good correspondence between the analytically predicted uti-

lization and the 90 percentile intervals determined by simulation. As we will see later, the

Page 75: Disk - University of California, Berkeley

63

0.2 0.4 0.6 0.8 1

p

0

0.2

0.4

0.6

0.8

1

Utilization

Analytic U with 90 Percentile Intervals

L

Figure 4.4: Analytic U with 90 Percentile Intervals as a Function of L and p.Each line and shaded region represents the analytically predicted response and empiricallydetermined 90 percentile intervals, respectively, for a di�erent value of L. From bottom totop the values for L are 1, 2, 4, 8, 16 and 32, respectively.

Page 76: Disk - University of California, Berkeley

64

percentile intervals are jagged because the factorN , which is ignored by the analytic model,

is signi�cant in explaining variations in the response.

4.7 Validation of Model Properties

The previous section examined the errors in the analytic model, but regardless

of the accuracy of the model, certain properties of the model may be valid where the

model itself is not. This section will investigate the model's prediction that the utilization

is independent of the disk service time distribution, S, or more speci�cally is dependent

only on the factors L and p. We will also examine whether, as implied by the model, the

utilization can be accurately modeled by an equation of the form U = 1=(1 + 1Lf(p;N))

where f(p;N) represents an arbitrary function of p and N . Recall that f(p;N) is equal to

1=p� 1 in the analytic model.

4.7.1 Signi�cance of Factors

The following table tabulates the error metrics for the best empirical models|

models that maximize R2|when certain factors are excluded. The �rst entry labeled NONE

corresponds to the best empirical model that can be constructed when no factors are ex-

cluded and is identical to the model identi�ed as BEST in the previous section. If the

exclusion of a factor results in error metrics that are only slightly di�erent from the error

metrics of the NONE entry, this provides strong evidence that the factor can be safely

ignored by the analytic model, at least over the range of factor levels investigated. The last

two entries in the table illustrate the e�ects of excluding more than one factor at a time. As

Page 77: Disk - University of California, Berkeley

65

before, the metrics are calculated for the logarithm of utilization rather than for utilization

directly.

Deleted Factors R2 1� R2 max error 90% error

NONE 0.9995 0.0005 0.0430 0.0133diskModel 0.9990 0.0010 0.0570 0.0205N 0.9936 0.0064 0.1653 0.0525L 0.4238 0.5762 1.4311 0.4835B 0.9986 0.0014 0.0789 0.0252p 0.4285 0.5715 1.8733 0.5018

diskModel B 0.9983 0.0017 0.0843 0.0269diskModel B N 0.9926 0.0074 0.1678 0.0577

As predicted by the analytic model, the above table illustrates that utilization is

insensitive to the disk service time distribution, S, or more speci�cally to the two principle

factors, diskModel and B, that determine S. Excluding both the factors diskModel and B

result in error metrics max error = 8.43% and 90% error = 2.69% which compare favorably

with the error metrics of the best case when no factors are excluded of max error = 4.30%

and 90% error = 1.33%. Thus, we conclude that utilization is insensitive to the disk service

time distribution, S.

The insensitivity of utilization to the disk service time distribution is hardly sur-

prising given that we have a closed queueing system where processes do nothing but I/O.

Consider the following thought experiment. If we replaced all the disks with devices twice

as fast but the same in other respects, we would expect throughput to exactly double but

utilization to remain the same. Real systems are more complicated because changing disks

or stripe unit sizes not only changes the mean of the service time distribution but also its

shape. However, this simple thought experiment shows why utilization is insensitive to the

disk service time distribution. It also provides evidence that this property will hold under

factor levels not investigated in this paper.

Page 78: Disk - University of California, Berkeley

66

In contrast to excluding factors diskModel and B, excluding factor N introduces

much larger errors. The errors, however, are not large enough to say that it is never

acceptable to ignore N . Whether N can be ignored depends on the actual use of the model.

If one is primarily interested in the e�ects of varying N , it may not be acceptable, but

otherwise, it is probably acceptable. The table shows that the remaining factors, L and p,

are clearly important and cannot be ignored.

4.7.2 Relationships Between Factors

The previous section identi�ed the factors N , L and p as signi�cant in formulating

a model for the utilization of the modeled system. In this section, we empirically investigate

the relationships between these factors. In particular, we examine whether, as implied by

the model, the utilization can be accurately modeled by an equation of the form U =

1=(1 + 1Lf(p;N)) where f(p;N) represents an arbitrary function of p and N . We then

search for values of f(p;N) that result in accurate analytic models.

Figure 4.5 plots log(1=U�1) versus log2 L for the di�erent factor levels of N and p

where U is the geometric mean of the utilization over all design points with the same values

of N , L and p. We use U rather than arbitrarily selecting design points with a speci�c

value for diskModel and B to reduce experimental error. If U ' 1=(1+ 1Lf(p;N)), a plot of

log(1=U � 1) versus log2L should result in straight lines with a slope of -1. As is evident

from the �gure, this is approximately the case.

Now that we have veri�ed that U ' 1=(1 + 1Lf(p;N)), it remains to determine a

good approximation for f(p;N). Since f is a function only of p and N and not L, we can

theoretically determine f by �xing L to any particular value. In particular, if L = 1, the

Page 79: Disk - University of California, Berkeley

67

1 2 3 4 5

log2 L

-8

-6

-4

-2

0

2

ln (1/Ubar-1)

N=2

p

1 2 3 4 5

log2 L

-8

-6

-4

-2

0

2

ln (1/Ubar-1)

N=3

p

1 2 3 4 5

log2 L

-8

-6

-4

-2

0

2

ln (1/Ubar-1)

N=4

p

1 2 3 4 5

log2 L

-8

-6

-4

-2

0

2

ln (1/Ubar-1)

N=8

p

Figure 4.5: Plots of log(1=U � 1) vs. log2L. Each plot corresponds to a di�erentvalue of N . Within each plot, each line represents a plot of log(1=U � 1) versus log2 L fora di�erent value of p. From top to bottom the values for p are 1=N; 2=N; : : :; and N=N ,respectively.

Page 80: Disk - University of California, Berkeley

68

resulting system has no queueing and U ' p. Substituting, we have p ' 1=(1 + 1Lf(p;N))

and solving for f(p;N) we have f(p;N) = 1=p�1. This results in the same analytic model,

U ' 1=(1 + 1L(1=p� 1)) derived in Section 4.4. The reader can look upon the above result

as an alternative derivation of the analytic model based on empirical techniques.

Having rederived the analytical model above, two questions immediately arise:

1. Can we get a more accurate model by solving for utilization with L = 2 rather than

L = 1? Theoretically, a solution to such a model would take into account a greater

amount of the interaction between processes and should result in a more accurate

model.

2. How good is the best model of the form U ' 1=(1 + 1Lf(p;N))?

The second question is easily answered empirically; we simply calculate the values of f(p;N)

which maximizes R2 and examine the errors of the resulting model. The �rst question is

more di�cult to answer. Even for L = 2, we must take into account both queueing and

fork-join synchronization. We model the system with L = 2 as a discrete-time discrete-

state Markov chain. If we assume that disk service times are constant, the number of

states required to model the system is equal to the number of disks in the system. That

is, the queue length at each disk is either zero or one, disks with a queue length of one are

always consecutively located in the disk array, and the state where every disk has a queue

can be merged with the state where no disk has a queue. We have formulated and solved

such a model for arbitrary N . The solution is complex enough and would require su�cient

explanation that it is not presented here.

Let UL represent the best empirical model of the form U ' 1=(1 + 1Lf(p;N)), let

Page 81: Disk - University of California, Berkeley

69

UL1 � 1=(1 + 1L(1=p � 1)), and let UL2 represent the approximate solution for utilization

derived by assuming L = 2. The table below tabulates the error metrics for UL, UL1 and

UL2.

Model R2 1� R2 max error 90% error

UL 0.9929 0.0071 0.1299 0.0642

UL1 0.9814 0.0186 0.1863 0.0987

UL2 0.9814 0.0186 0.2176 0.0920

The error metrics of UL1, the analytic model derived in Section 4.4, compares

favorably with the error metrics of UL, the best empirical model of the form U ' 1=(1 +

1Lf(p;N)). Somewhat surprisingly, The max error for UL2 is larger than that for UL1

although the 90% error is smaller. Although not visible from the table due to roundo�, R2

for UL2 is slightly larger than that for UL1. We conclude that the additional complexities of

UL2 does not, in general, merit its use over UL1. To more accurately gauge the di�erences

between UL and UL1, Figure 4.6 plots UL, UL1 and the 90 percentile intervals as a function

of N , L and p.

4.8 Variable Request Sizes

In this section, we extend our model to handle variable sized workloads. Although

we derived Equation 4.3 assuming a constant request size, it can be easily extended by

noting that the parameter p is just the probability that a given request will access a given

disk. Thus, given a workload that is f1 fraction requests of size p1 and f2 = 1� f1 fraction

requests of size p2, p = f1p1 + f2p2. In general, if x is the size of requests as a fraction of

Page 82: Disk - University of California, Berkeley

70

0.2 0.4 0.6 0.8 1

p

0

0.2

0.4

0.6

0.8

1

Utilization

N=2

L

UhatLUhatL1

0.2 0.4 0.6 0.8 1

p

0

0.2

0.4

0.6

0.8

1

Utilization

N=3

L

UhatLUhatL1

0.2 0.4 0.6 0.8 1

p

0

0.2

0.4

0.6

0.8

1

Utilization

N=4

L

UhatLUhatL1

0.2 0.4 0.6 0.8 1

p

0

0.2

0.4

0.6

0.8

1

Utilization

N=8

LUhatLUhatL1

Figure 4.6: UL vs. UL1. Each plot corresponds to a di�erent value of N . Within eachplot, each set of solid line, dotted line and shaded region represents UL, UL1 and the 90percentile intervals, respectively, for a di�erent value of L. From bottom to top the valuesfor L are 1,2,4,8,16 and 32, respectively.

Page 83: Disk - University of California, Berkeley

71

the disk array and F (x) its corresponding cumulative distribution function. Then,

p =Z 1

0x dF (x) = p (4.6)

where p is the average request size as a fraction of the disk array.

To validate the above result, consider the design set consisting of the following

parameter and factor levels:

Parameter Parameter Value

diskModel FujitsuN 8B 32KB

Factor Factor Levels

L 1, 2, 4, 8, 16, 32p1 (2; : : : ; N)=Np2 (1; : : : ; p1N � 1)=Nf1 0.20, 0.40, 0.60, 0.80

Since we have shown in Section 4.7.1 that utilization is insensitive to diskModel, N and B,

we hold them constant. In each simulation run, L processes randomly issue requests of size

p1, f1 fraction of the time, and requests of size p2, 1� f1 fraction of the time. Each design

point in the design set is simulated twice with two di�erent random seeds to quantify the

experimental error. Thus, the total number of simulation runs is 2(6� 28� 4)) = 1344. In

each simulation run, the disks are rotationally synchronized, requests to the disk array are

aligned on stripe unit boundaries and L is held constant.

The following are the error metrics from the experiment where U � 1=(1+ 1L(1=p�

1)).

Page 84: Disk - University of California, Berkeley

72

Model R2 1� R2 max error 90% error

BEST 0.9986 0.0014 0.0801 0.0192

U 0.9969 0.0031 0.0934 0.0300

The table shows that the analytic model that only uses the factors L and p compares

favorably with the best empirical model that uses all the factors L, p1, p2 and f1.

Figure 4.7 plots the response predicted by the analytic model U together with

the maximum error intervals and individual data points as a function of L and p. The

analytically predicted response approximately passes through the simulated data points.

4.9 The Optimal Stripe Unit Size

In this chapter, we will illustrate the utility of the analytic model derived in the

previous chapter by using it to derive an equation for the optimal stripe unit size, the stripe

unit size that maximizes throughput in bytes per second. The equation for the optimal

stripe unit size is useful as a rule of thumb in con�guring disk arrays and also provides

valuable insights into the factors that in uence the optimal stripe unit size. Given today's

disk technology and disk arrays of approximately eight disks in size, the optimal stripe

unit equation is most useful for servicing I/O requests that are a couple of hundred or

more kilobytes in size. Miller [29] has shown that such workloads are typical of scienti�c

applications. For such workloads, simulation shows that there is typically a 20% degradation

in performance when the stripe unit is a factor of two smaller or larger than the optimal

size.

In addition to deriving the equation for the optimal stripe unit size, we will show

that the stripe unit size that maximizes throughput also minimizes response time. Note,

Page 85: Disk - University of California, Berkeley

73

0.2 0.4 0.6 0.8

pbar

0.2

0.4

0.6

0.8

1

Utilization

L

Figure 4.7: Analytic U with Maximum Error Intervals and Individual Data

Points. In the above plot, diskModel = Fujitsu; N = 8 and B = 32KB. Each line, shadedregion and set of datapoints corresponds to a di�erent value of L. From bottom to top thevalues for L are 1,2,4,8,16 and 32, respectively.

Page 86: Disk - University of California, Berkeley

74

however, that maximizing throughput is not the same as maximizing utilization; just be-

cause a disk is busy does not mean that it is doing useful work. The fundamental tradeo�

in selecting a stripe unit size is one of parallelism versus concurrency. Small stripe unit

sizes increase the parallelism for a single request by mapping the request over many disks,

but reduce concurrency because each request uses more disks [7].

4.9.1 Derivation

We will derive the equation for the optimal stripe unit size from Equation 4.5.

But �rst, because the disk service time, S, is dependent on the stripe unit size, B, we

must formulate a simple model that makes this dependency explicit. Consider the following

de�nitions:

T Throughput in bytes per second.E(S) Expected disk service time.P Average disk positioning time (seek + rotational latency).X Sustained disk transfer rate.N Number of disks in the disk array.L The number of processes issuing I/O requests.B Size of the stripe unit (block size).Z Request size in bytes.

Then,

E(S) = P +B=X: (4.7)

Note that n, the number of stripe units per request can be calculated as follows:

n = Z=B: (4.8)

Page 87: Disk - University of California, Berkeley

75

Substituting equations 4.3, 4.7 and 4.8 into Equation 4.5 and simplifying we can write the

throughput in bytes per second as,

T =LNXBZ

(PX +B)(NB + Z(L� 1)): (4.9)

Solving for the local maxima in the above equation as a function of B we get the following

equation for the optimal stripe unit size:

Bopt =

sPX(L� 1)Z

N: (4.10)

Repeating the above procedure to minimize response time starting from Equation 4.4 results

in the same equation for the optimal stripe size; thus, the stripe unit size that maximizes

throughput also minimizes response time and is given by Equation 4.10.

The following remarks can be made about Equation 4.10:

� Changes to the system that increase the e�ective load, that is, an increase in L, an

increase in Z, or a decrease in N , favor larger optimal stripe units. The opposite is

true for changes that decrease the e�ective load.

� In our model system, the optimal stripe unit size is dependent only on the product

PX , the relative rate at which a disk can position and transfer data, and not on P or

X independently. If you replace the disks with those that position and transfer data

twice as quickly, the optimal stripe unit size remains unchanged [7]. In this respect,

the selection of an optimal stripe unit size is a trade-o� between the disk positioning

time and the data transfer time.

Page 88: Disk - University of California, Berkeley

76

4.9.2 Validation

As a further validation of the analytic model and of Equation 4.10 in particular,

we compare the analytic values for the optimal stripe unit size with empirically determined

values. Figure 4.8 plots the analytically determined optimal stripe unit sizes versus the

empirically determined optimal stripe unit sizes on a log-log scale. The shaded regions on

the �gure represent optimal stripe unit sizes that can be ruled out for the following reasons.

First, throughput when B < Z=N for �xed Z is less than or equal to the throughput when

B = Z=N . At this stripe unit size, requests are being distributed uniformly across all disks

and it is not possible to increase parallelism or concurrency by reducing the stripe unit

size. Second, throughput when B > Z for �xed Z is identical to when B = Z. In this

case, the request already �ts completely within a single disk and there is no advantage or

disadvantage to increasing the stripe unit size. We have empirically veri�ed the above two

facts. Thus, Z=N � B � Z.

For comparison purposes, Figure 4.9 adds the optimal stripe unit sizes predicted

by Chen [7]. Chen's model assumes that the optimal stripe unit size is independent of the

request size. As can be seen, our optimal stripe unit sizes correspond well with Chen's

optimal stripe unit sizes and both correspond closely with the measured optimal stripe unit

sizes.

To get a feel for the sensitivity of performance to the choice of stripe unit size,

Figure 4.10 individually plots each group of lines from Figure 4.9 with vertical bars to

indicate the range of stripe unit sizes providing 95% of the throughput of the optimal stripe

unit size.

Page 89: Disk - University of California, Berkeley

77

L=1

L=2

L=4

L=8

L=16

L=32

Empirical

Analytical

1 10 100 1000

0.1

1.0

10.0

100.0

1000.0

Request Size (KB)

Bopt

(KB)

Analytic vs. Empirical Optimal Stripe Units

Figure 4.8: Analytic vs. Empirical Optimal Stripe Unit Sizes. N = 16; B =32KB; L 2 f1; 2; 4; 8; 16; 32g; Each set of three lines is labeled with its corresponding valueof L. Bopt = Optimal stripe unit size. The graph may appear odd at �rst because thebeginning and end of the individual lines overlap at the edges of the shaded regions.

Page 90: Disk - University of California, Berkeley

78

1 10 100 1000

0.1

1.0

10.0

100.0

1000.0

Request Size (KB)

Bopt

(KB)

Comparision with Chen’s Model

AnalyticalEmpirical

L=32

L=16

L=8

L=4

L=2

L=1

Chen’s

Figure 4.9: Comparison with Chen's Model. N = 16; B = 32KB; L 2f1; 2; 4; 8; 16; 32g; Each pair of lines is labeled with its corresponding value of L. Bopt

= Optimal stripe unit size.

Page 91: Disk - University of California, Berkeley

79

)BK(

tpoB

1 10 100 1000

0.1

1.0

10.0

100.0

1000.0

Request Size (KB)

95% Performance Bars (L=1)

AnalyticalEmpiricalChen’s

)BK(

tpoB

Chen’sEmpiricalAnalytical

95% Performance Bars (L=2)

Request Size (KB)

1000.0

100.0

10.0

1.0

0.1

1000100101

)BK(

tpoB

Chen’sEmpiricalAnalytical

95% Performance Bars (L=4)

Request Size (KB)

1000.0

100.0

10.0

1.0

0.1

1000100101

)BK(

tpoB

1 10 100 1000

0.1

1.0

10.0

100.0

1000.0

Request Size (KB)

95% Performance Bars (L=8)

AnalyticalEmpiricalChen’s

Figure 4.10: Optimal Stripe Unit Sizes with 95% Performance Intervals. N =16; B = 32KB; Bopt = Optimal stripe unit size. Stripe unit sizes within the solid verticalbars provide 95% of the performance of the optimal stripe unit size indicated by the obliquethin solid line.

Page 92: Disk - University of California, Berkeley

80

There is a good correspondence between the analytically and empirically deter-

mined values of the optimal stripe unit size. Except when L = 1 and request sizes are small,

the optimal stripe unit sizes determined by both Chen's model and our model lie within

the 95% performance intervals.

4.10 Summary

In this section, we derived and validated an analytic performance model for disk

arrays. We initially modeled disk arrays as a closed queueing system consisting of a �xed

number, L, of processes continuously issuing requests of a �xed size, p, to a disk array

consisting of N disks. We then extended the model to handle variable sized requests. The

resulting model predicts the expected utilization of the model system, U , as 11+ 1

L(1=p�1)

where p is the average size of requests as a fraction of the number of disks in the disk array.

We then derived the expected response time and throughput as a function of utilization.

We showed via simulation that the simulated utilization is generally within �10% of the

utilization predicted by the analytic model. We also examined the error introduced by

each approximation made in the derivation of the analytic model to better understand the

validity of the approximations. Finally, we applied the analytic model to show that the

optimal size of data striping simultaneously maximizes throughput and minimizes response

time and is equal toq

PX(L�1)ZN where P is the average disk positioning time, X is the

average disk transfer rate and Z is the request size.

To the best of our knowledge, we have presented the �rst analytic performance

model that treats disk arrays as a closed queueing system and handles a continuum of

Page 93: Disk - University of California, Berkeley

81

request sizes from small requests to large requests. The primary weaknesses of the model are

twofold. First, it assumes that processes do nothing but issue I/O requests. In particular,

it does not model CPU-think-time. Second, although the model can be used to analyze

reads in RAID level 5 disk arrays, it can not be used to analyze writes. Because of the

above limitations, the next chapter will develop and apply an empirical technique based on

utilization pro�les, which can be used to analyze a much wider variety of systems under a

wide range of workloads.

Page 94: Disk - University of California, Berkeley

82

Chapter 5

Utilization Pro�les

The performance analysis of real systems is currently an art with techniques rang-

ing from analytic models to measurement. When feasible, analytic models are the most de-

sirable because they can be quickly evaluated, are applicable under a wide range of system

and workload parameters, and can be manipulated by a range of mathematical techniques.

Unfortunately, analytic models are highly limited in the types of workloads and systems

they can analyze.

Another approach to performance analysis is based on measuring the system of

interest over a large range of system and workload parameters. Such measurements can be

used to accurately interpolate the performance of the system over the measured range of

parameters. Unfortunately, it is dangerous to directly extrapolate performance from such

measurements and the large number of measurements is very cumbersome to manipulate.

A technique that falls between the purely analytic and measurement based ap-

proaches is to construct empirical performance models from the measurements using high-

Page 95: Disk - University of California, Berkeley

83

level information about the measured system. At the crudest level, this may consist of

nothing more than curve-�tting the measured data with arbitrary functions. At its most

re�ned, the measurements may be used to parameterize a mostly analytic model. In both

cases, however, the basic techniques are similar; we seek to summarize the vast quantities

of measured data in a mathematical form by using additional information speci�c to the

measured system.

This section presents one method for summarizing the measured data using what

we call utilization pro�les. We will show that utilization pro�les are useful for system level

performance characterization and are especially useful in analyzing throughput oriented

systems. The following are some of the questions that can be answered by utilization

pro�les:

� What is the bottleneck at a given workload?

� What will be the bottleneck if you improve the bottleneck resource?

� How much do you need to improve the bottleneck resource before it stops being the

bottleneck?

� What is the utilization of a given resource at a given system throughput?

In the following sections, we will a present a simple way of viewing utilization pro�les, an

example application of utilization pro�les in analyzing a single disk system, a generalization

of the technique to analyze blackbox systems, and an alternative but equivalent way of

viewing utilization pro�les. We will see that the two di�erent views of utilization pro�les

each stress a di�erent aspect of the analysis technique and provide di�erent insights into

Page 96: Disk - University of California, Berkeley

84

how utilization pro�les can be used to model real systems.

5.1 First Look at Utilization Pro�les

One view of utilization pro�les is based on modeling the expected service time of

requests at a given resource as a linear combination of arbitrary functions of system and

workload parameters. Here it is important to distinguish between the terms service time

and response time. Service time is the actual amount of time spent by a resource servicing

a request whereas response time is the total elapsed time between the issuing of a request

and its completion. Thus, in addition to service time, response time may include queueing

and synchronization delays. By arbitrary functions, we mean that we do not impose any

restriction on the mathematical form of the functions. Mathematically, we can summarize

the above statements as follows.

S = k1f1 + k2f2 + : : :+ knfn (5.1)

S = ~k � ~f (5.2)

S Mean service time.~k Utilization pro�le.

A constant vector.~f Throughput transform.

A vector of arbitrary functions.

Note that S in this chapter refers to the mean service time whereas S in Chapter 4 refers

to the service time. Henceforth, we will refer to the vector of constants, ~k, as the utilization

pro�le and the vector of functions, ~f , as the throughput transform.

Page 97: Disk - University of California, Berkeley

85

Strictly speaking, each resource in the system is characterized by its own utilization

pro�le and throughput transform. We will frequently �nd, however, that all the resources in

a system can be conveniently characterized by a single throughput transform. This means

that many systems can be characterized by a single throughput transform and a utilization

pro�le for each resource in the system. Further simpli�cation results if the system contains

multiple identical resources that are uniformly utilized, such as disks in a disk array. In

such cases, a single utilization pro�le can be used to characterize all the disks.

For a given resource in the system, the throughput transform, ~f is not unique;

there are usually many reasonable choices. The particular throughput transform that is

used depends on the characteristics of the system that are of interest, the knowledge of the

system that is available for modeling the system, and the desired accuracy of the model.

In general, the throughput transform characterizes a \class" of systems and the utilization

pro�le characterizes a speci�c implementation of a system. For example, once determined,

the same throughput transforms can be used to analyze most RAID level 5 systems, where

the systems will di�er is in the values of their utilization pro�les. This makes it convenient

to compare similar systems by directly comparing their utilization pro�les.

5.2 An Example: Utilization Pro�le for a Disk

In this section, we solidify the abstract concepts discussed in the previous section

by calculating the utilization pro�le for a disk. We start by modeling the expected service

time of the disk as a sum of the average data positioning time, k1, and data transfer time,

k2Z, where k2 is the time needed to transfer a single byte of data and Z is the request size

Page 98: Disk - University of California, Berkeley

86

in bytes.

S = k1 + k2Z (5.3)

~k = (k1; k2) (5.4)

~f = (1; Z) (5.5)

S Expected service time.Z Request size in bytes.k1 Per request overhead (latency).k2 Per byte overhead (1/transfer-rate).~k Utilization pro�le.~f Throughput transform.

In this example, the utilization pro�le ~k consists of the constants k1 and k2 and the

throughput transform ~f consists of the functions 1 and Z. In general, we will �nd that the

utilization pro�le is a vector of overhead values and the throughput transform corresponds

to the number of each type of overhead in each request. In this example, k1 is the per

request overhead and k2 is the per byte overhead; correspondingly, there is one per request

overhead per disk request and Z per byte overheads per disk request.

To calculate the values of the utilization pro�les, we can measure the expected

service times for a range of request sizes and use linear regression to solve for the value of

~k that minizes of sum of squared errors. Thus, given the following measurements,

Z S

1KB 21ms4KB 24ms8KB 27ms16KB 37ms

we get the following system of linear equations:

S = k1 + k2 � Z (5.6)

Page 99: Disk - University of California, Berkeley

87

21ms = k1 + k2 � 1KB (5.7)

24ms = k1 + k2 � 4KB (5.8)

27ms = k1 + k2 � 8KB (5.9)

37ms = k1 + k2 � 16KB (5.10)

The above system is overconstrained and has no solution. If we use linear regression to

minimize the sum of squared errors, we get the following value for ~k:

~k = (20ms; 500 ns) (5.11)

That is, the average positioning time of the disk is 20msand the sustained transfer rate of

the disk is 1/500ns= 2MB/s.

As this example illustrates, the throughput transform characterizes a class of sys-

tems whereas the utilization pro�le characterizes a speci�c implementation of the system.

In the context of the current example, we see that the throughput transform (1; Z) can

be used to analyze many other disks, each of which will have a di�erent utilization pro�le.

Comparing two or more disks by directly comparing the utilization pro�les can be simpler

and more complete than comparing the performance of the disks over a range of workloads.

This is because, given a reasonable model for the disk service time, the utilization pro�les

are very e�ective in characterizing the performance of disks. Similar analysis is possible for

more complex systems consisting of many more resources. As we will see, the bene�ts from

such analysis is even greater for the complex systems than for the simple disk system we

analyzed here.

Page 100: Disk - University of California, Berkeley

88

To summarize, in the above disk example, we used the following steps to deter-

mining the utilization pro�le.

1. Identify the relevant overheads. In the disk example, we assumed that the relevant

overheads are the per request overhead and the per byte overhead.

2. Determine the corresponding throughput transform by noting the number of each

type of overhead per request. In the disk example, we noted that there is one per

request overhead per disk request and Z per byte overheads per disk request.

3. Measure the resource utilization and system throughput over a range of system and

workload parameters.

4. Use linear regression to solve for the utilization pro�le.

5.3 Analysis of Blackbox Systems

The previous section provides a simple method for computing the utilization pro-

�les for a system when the expected service time can be measure for each resource in the

system. Frequently, however, it is not possible to measure the performance of each resource

in the system. In many cases, the system must be treated as a blackbox. This section

presents a method for computing the utilization pro�les of resources in blackbox systems

by expanding on the techniques previously presented.

Since most systems contain more than one resource, it will be useful at this point

to rewrite Equation 5.2 to re ect this fact. For the purposes of analyzing blackbox systems,

it will also be useful to express Equation 5.2 as a function of the resource's utilization and

Page 101: Disk - University of California, Berkeley

89

the system's overall throughput.

Si = ~k0i � ~fi (5.12)

Ui=Ti = ~k0i � ~fi (5.13)

Ui=T = ~k0i � ~f 0i (5.14)

Si Expected service time at resource i.Ui Utilization of resource i.Ti Throughput at resource i.T Overall system throughput.~ki;~k

0

i Utilization pro�le of resource i.~fi; ~f

0

i Throughput transform of resource i.

Here, we have used two operational laws, the utilization law, which relates the

expected service time at a resource to the utilization and throughput at the resource, and the

forced ow law, which relates the throughput at a resource to the overall system throughput.

The laws are very general and apply to both open and closed systems. The application of

the forced ow law changes the values of the utilization pro�le and throughput transform

but not the general form of the equation. The values of ~ki and ~fi can easily be computed

from ~k0i and~f 0i and visa versa.

To apply Equation 5.14 in analyzing real systems, we need to measure the uti-

lization of a given resource and the overall system throughput over a range of system and

workload parameters. The system throughput is easily measured, but it is not always pos-

sible to measure the utilization of a given resource in the system. Fortunately, there are

certain conditions under which the utilization of a resource can be determined without di-

rect measurements. A trivial example is when the system is completely idle. In this case,

the utilization of all resources in the system is 0%. A more useful case is when the system

Page 102: Disk - University of California, Berkeley

90

is operating at a bottleneck. In this case, the utilization of some resource in the system is

close to 100%. It is easy to identify the bottleneck states because in such a state, increasing

the e�ective system workload does not result in an increase in system throughput. Thus, if

we analyze the system only when it is bottlenecked, we can rewrite Equation 5.14 as follows:

1:0=T = ~ki � ~fi (5.15)

This eliminates utilization from the equation and allows us to calculate utilization pro�les

by measuring only the overall system throughput.

Until now, we have ignored an important practical consideration to blackbox anal-

ysis. Although it is easy to determine when a system is bottlenecked, it is not as easy to

determine which resource is causing the bottleneck. Given two bottleneck points, a di�erent

resource may be responsible for each of the two bottlenecks. Although it is not essential to

identify the exact resource that is responsible for each bottleneck, it is necessary to sepa-

rate the bottleneck points belonging to di�erent resources since the calculation of utilization

pro�les is performed on a per resource basis.

Because Equation 5.15 is linear, the process of separating the bottleneck points

for each resource and computing the corresponding utilization pro�les is analogous to using

planar surfaces to �t the sides of a multidimensional polyhedron. The process is actually

somewhat more complicated since a good value for the throughput transform may not be

apparent. This means that the choice of the throughput transform and the calculation of

the utilization pro�les may have to proceed concurrently. In reference to the polyhedron

example, changing the throughput transform is analogous to changing the shape of the

Page 103: Disk - University of California, Berkeley

91

polyhedron. In essence, we are trying to �t the surfaces of a polyhedron that simultaneously

changes shape!

To summarize, the following steps should be followed in determining utilization

pro�les for a blackbox system:

1. Measure the system throughput over the range of system and workload parameters of

interest.

2. Isolate the bottleneck throughputs.

3. Itemize the relevant overheads. A reasonable �rst guess can be made based on the

system and workload parameters and a general knowledge of the system's logical

function. In the worst case, when we have no intuition about the operation of the

system, we can perform a \blind" statistical analysis by modeling the expected service

time as a �rst order polynomial and gradually adding higher order terms as necessary.

4. Determine the corresponding throughput transform by noting the number of each

type of overhead per request.

5. Use linear regression to �t the bottleneck throughputs and solve for the utilization

pro�les. The linear regression will also indicate if certain terms in the throughput

transform are not useful and can be eliminated.

6. At this point, a poor �t may indicate that more than one resource is responsible for

the bottleneck or that a new throughput transform should be tried. In the latter

case, the graph of the �tted dataset can be used to guide in partitioning the dataset

into two or more datasets such that the bottlenecks in each dataset are caused by a

Page 104: Disk - University of California, Berkeley

92

single resource. If this does not result in a better �t, it is necessary to try a di�erent

throughput transform. Go back to Step 3 until a satisfactory �t is achieved.

5.4 Second Look at Utilization Pro�les

Section 5.1 has de�ned utilization pro�les from the view of modeling the expected

service time of resources. In this section, we present an alternative throughput-oriented

view based on a system with constant service time requests. This alternative view provides

a better understanding of utilization pro�les and insights that are useful in modeling real

systems.

Consider a system consisting of a single resource that services two di�erent types

of requests: type-1 and type-2. Assume that each type-i request requires a constant amount

of time, ki, to service. If ni type-i requests are serviced during a time interval t, then the

total time spent service requests, Ut, where U is the utilization of the resource during time

interval t, can be expressed as follows:

Ut = k1n1 + k2n2: (5.16)

By dividing through by t, we have

U = k1T1 + k2T2: (5.17)

Where Ti = ni=t is the throughput of type-i requests. By applying the forced ow law, we

can express the above equation in terms of the system throughput, T , of both type-1 and

Page 105: Disk - University of California, Berkeley

93

type-2 requests as follows:

U = k1f1T + k2f2T: (5.18)

Here, fi is a function that maps the system throughput to the throughput of type-i requests.

This is the reason we refer to the fi, collectively, as a throughput transform. By rearranging

terms we have

U=T = k1f1 + k2f2; (5.19)

which is very similar to Equation 5.14. If we have N identical resources with uniformly

distributed load, we can rewrite Equation 5.18 as

U = k1f1T=N + k2f2T=N: (5.20)

That is, each resource gets only 1=N the requests it would have gotten if there were only

one resource. Technically, the 1=N term could be incorporated into the fi terms but we will

�nd it useful to make fi independent of the number of identical resources in the system.

Rearranging, we �nally have

UN=T = k1f1 + k2f2: (5.21)

More generally we have

UN=T = k1f1 + k2f2 + : : :+ knfn; (5.22)

UN=T = ~k � ~f: (5.23)

which is a more general form of Equation 5.14. Specializing Equation 5.14 to bottleneck

Page 106: Disk - University of California, Berkeley

94

type-2

type-2

type-2

per disk

per bytetype-1

type-2overhead

overhead

Figure 5.1: Compound Requests. Requests whose service times depend on system orworkload parameters are modeled as a collection of simpler requests, each of which requiresa constant amount of time to service.

throughputs, we have

N=Tmax = ~k � ~f (5.24)

which is a generalization of Equation 5.15.

One problem with the derivation we have presented so far is that all type-i requests

must require a constant amount of time, ki, to service. In real systems, however, the service

time of requests usually depend on the system and workload parameters. For example,

it usually takes a disk more time to service an 8KBrequest than it takes to service a

4KBrequest. We get around this problem by modeling a real request as a compound

request consisting of many simple requests, each of which takes a constant amount of time

to service. For example, Figure 5.1 illustrates the modeling of a four byte disk request

as a compound request consisting of a single type-1 request representing the per request

overhead and four type-2 requests representing the per byte overheads. By generalizing this

Page 107: Disk - University of California, Berkeley

95

example, we can model arbitrarily complex requests by breaking them down into simpler

components each of which takes a constant amount of time to service.

In the next section, we will apply the concepts discussed in this and the preceding

sections to analyze the performance of RAID-II, our second disk array prototype. We

will see that utilization pro�les is a exible tool that can systematically characterize the

performance of complex systems. The analysis of RAID-II should also show that utilization

pro�les is a general tool that can be used to analyze a wide variety of systems in addition

to disk arrays.

5.5 Analysis of RAID-II

In this section we analyze the performance of RAID-II's RAID level 5 disk subsys-

tem as a blackbox. By blackbox, we do not mean that we have absolutely no information

about the system but that we cannot directly measure the performance of individual com-

ponents of the system. In particular, we examine in detail the performance of three access

modes: read, read-modify-write, and reconstruct-write. These access modes, discussed in

Section 2.3, are the only modes used during the normal operation of a RAID level 5 storage

system.

The measurement environment consists of RAID-II con�gured with four Interphase

Cougar disk controllers, eight SCSI strings and twenty-four SCSI disks in a RAID level 5

con�guration. We use our measurement program, raidPerf, to fork a �xed number of

processes that issue I/O requests to the disk array. In our measurements, data will be

transferred between the X-Bus board's memory system and the disk array but not over the

Page 108: Disk - University of California, Berkeley

96

network. We are interested in analyzing the following factors for both reads and writes

from/to the disk array.

N Number of disks in disk array.B Block/Stripe-unit size.n Request size in stripe units.L Load (number of processes).

Note that nB equals the request size in bytes.

The design set consists of the complete crossproduct of the following factor levels.

Factor Factor Levels

N 2; 3; 4; 8; 12; 16; 20; 24B (1; 2; 4; 8; 16; 32; 64)KBL 1; 2; 4; 8; 16; 32n (reads) 1; 2; : : : ; Nn (writes) 1; 2; : : : ; N � 1

For writes, the request size, n, ranges up to N � 1 instead of N because the parity disk

must also be updated. Each design point is measured twice to quantify the experimental

errors. The total number of design points is 2(7� 6� (89 + 81)) = 14; 280.

5.5.1 Analysis of Reads

This section de�nes and calculates the utilization pro�les for the read access mode

on RAID-II. An additional purpose is to familiarize the reader with the analysis of real

system using utilization pro�les. Thus, this section will present a more detailed exposition

of the process than later sections.

As described in Section 2.3, the read access mode is serviced by simply reading the

requested data from the speci�ed disks. We will assume that the relevant overheads are the

per array request overhead, per disk overhead, and the per byte overhead. That is, since

each array request is broken into independent disk requests, any resource that services an

Page 109: Disk - University of California, Berkeley

97

array request is likely to pay a �xed overhead for each array request, a �xed overhead for

each disk accessed and a �xed overhead for each byte transferred. Thus, the throughput

transform for the read access mode can be represented as,

~f = (1; n; nB): (5.25)

That is, there is one array request overhead per read request, n disk overheads per array

request, and nB byte overheads per array request.

At this point, we can simplify our analysis by identifying the resources that are

likely to limit system performance and noting how they will be used by the system. In

particular, the system resources are the CPU, VME data busses, Cougar disk controllers,

SCSI strings, and disks. As previously described, a single Sun 4 CPU runs the code that

implements the RAID level 5 storage system and controls the four Cougar disk controllers,

each of which controls two SCSI strings with three disks each. Each disk controller also has

access to its own VME data bus for transferring data to the memory system on the X-Bus

board.

Although the general throughput transform for reads is given by Equation 5.25,

individual resources can be characterized by simpler transforms. For example, in RAID-II,

because the data by-passes the memory system of the Sun 4 CPU, the CPU never touches

the data. Thus, for the purposes of analyzing the CPU, we can ignore the per byte overhead

and use a simpler throughput transform that is a subset of the full transform:

~f1 = (1; n;�): (5.26)

Page 110: Disk - University of California, Berkeley

98

Also, except for the CPU, which breaks array requests into disk requests, none of the other

resources see array requests. Thus, for the other resources, we can use a simpler transform

that excludes array request overheads:

~f2 = (�; n; nB): (5.27)

The above observations make it easier to analyze RAID-II, but it is not absolutely necessary

to make them at this stage. As we will see, statistical analysis will lead us to the same

conclusions as our intuition.

The next step is to select ranges of system and workload parameters for which a

speci�c resource is likely to be a bottleneck and to analyze the resource within that range.

To start o�, we will select disks as our �rst bottleneck resource. Since disks are most

likely to be a bottleneck when there are a few of them and the requests to the disks are

large, we will analyze them as a bottleneck when the number of disks in the system, N ,

is equal to two, and the stripe-unit size, B, is greater than or equal to eight. According

to Equation 5.24, the following relationship holds under the speci�ed circumstances if the

disks are the bottleneck:

2=Tmax = ~kdisk � ~f2: (5.28)

Solving for the disk utilization pro�le, ~kdisk, using linear regression we have,

~kdisk = (0; 14:8ms; 608 ns) (5.29)

~tdisk = (�; 37:9; 58:6) (5.30)

Page 111: Disk - University of California, Berkeley

99

The above indicates that the average per disk overhead is 14.8ms, and the average time to

transfer a byte is 608ns. That is, the disk has an average positioning time of 14.8msand a

sustained transfer rate of 1.64MB/s. The entries for ~tdisk , are the corresponding t-statistics

discussed in Section 3.4.3. The t-statistics are obtained by dividing each element of ~kdisk by

its corresponding standard deviation and is a measure of the statistical signi�cance of each

element of the utilization pro�le. The above indicates that both terms of the utilization

pro�le are highly signi�cant.

At this point, it is reasonable to ask what would have happened if we had not

made our simplifying observations and had used (1; n; nB) as our throughput transform

instead of (�; n; nB). The following illustrate the results.

~kdisk = (1:1ms; 14:2ms; 608 ns) (5.31)

~tdisk = (1:0; 18:0; 58:6) (5.32)

The t-statistic for the per request overhead is much smaller than the other t-statistics.

Examination of the correlation matrix for the utilization pro�les also indicates that there

is a strong correlation between the per array overhead and the per disk overhead. Thus, it

may be desirable to drop the per request overhead as our intuition indicated. This decision

cannot be made, however, until the e�ect of dropping the per request overhead on the

overall accuracy of the linear regression is known. To do this, we must examine the error

metrics discussed in Section 3.4.4.

~f R2 1�R2 max err 90 % err

BEST 0.9964 0.0036 0.093 0.075(1; n; nB) 0.9948 0.0052 0.129 0.109(�; n; nB) 0.9946 0.0054 0.152 0.116

Page 112: Disk - University of California, Berkeley

100

The above table tabulates the errors in using the empirically best throughput

transform, (1; n; nB), and (�; n; nB), respectively. The empirically best throughput trans-

form is the throughput transform from the in�nite set of all throughput transforms that

maximizes the coe�cient of determination, R2. The best throughput transform takes into

account all factors, n;B; and N , and all interactions among the factors. As can be seen,

there is only a small di�erence in accuracy between (1; n; nB) and (�; n; nB), and both

compare favorably with the best throughput transform. Thus, our earlier observation that

disks do not see array requests is justi�ed.

The reader should note that the results presented so far depend on the subset of

the design set, N = 2 and B >= 8, we selected for analysis. Other subsets should yield

equivalent results as long as the disks are the bottleneck but this is by no means guaranteed.

In practice, our �rst choice of the subset is rarely the best choice and it is only by iterating

through the analysis process that we gradually come to recognize the proper subsets for

which each resource is the bottleneck. By applying the above analysis procedure to the

other subsets, we can analyze each bottleneck resource in turn. When the response at all

bottleneck design points are accurately predicted by the utilization pro�le for an analyzed

resource, the analysis is complete.

Figure 5.2 summarizes the analysis for each bottleneck resource, including the

disks, for the read access mode. In the case of the CPU, eliminating an element from

the full throughput transform actually results in an improvement of the 90% err metric

from 0.062 to 0.059. This occasionally occurs when the eliminated element is statistically

insigni�cant. Note, however, that using a simpler throughput transform always results in

Page 113: Disk - University of California, Berkeley

101

a worse coe�cient of determination or R2. This is because linear regression maximizes

R2 and thus having more elements can only improve the �t. In all cases, the use of a

simpler throughput transform results in errors that are comparable to the empirically best

throughput transforms.

Figure 5.3 summarizes the utilization pro�les of the bottleneck resources, uses the

pro�les to create an upper bound, Tmax for system throughput, and compares the bound

with the measured bounds and the best empirical bound. As can be seen, the bound based

on utilization pro�les, Tmax, compares favorably with the best empirical bound with respect

to the max err and 90% err metrics although the R2 for Tmax is signi�cantly worse than for

the best empirical bound. The graph in Figure 5.3 plots the predicted maximum throughput

in MB/s and compares it to the range of measured maximum throughputs for N = 24. In

the future, we will primarily rely on such �gures and corresponding commentaries on the

analysis to succinctly present our results.

5.5.2 Analysis of Read-Modify-Writes

This section de�nes and calculates the utilization pro�les for the read-modify-write

access mode on RAID-II. As described in Section 2.3, the read-modify-write access mode is

serviced by reading the old data and old parity, xoring them with the new data to generate

the new parity, and then writing the new data and new parity to disk.

For many of the resources in the system, we can assume that the relevant over-

heads are the per array request overhead, per read-modify-write disk overhead, and the per

read-modify-write byte overhead. We specify the overheads in terms of read-modify-write

accesses rather than simply disk or byte accesses because read-modify-write accesses can

Page 114: Disk - University of California, Berkeley

102

Resource: CPU.Design Subset: N = 24 and B � 16KB.

~f ~k ~t (t-stat for ~k)

(1; n; nB) (8:5ms; 910 us; 1:3 ns) (138; 177; 3:3)(1; n;�) (8:5ms; 916 us; 0) (137; 202;�)

~f R2 1� R2 max err 90% err

BEST 0.9970 0.0030 0.071 0.029(1; n; nB) 0.9892 0.0108 0.161 0.062(1; n;�) 0.9890 0.0110 0.158 0.059

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Resource: Disks.Design Subset: N = 2 and B � 8KB.

~f ~k ~t (t-stat for ~k)

(1; n; nB) (1:1ms; 14:2ms; 608 ns) (1:0; 18:0; 58:6)(�; n; nB) (0; 14:8ms; 608 ns) (�; 37:9; 58:6)

~f R2 1�R2 max err 90 % err

BEST 0.9964 0.0036 0.093 0.075(1; n; nB) 0.9948 0.0052 0.129 0.109(�; n; nB) 0.9946 0.0054 0.152 0.116

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Resource: SCSI Strings.Design Subset: N = 24, B � 32KB and n � 12.

~f ~k ~t (t-stat for ~k)

(1; n; nB) (11:0ms; 2:5ms; 402 ns) (2:1; 7:6; 114)(�; n; nB) (0; 3:1ms; 402 ns) (�; 16:7; 112)

~f R2 1�R2 max err 90 % err

BEST 0.9958 0.0042 0.066 0.038(1; n; nB) 0.9948 0.0052 0.070 0.042(�; n; nB) 0.9945 0.0055 0.072 0.050

Figure 5.2: Analysis of Reads for each Resource.

Page 115: Disk - University of California, Berkeley

103

~f = (1; n; nB)

Resource ~k

CPU (8:5ms; 916 us; 0)Disk (0; 14:8ms; 608 ns)String (0; 3:1ms; 402 ns)

1=Tmax = max(~kcpu � ~f;~kdisk � ~f=N;~kstring � ~f=8)

~f R2 1�R2 max err 90% err

BEST 0.9951 0.0049 0.243 0.079

1=Tmax 0.9654 0.0346 0.495 0.153

5 10 15 20n

0

5

10

15

20

Throughput (MB/s)

B

Figure 5.3: Final Analysis of Read. The graph plots the maximum throughputpredicted by Tmax in MB/s and compares it to the range of measured maximum throughputsfor N = 24 and various values of B. From bottom to top, each line represents the predictedmaximum throughput for B = 1; 2; 4; 8; 16; 32; 64, respectively. For the smaller values of B,the measured ranges are so small that they overlap with the prediction lines.

Page 116: Disk - University of California, Berkeley

104

behave di�erently from independent disk or byte accesses. With disks, for example, a write

access immediately following a read to the same sectors may have to wait a full rotational

delay rather than the average of half a rotational delay. Additionally, writes on most disks

are slightly slower than reads because the disks require that the disk head be more precisely

positioned before writing data. The corresponding throughput transform is

~f 0 = (1; n+ 1; (n+ 1)B): (5.33)

That is, for each read-modify-write array access, n data disks plus 1 parity disk are read

and written, and correspondingly, (n+ 1)B bytes are read and written.

Unfortunately, some resources are not accurately characterized by ~f 0. The xor

engine, for example, which generates the new parity for each read-modify-write array re-

quest, is better characterized with the following overheads: per array request, per stripe-unit

xored, per parity stripe-unit generated, per byte xored, and per parity byte generated. The

corresponding throughput transform is

~f 00 = (1; 2n+ 1; 1; (2n+ 1)B;B): (5.34)

Of course, we can use di�erent throughput transforms to analyze each resource,

but there is a more elegant solution. We can use a more general throughput transform of

the form

~f = (1; n; B; nB) (5.35)

Page 117: Disk - University of California, Berkeley

105

by showing that using ~f is mathematically equivalent to using ~f 0 or ~f 00. That is,

~k � ~f = ~k0 � ~f 0

k1 + k2n+ k3B + k4nB = k01 + k02(n+ 1) + k03(n+ 1)B

k1 + k2n+ k3B + k4nB = (k01 + k02) + k02n+ k03B + k03nB

Thus, ~f is at least as general as ~f 0 and we can easily convert from one set of utilization

pro�les to another by using the following equations:

k1 = k01 + k02

k2 = k02

k3 = k03

k4 = k03

Although the two sets of throughput transforms are mathematically equivalent,

they are not statistically equivalent. Note that ~f has four elements while ~f 0 has only three.

This additional degree of freedom is necessary in modeling resources such as the xor unit but

is wasted on resources such as disks, which constrain some of the relationships. In fact, using

the more general throughput transform for modeling disks has the detrimental e�ect that

it produces larger variations in the computed utilization pro�les than using the less general

throughput transform. In cases where the design set is su�ciently large and accurate, the

additional variation can be insigni�cant, but otherwise, the additional variation can be

detrimental to the accuracy of the regression. In our analysis, the additional variation is

Page 118: Disk - University of California, Berkeley

106

sometimes unacceptable. In such cases, we will use the less general throughput transform in

analyzing the resources, but convert them to the more general basis when summarizing the

results. This technique should not come as a complete surprise to the reader. It has already

been used trivially in the section analyzing the performance of the read access mode, where

we excluded the per array request overhead in analyzing resources such as disks that we

knew could not possibly see array requests.

Figure 5.4 summarizes the analysis for each bottleneck resource for the read-

modify-write access mode. In the case of the CPU, it is clear from the t-statistics that

only the terms 1 and n are signi�cant. For the disks and SCSI strings, however, all the t-

statistics corresponding to the throughput transform (1; n; B; nB) are relatively small. This

indicates that the data is not \strong" enough to accurately estimate the utilization pro�les

for the given throughput transform. To get more accurate estimates, we would have to

perform additional experiments or more accurate experiments. In this case, we know that

the throughput transform (n + 1; (n + 1)B) is su�cient, as the corresponding t-statistics

and error metrics illustrate. We can now use the utilization pro�les for (n + 1; (n+ 1)B)

to accurately calculate the utilization pro�les for (1; n; B; nB). Figure 5.5 summarizes the

utilization pro�les of the bottleneck resources. The utilization pro�les for the disks and

SCSI strings are calculated indirectly from regression on a simpler throughput transform.

5.5.3 Analysis of Reconstruct-Writes

This section de�nes and calculates the utilization pro�les for the reconstruct{write

access mode on RAID-II. As described in Section 2.3, the reconstruct-write access mode is

serviced by reading the part of the parity stripe that is not being written, xoring it with

Page 119: Disk - University of California, Berkeley

107

Resource: CPU.Design Subset: N = 24 and B � 8KB.

~f ~k ~t (t-stat for ~k)

(1; n; B; nB) (11:7ms; 1:9ms;�15:4 ns; 13:4 ns) (72:4; 82:0;�0:45; 2:7)(1; n;�;�) (11:6ms; 2:0ms; 0; 0) (117; 136;�;�)

~f R2 1�R2 max err 90% err

BEST 0.9950 0.0050 0.078 0.029(1; n; B; nB) 0.9919 0.0081 0.099 0.037(1; n;�;�) 0.9907 0.0093 0.088 0.038

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Resource: Disks.Design Subset: N = 8, B � 16KB and n � 2.

~f ~k ~t (t-stat for ~k)

(1; n; B; nB) (70:2ms; 15:4ms; 1:5 us; 1:3 us) (2:8; 1:5; 2:6; 5:5)(n+ 1; (n+ 1)B) (30:7ms; 1:3 us) (17:4; 32:7)

~f R2 1�R2 max err 90% err

BEST 0.9900 0.0100 0.070 0.061(1; n; B; nB) 0.9895 0.0105 0.079 0.066(n + 1; (n+ 1)B) 0.9817 0.0183 0.122 0.084

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Resource: SCSI Strings.Design Subset: N = 24, B � 32KB and n � 6.

~f ~k ~t (t-stat for ~k)

(1; n; B; nB) (193ms;�8:8ms;�2:6 us; 1:0 us) (5:1;�2:0;�3:6; 12:1)(n+ 1; (n+ 1)B) (11:8ms; 642 ns) (12:7; 35:9)

~f R2 1�R2 max err 90% err

BEST 0.9864 0.0136 0.099 0.062(1; n; B; nB) 0.9846 0.0154 0.135 0.059(n + 1; (n+ 1)B) 0.9763 0.0237 0.158 0.096

Figure 5.4: Analysis of Read-Modify-Write for each Resource.

Page 120: Disk - University of California, Berkeley

108

~f = (1; n; B; nB)

Resource ~k

CPU (11:6ms; 2:0ms; 0; 0)Disk (30:7ms; 30:7ms; 1:3 us; 1:3 us)String (11:8ms; 11:8ms; 642 ns; 642 ns)

1=Tmax = max(~kcpu � ~f;~kdisk � ~f=N;~kstring � ~f=8)

~f R2 1�R2 max err 90% err

BEST 0.9909 0.0091 0.327 0.067

1=Tmax 0.9440 0.0560 0.454 0.177

5 10n

0

5

10

Throughput (MB/s)

B

Figure 5.5: Final Analysis of Read-Modify-Write. The graph plots the maximumthroughput predicted by Tmax in MB/s and compares it to the range of measured maximumthroughputs for N = 24 and various values of B. From bottom to top, each line representsthe predicted maximum throughput forB = 1; 2; 4; 8; 16; 32; 64, respectively. For the smallervalues of B, the measured ranges are so small that they overlap with the prediction lines.

Page 121: Disk - University of California, Berkeley

109

the new data to regenerate the new parity, and then writing the new data and new parity

to disk.

For many of the resources in the system, we can assume that the relevant overheads

are the per array request overhead, per write disk overhead, per read disk overhead, per

write byte overhead and per read byte overhead. The corresponding throughput transform

is

~f 0 = (1; N � n � 1; n+ 1; (N � n� 1)B; (n+ 1)B): (5.36)

That is, for each reconstruct-write array access, N �n� 1 disks must be read, n data disks

plus 1 parity disk must be written, and the corresponding numbers of bytes must be read

and written. If the write and read overheads are similar, the throughput transform can be

simpli�ed to (1; N;NB) corresponding to the per array request overhead, per disk overhead

and per byte overhead respectively.

As was the case for read-modify-writes, the xor engine's overheads are di�erent

and consists of overheads per array request, per stripe-unit xored, per parity stripe-unit

generated, per byte xored, per parity byte generated.

~f 00 = (1; N � 1; 1; (N � 1)B;B): (5.37)

The above discussion suggests the use of (1; n;N;B; nB;NB) as a general through-

put transform. However, preliminary analysis of the data suggests that n is an insigni�cant

factor|the best basis that excludes n is only slightly less accurate than the best basis that

Page 122: Disk - University of California, Berkeley

110

includes it. Thus, we will use the following simpler throughput transform:

~f = (1; N; B;NB): (5.38)

Figure 5.6 summarizes the analysis for each bottleneck resource for the reconstruct-

write access mode. Since the CPU does not explicitly transfer bytes of data, the t-statistics

for the CPU indicate that the overheads per byte overheads are statistically insigni�cant.

Analogously, since the disks and SCSI strings do not see array requests, the t-statistics

for those resources indicate that the �xed, or per array request overheads, are statistically

insigni�cant. The factor B is statistically insigni�cant for all three resources. Figure 5.7

summarizes the utilization pro�les of the bottleneck resources.

5.6 Applications of Utilization Pro�les

The previous sections have used utilization pro�les to characterize the performance

of RAID-II for the read, read-modify-write and reconstruct-write access modes. To illustrate

the usefulness of utilization pro�les, this section will use the utilization pro�les derived in

the earlier sections to answer the following performance-oriented questions:

� What is the maximum throughput at a given system con�guration and workload?

� What is the bottleneck resource at a given system con�guration and workload?

� Given that the disks are the bottleneck, what is the e�ect of increasing the number

of disks in the system?

Page 123: Disk - University of California, Berkeley

111

Resource: CPU.Design Subset: N � 12 and B � 8KB.

~f ~k ~t (t-stat for ~k)

(1; N; B;NB) (12:4ms; 825 us;�119 ns; 14:9 ns) (33:9; 41:7;�1:5; 3:6)(1; N;�;�) (12:0ms; 883 us; 0; 0) (53:8; 73:5;�;�)

~f R2 1�R2 max err 90% err

BEST 0.9116 0.0884 0.144 0.075(1; N; B;NB) 0.8737 0.1263 0.188 0.091(1; N;�;�) 0.8624 0.1376 0.193 0.092

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Resource: Disks.Design Subset: N � 3.

~f ~k ~t (t-stat for ~k)

(1; N; B;NB) (�2:2ms; 15:7ms; 28 ns; 591 ns) (�1:0; 19:0; 0:38; 20:4)(�; N;�; NB) (0; 14:9ms; 0; 602 us) (�; 92:1;�; 107)

~f R2 1�R2 max err 90% err

BEST 0.9930 0.0070 0.427 0.051(1; N; B;NB) 0.9908 0.0092 0.546 0.045(�; N;�; NB) 0.9907 0.0093 0.553 0.049

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Resource: SCSI Strings.Design Subset: N � 20 and B � 32KB.

~f ~k ~t (t-stat for ~k)

(1; N; B;NB) (83:8ms; 766 us; 6:7 us; 45:6 ns) (2:4; 0:48; 9:7; 1:5)(�; N;�; NB) (0; 4:5ms; 0; 346 ns) (�; 12:3;�; 48:4)

~f R2 1�R2 max err 90% err

BEST 0.9909 0.0091 0.129 0.043(1; N; B;NB) 0.9833 0.0167 0.141 0.057(�; N;�; NB) 0.8910 0.1090 0.231 0.126

Figure 5.6: Analysis of Reconstruct-Write for each Resource.

Page 124: Disk - University of California, Berkeley

112

~f = (1; N; B;NB)

Resource ~k

CPU (12:0ms; 883 us; 0; 0)Disk (0; 14:9ms; 0; 602 us)String (0; 4:5ms; 0; 346ns)

1=Tmax = max(~kcpu � ~f;~kdisk � ~f=N;~kstring � ~f=8)

~f R2 1�R2 max err 90% err

BEST 0.9941 0.0059 0.427 0.062

1=Tmax 0.9045 0.0955 0.553 0.180

5 10 15 20 25n

0

5

10

15

20

Throughput (MB/s)

B

Figure 5.7: Final Analysis of Reconstruct-Write. The graph plots the maximumthroughput predicted by Tmax in MB/s and compares it to the range of measured maximumthroughputs for N = 24 and various values of B. From bottom to top, each line representsthe predicted maximum throughput forB = 1; 2; 4; 8; 16; 32; 64, respectively. For the smallervalues of B, the measured ranges are so small that they overlap with the prediction lines.

Page 125: Disk - University of California, Berkeley

113

2 4 6 8 10n

0

2

4

6

8

10

Throughput (MB/s)

N=8. B=16.

TTcpuTdisk

2 4 6 8 10n

0

2

4

6

8

10

Throughput (MB/s)

N=12. B=16.

TTcpuTdisk

Figure 5.8: Maximum Throughput Envelopes for Reads. The solid line illustratesthe measured maximum throughput envelope. The dotted lines illustrates the maximumthroughputs predicted by the CPU and disk utilization pro�les.

� What throughput can the system support before certain resources become a certain

percent utilized?

� How does RAID-II compare with RAID-I?

Figure 5.8 plots the measured maximum throughput envelope and the maximum

throughputs predicted by the CPU and disk utilization pro�les for the speci�ed values of

N , the number of disks in the system, and B, the stripe unit size, as a function of n, the

request size in stripe units. The predicted envelope roughly corresponds to the measured

envelope. Thus, utilization pro�les can be used to determine for each system con�guration

and workload parameter, the maximum throughput that the system can achieve and the

resource that is the bottleneck. We can also use utilization pro�les to quickly evaluate

changes in the system con�guration. For example, the �rst graph illustrates the throughput

Page 126: Disk - University of California, Berkeley

114

envelope when there are eight disks in the system. In this case, the bottleneck throughput

when the request size equals six disks is approximately 5.2MB/s and the bottleneck resource

is the disks. The second graph illustrates what would happen if we increase the number of

disks to twelve. In such a case, the bottleneck throughput increases to 6.2MB/s and the

bottleneck resource becomes the CPU. Thus, increasing the number of disks from eight to

twelve, e�ectively eliminates the disks as a bottleneck for the given stripe unit and request

size.

We have so far applied utilization pro�les for workloads when the system is op-

erating close to its bottleneck throughputs. Utilization pro�les, however, can also be used

answer questions about the system at non-bottleneck throughputs. For example, suppose

that we want to limit the utilization of certain resources, such as disks, to be less than 30%

to ensure that queueing delays do not exceed a certain threshold. What throughput can

the system support without violating this criteria? Because, we can use utilization pro�les

to determine throughput envelopes for arbitrary levels of utilization, this question can be

answered much the same way as our previous questions.

In general, utilization pro�les mathematically relate the utilization of individual

resources in the system to the overall system throughput as a function of the system and

workload parameters. They can be used to study the e�ect of varying the system through-

put, utilization or system and workload parameters on either the system throughput or

utilization of a given resource in the system.

The throughput transforms are abstract quantities that are independent of a par-

ticular system's implementation; it is reasonable for all RAID level 5 implementations to

Page 127: Disk - University of California, Berkeley

115

have a per array request overhead, a per disk overhead and a per byte overhead. Thus, we

can use the same throughput transform to analyze \similar" systems, where the systems

will di�er is in the values of their utilization pro�les.

Because utilization pro�les accurately summarize the performance of a given sys-

tem over a wide range of system and workload parameters, it is especially convenient to

compare the performance of two or more similar system by directly comparing their uti-

lization pro�les. As an example, the table below tabulates the CPU and disk utilization

pro�les of RAID-II and RAID-I, our �rst disk array prototype, for read requests.

~f = (1; n; nB)

Resource kII kI kII=kICPU (8:5ms; 916 us; 0) (3:7ms; 1:0ms; 388 ns) (2:3; 0:92; 0)Disk (0; 14:8ms; 608 ns) (0; 25:4ms; 805 ns) (�; 0:58; 0:76)

Unlike RAID-II, the CPU utilization pro�le for RAID-I has a non-zero per byte overhead.

Because RAID-I is constructed using completely o�-the-shelf parts, all data goes through

the host CPU's memory system. Moreover, for each I/O request, the data is copied by the

CPU to/from user address space from/to kernel address space. Thus, the per byte overhead,

re ects the CPU overhead in copying the data and also contention for memory with I/O

devices when the cache is being �lled or ushed.

The per array request CPU overhead for RAID-II is 2.3 times higher than that

for RAID-I. This di�erence is primarily an artifact of the di�erent measurement programs

that we used to gather statistics. On RAID-I, we used a measurement tool that relied

on lightweight processes within a common address space whereas with RAID-II we used

Page 128: Disk - University of California, Berkeley

116

heavyweight processes with separate address spaces; we made the change to heavyweight

processes to increase generality in the measurement tool since not all UNIX systems sup-

port lightweight processes. As can be seen, however, it has the detrimental consequence

of signi�cantly increasing the per array request overhead since context switches between

heavyweight processes are signi�cantly more expensive. In reality, the per array request

overheads for both RAID-II and RAID-I are similar as is their per disk overhead. Thus,

the main di�erence between the CPU utilization pro�les of RAID-II versus RAID-I is in

the per byte overheads. Since RAID-II's CPU does not manipulate bytes, its per byte

overhead is dramatically smaller than RAID-I's per byte overhead. Comparison of the disk

utilization pro�les tells us is that the disks used for RAID-II both positions and transfers

data signi�cantly faster than the disks used for RAID-I.

5.7 Summary

In this section, we have presented an analysis technique that combines empirical

performance models with statistical techniques to summarize the performance of disk arrays

over a wide range of system and workload parameters. The analysis technique is general-

izable to a wide variety of systems, requires minimal information about the system to be

analyzed, does not require measurements of internal system resources, and produces results

that are compact, intuitive, and easily compared with those of similar systems. We then

applied the technique to analyze the performance of RAID-II, our second disk array pro-

totype and used the results of the analysis to answer several performance questions about

RAID-II. Finally, we used utilization pro�les to compare the performance of RAID-II rela-

Page 129: Disk - University of California, Berkeley

117

tive to RAID-I, our �rst disk array prototype. We showed that such comparisons between

similar systems immediately brings to attention the relative strengths and weaknesses of

the two systems.

Page 130: Disk - University of California, Berkeley

118

Chapter 6

Summary and Conclusions

In this dissertation, we have presented an analytic performance model for non-

redundant disk arrays, applied the analytic model to derive an equation for the optimal

size of data striping, developed an analysis technique based on utilization pro�les, applied

the technique to analyze RAID-II, our second disk array prototype, and used the results of

the analysis to determine throughput limits, bottlenecks, and to compare RAID-II versus

RAID-I.

In deriving the analytic model, we modeled disk arrays as a closed queueing system

consisting of N disks and a �xed number, L, of processes continuously issuing requests of

�xed size p. We later extended the model to handle distributions of request sizes. The

resulting model predicts the expected utilization of a disk in the modeled system, U , as

11+ 1

L(1=p�1)

where p is the average size of requests as a fraction of the number of disks in the

disk array. We then derived the expected response time and throughput as a function of

utilization. We showed via simulation that the simulated utilization is generally within 10%

Page 131: Disk - University of California, Berkeley

119

of the utilization predicted by the analytic model. We also examined the error introduced

by each approximation made in the derivation of the analytic model to better understand

the validity of the approximations. Finally, we validated two results of the analytic model,

namely that utilization is insensitive to the disk service time distribution and that utilization

can be accurately represented by an equation of the form U = 1=(1 + 1Lf(p;N)) where

f(p;N) represents an arbitrary function of p and N .

Next, we used the analytic model to show that the optimal size of data strip-

ing, which simultaneously maximizes throughput and minimizes response time, is equal toqPX(L�1)Z

N where P is the average disk positioning time, X is the average disk transfer rate

and Z is the request size. We used simulation to validate the optimal stripe unit equation

and showed that our results correspond well with those obtained by Chen [7]. An advantage

of our analytically derived optimal stripe unit equation versus that of Chen's empirically

derived rules-of-thumb is that we explicitly take into account both the number of disk in

the disk array and the request size whereas Chen does not.

Due to limitation in the applicability of the analytic model, we next developed

an analysis technique based on utilization pro�les. The technique combines intuitive per-

formance models with statistical techniques and metrics to summarize the performance of

a system over a wide range of system and workload parameters with a relative small set

of performance metrics called utilization pro�les. We showed that the analysis technique

is applicable to a wide variety of systems, requires minimal information about the system

to be analyzed, does not require measurements of internal system resources, and produces

results that are compact, intuitive, and easily compared with those of other similar systems

Page 132: Disk - University of California, Berkeley

120

Finally, we used the analysis technique to examine the performance of RAID-II,

our second disk array prototype. We used the results of the analysis to determine the

performance limits of RAID-II, the bottleneck resource for a given system con�guration

and workload, and to compare the performance of RAID-II to RAID-I. We showed that

utilization pro�les are good in characterizing the overall performance of a system and can

be used to characterize the performance e�ects of changes in the system con�guration or

workload, making them ideal for performance tuning and capacity planning.

In conclusion, we have found that both analytic models and empirical performance

analysis techniques are invaluable in understanding and analyzing the performance of real

systems. The analytic models allow the formulation of abstract relationships that are di�-

cult to determine directly from empirical measurements and aid in di�erentiating important

factors from insigni�cant factors. Analytic models, however, are frequently very limited in

their application to real systems. Thus, empirical analysis techniques must be applied to

validate the analytic models and also to study aspects of the real system that cannot be

modeled analytically. Since accurate empirical studies require the analysis of a large amount

of data, methods, such as utilization pro�les, for summarizing and characterizing the raw

performance measurements are necessary.

There are several areas for future work based on the material presented in this

dissertation. First, the analytic model's workload can be extended to handle CPU think

time. In such a system, processes, instead of simply issuing I/O requests, would alternate

between computation and I/O. Second, the model can be extended to handle write requests

in RAID's. It would be particularly interesting to see if the general properties we validated

Page 133: Disk - University of California, Berkeley

121

for non-redundant disk arrays, such as the insensitivity of disk utilization to the disk service

time distribution, hold for RAID's when servicing writes. Third, the model can be applied

to solve other problems in the design and con�guration of disk arrays, such as computing

price/performance ratios for various disk array and workload parameters, and quantifying

tradeo�s between the performance and reliability of disk arrays. Fourth, the analysis tech-

nique based on utilization pro�les can be used to analyze other types of systems such as

networks or time sharing systems. This would verify that utilization pro�les is a general-

purpose technique and is applicable to a wide variety of systems including disk arrays.

Fifth, and �nally, utilization pro�les can be applied to solve real problems in systems with

real users. The examples we have provided in this dissertation are, admittedly, arti�cial; it

would be di�cult to assess the practical usefulness of utilization pro�les without using it to

solve real problems.

Page 134: Disk - University of California, Berkeley

122

Bibliography

[1] Francois Baccelli. Two parallel queues created by arrivals with two demands. TechnicalReport 426, INRIA|Rocquencourt France, 1985.

[2] Francois Baccelli, Armand M. Makowski, and Adam Shwartz. The fork-join queue andrelated systems with synchronization constraints: Stochastic ordering and computablebounds. In Adv. Appl. Prob., volume 21, pages 629{660, September 1989.

[3] Anupam Bhide and Daniel Dias. Raid architectures for oltp. Technical Report RC17879 (78489), IBM Research Division, T. J. Watson Research Center, YorktownHeights, NY 10598, March 1992.

[4] Dina Bitton and Jim Gray. Disk shadowing. In Proc. Very Large Data Bases, pages331{338, August 1988.

[5] Peter M. Chen, Garth A. Gibson, Randy H. Katz, and David A. Patterson. An eval-uation of redundant arrays of disks using an Amdahl 5890. In Proc. SIGMETRICS,pages 74{85, May 1990.

[6] Peter M. Chen, Edward K. Lee, Ann L. Drapeau, Ken Lutz, Ethan L. Miller, Srini-vasan Seshan, Ken Shirri�, David A. Patterson, and Randy H. Katz. Performance anddesign evaluation of the RAID-II storage server. In International Parallel Processing

Symposium, April 1993.

[7] Peter M. Chen and David A. Patterson. Maximizing performance in a striped diskarray. In Proc. International Symposium on Computer Architecture, pages 322{331,May 1990.

[8] Shenze Chen and Don Towsley. A queueing analysis of RAID architectures. TechnicalReport COINS 91-71, University of Massachusetts at Amherst, 1991.

[9] Ann L. Chervenak and Randy H. Katz. Performance of a disk array prototype. InProc. SIGMETRICS, pages 188{197, May 1991.

[10] L. Flatto. Two parallel queues created by arrivals with two demands II. SIAM J. Appl.Math., 45:861{878, October 1985.

[11] L. Flatto and S. Hahn. Two parallel queues created by arrivals with two demands I.SIAM J. Appl. Math., 44:1041{1053, October 1984.

Page 135: Disk - University of California, Berkeley

123

[12] Neal Glover and Trent Dudley. Practical Error Correction Design for Engineers. DataSystems Technology, Corp., 1988.

[13] Jim Gray, Bob Horst, and Mark Walker. Parity striping of disc arrays: Low-costreliable storage with acceptable throughput. In Proc. Very Large Data Bases, pages148{161, August 1990.

[14] Philip Heidelberger and Kishor S. Trivedi. Queueing network models for parallel pro-cessing with asynchronous tasks. IEEE Trans. on Computers, C-31:1099{1109, Novem-ber 1982.

[15] Philip Heidelberger and Kishor S. Trivedi. Analytic queueing models for programs withinternal concurrency. IEEE Trans. on Computers, C-32:73{82, January 1983.

[16] Robert Y. Hou, Jai Menon, and Yale N. Patt. Balancing I/O response time and diskrebuild time in a RAID5 disk array. In Proc. Hawaii International Conference onComputer Systems, January 1993.

[17] Raj Jain. The Art of Computer Systems Performance Analysis. John Wiley & Sons,Inc., 1991.

[18] Cheeha Kim and Ashok K. Agrawala. Analysis of the fork-join queue. IEEE Trans.

on Computers, 38:250{255, February 1989.

[19] Michelle Y. Kim. Synchronized disk interleaving. IEEE Trans. on Computers, C-35:978{988, November 1986.

[20] Michelle Y. Kim and Asser N. Tantawi. Asynchronous disk interleaving: Approximat-ing access delays. IEEE Trans. on Computers, 40:801{810, July 1991.

[21] Edward K. Lee and Randy H. Katz. An analytic performance model of disk arraysand its application. Technical Report UCB/CSD 91/660, University of California atBerkeley, November 1991.

[22] Edward K. Lee and Randy H. Katz. Performance consequences of parity placement indisk arrays. In Proc. ASPLOS, pages 190{199, April 1991.

[23] Edward K. Lee and Randy H. Katz. An analytic performance model of disk arrays. InProc. SIGMETRICS, pages 98{109, May 1993.

[24] Miron Livny, S. Khosha�an, and H. Boral. Multi-disk management algorithms. InProc. SIGMETRICS, pages 69{77, May 1987.

[25] Jai Menon and Jim Kasson. Methods for improved update performance of disk arrays.In Proc. Hawaii International Conference on Computer Systems, January 1992.

[26] Jai Menon and Dick Mattson. Performance of disk arrays in transaction processingenvironments. Research Report RJ 8230, IBM Research Division, 1991.

Page 136: Disk - University of California, Berkeley

124

[27] Jai Menon and Dick Mattson. Comparison of sparing alternatives for disk arrays. InProc. International Symposium on Computer Architecture, pages 318{329, May 1992.

[28] Jai Menon and Dick Mattson. Distributed sparing in disk arrays. In Proc. IEEECOMPCON, pages 410{421, February 1992.

[29] Ethan L. Miller. Input/output behavior of supercomputing applications. TechnicalReport UCB/CSD 91/616, University of California at Berkeley, 1991.

[30] Richard R. Muntz and John C.S. Lui. Performance analysis of disk arrays under failure.Technical Report CSD 900005, University of California at Los Angeles, 1990.

[31] R. Nelson and A. N. Tantawi. Approximate analysis of fork/join synchronization inparallel queues. IEEE Trans. on Computers, 37:739{743, June 1988.

[32] David A. Patterson, Garth Gibson, and Randy H. Katz. A case for redundant arraysof inexpensive disks (RAID). In Proc. ACM SIGMOD, pages 109{116, June 1988.

[33] David A. Patterson and John L. Hennessy. Computer Organization and Design: TheHardware/Software Interface. Morgan Kaufmann, 1993.

[34] A. L. Narasimha Reddy and Prithviraj Banerjee. An evaluation of multiple-disk I/Osystems. IEEE Trans. on Computers, 38:1680{1690, December 1989.

[35] A. L. Narasimha Reddy and Prithviraj Banerjee. Gracefully degradable disk arrays.Proc. International Symposium on Fault-Tolerant Computing, June 1991.

[36] K. Salem and H. Garcia-Molina. Disk striping. In Proc. IEEE Data Engineering, pages336{342, February 1986.

[37] Daniel Stodolsky and Garth A. Gibson. Parity logging: Overcoming the small writeproblem in redundant disk arrays. In Proc. International Symposium on Computer

Architecture, May 1993.

[38] J. Thisquen. Seek time measurements. Technical report, Amdahl Peripheral ProductsDivision, May 1988.

[39] Ronald W. Wol�. Stochastic Modeling and the Theory of Queues. Prentice-Hall, 1989.