High Speed Sequential IO on Windows NT™ 4.0 (sp3) Erik Riedel (of CMU) Catharine van Ingen Jim...

High Speed Sequential IO on Windows NT™ 4.0 (sp3)

Erik Riedel (of CMU)

Catharine van Ingen

Jim Gray

http://Research.Microsoft.com/BARC/Sequential_IO/

Outline

• Intro/Overview• Disk background, technology trends• Measurements of Sequential IO

– Single disk (temp, buffered, unbuffered, deep)– Multiple disks and busses– RAID– Pitfalls

• Summary

We Got a Lot of Help• Brad Waters, Wael Bahaa-El-Din, and Maurice Franklin

Shared experience, results, tools, and hardware lab. Helped us understand NT Feedback on our preliminary measurements

• Tom Barclay iostress benchmark program

• Barry Nolte & Mike Parkes allocate issues

• Doug Treuting, Steve Mattos + Adaptec SCSI and Adaptec device drivers

• Bill Courtright, Stan Skelton, Richard Vanderbilt, Mark Regester loanded us a Symbios Logic array, host adapters, and r expertise. .

• Will Dahli : helped us understand NT configuration and measurement.

• Joe Barrera & Don Slutz & Felipe Cabrera valuable comments, feedback and helped in understanding NTFS internals.

• David Solomon: Inside Windows NT 2nd edition draft

Controller

The Actors• Measured & Modeling Sequential IO

• Where are the bottlenecks?

• How does it scale with – SMP, RAID, new interconnects

Adapter SCSIFile cache PCI

MemoryGoals:balanced bottlenecksLow overheadScale many processors (10s)Scale many disks (100s)

Mem

bus

App address space

PAP (peak advertised Performance) vs RAP (real application performance)

• Goal: PAP = RAP / 2 (the half-power point)System Bus

422 MBps7.2 MB/s

133 MBps7.2 MB/s

10-15 MBps7.2 MB/s

SCSIFile System Buffers

ApplicationData

Disk

PCI

40 MBps7.2 MB/s

Outline



• Summary

Two Basic Shapes

• Circle (disk)– storage frequently returns to same spot – so less total surface area

• Line (tape)– Lots more area, – Longer time to get to the data.

• Key idea: multiplex expensive read/write head over large storage area: trade $/GB for access/second

Disk Terms• Disks are called platters

• Data is recorded on tracks (circles) on the disk.

• Tracks are formatted into fixed-sized sectors.

• A pair of Read/Write heads for each platter

• Mounted on a disk arm• Client addresses logical blocks (cylinder, head, sector)

• Bad blocks are remapped to spare good blocks.

Disk Access Time

• Access time = SeekTime 6 ms+ RotateTime 3 ms+ ReadTime 1 ms

• Rotate time:– 5,000 to 10,000 rpm

• ~ 12 to 6 milliseconds per rotation• ~ 6 to 3 ms rotational latency• Improved 3x in 20 years

Disk Seek Time

• Seek time is ~ Sqrt(distance)(distance = 1/2 acceleration x time2)

• Specs assume seek is 1/3 of disk

• Short seeks are common. (over 50% are zero length)

• Typical 1/3 seek time: 8 ms

• 4x improvement in 20 years.

Full Accelerate Full Stop

spee

d

time

Read/Write Time: Density• Time = Size / BytesPerSecond

• Bytes/Second = Speed * Density– 5 to 15 MBps

• MAD (Magnetic Aerial Density)– Today 3 Gbits/inch2

5 gbpsi in lab

– Rising > 60%/year– ParaMagnetic Limit:

10 Gb/inch2

– linear density is sqrt10x per decade

1970 1980 1990 2000

10,000

1,000

100

10

1Hoagland’s L

aw

MA

D (

Mbp

si)

0

2

4

6

8

10

0% 25% 50% 75% 100%Radial Distance

Th

rou

gh

pu

t (M

B/s

)

Fast Wide SCSI

Ultra SCSI

.

Read/Write Time: Rotational Speed• Bytes/Second = Speed * Density

• Speed greater at edge of circle

• Speed 3600 -> 10,000 rpm– 5%/year improvement

• bit rate varies by ~1.5x today

r2 = 1

r2 = 4

r = 1

r = 2

Read/Write Time: Zones

• Disks are sectored – typical: 512 bytes/sector

– Sector is read/write unit – Failfast: can detect bad sectors.

• Disks are zoned – outer zones have more sectors– Bytes/second higher in outer zones.

14 sectors/track

8 sectors/track

8 sectors/track

Disk Access Time

• Access time = SeekTime 6 ms 5%/y + RotateTime 3 ms 5%/y+ ReadTime 1 ms 25%/y

• Other useful facts:– Power rises more than size3 (so small is indeed beautiful)

– Small devices are more rugged– Small devices can use plastics (forces are much smaller)

e.g. bugs fall without breaking anything

The Access Time MythThe Access Time MythThe Myth: seek or pick time dominatesThe Reality:(1) Queuing dominates (2) Transfer dominates BLOBs (3) Disk seeks often shortImplication: many cheap servers

better than one fast expensive server– shorter queues– parallel transfer– lower cost/access and cost/byte

This is now obvious for disk arraysThis will be obvious for tape arrays

Seek

Rotate

Transfer

Seek

Rotate

Transfer

Wait

Storage Ratios Changed• 10x better access time

• 10x more bandwidth

• 4,000x lower media price

Disk Performance vs Time

1

10

100

1980 1990 2000

Year

seek

s p

er s

eco

nd

ban

dw

idth

: MB

/s

0.1

1.

10.

Cap

acity

(GB

)

Disk accesses/second vs Time

1

10

100

1980 1990 2000

Year

Acc

esse

s p

er S

eco

nd

Storage Price vs TimeMegabytes per kilo-dollar

0.1

1.

10.

100.

1,000.

10,000.

1980 1990 2000

Year

MB

/k$

• DRAM/disk media price ratio changed– 1970-1990 100:1

– 1990-1995 10:1

– 1995-1997 50:1

– today ~ .2$pMB disk 10$pMB dram

Year 2002 Disks• Big disk (10 $/GB)

– 3”– 100 GB– 150 kaps (k accesses per second)– 20 MBps sequential

• Small disk (20 $/GB)– 3”– 4 GB– 100 kaps – 10 MBps sequential

• Both running Windows NT™ 7.0?(see below for why)

Tape & Optical: Beware of the Media Myth

• Optical is cheap: 200 $/platter 3 GB/platter => 70$/GB (cheaper than disc)

• Tape is cheap: 30 $/tape 20 GB/tape => 1.5 $/GB (100x cheaper than disc).

The Media Myth

• Tape needs a robot (10 k$ ... 3 m$ ) 10 ... 1000 tapes (at 20GB each) => 10$/GB ... 150$/GB

(1x…10x cheaper than disc)

Optical needs a robot (100 k$ ) 100 platters = 200GB ( TODAY ) => 400 $/GB

( more expensive than mag disc )

• Robots have poor access times Not good for Library of Congress (25TB) Data motel: data checks in but it never checks out!

Crazy Disk Ideas• Disk Farm on a card: surface mount disks

• Disk (magnetic store) on a chip: (micro machines in Silicon)

• NT and BackOffice in the disk controller(a processor with 100MB dram)

ASIC

The Disk Farm On a CardThe Disk Farm On a CardThe Disk Farm On a CardThe Disk Farm On a Card

The 100GB disc cardAn array of discsCan be used as 100 discs 1 striped disc 10 Fault Tolerant discs ....etcLOTS of accesses/second bandwidth

14"

Life is cheap, its the accessories that cost ya.

Processors are cheap, it’s the peripherals that cost ya (a 10k$ disc card).

Functionally Specialized Cards• Storage

• Network

• Display

M MB DRAM

P mips processor

ASIC

ASIC

ASIC Today:

P=50 mips

M= 2 MB

In a few years

P= 200 mips

M= 64 MB

It’s Already True of PrintersPeripheral = CyberBrick

• You buy a printer• You get a

– several network interfaces– A Postscript engine

• cpu, • memory, • software,• a spooler (soon)

– and… a print engine.

Tera Byte Backplane

• TODAY– Disk controller is 10 mips risc engine

with 2MB DRAM– NIC is similar power

• SOON– Will become 100 mips systems

with 100 MB DRAM.

• They are nodes in a federation(can run Oracle on NT in disk controller).

• Advantages– Uniform programming model– Great tools– Security– Economics (cyberbricks)– Move computation to data (minimize traffic)

All Device Controllers will be Cray 1’s

CentralProcessor &

Memory

System On A Chip• Integrate Processing with memory on one chip

– chip is 75% memory now– 1MB cache >> 1960 supercomputers– 256 Mb memory chip is 32 MB!– IRAM, CRAM, PIM,… projects abound

• Integrate Networking with processing on one chip– system bus is a kind of network– ATM, FiberChannel, Ethernet,.. Logic on chip.– Direct IO (no intermediate bus)

• Functionally specialized cards shrink to a chip.

With Tera Byte Interconnectand Super Computer Adapters

• Processing is incidental to – Networking– Storage– UI

• Disk Controller/NIC is – faster than device– close to device– Can borrow device

package & power

• So use idle capacity for computation.• Run app in device.

Tera ByteBackplane

Implications

• Offload device handling to NIC/HBA

• higher level protocols: I2O, NASD, VIA…

• SMP and Cluster parallelism is important.

Tera Byte Backplane

• Move app to NIC/device controller

• higher-higher level protocols: CORBA / DCOM.

• Cluster parallelism is VERY important.

CentralProcessor &

Memory

Conventional Radical

How Do They Talk to Each Other?• Each node has an OS• Each node has local resources: A federation.• Each node does not completely trust the others.• Nodes use RPC to talk to each other

– CORBA? DCOM? IIOP? RMI?

– One or all of the above.

• Huge leverage in high-level interfaces.• Same old distributed system story.

Wire(s)h

stre

ams

data

gram

s

RP

C?

Applications

VIAL/VIPL

streams

datagrams

RP

C ?

Applications

Will He Ever Get to The Point?

• I thought this was about NTFS sequential IO.

• Why is he telling me all this other crap?

It is relevant background

Outline



• Summary

ControllerAdapter SCSIFile cache PCI

Memory

Mem

bus

App address space

The Actors• Processor - Memory bus• Memory

– holds file cache and app data

• Application– reads and writes memory

• The Disk: writes, stores, reads data• The Disk Controller:

– manages drive (error handling)– reads & writes drive– converts SCSI commands

to disk actions– May buffer or do RAID

• The SCSI bus: carries bytes • The Host-Bus Adapter:

– protocol converter to system bus

– may do RAID

Sequential vs Random IO• Random IO is typically small IO (8KB)

– seek+rotate+transfer is ~ 10 ms

– 100 IO per second

– 800 KB per second

• Sequential IO is typically large IO– almost no seek (one per cylinder read/written)

– No rotational delay (reading whole disk track)

– Runs at MEDIA speed: 8 MB per second

• Sequential is 10x more bandwidth than random!

1

10

Basic File Concepts• Buffered:

– File reads/writes go to file cache– File system does pre-fetch, post write, aggregation.– Unbuffered bypasses file cache– Data written to disk at file close or LRU or lazy write

• Overlapped:– requests are pipelined– completions via events, completion ports, – A simpler alternative to multi-threaded IO.

• Temporary Files:– Files written to cache, not flushed on close.

Experiment Background

• Used Intel/Gateway 2000 G6-200Mhz Pentium Pro• 64 MB DRAM (4x interleave)• 32-bit PCI• Adaptec 2940 Fast-Wide (20 MBps)

and Ultra-Wide (40 MBps) controllers• Seagate 4GB SCSI disks (fast and ultra)

– (7200 rpm, 7-15 MBps “internal”)

• NT 4.0 SP3, NTFS• i.e.: modest 1997 technology.• Not multi-processor, Not DEC Alpha, Some RAID

Simplest Possible Code

• Error checking adds some more, but still, its easy

#include <stdio.h>#include <windows.h>

int main(){ const int iREQUEST_SIZE = 65536;

char cRequest[iREQUEST_SIZE];unsigned long ibytes;

HANDLE hFile = CreateFileCreateFile("C:\\input.dat", // name GENERIC_READ, // desired access 0, NULL, // share & security OPEN_EXISTING, // pre-existing file FILE_ATTRIBUTE_TEMPORARY | FILE_FLAG_SEQUENTIAL_SCAN, NULL); // file template

while(ReadFileReadFile(hFile,cRequest,iREQUEST_SIZE,&ibytes,NULL) ) // do read{ if (ibytes == 0) break; // break on end of file

/* do something with the data */ };

CloseHandleCloseHandle(hFile);return 0;}

The Best Case: Temp File, NO IO• Temp file Read / Write File System Cache• Program uses small (in cpu cache) buffer.• So, write/read time is bus move time (3x better than copy)• Paradox: fastest way to move data is to write then read it.• This hardware is

limited to 150 MBpsper processor

Temp File Read/Write

148 136

54

0

50

100

150

200

Temp read Temp write Memcopy ()

MB

ps

Out of the Box Disk File Performance

• One NTFS disk

• Buffered read

• NTFS does 64 KB read-ahead – if you ask FILE_FLAG_SEQUENTIAL– or if it thinks you are sequential

• NTFS does 64 KB write behind– under same conditions– aggregates many small IO to few big IO.

64KB

Synchronous Buffered Read/Write• Read throughput is GREAT!

• Write throughput is 40% of read

• WCE is fast but dangerous

Out of the Box Throughput

0

2

4

6

8

10

2 4 8 16 32 64 128 192

Request Size (K-Bytes)

Th

rou

gh

pu

t (M

B/s

)

Write

Read

Write +WCE

Out of the Box Overhead

0

10

20

30

40

50

60

70

80

2 4 8 16 32 64 128 192Request Size (K Bytes)

Ove

rhea

d (

cpu

mse

c/M

B) Read

Write

Write + WCE

Read

Write

• Net: default out of the box Net: default out of the box performance is good.performance is good.

• 20 ms/MB ~ 2 instructions/byte!

• CPU will saturate at 50MBps

Write Multiples of Cluster Size• For IOs less than 4KB

if OVERWRITING datafile system reads 4KB pagethen overwrites bytesthen writes bytes

• Cuts throughput by 2x - 3x

• So, write in multiples of cluster size.


0

2

4

6

8

10

2 4 8 16 32 64 128 192

Request Size (K-Bytes)T

hro

ug

hp

ut

(MB

/s)

Write

Read

Write +WCE

2KB writes are5x slower than reads

2x or 3x slower than 4KB writes

What is WCE?• Write Cache Enable lets disk controller respond “yes” before data is

on disk.

• DangerousDangerous – If power fails, WCE can destroy data integrity– Most RAID controllers have Non Volatile RAM

That makes WCE safe (invisible) if they do RESET right.

• About 50% of disks we see have WCE onYou can turn it off with 3rd party SCSI Utilities.

• As seen later: 3-deep request buffering gets similar performance.

Synchronous Un-Buffered Read/Write • Reads do well above 2KB• Writes are terrible• WCE helps writes• Ultra media is 1.5x Faster

Unbuffered Throughput

0

2

4

6

8

10

2 4 8 16 32 64 128 192Request Size (K bytes)

Th

rou

gh

pu

t (M

B/s

)

Ultra Read

Fast Read

Ultra Write

Fast Write

0

2

4

6

8

10


Th

rou

gh

pu

t (M

B/s

)Fast Write WCE

Ultra Write WCE

WCE Unbuffered Write Throughput

• 1/2 power point– Read: 4KB

– Write: 64h KB no wce 4 KB with

wce

Cost of Un-Buffered IO • Saves Buffer Memory copy.• Was 20 ms/MB, now 2 ms/MB• Cost/request ~ 120 s (wow)• Note: unbuffered must be sector aligned.

• Buffered:– saturates CPU at 50 MB/s

• Un Buffered– saturates CPU at 500 MB/s

CPU milliseconds per MB

1

10

100

2 4 8 16 32 64 128 192

Request Size (K bytes)

Co

st (

ms/

MB

)

CPU Utilization

0%

5%

10%

15%

20%

25%

30%

35%


Co

st (

CP

U%

)

cpu idle because

non-WCE w rites so slow

CPU milliseconds per Request

0.10

0.15

0.20

0.25

0.30


Co

st (

ms/

req

ues

t)

Fast Read

Ultra Read

Fast Write

Ultra Write

Ultra Write WCE

Fast write WCE

Summary• Out of the box

– Read RAP ~PAP (thanks NTFS)– Write RAP ~ PAP / 10 …PAP/2

• Buffering small IO is great!• Buffering large IO is expensive• WCE is a dangerous way out

but frequently used.

• Parallelism Tricks:– deep requests (async, overlap)

– striping (raid0, raid5)

– allocation and other tricks

Out of the Box Overhead

0

10

20

30

40

50

60

2 4 8 16 32 64 128 192

Request Size (K Bytes)

Read BufferedWrite BufferedWrite Buffered + WCEReadWriteWrite+WCE


0

2

4

6

8

10

2 4 8 16 32 64 128 192Request Size (K-Bytes)

Th

rou

gh

pu

t (M

B/s

)

Un-Buffered

Read & Write

FS Buffered Read & Write

WCE Out of Box Throughput

0

2

4

6

8

10

2 4 8 16 32 64 128 192Request Size (K-Bytes)

Un-Buffered Write

Buffered Write

Bottleneck Analysis

• Drawn to linear scale

TheoreticalBus Bandwidth

422MBps = 66 Mhz x 64 bits

MemoryRead/Write

~150 MBps

MemCopy~50 MBps

Disk R/W~9MBps

Outline



• Summary

Kinds of Parallel Execution

Pipeline

Partition outputs split N ways inputs merge M ways

Any Sequential

Step

A Sequential

Step

SequentialSequential

SequentialSequential Any Sequential

Step

Any Sequential

Step

Pipeline Requests to One Disk• Does not help reads much

They were already pipelined by the disk controller

• Pipeline (async, overlap) IO is a BIG win (RAP ~ 85% PAP)

• Helps writes a LOT– Above 16KB

3-deep matches WCE

Read Throughput - 1 Fast Disk, Various Request Depths

0

2

4

6

8

10


Th

rou

gh

pu

t (M

B/s

)

Write Throughput - 1 Fast Disk, Various Request Depths

0

2

4

6

8

10


Th

rou

gh

pu

t (M

B/s

)

WCE

1 Buffer

3 Buffers

8 Buffers

Parallel Access To Data?

1 Terabyte1 Terabyte

10 MB/s

At 10 MB/s1.2 days to scan

1 Terabyte1 Terabyte

1,000 x parallel100 second SCAN.

Parallelism: divide a big problem into many smaller ones

to be solved in parallel.

BANDWID

TH

10 GB/s

Pipeline Access: Stripe Across 4 disks• Stripes NEED pipeline• 3-deep is good enough• Saturate at 15 MBps

• 8-deep Pipeline matches WCE

Write 4 Disk StripesThroughput vs Request Depth

0

5

10

15

20


Th

rou

gh

pu

t (M

B/s

)

WCE

1 Buffer

3 Buffers

8 Buffers

Read 4 Disk Stripes Throughput vs Request Depth

0

5

10

15

20


Th

rou

gh

pu

t (M

B/s

)

3 Stripes and Your Out!• 3 disks can saturate adapter• Similar story with UltraWide

• CPU time goes down with request size

• Ftdisk (striping is cheap)

Read Throughput vs Stripes - 3 deep Fast

0

5

10

15

20


Th

rou

gh

pu

t (M

B/s

)

WriteThroughput vs Stripes - 3 deep Fast

0

5

10

15

20


Th

rou

gh

pu

t (M

B/s

)

1 Disk

2 Disks

3 Disks

4 Disks

CPU miliseconds per MB

1

10

100

2 4 8 16 32 64 128 192

Request Size (bytes)

Co

st (

CP

U m

s/M

B)

=

Parallel SCSI Busses Help

• Second SCSI bus nearly doubles read and wce throughput

• Write needs deeper buffers• Experiment is unbuffered

(3-deep +WCE)

One or Two SCSI Busses

0

5

10

15

20

25

2 4 8 16 32 64 128 192


Th

rou

gh

pu

t (M

B/s

)

ReadWriteWCEReadWriteWCE

2 busses

1 Bus

2 x

File System Buffering & Stripes(UltraWide Drives)

• FS buffering helps small reads• FS buffered writes peak at

12MBps• 3-deep async helps

• Write peaks at 20 MBps• Read peaks at 30 MBps

Three Disks, 1 Deep

0

5

10

15

20

25

30

35


Th

rou

gh

pu

t (M

B/s

)

FS Read

ReadFS Write WCE

Write WCE

Three Disks, 3 Deep

0

5

10

15

20

25

30

35


Th

rou

gh

pu

t (M

B/s

)

PAP vs RAP• Reads are easy, writes are hard• Async write can match WCE.

•

422 MBps

142 MBps

133 MBps

72 MBps

10-15 MBps

9 MBps

SCSI

File System

ApplicationData

PCI SCSI

Disks40 MBps

31 MBps

Bottleneck Analysis• NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI

~ 65 MBps Unbuffered read~ 43 MBps Unbuffered write

~ 40 MBps Buffered read

~ 35 MBps Buffered write

Memory Read/Write ~150 MBps

PCI~70 MBps

Adapter~30 MBps

Adapter

70 M

Bps

Hypothetical Bottleneck Analysis• NTFS Read/Write 12 disk, 4 SCSI, 2 PCI

(not measured, we had only one PCI bus available, 2nd one was “internal”)

~ 120 MBps Unbuffered read

~ 80 MBps Unbuffered write

~ 40 MBps Buffered read

~ 35 MBps Buffered write

Memory Read/Write ~150 MBps

PCI~70 MBps

Adapter~30 MBps

PCI

Adapter

Adapter

Adapter

120

MB

ps

Outline



• Summary

Stripes, Mirrors, Parity (RAID 0,1, 5)

• RAID 0: Stripes– bandwidth

• RAID 1: Mirrors, Shadows,…– Fault tolerance– Reads faster, writes 2x slower

• RAID 5: Parity– Fault tolerance– Reads faster– Writes 4x or 6x slower.

0,3,6,.. 1,4,7,.. 2,5,8,..

0,1,2,.. 0,1,2,..

0,2,P2,.. 1,P1,4,.. P0,3,5,..

Where To Do RAID? • RAID in host (= NT)

– no special hardware– data FtDisk responsible for data integrity– can stripe across multiple busses/adapters

• RAID in Adapter– Gets safe WCE if not volatile– Offloads host– Not good for WolfPack

• RAID in disk controller– Gets safe WCE if not volatile– offloads host– best data integrity for MSCS

NT Host-Based Striping is OK • 3 Ultra-disks per Stripe.• WCE is enabled in all cases• Requests are 3-deep

•

Striping Read Throughput

0

5

10

15

20

25

30

35

2 4 8 16 32 64 128Request Size (Kbytes)

Th

rou

gh

pu

t (M

B/s

)

Controller-Based Striping

Host-Based Striping

Array-Based Striping

Striping WriteThroughput

0

5

10

15

20

25

30

35

2 4 8 16 32 64 128Request Size (Kbytes)

Th

rou

gh

pu

t (M

B/s

)

Surprise: Good NT RAID5 Performance

• Ignores read performance in the case of disk fault.

• Above 32KB requests, CPU write cost is significant.

• At 8 KB, performance is similar

• Write performance is bad in all cases.RAID5 Throughput vs Request Depth

0

5

10

15

20

25

30

35


Th

rou

gh

pu

t (M

B/s

)

Read

Write

RAID5 CPU milliseconds per MB

1

10

100


Th

rou

gh

pu

t (M

B/s

)

Array ReadArray WriteHost ReadHost Write

Controller & Adapters are Complex• Min response time 300µs• Typical 1ms for 8KB• Many strange effects

(e.g. Ultra cache is busted).

1

10


Ela

pse

d T

ime

(ms)

0.10 10 20 30 40 50 60 70

Elapsed time vs Request Size

Controller Cache vs Controller Prefetch

Ultra Cached

Fast Cached

Narrow Cached

Narrow Prefetch

Fast Prefetch

Ultra Prefetch

Bus Overhead Grows • Small requests (8KB) are more than 1/2 overhead.• 3x more disks means 5x more overhead

SCSI Overhead Grows with Disks

31%

3%11%

18%

27% 27%

56%

80%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

1 Disk 8KB

1 Disk 64KB

2 Disks 64KB

3 Disks 64KB

SC

SI

Bu

s U

tili

zati

on Overhead

Data

Allocate/Extend Suppresses Async Writes

• When you allocate spaceNT zeros it (both DRAM and disk)

• Prevents others from reading data you “delete”

• This “kills” pipeline writes.• Solution: pre-allocate or

reuse files whenever you can.

• Do VERY large writes.

•

Allocate/Extend While Writing

0

5

10

15

20


Th

rou

gh

pu

t (M

B/s

)

4-disk write- 8 deepno-extend

1-disk write 8-deepno extend

1 deep equals 8-deep extend

Stripe Alignment: Chunk vs Cluster

Alignment, 4-disk(ultra), 3-deep

0

5

10

15

20

25

30

35

2 4 8 16 32 64 128 192Request Size (bytes)

Th

rou

gh

pu

t (M

B/s

)

Unaligned Read

Aligned Read

Aligned Write

Unaligned Write

64KB 64KB 64KB

4 64KB 64KB

• 64 KB read becomes two reads: 4KB and 60KB• Twice as many physical

requests.• Stripe has chunk size (64KB)• Volume has cluster size

– default is 4KB (for big disks).

60

Other Issues.• Multi-processor• DEC Alpha• Memory Mapped Files• Fragmentation• Ultra-2, Merced, FC,…• NT5

– Veritas volume manger

– 64-bit

– performance improvements

– I2O,...

Summary Read is easy, write is hard

SCSI & FS read prefetch worksRead PAP ~ .8 RAPWrite PAP ~ .05 RAP to .8 RAP

NTFS buffering is good for small IOscoalesces into 64KB requests

Bigger is better: 8KB ok, 64KB best Deep requests help

3-deep is good, 8-deep is better WCE is fast but dangerous

3-deep writes approximate WCE for > 8KB requests.

3 disks can saturate a SCSI bus, both Fast-Wide (15 MBps) or Ultra-Wide (31 MBps)

Memory speed is ultimate limitwith multiple disks, multiple PCI 50MBps copy, 150 MBps r/w.

Avoid FS buffering above 16KBcosts 20 ms/MB of cpu

Preallocate & reuse files when possibleAvoids Allocate/Extend sync IO

Software RAID5 performs well but fault tolerance is a problem writes are expensive in any case

Pitfalls Read-before-write: 2KB buffered IO Allocate/Extend: synchronous write Zoned disks => 50% speed bump RAID alignment => 20% speed bump

More Details at

• Web site has – Paper– Sample code– Test program we used– These slides– http://research.Microsoft.com/BARC/Sequential_IO/

Outline



• Summary

High Speed Sequential IO on Windows NT™ 4.0 (sp3) Erik Riedel (of CMU) Catharine van Ingen Jim...

Documents

Transcript of High Speed Sequential IO on Windows NT™ 4.0 (sp3) Erik Riedel (of CMU) Catharine van Ingen Jim...