Storage devices and disk arrays
-
Upload
networksguy -
Category
Documents
-
view
1.818 -
download
1
Transcript of Storage devices and disk arrays
ECE 6160: Advanced Computer Networks
Introduction to Storage Devices
Instructor: Dr. Xubin (Ben) He
Email: [email protected]
Tel: 931-372-3462
ECE6160:Advanced Computer Networks
2
Prev…
• Gigabit Ethernet– MAC
– PHY
• VLAN
ECE6160:Advanced Computer Networks
3
Anatomy: Key components for a typical computer
4ECE6160:Advanced Computer Networks
Motivation: Who Cares About I/O?
• CPU Performance: 60% per year• I/O system performance limited by mechanical delays
(disk I/O)< 10% per year (IO per sec)
• Amdahl's Law: system speed-up limited by the slowest part!
10% IO & 10x CPU => 5x Performance (lose 50%)10% IO & 100x CPU => 10x Performance (lose 90%)
• I/O bottleneck: Diminishing fraction of time in CPUDiminishing value of faster CPUs
ECE6160:Advanced Computer Networks
5
I/O Systems
Processor
Cache
Memory - I/O Bus
MainMemory
I/OController
Disk Disk
I/OController
I/OController
Graphics Network
interruptsinterrupts
ECE6160:Advanced Computer Networks
6
Storage Technology Drivers
• Driven by the prevailing computing paradigm– 1950s: migration from batch to on-line processing
– 1990s: migration to ubiquitous computing
» computers in phones, books, cars, video cameras, …
» nationwide fiber optical network with wireless tails
• Effects on storage industry:– Embedded storage
» smaller, cheaper, more reliable, lower power
– Data utilities
» high capacity, hierarchically managed storage
ECE6160:Advanced Computer Networks
7
Outline
• Disk Basics/Disk History
• Current Disk options
• Disk fallacies and performance
• FLASH
• Tapes
• RAID
8
Disk Device Terminology
• Several platters, with information recorded magnetically on both surfaces (usually)
• Actuator moves head (end of arm,1/surface) over track (“seek”), select surface, wait for sector rotate under head, then read or write
– “Cylinder”: all tracks under heads
• Bits recorded in tracks, which in turn divided into sectors (e.g., 512 Bytes)
Platter
OuterTrack
InnerTrackSector
Actuator
HeadArm
ECE6160:Advanced Computer Networks
9
Tracks, Sectors, and Cylinders
A cylinder is comprised of the set of tracks described by all the heads (on separate platters) at a single seek position
ECE6160:Advanced Computer Networks
10
Photo of Disk Head, Arm, Actuator
Actuator
ArmHead
Platters (12)
{Spindle
source: www.seagate.com
11ECE6160:Advanced Computer Networks
Disk Device Performance
Platter
Arm
Actuator
HeadSectorInnerTrack
OuterTrack
• Disk Latency = Seek Time + Rotation Time + Transfer Time + Controller Overhead
• Seek Time? depends on tracks move arm, seek speed of disk
• Rotation Time? depends on speed disk rotates, how far sector is from head
• Transfer Time? depends on data rate (bandwidth) of disk (bit density), size of request
ControllerSpindle
12ECE6160:Advanced Computer Networks
Disk Device Performance
• Average rotation time:– Average distance sector from head?– 1/2 time of a rotation
» 10000 Revolutions Per Minute 166.67 Rev/sec
» 1 revolution = 1/ 166.67 sec 6.00 milliseconds
» 1/2 rotation (revolution) 3.00 ms
• Average seek time?– Sum all possible seek distances
from all possible tracks / # possible
» Assumes average seek distance is random
– Disk industry standard benchmark
– About 10ms (ATA) and 5ms (SCSI)
ECE6160:Advanced Computer Networks
13
Data Rate: Inner vs. Outer Tracks
• To keep things simple, orginally kept same number of sectors per track
– Since outer track longer, lower bits per inch
• Competition decided to keep BPI the same for all tracks (“constant bit density”)
More capacity per disk
More of sectors per track towards edge
Since disk spins at constant speed, outer tracks have faster data rate
• Bandwidth outer track 1.7X inner track!– Inner track highest density, outer track lowest, so not really constant
– 2.1X length of track outer / inner, 1.7X bits outer / inner
14ECE6160:Advanced Computer Networks
Disk Performance Model /Trends
• Capacity+ 100%/year (2X / 1.0 yrs)
• Transfer rate (BW)+ 40%/year (2X / 2.0 yrs)
• Rotation + Seek time– 8%/ year (1/2 in 10 yrs)
• MB/$> 100%/year (2X / 1.0 yrs)
60GB/$300 => $5/GB =>0.5c/MB
ECE6160:Advanced Computer Networks
15
State of the Art: Barracuda 180
– 181.6 GB, 3.5 inch disk
– 12 platters, 24 surfaces
– 24,247 cylinders
– 7,200 RPM;
– 7.4/8.2 ms avg. seek (r/w)
– 64 to 35 MB/s (internal)
– 0.1 ms controller time
– 10.3 watts (idle)
source: www.seagate.com
Latency = Queuing Time + Controller time +Seek Time + Rotation Time + Size / Bandwidth
per access
per byte{+
Sector
Track
Cylinder
Head PlatterArmTrack Buffer
Q:Calculate time to read 64 KB (128 sectors) for Barracuda 180, sector is on outer track.
ECE6160:Advanced Computer Networks
16
Disk Performance Example
• Calculate time to read 64 KB (128 sectors) for Barracuda 180 X using advertised performance; sector is on outer track
Disk latency = average seek time + average rotational delay + transfer time + controller overhead
= 7.4 ms + 0.5 * 1/(7200 RPM) + 64 KB / (64 MB/s) + 0.1 ms
= 7.4 ms + 0.5 /(7200 RPM/(60000ms/M)) + 64 KB / (64 KB/ms) + 0.1 ms
= 7.4 + 4.2 + 1.0 + 0.1 ms = 12.7 ms
ECE6160:Advanced Computer Networks
17
Areal Density: linear density x track density
• Bits recorded along a track– Metric is Bits Per Inch (BPI) along a track
• Number of tracks per surface– Metric is Tracks Per Inch (TPI)
• Disk Designs Brag about bit density per unit area– Metric is Bits Per Square Inch
– Called Areal Density
– Areal Density = BPI x TPI
ECE6160:Advanced Computer Networks
18
Areal DensityYear Areal Density
1973 1.71979 7.71989 631997 30902000 17100
1
10
100
1000
10000
100000
1970 1980 1990 2000
Year
Are
al D
ensity
– Areal Density = BPI x TPI
– Change slope 30%/yr to 60%/yr about 1991
ECE6160:Advanced Computer Networks
19
Historical Perspective
• 1956 IBM Ramac — early 1970s Winchester– Developed for mainframe computers, proprietary interfaces– Steady shrink in form factor: 27 in. to 14 in
• Form factor and capacity drives market, more than performance• 1970s: Mainframes 14 inch diameter disks• 1980s: Minicomputers,Servers 8”,5 1/4” diameter• PCs, workstations Late 1980s/Early 1990s:
– Mass market disk drives become a reality» industry standards: SCSI, IPI, IDE
– Pizzabox PCs 3.5 inch diameter disks– Laptops, notebooks 2.5 inch disks– Palmtops didn’t use disks,
so 1.8 inch diameter disks didn’t make it
• 2000s:– 1 inch for cameras, cell phones?
20
Disk History
Data densityMbit/sq. in.
Capacity ofUnit ShownMegabytes
1973:1. 7 Mbit/sq. in140 MBytes
1979:7. 7 Mbit/sq. in2,300 MBytes
source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even more data into even smaller spaces”
21
Disk History
1989:63 Mbit/sq. in60,000 MBytes
1997:1450 Mbit/sq. in2300 MBytes
source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even mroe data into even smaller spaces”
1997:3090 Mbit/sq. in8100 MBytes
22
1 inch disk drive!
• 2003 IBM MicroDrive:– 1.7” x 1.4” x 0.2”
– 1 GB, 3600 RPM, 13.3 MB/s, 15 ms seek
– Digital camera, PalmPC?
• 2003 USB mini flash drive– 75mmx28mmx12mm, 17g
– 256MB
– 0.5-1MB/s
• 2006 MicroDrive?– 9 GB, 50 MB/s!
» Assuming it finds a niche in a successful product
» Assuming past trends continue
23
Disk Characteristics in 2003
Seagate Cheetah3146807LW
Hitachi TravelStar 7K60
IBM Microdrive (07N5574)
Dimension 3.5”, 1.5 lbs 2.5”, 0.2 lbs 1.7”, 0.6 lbs
Interface Ultra 320 SCSI ATA-100 (Ultra) CF+
Capacity 146.8GB 60GB 1GB
Data Transfer Rate
320MB/s 100MB/s 13.3MB/s
Average seek time
4.7ms 10ms 15ms
Spindle Speed 10000RPM 7200RPM 3600RPM
Buffer 8MB 8MB 128KB
Price (warehouse.com)
$899.95 $299.95 $319.95
24
Disk Characteristics in 2007
Seagate CheetahST3146707LW
Hitachi TravelStar 7K60
Seagate Barracuda ESST3750640NS
Dimension 3.5” 2.5”, 0.2 lbs 3.5”
Interface Ultra 320 SCSI ATA-100 (Ultra) Serial ATA-300
Capacity 146.8GB 60GB 750GB
Data Transfer Rate
320MB/s 100MB/s
Average seek time
4.7ms 10ms
Spindle Speed 10000RPM 7200RPM 7200RPM
Buffer 8MB 8MB 16MB
Price (warehouse.com)
$259 $93.99 (80GB) $319.95
ECE6160:Advanced Computer Networks
25
Today’s Hard Drive (09/29/2008)(warehouse.com)
• Seagate Barracuda 7200.11 - hard drive - 750 GB - SATA-300 Hard drive - 750 GB - internal - 3.5" - SATA-300 - 7200 rpm - buffer: 32 MB, $116
• Seagate FreeAgent Pro - hard drive - 500 GB - FireWire/Hi-Speed USB/eSATA Hard drive – 500 GB – external – 3.5” – USB 2.0, eSATA-300, FireWire interfaces – 7200 rpm – buffer: 32 MB, $100
ECE6160:Advanced Computer Networks
26
Fallacy: Use Data Sheet “Average Seek” Time
• Manufacturers needed standard for fair comparison (“benchmark”)
– Calculate all seeks from all tracks, divide by number of seeks => “average”
• Real average would be based on how data laid out on disk, where seek in real applications, then measure performance
– Usually, tend to seek to tracks nearby, not to random track
• Rule of Thumb: observed average seek time is typically about 1/4 to 1/3 of quoted seek time (i.e., 3X-4X faster)
– Barracuda 180 X avg. seek: 7.4 ms 2.5 ms
ECE6160:Advanced Computer Networks
27
Fallacy: Use Data Sheet Transfer Rate
• Manufacturers quote the speed off the data rate off the surface of the disk.
• Sectors contain an error detection and correction field (can be 20% of sector size) plus sector number as well as data
• There are gaps between sectors on track
• Rule of Thumb: disks deliver about 3/4 of internal media rate (1.3X slower) for data
• For example, Barracuda 180X quotes 64 to 35 MB/sec internal media rate
47 to 26 MB/sec external data rate (74%)
ECE6160:Advanced Computer Networks
28
Disk Performance Example: revision
• Calculate time to read 64 KB for UltraStar 72 again, this time using 1/3 quoted seek time, 3/4 of internal outer track bandwidth; (12.7 ms before)
Disk latency = average seek time + average rotational delay + transfer time + controller overhead
= (0.33 * 7.4 ms) + 0.5 * 1/(7200 RPM) + 64 KB / (0.75 * 65 MB/s) + 0.1 ms
= 2.5 ms + 0.5 /(7200 RPM/(60000ms/M)) + 64 KB / (47 KB/ms) + 0.1 ms
= 2.5 + 4.2 + 1.4 + 0.1 ms = 8.2 ms (64% of 12.7)
ECE6160:Advanced Computer Networks
29
Future Disk Size and Performance
• Continued advance in capacity (60%/yr) and bandwidth (40%/yr).
• Slow improvement in seek, rotation (8%/yr).
• Time to read whole disk.
Year Sequentially Randomly (1 sector/seek)
1990 4 minutes 6 hours
2000 12 minutes 1 week(!)
• 3.5” form factor make sense in 5 yrs?– What is capacity, bandwidth, seek time, RPM?
– Assume today 80 GB, 100 MB/sec, 6 ms, 10000 RPM
» 838GB, 537MB/s, 4ms, 15000RPM
ECE6160:Advanced Computer Networks
30
What about FLASH
• Compact Flash Cards– Intel Strata Flash
» 16 Mb in 1 square cm. (.6 mm thick)– 100,000 write/erase cycles.– Standby current = 100uA, write = 45mA– Compact Flash 256MB~=$120 512MB~=$542– Transfer @ 3.5MB/s
• IBM Microdrive 1G~370– Standby current = 20mA, write = 250mA– Efficiency advertised in wats/MB
• VS. Disks– Nearly instant standby wake-up time– Random access to data stored– Tolerant to shock and vibration (1000G of operating shock)
ECE6160:Advanced Computer Networks
31
Disk Operations
• Seek: move head to track;
• Rotation: wait for sector under head;
• Transfer:Move data to/from disks;
• Overhead: – Controller delay
– Queuing delay
Access time = seek time + rotational delay + transfer time+overhead
ECE6160:Advanced Computer Networks
32
Improving disk performance.
• Use large sectors to improve bandwidth
• Use track caches and read ahead;– Read entire track into on-controller cache
– Exploit locality (improves both latency and BW)
• Design file systems to maximize locality– Allocate files sequentially on disks (exploit track cache)
– Locate similar files in same cylinder (reduce seeks)
– Locate simlar files in near-by cylinders (reduce seek distance)
• Pack bits closer together to improve transfer rate and density.
• Use a collection of small disks to form a large, high performance one--->disk array
Stripping data across multiple disks to allow parallel I/O, thus improving performance.
ECE6160:Advanced Computer Networks
33
Use Arrays of Small Disks?
14”10”5.25”3.5”
3.5”
Disk Array: 1 disk design
Conventional: 4 disk designs
Low End High End
•Katz and Patterson asked in 1987: •Can smaller disks be used to close gap in performance between disks and CPUs?
34
Replace Small Number of Large Disks with Large Number of Small Disks! (1988 Disks)
Capacity
Volume
Power
Data Rate
I/O Rate
MTTF
Cost
IBM 3390K
20 GBytes
97 cu. ft.
3 KW
15 MB/s
600 I/Os/s
250 KHrs
$250K
IBM 3.5" 0061
320 MBytes
0.1 cu. ft.
11 W
1.5 MB/s
55 I/Os/s
50 KHrs
$2K
x70
23 GBytes
11 cu. ft.
1 KW
110 MB/s
3900 IOs/s
??? Hrs
$150K
Disk Arrays have potential for large data and I/O rates, high MB per cu. ft., high MB per KW, but what about reliability?
9X
3X
8X
6X
ECE6160:Advanced Computer Networks
35
Array Reliability
•MTTF: Mean Time To Failure: average time that a non - repairable component will operate before experiencing failure.
•Reliability of N disks = Reliability of 1 Disk ÷N
50,000 Hours ÷ 70 disks = 700 hours Disk system MTTF: Drops from 6 years to 1 month!
• Arrays without redundancy too unreliable to be useful!
Solution: redundancy.
ECE6160:Advanced Computer Networks
36
Redundant Arrays of (Inexpensive) Disks
• Replicate data over several disks so that no data will be lost if one disk fails.
• Redundancy yields high data availability– Availability: service still provided to user, even if some components
failed
• Disks will still fail• Contents reconstructed from data redundantly stored
in the array Capacity penalty to store redundant info
Bandwidth penalty to update redundant info
ECE6160:Advanced Computer Networks
37
ECE6160:Advanced Computer Networks
38
Levels of RAID
• Original RAID paper described five categories (RAID levels 1-5). (Patterson et al, “A case for redundant arrays of inexpensive disks (RAID)”, ACM SIGMOD, 1988)
• Disk striping with no redundant now is called RAID0 or JBOD(Just a bunch of disks).
• Other kinds have been proposed in literature,Level 6 (P+Q Redundancy), Level 10, RAID53, etc.
• Except RAID0, all the RAID levels trade disk capacity for reliability, and the extra reliability makes parallism a practical way to improve performance.
ECE6160:Advanced Computer Networks
39
RAID 0: Nonredundant (JBOD)
file data block 1block 0 block 2 block 3
Disk 1Disk 0 Disk 2 Disk 3
• High I/O performance.•Data is not save redundantly.•Single copy of data is striped across multiple disks.
•Low cost.•Lack of redundancy.
•Least reliable: single disk failure leads to data loss.
ECE6160:Advanced Computer Networks
40
RAID Level 0 requires a minimum of 2 drives to implement
RAID 0: Striped Disk Array without Fault Tolerance
41
Redundant Arrays of Inexpensive DisksRAID 1: Disk Mirroring/Shadowing
• Each disk is fully duplicated onto its “mirror” Very high availability can be achieved• Bandwidth sacrifice on write: Logical write = two physical writes
• Reads may be optimized, minimize the queue and disk search time
• Most expensive solution: 100% capacity overhead
recoverygroup
Targeted for high I/O rate , high availability environments
ECE6160:Advanced Computer Networks
42
RAID 1: Mirroring and Duplexing
For Highest performance, the controller must be able to perform two concurrent separate Reads per mirrored pair or two duplicate Writes per mirrored pair.
RAID Level 1 requires a minimum of 2 drives to implement
43
RAID 4: Block Interleaved Parity
block 0
block 4
block 8
block 12
block 1
block 5
block 9
block 13
block 2
block 6
block 10
block 14
block 3
block 7
block 11
block 15
P(0-3)
P(4-7)
P(8-11)
P(12-15)
•Blocks: striping units •Allow for parallel access by multiple I/O requests, high I/O rates • Doing multiple small reads is now faster than before. (allows small read requests to be restricted to a single disk).• Large writes(full stripe), update the parity:
P’ = d0’ + d1’ + d2’ + d3’; • Small writes(eg. write on d0), update the parity:
P = d0 + d1 + d2 + d3P’ = d0’ + d1 + d2 + d3 = P + d0’ + d0;
• However, writes are still very slow since the parity disk is the bottleneck.
ECE6160:Advanced Computer Networks
44
RAID 4: Independent Data disks with shared Parity disk
Each entire block is written onto a data disk. Parity for same rank blocks is generated on Writes, recorded on the parity disk and checked on Reads.
RAID Level 4 requires a minimum of 3 drives to implement
45
Inspiration for RAID 5
• RAID 4 works well for small reads• Small writes (write to one disk):
– Option 1: read other data disks, create new sum and write to Parity Disk
– Option 2: since P has old sum, compare old data to new data, add the difference to P
• Small writes are limited by Parity Disk: Write to D0, D5 both also write to P disk. Parity disk must be updated for every write operation!
D0 D1 D2 D3 P
D4 D5 D6 PD7
46
Redundant Arrays of Inexpensive Disks RAID 5: High I/O Rate Interleaved Parity
Independent writespossible because ofinterleaved parity
Independent writespossible because ofinterleaved parity
D0 D1 D2 D3 P
D4 D5 D6 P D7
D8 D9 P D10 D11
D12 P D13 D14 D15
P D16 D17 D18 D19
D20 D21 D22 D23 P
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.Disk Columns
IncreasingLogical
Disk Addresses
Example: write to D0, D5 uses disks 0, 1, 3, 4
47
RAID 5: Block Interleaved Distributed-Parity
block 0
block 4
block 8
block 12
P(16-19)
block 1
block 5
block 9
P(12-15)
block 16
block 2
block 6
P(8-11)
block 13
block 17
block 3
P(4-7)
block 10
block 14
block 18
P(0-3)
block 7
block 11
block 15
block 19
• Parity disk = (block number/4) mod 5 • Eliminate the parity disk bottleneck of RAID 4• Best small read, large read and large write performance• Can correct any single self-identifying failure• Small logical writes take two physical reads and two physical writes.• Recovering needs reading all nonfailed disks
Left Symmetric Distribution
ECE6160:Advanced Computer Networks
48
RAID 5: Independent Data disks with distributed parity blocks
Each entire data block is written on a data disk; parity for blocks in the same rank is generated on Writes, recorded in a distributed location and checked on Reads.
RAID Level 5 requires a minimum of 3 drives to implement
49
Comparison of RAID Levels (N disks, each with capacity of C)
RAID level
Technique Capacity Advantage Disadvantage
0 Striping NxC Maximum data transfer rate and sizes
No redundancy
1 Mirroring (N/2)xC High performance,
fastest write
cost
3 Bit(byte)-level parity
(N-1)xC Easy to implement, high error recoverability
Low performance
4 Block-level parity
(N-1)xC High redundancy and better performance
Write-related bottleneck
5 Interleave parity
(N-1)xC High performance, reliability
Small write problem
ECE6160:Advanced Computer Networks
50
Other RAIDs
• HP AutoRAID
• AFRAID
• RAPID
• SwiftRAID
• TickerTAIP
• SMDA
• ……
ECE6160:Advanced Computer Networks
51
Berkeley History: RAID-I
• RAID-I (1989) – Consisted of a Sun 4/280
workstation with 128 MB of DRAM, four dual-string SCSI controllers, 28 5.25-inch SCSI disks and specialized disk striping software
• Today RAID is $19 billion dollar industry, 80% nonPC disks sold in RAIDs