Storage Attachment Evolution

11
11/27/17 1 CS 61C: Great Ideas in Computer Architecture Lecture 25: Dependability and RAID Krste Asanović & Randy H. Katz http://inst.eecs.berkeley.edu/~cs61c/fa17 11/27/17 Fall 2017 – Lecture #25 1 Storage Attachment Evolution Host OS Disk Interface (DI) Allocation Table Disk, Cylinder,Track, Sector LAN Host Host Host Network File Server OS Network Interface (NI) Network Interface (NI) Network Interface (NI) File Name, Offset, Length Direct Attachment Network Server Attachment 11/27/17 Fall 2017 -- Lecture #25 2 Storage Attachment Evolution Disk Storage Subsystem Work Station Main Frame Main Frame Channel Interface OS OS LUN, Offset, Length LUN To PHY LAN Host Host Host Network File Server OS Network Interface (NI) Network Interface (NI) Network Interface (NI) Network-attached Storage (NAS) OS OS File Name, Offset, Length Disk, Cylinder, Track, Sector 11/27/17 Fall 2017 -- Lecture #25 3 Network Attached Channel Attached Optical Disk Storage Subsystem SAN Main Frame Disk Storage Subsystem Tape Storage Subsystem Channel Interface LAN Host Host Host Network Interface (NI) Network Interface (NI) Network Interface (NI) File Name, Offset, Length File Server File Server File Server CI LUN, Offset, Length Gate way WAN Gate way LAN SAN Main Frame FS DSS Remote SAN CI CI CI LUN, Offset, Length PHY Device, Cyl, Trk, Sector Storage Attachment Evolution 11/27/17 Fall 2017 -- Lecture #25 4 Storage Area Networks (SAN) Storage Class Memory aka Rack-Scale Memory 11/27/17 Fall 2017 -- Lecture #25 5 Storage Class Memory aka Rack-Scale Memory 11/27/17 Fall 2017 -- Lecture #25 6 Cheaper than DRAM More expensive than disk Non-Volatile and faster than disk

Transcript of Storage Attachment Evolution

Page 1: Storage Attachment Evolution

11/27/17

1

CS61C:GreatIdeasinComputerArchitecture

Lecture25:DependabilityandRAID

Krste Asanović &RandyH.Katzhttp://inst.eecs.berkeley.edu/~cs61c/fa17

11/27/17 Fall2017 – Lecture#25 1

StorageAttachmentEvolution

Host

OS

Disk Interface (DI)

AllocationTable

Disk, Cylinder,Track,

Sector

LAN

Host

Host

HostNetwork

FileServer

OS

NetworkInterface

(NI)

NetworkInterface

(NI)

NetworkInterface

(NI)

File Name, Offset, LengthDirectAttachment

NetworkServerAttachment

11/27/17 Fall2017 -- Lecture#25 2

StorageAttachmentEvolution

DiskStorage

Subsystem

WorkStation

MainFrame

MainFrame

ChannelInterface

OSOSLUN,

Offset,Length

LUNToPHY

LAN

Host

Host

Host

NetworkFile

Server

OS

NetworkInterface

(NI)

NetworkInterface

(NI)

NetworkInterface (NI)

Network-attachedStorage (NAS)

OSOS

File Name, Offset, Length

Disk, Cylinder,Track, Sector

11/27/17 Fall2017 -- Lecture#25 3

NetworkAttached

ChannelAttached

OpticalDisk

StorageSubsystem

SAN

MainFrame

DiskStorage

Subsystem

TapeStorage

Subsystem

ChannelInterface

LAN

Host

Host

Host

NetworkInterface

(NI)

NetworkInterface

(NI)

NetworkInterface

(NI)

File Name, Offset, Length

FileServer

FileServer

FileServer

CI

LUN,Offset,Length

Gateway

WAN

Gateway

LAN SAN

MainFrame

FS DSS

Remote SAN

CI

CI

CI

LUN,Offset, Length

PHY Device,Cyl, Trk, Sector

StorageAttachmentEvolution

11/27/17 Fall2017 -- Lecture#25 4

StorageAreaNetworks(SAN)

StorageClassMemoryakaRack-ScaleMemory

11/27/17 Fall2017 -- Lecture#25 5

StorageClassMemoryakaRack-ScaleMemory

11/27/17 Fall2017 -- Lecture#25 6

CheaperthanDRAMMoreexpensivethandiskNon-Volatileandfasterthandisk

Page 2: Storage Attachment Evolution

11/27/17

2

RemoteDirectMemoryAccess

711/27/17 Fall2017 -- Lecture#25

RemoteDirectMemoryAccess

8

ConventionalNetworking

Cut-throughMemoryaccessOvernetwork

Outline

• DependabilityviaRedundancy• ErrorCorrection/Detection• RAID• And,inConclusion…

11/27/17 Fall2017 – Lecture#25 9

Outline

• DependabilityviaRedundancy• ErrorCorrection/Detection• RAID• And,inConclusion…

11/27/17 Fall2017 – Lecture#25 10

SixGreatIdeasinComputerArchitecture

1. DesignforMoore’sLaw(Multicore,Parallelism,OpenMP,Project#3)2. AbstractiontoSimplifyDesign(Everythinganumber,Machine/Assembler

Language,C,Project#1;LogicGates,Datapaths,Project#2)3. MaketheCommonCaseFast(RISCArchitecture,InstructionPipelining,

Project#2)4. MemoryHierarchy(Locality,Consistency,FalseSharing,Project#3)5. PerformanceviaParallelism/Pipelining/Prediction(thefivekindsof

parallelism,Project#3,#4)6. DependabilityviaRedundancy(ECC,RAID)

11/27/17 Fall2017 – Lecture#25 11

GreatIdea#6:DependabilityviaRedundancy

• Redundancysothatafailingpiecedoesn’tmakethewholesystemfail

1+1=2 1+1=2 1+1=1

1+1=22of3agree

FAIL!

Increasingtransistordensityreducesthecostofredundancy11/27/17 Fall2017 – Lecture#25 12

Page 3: Storage Attachment Evolution

11/27/17

3

GreatIdea#6:DependabilityviaRedundancy

• Appliestoeverythingfromdatacenterstomemory– RedundantdatacenterssothatcanloseonedatacenterbutInternetservicestaysonline

– RedundantroutessocanlosenodesbutInternetdoesn’tfail– Redundantdiskssothatcanloseonediskbutnotlosedata(RedundantArraysofIndependentDisks/RAID)

– Redundantmemorybitsofsothatcanlose1bitbutnodata(ErrorCorrectingCode/ECCMemory)

11/27/17 Fall2017 – Lecture#25 13

Dependability

• Fault:failureofacomponent– Mayormaynotleadtosystemfailure

ServiceaccomplishmentServicedelivered

asspecified

ServiceinterruptionDeviationfromspecifiedservice

FailureRestoration

11/27/17 Fall2017 – Lecture#25 14

DependabilityviaRedundancy:Timevs.Space

• SpatialRedundancy– replicateddataorcheckinformationorhardwaretohandlehardandsoft(transient)failures

• TemporalRedundancy– redundancyintime(retry)tohandlesoft(transient)failures

11/27/17 Fall2017 – Lecture#25 15

DependabilityMeasures

• Reliability:MeanTimeToFailure(MTTF)• Serviceinterruption:MeanTimeToRepair(MTTR)• Meantimebetweenfailures(MTBF)

– MTBF=MTTF+MTTR

• Availability=MTTF/(MTTF+MTTR)• ImprovingAvailability

– IncreaseMTTF:Morereliablehardware/software+FaultTolerance– ReduceMTTR:improvedtoolsandprocessesfordiagnosisandrepair

11/27/17 Fall2017 – Lecture#25 16

UnderstandingMTTF

ProbabilityofFailure

1

Time11/27/17 Fall2017 – Lecture#25 17

AvailabilityMeasures

• Availability=MTTF/(MTTF+MTTR)as%– MTTF,MTBFusuallymeasuredinhours

• Sincehoperarelydown,shorthandis“numberof9sofavailabilityperyear”

• 1nine:90%=>36daysofrepair/year• 2nines:99%=>3.6daysofrepair/year• 3nines:99.9%=>526minutesofrepair/year• 4nines:99.99%=>53minutesofrepair/year• 5nines:99.999%=>5minutesofrepair/year

11/27/17 Fall2017 – Lecture#25 18

Page 4: Storage Attachment Evolution

11/27/17

4

ReliabilityMeasures

• Anotherisaveragenumberoffailuresperyear:AnnualizedFailureRate(AFR)– E.g.,1000diskswith100,000hourMTTF– 365days*24hours=8760hours– (1000disks*8760hrs/year)/100,000=87.6faileddisksperyearonaverage

– 87.6/1000=8.76%annualfailurerate• Google’s2007study*foundthatactualAFRsforindividualdrivesrangedfrom1.7%forfirstyeardrivestoover8.6%forthree-yearolddrives

*research.google.com/archive/disk_failures.pdf11/27/17 Fall2017 – Lecture#25 19

BreakingNews,1Q17,BackBlazehttps://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/

11/27/17 Fall2017 – Lecture#25 20

DependabilityDesignPrinciple

• DesignPrinciple:Nosinglepointsoffailure– “Chainisonlyasstrongasitsweakestlink”

• DependabilityCorollaryofAmdahl’sLaw– Doesn’tmatterhowdependableyoumakeoneportionofsystem– Dependabilitylimitedbypartyoudonotimprove

11/27/17 Fall2017 – Lecture#25 21

Outline

• DependabilityviaRedundancy• ErrorCorrection/Detection• RAID• And,inConclusion…

11/27/17 Fall2017 – Lecture#25 22

Error Detection/CorrectionCodes• Memorysystemsgenerateerrors(accidentallyflipped-bits)– DRAMs storeverylittlechargeperbit– “Soft”errorsoccuroccasionallywhencellsarestruckbyalphaparticlesorotherenvironmentalupsets

– “Hard”errorscanoccurwhenchipspermanentlyfail– Problemgetsworseasmemoriesgetdenserandlarger

• MemoriesprotectedagainstfailureswithEDC/ECC• Extrabitsareaddedtoeachdata-word– Usedtodetectand/orcorrectfaultsinthememorysystem– Eachdatawordvalue mappedto uniquecodeword– Afaultchangesvalidcodewordto invalidone,whichcanbedetected

11/27/17 Fall2017 – Lecture#25 23

BlockCodePrinciples• Hammingdistance=differencein#ofbits• p =011011,q =001111,Ham.distance(p,q)=2• p=011011,q =110001,distance(p,q)=?

• Canthinkofextrabitsascreatingacodewiththedata

• Whatifminimumdistancebetweenmembersofcodeis2andgeta1-biterror? RichardHamming,1915-98

TuringAwardWinner11/27/17 Fall2017 – Lecture#25 24

Page 5: Storage Attachment Evolution

11/27/17

5

Parity:SimpleError-DetectionCoding• Eachdatavalue,beforeitis

writtentomemoryis“tagged”withanextrabittoforcethestoredwordtohaveevenparity:

• Eachword,asitisreadfrommemoryis“checked”byfindingitsparity(includingtheparitybit).

b7b6b5b4b3b2b1b0

+

b7b6b5b4b3b2b1b0p

+c• MinimumHammingdistanceofparitycodeis2

• Anon-zeroparitycheckindicatesanerroroccurred:– 2errors(ondifferentbits)arenotdetected– Noranyevennumberoferrors,justoddnumbersoferrorsaredetected

p

11/27/17 Fall2017 – Lecture#25 25

ParityExample

• Data01010101• 4ones,evenparitynow• Writetomemory:010101010tokeepparityeven

• Data01010111• 5ones,oddparitynow• Writetomemory:010101111tomakeparityeven

• Readfrommemory010101010

• 4ones=>evenparity,sonoerror• Readfrommemory110101010

• 5ones=>oddparity,soerror

• Whatiferrorinparitybit?

11/27/17 Fall2017 – Lecture#25 26

SupposeWanttoCorrectOneError?

• HammingcameupwithsimpletounderstandmappingtoallowErrorCorrectionatminimumdistanceofthree– Singleerrorcorrection,doubleerrordetection

• Called“HammingECC”–Workedweekendsonrelaycomputerwithunreliablecardreader,frustratedwithmanualrestarting

– Gotinterestedinerrorcorrection;published1950– R.W.Hamming,“ErrorDetectingandCorrectingCodes,”TheBellSystemTechnicalJournal,Vol.XXVI,No2(April1950)pp 147-160.

11/27/17 Fall2017 – Lecture#25 27

Detecting/CorrectingCodeConcept

• Detection:bitpatternfailscodewordcheck• Correction:maptonearestvalidcodeword

11/27/17 Fall2017– Lecture#25 28

Spaceofpossiblebitpatterns(2N)

Sparsepopulationofcodewords(2M <<2N)- withidentifiablesignature

Errorchangesbitpatterntonon-code

HammingDistance:EightCodeWords

11/27/17 Fall2017 – Lecture#25 29

HammingDistance2:DetectionDetectSingleBitErrors

• No1biterrorgoestoanothervalidcodeword• ½codewords arevalid

InvalidCodewords

11/27/17 Fall2017 – Lecture#25 30

Page 6: Storage Attachment Evolution

11/27/17

6

HammingDistance3:CorrectionCorrectSingleBitErrors,DetectDoubleBitErrors

• No2biterrorgoestoanothervalidcodeword;1biterrornear• 1/4codewords arevalid

Nearest000

(one1)

Nearest111(one0)

11/27/17 Fall2017 – Lecture#25 31 11/27/17 Fall2017 -- Lecture#25 32

Administrivia (1/2)

• Finalexam:thelastThursdayexaminationslot!– 14December,7-10PM,RoomTBD– Contactusaboutconflicts– ReviewLecturesandBookwitheyeontheimportantconceptsofthecourse,e.g.,theGreatIdeasinComputerArchitectureandtheDifferentKindsofParallelism

• ReviewSessionFriDec8,5-8PM@TBA• ElectronicCourseEvaluationsthisweek!Seehttps://course-evaluations.berkeley.edu

11/27/17 Fall2017 -- Lecture#25 33

Administrivia (2/2)

• Project3ContestresultstobeannouncedduringThursday’slecture

• Lab11(Spark)isdueanydaythisweek• Lab13(VM)isdueanydaynextweek• VMGuerrillaSessiontonight!– 7-9pm@Cory293(unlessbiggerroomfound)– LastGuerrillaSessionisnextTuesday,sametimeandplace

• Willgooverthemostdifficulttopicsthissemester

• Project4Partytomorrownight7-9pm@Cory29311/27/17 Fall2017 -- Lecture#25 34

GraphicofHammingCode

• http://en.wikipedia.org/wiki/Hamming_code11/27/17 Fall2017 -- Lecture#25 35

HammingECCSetparitybitstocreateevenparity foreachgroup• Abyteofdata:10011010• Createthe codedword,leavingspacesfortheparitybits:

• __1_001_1010123456789abc– bitposition

• Calculatetheparitybits11/27/17 Fall2017 -- Lecture#25 36

Page 7: Storage Attachment Evolution

11/27/17

7

HammingECC• Position1checksbits1,3,5,7,9,11:? _1 _0 01 _1 01 0.setposition1toa _:

• Position2checksbits2,3,6,7,10,11:0?1_001 _101 0.setposition2toa _:

• Position4checksbits4,5,6,7,12:011?001 _1010.setposition4toa _:

• Position8checksbits8,9,10,11,12:0111001?1010.setposition8toa_:

11/27/17 Fall2017 -- Lecture#25 37

HammingECC• Finalcodeword:011100101010• Dataword: 10011010

11/27/17 Fall2017 -- Lecture#25 38

HammingECCErrorCheck

• Supposereceive011100101110

0 1 1 1 0 0 1 0 1 1 1 0

11/27/17 Fall2017 -- Lecture#25 39

HammingECCErrorCheck

• Supposereceive011100101110

11/27/17 Fall2017 – Lecture#25 40

HammingECCErrorCheck

• Supposereceive0111001011100 1 0 1 1 1 √11 01 11 X-Parity2inerror

1001 0 √01110 X-Parity8inerror

• Impliesposition8+2=10isinerror011100101110

11/27/17 Fall2017 – Lecture#25 41

HammingECCErrorCorrect

• Fliptheincorrectbit…011100101010

11/27/17 Fall2017 – Lecture#25 42

Page 8: Storage Attachment Evolution

11/27/17

8

HammingECCErrorCorrect

• Supposereceive0111001010100 1 0 1 1 1 √11 01 01 √

1001 0 √01010 √

11/27/17 Fall2017 – Lecture#25 43

WhatifMoreThan2-BitErrors?

• Networktransmissions,disks,distributedstoragecommonfailuremodeisburstsofbiterrors,notjustoneortwobiterrors– ContiguoussequenceofB bitsinwhichfirst,lastandanynumberofintermediatebitsareinerror

– Causedbyimpulsenoiseorbyfadinginwireless– Effectisgreaterathigherdatarates

• SolvewithCyclicRedundancyCheck(CRC),interleavingorothermoreadvancedcodes

11/27/17 Fall2017 – Lecture#25 44

11/27/17 Fall2017 -- Lecture#25 45

PeerInstructionQuestionThefollowingwordisreceived,encodedwithHammingcode:0 1 10 001

Whatisthecorrecteddatabitsequence?

A.1111B.0001C.1101D.1011

11/27/17 Fall2017 – Lecture#25 46

Outline

• DependabilityviaRedundancy• ErrorCorrection/Detection• RAID• And,inConclusion…

11/27/17 Fall2017 – Lecture#25 47

EvolutionoftheDiskDrive

IBMRAMAC305,1956

IBM3390K,1986

AppleSCSI,198611/27/17 Fall2017 – Lecture#25 48

Page 9: Storage Attachment Evolution

11/27/17

9

CansmallerdisksbeusedtoclosegapinperformancebetweendisksandCPUs?

ArraysofSmallDisks

14”10”5.25”3.5”

3.5”

DiskArray:1diskdesign

Conventional:4diskdesigns

LowEnd HighEnd

11/27/17 Fall2017 – Lecture#25 49

ReplaceSmallNumberofLargeDiskswithLargeNumberofSmallDisks!(1988Disks)

CapacityVolumePowerDataRateI/ORateMTTFCost

IBM3390K20GBytes97cu.ft.3KW15MB/s600I/Os/s250KHrs$250K

IBM3.5"0061320MBytes0.1cu.ft.11W1.5MB/s55I/Os/s50KHrs$2K

x7023GBytes11cu.ft.1KW120MB/s3900IOs/s???Hrs$150K

DiskArrayshavepotentialforlargedataandI/Orates,highMBpercu.ft.,highMBperKW,butwhataboutreliability?

9X3X

8X

6X

11/27/17 Fall2017 – Lecture#25 50

RAID:RedundantArraysof(Inexpensive)Disks

• Filesare"striped"acrossmultipledisks• Redundancyyieldshighdataavailability– Availability:servicestillprovidedtouser,evenifsomecomponentsfailed

• Diskswillstillfail• Contentsreconstructedfromdataredundantlystoredinthearray− Capacitypenaltytostoreredundantinfo− Bandwidthpenaltytoupdateredundantinfo

11/27/17 Fall2017 – Lecture#25 51

RedundantArraysofInexpensiveDisksRAID1:DiskMirroring/Shadowing

• Eachdiskisfullyduplicatedontoits“mirror”Veryhighavailabilitycanbeachieved

•Writeslimitedbysingle-diskspeed•Readsmaybeoptimized

Mostexpensivesolution:100%capacityoverhead

recoverygroup

11/27/17 Fall2017 – Lecture#25 52

RedundantArrayofInexpensiveDisksRAID3:ParityDisk

P

100100111100110110010011...

logicalrecord 10100011

11001101

10100011

11001101

Pcontainssumofotherdisksperstripemod2(“parity”)Ifdiskfails,subtractPfromsumofotherdiskstofindmissinginformation

Stripedphysicalrecords

11/27/17 Fall2017 – Lecture#25 53

RedundantArraysofInexpensiveDisksRAID4:HighI/ORateParity

D0 D1 D2 D3 P

D4 D5 D6 PD7

D8 D9 PD10 D11

D12 PD13 D14 D15

PD16 D17 D18 D19

D20 D21 D22 D23 P...

.

.

.

.

.

.

.

.

.

.

.

.DiskColumns

IncreasingLogicalDiskAddress

Stripe

Insidesof5disks

Example:smallreadD0&D5,largewriteD12-D15

11/27/17 Fall2017 – Lecture#25 54

Page 10: Storage Attachment Evolution

11/27/17

10

InspirationforRAID5• RAID4workswellforsmallreads• Smallwrites(writetoonedisk):– Option1:readotherdatadisks,createnewsumandwritetoParityDisk– Option2:sincePhasoldsum,compareolddatatonewdata,addthedifferencetoP

• SmallwritesarelimitedbyParityDisk:WritetoD0,D5bothalsowritetoPdisk

D0 D1 D2 D3 P

D4 D5 D6 PD7

11/27/17 Fall2017 – Lecture#25 55

RAID5:HighI/ORateInterleavedParity

Independentwritespossiblebecauseofinterleavedparity

D0 D1 D2 D3 P

D4 D5 D6 P D7

D8 D9 P D10 D11

D12 P D13 D14 D15

P D16 D17 D18 D19

D20 D21 D22 D23 P...

.

.

.

.

.

.

.

.

.

.

.

.DiskColumns

IncreasingLogicalDiskAddresses

Example:writetoD0,D5usesdisks0,1,3,4

11/27/17 Fall2017 – Lecture#25 56

ProblemsofDiskArrays: SmallWrites

D0 D1 D2 D3 PD0'

+

+

D0' D1 D2 D3 P'

newdata

olddata

oldparity

XOR

XOR

(1.Read) (2.Read)

(3.Write) (4.Write)

RAID-5:SmallWriteAlgorithm

1LogicalWrite=2PhysicalReads+2PhysicalWrites

11/27/17 Fall2017 – Lecture#25 57

TechReportRead‘RoundtheWorld(December1987)

11/27/17 Fall2017 – Lecture#25 58

RAID-I

• RAID-I(1989)–ConsistedofaSun4/280workstationwith128MBofDRAM,fourdual-stringSCSIcontrollers,285.25-inchSCSIdisksandspecializeddiskstripingsoftware

11/27/17 Fall2017 – Lecture#25 59

RAIDII• 1990-1993• EarlyNetworkAttached

Storage(NAS)SystemrunningaLogStructuredFileSystem(LFS)

• Impact:– $25Billion/yearin2002– Over$150BillioninRAID

devicesoldsince1990-2002– 200+RAIDcompanies(atthe

peak)– SoftwareRAIDastandard

componentofmodernOSs

11/27/17 Fall2017 – Lecture#25 60

Page 11: Storage Attachment Evolution

11/27/17

11

Outline

• DependabilityviaRedundancy• ErrorCorrection/Detection• RAID• And,inConclusion…

11/27/17 Fall2017 – Lecture#25 61

And,inConclusion,…• GreatIdea:RedundancytoGetDependability– Spatial(extrahardware)andTemporal(retryiferror)

• Reliability:MTTF&AnnualizedFailureRate(AFR)• Availability:%uptime(MTTF/MTTF+MTTR)• Memory– Hammingdistance2:ParityforSingleErrorDetect– Hammingdistance3:SingleErrorCorrectionCode+encodebitpositionoferror

• Treatdiskslikememory,exceptyouknowwhenadiskhasfailed—erasuremakesparityanErrorCorrectingCode

• RAID-2,-3,-4,-5:Interleaveddataandparity11/27/17 Fall2017 – Lecture#25 62