Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large...

47
Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi [email protected] (shared under Creative Commons Attribution Share-alike License incorporated herein by reference) (http://creativecommons.org/licenses/by-sa/3.0/ )

Transcript of Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large...

Page 1: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Intelligent People. Uncommon Ideas.

Yottabytes and Beyond

Demystifying Storage and Building large Storage Networks

Part I

by Bhavin Turakhia, CEO, [email protected]

(shared under Creative Commons Attribution Share-alike License incorporated herein by reference)

(http://creativecommons.org/licenses/by-sa/3.0/)

Page 2: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Why is storage important?

• Web 2.0 applications are an extension of your Desktop

• SaaS is here and growing

• Broadband is a reality

• Storage costs are dropping

• Everyone expects near-unlimited storage online – Youtube, Flickr, Facebook et al are storing your life online*

• (.. And yea … lets not forget your personal bit-torrent collection)

* it would take 1400 TB to store your entire life in video. 5700 TB if you want to know what was happening around you. Another 73 TB for the audio files of everything you heard (MP3 quality). That’s about 6000 TB for a copy of your life

Page 3: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Agenda

• Hard disks SATA, SAS, FC, Solidstate

• RAID

• DAS

• SAN

Page 4: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

“Large scale storage requires careful planning”

Page 5: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Choosing your Hard Disk(SATA, FC, SAS, SCSI, Solidstate)

Page 6: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Introduction to Hard Drives

• Basic physical storage unit (aka Physical block device)

• Variables to consider when selecting a drive Type (SAS, SATA, FC) RPM Capacity MTBF (Mean Time between Failures) Life Expectancy

Page 7: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Hard Disk types

SATA(Serial ATA)

SAS(Serial Attached SCSI)

FC(Fibre Channel)

Typical Use • low-cost, high-volume, low-speed, large-storage environments• CDP / Backups

• Replacement for SCSI• High performance transaction oriented applications with high IOPs requirement

• High performance transaction oriented applications with high IOPs requirement

Performance • Average• Typically 7200 RPM

• Good (Similar to FC)• 10k / 15k RPM

• Good (Similar to SAS)• 10k / 15k RPM

Hard drive capacities

Typically - 250 GB, 500 GB, 750 GB, 1TB

Typically – 73 GB, 146 GB, 300 GB, 400 GB

Typically – 73 GB, 146 GB, 300 GB, 400 GB

Page 8: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Hard Disk types

SATA(Serial ATA)

SAS(Serial Attached SCSI)

FC(Fibre Channel)

Price per Gig(based on max drive capacity retail web price)

$ 0.33 $2 $3

Misc - • Backward compatible with SATA• Allows mixing SATA drives on same backplane

-

Page 9: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Hard Disk Conclusions

• For high IOPs, database applications, low-storage requirements – you have a choice between FC and SAS

• SAS currently seems like the better option

• Future SAS standards promise to be faster than FC (though it is likely they may remain neck to neck)

• For high-storage requirements (video server, file servers, photo storage, archivals, mail servers, backup servers) SATA is the way to go

• One may combine SAS and SATA to reduce average cost and achieve your goals – especially since the backplanes are cross-compatible

• Readup the spec sheet of the hard drives you plan on using for determining specifics

Page 10: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Solid State Drives

• Uses solid state memory to store persistent data

• Eliminates mechanical parts

• Useful for creating efficient in-between caches or storing small to mid-sized high performance databases

Page 11: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Solid State Drives

• References Intro - http://en.wikipedia.org/wiki/Solid_state_disk RAM vs Flash based - http://www.storagesearch.com/ssd-ram-v-

flash.html SSD based SAN!!! - http://www.superssd.com/

Advantages Disadvantages

• Faster startup – no spinning• Significantly faster on Random IO (From 250x to 1000x+)• Extremely low latency (25x to 200x better)• No noise• Lower power consumption• Lesser heat production

• Significantly more expensive ($10-30/GB for Flash based, $100-200/GB for DDR RAM based)• Slightly slower on large sequential reads• Slower random write speeds incase of Flash based storage

Page 12: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

RAID Primer(0, 1, 2, 3, 4, 5, 6, TP, 0+1, 10, 50, 60)

Page 13: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Introduction to RAID

• allows multiple disks to appear as a single contiguous physical block device

• provides redundancy / high availability

• A raid group appears as a single physical block device

HD1 HD2HD1 HD2

RAID

Page 14: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Comparison of Single RAID Levels

RAID 0 RAID 1 RAID 5 RAID 6

Diagram

Description Striping Mirroring Striping with Parity

Striping with Dual Parity

Minimum Disks

2 2 3 4

Maximum Disks

Controller Dependant

2 Controller Dependant

Controller Dependant

Array Capacity

No. of Drives x Drive Capacity

Drive Capacity (No. of Drives - 1) x Drive Capacity

(No. of Drives - 2) x Drive Capacity

Page 15: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Comparison of Single RAID Levels

RAID 0 RAID 1 RAID 5 RAID 6

Storage Efficiency

100% 50% (Num of drives – 1) / Num of

drives

(Num of drives – 2) / Num of

drives

Fault Tolerance

None 1 Drive failure 1 Drive failure 2 Drive failures

High Availability

None Good Good Very Good

Degradation during rebuild

NA • Slight degradation

• Rebuilds very fast

• High degradation

• Slow Rebuild(due to write

penalty of parity)

• Very High degradation• Very Slow

Rebuild(due to write

penalty of dual parity)

Page 16: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Comparison of Single RAID Levels

RAID 0 RAID 1 RAID 5 RAID 6

Random Read Performance

Very Good Good Very Good Very Good

Random Write

Performance

Very Good Good (slightly worse than single drive)

Fair (Parity overhead)

Poor (Dual Parity

Overhead)

Sequential Read

Performance

Very Good Fair Good Good

Sequential Write

Performance

Very Good Good Fair Fair

Cost Lowest High Moderate Moderate+

Page 17: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Comparison of Single RAID Levels

RAID 0 RAID 1 RAID 5 RAID 6

Use Case • Non critical data

• High speed requirements• Data backed up elsewhere

• Typically used as RAID 10 in OLTP /

OLAP applications

Non-write intensive OLTP applications /

file servers etc

Non-write intensive OLTP applications /

file servers etc

Misc - - Parity can considerably slow down

system

Not supported on all RAID

cards

Page 18: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Understanding the Parity Penalty

• RAID 5 and RAID 6 store parity information against data for rebuild

• Single Parity can be calculated using a simple XOR

• eg– “abcdefghijkl” on a 4 disk RAID 5 array

• If Disk 2 fails then the data “B” can be recalculated as (01000001 XOR 01000011 XOR 01000000) => 01000010 => B

+12124286429 Disk 1 Disk 2 Disk 3 Disk 4

A (01000001) B (01000010) C (01000011) {P – 01000000}

Parity {P} D E F

G Parity {P} H I

J K Parity {P} L

Page 19: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Understanding the Parity Penalty

• Steps to change “B” to “X” on Disk 2

• Read A, C and {P}

• Recalculate {P} as ‘A’ XOR ‘X’ XOR ‘C’

• Write ‘X’ and {P}

• A single update required 3 reads and 2 writes

• Random writes in RAID 5 and RAID 6 are very very expensive

Disk 1 Disk 2 Disk 3 Disk 4

A (01000001) B->X (01000010) ->

(01011000)

C (01000011) {P – 01000000}

Page 20: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Understanding the Parity Penalty

• Rebuilding in RAID 5 and RAID 6 is expensive

• The cost increases with increase in number of disks

• As if this isnt enough there is an additional penalty

• All the writes after the computation (ie parity and the changed block) must be simultaneous (involving a two-phase commit operation) The impact can be marginally reduced through write-back caching

Page 21: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Comparison of Nested RAID Levels

RAID 10 RAID 50

Diagram

Description Mirroring then Striping Striping with Parity then Striping without parity

Minimum Disks Even number > 4 > 6

Maximum Disks Controller Dependant Controller Dependant

Array Capacity (Size of Drive) * (Number of Drives ) / 2

(Size of Drive) * (No. of Drives In Each RAID 5 Set - 1) * (No of RAID 5

Sets)

Page 22: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Comparison of Nested RAID Levels

RAID 10 RAID 50

Storage Efficiency 50% ((No. of Drives In Each RAID 5 Set - 1) / No. of Drives In Each RAID 5

Set)

Fault Tolerance Multiple drive failure as long as 2 drives from

same RAID 1 set do not fail

Multiple drive failure as long as 2 drives from

same RAID 5 set do not fail

High Availability Excellent Excellent

Degradation during rebuild

Minor • Moderate degradation• Slow Rebuild

(due to write penalty of parity)

Page 23: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Comparison of Nested RAID Levels

RAID 10 RAID 50

Read Performance Very Good Very Good

Write Performance Very Good Good

Use Case OLTP / OLAP applications

Medium-write intensive OLTP / OLAP applications

Page 24: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Nested RAID Misc Notes

• RAID 10 is faster and better than RAID 0+1 for the same cost

• RAID 60 is similar to RAID 50 except that the striped sets with parity contain dual parity

• Ideally RAID 10 and RAID 50 will be the only nested RAID levels you will use

Page 25: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

RAID Considerations

• Select your Stripe Size by empirical testing smaller stripe size increases transfer performance, decreases

positioning performance, and vice versa ideal stripe sizes depend on your application, typical data read in a

read, sequential vs random reads etc

• Try and select hard drives from separate production batches

• Maintain sufficient Spares in a large array (typically 1 per 10-15 disks is sufficient)

• Use Global spares across RAID groups if your controller supports it

Page 26: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

RAID Considerations

• Use hardware RAID unless performance is not a consideration Especially nested RAID levels or parity based RAID – consume

more CPU cycles and increase rebuild time if implemented in software

• General rule about Controller Cache – the higher the better

• Ensure the controller has battery backup to retain its cache in case of power failure

• For internal RAID Controller cards use faster PCI buses (PCI-x)

Page 27: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

The Fun starts – Lets build our

storage system

Page 28: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Passive Disk Enclosure based Direct Attached

Storage (PDE based DAS)

Page 29: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Passive Disk Enclosure based DAS

• DAS – Direct Attached storage

• RAID controller inside host machine

• External chasis is simply a JBOD (Just a Bunch Of Disks) (or what I’d like to call Passive Disk Enclosure or PDE)

• PDE enables stringing larger number of drives together as compared to internal RAID array

• Eg Dell Powervault MD1000

Page 30: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Passive Disk Enclosure based DAS

• Passive Disk Enclosure can consist of SAS, SATA or FC drives

• Passive Disk Enclosure to RAID Controller connectivity can be SAS, FC, SCSI (possibly different from the backplane)

• Multiple PDEs can be daisy chained if they support it

• RAID card is a single point of failure

• Only one host machine supported

• Array of disks can be divided into multiple RAID groups

Page 31: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Passive Disk Enclosure based DAS

• Array of disks can be divided into multiple heterogeneous RAID groups

• Size and type of a RAID group depends on RAID card

• PDE may have multiple paths to system with possibility of multiplexing for increased speed

• Global spares can be defined on the RAID card

• Maximum storage size = maximum number of PDEs that can be daisy chained x size of drives

Page 32: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Passive Disk Enclosure based DAS

• Performance Considerations Drives RAID configuration PDE Interconnect PDE to RAID Card connect RAID card config (cache etc) PCI bus

Page 33: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Active Disk Enclosure based Direct Attached Storage

(ADE based DAS)

Page 34: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Active Disk Enclosure based DAS

• ADE Difference -> RAID Card is not in the host machine but in the enclosure

• Host machine has a SAS/FC Host Bus Adaptor (HBA) depending on ADE to Host connectivity support Some ADEs may support multiple connection protocols

• ADE may support SAS/FC/SATA drives

• ADE can support daisy-chaining PDEs

• Eg of ADE – Dell MD 3000, Infortrend eonstor devices, Nexsan Satabeast and Sataboy etc

Page 35: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Active Disk Enclosure based DAS

• ADE may support dual RAID Controllers

• RAID Controllers can be used as Active-Active (incase of multiple RAID Groups) – otherwise as Active Passive

• RAID Controller to HBA connectivity can be multiplexed - if supported - for higher throughput

• ADEs are wrongly but commonly referred as SAN (SAN device would still be alright)

Page 36: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Partitioning and Mounting

Page 37: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Logical Volumes

• A RAID Group is a physical unit of storage

• At the Operating System a Logical Group can be created out of multiple RAID Groups

• Each Logical Group can be further divided into Logical Volumes

• Each Logical Volume represents a mountable block device

• In Linux this is done using LVM

• In LVM Logical Volumes are resizable

Page 38: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

SAN (Storage Area Network)

Page 39: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

SAN

• Multiple host machines connected to an ADE through a SAN switch

• SAN refers to the interconnect + Switch + ADE + PDE

• Switch and HBA can be SAS / FC depending on interconnect type supported by ADE

• ADE would support creation of Volumes

• These can be mounted onto Client and further subdivided

Page 40: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

SAN

• Care must be taken to mount each Logical Volume onto a single client (unless you are running a Clustered File System)

• This can be achieved by host masking supported by ADE and/or the Switch

• Without careful host masking and mounting data corruption can take place

Page 41: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

SAN

• Complex SAN configs include multiple hosts and multiple ADEs connected to active-active switches with multiplexed connections

• Client hosts can be of heterogeneous operating systems

• (Funnily ADE to PDE paths sometimes are not be multiplexed)

Page 42: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

SAN

• While this looks complex – just think of it as removing hard disks from the machine and hosting them outside in separate enclosures

• Each machine mounts an independent partition from the SAN

Page 43: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

SAN

• Performance Considerations All variables we covered before Switch config Ensure that switch / HBA / interconnect does not become the

bottleneck and full hdd throughput can be utilized

Page 44: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Throughput Calculations

• Hard disk performance – Type, RPM etc

• Data distribution and Type of Data access

• RAID performance, number of drives, RAID type

• RAID card performance – cache, active-active config etc

• ADE to switch connection speed

• Switch to HBA connection speed

• HBA to PCI bus speed

Page 45: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

That’s all Folks

“Lets go build out our Yottabyte arrays and fill ‘em up”

[Considerably exaggerated hyperbole given that the combined space of all computers in the world today (2007) doesn’t add up to 1 Yottabyte (2 ^ 80 bytes). Infact the entire worlds

storage is projected to hit 988 exabytes (2 ^ 60) by 2010]

[6th Sep 2007 - http://www.networkworld.com/newsletters/stor/2007/0903stor2.html – Nanotech breakthrough could put entire YouTube contents on an iPod-size device]

Page 46: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Part II sneak preview

• Complex SAN configurations

• iSCSI

• NAS

• Clustered Storage

• GFS

• Backups

• Storage Monitoring

• Storage Benchmarking

• Some Commercial storage vendors

Page 47: Intelligent People. Uncommon Ideas. Yottabytes and Beyond Demystifying Storage and Building large Storage Networks Part I by Bhavin Turakhia, CEO, Directi.

Intelligent People. Uncommon Ideas.

Shameless HR Propaganda Slide• Directi builds cool Web products

• Deployed on distributed architecture

• Using terrabytes of storage

• Used by millions of users

• Generating billions of pageviews and transactions

• Spanning every possible software engineering technology

http://careers.directi.com | http://wiki.directi.com | http://cosmos.directi.com

Personal Blog: http://bhavin.directi.comMail: [email protected]