The Art of Database Sharding

The Artof Database Sharding

Maxym KharchenkoAmazon.com

April 22-26, 2012Mandalay Bay Convention Center

Las Vegas, Nevada, USA

www.collaborate12.orgwww.collaborate12.ioug.org

http://www.collaborate12.org/

http://www.collaborate12.ioug.org/

When your data grows …

Old System

New SystemProblem

One machine is not enough

The Big Data problem

Vertical Scaling

Scaling Up …

Scaled!

What you getwhen you scale up

2+2=5

What you getwhen you scale up

2+2=3

Scale out, not up

0 1 2 3 4 5

Number of machines

Difficulty

1

10,000,000

Running on >1 machines

Courtesy: John Rauser @amazon.com

Distributed computing is hard

Distributed System

Sharded System

Sharding is (relatively) easy

Split your datainto small independent chunks

And run each chunkon cheap commodity hardware

How to split your data

Data

DataData

DataData

How to split your data

Step 1: Split off different things

Vertical Partitioning

Step 2: Chose sharding keyand function

Sharding

Bad Sharding

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z0

1

2

3

4

5

6

7

8

9

Last Names Distribution Shard Size

1 2 3 4

Can we partition collaborate participants by last name ?

CREATE TABLE Collaborate_Participants ( last_name varchar2(30) PRIMARY KEY, signup_date date)

Avalanche Effect

Bad Distribu-tion

Good Distribution

i.e. MD5

Step 3: Make enough shards

Hashes and Buckets

MOD

Good DistributionMOD

MOD

Resharding

Hashed_idShard:

mod(hashed_id, 3)1 12 23 04 15 26 07 18 29 010 111 212 0

3 shards Adding 4th shard

Hashed_idOld Shard:

mod(hashed_id, 3)

New Shard: mod(hashed_id,

4)1 1 12 2 23 0 34 1 05 2 16 0 27 1 38 2 09 0 110 1 211 2 312 0 0

Hashed_idOld Shard:

mod(hashed_id, 3)

New Shard: mod(hashed_id,

4)1 1 12 2 23 0 34 1 05 2 16 0 27 1 38 2 09 0 110 1 211 2 312 0 0

75 % bad

Logical Shards

MOD

Good Distribution

MOD

MOD

MOD

Implementing Shards: Standbys

Unsharded StandbyShard 1 Shard 2

Apps

Read Only

Implementing Shards: Tables

Shard1

Apps

TabA

Shard 2

MVA

TabA

Create materialized view … as select …from a@shard1

Dropmaterialized view … preserve table

Read Only

Why shards are awesome• Small data, small load

– Better caching, faster queries– Smaller load, fewer surprises– Faster maintenance, i.e. restores

• Eggs not in one basket:– Availability redefined– Safer maintenance

• Multiple points of view:– SQL performance– System load

Why shards are NOT so great

• More systems– Power, rack space etc– Needs automation … bad– More likely to fail overall

• Some operations become impractical:– Joins across shards– Foreign keys across shards

• More work:– Applications, developers, DBAs– High skill, DIY everything

Thank you

Implementing Shards:Moving “data head”

Shard 1

Apps

Shard 2

Logical Shard

Physical Shard

(1,2,3,4) 1(5,6,7,8) 2

Time Logical Shard

Physical Shard

2011(1,2,3,4) 12011(5,6,7,8) 2

Time Logical Shard

Physical Shard

2011(1,2,3,4) 12011(5,6,7,8) 22012(1,2) 12012(3,4) 32012(5,6) 22012(7,8) 4

Shard 3 Shard 4

Bad Sharding. Example 2

order_id:10000 - 20000

order_id:20001 - 30000

order_id:30001 - 40000

order_id:40001 - 50000

CREATE TABLE Orders ( order_id number PRIMARY KEY, customer_fname varchar2(30), customer_lname varchar2(30), order_date date)

Can we shard customers by meaningless sequence ?

The Art of Database Sharding

Documents

Transcript of The Art of Database Sharding