Data sharding

32
Data Sharding Michał Gruchała [email protected] WebClusters 2011
  • date post

    17-Oct-2014
  • Category

    Documents

  • view

    2.329
  • download

    3

description

WebClusters'11 presenation about database sharding

Transcript of Data sharding

Page 1: Data sharding

Data Sharding

Michał Gruchała [email protected]

WebClusters 2011

Page 2: Data sharding

TODO

● Background● Theory● Practice● Summary

Page 3: Data sharding

Background

Microblogging site● user messages (blog)● cockpit/wall

Classic architecture● database● web server(s)● loadbalancer(s)

Page 4: Data sharding

Background

Web servers, load balancers● one server● ...● 1000 servers● not a problem

Database ● one database● two databases (master -> slave)● two databases (master <-> master)● n databases (slave(s)<-master<->master->slave(s))

a lot of replication ;)

Page 5: Data sharding

Background

Replication ● increase read performance (raid1)● increase data safety (raid1)● does not increase system's capacity (GBs)

Page 6: Data sharding

Background

Scalability

● stateless elements scale well

● stateful elements ○ quite easy to scale

■ if we want more reads (cache, replication)○ hard to scale

■ if we want more writes■ if we want more capacity

Page 7: Data sharding

Background

Sharding ;)

A B C DE F G HI J K L

A B C D

E F G H

I J K L

Page 8: Data sharding

Theory

Page 9: Data sharding

Theory

Scaling● Scale Back

○ delete, archive unuset data● Scale Up (vertical)

○ more power, more disks● Scale Out (horizontal)

○ add machines■ functional partitioning■ replication■ sharding

Page 10: Data sharding

Theory

Sharding● split one big database into many smaller databases

○ spread rows○ spread them across many servers

● shared-nothing partitioning● not a replication

Page 11: Data sharding

Theory

Sharding key�

● shard by a key● all data with that key will be on the same shard● i.e. shard by user - all informations connected to user are on

one shard (user info, messages, friends list)

user 1 -> shard 1user 2 -> shard 2user 3 -> shard 1user 4 -> shard 2

● choosing a right key is very important!

Page 12: Data sharding

Theory

Sharding function

● maps keys to shards● where to find the data● where to store the data

shard number = sf(key)

Page 13: Data sharding

Theory

Sharding function

● Dynamic○ Mapping in a database table

● Fixed○ Modulo

shard number = id % shards_count○ Hash + Modulo

shard number = md5(email) % shards_count○ Consistent hasing

http://en.wikipedia.org/wiki/Consistent_hashing

Page 14: Data sharding

Theory

Advantages

● Linear write/read performance scalability (raid0)● Capacity increase (raid0)● Smaller databases are easier to manage

○ alter○ backup/restore○ truncate ;)

● Smaller databases are faster○ as may fit into memory

● Cost effective○ 80core, 20 HD, 80GB RAM vs○ 10 x (8core, 2HD, 8GB RAM)

Page 15: Data sharding

Theory

Challenges

● Globally unique IDs○ unique across all shards

■ auto_increment_increment, auto_increment_offset■ global IDs table

○ not unique across shards■ IDs in dbs - not unique■ shard_number - unique

■ global unique ID = shard_number + db ID

Page 16: Data sharding

Challenges

Re-sharding

● consistent hasing or

● more shards than machines/nodes (i.e. 100 shards on 10 machines)

1,4,7 2,5,8 3,6,9

1,6 2,7 3,8 4,9 5

Page 17: Data sharding

Challenges

Cross-shard

● queries○ sent to many shards○ collect result from one ○ avoidable (better sharding key, more sharding keys)

● joins○ send query to many shards○ join results in an application○ sometimes unavoidable

Page 18: Data sharding

Challenges

Network

● more machines, more smaller streams● full-mesh between webservers and shards● pconnect vs. connect

Complexity

● usually sharding is done in application logic

Page 19: Data sharding

Practice

Page 20: Data sharding

Practice

Microblogging site● see users messages● see stream/wall

Classic architecture● database● web server(s)● loadbalancer(s)

Page 21: Data sharding

Practice

Data

id login

1 John

2 Bob

3 Andy

4 Claire

5 Megan

id owner message

1 2 M1

2 1 M2

3 2 M3

4 3 M4

5 2 M5

who whose

1 2

3 4

3 2

1 3

5 2

2 1

1 5

4 3

4 1

John's messages?John's follows?

Page 22: Data sharding

Practice

User ● no need for sharding

Messagesharded by user (owner field)

● shard_number = owner % 2

Followsharded by user (who field)

● shard_number = who % 2

2 shards, 3 machines

User

MessageFollow

MessageFollowFollow

shard0 shard1

Page 23: Data sharding

Practice

id login

1 John

2 Bob

3 Andy

4 Claire

5 Megan

id owner message

1 2 M1

3 2 M3

5 2 M5

who whose

2 1

4 3

4 1

id owner message

2 1 M2

4 3 M4

who whose

1 2

3 4

3 2

1 3

5 2

1 5

shard0

shard1

mapping?

Page 24: Data sharding

Practice

Bob's blog

● Bob's messages ○ find Bob's id in User table (id = 2)○ find Bob's shard (2%2 = 0, shard0)○ fetch Messages (shard0) where owner = 2

● People Bob follows○ find Bob's id in User table (id = 2)○ find Bob's shard (2%2 = 0, shard0)○ fetch whose id from Follow table (shard0) ○ fetch people info from User table

Page 25: Data sharding

Practice

id login

1 John

2 Bob

3 Andy

4 Claire

5 Megan

id owner message

1 2 M1

3 2 M3

5 2 M5

who whose

2 1

4 3

4 1

id owner message

2 1 M2

4 3 M4

who whose

1 2

3 4

3 2

1 3

5 2

1 5

shard0

shard1

Page 26: Data sharding

Practice

Who follows Andy ?

● find Andy's id in User table (id=3)● find Andy's shard (3%2 = 1, shard1)● hmmm

Page 27: Data sharding

Practice

id login

1 John

2 Bob

3 Andy

4 Claire

5 Megan

id owner message

1 2 M1

3 2 M3

5 2 M5

who whose

2 1

4 3

4 1

id owner message

2 1 M2

4 3 M4

who whose

1 2

3 4

3 2

1 3

5 2

1 5

shard0

shard1

Cross-shardquery!

Page 28: Data sharding

Practice

id login

1 John

2 Bob

3 Andy

4 Claire

5 Megan

id owner message

1 2 M1

3 2 M3

5 2 M5

who whose

2 1

4 3

4 1

id owner message

2 1 M2

4 3 M4

who whose

1 2

3 4

3 2

1 3

5 2

1 5

shard0

shard1

Ideas?

Page 29: Data sharding

Summary

Page 30: Data sharding

Summary

Shard or not to shard

● many reads, little writes? - don't● many writes and no capacity problems? - don't (use SSD)● capacity problems? - yes● many writes and capacity problems? - yes● scale-up is affordable? - don't shard

As You see... it depends!

Page 31: Data sharding

Summary

If You have to shard

● always use sharding + replication = raid10○ sharding reduces high availability (like raid0)

● more shards than You need○ i.e. 4 machines, 100 shards○ or dynamic allocation

● think of network capacity (full-mesh)○ load sharding (google it ;))

● sharding key - important!○ cross-shard queries

Page 32: Data sharding

Wake Up!

Thanks

Questions?