Advanced Sharding Features in MongoDB 2.4

Software Engineer, 10gen

Jeremy Mikola

#MongoDBDays

Advanced Sharding Features in MongoDB 2.4

jmikola

Sharded cluster

Sharding is a powerful way scale your database…

MongoDB 2.4 adds some new features to get more out of it.

Agenda

• Shard keys– Desired properties– Evaluating shard key choices

• Hashed shard keys– Why and how to use hashed shard keys– Limitations

• Tag-aware sharding– How it works– Use case examples

Shard Keys

What is a shard key?

• Incorporates one or more fields

• Used to partition your collection

• Must be indexed and exist in every document

• Definition and values are immutable

• Used to route requests to shards

Cluster request routing

• Targeted queries

• Scatter/gather queries

• Scatter/gather queries with sort

Cluster request routing: writes

• Inserts– Shard key required– Targeted query

• Updates and removes– Shard key optional for multi-document

operations– May be targeted or scattered

Cluster request routing: reads

• Queries– With shard key: targeted– Without shard key: scatter/gather

• Sorted queries– With shard key: targeted in order– Without shard key: distributed merge sort

Cluster request routing: targeted query

Routable request received

Request routed to appropriate shard

Shard returns results

Mongos returns results to client

Cluster request routing: scattered query

Non-targeted request received

Request sent to all shards

Shards return results to mongos

Mongos returns results to client

Distributed merge sort

Shard key considerations

• Cardinality

• Write Distribution

• Query Isolation

• Reliability

• Index Locality

Request distribution and index locality

Shard 1 Shard 2 Shard 3

mongos

{

_id: ObjectId(),

user: 123,

time: Date(),

subject: "…",

recipients: [],

body: "…",

attachments: []

}

Example: email storage

Most common scenario, can be applied to 90% of cases

Each document can be up to 16MB

Each user may have GBs of storage

Most common query: get user emails sorted by time

Indexes on {_id}, {user, time}, {recipients}


Cardinality

Write scaling

Query isolation

Reliability

Indexlocality

_id

hash(_id)

user

user, time

ObjectId composition

ObjectId("51597ca8e28587b86528edfd”)

12 Bytes

Timestamp

Host

PID

Counter

Sharding on ObjectId

// enable sharding on test databasemongos> sh.enableSharding("test"){ "ok" : 1 }

// shard the test collectionmongos> sh.shardCollection("test.test", { _id: 1 }){ "collectionsharded" : "test.test", "ok" : 1 }

// insert many documents in a loopmongos> for (x=0; x<10000; x++) db.test.insert({ value: x });

shards:{ "_id" : "shard0000", "host" : "localhost:30000" }{ "_id" : "shard0001", "host" : "localhost:30001" }

databases:{ "_id" : "test", "partitioned" : true, "primary" :

"shard0001" }test.test

shard key: { "_id" : 1 }chunks:

shard0001 2{ "_id" : { "$minKey" : 1 } } -->> { "_id" :

ObjectId("…") } on : shard0001 { "t" : 1000,

"i" : 1 }{ "_id" : ObjectId("…") } -->> { "_id" :

{ "$maxKey" : 1 } } on : shard0001 { "t" : 1000,

"i" : 2 }

Uneven chunk distribution

Incremental values leads to a hot shard

minKey 0 0 maxKey


Cardinality

Write scaling

Query isolation

Reliability

Indexlocality

_id Doc levelOne

shardScatter/gather

All users affected

Good

hash(_id)

user

user, time


Cardinality

Write scaling

Query isolation

Reliability

Indexlocality

_id Doc levelOne

shardScatter/gather

All users affected

Good

hash(_id)Hash level

All Shards

Scatter/gather

All users affected

Poor

user

user, time


Cardinality

Write scaling

Query isolation

Reliability

Indexlocality

_id Doc levelOne

shardScatter/gather

All users affected

Good

hash(_id)Hash level

All Shards

Scatter/gather

All users affected

Poor

userMany docs

All Shards

TargetedSome users

affectedGood

user, time


Cardinality

Write scaling

Query isolation

Reliability

Indexlocality

_id Doc levelOne

shardScatter/gather

All users affected

Good

hash(_id)Hash level

All Shards

Scatter/gather

All users affected

Poor

userMany docs

All Shards

TargetedSome users

affectedGood

user, time

Doc levelAll

ShardsTargeted

Some users

affectedGood

Hashed Shard Keys

Why is this relevant?

• Documents may not already have a suitable value

• Hashing allows us to utilize an existing field

• More efficient index storage– At the expense of locality

Hashed shard keys

{x:2}

md5 c81e728d9d4c2f636f067f89cc14862c

{x:3}

md5 eccbc87e4b5ce2fe28308fd9f2a7baf3

{x:1}

md5 c4ca4238a0b923820dcc509a6f75849b

minKey 0 0 maxKey

Hashed shard keys avoids a hot shard

Under the hood

• Create a hashed index for use with sharding

• Contains first 64 bits of a field’s md5 hash

• Considers BSON type and value

• Represented as NumberLong in the JS shell

// hash on 1 as an integer> db.runCommand({ _hashBSONElement: 1 }){

"key" : 1,"seed" : 0,"out" : NumberLong("5902408780260971510"),"ok" : 1

}

// hash on "1" as a string> db.runCommand({ _hashBSONElement: "1" }){

"key" : "1","seed" : 0,"out" : NumberLong("-2448670538483119681"),"ok" : 1

}

Hashing BSON elements

Using hashed indexes

• Create index:– db.collection.ensureIndex({ field : "hashed" })

• Options:– seed: specify a hash seed to use (default: 0)– hashVersion: currently supports only version 0

(md5)

Using hashed shard keys

• Enable sharding on collection:– sh.shardCollection("test.collection", { field:

"hashed" })

• Options:– numInitialChunks: chunks to create (default: 2

per shard)

// enabling sharding on test databasemongos> sh.enableSharding("test"){ "ok" : 1 }

// shard by hashed _id fieldmongos> sh.shardCollection("test.hash", { _id: "hashed" }){ "collectionsharded" : "test.hash", "ok" : 1 }

Sharding on hashed ObjectId

databases:{ "_id" : "test", "partitioned" : true, "primary" :

"shard0001" }test.hash

shard key: { "_id" : "hashed" }chunks:

shard0000 2shard0001 2

{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-4611...") }

on : shard0000 { "t" : 2000, "i" : 2 }

{ "_id" : NumberLong("-4611...") } -->> { "_id" : NumberLong(0) }

on : shard0000 { "t" : 2000, "i" : 3 }

{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("4611...") }

on : shard0001 { "t" : 2000, "i" : 4 }

{ "_id" : NumberLong("4611...") } -->> { "_id" : { "$maxKey" : 1 } }

on : shard0001 { "t" : 2000, "i" : 5 }

Pre-splitting the data

test.hashshard key: { "_id" : "hashed" }chunks:

shard0000 4shard0001 4

{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-7374...") }

on : shard0000 { "t" : 2000, "i" : 8 }

{ "_id" : NumberLong("-7374...") } -->> { "_id" : NumberLong(”-4611...") }

on : shard0000 { "t" : 2000, "i" : 9 }

{ "_id" : NumberLong("-4611…") } -->> { "_id" : NumberLong("-2456…") }

on : shard0000 { "t" : 2000, "i" : 6 }

{ "_id" : NumberLong("-2456…") } -->> { "_id" : NumberLong(0) }

on : shard0000 { "t" : 2000, "i" : 7 }

{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("1483…") }

on : shard0001 { "t" : 2000, "i" : 12 }

Even chunk distribution after insertions

Hashed keys are great for equality queries

• Equality queries routed to a specific shard

• Will make use of the hashed index

• Most efficient query possible

mongos> db.hash.find({ x: 1 }).explain(){

"cursor" : "BtreeCursor x_hashed","n" : 1,"nscanned" : 1,"nscannedObjects" : 1,"numQueries" : 1,"numShards" : 1,"indexBounds" : {

"x" : [[

NumberLong("5902408780260971510"),

NumberLong("5902408780260971510")]

]},"millis" : 0

}

Explain plan of an equality query

But not so good for range queries

• Range queries will be scatter/gather

• Cannot utilize a hashed index– Supplemental, ordered index may be used at the

shard level

• Inefficient query pattern

mongos> db.hash.find({ x: { $gt: 1, $lt: 99 }}).explain(){

"cursor" : "BasicCursor","n" : 97,"nscanned" : 1000,"nscannedObjects" : 1000,"numQueries" : 2,"numShards" : 2,"millis" : 3

}

Explain plan of a range query

Other limitations of hashed indexes

• Cannot be used in compound or unique indexes

• No support for multi-key indexes (i.e. array values)

• Incompatible with tag aware sharding– Tags would be assigned hashed values, not the

original key

• Will not overcome keys with poor cardinality

– Floating point numbers are truncated before hashing

Summary

• There are multiple approaches for sharding

• Hashed shard keys give great distribution

• Hashed shard keys are good for equality queries

• Pick a shard key that best suits your application

Tag Aware Sharding

Global scenario

Single database

Optimal architecture

Tag aware sharding

• Associate shard key ranges with specific shards

• Shards may have multiple tags, and vice versa

• Dictates behavior of the balancer process

• No relation to replica set member tags

// tag a shardmongos> sh.addShardTag("shard0001", "APAC")

// shard by country code and user IDmongos> sh.shardCollection("test.tas", { c: 1, uid: 1 }){ "collectionsharded" : "test.tas", "ok" : 1 }

// tag a shard key rangemongos> sh.addTagRange("test.tas",... { c: "aus", uid: MinKey },... { c: "aut", uid: MaxKey },... "APAC"... )

// shard by hashed _id fieldmongos> sh.shardCollection("test.hash", { _id: "hashed" }){ "collectionsharded" : "test.hash", "ok" : 1 }

Configuring tag aware sharding

Use cases for tag aware sharding

• Operational and/or location-based separation

• Legal requirements for data storage

• Reducing latency of geographical requests

• Cost of overseas network bandwidth

• Controlling collection distribution– http://www.kchodorow.com/blog/2012/07/25/controlling-collection-di

stribution/

http://www.kchodorow.com/blog/2012/07/25/controlling-collection-distribution/

http://www.kchodorow.com/blog/2012/07/25/controlling-collection-distribution/

Other Changes in 2.4

Other changes in 2.4

• Make secondaryThrottle the default– https://jira.mongodb.org/browse/SERVER-7779

• Faster migration of empty chunks– https://jira.mongodb.org/browse/SERVER-3602

• Specify chunk by bounds for moveChunk– https://jira.mongodb.org/browse/SERVER-7674

• Read preferences for commands– https://jira.mongodb.org/browse/SERVER-7423

https://jira.mongodb.org/browse/SERVER-7779




Questions?

Software Engineer, 10gen

Jeremy Mikola

#MongoDBDays

Thank You

jmikola

Advanced Sharding Features in MongoDB 2.4

Business

Transcript of Advanced Sharding Features in MongoDB 2.4