Advanced Sharding Features in MongoDB 2.4
description
Transcript of Advanced Sharding Features in MongoDB 2.4
Software Engineer, 10gen
Jeremy Mikola
#MongoDBDays
Advanced Sharding Features in MongoDB 2.4
jmikola
Sharded cluster
Sharding is a powerful way scale your database…
MongoDB 2.4 adds some new features to get more out of it.
Agenda
• Shard keys– Desired properties– Evaluating shard key choices
• Hashed shard keys– Why and how to use hashed shard keys– Limitations
• Tag-aware sharding– How it works– Use case examples
Shard Keys
What is a shard key?
• Incorporates one or more fields
• Used to partition your collection
• Must be indexed and exist in every document
• Definition and values are immutable
• Used to route requests to shards
Cluster request routing
• Targeted queries
• Scatter/gather queries
• Scatter/gather queries with sort
Cluster request routing: writes
• Inserts– Shard key required– Targeted query
• Updates and removes– Shard key optional for multi-document
operations– May be targeted or scattered
Cluster request routing: reads
• Queries– With shard key: targeted– Without shard key: scatter/gather
• Sorted queries– With shard key: targeted in order– Without shard key: distributed merge sort
Cluster request routing: targeted query
Routable request received
Request routed to appropriate shard
Shard returns results
Mongos returns results to client
Cluster request routing: scattered query
Non-targeted request received
Request sent to all shards
Shards return results to mongos
Mongos returns results to client
Distributed merge sort
Shard key considerations
• Cardinality
• Write Distribution
• Query Isolation
• Reliability
• Index Locality
Request distribution and index locality
Shard 1 Shard 2 Shard 3
mongos
Request distribution and index locality
Shard 1 Shard 2 Shard 3
mongos
{
_id: ObjectId(),
user: 123,
time: Date(),
subject: "…",
recipients: [],
body: "…",
attachments: []
}
Example: email storage
Most common scenario, can be applied to 90% of cases
Each document can be up to 16MB
Each user may have GBs of storage
Most common query: get user emails sorted by time
Indexes on {_id}, {user, time}, {recipients}
Example: email storage
Cardinality
Write scaling
Query isolation
Reliability
Indexlocality
_id
hash(_id)
user
user, time
ObjectId composition
ObjectId("51597ca8e28587b86528edfd”)
12 Bytes
Timestamp
Host
PID
Counter
Sharding on ObjectId
// enable sharding on test databasemongos> sh.enableSharding("test"){ "ok" : 1 }
// shard the test collectionmongos> sh.shardCollection("test.test", { _id: 1 }){ "collectionsharded" : "test.test", "ok" : 1 }
// insert many documents in a loopmongos> for (x=0; x<10000; x++) db.test.insert({ value: x });
shards:{ "_id" : "shard0000", "host" : "localhost:30000" }{ "_id" : "shard0001", "host" : "localhost:30001" }
databases:{ "_id" : "test", "partitioned" : true, "primary" :
"shard0001" }test.test
shard key: { "_id" : 1 }chunks:
shard0001 2{ "_id" : { "$minKey" : 1 } } -->> { "_id" :
ObjectId("…") } on : shard0001 { "t" : 1000,
"i" : 1 }{ "_id" : ObjectId("…") } -->> { "_id" :
{ "$maxKey" : 1 } } on : shard0001 { "t" : 1000,
"i" : 2 }
Uneven chunk distribution
Incremental values leads to a hot shard
minKey 0 0 maxKey
Example: email storage
Cardinality
Write scaling
Query isolation
Reliability
Indexlocality
_id Doc levelOne
shardScatter/gather
All users affected
Good
hash(_id)
user
user, time
Example: email storage
Cardinality
Write scaling
Query isolation
Reliability
Indexlocality
_id Doc levelOne
shardScatter/gather
All users affected
Good
hash(_id)Hash level
All Shards
Scatter/gather
All users affected
Poor
user
user, time
Example: email storage
Cardinality
Write scaling
Query isolation
Reliability
Indexlocality
_id Doc levelOne
shardScatter/gather
All users affected
Good
hash(_id)Hash level
All Shards
Scatter/gather
All users affected
Poor
userMany docs
All Shards
TargetedSome users
affectedGood
user, time
Example: email storage
Cardinality
Write scaling
Query isolation
Reliability
Indexlocality
_id Doc levelOne
shardScatter/gather
All users affected
Good
hash(_id)Hash level
All Shards
Scatter/gather
All users affected
Poor
userMany docs
All Shards
TargetedSome users
affectedGood
user, time
Doc levelAll
ShardsTargeted
Some users
affectedGood
Hashed Shard Keys
Why is this relevant?
• Documents may not already have a suitable value
• Hashing allows us to utilize an existing field
• More efficient index storage– At the expense of locality
Hashed shard keys
{x:2}
md5 c81e728d9d4c2f636f067f89cc14862c
{x:3}
md5 eccbc87e4b5ce2fe28308fd9f2a7baf3
{x:1}
md5 c4ca4238a0b923820dcc509a6f75849b
minKey 0 0 maxKey
Hashed shard keys avoids a hot shard
Under the hood
• Create a hashed index for use with sharding
• Contains first 64 bits of a field’s md5 hash
• Considers BSON type and value
• Represented as NumberLong in the JS shell
// hash on 1 as an integer> db.runCommand({ _hashBSONElement: 1 }){
"key" : 1,"seed" : 0,"out" : NumberLong("5902408780260971510"),"ok" : 1
}
// hash on "1" as a string> db.runCommand({ _hashBSONElement: "1" }){
"key" : "1","seed" : 0,"out" : NumberLong("-2448670538483119681"),"ok" : 1
}
Hashing BSON elements
Using hashed indexes
• Create index:– db.collection.ensureIndex({ field : "hashed" })
• Options:– seed: specify a hash seed to use (default: 0)– hashVersion: currently supports only version 0
(md5)
Using hashed shard keys
• Enable sharding on collection:– sh.shardCollection("test.collection", { field:
"hashed" })
• Options:– numInitialChunks: chunks to create (default: 2
per shard)
// enabling sharding on test databasemongos> sh.enableSharding("test"){ "ok" : 1 }
// shard by hashed _id fieldmongos> sh.shardCollection("test.hash", { _id: "hashed" }){ "collectionsharded" : "test.hash", "ok" : 1 }
Sharding on hashed ObjectId
databases:{ "_id" : "test", "partitioned" : true, "primary" :
"shard0001" }test.hash
shard key: { "_id" : "hashed" }chunks:
shard0000 2shard0001 2
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-4611...") }
on : shard0000 { "t" : 2000, "i" : 2 }
{ "_id" : NumberLong("-4611...") } -->> { "_id" : NumberLong(0) }
on : shard0000 { "t" : 2000, "i" : 3 }
{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("4611...") }
on : shard0001 { "t" : 2000, "i" : 4 }
{ "_id" : NumberLong("4611...") } -->> { "_id" : { "$maxKey" : 1 } }
on : shard0001 { "t" : 2000, "i" : 5 }
Pre-splitting the data
test.hashshard key: { "_id" : "hashed" }chunks:
shard0000 4shard0001 4
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-7374...") }
on : shard0000 { "t" : 2000, "i" : 8 }
{ "_id" : NumberLong("-7374...") } -->> { "_id" : NumberLong(”-4611...") }
on : shard0000 { "t" : 2000, "i" : 9 }
{ "_id" : NumberLong("-4611…") } -->> { "_id" : NumberLong("-2456…") }
on : shard0000 { "t" : 2000, "i" : 6 }
{ "_id" : NumberLong("-2456…") } -->> { "_id" : NumberLong(0) }
on : shard0000 { "t" : 2000, "i" : 7 }
{ "_id" : NumberLong(0) } -->> { "_id" : NumberLong("1483…") }
on : shard0001 { "t" : 2000, "i" : 12 }
Even chunk distribution after insertions
Hashed keys are great for equality queries
• Equality queries routed to a specific shard
• Will make use of the hashed index
• Most efficient query possible
mongos> db.hash.find({ x: 1 }).explain(){
"cursor" : "BtreeCursor x_hashed","n" : 1,"nscanned" : 1,"nscannedObjects" : 1,"numQueries" : 1,"numShards" : 1,"indexBounds" : {
"x" : [[
NumberLong("5902408780260971510"),
NumberLong("5902408780260971510")]
]},"millis" : 0
}
Explain plan of an equality query
But not so good for range queries
• Range queries will be scatter/gather
• Cannot utilize a hashed index– Supplemental, ordered index may be used at the
shard level
• Inefficient query pattern
mongos> db.hash.find({ x: { $gt: 1, $lt: 99 }}).explain(){
"cursor" : "BasicCursor","n" : 97,"nscanned" : 1000,"nscannedObjects" : 1000,"numQueries" : 2,"numShards" : 2,"millis" : 3
}
Explain plan of a range query
Other limitations of hashed indexes
• Cannot be used in compound or unique indexes
• No support for multi-key indexes (i.e. array values)
• Incompatible with tag aware sharding– Tags would be assigned hashed values, not the
original key
• Will not overcome keys with poor cardinality
– Floating point numbers are truncated before hashing
Summary
• There are multiple approaches for sharding
• Hashed shard keys give great distribution
• Hashed shard keys are good for equality queries
• Pick a shard key that best suits your application
Tag Aware Sharding
Global scenario
Single database
Optimal architecture
Tag aware sharding
• Associate shard key ranges with specific shards
• Shards may have multiple tags, and vice versa
• Dictates behavior of the balancer process
• No relation to replica set member tags
// tag a shardmongos> sh.addShardTag("shard0001", "APAC")
// shard by country code and user IDmongos> sh.shardCollection("test.tas", { c: 1, uid: 1 }){ "collectionsharded" : "test.tas", "ok" : 1 }
// tag a shard key rangemongos> sh.addTagRange("test.tas",... { c: "aus", uid: MinKey },... { c: "aut", uid: MaxKey },... "APAC"... )
// shard by hashed _id fieldmongos> sh.shardCollection("test.hash", { _id: "hashed" }){ "collectionsharded" : "test.hash", "ok" : 1 }
Configuring tag aware sharding
Use cases for tag aware sharding
• Operational and/or location-based separation
• Legal requirements for data storage
• Reducing latency of geographical requests
• Cost of overseas network bandwidth
• Controlling collection distribution– http://www.kchodorow.com/blog/2012/07/25/controlling-collection-di
stribution/
Other Changes in 2.4
Other changes in 2.4
• Make secondaryThrottle the default– https://jira.mongodb.org/browse/SERVER-7779
• Faster migration of empty chunks– https://jira.mongodb.org/browse/SERVER-3602
• Specify chunk by bounds for moveChunk– https://jira.mongodb.org/browse/SERVER-7674
• Read preferences for commands– https://jira.mongodb.org/browse/SERVER-7423
Questions?
Software Engineer, 10gen
Jeremy Mikola
#MongoDBDays
Thank You
jmikola