MongoDB for Time Series Data Part 3: Sharding
-
Upload
mongodb -
Category
Technology
-
view
657 -
download
4
description
Transcript of MongoDB for Time Series Data Part 3: Sharding
![Page 1: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/1.jpg)
Sr. Solutions Architect, MongoDB
Jake Angerman
#MongoDBWorld
Sharding Time Series Data
![Page 2: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/2.jpg)
Let's Pretend We Are DevOps
What my friendsthink I do
What societythinks I do
What my Momthinks I do
What my bossthinks I do What I think I
doWhat I really do
DevOps
![Page 3: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/3.jpg)
Sharding Overview
Primary
Secondary
Secondary
Shard 1
Primary
Secondary
Secondary
Shard 2
Primary
Secondary
Secondary
Shard 3
Primary
Secondary
Secondary
Shard N
…
Query Router
Query Router
Query Router
……
Driver
Application
![Page 4: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/4.jpg)
Why do we need to shard?
• Reaching a limit on some resource– RAM (working set)– Disk space– Disk IO– Client network latency on writes (tag aware
sharding)– CPU
![Page 5: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/5.jpg)
Do we need to shard right now?• Two schools of thought:
1. Shard at the outset to avoid technical debt later2. Shard later to avoid complexity and overhead
today
• Either way, shard before you need to!– 256GB data size threshold published in
documentation– Chunk migrations can cause memory contention
and disk IOWorking SetFree RAM
Things seemed fine…
Working Set… then I
waited too long to shard
![Page 6: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/6.jpg)
> db.mdbw.stats()
{
"ns" : "test.mdbw",
"count" : 16000, // one hour's worth of documents
"size" : 65280000, // size of user data, padding included
"avgObjSize" : 4080,
"storageSize" : 93356032, // size of data extents, unused space included
"numExtents" : 11,
"nindexes" : 1,
"lastExtentSize" : 31354880,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 1,
"totalIndexSize" : 801248,
"indexSizes" : { "_id_" : 801248 },
"ok" : 1
}
collection stats
![Page 7: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/7.jpg)
Storage model spreadsheet
sensors 16,000years to keep data 6docs per day 384,000docs per year 140,160,000docs total across all years 840,960,000indexes per day 801248 bytesstorage per hour 63 MBstorage per day 1.5 GBstorage per year 539 GBstorage across all years 3,235 GB
![Page 8: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/8.jpg)
Why we need to shard now
539 GB in year one alone
1 2 3 4 5 60
500
1,000
1,500
2,000
2,500
3,000
3,500
YearTotal storage _x000d_(GB)
16,000 sensors today… … 47,000 tomorrow?
![Page 9: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/9.jpg)
What will our sharded cluster look like?
• We need to model the application to answer this question
• Model should include:– application write patterns (sensors)– application read patterns (clients)– analytic read patterns– data storage requirements
• Two main collections– summary data (fast query times)– historical data (analysis of environmental conditions)
![Page 10: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/10.jpg)
Option 1: Everything in one sharded cluster
Primary Primary Primary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Shard 2 Shard 3 Shard N
…
Primary
Secondary
Secondary
Shard 1Primary Shard
Primary
Secondary
Secondary
Shard 4
• Issue: prevent analytics jobs from affecting application
performance
• Summary data is small (16,000 * N bytes) and accessed
frequently
![Page 11: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/11.jpg)
Option 2: Distinct replica set for summaries
Primary Primary Primary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Shard 1 Shard 2 Shard N
…
Primary
Secondary
Secondary
Replica set
Primary
Secondary
Secondary
Shard 3
• Pros: Operational separation between business
functions
• Cons: application must write to two different databases
![Page 12: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/12.jpg)
Application read patterns
• Web browsers, mobile phones, and in-car navigation devices
• Working set should be kept in RAM
• 5M subscribers * 1% active * 50 sensors/query * 1 device query/min = 41,667 reads/sec
• 41,667 reads/sec * 4080 bytes = 162 MB/sec
– and that's without any protocol overhead
• Gigabit Ethernet is ≈ 118 MB/sec
Primary
Secondary
Secondary
Replica set
1 Gbps
![Page 13: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/13.jpg)
Application read patterns (continued)
• Options– provision more bandwidth ($$
$)– tune application read pattern– add a caching layer– secondary reads from the
replica set
Primary
Secondary
Secondary
Replica set
1 Gbps
1 Gbps
1 Gbps
![Page 14: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/14.jpg)
Secondary Reads from the Replica Set• Stale data OK in this use case
• caution: read preference of secondary could be disastrous in a 3-replica set if a secondary fails!
• app servers with mixed read preferences of primary and secondary are operationally cumbersome
• Use nearest read preference to access all nodes
Primary
Secondary
Secondary
Replica set
1 Gbps
1 Gbps
1 Gbps
db.collection.find().readPref( { mode: 'nearest'} )
![Page 15: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/15.jpg)
Replica Set Tags• app servers in different data centers use
replica set tags plus read preference
nearest
• db.collection.find().readPref( { mode:
'nearest', tags: [ {'datacenter':
'east'} ] } )
east
Secondary
Secondary
Primary
> rs.conf()
{ "_id" : "rs0",
"version" : 2,
"members" : [
{ "_id" : 0,
"host" : "node0.example.net:27017",
"tags" : { "datacenter": "east" }
},
{ "_id" : 1,
"host" : "node1.example.net:27017",
"tags" : { "datacenter": "east" }
},
{ "_id" : 2,
"host" : "node2.example.net:27017",
"tags" : { "datacenter": "east" }
},
}
![Page 16: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/16.jpg)
eastcentralwest
Replica Set Tags• Enables geographic distribution
Secondary
Secondary
Primary
![Page 17: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/17.jpg)
eastcentralwest
Replica Set Tags• Enables geographic distribution
• Allows scaling within each data center
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Primary
Secondary
Secondary
![Page 18: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/18.jpg)
Analytic read patterns
• How does an analyst look at the data on the sharded cluster?
• 1 Year of data = 539 GB
2 4 6 8 10 12 14 16 180
50
100
150
200
250
300
Series1; 256
192
128
6432
Server RAM
Number of machines
![Page 19: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/19.jpg)
Application write patterns
• 16,000 sensors every minute = 267 writes/sec
• Could we handle 16,000 writes in one second?
– 16,000 writes * 4080 bytes = 62 MB
• Load test the app!
![Page 20: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/20.jpg)
Modeling the Application - summary
• We modeled:– application write patterns (sensors)– application read patterns (clients)– analytic read patterns– data storage requirements– the network, a little bit
![Page 21: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/21.jpg)
Shard Key
![Page 22: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/22.jpg)
Shard Key characteristics
• A good shard key has:– sufficient cardinality– distributed writes– targeted reads ("query isolation")
• Shard key should be in every query if possible
– scatter gather otherwise
• Choosing a good shard key is important!– affects performance and scalability– changing it later is expensive
![Page 23: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/23.jpg)
Hashed shard key• Pros:
– Evenly distributed writes
• Cons:– Random data (and index) updates can be IO
intensive– Range-based queries turn into scatter gather
Shard 1
mongos
Shard 2
Shard 3
Shard N
![Page 24: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/24.jpg)
Low cardinality shard key
• Induces "jumbo chunks"
• Examples: sensor ID
Shard 1
mongos
Shard 2
Shard 3
Shard N
[ a, b )
![Page 25: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/25.jpg)
Ascending shard key
• Monotonically increasing shard key values cause "hot spots" on inserts
• Examples: timestamps, _id
Shard 1
mongos
Shard 2
Shard 3
Shard N
[ ISODate(…), $maxKey )
![Page 26: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/26.jpg)
Choosing a shard key for time series data
• Consider compound shard key:{arbitrary value, incrementing value}
• Best of both worlds – local hot spotting, targeted reads
Shard 1
mongos
Shard 2
Shard 3
Shard N
[ {V1, ISODate(A)}, {V1, ISODate(B)} ),[ {V1, ISODate(B)}, {V1, ISODate(C)} ),[ {V1, ISODate(C)}, {V1, ISODate(D)} ),…
[ {V4, ISODate(A)}, {V4, ISODate(B)} ),[ {V4, ISODate(B)}, {V4, ISODate(C)} ),[ {V4, ISODate(C)}, {V4, ISODate(D)} ),…
[ {V2, ISODate(A)}, {V2, ISODate(B)} ),[ {V2, ISODate(B)}, {V2, ISODate(C)} ),[ {V2, ISODate(C)}, {V2, ISODate(D)} ),…
[ {V3, ISODate(A)}, {V3, ISODate(B)} ),[ {V3, ISODate(B)}, {V3, ISODate(C)} ),[ {V3, ISODate(C)}, {V3, ISODate(D)} ),…
![Page 27: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/27.jpg)
What is our shard key?
• Let's choose: linkID, date– example: { linkID: 9000006, date: 140312 }– example: { _id: "900006:140312" }– this application's _id is in this form already, yay!
![Page 28: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/28.jpg)
Summary
• Model the read/write patterns and storage
• Choose an appropriate shard key
• DevOps influenced the application– write recent summary data to separate database– replica set tags for summary database– avoid synchronous sensor checkins– consider changing client polling frequency– consider throttling REST API access to app servers
![Page 29: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/29.jpg)
Which DevOps person are you?
![Page 30: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/30.jpg)
Sr. Solutions Architect, MongoDB
Jake Angerman
#MongoDBWorld
Thank You
![Page 31: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/31.jpg)
$ mongo --nodb
> cluster = new ShardingTest({"shards": 1, "chunksize": 1})
$ mongo --nodb
> // now connect to mongos on 30999
> db = (new Mongo("localhost:30999")).getDB("test")
Sharding Experimentation
![Page 32: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/32.jpg)
I decided to shard from the outset
• Sensor summary documents can all fit in RAM
– 16,000 sensors * N bytes
• Velocity of sensor events is only 267 writes/sec
• Volume of sensor events is what dictates sharding
{ _id : <linkID>,
update : ISODate(“2013-10-10T23:06:37.000Z”),
last10 : {
avgSpeed : <int>,
avgTime : <int>
},
lastHour : {
avgSpeed : <int>,
avgTime : <int>
},
speeds : [ 52, 49, 45, 51, ... ],
times : [ 237, 224, 246, 233,... ],
pavement: "Wet Spots",
status: "Wet Conditions",
weather: "Light Rain"
}
![Page 33: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/33.jpg)
> this_is_for_replica_sets_not_sharding = {
_id : "mySet",
members : [
{_id : 0, host : "A”, priority : 3},
{_id : 1, host : "B", priority : 2},
{_id : 2, host : "C"},
{_id : 3, host : "D", hidden : true},
{_id : 4, host : "E", hidden : true, slaveDelay : 3600}
]
}
> rs.initiate(conf)
Configuring Sharding
![Page 34: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/34.jpg)
I'm off to my private island in New Zealand
![Page 35: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/35.jpg)
Replica Set Diagram
![Page 36: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/36.jpg)
> conf = {
_id : "mySet",
members : [
{_id : 0, host : "A”, priority : 3},
{_id : 1, host : "B", priority : 2},
{_id : 2, host : "C"},
{_id : 3, host : "D", hidden : true},
{_id : 4, host : "E", hidden : true, slaveDelay : 3600}
]
}
> rs.initiate(conf)
Configuration Options
![Page 37: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/37.jpg)
My Wonderful Subsection
![Page 38: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/38.jpg)
> conf = {
_id : "mySet”,
members : [
{_id : 0, host : "A”, priority : 3},
{_id : 1, host : "B", priority : 2},
{_id : 2, host : "C"},
{_id : 3, host : "D", hidden : true},
{_id : 4, host : "E", hidden : true, slaveDelay : 3600}
]
}
> rs.initiate(conf)
Configuration Options
Primary DC
![Page 39: MongoDB for Time Series Data Part 3: Sharding](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b6d9d24a7959ca538b468a/html5/thumbnails/39.jpg)
Tag Aware Sharding
• Control where data is written to, and read from
• Each member can have one or more tags– tags: {dc: "ny"}– tags: {dc: "ny", subnet: "192.168",
rack: "row3rk7"}
• Replica set defines rules for write concerns
• Rules can change without changing app code