A Century Of Weather Data - Midwest.io
-
Upload
randall-hunt -
Category
Documents
-
view
128 -
download
2
description
Transcript of A Century Of Weather Data - Midwest.io
Weather of the Century
J. Randall Hunt @jrhuntDeveloper Advocate, MongoDB
@midwestio
What was the weather the day you were born?
Agenda
• Data and Schema
• Application
• Operational Concerns
MONGODB INTERLUDE!
What Is It And Why Use It?
• Document Data Store
• Geo Indexing
• "Simple" Sharded deployments
Terminology
RDBMS MongoDB (Document Store)
Database Database
Table Collection
Row(s) (bson) Document
Index Index
Join Nope.
The Data
Where To Get Data?
A Weather Datum
• A station ID
• A timestamp
• Lat, Long, Elevation
• A LOT OF WEATHER DATA (135 page manual for parsing)
• Lots of optional sections
How much of it do we have?
• 2.5 billion distinct data points
• 4 Terabytes
• Number of documents is huge, overall data size is reasonable
• We'll call this: "moderately big" data
How does it grow?
How does it grow?
Who Else Is This Relevant For?
• Particle Physics
• Stocks, high frequency trading
• Insurance
• People with lots of small pieces data
Schema Design 101
Things We Care About
• Performance
‣ Ingestion
‣ App Specific
‣ Ad-hoc
• Cost
• Flexibility
Performance Breakdown• Bulk Loading
• Latency and throughput for queries
• point in space-time
• one station, one year
• the whole world at one time
• Aggregation and Exploration
• warmest and coldest day ever, average temperature, etc.
0303725053947282013060322517+40779-073969FM-15+0048KNYC V0309999C00005030485MN0080475N5+02115+02005100975 ADDAA101000095AU100001015AW1105GA1025+016765999GA2045+024385999 GA3075+030485999GD11991+0167659GD22991+0243859GD33991+0304859...
{ "st" : "u725053", "ts" : ISODate("2013-06-03T22:51:00Z"), "airTemperature" : { "value" : 21.1, "quality" : "5" }, "atmosphericPressure" : { "value" : 1009.7, "quality" : "5" } }
Station ID: NYC Central Park
Schema
{! st: "u724463",! ts: ISODate("1991-01-01T00:00:00Z"),! position: {! type: "Point",! coordinates: [! -94.6,! 39.117! ]! },! elevation: 231,! … other fields …!}!
station ID and source
Stations
• USAF and WBAN IDs exist for most of North America. Prefix with "u" and "w" then the ID
• For ships we use the prefix "x" and their lat and lng to create a station id.
Schema
{! st: "u724463",! ts: ISODate("1991-01-01T00:00:00Z"),! position: {! type: "Point",! coordinates: [! -94.6,! 39.117! ]! },! elevation: 231,! … other fields …!}!
GeoJSON
GeoJSON
• A rich geographical data format
• Lines, MultiLines, Polygons, Geometries
• Able to perform queries on complex structures
Schema
!
airTemperature: {! value: -4.9,! quality: "1"! }!
Choice: Embedding?
Problem: ~100 "weather codes" and optional sections
• Store them inline
• Store them in another collection
Choice: Embedding?
• Embedding keeps your logic in the schema instead of the application.
• Depends on cardinality, don't embed "squillions"
• Don't embed objects that have to change frequently.
Choice: Unique Identifier
{_id: ObjectId("53a33f823ed4ac438f8c63b7")}!
• Simple, guaranteed unique identifier
• 12 bytes
Choice: Unique Identifier!{_id: {! 'st': 'w12345',! 'ts': ISODate("2014-06-19T19:53:58.680Z")! }!}
• Not great if there are duplicates
• Slightly More complex queries
• ~12 bytes saved per document
Choice: Field Shortening
• Indexes are still the same size
• Decreases readability
• In our example you can save ~40% space with minimum field lengths
• Probably better to go for semi-readable with ~20% space savings
{! "_id": ObjectId("5298c40f3004e2fe02922e29"),! "st": "w13731",! "ts": ISODate("1949-01-01T05:00:00Z"),! "airTemperature": {! "quality": "5",! "value": 1.1! },! "skyCondition": {! "cavok": "N",! "ceilingHeight": {! "determination": "9",! "quality": "4",! "value": 1433! }! },!... ... ...!}!
1236 Bytes
{! "_id": ObjectId("5398c40f3004e2fe02922e29"),! "st": "w13731",! "ts": ISODate("1949-01-01T05:00:00Z"),! "aT": {! "q": "5",! "v": 1.1! },! "sC": {! "c": "N",! "cH": {! "d": "9",! "q": "4",! "v": 1433! }! },!... ... ...!}!
786 Bytes
Choice: Indexes
• Prefer sparse indexes! All Geo indexes are sparse.
• Relying on index intersection can reduce storage needs but compound indexes are more performant.
• Build indexes AFTER ingesting the data!
The Application
Overview
Javascript
!Chrome
!Google Earth
browser plugin
KML!Python
PyMongo
Data
Data
ClientServer
Aggregation pipeline = [{! '$match': {! 'ts': {! '$gte': dt,! '$lt': dt + timedelta(hours=1)},! 'airTemperature.quality': {! '$in': ['0', '1', '5', '9']}! }!}, {! '$group': {! '_id': '$st',! 'position': {'$first': '$position'},! 'airTemperature': {'$first': '$airTemperature'}}!}]!!cursor = db.data.aggregate(pipeline, cursor={})!
{! name : "New York",!! geometry : {! type: "MultiPolygon",! coordinates: [! [! [-71.94, 41.28],! [-71.92, 41.29],! /* 2000 more points... */! [-71.94, 41.28]! ]! ]! }!}!
db.states.createIndex({! geometry: '2dsphere'!});!
GeoFencing
GeoFencing
db.states.find_one({! 'geometry': {! '$geoIntersects': {! '$geometry': {! 'type': 'Point',! 'coordinates': [lng, lat]}}}})!
Operational Concerns
Single Server
Application mongod
i2.8xlarge 251 GB RAM
6 TB SSD
c3.8xlarge
Sharded Cluster
Application / mongos
. . .
100 x r3.2xlarge
61 GB RAM @
100 GB disk
mongod
c3.8xlarge
Cost?
. .
$60,000 / yr
$700,000 / yr
Performance Breakdown
• Bulk Loading
• Latency and throughput for queries
• point in space-time
• one station, one year
• the whole world at one time
• Aggregation and Exploration
• warmest and coldest day ever, average temperature, etc.
Bulk Loading: Single Server8 threads
100 batch size
Bulk Loading: Single Server
Settings 8 Threads 100 Batch Size
Total loading time: 10 h 20 min
Documents per second: ~70,000
Index build time 7 h 40 min (ts_1_st_1)
Bulk Loading: Sharded Cluster 144 threads200 batch size
Bulk Loading: Sharded Cluster
Shard Key Station ID, hashed
Settings 10 mongos @ 144 threads 200 batch size
Total loading time: 3 h 10 min
Documents per second: ~228,000
Index build time 5 min (ts_1_st_1)
Queries: Point in Space-Timedb.data.find({"st" : "u747940", "ts" : ISODate("1969-07-16T12:00:00Z")})
Queries: Point in Space-Timedb.data.find({"st" : "u747940", "ts" : ISODate("1969-07-16T12:00:00Z")})
0
0.5
1
1.5
2
single server cluster
ms avg
95th
99th
max. throughput: 40,000/s 610,000/s
(10 mongos)
Queries: One Station, One Yeardb.data.find({"st" : "u103840", "ts" : {"$gte": ISODate("1989-01-01"), "$lt" : ISODate("1990-01-01")}})
Queries: One Station, One Yeardb.data.find({"st" : "u103840", "ts" : {"$gte": ISODate("1989-01-01"), "$lt" : ISODate("1990-01-01")}})
0
1000
2000
3000
4000
5000
single server cluster
ms avg
95th
99th
max. throughput: 20/s 430/s
(10 mongos)
targeted query
Queries: The Whole Worlddb.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
Queries: The Whole Worlddb.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})
0
2000
4000
6000
8000
10000
single server cluster
ms avg
95th
99th
max. throughput: 8/s
310/s (10 mongos)
scatter/gather query
Analytics: Maximum Temperaturedb.data.aggregate ([ { "$match" : { "airTemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxTemp" : { "$max" : "$airTemperature.value" } } } ])
61.8 °C = 143 °F
2 h 30 min Single Server
2 min Cluster
Summary: Single Server
Pro
• Cost Effective
• Low latency for single queries
Con
• Table scans are still slow
Summary: Cluster!
Con • High cost !
Pro • High throughput • Very good latency for single queries • Scatter-gather yields significant speed-up • Analytics are possible !
. .
Thank You!
J. Randall Hunt @jrhuntDeveloper Advocate, MongoDB
@midwest.io