A Century Of Weather Data - Midwest.io

59
Weather of the Century J. Randall Hunt @jrhunt Developer Advocate, MongoDB @midwestio

description

Use MongoDB to store and query 4TB of weather data. At midwest.io .

Transcript of A Century Of Weather Data - Midwest.io

Page 1: A Century Of Weather Data - Midwest.io

Weather of the Century

J. Randall Hunt @jrhuntDeveloper Advocate, MongoDB

@midwestio

Page 2: A Century Of Weather Data - Midwest.io

What was the weather the day you were born?

Page 3: A Century Of Weather Data - Midwest.io
Page 4: A Century Of Weather Data - Midwest.io

Agenda

• Data and Schema

• Application

• Operational Concerns

Page 5: A Century Of Weather Data - Midwest.io

MONGODB INTERLUDE!

Page 6: A Century Of Weather Data - Midwest.io

What Is It And Why Use It?

• Document Data Store

• Geo Indexing

• "Simple" Sharded deployments

Page 7: A Century Of Weather Data - Midwest.io

Terminology

RDBMS MongoDB (Document Store)

Database Database

Table Collection

Row(s) (bson) Document

Index Index

Join Nope.

Page 8: A Century Of Weather Data - Midwest.io

The Data

Page 9: A Century Of Weather Data - Midwest.io

Where To Get Data?

Page 10: A Century Of Weather Data - Midwest.io
Page 11: A Century Of Weather Data - Midwest.io
Page 12: A Century Of Weather Data - Midwest.io

A Weather Datum

• A station ID

• A timestamp

• Lat, Long, Elevation

• A LOT OF WEATHER DATA (135 page manual for parsing)

• Lots of optional sections

Page 13: A Century Of Weather Data - Midwest.io

How much of it do we have?

• 2.5 billion distinct data points

• 4 Terabytes

• Number of documents is huge, overall data size is reasonable

• We'll call this: "moderately big" data

Page 14: A Century Of Weather Data - Midwest.io

How does it grow?

Page 15: A Century Of Weather Data - Midwest.io

How does it grow?

Page 16: A Century Of Weather Data - Midwest.io

Who Else Is This Relevant For?

• Particle Physics

• Stocks, high frequency trading

• Insurance

• People with lots of small pieces data

Page 17: A Century Of Weather Data - Midwest.io

Schema Design 101

Page 18: A Century Of Weather Data - Midwest.io

Things We Care About

• Performance

‣ Ingestion

‣ App Specific

‣ Ad-hoc

• Cost

• Flexibility

Page 19: A Century Of Weather Data - Midwest.io

Performance Breakdown• Bulk Loading

• Latency and throughput for queries

• point in space-time

• one station, one year

• the whole world at one time

• Aggregation and Exploration

• warmest and coldest day ever, average temperature, etc.

Page 20: A Century Of Weather Data - Midwest.io

0303725053947282013060322517+40779-073969FM-15+0048KNYC V0309999C00005030485MN0080475N5+02115+02005100975 ADDAA101000095AU100001015AW1105GA1025+016765999GA2045+024385999 GA3075+030485999GD11991+0167659GD22991+0243859GD33991+0304859...

{ "st" : "u725053", "ts" : ISODate("2013-06-03T22:51:00Z"), "airTemperature" : { "value" : 21.1, "quality" : "5" }, "atmosphericPressure" : { "value" : 1009.7, "quality" : "5" } }

Station ID: NYC Central Park

Page 21: A Century Of Weather Data - Midwest.io

Schema

{! st: "u724463",! ts: ISODate("1991-01-01T00:00:00Z"),! position: {! type: "Point",! coordinates: [! -94.6,! 39.117! ]! },! elevation: 231,! … other fields …!}!

station ID and source

Page 22: A Century Of Weather Data - Midwest.io

Stations

• USAF and WBAN IDs exist for most of North America. Prefix with "u" and "w" then the ID

• For ships we use the prefix "x" and their lat and lng to create a station id.

Page 23: A Century Of Weather Data - Midwest.io

Schema

{! st: "u724463",! ts: ISODate("1991-01-01T00:00:00Z"),! position: {! type: "Point",! coordinates: [! -94.6,! 39.117! ]! },! elevation: 231,! … other fields …!}!

GeoJSON

Page 24: A Century Of Weather Data - Midwest.io

GeoJSON

• A rich geographical data format

• Lines, MultiLines, Polygons, Geometries

• Able to perform queries on complex structures

Page 25: A Century Of Weather Data - Midwest.io

Schema

!

airTemperature: {! value: -4.9,! quality: "1"! }!

Page 26: A Century Of Weather Data - Midwest.io

Choice: Embedding?

Problem: ~100 "weather codes" and optional sections

• Store them inline

• Store them in another collection

Page 27: A Century Of Weather Data - Midwest.io

Choice: Embedding?

• Embedding keeps your logic in the schema instead of the application.

• Depends on cardinality, don't embed "squillions"

• Don't embed objects that have to change frequently.

Page 28: A Century Of Weather Data - Midwest.io

Choice: Unique Identifier

{_id: ObjectId("53a33f823ed4ac438f8c63b7")}!

• Simple, guaranteed unique identifier

• 12 bytes

Page 29: A Century Of Weather Data - Midwest.io

Choice: Unique Identifier!{_id: {! 'st': 'w12345',! 'ts': ISODate("2014-06-19T19:53:58.680Z")! }!}

• Not great if there are duplicates

• Slightly More complex queries

• ~12 bytes saved per document

Page 30: A Century Of Weather Data - Midwest.io

Choice: Field Shortening

• Indexes are still the same size

• Decreases readability

• In our example you can save ~40% space with minimum field lengths

• Probably better to go for semi-readable with ~20% space savings

Page 31: A Century Of Weather Data - Midwest.io

{! "_id": ObjectId("5298c40f3004e2fe02922e29"),! "st": "w13731",! "ts": ISODate("1949-01-01T05:00:00Z"),! "airTemperature": {! "quality": "5",! "value": 1.1! },! "skyCondition": {! "cavok": "N",! "ceilingHeight": {! "determination": "9",! "quality": "4",! "value": 1433! }! },!... ... ...!}!

1236 Bytes

Page 32: A Century Of Weather Data - Midwest.io

{! "_id": ObjectId("5398c40f3004e2fe02922e29"),! "st": "w13731",! "ts": ISODate("1949-01-01T05:00:00Z"),! "aT": {! "q": "5",! "v": 1.1! },! "sC": {! "c": "N",! "cH": {! "d": "9",! "q": "4",! "v": 1433! }! },!... ... ...!}!

786 Bytes

Page 33: A Century Of Weather Data - Midwest.io

Choice: Indexes

• Prefer sparse indexes! All Geo indexes are sparse.

• Relying on index intersection can reduce storage needs but compound indexes are more performant.

• Build indexes AFTER ingesting the data!

Page 34: A Century Of Weather Data - Midwest.io

The Application

Page 35: A Century Of Weather Data - Midwest.io

Overview

Javascript

!Chrome

!Google Earth

browser plugin

KML!Python

PyMongo

Data

Data

ClientServer

Page 36: A Century Of Weather Data - Midwest.io

Aggregation pipeline = [{! '$match': {! 'ts': {! '$gte': dt,! '$lt': dt + timedelta(hours=1)},! 'airTemperature.quality': {! '$in': ['0', '1', '5', '9']}! }!}, {! '$group': {! '_id': '$st',! 'position': {'$first': '$position'},! 'airTemperature': {'$first': '$airTemperature'}}!}]!!cursor = db.data.aggregate(pipeline, cursor={})!

Page 37: A Century Of Weather Data - Midwest.io
Page 38: A Century Of Weather Data - Midwest.io
Page 39: A Century Of Weather Data - Midwest.io

{! name : "New York",!! geometry : {! type: "MultiPolygon",! coordinates: [! [! [-71.94, 41.28],! [-71.92, 41.29],! /* 2000 more points... */! [-71.94, 41.28]! ]! ]! }!}!

db.states.createIndex({! geometry: '2dsphere'!});!

GeoFencing

Page 40: A Century Of Weather Data - Midwest.io

GeoFencing

db.states.find_one({! 'geometry': {! '$geoIntersects': {! '$geometry': {! 'type': 'Point',! 'coordinates': [lng, lat]}}}})!

Page 41: A Century Of Weather Data - Midwest.io

Operational Concerns

Page 42: A Century Of Weather Data - Midwest.io

Single Server

Application mongod

i2.8xlarge 251 GB RAM

6 TB SSD

c3.8xlarge

Page 43: A Century Of Weather Data - Midwest.io

Sharded Cluster

Application / mongos

. . .

100 x r3.2xlarge

61 GB RAM @

100 GB disk

mongod

c3.8xlarge

Page 44: A Century Of Weather Data - Midwest.io

Cost?

. .

$60,000 / yr

$700,000 / yr

Page 45: A Century Of Weather Data - Midwest.io

Performance Breakdown

• Bulk Loading

• Latency and throughput for queries

• point in space-time

• one station, one year

• the whole world at one time

• Aggregation and Exploration

• warmest and coldest day ever, average temperature, etc.

Page 46: A Century Of Weather Data - Midwest.io

Bulk Loading: Single Server8 threads

100 batch size

Page 47: A Century Of Weather Data - Midwest.io

Bulk Loading: Single Server

Settings 8 Threads 100 Batch Size

Total loading time: 10 h 20 min

Documents per second: ~70,000

Index build time 7 h 40 min (ts_1_st_1)

Page 48: A Century Of Weather Data - Midwest.io

Bulk Loading: Sharded Cluster 144 threads200 batch size

Page 49: A Century Of Weather Data - Midwest.io

Bulk Loading: Sharded Cluster

Shard Key Station ID, hashed

Settings 10 mongos @ 144 threads 200 batch size

Total loading time: 3 h 10 min

Documents per second: ~228,000

Index build time 5 min (ts_1_st_1)

Page 50: A Century Of Weather Data - Midwest.io

Queries: Point in Space-Timedb.data.find({"st" : "u747940", "ts" : ISODate("1969-07-16T12:00:00Z")})

Page 51: A Century Of Weather Data - Midwest.io

Queries: Point in Space-Timedb.data.find({"st" : "u747940", "ts" : ISODate("1969-07-16T12:00:00Z")})

0

0.5

1

1.5

2

single server cluster

ms avg

95th

99th

max. throughput: 40,000/s 610,000/s

(10 mongos)

Page 52: A Century Of Weather Data - Midwest.io

Queries: One Station, One Yeardb.data.find({"st" : "u103840", "ts" : {"$gte": ISODate("1989-01-01"), "$lt" : ISODate("1990-01-01")}})

Page 53: A Century Of Weather Data - Midwest.io

Queries: One Station, One Yeardb.data.find({"st" : "u103840", "ts" : {"$gte": ISODate("1989-01-01"), "$lt" : ISODate("1990-01-01")}})

0

1000

2000

3000

4000

5000

single server cluster

ms avg

95th

99th

max. throughput: 20/s 430/s

(10 mongos)

targeted query

Page 54: A Century Of Weather Data - Midwest.io

Queries: The Whole Worlddb.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})

Page 55: A Century Of Weather Data - Midwest.io

Queries: The Whole Worlddb.data.find({"ts" : ISODate("2000-01-01T00:00:00Z")})

0

2000

4000

6000

8000

10000

single server cluster

ms avg

95th

99th

max. throughput: 8/s

310/s (10 mongos)

scatter/gather query

Page 56: A Century Of Weather Data - Midwest.io

Analytics: Maximum Temperaturedb.data.aggregate  ([      {  "$match"  :  {  "airTemperature.quality"  :                                                                    {  "$in"  :  [  "1",  "5"  ]  }  }  },      {  "$group"  :  {  "_id"          :  null,                                  "maxTemp"  :  {  "$max"  :                                                                "$airTemperature.value"  }  }  }  ])    

61.8 °C = 143 °F

2 h 30 min Single Server

2 min Cluster

Page 57: A Century Of Weather Data - Midwest.io

Summary: Single Server

Pro

• Cost Effective

• Low latency for single queries

Con

• Table scans are still slow

Page 58: A Century Of Weather Data - Midwest.io

Summary: Cluster!

Con • High cost !

Pro • High throughput • Very good latency for single queries • Scatter-gather yields significant speed-up • Analytics are possible !

. .

Page 59: A Century Of Weather Data - Midwest.io

Thank You!

J. Randall Hunt @jrhuntDeveloper Advocate, MongoDB

@midwest.io