Agility and Scalability with MongoDB

Post on 29-Nov-2014

369 views 1 download

description

MongoDB has taken a clear lead in adoption among the new generation of databases, including the enormous variety of NoSQL offerings. A key reason for this lead has been a unique combination of agility and scalability. Agility provides business units with a quick start and flexibility to maintain development velocity, despite changing data and requirements. Scalability maintains that flexibility while providing fast, interactive performance as data volume and usage increase. We'll address the key organizational, operational, and engineering considerations to ensure that agility and scalability stay aligned at increasing scale, from small development instances to web-scale applications. We will also survey some key examples of highly-scaled customer applications of MongoDB.

Transcript of Agility and Scalability with MongoDB

MongoDB Scalability and Agility

Chris.Biow@MongoDB.com

2

• Now

• Secure

• All varieties

• Fast and interactive

• Scalable to “Big”

• Agile to develop and deploy operationally

• Cloud and edge

Data Challenge“I want my data...”

iStock licensed (pixelfit)

3

Scalability with MongoDB

Metric Meaning Examples

Operations per Second

Concurrent reads and writes per second

> 1 Million per second

Nodes per Cluster

Horizontal scale-out, distributed to multiple data centers worldwide, with high availability, using inexpensive cloud resources

> 1000 nodes

Records / Documents

Data objects in any number of schemas or structures

> 10 billion

Data Volume Total amount of data: documents X size

> 1 Petabyte = 10^15 = 1,000,000,000,000,000≈ 2^50

Key Differentiation

5

Operational Database Landscape

6

Document Data Model

Relational MongoDB

{ first_name: ‘Paul’, surname: ‘Miller’, city: ‘London’, location: [45.123,47.232], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } }}

7

Documents are Rich Data Structures

{ first_name: ‘Paul’, surname: ‘Miller’, cell: ‘+447557505611’ city: ‘London’, location: [45.123,47.232], Profession: [banking, finance, trader], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } }}

Fields can contain an array of sub-documents

Fields

Typed field values

Fields can contain arrays

String

Number

Geo-

Coordinate

s

8

Document Model Benefits

• Agility and flexibility– Data model supports business change– Rapidly iterate to meet new requirements

• Intuitive, natural data representation– Eliminates ORM layer– Developers are more productive

• Reduces the need for joins, disk seeks– Programming is more simple– Performance delivered at scale

11

Big Data Tech Interest Comparison

j.mp/Ssvpev

12

Enterprise Adoption Comparison

bit.ly/1vAI7rF

Architecture for Availability & Scalability

14

Replica Sets

• Replica Set – two or more copies

• Availability solution– High Availability

– Disaster Recovery

– Maintenance

• Deployment Flexibility– Data locality to users

– Workload isolation: operational & analytics

• Self-healing shard

Primary

Driver

Application

Secondary

Secondary

Replication

16

Global Data Distribution

Real-time

Real-time Real-time

Real-time

Real-time

Real-time

Real-time

Primary

Secondary

Secondary

Secondary

Secondary

Secondary

Secondary

Secondary

17

Automatic Sharding

• Sharding types

• Range

• Hash

• Tag-aware

• Elastic increase or decrease in capacity

• Automatic balancing

18

Query Routing

• Multiple query optimization models

• Each sharding option appropriate for different apps

Performance

20

Drag Strip: straight ahead, quarter-mile, stop

21

Road Race:stay fast, stay agile, continuous

Nürburgring, Germany

MongoDB at Scale

24

• Large data set

CarFax

25

Baseline MongoDB Comparison Initial Production

• Vehicle History Database

• 11 billion records (growing at 1 billion per year)

• 30-year-old VMS-based RDBMS

• Cumbersome

• Costly

• Performance: 4x faster than baseline, 10x key-value

• Scale out using inexpensive commodity servers

• Built-in redundancy

• Flexible dynamic schema data model

• Strong consistency

• Analytics/aggregation

• MongoDB is primary data store

• 50 servers• 10 shards• 5 node replica sets per

shard

In-depth NoSQL evaluation

26

• 13 billion+ documents– 1.5 billion documents added every year

• 1 vehicle history report is > 200 documents

• 12 Shards

• 9-node replica sets

• Replicas distributed across 3 data centers

CARFAX Sharding and Replication

27

CARFAX Replication

28

29

• 50M users.

• 6B check-ins to date (6M per day growth).

• 55M points of interest / venues.

• 1.7M merchants using the platform for marketing

• Operations Per Second: 300,000

• Documents: 5.5B (~16.5B with replication).*

Foursquare

30

• 11 MongoDB clusters– 8 are sharded

• Largest cluster for check-ins

• 15 shards (check ins)

• Shard key user_id

Foursquare clusters

31

Facebook / parse.com mobile apps

• Persistent database for 270,000 mobile applications

• 200 M end-user mobile devices

• 250% annual growth in client apps

• 500% growth in requests

• 1.5 M collections

• Key differentiators:

– Document data model

– High perf. & avail.

– Geospatial query and index

• Charity Majors operations: j.mp/X3jVRC

– Understand your database and your data, and build for them.

Scalability Exercises in the Cloud with Amazon Web Services

35

• 27x hs1.8xlarge instances

– 16x VCPU

– 24x 2TB SATA drives, RAID0

– 8x mongod microshards

• Modified Yahoo Cloud Serving Benchmark (YCSB)

– Long Integer IDs (>2B)

– Zipfian-distributed integer fields

– Aggregation queries

• Load direct to 216 shards, 10 days, $4K        "objects" : 7,170,648,489,        "avgObjSize" : 147,438.99952658816,        "dataSize" : NumberLong("1,057,240,224,818,640")        (commas added)

Petascale Database

CGroup Memory Segregation

for DB in `seq 0 3`; do sudo cgcreate \ -a mongodb:mongodb \ -t mongodb:mongodb \ -g memory:mongodb$D sudo echo 48G > \ /sys/fs/cgroup/memory/mongodb$D/memory.limit_in_bytes cgexec \ -g memory:mongodb$DB \ numactl –interleave=all \ mongod –-config ~/mongod$DB.confdone

37

• Ingest 250-byte stock quotes at 2M/s

• Concurrently run 5 QPS, subsecond/indexed response on timeStamp, accountId, instrumentId, systemKey

• 5x r3.4xlarge– 16x VCPU, 1x 320GB SSD, 122GB RAM, 16x mongod

– 2.1M insert/second direct to shards

• 16x c3.8xlarge– 32x VCPU, 2x 320GB SSD, 60GB RAM, 16x mongod, 4x mongos

– 2.1M insert/second via mongos

Megawrite Ingest

38

• 2 threads on c3.8xl 

• 264 bsonsize object, _id index only

• coll.insert() 15,600 ins / sec

• coll.insert(List<DBObject>)listsize = 64: 118,000 ins / sec

• Bulk ops APIsize = 64: 120,000 ins / sec

Java API comparison

BulkWriteOperation bo = null; for(a = 0; a < this.items && stayAlive; a++) { if(bo == null) { bo = collection.initializeUnorderedBulkOperation(); } fillMap(this.m); BasicDBObject dbObject = new BasicDBObject(this.m); bo.insert(dbObject); if(0 == a % listsize) { BulkWriteResult rc = bo.execute(); bo = null; }} 

7x Load with BulkOp

How do I Pick A Shard Key?

41

Shard Key characteristics

• A good shard key has:– sufficient cardinality

– distributed writes

– targeted reads ("query isolation")

• Shard key should be in every query if possible– scatter gather otherwise

• Choosing a good shard key is important!– affects performance and scalability

– changing it later is expensive

42

Hashed shard key

• Pros:– Evenly distributed writes

• Cons:– Random data (and index) updates can be IO intensive

– Range-based queries turn into scatter gather

Shard 1

mongos

Shard 2 Shard 3 Shard N

43

Low cardinality shard key

• Induces "jumbo chunks"

• Examples: boolean field

Shard 1

mongos

Shard 2 Shard 3 Shard N

[ a, b )

44

Ascending shard key

• Monotonically increasing shard key values cause "hot spots" on inserts

• Examples: timestamps, _id

Shard 1

mongos

Shard 2 Shard 3 Shard N

[ ISODate(…), $maxKey )

Ensuring Success with High Scalability

46

Success Factors

• Storage: random seeks (IOPS)

• RAM: working set based on query patterns

• Query: indexing

• Delete: most expensive operation

• Real-time vs. bulk operations

• Continuity: HA, DR, backup, restore

• Agile process: iterate by powers of 4

• Sharding: shard key and strategy

• Resources: don’t go it alone!