MongoDB Tokyo - Monitoring and Queueing

73
MongoDB Queueing & Monitoring

description

Presentation given by @davidmytton at Mongo Tokyo meetup 15th Nov 2011.

Transcript of MongoDB Tokyo - Monitoring and Queueing

Page 1: MongoDB Tokyo - Monitoring and Queueing

MongoDB Queueing & Monitoring

Page 2: MongoDB Tokyo - Monitoring and Queueing
Page 3: MongoDB Tokyo - Monitoring and Queueing

•Server Density•26 nodes

•6 replica sets

•Primary datastore = 15 nodes

Page 4: MongoDB Tokyo - Monitoring and Queueing

•Server Density•+7TB / mth

•+1bn docs / mth

•2-5k inserts/s @ 3ms

We use MongoDB as our primary data store but also as a queueing system. So I’m going to talk first about how we built the queuing functionality into Mongo and then more generally about what you need to keep an eye on when monitoring MongoDB in production.

Page 6: MongoDB Tokyo - Monitoring and Queueing

www.flickr.com/photos/triplexpresso/496995086/

Queuing: Uses

• Background processing

Page 7: MongoDB Tokyo - Monitoring and Queueing

www.flickr.com/photos/triplexpresso/496995086/

Queuing: Uses

• Background processing

• Sending notifications

Page 8: MongoDB Tokyo - Monitoring and Queueing

www.flickr.com/photos/triplexpresso/496995086/

Queuing: Uses

• Background processing

• Sending notifications

• Event streaming

Asynchronous

Page 9: MongoDB Tokyo - Monitoring and Queueing

Queuing: Features

Page 10: MongoDB Tokyo - Monitoring and Queueing

Queuing: Features

• Consumers

Page 11: MongoDB Tokyo - Monitoring and Queueing

Queuing: Features

• Atomic

• Consumers

Page 12: MongoDB Tokyo - Monitoring and Queueing

Queuing: Features

• Speed

• Atomic

• Consumers

Page 13: MongoDB Tokyo - Monitoring and Queueing

Queuing: Features

• Speed

• Atomic

• Consumers

• GC

Page 14: MongoDB Tokyo - Monitoring and Queueing

Queuing: Features

•Consumers

Page 15: MongoDB Tokyo - Monitoring and Queueing

Queuing: Features

•Consumers

MongoDB RabbitMQ

Mongo Wire Protocol AMQP

If you’re building a queue connecting via - RabbitMQ AMQP. Mongo Wire

Page 16: MongoDB Tokyo - Monitoring and Queueing

en.wikipedia.org/wiki/State_of_matter

Queuing: Features

•Atomic

Page 17: MongoDB Tokyo - Monitoring and Queueing

en.wikipedia.org/wiki/State_of_matter

Queuing: Features

•Atomic

MongoDB RabbitMQ

findAndModify consume/ack

Page 18: MongoDB Tokyo - Monitoring and Queueing

Queuing: Features

•Speed

Page 19: MongoDB Tokyo - Monitoring and Queueing

Queuing: Features

•GC

Page 20: MongoDB Tokyo - Monitoring and Queueing

Queuing: Features

•GC

MongoDB RabbitMQ

☹ consume/ack

Page 21: MongoDB Tokyo - Monitoring and Queueing

Implementation

• Consumers

2 things we need to implement - consumers and GC

Page 22: MongoDB Tokyo - Monitoring and Queueing

Implementation

• Consumers

db.runCommand( { findAndModify : <collection>, <options> } )

findAndModify command takes 2 parameters - collection and options.

Page 23: MongoDB Tokyo - Monitoring and Queueing

Implementation

• Consumers

db.runCommand( { findAndModify : <collection>, <options> } )

{ query: { inProg: false } }

query: filter (WHERE)

Specify the query just like any normal query against Mongo. The very first document that matches this will be returned. Since we’re building a queuing system, we’re using a field called inProg so we’re asking it to give us documents where this is false - i.e. the processing of that document isnt in progress.

Page 24: MongoDB Tokyo - Monitoring and Queueing

Implementation

• Consumers

db.runCommand( { findAndModify : <collection>, <options> } )

{ update: { $set: {inProg: true, start: new Date()} } }

update: modifier object

Atomic update.

Page 25: MongoDB Tokyo - Monitoring and Queueing

Implementation

• Consumers

db.runCommand( { findAndModify : <collection>, <options> } )

{ sort: { added: -1 } }

sort: selects the first one on multi-match

We can also sort e.g. on a timestamp so you can return the oldest documents first, or you could build a priority system to return more important documents first.

Page 26: MongoDB Tokyo - Monitoring and Queueing

Implementation

• Consumers

db.runCommand( { findAndModify : <collection>, <options> } )

remove: true = deletes on returnnew: true = returns modified objectfields: return specific fieldsupsert: true = create object if !exists()

Page 27: MongoDB Tokyo - Monitoring and Queueing

Implementation

• GC

Page 28: MongoDB Tokyo - Monitoring and Queueing

Implementation

• GC

now = datetime.datetime.now()difference = datetime.timedelta(seconds=10)timeout = now - difference

queue.find({'inProg' : True, 'start' : {'$lte' : timeout} })

Page 29: MongoDB Tokyo - Monitoring and Queueing

Stick with RabbitMQ?

Page 30: MongoDB Tokyo - Monitoring and Queueing

Stick with RabbitMQ?

QoS

Page 31: MongoDB Tokyo - Monitoring and Queueing

Stick with RabbitMQ?

AMQP

QoS

Page 32: MongoDB Tokyo - Monitoring and Queueing

AMQP

Stick with RabbitMQ?

Throttling

QoS

Page 33: MongoDB Tokyo - Monitoring and Queueing

It’s a little different,but not entirely new.

The problem is that MongoDB is fairly new and whilst it’s still just another database running on a server, there are things that are new and unusual. This means that some old assumptions are still valid, but others aren’t. You don’t have to approach it as a completely new thing, but it is a little different. There are disadvantages to this but one advantage is you can use it for novel tasks, like queuing.

Page 34: MongoDB Tokyo - Monitoring and Queueing

www.flickr.com/photos/comedynose/4388430444/

Keep it in RAM. Obviously.

The first and most obvious thing to note is that keeping everything in RAM is faster. But what does that actually mean and how do you know when something is in RAM?

Page 35: MongoDB Tokyo - Monitoring and Queueing

http://www.flickr.com/photos/comedynose/4388430444/

How do you know?

> db.stats(){! "collections" : 3,! "objects" : 379970142,! "avgObjSize" : 146.4554114991488,! "dataSize" : 55648683504,! "storageSize" : 61795435008,! "numExtents" : 64,! "indexes" : 1,! "indexSize" : 21354514128,! "fileSize" : 100816388096,! "ok" : 1}

51GB

19GB

The easiest way is to check the database size. The MongoDB console provides an easy way to look at the data and index sizes, and the output is provided in bytes.

Page 36: MongoDB Tokyo - Monitoring and Queueing

http://www.flickr.com/photos/comedynose/4388430444/

Where should it go?

What? Should it be in memory?

Indexes Always

Data If you can

In every case, having something in memory is going to be faster than not. However, that’s not always feasible if you have massive data sets. Instead, you want to make sure you always have enough RAM to store all the indexes, which is what the db.stats() output is for. And if you can, have space for data too. MongoDB is smart about its memory management so it will keep commonly accessed data in RAM where possible.

Page 37: MongoDB Tokyo - Monitoring and Queueing

How you’ll know

1) Slow queries

Thu Oct 14 17:01:11 [conn7410] update sd.apiLog query: { c: "android/setDeviceToken", a: 1466, u: "blah", ua: "Server Density Android" } 51926ms

www.flickr.com/photos/tonivc/2283676770/

Although not the only reason, a slow query does indicate insufficient memory. This might be that you’ve not got the most optimal indexes for a query but if indexes are being used and it’s still slow, it could be because of a disk i/o bottleneck because the data isn’t in RAM. Doing an explain on the query will show you what indexes it is using.

Page 38: MongoDB Tokyo - Monitoring and Queueing

How you’ll know

2) Timeouts

cursor timed out (20000 ms)

These slow queries will obviously cause a slowdown in your app but they may also cause timeouts. In the PHP driver a cursor will timeout after 20,000ms by default, although this is configurable.

Page 39: MongoDB Tokyo - Monitoring and Queueing

How you’ll know

3) Disk i/o spikes

www.flickr.com/photos/daddo83/3406962115/

You’ll see write spikes because MongoDB syncs data to disk periodically, but if you’re seeing read spikes then that can indicate MongoDB is having to read the data files rather than accessing data from memory. Be careful though because this won’t distinguish between data and indexes, or even other server activity. Read spikes can also occur even if you have little or no read activity if the mongod is part of a cluster where the slaves are reading from the oplog.

Page 40: MongoDB Tokyo - Monitoring and Queueing

Watch your storage

1) Pre-alloc

It sounds obvious but our statistics show that people run out disk space suddenly, even though there is a predictable increase over time. Remember that MongoDB pre-allocates files before the space is used, so you’ll see your storage being used up in 2GB increments (once you go past the smaller initial data file sizes).

Page 41: MongoDB Tokyo - Monitoring and Queueing

Watch your storage

2) Sharding maxSize

When adding a new shard you can specify the maximum amount of data you want to store on that shard. This isn’t a hard limit and is instead used as a guide. MongoDB will try to keep the data balanced across all your shards so that it meets this setting but it may not. MongoDB doesn’t currently look at actual disk levels and assumes available capacity is the same across all nodes. As such, it’s advisable that you set this to around 70% of the total available disk space.

Page 42: MongoDB Tokyo - Monitoring and Queueing

Watch your storage

3) Logging

--quiet

db.runCommand("logRotate");

killall -SIGUSR1 mongod

Logging is verbose by default, so you’ll want to use the quiet option to ensure only important things are output. And assuming you’re logging to a log file, you will want to periodically rotate it via the MongoDB console so that it doesn’t get too big. You can also do a killall SIGUSR1 on all your mongod processes from the shell which will cause a log rotation (because of the SIGUSR1 flag). This is useful if you want to script log rotation or put it into a cron job.

Page 43: MongoDB Tokyo - Monitoring and Queueing

Watch your storage

4) Journaling

david@rs2b ~: ls -alh /mongodbdata/journal/total 538Mdrwxrwxr-x 2 david david 29 Mar 20 16:50 .drwx------ 4 david david 4.0K Mar 13 09:50 ..-rw------- 1 david david 538M Mar 20 17:00 j._862-rw------- 1 david david 88 Mar 20 17:00 lsn

Mongo should rotate the journal files often but you need to remember that they will take up some space too, and as new files are allocated and old ones deleted, you may see your disk usage spiking up and down.

Page 44: MongoDB Tokyo - Monitoring and Queueing

db.serverStatus()

The server status command provides a lot of different statistics that can help you, like this map of traffic in central Tokyo.

Page 45: MongoDB Tokyo - Monitoring and Queueing

1) Used connections

db.serverStatus()

www.flickr.com/photos/armchaircaver/2061231069/

Every connection to the database has an overhead. You want to reduce this number by using persistent connections through the drivers.

Page 46: MongoDB Tokyo - Monitoring and Queueing

2) Available connections

db.serverStatus()

Every server has its limits. If you run out of available connections then you’ll have a problem, which will look like this in the logs.

Page 47: MongoDB Tokyo - Monitoring and Queueing

Fri Nov 19 17:24:32 [mongosMain] Listener: accept() returns -1 errno:24 Too many open files Fri Nov 19 17:24:32 [mongosMain] Listener: accept() returns -1 errno:24 Too many open files Fri Nov 19 17:24:32 [mongosMain] Listener: accept() returns -1 errno:24 Too many open files Fri Nov 19 17:24:32 [mongosMain] Listener: accept() returns -1 errno:24 Too many open files Fri Nov 19 17:24:32 [mongosMain] Listener: accept() returns -1 errno:24 Too many open files Fri Nov 19 17:24:32 [mongosMain] Listener: accept() returns -1 errno:24 Too many open files Fri Nov 19 17:24:32 [mongosMain] Listener: accept() returns -1 errno:24 Too many open files Fri Nov 19 17:24:32 [mongosMain] Listener: accept() returns -1 errno:24 Too many open files Fri Nov 19 17:24:32 [mongosMain] Listener: accept() returns -1 errno:24 Too many open files Fri Nov 19 17:24:32 [mongosMain] Listener: accept() returns -1 errno:24 Too many open files Fri Nov 19 17:24:32 [mongosMain] Listener: accept() returns -1 errno:24 Too many open files Fri Nov 19 17:24:32 [mongosMain] Listener: accept() returns -1 errno:24 Too many open files Fri Nov 19 17:24:32 [mongosMain] Listener: accept() returns -1 errno:24 Too many open files Fri Nov 19 17:24:32 [conn2335] getaddrinfo("rs1b") failed: No address associated with hostname Fri Nov 19 17:24:32 [conn2335] getaddrinfo("rs1d") failed: No address associated with hostname Fri Nov 19 17:24:32 [conn2335] getaddrinfo("rs1c") failed: No address associated with hostname Fri Nov 19 17:24:32 [conn2335] getaddrinfo("rs2b") failed: No address associated with hostname Fri Nov 19 17:24:32 [conn2335] getaddrinfo("rs2d") failed: No address associated with hostname Fri Nov 19 17:24:32 [conn2335] getaddrinfo("rs2c") failed: No address associated with hostname Fri Nov 19 17:24:32 [conn2335] getaddrinfo("rs2a") failed: No address associated with hostname Fri Nov 19 17:24:32 [conn2268] checkmaster: rs2b:27018 { setName: "set2", ismaster: false, secondary: true, hosts: [ "rs2b:27018", "rs2d:27018", "rs2c:27018", "rs2a:27018" ], arbiters: [ "rs2arbiter:27018" ], primary: "rs2a:27018", maxBsonObjectSize: 8388608, ok: 1.0 } MessagingPort say send() errno:9 Bad file descriptor (NONE) Fri Nov 19 17:24:32 [conn2268] checkmaster: caught exception rs2d:27018 socket exception Fri Nov 19 17:24:32 [conn2268] MessagingPort say send() errno:9 Bad file descriptor (NONE) Fri Nov 19 17:24:32 [conn2268] checkmaster: caught exception rs2c:27018 socket exception Fri Nov 19 17:24:32 [conn2268] MessagingPort say send() errno:9 Bad file descriptor (NONE) Fri Nov 19 17:24:32 [conn2268] checkmaster: caught exception rs2a:27018 socket exception Fri Nov 19 17:24:33 [conn2330] getaddrinfo("rs1a") failed: No address associated with hostname Fri Nov 19 17:24:33 [conn2330] getaddrinfo("rs1b") failed: No address associated with hostname Fri Nov 19 17:24:33 [conn2330] getaddrinfo("rs1d") failed: No address associated with hostname Fri Nov 19 17:24:33 [conn2330] getaddrinfo("rs1c") failed: No address associated with hostname Fri Nov 19 17:24:33 [conn2327] getaddrinfo("rs2b") failed: No address associated with hostname Fri Nov 19 17:24:33 [conn2327] getaddrinfo("rs2d") failed: No address associated with hostname Fri Nov 19 17:24:33 [conn2327] getaddrinfo("rs2c") failed: No address associated with hostname Fri Nov 19 17:24:33 [conn2327] getaddrinfo("rs2a") failed: No address associated with hostname Fri Nov 19 17:24:33 [conn2126] getaddrinfo("rs2b") failed: No address associated with hostname Fri Nov 19 17:24:33 [conn2126] getaddrinfo("rs2d") failed: No address associated with hostname Fri Nov 19 17:24:33 [conn2126] getaddrinfo("rs2c") failed: No address associated with hostname Fri Nov 19 17:24:33 [conn2126] getaddrinfo("rs2a") failed: No address associated with hostname Fri Nov 19 17:24:33 [conn2343] getaddrinfo("rs1b") failed: No address associated with hostname Fri Nov 19 17:24:33 [conn2343] getaddrinfo("rs1d") failed: No address associated with hostname Fri Nov 19 17:24:33 [conn2343] getaddrinfo("rs1c") failed: No address associated with hostname Fri Nov 19 17:24:34 [conn2332] getaddrinfo("rs1b") failed: No address associated with hostname Fri Nov 19 17:24:34 [conn2332] getaddrinfo("rs1d") failed: No address associated with hostname Fri Nov 19 17:24:34 [conn2332] getaddrinfo("rs1c") failed: No address associated with hostname Fri Nov 19 17:24:34 [conn2332] getaddrinfo("rs2b") failed: No address associated with hostname Fri Nov 19 17:24:34 [conn2332] getaddrinfo("rs2d") failed: No address associated with hostname Fri Nov 19 17:24:34 [conn2332] getaddrinfo("rs2c") failed: No address associated with hostname Fri Nov 19 17:24:34 [conn2332] getaddrinfo("rs2a") failed: No address associated with hostname Fri Nov 19 17:24:34 [conn2343] getaddrinfo("rs2d") failed: No address associated with hostname Fri Nov 19 17:24:34 [conn2343] getaddrinfo("rs2c") failed: No address associated with hostname Fri Nov 19 17:24:34 [conn2343] getaddrinfo("rs2a") failed: No address associated with hostname Fri Nov 19 17:24:34 [conn2343] trying reconnect to rs2d:27018 Fri Nov 19 17:24:34 [conn2343] getaddrinfo("rs2d") failed: No address associated with hostname Fri Nov 19 17:24:34 [conn2343] reconnect rs2d:27018 failed Fri Nov 19 17:24:34 [conn2343] MessagingPort say send() errno:9 Bad file descriptor (NONE) Fri Nov 19 17:24:34 [conn2343] trying reconnect to rs2c:27018 Fri Nov 19 17:24:34 [conn2343] getaddrinfo("rs2c") failed: No address associated with hostname Fri Nov 19 17:24:34 [conn2343] reconnect rs2c:27018 failed Fri Nov 19 17:24:34 [conn2343] MessagingPort say send() errno:9 Bad file descriptor (NONE) Fri Nov 19 17:24:34 [conn2343] trying reconnect to rs2a:27018 Fri Nov 19 17:24:34 [conn2343] getaddrinfo("rs2a") failed: No address associated with hostname Fri Nov 19 17:24:34 [conn2343] reconnect rs2a:27018 failed Fri Nov 19 17:24:34 [conn2343] MessagingPort say send() errno:9 Bad file descriptor (NONE) Fri Nov 19 17:24:35 [conn2343] checkmaster: rs2b:27018 { setName: "set2", ismaster: false, secondary: true, hosts: [ "rs2b:27018", "rs2d:27018", "rs2c:27018", "rs2a:27018" ], arbiters: [ "rs2arbiter:27018" ], primary: "rs2a:27018", maxBsonObjectSize: 8388608, ok: 1.0 } MessagingPort say send() errno:9 Bad file descriptor (NONE)

We’ve recently had this problem and it manifests itself by the logs filling up all available disk space instantly, and in some cases completely crashing the server.

Page 48: MongoDB Tokyo - Monitoring and Queueing

connPoolStats> db.runCommand("connPoolStats"){! "hosts" : {! ! "config1:27019" : {! ! ! "available" : 2,! ! ! "created" : 6! ! },! ! "set1/rs1a:27018,rs1b:27018" : {! ! ! "available" : 1,! ! ! "created" : 249! ! },

...! },! "totalAvailable" : 5,! "totalCreated" : 1002,! "numDBClientConnection" : 3490,! "numAScopedConnection" : 3,}

connPoolStats allows you to see the connection pools that have been set up by a mongos to connect to different members of the replica set shards. This is useful to correlate against open file descriptors so you can see if there are suddenly a large number of connections, or if there are a low number of available connections across your entire cluster.

Page 49: MongoDB Tokyo - Monitoring and Queueing

3) Index counters

db.serverStatus()

"indexCounters" : {! ! "btree" : {! ! ! "accesses" : 15180175,! ! ! "hits" : 15178725,! ! ! "misses" : 1450,! ! ! "resets" : 0,! ! ! "missRatio" : 0.00009551932! ! }! },

The miss ratio is what you’re looking at here. If you’re seeing a lot of index misses then you need to look at your queries to see if they’re making optimal use of the indexes you’ve created. You should consider adding new indexes and seeing if your queries run faster as a result. You can use the explain syntax to see which indexes queries are hitting, and the total execution time so you can benchmark them before and after.

Page 50: MongoDB Tokyo - Monitoring and Queueing

4) Op counters

db.serverStatus()

www.flickr.com/photos/cosmic_bandita/2395369614/

The op counters - inserts, updates, deletes and queries - are fun to look at, especially if the numbers are high. But you have to be careful these are not just vanity metrics. There are some things you can use them for though. If you have a high number of inserts and updates, i.e. writes, then you may want to look at your fsync time setting. By default this will flush to disk every 60 seconds but if you’re doing thousands of writes per second you might want to do this sooner for durability. Of course you can also ensure the write happens from within the driver. Queries can show whether you need to load off reads to your slaves, which can be done through the drivers, so that you’re spreading the load across your servers and only writing to the master. Deletes can also cause concurrency problems if you’re doing a large number of them and the database keeps having to yield.

Page 51: MongoDB Tokyo - Monitoring and Queueing

5) Background flushing

db.serverStatus()

Picture is unrelated! Mmm, ice cream.

The server status output allows you to see the last time data was flushed to disk, and how long that took. This is useful to see if you’re causing high disk load but also so you can monitor how often data is being written. Remember that whilst it isn’t synced to disk, you could experience data loss in the event of a crash or power outage.

Page 52: MongoDB Tokyo - Monitoring and Queueing

6) Dur

db.serverStatus()

If you have journalling enabled then serverStatus will also show some stats such as how many commits have occurred, the amount of data written and how long various operations have taken. This can be useful for seeing how much overhead durability adds to servers. We’ve found no noticeable difference when enabling journaling and that’s on servers processing billions of operations.

Page 53: MongoDB Tokyo - Monitoring and Queueing

rs.status()

www.ex-astris-scientia.org/inconsistencies/ent_vs_tng.htm (yes it’s a replicator from Star Trek)

{! "_id" : 1,! "name" : "rs3b:27018",! "health" : 1,! "state" : 2,! "stateStr" : "SECONDARY",! "uptime" : 1886098,! "optime" : {! ! "t" : 1291252178000,! ! "i" : 13! },! "optimeDate" : ISODate("2010-12-02T01:09:38Z"), "lastHeartbeat" : ISODate("2010-12-02T01:09:38Z")},

If you’re running a replica set then you can use the rs.status() command to get information about the whole replica set, on any set member. This gives you a few stats about the current member as well as a full list of every member in the set.

Page 54: MongoDB Tokyo - Monitoring and Queueing

1) myState

rs.status()

Value Meaning0 Starting up (phase 1)1 Primary2 Secondary3 Recovering4 Fatal error5 Starting up (phase 2)6 Unknown state7 Arbiter8 Down

en.wikipedia.org/wiki/State_of_matter

The first value is myState which shows you the status of the server you executed the command on. However, it’s also used in the list of members the command also provides so you can see the state of any member in the replica set, as that member sees it. This is useful to understand why members might be down because other members can’t see them.

Page 55: MongoDB Tokyo - Monitoring and Queueing

2) Optime

rs.status()

www.flickr.com/photos/robbie73/4244846566/

"optimeDate" : ISODate("2010-12-02T01:09:38Z")

Replica set members who are not master will be secondary, which means they’ll act as a slave staying up to date with the master. The optimeDate allows you to see whether a member is behind on the replication sync. The timestamp is the last applied log item so if it’s up to date, it’ll be very close to the current actual time on the server.

Page 56: MongoDB Tokyo - Monitoring and Queueing

3) Heartbeat

rs.status()

www.flickr.com/photos/drawblindfaith/3400981091/

"lastHeartbeat" : ISODate("2010-12-02T01:09:38Z")

The whole idea behind replica sets is that they automate failover in the event of failure somewhere. This is done by a regular heartbeat that all members send out to all other members. The status output shows you the last time that particular member was contacted from the current member. In the event of a network partition it may be that some members can’t communicate with eachother, and when there is an error you’ll see it in this section too.

Page 57: MongoDB Tokyo - Monitoring and Queueing

mongostat

The mongostat tool is included as part of the standard MongoDB download and gives you a quick, real time snapshot of the current state of your servers.

Page 58: MongoDB Tokyo - Monitoring and Queueing

1) faults

mongostat

Picture is unrelated! Snowmobile in Norway.

The faults column shows you the number of Linux page faults per second. This is when Mongo accesses something that is mapped to the virtual address space but not in physical memory. i.e. it results in a read from disk. High values here indicate you may not have enough RAM to store all necessary data and disk accesses may start to become the bottleneck.

Page 59: MongoDB Tokyo - Monitoring and Queueing

2) locked

mongostat

www.flickr.com/photos/bbusschots/4541573665/

The next column is locked, which shows the % of time in a global write lock. When this is happening no other queries will complete until the lock is given up, or the lock owner yields. This is indicative of a large, global operation like a remove() or dropping a collection and can result in slow performance.

Page 60: MongoDB Tokyo - Monitoring and Queueing

3) index miss

mongostat

www.flickr.com/photos/gareandkitty/276471187/

Index miss is like we saw in the server status output except instead of an aggregate total, you can see queries hitting (or missing) the index in real time. This is useful if you’re debugging specific queries in development or need to track down a server that is performing badly.

Page 61: MongoDB Tokyo - Monitoring and Queueing

4) queues

mongostat

When MongoDB gets too many queries to handle in real time, it queues them up. This is represented in mongostat by the read and write queue columns. When this starts to increase you will slowdowns in executing queries as they have to wait to run through the queue. You can alleviate this by stopping any more queries until the queue has dissipated. Queues will tend to spike if you’re doing a lot of write operations alongside other write heavy ops, such as large ranged removes. The second column it the active read and writes.

Page 62: MongoDB Tokyo - Monitoring and Queueing

5) Diagnostics

mongostat

The last three columns show the total number of connections per server, the replica set they belong to and the status of that server. This is useful if you need to quickly see which server is a master in a replica set.

Page 63: MongoDB Tokyo - Monitoring and Queueing

Current operations

www.flickr.com/photos/jeffhester/2784666811/

db.currentOp();{! ! ! "opid" : "shard1:299939199",! ! ! "active" : true,! ! ! "lockType" : "write",! ! ! "waitingForLock" : false,! ! ! "secs_running" : 15419,! ! ! "op" : "remove",! ! ! "ns" : "sd.metrics",! ! ! "query" : {! ! ! ! "accId" : 1391,! ! ! ! "tA" : {! ! ! ! ! "$lte" : ISODate("2010-11-24T19:53:00Z")! ! ! ! }! ! ! },! ! ! "client" : "10.121.12.228:44426",! ! ! "desc" : "conn"! ! },

The db.currentOp() function will give you a full list of every operation currently in progress. In this case there’s a long runnin remove which has been active for over 4 hours. You can see that it’s targeted at shard 1 and the query is based on an account ID and a timestamp. It’s part of our retention scripts to remove older metrics data. This is useful because you can track down long running queries which might be hurting performance, and kill them off using the opid.

Page 64: MongoDB Tokyo - Monitoring and Queueing

Monitoring tools

Server Density

Page 65: MongoDB Tokyo - Monitoring and Queueing
Page 66: MongoDB Tokyo - Monitoring and Queueing
Page 67: MongoDB Tokyo - Monitoring and Queueing
Page 68: MongoDB Tokyo - Monitoring and Queueing

Monitoring tools

www.mongomonitor.com

Page 69: MongoDB Tokyo - Monitoring and Queueing

Recap

Page 70: MongoDB Tokyo - Monitoring and Queueing

Keep it in RAM

Recap

Page 71: MongoDB Tokyo - Monitoring and Queueing

Keep it in RAM

Watch your storage

Recap

Page 72: MongoDB Tokyo - Monitoring and Queueing

Keep it in RAM

Watch your storage

db.serverStatus()

rs.status()

Recap

Page 73: MongoDB Tokyo - Monitoring and Queueing

David Mytton

[email protected]

@davidmytton

Woop Japan!

www.mongomonitor.com