Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
-
Upload
rick-copeland -
Category
Technology
-
view
28.425 -
download
4
description
Transcript of Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
© 2011Geeknet Inc
Realtime Analytics using MongoDB, Python, Gevent,
and ZeroMQ
Rick Copeland@rick446
SourceForge s MongoDB
- Tried CouchDB – liked the dev model, not so much the performance
- Migrated consumer-facing pages (summary, browse, download) to MongoDB and it worked great (on MongoDB 0.8 no less!)
- Built an entirely new tool platform around MongoDB (Allura)
The Problem We’re Trying to Solve
- We have lots of users (good)
- We have lots of projects (good)
- We don’t know what those users and projects are doing (not so good)
- We have tons of code in PHP, Perl, and Python (not so good)
Introducing Zarkov 0.0.1
Asynchronous TCP server for event logging with gevent
Turn OFF “safe” writes, turn OFF Ming validation (or do it in the client)
Incrementally calculate aggregate stats based on event log using mapreduce with {‘out’:’reduce’}
Zarkov Architecture
MongoDB
BSON over
ZeroMQ
Journal Greenlet
Commit Greenlet
Write-ahead log
Write-ahead log
Aggregation
Greenlet
Technologies
- MongoDB- Fast (10k+ inserts/s single-threaded)
- ZeroMQ- Built-in buffering- PUSH/PULL sockets (push never blocks, easy to distribute work)
- BSON- Fast Python/C implementation- More types than JSON
- Gevent- “green threads” for Python
“Wow, it’s really fast; can it replace…”
- Download statistics?- Google Analytics?- Project realtime statistics?
“Probably, but it’ll take some
work….”
Moving towards production....
- MongoDB MapReduce: convenient, but not so fast- Global JS Interpreter Lock per mongod- Lots of writing to temp collections (high lock %)- Javascript without libraries (ick!)
- Hadoop? Painful to configure, high latency, non-seamless integration with MongoDB
Zarkov’s already doing a lot…
So we added a lightweight map/reduce framework
- Write your map/reduce jobs in Python- Input/Output is MongoDB- Intermediate files are local .bson files- Use ZeroMQ for job distribution
Quick Map/reduce Refresherdef map_reduce(input_collection, query, output_collection,
map, reduce):
objects = input_collection.find(query)
map_results = list(map(objects))
map_results.sort(key=operator.itemgetter(0))
for key, kv_pairs in itertools.groupby(
(map_results, operator.itemgetter(0)):
value = reduce(key, [ v for k,v in kv_pairs ])
output_collection.save(
{"_id":key, "value":value})
Quick Map/reduce Refresherdef map_reduce(input_collection, query, output_collection,
map, reduce):
objects = input_collection.find(query)
map_results = list(map(objects))
map_results.sort(key=operator.itemgetter(0))
for key, kv_pairs in itertools.groupby(
(map_results, operator.itemgetter(0)):
value = reduce(key, [ v for k,v in kv_pairs ])
output_collection.save(
{"_id":key, "value":value})
Parallel
Zarkov Map/Reduce Architecture
map_in_#.bson
Query
Map
Sort
Reduce Commit
map_out_#.bson
reduce_in.bson
JobMgr
Zarkov Map/Reduce
- Phases managed by greenlets
- Map and reduce jobs parceled out to remote workers via zmq PUSH/PULL
- Adaptive timeout/retry to support dead workers
- Sort phase is local (big mergesort) but still done in worker processes
Zarkov Web Service
- We’ve got the data in, now how do we get it out?
- Zarkov includes a tiny HTTP server
$ curl -d foo='{"c":"sfweb", "b":"date/2011-07-01/", "e":"date/2011-07-04"}' http://localhost:8081/q
{"foo": {"sflogo": [[1309579200000.0, 12774], [1309665600000.0, 13458], [1309752000000.0, 13967]], "hits": [[1309579200000.0, 69357], [1309665600000.0, 68514], [1309752000000.0, 68494]]}}
- Values come out tweaked for use in flot
Zarkov Deployment at SF.net
© 2011Geeknet Inc
Lessons learned at
MongoDB Tricks
- Autoincrement integers are harder than in MySQL but not impossible
- Unsafe writes, insert > update
class IdGen(object): @classmethod def get_ids(cls, inc=1): obj = cls.query.find_and_modify( query={'_id':0}, update={ '$inc': dict(inc=inc), }, upsert=True, new=True) return range(obj.inc - inc, obj.inc)
MongoDB Pitfalls
- $addToSet is nice but nothing beats an integer range query
- Avoid Javascript like the plague (mapreduce, group, $where)
- Indexing is nice, but slows things down; use _id when you can
- mongorestore is fast, but locks a lot
Open Source
Minghttp://sf.net/projects/
merciless/MIT License
Allurahttp://sf.net/p/allura/
Apache License
Zarkovhttp://sf.net/p/zarkov/
Apache License
Future Work
Remove SPoF
Better way of expressing aggregates Suggestions?
Better web integration WebSockets/Socket.io
Maybe trigger aggs based on event activity?
Credits
- http://www.flickr.com/photos/jprovost/5733297977/in/photostream/