Anti-RDBMS: A list of distributed key-value stores

Anti-RDBMS: A list of distributed key-value storesPerhaps you’re considering using a dedicated key-value or document store instead of a traditional relationaldatabase. Reasons for this might include:

1. You’re suffering from Cloud-computing Mania.2. You need an excuse to ‘get your Erlang on’3. You heard CouchDB was cool.4. You hate MySQL, and although PostgreSQL is much better, it still doesn’t have decent replication.

There’s no chance you’re buying Oracle licenses.5. Your data is stored and retrieved mainly by primary key, without complex joins.6. You have a non-trivial amount of data, and the thought of managing lots of RDBMS shards and

replication failure scenarios gives you the fear.

Whatever your reasons, there are a lot of options to chose from. At Last.fm we do a lot of batch computationin Hadoop, then dump it out to other machines where it’s indexed and served up over HTTP and Thrift as aninternal service (stuff like ‘most popular songs in London, UK this week’ etc). Presently we’re using a home-grown index format which points into large files containing lots of data spanning many keys, similar to theHaystack approach mentioned in this article about Facebook photo storage. It works, but rather than buildour own replication and partitioning system on top of this, we are looking to potentially replace it with adistributed, resilient key-value store for reasons 4, 5 and 6 above.

This article represents my notes and research to date on distributed key-value stores (and some other stuff)that might be suitable as RDBMS replacements under the right conditions. I’m expecting to try some of theseout and investigate further in the coming months.

Glossary and Background Reading

Distributed Hash Table (DHT) and algorithms such as Chord or KadmeliaAmazon’s Dynamo Paper, and this ReadWriteWeb article about Dynamo which explains why such asystem is invaluableAmazon’s SimpleDB Service, and some commentaryGoogle’s BigTable paperThe Paxos Algorithm - read this page in order to appreciate that knocking up a Paxos implementationisn’t something you’d want to do whilst hungover on a Saturday morning.

The ShortlistHere is a list of projects that could potentially replace a group of relational database shards. Some of theseare much more than key-value stores, and aren’t suitable for low-latency data serving, but are interestingnone-the-less.

Name Language Fault-tolerance Persistence ClientProtocol

Data model Docs Community

ProjectVoldemort

Java partitioned,replicated, read-repair

Pluggable:BerkleyDB, Mysql

Java API Structured /blob / text

A Linkedin, no

Ringo Erlang partitioned,replicated,immutable

Custom on-disk(append only log)

HTTP blob B Nokia, no

Scalaris Erlang partitioned,replicated,paxos

In-memory only Erlang, Java,HTTP

blob B OnScale, no

http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/

http://developers.facebook.com/thrift/

http://perspectives.mvdirona.com/2008/06/30/FacebookNeedleInAHaystackEfficientStorageOfBillionsOfPhotos.aspx

http://en.wikipedia.org/wiki/Distributed_hash_table

http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

http://www.readwriteweb.com/archives/amazon_dynamo.php

http://aws.amazon.com/simpledb/

http://gigaom.com/2007/12/14/amazon-simple-db/

http://www.satine.org/archives/2007/12/13/amazon-simpledb/

http://labs.google.com/papers/bigtable.html

http://en.wikipedia.org/wiki/Paxos_algorithm

http://project-voldemort.com/

http://github.com/tuulos/ringo/tree/master

http://code.google.com/p/scalaris/

paxos

Kai Erlang partitioned,replicated?

On-disk Dets file Memcached blob C no

Dynomite Erlang partitioned,replicated

Pluggable: couch,dets

Customascii, Thrift

blob D+ Powerset,no

MemcacheDB C replication BerkleyDB Memcached blob B some

ThruDB C++ Replication Pluggable:BerkleyDB, Custom,Mysql, S3

Thrift Documentoriented

C+ Third rail,unsure

CouchDB Erlang Replication,partitioning?

Custom on-disk HTTP, json Documentoriented(json)

A Apache, yes

Cassandra Java Replication,partitioning

Custom on-disk Thrift BigtablemeetsDynamo

F Facebook,no

HBase Java Replication,partitioning

Custom on-disk Custom API,Thrift, Rest

Bigtable A Apache, yes

Hypertable C++ Replication,partitioning

Custom on-disk Thrift, other Bigtable A Zvents,Baidu, yes

Why 5 of these aren’t suitableWhat I’m really looking for is a low latency, replicated, distributed key-value store. Something that scales wellas you feed it more machines, and doesn’t require much setup or maintenance - it should just work. The APIshould be that of a simple hashtable: set(key, val), get(key), delete(key). This would dispense with the hassle ofmanaging a sharded / replicated database setup, and hopefully be capable of serving up data by primary keyefficiently.

Five of the projects on the list are far from being simple key-value stores, and as such don’t meet therequirements - but they are definitely worth a mention.

1) We’re already heavy users of Hadoop, and have been experimenting with Hbase for a while. It’s muchmore than a KV store, but latency is too great to serve data to the website. We will probably use Hbaseinternally for other stuff though - we already have stacks of data in HDFS.

2) Hypertable provides a similar feature set to Hbase (both are inspired by Google’s Bigtable). They recentlyannounced a new sponsor, Baidu - the biggest Chinese search engine. Definitely one to watch, but doesn’t fitthe low-latency KV store bill either.

3) Cassandra sounded very promising when the source was released by Facebook last year. They use it forinbox search. It’s Bigtable-esque, but uses a DHT so doesn’t need a central server (one of the Cassandradevelopers previously worked at Amazon on Dynamo). Unfortunately it’s languished in relative obscurity sincerelease, because Facebook never really seemed interested in it as an open-source project. From what I cantell there isn’t much in the way of documentation or a community around the project at present.

4) CouchDB is an interesting one - it’s a “distributed, fault-tolerant and schema-free document-orienteddatabase accessible via a RESTful HTTP/JSON API”. Data is stored in ‘documents’, which are essentially key-value maps themselves, using the data types you see in JSON. Read the CouchDB Technical Overview if youare curious how the web’s trendiest document database works under the hood. This article on the Rules of

http://kai.wiki.sourceforge.net/

http://github.com/cliffmoon/dynomite/tree/master

http://memcachedb.org/

http://code.google.com/p/thrudb/

http://couchdb.apache.org/

http://code.google.com/p/the-cassandra-project/

http://hadoop.apache.org/hbase/

http://hypertable.org/

http://couchdb.apache.org/docs/overview.html

http://push.cx/2009/rules-of-database-app-aging

are curious how the web’s trendiest document database works under the hood. This article on the Rules ofDatabase App Aging goes some way to explaining why document-oriented databases make sense. CouchDBcan do full text indexing of your documents, and lets you express views over your data in Javascript. I couldimagine using CouchDB to store lots of data on users: name, age, sex, address, IM name and lots of otherfields, many of which could be null, and each site update adds or changes the available fields. In situationslike that it quickly gets unwieldly adding and changing columns in a database, and updating versions of yourapplication code to match. Although many people are using CouchDB in production, their FAQ points out theymay still make backwards-incompatible changes to the storage format and API before version 1.0.

5) ThruDB is a document storage and indexing system made up for four components: a document storageservice, indexing service, message queue and proxy. It uses Thrift for communication, and has a pluggablestorage subsystem, including an Amazon S3 option. It’s designed to scale well horizontally, and might be abetter option that CouchDB if you are running on EC2. I’ve heard a lot more about CouchDB than Thrudbrecently, but it’s definitely worth a look if you need a document database. It’s not suitable for our needs forthe same reasons as CouchDB.

Distributed key-value storesThe rest are much closer to being ’simple’ key-value stores with low enough latency to be used for servingdata used to build dynamic pages. Latency will be dependent on the environment, and whether or not thedataset fits in memory. If it does, I’d expect sub-10ms response time, and if not, it all depends on how muchmoney you spent on spinning rust.

MemcacheDB is essentially just memcached that saves stuff to disk using a Berkeley database. As useful asthis may be for some situations, it doesn’t deal with replication and partitioning (sharding), so it would stillrequire a lot of work to make it scale horizontally and be tolerant of machine failure. Other memcachedderivatives such as repcached go some way to addressing this by giving you the ability to replicate entirememcache servers (async master-slave setup), but without partitioning it’s still going to be a pain to manage.

Project Voldemort looks awesome. Go and read the rather splendid website, which explains how it works,and includes pretty diagrams and a good description of how consistent hashing is used in the Design section.(If consistent hashing butters your muffin, check out libketama - a consistent hashing library and the Erlanglibketama driver). Project-Voldemort handles replication and partitioning of data, and appears to be wellwritten and designed. It’s reassuring to read in the docs how easy it is to swap out and mock differentcomponents for testing. It’s non-trivial to add nodes to a running cluster, but according to the mailing-list thisis being worked on. It sounds like this would fit the bill if we ran it with a Java load-balancer service (see theirPhysical Architecture Options diagram) that exposed a Thrift API so all our non-Java clients could use it.

Scalaris is probably the most face-meltingly awesome thing you could build in Erlang. CouchDB, Ejabberdand RabbitMQ are cool, but Scalaris packs by far the most impressive collection of sexy technologies.Scalaris is a key-value store - it uses a modified version of the Chord algorithm to form a DHT, and stores thekeys in lexicographical order, so range queries are possible. Although I didn’t see this explicitly mentioned,this should open up all sorts of interesting options for batch processing - map-reduce for example. On top ofthe DHT they use an improved version of Paxos to guarantee ACID properties when dealing with multipleconcurrent transactions. So it’s a key-value store, but it can guarantee the ACID properties and do properdistributed transactions over multiple keys.

Oh, and to demonstrate how you can scale a webservice based on such a system, the Scalaris folkimplemented their own version of Wikipedia on Scalaris, loaded in the Wikipedia data, and benchmarked theirsetup to prove it can do more transactions/sec on equal hardware than the classic PHP/MySQL combo thatWikipedia use. Yikes.

From what I can tell, Scalaris is only memory-resident at the moment and doesn’t persist data to disk. Thismakes it entirely impractical to actually run a service like Wikipedia on Scalaris for real - but it sounds like they

http://push.cx/2009/rules-of-database-app-aging

http://repcached.lab.klab.org/

http://project-voldemort.com/

http://www.last.fm/user/RJ/journal/2007/04/10/rz_libketama_-_a_consistent_hashing_algo_for_memcache_clients

http://www.metabrew.com/article/erlang-libketama-driver-consistent-hashing/

http://en.wikipedia.org/wiki/Paxos_algorithm

makes it entirely impractical to actually run a service like Wikipedia on Scalaris for real - but it sounds like theytackled the hard problems first, and persisting to disk should be a walk in the park after you rolled your ownversion of Chord and made Paxos your bitch. Take a look at this presentation about Scalaris from the ErlangExchange conference: Scalaris presentation video.

The reminaing projects, Dynomite, Ringo and Kai are all, more or less, trying to be Dynamo. Of the three,Ringo looks to be the most specialist - it makes a distinction between small (less than 4KB) and medium-sizedata items (<100MB). Medium sized items are stored in individual files, whereas small items are all stored in anappend-log, the index of which is read into memory at startup. From what I can tell, Ringo can be used inconjunction with the Erlang map-reduce framework Nokia are working on called Disco.

I didn’t find out much about Kai other than it’s rather new, and some mentions in Japanese. You can choseeither Erlang ets or dets as the storage system (memory or disk, respectively), and it uses the memcachedprotocol, so it will already have client libraries in many languages.

Dynomite doesn’t have great documentation, but it seems to be more capable than Kai, and is under activedevelopment. It has pluggable backends including the storage mechanism from CouchDB, so the 2GB filelimit in dets won’t be an issue. Also I heard that Powerset are using it, so that’s encouraging.

SummaryScalaris is fascinating, and I hope I can find the time to experiment more with it, but it needs to save stuff todisk before it’d be useful for the kind of things we might use it for at Last.fm.

I’m keeping an eye on Dynomite - hopefully more information will surface about what Powerset are doing withit, and how it performs at a large scale.

Based on my research so far, Project-Voldemort looks like the most suitable for our needs. I’d love to hearmore about how it’s used at LinkedIn, and how many nodes they are running it on.

What else is there?Here are some other related projects:

Hazelcast - Java DHT/clustering librarynmdb - a network database (dbm-style)Open Chord - Java DHT

If you know of anything I’ve missed off the list, or have any feedback/suggestions, please post a comment.I’m especially interested in hearing about people who’ve tested or are using KV-stores in lieu of relationaldatabases.

UPDATE 1: Corrected table: memcachedb does replication, as per BerkeleyDB.

Tags: databases, dht, erlang, hashing, java

Monday, January 19th, 2009 programming

http://video.google.com/videoplay?docid=6981137233069932108&ei=caB0SaPUNIW0iALk-9CMBQ&q=erlang+exchange

http://discoproject.org/

http://www.hazelcast.com/

http://blitiri.com.ar/p/nmdb/

http://open-chord.sourceforge.net/

http://www.metabrew.com/article/tag/databases/

http://www.metabrew.com/article/tag/dht/

http://www.metabrew.com/article/tag/erlang/

http://www.metabrew.com/article/tag/hashing/

http://www.metabrew.com/article/tag/java/

http://www.metabrew.com/article/category/programming/

Anti-RDBMS: A list of distributed key-value stores

Documents

Transcript of Anti-RDBMS: A list of distributed key-value stores