Post on 29-Jan-2018
Making Massive Manageable:
Hadoop and Cassandra (at Rackspace)
Big Data Workshop
Stu Hood (@stuhood) – Technical Lead, Rackspace
April 23rd 2010
My, what a large dataset you have...
Processing 3 TB/day of logs
Using Hadoop/Pig
And the sticking points?
“How fast can we provision machines?”
“How do we get data on/off the cluster?”
“How do we add structure?”
MapReduce
Distributed processing methodology
Adapt a problem to MapReduce
Scale forever
Crunch almost anything
Typically adding structure to unstructured data
Logs
Also great for structured
Graph processing
Machine learning
“You want to use how many clients?”
Need to store structured inputs/outputs
Solution needs to
Support arbitrary number of clients
Preferably provide locality
Possibly provide 'web' latency
Solutions of varying quality
Sharding the RDBMS
shard n. - A horizontal partition in a databaseExample: Sharding by userid
Provided by ORM?Fixed partitions: manual rebalancing
Developing from scratch?Adding/removing nodes
Handling failover
As a library? As a middle tier?
Solutions of varying quality
Leaving data in Hadoop
Storage in Map/SequenceFile
Serialized with Thrift/Avro/ProtoBuffs
No random access
High latency
Solutions of varying quality
Storing in HBase/Hypertable
Column stores implemented on Hadoop
Modeled after Google's Bigtable
Multiple points of failure
Namenode
Master
High (almost non-web) latency
And the newest contender...
Standing on the shoulders of: Amazon Dynamo
No node in the cluster is special
No special roles
No scaling bottlenecks
No single point of failure
Techniques
Gossip
Eventual consistency
Standing on the shoulders of: Google Bigtable
“Column family” data model
Range queries for rows:
Scan rows in order
Memtable/SSTable structure
Always writes sequentially to disk
Bloom filters to minimize random reads
Trounces B-Trees for big dataLinear insert performance
Log growth for reads
Enter Cassandra
Hybrid of ancestors
Adopts listed features
And adds:
A sweet logo!
Pluggable partitioning
Multi datacenter supportPluggable locality
awareness
Datamodel improvements
Enter Cassandra
Project status
Open sourced by Facebook in 2008 (no longer active)
Apache License
Graduated to Apache TLP February 2010
Major releases: 0.3 through 0.6 (0.7 in two months)
cassandra.apache.org
Enter Cassandra
The code base
Java, Apache Ant, Git/SVN
5+ committers from 3+ companies
Known deployments at:
Cloudkick, Digg, Mahalo, SimpleGeo, Twitter, Rackspace, Reddit
Performance
Like peanut butter with jelly
Apache Cassandra 0.6:
MapReduce input support out of the box
Locality information partially exposed
Hadoop InputFormat
Pig LoadFunc
Hadoop + Cassandra at RAX
Multiple Hadoop clusters deployed
Smaller Cassandra deployments
Preparing for large scale Cassandra deployment
In the pipeline
MapReduce output support
Adding an OutputFormat with locality information
Improving locality for Hadoop inputs
Getting started
http://cassandra.apache.org/
Read "Getting Started"... Roughly:
Start one node
Test/develop app, editing node config as necessary
Launch cluster by starting more nodes with chosen config
Thanks!
Big Data Workshop
Participants!
Questions?
References
Brandon William's perf tests
http://racklabs.com/~bwilliam/cassandra/04vs05vs06.png
Hadoop/Cassandra Integration
http://issues.apache.org/jira/browse/CASSANDRA-342