Hadoop and Cassandra at Rackspace

Making Massive Manageable:

Hadoop and Cassandra (at Rackspace)

Big Data Workshop

Stu Hood (@stuhood) – Technical Lead, Rackspace

April 23rd 2010

My, what a large dataset you have...

Processing 3 TB/day of logs

Using Hadoop/Pig

And the sticking points?

“How fast can we provision machines?”

“How do we get data on/off the cluster?”

“How do we add structure?”

MapReduce

Distributed processing methodology

Adapt a problem to MapReduce

Scale forever

Crunch almost anything

Typically adding structure to unstructured data

Also great for structured

Graph processing

Machine learning

“You want to use how many clients?”

Need to store structured inputs/outputs

Solution needs to

Support arbitrary number of clients

Preferably provide locality

Possibly provide 'web' latency

Solutions of varying quality

Sharding the RDBMS

shard n. - A horizontal partition in a databaseExample: Sharding by userid

Provided by ORM?Fixed partitions: manual rebalancing

Developing from scratch?Adding/removing nodes

Handling failover

As a library? As a middle tier?

Leaving data in Hadoop

Storage in Map/SequenceFile

Serialized with Thrift/Avro/ProtoBuffs

No random access

High latency

Storing in HBase/Hypertable

Column stores implemented on Hadoop

Modeled after Google's Bigtable

Multiple points of failure

Namenode

Master

High (almost non-web) latency

And the newest contender...

Standing on the shoulders of: Amazon Dynamo

No node in the cluster is special

No special roles

No scaling bottlenecks

No single point of failure

Techniques

Gossip

Eventual consistency

Standing on the shoulders of: Google Bigtable

“Column family” data model

Range queries for rows:

Scan rows in order

Memtable/SSTable structure

Always writes sequentially to disk

Bloom filters to minimize random reads

Trounces B-Trees for big dataLinear insert performance

Log growth for reads

Enter Cassandra

Hybrid of ancestors

Adopts listed features

And adds:

A sweet logo!

Pluggable partitioning

Multi datacenter supportPluggable locality

awareness

Datamodel improvements

Enter Cassandra

Project status

Open sourced by Facebook in 2008 (no longer active)

Apache License

Graduated to Apache TLP February 2010

Major releases: 0.3 through 0.6 (0.7 in two months)

cassandra.apache.org

Enter Cassandra

The code base

Java, Apache Ant, Git/SVN

5+ committers from 3+ companies

Known deployments at:

Cloudkick, Digg, Mahalo, SimpleGeo, Twitter, Rackspace, Reddit

Performance

Like peanut butter with jelly

Apache Cassandra 0.6:

MapReduce input support out of the box

Locality information partially exposed

Hadoop InputFormat

Pig LoadFunc

Hadoop + Cassandra at RAX

Multiple Hadoop clusters deployed

Smaller Cassandra deployments

Preparing for large scale Cassandra deployment

In the pipeline

MapReduce output support

Adding an OutputFormat with locality information

Improving locality for Hadoop inputs

Getting started

http://cassandra.apache.org/

Read "Getting Started"... Roughly:

Start one node

Test/develop app, editing node config as necessary

Launch cluster by starting more nodes with chosen config

Thanks!

Big Data Workshop

Participants!

Questions?

References

Brandon William's perf tests

http://racklabs.com/~bwilliam/cassandra/04vs05vs06.png

Hadoop/Cassandra Integration

http://issues.apache.org/jira/browse/CASSANDRA-342

Hadoop and Cassandra at Rackspace

Documents

Transcript of Hadoop and Cassandra at Rackspace

Hadoop and cassandra

Cassandra/Hadoop Integration

Cassandra + Hadoop = Brisk

From Simple CQL to Time-Series Event Tracking and Aggregation Using Cassandra and Hadoop

Eli Mansoor, Rackspace - The Rackspace Story, OpenStacl Israel 2015

DCatch: Automatically Detecting Distributed Concurrency ...people.cs.uchicago.edu/~haopliu/paper/asplos17-preprint.pdf · source distributed cloud systems, Cassandra, Hadoop MapRe-duce,

Evaluating Apache Cassandra as a Cloud DatabaseDataStax Enterprise – Certified Cassandra for Production Applications ..... 11 Solving the Cloud Mixed-Workload Problem ..... 11 Hadoop

Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

About Rackspace

C* Summit EU 2013: From CQL to Time-Series Event Tracking and Aggregation Using Cassandra and Hadoop

Young Rackspace

Cassandra Hadoop Best Practices by Jeremy Hanna

Cassandra Query Language - Tutorials · PDF filedeveloped as a part of Apache Hadoop project and runs on ... Cisco, Rackspace, ebay, Twitter, Netflix ... Cassandra has become so popular

Einsatz von NoSQL-Datenbanken und Apache Hadoop … · Apache Hadoop für die Verarbeitung von Massenereignissen im Kontext ... 2.3.2.3 Cassandra 13 2.3.3 DOKUMENTENORIENTIERTE DATENBANKEN

RACKSPACE-#45658-v15- MASTER Rackspace Master Services ...

Rackspace Managed Cloud for Elastic Path · companies trust Rackspace, and Gartner rates Rackspace a ‘Leader’ for cloud-enabled managed hosting. Rackspace Managed Cloud for Elastic

Intro Cassandra - Meetupfiles.meetup.com/16806932/BDA_Meetup5-Introduction... · Cassandra was designed as a fast, reliable and scalable operational data store. Hadoop was designed

Store and Process Big Data with Hadoop and Cassandra

Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.

BIG DATA ANALYTICS & CLOUD SERVICES - … DATA ANALYTICS & CLOUD SERVICES . ... • AWS infrastructure and operations support ... Cassandra, Hadoop, SOLR, Cascading