Cassandra synergy

30
Cassandra Synergy By Niall Milton, DigBigData Presented to : Dublin Cassandra User Group

description

Presented to the Dublin Cassandra User Group by Niall Milton of DigBigData. This presentation is on Cassandra and its use with other technologies such as Storm, Spark, Hadoop, ElasticSearch and Redis. This presentation should act as a solid foundation to explore some of the mentioned technologies in more depth.

Transcript of Cassandra synergy

Page 1: Cassandra synergy

Cassandra Synergy

By Niall Milton, DigBigDataPresented to :

Dublin Cassandra User Group

Page 2: Cassandra synergy

Agenda

What do we mean by synergy?

Storm

Shark / Spark

Redis

ElasticSearch

Hadoop

Page 3: Cassandra synergy

What do we mean by Synergy?

synergy 1. The interaction of two or more agents or

forces so that their combined effect is greater than the sum of their individual effects.

Page 4: Cassandra synergy

What do we mean by Synergy?

Cassandra excellent for: Fast read or write performance Scalable, runs on commodity hardware Reliable cross-DC replication Robust persistence for high volume data

Needs some special sauce for: Real-time calculations for high volume streams Complex search functions (free-text etc.) Map Reduce on RDDs

Page 5: Cassandra synergy

Twitter Storm

Page 6: Cassandra synergy

Storm

Open Sourced by Twitter in 2011

Distributed event processor

Operates on Resilient Distributed Data Sets

Getting started in Apache Incubator

Can persist to and read from from C*

Great for high volume, real time (complex) calculations on streamed data

Page 7: Cassandra synergy
Page 8: Cassandra synergy

Storm

Is a CEP architecture

Spout – Collects & submits tuples for processing

Bolt – processes tuples and emits new tuples

Tuple – a collection of data passed in storm

Stream – identifies outputs from a spout / bolt and enforces tuple structure

Uses Zookeeper and ZeroMQ for coordination and message passing respectively

Page 9: Cassandra synergy

Example Topology

Page 10: Cassandra synergy

Synergy?

Can use Cassandra as the input data source

Can write tuples into Cassandra

Example project here… https://github.com/tjake/stormscraper/

See CassandraWriterBolt.java for simple example of a Java Driver CQL based bolt that writes to Cassandra.

Good as an example application, but not production ready

Page 11: Cassandra synergy

Use Case

Top N words for popularity tracking

Input: a constant stream of messages into the system

Count occurrences of each word in a message

Store raw messages in Cassandra

Use a bolt to break up messages and maintain sorted list of top N words

Persist the Top N words and their counts periodically in Cassandra

Page 12: Cassandra synergy

Use Case

CREATE TABLE messages (date_hour TIMESTAMP, message_id TIMEUUID, message VARCHAR, PRIMARY KEY(date_hour, message_id));

CREATE TABLE top_words (date_hour TIMESTAMP, position INTEGER, word VARCHAR, PRIMARY KEY(date_hour, position));

Page 13: Cassandra synergy

Use Case

https://github.com/nathanmarz/storm-starter/

Use RollingTopWords.java as base

Integrate CassandraWriterBolt into use case

Add spout for input messages

Add bolt for persisting messages & writing Top N words

Reference : http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/

Page 14: Cassandra synergy

Storm: Conclusion

Powerful Architecture

Lots of potential as an Apache project

Nice abstractions to simplify development (Trident)

Great for operating on high velocity, high volume streams

Not prohibitively difficult to integrate with other systems for input and output

Lots of people experimenting with it!

Page 15: Cassandra synergy

Spark & Shark

Lightning fast cluster computing

Page 16: Cassandra synergy

Apache Spark

100x faster than Hadoop MapReduce!

Faster in-memory MapR operations

Integration with Cassandra either via: https://github.com/tuplejump/calliope-release Or via Cassandra’s Hadoop support

Combines SQL, Streaming and Complex Analytics

Page 17: Cassandra synergy

Apache Spark

Can read and write to Cassandra…

Reading from CF / Table into RDD via Calliope (Scala)

val cas = CasBuilder.cql3.withColumnFamily("casDemo", "Words”).where("book = 'The Three Musketeers'”)

val rdd = sc.cql3Cassandra[Map[String, String], Map[String, String]](cas)

* where clause can use partition key or secondary index, CasBuilder also supports paging

Page 18: Cassandra synergy

Shark

With Spark we can achieve super fast in-memory queries on subsets of data in Cassandra

Effectively all the features of Hive running on RDD not HDFS

Uses HiveQL queries

Includes machine learning algorithms out of the box

CqlStorageHandler provided to read RDD from Cassandra or read SSTables directly

https://github.com/richardalow/cassowary

Page 19: Cassandra synergy

Spark / Shark: Conclusion

Need resource isolation if running directly on Cassandra nodes

Otherwise dealing with higher latency but not affecting cluster resources

Impressive possibilities for machine learning algorithms as well as more basic Hive queries

Introduces possibilities for JOINs on hot data!

Page 20: Cassandra synergy

REDIS

Page 21: Cassandra synergy

What is it?

“Redis is an open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.”

Page 22: Cassandra synergy

Synergy?

Good for… Sorting sets & lists Pubsub messaging (more) Accurate counters Merging sets Transactions!

Works in memory, can serve data fast based on key

Good for runtime storage of aggregate data

Could use shared resources on Cassandra nodes (could populate most recent data via triggers (naughty))

Page 23: Cassandra synergy

Elastic Search

Distributed real-time search engine based

Page 24: Cassandra synergy

What is it?

Distributed real-time search engine

Built from the ground up for reliability and scalability

Supports lots of other features as well free text search Spatial Query by arbitrary fields Facets

Multi-lingual query support

Page 25: Cassandra synergy

Synergy?

Although external to Cassandra it can provide rich query capabilities over the same data

Simplify Data Models in Cassandra to maximise storage

Separate read and write workloads (read from ES, write to Cassandra)

Some integration for Storm for writing records to elastic search and Cassandra as data enters the system

Again… Spatial!

Page 26: Cassandra synergy

Hadoop

Batch Analytics

Page 27: Cassandra synergy

What is it?

Open Source under Apache License 2.0

Top Level Apache project

Runs on commodity hardware

Used for storage and large scale processing of data-sets

Lots of complementary tools… impala, mahout etc.

Page 28: Cassandra synergy

Some terms…

HDFS a distributed file-system that stores data on

commodity machines, providing very high aggregate bandwidth across the cluster.

Hadoop MapReduce - a programming model for large scale data processing.

Hive - An SQL like abstraction for map reduce jobs

Pig - A procedural style language for expressing map reduce jobs

Page 29: Cassandra synergy

Synergy?

Multiple ways to use it with Cassandra

DataStax Enterprise supports Hadoop on top of a Cassandra File System Replication managed in-cluster (efficient) Full Hadoop toolset available

Some Hadoop support in vanilla distribution. Limited support for efficient querying

Page 30: Cassandra synergy

Questions?