StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013
An efficient data mining solution by integrating Spark and Cassandra
-
Upload
stratio -
Category
Technology
-
view
966 -
download
0
description
Transcript of An efficient data mining solution by integrating Spark and Cassandra
AN EFFICIENT DATA MINING SOLUTION
Hadoop?
Cassandra?
Spark?
Stratio Deep
An efficient data mining solution
“Two and two are four?
Sometimes… Sometimes they are five.”
G. Orwell
#StratioBD
Goals
• Why do you need Cassandra?• What is the problem?• Why do you need Spark?• How do they work together?
#StratioBD
Cassandra
#StratioBD
• Based on DynamoDB…• Replication, Key/Value, P2P• And based on Big Table…• Column oriented
ROBUST FAST EFFICENT
NO BOTTLENECK REPLICATE
DDECENTRALIZED
Another Databas
e?
Why?
One User – Lot of data
Case A
#StratioBD
Many User – Few data
Case B
#StratioBD
Many user – Lot of data
Case C
#StratioBD
Crawler app
#StratioBD
Cassandra, I choose you
100M
Indexedpages
3kreads
Query time
< 1s
But…
Marketingwalks in
New query
“I need to find all the reference to the domain
ACME. I need the answer by Friday.”
#StratioBD
Problem
Cassandra is not well suited to resolved this
type of queries
You need to design the schema with the query
in mind
#StratioBD
ChallengeAccepted
What options do we have?
• Run Hive Query on top of C*• Write an ETL script and load data into
another DB• Clone the cluster
#StratioBD
What options do we have?
Run Hive Query on top of C*
Write ETL scripts and load into another DB
Clone the cluster
#StratioBD
And now… what can we do?
“We can't solve problems by using the same kind of thinking
we used when we created them”
#StratioBD
Albert Einstein
• Alternative to MapReduce• A low latency cluster computing system• For very large datasets• Create by UC Berkeley AMP Lab in 2010.• May be 100 times faster than MapReduce for:
Interactive algorithms. Interactive data mining
Spark
#StratioBD
Logistic regression inSpark vs Hadoop
SOURCE | http://spark.incubator.apache.org/
#StratioBD
WHO USES SPARK?
Spark and Cassandra
Integration points
#StratioBD
Cassandra’s HDFS abstraction layer
Advantantages:• Easily integrates with legacy systems.
Drawbacks:• Very high-level: no access to low level Cassandra’s features.• Questionable performance.
INTEGRATION POINTS: HDFS OVER CASSANDRA
#StratioBD
Cassandra’s Hadoop Interface• Thrift protocol• CQL3 (our implementation)
Uses the novel Cassandra’s
CqlPagingInputFormat
INTEGRATION POINTS: HDFS OVER CASSANDRA
#StratioBD
• Supports CQL3 features• Respects data locality • Good compromise between performance / implementation complexity
CQL3 Integration
INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3
#StratioBD
CQL3 Integration (II)
Provides a Java friendly API:
• Developers map Column Families to custom serializable
POJOs
• StratioDeep wraps the complexity of performing Spark
calculations directly over the user provided POJOs.
INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3
#StratioBD
Demo
Drawbacks:
• Still not preforming as well as we’d like
Uses Cassandra’s Hadoop Interface• No analyst-friendly interface:
No SQL-like query features
CQL3 Integration (III)
INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3#StratioBD
Bring the integration to another level:
• Dump Cassandra’s Hadoop Interface• Direct access to Cassandra’s SSTable(s) files.• Extend Cassandra’s CQL3 to make use of Spark’s
distributed data processing power
Future extensions
What are we currently working on?
#StratioBD
#StratioBD
Conclusion
THANKS