Apache Cassandra and Python for Analyzing Streaming Big Data

Apache Cassandra and Python

For streaming Big Data

Prajod S VettiyattilArchitect, Wipro

@prajodshttps://in.linkedin.com/in/prajod

Nishant SahayArchitect, Wipro

@nsahaytechhttps://in.linkedin.com/in/nishantsahay

1

Open Source IndiaNov 2015

Database track

Agenda

1. Time Series Data Analysis2. Spark, Python, Cassandra and D3 3. Business problem4. Solution using Logical Architecture5. Data Processor6. Data Persistence 7. Data Visualization

2

What this session is about

3

What

Big Data

Streaming

Time Series

How

Spark

Python

Cassandra

D3.js, Node.js

Tools: Python, Spark, Cassandra, Node and D3

• Python and Spark for Big data processing• Cassandra for persistence and serving• D3 for visualization• Node for

• Enabling scalability • Data aggregation

4

python

• Popular with Open source projects• Wide support base• Strong in data science • Visualization libraries• Statistics functions

5

Cassandra

• noSQL database• Column family• Dynamic columns• AP in CAP theorem

• Tunable consistency

• Suited for time series storage

6

D3.js

• Data driven documents• SVG, html, css and javascript• Fine grained control of screen elements• Plethora of UI widgets

7

Business Problem

•Handle streaming data•Stock ticks•Weather movements•Satellite captures•Astronomical observations•Large Hadron Collider

•Ingest•Persist•Visualize

•Analysing stock prices

8

Logical Solution Architecture

Time Series Data Producer (IoT devices, Stock ticks)

Data Processor(pySpark)

Data Persistence(Cassandra)

Visualization Aggregator

(Node.js)

Visualization(D3.js)

9

Data Processor: pySpark

•Apache Spark is a big data processor•Streaming data•Batch data•Lambda architecture

•pySpark for using python’s power on top of Spark•python

•Machine learning•Statistics•Visualization

•Cassandra integration•pyspark-cassandra adapter from TargetHoldings

10

Logical Architecture diagram of Spark

Apache Spark

Spark

SQLMLlib GraphX SparkR pySpark

11

Spark Streaming

Apache Spark: Core

• In memory processing for Big Data• Cached intermediate data sets• Multi-step DAG based execution• Resilient Distributed Data(RDD) sets

12

pySpark and Cassandra

Java

Python

Cassandra

13

Apache Spark: Processing stock ticks

• Ingest stock tick stream, coming in at a high rate• Calculate moving average of stock prices• Insert the average of prices into Cassandra

14

Data Persistence - Cassandra

• Master less: Peer to peer• Built to Scale: Scales to support millions of operations per second• High Availability: No single point of failure• Ease of Use: Operational simplicity, CQL for developers• It is supposedly battle tested at Facebook, Apple and Netflix :-)

15

Data Persistence - Cassandra

16

n1

n5

n2

n4

n3n7

n8

n6

Write Request -Partition Key Hash value for n1

n8 – Coordinator Noden1 – Primary responsible node handling

requestn2, n3 – Replication Nodes (RF=3)

Cassandra Data Model – Skinny Rows

Skinny Rows: Primary Key with only partition key

CREATE TABLE stock_info(stock_id text, date text, price double, PRIMARY KEY ((stock_id, date));

stock_id date price

GAZP 2015-11-11 556.50

GAZP 2015-11-10 556.65

GAZP:2015-11-11

price

556.50

GAZP:2015-11-10

price

556.6517

Composite Partition KeyLogical View Disc View

Node n1

Node n4

Cassandra Data Model – Wide Rows

Wide RowsPrimary key contains column (Clustering Columns) other than the

partition key. CREATE TABLE stock_ticker(stock_id text, price double, event_time timestamp , PRIMARY KEY (stock_id, event_time);

GAZP

2015-11-10

13:30:00:price

556.45

2015-11-10

09:30:00:price

559.45

stock_

id

price date event_time

GAZP 559.45 2015-11-10 2015-11-10

09:30:00

GAZP 556.45 2015-11-10 2015-11-10

13:30:00

GAZP 556.65 2015-11-11 2015-11-11

18:00:00

2015-11-11

16:00:00:price

556.65

18

Logical View Disc ViewCompound Primary Key (Partition+Clustering)

Node n1

Time Series – Cassandra Data Model

Wide Row + Row Partition CREATE TABLE stock_info(stock_id text, date text, price double, event_time

timestamp, PRIMARY KEY ((stock_id, date), event_time);

stock_id price date event_time

GAZP 559.45 2015-11-10 2015-11-10

09:30:00

GAZP 556.45 2015-11-10 2015-11-10

13:30:00

GAZP 556.65 2015-11-11 2015-11-11

18:00:00

GAZP:2015-11-10

2015-11-10 13:30:00:price

556.45

2015-11-10 09:30:00:price

559.45

GAZP:2015-11-11

2015-11-11 18:00:00:price

556.6519

Logical View Disc View

Node n1

Node n6

Summary – Cassandra Data Model

Skinny Row

Wide Row

Wide Row + Row PartitionOptimize with Expiring Columns/Split day bucket to multiple rows

20

GAZP:2015-11-10

2015-11-10 13:30:00:price

556.45

2015-11-10 09:30:00:price

559.45

GAZP:2015-11-11

2015-11-11 18:00:00:price

556.65

Node n1

Node n6

GAZP

2015-11-10

13:30:00:price

556.45

2015-11-10

09:30:00:price

559.45

2015-11-11

16:00:00:price

556.65

Node n1

GAZP:2015-11-11

price

556.50

GAZP:2015-11-10

price

556.65

Node n1

Node n4

Node.js, Cassandra and D3.js

D3.js graph

Browser

Web UI Layer

ExpressJS

cassandra-driver

Server Layer Database Layer

Cassandra DB

Rest Based Polling

Get JSON Data

CQL – SelectTime SeriesData

21

Data Aggregator

• Node.js is proxy for data aggregation• Expose Rest endpoint for visualization• Retrieve data from Cassandra• Data transformation as per business need

• ExpressJS: Flexible web application framework

• Datastax cassandra-driver: client library for Apache Cassandra

• EJS: For quick templating of on-the-fly node application

22

Visualization - Frameworks

• D3 for transformation of time series data into visual information• Consume REST API• Generate customized data driven graphs and visualization

• Rickshaw is a JavaScript toolkit for creating interactive time series graphs• Built on D3.js• Generate time-series graph

23

Visualization – Graphs

2424

Price

Moving Average

Trade Volume

Stock Price

Summary

• Processing time series data• Apache Spark• Cassandra• Node.js• D3.js

25

QUESTIONS

Prajod S VettiyattilArchitect, Wipro

@prajodshttps://in.linkedin.com/in/prajod

Nishant SahayArchitect, Wipro

@nsahaytechhttps://in.linkedin.com/in/nishantsahay

Apache Cassandra and Python for Analyzing Streaming Big Data

Data & Analytics

Transcript of Apache Cassandra and Python for Analyzing Streaming Big Data