Download - How We Used Cassandra/Solr to Build Real-Time Analytics Platform

BUILDING REAL-TIME ANALYTICS

With DSE Enterprise.

jKoolCloud.com

1

Objectives• Store everything, analyze everything…• Combined real-time & historical analytics• Fast response, flexible query capabilities

• Target - for business user• Insulate us from underlying software• Hide complexity

• Scale for ingesting data-in-motion• Scale for storing data-at-rest• Elasticity & Operational efficiency• Ease of monitoring & management

2

Technologies we considered?• SQL (Oracle, MySQL, etc.)

• No scale. We have had a lot of experience our customer’s issues with this at our parent company Nastel…

• RAM was “the” bottleneck. Commits take too long and while that is happening everything else stops

• NoSQL• Cassandra/Solr (DSE)• Hadoop/MapReduce• MongoDB

• Clustered Computing Platforms• STORM• MapReduce• Spark (we learned about this while building jKool)

3

Why we chose Cassandra/Solr?• Pros:

• Simple to setup & scale for clustered deployments• Scalable, resilient, fault-tolerant (easy replication)• Ability to have data automatically expire (TTL – necessary for our pricing model)• Configurable replication strategy• Great for heavy write workloads

• Write performance was better than Hadoop.• Insert rate was of paramount importance for us – get data in as fast as possible was

our goal• Java driver balances the load amongst the nodes in a cluster for us (master-slave

would never have worked for us)• Solr provides a way to index all incoming data - essential• DSE provides a nice integration between Cassandra and Solr

• Cons:• Susceptible to GC pauses (memory management)

• The more memory the more GC pauses• Less memory and more nodes seems a better approach than one big “honking” server

(we see 6-8GB optimal, so far)• Data compaction tasks may hang

4

Why not Hadoop MapReduce?• MapReduce too slow for real-time workloads

• Ok for batch, not so great for real-time• Need to be paired with other technologies for query (Hive/Pig)• Complex to setup, run and operate

• Our goals were simplicity first…

• Opted for STORM/SPARK wrapped with our own micro services platform FatPipes instead of the Map Reduce functionality

5

Why we chose Cassandra/Solr vs. Mongo?

• Why not Mongo?• Global write-lock performance concerns…

• Cassandra/Solr• Java based (our project was in Java) • Easy to scale, replicate data, • Flexible write & write consistency levels (ALL, QUORUM, ANY,

etc.)• Did we say Java? Yes.(we like Java…)

• Flexible choice of platform coverage• Great for time-series data streams (market focus for jKool)

• Inherent query limitations in Cassandra solved via Solrintegration (provided with DSE – as mentioned earlier)

6

How we achieved near real-time analytics?• Created our own micro-services architecture (FatPipes)

which runs on top of: • STORM/JMS/Kafka• FatPipes can be embedded or distributed

• Real-time Grid• Feeds tracking data and real-time queries to CEP and back

• User interacts with Real-time via JKQL (jKool Query Language)• English like query language for analyzing data in motion and at rest.• “Subscribe” verb for real-time updates

Real-time (Real-time.png)

7

Why clustered computing platforms?• STORM paired with Kafka/JMS and CEP

• Clustered way to process incoming real-time streams• STORM handles clustering/distribution• Kafka/JMS for a messaging between grids

• Split streaming workload across the cluster• Achieve linear scalability for incoming real-time streams

• Apache Spark (alternative to MapReduce) • For distributing queries and trend analysis • Micro batching for historical analytics• Loading large dataset into memory (across different nodes)• Running queries against large data-sets

8

Key to Real-time Analytics• Process streams as they come while at the same time

avoiding IO• Streams are split into real-time queue and persistence queue with

eventual consistency (eventually… both real-time and historical must reconcile)

• Both have to be processed in parallel• Writing to persistence layer and then analyzing will not achieve

near-real time processing

9

High Level Architecture

10

Deeper ViewWeb Application Server Web Application Server Web Application Server

jKool Web Grid

Cassandra

Cassandra

Cassandra

Cassandra

Storage Grid

Solr

Solr

Solr

Solr

Search Grid

Digest, Index

Real-time Grid

JKQL

FatPipes Micro Services (INGEST)

Compute Grid

FatPipes Micro Services (REAL-TIME)(STORM/CEP)

Distributed Messaging (JMS or Kafka)

11

Challenges we ran into?• So many technology options (…so little time…)

• Deciding on the right combination is key early on

• Cassandra/Solr deployment – (it was a learning experience for us)

• Lots of configuration, memory management, replication options

• Monitoring, managing clusters• Cassandra/Solr, STORM, Zookeeper, Messaging• +Leverage parent company’s AutoPilot Technology

• Achieving near real-time analytics proved extremely challenging – but we did it!• Keeping track of latencies across cluster• Estimating computational capacity required to crunch incoming

streams

12

Business Analyst User InterfaceIt's easy to “visualize your data”

13

jKOOL IN REAL-TIMEReal-time Demonstration of jKool’s usage of DSE