BUILDING REAL-TIME ANALYTICS
With DSE Enterprise.
jKoolCloud.com
1
Objectives• Store everything, analyze everything…• Combined real-time & historical analytics• Fast response, flexible query capabilities
• Target - for business user• Insulate us from underlying software• Hide complexity
• Scale for ingesting data-in-motion• Scale for storing data-at-rest• Elasticity & Operational efficiency• Ease of monitoring & management
2
Technologies we considered?• SQL (Oracle, MySQL, etc.)
• No scale. We have had a lot of experience our customer’s issues with this at our parent company Nastel…
• RAM was “the” bottleneck. Commits take too long and while that is happening everything else stops
• NoSQL• Cassandra/Solr (DSE)• Hadoop/MapReduce• MongoDB
• Clustered Computing Platforms• STORM• MapReduce• Spark (we learned about this while building jKool)
3
Why we chose Cassandra/Solr?• Pros:
• Simple to setup & scale for clustered deployments• Scalable, resilient, fault-tolerant (easy replication)• Ability to have data automatically expire (TTL – necessary for our pricing model)• Configurable replication strategy• Great for heavy write workloads
• Write performance was better than Hadoop.• Insert rate was of paramount importance for us – get data in as fast as possible was
our goal• Java driver balances the load amongst the nodes in a cluster for us (master-slave
would never have worked for us)• Solr provides a way to index all incoming data - essential• DSE provides a nice integration between Cassandra and Solr
• Cons:• Susceptible to GC pauses (memory management)
• The more memory the more GC pauses• Less memory and more nodes seems a better approach than one big “honking” server
(we see 6-8GB optimal, so far)• Data compaction tasks may hang
4
Why not Hadoop MapReduce?• MapReduce too slow for real-time workloads
• Ok for batch, not so great for real-time• Need to be paired with other technologies for query (Hive/Pig)• Complex to setup, run and operate
• Our goals were simplicity first…
• Opted for STORM/SPARK wrapped with our own micro services platform FatPipes instead of the Map Reduce functionality
5
Why we chose Cassandra/Solr vs. Mongo?
• Why not Mongo?• Global write-lock performance concerns…
• Cassandra/Solr• Java based (our project was in Java) • Easy to scale, replicate data, • Flexible write & write consistency levels (ALL, QUORUM, ANY,
etc.)• Did we say Java? Yes.(we like Java…)
• Flexible choice of platform coverage• Great for time-series data streams (market focus for jKool)
• Inherent query limitations in Cassandra solved via Solrintegration (provided with DSE – as mentioned earlier)
6
How we achieved near real-time analytics?• Created our own micro-services architecture (FatPipes)
which runs on top of: • STORM/JMS/Kafka• FatPipes can be embedded or distributed
• Real-time Grid• Feeds tracking data and real-time queries to CEP and back
• User interacts with Real-time via JKQL (jKool Query Language)• English like query language for analyzing data in motion and at rest.• “Subscribe” verb for real-time updates
Real-time (Real-time.png)
7
Why clustered computing platforms?• STORM paired with Kafka/JMS and CEP
• Clustered way to process incoming real-time streams• STORM handles clustering/distribution• Kafka/JMS for a messaging between grids
• Split streaming workload across the cluster• Achieve linear scalability for incoming real-time streams
• Apache Spark (alternative to MapReduce) • For distributing queries and trend analysis • Micro batching for historical analytics• Loading large dataset into memory (across different nodes)• Running queries against large data-sets
8
Key to Real-time Analytics• Process streams as they come while at the same time
avoiding IO• Streams are split into real-time queue and persistence queue with
eventual consistency (eventually… both real-time and historical must reconcile)
• Both have to be processed in parallel• Writing to persistence layer and then analyzing will not achieve
near-real time processing
9
High Level Architecture
10
Deeper ViewWeb Application Server Web Application Server Web Application Server
jKool Web Grid
Cassandra
Cassandra
Cassandra
Cassandra
Storage Grid
Solr
Solr
Solr
Solr
Search Grid
Digest, Index
Real-time Grid
JKQL
FatPipes Micro Services (INGEST)
Compute Grid
FatPipes Micro Services (REAL-TIME)(STORM/CEP)
Distributed Messaging (JMS or Kafka)
11
Challenges we ran into?• So many technology options (…so little time…)
• Deciding on the right combination is key early on
• Cassandra/Solr deployment – (it was a learning experience for us)
• Lots of configuration, memory management, replication options
• Monitoring, managing clusters• Cassandra/Solr, STORM, Zookeeper, Messaging• +Leverage parent company’s AutoPilot Technology
• Achieving near real-time analytics proved extremely challenging – but we did it!• Keeping track of latencies across cluster• Estimating computational capacity required to crunch incoming
streams
12
Business Analyst User InterfaceIt's easy to “visualize your data”
13
jKOOL IN REAL-TIMEReal-time Demonstration of jKool’s usage of DSE
Top Related