A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
-
Upload
datastax-academy -
Category
Engineering
-
view
761 -
download
0
Transcript of How We Used Cassandra/Solr to Build Real-Time Analytics Platform
![Page 1: How We Used Cassandra/Solr to Build Real-Time Analytics Platform](https://reader033.fdocuments.in/reader033/viewer/2022042723/58f2f7331a28ab03358b4635/html5/thumbnails/1.jpg)
BUILDING REAL-TIME ANALYTICS
With DSE Enterprise.
jKoolCloud.com
1
![Page 2: How We Used Cassandra/Solr to Build Real-Time Analytics Platform](https://reader033.fdocuments.in/reader033/viewer/2022042723/58f2f7331a28ab03358b4635/html5/thumbnails/2.jpg)
Objectives• Store everything, analyze everything…• Combined real-time & historical analytics• Fast response, flexible query capabilities
• Target - for business user• Insulate us from underlying software• Hide complexity
• Scale for ingesting data-in-motion• Scale for storing data-at-rest• Elasticity & Operational efficiency• Ease of monitoring & management
2
![Page 3: How We Used Cassandra/Solr to Build Real-Time Analytics Platform](https://reader033.fdocuments.in/reader033/viewer/2022042723/58f2f7331a28ab03358b4635/html5/thumbnails/3.jpg)
Technologies we considered?• SQL (Oracle, MySQL, etc.)
• No scale. We have had a lot of experience our customer’s issues with this at our parent company Nastel…
• RAM was “the” bottleneck. Commits take too long and while that is happening everything else stops
• NoSQL• Cassandra/Solr (DSE)• Hadoop/MapReduce• MongoDB
• Clustered Computing Platforms• STORM• MapReduce• Spark (we learned about this while building jKool)
3
![Page 4: How We Used Cassandra/Solr to Build Real-Time Analytics Platform](https://reader033.fdocuments.in/reader033/viewer/2022042723/58f2f7331a28ab03358b4635/html5/thumbnails/4.jpg)
Why we chose Cassandra/Solr?• Pros:
• Simple to setup & scale for clustered deployments• Scalable, resilient, fault-tolerant (easy replication)• Ability to have data automatically expire (TTL – necessary for our pricing model)• Configurable replication strategy• Great for heavy write workloads
• Write performance was better than Hadoop.• Insert rate was of paramount importance for us – get data in as fast as possible was
our goal• Java driver balances the load amongst the nodes in a cluster for us (master-slave
would never have worked for us)• Solr provides a way to index all incoming data - essential• DSE provides a nice integration between Cassandra and Solr
• Cons:• Susceptible to GC pauses (memory management)
• The more memory the more GC pauses• Less memory and more nodes seems a better approach than one big “honking” server
(we see 6-8GB optimal, so far)• Data compaction tasks may hang
4
![Page 5: How We Used Cassandra/Solr to Build Real-Time Analytics Platform](https://reader033.fdocuments.in/reader033/viewer/2022042723/58f2f7331a28ab03358b4635/html5/thumbnails/5.jpg)
Why not Hadoop MapReduce?• MapReduce too slow for real-time workloads
• Ok for batch, not so great for real-time• Need to be paired with other technologies for query (Hive/Pig)• Complex to setup, run and operate
• Our goals were simplicity first…
• Opted for STORM/SPARK wrapped with our own micro services platform FatPipes instead of the Map Reduce functionality
5
![Page 6: How We Used Cassandra/Solr to Build Real-Time Analytics Platform](https://reader033.fdocuments.in/reader033/viewer/2022042723/58f2f7331a28ab03358b4635/html5/thumbnails/6.jpg)
Why we chose Cassandra/Solr vs. Mongo?
• Why not Mongo?• Global write-lock performance concerns…
• Cassandra/Solr• Java based (our project was in Java) • Easy to scale, replicate data, • Flexible write & write consistency levels (ALL, QUORUM, ANY,
etc.)• Did we say Java? Yes.(we like Java…)
• Flexible choice of platform coverage• Great for time-series data streams (market focus for jKool)
• Inherent query limitations in Cassandra solved via Solrintegration (provided with DSE – as mentioned earlier)
6
![Page 7: How We Used Cassandra/Solr to Build Real-Time Analytics Platform](https://reader033.fdocuments.in/reader033/viewer/2022042723/58f2f7331a28ab03358b4635/html5/thumbnails/7.jpg)
How we achieved near real-time analytics?• Created our own micro-services architecture (FatPipes)
which runs on top of: • STORM/JMS/Kafka• FatPipes can be embedded or distributed
• Real-time Grid• Feeds tracking data and real-time queries to CEP and back
• User interacts with Real-time via JKQL (jKool Query Language)• English like query language for analyzing data in motion and at rest.• “Subscribe” verb for real-time updates
Real-time (Real-time.png)
7
![Page 8: How We Used Cassandra/Solr to Build Real-Time Analytics Platform](https://reader033.fdocuments.in/reader033/viewer/2022042723/58f2f7331a28ab03358b4635/html5/thumbnails/8.jpg)
Why clustered computing platforms?• STORM paired with Kafka/JMS and CEP
• Clustered way to process incoming real-time streams• STORM handles clustering/distribution• Kafka/JMS for a messaging between grids
• Split streaming workload across the cluster• Achieve linear scalability for incoming real-time streams
• Apache Spark (alternative to MapReduce) • For distributing queries and trend analysis • Micro batching for historical analytics• Loading large dataset into memory (across different nodes)• Running queries against large data-sets
8
![Page 9: How We Used Cassandra/Solr to Build Real-Time Analytics Platform](https://reader033.fdocuments.in/reader033/viewer/2022042723/58f2f7331a28ab03358b4635/html5/thumbnails/9.jpg)
Key to Real-time Analytics• Process streams as they come while at the same time
avoiding IO• Streams are split into real-time queue and persistence queue with
eventual consistency (eventually… both real-time and historical must reconcile)
• Both have to be processed in parallel• Writing to persistence layer and then analyzing will not achieve
near-real time processing
9
![Page 10: How We Used Cassandra/Solr to Build Real-Time Analytics Platform](https://reader033.fdocuments.in/reader033/viewer/2022042723/58f2f7331a28ab03358b4635/html5/thumbnails/10.jpg)
High Level Architecture
10
![Page 11: How We Used Cassandra/Solr to Build Real-Time Analytics Platform](https://reader033.fdocuments.in/reader033/viewer/2022042723/58f2f7331a28ab03358b4635/html5/thumbnails/11.jpg)
Deeper ViewWeb Application Server Web Application Server Web Application Server
jKool Web Grid
Cassandra
Cassandra
Cassandra
Cassandra
Storage Grid
Solr
Solr
Solr
Solr
Search Grid
Digest, Index
Real-time Grid
JKQL
FatPipes Micro Services (INGEST)
Compute Grid
FatPipes Micro Services (REAL-TIME)(STORM/CEP)
Distributed Messaging (JMS or Kafka)
11
![Page 12: How We Used Cassandra/Solr to Build Real-Time Analytics Platform](https://reader033.fdocuments.in/reader033/viewer/2022042723/58f2f7331a28ab03358b4635/html5/thumbnails/12.jpg)
Challenges we ran into?• So many technology options (…so little time…)
• Deciding on the right combination is key early on
• Cassandra/Solr deployment – (it was a learning experience for us)
• Lots of configuration, memory management, replication options
• Monitoring, managing clusters• Cassandra/Solr, STORM, Zookeeper, Messaging• +Leverage parent company’s AutoPilot Technology
• Achieving near real-time analytics proved extremely challenging – but we did it!• Keeping track of latencies across cluster• Estimating computational capacity required to crunch incoming
streams
12
![Page 13: How We Used Cassandra/Solr to Build Real-Time Analytics Platform](https://reader033.fdocuments.in/reader033/viewer/2022042723/58f2f7331a28ab03358b4635/html5/thumbnails/13.jpg)
Business Analyst User InterfaceIt's easy to “visualize your data”
13
![Page 14: How We Used Cassandra/Solr to Build Real-Time Analytics Platform](https://reader033.fdocuments.in/reader033/viewer/2022042723/58f2f7331a28ab03358b4635/html5/thumbnails/14.jpg)
jKOOL IN REAL-TIMEReal-time Demonstration of jKool’s usage of DSE