The journy to real time analytics
Transcript of The journy to real time analytics
The journey to real-time analyticsIdo Friedman
IdoFriedman.ymlName: Ido Friedman,Past:”SQL Server consultant,Instructor,Team Leader”Present:”Data engineer and Architect,
Elasticsearch,CouchBase,MongoDB,Python”,…]WorkPlace:PerionWhenNotWorking:@Sea
AgendaWhat is Real-Time analytics
Our use case goals and insight
What’s next
Real-Time analyticsReal-time analytics is the use of, or the capacity to use,
all available enterprise data and resources when they are needed. It consists of dynamic analysis and reporting, based on data entered into a system less than one minute before the actual time of use. Real-time analytics is also known as real-time data analytics, real-time data integration, and real-time intelligence.
Time dimensions/SLAs
Real Time
Msec/Secs
Near Real Time
(Min/Hour)
Batch
(Hours/Days)
Analytics
Batch
Analytics
Real Time analytics Stream
Analytics
Our goals
Online segmentation
User report dashboard
SegmentationSingle event granularity
Full filtering flexibility no predefinition
No restriction on time range queries
No data purging
Msec response time
Hundreds to Thousands of requests per minute
So it began
Elastic search was selected because
No overhead on indexing fields – It’s all index
VERY fast filtering and aggregation
Rich aggregation and querying
Relatively easy maintenance of large data sets
Some words on Elastic searchFull Text engine gone wild
Highly available Search and analytics
Ultra scalable and easily maintainable
By developers for Developers
https://www.elastic.co/products/elasticsearch
ES ExamplesDate histogramsFiltersAggsCardinalityTopMany more..
POC
Number of indexes and shards was decide…
Index mapping was set
Search patterns, queries and SLA were achieved
Data set was not big enough
RE – POC
IN PRODUCTION
POC v2 - GoalsFind the correct sharding / nodes combination
Create a manageable cluster
Achieve repeatable results
Reduce costs
The insightsShardingReplicationNodesRoutingCluster managementRoutingDoc Values vs Field DataMaster nodes
The insights - Nodes
1 TB Data
250 GB Data
250 GB Data
250 GB Data
250 GB Data
250 GB Data
250 GB Data
Data Nodes option 1Nodes option 1 Effect of a single node downtime
50%
25%
Data loading•Analyze your need and choose your tools to suite
• If you know your data don’t invest in generic solution
•Check your data load processes and verify its accuracy
Re sharding
Will be internally in elastic in future versions
$$$$$
Money is not your enemy
Use costs as the main drive to improve your solution
Use costs as the main matric it will keep your company running
Issues – not all is perfectCardinality aggregation
PerformanceAccuracyData set size
Hardware resource balanceFind your real bottle neck
Choose the correct node for your load
Best practices are sometimes too general
We are not happy yetWe need joins – Data modeling Elastic search main issue for us –> data piping
Where we go next?Other analytics engines?
DruidMongoDB
Couchbase
MongoDB Aggregation framework
CouchBase - Global Service Indexing
CREATE INDEX productName_index1 ON bucket_name(productName, ProductID) WHERE type="product" USING GSI WITH {"nodes":"node1:8091"}; CREATE INDEX productName_index2 ON bucket_name(productName, ProductID) WHERE type="product" USING GSI WITH {"nodes":"node2:8091"};
CREATE INDEX productName_index1 ON bucket_name(productName, ProductID) WHERE type="product" AND productName BETWEEN "A" AND "K" USING GSI WITH {"nodes":"node1:8091"}; CREATE INDEX productName_index2 ON bucket_name(productName, ProductID) WHERE type="product" AND productName BETWEEN "K" AND "Z" USING GSI WITH {"nodes":"node2:8091"};
Manual scale out and replication
Druid
Joins in ElasticSearchhttp://siren.solutions/relational-joins-for-elasticsearch-the-siren-join-plugin/
$ curl -XGET 'http://localhost:9200/articles/_coordinate_search?pretty' -d '{ "query" : { "filtered" : { "query" : { "match_all" : { } }, "filter" : { "filterjoin" : { (1) "mentions" : { (2) "indices" : ["companies"], (3) "path" : "id", (4) "query" : { (5) "term" : { "name" : "orient" } } } } } } }}'
SummaryNo magic solutions
Always understand your data and needs
Invest the time on modeling and optimization