Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming...
Transcript of Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming...
![Page 1: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/1.jpg)
Druid for real-time analysis
Yann Esposito
7 Avril 2016
![Page 2: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/2.jpg)
Druid the Sales Pitch
![Page 3: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/3.jpg)
Intro
![Page 4: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/4.jpg)
Experience
▶ Real Time Social Media Analytics
![Page 5: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/5.jpg)
Real Time?
▶ Ingestion Latency: seconds▶ Query Latency: seconds
![Page 6: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/6.jpg)
Demand
▶ Twitter: 20k msg/s, 1msg = 10ko during 24h▶ Facebook public: 1000 to 2000 msg/s
continuously▶ Low Latency
![Page 7: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/7.jpg)
Reality
▶ Twitter: 400 msg/s continuously, burst to 1500▶ Facebook: 1000 to 2000 msg/s
![Page 8: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/8.jpg)
Origin (PHP)
![Page 9: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/9.jpg)
1st Refactoring (Node.js)
![Page 10: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/10.jpg)
Return of Experience
![Page 11: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/11.jpg)
Return of Experience
![Page 12: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/12.jpg)
2nd Refactoring
![Page 13: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/13.jpg)
2nd Refactoring (FTW!)
![Page 14: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/14.jpg)
2nd Refactoring return ofexperience
![Page 15: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/15.jpg)
Demo
![Page 16: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/16.jpg)
Pre Considerations
![Page 17: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/17.jpg)
Discovered vs Invented
Try to conceptualize a s.t.
▶ Ingest Events▶ Real-Time Queries▶ Scalable▶ Highly Available
Analytics: timeseries, alerting system, top N, etc…
![Page 18: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/18.jpg)
In the End
Druid concepts are always emerging naturally
![Page 19: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/19.jpg)
Druid
![Page 20: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/20.jpg)
Who?
MetamarketsPowered by Druid
▶ Alibaba, Cisco, Criteo, eBay, Hulu, Netflix,Paypal…
![Page 21: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/21.jpg)
Goal
Druid is an open source store designed forreal-time exploratory analytics on largedata sets.
hosted dashboard that would allow usersto arbitrarily explore and visualize eventstreams.
![Page 22: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/22.jpg)
Concepts
▶ Column-oriented storage layout▶ distributed, shared-nothing architecture▶ advanced indexing structure
![Page 23: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/23.jpg)
Key Features
▶ Sub-second OLAP Queries▶ Real-time Streaming Ingestion▶ Power Analytic Applications▶ Cost Effective▶ High Available▶ Scalable
![Page 24: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/24.jpg)
Right for me?
▶ require fast aggregations▶ exploratory analytics▶ analysis in real-time▶ lots of data (trillions of events, petabytes of
data)▶ no single point of failure
![Page 25: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/25.jpg)
High Level Architecture
![Page 26: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/26.jpg)
Inspiration
▶ Google’s BigQuery/Dremel▶ Google’s PowerDrill
![Page 27: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/27.jpg)
Index / Immutability
Druid indexes data to create mostly immutableviews.
![Page 28: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/28.jpg)
Storage
Store data in custom column format highlyoptimized for aggregation & filter.
![Page 29: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/29.jpg)
Specialized Nodes
▶ A Druid cluster is composed of various type ofnodes
▶ Each designed to do a small set of things verywell
▶ Nodes don’t need to be deployed on individualhardware
▶ Many node types can be colocated inproduction
![Page 30: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/30.jpg)
Druid vs X
![Page 31: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/31.jpg)
Elasticsearch
▶ resource requirement much higher for ingestion& aggregation
▶ No data summarization (100x in real worlddata)
![Page 32: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/32.jpg)
Key/Value Stores(HBase/Cassandra/OpenTSDB)
▶ Must Pre-compute Result▶ Exponential storage▶ Hours of pre-processing time
▶ Use the dimensions as key (like in OpenTSDB)▶ No filter index other than range▶ Hard for complex predicates
![Page 33: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/33.jpg)
Spark
▶ Druid can be used to accelerate OLAP queriesin Spark
▶ Druid focuses on the latencies to ingest andserve queries
▶ Too long for end user to arbitrarily explore data
![Page 34: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/34.jpg)
SQL-on-Hadoop (Impala/Drill/SparkSQL/Presto)
▶ Queries: more data transfer between nodes▶ Data Ingestion: bottleneck by backing store▶ Query Flexibility: more flexible (full joins)
![Page 35: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/35.jpg)
Data
![Page 36: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/36.jpg)
Concepts
▶ Timestamp column: query centered on timeaxis
▶ Dimension columns: strings (used to filter orto group)
▶ Metric columns: used for aggregations(count, sum, mean, etc…)
![Page 37: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/37.jpg)
Indexing
▶ Immutable snapshots of data▶ data structure highly optimized for analytic
queries▶ Each column is stored separately▶ Indexes data on a per shard (segment) level
![Page 38: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/38.jpg)
Loading
▶ Real-Time▶ Batch
![Page 39: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/39.jpg)
Querying
▶ JSON over HTTP▶ Single Table Operations, no joins.
![Page 40: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/40.jpg)
Segments
▶ Per time interval▶ skip segments when querying
▶ Immutable▶ Cache friendly▶ No locking
▶ Versioned▶ No locking▶ Read-write concurrency
![Page 41: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/41.jpg)
Roll-up
![Page 42: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/42.jpg)
Exampletimestamp page ... added deleted2011-01-01T00:01:35Z Cthulhu 10 652011-01-01T00:03:63Z Cthulhu 15 622011-01-01T01:04:51Z Cthulhu 32 452011-01-01T01:01:00Z Azatoth 17 872011-01-01T01:02:00Z Azatoth 43 992011-01-01T02:03:00Z Azatoth 12 53
timestamp page ... nb added deleted2011-01-01T00:00:00Z Cthulhu 2 25 1272011-01-01T01:00:00Z Cthulhu 1 32 452011-01-01T01:00:00Z Azatoth 2 60 1862011-01-01T02:00:00Z Azatoth 1 12 53
![Page 43: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/43.jpg)
as SQL
GROUP BY timestamp, page, nb, added, deleted:: nb = COUNT(1), added = SUM(added), deleted = SUM(deleted)
In practice can dramatically reduce the size (up tox100)
![Page 44: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/44.jpg)
Segments
![Page 45: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/45.jpg)
ShardingsampleData_2011-01-01T01:00:00:00Z_2011-01-01T02:00:00:00Z_v1_0
timestamp page ... nb added deleted2011-01-01T01:00:00Z Cthulhu 1 20 452011-01-01T01:00:00Z Azatoth 1 30 106
sampleData_2011-01-01T01:00:00:00Z_2011-01-01T02:00:00:00Z_v1_0
timestamp page ... nb added deleted2011-01-01T01:00:00Z Cthulhu 1 12 452011-01-01T01:00:00Z Azatoth 2 30 80
![Page 46: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/46.jpg)
Core Data Structure
▶ dictionary▶ a bitmap for each value▶ a list of the columns values encoded using the
dictionary
![Page 47: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/47.jpg)
Example
dictionary: { "Cthulhu": 0, "Azatoth": 1 }
column data: [0, 0, 1, 1]
bitmaps (one for each value of the column):value="Cthulhu": [1,1,0,0]value="Azatoth": [0,0,1,1]
![Page 48: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/48.jpg)
Example (multiple matches)
dictionary: { "Cthulhu": 0, "Azatoth": 1 }
column data: [0, [0,1], 1, 1]
bitmaps (one for each value of the column):value="Cthulhu": [1,1,0,0]value="Azatoth": [0,1,1,1]
![Page 49: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/49.jpg)
Real-time ingestion
▶ Via Real-Time Node and Firehose▶ No redundancy or HA, thus not recommended
▶ Via Indexing Service and Tranquility API▶ Core API▶ Integration with Streaming Frameworks▶ HTTP Server▶ Kafka Consumer
![Page 50: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/50.jpg)
Batch Ingestion
▶ File based (HDFS, S3, …)
![Page 51: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/51.jpg)
Real-time Ingestion
Task 1: [ Interval ][ Window ]Task 2: [ ]----------------------------------------------------->
time
![Page 52: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/52.jpg)
Querying
![Page 53: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/53.jpg)
Query types
▶ Group by: group by multiple dimensions▶ Top N: like grouping by a single dimension▶ Timeseries: without grouping over dimensions▶ Search: Dimensions lookup▶ Time Boundary: Find available data timeframe▶ Metadata queries
![Page 54: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/54.jpg)
Example(s)
{"queryType": "groupBy","dataSource": "druidtest","granularity": "all","dimensions": [],"aggregations": [
{"type": "count", "name": "rows"},{"type": "longSum", "name": "imps", "fieldName": "impressions"},{"type": "doubleSum", "name": "wp", "fieldName": "wp"}
],"intervals": ["2010-01-01T00:00/2020-01-01T00"]}
![Page 55: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/55.jpg)
Result
[ {"version" : "v1","timestamp" : "2010-01-01T00:00:00.000Z","event" : {
"imps" : 5,"wp" : 15000.0,"rows" : 5
}} ]
![Page 56: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/56.jpg)
Caching
▶ Historical node level▶ By segment
▶ Broker Level▶ By segment and query▶ groupBy is disabled on purpose!
▶ By default: local caching
![Page 57: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/57.jpg)
Druid Components
![Page 58: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/58.jpg)
Druid
▶ Real-time Nodes▶ Historical Nodes▶ Broker Nodes▶ Coordinator▶ For indexing:
▶ Overlord▶ Middle Manager
![Page 59: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/59.jpg)
Also
▶ Deep Storage (S3, HDFS, …)▶ Metadata Storage (SQL)▶ Load Balancer▶ Cache
![Page 60: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/60.jpg)
Coordinator
▶ Real-time Nodes (pull data, index it)▶ Historical Nodes (keep old segments)▶ Broker Nodes (route queries to RT & Hist.
nodes, merge)▶ Coordinator (manage segemnts)▶ For indexing:
▶ Overlord (distribute task to the middle manager)▶ Middle Manager (execute tasks via Peons)
![Page 61: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/61.jpg)
When not to choose Druid
![Page 62: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/62.jpg)
Graphite (metrics)
![Page 63: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/63.jpg)
Pivot (exploring data)
![Page 64: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/64.jpg)
Caravel
![Page 65: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/65.jpg)
Conclusions
![Page 66: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/66.jpg)
Precompute your time series?
![Page 67: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/67.jpg)
Don’t reinvent it
▶ need a user facing API▶ need time series on many dimensions▶ need real-time▶ big volume of data
![Page 68: Druid for real-time analysis - GitHub PagesKey Features Sub-second OLAP Queries Real-time Streaming Ingestion Power Analytic Applications Cost Effective High Available Scalable Right](https://reader033.fdocuments.in/reader033/viewer/2022060906/60a15b0a0d123f05571e0bed/html5/thumbnails/68.jpg)
Druid way is the right way!
1. Push in kafka2. Add the right dimensions3. Push in druid4. ???5. Profit!