Scalable Benchmarks and Kernels for Data Mining and Analytics
Scalable Online Analytics for Monitoring
-
Upload
heinrich-hartmann -
Category
Technology
-
view
2.313 -
download
0
Transcript of Scalable Online Analytics for Monitoring
![Page 1: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/1.jpg)
Scalable Online Analytics for Monitoring
LISA15, Nov. 13, 2015, Washington, D.C.
Heinrich Hartmann, PhD, Chief Data Scientist, Circonus
![Page 2: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/2.jpg)
I’m Heinrich
· From Mainz, Germany
· Studied Math. in Mainz, Bonn, Oxford
· PhD in Algebraic Geometry
· Sensor Data Mining at Univ. Koblenz
· Freelance IT consultant
· Chief Data Scientist at Circonus
@HeinrichHartman(n)
![Page 3: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/3.jpg)
Circonus is ...
@HeinrichHartman(n)
· monitoring and telemetry analysis platform
· scalable to millions of incoming metrics
· available as public and private SaaS
· built-in histograms, forecasting, anomaly-detection, ...
![Page 4: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/4.jpg)
@HeinrichHartman(n)
This talk is about...
I. The Future of Monitoring
II. Patterns of Telemetry Analysis
III. Design of the Online Analytics Engine ‘Beaker’
![Page 5: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/5.jpg)
Part I - The Future of Monitoring
![Page 6: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/6.jpg)
Monitor this:
· 100 containers in one fleet
· 200 metrics per-container
· 50 code pushes a day
· For each push all containers are recreated
The “cloud monitoring challenge” 2015
@HeinrichHartman(n)
![Page 7: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/7.jpg)
Line-plots work well for small numbers of nodes
@HeinrichHartman(n)
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
CPU utilization for a DB cluster. Source: Internal.
![Page 8: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/8.jpg)
... but can get polluted easily
@HeinrichHartman(n)
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
CPU utilization for a DB cluster. Source: Internal.
![Page 9: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/9.jpg)
“Information is not a scarce resource. Attention is.”
Herbert A. Simon
@HeinrichHartman(n) Source: http://www.circonus.com/the-future-of-monitoring-qa-with-john-allspaw/
![Page 10: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/10.jpg)
Store Histograms and Alert on Percentiles
@HeinrichHartman(n)
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
90% percentile
CPU Utilization of a db service with a variable number of nodes.
![Page 11: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/11.jpg)
@HeinrichHartman(n)
Anomaly Detection for Surfacing relevant data
Mockup of an metric overview page. Source: Circonus
![Page 12: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/12.jpg)
Part II - Patterns of Telemetry Analysis
![Page 13: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/13.jpg)
@HeinrichHartman(n)
Telemetry analysis comes in two forms
Online Analytics
· Anomaly detection· Percentile alerting· Smart alerting rules· Smart dashboards
Offline Analytics
· Post mortem analysis· Assisted thresholding
Stream Processing ‘Big Data’
![Page 14: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/14.jpg)
@HeinrichHartman(n)
New Components in Circonus
BunsenBeaker
![Page 15: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/15.jpg)
@HeinrichHartman(n)
Beaker & Bunsen in the Circonus Architecture
metrics ErnieAlerting Service
SnowthMetric Store
alerts
Web-UI
QBeakerOnline AE
BunsenOffline AE
![Page 16: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/16.jpg)
@HeinrichHartman(n)
Pattern 1: Windowing
Examples
· Supervised Machine Learning· Fourier Transformation· Anomaly Detection (etsy)
Remarks
· Tradeoff: window sizerich features vs. memory
· Tradeoff: overlaplatency vs. CPU
windows = window(y_stream)
def results():
for w in windows:
yield z = method(w)
![Page 17: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/17.jpg)
@HeinrichHartman(n)
Pattern 3: Processing Units
processing_unit = {
state = ...
update = function(self, y) ... end
}
Example
· Exponential Smoothing· Holt Winters forecasting· Anomaly Detection (Circonus)
Remarks
· Fast updates· Fully general· Cost: maintains state
Circonus Anomaly Detection
![Page 18: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/18.jpg)
@HeinrichHartman(n)
local q = 0.9
exponential_smoothing = {
s = 0,
update = function(self, y)
self.s = (y * q) + (self.s * (1 - q))
return self.s
end
}
For
A Processing Unit for Exponential Smoothing
Exponential smoothing applied to a dns-duration metric
More examples on http://heinrichhartmann.com/.../Generative-Models-for-Time-Series
![Page 19: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/19.jpg)
Processing units are convenient
· Primitive transformation are readily implemented:Arithmetic, Smoothing, Forecasting, ...
· Fully general. Allow window-based processing as well
· Composable. Compose several PUs to get a new one!
@HeinrichHartman(n)
![Page 20: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/20.jpg)
Circonus Analytics Query Language
· Create your own customized processing unit from primitives· UNIX-inspired syntax with pipes ‘|’· Native support for histogram metrics
@HeinrichHartman(n)
![Page 21: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/21.jpg)
@HeinrichHartman(n)
CAQL: Example 1 - Low frequency ADPre-process a metric before feeding into anomaly detection
metric:average(<>) | rolling:mean(30m) | anomaly_detection()
![Page 22: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/22.jpg)
@HeinrichHartman(n)
CAQL: Example 2 - Histogram AggregationHistograms are first-class citizens
metric:average(<uuid>) | window:histogram(1h) | histogram:percentile(95)
![Page 23: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/23.jpg)
Part III - The Design of the Online Analytics Engine ‘Beaker’
![Page 24: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/24.jpg)
1. Read messages from a message queue2. Execute a processing units over incoming metrics3. Publish computed values to message queue
Beaker v. 0.1: Simple stream processing
Beaker Serviceoutput
QinputQ
Beaker: Basic Data Flow
![Page 25: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/25.jpg)
Challenge 1: Rollup metrics by the minute
@HeinrichHartman(n)
· Metrics arrive asynchronously on the input queue· PUs expect exactly one sample per minute· Rollup logic needs to allow for
· late arrival, e.g. when broker is behind· out of order arrival· errors in system time (time-zones, drifts)
![Page 26: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/26.jpg)
Consequence: No real-time processing in Beaker
· Rolling up data in 1m periods causes an average delay of 30sec for data processing.
· Real-time threshold-based alerting still available.
· Approach: Avoid rolling up data for stateless PUs.
@HeinrichHartman(n)
![Page 27: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/27.jpg)
Challenge 2: Multiple input slots
input metric
input metric
input metric
input metric
input metric
processing unit
processing unit
processing unit output metric
output metric
output metric
Beaker: Logical Data Flow
@HeinrichHartman(n)
![Page 28: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/28.jpg)
Challenge 3:Synchronize roll-ups for multiple inputs is tricky
@HeinrichHartman(n)
![Page 29: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/29.jpg)
Challenge 4: Fault Tolerance
Definition (Birman): A service is fault tolerant if it tolerates the failure of nodes without producing wrong output messages.
Failed nodes must be able to recover from a crash and rejoin the cluster.
The time to recovery should be as low as possible.
@HeinrichHartman(n) Source: K. Birman - Reliable Distributed Systems, Springer, 2005
![Page 30: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/30.jpg)
Beaker v0.5:
· Automated restarts a. Service Management Facilities (svcadm) in OmniOS
b. systemd or watchdogs in Linux
· But, recovery can take a long timeState has to be rebuilt from input stream.
· Need a way to recover faster from errors:a. Persist processing unit state (software updates!)b. Access persisted metric data
@HeinrichHartman(n)
![Page 31: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/31.jpg)
Beaker v. 0.5: Use db to rebuild state on startup
@HeinrichHartman(n)
Beaker Serviceoutput
QinputQ
Snowth db
![Page 32: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/32.jpg)
Challenge 5: High Availability
Definition (Birman): A service is highly available it continues to publish valid messages during a node failureafter a small reconfiguration period.
For Beaker we require a reconfiguration period of less than 1 minute. In this time messages may be delayed (e.g. 30sec) and duplicated messages may be published.
@HeinrichHartman(n) Source: K. Birman - Reliable Distributed Systems, Springer, 2005
![Page 33: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/33.jpg)
Beaker v. 0.5 -- HA Cluster
Beaker HA Cluster
Beaker Masteroutput
QinputQ
BeakerSlave
BeakerSlave On
Failover:Master election
@HeinrichHartman(n)
Snowth db
![Page 34: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/34.jpg)
@HeinrichHartman(n)
Challenge 6: Scalability
Beaker needs to scale in the following dimensions
· Number of processing units up to an unlimited amount.
· In the number of incoming metrics up to ~100M metrics.
![Page 35: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/35.jpg)
Beaker v.0.6 -- Multiple - HA Clusters
Beaker HA Cluster II (3 slaves)
Beaker HA Cluster I (5 slaves)
..
.
..
Beaker HA Cluster III (1 slave)
BI-M BI-S1 BI-S5
BII-M BII-S1 BII-S3
BIII-M BIII-S1
Meta Service
MsgBroker
MsgBroker
BIII-S2
@HeinrichHartman(n)
Snowth db
..
![Page 36: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/36.jpg)
@HeinrichHartman(n)
Done! Great. This works...
![Page 37: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/37.jpg)
Can we do better?
@HeinrichHartman(n)
![Page 38: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/38.jpg)
· Divide Beaker into multiple services
· Avoid master election in processing service, by allowing duplicates
· Upside: Only the processing service needs to scale out
Beaker Service
Beaker v. 1.0 -- Divide and Conquer
@HeinrichHartman(n)
Q QProcessingService
DedupService
Rollup Service
![Page 39: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/39.jpg)
Scaling the Processing Service is simplified
· No rollup logic in workers· All workers publish messages· No failover logic necessary· Replicate each PUs on multiple
workers
· No crosstalk between nodes( = 0 in USL).
@HeinrichHartman(n)
Q Q
...
snowth db
Worker
Worker
Worker
Meta ServiceProcessing Service: Data Flow
![Page 40: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/40.jpg)
Benchmark results on prototype
· PU type anomaly_detection PU count 100Throughput 15 kHz
· PU type anomaly_detectionPU count 10kThroughput 4.2 kHz
Machine: 6-core Xeon E5 2630 v2 at 2.6 GHz
@HeinrichHartman(n)
![Page 41: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/41.jpg)
Conclusion
· Stateful processing units allow implementation of next
generation of monitoring analytics
· Use CAQL to build your own processing units
· Service orientation facilitates scaling
· Beaker will be out soon
@HeinrichHartman(n)
![Page 42: Scalable Online Analytics for Monitoring](https://reader034.fdocuments.in/reader034/viewer/2022050613/58ecff261a28ab785c8b4587/html5/thumbnails/42.jpg)
Credits
Joint work with:
· Jonas Kunze
· Theo Schlossnagle
Image Credits:Example of Kummer Surface, Claudio Rocchini, CC-BY-SA, https://en.wikipedia.org/wiki/File:Kummer_surface.png
Indian truck, by strudelt, CC-BY, https://commons.wikimedia.org/wiki/File:Truck_in_India_-_overloaded.jpg
Kafka, Public Domain, https://en.wikipedia.org/wiki/File:Kafka_portrait.jpg
Thinker, Public Domain, https://www.flickr.com/photos/mustangjoe/5966894496
@HeinrichHartman(n)