f'•IH•''-'''''' THE COLLECTORS' DIGEST t Digest/1956-07... · 2016. 7. 14. · 11 Jhe
T digest-update
-
Upload
ted-dunning -
Category
Data & Analytics
-
view
448 -
download
1
Transcript of T digest-update
![Page 1: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/1.jpg)
© 2017 MapR Technologies 1
Update on t-digest
![Page 2: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/2.jpg)
© 2017 MapR Technologies 2
Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Board member, Apache Software Foundation
O’Reilly author
Email [email protected] [email protected]
Twitter @ted_dunning
![Page 3: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/3.jpg)
© 2017 MapR Technologies 3
Who We Are
• MapR echnologies
– We make a kick-ass platform for big data computing
– Support many workloads including Hadoop / Spark / HPC / Other
– Extended to allow streams and tables in basic platform
– Free for academic research / training
• Apache Software Foundation
– Culture hub for building open source communities
– Shared values around openness for contribution as well as use
– Many major projects are part of Apache
– Even more minor ones!
![Page 4: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/4.jpg)
© 2017 MapR Technologies 4
Basic Outline
• Why we should measure distributions
• Basic Ideas
• How t-digest works
• Recent results
• Applications
![Page 5: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/5.jpg)
© 2017 MapR Technologies 5
Why Is This Practically Important
• The novice came to the master and says “something is broken”
![Page 6: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/6.jpg)
© 2017 MapR Technologies 6
Why Is This Practically Important
• The novice came to the master and says “something is broken”
• The master replied “What has changed?”
![Page 7: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/7.jpg)
© 2017 MapR Technologies 7
Why Is This Practically Important
• The novice came to the master and says “something is broken”
• The master replied “What has changed?”
• And the student was enlightened
![Page 8: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/8.jpg)
© 2017 MapR Technologies 8
Finding change is key
but what kind?
![Page 9: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/9.jpg)
© 2017 MapR Technologies 9
Last Night’s Latencies
• These are ping latencies from my hotel
• Looks pretty good, right?
• But what about longer term?
208.302198.571185.099191.258201.392214.738197.389187.749201.693186.762185.296186.390183.960188.060190.763
> mean(y$t[i])[1] 198.6047> sd(y$t[i])[1] 71.43965
![Page 10: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/10.jpg)
© 2017 MapR Technologies 10
Not So Fast …
![Page 11: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/11.jpg)
© 2017 MapR Technologies 11
This is long-tailed land
![Page 12: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/12.jpg)
© 2017 MapR Technologies 12
This is long-tailed land
You have to know the distribution
of values
![Page 13: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/13.jpg)
© 2017 MapR Technologies 13
![Page 14: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/14.jpg)
© 2017 MapR Technologies 14
A single number
is simply not enough
![Page 15: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/15.jpg)
© 2017 MapR Technologies 15
What We Really Need Here
• I want to be able to compute the distribution from any time
period
• From any subset of measurements
• With lots of keys and filters
• And not a lot of space
• Basically, any OLAP kind of query
select distribution(x) from … where … group by y,z
![Page 16: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/16.jpg)
© 2017 MapR Technologies 16
Idea 0 – Pre-defined bins
• So let’s assume we have bins– Upper, lower bound, constant width
• Get a measurement, pick a bin, increment count
• Works great if you know the data– And you have limited dynamic range (too many bins)
– And the distribution is fixed
• Useful, but not general enough
![Page 17: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/17.jpg)
© 2017 MapR Technologies 17
Idea 1 – Exponential Bins
• Suppose we want relative accuracy in measurement space
• Latencies are positive and only matter within a few percent
– 1.1 ms versus 1.0 ms
– 1100 ms versus 1000 ms
• We can cheat by using floating point representations
– Compute bin using magic
– Count
![Page 18: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/18.jpg)
© 2017 MapR Technologies 18
FloatHistogram
• Assume all measurements are in the range
• Divide this range into power of 2 sub-ranges
• Sub-divide each sub-range evenly with steps
– is typical
• Relative error is bounded in measurement space
![Page 19: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/19.jpg)
© 2017 MapR Technologies 19
FloatHistogram
• Assume all measurements are in the range
• Divide this range into power of 2 sub-ranges
• Sub-divide each sub-range evenly with steps
– is typical
• Relative error is bounded in measurement space
• Bin index can be computed using FP representation!
![Page 20: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/20.jpg)
© 2017 MapR Technologies 20
Fixed Size Bins
![Page 21: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/21.jpg)
© 2017 MapR Technologies 21
Approximate Exponential Bins
![Page 22: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/22.jpg)
© 2017 MapR Technologies 22
Non-linear bins are better
(sometimes)
Still not general enough
![Page 23: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/23.jpg)
© 2017 MapR Technologies 23
Idea 2 – Fully Adaptive Bins
• First intuition – in general, we want accuracy in terms of
percentile
• Second intuition – we want better accuracy at extreme
quantiles
– 50%-ile versus 50.1%-ile?
– What does 0.1% error even mean for 99.99th percentile
• We need bins with small counts near the edges
![Page 24: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/24.jpg)
© 2017 MapR Technologies 24
First 1% of data shown.
Left graph has 100 x 100 sample bins.
Right graph has ~130bins, variable size
![Page 25: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/25.jpg)
© 2017 MapR Technologies 25
The Basic t-digest
• Take a bunch of data
• Sort it
• Group into bins
– But make the bins be smaller at the beginning and end
• Remember the centroid and count of each bin
• That’s a t-digest
![Page 26: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/26.jpg)
© 2017 MapR Technologies 26
But Wait, You Need a Bit More
• Take a bunch of new data, old t-digest
• Sort the data and the old bins together
• Group into bins
– Note that existing bins have bigger weights
– So they might survive … or might clump
• Remember the centroid and count of each new bin
• That’s an updated t-digest
![Page 27: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/27.jpg)
© 2017 MapR Technologies 27
Oh … and Merging
• Take a bunch of old t-digests
• Sort the bins
• Group into mega-bins
– Respect the size constraint
• Remember the centroid and count of each new bin
• That’s a merged t-digest
![Page 28: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/28.jpg)
© 2017 MapR Technologies 28
Adaptive non-linear bins are good
and general
And can be grouped
and regrouped
![Page 29: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/29.jpg)
© 2017 MapR Technologies 29
Results
![Page 30: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/30.jpg)
© 2017 MapR Technologies 30
![Page 31: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/31.jpg)
© 2017 MapR Technologies 31
Status
• Current release
– Small accuracy bugs in corner cases
– Best overall is still AVLTreeDigest
![Page 32: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/32.jpg)
© 2017 MapR Technologies 32
Status
• Current release (3.x)
– Small accuracy bugs in corner cases
– Best overall is still AVLTreeDigest
• Upcoming release (4.0)
– Better accuracy in pathological cases
– Strictly bounded size
– No dynamic allocation (with MergingDigest)
– Good speed (100ns for MergingDigest, 5ns for FloatHistogram)
– Real Soon Now
![Page 33: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/33.jpg)
© 2017 MapR Technologies 33
Example Application
• The data:– ~ 1 million machines
– Even more services
– Each producing thousands of measurements per second
• Store t-digest for each 5 minute period for each measurement
• Want to query any combination of keys, produce t-digest result
“what was the distribution of launch times yesterday?”
“what about last month?”
“in Europe versus in North America versus in Asia?”
![Page 34: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/34.jpg)
© 2017 MapR Technologies 34
Collect Data
log consolidator
web server
web server
Web-server
Log
Web-server
Log
log_events
log-stash
log-stash
data center
![Page 35: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/35.jpg)
© 2017 MapR Technologies 35
And Transport to Global Analytics
log consolidator
web server
web server
Web-server
Log
Web-server
Log
log_events
log-stash
log-stash
data center GHQ
log_events
eventsElaborate
events(log-stash)
Aggregate
Signal detection
![Page 36: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/36.jpg)
© 2017 MapR Technologies 36
With Many Sources
log consolidator
web server
web server
Web-server
Log
Web-server
Log
log_events
log-stash
log-stash
data center GHQ
log_events
eventsElaborate
events(log-stash)
Aggregate
Signal detection
![Page 37: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/37.jpg)
© 2017 MapR Technologies 37
With Many Sources
log consolidator
web server
web server
Web-server
Log
Web-server
Log
log_events
log-stash
log-stash
data center GHQ
log_events
eventsElaborate
events(log-stash)
Aggregate
Signal detection
log consolidator
web server
Web-server
Log
web server
Web-server
Log
log_events
log-stash
log-stash
data center
![Page 38: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/38.jpg)
© 2017 MapR Technologies 38
With Many Sources
log consolidator
web server
web server
Web-server
Log
Web-server
Log
log_events
log-stash
log-stash
data center GHQ
log_events
eventsElaborate
events(log-stash)
Aggregate
Signal detection
log consolidator
web server
Web-server
Log
web server
Web-server
Log
log_events
log-stash
log-stash
data center
log consolidator
web server
Web-server
Log
web server
Web-server
Log
log_events
log-stash
log-stash
data center
![Page 39: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/39.jpg)
© 2017 MapR Technologies 39
What about visualization?
![Page 40: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/40.jpg)
© 2017 MapR Technologies 40
Can’t see small count bars
![Page 41: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/41.jpg)
© 2017 MapR Technologies 41
Good Results
![Page 42: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/42.jpg)
© 2017 MapR Technologies 42
Bad Results – 1% of measurements are 3x bigger
![Page 43: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/43.jpg)
© 2017 MapR Technologies 43
Bad Results – 1% of measurements are 3x bigger
![Page 44: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/44.jpg)
© 2017 MapR Technologies 44
With Better Vertical Scaling
![Page 45: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/45.jpg)
© 2017 MapR Technologies 45
Uniform Bins
![Page 46: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/46.jpg)
© 2017 MapR Technologies 46
FloatHistogram Bins
![Page 47: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/47.jpg)
© 2017 MapR Technologies 47
With FloatHistogram
![Page 48: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/48.jpg)
© 2017 MapR Technologies 48
Original Ping Latency Data
![Page 49: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/49.jpg)
© 2017 MapR Technologies 49
Summary
• Single measurements insufficient, need distributions
• Uniform binned histograms not good
• FloatHistogram for some cases
• T-digest for general cases
• Upcoming release has super-
fast and accurate versions
• Good visualization also key
0.0 0.2 0.4 0.6 0.8 1.0q
02
46
81
0k
![Page 50: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/50.jpg)
© 2017 MapR Technologies 50
Q & A
![Page 51: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/51.jpg)
© 2017 MapR Technologies 51
Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Board member, Apache Software Foundation
O’Reilly author
Email [email protected] [email protected]
Twitter @ted_dunning
![Page 52: T digest-update](https://reader031.fdocuments.in/reader031/viewer/2022030317/5a65c3df7f8b9a38648b4d73/html5/thumbnails/52.jpg)
© 2017 MapR Technologies 52
T-digest
• Or we can talk about small errors in q
• Accumulate samples, sort, merge
• Merge if k-size < 1
• Interpolate using centroids in x
• Very good near extremes, no dynamic allocation
0.0 0.2 0.4 0.6 0.8 1.0q
02
46
81
0k