Scaling by Cheating
description
Transcript of Scaling by Cheating
![Page 1: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/1.jpg)
1
Scaling by CheatingApproximation, Sampling and Fault-Friendliness for Scalable Big Learning
Sean Owen / Director, Data Science @ Cloudera
![Page 2: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/2.jpg)
2
Two Big Problems
![Page 3: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/3.jpg)
3
Grow Bigger
“ Make quotes lookinteresting or different.”
Today’s big is just tomorrow’s small. We’re expected to process arbitrarily large data sets by just adding computers. You can’t tell the boss that anything’s too big to handle these days.
“
”David, Sr. IT Manager
![Page 4: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/4.jpg)
4
And Be Faster
“ Make quotes lookinteresting or different.”
Speed is king. People expect up-to-the-second results, and millisecond response times. No more overnight reporting jobs. My data grows 10x but my latency has to drop 10x.
“
”Shelly, CTO
![Page 5: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/5.jpg)
5
Two Big Solutions
![Page 6: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/6.jpg)
6
Plentiful Resources
“ Make quotes lookinteresting or different.”
Disk and CPU are cheap, on-demand. Frameworks to harness them, like Hadoop, are free and mature. We can easily bring to bear plenty of resources to process data quickly and cheaply.
“
”“Scooter”, White Lab
![Page 7: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/7.jpg)
7
Not Right, but Close Enough
Cheating
![Page 8: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/8.jpg)
8
Kirk What would you say the odds are on our getting out of here?
Spock Difficult to be precise, Captain. I should say approximately seven thousand eight hundred twenty four point seven to one.
Kirk Difficult to be precise? Seven thousand eight hundred and twenty four to one?
Spock Seven thousand eight hundred twenty four point seven to one.
Kirk That's a pretty close approximation.
Star Trek, “Errand of Mercy”http://www.redbubble.com/people/feelmeflow
![Page 9: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/9.jpg)
When To Cheat Approximate
9
• Only a few significant figures matter
• Least-significant figures are noise
• Only relative rank matters• Only care about
“high” or “low”
Do you care about 37.94% vs simply 40%?
![Page 10: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/10.jpg)
10
Approximation
![Page 11: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/11.jpg)
The Mean
11
• Huge stream of values: x1 x2 x3 … * • Finding entire population mean µ is expensive• Mean of small sample of N is close:
µN = (1/N) (x1 + x2 + … + xN)
• How much gets close enough?
* independent, roughly normal distribution
![Page 12: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/12.jpg)
“Close Enough” Mean
12
• Want: with high probability p, at most ε errorµ = (1± ε) µN
• Use Student’s t-distribution (N-1 d.o.f.)t = (µ - µN) / (σN/√N )
• How unknown µ behaves relative to known sample stats t
![Page 13: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/13.jpg)
“Close Enough” Mean
13
• Critical value for one tailtcrit = CDF-1((1+p)/2)
• Use library like Commons Math3:TDistribution.inverseCumulativeProbability()
• Solve for critical µcrit
CDF-1((1+p)/2) = (µcrit - µN) / (σN/√N )• µ “probably” at most µcrit
• Stop when (µcrit - µN) / µN small (<ε) t
![Page 14: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/14.jpg)
14
Sampling
![Page 15: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/15.jpg)
15
![Page 16: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/16.jpg)
Word Count: Toy Example
16
• Input: text documents• Exactly how many times does
each word occur?• Necessary precision?• Interesting question?
Why?
![Page 17: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/17.jpg)
Word Count: Useful Example
17
• About how many times does each word occur?
• Which 10 words occur most frequently?
• What fraction are Capitalized?
Hmm!
![Page 18: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/18.jpg)
Common Crawl
18
• s3n://aws-publicdatasets/common-crawl/ parse-output/segment/*/textData-*
• Count top words, Capitalized, zucchini in 35GB subset
• github.com/srowen/commoncrawl• Amazon EMR
4 c1.xlarge instances
![Page 19: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/19.jpg)
Raw Results
19
• 40 minutes• 40.1% Capitalized• Most frequent words:
the and to of a in de for is• zucchini occurs 9,571 times
![Page 20: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/20.jpg)
Sample 10% of Documents
20
• 21 minutes• 39.9% Capitalized• Most frequent words:
the and to of a in de for is• zucchini occurs 967 times,
( 9,670 overall)
...if (Math.random() >= 0.1) continue;...
![Page 21: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/21.jpg)
Stop When “Close Enough”
21
• CloseEnoughMean.java• Stop mapping when
% Capitalized is close enough
• 10% error, 90% confidenceper Mapper
• 18 minutes• 39.8% Capitalized
...if (m.isCloseEnough()) { break;}...
![Page 22: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/22.jpg)
22
Fault-Friendliness
![Page 23: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/23.jpg)
Oryx (α)
23
![Page 24: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/24.jpg)
Oryx (α)
24
• Computation Layer• Offline, Hadoop-based• Large-scale model
building• Serving Layer
• Online, REST API• Query model in real-time• Update model
approximately
• Few Key Algorithms• Recommenders
ALS• Clustering
k-means++• Classification
Random decision forests
![Page 25: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/25.jpg)
25
Not A Bank
![Page 26: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/26.jpg)
Oryx (α)
26
No Transactions!
![Page 27: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/27.jpg)
Serving Layer Designs For …
27
• Independent replicas• Need not have a globally
consistent view• Clients have consistent
view through sticky load balancing
• Push data into durable store, HDFS
• Buffer a little locally• Tolerate loss of
“a little bit”
Fast Availability Fast “99.9%” Durability
![Page 28: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/28.jpg)
28
If losing 90% of the data might make <1% difference here, why spend effort saving every last 0.1%?
![Page 29: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/29.jpg)
Resources
29
• Oryxgithub.com/cloudera/oryx
• Apache Commons Mathcommons.apache.org/proper/commons-math/
• Common Crawl examplegithub.com/srowen/ commoncrawl
![Page 30: Scaling by Cheating](https://reader035.fdocuments.in/reader035/viewer/2022081512/56815c53550346895dca547f/html5/thumbnails/30.jpg)