News From Mahout
-
Upload
mapr-technologies -
Category
Technology
-
view
125 -
download
0
Transcript of News From Mahout
![Page 1: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/1.jpg)
1©MapR Technologies - Confidential
News From Mahout
![Page 2: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/2.jpg)
2©MapR Technologies - Confidential
whoami – Ted Dunning
Chief Application Architect, MapR Technologies
Committer, member, Apache Software Foundation– particularly Mahout, Zookeeper and Drill
(we’re hiring)
Contact me [email protected]
@ted_dunning
![Page 3: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/3.jpg)
3©MapR Technologies - Confidential
Slides and such (available late tonight):– http://www.mapr.com/company/events/nyhug-03-05-2013
Hash tags: #mapr #nyhug #mahout
![Page 4: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/4.jpg)
4©MapR Technologies - Confidential
New in Mahout
0.8 is coming soon (1-2 months)
gobs of fixes
QR decomposition is 10x faster– makes ALS 2-3 times faster
May include Bayesian Bandits
Super fast k-means– fast
– online (!?!)
![Page 5: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/5.jpg)
5©MapR Technologies - Confidential
New in Mahout
0.8 is coming soon (1-2 months)
gobs of fixes
QR decomposition is 10x faster– makes ALS 2-3 times faster
May include Bayesian Bandits
Super fast k-means– fast
– online (!?!)
– fast
Possible new edition of MiA coming– Japanese and Korean editions released, Chinese coming
![Page 6: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/6.jpg)
6©MapR Technologies - Confidential
New in Mahout
0.8 is coming soon (1-2 months)
gobs of fixes
QR decomposition is 10x faster– makes ALS 2-3 times faster
May include Bayesian Bandits
Super fast k-means– fast
– online (!?!)
– fast
Possible new edition of MiA coming– Japanese and Korean editions released, Chinese coming
![Page 7: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/7.jpg)
7©MapR Technologies - Confidential
Real-time Learning
![Page 8: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/8.jpg)
8©MapR Technologies - Confidential
We have a product to sell …
from a web-site
![Page 9: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/9.jpg)
9©MapR Technologies - Confidential
Bogus Dog Food is the Best!
Now available in handy 1 ton
bags!
Buy 5!
What picture?
What tag-line?
What call to action?
![Page 10: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/10.jpg)
10©MapR Technologies - Confidential
The Challenge
Design decisions affect probability of success– Cheesy web-sites don’t even sell cheese
The best designers do better when allowed to fail– Exploration juices creativity
But failing is expensive– If only because we could have succeeded
– But also because offending or disappointing customers is bad
![Page 11: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/11.jpg)
11©MapR Technologies - Confidential
More Challenges
Too many designs– 5 pictures
– 10 tag-lines
– 4 calls to action
– 3 back-ground colors
=> 5 x 10 x 4 x 3 = 600 designs
It gets worse quickly– What about changes on the back-end?
– Search engine variants?
– Checkout process variants?
![Page 12: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/12.jpg)
12©MapR Technologies - Confidential
Example – AB testing in real-time
I have 15 versions of my landing page
Each visitor is assigned to a version– Which version?
A conversion or sale or whatever can happen– How long to wait?
Some versions of the landing page are horrible– Don’t want to give them traffic
![Page 13: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/13.jpg)
13©MapR Technologies - Confidential
A Quick Diversion
You see a coin– What is the probability of heads?
– Could it be larger or smaller than that?
I flip the coin and while it is in the air ask again
I catch the coin and ask again
I look at the coin (and you don’t) and ask again
Why does the answer change?– And did it ever have a single value?
![Page 14: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/14.jpg)
14©MapR Technologies - Confidential
A Philosophical Conclusion
Probability as expressed by humans is subjective and depends on information and experience
![Page 15: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/15.jpg)
15©MapR Technologies - Confidential
I Dunno
![Page 16: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/16.jpg)
16©MapR Technologies - Confidential
5 heads out of 10 throws
![Page 17: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/17.jpg)
17©MapR Technologies - Confidential
2 heads out of 12 throws
![Page 18: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/18.jpg)
18©MapR Technologies - Confidential
So now you understand Bayesian probability
![Page 19: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/19.jpg)
19©MapR Technologies - Confidential
Another Quick Diversion
Let’s play a shell game
This is a special shell game
It costs you nothing to play
The pea has constant probability of being under each shell(trust me)
How do you find the best shell?
How do you find it while maximizing the number of wins?
![Page 20: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/20.jpg)
20©MapR Technologies - Confidential
Pause for short con-game
![Page 21: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/21.jpg)
21©MapR Technologies - Confidential
Interim Thoughts
Can you identify winners or losers without trying them out?
Can you ever completely eliminate a shell with a bad streak?
Should you keep trying apparent losers?
![Page 22: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/22.jpg)
22©MapR Technologies - Confidential
So now you understand multi-armed bandits
![Page 23: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/23.jpg)
23©MapR Technologies - Confidential
Conclusions
Can you identify winners or losers without trying them out?No
Can you ever completely eliminate a shell with a bad streak?No
Should you keep trying apparent losers?Yes, but at a decreasing rate
![Page 24: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/24.jpg)
24©MapR Technologies - Confidential
Is there an optimum strategy?
![Page 25: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/25.jpg)
25©MapR Technologies - Confidential
Bayesian Bandit
Compute distributions based on data so far
Sample p1, p2 and p2 from these distributions
Pick shell i where i = argmaxi pi
Lemma 1: The probability of picking shell i will match the probability it is the best shell
Lemma 2: This is as good as it gets
![Page 26: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/26.jpg)
26©MapR Technologies - Confidential
And it works!
11000 100 200 300 400 500 600 700 800 900 1000
0.12
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
n
reg
ret
ε- greedy, ε = 0.05
Bayesian Bandit with Gamma- Normal
![Page 27: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/27.jpg)
27©MapR Technologies - Confidential
Video Demo
![Page 28: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/28.jpg)
28©MapR Technologies - Confidential
The Code
Select an alternative
Select and learn
But we already know how to count!
n = dim(k)[1]
p0 = rep(0, length.out=n)
for (i in 1:n) {
p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)
}
return (which(p0 == max(p0)))
for (z in 1:steps) {
i = select(k)
j = test(i)
k[i,j] = k[i,j]+1
}
return (k)
![Page 29: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/29.jpg)
29©MapR Technologies - Confidential
The Basic Idea
We can encode a distribution by sampling
Sampling allows unification of exploration and exploitation
Can be extended to more general response models
![Page 30: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/30.jpg)
30©MapR Technologies - Confidential
The Original Problem
Bogus Dog Food is the Best!
Now available in handy 1 ton
bags!
Buy 5!
x1x2
x3
![Page 31: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/31.jpg)
31©MapR Technologies - Confidential
Response Function
p(win) = w qii
å xiæ
èç
ö
ø÷
6- 6 - 4 - 2 0 2 4
1
0
0.5
x
y
![Page 32: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/32.jpg)
32©MapR Technologies - Confidential
Generalized Banditry
Suppose we have an infinite number of bandits– suppose they are each labeled by two real numbers x and y in [0,1]
– also that expected payoff is a parameterized function of x and y
– now assume a distribution for θ that we can learn online
Selection works by sampling θ, then computing f
Learning works by propagating updates back to θ– If f is linear, this is very easy
– For special other kinds of f it isn’t too hard
Don’t just have to have two labels, could have labels and context
E z[ ] = f (x, y |q )
![Page 33: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/33.jpg)
33©MapR Technologies - Confidential
Context Variables
Bogus Dog Food is the Best!
Now available in handy 1 ton
bags!
Buy 5!
x1x2
x3
user.geo env.time env.day_of_week env.weekend
![Page 34: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/34.jpg)
34©MapR Technologies - Confidential
Caveats
Original Bayesian Bandit only requires real-time
Generalized Bandit may require access to long history for learning– Pseudo online learning may be easier than true online
Bandit variables can include content, time of day, day of week
Context variables can include user id, user features
Bandit × context variables provide the real power
![Page 35: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/35.jpg)
35©MapR Technologies - Confidential
You can do thisyourself!
![Page 36: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/36.jpg)
36©MapR Technologies - Confidential
Super-fast k-means Clustering
![Page 37: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/37.jpg)
37©MapR Technologies - Confidential
Rationale
![Page 38: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/38.jpg)
38©MapR Technologies - Confidential
What is Quality?
Robust clustering not a goal– we don’t care if the same clustering is replicated
Generalization is critical
Agreement to “gold standard” is a non-issue
![Page 39: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/39.jpg)
39©MapR Technologies - Confidential
An Example
![Page 40: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/40.jpg)
40©MapR Technologies - Confidential
An Example
![Page 41: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/41.jpg)
41©MapR Technologies - Confidential
Diagonalized Cluster Proximity
![Page 42: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/42.jpg)
42©MapR Technologies - Confidential
Clusters as Distribution Surrogate
![Page 43: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/43.jpg)
43©MapR Technologies - Confidential
Clusters as Distribution Surrogate
![Page 44: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/44.jpg)
44©MapR Technologies - Confidential
Theory
![Page 45: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/45.jpg)
45©MapR Technologies - Confidential
For Example
Grouping these two clusters
seriously hurts squared distance
D4
2 (X) >1
s 2D5
2 (X)
![Page 46: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/46.jpg)
46©MapR Technologies - Confidential
Algorithms
![Page 47: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/47.jpg)
47©MapR Technologies - Confidential
Typical k-means Failure
Selecting two seeds here cannot be
fixed with Lloyds
Result is that these two clusters get glued
together
![Page 48: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/48.jpg)
48©MapR Technologies - Confidential
Ball k-means
Provably better for highly clusterable data
Tries to find initial centroids in each “core” of each real clusters
Avoids outliers in centroid computation
initialize centroids randomly with distance maximizing tendency
for each of a very few iterations:
for each data point:
assign point to nearest cluster
recompute centroids using only points much closer than closest cluster
![Page 49: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/49.jpg)
49©MapR Technologies - Confidential
Still Not a Win
Ball k-means is nearly guaranteed with k = 2
Probability of successful seeding drops exponentially with k
Alternative strategy has high probability of success, but takes O(nkd + k3d) time
![Page 50: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/50.jpg)
50©MapR Technologies - Confidential
Still Not a Win
Ball k-means is nearly guaranteed with k = 2
Probability of successful seeding drops exponentially with k
Alternative strategy has high probability of success, but takes O( nkd + k3d ) time
But for big data, k gets large
![Page 51: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/51.jpg)
51©MapR Technologies - Confidential
Surrogate Method
Start with sloppy clustering into lots of clusters
κ = k log n clusters
Use this sketch as a weighted surrogate for the data
Results are provably good for highly clusterable data
![Page 52: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/52.jpg)
52©MapR Technologies - Confidential
Algorithm Costs
Surrogate methods– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d (log k + log log n)) per point
– fast, in-memory, high-quality clustering of κ weighted centroidsO(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O(d log κ log k) for larger k, looser quality
– result is k high-quality centroids• Even the sloppy surrogate may suffice
![Page 53: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/53.jpg)
53©MapR Technologies - Confidential
Algorithm Costs
Surrogate methods– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d ( log k + log log n )) per point
– fast, in-memory, high-quality clustering of κ weighted centroidsO(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality
– result is k high-quality centroids• For many purposes, even the sloppy surrogate may suffice
![Page 54: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/54.jpg)
54©MapR Technologies - Confidential
Algorithm Costs
How much faster for the sketch phase?– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal
![Page 55: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/55.jpg)
55©MapR Technologies - Confidential
Algorithm Costs
How much faster for the sketch phase?– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal
![Page 56: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/56.jpg)
56©MapR Technologies - Confidential
How It Works
For each point– Find approximately nearest centroid (distance = d)
– If (d > threshold) new centroid
– Else if (u > d/threshold) new cluster
– Else add to nearest centroid
If centroids > κ ≈ C log N– Recursively cluster centroids with higher threshold
![Page 57: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/57.jpg)
57©MapR Technologies - Confidential
Implementation
![Page 58: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/58.jpg)
58©MapR Technologies - Confidential
But Wait, …
Finding nearest centroid is inner loop
This could take O( d κ ) per point and κ can be big
Happily, approximate nearest centroid works fine
![Page 59: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/59.jpg)
59©MapR Technologies - Confidential
Projection Search
total ordering!
![Page 60: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/60.jpg)
60©MapR Technologies - Confidential
LSH Bit-match Versus Cosine
0 8 16 24 32 40 48 56 64
1
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0
0.2
0.4
0.6
0.8
X Axis
Y A
xis
![Page 61: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/61.jpg)
61©MapR Technologies - Confidential
Results
![Page 62: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/62.jpg)
62©MapR Technologies - Confidential
Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Tim
e p
er
po
int
(μs) 2
3
4
56
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓
![Page 63: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/63.jpg)
63©MapR Technologies - Confidential
Quality
Ball k-means implementation appears significantly better than simple k-means
Streaming k-means + ball k-means appears to be about as good as ball k-means alone
All evaluations on 20 newsgroups with held-out data
Figure of merit is mean and median squared distance to nearest cluster
![Page 64: News From Mahout](https://reader033.fdocuments.in/reader033/viewer/2022052911/559f74f31a28abf4718b477e/html5/thumbnails/64.jpg)
64©MapR Technologies - Confidential
Contact Me!
We’re hiring at MapR in US and Europe
MapR software available for research use
Get the code as part of Mahout trunk (or 0.8 very soon)
Contact me at [email protected] or @ted_dunning
Share news with @apachemahout