Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering...
Transcript of Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering...
![Page 1: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/1.jpg)
Big Data Infrastructure
Week 8: Data Mining (1/4)
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States���See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
CS 489/698 Big Data Infrastructure (Winter 2016)
Jimmy LinDavid R. Cheriton School of Computer Science
University of Waterloo
March 1, 2016
These slides are available at http://lintool.github.io/bigdata-2016w/
![Page 2: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/2.jpg)
Structure of the Course
“Core” framework features and algorithm design
Anal
yzin
g Te
xt
Anal
yzin
g G
raph
s
Anal
yzin
g Re
latio
nal D
ata
Data
Min
ing
![Page 3: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/3.jpg)
Supervised Machine Learning
The generic problem of function induction given sample instances of input and output
Classification: output draws from finite discrete labels
Regression: output is a continuous value
This is not meant to be an exhaustive treatment of machine learning!
Focus today
![Page 4: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/4.jpg)
Source: Wikipedia (Sorting)
Classification
![Page 5: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/5.jpg)
Applications
Spam detection
Sentiment analysis
Content (e.g., genre) classification
Link prediction
Document ranking
Object recognition
And much much more!
Fraud detection
![Page 6: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/6.jpg)
training
Model
training data
Machine Learning Algorithm
testing/deployment
?
Supervised Machine Learning
![Page 7: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/7.jpg)
Objects are represented in terms of features:“Dense” features: sender IP, timestamp, # of recipients, length of message, etc.
“Sparse” features: contains the term “viagra” in message, contains “URGENT” in subject, etc.
Who comes up with the features?
How?
Feature Representations
![Page 8: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/8.jpg)
Applications
Spam detection
Sentiment analysis
Content (e.g., genre) classification
Link prediction
Document ranking
Object recognition
And much much more!
Fraud detection
Features are highly
application-specific!
![Page 9: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/9.jpg)
Components of a ML Solution
DataFeaturesModel
Optimization
What “matters” the most?
logistic regression, naïve Bayes, SVM, random forests, perceptrons, neural network, etc.gradient descent, stochastic
gradient descent, L-BFGS, etc.
![Page 10: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/10.jpg)
No data like more data!
(Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007)
s/knowledge/data/g;
![Page 11: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/11.jpg)
Limits of Supervised Classification?¢ Why is this a big data problem?
l Isn’t gathering labels a serious bottleneck?
¢ Solution: crowdsourcing
¢ Solution: bootstrapping, semi-supervised techniques
¢ Solution: user behavior logsl Learning to rank
l Computational advertisingl Link recommendation
¢ The virtuous cycle of data-driven products
![Page 12: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/12.jpg)
Supervised Binary Classification¢ Restrict output label to be binary
l Yes/Nol 1/0
¢ Binary classifiers form a primitive building block for multi-class problemsl One vs. rest classifier ensembles
l Classifier cascades
![Page 13: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/13.jpg)
¢ Inducel Such that loss is minimized
f : X ! Y
1
n
nX
i=0
`(f(xi), yi)
¢ Given D = {(xi, yi)}ni
¢ Typically, consider functions of a parametric form:
argmin✓
1
n
nX
i=0
`(f(xi; ✓), yi)
xi = [x1, x2, x3, . . . , xd]
y 2 {0, 1}
The Task
(sparse) feature vector
label
loss function
model parameters
![Page 14: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/14.jpg)
Key insight: machine learning as an optimization problem!(closed form solutions generally not possible)
![Page 15: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/15.jpg)
Gradient Descent: Preliminaries¢ Rewrite:
¢ Compute gradient:l “Points” to fastest increasing “direction”
¢ So, at any point:
rL(✓) =
@L(✓)
@w0,@L(✓)
@w1, . . .
@L(✓)
@wd
�
b = a� �rL(a)
L(a) � L(b)
*
* caveats
argmin✓
L(✓)argmin
✓
1
n
nX
i=0
`(f(xi; ✓), yi)
![Page 16: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/16.jpg)
Gradient Descent: Iterative Update¢ Start at an arbitrary point, iteratively update:
¢ We have:
¢ Lots of details:l Figuring out the step size
l Getting stuck in local minima
l Convergence rate
l …
✓(t+1) ✓(t) � �(t)rL(✓(t))
L(✓(0)) � L(✓(1)) � L(✓(2)) . . .
![Page 17: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/17.jpg)
Gradient DescentRepeat until convergence:
✓(t+1) ✓(t) � �(t) 1
n
nX
i=0
r`(f(xi; ✓(t)), yi)
![Page 18: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/18.jpg)
Intuition behind the math…
Old weightsUpdate based on gradient
New weights
✓(t+1) ✓(t) � �(t) 1
n
nX
i=0
r`(f(xi; ✓(t)), yi)
`(x)
r`d
dx
`
![Page 19: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/19.jpg)
![Page 20: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/20.jpg)
Gradient Descent
Source: Wikipedia (Hills)
✓(t+1) ✓(t) � �(t) 1
n
nX
i=0
r`(f(xi; ✓(t)), yi)
![Page 21: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/21.jpg)
Lots More Details…¢ Gradient descent is a “first order” optimization technique
l Often, slow convergencel Conjugate techniques accelerate convergence
¢ Newton and quasi-Newton methods:l Intuition: Taylor expansion
l Requires the Hessian (square matrix of second order partial derivatives): impractical to fully compute
f(x+�x) = f(x) + f
0(x)�x+1
2f
00(x)�x
2
![Page 22: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/22.jpg)
Source: Wikipedia (Hammer)
Logistic Regression
![Page 23: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/23.jpg)
Logistic Regression: Preliminaries¢ Given
¢ Let’s define:
¢ Interpretation:
D = {(xi, yi)}nixi = [x1, x2, x3, . . . , xd]
y 2 {0, 1}
ln
Pr (y = 1|x)Pr (y = 0|x)
�= w · x
ln
Pr (y = 1|x)
1� Pr (y = 1|x)
�= w · x
f(x; w) : Rd ! {0, 1}
f(x; w) =
⇢1 if w · x � t
0 if w · x < t
![Page 24: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/24.jpg)
Relation to the Logistic Function¢ After some algebra:
¢ The logistic function:
Pr (y = 1|x) = e
w·x
1 + e
w·x
Pr (y = 0|x) = 1
1 + e
w·x
f(z) =ez
ez + 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8
logi
stic
(z)
z
![Page 25: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/25.jpg)
Training an LR Classifier¢ Maximize the conditional likelihood:
¢ Define the objective in terms of conditional log likelihood:
l We know so:
l Substituting:
argmax
w
nY
i=1
Pr(yi|xi,w)
L(w) =nX
i=1
ln Pr(yi|xi,w)
y 2 {0, 1}
L(w) =nX
i=1
⇣yi ln Pr(yi = 1|xi,w) + (1� yi) lnPr(yi = 0|xi,w)
⌘
Pr(y|x,w) = Pr(y = 1|x,w)yPr(y = 0|x,w)(1�y)
![Page 26: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/26.jpg)
LR Classifier Update Rule¢ Take the derivative:
¢ General form for update rule:
¢ Final update rule:
L(w) =nX
i=1
⇣yi ln Pr(yi = 1|xi,w) + (1� yi) lnPr(yi = 0|xi,w)
⌘
@
@wL(w) =
nX
i=0
xi
⇣yi � Pr(yi = 1|xi,w)
⌘
w(t+1) w(t) + �(t)rwL(w(t))
rL(w) =
@L(w)
@w0,@L(w)
@w1, . . .
@L(w)
@wd
�
w
(t+1)i w
(t)i + �
(t)nX
j=0
xj,i
⇣yj � Pr(yj = 1|xj ,w(t)
)
⌘
![Page 27: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/27.jpg)
Lots more details…¢ Regularization
¢ Different loss functions
¢ …
Want more details? ���Take a real machine-learning course!
![Page 28: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/28.jpg)
mapper mapper mapper mapper
reducer
compute partial gradient
single reducer
mappers
update model iterate until convergence
✓(t+1) ✓(t) � �(t) 1
n
nX
i=0
r`(f(xi; ✓(t)), yi)
MapReduce Implementation
![Page 29: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/29.jpg)
Shortcomings¢ Hadoop is bad at iterative algorithms
l High job startup costsl Awkward to retain state across iterations
¢ High sensitivity to skewl Iteration speed bounded by slowest task
¢ Potentially poor cluster utilizationl Must shuffle all data to a single reducer
¢ Some possible tradeoffsl Number of iterations vs. complexity of computation per iteration
l E.g., L-BFGS: faster convergence, but more to compute
![Page 30: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/30.jpg)
Spark Implementationval points = spark.textFile(...).map(parsePoint).persist() var w = // random initial vector for (i <- 1 to ITERATIONS) { val gradient = points.map{ p => p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y }.reduce((a,b) => a+b) w -= gradient }
mapper mapper mapper mapper
reducer
compute partial gradient
update model
What’s the difference?
![Page 31: Big Data Infrastructure · 2016. 3. 31. · Why is this a big data problem? " Isn’t gathering labels a serious bottleneck?! Solution: crowdsourcing! Solution: bootstrapping, semi-supervised](https://reader033.fdocuments.in/reader033/viewer/2022060522/6050b27d96803f34313e9ebd/html5/thumbnails/31.jpg)
Source: Wikipedia (Japanese rock garden)
Questions?