Data-Intensive Distributed Computing - University of Waterloo
Transcript of Data-Intensive Distributed Computing - University of Waterloo
Data-Intensive Distributed Computing
Part 6: Data Mining (1/4)
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
CS 431/631 451/651 (Winter 2022)
Ali Abedi
These slides are available at https://www.student.cs.uwaterloo.ca/~cs451
1
1
Structure of the Course
“Core” framework features
and algorithm design
Analy
zin
gT
ext
Analy
zin
gG
raphs
Analy
zin
g
Rela
tional D
ata
Data
Min
ing
2
2
Supervised Machine Learning
The generic problem of function induction given sample instances of input and output
Classification: output draws from finite discrete labels
Regression: output is a continuous value
This is not meant to be an exhaustive treatment of machine learning!
3
3
Source: Wikipedia (Sorting)
Classification
4
4
Applications
Spam detection
Sentiment analysis
Content (e.g., topic) classification
Link prediction
Document ranking
Object recognition
And much much more!
Fraud detection
5
5
Applications
Spam detection
Sentiment analysis
Content (e.g., topic) classification
Link prediction
Document ranking
Object recognition
And much much more!
Fraud detection
6
6
training
Model
Machine Learning Algorithm
testing/deployment
?
Supervised Machine Learning
7
7
Objects are represented in terms of features:
“Dense” features: sender IP, timestamp, # of recipients, length of message, etc.
“Sparse” features: contains the term “Viagra” in message, contains “URGENT” in subject, etc.
Feature Representations
8
8
Applications
Spam detection
Sentiment analysis
Content (e.g., genre) classification
Link prediction
Document ranking
Object recognition
And much much more!
Fraud detection
9
9
Components of a ML Solution
DataFeaturesModel
Optimization
10
Data matters the most. If we throw a lot of data at almost any algorithm it performs
good.
10
(Banko and Brill, ACL 2001)
(Brants et al., EMNLP 2007)
No data like more data!
11
We have seem these examples before. For example stupid backoff outperforms
other algorithms when it’s trained on a lot of data.
11
Limits of Supervised Classification?
Why is this a big data problem?Isn’t gathering labels a serious bottleneck?
SolutionsCrowdsourcing
Bootstrapping, semi-supervised techniquesExploiting user behavior logs
12
12
Supervised Binary Classification
Restrict output label to be binaryYes/No
1/0
Binary classifiers form primitive building blocks for multi-class problems…
13
13
Binary Classifiers as Building BlocksExample: four-way classification
One vs. rest classifiers
A or not?
B or not?
C or not?
D or not?
A or not?
B or not?
C or not?
D or not?
Classifier cascades
14
14
The Task
Given:
(sparse) feature vector
label
Induce:Such that loss is minimized
loss function
Typically, we consider functions of a parametric form:
model parameters15
15
Key insight: machine learning as an optimization problem!
16
16
Because the closed form solutions generally not possible …
Gradient Descent
17
Old weights
Update based on gradient
New weights
Intuition behind the math…
18
Gradient Descent: Preliminaries
18
Linear Regression + Gradient Descent
19
Gradient Descent: Preliminaries
Rewrite:
Compute gradient:“Points” to fastest increasing “direction”
So, at any point:
20
20
Gradient Descent: Iterative Update
Start at an arbitrary point, iteratively update:
We have:
21
21
Gradient Descent: Iterative Update
Start at an arbitrary point, iteratively update:
Lots of details:Figuring out the step size
Getting stuck in local minimaConvergence rate
…
We have:
22
22
Repeat until convergence:
Gradient Descent
Note, sometimes formulated as ascent but entirely equivalent23
23
Even More Details…
Gradient descent is a “first order” optimization techniqueOften, slow convergence
Newton and quasi-Newton methods:Intuition: Taylor expansion
Requires the Hessian (square matrix of second order partial derivatives):impractical to fully compute
24
24
Logistic Regression: Preliminaries
26
Logistic Regression: Preliminaries
Given:
Define:
Interpretation:
27
27
Relation to the Logistic Function
After some algebra:
The logistic function:
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8
log
istic(z
)
z28
28
Training an LR Classifier
Maximize the conditional likelihood:
Define the objective in terms of conditional log likelihood:
We know:
So:
Substituting:
29
29
LR Classifier Update RuleTake the derivative:
General form of update rule:
Final update rule:
30
30
Want more details? Take a real machine-learning course!
Lots more details…
Regularization
Different loss functions…
31
31
32
mapper mapper mapper mapper
reducer
compute partial gradient
single reducer
mappers
update model iterate until convergence
MapReduce Implementation
33
33
Shortcomings
Hadoop is bad at iterative algorithmsHigh job startup costs
Awkward to retain state across iterations
High sensitivity to skewIteration speed bounded by slowest task
Potentially poor cluster utilizationMust shuffle all data to a single reducer
Some possible tradeoffsNumber of iterations vs. complexity of computation per iteration
E.g., L-BFGS: faster convergence, but more to compute
34
34
val points = spark.textFile(...).map(parsePoint).persist()
var w = // random initial vectorfor (i <- 1 to ITERATIONS) {
val gradient = points.map{ p =>p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y
}.reduce((a,b) => a+b)w -= gradient
}
mapper mapper mapper mapper
reducer
compute partial gradient
update model
Spark Implementation
35
35