Topics in Machine Learning - University of Minnesotalerman/bootcamp/machine_learning_cours… ·...
Transcript of Topics in Machine Learning - University of Minnesotalerman/bootcamp/machine_learning_cours… ·...
Gilad Lerman
School of Mathematics
University of Minnesota
Topics in Machine
Learning
Text/slides stolen from G. James, D. Witten, T. Hastie, R. Tibshirani and A. Ng
Machine Learning - Motivation
• Arthur Samuel (1959): “Field of study that
gives computers the ability to learn
without being explicitly programmed”
Machine Learning - Motivation
• Arthur Samuel (1959): “Field of study that
gives computers the ability to learn
without being explicitly programmed”
• In between, computer science, statistics,
optimization,…
• Three categories (soft dichotomy)
Supervised learning
Unsupervised learning
Reinforcement learning
Difficulties
• Understanding the methods
(requires knowledge of various areas)
• Understanding data and application areas
• Sometimes hard to establish mathematical
guarantees
• Sometimes hard to code and test
• Fast developing area of research
Simplification
• To avoid such difficulties, but obtain a fine
level of knowledge in 2 days, we’ll follow:
• Book is available online
• Plan: last 3 chapters (8-10)
and a bit more….
Review
• Supervised learning (training and test
sets) vs. unsupervised learning
• Examples of supervised learning:
regression, classification
• Examples of unsupervised learning:
density/function estimation, clustering,
dimension reduction
• Recall: regression, bias-variance tradeoff,
resampling (e.g., cross validation), linear
and non-linear models
Quick Review of Regression
and Nearest Neighbors
• Regression predicts a response variable Y (quantitative
variable) in terms of input variables (predictors) X1,…,Xp
given n samples in p; denote X=(X1,…,Xp)
• The regression function f(x)=E(Y|X=x) is the minimizer
of the mean square prediction error
• We cannot precisely compute f, since we have few if any
values of given x
Estimating f by NN
Remarks on NN and
Classification
• Need 𝑝 ≤ 4 and sufficiently large n
• Nearest neighbors tend to be far away in
high dimensions
• Can use kernel or spline smoothing
• Other common methods: parametric and
structure models
Neighborhoods in Increasing
Dimensions
More on Regression
• Assessing model accuracy:
More on Regression
Flexibility = degrees of freedom (each square represents method with same color),
Dashed line explained later (irreducible error)
More on Regression
More on Regression
More on Regression
On Regression Error
• For an estimator 𝑓 learned on training set
the mean squared error is
𝐸(𝑌 − 𝑓 𝑋 |𝑋 = 𝑥)2
• Assume 𝑌 = 𝑓 𝑋 + 𝜀, wher𝜀 is independent
noise with mean zero, then
𝐸(𝑌 − 𝑓 𝑋 |𝑋 = 𝑥)2 = 𝐸(𝑓 𝑋 + 𝜀 − 𝑓 𝑋 |𝑋 = 𝑥)2
= 𝐸(𝑓 𝑋 − 𝑓 𝑋 |𝑋 = 𝑥)2 + Var(𝜀)• Var(𝜀) is the irreducible error
• 𝐸(𝑓 𝑋 − 𝑓 𝑋 |𝑋 = 𝑥)2 is the reducible error
( 𝑓 𝑋 depends on random training sample)
Regression Error:
Bias and Variance
• 𝐸(𝑓 𝑋 − 𝑓 𝑋 |𝑋 = 𝑥)2 =
𝐸( 𝑓 𝑋 − 𝐸( 𝑓 𝑋 )|𝑋 = 𝑥)2 +
(𝐸( 𝑓 𝑋 |𝑋 = 𝑥) −𝑓 𝑥 )2=
Var( 𝑓 𝑋 |𝑋 = 𝑥)+Bias2(( 𝑓 𝑋 |𝑋 = 𝑥)
• 𝐸(𝑌 − 𝑓 𝑋 |𝑋 = 𝑥)2 =
Var( 𝑓 𝑋 |𝑋 = 𝑥)+Bias2(( 𝑓 𝑋 |𝑋 = 𝑥)+Var(𝜀)
Bias-Variance Tradeoff
Two other tradeoffs:
Bias-Variance Tradeoff
Quick Review of Classification
and Nearest Neighbors
• Classification:
Quick Review of Classification
and Nearest Neighbors
• Example:
Quick Review of Classification
and Nearest Neighbors
Quick Review of Classification
and Nearest Neighbors
Quick Review of Classification
and Nearest Neighbors
Quick Review of Classification
and Nearest Neighbors
Quick Review of Classification
and Nearest Neighbors
Chapter 9: SVM
Separation of 2 Classes by a
hyperplane• Training set: 𝑛 points (𝑥i,1, … , 𝑥i,p) , 1 ≤ 𝑖 ≤ 𝑛,
with 𝑛 labels 𝑦i∈ −1,1 , 1 ≤ 𝑖 ≤ 𝑛• Separating hyperplane (if exists) satisfies:
Separation of 2 Classes by a
hyperplane
Example:
Separation of 2 Classes by a
hyperplane• If a separating hyperplane exists, then
for a test observation 𝑥*, a classifier is
obtained by the sign of
(negative (positive) sign → -1/1)
• The magnitude of 𝑓 𝑥 * provides
confidence on class assignment
•* * 2
0
1 1
d( ,Hyp.) /p p
i i
i i
x β β x β
Maximal Margin Classifier
Maximal Margin Classifier
• MMC is the solution of
• No explanation in book, but immediate for
a math student…
• Actual algorithm is not discussed…
Numerical Solution (following
A. Ng’s Cs229 notes)
• Change of notation: y(i)=yi, 𝑥(i)=(𝑥i,1 , … , 𝑥i,p)
• Recall – Distance of (𝑥(i),y(i)) to a hyperplane
𝑤T𝑋 +b=0 is |𝑤T𝑥(i)+b|/ 𝑤
Numerical Solution (following
A. Ng’s Cs229 notes)
Original Problem (non-convex):
Equivalent non-convex problem via
Numerical Solution (following
A. Ng’s Cs229 notes)
Scale w and b by the same constant so that
(no effect on problem) and change
to the convex problem (quadratic program)
Equivalent Formulation
(following A. Ng’s Cs229 notes)
Lagrangian:
Dual:
Solution: Hence:
(used later)
A Non-separable Example
Non-robustness of the
Maximal Margin Classifier
The Support Vector Classifier
• If εi=0 → correct side of boundary
• If εi>0 → wrong side of margin
• If εi>1 → wrong side of hyperplane
• Solution is effected only by support vectors, i.e.,
observations on wrong side of margins or boundary.
Concept Demonstration
More on the Optimization
Problem
• C – controls # observations on wrong side of margin
• C – controls the bias-variance trade-off
• Optimizer is effected only by support vectors
Increasing C in
clock-wise order:
Equivalent Formulation
(following A. Ng’s Cs229 notes)
• Dual:
• Similarly as before wTx is a linear
combination of <x,x(i)>
Support Vector Machine (SVM)
• From linear to nonlinear boundaries by
embedding to a higher-dimensional space
• The algorithm can be written in terms of a
dot product
• Instead of embed to a very high-dimen.
space, replace dot products with kernels
Clarification
Clarification
More (following book)
By solution of SVC (recall earlier comment)
Can use only support vectors for SVC
For SVM – replace dot products with kernels
Demonstration
SVM for K>2 Classes
• OVO (One vs. One): For training data,
construct 𝐾2
1/-1 classifiers (2 classes
out of K classes). For test point, use voting
(class with most pairwise assignments)
• OVA (One vs. All): For training, construct K
classifiers (class with 1 vs. rest of classes
with -1). For test x*, classify according to
largest estimated f(x*)
• OVO is better for K not too large
Chapter 8: Tree-based
Methods (or CART)
• Decision Trees for Regression
• Demonstration of predicting log(salary/1000) as a func.
of # of years in major leagues and hits in previous year
• Terminology: leaf/terminal node, internal node, branch
Chapter 8: Tree-based
Methods (or CART)
Building a Decision Tree
• We wish to minimize the RSS (residual sum of squares):
• Computationally infeasible. Use instead recursive binary
splitting (top-down greedy procedure)
Recursive Binary Splitting
• At each node (top to bottom) determine
predictor Xj and cutoff s minimizing
21
1 2
22
: ( , ): ( , )2 2
: ( , ) : ( , )1 2( , ) ( , )ii
i i
ii
i x R j si x R j s
i i
i x R j s i x R j s
yy
y yR j s R j s
Recursive Binary Splitting
• For 𝑗 = 1, … , 𝑝, determine s that maximize
• Can be done by sorting the j-values and
checking all n-1 pairs (xi,xi+1)
(O(1) operations for each) and reporting
average of xi and xi+1, for max. i.
• Total cost is O(pn).
• We assumed continuous random variables (can
modify for discrete ones)
21
22
: ( , ): ( , )
1 2( , ) ( , )ii
ii
i x R j si x R j s
yy
R j s R j s
More on Recursive Binary
Splitting
• The previous process is repeated until a stopping
criteria is met
• Predict response by mean of training
observations in region the test sample belong to
Tree Pruning
• Continue page 17 of books’ slides trees.pdf