Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead...
Transcript of Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead...
![Page 1: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/1.jpg)
Supervised Learning(Part I)
Weinan ZhangShanghai Jiao Tong University
http://wnzhang.net
2019 EE448, Big Data Mining, Lecture 4
http://wnzhang.net/teaching/ee448/index.html
![Page 2: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/2.jpg)
Content of Supervised Learning• Introduction to Machine Learning
• Linear Models
• Support Vector Machines
• Neural Networks
• Tree Models
• Ensemble Methods
![Page 3: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/3.jpg)
Content of This Lecture
• Introduction to Machine Learning
• Linear Models
![Page 4: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/4.jpg)
What is Machine Learning• Learning
“Learning is any process by which a system improves performance from experience.”
--- Herbert SimonTuring Award (1975)
artificial intelligence, the psychology of human cognition
Nobel Prize in Economics (1978)decision-making process within economic organizations
![Page 5: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/5.jpg)
What is Machine LearningA more mathematical definition by Tom Mitchell
• Machine learning is the study of algorithms that• improvement their performance P• at some task T• based on experience E• with non-explicit programming
• A well-defined learning task is given by <P, T, E>
![Page 6: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/6.jpg)
Programming vs. Machine Learning
• Traditional Programming
Program
Input
HumanProgrammer Output
Slide credit: Feifei Li
• Machine Learning
Program
Input
OutputLearningAlgorithmData
![Page 7: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/7.jpg)
When does ML Make Advantages
ML is used when• Models are based on a huge amount of data
• Examples: Google web search, Facebook news feed• Output must be customized
• Examples: News / item / ads recommendation• Humans cannot explain the expertise
• Examples: Speech / face recognition, game of Go• Human expertise does not exist
• Examples: Navigating on Mars
![Page 8: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/8.jpg)
Machine Learning Categories• Supervised Learning
• To perform the desired output given the data and labels
• Unsupervised Learning• To analyze and make use of the underlying data
patterns/structures
• Reinforcement Learning• To learn a policy of taking actions in a dynamic
environment and acquire rewards
![Page 9: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/9.jpg)
Machine Learning Process
• Basic assumption: there exist the same patterns across training and test data
TrainingData
Data Formaliz-
ation
Model
Evaluation
TestData
Raw Data
Raw Data
![Page 10: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/10.jpg)
Supervised Learning• Given the training dataset of (data, label) pairs,
let the machine learn a function from data to label
• Function set is called hypothesis space• Learning is referred to as updating the parameter• How to learn?
• Update the parameter to make the prediction close to the corresponding label• What is the learning objective?• How to update the parameters?
![Page 11: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/11.jpg)
Learning Objective• Make the prediction close to the corresponding
label
• Loss function measures the error between the label and prediction
• The definition of loss function depends on the data and task
• Most popular loss function: squared loss
![Page 12: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/12.jpg)
Squared Loss
• Penalty much more on larger distances
• Accept small distance (error) • Observation
noise etc.• Generalization
![Page 13: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/13.jpg)
Gradient Learning Methods
![Page 14: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/14.jpg)
A Simple Example
• Observing the data , we can use different models (hypothesis spaces) to learn• First, model selection (linear or quadratic)• Then, learn the parameters
An example from Andrew Ng
![Page 15: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/15.jpg)
Learning Linear Model - Curve
![Page 16: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/16.jpg)
Learning Linear Model - Weights
![Page 17: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/17.jpg)
Learning Quadratic Model
![Page 18: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/18.jpg)
Learning Cubic Model
![Page 19: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/19.jpg)
Model Selection• Which model is the best?
• Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data.
• Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship
Linear model: underfitting Quadratic model: well fitting 5th-order model: overfitting
![Page 20: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/20.jpg)
Model Selection• Which model is the best?
• Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data.
• Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship
Linear model: underfitting 4th-order model: well fitting 15th-order model: overfitting
![Page 21: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/21.jpg)
Regularization• Add a penalty term of the parameters to prevent
the model from overfitting the data
![Page 22: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/22.jpg)
Typical Regularization• L2-Norm (Ridge)
• L1-Norm (LASSO)
![Page 23: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/23.jpg)
More Normal-Form Regularization
• Contours of constant value of
Ridge LASSO
• Sparse model learning with q not higher than 1• Seldom use of q > 2• Actually, 99% cases use q = 1 or 2
![Page 24: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/24.jpg)
Principle of Occam's razor
Among competing hypotheses, the one with the fewest assumptions should be selected.
• Recall the function set is called hypothesis space
Original loss Penalty on assumptions
![Page 25: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/25.jpg)
Model Selection
• An ML solution has model parameters and optimization hyperparameters
• Hyperparameters• Define higher level concepts about the model such as
complexity, or capacity to learn.• Cannot be learned directly from the data in the standard
model training process and need to be predefined.• Can be decided by setting different values, training different
models, and choosing the values that test better• Model selection (or hyperparameter optimization)
cares how to select the optimal hyperparameters.
![Page 26: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/26.jpg)
Cross Validation for Model Selection
K-fold Cross Validation1. Set hyperparameters2. For K times repeat:
• Randomly split the original training data into training and validation datasets
• Train the model on training data and evaluate it on validation data, leading to an evaluation score
3. Average the K evaluation scores as the model performance
TrainingData
Original Training
Data
Model
Evaluation
ValidationData
Random Split
![Page 27: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/27.jpg)
Machine Learning Process
• After selecting ‘good’ hyperparameters, we train the model over the whole training data and the model can be used on test data.
TrainingData
Data Formaliz-
ation
Model
Evaluation
TestData
Raw Data
Raw Data
![Page 28: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/28.jpg)
Generalization Ability• Generalization Ability is the model prediction
capacity on unobserved data• Can be evaluated by Generalization Error, defined by
• where is the underlying (probably unknown) joint data distribution
• Empirical estimation of GA on a training dataset is
![Page 29: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/29.jpg)
For any function , with probability no less than , it satisfies
where
• N: number of training instances• d: number of functions in the hypothesis set
A Simple Case Study on Generalization Error
• Finite hypothesis set• Theorem of generalization error bound:
Section 1.7 in Dr. Hang Li’s text book.
![Page 30: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/30.jpg)
Content of This Lecture
• Introduction to Machine Learning
• Linear Models• Linear regression, linear classification, applications
![Page 31: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/31.jpg)
Linear RegressionLinear Models for Supervised Learning
![Page 32: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/32.jpg)
Linear Discriminative Models• Discriminative model
• modeling the dependence of unobserved variables on observed ones
• also called conditional models.• Deterministic: • Probabilistic:
• Focus of this course• Linear regression model• Linear classification model
![Page 33: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/33.jpg)
Linear Discriminative Models• Discriminative model
• modeling the dependence of unobserved variables on observed ones
• also called conditional models.• Deterministic: • Probabilistic:
• Linear regression model
![Page 34: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/34.jpg)
Linear Regression• One-dimensional linear & quadratic regression
Linear Regression Quadratic Regression(A kind of generalizedlinear model)
![Page 35: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/35.jpg)
Linear Regression• Two-dimensional linear regression
![Page 36: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/36.jpg)
Learning Objective• Make the prediction close to the corresponding
label
• Loss function measures the error between the label and prediction
• The definition of loss function depends on the data and task
• Most popular loss function: squared loss
![Page 37: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/37.jpg)
Squared Loss
• Penalty much more on larger distances
• Accept small distance (error) • Observation
noise etc.• Generalization
![Page 38: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/38.jpg)
Least Square Linear Regression
• Objective function to minimize
![Page 39: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/39.jpg)
Minimize the Objective Function• Let N=1 for a simple case, for (x,y)=(2,1)
![Page 40: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/40.jpg)
Gradient Learning Methods
![Page 41: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/41.jpg)
Batch Gradient Descent
• Update for the whole batch
![Page 42: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/42.jpg)
Learning Linear Model - Curve
![Page 43: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/43.jpg)
Learning Linear Model - Weights
![Page 44: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/44.jpg)
Stochastic Gradient Descent
• Update for every single instance
• Compare with BGD• Faster learning• Uncertainty or fluctuation in learning
![Page 45: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/45.jpg)
Linear Classification Model
![Page 46: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/46.jpg)
Mini-Batch Gradient Descent• A combination of batch GD and stochastic GD
• Split the whole dataset into K mini-batches
• Update for each mini-batch
• For each mini-batch k, perform one-step BGD toward minimizing
![Page 47: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/47.jpg)
Mini-Batch Gradient Descent• Good learning stability (BGD)• Good convergence rate (SGD)
• Easy to be parallelized• Parallelization within a mini-batch
Mini-batch
Worker 1
Worker 2
Worker 3
ParallelizedMap Gradient Reduce Gradient Sum
![Page 48: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/48.jpg)
Basic Search Procedure• Choose an initial value for• Update iteratively with the data• Until we research a minimum
μμ
μμ
![Page 49: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/49.jpg)
Basic Search Procedure• Choose a new initial value for• Update iteratively with the data• Until we research a minimum
μμ
μμ
![Page 50: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/50.jpg)
Unique Minimum for Convex Objective
• Different initial parameters and different learning algorithm lead to the same optimum
![Page 51: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/51.jpg)
Convex Set• A convex set S is a set of points such that, given any
two points A, B in that set, the line AB joining them lies entirely within S.
A
BA
B
tx1 + (1¡ t)x2 2 Stx1 + (1¡ t)x2 2 S
for all x1; x2 2 S; 0 · t · 1x1; x2 2 S; 0 · t · 1
Convex set Non-convex set
[Boyd, Stephen, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.]
![Page 52: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/52.jpg)
Convex Function
is convex if is a convex set andf : Rn ! Rf : Rn ! Rf(tx1 + (1¡ t)x2) · tf(x1) + (1¡ t)f(x2)f(tx1 + (1¡ t)x2) · tf(x1) + (1¡ t)f(x2)
for all
dom fdom f
x1; x2 2 dom f; 0 · t · 1x1; x2 2 dom f; 0 · t · 1
![Page 53: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/53.jpg)
Choosing Learning Rate
• To see if gradient descent is working, print out for each or every several iterations. If does not drop properly, adjust
too smallslow convergence
too largeIncreasing value of
• May overshoot the minimum• May fail to converge• May even diverge
Slide credit Eric Eaton
• The initial point may be too far away from the optimal solution, which takes much time to converge
![Page 54: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/54.jpg)
Algebra Perspective
• Prediction
X =
26664x(1)
x(2)
...
x(n)
37775 =
266664x
(1)1 x
(1)2 x
(1)3 : : : x
(1)d
x(2)1 x
(2)2 x
(2)3 : : : x
(2)d
......
.... . .
...
x(n)1 x
(n)2 x
(n)3 : : : x
(n)d
377775X =
26664x(1)
x(2)
...
x(n)
37775 =
266664x
(1)1 x
(1)2 x
(1)3 : : : x
(1)d
x(2)1 x
(2)2 x
(2)3 : : : x
(2)d
......
.... . .
...
x(n)1 x
(n)2 x
(n)3 : : : x
(n)d
377775 μ =
26664μ1μ2...μd
37775μ =
26664μ1μ2...μd
37775 y =
26664y1
y2...
yn
37775y =
26664y1
y2...
yn
37775
y = Xμ =
26664x(1)μ
x(2)μ...
x(n)μ
37775y = Xμ =
26664x(1)μ
x(2)μ...
x(n)μ
37775• Objective J(μ) =
1
2(y ¡ y)>(y ¡ y) =
1
2(y ¡Xμ)>(y ¡Xμ)J(μ) =
1
2(y ¡ y)>(y ¡ y) =
1
2(y ¡Xμ)>(y ¡Xμ)
![Page 55: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/55.jpg)
Matrix Form
• Gradient
J(μ) =1
2(y ¡Xμ)>(y ¡Xμ) min J(μ)J(μ) =
1
2(y ¡Xμ)>(y ¡Xμ) min J(μ)
@J(μ)
@μ= ¡X>(y ¡Xμ)
@J(μ)
@μ= ¡X>(y ¡Xμ)
@J(μ)
@μ= 0 ) X>(y ¡Xμ) = 0
) X>y = X>Xμ
) μ = (X>X)¡1X>y
@J(μ)
@μ= 0 ) X>(y ¡Xμ) = 0
) X>y = X>Xμ
) μ = (X>X)¡1X>y
• Objective
• Solution
http://dsp.ucsd.edu/~kreutz/PEI-05%20Support%20Files/ECE275A_Viewgraphs_5.pdf
![Page 56: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/56.jpg)
First column
Second column
Matrix Form• Then the predicted values are
y = X(X>X)¡1X>y
= Hy
y = X(X>X)¡1X>y
= Hy
• Geometrical Explanation• The column vectors form a subspace of • H is a least square projection
H: hat matrix
X =
266664x
(1)1 x
(1)2 x
(1)3 : : : x
(1)d
x(2)1 x
(2)2 x
(2)3 : : : x
(2)d
......
.... . .
...
x(n)1 x
(n)2 x
(n)3 : : : x
(n)d
377775 = [x1;x2; : : : ;xd]X =
266664x
(1)1 x
(1)2 x
(1)3 : : : x
(1)d
x(2)1 x
(2)2 x
(2)3 : : : x
(2)d
......
.... . .
...
x(n)1 x
(n)2 x
(n)3 : : : x
(n)d
377775 = [x1;x2; : : : ;xd] y =
26664y1
y2...
yn
37775y =
26664y1
y2...
yn
37775
[x1;x2; : : : ;xd][x1;x2; : : : ;xd] RnRn
jjy ¡Xμjj2jjy ¡Xμjj2
More details refer to Sec 3.2. Hastie et al. The elements of statistical learning.
![Page 57: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/57.jpg)
Might be Singular• When some column vectors are not independent
• For example,then is singular, thuscannot be directly calculated.
• Solution: regularization
X>XX>X
x2 = 3x1x2 = 3x1
X>XX>X μ = (X>X)¡1X>yμ = (X>X)¡1X>y
J(μ) =1
2(y ¡Xμ)>(y ¡Xμ) +
¸
2jjμjj22J(μ) =
1
2(y ¡Xμ)>(y ¡Xμ) +
¸
2jjμjj22
![Page 58: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/58.jpg)
Matrix Form with Regularization
• Gradient
J(μ) =1
2(y ¡Xμ)>(y ¡Xμ) +
¸
2jjμjj22 minJ(μ)J(μ) =
1
2(y ¡Xμ)>(y ¡Xμ) +
¸
2jjμjj22 minJ(μ)
@J(μ)
@μ= ¡X>(y ¡Xμ) + ¸μ
@J(μ)
@μ= ¡X>(y ¡Xμ) + ¸μ
@J(μ)
@μ= 0 ! ¡X>(y ¡Xμ) + ¸μ = 0
! X>y = (X>X + ¸I)μ
! μ = (X>X + ¸I)¡1X>y
@J(μ)
@μ= 0 ! ¡X>(y ¡Xμ) + ¸μ = 0
! X>y = (X>X + ¸I)μ
! μ = (X>X + ¸I)¡1X>y
• Objective
• Solution
![Page 59: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/59.jpg)
Linear Discriminative Models• Discriminative model
• modeling the dependence of unobserved variables on observed ones
• also called conditional models.• Deterministic: • Probabilistic:
• Linear regression with Gaussian noise model
![Page 60: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/60.jpg)
Objective: Likelihood
• Data likelihood
p(²) =1p
2¼¾2e¡
²2
2¾2p(²) =1p
2¼¾2e¡
²2
2¾2
² » N (0; ¾2)² » N (0; ¾2)
p(yjx) =1p
2¼¾2e¡
(y¡μ>x)2
2¾2p(yjx) =1p
2¼¾2e¡
(y¡μ>x)2
2¾2
![Page 61: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/61.jpg)
Learning• Maximize the data likelihood
maxμ
NYi=1
1p2¼¾2
e¡(yi¡μ>xi)
2
2¾2maxμ
NYi=1
1p2¼¾2
e¡(yi¡μ>xi)
2
2¾2
• Maximize the data log-likelihood
logNY
i=1
1p2¼¾2
e¡(yi¡μ>xi)
2
2¾2 =NX
i=1
log1p
2¼¾2e¡
(yi¡μ>xi)2
2¾2
= ¡NX
i=1
(yi ¡ μ>xi)2
2¾2+ const
logNY
i=1
1p2¼¾2
e¡(yi¡μ>xi)
2
2¾2 =NX
i=1
log1p
2¼¾2e¡
(yi¡μ>xi)2
2¾2
= ¡NX
i=1
(yi ¡ μ>xi)2
2¾2+ const
minμ
NXi=1
(yi ¡ μ>xi)2min
μ
NXi=1
(yi ¡ μ>xi)2 Equivalent to least square error learning
![Page 62: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/62.jpg)
Linear ClassificationLinear Models for Supervised Learning
![Page 63: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/63.jpg)
Classification Problem• Given:
• A description of an instance, , where is the instance space.
• A fixed set of categories:
• Determine:• The category of : , where is a
categorization function whose domain is and whose range is
• If the category set binary, i.e. ({false, true}, {negative, positive}) then it is called binary classification.
x 2 Xx 2 X XX
C = fc1; c2; : : : ; cmgC = fc1; c2; : : : ; cmg
xx f(x) 2 Cf(x) 2 C f(x)f(x)XX
CC
C = f0; 1gC = f0; 1g
![Page 64: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/64.jpg)
Binary Classification
Linearly inseparable Non-linearly inseparable
![Page 65: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/65.jpg)
Linear Discriminative Models• Discriminative model
• modeling the dependence of unobserved variables on observed ones
• also called conditional models.• Deterministic:
• Non-differentiable• Probabilistic:
• Differentiable
• For binary classification
y = fμ(x)y = fμ(x)
pμ(yjx)pμ(yjx)
pμ(y = 1jx)pμ(y = 1jx)
pμ(y = 0jx) = 1¡ pμ(y = 1jx)pμ(y = 0jx) = 1¡ pμ(y = 1jx)
![Page 66: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/66.jpg)
Loss Function• Cross entropy loss
0 1 0 0 0
0.1 0.6 0.05 0.05 0.2
Ground Truth
Prediction
L(y; x; pμ) = ¡X
k
±(y = ck) log pμ(y = ckjx)L(y; x; pμ) = ¡X
k
±(y = ck) log pμ(y = ckjx)
H(p; q) = ¡X
x
p(x) log q(x)H(p; q) = ¡X
x
p(x) log q(x)
H(p; q) = ¡Z
xp(x) log q(x)dxH(p; q) = ¡
Zxp(x) log q(x)dx
• For classification problem
±(z) =
(1; z is true
0; otherwise±(z) =
(1; z is true
0; otherwise
Discrete case:
Continuous case:
![Page 67: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/67.jpg)
Cross Entropy for Binary Classification
• Loss function
0 1
0.3 0.7
Ground Truth
Prediction
Class 1 Class 2
L(y; x; pμ) = ¡±(y = 1) log pμ(y = 1jx)¡ ±(y = 0) log pμ(y = 0jx)
= ¡y log pμ(y = 1jx)¡ (1¡ y) log(1¡ pμ(y = 1jx))
L(y; x; pμ) = ¡±(y = 1) log pμ(y = 1jx)¡ ±(y = 0) log pμ(y = 0jx)
= ¡y log pμ(y = 1jx)¡ (1¡ y) log(1¡ pμ(y = 1jx))
![Page 68: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/68.jpg)
Logistic Regression• Logistic regression is a binary classification model
pμ(y = 1jx) = ¾(μ>x) =1
1 + e¡μ>xpμ(y = 1jx) = ¾(μ>x) =
1
1 + e¡μ>x
pμ(y = 0jx) =e¡μ>x
1 + e¡μ>xpμ(y = 0jx) =
e¡μ>x
1 + e¡μ>x
L(y; x; pμ) = ¡y log ¾(μ>x)¡ (1 ¡ y) log(1¡ ¾(μ>x))L(y; x; pμ) = ¡y log ¾(μ>x)¡ (1 ¡ y) log(1¡ ¾(μ>x))
@¾(z)
@z= ¾(z)(1¡ ¾(z))
@¾(z)
@z= ¾(z)(1¡ ¾(z))
@L(y; x; pμ)
@μ= ¡y
1
¾(μ>x)¾(z)(1¡ ¾(z))x¡ (1¡ y)
¡1
1¡ ¾(μ>x)¾(z)(1¡ ¾(z))x
= (¾(μ>x)¡ y)x
μ Ã μ + ´(y ¡ ¾(μ>x))x
@L(y; x; pμ)
@μ= ¡y
1
¾(μ>x)¾(z)(1¡ ¾(z))x¡ (1¡ y)
¡1
1¡ ¾(μ>x)¾(z)(1¡ ¾(z))x
= (¾(μ>x)¡ y)x
μ Ã μ + ´(y ¡ ¾(μ>x))x
• Cross entropy loss function
• Gradient
¾(x)¾(x)
xx
![Page 69: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/69.jpg)
Label Decision• Logistic regression provides the probability
• The final label of an instance is decided by setting a threshold
pμ(y = 1jx) = ¾(μ>x) =1
1 + e¡μ>xpμ(y = 1jx) = ¾(μ>x) =
1
1 + e¡μ>x
pμ(y = 0jx) =e¡μ>x
1 + e¡μ>xpμ(y = 0jx) =
e¡μ>x
1 + e¡μ>x
hh
y =
(1; pμ(y = 1jx) > h
0; otherwisey =
(1; pμ(y = 1jx) > h
0; otherwise
![Page 70: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/70.jpg)
Evaluation Measures
• True / False• True: prediction = label• False: prediction ≠ label
• Positive / Negative• Positive: predict y = 1• Negative: predict y = 0
1 0
1 TruePositive
False Negative
0 False Positive
True Negative
Label
Prediction
Class 1
Class 0
TP: if predicting 1FN: if predicting 0
FP: if predicting 1TN: if predicting 0
![Page 71: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/71.jpg)
Evaluation Measures
• Accuracy: the ratio of cases when prediction = label
1 0
1 TruePositive
False Negative
0 False Positive
True Negative
Label
Prediction
Acc =TP + TN
TP + TN + FP + FNAcc =
TP + TN
TP + TN + FP + FN
![Page 72: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/72.jpg)
Evaluation Measures
• Precision: the ratio of true class 1 cases in those with prediction 1
1 0
1 TruePositive
False Negative
0 False Positive
True Negative
Label
Prediction
Prec =TP
TP + FPPrec =
TP
TP + FP
• Recall: the ratio of cases with prediction 1 in all true class 1 cases
1 0
1 TruePositive
False Negative
0 False Positive
True Negative
Label
Prediction
Rec =TP
TP + FNRec =
TP
TP + FN
![Page 73: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/73.jpg)
Evaluation Measures• Precision-recall tradeoff
• Higher threshold, higher precision, lower recall• Extreme case: threshold = 0.99
• Lower threshold, lower precision, higher recall• Extreme case: threshold = 0
• F1 Measure
y =
(1; pμ(y = 1jx) > h
0; otherwisey =
(1; pμ(y = 1jx) > h
0; otherwise
F1 =2£ Prec£Recall
Prec + RecF1 =
2£ Prec£Recall
Prec + Rec
![Page 74: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/74.jpg)
Evaluation Measures• Ranking-based measure: Area Under ROC Curve (AUC)
![Page 75: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/75.jpg)
Evaluation Measures• Ranking-based measure: Area Under ROC Curve (AUC)
PerfectPrediction
AUC = 1
RandomPredictionAUC = 0.5
![Page 76: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/76.jpg)
Evaluation Measures• A simple example of Area Under ROC Curve (AUC)
False Positive Ratio
TruePositive
Ratio
Prediction Label0.91 10.85 00.77 10.72 10.61 00.48 10.42 00.33 00.25 0.5 0.75 1.0
0.5
1.0
0.75
0.25
AUC = 0.75
![Page 77: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/77.jpg)
Multi-Class Classification
• Still cross entropy loss 0 1 0
0.1 0.7 0.2
Ground Truth
Prediction
L(y; x; pμ) = ¡X
k
±(y = ck) log pμ(y = ckjx)L(y; x; pμ) = ¡X
k
±(y = ck) log pμ(y = ckjx) ±(z) =
(1; z is true
0; otherwise±(z) =
(1; z is true
0; otherwise
![Page 78: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/78.jpg)
• Softmax• Parameters• Can be normalized with m-1 groups of parameters
Multi-Class Logistic Regression• Class set C = fc1; c2; : : : ; cmgC = fc1; c2; : : : ; cmg
pμ(y = cjjx) =eμ>j xPm
k=1 eμ>k x
for j = 1; : : : ;mpμ(y = cjjx) =eμ>j xPm
k=1 eμ>k x
for j = 1; : : : ;m
pμ(y = cjjx)pμ(y = cjjx)• Predicting the probability of
μ = fμ1; μ2; : : : ; μmgμ = fμ1; μ2; : : : ; μmg
![Page 79: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/79.jpg)
Multi-Class Logistic Regression• Learning on one instance
• Maximize log-likelihood
@ log pμ(y = cjjx)
@μj=
@
@μjlog
eμ>j xPm
k=1 eμ>k x
= x¡ @
@μjlog
mXk=1
eμ>k x
= x¡ eμ>j xxPm
k=1 eμ>k x
@ log pμ(y = cjjx)
@μj=
@
@μjlog
eμ>j xPm
k=1 eμ>k x
= x¡ @
@μjlog
mXk=1
eμ>k x
= x¡ eμ>j xxPm
k=1 eμ>k x
maxμ
log pμ(y = cjjx)maxμ
log pμ(y = cjjx)
(x; y = cj)(x; y = cj)
• Gradient
![Page 80: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/80.jpg)
Application Case Study:Click-Through Rate (CTR) Estimation in Online AdvertisingLinear Models for Supervised Learning
![Page 81: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/81.jpg)
[http://news.ifeng.com]
Click or not?
Ad Click-Through Rate Estimation
![Page 82: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/82.jpg)
User response estimation problem
• Problem definition
• Date: 20160320• Hour: 14• Weekday: 7• IP: 119.163.222.*• Region: England• City: London• Country: UK• Ad Exchange: Google• Domain: yahoo.co.uk• URL: http://www.yahoo.co.uk/abc/xyz.html• OS: Windows• Browser: Chrome• Ad size: 300*250• Ad ID: a1890• User occupation: Student• User tags: Sports, Electronics
Click (1) or not (0)?
Predicted CTR (0.15)
One instance data Corresponding label
![Page 83: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/83.jpg)
One-Hot Binary Encoding
• High dimensional sparse binary feature vector• Usually higher than 1M dimensions, even 1B dimensions• Extremely sparse
x=[Weekday=Friday, Gender=Male, City=Shanghai]
x=[0,0,0,0,1,0,0 0,1 0,0,1,0…0]
• A standard feature engineering paradigm
Sparse representation: x=[5:1 9:1 12:1]
![Page 84: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/84.jpg)
Training/Validation/Test Data• Examples (in LibSVM format)
1 5:1 9:1 12:1 45:1 154:1 509:1 4089:1 45314:1 988576:10 2:1 7:1 18:1 34:1 176:1 510:1 3879:1 71310:1 818034:1
…
• Training/Validation/Test data split• Sort data by time• Train:validation:test = 8:1:1• Shuffle training data
![Page 85: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/85.jpg)
Training Logistic Regression• Logistic regression is a binary classification model
pμ(y = 1jx) = ¾(μ>x) =1
1 + e¡μ>xpμ(y = 1jx) = ¾(μ>x) =
1
1 + e¡μ>x
L(y; x; pμ) = ¡y log ¾(μ>x)¡ (1¡ y) log(1¡ ¾(μ>x)) +¸
2jjμjj22L(y; x; pμ) = ¡y log ¾(μ>x)¡ (1¡ y) log(1¡ ¾(μ>x)) +
¸
2jjμjj22
μ Ã (1 ¡ ¸´)μ + ´(y ¡ ¾(μ>x))xμ Ã (1 ¡ ¸´)μ + ´(y ¡ ¾(μ>x))x
• Cross entropy loss function with L2 regularization
• Parameter learning
• Only update non-zero entries
![Page 86: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/86.jpg)
Experimental Results• Datasets
• Criteo Terabyte Dataset• 13 numerical fields, 26 categorical fields• 7 consecutive days out of 24 days in total (about 300 GB)
during 2014• 79.4M impressions, 1.6M clicks after negative down sampling
• iPinYou Dataset• 65 categorical fields• 10 consecutive days during 2013• 19.5M impressions, 937.7K clicks without negative down
sampling
![Page 87: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/87.jpg)
PerformanceModel Linearity
AUC Log LossCriteo iPinYou Criteo iPinYou
Logistic Regression Linear 71.48% 73.43% 0.1334 5.581e-3
Factorization Machine Bi-linear 72.20% 75.52% 0.1324 5.504e-3
Deep NeuralNetworks
Non-linear 75.66% 76.19% 0.1283 5.443e-3
[Yanru Qu et al. Product-based Neural Networks for User Response Prediction. ICDM 2016.]
• Compared with non-linear models, linear models• Pros: standardized, easily understood and implemented, efficient
and scalable• Cons: modeling limit (feature independent assumption), cannot
explore feature interactions
![Page 88: Supervised Learning - wnzhangwnzhang.net/teaching/ee448/slides/4-supervised-learning...noise instead of the underlying relationship Linear model: underfitting Quadratic model: well](https://reader035.fdocuments.in/reader035/viewer/2022062609/60fd211a4701bc53bd4da2ec/html5/thumbnails/88.jpg)
CS420 Machine Learning
http://wnzhang.net/teaching/cs420/index.html
Course webpage:
Weinan Zhang
For more machine learning materials, you can check out my machine learning course at Zhiyuan College