Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.
-
Upload
felicia-nicholson -
Category
Documents
-
view
216 -
download
0
Transcript of Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.
![Page 1: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/1.jpg)
Chapter1: IntroductionChapter2: Overview of
Supervised Learning2006.01.20
![Page 2: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/2.jpg)
Supervised learning
Training data set: several features and outcome
Build a learner based on training data sets Predict the future unseen outcome from seen
features of data
![Page 3: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/3.jpg)
An example of supervised learningEmail spam
NormalEmails
………
Spam…………
Learner …New emails
Normal emails
Spam
Known
Unknown
![Page 4: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/4.jpg)
Input & Output
Input = predictor = independent variable Output = response = dependent variable
![Page 5: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/5.jpg)
Output Types
Quantitative >> regression Ex) stock price, temperature, age
Qualitative >> classification Ex) Yes/No,
![Page 6: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/6.jpg)
Input Types
Quantitative Qualitative Ordered categorical
Ex) small, medium, big
![Page 7: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/7.jpg)
Terminology
X : input Xj : j th component X : matrix xj : j th observed value
Y : quantitative output Y : prediction
G : qualitative output
^
![Page 8: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/8.jpg)
General model
Given input X, output Y
Want to estimate the function f based on known data set (training data)
unknown
![Page 9: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/9.jpg)
Two simple methods
Linear model, linear regression
Nearest neighbor method
![Page 10: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/10.jpg)
Linear model
Give a vector of input features X = (X1…Xp) Assume the linear relationship:
Least squares standard:
min -2
![Page 11: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/11.jpg)
Classification example in two dimensions -1
![Page 12: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/12.jpg)
Nearest neighbor method
Majority vote within the k nearest neighbors
K= 1: brownK= 3: green
new
![Page 13: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/13.jpg)
Classification example in two dimensions -2
![Page 14: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/14.jpg)
Linear model vs. K-nearest neighbor Linear model• #parameters: p• Stable, smooth• Low variance, high bias
K-nearest neighbor• #parameters: N/k • Unstable, wiggly• High variance, low bias
Each method has its own situations
for which it works best.
![Page 15: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/15.jpg)
Misclassification curves
![Page 16: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/16.jpg)
Enhanced Methods
Kernel methods using weights Modifying the distance kernels Locally weighted least squares Expansion of inputs for arbitrarily complex
models Projection & neural network
![Page 17: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/17.jpg)
Statistical decision theory (1)
Given input X in Rp, output Y in R Joint distribution: Pr(X,Y) Looking for predicting function: f(X) Squared error loss:
Nearest-neighbor methods :
min EPE
^
![Page 18: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/18.jpg)
Statistical decision theory (2)
k-Nearest neighbor
If N,k , k/N 0
Insufficient samples!
Curse of dimensionality!
Linear model
But, the true function might not be linear!
![Page 19: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/19.jpg)
Statistical decision theory (3)
If
Robust But, discontinuous in their derivatives
^
![Page 20: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/20.jpg)
Statistical decision theory (4)
G : categorical output variable L : Loss Function EPE = E[L(G, G(X))]
Bayesian Classifier
^
![Page 21: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/21.jpg)
References
Reading group on "elements of statistical learning” – overview.ppt http://sifaka.cs.uiuc.edu/taotao/stat.html
Welcome to STAT 894 – SupervisedLearningOVERVIEW05.pdf
http://www.stat.ohio-state.edu/~goel/STATLEARN/
The Matrix Cookbook http://www2.imm.dtu.dk/pubdb/views/ edoc_download.php/
3274/pdf/imm3274.pdf
A First Course in Probability
![Page 22: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/22.jpg)
2.5 Local Methods in High Dimensions With a reasonably large set of training data, we could always
approximate the theoretically optimal conditional expectation by k-nearest-neighbor averaging.
The curse of dimensionality To capture 1% of data to form a local average, we must cover
63% of the range of each input variable. The expected edge length =
All sample points are close to an edge of the sample. Median distance from the origin to the closest data point:
p r
1/1/1
( , ) (1 )2
Npd p N
![Page 23: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/23.jpg)
2.5 Local Methods in High Dimensions Example 1-NN vs. Linear
1-NN
As p increases, MSE & bias tends to 1.0. Linear model
Expecting on x0, the expected EPE increases linearly as a function of p.
20 0 0
2 20 0 0 0
ˆ( ) [ ( ) ]
ˆ ˆ ˆ[ ( )] [ ( ) ( )]
T
T T T
MSE x E f x y
E y E y E y f x
Variance Sq. Bias
0 0
20 | 0 0
2 20 0 0 0 0 0
20 0 0 0
ˆ( ) ( )
ˆ ˆ ˆ( | ) [ ] [ ]
ˆ ˆ( | ) ( ) ( )
y x T
T T T T
T
EPE x E E y y
Var y x E y E y E y E y
Var y x Var y Bias y
= 0.
By relying on rigid assumptions, the linear model has no bias at all and negligible variance, while the error in 1-nearest neighbor is larger.
![Page 24: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/24.jpg)
2.6 Statistical Models, Supervised Learning and Function Approximation Finding a useful approximation to
function that underlies the predictive relationship between the inputs and outputs. Supervised learning: machine learning point of
view Function approximation: mathematics and
statistics point of view
ˆ ( )f x( )f x
![Page 25: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/25.jpg)
2.7 Structured Regression Models Nearest-neighbor and other local methods
face problems in high dimensions. They may be inappropriate even in low
dimensions. Need for structured approaches.
Difficulty of the problem Infinitely many solutions to minimizing RSS. Unique solution comes from restrictions on f.
2
1
( ) ( ( ))N
i ii
RSS f y f x
![Page 26: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/26.jpg)
2.8 Classes of Restricted Estimators Methods categorized by the nature of the
restrictions. Roughness penalty and Bayesian methods
Penalizing functions that too rapidly vary over small regions of input space.
Kernel methods and local regression Explicitly specifying the nature of local neighborhood (kernel
function). Need adaptation in high dimensions.
Basis functions and dictionary methods Linear expansion of basis functions.
![Page 27: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/27.jpg)
2.9 Model Selection and the Bias-Variance Tradeoff All models have a smoothing or complexity
parameter to be determined Multiplier of the penalty term Width of the kernel Number of basis functions
![Page 28: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/28.jpg)
Bias-Variance tradeoff
Essential with ε, no way to reduce
To reduce one might increase the other. Tradeoff!
![Page 29: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/29.jpg)
Bias-Variance tradeoff in kNN
![Page 30: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20.](https://reader035.fdocuments.in/reader035/viewer/2022081520/5697bfe81a28abf838cb6714/html5/thumbnails/30.jpg)
Model complexity
Training error
Test error
Model ComplexityLow High
Pre
dict
ion
Err
or
High BiasLow Variance
Low BiasHigh Variance