MLHEP 2015: Introductory Lecture #2
-
Upload
arogozhnikov -
Category
Science
-
view
359 -
download
1
Transcript of MLHEP 2015: Introductory Lecture #2
![Page 1: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/1.jpg)
MACHINE LEARNING IN HIGHENERGY PHYSICSLECTURE #2
Alex Rogozhnikov, 2015
![Page 2: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/2.jpg)
RECAPITULATIONclassification, regressionkNN classifier and regressorROC curve, ROC AUC
![Page 3: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/3.jpg)
Given knowledge about distributions, we can build optimalclassifier
OPTIMAL BAYESIAN CLASSIFIER
=p(y = 1 | x)p(y = 0 | x)
p(y = 1) p(x | y = 1)p(y = 0) p(x | y = 0)
But distributions are complex, contain many parameters.
![Page 4: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/4.jpg)
QDA
QDA follows generative approach.
![Page 5: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/5.jpg)
LOGISTIC REGRESSION
Decision function d(x) =< w, x > +w0
Sharp rule: = sgn d(x)y ̂
![Page 6: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/6.jpg)
Optimizing weights to maximize log-likelihood
LOGISTIC REGRESSION
Smooth rule:
d(x) =< w, x > +w0
(x)p+1
(x)p+1
==σ(d(x))σ(+d(x))
w, w0
= + ln( ( )) = L( , ) → min1N ∑
i!events
pyixi
1N ∑
i
xi yi
![Page 7: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/7.jpg)
LOGISTIC LOSSLoss penalty for single observation
L( , ) = + ln( ( )) = {xi yi pyixi
ln(1 + ),e+d( )xi
ln(1 + ),ed( )xi
= +1yi
= +1yi
![Page 8: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/8.jpg)
GRADIENT DESCENT & STOCHASTICOPTIMIZATION
Problem: finding to minimize
is step size (also `shrinkage`, `learning rate`)
w
w ← w + η��w
η
![Page 9: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/9.jpg)
STOCHASTIC GRADIENT DESCENT = L( , ) → min
1N ∑
i
xi yi
On each iteration make a step with respect to only oneevent:
1. take — random event from training data
2.
i
w ← w + η�( , )xi yi
�wEach iteration is done much faster, but training process isless stable.
![Page 10: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/10.jpg)
POLYNOMIAL DECISION RULEd(x) = + +w0 ∑
i
wixi ∑ij
wijxixj
is again linear model, introduce new features:
and reusing logistic regression.
z = {1} C { C {xi}i xixj}ij
d(x) = ∑i
wizi
We can add as one more variable to dataset andforget about intercept
= 1x0
d(x) = + =w0 *Ni=1 wixi *N
i=0 wixi
![Page 11: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/11.jpg)
PROJECTING IN HIGHER DIMENSION SPACESVM with polynomial kernel visualization
After adding new features, classes may become separable.
![Page 12: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/12.jpg)
is projection operator (which adds new features).
Assume
and look for optimal
We need only kernel:
KERNEL TRICKP
d(x) = < w, P(x) >
w = P( )∑i
αi xi
αi
d(x) = < P( ), P(x) >= K( , x)∑i
αi xi ∑i
αi xi
K(x, y) =< P(x), P(y) >
![Page 13: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/13.jpg)
Popular kernel is gaussian Radial Basis Function:
Corresponds to projection to Hilbert space.
KERNEL TRICK
K(x, y) = e+c||x+y||2
Exercise: find a correspong projection.
![Page 14: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/14.jpg)
SVM selects decision rule with maximal possible margin.
SUPPORT VECTOR MACHINE
![Page 15: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/15.jpg)
SVM uses different loss function (only signal lossescompared):
HINGE LOSS FUNCTION
![Page 16: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/16.jpg)
SVM + RBF KERNEL
![Page 17: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/17.jpg)
SVM + RBF KERNEL
![Page 18: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/18.jpg)
OVERFITTING
Knn with k=1 gives ideal classification of training data.
![Page 19: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/19.jpg)
OVERFITTING
![Page 20: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/20.jpg)
There are two definitions of overfitting, which oftencoincide.
DIFFERENCE-OVERFITTINGThere is significant difference in quality of predictionsbetween train and test.
COMPLEXITY-OVERFITTINGFormula has too high complexity (e.g. too manyparameters), increasing the number of parameters drives tolower quality.
![Page 21: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/21.jpg)
MEASURING QUALITYTo get unbiased estimate, one should test formula onindependent samples (and be sure that no train informationwas given to algorithm during training)
In most cases, simply splitting data into train and holdout isenough.
More approaches in seminar.
![Page 22: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/22.jpg)
Difference-overfitting is inessential, provided that wemeasure quality on holdout (though easy to check).
Complexity-overfitting is problem — we need to testdifferent parameters for optimality (more examples throughthe course).
Don't use distribution comparison to detect overfitting
![Page 23: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/23.jpg)
![Page 24: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/24.jpg)
REGULARIZATIONWhen number of weights is high, overfitting is very probable
Adding regularization term to loss function:
= L( , ) + → min1N ∑
i
xi yi reg
regularization : regularization:
regularization:
L2 = α |reg *j wj |2
L1 = β | |reg *j wj
+L1 L2 = α | + β | |reg *j wj |2 *j wj
![Page 25: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/25.jpg)
, — REGULARIZATIONSL2 L1
L2 regularization L1 (solid), L1 + L2 (dashed)
![Page 26: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/26.jpg)
REGULARIZATIONS regularization encourages sparsityL1
![Page 27: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/27.jpg)
![Page 28: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/28.jpg)
REGULARIZATIONSLp
=Lp *i wpi
What is the expression for ?
But nobody uses it, even . Why?Because it is not convex
L0= [ y 0]L0 *i wi
, 0 < p < 1Lp
![Page 29: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/29.jpg)
LOGISTIC REGRESSIONclassifier based on linear decision ruletraining is reduced to convex optimizationother decision rules are achieved by adding new features stochastic optimization is usedcan handle > 1000 features, requires regularizationno iteraction between features
![Page 30: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/30.jpg)
[ARTIFICIAL] NEURAL NETWORKSBased on our understanding of natural neural networks
neurons are organized in networksreceptors activate some neurons, neurons are activatingother neurons, etc.connection is via synapses
![Page 31: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/31.jpg)
STRUCTURE OF ARTIFICIAL FEED-FORWARD NETWORK
![Page 32: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/32.jpg)
ACTIVATION OF NEURON
Neuron states: n = { 1,0,
activatednot activated
Let to be state of to be weight of connection between -th neuron and output neuron:
ni wii
n = { 1,0,
> 0*i wini
otherwise*i
Problem: find set of weights, that minimizes error on traindataset. (discrete optimization)
![Page 33: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/33.jpg)
SMOOTH ACTIVATIONS:ONE HIDDENLAYER
= σ( )hi ∑j
wijxj
= σ( )yi ∑i
vijhj
![Page 34: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/34.jpg)
VISUALIZATION OF NN
![Page 35: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/35.jpg)
NEURAL NETWORKSPowerful general purpose algorithm for classification andregressionNon-interpretable formulaOptimization problem is non-convex with local optimumsand has many parameters Stochastic optimization speeds up process and helps not tobe caught in local minimum.Overfitting due to large amount of parameters
— regularizations (and other tricks),L1 L2
![Page 36: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/36.jpg)
MINUTES BREAKx
![Page 37: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/37.jpg)
DEEP LEARNINGGradient diminishes as number of hidden layers grows.Usually 1-2 hidden layers are used.
But modern ANN for image recognition have 7-15 layers.
![Page 38: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/38.jpg)
![Page 39: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/39.jpg)
CONVOLUTIONAL NEURAL NETWORK
![Page 40: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/40.jpg)
![Page 41: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/41.jpg)
DECISION TREESExample: predict outside play based on weather conditions.
![Page 42: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/42.jpg)
![Page 43: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/43.jpg)
DECISION TREES: IDEA
![Page 44: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/44.jpg)
DECISION TREES
![Page 45: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/45.jpg)
DECISION TREES
![Page 46: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/46.jpg)
DECISION TREEfast & intuitive predictionbuilding optimal decision tree is building tree from root using greedy optimization
each time we split one leaf, finding optimal feature andthresholdneed criterion to select best splitting (feature, threshold)
NP complete
![Page 47: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/47.jpg)
SPLITTING CRITERIONS TotalImpurity = impurity(leaf ) × size(leaf )*leaf
Misclass.Gini
Entropy
===
min(p, 1 + p)p(1 + p)+ p log p + (1 + p) log(1 + p)
![Page 48: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/48.jpg)
SPLITTING CRITERIONSWhy using Gini or Entropy not misclassification?
![Page 49: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/49.jpg)
REGRESSION TREEGreedy optimization (minimizing MSE):
GlobalMSE U ( +*i yi y ̂ i)2
Can be rewritten as:
GlobalMSE U MSE(leaf) × size(leaf)*leaf
MSE(leaf) is like 'impurity' of leaf
MSE(leaf) = ( +1size(leaf) *i!leaf yi y ̂ i)2
![Page 50: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/50.jpg)
![Page 51: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/51.jpg)
![Page 52: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/52.jpg)
![Page 53: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/53.jpg)
![Page 54: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/54.jpg)
![Page 55: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/55.jpg)
In most cases, regression trees are optimizing MSE:
GlobalMSE U ( +∑i
yi y ̂ i)2
But other options also exist, i.e. MAE:
GlobalMAE U | + |∑i
yi y ̂ i
For MAE optimal value of leaf is median, not mean.
![Page 56: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/56.jpg)
DECISION TREES INSTABILITYLittle variation in training dataset produce differentclassification rule.
![Page 57: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/57.jpg)
PRE-STOPPING OF DECISION TREETree keeps splitting until each event is correctly classified.
![Page 58: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/58.jpg)
PRE-STOPPINGWe can stop the process of splitting by imposing differentrestrictions.
limit the depth of treeset minimal number of samples needed to split the leaflimit the minimal number of samples in leafmore advanced: maximal number of leaves in tree
Any combinations of rules above is possible.
![Page 59: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/59.jpg)
no prepruning max_depth
min # of samples in leaf maximal number of leaves
![Page 60: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/60.jpg)
POST-PRUNINGWhen tree tree is already built we can try optimize it tosimplify formula.
Generally, much slower than pre-stopping.
![Page 61: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/61.jpg)
![Page 62: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/62.jpg)
![Page 63: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/63.jpg)
SUMMARY OF DECISION TREE1. Very intuitive algorithm for regression and classification2. Fast prediction3. Scale-independent4. Supports multiclassification
But
1. Training optimal tree is NP-complex2. Trained greedily by optimizing Gini index or entropy (fast!)3. Non-stable4. Uses only trivial conditions
![Page 64: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/64.jpg)
MISSING VALUES IN DECISION TREES
If event being predicted lacks , we use prior probabilities.x1
![Page 65: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/65.jpg)
FEATURE IMPORTANCES
Different approaches exist to measure importance of featurein final model
Importance of feature quality provided by one featurey
![Page 66: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/66.jpg)
FEATURE IMPORTANCEStree: counting number of splits made over this featuretree: counting gain in purity (e.g. Gini) fast and adequatecommon recipe: train without one feature, compare quality on test with/without one feature
requires many evaluations
common recipe: feature shuffling
take one column in test dataset and shuffle them. Comparequality with/without shuffling.
![Page 67: MLHEP 2015: Introductory Lecture #2](https://reader030.fdocuments.in/reader030/viewer/2022012923/58aba9f51a28abdf3c8b5f0d/html5/thumbnails/67.jpg)
THE ENDTomorrow: ensembles and boosting