Machine Learning in High Energy Physics
Lectures 5 & 6
Alex Rogozhnikov
Lund, MLHEP 2016
1 / 101
Linear models: linear regression
Minimizing MSE:
d(x) =< w, x > +w0
= ( , ) → min1N*i Lmse xi yi
( , ) = (d( ) +Lmse xi yi xi yi )2
1 / 101
Linear models: logistic regression
Minimizing logistic loss:
Penalty for single observation :
d(x) =< w, x > +w0
= ( , ) → min*i Llogistic xi yi
± 1yi( , ) = ln(1 + )Llogistic xi yi e+ d( )yi xi
2 / 101
Linear models: support vector machine (SVM)
Margin no penalty
( , ) = max(0, 1 + d( ))Lhinge xi yi yi xi
d( ) > 1 →yi xi
3 / 101
Kernel trickwe can project data into higher-dimensional space, e.g. by adding new features.
Hopefully, in the new space distributions are separable
4 / 101
Kernel trick is a projection operator:
We need only kernel:
Popular choices: polynomial kernel andRBF kernel.
Pw = P( )*i αi xi
d(x) = < w, P(x) >new
d(x) = K( , x)*i αi xi
K(x, ) =< P(x), P( )x ̃ x ̃ >new
5 / 101
Regularizations
regularization : regularization:
:
= L( , ) + → min1N ∑
i
xi yi reg
L2 = α |reg *j wj |2
L1 = β | |reg *j wj
+L1 L2 = α | + β | |reg *j wj |2 *j wj
6 / 101
Stochastic optimizationmethodsStochastic gradient descent
take — random event from trainingdata
(can be applied to additive lossfunctions)
i
w ← w + η�L( , )xi yi
�w
7 / 101
Decision treesNP complex to buildheuristic: use greedy optimizationoptimization criterions (impurities): misclassification, Gini, entropy
8 / 101
Decision trees for regressionOptimizing MSE, prediction inside a leaf is constant.
9 / 101
Overfitting in decision treepre-stoppingpost-pruningunstable to the changes in training dataset
10 / 101
Random ForestMany trees built independently
bagging of samplessubsampling of features
Simple voting is used to get prediction of anensemble
11 / 101
Random Forest
overfitted (in the sense that predictions for train and test are different)doesn't overfit: increasing complexity (adding more trees) doesn't spoil aclassifier 12 / 101
Random Forestsimple and parallelizabledoesn't require much tuninghardly interpretable
but feature importances can be computeddoesn't fix samples poorly classifies at previous stages
13 / 101
EnsemblesAveraging decision functions
Weighted decision
D(x) = (x)1J *
Jj=1 dj
D(x) = (x)*j αjdj
14 / 101
Sample weights in MLCan be used with many estimators. We now have triples
weight corresponds to frequency of observationexpected behavior: is the same as having copies of th eventglobal normalization of weights doesn't matter
, , i + index of an eventxi yi wi
= nwi n i
15 / 101
Sample weights in MLCan be used with many estimators. We now have triples
weight corresponds to frequency of observationexpected behavior: is the same as having copies of th eventglobal normalization of weights doesn't matter
Example for logistic regression:
, , i + index of an eventxi yi wi
= nwi n i
= L( , ) → min∑i
wi xi yi
16 / 101
Weights (parameters) of a classifier sample weights
In code:
tree = DecisionTreeClassifier(max_depth=4)tree.fit(X, y, sample_weight=weights)
Sample weights are convenient way to regulate importance of training events.
Only sample weights when talking about AdaBoost.
y
17 / 101
AdaBoost [Freund, Shapire, 1995]Bagging: information from previous trees not taken into account.
Adaptive Boosting is a weighted composition of weak learners:
We assume , labels ,
th weak learner misclassified th event iff
D(x) = (x)∑j
αjdj
(x) = ±1dj = ±1yi
j i ( ) = +1yidj xi
18 / 101
AdaBoost
Weak learners are built in sequence, each classifier is trained using differentweights
initially = 1 for each training sampleAfter building th base classifier:1. compute the total weight of correctly and wrongly classified events
2. increase weight of misclassified samples
D(x) = (x)∑j
αjdj
wij
= ln( )αj12
wcorrect
wwrong
← ×wi wi e+ ( )αj yidj xi
19 / 101
AdaBoost example
Decision trees of depth will beused.
1
20 / 101
21 / 101
22 / 101
(1, 2, 3, 100 trees) 23 / 101
AdaBoost secret
sample weight is equal to penalty for event
is obtained as a result of analytical optimization Exercise: prove formula for
D(x) = (x)∑j
αjdj
= L( , ) = exp(+ D( )) → min∑i
xi yi ∑i
yi xi
= L( , ) = exp(+ D( ))wi xi yi yi xi
αj
αj
24 / 101
Loss function of AdaBoost
25 / 101
AdaBoost summaryis able to combine many weak learnerstakes mistakes into accountsimple, overhead for boosting is negligibletoo sensitive to outliers
In scikit-learn, one can run AdaBoost over other algorithms.
26 / 101
Gradient Boosting
27 / 101
Decision trees for regression
28 / 101
Gradient boosting to minimize MSESay, we're trying to build an ensemble to minimize MSE:
When ensemble's prediction is obtained by taking weighted sum
Assuming that we already built estimators, how do we train a next one?
(D( ) + → min∑i
xi yi )2
D(x) = (x)*j dj
(x) = (x) = (x) + (x)Dj *j=1j′ dj′ Dj+1 dj
j + 1
29 / 101
Natural solution is to greedily minimize MSE:
Introduce residual: , now we need to simply minimizeMSE
So the th estimator (tree) is trained using the following data:
( ( ) + = ( ( ) + ( ) + → min∑i
Dj xi yi )2 ∑i
Dj+1 xi dj xi yi )2
( ) = + ( )Rj xi yi Dj+1 xi
( ( ) + ( ) → min∑i
dj xi Rj xi )2
j
, ( )xi Rj xi
30 / 101
Example: regression with GB
using regression trees of depth=231 / 101
number of trees = 1, 2, 3, 100 32 / 101
Gradient Boosting visualization
33 / 101
Gradient Boosting [Friedman, 1999]composition of weak regressors,
Borrow an approach to encode probabilities from logistic regression
Optimization of log-likelihood ( ):
D(x) = (x)∑j
αjdj
(x)p+1
(x)p+1
==σ(D(x))σ(+D(x))
= ±1yi
= L( , ) = ln(1 + ) → min∑i
xi yi ∑i
e+ D( )yi xi
34 / 101
Gradient Boosting
Optimization problem: find all and weak leaners Mission impossible
D(x) = (x)∑j
αjdj
= ln(1 + ) → min∑i
e+ D( )yi xi
αj dj
35 / 101
Gradient Boosting
Optimization problem: find all and weak leaners Mission impossibleMain point: greedy optimization of loss function by training one more weaklearner Each new estimator follows the gradient of loss function
D(x) = (x)∑j
αjdj
= ln(1 + ) → min∑i
e+ D( )yi xi
αj dj
dj
36 / 101
Gradient BoostingGradient boosting ~ steepest gradient descent.
At jth iteration:
compute pseudo-residual
train regressor to minimize MSE: find optimal
Important exercise: compute pseudo-residuals for MSE and logistic losses.
(x) = (x)Dj *j=1j′ αj′ dj′
(x) = (x) + (x)Dj Dj+1 αjdj
R( ) = + xi�
�D( )xi
<<D(x)= (x)Dj+1
dj ( ( ) + R( ) → min*i dj xi xi )2
αj
37 / 101
Additional GB tricksto make training more stable, add learning rate :
randomization to fight noise and build different trees:
subsampling of featuressubsampling of training samples
η
(x) = η (x)Dj ∑j
αjdj
38 / 101
AdaBoost is a particular case of gradient boosting with different target lossfunction*:
This loss function is called ExpLoss or AdaLoss.
*(also AdaBoost expects that )
= → minada ∑i
e+ D( )yi xi
( ) = ±1dj xi
39 / 101
Loss functionsGradient boosting can optimize different smooth loss function.
regression,
Mean Squared Error Mean Absolute Error
binary classification,
ExpLoss (ada AdaLoss) LogLoss
y ! ℝ
(d( ) +*i xi yi )2
d( ) +*i << xi yi <<
= ±1yi
*i e+ d( )yi xi
log(1 + )*i e+ d( )yi xi
40 / 101
41 / 101
42 / 101
Usage of second-order informationFor additive loss function apart from gradient , we can make use of secondderivatives .
E.g. select leaf value using second-order step:
gihi
= L( ( ), ) = L( ( ) + (x), ) a∑i
Dj xi yi ∑i
Dj+1 xi dj yi
a L( ( ), ) + (x) + (x)∑i
Dj+1 xi yi gi djhi
2d2
j
= + ( + ) → minj+1 ∑leaf
gleafwleafhleaf
2w2
leaf
43 / 101
Using second-order information
Independent optimization. Explicit solution for optimal values in the leaves:
a + ( + ) → minj+1 ∑leaf
gleafwleafhleaf
2w2
leaf
where = , = .gleaf ∑i!leaf
gi hleaf ∑i!leaf
hi
= +wleafgleaf
hleaf
44 / 101
Using second-order information: recipeOn each iteration of gradient boosting
1. train a tree to follow gradient (minimize MSE with gradient)2. change the values assigned in leaves to:
3. update predictions (no weight for estimator: )
This improvement is quite cheap and allows smaller GBs to be more effective.We can use information about hessians on the tree building step (step 1),
← +wleafgleaf
hleaf
= 1aj
(x) = (x) + η (x)Dj Dj+1 dj
45 / 101
Multiclass classification: ensembling
One-vs-one, One-vs-rest,
scikit-learn implements those as meta-algorithms.
× ( + 1)nclasses nclasses
2nclasses
46 / 101
Multiclass classification: modifying an algorithmMost classifiers have natural generalizations to multiclass classification.
Example for logistic regression: introduce for each class avector .
Converting to probabilities using softmax function:
And minimize LogLoss:
c ! 1, 2,… , Cwc
(x) =< , x >dc wc
(x) =pce (x)dc
*c ̃ e (x)dc̃
= + log ( )*i pyi xi47 / 101
Softmax functionTypical way to convert numbers to probabilities.
Mapping is surjective, but not injective ( dimensions to dimension).Invariant to global shift:
For the case of two classes:
Coincides with logistic function for
n n
n n + 1
(x) → (x) + constdc dc
(x) = =p1e (x)d1
+e (x)d1 e (x)d2
11 + e (x)+ (x)d2 d1
d(x) = (x) + (x)d1 d2
48 / 101
Loss function: ranking exampleIn ranking we need to order items by :
We can penalize for misordering:
yi
< ⇒ d( ) < d( )yi yj xi xj
= L( , , , )∑i,i ̃
xi xi ̃ yi yi ̃
L(x, , y, ) = {x ̃ y ̃ σ(d( ) + d(x)),x ̃ 0,
y < y ̃ otherwise
49 / 101
Adapting boostingBy modifying boosting or changing loss function we can solve different problems
classificationregressionranking
HEP-specific examples in Tatiana's lecture tomorrow.
50 / 101
Gradient Boosting classification playground
51 / 101
-minutes breakn3
52 / 101
Recapitulation: AdaBoostMinimizes
by increasing weights of misclassified samples:
= → minada ∑i
e+ D( )yi xi
← ×wi wi e+ ( )αj yidj xi
53 / 101
Gradient Boosting overviewA powerful ensembling technique (typically used over trees, GBDT)
a general way to optimize differentiable lossescan be adapted to other problems
'following' the gradient of loss at each stepmaking steps in the space of functionsgradient of poorly-classified events is higherincreasing number of trees can drive to overfitting (= getting worse quality on new data)requires tuning, better when trees are not complexwidely used in practice
54 / 101
Feature engineeringFeature engineering = creating features to get the best result with ML
important stepmostly relying on domain knowledgerequires some understandingmost of practitioners' time is spent at this step
55 / 101
Feature engineeringAnalyzing available features
scale and shape of featuresAnalyze which information lacks
challenge example: maybe subleading jets matter?Validate your guesses
56 / 101
Feature engineeringAnalyzing available features
scale and shape of featuresAnalyze which information lacks
challenge example: maybe subleading jets matter?Validate your guessesMachine learning is a proper tool for checking your understanding of data
57 / 101
Linear models exampleSingle event with sufficiently large value of feature can break almost all linearmodels.
Heavy-tailed distributions are harmful, pretransforming required
logarithmpower transformand throwing out outliers
Same tricks actually help to more advanced methods.
Which transformation is the best for Random Forest?
58 / 101
One-hot encodingCategorical features (= 'not orderable'), being one-hot encoded, are easier forML to operate with
59 / 101
Decision tree example is hard for tree to use, since provides no good splitting
Don't forget that a tree can't reconstruct linear combinations — take care ofthis.
ηlepton
60 / 101
Example of feature: invariant massUsing HEP coordinates, invariant mass of two products is:
Can't be recovered with ensembles of trees of depth < 4 when using onlycanonical features.
What about invariant mass of 3 particles?
a 2 (cosh( + ) + cos( + ))m2inv pT1 pT2 η1 η2 ϕ1 ϕ2
61 / 101
Example of feature: invariant massUsing HEP coordinates, invariant mass of two products is:
Can't be recovered with ensembles of trees of depth < 4 when using onlycanonical features.
What about invariant mass of 3 particles? (see Vicens' talk today).
Good features are ones that are explainable by physics. Start from simplest andmost natural.
Mind the cost of computing the features.
a 2 (cosh( + ) + cos( + ))m2inv pT1 pT2 η1 η2 ϕ1 ϕ2
62 / 101
Output engineeringTypically not discussed, but the target of learning plays an important role.
Example: predicting number of upvotes forcomment. Assuming the error is MSE
Consider two cases:
100 comments when predicted 01200 comments when predicted 1000
= (d( ) + → min∑i
xi yi )2
63 / 101
Output engineeringRidiculously large impact of highly-commented articles. We need to predict order, not exact number of comments.
Possible solutions:
alternate loss function. E.g. use MAPEapply logarithm to the target, predict
Evaluation score should be changed accordingly.
log(# comments)
64 / 101
Sample weightsTypically used to estimate the contribution of event (how often we expect thisto happen).
Sample weights in some situations also matter.
highly misbalanced dataset (e.g. 99 % of events in class 0) tend to haveproblems during optimization.changing sample weights to balance the dataset frequently helps.
65 / 101
Feature selectionWhy?
speed up training / predictionreduce time of data preparation in the pipelinehelp algorithm to 'focus' on finding reliable dependencies
useful when amount of training data is limited
Problem:
find a subset of features, which provides best quality
66 / 101
Feature selection
Exhaustive search: cross-validations.
incredible amount of resourcestoo many cross-validation cycles drive to overly-optimistic quality on thetest data
2d
67 / 101
Feature selection
Exhaustive search: cross-validations.
incredible amount of resourcestoo many cross-validation cycles drive to overly-optimistic quality on thetest databasic nice solution: estimate importance with RF / GBDT.
2d
68 / 101
Feature selection
Exhaustive search: cross-validations.
incredible amount of resourcestoo many cross-validation cycles drive to overly-optimistic quality on thetest databasic nice solution: estimate importance with RF / GBDT.
Filtering methods
Eliminate variables which seem not to carry statistical information about thetarget. E.g. by measuring Pearson correlation or mutual information.
Example: all angles will be thrown out.
2d
ϕ 69 / 101
Feature selection: embedded methodsFeature selection is a part of training. Example: — regularized linear models.
Forward selection
Start from empty set of features. For each feature in the dataset check if adding this feature improves the quality.
Backward elimination
Almost the same, but this time we iteratively eliminate features.
Bidirectional combination of the above is possible, some algorithms can usepreviously trained model as the new starting point for optimization.
L1
70 / 101
Unsupervised dimensionality reduction
71 / 101
Principal component analysis [Pearson, 1901]PCA is finding axes along which variance is maximal
72 / 101
PCA descriptionPCA is based on the principal axis theorem
is covariance matrix of the dataset, is orthogonal matrix, is diagonalmatrix.
Q = ΛUU T
Q U Λ
Λ = diag( , ,… , ), ~ ~ ⋯ ~λ1 λ2 λn λ1 λ2 λn
73 / 101
PCA optimization visualized
74 / 101
PCA: eigenfaces
Emotion = α[scared] + β[laughs] + γ[angry]+. . .75 / 101
Locally linear embeddinghandles the case of non-linear dimensionality reduction
Express each sample as a convex combination of neighbours + →*i
<<xi *i ̃ wii ̃ xi ̃ << minw
76 / 101
Locally linear embedding
subject to constraints: , , and if are notneighbors.
Finding an optimal mapping for all points simultaneously ( are images —positions in the new space):
+ →∑i
<
<
<<xi ∑
i ̃ wii ̃ xi ̃
<
<
<< min
w
= 1*i ̃ wii ̃ >= 0wii ̃ = 0wii ̃ i, i ̃
yi
+ →∑i
<
<
<<yi ∑
i ̃ wii ̃ yi ̃
<
<
<< min
y
77 / 101
PCA and LLE
78 / 101
IsomapIsomap is targeted to preserve geodesic distance on the manifold between twopoints
79 / 101
Supervised dimensionality reduction
80 / 101
Fisher's LDA (Linear Discriminant Analysis) [1936]Original idea: find a projection to discriminate classes best
81 / 101
Fisher's LDAMean and variance within a single class ( ):
Total within-class variance:
Total between-class variance:
Goal: find a projection to maximize a ratio
c c ! {1, 2,… , C}=< xμk >events of class c=< ||x + |σk μk |2 >events of class c
=σwithin *c pcσc
= || + μ|σbetween *c pc μc |2
σbetween
σwithin
82 / 101
Fisher's LDA
83 / 101
LDA: solving optimization problemWe are interested in finding -dimensional projection :
Naturally connected to the generalized eigenvalue problemProjection vector corresponds to the highest generalized eigenvalueFinds a subspace of components when applied to a classificationproblem with classes
Fisher's LDA is a basic popular binary classification technique.
1 w
→wwTΣwithin
wwTΣbetweenmax
w
C + 1C
84 / 101
Common spacial patternsWhen we expect that each class is close to some linear subspace, we can
(naively) find for each class this subspace by PCA(better idea) take into account variation of other data and optimize
Natural generalization is to take several components: is aprojection matrix; and are number of dimensions in original and newspaces
Frequently used in neural sciences, in particular in BCI based on EEG / MEG.
W ! ℝn×n1
n n1
tr W → maxW TΣclass
subject to W = IW TΣtotal
85 / 101
Common spacial patternsPatters found describe the projection into 6-dimensional space
86 / 101
Dimensionality reduction summaryis capable of extracting sensible features from highly-dimensional datafrequently used to 'visualize' the datanonlinear methods rely on the distance in the spaceworks well with highly-dimensional spaces with features of same nature
87 / 101
Finding optimal hyperparameterssome algorithms have many parameters (regularizations, depth, learning rate,...)not all the parameters are guessedchecking all combinations takes too long
88 / 101
Finding optimal hyperparameterssome algorithms have many parameters (regularizations, depth, learning rate,...)not all the parameters are guessedchecking all combinations takes too long
We need automated hyperparameter optimization!
89 / 101
Finding optimal parametersrandomly picking parameters is a partial solutiongiven a target optimal value we can optimize it
90 / 101
Finding optimal parametersrandomly picking parameters is a partial solutiongiven a target optimal value we can optimize itno gradient with respect to parametersnoisy results
91 / 101
Finding optimal parametersrandomly picking parameters is a partial solutiongiven a target optimal value we can optimize itno gradient with respect to parametersnoisy resultsfunction reconstruction is a problem
92 / 101
Finding optimal parametersrandomly picking parameters is a partial solutiongiven a target optimal value we can optimize itno gradient with respect to parametersnoisy resultsfunction reconstruction is a problem
Before running grid optimization make sure your metric is stable (i.e. bytrain/testing on different subsets).
Overfitting (=getting too optimistic estimate of quality on a holdout) by usingmany attempts is a real issue.
93 / 101
Optimal grid searchstochastic optimization (Metropolis-Hastings, annealing)
requires too many evaluations, using only last checked combinationregression techniques, reusing all known information (ML to optimize ML!)
94 / 101
Optimal grid search using regressionGeneral algorithm (point of grid = set of parameters):
1. evaluations at random points2. build regression model based on known results3. select the point with best expected quality according to trained model4. evaluate quality at this points5. Go to 2 if not enough evaluations
Why not using linear regression?
Exploration vs. exploitation trade-off: should we try explore poorly-coveredregions or try to enhance currently seen to be optimal?
95 / 101
Gaussian processes for regressionSome definitions: , where and are functions of mean andcovariance: ,
represents our prior expectation of quality (may be takenconstant)
represents influence of known results on theexpectation of values in new pointsRBF kernel is used here too: Another popular choice:
We can model the posterior distribution of results in each point.
Y U GP(m, K) m Km(x) K(x, )x ̃
m(x) = �Y(x)
K(x, ) = �Y(x)Y( )x ̃ x ̃
K(x, ) = exp(+c||x + | )x ̃ x ̃ |2K(x, ) = exp(+c||x + ||)x ̃ x ̃
96 / 101
Gaussian processesGaussian processes model posterior distribution at each point of the grid.we know at which point we have already well-estimated qualityand we are able to find regions which need exploration. See also this demo.
98 / 101
Summary about hyperoptimizationparameters can be tuned automaticallybe sure that metric being optimized is stablemind the optimistic quality estimation (resolved by one more holdout)
99 / 101
Summary about hyperoptimizationparameters can be tuned automaticallybe sure that metric being optimized is stablemind the optimistic quality estimation (resolved by one more holdout)but this is not what you should spend your time on
the gain from properly cooking features / reconsidering problem is muchhigher
100 / 101
101 / 101
Top Related