andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y...
Transcript of andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y...
![Page 1: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/1.jpg)
Tell me What Label Noise isand I will Tell you how to Dispose of it
Benoît Frénay - Namur Digital InstitutePReCISE research center - UNamur
![Page 2: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/2.jpg)
Outline of this Talklabel noise: overview of the literatureprobabilistic models for label noise
HMMs for ECG segmentationrobust feature selection with MIdealing with streaming datagoing further with modelling
robust maximum likelihood inference
![Page 3: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/3.jpg)
Label Noise: Overviewof the Literature
![Page 4: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/4.jpg)
Label Noise: a Complex Phenomenon
Frénay, B., Verleysen, M. Classification in the Presence of Label Noise:a Survey. IEEE TNN & LS, 25(5), 2014, p. 845-869.Frénay, B., Kabán, A. A Comprehensive Introduction to Label Noise.In Proc. ESANN, Bruges, Belgium, 23-25 April 2014, p. 667-676.
3
![Page 5: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/5.jpg)
Sources and Effects of Label Noise
Label noise can come from several sourcesinsufficient information provided to the experterrors in the expert labelling itselfsubjectivity of the labelling taskcommunication/encoding problems
Label noise can have several effectsdecrease the classification performancesincrease/decrease the complexity of learned modelspose a threat to tasks like e.g. feature selection
4
![Page 6: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/6.jpg)
State-of-the-Art Methods to Deal with Label Noise
label noise-robust models rely on overfitting avoidancelearning algorithms are seldom completely robust to label noise
data cleansing remove instances which seems to be mislabelledmostly based on model predictions or kNN-based methods
label noise-tolerant learning algorithms take label noise into accountbased on e.g. probabilistic models of label noise
5
![Page 7: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/7.jpg)
Label Noise-Robust Models
some losses are robust to uniform label noise (Manwani and Sastry)theoretically robust: least-square loss → Fisher linear discriminanttheoretically non-robust: exponential loss → AdaBoost, log loss →logistic regression, hinge loss → support vector machines
one can expect most of the recent learning algorithms in machine learningto be completely label noise-robust ⇒ research on label noise matters!
robustness to label noise is method-specificboosting: AdaBoost < LogitBoost / BrownBoostdecision trees: C4.5 < imprecise info-gain
there exist empirical comparisons in the literature, not easy to conclude
6
![Page 8: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/8.jpg)
Label Noise-Robust Models
some losses are robust to uniform label noise (Manwani and Sastry)theoretically robust: least-square loss → Fisher linear discriminanttheoretically non-robust: exponential loss → AdaBoost, log loss →logistic regression, hinge loss → support vector machines
one can expect most of the recent learning algorithms in machine learningto be completely label noise-robust ⇒ research on label noise matters!
robustness to label noise is method-specificboosting: AdaBoost < LogitBoost / BrownBoostdecision trees: C4.5 < imprecise info-gain
there exist empirical comparisons in the literature, not easy to conclude
6
![Page 9: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/9.jpg)
Data Cleansing Methods
many methods to detect and remove mislabelled instancesmeasures and thresholds: model complexity, entropy of P(Y |X ). . .model prediction-based: class probabilities, voting, partition filteringmodel influence: LOO perturbed classification (LOOPC), CL-stabilitykNN: CNN, RNN, BBNR, DROP1-6, GE, Tomek links, PRISM. . .boosting: ORBoost exploits tendency of AdaBoost to overfit
7
![Page 10: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/10.jpg)
Label Noise-Tolerant Methods
6= label noise-robust ⇔ fight the effects of label noise6= data cleansing methods ⇔ no filtering of instances
Bayesian priors and frequentist methods (e.g. Lawrence et al.)clustering-based: structure of data (=clusters) to model label noisebelief functions: each instance seen as an evidence, used to robustkNN classifiers, neural networks, decision trees, boostingmodification of SVMs, neural networks, decision trees, boosting. . .
8
![Page 11: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/11.jpg)
Probabilistic Modelsof Label Noise
![Page 12: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/12.jpg)
Probabilistic Modelling of Lawrence et al.
pY = true labels Y prior
pX |Y = observed features X distribution
pY |Y = observed labels Y distribution
Frénay, B., de Lannoy, G., Verleysen, M. Label noise-tolerant hidden Markovmodels for segmentation: application to ECGs. ECML-PKDD 2011, p. 455-470.
Frénay, B., Doquire, G., Verleysen, M. Estimating mutual information for featureselection in the presence of label noise. CS & DA, 71, 832-848, 2014.
Frenay, B., Hammer, B. Label-noise-tolerant classification for streaming data.IJCNN 2017, Anchorage, AK, 14-19 May 2017, p. 1748-1755.
Bion, Q., Frénay, B. Modelling non-uniform label noise to robustify a classifier withapplication to neural networks. Submitted to ESANN’18 (under review).
10
![Page 13: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/13.jpg)
HMMs for ECGSegmentation
![Page 14: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/14.jpg)
What is an Electrocardiogram Signal ?
an ECG is a measure of the electrical activity of the human heart
patterns of interest: P wave, QRS complex, T wave, baseline
12
![Page 15: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/15.jpg)
Where do ECG Signals Come from ?
an ECG results from the superposition of several signals
13
![Page 16: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/16.jpg)
What Real-World ECG Signals Look Like
real ECGs are polluted by various sources of noise
14
![Page 17: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/17.jpg)
Solving an ECG Segmentation Task
learn from a few manual segmentations from expertssplit/segment the entire ECG into patternssequence modelling with hidden Markov Models (+ wavelet transform)
15
![Page 18: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/18.jpg)
Hidden Markov Models in a Nutshell
hidden Markov models (HMMs) are probabilistic models of sequences
S1, . . . ,ST is the sequence of annotations (ex.: state of the heart).
P(St = st |St−1 = st−1)
O1, . . . ,OT is the sequence of observations (ex.: measured voltage).
P(Ot = ot |St = st)
16
![Page 19: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/19.jpg)
Hidden Markov Models in a Nutshell
hidden Markov models (HMMs) are probabilistic models of sequences
S1, . . . ,ST is the sequence of annotations (ex.: state of the heart).
P(St = st |St−1 = st−1)
O1, . . . ,OT is the sequence of observations (ex.: measured voltage).
P(Ot = ot |St = st)
16
![Page 20: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/20.jpg)
Hidden Markov Models in a Nutshell
hidden Markov models (HMMs) are probabilistic models of sequences
S1, . . . ,ST is the sequence of annotations (ex.: state of the heart).
P(St = st |St−1 = st−1)
O1, . . . ,OT is the sequence of observations (ex.: measured voltage).
P(Ot = ot |St = st)
16
![Page 21: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/21.jpg)
Information Extraction with Wavelet Transform
17
![Page 22: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/22.jpg)
Standard Inference Algorithms for HMMs
Supervised learningassumes the observed labels are correctmaximises the likelihood P(S ,O|Θ)
learns the correct conceptssensitive to label noise
Baum-Welch algorithmunsupervised, i.e. observed labels are discardediteratively (i) label samples and (ii) learn a modelmay learn concepts which differs significantlytheoretically insensitive to label noise
18
![Page 23: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/23.jpg)
Example of Label Noise: Electrocardiogram Signals
19
![Page 24: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/24.jpg)
Label Noise Model
Modelling label noise in sequences
two distinct sequences of states:the observed, noisy annotations Ythe hidden, true labels S
the annotation probability is
dij =
1− pi (i = j)
pi|S|−1 (i 6= j)
where pi is the expert error probability in i
20
![Page 25: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/25.jpg)
Label Noise-Tolerant HMMs
Compromise between supervised learning and Baum-Welchassumes the observed labels are potentially noisymaximises the likelihood P(Y ,O|Θ) =
∑S P(O,Y , S |Θ)
learns the correct concepts (and error probabilities)less sensitive to label noise (can estimate label noise level)
21
![Page 26: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/26.jpg)
Results for Artificial ECGs
Supervised learning, Baum-Welch and label noise-tolerant.
22
![Page 27: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/27.jpg)
Results for Sinus ECGs
Supervised learning, Baum-Welch and label noise-tolerant.
23
![Page 28: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/28.jpg)
Resuts for Arrhythmia ECGs
Supervised learning, Baum-Welch and label noise-tolerant.
24
![Page 29: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/29.jpg)
Robust Feature Selectionwith Mutual Information
![Page 30: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/30.jpg)
Feature Selection with Mutual Information
Problem statementproblems with high-dimensional data:
interpretability of datacurse of dimensionalityconcentration of distances
feature selection consists in using only a subset of the features
How to select featuresmutual information (MI) assesses the quality of feature subsets:
rigorous definition (information theory)interpretation in terms of uncertainty reductioncan detect linear as well as non-linear relationshipscan be defined for multi-dimensional variables
26
![Page 31: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/31.jpg)
Feature Selection with Mutual Information
Problem statementproblems with high-dimensional data:
interpretability of datacurse of dimensionalityconcentration of distances
feature selection consists in using only a subset of the features
How to select featuresmutual information (MI) assesses the quality of feature subsets:
rigorous definition (information theory)interpretation in terms of uncertainty reductioncan detect linear as well as non-linear relationshipscan be defined for multi-dimensional variables
26
![Page 32: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/32.jpg)
Label Noise-Tolerant Mutual Information Estimation
Gómez et al. propose to estimate MI (using the Kozachenko-LeonenkokNN-based entropy estimator, a.k.a. the Kraskov estimator) as
I (X ;Y ) = H(X )−∑y∈Y
pY (y)H(X |Y = y)
= ψ(n)− 1n
∑y∈Y
nyψ(ny ) +d
n
n∑i=1
log εk(i)−∑y∈Y
∑i |yi=y
log εk(i |y)
,assumption: density remains constant in a small hypersphere ofdiameter εk(i) containing the k nearest neighbours of the instance xi
Solutionfind hyperspheres with expected number of k instances really belonging tothe target class s (with true class memberships, similar to Lawrence et al.)
27
![Page 33: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/33.jpg)
Label Noise-Tolerant Mutual Information Estimation
Gómez et al. propose to estimate MI (using the Kozachenko-LeonenkokNN-based entropy estimator, a.k.a. the Kraskov estimator) as
I (X ;Y ) = H(X )−∑y∈Y
pY (y)H(X |Y = y)
= ψ(n)− 1n
∑y∈Y
nyψ(ny ) +d
n
n∑i=1
log εk(i)−∑y∈Y
∑i |yi=y
log εk(i |y)
,assumption: density remains constant in a small hypersphere ofdiameter εk(i) containing the k nearest neighbours of the instance xi
Solutionfind hyperspheres with expected number of k instances really belonging tothe target class s (with true class memberships, similar to Lawrence et al.)
27
![Page 34: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/34.jpg)
Experimental Results for Feature Selection
resulting estimate of MI is only I (X ;Y ) = 0.58 instead of I (X ;Y ) = 0.63with clean data ⇒ label noise-tolerant estimation is I (X ;Y ) = 0.61
28
![Page 35: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/35.jpg)
Experimental Results for Feature Selection
resulting estimate of MI is only I (X ;Y ) = 0.58 instead of I (X ;Y ) = 0.63with clean data ⇒ label noise-tolerant estimation is I (X ;Y ) = 0.61
28
![Page 36: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/36.jpg)
Experimental Results for Feature Selection
29
![Page 37: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/37.jpg)
Dealing with NoisyStreaming Data
![Page 38: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/38.jpg)
Motivation for Robust Online Learning
Contextlarge scale ML = deal with large (batch) or infinite (streaming) datasets
online learning can deal with such datasets (see e.g. Botou’s works)
robust classification = deal with label noise in datasetsreal-world datasets = ±5% labeling errorsuse of low-quality labels (crowdsourcing)
Online learning with label noiseonly few online-learning approaches related to perceptron
λ-trick modifies the adaptation criterion if previously misclassifiedα-bound does not update the weights if already misclassified α times
31
![Page 39: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/39.jpg)
Motivation for Robust Online Learning
Contextlarge scale ML = deal with large (batch) or infinite (streaming) datasets
online learning can deal with such datasets (see e.g. Botou’s works)
robust classification = deal with label noise in datasetsreal-world datasets = ±5% labeling errorsuse of low-quality labels (crowdsourcing)
Online learning with label noiseonly few online-learning approaches related to perceptron
λ-trick modifies the adaptation criterion if previously misclassifiedα-bound does not update the weights if already misclassified α times
31
![Page 40: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/40.jpg)
Prototype-based Models: Robust Soft LVQ
Motivationeasy-to-optimise, online learning, interpretable (controlled complexity)
RSLVQ relies on a data generating Gaussian mixture model (=prototypes)
p(x|j) :=1
(2πσ2)d2e−‖x−wj‖
2
2σ2
where the bandwidth σ is considered to be identical for each component.
assuming equal prior P(j) = 1k , (un)labelled instances follow
p(x) =1k
k∑j=1
p(x|j) p(x, y) =1k
∑j |c(wj )=y
p(x|j)
32
![Page 41: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/41.jpg)
Prototype-based Models: Robust Soft LVQ
Motivationeasy-to-optimise, online learning, interpretable (controlled complexity)
RSLVQ relies on a data generating Gaussian mixture model (=prototypes)
p(x|j) :=1
(2πσ2)d2e−‖x−wj‖
2
2σ2
where the bandwidth σ is considered to be identical for each component.
assuming equal prior P(j) = 1k , (un)labelled instances follow
p(x) =1k
k∑j=1
p(x|j) p(x, y) =1k
∑j |c(wj )=y
p(x|j)
32
![Page 42: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/42.jpg)
Prototype-based Models: Robust Soft LVQ
RSLVQ training = optimization of the conditional log likelihood
m∑i=1
log p(yi |xi ) =m∑i=1
logp(xi , yi )p(xi )
gradient ascent to be used in streaming scenarios ⇒ update rule
∆wj =
ασ2 (Pyi (j |xi )− P(j |xi )) (xi −wj) if c(wj) = yi
− ασ2P(j |xi )(xi −wj) if c(wj) 6= yi
where
P(j |xi ) =p(xi |j)∑k
j ′=1 p(xi |j ′)Pyi (j |xi ) =
p(xi |j)∑j ′|c(wj′ )=yi
p(xi |j ′)
33
![Page 43: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/43.jpg)
Prototype-based Models: Robust Soft LVQ
RSLVQ training = optimization of the conditional log likelihood
m∑i=1
log p(yi |xi ) =m∑i=1
logp(xi , yi )p(xi )
gradient ascent to be used in streaming scenarios ⇒ update rule
∆wj =
ασ2 (Pyi (j |xi )− P(j |xi )) (xi −wj) if c(wj) = yi
− ασ2P(j |xi )(xi −wj) if c(wj) 6= yi
where
P(j |xi ) =p(xi |j)∑k
j ′=1 p(xi |j ′)Pyi (j |xi ) =
p(xi |j)∑j ′|c(wj′ )=yi
p(xi |j ′)
33
![Page 44: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/44.jpg)
Label Noise-Tolerant RSLVQ
using Lawrence and Schölkopf methodology, RSLVQ equations become
p(x, y) =∑y∈Y
P(y |y)
1k
∑j |c(wj )=y
p(x|j)
=1k
k∑j=1
P(y |c(wj))p(x|j)
where P(y |c(wj)) = probability of observing label y if true label is c(wj)
online update rules become
∀j ∈ 1 . . .m : ∆wj =α
σ2 (Pyi (j |xi )− P(j |xi )) (xi −wj)
where
P(j |xi ) =p(xi |j)∑k
j ′=1 p(xi |j ′)Pyi (j |xi ) =
P(yi |c(wj))p(xi |j)∑kj ′=1 P(yi |c(wj ′))p(xi |j ′)
34
![Page 45: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/45.jpg)
Label Noise-Tolerant RSLVQ
using Lawrence and Schölkopf methodology, RSLVQ equations become
p(x, y) =∑y∈Y
P(y |y)
1k
∑j |c(wj )=y
p(x|j)
=1k
k∑j=1
P(y |c(wj))p(x|j)
where P(y |c(wj)) = probability of observing label y if true label is c(wj)
online update rules become
∀j ∈ 1 . . .m : ∆wj =α
σ2 (Pyi (j |xi )− P(j |xi )) (xi −wj)
where
P(j |xi ) =p(xi |j)∑k
j ′=1 p(xi |j ′)Pyi (j |xi ) =
P(yi |c(wj))p(xi |j)∑kj ′=1 P(yi |c(wj ′))p(xi |j ′)
34
![Page 46: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/46.jpg)
Results in batch setting
35
![Page 47: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/47.jpg)
Results in batch setting
36
![Page 48: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/48.jpg)
Results in streaming setting
37
![Page 49: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/49.jpg)
Results in streaming setting
38
![Page 50: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/50.jpg)
Going Furtherwith Modelling
![Page 51: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/51.jpg)
Statistical Taxonomy of Label Noise (inspired by Schafer)
most works consider that label noise affects instances with no distinctionin specific cases, empirical evidence was given that more difficultsamples are labelled randomly (e.g. in text entailment)it seems natural to expect less reliable labels in regions of low density
40
![Page 52: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/52.jpg)
Non-uniform Label Noise in the Literature
Lachenbruch/Chhikara models of label noiseprobability of misallocation gy (z) for LDA is defined w.r.t a z-axis whichpasses through the center of both classes s.t. each center is at z = ±∆
2
random misallocation: gy (z) = αy is constant for each classtruncated label noise: g(z) is zero as long as the instance is closeenough to the mean of its class, then equal to a small constantexponential model:
gy (z) =
0 if z ≤ −∆
2
1− exp(−1
2ky (z + ∆2 )2) if z > −∆
2
41
![Page 53: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/53.jpg)
Non-uniform Label Noise in the Literature
non-uniform label noise is considered in much less worksexperiments rely on simple models (e.g. Lachenbruch/Chhikara’s orSastry’s quadrant-dependent probability of mislabelling)there are (up to our knowledge) almost no empirical evidences/studieson the characteristics of real non-uniform label noise
Call for discussionsIt would be very interesting to obtain more real-world datasets wheremislabelled instances are clearly identified. Also, an important openresearch problem is to find what the characteristics of real-world label noiseare. It is not yet clear if and when (non-)uniform label noise is realistic.
42
![Page 54: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/54.jpg)
Non-uniform Label Noise in the Literature
non-uniform label noise is considered in much less worksexperiments rely on simple models (e.g. Lachenbruch/Chhikara’s orSastry’s quadrant-dependent probability of mislabelling)there are (up to our knowledge) almost no empirical evidences/studieson the characteristics of real non-uniform label noise
Call for discussionsIt would be very interesting to obtain more real-world datasets wheremislabelled instances are clearly identified. Also, an important openresearch problem is to find what the characteristics of real-world label noiseare. It is not yet clear if and when (non-)uniform label noise is realistic.
42
![Page 55: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/55.jpg)
What if Label Noise ⊥ Classification Task?
Work in ProgressQuentin Bion, a student from Nancy (École des Mines, Ing. Mathématique)made an internship in our lab and worked on non-uniform label noise
results have been submitted to the ESANN’18 conferencehe studied how to combine simple non-uniform models of label noisewith complex classification models using generic mechanismsgain = power of up-to-date classifiers + interpretability/transparencyof simple non-uniform models of label noise = best of both worlds
43
![Page 56: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/56.jpg)
Robust MaximumLikelihood Inference
![Page 57: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/57.jpg)
What is the common point of. . .
linear regressionLS-SVMslogistic regressionprincipal component analysis
Answercommon methods in machine learningcan interpreted in probabilistic termssensitive to outliers (but can be robustified)
45
![Page 58: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/58.jpg)
What is the common point of. . .
linear regressionLS-SVMslogistic regressionprincipal component analysis
Answercommon methods in machine learningcan interpreted in probabilistic termssensitive to outliers (but can be robustified)
45
![Page 59: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/59.jpg)
What is the common point of. . .
linear regressionLS-SVMslogistic regressionprincipal component analysis
Answercommon methods in machine learningcan interpreted in probabilistic termssensitive to outliers (but can be robustified)
45
![Page 60: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/60.jpg)
What is the common point of. . .
linear regressionLS-SVMslogistic regressionprincipal component analysis
Answercommon methods in machine learningcan interpreted in probabilistic termssensitive to outliers (but can be robustified)
45
![Page 61: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/61.jpg)
What is the common point of. . .
linear regressionLS-SVMslogistic regressionprincipal component analysis
Answercommon methods in machine learningcan interpreted in probabilistic termssensitive to outliers (but can be robustified)
45
![Page 62: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/62.jpg)
What is the common point of. . .
linear regressionLS-SVMslogistic regressionprincipal component analysis
Answercommon methods in machine learningcan interpreted in probabilistic termssensitive to outliers (but can be robustified)
45
![Page 63: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/63.jpg)
Everything Wrong with Maximum Likelihood Inference
What is maximum likelihood inference?maximise loglikelihood L (θ; x) = log p(x1, .., xn) =
∑ni=1 log p (xi |θ)
minimise KL divergence of empirical vs. parametric distribution
DKL (empirical distr.‖parametric distr.) = −n∑
i=1
log p (xi |θ) + const.
Why is it sensitive to outliers?abnormally frequent data / outliers = too frequent observations
the reference distribution in DKL is the empirical oneinference is biased towards models supporting outliersotherwise, DKL is too large because of low probability for outliers
46
![Page 64: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/64.jpg)
Everything Wrong with Maximum Likelihood Inference
What is maximum likelihood inference?maximise loglikelihood L (θ; x) = log p(x1, .., xn) =
∑ni=1 log p (xi |θ)
minimise KL divergence of empirical vs. parametric distribution
DKL (empirical distr.‖parametric distr.) = −n∑
i=1
log p (xi |θ) + const.
Why is it sensitive to outliers?abnormally frequent data / outliers = too frequent observations
the reference distribution in DKL is the empirical oneinference is biased towards models supporting outliersotherwise, DKL is too large because of low probability for outliers
46
![Page 65: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/65.jpg)
Everything Wrong with Maximum Likelihood Inference
What is maximum likelihood inference?maximise loglikelihood L (θ; x) = log p(x1, .., xn) =
∑ni=1 log p (xi |θ)
minimise KL divergence of empirical vs. parametric distribution
DKL (empirical distr.‖parametric distr.) = −n∑
i=1
log p (xi |θ) + const.
Why is it sensitive to outliers?abnormally frequent data / outliers = too frequent observations
the reference distribution in DKL is the empirical oneinference is biased towards models supporting outliersotherwise, DKL is too large because of low probability for outliers
46
![Page 66: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/66.jpg)
Everything Wrong with Maximum Likelihood Inference
What is maximum likelihood inference?maximise loglikelihood L (θ; x) = log p(x1, .., xn) =
∑ni=1 log p (xi |θ)
minimise KL divergence of empirical vs. parametric distribution
DKL (empirical distr.‖parametric distr.) = −n∑
i=1
log p (xi |θ) + const.
Why is it sensitive to outliers?abnormally frequent data / outliers = too frequent observations
the reference distribution in DKL is the empirical oneinference is biased towards models supporting outliersotherwise, DKL is too large because of low probability for outliers
46
![Page 67: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/67.jpg)
Everything Wrong with Maximum Likelihood Inference
What is maximum likelihood inference?maximise loglikelihood L (θ; x) = log p(x1, .., xn) =
∑ni=1 log p (xi |θ)
minimise KL divergence of empirical vs. parametric distribution
DKL (empirical distr.‖parametric distr.) = −n∑
i=1
log p (xi |θ) + const.
Why is it sensitive to outliers?abnormally frequent data / outliers = too frequent observations
the reference distribution in DKL is the empirical oneinference is biased towards models supporting outliersotherwise, DKL is too large because of low probability for outliers
46
![Page 68: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/68.jpg)
Pointwise Probability Reinforcements
Idea: provide an alternative to deal with outliersgood distributions cannot explain outliers ⇒ low probability
consequence: not-so-good distributions are enforcedour solution: provide another mechanism to deal with outliers
Reinforced loglikelihoodxi is given a PPR r(xi ) ≥ 0 as reinforcement to the probability p (xi |θ)
L (θ; x, r) =n∑
i=1
log [p (xi |θ) + r(xi )]
reinforced maximum likelihood inference:PPRs r(xi ) ≈ 0 if xi = clean ⇒ xi impacts the inference of θPPRs r(xi ) 0 if xi = outlier ⇒ xi ignored by inference of θ
47
![Page 69: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/69.jpg)
Pointwise Probability Reinforcements
Idea: provide an alternative to deal with outliersgood distributions cannot explain outliers ⇒ low probability
consequence: not-so-good distributions are enforcedour solution: provide another mechanism to deal with outliers
Reinforced loglikelihoodxi is given a PPR r(xi ) ≥ 0 as reinforcement to the probability p (xi |θ)
L (θ; x, r) =n∑
i=1
log [p (xi |θ) + r(xi )]
reinforced maximum likelihood inference:PPRs r(xi ) ≈ 0 if xi = clean ⇒ xi impacts the inference of θPPRs r(xi ) 0 if xi = outlier ⇒ xi ignored by inference of θ
47
![Page 70: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/70.jpg)
Control of PPRS through Regularisation
Keeping the PPRs under ControlPPRs cannot be allowed to take any arbitrary positive values
non-parametric approach: reinforcement r(xi ) = ri for each xi
LΩ (θ;X, r) =n∑
i=1
log [p (xi |θ) + ri ]− αΩ(r)
no prior knowledge on outliers (e.g. uniform distribution of outliers ormislabelling probability w.r.t. distance to the classification boundary)Ω(r) = penalisation term (e.g. to allow only a few non-zero PPRs)α = focus on fitting the model vs. considering data as outliers
48
![Page 71: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/71.jpg)
Control of PPRS through Regularisation
Keeping the PPRs under ControlPPRs cannot be allowed to take any arbitrary positive values
non-parametric approach: reinforcement r(xi ) = ri for each xi
LΩ (θ;X, r) =n∑
i=1
log [p (xi |θ) + ri ]− αΩ(r)
no prior knowledge on outliers (e.g. uniform distribution of outliers ormislabelling probability w.r.t. distance to the classification boundary)Ω(r) = penalisation term (e.g. to allow only a few non-zero PPRs)α = focus on fitting the model vs. considering data as outliers
48
![Page 72: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/72.jpg)
Control of PPRS through Regularisation
Keeping the PPRs under ControlPPRs cannot be allowed to take any arbitrary positive values
non-parametric approach: reinforcement r(xi ) = ri for each xi
LΩ (θ;X, r) =n∑
i=1
log [p (xi |θ) + ri ]− αΩ(r)
no prior knowledge on outliers (e.g. uniform distribution of outliers ormislabelling probability w.r.t. distance to the classification boundary)Ω(r) = penalisation term (e.g. to allow only a few non-zero PPRs)α = focus on fitting the model vs. considering data as outliers
48
![Page 73: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/73.jpg)
Control of PPRS through Regularisation
Keeping the PPRs under ControlPPRs cannot be allowed to take any arbitrary positive values
non-parametric approach: reinforcement r(xi ) = ri for each xi
LΩ (θ;X, r) =n∑
i=1
log [p (xi |θ) + ri ]− αΩ(r)
no prior knowledge on outliers (e.g. uniform distribution of outliers ormislabelling probability w.r.t. distance to the classification boundary)Ω(r) = penalisation term (e.g. to allow only a few non-zero PPRs)α = focus on fitting the model vs. considering data as outliers
48
![Page 74: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/74.jpg)
Control of PPRS through Regularisation
theoretical guarantees and closed form expressions exist for some Ω(r)L1 penalisation Ω (r) =
∑ni=1 ri shrinks (sparse) PPRs towards zero
L2 regularisation Ω (r) = 12∑n
i=1 r2i provides smoother solutions
all details in Frénay, B., Verleysen, M. Pointwise Probability Reinforcementsfor Robust Statistical Inference. Neural networks, 50, 124-141, 2014. shortversion: "Robustifying Maximum Likelihood Inference" at BENELEARN’16.49
![Page 75: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/75.jpg)
Control of PPRS through Regularisation
theoretical guarantees and closed form expressions exist for some Ω(r)L1 penalisation Ω (r) =
∑ni=1 ri shrinks (sparse) PPRs towards zero
L2 regularisation Ω (r) = 12∑n
i=1 r2i provides smoother solutions
all details in Frénay, B., Verleysen, M. Pointwise Probability Reinforcementsfor Robust Statistical Inference. Neural networks, 50, 124-141, 2014. shortversion: "Robustifying Maximum Likelihood Inference" at BENELEARN’16.49
![Page 76: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/76.jpg)
Iterative Method for Reinforced Inference
no closed-form solution for penalised reinforced log-likelihood. . . but
n∑i=1
log [p (xi |θ) + ri ] ≥n∑
i=1
wi log p (xi |θ) + const.
where θold = current estimate and wi = p(xi |θold) /p (xi |θold)+ ri
EM-like algorithminitialise θ, then loop over
1 compute PPRs with L1 or L2 regularisation ⇒ model-independent2 compute instance weights with PPRs ⇒ model-independent3 maximise weighted log-likelihood to update θ ⇒ model-specific
50
![Page 77: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/77.jpg)
Iterative Method for Reinforced Inference
no closed-form solution for penalised reinforced log-likelihood. . . but
n∑i=1
log [p (xi |θ) + ri ] ≥n∑
i=1
wi log p (xi |θ) + const.
where θold = current estimate and wi = p(xi |θold) /p (xi |θold)+ ri
EM-like algorithminitialise θ, then loop over
1 compute PPRs with L1 or L2 regularisation ⇒ model-independent2 compute instance weights with PPRs ⇒ model-independent3 maximise weighted log-likelihood to update θ ⇒ model-specific
50
![Page 78: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/78.jpg)
Iterative Method for Reinforced Inference
no closed-form solution for penalised reinforced log-likelihood. . . but
n∑i=1
log [p (xi |θ) + ri ] ≥n∑
i=1
wi log p (xi |θ) + const.
where θold = current estimate and wi = p(xi |θold) /p (xi |θold)+ ri
EM-like algorithminitialise θ, then loop over
1 compute PPRs with L1 or L2 regularisation ⇒ model-independent2 compute instance weights with PPRs ⇒ model-independent3 maximise weighted log-likelihood to update θ ⇒ model-specific
50
![Page 79: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/79.jpg)
Iterative Method for Reinforced Inference
no closed-form solution for penalised reinforced log-likelihood. . . but
n∑i=1
log [p (xi |θ) + ri ] ≥n∑
i=1
wi log p (xi |θ) + const.
where θold = current estimate and wi = p(xi |θold) /p (xi |θold)+ ri
EM-like algorithminitialise θ, then loop over
1 compute PPRs with L1 or L2 regularisation ⇒ model-independent2 compute instance weights with PPRs ⇒ model-independent3 maximise weighted log-likelihood to update θ ⇒ model-specific
50
![Page 80: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/80.jpg)
Iterative Method for Reinforced Inference
no closed-form solution for penalised reinforced log-likelihood. . . but
n∑i=1
log [p (xi |θ) + ri ] ≥n∑
i=1
wi log p (xi |θ) + const.
where θold = current estimate and wi = p(xi |θold) /p (xi |θold)+ ri
EM-like algorithminitialise θ, then loop over
1 compute PPRs with L1 or L2 regularisation ⇒ model-independent2 compute instance weights with PPRs ⇒ model-independent3 maximise weighted log-likelihood to update θ ⇒ model-specific
50
![Page 81: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/81.jpg)
Illustration: Linear Regression / OLS
51
![Page 82: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/82.jpg)
Illustration: LS-SVMs
comparison to Suykens et al. (2002) on 22 datasets ⇒ see full paper
52
![Page 83: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/83.jpg)
Illustration: LS-SVMs
comparison to Suykens et al. (2002) on 22 datasets ⇒ see full paper
52
![Page 84: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/84.jpg)
Illustration: Logistic Regression
53
![Page 85: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/85.jpg)
Illustration: Principal Component Analysis
54
![Page 86: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/86.jpg)
Conclusion
![Page 87: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/87.jpg)
Take Home Messages
Probabilistic approachesprobabilistic modelling of label noise is a powerful approach
easily plugged in various problems/modelscan be used to provide feedback to userscan embed prior knowledge on label noise
Pointwise probability reinforcementsgeneric approach to robustify maximum likelihood inference
many models can be formulated in probabilistic termsno parametric assumption → could deal with non-uniform noiseeasy to implement if weights can be enforced on instancesfurther work: more complex models + noise level estimation
try on your own favorite probabilistic model (and let us know! )56
![Page 88: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/88.jpg)
Take Home Messages
Probabilistic approachesprobabilistic modelling of label noise is a powerful approach
easily plugged in various problems/modelscan be used to provide feedback to userscan embed prior knowledge on label noise
Pointwise probability reinforcementsgeneric approach to robustify maximum likelihood inference
many models can be formulated in probabilistic termsno parametric assumption → could deal with non-uniform noiseeasy to implement if weights can be enforced on instancesfurther work: more complex models + noise level estimation
try on your own favorite probabilistic model (and let us know! )56
![Page 89: andIwillTellyouhowtoDisposeofit BenoîtFrénay ... · ProbabilisticModellingofLawrenceetal. p Y =truelabelsY prior p XjY =observedfeaturesX distribution p Y~jY =observedlabels Y~](https://reader034.fdocuments.in/reader034/viewer/2022042915/5f5338f9115c3f1fd93d604a/html5/thumbnails/89.jpg)
References
Frénay, B., Verleysen, M. Classification in the Presence of Label Noise: a Survey.IEEE Trans. Neural Networks and Learning Systems, 25(5), 2014, p. 845-869.
Frénay, B., Kabán A. A Comprehensive Introduction to Label Noise. In Proc.ESANN, Bruges, Belgium, 23-25 April 2014, p. 667-676.
Frénay, B., de Lannoy, G., Verleysen, M. Label Noise-Tolerant Hidden MarkovModels for Segmentation: Application to ECGs. In Proc. ECML-PKDD 2011, p.455-470.
Frénay, B., Doquire, G., Verleysen, M. Estimating mutual information for featureselection in the presence of label noise. Computational Statistics & Data Analysis,71, 832-848, 2014.
Frenay, B., Hammer, B. Label-noise-tolerant classification for streaming data. InProc. IJCNN 2017, Anchorage, AK, 14-19 May 2017, p. 1748-1755.
Frénay, B., Verleysen, M. Pointwise Probability Reinforcements for RobustStatistical Inference. Neural networks, , 50, 124-141, 2014.
Bion, Q., Frénay, B. Modelling non-uniform label noise to robustify a classifier withapplication to neural networks. Submitted to ESANN’18 (under review).
57