Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the...
Transcript of Richard J. Samworth University of Cambridge · Existing work The topic has been well-studied in the...
Classification with imperfect training labels
Richard J. Samworth
University of Cambridge
39th Conference on Applied Statistics in Ireland (CASI2019)Dundalk, Ireland
15 May 2019
Collaborators
Tim Cannings Yingying Fan
Richard J. Samworth 2/26
Supervised classification
Richard J. Samworth 3/26
Classification and label noise
With perfect labels in the binary response se�ing, we observe
(X1, Y1), . . . , (Xn, Yn)iid∼ P taking values in Rd × {0, 1}.
Task: Predict the class Y of a new observation X , where (X,Y ) ∼ Pindependently of the training data.
In many modern applications, however, it may be too expensive, di�icult ortime-consuming to determine class labels perfectly:
Uncorrupted: (X1, 1), (X2, 1), (X3, 0), (X4, 0) . . . , (Xn, 0)
Corrupted: (X1, 1), (X2, 0), (X3, 0), (X4, 0), . . . , (Xn, 1)
Richard J. Samworth 4/26
Existing work
The topic has been well-studied in the machine learning/computer scienceliterature (Frénay and Kabán, 2014; Frénay and Verleysen, 2014).
I Lachenbruch (1966): LDA with zero intercept is consistent withρ-homogeneous noise, where each observation mislabelled independentlywith probability ρ ∈ (0, 1/2).
I Okamoto and Nobuhiro (1997) consider the k-nearest neighbour classifierwith n = 32 and small k.
‘. . . the predictive accuracy of 1−NN is strongly a�ectedby. . .class noise’.
I Ghosh et al. (2015):‘Many standard algorithms such as SVM perform poorly in the
presence of label noise’.
Other work seeks to identify mislabelled observations and flip or remove them.
Richard J. Samworth 5/26
Existing work
The topic has been well-studied in the machine learning/computer scienceliterature (Frénay and Kabán, 2014; Frénay and Verleysen, 2014).
I Lachenbruch (1966): LDA with zero intercept is consistent withρ-homogeneous noise, where each observation mislabelled independentlywith probability ρ ∈ (0, 1/2).
I Okamoto and Nobuhiro (1997) consider the k-nearest neighbour classifierwith n = 32 and small k.
‘. . . the predictive accuracy of 1−NN is strongly a�ectedby. . .class noise’.
I Ghosh et al. (2015):‘Many standard algorithms such as SVM perform poorly in the
presence of label noise’.
Other work seeks to identify mislabelled observations and flip or remove them.
Richard J. Samworth 5/26
Existing work
The topic has been well-studied in the machine learning/computer scienceliterature (Frénay and Kabán, 2014; Frénay and Verleysen, 2014).
I Lachenbruch (1966): LDA with zero intercept is consistent withρ-homogeneous noise, where each observation mislabelled independentlywith probability ρ ∈ (0, 1/2).
I Okamoto and Nobuhiro (1997) consider the k-nearest neighbour classifierwith n = 32 and small k.
‘. . . the predictive accuracy of 1−NN is strongly a�ectedby. . .class noise’.
I Ghosh et al. (2015):‘Many standard algorithms such as SVM perform poorly in the
presence of label noise’.
Other work seeks to identify mislabelled observations and flip or remove them.
Richard J. Samworth 5/26
Motivating example
−4 −2 0 2
−3
−2
−1
01
23
−4 −2 0 2−
3−
2−
10
12
3
Priors π0 = 0.9, π1 = 0.1. Class conditionals X|Y = 0 ∼ N2
((−1, 0)>, I2
),
X|Y = 1 ∼ N2
((1, 0)>, I2
), n = 1000.
Le�: no noise; right: ρ-homogeneous noise with ρ = 0.3.
Richard J. Samworth 6/26
Risks in motivating example
5 6 7 8
81
01
21
4
log(n)
Err
or
Misclassification error for predicting the true label of the test point, for the knn(black), SVM (red) and LDA (blue) classifiers.Solid lines: no label noise; dashed lines: 0.3-homogeneous label noise.
Richard J. Samworth 7/26
Statistical se�ing
Let (X,Y, Y ), (X1, Y1, Y1), . . . , (Xn, Yn, Yn) be i.i.d. triples taking values inX × {0, 1} × {0, 1}.
We observe (X1, Y1), . . . , (Xn, Yn) and X . The task is to predict Y .
I For x ∈ X , define the regression function
η(x) := P(Y = 1|X = x)
and its corrupted version
η(x) := P(Y = 1|X = x).
I For x ∈ X and r ∈ {0, 1}, the conditional noise probabilities are
ρr(x) := P(Y 6= Y |X = x, Y = r).
We also write PX for the marginal distribution of X .
Richard J. Samworth 8/26
Classifiers
A classifier C is a (measurable) function from X to {0, 1}.
The risk R(C) := P{C(X) 6= Y } is minimised by the Bayes classifier
CBayes(x) :=
{1 if η(x) ≥ 1/2
0 otherwise.
A classifier Cn, depending on the training data, is said to be consistent ifR(Cn)→ R(CBayes) as n→∞.
The corrupted risk R(C) := P{C(X) 6= Y } is minimised by the corrupted Bayesclassifier
CBayes(x) :=
{1 if η(x) ≥ 1/2
0 otherwise.
Richard J. Samworth 9/26
General finite-sample result
Let S := {x ∈ X : η(x) = 1/2}, let B := {x ∈ Sc : ρ0(x) + ρ1(x) < 1} and let
A :=
{x ∈ B :
ρ1(x)− ρ0(x)
{2η(x)− 1}{1− ρ0(x)− ρ1(x)}< 1
}.
Theorem.(i) PX
(A4{x ∈ B : CBayes(x) = CBayes(x)}
)= 0.
(ii) Now suppose there exist ρ∗ < 1/2 and a∗ < 1 such thatPX({x ∈ Sc : ρ0(x) + ρ1(x)} > 2ρ∗}
)= 0, and
PX
(x ∈ B :
ρ1(x)− ρ0(x)
{2η(x)− 1}{1− ρ0(x)− ρ1(x)}> a∗
})= 0.
Then, for any classifier C ,
R(C)−R(CBayes) ≤ R(C)− R(CBayes)
(1− 2ρ∗)(1− a∗).
Richard J. Samworth 10/26
General finite-sample result
Let S := {x ∈ X : η(x) = 1/2}, let B := {x ∈ Sc : ρ0(x) + ρ1(x) < 1} and let
A :=
{x ∈ B :
ρ1(x)− ρ0(x)
{2η(x)− 1}{1− ρ0(x)− ρ1(x)}< 1
}.
Theorem.(i) PX
(A4{x ∈ B : CBayes(x) = CBayes(x)}
)= 0.
(ii) Now suppose there exist ρ∗ < 1/2 and a∗ < 1 such thatPX({x ∈ Sc : ρ0(x) + ρ1(x)} > 2ρ∗}
)= 0, and
PX
(x ∈ B :
ρ1(x)− ρ0(x)
{2η(x)− 1}{1− ρ0(x)− ρ1(x)}> a∗
})= 0.
Then, for any classifier C ,
R(C)−R(CBayes) ≤ R(C)− R(CBayes)
(1− 2ρ∗)(1− a∗).
Richard J. Samworth 10/26
Discussion
I This result is particularly useful when the classifier C is trained using thenoisy labels, i.e. with (X1, Y1), . . . , (Xn, Yn), since then the training andtest data in R(C) have the same distribution.
I We can then find conditions under which a classifier trained with imperfectlabels will remain consistent for classifying uncorrupted test data points.
For specific classifiers and under stronger conditions, we can provide furthercontrol of the excess risk
R(C)−R(CBayes).
Richard J. Samworth 11/26
The k-nearest neighbour classifier
For x ∈ Rd, let (X(1), Y(1)), . . . , (X(n), Y(n)) be the reordering of the corruptedtraining data pairs such that
‖X(1) − x‖ ≤ . . . ≤ ‖X(n) − x‖.
Define
Cknn(x) :=
{1 if 1
k
∑ki=1 1{Y(i)=1} ≥ 1/2
0 otherwise.
Corollary. Assume the conditions of part (ii) of the lemma. If k = kn →∞,but k/n→ 0, then
R(Cknn)−R(CBayes)→ 0
as n→∞.
Richard J. Samworth 12/26
Further assumptions
I Label noise: Assume the conditions of part (ii) of the lemma and that
ρ0(x) = g(η(x)) and ρ1(x) = g(1− η(x)),
where g : (0, 1)→ [0, 1) is twice di�erentiable. Assume thatg′(1/2) > 2g(1/2)− 1 and that g′′ is uniformly continuous.
I Distribution (Cannings et al., 2018): Among other technical conditions, assume thatPX has a density f , that η is twice continuously di�erentiable withinfx0∈S ‖η′(x0)‖ > 0, and that∫
Rd‖x‖αf(x) dx <∞.
I For β ∈ (0, 1/2), let
Kβ := {d(n− 1)βe, . . . , b(n− 1)1−βc}.
Richard J. Samworth 13/26
Asymptotic expansion
Theorem. Under our assumptions, we have two cases:
(i) Suppose that d ≥ 5 and α > 4dd−4 , and let νn,k := k−1 + (k/n)4/d. Then there
exist B1 = B1(d, P ) > 0, B2 = B2(d, P ) ≥ 0 such that for each β ∈ (0, 1/2),
R(Cknn)−R(CBayes) =B1
k{1−2g(1/2)+g′(1/2)}2+B2
(kn
)4/d
+ o(νn,k)
as n→∞, uniformly for k ∈ Kβ .
(ii) Suppose that either d ≤ 4, or, d ≥ 5 and α ≤ 4dd−4 . Then for each ε > 0 and
β ∈ (0, 1/2), we have
R(Cknn)−R(CBayes) =B1
k{1− 2g(1/2) + g′(1/2)}2+ o
(1
k+(kn
) αα+d−ε
).
as n→∞, uniformly for k ∈ Kβ .
Richard J. Samworth 14/26
Asymptotic expansion
Theorem. Under our assumptions, we have two cases:
(i) Suppose that d ≥ 5 and α > 4dd−4 , and let νn,k := k−1 + (k/n)4/d. Then there
exist B1 = B1(d, P ) > 0, B2 = B2(d, P ) ≥ 0 such that for each β ∈ (0, 1/2),
R(Cknn)−R(CBayes) =B1
k{1−2g(1/2)+g′(1/2)}2+B2
(kn
)4/d
+ o(νn,k)
as n→∞, uniformly for k ∈ Kβ .
(ii) Suppose that either d ≤ 4, or, d ≥ 5 and α ≤ 4dd−4 . Then for each ε > 0 and
β ∈ (0, 1/2), we have
R(Cknn)−R(CBayes) =B1
k{1− 2g(1/2) + g′(1/2)}2+ o
(1
k+(kn
) αα+d−ε
).
as n→∞, uniformly for k ∈ Kβ .
Richard J. Samworth 14/26
Relative asymptotic performance
Given k to be used by the knn classifier in the noiseless case, let
kg :=⌊{1− 2g(1/2) + g′(1/2)}−2d/(d+4)k
⌋.
This coupling reflects the ratio of the optimal choices of k in the corrupted anduncorrupted se�ings.
Corollary. Under the assumptions of part (i) of the theorem, and providedB2 > 0, we have that for any β ∈ (0, 1/2),
R(Ckgnn)−R(CBayes)
R(Cknn)−R(CBayes)→ 1
{1− 2g(1/2) + g′(1/2)}8/(d+4)
as n→∞, uniformly for k ∈ Kβ .
If g′(1/2) > 2g(1/2), then the label noise improves the asymptotic performance!
Richard J. Samworth 15/26
Intuition
For x ∈ Sc, we have
η(x)− 1/2 = {1− ρ1(x)}η(x) + ρ0(x){1− η(x)} − 1/2
= {η(x)− 1/2}{
1− ρ0(x)− ρ1(x) +ρ0(x)− ρ1(x)
2η(x)− 1
}.
But, writing t := η(x)− 1/2,
1− ρ0(x)− ρ1(x) +ρ0(x)− ρ1(x)
2η(x)− 1
= 1− g(1/2 + t)− g(1/2− t) +g(1/2 + t)− g(1/2− t)
2tt→0→ 1− 2g(1/2) + g′(1/2).
Richard J. Samworth 16/26
Estimated regret ratios
Model: X|Y = r ∼ N5(µr, I5), where µ1 = (3/2, 0, 0, 0, 0)T = −µ0, π1 = 0.5.
Labels: Let g(1/2 + t) = 0 ∨min{g0(1 + h0t), 2g0}, then set ρ0(x) = g(η(x))
and ρ1(x) = g(1− η(x)).
4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
1.0
1.5
2.0
2.5
log(n)
Re
gre
t R
atio
g0 h0 Asymptotic RR0.1 −1 1.37
0.1 0 1.22
0.1 1 1.10
0.1 2 1
0.1 3 0.92
Richard J. Samworth 17/26
Support Vector Machines
LetH denote an RKHS, and let L(y, t) := max{0, 1− (2y − 1)t} denote thehinge loss function. The SVM classifier is given by
CSVM(x) :=
{1 if f(x) ≥ 0
0 otherwise,
where
f ∈ argminf∈H
{1
n
n∑i=1
L(Yi, f(Xi)) + λ‖f‖2H}.
We focus on the case whereH has the Gaussian radial basis reproducing kernelfunction K(x, x′) := exp(−σ2‖x− x′‖2), for σ > 0.
Richard J. Samworth 18/26
SVM asymptotic analysis
If PX is compactly supported and λ = λn is chosen appropriately then thisSVM classifier is consistent in the uncorrupted labels case (Steinwart, 2005).
Corollary. Assume the conditions of our lemma, and suppose that PX iscompactly supported. If λ = λn → 0 but nλn
| log λn|d+1 →∞, then
R(CSVM)−R(CBayes)→ 0
as n→∞.
Richard J. Samworth 19/26
SVM assumptions
1. We say that the distribution P satisfies the margin assumption withparameter γ1 ∈ [0,∞) if there exists κ1 > 0 such that
PX({x ∈ Rd : 0 < |η(x)− 1/2| ≤ t}) ≤ κ1tγ1
for all t > 0.
2. Let S+ := {x ∈ Rd : η(x) > 1/2} and S− := {x ∈ Rd : η(x) < 1/2}, andfor x ∈ Rd, let τx := infx′∈S∪S+ ‖x− x′‖+ infx′∈S∪S− ‖x− x′‖. Say Phas geometric noise exponent γ2 ∈ [0,∞) if there exists κ2 > 0, such that∫
Rd|2η(x)− 1| exp
(τ2x
t2
)dPX(x) ≤ κ2t
γ2d
for all t > 0.
Richard J. Samworth 20/26
Rate of convergence
With perfect labels and when PX(B(0, 1)
)= 1, the excess risk of the SVM
classifier is O(n−Γ+ε) for every ε > 0, where
Γ :=
{γ2
2γ2+1 if γ2 ≤ γ1+22γ1
2γ2(γ1+1)2γ2(γ1+2)+3γ1+4 otherwise.
(Steinwart and Scovel, 2007).
Theorem. Suppose that P has margin parameter γ1 ∈ [0,∞], geometric noiseexponent γ2 ∈ (0,∞) and PX
(B(0, 1)
)= 1. Assume the conditions of the
lemma and that ρ0(x) = g(η(x)), ρ1(x) = g(1− η(x)), where g : (0, 1)→ [0, 1)
is di�erentiable at 1/2.
Let λ = λn := n−(γ2+1)/(γ2Γ) and σ = σn := nΓ/(γ2d). Then
R(CSVM)−R(CBayes) = O(n−Γ+ε),
as n→∞, for every ε > 0.
Richard J. Samworth 21/26
Linear Discriminant Analysis
Suppose that Pr = Nd(µr,Σ) for r = 0, 1. Then
CBayes(x) =
{1 if log
(π1
π0
)+(x− µ0+µ1
2
)>Σ−1(µ1 − µ0) ≥ 0
0 otherwise.
Define
CLDA(x) :=
{1 if log
(π1
π0
)+(x− µ0+µ1
2
)>Σ−1(µ1 − µ0) ≥ 0
0 otherwise,
where, πr := n−1∑ni=1 1{Yi=r}, µr :=
∑ni=1Xi1{Yi=r}/
∑ni=1 1{Yi=r}, and
Σ :=1
n− 2
n∑i=1
1∑r=0
(Xi − µr)(Xi − µr)>1{Yi=r}.
Richard J. Samworth 22/26
LDA asymptotic analysis
Theorem. Assume we have ρ-homogeneous noise (ρ < 1/2) and suppose thatPr = Nd(µr,Σ), for r = 0, 1. Then
limn→∞
CLDA(x) =
{1 if c0 +
(x− µ0+µ1
2
)>Σ−1(µ1 − µ0) > 0
0 if c0 +(x− µ0+µ1
2
)>Σ−1(µ1 − µ0) < 0,
where c0 can be expressed in terms of ∆2 := (µ1 − µ0)TΣ−1(µ1 − µ0), ρand π1. As a consequence,
limn→∞
R(CLDA) = π0Φ
(c0∆− ∆
2
)+ π1Φ
(−c0
∆− ∆
2
)≥ R(CBayes), (1)
with equality if π0 = π1 = 1/2. Moreover, for each ρ ∈ (0, 1/2) and π0 6= π1,there is a unique value of ∆ > 0 for which we have equality in (1).
Richard J. Samworth 23/26
LDA with ρ-homogeneous noise
5 6 7 8
51
01
5
log(n)
Err
or
Here, X|{Y = r} ∼ N5(µr, I5), where µ1 = ( 32 , 0, . . . , 0)> = −µ0 ∈ R5, and
π1 = 0.9.
No label noise (black), ρ-homogeneous noise for ρ = 0.1 (red), 0.2 (blue), 0.3(green) and 0.4 (purple). The do�ed lines show our asymptotic limit.
Richard J. Samworth 24/26
Summary
I The knn and SVM classifiers remain consistent with label noise under mildassumptions on the noise mechanism and data distribution.
I Under stronger conditions, the rate of convergence of the excess risk forthese classifiers is preserved.
I However, the LDA classifier is typically not consistent, unless the classpriors are equal (even with homogeneous noise).
Main reference:
I Cannings, T. I., Fan, Y. and Samworth, R. J. (2018) Classification withimperfect training labels. https://arxiv.org/abs/1805.11505.
Richard J. Samworth 25/26
Other referencesI Cannings, T. I., Berre�, T. B. and Samworth, R. J. (2018) Local nearest neighbour
classification with applications to semi-supervised learning.https://arxiv.org/abs/1704.00642v2.
I Frénay, B. and Kabán, A. (2014) A comprehensive introduction to label noise. Proc.Euro. Sym. Artificial Neural Networks, 667–676.
I Frénay, B. and Verleysen, M. (2014) Classification in the presence of label noise: asurvey. IEEE Trans. on NN and Learn. Sys., 25, 845–869.
I Ghosh, A., Manwani, N. and Sastry, P. S. (2015) Making risk minimization tolerantto label noise. Neurocomputing, 160, 93–107.
I Lachenbruch, P. A. (1966) Discriminant analysis when the initial samples aremisclassified. Technometrics, 8, 657–662.
I Okamoto, S. and Nobuhiro, Y. (1997) An average-case analysis of the k-nearestneighbor classifier for noisy domains. In Proc. 15th Int. Joint Conf. Artif. Intell., 1,238–243.
I Steinwart, I. (2005) Consistency of support vector machines and other regularizedkernel classifiers. IEEE Trans. Inf. Th., 51, 128–142.
I Steinwart, I. and Scovel, C. (2007) Fast rates for support vector machines usingGaussian kernels. Ann. Statist., 35, 575–607.
Richard J. Samworth 26/26