DUAL STRATEGY ACTIVE LEARNING
presenter: Pinar Donmez1
Joint work with Jaime G. Carbonell1 & Paul N. Bennett2
1 Language Technologies Institute, Carnegie Mellon University2 Microsoft Research
Active Learning (Pool-based)
unlabeled data
Expert
Data Source
Learning Mechanism
label request
labeled data
User
output
learn a new model
Two different trends on Active Learning Uncertainty Sampling:
selects the example with the lowest certainty i.e. closest to the boundary, maximum entropy,...
Density-based Sampling: considers the underlying data distribution selects representatives of large clusters aims to cover the input space quickly
i.e. representative sampling, active learning using pre-clustering, etc.
Goal of this Work
Find an active learning method that works well everywhere Some work best when very few instances
sampled (i.e. density-based sampling) Some work best after substantial sampling
(i.e. uncertainty sampling)
Combine the best of both worlds for superior performance
Main Features of DUAL
DUAL is dynamic rather than static is context-sensitive builds upon the work titled “Active Learning with Pre-
Clustering”, (Nguyen & Smeulders, 2004) proposes a mixture model of density and uncertainty
DUAL’s primary focus is to outperform static strategies over a large operating
range improve learning for the later iterations rather than
concentrating on the initial data labeling
Related Work
DUAL AL with Pre-Clustering
Representative Sampling
COMB
Clustering Yes Yes Yes No
Uncertainty + Density
Yes Yes Yes No
Dynamic Yes No No Yes
Active Learning with Pre-Clustering
We call it Density Weighed Uncertainty Sampling (DWUS in short). Why?
assumes a hidden clustering structure of the data calculates the posterior P(y | x) as
x and y are conditionally independent given k since points in one cluster assumed to share the same label
1 1( | ) ( , | ) ( | , ) ( | )
K K
k kP y x P y k x P y k x P k x
^2argmax [( ) | ] ( )
U
i i i ii I
s E y y x p x
1 1( | ) ( , | ) ( | ) ( | )
K K
k kP y x P y k x P y k P k x
selection criterion
uncertainty score density score
[1]
[2]
[3]
Outline of DWUS
1. Cluster the data using K-medoid algorithm to find the cluster centroids ck
2. Estimate P(k|x) by a standard EM procedure3. Model P(y|k) as a logistic regression classifier
4. Estimate P(y|x) using5. Select an unlabeled instance using Eq. 16. Update the parameters of the logistic regression
model (hence update P(y|k) )7. Repeat steps 3-5 until stopping criterion
1( | )
1 exp( ( . ))k
P y ky c a b
1 1( | ) ( , | ) ( | ) ( | )
K K
k kP y x P y k x P y k P k x
Notes on DWUS
Posterior class distribution:
P(y | k) is calculated via
P(k|x) is estimated using an EM procedure after the clustering
p(x | k) is a multivariate Gaussian with the same σ for all clusters
The logistic regression model to estimate parameters
2/ 2
2
|| ||( | ) (2 ) exp{ }
2d d kx c
p x k
1( ) ( | ) ( )
K
kp x p x k P k
1( | )
1 exp( ( . ))k
P y ky c a b
ln ( | ; , )l
i ii I
L P y x a b
1 1( | ) ( , | ) ( | ) ( | )
K K
k kP y x P y k x P y k P k x
Motivation for DUAL
Strength of DWUS: favors higher density samples close to the decision boundary fast decrease in error But!
DWUS establishes diminishing returns! Why?
• Early iterations -> many points are highly uncertain• Later iterations -> points with high uncertainty no longer in dense regions• DWUS wastes time picking instances with no direct effect on the error
How does DUAL do better? Runs DWUS until it estimates a cross-over
Monitor the change in expected error at each iteration to detect when it is stuck in local minima
DUAL uses a mixture model after the cross-over ( saturation ) point
Our goal should be to minimize the expected future error If we knew the future error of Uncertainty Sampling (US) to
be zero, then we’d force But in practice, we do not know it
( )
t
DWUSx
^ ^21
( ) [( ) | ] 0i i it
DWUS E y y xn
^* 2argmax * [( ) | ] (1 ) * ( )
U
is i i ii I
x E y y x p x
1
More on DUAL
After cross-over, US does better => uncertainty score should be given more weight
should reflect how well US performs can be calculated by the expected error of
US on the unlabeled data* =>
Finally, we have the following selection criterion for DUAL:
* US is allowed to choose data only from among the already sampled instances, and is calculated on the remaining unlabeled set
^ ^ ^* 2argmax(1 ( )) * [( ) | ] ( ) * ( )
U
is i i ii I
x US E y y x US p x
^ ^
( )US
^
( )US
A simple Illustration I
2
22
2
1
2
22
22
2
2
2
2
2
2
222
2
2
2
2
22
1
11
1
1
1
1
11
1 11
11
1
1
11
1
1
2
22
22
22
2
2
A simple Illustration II
22
2
1
2
22
22
2
2
2
2
2
2
222
2
2
22
22
1
11
1
1
1
1
11
1 11
11
1
1
11
1
1
2
22
22
22
2
2
2
A simple Illustration III
2
22
2
1
2
22
22
2
2
2
2
2
2
222
2
2
22
22
1
11
1
1
1
1
11
1 11
11
1
1
11
1
1
2
22
22
22
2
2
A simple Illustration IV
2
22
1
2
22
22
2
2
2
2
2
2
222
2
2
22
22
11
1
1
1
1
11
1 11
11
1
1
11
1
1
2
22
22
22
2
2
2
Experiments
initial training set size : 0.4% of the entire data ( n+ = n- ) The results are averaged over 4 runs, each run takes 100
iterations DUAL outperforms
DWUS with p<0.0001 significance* after 40th iteration Representative Sampling (p<0.0001) on all COMB (p<0.0001) on 4 datasets, and p<0.05 on Image and M-
vs-N US (p<0.001) on 5 datasets DS (p<0.0001) on 5 datasets
* All significance results are based on a 2-sided paired t-test on the classification error
Results: DUAL vs DWUS
Results: DUAL vs US
Results: DUAL vs DS
Results: DUAL vs COMB
Results: DUAL vs Representative S.
Failure Analysis
Current estimate of the cross-over point is not accurate on V-vs-Y dataset => simulate a better error estimator
Currently, DUAL only considers the performance of US. But, on Splice DS is better => modify selection criterion:
^ ^ ^* 2argmax ( ) * [( ) | ] (1 ( )) * ( )
U
is i i ii I
x DS E y y x DS p x
Conclusion
DUAL robustly combines density and uncertainty (can be generalized to other active sampling methods which exhibit differential performance)
DUAL leads to more effective performance than individual strategies
DUAL shows the error of one method can be estimated using the data labeled by the other
DUAL can be applied to multi-class problems where the error is estimated either globally or at the class or the instance level
Future Work
Generalize DUAL to estimate which method is currently dominant or use a relative success weight
Apply DUAL to more than two strategies to maximize the diversity of an ensemble
Investigate better techniques to estimate the future classification error
THANK YOU!
The error expectation for a given point:
Data density is estimated as a mixture of K Gaussians:
EM procedure to estimate P(K):
Likelihood:
^ ^ ^2 2 2[( ) | ] ( 1) ( 1| ) ( ) ( 0| )i i ii i i i i iE y y x y P y x y P y x
1
( ) ( | ) ( )K
k
p x p x k P k
2
2
1
|| ||( | ) ( )exp{ }
21
( ) ( | )
i ki
n
ii
x cP k x P k
P k P k xn
2
1
( , ) || || ln{ ( | ) ( | ; , )}2
l
K
i ii I k
L a b a P k x P y k a b
Top Related