DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul...

27
DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie Mellon University 2 Microsoft Research
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    2

Transcript of DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul...

Page 1: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

DUAL STRATEGY ACTIVE LEARNING

presenter: Pinar Donmez1

Joint work with Jaime G. Carbonell1 & Paul N. Bennett2

1 Language Technologies Institute, Carnegie Mellon University2 Microsoft Research

Page 2: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Active Learning (Pool-based)

unlabeled data

Expert

Data Source

Learning Mechanism

label request

labeled data

User

output

learn a new model

Page 3: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Two different trends on Active Learning Uncertainty Sampling:

selects the example with the lowest certainty i.e. closest to the boundary, maximum entropy,...

Density-based Sampling: considers the underlying data distribution selects representatives of large clusters aims to cover the input space quickly

i.e. representative sampling, active learning using pre-clustering, etc.

Page 4: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Goal of this Work

Find an active learning method that works well everywhere Some work best when very few instances

sampled (i.e. density-based sampling) Some work best after substantial sampling

(i.e. uncertainty sampling)

Combine the best of both worlds for superior performance

Page 5: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Main Features of DUAL

DUAL is dynamic rather than static is context-sensitive builds upon the work titled “Active Learning with Pre-

Clustering”, (Nguyen & Smeulders, 2004) proposes a mixture model of density and uncertainty

DUAL’s primary focus is to outperform static strategies over a large operating

range improve learning for the later iterations rather than

concentrating on the initial data labeling

Page 6: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Related Work

DUAL AL with Pre-Clustering

Representative Sampling

COMB

Clustering Yes Yes Yes No

Uncertainty + Density

Yes Yes Yes No

Dynamic Yes No No Yes

Page 7: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Active Learning with Pre-Clustering

We call it Density Weighed Uncertainty Sampling (DWUS in short). Why?

assumes a hidden clustering structure of the data calculates the posterior P(y | x) as

x and y are conditionally independent given k since points in one cluster assumed to share the same label

1 1( | ) ( , | ) ( | , ) ( | )

K K

k kP y x P y k x P y k x P k x

^2argmax [( ) | ] ( )

U

i i i ii I

s E y y x p x

1 1( | ) ( , | ) ( | ) ( | )

K K

k kP y x P y k x P y k P k x

selection criterion

uncertainty score density score

[1]

[2]

[3]

Page 8: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Outline of DWUS

1. Cluster the data using K-medoid algorithm to find the cluster centroids ck

2. Estimate P(k|x) by a standard EM procedure3. Model P(y|k) as a logistic regression classifier

4. Estimate P(y|x) using5. Select an unlabeled instance using Eq. 16. Update the parameters of the logistic regression

model (hence update P(y|k) )7. Repeat steps 3-5 until stopping criterion

1( | )

1 exp( ( . ))k

P y ky c a b

1 1( | ) ( , | ) ( | ) ( | )

K K

k kP y x P y k x P y k P k x

Page 9: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Notes on DWUS

Posterior class distribution:

P(y | k) is calculated via

P(k|x) is estimated using an EM procedure after the clustering

p(x | k) is a multivariate Gaussian with the same σ for all clusters

The logistic regression model to estimate parameters

2/ 2

2

|| ||( | ) (2 ) exp{ }

2d d kx c

p x k

1( ) ( | ) ( )

K

kp x p x k P k

1( | )

1 exp( ( . ))k

P y ky c a b

ln ( | ; , )l

i ii I

L P y x a b

1 1( | ) ( , | ) ( | ) ( | )

K K

k kP y x P y k x P y k P k x

Page 10: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Motivation for DUAL

Strength of DWUS: favors higher density samples close to the decision boundary fast decrease in error But!

DWUS establishes diminishing returns! Why?

• Early iterations -> many points are highly uncertain• Later iterations -> points with high uncertainty no longer in dense regions• DWUS wastes time picking instances with no direct effect on the error

Page 11: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

How does DUAL do better? Runs DWUS until it estimates a cross-over

Monitor the change in expected error at each iteration to detect when it is stuck in local minima

DUAL uses a mixture model after the cross-over ( saturation ) point

Our goal should be to minimize the expected future error If we knew the future error of Uncertainty Sampling (US) to

be zero, then we’d force But in practice, we do not know it

( )

t

DWUSx

^ ^21

( ) [( ) | ] 0i i it

DWUS E y y xn

^* 2argmax * [( ) | ] (1 ) * ( )

U

is i i ii I

x E y y x p x

1

Page 12: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

More on DUAL

After cross-over, US does better => uncertainty score should be given more weight

should reflect how well US performs can be calculated by the expected error of

US on the unlabeled data* =>

Finally, we have the following selection criterion for DUAL:

* US is allowed to choose data only from among the already sampled instances, and is calculated on the remaining unlabeled set

^ ^ ^* 2argmax(1 ( )) * [( ) | ] ( ) * ( )

U

is i i ii I

x US E y y x US p x

^ ^

( )US

^

( )US

Page 13: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

A simple Illustration I

2

22

2

1

2

22

22

2

2

2

2

2

2

222

2

2

2

2

22

1

11

1

1

1

1

11

1 11

11

1

1

11

1

1

2

22

22

22

2

2

Page 14: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

A simple Illustration II

22

2

1

2

22

22

2

2

2

2

2

2

222

2

2

22

22

1

11

1

1

1

1

11

1 11

11

1

1

11

1

1

2

22

22

22

2

2

2

Page 15: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

A simple Illustration III

2

22

2

1

2

22

22

2

2

2

2

2

2

222

2

2

22

22

1

11

1

1

1

1

11

1 11

11

1

1

11

1

1

2

22

22

22

2

2

Page 16: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

A simple Illustration IV

2

22

1

2

22

22

2

2

2

2

2

2

222

2

2

22

22

11

1

1

1

1

11

1 11

11

1

1

11

1

1

2

22

22

22

2

2

2

Page 17: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Experiments

initial training set size : 0.4% of the entire data ( n+ = n- ) The results are averaged over 4 runs, each run takes 100

iterations DUAL outperforms

DWUS with p<0.0001 significance* after 40th iteration Representative Sampling (p<0.0001) on all COMB (p<0.0001) on 4 datasets, and p<0.05 on Image and M-

vs-N US (p<0.001) on 5 datasets DS (p<0.0001) on 5 datasets

* All significance results are based on a 2-sided paired t-test on the classification error

Page 18: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Results: DUAL vs DWUS

Page 19: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Results: DUAL vs US

Page 20: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Results: DUAL vs DS

Page 21: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Results: DUAL vs COMB

Page 22: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Results: DUAL vs Representative S.

Page 23: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Failure Analysis

Current estimate of the cross-over point is not accurate on V-vs-Y dataset => simulate a better error estimator

Currently, DUAL only considers the performance of US. But, on Splice DS is better => modify selection criterion:

^ ^ ^* 2argmax ( ) * [( ) | ] (1 ( )) * ( )

U

is i i ii I

x DS E y y x DS p x

Page 24: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Conclusion

DUAL robustly combines density and uncertainty (can be generalized to other active sampling methods which exhibit differential performance)

DUAL leads to more effective performance than individual strategies

DUAL shows the error of one method can be estimated using the data labeled by the other

DUAL can be applied to multi-class problems where the error is estimated either globally or at the class or the instance level

Page 25: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

Future Work

Generalize DUAL to estimate which method is currently dominant or use a relative success weight

Apply DUAL to more than two strategies to maximize the diversity of an ensemble

Investigate better techniques to estimate the future classification error

Page 26: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

THANK YOU!

Page 27: DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.

The error expectation for a given point:

Data density is estimated as a mixture of K Gaussians:

EM procedure to estimate P(K):

Likelihood:

^ ^ ^2 2 2[( ) | ] ( 1) ( 1| ) ( ) ( 0| )i i ii i i i i iE y y x y P y x y P y x

1

( ) ( | ) ( )K

k

p x p x k P k

2

2

1

|| ||( | ) ( )exp{ }

21

( ) ( | )

i ki

n

ii

x cP k x P k

P k P k xn

2

1

( , ) || || ln{ ( | ) ( | ; , )}2

l

K

i ii I k

L a b a P k x P y k a b