Active learning lecture
-
Upload
azuring -
Category
Technology
-
view
475 -
download
4
description
Transcript of Active learning lecture
![Page 1: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/1.jpg)
Active LearningCOMS 6998-4:
Learning and Empirical Inference
Irina RishIBM T.J. Watson Research Center
![Page 2: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/2.jpg)
2
Outline
Motivation Active learning approaches
Membership queries Uncertainty Sampling Information-based loss functions Uncertainty-Region Sampling Query by committee
Applications Active Collaborative Prediction Active Bayes net learning
![Page 3: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/3.jpg)
3
Standard supervised learning modelGiven m labeled points, want to learn a classifier with misclassification rate <, chosen from a hypothesis class H with VC dimension d < 1.
VC theory: need m to be roughly d/, in the realizable case.
![Page 4: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/4.jpg)
4
Active learning
In many situations – like speech recognition and document retrieval – unlabeled data is easy to come by, but there is a charge for each label.
What is the minimum number of labels needed to achieve the target error rate?
![Page 5: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/5.jpg)
5
![Page 6: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/6.jpg)
6
What is Active Learning?
Unlabeled data are readily available; labels are expensive
Want to use adaptive decisions to choose which labels to acquire for a given dataset
Goal is accurate classifier with minimal cost
![Page 7: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/7.jpg)
7
Active learning warning
Choice of data is only as good as the model itself Assume a linear model, then two data points are sufficient What happens when data are not linear?
![Page 8: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/8.jpg)
8
Active Learning Flavors
Selective Sampling Membership Queries
Pool Sequential
Myopic Batch
![Page 9: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/9.jpg)
9
Active Learning Approaches
Membership queries Uncertainty Sampling Information-based loss functions Uncertainty-Region Sampling Query by committee
![Page 10: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/10.jpg)
10
![Page 11: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/11.jpg)
11
![Page 12: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/12.jpg)
12
Problem
Many results in this framework, even for complicated hypothesis classes.
[Baum and Lang, 1991] tried fitting a neural net to handwritten characters.Synthetic instances created were incomprehensible to humans!
[Lewis and Gale, 1992] tried training text classifiers.“an artificial text created by a learning algorithm is unlikely to be a legitimate natural language expression, and probably would be uninterpretable by a human teacher.”
![Page 13: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/13.jpg)
13
![Page 14: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/14.jpg)
14
![Page 15: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/15.jpg)
15
![Page 16: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/16.jpg)
16
Uncertainty Sampling
Query the event that the current classifier is most uncertain about
Used trivially in SVMs, graphical models, etc.
x x x x x x xxxx
If uncertainty is measured in Euclidean distance:
[Lewis & Gale, 1994]
![Page 17: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/17.jpg)
17
Information-based Loss Function
Maximize KL-divergence between posterior and prior
Maximize reduction in model entropy between posterior and prior
Minimize cross-entropy between posterior and prior
All of these are notions of information gain
[MacKay, 1992]
![Page 18: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/18.jpg)
18
Query by Committee
Prior distribution over hypotheses Samples a set of classifiers from distribution Queries an example based on the degree of
disagreement between committee of classifiers
[Seung et al. 1992, Freund et al. 1997]
x x x x x x xxxx
A B C
![Page 19: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/19.jpg)
Infogain-based Active Learning
![Page 20: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/20.jpg)
20
Notation
We Have:
1. Dataset, D
2. Model parameter space, W
3. Query algorithm, q
![Page 21: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/21.jpg)
21
t Sex Age Test A Test B Test C Disease
0 M 40-50 0 1 1 ?1 F 50-60 0 1 0 ?2 F 30-40 0 0 0 ?3 F 60+ 1 1 1 ?4 M 10-20 0 1 0 ?5 M 40-50 0 0 1 ?6 F 0-10 0 0 0 ?7 M 30-40 1 1 0 ?8 M 20-30 0 0 1 ?
Dataset (D) Example
![Page 22: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/22.jpg)
22
Notation
We Have:
1. Dataset, D
2. Model parameter space, W
3. Query algorithm, q
![Page 23: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/23.jpg)
23
Model Example
St Ot
Probabilistic Classifier
Notation
T : Number of examples
Ot : Vector of features of example t
St : Class of example t
![Page 24: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/24.jpg)
24
Model Example
Patient state (St)St : DiseaseState
Patient Observations (Ot)Ot1 : GenderOt2 : AgeOt3 : TestAOt4 : TestBOt5 : TestC
![Page 25: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/25.jpg)
25
Possible Model Structures
S
Age
Gender
TestA
TestB
TestC
S
Age
Gender
TestA
TestB
TestC
![Page 26: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/26.jpg)
26
Model Space
St Ot
P(St)
Model:
Model Parameters: P(Ot|St)
Generative Model:Must be able to compute P(St=i, Ot=ot | w)
![Page 27: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/27.jpg)
27
Model Parameter Space (W)
• W = space of possible parameter values
• Prior on parameters:
• Posterior over models:
T
ttt WPWOSP
WPWDPDWP
)()|,(
)()|()|(
)(WP
![Page 28: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/28.jpg)
28
Notation
We Have:
1. Dataset, D
2. Model parameter space, W
3. Query algorithm, q
q(W,D) returns t*, the next sample to label
![Page 29: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/29.jpg)
29
Game
while NotDone
• Learn P(W | D)• q chooses next example to label• Expert adds label to D
![Page 30: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/30.jpg)
30
O1
S1
O2
S2
O3
S3
Simulation
O4
S4
O5
S5
O6
S6
O7
S7
?
S2=false
S7=false
S5=true
?
?
q
hmm…
![Page 31: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/31.jpg)
31
Active Learning Flavors
• Pool
(“random access” to patients)
• Sequential
(must decide as patients walk in the door)
![Page 32: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/32.jpg)
32
q?
• Recall: q(W,D) returns the “most interesting” unlabelled example.
• Well, what makes a doctor curious about a patient?
![Page 33: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/33.jpg)
33
1994
![Page 34: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/34.jpg)
34
Score Function
))|(y(uncertaint)(score tttuncert OSPS
)( tSH
i
tt iSPiSP )(log)(
![Page 35: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/35.jpg)
35
t Sex Age Test A
Test B
Test C
St
1 M 20-30
0 1 1 ?
2 F 20-30
0 1 0 ?
3 F 30-40
1 0 0 ?
4 F 60+ 1 1 0 ?
5 M 10-20
0 1 0 ?
6 M 20-30
1 1 1 ?
P(St)
0.02
0.01
0.05
0.12
0.01
0.96
H(St)
0.043
0.024
0.086
0.159
0.024
0.073
Uncertainty Sampling Example
FALSE
![Page 36: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/36.jpg)
36
t Sex Age Test A
Test B
Test C
St
1 M 20-30
0 1 1 ?
2 F 20-30
0 1 0 ?
3 F 30-40
1 0 0 ?
4 F 60+ 1 1 0 ?
5 M 10-20
0 1 0 ?
6 M 20-30
1 1 1 ?
P(St)
0.01
0.02
0.04
0.00
0.06
0.97
H(St)
0.024
0.043
0.073
0.00
0.112
0.059
Uncertainty Sampling Example
FALSE
TRUE
![Page 37: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/37.jpg)
37
Uncertainty Sampling
GOOD: couldn’t be easierGOOD: often performs pretty well
BAD: H(St) measures information gain about the samples, not the model
Sensitive to noisy samples
![Page 38: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/38.jpg)
38
Can we do better thanuncertainty sampling?
![Page 39: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/39.jpg)
39
1992
Informative with respect to what?
![Page 40: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/40.jpg)
40
Model Entropy
WW W
P(W|D) P(W|D)P(W|D)
H(W) = high H(W) = 0…better…
![Page 41: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/41.jpg)
41
Information-Gain
• Choose the example that is expected to most reduce H(W)
• I.e., Maximize H(W) – H(W | St)
Expected model space entropy if we learn St
Current model space entropy
![Page 42: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/42.jpg)
42
Score Function
)|()(
);()(score
t
ttIG
SWHWH
WSMIS
![Page 43: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/43.jpg)
43
We usually can’t just sum over all models to get H(St|W)
…but we can sample from P(W | D)
Cc
w
cPcP
CHWH
dwwPwPWH
)(log)(
)()(
)(log)()(
![Page 44: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/44.jpg)
44
Conditional Model Entropy
dwwPwPWHw )(log)()(
dwiSwPiSwPiSWHw ttt )|(log)|()|(
i
w tttt dwiSwPiSwPiSPSWH )|(log)|()()|(
![Page 45: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/45.jpg)
45
Score Function
)|()()(score ttIG SCHCHS
![Page 46: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/46.jpg)
46
t Sex Age Test A
Test B
Test C
St
1 M 20-30
0 1 1 ?
2 F 20-30
0 1 0 ?
3 F 30-40
1 0 0 ?
4 F 60+ 1 1 1 ?
5 M 10-20
0 1 0 ?
6 M 20-30
0 0 1 ?
P(St)
0.02
0.01
0.05
0.12
0.01
0.02
Score =
H(C) - H(C|St)
0.53
0.58
0.40
0.49
0.57
0.52
![Page 47: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/47.jpg)
47
Score Function
)|()(
)|()()(score
CSHSH
SCHCHS
tt
ttIG
Familiar?
![Page 48: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/48.jpg)
48
Uncertainty Sampling & Information Gain
)|()()(score
)()(score
CSHSHS
SHS
tttInfoGain
ttUncertain
![Page 49: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/49.jpg)
49
But there is a problem…
![Page 50: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/50.jpg)
50
“the expected information gain of an unlabeled sample is NOT a sufficient criterion for constructing good queries”
If our objective is to reduce the prediction error, then
![Page 51: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/51.jpg)
51
Strategy #2:Query by Committee
Temporary Assumptions:
Pool Sequential
P(W | D) Version Space
Probabilistic Noiseless
QBC attacks the size of the “Version space”
![Page 52: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/52.jpg)
52
O1
S1
O2
S2
O3
S3
O4
S4
O5
S5
O6
S6
O7
S7
Model #1 Model #2
FALSE!
FALSE!
![Page 53: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/53.jpg)
53
O1
S1
O2
S2
O3
S3
O4
S4
O5
S5
O6
S6
O7
S7
Model #1 Model #2
TRUE!TRUE!
![Page 54: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/54.jpg)
54
O1
S1
O2
S2
O3
S3
O4
S4
O5
S5
O6
S6
O7
S7
Model #1 Model #2
TRUE!FALSE!
Ooh, now we’re going to learnsomething for sure!
One of them is definitely wrong.
![Page 55: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/55.jpg)
55
The Original QBCAlgorithm
As each example arrives…
1. Choose a committee, C, (usually of size 2) randomly from Version Space
2. Have each member of C classify it
3. If the committee disagrees, select it.
![Page 56: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/56.jpg)
56
1992
![Page 57: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/57.jpg)
57
Infogain vs Query by Committee
[Seung, Opper, Sompolinsky, 1992; Freund, Seung, Shamir, Tishby 1997]
First idea: Try to rapidly reduce volume of version space?
Problem: doesn’t take data distribution into account.H:
Which pair of hypotheses is closest? Depends on data distribution P.Distance measure on H: d(h,h’) = P(h(x) h’(x))
![Page 58: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/58.jpg)
58
Query-by-committee
First idea: Try to rapidly reduce volume of version space?
Problem: doesn’t take data distribution into account.
H:
To keep things simple, say d(h,h’) = Euclidean distance
Error is likely to remain large!
![Page 59: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/59.jpg)
59
Query-by-committee
Elegant scheme which decreases volume in a manner which is sensitive to the data distribution.
Bayesian setting: given a prior on H
H1 = HFor t = 1, 2, …
receive an unlabeled point xt drawn from P[informally: is there a lot of disagreement about
xt in Ht?]choose two hypotheses h,h’ randomly from (,
Ht)if h(xt) h’(xt): ask for xt’s labelset Ht+1
Problem: how to implement it efficiently?
![Page 60: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/60.jpg)
60
Query-by-committee
For t = 1, 2, …receive an unlabeled point xt drawn from Pchoose two hypotheses h,h’ randomly from (, Ht)if h(xt) h’(xt): ask for xt’s labelset Ht+1
Observation: the probability of getting pair (h,h’) in the inner loop (when a query is made) is proportional to (h) (h’) d(h,h’).
Ht
vs.
![Page 61: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/61.jpg)
61
![Page 62: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/62.jpg)
62
Query-by-committee
Label bound: For H = {linear separators in Rd}, P = uniform distribution, just d log 1/ labels to reach a hypothesis with error < .
Implementation: need to randomly pick h according to (, Ht).
e.g. H = {linear separators in Rd}, = uniform distribution:
Ht
How do you pick a random point from a convex body?
![Page 63: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/63.jpg)
63
Sampling from convex bodies
By random walk!1. Ball walk2. Hit-and-run
[Gilad-Bachrach, Navot, Tishby 2005] Studies random walks and also ways to kernelize QBC.
![Page 64: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/64.jpg)
64
![Page 65: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/65.jpg)
65
Some challenges
[1] For linear separators, analyze the label complexity for some distribution other than uniform!
[2] How to handle nonseparable data?Need a robust base learner
true boundary+-
![Page 66: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/66.jpg)
Active Collaborative Prediction
![Page 67: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/67.jpg)
67
Approach: Collaborative Prediction (CP)
QoS measure (e.g. bandwidth)
?7033009Client4
1688Client3
567187Client2
18 3674Client1
Server3Server2Server1
Given previously observed ratings R(x,y), where
X is a “user” and Y is a “product”, predict
unobserved ratings
4 Raja
109Irina
24Gerry
31 ?Alina
Shrek GeishaMatrix
Movie Ratings
- will Alina like “The Matrix”? (unlikely )
- will Client 86 have fast download from Server 39?
- will member X of funding committee approve our project Y?
![Page 68: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/68.jpg)
68
10
0
clie
nts
100 servers
Collaborative Prediction = Matrix Approximation
• Important assumption: matrix entries are NOT independent, e.g. similar users have similar tastes
• Approaches: mainly factorized models assuming hidden ‘factors’ that affect ratings (pLSA, MCVQ, SVD, NMF, MMMF, …)
![Page 69: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/69.jpg)
69
User’s ‘weights’ associated with
‘factors’’
Factors
Assumptions: - there is a number of (hidden) factors behind the user preferences that relate to (hidden) movie properties
2 4 5 1 4 2
- movies have intrinsic values associated with such factors - users have intrinsic weights with such factors; user ratings a weighted (linear) combinations of movie’s values
![Page 70: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/70.jpg)
70
2 4 5 1 4 2
![Page 71: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/71.jpg)
71
2 4 5 1 4 2
![Page 72: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/72.jpg)
72
2 4 5 1 4 2
3 1 2 2 5
4 2 4 1 3 1
3 3 4 2
2 3 1 4 3 2
2 2 1 4
1 3 1 1 4Y XObjective: find a factorizable X=UV’ that approximates Y
),'(minarg'
YXLossXX
7 2 5 4 5 3 1 4 2
3 1 2 4 2 2 7 5 6
4 3 2 2 4 1 4 3 1
3 1 2 3 4 3 2 4 5
2 3 2 1 3 4 3 5 2
8 2 2 9 1 8 3 4 5
1 2 3 5 1 1 5 6 4
=
and satisfies some “regularization” constraints (e.g. rank(X) < k)Loss functions: depends on the nature of your problem
rank k
![Page 73: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/73.jpg)
73
Matrix Factorization Approaches
Singular value decomposition (SVD) – low-rank approximation Assumes fully observed Y and sum-squared loss
In collaborative prediction, Y is only partially observed Low-rank approximation becomes non-convex problem w/ many local minima
Furthermore, we may not want sum-squared loss, but instead accurate predictions (0/1 loss, approximated by hinge loss) cost-sensitive predictions (missing a good server vs suggesting a bad one) ranking cost (e.g., suggest k ‘best’ movies for a user)
NON-CONVEX PROBLEMS!
Use instead the state-of-art Max-Margin Matrix Factorization [Srebro 05] replaces bounded rank constraint by bounded norm of U, V’ vectors convex optimization problem! – can be solved exactly by semi-definite programming strongly relates to learning max-margin classifiers (SVMs)
Exploit MMMF’s properties to augment it with active sampling!
![Page 74: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/74.jpg)
74
Key Idea of MMMF Rows – feature vectors, Columns – linear classifiers
Linear classifiersweight vectors
Featu
re v
ect
ors
f1
v2
“marg
in”
Xij = signij x marginij
Predictorij = signij
-1
If signij > 0, classify as +1,Otherwise classify as -1
“margin” here = Dist(sample, line)
![Page 75: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/75.jpg)
75
MMMF: Simultaneous Search for Low-norm Feature Vectors and Max-margin Classifiers
![Page 76: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/76.jpg)
76
Margin-based heuristics:
min-margin (most-uncertain) min-margin positive (“good” uncertain) max-margin (‘safe choice’ but no info)
-0.3
-0.4
0.6
-0.9
0.1
-0.5
0.8
0.5
0.7
0.9
-0.2
-0.5
-0.5
-0.9 -0.6
-0.7-0.5
-0.1
-0.5
-0.9
-0.9 -0.8 -0.5
-0.6
-0.5
-0.5
-0.5 -0.5
-0.4
-0.8
-0.1
-0.5-0.5
-0.5
0.1
0.3 0.4
0.6
0.3
0.8 0.2
0.7
0.9
0.2
0.5 0.2 0.3
0.6 0.6
0.9
0.6
0.7
0.30.8
0.6
0.4 0.5
0.30.9
0.2
0.1
0.6
-0.2
Active Learning with MMMF
- We extend MMMF to Active-MMMF using margin-based active sampling
- We investigate exploitation vs exploration trade-offs imposed by different heuristics
![Page 77: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/77.jpg)
77
A-MMMF(M,s)1. Given s sparse matrix Y, learn approximation X = MMMF(Y)
2. Using current predictions, actively select “best s” samples and request their labels (e.g., test client/server pair via ‘enforced’ download)
3. Add new samples to Y
4. Repeat 1-3
Active Max-Margin Matrix Factorization
Issues: Beyond simple greedy margin-based heuristics?
Theoretical guarantees? not so easy with non-trivial learning methods and non-trivial data distributions
(any suggestions??? )
![Page 78: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/78.jpg)
78
Empirical Results
Network latency prediction
Bandwidth prediction (peer-to-peer)
Movie Ranking Prediction
Sensor net connectivity prediction
![Page 79: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/79.jpg)
79
Empirical Results: Latency Prediction
P2Psim data NLANR-AMP data
Active sampling with most-uncertain (and most-uncertain positive) heuristics provide consistent improvement over random and least-uncertain-next sampling
![Page 80: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/80.jpg)
80
Movie Rating Prediction (MovieLens)
![Page 81: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/81.jpg)
81
Sensor Network Connectivity
![Page 82: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/82.jpg)
82
Introducing Cost: Exploration vs Exploitation
Active sampling lower prediction errors at lower costs (saves 100s of samples) (better prediction better server assignment decisions faster downloads
Active sampling achieves a good exploration vs exploitation trade-off: reduced decision cost AND information gain
DownloadGrid: bandwidth prediction
PlanetLab: latency prediction
![Page 83: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/83.jpg)
83
Conclusions
Common challenge in many applications: need for cost-efficient sampling
This talk: linear hidden factor models with active sampling
Active sampling improves predictive accuracy while keeping sampling complexity low in a wide variety of applications
Future work:
Better active sampling heuristics?
Theoretical analysis of active sampling performance?
Dynamic Matrix Factorizations: tracking time-varying matrices
Incremental MMMF? (solving from scratch every time is too costly)
![Page 84: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/84.jpg)
84
ReferencesSome of the most influential papers
• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research. Volume 2, pages 45-66. 2001.
• Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm. Machine Learning, 28:133—168
• David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models, Journal of Artificial Intelligence Research, (4): 129-145, 1996.
• David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning, Machine Learning 15(2):201-221, 1994.
• D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection. Neural Comput., vol. 4, no. 4, pp. 590--604, 1992.
![Page 85: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/85.jpg)
85
NIPS papers• Francis Bach. Active learning for misspecified generalized linear models. NIPS-06
• Ran Gilad-Bachrach, Amir Navot, Naftali Tishby. Query by Committee Made Real. NIPS-05
• Brent Bryan, Jeff Schneider, Robert Nichol, Christopher Miller, Christopher Genovese, Larry Wasserman . Active Learning For Identifying Function Threshold Boundaries . NIPS-05
• Rui Castro, Rebecca Willett, Robert Nowak. Faster Rates in Regression via Active Learning. NIPS-05
• Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. NIPS-05
• Masashi Sugiyama. Active Learning for Misspecified Models. NIPS-05
• Brigham Anderson, Andrew Moore. Fast Information Value for Graphical Models. NIPS-05
• Dan Pelleg, Andrew W. Moore. Active Learning for Anomaly and Rare-Category Detection. NIPS-04
• Sanjoy Dasgupta. Analysis of a greedy active learning strategy. NIPS-04
• T. Jaakkola and H. Siegelmann. Active Information Retrieval. NIPS-01
• M. K. Warmuth et al. Active Learning in the Drug Discovery Process. NIPS-01
• Jonathan D. Nelson, Javier R. Movellan. Active Inference in Concept Learning. NIPS-00
• Simon Tong, Daphne Koller. Active Learning for Parameter Estimation in Bayesian Networks. NIPS-00
• Thomas Hofmann and Joachim M. Buhnmnn. Active Data Clustering. NIPS-97
• K. Fukumizu. Active Learning in Multilayer Perceptrons. NIPS-95
• Anders Krogh, Jesper Vedelsby. NEURAL NETWORK ENSEMBLES, CROSS VALIDATION, AND ACTIVE LEARNING. NIPS-94
• Kah Kay Sung, Partha Niyogi. ACTIVE LEARNING FOR FUNCTION APPROXIMATION. NIPS-94
• David Cohn, Zoubin Ghahramani, Michael I. Jordan. ACTIVE LEARNING WITH STATISTICAL MODELS. NIPS-94
• Sebastian B. Thrun and Knut Moller. Active Exploration in Dynamic Environments. NIPS-91
![Page 86: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/86.jpg)
86
ICML papers• Maria-Florina Balcan, Alina Beygelzimer, John Langford. Agnostic Active Learning. ICML-06
• Steven C. H. Hoi, Rong Jin, Jianke Zhu, Michael R. Lyu. Batch Mode Active Learning and Its Application to Medical Image Classification. ICML-06
• Sriharsha Veeramachaneni, Emanuele Olivetti, Paolo Avesani. Active Sampling for Detecting Irrelevant Features. ICML-06
• Kai Yu, Jinbo Bi, Volker Tresp. Active Learning via Transductive Experimental Design. ICML-06
• Rohit Singh, Nathan Palmer, David Gifford, Bonnie Berger, Ziv Bar-Joseph. Active Learning for Sampling in Time-Series Experiments With Application to Gene Expression Analysis. ICML-05
• Prem Melville, Raymond Mooney. Diverse Ensembles for Active Learning. ICML-04
• Klaus Brinker. Active Learning of Label Ranking Functions. ICML-04
• Hieu Nguyen, Arnold Smeulders. Active Learning Using Pre-clustering. ICML-04
• Greg Schohn and David Cohn. Less is More: Active Learning with Support Vector Machines, ICML-00
• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. ICML-00.
• COLT papers
• S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. COLT-05.
• H. S. Seung, M. Opper, and H. Sompolinski. 1992. Query by committee. COLT-92, pages 287--294.
![Page 87: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/87.jpg)
87
Journal Papers• Antoine Bordes, Seyda Ertekin, Jason Weston, Leon Bottou. Fast Kernel Classifiers with Online
and Active Learning. Journal of Machine Learning Research (JMLR), vol. 6, pp. 1579-1619, 2005.
• Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research. Volume 2, pages 45-66. 2001.
• Y. Freund, H. S. Seung, E. Shamir, N. Tishby. 1997. Selective sampling using the query by committee algorithm. Machine Learning, 28:133--168
• David Cohn, Zoubin Ghahramani, and Michael Jordan. Active learning with statistical models, Journal of Artificial Intelligence Research, (4): 129-145, 1996.
• David Cohn, Les Atlas and Richard Ladner. Improving generalization with active learning, Machine Learning 15(2):201-221, 1994.
• D. J. C. Mackay. Information-Based Objective Functions for Active Data Selection. Neural Comput., vol. 4, no. 4, pp. 590--604, 1992.
• Haussler, D., Kearns, M., and Schapire, R. E. (1994). Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. Machine Learning, 14, 83--113
• Fedorov, V. V. 1972. Theory of optimal experiment. Academic Press.
• Saar-Tsechansky, M. and F. Provost. Active Sampling for Class Probability Estimation and Ranking. Machine Learning 54:2 2004, 153-178.
![Page 88: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/88.jpg)
88
Workshops
• http://domino.research.ibm.com/comm/research_projects.nsf/pages/nips05workshop.index.html
![Page 89: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/89.jpg)
89
Appendix
![Page 90: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/90.jpg)
Active Learning of Bayesian Networks
![Page 91: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/91.jpg)
91
Entropy Function• A measure of information in random
event X with possible outcomes {x1,…,xn}
• Comments on entropy function:– Entropy of an event is zero when the
outcome is known
– Entropy is maximal when all outcomes are equally likely
• The average minimum yes/no questions to answer some question (connection to binary search)
H(x) = - i p(xi) log2 p(xi)
[Shannon, 1948]
![Page 92: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/92.jpg)
92
Kullback-Leibler divergence
• P is the true distribution; Q distribution is used to encode data instead of P
• KL divergence is the expected extra message length per datum that must be transmitted using Q
• Measure of how “wrong” Q is with respect to true distribution P
DKL(P || Q) = i P(xi) log (P(xi)/Q(xi))
= i P(xi) log Q(xi) – i P(xi) log P(xi)
= -H(P,Q) + H(P)
= -Cross-entropy + entropy
![Page 93: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/93.jpg)
93
Learning Bayesian Networks
E
R
B
A
C
.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
Data+
Prior Knowledge
• Model Building• Parameter estimation• Causal structure discovery
• Passive Learning vs Active Learning
Learner
![Page 94: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/94.jpg)
94
Active Learning
• Selective Active Learning
• Interventional Active Learning
• Obtain measure of quality of current model • Choose query that most improves quality • Update model
![Page 95: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/95.jpg)
95
Active Learning: Parameter Estimation
[Tong & Koller, NIPS-2000]• Given a BN structure G• A prior distribution p(θ)• Learner request a particular instantiation q (Query)
Training data
Active Learner
Response (x)
+
E B
A
Initial Network G, p(θ)
E B
A
Updated distribution
p´(θ)
How to update parameter densityHow to select next query based on p
Query (q) E B
A
Updated distribution
p´(θ)
![Page 96: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/96.jpg)
96
•Do not update A since we are fixing it•If we select A then do not update B
•Sampling from P(B|A=a) P(B)
•If we force A then we can update B
•Sampling from P(B|A:=a) = P(B)*
•Update all other nodes as usual
•Obtain new density
*Pearl 2000
A
B
J M
Updating parameter density
) a,A| (p xX θ
![Page 97: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/97.jpg)
97
• Goal: a single estimate – instead of a distribution p over
• If we choose and the true model is ’ then we incur some loss, L(’ || )
Bayesian point estimation
’
p()
~
![Page 98: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/98.jpg)
98
• We do not know the true ’ • Density p represents optimal beliefs over ’ • Choose that minimizes the expected loss
= argmin ∫p(’) L(’ || ) d’
• Call the Bayesian point estimate• Use the expected loss of the Bayesian point
estimate as a measure of quality of p(): – Risk(p) = ∫p(’) L(’ || ) d’
Bayesian point estimation
~
~
![Page 99: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/99.jpg)
99
• Set the controllable variables so as to minimize the expected posterior risk:
• KL divergence will be used for loss
– KL( || ’)=∑KL(Pθ(Xi|Ui)|| Pθ’(Xi|Ui))
The Querying component
θθθθ d )~
||(K )|(p)|(P LxqQxXx
ExPRisk(p | Q=q)
Conditional KL-divergence
![Page 100: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/100.jpg)
100
Algorithm Summary
• For each potential query q • Compute Risk(X|q) • Choose q for which Risk(X|q) is greatest
– Cost of computing Risk(X|q):
• Cost of Bayesian network inference • Complexity: O (|Q|. Cost of inference)
![Page 101: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/101.jpg)
101
Uncertainty samplingMaintain a single hypothesis, based on labels seen so far.Query the point about which this hypothesis is most “uncertain”.
Problem: confidence of a single hypothesis may not accurately represent the true diversity of opinion in the hypothesis class.
X
-
-
--
-
-
-
++
+
+
+ --
![Page 102: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/102.jpg)
102
![Page 103: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/103.jpg)
103
Region of uncertainty
current version spaceSuppose data lies on circle in R2; hypotheses are linear separators.
(spaces X, H superimposed) region of
uncertainty in data space
Current version space: portion of H consistent with labels so far.“Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space)
++
![Page 104: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/104.jpg)
104
Region of uncertainty
current version spaceData and hypothesis spaces, superimposed:
(both are the surface of the unit sphere in Rd)
region of uncertainty in data space
Algorithm [CAL92]:of the unlabeled points which lie in the region of uncertainty, pick one at random to query.
![Page 105: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/105.jpg)
105
Region of uncertainty
Number of labels needed depends on H and also on P.
Special case: H = {linear separators in Rd}, P = uniform distribution over unit sphere.
Then: just d log 1/ labels are needed to reach a hypothesis with error rate < .
[1] Supervised learning: d/ labels.[2] Best we can hope for.
![Page 106: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/106.jpg)
106
Region of uncertaintyAlgorithm [CAL92]:of the unlabeled points which lie in the region of uncertainty, pick one at random to query.
For more general distributions: suboptimal…
Need to measure quality of a query – or alternatively, size of version space.
![Page 107: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/107.jpg)
107
Uncertainty sampling!
ExpectedInfogainof sample
![Page 108: Active learning lecture](https://reader033.fdocuments.in/reader033/viewer/2022052322/5577a2e7d8b42a410a8b554d/html5/thumbnails/108.jpg)
108