A Two-Stage Approach to Domain Adaptation for Statistical Classifiers Jing Jiang & ChengXiang Zhai...
-
Upload
laurel-elliott -
Category
Documents
-
view
213 -
download
0
Transcript of A Two-Stage Approach to Domain Adaptation for Statistical Classifiers Jing Jiang & ChengXiang Zhai...
A Two-Stage Approach to Domain Adaptation for Statistical Classifiers
Jing Jiang & ChengXiang Zhai
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign
Nov 7, 2007 CIKM2007 3
Example: named entity recognition
New York Times New York Times
train (labeled)
test (unlabeled)
NER Classifier
standard supervised learning 85.5%
persons, locations, organizations, etc.
Nov 7, 2007 CIKM2007 4
Example: named entity recognition
New York Times
NER Classifier
train (labeled)
test (unlabeled)
New York Times
labeled data not availableReuters
64.1%
persons, locations, organizations, etc.
non-standard (realistic) setting
Nov 7, 2007 CIKM2007 5
Domain difference performance drop
train test
NYT NYT
New York Times New York Times
NER Classifier 85.5%
Reuters NYT
Reuters New York Times
NER Classifier 64.1%
ideal setting
realistic setting
Nov 7, 2007 CIKM2007 6
Another NER example
train test
mouse mouse
gene name
recognizer
fly mouse
gene name
recognizer
ideal setting
realistic setting
54.1%
28.1%
Nov 7, 2007 CIKM2007 7
Other examples
Spam filtering: Public email collection personal inboxes
Sentiment analysis of product reviews Digital cameras cell phones Movies books
Can we do better than standard supervised learning?
Domain adaptation: to design learning methods that are aware of the training and test domain difference.
Nov 7, 2007 CIKM2007 9
Observation 1domain-specific features
wingless
daughterless
eyeless
apexless
…
Nov 7, 2007 CIKM2007 10
Observation 1domain-specific features
wingless
daughterless
eyeless
apexless
…
• describing phenotype
• in fly gene nomenclature
• feature “-less” weighted high
CD38
PABPC5
…
feature still useful for other
organisms?
No!
Nov 7, 2007 CIKM2007 11
Observation 2generalizable features
…decapentaplegic and wingless are expressed in analogous patterns in each…
…that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues.
Nov 7, 2007 CIKM2007 12
Observation 2generalizable features
…decapentaplegic and wingless are expressed in analogous patterns in each…
…that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues.
feature “X be expressed”
Nov 7, 2007 CIKM2007 13
General idea: two-stage approach
SourceDomain Target
Domain
features
generalizable features
domain-specific features
Nov 7, 2007 CIKM2007 16
Generalization: to emphasize generalizable features in the trained model
SourceDomain Target
Domain
features Stage 1
Nov 7, 2007 CIKM2007 17
Adaptation: to pick up domain-specific features for the target domain
SourceDomain Target
Domain
features Stage 2
Nov 7, 2007 CIKM2007 19
Comparison with related work
We explicitly model generalizable features. Previous work models it implicitly [Blitzer et al. 2006, Ben-
David et al. 2007, Daumé III 2007]. We do not need labeled target data but we need
multiple source (training) domains. Some work requires labeled target data [Daumé III 2007].
We have a 2nd stage of adaptation, which uses semi-supervised learning. Previous work does not incorporate semi-supervised
learning [Blitzer et al. 2006, Ben-David et al. 2007, Daumé III 2007].
Nov 7, 2007 CIKM2007 20
Implementation of the two-stage approach with logistic
regression classifiers
Nov 7, 2007 CIKM2007 21
x
Logistic regression classifiers01001::010
0.24.55
-0.33.0
::
2.1-0.90.4
-less
X be expressed
wyT x
'' )exp(
)exp(),|(
y
Ty
Tyypxw
xwwx
p binary features
… and wingless are expressed in…
wy
Nov 7, 2007 CIKM2007 22
Learning a logistic regression classifier
01001::010
0.24.55
-0.33.0
::
2.1-0.90.4
N
iy
Ty
Ty
N 1'
'
2
)exp(
)exp(log
1
minargˆ
xw
xw
www
log likelihood of training data
regularization term
penalize large weights
control model complexity
wyT x
Nov 7, 2007 CIKM2007 23
Generalizable features in weight vectors
0.24.55
-0.33.0
::
2.1-0.90.4
3.20.54.5-0.13.5
::
0.1-1.0-0.2
0.10.74.20.13.2
::
1.70.10.3
D1 D2 DK
w1 w2 wK
…
K source domains
generalizable features
domain-specific features
Nov 7, 2007 CIKM2007 24
We want to decompose w in this way
0.24.55
-0.33.0
::
2.1-0.90.4
00
4.60
3.2::000
0.24.50.4-0.3-0.2
::
2.1-0.90.4
= +
h non-zero entries for h
generalizable features
Nov 7, 2007 CIKM2007 25
Feature selection matrix A
0 0 1 0 0 … 00 0 0 0 1 … 0
::
0 0 0 0 0 … 1
01001::010
01::0
x
A z = Ax
h
matrix A selects h generalizable features
Nov 7, 2007 CIKM2007 26
0.24.55
-0.33.0
::
2.1-0.90.4
4.63.2
::
3.6
0.24.50.4-0.3-0.2
::
2.1-0.90.4
= +
Decomposition of w
01001::010
01::0
01001::010
wT x = vT z + uT x
weights for generalizable features
weights for domain-specific features
Nov 7, 2007 CIKM2007 28
0.24.55
-0.33.0
::
2.1-0.90.4
4.63.2
::
3.6
0.24.50.4-0.3-0.2
::
2.1-0.90.4
= +
0 0 … 00 0 … 01 0 … 00 0 … 00 1 … 0
::
0 0 … 00 0 … 00 0 … 0
Decomposition of wshared by all domainsdomain-specific
w = AT v + u
Nov 7, 2007 CIKM2007 29
Framework for generalization
Fix A, optimize:
wkregularization termλs >> 1: to penalize domain-specific features
SourceDomain
TargetDomain
k k
k
N
k
N
i
kTki
ki
k
K
k
ks
k
AypNK 1 1
1
22
}{,
);|(log11
minarg})ˆ{,ˆ(
uvx
uvuvuv
likelihood of labeled data from K source domains
Nov 7, 2007 CIKM2007 30
Framework for adaptation
);|(log1
);|(log1
1
1
minarg})ˆ{,ˆ,ˆ(
1
1 1
2
1
22
}{,
m
i
tTti
ti
N
k
N
i
kTki
ki
k
tt
K
k
ks
kt
Aypm
AypNK
k k
kt
uvx
uvx
uuvuuvuuv
SourceDomain
TargetDomain
likelihood of pseudo labeled target domain
examplesλt = 1 << λs : to pick up domain-specific
features in the target domain
Fix A, optimize:
Nov 7, 2007 CIKM2007 31
Joint optimization
Alternating optimization
k k
k
N
k
N
i
kTki
ki
k
K
k
ks
A,
k
AypNK
A
1 1
1
22
}{,
);|(log11
minarg})ˆ{,ˆ,ˆ(
uvx
uvuvuv
How to find A? (1)
Nov 7, 2007 CIKM2007 32
How to find A? (2)
Domain cross validation Idea: training on (K – 1) source domains and test
on the held-out source domain Approximation:
wfk: weight for feature f learned from domain k
wfk: weight for feature f learned from other domains
rank features by
See paper for details
K
k
kf
kf ww
1
Nov 7, 2007 CIKM2007 33
Intuition for domain cross validation
…domains
………expressed………-less
D1 D2 Dk-1 Dk (fly)
……-less……expressed……
w
1.5
0.05
w
2.0
1.2………expressed………-less
1.8
0.1
Nov 7, 2007 CIKM2007 34
Experiments
Data set BioCreative Challenge Task 1B Gene/protein name recognition 3 organisms/domains: fly, mouse and yeast
Experiment setup 2 organisms for training, 1 for testing F1 as performance measure
Nov 7, 2007 CIKM2007 35
Experiments: Generalization
Method F+MY M+YF Y+FM
BL 0.633 0.129 0.416
DA-1
(joint-opt)
0.627 0.153 0.425
DA-2
(domain CV)
0.654 0.195 0.470
SourceDomain
TargetDomain
SourceDomain
TargetDomain using generalizable features is effective
domain cross validation is more effective than joint optimization
F: fly M: mouse Y: yeast
Nov 7, 2007 CIKM2007 36
Experiments: Adaptation
Method F+MY M+YF Y+FM
BL-SSL 0.633 0.241 0.458
DA-2-SSL 0.759 0.305 0.501
SourceDomain
TargetDomain
F: fly M: mouse Y: yeast
domain-adaptive bootstrapping is more effective than regular bootstrapping
SourceDomain
TargetDomain
Nov 7, 2007 CIKM2007 37
Experiments: Adaptation
domain-adaptive SSL is more effective, especially with a small number of pseudo labels
Nov 7, 2007 CIKM2007 38
Conclusions and future work
Two-stage domain adaptation Generalization: outperformed standard supervised
learning Adaptation: outperformed standard bootstrapping
Two ways to find generalizable features Domain cross validation is more effective
Future work Single source domain? Setting parameters h and m
Nov 7, 2007 CIKM2007 39
References
S. Ben-David, J. Blitzer, K. Crammer & F. Pereira. Analysis of representations for domain adaptation. NIPS 2007.
J. Blitzer, R. McDonald & F. Pereira. Domain adaptation with structural correspondence learning. EMNLP 2006.
H. Daumé III. Frustratingly easy domain adaptation. ACL 2007.