Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift...
Transcript of Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift...
![Page 1: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/1.jpg)
Covariate Shift Correction& Propensity Scores
Alex J. Smola
Monday, September 6, 2010
![Page 2: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/2.jpg)
The Problem... aka two problems and one hammer ...
Monday, September 6, 2010
![Page 3: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/3.jpg)
Covariate Shift• Basic setting
• Training data is drawn from• Test data is drawn from
• Examples• Training data from last week, deploy today • Training data for USA market, deploy in UK• Training data for mithril, deploy on axonite • Speech recogntion - adapt to speaker
• No labels on test set but
p(x)p(y|x)q(x)p(y|x)
p(y|x) = q(y|x)
Monday, September 6, 2010
![Page 4: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/4.jpg)
Covariate Shift• Importance Sampler Identity
• Radon Nikodym derivative
(need measure theory to avoid ∞/∞ division)• Reweighted Empirical Risk
Ex∼q(x)[l(x, y, f(x))] = Ex∼p(x)
dq(x)
dp(x)l(x, y, f(x))
β(x) :=dq(x)
dp(x)
minimizef
i
β(xi)l(xi, yi, f(xi)) + λΩ[f ]
Monday, September 6, 2010
![Page 5: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/5.jpg)
Propensity Scores• What if questions in experiments
• Display ad a for user u, what about a’• New feature for advertisers but uneven opt in• Efficacy of medical treatment
(stent vs. drugs for coronary artery problem)• More than 2 choices
• Customized module / page layout• Real-valued dosage level of drug
Monday, September 6, 2010
![Page 6: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/6.jpg)
Propensity Scores• Basic goal - changing the conditioning
• Improvement estimation
This yields improvement when drawing from q. • Doubly robust estimation (variance reduction)
estimate f - we can evaluate the estimate on q
Eq[f(x)] = Ep[β(x)f(x)] −→1
m
m
i=1
β(xi)fi
Eq[f(x)− g(x)] = Ep[β(x)f(x)]−Eq[g(x)] −→1
m
m
i=1
β(xi)fi −1
n
n
i=1
gi
Eq[f(x)] = Eq[f(x)− f(x)] +Eq[f(x)] = Ep[β(x)[f(x)− f(x)]] +Eq[f(x)]
Monday, September 6, 2010
![Page 7: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/7.jpg)
The goal
• Estimate the Radon Nikodym Derivative
based on samples from p and q
β(x) :=dq(x)
dp(x)
Monday, September 6, 2010
![Page 8: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/8.jpg)
Logistic Regression... aka the idiot-proof simple method ...
Monday, September 6, 2010
![Page 9: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/9.jpg)
Logistic Regression 101• Logistic transfer function
• Samples
• Risk minimization
p(y|x, f) = 1
1 + e−yf(x)
Z = (x1, y1), . . . , (xm, ym) where (xi, yi) ∼ p(y, x)
minimizef
m
i=1
log [1 + exp(−yif(xi))] + λΩ[f ]
Monday, September 6, 2010
![Page 10: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/10.jpg)
Logistic to Radon Nikodym• Key idea
Generate artificial distribution from p and q
• Connection to Radon Nikodym
• Efficient optimization in primal spaceStochastic gradient descent in f (VW, Dios)
ρ(x, y) :=1
2δy,1p(x) +
1
2δy,−1q(x)
ρ(y|x) = 1
1 + e−yf(x)=⇒ β(x) =
ρ(−1|x)ρ(1|x) =
1 + ef(x)
1 + e−f(x)= ef(x)
f(x) = φ(x), θMonday, September 6, 2010
![Page 11: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/11.jpg)
Moment Matching Theorem• Maximum Entropy Estimation (primal)
• Maximum a Posteriori Estimation (dual)
here g is the conditional log-partition function• Proof - analogous to Altun&Smola, 2006
via Fenchel Duality Theorem & operators
maximizep∈P
m
i
H(y|xi) subject to
m
i=1
φ(xi, yi)−Ey|xi[φ(xi, y)]
2
≤
minimizeθ
m
i=1
g(θ|xi)− φ(xi, yi), θ+λ
2θ2
Monday, September 6, 2010
![Page 12: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/12.jpg)
Mean Operators... aka Fortet & Mourier 1946 revisited ...
Monday, September 6, 2010
![Page 13: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/13.jpg)
Mean operators• Expectation map
• Empirical average
• Convergence theorem (Altun&Smola, 2006)
f → Ex∼p[f(x)] = Ex∼p[φ(x), θ] = f,Ex∼p[φ(x)] =: f, µ[p]
X → µ[X] :=1
m
m
i=1
φ(xi) hence f, µ[X] = 1
m
m
i=1
f(xi)
Pr µX − µ[p] > + ρ ≤ e−n2R−2
where ρ2 = n−1Ex,x∼p [k(x, x)− k(x, x)]
Monday, September 6, 2010
![Page 14: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/14.jpg)
Mean operators• Key idea
• Have empirical mean operator for p and q• Find reweighted combination from X to X’
• By Cauchy-Schwartz this gives bound
minimizeβ
1
m
m
i=1
βiφ(xi)−1
m
m
i=1
φ(xi)
1
m
m
i=1
βif(xi)−1
m
m
i=1
f(xi)
≤ f
1
m
m
i=1
βiφ(xi)−1
m
m
i=1
φ(xi)
Monday, September 6, 2010
![Page 15: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/15.jpg)
Guarantees• Radon Nikodym derivative is unique solution
when plugging in distributions.• For empirical averages approximation error is
small (upper bound by using RND).
where• We can find it by optimization
Pr
1
m
m
i=1
βiφ(xi)−1
m
m
i=1
φ(xi)
> + ρ
≤ exp
−m2
R2
1
m=
B2
m+
1
m and ρ ≤ R/√m
Monday, September 6, 2010
![Page 16: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/16.jpg)
Optimization template• Constrained problem
• Quadratic penalty: Kernel Mean Matching• L infinity penalty: Bounded Mean Matching• Entropy penalty: Entropy Mean Matching
(Sugiyama, Bickel, Brefeld, Tsuboi, ...)
minimizeβ
Ω[β]
subject to
1
m
m
i=1
βiφ(xi)−1
m
m
i=1
φ(xi)
≤
Monday, September 6, 2010
![Page 17: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/17.jpg)
Optimization Problems... applied duality theory ...
Monday, September 6, 2010
![Page 18: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/18.jpg)
Quadratic Program• Quadratic penalty on RND
(this favors large effective sample size)
looks like a single-class SVM• Bounded range of RND
(this bounds variance in McDiarmid tail)
minimizeα
1
2α[K + λ1]α− αu subject to α1 = 1 and αi ≥ 0.
minimizeα
1
2αKα− αu subject to α1 = 1 and αi ∈ [0,λ]
Monday, September 6, 2010
![Page 19: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/19.jpg)
Quadratic Program• Problem
Optimization problem is cubic in sample size• Solution
Find (ante)-primal problem and solve via SGD
where and via subdifferentials
minimizeθ,b
1
2θ2 + b+
1
2λ
n
i=1
(ui − φ(xi), θ − b)2+
minimizeθ,b
1
2θ2 + b+ λ
n
i=1
(ui − φ(xi), θ − b)+
ui =1
n
n
j=1
k(xi, xj) βi
Monday, September 6, 2010
![Page 20: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/20.jpg)
Convex Program• Minimum KL-Divergence regularization dual
where• Problem
Computing the normalization g is expensive• Solutions
• MCMC sampler for gradient of g• Retain estimate of g (update parts frequently)
minimizeθ
g(θ)− θ, µ+ 1
2λθ2 with g(θ) = log
n
i=1
eφ(xi),θ
β(x) = eφ(x),θ−g(θ)
Monday, September 6, 2010
![Page 21: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/21.jpg)
Conclusions... good/bad news ...
Monday, September 6, 2010
![Page 22: Covariate Shift Correction - Alex Smolaalex.smola.org/drafts/covariate.pdf · Covariate Shift Correction & Propensity Scores Alex J. Smola Monday, September 6, 2010. The Problem...](https://reader034.fdocuments.in/reader034/viewer/2022042810/5f9665cc9aa0d70d2e3fe7c6/html5/thumbnails/22.jpg)
Experimental results
• All methods work well (much better than doing nothing)
• Online optimization is effective• Logistic regression works very well• Logistic regression works very well• Entropy regularization works best
(even though we have theory for the norms)(but not for entropy)
Monday, September 6, 2010