Mandoline: Model Evaluation under
Transcript of Mandoline: Model Evaluation under
![Page 1: Mandoline: Model Evaluation under](https://reader031.fdocuments.in/reader031/viewer/2022012510/6188a3f3ac720d37b3622dd4/html5/thumbnails/1.jpg)
Mandoline: Model Evaluation under Distribution ShiftMayee Chen*, Karan Goel*, Nimit Sohoni*, Fait Poms, Kayvon Fatahalian, Christopher Ré
Stanford University
![Page 2: Mandoline: Model Evaluation under](https://reader031.fdocuments.in/reader031/viewer/2022012510/6188a3f3ac720d37b3622dd4/html5/thumbnails/2.jpg)
Motivation
Q: How do we evaluate model performance during deployment?
● Model’s deployment setting ≠ training setting
![Page 3: Mandoline: Model Evaluation under](https://reader031.fdocuments.in/reader031/viewer/2022012510/6188a3f3ac720d37b3622dd4/html5/thumbnails/3.jpg)
Motivation
Q: How do we evaluate model performance during deployment?
Source data distribution
evalu
ate
Model
Training set Validation set
Labeled
learn
● Model’s deployment setting ≠ training setting
![Page 4: Mandoline: Model Evaluation under](https://reader031.fdocuments.in/reader031/viewer/2022012510/6188a3f3ac720d37b3622dd4/html5/thumbnails/4.jpg)
Motivation
Q: How do we evaluate model performance during deployment?
Source data distribution
evalu
ate
Model
Target data distributionTraining set Validation set
Mandoline: user-guided framework for evaluation under distribution shiftLabeled Unlabeled
learn deploy
Distribution shift
evaluate?
● Model’s deployment setting ≠ training setting
![Page 5: Mandoline: Model Evaluation under](https://reader031.fdocuments.in/reader031/viewer/2022012510/6188a3f3ac720d37b3622dd4/html5/thumbnails/5.jpg)
Common approach: importance weightingSource data distribution Target data distribution
![Page 6: Mandoline: Model Evaluation under](https://reader031.fdocuments.in/reader031/viewer/2022012510/6188a3f3ac720d37b3622dd4/html5/thumbnails/6.jpg)
Common approach: importance weightingSource data distribution Target data distribution
Density ratio to estimate
![Page 7: Mandoline: Model Evaluation under](https://reader031.fdocuments.in/reader031/viewer/2022012510/6188a3f3ac720d37b3622dd4/html5/thumbnails/7.jpg)
● High dimensional data : harder to compute
Common approach: importance weighting
Problems:
● Support shift - what if ?
Source data distribution Target data distribution
Density ratio to estimate
![Page 8: Mandoline: Model Evaluation under](https://reader031.fdocuments.in/reader031/viewer/2022012510/6188a3f3ac720d37b3622dd4/html5/thumbnails/8.jpg)
Mandoline: Slice-based reweighting frameworkSlice: user-defined grouping of data
![Page 9: Mandoline: Model Evaluation under](https://reader031.fdocuments.in/reader031/viewer/2022012510/6188a3f3ac720d37b3622dd4/html5/thumbnails/9.jpg)
Mandoline: Slice-based reweighting frameworkSlice: user-defined grouping of data
negationcontains not, n’t
male pronouncontains he, him
strong sentimentcontains love, adore
![Page 10: Mandoline: Model Evaluation under](https://reader031.fdocuments.in/reader031/viewer/2022012510/6188a3f3ac720d37b3622dd4/html5/thumbnails/10.jpg)
Mandoline: Slice-based reweighting frameworkSlice: user-defined grouping of data
Source Accuracy: 91%
negationcontains not, n’t
male pronouncontains he, him
strong sentimentcontains love, adore
I love eating ice-cream.
He loved walking on the beach.
He didn’t like drinking coffee.
(Source) Labeled Validation Set
-1
Slices
-1
1
-1
1
1
1
1
-1
Model
![Page 11: Mandoline: Model Evaluation under](https://reader031.fdocuments.in/reader031/viewer/2022012510/6188a3f3ac720d37b3622dd4/html5/thumbnails/11.jpg)
Mandoline: Slice-based reweighting frameworkSlice: user-defined grouping of data
Source Accuracy: 91%
negationcontains not, n’t
male pronouncontains he, him
strong sentimentcontains love, adore
He does not love eating scones.
He loves taking risks.
She likes drinking coffee.
(Target) Unlabeled Test Set Slices
1
-1
-1
1
1
-1
1
1
-1
I love eating ice-cream.
He loved walking on the beach.
He didn’t like drinking coffee.
(Source) Labeled Validation Set
-1
Slices
-1
1
-1
1
1
1
1
-1
Model
![Page 12: Mandoline: Model Evaluation under](https://reader031.fdocuments.in/reader031/viewer/2022012510/6188a3f3ac720d37b3622dd4/html5/thumbnails/12.jpg)
Mandoline: Slice-based reweighting frameworkSlice: user-defined grouping of data
Source Accuracy: 91% Target Accuracy:
84%
negationcontains not, n’t
male pronouncontains he, him
strong sentimentcontains love, adore
He does not love eating scones.
He loves taking risks.
She likes drinking coffee.
(Target) Unlabeled Test Set Slices
1
-1
-1
1
1
-1
1
1
-1
I love eating ice-cream.
He loved walking on the beach.
He didn’t like drinking coffee.
(Source) Labeled Validation Set
-1
Slices
-1
1
-1
1
1
1
1
-1
Model
Mandoline
![Page 13: Mandoline: Model Evaluation under](https://reader031.fdocuments.in/reader031/viewer/2022012510/6188a3f3ac720d37b3622dd4/html5/thumbnails/13.jpg)
Results
Prop 1: if k slices capture all “relevant” distributional shift
between and , then reweighting with recovers .
● If support shift occurs on irrelevant slices (i.e. slices independent of Y), it can be corrected!
● Dimensionality: reduce from d → k
![Page 14: Mandoline: Model Evaluation under](https://reader031.fdocuments.in/reader031/viewer/2022012510/6188a3f3ac720d37b3622dd4/html5/thumbnails/14.jpg)
Results
● How to compute ? Use any density ratio estimation method on g(x)○ Kullback-Leibler Importance Estimation Procedure (KLIEP)
■ Extend to correct for noisily-defined, incomplete slices
Prop 1: if k slices capture all “relevant” distributional shift
between and , then reweighting with recovers .
● If support shift occurs on irrelevant slices (i.e. slices independent of Y), it can be corrected!
● Dimensionality: reduce from d → k
![Page 15: Mandoline: Model Evaluation under](https://reader031.fdocuments.in/reader031/viewer/2022012510/6188a3f3ac720d37b3622dd4/html5/thumbnails/15.jpg)
Results
![Page 16: Mandoline: Model Evaluation under](https://reader031.fdocuments.in/reader031/viewer/2022012510/6188a3f3ac720d37b3622dd4/html5/thumbnails/16.jpg)
Results
CelebA SNLI → MNLI
Importance weighting on x
On g(x)
![Page 17: Mandoline: Model Evaluation under](https://reader031.fdocuments.in/reader031/viewer/2022012510/6188a3f3ac720d37b3622dd4/html5/thumbnails/17.jpg)
Summary
Thank you! Contact: [email protected]
Model evaluation under distribution shift:● When user-specified slices capture relevant distribution shift, can reweight
using them● Can mitigate 1) support shift and 2) high dimensionality in standard
importance reweighting● Future steps: slice design - frameworks for how to construct good g?