Intern Presentation short - Stanford...
Transcript of Intern Presentation short - Stanford...
![Page 1: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/1.jpg)
Intern Presentation
A/B Testing by Interleaving
Sida Wang
![Page 2: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/2.jpg)
My Project
• Evaluating search relevance by interleaving
results and collecting user data
– Interleaving Framework
• Generic, Extensible• Generic, Extensible
– Experiments to evaluate relevance by interleaving
• Based on the paper How Does Clickthrough Data
Reflect Retrieval Quality? by F. Radlinski et al
![Page 3: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/3.jpg)
Evaluating Search Relevance
• Without Interleaving
- Full time human judges -> precision, recall, NDCG
- Compare Search
![Page 4: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/4.jpg)
Compare Search
Result S1
Result S2
Result S3
Result G1
Result G2
Result G3
![Page 5: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/5.jpg)
![Page 6: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/6.jpg)
Issues
• Aas
• But do Microsoft people pick O14 Search or
Google Mini? Google Mini?
• Maybe people tend to pick the left?
• Alters the search experience
– Can never collect a lot of data using this method
![Page 7: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/7.jpg)
By Interleaving
Result A1 - Relevant
Result A2 - Relevant
Result A1 - Relevant
Result B1 - Useless
Result A2 - Relevant
Result B1 - Useless
Result B2 - UselessResult A2 - Relevant
Result A3 - Relevant
Result B2 - Useless
Result A3 - Relevant
Result B3 - Useless
Result B2 - Useless
Result B3 - Useless
![Page 8: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/8.jpg)
By Interleaving
Result A1
Result A2
Result A1
Result B1
Result A2Result B1
Result B1Result A2
Result A3Result B2
Result A3
Result B3
Result B1
Result B3
![Page 9: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/9.jpg)
Considerations
• Minimize impact to UX
– So no demo, it looks exactly like normal search
• Minimize Bias
– Summary normalization– Summary normalization
– Interleaving algorithms
• Reliability / performance / and the usual
![Page 10: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/10.jpg)
Experiments I did
• Automated random clicks
• Automated clicks according to relevance
judgments
• Clicks from real people• Clicks from real people
![Page 11: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/11.jpg)
Random Clicks
0.4
0.5
0.6
% o
f V
ote
s R
ece
ive
d
Control Using Automated Random Clicks
0
0.1
0.2
0.3
0 500 1000 1500 2000 2500 3000 3500 4000 4500
% o
f V
ote
s R
ece
ive
d
Clicks
Betaa
MSW
Ties
![Page 12: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/12.jpg)
A Lot of Random Clicks
0.8
1
1.2
% o
f V
ote
s R
ece
ive
d
Control Using Automated Random Clicks
0
0.2
0.4
0.6
0 5000 10000 15000 20000 25000 30000
% o
f V
ote
s R
ece
ive
d
Clicks
ac11
a86f
ties
![Page 13: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/13.jpg)
Experiments I did
• Automated random clicks
• Automated clicks according to relevance
judgments
• Clicks from real people• Clicks from real people
![Page 14: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/14.jpg)
O12 vs. O14
0.5
0.6
0.7
0.8
% o
f V
ote
s R
ece
ive
d
Automated Clicks Using Relevence Judgments
-0.1
0
0.1
0.2
0.3
0.4
0 1000 2000 3000 4000
% o
f V
ote
s R
ece
ive
d
Clicks
Acing05 Degraded
Acing05
Ties
![Page 15: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/15.jpg)
Experiments I did
• Automated random clicks
• Automated clicks according to relevance
judgments
• Clicks from real people• Clicks from real people
![Page 16: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/16.jpg)
O12 vs. O14
0.8
1
1.2
% o
f V
ote
s R
ece
ive
d
O12 vs. O14 in BSG ALL
-0.2
0
0.2
0.4
0.6
0 10 20 30 40 50 60 70 80 90
% o
f V
ote
s R
ece
ive
d
Clicks
O12
O14
Tie
![Page 17: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/17.jpg)
Method of Analysis (election)
• Vote by query, by user, by session etc.
• query = person, user = state
![Page 18: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/18.jpg)
Summary of Results
Method of Voting O12 vs. O14
by queries (direct election): 12 vs. 24
by users (1 vote per state): 4 vs. 9
by sessions (~electoral votes): 5 vs. 11by sessions (~electoral votes): 5 vs. 11
• System does not seem to matter much, but
too little clicks (85) to draw significant
conclusion
![Page 19: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/19.jpg)
What Logically Follows
• Google Mini vs. O14 (after fixing Google Mini)
• FAST vs. O14 (after fixing RSS in fssearchoffice)
• I’d love to see the results
![Page 20: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/20.jpg)
What can interleaving do?
• Give relevance team more confidence
• Use interleaving for displaying results
• Use interleaving to automatically tune the
search engine
Am
bitio
n
search engine
Am
bitio
n
![Page 21: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/21.jpg)
Add Confidence
• In addition to very traditional measures like
NDCG, Precision and Recall. It is nice to have
another independent metric.
• Automatic• Automatic
– Does not require human judgments
• Scalable
– Small impact to UX
![Page 22: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/22.jpg)
What can interleaving do?
• Give relevance team more confidence
• Use interleaving for displaying results
• Use interleaving to automatically tune the
search engine
Am
bitio
n
search engine
Am
bitio
n
![Page 23: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/23.jpg)
Display
![Page 24: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/24.jpg)
Display
![Page 25: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/25.jpg)
What can we do?
• Give relevance team more confidence
• Use interleave for displaying results
• Use interleaving to automatically tune the
search engine
Am
bitio
n
search engine
Am
bitio
n
![Page 26: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/26.jpg)
Automatic Tuning
• Many relevance models, each is good for a particular type of corpora (specs, user data, academic articles, product catalog, websites)
• Use interleaving in 10% of searches
• Use user click data to:
– Automatically and dynamically decide on the best model, or tweak model parameters
![Page 27: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/27.jpg)
Thank you!
• Dmitriy, Eugene, Puneet
• Jamie, Jessica, Ping, Victor, Relevance Team
• Russ, Jon• Russ, Jon
• Search Team
• Hope to see you again in the future!
![Page 28: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/28.jpg)
Extra Slides
![Page 29: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/29.jpg)
Automatic Tuning – Pair wise?
• Pair wise comparisons scales poorly
• But there seems to be “strong stochastic
transitivity”
– Given locations A, B ,C– Given locations A, B ,C
– If A > B > C then ΔAC > Max(ΔAB, ΔBC)
![Page 30: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/30.jpg)
How to Interleave
• Balanced
• Team Draft
![Page 31: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/31.jpg)
Balanced Interleaving
![Page 32: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/32.jpg)
Team Draft
1st pick:
LeBron James
2nd pick:
Kobe Bryant
1st pick:
John Smith
2nd pick:
Kobe Bryant
3rd pick:
Tim Duncan
LeBron James
3rd pick:
Tim Duncan
![Page 33: Intern Presentation short - Stanford Universityweb.stanford.edu/~sidaw/projects/nontechnicalslides.pdf•But do Microsoft people pick O14 Search or Google Mini? •Maybe people tend](https://reader031.fdocuments.in/reader031/viewer/2022030213/5ad705a37f8b9a9d5c8b8eb2/html5/thumbnails/33.jpg)
A/B Testing By Interleaving
Result A1 - Relevant
Result A2 - Relevant
Result A1 - Relevant
Result B1 - Useless
Result B2 - Useless
Result B1 - Useless
Result B2 - UselessResult A2 - Relevant
Result A3 - Relevant
Result A2 - Relevant
Result A3 - Relevant
Result B3 - Useless
Result B2 - Useless
Result B3 - Useless