HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching
description
Transcript of HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching
![Page 1: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/1.jpg)
HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING
Arnab Nandi Phil BernsteinUNIV OF MICHIGAN MICROSOFT RESEARCH
PRESENTED BYVAIBHAV MEHTA
![Page 2: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/2.jpg)
Scenario
Arnab Nandi & Phil Bernstein
2
![Page 3: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/3.jpg)
Scenario
Arnab Nandi & Phil Bernstein
3
Search over structured dataCommerceentertainment
Data onboarding – merge an XML data feed from a 3rd partyto Microsoft data warehouse.
![Page 4: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/4.jpg)
Scenario
Arnab Nandi & Phil Bernstein
4
query
Search engine + data warehouse
Users
3rd Party Feed
3rd Party Feed
3rd Party Feed
3rd Party Feed
results
“Amazon.com”
•High Precision•(Irrespective of Recall)
•Minimal Human Involvement
•High Precision•(Irrespective of Recall)
•Minimal Human Involvement
![Page 5: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/5.jpg)
Example Feed
-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>
<Category>Action</Category> <Category>Comedy</Category>
</Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>
Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)
<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of
the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <MPAA>NR</MPAA> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>
5
Arnab Nandi & Phil Bernstein
![Page 6: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/6.jpg)
Schema Matching
-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>
<Category>Action</Category> <Category>Comedy</Category>
</Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>
Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)
<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of
the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>
6
Arnab Nandi & Phil Bernstein
![Page 7: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/7.jpg)
Taxonomy Matching
-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>
<Category>Action</Category> <Category>Comedy</Category>
</Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>
Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)
<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of
the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>
7
Arnab Nandi & Phil Bernstein
![Page 8: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/8.jpg)
Various Problems8
Badly normalized….
Unit conversion…
Formatting choices…
In-band signaling…
Arbitrary labels
Arnab Nandi & Phil Bernstein
Non standard vocabulary / language
Zero documenta
tion
Not enough
instances
![Page 9: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/9.jpg)
Unlike conventional matching…
Arnab Nandi & Phil Bernstein
9
We have web search click data
For both Warehouse & 3rd party website
The databases we are integrating (usually) have a presence on the web
Why not use click data as a feature for schema & taxonomy matching?
query
Search engine + data warehouse
Users
3rd Party Feed
results
![Page 10: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/10.jpg)
Outline10
Scenario
Using Clicklogs Core idea Using Query Distributions Example System Architecture
Results
Arnab Nandi & Phil Bernstein
![Page 11: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/11.jpg)
Core idea11
“If two (sets of) products are searched for by similar queries, then they are similar”
Small laptop
Arnab Nandi & Phil BernsteinWeb Search
![Page 12: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/12.jpg)
Clicklog
Core idea12
Arnab Nandi & Phil Bernstein
Small Lapto
ps
Pro. Laptops
Warehouse
eee ::: small
laptopsSmall laptop
Small laptop
Y
X
Z
Small laptop
![Page 13: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/13.jpg)
Query Distributions
Arnab Nandi & Phil Bernstein
13
click count
![Page 14: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/14.jpg)
Mapping to Taxonomy14
Map URL to product, which belongs to taxonomy
http://www.amazon.com/dp/B001JTA59C
Shopping | Electronics |Netbooks Arnab Nandi & Phil Bernstein
3rd party DB(provided to us)
![Page 15: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/15.jpg)
Aggregating Query Distributions
15
Arnab Nandi & Phil Bernstein
Small Laptop
s
Pro. Laptops
Warehouse
eee ::: small
laptops
![Page 16: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/16.jpg)
Generating Correspondences
Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them.
Process For each page (URL)
Identify query distribution Identify category / schema element of that page
For each category / schema element C Aggregate over pages in C to get query distribution
For each foreign category / schema element Find host category / schema element with most similar query
distribution
17
Arnab Nandi & Phil Bernstein
![Page 17: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/17.jpg)
Outline18
Scenario
Using Clicklogs Core idea Using Query Distributions Example System Architecture
Results
Arnab Nandi & Phil Bernstein
![Page 18: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/18.jpg)
Example: Taxonomy Matching
Arnab Nandi & Phil Bernstein
19
query freq url
laptop 70http://searchengine.com/product/macbookpro
laptop 25http://searchengine.com/product/mininote
laptop 5 http://asus.com/eeepcnetbook 5
http://searchengine.com/product/macbookpro
netbook 20
http://searchengine.com/product/mininote
netbook 15 http://asus.com/eeepccheap netbook 5 http://asus.com/eeepc
Warehouse: Small
Laptops
Warehouse: Professional
Laptops
eee
![Page 19: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/19.jpg)
Example: Taxonomy Matching
Arnab Nandi & Phil Bernstein
20
“laptop”: 25/45“netbook”: 20/45
“laptop” : 70 / 75“netbook” : 5/75
“laptop”: 5/25“netbook”: 15/25“cheap laptop”:
5/25
Warehouse: Small
Laptops
Warehouse: Professional
Laptops
eee
![Page 20: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/20.jpg)
Distribution Similarity Metric
Arnab Nandi & Phil Bernstein
21
Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)Σ(all qhost, qforeign combinations)
![Page 21: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/21.jpg)
“small laptops” vs “eee”laptop vs laptop netbook vs netbook laptop vs cheap laptop
1 x (5/25) + 1 x (20/45) + 0.5 x (5/25)
= 0.74
Example: Taxonomy Matching
Arnab Nandi & Phil Bernstein
22
Warehouse: Small
Laptops
Warehouse: Professional
Laptops
eee
“laptop”: 25/45“netbook”: 20/45
“laptop” : 70 / 75“netbook” : 5/75
“laptop”: 5/25“netbook”: 15/25“cheap laptop”:
5/25
0.74
0.31
![Page 22: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/22.jpg)
Advantages of Clicklogs
Arnab Nandi & Phil Bernstein
23
Resilient to language
Resilient to new domains, data, and features As long as people query & click, we have data to
learn from
Generates mappings previous methods can’t Electronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments
≈ Office Products ▷ Office Machines ▷ Calculators
Software ▷ Categories ▷ Programming ▷ Programming Languages ▷ Visual Basic ≈ Software ▷ Developer Tools
![Page 23: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/23.jpg)
System Design24
Arnab Nandi & Phil Bernstein
![Page 24: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/24.jpg)
Outline25
Scenario
Using Clicklogs Core idea Using Query Distributions Example System Architecture
Results
Arnab Nandi & Phil Bernstein
![Page 25: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/25.jpg)
Experimenting with Click Logs
Arnab Nandi & Phil Bernstein
26
Commercial warehouse mapping, 258 products from a 70,000 term Amazon.com taxonomy (613
in gold) to a 6,000 term warehouse taxonomy (40 in gold)
Live.com (now Bing.com) search querylog Amazon to warehouse mapping task,
consecutively halving the clicklog size used 1.8 million clicks to Amazon.com product pages Typically each product had a query distribution
averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).
![Page 26: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/26.jpg)
Summary of Results
Arnab Nandi & Phil Bernstein
27
90% precision / recall possible
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
![Page 27: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/27.jpg)
Precision / Recall
Arnab Nandi & Phil Bernstein
28
Commercial warehouse mapping, 258 products
from a 70K term Amazon.com taxonomy to a 6,000 term warehouse taxonomy (613
categories used)
![Page 28: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/28.jpg)
Summary of Results
Arnab Nandi & Phil Bernstein
29
90% precision / recall possible
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
![Page 29: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/29.jpg)
Varying Clicklog Size30
Successively decreased clicklog size by half
Recall decreases as clicklog size is decreased
Arnab Nandi & Phil Bernstein
![Page 30: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/30.jpg)
Summary of Results
Arnab Nandi & Phil Bernstein
31
90% precision / recall possible
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
![Page 31: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/31.jpg)
Comparing Query Distributions
32
Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)
Σ(all qhost, qforeign combinations)
Replace Jaccard with various phrase similarity metrics
Minimal difference due to size of most queries Arnab Nandi & Phil Bernstein
![Page 32: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/32.jpg)
Summary of Results
Arnab Nandi & Phil Bernstein
33
90% precision / recall possible
Query distribution is a good similarity metric
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
![Page 33: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/33.jpg)
Conclusion
Unsupervised mapping is possible very high recall / precision when enough
queries are present
Click logs are promising Finds results that other methods cannot find As clicklog size increases, it will produce
more mappings
Combinable with existing methods
34
Arnab Nandi & Phil Bernstein
![Page 34: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching](https://reader033.fdocuments.in/reader033/viewer/2022051417/568149a2550346895db6e34f/html5/thumbnails/34.jpg)
Questions?
Arnab Nandi & Phil Bernstein