HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

34
HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein UNIV OF MICHIGAN MICROSOFT RESEARCH PRESENTED BY VAIBHAV MEHTA

description

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching. Arnab Nandi  Phil Bernstein UNIV OF MICHIGAN MICROSOFT RESEARCH. PRESENTED BY VAIBHAV MEHTA. Scenario. Scenario. Search over structured data Commerce entertainment - PowerPoint PPT Presentation

Transcript of HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Page 1: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING

Arnab Nandi Phil BernsteinUNIV OF MICHIGAN MICROSOFT RESEARCH

PRESENTED BYVAIBHAV MEHTA

Page 2: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Scenario

Arnab Nandi & Phil Bernstein

2

Page 3: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Scenario

Arnab Nandi & Phil Bernstein

3

Search over structured dataCommerceentertainment

Data onboarding – merge an XML data feed from a 3rd partyto Microsoft data warehouse.

Page 4: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Scenario

Arnab Nandi & Phil Bernstein

4

query

Search engine + data warehouse

Users

3rd Party Feed

3rd Party Feed

3rd Party Feed

3rd Party Feed

results

“Amazon.com”

•High Precision•(Irrespective of Recall)

•Minimal Human Involvement

•High Precision•(Irrespective of Recall)

•Minimal Human Involvement

Page 5: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Example Feed

-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>

 <Category>Action</Category> <Category>Comedy</Category>

 </Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>

Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)

<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <MPAA>NR</MPAA> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>

5

Arnab Nandi & Phil Bernstein

Page 6: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Schema Matching

-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>

 <Category>Action</Category> <Category>Comedy</Category>

 </Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>

Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)

<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>

6

Arnab Nandi & Phil Bernstein

Page 7: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Taxonomy Matching

-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>

 <Category>Action</Category> <Category>Comedy</Category>

 </Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>

Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)

<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>

7

Arnab Nandi & Phil Bernstein

Page 8: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Various Problems8

Badly normalized….

Unit conversion…

Formatting choices…

In-band signaling…

Arbitrary labels

Arnab Nandi & Phil Bernstein

Non standard vocabulary / language

Zero documenta

tion

Not enough

instances

Page 9: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Unlike conventional matching…

Arnab Nandi & Phil Bernstein

9

We have web search click data

For both Warehouse & 3rd party website

The databases we are integrating (usually) have a presence on the web

Why not use click data as a feature for schema & taxonomy matching?

query

Search engine + data warehouse

Users

3rd Party Feed

results

Page 10: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Outline10

Scenario

Using Clicklogs Core idea Using Query Distributions Example System Architecture

Results

Arnab Nandi & Phil Bernstein

Page 11: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Core idea11

“If two (sets of) products are searched for by similar queries, then they are similar”

Small laptop

Arnab Nandi & Phil BernsteinWeb Search

Page 12: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Clicklog

Core idea12

Arnab Nandi & Phil Bernstein

Small Lapto

ps

Pro. Laptops

Warehouse

eee ::: small

laptopsSmall laptop

Small laptop

Y

X

Z

Small laptop

Page 13: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Query Distributions

Arnab Nandi & Phil Bernstein

13

click count

Page 14: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Mapping to Taxonomy14

Map URL to product, which belongs to taxonomy

http://www.amazon.com/dp/B001JTA59C

Shopping | Electronics |Netbooks Arnab Nandi & Phil Bernstein

3rd party DB(provided to us)

Page 15: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Aggregating Query Distributions

15

Arnab Nandi & Phil Bernstein

Small Laptop

s

Pro. Laptops

Warehouse

eee ::: small

laptops

Page 16: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Generating Correspondences

Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them.

Process For each page (URL)

Identify query distribution Identify category / schema element of that page

For each category / schema element C Aggregate over pages in C to get query distribution

For each foreign category / schema element Find host category / schema element with most similar query

distribution

17

Arnab Nandi & Phil Bernstein

Page 17: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Outline18

Scenario

Using Clicklogs Core idea Using Query Distributions Example System Architecture

Results

Arnab Nandi & Phil Bernstein

Page 18: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Example: Taxonomy Matching

Arnab Nandi & Phil Bernstein

19

query freq url

laptop 70http://searchengine.com/product/macbookpro

laptop 25http://searchengine.com/product/mininote

laptop 5 http://asus.com/eeepcnetbook 5

http://searchengine.com/product/macbookpro

netbook 20

http://searchengine.com/product/mininote

netbook 15 http://asus.com/eeepccheap netbook 5 http://asus.com/eeepc

Warehouse: Small

Laptops

Warehouse: Professional

Laptops

eee

Page 19: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Example: Taxonomy Matching

Arnab Nandi & Phil Bernstein

20

“laptop”: 25/45“netbook”: 20/45

“laptop” : 70 / 75“netbook” : 5/75

“laptop”: 5/25“netbook”: 15/25“cheap laptop”:

5/25

Warehouse: Small

Laptops

Warehouse: Professional

Laptops

eee

Page 20: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Distribution Similarity Metric

Arnab Nandi & Phil Bernstein

21

Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)Σ(all qhost, qforeign combinations)

Page 21: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

“small laptops” vs “eee”laptop vs laptop netbook vs netbook laptop vs cheap laptop

1 x (5/25) + 1 x (20/45) + 0.5 x (5/25)

= 0.74

Example: Taxonomy Matching

Arnab Nandi & Phil Bernstein

22

Warehouse: Small

Laptops

Warehouse: Professional

Laptops

eee

“laptop”: 25/45“netbook”: 20/45

“laptop” : 70 / 75“netbook” : 5/75

“laptop”: 5/25“netbook”: 15/25“cheap laptop”:

5/25

0.74

0.31

Page 22: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Advantages of Clicklogs

Arnab Nandi & Phil Bernstein

23

Resilient to language

Resilient to new domains, data, and features As long as people query & click, we have data to

learn from

Generates mappings previous methods can’t Electronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments

≈ Office Products ▷ Office Machines ▷ Calculators

Software ▷ Categories ▷ Programming ▷ Programming Languages ▷ Visual Basic  ≈ Software ▷ Developer Tools

Page 23: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

System Design24

Arnab Nandi & Phil Bernstein

Page 24: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Outline25

Scenario

Using Clicklogs Core idea Using Query Distributions Example System Architecture

Results

Arnab Nandi & Phil Bernstein

Page 25: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Experimenting with Click Logs

Arnab Nandi & Phil Bernstein

26

Commercial warehouse mapping, 258 products from a 70,000 term Amazon.com taxonomy (613

in gold) to a 6,000 term warehouse taxonomy (40 in gold)

Live.com (now Bing.com) search querylog Amazon to warehouse mapping task,

consecutively halving the clicklog size used 1.8 million clicks to Amazon.com product pages Typically each product had a query distribution

averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).

Page 26: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Summary of Results

Arnab Nandi & Phil Bernstein

27

90% precision / recall possible

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Page 27: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Precision / Recall

Arnab Nandi & Phil Bernstein

28

Commercial warehouse mapping, 258 products

from a 70K term Amazon.com taxonomy to a 6,000 term warehouse taxonomy (613

categories used)

Page 28: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Summary of Results

Arnab Nandi & Phil Bernstein

29

90% precision / recall possible

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Page 29: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Varying Clicklog Size30

Successively decreased clicklog size by half

Recall decreases as clicklog size is decreased

Arnab Nandi & Phil Bernstein

Page 30: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Summary of Results

Arnab Nandi & Phil Bernstein

31

90% precision / recall possible

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Page 31: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Comparing Query Distributions

32

Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)

Σ(all qhost, qforeign combinations)

Replace Jaccard with various phrase similarity metrics

Minimal difference due to size of most queries Arnab Nandi & Phil Bernstein

Page 32: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Summary of Results

Arnab Nandi & Phil Bernstein

33

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Page 33: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Conclusion

Unsupervised mapping is possible very high recall / precision when enough

queries are present

Click logs are promising Finds results that other methods cannot find As clicklog size increases, it will produce

more mappings

Combinable with existing methods

34

Arnab Nandi & Phil Bernstein

Page 34: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Questions?

Arnab Nandi & Phil Bernstein