HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF...

38
HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein UNIV OF MICHIGAN MICROSOFT RESEARCH

Transcript of HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF...

Page 1: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING

Arnab Nandi Phil BernsteinUNIV OF MICHIGAN MICROSOFT RESEARCH

Page 2: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

2

Scenario

Arnab Nandi & Phil Bernstein

Page 3: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

Arnab Nandi & Phil Bernstein

3

Scenario

Search over structured dataCommerceentertainment

Data onboarding – merge an XML data feed from a 3rd partyto Microsoft data warehouse.

Page 4: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

4

Scenario

Arnab Nandi & Phil Bernstein

query

Search engine + data warehouse

Users

3rd Party Feed

3rd Party Feed

3rd Party Feed

3rd Party Feed

results

“Amazon.com”

•High Precision•High Recall•Minimal Human

Involvement

Page 5: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

Arnab Nandi & Phil Bernstein

5

Example Feed

-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>

 <Category>Action</Category> <Category>Comedy</Category>

 </Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>

Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)

<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <MPAA>NR</MPAA> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>

Page 6: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

Arnab Nandi & Phil Bernstein

6

Schema Matching

-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>

 <Category>Action</Category> <Category>Comedy</Category>

 </Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>

Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)

<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>

From To

Movie MOVIE

Title MOVIE_NAME

Runtime RUNTIME

Category GENRE*

MPAA RATING

Person ACTOR*

Page 7: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

Arnab Nandi & Phil Bernstein

7

Taxonomy Matching

-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>

 <Category>Action</Category> <Category>Comedy</Category>

 </Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>

Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)

<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>From To

Action Action/Adventure

PG-13 NR

R R

Page 8: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

8

Various Problems

Badly normalized….

Unit conversion…

Formatting choices…

In-band signaling…

Arbitrary labels

Arnab Nandi & Phil Bernstein

Non standard vocabulary / language

Zero documenta

tion

Not enough

instances

Page 9: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

9

Unlike conventional matching…

Arnab Nandi & Phil Bernstein

We have web search click data

For both Warehouse & 3rd party website

The databases we are integrating (usually) have a presence on the web

Why not use click data as a feature for schema & taxonomy matching?

query

Search engine + data warehouse

Users

3rd Party Feed

results

Page 10: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

10

Outline

Scenario

Using Clicklogs Core idea Using Query Distributions Example System Architecture

Results

Arnab Nandi & Phil Bernstein

Page 11: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

11

Core idea

“If two (sets of) products are searched for by similar queries, then they are similar”

Small laptop

Arnab Nandi & Phil BernsteinWeb Search

Page 12: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

12

Clicklog

Core idea

Arnab Nandi & Phil Bernstein

Small Lapto

ps

Pro. Laptops

Warehouse

hardware eee

Asus.com

eee ::: small

laptopsSmall laptop

Small laptop

Y

X

Z

Small laptop

Page 13: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

13

Query Distributions

Arnab Nandi & Phil Bernstein

small laptop

netbook

hp mini 1000

hp mini

0 10 20 30 40 50click count

Page 14: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

14

Mapping to Taxonomy

Map URL to product, which belongs to taxonomy

http://www.amazon.com/dp/B001JTA59C

Shopping | Electronics |NetbooksArnab Nandi & Phil Bernstein

3rd party DB(provided to us)

Page 15: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

15

Aggregating Query Distributions

Arnab Nandi & Phil Bernstein

Small Laptop

s

Pro. Laptops

Warehouse

hardware eee

Asus.com

eee ::: small

laptops

0 5 101520253035404550

0 5 101520253035404550

0 5 101520253035404550

0 5 101520253035404550

0 10 20 30 40 50

0 10 20 30 40 50

Page 16: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

Arnab Nandi & Phil Bernstein

17

Generating Correspondences

Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them.

Process For each page (URL)

Identify query distribution Identify category / schema element of that page

For each category / schema element C Aggregate over pages in C to get query distribution

For each foreign category / schema element Find host category / schema element with most similar

query distribution

Page 17: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

18

Outline

Scenario

Using Clicklogs Core idea Using Query Distributions Example System Architecture

Results

Arnab Nandi & Phil Bernstein

Page 18: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

19

Example: Taxonomy Matching

Arnab Nandi & Phil Bernstein

query freq url

laptop 70http://searchengine.com/product/macbookpro

laptop 25http://searchengine.com/product/mininote

laptop 5 http://asus.com/eeepcnetbook 5

http://searchengine.com/product/macbookpro

netbook 20

http://searchengine.com/product/mininote

netbook 15 http://asus.com/eeepccheap netbook 5 http://asus.com/eeepc

Warehouse: Small

Laptops

Warehouse: Professional

Laptops

eee

Page 19: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

20

Example: Taxonomy Matching

Arnab Nandi & Phil Bernstein

“laptop”: 25/45“netbook”: 20/45

“laptop” : 70 / 75“netbook” : 5/75

“laptop”: 5/25“netbook”: 15/25“cheap laptop”:

5/25

Warehouse: Small

Laptops

Warehouse: Professional

Laptops

eee

Page 20: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

21

Distribution Similarity Metric

Arnab Nandi & Phil Bernstein

Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)Σ(all qhost, qforeign combinations)

Page 21: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

22

“small laptops” vs “eee”laptop vs laptop netbook vs netbook laptop vs cheap laptop

1 x (25/45) + 1 x (20/45) + 0.5 x (5/25)

= 0.74

Example: Taxonomy Matching

Arnab Nandi & Phil Bernstein

Warehouse: Small

Laptops

Warehouse: Professional

Laptops

eee

“laptop”: 25/45“netbook”: 20/45

“laptop” : 70 / 75“netbook” : 5/75

“laptop”: 5/25“netbook”: 15/25“cheap laptop”:

5/25

0.74

0.31

Page 22: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

Arnab Nandi & Phil Bernstein

23

Advantages of Clicklogs

Resilient to language

Resilient to new domains, data, and features As long as people query & click, we have data to

learn from

Generates mappings previous methods can’tElectronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments

≈ Office Products ▷ Office Machines ▷ Calculators

Software ▷ Categories ▷ Programming ▷ Programming Languages ▷ Visual Basic  ≈ Software ▷ Developer Tools

Page 23: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

24

System Design

Arnab Nandi & Phil Bernstein

Page 24: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

25

Outline

Scenario

Using Clicklogs Core idea Using Query Distributions Example System Architecture

Results

Arnab Nandi & Phil Bernstein

Page 25: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

Arnab Nandi & Phil Bernstein

26

Experimenting with Click Logs Commercial warehouse mapping, 258 products

from a 70,000 term Amazon.com taxonomy (613 in gold)

to a 6,000 term warehouse taxonomy (40 in gold)

Live.com (now Bing.com) search querylog Amazon to warehouse mapping task,

consecutively halving the clicklog size used 1.8 million clicks to Amazon.com product

pages Typically each product had a query

distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).

Page 26: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

27

Summary of Results

Arnab Nandi & Phil Bernstein

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Page 27: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

Arnab Nandi & Phil Bernstein

28

Precision / Recall

Commercial warehouse mapping, 258 products

from a 70K term Amazon.com taxonomy to a 6,000 term warehouse taxonomy (613

categories used)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Instance-basedQuery DistributionConsensusName-based

Recall

Pre

cisio

n

Page 28: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

29

Summary of Results

Arnab Nandi & Phil Bernstein

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Page 29: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

Arnab Nandi & Phil Bernstein

30

Match Quality

QDs are unique to entities

QDs are unique to aggregate classes

Amazon Products

Amazon Categories

Warehouse Products

Warehouse Categories

Amazon Products

257/258 correct

241/258 correct

189/258 correct (73%)

226/258correct

Amazon Categories

373/613 correct

204/400 correct 525/613 (85%)

Warehouse Products

392/400 correct 383/400 correct

Warehouse Categories

40/40 correct

QDs of entities are closest to the distributions of their aggregate classes

QDs of similar aggregates are similar

Page 30: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

31

Summary of Results

Arnab Nandi & Phil Bernstein

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Page 31: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

32

Varying Clicklog Size

Successively decreased clicklog size by half

Recall decreases as clicklog size is decreased

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.65

0.75

0.85

0.95

ItemsCategories

Recall

Pre

cisio

n

¼ ½ Full Log

1/32

Arnab Nandi & Phil Bernstein

Page 32: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

33

Summary of Results

Arnab Nandi & Phil Bernstein

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Page 33: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

34

Comparing Query Distributions

Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)

Σ(all qhost, qforeign combinations)

Replace Jaccard with various phrase similarity metrics

Minimal difference due to size of most queries

Arnab Nandi & Phil Bernstein

Page 34: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

35

Summary of Results

Arnab Nandi & Phil Bernstein

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Page 35: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

36

Related + Future Work

Arnab Nandi & Phil Bernstein

Usage Based / Crowdsourcing Usage-Based Schema Matching (ICDE 2008)

Elmeleegy, H.; Ouzzani, M.; Elmagarmid, A.

Matching schemas in online communities: A web 2.0 approach(ICDE 2008) R McCann, W Shen, AH Doan

Web Scale Integration Web-scale Data Integration: You can only afford to Pay

As You Go (CIDR 2007)Jayant Madhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy

Page 36: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

37

Related + Future Work

Arnab Nandi & Phil Bernstein

“Mixed” methods Ontology matching: A machine learning approach

(Handbook on Ontologies 2004)A Doan, J Madhavan, P Domingos, A Halevy

Learning to match the schemas of data sources: A multistrategy approach (Machine Learning Journal 2003)A Doan, P Domingos, A Halevy

Schema and ontology matching with COMA++ (SIGMOD 2005)D Aumueller, HH Do, S Massmann, E Rahm

Page 37: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

Arnab Nandi & Phil Bernstein

38

Conclusion

Unsupervised mapping is possible very high recall / precision when enough

queries are present

Click logs are promising Finds results that other methods cannot find As clicklog size increases, it will produce

more mappings

Combinable with existing methods

Page 38: HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH.

39

Arnab Nandi & Phil Bernstein

http://arnab.org/contact

http://research.microsoft.com/~philbe/

Questions?