Growing Parallel Paths for Entity-Page Retrieval

18
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Growing Parallel Paths for Entity-Page Retrieval Tim Weninger , Cindy Xide Lin, and Jiawei Han Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL Work Submitted to VLDB'10

description

Growing Parallel Paths for Entity-Page Retrieval. Tim Weninger , Cindy Xide Lin, and Jiawei Han. Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL Work Submitted to VLDB'10. Problem: Entity Page Retrieval. Given: Reference page. - PowerPoint PPT Presentation

Transcript of Growing Parallel Paths for Entity-Page Retrieval

Page 1: Growing Parallel Paths for  Entity-Page Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

Advanced Data MiningMay 4, 2010

Growing Parallel Paths for Entity-Page Retrieval

Tim Weninger, Cindy Xide Lin, and Jiawei Han

Department of Computer ScienceUniversity of Illinois Urbana-Champaign, Urbana, IL

Work Submitted to VLDB'10

Page 2: Growing Parallel Paths for  Entity-Page Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

Advanced Data MiningMay 4, 2010

Problem: Entity Page Retrieval

Given: Reference page

Page 3: Growing Parallel Paths for  Entity-Page Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

Advanced Data MiningMay 4, 2010

…Can We find Entity Pages of the same Type?

Problem: Entity Page Retrieval

Page 4: Growing Parallel Paths for  Entity-Page Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

Advanced Data MiningMay 4, 2010

…Can We find Entity Pages of the same Type?

Problem: Entity Page Retrieval

Page 5: Growing Parallel Paths for  Entity-Page Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

Advanced Data MiningMay 4, 2010

Definitions:Defn 1: Root to link path:

◊ - hrefX contains

HTML-TABLE-TR1—TD-hrefX

Defn 2: Parallel Links:

Share a root to link path.i.e., lists of links

Defn 3: Intra-page parallel paths:

◊ - hrefC ǁ ◊ - hrefB

◊ - hrefC ǁ ◊ - hrefX

Page 6: Growing Parallel Paths for  Entity-Page Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

Advanced Data MiningMay 4, 2010

Definitions:

Defn 5: Parallel Web site pathsShare intra or inter-page parallel paths across multiple pages

Defn 4: Inter-page parallel ◊ - hrefC in Page A ǁ ◊ - hrefW in Page B

Page 7: Growing Parallel Paths for  Entity-Page Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

Advanced Data MiningMay 4, 2010

Properties of Parallel Paths

Prop. 1: Equal Path Length Property:

Parallel paths must contain the same number of pages.

Prop. 2: Parallel Page Property:

The test of two paths being in parallel is equivalent to the result of tests of respective pages.

Prop. 3: Equal Page Length Property:

Parallel paths must have the same number of nodes across pages.

Page 8: Growing Parallel Paths for  Entity-Page Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

Advanced Data MiningMay 4, 2010

Properties of Parallel Paths

Prop. 4: Divergent Path Property:

Parallel Paths can extend through separate pages

Prop. 5: Early Termination Property:

The test of two paths can be terminated at the first occurrence of a dissimilar node

Page 9: Growing Parallel Paths for  Entity-Page Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

Advanced Data MiningMay 4, 2010

Finding Paths

Naive MethodCan be very costly

Growing Parallel PathsFirst find example path Then grow paths which are in parallel to the example

Repeat with alternate pathsThis makes magic happen

Page 10: Growing Parallel Paths for  Entity-Page Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

Advanced Data MiningMay 4, 2010

Repeating with alternate paths

k-shortest pathsDo k-shortest path search. Explore all of these paths

Removing links After exploring a path remove the edges from the graph

Page 11: Growing Parallel Paths for  Entity-Page Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

Advanced Data MiningMay 4, 2010

Interpreting the Output

Side Effect of Repeating with Alternate pathsGiven: Jiawei HanResult: Jiawei Han 40

Cheng Zhai 38Kevin Chang 38Dan Roth 32Vikram Adve 4Roy Campbell 3

Page 12: Growing Parallel Paths for  Entity-Page Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

Advanced Data MiningMay 4, 2010

Interpreting the Output

Side Effect of Path FindingWhat does the link labels on the path tell us about the entity

First pathPeopleFacultyJiawei HanPersonal Site

Second pathResearchData Mining

Page 13: Growing Parallel Paths for  Entity-Page Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

Advanced Data MiningMay 4, 2010

Experiments

Top 25 CS Departments in US (according to US News)Find all professors

United States CongressFind all senators, representatives, and committees

UIUC onlyFind all coursesFinal all research groups

BaselineGoogle’s find similar search (essentially TFIDF-type ranking)

Page 14: Growing Parallel Paths for  Entity-Page Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

Advanced Data MiningMay 4, 2010

Results

Page 15: Growing Parallel Paths for  Entity-Page Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

Advanced Data MiningMay 4, 2010

Results

Page 16: Growing Parallel Paths for  Entity-Page Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

Advanced Data MiningMay 4, 2010

Results

Page 17: Growing Parallel Paths for  Entity-Page Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

Advanced Data MiningMay 4, 2010

Conclusions and Future Work

Given a reference page and an example entity type we can retrieve all entity pages of the same type

Implications:We can use this for information integrationSearch, retrieval can be enhanced

Shortcomings:Most errors due to incorrect list finding

Page 18: Growing Parallel Paths for  Entity-Page Retrieval

Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign

Advanced Data MiningMay 4, 2010

Questions?