EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan...

28
Entities Directly and Entities Directly and Holistically Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam PhD Student CSE Department, UTA

Transcript of EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan...

Page 1: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

EntityRank: Searching EntityRank: Searching

Entities Directly and Entities Directly and

HolisticallyHolistically

- Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang

CS Department, UIUC

Presented By: Md. Abdus SalamPhD StudentCSE Department, UTA

Page 2: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

Motivating ScenarioMotivating Scenario

Customer service phone number of Amazon?

Page 3: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

Search on Amazon?

Page 4: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

Search on Google?

Page 5: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

Many many Similar CasesMany many Similar CasesThe email of Luis Gravano?What profs are doing databases at UIUC?The papers and presentations of ICDE

2007?Due date of SIGMOD 2008?Sale price of “Canon PowerShot A400”?“Hamlet” books available at bookstores?

Often times, we are looking for data entities, e.g. emails, dates, prices, etc, not pages.

Page 6: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

What you search is not what you want.

Page 7: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

From pages to entitiesFrom pages to entitiesTraditional Search Entity Search

Keywords Entities

ResultsResults Support

Page 8: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

Concretely, what is meant by

Entity Search?

Page 9: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

9

Entity Search Problem:Entity Search Problem: Given:

Input: Keywords & Entities (optionally with a pattern)

E.g. Amazon Customer Service #phone

Output: Ranked Entity Tuples

……

0.60

0.80

0.90

Page 10: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

10

How to rank Entities?

Challenge: Challenge:

Page 11: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

Characteristics I: ContextualCharacteristics I: Contextual -Utilize Entities’ Surrounding -Utilize Entities’ Surrounding ContextContext

Content

Context

Page 12: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

Characteristics II: UncertainCharacteristics II: Uncertain -Extractions are -Extractions are

non”prefect”non”prefect”

Page 13: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

Characteristics III: HolisticCharacteristics III: Holistic -Many evidences from multiple -Many evidences from multiple sourcessources

Page 14: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

Characteristics IV: Characteristics IV: DiscriminativeDiscriminative - Web Pages are of Varying - Web Pages are of Varying QualityQuality

Page 15: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

Characteristics V: AssociativeCharacteristics V: Associative -Tell True Associations from -Tell True Associations from AccidentalAccidental

Example: Finding Prof. Luis Gravano’s Email

Observation: [email protected] appears very frequently with keywords “Luis”, “Gravano”

However, such association is only accidental as [email protected] appears on many pages.

Page 16: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

EntityRankEntityRank: The Impression : The Impression ModelModel

Tireless Observer ... ... ...

?? ??

Access Layer: Global Aggregation

Recognition Layer: Local Assessment

Validation Layer: Hypothesis Testing

……

0.60

0.80

0.90

Page 17: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

17

Recognition Layer: Local Recognition Layer: Local AssessmentAssessment

Contextual

Uncertain

Holistic

Discriminative

Associative

Input: L1

L2:d

Output: )|)7575201)800((( dqp

)|)7400376408(( dqp )|)2006.02.9(( dqp

Page 18: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

18

Access Layer: Global Access Layer: Global AggregationAggregation

Contextual

Uncertain

Holistic

Discriminative

Associative

Holistic Discriminative

d

o dpqpp )()d|7575)-201-800((

Output:

Input:

1d

)|)7575201800(( 1dqp

2d

)|)7575201800(( 2dqp

3d

)|)7575201800(( 3dqp

Page 19: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

19

Validation Layer: Hypothesis Validation Layer: Hypothesis TestingTesting

rp

Contextual

Uncertain

Holistic

Discriminative

Associative

op

Input:

Collection E over D

Output:

)1

1log)1(log(2))((

r

oo

r

oo

p

pp

p

pptqScore

Virtual Collection E’ over D’

randomize

Page 20: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

EntityRankEntityRank: The Scoring : The Scoring FunctionFunction

Local RecognitionGlobal AggregationValidation

Page 21: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

21

Sort-merge Join

Query ProcessingQuery Processing

7, 33

d9

3d7

10d6

5d3

8, 25

d1

Doc Posting Doc

8, 24

d7

66d5

11d3

Posting

44d8

9d7

12d3

Doc Posting

Amazon Customer Service

(13,800-202-7575,1.0)(78,800-322-9266,1.0)

d7

(18,800-202-7575,1.0)

d3

(42,851-0400,0.8)d2

Doc Posting

#phone

Aggregation

800-202-7575: p1800-322-9266: p3800-202-7575: p2

800-322-9266: p5800-202-7575: p4

Hypothesis Test Result

Page 22: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

22

Experiment SetupExperiment SetupCorpus: General crawl of the Web(Aug,

2006), around 2TB with 93M pages.

Entities: Phone (8.8M distinctive instances) Email (4.6M distinctive instances)

System: A cluster of 34 machines

Page 23: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

23

Comparing EntityRank to the Comparing EntityRank to the Following Different ApproachesFollowing Different Approaches

Contextual

Uncertain Holistic Discriminative

Associative

Naïve

Local

Global

Combine

Without

EntityRank

Page 24: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

Online Demo.

Page 25: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

25

Example Query ResultsExample Query Results

Page 26: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

26

ConclusionsConclusionsFormulate the entity search problem

Study and define the characteristics of entity search

Conceptual Impression Model and concrete EntityRank framework for ranking entities

An online prototype with real Web corpus

Page 27: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

Thank You !Thank You !

Questions?

Page 28: EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang CS Department, UIUC Presented By: Md. Abdus Salam.

ReferenceReferenceEntityRank: Searching Entities Directly and Holistically. T. Cheng, X. Yan, and K. C.-C. Chang. In Proceedings of the 33rd Very Large Data Bases Conference (VLDB 2007), pages 387-398, Vienna, Austria, September 2007

http://www-forward.cs.uiuc.edu/talks/2007/entityrank-vldb07-cyc-sep07.ppt