EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan...

Post on 18-Dec-2015

217 views 2 download

Tags:

Transcript of EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan...

EntityRank: Searching EntityRank: Searching

Entities Directly and Entities Directly and

HolisticallyHolistically

- Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang

CS Department, UIUC

Presented By: Md. Abdus SalamPhD StudentCSE Department, UTA

Motivating ScenarioMotivating Scenario

Customer service phone number of Amazon?

Search on Amazon?

Search on Google?

Many many Similar CasesMany many Similar CasesThe email of Luis Gravano?What profs are doing databases at UIUC?The papers and presentations of ICDE

2007?Due date of SIGMOD 2008?Sale price of “Canon PowerShot A400”?“Hamlet” books available at bookstores?

Often times, we are looking for data entities, e.g. emails, dates, prices, etc, not pages.

What you search is not what you want.

From pages to entitiesFrom pages to entitiesTraditional Search Entity Search

Keywords Entities

ResultsResults Support

Concretely, what is meant by

Entity Search?

9

Entity Search Problem:Entity Search Problem: Given:

Input: Keywords & Entities (optionally with a pattern)

E.g. Amazon Customer Service #phone

Output: Ranked Entity Tuples

……

0.60

0.80

0.90

10

How to rank Entities?

Challenge: Challenge:

Characteristics I: ContextualCharacteristics I: Contextual -Utilize Entities’ Surrounding -Utilize Entities’ Surrounding ContextContext

Content

Context

Characteristics II: UncertainCharacteristics II: Uncertain -Extractions are -Extractions are

non”prefect”non”prefect”

Characteristics III: HolisticCharacteristics III: Holistic -Many evidences from multiple -Many evidences from multiple sourcessources

Characteristics IV: Characteristics IV: DiscriminativeDiscriminative - Web Pages are of Varying - Web Pages are of Varying QualityQuality

Characteristics V: AssociativeCharacteristics V: Associative -Tell True Associations from -Tell True Associations from AccidentalAccidental

Example: Finding Prof. Luis Gravano’s Email

Observation: info@acm.org appears very frequently with keywords “Luis”, “Gravano”

However, such association is only accidental as info@acm.org appears on many pages.

EntityRankEntityRank: The Impression : The Impression ModelModel

Tireless Observer ... ... ...

?? ??

Access Layer: Global Aggregation

Recognition Layer: Local Assessment

Validation Layer: Hypothesis Testing

……

0.60

0.80

0.90

17

Recognition Layer: Local Recognition Layer: Local AssessmentAssessment

Contextual

Uncertain

Holistic

Discriminative

Associative

Input: L1

L2:d

Output: )|)7575201)800((( dqp

)|)7400376408(( dqp )|)2006.02.9(( dqp

18

Access Layer: Global Access Layer: Global AggregationAggregation

Contextual

Uncertain

Holistic

Discriminative

Associative

Holistic Discriminative

d

o dpqpp )()d|7575)-201-800((

Output:

Input:

1d

)|)7575201800(( 1dqp

2d

)|)7575201800(( 2dqp

3d

)|)7575201800(( 3dqp

19

Validation Layer: Hypothesis Validation Layer: Hypothesis TestingTesting

rp

Contextual

Uncertain

Holistic

Discriminative

Associative

op

Input:

Collection E over D

Output:

)1

1log)1(log(2))((

r

oo

r

oo

p

pp

p

pptqScore

Virtual Collection E’ over D’

randomize

EntityRankEntityRank: The Scoring : The Scoring FunctionFunction

Local RecognitionGlobal AggregationValidation

21

Sort-merge Join

Query ProcessingQuery Processing

7, 33

d9

3d7

10d6

5d3

8, 25

d1

Doc Posting Doc

8, 24

d7

66d5

11d3

Posting

44d8

9d7

12d3

Doc Posting

Amazon Customer Service

(13,800-202-7575,1.0)(78,800-322-9266,1.0)

d7

(18,800-202-7575,1.0)

d3

(42,851-0400,0.8)d2

Doc Posting

#phone

Aggregation

800-202-7575: p1800-322-9266: p3800-202-7575: p2

800-322-9266: p5800-202-7575: p4

Hypothesis Test Result

22

Experiment SetupExperiment SetupCorpus: General crawl of the Web(Aug,

2006), around 2TB with 93M pages.

Entities: Phone (8.8M distinctive instances) Email (4.6M distinctive instances)

System: A cluster of 34 machines

23

Comparing EntityRank to the Comparing EntityRank to the Following Different ApproachesFollowing Different Approaches

Contextual

Uncertain Holistic Discriminative

Associative

Naïve

Local

Global

Combine

Without

EntityRank

Online Demo.

25

Example Query ResultsExample Query Results

26

ConclusionsConclusionsFormulate the entity search problem

Study and define the characteristics of entity search

Conceptual Impression Model and concrete EntityRank framework for ranking entities

An online prototype with real Web corpus

Thank You !Thank You !

Questions?

ReferenceReferenceEntityRank: Searching Entities Directly and Holistically. T. Cheng, X. Yan, and K. C.-C. Chang. In Proceedings of the 33rd Very Large Data Bases Conference (VLDB 2007), pages 387-398, Vienna, Austria, September 2007

http://www-forward.cs.uiuc.edu/talks/2007/entityrank-vldb07-cyc-sep07.ppt