EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan...
-
Upload
pauline-andrews -
Category
Documents
-
view
217 -
download
2
Transcript of EntityRank: Searching Entities Directly and Holistically - Tao Cheng, Xifeng Yan, Kevin Chen-Chuan...
EntityRank: Searching EntityRank: Searching
Entities Directly and Entities Directly and
HolisticallyHolistically
- Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang
CS Department, UIUC
Presented By: Md. Abdus SalamPhD StudentCSE Department, UTA
Motivating ScenarioMotivating Scenario
Customer service phone number of Amazon?
Search on Amazon?
Search on Google?
Many many Similar CasesMany many Similar CasesThe email of Luis Gravano?What profs are doing databases at UIUC?The papers and presentations of ICDE
2007?Due date of SIGMOD 2008?Sale price of “Canon PowerShot A400”?“Hamlet” books available at bookstores?
Often times, we are looking for data entities, e.g. emails, dates, prices, etc, not pages.
What you search is not what you want.
From pages to entitiesFrom pages to entitiesTraditional Search Entity Search
Keywords Entities
ResultsResults Support
Concretely, what is meant by
Entity Search?
9
Entity Search Problem:Entity Search Problem: Given:
Input: Keywords & Entities (optionally with a pattern)
E.g. Amazon Customer Service #phone
Output: Ranked Entity Tuples
……
0.60
0.80
0.90
10
How to rank Entities?
Challenge: Challenge:
Characteristics I: ContextualCharacteristics I: Contextual -Utilize Entities’ Surrounding -Utilize Entities’ Surrounding ContextContext
Content
Context
Characteristics II: UncertainCharacteristics II: Uncertain -Extractions are -Extractions are
non”prefect”non”prefect”
Characteristics III: HolisticCharacteristics III: Holistic -Many evidences from multiple -Many evidences from multiple sourcessources
Characteristics IV: Characteristics IV: DiscriminativeDiscriminative - Web Pages are of Varying - Web Pages are of Varying QualityQuality
Characteristics V: AssociativeCharacteristics V: Associative -Tell True Associations from -Tell True Associations from AccidentalAccidental
Example: Finding Prof. Luis Gravano’s Email
Observation: [email protected] appears very frequently with keywords “Luis”, “Gravano”
However, such association is only accidental as [email protected] appears on many pages.
EntityRankEntityRank: The Impression : The Impression ModelModel
Tireless Observer ... ... ...
?? ??
Access Layer: Global Aggregation
Recognition Layer: Local Assessment
Validation Layer: Hypothesis Testing
……
0.60
0.80
0.90
17
Recognition Layer: Local Recognition Layer: Local AssessmentAssessment
Contextual
Uncertain
Holistic
Discriminative
Associative
Input: L1
L2:d
Output: )|)7575201)800((( dqp
)|)7400376408(( dqp )|)2006.02.9(( dqp
18
Access Layer: Global Access Layer: Global AggregationAggregation
Contextual
Uncertain
Holistic
Discriminative
Associative
Holistic Discriminative
d
o dpqpp )()d|7575)-201-800((
Output:
Input:
1d
)|)7575201800(( 1dqp
2d
)|)7575201800(( 2dqp
3d
)|)7575201800(( 3dqp
19
Validation Layer: Hypothesis Validation Layer: Hypothesis TestingTesting
rp
Contextual
Uncertain
Holistic
Discriminative
Associative
op
Input:
Collection E over D
Output:
)1
1log)1(log(2))((
r
oo
r
oo
p
pp
p
pptqScore
Virtual Collection E’ over D’
randomize
EntityRankEntityRank: The Scoring : The Scoring FunctionFunction
Local RecognitionGlobal AggregationValidation
21
Sort-merge Join
Query ProcessingQuery Processing
7, 33
d9
3d7
10d6
5d3
8, 25
d1
Doc Posting Doc
8, 24
d7
66d5
11d3
Posting
44d8
9d7
12d3
Doc Posting
Amazon Customer Service
(13,800-202-7575,1.0)(78,800-322-9266,1.0)
d7
(18,800-202-7575,1.0)
d3
(42,851-0400,0.8)d2
Doc Posting
#phone
Aggregation
800-202-7575: p1800-322-9266: p3800-202-7575: p2
800-322-9266: p5800-202-7575: p4
Hypothesis Test Result
22
Experiment SetupExperiment SetupCorpus: General crawl of the Web(Aug,
2006), around 2TB with 93M pages.
Entities: Phone (8.8M distinctive instances) Email (4.6M distinctive instances)
System: A cluster of 34 machines
23
Comparing EntityRank to the Comparing EntityRank to the Following Different ApproachesFollowing Different Approaches
Contextual
Uncertain Holistic Discriminative
Associative
Naïve
Local
Global
Combine
Without
EntityRank
Online Demo.
25
Example Query ResultsExample Query Results
26
ConclusionsConclusionsFormulate the entity search problem
Study and define the characteristics of entity search
Conceptual Impression Model and concrete EntityRank framework for ranking entities
An online prototype with real Web corpus
Thank You !Thank You !
Questions?
ReferenceReferenceEntityRank: Searching Entities Directly and Holistically. T. Cheng, X. Yan, and K. C.-C. Chang. In Proceedings of the 33rd Very Large Data Bases Conference (VLDB 2007), pages 387-398, Vienna, Austria, September 2007
http://www-forward.cs.uiuc.edu/talks/2007/entityrank-vldb07-cyc-sep07.ppt