ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

119
ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems Hiroshi Nakagawa Introduction Feature Extraction(Phrase Extraction) Minoru Yoshida Feature Extraction(Information Extraction Approach) End (University of Tokyo)

description

ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems. Hiroshi Nakagawa Introduction  Feature Extraction(Phrase Extraction) Minoru Yoshida Feature Extraction(I nformation Extraction Approach)  End (University of Tokyo). Contents. Introduction - PowerPoint PPT Presentation

Transcript of ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Page 1: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

ACML 2010 TutorialWeb People Search: Person Name

Disambiguation and Other Problems

Hiroshi Nakagawa Introduction Feature Extraction(Phrase Extraction)

Minoru YoshidaFeature Extraction(Information Extraction Approach) End

(University of Tokyo)

Page 2: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Contents

1. Introduction2. Feature Extraction3. Feature Weighting / Similarity Calculation4. Clustering5. Evaluation Issues

Page 3: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Contents

1. Introduction2. Feature Extraction3. Feature Weighting / Similarity Calculation4. Clustering5. Evaluation Issues

Page 4: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Introduction

1. Motivation2. Problem Settings3. Differences from other problems4. History

Page 5: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Motivation• Web search for person names :

over 10% of all queries• “Same-name” Problem in person name search– When different real-world entities have the same name,

the reference from the name to the entity can be ambiguous.

– Many different persons having the same name • (e.g.,) John Smith

– Persons having the same name as a famous one • (e.g.,) Bill Gates

Difficult to access to the target person

A study of the query log of the AllTheWeb and Altavistasearch sites gives an idea of the relevance of the people searchtask: 11-17% of the queries were composed of a person namewith additional terms and 4% were identified simply as person names:(Artiles+, 2009 WePS2)

With ordinary search engines, it is tough to find Bill Gates who is not a Microsoft

founder! Domination!

Page 6: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Problem in People Search

Query

Search engine

Results

Which pages for what persons?

Page 7: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Person Name Clustering

Query

Search engine

Searchresult Clusters of Web pages

Each page in a cluster refers to the same entity.

Page 8: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

数理情報学輪講 (2008/04/18) 8

Sample Systemquery= Ichiro Suzuki:famous Japanese baseball player

Keywords aboutthe person

Documents about the same person

Page 9: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

数理情報学輪講 (2008/04/18) 9

Output Example ( Ichiro Suzuki )

Painter

LawyerDentist

Used as an example name because Ichiro is so famous

Page 10: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Introduction

1. Motivation2. Problem Settings3. Differences from other problems4. History

Page 11: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Problem Setting• Given: a set of Web pages returned from a

search engine when entering person name queries

• Goal: to cluster Web pages– One cluster for one entity– Possibly with related information (e.g., biography

and/or related words)Another usage :

If a person has many aspects, like scientist and poet, these aspects are grouped together. Easy to grasp who he/she is.

Page 12: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Example: Sakai Shuichi

Sakai shuichi is a professor of the University of Tokyo in the field of Computer Architecture: These pages are about his books of Computer Architecture

He is a Japanese poet too. These pages are about his collection of poems.

Page 13: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Example: Famous car maker”TOYOTA”

These pages are about TOYOTA’s retailer’s network

These pages are about TOYOTA HOME which is a house maker and one of TOYOTA company’s group enterprise

Page 14: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Introduction

1. Motivation2. Problem Settings3. Differences from other problems4. History

Page 15: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Difference from Other Tasks

• Cluster documents for the same person• Difficult to use training data for other person names

Method WSD,Categorization

Person Name Clustering

Document Clustering

Goal Categorize Cluster documents about the same entity(=person)

Cluster similar documents

Answers Definite  y/n

Definite   y/n Not definite

Number of Cluster

# of categories

# of entities (unknown)

Task dependent

Training Data

Yes Difficult to use No

Learning Supervised Unsupervised Unsupervised

15

Unknown but exact # in real world

Page 16: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

WSD: Word Sense Disambiguation

bank

I was strolling the bank.Do you use a bank card there?Did you go to the bank?

?

Page 17: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Challenges

• Noisy Web data– Light linguistic tools• POS taggers, Stemmer, NE taggers• Pattern-based information extraction

• How to use “training data” – Most systems use unsupervised clustering

approach– Some systems assume “background knowledge”

• How to determine K (number of clusters)Remember this K does not depend on users intention but is exact and

fixed, in real use. Different form usual clustering!

(1) Heavy and sophisticated NLP tools such as HPSG parser is not suitable for

the purpose.(2)The system should work in tolerant

speed light weight tools is needed

Page 18: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Introduction

1. Motivation2. Problem Settings3. Differences from other problems4. History

Page 19: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

History

1998 Cross-document coreference Resolution[Bagga+, 98] – Naive VSM

(Word Sense Disambiguation)

Disambiguation for Web Search Results[Mann+, 03] – Biographic data

Web People Search Workshop (WePS)[Artiles+, 07][Artiles+, 09]

2007

2003

(Coreference Resolution)

Page 20: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

History

• Web People Search Workshop– 1st, SemEval-2007– 2nd, WWW-2009• Document Clustering• Attribute Extraction

– 3rd, CLEF-2010(Conference on Multilingual and Multimodal Information Access Evaluation )20-23 September 2010, Padua.• Document Clustering & Attribute Extraction• Organization Name Disambiguation

Page 21: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

WePS2 Data Source: 30names

Page 22: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

WePS2 Data 1 (Artiles+, 09)

Page 23: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

WePS2 Data 2

Page 24: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

WePS2 Data 3

Page 25: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

WePS2 summary report

Page 26: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Contents

1. Introduction2. Feature Extraction3. Feature Weighting / Similarity Calculation4. Clustering5. Evaluation Issues

Page 27: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Main Steps

1. Preprocessing2. Feature extraction3. Feature weighting / Similarity calculation4. Clustering5. (Related Information Extraction)

Page 28: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

PREPROCESSING

Page 29: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Preprocessing

• Filter out useless pages (“junk pages”)– the name is matched, but the matched string

doesn’t refer to a person (e.g., company name)• Data cleaning– HTML Tag removal– Sentence (snippet) extraction– Coreference resolution(used by Bagga+)

In addition, alphabetically ordered name list page. (Ono+, 08)

In fact, very difficult task of NLP

Page 30: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Junk Page Filtering

• SVM-based classification (Wan+, 05)– features• Simple lexical features• Stylistic features (fonts / tags) • query-relevant features (next-to-query words) • linguistic features (NE counts) …

words related or not related to the person name

Such as how many person, organization,

location name appear.

i.e. how many and which words in bold

font

Page 31: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

FEATURE EXTRACTION

Page 32: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Feature Extraction

• How to characterize each name appearance– Name itself can not be used for disambiguation!

• Each name appearances can be characterized by contexts.

• Possible contexts– Surrounding words, adjacent strings, syntactically

related words, etc.– Which to use?

Page 33: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Basic Approach

• Use all words in documents– Or snippets (texts around the name)– Or titles/summaries (first sentence, etc.)

• Use TFIDF weighting scheme

Page 34: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Problem

• There exist: – relatively useful features and relatively useless

features • (especially for person name disambiguation)

– Useful: NEs, biography, noun phrases, etc.– Useless: General words, boilerplate, etc.

• How to distinguish useful features from others• How to give weight to each feature

Page 35: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Named Entities

• Documents about Bill Gates

related person name related organization name

Page 36: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Noun Phrases

• Documents about Bill Gates

related key words

Page 37: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Other Words

• Documents about Bill Gates

Page 38: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Other Words

• Documents about Bill Gates

more important

Page 39: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Extracting Useful Features

• Thresholding• Tool-based approach– POS tagging, NE tagging

• Information Extraction approach

• Meta-data approach– Link structures, Meta tags

Based on score related to our purpose: TFIDF etc.

Later described by Yoshida

Page 40: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Thresholding

• Calculate TFIDF scores of words• Discard the words with low TFIDF scoresUnigram, Bigram, even N-gram can be used

(Chen+, 09) , where Google 5 gram corpus (from 1T words) is used to calculate TFIDF score

Other Scores: such as Log-Likelihood Ratio, Mutual information, KL-

divergence,

Page 41: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Tool Based Approach

• Available Tools:– POS tagging– NE Extraction (sophisticated unsophisticated but simple) bigram, N-gram– Keyword extraction

High performance POS taggers are developed for many languages.

For western languages , stemmers are also developed .

middle between NE and bigram,N-gram

Page 42: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Part of Speech (POS) Tagging

• Detect the grammatical categories of the words– Nouns, verbs, prepositions, adverbs, adjectives, …– Typically nouns are used as features

– Noun phrases can be extracted with some simple rules– Many available tools (e.g., Tree Tagger)

William Henry "Bill" Gates III (born October 28, 1955) is an American business magnate, philanthropist, …

NOUNS

NOUNS VERB

ADJECTIVE

NOUNS VERB DETERMINER

Page 43: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Named Entities (NE) Extraction

• Find “proper names” in texts– e.g., names of persons, organizations, locations, …– Include time expressions in many cases

– Many available tools (Stanford NER, OpenNLP, Espotter, …)

William Henry "Bill" Gates III (born October 28, 1955) is an American business magnate, philanthropist, …

PERSON DATE

Page 44: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Key Phrase Extraction

• Noun phrases consisting of 2 or more words– Likely to be topic-related concepts– Term-extraction tool “Gensen”(Nakagawa+, 05)• Noun phrases with the score of “term-likelihood”• Topic related term -> higher score

Gates held the positions of CEO and chief software architect,

and remains the largest individual shareholder …SCORE=45.2

SCORE=22.4

Page 45: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Gensen( 言選 ) Web Score

処理Processing

(=proc) 学会society

信息Information

能力capacity

計算機computer

段階step

L:# of left adjacent words: 2+1 R:# of right adjacent words: : 3+1

From corpus we extract:信息処理 , 計算機処理能力 , 処理段階 , 信息処理学会Information proc, computer proc. capacity, proc. step, info. proc.society

L(W= 処理 )=2+1 R(W= 処理 )=3+1        LR(W= 処理 )=3×4=12

Page 46: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Calculation of LR and FLRCompound word:W ={ w1, ... , wn} where wi is a simple

noun.L(wi) = # of left side connection of wi+1R(wi) = # of right side connection of wi+1Score LR of Comp. word:W={ w1 ... wn},   like  信息処理学会  is defined as follows:

Example :LR( 信息処理 ) =[L( 信息 )×R( 信息 ) × L( 処理 )×R( 処理 ) ]1/4

Or LR(information processing) =[L(info.)×R(info.) × L(proc.)×R(proc.) ]1/4

nn

iii wRwLWLR

2/1

1

)()()(

 Normalized by length

Page 47: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Calculation of LR and FLR

F(W) is the independent frequency of comp. word:W where “independent” means that W is not a part of longer comp. word.

Then FLR ( W) is defined asFLR ( W ) = F(W) × LR(W)

Example FLR( 信息処理 ) =F ( 信息処理 )×[L( 信息 )×R( 信息 ) × L( 処理 )×R( 処理 ) ]1/4

nn

iii wRwLWLR

2/1

1

)()()(

 

This FLR is the score to rank term candidates

Normalized by length

F(W) has similar effect as TF. Then, if corpus is big, F(w) affects more to FLR(w).

Page 48: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Example of term extraction by Gensen Web: English article:SVM on Wikipedia

Support vector machines (SVMs) are a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis. The original SVM algorithm was invented by Vladimir Vapnik and the current standard incarnation (soft margin) was proposed by Corinna Cortes and Vladimir Vapnik[1]. The standard SVM is a non-probabilistic binary linear classifier, i.e. it predicts, for each given input, which of two possible classes the input is a member of. Since an SVM is a classifier, then given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other. Intuitively, an SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. …….Another approach is to use an interior point method that uses Newton-like iterations to find a solution of the Karush-Kuhn-Tucker conditions of the primal and dual problems.[10] Instead of solving a sequence of broken down problems, this approach directly solves the problem as a whole. To avoid solving a linear system involving the large kernel matrix, a low rank approximation to the matrix is often used to use the kernel trick.

Page 49: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Extracted terms (term score)Top 1-17hyperplane 116.65 margin 109.54 SVM 74.08 vector 56.12 point 52.85 support vector 49.34 training data 48.12 data 47.83 problem 44.27 space 44.09 data point 38.01 classifier 30.59 classification 29.58 optimization problem 26.05 set 25.30 support vector machine 24.66kernel 21.00

Top 18-38set of point 20.73 linear classifier 19.99 maximum-margin hyperplane 19.92 example 19.60 one 17.32 Vladimir Vapnik 15.87 parameter 14.70 linear SVM 14.40 training set 14.00 optimization 13.42 model 12.25 training vector 12.04 support vector classification 11.70 two classe 11.57 normal vector 11.38 kernel trick 11.22 maximum margin classifier 11.22

Top 408–426(last)Vandewalle 1.00 derive 1.00 it 1.00 Leisch 1.00 2.3 1.00 H1 1.00 c 1.00 Hornik 1.00 mean 1.00 testing 1.00 transformation 1.00 unconstrained 1.00 homogeneous 1.00 need 1.00 learner 1.00 grid-search 1.00 convex 1.00 See 1.00 trade 1.00

.....

Page 50: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Contents

1. Introduction2. Feature Extraction3. Feature Weighting / Similarity Calculation4. Clustering5. Evaluation Issues

Page 51: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Contents

1. Introduction2. Feature Extraction3. Feature Weighting / Similarity Calculation4. Clustering5. Evaluation Issues

Page 52: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Introduction

1. Motivation2. Problem Settings3. Differences from other problems4. History

Page 53: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Information Extraction Approach

• Information extraction:– The task to extract specific type of information– e.g., person and his/her working place

William Henry "Bill" Gates III (born October 28, 1955) is an American business magnate, philanthropist, …

OCCUPATION

NAME

NATIONALITY

DATE OF BIRTH

Page 54: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Information Extraction Approach

• Useful features for disambiguation (Wan+, 2005) (Mann+, 2003) (Niu+, 04)

• Also used as “summaries” of clusters– To be help of users to find objective clusters – WePS-2 “attribute extraction task”

Page 55: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Information Extraction Approach• Different methods for different attributes– Simple patterns (hand-crafted / automatically

obtained)• Phone, FAX, URL, E-mail

– Syntactic rules (hand-crafted /automatically generated)• Date of birth, Titles, positions,

– Dictionary match (from wikipedia, etc.)• Occupation, Major, Degree, Nationality

– Keywords extracted by NER tools• Birth place (LOCATION), Affiliation (ORGANIZATION),

Schools (ORGANIZATION)

Page 56: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Hand-Crafted Patterns• Typically written with regular expressions• Phone, FAX– +## (#) ####-####

• URLs– http://www.xxx.xxx.xxx/...

• E-mails– [email protected]

• Needs some classification (Phone or FAX?)– Supervised learning– Keyword-based approach (e.g., “born” for date of

birth)

Page 57: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Automatically Generated Patterns

• Patterns for birth years (Mann+, 03)

• Patterns for titles (Wan+, 05)

<name> (<birth year> - ####)<name> <name> ( <birth year><name> was born in <birth year>

<name> is a <title>

Page 58: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Automatically Generated Patterns• Approach by (Mann+, 03)– Bootstrapping method• Start with seed facts

– (e.g., (Mozart, 1756))• Find sentences (from the Web) that contain both of

elements – (e.g., “Mozart was born in 1756”)

• Perform some generalization – (e.g., “<name> was born in <birth year>”)

• Extract substrings with high score (measured using current facts) • Extract new facts

<name> (<birth year> - ####)<name> <name> ( <birth year><name> was born in <birth year>

Page 59: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Dictionary Matching

• Construct a list of occupations, nations (for “nationality” attributes), etc. from existing dictionaries– Wikipedia, WordNet, etc.

Page 60: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Dictionary Matching• e.g., List of countries

Page 61: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Link Structure Approach

• It is difficult to find correct network structures– Difficulty in finding “in-links”

• Needs some approximation• (Bekkerman+, 05) : “socially linked persons

tend to link similar pages”– Determine whether two pages are linked or not– MaxEnt classification with “linked-page” (URLs in

pages) features

Page 62: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

FEATURE WEIGHTING / SIMILARITY CALCULATION

Page 63: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Feature Weighting

• Knowledge-based approach– US Census data, WordNet

• Web-query approach• SVD• Bootstrapping• Determination of link/non-link by supervised

classifiers

Page 64: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Knowledge-Base Approach

• US Census data– Frequent name -> ambiguous (Fleishman+, 04)

• WordNet– Semantic similarity for concept words• WordNet distance

Page 65: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

WordNet

• Publicly available “dictionary” (thesaurus)– Hierarchical structures between words– We can find “synonyms”, “hyponyms”,

“hypernyms” of words• Many “semantic distance” measures between

two words– Path length– Depth of common hypernyms– …

Page 66: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Web-Query Approach

• Name-concept relation (Fleishman+, 04)• Validate relations between context NEs by

Web search counts (Kalashnikov+, 08) (Nuray-Turan+, 09)

• Use query “name + bigram”, concatenating the snippetes into a new document (Chen+, 09)

• Obtaining reliable counts (google_df) (Bekkerman+, 05)

Page 67: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Name-concept relation (Fleishman+, 04)

• Task: distinguish (name, concept) pairs– (Paul Simon, pop star) ; (Paul Simon, singer)– (Paul Simpn, pop star) ; (Paul Simon, politician)

• MaxEnt Classifier• Features using Web counts (N:name,

c:concept, +:AND operation)– Q(N + c1 + c2) : Intersection– | Q(N + c1) - Q(N + c2) |: Difference– Q(N + c1 + c2) / (Q(N + c1) + Q(N + c2)) : Ratio

Page 68: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Validate relations between context NEs by Web search counts

(Kalashnikov+, 08) (Nuray-Turan+, 09) • NE-based document similarity calculated using

Web counts– NE: persons or organizations

• WebDice (C:context set … [c1] OR [c2] OR …)– 2Q(N + C1 + C2) / (Q(N + C1) + Q(N + C2))– 2Q(N + C1 + C2) / (Q(N) + Q(C1 + C2))– The second one was better

Page 69: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Use query “name + bigram”, concatenating the snippetes into a new document (Chen+,

09)• Obtain additional features for similarity

calculation– Web page -> b: maximal weight bigram– Snippets100(N + b) -> one new document– New document -> additional features (tokens)

Page 70: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Obtaining reliable counts (google_df) (Bekkerman+, 05)

• Google_tfidf(w) = tf(w) / log(Q(w))

• Some recent systems use Google N-gram (Chen+, 09)

Page 71: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Dimension Reduction by SVD(Pedersen+, 05)

• Reduce sparseness of context vectors • More semantic-level representations (can use

word similarities in contexts)• Bigram features (contexts)

Page 72: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

• Strong features can identify a person– High precision, but not always observed

72

Bill GatesPaul AllenMicrosoft

program

Bill GatesSteve BallmerMicrosoft

program

Bill Gates

program

sameperson

Strong Features•NEs•CKWs...

Weak Features

Not useful in general, but useful for this name

Cluster Refinement by Bootstrapping (1/4)(Yoshida+, 10)

Page 73: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

7373

f1

f2

f3

fm

Document Set Feature Set

Document-Feature Matrix   P

Document-Cluster Relation

Feature-Cluster Relation

・・・・・・

Initial Cluster

d1

d2

d3

d4

d5

d6

dn

CF ,rCD,r

Cluster Refinement by Bootstrapping (2/4)D F

Page 74: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

7474

f1

f2

f3

fm

Document Set Feature Set

Document-Feature Matrix   P

Document-Cluster Relation

Feature-Cluster Relation

・・・・・・

Initial Cluster

d1

d2

d3

d4

d5

d6

dn

CF ,rCD,r

D F

)(,

)1(,

)(,

T)(,

tCF

tCD

tCD

tCF

Prr

rPr

)(,

T)1(,

tCD

tCD rPPr

Cluster Refinement by Bootstrapping (3/4)

Page 75: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

75

100010010001001

30.020.020.020.030.020.045.040.005.010.020.040.045.005.010.020.005.005.060.040.030.010.010.040.040.0

3.04.05.02.085.015.02.085.015.02.01.00.13.02.08.0

Refined values

Initial values

Each document is taken in the cluster with the largestrelation value

TPP

Cluster Refinement by Bootstrapping (4/4)

Page 76: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Determination of “linked” or”not-linked“ by supervised classifiers

• MaxEnt Classification (Fleischman+, 04)– Features: name features, web features, etc.

• SkyLine-Based Classification (Kalashnikov+, 08) – Features: search engine hit counts

Page 77: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

CLUSTERING

Page 78: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Problem: How to Determine K

• Hierarchical clustering with thresholds

• Online Clustering (Single Pass Clustering)

• Building “core” clusters (2-stage clustering)

• Variable-Component-Number Clustering (e.g., Dirichlet Process Mixture)

Page 79: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Hierarchical clustering with thresholds

• Used in many systems• Popular settings:– Agglomerative clustering– Group-average method (or, single-link method in

some times)– Predetermined threshold (or, determined by cross-

validation in some times)

Page 80: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Hierarchical clustering with thresholds

80

High

Low

5 2 3 9 1 8 7 6 4Document ID

→2 clusters{1,2,3,5,9},

{4,6,7,8}

→4 clusters ,

{2,5},{1,3,9},{6,7,8},{4}

Cluster similarity:group average method

yyxx CdCd

yxyx

yxC ddCC

CC,

d ,sim1,sim

Cluster Similarity

Page 81: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Cluster-Distance Calculation

(complete linkage method)

(single linkage method)

(centroid method)

×

×

Page 82: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Online Clustering

• Single Pass Clustering (Balog+, 08)– Take pages from the 1st in search results

1

6

54 3

2

Page 83: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Online Clustering

• Single Pass Clustering (Balog+, 08)– Take pages from the 1st in search results– For each page, find the most similar cluster

1

6

54 3

2

Page 84: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Online Clustering

• Single Pass Clustering (Balog+, 08)– Take pages from the 1st in search results– For each page, find the most similar cluster – If the similarity is below the threshold, create a

new cluster• Similarity: Naïve Bayes | Cosine with TFIDF

1

6

54 3

2

Page 85: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Building Core Clusters

• 1st stage clustering – High Precision Clusters– Relatively high threshold (Mann+, 03)– Use strong features only (Ikeda+, 09)

• 2nd stage clustering – Treat Remaining Documents– Add to the most similar 1st stage clusters (Mann+,

03) (Ikeda+, 09)– Feature weighting by 1st stage clusters (Yoshida+,

10)

Page 86: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Query Expansion Approach (Ikeda+, 09)

• Re-extract key-phrases by using 1st-stage clusters– Key-phrases for documents -> key-phrases for

clusters– More reliable than one document

Page 87: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

home runs, major leagues,

all stars,

87

Current cluster Top CKWs 1 Extract top CKWs from the current cluster2 Search for the CKWs in documents out of the cluster3 If such documents exist, then copy them into the cluster (soft clustering)4 Remove 1-element clusters

Other documents

Search 2

1

1

87

Page 88: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Feature weighting by 1st stage clusters (Yoshida+, 10)

1. Make clusters by strong features2. Weight weak features using clusters, and

refine similarities3. Refine clusters by using new similarities

88

Page 89: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Using Dirichlet Process Mixture (Ono+, 08)

• Topic = word distribution– Topic:”economics” = Word distribution:

{“dollar”:0.03, “stock”:0.05, “share”:0.01, ...}• Document = mixture of topics– {economics:0.3, politics:0.2, ...}

• Document’s topic = topic with highest weight• Modeling by DPUM (Dirichlet Process Unigram

Mixture)– # of topics is automatically determined

89

Page 90: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Example: Estimation of Latent Topics

word-1

word-2

word-3Document = each point

Latent entity = each (red) bar

Page 91: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Dirichlet Process Unigram Mixture

θ

θ

θd

wdn

M

Nd

UM

G0=Distrubution for θ (Dirichlet Distribution)θ=Multi. Distribution

(Countable number of Multi. distributions)

G

Page 92: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

DPUM Parameter Estimation

Initial entity distribution Estimation of entity distributionby iteratively maximizing likelihood

Page 93: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Politics

Politics

Economics

Emonomics

Politics

Entertainment

Sports

ArtsSociety

Merge clusters with the same topic

Page 94: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

EVALUATION ISSUES

Page 95: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Evaluation Issues

• Evaluation Measures• Available Corpus• WePS Workshop

Page 96: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Evaluation Measures

• Precision / Recall / F-measure• Purity / Inverse Purity• B-cubed Precision / Recall / F-measure– Extended B-cubed

Page 97: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Recall and Precision for Clustering• Features and recall/precision• First stage cluster = high precision

A:size of cluster B:# of correct documents

C:# of correct documents in cluster

Precision

Recall C=3A=5 B=8

375.0recall

0.6 precision

BCR

ACP

97

Page 98: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Recall and Precision [Larsen and Aone 1999]

• Machine-made clusters

are calculated for each as:

for each that maximize

• Correct clusters

Page 99: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

99

F-measure

Total F-measure (F):

Note:

Precision (P) , Recall (R) are calculated in the same way.

Page 100: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Example

Machine-made clusters: D

[A][A][B]

Correct clusters: C [A][A][A][A][A]P = 2 /3

R = 2 /5F = 1 /2

P = 3 /5R = 3 /5F = 3 /5

[A][A][A][B][C]

[B][B] …

Page 101: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Purity / Inverse Purity

• Similar to precision / recall– L: manually annotated categories (clusters)– C: clusters output by systems

Page 102: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

B-Cubed Precision/Recall• Entity-wise accuracy calculation– C: cluster (by system) containing e– L: cluster (by human) containing e

102

Page 103: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

B-Cubed Precision/Recall• Borrowed from (Amigo, 09)

103

Page 104: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Other Metrics

• Counting pairs– Given pair of documents, label “link” or “unlink”– Problem: # of pairs is quadratic to size of clusters

• Entropy– Low entropy in cluster -> pure

• Edit distance– Distance from system output to correct output

Page 105: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Which Metrics to Use• Constraints (borrowed from (Amigo, 09))• Homogeneity: the purer, the better

• Completeness: the more complete, the better

Page 106: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Which Metrics to Use• Constraints (borrowed from (Amigo, 09))• Rag bag– Noisy cluster <- noise: better!– Pure cluster <- noise: worse!

Page 107: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Which Metrics to Use• Constraints (borrowed from (Amigo, 09))• Cluster size vs. quantity– A small error in big cluster : better!– (Large number of) small errors in small clusters :

worse!

Page 108: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Which Metrics to Use• Borrowed from (Amigo, 09)

Page 109: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Which Metrics to Use• Borrowed from (Amigo, 09)

Page 110: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Which Metrics to Use• Borrowed from (Amigo, 09)

Page 111: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Baselines

111

Page 112: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

P-IP vs. B-Cubed: for Practical Data

• Purity/Inverse-Purity measure is not appropriate in soft-clustering case– It gives very high scores to “cheat” baseline

clustering (COMBINED in the table)• B-cubed measure is appropriate in this case

Page 113: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Available Corpus

• John Smith Corpus (Bagga+, 98)• 12 different people (Bekkerman+, 05)• WePS corpus (Artiles+, 07)(Artiles+, 09)– WePS-1• 79 person names (49 training + 30 test), 100 top pages

for each– WePS-2• 30 person names, 150 top pages for each

Page 114: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

WePS (Web People Search) Workshops (Artiles+, 07)(Artiles+, 09)

• Evaluation campaigns for person name disambiguation (along with person attribute extraction)

• WePS-1– with SemEval-2007– 16 teams participated

• WePS-2– with WWW-2009– 17 teams participated

114

Page 115: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

References• (Amigo+, 09) Enrique Amigó , Julio Gonzalo , Javier Artiles , Felisa Verdejo, A comparison

of extrinsic clustering evaluation metrics based on formal constraints, Information Retrieval, v.12 n.4, p.461-486, August 2009

• (Artiles+, 07) Javier Artiles , Julio Gonzalo , Satoshi Sekine, The SemEval-2007 WePS evaluation: establishing a benchmark for the web people search task, Proceedings of the 4th International Workshop on Semantic Evaluations, p.64-69, June 23-24, 2007, Prague, Czech Republic

• (Artiles+, 09) J. Artiles, J. Gonzalo, and S. Sekine. WePS 2 Evaluation Campaign: overview of the Web People Search Clustering Task. 2nd Web People Search Evaluation Workshop (WePS 2009), 2009.

• (Bagga+, 98) Amit Bagga , Breck Baldwin, Entity-based cross-document coreferencing using the Vector Space Model, Proceedings of the 17th international conference on Computational linguistics, August 10-14, 1998, Montreal, Quebec, Canada

• (Balog+, 08) K. Balog, L. Azzopardi, and M. de Rijke. Personal name resolution of web people search. In WWW2008 Workshop: NLP Challenges in the Information Explosion Era (NLPIX 2008), 2008.

• (Balog+, 09) Krisztian Balog, Jiyin He, Katja Hofmann, Valentin Jijkoun, Christof Monz, Manos Tsagkias, Wouter Weerkamp and Maarten de Rijke, The University of Amsterdam at WePS2. 2nd Web People Search Evaluation Workshop (WePS 2009), 2009.

Page 116: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

References• (Bekkerman+, 05) Ron Bekkerman , Andrew McCallum, Disambiguating Web

appearances of people in a social network, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan

• (Bollegala+, 06) Danushka Bollegala , Yutaka Matsuo , Mitsuru Ishizuka, Extracting key phrases to disambiguate personal name queries in web search, Proceedings of the Workshop on How Can Computational Linguistics Improve Information Retrieval?, July 23-23, 2006, Sydney, Australia

• (Bunescu+, 06) R. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-06), 2006.

• (Chen+, 09) Names.Ying Chen, Sophia Yat Mei Lee and Chu-Ren Huang, PolyUHK: A Robust Information Extraction System for Web Personal, 2nd Web People Search Evaluation Workshop (WePS 2009), 2009.

• (Chen+, 07) Ying Chen, James Martin, Towards Robust Unsupervised Personal Name Disambiguation, EMNLP-CoNLL 2007, pp. 190-198, 2007

• (Elmacioglu+, 07) Ergin Elmacioglu , Yee Fan Tan , Su Yan , Min-Yen Kan , Dongwon Lee, PSNUS: web people name disambiguation by simple clustering with rich features, Proceedings of the 4th International Workshop on Semantic Evaluations, p.268-271, June 23-24, 2007, Prague, Czech Republic

Page 117: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

References• (Fleishman+, 2004) Fleischman, M.B. and E.H. Hovy, Multi-Document Person

Name Resolution. Proceedings of the Reference Resolution Workshop at the 42nd Annual Meeting of the Association for Computational Linguistics (ACL). Barcelona, Spain, 2004

• (Gooi+, 04) Chung H. Gooi, James Allan, Cross-Document Coreference on a Large Scale Corpus, HLT-NAACL 2004: Main Proceedings, pp. 9-16, 2004

• (Han+, 04) Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li, Kostas Tsioutsiouliklis, Two supervised learning approaches for name disambiguation in author citations, JCDL 2004, pp. 296-305, 2004

• (Ikeda+, 09) M. Ikeda, S. Ono, I. Sato, M. Yoshida, and H. Nakagawa. Person Name Disambiguation on the Web by Two-Stage Clustering. 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.

• (Kalashnikov+, 08) Dmitri V. Kalashnikov , Rabia Nuray-Turan , Sharad Mehrotra, Towards breaking the quality curse.: a web-querying approach to web people search., Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore

• (Li+, 04) X. Li, P. Morie and D. Roth, Robust Reading: Identification and Tracing of Ambiguous Names. Proc. of the Annual Meeting of the North American Association of Computational Linguistics (NAACL) , pp. 17-24, 2004

Page 118: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

References• (Malin, 05) Bradley Malin, Unsupervised name disambiguation via social network

similarity, In Workshop on Link Analysis, Counterterrorism, and Security, with SDM 2005• (Murakami, 10) Hiroshi Ueda, Harumi Murakami, and Shoji Tatsumi, Suggesting Subject

Headings using Web Information Sources, ... Conference on Agents and Artificial Intelligence (ICAART 2010) Volume 1 Artificial Intelligence, pp.640-643, 2010.

• (Mann+, 03) Gideon S. Mann , David Yarowsky, Unsupervised personal name disambiguation, Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, p.33-40, May 31, 2003, Edmonton, Canada

• (Nakagawa+, 03) H. Nakagawa and T. Mori. Automatic term recognition based on statistics of compound nouns and their components. Terminology, 9(2):201--219, 2003.

• (Niu+, 04) Cheng Niu , Wei Li , Rohini K. Srihari, Weakly supervised learning for cross-document person name disambiguation supported by information extraction, Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p.597-es, July 21-26, 2004, Barcelona, Spain

• (Nuray-Turan+, 09) R. Nuray-Turan, Z. Chen, D. Kalashnikov, and S. Mehrotra. Exploiting web querying for web people search in weps2. 2nd Web People Search Evaluation Workshop (WePS 2009), 2009.

Page 119: ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

References• (On+, 07) B.-W. On and D. Lee. Scalable name disambiguation using multi-level graph

partition. In Proc. of the SIAM SDM Conf., Minneapolis, Minnesota, USA, 2007• (Ono+, 08) Shingo Ono , Issei Sato , Minoru Yoshida , Hiroshi Nakagawa, Person name

disambiguation in web pages using social network, compound words and latent topics, Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining, May 20-23, 2008, Osaka, Japan

• (Pedersen+, 05) Ted Pedersen, Amruta Purandare, Anagha Kulkarni , Name Discrimination by Clustering Similar Contexts, CICLing 2005, pp. 226-237, 2005

• (Resnick+, 94) Paul Resnick , Neophytos Iacovou , Mitesh Suchak , Peter Bergstrom , John Riedl, GroupLens: an open architecture for collaborative filtering of netnews, Proceedings of the 1994 ACM conference on Computer supported cooperative work, p.175-186, October 22-26, 1994, Chapel Hill, North Carolina, United States

• (Yoshida+, 10) Minoru Yoshida, Masaki Ikeda, Shingo Ono, Issei Sato, Hiroshi Nakagawa, Person name disambiguation by bootstrapping, In SIGIR '10: Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval , pp. 10-17, 2010

• (Wan+, 05) Xiaojun Wan , Jianfeng Gao , Mu Li , Binggong Ding, Person resolution in person search results: WebHawk, Proceedings of the 14th ACM international conference on Information and knowledge management, October 31-November 05, 2005, Bremen, Germany