Grouping Search-Engine Returned Citations for Person Name Queries

27
Grouping Search-Engine Returned Citations for Person Name Queries Reema Al-Kamha Research Supported by NSF

description

Grouping Search-Engine Returned Citations for Person Name Queries. Reema Al-Kamha. Research Supported by NSF. The Problem. Search engines return too many citations. Example: “Kelly Flanagan”. Google returns around 685 citations. Many people named “Kelly Flanagan” - PowerPoint PPT Presentation

Transcript of Grouping Search-Engine Returned Citations for Person Name Queries

Page 1: Grouping Search-Engine Returned Citations for Person Name Queries

Grouping Search-Engine Returned Citations for Person Name Queries

Reema Al-Kamha

Research Supported by NSF

Page 2: Grouping Search-Engine Returned Citations for Person Name Queries

2

The Problem

Search engines return too many citations. Example: “Kelly Flanagan”. Google returns around 685 citations.

Many people named “Kelly Flanagan” It would help to group the citations by person. How do we group them?

Page 3: Grouping Search-Engine Returned Citations for Person Name Queries

3

“Kelly Flanagan” Query to Google

Page 4: Grouping Search-Engine Returned Citations for Person Name Queries

4

A Multi-faceted approach Attributes Links Page Similarity

Confidence matrix for each facet

Final confidence matrix

Grouping algorithm

Our Solution

Page 5: Grouping Search-Engine Returned Citations for Person Name Queries

5

A Multi-faceted ApproachGather evidence from each of several different facetsCombine the evidence

Page 6: Grouping Search-Engine Returned Citations for Person Name Queries

6

Attributes

Phone number, email address, state, city, zip code.

Regular expression for each attribute.

Page 7: Grouping Search-Engine Returned Citations for Person Name Queries

7

Links

People usually post information on only a few host servers. Returned citations that have a same host.

People often link one page about a person to another page

about the same person. The URL of one citation has the same host as one of the URLs that belongs to the web page referenced by the other citation.

Page 8: Grouping Search-Engine Returned Citations for Person Name Queries

8

Links (Cont)

Page 9: Grouping Search-Engine Returned Citations for Person Name Queries

9

Page Similarity

“adjacent cap-word pairs”: Cap-Word (Connector | Preposition (Article)? | (Capital-LetterDot))? Cap-Word.

Page 10: Grouping Search-Engine Returned Citations for Person Name Queries

10

Page Similarity

The number of shared adjacent cap-word pairs (1, 2 , 3, 4 or more).

Ignore adjacent cap-word pairs that often occur on web pages (Home Page and Privacy Policy) by constructing a stop-word list.

Page 11: Grouping Search-Engine Returned Citations for Person Name Queries

11

Confidence Matrix Construction

For each facet we construct a confidence matrix.C1 C2 ….. Ci ….. Cj … Cn

C1 1 C12 C1i C1j C1n

C2 1 C2i C2j C2n

: : : :

Ci 1 Cij Cin

: : :

Cj 1 Cjn

: :

Cn 1

P(Ci and Cj refer to a same person | evidence for a facet f )

0 if no evidence for a facet f

Cij =

Training set to compute the conditional probabilities.

Page 12: Grouping Search-Engine Returned Citations for Person Name Queries

12

Confidence Matrix Construction (Cont)

We select 9 person names.For each name we collect the first 50 citations.For 50 citations we have 1,225 comparison pairs.The size of our training set is 11,025.

Page 13: Grouping Search-Engine Returned Citations for Person Name Queries

13

Confidence Matrix Construction (Cont)

For attribute facet

P(Same Person = “Yes” | Email = “yes”)

P(Same Person = “Yes” | City = “yes” and State = “Yes”)

For link facet

P(Same Person = “Yes” | Host1 = “yes” and Host1 is non-popular)

For page similarity facet

P(Same Person = “Yes” | Share2 = “yes”)

Page 14: Grouping Search-Engine Returned Citations for Person Name Queries

14

Confidence Matrix for Attribute Facet

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

C1 1 0.99 0 0 0 0 0 0.96 0 0

C2 1 0 0 0 0 0 0.96 0 0

C3 1 0 0 0 0 0 0 0

C4 1 0 0 0.96 0 0 0

C5 1 0 0 0 0 0

C6 1 0 0 0 0

C7 1 0 0 0

C8 1 0 0

C9 1 0

C10 1

C1 and C2 have the same zip, city, and state, which are “Provo”, “UT”, and “84604”.

C1 and C8 , C2 and C8 have the same city and state, which are “Provo” and “UT”.

C4 and C7 have the same city and state, which are“Palm Desert” and “California”.

Page 15: Grouping Search-Engine Returned Citations for Person Name Queries

15

Confidence Matrix for Link Facet

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

C1 1 0.99 0 0 0 0 0 0 0 0

C2 1 0 0 0.99 0 0 0 0 0

C3 1 0 0 0.99 0 0 0 0

C4 1 0 0 0 0 0 0

C5 1 0.99 0 0 0 0

C6 1 0 0 0 0

C7 1 0 0 0

C8 1 0 0

C9 1 0

C10 1

C1 and C2 have the same host name, and C1 refers to the host of C2. .

C5 and C6 have the same host name.

C3 refers to the host of C5 and C3 refers to the host of C6

Page 16: Grouping Search-Engine Returned Citations for Person Name Queries

16

Confidence Matrix for Page Similarity Facet

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

C1 1 0.95 0 0 0 0 0 0.78 0 0

C2 1 0.95 0 0 0 0 0.78 0 0

C3 1 0 0 0 0 0 0 0

C4 1 0 0 0.92 0 0 0

C5 1 0 0 0 0 0

C6 1 0 0 0 0

C7 1 0 0 0

C8 1 0 0

C9 1 0

C10 1

C1 and C2 share Associate Professor, Brigham Young, Performance Evaluation, Trace Collection, Computer

Organization, Computer Architecture.

C2 and C3 share Memory Hierarchy, Brent E. Nelson, System-Assisted Disk, Simulation Technique, Stochastic Disk,

Winter Simulation, Chordal Spoke, Interconnection Network, Transaction Processing, Benchmarks Using, Performance Studies, Incomplete Trace, Heng Zho.

C1 and C8 , C2 and C8 share Brigham Young. C4 and C7 share Palm Desert, Real Estate, Desert Real .

Page 17: Grouping Search-Engine Returned Citations for Person Name Queries

17

Final Matrix

Combine the confidence matrices for the three facets using Stanford Certainty Measure.For some observation B,

If CF(E1) is the certainty factor associated with E1

If CF(E2) is the certainty factor associated with E2 the new certainty factor for B is: CF(E1) + CF(E2) – CF(E1) * CF(E2).

Page 18: Grouping Search-Engine Returned Citations for Person Name Queries

18

Final Matrix (Cont)

0.96 + 0 + 0.78 - 0.96 * 0 - 0.96 * 0.78 - 0.78 * 0 + 0.96 * 0 * 0.78 = 0.9912

Confidence Matrix for Attributes Confidence Matrix for Links Confidence Matrix for Page Similarity

Page 19: Grouping Search-Engine Returned Citations for Person Name Queries

19

Final Confidence Matrix

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

C1 1 0.95 0 0 0 0 0 0.99 0 0

C2 1 0.95 0 0 0 0 0.99 0 0

C3 1 0 0.99 0.99 0 0 0 0

C4 1 0 0 0.99 0 0 0

C5 1 0 0 0 0 0

C6 1 0 0 0 0

C7 1 0 0 0

C8 1 0 0

C9 1 0

C10 1

Page 20: Grouping Search-Engine Returned Citations for Person Name Queries

20

Grouping Algorithm

Input: the final confidence matrix.Output: groups of search engine returned citations, such that each group refers to the same person.The idea is:

{Ci , Cj} and {Cj , Ck} then {Ci , Cj , Ck}

The threshold we use for “highly confident” is 0.8.

Page 21: Grouping Search-Engine Returned Citations for Person Name Queries

21

Grouping Algorithm(Cont)

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

C1 1 0.95 0 0 0 0 0 0.99 0 0

C2 1 0.95 0 0 0 0 0.99 0 0

C3 1 0 0.99 0.99 0 0 0 0

C4 1 0 0 0.99 0 0 0

C5 1 0 0 0 0 0

C6 1 0 0 0 0

C7 1 0 0 0

C8 1 0 0

C9 1 0

C10 1

{C1 , C2}, {C2 , C3}, {C3 , C5}, {C3 , C6}, {C4 , C7}, {C1 , C8}, {C2 , C8}

Group1: {C1 , C2 , C3 , C5 , C6 , C8}, Group 2: {C4 , C7}, Group 3: {C9}, Group4: {C10}

Page 22: Grouping Search-Engine Returned Citations for Person Name Queries

22

Experimental Results

Choose 10 arbitrary different names.For each name we get the first 50 returned citations. The size of the test set is 500.Use split and merge measures.

Consider 8 returned citations C1, C2, C3, C4, C5, C6, C7, C8 the correct grouping result:

Group 1: {C1, C2, C4, C6, C7}, Group 2: {C3, C8}, Group 3: {C5} grouping result of our system:

Group 1: {C1, C2, C4}, Group 2 :{C3, C6, C7}, Group 3: {C5, C8} The number of splits is 0+1+1=2. The total number of merges is 2. Normalized the split and merge scores.

Page 23: Grouping Search-Engine Returned Citations for Person Name Queries

23

Experimental Results (Cont)

Official College, Sports Network, Student Advantage.

Page 24: Grouping Search-Engine Returned Citations for Person Name Queries

24

Cases that Caused Missing Merges--Attributes Facet

No shared attributes. 1030 pairs (out of 1036 pairs) in 41 groups in Larry Wild.

Only the value of attribute State is shared. 6 pairs in 41 groups in Larry Wild.

Page 25: Grouping Search-Engine Returned Citations for Person Name Queries

25

Techniques that Used to Judge In Case of no Evidence or Weak Evidence

Page 26: Grouping Search-Engine Returned Citations for Person Name Queries

26

Conclusions

Multi-faceted approach is useful, low normalized split score (0.004) and a low normalized merge score (0.014).

No individual facet scored better than using all facets together.

Page 27: Grouping Search-Engine Returned Citations for Person Name Queries

27

Contributions

Grouped person-name queries by person.

Provided an additional tool for search engine queries.