Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th...

27
Jed Hassell, Boanerges Aleman-Meza , Budak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9, 2006 Ontology-Driven Automatic Entity Disambiguation in Unstructured Text knowledgement: NSF-ITR-IDM Award #0325464 ‘SemDIS: Discovering Complex Relationships in the Semantic Web’

Transcript of Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th...

Page 1: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

Jed Hassell, Boanerges Aleman-Meza, Budak Arpinar

5th International Semantic Web ConferenceAthens, GA, Nov. 5 – 9, 2006

Ontology-Driven Automatic Entity Disambiguation in

Unstructured Text

Acknowledgement: NSF-ITR-IDM Award #0325464 ‘SemDIS: Discovering Complex Relationships in the Semantic Web’

Page 2: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

2

The Question is …

• How to determine the most likely match of a named-entity in unstructured text?

• Example:

Which “A. Joshi” is this text referring to?

• out of, say, 20 candidate entities (in a populated ontology)

Page 3: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

3

“likely match” = confidence score• Idea is to spot entity names in text and

assign each potential match a confidence score

• The confidence score represents a degree of certainty that a spotted entity refers to a particular object in the ontology

Page 4: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

4

Our Approach, three steps

1. Spot Entity Names- assign initial confidence score

2. Adjust confidence score using:- proximity relationships (text)- co-occurrence relationships (text)- connections (graph)- popular entities (graph)

3. Iterate again to propagate result- finish when confidence scores are not

updated

Page 5: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

5

Spotting Entity Names

• Search document for entity names within the ontology

• Each match becomes a “candidate entity”

• Assign initial confidence scores

Page 6: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

6

Using Text-proximity Relationships• Relationships that can be expected to be in near

text-proximity of the entity– Measured in terms of character spaces

Page 7: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

7

Using Co-occurrence Relations• Similar to text-proximity with the exception that

proximity is not relevanti.e., location within the document does not matter

Page 8: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

8

Using Popular Entities (graph)• Intention: bias the right entity to be the

most popular entity

• This should be used with care, depending on the domain

• good for tie-breaking

• DBLP scenario: entity with more papers• e.g., only two “A. Joshi” entities with >50 papers

Page 9: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

9

Using Relations to other Entities• Entities can be related to one another through their

collaboration network– ‘neighboring’ entities get a boost in their confidence score

• i.e., propagation

– This is the ‘iterative’ step in our apprach, • It starts with entities having highest confidence score

– Example:

“Conference Program Committee Members:” - Professor Smith - Professor Smith’s co-editor in recent book - Professor Smith’s recently graduated Ph.D advisee . . . . . . . . .

Page 10: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

10

In Summary, ontology-driven• Using “clues”

– from the text where the entity appears– from the ontology

Example: RDF/XML snippet of a person’s metadata

Page 11: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

11

Overview of System Architecture

Page 12: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

12

Once no more iterations are needed• Output of results: XML format

– URI– Confidence score– Entity name (as it appears in the text)– Start and end position (location in a document)

• Can easily be converted to other formats– Microformats, RDFa, ...

Page 13: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

13

Sample Output

Page 14: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

14

Sample Output - Microformat

Page 15: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

15

Evaluation: Gold Standard Set

• We evaluate our method using a gold standard set of documents

– Randomly chose 20 consecutive post from DBWorld

– Set of manually disambiguated documents(two) humans validated the ‘right’ entity match

– We used precision and recall as the measurement of evaluation for our system

Page 16: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

16

Evaluation, sample DBWorld post

Page 17: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

17

Sample disambiguated document

Page 18: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

18

Using DBLP data as ontology• Converted DBLP’s bibliographic data to RDF

– 447,121 authors– A SAX parser to convert DBLP’s XML data to RDF– Created relationships such as “co-author”– Added

• Affiliations (for a subset of authors)• Areas of interest (for a subset of authors)• spellings for international characters

• Lessons learned lead us to create SwetoDblp (containing many improvements)

[DBLP] http://www.informatik.uni-trier.de/~ley/db/

[SwetoDblp] http://lsdis.cs.uga.edu/projects/semdis/swetodblp/

Page 19: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

19

Evaluation, Precision & Recall• We define set A as the set of unique

names identified using the disambiguated dataset (i.e., exact results)

• We define set B as the set of entities found by our method

• A B represents the set of entities correctly identified by our method

Page 20: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

20

Evaluation, Precision & Recall• Precision is the

proportion of correctly disambiguated entities with regard to B

• Recall is the proportion of correctly disambiguated entities with regard to A

Page 21: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

21

Evaluation, Results• Precision and recall (compared to gold standard)

• Precision and recall on a per document basis:

Correct Disambiguation Found Entities Total Entities Precision Recall

602 620 758 97.1% 79.4%

Precision and Recall

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Documents

Percentage

Recall

Precision

Page 22: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

22

Related Work

• Semex Personal Information Management:– The results of disambiguated entities are

propagated to other ambiguous entities, which could then be reconciled based on recently reconciled entities (much like our work does)

– Takes advantage of a predictable structure such as fields where an email or name is expected to appear

• Our approach works with unstructured data

[Semex] Dong, Halevy, Madhaven, SIGMOD-2005

Page 23: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

23

Related Work

• Kim– Contains an entity recognition portion that

uses natural language processing– Evaluations performed on human annotated

corpora

• SCORE Technology (now, http://www.fortent.com/)

– Uses associations from a knowledge base, yet implementation details are not available (commercial product)

[SCORE] Sheth et al, Internet Computing, 6(4), 2002[Kim] Popov et al., ISWC-2003

Page 24: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

24

Conclusions

• Our method uses relationships between entities in the ontology to go beyond traditional syntactic-based disambiguation techniques

• This work is among the first to successfully use relationships for identifying named-entities in text without relying on the structure of the text

Page 25: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

25

Future Work

• Improvements on spotting– e.g., canonical names (Tim = Timothy)

• Integration/deployment as a UIMA component• allows analysis along a document collection• for applications such as semantic annotation and search

• Further evaluations– Using different datasets and document sets– Compare with respect to other methods, and– to determine best contributing factor in disambiguation– measure how far in the list we missed the ‘right’ entity

[UIMA] IBM’s Unstructured Information Management Architecture

Page 26: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

26

Scalability, Semantics, Automation• Usage of background knowledge in the

form of a (large) populated ontology

• Flexibility to use a different ontology, but,– the ontology must ‘fit’ the domain

• It’s an ‘automatic’ approach, yet …– Human defines threshold values (and some

weights)

Page 27: Jed Hassell, Boanerges Aleman-Meza, Budak ArpinarBoanerges Aleman-MezaBudak Arpinar 5 th International Semantic Web Conference Athens, GA, Nov. 5 – 9,

27

References1. Aleman-Meza, B., Nagarajan, M., Ramakrishnan, C., Ding, L., Kolari, P., Sheth, A., Arpinar, B.,

Joshi, A.,Finin, T.: Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection. 15th International World Wide Web Conference, Edinburgh, Scotland (2006)

2. DBWorld. http://www.cs.wisc.edu/dbworld/ April 9, 2006.3. Dong, X. L., Halevy, A., Madhaven, J.: Reference Reconciliation in Complex Information

Spaces. Proc. of SIGMOD, Baltimore, MD. (2005)4. Ley, M.: The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives.

Proc. of the 9th International Symposium on String Processing and Information Retrieval, Lisbon, Portugal (Sept. 2002) 1-10

5. Popov, B., Kiryakov, A., Kirilov, A., Manov, D., Ognyanoff, D., Goranov, M.: KIM - Semantic Annotation Platform. Proc. of the 2nd Intl. Semantic Web Conf, Sanibel Island, Florida (2003)

6. Sheth, A., Bertram, C., Avant, D., Hammond, B., Kochut, K., Warke, Y.: Managing semantic content for the Web. IEEE Internet Computing, 6(4) (2002) 80-87

7. Zhu, J., Uren, V., Motta, E.: ESpotter: Adaptive Named Entity Recognition for Web Browsing, 3rd Professional Knowledge Management Conference, Kaiserslautern, Germany, 2005

Evaluation datasets at: http://lsdis.cs.uga.edu/~aleman/publications/Hassell_ISWC2006/