8//2808 Wikitology Wikipedia as an Ontology Tim Finin, Zareen Syed and Anupam Joshi University of...

8//2808

WikitologyWikitologyWikipedia as an OntologyWikipedia as an Ontology

Tim Finin, Zareen Syed and Anupam Joshi

University of Maryland, Baltimore County

http://ebiquity.umbc.edu/resource/html/id/250/

MotivationIdentifying the topics and concepts associated with text or text entities is a task common to many applications:– Annotation and categorization of documents– Modelling user interests– Business intelligence– Selecting advertisements– Improving information retrieval– Better named entity extraction and

disambiguation

What’s a document about?

Two common approaches:

(1) Select words and phrases using TF-IDF that characterize the document

(2) Map document to a list of terms from a controlled vocabulary or ontology

(1) is flexible and doesn’t require creating and maintaining an ontology

(2) Can connect documents to a rich knowledge base

Wikitology !• Using Wikipedia as an ontology offers the best of both approaches– each article (~4M) is a concept in the ontology– terms linked via Wikipedia’s category system

(~200k) and inter-article links– Lots of structured and semi-structured data

• It’s a consensus ontology created and maintained by a diverse community

• Broad coverage, multilingual, very current

• Overall content quality is high

Constructing the Wikitology KB

WordNetYago

Human input & editingDatabases

Freebase KB

RDF and OWLstatements

ACE 2008

• ACE 2008 is a NIST sponsored exercise in entity extraction from text

• Focus on resolving entities across documents, e.g., “Dr. Rice” mentioned in doc 18397 is the same as “Secretary of State” in doc 46281

• 20K documents in English and Arabic• We participated on a team from the JHU Human

Language Technology Center of Excellence

NLPNLP MLML clustclustFEA

TFEA

T

Documents KB entities

ACE 2008

• BBN’s Serif system produces text annotated with named entities (people or organizations)Dr. Rice, Ms. Rice, the secretary, she, secretary Rice

• Featurizers score pairs of entities for co-reference(CNN-264772-E32, AFP-7373726-E19, 0.6543)

• A machine learning system combines the evidence• A simple clustering algorithm identifies clusters

NLPNLP MLML clustclustFEA

TFEA

T

Documents KB entities

Wikitology tagging

• Using Serif’s output, we produced an entity document for each entity.Included the entity’s name, nominal and pronom-inal mentions, APF type and subtype, and words in a window around the mentions

• We tagged entity documents using Wiki-tology producing vectors of (1) terms and (2) categories for the entity

• We used the vectors to compute fea-tures measuring entity pair similarity/dissimilarity

Entity Document & Tags<DOC>

<DOCNO>ABC19980430.1830.0091.LDC2000T44-E2 <DOCNO>

<TEXT>

Webb Hubbell

PER

Individual

NAM: "Hubbell” "Hubbells” "Webb Hubbell” "Webb_Hubbell"

NAM: "Mr . " "friend” "income"

PRO: "he” "him” "his"

, . abc's accountant after again ago all alleges alone also and arranged attorney avoid been before being betray but came can cat charges cheating circle clearly close concluded conspiracy cooperate counsel counsel's department did disgrace do dog dollars earned eightynine enough evasion feel financial firm first four friend friends going got grand happening has he help him hi s hope house hubbell hubbells hundred hush income increase independent indict indicted indictment inner investigating jackie jackie_judd jail jordan judd jury justice kantor ken knew lady late law left lie little make many mickey mid money mr my nineteen nineties ninetyfour not nothing now office other others paying peter_jennings president's pressure pressured probe prosecutors questions reported reveal rock saddened said schemed seen seven since starr statement such tax taxes tell them they thousand time today ultimately vernon washington webb webb_hubbell were what's whether which white whitewater why wife years

</TEXT>

</DOC>

Wikitology article tag vector

Webster_Hubbell 1.000

Hubbell_Trading_Post National Historic Site 0.379

United_States_v._Hubbell 0.377

Hubbell_Center 0.226

Whitewater_controversy 0.222

Wikitology category tag vector

Clinton_administration_controversies 0.204

American_political_scandals 0.204

Living_people 0.201

1949_births 0.167

People_from_Arkansas 0.167

Arkansas_politicians 0.167

American_tax_evaders 0.167

Arkansas_lawyers 0.167

Wikitology derived features

• Seven features measured entity similarity using cosine similarity of various length article or category vectors

• Five features measured entity dissimilarity:• two PER entities match different Wikitology persons• two entities match Wikitology tags in a disambiguation set• two ORG entities match different Wikitology organizations• two PER entities match different Wikitology persons,

weighted by 1-abs(score1-score2)• two ORG entities match different Wikitology orgs,

weighted by 1-abs(score1-score2)

Challenges• Wikitology tagging is expensive

– ~2 seconds/document on a single processor– Took ~24 hrs on a cluster for 150K entity docs– A spreading activation algorithm on the underlying

graphs improves accuracy at even more cost

• Exploiting the RDF metadata and data and the underlying graphs– requires reasoning and graph processing

• Extract entities from Wiki text to find more relations– More graph processing

Next Steps

• Construct a Web-based API and demo system to facility experimentation

• Process Wikitology updates in real-time• Exploit machine learning to classify pages and improve performance

• Better use of cluster using Hadoop, etc.• Exploit cell processor technology for spreading activation and other graph-based algorithms– e.g., recognize people by the graph of

relations they are part of

Spreading activation example

11

4422

55 66

3311 .5.5

11 .5.5

11 11 11 .5.5

.8.8 11 .5.5

.8.8 11

11

=

0.9

0.81.0

1.0

0.5

0.5

0.5 0.5

0.3

at at-1= W

a0a1

fromto


11

4422

55 6

33 .9.9

.3.3

11 .5.5

11 .5.5

11 11 11 .5.5

.8.8 11 .5.5

.8.8 11

11

=

0.9

0.81.0

1.0

0.5

0.5

0.5 0.5

0.3

at at-1= W

a0a1

fromto


11

4422

55 6

33 .9.9

.3.3

11 .5.5

11 .5.5

11 11 11 .5.5

.8.8 11 .5.5

.8.8 11

11

.45.45

.9.9

.15.15

.15.15

.45.45

.3.3

=

0.9

0.81.0

1.0

0.5

0.5

0.5 0.5

0.3

at at-1= W

a0a1

fromto


11

4422

55 6

33.45.45

.9.9

.15.15

.15.15

.45.45

.3.3

11 .5.5

11 .5.5

11 11 11 .5.5

.8.8 11 .5.5

.8.8 11

11

.45.45

.9.9

.3.3

.51.51

.45.45

.3.3

=

0.9

0.81.0

1.0

0.5

0.5

0.5 0.5

0.3

at at-1= W

a1a2

fromto

SA as matrix multiplication

• Good news: SA is matrix multiplication– Model graph as nxn matrix W where Wij is

strength of connection from node i to j – Vector A of length n, Ai is node I’s activation– A(t) = W*A(t-1)

• Bad news: is n is huge– 140K category nodes and 4.2M edges– 2.9M articles and 50M edges.

• Good news: matrices are sparse

1/9/2007

Sparse Matrix Vector Multiplication

Exploiting parallelism for sparse matrix-vector multiplication (SPMV) has several challenges•High storage overhead and Indirect and irregular memory access patterns

•How to parallelize

•Load balancing

Sparse Matrix Representation

Compressed Sparse Row Format (CSR) is a simple storage format

Values: The non-zero values in the matrix

Columns: Column indices of non-zero values

Pointer B: Column index of first non-zero value in a row

Pointer E: Column index of last non-zero value in a row

Thread Level Parallelism

• Partition matrix rows among processors

• Statically load balance SPMV by approximately equally distributing the non-zero values among processors/threads

• Sort rows in decreasing order of number of non-zeros

• Assign rows to processes/threads iteratively– Assign row #1 to processes 0– Assign subsequent rows to the process with

smallest total number of non-zeros• Guarantees the maximum difference in the

number of non-zero values between any two processes/threads will be at most the largest number of non-zeros in a row

Heuristic Load Balancing

Conclusion• Our initial experiments showed that the Wikitology idea has merit

• Wikipedia is increasingly being used as a knowledge source of choice

• Easily extendable to other wikis and collaborative KBs, e.g., Intellipedia

• Serious use requires exploiting cluster machines and cell processing

• Key processing of associated graph data can exploit cell architecture

8//2808 Wikitology Wikipedia as an Ontology Tim Finin, Zareen Syed and Anupam Joshi University of...

Documents

Transcript of 8//2808 Wikitology Wikipedia as an Ontology Tim Finin, Zareen Syed and Anupam Joshi University of...