8//2808 Wikitology Wikipedia as an Ontology Tim Finin, Zareen Syed and Anupam Joshi University of...
-
Upload
della-lane -
Category
Documents
-
view
232 -
download
1
Transcript of 8//2808 Wikitology Wikipedia as an Ontology Tim Finin, Zareen Syed and Anupam Joshi University of...
8//2808
WikitologyWikitologyWikipedia as an OntologyWikipedia as an Ontology
Tim Finin, Zareen Syed and Anupam Joshi
University of Maryland, Baltimore County
http://ebiquity.umbc.edu/resource/html/id/250/
MotivationIdentifying the topics and concepts associated with text or text entities is a task common to many applications:– Annotation and categorization of documents– Modelling user interests– Business intelligence– Selecting advertisements– Improving information retrieval– Better named entity extraction and
disambiguation
What’s a document about?
Two common approaches:
(1) Select words and phrases using TF-IDF that characterize the document
(2) Map document to a list of terms from a controlled vocabulary or ontology
(1) is flexible and doesn’t require creating and maintaining an ontology
(2) Can connect documents to a rich knowledge base
Wikitology !• Using Wikipedia as an ontology offers the best of both approaches– each article (~4M) is a concept in the ontology– terms linked via Wikipedia’s category system
(~200k) and inter-article links– Lots of structured and semi-structured data
• It’s a consensus ontology created and maintained by a diverse community
• Broad coverage, multilingual, very current
• Overall content quality is high
Constructing the Wikitology KB
WordNetYago
Human input & editingDatabases
Freebase KB
RDF and OWLstatements
ACE 2008
• ACE 2008 is a NIST sponsored exercise in entity extraction from text
• Focus on resolving entities across documents, e.g., “Dr. Rice” mentioned in doc 18397 is the same as “Secretary of State” in doc 46281
• 20K documents in English and Arabic• We participated on a team from the JHU Human
Language Technology Center of Excellence
NLPNLP MLML clustclustFEA
TFEA
T
Documents KB entities
ACE 2008
• BBN’s Serif system produces text annotated with named entities (people or organizations)Dr. Rice, Ms. Rice, the secretary, she, secretary Rice
• Featurizers score pairs of entities for co-reference(CNN-264772-E32, AFP-7373726-E19, 0.6543)
• A machine learning system combines the evidence• A simple clustering algorithm identifies clusters
NLPNLP MLML clustclustFEA
TFEA
T
Documents KB entities
Wikitology tagging
• Using Serif’s output, we produced an entity document for each entity.Included the entity’s name, nominal and pronom-inal mentions, APF type and subtype, and words in a window around the mentions
• We tagged entity documents using Wiki-tology producing vectors of (1) terms and (2) categories for the entity
• We used the vectors to compute fea-tures measuring entity pair similarity/dissimilarity
Entity Document & Tags<DOC>
<DOCNO>ABC19980430.1830.0091.LDC2000T44-E2 <DOCNO>
<TEXT>
Webb Hubbell
PER
Individual
NAM: "Hubbell” "Hubbells” "Webb Hubbell” "Webb_Hubbell"
NAM: "Mr . " "friend” "income"
PRO: "he” "him” "his"
, . abc's accountant after again ago all alleges alone also and arranged attorney avoid been before being betray but came can cat charges cheating circle clearly close concluded conspiracy cooperate counsel counsel's department did disgrace do dog dollars earned eightynine enough evasion feel financial firm first four friend friends going got grand happening has he help him hi s hope house hubbell hubbells hundred hush income increase independent indict indicted indictment inner investigating jackie jackie_judd jail jordan judd jury justice kantor ken knew lady late law left lie little make many mickey mid money mr my nineteen nineties ninetyfour not nothing now office other others paying peter_jennings president's pressure pressured probe prosecutors questions reported reveal rock saddened said schemed seen seven since starr statement such tax taxes tell them they thousand time today ultimately vernon washington webb webb_hubbell were what's whether which white whitewater why wife years
</TEXT>
</DOC>
Wikitology article tag vector
Webster_Hubbell 1.000
Hubbell_Trading_Post National Historic Site 0.379
United_States_v._Hubbell 0.377
Hubbell_Center 0.226
Whitewater_controversy 0.222
Wikitology category tag vector
Clinton_administration_controversies 0.204
American_political_scandals 0.204
Living_people 0.201
1949_births 0.167
People_from_Arkansas 0.167
Arkansas_politicians 0.167
American_tax_evaders 0.167
Arkansas_lawyers 0.167
Wikitology derived features
• Seven features measured entity similarity using cosine similarity of various length article or category vectors
• Five features measured entity dissimilarity:• two PER entities match different Wikitology persons• two entities match Wikitology tags in a disambiguation set• two ORG entities match different Wikitology organizations• two PER entities match different Wikitology persons,
weighted by 1-abs(score1-score2)• two ORG entities match different Wikitology orgs,
weighted by 1-abs(score1-score2)
Challenges• Wikitology tagging is expensive
– ~2 seconds/document on a single processor– Took ~24 hrs on a cluster for 150K entity docs– A spreading activation algorithm on the underlying
graphs improves accuracy at even more cost
• Exploiting the RDF metadata and data and the underlying graphs– requires reasoning and graph processing
• Extract entities from Wiki text to find more relations– More graph processing
Next Steps
• Construct a Web-based API and demo system to facility experimentation
• Process Wikitology updates in real-time• Exploit machine learning to classify pages and improve performance
• Better use of cluster using Hadoop, etc.• Exploit cell processor technology for spreading activation and other graph-based algorithms– e.g., recognize people by the graph of
relations they are part of
Spreading activation example
11
4422
55 66
3311 .5.5
11 .5.5
11 11 11 .5.5
.8.8 11 .5.5
.8.8 11
11
=
0.9
0.81.0
1.0
0.5
0.5
0.5 0.5
0.3
at at-1= W
a0a1
fromto
Spreading activation example
11
4422
55 6
33 .9.9
.3.3
11 .5.5
11 .5.5
11 11 11 .5.5
.8.8 11 .5.5
.8.8 11
11
=
0.9
0.81.0
1.0
0.5
0.5
0.5 0.5
0.3
at at-1= W
a0a1
fromto
Spreading activation example
11
4422
55 6
33 .9.9
.3.3
11 .5.5
11 .5.5
11 11 11 .5.5
.8.8 11 .5.5
.8.8 11
11
.45.45
.9.9
.15.15
.15.15
.45.45
.3.3
=
0.9
0.81.0
1.0
0.5
0.5
0.5 0.5
0.3
at at-1= W
a0a1
fromto
Spreading activation example
11
4422
55 6
33.45.45
.9.9
.15.15
.15.15
.45.45
.3.3
11 .5.5
11 .5.5
11 11 11 .5.5
.8.8 11 .5.5
.8.8 11
11
.45.45
.9.9
.3.3
.51.51
.45.45
.3.3
=
0.9
0.81.0
1.0
0.5
0.5
0.5 0.5
0.3
at at-1= W
a1a2
fromto
SA as matrix multiplication
• Good news: SA is matrix multiplication– Model graph as nxn matrix W where Wij is
strength of connection from node i to j – Vector A of length n, Ai is node I’s activation– A(t) = W*A(t-1)
• Bad news: is n is huge– 140K category nodes and 4.2M edges– 2.9M articles and 50M edges.
• Good news: matrices are sparse
1/9/2007
Sparse Matrix Vector Multiplication
Exploiting parallelism for sparse matrix-vector multiplication (SPMV) has several challenges•High storage overhead and Indirect and irregular memory access patterns
•How to parallelize
•Load balancing
Sparse Matrix Representation
Compressed Sparse Row Format (CSR) is a simple storage format
Values: The non-zero values in the matrix
Columns: Column indices of non-zero values
Pointer B: Column index of first non-zero value in a row
Pointer E: Column index of last non-zero value in a row
Thread Level Parallelism
• Partition matrix rows among processors
• Statically load balance SPMV by approximately equally distributing the non-zero values among processors/threads
• Sort rows in decreasing order of number of non-zeros
• Assign rows to processes/threads iteratively– Assign row #1 to processes 0– Assign subsequent rows to the process with
smallest total number of non-zeros• Guarantees the maximum difference in the
number of non-zero values between any two processes/threads will be at most the largest number of non-zeros in a row
Heuristic Load Balancing
Conclusion• Our initial experiments showed that the Wikitology idea has merit
• Wikipedia is increasingly being used as a knowledge source of choice
• Easily extendable to other wikis and collaborative KBs, e.g., Intellipedia
• Serious use requires exploiting cluster machines and cell processing
• Key processing of associated graph data can exploit cell architecture