Learning the Semantic Meaning of a Concept from the Web
description
Transcript of Learning the Semantic Meaning of a Concept from the Web
Learning the Semantic Meaning of a Concept from the Web
Yang YuMaster’s Thesis Defense
August 03, 2006
2
The Problem
Manually preparing training data for text classification based ontology mapping is expensive.
LIVING_THINGS
ANIMAL PLANT
HUMAN
MAN
CAT
WOMAN
TREE
ARBOR
GRASS
FRUTEX
3
The Thesis
Automatically collecting training data for the concept defined in an ontology.
Benefits Reduce the amount of human work Fully automated ontology mapping
http://www.google.com/
4
Overview
Background The semantic Web and ontology Ontology Mapping
Proposal System Experimental Results
WEAPONS ontology LIVING_THINGS ontology
Discussions and Conclusion
5
Semantic Web and Ontology
What is it? “an extension of the current web”
An Example
Find all types of jets that are made in the USA
USA
partOf
WAMade-in
6
Interoperability problem Independently developed ontologies for the
same or overlapped domain Mapping
r = f (Ci, Cj) where i=1, …, n and j=1, …, m; r {equivalent, subClassOf, superClassOf,
complement, overlapped, other}
Ontology Mapping
7
Approaches to Ontology Mapping Manual mapping String Matching Text classification
the semantic meaning of a concept is reflected in the training data that use the concept
Probabilistic feature model Classification Results highly depend on training data
8
Motivation
Preparing exemplars manually is costly
Billions of documents available on the web Search engines
9
The Proposal
Using the concept defined in an ontology as a query and processing the search results to obtain exemplars
Verification Build a prototype system Check ontology mapping results
10
System overview – Part I
Ontology A
Parser
Processor
Search Engine
HTML Docs
Queries
Text Files
Links to Web Pages
WWW
Retriever
Retriever
11
The parser (Query expansion)
FOOD+FRUIT+APPLE
FOOD
FRUIT
APPLEORANGE
living+things+plant+tree+arborarbor
living+things+plant+tree+Frutexfrutex
living+things+plant+grassgrass
living+things+plant+treetree
living+things+animal+human+womanwoman
living+things+animal+human+manman
living+things+animal+humanhuman
living+things+animal+catcat
living+things+plantplant
living+things+animalanimal
living+thingsliving+things
QueriesConcepts
12
The retriever
13
The processor
14
Naïve Bayes text classifier
Bow toolkit McCallum, Andrew Kachites, Bow: A toolkit for statistical language
modeling, text retrieval, classification and clustering,
http://www.cs.cmu.edu/~mccallum/bow 1996. rainbow -d model --index dir/* rainbow –d model –query
Bayes Rule Naïve Bayes text classifier
15
Bayes Rule
P (A | B) =
P (B | A) * P (A)
P (B)
P(A, B)
A
B
P (B | A) = P (A, B) / P (A)P (A | B) = P (A, B) / P (B)
posterior
Prior
Normalizing constant
Mitchell Tom, Machine Learning, McGraw Hill, 1997
16
Naïve Bayes classifier
A text classification problem “What’s the most probable classification of the new
instance given the training data?”
vj: category j. (a1, a2, …, an): attributes of a new document
So Naïve
(Mitchell Tom, Machine Learning, McGraw Hill) 1997
17
System overview– Part II
Ontology A Ontology BModel Builder
Mapping Results
Text Files (B)
CalculatorFeature Model
Text Files (A)
Rainbow
Rainbow
18
The model builder
LIVING_THINGS
ANIMAL PLANT
HUMAN
MAN
CAT
WOMAN
TREE
ARBOR
GRASS
FRUTEX
LIVING_THINGS
ANIMAL PLANT
HUMAN
MAN
CAT
WOMAN
TREE
ARBOR
GRASS
FRUTEX
Mutually exclusive and exhaustive Leaf classes C+ and C-
19
The calculator
Naïve Bayes text classifier tends to give extreme values (1/0)
Tasks Feed exemplars to the classifier one by one Keep records of classification results Take averages and generate report
20
An Example of the Calculator
APC
TANK-VEHICLE
AIR-DEFENSE-GUN
SAUDI-NAVAL-MISSILE-CRAFT
Classifier
200
10SAUDI-NAVAL-MISSILE-CRAFT
20AIR-DEFENSE-GUN
170TANK-VEHICLE
Num. of exemplars
Categories in WeaponsA.n3
P(TANK-VEHICLE | APC) = 170 /200= 0.85
P(AIR-DEFENSE-GUN | APC) = 0.10
P(SAUDI-NAVAL-MISSILE-CRAFT| APC) = 0.05
21
Experiments with WEAPONS ontology Information Interpretation and Integration
Conference (http://www.atl.lmco.com/projects/ontology/i3con.html) WeaponsA.n3 and WeaponsB.n3
Both over 80 classes defined More than 60 classes are leaf classes Similar structure
22
WeaponsA.n3Part of WeaponsA.n3
TANK-VEHICLE-
MODERN-NAVAL-SHIP
WEAPON
CONVENTIONAL-WEAPON
WARPLANEARMORED-COMBAT-VEHICLE
PATROL-CRAFTAIRCRAFT-CARRIER
SUPER-ETENDARD
23
WeaponsB.n3Part of WeaponsB.n3
TANK-VEHICLE-
MODERN-NAVAL-SHIP
WEAPON
CONVENTIONAL-WEAPON
WARPLANEARMORED-COMBAT-VEHICLE
LIGHT-TANK APC
PATROL-WARTER-CRAFT
AIRCRAFT-CARRIER
LIGHT-AIRCRAFT-CARRIER
PATROL-BOAT-RIVER
PATROL-BOAT
FIGHTER-PLANE
FIGHTER-ATTACK-PLANE
SUPER-ETENDARD-FIGHTER
24
Expected Results
TANK-VEHICLE SUPER-ETENDARD
LIGHT-TANK
APCPATROL-WARTER-CRAFT
AIRCRAFT-CARRIER
LIGHT-AIRCRAFT-CARRIER
PATROL-BOAT-RIVER
PATROL-BOAT
FIGHTER-PLANE
FIGHTER-ATTACK-PLANE
SUPER-ETENDARD-FIGHTER
PATROL-CRAFT
25
A Typical Report
APCAPC
SELF-PROPELLED-ARTILLERY 0.357180681
TANK-VEHICLE 0.277139274
ICBM 0.10423636
MRBM 0.080615147
TOWED-ARTILLERY 0.054724102
SUPPORT-VESSEL 0.023265054
PATROL-CRAFT 0.019570325
MOLOTOV-COCKTAIL 0.015032411
TORPEDO-CRAFT 0.013677696
SUPER-ETENDARD 0.009856519
MORTAR 0.00772997
AIR-DEFENSE-GUN 0.002997109
MACHINE-GUN 0.000211772
MOLOTOV-COCKTAIL 0.000187578
TRUCK-BOMB 0.000171675
AS-9-KYLE-ALCM 0.000156403
ARABIL-100-MISSILE 0.000111953
AL-HIJARAH-MISSILE 7.65E-05
OGHAB-MISSILE 7.12E-05
BADAR-2000 4.28E-05
P(APC | Ci) where i = 1 … 63
...... ……
26
classes with highest conditional probability
0.38MRBM0.49AIRCRAFT-CARRIERFIGHTER-PLANE
0.3TANK-VEHICLE0.56SILKWORM-MISSILE-MODLIGHT-TANK
0.66PATROL-CRAFT0.51SILKWORM-MISSILE-MODPATROL-BOAT
0.54PATROL-CRAFT0.65SILKWORM-MISSILE-MODPATROL-BOAT-RIVER
0.52PATROL-CRAFT0.28SILKWORM-MISSILE-MODPATROL-WATERCRAFT
0.38MRBM0.83SILKWORM-MISSILE-MODFIGHTER-ATTACK-PLANE
0.51MRBM0.66SILKWORM-MISSILE-MODSUPER-ETENDARD-FIGHTER
0.36SELF-PROPELLED-ARTILLERY0.46
SILKWORM-MISSILE-MODAPC
0.57AIRCRAFT-CARRIER0.65AIRCRAFT-CARRIERLIGHT-AIRCRAFT-CARRIER
ProbSentences with KeywordsProbWhole fileNew Classes
P(TANK-VEHICLE | APC ) = 0.28
P(SUPER-ETENDARD | SUPER-ETENDARD-FIGHTER ) = 0.21
27
different numbers of exemplars (whole)
0.49AIRCRAFT-CARRIER0.80
SILKWORM-MISSILE-MOD FIGHTER-PLANE
0.56SILKWORM-MISSILE-MOD0.62
SILKWORM-MISSILE-MODLIGHT-TANK
0.51SILKWORM-MISSILE-MOD0.64
SILKWORM-MISSILE-MODPATROL-BOAT
0.65SILKWORM-MISSILE-MOD0.89
SILKWORM-MISSILE-MODPATROL-BOAT-RIVER
0.28SILKWORM-MISSILE-MOD0.64
SILKWORM-MISSILE-MODPATROL-WATERCRAFT
0.83SILKWORM-MISSILE-MOD0.83
SILKWORM-MISSILE-MODFIGHTER-ATTACK-PLANE
0.66SILKWORM-MISSILE-MOD0.74
SILKWORM-MISSILE-MOD
SUPER-ETENDARD-FIGHTER
0.46SILKWORM-MISSILE-MOD0.65
SILKWORM-MISSILE-MODAPC
0.65AIRCRAFT-CARRIER0.60
SILKWORM-MISSILE-MOD
LIGHT-AIRCRAFT-CARRIER
ProbGroup-whole-100ProbGroup-whole-50New Classes
28
different numbers of exemplars (sentence)
0.38MRBM0.38MRBMFIGHTER-PLANE
0.3TANK-VEHICLE0.59
TANK-VEHICLELIGHT-TANK
0.66PATROL-CRAFT0.37
PATROL-CRAFTPATROL-BOAT
0.54PATROL-CRAFT0.36
PATROL-CRAFTPATROL-BOAT-RIVER
0.52PATROL-CRAFT0.49
PATROL-CRAFTPATROL-WATERCRAFT
0.38MRBM0.19ICBMFIGHTER-ATTACK-PLANE
0.51MRBM0.4HY-4-C-201-MISSILE
SUPER-ETENDARD-FIGHTER
0.36
SELF-PROPELLED-ARTILLERY0.54
TANK-VEHICLEAPC
0.57AIRCRAFT-CARRIER0.44
AIRCRAFT-CARRIER
LIGHT-AIRCRAFT-CARRIER
ProbGroup-sentence-100Prob
Group-sentence-50New Classes
29
Comparison of mapping accuracy of different groups of experiments
56%Group-sentence-100
67%Group-sentence-50
11%Group-whole-100
0%Group-whole-50
Mapping accuracy judged by desired class mappedGroups of experiments
Higher Conditional Probability
30
LIVING_THINGS
ANIMAL PLANT
HUMAN
MAN
CAT
WOMAN
TREE
ARBOR
GRASS
FRUTEX
GIRL
Level1
Level2
Level3
Experiment with LIVING_THINGS ontology P(MAN | HUMAN) P (WOMAN | HUMAN) Find a mapping for GIRL
HUMAN
MAN
WOMAN
31
Actual Experiment Results: L-1
0.380.410.24P(WOMAN | HUMAN)
0.620.580.75P(MAN | HUMAN)
Using first 200 exemplars
Using first 100 exemplars
Using first 50 exemplarsConditional Probability
HUMAN
MAN
WOMAN
Results of experiment (1)
32
LIVING_THINGS
ANIMAL PLANT
HUMAN
MAN
CAT
WOMAN
TREE
ARBOR
GRASS
FRUTEX
GIRL
Level1
Level2
Level3
Actual Experiment Results: L-2
1P(WOMAN | GIRL)
0P(MAN | GIRL)
0.30P(CAT | GIRL)
0.70P(HUMAN | GIRL)
0.23P(PLANT | GIRL)
0.76P(ANIMAL | GIRL)
0P(PYCNOGONID | GIRL)
0.43P(HUMAN | GIRL)
0.01P(CAT | GIRL)
0.56P(DOG | GIRL)
0.37P(MAN | GIRL)
0.63P(WOMAN | GIRL)
0.08P(CAT | GIRL)
0.92P(HUMAN | GIRL)
0.17P(PLANT | GIRL)
0.83P(ANIMAL | GIRL)
With clustering on exemplars Without clustering on exemplars
with additional classes
33
Actual Experiment Results: L-3
10.970.98P(WOMAN | GIRL)
00.030.02P(MAN | GIRL)
000P(PYCNOGONID | GIRL)
0.560.290.13P(DOG | GIRL)
0.010.150.01P(CAT | GIRL)
0.430.560.86P(HUMAN | GIRL)
0.230.470.34P(PLANT | GIRL)
0.770.530.66P(ANIMAL | GIRL)
Using first 200 exemplars
Using first 100 exemplars
Using first 50 exemplarsConditional Probability
Comparison between different numbers of exemplars (sentence)
34
Actual Experiment Results: Different Queries
Living+things+plant+Plantae+tree+arborarbor
Living+things+plant+Plantae+tree+Frutexfrutex
Living+things+plant+Plantae+grassgrass
Living+things+plant+Plantae+treetree
Living+things+animal+Animalia+human+intelligent+woman+femalewoman
Living+things+animal+Animalia+human+intelligent+man+maleman
Living+things+animal+Animalia+human+intelligenthuman
Living+things+animal+Animalia+cat+Felidaecat
Living+things+plant+Plantaeplant
Living+things+animal+Animaliaanimal
Living+thingsliving+things
QueriesConcepts
Queries augmented with class properties
35
Actual Experiment Results: L-4
0.070.09P(WOMAN | HUMAN)
0.930.91P(MAN | HUMAN)
Keyword SentencesWholeConditional Probability
0.840.86P(WOMAN | GIRL)
0.160.14P(MAN | GIRL)
0.170.22P(CAT | GIRL)
0.830.78P(HUMAN | GIRL)
0.170.1P(PLANT | GIRL)
0.830.9P(ANIMAL | GIRL)
Keyword SentencesWholeConditional Probability
HUMAN
MAN
WOMAN
LIVING_THINGS
ANIMAL PLANT
HUMAN
MAN
CAT
WOMAN
TREE
ARBOR
GRASS
FRUTEX
GIRL
Level1
Level2
Level3
Results of experiment (1) with new queries
Results of experiment (2) with new queries
36
Limitation 1: An exemplar is not a sample of a concept An exemplar is a combination of strings that
represent some usage of a concept. An exemplar is not an instance of a concept. The way we calculate conditional probability
is an estimation.
HUMAN
MAN
WOMAN
37
Limitation 2: Popularity does not equal relevancy Limited by a search engine’s algorithm
PageRank™ Popularity does not equal relevancy
Weight cannot be specified for words in a search query
38
Limitation 3: Relevancy does not equal to similarity
Search Results for concept A
Text related to concept A
Text against concept AText for concept A
i.e. desired exemplars
Text for related concept B
39
Related Research
UMBC OntoMapper Sushama Prasad, Peng Yun and Finin Tim, A Tool for Mapping between Two Ontologies
Using Explicit Information, AAMAS 2002 Workshop on Ontologies and Agent Systems, 2002. CAIMEN
Lacher S. Martin and Groh Georg ,Facilitating the Exchange of Explicit Knowledge through Ontology Mappings, Proc of the Fourteenth International FLAIRS conference, 2001.
GLUE Doan Anhai, Madhavan Jayant, Dhamankar Robin, Domingos Pedro, and Halevy Alon,
Learning to Match Ontologies on the Semantic Web, WWW2002, May, 2002.
Google Conditional Probability P(HUMAN | MAN) = 1.77 billion / 2.29 billion = 0.77 P(HUMAN | WOMAN) = 0.6 billion / 2.29 billion = 0.26 Wyatt D., Philipose M., and Choudhury T., Unsupervised Activity Recognition Using
Automatically Mined Common Sense. Proceedings of AAAI-05. pp. 21-27.
40
Conclusion and Future Work
Text retrieved from the web can be used as exemplars for text classification based ontology mapping Many parameters affect the quality of the
exemplars There are noise contained in the processed
documents Future work
Clustering
41
Questions