Issues in Learning an Ontology from Text
-
Upload
robertstevens65 -
Category
Science
-
view
113 -
download
1
description
Transcript of Issues in Learning an Ontology from Text
Issues in Learning an Ontology from
Text
Christopher Brewster, Simon Jupp, Joanne Luciano, David Shotton, Robert Stevens, and Ziqi Zhang
The Use Case: Animal Behaviour
• Animal behaviour community recognises the need for an ontology, e.g. for video annotation/retrieval
• The community created an “Animal Behaviour Ontology” - 339 terms
• Can we (semi-) automatically build from text?
Some Questions
• Do we get a “good ontology”?
• If not, is it useful?
• Is it low-effort?
• Should the result be “tidied up” or used as a donor?
Methodology: Dataset
• Journal “Animal Behaviour” from Elsevier
• 623 articles from Vol 71 (2006) - Vol 74 (2007)
• 2.2 million words
• Various formats - most usefully xml
We Want an Ontology of Green
• An ontology of “animal behaviours”
• Not an ontology of the corpus
We want the green terms in the ontology
Processing Steps (1)
1. Text extracted from XML - excluding affiliations, acknowledgements, bibliography except for title etc.
2. Noise removed - person names, animal names, place names
3. Lemmatiser used to reduce data sparsity
4. Term extraction applied
Processing Steps (2)5. Term selection
Regular expression used to select terms ending in behaviour, display, construction, inspection plus generic -ing, -ism, etc.
Build hierarchies using String Inclusion
6. Top level terms filtered using “Hearst Patterns” to test if X ISA behaviour/activity/etc.
WalkingRunningJumpingHuntingPeckingReed BuntingCorn BuntingHerringCourtshipStudentshipCannibalismDimorphism
Applying String Inclusion /Rules to Terms
C
BCAC
ABC
Selection
Mate Selection
Natural Selection
Female Mate Selection
Lexico-Syntactic Patterns
• X such as P, Q, R; X is a Y
• Grooming is a behaviour
• Copulation is an activity
• Dimorphism is a behaviour
• Calls such as trills, whistles, grunts
Results
• 64,000 terms extracted
• The regexp selected 10,335 terms
• Step 6a resulted in an ontology with 17,776 classes and 1295 top level classes
• Step 6b resulted in an ontology with 13,058 classes and 912 top level classes
Results (2) - Copulation Sub-tree
Results(3)
• Evaluation of terms excluded by regexp:
• 56,000 terms excluded
• Random sample of 3140 terms evaluated by hand
• 7 verbs and 42 nouns should not have been excluded
• E.g., “interaction”
• A recall of .905
Discussion: The problem of focus
Other Issues
• More a vocabulary than an ontology
• SKOS-like rather than OWL-like
• Can deal with “selection”, “mate selection” and “natural selection
• Highly compositional terms “Adult male grooming behaviour”
• Cleanish list of top level terms: Canabalism, copulation, eating, foraging, fighting, grooming
Discussion: Is it useful?
• Answers: No, yes, yes, donor
• Useful ontological fragments
• Bringing ontology to ontology learning is the research challenge
• Limitations: noise; the problem of focus; only taxonomic relations
• Advantages: speed; ease; a step towards formal ontologies