Complete and Consistent Annotation of WordNet with the Top Concept Ontology Javier Álvez, Jordi...

25
Complete and Consistent Annotation of WordNet with the Top Concept Ontology Javier Álvez, Jordi Atserias, Jordi Carrera, Salvador Climent, Egoitza Laparra, Antoni Oliver and German Rigau Basque Country Univ., Pompeu Fabra Univ, (Barcelona), Open Univ. Of Catalonia (Barcelona)

Transcript of Complete and Consistent Annotation of WordNet with the Top Concept Ontology Javier Álvez, Jordi...

Complete and Consistent Annotation of WordNet with the Top Concept Ontology

Javier Álvez, Jordi Atserias, Jordi Carrera, Salvador Climent, Egoitza Laparra, Antoni Oliver and German

Rigau

Basque Country Univ., Pompeu Fabra Univ, (Barcelona), Open Univ. Of Catalonia (Barcelona)

Introduction

• 4 years work

• Full annotation of WordNet’s Nouns with Semantic Features (EWN TCO)

• Aimed to be an important semantic resource for NLP (selectional preferences, synset clustering, reasoning…).

Result

• 65.989 noun concepts (synsets) = 116.364 noun lexemes (variants) consistently

annotated

• Average of 6.47 features per synset– Features organized in a multilevel hierarchy

Structure of the talk

• Methodology

• Examples and Discussion

• Conclusions

Methodology

• Annotation of the Inter Lingual Index (=EnWn1.6, SpaWN, mapping to other WNs...) with the nodes/features of the TCO (a shallow ontology defined in the EWN Project [Vossen et. Al 1998])

• Methodology based on:– INCOMPATIBILITY OF ONTOLOGICAL

INFORMATION– SUBSUMPTION BLOCKAGE POINTS

The Top Concept Ontology

• Organized in three orders of entities:

– 1st Order (physical entities)– 2nd Order (situations)– 3rd Order (abstract entities)

The Top Concept Ontology

• 1st Order entities organized in four Qualia-like features:

– Origin (Artifact, Natural..) – Form (Object, Substance…)– Composition (Group, Part)– Function (Building, Container, Vehicle…)

The Top Concept Ontology

• 2nd Order Entities organized in two dimensions

– Situation Type: Dynamic (Bounded Events, Unbounded Events) & Static (Properties, Relations)

– Situation Component: (Cause, Manner, Modal…)

• 3rd Order Entities, no further subdivided

Methodology

• We don’t modify the structure of neither the TCO nor WN (=> future work). We just annotate.

• We declared pairs of TCO properties as incompatible (e.g.:natural vs. artifact, substance vs. object)

• Initial annotation situation: In EWN, TCO features were manually assigned to a basic set of 1024 EWN synsets (= Base Concepts)

Methodology

1. We annotated automatically the rest of the Top Synsets (from the BCs up to the Top) using a Wordnet’s SemanticFile-TCO table of equivalence (e.g. NounAct <=> Agentive , NounAttribute <=> Property )

2. We performed a full automatic top-down expansion of such information via the WN1.6 hierarchy (feature inheritance)

Methodology

• This caused feature incompatibility to arise:• about 225.000 conflicts in 25.000 synsets

• Causes:• Wrong manual annotation in EWN

• Wrong TCO-SF equivalence

• ... but basically:– Subsumption in WN not always work

» ISA Overloading etc.

– Multiple inheritance in WN

Methodology

• We checked manually all feature incompatibilities in order to:– (i) adding and/or deleting ontological features– (ii) setting inheritance blockage points.

• A blockage point is an annotation in WN1.6 which breaks the ISA relation between two synsets, thus no inheritance is allowed.

A simple example

Bandung

Java

island

city

A simple example

Bandung

Java

island=NATURAL

city=ARTIFACT

A simple example

Bandung+NATURAL

+ARTIFACT

Java+NATURAL

island=NATURAL

city=ARTIFACT

A simple example

Bandung+ARTIFACT

Java+NATURAL

island=NATURAL

city=ARTIFACT

MethodologyInformation used for decision making

• Relational information regarding every synset and neighbours; i.e. the WN structure

• Synsets' glosses as provided by EWN

• Glosses, descriptions and examples of the TCO features as provided in [Alonge et al. 1998]

• Usual word-substitution tests to acknowledge hyponymy, as in [Cruse 1986]

Methodology

• When all incompatibilities were fixed, a new automatic re-expansion was launched which resulted in a new (smaller) number of conflicts.

• Following this iterative and incremental approach, inheritance was re-calculated and data are re-examined several times.

• Task finished when a new cycle of re-expansion of properties did not result in new conflicts.

Methodology

• Then, two final steps were applied:

1. Since the TCO is itself a hierarchy, for every synset, its annotation was expanded up-feature; e.g. Animal expands ot Living, Natural, Origin and 1stOrderEntity

2. The whole hierarchy was checked for consistency using formal Theorem Provers like Vampire and E-prover

– This step resulted in a number of new conflicts which were finally fixed.

Typology of miscategorizations (IS-A Overload)

• Overgeneralization• Reduction of sense• Confusion of senses• Suspect Type-to-role relationshipSuspect Type-to-role relationship• Extensional ambiguity• 3rd Order Entities vs Mental 2nd Order Entities (TCO labels)• Technical inconsistencies

(in black:[Guarino 1998] original typology)

Typology of miscategorizations

• Overgeneralitzation = Hypernym has more features than Hyponym should have

• Reduction of Sense = Hypernym fails to capture part of the Hyponym’s meaning

• Confusion of senses = Multiple inheritance where hypernyms are incompatible

Typology of miscategorizations

• Extensional ambiguity = e.g. “layer”: is it an object or a substance?

• 3rd Order Entities vs Mental 2nd Order Entities (TCO labels) = e.g “discipline” (process thus 2ndOrder) IS-A “knowledge domain” (3rdOrder)

• Technical inconsistencies = e.g. Hyponymy-Meronymy confusion

Conclusions

• WN1.6 (= ILI) fully and consistently annotated for Nouns with 60 semantic features organized in a shallow ontology – 65.000 synsets,116.000 variants– Average of 6.48 TCO features per synset

• 350 inheritance-blocking points detected in WN– 28.000 synsets have at least one in their hypernymy chain [=

they are affected by WN hierarchy mistakes or inadequacies]

• The resource is free. It can be downloaded from our web site (vid. proceedings)

The Statue of Liberty

+OBJECT

+IMAGE_REPRESENTATION

+CONCEPT

monument+OBJECT

artifact+OBJECT

art+OBJECT

sculpture=IMAGE_REPRESENTATION

+CONCEPT

+OBJECT

impressionism+OBJECT

figure+CONCEPT

shape+CONCEPT

abstraction=CONCEPT

object=OBJECT

The Statue of Liberty

+OBJECT

+IMAGE_REPRESENTATION

monument+OBJECT

artifact+OBJECT

art=CONCEPT

sculpture=IMAGE_REPRESENTATION

=OBJECT

impressionism+CONCEPT

figure+CONCEPT

shape+CONCEPT

abstraction=CONCEPT

object=OBJECT