An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues
description
Transcript of An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues
![Page 1: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/1.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. ..
.. ..
..
Text TechnologicalModel l ing of Informat ion
An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues
Irene Cramer & Marc FinthammerFaculty of Cultural Studies,
Technische Universität Dortmund, Germany
![Page 2: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/2.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Outline
• Project Context and Motivation
• Lexical Chaining – Evaluation Steps
1. Preprocessing and Coverage
2. Sense Disambiguation
3. Semantic Relatedness/Similarity
4. Application
• Open Issues and Future Work
![Page 3: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/3.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Project Context and Motivation
• Research project HyTex funded by DFG (German Research Foundation) – part of the research unit "Text Technological Modelling of Information"
• Research objective in HyTex: text-grammatical foundations for the (semi-) automated text-to-hypertext conversion
• One focus of our research: topic-based linking strategies using lexical and topic chains/topic views
![Page 4: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/4.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Project Context and Motivation
• Lexical chains
• Topic chains/views– based on the concept of lexical cohesion, – regarded as partial text representation, – valuable resource for many NLP applications, such as
text summarization, dialog systems etc.
– (to our knowledge) 2 lexical chainers for German (Mehler, 2006 and Cramer/Finthammer), in additon: research on semantic similarity based on GermaNet by Gurevych et al.
![Page 5: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/5.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Project Context and Motivation
• Lexical chains• Topic chains/views
– based on a selection of central words, so called topic words,
– intended to support the user‘s orientation and navigation.
Steps:Lexical chains are used to select topic words (1-3 topic
words per passage),topic words are used to construct the topic view
(~"thematic index"),topic words are re-connected via semantically meaningful
edges (as in lexical chaining) to construct topic chains
![Page 6: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/6.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Project Context and Motivation
Chapter 1.1text topic word text text text text text text text text text texttext text text text text text texttext text text text topic word texttext text text text text texttext text text text text text texttext text text text text text text text text text topic word text texttext text text text text text text text text
Chapter 1.1
topic word 1 topic word 2 topic word 3
Chapter 1.2
topic word 1topic word 2topic word 3
Chapter 1.3 …Chapter 2 …Chapter 3.1 …
Chapter 1.2text topic word text text text text text text text text text texttext text text text text text texttext text text text topic word texttext text text text text texttext text text text text text texttext text text text text text text text text text topic word text texttext text text text text text text text text
Topic ViewTopic View
![Page 7: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/7.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Lexical Chaining – Evaluation Steps
• To evaluate our chainer, called GLexi, test data is required;
• experiments to develop such gold standard for German emphasize:– manual annotation of lexical chains is very demanding,– rich interaction between various principles to achieve a
cohesive text structure distracts annotators;
• results of these experiments partially reported in Stührenberg et al., 2007.
• Our conclusion: Evaluation of lexical chainer might be best performed in several steps.
![Page 8: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/8.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Lexical Chaining – Evaluation Steps
• Our conclusion: Evaluation of lexical chainer might be best performed in several steps.
– Evaluation of coverage– Evaluation of disambiguation quality– Evaluation of semantic relatedness measures– Evaluation of chains wrt. specific application
![Page 9: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/9.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Lexical Chaining – Evaluation Steps
• Remainder of talk: – very short presentation of GLexi‘s architecture and – exemplary demonstration of applicability of evaluation
procedure
• Resources used: – GermaNet (version 5.0)– HyTex corpus (specialized text) – approx. 29,000 (noun)
tokens– set of word pairs + results of human judgement
experiment– German word frequency list (thanks to Sabine Schulte im
Walde)
![Page 10: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/10.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Lexical Chaining - GLexi
Basic modules:– preprocessing of corpora
tokenization, POS tagging, chunking chaining candidate selection
– core algorithm lexical semantic look-up, scoring of relation, sense disambiguation
– output creation rating of chain strength application specific representation
![Page 11: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/11.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Lexical Chaining - GLexi
Preprocessing
![Page 12: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/12.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Lexical Chaining - GLexi
Core algorithm:
lexical semantic look-up
![Page 13: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/13.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Outline
• Project Context and Motivation
• Lexical Chaining – Evaluation Steps
1. Preprocessing and Coverage
2. Sense Disambiguation
3. Semantic Relatedness/Similarity
4. Application
• Open Issues and Future Work
![Page 14: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/14.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 1: Coverage
approx. 29,000 (noun) tokens in our corpus split into
56 % in GermaNet 44 % not in GermaNet, of these:
15 % inflected 12 % compounds 17 % remaining, uncovered nouns
• Coverage without preprocessing: approx. 56 %
• Approx. 44 % not included in chaining
preprocessing necessary to improve coverage!
![Page 15: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/15.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 1: Coverage
• Coverage without preprocessing: approx. 56 %
• Lemmatization: increase coverage to approx. 71 %• Compound analysis: increase coverage to approx. 83 %
Simple Named Entity Recognition in preprocessing phase• Open issues: abbreviations, foreign words, nominalized
verbs
remaining, uncovered nouns split into:
15 % Named Entities
30 % foreign words
25 % abbreviations 20 % nominalized verbs
theoretical value – open issue e.g. Medien – Medium (Engl.
media – psychic or data carrier)
theoretical value – open issue e.g. Agrarproduktion (Engl. agricultural production) Agrar (Engl. agricultural) + Produkt (Engl. artifact) + Ion (Engl. ion [chem.])
future work: include TermNet (domain specific language) as a resource – for more information: talk by Lüngen et al. – tomorrow, session 6, 10:40 h …
![Page 16: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/16.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 2: Chaining-based WSD
• Approx. 45 % of tokens in our corpus more than 1 synset
• Basis for chaining-based WSD: manual annotation
• Compare manually annotated data and disambiguation decision of semantic measure
word A word B sense A sense B Wu-Palmer rank
Text Hypertext
Text Hypertext
1 1
2 1
0,9231
0,8333
1
2
manually annotated word senses
Text Hypertext 1 1
correct word senses (word A, sense A = 1 and word B, sense B = 1) of word pair on rank 1 ( semantic measure Wu-Palmer best value): correct disambiguation
![Page 17: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/17.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 2: Chaining-based WSD
• Approx. 45 % of tokens in our corpus more than 1 synset
• Basis for chaining-based WSD: manual annotation
• Compare manually annotated data and disambiguation decision of semantic measure
word A word B sense A sense B Wu-Palmer rank
Text Hypertext
Text Hypertext
1 1
2 1
0,9231
0,8333
1
2
manually annotated word senses
Text Hypertext 1 1
correct word senses (word A, sense A = 1 and word B, sense B = 1) of word pair on rank 1 ( semantic measure Wu-Palmer best value): correct disambiguation
best value therefore rank 1
compare with manual annotation
![Page 18: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/18.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 2: Chaining-based WSD
• Approx. 45 % of tokens in our corpus more than 1 synset
• Basis for chaining-based WSD: manual annotation
• Compare manually annotated data and disambiguation decision of semantic measure
word A word B sense A sense B Wu-Palmer rank
Text Hypertext
Text Hypertext
1 1
2 1
0,9231
0,8333
1
2
manually annotated word senses
Text Hypertext 1 1
correct word senses (word A, sense A = 1 and word B, sense B = 1) of word pair on rank 1 ( semantic measure Wu-Palmer best value): correct disambiguation
compare with manual annotation
![Page 19: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/19.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 2: Chaining-based WSD
• Performance of chaining-based WSD: mediocre!
• Best semantic measures (Resnik, Wu-Palmer, and Lin):– approx. 50-60% correct disambiguation compared to
manual annotation– majority voting increased performance to approx. 63-65 %
• Future work– include WSD in preprocessing? – machine learning based new measure?
![Page 20: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/20.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 3: Semantic Relatedness
• Implemented 11 semantic relatedness measures (SRM) (GermaNet: 8 measures, Google co-occurrence counts: 3 measures)
focus this talk: GermaNet measures
• Evaluation of SRM performance used results of human judgement experiment:– list of 100 word pairs, subject’s judgement of "semantic
distance" (35 subjects) on 5-level-scale– compare human judgement and SRM values
![Page 21: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/21.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 3: Semantic Relatedness
almost 2/3 extreme values (not related / strongly related)
![Page 22: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/22.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 3: Semantic Relatedness
human judgement experiment example of the results
Engl. printerEngl. fin
Engl. fluid Engl. water
![Page 23: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/23.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 3: Semantic Relatedness
0.00
0.20
0.40
0.60
0.80
1.00
Word-Pairs Ordered by Relatedness Value
Rel
ated
ness
Human Judgement Resnik
all SRM values scatter
correlation between human judgement and SRM values low
![Page 24: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/24.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 3: Semantic Relatedness
Open issues – semantic relatedness:– continuous SRM values necessary / helpful? instead: classes (e.g. 3 classes: not related, related,
strongly related)machine learning (ML) experiment using parameters of
SRM
– interactions between SRM quality and disambiguation quality?
– combination of GermaNet and Google co-occurrence based measures (and further resources) useful?
integration in ML experiment?
![Page 25: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/25.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 4: Application-oriented Evaluation
Example: newspaper article about child poverty in Germany
Topic words according to lexical chaining results
Kind, Engl. child
Geld, Engl. money
Deutschland, Engl. Germany
![Page 26: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/26.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Step 4: Application-oriented Evaluation
• Features used in calculation of topic words and views:– chain / meta-chain info:
link density link strength
– in addition to chains: frequency (relative passage and document) mark-up
• application-oriented evaluation gold standard topic words, topic views, and topic chains necessary
• manual annotation of topic words and topic views – work in progress – current annotation agreement > 75 % (before accordance)
initial results show: link density and frequency are most relevant features
![Page 27: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/27.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Outlook and Future Work
To sum up:– Application: use lexical chaining for construction of topic
views– Lexical chaining for German corpora: several challenges
coverage, disambiguation, SRM– room for improvement: disambiguation and SRM
possible solutions: WSD as preprocessing step alternative SRM (potentially ML based) additional resources
– initial results using lexical chains for construction of topic views chaining useful!
![Page 28: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/28.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. ..
.. ..
..
Text TechnologicalModel l ing of Informat ion
Thank you!
Comments, ideas, questions are very welcome.
![Page 29: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues](https://reader036.fdocuments.in/reader036/viewer/2022070404/56813ab7550346895da2bb8a/html5/thumbnails/29.jpg)
http://www.hytex.info/
............................ HHyyTTeexx
Hypertextualisierung auf textgrammatischer Grundlage
.. .. .. .. .. ..
..
Text TechnologicalModel l ing of Informat ion
Literature (back-up slide)
• Alexander Budanitsky and Graeme Hirst. 2001. Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures. In Workshop on WordNet and Other Lexical Resources at NAACL-2000, Pittsburgh, PA, June 2001.
• M. A. K. Halliday und Ruqaiya Hasan. 1976. Cohesion in English. Longman, London.
• Graeme Hirst und David St-Onge. 1998. Lexical chains as representation of context for the detection and correction malapropisms. In C. Fellbaum, editor, WordNet: An electronic lexical database, chapter 13, pages 305–332. The MIT Press, Cambrige, MA.
• Alexander Mehler. 2005. Lexical chaining as a source of text chaining. In Proceedings of the 1st Computational Systemic Functional Grammar Conference, Sydney.
• Grogory H. Silber und Kathleen F. McCoy. 2002. Efficiently computed lexical chains as an intermediate representation for automatic text summarization. Computational Linguistics, 28(4):487 – 496.