Multilingual Ner Using Wiki
-
Upload
svitlana-volkova -
Category
Education
-
view
797 -
download
1
Transcript of Multilingual Ner Using Wiki
Laboratory for Knowledge Discovery in Databases
Department of Computing and Information Sciences
Kansas State University
http://www.kddresearch.org/tikiwiki/tiki-index.php
Presenter: Svitlana O. Volkova
Instructor: William Hsu
Multilingual Named Entity Recognition
using Wikipedia
AGENDA
I. Project Overview
II. Crawling Wikipedia
III. Synonymy Discovery with Google Sets
IV. Experiment Design
V. Conclusions
AGENDA
I. Project Overview
II. Crawling Wikipedia
III. GoogleSets for Synonymy Discovery
IV. Experiment
V. Conclusions
PROJECT MILESTONES
Input: Crawler Functionality
CRAWLING WIKIPEDIA
Output: Set of Multilingual Gazetteers
Input: Initial Gazetteer in one Language
RELATIONSHIP DISCOVERY WITH GOOGLESETS
Output: Extended Gazetteer with Synonyms
Input: Extended Gazetteer with Synonyms + Content
MULTILINGUAL NER TASK
Output: Extracted Entities from the Content
KEY IDEA - WIKIPEDIA
Apply Wikipedia knowledge representation for
multilingual information extraction
17http://wiki.digitalmethods.net/Dmi/WikipediaAnalysis
English Wiki Concepts of Interest
…, anthrax, bovine virus, …, camelpox, surra, …
Russian Wiki Concepts of Interest
…, Зоонозы, Классическая чума свиней, Лептоспироз, …
AGENDA
I. Project Overview
II. Crawling Wikipedia
III. GoogleSets for Synonymy Discovery
IV. Experiment
V. Conclusions
CRAWLING WIKIPEDIA
Multilingual NER
(article + category
+interwiki links)
Wiki Category Graph and Article Graph
GAZETTEERS EXAMPLES IN DIFFERENT
LANGUAGES
GAZETTEERS SIZE IN DIFFERENT
LANGUAGES
86
20
37
19
English
Japanese
German
Russian
Decision: dictionaries are too small, so wee need to find a way how to
extend it!!!
AGENDA
I. Project Overview
II. Crawling Wikipedia
III. GoogleSets for Synonymy Discovery
IV. Experiment
V. Conclusions
GAZETTEERS EXAMPLES:
GERMAN GOOGLE SETS OUTPUT
AGENDA
I. Project Overview
II. Crawling Wikipedia
III. GoogleSets for Synonymy Discovery
IV. Experiment
V. Conclusions
EXPERIMENT SET UP
Purpose: to perform named entity recognition task in
specific domain and report accuracy of extraction using
a) Wiki knowledge
b) Extended lists with synonyms from Google Sets
Hypothesis: the synonyms extraction phase is essential
for increasing accuracy of information extraction task
DISEASE EXTRACTOR MODULE
INPUT AND OUTPUT
Disease
Extractor
Module
Index of the first character
Index of the last character
Length of the matched text
Matched Text
Canonical disease name
Input: Text
from file
Output:
Disease ExtractionTask
The task of disease recognition can be considered as NER/information
extraction (IE) task
The main purpose is to retrieve tokens that much at least one term with
synonyms, abbreviations from list of the animal disease names
CONTEXT EXAMPLES IN DIFFERENT LANGUAGES
DUTCH
Leptospirose komt voor in alle landen, behalve het Noordpoolgebied. De incidentie is hoog.Meer dan de helft van de gevallen voordoet in ernstige en vereiste reanimatie.
CZECH
Leptospiróza se vyskytuje ve všech zemích s výjimkou Arktidy. Incidence je vysoká. Více nežpolovina případů se vyskytuje v těžké a vyžaduje resuscitaci.
GERMAN
Leptospirose tritt in allen Ländern, mit Ausnahme der Arktis. Die Inzidenz ist hoch. Mehr alsdie Hälfte der Fälle tritt in schweren und Reanimation erforderlich.
ITALIAN
Leptospirosi si verifica in tutti i paesi, tranne l'Artico. L'incidenza è alta. Più della metà deicasi si verifica in rianimazione grave e richiesti.
URKAINIAN
Лептоспіроз відбувається в усіх країнах, за винятком Арктики. Захворюваність висока.Більше половини випадків відбувається в суворих і необхідність реанімації.
RUSSIAN
Лептоспироз происходит во всех странах, за исключением Арктики. Заболеваемостьвысокая. Более половины случаев происходит в суровых и необходимости реанимации.
DISEASE EXTRACTOR MODULE DEMO
http://fingolfin.user.cis.ksu.edu:8080/diseaseextractor/
Foot and mouth disease is
one of the most contagious
diseases of cloven-hooved
mammals…
INPUT A OUTPUT A
Rift Valley Fever | CDC
Special Pathogens Branch
Mission Statement Disease …
INPUT B OUTPUT B
RESULTS FOR DISEASE EXTRACTOR MODULE
AGENDA
I. Project Overview
II. Crawling Wikipedia
III. GoogleSets for Synonymy Discovery
IV. Experiment
V. Conclusions
CONCLUSIONS
ApplyingWikipedia knowledge for multilingual NERTask
Phase 1: CrawlingWiki – completed
Phase 2: Google Sets Expansion – completed
Phase 3: Multilingual Disease Extraction – in progress
Novelty: Overcome Wiki limitations by applying Google Sets
expansion approach
In order to estimate accuracy we need to have annotated data in
different languages
REFERENCES
Torsten Zesch and Iryna Gurevych, Analysis of the Wikipedia Category Graph for NLP
Applications, In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), p.
1--8, April 2007. http://elara.tk.informatik.tu-
darmstadt.de/publications/2007/hlt-textgraphs.pdf
Watanabe, Yotaro and Asahara, Masayuki and Matsumoto, Yuji, A Graph-Based
Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields,
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning (EMNLP-
CoNLL), 649-657. http://www.aclweb.org/anthology/D/D07/D07-1068
Manning, C., & Schutze, H. Foundations of statistical natural language processing.
Cambridge, MA: MIT Press, 1999.
ACKNOWLEDGEMENTS
Dr. William Hsu for meaningful guidance
John Drouhard for building extraction architecture
Landon Fowles for expanding gazetteers using Google Sets