Multilingual Ner Using Wiki

Laboratory for Knowledge Discovery in Databases

Department of Computing and Information Sciences

Kansas State University

http://www.kddresearch.org/tikiwiki/tiki-index.php

Presenter: Svitlana O. Volkova

Instructor: William Hsu

Multilingual Named Entity Recognition

using Wikipedia




AGENDA

I. Project Overview

II. Crawling Wikipedia

III. Synonymy Discovery with Google Sets

IV. Experiment Design

V. Conclusions

AGENDA

I. Project Overview


III. GoogleSets for Synonymy Discovery

IV. Experiment

V. Conclusions

PROJECT MILESTONES

Input: Crawler Functionality

CRAWLING WIKIPEDIA

Output: Set of Multilingual Gazetteers

Input: Initial Gazetteer in one Language

RELATIONSHIP DISCOVERY WITH GOOGLESETS

Output: Extended Gazetteer with Synonyms

Input: Extended Gazetteer with Synonyms + Content

MULTILINGUAL NER TASK

Output: Extracted Entities from the Content

KEY IDEA - WIKIPEDIA

Apply Wikipedia knowledge representation for

multilingual information extraction

17http://wiki.digitalmethods.net/Dmi/WikipediaAnalysis

English Wiki Concepts of Interest

…, anthrax, bovine virus, …, camelpox, surra, …

Russian Wiki Concepts of Interest

…, Зоонозы, Классическая чума свиней, Лептоспироз, …

http://wiki.digitalmethods.net/Dmi/WikipediaAnalysis








AGENDA

I. Project Overview



IV. Experiment

V. Conclusions

CRAWLING WIKIPEDIA

Multilingual NER

(article + category

+interwiki links)

Wiki Category Graph and Article Graph

GAZETTEERS EXAMPLES IN DIFFERENT

LANGUAGES

GAZETTEERS SIZE IN DIFFERENT

LANGUAGES

86

20

37

19

English

Japanese

German

Russian

Decision: dictionaries are too small, so wee need to find a way how to

extend it!!!

AGENDA

I. Project Overview



IV. Experiment

V. Conclusions

GAZETTEERS EXAMPLES:

GERMAN GOOGLE SETS OUTPUT

AGENDA

I. Project Overview



IV. Experiment

V. Conclusions

EXPERIMENT SET UP

Purpose: to perform named entity recognition task in

specific domain and report accuracy of extraction using

a) Wiki knowledge

b) Extended lists with synonyms from Google Sets

Hypothesis: the synonyms extraction phase is essential

for increasing accuracy of information extraction task

DISEASE EXTRACTOR MODULE

INPUT AND OUTPUT

Disease

Extractor

Module

Index of the first character

Index of the last character

Length of the matched text

Matched Text

Canonical disease name

Input: Text

from file

Output:

Disease ExtractionTask

The task of disease recognition can be considered as NER/information

extraction (IE) task

The main purpose is to retrieve tokens that much at least one term with

synonyms, abbreviations from list of the animal disease names

CONTEXT EXAMPLES IN DIFFERENT LANGUAGES

DUTCH

Leptospirose komt voor in alle landen, behalve het Noordpoolgebied. De incidentie is hoog.Meer dan de helft van de gevallen voordoet in ernstige en vereiste reanimatie.

CZECH

Leptospiróza se vyskytuje ve všech zemích s výjimkou Arktidy. Incidence je vysoká. Více nežpolovina případů se vyskytuje v těžké a vyžaduje resuscitaci.

GERMAN

Leptospirose tritt in allen Ländern, mit Ausnahme der Arktis. Die Inzidenz ist hoch. Mehr alsdie Hälfte der Fälle tritt in schweren und Reanimation erforderlich.

ITALIAN

Leptospirosi si verifica in tutti i paesi, tranne l'Artico. L'incidenza è alta. Più della metà deicasi si verifica in rianimazione grave e richiesti.

URKAINIAN

Лептоспіроз відбувається в усіх країнах, за винятком Арктики. Захворюваність висока.Більше половини випадків відбувається в суворих і необхідність реанімації.

RUSSIAN

Лептоспироз происходит во всех странах, за исключением Арктики. Заболеваемостьвысокая. Более половины случаев происходит в суровых и необходимости реанимации.

DISEASE EXTRACTOR MODULE DEMO

http://fingolfin.user.cis.ksu.edu:8080/diseaseextractor/

http://fingolfin.user.cis.ksu.edu:8080/diseaseextractor/

Foot and mouth disease is

one of the most contagious

diseases of cloven-hooved

mammals…

INPUT A OUTPUT A

Rift Valley Fever | CDC

Special Pathogens Branch

Mission Statement Disease …

INPUT B OUTPUT B

RESULTS FOR DISEASE EXTRACTOR MODULE

AGENDA

I. Project Overview



IV. Experiment

V. Conclusions

CONCLUSIONS

ApplyingWikipedia knowledge for multilingual NERTask

Phase 1: CrawlingWiki – completed

Phase 2: Google Sets Expansion – completed

Phase 3: Multilingual Disease Extraction – in progress

Novelty: Overcome Wiki limitations by applying Google Sets

expansion approach

In order to estimate accuracy we need to have annotated data in

different languages

REFERENCES

Torsten Zesch and Iryna Gurevych, Analysis of the Wikipedia Category Graph for NLP

Applications, In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), p.

1--8, April 2007. http://elara.tk.informatik.tu-

darmstadt.de/publications/2007/hlt-textgraphs.pdf

Watanabe, Yotaro and Asahara, Masayuki and Matsumoto, Yuji, A Graph-Based

Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields,

Proceedings of the 2007 Joint Conference on Empirical Methods in Natural

Language Processing and Computational Natural Language Learning (EMNLP-

CoNLL), 649-657. http://www.aclweb.org/anthology/D/D07/D07-1068

Manning, C., & Schutze, H. Foundations of statistical natural language processing.

Cambridge, MA: MIT Press, 1999.

http://elara.tk.informatik.tu-darmstadt.de/publications/2007/hlt-textgraphs.pdf



















http://www.aclweb.org/anthology/D/D07/D07-1068












ACKNOWLEDGEMENTS

Dr. William Hsu for meaningful guidance

John Drouhard for building extraction architecture

Landon Fowles for expanding gazetteers using Google Sets

Multilingual Ner Using Wiki

Education

Transcript of Multilingual Ner Using Wiki