WiML Poster

1
Named Entity Annotation and Tagging in the Domain of Epizootics Svitlana Volkova, William Hsu, Doina Caragea Kansas State University, Department of Computing and Information Sciences, Manhattan, KS 66506 ACKNOWLEDGEMENTS This work is supported through a grant from the U.S. Department of Defense. A collaborative program on IE with faculty at the University of Illinois at Urbana-Champaign (ChengXiang Zhai, Dan Roth, Jiawei Han, and Kevin Chang), the 2009 Data Sciences Summer Institute (DSSI) on Multimodal Information Access and Synthesis (MIAS), was made possible through the support of DHS/ONR. We appreciate effective discussions with Dr. Chris Callison-Burch, Dr. Mark Dredze and Dr. Jason Eisner from Center for Language and Speech Processing, Johns Hopkins University; Tim Weninger, Research Fellow, UIUC; John Drouhard, Landon Fowles (KDD Lab, IE Team) for assistance with experiments. KANSAS STATE UNIVERSITY KNOWLEDGE DISCOVERY IN DATABASES LABORATORY NATIONAL AGRICULTURAL BIOSECURITY CENTER @ K-STATE K-State Laboratory for Knowledge Discovery in Databases (KDD) OVERVIEW We present an information extraction (IE) application in the domain of animal diseases. Previously, such tasks were performed only for human disease related data. As opposed to that, our task is directly related to web crawling for retrieving animal infectious disease related information. THE EFFECT OF THE ONTOLOGY SIZE AND QUALITY ON THE ACCURACY OF DISEASE EXTRACTION The data set is sampled from animal disease crawled sources with number of occurrences of the disease named entities above predefined threshold. All animal diseases were manually annotated within the dataset for future cross-validation. In the first experiment the baseline run is processed using dictionary look-up with and w/o capitalization feature (1a, 1b). The next runs include addition only synonyms (2a, 2b) and abbreviations (3a, 3b) respectively. The last run (4a, 4b) combines all above mentioned features. In the second experiment we divide data into training and test sets. Using training examples we learn the model for animal disease name extraction by discovering relations between concepts; we report accuracy on test data set. In the third experiment, we compare our approach of learning relations between concepts with Google Sets method. We report results in terms of precision, recall and F-measure. We build learning curves for both methods in order to show the influence of the ontology size and quality on the accuracy of extracted results. INFORMATION EXTRACTION IN THE DOMAIN OF EPIZOOTICS The IE task in the domain of the epizootics can be defined as automatic extraction of structured information that is related to animal diseases from unstructured web documents with different content. The IE task is related to development of several modules for tagging of specific entities such as: animal disease name, species, vaccines, serotypes etc. at the document-level within a crawled collection of documents. GAZETEER COLLECTION AND ONTOLOGY CONSTRUCTION The main purpose of IE using a gazetteer is to retrieve tokens that match at least one term with synonyms, abbreviations from known animal disease names. We collect prior domain specific knowledge and, as a result, construct ontology of animal disease concepts. The extraction technique is based on a pattern matching approach. The gazetteer is semi-automatically collected from official web-portals. Using initial gazetteer we enriched ontology with latent synonymic and causal relations between related concepts. EMAIL LITERATURE Document Collection CRAWLER QUERY DB CLASSIFICATION-BASED NAMED ENTITY RECOGNITION Named Entity Recognition (NER) task is a subtask of IE which seeks to locate and classify atomic elements in text into predefined categories, such as: disease names (e.g. “foot and mouth disease”); viruses (e.g. picornavirus) and serotypes (e.g. “Asia-1); species and its quantities (e.g. “sheep”, “pigs”); locations where outbreak happened (e.g. “United Kingdom”, “eastern provinces of Shandong and Jiangsu, China” different level of granularity); dates in different formats including special cases (e.g. “last Tuesday”, “two month ago”); organizations that reports outbreak (e.g. “DEFRA”, “CDC”). ANIMAL DISEASE Dipylidium infection Tapeworm Q fever Coxiella burnetii C. burnetii Baylisascariasis Baylisascaris procyonis B. melis B. procyonis B. transfuga DOMAIN INDEPENDENT KNOWLEDGE Location hierarchy, containing names of countries, states or provinces, cities, etc; canonical date/time representation. DOMAIN SPECIFIC KNOWLEDGE Medical ontology, containing names of diseases, viruses, animal species etc., organized in a conceptual hierarchy. Example: The US saw its latest FMD outbreak in Montebello, California in 1929 where 3,600 animals were slaughtered. Animal Disease Names Locations Dates/Times Quantities DOCUMENT COLLECTION Goal: to extract structured information with facts and entities related to events from unstructured or semistructured sources. FUTURE WORK The animal disease extraction task is a prerequisite for more advanced content analysis of the unstructured documents within corpora. So, the design of an NER-driven system for extracting structured tuples that describe animal disease-related events will be performed. The approach extends the shared NER task of identifying persons, organizations, and locations with not only disease names but constituent entities and attributes of these event tuples. These include dates and times, quantities with relevant units, and geo- referenced locations. A primary overall objective of the IE task is to support timeline and map-based visualization of events. 1. Disease names and fact sheets from Iowa State University Center for Food Security and Public Health (CFSPH): http://www.cfsph.iastate.edu/diseaseinfo/animaldiseaseindex.htm 2. Word Organization of Animal Health (OIE) Animal Disease Data: http://www.oie.int/eng/maladies/en_alpha.htm 3. Department for Environmental Food and Rural Affairs, UK (DEFRA): http://www.defra.gov.uk/animalh/diseases/vetsurveillance/az_index.htm 4. United States Department of Agriculture (USDA), Animal and Plant Health Inspection Service http://www.aphis.usda.gov/animal_health/animal_diseases/ 5. Medline Plus, Service of National Library of Medicine and National Institute of Health http://www.nlm.nih.gov/medlineplus/animaldiseasesandyourhealth.html 6. Wikipedia http://en.wikipedia.org/wiki/Animal_diseases RELATION DISCOVERY BETWEEN CONCEPTS Synonymy (“is a kind of” relation, e.g. “Swine influenza” is a kind of “Swine fever”); Example A: “Diseases such as Foot and Mouth Disease, Bovine TB or Johne’s Disease have far-reaching potential for major economic impact on cattle producers”. Disease Extractor Module Index of the first/last character Matched text and length Canonical disease names Associated Synonyms/Abbreviations Non-unique/Unique diseases Input: Text from file Output: Method A: Number of Training Instances Accuracy 429 773 955 1159 1287 1442 1561 1590 1619 1682 0.964 0.929 0.927 0.925 0.964 0.929 0.927 0.925 0.964 0.929 Method B: Number of Training Instances Accuracy 429 754 925 1118 1238 1385 1497 1524 1552 1611 0.962 0.961 0.864 0.862 0.962 0.961 0.864 0.862 0.962 0.961 Dictionary Look-Up: Number of Instances (max. 429) Accuracy 1a 1b 2a 2b 3a 3b 4a 4b - - 0.885 0.920 0.886 0.896 0.887 0.922 0.889 0.933 - - 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 Method B Method A Gazetteer F-Measure DICTIONARY LOOKUP METHOD FOR DISEASE EXTRACTION List’s look up features: flexible pattern match Document level features: keyword appearance within predefined window. Word level morphological features 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 4b 4a 3b 3a 2b 2a 1b 1a Precision Recall F-Measure 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 50 100 4b 3b 3a 2b 2a 1b 1a 4a Recall Range Run 1, 3.60% Run 2 - only capitalization, 14.00% Run 3 - only abbreviations + synonyms, 84.36% Averaged "FMD" Extraction Performance over 50 web-sites Run 1, 0.20% Run 2 - only capitalization 38.02% Run 3 - only abbreviation + synonyms, 57.52% Averaged "RVF" Extraction Performance over 50 web-sites 0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00 400 650 900 1150 1400 1650 0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00 400 650 900 1150 1400 1650 Learning Curve for Method A (Relation Discovery within Training Data) Learning Curve for Method B (Relation Discovery using Google Sets) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Metod A Metod B Dictionary Look-Up Precision/Recall Foot-and-mouth disease [DIS] killed 15 hog on farm in Taiwan [LOC] Fact: killed Disease: foot-and-mouth disease Location: Taiwan Species: hog Quantity: 15 Foot-and-mouth disease killed 15 hog on farm in Taiwan. Outbreak was reported on 9 June. Event: outbreak Species: 15 hog Disease: foot-and-mouth disease Location: Taiwan Date/Time: 9 June Foot-and-mouth disease [SUBJ] killed [VP] 15 hog on farm in Taiwan [PP] Syntactic Analysis Extraction Co-reference Resolution Template Generation NLP TASKS Run Precision, Recall, F-measure Document number Causal links (“is caused by”, e.g. “Ovine epididymitis is caused by Brucella ovis”). Example F: “Bluetongue virus (BTV), a member of Orbivirus genus within the Reoviridae family causes Bluetongue disease in livestock (sheep, goat, cattle)”. 1a - using only initial gazetteer w/o capitalization 1b - using initial gazetteer + capitalization 2a - initial gazetteer + only synonyms w/o capitalization 2b - initial gazetteer + only synonyms with capitalization 3a - init. gazetteer + only abbreviations w/o capitalization 3b - init. gazetteer + only abbreviations with capitalization 4a - init. gazetteer + synonyms + abbrev. w/o capitaliz. 4b - init. gazetteer + synonyms + abbrev. with capitaliz. WWW (official reports about animal disease outbreaks, surveillance networks disease descriptions, fact sheets etc.) Runs Number of Ontology Concepts Number of Ontology Concepts Accuracy Accuracy

description

 

Transcript of WiML Poster

Page 1: WiML Poster

Named Entity Annotation and Tagging in the Domain of EpizooticsSvitlana Volkova, William Hsu, Doina Caragea

Kansas State University, Department of Computing and Information Sciences, Manhattan, KS 66506

ACKNOWLEDGEMENTSThis work is supported through a grant from the U.S. Department

of Defense. A collaborative program on IE with faculty at the

University of Illinois at Urbana-Champaign (ChengXiang Zhai, Dan

Roth, Jiawei Han, and Kevin Chang), the 2009 Data Sciences

Summer Institute (DSSI) on Multimodal Information Access and

Synthesis (MIAS), was made possible through the support of

DHS/ONR.

We appreciate effective discussions with Dr. Chris Callison-Burch,

Dr. Mark Dredze and Dr. Jason Eisner from Center for Language

and Speech Processing, Johns Hopkins University; Tim Weninger,

Research Fellow, UIUC;

John Drouhard, Landon Fowles (KDD Lab, IE Team) for

assistance with experiments.

KANSAS STATE UNIVERSITY KNOWLEDGE DISCOVERY IN DATABASES LABORATORY NATIONAL AGRICULTURAL BIOSECURITY CENTER @ K-STATE

K-State Laboratory for Knowledge

Discovery in Databases (KDD)

OVERVIEW

We present an information extraction (IE) application in the domain of animal

diseases. Previously, such tasks were performed only for human disease related

data. As opposed to that, our task is directly related to web crawling for retrieving

animal infectious disease related information.

THE EFFECT OF THE ONTOLOGY SIZE AND QUALITY ON THE

ACCURACY OF DISEASE EXTRACTION

The data set is sampled from animal disease crawled sources with number of

occurrences of the disease named entities above predefined threshold. All animal

diseases were manually annotated within the dataset for future cross-validation.

In the first experiment the baseline run is processed using dictionary look-up with

and w/o capitalization feature (1a, 1b). The next runs include addition only

synonyms (2a, 2b) and abbreviations (3a, 3b) respectively. The last run (4a, 4b)

combines all above mentioned features.

In the second experiment we divide data into training and test sets. Using training

examples we learn the model for animal disease name extraction by discovering

relations between concepts; we report accuracy on test data set.

In the third experiment, we compare our approach of learning relations between

concepts with Google Sets method. We report results in terms of precision, recall

and F-measure. We build learning curves for both methods in order to show the

influence of the ontology size and quality on the accuracy of extracted results.

INFORMATION EXTRACTION IN THE DOMAIN OF EPIZOOTICS

The IE task in the domain of the epizootics can be defined as automatic extraction

of structured information that is related to animal diseases from unstructured web

documents with different content. The IE task is related to development of several

modules for tagging of specific entities such as: animal disease name, species,

vaccines, serotypes etc. at the document-level within a crawled collection of

documents.

GAZETEER COLLECTION AND ONTOLOGY CONSTRUCTION

The main purpose of IE using a gazetteer is to retrieve tokens that match at least

one term with synonyms, abbreviations from known animal disease names. We

collect prior domain specific knowledge and, as a result, construct ontology of

animal disease concepts. The extraction technique is based on a pattern matching

approach. The gazetteer is semi-automatically collected from official web-portals.

Using initial gazetteer we enriched ontology with latent synonymic and causal

relations between related concepts.EMAIL

LITERATURE

Document

Collection

CRAWLER

QUERY

DB

CLASSIFICATION-BASED NAMED ENTITY RECOGNITION

Named Entity Recognition (NER) task is a subtask of IE which seeks to locate and

classify atomic elements in text into predefined categories, such as:

disease names (e.g. “foot and mouth disease”);

viruses (e.g. “picornavirus”) and serotypes (e.g. “Asia-1”);

species and its quantities (e.g. “sheep”, “pigs”);

locations where outbreak happened (e.g. “United Kingdom”, “eastern provinces

of Shandong and Jiangsu, China” – different level of granularity);

dates in different formats including special cases (e.g. “last Tuesday”, “two

month ago”);

organizations that reports outbreak (e.g. “DEFRA”, “CDC”).

ANIMAL

DISEASE

Dipylidiuminfection

Tapeworm

Q fever

Coxiella burnetii

C. burnetii

Baylisascariasis

Baylisascaris procyonis

B. melis

B. procyonis

B. transfuga

DOMAIN INDEPENDENT KNOWLEDGE

Location hierarchy, containingnames of countries, states orprovinces, cities, etc; canonicaldate/time representation.

DOMAIN SPECIFIC KNOWLEDGE

Medical ontology, containingnames of diseases, viruses,animal species etc., organized ina conceptual hierarchy.

Example: The US saw its latest FMD outbreak in Montebello,

California in 1929 where 3,600 animals were slaughtered.

Animal Disease Names Locations

Dates/Times Quantities

DOCUMENT

COLLECTION

Goal: to extract structured

information with facts and

entities related to events from

unstructured or semistructured

sources.

FUTURE WORK

The animal disease extraction task is a

prerequisite for more advanced content

analysis of the unstructured documents within

corpora. So, the design of an NER-driven

system for extracting structured tuples that

describe animal disease-related events will

be performed.

The approach extends the shared NER task

of identifying persons, organizations, and

locations with not only disease names but

constituent entities and attributes of these

event tuples. These include dates and times,

quantities with relevant units, and geo-

referenced locations. A primary overall

objective of the IE task is to support timeline

and map-based visualization of events.

1. Disease names and fact sheets from Iowa State University Center for

Food Security and Public Health (CFSPH):http://www.cfsph.iastate.edu/diseaseinfo/animaldiseaseindex.htm

2. Word Organization of Animal Health (OIE) Animal Disease Data:http://www.oie.int/eng/maladies/en_alpha.htm

3. Department for Environmental Food and Rural Affairs, UK (DEFRA):http://www.defra.gov.uk/animalh/diseases/vetsurveillance/az_index.htm

4. United States Department of Agriculture (USDA), Animal and Plant

Health Inspection Servicehttp://www.aphis.usda.gov/animal_health/animal_diseases/

5. Medline Plus, Service of National Library of Medicine and National

Institute of Healthhttp://www.nlm.nih.gov/medlineplus/animaldiseasesandyourhealth.html

6. Wikipediahttp://en.wikipedia.org/wiki/Animal_diseases

RELATION DISCOVERY BETWEEN CONCEPTS

Synonymy (“is a kind of” relation, e.g. “Swine influenza” is a kind of “Swine fever”);

Example A: “Diseases such as Foot and Mouth Disease, Bovine TB or Johne’s Disease

have far-reaching potential for major economic impact on cattle producers”.

Disease

Extractor

Module

Index of the first/last character

Matched text and length

Canonical disease names

Associated Synonyms/Abbreviations

Non-unique/Unique diseases

Input:

Text from file

Output:

Method A: Number of Training Instances

Accuracy429 773 955 1159 1287 1442 1561 1590 1619 1682

0.964 0.929 0.927 0.925 0.964 0.929 0.927 0.925 0.964 0.929

Method B: Number of Training Instances

Accuracy429 754 925 1118 1238 1385 1497 1524 1552 1611

0.962 0.961 0.864 0.862 0.962 0.961 0.864 0.862 0.962 0.961

Dictionary Look-Up: Number of Instances (max. 429)

Accuracy1a 1b 2a 2b 3a 3b 4a 4b - -

0.885 0.920 0.886 0.896 0.887 0.922 0.889 0.933 - -

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

Method B

Method A

Gazetteer

F-Measure

DICTIONARY LOOKUP METHOD FOR DISEASE EXTRACTION

List’s look up features: flexible pattern match

Document level features: keyword

appearance within predefined window.

Word level morphological features

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

4b 4a 3b 3a 2b 2a 1b 1a

Precision Recall F-Measure

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0 50 100

4b

3b

3a

2b

2a

1b

1a

4a

Recall Range

Run 1, 3.60%

Run 2 - only capitalization,

14.00%

Run 3 - only abbreviations + synonyms,

84.36%

Averaged"FMD" Extraction

Performanceover 50 web-sites

Run 1, 0.20%

Run 2 - only capitalization

38.02%

Run 3 - only abbreviation+ synonyms,

57.52%

Averaged"RVF" Extraction

Performanceover 50 web-sites

0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

400 650 900 1150 1400 1650

0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

400 650 900 1150 1400 1650

Learning Curve for Method A (Relation Discovery within Training Data)

Learning Curve for Method B (Relation Discovery using Google Sets)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Metod A Metod B Dictionary Look-Up

Precision/Recall

Foot-and-mouth disease[DIS] killed 15

hog on farm in Taiwan[LOC]

Fact: killed

Disease: foot-and-mouth disease

Location: Taiwan

Species: hog

Quantity: 15

Foot-and-mouth disease killed 15 hog

on farm in Taiwan. Outbreak was

reported on 9 June.

Event: outbreak

Species: 15 hog

Disease: foot-and-mouth disease

Location: Taiwan

Date/Time: 9 June

Foot-and-mouth disease [SUBJ] killed[VP]

15 hog on farm in Taiwan [PP]

Syntactic Analysis

Extraction

Co-reference

Resolution

Template Generation

NLP TASKS

Run

Precision, Recall, F-measure

Document number

Causal links (“is caused by”, e.g. “Ovine epididymitis is caused by Brucella ovis”).

Example F: “Bluetongue virus (BTV), a member of Orbivirus genus within the

Reoviridae family causes Bluetongue disease in livestock (sheep, goat, cattle)”.

1a - using only initial gazetteer w/o capitalization

1b - using initial gazetteer + capitalization

2a - initial gazetteer + only synonyms w/o capitalization

2b - initial gazetteer + only synonyms with capitalization

3a - init. gazetteer + only abbreviations w/o capitalization

3b - init. gazetteer + only abbreviations with capitalization

4a - init. gazetteer + synonyms + abbrev. w/o capitaliz.

4b - init. gazetteer + synonyms + abbrev. with capitaliz.

WWW (official reports about animal disease outbreaks,

surveillance networks disease descriptions, fact sheets etc.)

Runs

Number of Ontology Concepts Number of Ontology Concepts

Accuracy Accuracy