Corpus Statistics

Post on 04-Jan-2016

92 views 19 download

Tags:

description

Corpus Statistics. ACE2005/ACE2007 English EDR Chars: 1.5M Words: 257K Entities: 18K (PER 9.7K, ORG 3K, GPE 3K, FAC 1K , LOC 897, WEA 579 , VEH 571 ) Mentions: 55K (PRO 20K, NAM 18K, NOM 17K) CDC Entities (PER, ORG, LOC, GPE) IDC Entities 7,129 (Entities with at least one name) - PowerPoint PPT Presentation

Transcript of Corpus Statistics

Corpus Statistics

• ACE2005/ACE2007 English EDR– Chars: 1.5M Words: 257K– Entities: 18K (PER 9.7K, ORG 3K, GPE 3K, FAC 1K, LOC 897,

WEA 579, VEH 571)– Mentions: 55K (PRO 20K, NAM 18K, NOM 17K)

• CDC Entities (PER, ORG, LOC, GPE)– IDC Entities 7,129 (Entities with at least one name)– CDC Entities 3,660 (after manual linking)

• 2,390 singleton entities

• CDC Annotation Effort– Approximately 2 staff weeks– Annotated after automatic pre-linking of entities that shared at

least one identical (case-sensitive) name string

Cross-Document Entity Mention Count Histogram

0

50

100

150

200

250

300

1 170 339 508 677 846 1015 1184 1353 1522 1691 1860 2029 2198 2367 2536 2705 2874 3043

Series1

0

50

100

150

200

250

300

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37

Series1

Rank MFreq Entity Name 1 259 US 2 182 Iraq 3 96 Baghdad 4 93 George W. Bush 5 89 Saddam Hussein 6 83 CNN …

Total Mentions Covered byFrequency-Sorted Entities

0

1000

2000

3000

4000

5000

6000

7000

8000

1 222 443 664 885 1106 1327 1548 1769 1990 2211 2432 2653 2874 3095

Series1

Callisto/EDNA

• Entity Disambiguation and Normalization Annotation (EDNA) tool– A plug-in for Callisto client– Multiple annotators supported with single Tomcat

server (with document locking)– Document set indexed by APF-customized Lucene

search engine

• Assumes documents annotated for ACE EDR (entity mentions and intra-document coreference)

Logging onto the Server

File Selection, Locking & Status

Highlighted Mentions and ACE Annotations

Source document

ACE Annotations

Default and Customizable Entity SearchEntity-based Search Criteria

Search Results

Selected EntityDetails

Color Coding Entity Status & Type

Reviewing Target Link Target in Context of Source Document

Type Restrictions in Search Can Be Relaxed

Annotator Comments can be Added and Retained