Corpus Statistics
description
Transcript of Corpus Statistics
Corpus Statistics
• ACE2005/ACE2007 English EDR– Chars: 1.5M Words: 257K– Entities: 18K (PER 9.7K, ORG 3K, GPE 3K, FAC 1K, LOC 897,
WEA 579, VEH 571)– Mentions: 55K (PRO 20K, NAM 18K, NOM 17K)
• CDC Entities (PER, ORG, LOC, GPE)– IDC Entities 7,129 (Entities with at least one name)– CDC Entities 3,660 (after manual linking)
• 2,390 singleton entities
• CDC Annotation Effort– Approximately 2 staff weeks– Annotated after automatic pre-linking of entities that shared at
least one identical (case-sensitive) name string
Cross-Document Entity Mention Count Histogram
0
50
100
150
200
250
300
1 170 339 508 677 846 1015 1184 1353 1522 1691 1860 2029 2198 2367 2536 2705 2874 3043
Series1
0
50
100
150
200
250
300
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37
Series1
Rank MFreq Entity Name 1 259 US 2 182 Iraq 3 96 Baghdad 4 93 George W. Bush 5 89 Saddam Hussein 6 83 CNN …
Total Mentions Covered byFrequency-Sorted Entities
0
1000
2000
3000
4000
5000
6000
7000
8000
1 222 443 664 885 1106 1327 1548 1769 1990 2211 2432 2653 2874 3095
Series1
Callisto/EDNA
• Entity Disambiguation and Normalization Annotation (EDNA) tool– A plug-in for Callisto client– Multiple annotators supported with single Tomcat
server (with document locking)– Document set indexed by APF-customized Lucene
search engine
• Assumes documents annotated for ACE EDR (entity mentions and intra-document coreference)
Logging onto the Server
File Selection, Locking & Status
Highlighted Mentions and ACE Annotations
Source document
ACE Annotations
Default and Customizable Entity SearchEntity-based Search Criteria
Search Results
Selected EntityDetails
Color Coding Entity Status & Type
Reviewing Target Link Target in Context of Source Document
Type Restrictions in Search Can Be Relaxed
Annotator Comments can be Added and Retained