Download - Corpus Statistics

Transcript
Page 1: Corpus Statistics

Corpus Statistics

• ACE2005/ACE2007 English EDR– Chars: 1.5M Words: 257K– Entities: 18K (PER 9.7K, ORG 3K, GPE 3K, FAC 1K, LOC 897,

WEA 579, VEH 571)– Mentions: 55K (PRO 20K, NAM 18K, NOM 17K)

• CDC Entities (PER, ORG, LOC, GPE)– IDC Entities 7,129 (Entities with at least one name)– CDC Entities 3,660 (after manual linking)

• 2,390 singleton entities

• CDC Annotation Effort– Approximately 2 staff weeks– Annotated after automatic pre-linking of entities that shared at

least one identical (case-sensitive) name string

Page 2: Corpus Statistics

Cross-Document Entity Mention Count Histogram

0

50

100

150

200

250

300

1 170 339 508 677 846 1015 1184 1353 1522 1691 1860 2029 2198 2367 2536 2705 2874 3043

Series1

0

50

100

150

200

250

300

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37

Series1

Rank MFreq Entity Name 1 259 US 2 182 Iraq 3 96 Baghdad 4 93 George W. Bush 5 89 Saddam Hussein 6 83 CNN …

Page 3: Corpus Statistics

Total Mentions Covered byFrequency-Sorted Entities

0

1000

2000

3000

4000

5000

6000

7000

8000

1 222 443 664 885 1106 1327 1548 1769 1990 2211 2432 2653 2874 3095

Series1

Page 4: Corpus Statistics

Callisto/EDNA

• Entity Disambiguation and Normalization Annotation (EDNA) tool– A plug-in for Callisto client– Multiple annotators supported with single Tomcat

server (with document locking)– Document set indexed by APF-customized Lucene

search engine

• Assumes documents annotated for ACE EDR (entity mentions and intra-document coreference)

Page 5: Corpus Statistics

Logging onto the Server

Page 6: Corpus Statistics

File Selection, Locking & Status

Page 7: Corpus Statistics

Highlighted Mentions and ACE Annotations

Source document

ACE Annotations

Page 8: Corpus Statistics

Default and Customizable Entity SearchEntity-based Search Criteria

Search Results

Selected EntityDetails

Page 9: Corpus Statistics

Color Coding Entity Status & Type

Page 10: Corpus Statistics

Reviewing Target Link Target in Context of Source Document

Page 11: Corpus Statistics

Type Restrictions in Search Can Be Relaxed

Page 12: Corpus Statistics

Annotator Comments can be Added and Retained