Corpus Statistics

12
Corpus Statistics ACE2005/ACE2007 English EDR Chars: 1.5M Words: 257K Entities: 18K (PER 9.7K, ORG 3K, GPE 3K, FAC 1K, LOC 897, WEA 579, VEH 571) Mentions: 55K (PRO 20K, NAM 18K, NOM 17K) CDC Entities (PER, ORG, LOC, GPE) IDC Entities 7,129 (Entities with at least one name) CDC Entities 3,660 (after manual linking) 2,390 singleton entities CDC Annotation Effort Approximately 2 staff weeks Annotated after automatic pre-linking of entities that shared at least one identical (case-sensitive) name string

description

Corpus Statistics. ACE2005/ACE2007 English EDR Chars: 1.5M Words: 257K Entities: 18K (PER 9.7K, ORG 3K, GPE 3K, FAC 1K , LOC 897, WEA 579 , VEH 571 ) Mentions: 55K (PRO 20K, NAM 18K, NOM 17K) CDC Entities (PER, ORG, LOC, GPE) IDC Entities 7,129 (Entities with at least one name) - PowerPoint PPT Presentation

Transcript of Corpus Statistics

Page 1: Corpus Statistics

Corpus Statistics

• ACE2005/ACE2007 English EDR– Chars: 1.5M Words: 257K– Entities: 18K (PER 9.7K, ORG 3K, GPE 3K, FAC 1K, LOC 897,

WEA 579, VEH 571)– Mentions: 55K (PRO 20K, NAM 18K, NOM 17K)

• CDC Entities (PER, ORG, LOC, GPE)– IDC Entities 7,129 (Entities with at least one name)– CDC Entities 3,660 (after manual linking)

• 2,390 singleton entities

• CDC Annotation Effort– Approximately 2 staff weeks– Annotated after automatic pre-linking of entities that shared at

least one identical (case-sensitive) name string

Page 2: Corpus Statistics

Cross-Document Entity Mention Count Histogram

0

50

100

150

200

250

300

1 170 339 508 677 846 1015 1184 1353 1522 1691 1860 2029 2198 2367 2536 2705 2874 3043

Series1

0

50

100

150

200

250

300

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37

Series1

Rank MFreq Entity Name 1 259 US 2 182 Iraq 3 96 Baghdad 4 93 George W. Bush 5 89 Saddam Hussein 6 83 CNN …

Page 3: Corpus Statistics

Total Mentions Covered byFrequency-Sorted Entities

0

1000

2000

3000

4000

5000

6000

7000

8000

1 222 443 664 885 1106 1327 1548 1769 1990 2211 2432 2653 2874 3095

Series1

Page 4: Corpus Statistics

Callisto/EDNA

• Entity Disambiguation and Normalization Annotation (EDNA) tool– A plug-in for Callisto client– Multiple annotators supported with single Tomcat

server (with document locking)– Document set indexed by APF-customized Lucene

search engine

• Assumes documents annotated for ACE EDR (entity mentions and intra-document coreference)

Page 5: Corpus Statistics

Logging onto the Server

Page 6: Corpus Statistics

File Selection, Locking & Status

Page 7: Corpus Statistics

Highlighted Mentions and ACE Annotations

Source document

ACE Annotations

Page 8: Corpus Statistics

Default and Customizable Entity SearchEntity-based Search Criteria

Search Results

Selected EntityDetails

Page 9: Corpus Statistics

Color Coding Entity Status & Type

Page 10: Corpus Statistics

Reviewing Target Link Target in Context of Source Document

Page 11: Corpus Statistics

Type Restrictions in Search Can Be Relaxed

Page 12: Corpus Statistics

Annotator Comments can be Added and Retained