Summary - University of Leeds · NLP-based approach to dictionary generation, ... well as support...

64
Gerard Howard Summary This project attempts to solve the problems caused by manual dictionary creation for the use of event coding systems, by providing an automatic dictionary generation system. These problems are namely a)The number of man-hours required to manually generate such dictionaries from a large corpus b)The possibilities for inconsistencies of actor assignment by a human dictionary compiler and c)The possibility that a number of actors contained in the corpus are overlooked, and not included in the dictionaries. By taking an algorithmic, NLP-based approach to dictionary generation, it is predicted that the time taken to produce such dictionaries will dramatically decrease, and the assignment of actors to dictionaries will improve in consistency. The second part of the system will produce an actor coding scheme from the dictionary files generated. These actor coding schemes are used in both manual and automatic event coding systems, but are often hand-created. As with dictionary generation, the prime concerns with this approach are the time-consuming nature of scheme generation, and also application of the scheme to the corpus to be coded. It has been noted in previous event coding projects that human coders (who must work in teams due to the number of articles to be coded) can disagree on the code assigned to a particular actor, producing an inconsistent coding. An automatic, algorithmic system will remove the ambiguity in code assignment, as well as reduce the time taken to a)Generate the actor scheme and b)Apply the scheme to an input corpus. Creating a system based on context-free grammar analysis will allow the system to generate a dictionary (and hence actor coding scheme) for any region of interest. This will solve the problem of a lack of reusability seen when human coders create such actor schemes – there is no quick way to apply the same process on another corpus to generate an actor scheme with a different focus in mind manually. This system will take a corpus of Reuters news articles as input, and produce dictionaries of people and organizations and an actor coding scheme as output.

Transcript of Summary - University of Leeds · NLP-based approach to dictionary generation, ... well as support...

Gerard Howard

Summary

This project attempts to solve the problems caused by manual dictionary creation for the use

of event coding systems, by providing an automatic dictionary generation system. These

problems are namely a)The number of man-hours required to manually generate such

dictionaries from a large corpus b)The possibilities for inconsistencies of actor assignment

by a human dictionary compiler and c)The possibility that a number of actors contained in

the corpus are overlooked, and not included in the dictionaries. By taking an algorithmic,

NLP-based approach to dictionary generation, it is predicted that the time taken to produce

such dictionaries will dramatically decrease, and the assignment of actors to dictionaries will

improve in consistency.

The second part of the system will produce an actor coding scheme from the dictionary files

generated. These actor coding schemes are used in both manual and automatic event

coding systems, but are often hand-created. As with dictionary generation, the prime

concerns with this approach are the time-consuming nature of scheme generation, and also

application of the scheme to the corpus to be coded. It has been noted in previous event

coding projects that human coders (who must work in teams due to the number of articles to

be coded) can disagree on the code assigned to a particular actor, producing an inconsistent

coding. An automatic, algorithmic system will remove the ambiguity in code assignment, as

well as reduce the time taken to a)Generate the actor scheme and b)Apply the scheme to an

input corpus.

Creating a system based on context-free grammar analysis will allow the system to generate

a dictionary (and hence actor coding scheme) for any region of interest. This will solve the

problem of a lack of reusability seen when human coders create such actor schemes – there

is no quick way to apply the same process on another corpus to generate an actor scheme

with a different focus in mind manually.

This system will take a corpus of Reuters news articles as input, and produce dictionaries of

people and organizations and an actor coding scheme as output.

Gerard Howard

Acknowledgements

I'd like to thank Dr. Katja Markert for her aid with project direction and scheduling, as well as

code consultation and finding a forest of relevant papers on event coding and generally

putting up with me.

Also, Eric Atwell for his comments on the report and ideas for testing the system, as well as

going through the system demonstration session with me half asleep (No more late nights!).

Acknowledgments must also be made to Sorin for his help with the Java version of the

NP/VP extraction program.

Robert Yaw, Dave Goodwin, and Andy Walton deserve a mention for frequently responding

to my answer to their questions of “So what does your system do again?” with “Huh?”, as

well as support during the late-night hacking sessions and beerathons.

I would also like to thank my parents for washing my clothes and sorting me out with a new

PC when my old one went “pop”. Oh, and for the manual event coding.

And finally to the University of Leeds staff, for providing such a comfortable working

environment.

Gerard Howard

Table of Contents

1. Requirements and Scheduling1.1 Requirements

1.2 Scheduling

1.2.1 Preliminary Schedule

1.2.2 Revised Schedule

1.2.3 Explanation

2. Research and Background Reading

2.1 The Purpose of Event Coding

2.1.1 Problem Statement

2.2 Information Extraction and the English Language

2.3 Coding

2.3.1 Coding Systems

2.3.2 Event Coding Schemes

2.3.3 Actor Coding Schemes

3. Design and Methodology

3.1 Methodology

3.2 Scope

3.3 Programming Languages and Tools

3.4 Identifying Useful Information

3.4.1 Information Extraction

3.4.2 Exploiting POS and NER

3.5 System Design

4. Implementation

4.1 Programs, Inputs and Outputs

4.2 Information Extraction

4.3 Named Entity Recognition

Gerard Howard

4.4 Clustering

4.5 Dictionary Creation

4.6 Actor Coding

5. Testing

5.1 Test Plan

5.2 Dictionary Creation

5.3 Actor Coding Scheme

6. Evaluation

6.1 Results and Discussion

6.2 Dictionaries

6.3 Actor Coding Scheme

6.4 Conclusions

6.4.1 Improvements

6.4.2 Furthering the Project

Gerard Howard

Chapter 1

Requirements and Scheduling

1.1 Requirements

The requirements for this project can be split into two categories: needed and desired.

Those that are needed are specified in [1], defined early on in the project after consultation

with the project supervisor, and are as follows:

· An overview of current Event Coding Schemes and Event Coding Systems as well as

Natural Language Processing (NLP) methods for building Event Coding Systems.

· Implementation of an automatic extraction method of relevant Reuters leadlines from [2].

· Automatic dictionary extraction for proper name actors and mapping to actor codes

focusing on the Balkans conflict [3].

· Mapping proper names in Reuters leadlines automatically to actor codes.

· Evaluation of proper name Actor Coding accuracy by comparison to two human coders.

Possible extensions to this project include:

· Allowing users to extract articles from a corpus based on multiple criteria.

· Increasing the accuracy of the assignment of actor codes by improving the code

assignment algorithm.

· Expanding the project by allowing for Event Coding.

The full amount of allotted time is 21 weeks of term time, plus 8 weeks worth of holidays.

Figures 1 and 2 (in Appendix B) show Gantt charts of the preliminary schedule for my

project.

1

Gerard Howard

1.2 Scheduling

1.2.1 Preliminary Schedule

Milestones

· Return completed preference forms: 30.09.2004

· Complete aim and minimum requirements: 22.10.2004

· Mid project report: 10.12.2004

· Complete background reading: 01.12.2004

· Complete implementation of system: 01.03.2005

· Complete project report: 10.04.2005

· Deadline for report submission: 27.04.2005

Dates Task Stage of Waterfall model Details

01.11.2004 - 01.12.2004 Preliminary reading Research Topics: basic NLP, automaticdictionary construction,event/actor codingschemes/systems

01.12.2004 - 10.12.2004 Research and decide onmethodology

Research Comparison ofmethodologies, decision ofbest suited to project, writeup

12.10.2004 - 22.10.2004 Requirements capture Research Analyze needs of the users,capabilities of similarsystems and minimumrequirements

01.12.2004 - 23.12.2004 Initial designs and testing Design Design initial ideas,implement in code usingthrowaway prototyping, writeup

23.12.2004 - 01.01.2005 Final design Design Include specifics (e.g.definite uses for eachprogram, number ofprograms required). Relatedevelopment to research,write up

01.01.2005 - 01.03.2005 Implementation andTesting

Implementation Implement using waterfallmethodology, test on samplecorpora, write up

01.12.2004 - 01.04.2005 Evaluation Evaluation Define suitable evaluationcriteria (test plan), evaluatequalitative and quantitativeaspects of the system, drawconclusions, write up

Figure 3: Project schedule

2

Gerard Howard

1.2.3 Revised Schedule

Revised Milestones

· Return completed preference forms: 30.09.2004

· Complete aim and minimum requirements: 22.10.2004

· Mid project report: 10.12.2004

· Complete background reading: 10.12.2004

· Complete implementation of system: 25.03.2005

· Complete project report: 17.04.2005

· Deadline for report submission: 27.04.2005

Dates Task Stage of Waterfall model Details

01.11.2004 - 05.12.2004 Preliminary reading Research Topics: basic NLP, automaticdictionary construction,event/actor codingschemes/systems. Addedreading on Perl

05.12.2004 - 20.12.2004 Research and decide onmethodology

Research Comparison ofmethodologies, decision ofhow to combinemethodologies mosteffectively, write up

12.10.2004 - 22.10.2004 Requirements capture Research Analyze needs of the users,capabilities of similarsystems and minimumrequirements and extensions.Added practice with Perl

01.12.2004 - 10.01.2005 Initial designs andtesting

Design Design initial ideas,implement in code usingthrowaway prototyping,choose one to develop, writeup

10.12.2004 - 20.01.2005 Final design Design Include specifics . Relatedevelopment to research andprevious design choices,write up

20.01.2005 - 20.04.2005 Implementation andTesting

Implementation Implement using waterfallmethodology, test on samplecorpora, improve clusteringalgorithm, write up

20.01.2005 - 25.04.2005 Evaluation Evaluation Define suitable evaluationcriteria (test plan), evaluatequalitative and quantitativeaspects of the system, drawconclusions, write up

Figure 4: Revised project schedule

1.2.3 Explanation

3

Gerard Howard

The schedule had to be revised due to the following main factors

· Unfamiliarity with Perl, underestimating the time Perl took to learn. This increased both

the design and implementation stages, pushing back the evaluation.

· The first implementation of the clustering algorithm did not do well in testing. Altering it

took a few days which should have been spent on other parts of the implementation.

However, the revised schedule was adhered to, ensuring that the project was completed

on time.

· The amount of background reading meant that the deadline was overshot by a number of

days. This created something of a cascade effect with the other deadline dates, shifting

them all back slightly.

4

Gerard Howard

Chapter 2

Research and Background Reading

2.1 The Purpose of Event Coding

Event Coding is a relatively recent social science-based discipline, utilizing generalized

codes to represent countries, political actors, and government branches (Actors), and the

actions those myriad actors take against/with each other (Events). In this way, diverse

inputs containing many events and actors [2,4] can receive a generalized classification, in

turn allowing a quick overview of a particular series of events. Clustering is traditionally used

to cover synonyms of an actor or event, allowing them to be correctly classified (i.e. U.S.A,

America, United States could all be clustered into the “USA” cluster for simplified coding).

Using a textual input such as a selection of newspaper headlines for a region, or a series of

stock market reports over a given time period, the events reported in the headlines or

reports can be analyzed quickly.

The primary use of Event Coding is related to early warning. By assigning each event to a

number that can be represented on a scale such as [5] (where the most severe negative

actions receive the lowest negative mark, the most constructive actions receive the highest

positive mark), and analyzing the products of these marks between two countries over a

given period of time (e.g. every week for a number of months), a time-series analysis for

aggregate marks between two counties can be obtained. These can be compared to

“template” time series analysis showing a number of likely actions and studied for

correlation. Put simply, if a stock market report for a company correlates roughly to that of a

template showing market growth, it can be predicted that that company will experience a

growth in its market. Similarly, if the time-series analysis graph between two countries

correlates to a template showing unrest, possibly descending into armed conflict, the two

countries in question could be reinforced with additional UN troops in an attempt to quell any

danger.

5

Gerard Howard

Event Coding Systems are created to study two main areas: international events (both

military and political) and economics.

Until recently, Event Coding was conducted manually, by consulting a coding scheme (such

as [6]) and assigning the event to a code in the scheme that most closely correlated to it.

Papers concerning the process of manual coding in more detail include [7] and [8].

Recent advancements have allowed these codes to be automatically assigned by algorithm

or pattern matching; saving many man-hours spent not only assigning codes, but also

assuring that the human coders would assign the same code to any given event. For

example, in [9], the team automatically code both regional and international data using

machine-based automatic Event Coding.

One of the first automatic Event Coding Systems was [10], a coding system related to

actions between two countries, and centered mainly around warfare. It was first used in [11]

to automatically event code a dataset related the Palestinian Intifada. Recently, a company

called Virtual Research Associates has released a system, [12], which is designed to predict

the fluctuations of various stock markets.

The rise of these new computerized Event Coding Systems can be attributed to their

advantages (speed, unsupervised automatic assignment, the ability to “plug in” and use one

of a number of Coding Schemes). However, these systems have many requirements that

must also be fulfilled in order to use them. Firstly, a mapping must be provided from the

myriad actors and events that would be encountered in a corpus, to a smaller set of

generalized actors and events. Secondly, the system must be able to identify the relevant

actors and events in the corpus of articles. Thirdly, there must be a program that takes the

extracted actors and events, and applies the mapping scheme to them, returning a set of

possible events and actors for a given area of conflict.

The purpose of my Event Coding System is firstly to automatically generate a dictionary of

actors for the Balkans area of conflict, using a one-year corpus of Reuters Newswire texts

([2]). I will extract from this corpus the relevant articles, then extract from those articles a

corpus of headlines and meta data such as the date the article was published. A list of

countries, persons and organizations involved in the conflict will then be created from this

corpus. I will then generate a dictionary of these actors, using clustering and mapping to

refine this process.

6

Gerard Howard

Secondly, an Event Coding scheme of the actors in the extracted articles will be generated,

using the actors extracted by the dictionary creation system.

My system will also be reusable (for different areas, by using a different input corpus and

different extraction parameters), compact (small, simple programs which can be used by

anyone), reasonably accurate (obviously hoping for the same accuracy as manual

assignment is unrealistic), and most of all, quick (this will be my systems main strength –

human-coded actor dictionaries can “take upwards of 150 man-hours to generate” [13]).

2.1.1 Problem Statement

The system will solve the problem of manually creating dictionaries for use in Event Coding

Systems, by providing an alternative whereby dictionaries can be generated automatically,

saving the user many hours of preparation time. To maximize potential users, these

dictionaries should be reasonably accurate, and dependent only on the input corpus used, to

allow any kind of Event Coding dictionaries to be generated.

The system will also attempt to automatically generate an Actor Coding Scheme, for use in

Event Coding Systems. This should provide a reasonable number of actors, and utilize

clustering and grouping to provide a useful coding scheme. This will solve the problem of

the traditional approach of manually assigning an actor code to a dictionary entry, by

providing a reasonably accurate, quick and comprehensive alternative.

2.2 Information Extraction and the English Language

The input I will use will be from Reuters [2]. Although a global news service, they are an

American company, and as such their articles use English as a primary language. The

corpus I will be using for input to my system will therefore be entirely in English.

Like all languages, English predominantly follows a structure for composing phrases and

sentences. Programs have been written to exploit this structure (taggers, tokenizers, and

parsers), providing facilities such as Part Of Speech (POS) tagging, and Named Entity

Recognition (NER).

Parsers are responsible for splitting an input into words and sentences. RASP, used in

projects such as [14], is installed on University of Leeds School of Computing Linux systems,

provides advanced structures such as phrasal and sub-phrasal identification.

7

Gerard Howard

Part of Speech (POS) taggers provide further structural analysis of a text, identifying the

types of words in a sentence. RASP also contains this functionality, which it uses for its

“Probabilistic Parsing” [15] running mode.

Named Entity Recognition (NER) software provides a tagging of words to types. Unlike POS

tagging, these word categories are not syntactic types, such as nouns and verbs, but instead

to types such as people and dates. Aside from individual words, the context of a phrase

within a sentence is also important, and has been used in systems such as [13] to

automatically generate dictionaries for Event Coding Systems.

In English, each declarative sentence is split into a Noun Phrase (NP), and a Verb Phrase

(VP). Each of these NPs and VPs can be split recursively into further pairs of NPs and VPs,

Prepositional Phrases (PP), determiners (e.g. The), and nouns (N) and verbs (V). This type

of parsing, combined with and POS tagging allows the structure of a sentence to be

dissected and analyzed. Figure 5 shows such actions on the phrase “The cat sat”.

Figure 5: Example parsing and POS tagging.

RASP can generate these parse trees, which describe not only the types of words and

phrases in a sentence, but also tag words with their syntactic classification.

By extracting the NPs and VPs that contain the main actors (identified by extracting words,

explained above), I would be able to identify not only the important NPs or VPs in the

sentence, but also extract the actors and events in those NPs or VPs with a surrounding

context, which could be used to aid in the classification of those actors and events.

8

Gerard Howard

A combination of NER and parsing can identify the parts of a sentence that would be useful

to my system (such as the names of countries, people, companies, agencies, political parties

and so on). This will allow me to extract certain types of words, which I would wish to

include in my dictionaries.

2.3 Coding

Coding is to the assignment of a code (provided by a coding scheme) to a piece of text (in

this case a Reuters Newswire headline) by a coding system (such as [16]). In this chapter,

background knowledge and some preliminary design choices relating to Coding Systems

(2.3.1), Event Coding Schemes (2.3.2) and Actor Coding Schemes (2.3.3) will be provided.

2.3.1 Coding Systems

A coding system is a piece of software that allows, by way of “plug-in” script files, the

assignment of event codes to input of text. Research into this area is necessary as it gives

me a background of knowledge from which to work, allowing me to gear my system towards

compatibility with them, as well as giving me an idea of how Event Coding Systems work.

The three major coding systems of recent years are KEDS[10](Kansas Event Data System),

TABARI[16](Textual Analysis By Augmented Replacement Instructions), and the VRA

(Virtual Research Associates) project[20].

Developed at Kansas University, KEDS is a data coding system that uses a modification of

[17] to assign event codes to political event data focusing on the Middle East and the

Balkans. The predicted final use for this project is as a statistical early-warning model for

regions of conflict (see 2.1).

KEDS takes preformatted data as input, and uses a pattern-matching algorithm to assign a

code to that data, processing Reuters data headlines. KEDS seems to be a logical choice

for analysis, due to its similarities to this project. However, it has recently been superseded

by TABARI, also produced by the KEDS team. It is therefore TABARI that will be the main

focus of my analysis.

TABARI is a coding system that uses pattern matching to assign codes to events in a given

text. It uses replaceable actor and Event Coding scripts (text files) to denote the assignment

of codes to actors/events

9

Gerard Howard

The basic pattern is subject – verb – object, where:

Subject = The source of the event (actor code assigned here)

Verb = The event itself (event code assigned here).

Object of verb = target of event (actor code assigned here).

For example, given the input of:

Today, Italy declared war on Scotland.

Italy would be the subject, assigned an actor code. Scotland would be the object, and also

assigned an actor code. The bigram “declared war” is the event, and as such would be

assigned an event code.

Due to the simplistic way it tackles the complex issue of syntax, It is prone to generating

errors with complex sentences/events. However, results from use of TABARI suggest that it

can handle the vast majority of Reuters headline syntax forms [18].

Like many Systems of its ilk (e.g. KEDS), TABARI requires a certain formatting of the input

file in order to work. An example of TABARI formatting can be seen below.

980216 REUT-0001-01

BALKANS. <<Sentence containing the event>>.

Figure 6: Example format of a TABARI leadline.

Requiring a strict format allows the input file to be read more easily, as well as providing a

common format for human coders be able to understand. Some other rules include:

· All letters must be upper case, words beginning with an upper case letter in the middle of

a sentence are tagged as nouns.

· All punctuation except commas are eliminated, then TABARI checks each individual word

against the entries in the actor/verb dictionaries - words found in dictionary are given an

integer to uniquely identify them. TABARI then represents the words with an array of

integers and assigns a word type to each literal (noun, verb, actor, comma, number,

conjunction). It also locates noun and verb phrases in the sentence using a sparse

parsing technique.

10

Gerard Howard

To code an event, the program finds each verb and attempts to match the words

surrounding that verb with the phrases associated with that verb in the dictionary. A

successful match results in the assignment of an Event Code to that event.

Using the example of “Italy declared war on Scotland”, as above, the pattern matching

algorithm would first apply the Actor Coding Scheme, generating:

Today, ITA declared war on SCO.

Secondly, the Event Coding Scheme would be applied, matching “declared war” to a code

from an Event Coding Scheme (see 2.3.2) giving a final output of:

ITA <EVENT CODE COORESPONDING TO “DECLARED WAR”> SCO

Or, as an example (using DECW as the event code for “declared war”:

ITA DECW SCO

This would be done for each input into the system, generating a series of trigrams to

describe the events that took place in each articles headline.

TABARI utilizes a number of “dictionary rules” to its actor and verb dictionaries to increase

their flexibility with respect to the way they code events. Some useful dictionary rules from

TABARI include:

· A space at the end of a word making a stem.

· Underscores connecting words denote words that will match only if found consecutively in

the text

· An underscore at the end of a word, means that the word must be followed by a space to

match.

Format of verbs file:

WON [---]

- + * PLEDGE FROM $ [054]

– * MORE {GROUND | LAND} [211] ; means ground OR land can follow.

Figure 7: Example format of a TABARI verb.

11

Gerard Howard

"+" and "$" are variables for actors. In TABARI, "+" denotes the target and "$"denotes the

source of the action. "%" is also sometimes used, and denotes a compound actor should be

assigned to source and target. A semicolon allows comments to be made after the code is

supplied.

Since "won" cannot be assigned a code in itself, it is assigned a null code - [---]. This null

code is used for "Eliminating words that have the same stem as common verbs, and

eliminating verb phrases that do not correspond to political activity". [19]

The other code to be mentioned is the discard code, [###] - used when stories need to be

discarded (e.g. Confusing Jordan the place with Michael Jordan the basketball player). If

found anywhere in a news report, none of the text will be coded.

"BASKETBALL[###]" is an example of its usage.

This detailed overview of how a coding scheme works suggests that a similar technique

could be employed in the system's Actor Coding Scheme, with specialist codes assigned to

do specific things within the file. This would give an amount of flexibility that would otherwise

be unattainable.

The VRA logger is a recent development by [20]. It is designed to work with the IDEA

coding scheme [21], and its application is more in the area of economics (predicting market

crashes, tax increases etc.).

Because the VRA logger is a commercial initiative, the VRA group would not make their

code available for analysis, so little is known about the algorithms that are of most interest

from a researchers point of view. However, the in-depth analysis of TABARI that was

conducted gave enough information on how to proceed with my system that this was not a

setback.

2.3.2 Event Coding Schemes

Before cementing my ideas for system design, a choice had to be made with regards to the

coding scheme my system would be aimed toward. Research into this area would allow me

to ensure compatibility between the output of my system and these schemes, as well as

research into ideas that may be applicable to my Actor Coding Scheme.

12

Gerard Howard

The schemes to be discussed here include WEIS (World Events Interaction Survey), IDEA

(Integrated Data for Events Analysis), PANDA (Protocol for the Analysis of Non-violent

Direct Action) [22], CREON (Comparative Research on the Events Of Nations) [23], BCOW

(Behavioral Correlates Of War) [24], COPDAB (Conflict and Peace Data Bank) [25], and

CAMEO (Conflict and Mediation Event Operations) [26].

There are a number of different coding schemes available, ranging from syntax-based

schemes such as COPDAB (see example later) to word-based (CAMEO, WEIS, IDEA).

Regardless of personal choice, the scheme would have to adhere to set of criteria that would

allow it to be used in the Balkans area of conflict, in the context of a machine-based Event

Coding System.

Research into the Balkans area of conflict (using [27,28] as references)revealed it to be a

conflict zone comprised mainly of armed conflict, political unrest, social conflict and riots, and

genocide, followed by international mediation, and later international military action. Hence

any scheme chosen would have to include detailed classification of these types of events.

By creating a table of event types, and evaluating the schemes on their coverage of specific

points, the intent was to narrow down by choice from seven potential coding schemes to a

final choice.

The scale chosen was as follows:

None – The type of event is not covered explicitly at all.

Basic – The type of event is covered once.

Average – The type of event is covered two-four times.

Detailed – The type of event is covered more than five times.

The marking of these events is as follows:

One count for an individual code relating to that event, with one count each for each

subsection of that event, if any are provided. If many codes are provided to cover a given

event, the overall score is the sum of the additions of all the codes that cover that event.

13

Gerard Howard

Figure 8: Table showing a breakdown of coding scheme coverage in a number of areas

related to the Balkans area of conflict

The above table gives some indicators as to which coding scheme to use, but should be

combined with personal research to make an informed choice. A personal comparison of the

different event coding schemes follows:

WEIS

WEIS is a very old coding scheme, and seems too generalized (It has only 216 separate

event codes, thanks in part to it using an outdated three-digit coding system, in contrast to

CAMEO or IDEA, both of which use four digits to classify events. This extra digit provides an

extra level of classification, see later).

This lack of detailed codes indicated that a detailed classification using the WEIS scheme

would not be possible. Besides this, both IDEA and CAMEO are second-generation

schemes (IDEA from PANDA, CAMEO from WEIS), and as such there is no reason to be

using a first-generation coding scheme, compared to either of the two mentioned above.

However, WEIS codes are available with an integrated Goldstein [5] scale rating for each

event. The Goldstein scale measures the projected affect of an event on a situation (such as

a mass-execution during a war). In the event that the system was used as part of an Event

Coding System, the inclusion of this feature could persuade the user to apply WEIS as the

Event Coding Scheme.

COPDAB

Unlike the WEIS or PANDA-based schemes, COPDAB seems to be centered around

Military Action Guerrillawarfare

Politicalinstability

Mediation Socialaction

Non-militarywarfare

WEIS Basic None Basic None Basic Basic

IDEA Average Basic Detailed Basic Average Average

CAMEO Average Basic Detailed Average Average Detailed

COPDAB Basic None Average None Average Average

CREON None None Detailed Average Average None

BCOW Detailed Average Basic None Basic Average

PANDA Basic None Detailed Average Detailed Basic

14

Gerard Howard

human-coding. It was decided to remove COPDAB from contention due to the format of its

scheme, which does not lend itself to a machine-coding environment, and hence would not

be applicable to the output of my system.

09

Nation A expressed mild disaffection toward B's policies, objectives, goals, behaviors

with A's government objection to these protestations; A's communique or note

dissatisfied with B's policies in third party

Figure 9: An example COPDAB event code.

As can be seen, COPDAB would be very hard to implement in a machine-coding context,

due to its wordiness and generality. Also, terms such as “mild disaffection” would be hard to

apply to a machine-coding program as the term is hard to pin to an analyzed sentence, i.e. it

is harder to classify actions in this way. Since it contains no direct links to verbs, it will be

harder to implement this scheme when compared to IDEA or CAMEO, for example.

CREON

“CREON is to study the foreign policy process, rather than foreign policy output. In practice

this means that CREON is better suited than WEIS or COPDAB to studying the linkages

between the foreign policy decision-making environment and foreign-policy outputs for

specific decisions, but it cannot be used to study policy outputs over a continuous period of

time or for countries not in the sample” [29]

CREON is therefore unsuitable; the focus of the Balkans conflict has little to do with the

processes behind foreign policy. CREON also places less of an emphasis on warfare and

violent conflict, and hence generalizes many key areas of the Balkans conflict where a more

detailed approach would be effective.

PANDA

“PANDA's data set uses a superset of the WEIS coding scheme that provides greater detail

in internal political events”[29]

“The other major development by Bond and his collaborators is the IDEA -- Integrated Data

for Events Analysis -- coding system. This will supersede the PANDA coding scheme, and

15

Gerard Howard

more is designed to provide a general framework for coding events. ” [29]

These two quotations show why PANDA was eliminated from contention. PANDA places

little emphasis on violent conflict, instead focusing on humanitarian aid and internal political

structure. Although internal politics were vital to the Balkans conflict, wars also made up a

large part of the activities in that region - PANDA would be too specialized to provide a good

overview of the conflict and, like CREON, generalize over the more violent actions of the

conflict.

IDEA

This is a very detailed coding scheme, and as shown by the table covers most areas in a

detailed manner. Utilizing a four-digit coding scheme, it would allow me to implement a

hierarchical algorithm for code assignment. Since both IDEA and CAMEO, below, encode

using four digits, both schemes have the capacity for four levels of abstraction of an event.

From a Computer Scientists point of view, this would let an algorithm to work through these

hierarchies, allowing the coding algorithm to pull back to a higher level of abstraction should

be coding algorithm be unable to assign an event to a specific subsection (more specifically,

to pull back to using a three-digit general code rather than a four-digit specific code). This

kind of hierarchy suits algorithm design, it is another plus point to using either CAMEO or

IDEA, especially when considering the addition of an actor coding scheme to the project

later on.

CAMEO

CAMEO, along with IDEA, utilizes a four-digit coding scheme, allowing it greater depth of

classification. The CAMEO coding scheme is also shorter than IDEA, meaning that it will be

easier to implement and evaluate.

The scheme itself goes into a sufficient amount of detail in the most important areas, and in

addition contains good support for mapping Mediation events (as were common toward the

latter stages of the conflict). It is also judged to be broad-ranging enough to code a large

percentage of the reported happenings during the conflict.

BCOW

BCOW is a scheme mainly used for analyzing heavily war-torn areas, but maintains a small

16

Gerard Howard

section for other types of action. Originally, it was hoped that this would be enough for a

detailed classification of events in the Balkans region, but it became apparent that this would

not be the case. The other events section would be too small to successfully capture the

depth of the actions in the region, and in addition to this the heavy reliance on war and

violent activities indicated that a bias might be included (e.g. there is more chance of a

“disputed” event being classified as violent, rather than political).

Overall, CAMEO appears to be the best all-round scheme for the analysis of the Balkans

conflict. It has a detailed all-round categorization of violent, non-violent, political, and social

actions, as well as the classification of mediation actions. Later extension to my project will

no doubt include an Event Coding Scheme – CAMEO would be the choice in this case.

2.3.3 Actor Coding Schemes

Since the VRA logger's code is not publicly available, and TABARI has superseded KEDS,

research into the Actor Coding Scheme will be based on the TABARI scheme.

The actor scheme is formatted to have one entry per line, to ease the process of reading the

actors into an array, and also for human-readability, to enhance clarity.

An example actor code:

EAST_AND_WEST_GERMANY [GME/GMW]

Since actors change over time, TABARI allows people and places to change code.

MOSCOW [USR (<901225) RUS (>901226)].

This is a very useful feature, especially considering the turbulent and dynamic events leading

up to and including the Balkans conflict. The actors are manually tagged, the name of the

person that attributed the tag to the actor appearing as a comment beside the tagged actor.

The scheme goes into great detail, making fine-grained distinctions between the main actors

in the conflict. For example, Yugoslavia has 42 subcategories, specifying presidents, armed

forces, rebel factions, political movements, states and other related actors. However,

countries that could be considered peripheral to the conflict receive a much more general

classification (countries such as Zaire and Scotland are tagged as single entities), to reduce

complexity as well as file sizes.

17

Gerard Howard

An advantage to using a dictionary of actors from the conflict to construct the Actor Coding

Scheme is that these varying levels of depth complexity should be seen automatically.

For example, if an article makes a distinction between Yugoslavia, Yugoslavia's military, and

a Yugoslav President, they will appear as separate entities in the coding scheme. Also,

since the articles will be extracted from the corpus depending on whether or not they relate

to the Balkans conflict, more fine-grained distinctions will be found for the main actors in the

conflict, compared to actors peripheral to the conflict, meaning more distinctions (and

therefore a more fine-grained definitions) will be made for those main actors in the scheme.

18

Gerard Howard

Chapter 3

Design and Methodology

3.1 Methodology

Adhering to a design methodology will enable me to complete my project in an effective,

systematic manner. [30] is a simple methodology, and often referred to as the “classic

methodology”. In it, the distinct stages (Analysis, Design, Implementation, Testing,

Evaluation, or similar) are completed sequentially, and the previous stage must be

completed before the next is started (hence “waterfall”).

The basic flow of this diagram (figure 10) can be seen mirrored in the Gantt charts (appendix

B, figures 1 and 2), as well as in the predicted “pipeline” flow of data through the system –

the output from one program is used as the input to the next program on the pipeline, with

data flowing from start to end of the system.

Figure 10: Basic Waterfall methodology

Figure 10 maps onto the report as follows: Section 2 can be thought of as Analysis, section 3

Design, section 4 Coding, and section 5 Testing. Maintenance is out of the scope of the

19

Gerard Howard

project, but nevertheless would be a continuing part of the life cycle of the system.

Waterfall methodologies have disadvantages, primarily that problems with the system are

not discovered until the “Testing” stage, and also that product requirements must be fixed

before the system is designed.

To this end, a “modified waterfall” methodology [31] may be used, removing the problems

associated with the use of its “classic” counterpart. In this case, modifications include

testing at every stage of the development process, and the inclusion of the two additional

methodologies mentioned below. [32] describes “throwaway prototyping”, which would be

ideal for the “Design” stage – having a throwaway model allows experimentation with

different ideas without setting anything in stone. I chose this as the final methodology for the

“Design” stage, as it removes the possibility that the project may be hampered by a poor

design decision early on due to unfamiliarity with Perl, as well as being more flexible than a

waterfall methodology.

The implementation stage could be thought of as an iterative cycle of smaller waterfalls –

analysis, design, coding and testing. Because the system will run as a pipeline of small,

single-function programs, one program cannot be fully tested until the previous one is

complete. These smaller waterfalls allow this pipeline development to take place, since each

program can be tackled sequentially.

Overall then, a modified Waterfall methodology best reflected the intended direction and

development of the system, and throwaway prototyping provided flexibility at the design

stage, as well as removing the need for product requirements to be fixed before the system

is designed. A series of smaller waterfalls allow efficient software implementation of a

pipeline system.[31]. Deliverables for each stage are as follows:

Stage DeliverablesAnalysis Evaluation of actor/event schemes and coding

systems. Minimum requirements. “Research”writeup. Draft chapter

Design Final design and “Design and Methodology”writeup. Mid-project report.

Coding Implementation plan, Final solution,Implementation writeup

Testing Evaluation, Testing results, included with“Testing” and “Evaluation” writeups

20

Gerard Howard

Figure 11: Deliverables for each stage of the report

3.2 Scope

Scope refers to the scale of the project, its intended users, and other such information.

The target users for this system are primarily event coders, who can use the system to

generate an actor dictionary for use in the coding system of their choice, and/or generate a

precoded set of actors. Since both of these operations are independent of the input corpus

used, an event coder working on any region could feasibly use the system.

Examples of scale include Reuters one-day archives being between 284997 (19970809.zip)

and 4725275 (19970515.zip) bytes compressed, 668906 and 11600256 uncompressed, and

containing between 214 and 3928 entries respectively. The entire Reuters one-year corpus

used “...is an archive of 806,791 English language news stories...”. [2]

An example TABARI actor scheme (the TABARI Balkans dataset, BALK.ACTOR.030502)

was 60369 bytes, with 1631 entries.

The system must therefore be able to cope with very large file sizes for input, and be

expected to generate actor files of between approximately 2500 and 60000 bytes, containing

between 100 and 1500 estimated discreet actors.

3.3 Programming Languages and Tools

The system is designed to be compact, simple, and run as a pipeline. This allows certain

functions to be performed on an input (tagging, parsing, generating a dictionary, generating

a list of organizations etc.), without the entire dictionary generation system being performed

at once.

Keeping the programs to providing a simple function rather than grouping all functions

required into one program also allows inexperienced users to have a better grasp of how the

system works (each programs code becomes easier to read since it is not surrounded by

other methods, each program also has its own readme file). [33] is a program implemented

in Lisp, a language with roots in NLP, its implementation similar to that of my intended

system.

21

Gerard Howard

This seems to point away from needing a more complex language such as Java or C++/C#

since no complex programming structure is necessary. Object-oriented functionality is not

required for a pipeline system such as mine, and would overcomplicate the implementation .

Perl or Python could therefore also be considered: like Java and C++, they are platform

independent. They also generate smaller file sizes than a comparative program in Java or

C++, since Perl and Python files need not be compiled.

Perl was designed (in part) as a language for language processing, similarly to Lisp. Perl

has a powerful, flexible, and simple regular expression system which could be used for

implementing information extraction. Regular expressions in Java take a minimum of five

lines to implement – in Perl they can be done in a single line.

Perl also allows simple output to (and input from) multiple files, using a system of shorthand

filehandles, and simple language-input manipulation. Due to its NLP functionality, Perl will

be the language used to implement the system.

RASP (a parser installed on the SoC Linux machines) could be used for the NP/VP

extraction, as well as parsing and POS tagging the articles for extraction of the required

elements.

3.4 Identifying Useful Information

The first step to extraction, once all files are accessible (i.e. In the same directory), is to take

only the parts of the article that I need. This design decision is backed up by [33], which

“was developed in response to the needs of the intelligence community for scanning and

processing huge volumes of written texts...FASTUS provides the analyst with a tool that will

help him or her to avoid being overwhelmed by the flood of information“.

Since FASTUS was designed to reduce the amount of text that users of a large corpus need

to handle, it indicates that there is a problem with maintaining large amounts of information.

For processing a corpus containing over 800,000 articles, it therefore seems necessary to

discard as much useless information as quickly as possible – to reduce processing time as

well as the size of the files my system will generate and process.

22

Gerard Howard

The elements of primary interest are the title of the article and the first paragraph of the

article (Since the first paragraph always summarizes the contents of the remainder of the

article). Since most of the articles will contain more than one paragraph, the system must be

able to extract only the first paragraph (since the corpus extracted was be very large – over

2 gigabytes [2]), it is necessary to reduce the size of the working corpus as soon as

possible).

The Reuters corpus used [2] was encoded with an XML schema (Appendix B, Figure 12).

These clearly define the elements of the article, such as “Title” or “Headline”, as well as

“code” fields that specify the countries involved, and the type of event the article is

describing.

These codes could be used to determine whether or not an article should be included in the

corpus of Balkans headlines, by taking only articles referring to certain countries, regions or

types of action.

3.4.1 Information Extraction

Since information extraction must be performed only on specific parts of each article, it

follows that some form of identification must be performed to help my system to understand

the types of words and phrases it is working with.

Almost every stage of my system will involve some form of information extraction or

manipulation, since I will be working from an entire corpus down to a selection of words and

phrases. Hence, at each stage of my system, the data I am processing must have some

form of identification with it, so only the relevant information is manipulated.

Other systems extract using parsing ([20]) or tagging and pattern matching([10],[16]).

Tagging-based extraction requires an accurate and detailed tagging of words, showing what

part of speech (determiner, verb, first person noun) the words are, and also what syntactic

category they represent (a person, a country, and so on).

Parse-based extraction, as used in [15], is harder to implement that its tagging-based

variant, since it requires knowledge not only of the word in question, but also of the

surrounding words, and the construction of the phrase those words appear in, and also

possibly of the sentence as a whole. By examining all three of these factors, and not just the

word, a more accurate classification can be made. However, implementation of such a

23

Gerard Howard

feature is often tricky and time consuming, the benefits of its implementation questionable

when a word-based classification is often sufficiently accurate for a system of this type.

Additionally, although both inputs are imperfect, and prone to mistagging that is out of the

sphere of control of my system (since NER must be done externally, using [34]), it is clear

that tagging-based classification will suffer less from a single mistagging in an article.

For example, if one word in an article is mistagged, only that word is affected by the

mistagging. Although this obviously has a detrimental effect on the accuracy of the

dictionary creation, the mistagging error is isolated (that is, a single mistagging is not

propagated through the rest of the article).

Parse-based classification requires the structure of an article to be defined, the structure

then being used to help classify the words in the article. Similarly to the word-based

approach, mistaggings occur. However, since one parameter is used to determine the next

(i.e. The end of the first noun phrase signifies the start of the first verb phrase), a mistagging

can throw the structural tagging of the entire article awry.

|---NP---|-VP-|

The cat sat

|-NP-|---VP---|

The cat sat

Figure 13: An example of the “knock-on” effect of a mistagging while using a parse-

based approach.

Additionally, testing the RASP parser has shown problems with some input types,

particularly fragments. These occur more frequently in news reports than they do in common

written English.

decided against using a full parse, primarily for simplicity, and for its confinement of

mistagging errors.

Regular expressions are likely to be the way to implement pattern matching for any

extraction the system is expected to perform. Perl has excellent integrated support for

regular expressions.

24

Gerard Howard

3.4.2 Exploiting POS and NER

For my system to function, it must be able to determine what type of words are being passed

to it.

Part of Speech (POS) tagging is used to identify the syntactic type of a word, and appends

that type to the end of each word. This helps systems that have to be able to distinguish

different types of words by appending a POS tag to the end of each word. Example POS

tags include:

The-DET

Meeting-NN

The above examples show “The” to be of type “Determiner” (DET), and “Meeting” to be of

type “Noun” (NN). The actual tagset may vary from implementation to implementation, but

are generally standard across major POS systems. In either case, it is a regular expression

that is expected to do the matching, with some form of algorithm to process the input (the

words and their syntactic types), and produce an output that analyzes the structure and

places further tags on the words or phrases it determines to be actors or events.

This helps because it breaks down the structure of the sentence to a fine-grained detail

(assignment being on a “per-word” basis), preprocessing the data and turning it into a

manageable form for use with the dictionary creation system. If no tagging was performed,

the system would have no way to distinguish between different types of words (since it has

no inbuilt appreciation of context). NER and POS allow an algorithmic, computational

approach to a traditionally human-solved problem.

Named Entity Recognition (NER) is a method most commonly associated with NLP tasks. It

algorithmically assigns each word in an input to being a part of speech. These parts include

“person”, “country” and so on. Regular expressions can be used to separate the different

parts of speech, allowing dictionaries of certain parts of speech to be constructed. A POS-

tagged output would be similar to:

Microsoft-ORG

The suffix “ORG” tags Microsoft as being an organization. Creating a dictionary of

organizations is therefore likely to involve exploiting this POS tagging, to extract, for

example, every word tagged “ORG” and placing it in a file containing all previously identified

25

Gerard Howard

organizations from the input. It will be easiest to implement this method of extraction using

regular expressions to match the suffixes of the words.

3.5 System Design

My system aimed to take a corpus of Reuters newswire headlines, and from them create

dictionaries of organizations, people, and dates (for any time-series analysis a user may

wish to perform).

Figure 10: The system as a “Black Box”.

The following is a demonstration of the desired inputs and outputs of the system.

Today (4/8/85), soviet tanks moved across the border of France towards Paris,

despite the protestations of the French President, Jacques Chirac.

From the sample shown above, the following organizations, people, and dates should be

generated:

ORGANISATION: Soviet tanks, France, Paris.

PERSON: French President, Jacques Chirac.

DATE: 4/8/85.

From the actors extracted, people who are the same, but identified with different names,

26

Gerard Howard

should be clustered together to show that they are the same actor. One such cluster would

be:

(French President, Jacques Chirac)

Finally, an Actor Coding should take place, assigning a distinct code to each actor.

FR_PRES (French President, Jacques Chirac)

SOV_MIL (Soviet tanks)

FRA (France, Paris)

This output could be used in Event Coding Systems, as the Actor Coding Scheme. The

system would search for the words in parenthesis, and assign the capitalized code on the

left to that actor should a match be positive.

Given a sample input of:

“Soviet tanks attack Paris, Chirac surrenders.”

The Event Coding program would use the actor codes generated from my program, together

with event codes from a scheme such as CAMEO (see 2.32) to generate an event code for

that sample (using examples of “SURR” as the event code for “surrender”, and “MIL_AGG”

as the event code for “military aggression”).

SOV_MIL MIL_AGG FRA

(The Soviet military committed an act of military aggression against France).

FR_PRES SURR SOV_MIL

(The French President surrendered to the Soviet military).

Proceeding in this way, it can be observed that a list of event codes will be generated, which

can then be analyzed for whatever their intended use is.

Whilst considering system design, a number of decisions had to be made. Primarily, how to

strike a balance between features and complexity.

Since the system is designed to be reusable for other areas, or other applications that

27

Gerard Howard

require dictionaries to be generated, a simple design would breed a simple (and hence easy

to document and use) interface. Conversely, a lack of additional features would reduce the

number of tasks the system could execute, reducing the chances that the system would

provide a feature some user might need.

So the design of the system must be directed towards providing some extra features, but

without losing a simple design and interface. With this considered, the easiest way to

provide added functionality would be to split the dictionary generation process into steps.

Each step would modify the data as required for the next step, so that the system would

function exactly the same running each separate program back-to-back as it would running

the system as a single program, the bonus being that the single programs could also be run

independently, allowing other programmers to use only the parts of the system they want,

and taking the output from any point in the dictionary creation process.

From a usability point of view, this allows each program to be viewed by whoever wishes to

use it, separate from the system as a whole. Not only would this aid in understanding of the

program (separate programs, out of the context of the system as a whole, would be easier to

understand since they would be shorter and simpler), each program could also have its own

documentation (readme file), explaining in more detail how a specific part of the system

works (since it would receive more coverage that it would in a general system-wide readme

file).

Finally, if a prospective user only wishes to use two or three programs from the system, they

need not concern themselves with any other part of the system.

Reusability could be catered for by allowing a plug-in capability, allowing other regions to

have Event Coding dictionaries generated for them. The part of my system (most likely a

single program) that extracts articles relevant to the Balkans conflict area should therefore

be designed to accept other countries and classification codes as arguments, extracting

from a Reuters corpus based on matching XML “code” tags to the country codes and topic

codes specified in the input. This would make the system reusable, but only if a Reuters

electronic corpus such as [2] was used as input, as the extraction is dependent on the XML

tags that accompany each article.

28

Gerard Howard

Chapter 4

Implementation

Implementation is to be carried out in Perl, due to its capacity for handling and processing

language-based input. For simplicity of analysis, the implementation can be thought of as a

two-stage, five-step process.

The first stage is the creation of the dictionaries, and the second stage the assignment of

actor codes to an input. In the first step, the relevant information is to be extracted from the

Reuters news corpus. Secondly, following NER preparation and preprocessing, a number of

word lists are to be created (using the NER tags). Thirdly, the word lists are to be clustered.

Finally, dictionaries are created when the word lists are to be transformed into dictionaries.

The second stage, and fifth step, is the assignment of actor codes to an input file based on

29

Gerard Howard

the actors in the dictionary file.

Figure 13: Flow chart of dictionary creation, showing the implementation of the

system, mirroring the “waterfall” model methodology. Implementation starts at the

top and flows down. Each program in itself can be represented as a “mini-waterfall”

of design, implementation, and evaluation.ActorScheme.pl applies the scheme to a

corpus

After analyzing the two techniques (tagging/pattern matching and parsing), I have decided to

implement the former as my primary dictionary creation technique. Not only is it simpler, but

it is very easily implemented using Perl regular expressions on an NER-tagged input.

Pattern matching the suffixes based on the three available classifications - PER, ORG and

DAT (or Person, Organization, and Date), and adding each of the three into a separate array

gives three arrays containing all of the PERs, ORGs and DATs in the input file. By keeping

the unique identifier of each article in the output files, a list of articles followed by any PERs,

ORGs or DATs contained in that article can be output relatively easily.

Although a full parse will not be the method used to extract or code actors, the capability to

perform a full parse has been included (in Extract.pl) as part of preprocessing for others who

may want to use my system for syntactic structure analysis. This in itself requires a

preprocessing via the RASP parser. The implementation of the noun and verb phrase

extraction itself is relatively simple – given a sample RASP output, with POS tagging and

parsing, is shown in appendix B, Figure 14.

The NP and VP extraction is based on parenthesis matching. The parsed form of an article

contains a number of pairs of parenthesis that mark the boundaries between word

classifications, noun phrases, verb phrases and the like. Taking the open parenthesis just

before the start of a NP or VP, and counting the number of open and closed parenthesis, the

end of the NP or VP can be located. The words contained between the matching pair of

parenthesis constitute the NP or VP itself.

Although suffering from the mistagging problem mentioned above, it affords the system

some added functionality for those that wish to take advantage of it.

30

Gerard Howard

4.1Programs, Inputs and Outputs

Program Input(s) Output(s)StoryExtractor.pl Directory of unzipped Reuters

XML articlesA single file containing thetitle, first paragraph, date, andany code fields for everyarticle

LeadLine.pl The output fromStoryExtactor.pl

The articles that contain thecodes provided by the user

NERPrep.pl The output from LeadLine.pl The same file, ready for NERRASPer.pl The output from LeadLine.pl The same file, ready for

RASP Extract.pl RASP-processed corpus Each article, split into its NPs

and VPsORGDATPERExtractor.pl NER-processed corpus Three files, (ORG, DAT,

PER) entities taken from thecorpus and containing onlythat syntactic type

PuncRemover.pl Either of the three files fromORGDATPERExtractor.pl

The same file withpunctuation removed, andsimple stemming

HeuristicClusterer.pl The cleaned PER file fromPuncRemover.pl

A file containing each uniquecluster

CountryClusterer.pl The cleaned ORG file fromPuncRemover.pl

A file containing each uniquecluster

DictionaryBuilder.pl Either of the three files fromPuncRemover.pl

An alphabetically-ordereddictionary of unique actors

ActorCoder.pl The output fromHeuristicClusterer.pl andCountryClusterer.pl

An Actor coding schemecontaining each uniquecluster followed by itsassigned code

ActorScheme.pl The output fromActorCoder.pl, an inputcorpus

An actor-coded output corpus

The programs created can be assigned to the five-step process (see 3,4) as follows:

The first step included the two programs StoryExtractor.pl and LeadLine.pl. The second

step involved the program NERPrep.pl. Following the NER preprocessing, the

ORGDATPERExtractor program was used to create the separate word lists (one for each of

ORG, DAT, and PER). The program PuncRemover.pl was also added to this step of the

31

Gerard Howard

system (see 4.2 for details). To extract the NPs and VPs from the input, the program

Extract.pl must be used on an input parsed and POS tagged by RASP. This input must first

be formatted to be compatible with RASP, using the file RASPer.pl.

The third step involved the programs HeuristicClusterer.pl, CountryClusterer.pl, and

FileJoiner.pl.

The final step of the first stage involved the file DictionaryBuilder.pl

The second stage, and fifth step of my system will use ActorCoder.pl to assign codes to the

actors in an input file, using the actors found in the dictionary file created by

DictionaryBuilder.pl. The scheme can be invoked on an input via ActorScheme.pl

4.2 Information Extraction

The preliminary information extraction (stage 1), produced a single file where each line

contained the information of a single article (its date, title, first paragraph, and any additional

“code” fields). This was implemented in the program StoryExtractor.pl.

When all of the files in the Reuters corpus were unzipped into the same directory (using a

conventional unzip program present on both Unix and Windows-based operating systems),

StoryExtractor.pl was run. Utilizing the Unix grep command, an array of all the XML files in

the directory (@files) was generated as follows.

opendir(DIR, ".");

@files = grep(/\.xml$/,readdir(DIR));

This removed to need to specifically declare each file as needing to be read before any

processing could be done on the file.

Each of these XML files was then examined line by line, and a series of regular expressions

used to extract the necessary fields from each file via pattern matching. To maintain a

consistent style of extraction, the program was designed so that no matter the order the

fields appeared in, they would appear in the output file in a set order (determined by the

order or the regular expressions in the sequence).

An integer variable was used to ensure that only the first paragraph from each article (the

lead line) was taken. The rest of the extraction consisted of a simple series of if statements

32

Gerard Howard

coupled with regular expressions which process each line of the input.

476032newsML.xml<title>ALBANIA: ALBANIA MOURNS REFUGEES,

ITALY PREPARES FORCE.</title> <p>Albania declared a day of mourning on

Monday for refugees drowned in a shipwreck off Italy, battering Rome's image as it

organised a multinational force to protect aid to the chaos-torn country.</p> <code

code="ALB"><code code="ITALY"><code code="GCAT"><code

code="GDIP"><code code="GDIS"><codcode="GVIO"><dc

element="dc.date.published value="1997-03-31"/>

Figure 15: The extracted parts of a Reuters news article.

Due to page constraints this appears cluttered, however the output file presents all this

information on the same line, making the processing easier.

The second stage of the extraction used the output file from the StoryExtractor.pl program

as input. LeadLine.pl was used to extract articles relevant to the area of study via extraction

based on the “code” XML fields of each article. Since each article now occupied only a

single line, this could be performed by a regular expression matching the whole line (whole

article) at once. To support reusability, the user could dynamically construct the regular

expression, and hence could easily extract form a corpus based on an entirely different set

of criteria. Matches were performed on country and topic codes, as defined in the

“topic_codes.txt” and “region_codes.txt” files supplied with the Reuters corpus.

Perl regular expressions support the “|” symbol for “OR”. Hence a number of codes could be

specified as command line arguments, separated by “|” symbols. Once constructed, the

regular expression was run against each line in the file, outputting any matches to a

separate file ready for NER.

Appendix B contains Figure 16, illustrating this implementation.

4.3 Named Entity Recognition

Named Entity Recognition is a process that associates the syntactic category of a word with

that word, over a given input. Since it is used to process English texts, the input file had to

be preformatted to remove the XML tags – the inclusion of which would have made the NER

program malfunction, and produce a greater amount of mistaggings.

33

Gerard Howard

The removal of XML tags was implemented in the file NER_Prep.pl. Capturing groups were

used to extract the unique identifier of each article, and the first paragraph of that article.

Only the unique identifier and first paragraph were extracted since they were the unique

identifier was required for identifying each article, and the first paragraph was to be the

subject of the NER processing. The remainder of the fields would have been meaningless

to the NER program, and may have led to mistaggings.

476032newsML.xml. Albania declared a day of mourning on Monday for refugees

drowned in a shipwreck off Italy, battering Rome's image as it organized a

multinational force to protect aid to the chaos-torn country.

Figure 17: A cleaned Reuters news article, ready for NER

This produced a file suitable for input into an NER system. This file was then sent for NER

processing. The returned file was to be used in conjunction with the OrgDatPerExtractor.pl

file, which would extract every organization, date, and person from the file, and place them

into the relevant output file, one for each type of entity extracted.

The first stage was to split the entire input into an array of words, done by splitting a line on

white space

Now each element of the array of words was matched against regular expressions designed

to capture the ORG, DAT, or PER suffixes placed by the NER process.

The unique article identifier was printed to each of the three output files, since it would be

needed by all of them. A word which matched ORG would be printed to the ORG output

filehandle, and similarly for the other two filehandles.

However, dry runs on some sample texts showed two major problems that had to be tackled.

Firstly, many organization and person names are over one word in length. Although the

program correctly identified them, it printed them as:

Ariel

Charon

34

Gerard Howard

Whereas the required format is:

Ariel Charon

This problem was rectified by having the next element array (next word in the article)

checked before printing the new line characters that would separate the organization or

person words. The lines that controlled the output format of the word added a new line

character (signifying the end of the current name) if the following word in the array was not

of the same suffix (ORG, DAT, PER) as the word currently being processed. In the same

way that a regular expression match is signified with “=~”, a non-match can be signified with

“!~”.

The second problem stemmed from the fact that the NER software used to tag the input file

had a tendency to wrongly tag countries as people. This was sidestepped by including an

exclusion list – an associative array of countries.

If a match was found, between the element and the exclusion like, that element was printed

straight to the ORG array, since it must be a country. This stopped countries that had been

incorrectly tagged as people by the NER process being added to the wrong dictionary.

%ExclusionList = ("Yugoslavia", "Poland", "Albania", "Austria", "Bosnia", "Bulgaria",

"Croatia", "Romania", "Turkey", "Slovania", "Slovakia", "Georgia",

"Macedonia", "Hungary", "Serbia" , "Egypt", "Cyprus","USA","America", "U.S.A",

"US", "England", "UK", "Germany", "Poland", "Holland", "Netherlands", "Russia",

"Switzerland", "Greece", "France", "Italy");

Figure 18: The Perl code for the exclusion list.

Postprocessing was done with the PuncRemover.pl program. This was because a sizable

proportion of the names detected as “unique” were in fact the same name, but with added

punctuation, e.g. England and England's. The role of PuncRemover.pl was simply to stem

every entry in each of the three files, so that the unique names could be taken with no plural

entries, with no punctuation. This was implemented using the Perl grouping support for

regular expressions.

4.4 Clustering

35

Gerard Howard

Clustering involved three programs, HeuristicClusterer.pl, CountryClusterer.pl, and

FileJoiner.pl.

HeuristicClusterer.pl was designed to cluster names based on a series of heuristics (values

possessed by one entity that would identify it as being the same as another entity). These

heuristics took the form of a set of regular expressions, the premise behind the programs

operation being that each identified PER entry would be compared to an array, whose

elements themselves were arrays containing all of the clustered names (e.g. All the names

that have already been through the system). If a match was found, the new element would

be added to the array of names that it matched. If no match was found, a new array would

be created with that element at the first array position.

In practice, it became apparent that an associative array was the easiest way to implement

this – each unique name would have a unique key, two names that had been identified as

the same by the heuristics would have a common key.

A Lincoln, 1

Bill Gates, 2

President Lincoln, 1

Firstly, the heuristics had to be defined, firstly in words, then later in code. Secondly, each

name had to be split into a first element (first name), and second element (second name), so

that it could be compared using the heuristics defined above. Thirdly, the associative array

had be created, along with logic for adding clustered elements to existing arrays, and for

adding new elements with a new, unique key to the hash.

The heuristics chosen were as follows:

The first character of the first name, followed by zero or more word characters,

followed by a space, followed by the last name

Or, in Perl:

$firstName[0]\w*\s$lastName;

$firstElement[0]\s\$lastElement;

The heuristics had to be altered to give improved clustering accuracy, this simple heuristic

outperformed a more complex, earlier version.

36

Gerard Howard

The variables $firstName and $lastName come from splitting of the word on whitespace. As

each name is processed, the program ensures that even if the person has more than three

elements in their name, only the first and last elements are taken (processing three names

would be inefficient, as well as unnecessary – people are known generally by their first and

last names only).

Once all the names are present, each element in turn in checked against the name

heuristics, and clustered if a match is found.

The element in question is compared with each key in the associative array (each distinct

cluster already defined by the program). For every id present, name stored at that position is

split into a first name and last name as above, giving two variables which can then be

compared against the persons first and last name

If the elements heuristics match those of the name currently being processed, that element

is given the same key as the associative array entry which gave the match. If no match is

found, a new id is created in the associative array, and the element with no match added as

the first member of that array.

CountryClusterer.pl was a simple program, which simply replaced a positive country match

(in the ORG file) with a predetermined country abbreviation (ITA for Italy, for example).

The regular expression syntax used was slightly different to the usual matching syntax – this

is because the program is designed to substitute the matched line with a set string.

The default settings for substitutions are also different to the normal settings. With

substitutions, case insensitivity has to be explicitly enabled (using the “i” flag). Similarly, the

program had to match globally throughout the string (the “g” flag). To enable substitution

mode, the “s” prefix had to be included.

The regular expression construction had to include .* both before and after the search string,

to ensure that the whole line was substituted rather than just the matched word.

A problem with this implementation was that the matcher continued to match even if it had

assigned a country code to a country. It was found that some of the abbreviations I had

chosen would be replaced by a further abbreviation, giving that particular line an incorrect

code assignment. To stop this happening, a boolean was used to determine if an input has

37

Gerard Howard

been matched, and prevented further comparisons and substitutions occurring after this first

match.

4.5 Dictionary Creation

The dictionary creation implementation originally involved the following steps, and the file

DictionaryBuilder.pl:

· Using a PER, ORG or DAT file as input, take all the unique names

· Sort the names alphabetically

The first point was implemented using Perls associative array support. Associative arrays

were not originally part of my plan, but research into different implementations of detecting

unique elements in an input file showed that associative arrays gave the cleanest, shortest

method possible.

A single amendment had to be made to this outline implementation plan. Namely, research

into Perl documentation revealed that Perl has an inbuilt sorting function. Originally, for the

sake of simplicity, this would be a separate program, taking an input file of unique names,

and bubble sorting them into an alphabetically ordered list. However, this could be

implemented in Perl in a single line, using two arrays as arguments (the unsorted array as

input, and a new sorted array as output).

Since it is only a single line,it was decided to include this is the DictionaryBuilder.pl program,

using a user command-line input to determine whether or not to sort the output.

4.6 Actor Coding

The Actor Coding Scheme will take the dictionary files created by the system, and utilize

clustering to apply a code to each actor detected in the corpus. This will produce an output

file of actors, clustered based on their name, with a code assigned and appended to each.

Two main implementation points arose – the generation of unique codes for each actor, and

reusability of the actor scheme generation.

Since the program worked on an input of unique clustered actors (as all actors in a cluster

are assumed to be the same entity), one code was required for each input element. The

easiest way to ensure a unique code was therefore to use the actor name as the basis for a

38

Gerard Howard

code. However, the point of coding is to reduce the amount of information that one must

work with, hence the codes had to be smaller than the name they represented.

This was implemented by taking each element in turn, and splitting it into an array of

characters, taking the first two and capitalizing them (making the codes compatible with

[16]). If this pair of characters was unique, they became the code for that actor. If not,

characters were iteratively added to the array until the code became unique.

My coding of this program did not implement the more sophisticated features of, for

example, the TABARI actor file (allowing actors to change codes over time), because the

program would not know when e.g. The USSR became Russia. These types of features are

only realistically implementable with manually-compiled Actor Coding Schemes.

Reusability was provided by basing the actor coding scheme on the actor dictionaries. This

meant that, for any input, the actor codes generated would be similar to the corpus provided

for those codes – allowing the input corpus to affect the coding scheme produced.

39

Gerard Howard

Chapter 5

Testing

General information about the project:

The dictionaries produced by the system were of the following sizes:

Organization Dictionary: 76610 bytes, 25.7 average bytes per article, 2977 discrete

organizations (clustered).

Person Dictionary: 28219 bytes, 18.78 average bytes per article, 1502 discrete people

(clustered).

Size of NER file:136187 bytes.

The system was evaluated on the following criteria:

Dictionary creation was evaluated as a series of “precision and recall” experiments. The

errors from NER will be calculated by taking a percentage of actors that have been correctly

NER tagged. Other factors, such as any actors not relevant to the Balkans area of conflict

appearing in the dictionary, or entries existing in the wrong cluster will also be measured.

These three parameters will give me a precision and accuracy estimation for the dictionary

creation.

These experiments would take a large amount of time to complete, so each (with the

exceptions of experiments 4 and 5) was conducted on a test corpus of 200 articles, the

results then extrapolated to fit in with the true dictionary sizes, giving an approximation of the

outcome of each experiment.

The actor coding scheme will be evaluated via two methods, firstly a study of comparative

accuracy of coding a small corpus of 200 news articles, against an existing system

(TABARI).

40

Gerard Howard

Secondly, an accuracy comparison of the actor coding scheme on a small sample of 50

news articles to a manual event coding, done by 2 human coders (who both much agree on

the code to assign in, due to ambiguity issues with human code assignment).

Justification of the evaluation criteria can be seen as follows. Because the dictionaries were

designed to be used by other systems, the most important evaluation criterion was the

accuracy of selection of the entities in each dictionary. If entities in the dictionary are not of

the correct syntactic type, any system using that dictionary may generate errors (based on

the specific use of the dictionary). Correctness of the dictionary contents can be assessed

quantitatively, and will give the percentage of correct dictionary entries..

The accuracy of the clustering algorithm is also important – actors become “lost” and

incorrectly assigned an actor code if clustered with an actor they are not related to. Hence

the numbers of actors correctly/incorrectly clustered will make a quantitative evaluation

criterion for the percentage of correctly-assigned event codes.

A system-system comparison using TABARI as a benchmark system, as well as a

comparison to human coding , will used as evaluation criteria for system performance. By

comparing against an existing system and a human coder, some idea of “real world”

performance can be generated. A test plan was generated to encapsulate these ideas.

5.1 Test Plan

Test Focus Description Result type

Dictionary Generation NER tagging Check actor dictionary forthe existence of elements thatare not actors

Percentage error

Dictionary Generation Code-based actor extraction Check actor dictionary foractors that are not related tothe area of focus, via Googlesearch

Percentage error

Dictionary Generation Clustering algorithm Check each cluster forelements that do not belongin that cluster

Percentage error

Actor Coding System-system comparisonvs. TABARI

Code a small input usingboth systems, compareaccuracy

Accuracy differences,Comparative study(qualitative)

Actor Coding Comparison to human coding Code a small input andcompare for accuracy againsthuman-coded example

Percentage incorrect for eachcoding, Comparative study(qualitative)

Figure 19: Test plan

41

Gerard Howard

These tests will be referred to as “Experiment 1...Experiment 5”.

5.2 Dictionary Creation

Experiment 1: NER Errors

Methodology: Move through each dictionary file before its NER taggings are removed,

manually note any entities that are incorrectly tagged, and entities tagged as actors that are

not actually people or organizations.

Total number of taggings: 6014

People mistagged: 163

Organizations mistagged: 204

Entities that aren't actors: 383

Incorrectly tagged actors: 755

Correctly tagged actors: 5259

Percentage correct: 87.4%

Percentage error: 12.6%

Experiment 2: Extraction Errors

Methodology: Move through the actor file, manually note any actors that are unrelated to the

Balkans conflict (confirm via Google search for that actor if unsure).

Total actors: 6014

Actors that are related to the Balkans: 2104

Actors that are unrelated to the Balkans: 3310

Percentage correct: 45.0%

Percentage error: 55.0%

Experiment 3: Clustering Errors

Methodology: Examine the contents of each cluster, note any actors that do not belong in

their cluster, and any actors that are the same person, but in different clusters.

Total number of clusters: 4479

42

Gerard Howard

Total number of actors: 6014

Actors correctly clustered: 4682

Actors incorrectly clustered: 1332

Percentage correct: 77.9%

Percentage error: 22.1%

5.3 Actor Coding Scheme

The errors found in the coding scheme can be attributed in part to the dictionary generation

process, since the same actors from the dictionaries were used to create the Actor Coding

Scheme.

Experiment 4: Real-world Application

Methodology: Apply the Actor Coding Scheme to a test corpus containing 500 sample

articles. Compare accuracy with TABARI on the same test corpus. Compare Actor Coding

only. (Test machine: 1.4ghz Pentium IV, 512 MB RAM, Fedora Core Linux version 3.0)

The Produced System:Total number of actors in corpus:1598

Actors incorrectly coded:228

Actors not coded:0

Non-actors coded:21

Correct coding produced on: 1379 articles

Percentage correctly coded articles: 84.4%

TABARI:Total number of actors in corpus:1598

Actors incorrectly coded:14

Actors not coded:71

Non-actors coded:5

Correct coding produced on:1578 articles.

Percentage correctly coded articles: 94.4%

Experiment 5: Human Coding

43

Gerard Howard

Methodology: Generate Actor Codes for 50 articles, compare accuracy to that of a human

coder.

The number of correctly coded actors for each is as follows:

System: 36 (72%)

Human: 50 (100%)

Actors in the corpus that didn't appear in the Actor Coding Scheme: 0.

44

Gerard Howard

Chapter 6

Evaluation

6.1 Results and Discussion

Experiment 5 (Appearing first as it is a minimum requirement)

When compared to a human coder, the system produced an actor coding accuracy for the

input corpus that was 28% lower. Because the coding of each actor came from the name,

the coder could quickly check the scheme to check if the actor was classified, and assign the

code. This resulted in the large percentage of correctly coded samples.

Disagreements between the two methods of coding were largely due to the machine coding,

which had a match for every actor in the corpus due to the sheer amount of detail produced

from being generated from such a large actor dictionary. However, 28% of these actors

were incorrectly clustered.

Since all samples were coded, the miscodings were due to clustering errors (one actor being

incorrectly identified as being the same as another actor).

These can be attributed to countries that share acronyms (AA could be American Airlines or

Automobile Association), compounded by the large amount of actors meaning there was

more chance for identical names to refer to different entities.

Also, clustering organizations that are referred to both as acronyms and as full names was

impossible, because of the heuristics employed. Since the system analyzes an article

without analyzing context, it will not link two syntactically distinct entities (such as

“nicknames”), with the organizations full name. Finally, if two individuals called “Paul

Simon” and “Pauline Simon” are detected, they will be clustered together, meaning that both

45

Gerard Howard

individuals, though different, will have the same actor code. Errors with the human coding

scheme assignments must be attributed to human error, since the alphabetical nature of the

scheme makes it very easy to follow and code to.

Testing the dictionary creation and actor coding systems (see 5) produced encouraging

results, but also highlighted three main types of errors to which the system is prone – NER-

based errors, entity extraction errors, and clustering assignment errors.

Experiment 1

NER-based errors are external errors caused by the NER program wrongly assigning a

syntactic type to an actor (e.g. A person being tagged as a country). This type of error

would be detrimental to a system using the dictionaries generated as an input, as the

dictionaries would contain errors. The total percentage of NER-based errors found in the

dictionaries was 12.6%. These can be attributed to the NER program itself [34], since the

NER process had to be done externally, and the writer had no control over the results.

Experiment 2

Entity extraction errors are when the codes used to extract articles from the corpus are too

general. Too much of a general extraction allows actors that are unrelated to the topic of

interest make their way into the dictionaries (and hence coding scheme), which can mean

that actor codes are generated but would not be used if run on a corpus specific to that area

of interest. The percentage of entity extraction errors found in my system in total was

55.0%. A more discriminating extraction would prevent actors unrelated to the area of study

being added to the actor dictionary/Event Coding Scheme.

Since the regular expression for extraction is dynamically generated, this can be done by the

user without any modifications being made to the code. This was also the largest cause of

error in my system, since such a large amount of the actors included were not related to the

event, so a coding scheme or NLP system would not benefit from their presence.

Experiment 3

Finally, clustering errors are when either an actor isn't clustered with other examples of that

actor or is clustered with an actor they shouldn't be clustered with. Clustering errors can be

46

Gerard Howard

attributed the heuristics used to perform the clustering operation, and actors with similar

names. 22.1% of the clustered actors were incorrectly clustered. Possible causes of error

are explained above (Experiment 5).

Experiment 4

On the subject of speed, [35] refers to TABARI as “Very fast. I timed TABARI on Levant

texts for 1987-1990, about 26,000 sentences. On a 350Mhz Mac G3 and using the...mode

that provides no screen feedback, TABARI codes 2000 events per second... On a 650Mhz

Dell Pentium III, the speed is around 3000 events per second”.

The speeds at which the two systems coded the test corpus (see 5.2) shows that the system

developed as part of this project was faster than TABARI by 0.05 seconds. Although the

speed of the machine used to test was faster than that cited above, there was obviously a

loss of speed due to initial startup costs for the programs.

Given that the “usual benchmark that human coders can reliably produce 40 events per

day”[35], the human-coding of the example corpus would take twelve and a half days!

(Although this means coding a whole event, not just the Actor). Obviously, in a full Event

Coding System, time would be saved by a factor of millions, and the results for accuracy of

the code assignment are (despite errors in the system) reasonable.

“Actors incorrectly coded: 228” (from experiment 4). This was far and away the largest

source of error in the coding. The errors with Actor Coding shown can be attributed to a too

general selection criteria at the initial extraction stage. This left lots of actors unrelated to

the region of interest, and also increased the amount of clusters containing actors that

should not be there – excessive information density meant that the clusterer found it hard to

distinguish between identical or nearly-identical names.

“Actors not coded: 0”. The test corpus used was taken from the corpus used to build the

dictionary [2]. The NER program, for the test corpus used, seems to have extracted all the

actors successfully – hence they all appear in the coding scheme, and are matched in the

test corpus when the Actor Coding scheme is run on it. For a input from a different corpus,

this number is predicted to be substantially higher

“Non-actors coded: 21”. These errors are due to the NER, non-actors that the NER

program has tagged as actors, which have then been included in the Actor Coding scheme.

Clustering seems to give a low number of actors not coded by the scheme, but a substantial

47

Gerard Howard

number of non-actors that have been wrongly coded.

The overall accuracy of TABARI was 10% higher. However, this was expected as TABARI

has a more detailed coding algorithm, which sidesteps some of the pitfalls the system

undoubtedly hit whilst coding (e.g. misclustering, wrong tagging).

6.2 Dictionaries

The dictionaries created could be used in NLP-based systems for lexical analysis, as well as

in my system for actor code generation. They contain a diverse variety of elements, divided

by syntactic type into three files (organization, person, and date) , and are formatted in a way

that would be easily read into such a system.

The Actor Coding Scheme clustering algorithm for produced a very fine-grained coding

scheme when compared to an existing example, such as the person classification seen in

the TABARI Actor Coding Scheme. This type of scheme would be especially useful for

users working on a smaller corpus, since the reduction in volume of information could allow

for a more detailed classification of the actors in the corpus. This would suit a more specific

(and smaller) corpus.

6.3 Actor Coding Scheme

The Actor Coding Scheme generated was plausible for use in an Event Coding System, but

is not an optimal solution. The main detractors were variable-length codes and “First Come,

First Served” (FCFS) code assignment.

Variable-length codes are generated because of the way the algorithm moves sequentially

through the dictionary of actors. This was not something observed in the TABARI actor file,

but the discrepancies can be attributed to the fact that TABARI uses a hand-coded actor file,

whereas the system uses a machine-based, algorithmic approach.

FCFS code assignment comes again from the algorithm, which assigns a code to the first

element in the input file, before moving onto the next. In an alphabetically-ordered input file,

this produced an alphabetical assignment priority, where the smallest codes were assigned

to elements coming near the start of the input file. Longer codes were attributed to elements

with a common beginning (e.g. If John Paul = JOHN, John Prescott = JOHNP). By contrast,

manually-produced coding schemes have codes generated based on the importance of that

48

Gerard Howard

actor to the region the scheme is written for – the main actors and countries have codes that

are prioritized by relevance to read. The algorithmic nature of code assignment for the

system does not take such factors into consideration when assigning codes. Example actor

codings on a three-entity article follow.

INPUT

Today Albania declared war on FrancePope John Paul arrived in Yugoslavia preaching peaceHungary and Italy moved in on the WHO

OUTPUTToday ALB declared war on FRAPope JOHNP arrived in YUGOSLAVI preaching peaceHUNGARY and ITALY moved in on the WHO

6.4 Conclusions

In summary, the system fulfilled its minimum requirements, and three additions, two of which

were confirmed by the Mid-Project Report.

A number of conclusions can be drawn from the project. Firstly, that automated dictionary

extraction for event coding systems is feasible. The system produced shows that algorithmic

and NLP methods for dictionary creation can generate dictionaries with useful contents.

Secondly, that successful actor code generation is heavily dependent on the article

extraction parameters. In the examples used in the evaluation, many of the extracted

articles were unrelated to the Balkans conflict, so many codes that would be unused in a

real-world coding would be generated. This also affected the actor dictionaries, since some

of the entries bore no relevance to the subject of that dictionary.

Thirdly, that the actor coding scheme and dictionaries produced are especially suited to

small, focused corpii, or where extremely detailed classifications are required.

Finally, that implementation of a non-trivial Actor Coding Scheme will be a complex task.

This is why even automated systems such as [16] use manually-generated actor coding

schemes for their classification.

6.4.1 Improvements

A number of improvements could be made to the system. For example, a way to verify the

NER classification would enhance the correctness of the dictionaries. This could be

49

Gerard Howard

implemented simply by running the corpus through a number of NER systems, and taking

only the entities where they all agree on a syntactic type for that entity. Failing that, a run-

through of the generated corpus with a checking utility (such as a POS program) would allow

wrongly-tagged entities to be detected an deleted (by comparing the type of word to an

“exclusion list” of word types), reducing the number of total dictionary entries, but increasing

dictionary accuracy.

Enhancing the algorithm for clustering will reduce errors related to the misclustering of

actors – in turn preventing the incorrect assignment of actor codes. Improvements to the

clustering algorithm will allow distinctions to be made between such actors.

Further improvements to the accuracy of the system could also be made. This could be

implemented by altering the algorithms used to cluster. Also, the code assignment algorithm

could be altered to prioritize frequently-used codes, giving them shorter abbreviations, while

demoting less-used codes to longer lengths. A human-readable, standard-length coding

algorithm for actors would be a major advancement in the project.

Improvements could also be made to the aesthetic and interactive qualities the current

implementation of the system is lacking. Although not a priority for this project, a GUI would

increase usability and reduce the time it takes to become comfortable with system use (since

there are so many small programs, novice users can become lost). Increasing the aesthetic

properties of the system could also entice more users into using it.

6.4.2 Furthering the Project

The most obvious way to further the project would be to turn the system into a fully-fledged

Event Coding System, using an existing Event Coding Scheme [21,22,23,24,25,26] along

with my Actor Coding Scheme generation to code an input corpus, and produce an output of

event coded headlines.

Finally, as commented on in the Mid-Project Report, a paper based on the system could be

submitted to a workshop, which could produce things such as requests for additions to the

program/bug fixes (improving the system), as well as the possibility of collaborative projects

in the field of Event Coding

50

Gerard Howard

Bibliography

[1] School of Computing, University of Leeds, (published date unknown) Projects:MinimumRequirements Form, URL: http://www.comp.leeds.ac.uk/tsinfo/projects/minreq-form.html [21 st April 2005]

[2] Tony Rose, Mark Stevenson, Miles Whitehead, (2004), The Reuters Corpus Volume 1,from Yesterday's News to Tomorrow's Language Resources, Technology InnovationGroup, Reuters Limited

[3] CNN News,(date unknown),CNN Balkan Conflict:History, URL:http://www.cnn.com/WORLD/Bosnia/history/ [21 st April 2005]

[4] Geert-Jan M.Kruijff, Oliver Plaehn, Holger Stenzhorn, Thorsten Brants, (2001), NEGRACorpus, URL: http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/negra-corpus.html [21 st April 2005]

[5] Goldstein, Joshua S, (1992), A Conflict-Cooperation Scale for WEIS Events Data,Journal of Conflict Resolution 36 pp.369-385.

[6] Burgess, P.M., & Lawton, R.W, (1972), Indicators of International Behavior: AnAssessment of Events Data Research, Beverly Hills: Sage Publications.

[7] Vincent, Jack E, (1983), WEIS vs. COPDAB: Correspondence Problems, InternationalStudies Quarterly 27 pp.160-169.

[8] R Grishman, (1995), The NYU System for MUC-6 or Where's the Syntax, Proceedingsof the Sixth Message Understanding Conference.

[9] DJ Gerner, PA Schrodt, RA Francisco, JL Weddle, (1994), The Analysis of PoliticalEvents Using Machine Coded Data. International Studies Quarterly 38 pp.91-119.

[10] KEDS Team, (2004), Kansas Events Data System(KEDS) Homepage, URL: http://raven.cc.ku.edu/~keds/ [11th April 2005]

[11] Philip A. Schrodt, Shannon G. Davis and Judith L. Weddle, (1994), Political Science:KEDS: A Program for the Machine Coding of Event Data. University of Kansas.

[12] Gary King and Will Lowe, (2002), An Automated Information Extraction Tool ForInternational Conflict Data with Performance as Good as Human Coders: A Rare EventsEvaluation Design, Harvard University.

[13] Ellen Riloff, (1993), Automatically Constructing a Dictionary for Information ExtractionTasks, Proceedings of the Eleventh National Conference on Artificial Intelligence,AAAI Press / MIT Press, pp.811-816.

[14] Julie Weeds, Bill Keller, David Weir, Ian Wakeman, Jon Rimmer, and Tim Owen,(2004),Natural Language Expression of User Policies in Pervasive Computing Environments.Proceedings of the OntoLex 2004 (LREC Workshop).

[15] G Sampson, (1991), Probabilistic parsing, in Svartvik, J (ed), Directions in Corpus

1

Gerard Howard

Linguistics.

[16] KEDS Group, (1999), TABARI Readme File URL:http://www.ku.edu/~keds/software.dir/tabari.info.html [14th April 2005]

[17] McClelland, C.A, (1976), World Event/Interaction Survey Codebook, ICPSR 5211.

[18] G. DALE THOMAS, (2000), The Machine-Assisted Creation of Historical Event DataSets:A Practical Guide, Paper presented at the 2000 annual meeting of the InternationalStudies Association

[19] Philip A. Schrodt, (2001), Automated Coding of International Event Data Using SparseParsing Techniques, Paper presented at the annual meeting of the International Studies Association, Chicago, February 2001

[20] Virtual Research Associates, (date unknown), Virtual Research Associates Homepage,URL: http://www.vranet.com/main.html [16 th April 2005]

[21] Bond et al, (2003), Integrated Data for Events Analysis (IDEA): An Event Typology forAutomated Event Coding. Journal of Peace Research 40: pp733-745.

[22] Bond, Bennett, and Vogele, (1994), The Protocol for the Assessment of NonviolentDirect Action (PANDA) , Paper presented at the Program on Nonviolent Sanctions inConflict and Defense, Center for International Affairs at Harvard [23]Hermann, East, Hermann, Salmore, and Salmore, (1973), Comparative Research on theEvents of Nations (CREON), 2nd ICPSR., 1977

[24]Leng, Russell J, (1987), Behavioral Correlates of War, 1816-1975 , ICPSR 8606.

[25] Azar, E. E, (1982), The Codebook of the Conflict and Peace Data Bank (COPDAB),College Park, MD: Center for International Development, University of Maryland.

[26] Deborah J. Gerner, Rajaa Abu-Jabr, Philip A. Schrodt, A Yilmaz, (2002), Conflict andMediation Event Observations (CAMEO): A New Event Data Framework for the Analysis ofForeign Policy Interactions. Center for International Political Analysis, Department ofPolitical Science, University of Kansas

[27] H Poulton, (1993), The Balkans: minorities and states in conflict London: MinorityRights Group.

[28] V Roudometof, R Robertson, (2001), Nationalism, Globalization, and Orthodoxy: TheSocial Origins of Ethnic Conflict in the Balkans, Westport, CT: Greenwood Press.

[29] PA Schrodt, B Hall, L Neack, PJ Haney, JAK Hey, (1993), Event Data in Foreign PolicyAnalysis. Foreign Policy Analysis: Continuity and Change, Prentice Hall, 1995

[30] AM Davis, EH Bersoff, ER Comer, BTG Inc, VA Vienna, (1998), A Strategy forComparing Alternative Software Development life Cycle Models, IEEE Transactions onSoftware Engineering

[31] Stefan Junginger, Harald Kahn, Mark Heidenfeld, Dimitris Karagiannis, (2001), BuildingComplex Workflow Applications:How to Overcome the Limitations of the Waterfall Model.,Fischer, L. (Ed.): Workflow Management Coalition: Workflow Handbook 2001. Future

2

Gerard Howard

Strategies Inc., October 2000, pp. 191-206.

[32] T Glib, (1985), Evolutionary Delivery versus the 'Waterfall Model', ACM SIGSOFTSoftware Engineering Notes, ACM Press New York, NY, USA.

[33] Jerry R.Hobbs and David Isreal,(1996), FASTUS: An Information Extraction System,published in Finite State Devices for Natural Language Processing, MIT Press.

[34] Curran and Clark, (2003), Language Independent NER using a Maximum EntropyTagger, Proceedings of the Seventh Conference on Natural Language Learning(CoNLL-03), pp.164-167

3

Gerard Howard

Appendix A

With respect to time management, the project process can be considered a success, with

other successes including the project writeup, and research. Although fifty pages appears to

be a weighty sum at first, by generating a rough plan of the structure of my report at the

start. By writing about something that I had just done, the details were fresh in my mind,

meaning that the workload of writing the report was effectively split over the entire duration

of the project. This ties into scheduling – another area I personally felt was a success. By

deciding on and sticking to a design methodology, and creating a provisional schedule of

tasks early on, I was able to stay on task and complete the project on time.

Research aided the development of my solution greatly, with Dr. Markert providing a plethora

of white papers to read, and plenty of ideas to think about. The large amount of research

that I did before even beginning the design stage allowed me to sidestep pitfalls discussed in

some of the papers, steering the direction of the project towards ideas that have been

proved to work, using the best implementation methods available.

However, not all went according to plan. The main problem with the whole project was

actually getting such a large corpus of data in one place, and running programs on it. For

example, I had to ask the network administrators for more space in a folder I could access,

which took some time to set up. After that, it became necessary to leave a computer logged

in for a long time, as I had to unzip and extract from a corpus containing over 800,000

separate files. If I could change the project, I would either operate on a smaller corpus, get

my own PC (so it could be left on without interruption), or arrange any additional things

(extra storage) that I needed beforehand.

Advice to students attempting a similar style of project would be to do a lot research into

systems similar to the one they want to implement, taking ideas and incorporating them into

their system, and also to create a project schedule early on, create realistic deadlines, and

stick to the plan. This will ensure that the project will be completed on time. Also, not to

underestimate the time learning a new language, or anything else they are not familiar with

will take.

1

Gerard Howard

Appendix B

Misc. Charts & Examples

Figure 1: Gannt Chart Showing Project Schedule for Semester 1

Week1 2 3 4 5 6 7 8 9 10 11 CHRISTMAS

Phase ResearchDesignImplementTestEvaluateCompile

Figure 2: Gannt Chart Showing Project Schedule for Semester 2

Week1 2 3 4 5 6 7 8 EASTER 10 11

Phase ResearchDesignImplementTestEvaluateCompile

Figure 12: Example Reuters XML-formatted article.

2

Gerard Howard

<xml version="1.0" encoding="iso-8859-1" ?>

<newsitem itemid="804285" id="root" date="1997-08-16" xml:lang="en">

<title>PAKISTAN: Pakistan auctions 6-mth bonds for 19.91 bln

rupees.</title>

<headline>Pakistan auctions 6-mth bonds for 19.91 bln

rupees.</headline>

<dateline>KARACHI, Pakistan 1997-08-16</dateline>

<text>

<p>The State (central) Bank of Pakistan said it had auctioned six-month short-term

federal bonds worth 19.91 billion rupees on Saturday at a weighted average yield of

15.42049 percent per annum.</p>

<p>Bids worth a total of 26.485 billion rupees were received out of which bids for 1

9.91 billion rupees were accepted, a bank statement said.</p>

<p>In the previous auction held on August 4, bids worth 5.102 billion rupees were

accepted at a weighted average yield of 15.38533 percent per annum.</p>

<p>-- Karachi newsroom (9221) 5685192; Fax 5673428</p>

</text>

<copyright>(c) Reuters Limited 1997</copyright>

<metadata>

<codes class="bip:countries:1.0">

<code code="PAKIS">

<editdetail attribution="Reuters BIP Coding Group"

action="confirmed" date="1997-08-16"/>

</code>

</codes>

<codes class="bip:topics:1.0">

<code code="M12">

<editdetail attribution="Reuters BIP Coding Group"

action="confirmed"date="1997-08-16"/>

</code>

<code code="MCAT">

<editdetail attribution="Reuters BIP Coding Group"

action="confirmed"date="1997-08-16"/>

</code>

</codes>

<dc element="dc.date.created" value="1997-08-16"/>

<dc element="dc.publisher" value="Reuters Holdings Plc"/>

<dc element="dc.date.published" value="1997-08-16"/>

3

Gerard Howard

<dc element="dc.source" value="Reuters"/>

<dc element="dc.creator.location" value="KARACHI, Pakistan"/>

<dc element="dc.creator.location.country.name" value="PAKISTAN"/>

<dc element="dc.source" value="Reuters"/>

</metadata>

</newsitem>

Figure 14: A RASP-parsed and POS tagged Reuters news article.

("476185newsML.xml" "The" "Chicago" "Board" "Options" "Exchange" "said" "Monday" "an"

"exchange" "membership" "sold" "for" "a" "record" "price" "for" "the" "second" "time" "this"

"month") 1 ; (-29.820)

4

Gerard Howard

(|T/txt-sc1/---|

(|S/np_vp|

(|NP/n2_name|

(|NP/name_n2| (|NP/n1_name/-| (|N1/n| |476185newsML.xml:1_NP1|))

(|NP/name_n1| (|NP/det_n| |The:2_AT| (|N1/n| |Chicago:3_NP1|))

(|N1/n| |Board:4_NNJ1|)))

(|NP/n1_name/-|

(|N1/name+| |Options:5_NP1| (|N1/n| |Exchange:6_NP1|))))

(|VP/vp_npadv1|

(|V/np_np| |say+ed:7_VVD| |Monday:8_NPD1|

(|NP/det_n| |an:9_AT1|

(|N1/n1_nm| |exchange:10_NN1|

(|N1/n_pprt| |membership:11_NN1|

(|V/pp_pp| |sell+ed:12_VVN|

(|PP/p1|

(|P1/p_np| |for:13_IF|

(|NP/det_n| |a:14_AT1|

(|N1/n1_nm| |record:15_NN1| (|N1/n| |price:16_NN1|)))))

(|PP/p1|

(|P1/p_np| |for:17_IF|

(|NP/det_n| |the:18_AT|

(|N1/ap_n1/-| (|AP/a1| (|A1/a| |second:19_MD|))

(|N1/n| |time:20_NNT1|))))))))))

(|NP/det_n| |this:21_DD1| (|N1/n| |month:22_NNT1|)))))

Figure 16: Perl code for dynamic regular expression generation

$textstring = "code.code=(";

$i = 0;

for($i = 0; $i <=$#ARGV; $i++)

{

if ($i == $#ARGV)

{

$textstring = $textstring . $ARGV[$i] . ")";

5

Gerard Howard

}

else

{

$textstring = $textstring . $ARGV[$i] . "|";

}

}

print "$textstring\n\n";#delete

6