Summary - University of Leeds · NLP-based approach to dictionary generation, ... well as support...
Transcript of Summary - University of Leeds · NLP-based approach to dictionary generation, ... well as support...
Gerard Howard
Summary
This project attempts to solve the problems caused by manual dictionary creation for the use
of event coding systems, by providing an automatic dictionary generation system. These
problems are namely a)The number of man-hours required to manually generate such
dictionaries from a large corpus b)The possibilities for inconsistencies of actor assignment
by a human dictionary compiler and c)The possibility that a number of actors contained in
the corpus are overlooked, and not included in the dictionaries. By taking an algorithmic,
NLP-based approach to dictionary generation, it is predicted that the time taken to produce
such dictionaries will dramatically decrease, and the assignment of actors to dictionaries will
improve in consistency.
The second part of the system will produce an actor coding scheme from the dictionary files
generated. These actor coding schemes are used in both manual and automatic event
coding systems, but are often hand-created. As with dictionary generation, the prime
concerns with this approach are the time-consuming nature of scheme generation, and also
application of the scheme to the corpus to be coded. It has been noted in previous event
coding projects that human coders (who must work in teams due to the number of articles to
be coded) can disagree on the code assigned to a particular actor, producing an inconsistent
coding. An automatic, algorithmic system will remove the ambiguity in code assignment, as
well as reduce the time taken to a)Generate the actor scheme and b)Apply the scheme to an
input corpus.
Creating a system based on context-free grammar analysis will allow the system to generate
a dictionary (and hence actor coding scheme) for any region of interest. This will solve the
problem of a lack of reusability seen when human coders create such actor schemes – there
is no quick way to apply the same process on another corpus to generate an actor scheme
with a different focus in mind manually.
This system will take a corpus of Reuters news articles as input, and produce dictionaries of
people and organizations and an actor coding scheme as output.
Gerard Howard
Acknowledgements
I'd like to thank Dr. Katja Markert for her aid with project direction and scheduling, as well as
code consultation and finding a forest of relevant papers on event coding and generally
putting up with me.
Also, Eric Atwell for his comments on the report and ideas for testing the system, as well as
going through the system demonstration session with me half asleep (No more late nights!).
Acknowledgments must also be made to Sorin for his help with the Java version of the
NP/VP extraction program.
Robert Yaw, Dave Goodwin, and Andy Walton deserve a mention for frequently responding
to my answer to their questions of “So what does your system do again?” with “Huh?”, as
well as support during the late-night hacking sessions and beerathons.
I would also like to thank my parents for washing my clothes and sorting me out with a new
PC when my old one went “pop”. Oh, and for the manual event coding.
And finally to the University of Leeds staff, for providing such a comfortable working
environment.
Gerard Howard
Table of Contents
1. Requirements and Scheduling1.1 Requirements
1.2 Scheduling
1.2.1 Preliminary Schedule
1.2.2 Revised Schedule
1.2.3 Explanation
2. Research and Background Reading
2.1 The Purpose of Event Coding
2.1.1 Problem Statement
2.2 Information Extraction and the English Language
2.3 Coding
2.3.1 Coding Systems
2.3.2 Event Coding Schemes
2.3.3 Actor Coding Schemes
3. Design and Methodology
3.1 Methodology
3.2 Scope
3.3 Programming Languages and Tools
3.4 Identifying Useful Information
3.4.1 Information Extraction
3.4.2 Exploiting POS and NER
3.5 System Design
4. Implementation
4.1 Programs, Inputs and Outputs
4.2 Information Extraction
4.3 Named Entity Recognition
Gerard Howard
4.4 Clustering
4.5 Dictionary Creation
4.6 Actor Coding
5. Testing
5.1 Test Plan
5.2 Dictionary Creation
5.3 Actor Coding Scheme
6. Evaluation
6.1 Results and Discussion
6.2 Dictionaries
6.3 Actor Coding Scheme
6.4 Conclusions
6.4.1 Improvements
6.4.2 Furthering the Project
Gerard Howard
Chapter 1
Requirements and Scheduling
1.1 Requirements
The requirements for this project can be split into two categories: needed and desired.
Those that are needed are specified in [1], defined early on in the project after consultation
with the project supervisor, and are as follows:
· An overview of current Event Coding Schemes and Event Coding Systems as well as
Natural Language Processing (NLP) methods for building Event Coding Systems.
· Implementation of an automatic extraction method of relevant Reuters leadlines from [2].
· Automatic dictionary extraction for proper name actors and mapping to actor codes
focusing on the Balkans conflict [3].
· Mapping proper names in Reuters leadlines automatically to actor codes.
· Evaluation of proper name Actor Coding accuracy by comparison to two human coders.
Possible extensions to this project include:
· Allowing users to extract articles from a corpus based on multiple criteria.
· Increasing the accuracy of the assignment of actor codes by improving the code
assignment algorithm.
· Expanding the project by allowing for Event Coding.
The full amount of allotted time is 21 weeks of term time, plus 8 weeks worth of holidays.
Figures 1 and 2 (in Appendix B) show Gantt charts of the preliminary schedule for my
project.
1
Gerard Howard
1.2 Scheduling
1.2.1 Preliminary Schedule
Milestones
· Return completed preference forms: 30.09.2004
· Complete aim and minimum requirements: 22.10.2004
· Mid project report: 10.12.2004
· Complete background reading: 01.12.2004
· Complete implementation of system: 01.03.2005
· Complete project report: 10.04.2005
· Deadline for report submission: 27.04.2005
Dates Task Stage of Waterfall model Details
01.11.2004 - 01.12.2004 Preliminary reading Research Topics: basic NLP, automaticdictionary construction,event/actor codingschemes/systems
01.12.2004 - 10.12.2004 Research and decide onmethodology
Research Comparison ofmethodologies, decision ofbest suited to project, writeup
12.10.2004 - 22.10.2004 Requirements capture Research Analyze needs of the users,capabilities of similarsystems and minimumrequirements
01.12.2004 - 23.12.2004 Initial designs and testing Design Design initial ideas,implement in code usingthrowaway prototyping, writeup
23.12.2004 - 01.01.2005 Final design Design Include specifics (e.g.definite uses for eachprogram, number ofprograms required). Relatedevelopment to research,write up
01.01.2005 - 01.03.2005 Implementation andTesting
Implementation Implement using waterfallmethodology, test on samplecorpora, write up
01.12.2004 - 01.04.2005 Evaluation Evaluation Define suitable evaluationcriteria (test plan), evaluatequalitative and quantitativeaspects of the system, drawconclusions, write up
Figure 3: Project schedule
2
Gerard Howard
1.2.3 Revised Schedule
Revised Milestones
· Return completed preference forms: 30.09.2004
· Complete aim and minimum requirements: 22.10.2004
· Mid project report: 10.12.2004
· Complete background reading: 10.12.2004
· Complete implementation of system: 25.03.2005
· Complete project report: 17.04.2005
· Deadline for report submission: 27.04.2005
Dates Task Stage of Waterfall model Details
01.11.2004 - 05.12.2004 Preliminary reading Research Topics: basic NLP, automaticdictionary construction,event/actor codingschemes/systems. Addedreading on Perl
05.12.2004 - 20.12.2004 Research and decide onmethodology
Research Comparison ofmethodologies, decision ofhow to combinemethodologies mosteffectively, write up
12.10.2004 - 22.10.2004 Requirements capture Research Analyze needs of the users,capabilities of similarsystems and minimumrequirements and extensions.Added practice with Perl
01.12.2004 - 10.01.2005 Initial designs andtesting
Design Design initial ideas,implement in code usingthrowaway prototyping,choose one to develop, writeup
10.12.2004 - 20.01.2005 Final design Design Include specifics . Relatedevelopment to research andprevious design choices,write up
20.01.2005 - 20.04.2005 Implementation andTesting
Implementation Implement using waterfallmethodology, test on samplecorpora, improve clusteringalgorithm, write up
20.01.2005 - 25.04.2005 Evaluation Evaluation Define suitable evaluationcriteria (test plan), evaluatequalitative and quantitativeaspects of the system, drawconclusions, write up
Figure 4: Revised project schedule
1.2.3 Explanation
3
Gerard Howard
The schedule had to be revised due to the following main factors
· Unfamiliarity with Perl, underestimating the time Perl took to learn. This increased both
the design and implementation stages, pushing back the evaluation.
· The first implementation of the clustering algorithm did not do well in testing. Altering it
took a few days which should have been spent on other parts of the implementation.
However, the revised schedule was adhered to, ensuring that the project was completed
on time.
· The amount of background reading meant that the deadline was overshot by a number of
days. This created something of a cascade effect with the other deadline dates, shifting
them all back slightly.
4
Gerard Howard
Chapter 2
Research and Background Reading
2.1 The Purpose of Event Coding
Event Coding is a relatively recent social science-based discipline, utilizing generalized
codes to represent countries, political actors, and government branches (Actors), and the
actions those myriad actors take against/with each other (Events). In this way, diverse
inputs containing many events and actors [2,4] can receive a generalized classification, in
turn allowing a quick overview of a particular series of events. Clustering is traditionally used
to cover synonyms of an actor or event, allowing them to be correctly classified (i.e. U.S.A,
America, United States could all be clustered into the “USA” cluster for simplified coding).
Using a textual input such as a selection of newspaper headlines for a region, or a series of
stock market reports over a given time period, the events reported in the headlines or
reports can be analyzed quickly.
The primary use of Event Coding is related to early warning. By assigning each event to a
number that can be represented on a scale such as [5] (where the most severe negative
actions receive the lowest negative mark, the most constructive actions receive the highest
positive mark), and analyzing the products of these marks between two countries over a
given period of time (e.g. every week for a number of months), a time-series analysis for
aggregate marks between two counties can be obtained. These can be compared to
“template” time series analysis showing a number of likely actions and studied for
correlation. Put simply, if a stock market report for a company correlates roughly to that of a
template showing market growth, it can be predicted that that company will experience a
growth in its market. Similarly, if the time-series analysis graph between two countries
correlates to a template showing unrest, possibly descending into armed conflict, the two
countries in question could be reinforced with additional UN troops in an attempt to quell any
danger.
5
Gerard Howard
Event Coding Systems are created to study two main areas: international events (both
military and political) and economics.
Until recently, Event Coding was conducted manually, by consulting a coding scheme (such
as [6]) and assigning the event to a code in the scheme that most closely correlated to it.
Papers concerning the process of manual coding in more detail include [7] and [8].
Recent advancements have allowed these codes to be automatically assigned by algorithm
or pattern matching; saving many man-hours spent not only assigning codes, but also
assuring that the human coders would assign the same code to any given event. For
example, in [9], the team automatically code both regional and international data using
machine-based automatic Event Coding.
One of the first automatic Event Coding Systems was [10], a coding system related to
actions between two countries, and centered mainly around warfare. It was first used in [11]
to automatically event code a dataset related the Palestinian Intifada. Recently, a company
called Virtual Research Associates has released a system, [12], which is designed to predict
the fluctuations of various stock markets.
The rise of these new computerized Event Coding Systems can be attributed to their
advantages (speed, unsupervised automatic assignment, the ability to “plug in” and use one
of a number of Coding Schemes). However, these systems have many requirements that
must also be fulfilled in order to use them. Firstly, a mapping must be provided from the
myriad actors and events that would be encountered in a corpus, to a smaller set of
generalized actors and events. Secondly, the system must be able to identify the relevant
actors and events in the corpus of articles. Thirdly, there must be a program that takes the
extracted actors and events, and applies the mapping scheme to them, returning a set of
possible events and actors for a given area of conflict.
The purpose of my Event Coding System is firstly to automatically generate a dictionary of
actors for the Balkans area of conflict, using a one-year corpus of Reuters Newswire texts
([2]). I will extract from this corpus the relevant articles, then extract from those articles a
corpus of headlines and meta data such as the date the article was published. A list of
countries, persons and organizations involved in the conflict will then be created from this
corpus. I will then generate a dictionary of these actors, using clustering and mapping to
refine this process.
6
Gerard Howard
Secondly, an Event Coding scheme of the actors in the extracted articles will be generated,
using the actors extracted by the dictionary creation system.
My system will also be reusable (for different areas, by using a different input corpus and
different extraction parameters), compact (small, simple programs which can be used by
anyone), reasonably accurate (obviously hoping for the same accuracy as manual
assignment is unrealistic), and most of all, quick (this will be my systems main strength –
human-coded actor dictionaries can “take upwards of 150 man-hours to generate” [13]).
2.1.1 Problem Statement
The system will solve the problem of manually creating dictionaries for use in Event Coding
Systems, by providing an alternative whereby dictionaries can be generated automatically,
saving the user many hours of preparation time. To maximize potential users, these
dictionaries should be reasonably accurate, and dependent only on the input corpus used, to
allow any kind of Event Coding dictionaries to be generated.
The system will also attempt to automatically generate an Actor Coding Scheme, for use in
Event Coding Systems. This should provide a reasonable number of actors, and utilize
clustering and grouping to provide a useful coding scheme. This will solve the problem of
the traditional approach of manually assigning an actor code to a dictionary entry, by
providing a reasonably accurate, quick and comprehensive alternative.
2.2 Information Extraction and the English Language
The input I will use will be from Reuters [2]. Although a global news service, they are an
American company, and as such their articles use English as a primary language. The
corpus I will be using for input to my system will therefore be entirely in English.
Like all languages, English predominantly follows a structure for composing phrases and
sentences. Programs have been written to exploit this structure (taggers, tokenizers, and
parsers), providing facilities such as Part Of Speech (POS) tagging, and Named Entity
Recognition (NER).
Parsers are responsible for splitting an input into words and sentences. RASP, used in
projects such as [14], is installed on University of Leeds School of Computing Linux systems,
provides advanced structures such as phrasal and sub-phrasal identification.
7
Gerard Howard
Part of Speech (POS) taggers provide further structural analysis of a text, identifying the
types of words in a sentence. RASP also contains this functionality, which it uses for its
“Probabilistic Parsing” [15] running mode.
Named Entity Recognition (NER) software provides a tagging of words to types. Unlike POS
tagging, these word categories are not syntactic types, such as nouns and verbs, but instead
to types such as people and dates. Aside from individual words, the context of a phrase
within a sentence is also important, and has been used in systems such as [13] to
automatically generate dictionaries for Event Coding Systems.
In English, each declarative sentence is split into a Noun Phrase (NP), and a Verb Phrase
(VP). Each of these NPs and VPs can be split recursively into further pairs of NPs and VPs,
Prepositional Phrases (PP), determiners (e.g. The), and nouns (N) and verbs (V). This type
of parsing, combined with and POS tagging allows the structure of a sentence to be
dissected and analyzed. Figure 5 shows such actions on the phrase “The cat sat”.
Figure 5: Example parsing and POS tagging.
RASP can generate these parse trees, which describe not only the types of words and
phrases in a sentence, but also tag words with their syntactic classification.
By extracting the NPs and VPs that contain the main actors (identified by extracting words,
explained above), I would be able to identify not only the important NPs or VPs in the
sentence, but also extract the actors and events in those NPs or VPs with a surrounding
context, which could be used to aid in the classification of those actors and events.
8
Gerard Howard
A combination of NER and parsing can identify the parts of a sentence that would be useful
to my system (such as the names of countries, people, companies, agencies, political parties
and so on). This will allow me to extract certain types of words, which I would wish to
include in my dictionaries.
2.3 Coding
Coding is to the assignment of a code (provided by a coding scheme) to a piece of text (in
this case a Reuters Newswire headline) by a coding system (such as [16]). In this chapter,
background knowledge and some preliminary design choices relating to Coding Systems
(2.3.1), Event Coding Schemes (2.3.2) and Actor Coding Schemes (2.3.3) will be provided.
2.3.1 Coding Systems
A coding system is a piece of software that allows, by way of “plug-in” script files, the
assignment of event codes to input of text. Research into this area is necessary as it gives
me a background of knowledge from which to work, allowing me to gear my system towards
compatibility with them, as well as giving me an idea of how Event Coding Systems work.
The three major coding systems of recent years are KEDS[10](Kansas Event Data System),
TABARI[16](Textual Analysis By Augmented Replacement Instructions), and the VRA
(Virtual Research Associates) project[20].
Developed at Kansas University, KEDS is a data coding system that uses a modification of
[17] to assign event codes to political event data focusing on the Middle East and the
Balkans. The predicted final use for this project is as a statistical early-warning model for
regions of conflict (see 2.1).
KEDS takes preformatted data as input, and uses a pattern-matching algorithm to assign a
code to that data, processing Reuters data headlines. KEDS seems to be a logical choice
for analysis, due to its similarities to this project. However, it has recently been superseded
by TABARI, also produced by the KEDS team. It is therefore TABARI that will be the main
focus of my analysis.
TABARI is a coding system that uses pattern matching to assign codes to events in a given
text. It uses replaceable actor and Event Coding scripts (text files) to denote the assignment
of codes to actors/events
9
Gerard Howard
The basic pattern is subject – verb – object, where:
Subject = The source of the event (actor code assigned here)
Verb = The event itself (event code assigned here).
Object of verb = target of event (actor code assigned here).
For example, given the input of:
Today, Italy declared war on Scotland.
Italy would be the subject, assigned an actor code. Scotland would be the object, and also
assigned an actor code. The bigram “declared war” is the event, and as such would be
assigned an event code.
Due to the simplistic way it tackles the complex issue of syntax, It is prone to generating
errors with complex sentences/events. However, results from use of TABARI suggest that it
can handle the vast majority of Reuters headline syntax forms [18].
Like many Systems of its ilk (e.g. KEDS), TABARI requires a certain formatting of the input
file in order to work. An example of TABARI formatting can be seen below.
980216 REUT-0001-01
BALKANS. <<Sentence containing the event>>.
Figure 6: Example format of a TABARI leadline.
Requiring a strict format allows the input file to be read more easily, as well as providing a
common format for human coders be able to understand. Some other rules include:
· All letters must be upper case, words beginning with an upper case letter in the middle of
a sentence are tagged as nouns.
· All punctuation except commas are eliminated, then TABARI checks each individual word
against the entries in the actor/verb dictionaries - words found in dictionary are given an
integer to uniquely identify them. TABARI then represents the words with an array of
integers and assigns a word type to each literal (noun, verb, actor, comma, number,
conjunction). It also locates noun and verb phrases in the sentence using a sparse
parsing technique.
10
Gerard Howard
To code an event, the program finds each verb and attempts to match the words
surrounding that verb with the phrases associated with that verb in the dictionary. A
successful match results in the assignment of an Event Code to that event.
Using the example of “Italy declared war on Scotland”, as above, the pattern matching
algorithm would first apply the Actor Coding Scheme, generating:
Today, ITA declared war on SCO.
Secondly, the Event Coding Scheme would be applied, matching “declared war” to a code
from an Event Coding Scheme (see 2.3.2) giving a final output of:
ITA <EVENT CODE COORESPONDING TO “DECLARED WAR”> SCO
Or, as an example (using DECW as the event code for “declared war”:
ITA DECW SCO
This would be done for each input into the system, generating a series of trigrams to
describe the events that took place in each articles headline.
TABARI utilizes a number of “dictionary rules” to its actor and verb dictionaries to increase
their flexibility with respect to the way they code events. Some useful dictionary rules from
TABARI include:
· A space at the end of a word making a stem.
· Underscores connecting words denote words that will match only if found consecutively in
the text
· An underscore at the end of a word, means that the word must be followed by a space to
match.
Format of verbs file:
WON [---]
- + * PLEDGE FROM $ [054]
– * MORE {GROUND | LAND} [211] ; means ground OR land can follow.
Figure 7: Example format of a TABARI verb.
11
Gerard Howard
"+" and "$" are variables for actors. In TABARI, "+" denotes the target and "$"denotes the
source of the action. "%" is also sometimes used, and denotes a compound actor should be
assigned to source and target. A semicolon allows comments to be made after the code is
supplied.
Since "won" cannot be assigned a code in itself, it is assigned a null code - [---]. This null
code is used for "Eliminating words that have the same stem as common verbs, and
eliminating verb phrases that do not correspond to political activity". [19]
The other code to be mentioned is the discard code, [###] - used when stories need to be
discarded (e.g. Confusing Jordan the place with Michael Jordan the basketball player). If
found anywhere in a news report, none of the text will be coded.
"BASKETBALL[###]" is an example of its usage.
This detailed overview of how a coding scheme works suggests that a similar technique
could be employed in the system's Actor Coding Scheme, with specialist codes assigned to
do specific things within the file. This would give an amount of flexibility that would otherwise
be unattainable.
The VRA logger is a recent development by [20]. It is designed to work with the IDEA
coding scheme [21], and its application is more in the area of economics (predicting market
crashes, tax increases etc.).
Because the VRA logger is a commercial initiative, the VRA group would not make their
code available for analysis, so little is known about the algorithms that are of most interest
from a researchers point of view. However, the in-depth analysis of TABARI that was
conducted gave enough information on how to proceed with my system that this was not a
setback.
2.3.2 Event Coding Schemes
Before cementing my ideas for system design, a choice had to be made with regards to the
coding scheme my system would be aimed toward. Research into this area would allow me
to ensure compatibility between the output of my system and these schemes, as well as
research into ideas that may be applicable to my Actor Coding Scheme.
12
Gerard Howard
The schemes to be discussed here include WEIS (World Events Interaction Survey), IDEA
(Integrated Data for Events Analysis), PANDA (Protocol for the Analysis of Non-violent
Direct Action) [22], CREON (Comparative Research on the Events Of Nations) [23], BCOW
(Behavioral Correlates Of War) [24], COPDAB (Conflict and Peace Data Bank) [25], and
CAMEO (Conflict and Mediation Event Operations) [26].
There are a number of different coding schemes available, ranging from syntax-based
schemes such as COPDAB (see example later) to word-based (CAMEO, WEIS, IDEA).
Regardless of personal choice, the scheme would have to adhere to set of criteria that would
allow it to be used in the Balkans area of conflict, in the context of a machine-based Event
Coding System.
Research into the Balkans area of conflict (using [27,28] as references)revealed it to be a
conflict zone comprised mainly of armed conflict, political unrest, social conflict and riots, and
genocide, followed by international mediation, and later international military action. Hence
any scheme chosen would have to include detailed classification of these types of events.
By creating a table of event types, and evaluating the schemes on their coverage of specific
points, the intent was to narrow down by choice from seven potential coding schemes to a
final choice.
The scale chosen was as follows:
None – The type of event is not covered explicitly at all.
Basic – The type of event is covered once.
Average – The type of event is covered two-four times.
Detailed – The type of event is covered more than five times.
The marking of these events is as follows:
One count for an individual code relating to that event, with one count each for each
subsection of that event, if any are provided. If many codes are provided to cover a given
event, the overall score is the sum of the additions of all the codes that cover that event.
13
Gerard Howard
Figure 8: Table showing a breakdown of coding scheme coverage in a number of areas
related to the Balkans area of conflict
The above table gives some indicators as to which coding scheme to use, but should be
combined with personal research to make an informed choice. A personal comparison of the
different event coding schemes follows:
WEIS
WEIS is a very old coding scheme, and seems too generalized (It has only 216 separate
event codes, thanks in part to it using an outdated three-digit coding system, in contrast to
CAMEO or IDEA, both of which use four digits to classify events. This extra digit provides an
extra level of classification, see later).
This lack of detailed codes indicated that a detailed classification using the WEIS scheme
would not be possible. Besides this, both IDEA and CAMEO are second-generation
schemes (IDEA from PANDA, CAMEO from WEIS), and as such there is no reason to be
using a first-generation coding scheme, compared to either of the two mentioned above.
However, WEIS codes are available with an integrated Goldstein [5] scale rating for each
event. The Goldstein scale measures the projected affect of an event on a situation (such as
a mass-execution during a war). In the event that the system was used as part of an Event
Coding System, the inclusion of this feature could persuade the user to apply WEIS as the
Event Coding Scheme.
COPDAB
Unlike the WEIS or PANDA-based schemes, COPDAB seems to be centered around
Military Action Guerrillawarfare
Politicalinstability
Mediation Socialaction
Non-militarywarfare
WEIS Basic None Basic None Basic Basic
IDEA Average Basic Detailed Basic Average Average
CAMEO Average Basic Detailed Average Average Detailed
COPDAB Basic None Average None Average Average
CREON None None Detailed Average Average None
BCOW Detailed Average Basic None Basic Average
PANDA Basic None Detailed Average Detailed Basic
14
Gerard Howard
human-coding. It was decided to remove COPDAB from contention due to the format of its
scheme, which does not lend itself to a machine-coding environment, and hence would not
be applicable to the output of my system.
09
Nation A expressed mild disaffection toward B's policies, objectives, goals, behaviors
with A's government objection to these protestations; A's communique or note
dissatisfied with B's policies in third party
Figure 9: An example COPDAB event code.
As can be seen, COPDAB would be very hard to implement in a machine-coding context,
due to its wordiness and generality. Also, terms such as “mild disaffection” would be hard to
apply to a machine-coding program as the term is hard to pin to an analyzed sentence, i.e. it
is harder to classify actions in this way. Since it contains no direct links to verbs, it will be
harder to implement this scheme when compared to IDEA or CAMEO, for example.
CREON
“CREON is to study the foreign policy process, rather than foreign policy output. In practice
this means that CREON is better suited than WEIS or COPDAB to studying the linkages
between the foreign policy decision-making environment and foreign-policy outputs for
specific decisions, but it cannot be used to study policy outputs over a continuous period of
time or for countries not in the sample” [29]
CREON is therefore unsuitable; the focus of the Balkans conflict has little to do with the
processes behind foreign policy. CREON also places less of an emphasis on warfare and
violent conflict, and hence generalizes many key areas of the Balkans conflict where a more
detailed approach would be effective.
PANDA
“PANDA's data set uses a superset of the WEIS coding scheme that provides greater detail
in internal political events”[29]
“The other major development by Bond and his collaborators is the IDEA -- Integrated Data
for Events Analysis -- coding system. This will supersede the PANDA coding scheme, and
15
Gerard Howard
more is designed to provide a general framework for coding events. ” [29]
These two quotations show why PANDA was eliminated from contention. PANDA places
little emphasis on violent conflict, instead focusing on humanitarian aid and internal political
structure. Although internal politics were vital to the Balkans conflict, wars also made up a
large part of the activities in that region - PANDA would be too specialized to provide a good
overview of the conflict and, like CREON, generalize over the more violent actions of the
conflict.
IDEA
This is a very detailed coding scheme, and as shown by the table covers most areas in a
detailed manner. Utilizing a four-digit coding scheme, it would allow me to implement a
hierarchical algorithm for code assignment. Since both IDEA and CAMEO, below, encode
using four digits, both schemes have the capacity for four levels of abstraction of an event.
From a Computer Scientists point of view, this would let an algorithm to work through these
hierarchies, allowing the coding algorithm to pull back to a higher level of abstraction should
be coding algorithm be unable to assign an event to a specific subsection (more specifically,
to pull back to using a three-digit general code rather than a four-digit specific code). This
kind of hierarchy suits algorithm design, it is another plus point to using either CAMEO or
IDEA, especially when considering the addition of an actor coding scheme to the project
later on.
CAMEO
CAMEO, along with IDEA, utilizes a four-digit coding scheme, allowing it greater depth of
classification. The CAMEO coding scheme is also shorter than IDEA, meaning that it will be
easier to implement and evaluate.
The scheme itself goes into a sufficient amount of detail in the most important areas, and in
addition contains good support for mapping Mediation events (as were common toward the
latter stages of the conflict). It is also judged to be broad-ranging enough to code a large
percentage of the reported happenings during the conflict.
BCOW
BCOW is a scheme mainly used for analyzing heavily war-torn areas, but maintains a small
16
Gerard Howard
section for other types of action. Originally, it was hoped that this would be enough for a
detailed classification of events in the Balkans region, but it became apparent that this would
not be the case. The other events section would be too small to successfully capture the
depth of the actions in the region, and in addition to this the heavy reliance on war and
violent activities indicated that a bias might be included (e.g. there is more chance of a
“disputed” event being classified as violent, rather than political).
Overall, CAMEO appears to be the best all-round scheme for the analysis of the Balkans
conflict. It has a detailed all-round categorization of violent, non-violent, political, and social
actions, as well as the classification of mediation actions. Later extension to my project will
no doubt include an Event Coding Scheme – CAMEO would be the choice in this case.
2.3.3 Actor Coding Schemes
Since the VRA logger's code is not publicly available, and TABARI has superseded KEDS,
research into the Actor Coding Scheme will be based on the TABARI scheme.
The actor scheme is formatted to have one entry per line, to ease the process of reading the
actors into an array, and also for human-readability, to enhance clarity.
An example actor code:
EAST_AND_WEST_GERMANY [GME/GMW]
Since actors change over time, TABARI allows people and places to change code.
MOSCOW [USR (<901225) RUS (>901226)].
This is a very useful feature, especially considering the turbulent and dynamic events leading
up to and including the Balkans conflict. The actors are manually tagged, the name of the
person that attributed the tag to the actor appearing as a comment beside the tagged actor.
The scheme goes into great detail, making fine-grained distinctions between the main actors
in the conflict. For example, Yugoslavia has 42 subcategories, specifying presidents, armed
forces, rebel factions, political movements, states and other related actors. However,
countries that could be considered peripheral to the conflict receive a much more general
classification (countries such as Zaire and Scotland are tagged as single entities), to reduce
complexity as well as file sizes.
17
Gerard Howard
An advantage to using a dictionary of actors from the conflict to construct the Actor Coding
Scheme is that these varying levels of depth complexity should be seen automatically.
For example, if an article makes a distinction between Yugoslavia, Yugoslavia's military, and
a Yugoslav President, they will appear as separate entities in the coding scheme. Also,
since the articles will be extracted from the corpus depending on whether or not they relate
to the Balkans conflict, more fine-grained distinctions will be found for the main actors in the
conflict, compared to actors peripheral to the conflict, meaning more distinctions (and
therefore a more fine-grained definitions) will be made for those main actors in the scheme.
18
Gerard Howard
Chapter 3
Design and Methodology
3.1 Methodology
Adhering to a design methodology will enable me to complete my project in an effective,
systematic manner. [30] is a simple methodology, and often referred to as the “classic
methodology”. In it, the distinct stages (Analysis, Design, Implementation, Testing,
Evaluation, or similar) are completed sequentially, and the previous stage must be
completed before the next is started (hence “waterfall”).
The basic flow of this diagram (figure 10) can be seen mirrored in the Gantt charts (appendix
B, figures 1 and 2), as well as in the predicted “pipeline” flow of data through the system –
the output from one program is used as the input to the next program on the pipeline, with
data flowing from start to end of the system.
Figure 10: Basic Waterfall methodology
Figure 10 maps onto the report as follows: Section 2 can be thought of as Analysis, section 3
Design, section 4 Coding, and section 5 Testing. Maintenance is out of the scope of the
19
Gerard Howard
project, but nevertheless would be a continuing part of the life cycle of the system.
Waterfall methodologies have disadvantages, primarily that problems with the system are
not discovered until the “Testing” stage, and also that product requirements must be fixed
before the system is designed.
To this end, a “modified waterfall” methodology [31] may be used, removing the problems
associated with the use of its “classic” counterpart. In this case, modifications include
testing at every stage of the development process, and the inclusion of the two additional
methodologies mentioned below. [32] describes “throwaway prototyping”, which would be
ideal for the “Design” stage – having a throwaway model allows experimentation with
different ideas without setting anything in stone. I chose this as the final methodology for the
“Design” stage, as it removes the possibility that the project may be hampered by a poor
design decision early on due to unfamiliarity with Perl, as well as being more flexible than a
waterfall methodology.
The implementation stage could be thought of as an iterative cycle of smaller waterfalls –
analysis, design, coding and testing. Because the system will run as a pipeline of small,
single-function programs, one program cannot be fully tested until the previous one is
complete. These smaller waterfalls allow this pipeline development to take place, since each
program can be tackled sequentially.
Overall then, a modified Waterfall methodology best reflected the intended direction and
development of the system, and throwaway prototyping provided flexibility at the design
stage, as well as removing the need for product requirements to be fixed before the system
is designed. A series of smaller waterfalls allow efficient software implementation of a
pipeline system.[31]. Deliverables for each stage are as follows:
Stage DeliverablesAnalysis Evaluation of actor/event schemes and coding
systems. Minimum requirements. “Research”writeup. Draft chapter
Design Final design and “Design and Methodology”writeup. Mid-project report.
Coding Implementation plan, Final solution,Implementation writeup
Testing Evaluation, Testing results, included with“Testing” and “Evaluation” writeups
20
Gerard Howard
Figure 11: Deliverables for each stage of the report
3.2 Scope
Scope refers to the scale of the project, its intended users, and other such information.
The target users for this system are primarily event coders, who can use the system to
generate an actor dictionary for use in the coding system of their choice, and/or generate a
precoded set of actors. Since both of these operations are independent of the input corpus
used, an event coder working on any region could feasibly use the system.
Examples of scale include Reuters one-day archives being between 284997 (19970809.zip)
and 4725275 (19970515.zip) bytes compressed, 668906 and 11600256 uncompressed, and
containing between 214 and 3928 entries respectively. The entire Reuters one-year corpus
used “...is an archive of 806,791 English language news stories...”. [2]
An example TABARI actor scheme (the TABARI Balkans dataset, BALK.ACTOR.030502)
was 60369 bytes, with 1631 entries.
The system must therefore be able to cope with very large file sizes for input, and be
expected to generate actor files of between approximately 2500 and 60000 bytes, containing
between 100 and 1500 estimated discreet actors.
3.3 Programming Languages and Tools
The system is designed to be compact, simple, and run as a pipeline. This allows certain
functions to be performed on an input (tagging, parsing, generating a dictionary, generating
a list of organizations etc.), without the entire dictionary generation system being performed
at once.
Keeping the programs to providing a simple function rather than grouping all functions
required into one program also allows inexperienced users to have a better grasp of how the
system works (each programs code becomes easier to read since it is not surrounded by
other methods, each program also has its own readme file). [33] is a program implemented
in Lisp, a language with roots in NLP, its implementation similar to that of my intended
system.
21
Gerard Howard
This seems to point away from needing a more complex language such as Java or C++/C#
since no complex programming structure is necessary. Object-oriented functionality is not
required for a pipeline system such as mine, and would overcomplicate the implementation .
Perl or Python could therefore also be considered: like Java and C++, they are platform
independent. They also generate smaller file sizes than a comparative program in Java or
C++, since Perl and Python files need not be compiled.
Perl was designed (in part) as a language for language processing, similarly to Lisp. Perl
has a powerful, flexible, and simple regular expression system which could be used for
implementing information extraction. Regular expressions in Java take a minimum of five
lines to implement – in Perl they can be done in a single line.
Perl also allows simple output to (and input from) multiple files, using a system of shorthand
filehandles, and simple language-input manipulation. Due to its NLP functionality, Perl will
be the language used to implement the system.
RASP (a parser installed on the SoC Linux machines) could be used for the NP/VP
extraction, as well as parsing and POS tagging the articles for extraction of the required
elements.
3.4 Identifying Useful Information
The first step to extraction, once all files are accessible (i.e. In the same directory), is to take
only the parts of the article that I need. This design decision is backed up by [33], which
“was developed in response to the needs of the intelligence community for scanning and
processing huge volumes of written texts...FASTUS provides the analyst with a tool that will
help him or her to avoid being overwhelmed by the flood of information“.
Since FASTUS was designed to reduce the amount of text that users of a large corpus need
to handle, it indicates that there is a problem with maintaining large amounts of information.
For processing a corpus containing over 800,000 articles, it therefore seems necessary to
discard as much useless information as quickly as possible – to reduce processing time as
well as the size of the files my system will generate and process.
22
Gerard Howard
The elements of primary interest are the title of the article and the first paragraph of the
article (Since the first paragraph always summarizes the contents of the remainder of the
article). Since most of the articles will contain more than one paragraph, the system must be
able to extract only the first paragraph (since the corpus extracted was be very large – over
2 gigabytes [2]), it is necessary to reduce the size of the working corpus as soon as
possible).
The Reuters corpus used [2] was encoded with an XML schema (Appendix B, Figure 12).
These clearly define the elements of the article, such as “Title” or “Headline”, as well as
“code” fields that specify the countries involved, and the type of event the article is
describing.
These codes could be used to determine whether or not an article should be included in the
corpus of Balkans headlines, by taking only articles referring to certain countries, regions or
types of action.
3.4.1 Information Extraction
Since information extraction must be performed only on specific parts of each article, it
follows that some form of identification must be performed to help my system to understand
the types of words and phrases it is working with.
Almost every stage of my system will involve some form of information extraction or
manipulation, since I will be working from an entire corpus down to a selection of words and
phrases. Hence, at each stage of my system, the data I am processing must have some
form of identification with it, so only the relevant information is manipulated.
Other systems extract using parsing ([20]) or tagging and pattern matching([10],[16]).
Tagging-based extraction requires an accurate and detailed tagging of words, showing what
part of speech (determiner, verb, first person noun) the words are, and also what syntactic
category they represent (a person, a country, and so on).
Parse-based extraction, as used in [15], is harder to implement that its tagging-based
variant, since it requires knowledge not only of the word in question, but also of the
surrounding words, and the construction of the phrase those words appear in, and also
possibly of the sentence as a whole. By examining all three of these factors, and not just the
word, a more accurate classification can be made. However, implementation of such a
23
Gerard Howard
feature is often tricky and time consuming, the benefits of its implementation questionable
when a word-based classification is often sufficiently accurate for a system of this type.
Additionally, although both inputs are imperfect, and prone to mistagging that is out of the
sphere of control of my system (since NER must be done externally, using [34]), it is clear
that tagging-based classification will suffer less from a single mistagging in an article.
For example, if one word in an article is mistagged, only that word is affected by the
mistagging. Although this obviously has a detrimental effect on the accuracy of the
dictionary creation, the mistagging error is isolated (that is, a single mistagging is not
propagated through the rest of the article).
Parse-based classification requires the structure of an article to be defined, the structure
then being used to help classify the words in the article. Similarly to the word-based
approach, mistaggings occur. However, since one parameter is used to determine the next
(i.e. The end of the first noun phrase signifies the start of the first verb phrase), a mistagging
can throw the structural tagging of the entire article awry.
|---NP---|-VP-|
The cat sat
|-NP-|---VP---|
The cat sat
Figure 13: An example of the “knock-on” effect of a mistagging while using a parse-
based approach.
Additionally, testing the RASP parser has shown problems with some input types,
particularly fragments. These occur more frequently in news reports than they do in common
written English.
decided against using a full parse, primarily for simplicity, and for its confinement of
mistagging errors.
Regular expressions are likely to be the way to implement pattern matching for any
extraction the system is expected to perform. Perl has excellent integrated support for
regular expressions.
24
Gerard Howard
3.4.2 Exploiting POS and NER
For my system to function, it must be able to determine what type of words are being passed
to it.
Part of Speech (POS) tagging is used to identify the syntactic type of a word, and appends
that type to the end of each word. This helps systems that have to be able to distinguish
different types of words by appending a POS tag to the end of each word. Example POS
tags include:
The-DET
Meeting-NN
The above examples show “The” to be of type “Determiner” (DET), and “Meeting” to be of
type “Noun” (NN). The actual tagset may vary from implementation to implementation, but
are generally standard across major POS systems. In either case, it is a regular expression
that is expected to do the matching, with some form of algorithm to process the input (the
words and their syntactic types), and produce an output that analyzes the structure and
places further tags on the words or phrases it determines to be actors or events.
This helps because it breaks down the structure of the sentence to a fine-grained detail
(assignment being on a “per-word” basis), preprocessing the data and turning it into a
manageable form for use with the dictionary creation system. If no tagging was performed,
the system would have no way to distinguish between different types of words (since it has
no inbuilt appreciation of context). NER and POS allow an algorithmic, computational
approach to a traditionally human-solved problem.
Named Entity Recognition (NER) is a method most commonly associated with NLP tasks. It
algorithmically assigns each word in an input to being a part of speech. These parts include
“person”, “country” and so on. Regular expressions can be used to separate the different
parts of speech, allowing dictionaries of certain parts of speech to be constructed. A POS-
tagged output would be similar to:
Microsoft-ORG
The suffix “ORG” tags Microsoft as being an organization. Creating a dictionary of
organizations is therefore likely to involve exploiting this POS tagging, to extract, for
example, every word tagged “ORG” and placing it in a file containing all previously identified
25
Gerard Howard
organizations from the input. It will be easiest to implement this method of extraction using
regular expressions to match the suffixes of the words.
3.5 System Design
My system aimed to take a corpus of Reuters newswire headlines, and from them create
dictionaries of organizations, people, and dates (for any time-series analysis a user may
wish to perform).
Figure 10: The system as a “Black Box”.
The following is a demonstration of the desired inputs and outputs of the system.
Today (4/8/85), soviet tanks moved across the border of France towards Paris,
despite the protestations of the French President, Jacques Chirac.
From the sample shown above, the following organizations, people, and dates should be
generated:
ORGANISATION: Soviet tanks, France, Paris.
PERSON: French President, Jacques Chirac.
DATE: 4/8/85.
From the actors extracted, people who are the same, but identified with different names,
26
Gerard Howard
should be clustered together to show that they are the same actor. One such cluster would
be:
(French President, Jacques Chirac)
Finally, an Actor Coding should take place, assigning a distinct code to each actor.
FR_PRES (French President, Jacques Chirac)
SOV_MIL (Soviet tanks)
FRA (France, Paris)
This output could be used in Event Coding Systems, as the Actor Coding Scheme. The
system would search for the words in parenthesis, and assign the capitalized code on the
left to that actor should a match be positive.
Given a sample input of:
“Soviet tanks attack Paris, Chirac surrenders.”
The Event Coding program would use the actor codes generated from my program, together
with event codes from a scheme such as CAMEO (see 2.32) to generate an event code for
that sample (using examples of “SURR” as the event code for “surrender”, and “MIL_AGG”
as the event code for “military aggression”).
SOV_MIL MIL_AGG FRA
(The Soviet military committed an act of military aggression against France).
FR_PRES SURR SOV_MIL
(The French President surrendered to the Soviet military).
Proceeding in this way, it can be observed that a list of event codes will be generated, which
can then be analyzed for whatever their intended use is.
Whilst considering system design, a number of decisions had to be made. Primarily, how to
strike a balance between features and complexity.
Since the system is designed to be reusable for other areas, or other applications that
27
Gerard Howard
require dictionaries to be generated, a simple design would breed a simple (and hence easy
to document and use) interface. Conversely, a lack of additional features would reduce the
number of tasks the system could execute, reducing the chances that the system would
provide a feature some user might need.
So the design of the system must be directed towards providing some extra features, but
without losing a simple design and interface. With this considered, the easiest way to
provide added functionality would be to split the dictionary generation process into steps.
Each step would modify the data as required for the next step, so that the system would
function exactly the same running each separate program back-to-back as it would running
the system as a single program, the bonus being that the single programs could also be run
independently, allowing other programmers to use only the parts of the system they want,
and taking the output from any point in the dictionary creation process.
From a usability point of view, this allows each program to be viewed by whoever wishes to
use it, separate from the system as a whole. Not only would this aid in understanding of the
program (separate programs, out of the context of the system as a whole, would be easier to
understand since they would be shorter and simpler), each program could also have its own
documentation (readme file), explaining in more detail how a specific part of the system
works (since it would receive more coverage that it would in a general system-wide readme
file).
Finally, if a prospective user only wishes to use two or three programs from the system, they
need not concern themselves with any other part of the system.
Reusability could be catered for by allowing a plug-in capability, allowing other regions to
have Event Coding dictionaries generated for them. The part of my system (most likely a
single program) that extracts articles relevant to the Balkans conflict area should therefore
be designed to accept other countries and classification codes as arguments, extracting
from a Reuters corpus based on matching XML “code” tags to the country codes and topic
codes specified in the input. This would make the system reusable, but only if a Reuters
electronic corpus such as [2] was used as input, as the extraction is dependent on the XML
tags that accompany each article.
28
Gerard Howard
Chapter 4
Implementation
Implementation is to be carried out in Perl, due to its capacity for handling and processing
language-based input. For simplicity of analysis, the implementation can be thought of as a
two-stage, five-step process.
The first stage is the creation of the dictionaries, and the second stage the assignment of
actor codes to an input. In the first step, the relevant information is to be extracted from the
Reuters news corpus. Secondly, following NER preparation and preprocessing, a number of
word lists are to be created (using the NER tags). Thirdly, the word lists are to be clustered.
Finally, dictionaries are created when the word lists are to be transformed into dictionaries.
The second stage, and fifth step, is the assignment of actor codes to an input file based on
29
Gerard Howard
the actors in the dictionary file.
Figure 13: Flow chart of dictionary creation, showing the implementation of the
system, mirroring the “waterfall” model methodology. Implementation starts at the
top and flows down. Each program in itself can be represented as a “mini-waterfall”
of design, implementation, and evaluation.ActorScheme.pl applies the scheme to a
corpus
After analyzing the two techniques (tagging/pattern matching and parsing), I have decided to
implement the former as my primary dictionary creation technique. Not only is it simpler, but
it is very easily implemented using Perl regular expressions on an NER-tagged input.
Pattern matching the suffixes based on the three available classifications - PER, ORG and
DAT (or Person, Organization, and Date), and adding each of the three into a separate array
gives three arrays containing all of the PERs, ORGs and DATs in the input file. By keeping
the unique identifier of each article in the output files, a list of articles followed by any PERs,
ORGs or DATs contained in that article can be output relatively easily.
Although a full parse will not be the method used to extract or code actors, the capability to
perform a full parse has been included (in Extract.pl) as part of preprocessing for others who
may want to use my system for syntactic structure analysis. This in itself requires a
preprocessing via the RASP parser. The implementation of the noun and verb phrase
extraction itself is relatively simple – given a sample RASP output, with POS tagging and
parsing, is shown in appendix B, Figure 14.
The NP and VP extraction is based on parenthesis matching. The parsed form of an article
contains a number of pairs of parenthesis that mark the boundaries between word
classifications, noun phrases, verb phrases and the like. Taking the open parenthesis just
before the start of a NP or VP, and counting the number of open and closed parenthesis, the
end of the NP or VP can be located. The words contained between the matching pair of
parenthesis constitute the NP or VP itself.
Although suffering from the mistagging problem mentioned above, it affords the system
some added functionality for those that wish to take advantage of it.
30
Gerard Howard
4.1Programs, Inputs and Outputs
Program Input(s) Output(s)StoryExtractor.pl Directory of unzipped Reuters
XML articlesA single file containing thetitle, first paragraph, date, andany code fields for everyarticle
LeadLine.pl The output fromStoryExtactor.pl
The articles that contain thecodes provided by the user
NERPrep.pl The output from LeadLine.pl The same file, ready for NERRASPer.pl The output from LeadLine.pl The same file, ready for
RASP Extract.pl RASP-processed corpus Each article, split into its NPs
and VPsORGDATPERExtractor.pl NER-processed corpus Three files, (ORG, DAT,
PER) entities taken from thecorpus and containing onlythat syntactic type
PuncRemover.pl Either of the three files fromORGDATPERExtractor.pl
The same file withpunctuation removed, andsimple stemming
HeuristicClusterer.pl The cleaned PER file fromPuncRemover.pl
A file containing each uniquecluster
CountryClusterer.pl The cleaned ORG file fromPuncRemover.pl
A file containing each uniquecluster
DictionaryBuilder.pl Either of the three files fromPuncRemover.pl
An alphabetically-ordereddictionary of unique actors
ActorCoder.pl The output fromHeuristicClusterer.pl andCountryClusterer.pl
An Actor coding schemecontaining each uniquecluster followed by itsassigned code
ActorScheme.pl The output fromActorCoder.pl, an inputcorpus
An actor-coded output corpus
The programs created can be assigned to the five-step process (see 3,4) as follows:
The first step included the two programs StoryExtractor.pl and LeadLine.pl. The second
step involved the program NERPrep.pl. Following the NER preprocessing, the
ORGDATPERExtractor program was used to create the separate word lists (one for each of
ORG, DAT, and PER). The program PuncRemover.pl was also added to this step of the
31
Gerard Howard
system (see 4.2 for details). To extract the NPs and VPs from the input, the program
Extract.pl must be used on an input parsed and POS tagged by RASP. This input must first
be formatted to be compatible with RASP, using the file RASPer.pl.
The third step involved the programs HeuristicClusterer.pl, CountryClusterer.pl, and
FileJoiner.pl.
The final step of the first stage involved the file DictionaryBuilder.pl
The second stage, and fifth step of my system will use ActorCoder.pl to assign codes to the
actors in an input file, using the actors found in the dictionary file created by
DictionaryBuilder.pl. The scheme can be invoked on an input via ActorScheme.pl
4.2 Information Extraction
The preliminary information extraction (stage 1), produced a single file where each line
contained the information of a single article (its date, title, first paragraph, and any additional
“code” fields). This was implemented in the program StoryExtractor.pl.
When all of the files in the Reuters corpus were unzipped into the same directory (using a
conventional unzip program present on both Unix and Windows-based operating systems),
StoryExtractor.pl was run. Utilizing the Unix grep command, an array of all the XML files in
the directory (@files) was generated as follows.
opendir(DIR, ".");
@files = grep(/\.xml$/,readdir(DIR));
This removed to need to specifically declare each file as needing to be read before any
processing could be done on the file.
Each of these XML files was then examined line by line, and a series of regular expressions
used to extract the necessary fields from each file via pattern matching. To maintain a
consistent style of extraction, the program was designed so that no matter the order the
fields appeared in, they would appear in the output file in a set order (determined by the
order or the regular expressions in the sequence).
An integer variable was used to ensure that only the first paragraph from each article (the
lead line) was taken. The rest of the extraction consisted of a simple series of if statements
32
Gerard Howard
coupled with regular expressions which process each line of the input.
476032newsML.xml<title>ALBANIA: ALBANIA MOURNS REFUGEES,
ITALY PREPARES FORCE.</title> <p>Albania declared a day of mourning on
Monday for refugees drowned in a shipwreck off Italy, battering Rome's image as it
organised a multinational force to protect aid to the chaos-torn country.</p> <code
code="ALB"><code code="ITALY"><code code="GCAT"><code
code="GDIP"><code code="GDIS"><codcode="GVIO"><dc
element="dc.date.published value="1997-03-31"/>
Figure 15: The extracted parts of a Reuters news article.
Due to page constraints this appears cluttered, however the output file presents all this
information on the same line, making the processing easier.
The second stage of the extraction used the output file from the StoryExtractor.pl program
as input. LeadLine.pl was used to extract articles relevant to the area of study via extraction
based on the “code” XML fields of each article. Since each article now occupied only a
single line, this could be performed by a regular expression matching the whole line (whole
article) at once. To support reusability, the user could dynamically construct the regular
expression, and hence could easily extract form a corpus based on an entirely different set
of criteria. Matches were performed on country and topic codes, as defined in the
“topic_codes.txt” and “region_codes.txt” files supplied with the Reuters corpus.
Perl regular expressions support the “|” symbol for “OR”. Hence a number of codes could be
specified as command line arguments, separated by “|” symbols. Once constructed, the
regular expression was run against each line in the file, outputting any matches to a
separate file ready for NER.
Appendix B contains Figure 16, illustrating this implementation.
4.3 Named Entity Recognition
Named Entity Recognition is a process that associates the syntactic category of a word with
that word, over a given input. Since it is used to process English texts, the input file had to
be preformatted to remove the XML tags – the inclusion of which would have made the NER
program malfunction, and produce a greater amount of mistaggings.
33
Gerard Howard
The removal of XML tags was implemented in the file NER_Prep.pl. Capturing groups were
used to extract the unique identifier of each article, and the first paragraph of that article.
Only the unique identifier and first paragraph were extracted since they were the unique
identifier was required for identifying each article, and the first paragraph was to be the
subject of the NER processing. The remainder of the fields would have been meaningless
to the NER program, and may have led to mistaggings.
476032newsML.xml. Albania declared a day of mourning on Monday for refugees
drowned in a shipwreck off Italy, battering Rome's image as it organized a
multinational force to protect aid to the chaos-torn country.
Figure 17: A cleaned Reuters news article, ready for NER
This produced a file suitable for input into an NER system. This file was then sent for NER
processing. The returned file was to be used in conjunction with the OrgDatPerExtractor.pl
file, which would extract every organization, date, and person from the file, and place them
into the relevant output file, one for each type of entity extracted.
The first stage was to split the entire input into an array of words, done by splitting a line on
white space
Now each element of the array of words was matched against regular expressions designed
to capture the ORG, DAT, or PER suffixes placed by the NER process.
The unique article identifier was printed to each of the three output files, since it would be
needed by all of them. A word which matched ORG would be printed to the ORG output
filehandle, and similarly for the other two filehandles.
However, dry runs on some sample texts showed two major problems that had to be tackled.
Firstly, many organization and person names are over one word in length. Although the
program correctly identified them, it printed them as:
Ariel
Charon
34
Gerard Howard
Whereas the required format is:
Ariel Charon
This problem was rectified by having the next element array (next word in the article)
checked before printing the new line characters that would separate the organization or
person words. The lines that controlled the output format of the word added a new line
character (signifying the end of the current name) if the following word in the array was not
of the same suffix (ORG, DAT, PER) as the word currently being processed. In the same
way that a regular expression match is signified with “=~”, a non-match can be signified with
“!~”.
The second problem stemmed from the fact that the NER software used to tag the input file
had a tendency to wrongly tag countries as people. This was sidestepped by including an
exclusion list – an associative array of countries.
If a match was found, between the element and the exclusion like, that element was printed
straight to the ORG array, since it must be a country. This stopped countries that had been
incorrectly tagged as people by the NER process being added to the wrong dictionary.
%ExclusionList = ("Yugoslavia", "Poland", "Albania", "Austria", "Bosnia", "Bulgaria",
"Croatia", "Romania", "Turkey", "Slovania", "Slovakia", "Georgia",
"Macedonia", "Hungary", "Serbia" , "Egypt", "Cyprus","USA","America", "U.S.A",
"US", "England", "UK", "Germany", "Poland", "Holland", "Netherlands", "Russia",
"Switzerland", "Greece", "France", "Italy");
Figure 18: The Perl code for the exclusion list.
Postprocessing was done with the PuncRemover.pl program. This was because a sizable
proportion of the names detected as “unique” were in fact the same name, but with added
punctuation, e.g. England and England's. The role of PuncRemover.pl was simply to stem
every entry in each of the three files, so that the unique names could be taken with no plural
entries, with no punctuation. This was implemented using the Perl grouping support for
regular expressions.
4.4 Clustering
35
Gerard Howard
Clustering involved three programs, HeuristicClusterer.pl, CountryClusterer.pl, and
FileJoiner.pl.
HeuristicClusterer.pl was designed to cluster names based on a series of heuristics (values
possessed by one entity that would identify it as being the same as another entity). These
heuristics took the form of a set of regular expressions, the premise behind the programs
operation being that each identified PER entry would be compared to an array, whose
elements themselves were arrays containing all of the clustered names (e.g. All the names
that have already been through the system). If a match was found, the new element would
be added to the array of names that it matched. If no match was found, a new array would
be created with that element at the first array position.
In practice, it became apparent that an associative array was the easiest way to implement
this – each unique name would have a unique key, two names that had been identified as
the same by the heuristics would have a common key.
A Lincoln, 1
Bill Gates, 2
President Lincoln, 1
Firstly, the heuristics had to be defined, firstly in words, then later in code. Secondly, each
name had to be split into a first element (first name), and second element (second name), so
that it could be compared using the heuristics defined above. Thirdly, the associative array
had be created, along with logic for adding clustered elements to existing arrays, and for
adding new elements with a new, unique key to the hash.
The heuristics chosen were as follows:
The first character of the first name, followed by zero or more word characters,
followed by a space, followed by the last name
Or, in Perl:
$firstName[0]\w*\s$lastName;
$firstElement[0]\s\$lastElement;
The heuristics had to be altered to give improved clustering accuracy, this simple heuristic
outperformed a more complex, earlier version.
36
Gerard Howard
The variables $firstName and $lastName come from splitting of the word on whitespace. As
each name is processed, the program ensures that even if the person has more than three
elements in their name, only the first and last elements are taken (processing three names
would be inefficient, as well as unnecessary – people are known generally by their first and
last names only).
Once all the names are present, each element in turn in checked against the name
heuristics, and clustered if a match is found.
The element in question is compared with each key in the associative array (each distinct
cluster already defined by the program). For every id present, name stored at that position is
split into a first name and last name as above, giving two variables which can then be
compared against the persons first and last name
If the elements heuristics match those of the name currently being processed, that element
is given the same key as the associative array entry which gave the match. If no match is
found, a new id is created in the associative array, and the element with no match added as
the first member of that array.
CountryClusterer.pl was a simple program, which simply replaced a positive country match
(in the ORG file) with a predetermined country abbreviation (ITA for Italy, for example).
The regular expression syntax used was slightly different to the usual matching syntax – this
is because the program is designed to substitute the matched line with a set string.
The default settings for substitutions are also different to the normal settings. With
substitutions, case insensitivity has to be explicitly enabled (using the “i” flag). Similarly, the
program had to match globally throughout the string (the “g” flag). To enable substitution
mode, the “s” prefix had to be included.
The regular expression construction had to include .* both before and after the search string,
to ensure that the whole line was substituted rather than just the matched word.
A problem with this implementation was that the matcher continued to match even if it had
assigned a country code to a country. It was found that some of the abbreviations I had
chosen would be replaced by a further abbreviation, giving that particular line an incorrect
code assignment. To stop this happening, a boolean was used to determine if an input has
37
Gerard Howard
been matched, and prevented further comparisons and substitutions occurring after this first
match.
4.5 Dictionary Creation
The dictionary creation implementation originally involved the following steps, and the file
DictionaryBuilder.pl:
· Using a PER, ORG or DAT file as input, take all the unique names
· Sort the names alphabetically
The first point was implemented using Perls associative array support. Associative arrays
were not originally part of my plan, but research into different implementations of detecting
unique elements in an input file showed that associative arrays gave the cleanest, shortest
method possible.
A single amendment had to be made to this outline implementation plan. Namely, research
into Perl documentation revealed that Perl has an inbuilt sorting function. Originally, for the
sake of simplicity, this would be a separate program, taking an input file of unique names,
and bubble sorting them into an alphabetically ordered list. However, this could be
implemented in Perl in a single line, using two arrays as arguments (the unsorted array as
input, and a new sorted array as output).
Since it is only a single line,it was decided to include this is the DictionaryBuilder.pl program,
using a user command-line input to determine whether or not to sort the output.
4.6 Actor Coding
The Actor Coding Scheme will take the dictionary files created by the system, and utilize
clustering to apply a code to each actor detected in the corpus. This will produce an output
file of actors, clustered based on their name, with a code assigned and appended to each.
Two main implementation points arose – the generation of unique codes for each actor, and
reusability of the actor scheme generation.
Since the program worked on an input of unique clustered actors (as all actors in a cluster
are assumed to be the same entity), one code was required for each input element. The
easiest way to ensure a unique code was therefore to use the actor name as the basis for a
38
Gerard Howard
code. However, the point of coding is to reduce the amount of information that one must
work with, hence the codes had to be smaller than the name they represented.
This was implemented by taking each element in turn, and splitting it into an array of
characters, taking the first two and capitalizing them (making the codes compatible with
[16]). If this pair of characters was unique, they became the code for that actor. If not,
characters were iteratively added to the array until the code became unique.
My coding of this program did not implement the more sophisticated features of, for
example, the TABARI actor file (allowing actors to change codes over time), because the
program would not know when e.g. The USSR became Russia. These types of features are
only realistically implementable with manually-compiled Actor Coding Schemes.
Reusability was provided by basing the actor coding scheme on the actor dictionaries. This
meant that, for any input, the actor codes generated would be similar to the corpus provided
for those codes – allowing the input corpus to affect the coding scheme produced.
39
Gerard Howard
Chapter 5
Testing
General information about the project:
The dictionaries produced by the system were of the following sizes:
Organization Dictionary: 76610 bytes, 25.7 average bytes per article, 2977 discrete
organizations (clustered).
Person Dictionary: 28219 bytes, 18.78 average bytes per article, 1502 discrete people
(clustered).
Size of NER file:136187 bytes.
The system was evaluated on the following criteria:
Dictionary creation was evaluated as a series of “precision and recall” experiments. The
errors from NER will be calculated by taking a percentage of actors that have been correctly
NER tagged. Other factors, such as any actors not relevant to the Balkans area of conflict
appearing in the dictionary, or entries existing in the wrong cluster will also be measured.
These three parameters will give me a precision and accuracy estimation for the dictionary
creation.
These experiments would take a large amount of time to complete, so each (with the
exceptions of experiments 4 and 5) was conducted on a test corpus of 200 articles, the
results then extrapolated to fit in with the true dictionary sizes, giving an approximation of the
outcome of each experiment.
The actor coding scheme will be evaluated via two methods, firstly a study of comparative
accuracy of coding a small corpus of 200 news articles, against an existing system
(TABARI).
40
Gerard Howard
Secondly, an accuracy comparison of the actor coding scheme on a small sample of 50
news articles to a manual event coding, done by 2 human coders (who both much agree on
the code to assign in, due to ambiguity issues with human code assignment).
Justification of the evaluation criteria can be seen as follows. Because the dictionaries were
designed to be used by other systems, the most important evaluation criterion was the
accuracy of selection of the entities in each dictionary. If entities in the dictionary are not of
the correct syntactic type, any system using that dictionary may generate errors (based on
the specific use of the dictionary). Correctness of the dictionary contents can be assessed
quantitatively, and will give the percentage of correct dictionary entries..
The accuracy of the clustering algorithm is also important – actors become “lost” and
incorrectly assigned an actor code if clustered with an actor they are not related to. Hence
the numbers of actors correctly/incorrectly clustered will make a quantitative evaluation
criterion for the percentage of correctly-assigned event codes.
A system-system comparison using TABARI as a benchmark system, as well as a
comparison to human coding , will used as evaluation criteria for system performance. By
comparing against an existing system and a human coder, some idea of “real world”
performance can be generated. A test plan was generated to encapsulate these ideas.
5.1 Test Plan
Test Focus Description Result type
Dictionary Generation NER tagging Check actor dictionary forthe existence of elements thatare not actors
Percentage error
Dictionary Generation Code-based actor extraction Check actor dictionary foractors that are not related tothe area of focus, via Googlesearch
Percentage error
Dictionary Generation Clustering algorithm Check each cluster forelements that do not belongin that cluster
Percentage error
Actor Coding System-system comparisonvs. TABARI
Code a small input usingboth systems, compareaccuracy
Accuracy differences,Comparative study(qualitative)
Actor Coding Comparison to human coding Code a small input andcompare for accuracy againsthuman-coded example
Percentage incorrect for eachcoding, Comparative study(qualitative)
Figure 19: Test plan
41
Gerard Howard
These tests will be referred to as “Experiment 1...Experiment 5”.
5.2 Dictionary Creation
Experiment 1: NER Errors
Methodology: Move through each dictionary file before its NER taggings are removed,
manually note any entities that are incorrectly tagged, and entities tagged as actors that are
not actually people or organizations.
Total number of taggings: 6014
People mistagged: 163
Organizations mistagged: 204
Entities that aren't actors: 383
Incorrectly tagged actors: 755
Correctly tagged actors: 5259
Percentage correct: 87.4%
Percentage error: 12.6%
Experiment 2: Extraction Errors
Methodology: Move through the actor file, manually note any actors that are unrelated to the
Balkans conflict (confirm via Google search for that actor if unsure).
Total actors: 6014
Actors that are related to the Balkans: 2104
Actors that are unrelated to the Balkans: 3310
Percentage correct: 45.0%
Percentage error: 55.0%
Experiment 3: Clustering Errors
Methodology: Examine the contents of each cluster, note any actors that do not belong in
their cluster, and any actors that are the same person, but in different clusters.
Total number of clusters: 4479
42
Gerard Howard
Total number of actors: 6014
Actors correctly clustered: 4682
Actors incorrectly clustered: 1332
Percentage correct: 77.9%
Percentage error: 22.1%
5.3 Actor Coding Scheme
The errors found in the coding scheme can be attributed in part to the dictionary generation
process, since the same actors from the dictionaries were used to create the Actor Coding
Scheme.
Experiment 4: Real-world Application
Methodology: Apply the Actor Coding Scheme to a test corpus containing 500 sample
articles. Compare accuracy with TABARI on the same test corpus. Compare Actor Coding
only. (Test machine: 1.4ghz Pentium IV, 512 MB RAM, Fedora Core Linux version 3.0)
The Produced System:Total number of actors in corpus:1598
Actors incorrectly coded:228
Actors not coded:0
Non-actors coded:21
Correct coding produced on: 1379 articles
Percentage correctly coded articles: 84.4%
TABARI:Total number of actors in corpus:1598
Actors incorrectly coded:14
Actors not coded:71
Non-actors coded:5
Correct coding produced on:1578 articles.
Percentage correctly coded articles: 94.4%
Experiment 5: Human Coding
43
Gerard Howard
Methodology: Generate Actor Codes for 50 articles, compare accuracy to that of a human
coder.
The number of correctly coded actors for each is as follows:
System: 36 (72%)
Human: 50 (100%)
Actors in the corpus that didn't appear in the Actor Coding Scheme: 0.
44
Gerard Howard
Chapter 6
Evaluation
6.1 Results and Discussion
Experiment 5 (Appearing first as it is a minimum requirement)
When compared to a human coder, the system produced an actor coding accuracy for the
input corpus that was 28% lower. Because the coding of each actor came from the name,
the coder could quickly check the scheme to check if the actor was classified, and assign the
code. This resulted in the large percentage of correctly coded samples.
Disagreements between the two methods of coding were largely due to the machine coding,
which had a match for every actor in the corpus due to the sheer amount of detail produced
from being generated from such a large actor dictionary. However, 28% of these actors
were incorrectly clustered.
Since all samples were coded, the miscodings were due to clustering errors (one actor being
incorrectly identified as being the same as another actor).
These can be attributed to countries that share acronyms (AA could be American Airlines or
Automobile Association), compounded by the large amount of actors meaning there was
more chance for identical names to refer to different entities.
Also, clustering organizations that are referred to both as acronyms and as full names was
impossible, because of the heuristics employed. Since the system analyzes an article
without analyzing context, it will not link two syntactically distinct entities (such as
“nicknames”), with the organizations full name. Finally, if two individuals called “Paul
Simon” and “Pauline Simon” are detected, they will be clustered together, meaning that both
45
Gerard Howard
individuals, though different, will have the same actor code. Errors with the human coding
scheme assignments must be attributed to human error, since the alphabetical nature of the
scheme makes it very easy to follow and code to.
Testing the dictionary creation and actor coding systems (see 5) produced encouraging
results, but also highlighted three main types of errors to which the system is prone – NER-
based errors, entity extraction errors, and clustering assignment errors.
Experiment 1
NER-based errors are external errors caused by the NER program wrongly assigning a
syntactic type to an actor (e.g. A person being tagged as a country). This type of error
would be detrimental to a system using the dictionaries generated as an input, as the
dictionaries would contain errors. The total percentage of NER-based errors found in the
dictionaries was 12.6%. These can be attributed to the NER program itself [34], since the
NER process had to be done externally, and the writer had no control over the results.
Experiment 2
Entity extraction errors are when the codes used to extract articles from the corpus are too
general. Too much of a general extraction allows actors that are unrelated to the topic of
interest make their way into the dictionaries (and hence coding scheme), which can mean
that actor codes are generated but would not be used if run on a corpus specific to that area
of interest. The percentage of entity extraction errors found in my system in total was
55.0%. A more discriminating extraction would prevent actors unrelated to the area of study
being added to the actor dictionary/Event Coding Scheme.
Since the regular expression for extraction is dynamically generated, this can be done by the
user without any modifications being made to the code. This was also the largest cause of
error in my system, since such a large amount of the actors included were not related to the
event, so a coding scheme or NLP system would not benefit from their presence.
Experiment 3
Finally, clustering errors are when either an actor isn't clustered with other examples of that
actor or is clustered with an actor they shouldn't be clustered with. Clustering errors can be
46
Gerard Howard
attributed the heuristics used to perform the clustering operation, and actors with similar
names. 22.1% of the clustered actors were incorrectly clustered. Possible causes of error
are explained above (Experiment 5).
Experiment 4
On the subject of speed, [35] refers to TABARI as “Very fast. I timed TABARI on Levant
texts for 1987-1990, about 26,000 sentences. On a 350Mhz Mac G3 and using the...mode
that provides no screen feedback, TABARI codes 2000 events per second... On a 650Mhz
Dell Pentium III, the speed is around 3000 events per second”.
The speeds at which the two systems coded the test corpus (see 5.2) shows that the system
developed as part of this project was faster than TABARI by 0.05 seconds. Although the
speed of the machine used to test was faster than that cited above, there was obviously a
loss of speed due to initial startup costs for the programs.
Given that the “usual benchmark that human coders can reliably produce 40 events per
day”[35], the human-coding of the example corpus would take twelve and a half days!
(Although this means coding a whole event, not just the Actor). Obviously, in a full Event
Coding System, time would be saved by a factor of millions, and the results for accuracy of
the code assignment are (despite errors in the system) reasonable.
“Actors incorrectly coded: 228” (from experiment 4). This was far and away the largest
source of error in the coding. The errors with Actor Coding shown can be attributed to a too
general selection criteria at the initial extraction stage. This left lots of actors unrelated to
the region of interest, and also increased the amount of clusters containing actors that
should not be there – excessive information density meant that the clusterer found it hard to
distinguish between identical or nearly-identical names.
“Actors not coded: 0”. The test corpus used was taken from the corpus used to build the
dictionary [2]. The NER program, for the test corpus used, seems to have extracted all the
actors successfully – hence they all appear in the coding scheme, and are matched in the
test corpus when the Actor Coding scheme is run on it. For a input from a different corpus,
this number is predicted to be substantially higher
“Non-actors coded: 21”. These errors are due to the NER, non-actors that the NER
program has tagged as actors, which have then been included in the Actor Coding scheme.
Clustering seems to give a low number of actors not coded by the scheme, but a substantial
47
Gerard Howard
number of non-actors that have been wrongly coded.
The overall accuracy of TABARI was 10% higher. However, this was expected as TABARI
has a more detailed coding algorithm, which sidesteps some of the pitfalls the system
undoubtedly hit whilst coding (e.g. misclustering, wrong tagging).
6.2 Dictionaries
The dictionaries created could be used in NLP-based systems for lexical analysis, as well as
in my system for actor code generation. They contain a diverse variety of elements, divided
by syntactic type into three files (organization, person, and date) , and are formatted in a way
that would be easily read into such a system.
The Actor Coding Scheme clustering algorithm for produced a very fine-grained coding
scheme when compared to an existing example, such as the person classification seen in
the TABARI Actor Coding Scheme. This type of scheme would be especially useful for
users working on a smaller corpus, since the reduction in volume of information could allow
for a more detailed classification of the actors in the corpus. This would suit a more specific
(and smaller) corpus.
6.3 Actor Coding Scheme
The Actor Coding Scheme generated was plausible for use in an Event Coding System, but
is not an optimal solution. The main detractors were variable-length codes and “First Come,
First Served” (FCFS) code assignment.
Variable-length codes are generated because of the way the algorithm moves sequentially
through the dictionary of actors. This was not something observed in the TABARI actor file,
but the discrepancies can be attributed to the fact that TABARI uses a hand-coded actor file,
whereas the system uses a machine-based, algorithmic approach.
FCFS code assignment comes again from the algorithm, which assigns a code to the first
element in the input file, before moving onto the next. In an alphabetically-ordered input file,
this produced an alphabetical assignment priority, where the smallest codes were assigned
to elements coming near the start of the input file. Longer codes were attributed to elements
with a common beginning (e.g. If John Paul = JOHN, John Prescott = JOHNP). By contrast,
manually-produced coding schemes have codes generated based on the importance of that
48
Gerard Howard
actor to the region the scheme is written for – the main actors and countries have codes that
are prioritized by relevance to read. The algorithmic nature of code assignment for the
system does not take such factors into consideration when assigning codes. Example actor
codings on a three-entity article follow.
INPUT
Today Albania declared war on FrancePope John Paul arrived in Yugoslavia preaching peaceHungary and Italy moved in on the WHO
OUTPUTToday ALB declared war on FRAPope JOHNP arrived in YUGOSLAVI preaching peaceHUNGARY and ITALY moved in on the WHO
6.4 Conclusions
In summary, the system fulfilled its minimum requirements, and three additions, two of which
were confirmed by the Mid-Project Report.
A number of conclusions can be drawn from the project. Firstly, that automated dictionary
extraction for event coding systems is feasible. The system produced shows that algorithmic
and NLP methods for dictionary creation can generate dictionaries with useful contents.
Secondly, that successful actor code generation is heavily dependent on the article
extraction parameters. In the examples used in the evaluation, many of the extracted
articles were unrelated to the Balkans conflict, so many codes that would be unused in a
real-world coding would be generated. This also affected the actor dictionaries, since some
of the entries bore no relevance to the subject of that dictionary.
Thirdly, that the actor coding scheme and dictionaries produced are especially suited to
small, focused corpii, or where extremely detailed classifications are required.
Finally, that implementation of a non-trivial Actor Coding Scheme will be a complex task.
This is why even automated systems such as [16] use manually-generated actor coding
schemes for their classification.
6.4.1 Improvements
A number of improvements could be made to the system. For example, a way to verify the
NER classification would enhance the correctness of the dictionaries. This could be
49
Gerard Howard
implemented simply by running the corpus through a number of NER systems, and taking
only the entities where they all agree on a syntactic type for that entity. Failing that, a run-
through of the generated corpus with a checking utility (such as a POS program) would allow
wrongly-tagged entities to be detected an deleted (by comparing the type of word to an
“exclusion list” of word types), reducing the number of total dictionary entries, but increasing
dictionary accuracy.
Enhancing the algorithm for clustering will reduce errors related to the misclustering of
actors – in turn preventing the incorrect assignment of actor codes. Improvements to the
clustering algorithm will allow distinctions to be made between such actors.
Further improvements to the accuracy of the system could also be made. This could be
implemented by altering the algorithms used to cluster. Also, the code assignment algorithm
could be altered to prioritize frequently-used codes, giving them shorter abbreviations, while
demoting less-used codes to longer lengths. A human-readable, standard-length coding
algorithm for actors would be a major advancement in the project.
Improvements could also be made to the aesthetic and interactive qualities the current
implementation of the system is lacking. Although not a priority for this project, a GUI would
increase usability and reduce the time it takes to become comfortable with system use (since
there are so many small programs, novice users can become lost). Increasing the aesthetic
properties of the system could also entice more users into using it.
6.4.2 Furthering the Project
The most obvious way to further the project would be to turn the system into a fully-fledged
Event Coding System, using an existing Event Coding Scheme [21,22,23,24,25,26] along
with my Actor Coding Scheme generation to code an input corpus, and produce an output of
event coded headlines.
Finally, as commented on in the Mid-Project Report, a paper based on the system could be
submitted to a workshop, which could produce things such as requests for additions to the
program/bug fixes (improving the system), as well as the possibility of collaborative projects
in the field of Event Coding
50
Gerard Howard
Bibliography
[1] School of Computing, University of Leeds, (published date unknown) Projects:MinimumRequirements Form, URL: http://www.comp.leeds.ac.uk/tsinfo/projects/minreq-form.html [21 st April 2005]
[2] Tony Rose, Mark Stevenson, Miles Whitehead, (2004), The Reuters Corpus Volume 1,from Yesterday's News to Tomorrow's Language Resources, Technology InnovationGroup, Reuters Limited
[3] CNN News,(date unknown),CNN Balkan Conflict:History, URL:http://www.cnn.com/WORLD/Bosnia/history/ [21 st April 2005]
[4] Geert-Jan M.Kruijff, Oliver Plaehn, Holger Stenzhorn, Thorsten Brants, (2001), NEGRACorpus, URL: http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/negra-corpus.html [21 st April 2005]
[5] Goldstein, Joshua S, (1992), A Conflict-Cooperation Scale for WEIS Events Data,Journal of Conflict Resolution 36 pp.369-385.
[6] Burgess, P.M., & Lawton, R.W, (1972), Indicators of International Behavior: AnAssessment of Events Data Research, Beverly Hills: Sage Publications.
[7] Vincent, Jack E, (1983), WEIS vs. COPDAB: Correspondence Problems, InternationalStudies Quarterly 27 pp.160-169.
[8] R Grishman, (1995), The NYU System for MUC-6 or Where's the Syntax, Proceedingsof the Sixth Message Understanding Conference.
[9] DJ Gerner, PA Schrodt, RA Francisco, JL Weddle, (1994), The Analysis of PoliticalEvents Using Machine Coded Data. International Studies Quarterly 38 pp.91-119.
[10] KEDS Team, (2004), Kansas Events Data System(KEDS) Homepage, URL: http://raven.cc.ku.edu/~keds/ [11th April 2005]
[11] Philip A. Schrodt, Shannon G. Davis and Judith L. Weddle, (1994), Political Science:KEDS: A Program for the Machine Coding of Event Data. University of Kansas.
[12] Gary King and Will Lowe, (2002), An Automated Information Extraction Tool ForInternational Conflict Data with Performance as Good as Human Coders: A Rare EventsEvaluation Design, Harvard University.
[13] Ellen Riloff, (1993), Automatically Constructing a Dictionary for Information ExtractionTasks, Proceedings of the Eleventh National Conference on Artificial Intelligence,AAAI Press / MIT Press, pp.811-816.
[14] Julie Weeds, Bill Keller, David Weir, Ian Wakeman, Jon Rimmer, and Tim Owen,(2004),Natural Language Expression of User Policies in Pervasive Computing Environments.Proceedings of the OntoLex 2004 (LREC Workshop).
[15] G Sampson, (1991), Probabilistic parsing, in Svartvik, J (ed), Directions in Corpus
1
Gerard Howard
Linguistics.
[16] KEDS Group, (1999), TABARI Readme File URL:http://www.ku.edu/~keds/software.dir/tabari.info.html [14th April 2005]
[17] McClelland, C.A, (1976), World Event/Interaction Survey Codebook, ICPSR 5211.
[18] G. DALE THOMAS, (2000), The Machine-Assisted Creation of Historical Event DataSets:A Practical Guide, Paper presented at the 2000 annual meeting of the InternationalStudies Association
[19] Philip A. Schrodt, (2001), Automated Coding of International Event Data Using SparseParsing Techniques, Paper presented at the annual meeting of the International Studies Association, Chicago, February 2001
[20] Virtual Research Associates, (date unknown), Virtual Research Associates Homepage,URL: http://www.vranet.com/main.html [16 th April 2005]
[21] Bond et al, (2003), Integrated Data for Events Analysis (IDEA): An Event Typology forAutomated Event Coding. Journal of Peace Research 40: pp733-745.
[22] Bond, Bennett, and Vogele, (1994), The Protocol for the Assessment of NonviolentDirect Action (PANDA) , Paper presented at the Program on Nonviolent Sanctions inConflict and Defense, Center for International Affairs at Harvard [23]Hermann, East, Hermann, Salmore, and Salmore, (1973), Comparative Research on theEvents of Nations (CREON), 2nd ICPSR., 1977
[24]Leng, Russell J, (1987), Behavioral Correlates of War, 1816-1975 , ICPSR 8606.
[25] Azar, E. E, (1982), The Codebook of the Conflict and Peace Data Bank (COPDAB),College Park, MD: Center for International Development, University of Maryland.
[26] Deborah J. Gerner, Rajaa Abu-Jabr, Philip A. Schrodt, A Yilmaz, (2002), Conflict andMediation Event Observations (CAMEO): A New Event Data Framework for the Analysis ofForeign Policy Interactions. Center for International Political Analysis, Department ofPolitical Science, University of Kansas
[27] H Poulton, (1993), The Balkans: minorities and states in conflict London: MinorityRights Group.
[28] V Roudometof, R Robertson, (2001), Nationalism, Globalization, and Orthodoxy: TheSocial Origins of Ethnic Conflict in the Balkans, Westport, CT: Greenwood Press.
[29] PA Schrodt, B Hall, L Neack, PJ Haney, JAK Hey, (1993), Event Data in Foreign PolicyAnalysis. Foreign Policy Analysis: Continuity and Change, Prentice Hall, 1995
[30] AM Davis, EH Bersoff, ER Comer, BTG Inc, VA Vienna, (1998), A Strategy forComparing Alternative Software Development life Cycle Models, IEEE Transactions onSoftware Engineering
[31] Stefan Junginger, Harald Kahn, Mark Heidenfeld, Dimitris Karagiannis, (2001), BuildingComplex Workflow Applications:How to Overcome the Limitations of the Waterfall Model.,Fischer, L. (Ed.): Workflow Management Coalition: Workflow Handbook 2001. Future
2
Gerard Howard
Strategies Inc., October 2000, pp. 191-206.
[32] T Glib, (1985), Evolutionary Delivery versus the 'Waterfall Model', ACM SIGSOFTSoftware Engineering Notes, ACM Press New York, NY, USA.
[33] Jerry R.Hobbs and David Isreal,(1996), FASTUS: An Information Extraction System,published in Finite State Devices for Natural Language Processing, MIT Press.
[34] Curran and Clark, (2003), Language Independent NER using a Maximum EntropyTagger, Proceedings of the Seventh Conference on Natural Language Learning(CoNLL-03), pp.164-167
3
Gerard Howard
Appendix A
With respect to time management, the project process can be considered a success, with
other successes including the project writeup, and research. Although fifty pages appears to
be a weighty sum at first, by generating a rough plan of the structure of my report at the
start. By writing about something that I had just done, the details were fresh in my mind,
meaning that the workload of writing the report was effectively split over the entire duration
of the project. This ties into scheduling – another area I personally felt was a success. By
deciding on and sticking to a design methodology, and creating a provisional schedule of
tasks early on, I was able to stay on task and complete the project on time.
Research aided the development of my solution greatly, with Dr. Markert providing a plethora
of white papers to read, and plenty of ideas to think about. The large amount of research
that I did before even beginning the design stage allowed me to sidestep pitfalls discussed in
some of the papers, steering the direction of the project towards ideas that have been
proved to work, using the best implementation methods available.
However, not all went according to plan. The main problem with the whole project was
actually getting such a large corpus of data in one place, and running programs on it. For
example, I had to ask the network administrators for more space in a folder I could access,
which took some time to set up. After that, it became necessary to leave a computer logged
in for a long time, as I had to unzip and extract from a corpus containing over 800,000
separate files. If I could change the project, I would either operate on a smaller corpus, get
my own PC (so it could be left on without interruption), or arrange any additional things
(extra storage) that I needed beforehand.
Advice to students attempting a similar style of project would be to do a lot research into
systems similar to the one they want to implement, taking ideas and incorporating them into
their system, and also to create a project schedule early on, create realistic deadlines, and
stick to the plan. This will ensure that the project will be completed on time. Also, not to
underestimate the time learning a new language, or anything else they are not familiar with
will take.
1
Gerard Howard
Appendix B
Misc. Charts & Examples
Figure 1: Gannt Chart Showing Project Schedule for Semester 1
Week1 2 3 4 5 6 7 8 9 10 11 CHRISTMAS
Phase ResearchDesignImplementTestEvaluateCompile
Figure 2: Gannt Chart Showing Project Schedule for Semester 2
Week1 2 3 4 5 6 7 8 EASTER 10 11
Phase ResearchDesignImplementTestEvaluateCompile
Figure 12: Example Reuters XML-formatted article.
2
Gerard Howard
<xml version="1.0" encoding="iso-8859-1" ?>
<newsitem itemid="804285" id="root" date="1997-08-16" xml:lang="en">
<title>PAKISTAN: Pakistan auctions 6-mth bonds for 19.91 bln
rupees.</title>
<headline>Pakistan auctions 6-mth bonds for 19.91 bln
rupees.</headline>
<dateline>KARACHI, Pakistan 1997-08-16</dateline>
<text>
<p>The State (central) Bank of Pakistan said it had auctioned six-month short-term
federal bonds worth 19.91 billion rupees on Saturday at a weighted average yield of
15.42049 percent per annum.</p>
<p>Bids worth a total of 26.485 billion rupees were received out of which bids for 1
9.91 billion rupees were accepted, a bank statement said.</p>
<p>In the previous auction held on August 4, bids worth 5.102 billion rupees were
accepted at a weighted average yield of 15.38533 percent per annum.</p>
<p>-- Karachi newsroom (9221) 5685192; Fax 5673428</p>
</text>
<copyright>(c) Reuters Limited 1997</copyright>
<metadata>
<codes class="bip:countries:1.0">
<code code="PAKIS">
<editdetail attribution="Reuters BIP Coding Group"
action="confirmed" date="1997-08-16"/>
</code>
</codes>
<codes class="bip:topics:1.0">
<code code="M12">
<editdetail attribution="Reuters BIP Coding Group"
action="confirmed"date="1997-08-16"/>
</code>
<code code="MCAT">
<editdetail attribution="Reuters BIP Coding Group"
action="confirmed"date="1997-08-16"/>
</code>
</codes>
<dc element="dc.date.created" value="1997-08-16"/>
<dc element="dc.publisher" value="Reuters Holdings Plc"/>
<dc element="dc.date.published" value="1997-08-16"/>
3
Gerard Howard
<dc element="dc.source" value="Reuters"/>
<dc element="dc.creator.location" value="KARACHI, Pakistan"/>
<dc element="dc.creator.location.country.name" value="PAKISTAN"/>
<dc element="dc.source" value="Reuters"/>
</metadata>
</newsitem>
Figure 14: A RASP-parsed and POS tagged Reuters news article.
("476185newsML.xml" "The" "Chicago" "Board" "Options" "Exchange" "said" "Monday" "an"
"exchange" "membership" "sold" "for" "a" "record" "price" "for" "the" "second" "time" "this"
"month") 1 ; (-29.820)
4
Gerard Howard
(|T/txt-sc1/---|
(|S/np_vp|
(|NP/n2_name|
(|NP/name_n2| (|NP/n1_name/-| (|N1/n| |476185newsML.xml:1_NP1|))
(|NP/name_n1| (|NP/det_n| |The:2_AT| (|N1/n| |Chicago:3_NP1|))
(|N1/n| |Board:4_NNJ1|)))
(|NP/n1_name/-|
(|N1/name+| |Options:5_NP1| (|N1/n| |Exchange:6_NP1|))))
(|VP/vp_npadv1|
(|V/np_np| |say+ed:7_VVD| |Monday:8_NPD1|
(|NP/det_n| |an:9_AT1|
(|N1/n1_nm| |exchange:10_NN1|
(|N1/n_pprt| |membership:11_NN1|
(|V/pp_pp| |sell+ed:12_VVN|
(|PP/p1|
(|P1/p_np| |for:13_IF|
(|NP/det_n| |a:14_AT1|
(|N1/n1_nm| |record:15_NN1| (|N1/n| |price:16_NN1|)))))
(|PP/p1|
(|P1/p_np| |for:17_IF|
(|NP/det_n| |the:18_AT|
(|N1/ap_n1/-| (|AP/a1| (|A1/a| |second:19_MD|))
(|N1/n| |time:20_NNT1|))))))))))
(|NP/det_n| |this:21_DD1| (|N1/n| |month:22_NNT1|)))))
Figure 16: Perl code for dynamic regular expression generation
$textstring = "code.code=(";
$i = 0;
for($i = 0; $i <=$#ARGV; $i++)
{
if ($i == $#ARGV)
{
$textstring = $textstring . $ARGV[$i] . ")";
5