thesis
-
Upload
api-19960623 -
Category
Documents
-
view
376 -
download
0
Transcript of thesis
ANOMALY DETECTION FROM AVIATION SAFETY REPORTS
by
Suraj Raghuraman
APPROVED BY SUPERVISORY COMMITTEE: ___________________________________________ Dr. Vincent Ng, Chair ___________________________________________ Dr. Yang Liu ___________________________________________ Dr. Latifur Khan
! Copyright 2008
Suraj Raghuraman
All Rights Reserved
To my late Grandfather & Family
To my advisor, Dr. Vincent Ng
To all my friends
ANOMALY DETECTION FROM AVIATION SAFETY REPORTS
by
SURAJ RAGHURAMAN, B.E.
THESIS
Presented to the Faculty of
The University of Texas at Dallas
in Partial Fulfillment
of the Requirements
for the Degree of
MASTER OF SCIENCE IN COMPUTER SCIENCE
THE UNIVERSITY OF TEXAS AT DALLAS
December, 2008
v
ACKNOWLEDGEMENTS
I would like to thank my advisor Dr. Vincent Ng for putting up with me. I would have
given up long before if it has not been for him providing support and motivation. He some
how knew what would motivate me and did his best to get me going.
I would like to thank my family for motivating me enough to study abroad. I would
still be working in India if it weren’t for them. Thanks for supporting me when I needed your
support the most.
I would like to thank my friends for keeping me well fed and happy during the bad
times. Motivating me hard enough for me to finish my thesis. If it weren’t for you guys I
would have never thought of finishing my Master’s in a year.
September, 2008
vi
ANOMALY DETECTION FROM AVIATION SAFETY REPORTS
Publication No. ___________________
Suraj Raghuraman, M.S. The University of Texas at Dallas, 2008
Supervising Professor: Dr. Vincent Ng
ABSTRACT
Anomaly detection from Aircraft Safety Reporting System (ASRS) reports is the task of
identifying the multiple types of anomalies that resulted in an aviation incident based on the
narrative portion of the report, which was written with lot of domain specific terminology.
The thesis proposes an effective semi-supervised approach to solving the problem without
using any domain related information. The approach relies on methods similar to singular
value decomposition and latent semantics for feature generation. A Support Vector Machine
with transduction was used to classify the narratives. Experimental results show that the
method developed was marginally better than the existing methods for high frequency
anomalies and substantially better for low frequency anomaly on the ASRS dataset.
vii
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ............................................................................................... v
ABSTRACT .......................................................................................................................vi
LIST OF TABLES ............................................................................................................. ix
LIST OF FIGURES............................................................................................................. x
CHAPTER 1 INTRODUCTION...................................................................................... 1
CHAPTER 2 PROBLEM DESCRIPTION ...................................................................... 3
CHAPTER 3 RELATED WORK..................................................................................... 5
3.1 Supervised Approach ........................................................................................ 5
3.1.1 Dynamic Template Rule.................................................................... 5
3.1.2 Naïve Bayes Classifier ...................................................................... 5
3.1.3 Support Vector Machines ................................................................... 6
3.2 Semi-supervised Approach................................................................................ 7
3.2.1 Naïve Bayes EM................................................................................ 7
CHAPTER 4 OUR SEMI-SUPERVISED APPROACH ................................................. 9
4.1 Introduction ...................................................................................................... 9
4.2 Data Analysis ................................................................................................. 10
4.3 Feature Generation ......................................................................................... 12
4.3.1 Content Extraction............................................................................ 12
4.3.2 Tokenization..................................................................................... 12
4.3.3 Hash Generation ............................................................................... 13
4.4 Word Selection ................................................................................................ 14
4.4.1 Plain Information Gain..................................................................... 15
4.4.2 Simple Subtraction ........................................................................... 15
4.4.3 Combined Metric.............................................................................. 15
4.5 Feature Reduction........................................................................................... 16
4.5.1 Term By Document Vector .............................................................. 16
viii
4.5.2 Inner Product .................................................................................... 17
4.5.3 Singular Value Decomposition ........................................................ 19
4.5.4 Gram Schmidt Process ..................................................................... 19
4.5.5 Applying Gram Schmidt .................................................................. 20
4.8 Classification ................................................................................................... 21
4.8.1 Learning............................................................................................ 22
4.8.2 Classifying........................................................................................ 22
4.8.3 Interpreting Output ........................................................................... 23
CHAPTER 5 EVALUATION ........................................................................................ 24
5.1 Dataset ............................................................................................................ 24
5.2 Experiment Setup ........................................................................................... 25
5.3 Baseline Experiments ..................................................................................... 25
5.3.1 Plain SVM ........................................................................................ 25
5.3.2 Naïve Bayes EM............................................................................... 27
5.3.2 Gram Schmidt Kernel....................................................................... 29
5.4 Our Semi Supervised Approach ..................................................................... 31
5.4.1 Stage 1: Word Selection ................................................................... 31
5.4.2 Stage 2: Further Dimensionality Reduction ..................................... 33
5.5 Result And Error Analysis ............................................................................. 35
CHAPTER 6 CONCLUSION ........................................................................................ 37
REFERENCES................................................................................................................. 39
Vita
ix
LIST OF TABLES
Number Page
2.1 ANOMALY DETAILS................................................................................................ 4
4.1 TERM BY DOCUMENT VECTOR.......................................................................... 17
4.2 TERM BY DOCUMENT VECTORS FOR S1, S2 AND S3 .................................... 18
5.1 DATASET COMPOSITIONS ................................................................................... 25
5.2 RESULTS OF PLAIN SVM ...................................................................................... 26
5.3 RESULTS OF NAÏVE BAYES EM .......................................................................... 28
5.4 RESULTS OF GSK.................................................................................................... 30
5.5 RESULTS OF STAGE 1............................................................................................ 33
5.6 RESULTS OF STAGE 2............................................................................................ 34
5.7 COMPARING THE F-SCORES................................................................................ 36
x
LIST OF FIGURES
Number Page
2.1 AN EXAMPLE NARRATIVE .................................................................................... 3
4.1 NARRATIVE FOR ANOMALY, INCURSION – RUNWAY................................. 10
4.2 NARRATIVE FOR ANOMALY, INFLIGHT ENCOUNTER – WEATHER ......... 11
4.3 TOKENIZED NARRATIVE FOR ANOMALY, INCURSION – RUNWAY ......... 13
4.4 HASHED VERSION OF TOKENIZED NARRATIVE............................................ 14
4.5 REDUCED HASHED NARRATIVE........................................................................ 16
4.6 SAMPLE TRAIN FILE CONTENT.......................................................................... 22
4.7 SAMPLE TEST FILE CONTENT............................................................................. 23
5.1 RESULTS OF PLAIN SVM ...................................................................................... 27
5.2 RESULTS OF NAÏVE BAYES EM .......................................................................... 29
5.3 RESULTS OF GSK.................................................................................................... 31
5.4 RESULTS OF STAGE 1............................................................................................ 33
5.5 RESULTS OF STAGE 2............................................................................................ 35
5.6 COMPARING ALL RESULTS................................................................................. 36
1
CHAPTER 1
INTRODUCTION
Aircraft Safety Reports are a rich source of information on flight incident. Every time an
aviation incident occurs any where around the world a report is written which contains the
complete details of the incident. Over the years these reports from around the world were
submitted to the ASRS archive at National Aeronautical and Space Agency (NASA). Over
the decades the database has grown to contain over 150 thousand incident narratives.
The task of anomaly detection is to determine the cause of the incident by just looking at the
narrative portion of the report. There are about 62 anomaly classes in all. The documents
come from varying backdrops, contain lots of grammatical mistakes, incorrect spelling, loads
of conflicting abbreviations and are very hard for even an average person to understand
making the task difficult. The goal of this thesis is to come up with a semi supervised
approach for anomaly detection without relying on any knowledge of the aviation domain
terminologies or procedures to solve the problem.
We will start with the existing approaches that were applied on the dataset and approaches
that were used for similar classification tasks. It is always wise to build upon the approaches
based on which people are already working. So we start off with enhancing the word
selection methods and getting better accuracies by using semi-supervised learning algorithms
like support vector machines with transduction.
2
After that the focus shifts to using the entire dataset and not just the labeled instances to learn
the features. For this we propose a new approach that uses similarity between documents to
generate features. We use latent semantics and single value decomposition in order to get the
similarity between the documents. While getting the similarity greater emphasis is given to
the words that help identify the presence of the anomalies. The key idea that this thesis
introduces is the usage of a set of documents to generate the features for a single document.
Instead of just looking at the similarity between two documents and classifying based on just
that, An instance selection algorithm is used to pick out the points of reference and then
based on the similarity with these varying reference documents, patterns are extracted using
SVM, which are used to classify documents into anomaly classes. In order to take greater
advantage from the instances we have, SVM with transduction is used instead of a fully
supervised SVM.
Finally, we show that the word selection based methods that we developed perform better
than the existing methods. The semantics based method doesn’t perform as well as the word
selection based methods but performs far better than the approximated latent semantics based
algorithms on this dataset.
3
CHAPTER 2
PROBLEM DEFINITION
Anomaly detection in Aircraft Safety Reporting System (ASRS) reports is the task of
identifying the types of anomalies that resulted in an aviation incident based on the narrative
portion of the report. Incidents can occur due to more than one reason and hence more than
one anomaly can be present in a single narrative.
The aviation officials write an ASRS report whenever an incident occurs. These
reports are written all around the world and as a result contain varying style of writing. The
content of the report is highly aviation domain specific and contains lots of misspelled words,
acronyms, and grammatical mistakes. To get a clearer picture of how a report looks like, an
example is given below.
ACR Y CLIMBING TO FL210 WAS STOPPED AT 160 FOR TFC AT 170; WHICH PASSED.
ACR Y WAS THEN KEPT AT 160. ACR X WAS DESCENDED TO 170 UNTIL CLEAR OF TFC;
THEN DESCENDED TO 150. ACR X WAS NOT OUT OF 170 WHEN CLRNC ISSUED TO
STOP AT 170. NO ANSWER. CLRNC GIVEN AGAIN. NO ANSWER. TURNED ACR Y FROM
320 DEG TO 250 DEG HDG. CLRNC TO ACR X AGAIN WITH TURN TO 060 DEG; NO
ANSWER. FINALLY ACR X ANSWERED AND TURNED TO 060 DEG WITH CLIMB BACK TOS
170. ACFT PASSED 4.3 MI AND 200' APART. ACR X DID NOT RESPOND TO CLRNC.
POSSIBLE LACK OF COCKPIT DISCIPLINE.
Figure 2.1. An example narrative
4
As you can clearly see in Figure 2.1 most of the text is not very readable to a non-aviation
expert. The other complication is the fact that most acronyms are 3 characters in length.
Example ACR can stand for Aircraft as well as Air Control Radar, and it may also refer to an
airport in Columbia. In addition, ACFT is the standard way of referring to an Aircraft.
There are over 62 classes of anomalies that a narrative can belong to. For the sake of
simplicity and to reduce the amount of processing the data needs to go through we will
consider only 8 anomalies. In order not to loose out on the variations that each anomaly
brings forth in terms of the frequency of occurrence and the situation in which they occur, the
anomalies we choose for classification are listed in table below.
Table 2.1. Anomaly Details
Class Anomaly Description
1 Aircraft Equipment Problem - Critical Some equipment on the aircraft has malfunctioned.
2 Airspace Violation – Entry Aircraft was flown into an unauthorized location.
3 Incursion – Runway Incorrect presence of a vehicle, person or aircraft on the runway.
4 Inflight Encounter – Turbulence Heavy turbulence was encountered during the flight.
5 Inflight Encounter – Weather Severe weather was encountered while flying. 6 Inflight Encounter – VFR IN IMC Visual Flight Rules (VFR) into Instrument
Meteorological Conditions (IMC) means unqualified flight into adverse weather.
7 Maintenance Problem – Improper maintenance
Some part of aircraft was not properly maintained.
8 Non Adherence - clearance The Air Traffic Controller (ATC) did not clear pilot to perform the operation.
5
CHAPTER 3
RELATED WORK
3.1 Supervised Approaches
Supervised approaches use labeled documents to learn and then label the unlabeled
documents. The next sections cover some of the supervised approaches used on the ASRS
dataset.
3.1.1 Dynamic Template Rule System
The basic approach that was tried out by [1] was based on regular expressions. The domain
heavily influences this approach and works similar to how an aviation expert identifies
anomalies from the narrative. After lot of brainstorming a set of shaper words that are useful
in identifying the anomaly expressions, were created. These shaper words are used to identify
expressions from a narrative. Regular expression templates are generated for particular types
of anomalies, using these expressions rules. A narrative is run through these rule trees and
depending on the matches of regular expression it is assigned anomalies.
3.1.2 Naïve Bayes Classifier
Naïve Bayes is a probabilistic classifier; an anomaly c* is assigned to a narrative document d
where
Anomaly class c* = argmaxc P(c | d)
We derive this P(c|d) using Bayes rule as follows:
6
P(d) doesn’t play any role in determining argmaxc P(c | d) so is ignored. Naïve Bayes
decomposes the term P(d | c) into a set of features fi ‘s assuming they are conditionally
independent. So
In spite of having such a big assumption of conditional independence between words, which
clearly doesn’t hold in case of real world data Naïve Bayes is known to be good for text
classification [2] and is sometimes optimal for highly dependent features [3].
3.1.3 Support Vector Machine
Support Vector Machines (SVM) is known to be very good at classifying text [4]. The
classification task is split up into multiple binary classification tasks. In case of binary SVM
the basic idea is to represent the input features in a high dimension feature space. Then it
identifies a hyperplane that will separate the features by the highest possible margin. If we
take to be the hyperplane then
Where , are the support vectors and "j are obtained by solving the dual
optimization problem [5].
In [6] an SVM classifier was used on the ASRS dataset by taking words as features. In order
to control the number of features given to the learner, Information Gain (IG) was used as a
metric for choosing the words.
7
IG = H(C) – H(C|W)
Where H(C) is the entropy of the anomaly class and H(W|C) is the entropy of the word given
the class.
Overall it was observed that the top 2500 words are more than sufficient to get good
results. It was noticed that anomalies that occur less frequently result in bad F-Scores and
perform better with fewer words.
3.2 Semi-Supervised Approaches
Semi-supervised approaches use a combination of labeled and unlabeled data to label the
unlabeled documents. There are very few semi-supervised algorithms that have been applied
to ASRS dataset. The approach mentioned below is good for text classification in general but
never used on the ASRS dataset.
3.2.1 Naïve Bayes EM
Expectation Maximization (EM) is a meta-learning algorithm, which combines both the
labeled and unlabeled data. The algorithm works in two stages; in the first stage the
unlabeled data is probabilistically labeled using the parameter values. Then in the second
stage the value of the parameters are recalculated using maximum likelihood estimation.
In [7] Naives Bayes is combined with EM to classify text.
To classify (E-Step):
Where,
is the probability that a document belongs to that class j.
8
is the prior probability of class j.
is the probability of the kth word from the di
th document occurring in the class
j.
To re-estimate the parameters (M-Step):
The function below is used to distinguish between the label and unlabeled documents.
Where,
# is the weight factor for unlabeled documents
Dl are the labeled documents
Du are the unlabeled documents
For estimating the probabilities of a word to occur in a class.
N(ws , di ) is the frequency of the word s in the document di
The class prior probabilities are computed using
Using the above set of expressions, good classification accuracy was achieved in their
experiments. The best result is obtained when the value of # between 0 and 1. The value of #
depends on the type of data classified.
9
CHAPTER 4
OUR SEMI-SUPERVISED APPROACH
4.1 Introduction
A good look into the narratives of the ASRS dataset reveals a lot of connections between few
keywords and the anomalies associated with it. There are lots of words that will not occur
with certain anomalies and there are lots of words that will occur only with certain
anomalies. But this is not an easily exploitable feature, since a document can have multiple
anomalies. These are the two key concepts, which the approach relies on for classification.
To start with the task of anomaly detection is split into binary classification tasks of
checking the presence of a single anomaly. For each of the binary tasks the following
operations are performed.
• Feature Generation: Words are mapped as direct features.
• Word Selection: Relevant features are selected using Information Gain metric.
• Feature Reduction: Multiple Features are combined to form a feature using Singular
Value Decomposition (SVD)[8] and Gram Schmidt process [9].
• Classification: A transductive SVM is used for learning and classifying the feature.
Further sections detail each of the operations. We start by analyzing the data.
10
4.2 Data Analysis
Let’s start with taking a relook at the data. As mentioned before the data has lots of aspects
like abbreviated words, misspelling, grammatical mistakes, domain specific words etc that
make it hard for non aviation domain humans to understand, so there is no way that we can
use any of the more advanced tools of Natural Language Processing (NLP) like part of
speech taggers, syntactic parser, named entity recognizers etc readily. So we are left with two
alternatives; either we byte the bullet and come up with a big task of fixing the text so that
advanced NLP tools can be used; or we avoid using the tools and adopt a different approache
to the problem, which is what most others seem to be doing.
WHEN I CALLED GND CTL I STATED; 'WORCESTER GND (OUR ACFT IDENT) TAXI; VFR
TO BRADLEY.' SHE RESPONDED BY SAYING; '(ACFT IDENT); TAXI VIA BRAVO FOR
INTXN DEP 29.' SHE ALSO ASKED ME PREVIOUS TO THIS IF I REQUIRED FULL LENGTH.
I RESPONDED BY SAYING; 'NEGATIVE.' SHE CAME BACK BY SAYING; 'BRAVO INTXN 29
FOR TKOF IS APPROVED.' I RESPONDED BY SAYING; 'CLRD TO GO AT 29?' SHE SAID;
'AFFIRMATIVE;' AFTER I SAID; 'CLRD TO GO 29.' WHEN I MADE MY 45 DEG LEFT TURN
FOR DEP; I DECIDED TO CALL TWR FREQ TO ADVISE (WHICH IS NOT REQUIRED).
WHEN I CALLED TWR FREQ; A MAN RESPONDED BY SAYING; '(ACFT IDENT); YOU WERE
NEVER CLRD FOR TKOF.' THE NEXT DAY MY F/O (NAME) AND I VISITED THE
WORCESTER TWR TO SPEAK TO THE CHIEF CTLR. WE LISTENED TO THE TAPE AND I
WAS CORRECT; THAT THE GND CTLR ACKNOWLEDGED THAT I WAS 'CLRD TO GO' AT
INTXN BRAVO 29. THE CHIEF CTLR SAID THERE WOULD BE NO ACTION TAKEN
AGAINST ME AND MY F/O. THE CHIEF CTLR AND MYSELF AGREED THAT THE GND CTLR
AND MYSELF USED POOR PHRASEOLOGY. FOR INSTANCE; 'CLRD TO GO;' WHICH IS
USED QUITE OFTEN IN AVIATION FOR 'CLRD FOR TKOF.' I WOULD LIKE TO POINT OUT
THAT NO ATTEMPT WAS MADE BY TWR OR GND FREQ TO CONTACT ME NOT TO TKOF
AT BRAVO INTXN RWY 29. SUPPLEMENTAL INFORMATION FROM ACN 80338: CAPT
FLYING ACFT AND OPERATING RADIOS. I ASKED CAPT IF SHE CLRD US FOR TKOF ON
GND FREQ. HE SAID YES; AND THAT SHE WAS WORKING BOTH FREQS.
Figure 4.1. Narrative for Anomaly, Incursion – Runway
11
On closer examination of the narrative in Figure 4.1, it is easy to notice that lots of words
only occur in certain anomaly cases. For example in the above case the word TAXI is
mentioned that increases the chances of the anomaly being an incursion or excursion of the
runway. Since it is easy for us to determine that by looking at the record, even machines can
do that using simple measures like plain old information gain [6]. The other way of doing
something similar would be to just use Inverse Term Frequency and then adding weight on
words that are less frequent [10].
Now that we know what kind of words to look for, let us look at the way the words in
text are arranged. As we look at more and more documents it starts to become clear that in
most of the cases the order in which the words occur is not relevant. The other thing to notice
is shaper words like TAXI are not mentioned when the anomaly is Weather (Figure 4.2) for
instance. The more you look at the documents the more you would feel that a Bag of Word
approach is the best way to move forward. That’s what this thesis ended up doing.
AFTER 4 HRS FLT TIME ARRIVED AT SEA. WX HAD STARTED TO GO DOWN. ARPT USING
ILS 34R APCH WITH A RVR OF 3000'. ATTEMPTED FIRST APCH AND INITIATED MISSED
APCH AT DH WITH UNEVENTFUL GO AROUND. STARTED SECOND APCH; AT DH I SAW
APCH LIGHTS. AT SAME TIME COPLT STARTED TO PUSH THROTTLE FORWARD FOR GO
AROUND. I DID NOT REALIZE THAT THE TOGA BUTTONS HAD BEEN PUSHED. WHEN I
GRABBED THE YOKE I DISCONNECTED THE AUTOPLT. AT 100' RWY CAME IN SIGHT
AND WE WERE LINED UP A LITTLE RIGHT. AT THE SAME TIME I BECAME CONFUSED
WITH THE THROTTLE RESPONSES BECAUSE I DID NOT REALIZE THE TOGA BUTTON
HAD BEEN ACTIVATED. WE STARTED A LEFT DRIFT WHILE I TRIED TO TURN THE
LIGHTS ON FOR BETTER DEFINITION OF THE RWY. AT ABOUT 50' THE DRIFT BECAME
FASTER TO THE LEFT. WHEN EDGE LIGHTS CENTERED UP AND ALSO ILS TWR WAS
SEEN; GO AROUND WAS INITIATED FOR AN UNEVENTFUL GO AROUND. ON GO
AROUND IT WAS LEARNED THAT THE RVR HAD DROPPED TO 1000' ALL XMITTERS.
LESSONS LEARNED: 1) A BETTER APCH BRIEF WITH COPLT KNOWING TO CALL DH OUT
EXTRA LOUD SO IT IS DEFINITELY HEARD. 2) APCH SHOULD HAVE BEEN
DISCONTINUED AS SOON AS THE AUTOTHROTTLES FELT FUNNY AND DEFINITELY
WHEN THE RWY WAS NOT LINED UP. 3) ALSO BRIEF A POSITIVE GO AROUND CALLOUT
AGAIN TO DISTRACT THE OUTSIDE CONCENTRATE. I THINK ANOTHER CONTRIBUTING
FACTOR WAS THAT THE TIME WAS CLOSE TO 2 O'CLOCK BODY TIME AND
CONCENTRATION WAS JUST A LITTLE SLOWER THAN NORMAL.
Figure 4.2. Narrative for Anomaly, Inflight Encounter – Weather
12
The other key thing to note is that many of the anomalies occur together, for example the
anomalies 1(Aircraft Equipment Malfunction) and 7(Maintenance Problem) go hand in hand;
this is a feature that is unique to this dataset and which many datasets will not share. As a
result this thesis will not be exploiting this characteristic of the dataset. A Bayesian classifier
can be used to make the predictions based on the narrative and anomalies identified, whether
or not a different anomaly can also be associated with the current narrative. This could be
something that could be tried out if the aim were to just classify this dataset.
4.3 Feature Generation
This section details about the how the narratives, where extracted from the ASRS dataset.
The preprocessing and cleaning applied to the narratives. Finally how the words from the
narrative are converted into features.
4.3.1 Content Extraction
The ASRS document set used for this thesis was a set of comma separated values (csv) files
that were taken from the years 1988 to 2007. Each of these files has thousands of narratives.
There are about 54 fields in each of the file. The thesis will be using only two fields, the
narrative and the event anomaly from the files. There are a total of more than 60 anomalies
present in these files. In this stage we pick out all the narratives containing at least one of the
anomalies mentioned in table 2.1. This step returns about 75000 narratives.
4.3.2 Tokenization
This step involves converting the text as seen in Figure 4.1 to Figure 4.3. Tokenizing the text
is very important in order to ensure that the words can be identified easily, since there are lots
13
of punctuations attached to the words of the narrative. The first step in tokenization is to
separate the punctuations from the words. However, there are few places where punctuations
indicate a measure of distance from example like in the case of 3000’, which shows that the
distance is 3000 feet. These are the places where we cannot afford to strip or separate
punctuations, due to the fact that this can have an adverse affect on the matching process. To
handle such cases, punctuations following or contained within numbers, where not removed.
WHEN I CALLED GND CTL I STATED WORCESTER GND OUR ACFT IDENT TAXI VFR TO
BRADLEY SHE RESPONDED BY SAYING ACFT IDENT TAXI VIA BRAVO FOR INTXN DEP 29
SHE ALSO ASKED ME PREVIOUS TO THIS IF I REQUIRED FULL LENGTH I RESPONDED
BY SAYING NEGATIVE SHE CAME BACK BY SAYING BRAVO INTXN 29 FOR TKOF IS
APPROVED I RESPONDED BY SAYING CLRD TO GO AT 29 SHE SAID AFFIRMATIVE …
Figure 4.3. Portion of a tokenized Narrative for Anomaly, Incursion – Runway
4.3.3 Hash Generation
The basic idea behind hash generation is to replace the use of words by numbers to facilitate
the generation of the input files for the SVM learning algorithm. All the narratives are read
and each word is assigned a number that will be its primary identity from now on. We are
also not interested in the order in which words occur. All we are interested in is the number
of times they occur in a narrative. So, to avoid repeated parsing of the sentence to get the
frequency of occurrence of the words, take the primary key and associate to each word its
normalized frequency. Specifically,
Normalized Frequency = Total occurrence of the word . Total number of words in narrative Finally, the result is sorted in ascending order of the primary key value. To exemplify, Figure
4.4 is a hashed version of Figure 4.3.
14
8:0.00574712643678161 10:0.028735632183908 17:0.0862068965517241 23:0.0229885057471264 33:0.0114942528735632 41:0.00574712643678161 58:0.0114942528735632 61:0.0229885057471264 77:0.00574712643678161 81:0.0229885057471264 84:0.0065359477124183 93:0.0065359477124183 95:0.0065359477124183 121:0.0065359477124183 125:0.0130718954248366 126:0.0065359477124183 128:0.0065359477124183 150:0.0065359477124183 155:0.0065359477124183 …
Figure 4.4. Hashed version of tokenized Narrative for Anomaly, Incursion – Runway
4.4 Word Selection
There are over 20,000 of unique words in the ASRS dataset. Since we are considering words
as features, it becomes a tall order for any classifier to classify using such a high number of
features well. Even though the SVM learning algorithm is known to handle large feature sets
very nicely, it is still better to reduce the number of features sent to it by looking at the kind
of features provided to it by the method.
After experimenting with several methods on the dataset we observed that the best
results were obtained when only a selective subset of words from the narratives were used
instead of all the words. Typically, the words that are selected either appear a lot in the
narratives with a particular anomaly or in all but that anomaly.
The Information Gain (IG) method for word selection used by the folks at NASA in
[6] was found to be one of the best ways of doing word selection (see section 4.4.1). The key
advantage of using IG is that it takes care of both of the cases mentioned above. Another
method using subtraction was also added into the system to increase the accuracy for the
words picked (see section 4.4.2).
15
4.4.1 Plain Information Gain
IG is calculated as
IG = H(anomaly class) – H(anomaly class | Feature)
In this method, while calculating the entropy, normalized frequencies are not considered. All
that is considered is the presence or absence of a feature in the document. Also, the features
that occur only in a single document are ignored for being too rare.
4.4.2 Simple Subtraction
One of the easiest ways to find out if a feature is more prevalent in a class than other classes
is by doing a subtraction. But the problem with subtraction is that the range of values it
returns can be huge. To fix this problem we use the log function. The log function brings
back the numbers returned by subtracting into a manageable level. So basically the log of the
difference in the frequencies of occurrence of a word in the class and outside it is what is
used.
4.4.3 Combined Metric
A combination of both word selection methods described above is used to eliminate the
irrelevant features for an anomaly. The values for both IG and subtraction are calculated for
all the features. Then two lists are formed, one with features arranged in descending order of
IG and the other with subtraction. 75% of the required features are chosen from the top of the
IG list and then the remaining 25% features are taken from the top of the subtraction list.
After this all the unselected features are considered irrelevant and removed from both the
training and test files. A result of word selection is shown in Figure 4.5.
16
17:0.0862068965517241 41:0.00574712643678161 58:0.0114942528735632 61:0.0229885057471264 77:0.00574712643678161 84:0.0065359477124183 93:0.0065359477124183 95:0.0065359477124183 125:0.0130718954248366 126:0.0065359477124183 155:0.0065359477124183 …
Figure 4.5. Reduced hashed Narrative for Anomaly, Incursion – Runway
4.5 Feature Reduction
In many classification tasks, the feature values for an instance are typically generated from
the instance itself. The less common approach is to use multiple instances to generate the
features for a given instance. One such approach is uses latent semantics kernels [11]. Our
feature reduction approach is strongly influenced by latent semantics.
The section starts with the basic terminologies and then continues on to explain how
and why using the method for feature reduction is a good idea.
4.5.1 Term By Document Vector
A document can be represented in many different ways. One of the more mathematical ways
of representing it is similar to a histogram: Take all the words in alphabetical order and then
associate with each word the frequency of its occurrence in the document. For example, the
document term vector for the sentence given below is shown in Table 4.1.
THAT THE GND CTLR ACKNOWLEDGED THAT I WAS 'CLRD TO GO' AT INTXN BRAVO
17
Table 4.1. Term by document vector
acknowledged 1
at 1
bravo 1
clrd 1
ctlr 1
gnd 1
go 1
i 1
intxn 1
that 2
the 1
to 1
was 1
In a conventional term by document vector, the entries correspond to the word stems instead
of the actual words. But in our case the words themselves were used due the presence of
large number of acronyms and misspellings.
The term by document vector in this case has already been generated based on the
results from section 4.4. The only key difference is that instead of having words sorted in
alphabetical order; there are numbers, which represent words sorted in ascending order. A
document by term vector also doesn’t have normalized frequencies in its representation. This
also doesn’t harm the approach in any way since the numbers are still as far apart as before
but now in a reduced scale.
4.5.2 Inner Product
The inner product (or the dot product) is an operation that takes two vectors and returns a
single number as the output. To get the scalar or dot product of two vectors, all the elements
in the first vector should be multiplied to all the corresponding elements in the other vector
and the sum of the result is the dot product.
18
The easiest way to calculate the similarity between two documents is to take the dot
product of the document by term vector of the two documents. So, for example, take the
following sentences.
S1. THAT THE GND CTLR ACKNOWLEDGED THAT I WAS 'CLRD TO GO' AT INTXN BRAVO
S2. CLRD TO GO AT INTXN BRAVO
S3. ACFT IDENT TAXI VIA BRAVO FOR INTXN DEP
Table 4.2. Term by document vectors for S1, S2 and S3
Sentence 1 Sentence 2 Sentence 3
acft 1
acknowledged 1
at 1 at 1
bravo 1 bravo 1 bravo 1
clrd 1 clrd 1
ctlr 1
dep 1
for 1
gnd 1
go 1 go 1
i 1
ident 1
intxn 1 intxn 1 intxn 1
taxi 1
that 2
the 1
to 1 to 1
via 1
was 1
Taking the dot product between each part of document by term vectors, we have:
S1 . S2 = 1 + 1 + 1 + 1 + 1 + 1 = 6
S1 . S3 = 1 + 1 = 2
S2 . S3 = 1 + 1 = 2
19
Looking at the results of the dot product, it is easy for us to say that S1 and S2 are
more similar compared to S1, S3 and S2, S3. The same result will hold if the use of
frequencies is substituted by the use of normalized frequencies. In other words, instead of the
result being 6, 2 and 2, it would have been some fraction with a similar order.
4.5.3 Singular Value Decomposition
Singular value decomposition (SVD) is the a process of factorizing a nxm matrix M into
multiple matrices of the form U$V*
where U is the unitary matrix over some feature space with real numbers.
$ Contains the singular values, which can be used to transform other vectors.
V is a set of values that are orthonormal directions of M.
V* is the conjugate transpose of V.
The training set is used to generate $, which is then used to generate the features for both the
training and test set.
4.5.4 Gram Schmidt Process
The Gram Schmidt process is used for converting a linearly independent set of vectors v1, v2,
v3, …, vn to a orthogonal set of vectors q1,q2,q3, …, qn in the inner product feature space
(commonly the Euclidean space).
The process relies on iteration over the equations.
where projq v is
20
Due to floating point issues, the direct algorithm described above doesn’t give exact results
on a computer. There is a stable version of the algorithm1 that is used and given below.
GRAM-SCHMIDT-STABLE
for i =1 to n do vi = ai end for for i = 1 to n do
rii = ||vi|| qi = vi / rii for j = i + 1 to n do
rij = qiTvj
vj = vj - rijqi end for
end for
4.5.5 Applying Gram Schmidt
The concept of latent semantics relies on the concept of co-occurrence, since we are not
using any external semantic tool like wordnet or semantic net. We can use the fact that two
words occurring in the document together are related to each other. This fact is can be used
by generating the Eigenvalues using SVD [11].
Estimating the Eigenvalues of a term document matrix by matrix factorization is a
very computationally intensive task. As a result, we use the Gram Schmidt orthogonalization
(GS) for estimating the singular values and using it to reduce the dimensions of the space
effectively. The GS method is known to provide similar results to the latent semantics kernel
(LSK) [12].
All the feature values of each document are normalized, for all documents. The
document term matrix is created and given to the GS algorithm for processing.
1 http://www.phys.ocean.dal.ca/~lukeman/teaching/3170-3111/slides/Lecture8.pdf
21
To generate the training data,
• The 2-norm is taken for all the document vectors.
• A biasing factor is applied on the norm for all the training documents containing the
anomalies.
• K (resultant feature space size) Documents with high biased norm score as taken as
the primary vector for calculating the singular value features.
• The GS algorithm is run on all the documents using these K documents. The reduced
features obtained are used for training.
To generate the test data, the GS algorithm is run with the K documents selected in the
training stage to generate features for the test documents.
The training and test files are generated using the features computed in the train and test
stage, as described above. Even though there exists an unsupervised way of selecting the K
documents [13]. It was seen that the unsupervised approach did not give nearly as good as
the supervised method [12].
4.6 Classification
SVM-light2 is a C language based implementation of SVM. It is small in size, fast to run and
very easy to use. These inherent advantages of SVM-light made us use it. It comes with two
main executables: svm_learn and svm_classify. The learner svm_learn takes the training data
and generates a model. The classifier svm_classify takes the test data along with the model
and generates the classifications.
2 http://svmlight.joachims.org/
22
4.6.1 Learning
Training a SVM with transduction is quite simple. Once we have all the features generated
using the methods from sections 4.2 to 4.5, we only need to put the features in a format that
the SVM-light learner can understand. The input format for the training file is class label
either (-1 or 1) followed by a space separated list of feature number and value pairs. In our
case, if the document belongs to that anomaly we assign it -1, otherwise we assign it 1. In
order to use transduction, the unlabeled instances also need to be passed. These unlabeled
instances are assigned class value of 0. See Figure 4.6 for an example of a training file. Once
the learner encounters a line with label zero it changes to the transduction mode of operation.
-1 101:0.23 132:0.39 159:1.23 160: 2.02 161:0.25 0 101:0.213 102:0.39 109:1.93 111: 0.102 116:0.95 130: 0.234 1 101:0.123 102:0.69 105:0.823 116: 0.92 121:0.45 -1 104:1.025 178:7.34 180:1.4
Figure 4.6. Sample Training File content
Finally svm_learn is run to start the learner when given the training file and model file name
as parameters, as shown below.
svm_learn train_file model_output_file
4.6.2 Classifying
The test file is created the same way as the training file. All the narratives that are not labeled
are put in a file. The features that is required for the classification of a test document is
generated using the same set of reference narratives as in section 4.6.1.
Each test document is processed through the steps mentioned in sections 4.2 to 4.5.
After going through these stages, the features that are generated for each document are taken
23
and a zero is put in place of the class value while passing the features for classification
through the SVM. A sample test document is shown in Figure 4.7.
0 101:0.23 132:0.39 159:1.23 160: 2.02 161:0.25 0 101:0.213 102:0.39 109:1.93 111: 0.102 116:0.95 150:2.345 0 101:0.123 102:0.69 105:0.823 116: 0.92 121:0.45
Figure 4.7. Sample Test File content
To use the classifier the executable svm_classify has to be run with the test file, the model
file and the output file name given as command line switch., as shown below.
svm_classify unlabeled_documents_file model_file classification_output_file
4.6.3 Interpreting Output
SVM returns a set of numbers around zero, one per line. Each line corresponds to each
document instance. If the number is closer to -1 then it basically means that SVM classifies
the instances negative, but if it is close to 1 then it means it is positive. The degree of
confidence SVM has on its decision is indicated by the number it churns out. A threshold of
0 was used for determining the class. Anything above zero was considered positive and
everything below or equal to zero as negative.
24
CHAPTER 5
EVALUATION
5.1 Dataset
As mentioned before, the evaluation dataset contains the aviation safety reports, with over 62
anomalies and over 130,000 narratives. In order to reduce the scale of the task, only 8
anomalies were considered in our experiments. The number of relevant narratives (i.e., the
number of narratives containing at least one of the eight anomalies) was over 75000. In order
to make sure that the program can run on everyday hardware only 8000 narratives were used.
By using lesser narratives the composition of low frequency anomalies in the dataset was
increased. Since a single narrative can have more than one anomaly the total need not be
same as total number of narratives. The table below lists the anomalies and the number of
narratives associated with these anomalies.
The evaluation was performed using 2-fold cross validation. This was done in order
to have a smaller percentage of training instances, which would add greater value while
developing a semi-supervised approach. Each fold contains 4000 narratives and the details on
the distribution of classes in the 2 folds are given in columns.
25
Table 5.1. Dataset Compositions
5.2 Experiment Setup
Identification is done using binary SVM classifiers. In this case we use 8 binary classifiers
setup to handle each anomaly independently of each other. The experiments and results will
be put forward starting with the existing methods, followed by our proposed approach.
5.3 Baseline Experiments
We used three different approaches as baselines systems:
• The NASA approach with the SVM.
• The Naïve based EM.
• The Gram Schmidt Kernel based SVM.
5.3.1 Plain SVM
The most basic of the methods that can be used on the dataset set readily is classifying using
a linear SVM. In order to do this we take the output generated by the method described in
Narratives Class Anomaly Description Fold 1 Fold 2 Total Participation
1 Aircraft Equipment Problem – Critical 1028 1033 2061 16.78 2 Airspace Violation – Entry 513 537 1050 8.55 3 Incursion – Runway 541 573 1114 9.07 4 Inflight Encounter – Turbulence 523 587 1010 8.22 5 Inflight Encounter – Weather 1060 1055 2115 17.22 6 Inflight Encounter – VFR IN IMC 505 521 1026 8.36 7 Maintenance Problem – Improper
maintenance 510 496 1006 8.19
8 Non Adherence – clearance 1455 1443 2898 23.60
26
section 4.2.3 and pass it as input to the learning algorithm. The SVM-light package, which is
based on [7], was used to classify the data. For the classification stage all the options were set
to their default values. Words were selected using the IG metric mentioned in section 4.2.4.
The SVM-light system accepts input as line separated instances. Each instance is
given in the form of class label (-1 or 1) and then followed by a space separated feature
number colon feature value format. The feature number should be strictly greater than zero.
The learner learns and generates a model file after processing the labeled instances.
The test instances are also passed to the SVM-light system in the same way as described
above. But instead of the class label at the beginning of the line, some placeholder is given.
The output provided by the system is a number in the range of -1 to 1, with numbers closer to
-1 meaning a classification of -1 and ones tending towards 1 meaning a classification of 1. A
threshold of 0 was used to discretize the output. The results of the experiment are tabulated in
Table 5.4 and graphed in Figure 5.1. The rows of table 5.2 correspond to the number of
features selected by IG and used by the SVM; the columns are the eight anomalies.
Table 5.2. F-Scores of Plain SVM
No. Of Words/Class 1 2 3 4 5 6 7 8
500 83.21 62.13 85.64 50.19 65.38 77.32 71.80 78.42
1500 83.50 63.02 85.27 49.28 65.63 77.54 72.30 79.12
2500 83.65 62.99 85.49 49.21 65.48 77.85 71.87 78.95
5000 83.61 62.79 85.44 49.34 65.33 78.12 72.48 79.07
All 83.92 61.40 85.50 49.06 65.34 77.65 72.57 79.13
27
Figure 5.1. Results of Plain SVM
As seen by the results, the approach of just selecting the words using the IG works very well
with this dataset with the overall F-Score reaching 73.81. The key reason that it fails to
perform well in the case of anomaly 4 is that this anomaly (Inflight Encounter – Turbulence)
is almost all the time accompanied by anomaly 5 (Inflight Encounter – Weather), so they
have a lot of common words used and are not easily distinguished from each other. In case of
anomaly 2 (Airspace Violation – Entry), the low performance can be attributed to the fact
that there is not that many good shapers found due to the low number of training instances
belonging to this anomaly class. Overall, anomalies with higher presence were detected well
except for anomaly 5 due to a large portion of overlap with anomaly 4.
5.3.2 Naïve Bayes EM
The Naïve Bayes classifier is known to provide good results in case of text classifications
tasks. In this experiment the Naïve Bayes classifier is used in conjunction with EM as given
in [6]. The details of the method can be found in section 3.2.1. Unlike the previous methods
28
where a standard learning algorithm package was used, here a custom Naïve Bayes EM
algorithm was coded. Doing this allowed for greater flexibility on how the input was
provided and easier tuning of the parameters. The unlabeled data that we use is the test set.
The input given to the system is again taken from the output of sections 4.2.3 and
4.2.4. Since the way the EM algorithm works is different from the way SVM works, a few
changes need to be made to in the way the input was passed and handled. In order to keep the
input format same as in the other methods the missing features (the words that don’t occur in
the narrative) is handled in the code, by making the value of that feature to zero.
Furthermore, instead of using the frequency of occurrence of words in the instances as given
in the method at section 3.2.1 the normalized frequency is used.
The output generated by the method is not the same as the one generated by the SVM
classifiers. The output generated is the probability of occurrence of the anomaly, which is
then converted to either -1 or 1 depending on the probability values.
The importance given to the unlabeled data was varied. The best scores obtained by
varying the importance of the unlabeled data are listed in the table 5.3. The average number
of EM iterations was 50.
Table 5.5. F-Scores of Naïve Bayes EM
Unlabeled data
Weightage/Class 1 2 3 4 5 6 7 8
0.1 83.69 58.71 86.55 59.53 69.90 69.84 72.73 78.57
0.25 83.57 56.46 86.74 57.89 68.14 67.18 72.38 78.47
0.5 83.35 54.16 86.83 55.75 66.31 64.98 71.68 78.06
0.7 83.26 52.43 86.68 54.30 64.93 63.92 71.44 77.81
0.9 83.26 51.59 86.76 52.80 63.91 63.47 71.23 77.27
29
Figure 5.2. Results of Naïve Bayes EM
The overall F-score obtained over all the anomalies was 74.21. All words are used, due to
which there are more word frequencies that can influence the outcome, as a result it performs
better on both anomalies 4 and 5. It performs poorly on anomaly 2 due to the fact that the
narratives do not contain many words that are repetitive across different documents of the
same anomaly with high frequency. The performance of the other anomalies is similar to the
plain SVM approach. So use of unlabeled data improves the performance of the Naïve bayes
classifier but it is still comparable to a supervised SVM.
5.3.3 Gram Schmidt Kernel
In this baseline, we generate the features using the Gram Schmidt Kernel (GSK) algorithm
mentioned in [12]. The entire original preprocessed dataset was used to generate the features
using the GSK algorithm. The number of feature chosen was 100, 500 and 1000. The set of
biases used were 1, 2 and 3.
30
Once the features were generated, the labeled feature instances were given as input to
SVM with all default options for training. -1 was used if the anomaly was present and 1 was
used if the anomaly was not present.
Then the test file was generated again using GSK via the exact same procedure as in
training and passed as input to their corresponding models.
The result was determined using a threshold of 0. Instances with scores below 0 were
classified as -1 and scores above 0 were classified as 1. The best results obtained are
tabulated in Table 5.4 and graphed in Figure 5.3.
The overall F-Score obtained was 72.77. The GSK approach reduces the features in
the feature set similar to the word selection approach, by combining features that are most
relevant. Hence both the methods perform similarly.
Table 5.4. Results of GSK
Class Precision Recall F-Score
1 86.37 79.62 82.56
2 90.67 46.36 61.29
3 88.35 81.76 84.90
4 79.91 34.30 47.98
5 77.56 55.03 64.38
6 87.43 68.49 76.68
7 81.79 64.88 72.36
8 80.73 76.99 78.81
31
Figure 5.3. Results of GSK
5.4 Our Semi-supervised Approach
The semi-supervised approach has two stages. The first stage is based totally on the word
selection using IG and the subtractive log method mentioned in section 4.4. The second stage
further reduces the dimensionality of the feature get obtained in the first stage by applying
our instance selection based method to generate the features.
5.4.1 Stage 1: Our Word Selection
SVM with transduction can incorporate both labeled and unlabeled information to potentially
yield give out higher classification accuracies. As in the case of plain SVM, the SVM-light
package was used with default options. In order to run SVM-light in the transduction mode a
few extra things need to be done.
The input is generated in a similar manner to the previous method. The output
generated at section 4.3.3 is used with a few extra things, since we need to combine the
labeled and unlabeled instances for processing through the learning algorithm. The way the
32
class labels are passed is different, for the instances that are labeled a -1 or 1 and 0 is passed
for the unlabeled instances. In all the experiments all the test instances were passed as the
unlabeled instances for training the learner. The learner learns from the data given to it
generates the model, which is used for classification as usual.
Unlabeled instances are passed in the same way as they were passed in the previous
method. Again the classifier returns a confidence score between -1 and 1. The threshold of 0
is used to classify an instance as belonging to either -1 or 1.
The experiment was performed with the application of word selection as given in
section 4.3.2. The results of the experiment for 500, 1500, 2500 and 5000 words are tabulated
in the table 5.5 and graphed in Figure 5.4. Comparing this table with those for the baseline
systems, we can see that combining labeled and unlabeled data gives better results than just
having just labeled data for training.
When looking at the result one of the key things that stand out is the classification
results obtained for anomaly 4. In the preceding experiments, the way the word selection was
done was using the frequency of occurrence of the words in the documents. However, in this
experiment, we use only binary valued features that indicate the presence or absence of a
word in the document, instead of the number of times it occurs in the other documents, which
was the case with the plain SVM approach. The slight increase seen in all the other
anomalies is due partly to the presence/absence modification and also partly to the fact that
both the labeled and unlabeled data was used to learn the features by the transductive SVM
classifier.
33
Table 5.5. Results of Stage 1
Figure 5.4. Results of Stage 1
5.4.2 Stage 2: Further Dimensionality Reduction
We continue on the reduced feature set produced by our word selection method. The one
interesting piece of observation that can be incorporated into this reduction method is using
word presence instead of frequency of occurrence of the words in a given document doesn’t
hurt the overall classification, in low classification rate anomalies like anomaly 4, word
presence was very beneficial. Since we can’t use an approach of replacing the normalized
frequencies with 1 or some fraction in order not to influence the value decomposition. We
Word/Class 1 2 3 4 5 6 7 8
500 84.16 72.64 85.99 69.53 72.48 81.30 76.73 79.70
1500 84.21 73.23 86.36 69.71 72.00 81.94 76.54 79.04
2500 84.61 73.57 86.79 69.68 72.23 82.22 76.53 79.30
5000 84.29 73.88 86.24 69.94 71.96 82.81 76.13 78.91
34
use the method of taking 1-norm of the constant based feature set generated in the previous
step.
5000 features were selected in the word selection phase, so that none of the
documents are left with a very small number of features. The train and test feature reduction
algorithms are run on this 5000 feature set. Then the transductive SVM with default
parameters is used to classify the data. The results tabulated in Table 5.6 and graphed in
Figure 5.5 were obtained using the reduced features of 250, 500 and 800 with a bias value of
3.
The overall F-score obtained was 78.62 compared to the overall F-score of 78.64
obtained from stage 1. The F-scores doesn’t change much due to the application of the
feature reduction using GS, but for classes with average participation reduction gives better
results compared to word selection. However it performs badly in cases were there is a good
amount of over lap between classes as in anomalies 4 and 5.
Table 5.6. Results of Stage 2
No. of Features/Class 1 2 3 4 5 6 7 8
250 82.93 70.51 85.85 66.73 70.57 81.13 76.46 77.96
500 84.41 73.51 86.17 68.31 72.32 81.87 77.26 79.09
800 84.77 74.01 86.77 69.05 73.03 82.41 77.22 79.33
35
Figure 5.5. Results of Stage 2
5.5 Result Comparison And Error Analysis
Looking at the result for the various approaches, it is very clear that the word selection based
approaches perform very well on the ASRS dataset. The performance goes down when the
words are spread over lot of documents having similar frequencies in different documents.
But overall in most cases the performance for classification is good, considering that all
approaches are bag of words based.
The best approach turned out to be our approach of word selection using a
combination method instead of plain IG and using the entire dataset for training the
transductive SVM classifier. The GS kernel based baseline method for determining features
for improving the classification performance gave better results then the other existing
methods, but was not as good as the word presence method used in combination with a
transductive SVM based linear kernel.
36
The reason why our GS based approach of ours is not performing so well is perhaps due to
the fact that may be the reference points used for getting the new feature space is not
determined by a reasonably good approach. Switching to simulated annealing or a
transductive approach of selecting the reference documents could boost the results.
Table 5.9. Comparing the F-Scores
Class NASA EM GSK Stage 1 Stage 2
1 83.92 83.69 82.56 84.61 84.77
2 63.02 58.71 61.29 73.88 74.01
3 85.64 86.83 84.90 86.79 86.77
4 50.19 59.53 47.98 69.94 69.05
5 65.63 69.90 64.38 72.48 73.03
6 78.12 69.84 76.68 82.81 82.41
7 72.57 72.73 72.36 76.73 77.22
8 79.13 78.57 78.81 79.70 79.33
Overall 73.82 74.21 72.77 78.64 78.62
Figure 5.6. Comparing all results
37
CHAPTER 6
CONCLUSION
Few methods use the entire dataset to generate features for a single document in the dataset.
Through this thesis a method is proposed which lets us combine the various aspects from the
different documents to form the features, resulting in good scores.
The ASRS dataset is very word-centric. This aspect is clearly shown through the
experiments conducted just by using different ways of word selection. We could get really
high F-scores just by using simple word selection methods and passing it through a learning
algorithm like SVM. Combining unlabeled and labeled data using transduction gave good
improvements over all the existing approaches but it could perform even better if both the
labeled and unlabeled documents are used together to select words.
The semi-supervised approach developed using the concepts of word selection and
GSK worked far better on this dataset then just GSK. The key to our method is the way the
instances are selected and used to generated features for other documents. The main
contribution of this thesis lies in our view that documents should not be used just as input
data but also as points of references from which patterns can be derived for future
classification.
There is a lot of room for improvements, especially with respect to how the original
instances are selected for using during feature generation. The current method of using the
training data, though effective, can be bettered to use both the training and test data. Also, the
broader the spread of the reference points, the better the result would be. The word selection
38
metric can also be improved on to provide better features to the GSK module, such that more
features are present per document without bringing in contradicting features.
At first glance, it appears that getting high accuracy for determining the anomalies
associated with them would be an extremely complicated task, requiring many stages of
processing. However, the results of our experiments and the details of the new approach
suggest that sometimes it is good to think simple. At least for the ASRS dataset, it turns out
that the simpler the approach the better the results we get.
39
REFERENCES
[1] Christian Posse, Brett Matzke, Catherine Anderson, Alan Brothers, Melissa Matske, Thomas Ferryman. Extracting Information from Narratives: An Application to Aviation Safety Reports. In Proceedings of IEEE Aerospace Conference, 2005.
[2] Pedro Domingos, Michael J. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2-3): 103-130, 1997.
[3] David D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In Proceedings of the European Conference on Machine Learning (ECML),
pages 4-15, 1998.
[4] Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. In proceedings of the European Conference on Machine
Learning (ECML), pages 137-142, 2005.
[5] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000.
[6] Ashok N. Srivastava, Ram Akella, Vesselin Diev, Sakthi Preethi Kumaresan, Dawn M. McIntosh, Emmanuel D. Pontikakis, Zuobing Xu, Yi Zhang. Enabling the Discovery of Recurring Anomalies in Aerospace Problem Reports using High-Dimensional Clustering Techniques. In Proceedings of IEEE Aerospace Conference, 2006.
[7] Kamal Nigam, Andrew Kachite McCallum, Sebastian Thrun, Tom Mitchell. Text Classification from Labeled and Unlabeled Documents using EM. Special Issue on
Information Retrieval, 2000.
[8] P. Dewilde, Ed. F. Deprette. SVD and Signal Processing: Algorithms, Applications and Architectures. Elsevier Science Publishers, pages 3-41, 1988.
[9] Gene H. Golub, Charles F. Van Loan. Matrix Computations, 3rd ed., Johns Hopkins,
1996. [10] Thorsten Joachims, Probabilistic Analysis of the Rocchio Algorithm with TFIDF for
Text Categorization. Proceedings of 14th International Conference on Machine
Learning, pages 78-85, 1996.
40
[11] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for
Information Science, 1990.
[12] Nello Cristianini, John Shawe-Taylor, Huma Lodhi. Latent Semantic Kernels. Journal
of Intelligent Information Systems, 2002.
[13] A. J. Smola, O. L. Mangasarian, B. Scholkopf. Sparse Kernel feature analysis. Technical
Report 99-04, University of Wisconsin, Data Mining Institute, 1999.
VITA
Suraj Raghuraman was born in Madras (Chennai), India on 25th June 1984, the son of
Raghuraman and Usha Raghuraman. Suraj spent most of this childhood breaking and fixing
stuff around the house old later shifting to doing the same on the computer saving lot of
money to his parents. He did his schooling switching school and places very often all over
India. He joined MVIT, Bangalore for a Bachelor of Engineering in Information Science and
after a very patient wait for four years he was awarded the degree in 2005. He joined Subex
as a Fraud & Risk Management engineer during the final years of his Bachelor. Suraj was
very fond of opening companies during the later years of his Bachelors and during his work
at Subex, he started a few companies and closed it down; all were profitable. During 2007 he
was getting bored with his job at Subex and one of his families ventures finally became big
enough to sustain it. He decided to leave his country. He quit his job and joined UTD for
Master’s in Computer Science in Fall 2007. Currently he works at the HLTRI at UTD.