thesis

ANOMALY DETECTION FROM AVIATION SAFETY REPORTS

by

Suraj Raghuraman

APPROVED BY SUPERVISORY COMMITTEE: ___________________________________________ Dr. Vincent Ng, Chair ___________________________________________ Dr. Yang Liu ___________________________________________ Dr. Latifur Khan

! Copyright 2008

Suraj Raghuraman

All Rights Reserved

To my late Grandfather & Family

To my advisor, Dr. Vincent Ng

To all my friends


by

SURAJ RAGHURAMAN, B.E.

THESIS

Presented to the Faculty of

The University of Texas at Dallas

in Partial Fulfillment

of the Requirements

for the Degree of

MASTER OF SCIENCE IN COMPUTER SCIENCE

THE UNIVERSITY OF TEXAS AT DALLAS

December, 2008

v

ACKNOWLEDGEMENTS

I would like to thank my advisor Dr. Vincent Ng for putting up with me. I would have

given up long before if it has not been for him providing support and motivation. He some

how knew what would motivate me and did his best to get me going.

I would like to thank my family for motivating me enough to study abroad. I would

still be working in India if it weren’t for them. Thanks for supporting me when I needed your

support the most.

I would like to thank my friends for keeping me well fed and happy during the bad

times. Motivating me hard enough for me to finish my thesis. If it weren’t for you guys I

would have never thought of finishing my Master’s in a year.

September, 2008

vi


Publication No. ___________________

Suraj Raghuraman, M.S. The University of Texas at Dallas, 2008

Supervising Professor: Dr. Vincent Ng

ABSTRACT

Anomaly detection from Aircraft Safety Reporting System (ASRS) reports is the task of

identifying the multiple types of anomalies that resulted in an aviation incident based on the

narrative portion of the report, which was written with lot of domain specific terminology.

The thesis proposes an effective semi-supervised approach to solving the problem without

using any domain related information. The approach relies on methods similar to singular

value decomposition and latent semantics for feature generation. A Support Vector Machine

with transduction was used to classify the narratives. Experimental results show that the

method developed was marginally better than the existing methods for high frequency

anomalies and substantially better for low frequency anomaly on the ASRS dataset.

vii

TABLE OF CONTENTS

ACKNOWLEDGEMENTS ............................................................................................... v

ABSTRACT .......................................................................................................................vi

LIST OF TABLES ............................................................................................................. ix

LIST OF FIGURES............................................................................................................. x

CHAPTER 1 INTRODUCTION...................................................................................... 1

CHAPTER 2 PROBLEM DESCRIPTION ...................................................................... 3

CHAPTER 3 RELATED WORK..................................................................................... 5

3.1 Supervised Approach ........................................................................................ 5

3.1.1 Dynamic Template Rule.................................................................... 5

3.1.2 Naïve Bayes Classifier ...................................................................... 5

3.1.3 Support Vector Machines ................................................................... 6

3.2 Semi-supervised Approach................................................................................ 7

3.2.1 Naïve Bayes EM................................................................................ 7

CHAPTER 4 OUR SEMI-SUPERVISED APPROACH ................................................. 9

4.1 Introduction ...................................................................................................... 9

4.2 Data Analysis ................................................................................................. 10

4.3 Feature Generation ......................................................................................... 12

4.3.1 Content Extraction............................................................................ 12

4.3.2 Tokenization..................................................................................... 12

4.3.3 Hash Generation ............................................................................... 13

4.4 Word Selection ................................................................................................ 14

4.4.1 Plain Information Gain..................................................................... 15

4.4.2 Simple Subtraction ........................................................................... 15

4.4.3 Combined Metric.............................................................................. 15

4.5 Feature Reduction........................................................................................... 16

4.5.1 Term By Document Vector .............................................................. 16

viii

4.5.2 Inner Product .................................................................................... 17

4.5.3 Singular Value Decomposition ........................................................ 19

4.5.4 Gram Schmidt Process ..................................................................... 19

4.5.5 Applying Gram Schmidt .................................................................. 20

4.8 Classification ................................................................................................... 21

4.8.1 Learning............................................................................................ 22

4.8.2 Classifying........................................................................................ 22

4.8.3 Interpreting Output ........................................................................... 23

CHAPTER 5 EVALUATION ........................................................................................ 24

5.1 Dataset ............................................................................................................ 24

5.2 Experiment Setup ........................................................................................... 25

5.3 Baseline Experiments ..................................................................................... 25

5.3.1 Plain SVM ........................................................................................ 25

5.3.2 Naïve Bayes EM............................................................................... 27

5.3.2 Gram Schmidt Kernel....................................................................... 29

5.4 Our Semi Supervised Approach ..................................................................... 31

5.4.1 Stage 1: Word Selection ................................................................... 31

5.4.2 Stage 2: Further Dimensionality Reduction ..................................... 33

5.5 Result And Error Analysis ............................................................................. 35

CHAPTER 6 CONCLUSION ........................................................................................ 37

REFERENCES................................................................................................................. 39

Vita

ix

LIST OF TABLES

Number Page

2.1 ANOMALY DETAILS................................................................................................ 4

4.1 TERM BY DOCUMENT VECTOR.......................................................................... 17

4.2 TERM BY DOCUMENT VECTORS FOR S1, S2 AND S3 .................................... 18

5.1 DATASET COMPOSITIONS ................................................................................... 25

5.2 RESULTS OF PLAIN SVM ...................................................................................... 26

5.3 RESULTS OF NAÏVE BAYES EM .......................................................................... 28

5.4 RESULTS OF GSK.................................................................................................... 30

5.5 RESULTS OF STAGE 1............................................................................................ 33

5.6 RESULTS OF STAGE 2............................................................................................ 34

5.7 COMPARING THE F-SCORES................................................................................ 36

x

LIST OF FIGURES

Number Page

2.1 AN EXAMPLE NARRATIVE .................................................................................... 3

4.1 NARRATIVE FOR ANOMALY, INCURSION – RUNWAY................................. 10

4.2 NARRATIVE FOR ANOMALY, INFLIGHT ENCOUNTER – WEATHER ......... 11

4.3 TOKENIZED NARRATIVE FOR ANOMALY, INCURSION – RUNWAY ......... 13

4.4 HASHED VERSION OF TOKENIZED NARRATIVE............................................ 14

4.5 REDUCED HASHED NARRATIVE........................................................................ 16

4.6 SAMPLE TRAIN FILE CONTENT.......................................................................... 22

4.7 SAMPLE TEST FILE CONTENT............................................................................. 23

5.1 RESULTS OF PLAIN SVM ...................................................................................... 27

5.2 RESULTS OF NAÏVE BAYES EM .......................................................................... 29

5.3 RESULTS OF GSK.................................................................................................... 31

5.4 RESULTS OF STAGE 1............................................................................................ 33

5.5 RESULTS OF STAGE 2............................................................................................ 35

5.6 COMPARING ALL RESULTS................................................................................. 36

1

CHAPTER 1

INTRODUCTION

Aircraft Safety Reports are a rich source of information on flight incident. Every time an

aviation incident occurs any where around the world a report is written which contains the

complete details of the incident. Over the years these reports from around the world were

submitted to the ASRS archive at National Aeronautical and Space Agency (NASA). Over

the decades the database has grown to contain over 150 thousand incident narratives.

The task of anomaly detection is to determine the cause of the incident by just looking at the

narrative portion of the report. There are about 62 anomaly classes in all. The documents

come from varying backdrops, contain lots of grammatical mistakes, incorrect spelling, loads

of conflicting abbreviations and are very hard for even an average person to understand

making the task difficult. The goal of this thesis is to come up with a semi supervised

approach for anomaly detection without relying on any knowledge of the aviation domain

terminologies or procedures to solve the problem.

We will start with the existing approaches that were applied on the dataset and approaches

that were used for similar classification tasks. It is always wise to build upon the approaches

based on which people are already working. So we start off with enhancing the word

selection methods and getting better accuracies by using semi-supervised learning algorithms

like support vector machines with transduction.

2

After that the focus shifts to using the entire dataset and not just the labeled instances to learn

the features. For this we propose a new approach that uses similarity between documents to

generate features. We use latent semantics and single value decomposition in order to get the

similarity between the documents. While getting the similarity greater emphasis is given to

the words that help identify the presence of the anomalies. The key idea that this thesis

introduces is the usage of a set of documents to generate the features for a single document.

Instead of just looking at the similarity between two documents and classifying based on just

that, An instance selection algorithm is used to pick out the points of reference and then

based on the similarity with these varying reference documents, patterns are extracted using

SVM, which are used to classify documents into anomaly classes. In order to take greater

advantage from the instances we have, SVM with transduction is used instead of a fully

supervised SVM.

Finally, we show that the word selection based methods that we developed perform better

than the existing methods. The semantics based method doesn’t perform as well as the word

selection based methods but performs far better than the approximated latent semantics based

algorithms on this dataset.

3

CHAPTER 2

PROBLEM DEFINITION

Anomaly detection in Aircraft Safety Reporting System (ASRS) reports is the task of

identifying the types of anomalies that resulted in an aviation incident based on the narrative

portion of the report. Incidents can occur due to more than one reason and hence more than

one anomaly can be present in a single narrative.

The aviation officials write an ASRS report whenever an incident occurs. These

reports are written all around the world and as a result contain varying style of writing. The

content of the report is highly aviation domain specific and contains lots of misspelled words,

acronyms, and grammatical mistakes. To get a clearer picture of how a report looks like, an

example is given below.

ACR Y CLIMBING TO FL210 WAS STOPPED AT 160 FOR TFC AT 170; WHICH PASSED.

ACR Y WAS THEN KEPT AT 160. ACR X WAS DESCENDED TO 170 UNTIL CLEAR OF TFC;

THEN DESCENDED TO 150. ACR X WAS NOT OUT OF 170 WHEN CLRNC ISSUED TO

STOP AT 170. NO ANSWER. CLRNC GIVEN AGAIN. NO ANSWER. TURNED ACR Y FROM

320 DEG TO 250 DEG HDG. CLRNC TO ACR X AGAIN WITH TURN TO 060 DEG; NO

ANSWER. FINALLY ACR X ANSWERED AND TURNED TO 060 DEG WITH CLIMB BACK TOS

170. ACFT PASSED 4.3 MI AND 200' APART. ACR X DID NOT RESPOND TO CLRNC.

POSSIBLE LACK OF COCKPIT DISCIPLINE.

Figure 2.1. An example narrative

4

As you can clearly see in Figure 2.1 most of the text is not very readable to a non-aviation

expert. The other complication is the fact that most acronyms are 3 characters in length.

Example ACR can stand for Aircraft as well as Air Control Radar, and it may also refer to an

airport in Columbia. In addition, ACFT is the standard way of referring to an Aircraft.

There are over 62 classes of anomalies that a narrative can belong to. For the sake of

simplicity and to reduce the amount of processing the data needs to go through we will

consider only 8 anomalies. In order not to loose out on the variations that each anomaly

brings forth in terms of the frequency of occurrence and the situation in which they occur, the

anomalies we choose for classification are listed in table below.

Table 2.1. Anomaly Details

Class Anomaly Description

1 Aircraft Equipment Problem - Critical Some equipment on the aircraft has malfunctioned.

2 Airspace Violation – Entry Aircraft was flown into an unauthorized location.

3 Incursion – Runway Incorrect presence of a vehicle, person or aircraft on the runway.

4 Inflight Encounter – Turbulence Heavy turbulence was encountered during the flight.

5 Inflight Encounter – Weather Severe weather was encountered while flying. 6 Inflight Encounter – VFR IN IMC Visual Flight Rules (VFR) into Instrument

Meteorological Conditions (IMC) means unqualified flight into adverse weather.

7 Maintenance Problem – Improper maintenance

Some part of aircraft was not properly maintained.

8 Non Adherence - clearance The Air Traffic Controller (ATC) did not clear pilot to perform the operation.

5

CHAPTER 3

RELATED WORK

3.1 Supervised Approaches

Supervised approaches use labeled documents to learn and then label the unlabeled

documents. The next sections cover some of the supervised approaches used on the ASRS

dataset.

3.1.1 Dynamic Template Rule System

The basic approach that was tried out by [1] was based on regular expressions. The domain

heavily influences this approach and works similar to how an aviation expert identifies

anomalies from the narrative. After lot of brainstorming a set of shaper words that are useful

in identifying the anomaly expressions, were created. These shaper words are used to identify

expressions from a narrative. Regular expression templates are generated for particular types

of anomalies, using these expressions rules. A narrative is run through these rule trees and

depending on the matches of regular expression it is assigned anomalies.

3.1.2 Naïve Bayes Classifier

Naïve Bayes is a probabilistic classifier; an anomaly c* is assigned to a narrative document d

where

Anomaly class c* = argmaxc P(c | d)

We derive this P(c|d) using Bayes rule as follows:

6

P(d) doesn’t play any role in determining argmaxc P(c | d) so is ignored. Naïve Bayes

decomposes the term P(d | c) into a set of features fi ‘s assuming they are conditionally

independent. So

In spite of having such a big assumption of conditional independence between words, which

clearly doesn’t hold in case of real world data Naïve Bayes is known to be good for text

classification [2] and is sometimes optimal for highly dependent features [3].

3.1.3 Support Vector Machine

Support Vector Machines (SVM) is known to be very good at classifying text [4]. The

classification task is split up into multiple binary classification tasks. In case of binary SVM

the basic idea is to represent the input features in a high dimension feature space. Then it

identifies a hyperplane that will separate the features by the highest possible margin. If we

take to be the hyperplane then

Where , are the support vectors and "j are obtained by solving the dual

optimization problem [5].

In [6] an SVM classifier was used on the ASRS dataset by taking words as features. In order

to control the number of features given to the learner, Information Gain (IG) was used as a

metric for choosing the words.

7

IG = H(C) – H(C|W)

Where H(C) is the entropy of the anomaly class and H(W|C) is the entropy of the word given

the class.

Overall it was observed that the top 2500 words are more than sufficient to get good

results. It was noticed that anomalies that occur less frequently result in bad F-Scores and

perform better with fewer words.

3.2 Semi-Supervised Approaches

Semi-supervised approaches use a combination of labeled and unlabeled data to label the

unlabeled documents. There are very few semi-supervised algorithms that have been applied

to ASRS dataset. The approach mentioned below is good for text classification in general but

never used on the ASRS dataset.

3.2.1 Naïve Bayes EM

Expectation Maximization (EM) is a meta-learning algorithm, which combines both the

labeled and unlabeled data. The algorithm works in two stages; in the first stage the

unlabeled data is probabilistically labeled using the parameter values. Then in the second

stage the value of the parameters are recalculated using maximum likelihood estimation.

In [7] Naives Bayes is combined with EM to classify text.

To classify (E-Step):

Where,

is the probability that a document belongs to that class j.

8

is the prior probability of class j.

is the probability of the kth word from the di

th document occurring in the class

j.

To re-estimate the parameters (M-Step):

The function below is used to distinguish between the label and unlabeled documents.

Where,

# is the weight factor for unlabeled documents

Dl are the labeled documents

Du are the unlabeled documents

For estimating the probabilities of a word to occur in a class.

N(ws , di ) is the frequency of the word s in the document di

The class prior probabilities are computed using

Using the above set of expressions, good classification accuracy was achieved in their

experiments. The best result is obtained when the value of # between 0 and 1. The value of #

depends on the type of data classified.

9

CHAPTER 4

OUR SEMI-SUPERVISED APPROACH

4.1 Introduction

A good look into the narratives of the ASRS dataset reveals a lot of connections between few

keywords and the anomalies associated with it. There are lots of words that will not occur

with certain anomalies and there are lots of words that will occur only with certain

anomalies. But this is not an easily exploitable feature, since a document can have multiple

anomalies. These are the two key concepts, which the approach relies on for classification.

To start with the task of anomaly detection is split into binary classification tasks of

checking the presence of a single anomaly. For each of the binary tasks the following

operations are performed.

• Feature Generation: Words are mapped as direct features.

• Word Selection: Relevant features are selected using Information Gain metric.

• Feature Reduction: Multiple Features are combined to form a feature using Singular

Value Decomposition (SVD)[8] and Gram Schmidt process [9].

• Classification: A transductive SVM is used for learning and classifying the feature.

Further sections detail each of the operations. We start by analyzing the data.

10

4.2 Data Analysis

Let’s start with taking a relook at the data. As mentioned before the data has lots of aspects

like abbreviated words, misspelling, grammatical mistakes, domain specific words etc that

make it hard for non aviation domain humans to understand, so there is no way that we can

use any of the more advanced tools of Natural Language Processing (NLP) like part of

speech taggers, syntactic parser, named entity recognizers etc readily. So we are left with two

alternatives; either we byte the bullet and come up with a big task of fixing the text so that

advanced NLP tools can be used; or we avoid using the tools and adopt a different approache

to the problem, which is what most others seem to be doing.

WHEN I CALLED GND CTL I STATED; 'WORCESTER GND (OUR ACFT IDENT) TAXI; VFR

TO BRADLEY.' SHE RESPONDED BY SAYING; '(ACFT IDENT); TAXI VIA BRAVO FOR

INTXN DEP 29.' SHE ALSO ASKED ME PREVIOUS TO THIS IF I REQUIRED FULL LENGTH.

I RESPONDED BY SAYING; 'NEGATIVE.' SHE CAME BACK BY SAYING; 'BRAVO INTXN 29

FOR TKOF IS APPROVED.' I RESPONDED BY SAYING; 'CLRD TO GO AT 29?' SHE SAID;

'AFFIRMATIVE;' AFTER I SAID; 'CLRD TO GO 29.' WHEN I MADE MY 45 DEG LEFT TURN

FOR DEP; I DECIDED TO CALL TWR FREQ TO ADVISE (WHICH IS NOT REQUIRED).

WHEN I CALLED TWR FREQ; A MAN RESPONDED BY SAYING; '(ACFT IDENT); YOU WERE

NEVER CLRD FOR TKOF.' THE NEXT DAY MY F/O (NAME) AND I VISITED THE

WORCESTER TWR TO SPEAK TO THE CHIEF CTLR. WE LISTENED TO THE TAPE AND I

WAS CORRECT; THAT THE GND CTLR ACKNOWLEDGED THAT I WAS 'CLRD TO GO' AT

INTXN BRAVO 29. THE CHIEF CTLR SAID THERE WOULD BE NO ACTION TAKEN

AGAINST ME AND MY F/O. THE CHIEF CTLR AND MYSELF AGREED THAT THE GND CTLR

AND MYSELF USED POOR PHRASEOLOGY. FOR INSTANCE; 'CLRD TO GO;' WHICH IS

USED QUITE OFTEN IN AVIATION FOR 'CLRD FOR TKOF.' I WOULD LIKE TO POINT OUT

THAT NO ATTEMPT WAS MADE BY TWR OR GND FREQ TO CONTACT ME NOT TO TKOF

AT BRAVO INTXN RWY 29. SUPPLEMENTAL INFORMATION FROM ACN 80338: CAPT

FLYING ACFT AND OPERATING RADIOS. I ASKED CAPT IF SHE CLRD US FOR TKOF ON

GND FREQ. HE SAID YES; AND THAT SHE WAS WORKING BOTH FREQS.

Figure 4.1. Narrative for Anomaly, Incursion – Runway

11

On closer examination of the narrative in Figure 4.1, it is easy to notice that lots of words

only occur in certain anomaly cases. For example in the above case the word TAXI is

mentioned that increases the chances of the anomaly being an incursion or excursion of the

runway. Since it is easy for us to determine that by looking at the record, even machines can

do that using simple measures like plain old information gain [6]. The other way of doing

something similar would be to just use Inverse Term Frequency and then adding weight on

words that are less frequent [10].

Now that we know what kind of words to look for, let us look at the way the words in

text are arranged. As we look at more and more documents it starts to become clear that in

most of the cases the order in which the words occur is not relevant. The other thing to notice

is shaper words like TAXI are not mentioned when the anomaly is Weather (Figure 4.2) for

instance. The more you look at the documents the more you would feel that a Bag of Word

approach is the best way to move forward. That’s what this thesis ended up doing.

AFTER 4 HRS FLT TIME ARRIVED AT SEA. WX HAD STARTED TO GO DOWN. ARPT USING

ILS 34R APCH WITH A RVR OF 3000'. ATTEMPTED FIRST APCH AND INITIATED MISSED

APCH AT DH WITH UNEVENTFUL GO AROUND. STARTED SECOND APCH; AT DH I SAW

APCH LIGHTS. AT SAME TIME COPLT STARTED TO PUSH THROTTLE FORWARD FOR GO

AROUND. I DID NOT REALIZE THAT THE TOGA BUTTONS HAD BEEN PUSHED. WHEN I

GRABBED THE YOKE I DISCONNECTED THE AUTOPLT. AT 100' RWY CAME IN SIGHT

AND WE WERE LINED UP A LITTLE RIGHT. AT THE SAME TIME I BECAME CONFUSED

WITH THE THROTTLE RESPONSES BECAUSE I DID NOT REALIZE THE TOGA BUTTON

HAD BEEN ACTIVATED. WE STARTED A LEFT DRIFT WHILE I TRIED TO TURN THE

LIGHTS ON FOR BETTER DEFINITION OF THE RWY. AT ABOUT 50' THE DRIFT BECAME

FASTER TO THE LEFT. WHEN EDGE LIGHTS CENTERED UP AND ALSO ILS TWR WAS

SEEN; GO AROUND WAS INITIATED FOR AN UNEVENTFUL GO AROUND. ON GO

AROUND IT WAS LEARNED THAT THE RVR HAD DROPPED TO 1000' ALL XMITTERS.

LESSONS LEARNED: 1) A BETTER APCH BRIEF WITH COPLT KNOWING TO CALL DH OUT

EXTRA LOUD SO IT IS DEFINITELY HEARD. 2) APCH SHOULD HAVE BEEN

DISCONTINUED AS SOON AS THE AUTOTHROTTLES FELT FUNNY AND DEFINITELY

WHEN THE RWY WAS NOT LINED UP. 3) ALSO BRIEF A POSITIVE GO AROUND CALLOUT

AGAIN TO DISTRACT THE OUTSIDE CONCENTRATE. I THINK ANOTHER CONTRIBUTING

FACTOR WAS THAT THE TIME WAS CLOSE TO 2 O'CLOCK BODY TIME AND

CONCENTRATION WAS JUST A LITTLE SLOWER THAN NORMAL.

Figure 4.2. Narrative for Anomaly, Inflight Encounter – Weather

12

The other key thing to note is that many of the anomalies occur together, for example the

anomalies 1(Aircraft Equipment Malfunction) and 7(Maintenance Problem) go hand in hand;

this is a feature that is unique to this dataset and which many datasets will not share. As a

result this thesis will not be exploiting this characteristic of the dataset. A Bayesian classifier

can be used to make the predictions based on the narrative and anomalies identified, whether

or not a different anomaly can also be associated with the current narrative. This could be

something that could be tried out if the aim were to just classify this dataset.

4.3 Feature Generation

This section details about the how the narratives, where extracted from the ASRS dataset.

The preprocessing and cleaning applied to the narratives. Finally how the words from the

narrative are converted into features.

4.3.1 Content Extraction

The ASRS document set used for this thesis was a set of comma separated values (csv) files

that were taken from the years 1988 to 2007. Each of these files has thousands of narratives.

There are about 54 fields in each of the file. The thesis will be using only two fields, the

narrative and the event anomaly from the files. There are a total of more than 60 anomalies

present in these files. In this stage we pick out all the narratives containing at least one of the

anomalies mentioned in table 2.1. This step returns about 75000 narratives.

4.3.2 Tokenization

This step involves converting the text as seen in Figure 4.1 to Figure 4.3. Tokenizing the text

is very important in order to ensure that the words can be identified easily, since there are lots

13

of punctuations attached to the words of the narrative. The first step in tokenization is to

separate the punctuations from the words. However, there are few places where punctuations

indicate a measure of distance from example like in the case of 3000’, which shows that the

distance is 3000 feet. These are the places where we cannot afford to strip or separate

punctuations, due to the fact that this can have an adverse affect on the matching process. To

handle such cases, punctuations following or contained within numbers, where not removed.

WHEN I CALLED GND CTL I STATED WORCESTER GND OUR ACFT IDENT TAXI VFR TO

BRADLEY SHE RESPONDED BY SAYING ACFT IDENT TAXI VIA BRAVO FOR INTXN DEP 29

SHE ALSO ASKED ME PREVIOUS TO THIS IF I REQUIRED FULL LENGTH I RESPONDED

BY SAYING NEGATIVE SHE CAME BACK BY SAYING BRAVO INTXN 29 FOR TKOF IS

APPROVED I RESPONDED BY SAYING CLRD TO GO AT 29 SHE SAID AFFIRMATIVE …

Figure 4.3. Portion of a tokenized Narrative for Anomaly, Incursion – Runway

4.3.3 Hash Generation

The basic idea behind hash generation is to replace the use of words by numbers to facilitate

the generation of the input files for the SVM learning algorithm. All the narratives are read

and each word is assigned a number that will be its primary identity from now on. We are

also not interested in the order in which words occur. All we are interested in is the number

of times they occur in a narrative. So, to avoid repeated parsing of the sentence to get the

frequency of occurrence of the words, take the primary key and associate to each word its

normalized frequency. Specifically,

Normalized Frequency = Total occurrence of the word . Total number of words in narrative Finally, the result is sorted in ascending order of the primary key value. To exemplify, Figure

4.4 is a hashed version of Figure 4.3.

14

8:0.00574712643678161 10:0.028735632183908 17:0.0862068965517241 23:0.0229885057471264 33:0.0114942528735632 41:0.00574712643678161 58:0.0114942528735632 61:0.0229885057471264 77:0.00574712643678161 81:0.0229885057471264 84:0.0065359477124183 93:0.0065359477124183 95:0.0065359477124183 121:0.0065359477124183 125:0.0130718954248366 126:0.0065359477124183 128:0.0065359477124183 150:0.0065359477124183 155:0.0065359477124183 …

Figure 4.4. Hashed version of tokenized Narrative for Anomaly, Incursion – Runway

4.4 Word Selection

There are over 20,000 of unique words in the ASRS dataset. Since we are considering words

as features, it becomes a tall order for any classifier to classify using such a high number of

features well. Even though the SVM learning algorithm is known to handle large feature sets

very nicely, it is still better to reduce the number of features sent to it by looking at the kind

of features provided to it by the method.

After experimenting with several methods on the dataset we observed that the best

results were obtained when only a selective subset of words from the narratives were used

instead of all the words. Typically, the words that are selected either appear a lot in the

narratives with a particular anomaly or in all but that anomaly.

The Information Gain (IG) method for word selection used by the folks at NASA in

[6] was found to be one of the best ways of doing word selection (see section 4.4.1). The key

advantage of using IG is that it takes care of both of the cases mentioned above. Another

method using subtraction was also added into the system to increase the accuracy for the

words picked (see section 4.4.2).

15

4.4.1 Plain Information Gain

IG is calculated as

IG = H(anomaly class) – H(anomaly class | Feature)

In this method, while calculating the entropy, normalized frequencies are not considered. All

that is considered is the presence or absence of a feature in the document. Also, the features

that occur only in a single document are ignored for being too rare.

4.4.2 Simple Subtraction

One of the easiest ways to find out if a feature is more prevalent in a class than other classes

is by doing a subtraction. But the problem with subtraction is that the range of values it

returns can be huge. To fix this problem we use the log function. The log function brings

back the numbers returned by subtracting into a manageable level. So basically the log of the

difference in the frequencies of occurrence of a word in the class and outside it is what is

used.

4.4.3 Combined Metric

A combination of both word selection methods described above is used to eliminate the

irrelevant features for an anomaly. The values for both IG and subtraction are calculated for

all the features. Then two lists are formed, one with features arranged in descending order of

IG and the other with subtraction. 75% of the required features are chosen from the top of the

IG list and then the remaining 25% features are taken from the top of the subtraction list.

After this all the unselected features are considered irrelevant and removed from both the

training and test files. A result of word selection is shown in Figure 4.5.

16

17:0.0862068965517241 41:0.00574712643678161 58:0.0114942528735632 61:0.0229885057471264 77:0.00574712643678161 84:0.0065359477124183 93:0.0065359477124183 95:0.0065359477124183 125:0.0130718954248366 126:0.0065359477124183 155:0.0065359477124183 …

Figure 4.5. Reduced hashed Narrative for Anomaly, Incursion – Runway

4.5 Feature Reduction

In many classification tasks, the feature values for an instance are typically generated from

the instance itself. The less common approach is to use multiple instances to generate the

features for a given instance. One such approach is uses latent semantics kernels [11]. Our

feature reduction approach is strongly influenced by latent semantics.

The section starts with the basic terminologies and then continues on to explain how

and why using the method for feature reduction is a good idea.

4.5.1 Term By Document Vector

A document can be represented in many different ways. One of the more mathematical ways

of representing it is similar to a histogram: Take all the words in alphabetical order and then

associate with each word the frequency of its occurrence in the document. For example, the

document term vector for the sentence given below is shown in Table 4.1.

THAT THE GND CTLR ACKNOWLEDGED THAT I WAS 'CLRD TO GO' AT INTXN BRAVO

17

Table 4.1. Term by document vector

acknowledged 1

at 1

bravo 1

clrd 1

ctlr 1

gnd 1

go 1

i 1

intxn 1

that 2

the 1

to 1

was 1

In a conventional term by document vector, the entries correspond to the word stems instead

of the actual words. But in our case the words themselves were used due the presence of

large number of acronyms and misspellings.

The term by document vector in this case has already been generated based on the

results from section 4.4. The only key difference is that instead of having words sorted in

alphabetical order; there are numbers, which represent words sorted in ascending order. A

document by term vector also doesn’t have normalized frequencies in its representation. This

also doesn’t harm the approach in any way since the numbers are still as far apart as before

but now in a reduced scale.

4.5.2 Inner Product

The inner product (or the dot product) is an operation that takes two vectors and returns a

single number as the output. To get the scalar or dot product of two vectors, all the elements

in the first vector should be multiplied to all the corresponding elements in the other vector

and the sum of the result is the dot product.

18

The easiest way to calculate the similarity between two documents is to take the dot

product of the document by term vector of the two documents. So, for example, take the

following sentences.

S1. THAT THE GND CTLR ACKNOWLEDGED THAT I WAS 'CLRD TO GO' AT INTXN BRAVO

S2. CLRD TO GO AT INTXN BRAVO

S3. ACFT IDENT TAXI VIA BRAVO FOR INTXN DEP

Table 4.2. Term by document vectors for S1, S2 and S3

Sentence 1 Sentence 2 Sentence 3

acft 1

acknowledged 1

at 1 at 1

bravo 1 bravo 1 bravo 1

clrd 1 clrd 1

ctlr 1

dep 1

for 1

gnd 1

go 1 go 1

i 1

ident 1

intxn 1 intxn 1 intxn 1

taxi 1

that 2

the 1

to 1 to 1

via 1

was 1

Taking the dot product between each part of document by term vectors, we have:

S1 . S2 = 1 + 1 + 1 + 1 + 1 + 1 = 6

S1 . S3 = 1 + 1 = 2

S2 . S3 = 1 + 1 = 2

19

Looking at the results of the dot product, it is easy for us to say that S1 and S2 are

more similar compared to S1, S3 and S2, S3. The same result will hold if the use of

frequencies is substituted by the use of normalized frequencies. In other words, instead of the

result being 6, 2 and 2, it would have been some fraction with a similar order.

4.5.3 Singular Value Decomposition

Singular value decomposition (SVD) is the a process of factorizing a nxm matrix M into

multiple matrices of the form U$V*

where U is the unitary matrix over some feature space with real numbers.

$ Contains the singular values, which can be used to transform other vectors.

V is a set of values that are orthonormal directions of M.

V* is the conjugate transpose of V.

The training set is used to generate $, which is then used to generate the features for both the

training and test set.

4.5.4 Gram Schmidt Process

The Gram Schmidt process is used for converting a linearly independent set of vectors v1, v2,

v3, …, vn to a orthogonal set of vectors q1,q2,q3, …, qn in the inner product feature space

(commonly the Euclidean space).

The process relies on iteration over the equations.

where projq v is

20

Due to floating point issues, the direct algorithm described above doesn’t give exact results

on a computer. There is a stable version of the algorithm1 that is used and given below.

GRAM-SCHMIDT-STABLE

for i =1 to n do vi = ai end for for i = 1 to n do

rii = ||vi|| qi = vi / rii for j = i + 1 to n do

rij = qiTvj

vj = vj - rijqi end for

end for

4.5.5 Applying Gram Schmidt

The concept of latent semantics relies on the concept of co-occurrence, since we are not

using any external semantic tool like wordnet or semantic net. We can use the fact that two

words occurring in the document together are related to each other. This fact is can be used

by generating the Eigenvalues using SVD [11].

Estimating the Eigenvalues of a term document matrix by matrix factorization is a

very computationally intensive task. As a result, we use the Gram Schmidt orthogonalization

(GS) for estimating the singular values and using it to reduce the dimensions of the space

effectively. The GS method is known to provide similar results to the latent semantics kernel

(LSK) [12].

All the feature values of each document are normalized, for all documents. The

document term matrix is created and given to the GS algorithm for processing.

1 http://www.phys.ocean.dal.ca/~lukeman/teaching/3170-3111/slides/Lecture8.pdf

21

To generate the training data,

• The 2-norm is taken for all the document vectors.

• A biasing factor is applied on the norm for all the training documents containing the

anomalies.

• K (resultant feature space size) Documents with high biased norm score as taken as

the primary vector for calculating the singular value features.

• The GS algorithm is run on all the documents using these K documents. The reduced

features obtained are used for training.

To generate the test data, the GS algorithm is run with the K documents selected in the

training stage to generate features for the test documents.

The training and test files are generated using the features computed in the train and test

stage, as described above. Even though there exists an unsupervised way of selecting the K

documents [13]. It was seen that the unsupervised approach did not give nearly as good as

the supervised method [12].

4.6 Classification

SVM-light2 is a C language based implementation of SVM. It is small in size, fast to run and

very easy to use. These inherent advantages of SVM-light made us use it. It comes with two

main executables: svm_learn and svm_classify. The learner svm_learn takes the training data

and generates a model. The classifier svm_classify takes the test data along with the model

and generates the classifications.

2 http://svmlight.joachims.org/

22

4.6.1 Learning

Training a SVM with transduction is quite simple. Once we have all the features generated

using the methods from sections 4.2 to 4.5, we only need to put the features in a format that

the SVM-light learner can understand. The input format for the training file is class label

either (-1 or 1) followed by a space separated list of feature number and value pairs. In our

case, if the document belongs to that anomaly we assign it -1, otherwise we assign it 1. In

order to use transduction, the unlabeled instances also need to be passed. These unlabeled

instances are assigned class value of 0. See Figure 4.6 for an example of a training file. Once

the learner encounters a line with label zero it changes to the transduction mode of operation.

-1 101:0.23 132:0.39 159:1.23 160: 2.02 161:0.25 0 101:0.213 102:0.39 109:1.93 111: 0.102 116:0.95 130: 0.234 1 101:0.123 102:0.69 105:0.823 116: 0.92 121:0.45 -1 104:1.025 178:7.34 180:1.4

Figure 4.6. Sample Training File content

Finally svm_learn is run to start the learner when given the training file and model file name

as parameters, as shown below.

svm_learn train_file model_output_file

4.6.2 Classifying

The test file is created the same way as the training file. All the narratives that are not labeled

are put in a file. The features that is required for the classification of a test document is

generated using the same set of reference narratives as in section 4.6.1.

Each test document is processed through the steps mentioned in sections 4.2 to 4.5.

After going through these stages, the features that are generated for each document are taken

23

and a zero is put in place of the class value while passing the features for classification

through the SVM. A sample test document is shown in Figure 4.7.

0 101:0.23 132:0.39 159:1.23 160: 2.02 161:0.25 0 101:0.213 102:0.39 109:1.93 111: 0.102 116:0.95 150:2.345 0 101:0.123 102:0.69 105:0.823 116: 0.92 121:0.45

Figure 4.7. Sample Test File content

To use the classifier the executable svm_classify has to be run with the test file, the model

file and the output file name given as command line switch., as shown below.

svm_classify unlabeled_documents_file model_file classification_output_file

4.6.3 Interpreting Output

SVM returns a set of numbers around zero, one per line. Each line corresponds to each

document instance. If the number is closer to -1 then it basically means that SVM classifies

the instances negative, but if it is close to 1 then it means it is positive. The degree of

confidence SVM has on its decision is indicated by the number it churns out. A threshold of

0 was used for determining the class. Anything above zero was considered positive and

everything below or equal to zero as negative.

24

CHAPTER 5

EVALUATION

5.1 Dataset

As mentioned before, the evaluation dataset contains the aviation safety reports, with over 62

anomalies and over 130,000 narratives. In order to reduce the scale of the task, only 8

anomalies were considered in our experiments. The number of relevant narratives (i.e., the

number of narratives containing at least one of the eight anomalies) was over 75000. In order

to make sure that the program can run on everyday hardware only 8000 narratives were used.

By using lesser narratives the composition of low frequency anomalies in the dataset was

increased. Since a single narrative can have more than one anomaly the total need not be

same as total number of narratives. The table below lists the anomalies and the number of

narratives associated with these anomalies.

The evaluation was performed using 2-fold cross validation. This was done in order

to have a smaller percentage of training instances, which would add greater value while

developing a semi-supervised approach. Each fold contains 4000 narratives and the details on

the distribution of classes in the 2 folds are given in columns.

25

Table 5.1. Dataset Compositions

5.2 Experiment Setup

Identification is done using binary SVM classifiers. In this case we use 8 binary classifiers

setup to handle each anomaly independently of each other. The experiments and results will

be put forward starting with the existing methods, followed by our proposed approach.

5.3 Baseline Experiments

We used three different approaches as baselines systems:

• The NASA approach with the SVM.

• The Naïve based EM.

• The Gram Schmidt Kernel based SVM.

5.3.1 Plain SVM

The most basic of the methods that can be used on the dataset set readily is classifying using

a linear SVM. In order to do this we take the output generated by the method described in

Narratives Class Anomaly Description Fold 1 Fold 2 Total Participation

1 Aircraft Equipment Problem – Critical 1028 1033 2061 16.78 2 Airspace Violation – Entry 513 537 1050 8.55 3 Incursion – Runway 541 573 1114 9.07 4 Inflight Encounter – Turbulence 523 587 1010 8.22 5 Inflight Encounter – Weather 1060 1055 2115 17.22 6 Inflight Encounter – VFR IN IMC 505 521 1026 8.36 7 Maintenance Problem – Improper

maintenance 510 496 1006 8.19

8 Non Adherence – clearance 1455 1443 2898 23.60

26

section 4.2.3 and pass it as input to the learning algorithm. The SVM-light package, which is

based on [7], was used to classify the data. For the classification stage all the options were set

to their default values. Words were selected using the IG metric mentioned in section 4.2.4.

The SVM-light system accepts input as line separated instances. Each instance is

given in the form of class label (-1 or 1) and then followed by a space separated feature

number colon feature value format. The feature number should be strictly greater than zero.

The learner learns and generates a model file after processing the labeled instances.

The test instances are also passed to the SVM-light system in the same way as described

above. But instead of the class label at the beginning of the line, some placeholder is given.

The output provided by the system is a number in the range of -1 to 1, with numbers closer to

-1 meaning a classification of -1 and ones tending towards 1 meaning a classification of 1. A

threshold of 0 was used to discretize the output. The results of the experiment are tabulated in

Table 5.4 and graphed in Figure 5.1. The rows of table 5.2 correspond to the number of

features selected by IG and used by the SVM; the columns are the eight anomalies.

Table 5.2. F-Scores of Plain SVM

No. Of Words/Class 1 2 3 4 5 6 7 8

500 83.21 62.13 85.64 50.19 65.38 77.32 71.80 78.42

1500 83.50 63.02 85.27 49.28 65.63 77.54 72.30 79.12

2500 83.65 62.99 85.49 49.21 65.48 77.85 71.87 78.95

5000 83.61 62.79 85.44 49.34 65.33 78.12 72.48 79.07

All 83.92 61.40 85.50 49.06 65.34 77.65 72.57 79.13

27

Figure 5.1. Results of Plain SVM

As seen by the results, the approach of just selecting the words using the IG works very well

with this dataset with the overall F-Score reaching 73.81. The key reason that it fails to

perform well in the case of anomaly 4 is that this anomaly (Inflight Encounter – Turbulence)

is almost all the time accompanied by anomaly 5 (Inflight Encounter – Weather), so they

have a lot of common words used and are not easily distinguished from each other. In case of

anomaly 2 (Airspace Violation – Entry), the low performance can be attributed to the fact

that there is not that many good shapers found due to the low number of training instances

belonging to this anomaly class. Overall, anomalies with higher presence were detected well

except for anomaly 5 due to a large portion of overlap with anomaly 4.

5.3.2 Naïve Bayes EM

The Naïve Bayes classifier is known to provide good results in case of text classifications

tasks. In this experiment the Naïve Bayes classifier is used in conjunction with EM as given

in [6]. The details of the method can be found in section 3.2.1. Unlike the previous methods

28

where a standard learning algorithm package was used, here a custom Naïve Bayes EM

algorithm was coded. Doing this allowed for greater flexibility on how the input was

provided and easier tuning of the parameters. The unlabeled data that we use is the test set.

The input given to the system is again taken from the output of sections 4.2.3 and

4.2.4. Since the way the EM algorithm works is different from the way SVM works, a few

changes need to be made to in the way the input was passed and handled. In order to keep the

input format same as in the other methods the missing features (the words that don’t occur in

the narrative) is handled in the code, by making the value of that feature to zero.

Furthermore, instead of using the frequency of occurrence of words in the instances as given

in the method at section 3.2.1 the normalized frequency is used.

The output generated by the method is not the same as the one generated by the SVM

classifiers. The output generated is the probability of occurrence of the anomaly, which is

then converted to either -1 or 1 depending on the probability values.

The importance given to the unlabeled data was varied. The best scores obtained by

varying the importance of the unlabeled data are listed in the table 5.3. The average number

of EM iterations was 50.

Table 5.5. F-Scores of Naïve Bayes EM

Unlabeled data

Weightage/Class 1 2 3 4 5 6 7 8

0.1 83.69 58.71 86.55 59.53 69.90 69.84 72.73 78.57

0.25 83.57 56.46 86.74 57.89 68.14 67.18 72.38 78.47

0.5 83.35 54.16 86.83 55.75 66.31 64.98 71.68 78.06

0.7 83.26 52.43 86.68 54.30 64.93 63.92 71.44 77.81

0.9 83.26 51.59 86.76 52.80 63.91 63.47 71.23 77.27

29

Figure 5.2. Results of Naïve Bayes EM

The overall F-score obtained over all the anomalies was 74.21. All words are used, due to

which there are more word frequencies that can influence the outcome, as a result it performs

better on both anomalies 4 and 5. It performs poorly on anomaly 2 due to the fact that the

narratives do not contain many words that are repetitive across different documents of the

same anomaly with high frequency. The performance of the other anomalies is similar to the

plain SVM approach. So use of unlabeled data improves the performance of the Naïve bayes

classifier but it is still comparable to a supervised SVM.

5.3.3 Gram Schmidt Kernel

In this baseline, we generate the features using the Gram Schmidt Kernel (GSK) algorithm

mentioned in [12]. The entire original preprocessed dataset was used to generate the features

using the GSK algorithm. The number of feature chosen was 100, 500 and 1000. The set of

biases used were 1, 2 and 3.

30

Once the features were generated, the labeled feature instances were given as input to

SVM with all default options for training. -1 was used if the anomaly was present and 1 was

used if the anomaly was not present.

Then the test file was generated again using GSK via the exact same procedure as in

training and passed as input to their corresponding models.

The result was determined using a threshold of 0. Instances with scores below 0 were

classified as -1 and scores above 0 were classified as 1. The best results obtained are

tabulated in Table 5.4 and graphed in Figure 5.3.

The overall F-Score obtained was 72.77. The GSK approach reduces the features in

the feature set similar to the word selection approach, by combining features that are most

relevant. Hence both the methods perform similarly.

Table 5.4. Results of GSK

Class Precision Recall F-Score

1 86.37 79.62 82.56

2 90.67 46.36 61.29

3 88.35 81.76 84.90

4 79.91 34.30 47.98

5 77.56 55.03 64.38

6 87.43 68.49 76.68

7 81.79 64.88 72.36

8 80.73 76.99 78.81

31

Figure 5.3. Results of GSK

5.4 Our Semi-supervised Approach

The semi-supervised approach has two stages. The first stage is based totally on the word

selection using IG and the subtractive log method mentioned in section 4.4. The second stage

further reduces the dimensionality of the feature get obtained in the first stage by applying

our instance selection based method to generate the features.

5.4.1 Stage 1: Our Word Selection

SVM with transduction can incorporate both labeled and unlabeled information to potentially

yield give out higher classification accuracies. As in the case of plain SVM, the SVM-light

package was used with default options. In order to run SVM-light in the transduction mode a

few extra things need to be done.

The input is generated in a similar manner to the previous method. The output

generated at section 4.3.3 is used with a few extra things, since we need to combine the

labeled and unlabeled instances for processing through the learning algorithm. The way the

32

class labels are passed is different, for the instances that are labeled a -1 or 1 and 0 is passed

for the unlabeled instances. In all the experiments all the test instances were passed as the

unlabeled instances for training the learner. The learner learns from the data given to it

generates the model, which is used for classification as usual.

Unlabeled instances are passed in the same way as they were passed in the previous

method. Again the classifier returns a confidence score between -1 and 1. The threshold of 0

is used to classify an instance as belonging to either -1 or 1.

The experiment was performed with the application of word selection as given in

section 4.3.2. The results of the experiment for 500, 1500, 2500 and 5000 words are tabulated

in the table 5.5 and graphed in Figure 5.4. Comparing this table with those for the baseline

systems, we can see that combining labeled and unlabeled data gives better results than just

having just labeled data for training.

When looking at the result one of the key things that stand out is the classification

results obtained for anomaly 4. In the preceding experiments, the way the word selection was

done was using the frequency of occurrence of the words in the documents. However, in this

experiment, we use only binary valued features that indicate the presence or absence of a

word in the document, instead of the number of times it occurs in the other documents, which

was the case with the plain SVM approach. The slight increase seen in all the other

anomalies is due partly to the presence/absence modification and also partly to the fact that

both the labeled and unlabeled data was used to learn the features by the transductive SVM

classifier.

33

Table 5.5. Results of Stage 1

Figure 5.4. Results of Stage 1

5.4.2 Stage 2: Further Dimensionality Reduction

We continue on the reduced feature set produced by our word selection method. The one

interesting piece of observation that can be incorporated into this reduction method is using

word presence instead of frequency of occurrence of the words in a given document doesn’t

hurt the overall classification, in low classification rate anomalies like anomaly 4, word

presence was very beneficial. Since we can’t use an approach of replacing the normalized

frequencies with 1 or some fraction in order not to influence the value decomposition. We

Word/Class 1 2 3 4 5 6 7 8

500 84.16 72.64 85.99 69.53 72.48 81.30 76.73 79.70

1500 84.21 73.23 86.36 69.71 72.00 81.94 76.54 79.04

2500 84.61 73.57 86.79 69.68 72.23 82.22 76.53 79.30

5000 84.29 73.88 86.24 69.94 71.96 82.81 76.13 78.91

34

use the method of taking 1-norm of the constant based feature set generated in the previous

step.

5000 features were selected in the word selection phase, so that none of the

documents are left with a very small number of features. The train and test feature reduction

algorithms are run on this 5000 feature set. Then the transductive SVM with default

parameters is used to classify the data. The results tabulated in Table 5.6 and graphed in

Figure 5.5 were obtained using the reduced features of 250, 500 and 800 with a bias value of

3.

The overall F-score obtained was 78.62 compared to the overall F-score of 78.64

obtained from stage 1. The F-scores doesn’t change much due to the application of the

feature reduction using GS, but for classes with average participation reduction gives better

results compared to word selection. However it performs badly in cases were there is a good

amount of over lap between classes as in anomalies 4 and 5.

Table 5.6. Results of Stage 2

No. of Features/Class 1 2 3 4 5 6 7 8

250 82.93 70.51 85.85 66.73 70.57 81.13 76.46 77.96

500 84.41 73.51 86.17 68.31 72.32 81.87 77.26 79.09

800 84.77 74.01 86.77 69.05 73.03 82.41 77.22 79.33

35

Figure 5.5. Results of Stage 2

5.5 Result Comparison And Error Analysis

Looking at the result for the various approaches, it is very clear that the word selection based

approaches perform very well on the ASRS dataset. The performance goes down when the

words are spread over lot of documents having similar frequencies in different documents.

But overall in most cases the performance for classification is good, considering that all

approaches are bag of words based.

The best approach turned out to be our approach of word selection using a

combination method instead of plain IG and using the entire dataset for training the

transductive SVM classifier. The GS kernel based baseline method for determining features

for improving the classification performance gave better results then the other existing

methods, but was not as good as the word presence method used in combination with a

transductive SVM based linear kernel.

36

The reason why our GS based approach of ours is not performing so well is perhaps due to

the fact that may be the reference points used for getting the new feature space is not

determined by a reasonably good approach. Switching to simulated annealing or a

transductive approach of selecting the reference documents could boost the results.

Table 5.9. Comparing the F-Scores

Class NASA EM GSK Stage 1 Stage 2

1 83.92 83.69 82.56 84.61 84.77

2 63.02 58.71 61.29 73.88 74.01

3 85.64 86.83 84.90 86.79 86.77

4 50.19 59.53 47.98 69.94 69.05

5 65.63 69.90 64.38 72.48 73.03

6 78.12 69.84 76.68 82.81 82.41

7 72.57 72.73 72.36 76.73 77.22

8 79.13 78.57 78.81 79.70 79.33

Overall 73.82 74.21 72.77 78.64 78.62

Figure 5.6. Comparing all results

37

CHAPTER 6

CONCLUSION

Few methods use the entire dataset to generate features for a single document in the dataset.

Through this thesis a method is proposed which lets us combine the various aspects from the

different documents to form the features, resulting in good scores.

The ASRS dataset is very word-centric. This aspect is clearly shown through the

experiments conducted just by using different ways of word selection. We could get really

high F-scores just by using simple word selection methods and passing it through a learning

algorithm like SVM. Combining unlabeled and labeled data using transduction gave good

improvements over all the existing approaches but it could perform even better if both the

labeled and unlabeled documents are used together to select words.

The semi-supervised approach developed using the concepts of word selection and

GSK worked far better on this dataset then just GSK. The key to our method is the way the

instances are selected and used to generated features for other documents. The main

contribution of this thesis lies in our view that documents should not be used just as input

data but also as points of references from which patterns can be derived for future

classification.

There is a lot of room for improvements, especially with respect to how the original

instances are selected for using during feature generation. The current method of using the

training data, though effective, can be bettered to use both the training and test data. Also, the

broader the spread of the reference points, the better the result would be. The word selection

38

metric can also be improved on to provide better features to the GSK module, such that more

features are present per document without bringing in contradicting features.

At first glance, it appears that getting high accuracy for determining the anomalies

associated with them would be an extremely complicated task, requiring many stages of

processing. However, the results of our experiments and the details of the new approach

suggest that sometimes it is good to think simple. At least for the ASRS dataset, it turns out

that the simpler the approach the better the results we get.

39

REFERENCES

[1] Christian Posse, Brett Matzke, Catherine Anderson, Alan Brothers, Melissa Matske, Thomas Ferryman. Extracting Information from Narratives: An Application to Aviation Safety Reports. In Proceedings of IEEE Aerospace Conference, 2005.

[2] Pedro Domingos, Michael J. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2-3): 103-130, 1997.

[3] David D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In Proceedings of the European Conference on Machine Learning (ECML),

pages 4-15, 1998.

[4] Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. In proceedings of the European Conference on Machine

Learning (ECML), pages 137-142, 2005.

[5] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000.

[6] Ashok N. Srivastava, Ram Akella, Vesselin Diev, Sakthi Preethi Kumaresan, Dawn M. McIntosh, Emmanuel D. Pontikakis, Zuobing Xu, Yi Zhang. Enabling the Discovery of Recurring Anomalies in Aerospace Problem Reports using High-Dimensional Clustering Techniques. In Proceedings of IEEE Aerospace Conference, 2006.

[7] Kamal Nigam, Andrew Kachite McCallum, Sebastian Thrun, Tom Mitchell. Text Classification from Labeled and Unlabeled Documents using EM. Special Issue on

Information Retrieval, 2000.

[8] P. Dewilde, Ed. F. Deprette. SVD and Signal Processing: Algorithms, Applications and Architectures. Elsevier Science Publishers, pages 3-41, 1988.

[9] Gene H. Golub, Charles F. Van Loan. Matrix Computations, 3rd ed., Johns Hopkins,

1996. [10] Thorsten Joachims, Probabilistic Analysis of the Rocchio Algorithm with TFIDF for

Text Categorization. Proceedings of 14th International Conference on Machine

Learning, pages 78-85, 1996.

40

[11] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for

Information Science, 1990.

[12] Nello Cristianini, John Shawe-Taylor, Huma Lodhi. Latent Semantic Kernels. Journal

of Intelligent Information Systems, 2002.

[13] A. J. Smola, O. L. Mangasarian, B. Scholkopf. Sparse Kernel feature analysis. Technical

Report 99-04, University of Wisconsin, Data Mining Institute, 1999.

VITA

Suraj Raghuraman was born in Madras (Chennai), India on 25th June 1984, the son of

Raghuraman and Usha Raghuraman. Suraj spent most of this childhood breaking and fixing

stuff around the house old later shifting to doing the same on the computer saving lot of

money to his parents. He did his schooling switching school and places very often all over

India. He joined MVIT, Bangalore for a Bachelor of Engineering in Information Science and

after a very patient wait for four years he was awarded the degree in 2005. He joined Subex

as a Fraud & Risk Management engineer during the final years of his Bachelor. Suraj was

very fond of opening companies during the later years of his Bachelors and during his work

at Subex, he started a few companies and closed it down; all were profitable. During 2007 he

was getting bored with his job at Subex and one of his families ventures finally became big

enough to sustain it. He decided to leave his country. He quit his job and joined UTD for

Master’s in Computer Science in Fall 2007. Currently he works at the HLTRI at UTD.

thesis

Documents

Transcript of thesis