Spatio-temporal outlier detection in streaming …762507/FULLTEXT01.pdf7.11 Outlier detection...

Spatio-temporal outlier detection in streamingtrajectory data

MÁTÉ SZEKÉR

Master’s Thesis at Computer Vision and Active Perception LabSupervisor: Carl Henrik Ek, Tove Gustavi

Examiner: Danica Kragic

AbstractThis thesis investigates the problem of detecting spatio-temporal anomalies in streamed trajectory data using bothsupervised and unsupervised algorithms. Anomaly detec-tion can be understood as an unsupervised classificationproblem which requires the knowledge of the normal courseof events or how the anomalies manifest themselves. To thisend, an algorithm is proposed to identify the normative pat-tern in a streamed dataset. A non-parametric algorithmbased on SVM is proposed for classifying trajectories basedon the explicit geometric properties alone. A parametricalgorithm based on dynamic Markov Chains is presentedfor analysing trajectories based on their semantics. Twomethods are proposed to fade the Markov Chains so thatnew behaviours can be modelled and obsolete behaviourscan be forgotten. Both the non-parametric and paramet-ric approaches are evaluated using both a synthetic and areal-life dataset. Fading the Markov Chains turns out tobe essential in order to accurately detect anomalies in adynamic dataset.

ReferatTidsberoende avvikelsedetektion i

trajektorie-dataströmmar

Detta examensarbete undersöker hur man kan detekteratidsberoende avvikelser i dataströmmar innehållande tra-jektorier. Avvikelsedetektion kräver vanligen kunskap omhur de normativa eller avvikande mönstren ser ut. Förstpresenteras därför en algoritm som kan användas för attidentifiera det normativa mönstret i en dataström. En icke-parametrisk algoritm baserad på SVM presenteras sedanför att klassificera trajektorier utifrån deras geometriskaegenskaper. Till sist presenteras en parametrisk algoritmbaserad på dynamiska Markov-kedjor för att analysera tra-jektorier genom att också ta hänsyn till de bakomliggandesemantiska informationerna. Två olika metoder presenterasför att bleka och uppdatera dessa Markov-kerdjor så att nyabeteenden lättare kan läras av modellen. Både den icke-parametriska och parametriska metoden utvärderas slutli-gen på en syntetisk och en fysikalisk datamängd. Slutresul-tatet är att blekningen av Markov-kedjorna är nödvändigtför att upptäcka avvikelser i en dynamisk datamängd.

Acknowledgements

I am very grateful to my supervisors Carl Henrik Ek and Tove Gustavi withoutwhom this thesis could not have been accomplished. I would also like to thankmy family for all the support even though they were over 5000 km away, 6timezones behind.The thesis was carried out with the resources provided by the Swedish DefenseResearch Agency and the Computer Vision and Active Perception Lab at KTH.The roles of the institutions are gratefully acknowledged.

Contents

1 Introduction 11.1 Anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Background and motivation . . . . . . . . . . . . . . . . . . . . . 21.3 Nature of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Outlier detection and Classification . . . . . . . . . . . . . . . . . 51.6 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.7 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Group discovery 72.1 Association rule mining . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Apriori algorithm . . . . . . . . . . . . . . . . . . . . . . . 82.2 Finding groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Background on trajectory analysis 113.1 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . 113.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Trajectory dissimilarities . . . . . . . . . . . . . . . . . . . . . . . 12

3.3.1 Dynamic Time Warping . . . . . . . . . . . . . . . . . . . 123.3.2 Hausdorff distance . . . . . . . . . . . . . . . . . . . . . . 133.3.3 Frechet distance . . . . . . . . . . . . . . . . . . . . . . . 14

4 Non-parametric trajectory analysis 174.1 Introduction to SVM . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1.1 Feature mapping . . . . . . . . . . . . . . . . . . . . . . . 194.1.2 Linearly non-separable datasets . . . . . . . . . . . . . . . 194.1.3 Dual formulation of the SVM . . . . . . . . . . . . . . . . 204.1.4 Kernelization . . . . . . . . . . . . . . . . . . . . . . . . . 214.1.5 Multi-Class SVM . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Trajectory kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3 Preliminary results . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3.1 Gaussian RBF kernel . . . . . . . . . . . . . . . . . . . . 254.4 SVM applied to data streams . . . . . . . . . . . . . . . . . . . . 26

5 Parametric trajectory analysis 285.1 Introduction to Markov Chains . . . . . . . . . . . . . . . . . . . 295.2 Extensible Markov Model . . . . . . . . . . . . . . . . . . . . . . 30

5.2.1 Updating the Markov Chain . . . . . . . . . . . . . . . . . 31

5.2.2 State clustering . . . . . . . . . . . . . . . . . . . . . . . . 325.3 Estimating the parameters of the EMM . . . . . . . . . . . . . . 32

5.3.1 Clustering threshold . . . . . . . . . . . . . . . . . . . . . 325.3.2 Forgetting factor . . . . . . . . . . . . . . . . . . . . . . . 32

5.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.5 Anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . 385.6 Preliminary results . . . . . . . . . . . . . . . . . . . . . . . . . . 395.7 Database of individuals with same instantaneous behaviour . . . 42

6 Dataset 446.1 MIT Realitycommons Badge dataset . . . . . . . . . . . . . . . . 446.2 Proprietary FOI dataset . . . . . . . . . . . . . . . . . . . . . . . 45

7 Results 467.1 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467.2 Evaluation method . . . . . . . . . . . . . . . . . . . . . . . . . . 467.3 Parametric analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7.3.1 Estimation of λ . . . . . . . . . . . . . . . . . . . . . . . . 487.3.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . 527.3.3 Anomaly detection . . . . . . . . . . . . . . . . . . . . . . 557.3.4 Detecting anomalous groups . . . . . . . . . . . . . . . . . 597.3.5 Group discovery . . . . . . . . . . . . . . . . . . . . . . . 63

7.4 Non-parametric analysis . . . . . . . . . . . . . . . . . . . . . . . 65

8 Discussion and Future Work 688.1 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . 68

8.1.1 Parametric analysis . . . . . . . . . . . . . . . . . . . . . 688.1.2 Non-parametric analysis . . . . . . . . . . . . . . . . . . . 70

8.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

8.3.1 Parametric analysis . . . . . . . . . . . . . . . . . . . . . 718.3.2 Non-parametric analysis . . . . . . . . . . . . . . . . . . . 72

Bibliography 74

Appendices 78

A Source code 79A.1 SVM preliminary results . . . . . . . . . . . . . . . . . . . . . . . 79A.2 Algorithm 1 for estimating λ . . . . . . . . . . . . . . . . . . . . 81A.3 Algorithm 2 for estimating λ . . . . . . . . . . . . . . . . . . . . 86A.4 Outlier detection in the Badge dataset . . . . . . . . . . . . . . . 89

List of Figures

1.1 Typical trajectory of an employee during a sliding window. Theaxes correspond to the x and y coordinates in the office environment. 4

1.2 Types of trajectory anomalies. . . . . . . . . . . . . . . . . . . . 5

2.1 Graph of frequent rules containing single itemsets. This graphcontains two groups/equivalence classes: (A,B,C) and (D). . . . 10

3.1 Example of a mapping of points between two curves of differentlength. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Two trajectories with a small Hausdorff distance but with a largeFrechet distance.[1] . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Graphical representation of the Frechet distance calculation. . . . 16

4.1 A possible way to pre-process data is to change the coordinatesystem from Cartesian to polar. . . . . . . . . . . . . . . . . . . . 19

4.2 Example of a linear classifier applied to a linearly non-separabledataset in R2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 Different approaches to multi-class SVM. . . . . . . . . . . . . . 234.4 Test dataset for distance substitution kernel based SVM classifi-

cation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.5 Classification results using the Frechet kernel. . . . . . . . . . . . 254.6 Classification results using the Hausdorff kernel. . . . . . . . . . 264.7 Schematic view of the proposed online SVM model. . . . . . . . . 26

5.1 Schematic view of the EMM construction and maintenance. . . . 305.2 Left: Graph of the rate of change of behaviour in a hypothetical

dataset. The red line is used as a threshold to identify coherenthigh-activity periods. Right: During such periods, the value ofλ is set inversely proportional to the period length. Outside theseperiods, λ is set to a small constant value. (Graphs not to scale.) 35

5.3 Example of a not irreducible Markov Chain. The thickness of theedges represent the relative probabilities of the transitions. . . . 37

5.4 Test dataset. Notice that both the normative and anomalousbehaviour change with time. The colors indicate the type ofintervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.5 Results using C = 3, occurrence thr = 0.1, transition thr =0.05 and λ = 0. An anomaly score of 1 represents an anomalywhile 0 represents the normal data points. . . . . . . . . . . . . 40

5.6 Results using C = 3, occurrence thr = 0.1, transition thr =0.05 and λ = 0.5. An anomaly score of 1 represents an anomalywhile 0 represents the normal data points. . . . . . . . . . . . . . 41

5.7 Anomality detection results on the dynamic test dataset. Ananomaly score of 1 represents an anomaly while 0 represents thenormal behaviour. . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.1 Floor plan of the MIT Reality Commons Badge dataset. [2] . . . 44

7.1 Overview of the evaluation method. . . . . . . . . . . . . . . . . 477.2 F1 score vs clustering threshold C. . . . . . . . . . . . . . . . . . 487.3 Locations of the EMM states, each identified by a unique ID. . . 487.4 Dissimilarities between the behaviours at different states during

two subsequent sliding windows during a complete day. . . . . . . 497.5 Dissimilarities between average behaviours during two subsequent

sliding windows during a complete day. . . . . . . . . . . . . . . . 507.6 λ estimated using Algorithm 1. . . . . . . . . . . . . . . . . . . . 517.7 λ estimated using Algorithm 2. . . . . . . . . . . . . . . . . . . . 517.8 Classification performance of the department group classifica-

tions during a day . . . . . . . . . . . . . . . . . . . . . . . . . . 537.9 Outlier detection results without fading. . . . . . . . . . . . . . . 567.10 Outlier detection results without fading, notice that the nor-

mative behaviour has changed from the previous behaviour haschanged. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7.11 Outlier detection results with fading using Algorithm 1. . . . . . 577.12 Outlier detection results without fading, notice that the nor-


7.13 Outlier detection results with fading using Algorithm 2. . . . . . 587.14 Outlier detection results without fading, notice that the nor-


7.15 Percentual increase due to fading. First week in the dataset. . . . 617.16 Percentual increase due to fading. Second week in the dataset. . 627.17 Percentual increase due to fading. Third week in the dataset. . . 637.18 Non-parametric classification, training window: 2007/03/28 10:02

- 2007/03/28 10:04. . . . . . . . . . . . . . . . . . . . . . . . . . . 667.19 Non-parametric classification, training window: 2007/03/28 10:04

- 2007/03/28 10:06. . . . . . . . . . . . . . . . . . . . . . . . . . . 67

8.1 Transition probability matrices for the novice and senior sub-groups in the Configuration group. The colors correspond to theintensity of the transitions. . . . . . . . . . . . . . . . . . . . . . 69

8.2 Example of a situation where a first order Markov Chain may beinsufficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

8.3 Example of a MC with extended states. . . . . . . . . . . . . . . 72

List of Tables

2.1 Example database. . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Frequent rules with support greater than 0.2 and confidence greater

than 0.5. Singletons are omitted. . . . . . . . . . . . . . . . . . . 9

6.1 Department groups in the MIT Badge dataset. . . . . . . . . . . 45

7.1 F1 scores of classification of the department groups. First weekin the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

7.2 F1 scores of classification of the department groups. Second weekin the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

7.3 F1 scores of classification of the department groups. Third weekin the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

7.4 Percentual change in F1 scores of classification with respect tothe classifier without fading. First week in the dataset. . . . . . . 53

7.5 Percentual change in F1 scores of classification with respect tothe classifier without fading. Second week in the dataset. . . . . 54

7.6 Percentual change in F1 scores of classification with respect tothe classifier without fading. Third week in the dataset. . . . . . 54

7.7 Percentual change in F1 scores of classification with respect tothe classifier without fading. First week in the dataset. . . . . . . 54

7.8 Percentual change in F1 scores of classification with respect tothe classifier without fading. Second week in the dataset. . . . . 54

7.9 Percentual change in F1 scores of classification with respect tothe classifier without fading. Third week in the dataset. . . . . . 54

7.10 First week in the dataset. The values correspond to the precisionof the outlier detector. . . . . . . . . . . . . . . . . . . . . . . . . 60

7.11 Second week in the dataset. The values correspond to the preci-sion of the outlier detector. . . . . . . . . . . . . . . . . . . . . . 60

7.12 Third week in the dataset. The values correspond to the precisionof the outlier detector. . . . . . . . . . . . . . . . . . . . . . . . . 60

7.13 Subgroups of the Configuration group. . . . . . . . . . . . . . . . 617.14 Precision of the novice subgroup outlier detector. The last col-

umn represents the precision of a random outlier detector. Firstweek in the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.15 Precision of the novice subgroup outlier detector. The last col-umn represents the precision of a random outlier detector. Secondweek in the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 62

7.16 Precision of the novice subgroup outlier detector. The last col-umn represents the precision of a random outlier detector. Thirdweek in the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 62

7.17 Discovered groups with clustering threshold C = 8, min supp =0.1 and min conf = 0.1. The algorithm identified 27 groups, allof which were only containing one badge ID. . . . . . . . . . . . . 63

7.18 Discovered groups with clustering threshold C = 8, min supp =0.2 and min conf = 0.2. The algorithm identified 27 groups, allof which were only containing one badge ID. . . . . . . . . . . . . 64

7.19 Discovered groups with clustering threshold C = 9, min supp =0.1 and min conf = 0.3. . . . . . . . . . . . . . . . . . . . . . . . 64





7.24 F1 scores of the department group classifications on 2007/03/28. 65

8.1 The number of measurements of the supposed novice group variessignificantly over weeks in the dataset. . . . . . . . . . . . . . . . 69

Chapter 1

Introduction

Anomaly detection is performed by analyzing and modelling interactions be-tween entitites or groups based on some behavioural data. Behaviour is of-ten defined as a set of attributes of the participating individuals that can bemeasured objectively [3] like physical proximity, telephone call duration andfrequency to others etc. It is important to be able to automatically detect devi-ations from the normative activity pattern because in many applications manualanalysis is often infeasible due to complexity, information overload and fatiguefor human operators[4].Example of applications include:

• Intrusion detection systems: Card access control systems are increas-ingly being used to enhance the security of buildings and strategic areasby policy regulations for employees entering security doors. It is how-ever relatively easy to bypass such systems either by simply following anauthorized person through a door or by stealing another person’s card.

• Fraud detection: In many cases, credit card fraud can be detected bymonitoring the patterns in geographic locations of the transactions.

Detecting anomalous activities requires either knowing the normal course ofevents [5] or how the deviations manifest themselves. For example, in the con-text of traffic analysis anomalies could be characterized by the speed, directionor the type of vehicles (car, motorcycle, pedestrian etc.). Usually there are twomeans for obtaining this type of information: consulting with a human domainexpert to set-up predefined rules, or learning from historical data[4]. It is of-ten argued that the use of predefined rules from a domain expert is insufficientdue to the complexity of the underlying dynamical system and lack of expertknowledge that covers all possible situations [6]. On the other hand, there maynot be sufficient data available for the automatic set up of rules to filter rareincidents.

This thesis constitutes an empirical investigation of both rule-based and histor-ical data based outlier detection algorithms for use in intrusion detection.

1

1.1 Anomalies

Anomalies or outliers can be understood as deviations from some normativepattern. Unfortunately there is no unified formal definition of what characterizesa ”deviation”. In statistics, the following definition is used

”An outlying observation, or ”outlier,” is one that appears to deviatemarkedly from other members of the sample in which it occurs.” [7]

By narrowing down the abstract notion of other members in the definition above,anomalies can be split into the following categories[8]:

• Point anomalies: Single data points that deviate significantly from allother observations in the dataset.

• Contextual anomalies: Point anomalies that deviate significantly fromall other data points in a neighborhood of the data point itself. For exam-ple, the outside temperature of 25C in Stockholm sometime in January isa contextual anomaly since 25C is considered normal during summer butnot during winter. The context in this example is season.

• Collective anomalies: A set of neighbouring point or contextual anoma-lies.

It is important to observe that a good outlier detector should discover contextualrather than point anomalies since the dataset may have a dynamic nature inwhich an anomalous pattern at some time t may not necessarily be anomalous att+∆t. Furthermore, the quantity of collective anomalies should be relatively lowsince the outlier detector should ideally be able to ”forget” obsolete observationsin order to accommodate itself to new patterns [9].

1.2 Background and motivation

The European Union funded Security UPgrade for PORTs1 (SUPPORT) projectaims to increase the security at European port facilities by upgrading legacysurveillance systems. One problem in today’s surveillance systems is the highfalse-positive alarm rate in some subsystems. The project has explored differentways of reducing this false-positive alarm rate.In order to understand what is causing these alarms, it is hypothesised thatthe majority of the alarms can linked to different types of behaviours which areanomalous. The main goal of the thesis is thus to look for anomalous patternsin a behavioural dataset. These behaviours can eventually be incorporated intothe surveillance systems to prevent the triggering of false-alarms.The dataset provided by the project is confidential and has restricted access.As an alternative, a publicly available dataset arose with supposedly similarcharacteristics. At the beginning I have been using both the confidential andthe publicly available datasets parallel. Due to the constrained access to theconfidential dataset it was decided that the publicly dataset will be used for thethesis’ purposes.

1http://supportproject.info/

2

1.3 Nature of data

This section provides a brief overview of the datasets used in the thesis. For amore detailed description the Reader is referred to Chapter 7.

• The confidential dataset contains time stamped logs from a card accesscontrol system at a harbour facility. Each data point represents the textualidentifier of the harbour area in which an employee identified by his/herunique ID currently resides in. It is transaction based in the sense thatdata is delivered only if there is movement between different areas. Theformat of this dataset is

<Day, Time, Site Number, Location, Card number, Employee ID, Area>

• The publicly available dataset on the other hand contains time-stampedgeographic locations sampled at a predefined sampling period for employ-ees in an office environment. Each data point represents thus the mo-mentarily estimated position of an employee identified by his/her uniqueID with respect to some fixed Cartesian coordinate system. The datasetcontains position data collected from 39 employees over a period of onemonth[2]. The number of employees working in the office varies signifi-cantly during a typical day. In the mornings, less than 10 people are activewhile later on the afternoon the number of employees is higher. A typicaltrajectory from this dataset is shown in Figure 1.1. The format of thisdataset is

<Day, Time, Employee ID, X,Y>

Despite the obvious differences between the two datasets, the confidential datasetmay be transformed into geometric trajectories by introducing an abstract co-ordinate system. We now formalize the notion of geometric trajectories in Def-inition 1.

Definition 1. Trajectory: Consider a sequence of a finite number of pointspoints P = 〈p1, p2, . . . pn〉. The piecewise linear curve C connecting the vectors−−→p1p2,−−→p2p3, . . . ,−−−−→pn−1pn is called a trajectory.

In the thesis, data streams are segmented into a number of batches, each contain-ing observations from fixed length time windows. Due to the dynamic natureof the data streams, these batches contain varying number of measurements.As an extreme example of this variation, outside of normal working hours, thedata batches will be empty. As we shall see in Chapter 2 this fact is one of thechallenges of trajectory classification/outlier detection.

3

Figure 1.1: Typical trajectory of an employee during a sliding win-dow. The axes correspond to the x and y coordinates in the officeenvironment.

1.4 Problem description

The purpose of the the work in this thesis is to detect contextually anomalouspatterns in a two dimensional trajectory dataset. By the complex nature ofthe human behaviour such anomalies can arise in multiple ways. For example,employees may steal another person’s card or piggyback an authorized personthrough security doors. In this thesis, we focus on the following types of outliersin the aforementioned trajectory datasets:

• Anomalous trajectories with anomalous positions This correspondsto the situation where the trajectory is outside the normal region to whichthe trajectories are normally confined to. A possible interpretation of thisis when an individual visits a spatial region he/she rarely visits or haslimited access to. An example of such situation is shown in Figure 1.2a.

• Anomalous trajectories with normal positions Even though the tra-jectory may be located inside the normal region to which the trajectoriesare confined to, the temporal ordering of the trajectory points may berare. A possible scenario is when an individual bypasses a security doorby following another employee. Notice that this type of anomaly corre-sponds to the whole collection of outliers in the spatial derivatives such asvelocity, acceleration etc. An example of such situation is shown in Figure1.2b.

4

(a) Example of a trajectory markedas an outlier because of its anoma-lous segments. This may correspondto the following situations: an indi-vidual visit a room/spatial region thathe/she normally does not visit.

(b) Example of a trajectory marked asan outlier because of its anomalous or-dering of sample points. Notice thatthis type of anomaly also include out-liers in speed, acceleration etc.

Figure 1.2: Types of trajectory anomalies.

So far in the thesis, the focus has been on sub-trajectory outliers (anomalouspositions, instantaneous velocities etc.). It is however important to define therequirements for a whole trajectory (within a data frame) to be consideredanomalous. Unless stated otherwise, it will be assumed that if a trajectorycontains at least one anomaly of any kind, then the whole trajectory will beconsidered as anomalous.

We have now formalized the notion of trajectory outlier but have been relativelyvague in which context we expect to find them in: intra-personal (behaviour ofthe same person) or interpersonal (behaviour of a group)? It is assumed thatinterpersonal behaviour is more interesting to model than that of individualson their own for two reasons: First, according to [10], employees tend socializemore often with others with the same role/expertise since their cubicles areoften situated close by. In other words, there is a natural grouping of employeeswhose behaviour is assumed to easier to model than that of individuals. Second,the modelling of groups can at any time be reverted back to individuals.

1.5 Outlier detection and Classification

In this section, we explore the similarities and differences between outlier de-tection and classification.In machine learning, classification is defined as the process of creating a classify-ing function based on some training data. The classifier is used to automaticallyassign class labels to new observations. Depending on the availability of labelsin the training dataset, there are usually two approaches to this: supervised andunsupervised classification. In supervised classification, it is assumed that thecomplete training dataset is labelled, ie. there is a ground truth. In contrast,unsupervised algorithms do not require labelled data, instead the classes aredetermined implicitly by the characteristics of the data.

5

With this in mind, outlier detection can be understood as a binary (two classes:normal or anomalous) unsupervised classification problem. The underlying as-sumption for data characterization is that normal data points are far morefrequent than he anomalous data points. Thus, the problem of finding algo-rithms for outlier detection in trajectory data is closely related to trajectoryclassification.

1.6 Contribution

The thesis has reached the following main results:

• The main result of the thesis is an improved trajectory outlier detectionalgorithm that can be used to more accurately detect contextual anomaliesin datasets with dynamic behaviour.

• The thesis work also resulted in the creation of a new approach to trajec-tory classification which does not require feature selection.

• An algorithm is proposed to discover the normative behaviour of a dataset.This may be required for the anomaly detector if there is no such a prioriknowledge in the dataset.

1.7 Organization

The rest of the thesis is organized as follows. The first 4 chapters summarizethe result of my literature investigation and my customization of the algorithmsfound. The theoretical foundation is necessary to understand, use and possiblyfurther develop the source code I wrote which can be found in the Appendix.References to these programs are linked from the results they produced.

Chapter 2 introduces the concept of group discovery that is necessary for outlierdetection.Chapter 3 gives the necessary background in trajectory analysis and discussesits challenges.Chapter 4 introduces the concept of non-parametric classification and describesan algorithm to classify trajectories based on their geometric properties alone.Chapter 5 describes an algorithm to classify trajectories based on their semanticinformation and introduces the Extensible Markov model.In Chapter 6, the data set used in the project is also described in detail.In Chapter 7, the results of non-parametric and parametric analysis are pre-sented with explanations.Finally, in Chapter 8, the thesis is concluded with a discussion about the findingsand suggests ways to continue this work.

6

Chapter 2

Group discovery

As described in Chapter 1, the problem of outlier detection either necessitates ana priori knowledge of the normal course of events or some subset of a dataset inwhich the normal events are far more frequent than the outliers. Since the aim ofthe thesis is to identify anomalous activity in groups, the primary task is thus toidentify groups of individuals with similar long-term behaviour. Before movingon, it is important to define what is meant by group in this application. A groupcan be thought of an equivalence class having the following three properties:

• Reflexivity: A has the same property as A.

• Symmetry: If A and B have the same property then so does B and A.

• Transitivity: If A and B, and B and C have the same property, then sodo A and C.

Throughout this Chapter, it is assumed that there is an existing database con-taining distinct sets of individuals with the same instantaneous behaviour pat-terns at different times. Each record in the database - called a transaction - canthus be understood as a combination of individuals with the same behaviour attime t. One possible way of how such databases can be constructed is discussedin Chapter 6. The problem of finding the underlying long-term structure of suchdatabase leads over to the concept of data mining and association rule mining.

2.1 Association rule mining

In the field of data mining, association rule mining is defined as the processof discovering patterns in large databases [11]. Example of such pattern isbread, butter → ham which indicates that if a customer buys bread and buttersimultaneously, he/she is also likely to buy ham. It is important to observethat in the previous example the patterns are unordered, ie. the ordering ofproducts is not important bread, butter = butter, bread. Such patterns arecalled association rules while bread, butter, ham are referred to as items. Acombination of items is called an itemset. The formal definition of associationrule is given in Definition 2.

7

Definition 2. Association rule [12]: Statements on the form X,Y → Z that areused to represent the predicted occurrence of an item Z based on the occurrenceof other, possibly multiple items X and Y , where X,Y and Z are disjoint items.

The validity of an association rule is usually determined by its support and con-fidence values. The support of an association rule X → Y measures how oftenthe rule is correct with respect to some dataset. That is, how often two items Xand Y occur in the same transaction in the database. The confidence measureshow often the itemset Y appears in transactions containing X. In other words,the confidence can be understood as the conditional probability P (Y |X) whichmeasures the validity of the inference X → Y . It is important to understandthe difference between support and confidence, support measures the undirectedrelation while the confidence measures the directed relation between the items.

Given a database D, the goal of association rule mining is to find all rules Rthat satisfy the following conditions:

support(R) > min supportconfidence(R) > min confidence

where min support and min confidence are user-set thresholds. Frequent as-sociation rules are characterized by high support and confidence intervals.

2.1.1 Apriori algorithm

In this section, the Apriori algorithm [13] is presented to extract frequent asso-ciation rules from databases. The algorithm recursively finds association rulesby extending previous rules according to the Apriori principle.

Definition 3. Apriori principle [13]: An itemset containing k items is frequentonly if all of its k − 1 supersets are frequent.

The Apriori algorithm is shown in Algorithm 1 below. It starts by determiningthe frequent single items in the database At each iteration, the size of thepreviously discovered rules are increased by one. These new rules are formed byjoining two similar rules from the previous iteration. This is done by taking theCartesian product of the two previous rules and by verifying that the frequencyof the newly generated rule satisfies the minimum support inequality. Finally,the confidence values of each found rule is computed and only those that meetthe minimum threshold are retained.

8

Algorithm 1 Apriori algorithm

1: procedure Apriori(transactions,minsup)2: L1 ← [frequent 1-itemsets in transactions]3: k ← 24: while Lk−1 6= ∅ do5: Ck ← candidate generated from Lk−16: for transaction in transactions do7: Ct ← candidates in Ck that are contained in transaction8: for candidate in Ct do9: count[candidate]← count[candidate] + 1

10: end for11: Lk ← candidates in Ct with counts greater than minsup12: end for13: end while14: return ∪kLk15: end procedure

2.2 Finding groups

Table 2.1 shows an example of a transactional database containing equivalenceclasses over the alphabet A,B,C,D. Table 2.2 shows the frequent associationrules of these transactions that satisfy minsup > 0.2 and minconf > 0.5.It is now proposed to transform this set of rules into a directed graph by let-ting the edges represent the relations between the items. The problem of findingequivalence classes can than be transformed into finding the cycles in this graph.A cycle is a directed sequence of vertices starting and ending at the same vertex.

ID Transaction1 A B C2 B A3 C A4 B C5 A B D

Table 2.1: Exampledatabase.

ID Frequent patterns1 D → A2 D → B3 C → B4 B → C5 C → A6 A → C7 B → A8 A → B9 B,D → A10 A,D → B11 B,C → A12 A,C → B

Table 2.2: Frequent ruleswith support greater than0.2 and confidence greaterthan 0.5. Singletons areomitted.

Theorem 2.2.1. In order to represent the frequent association rules as a graph,it is sufficient to only discover the rules between single itemsets.

9

Proof: By the Apriori principle in Definition 3., it holds that any itemset longerthan 2 items can be decomposed into a set of 2-items since the supersets havealways greater (or equal) support than the subsets. Conversely, the supersets ofan infrequent itemset are also infrequent.

Figure 2.1 shows the graph constructed with the use of the frequent rules 1-8 ofTable 2.2. Complex rules, such as A,C → B may be recovered by the followingthe edges A→ C and C → B.

Figure 2.1: Graph of frequent rules containing single itemsets. Thisgraph contains two groups/equivalence classes: (A,B,C) and (D).

The problem of finding such cycles has been studied thoroughly and severalalgorithms have been proposed such as Tarjan’s strongly connected componentsalgorithm [14] and the Sharir algorithm [15]. The description of these algorithmsare beyond the scope of the thesis and the Reader is referred to the publicationsfor more details.

Finally, it is worth mentioning that the proposed group discovery algorithmdescribed above can also be used as a primitive outlier detector. Suppose thatin the database from the office environment, only employee A works outside ofduring normal working hours. This situation corresponds to that the specificindividual will mostly occur by itself in the database containing the combinationof individuals with the same behaviour.

Such individuals may already be captured by the Apriori algorithm with therule A→ A which corresponds to a simple loop in the graphical representation.Thus, groups only containing one individual may be understood as outliers withrespect to the rest of individuals.

10

Chapter 3

Background on trajectoryanalysis

This chapter introduces the necessary concepts and methods for trajectory anal-ysis which will be used in chapter 4.

With the wide availability of wireless positioning systems such as GPS, RFID,WiFi, ZigBee etc. huge amounts of location based data may be recorded withhigh temporal and spatial resolution. The main difficulty in analysing suchdatasets is their temporal nature which may lead to sequences with differentlengths.In the literature, two approaches to trajectory analysis can be found: the firstmaps the sequences into a fixed dimensional space and uses usual dissimilaritymeasures. The other approach is to use special dissimilarity measures that donot require fixed dimensions. In this Chapter, several preprocessing techniquesare discussed to overcome the problem of dimesionality and finally a number ofspecial dissimilarity measures are introduced.

3.1 Dimensionality reduction

The idea of dimensionality reduction is to represent a trajectory in a time win-dow with a small fixed number of parameters [16]. Clearly, the success of thisapproach depends to a large extent on the choice these parameters. A naturalway of mapping a trajectory into a fixed dimensional space would be to computethe position, velocity and acceleration of an object at regular time increments.However, the choice of parameters is often dependent on the specific application.In a more structured fashion, such parameters can be determined by principalcomponent analysis (PCA) [17].

3.2 Normalization

Normalization is used to ensure that all sequences have the same length by eithercompressing or stretching individual sequences. Examples of such techniques aretrajectory extension and resampling [16]. When extending a trajectory, the last

11

known dynamics of a trajectory is used to estimate extra hypothetical trajectorypoints. When re-sampling, linear interpolation is used to reduce the number ofpoints in the longest sequences. After normalization, usual metrics such as theEuclidean distance can be used to measure the dissimilarity. However, thisapproach is found to perform poorly in practice [16].

3.3 Trajectory dissimilarities

In the literature the following three dissimilarities are suggested to compare twotrajectories:

• Dynamic Time Warping

• Hausdorff distance

• Frechet distance

3.3.1 Dynamic Time Warping

Dynamic Time Warping (DTW) is one of the most used dissimilarity measurefor sequences with different lengths[18] and was chosen as the starting point forthe thesis. Simply put, the objective of DTW is to either compress or stretchthe time axis in some places of the sequences so that the individual points ofany two sequences can be mapped to each other in an optimal way.

So far, this reminds of the re-sampling technique discussed in Section 3.2. How-ever, instead of having a predetermined strategy for the re-sampling, the DTWaims to find a dynamic re-sampling strategy which gives the smallest distancebetween the sequences. A mapping function φ = (φP , φQ) of length n, mapsindividual points between two sequences P and Q so that P (φP (i)) is matchedto Q(φQ(i)) for i ≤ n. The dissimilarity between the sequences is defined as thesum of distances between the two trajectories’ corresponding points:

dDTW (P,Q) =

n∑i

d(P (φP (i)), Q(φQ(i))) (3.1)

where d is a ”normal” distance function used to measure the dissimilarity be-tween points in a mapping. The goal is to determine a warping of the time scaleso that the total distance in Eq. 3.1 is minimized. An example of mapping twotrajectories with different lengths is shown in Figure 3.1

12

Figure 3.1: Example of a mapping of points between two curves ofdifferent length.

Normally, there are several possible mappings that can be used to minimize thedistance between two sequences. Therefore, it is necessary to impose certainrestrictions on the mapping[18]:

• It must preserve the continuity of the trajectories. Consider two adjacentpoints P (k−1) and P (k) on the blue curve and another point Q(k−1) onthe red curve. If the point P (k−1) is mapped to the point Q(k−1), thenthe point P (k) may not be matched to any point that precedes Q(k − 1).Notice that this does not mean that each point on the first curve must bemapped to unique points on the other curve: it is possible that the samepoint on the blue curve is mapped to other points on the red curve.

• The start and end points of the curves are always matched to each other.This ensures that the entire curves are matched.

Unfortunately, it can be shown that the DTW fails the triangle inequality [19]and is thus not a valid metric. As we shall see in Chapter 4, the metric propertywill play a central role in our quest for a trajectory classifier/outlier detector.

3.3.2 Hausdorff distance

The Hausdorff distance between two curves P and Q is the smallest distance dsuch that the set of vertices of P is contained in a circle with radius d aroundevery point in the vertex set of Q.

Definition 4. Directed Hausdorff distance [20]: Let P and Q be any two tra-jectories in R2 and let || · || denote the Euclidean norm. The directed Hausdorffdistance is defined as

δH(P,Q) = maxq∈VQ

minp∈VP

||p− q||

where VP and VQ represents the set of vertices in P and Q, respectively.

An important observation is that the Hausdorff distance identifies each curve byits vertices only and does not take the ordering of the trajectories into account:a small Hausdorff distance does not necessarily correspond to a close match.Figure 3.2 shows an example of two trajectories whose vertex sets are closelyrelated even though the trajectories itself are not.

13

Figure 3.2: Two trajectories with a small Hausdorff distance butwith a large Frechet distance.[1]

This means that the Hausdorff distance can only measure geographic proximitiesbetween trajectories. Additional information such as speed, direction are nottaken into consideration. Furthermore, it should be noted that the directedHausdorff distance is assymmetric and is therefore not a metric. A more generalversion of the Hausdorff distance is given in Definition 5.

Definition 5. Undirected Hausdorff distance [20]: Let P and Q be any twotrajectories in R2 and let δH denote the directed Hausdorff distance. The undi-rected Hausdorff distance is defined as

dH(P,Q) = max(δH(P,Q), δH(Q,P ))

According to [21], the undirected Hausdorff-distance is a metric. Even thoughthe Hausdorff distance does not take the continuity of the trajectories into con-sideration, it is assumed in the thesis that it may be appropriate to differentiatebetween groups in an office environment. Under normal office conditions, it isplausible to assume that the cubicles of employees who have the same role areoften situated closer to each other than to other employees. Thus, there maybe sufficient information available in the geographic positions alone to be ableto classify trajectories to some degree.

3.3.3 Frechet distance

The Frechet distance is another curve dissimilarity measure that takes the conti-nuity of the curves into account and is therefore believed to be better suited forcapturing contour dissimilarities than the Hausdorff distance. First, we definethe monotone parameterization of a trajectory.

Definition 6. Monotone parameterization: A monotone paramterization of atrajectory P of length N is a monotonic function f(t) : [0, 1]→ [0, N ] such thatf(0) = 0 and f(1) = N .

Simply said, the reparameterization is a way to introduce new samples, ie. re-sample a trajectory. In this way, the Frechet distance reminds of the DTW inthe sense that both rely on a certain resampling of the trajectories.

14

The Frechet distance is often explained as follows[1]: suppose a man is walkinghis dog on a leash and both of them are constrained to follow their own tra-jectory. Both the man and his dog can control their speed on their own butare not allowed to go backwards on their path. The Frechet distance is thenthe minimum length of the leash that is needed so that both of them can travelthrough their polygonal path from start to end. The formal definition is givenin Definition 7.

Definition 7. Discrete Frechet distance [1]: Let P and Q be any two planarcurves in R2 with lengths N and M , let α and β be any monotone reparametriza-tions of P and Q, respectively. Let d be a distance function in R2. The Frechetdistance is defined as

dF (P,Q) = minα,β

maxt∈[0,1]

d(P (α(t)), Q(β(t)) (3.2)

In contrast to the previously introduced Hausdorff metric, the Frechet distanceis able to summarize multiple aspects of a trajectory curve in a concise way:Clearly, the minimal length of leash is closely related to the relative positionand orientation of two trajectories. When two trajectories are close and alignedwith each other, the minimum of leash required is shorter than for two curvesthat are orthogonal and far from each other. Trajectories with different speedstypically require a longer leash to map the start and end points to each other.According to [1], the Frechet distance is a metric.

Computing the discrete Frechet distance

The optimization problem in Eq. 3.2 cannot be minimized directly for all pos-sible choices of two trajectories in R2. Instead, an iterative method is used todetermine the minimal leash length ε. The algorithm is initialized with a guesson the minimal leash length ε = ε0: if it is long enough to couple the two tra-jectories then ε is decreased otherwise it is increased.

Given a guess on the minimal leash length ε = ε0, we define the Free Spacediagram as all pair of points on the trajectories P and Q whose distance to theother curve is at most ε.

Definition 8. Free Space diagram [1]: Let α(t) and β(t) be any monotoneparametrizations of the two trajectories P and Q, respectively. Further, let ε ≥ 0and d be a distance function be given. The corresponding Free Space diagram isthen the set

Fε = (α(t), β(t))|d(P (α(t)), Q(β(t)) ≤ ε (3.3)

The Free Space diagram is then used to verify that the guess ε = ε0 is longenough to couple the two trajectories. It can be shown that if there exists apath from the origin to the top right corner of the Free Space diagram which ismonotone in both coordinates, then ε solves the coupling problem dF ≤ ε. Anexample of two trajectories with different lengths and the corresponding FreeSpace diagram is shown in Figure 3.3.

15

(a) Example of two trajectories. (b) Corresponding Free Space dia-gram with an example of a monotonecurve in both coordinates from (0, 0)to (4, 5).

Figure 3.3: Graphical representation of the Frechet distance calcu-lation.

Each of the axes in figure 3.3b are used to represent the index of the line segmentsof the two trajectories in figure 3.3a separately. The blue trajectory containsfive segments which are numbered on the vertical axis in the Free Space diagramin figure 3.3b. The red trajectory contains four segments which are representedby the horizontal axis.

A position (x, y) in the Free Space diagram represents thus the situation wherewe try to couple the segments P (x) and Q(y) of two trajectories P and Q. Theblack regions correspond to situations where the leash is too short to couple thecurrent line segments. For example, it can be seen that the bottleneck occursbetween the 2nd and 3rd segment of the red trajectory and the 3rd and 4thsegment of the blue trajectory. Geometrically speaking, the bottleneck corre-sponds to the most distant segments of the two trajectories. These segmentshave been marked as bold in figure 3.3a for clarity.

16

Chapter 4

Non-parametric trajectoryanalysis

In this Chapter, the class of non-parametric outlier detection and classificationalgorithms are introduced. Non-parametric models are characterized by thatthey do not rely on any particular assumption of the nature of the data, neitherdo they require a predetermined model structure. In the following, popularnon-parametric analysis tools are discussed.

• Nearest neighbour clustering assumes that normal data points are closelypacked while anomalies are located far from other points or in regionswith low density. They are well suited for unsupervised learning but theresults are strongly dependent on the number of available data points. Ifnormal points do not have significantly higher number of neighbours thanoutliers, the classification may fail completely [22]. As the public datasetonly contains a limited number of employees, nearest neighbour clusteringwill not be used.

• Support Vector Machine (SVM)[23] is another commonly used data clas-sification and prediction technique. In contrast to nearest neighbour clus-tering, SVM is completely supervised but does not require a large numberof data points. Hence, regular SVM is more appropriate in situationswhere the anomalous and regular activities are known a priori. In thissection, we will assume that such knowledge is available to us and willdescribe an SVM based trajectory classifier.

In SVM classification, the task is to find an optimal hyperplane that separatesthe different classes of data. In prediction, the hyperplane is used to assignclass labels to previously unobserved data based on some characteristics, calledfeatures of the dataset.

17

4.1 Introduction to SVM

In this section we will describe binary classification. The goal is to train a linearclassifier on some linearly separable data D

D = (xi, si)|i = 1 . . . n

where xi are the individual data points (vectors) and si ∈ −1, 1 is the classlabel variable. In other words, the task is to find a hyperplane f(xi) = wTxi+b,where w and b are parameters such that

f(xi) =

wTxi + b ≥ 1, if si = 1.

wTxi + b ≤ −1, if si = −1.(4.1)

There might be several possible set of parameters 〈w, b〉 that satisfy Eq. 4.1but we would like to determine which one is best at separating the data. Theproblem now is that the underlying true probability distribution is normallyunknown, all we have is D which is a sample from that distribution. Noticehowever, that if we could repeat the sampling process somehow, we would getanother sample D′ whose general structure would be similar to that of D. Ifwe were to train a new hyperplane on D′ and compare it to our original hy-perplane trained on D, we would therefore expect that they resemble each other.

With this in mind, we would want choose the hyperplane that is the moststable with respect to changes in the dataset. In other words, we want to selectthe hyperplane whose orthogonal distances to the nearest data points are aslarge as possible. Mathematically, this distance can be written as

di =si(w

Txi + b)

||w||(4.2)

Namely, let x′i be the orthogonal projection of xi on the hyperplane. By defini-tion, wTx′i + b = 0 since xi is on the hyperplane. To get from x′i to xi we needto move disi along the normal vector of the hyperplane:

xi = x′i + disiw||w|| → wTxi = wTx′i + disi

wTw||w|| = −b+ disi||w||

di = si(wTxi+b)||w||

where we used si = 1si

. To summarize, the problem of finding the best separatinghyperplane is

maxw,bmin

i

si(wTxi + b)

||w|| (4.3)

This is a non-convex optimization problem. According to [24], it can be shownthat problem can be transferred to the following convex optimization problem

minimizew,b

1

2||w||2

subject to si(wTxi + b) ≥ 1, i = 1, . . . , n.

(4.4)

18

4.1.1 Feature mapping

Until now, we have restricted ourselves to linearly separable datasets and as-sumed that the separating hyperplane depends linearly on the input data:f(xi) = wTxi + b. However, for some datasets which are not initially lin-early separable it may be fruitful to pre-process the inputs before training theseparator hyperplane. This is called feature mapping.

Definition 9. Feature mapping: Preprocessing of the input data using a featuremap Φ : Rn → Rm where Rn and Rm is called the input and feature space,respectively.

Instead of training directly on the input space as in Figure 4.1, it is useful tofirst map the data points so that they become linearly separable.

Figure 4.1: A possible way to pre-process data is to change thecoordinate system from Cartesian to polar.

4.1.2 Linearly non-separable datasets

The linear classifier described in the previous section has one major shortcom-ing: it is limited to datasets that are linearly separable. If we were to applythe classifier on some non-separable data, the feasible set would be empty andthe optimization problem in Eq. 4.4 would be infeasible. In this section we willrelax this assumption and allow for some data points to be misclassified in thehope of being able to classify more interesting datasets.

The idea is to modify Eq. 4.4 so that the misclassified data points are penalizedby a factor proportional to their classification error.

minimizew,b

1

2||w||2 + C

n∑i=1

ξi

subject to si(wTφ(xi) + b) ≥ 1− ξi, i = 1, . . . , n.

ξi ≥ 0, i = 1, . . . , n.

(4.5)

Where ξi is a so called slack variable measuring the classification error of thefeature vector φ(xi) and C is the misclassication penalty factor.

19

Figure 4.2: Example of a linear classifier applied to a linearly non-separable dataset in R2.

To see how this works, consider the following cases in Figure 4.2.

• If ξi = 0, then sixi ≥ 1 which means that the feature vector is correctlyclassified. Note that in this case, the feature vector does not contribute tothe objective function. The feature vector has no influence on the choiceof the hyperplane.

• If 1 ≥ ξi ≥ 0, then sixi ≥ 0 which means that the feature vector iscorrectly classified but lays in the marginal area of the hyperplane. Notethat even though it is correctly classified, the feature vector contributes tothe objective function. Removing this point will change the hyperplane.Points like these are usually referred to as support vectors.

• If ξi ≥ 1, then sixi ≤ 0 which means that the feature vector is misclassified.This point has the same type of effect as a correctly classified feature vectorin the margin area. Note that however, the effect of misclassified pointsis larger than that of correctly classified points in the margin area.

In summary, the sum∑ni=1 ξi can be seen as an upper bound on the number of

training errors. The larger the value of C the larger the penalties for misclas-sification. Observe that by setting C =∞ we get back the hard-classifier fromthe previous section.

4.1.3 Dual formulation of the SVM

Instead of solving the optimization problem in Eq.4.5 directly, usually it is trans-formed into another equivalent formulation which can be solved more efficiently.This is the so called dual problem which is obtained by expressing the weightsw as a linear combination of the feature vectors:

w =∑

αiφ(xi) (4.6)

Theorem 4.1.1 proves the correctness of this transformation.

20

Theorem 4.1.1. Representer theorem [23]: Given a weight vector w ∈ Rn,the component w⊥ perpendicular to the space spanned by φ(x) has no directeffect on the classification performance and just adds extra costs to the objectivefunction in Eq. 4.5.

Proof: We can write w as the sum of the two components w = w‖ + w⊥ wherew‖ = ΦTα for some α ∈ Rn and w⊥ΦT = 0. The equation for the separatinghyperplane can thus be written as

f(x′) = wTΦ(x′) + b = (w‖ + w⊥)TΦ(x′) + b = wT‖Φ(x′) + 0 + b (4.7)

The intuition behind Eq. 4.7 is that the contribution w⊥ can not help in de-creasing the classification error in Eq. 4.5. Observe however that

||w||2 = ||w‖||2 + ||w⊥||2 ≥ ||w‖||2 (4.8)

One major consequence of Theorem 4.1.1 is that instead of minimizing over w,we may as well optimize over α without any loss of optimality. This leads overto the concept of kernelization.

4.1.4 Kernelization

By substituting Eq. 4.6 into Eq. 4.7, we get

f(x′) =

n∑i

αiφ(xi)Tφ(x′) + b = αk(x,x′) + b (4.9)

where we define

k(x,x′) =

n∑i

φ(xi)Tφ(x′) = 〈φ(x),φ(x′)〉 (4.10)

a kernel function where 〈·, ·〉 represents the dot product. In other words, giventhe kernel function we never need to compute the underlying kernel mappingin order to construct the separating hyperplane. This fact enables us to use anarbitrary number of features without actually computing them explicitly.

The intuition behind kernelization is that we transform the linearly non-separabledata points into a higher dimensional vector space in hope that they becomelinearly separable. Choosing the kernel function implicitly determines whichfeatures will be used. It is therefore important to identify what kind of infor-mation we are expecting to extract from the data.

By construction in Eq. 4.10, a kernel function must be symmetric and realvalued. However, not every such function is a valid kernel function. Definition10 gives the necessary conditions for a function to be a positive definite kernel.Using positive definite kernels is often desired since it ensures that the opti-mization problem in Eq. 4.5 is convex and that the minimum is global. [25] Inother words, the computed hyperplane is optimal maximum-margin hyperplaneseparator.

21

Definition 10. Mercer’s condition for positive definite kernels [25]Let k : Rd×Rd → R be a real-valued symmetric function. Consider a collectionof n inputs X = xi ∈ Rd, ∀i = 1 . . . n and define K as the Gram matrix of k

K =

k(x1,x1) k(x1,x2) . . .k(x2,x1) k(x2,x2) . . .

......

. . .

(4.11)

If K is positive definite, then k is a positive definite kernel.

For example, even though the Euclidean distance d : Rn × Rn → R, d =√(x− x′)2 is symmetric and positive definite, k(xi,xj) =

√(x− x′)T (x− x′)

does not give a positive definite Gram matrix, ie. it is not a valid kernel.

To understand why, we will show that valid kernel functions must measure thesimilarity rather than the dissimilarity between features. By using the definitionin Eq. 4.10 it can be shown that the following holds

k(x,x′) =||φ(x)||2 + ||φ(x′)||2 − d(φ(x),φ(x′))2

2(4.12)

where d(φ(x),φ(x′))2 = ||φ(x) − φ(x′)||2 measures the distance between themappings of the data points in the feature space. Equation 4.12 shows thatwhile d measures the dissimilarity, the kernel function measures the similarity[26].In summary, in order to use distances as ingredients in kernel functions, theymust first be transformed to measure the similarity rather than the dissimilaritybetween points. This is normally accomplished by applying a positive definite,monotonically decreasing function on the distance function [27].

Example 1. Typical kernel functions

• Lineark(xi,xj) = xTi xj + C (4.13)

• Radial basis functions (RBF) is a group of functions whose value dependsonly on the distance between two elements x and x’ so that: k(x,x′) =k(d(x,x′), 0) where d is a metric. Examples of RBF kernels include

– Multiquadratic RBF

k(x,x′) =√

1 + α||x− x′||2 (4.14)

where α ≥ 0.

– Inverse multiquadratic RBF

k(x,x′) =1√

1 + α||x− x′||2(4.15)

where α ≥ 0.

22

– Gaussian RBFk(x,x′) = e−α||xi−xj ||2 (4.16)

where α ≥ 0. The value of α determines the spread of the kernel. Ifoverestimated, the exponential will behave almost linearly. If under-estimated, the exponential will be overly sensitive to noise. [28]

Example 1 gives some examples of commonly used kernels in SVM. All of theseexcept the Multiquadratic RBF are examples of positive definite kernels.

Before continuing, it is important to notice that an ideal kernel function shouldmap the input data to a set of normalized feature vectors. If not, features withlarge values may dominate and cause the SVM to neglect features with smallvalues. In other words, normalization in the feature space is necessary in orderfor the SVM to consider all features equally important. By Definition 4.10, it canbe seen that the sufficient condition for this is that k(x,x) = 〈φ(x),φ(x)〉 = 1.Such normalized kernel functions k are usually formed by [29]

k(x,x′) =k(x,x′)√

k(x,x)k(x′,x′)(4.17)

It is easy to verify that the class of RBF-kernels satisfy this requirement.

4.1.5 Multi-Class SVM

So far, the focus has been on binary classification. To be able to classify datasetswith more classes, usually a combination of several binary SVMs is used [30].Popular methods for this are one class versus the rest and one-versus-one.

(a) One vs. all classification. The grayregions correspond to situations thatcannot be decided. The other colorscorrespond to different classes.

(b) One vs. one classification. Thegray regions correspond to situationsthat cannot be decided. The other col-ors correspond to different classes.

Figure 4.3: Different approaches to multi-class SVM.

As it can be seen in Figure 4.3, both methods leave regions where the class ofthe features cannot be decided deterministically. In the central gray area, all

23

binary SVMs reject the features while in the gray areas between the classes,more than one SVM accept the features. Since 1-vs-1 method has a smallernumber of gray areas, it will be used in in the thesis for classifying multipleclasses.

4.2 Trajectory kernels

In this section, we introduce two new SVM kernels for trajectory classification.These kernels are formed by modifying the RBF kernels by replacing their dis-tance measures with the Hausdorff and Frechet distances.

To see how this can be done notice that by Theorem 2.2.1 and 2.2.2 it holdsthat the undirected Hausdorff and Frechet distances are true metrics. Since theMercer condition is equivalent to the triangle inequality for distance functions[31], it holds that both the Gram matrix for both of these distance measuresare positive definite.

We now define the distance substitution kernel [32] kd of an RBF-kernel k by

kd(x,x′) = k(d(x,x′), 0) (4.18)

where d is either the Frechet distance dF or the undirected Hausdorff distancedH .

4.3 Preliminary results

In this section, we are going to evaluate the classification performances of theproposed distance substitution kernels on a test dataset. These results are calledpreliminary, because these are outcomes of running the classification algorithmon a test dataset which has very obvious differences between the classes. Itbasically shows that the algorithm is capable to sort the input dataset into twoobviously different groups.

The dataset is shown in Figure 4.4. As it can be seen, the training set con-tains multiple trajectories with different bearings and lengths. The evaluationdataset contains a dense set of oscillating trajectories whose vertices are con-fined to the same region as the training dataset. The classes in the evaluationdataset are determined by comparing the average bearing of each individualtrajectory and the training dataset.

24

(a) Training dataset, the colors indi-cate the predetermined classes.

(b) Evaluation dataset, the colors in-dicate the true class labels based onthe average bearing of each trajectory.

Figure 4.4: Test dataset for distance substitution kernel basedSVM classification.

4.3.1 Gaussian RBF kernel

Throughout this section, the misclassification penalty factor C in Eq. 4.5 wasset to C = 10.

Frechet distance

(a) Gaussian RBF-Frechet kernel, α =2, the colors indicate the predictedclasses.

(b) Classification results in theFrechet kernel space with α = 2. Thecolors indicate the true classes.

Figure 4.5: Classification results using the Frechet kernel.

25

Hausdorff distance

(a) Gaussian RBF-Hausdorff kernel,α = 2, the colors indicate the pre-dicted classes.

(b) Classification results in the Haus-dorff kernel space with α = 2. Thecolors indicate the true classes.

Figure 4.6: Classification results using the Hausdorff kernel.

Figures 4.5 and 4.6 confirm our intuition from the previous sections: sincethe trajectories in the evaluation dataset are confined to the same region, theHausdorff distance has a hard time classifying them. On the other hand, theFrechet distance based kernel correctly classifies the trajectories based on theiraverage bearing.

4.4 SVM applied to data streams

Training SVMs is normally limited to predefined, static datasets. As describedin Chapter 1, in order to handle data streams with characteristics that changeover time, the model has to be able to forget old observations. Here, a sim-ple extension is proposed to the regular SVM formulation that satisfies thisrequirement and is shown in Figure 4.7.

Figure 4.7: Schematic view of the proposed online SVM model.

26

By introducing a middle-layer of a non-circular buffer with n ≥ 1 elements, theproposed model is able to keep only the most recent n number of dataframesfor the training of the SVM. In this way, the SVM is able to forget data thatis more than n observations old. The length of the buffer is not fixed and canbe changed at any time either by adding empty elements after the rightmostelement, or by removing elements from the right.

Before a new dataframe can be added to the buffer, all buffer elements must firstbe shifted to the left by one position. In this way, the most recent dataframe isstored in the rightmost element while the oldest is stored in the leftmost elementof the buffer. Every time a new dataframe is added, the rightmost dataframe isremoved from the buffer. The contents of the buffer is then released to train anew SVM.

27

Chapter 5

Parametric trajectoryanalysis

In addition to the usual relative geometric properties of trajectories, such asspeed, direction, etc., additional semantic information can be obtained by alsoconsidering the context of the trajectories [33].One possible way of doing this is to identify interesting spatial regions in theoffice environment, such as the kitchen or cubicles of different departments andmap the trajectories to these regions. In other words, a trajectory can also bedescribed as a sequence of regions that it visits. The idea behind this chap-ter is to model the transitions between such geographic regions. It is assumedthat different groups of employees have different transition probabilities betweensuch regions. Markov Chains are readily available to describe/model such statetransitions.

Dynamical systems are processes that evolve over time according to some un-derlying law which is probabilistic. The Markov Chain (MC) and its variationssuch as the Hidden Markov Model (HMM) are one of the most powerful, yetsimplistic tools to analyze and model stochastic processes [34]. In its most sim-ple formulation however, the MC is restricted to stationary systems and hastherefore key issues when applied to real-life dynamical systems:

• The model may be incomplete at the time of construction and thus sensi-tive to the initial selection of parameters. Having a predetermined struc-ture and number of parameters may lead to that the model describes theinherent noise in the data rather than the underlying dynamics of thedata. In statistics, this is often called overfitting.

• Changes in dynamical system behaviour over time should be reflected inthe model. Keeping outdated data may negatively impact the ability ofthe model to classify and identify patterns.

In this section, a novel approach called the Extensible Markov Model (EMM)[35]is presented which deals with the issues stated above. First, let us first recallsome basic definitions concerning regular, time homogeneous Markov Chains.

28

5.1 Introduction to Markov Chains

A discrete-time Markov Chain is a special case of a stochastic process in discretetime and discrete space. At each time instant ti, it is characterized by a sequenceof random variablesX(t) = 〈X1(t), X2(t), X3(t), . . . 〉 taking values in some finiteset.Each random variable Xn can be though of as the state at time tn of somesystem which is governed by a set of probabilistic laws. The finite set of valueswhich the random variables assume is called the state space of the stochasticprocess and is denoted by S = 〈x1, x2, x3, . . . 〉. The process X(t) is called asimple Markov Chain if it satisfies the first order Markov property.

Definition 11. First order Markov property: Given the present state, the pastcontains no additional information for the future evolution of the system. For-mally, the future state is conditionally independent of the past, given the presentstate of the system:

P (Xk+1 = xk+1|X1 = x1, . . . , Xk = xk) = P (Xk+1 = xk+1|Xk = xk) (5.1)

Given a sequence of states s = x1x2 . . . xn over some state space S, we cancharacterize the dynamics of the underlying system by counting the number ofoccurrences of every state in the sequence.

Definition 12. Transition count and probability matrices: Let s0 = x1x2 . . . xnbe a sequence over some state space S. The transition count and probabilitymatrices TC and TP are defined as

TC =

c11 c12 . . . c1nc21 c22 . . . c2n...

.... . .

cn1 cn2 . . . cnn

TP =

p11 p12 . . . p1np21 p22 . . . p2n

......

. . .

pn1 pn2 . . . pnn

(5.2)

where each cij represents the number of state transitions xi → xj in the sequences0. The state transition probabilities are estimated from the transition countsby using the Maximum-Likelihood method [36]

pij =cij∑mk=1 cik

(5.3)

This probability measures how many times the transition xi → xj has occurrednormalized by the number of times the state xi actually occurred.

Suppose we are given a state transition probability matrix TP and an initialstate distribution p of some system. We can think of p as a probability vectorwhere the ith component represents the probability that the system starts instate xi. By the definition of conditional probabilities, the next state distri-bution after one time step is given by p(xj) = p(xj |xi)p(xi) ∀xj ∈ S, or withmatrix notation pn+1 = pnTP . More generally, by using the definition of thesimple Markov model in Eq. 5.1 it can be shown that the state probabilitydistribution after k time steps can be written as

pn+k = pnTkP (5.4)

29

We can thus predict the future behaviour of some group based on the current dy-namics. Before introducing the Extensible Markov Model there is some furtherconcepts that need to be defined.

Definition 13. Time homogeneous Markov Chain: The probability of the tran-sitions is independent of the time instant t:

P (Xt+1 = x|Xt = y) = P (Xt = x|Xt−1 = y)

Time homogeneous Markov chains are often represented as directed graphs,in which the vertices correspond to the states and the directed edges to theprobabilities from going from one state to another.Another important aspect is whether the Markov Chain is irreducible. We willreturn to this in the discussion of the EMM.

Definition 14. Irreducible Markov Chain: A Markov Chain in which it ispossible to get to any state from any other state in a finite number of transitions.

5.2 Extensible Markov Model

The Extensible Markov Model (EMM) is a combination of an adaptive, first or-der Markov Chain and a clustering algorithm. The overall structure is shown inFigure 5.1. During a data frame Dk from the sliding window (tk, tk+1) the EMMis characterized by a time homogeneous Markov Chain. The model’s ability toadapt to changes is achieved by updating the underlying Markov Chain betweensubsequent data frames. During these updates, the EMM is no longer stationarybut the discontinuities are ignored by the assumption that the computationaltime necessary to build and update the EMM is negligible with respect to thedata frame durations.

Each state of the EMM corresponds to a unique cluster. The clustering maybe off-line or on-line, meaning that the clusters are either predetermined ordependent on the data.

Figure 5.1: Schematic view of the EMM construction and mainte-nance.

30

5.2.1 Updating the Markov Chain

In this section, the update procedures for the EMM are described. To achievethis we define the cumulative and delta transition matrices.

Definition 15. Cumulative transition count matrix: The transition count ma-trix Ct is based on all previous data frames up to time t.

Definition 16. Delta transition count matrix: The transition count matrix ∆Ct

is based solely on the current data frame t.

Adding/removing states

As described previously, the EMM may be incomplete at the time of construc-tion. In order to adapt to possible new behaviour, we need to be able to add orremove states.

To add a new state to the model we augment the transition count matrix andinitialize the new row and column to zeros. The new state is given a unique IDto keep track of which state corresponds to which cluster.Similarly, to remove a state xi, we simply remove its row and column in thetransition count matrix. [37]

Fading the model

The primary purpose of the fading step is to reduce the importance of oldand possible obsolete transitions since current behaviours may not rely on allprevious data. Formally, this is accomplished by fading the previous EMM fromthe open time interval (tk−1, tk) before incorporating the new measurementsfrom the interval (tk, tk+1)

Tk+1C = Tk

C · 2−λ (5.5)

where λ ≥ 0 is the fading parameter. A negative λ would correspond to asituation where the transition matrix from the previous time interval wouldbe more similar to the current behaviour than the current transition matrix.Such situation is clearly unrealistic. The effect of fading on the state transitionprobabilities is given by

pt+1xy =

ctxy2−λ + ∆ct+1xy∑

k ctxk2−λ + ∆ct+1

xk

(5.6)

In the formula above, ∆c is the delta transition count matrix that is solelybased on the new data frame. pt+1 represents the target cumulative probabilitytransition matrix that describes the dynamics of the system based on bothcurrent and historical observations.

Incorporating new sequences

Assume that we start off with an empty EMM which is updated every timewhen a new sequence arrives. That is, for every new sequence the transitioncounts are increased by the number of times they appear in the new sequence.The transition probabilities are updated according to Eq. 5.6.

31

5.2.2 State clustering

The Markov Chain of the EMM allows to model temporal relationships. To ex-tend the EMM to the spatio-temporal domain, a clustering algorithm is needed.Regular clustering techniques such as k-means [38] are not suitable for handlingdata streams since they assume that the data points are independent and iden-tically distributed. In the case with streaming datasets however, there is oftenlatent correlation between subsequent data points. [39] In this section we willdescribe the threshold nearest neighbour (tNN) clustering algorithm.

In tNN, the data points are clustered using the nearest neighbour principle.Each cluster is represented by a central vector. For each new data point, thedissimilarities with respect to the existing clusters are computed and the bestmatching cluster is selected. However, instead of always assigning a new datapoint to the best matching cluster, new clusters may be created if the dissimi-larity to the best matching cluster is greater than the specified threshold. Forspatial clustering, the dissimilarity measure is normally the Euclidean distance.

5.3 Estimating the parameters of the EMM

As described in the previous sections, the EMM has two free parameters: theclustering threshold C and the forgetting factor λ.

5.3.1 Clustering threshold

The effect of the clustering threshold on the EMM can be understood as follows.If C is underestimated, it forces the EMM to create a large number of stateswith relatively small occurrence and transition counts. Eventually, this rendersthe Markov Chain isotropic, ie. all states become equally probable. On theother hand when overestimated instead, the EMM eventually collapses into asingle state, losing all of modelling capability.

In addition to the online tNN clustering described above, there is an alternativetechnique that also takes the semantical information into account when mappingtrajectories into discrete states. Examples of such semantics are stops, whichcorrespond to regions in which the trajectories remain for a certain amount oftime. Another example are moves, which corresponds to regions in which theaverage velocity is greater than a threshold[33]. In an office environment, suchsemantical regions could for example be the cubicles of different departments,the kitchen area, etc.

5.3.2 Forgetting factor

A straightforward way to ensure that the EMM is always up to date would beto set the forgetting factor equal to some constant value during the whole sim-ulation. In Section 5.6 it is found by empirical testing that while this approachsuccessfully updates the EMM, it is overly aggressive and leads to overfitting.In this section we propose two methods to estimating λ in a systematic way.

32

Proposed Algorithm 1

The idea is to link the relative value of λ to the rate of change in behaviour ofsome group over time. It is supposed that in an office environment, there existscertain time periods during which the general behaviour within a group of em-ployees changes significantly. Intuitively, such periods could exist for exampleduring morning hours, lunchtime and late afternoons during which employeesare supposed to arrive or leave the office and thereby change their general be-haviour.During such time periods, the importance of old patterns should be decreasedin order to accommodate the EMM to the new patterns of behaviour.

How do we measure the behaviour of a group and assess the similarity of twobehaviours? Suppose we are given a dataset containing the geometric trajecto-ries of a group of employees under a time period. By identifying certain staticregions of interests in the office environment, the trajectories can be clusteredand thus mapped to discrete states that each correspond a unique cluster. Thetransitions between the states can in turn be modelled using a Markov Chain.In this way, we can describe the behaviour of a group of employees either by atransition count or probability matrix.By using count matrices, Eq. 5.5 could be used to estimate λ by simply invert-ing Tk

C . However, the estimates of λ are typically noisy and not guaranteed tobe greater than zero. Unless stated otherwise, in this algorithm we will use theprobability matrix to represent the behaviour because it is per definition alwaysnormalized.

To measure the difference in behaviour over time, we use the probability distancebetween the probability transition matrices of a group during two subsequenttime intervals. Two distance functions are presented in Definition 17 and 18. Itis important to notice that in order to be able to use the probability matricesto compare the behaviours, the individual rows of the transition matrices musteach model the probability distribution of the exact same spatial region. Inother words, the rows of the transition matrices must describe the same spatialregions. This requires that we have predetermined spatial regions which are in-teresting in the sense that they are good at discriminating between behaviours.In an office environment, such regions could for example be the communal space,department areas etc.

A probability distance function compares individual probability distributions,in this case however we need to compare sets of distributions:

P = [p1(x),p2(x), . . . ,pn(x)]T , pi(x) ∈ R1×n (5.7)

By the assumption that the predetermined regions are all equally interesting,the distance between two sets of distributions is computed as the sum of allpointwise distances between the corresponding distributions in the two sets.

d(P(x),Q(x)) =

n∑i

δ(pi(x),qi(x)) (5.8)

where δ is a distance function between probability distributions.

33

Definition 17. Total variation distance: Let P and Q be two discrete probabilitydistributions over some finite set Ω. The total variation distance is [40]

δ(P,Q) =1

2

∑x

|P (x)−Q(x)| (5.9)

Definition 18. Kullback-Leibler divergence: Let P and Q be two discrete prob-ability distributions over some finite set Ω. The Kullback-Leibler divergence is[41]

δ(P,Q) =∑x

ln

(P (x)

Q(x)

)P (x) (5.10)

Notice that the KL-divergence is not symmetric and thus not a true metric. Inthe following, we will often used the average symmetric KL-divergence

δ(P,Q) =δ(P,Q) + δ(Q,P )

2(5.11)

Suppose now that we have computed the change of behaviour of the same groupover multiple time intervals. There are now two possible ways of linking λ tothe difference in behaviour:

• Setting λ proportional to the rate of change directly. Although intuitive,this option leads to λ > 0 during the whole experiment which may be tooaggressive and lead to overfitting as described in Section 5.6.

• Identifying high-activity regions to set λ. By thresholding the rate ofchange we identify coherent time intervals during which the rate of changeis greater than a threshold value. This threshold will be referred to asHIGH ACTIV ITY CONST . The idea is to set λ proportional to thelength of such high-activity periods.Besides being more softer than the first option, this also has an interestinginterpretation: by Eq. 5.5, it holds that λ is inversely proportional to thehalf life of the elements in the transition count matrix. Thus, the meanlife time τ of a transition count that was registered during a high-activityperiod is inversely proportional to λ:

τ =c

λ(5.12)

The proportionality constant c will be referred to as PROP CONST . Inother words, by the end of high-activity periods, half of the transitioncounts registered in the beginning of such periods remain.

34

Figure 5.2: Left: Graph of the rate of change of behaviour in ahypothetical dataset. The red line is used as a threshold to iden-tify coherent high-activity periods. Right: During such periods,the value of λ is set inversely proportional to the period length.Outside these periods, λ is set to a small constant value. (Graphsnot to scale.)

The implementation of the proposed algorithm is presented in A.2.

Proposed Algorithm 2

The second proposed algorithm is based on the fading equation for the EMMwhich is repeated here for clarity:

pt+1xy =

ctxy2−λ + ∆ct+1xy∑

k ctxk2−λ + ∆ct+1

xk

(5.13)

We will re-use the notion of behaviour from Algorithm 1 above. Clearly, if wewere given the target cumulative transition probability matrix pt+1, the previ-ous cumulative transition count matrix ct and the current delta transition countmatrix ∆ct+1,then we could easily estimate λ as in Algorithm 1 by measuringthe dissimilarity between sets of distributions and trying to minimize it.

However, all we have is the delta count matrices ∆ct and ∆ct+1 estimated fromthe the previous and current time intervals, respectively. These matrices cor-respond to the instantaneous behaviours during the previous and current timewindows. The target cumulative transition probability distribution is typicallyunknown because we have no estimate of λ. Neither do we have behaviouraldomain knowledge to say for example that group A should be in the conferenceroom or at their desk most of the time during in the time period 09:00-09:10.Thus, we are left in a chicken and egg problem.

The idea is to estimate the target distribution pt+1 by computing the deltaprobability matrix at time t + 2. Notice that this is not exactly what we arelooking for since the time indices are different those in Eq. 5.13. However, if thesliding windows are sufficiently short, we will assume that the behaviour in twosubsequent data batches will approximately be the same, that is pt+1 ≈ pt+2.In other words, we will assume that the probability of going from one region toan other is approximately the same during two adjacent time windows.

35

The problem of minimizing the probability distance between the distributionscannot be guaranteed to be a convex optimization problem. To alleviate thisproblem, the minimization problem can be performed multiple times with dif-ferent initial values.The implementation of this proposed algorithm is presented in A.3.

5.4 Classification

The proposed classification procedure builds upon the concept of model selec-tion and is shown in Algorithm 2. In model selection, the task is to select themodel with the highest explanatory power of an observation from a set of can-didate models [42]. If there are multiple models which are equally likely, usuallythe least complex model is chosen 1.

Given a set of EMMs that each models the behaviour of a unique class inthe dataset, the algorithm computes the similarity between a trajectory in thecurrent dataframe and all class models. Before doing this, the actual trajectoryconsisting of geographic coordinates must be mapped to a sequence of states ina Markov Chain. This is done on Line 14. To measure the similarity betweena sequence of states and a model, the Akaike Information Criterion[43] (AIC)is used. This is done on Line 1-6. The AIC can be understood as a trade-offbetween the goodness of fit and model complexity: the best matching model isthe one with minimum AIC. The number of parameters of the model is estimatedby the number of non-zero entries in the transition probability matrix of themodel.

1This principle is known as Occam’s Razor

36

Algorithm 2 EMM classification

1: procedure AkaikeInformationCriterion(model,stateSeq)2: k← number of non-zero elements in the transition count matrix in the model3: L← Likelihood of stateSeq seq in model4: AIC ← 2(k − ln (L))5: return AIC6: end procedure7:

8: procedure EMMClassification(dataFrame,classModels)9: predictedClasses← []

10: for trajectory in dataFrame do11: scores← 012: i← 113: for model in classModels do14: stateSeq← mapTrajectoryToStates(trajectory,model)15: scores[i]← AkaikeInformationCriterion(model,stateSeq)16: i← i + 117: end for18: predictedClasses← [predictedClasses,index of smallest element in scores]19: end for20: return predictedClasses21: end procedure

The classification performance of the EMM is dependent on whether the under-lying Markov Chain is irreducible or not. If the Markov Chain is fragmentedinto multiple, irreducible sub-chains which do not commute, it is possible thatthe classification algorithm in Algorithm 2 degenerates into a random classifier.

To understand why, consider the Markov Chain in Figure 5.3 and two sequencesS1 and S2 which are to be mapped to the states 1, 2, 6, 4, 5, 7 and 1, 2, 5, 7.

Figure 5.3: Example of a not irreducible Markov Chain. The thick-ness of the edges represent the relative probabilities of the transi-tions.

37

As it can be seen in Figure 5.3, even though the sequences may have differentlikelihoods up to the last state, their final likelihood will degenerate to 0 becausethere are no transitions 5 → 7 or 2 → 7 in the Markov Chain. In other words,the probabilities corresponding to the transitions 5→ 7 and 2→ 7 are 0.

In order to avoid such situations, it is advised in [44] to initialize the tran-sition count matrices with a prior distribution of counts equal to some smallε > 0.

5.5 Anomaly detection

The EMM outlier detection [45] procedure is shown in Algorithm 3. The ideabehind the algorithm is that there are mainly two ways in which anomalies canarise in a first order Markov Chain:

• First, it may happen that a specific state has not occurred frequently atall in the past. These types of anomalies are captured by thresholding thenumber of occurrences of each state/cluster in the EMM.

• Second, it may happen that although two events themselves are frequent,the transition between them is infrequent. These types of anomalies arecaptured by thresholding the transition counts.

In addition, individual states may be marked as anomalies by a domain expertto incorporate a priori knowledge into the detector. When a sequence goesthrough such states, the sequence may be marked as an anomaly.

Instead of thresholding the actual cluster and transition counts directly, theauthors in [45] propose to first normalize them by the sum of all cluster counts.In this way, the score values are mapped to the interval (0, 1).

The anomaly detection algorithm works in series with the update mechanismof the EMM: Every time a new sequence arrives and is incorporated into theEMM (Lines 5− 6), the algorithm finds the infrequent states and transitions ofthe model at that specific time instant. Given the current state of the EMM,the algorithm loops through the states in the sequence and returns True if anoutlier is found (otherwise False). In the next iteration, the current state isupdated to the previous state in the sequence.

38

Algorithm 3 EMM Update/ Anomaly detector

1: procedure EMMAnomaly(currState,stateSeq,model,occThr,transThr)2: isOutlier← False3:

4: for state in stateSeq do5: add state to model6: add transition currState → state to model7:

8: clusterCounts← cluster counts(model)9: normOcc← clusterCounts[state]/sum(clusterCounts)

10: normTrans← transition counts(currState,state)/sum(clusterCounts)11:

12: if normOcc < occThr | normTrans < transThr then13: isOutlier← True14: end if15:

16: currState← state17:

18: end for19: return isOutlier20: end procedure

5.6 Preliminary results

In this section, the anomaly detection performance of the EMM is evaluatedon a test dataset with dynamic behaviour. These results are preliminary in thesame sense as it was described in Section 4.3. The training dataset shown inFigure 5.4 contains 1000 data points drawn from a discrete uniform distributionwith a constant variance but with a varying mean. During the 200 first timeindices, the mean is set to 5. During a period of 50 samples starting from timeindex 200, the mean is temporarily set to 25. This short interval is supposed toact as a contextual anomaly. The mean is then changed back to 5 for the timeindices 250-500.After the first 500 samples, the behaviour is reversed: the majority of the datapoints are sampled with a mean value of 25 while the anomalous interval is sam-pled with a mean value of 5. In other words, both the normative and anomalousbehaviours are reversed. Notice that the transition between the initial anoma-lous period starting at time index 500, and the rest of that period up to timeindex 800 is gradual. In other words, the length of the anomalous period aftertime index 500 is not fixed and depends on the choice of parameters of the out-lier detector. The dataset was segmented using a sliding window with lengthcorresponding to 2 time increments.

39

Figure 5.4: Test dataset. Notice that both the normative andanomalous behaviour change with time. The colors indicate thetype of intervals.

Figure 5.5 and 5.6 show the results of the anomaly detection with C = 3, occThr =0.1 and transThr = 0.05 without and with constant fading, respectively. As itcan be seen in 5.5, the algorithm correctly detects the first sequence of anoma-lous data points in the interval 200-250, but misclassifies the second half ofthe data stream: it marks the majority of data points after time index 500 asanomalous and classifies the sequence starting at time index 800 as normal.

Figure 5.5: Results using C = 3, occurrence thr = 0.1,transition thr = 0.05 and λ = 0. An anomaly score of 1 rep-resents an anomaly while 0 represents the normal data points.

These results are due to that the outlier detector can not accommodate itselfcompletely to the new pattern after time index 500. It can also be seen thatthe results are less clear and start to fluctuate between normal and anomalousafter time index ≈ 650. This is believed to be caused by the fact that the

40

even without fading, the EMM is constantly updated when new sequences areincorporated into the MC. However, as this update is slower and more gradualthan with fading, the outlier detector ends up in an intermediate state betweenthe two patterns.

Figure 5.6: Results using C = 3, occurrence thr = 0.1,transition thr = 0.05 and λ = 0.5. An anomaly score of 1 repre-sents an anomaly while 0 represents the normal data points.

As it can be seen in Figure 5.6, fading the EMM has significant effect on theanomaly detection performance. All three changes in behaviour (first anomalyat t = 200, reversal at t = 500 and second anomaly at t = 800) are correctlydetected but the results are somewhat noisy. It is believed that this is causedby that a constant λ > 0 throughout the experiment is too aggressive: By con-stantly forgetting old measurements it is possible that the EMM models thenoise rather than the underlying characteristics of the data.

41

(a) Results using C = 3,occurrence thr = 0.1,transition thr = 0.05 and a variableλ given by Figure 5.7b.

(b) Variation of λ during the experi-ment.

Figure 5.7: Anomality detection results on the dynamic testdataset. An anomaly score of 1 represents an anomaly while 0represents the normal behaviour.

Figure 5.7a shows the results of the anomaly detection with C = 3, occThr = 0.1and transThr = 0.05 with a variable λ as given by Figure 5.7b. By comparingthe prediction with the ones obtained with constant fading, it can be seen thatthis second fading technique tends to give more accurate predictions.

Finally, it should be noted that in all three cases with the outlier detectorthere is a short series of misclassifications in the beginning of the experiment:the algorithm incorrectly marks data points as anomalous even though theyare sampled from the same distribution. This is believed to be caused by thecreation and population of new states. During this period, new states with afew number of elements are incorrectly classified as outliers.

5.7 Database of individuals with same instanta-neous behaviour

The purpose of this section is to create a database containing records of indi-viduals with the same instantaneous behaviour. The idea is that by combiningthe instantaneous groups from multiple time frames, long term-relations maybe captured. It is assumed that although behaviours may fluctuate over time,data mining techniques can be used to extract the underlying relations.During each dataframe, a first order Markov Chain is estimated for each in-dividual. The partitioning of individuals into groups with the approximatelysame behaviour is done by clustering the transition matrices of each MarkovChain using the probability distances introduced in Section 5.3.2. The thresholdnearest-neighbour clustering technique is used to guarantee that the differencein behaviour within each cluster is bounded.

42

Notice that in order to be able to use these matrices to compare individuals,the individual rows of the transition matrices must each model the probabilitydistribution of the exact same spatial region. In other words, the rows of alltransition matrices must describe the same spatial regions. This requires thatwe have predetermined spatial regions which are interesting in the sense thatthey are good at discriminating between behaviours.Finally, the instantaneous clusters from the different sliding windows are ap-pended to a database.

43

Chapter 6

Dataset

6.1 MIT Realitycommons Badge dataset

As described in the introduction in chapter 1, the Badge dataset contains time-stamped geographic locations for employees at a company during a period ofone month. Each employee is equipped with a unique badge for localizationusing radio signals. The in-door positioning system is based on measuring theradio signal strength (RSSI) of each employee’s badge at different base stations.These base stations are located at fixed locations in the office and are used totriangulate the instantaneous position of a badge.

Figure 6.1: Floor plan of the MIT Reality Commons Badgedataset. [2]

Location data is sampled at 10 measurements/minute. The layout of the officeis shown in Figure 6.1. The location of the base stations are shown in yellowboxes. There are 3 groups and 58 employees at the firm of which 39 participatedin the data collection. The cubicles of participating employees are identified bya unique employee ID and are marked by orange, green and purple boxes basedon their function at the firm. Employees who did not participate are represented

44

an ”N” and are not used in the thesis. The composition of department groups issummarized in Table 6.1. As it can be seen, the Configuration group correspondsto 69%, the Pricing group to 18% and the Coordinator group to 13% of the totalnumber of data points.

Group Employee ID Numberof datapoints

Configuration 276 106 101 294 103 265 253 292 99 107278 258 105 264 104 251 298 256 56 109281 82 267 273 290 280

672500

Pricing 266 293 272 288 297 268 263 171700Coordinator 291 285 257 261 129820

Table 6.1: Department groups in the MIT Badge dataset.

6.2 Proprietary FOI dataset

The confidential dataset does not contain explicit geographic trajectories asthe Badge dataset, but only textual location identifiers. As described in theintroduction in chapter 1, these can be transformed into regular trajectories byintroducing an abstract coordinate system. By doing so, the proposed trajectorySVM algorithm can be applied also to the confidential dataset. In contrast,in the case with parametric analysis using EMM, there is no need for suchtransformations as the model is readily available to describe transitions betweendifferent states.Due to the similarities described above, we expect the algorithms found anddeveloped for the Badge dataset to be directly applicable to the proprietaryFOI dataset.

45

Chapter 7

Results

7.1 Organization

This chapter is organized as follows:

• The results for parametric model are presented in section 7.3. First, theresults for the two proposed methods for estimating the forgetting factorare presented. This is followed by classification of the departments groupsand outlier detection. The validity of the outlier detector is verified bothquantitatively and qualitatively on two hypotheses. The correctness of thegroup discovery algorithm is verified in section 7.3.5.

• The results for the non-parametric model are presented in section 7.4.Since the proposed SVM based algorithm is supervised, only classificationof the department groups are presented.

In detail analysis of the results obtained in this chapter is presented in chapter8.

7.2 Evaluation method

The performance of the classification algorithms is estimated by having two sep-arate sliding windows for training and evaluation. The training and evaluationwindows are separated in time from each other by an offset parameter so thatthe evaluation window is always ahead of the training window. By doing so, weinvestigate the predictive power of the algorithm and the overall predictabilityof the behaviours in the datasets. The overview of the setup is shown in Figure7.1.

46

Figure 7.1: Overview of the evaluation method.

It should be noted that the EMM anomaly detector does not require such setupsince the outlier detection is done in series with constant update of the modelfrom the same data frame.

7.3 Parametric analysis

The proposed algorithms for estimating the forgetting factor λ in Chapter 5require static states. The states of the EMM are estimated by using the loca-tion of all participating employee’s cubicles, the locations of the RSSI and basestations. However, as it can be seen in Figure 6.1, there are multiple locationsthat are close to each other. It is believed that such configurations can degradethe performance of the EMM since trajectories that are in between two suchstations can be mapped randomly between the stations.To overcome this problem, we optimize the location of the EMM states by clus-tering the stations based on their location using the threshold nearest neighbourclustering. By using the aforementioned clustering, the distance between thestations within each cluster can be bounded.

The location of the EMM states are optimized by using a grid search over theclustering threshold parameter C ∈ 1, 201, . . . 3601 and measuring the F11

score of the classification performance of the department groups from Table 6.1without fading. The F1 score is computed from the sum of all classificationsresults from the sliding windows during the whole day. The sliding windowlengths were set to 10 minutes and the offset between the training and evaluationwindow was set to 5 minutes. Figure 7.2 shows the classification results of thegrid search during the first weekday in the dataset.

1The F1 score is a weighted average between the precision and recall of the classifier [46].

47

Figure 7.2: F1 score vs clustering threshold C.

The clustering threshold that gives the highest F1 score for all three groupsis C ≈ 800. The corresponding spatial configuration shown in Figure 7.3 willbe fixed and used throughout this chapter. The assumption is thus that thesespatial regions which are appropriate to classify the trajectories are also suitableto detect outliers.

Figure 7.3: Locations of the EMM states, each identified by aunique ID.

7.3.1 Estimation of λ

Using Algorithm 1

The series of Figures below shows the estimation procedure of the forgettingfactor for the Configuration group using the spatial states from the previoussection. The sliding window length is set to 10 minutes. The first week in thedataset is used to estimate a λ for each weekday. Subfigures 7.4a and 7.4b showthe symmetric KL-divergence and the Total Variation distance between the be-haviour at different states of adjacent sliding windows during Wednesdays forthe Configuration group. The horizontal axis represents the states in Figure 7.3and the vertical axis represents the sliding window index.

48

(a) Symmetric KL-divergence betweenadjacent sliding windows.

(b) Total variation distance betweenadjacent sliding windows.

Figure 7.4: Dissimilarities between the behaviours at differentstates during two subsequent sliding windows during a completeday.

The saturation of the colors indicate the degree of dissimilarity: red color cor-responds to high and white color corresponds to low similarities between thesliding windows. As it can be seen in the figures above, there seems to be a pat-tern in the behaviour during a complete day which is similar to our hypothesisin Section 5.3.2:

• The differences in behaviour between subsequent sliding windows seemsto be isotropic over all 12 states. This means that when there is a majorchange in the behaviour at one location in the office environment, thereare corresponding changes in the behaviour at approximately the sametime at other locations. This observation justifies that we average thedissimilarities over all states in Algorithm 1 order to determine the high-activity time periods.

• The difference between subsequent sliding windows is largest during morn-ing hours and in the afternoon. Additionally, a faint difference can be ob-served during lunch hours in figure 7.4a when using the Kullback-Leiblerdivergence. In contrast, when using the Total variation distance as infigure 7.4b, the results are typically more noisy and the aforementionedlunch period is no longer visible. The behaviour in between these timeperiods is relatively constant. This observation justifies the step whenidentify high-activity timer periods by setting a lower threshold for thedifference in behaviour at subsequent sliding windows.

Subfigure 7.5a and 7.5b shows the average symmetric KL-divergence and theTotal Variation distance between adjacent sliding windows over all 12 states.The red line corresponds to the threshold for high-activity periods and is hereset to the average of the dissimilarities during a complete day for the sake of

49

simplicity. From these figures it can be seen that the difference in behaviouraveraged over all 12 states follows a prominent W shape.

However, by using the Total Variation distance and the mean as the lowerthreshold to compare the behaviour during subsequent windows, the algorithmdoes not detect the high-activity period at time index 80. This time period isbelieved to correspond to the change in behaviour due to lunch break. Thus,the shape of the detected high-activity periods is more similar to a U shape.

(a) Average KL-divergenceover all states between adja-cent sliding windows. Thered line corresponds to thethreshold that is used to iden-tify high-activity periods.

(b) Average total variationdistance over all states be-tween adjacent sliding win-dows. The red line cor-responds to the thresholdthat is used to identify high-activity periods.

Figure 7.5: Dissimilarities between average behaviours during twosubsequent sliding windows during a complete day.

The threshold for high-activity periods, HIGH ACTIV ITY CONST is setto the mean value of differences between the average behaviours in subsequentsliding windows during a day. The proportionality constantPROP CONST isset to 1. Finally, the λ estimated by using these two approaches are presentedin Figure 7.6a and 7.6b.

50

(a) λ estimated using KL-divergence. (b) λ estimated using total variationdistance.

Figure 7.6: λ estimated using Algorithm 1.

Using Algorithm 2

In this section, λ is estimated by minimizing the probability distance betweenthe behaviours subsequent sliding windows. Since the problem of minimizingthe probability distances between the distributions is typically non-convex, theminimization is repeated 10 times with random initial values in the range 0 −1. The motivation for this choice of initial values is that behaviours duringsubsequent sliding windows are though to be generally similar. The final valueof λ is determined by computing the average of the different estimations foreach sliding window index.

(a) λ estimated by minimizing the KL-divergence.

(b) λ estimated by minimizing the To-tal variation distance.

Figure 7.7: λ estimated using Algorithm 2.

51

By comparing Figures 7.7a and 7.7b with the Figures 7.5a 7.5b from the previoussection, it can be seen that both estimates of λ feature the general U type shape.

7.3.2 Classification

In this section, the classification performances of the departments groups in thedataset are presented for the three first weeks in the dataset.

Without fading

As before, the F1 score is computed from the sum of all classifications resultsfrom the sliding windows during a whole day.

Date 2007/03/26 2007/03/27 2007/03/28 2007/03/29 2007/03/30Configuration 0.97 0.94 0.93 0.95 0.85Coordinator 0.75 0.79 0.60 0.74 0.41

Pricing 0.85 0.88 0.84 0.80 0.78

Table 7.1: F1 scores of classification of the department groups.First week in the dataset.


Pricing 0.76 0.70 0.89 0.93 0.91

Table 7.2: F1 scores of classification of the department groups.Second week in the dataset.


Pricing 0.73 N/A N/A N/A N/A

Table 7.3: F1 scores of classification of the department groups.Third week in the dataset.

In addition, the performance of the department group classifications during anormal day is shown in Figure 7.8.

52

Figure 7.8: Classification performance of the department groupclassifications during a day

As it can be seen in the figures above, the F1 score seem to follow an upsidedown U-shape: the classification performances are lowest during morning andafternoon hours when employees enter and exit the office. However, in betweenthese time instants the F1 scores are typically higher than 0.8. This observationjustifies again the fading strategies discussed in the sections above.

λ using Algorithm 1

In this section, the percentual change in the classification F1 score due to fadingwith Algorithm 1 using KL-divergence is presented. (A value less than 100%signifies a decrease.)

Date 2007/03/26 2007/03/27 2007/03/28 2007/03/29 2007/03/30Configuration 100 % 101 % 100 % 100 % 102 %Coordinator 103 % 105 % 107 % 102 % 108 %

Pricing 100% 102 % 101 % 100 % 100 %

Table 7.4: Percentual change in F1 scores of classification withrespect to the classifier without fading. First week in the dataset.

53


Pricing 88 % 98 % 102 % 100% 100 %

Table 7.5: Percentual change in F1 scores of classification withrespect to the classifier without fading. Second week in the dataset.

Date 2007/04/09 2007/04/10 2007/04/11 2007/04/12 2007/04/13Configuration 100 % 101 % 100 % 101 % 99 %Coordinator 102 % 103 % 99 % 103 % 99%

Pricing 96 % N/A N/A N/A N/A

Table 7.6: Percentual change in F1 scores of classification withrespect to the classifier without fading. Third week in the dataset.

From the tables above, it can be seen that fading with λ estimated accordingto algorithm 1 generally improves the classification performance of the EMM inthe Badge dataset.

λ using Algorithm 2

Similarly as in the previous section, the percentual change in the classificationF1 score due to fading with Algorithm 2 is presented.


Pricing 109 % 103 % 104 % 104 % 75 %

Table 7.7: Percentual change in F1 scores of classification withrespect to the classifier without fading. First week in the dataset.

Date 2007/04/02 2007/04/03 2007/04/04 2007/04/05 2007/04/06Configuration 101% 108 % 101 % 101 % 100 %Coordinator 100% 109 % 107 % 120 % 99 %

Pricing 103 % 106 % 104 % 101 % 100%

Table 7.8: Percentual change in F1 scores of classification withrespect to the classifier without fading. Second week in the dataset.

Date 2007/04/09 2007/04/10 2007/04/11 2007/04/12 2007/04/13Configuration 103 102% 100 % 103% 101%Coordinator 102 % 105 % 101 % 107 % 109 %

Pricing 103 % N/A N/A N/A N/A

Table 7.9: Percentual change in F1 scores of classification withrespect to the classifier without fading. Third week in the dataset.

54

Similarly to the previous section, it can be seen from the tables above thatfading with λ estimated according to Algorithm 2 also improves the classificationperformance. In addition, the percentual increase of the F1 scores with respectto the base line λ = 0 is typically larger than using Algorithm 1 to estimate λ.It is believed that this due to that the the Algorithm 1 does not accurately takeinto account the magnitude of the difference in behaviour between subsequentsliding windows.

7.3.3 Anomaly detection

In this section we look for individuals with anomalous behaviour in the Config-uration group. As in the previous sections, a λ for each weekday is estimatedfrom the first week in the dataset. The sliding window length is set to 2 minutesin order to get a clearer picture. Detecting outliers requires two parameters: theoccurrence and transition threshold which are both set to 0.001. In this way, wedefine anomalous trajectories as trajectories that contain rare states or transi-tions that are less frequent than 0.1% of the number of observations.In the following, a series of trajectory anomaly results will be shown with andwithout fading. The first two graphs belong to data in the morning. The sec-ond two graphs are for the afternoon. Each graph group shows data for eightconsecutive time intervals. The purpose of the chosen scenarios is to verify thatthe outlier detector works as anticipated in the preliminary results in Section5.6. The code used to generate these results can be found in Section A.4 in theAppendix.

In Figure 7.9 a series of eight trajectory outlier detection result without fadingfor the Configuration department are shown. The subgraphs are ordered inrows and correspond to subsequent time windows. The black color designatesnormal and red color anomalous trajectories. As it can be seen in the first 3subgraphs, there are two clusters of trajectories which are close to each other butstart to diverge after the 4th graph. The EMM outlier detector fails to detectdivergences when they are close to the original clusters at the current thresholdlevels. Whether this is correct is difficult to decide at this stage. However, as itcan be seen in the fifth and seventh subgraphs the strongly diverging trajectoriesare not marked as outliers at the current threshold levels. In the last graph it canbe seen that even without fading, the algorithm can correctly detect anomaliesthat are far enough from the original clusters.

55

Without fading, λ = 0

Figure 7.9: Outlier detection results without fading.

In Figure 7.10, a second series of outlier detection results are shown for theConfiguration group on the afternoon of the same day. As it can be seen, thenormative behaviour has now changed from being on the right side of the officeto the left side. Even without fading, the algorithm has correctly accommodateditself to the new behaviour but not entirely. As it can be seen in the last twosubgraphs, trajectories concentrated around the old clusters are still markedas normal. This situation corresponds to the reversal of the preliminary testdataset in Section 5.6.

Figure 7.10: Outlier detection results without fading, notice thatthe normative behaviour has changed from the previous behaviourhas changed.

56

Fading with λ estimated using Algorithm 1, KL-divergence

This section shows the outlier detection results for the same situation as abovein the first group of eight subgraphs in the Section 7.3.3 but now with fadingusing Algorithm 1 with the KL-divergence. As it can be seen in the first seriesof graphs, the algorithm now detects the diverging trajectories. It does so byreducing the old weights in the Markov Chain so that by using exactly the sameoccurrence and transition thresholds, outliers can be detected. Not shown inthe graphs below (for the sake of brevity) are a small number of trajectories inthe upper right part of the office that occurred previously.

Figure 7.11: Outlier detection results with fading using Algorithm1.

In Figure 7.12 a second series of outlier detection results are shown on theafternoon the same day. As it can be seen, the algorithm has now accommodateditself to a higher degree than without fading. The trajectories confined to theleft side of the office are now considered as normal, while the trajectories onthe right side has become outlier. Please note the differences between the 7thsubgraph in this series and the corresponding series in Section 7.3.3

57


Fading with λ estimated using Algorithm 2, KL-divergence

This section shows the outlier detection results for the same situation as abovebut now using λ estimated with Algorithm 2 and the KL-divergence. As itcan be seen in Figure 7.13, the algorithm is not able to detect the divergingtrajectories in subgraphs 4 and 7.

Figure 7.13: Outlier detection results with fading using Algorithm2.

In Figure 7.14 a second series of outlier detection results are shown on theafternoon the same day. Notice that similarity between this and the previousseries of graphs.

58


7.3.4 Detecting anomalous groups

In this section, the correctness of the outlier detector is verified by searching forinherent outliers in the dataset.

Configuration and Pricing

As verified in Section 7.3.2, employees in the dataset can be partitioned intodifferent groups. In this section, we verify the correctness of the outlier detectorby merging two such groups. The hypothesis is that if one group occurs morefrequently than the other, the second group can be described as an anomaly inthe merged group.To this end, we merge the Configuration and Pricing groups that have signifi-cantly different frequencies in the dataset (69 % vs. 18 %). The hypothesis isthus that the employees in the Pricing department can be found in the mergedgroup by looking for anomalies.

Below is a series of tables that show the precision of the outlier detector forthe first three weeks in the dataset. Precision is defined as the number of truepositives, ie. the number of employees from the Pricing group that are markedas anomalies, divided by the total number of anomalous employees.

Notice that the precision of a random outlier detector is related to the ratiobetween the cardinality of the subgroups. If the number of measurements ofthe Configuration group is much larger than that of the Pricing group, the pre-cision of the random outlier detector will typically be inferior to 50%. More

specifically, the precision of the outlier detector is |P ||C|+|P | . Since the cardinality

of the subgroups change over time, the instantaneous precision for the randomclassifier during each day is given in the last column in the tables below.

59

Date No fading Algorithm 1 Algorithm 2 Random2007/03/26 0.12 0.08 0.09 0.162007/03/27 0.30 0.32 0.35 0.282007/03/28 0.57 0.61 0.61 0.362007/03/29 0.37 0.56 0.56 0.182007/03/30 0.42 0.49 0.48 0.19

Table 7.10: First week in the dataset. The values correspond tothe precision of the outlier detector.


Table 7.11: Second week in the dataset. The values correspond tothe precision of the outlier detector.

Date No fading Algorithm 1 Algorithm 2 Random2007/04/09 0.36 0.64 0.64 0.152007/04/10 0 0 0 02007/04/11 0 0 0 02007/04/12 0 0 0 02007/04/13 0 0 0 0

Table 7.12: Third week in the dataset. The values correspond tothe precision of the outlier detector.

As it can be seen in the tables above, the performance of the outlier detectorgenerally outperforms the random outlier detector. Performance of the outlierdetector is in general improved by fading the EMM.

Novices and Seniors

As it is described in [2], the Configuration group can be further partitioned intoa senior and a novice subgroup. The senior subgroup spends relatively littletime with other employees and prefers to work alone. In contrast, the novicegroup discusses his/her tasks and thus often visits other employees. Thus, thehypothesis is that the novice group can be described as anomalies in the Con-figuration group.In this section, we verify this hypothesis on multiple days in the dataset. Unfor-tunately, the exact configuration of the subgroups is not given in [2]. Thus theresults in this section are to interpreted qualitatively only. The approximateIDs are presented in the Table 7.13 below:

60

Senior 278,294,99,258,109,281,82,273,101,276,253,290,280,292Novice 265,104,107,251,105,298,56

Table 7.13: Subgroups of the Configuration group.

In Table 7.14, the precision of the novice subgroup outlier detector are presentedfor the first week in the dataset. As noted in the previous section, precision isdefined as the fraction of number of novice individuals marked as outliers dividedby the total number of individuals marked as outliers.The ratio between the number of observations from the novice group and thetotal number of measurements in the Configuration group corresponds to the

precision of the random classifier: |Novice||Senior|+|Novice| . Since the cardinality of

the subgroups change over time, the instantaneous precision for the randomclassifier during each day is given in the last column in the tables below.


Table 7.14: Precision of the novice subgroup outlier detector. Thelast column represents the precision of a random outlier detector.First week in the dataset.

Figure 7.15 shows the increase in precision of the novice subroup outlier detectordue to fading.

(a) Percentual increase due to fadingusing Algorithm 1. First week in thedataset.

(b) Percentual increase due to fadingusing Algorithm 2. First week in thedataset.

Figure 7.15: Percentual increase due to fading. First week in thedataset.

In the figures below, the corresponding results are shown for the novice groupoutlier detector during the second and third week in the dataset. The figuresbelow show that fading typically increases the performance of the outlier de-tector. This suggest that there seems to be an underlying structure of noviceemployees but the outlier detector is to approximate to detect it completely.

61


Table 7.15: Precision of the novice subgroup outlier detector. Thelast column represents the precision of a random outlier detector.Second week in the dataset.

(a) Percentual increase due to fadingusing Algorithm 1. Second week in thedataset.

(b) Percentual increase due to fadingusing Algorithm 2. Second week in thedataset.

Figure 7.16: Percentual increase due to fading. Second week in thedataset.

Date No fading Algorithm 1 Algorithm 2 Random2007/04/09 0.12 0.13 0.10 0.182007/04/10 0.16 0.09 0.00 0.252007/04/11 0.00 0.00 0 0.02007/04/12 0.11 0.11 0 0.092007/04/13 0.11 0.17 0.16 0.24

Table 7.16: Precision of the novice subgroup outlier detector. Thelast column represents the precision of a random outlier detector.Third week in the dataset.

62

(a) Percentual increase due to fadingusing Algorithm 1. Third week in thedataset.

(b) Percentual increase due to fadingusing Algorithm 2. Third week in thedataset.

Figure 7.17: Percentual increase due to fading. Third week in thedataset.

As it can be seen from Table 7.14,7.15 and 7.16, detecting the novice subgroupis substantially more difficult than detecting the Pricing group in Section 7.3.4.From figures 7.15, 7.16 and 7.17 it can be seen that fading seems as an appro-priate tool to make the outlier detector more robust.

7.3.5 Group discovery

In addition to the predefined department groups in the dataset, the results of thegroup discovery algorithm using Tarjan’s algorithm is presented. The resultshere are based on the two first weeks in the dataset. The clustering thresh-old C, minimum support and confidence parameters were determined by usinga grid search C ∈ 1, 2, . . . 15, (min supp,min conf) ∈ (0.1, 0.2, . . . , 1) ×(0.1, 0.2, . . . , 1).The tables below present the results of the proposed group discovery algorithmwith different parameters.

Group 1 258Group 2 99Group 3 82Group 4 105Group 5 281Group 6 290Group 7 264

. . . . . .

Table 7.17: Discovered groups with clustering threshold C = 8,min supp = 0.1 and min conf = 0.1. The algorithm identified 27groups, all of which were only containing one badge ID.

63

Group 1 258Group 2 99Group 3 82Group 4 105Group 5 281Group 6 290Group 7 264

. . . . . .

Table 7.18: Discovered groups with clustering threshold C = 8,min supp = 0.2 and min conf = 0.2. The algorithm identified 27groups, all of which were only containing one badge ID.

Group 1 293, 268Group 2 280, 281, 56, 105, 99, 258, 109, 82, 278,

264, 294, 290, 265Group 3 266, 106, 276, 272

Table 7.19: Discovered groups with clustering threshold C = 9,min supp = 0.1 and min conf = 0.3.

Group 1 272Group 2 293Group 3 268Group 4 280, 281, 56, 105, 99, 258, 109, 82, 278,

264, 294, 290, 265Group 5 266, 106, 276


Group 1 272Group 2 293Group 3 268Group 4 280, 281, 56, 105, 99, 258, 109, 82, 278,

264, 294, 290, 265Group 5 266, 106, 276


64

Group 1 101Group 2 292,56, 273, 267, 256, 290, 104, 251,

109, 280, 278, 264, 281, 82, 265, 105,99, 258, 294

Group 3 288, 272, 293, 266, 285, 106, 276, 261,268

Group 4 268


Group 1 56, 99, 258, 290, 294, 281, 264, 109, 82,278, 265, 105

Group 3 266, 106, 276


7.4 Non-parametric analysis

In this section, the classification performances of the customized RBF kernelsare presented. The sliding window length was set to 2 minutes and the offsetbetween the training and evaluation windows was set to 5 minutes. The lengthof the buffer was set to 3 sliding windows.For each kernel, the parameters α and C were optimized using a grid searchover (α,C) ∈ (1000, 2000, . . . , 10000) × (10, 20, . . . , 100) and by measuringthe F1 score of the classification results on a randomly selected sliding win-dow on 2007/03/27. The optimal parameters were found to be αFrechet =5000, CFrechet = 10 for the Frechet kernel and αHausdorff = 4000, CHausdorff =10 for the Hausdorff kernel.The classification performances of the two approaches during a complete day issummarized in Table 7.24.

F1 scores/Kernels Frechet kernel Hausdorff kernelConfiguration 0.88 0.88

Pricing 0.63 0.72Coordinator 0.31 0.41

Table 7.24: F1 scores of the department group classifications on2007/03/28.

Finally, a series of classification results are shown for two subsequent time win-dows using optimum parameters.

65

(a) Training sliding window. (b) Evaluation sliding window.

(c) Classification results using theFrechet kernel, α = 5000.

(d) Classification results using theHausdorff kernel, α = 4000.

Figure 7.18: Non-parametric classification, training window:2007/03/28 10:02 - 2007/03/28 10:04.

From these figures it can be seen that trajectories seem to fluctuate and gatheraround certain clusters. This is believed to be caused by either the low spatialresolution of the in-door positioning system or the fact that employees wanderaround their cubicles. In both cases, this is believed to explain the relatively lowperformance of the Frechet based kernel SVM. Since the Frechet distance takesthe ordering of the trajectories into account, it may assign large dissimilarityvalues for two trajectories that are oriented differently but correspond to thesame employee being at his/her cubicle.

66

(a) Training sliding window. (b) Evaluation sliding window.

(c) Classification results using theFrechet kernel, α = 5000.

(d) Classification results using theHausdorff kernel, α = 4000.

Figure 7.19: Non-parametric classification, training window:2007/03/28 10:04 - 2007/03/28 10:06.

67

Chapter 8

Discussion and FutureWork

8.1 Discussion and Conclusion

8.1.1 Parametric analysis

Classification

As it can be seen in Section 7.3.2, fading the EMM generally improves theperformance of the classifier. In some cases however, using Algorithm 1 toestimate λ may lead to worse performance than without fading. A possibleexplanation is that during such days, the normal rhythm during a day as shownin Figures 7.4a and 7.4b is missing. Algorithm 1 may in such cases estimate aconstant λ during the whole day which may lead to overfitting.

Outlier detection

As it can be seen in Tables 7.10, 7.11, 7.12 and Figures 7.15,7.16, 7.17, fadingthe EMM significantly improves the precision of the outlier detector.However, the outlier detector performs much poorer in Section 7.3.4 on the Con-figuration group, than in Section 7.3.4 on the union of the Configuration andPricing groups. Unfortunately, this observation should be taken with a grainof salt since the exact novice subgroup composition is unknown. For exam-ple, it may happen that there were other novice employees than those in thegiven novice group at the firm. In addition, it may happen that the occurrenceand transition thresholds are overestimated and thus, even normal behaviour ismarked as anomalous.Another way of stating this is to say that the overall behaviour of the seniorand novice employees may be indistinguishable with the current Markov Chainapproach. By looking at Figure 8.1 representing the transition probability ma-trices for the two subgroups, we can quickly reject this explanation. There seemsto be a clear difference between the transition matrices.

68

Figure 8.1: Transition probability matrices for the novice and se-nior subgroups in the Configuration group. The colors correspondto the intensity of the transitions.

However, we can also observe that for some states there are certain similaritiesbetween the transition matrices for the two subgroups. For example, both ofthe matrices seem to be diagonally dominant. With this in mind, a possibleexplanation for the high false positive rate may be that the outlier detector isoverly aggressive: trajectories with even only one outlier are marked outliers.In real-life this approximation may be too rough compared to the complexity ofhuman behaviour. Senior employees may be incorrectly marked as novices andvice versa.

As it can be seen in Table 8.1, the number measurements for the novice groupvaries over weeks. The significantly fewer measurements during the third weekin the dataset decreases the reliability of the novice group outlier detection.This can be seen in Table 7.16 and Figure 7.17 as the percentual changes inprecision relative to without fading is typically less than 100% .

Week Number of measurements2007/03/26− 2007/03/30 481702007/04/02− 2007/04/06 175802007/04/09− 2007/04/13 3960

Table 8.1: The number of measurements of the supposed novicegroup varies significantly over weeks in the dataset.

Estimation of the forgetting factor

As it can be seen in section 7.3.2, Algorithm 2 typically outperforms Algorithm1 in estimating the forgetting factor for the EMM for classification. It is believedthat this is caused by that Algorithm does not accurately take into account thedifference in behaviour between subsequent sliding windows. Above the prede-termined threshold for high-activity periods, Algorithm 1 can not differentiatebetween the magnitude of the differences.In outlier detection on the other hand, the two algorithms give rise to seeminglyequivalent performances.

69

Group discovery

As shown in Section 7.3.5, the correctness of the group discovery algorithm de-pends to a great extent on the clustering threshold. A small threshold leads tothat all employees are put into separate groups, seemingly independent of themin supp and min conf parameters. This is expected since if the similaritythreshold for behaviours is too low, each employee will look different.

The algorithm consistently discovers a more or less invariant subset of the Con-figuration group. The Jaccard similarity [47] of the discovered subgroups isconsistently ≥ 0.55. This means that more than half of the Configuration groupemployee IDs are invariant in the found subsets. By tweaking the mining pa-rameters, also a subset of the Pricing group may be identified. This is shownin Table 7.21. However, the algorithm fails to discover the Coordinator groupwhich is probably due to that it is the smallest group in the dataset (it onlycorresponds to 13 % of the total number of measurements).

8.1.2 Non-parametric analysis

The proposed online SVM turned out be infeasible with the given time con-straints. The time between iterations proved to be too long to achieve con-clusive results on the whole dataset. However, from the limited experiment inTable 7.24, it seems that the classification performance of Hausdorff based ker-nel seems to outperform the Frechet based one.One possible explanation for this is the low spatial resolution of the employee-positioning system: even though an employee is stationary, random noise maylead to measurements which fluctuate around the true position of the employee.Since the Frechet distance takes the ordering of the trajectories into account,it may assign large dissimilarity values for two trajectories that correspond tothe same employee being stationary. In other words, the velocity measurementsmay be more noisy than the position measurements and this can negativelyaffect the classification using the Frechet kernel.

8.2 Conclusions

The thesis investigates both supervised and unsupervised anomaly detection intwo dimensional trajectory data streams.

• For non-parametric supervised anomaly detection, the Hausdorff distancebased RBF kernel performs better on discriminating between departmentgroups than the Frechet distance based kernel. The SVM parameters α,Cwere optimized by a grid search over a random sliding window, making thegeneralization performance between multiple sliding windows uncertain.However, these results show that the presented metrics are appropriate tocompare trajectories with different lengths without any need for featureselection. Therefore, the presented metrics could also be integrated intothe nearest neighbour clustering for an automatic anomaly detector.

• For parametric anomaly detection, fading the underlying MC turns outto be essential in order to accurately discriminate between department

70

groups and detect anomalies. Two algorithms are proposed for estimatinga forgetting factor λ ∈ R+, ie. an isotropic fading factor for all states. Inthe case with the Badge dataset, it is assumed that this approximationhas marginal effect on the performance since changes in behaviour seemto happen at the same time at all geographic locations. Whether a pos-itive real valued λ is sufficient to fade all the states when applied to theconfidential dataset is left be determined.

8.3 Future work

8.3.1 Parametric analysis

An obvious limitation of the EMM is that it is restricted to simple MarkovChains with a short lived memory that only extends back one step. It is simpleto imagine scenarios when this approximation can be too rough.

Suppose we aim to detect outliers in a hypothetical office space consisting ofthree spatial regions and two groups of employees: those who work with classi-fied material and those who don’t.Suppose that A correspond to the office space of the employees working withpublic material, B corresponds to the kitchen area and region C corresponds tothe office space of the employees working with classified material. Assume thatthe transitions between regions A to B and between regions B to C are normal.However, this does not necessarily mean that transitions from region C to Aare also normal. In other words, it may not be normal for an employee workingwith classified material to visit the cubicles of the public group. This scenariois shown in Figure 8.2.

Figure 8.2: Example of a situation where a first order MarkovChain may be insufficient.

A first step would be to investigate if extended EMM states that both includethe current, and a sequence of past k ≥ 1 visited regions give more robustclassification and outlier detection results. Thus, each state would correspondto a unique representative subtrajectory. This is sown in Figure 8.3. In order tomeasure the dissimilarity between trajectories with variable lengths and a set ofsuch states, distances such as the DTW, Hausdorff and Frechet distances couldbe used. However, as the number of such states grow with k, this seems onlyappropriate for situations with a small number of states.

71

Figure 8.3: Example of a MC with extended states.

For datasets with larger number of states, Probabilistic suffix trees [48] seemsas a promising step.

Outlier detection

The outlier detector currently marks all outliers as equal. Furthermore, tra-jectories with even only one outlying segment as anomalous trajectories. Amore fine-grained strategy would be to make a distinction between the typesof anomalies to assess the importance of each situation. The high false-positiverate could also be reduced by taking into account the frequency of anomaliesin a trajectory. Ie. a trajectory could be marked as anomalous if it contains aspecified number of outliers or its outlying segments occur more often than aspecified threshold.

Applicability to the classified dataset

As described in Section 1.3, the confidential system delivers location data onlyfor changes. In other words, it may happen that the measurements for anemployee during a sliding window only contains one state which represents thenew area in which the employee resides in. However, in the thesis we assumedthat the measurements contains both the starting and ending locations of asequence with more than one element. This is because the EMM classifier andoutlier detector needs a sequence of states. To alleviate this problem, the thecurrent area could be concatenated with the with previous area identifiers.

8.3.2 Non-parametric analysis

A continuation of my work can be to rewrite/optimize the program and theSQL database structures. In addition, more sophisticated incremental SVMalgorithms such as [49] and [50] could be explored. Another approach wouldbe to drop the SVM and go for a clustering algorithms instead, even thoughthe dataset used in this thesis is rather small. More specifically, the threshold

72

nearest-neighbour clustering introduced in Chapter 5 could be appropriate tohandle data streams.

Applicability to the classified dataset

As described in Section 1.3, the confidential system delivers identifiers of dif-ferent harbour areas rather than the explicit geometric locations of employeesduring a sliding window. This makes the non-parametric approach in the thesisrather inappropriate. However, by introducing an abstract floor plan similar tothe one in Figure 6.1, these sequences of harbour areas can be reverse engineeredinto geographic trajectories.

73

Bibliography

[1] Helmut Alt and Michael Godau. Computing the frechet distance betweentwo polygonal curves. Int. J. Comput. Geometry Appl., 5:75–91, 1995. ,14, 15

[2] Daniel Olguın Olguın, Benjamin N Waber, Taemie Kim, Akshay Mohan,Koji Ara, and Alex Pentland. Sensible organizations: Technology andmethodology for automatically measuring organizational behavior. Sys-tems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on,39(1):43–55, 2009. , 3, 44, 60

[3] Nathan Eagle, Alex Sandy Pentland, and David Lazer. Inferring friendshipnetwork structure by using mobile phone data. Proceedings of the NationalAcademy of Sciences, 106(36):15274–15278, 2009. 1

[4] Rikard Laxhammar. Anomaly detection in trajectory data for surveillanceapplications. 1

[5] A. Dahlbom and L. Niklasson. Trajectory clustering for coastal surveillance.In Information Fusion, 2007 10th International Conference on, pages 1–8,July 2007. 1

[6] Animesh Patcha and Jung-Min Park. An overview of anomaly detectiontechniques: Existing solutions and latest technological trends. ComputerNetworks, 51(12):3448–3470, 2007. 1

[7] Frank E Grubbs. Procedures for detecting outlying observations in samples.Technometrics, 11(1):1–21, 1969. 2

[8] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detec-tion: A survey. ACM Computing Surveys (CSUR), 41(3):15, 2009. 2

[9] Charu C Aggarwal. Outlier analysis. Springer, 2013. 2

[10] Wen Dong, Daniel Olguin-Olguin, Benjamin Waber, Taemie Kim, and AlexPentland. Mapping organizational dynamics with body sensor networks.In Wearable and Implantable Body Sensor Networks (BSN), 2012 NinthInternational Conference on, pages 130–135. IEEE, 2012. 5

[11] Jochen Hipp, Ulrich Guntzer, and Gholamreza Nakhaeizadeh. Algorithmsfor association rule mining — a general survey and comparison.SIGKDD Explor. Newsl., 2(1):58–64, June 2000. 7

74

[12] Tan Pang-Ning, Michael Steinbach, Vipin Kumar, et al. Introduction todata mining. WP Co, 2006. 8

[13] Rakesh Agrawal, Ramakrishnan Srikant, et al. Fast algorithms for miningassociation rules. 8

[14] Robert Tarjan. Depth-first search and linear graph algorithms. SIAMjournal on computing, 1(2):146–160, 1972. 10

[15] Micha Sharir. A strong-connectivity algorithm and its applications in dataflow analysis. Computers & Mathematics with Applications, 7(1):67–72,1981. 10

[16] Brendan Tran Morris and Mohan M Trivedi. A survey of vision-basedtrajectory learning and analysis for surveillance. Circuits and Systems forVideo Technology, IEEE Transactions on, 18(8):1114–1127, 2008. 11, 12

[17] H. Liu and H. Motoda. Computational Methods of Feature Selection. Chap-man & Hall/CRC Data Mining and Knowledge Discovery Series. Taylor &Francis, 2007. 11

[18] Meinard Muller. Information retrieval for music and motion, volume 6.Springer, 2007. 12, 13

[19] B.-K. Yi, H.V. Jagadish, and C. Faloutsos. Efficient retrieval of similar timesequences under time warping. In Data Engineering, 1998. Proceedings.,14th International Conference on, pages 201–208, Feb 1998. 13

[20] T. Hagerup and J. Katajainen. Algorithm Theory - SWAT 2004: 9th Scan-dinavian Workshop on Algorithm Theory, Humlebaek, Denmark, July 8-10,2004, Proceedings. Number v. 9 in Lecture Notes in Computer Science.Springer, 2004. 13, 14

[21] Jeff Henrikson. Completeness and total boundedness of the hausdorff met-ric. 14

[22] Vipin Kumar Jaideep Srivastava Aleksandar Lazarevic Arindam Banerjee,Varun Chandola. Anomaly detection: A tutorial. Society for Industrialand Applied Mathematics, 2008. 17

[23] B. Scholkopf and A.J. Smola. Learning with Kernels: Support Vector Ma-chines, Regularization, Optimization, and Beyond. Adaptive computationand machine learning. MIT Press, 2002. 17, 21

[24] L. Wang. Support Vector Machines: Theory and Applications. Studies inFuzziness and Soft Computing. Springer, 2005. 18

[25] M. Bianchini, M. Maggini, and F. Scarselli. Innovations in Neural Informa-tion Paradigms and Applications. Studies in Computational Intelligence.Springer, 2009. 21, 22

[26] Jean-Philippe Vert and Koji Tsuda. A primer on kernel methods. KernelMethods in Computational Biology, pages 35–70, 2004. 22

75

[27] C.H. Chen. Emerging Topics in Computer Vision and Its Applications.Computer Vision Series. World Scientific Publishing Company, Incorpo-rated, 2011. 22

[28] Cesar R. Souza. Kernel functions for machine learning applications. 23

[29] Arnulf BA Graf and Silvio Borer. Normalization in support vector ma-chines. In Pattern Recognition, pages 277–282. Springer, 2001. 23

[30] Kai-Bo Duan and S Sathiya Keerthi. Which is the best multiclass svmmethod? an empirical study. In Multiple Classifier Systems, pages 278–285. Springer, 2005. 23

[31] H. Zhang and Berkeley University of California. Adapting Learning Tech-niques for Visual Recognition. 2007. 24

[32] Bernard Haasdonk and Claus Bahlmann. Learning with distance substitu-tion kernels. In Pattern Recognition, pages 220–227. Springer, 2004. 24

[33] Stefano Spaccapietra, Christine Parent, Maria Luisa Damiani, Jose Antoniode Macedo, Fabio Porto, and Christelle Vangenot. A conceptual view ontrajectories. Data & knowledge engineering, 65(1):126–146, 2008. 28, 32

[34] Kristie Seymore. Learning hidden markov model structure for informationextraction. 28

[35] M.H. Dunham, Yu Meng, and Jie Huang. Extensible markov model. InData Mining, 2004. ICDM ’04. Fourth IEEE International Conference on,pages 371–374, Nov 2004. 28

[36] L.P. Kaelbling. Recent Advances in Reinforcement Learning. MachineLearning. Springer, 1996. 29

[37] Michael Hahsler and Margaret H Dunham. Tracds: Temporal relationshipamong clusters for data streams. 31

[38] John A Hartigan and Manchek A Wong. Algorithm as 136: A k-meansclustering algorithm. Applied statistics, pages 100–108, 1979. 32

[39] Anders Billesø Beck, Claus Risager, Nils A Andersen, and Ole Ravn.Spacio-temporal situation assessment for mobile robots. In InformationFusion (FUSION), 2011 Proceedings of the 14th International Conferenceon, pages 1–8. IEEE, 2011. 32

[40] David Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. Markov Chainsand Mixing Times. American Mathematical Society, 2009. 34

[41] D.M. Gabbay, P. Thagard, J. Woods, P.S. Bandyopadhyay, and M.R.Forster. Philosophy of Statistics. Handbook of the Philosophy of Science.Elsevier Science, 2011. 34

[42] P. Lahiri. Model Selection. Institute of Mathematical Statistics LectureNotes Monograph. Institute of Mathematical Statistics, 2001. 36

[43] S. Konishi and G. Kitagawa. Information Criteria and Statistical Modeling.Springer Series in Statistics. Springer, 2008. 36

76

[44] Michael Hahsler and Margaret H Dunham. remm: Extensible markovmodel for data stream clustering in r. Journal of Statistical Software,35(5):1–31, 2010. 38

[45] Yu Meng, Margaret H Dunham, Marco F Marchetti, and Jie Huang. Rareevent detection in a spatiotemporal environment. In GrC, pages 629–634,2006. 38

[46] L. Cao, Y. Feng, and J. Zhong. Advanced Data Mining and Applications:6th International Conference, ADMA 2010, Chongqing, China, November19-21, 2010, Proceedings. Advanced Data Mining and Applications: 6thInternational Conference : Proceedings. Springer, 2010. 47

[47] Pang-Ning Tan and Michael Steinbach. Vipin kumar, introduction to datamining, 2006. 70

[48] Pei Sun, Sanjay Chawla, and Bavani Arunasalam. Mining for outliers insequential databases. In in ICDM, 2006, pages 94–106. 72

[49] Pavel Laskov, Christian Gehl, Stefan Kruger, and Klaus-Robert Muller.Incremental support vector learning: Analysis, implementation and appli-cations. The Journal of Machine Learning Research, 7:1909–1936, 2006.72

[50] Gert Cauwenberghs and Tomaso Poggio. Incremental and decrementalsupport vector machine learning. Advances in neural information processingsystems, pages 409–415, 2001. 72

77

Appendices

78

Appendix A

Source code

A.1 SVM preliminary results

rm(list=ls())

library(kernlab)

library(longitudinalData)

library(geometry)

source("C:/Users/mate/Dropbox/Thesis/bin/chapter1/Hausdorff.R")

#############################################################################

## hd = function(A,B)

## Hausdorff distance between the point sets A and B

#############################################################################

hd = function(A,B)

A <- na.omit(matrix(t(A),ncol=2,byrow=T))

B <- na.omit(matrix(t(B),ncol=2,byrow=T))

return(hausdorff(A,B))

#############################################################################

## Frechet distance calculation

#############################################################################

frechet <- function(A,B)

P <- na.omit(matrix(t(A),ncol=2,byrow=T))

Q <- na.omit(matrix(t(B),ncol=2,byrow=T))

if(ncol(P) != ncol(Q))

cat("Error: dimensionality must be the same \n")

return(NA)

result=distFrechet(P[,1],P[,2], Q[,1],Q[,2],timeScale=1,FrechetSumOrMax =

"max")

return(result)

79

#############################################################################

#Plot raw trajectories

#############################################################################

plotRawData <- function(data)

plot.new()

cols <-c("red","blue")

for(i in 1:nrow(data))

len <- length(data[i,])

class <- data[i,len]

tmp_xy <- matrix(data[i,1:len-1],nrow=2)

x <- tmp_xy[1,]

y <- tmp_xy[2,]

plot(x,y,xaxt=’n’,yaxt=’n’,

ann=FALSE,ylim=c(0,6),xlim=c(0,6),xlab="x",ylab="y")

## draw arrows from point to point :

s <- seq(length(x)-1) # one shorter than data

arrows(x[s], y[s], x[s+1], y[s+1],col=cols[class])

par(new=TRUE)

axis(1)

axis(2)

#############################################################################

#Plot trajectories with colors corresponding to the predicted classes

#############################################################################

plotPredData <- function(data,svp)

plot.new()

cols <-c("red","blue")

for(i in 1:nrow(data))

tmp_xy <- matrix(data[i,],nrow=2)

class <- predict(svp,data[i,, drop=FALSE])

x <- tmp_xy[1,]

y <- tmp_xy[2,]

plot(x,y,xaxt=’n’,yaxt=’n’,

ann=FALSE,ylim=c(0,6),xlim=c(0,6),xlab="x",ylab="y")

## draw arrows from point to point :

s <- seq(length(x)-1) # one shorter than data

arrows(x[s], y[s], x[s+1], y[s+1],col=cols[class])

par(new=TRUE)

axis(1)

axis(2)

#######################################################################

## Define test dataset

#######################################################################

train_data <- matrix(nrow=0,ncol=11)

train_data <-rbind(train_data, c(1,2.51,2,2.5,3,2.5,4,2.5,5,2.6,1))

train_data <-rbind(train_data, c(2.5,1,2.51,2,2.5,3,2.5,4,NA,NA,-1))

train_data <-rbind(train_data, c(2.5,1,2.6,2,2.5,3,2.5,4,NA,NA,-1))

80

eval_data <- matrix(nrow=0,ncol=11)

eval_data <-rbind(eval_data, c(1,1,2,3,3,1,4,3,5,1,1))

eval_data <-rbind(eval_data, c(1.5,1,2,3,3.5,1,4,3,5,1.5,1))

eval_data <-rbind(eval_data, c(1,1.5,2,3.5,3,1,4,3,5.5,1,1))

#eval_data <-rbind(eval_data, c(1,3,5,2.8,1,2.5,5,2.2,1,1.5,-1))

#eval_data <-rbind(eval_data, c(1,3,5,2.5,1,2.6,5,2.1,1,1.75,-1))

eval_data <-rbind(eval_data, c(1,1.5,5,2.2,1,2.5,5,2.8,1,3,-1))

eval_data <-rbind(eval_data, c(1,1.75,5,2.1,1,2.6,5,2.5,1,3,-1))

exp_hausdorff_k <- function (x,y)

sigma <- 2

return(exp(-hd(x,y)/sigma))

class(exp_hausdorff_k) <- "kernel"

exp_frechet_k <- function (x,y)

sigma <- 2

return(exp(-frechet(x,y)/sigma))

class(exp_frechet_k) <- "kernel"

#############################################################################

#Two class SVM

#############################################################################

kernel_function <- exp_frechet_k

x <- train_data[,1:10,drop=FALSE]

y <- train_data[,11,drop=FALSE]

svp <- ksvm(x,y,type="C-svc",C = 10,

kernel=kernel_function,scaled=c(),na.action=na.pass)

eval_x <- eval_data[,1:10,drop=FALSE]

eval_y <- eval_data[,11,drop=FALSE]

pred <- predict(svp,eval_x,type="response")

plotPredData(eval_x,svp)

title(main="Classification results",xlab="x", ylab="y")

A.2 Algorithm 1 for estimating λ

## Segments the specific day into a number of sliding windows and returns the

measurements for each such window.

## Format of output

## day_interval_data <- list()

## day_interval_data[[n]] <- list() representing time interval (t_n,t_n+1)

## XPOS, YPOS, GROUP_ID

81

getDayIntervalData =

function(day,intervalLen,startDate,endDate,startHour,endHour,GROUPS,conn)

#Number of intervals

intervalNum <- (endHour-startHour)*60/intervalLen +1

#Container to hold each interval

dayIntervalData <- vector("list", intervalNum)

for(t in seq(0,(endHour-startHour)*60-intervalLen,intervalLen))

intervalStartMinute <- t %% 60

intervalEndMinute <- (t+intervalLen) %% 60

intervalStartHour <- startHour+(t-intervalStartMinute) / 60

intervalEndHour <- startHour+(t-intervalStartMinute) / 60

if(intervalEndMinute<intervalStartMinute)

intervalEndHour <- intervalEndHour +1

cat(intervalStartHour,intervalStartMinute,">",intervalEndHour,intervalEndMinute,"\n")

intervalIndex <- t/intervalLen +1

dayIntervalData[[intervalIndex]] <- list()

for(groupID in 1:length(GROUPS))

#Determine badge ids in current group

groupIDs=GROUPS[[groupID]]

for(id in groupIDs)

#Skip anonymous

if(id=="N" | id=="-") next

query <- paste("SELECT",

" CASE WHEN(datepart(day,datetime)=datepart(day

,LEAD(datetime) over (ORDER BY datetime))) THEN",

" 0 ELSE 1 END AS NEW_DAY_FLAG,",

" XPOS,YPOS",

" FROM LOCATIONS",

" WHERE DATETIME BETWEEN ’",startDate,"’ AND ’",endDate,"’",

" AND ID=",id,

" AND DATENAME(dw,DATETIME)=’",day,"’",

" AND CONVERT(TIME,CAST(DATEPART(HOUR,DATETIME) AS VARCHAR)

+’:’+ CAST(DATEPART(MINUTE,DATETIME) AS VARCHAR))

BETWEEN

CONVERT(TIME,’",intervalStartHour,":",intervalStartMinute,"’)

AND

CONVERT(TIME,’",intervalEndHour,":",intervalEndMinute,"’)",

" ORDER BY DATETIME",sep="")

data <- dbGetQuery(conn, query)

#print(query)

if(nrow(data)==0) next

breakpoints <- which(data[,1]==1)

prevIndex=1

for(breakpoint in breakpoints)

82

len <- length(dayIntervalData[[intervalIndex]])

dayIntervalData[[intervalIndex]][[len+1]] <-

cbind(as.matrix(data[prevIndex:breakpoint,2:3],ncol=2),groupID)

prevIndex=breakpoint+1

return(dayIntervalData)

## Creates a list of Markov Chains for each sliding window during a complete day

getIntervalModels <-

function(day,startDate,endDate,startHour,endHour,intervalLen,regionClusters,GROUPS,CONN)

#Number of intervals

intervalNum <- (endHour-startHour)*60/intervalLen +1

#Container for models

dayModels <- list()

## Get the states

nstates <- nclusters(regionClusters)

dayData <-

getDayIntervalData(day,intervalLen,startDate,endDate,startHour,endHour,GROUPS,CONN)

dayIntervalModels <- vector("list", intervalNum)

for(intervalIndex in 1:length(dayData))

intervalData <- dayData[[intervalIndex]]

dayIntervalModels[[intervalIndex]] <- list()

#######################################################################

#By default, the EMM allocates the states online so that the ordering

#of two models is not necessarily the same. To fix this, we initialize the

#models so that the ordering of the states is consistent

#######################################################################


dayIntervalModels[[intervalIndex]][[groupID]] <-

EMM(threshold=0.1,measure="euclidean",centroids=FALSE)

cluster(dayIntervalModels[[intervalIndex]][[groupID]],as.matrix(1:nstates),

verbose=FALSE)

update(dayIntervalModels[[intervalIndex]][[groupID]],

last_clustering(dayIntervalModels[[intervalIndex]][[groupID]]),

verbose=FALSE)

## Reset counters

dayIntervalModels[[intervalIndex]][[groupID]]@tnn_d$counts <-

dayIntervalModels[[intervalIndex]][[groupID]]@tnn_d$counts -1

n <- nrow(dayIntervalModels[[intervalIndex]][[groupID]]@tracds_d$mm@counts)

dayIntervalModels[[intervalIndex]][[groupID]]@tracds_d$mm@counts <-

matrix(0,nrow=n,ncol=n)

## Handle the case when there is no data

if(length(intervalData)==0) next

83

for(j in 1:length(intervalData))


if(length(intervalData[[j]])==0) next

mappedData <-

as.matrix(as.numeric(find_clusters(regionClusters,intervalData[[j]][,1:2],match_cluster="nn")))

groupID <- unique(intervalData[[j]][,3])

##Cluster new data

rEMM::cluster(dayIntervalModels[[intervalIndex]][[groupID]],mappedData,

verbose=FALSE)

update(dayIntervalModels[[intervalIndex]][[groupID]],

last_clustering(dayIntervalModels[[intervalIndex]][[groupID]]),

verbose=FALSE)

reset(dayIntervalModels[[intervalIndex]][[groupID]])

return(dayIntervalModels)

## lambda_propActivity - Adjust lambda so that after a high-activity period, the

half life of

## the transition counts originating from that period is

inversely proportional

## to the length of the period.

## T_1/2 = period length / PROP

## INPUTS:

## days - Names of weekdays to estimate the lambda for, format: list

of strings

## start_date - Estimation start date, format: string "YYYY/MM/DD"

## end_date - Estimation end date, format: string "YYYY/MM/DD"

## start_hour - Estimation start hour, format: integer between 0 and 23

## start_hour - Estimation end hour, format: integer between 0 and 23

## interval_len - Split the current day into intervals with length

interval_len, format: int

## region_clusters - EMM object containing the spatial mapping of the states

## GROUPS - Groups in the dataset, format: list of strings

## CONN - Database driver object

## PROP_CONST - Proportionality constant

## ACTIVITY_CONST - Determines what high-activity actually means.

## High activity periods are characterized by time instants

where the probability distr.

## distance between subsequent intervals is greater than

ACTIVITY_CONST* of the mean distance over the whole day

## OUTPUT:

## lambdas - List of matrices of lambdas for each day for each group. In

the matrices, the lambdas for each group is stored on one row.

##

lambdaAlgorithm1 <-

function(day,startDate,endDate,startHour,endHour,intervalLen,regionClusters,GROUPS,CONN,PROP_CONST,ACTIVITY_CONST)

84

dayIntervalModels <-

getIntervalModels(day,startDate,endDate,startHour,endHour,intervalLen,regionClusters,GROUPS,CONN)

#Number of Windows

windowNum <- (endHour-startHour)*60/intervalLen +1

lambdas <- matrix(0,nrow=0,ncol=windowNum-1)

for(gid in 1:length(GROUPS))

probDistance <- matrix(nrow=0,ncol=nclusters(regionClusters))

for(intervalIndex in 2:length(dayIntervalModels))

klDiv <- c()

totalVar <- c()

T_prev <-

transition_matrix(dayIntervalModels[[intervalIndex-1]][[gid]],type="probability")

T_curr <-

transition_matrix(dayIntervalModels[[intervalIndex]][[gid]],type="probability")

for(j in 1:nclusters(regionClusters))

## Total variation distance

##

http://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures

totalVar <- c(totalVar,sum(abs(T_curr[j,]-T_prev[j,])))

## Kullback-Leibler divergence

## http://en.wikipedia.org/wiki/Kullback-Leibler_divergence

klDiv <- c(klDiv,mean(KLdiv(cbind(T_curr[j,],T_prev[j,]))))

probDistance <- rbind(probDistance,klDiv)

rownames(probDistance) <- 1:nrow(probDistance)

##DEBUG

#heatmap(probDistance, Rowv=NA, Colv=NA, col = heat.colors(256),

scale="column", margins=c(3,3))

## Changes in the probability distributions may not be homogeneous over all

the states

## Select which states are homogeneous so that we can take the average over

them

homStates <- seq(1:nclusters(regionClusters))

avgActivity <- rowMeans(probDistance[,homStates])

#plot(avgActivity,ylab="Average symmetric Kullback-Leibler

divergence",xlab="Time window index")

#abline(ACTIVITY_CONST*mean(avgActivity),0,col=2)

## Now identify the time indices where the avg_activity is high

## High is definined to be greater than the mean

highActivity <- which(avgActivity > ACTIVITY_CONST*mean(avgActivity))

## Init the lambda vector with 0 for all time instants

lambda <- rep(0,length(avgActivity))

85

## For periods with high activity, set lambda proportional to the length of

these

## periods

segments <- list()

nseg <- 1

segments[[nseg]] <- highActivity[1]

for(i in 1:length(highActivity))

if(i>1)

if(highActivity[i]-prev==1)

segments[[nseg]] <- c(segments[[nseg]],highActivity[i])

else

nseg <- nseg +1

segments[[nseg]] <- c(highActivity[i])

prev <- highActivity[i]

tmp_lambda <- numeric(0)

for(i in 1:length(segments))

tmp_lambda <-

c(tmp_lambda,rep(PROP_CONST/length(segments[[i]]),length(segments[[i]])))

lambda[highActivity] <- tmp_lambda

lambdas <- rbind(lambdas,lambda)

return(lambdas)

A.3 Algorithm 2 for estimating λ

## lambda_optProbDist - Adjust lambda so that the distance between the

probability distributions is minimized for each time interval.

## The fading eq is C_t * 2^-lambda + delta_C_t+1 =

target_C_t+1

##

## INPUTS:

## days - Names of weekdays to estimate the lambda for, format: list

of strings

## start_date - Estimation start date, format: string "YYYY/MM/DD"

## end_date - Estimation end date, format: string "YYYY/MM/DD"

## start_hour - Estimation start hour, format: integer between 0 and 23

## start_hour - Estimation end hour, format: integer between 0 and 23

## interval_len - Split the current day into intervals with length

interval_len, format: int

## region_clusters - EMM object containing the spatial mapping of the states

## GROUPS - Groups in the dataset, format: list of strings

## CONN - Database driver object

## method - Name of probability distance to use, format: string, either

"tv" or "kld"

## OUTPUT:

## lambdas - List of matrices of lambdas for each day for each group. In

the matrices, the lambdas for each group is stored on one row.

##

86

lambda_optProbDist <-

function(day,startDate,endDate,startHour,endHour,intervalLen,regionClusters,GROUPS,CONN,method)

#Number of Windows


##Objective function measuring the distance between two sets of probability

## distributions. The probabilities are computed pointwise either using total

variation or KLd

obj_probDist <- function(lambda)

Pa <- targetC/rowSums(targetC)

Pb <- ((c*2^-lambda)+deltaC)/(rowSums((c*2^-lambda)+deltaC))

probDistance <- 0

for(i in 1:nrow(Pa))

if(method=="tv")

probDistance <- probDistance + sum(abs(Pa[i,]-Pb[i,]))

else

probDistance <- probDistance + mean(KLdiv(cbind(Pa[i,],Pb[i,])))

return(probDistance)

## Upper threshold for lambda in optimization

OPT_LIM <- 100

dayIntervalMatrices <-

getIntervalCountMatrices(day,startDate,endDate,startHour,endHour,intervalLen,regionClusters,GROUPS,CONN)

## Empty matrix to hold the lambda for each group

lambdas <- matrix(0,nrow=0,ncol=windowNum-2)


lambda <- numeric(0)

for(intervalIndex in 2:(length(dayIntervalMatrices)-1))

tmpC <- dayIntervalMatrices[[intervalIndex-1]][[groupID]]

tmpCDeltas <- dayIntervalMatrices[[intervalIndex]][[groupID]]

tmpCTarget <- dayIntervalMatrices[[intervalIndex+1]][[groupID]]

## Compute average C

c <- Reduce(’+’, tmpC) / length(tmpC)

deltaC <- Reduce(’+’, tmpCDeltas) / length(tmpCDeltas)

targetC <- Reduce(’+’, tmpCTarget) / length(tmpCTarget)

lambdaTmp <- numeric(0)

for(i in 1:1)

lambdaTmp <- c(lambdaTmp,optimx(runif(1,0,1),obj_probDist, method =

"L-BFGS-B",lower=0,upper=OPT_LIM)$p1)

# for(i in 1:10)

#

# ## Split deltas into two random partitions

# randomIndices <- permute(1:length(tmpCDeltas))

#

87

# len <- length(randomIndices)

# halfLen <- floor(len/2)

#

# tmpDeltaC <- tmpCDeltas[randomIndices[1:halfLen]]

# tmpTargetC <- tmpCDeltas[randomIndices[(halfLen+1):len]]

#

# deltaC <- Reduce(’+’, tmpDeltaC) / length(tmpDeltaC)

# targetC <- Reduce(’+’, tmpTargetC) / length(tmpTargetC)

#

# lambdaTmp <- c(lambdaTmp, optimx(0.01,obj_probDist, method =

"L-BFGS-B",lower=0,upper=OPT_LIM)$p1)

#

lambda <- c(lambda,mean(lambdaTmp))

lambdas <- rbind(lambdas,lambda)

return(lambdas)

## Creates a list of Markov Chains for each sliding window and employee during a

complete day

getIntervalCountMatrices <-

function(day,startDate,endDate,startHour,endHour,intervalLen,regionClusters,GROUPS,CONN)

#Number of Windows


## Get the states

nstates <- nclusters(regionClusters)

## Get current day data

## The data is a list of a number of sub-lists, each representing a time

interval

## Each sub-list is itself a list and contains separate trajectories in the

time interval

## The format for a trajectory is <XPOS,YPOS, GROUP_ID>

dayData <-

getDayIntervalData(day,intervalLen,startDate,endDate,startHour,endHour,GROUPS,CONN)

## Create list to hold transition matrices for different days

dayIntervalMatrices <- vector(mode="list",length=length(dayData))

for(intervalIndex in 1:length(dayData))

intervalData <- dayData[[intervalIndex]]

dayIntervalMatrices[[intervalIndex]] <-

vector(mode="list",length=length(GROUPS))

## Add empty transition count matrices


dayIntervalMatrices[[intervalIndex]][[groupID]] <- list()

dayIntervalMatrices[[intervalIndex]][[groupID]][[1]] <-

matrix(0.001,ncol=nstates,nrow=nstates)


if(length(intervalData)==0) next

for(j in 1:length(intervalData))


if(length(intervalData[[j]])==0) next

88

mappedData <-

as.matrix(as.numeric(find_clusters(regionClusters,intervalData[[j]][,1:2],match_cluster="nn")))

groupID <- unique(intervalData[[j]][,3])

#######################################################################

#Init tmp model that will be used to estimate the transition matrix

#By default, the EMM allocates the states online so that the ordering

#of two models is not necessarily the same. To fix this, we initialize the

#models so that the ordering of the states is consistent

#######################################################################

model <- EMM(threshold=0.1,measure="euclidean",centroids=FALSE)

cluster(model,as.matrix(1:nstates), verbose=FALSE)

update(model, last_clustering(model), verbose=FALSE)

## Reset counters

model@tnn_d$counts <- model@tnn_d$counts -1

n <- nrow(model@tracds_d$mm@counts)

model@tracds_d$mm@counts <- matrix(0,nrow=n,ncol=n)

## Cluster new data into tmp model

rEMM::cluster(model,mappedData, verbose=FALSE)


len <- length(dayIntervalMatrices[[intervalIndex]][[groupID]])

dayIntervalMatrices[[intervalIndex]][[groupID]][[len+1]] <-

transition_matrix(model,type="counts")

return(dayIntervalMatrices)

A.4 Outlier detection in the Badge dataset

## Clear all

rm(list=ls())

setwd("C:/Users/mate/Dropbox/Thesis/badge/relations")

Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk1.7.0_45")

library(RJDBC)

library(rEMM)

library(flexmix)

library(optimx)

library(gtools)

library(raster)

source(’C:/Users/mate/Dropbox/Thesis/badge/outlier/mapToRegions.R’)

source("C:/Users/mate/Dropbox/Thesis/badge/classify/getDayIntervalData.R")

source(’C:/Users/mate/Dropbox/Thesis/badge/classify/getIntervalModels.R’)

source(’C:/Users/mate/Dropbox/Thesis/badge/outlier/getIDMeasurements.R’)

source(’C:/Users/mate/Dropbox/Thesis/badge/classify/lambda_propActivity.R’)

source(’C:/Users/mate/Dropbox/Thesis/badge/classify/lambda_optProbDist.R’)

source(’C:/Users/mate/Dropbox/Thesis/badge/classify/getIntervalCountMatrices.R’)

## Set whether or not we use fading

89

useFade <- TRUE

## Set whether or not we use static states

useStaticRegions <- TRUE

## Methods to estimate lambda

## 1) The inverse of lambda is the half life of the transition counts.

## Adjust lambda so that the half life is proportional to the length

## of the intervals with high variability.

## At the end of the intervals half of the counts remain

##

## 2) Minimize the probability distance between the probability distributions

## at fixed time intervals

estLambda <- 1

#######################################################################

## Determine group ids

#######################################################################

options(stringsAsFactors = FALSE)

ASSIGN <- read.csv("badge_assignment.csv",header=TRUE)

DEPARTMENTS <- unique(ASSIGN$role)[1:3]

GROUPS <- list()

for(i in 1:length(DEPARTMENTS))

GROUPS[[i]] <- ASSIGN$BID[which(ASSIGN$role %in% DEPARTMENTS[i])]

#######################################################################

## Create database connection

#######################################################################

DRV <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver",

"C:/JDBC/sqljdbc_4.0/sve/sqljdbc4.jar")

CONN <- dbConnect(DRV,

"jdbc:sqlserver://MATE-PC:1433;database=Badge;integratedSecurity=true;")

#######################################################################

## Create static states by decomposing the floorplan manually

## into regions

#######################################################################

if(useStaticRegions==TRUE)

map_xy <- matrix(nrow=0,ncol=2)

map_names <- c("Configuration","Coordinator","Pricing","RSSI","Base station")

for(i in 1:length(map_names))

map_xy <- rbind(map_xy,cbind(ASSIGN$x[which(ASSIGN$role %in%

map_names[i])],ASSIGN$y[which(ASSIGN$role %in% map_names[i])]))

#Segment the map_xy into different regions

max_dist <-1000

regionClusters <- tNN(threshold=max_dist,measure="euclidean",centroid=TRUE)

cluster(regionClusters,map_xy)

centers <- cluster_centers(regionClusters)

#######################################################################

## Select outlier detection parameters

#######################################################################

trainDate <- "2007/03/28"

90

evalDate <- "2007/03/28"

#Sliding window length, minutes

intervalLen <- 2

#Offset between training and eval

offset <- 5

#Start and end hours

startHour <- 0

endHour <- 23

#######################################################################

## Estimate the forgetting factor lambda

## Do this either by

## 1) Adjusting lambda so that roughly half of the transition counts remain

## after a high-activity period

##

## 2) Adjusting lambda so that the probabilistic distance is minimized

#######################################################################

## Parameters for lambda estimation

startDate <- "2007/03/23"

endDate <- "2007/04/06"

day <- "Wednesday"

if(estLambda==1)

PROP_CONST <- 10

ACTIVITY_CONST <- 1

lambdas_propActivity <-

lambda_propActivity(day,startDate,endDate,startHour,endHour,intervalLen,regionClusters,GROUPS,CONN,PROP_CONST,ACTIVITY_CONST)

else if(estLambda==2)

lambdas_optProbDist <-

lambda_optProbDist(day,startDate,endDate,startHour,endHour,intervalLen,regionClusters,GROUPS,CONN,"kld")

lambdas <- lambdas_propActivity

groupOutliers <-

function(GROUP,startHour,endHour,intervalLen,trainDate,occurrenceThr,transitionThr,groupID)

## Select which group to look for outliers in

#GROUP <- GROUPS[[groupID]]

outliers <- numeric(0)

#######################################################################

## Build empty model for outlier detection

#######################################################################


#Discrete values only, threshold not important as long as it’s < 1

thr <- 0.1

model <-

EMM(threshold=thr,measure="euclidean",lambda=lambdas[groupID,],centroids=FALSE)

91

cluster(model,as.matrix(1:nclusters(regionClusters)), verbose=FALSE)


## Reset counters

model@tnn_d$counts <- model@tnn_d$counts -1

n <- nrow(model@tracds_d$mm@counts)

model@tracds_d$mm@counts <- matrix(0,nrow=n,ncol=n)

else

#Continous values, threshold serves as the minimum resolution for detecting

behaviours

thr <- 1400

model<-

EMM(threshold=thr,measure="euclidean",lambda=lambdas[groupID,],centroids=TRUE)

for(t in seq(0,(endHour-startHour)*60-intervalLen,intervalLen))

#######################################################################

## Determine training and eval intervals

#######################################################################

trainStartMinute <- t %% 60

trainEndMinute <- (t+intervalLen) %% 60

trainStartHour <- startHour+(t-trainStartMinute) / 60

trainEndHour <- startHour+(t-trainStartMinute) / 60

if(trainEndMinute<trainStartMinute)

trainEndHour <- trainEndHour +1

trainStartDate=paste(trainDate,trainStartHour,":",trainStartMinute);

trainEndDate=paste(trainDate,trainEndHour,":",trainEndMinute);

timeWindowIndex <- t/intervalLen +1

#######################################################################

## Get training data

#######################################################################

cat("Training:",trainStartDate,trainEndDate,"\n")

trainingData <- getIDMeasurements(CONN,GROUP,trainStartDate,trainEndDate)

#######################################################################

## Map to regions

#######################################################################


mappedData <- mapToRegions(regionClusters,trainingData)

#######################################################################

## Update EMM and detect outliers

#######################################################################

#Handle the case when the group is inactive

if(length(mappedData)==0) next

#Fade

if(useFade==TRUE)

model <- fade(model,1,model@lambda[timeWindowIndex])

92

outlier <- numeric(0)

normalTrajectory <- list()

outlierTrajectory <- list()

for(i in 1:length(mappedData))

#Skip empty

if(nrow(mappedData[[i]])==0) next

rEMM::cluster(model,as.matrix(mappedData[[i]][,1],ncol=1), verbose=FALSE)

isOutlier <- rEMM::updateAndFindOutliers(model,

last_clustering(model),occurrenceThr,transitionThr, verbose=FALSE)

rEMM::reset(model)

if(isOutlier)

outlier <- c(outlier,unique(mappedData[[i]][,2]))

outlierTrajectory[[length(outlierTrajectory)+1]] <-

trainingData[[i]][,1:2]

else

normalTrajectory[[length(normalTrajectory)+1]] <- trainingData[[i]][,1:2]

png(paste("outliers/outlier",timeWindowIndex,".png"))

par(new=F)

par(new=F)

if(length(normalTrajectory)!=0)

for(k in 1:length(normalTrajectory))

## Plot normal trajectory with arrows

plot(normalTrajectory[[k]],type="l",xlim=c(200,7200),ylim=c(1600,5200),col=1)

normalSequence <- seq(nrow(normalTrajectory[[k]])-1) # one shorter than

data

if(length(normalSequence)!=0)

normalX <- normalTrajectory[[k]][,1]

normalY <- normalTrajectory[[k]][,2]

arrows(normalX[normalSequence], normalY[normalSequence],

normalX[normalSequence+1], normalY[normalSequence+1],col=1)

par(new=T)

if(length(outlierTrajectory)!=0)

for(k in 1:length(outlierTrajectory))

## Plot outlier trajectory with arrows

plot(outlierTrajectory[[k]],type="l",xlim=c(200,7200),ylim=c(1600,5200),col=2

)

outlierSequence <- seq(nrow(outlierTrajectory[[k]])-1) # one shorter

than data

if(length(outlierSequence)!=0)

outlierX <- outlierTrajectory[[k]][,1]

outlierY <- outlierTrajectory[[k]][,2]

arrows(outlierX[outlierSequence], outlierY[outlierSequence],

outlierX[outlierSequence+1], outlierY[outlierSequence+1],col=2 )

par(new=T)

93

par(new=F)

title(main=paste("Trajectory outlier detection, ",trainStartDate," >

",trainEndDate),xlab="x",ylab="y")

print(outlier)

dev.off()

print(outlier)

#plot(raster(transition_matrix(model,type="probability")))

outliers <- c(outliers,outlier)

return(outliers)

94

Spatio-temporal outlier detection in streaming …762507/FULLTEXT01.pdf7.11 Outlier detection...

Documents

Transcript of Spatio-temporal outlier detection in streaming …762507/FULLTEXT01.pdf7.11 Outlier detection...