X-OUTLIER DETECTION AND PERIODICITY...

71
X-OUTLIER DETECTION AND PERIODICITY DETECTION IN LOAD CURVE DATA IN POWER SYSTEMS by Zhihui Guo B.Sc. Sun Yat-sen University, China, 2009 a Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the School of Computing Science c Zhihui Guo 2011 SIMON FRASER UNIVERSITY Summer 2011 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for Fair Dealing. Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately.

Transcript of X-OUTLIER DETECTION AND PERIODICITY...

Page 1: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

X-OUTLIER DETECTION AND PERIODICITY

DETECTION IN LOAD CURVE DATA IN

POWER SYSTEMS

by

Zhihui Guo

B.Sc. Sun Yat-sen University, China, 2009

a Thesis submitted in partial fulfillment

of the requirements for the degree of

Master of Science

in the School

of

Computing Science

c© Zhihui Guo 2011

SIMON FRASER UNIVERSITY

Summer 2011

All rights reserved. However, in accordance with the Copyright Act of

Canada, this work may be reproduced without authorization under the

conditions for Fair Dealing. Therefore, limited reproduction of this

work for the purposes of private study, research, criticism, review and

news reporting is likely to be in accordance with the law, particularly

if cited appropriately.

Page 2: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

APPROVAL

Name: Zhihui Guo

Degree: Master of Science

Title of Thesis: X-Outlier Detection and Periodicity Detection in Load Curve

Data in Power Systems

Examining Committee: Dr. Jian Pei

Chair

Dr. Ke Wang

Senior Supervisor

Dr. Martin Ester

Supervisor

Dr. Fred Popowich

SFU Examiner

Date Approved:

ii

lib m-scan11
Typewritten Text
16 August 2011
Page 3: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

Partial Copyright Licence

Page 4: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

Abstract

Load curve data is a type of time series data which records the electric energy consumptions

at time points and plays an important role in operation and planning of power systems.

Unfortunately, load curves always contain abnormal, noisy, unrepresentative and missing

data due to various random factors. It is crucial to power systems to identify and repair

corrupted and unrepresentative data before load curve data can be used for planning and

modeling. In this thesis we present a new class of X-outliers that have abnormal power

consumption levels related to periodicity (X-axis) and propose a novel solution to detect

these outliers. The underlying assumption is that the data set follows a periodicity and the

length (not the pattern) of the periodicity is known. This is the case for most real load

curve data collected at BC Hydro.

In the above the periodicity is assumed to be known for X-outlier detection. In some

other applications, however, the periodicity needs to be discovered. The latter is the case

when the periodicity evolves, when a new time series is collected, or when conditions that

affect time series have changed. Periodicity detection for time series has important applica-

tions in forecasting, planning, trend detection, and outlier detection. For time series with

unknown periodicity, X-outlier detection could still be performed after the periodicity is de-

tected. Thus X-outlier detection and periodicity detection are highly related and periodicity

detection could be considered as a pre-processing step of X-outlier detection for time series

with unknown periodicity. Therefore, in this thesis, we also propose a trend based period-

icity detection algorithm for time series data with unknown periodicity. This approach is

trend preserving and noise resilient. Real load curve data in the BC Hydro system is used

to demonstrate the effectiveness and accuracy of the proposed methods.

Keywords: Time Series, Load Management, Power Systems, Power Quality, Smoothing

Methods, Periodicity Detection.

iii

Page 5: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

Acknowledgments

My special thanks goes to my senior supervisor Dr. Ke Wang. I benefit a lot from his

insights and every discussion with him. This work would not have been possible without his

invaluable guidance and great patience. I am grateful for the inspiring discussions with him

that led to this thesis. I would like to thank my supervisor, Dr. Martin Ester and examiner

Dr. Fred Popowich for their precious time and useful comments on my thesis. And I would

like to thank Dr. Jian Pei for taking the time to chair my thesis defense.

I am grateful to BC Hydro’s Principal Engineer Dr. Wenyuan Li, Manager Dr. Tito

Inga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

I thank them for giving me the opportunity to work in BC Hydro for over a year. I am

thankful for their precious time for training me to be a better presenter and executor. I

learn the ways of doing a practical project from their on-site supervision, feedback, testing

and evaluation on our collaborative project. I would also like to thank BC Hydro for the

access to their precious data sets.

I would like to thank SFU for providing the excellent facilities and comfortable environ-

ments, and thank NSERC and BC Hydro for their funding to support my study in SFU.

I would like to take this opportunity to thank my friends Wen Huang, Bo Hu, Hua

Huang, Jiyi Chen, Chao Han, Peng Wang, Judy Yeh and Zhensong Qian for their care and

help.

Finally, I would like to express my deepest gratitude to my family for their continuous

love, support and encouragement.

iv

Page 6: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

Contents

Approval ii

Abstract iii

Acknowledgments iv

Contents v

List of Tables viii

List of Figures ix

1 Introduction 1

1.1 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 X-outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Periodicity Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3.1 For Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3.2 For Periodicity Detection . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Background 12

2.1 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 Overview of Existing Techniques . . . . . . . . . . . . . . . . . . . . . 12

2.1.2 Outlier Detection in a Time Series Database . . . . . . . . . . . . . . 14

2.1.3 Outlier Detection in a Single Time Series . . . . . . . . . . . . . . . . 14

2.2 Periodicity Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

v

Page 7: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

3 Trend Modelling 16

3.1 Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Kernel Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Smoothing Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 X-Outlier Detection 19

4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.1 Observations for X-Outlier Detection . . . . . . . . . . . . . . . . . . . 21

4.2.2 Approximating the Smoothing Curve by Peaks and Valleys . . . . . . 22

4.2.3 Identifying Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.4 Repairing Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3 Practical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4.1 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4.2 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4.3 Parameter Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4.4 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Trend Based Periodicity Detection 39

5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1.1 Periodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1.2 Dynamic Time Warping . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.1.3 The WARP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 The Trend Based Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2.1 Observations for Periodicity Detection . . . . . . . . . . . . . . . . . . 43

5.2.2 Identifying Periodicities Using The Shape Sequence . . . . . . . . . . 44

5.2.3 Computing the Length of Candidate Periods . . . . . . . . . . . . . . 46

5.2.4 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.3.1 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.3.2 Parameter Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.3.3 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3.4 Effect of Smoothness Levels . . . . . . . . . . . . . . . . . . . . . . . . 50

vi

Page 8: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

5.3.5 Effect of Discretization on WARP . . . . . . . . . . . . . . . . . . . . 51

5.3.6 Multiple Periodicities . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6 Conclusion and Future Work 54

Bibliography 56

vii

Page 9: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

List of Tables

4.1 Proposed method and traditional smoothing method for data sets with no

X-outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Running median method for data sets with no X-outliers . . . . . . . . . . . . 33

4.3 Proposed method and traditional smoothing method for data sets with X-

outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4 Running median method for data sets with X-outliers . . . . . . . . . . . . . 34

5.1 Accuracy comparison on “Noisy” data . . . . . . . . . . . . . . . . . . . . . . 49

5.2 Accuracy comparison on “Normal” data . . . . . . . . . . . . . . . . . . . . . 49

5.3 Trend based algorithm for the “Noisy” data (confidence threshold set as 70%) 50

5.4 Trend based algorithm for the “Normal” data (confidence threshold set as 70%) 50

5.5 WARP on “Normal” data (confidence threshold set as 70%, equi-width binning) 51

5.6 WARP on “Normal” data (confidence threshold set as 70%, equi-depth binning) 52

viii

Page 10: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

List of Figures

1.1 Local Y-outliers identified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Global Y-outliers identified . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Outier not identified by smoothing techniques (a) Load curve data (b) Model

the trend by a proper smoothing curve (c) Model the trend by an overly flat

smoothing curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Load curve data with labeled X-outliers . . . . . . . . . . . . . . . . . . . . . 5

1.5 X-outiers not identified by smoothing techniques . . . . . . . . . . . . . . . . 6

1.6 Four days data with daily periodicity . . . . . . . . . . . . . . . . . . . . . . . 8

1.7 Five weeks’ data with weekly periodicity . . . . . . . . . . . . . . . . . . . . . 8

4.1 Example for an X-outlier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Example for an X-outier within a valley . . . . . . . . . . . . . . . . . . . . . 22

4.3 Smoothing curve [t1, t12]. The horizontal axis is time. The values on the

curve are the slopes at each point. [t1, t5] is a maximal-decreasing interval;

[t6, t10] is a maximal-increasing interval. [t4, t8] is a ∪ shape. . . . . . . . . . . 23

4.4 Two similar load curves with noise . . . . . . . . . . . . . . . . . . . . . . . . 25

4.5 Two similar load curves with time shifting and stretching . . . . . . . . . . . 26

4.6 The system tool developed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.7 Outlier detection for a six-year test data set. (a) Outlier detection result for

smoothness level 5. (b) Outlier detection result for smoothness level 1. (c)

Outlier detection result for smoothness level 10. . . . . . . . . . . . . . . . . . 36

4.8 Outlier repairing for the six-year test data set. . . . . . . . . . . . . . . . . . 37

4.9 Outlier repairing for a five-week test data set. (a) Test data before outlier

repairing. (b) Test data after outlier repairing. . . . . . . . . . . . . . . . . . 38

ix

Page 11: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

5.1 An example for the DTW matrix . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Alignment for the DTW matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.3 Alignment for T(3) and T (3) where T = “abcabcabd” . . . . . . . . . . . . . . . 42

5.4 DTW matrix for sequences T and T where T = e1e2 . . . en . . . . . . . . . . . 42

5.5 Example for a period consisting of a peak and a valley . . . . . . . . . . . . . 44

5.6 Example for a period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.7 One weeks data with weekly pattern . . . . . . . . . . . . . . . . . . . . . . . 48

5.8 Five weeks data with daily patterns, smoothness level 3 . . . . . . . . . . . . 53

5.9 Five weeks data with weekly patterns, smoothness level 5 . . . . . . . . . . . 53

x

Page 12: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

Chapter 1

Introduction

1.1 Outlier Detection

Load curve data is a type of time series data which refers to the power consumptions recorded

by meters at time intervals. The quality of load curve data is essential to load forecast

[34, 40], system analysis, system operation and visualization, system reliability performance,

energy saving, and accuracy in system planning [37]. Two key features in smart grids [42] are

self-healing from power disturbance events and enabling active participation by consumers

in demand response. The collection of valid load curve data is crucial for supporting decision

making in smart metering and smart grid systems.

Collecting all load data accurately in fine granularity is a challenging objective. There is

often missing and corrupted data in the process of information collection and transfer, due

to various causes including meter malfunction, communication failures, equipment outages,

lost data, unexpected shutdown, unscheduled maintenance, and unknown factors. Since such

events cause a significant deviation in load and do not repeat regularly, their presence results

in load data records being unrepresentative of actual usage patterns. Therefore, before load

curve data can be used for load forecasting, modeling and system analysis, an important

task is to identify and repair abnormal data that are unrepresentative of underlying usage

patterns, called “outliers” below.

A load curve is a time series with the Y-axis representing power or energy consumption

and the X-axis representing the time. Outlier detection in time series has been a topic

in data mining [1, 2, 3] and statistics [4]. Most previous work focused on outliers that

have invalid Y-axis values compared to the behaviors in a local neighborhood [5]. We call

1

Page 13: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 1. INTRODUCTION 2

such outliers Y -outliers. Smoothing techniques were successfully used to detect Y-outliers

[5]. The idea is modeling the data by a smoothing curve whose points are derived from

the observations in a local neighborhood. The moving average [41] is another example of

smoothing techniques. Y-outliers are the observations that deviate substantially from the

smoothing curve. To define the “deviation” from the smoothing curve, a technique called

“confidence interval” [5] is applied. Data points outside the confidence interval are the

detected Y-outliers.

Figure 1.1: Local Y-outliers identified

Figure 1.2: Global Y-outliers identified

An example showing the Y-outliers detected by the smoothing techniques is displayed

in Figure 1.1 with one week’s data and Figure 1.2 with almost one year’s data. The red

curve is the smoothing curve and the two blue curves are the upper bound and lower bound

Page 14: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 1. INTRODUCTION 3

of the confidence interval. In Figure 1.1, the data points outside the confidence interval are

detected as Y-outliers, denoted in the circles. In Figure 1.2, some of the Y-outliers detected

using the smoothing techniques are denoted in the circles.

Y-outliers have been studied by a list of previous work [5, 41]. In this thesis we do not

consider Y-outliers but a new type of outliers called X-outliers discussed below.

1.1.1 X-outliers

As well known, load curve data exhibits some loose form of periodicities (e.g., daily, weekly,

monthly, seasonal, yearly). The term “loose” means that the actual data values can be

different but the trend of the data repeats itself regularly at some interval. In statistics,

such periodicity is known as seasonality. Periodicities of load curves are usually known

to power utilities due to regularities of usage patterns. Our investigation indicates that

abnormal data may occur as a deviation from such periodicities. In this thesis the term

X-outliers refers to such deviations. For example, the data in Figure 1.3(a) has a weekly

periodicity, i.e., high weekdays and low weekends, but the data in the rectangle box in the

first week is an X-outlier to this periodicity.

In general, X-outliers are caused by random events such as malfunction of data metering

or transfer systems, outages, unexpected full or partial shutdown of production lines, un-

scheduled strikes, temporary weather changes, etc. Such events are unlikely to occur again

in other periods, thus, are not representative of regular patterns of load curve. Therefore,

before the data can be used, such unrepresentative data must be identified and repaired.

Notice that we distinguish between an “important change” (say in temperature), which is

likely to persist in the future, and a “random rise or drop”, which most likely does not occur

again in the future. The latter case is caused by random events such as outages, temporary

weather change, union strike, unscheduled maintenance, etc. Identifying the affected data

in this case is exactly the motivation of this work.

Another purpose of identifying X-outliers is to focus further investigation on the potential

problematic areas of the data and find the cause of unusual data. In the example of Figure

1.3(a), because of the identification of the unusually low consumption during Wednesday-

Friday in the first week, the user could conduct further investigation and find that the low

consumption was caused by a union strike during Wednesday-Friday in that week. Without

identifying X-outliers, finding such problematic areas will require scanning huge load curves

manually, which is a painstaking task.

Page 15: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 1. INTRODUCTION 4

(a)

(b)

(c)

Figure 1.3: Outier not identified by smoothing techniques (a) Load curve data (b) Modelthe trend by a proper smoothing curve (c) Model the trend by an overly flat smoothingcurve

Page 16: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 1. INTRODUCTION 5

The neighborhood based techniques, such as moving average [41] and smoothing tech-

niques [5], are effective for detecting Y-outliers. Since X-outliers are deviations from a

periodicity, these outliers cannot be detected by such techniques for two reasons: (1) check-

ing deviations from a periodicity requires identifying the periodic pattern, thus, examining

the data in all periods; (2) an X-outlier could last for a considerable time and form its own

trend of a sizable neighborhood, thus, misleading all local information based methods.

To illustrate these points, consider the data in Figure 1.3(a). In Figure 1.3(b), the tra-

ditional smoothing curve models the normal weeks correctly by including most observations

in the confidence interval, but it also models the corrupted data in the first week 1. In Fig-

ure 1.3(c), with a flatter smoothing curve and a smaller confidence interval, the traditional

smoothing technique detects all weekend drops as outliers. Neither result is satisfactory.

Using a larger confidence interval does not help because the outlier in the first week will not

be detected. The problem with the traditional smoothing techniques is that they ignore the

periodicity information. If the knowledge on periodicity is used, the data in the first week

would not be considered as normal because it does not repeat in other weeks.

Figure 1.4: Load curve data with labeled X-outliers

Another example of the limitations of the neighborhood based techniques for detecting

X-outliers is shown in Figure 1.4 and Figure 1.5. These two Figures show the same six-year

load curve data with yearly periodicity. The X-outliers are labeled in the rectangle boxes in

1In the traditional smoothing techniques, only the data points falling outside the confidence interval areconsidered as outliers. That’s, the data points outside the upper and lower confidence interval curves inFigure 1.3(b) and Figure 1.3(c) in this example.

Page 17: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 1. INTRODUCTION 6

Figure 1.5: X-outiers not identified by smoothing techniques

Figure 1.4. It can be seen from Figure 1.5 that the traditional smoothing technique does not

detect any X-outlier because the labeled X-outliers are all within the specified confidence

interval. Since the neighborhood based techniques fail to detect X-outliers, new technique

is required and this is the motivation of this work.

1.2 Periodicity Detection

As discussed in Section 1.1.1, the X-outlier detection considered in this thesis depends on

a known periodicity. In some other applications, however, the periodicity is unknown and

needs to be discovered. For example, for a time series with unknown periodicity, if we would

like to see whether there are X-outliers in the time series, we have to detect the periodicity

first. Thus X-outlier detection and periodicity detection are highly related and periodicity

detection could be considered as a pre-processing step of X-outlier detection for time series

with unknown periodicity. Therefore, in this thesis a method for periodicity detection in

time series data will also be presented. Time series often have some form of periodicity

where a certain pattern repeats itself at regular time intervals, and due to the presence

of noise, such periodicities are subject to variances in both time and data value. Many

real life applications depend on knowing the periodicities of time series [34]. For example,

power utilities use periodic usage patterns for load forecasting, system analysis, scheduling

maintenance, energy saving, filtering missing and corrupted data [5, 37]. Other sources of

time series include weather data, sensor generated data, physical traffic data, stock index

Page 18: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 1. INTRODUCTION 7

data, network flow data, patient physiological data, etc. Periodic pattern mining is useful

in predicting the stock price movement, computer network fault analysis and detection

of security breach, earthquake prediction, and gene expression analysis [38, 39]. While

periodicities may be known in some applications, in other applications they need to be

discovered. The latter is the case when the periodicity evolves, when a new time series is

collected, or when conditions that affect time series have changed. In this thesis, we consider

periodicity detection from time series data.

Periodicity detection has been an active topic in data mining and statistics [27, 28, 29,

30, 31, 32, 33]. Three types of periodicity have been identified [27, 36]: segment periodicity,

symbol periodicity, and partial periodicity. In this thesis, we consider segment periodicity,

where a time series consists of the repetition of a segment in the series. Most existing

algorithms [27, 28, 36] first discretize a real valued time series into a sequence of discrete

symbols before performing periodicity detection. Common discretization methods include

equi-width binning, where each bin has the same size, or equi-depth binning, where each bin

contains the same number of data points. With this preprocessing step, most algorithms

assume a sequence of discrete symbols as the input [27, 28, 36].

Unfortunately, the above approach suffers from major drawbacks. First, it is difficult to

specify a proper number of bins. A large number makes similar data different and a small

number makes different data similar, both of which impair the detection of periodicity.

Second, a fixed binning scheme is not suitable for a time series where different parts may

have different characteristics. For example, daytime and nighttime may have different data

characteristics, so do weekdays and weekends. Third, the binning method considers each

time point independently and is not sensitive to the preservation of neighborhood based

trends.

To illustrate these drawbacks, let us consider the four days hourly time series in Figure

1.6. This data has a strong daily periodicity as highlighted by the peaks and valleys in the

rectangles. With equi-width binning, the y-values are discretized into four equal-sized bins

{a, b, c, d}. This discretization breaks each big peak into several bins (i.e., a, b, and c) and

collapses the small peaks and valleys into one bin d. After such discretization, the daily

trend that a big peak is followed by two small valleys and two small peaks is lost.

Figure 1.7 further illustrates the last drawback mentioned above. This is a time series

for five weeks’ hourly data with a clear periodic pattern and with noise shown in the rect-

angle boxes. Suppose that the data is discretized into eight bins {a, b, c, d, e, f, g, h} using

Page 19: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 1. INTRODUCTION 8

Figure 1.6: Four days data with daily periodicity

equi-width binning. Since each data point is discretized independently, the noisy data are

also discretized into bins, instead of being filtered, which increase the chance to mislead

periodicity detection. It does not work to filter noise by using a smaller number of bins

because doing so also diminishes the variance that is part of the trends. For example, if

we discretize the above data into four bins, the data points that form the trend will be

represented by the same symbol, making the data less useful for periodicity detection.

Figure 1.7: Five weeks’ data with weekly periodicity

A similar problem with equi-depth binning will be discussed in Section 5.3.5. To sum-

marize, discretization is not sensitive to the preservation of trends, as such, a significant

amount of information could be lost.

Page 20: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 1. INTRODUCTION 9

1.3 Contributions

This thesis focuses on both outlier detection and periodicity detection in time series data in

general, and in load curve data in particular. The contributions of this thesis are as follows.

1.3.1 For Outlier Detection

First, we present the novel notion of X-outliers under the assumption that data follows some

loose form of periodicity and the length (but not the pattern) of the periodicity is known.

Second, a novel method to detect and repair X-outliers is proposed. The proposed

method has four steps: (1) Approximate the load curve data by a smoothing curve. (2)

Represent the smoothing curve by a sequence of valleys and peaks, called ∪ shapes and ∩shapes. (3) Identify X-outliers as the valleys and peaks that do not repeat according to the

known periodicity length. Our observation is that an X-outlier typically occurs at a time

interval where the smoothing curve either has a valley or has a peak. (4) Repair the outliers.

The novelty of the proposed method is considering periodicity of the data and considering

the X-outliers to be the data in the valleys and peaks that do not repeat according to the

periodicity. Therefore this method is able to detect the approximate locations and lengths

of X-outliers without making any assumption about them.

Third, a fully implementable system, which includes a method for detecting Y-outliers

[5], the method presented here for detecting X-outliers, and a user-friendly interface, is de-

veloped. This system will help power utilities identify and correct corrupted data efficiently

in applications of smart metering in particular and in load forecasting, system analysis,

operation modeling and planning studies of power systems in general.

Finally, though motivated by load curve data in power systems, the proposed approach

is rather general and can be applied to other types of data such as road traffic, network flow

traffic, call volume, weather data, etc. In this sense, the proposed method can be applied

to a wide range of data sets.

1.3.2 For Periodicity Detection

Consider the data in Figure 1.7 again. Despite the presence of noise, this data has the

periodicity that it peaks at weekdays and valleys at weekends, illustrated by the smoothing

curve in red. The peak and valley of each week are not exactly the same because actual

data points are different in every week, but the trends represented by them are similar.

Page 21: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 1. INTRODUCTION 10

Therefore, if we can represent these peaks and valleys, it is possible to detect the periodicity

as re-occurrence of subsequences of peaks and valleys, by taking into account similarity of

such shapes. In this example, the period is approximately the length of a peak plus the

length of a valley.

With the above observation, a novel trend based algorithm is proposed to detect peri-

odicities in real valued time series. The term “trend based” means that the method focuses

on the trends in the data, rather than the actual value of every single data point. This al-

gorithm has four steps: (1) The trends of the data are approximated by a smoothing curve.

(2) The smoothing curve is represented by a sequence of ∪ shapes and ∩ shapes, which

correspond to valleys and peaks and capture the most interesting information in the data.

Each ∪ shape and ∩ shape is represented by a feature vector. (3) The WARP algorithm

[28] is extended to a sequence of ∪ shapes and ∩ shapes to discover periodicities in terms

of subsequences of ∪ shapes and ∩ shapes. (4) We express these periodicities in the length

of time.

The novelty of the trend based algorithm is modeling a time series as a sequence of

local trends (i.e., ∪ shapes and ∩ shapes), instead of a sequence of symbols obtained by

a fixed binning scheme. Thus this approach is sensitive to the distinction between trends

and noise, which helps preserve trends and filter noise, thus, helps detect the underlying

periodicity. Another feature of the trend based algorithm is the easiness of detecting multiple

periodicities (e.g., daily, weekly, yearly, etc.). This can be done by adopting a proper choice

for the smoothing parameter to model the trends at a desired resolution level. The user is

not required to have good knowledge on such choices; the software tool mentioned in 1.3.1

can help the user to converge to a proper choice with little effort. We have evaluated the

proposed algorithm using real life load curve data obtained from our industrial collaborator.

The evaluation shows that the proposed method is able to detect more accurately than the

discretization based methods and increases the F-measure by more than 30%.

1.4 Thesis Organization

The thesis is organized as follows: in Chapter 2, the related work about outlier detection

and periodicity detection in time series data is reviewed. In Chapter 3, a description of the

regression smoothing method to model the trends of time series data is given. In Chapter 4,

a new method for X-outlier detection is proposed and evaluated using the real data from BC

Page 22: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 1. INTRODUCTION 11

Hydro. In Chapter 5, the novel algorithm for periodicity detection is presented and tested

using the real data from BC Hydro. Finally, we summarize our conclusions in Chapter 6.

Page 23: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

Chapter 2

Background

In this chapter, a review of the related techniques for outlier detection and periodicity

detection in time series data in the literature will be presented.

2.1 Outlier Detection

Outlier detection in time series data refers to the problem of finding behaviors in the data

which are not expected according to some regular patterns existing in the data. Outlier

detections are used widely in various applications such as credit card fraud detection and

extremely low or high power consumption regarding to the periodicity.

2.1.1 Overview of Existing Techniques

Outlier detection in time series has been studied in the field of statistics as a general mathe-

matical concept [3]. A simple statistical outlier detection proposed in [25] is to use informal

box plots to pinpoint the outliers. Many statistical approaches assume an underlying model

that generates data sets (e.g. normal distribution) [6]. Other methods [7, 8, 9, 10, 11] are

based on the ARMA (auto-regressive moving average) model, which impractically assumes

that the time series is stationary, as implied by the various parameters used by the model.

For load curve data, this kind of assumption does not hold. In addition, these methods

cannot handle a relatively large portion of missing data.

Most works on time series outlier detection employ smoothing techniques to identify

unusual data values within a local neighborhood. See [5] for a list of works. The idea is

12

Page 24: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 2. BACKGROUND 13

to model the trends of data using a smoothing curve whose points are some aggregation of

those observed values within a small neighborhood. The moving average [41] is one example

of aggregation. Unfortunately, such techniques do not work for X-outliers considered in this

thesis because they do not consider periodicities. The mean and median method suggested

in [41] considers a given periodicity and replaces missing and corrupted data using the

average or medians of the corresponding observations at different periods. This method

does not fully factor data distribution. For example, though the median for {1, 1, 50, 100,

100} and the median for {49, 49, 50, 51, 51} are the same (i.e., 50), the second median is

more representative than the first median. Also, this method does not allow time shifting

and stretching that are commonly observed in load curve data.

The SAS/ETS (Econometric and Time Series) package provides routines for outlier

detection in periodic time series data using the intervention analysis methods [12, 13], which

are based on the ARIMA (auto-regressive integrated moving average) model. The ARIMA

model treats each outlier as a single observation and detects multiple point outliers as a

sequence of observations. If multiple outliers exist in a close proximity, these outliers may

mask each other so that no points are identified as outliers. Besides, the ARIMA method

requires considerable computer time and memory for a long time series [14], which is the

case of load curve.

The Multivariate Linear Gaussian state space model [45] provides a more general mod-

eling technique for time series and it also allows for non-stationary models. The state space

model has primarily been used for forecasting, for example, see [46, 47], where observed

data are assumed to be valid and the parameters of the state space model are estimated by

fitting the observed data. Our work has a different focus and objective from forecasting: we

assume that a large portion of observed data (in general) may be corrupted according to a

known periodicity and our goal is to identify and repair corrupted data, instead of fitting

the observed data, so that the repaired data is more representative of the underlying data

pattern. Therefore, our work can be applied in a preprocessing step to fix corrupted data

prior to other applications such as forecasting.

The work [15] defines an object to be a distance-based outlier if at least some percent of

the data set lies greater than some distance away from the object. Other similar definitions

use the distance of a point to its k-th nearest neighbor [16] or the sum of the distances to its

k-nearest neighbors [17]. Unfortunately, these definitions do not deal with the time series

data that characterizes load curves with loose periodicity.

Page 25: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 2. BACKGROUND 14

In general, two forms of time series are considered for outlier detection: a time series

database and a single time series.

2.1.2 Outlier Detection in a Time Series Database

When detecting outliers in a time series database, most of the previous work tries to find a

time series which is abnormal with respect to a normal time series. Both of [26, 49] construct

a normal model from training time series that are known to be normal. If an input time

series does not conform to the model, it is detected as an outlier. These methods are not

suitable for solving our problem because we do not have any pre-determined normal time

series for training.

2.1.3 Outlier Detection in a Single Time Series

Our problem fits into the second form where outliers are detected in a single time series,

where an anomalous subsequence exists for an abnormally long time. It is assumed that

most part of the time series is normal.

In this scenario, much of the previous work slides a window across the time series data to

search anomalous subsequences [1, 18, 19, 20]. This approach has to predefine the window

size for anomalous subsequences. For example, Keogh et al. developed a suite of techniques

[1, 2] for finding discords within a large time series, where a discord is a subsequence that is

maximally different from all the rest of the time series data. To locate discords, they used a

sliding window to scan the whole time series data. However, in the case of load curve data,

the length of abnormal data can vary considerably, which makes it difficult to determine

a proper window size in advance. They find the most unusual subsequences, but in our

context the most unusual subsequences are not necessarily outliers. The problem is that the

notion of discords does not factor in the periodicity of data, which is crucial to load curve

data.

2.2 Periodicity Detection

Periodicity detection in time series data has been studied in the data mining field. Basically

there are three types of periodicity considered in the time series literature [27, 36]. The first

type is called segment or full-cycle periodicity, meaning that the time series consists of the

Page 26: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 2. BACKGROUND 15

repetitions of a segment in the series. And this is the type of periodicity we are going to

detect in this thesis. The second type is called symbol periodicity, where it is determined

whether the individual symbols repeat periodically or not. The last type is called partial

periodicity, with a pattern (length ≥ 1) repeating periodically.

There have been various approaches for detecting different kinds of periods. [30] devel-

oped a linear distance-based algorithm for detecting the symbol periodicity. [33] presented

a similar method with some pruning techniques. [32] proposed a multi-pass algorithm for

symbol periodicity, one symbol at a time. All the proposed algorithms in [30, 32, 33] dis-

cover the periodicities of some symbols of the time series rather than the periodicity of the

entire time series.

Previous work on segment periodicity detection can be divided into those for real valued

time series and those for sequences of discrete symbols. Representative works from the first

group include the sketching algorithm [29] and the wavelet transform based AWSOM [31].

The latter detects only periods that are of powers of two. Representatives from the second

group include the convolution based technique [27], the dynamic time warping distance

based WARP algorithm [28] and the suffix tree based method [36]. The authors of [28]

showed that the WARP algorithm outperforms the algorithms in [27, 29, 31]. Like [28], our

algorithm is based on the dynamic time warping technique. Unlike [28], our algorithm deals

with a real valued time series without a prior discretization step. Dealing with real valued

but not discretized data has the benefit of preserving the trends of the data while avoiding

the limitations of discretization discussed in Section 1.2.

Page 27: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

Chapter 3

Trend Modelling

In this chapter, we introduce the smoothing techniques for modelling the trends of the load

curve (time series) data. Trend modelling is the first and essential step of the proposed

methods for X-outlier detection and periodicity detection in this thesis. Modelling the

trends of the data provides a sequence of peaks and valleys which describes the trends on

how the data goes up and down. For X-outlier detection, these peaks and valleys indicate

the possible locations where an X-outlier tends to occur. For periodicity detection, the re-

occurrence of these peaks and valleys represents the periodicity. We begin with some basic

definitions:

Definition 1: A load curve T = {(ti, yi)}ni=1 is an ordered sequence of n real-valued

observations where yi is the observed value at time ti. A sub load curve C = {(ti, yi)}ki=j of

T is a continuous part of T where 1 ≤ j and k ≤ n.

Given a load curve T = (ti, yi)ni=1, we can model the regression relationship of the data

by a continuous function [21]

yi = m(ti) + εi(i = 1, , n) (3.1)

with the regression function m and the observation error εi. The error εi is assumed to be

normally and independently distributed with the mean of zero and a constant variance.

16

Page 28: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 3. TREND MODELLING 17

3.1 Nonparametric Regression

To smooth the observed data (ti, yi)ni=1, a key is to estimate the function m(t) in Equation

3.1. The approximation can be done in two ways: parametric regression and nonparametric

regression. In parametric regression, m(t) is some known function and the researcher must

determine the appropriate parameters of m(t). In nonparametric regression, m(t) is an un-

known function. We choose nonparametric regression because we have no prior knowledge

about the structures of the load curves except that they have loose periodicity.

In nonparametric regression, the basic idea is local averaging: to estimate the value

at time t, the Y-observations in a neighborhood around t are taken into account, and the

further the Y-observations are away from t, the less they will contribute to the estimation

of the Y-observation at time t. Formally, the estimated value at time t can be modeled as

m̂(t) =1

n

n∑i=1

Wi(t)yi (3.2)

where Wi(t)ni=1 denotes a sequence of weights which depend on the whole vector {ti}ni=1.

Equation 3.2 is also called the smoothing curve.

3.2 Kernel Smoothing

To instantiate the weight function Wi(t) in Equation 3.2, we consider kernel smoothing, one

of the most popular nonparametric smoothing techniques 2. In kernel smoothing, Wi(t) is

given by

Wi(t) =Kernh(t− ti)

f̂t(t)(3.3)

where

Kernh(t) =1

hKern(

t

h) (3.4)

is theKernel with the scale factor h. Using theRosenblatt-Parzen kernel density estimator

[21] of the density of t

2In general, any smoothing technique could be considered.

Page 29: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 3. TREND MODELLING 18

f̂h(t) = n−1n∑

i=1

Kernh(t− ti) (3.5)

we obtain the Nadaraya-Watson estimator [21] for Equation 3.2:

m̂h(t) =n−1 ∑n

i=1Kernh(t− ti)yin−1

∑ni=1Kernh(t− ti)

(3.6)

The shape of the kernel weights is determined by the function Kern whereas the size of

weights is parameterized by h, called bandwidth. Importantly, the bandwidth h controls

the smoothness of the smoothing curve and how wide the probability mass is spread around

a point. In this thesis, we choose Kern as the normal probability density function [21]:

Kern(t) =1√2π

e−12t2 (3.7)

3.3 Smoothing Parameter

Below, we refer to the bandwidth h as the smoothing parameter. This parameter con-

trols the smoothness of the smoothing curve by regulating the size of the neighborhood

around time t in the observed data. A large h corresponds to a large neighborhood, thus, a

smoother curve. There has been some work on choosing the optimal smoothing parameter

in the literature. Different methods have been proposed such as cross-validation (CV) [21],

minimizing mean squared error (MSE) [21] and mean integrated squared error (MISE) [48].

In general, there is no golden rule since the choice often depends on the user’s needs. For

example, in Y-outlier detection, the smoothing parameter should be set to make the Y-

outliers outstanding so that they could be detected by the confidence interval. In X-outlier

detection, the smoothing parameter should be properly set so that a “valley” or a “peak”

on the smoothing curve is approximately the position where an X-outlier occurs. In period-

icity detection, a best choice of the smoothing parameter ensures that the smoothing curve

models the periodic pattern of the data properly.

Page 30: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

Chapter 4

X-Outlier Detection

In this chapter, we will present the X-outlier detection algorithm. We start with some

notions used in this chapter in Section 4.1. After that the detailed description of X-outlier

detection is provided in Section 4.2. In Section 4.3 we discuss some practical issues. In

Section 4.4 the proposed method is evaluated experimentally.

4.1 Problem Definition

In this section, we present the essential notions used in the chapter and the problem we

study.

As mentioned in Chapter 1, load curve data have a loose form of periodicity. Informally,

a loose periodicity of length l means that the load at time t is similar to the load at the

corresponding times in other periods, subject to variability in time and load caused by

background noise, where the corresponding time of t in the i-th period is t + i × l. In

other words, a similar “trend” is observed in all periods, even if the actual load at the

corresponding times may be different.

In this chapter, we assume that a load curve follows a loose periodicity and the length

(but not the pattern) of the periodicity is known. For example, we know that a data set

follows a yearly (or weekly, etc.) periodicity, thus, the length of the periodicity is one year

(or one week, etc.), but we do not know the actual trend or pattern of the periodicity. This

assumption holds for all the load curves we encountered because the usage of electricity does

follow daily, weekly, monthly and seasonal periodicities or working cycles (such as industrial

customers) in real life. We believe that this assumption also holds in many other real

19

Page 31: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 20

applications beyond power systems, such as road traffic, network flow traffic, call volume,

weather data, etc.

Definition 2: Given a load curve data following a loose periodicity of a known length,

an X-outlier is a maximal sub load curve which deviates from the periodicity.

The exact definition of “deviation” in Definition 2 is unspecified so that the notion

of X-outliers can be adapted to a new instantiation of “deviation”. In Section 4.2.3, we

will consider one instantiation of “deviation” based on the longest common subsequence

similarity [22].

Consider the load curve data with the weekly periodicity in Figure 4.1. This data set has

a weekly periodicity, i.e., high consumption during weekdays and low consumption during

weekend, except for a deviation in Monday-Wednesday of the third week denoted in the

rectangle box. Thus the data for Monday-Wednesday of the first week is considered an

X-outlier. It is worth noting that there is no requirement that all X-outliers have a similar

length or a length similar to the length of the periodicity. The length of an X-outlier is

determined by the length of the random event that causes the outlier, whereas the length of

periodicity is determined by the regularity of underlying patterns in the data. For example,

for a data set of a yearly periodicity, the length of the periodicity is 12 months; if a factory

shuts down for 3 months, the X-outlier has 3 months in length; if a union strike lasts for

1.5 weeks, this X-outlier is 1.5 weeks in length.

Figure 4.1: Example for an X-outlier

Definition 3: Given a load curve following a loose periodicity of a known length, the

outlier detection problem is to identify all X-outliers.

Page 32: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 21

4.2 Proposed Method

At first glance, one may think that it is straightforward to detect X-outliers by comparing the

observations in the data against the periodic pattern assumed in the data. Unfortunately,

this approach does not work because we only know the length of the periodicity, but not

the pattern or trend of the periodicity, as is the case in most real applications. To detect

X-outliers, we need to first model the general trend in the data while ignoring background

noisy. Since the data follow a periodicity, the trend tends to be periodical, i.e., repeating

itself at intervals equal to the known length of periodicity, with possible deviations. We

then detect outliers by identifying all deviations to this periodic trend.

Our X-outlier detection approach can be described in four steps:

1. Approximate the load curve data by a smoothing curve, which captures the general

trend of the data;

2. Represent the smoothing curve by a sequence of valleys and peaks, called ∪ shapes

and ∩ shapes;

3. Identify X-outliers as the valleys and peaks that do not repeat according to the known

periodicity length. Our observation is that an X-outlier typically occurs at a time

interval where the smoothing curve either has a valley or has a peak;

4. Repair the X-outliers.

Generating a smoothing curve takes O(n2) time. This is the most time consuming part

for the detection approach, resulting in the time complexity of the approach to be O(n2).

Step 1 is already explained in Chapter 3. Below, we first make some observations useful

for X-outlier detection in Section 4.2.1. After that we explain step 2, step 3 and step 4 in

Section 4.2.2, Section 4.2.3 and Section 4.2.4, respectively.

4.2.1 Observations for X-Outlier Detection

After the smoothing curve is obtained from the kernel smoothing technique discussed in

Chapter 3 (step 1), in the following we will make some observations on a few properties of

the smoothing curve which are useful for X-outlier detection.

As discussed in Chapter 1, an X-outlier has an unusual trend in the Y-axis over an

extended time period. Therefore, if the trend of a load curve is represented by a smoothing

Page 33: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 22

curve, we have the observation that an X-outlier tends to occur at a “valley” or a “peak”

of the smoothing curve where the smoothing curve has local minimal or maximal values.

An example of this intuition is shown in Figure 4.2 which has a six-year load curve and the

length of the periodicity is one year. The red curve is the smoothing curve modelling the

trends of the load curve. It can be seen that the smoothing curve consists of a sequence of

“valleys” and “peaks”. And an X-outlier lies within a valley inside the rectangle box.

Figure 4.2: Example for an X-outier within a valley

In the following we present step 2, step 3 and step 4 of the proposed method. Unless

otherwise specified, the term “outlier” refers to an X-outlier.

4.2.2 Approximating the Smoothing Curve by Peaks and Valleys

This section is step 2 of the proposed method. To formally define the location of such valleys

and peaks discussed in Section 4.2.1, we first introduce some terminology. The slope at a

time ti for a smoothing curve {(ti, m̂i)}ni=1 is defined by ∆mi∆ti

, where ∆mi = m̂i − m̂i−1,

∆ti = ti − ti−1, for 2 ≤ i ≤ n. A time t in an interval is called a steep time point if the

absolute value of the slope at time t is maximum in the interval. In other words, at a steep

time point the smoothing curve ascends or descends at the maximum rate in the concerned

interval. Note that there could be more than one steep time point in an interval.

In the smoothing curve, an interval [a, b] is maximal-decreasing if the slope at every

point in [a, b] is ≤ 0 and any interval containing [a, b] has at least one point with positive

slope. Let c and c′ be in a maximal-decreasing interval [a, b]. [c, b] is convex-decreasing if

c is the last steep time point in [a, b], and [a, c′] is concave-decreasing if c′ is the first steep

Page 34: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 23

time point in [a, b]. Since c is the last steep time point and c′ is the first steep time point,

the concave-decreasing interval [a, c′] and convex-decreasing interval [c, b] overlap at most

one point.

In Figure 4.3, the values on the curve are the slopes of the smoothing curve. The

horizontal axis is time. [t1, t5] is a maximal-decreasing interval. t3 and t4 are steep time

points in [t1, t5]. Since t4 is the last steep time point in [t1, t5], [t4, t5] is a convex-decreasing

interval but [t3, t5] is not. [t1, t3] is a concave-decreasing interval.

Figure 4.3: Smoothing curve [t1, t12]. The horizontal axis is time. The values on the curveare the slopes at each point. [t1, t5] is a maximal-decreasing interval; [t6, t10] is a maximal-increasing interval. [t4, t8] is a ∪ shape.

Similarly, we can define concave and convex intervals for increasing intervals. In the

smoothing curve, an interval [a, b] is maximal-increasing if the slope at every point in [a, b]

is ≥ 0 and any interval containing [a, b] has at least one point with negative slope. Let c

and c′ be in a maximal-increasing interval [a, b]. [c, b] is concave-increasing if c is the last

steep time point in [a, b], and [a, c′] is convex-increasing if c′ is the first steep time point

in [a, b]. Notice that [a, c′] and [c, b] overlap at most one point.

In Figure 4.3, [t6, t10] is a maximal-increasing interval, t8 and t9 are steep time points,

[t6, t8] is a convex-increasing interval because t8 is the first steep time point in [t6, t10].

[t9, t10] is a concave-increasing interval.

The following definitions 4 and 5 formalize the notions of valleys and peaks.

Definition 4: (∪ shape) For a smoothing curve, a ∪ shape is a sub curve T∪ =

{(tp, m̂p)}jp=i such that for some k with i≤k≤j, [i, k] is a convex-decreasing interval and

[k + 1, j] is a convex-increasing interval.

Definition 5: (∩ shape) For a smoothing curve, a ∩ shape is a sub curve T∪ =

{(tp, m̂p)}jp=i such that for some k with i≤k≤j, [i, k] is concave-increasing interval and

Page 35: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 24

[k + 1, j] is a concave-decreasing interval.

In Figure 4.3, the curve [t4, t8] is a ∪ shape formed by a convex-decreasing interval

[t4, t5] and a convex-increasing interval [t6, t8]. The curve [t9, t12] is a ∩ shape formed by a

concave-increasing interval [t9, t10] and a concave-decreasing interval [t11, t12]. Notice that

adjacent ∪ shape and ∩ shape overlap at most one point. To see this, consider the adjacent

∪ shape [t4, t8] and ∩ shape [t9, t12] in Figure 4.3. The ∪ shape [t4, t8] must end at the first

steep time point t8 and the ∩ shape [t9, t12] must start at the last steep time point t9.

There might be a gap between two adjacent ∪ shape and ∩ shape. To cover all data

points on the smoothing curve, we extend each shape to cover a half of the gap on each of

its two ends.

4.2.3 Identifying Outliers

This section is step 3 of the proposed method. Intuitively, ∪ shapes and ∩ shapes capture

the regions on the smoothing curve where the raw load curve has large drops and large rises.

Such regions are the potential places where outliers may occur. Therefore, ∪ shapes and ∩shapes are candidate regions for outliers.

Definition 6: (Candidate Regions) A candidate region for an outlier is the region for

a ∪ shape or ∩ shape of the smoothing curve.

All candidate regions can be found by computing ∪ shapes and ∩ shapes following

Definition 4 and Definition 5. We would like to remind that the length of a ∪ shape or ∩shape, thus the length of a candidate region, is determined by the width of a drop or rise

in the smoothing curve, which is independent of the length of the periodicity. There is no

correlation between a candidate region and a period of the periodicity.

The ∪ shape and ∩ shape provide candidates for outliers based on local neighborhood

information. A candidate may or may not be a real outlier depending on whether or not

it is a part of regular pattern. In other words, a candidate is not a real outlier if it occurs

regularly according to periodicity. As well known, a load curve has daily, weekly, seasonal,

or yearly periodicity although there may be noise or time shifting in the periodicity.

Candidate regions are extracted based on local neighborhood information (i.e., valleys

and peaks). Whether a candidate region contains a real outlier depends on whether the data

in the region deviates from the given periodicity. A candidate is not a real outlier if similar

load data occurs regularly in other periods according to the periodicity. The similarity

should take into account background noise and time shifting of periodicity.

Page 36: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 25

To identify all outliers, we consider every candidate region r found in the previous step.

Let C∗ denote the portion of the raw load curve data contained in r. Notice that C∗ contains

raw data in the load curve, not the data in the smoothing curve. If the data in C∗ occurs

approximately in the corresponding regions in different periods, C∗ is not a real outlier but

a part of the periodicity. For this purpose, we extract all the sub load curves C1, C2, . . . , Ck,

where each Ci is the portion of the raw load curve in the corresponding region of C∗ in the

i-th period. If C∗ is “similar” to the majority of C1, C2, . . . , Ck, C∗ is considered normal;

otherwise, C∗ is considered a real outlier.

The remaining question is how to measure the similarity between C∗ and Ci of the

same length. There are two considerations in choosing the similarity measure. First, the

similarity measure should be less sensitive to background noise. For example, the two load

curves TA and TB in Figure 4.4 should be considered similar despite some variability at the

first peak due to background noise. Second, the similarity measure should be less sensitive

to time shifting and stretching commonly observed in load curve data. For example, load

curves TA and TB in Figure 4.5 should be considered similar despite a small time shifting

and stretching.

Figure 4.4: Two similar load curves with noise

The Euclidean distance commonly used is not suitable for our purpose because it is

sensitive to time shifting and stretching. If the Euclidean distance was used, the two curves

TA and TB in Figure 4.5 would be recognized as dissimilar. What we need is a similarity

measure that will examine a small neighborhood in search of matching points and skip

noisy points. For this purpose, the Longest Common Sub-Sequence (LCSS) [22] concept

can be adopted. LCSS is a coarse-grained similarity measure in the sense that it measures

similarity in terms of “trends” instead of exact points. Below, we describe how LCSS is

extended to measure the similarity of load curves.

Page 37: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 26

Figure 4.5: Two similar load curves with time shifting and stretching

Given two load curves A =< a1, a2, . . . , am > and B =< b1, b2, . . . , bn > which corre-

spond to C∗ and Ci, we want to find the longest subsequence common to both A and B.

The idea is as follows. To allow time shifting and stretching, ai and bj that are within some

time proximity are examined for matching. If these load points are similar, they are con-

sidered as a match and are kept. Dissimilar values in one or both load curves are dropped.

Mathematically, given an integer δ and a real value ε, the cumulative similarity Si,j(A,B)

or Si,j is defined as follows:

Si,j =

0, if i = 0 or j = 0

1 + Si,j , if |ai − bj | ≤ ε and |i− j| ≤ δmax(Si,j−1, Si−1,j) otherwise

(4.1)

In Equation 4.1, the first if statement does the initialization for the shortest prefix. The

second if statement builds the similarity recursively: If |ai−bj | ≤ ε and if ai and bj are close

enough in time, i.e., |i − j| ≤ δ, ai and bj are matched and the similarity is incremented.

Note that ε represents a tolerance of noise in load and δ represents a tolerance of time

shifting and stretching.

Let |A| and |B| be the length of A and B, respectively The LCSS similarity of A and

B is given by

γ(δ, ε, A,B) =S|A|,|B|

min(|A|, |B|)(4.2)

where S|A|,|B| is the length of the common subsequence to both A and B computed by

Equation 4.1. For a user-specified threshold, we say that A and B are similar if

γ(δ, ε, A,B) ≥ θ (4.3)

Page 38: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 27

Before LCSS is applied to A and B, piecewise aggregate approximation (PAA) [24] is

utilized to reduce the dimensionality of A and B. In PAA, a load curve T = {(ti, yi)}ni=1

of length n can be represented in a w-dimensional space by a vector T =< t1, t2, . . . , tw >.

The i-th element of T is calculated by the following equation:

ti =w

n

nwi∑

j= nw

(i−1)+1

yj (4.4)

The load curve T is divided into w segments with equal size. Each segment is represented

by its mean value.

4.2.4 Repairing Outliers

This section is step 4 of the proposed method. After detection, outliers should be re-

placed with valid data in the load curve. The replacing data can be derived from the data

in the corresponding time interval in other different periods with an adjustment for the

increasing/decreasing long-term trends over time. This is expressed using the following

multiplicative model [23]:

Y (ti) = T (ti) ∗ S(ti) (4.5)

where Y (ti) represents the value that will replace the abnormal value at a time ti belonging

to an outlier. T (ti) represents the long-term trends and S(ti) represents the periodic index

(i.e., how much the load curve deviates from the long-term trends at time ti).

T (ti) can be estimated by the smoothing curve defined in Equation 3.2 with an appro-

priate smoothness level. However, in the presence of outliers, the smoothing curve may have

some errors around the outliers to be replaced. To address this problem, the outliers in the

load curve are replaced firstly by the average of the data at the corresponding time in the

previous and next periods. If load points at these times are also outliers themselves, the

corresponding data from earlier and later periods is examined until normal data is obtained.

After that, the smoothing curve is re-generated using Equation 3.2 to obtain T (ti).

The periodic index S(ti) at a time ti belonging to an outlier is estimated by the average

of the periodic indexes at the corresponding time of its previous and next period, that is,

S(ti) =1

2(S(ti − l) + S(ti + l)) (4.6)

Page 39: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 28

where l is the length of the periodicity. In the case where the data at previous and next

periods are outliers, earlier and later periods are examined until normal data is obtained.

Note that for a time ti not belonging to an outlier, the periodic index at ti is computed by

its definition:

S(ti) =yi

T (ti)(4.7)

where yi is the load data value at time ti. After T (ti) and S(ti) for a time point belonging

to an outlier are obtained, Equation 4.5 is utilized to produce the replacing value Y (ti).

4.3 Practical Issues

The proposed approach uses several parameters: the smoothing parameter h (see Section

3.3), the load stretching threshold ε and the time stretching threshold δ (see Equation

4.1), and the LCSS similarity threshold θ (see Equation 4.3). A question is how to set

the values of these parameters. One approach is to use some statistically “optimal” setting

such as the optimal smoothing parameter [43, 44]. In the case of applications where the

user has background knowledge, the user often has the desire to have a control over a small

number of settings. This is the case for BC Hydro and we believe that the situation is

similar in other utilities. In particular, often the result produced by the “optimal” setting

was not satisfactory and the satisfactory result was not produced by the “optimal” setting.

A close look reveals that certain background knowledge or certain business rules prefer

certain solutions to others. To address this issue, we take a practical approach to provide

a mechanism that helps the user to identify a proper setting of parameters. Below, we

describe such an approach using the smoothing parameter h as an example, but the same

approach can be applied to other parameters.

Recall that a larger h produces a smoother smoothing curve, thus, models less detail of

the data. Thus a proper choice of the smoothing parameter h is crucial for detecting the

outliers at a proper resolution. In practice, the user does not have to make such a choice

in advance. A software tool with a user-friendly interface has been developed, which allows

the user to slide a bar for the smoothing parameter and displays the smoothing curve and

the identified outliers to the user interactively. Based on visual inspection of the raw data,

the smoothing curve, and the identified outliers, the user can either accept the results or

slide the bar again for a different choice of h and get a display of the new smoothing curve

Page 40: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 29

and outliers based on the new choice of h. In our experiences, after several trials the user

quickly converges to a proper choice of the smoothing parameter. In our case, five users

have had experience on this. At the beginning the users need five to six trials on average to

get a proper smoothing curve for outlier detection with good results. After they get more

familiar with it, the number of trials are reduced to three to four.

A screen shot of the software tool is shown in Figure 4.6. The left side displays data

selections and algorithm options. When the user selects a data set and an algorithm, the raw

data will be displayed in the upper window and the data after detecting and/or repairing

outliers will be shown in the lower window. The user can slide the bar for the smoothing

parameter and display the outliers detected based on a different smoothing parameter.

Figure 4.6: The system tool developed

Page 41: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 30

4.4 Experiments

This section evaluates the accuracy of the trend based algorithm proposed for detecting

X-outliers. The rest of this section is structured into four subsections: data selection,

evaluation criteria, parameter settings, and accuracy. All smoothing curves were generated

using the Nadaraya-Watson estimator.

4.4.1 Data Selection

Ten data sets from the industrial load curves in the BC Hydro system were used for our

experiments. These data sets are hourly electricity consumptions in different areas for the

six years from October 2004 to October 2010, with 24 × 365 × 6 = 52560 observations in

each data set. All data sets have the yearly periodicity and are categorized into two types:

five data sets contain no outliers and the other five data sets contain 21 outliers in total,

with 3 to 5 outliers in each data set. Note that these outliers are not usual deviations from a

neighborhood; they are deviations from the yearly periodicity. These outliers were identified

manually by experienced engineers in the industry in advance and were pre-labeled. We use

such pre-labeled outliers as the “ground truth” to evaluate the accuracy of the proposed

method. More details will be explained shortly. In addition, the time is normalized into the

interval [0, 1].

4.4.2 Evaluation Criteria

Recall that the proposed method uses each ∪ shape and ∩ shape (on the smoothing curve)

to detect an X-outlier or a non-X-outlier. The outliers pre-labeled by the user are not

necessarily delimited in the same way as such ∪ shapes and ∩ shapes are limited. For this

reason, we cannot simply count the pre-labeled outliers that are detected as X-outliers. To

address this issue, we consider accuracy at the observation level as follows. Let D denote

the set of observations on the load curve that are detected in the sense of belonging to the

outliers detected by an algorithm, let L denote the set of observations on the load curve

that are pre-labeled in the sense of belonging to the pre-labeled outliers. Let |S| denote

the cardinality of a set S. Precision (P ) is the percentage of detected observations that

were pre-labeled. Recall (R) is the percentage of pre-labeled observations that are detected.

F-measure (F ) is the harmonic mean of precision and recall. Mathematically,

Page 42: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 31

P =|L ∩D||D|

(4.8)

R =|L ∩D||L|

(4.9)

F =2× P ×RP +R

(4.10)

A higher F entails both a high precision and a higher recall, thus, more agreement

between the ground truth and the detection made by an algorithm. P , R and F will be

used as our accuracy criteria.

We compare the proposed method with two baseline algorithms. The first is the tra-

ditional smoothing method in [5], which uses a smoothing curve to model the trend of

data and uses a confidence interval around the smoothing curve to detect outliers. We use

the usual 95% confidence level for generating the confidence interval. The second baseline

algorithm is the running median method in [41]. At each observation (ti, yi) of the load

curve, a running median mi of a sub curve Ti centered at ti is computed and a filter band

Bi = mi ± 3 × SD(Ti − mi) is constructed, where Ti − mi is the sub curve obtained by

shifting Ti down by mi and SD() is the standard deviation. Then all observations outside

the filter bands are identified as outliers. For the running median method, we consider five

levels for the length of Ti. For i = 1, . . . , 5, the level i has the length 24× 7× i.

4.4.3 Parameter Settings

As explained in Section 4.2.3, ε controls the closeness of two matching points ai and bj , and

δ controls how far i could be away from j in the matching. The choices of these parameters

are dependent on the noise level of the periodicity in the dataset. In the given studies, δ is

empirically set to be 14 days, whereas for each candidate of outlier C∗, ε is set to be half of

the standard deviation of the sub load curves < C1, C2, . . . , Ck > that correspond to C∗ in

different years. The LCSS similarity threshold θ is set to be 40%. This is the best setting

from several settings we tried: 30%, 35%, 40%, 45% and 50%. w = n24 (for PAA, where n

is the number of data points in Ci) which means every 24 data points (data of one day) in

Ci and C∗ are represented by their mean value.

The most important parameter is the smoothing parameter for generating the smoothing

curve, which determines the level of details to be modeled. The literature has suggested

Page 43: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 32

some optimal setting that minimizes some notion of modeling error when the smoothing

curve is used to model the data. Among others, the mean integrated squared error (MISE)

[44] is a commonly used error measure. If modeling the data is the ultimate goal, such as

in forecasting, such an optimal setting would be sufficient. However, our goal is to identify

a special type of corrupted data or outliers, where the notion of error is the deviation from

a given periodicity. This notion of error requires checking the re-occurrence of a pattern in

*all* periods, therefore, it is not sufficient to minimize the standard estimation error such as

MISE where local data points play a more important role than distant points. As a matter

of facts, the optimal smoothing parameter setting suggested in [44] does not depend on the

periodicity used by our outlier detection problem, thus, is unlikely to produce a good result

for our problem. This point will be evaluated experimentally in Section 4.4.4.

The theoretical range of the smoothing parameter h is (0,∞). A smaller h produces a

rougher smoothing curve whereas a larger h produces a smoother curve. In our experiments,

the following 10 smoothness levels are considered for generating the smoothing curve in the

trend based algorithm and the traditional smoothing method. The h at level i is given by

h =1

480− 45× i, i = 1, 2, . . . , 10 (4.11)

Level 1 corresponds to the roughest smoothing curve and level 10 corresponds to the

smoothest smoothing curve. These levels are not equally spaced and are chosen so that

they cover a wide range of smoothness. The use of such smoothing levels assumes that the

time has been normalized into the interval [0, 1], where 0 corresponds to the starting time

and 1 corresponds to the ending time.

4.4.4 Accuracy

Results on Data Sets without X-Outliers In the first experiment, we study how dif-

ferent algorithms perform on the five data sets with no X-outlier. Since no outlier was

pre-labeled, all detected points are false positives and the percentage of detected points is

an indication of performance. Outlier detection was performed on each data set individually.

Let |D| be the sum of the numbers of detected points in the five data sets and let |T | be the

total number of points in the five data sets. |D|/|T | is the percentage of points that were

incorrectly detected. Table 4.1 and Table 4.2 summarize |D|/|T | for the three methods.

Each row corresponds to a smoothness level defined by Equation 4.11, and the last row

Page 44: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 33

corresponds to the optimal setting of the smoothing parameter suggested in the literature

[44], hopt = 1.06× σ× n−15 , where σ is the sample standard deviation of the samples and n

is the number of samples.

Table 4.1: Proposed method and traditional smoothing method for data sets with no X-outliers

Smoothness LevelProposed Method Traditional Smoothing Method

|D|/|T | |D|/|T |1 1.7% 4.5%2 0.9% 4.5%3 0.7% 4.6%4 0.4% 4.6%5 0.2% 4.7%6 0% 4.8%7 0% 4.9%8 0% 4.9%9 0% 4.9%10 0% 5.0%hopt 0% 4.7%

Table 4.2: Running median method for data sets with no X-outliers

Running Median

Length Level level |D|/|T |1 2.2%2 2.1%3 2.2%4 2.2%5 2.2%

When the smoothness level is low, the proposed method has a small |D|/|T | because

the smoothing curve is rather rough. The traditional smoothing method has a much higher

|D|/|T | across all smoothing levels, and the running median method has a lower |D|/|T | than

the traditional smoothing method, but a higher |D|/|T | than the proposed method. The

reason for the higher false positives of the traditional smoothing method and the running

median method is that these algorithms do not consider the periodicity of data, therefore,

even if a peak or valley is part of the periodicity, it may still be considered as an outlier.

Page 45: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 34

Such peaks and valleys will not be considered as outliers by the trend based method.

Results on Data Sets with X-Outliers In the second experiment, we consider the

five data sets with pre-labeled outliers. We examined the precision (P ) and recall (R) and

F-measure (F ) on the five data sets with pre-labeled X-outliers. First, outlier detection was

performed on each data set individually. Then, D and L were aggregated over the five data

sets, and P/R/F was computed using the aggregated D and L. The results are summarized

in Table 4.3 and Table 4.4. Recall that precision is the percentage of detected points that

were pre-labeled (thus correctly detected), recall is the percentage of pre-labeled points that

were detected, and F-measure is the harmonic mean of precision and recall.

Table 4.3: Proposed method and traditional smoothing method for data sets with X-outliers

Smoothness LevelProposed Method Traditional Smoothing Method

P R F P R F

1 83% 98% 90% 0.5% 0.2% 0.3%2 85% 98% 91% 0.6% 0.3% 0.4%3 86% 98% 92% 0.9% 0.4% 0.6%4 87% 97% 92% 1.3% 0.6% 0.8%5 90% 95% 92% 3.2% 1.5% 2.1%6 91% 95% 93% 5.7% 2.7% 3.7%7 92% 92% 92% 6.4% 2.3% 4.0%8 91% 84% 87% 7.6% 3.3% 4.6%9 94% 66% 77% 13.1% 5.6% 7.9%10 93% 48% 63% 16.8% 7.0% 9.9%hopt 92% 48% 63% 17.2% 6.0% 8.9%

Table 4.4: Running median method for data sets with X-outliers

Running Median

Length Level level P R F

1 2.4% 0.4% 0.7%2 2.8% 0.5% 0.8%3 2.7% 0.5% 0.8%4 2.6% 0.5% 0.8%5 1.5% 0.2% 0.4%

A clear trend shown in Table 4.3 and Table 4.4 is that the F-measure of the proposed

Page 46: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 35

method is significantly higher than those of the traditional smoothing method and the run-

ning median method. Moreover, this gain was observed over all choices of parameter settings

of the three methods, thus, was not due to a careful choice of parameter settings. Specif-

ically, the traditional smoothing method and the running median method failed miserably

as P and R were extremely low, suggesting that many pre-labeled outliers were missed and

many detected outliers are part of normal data. These methods consider any deviation from

a local neighborhood as an outlier, even though such deviation is part of the periodicity in

the data. In contrast, both P and R of the proposed method are high, with the best results

being 91% and 95%, respectively, at the smoothness level of 6. This study clearly shows

that the traditional smoothing method and the running median method are not suitable for

detecting X-outliers, and the proposed method has achieved the expectation of detecting

X-outliers.

This study clearly shows that traditional methods are not suitable for detecting X-

outliers, and the proposed method has met the expectation of detecting X-outliers. The

study also shows that the optimal setting hopt of the smoothing parameter suggested in the

literature for the standard forecasting problem failed to generate the best result for outlier

detection, and that the proposed user-sliding bar is highly effective for choosing the best

smoothness level.

It is worth noting an interesting trend on the proposed method: as the smoothness level

increases, P increases and R decreases; the best result in terms of the highest F-measure

was attached at a suitable smoothness level. This trend can be explained as follows. At

a low smoothness level, the smoothing curve models more details of the load curve and

many small ∪ shapes and ∩ shapes were generated while some of them are pure noise. As

a result, the number of false positives is large and precision is low. As the smoothness level

increases, fewer details were modeled and larger ∪ shapes or ∩ shapes were identified, which

more likely correspond to real outliers. Therefore, the number of false positives decreases.

When the smoothness level further increases, the smoothing curve becomes more flat, thus,

under-fits the data. In this case, a ∪ shape or ∩ shape becomes so large that it contains a

portion of normal data, which leads to the failure of detecting some real outliers, thus, a

low recall.

To further illustrate the above points, an example for one data set is depicted in Figures

4.7(a), 4.7(b) and 4.7(c) for various smoothness levels. The curve in red is the smoothing

curve and each rectangle marks an X-outlier detected by the proposed method. Figure

Page 47: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 36

(a)

(b)

(c)

Figure 4.7: Outlier detection for a six-year test data set. (a) Outlier detection result forsmoothness level 5. (b) Outlier detection result for smoothness level 1. (c) Outlier detectionresult for smoothness level 10.

Page 48: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 37

4.7(a) shows that the smoothing curve at level 5 models the load curve properly and there

is no false positive or false negative. Figure 4.7(b) shows the result at smoothness level 1

where too much local information is modeled by the rather rough smoothing curve, which

leads to small ∪ shapes and ∩ shapes. A few of such small ∪ shapes and ∩ shapes, marked

as false positives FP1, FP2, FP3, FP4 and FP5, are pure noise and mislead the algorithm

to consider them as outliers. The exact opposite was observed in Figure 4.7(c) where the

smoothing curve at smoothness level 10 is rather flat and ∪ shapes and ∩ shapes are so

wide that a large portion of normal data are contained in them; consequently, no outlier

was detected.

Figure 4.8 shows the load curve data after repairing the outliers identified in Figure 4.7(a)

based on the repairing method in Section 4.2.4. The outliers are replaced by representative

y values.

Figure 4.8: Outlier repairing for the six-year test data set.

The proposed method is applicable to load curve data of any time granularity and length.

An example for outlier repairing for a test data of five weeks is depicted in Figure 4.9. Figure

4.9(a) illustrates the data before outlier repairing, with the detected outlier being marked

by the star rectangle. Figure 4.9(b) shows the data after outlier repairing based on the

method in Section 4.2.4.

Page 49: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 4. X-OUTLIER DETECTION 38

(a)

(b)

Figure 4.9: Outlier repairing for a five-week test data set. (a) Test data before outlierrepairing. (b) Test data after outlier repairing.

Page 50: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

Chapter 5

Trend Based Periodicity Detection

In this chapter the trend based periodicity detection algorithm will be presented. We start

with an overview of a highly related periodicity detection technique call ”WARP” (The

WArping foR Periodicity Algorithm). Then the trend based periodicity detection will be

introduced in Section 5.2. In Section 5.3, the performance of the trend based algorithm is

evaluated.

5.1 Preliminaries

In this section we review the WARP algorithm in [28], a periodicity detection algorithm for

a sequence of discrete symbols. In the next section we will extend the WARP algorithm to

deal with a real valued time series.

5.1.1 Periodicity

For convenience, in this chapter a time series T = (ti, yi)ni=1 is denoted as T = e1e2 . . . en,

an ordered list of n feature values ei at times i, 1 ≤ i ≤ n. A sequence is the special case

of time series where each feature value ei is a discrete symbol taken from a dictionary of

alphabets. We adopt the notion of segment periodicity from [27]: A time series T is periodic

with a period p if it can be divided into equal length segments, each of length p, that are

“almost similar”. For example, the sequence T = “abcabcabb” has a period 3 with the noise

“b” at the end.

In this chapter, we consider time series where the feature values ei are real values for

39

Page 51: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 5. TREND BASED PERIODICITY DETECTION 40

periodicity detection. To deal with such real valued time series, most existing periodic-

ity detection algorithms assume that a time series is first transformed into a sequence of

discrete symbols using a binning method (mostly uniform binning scheme, i.e., equi-width

binning where each bin has the same size or equi-depth binning where each bin contains

approximately the same number of data points.). The data points in the same bin are rep-

resented using the same symbol. Thus, most of the existing algorithms deal with a sequence

of discrete symbols. The dynamic time warping based WARP algorithm [28] is such an

algorithm. Below, we review this algorithm.

5.1.2 Dynamic Time Warping

Dynamic time warping (DTW) [35] is a measure of the distance between two sequences A

= a1a2 . . . am and B = b1b2 . . . bn. The DTW distance of A and B, denoted as DTW (A,B)

or DTW (m,n), is computed by a dynamic programming formulated as

DTW (i, j) = d(ai, bj) +

DTW (i− 1, j − 1)

DTW (i− 1, j)

DTW (i, j − 1)

(5.1)

where the function d(ai, bj) returns the distance between two symbols ai and bj , defined as

d(ai, bj) =

0 ai = bj

1 ai 6= bj(5.2)

To compute the DTW distance, an m × n matrix is constructed where the cell (i, j)

contains the value d(ai, bj). A warping path is a contiguous path from cell (1, 1) to cell

(m, n) 3, corresponding to a particular alignment between the two sequences. The DTW

distance is defined as the minimum cost of any warping path from (1, 1) to (m, n). A locality

constraint is added to control how far away i could be from j when computing DTW(i, j).

A window size w can be used to specify this constraint, that is, if ai is aligned with bj , then

|i− j| ≤ w.

Figure 5.1 shows an example of the DTW matrix for two sequences “abcbde” and

“abcefg”. The minimum cost warping path is circled. Figure 5.2 shows the actual alignment

represented by this minimum cost warping path. The warping cost of this path is 0 (a↔ a)

3From cell (i, j), the path could go to cell (i + 1, j), cell (i, j + 1) or cell (i + 1, j + 1).

Page 52: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 5. TREND BASED PERIODICITY DETECTION 41

Figure 5.1: An example for the DTW matrix

Figure 5.2: Alignment for the DTW matrix

+ 0 (b↔ b) + 0 (c↔ c) + 1 (c↔ b) + 1 (c↔ d) + 0 (e↔ e) + 1 (f ↔ e) + 1 (g ↔ e) =

4, where “↔” means “paired with”.

5.1.3 The WARP Algorithm

To detect the periodicity in a single sequence T = e1e2 . . . en, the WARP algorithm [28]

compares the original sequence T with a sequence obtained by shifting some number of

symbols, p. If there is high similarity between the two in terms of the DTW distance, p is

considered a candidate period. Specifically, for a given positive integer p, T(p) denotes the

first n − p symbols and T (p) denotes the last n − p symbols. If DTW(T(p), T(p)) is small

enough, p is considered a candidate period.

For example, with T = e1e2 . . . e9 = “abcabcabd”, T(3) = “abcabc” and T (3) = “abcabd”,

and DTW(T(3), T(3)) = 1. If this warping cost is considered small enough, p = 3 is a

candidate period. The reason is as follows. Consider the re-occurrence of e1 in the actual

alignment in Figure 5.3, e4 = e1 and e7 = e4, where the symbols on the LHS of = are from

T (3) and the symbols on the RHS of = are from T(3). These equalities imply e1 = e4 = e7.

Page 53: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 5. TREND BASED PERIODICITY DETECTION 42

For longer sequences T(3) and T (3) with a small DTW distance, these equalities imply that

e1 re-occurs at the regular interval of three time units. The same argument applies to e2

and e3. Therefore, e1e2e3 is periodic with a period 3.

Figure 5.3: Alignment for T(3) and T (3) where T = “abcabcabd”

To find all candidate periods, for p = 1, . . ., n/2, DTW(T(p), T(p)) is computed. Note

that the maximum value of DTW(T(p), T(p)) is n − p. For each possible value of p, the

confidence of p [28] is

(n− p−DTW (T(p), T(p))/(n− p). (5.3)

If the confidence of p is larger than or equal to a given threshold τ , p is considered a candidate

period.

Figure 5.4: DTW matrix for sequences T and T where T = e1e2 . . . en

Figure 5.4 shows the DTW matrix for sequences T and T where T = e1e2 . . . en. It can

be seen that to compute DTW(T(p), T(p)) is to find the minimum warping path from cell

(1, p + 1) to cell (n − p, n) from the DTW matrix. It should be noted that the values in

Page 54: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 5. TREND BASED PERIODICITY DETECTION 43

the diagonal are all zeros in the DTW matrix. In this case, the zero values in the diagonal

would drag the minimum warping path for T(p) and T (p) towards the diagonal, meaning that

the p-positions shift is ignored in alignment for T(p) and T (p). Therefore, in order to avoid

this situation, the zero values in the diagonal are replaced by infinity values (∞) [28].

Let cp be the warping cost of a candidate period value p, and ca be the warping cost

of any adjacent period value around p. cp and ca have the relation cp ≤ ca. Therefore, in

order to reduce the number of redundant periods, only the candidate periods with the local

minimal warping cost cp are considered [28]. For a more detailed description of the WARP

algorithm, the reader is referred to [28].

5.2 The Trend Based Algorithm

The WARP technique described in Section 5.1.3 cannot be directly applied to real valued

time series. In this section, we present a novel periodicity detection algorithm, called trend

based algorithm, for a real valued time series. The algorithm has four steps:

1. Approximate the time series using a smoothing curve;

2. Model the trends in the smoothing curve by a sequence of ∪ shapes and ∩ shapes that

correspond to the peaks and valleys in the smoothing curve;

3. Identify the periodicity by extending the DTW distance to sequences of ∪ and ∩shapes and taking into account the similarity between such shapes;

4. Express the periodicity in the length of time.

As mentioned in Section 4.2, generating a smoothing curve takes O(n2) time. This is

the most time consuming part for the algorithm, resulting in the time complexity of the

algorithm to be O(n2).

Step 1 is already explained in Chapter 3 and step 2 is explained in Section 4.2.2. Below,

we first make some observations useful for periodicity detection in Section 5.2.1. And then

we explain step 3 to step 4 in detail in Section 5.2.2 and Section 5.2.3, respectively.

5.2.1 Observations for Periodicity Detection

In the following we will make some observations on a few properties of the smoothing curve

which are useful for periodicity detection.

Page 55: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 5. TREND BASED PERIODICITY DETECTION 44

For periodicity detection, as discussed in Chapter 1, the most interesting information

lies at peaks and valleys of the smoothing curve. The sequence of such peaks and valleys

describes the trends on how the data goes up and down while paying less attention to actual

data values. Based on this observation, we shall detect the periodicity in the data from such

peaks and valleys. An example of this intuition is shown in Figure 5.5 which has eight

weeks’ data with a weekly periodicity. The period is approximately the length of a peak

plus the length of a valley of the smoothing curve in this example.

Figure 5.5: Example for a period consisting of a peak and a valley

5.2.2 Identifying Periodicities Using The Shape Sequence

This section is step 3 for the trend based algorithm. As discussed in Section 4.2.2, the

smoothing curve is partitioned into a sequence of ∪ shapes and ∩ shapes. In this chapter,

the sequence is called the shape sequence and is denoted by S. Further, each ∪ shape and

∩ shape is represented by a feature vector (sig, len, max, ave, min), where sig indicates

whether it is a ∪ shape or ∩ shape; len is the number of time points of the shape; max is

the highest value of the shape; ave is the average value of the shape; min is the lowest value

of the shape. We can tell if two shapes are similar to each other using their feature vectors.

More details will follow shortly.

Now we will detect the periodicity using the shape sequence S defined above. Let us

consider an example to illustrate the idea. Figure 5.6 represents eight weeks time series

data and the smoothing curve. The second rectangle box indicates a periodic pattern that

occurs in most of the weeks. This pattern is reflected by a periodic pattern of one ∩ shape

Page 56: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 5. TREND BASED PERIODICITY DETECTION 45

(weekdays) followed by one ∪ shape (weekend). The first and third rectangle boxes indicate

deviations from this pattern where the ∩ shapes and ∪ shapes have quite different time

lengths and y values from those in the other weeks. The periodicity of this data set comes

from the fact that a majority of weeks have a similar sequence of a large ∩ followed by a

small ∪ shape. Below, we describe an algorithm for detecting periodicity based on this idea.

Figure 5.6: Example for a period

The main idea of our algorithm is to extend the WARP framework in Section 5.1.3 to the

shape sequence S. A key component of the WARP algorithm is the DTW distance between

two sequences A = a1a2 . . . am and B = b1b2 . . . bn of symbols. The DTW distance makes

use of the distance function d(ai, bj) in Equation 5.2 for the two symbols ai and bj . For two

shapes ai and bj , a direct application of this distance function always yields d(ai, bj) = 0

because it is unlikely that two shapes are exactly identical.

To adopt the distance function d(ai, bj) to two shapes ai and bj , we introduce a difference

threshold ε: ai and bj are considered similar if they have the same type, i.e., either both

are ∪ shapes or both are ∩ shapes, and if their relative difference in length, max value

and min value is at most ε. Precisely, let ai = (sig a, len a,max a, ave a,min a) and

bj = (sig b, len b,max b, ave b,min b). d(ai, bj) = 0 if all of the following conditions hold:

1) sig a = sig b,

2) |len a− len b|/Max(len a, len b) ≤ ε,3) |max a−max b|/Max(max a,max b) ≤ ε,4) |ave a− ave b|/Max(ave a, ave b) ≤ ε, and

5) |min a−min b|/Max(min a,min b) ≤ ε.

Page 57: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 5. TREND BASED PERIODICITY DETECTION 46

Otherwise, we define d(ai, bj) = 1. With this definition of d(ai, bj) for two shapes, the

WARP framework in Section 5.1.3 can be applied to the shape sequence S, treating each

shape as a symbol.

5.2.3 Computing the Length of Candidate Periods

This section is step 4 for the trend based algorithm. The above algorithm returns a set of

candidate periods, where each candidate period is a sequence of ∪ shapes and ∩ shapes.

The final step of our algorithm is to transform each such candidate period into a candidate

period in terms of the length of time. Consider a candidate period p. Suppose that the

shape sequence S has n shapes in total. For i = 1, 2, . . . , p, the i-th shape in the period

p is expected to occur at the location for all the j-th shapes in S, where j = i + k × p,0 ≤ k ≤ (n − i)/p. The time length for the i-th shape in the period p is defined as the

average length of these j-th shapes, and the time length for the period p is defined as the

sum of the time length for the i-th shape in the period p over i = 1, 2, . . . , p. In the presence

of corrupted shapes (an example is shown in Figure 5.6), the median should be used instead

of the average as the latter is more sensitive to the bias introduced by corrupted shapes.

5.2.4 The Algorithm

Summarizing all above steps, the trend based algorithm is resented in Algorithm 1.

Algorithm 1 Trend Based Algorithm

Input: A real valued time series T = e1e2 . . . en, and the confidence threshold τ .Output: All candidate periods for T .Method:1. Generate the smoothing curve C for T using the kernel smoothing technique;2. Extract ∪ shapes and ∩ shapes from C and construct the shape sequence S;3. For p = 1, 2, ,m/2, where m is the number of shapes in S:

a. compute d = DTW (S(p), S(p));

b. compute the confidence defined in Equation 5.3 conf = (m− p− d)/(m− p),if conf ≥ τ and d is a local minimal, add p to Cand;

4. For each p in Cand, output the time length of p.

Page 58: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 5. TREND BASED PERIODICITY DETECTION 47

5.3 Experiments

In this section we study the performance of the trend based algorithm by comparing it

with the WARP algorithm [28], which is known to outperform other algorithms. Section

5.3.1 explains our selection of data sets and Section 5.3.2 explains parameter settings used.

Section 5.3.3 studies the accuracy of periodicity detection. Section 5.3.4 examines the effect

of smoothness level on the trend based algorithm. Section 5.3.5 examines the effect of

discretization on the WARP algorithm. Section 5.3.6 examines the applicability of the

trend based algorithm for detecting multiple periodicities.

5.3.1 Data Selection

Two time series data sets from industrial load curves collected by BC Hydro were used in our

experiments. These are two different data sets from the data sets used in the experiments

in Section 4.4 for X-outlier detection. They are hourly electricity consumption for one year,

one from January 2008 to December 2008 and one from December 2004 to November 2005,

respectively. Both data sets have a weekly periodicity, therefore, the periods are known:

24 × 7 × i hours, where i = 1, 2, . . . , 52/2. An example of one week’s data is shown in

Figure 5.7 where the weekend has a lower consumption than weekdays and night time has a

lower consumption than daytime. In addition, these data sets have different levels of noise.

The first data set, the “Normal” data, has preserved almost every weekly pattern, with a

few exceptions where one or two days in the weekdays were corrupted into low values like

those of weekends in some weeks. The second data set, the “Noisy” data, has about 15% of

data corrupted into low values. Both data sets are presented to the trend based algorithm

and the WARP algorithm. For the WARP algorithm, we first discretized the consumption

values into four bins using equi-width binning. The effect of other binning choices will be

examined in Section 5.3.5.

5.3.2 Parameter Settings

The difference threshold ε for two shapes ai and bj (Section 5.2.2) is set to 30%. This is

the best setting from several settings we tried: 20%, 25%, 30% and 35%. The window size

w used by the DTW distance (Section 5.1.2) is set to 24 × 2 (i.e., two days) for the hourly

based WARP algorithm and is set to 3 (i.e., three shapes) for the trend based method. The

confidence threshold θ (Section 5.1.3) is ranged from 0.7 to 0.9. We divide the smoothness

Page 59: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 5. TREND BASED PERIODICITY DETECTION 48

Figure 5.7: One weeks data with weekly pattern

level of the smoothing parameter for the Nadaraya-Watson estimator into the following five

levels:

h =2i−1

1000, i = {1, 2, . . . , 5} (5.4)

Level i = 1 corresponds to the roughest level and level i = 5 corresponds to the smoothest

level.

5.3.3 Accuracy

The first set of experiments evaluates the accuracy of the trend based algorithm compared

with the WARP algorithm. We say that a detected period pd is correct if there exists a

real period pi such that |pd − pi|/Max(pd, pi) ≤ η. In our experiments, η is set to 5%.

True positive (TP ) is the number of correctly detected periods; false positive (FP ) is the

number of wrongly detected periods; false negative (FN) is the number of periods that

are not detected. The precision (P ), recall (R) and F -measure (F ) are defined as follows:

P =TP

TP + FP(5.5)

R =TP

TP + FN(5.6)

F =2× precision× recallprecision+ recall

(5.7)

Page 60: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 5. TREND BASED PERIODICITY DETECTION 49

Table 5.1: Accuracy comparison on “Noisy” data

Confidence “Noisy” data set

Threshold Trend based algorithm (level 3) WARP

(%) TP FP FN P R F TP FP FN P R F

70 22 4 4 85% 85% 85% 26 51 0 34% 100% 50%75 22 4 4 85% 85% 85% 26 51 0 34% 100% 50%80 22 4 4 85% 85% 85% 21 49 5 30% 81% 44%85 19 4 7 83% 73% 78% 3 46 23 6% 12% 8%90 9 2 17 82% 35% 49% 0 0 26 0%

Table 5.2: Accuracy comparison on “Normal” data

Confidence “Normal” data set

Threshold Trend based algorithm (level 3) WARP

(%) TP FP FN P R F TP FP FN P R F

70 26 0 0 100% 100% 100% 26 39 0 40% 100% 57%75 26 0 0 100% 100% 100% 26 39 0 40% 100% 57%80 26 0 0 100% 100% 100% 26 39 0 40% 100% 57%85 26 0 0 100% 100% 100% 26 39 0 40% 100% 57%90 26 0 0 100% 100% 100% 15 12 11 56% 58% 57%

The accuracy comparison between the trend based algorithm (with smoothness level 3)

and the WARP algorithm for the “Noisy” and “Normal” data sets are shown in TABLE 5.1

and TABLE 5.2. For both data sets, FP of the WARP algorithm is much larger than that of

the trend based algorithm. This is because the WARP algorithm depends on discretization

to map continuous consumption values to a fixed number of bins on a point-by-point basis,

which is not sensitive to the trends in the data. Consequently, many false patterns not

representing the weekly pattern were generated. In contrast, the trend based algorithm

preserves the weekly pattern through the smoothing curve and the ∪ shapes and ∩ shapes

on the smoothing curve. For the “Normal” data set, the trend based algorithm has the

perfect periodicity detection across all confidence thresholds. For the “Noisy” data set,

except for the very high confidence threshold 90%, the trend based algorithm finds most

periods correctly (i.e., 19 or 22 out of 26) while returning a few false positives, yielding the

F-measure of 0.85 that is significantly higher than that of the WARP algorithm. This study

Page 61: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 5. TREND BASED PERIODICITY DETECTION 50

suggests that the trend based algorithm is able to detect periodicity accurately.

5.3.4 Effect of Smoothness Levels

The smoothness level affects the level of details modeled by the trend based algorithm.

We study this effect and summarize the results in TABLE 5.3 and TABLE 5.4. Consider

TABLE 5.4 for example. When the smoothness level is low, say level 1, TP and FP are

extremely low because the smoothing curve models a lot of details, which leads to many

small ∪ shapes and ∩ shapes that are largely contributed by the noise in the raw data. When

DTW is applied to the shape sequence, most of such shapes are considered dissimilar.

When the smoothness increases to level 3, the smoothing curve correctly models a se-

quence of ∪ shapes and ∩ shapes corresponding to high usages during the weekdays and low

usages at the weekend. With such a sequence of ∪ shapes and ∩ shapes, the trend based

algorithm finds all real periods with no false positives.

When the smoothness reaches level 5, the smoothing curve is rather flat and there are

only a few large size ∪ shapes and ∩ shapes that each includes more than one weeks data,

resulting in the situation that the weekly periods were not detected.

Table 5.3: Trend based algorithm for the “Noisy” data (confidence threshold set as 70%)

Smoothness Level TP FP FN P R F

1 0 0 26 0%2 4 0 22 100% 15% 27%3 22 4 4 85% 85% 85%4 14 1 12 93% 54% 68%5 2 0 24 100% 8% 14%

Table 5.4: Trend based algorithm for the “Normal” data (confidence threshold set as 70%)

Smoothness Level TP FP FN P R F

1 1 1 25 50% 4% 7%2 23 0 3 100% 88% 94%3 26 0 0 100% 100% 100%4 19 1 7 95% 73% 83%5 0 0 26 0%

Page 62: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 5. TREND BASED PERIODICITY DETECTION 51

Clearly, a proper choice of the smoothness level is crucial. In practice, the user does not

have to make such a choice in advance. As mentioned in Section 4.3, we have developed

a software tool with a user-friendly interface, which allows the user to slide a bar for the

smoothness level and displays the corresponding smoothing curve interactively. Based on

visual inspection of the fit between the smoothing curve and the time series, the user can

adjust the smoothness level using the sliding bar and immediately get a new smoothing

curve based on the adjusted smoothness level. Typically, after several trials the user is able

to converge to a desired smoothness level. In our case, five users have had experience on

this. At the beginning the users need four to five trials on average to get a proper smoothing

curve for periodicity detection with good results. After they get more familiar with it, the

number of trials are reduced to two to three.

5.3.5 Effect of Discretization on WARP

One of our observations is that the uniform binning assumed in previous works could distort

the trends in the data. To validate this observation, we vary the number of bins in the

WARP algorithm and examine if any choice will produce a better result. The findings are

reported in TABLE 5.5 for equi-width binning and in TABLE 5.6 for equi-depth binning.

The “Normal” data set was used in this experiment.

Table 5.5: WARP on “Normal” data (confidence threshold set as 70%, equi-width binning)

Number of Bins TP FP FN P R F

3 26 45 0 37% 100% 54%4 26 39 0 40% 100% 57%5 26 33 0 44% 100% 61%6 22 24 4 48% 85% 61%7 15 21 11 42% 58% 48%

Even though the “Normal” data set has a strong week periodicity, for all the number of

bins tested and for both binning methods, P (precision) is low due to many false positives.

A larger confidence threshold did not help because it will reduce TP of the WARP algorithm

as shown in TABLE 5.2. In fact, we did not observe any “proper” number of bins. The

reason for this is that the periodicity pattern does not follow equi-width binning or equi-

depth binning. For example, in Figure 5.7 the consumption at daytime of weekdays trends

Page 63: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 5. TREND BASED PERIODICITY DETECTION 52

Table 5.6: WARP on “Normal” data (confidence threshold set as 70%, equi-depth binning)

Number of Bins TP FP FN P R F

3 26 47 0 36% 100% 53%4 26 35 0 43% 100% 60%5 26 34 0 43% 100% 60%6 14 29 12 33% 54% 41%7 9 16 17 36% 35% 35%

to stay within the narrow range [4, 5], and in this case the equi-depth binning will divide

this range into several bins, which clearly destroys the underlying trends. The trend based

algorithm does not have this problem because it uses a smoothing curve to model the trends

in the data.

5.3.6 Multiple Periodicities

Time series data often have multiple periodicities (i.e., daily periodicity, weekly periodicity,

etc.) at the same time. Another advantage of the trend based algorithm is its ability to

detect different periodicities. This is done by using different smoothness levels to model

the trends at different detail levels. We explain this point using the five weeks data set in

Figure 5.8 and Figure 5.9.

In Figure 5.8, the smoothing curve was generated with smoothness level 3 and the curve

models the daily trend by a ∪ shape followed by a ∩ shape. This pattern occurs approxi-

mately every day with the weekend having slightly lower values. With such a sequence of

∪ shapes and ∩ shapes, a daily periodicity will be found by the trend based algorithm.

In Figure 5.9, the smoothing curve was generated (on the same data) with smoothness

level 5 and the curve models the weekly trend by one ∩ shape for weekdays and one ∪shape for weekend. Unlike the smoothing curve in Figure 5.8, the detailed change between

daytime and nighttime on each day is not modeled. With such a sequence of shapes, the

trend based algorithm will be able to find the weekly periodicity.

As discussed in Section 4.3, a user-friendly interface and visualization tool will help the

user to identify a proper smoothness level.

Page 64: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 5. TREND BASED PERIODICITY DETECTION 53

Figure 5.8: Five weeks data with daily patterns, smoothness level 3

Figure 5.9: Five weeks data with weekly patterns, smoothness level 5

Page 65: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

Chapter 6

Conclusion and Future Work

Load curve data cleansing is an essential task in power systems. With good quality of

load curve data, high accuracy in load forecasting, system analysis, operation modeling

and planning studies of power systems can be obtained and therefore reliability of power

systems can be improved. In this thesis, a novel class of X-outliers, which are consequences

of various random factors, is presented. We argue that traditional smoothing techniques,

which take into account only local information, are not suitable for detecting X-outliers.

A four-step approach is proposed to detect and repair X-outliers. This includes smoothing

of load curve, representation of smoothing curve by a sequence of ∪ shapes and ∩ shapes,

identification of X-outliers and repairing X-outliers.

Outlier detection in time series data usually involves periodicity detection. For periodic-

ity detection, previous work assumes that real valued data points can be properly discretized

into a small number of bins, and treats a time series as a sequence of discrete symbols. A

major drawback of this approach is that much information is lost because discretization is

not sensitive to the preservation of the trends in the data. Another drawback is that it is

difficult to specify a proper number of bins and the uniform binning scheme is not suitable

for a time series where different parts have different characteristics.

The trend based approach proposed in this thesis addresses these problems by modeling

the trends in the data by a sequence of ∪ shapes and ∩ shapes. These shapes represent

the most interesting information in the data and are extracted from a smoothing curve

that approximates the time series data. The periodic patterns are detected by finding the

re-occurrence of subsequences of ∪ shapes and ∩ shapes, taking into account the similarity

of such shapes. The proposed approach is trend preserving, noise resilient, and flexible for

54

Page 66: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

CHAPTER 6. CONCLUSION AND FUTURE WORK 55

detecting multiple periodicities.

Both of the outlier detection and periodicity detection algorithms proposed in this thesis

have the challenge of determining the best smoothing parameter according to time series

with different lengths. In our application, the user interaction is involved. With a user-

friendly interface, the user can easily find the best smoothing parameter after several trials.

In this thesis the time complexity for the proposed outlier detection and periodicity

detection algorithms is O(n2) because generating a smoothing curve takes O(n2) time. A

future direction of research is to develop faster algorithms while still keeping high accuracy.

Page 67: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

Bibliography

[1] E. Keogh, J. Lin, and A. Fu, HOT SAX: Finding the most unusual time series subse-

quence: algorithms and applications, In ICDM 2005.

[2] E. Keogh, J. Lin, S. H. Lee, and H. V. Herle, Finding the most unusual time series

subsequence: algorithms and applications, Knowledge and Information Systems,

11(1):127, 2006.

[3] V. J. Hodge and J. Austin, A survey of outlier detection methodologies, Artif. Intell.

Rev., Vol. 22, Mo. 2, pp. 5126, Oct. 2004.

[4] V. Barnett and T. Lewis, Outliers in statistical data, 3rd ed. New Y ork: Wiley, 1994,

pp. 397415.

[5] J. Chen, W. Li, A. Lau, J. Cao, K. Wang, Automated load curve data cleansing in

power systems, IEEE PES Transactions on Smart Grid, Vol. 1, No. 2, September

2010.

[6] A. Fallon and C. Spade, Detection and accommodation of out-

liers in normally distributed data sets/ [Online]. Available:

http://www.cee.vt.edu/ewr/environmental/teach/smprimer/outlier/outlier.html

[7] A. J. Fox, Outliers in time series, J. Roy. Stat. Soc. B, Methodological, vol. 34, pp.

350363, 1972.

[8] G. M. Ljung, On outlier detection in time series, J. Roy. Stat. Soc. B, Methodological,

Vol. 55, pp. 559567, 1993.

[9] B. Abraham and N. Yatawara, A score test for detection of time series outliers, J. T ime

Ser. Anal., Vol. 9, pp. 109119, 1988.

56

Page 68: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

BIBLIOGRAPHY 57

[10] B. Abraham and A. Chuang, Outlier detection and time series modeling,

Technometrics, Vol. 31, pp. 241248, 1989.

[11] W. Schmid, The multiple outlier problems in time series analysis, Australian. J.Stat.,

Vol. 28, pp. 400413, 1986.

[12] I. Chang and G. C. Tiao, Estimation of time series parameters in the presence of

outliers, Technometrics, 30, 193-204, 1988.

[13] R. S. Tsay, Outlier, level shifts, and variance changes in time series, Journal of

Forecasting, 7, 1-20, 1988.

[14] SAS/ETS 9.22 users guide [Online]. Available: http://support.sas.com/documentation

/cdl/en/etsug/60372/PDF/default/etsug.pdf

[15] E. M. Knorr, and R. T. Ng, Algorithms for mining distance-based outliers in large

datasets, In V LDB, 1998.

[16] S. Ramaswamy, R. Rastogi, and S. Kyuseok, Efficient algorithms for mining outliers

from large data sets, Proc. ACMSIGMOD Int. Conf. on Management of Data,

2000.

[17] F. Angiulli and C. Pizzuti, Fast outlier detection in high dimensional spaces, In

Proceedings of the Sixth European Conference on the Principles of Data Mining

and Knowledge Discovery, pages 15-26, 2002.

[18] D. Gasgupta and S. Forrest, Novelty detection in time series data using ideas from im-

munology, In Proceedings of the International Conference on Intelligent Systems,

pp. 82-87, 1996.

[19] J. Ma and S. Perkins, Online novelty detection on temporal sequences, In Proceedings

of the Ninth ACM SIGKDD International Conference on Knowledge Discovery

and Data Mining, pp. 157-166, 2003.

[20] L. Wei, N. Kumar, V. Lolla, E. Keogh, S. Lonardi and C. Ratanamahatana,

Assumption-free anomaly detection in time series, In SSDBM 2005: Proceedings

of the 17th International Conference on Scientific and Statistical Database

Management, pp. 237-240, 2005.

Page 69: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

BIBLIOGRAPHY 58

[21] W. Hardle, Applied nonparametric regression. Cambridge University Press, 1990.

[22] M. Vlachos, G. Kollios, and D. Gunopulos, Discovering similar multidimensional tra-

jectories, ICDE 2002.

[23] D. M. Bourg, Excel scientific and engineering cookbook. OREILLY. 2006.

[24] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, Dimensionality reduction

for fast similarity search in large time series databases, Knowledge and Information

Systems 3(3), 2000.

[25] J. Laurikkala, M. Juhola, and E. Kentala, Informal identification of outliers in medical

data, In Fifth International Workshop on Intelligent Data Analysis in Medicine

and Pharmacology IDAMAP-2000 Berlin, 22 August. Organized as a workshop of the

14th European Conference on Artificial Intelligence ECAI-2000.

[26] P. Chan and M. Mahoney, Modeling multiple time series for anomaly detection, In

ICDM, 2005.

[27] M.G. Elfeky, W.G. Aref and A.K. Elmagramid: Periodicity detection in time series

databases. In TKDE, 2005.

[28] M.G. Elfeky, W.G. Aref and A.K. Elmagarmid: WARP: time warping for periodicity

detection. In ICDM, 2005.

[29] P. Indyk, N. Koudas and S. Muthukrishan: Identifying representative trends in massive

time series data sets using sketches. In VLDB, 2000.

[30] S. Ma and J. Hellerstein: Mining partially periodic event patterns with unknown peri-

ods. In ICDE, 2001.

[31] S. Papadimitriou, A. Brockwell and C. Faloutsos.: Adaptive, hands-off stream mining.

In VLDB, 2003.

[32] C. Berberidis, W. Aref, M. Atallah, I. Vlahavas and A. Elmagarmid: Multiple and

partial periodicity mining in time series databases. In ECAI, 2002.

[33] J. Yang, W. Wang and P. Yu: InfoMiner +: mining partial periodic patterns with gap

penalties, In ICDM, 2002.

Page 70: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

BIBLIOGRAPHY 59

[34] A. Weigend and N. Gershenfeld: Time series prediction: forecasting the future and

understanding the past. Addison-Wesley, Reading, Massachusetts, 1994.

[35] D. Berndt and J. Clifford: Using dynamic time warping to find patterns in time series.

In KDD, 1994.

[36] F. Rasheed, and R. Alhajj: STNR: A suffix tree based noise resilient algorithm for

periodicity detection in time series databases. In Appl Intell, 2010.

[37] W. Li: Risk assessment of power systems: models, methods, and applications. IEEE

PressWiley, 2005.

[38] M. Ahdesmki, H. Lhdesmki, R. Pearson, H. Huttunen and O. Yli-Harja: Robust detec-

tion of periodic time series measured from biological systems. In BMC Bioinformatics,

2005.

[39] E.F. Glynn, J. Chen and A.R. Mushegian, Detecting periodic patterns in unevenly

spaced gene expression time series using LombScargle periodograms. In Bioinformatics,

2006.

[40] J. W. Taylor. An evaluation of methods for very short-term load forecasting, using

minute-by-minute british data. International Journal of Forecasting, 2008, Vol. 24, pp.

645-658.

[41] R. Weron. Modeling and forecasting electricity loads and prices - a statistical approach.

John Wiley & Sons, 2006.

[42] Smart Grid. Available via http://www.oe.energy.gov/smartgrid.htm.

[43] Hardle et al. Nonparametric and semiparametric models. Springer, 2004.

[44] B.W. Silverman. Density estimation for statistics and data analysis, Chapman & Hall,

1986.

[45] J. Durbin and S.J. Koopman. Time Series Analysis by State Space Methods, Oxford

University Press, Oxford, UK, 2001.

[46] V. Dordonnat, State-Space Modelling for High Frequency Data, Three Applications to

French National Electricity Load, Ph.D thesis, VU University Amsterdam, 2009

Page 71: X-OUTLIER DETECTION AND PERIODICITY …summit.sfu.ca/system/files/iritems1/11240/etd6823_ZGuo.pdfInga-Rojas and Senior Engineer Dr. Adriel Lau for every discussion about my research.

BIBLIOGRAPHY 60

[47] V. Dordonnat, S.J. Koopman, M. Ooms, A. Dessertaine, J. Collet, An Hourly Periodic

State Space Model for Modelling French National Electricity Load, Tinbergen Institute,

Paper Number 08-008/4.

[48] D. Ruppert, S. J. Sheather, and M. P.Wand, An effective bandwidth selector for local

least squares regression, Journal of the American Statistical Association, vol. 90, pp.

2571270, 1995.

[49] S. Salvador, P. Chan, and J. Brodie. Learning states and rules for time series anomaly

detection. In Proceedings of the seventeenth international Florida artificial intelligence

research society conference, 2004.