Download - 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.2 Mining time-series data Jiawei Han and Micheline Kamber Department of Computer Science University.

1

Data Mining: Concepts and Techniques

— Chapter 8 —

8.2 Mining time-series data

Jiawei Han and Micheline Kamber

Department of Computer Science

University of Illinois at Urbana-Champaign

www.cs.uiuc.edu/~hanj©2010 Jiawei Han and Micheline Kamber. All rights reserved.

http://www.cs.uiuc.edu/~hanj

April 18, 2023Data Mining: Concepts and

Techniques 2

3

Mining Time-Series Data A time series is a sequence of data points, measured

typically at successive times, spaced at (often uniform) time intervals

Time series analysis: A subfield of statistics, comprises methods that attempt to understand such time series, often either to understand the underlying context of the data points or to make forecasts (or predictions)

Methods for time series analyses Frequency-domain methods: Model-free analyses, well-

suited to exploratory investigations spectral analysis vs. wavelet analysis

Time-domain methods: Auto-correlation and cross-correlation analysis

Motif-based time-series analysis Applications

Financial: stock price, inflation Industry: power consumption

Scientific: experiment results Meteorological: precipitation

4

Mining Time-Series Data

Regression Analysis

Trend Analysis

Similarity Search in Time Series Data

Motif-Based Search and Mining in

Time Series Data

Summary

5

Time-Series Data Analysis: Prediction & Regression Analysis

(Numerical) prediction is similar to classification construct a model use model to predict continuous or ordered value for a given

input Prediction is different from classification

Classification refers to predict categorical class label Prediction models continuous-valued functions

Major method for prediction: regression model the relationship between one or more independent or

predictor variables and a dependent or response variable Regression analysis

Linear and multiple regression Non-linear regression Other regression methods: generalized linear model, Poisson

regression, log-linear models, regression trees

6

What is Regression?

Modeling the relationship between one response variable and one or more predictor variables

Analyzing the confidence of the model E.g, height v.s weight

7

Regression Yields Analytical Model

Discrete data points →Analytical model General relationship Easy calculation Further analysis

Application - Prediction

8

Application - Detrending

Obtain the trend for irregular data series Subtract trend Reveal oscillations

trend

9

Linear Regression - Single Predictor

Model is linear

y = w0 + w1 x

where w0 (y-intercept) and w1

(slope) are regression coefficients

Method of least squares:

y: response variable

x: predictor variable

w1

w0

| |

1| |

2

1

( )( )

1( )

D

i ii

D

ii

x x y y

x xw

xwyw

10

10

Training data is of the form (X1, y1), (X2, y2),…, (X|

D|, y|D|) E.g., for 2-D data or

y = w0 + w1 x1+ w2 x2

Solvable by Extension of least square method

(XTX ) W=Y →W = (XTX ) -1Y Commercial software (SAS, S-Plus)

x1

x2

y

Linear Regression – Multiple Predictor

11

Nonlinear Regression with Linear Method

Polynomial regression model E.g., y = w0 + w1 x + w2 x2 + w3 x3

Let x2 = x2, x3= x3

y = w0 + w1 x + w2 x2 + w3 x3

Log-linear regression model E. g., y = exp(w0 + w1 x + w2 x2 + w3 x3

)

Let y’=log(y) y’= w0 + w1 x + w2 x2 + w3 x3

12

Generalized Linear Regression

Response y Distribution function in the exponential family Variance of y depends on E( y), not a constant

E( y) = g-1( w0 + w1 x + w2 x2 + w3 x3 ) Examples

Logistic regression (binomial regression): probability of some event occurring

Poisson regression: number of customers …

References: Nelder and Wedderburn, 1972; McCullagh and Nelder, 1989

13

Regression Tree (Breiman et al., 1984)

Figure source: http://www.stat.cmu.edu/~cshalizi/350-2006/lecture-10.pdf

Partition the domain space

Leaf: (1) a continuous-valued

prediction; (2) average value

14

Model Tree (Quinlan, 1992)

Leaf – a linear equation More general than regression tree

Figure source: http://datamining.ihe.nl/research/model-trees.htm

15

Regression Trees and Model Trees

Regression tree: proposed in CART system (Breiman et al.

1984) CART: Classification And Regression Trees

Each leaf stores a continuous-valued prediction

It is the average value of the predicted attribute for the training

tuples that reach the leaf

Model tree: proposed by Quinlan (1992) Each leaf holds a regression model—a multivariate linear

equation for the predicted attribute

A more general case than regression tree

Regression and model trees tend to be more accurate than

linear regression when the data cannot be represented well

by a simple linear model

16

Predictive Modeling in Multidimensional Databases

Predictive modeling: Predict data values or construct generalized linear models based on the database data

One can only predict value ranges or category distributions

Method outline Minimal generalization Attribute relevance analysis Generalized linear model construction Prediction

Determine the major factors which influence the prediction Data relevance analysis: uncertainty measurement,

entropy analysis, expert judgment, etc. Multi-level prediction: drill-down and roll-up analysis

17

Predictive modeling: Predict data values or construct generalized linear models based on the database data

One can only predict value ranges or category distributions

Method outline: Minimal generalization Attribute relevance analysis Generalized linear model construction Prediction

Determine the major factors which influence the prediction Data relevance analysis: uncertainty measurement,

entropy analysis, expert judgment, etc. Multi-level prediction: drill-down and roll-up analysis

Predictive Modeling in Multidimensional Databases

18

Prediction: Numerical Data

19

References

Nelder, J.A. and Wedderburn, R.W.M. (1972). Generalized linear models. Journal of the Royal Statistical Society A, 135, 370-384.

C. Chatfield. The Analysis of Time Series: An Introduction, 3rd ed. Chapman & Hall, 1984.

McCullagh, P. and Nelder, J.A. (1989). Generalized linear models, 2nd ed. Chapman and Hall, London.

Breiman L, Friedman JH, Olshen RA, Stone CJ. (1984). Classification and Regression Trees. Chapman &Hall (Wadsworth, Inc.): New York.

Quinlan, J. R. (1992). Learning with continuous classes. In: Adams, , Sterling, (Eds.), Proceedings of artificial intelligence'92, World Scientific, Singapore. pp. 343-348.

Acknowledgment This presentation integrates Xiaopeng Li’s slides in his CS 512

class presentation

20


Regression Analysis

Trend Analysis



Time Series Data

Summary

21

A time series can be illustrated as a time-series graph which describes a point moving with the passage of time

22

Categories of Time-Series Movements

Categories of Time-Series Movements Long-term or trend movements (trend curve): general

direction in which a time series is moving over a long interval of time

Cyclic movements or cycle variations: long term oscillations about a trend line or curve

e.g., business cycles, may or may not be periodic Seasonal movements or seasonal variations

i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years.

Irregular or random movements Time series analysis: decomposition of a time series into these

four basic movements Additive Modal: TS = T + C + S + I Multiplicative Modal: TS = T C S I

23

Estimation of Trend Curve

The freehand method

Fit the curve by looking at the graph

Costly and barely reliable for large-scaled data

mining

The least-square method

Find the curve minimizing the sum of the

squares of the deviation of points on the curve

from the corresponding data points

The moving-average method

24

Moving Average

Moving average of order n

Smoothes the data

Eliminates cyclic, seasonal and irregular

movements

Loses the data at the beginning or end of a series

Sensitive to outliers (can be reduced by weighted

moving average)

25

Trend Discovery in Time-Series (1): Estimation of Seasonal

Variations

Seasonal index Set of numbers showing the relative values of a variable

during the months of the year E.g., if the sales during October, November, and

December are 80%, 120%, and 140% of the average monthly sales for the whole year, respectively, then 80, 120, and 140 are seasonal index numbers for these months

Deseasonalized data Data adjusted for seasonal variations for better trend and

cyclic analysis Divide the original monthly data by the seasonal index

numbers for the corresponding months


Techniques 26

Seasonal Index

0

20

40

60

80

100

120

140

160

1 2 3 4 5 6 7 8 9 10 11 12

Month

Seasonal Index

Raw data from http://www.bbk.ac.uk/manop/man/docs/QII_2_2003%20Time%20series.pdf

27

Trend Discovery in Time-Series (2)

Estimation of cyclic variations If (approximate) periodicity of cycles occurs,

cyclic index can be constructed in much the same manner as seasonal indexes

Estimation of irregular variations By adjusting the data for trend, seasonal and

cyclic variations With the systematic analysis of the trend, cyclic,

seasonal, and irregular components, it is possible to make long- or short-term predictions with reasonable quality

28


Regression Analysis

Trend Analysis



Time Series Data

Summary

29

Similarity Search in Time-Series Analysis

Normal database query finds exact match Similarity search finds data sequences that differ

only slightly from the given query sequence Two categories of similarity queries

Whole matching: find a sequence that is similar to the query sequence

Subsequence matching: find all pairs of similar sequences

Typical Applications Financial market Market basket data analysis Scientific databases Medical diagnosis

30

Data Transformation

Many techniques for signal analysis require the data to be in the frequency domain

Usually data-independent transformations are used The transformation matrix is determined a priori

discrete Fourier transform (DFT) discrete wavelet transform (DWT)

The distance between two signals in the time domain is the same as their Euclidean distance in the frequency domain

31

Discrete Fourier Transform

DFT does a good job of concentrating energy in the first few coefficients

If we keep only first a few coefficients in DFT, we can compute the lower bounds of the actual distance

Feature extraction: keep the first few coefficients (F-index) as representative of the sequence

32

DFT (continued)

Parseval’s Theorem

The Euclidean distance between two signals in the time domain is the same as their distance in the frequency domain

Keep the first few (say, 3) coefficients underestimates the distance and there will be no false dismissals!

1

0

21

0

2 ||||n

ff

n

tt Xx

|])[(])[(||][][|3

0

2

0

2

f

n

t

fQFfSFtQtS

33

Multidimensional Indexing in Time-Series

Multidimensional index construction Constructed for efficient accessing using the first

few Fourier coefficients Similarity search

Use the index to retrieve the sequences that are at most a certain small distance away from the query sequence

Perform post-processing by computing the actual distance between sequences in the time domain and discard any false matches

34

Subsequence Matching

Break each sequence into a set of pieces of window with length w

Extract the features of the subsequence inside the window

Map each sequence to a “trail” in the feature space

Divide the trail of each sequence into “subtrails” and represent each of them with minimum bounding rectangle

Use a multi-piece assembly algorithm to search for longer sequence matches

35

Analysis of Similar Time Series

36

Enhanced Similarity Search Methods

Allow for gaps within a sequence or differences in offsets or amplitudes

Normalize sequences with amplitude scaling and offset translation

Two subsequences are considered similar if one lies within an envelope of width around the other, ignoring outliers

Two sequences are said to be similar if they have enough non-overlapping time-ordered pairs of similar subsequences

Parameters specified by a user or expert: sliding window size, width of an envelope for similarity, maximum gap, and matching fraction

37

Steps for Performing a Similarity Search

Atomic matching Find all pairs of gap-free windows of a small

length that are similar Window stitching

Stitch similar windows to form pairs of large similar subsequences allowing gaps between atomic matches

Subsequence Ordering Linearly order the subsequence matches to

determine whether enough similar pieces exist

38

Similar Time Series Analysis

VanEck International Fund Fidelity Selective Precious Metal and Mineral Fund

Two similar mutual funds in the different fund group

39

Query Languages for Time Sequences

Time-sequence query language Should be able to specify sophisticated queries like

Find all of the sequences that are similar to some sequence in class A, but not similar to any sequence in class B

Should be able to support various kinds of queries: range queries, all-pair queries, and nearest neighbor queries

Shape definition language Allows users to define and query the overall shape of time

sequences Uses human readable series of sequence transitions or

macros Ignores the specific details

E.g., the pattern up, Up, UP can be used to describe increasing degrees of rising slopes

Macros: spike, valley, etc.

40


Regression Analysis

Trend Analysis



Time Series Data

Summary

41

Sequence Distance

A function that measures the differentness of two sequences (of possibly unequal length)

Example: Euclidean Distance between TS Q,C

n

i ii cqCQD1

2)(),(

42

Motif: Basic Concepts

What is a motif? A previously unknown, frequently

occurring sequential pattern

Match: Given subsequences Q,C ⊆ T,

C is a match for Q iff for some R

Non-Trivial Match: C = T[p..*], Q = T[q..*] and C match

Q. If p = q or non-match N = T[s..*]∄ such that s between

p,q then match is non-trivial.

(i.e. C,Q must be separated by a non-match)

1-Motif: the subsequence with most non-trivial matches

(least variance decides ties)

k-Motif: Ck such that D(Ck,Ci) > 2R i [1,k)∀ ∈

RCQD ),(

43

SAX: Symbolic Aggregate approXimation

Dim. Reduction/Compression

“Symbolic Aggregate approXimation”

SAX : ℝ → ∑

SAX : ccbaabbbabcbcb↦

Essentially an alphabet over the Piecewise Aggregate

Approximation (PAA) rank

Faster, simpler, more compression, yet on par with DFT,

DWT and other dim. reductions

44

SAX Illustration

45

SAX Algorithm

Parameters: alphabet size, word (segment) length (or output rate)

1. Select probability distribution for TS

2. z-Normalize TS

3. PAA: Within each time interval, calculate aggregated value (mean) of the segment

4. Partition TS range by equal-area partitioning the PDF into n partitions (eq. freq. binning)

5. Label each segment with arank ∑∈ for aggregate’s

corresponding partition rank

46

Finding Motifs in a Time Series

EMMA Algorithm: Finds 1-(k-)motif of fixed length n SAX Compression (Dim. Reduction)

Possible to store D(i,j) ∀(i,j) ∈ ∑∑ Allows use of various distance measures (Minkowski,

Dynamic Time Warping) Multiple Tiers

Tier 1: Uses sliding window to hash length-w SAX subsequences (aw addresses, total size O(m)).Bucket B with most collisions & buckets with MINDIST(B) < R form neighborhood of B.

Tier 2: Neighborhood is pruned using more precise ADM algorithm. Ni with max. matches is 1-motif. Early stop if |ADM matches| > maxk>i(|neighborhoodk|)

47

Hashing

c e c a b b c b a c c e c a b b c b a cc c c c b b c c d c

w

n

2 4 2 0 1 1 2 1 0 25

2 2 2 2 1 1 2 2 3 25

2 4 2 0 1 1 2 1 0 25

… …

… …

…… …

…

…

…

…

Classification in Time Series

Application: Finance, Medicine

1-Nearest Neighbor Pros: accurate, robust, simple Cons: time and space complexity (lazy

learning); results are not interpretable

0 200 400 600 800 1000 1200

false nettles

stinging nettles

false nettles

Shapelet

stinging nettles

false nettles stinging nettles

Leaf Decision Tree

Shapelet Dictionary

5.1

yes

no

I

I

0 1

Testing the utility of a candidate shapelet

Arrange the time series objects

based on the distance from candidate

Find the optimal split point (maximal information gain)

Pick the candidate achieving best utility as the shapelet

Split Point

0

candidate

Information gain

false nettles stinging nettles

Leaf Decision Tree

Shapelet Dictionary

5.1

yes

no

I

I

0 1

false nettles

stinging nettles

false nettles

false nettles

Shapelet

stinging nettles

ClassificationClassification

52


Regression Analysis

Trend Analysis



Time Series Data

Summary

53

Summary

Time series analysis is an important

research field in data mining

Regression Analysis

Trend Analysis


Motif-Based Search and Mining in Time

Series Data

54

References on Time-Series Similarity Search

R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence databases. FODO’93 (Foundations of Data Organization and Algorithms).

R. Agrawal, K.-I. Lin, H.S. Sawhney, and K. Shim. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. VLDB'95.

R. Agrawal, G. Psaila, E. L. Wimmers, and M. Zait. Querying shapes of histories. VLDB'95. C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series

databases. SIGMOD'94.

J. Lin, E. Keogh, S. Lonardi, and B. Chiu, “A Symbolic Representation of Time Series, with

Implications for Streaming Algorithms”, Data Mining and Knowledge discovery, 2003

P. Patel, E. Keogh, J. Lin, and S. Lonardi, “Mining Motifs in Massive Time Series Databases”,

ICDM’02 D. Rafiei and A. Mendelzon. Similarity-based queries for time series data. SIGMOD'97. Y. Moon, K. Whang, W. Loh. Duality Based Subsequence Matching in Time-Series Databases,

ICDE’02 B.-K. Yi, H. V. Jagadish, and C. Faloutsos. Efficient retrieval of similar time sequences under time

warping. ICDE'98. B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online data mining

for co-evolving time sequences. ICDE'00. D. Shasha and Y. Zhu. High Performance Discovery in Time Series: Techniques and Case Studies,

SPRINGER, 2004

L. Ye and E. Keogh, “Time Series Shapelets: A New Primitive for Data Mining”, KDD’09


Techniques 55