Pattern Mining in large time series databases

40
M.Tech Seminar Presentation 2014-15 Presented by Jitesh Khandelwal 10211015 Guided by Dr. Durga Toshniwal IIT Roorkee

Transcript of Pattern Mining in large time series databases

M.Tech Seminar Presentation

2014-15

Presented by

Jitesh Khandelwal

10211015

Guided by

Dr. Durga Toshniwal

IIT Roorkee

Financee.g. Stock prices Medical

e.g. Electro Cardio Grams

Marketinge.g. Forecasting product/brand demands

Operationse.g. Monitoring control infrastructure at LHC

Social Networkse.g. Like count on a profile picturebased on gender

Almost everything is a time series!

Value Prediction Pattern Identification

2.1, 9.3, 4.5, 3, 6.7, 4.0, 18.8, 9.2, 5.8, ?

2.1, 9.3, 4.5, 3, 6.7, 4.0, 18.8, 9.2, 5.8

Based on mathematical models Based on human perception

Whatโ€™s next?

Classification

Anomaly DetectionMotif discovery

Clustering

[16]

[16] [16]

Source: www.google.com

Raw time series data

Similarity model selection

Dimensionality reduction

Index construction

Mathematical formulation of human perception

of similarity

High dimensionality makes distance calculation slow

Enables efficient querying of big and fast incoming

time series data

1 32

Raw time series data

Similarity model selection

Dimensionality reduction

Index construction

1 32

SymbolicRepresentation

Text Processing Algorithms

Double is 4 byte, Char is 1 byte. Hence, lower memory footprint.

Lp Norms

DTW distance

Longest commonsubsequence

LandmarkSimilarity

โ„’๐‘ ๐‘ฅ โˆ’ ๐‘ฆ =

๐‘–=1

๐ฟ

๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘–๐‘

1๐‘ โ„’1

โ„’2

- Manhattan distance

- Euclidean distance

Invariant to amplitude scaling when used with z-score normalization.

Source: www.google.com

Lp Norms

DTW distance

Longest commonsubsequence

LandmarkSimilarity

๐ท ๐‘–, ๐‘— = ๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘— +๐‘š๐‘–๐‘› ๐ท ๐‘– โˆ’ 1, ๐‘— , ๐ท ๐‘– โˆ’ 1, ๐‘— โˆ’ 1 , ๐ท ๐‘–, ๐‘— โˆ’ 1

DTW is Dynamic Time Warping.

Allows comparison of variable length time series.

Computationally Expensive. Can be optimized using warping window techniques and early abandoning using lower bounds.

Source: www.google.com

Lp Norms

DTW distance

Longest commonsubsequence

LandmarkSimilarity

Applicable only to symbolic representations of time series.

Non-metric because it does not satisfy triangle inequality.

๐‘†๐‘–๐‘š ๐‘ฅ, ๐‘ฆ = ๐ฟ๐ถ๐‘† ๐‘ฅ, ๐‘ฆ

A distance measure D is a metric if it satisfies the following properties:

1. Symmetry: D(X, Y) = D(Y, X)2. Triangle Inequality: D(X, Y) + D(Y, Z) <= D(X, Z)

Threshold parameter, matching criteria for 2 points from x and y.

Warping threshold, constraint on matching of points along the time axis.

Works the same ways as human remember patterns.

Definition of landmarks vary with application domains.E.g. local minima, local maxima, inflection point etc.

Uses MDPP (Minimum Distance/Percentage Principle) technique to eliminate noisy landmarks.

๐‘ฅ๐‘–+1 โˆ’ ๐‘ฅ๐‘– < ๐ท๐‘ฆ๐‘–+1 โˆ’ ๐‘ฆ๐‘–

๐‘ฆ๐‘– โˆ’ ๐‘ฆ๐‘–+1 2< ๐‘ƒ

๐‘€๐ท๐‘ƒ ๐ท, ๐‘ƒ removes landmarks at and if๐‘ฅ๐‘– ๐‘ฅ๐‘–+1

Lp Norms

DTW distance

Longest commonsubsequence

LandmarkSimilarity

๐ท๐‘Ÿ๐‘’๐‘‘๐‘ข๐‘๐‘’๐‘‘ ๐‘ ๐‘๐‘Ž๐‘๐‘’ ๐ด, ๐ต โ‰ค ๐ท๐‘ก๐‘Ÿ๐‘ข๐‘’ ๐ด, ๐ต

False Alarms False Dismissals

Objects that appear close in index space are actually distant.

Objects appear distant in index space but are actually closer.

Removed in post-processing step. Unacceptable.

DFT

DWT

PAA

eAPCA

APCA

Discrete Fourier Transform

๐‘‹๐‘“ =1

๐‘›

๐‘ก=0

๐‘›โˆ’1

๐‘ฅ๐‘ก ๐‘’โˆ’๐‘—2๐œ‹๐‘ก๐‘“๐‘›

1. Choose coefficients corresponding to a few low values of frequencies.

2. Choose coefficients corresponding to frequencies with higher values of coefficients.

Based on Parsevalโ€™s Relation, Euclidean distance is preserved in the Frequency domain.

DFT

DWT

PAA

eAPCA

APCA

Discrete Fourier Transform

[6]

DFT

DWT

PAA

eAPCA

APCA

Discrete Fourier Transform

[6]

DFT

DWT

PAA

eAPCA

APCA

Discrete Wavelet Transform

๐œ“๐‘—,๐‘˜ = 2๐‘—2 ๐œ“ 2๐‘—๐‘ก โˆ’ ๐‘˜

Used with Haar wavelets as the basis function. Applicable only for time series with lengths which are a power of 2.

Lower bound is tighter than DFT.

DFT

DWT

PAA

eAPCA

APCA

Discrete Wavelet Transform

Using Haar Wavelet as the basis function.

[6]

DFT

DWT

PAA

eAPCA

APCA

Piecewise Aggregate Approximation

๐‘ฅ๐‘– =๐‘

๐‘›

๐‘—=๐‘›๐‘ ๐‘–โˆ’1 +1

๐‘›๐‘ ๐‘–

๐‘ฅ๐‘—

Reconstruction quality and estimated distance in index space is same as DWT with the Haar Wavelet. With no restriction on length.

[6]

DFT

DWT

PAA

eAPCA

APCA

Piecewise Aggregate Approximation

๐ท ๐‘‹, ๐‘Œ =๐‘›

๐‘ ๐‘–=1

๐‘

๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘–2

A lower bound on the Euclidean distance in the PAA space.

N = actual number of pointsn = number of PAA segments

DFT

DWT

PAA

eAPCA

APCA

Adaptive Piecewise Constant Approximation

Data adaptive. Shorter segments for areas of high activity.

An extension of PAA.

๐‘‹ = < ๐‘ฅ1, ๐‘Ÿ1 >,< ๐‘ฅ2, ๐‘Ÿ2 > โ‹ฏ < ๐‘ฅ๐‘›, ๐‘Ÿ๐‘› >

๐‘ฅ๐‘– = ๐‘š๐‘’๐‘Ž๐‘›(๐‘ฅ๐‘Ÿ๐‘–โˆ’1+1โ€ฆ๐‘ฅ๐‘Ÿ๐‘–)

[7]

DFT

DWT

PAA

eAPCA

APCA

Adaptive Piecewise Constant Approximation

๐ท ๐ถ, ๐‘„ = ๐‘–=1

๐‘€

๐‘๐‘Ÿ๐‘– โˆ’ ๐‘๐‘Ÿ๐‘–โˆ’1 ๐‘ž๐‘ฅ๐‘– โˆ’ ๐‘๐‘ฅ๐‘–2

M = number of APCA segments

A lower bound on the Euclidean distance in the APCA space.

DFT

DWT

PAA

eAPCA

APCA

Extended APCA

๐‘† = ๐œ‡1, ๐œŽ1, ๐‘Ÿ1 , โ€ฆ , ๐œ‡๐‘š, ๐œŽ๐‘š, ๐‘Ÿ๐‘š

Also stores variance along with mean for the segments.

As a result, it gives both a lower and upper bound on the Euclidean distance.

Formulas are very ugly!

SAX

iSAX

SFA

Based on PAA.

Symbolic Aggregate Approximation

Static alphabet size.

โ€œDesirable to have a discretization technique that produce symbols with equal

probability.โ€

Can leverage run length encoding compression.

Breakpoints

[9]

SAX

iSAX

SFA

Symbolic Aggregate Approximation

Supports a lower bound distance measure to Euclidean distance.

๐ท๐‘†๐ด๐‘‹ โ‰ค ๐ท๐‘ƒ๐ด๐ด โ‰ค ๐ท๐‘ก๐‘Ÿ๐‘ข๐‘’

Can be calculated in a streaming fashion.

[9]

SAX

iSAX

SFA

Indexable SAX

a, b, c, d (SAX)

00, 01, 10, 11(iSAX)

0 00, 01, 10, 11

1 00, 01, 10, 111 00, 01, 0 10, 11

1 00, 01, 1 10, 11

Fixed number of segments. Dynamic alphabet size.

iSAX Notation: iSAX(T, segment count, alphabet size)e.g. iSAX(T, 4, 8)

SAX

iSAX

SFA

Indexable SAX

Comparison of iSAX words with different alphabet size.

iSAX(A, 4, 8) = { 110, 110, 011, 000 }

iSAX(B, 4, 2) = { 0 , 0 , 1 , 1 }

Replace 0 with either of { 0 00, 0 01, 0 10, 0 11 }

whichever is closest to 110.Similarly for all segments.

{ 011, 011, 100, 100} != iSAX(B, 4, 8)

Just a lower bound estimate. We cannot undo lossy compression.

SAX

iSAX

SFA

Symbolic Fourier Approximation

Uses MCB (multiple coefficient binning) discretization.

Based on DFT.

SAX - assumes a common distribution for all the coefficients of the reduced representation

MCB โ€“ histograms are built for all the coefficients and then equi-width binning is used.

Tighter lower bound than iSAX

[12]

SAX

iSAX

SFA

Symbolic Fourier Approximation

Every SFA symbol has some global information since it is based on DFT. Cannot be calculated in a streaming fashion.

Unlike iSAX, Fixed alphabet size. Dynamic segment count. Quality of representation improves with segment count.

PruningPower

Tightness of Lower bound

number of data points examined to answer the query

total number of data points in the database

Intuitively captures the measure of the quality of representation. Free from implementation bias.

On a random walk dataset with query lengths 256 to 1024 and dimension of representation 16 to 64 [7]

Lower is better.

PruningPower

Tightness of Lower bound

lower bounding distance

true distance

Tightness of lower bounds for various time series representations on the Koski ECG dataset [10]

Higher is better.

R/R* Trees

SFA Trie

DS Tree

ADS index

iSAX Tree

R trees are multi-dimensional index structures.

Encloses close objects in a MBR (Minimum Bounding Rectangle).

Individual objects are at the leaves and intermediate nodes are MBRs enclosing other MBRs or the objects.

Used for indexing time series after dimensionality reduction using DFT, PAA, APCA and other numeric representations.

Unlike R trees, In R* trees, there is no overlap between the different MBRs due to which it also works for range queries rather than only point queries.

R/R* Trees

SFA Trie

DS Tree

ADS index

iSAX Tree

Based on the dynamic alphabet size of iSAX representation.

Given the segment count, say d. The root node has 2^d children.

A Leaf node, when overflows, is converted to an intermediate node.

A segment is selected and its cardinality is increased to produce 2 child leaf nodes that contain the iSAX representations of the time series.

iSAX 2.0 is an improvement over iSAX where the segment on which split occurs is selected based on the distribution of time series so that the splitting is balanced.

R/R* Trees

SFA Trie

DS Tree

ADS index

iSAX Tree

Based on the SFA representation.

Time series with common SFA prefix lie in common sub-tree.

SFA is computed for more number of Fourier coefficients. But not all are used in the index. Hence, small index size.

Example:

SFA( T1 ) = abaacde | SFA( T2 ) = abbadef

SFA( T1 ) = abaacde | SFA( T2 ) = abbadef | SFA( T3 ) = abaagef

R/R* Trees

SFA Trie

DS Tree

ADS index

iSAX TreeBased on the Extended APCA reduction method.

Intermediate nodes store

๐œ‡๐‘–๐‘š๐‘–๐‘›, ๐œ‡๐‘–

๐‘š๐‘Ž๐‘ฅ, ๐œŽ๐‘–๐‘š๐‘–๐‘›, ๐œŽ๐‘–

๐‘š๐‘Ž๐‘ฅ

for all segments i = 1 to m.

It also stores the splitting strategy chose during splitting.

Dynamic Segmentation Tree

R/R* Trees

SFA Trie

DS Tree

ADS index

iSAX Tree

Dynamic Segmentation Tree

Splitting strategies are of 2 types: Horizontal and Vertical.

Horizontal: using mean and variance.

Vertical: using segment splitting.

Splitting strategy is chosen based on the value of a Quality Measure. The one with maximum value is selected.

๐‘„ =

๐‘–=1

๐‘š

๐‘Ÿ๐‘– โˆ’ ๐‘Ÿ๐‘–โˆ’1 ๐œ‡๐‘–๐‘š๐‘Ž๐‘ฅ โˆ’ ๐œ‡๐‘–

๐‘š๐‘–๐‘› 2 + ๐œŽ๐‘–๐‘š๐‘Ž๐‘ฅ2

๐‘†๐‘๐‘™๐‘–๐‘ก๐‘ก๐‘–๐‘›๐‘” ๐ต๐‘’๐‘›๐‘’๐‘“๐‘–๐‘ก = ๐‘„๐‘๐‘Ž๐‘Ÿ๐‘’๐‘›๐‘ก โˆ’ ๐‘„๐‘™ + ๐‘„๐‘Ÿ 2

R/R* Trees

SFA Trie

DS Tree

ADS index

iSAX Tree

Dynamic Segmentation Tree

Apart from similarity search as supported by other indices, it also allows distance histogram computation for a given query.

For e.g. Given a query Q, a list L = [ ([10, 20], 10), ([15, 30], 15), ([40, 50], 2) ] means that there are 3 leaf nodes: N1, N2 and N3. N1 includes 10 time series, and their distance from Q is between [10, 20]. Similarly for N2 and N3.

This is due to the lower and upper bounds provided by eAPCA.

R/R* Trees

SFA Trie

DS Tree

ADS index

iSAX Tree

Adaptive data series index

Based in iSAX representation.

Delays the construction of leaf nodes to query time.

Also, leaf nodes contain only the iSAX representations and the actual data series remain in the disk. Even during splits, only the iSAX representations are shuffled.

Trade off: Small leaf size require splits that costs disk IO time, whereasBig leaf size leads to increased query time for linear scan.

So, ADS+ uses adaptive leaf size. A bigger build time leaf size and a much smaller query time leaf size.

[1] Agrawal, R., Faloutsos, C., & Swami, A. (1993). โ€œEfficient similarity search in sequence databasesโ€. Proceedings of the 4th Conference on Foundations of Data Organization and Algorithms.

[2] Antonin Guttman, (1984). โ€œR-trees: a dynamic index structure for spatial searchingโ€. Proceedings of the 1984 ACM SIGMOD international conference on Management of data.

[3] Yi, B.K., Faloutsos, C. (2000) โ€œFast Time Sequence Indexing for Arbitrary Lp-Normsโ€. Proceedings of the 26th International Conference on Very Large Data Bases.

[4] Keogh, E. (2002) โ€œExact Indexing of Dynamic Time Warpingโ€. Proceedings of the 28th international conference on Very Large Data Bases.

[5] Perng, C., Wang H., Zhang S. R., Parker, D.S. (2000). โ€œLandmarks: A New Model for Similarity-Based Pattern Querying in Time Series Databasesโ€. Proceedings of the 16th International Conference on Data Engineering

[6] Keogh, E., Chakrabarti, K., Pazzani, M. & Mehrotra, S. (2000). โ€œDimensionality Reduction for Fast Similarity Search in Large Time Series Databasesโ€. Published in Journal Knowledge and Information Systems.

[7] Keogh, E., Chakrabarti, K., Mehrotra, S., Pazzani, M. (2002) โ€œLocally Adaptive Dimensionality Reduction for Indexing Large Time Series Databasesโ€. Published in Journal ACM Transactions on Database Systems.

[8] Wang, Y., Wang, P., Pei, J., Wang, W., Huang, S. (2013) โ€œA Data-adaptive and Dynamic Segmentation Index for Whole Matching on Time Seriesโ€. Proceedings of the VLDB Endowment

[9] Lin, J., Keogh, E., Lonardi, S., Chiu, B. (2003). โ€œA symbolic representation of time series, with implications for streaming algorithmsโ€. Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery.

[10] Shieh, J., Keogh, E., (2008) โ€œiSAX: Indexing and Mining Terabyte Sized Time Seriesโ€. Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining.

[11] Camerra, A., Palpanas, T., Shieh, J., Keogh, E. (2010) โ€œiSAX 2.0: Indexing and Mining One Billion Time Seriesโ€. Proceedings of the IEEE International Conference on Data Mining.

[12] Schรคfer, P., Hรถgqvist, M. (2012) โ€œSFA: A Symbolic Fourier Approximation and Index for Similarity Search in High Dimensional Datasetsโ€. Proceedings of the 15th International Conference on Extending Database Technology.

[13] Beckmann, N., Kriegel, H., Schneider, R., Seeger, B. (1990) โ€œThe R*-tree: an efficient and robust access method for points and rectanglesโ€, Proceedings of the ACM SIGMOD international conference on Management of data.

[14] Zoumpatianos, K., Idreos, S., Palpanas, T. (2014) โ€œIndexing for Interactive Exploration of Big Data Seriesโ€. Proceedings of the ACM SIGMOD International Conference on Data Management.

[15] Wu, Y.L., Agrawal, D., Abbadi, A.E., (2000) โ€œA comparison of DFT and DWT based Similarity Search in Time-Series Databasesโ€. Proceedings of the ninth international conference on Information and knowledge management.

[16] Esling, P., Agon, C. (2012) โ€œTime-Series Data Miningโ€. Published in Journal ACM Computing Surveys (CSUR).