Learning Time-Series Shapelets - Semantic Scholar

Learning Time-Series Shapelets


Josif Grabocka, Nicolas Schilling, Martin Wistuba and LarsSchmidt-Thieme

Information Systems and Machine Learning Lab (ISMLL)University of Hildesheim, Germany

SIGKDD 2014, 26/08/2014

Grabocka et al., ISMLL, University of Hildesheim, Germany

SIGKDD 2014, 26/08/2014 1 / 1


What are Time-Series Shapelets? (I)

I Definition:

I Patterns whose minimum distances to time-series yielddiscriminative predictors [Ye and Keogh(2009),Lines et al.(2012)Lines, Davis, Hills, and Bagnall]

I Problem:

1. Learn K discriminative shapelets of length L (denoted as S ∈ RK×L).2. From a dataset that has I time-series instances of length M (denoted

as T ∈ RI×M), where each series is divided into J = M − L slidingwindow segments.


SIGKDD 2014, 26/08/2014 1 / 1


What are Time-Series Shapelets? (II)

The minimum distances M ∈ RI×K between shapelets and time-series:

Mi ,k = minj=1,...,J

1

L

L∑l=1

(Ti ,j+l−1 − Sk,l)2 (1)

... yield discriminative predictors:

0 30−1

0

1

S1

0 30−2

0

2

S2

0 100 200−2

0

2

T1

0 100 200−2

0

2

T2

0 100 200−2

0

2

T3

0 100 200−2

0

2

T4

0.1 0.14 0.18

0.03

0.06

0.09

||T∗ − S1||2

||T∗−S2||2

Shapelet−transformed data

T4

T3

T2

T1

Figure 1: Left: Two shapelets, Middle: Closest Segments, Right:’Shapelet-Transform’ed data


SIGKDD 2014, 26/08/2014 2 / 1


Related Shapelets Work

All the possible segments (exhaustively) of all time-series candidates arepotential shapelet candidates:

I Compute the prediction accuracy of minimum distances of eachcandidate and then rank the top-K by prediction accuracy (featureranking)

I Build a decision tree from the top-K features [Ye and Keogh(2009),Lines et al.(2012)Lines, Davis, Hills, and Bagnall]

I Alternatively, use the new K-dimensional feature representation anduse standard classifiers[Hills et al.(2013)Hills, Lines, Baranauskas, Mapp, and Bagnall].

Exhaustive search is expensive, speed-ups were proposed[Mueen et al.(2011)Mueen, Keogh, and Young,Rakthanmanon and Keogh(2013)].


SIGKDD 2014, 26/08/2014 3 / 1


Proposed Method (I)

A linear model of predictors M ∈ RI×K and weights W ∈ RK ,W0 ∈ R:can be used to estimate the target Y ∈ RI :

Yi = W0 +K∑

k=1

Mi ,kWk ∀i ∈ {1, . . . , I} (2)

The risk of estimating the true target Y ∈ {−1,+1}I from approximatedtarget Y ∈ RI is the logistic loss L(Y , Y ) ∈ RI :

L(Yi , Yi ) = −Yi lnσ(Yi )− (1− Yi ) ln(

1− σ(Yi )), ∀i ∈ {1, . . . , I} (3)


SIGKDD 2014, 26/08/2014 4 / 1


Proposed Method (II)

The objective function F ∈ R is a regularized loss function:

argminS,W

F(S ,W ) = argminS ,W

I∑i=1

L(Yi , Yi ) + λW ||W ||2 (4)

The objective function F can be decomposed into per-instance objectivesFi :

Fi = L(Yi , Yi ) +λWI

K∑k=1

Wk2, ∀i ∈ {1, . . . , I} (5)

Mission of This Paper: Learn S ,W that minimize F .


SIGKDD 2014, 26/08/2014 5 / 1


Differentiable Minimum Function (I)

Approximate the true minimum M with the soft-minimum version M:

Mi ,k ≈ Mi ,k =

∑Jj=1 Di ,k,j e

αDi,k,j∑Jj ′=1 e

αDi,k,j′,

α ∈ (−∞, 0] ∀i ∈ {1, . . . , I},∀k ∈ {1, . . . ,K} (6)

Di ,k,j :=1

L

L∑l=1

(Ti ,j+l−1 − Sk,l)2 ,

∀i ∈ {1, . . . , I},∀k ∈ {1, . . . ,K},∀j ∈ {1, . . . , J} (7)


SIGKDD 2014, 26/08/2014 6 / 1


Differentiable Minimum Function (II)The smooth approximation of the minimum function, allows only theminimum segment to contribute for α→ −∞.

50 100 150 200 250 300 350−5

−2.50

52.5

Series Shapelet

50 100 150 200 250 300 3500

0.5

1

Euclidean

50 100 150 200 250 300 3500246

x 10−3

Soft−min, alpha=−20

0 50 100 150 200 250 300 3500

1

2x 10

−3

Soft−min, alpha=−100

Figure 2: Illustration of the soft minimum between a shapelet (green) and all thesegments of a series (black) from the FaceFour dataset


SIGKDD 2014, 26/08/2014 7 / 1


Learning Algorithm (I)The partial derivative of the per-instance objective function Fi withrespect to the l-th point of the k-th shapelet Sk,l is computed using thechain rule of derivation:

∂Fi

∂Sk,l=

∂L(Yi , Yi )

∂Yi

∂Yi

∂Mi ,k

J∑j=1

∂Mi ,k

∂Di ,k,j

∂Di ,k,j

∂Sk,l(8)

∂Fi

∂Wk=

∂L(Yi , Yi )

∂Yi

∂Yi

∂Wk

+∂Reg(W )

∂Wk,

∂Fi

∂W0=∂L(Yi , Yi )

∂Yi

(9)

All the components of the partial derivative are computable as follows:

∂L(Yi , Yi )

∂Yi

= −(Yi − σ

(Yi

)),

∂Yi

∂Mi,k

= Wk ,∂Yi

∂Wk

= Mi,k ,∂Reg(W )

∂Wk=

2λW

IWk (10)

∂Mi,k

∂Di,k,j=

eαDi,k,j

(1 + α

(Di,k,j − Mi,k

))∑J

j′=1 eαDi,k,j′

,∂Di,k,j

∂Sk,l=

2

L

(Sk,l − Ti,j+l−1

)(11)


SIGKDD 2014, 26/08/2014 8 / 1


Learning Algorithm (II)

Require: T ∈ RI×Q , Number of Shapelets K , Length of a shapelet L, Regularization λW ,Learning Rate η, Number of iterations: maxIter

Ensure: Shapelets S ∈ RK×L, Classification weights W ∈ RK , Bias W0 ∈ R1: for iteration=1, . . . ,maxIter do2: for i = 1, . . . , I do3: W0 ←W0 − η ∂Fi

∂W0

4: for k = 1, . . . ,K do5: Wk ←Wk − η ∂Fi

∂Wk

6: for L = 1, . . . , L do7: Sk,l ← Sk,l − η ∂Fi

∂Sk,l

8: return S ,W ,W0


SIGKDD 2014, 26/08/2014 9 / 1


Illustration of Learning

0.2 0.4 0.6 0.80

0.2

0.4

0.6

M1

M2

a) Iteration 0

Y=0Y=1W

5 15 25 35

−0.3

0

0.3

S1 (It. 0)

Time5 15 25 35

−1.5

0

1.5

S2 (It. 0)

Time

0.2 0.4 0.6 0.80.1

0.3

0.5

0.7

M1

M2

b) Iteration 400

Y=0Y=1W

5 15 25 35

−2

−0.5

1

S1 (It. 400)

Time5 15 25 35

−2

0

2

S2 (It. 400)

Time

0.2 0.53 0.87 1.20.2

0.4

0.6

0.8

M1

M2

c) Iteration 800

Y=0Y=1W

5 15 25 35

−2

−0.5

1

S1 (It. 800)

Time5 15 25 35

−2

0

2

S2 (It. 800)

Time

Figure 3: Learning Two Shapelets on the Gun-Point Dataset,(L = 40, η = 0.01, λW = 0.01, α = −100 )


SIGKDD 2014, 26/08/2014 10 / 1


Advantage over Feature Ranking

1. Discover hidden/latent shapelets

2. Interactions of Variables

0 1 2 3 4 5

0

M1

0 1 2 3 4 5

0

M2 0 1 2 3 4 5

0

1

2

3

4

5

6

M1

M2

Pos. ClassNeg. Class

Figure 4: Interactions among Shapelets Enable Individually UnsuccessfulShapelets (left plots) to Excel in Cooperation (right plot)


SIGKDD 2014, 26/08/2014 11 / 1


Experimental Setup

Our method, denoted LS, is compared against 13 baselines using 28time-series dataset:

I Baselines: Quality Criteria: Information gain (IG), Kruskall-Wallis(KW), F-Stats (FT), Mood’s Median Criterion (MM); Usingshapelet-transformed data: Nearest Neighbors (1NN), Naive Bayes(NB), C4.5 tree (C4), Bayesian Networks (BN), Random Forest (RA),Rotation Forest (RO), Support Vector Machines (SV); OtherRelated: Fast Shapelets (FS), Dynamic Time Warping (DT).

I Datasets: 28 time-series datasets from the UCR and UEA collectionsfrom diverse domains; Provided train/test splits

I Hyper-parameters: are found using grid-search, by testing over avalidation split from the training data


SIGKDD 2014, 26/08/2014 12 / 1


Results

IG KW FT MM DT C4 NN NB BN RF RO SV FS LS

Total Wins 0.0 0.0 1.2 0.3 0.0 0.0 1.5 0.0 1.6 0.2 0.0 4.7 1.3 17.3

LS Wins 28 27 26 26 28 28 23 27 23 26 24 20 26 -LS Draws 0 0 1 1 0 0 2 0 2 1 1 2 1 -LS Losses 0 1 1 1 0 0 3 1 3 1 3 6 1 -

Rank Mean 9.8 9.9 9.1 9.5 9.8 10.0 6.2 7.7 5.5 6.3 5.4 4.6 9.2 1.9Rank C.I. 1.0 1.3 1.3 1.3 1.9 0.8 1.2 1.1 1.2 0.7 0.9 1.2 1.5 0.5Rank S.D. 2.7 3.4 3.6 3.4 5.0 2.1 3.2 2.9 3.1 2.0 2.4 3.2 4.1 1.4

W.t. p-val. 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -

Table 1: Comparative Figures of Classification Accuracies over 28 Time-seriesDatasets


SIGKDD 2014, 26/08/2014 13 / 1


Conclusions

1. Learning shapelets to directly optimize the classification objectiveimproves classification accuracy

2. Supervised Shapelet Learning more accurate than ExhaustiveShapelet Discovery

I Shapelets are not restricted to series segmentsI Considers interactions among minimum distances features

3. Results against 13 baselines using 28 datasets validate the claim


SIGKDD 2014, 26/08/2014 14 / 1


References

J. Hills, J. Lines, E. Baranauskas, J. Mapp, and A. Bagnall.Classification of time series by shapelet transformation.Data Mining and Knowledge Discovery, 2013.

J. Lines, L. Davis, J. Hills, and A. Bagnall.A shapelet transform for time series classification.In Proceedings of the 18th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, 2012.

A. Mueen, E. Keogh, and N. Young.Logical-shapelets: an expressive primitive for time series classification.In Proceedings of the 17th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, 2011.

T. Rakthanmanon and E. Keogh.Fast shapelets: A scalable algorithm for discovering time series shapelets.Proceedings of the 13th SIAM International Conference on Data Mining, 2013.

L. Ye and E. Keogh.Time series shapelets: a new primitive for data mining.In Proceedings of the 15th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, 2009.


SIGKDD 2014, 26/08/2014 15 / 1


Back-up Slide: Dependence on Initialization

S1:15

S16

:30

a) Train Loss

−1 0 1−1

−0.5

0

0.5

1

20

25

30

35

40

S1:15

S16

:30

b) Train Error

−1 0 1−1

−0.5

0

0.5

1

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Figure 5: Sensitivity of Shapelet Initialization, Gun-Point dataset, Parameters:L = 30, η = 0.01, λW = 0.01, Iterations= 3000, α = −100

I We used K-Means centroids as initial shapelets.


SIGKDD 2014, 26/08/2014 16 / 1

Learning Time-Series Shapelets - Semantic Scholar

Documents

Transcript of Learning Time-Series Shapelets - Semantic Scholar