Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data Thanawin...

Time Series Epenthesis:Clustering Time Series StreamsRequires Ignoring Some Data

Thanawin RakthanmanonEamonn KeoghStefano Lonardi

Scott Evans

Subsequence Clustering Problem

• Given a time series, individual subsequences are extracted with a sliding window.

• Main task is to cluster those subsequences.

Sliding window

All subsequences

Average subsequence

Subsequence Clustering ProblemSliding window

Keogh and Lin in ICDM 2003. Subsequence clustering is meaningless.

Centers of 3 clusters

All data also contains ..

Transitions (the connections between words)• Some transitions has good meaning and worth

to be discovered– The connection inside a group of words

• Some transitions has no meaning/structure– ASL: hand movement between two words– Speech: (un)expected sound like um.., ah.., er..– Motion Capture: unexpected movement– Hand Writing: size of space between words

How to Deal with them?Possible approaches are• Learn it!

– Separate noise/unexpected data from the dataset.• Use a very clean dataset

– dataset contains only atomic words.• Simple approach (our choice)

– Just ignore some data.– Hope that we will ignore unimportant data.

Concepts in Our Algorithm

Our clustering algorithm ..• is a hierarchical clustering• is parameter-lite

– approx. length of subsequence (size of sliding window)• ignores some data

– the algorithm considers only non-overlapped data• uses MDL-based distance, bitsave• terminates if ..

– no choice can save any bit (bitsave ≤ 0)– all data has been used

Minimum Description Length (MDL)

• The shortest code to output the data by Jorma Rissanen in 1978

• Intractable complexity (Kolmogorov complexity)

Basic concepts of MDL which we use:• The better choice uses the smaller number of

bits to represent the data– Compare between different operators– Compare between different lengths

0 50 100 150 200 250

A' denoted as

A' is A given H

A' = A|H = A-H

How to use Description Length?

If > , we will store A as A' and H DL(A) DL(A' ) +

DL(A) is the number of bits to store A

Clustering Algorithm

Current Clusters

Add to cluster

Create new cluster

Merge clusters

Create a cluster by 2 subsequences which are the most similar.

Add the closet sub-sequence to a cluster.

Merge 2 closet clusters.

What is the best choice? bitsave = DL(Before) - DL(After)

1) Create bitsave = DL(A) + DL(B) - DLC(C')- a new cluster C' from subsequences A and B

2) Add bitsave = DL(A) + DLC(C) - DLC(C')- a subsequence A to an existing cluster C- C' is the cluster C after including subsequence A.

3) Merge bitsave = DLC(C1) + DLC(C2) - DLC(C')

- cluster C1 and C2 merge to a new cluster C'.

The bigger save, the better choice.

Clustering Algorithm

Current Clusters

Add to cluster

Create new cluster

Merge clusters

Create a cluster by 2 subsequences which are the most similar.

Add the closet sub-sequence to a cluster.

Merge 2 closet clusters.

Bird Calls

0.5 1 1.5 2 2.5 3 x 10 50

Two interwoven calls from the Elf Owl, and Pied-billed Grebe.

A time series extracted by using MFCC technique.

Current Clusters

Create

Add Add Merge

Create Motif Discovery

Final Clusters

Add Nearest Nighbor

Bird Calls: Clustering Result

Step 1: Create

Step 2: Create

Step 3: Add

Step 4: Merge

SubsequencesCenter of cluster (or Hypothesis)

1 2 3 4

Step of the clustering process

Clustering stops here

CreateAddMerge

Poem The Bells

In a sort of Runic rhyme,To the throbbing of the bells--Of the bells, bells, bells,To the sobbing of the bells;Keeping time, time, time,As he knells, knells, knells,In a happy Runic rhyme,To the rolling of the bells,--Of the bells, bells, bells--To the tolling of the bells,Of the bells, bells, bells, bells,Bells, bells, bells,--To the moaning and the groan-ing of the bells.

Edgar Allen Poe1809-1849

(Wikipedia)

The Bells: Clustering Result

== Group by Clusters ==bells, bells, bells, Bells, bells, bells,

Of the bells, bells, bells,Of the bells, bells, bells—

To the throbbing of the bells--To the sobbing of the bells;To the tolling of the bells,

To the rolling of the bells,--To the moaning and the groan-

time, time, time,knells, knells, knells,

sort of Runic rhyme,groaning of the bells.

== Original Order ==In a sort of Runic rhyme,To the throbbing of the bells--Of the bells, bells, bells,To the sobbing of the bells;Keeping time, time, time,As he knells, knells, knells,In a happy Runic rhyme,To the rolling of the bells,--Of the bells, bells, bells--To the tolling of the bells,Of the bells, bells, bells, bells,Bells, bells, bells,--To the moaning and the groan-ing of the bells.

Summary• Clustering time series algorithm using MDL.• Some data must be ignored or not appeared in

any cluster.• MDL is used to ..

– select the best choice among different operators.– select the best choice among the different lengths.

• Final clusters can contain subsequences of different length.

• To speed up, Euclidean is used instead of MDL in core modules, e.g., motif discovery.

Thank you for your attention

QUESTION?

Supplementary

How to calculate DL?

A is a subsequence.• DL(A) = entropy(A)

– Similar result if use Shanon-Fano or Huffman coding.

H is a hypothesis, which can be any subsequence .*DL(A) = DL(H) + DL(A - H)

– Compression idea; never use – directly in algorithm

Cluster C contains subsequence A and B• DLC(C) = DL(center) + min(DL(A-center), DL(B-center))

Running Time

5000 10000 15000 20000 25000 300001000

Size of time series

Scalability16000

motif lengths = 350

500 1000 1500 20000

Cluster plotted Stacked, Dithered

Koshi-ECG time series

O(m3/s)

ED vs MDL in Random Walk

min maxmin

maxED vs MDL

ED calculated in original continuous spaceMDL calculated in discrete space (64 cardinality)

Discretization vs Accuracy

Deceasing Cardinality

Classification Accuracy

• Classification Accuracy of 18 data sets. • The reduction from original continuous space to different

discretization does not hurt much, at least in classification accuracy.

Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data Thanawin...

Documents

Transcript of Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data Thanawin...

Group 1 Thanawin T. (5413085) Jirawat I. (5413341) Than I. (5413354)

Discovering the Intrinsic Cardinality and Dimensionality of Time Series using MDL Bing Hu Thanawin Rakthanmanon Yuan Hao Scott Evans 1 Stefano Lonardi.

Multithreaded FPGA Acceleration of DNA Sequence Mapping · Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward B. Fernandez, Walid A. Najjar, Stefano Lonardi University

HARDMAN, Francisco Foot e LONARDI, Victor[1982. 2ªed ......1 aped og5e1uasaJde ap e10N 01nEd) oeö!pa seJ01nV sop 01J?uxns equeuu.0N edeo sanEupou even sapuewag 01001uv omen oap!A

Mining Historical Archives for Near- Duplicate Figures Thanawin Rakthanmanon, Qiang Zhu, and Eamonn J. Keogh.

OMGS: Optical Map-based Genome Scaffolding · OMGS: Optical Map-based Genome Sca olding Weihua Pan 1, Tao Jiang , and Stefano Lonardi 1 1Department of Computer Science and Engineering,

1 nd Semester 2007 1 Module1 Introduction to Computer and Programming Thanawin Rakthanmanon Email: fengtwr@ku.ac.th Create by: Aphirak Jansang Computer.

CURRICULUM VITAE OF DR. SARA LONARDI PERSONAL DATAioveneto.it/wp-content/uploads/2017/06/Lonardi_extendedCV_english.pdf · CURRICULUM VITAE OF DR. SARA LONARDI PERSONAL DATA ... mitomicina

Human computation Gesture CAPTCHA Jaehoon Kim Committees : Eamonn Keogh, Stefano Lonardi.

SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate.

1 nd Semester 2007 1 Module7 Arrays Thanawin Rakthanmanon Email: fengtwr@ku.ac.th Create by: Aphirak Jansang Computer Engineering Department Kasetsart.

A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.

Classificazione spettrale e calcolo delle distanze di stelle con riga Hα in emissione Polo di Verona Lonardi Fabio (1) Piccoli Michele (1) Manzati Leonardo.

Deconvoluting BAC-gene Relationships Using a Physical Map Y. Wu 1, L. Liu 1, T. Close 2, S. Lonardi 1 1 Department of Computer Science & Engineering 2.

1 nd Semester 2007 1 Module3 Selection Statement Thanawin Rakthanmanon Email: fengtwr@ku.ac.th Create by: Aphirak Jansang Computer Engineering Department.

1 Dot Plots For Time Series Analysis Dragomir Yankov, Eamonn Keogh, Stefano Lonardi Dept. of Computer Science & Eng. University of California Riverside.

1 nd Semester 2007 1 Module2 C# Basic Concept Thanawin Rakthanmanon Computer Engineering Department Kasetsart University, Bangkok.

UNIVERSITY OF CALIFORNIA RIVERSIDEUniversity of California, Riverside, September 2006 Dr. Stefano Lonardi, Dr. Tao Jiang, Co-Chairpersons Repetitive patterns, or ”repeats” for