Clustering Seasonality Patterns in the Presence of Errors

Clustering Seasonality Patterns in the Presence of Errors

Advisor ： Dr. HsuGraduate ： You-Cheng ChenAuthor ： Mahesh Kumar

Nitin R. Patel Jonathan Woo

Motivation Objective Introduction Seasonality Estimation Distance Function Experimental results Conclusions Personal opinion

Outline

Motivation

Most traditional clustering algorithms assume that the data is provided without measurement error

Objective

To present a clustering method that incorporates information contained in these error estimates and a new distance function that is based on the distribution of errors in data

Introduction

Definition of a good distance or dissimilarity function is a critical step in any distance based clustering method.

Problem:Most traditional clustering methods assume that data is without any error,but errors are natural in any data measurement.

Example:Sample average

Introduction

This study and results are focused on time-series clustering in the retail industry

This study assume that each point comes from a multidimensional Gaussian distribution

Seasonality Estimation (1/4)

Seasonality is defined as the normalized underlying demand of a group of similar merchandize as a function of time of the year after taking into account other factors that impact sales such as discounts,inventory,promotions and random effects.

Saleit=fI(Iit)*fP(Pit)*fQ(Qit)*fR(Rit)*PLCi(t-ti0)*Seasit (1)

After (1) remove the effects of all these nonseasonal factors Saleit= PLCi(t-ti

0)*Seasit


S is a set of items following similar seasonality ,therefore, S consists of items having a variety of PLCs differing in their shape and time duration

Theorem 1:


If we take the average of weekly sales of all items in S then it would nullify the effect of PLCs as suggested by the following equations.


Seasonality values,Seast, can be estimated by appropriate Scaling of weekly sales average, Salet

The above procedure provides us with a large number of seasonal patterns, one for each set S, along with estimates of associated errors.

Distance Function(1/4)

Consider two seasonalities : Ai={(xi1,σi1),(xi2, σi2),…,(xiT, σiT)}Aj={(xj2, σj2),(xj2, σj2),…,(xjT, σjT)}

We define similarity between two seasonalities as follows: If the null hypothesis H0:Ai~Aj is true then similarity between Ai and Aj is the probability of accepting the hypothesis.

The distance dij between Ai and Aj is defined as ( 1-similarity)which is the probability of rejecting the H0


Consider tth samples of both seasonalities

Ait=(xit, σit) and Ajt=(xjt, σjt).

(xit-xjt) ~ N( uit-ujt, (σ2it+ σ2

jt)1/2 ) (1)

If Ai~Aj then uit=ujt and consequently the statistic follows a t-distribution.

22jtit

jtit xx


Finally distance

)22

2)((2

1jtit

jtxitx

TXijd

Comparison with Euclidean Distance

dij is monotonically increasing with respect to 22

2

1

)(

jtit

jtitTt

xx


Comparison with Euclidean Distance If all σ’s were the same and equal to σ then it would become the rank order of (1) which is the same as the rank order of the Euclidean distance,(2)2

12)(

2

1jtit

Tt xx

21 )( jtit

Tt xx

Clustering Clustering

Algorithm

Experimental Results (1/6)

Simulated Data

Figure 5: Individual(prior to clustering) seasonality estimates with associated errors

Experimental Results (2/6)Figure 6:Seasonalities obtained by hError


Figure 7: Seasonalities obtained by kmeans and Ward’s method using Euclidean distances


Clustering Method

Average # misclassification

Average Estimation Error

hError Ward’s method kmeans

0.87 2.63 2.94

2.0182 4.7021 5.0337

Table 1:Average # misclassifications and Average Estimation Error for different clustering methods


tTt

ttTt

ActualSale

eForeastSalActualSalerorForecastEr

1

1

Clustering Method

Average Forecast Error %

hError Ward’s Kmeans No clustering

18.7 23.9 24.2 31.5

Table 2: Average Forecast Error(Retailer Data)

Conclusions

The distance function dij is invariant under different scales for data and the clustering method obtain better cluster than others.

Personal Opinion

The concept of incorporating information abouterrors in the distance function is very good and can beused in many other clustering applications.

Clustering Seasonality Patterns in the Presence of Errors

Documents

Transcript of Clustering Seasonality Patterns in the Presence of Errors