Decision Tree Classification of Products Using C5.0 and ...1072186/FULLTEXT01.pdf · Decision Tree...

INOM EXAMENSARBETE ELEKTROTEKNIK,AVANCERAD NIVÅ, 30 HP

, STOCKHOLM SVERIGE 2016

Decision Tree Classification of Products Using C5.0 and Prediction of Workload Using Time Series Analysis

JOHAN JANSSON

KTHSKOLAN FÖR ELEKTRO- OCH SYSTEMTEKNIK

Abstract

This thesis covers the analysis of manually classified information, with afocus on classification using decision trees and prediction using time seriesanalysis. In the classification part of the thesis, an existing manual clas-sification is evaluated and compared to the classification obtained with adecision tree approach.

In this thesis, the classes are comparable to each other, i.e. each class can beassigned a numerical value. Thus, the manual classification can be comparedto the decision tree classification with respect to the distance from the trueclass. The results show that decision tree classifications tend to fall intoneighboring classes with some exceptions.

Using time series analysis, the daily rate of items arriving to a repair work-shop is evaluated and predicted. The result shows that it is possible to finda predictor for the arrival rate of workload. This is performed by implement-ing a classical decomposition model to forecast a general trend and seasonalchanges, and improving the predictions by fitting a linear dynamical modeldriven by a white noise process. An automated algorithm to update thismodel is implemented to minimize the maintenance of forecasting.

Sammanfattning

Denna uppsats avhandlar en analys utav manuellt klassificerad data medfokus på klassificering genom beslutsträd och prognostisering med hjälp avtidsserieanalys. I den del av uppsatsen som behandlar klassificering användsen uppsättning av tidigare, manuellt klassificerad data som analyseras ochjämförs med en ny klassifiering utförd genom en beslutsträdmetod.

Klasserna som används i denna uppsats är jämförbara, vilket innebär attvardera klass kan tilldelas ett numeriskt värde. På så vis kan en jämförelseutav avstånd mellan den manuella klassifieringen och den klassifiering sombaserats på beslutsträd genomföras. Resultaten visar här att klasserna somhittas med hjälp av beslutsträd tenderar att hamna nära den sanna klassenmed vissa undantag.

Genom tidsserieanalys analyseras och prognostiseras dagligt antal av inkomnaartiklar hos en reparatör. Resultaten visar att det är möjligt att till vissmån förutspå detta antal. Detta utförs genom att implementera en klas-sisk dekompositionsmodell, vilken sedan används för att förutspå en generelltrend och periodicitet. Återstående information i datan förutspås sedan medhjälp utav en linjär, dynamisk modell, driven av vitt brus. En automatis-erad algoritm för att uppdatera denna model implementerades även för attminimera underhållet av prognosticeringen.

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . 31.3 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Outline of the Report . . . . . . . . . . . . . . . . . . . . . . 4

2 Review of Classification 52.1 Choice of Classifier . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 The Fundamental Methods in C5.0 . . . . . . . . . . . . . . . 8

2.3.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.2 Information Gain . . . . . . . . . . . . . . . . . . . . . 92.3.3 An Example of a Decision Tree, Implementing Infor-

mation Gain . . . . . . . . . . . . . . . . . . . . . . . 92.3.4 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . 142.3.5 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.6 Boosting in C5.0 . . . . . . . . . . . . . . . . . . . . . 16

3 Review of Time Series Analysis 203.1 Stationary Processes . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.1 White Gaussian Noise . . . . . . . . . . . . . . . . . . 213.2 An Example of a Time Series . . . . . . . . . . . . . . . . . . 233.3 Model Decomposition . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1 Estimating a Trend . . . . . . . . . . . . . . . . . . . . 243.3.2 Estimating a Seasonal Component . . . . . . . . . . . 25

3.4 Autoregressive Moving-Average (ARMA) . . . . . . . . . . . . 253.5 Estimating an ARMA(p, q) Process . . . . . . . . . . . . . . 26

3.5.1 Estimating Order of p and q . . . . . . . . . . . . . . . 263.5.2 Estimating Coe�cients for the ARMA(1, 1) Process . 273.5.3 Prediction Error Method . . . . . . . . . . . . . . . . 28

4 Results: Classification 294.1 Origin of Data . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Data Management . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Implementation Tools . . . . . . . . . . . . . . . . . . . . . . 304.4 Decision Tree Building and Prediction . . . . . . . . . . . . . 30

4.4.1 Prediction Result . . . . . . . . . . . . . . . . . . . . . 324.4.2 Prediction of Unknown Classes . . . . . . . . . . . . . 34

5 Results: Time Series Analysis 365.1 Origin of Data . . . . . . . . . . . . . . . . . . . . . . . . . . 365.2 Implementation Tools . . . . . . . . . . . . . . . . . . . . . . 365.3 Decomposition & Forecasting . . . . . . . . . . . . . . . . . . 36

6 Conclusions 426.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Chapter 1

Introduction

Computers (specially laptops) are often divided into categories based onvisual appearance. One such category can consist of computers with a rangeof di�erent technical specifications. This means that high end computersin one category might have very similar technical properties to high endcomputers in another category, which leads to a scenario where the visualaspect can be the foremost delimiting property.

In this thesis, we focus on solving the problem of how to successively divide anumber of computers, from one manufacturer, into smaller groups based ontheir technical characteristics. The idea is to examine how specific models ofcomputers are classified when compared to the technical specifications of allother computers from this manufacturer. Secondly, we address the issue ofpredicting the number of computers arriving to a repair workshop based onprior records of such events. Finding a model which enables the computermanufacturer to predict repair times can help scheduling and thus lessencosts.

1.1 Background

The idea of dividing groups of information into smaller groups is increas-ingly important as databases become larger over time. For example, theinformation associated with a computer is of course not limited to only itsprocessor speed and size. Therefore, other attributes of computers in a classmay overlap or di�er in other aspects than those named above.

Consider a computer manufacturer with a collection of computers dividedinto classes, e.g. low, mid and high price range. Each of these classes can bedivided into sub-classes, where the class is split into several other levels ofprices. These sub-classes may share many properties within the boundary

1

of their parent class and may even share properties with other classes aswell. Hence, there may exist overlapping both in classes and sub-classeswhen considering the properties of the information. One such property isthe screen size which can be shared among computers in all di�erent priceranges.

When introducing new computers into an already existing class (category ofmodels), care should be taken to make sure that the conceived value of thecomputer is justified in consideration of the previously defined classes. Forexample, if newly introduced computers would be irrationally placed withrespect to the predefined structure, this could lead to confusion regardingthe value of previously classified computers, which in turn could change theperception of the structure and add extra labor in manual re-structuring ofthe classes.

The issue of properly classifying computers gives rise to the question of howwell new information is placed into predefined structures, in comparison toa machine learning approach. Machine learning concerns with using com-putational methods to gain knowledge of structures and formalizing themin a more comprehensible manner to the human mind [1, p. 2]. Using thisapproach, new situations can be cleverly assessed by implementing the pat-terns found. The process of creating a decision tree is one manner of findingthis structure.

Another challenging part of the market of computers, belonging to suchclasses mentioned above, is to find patterns on the rate at which computersare sent to a repair workshop. As computers are owned by individual people,the rate at which they are sent to a repair shop is not trivial to forecastwithout the use of a sophisticated analysis.

When scheduling the labor for those working with repairs, a time line whento assess each computer is important. To deal with task scheduling, a sys-tematic method must be implemented, where the tasks are completed ina predefined manner. However, scheduling of new tasks can be di�cult, asthey might be perceived as unpredictable. This happens when there is littleor no knowledge about the underlying events generating the data. Thus,data that looks unpredictable at a first glance might contain well definedstructures if the right methods are used in the analysis of the data.

If an accurate prediction of future tasks is available, then the uncertaintiesassociated with task scheduling can be lessened. Manual prediction can becumbersome when the degrees of freedom in such systems become large.Thus, finding an automated process to build a predictor only dependenton the information in the database is an attractive solution. This can beproduced by systematically finding trends and seasonal behavior in the his-torical data using time series analysis.

2

1.2 Problem Description

In the first part of this thesis, we want to evaluate how well the man-madeclassification of a group of objects compares to a systematic classificationbased on machine learning techniques. For the sake of confidentiality, theseobjects will be referred to as computers. Within a selection of computersof di�erent prices, and considering products with overlapping properties,classification will be made to assess how well the existing classification wasperformed. More specifically, a strong connection between certain propertiesin a class and the median price (or value) of that class is expected and willthus be analyzed.

In the second part of this thesis, we address the question of how to me-thodically forecast database entries with respect to a historical record of thedata. The data corresponds to the daily number of computers received at therepair workshop. This data then needs to be re-structured to simplify theimplementation of time series analysis. The proposed model is then analyzedwith respect to the underlying process in order to evaluate the possibility ofbeing a candidate for prediction of future entries.

1.3 Previous Work

The previous work undertaken on the data sets used in this thesis is mainlythe collection and processing of the existing data to allow for storage ofprevious events. Classification and prediction have not been implemented inany broader sense, as the focus has mostly been towards creating and main-taining several databases. The reason for storing information was to simplyshow the data to employees who need to know the current situation, ratherthan to be the base for more sophisticated calculations on the data. How-ever, there has been a wish to gain insight into how predictions could helpthe decision making to be more proactive. Such predictions could producea better knowledge of both the past and the future, leading to an improveddecision making strategy.

The content of this thesis relies partly on [2], in which decision trees areused to classify unstructured data. Based on the classified data, time seriesanalysis is performed on each class. The ultimate goal in doing so is toenable class prediction for new products. The results in [2] show that theinformation retrieved from the combination of classification and predictionleads to useful conclusions.

3

1.4 Outline of the Report

This thesis is organized as follows. Chapters 2 and 3 provide theoreticalfoundations concerning classification and time series analysis. In Chapters 4and 5, the implementation and results for classification and time series anal-ysis are discussed. Finally, Chapter 6 concludes this document and presentsideas for future work in the subject.

4

Chapter 2

Review of Classification

This chapter presents a description of the theory and implementation ofclassification methods used in this thesis.

The objective of classification is to gain a better understanding on how alarge amount of data can be categorized. The objective of the implementa-tion in this thesis is to examine how well an already classified set of datais structured by iteratively removing one class and reintroducing it to aclassifier unaware of this class.

2.1 Choice of Classifier

Many methods to find a logical structure in a dataset have been proposed.The point in finding this structure is to make mathematical rules delimitingdi�erent sets of observations according to the state of some property. Somemethods are Decision Trees [3, pp. 96-97], k-Nearest Neighbors [3, p. 97],Naïve Bayes [3, p. 97] and Support Vector Machines [3, pp. 97-98], amongothers. In [3] several classification methods are described and evaluated. Insummary, the results in [3] show that C4.5 [4], Decision Trees [5, p. 150] andSupport Vector Machines [6, pp. 156-164] are the most reliable classifiers, interms of accuracy, amongst the tested set of methods.

For classifying the dataset of computers used in this thesis, there are nostrict restrictions due to time consumption or space usage. The only mainconcern is being able to correctly classify the data, combined with the factthat the data contains nonorderable factors, e.g., color keyboard layout.Note that the target class, value, is an orderable factor. The data contains339 points over 10 predefined classes and 31 variables. Due to the fact thatthe data set has a considerable number of variables, several widely usedclassification algorithms become cumbersome to implement with regards to

5

making a clear definition of how each class is delimited. For example, agraphical representation in the k-Nearest Neighbor would be represented ina 31-dimensional space.

Since the data consists of several classes, multi-class support is required.Another restriction on the choice of classifier is to allow for the use of bothnumerical and categorical predictors. This means that the classifier mustbe capable of comparing predictors of both types. Furthermore, a classifierwith an easily interpretable output is favored. The interpretabillity considersthe level of abstraction in terms of visualizing delimitations between classes.Properties of well known classification algorithms are stated in the followinglist [7]:

• Decision Trees

– Multi-class Support: Yes.

– Mixed Predictor Support (numeric/categorical): Yes.

– Interpretability level: Easy.

• k-Nearest Neighbor


– Mixed Predictor Support (numeric/categorical): No.

– Interpretability level: Hard.

• Naïve Bayes



– Interpretability level: Easy.

• Support Vector Machines

– Multi-class Support: No


– Interpretability level: Hard (if the kernel used is not linear).

The previous list encourages the use of Decision Trees or Naïve Bayes. SinceNaïve Bayes is slower and uses more memory than Decision Trees, the latteris a better choice for the problem of classifying the data in this thesis. Inthis thesis, we follow the decision tree implementation used in [2], whichemploys C4.5 [4]. Since a successor to C4.5 has been released, namely C5.0[8, pp. 396-397], this will be used for classifying the data in this thesis.

6

2.2 Decision Trees

The concept of decision trees is widely used in understanding the underly-ing relationships in large collections of data, i.e., multiple observations ofvariables, whose structure is di�cult to determine. The main objective ofthe decision tree is to let an observation of a process use its properties asquestions to classify it, and thus identify the observation as belonging toa specific class. The structure of such a decision tree is built by a trainingalgorithm.

Figure 2.1: A decision tree with four questions (circular symbols) and twopotential outcomes (“Make dinner”, “Do nothing”).

Figure 2.1 shows an example of a very trivial question solved with a decisiontree. All terminal nodes, depicted as shaded boxes in the figure, hold answersto the question “Should I make dinner?”. This is a so called “two classproblem”, where there are two potential outcomes (“Make dinner” and “Donothing”), depicted as shaded squares in the figure. The circular symbols inthe figure are decision nodes. In these nodes there is a question to be decidedupon.

Once a decision tree is built, it can be used to evaluate other samples withvarying success depending on how well it models the dataset. The successrate depends on several aspects, such as the size of the data set used for treecreation, class-wise overlapping of variable observations, choice of algorithmfor building the tree and the use of additional methods to aid the treecreation.

7

2.3 The Fundamental Methods in C5.0

The contents of this section are largely based on [9].

2.3.1 Entropy

Definition 2.3.1 ([10, p. 57]) Let S be a random variablewith outcomes

si

, i œ {1, . . . , n}, (2.1)

and probability mass function p. The quantity

H(S) :=nÿ

i=1

≠p(si

) log2

(p(si

)) , (2.2)

is defined as the entropy of S, where log2

(·) is the base 2 log-arithm. For the purpose of entropy calculations, log

2

(0) := 0.

The entropy function H(S) in Definition 2.3.1 fulfills [11, p. 2]:

• H(S) is continuous in p(si

), i œ {1, . . . , n}.

• If p(si

) = 1/n for i = 1, . . . , n, then H(S) is a monotonically increasingfunction of n.

• H(S) is additive. i.e., H(S, Q) = H(S) + H(Q) for any choice of twoindependent random variables S and Q.

The entropy function in Definition 2.3.1 quantifies the information containedin a random variable with outcomes {s

i

}n

i=1

and probabilities {p(si

)}n

i=1

.Information in this sense is associated with the probability mass function ofthe random variable.

As the distribution of the outcomes moves further from a uniform distribu-tion, the information of the system decreases. This relates to the ability tomake justified splits of the full ensemble of outcomes.

The term ≠ log2

(p(si

)) in Equation (2.2) can be related to the number oflevels in a decision tree needed to determine the outcome of realization s

i

of the random variable S. Since log2

11

p(s

i

)

2= ≠ log

2

(q(si

)), where q(si

) isthe number of decision nodes in one particular level of the decision tree,log

2

(q(si

)) is the number of levels.

8

2.3.2 Information Gain

Given a realization of a random variable, S, we will assume that there areN attributes {a

i

}N

i=1

associated with each observation, where ai

= ai

(S) forall i = 1, . . . , N . This gives observations on the form (s

i

, a1

(si

), . . . , aN

(si

)),where each attribute a

i

is in itself a realization of a random variable.

Definition 2.3.2 ([10, pp. 57-58]) Let {si

}n

i=1

be the set ofpossible outcomes of the random variable S with entropy H(S).Let A be the set of all possible values of the N-tuple (a

1

(si

), . . . ,a

N

(si

)), i = 1, . . . , n. For each function ai

, let Ai

denote theset of possible outcomes of a

i

.

For a set of realizations S of the random variable S and at-tributes A, the quantity

IG(S, Ai

) = H(S) ≠ÿ

–œAi

|S–

||S| H(S

–

), (2.3)

where S–

:= {s œ S|ai

(s) = –}, is the information gain of asplit of S on attribute a

i

(s), and |·| denotes cardinality.

The notion of entropy, discussed in Section 2.3.1, is the main building blockof Definition 2.3.2. The information gain describes the reduction of entropywhen splitting a set of realizations into two separate sets based on an at-tribute. A reduction of entropy in this sense is when the total entropy of aset is higher than the total entropy of the two subsets created by a split.This is discussed in great detail in [10, p. 58], where the connection betweeninformation gain and decision trees is studied.

In Equation (2.3), the right hand side consists of a di�erence between theentropy H(S), which describes the number of bits needed to encode a class,and a weighted sum of a subset of entropies with weights based on the ratiobetween the number of observations in S

–

and the number of observationsin S. The elements in S

–

fulfill the requirement of holding the value – fromattribute A

i

.

2.3.3 An Example of a Decision Tree, Implementing Infor-mation Gain

Using a set of realizations of a random variable, the information gain ofpossible choices on how to split the set of realizations can be calculated. Byiteratively repeating such splits, a decision tree can be created. To illustratethe process of building a decision tree from a database, consider the five

9

Observation Color Keyboard Layout SSD Value (Class)S a

1

(S) a2

(S) a3

(S) a4

(S)

s1

Brown French No Lows

2

Green French Yes Lows

3

Green English Yes Lows

4

Blue English Yes Highs

5

Green English Yes Mid

Table 2.1: Attributes for five training examples with given class associations.

observations in Table 2.1. S corresponds to {si

}5

i=1

. Each of these observa-tions have values of four di�erent attributes (a

1

(S), . . . , a4

(S)) associatedwith them.

Attribute a4

(S) is the class attribute and thus, the goal is to predict a4

(S)using prior knowledge of a

1

(S), a2

(S), a3

(S). Since this is the attribute tobe determined in new samples, the samples in S need to be analyzed withthe end goal of finding structures leading to homogeneous subsets with re-spect to attribute a

4

(S). The concept of homogeneous sets is introducednext.

Definition 2.3.3 Let S be a subset of outcomes of a randomvariable, S. If all observations in S share the same propertiesfor its attributes, S is said to be homogeneous. Otherwise Sis said to be inhomogeneous.

Definition 2.3.3 implies that the subset S in Table 2.1 is inhomogeneous withrespect to the attribute “Value”. An example of a homogeneous subset of Sis {s

1

, s2

, s3

}, as the class of these observations are all Low.

The entropy of attribute a4

(s) is calculated as

H(a4

(S)) = ≠p(Low) log2

(p(Low)) ≠ p(Mid) log2

(p(Mid))≠ p(High) log

2

(p(High)), (2.4)

where p(·) is the empirical probability of each occurrence. Inserting numer-ical values and assuming s

1

, . . . , s5

being equally likely yields

H(a4

(S)) = ≠35 log

2

335

4≠ 1

5 log2

315

4≠ 1

5 log2

315

4= 1.371. (2.5)

To create a first split of the set S, attribute a1

(S) can be chosen to create ahomogeneous set, since a

1

(s4

) = Blue and a4

(s4

) = High. Splitting S basedon attribute a

1

(S) can thus create the subsets to the left in Figure 2.2. The

10

right plot in Figure 2.2 shows an example of a split that would not isolateone class from the ensemble of sets.

Figure 2.2: Left: A split on color (when High is the positive outcome). Right:A split on either Keyboard Layout or SSD (when Low is the positive out-come).

This observation is supported by the information gain, which can be writtenas

IG(S, A1

) = H(S) ≠ÿ

–œA1

|S–

||S| H(S

–

), (2.6)

where A1

= {Brown, Green, Blue} is the set of possible outcomes of at-tribute a

1

(S) and S–

is the subset of S with a1

(S) = –.

In the following calculations of information gain, the target class High hasbeen chosen to be regarded as the positive outcome and the ensemble {Low,Mid} regarded as the negative outcome. This is a way of “binarising” andthus introducing a distinction of the possible outcomes. The idea is to max-imize the information gain and thus minimize the entropy in the subsetsgenerated after the split on a specific attribute.

Expanding the sum in expression (2.6) gives

IG(S, A1

) = H(S) ≠ |SBrown

||S| H(S

Brown

) ≠ |SGreen

||S| H(S

Green

)

≠ |SBlue

||S| H(S

Blue

) (2.7)

and with numerical values:

IG(S, A1

)) = 0.72 ≠ 15

301 log

2

301

4+ 1

1 log2

311

44

≠ 35

303 log

2

303

4+ 3

3 log2

333

44

≠ 15

311 log

2

311

4+ 0

1 log2

301

44

= 0.72 ≠ 0 ≠ 0 ≠ 0 = 0.72, (2.8)

11

where we use 0 log2

(0) = 0.

Analogously, for Keyboard Layout and SSD respectively the resulting infor-mation gains are

IG(S, A2

) = 0.17, (2.9)

IG(S, A3

) = 0.07. (2.10)

The computations in (2.8)-(2.10) verify the intuition that an attribute pro-viding a clean split between two classes must reduce the entropy significantly.This entropy reduction is related to the maximization of information gain,which such a split is based on. On the other hand, an attribute which pro-vides a split with mixed classes must have a lower relative impact on thereduction of entropy.

The desired outcome is one where homogeneity in terminal nodes is achieved,which is when all leafs of a final tree contains samples stemming from onesingle class of a specific attribute. However, fulfilling this requirement maygive rise to the problem of over-fitting, where outliers from a certain classare isolated within, or close to a group of samples stemming from anotherspecific class.

The split based on attributes is repeated, building a decision tree until eachterminal node of the tree is homogeneous (if possible) with respect to theattribute, i.e. when there are no leafs containing a mix of classes.

Figure 2.3: The split over attribute a1

(S).

In Figure 2.3 the first split is visualized. Here, a split based on attribute a1

was made due to the calculations in (2.8)-(2.10) showing that this variablehad the greatest information gain among the three variables. In this figure,

12

“Leaf 1” contains only observations in class High and thus this leaf is homo-geneous. On the other hand, “Leaf 2” has observations in the classes Lowand Mid, which means it is an inhomogeneous set. However, the four neg-ative outcomes in “Leaf 2” actually stem from the ensemble {Low, Mid}which means that it can be redefined to a two class problem from this nodeon.

Recomputing the entropy for the set of samples remaining after the split,i.e., the samples in “Leaf 2”, gives the results in Table 2.2.

Eye Colour Keyboard Layout SSD Valuea

1

(S) a2

(S) a3

(S) a4

(S)

H(a1

(S)) = 0.811 H(a2

(S)) = 1 H(a3

(S)) = 0.811 H(a4

(S)) = 0.811

Table 2.2: The entropy of each variable and the target class.

Figure 2.4: The split over attribute a2

(s) which enables complete separationof class Low and thus homogeneity in one resulting leaf node.

Table 2.2 shows the entropies of observations 1, 2, 3 and 5 contained in “Node2” in Figure 2.4. As there are only two classes represented in this subset,class Low is defined as the negative outcome and class Mid is defined as thepositive outcome. Based on this assignment, we compute the informationgains for this set, analogously to prior calculations, as

IG(S, A1

) = 0.811 ≠ 0.689 = 0.122 (2.11)IG(S, A

2

) = 0.811 ≠ 0.5 = 0.311 (2.12)IG(S, A

3

) = 0.811 ≠ 0.689 = 0.122 (2.13)

Based on (2.11)-(2.13), Keyboard Layout provides the best split. This splitis shown in Figure 2.4, where “Leaf 4” becomes homogeneous but “Leaf 3”

13

remains mixed with respect to target classes. In Table 2.3 the remainingobservations contained in “Leaf 3” are shown.

Leaf # Ensemble of samples

Leaf 1 High: [Blue, English, Yes]Leaf 3 Low: [Green, English, Yes]

Mid: [Green, English, Yes]Leaf 4 Low: [Brown, French, No]

Low: [Green, French, Yes]

Table 2.3: A table of the samples found in each of the leaves in Figure 2.3.

In the first split originating from “Node 1”, the assigned positive outcomewas High and the split under “Node 2” the assigned positive outcome wasMid. The negative outcome was {Low, Mid} and Low respectively. There arethree resulting leaves in the final tree as shown in Table 2.3. The observationss

3

and s4

in Leaf 3 are identical, except for the target attribute. Thus, thereis no way of making a justified split of this ensemble.

2.3.4 Overfitting

The notion of overfitting is a common problem in classification, where anoverconfidence in the training data results in a classifier which due to itsoverly complex structure missclassifies datapoints.

Figure 2.5: Left: Hypothesis hyperplane (Ê1

) defined by training samples.Right: Test samples evaluated in Ê

1

.

In Figure 2.5, a two class problem is shown. The left side of this figure definesa class separating hyperplane based on the samples used for training. The

14

Figure 2.6: Left: Hypothesis hyperplane and ellipse (Ê2

) defined by trainingsamples. Right: Test samples evaluated in Ê

2

.

right side shows how a set of test samples is classified by this space. Note thatmissclassification of two samples are disregarded during training in favor ofthe hyperplane Ê

1

. Thus, the training samples are not successfully separated.However, the test samples are successfully classified using Ê

1

.

Figure 2.6 shows a case where the hypothesis separating the two classesconsists of a hyperplane and an ellipse, which is denoted by Ê

2

. In training,this hypothesis successfully separates the two classes. Allowing the modelto include the area covered by the ellipse as being true for the class denotedwith crosses could produce an overfit. This overfit results in a probabilityof missclassification in certain areas which should not have been includedin the classifier. Thus, if the probability of misclassification of new samplesduring testing is higher in a relatively complex hypothesis space, an overfitis probable.

Based on the test data, hypothesis Ê2

missclassifies the circle as seen inFigure 2.6. This suggests that Ê

2

overfits the data.

Definition 2.3.4 ([10, p. 67]) Let � be a hypothesis spacewith hypotheses Ê

1

œ � and Ê2

œ �. Then, Ê2

is said to overfit

the data if its error is smaller than Ê1

in the training step ofthe algorithm, but greater when introducing new observations.

The overfitting described in Definition 2.3.4 often happens when algorithmsclassify data according to overly complex hypothesis spaces, i.e., when gen-eralized models are disfavored for models with low training error.

15

2.3.5 Boosting

As a method to reduce errors in the predictions made by the decision trees, aboosting algorithm can be implemented. The fundamental idea of boostingalgorithms is to [12, p. 150]:

1. Choose and implement a classifier.

2. Split the dataset into a training set (of size n) and a validation set(of size m). The samples in these sets are chosen randomly in the firstiteration. Note that both sets need to contain samples from all classes.

3. Train the classifier using the training set. This classifier is said to bea weak classifier. Test the classifier by classifying the testing set andcreate weights to indicate the flaws in the weak classifier.

4. Repeat steps 2 and 3, for a given number of iterations, storing eachweak classifier constructed this way. Note that, at each iteration, theselection of samples is reflected by the assigned weights.

5. Combine all weak classifiers into one strong classifier.

2.3.6 Boosting in C5.0

In C5.0 the particular boosting algorithm implemented is based on the ideaof Adaptive Boosting or AdaBoost for short, which was introduced in [13].This work made great impact in the field of machine learning due to the abil-ity of combining several weak classifiers into a stronger one, while retainingthe robustness to over-fitting of the weak classifiers.

The main idea of adaptive boosting is to weigh the data points in eachsuccessive boosting iteration during the construction of a classifier. Theseweights for each sample in the training data are distributed such that thealgorithm will be focused to correctly classify the data points which weremissclassified by the previous classifiers. Thus, as a cost of this focus, samplescorrectly classified in a previous iteration will be slightly more likely to bemissclassified in the current iteration.

Algorithm 1 shows the AdaBoost algorithm for the binary classification case[14]. The implementation of Algorithm 1 considers all observations of a vari-able, x

i

, including the true class association of that observation, yi

. The dataset is then given by (x

i

, yi

), where i = {1, . . . , n}. In the initialization of theprocedure, all weights are set to be equal for all samples. These weights arethen normalized such that their sum equals to one. The constant T a�ectsthe number of weak classifiers to use in the final boosted model. The mainloop beginning at line 6 in Algorithm 1 incrementally builds a weighted

16

Algorithm 1 AdaBoost for binary classification1: procedure AdaBoost

2: (x1

, y1

), . . . , (xn

, yn

) Ω n observations, where xj

is a vector of3: attributes and y

j

is the class.4: w

t

(i) Ω weighting function. Initialized as 1/n for i = 1, . . . , n.5: T Ω number of trials6: for t Ω 1 to T do7: f

t

(xi

) Ω weak classifier8: ‘

t

=q

y

i

”=f

t

(x

i

)

wt

(i)9: –

t

= 1

2

log1

1≠‘

t

‘

t

2

10: for i Ω 1 to n do11: w

t+1

(i) = wt

e≠–

t Ω if correctly classified12: w

t+1

(i) = wt

e–

t Ω if incorrectly classified13: end14: w

t

(i) = w

t

(i)qn

i=1 w

t

(i)

Ω normalize wt

15: end16: F (x) = sign

1qT

t=1

–t

ft

(x)2

Ω the final classifier

17: end procedure

sum of weak classifiers to produce a stronger final classifier. This is achievedby summing the weights of all missclassified observations and construct-ing a modifier for the updated weights as is presented in lines 11-12. There-weighted observations are then processed in the successive iteration. Werefer to [15, p. 4] for the details about the derivation for –

t

.

The di�erences between the algorithm used in C5.0 and AdaBoost are

1. C5.0 tries to maintain a tree size similar to the initial one (whichis generated without boosting taken into account). This is correlatedwith the amount of terminal nodes, which increase in number as thetree grows.

2. C5.0 calculates class probabilities for all boosted models and withinthese models, weighted averages are calculated. Then, from these mod-els, C5.0 chooses the class having the maximum probability within thegroup.

3. The boosting procedure ends ifÿ

{i:y

i

”=F

t

(x

i

)}w

t

(i) < 0.1 or (2.14)

q{i:y

i

”=F

t

(x

i

)} wt

(i)|W

m

| > 0.5, (2.15)

17

where |Wm

| is the cardinality of the set of weights associated withmissclassified observations.

Algorithm 2 Pseudo code depicting the weighting procedure in C5.01: procedure2: N Ω Number of samples in training set.3: N≠ Ω Number of missclassified samples.4: T Ω Number of boosting iterations.5: w

i,t

Ω Weight of the i-th sample during t-th round of boosting.6: S

+

Ω Sum over all weights associated with correctly classified7: samples.8: S≠ Ω Sum over all weights associated with missclassified samples9: for t Ω 1 to T do

10: Build a decision tree.11: for i Ω 1 to N do12: midpoint = 1

2

Ë1

2

(S+

+ S≠) ≠ S≠È

13: wi,t

= wi,t≠1

S+≠midpoint

S+Ω weight if correctly classified

14: wi,t

= wi,t≠1

+ midpoint

N≠Ω weight if missclassified

15: end16: end17: end procedure

The method employed in C5.0 is shown in Algorithm 2, which is describedin [8, p. 398]. The following discussion of boosting in C5.0 is based on theirwork.

In Figure 2.7, the evolution of the weight for a single sample over boostingiterations is shown. The most drastic weight changes occur when the samplehas been missclassified. This causes the procedure in Algorithm 2 to focusheavily on increasing the weight on this sample. The opposite event does notcause such a drastic action, but instead produces a much slower decline inweight. This method serves the purpose of biasing the weight. The attentionis aimed towards the samples proven to be most di�cult, as a proactivemeasure for the next iteration of boosting.

18

Figure 2.7: An example of typical behavior in calculated weight for a singlesample over 200 iterations of boosting. Figure from [8, p. 398].

19

Chapter 3

Review of Time SeriesAnalysis

This chapter describes the theory of time series analysis used in this the-sis.

Time series analysis is a discipline of statistics dealing with the prediction offuture observations in a large number of applications. These applications, toname a few, include methods to distinguish between explosions and earth-quakes in seismic readings, evaluate stock market returns, and find patternsin speech recordings [16]. The notion of time series analysis is often com-bined with classification, as time series analysis can be implemented to finddi�erences in behavior with respect to a set of classes.

3.1 Stationary Processes

When performing time series analysis, the property of stationarity of a timeseries is important. Stationarity of a time series refers to a constancy ofthe statistical properties of the series [17, p. 1]. The reason to transformdata to become stationary is to allow for modeling using methods designedspecifically for stationary processes.

In order to define stationarity of a time series, a few fundamental mathemat-ical concepts need to be defined. These include the expected value, variance,and autocovariance functions.

Definition 3.1.1 ([18, p. 203]) A stochastic process is de-fined as a collection of random variables {X

t

}tœT

for some in-dex set T on a probability space �.

20

Definition 3.1.2 Consider a continuous random variable Xwith probability density function f(x). The expected value ofX is defined as

E[X] :=⁄ Œ

≠Œxf(x)dx. (3.1)

For a stochastic process {Xt

}tœT

we introduce the notation µX

(t):= E[X

t

].

Definition 3.1.3 Consider the stochastic processes {Xk

}kœT

,{Y

k

}kœT

. For given t, s œ T the cross covariance function isdefined as

“X,Y

(t, s) := E [(Xt

≠ µX

(t)) (Ys

≠ µY

(s))] . (3.2)

From Definition 3.1.3, we note that the variance of a stochastic processis

‡2

X

t

:= Var{Xt

} = “X,X

(t, t) = EËX2

t

È≠ µ2

X

(t). (3.3)

Definition 3.1.4 ([19]) A stochastic process {Xk

}kœT

, withvariance ‡2

X

k

and expected value µX

(k), is weakly stationaryif and only if for all t, s œ T

µx

(t) = µ, (3.4)‡2

X

t

< Œ, (3.5)“

X,X

(t, s) = “X

(|t ≠ s|), with “X

:Z+

0

æ R. (3.6)

Definition 3.1.4 implies that to claim weak stationarity for a time series, itsmean must be time independent and its autocovariance function must bebounded and shift invariant.

3.1.1 White Gaussian Noise

A commonly used stochastic process in time series analysis is white noiseand more specifically Gaussian white noise.

21

Figure 3.1: Four sequences of white Gaussian noise. Left: Generated samples.Right: Histograms of the samples generated and the Gaussian probabilitydensity function.

Definition 3.1.5 ([20]) Consider the stochastic process {Xt

}with E[X

t

] = µ, V ar{Xt

} = ‡2 and probability density function

f(xt

| µ, ‡2) = 1‡

Ô2fi

e≠ (x

t

≠u)22‡

2 . (3.7)

If any pair of values in {Xt

}tœT

are uncorrelated, i.e.

E[Xk

Xs

] = E[Xk

]E[Xs

] = µ2 for all k, s, (3.8)

and Xt

comes from (3.7) for all t œ T , then {Xt

}tœT

is a Gaus-sian white noise.

Figure 3.1 shows four series of generated random samples according to Equa-tion (3.7). Each row in Figure 3.1 shows a a number of samples generated bythe Gaussian white noise process with probability density function f(x|0, 1).As the number of samples increases, the histogram is close in shape to theGaussian probability density function.

22

Figure 3.2: Temperatures in Stockholm, modified by a linear trend.

3.2 An Example of a Time Series

A typical time series {Xt

} is a set of observations which have been recordedat certain points in time t. To illustrate the concept of time series, Figure 3.2shows the measurements of daily temperature in the city of Stockholm,Sweden. Note that this data has been modified after being retrieved fromthe Swedish Meteorological and Hydrological Institute [21]. The modificationmade is a superposition of a linear function to the original data points,i.e.,

X Õt

= Xt

+ t

100 , (3.9)

where t œ {1, ..., 3000}. The modified set, {X Õt

}3000

t=1

, is thus a skewed ver-sion of the original set. The linear trend was introduced to provide a morecomplete description of how a time series can look like.

In the data of Figure 3.2 we observe both a linear trend and an oscillatingbehavior. The derivation of a model describing the process {X Õ

t

} is the ob-jective of time series analysis. For the data in Figure 3.2, a simple modelcan be

Xt

= Yt

+ mt

+ st

, (3.10)where {Y

t

} is a weakly stationary process, {mt

} is a trend and {st

} is aseasonal component [22, p. 3]. Deriving a model on the form of (3.10) mayrequire isolation from, for example:

23

• Trends which show a general deterministic movement. These generalmovements can be linear, exponential, logarithmic, etc.

• Discontinuities (or jumps), where the measurements make sudden changesup- or downwards.

• Seasonal components such as reoccurring patterns of movement.

If mt

and st

in (3.10) are calculated in such a way that Yt

= Xt

≠mt

≠ st

hasthe characteristics of a weakly stationary process, the description of trends,seasonal changes and discontinuities can be considered as performed to asu�cient degree.

3.3 Model Decomposition

The model (3.10) is estimated in three steps:

1. Trend estimation to obtain mt

.

2. Seasonality component estimation to obtain st

.

3. Fitting of a stochastic model to the remaining process after completingsteps 1 and 2.

3.3.1 Estimating a Trend

A simple manner of removing trends in the data is to implement a mov-ing average filter to find general movements of the signal. The idea of themoving average filter is to let an average, spanning a whole period, be calcu-lated for each data point in the series. If the signal contains several seasonalcomponents, the window size of the averaging should equal to the lengthof the slowest observed seasonal component, in order to exclude all season-ality from the trend. The Fourier transform can be implemented to findspecifically dominant frequencies in the signal, which in turn indicates thefundamental frequencies of the seasonal components.

Each average mt

is calculated by centering the summation around the pointto be estimated. When the period d of the signal X

t

is an odd number, thetrend can be estimated by [23, pp. 25-31]

mt

=qÿ

j=≠q

Xt≠j

2q + 1 , q + 1 Æ t Æ n ≠ q, (3.11)

24

where n is the number of samples in the series. If the period is even, thetrend can be estimated by

mt

= 0.5Xt≠q

+ Xt≠q+1

+ · · · + Xt+q≠1

+ 0.5Xt+q

d, q + 1 Æ t Æ n ≠ q,

(3.12)where n is the number of samples in the series and d is the period, e.g.,d = 2q or any other even number.

3.3.2 Estimating a Seasonal Component

In the analysis and isolation of seasonal components, one method is to splitthe signal into segments with the same length as the period length of anobserved seasonal component. Thus, this method requires previous informa-tion of present seasonality. A method of finding the fundamental frequenciesto set the segment lengths is identical to that in Section 3.3.1.

A seasonal component, st

, can be calculated as

st

= 1P

Pÿ

k=0

(XkP

l

+t

≠ mkP

l

+t

) , t œ {1, . . . , Tp

}, (3.13)

where P is the number of full periods with length Pl

and Tp

is the numberof samples contained in one period. Thus, at each index t, all P

l

-tuplesof the detrended data will sum to an averaged value. The idea of (3.13)is to estimate a seasonal component by calculating one averaged period.Constructing {s

t

} is simply a matter of repeating {sp

}T

p

p=1

to span the fulllength of {X

t

}.

3.4 Autoregressive Moving-Average (ARMA)

When a decomposition model has been produced, there might still be struc-tures in the unexplained variable, Y

t

. This happens, for example, when theautocovariance function is non-zero at non-zero time lags, i.e. if

“Y

(h) ”= 0, ’h ”= 0. (3.14)

A method used to describe a stationary time series is to introduce a stochas-tic process where the output is a linear combination of previous inputs andoutputs. The ARMA process is a candidate model. For given positive inte-gers p, q, the ARMA(p, q) model is defined as

Yt

≠pÿ

i=1

„i

Yt≠i

= Zt

+qÿ

j=1

◊j

Zt≠j

, (3.15)

25

where {„n

}p

n=1

and {◊n

}q

n=1

are real coe�cients, and p, q denote the numberof autoregressive and moving average terms respectively. Furthermore, theterm {Z

t

} is a Gaussian white noise process which is used to generate theoutput Y

t

.

The sum in the left hand side of the equality in Equation (3.15) is the au-toregressive part AR(p). Likewise, the term at the right hand of the equalityis the moving average part MA(q).

3.5 Estimating an ARMA(p, q) Process

In ARMA(p, q) estimation, there are two key steps. First the order of pand q are estimated and secondly, the coe�cients for each term are esti-mated.

3.5.1 Estimating Order of p and q

Order selection of p and q in the ARMA(p, q) process is important in orderto build a good predictive model for a time series. This can be done, forexample, by manual examination of the autocorrelation function or by min-imizing the model residual. The autocorrelation function in Definition 3.5.1and partial autocorrelation function in Definition 3.5.2 can be used for anal-ysis of weakly stationary processes, due to the property of visualizing howa time series is correlated to itself in di�erent time shifts.

Definition 3.5.1 ([23, p. 16]) Let {Xt

} be a weakly station-ary process. The autocorrelation function (ACF) is defined forall positive integers h as

flX

(h) := “X

(h)“

X

(0) . (3.16)

Definition 3.5.2 ([23, p. 94-95]) Let {Xt

} be an weakly sta-tionary process. The partial autocorrelation function (PACF)–(·) is defined as

–(0) := 1 (3.17)–(h) = „

hh

, h Ø 1, (3.18)

where „hh

is the last component of „h

= �≠1

h

“h

, where �h

:=[“

X

(|i ≠ j|)]hi,j=1

and “h

:= [“X

(1) “X

(2) · · · “X

(h)]T .

26

By studying the autocorrelation function of the data, the order q in MA(q)is indicated as the last significant lag with respect to the 95% confidencebounds [23, p. 94]. Similarly, the order p in AR(p) is indicated by the indexof the last significant lag with respect to the 95% confidence bounds [23,p. 96].

Definition 3.5.3 ([24, p. 4]) Let {Xt

}N

t=1

be a time series andV be the sample variance of the prediction error of the ARMA(p,q) process with n = p + q + 1 number of parameters. Then theAkaike Information Criterion is defined by

AIC := log5V

31 + 2n

N

46. (3.19)

For ARMA(p, q) processes, order identification by inspection of the ACFand PACF is not trivial [24, p. 4]. Instead the Akaike Information Crite-rion (AIC) can be implemented to estimate the model order. To implementAIC, a number of models with unique pairs (p, q) are evaluated using Defi-nition 3.5.3. The model with the lowest AIC is the the most preferable withregard to both model order and error prediction minimization.

3.5.2 Estimating Coe�cients for the ARMA(1, 1) Process

Once the number of autoregressive and moving average terms have beendecided, the coe�cients for these terms need to be determined. In the casewhen p = q = 1, a zero mean, weakly stationary stochastic process {X

t

} canbe constructed as

Xt

= „Xt≠1

+ Zt

+ ◊Zt≠1

, (3.20)

where {Zt

} is a white noise process with mean µ = 0 variance ‡2

z

. Finding„, ◊ and ‡2

z

is possible by constructing the autocorrelation function, flX

(h),for lags h = {0, 1, 2}, giving three equations with three unknowns. From thisfollows that “

X

(0), “X

(1), “X

(2) must be found.

The autocovariance of Xt

in lag zero is

“X

(0) = E[(Xt

≠ E[Xt

])(Xt

≠ E[Xt

])]. (3.21)

Since E[Xt

] = 0 and Xt

= „Xt≠1

+ Zt

+ ◊Zt≠1

, Equation (3.21) can berewritten as

“X

(0) = „E[Xt

Xt≠1

] + E[Xt

Zt

] + ◊E[Xt

Zt≠1

]. (3.22)

The term E[Xt

Xt≠1

] in Equation (3.22) is the definition of the autocovari-ance of X

t

in lag 1, i.e. “X

(1). Replacing Xt

with Equation (3.20) in E[Xt

Zt

],

27

the only non-zero term is the autocovariance of Zt

, since all other termscan be rewritten as autocovariances of Z

t

in non-zero lags. Analogously theterm

E[Xt

Zt≠1

] = E[„Xt≠1

Zt≠1

+ Zt

Zt≠1

+ ◊Zt≠1

Zt≠1

], (3.23)

which only is non-zero for E[„Xt≠1

Zt≠1

] = „‡2

z

and E[◊Zt≠1

Zt≠1

] = ◊‡2

z

.Replacing (3.23) into (3.22) and simplifying gives

“X

(0) = „“X

(1) + ‡2

z

+ ◊„‡2

z

+ ◊2‡2

z

. (3.24)

Similarly for lags 1 and 2 the autocovariances are

“X

(1) = „“X

(0) + ◊‡2

z

, (3.25)

and

“X

(2) = „“X

(1). (3.26)

Inserting the numerical values from the autocorrelation function of the sam-ples into (3.24)-(3.26), results in a solvable set of equations. Such numericalvalues are found by inspection of the ACF plot of the residual to be esti-mated.

In models with higher order of p and q terms, this procedure can be used bycalculating fl

X

(h) for h = {0, . . . , n ≠ 1}, where n is the number of unknownparameters. Thus, the number of equations to solve grows as the sum of pand q increases. This results in more complex equation systems.

3.5.3 Prediction Error Method

Another way of estimating the ARMA(p, q) coe�cients is by minimizingthe mean squared error of the model over the vector of parameters (includ-ing the noise variance), ◊ =

#‡2

z

, „1

, . . . , „p

, ◊1

, . . . , ◊q

$[25, pp. 6-7]. This

minimization can be expressed as

◊N

:= arg min◊œ�

Nÿ

t=2

1X

t

≠ Xt|t≠1

(◊)2

2

, (3.27)

where {Xt

} is the time series being modeled and {Xt|t≠1

(◊)} is the predictorat time t given by the ARMA(p, q) model, the time series up to time t ≠ 1and and ◊ œ �.

28

Chapter 4

Results: Classification

This chapter implements the theory presented in Chapter 2. The results arediscussed and compared with respect to the methods used.

4.1 Origin of Data

The collected datasets to be used in classification comes from an onlinestorage where they are stored as pdf and excel files. Furthermore, thesedocuments have multiple cases of inconsistent formatting, which result inthe essential tasks of reconstruction and management of the data. The mainreason why the data is inconsistent is due to changes made in the storage ofinformation. For example, new properties have been implemented at certainpoints in time.

The data types represented are numerical and categorical. Numerical datatypes are numerical values and the categorical data can be described asnon-numerical elements. An example of numerical data is the size of a com-puter screen in inches and categorical data can be the color of a computer.A numerical data type is inherently comparable by nature. In some caseseven categorical data can be comparable as, for example a set of nuancesof gray can be considered orderable with respect to the lighter and darkercolors.

4.2 Data Management

The data was manually sorted to create a subset of reasonable size, thenreformatted prior to importing it into the R environment [26]. This is a freesoftware, valid under the Free Software Foundation’s GNU General Public

29

License [27]. R was primarily used to process in terms of classifying, but wasalso used to format the data further, prior to classifying.

4.3 Implementation Tools

To perform preparatory work, such as restructuring and selection of dataconnected to the classification of information, the R software is used [26]. Thealgorithm C5.0 is used within R to perform the building of the decision treeas well as for predictions of class association. Furthermore, all visualizationof the data set is produced using R.

4.4 Decision Tree Building and Prediction

A specific segment of the dataset, containing one class, is isolated and re-moved to train the decision tree with the remaining elements, hence keep-ing the resulting decision tree unaware of samples of this class. This en-ables a classification of samples defined as a specific class to be forced intomisclassification, with the hope that this missclassification will tend to theneighboring classes. A pair of neighboring classes to class X is defined as{X ≠ 1, X + 1}, assuming they are orderable. Thus, a neighboring class isa class with the closest (higher or lower) median value.

Figure 4.1 presents the box plots associated with two collections of dataa and b, where the collections a and b stem from two di�erent years. Thisfigure illustrates that both datasets behave in a similar manner in successiveyears.

As an example of how classification using C5.0 works on a subset of the data,the three leftmost classes in Figure 4.1 are isolated. Furthermore, the datafrom the two di�erent years for these classes are combined resulting in thebox plot in Figure 4.2. Class {2a, 2b} is removed from the data set beforetraining in C5.0, to provide an example of how the samples in this classare classified when there is no prior knowledge of this class. The trainingdata consists of 314 samples with 31 variables. The test data (class {2a, 2b})consists of 25 samples with 31 variables.

When classifying the samples from class {2a, 2b}, the C5.0 model classifiesthe 25 samples as shown in Table 4.1. The highlighted rows are the neigh-boring classes, which intuitively are the classes which predictions of class 2should fall into due to the median of the value in this class lies between themedian value of the neighboring classes.

30

Figure 4.1: A boxplot of two sets of data where Xa and Xb holds values fromclass X in two di�erent years, a and b.

In the cases when tree construction is done without boosting and with fiveboosting trials, we see that 96% of the samples are classified in one of theneighboring classes.

For this specific sample set we observe that boosting the decision tree doesnot increase the accuracy, as the results for the decision tree using zero or fiveboosting trials are the same. Using 10 or 20 trials of boosting, on the otherhand, produces a somewhat worse predictive result which can stem from anoverfit due to the unawareness of class 2 during training and boosting.

In addition, this trial allows to observe that class number 2 can be correlatedto class number 8 with respect to its attributes.

31

Figure 4.2: A boxplot where sample sets Xa and Xb have been pair-wiseconcatenated over X œ {1, 2, 3}.

4.4.1 Prediction Result

Algorithm 3 Averaged prediction1: procedure2: N Ω Number of iterations of tree building and predictions3: Construct an empty matrix, M , for predicted class versus true class,4: where row and column index represents true and predicted class5: respectively.6: for 1 to N do7: Partition the data into two sets, A and B.8: Construct the tree, using set A as training examples.9: Predict the class, using the test samples in set B.

10: Update M by adding the predictions.11: end12: Update M by calculating relative sample distribution for each row.13: Return M .14: end

Algorithm 3 shows the method employed to evaluate the performance of thedecision tree. At every iteration, this algorithm produces trees and predic-

32

Class number No boosting 5 trials 10 trials 20 trials

1 (Neighbor) 21 21 21 213 (Neighbor) 3 3 1 1

4 0 0 0 05 0 0 0 06 0 0 0 07 0 0 0 08 1 1 3 39 0 0 0 010 0 0 0 0

Table 4.1: A count of predicted classes for the 25 samples used for validationof the decision tree created with no boosting, boosting with 5 trials, boostingwith 10 trials and boosting with 20 trials.

tions on these trees. First, the full set of samples is divided into two parts,where one part is used as training data and the other is used to predictthe class associations. The results of the predictions are then added to amatrix to store the results. In Table 4.2, the outcome of Algorithm 3 with`````````````True Class

Predicted Class 1 2 3 4 5 6 7 8 9

1 0.81 0.12 0.07 - - - - - -2 0.18 0.76 0.04 - - - - 0.02 -3 - 0.05 0.77 - 0.14 - 0.04 - -4 - - 0.04 0.76 0.02 0.05 0.12 0.01 -5 - - 0.22 0.03 0.74 - 0.03 - -6 - - - 0.12 0.02 0.73 0.01 0.12 -7 - - 0.16 0.14 0.01 0.02 0.63 0.04 -8 - 0.02 - - - 0.06 - 0.92 -9 - - - - - 0.03 - 0.04 0.93

Table 4.2: Averaged prediction over 20 iterations of the C5.0 algorithm. Thenumbers correspond to the fraction of samples classified in the specific class.

N = 20 and 9 classes, is shown. All classifications with a rate less than0.5% are suppressed and denoted by “≠” in Table 4.2. Along the diagonalentries of this table, the relative frequency of correctly predicted samplesare located.

The class with least correctly classified elements is class 7, with only 63%of accuracy. On the other hand, class 9 obtains the highest accuracy, with a

33

93% success rate.

The classification shown in Table 4.2 indicates that the boosted C5.0 algo-rithm classifies the data set correctly with an average accuracy above 78%.This means that about 20% of the data has attribute combinations similarto other classes. On the other hand, the missclassified samples tend to neigh-boring classes which implies that the ordering of the classes is meaningful.Basically, the rate of missclassification is not ideally low, but the tendenciesof these missclassifications indicate that the decision tree approach is a suf-ficiently good model to visualize the values and overlapping thereof.

This result indicates that the classes are distinguishable from each other asthe majority of the samples are classified correctly. Furthermore, the samplesthat do get missclassified are generally close to the true class, showing thatthe ordering of the classes can be considered meaningful. If the majority ofthe missclassified samples would lie in classes further from the diagonal, theordering could have been considered meaningless in the sense that there isno clear reason to why the specific ordering has been chosen.

4.4.2 Prediction of Unknown Classes

Algorithm 4 Prediction of unknown classes.1: procedure2: N Ω Number of classes represented in the sample set.3: Construct an empty matrix, M , for predicted class versus true class,4: where row and column index represents true and predicted class5: respectively.6: for n Ω 1 to N do7: Construct a training set, A, with all observations except of those8: in class n. Samples from class n define set B.9: Construct the tree, using set A.

10: Predict classes, using the samples in set B.11: Update M by adding the predictions.12: end13: Update M by calculating relative frequency for each row.14: Return M .15: end

Algorithm 4 is similar to Algorithm 3 with the exception that the classbeing predicted is kept unknown to the tree until the prediction occurs.Table 4.3 shows the relative frequency of predicted classes after computingin Algorithm 4. At every iteration, one class has been excluded to make thetraining of the decision tree unaware of that specific class. The excludedclass is then introduced and used for prediction.

34

`````````````True Class

Predicted Class 1 2 3 4 5 6 7 8 9

1 - 0.88 0.04 - - - - 0.08 -2 0.79 - 0.15 - - - - 0.06 -3 0.04 0.02 - 0.09 0.55 - 0.25 0.05 -4 - - 0.03 - 0.33 0.21 0.36 0.04 0.035 0.02 0.01 0.65 0.22 - - 0.10 - -6 - - 0.04 0.23 - - 0.04 0.69 -7 - - 0.42 0.46 0.05 - - 0.07 -8 - 0.10 0.03 0.03 - 0.55 0.03 - 0.269 - - - - - 0.50 - 0.50 -

Table 4.3: Relative frequency of predicted classes. Every row indicates theiteration and the excluded class for that iteration. The shaded cells are tobe regarded as neighbors to the diagonal cells.

Iterating this method over all classes, one at a time, results in the predictionsin Table 4.3. If the man-made classification would be done with respect tothe similarities of the attributes in each separate class, then an expectedresult using Algorithm 4 would be that each reintroduction of a class wouldcause predictions to fall into neighboring classes.

Classifying of unknown classes, shown in Table 4.3 shows that the classifierproduces worse results in terms of correct classification. Less samples areclassified into the neighboring classes.

Ideally, the sum of the percentages in the true and neighboring classes in Ta-ble 4.2 should be accounted for in the neighboring classes in Table 4.3. Theaverage accuracy in prediction of unknown classes is roughly 32%. How-ever, the tendency to classify close to the diagonal is still present in thistable.

This result contradicts Table 4.2 and suggests that the ordering is not themost meaningful, or possibly that the matching of dataset and classifier isnot the optimal.

35

Chapter 5

Results: Time SeriesAnalysis

In this chapter, the theory of Chapter 3 is implemented for the availabledata and the results are presented.

5.1 Origin of Data

The data used in this chapter was originally stored in multiple related tableswithin the MSSQL server environment [28]. Furthermore, the data in Chap-ter 4 is connected to the data in this chapter. Realizations of the classes inChapter 4 constitutes a subset of the data used in time series analysis.

5.2 Implementation Tools

As in Chapter 4, the R software [26] is used to analyze and visualize thedata sets.

5.3 Decomposition & Forecasting

In Figure 5.1, a subset of the data set is shown. By inspection, there aresigns of both seasonality and a trend. The linear model in this figure showsthe general movement and the upper and lower prediction bounds indicatethat the trend is in fact a multiplicative one. This is deduced from thecone-like shape of these bounds. Dividing the data with the linear modelresults both in a constant mean and suppresses the cone-like shape, and

36

Figure 5.1: Original time series, together with the linear trend and its con-fidence bounds.

thus also the variance drift this shape is associated with. A time series Xt

with a multiplicative trend mt

can be written as Xt

= Yt

mt

, where Yt

is atime series. The seasonality is observed as the repeating cyclic movementsin the signal. Removing the multiplicative trend from the original data inFigure 5.1 and removing the remaining constant bias, results in the seriesshown in Figure 5.2. This series has the seasonal components present in theFigure 5.1, but with a more constant amplitude due to the removal of themultiplicative trend. By removing seasonal components from the data inFigure 5.2, we obtained the deseasonalized data and the residual analysisshown in Figure 5.3. The seasonal component is a result of successive aver-aging over the data set. These components come from equally sized splitsmade to the data set, where all splits are averaged together to cancel outthe e�ect of the noise in the data. The list below shows the steps on how tobuild the seasonal components:

1. Choose window sizes related to the periodicity of possible seasonalcomponents. Typical period lengths of data in practice are often weekly,monthly, quarterly and yearly, to name a few. An example of a weeklyperiod length, l, with data recorded once per day is thus l = 1◊7 = 7.

2. Split the data into n vectors of length l and compute its sample meanto obtain one single vector of length l. This results in a single approx-imated weekly seasonal component, {s

t

}l

t=1

.

37

Figure 5.2: The data from Figure 5.1 with removed trend and mean.

3. Steps 1 and 2 can be repeated for monthly, quarterly and yearly pe-riods. All seasonal components found this manner are then added to-gether in a final seasonal component s. For example, to add weeklyand yearly seasons, the weekly season has to be repeated 52 times tomake the sum meaningful.

The correlation analysis in Figure 5.3 shows that the trend and seasonalestimates cannot explain the time series completely, as the ACF and PACFare distinguishable from the ones generated by a white noise process, since awhite noise process is uncorrelated with itself in all nonzero time-lags. Thiscan also be seen in the deseasonalized data in Figure 5.3, where a slightseasonality is still observed.

Due to significantly greater correlation in the first lag of both the ACF andPACF in Figure 5.3, an ARMA(1, 1) is chosen as a candidate model. Theestimated model is

Xt

≠ 0.118Xt≠1

= Zt

+ 0.319Zt≠1

, (5.1)

where Zt

is a white noise process with expected value µ = 0 and variance‡2 = 0.035.

Algorithm 5 is based on the decomposition methods described in Section 3.3.This algorithm takes the data set shown in Figure 5.1. The data set is usedto estimate linear trends similar to that in Figure 5.1. Furthermore, seasonal

38

Figure 5.3: Top: Deseasonalized data. Bottom: ACF and PACF of the de-seasonalized data.

Algorithm 5 Time Series Analysis1: procedure Combined decomposition and forecasting

2: {Xt

} Ω observations.3: Decompose the set into a residual {Y

t

}, trend {mt

} and seasonal4: component {s

t

}.5: Fit an ARMA(p, q) process to {Y

t

} by analyzing its ACF6: and PACF .7: Compute y

n

as the n step ahead predictor of Yt

, based on the8: ARMA(p, q) model.9: Compute m

n

by n step(s) of extrapolation of the trend, {mt

}.10: Compute s

n

by n step(s) of extrapolation of the seasonal component,11: {s

t

}.12: Compute the final predicted series X

n

= yn

+ mn

+ sn

.13: end

components are constructed and removed from the data along with the linearmodel. The remaining residual, i.e. the detrended and deseasonalized data, isthen approximated by an ARMA(p, q) model. Subsequently, the estimatedcomponents are used to predict n new data entries by

• Extrapolation of the trend to find the array of n step(s) ahead for thetrend. Since the data in this chapter has a linear trend, a linear extrap-

39

Figure 5.4: Forecasted data.

olation is used. Other common trends are, for example exponential orlogarithmic trends.

• Finding the array of n step(s) ahead for the seasonal component. Thepositioning of the array to choose is concluded by matching the phaseshift of the last known value with the seasonal component.

• Generating the array of n step(s) ahead predictions for the ARMA(p, q)process.

The prediction in Figure 5.4 is a 30-step ahead prediction produced using themethod in Algorithm 5. Four of the 30 last true values in Figure 5.4 are out-side the 95% confidence region. This means that the confidence bounds holdfor 87% of the true data, implying that this model is not su�ciently goodat predicting future values. However, 30 values is a relatively small samplesize to evaluate by inspecting the confidence bounds. This result indicatesthat the data is under modeled and that a better result is possible.

Figure 5.5 shows the residual, i.e., the di�erence between true and fore-casted data in Figure 5.4. It also shows the autocorrelation function andpartial autocorrelation of the residual. Since the number of data points isrelatively low, it is di�cult to validate the residual analysis. However, thereare signs of a significant partial autocorrelation at lag 2. Since there are nosignificant correlations in any multiple of 2 a seasonality of period 2 can bedisregarded. Ideally there would be no significant partial autocorrelation atany lag h Ø 1, as this would imply an uncorrelated process with no distin-guishable trend or seasonality. This agrees with the confidence region fit inFigure 5.4. Disregarding the significant partial autocorrelation, the residual

40

Figure 5.5: Residual analysis of the prediction errors.

analysis behaves similar to the one given by a white noise process. Based onthis, the model is a possible candidate for prediction of the data set.

41

Chapter 6

Conclusions

This thesis consists of two parts, which are the classification of the availabledata set and a forecasting problem using time series analysis. For addressingthe classification problem, several methods are listed to conclude which isthe most satisfactory for the specific structure of the problem. This resultedin the choice of decision trees, and more specifically the algorithm C5.0. Intime series analysis, a combination of methods was implemented to producepredictions of future events. The approach considered first isolating bothtrend and seasonal components and then estimating an ARMA process overthe resulting data set.

The decision tree approach in classification showed satisfactory results inclassifying the predefined structure with an average of 78% of accuracy insupervised classification. When iteratively removing each class in trainingthe decision tree and reintroducing again these classes when validating themethod, it is observed that a significant amount of the elements in theremoved class tend to be placed into neighboring classes. However, there aresubsets of data which have a relatively large o�set from the true class, whichmight suggest that these subsets should be re-evaluated.

When analyzing the time series model for a subset of the data, we observethat the model can predict the data with reasonable accuracy. The residualanalysis of the prediction error had a slight partial autocorrelation whichcould stem from the small amount of data predicted upon. However, thetrue data is well situated within the confidence bounds of the predictionsfor about 87% of the samples, implying that the model is indeed a goodcandidate for prediction. Furthermore, a method to automate the predictionswas proposed and implemented.

42

6.1 Future Work

All algorithms involving classification and time series analysis in this the-sis have been implemented to work for the data provided by the company.However, for this work to be fully functional for the company, an automatedrestructuring of the data needs to be implemented, as the current structureis not applicable to the input of the algorithms. In this line, the algorithmpresented in Chapter 5 considers the restructured data, and thus a general-ization of several algorithms could be made to allow for other data sets andstructures. For the classification part, a detailed analysis of the missclassi-fied data can be of high interest, as an analysis of this was not possible dueto constraints on the available information.

43

Bibliography

[1] D. J. Spiegelhalter D. Michie and C. C. Taylor. Machine learning, neuraland statistical classification. 1994.

[2] E. Lundqvist. Decision tree classification and forecasting of pricing timeseries data. Master’s thesis, KTH Royal Institute of Technology, July2014.

[3] A. Rezaei R. Entezari-Maleki and B. Minaei-Bidgoli. Comparison ofclassification methods based on the type of attributes and sample size.Journal of Convergence Information Technology, 4(3):94–102, 2009.

[4] J. R. Quinlan. C4.5: programs for machine learning. Elsevier, 2014.

[5] M. Steinbach P. Tan and V. Kumar. Classification: basic concepts,decision trees, and model evaluation. Introduction to data mining, 1,2006.

[6] V. Vapnik. The nature of statistical learning theory. Springer, 2013.

[7] The MathWorks. Supervised learning workflow and algorithms, 2016.

[8] M. Kuhn and K. Johnson. Applied predictive modeling. Springer, 2013.

[9] Khan Academy Labs. Modern information theory: InformationEntropy, 2014. https://www.khanacademy.org/computing/

computer-science/informationtheory/moderninfotheory/v/

information-entropy.

[10] T. M. Mitchell. Machine Learning. McGraw-Hill. McGraw-Hill Educa-tion, 1997.

[11] N. A. Thacker P. A. Bromiley and E. Bouhova-Thacker. Shannon en-tropy, Renyi entropy, and information. Technical report, School of Can-cer and Imaging Sciences, University of Manchester, 93, 2004.

[12] R. E. Schapire. The boosting approach to machine learning: Anoverview. In Nonlinear estimation and classification, pages 149–171.Springer, 2003.

44

[13] Y. Freund and R. E. Schapire. A decision-theoretic generalization ofon-line learning and an application to boosting. Journal of computerand system sciences, 55(1):119–139, 1997.

[14] R. E. Schapire. Explaining Adaboost. In Empirical inference, pages37–52. Springer, 2013.

[15] R. Rojas. Adaboost and the super bowl of classifiers a tutorial introduc-tion to adaptive boosting. Technical report, Freie University, Berlin,2009.

[16] R. H. Shumway and D. S. Sto�er. Time series analysis and its appli-cations: with R examples. Springer, 2010.

[17] G. P. Nason. Stationary and non-stationary time series. Statistics inVolcanology. Special Publications of IAVCEI, 1, 2006.

[18] M. Kijima. Stochastic processes with applications to finance. CRCPress, 2013.

[19] L. H. Koopmans. The spectral analysis of time series. Academic press,1995.

[20] H. Ling. Lecture 1: Stationary Time Series. Santa Fe Institute, 2006.www.econ.ohio-state.edu/dejong/note1.pdf.

[21] Sveriges meteorologiska och hydrologiska institut. SMHI öppna data -meteorologiska observationer, 2016.

[22] J. Grandell. Lecture notes in Time Series Analysis. KTH RoyalInstitute of Technology, Division of Mathematical Statistics, 2016.https://www.math.kth.se/matstat/gru/sf2943/ts.pdf.

[23] P. J. Brockwell and R. A. Davis. Introduction to time series and fore-casting. Springer, 2006.

[24] D. Meko. Lecture notes: Autoregressive-Moving-Average Modeling. TheUniversity of Arizona, The University of Arizona, 2015. www.ltrr.

arizona.edu/~dmeko/notes_5.pdf.

[25] R. A. Stine. Lecture notes: Estimating ARMA processes. The Universityof Pennsylvania. www-stat.wharton.upenn.edu/~stine/stat910/

lectures/12_est_arma.pdf.

[26] RStudio Team. RStudio: Integrated Development Environment for R.RStudio, Inc., Boston, MA, 2016.

[27] Free Software Foundation Inc. GNU general public license, 2016.

[28] Microsoft Corporation. SQL server 2014, 2016.

45

TRITA 2016:098

www.kth.se

Decision Tree Classification of Products Using C5.0 and ...1072186/FULLTEXT01.pdf · Decision Tree...

Documents

Transcript of Decision Tree Classification of Products Using C5.0 and ...1072186/FULLTEXT01.pdf · Decision Tree...