Download - 1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Classification Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair.

1

ACCTG 6910Building Enterprise &

Business Intelligence Systems(e.bis)

ACCTG 6910Building Enterprise &

Business Intelligence Systems(e.bis)

Classification

Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business

Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business

2

Outline

• Introduction • Classic Methods

– Decision Tree– Neural Network

3

Introduction • Classification

– Classifies objects into a set of pre-specified object classes based on the values of relevant object attributes and objects’ class lables

ClassifierO1

O3O2O5O4

O6

O1 O2 O6

O5

O3 O4

Oi:contains relevant attribute values and class labels

Class X

Class Y

Class Z

Classes X, Y and Z are pre-determined

4

Introduction

• When to use it?– Discovery (descriptive, explanatory)– Prediction (prescriptive, decision support)– When the relevant object data can be

decided and is available

• Real World Applications– Profiling/predicting customer purchases– Loan/credit approval – Fraud/intrusion detection– Diagnosis decision support

5

Example

Age Income Churn?70 20,000 Yes60 18,000 Yes75 36,000 Yes67 33,000 Yes60 36,000 Yes60 50,000 No50 12000 Yes40 12000 Yes30 12000 No50 30,000 No40 16000 Yes35 20,000 Yes48 36,000 No30 37,000 No22 50,000 No21 51,000 No

Income

Age

:Churn

:Not Churn

6

Notations

Age Income Churn?70 20,000 Yes60 18,000 Yes75 36,000 Yes67 33,000 Yes60 36,000 Yes60 50,000 No50 12000 Yes40 12000 Yes30 12000 No50 30,000 No40 16000 Yes35 20,000 Yes48 36,000 No30 37,000 No22 50,000 No21 51,000 No

Income

Age

:Churn

:Not Churn

Classification Attributes Class Label Attribute

Class Labels

Problem Space

Classification SamplesPrediction Object

7

Object Data Required• Class Label Attribute:

– Dependent variable, output attribute, prediction variable,…

– Variable whose values label objects’ classes

Classification Attributes:– Independent variables, input attributes, or predictor

variables– Object variables whose values affect objects’ class

labels

• Three Types:– numerical (age, income)– categorical (hair color, sex)– ordinal (severity of a injury)

8

Classification Vs. Prediction

– View 1• Classification: discovery• Prediction: predictive utilizing classification results

(rules)

– View 2• Either discovery or predictive• Classification: categorical or ordinal class labels• Prediction: numerical (continuous) class labels

– Class lectures, assignment and exam: • View 1

– Text: View 2

9

Classification & Prediction

• Main Function– Mappings from input attribute values to

output attribute values– Methods affect how the mappings are

derived and represented

• Process– Training (supervised): derives the mappings– Testing: evaluate accuracy of the mappings

10

Classification & Prediction

• Classification samples: divided into training and testing sets– Often processed in batched modes– Include class labels

• Prediction objects– Often processed in online modes– No class labels

11

Classification Methods• Comparative Criteria

– Accuracy– Speed– Robustness– Scalability– Interpretability– Data types

• Classic methods– Decision Tree– Neural Network– Bayesian Network

12

Decision Tree

• Mapping Principle: Recursively partition the data set so that the subsets contain “pure” data

Income

Age

13

Decision Tree• Algorithm:

Start from the whole data set;Do{

Split the data set into two or more subsets by every possible class label;Choose the split that produce the “purist” subsets;

}While (subset not pure)

14

Decision Tree• Key Question: How is purity (diversity)

measured?

– Gini Index of diversity: Ecologists’ contribution

– Example: 8 cats, 2 tigers• The probability of choosing a cat (p1) = 8/10 = 0.8• The probability of choosing a cat AGAIN = 0.8 * 0.8 =

0.64• The probability of choosing a tiger(p2) = 2/10 = 0.2• The probability of choosing a tiger AGAIN = 0.2 * 0.2 =

0.04• What is the probability of choosing two different animals?

15

Decision Tree

• P = 1 - p1*p1 - p2*p2 = 0.32

• When is p biggest? --> when cats number = tiger number (p1 = p2 = 0.5)P = 0.5

• When is p smallest? -->only one kind of animal (p1 or p2=1)p = 0

• Gini index could represent the diversity of the data set

16

Decision Tree

• Gini Index: Suppose there are n different output classes, each class has a probability of pn to appear, the Gini Index is:1 -

• When there are only two classes:

1- p1*p1 - p2*p2 = 1 - p1*p1 - (1-p1) *(1-p1)= 2p1 (1 - p1)

2ip

17

Decision Tree• Another Index: Entropy

E = -

• When only two categories:

E = - (p1 log2p1 + p2 log2 p2)

• The bigger E is, the more diverse the data is

n

xxx pp

12log

18

Decision Tree

• Practice? – Question 1: if the chance of churn is 0.5 and

not churn is – 0.5, what is the Entropy?– Answer: - 2 * ( 0.5 * log 2 0.5) = 1

– Question 2: if the chance of churn is 0.25 and not churn is

– 0.75, what is the Entropy?– Answer: - ( 0.25 * log 2 0.25 + 0.75 * log 2

0.75 )

19

Decision Tree

• What is the entropy of set 1? – - [5/6 * log2 (5/6) + 1/6 * log2 (1/6)]

• Set 2?• The whole Set?• The reduction?

Income

Age

Set 1

Set 2

20

Decision Tree

• Calculation of the reduction in Entropy– Original E: Easy to get– E of the subsets: easy to get– How to deal with the E’s of the subsets?

• Simply add them together is not good, since their sizes are different

– Use a weighted sum:• w1 = # of records in subset 1 / total # of records• w2 = # of records in subset 2 / total # of records

• E’ = w1 * E1 + w2 * E2

21

Decision Tree

• The algorithm (divide and conquer):– Select an attribute and partition a data set D

into D1, D2 … Dn– Calculate the Entropy En for each of the data set

– Get the E’ =

– Get E of the data set before partition– Get reduction in Entropy = E - E’– Divide the data set using the attribute with the

largest Entropy reduction; go to next round

n

xxx Ew

1

22

Decision Tree

Income

Age

Income = 23K

Age = 55

23

Decision Tree

• Partition by Age = 55: – E = -[9/16 * log (9/16) + 7/16 * log (7/16)]

– E1= - [5/6 * log2 (5/6) + 1/6 * log2 (1/6)]– E2= -[ 0.4 * log2 (0.4) + 0.6 * log2 (0.6)]

– E’ = 6/16 * E1 + 10/16 * E2

– Reduction = E – E’

• Partition by Income = 23k : Similar

24

Decision Tree

Age?

Income? Churn

<= 55 > 55

<= 23K

Churn Not Churn

25

Extract rules from the model

• Each path from the root to a leaf node forms a IF-THEN rule.

• In this rule, root and internal nodes are conjuncted to form the IF part.

• Left node denotes the THEN part of the rule.

26

Pruning

• Noises: inconsistent class labels for the same attribute values

• Outliers: the # of samples with a given combinations of class labels and input attribute values is small

• Overfitting: tree branches are created to classify noises and outliers

• Problem: unreliable tree branches

27

Pruning

• Function: remove unreliable branches• Pre-pruning

– Halting creations of unreliable branches by statistically determine the goodness of further tree splits

– Less time-consuming but less effective

• Post-pruning– Remove unreliable branches from a full tree– Minimizing error rates or required encoding

bits– More time-consuming but more effective

28

Decision Tree

• Pros of Decision Trees: – Clear Rules– Fast Algorithm– Scalable

• Cons:– The accuracy may suffer with complex

problems, e.g., a large number of class labels

29

Decision Tree

• Many Trees out there!!– ID3– C4.5 --- continuous predictor values– CART– Forest– MDTI– ….

30

Neural Networks

• What is it? – Another classification technique to

map from input attribute(s) to output attribute(s)

– Most widely known but least understood

• Human Brains: The root of neural network +

-

?

31

Neural Networks

i1

i2

H2

H1

O1

O3

O2

32

Neural Network

• Let’s start with a simple example:

• z = 3x + 2y + 1• Input Attributes: x, y Output

Attribute: z

• How to represent the mapping?

33

Neural Network

x

y

Input nodes

Input layer Output layer

Output nodes

3

2

Weights

(SUM)

Combination Function

+1

Transfer Function

34

Two-layer Neural Network• Three Major components:

– Input layer– Output layer– A weight matrix in between

• Three Functions:– Combination function: Usually sum– Transfer or activation Function: To “Squash

(normalize)” the sum to a certain range

• Can represent ANY linear functions between the input space and output space

35

Neural Network

x

y

Input nodes

Input layer Output layer

Output nodes

3

2

Weights

SUM

Combination Function

sigmoid

Transfer Function

36

Neural Networks

• How about non - linear relationships?• Throw in another layer: Hidden layer

• Theoretically, a neural net with above structure can represent ANY function between the input space and output space

i1

i2

H2

H1

O1w111

w122

w12

1

w11

2

O3

O2

37

Neural Networks

• Data Flow:

i1

i2

H2

H1

O1w111

w122

w12

1

w11

2

O2

Age

Income

sum

S(H2)

S(H1)

sum

sum

S(O1)

S(O2)

Churn

Not Churn

38

Neural Network

• Feed-forward: – The above process, in which the input values are

transformed through the network to produce the output values, is called FEED-FORWARD

• When we get new records, we do feed-forward to get the prediction values.

• But how do we produce a network that can predict?

39

Neural Nets

Data set

Training Set

Testing Set

Initial Neural Net

Training

Trained Neural Net

Testing

Trained Net with

Performance Measurement

40

Neural Net• Split the Data Set:

– classifier and error: • 2/3 for training, 1/3 for testing

– Ten-fold validation: • 9/10 for training, 1/10 for testing. Repeat ten times • When sample size is small, use this

41

Training Neural Net

• Step 1. Set up an initial neural net:– input, output, and hidden layer nodes, value to 0.– Weight matrix often set to random small values (-0.5, 0.5)

• Step 2. Feed-forward:– Use historical data, run the predictor values through,

get output.

initialize

Feed-forward : guess Back-

propagation: Learn

42

Neural Net

• Step 3: Back-propagation– Critical Step: learning happens here– Compare the result of machine with the

historical result:• error i = oi real - oi machine• Based on this error, go BACK to the hidden -- output

layer matrix, change the weights so that error could be smaller

• Requires calculus (derivatives of the error)• Just interpret it as looking for the minimum error on

an error surface• Repeat the process until the error falls within an

acceptable range

43

Neural Net• Tuning for the training phase

– Topology: number of input, output, and hidden nodes

• hidden = 1/2 (output + input)• number of hidden layers: 1 is enough

– Learning Rate (0-1): • The rate at which weights can be modified from previous

weights• Very important for learning convergence and performance

– Momentum: • The adjustment to be included to calculate weight

modifications• Typically very small or zero. Less important

44

Neural Net

• Pros: Very Powerful (ANY function!)• Cons:

• Time - consuming• Black-Box

• When and Where to use it:• Complicated prediction problems• Visualization or understanding of the rules are not

needed • Accuracy is very important

45

Summary• Basics

– Classification versus prediction• Mappings from input attributes to class labels• Data types of input attributes and class labels: numeric,

categorical and ordinal• Data-type-based view and discovery-vs-predictive view

• Decision-tree induction method– Recursive partitions of the data sets to increase the

purity (or information gain) level of class labels in individual partitions.

– Entropy function: measure of diversity– Tree nodes correspond to partitions and links

correspond to partitioning conditions– Pre-pruning or post-pruning remove unreliable tree

branches caused by noises or outliers

46

Summary

• Neural Net– Neural net has the following components:

• input layer, output layer, hidden layer• weight matrices

– Input layer represents the input attributes– Output layer represents the output classes– Hidden layer and the matrices helps to

capture the mapping function

47

Summary

• Neural Net– To use a neural net, go through three steps:

• Training: feed-forward, back-propagation• Testing: feed-forward only, used to measure the

accuracy of the model built• Prediction: Feed-forward without testing the

performance

– Most of the tuning occurs in the training phase• hidden layer node number• learning rate• Momentum

• Readings: T2, Ch. 7.1 – 7.3.3 and Ch. 7.2