1
ACCTG 6910Building Enterprise &
Business Intelligence Systems(e.bis)
ACCTG 6910Building Enterprise &
Business Intelligence Systems(e.bis)
Classification
Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business
Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business
2
Outline
• Introduction • Classic Methods
– Decision Tree– Neural Network
3
Introduction • Classification
– Classifies objects into a set of pre-specified object classes based on the values of relevant object attributes and objects’ class lables
ClassifierO1
O3O2O5O4
O6
O1 O2 O6
O5
O3 O4
Oi:contains relevant attribute values and class labels
Class X
Class Y
Class Z
Classes X, Y and Z are pre-determined
4
Introduction
• When to use it?– Discovery (descriptive, explanatory)– Prediction (prescriptive, decision support)– When the relevant object data can be
decided and is available
• Real World Applications– Profiling/predicting customer purchases– Loan/credit approval – Fraud/intrusion detection– Diagnosis decision support
5
Example
Age Income Churn?70 20,000 Yes60 18,000 Yes75 36,000 Yes67 33,000 Yes60 36,000 Yes60 50,000 No50 12000 Yes40 12000 Yes30 12000 No50 30,000 No40 16000 Yes35 20,000 Yes48 36,000 No30 37,000 No22 50,000 No21 51,000 No
Income
Age
:Churn
:Not Churn
6
Notations
Age Income Churn?70 20,000 Yes60 18,000 Yes75 36,000 Yes67 33,000 Yes60 36,000 Yes60 50,000 No50 12000 Yes40 12000 Yes30 12000 No50 30,000 No40 16000 Yes35 20,000 Yes48 36,000 No30 37,000 No22 50,000 No21 51,000 No
Income
Age
:Churn
:Not Churn
Classification Attributes Class Label Attribute
Class Labels
Problem Space
Classification SamplesPrediction Object
7
Object Data Required• Class Label Attribute:
– Dependent variable, output attribute, prediction variable,…
– Variable whose values label objects’ classes
Classification Attributes:– Independent variables, input attributes, or predictor
variables– Object variables whose values affect objects’ class
labels
• Three Types:– numerical (age, income)– categorical (hair color, sex)– ordinal (severity of a injury)
8
Classification Vs. Prediction
– View 1• Classification: discovery• Prediction: predictive utilizing classification results
(rules)
– View 2• Either discovery or predictive• Classification: categorical or ordinal class labels• Prediction: numerical (continuous) class labels
– Class lectures, assignment and exam: • View 1
– Text: View 2
9
Classification & Prediction
• Main Function– Mappings from input attribute values to
output attribute values– Methods affect how the mappings are
derived and represented
• Process– Training (supervised): derives the mappings– Testing: evaluate accuracy of the mappings
10
Classification & Prediction
• Classification samples: divided into training and testing sets– Often processed in batched modes– Include class labels
• Prediction objects– Often processed in online modes– No class labels
11
Classification Methods• Comparative Criteria
– Accuracy– Speed– Robustness– Scalability– Interpretability– Data types
• Classic methods– Decision Tree– Neural Network– Bayesian Network
12
Decision Tree
• Mapping Principle: Recursively partition the data set so that the subsets contain “pure” data
Income
Age
13
Decision Tree• Algorithm:
Start from the whole data set;Do{
Split the data set into two or more subsets by every possible class label;Choose the split that produce the “purist” subsets;
}While (subset not pure)
14
Decision Tree• Key Question: How is purity (diversity)
measured?
– Gini Index of diversity: Ecologists’ contribution
– Example: 8 cats, 2 tigers• The probability of choosing a cat (p1) = 8/10 = 0.8• The probability of choosing a cat AGAIN = 0.8 * 0.8 =
0.64• The probability of choosing a tiger(p2) = 2/10 = 0.2• The probability of choosing a tiger AGAIN = 0.2 * 0.2 =
0.04• What is the probability of choosing two different animals?
15
Decision Tree
• P = 1 - p1*p1 - p2*p2 = 0.32
• When is p biggest? --> when cats number = tiger number (p1 = p2 = 0.5)P = 0.5
• When is p smallest? -->only one kind of animal (p1 or p2=1)p = 0
• Gini index could represent the diversity of the data set
16
Decision Tree
• Gini Index: Suppose there are n different output classes, each class has a probability of pn to appear, the Gini Index is:1 -
• When there are only two classes:
1- p1*p1 - p2*p2 = 1 - p1*p1 - (1-p1) *(1-p1)= 2p1 (1 - p1)
2ip
17
Decision Tree• Another Index: Entropy
E = -
• When only two categories:
E = - (p1 log2p1 + p2 log2 p2)
• The bigger E is, the more diverse the data is
n
xxx pp
12log
18
Decision Tree
• Practice? – Question 1: if the chance of churn is 0.5 and
not churn is – 0.5, what is the Entropy?– Answer: - 2 * ( 0.5 * log 2 0.5) = 1
– Question 2: if the chance of churn is 0.25 and not churn is
– 0.75, what is the Entropy?– Answer: - ( 0.25 * log 2 0.25 + 0.75 * log 2
0.75 )
19
Decision Tree
• What is the entropy of set 1? – - [5/6 * log2 (5/6) + 1/6 * log2 (1/6)]
• Set 2?• The whole Set?• The reduction?
Income
Age
Set 1
Set 2
20
Decision Tree
• Calculation of the reduction in Entropy– Original E: Easy to get– E of the subsets: easy to get– How to deal with the E’s of the subsets?
• Simply add them together is not good, since their sizes are different
– Use a weighted sum:• w1 = # of records in subset 1 / total # of records• w2 = # of records in subset 2 / total # of records
• E’ = w1 * E1 + w2 * E2
21
Decision Tree
• The algorithm (divide and conquer):– Select an attribute and partition a data set D
into D1, D2 … Dn– Calculate the Entropy En for each of the data set
– Get the E’ =
– Get E of the data set before partition– Get reduction in Entropy = E - E’– Divide the data set using the attribute with the
largest Entropy reduction; go to next round
n
xxx Ew
1
22
Decision Tree
Income
Age
Income = 23K
Age = 55
23
Decision Tree
• Partition by Age = 55: – E = -[9/16 * log (9/16) + 7/16 * log (7/16)]
– E1= - [5/6 * log2 (5/6) + 1/6 * log2 (1/6)]– E2= -[ 0.4 * log2 (0.4) + 0.6 * log2 (0.6)]
– E’ = 6/16 * E1 + 10/16 * E2
– Reduction = E – E’
• Partition by Income = 23k : Similar
24
Decision Tree
Age?
Income? Churn
<= 55 > 55
<= 23K
Churn Not Churn
25
Extract rules from the model
• Each path from the root to a leaf node forms a IF-THEN rule.
• In this rule, root and internal nodes are conjuncted to form the IF part.
• Left node denotes the THEN part of the rule.
26
Pruning
• Noises: inconsistent class labels for the same attribute values
• Outliers: the # of samples with a given combinations of class labels and input attribute values is small
• Overfitting: tree branches are created to classify noises and outliers
• Problem: unreliable tree branches
27
Pruning
• Function: remove unreliable branches• Pre-pruning
– Halting creations of unreliable branches by statistically determine the goodness of further tree splits
– Less time-consuming but less effective
• Post-pruning– Remove unreliable branches from a full tree– Minimizing error rates or required encoding
bits– More time-consuming but more effective
28
Decision Tree
• Pros of Decision Trees: – Clear Rules– Fast Algorithm– Scalable
• Cons:– The accuracy may suffer with complex
problems, e.g., a large number of class labels
29
Decision Tree
• Many Trees out there!!– ID3– C4.5 --- continuous predictor values– CART– Forest– MDTI– ….
30
Neural Networks
• What is it? – Another classification technique to
map from input attribute(s) to output attribute(s)
– Most widely known but least understood
• Human Brains: The root of neural network +
-
?
31
Neural Networks
i1
i2
H2
H1
O1
O3
O2
32
Neural Network
• Let’s start with a simple example:
• z = 3x + 2y + 1• Input Attributes: x, y Output
Attribute: z
• How to represent the mapping?
33
Neural Network
x
y
Input nodes
Input layer Output layer
Output nodes
3
2
Weights
(SUM)
Combination Function
+1
Transfer Function
34
Two-layer Neural Network• Three Major components:
– Input layer– Output layer– A weight matrix in between
• Three Functions:– Combination function: Usually sum– Transfer or activation Function: To “Squash
(normalize)” the sum to a certain range
• Can represent ANY linear functions between the input space and output space
35
Neural Network
x
y
Input nodes
Input layer Output layer
Output nodes
3
2
Weights
SUM
Combination Function
sigmoid
Transfer Function
36
Neural Networks
• How about non - linear relationships?• Throw in another layer: Hidden layer
• Theoretically, a neural net with above structure can represent ANY function between the input space and output space
i1
i2
H2
H1
O1w111
w122
w12
1
w11
2
O3
O2
37
Neural Networks
• Data Flow:
i1
i2
H2
H1
O1w111
w122
w12
1
w11
2
O2
Age
Income
sum
S(H2)
S(H1)
sum
sum
S(O1)
S(O2)
Churn
Not Churn
38
Neural Network
• Feed-forward: – The above process, in which the input values are
transformed through the network to produce the output values, is called FEED-FORWARD
• When we get new records, we do feed-forward to get the prediction values.
• But how do we produce a network that can predict?
39
Neural Nets
Data set
Training Set
Testing Set
Initial Neural Net
Training
Trained Neural Net
Testing
Trained Net with
Performance Measurement
40
Neural Net• Split the Data Set:
– classifier and error: • 2/3 for training, 1/3 for testing
– Ten-fold validation: • 9/10 for training, 1/10 for testing. Repeat ten times • When sample size is small, use this
41
Training Neural Net
• Step 1. Set up an initial neural net:– input, output, and hidden layer nodes, value to 0.– Weight matrix often set to random small values (-0.5, 0.5)
• Step 2. Feed-forward:– Use historical data, run the predictor values through,
get output.
initialize
Feed-forward : guess Back-
propagation: Learn
42
Neural Net
• Step 3: Back-propagation– Critical Step: learning happens here– Compare the result of machine with the
historical result:• error i = oi real - oi machine• Based on this error, go BACK to the hidden -- output
layer matrix, change the weights so that error could be smaller
• Requires calculus (derivatives of the error)• Just interpret it as looking for the minimum error on
an error surface• Repeat the process until the error falls within an
acceptable range
43
Neural Net• Tuning for the training phase
– Topology: number of input, output, and hidden nodes
• hidden = 1/2 (output + input)• number of hidden layers: 1 is enough
– Learning Rate (0-1): • The rate at which weights can be modified from previous
weights• Very important for learning convergence and performance
– Momentum: • The adjustment to be included to calculate weight
modifications• Typically very small or zero. Less important
44
Neural Net
• Pros: Very Powerful (ANY function!)• Cons:
• Time - consuming• Black-Box
• When and Where to use it:• Complicated prediction problems• Visualization or understanding of the rules are not
needed • Accuracy is very important
45
Summary• Basics
– Classification versus prediction• Mappings from input attributes to class labels• Data types of input attributes and class labels: numeric,
categorical and ordinal• Data-type-based view and discovery-vs-predictive view
• Decision-tree induction method– Recursive partitions of the data sets to increase the
purity (or information gain) level of class labels in individual partitions.
– Entropy function: measure of diversity– Tree nodes correspond to partitions and links
correspond to partitioning conditions– Pre-pruning or post-pruning remove unreliable tree
branches caused by noises or outliers
46
Summary
• Neural Net– Neural net has the following components:
• input layer, output layer, hidden layer• weight matrices
– Input layer represents the input attributes– Output layer represents the output classes– Hidden layer and the matrices helps to
capture the mapping function
47
Summary
• Neural Net– To use a neural net, go through three steps:
• Training: feed-forward, back-propagation• Testing: feed-forward only, used to measure the
accuracy of the model built• Prediction: Feed-forward without testing the
performance
– Most of the tuning occurs in the training phase• hidden layer node number• learning rate• Momentum
• Readings: T2, Ch. 7.1 – 7.3.3 and Ch. 7.2
Top Related