Predictive Analyticsusing
Machine learningPraisan Padungweang, Ph.D.
Model evaluation
2
The Confusion MatrixA confusion matrix shows the number of correct and incorrect decisions made by the model compare to the actual labels (target) in the data.
For a problem involving n classes, it is an n × n matrix with the rows labeled with actual classes and the columns labeled with predicted classes.
Predicted
T F
Act
ual T
F
Predicted
a b c
Act
ual
a
b
c
3
The Confusion MatrixThe relationship between classes can be depicted as a 2 x 2 confusion matrix
◦ True Positive (TP): Correctly classified as the class of interest
◦ True Negative (TN): Correctly classified as not the class of interest
◦ False Positive (FP): Incorrectly classified as the class of interest
◦ False Negative (FN): Incorrectly classified as not the class of interest
Predicted
T F
Act
ual
TTrue
Positive
False
Negative(Type II error)
FFalse
Positive(Type I error)
True
Negative
4
Accuracy (TP+ TN)/(TP+FN+FP+TN)
Model evaluation
5
Confusion matrix
Predicted status
𝑃1 𝑃2 𝑃3 𝑃𝑘
Act
ual
Sta
tus
𝐴1 𝑨𝟏𝑷𝟏 𝐴1𝑃2 𝐴1𝑃3 𝐴1𝑃𝑘
𝐴2 𝐴2𝑃1 𝑨𝟐𝑷𝟐 𝐴2𝑃3 𝐴2𝑃𝑘
𝐴3 𝐴3𝑃1 𝐴3𝑃2 𝑨𝟑𝑷𝟑 𝐴3𝑃𝑘
:
𝐴𝑘 𝐴𝑘𝑃1 𝐴𝑘𝑃2 𝐴𝑘𝑃3 𝐴𝑘 𝑨𝒌𝑷𝒌
Accuracy = 𝐴1𝑃1+𝐴2𝑃2+𝐴3𝑃3+⋯+𝐴𝑘𝑃𝑘
𝑛
Model evaluation for multiple classes
Model evaluation
6
Churn Predicted
1 John Yes 0.72
2 Sophie No 0.56
3 David Yes 0.44
4 Emma No 0.18
5 Bob No 0.36
Predicted status
churn no churn
Actual Status
churn 1 (John) 1(David)
no churn
1 (Sophie)2 (Emma,
Bob)
Accuracy = 𝑇𝑃+𝑇𝑁
𝑛=1+2
5= 0.6
Model evaluation for binary classes
Model evaluation
7
Accuracy = 𝑇𝑃+𝑇𝑁
𝑁
Actual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.506
1 0.988 0 0.471
1 0.984 0 0.337
1 0.980 1 0.218
1 0.948 0 0.199
1 0.889 0 0.149
1 0.848 0 0.048
0 0.762 0 0.038
1 0.707 0 0.025
1 0.681 0 0.022
1 0.656 0 0.016
0 0.622 0 0.004
Actual Class Prob. of "1" Actual Class Prob. of "1"
1 0.996 1 0.506
1 0.988 0 0.471
1 0.984 0 0.337
1 0.980 1 0.218
1 0.948 0 0.199
1 0.889 0 0.149
1 0.848 0 0.048
0 0.762 0 0.038
1 0.707 0 0.025
1 0.681 0 0.022
1 0.656 0 0.016
0 0.622 0 0.004
Predicted status
1 0
Actual Status
1
0
Other Evaluation MetricsThere are other Evaluation Metrics that can be calculate from the confusion matrix
◦ Sensitivity and specificity
◦ Precision and Recall
◦ F-measure
8
Sensitivity and specificity
Other Evaluation Metrics
9
Predicted
T F
Act
ual
TTrue
Positive
False
Negative(Type II error)
True positive rate, Sensitivity,
Recall = 𝑻𝑷
𝑻𝑷+𝑭𝑵
FFalse
Positive(Type I error)
True
Negative
True negative rate,
Specificity = 𝑻𝑵
𝑭𝑷+𝑻𝑵
Positive predictive value,
Precision = 𝑻𝑷
𝑻𝑷+𝑭𝑷
Accuracy = 𝑻𝑷+ 𝑻𝑵
𝑻𝑷+𝑭𝑵+𝑭𝑷+𝑻𝑵
F-score = 𝟐×𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏×𝑹𝒆𝒄𝒂𝒍𝒍
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏+𝑹𝒆𝒄𝒂𝒍𝒍
10
For example, in spam message problem ◦ the sensitivity of 0.842 implies that 84 percent of spam messages were correctly
classified.
◦ the specificity of 0.996 implies that 99.6 percent of non-spam messages were correctly classified, or alternatively, 0.4 percent of valid messages were rejected as spam.
The idea of rejecting 0.4 percent of valid email messages may be unacceptable
Predicted
T F
Act
ual
TTrue
Positive
False
Negative(Type II error)
True positive rate,
Sensitivity = 𝑻𝑷
𝑻𝑷+𝑭𝑵
FFalse
Positive(Type I error)
True
Negative
True negative rate,
Specificity = 𝑻𝑵
𝑻𝑵+𝑭𝑷
Sensitivity and specificity
11
Predicted
T F
Act
ual
TTrue
Positive
False
Negative(Type II error)
True positive rate, Sensitivity,
Recall = 𝑻𝑷
𝑻𝑷+𝑭𝑵
FFalse
Positive(Type I error)
True
Negative
Positive predictive value,
Precision = 𝑻𝑷
𝑻𝑷+𝑭𝑷 When a model predicts the positive class, how often is it correct?
o A precise model will only predict the positive class in cases very likely to be positive. It will be very trustworthy.
• having both high precision and recall at the same time is very challenging.
A model with high recall captures a large portion of the positive examples.o For example, a search engine with high
recall returns a large number of documents pertinent to the search query.
Precision and recall
Other Evaluation Metrics
F-measure
A measure of model performance that combines precision and recall into a single number is known as the F-measure (also sometimes called the F1 score or the F-score).
Since the F-measure reduces model performance to a single number, it provides a convenient way to compare several models side-by-side.
12
The F-measure
F1
Problems with Unequal Costs and Benefits
Accuracy makes no distinction between false positive and false negative errors.
◦ It makes the tacit assumption that both errors are equally important.
◦ With real-world domains this is rarely the case.
These two errors are very different, should be counted separately, and should have different costs.
13
Model-> cancer Actual-> not
Model-> notActual-> cancer
He would be given further tests • expensive• inconvenient• stressful
Do nothing!
A Key Analytical Framework: cThe general form of an expected value calculation
EV = 𝑝(𝑜1) ∙ 𝑣(𝑜1) + 𝑝(𝑜2) ∙ 𝑣(𝑜2)+...
= σ𝑖 𝑝(𝑜𝑖) ∙ 𝑣(𝑜𝑖)
◦ 𝑜𝑖 is a possible decision outcome;
◦ 𝑝(𝑜𝑖) is its probability
◦ 𝑣(𝑜𝑖) is its business value.
The probabilities often can be estimated from the data
The business values often need to be acquired from other sources◦ usually the values must come from external domain knowledge
14
Expected Value for Model EvaluationIn targeted marketing, for example, a consumer need to be assigned as responder versus not likely responder then we could target the likely responders.
Cost/profit◦ If a consumer buys the product for $200 and our product related costs are $100.
◦ We mail some marketing materials, and the overall cost including postage is $1.
Yielding ◦ $99 is a value (profit) if the consumer responds (buys the product).
◦ a cost of $1 or equivalently a benefit of -$1 if the consumer not responds .
15
Cost-Benefit
Predicted
R N
Act
ual
R 99 0
N -1 0
Cost-Benefit matrices
7.425 0
-0.1 0
Expected Value for Model Evaluation
16
ModelPredicted
R N
Act
ual
R 150 150
N 200 1500
Predicted
R N
Act
ual
R 0.075 0.075
N 0.1 0.75
Cost-Benefit
Predicted
R N
Act
ual
R 99 0
N -1 0
/ 2,000
7.325
Acc=82.5%
Expected value
Targeted marketing
0 0
0 0
Expected Value for Model Evaluation
17
ModelPredicted
R N
Act
ual
R 0 300
N 0 1700
Predicted
R N
Act
ual
R 0 0.15
N 0 0.85
Cost-Benefit
Predicted
R N
Act
ual
R 99 0
N -1 0
/ 2,000
0
Acc=85%
Expected value
Targeted marketing
Expected Value for Model Evaluation
18
ModelPredicted
Churn not
Act
ual
Churn 100 50
not 150 9700
Cost-Benefit
Predicted
Churn not
Act
ual
Churn -10 -100
not -10 0
Acc=98%
ModelPredicted
Churn not
Act
ual
Churn 0 150
not 0 9850
Expected value = -0.75
Acc=98.5% Expected value = -1.5
Churn prediction
Problems with Unbalanced ClassesConsider a domain where the classes appear in a 999:1 ratio.
◦ A simple rule—always choose the most prevalent class—gives 99.9% accuracy.
Skews of 1:100 are common in fraud detection.
In churn data the baseline churn rate is approximately 10% per month ◦ If we simply classify everyone as negative we could achieve the accuracy of 90%!
19
Problems with Unbalanced Classes
20
Model1Predicted
Churn not
Act
ual
Churn 100 50
not 150 9700
Model2Predicted
Churn not
Act
ual
Churn 0 150
not 0 9850
Accuracy = 98% Accuracy = 98.5%
OtherMachine Learning
Models
21
Decision trees
Decision treesDecision trees are recursive partitioning algorithms (RPAs) that come up with a tree-like structure representing patterns in an underlying data set
Example Decision Tree
23
Decision trees
The top node is the root node ◦ Specify a testing condition of which the outcome corresponds to a
branch leading up to an internal node.
The terminal nodes of the tree assign the classifications and are also referred to as the leaf nodes.
Parent node
Child nodes
leaf nodes
24
Not Respond Respond
Decision treesMany algorithms have been suggested to construct decision trees.
Amongst the most popular are: C4.5, CART and CHAID.
These algorithms differ in their way of answering the key decisions to build a tree, which are:
Splitting decision: ◦ Which variable to split at what value (e.g., age < 30 or not, income < 1,000 or not;
marital status = married or not)
Stopping decision: ◦ When to stop growing a tree?
Assignment decision: ◦ What class (e.g., good or bad customer) to assign to a leaf node?
25
Decision treesSplitting decision
Use the concept of impurity
Consider three nodes containing good (unfilled circles) and bad (filled circles) customers
◦ Minimal impurity occurs when all customers are either good or bad.
◦ Maximal impurity occurs when one has the same number of good and bad customers
Feature X1 Feature X2 Feature X3
26
Decision trees - Splitting decision
Decision trees will now aim at minimizing the impurity in the data.
The most popular measurement are: Entropy: E(S) = −pGlog2(pG)−pBlog2(pB) (C4.5)
Gini: Gini(S) = 2pGpB (CART)
with pG (pB) being the proportions of class G (good) and B (bad), respectively.
27
Decision trees
Stopping criterion
The tree can learn to fit the specificities or noise in the data, which is also referred to as overfitting.
The data should be split into a training sample and a validation sample ◦ The training sample will be used to make the splitting decision
◦ The validation sample is an independent sample ◦ monitor the misclassification error
28
Stopping criteria Spark parameters
omaxDepthoMaximum depth of a tree
o Deeper trees are more expressive (potentially allowing higher accuracy), but they are also more costly to train and are more likely to overfit.
o minInstancesPerNodeo For a node to be split further, each of its children must receive at least this number
of training instances
o minInfoGaino For a node to be split further, the split must improve at least this much (in terms of
information gain).
29
Decision treesAssignment decision
typically looks at the majority class within the leaf node to make the decision
30
Bad Good
Decision treesDecision trees essentially model decision boundaries orthogonal to the axes
Decision Boundary of a Decision Tree
31
Decision treesDecision trees can be used for various purposes in analytics.
input selection ◦ attributes that occur at the top of the tree are more predictive of the target
initial segmentation. ◦ builds a tree of two or three levels deep as the segmentation scheme
◦ then uses second stage machine learning models for further refinement
final analytical model to be used directly into production◦ It gives a white box model with a clear explanation behind how it reaches its
classifications.
32
Model decision boundaries
33
Decision trees
Logistic regression
Neural Networks
Neural networks
𝑓(. )
w0
w1
w2
A mathematical representations inspired by the functioning of the human brain.
Another more realistic perspective sees neural networks as generalizations of existing machine learning models.
35
Neural networksNeural networks vs Linear regression
x0
x1 𝑓(. )
0
1
36
y
x
𝑓(𝓏) = 𝓏
𝓏 = 𝜃0+ 𝜃1𝐴𝑔𝑒+ 𝜃2𝐼𝑛𝑐𝑜𝑚𝑒
Neural networksNeural networks vs Logistic regression
x0
x1 𝑓(. )
0
1
37
𝑓(𝓏)=1
1+𝑒−(𝓏)
𝓏 = 𝜃0+ 𝜃1𝐴𝑔𝑒+ 𝜃2𝐼𝑛𝑐𝑜𝑚𝑒
Neural networksSingle Layer Perceptron X0=1
w0
w1
w2
w3
CustomerAge(𝑥1)
Income(𝑥2)
Gender(𝑥3)
Response (y)
John 30 1,500 M No 0
Sarah 31 800 F Yes 1
Sophie 52 1,800 F Yes 1
David 48 2,000 M No 1
Peter 34 1,800 M Yes 0
w1
Agew2
Incomew3
Genderw0
Bias (inception)
77.09677288 -1.69512 -2.99575 1.64252
1
1.643
77.097
-1.695
-2.996
Age
Income
Gender
Weights
Neural networksMulti Layer Perceptron (MLP)
Layer 1 Layer 2 Layer 3
Input Layer Hidden Layer Output Layer
39
Neural networksEach node has a transformation function f(.) (also called activation functions). The most popular activation functions are:
Linear ranging between −∞ and +∞; 𝑓 𝑧 = 𝑧
Sigmoid (Logistic)
ranging between 0 and 1; 𝑓 𝑧 =1
1+𝑒−𝑧
Hyperbolic tangent
ranging between –1 and +1; 𝑓 𝑧 =𝑒𝑧−𝑒−𝑧
𝑒𝑧+𝑒−𝑧
40
Selecting activation function
Hidden Layer -> logistic, hyperbolic tangent, linear
Output Layer ◦ For classification (e.g., churn, response, fraud),
◦ it is common practice to adopt a logistic transformation in the output layer, since the outputs can then be interpreted as probabilities.
◦ For regression targets◦ Linear
◦ Linear, logistic, hyperbolic tangent for normalized target
41
Input Layer Hidden Layer Output Layer
Model ComparisonHeld-out test datadata is divide into training set and test set
◦ Training set is user for models creation (training and validation)
◦ Test set is held-out for model selection
42
Training set Test set
models
Test performance
The selected model
mo
del
s cr
eati
on
Model ComparisonCross-validation for model comparison
◦ K-folds cross-validation
43
Demo
Data Preprocessing
Models training
Models evaluation
Model deployment
44
Hands on, machine learning using Spark, in the class
Top Related