Week 5: Evaluating Model Performance Unit 1: Model ...

openSAP_ds3_Week_5_All_SlidesWeek 5: Evaluating Model Performance
Unit 1: Model Performance Metrics
2PUBLIC© 2020 SAP SE or an SAP affiliate company. All rights reserved.
Model Performance Metrics
Introduction
From Rexer Analytics 3rd Annual Data Miner Survey by Karl Rexer, PhD, Heather N. Allen, PhD and Paul Gearan
Model performance (lift, R2, etc.)
Improve efficiency
Customer service improvements
Results were published
13%
21%
27%
34%
35%
38%
42%
42%
44%
48%
49%
58%
The following performance metrics are often used to
assess classification model success:
positive) and Type II (false negative) errors
Lift and area under the curve (AUC) and other
metrics
prediction confidence metrics
value.
are to the true value.
Unbiased Biased
– Benefit Value+ Benefit Value
s
ROC curve
Baseline
0
0,5
1
1,5
2
2,5
3
3,5
0 10 20 30 40 50 60 70 80 90 100
L IF
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
% P
CUMULATIVE GAINS CHART
Targets
Population33%
33%
Random
Targets
Targets
Population33%
Random
Targets
Perfect
33%
100%
Targets
Population33%
Random
Targets
Perfect
33%
100%
66%
Model
Non-Targets
Targets
Population33%
Random
Targets
Perfect
33%
100%
66%
Model
SAP metrics – Predictive power and prediction confidence
Prediction Confidence ≈ 1- B/(A+B+C)Predictive Power Validation ≈ C/(A+B+C)
Predictive Power Estimation ≈ (B+C)/(A+B+C)
Population in %, ranked in descending order of their score value
% of
target
detected
Wizard
Random
80
Regression model –
the “fitted” values
Actual “observed” values
After you have fit a model using regression analysis with a continuous target variable, you need to determine
how well the model fits the data.
Linear regression calculates an equation that minimizes the distance between the fitted line and all of the
data points.
In general, a model fits the data well if the differences between the observed values and the model’s
predicted values are small and unbiased.
Residual = Observed value - Fitted value
The following performance metrics are often used to assess
regression model (with a continuous target) success:
− Mean absolute error: mean of the absolute values of the
differences between predictions and actual results (this is called
the city block distance or Manhattan distance)
− Mean square error: square root of the mean of the quadratic
errors (Euclidian distance or root mean squared error – RMSE)
− Maximum error: maximum absolute difference between
predicted and actual values (called the Chebyshev distance)
− Error mean: mean of the difference between predictions and
actual values
actual result
− R² (coefficient of determination): ratio between the variability of
the prediction and the variability of the data.
All of these indicators should be as low as possible, except R²,
which should be high (its maximum is 1).
Please see the appendix for more
information
that most closely matches the business objectives
defined at the beginning of the project during the
Business Understanding phase.
importance, because the model selected based on
one metric may not be a good model for a different
metric.
Notation

False Positive Rate,
The proportion of
The proporion of instances
classified
assigned to the negative target):
A random selection (with no predictive model) would classify 40%
of the positive targets correctly as True Positive.
A perfect predictive model would classify 100% of the positive
targets as True Positive.
The predictive model created by Smart Predict (the validation
curve) would classify 96% of the positive targets as True Positive. T
ru e P
portrays how well a model discriminates in terms of the
trade-off between sensitivity and specificity, or, in effect,
between correct and mistaken detection, as the
detection threshold is varied.
proportion of CORRECTLY identified cases (true
positives) out of all true positives on the validation
dataset.
proportion of INCORRECT assignments made (false
positives) out of all false positives on the validation
dataset.
the proportion of CORRECT assignments to the non-
target class – true negatives.
Sensitivity example calculation
Sensitivity = 50% and
Total Number of Customers = 30
Total Number of Churners = 10
Sensitivity and specificity extreme examples
Sensitivity example curve
50
If we use the following notation:
between predictions and actual results
(city block distance or Manhattan distance)
Formula:
Definition: square root of the mean of the quadratic errors
(Euclidian distance or root mean squared error – RMSE)
Formula:
Definition: maximum absolute difference between predicted
and actual values (upper bound) (Chebyshev distance)
Formula:
Success criteria for regression models – Coefficient of determination (R²)
Definition: ratio between the variability (sum of squares)
of the prediction and the variability (sum of squares) of
the data.
[email protected]
© 2020 SAP SE or an SAP affiliate company. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of
SAP SE or an SAP affiliate company.
The information contained herein may be changed without prior notice. Some software products marketed by SAP SE and its
distributors contain proprietary software components of other software vendors. National product specifications may vary.
These materials are provided by SAP SE or an SAP affiliate company for informational purposes only, without representation or
warranty of any kind, and SAP or its affiliated companies shall not be liable for errors or omissions with respect to the materials.
The only warranties for SAP or SAP affiliate company products and services are those that are set forth in the express warranty
statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional
warranty.
In particular, SAP SE or its affiliated companies have no obligation to pursue any course of business outlined in this document or
any related presentation, or to develop or release any functionality mentioned therein. This document, or any related presentation,
and SAP SE’s or its affiliated companies’ strategy and possible future developments, products, and/or platforms, directions, and
functionality are all subject to change and may be changed by SAP SE or its affiliated companies at any time for any reason
without notice. The information in this document is not a commitment, promise, or legal obligation to deliver any material, code, or
functionality. All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ
materially from expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, and they
should not be relied upon in making purchasing decisions.
SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered
trademarks of SAP SE (or an SAP affiliate company) in Germany and other countries. All other product and service names
mentioned are the trademarks of their respective companies.
See www.sap.com/copyright for additional trademark information and notices.
www.sap.com/contactsap
Unit 2: Model Testing
Model Testing
To test the strength of classification models, many data
scientists use gains charts, lift charts and decile tables to
measure the performance of the model against random
guessing, or what the results would be if you didn’t use any
model.
Perfect
Model Testing
without using a predictive model, to 10,000
customers.
The response rate was 20% (there were 2,000
positive responses).
10000 2000 10000
Model Testing
http://www.dmstat1.com/res/DecileAnalysisPrimer.html
https://select-statistics.co.uk/blog/cumulative-gains-and-lift-curves-measuring-the-performance-of-a-marketing-campaign/
Decile
Customer
Contacted
Total 10000 2000 20.0%
Model Testing
Direct mailing example – Cumulative gains chart
0 10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100
Random
Model Testing
Direct mailing example – Cumulative gains chart
0 10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100
Total 10000 2000 20.0%
Top 40% of customers with the highest model scores
Model Testing
% Customers Contacted
L if
Random
Model Testing
In this unit you have been introduced to decile analysis
and how you can use it to analyze the power of the
classification models you build.
[email protected]
warranty.
Week 5: Evaluating Model Performance
Unit 3: Improving Model Performance
Improving Model Performance
Add more data
more data = more accurate models
Missing and outlier values in training data can reduce accuracy
To read more about feature engineering see:
https://en.wikipedia.org/wiki/Feature_engineering
https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/
https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114
that best explain the target
Using multiple algorithms
Multiple algorithms might
increase accuracy
parameter to improve accuracy
Final
Model
accuracy and robustness
Test Training Training Training Training Test
Training Test Training Training Training Test
C o
m p
le te
D a
Training Training Training Training Test Test P e
rf o
rm a
n c
e M
e tr
ic s
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
To learn more about cross-validation see:
https://en.wikipedia.org/wiki/Cross-validation_(statistics)#:~:text=Cross%2Dvalidation%2C%20
https://towardsdatascience.com/why-and-how-to-cross-validate-a-model-d6424b45261f
This unit introduced you to some of the techniques that
are commonly used to improve the performance of your
predictive models.
You can:
and outliers
− Use multiple algorithms
− Use ensemble methods
− Use cross-validation techniques
Of course, one of the most common ways of improving
the accuracy of a forecast is to use a different algorithm!!
Thank you.
Contact information:
[email protected]
warranty.
Unit 4: Evaluation Phase – Overview
Evaluation Phase – Overview
Preparation Modeling Evaluation Deployment
business objectives.
budget constraints permit.
respect to business success criteria
Output – Approved Model

engagement to determine if there is any important
factor or task that has somehow been overlooked.
− Identify any quality assurance issues.
Output – Review of Process
activities that have been missed and/or should be
repeated.
Task
Output – List of Possible Actions
− List the potential further actions along with the
reasons for and against each option.
Output – Decision
− Describe the decision on how to proceed.
of the CRISP-DM process.
There are three tasks:
[email protected]
warranty.
Unit 5: Evaluating Model Performance
Evaluating Model Performance
business objectives.
is deficient.
production environment (if time and budget
constraints permit).
to perform this evaluation.
operationalized, you might find that its behavior is
different from your expectation. Therefore, you need
to try and evaluate the model so that you can
understand how its performance could deviate.
The deviation could be because of interactions with
other models and systems, the variability of the real
world, or even because of adversaries that are
changing the behavior your data reflected when you
initially trained the model.
performance is to use the decile analysis of the
model build distribution as bin boundaries and then
use the actual real-world results to estimate how
many samples are in each bin.
Classification model example
You want to know if customers will buy your new product "P".
When training the model, the decile analysis will split the
data into 10 bins with 10% of customers in each bin, and you
can calculate the average probability per bin to purchase
product “P”:
When you apply the model and evaluate it on up-to-date real-world
data, you might find there is a change in the percentage of
customers in each bin:
100
100
100
100
100
100
100
100
100
100
Regression model example
You want to predict the deal values for the next quarter. Your dataset contains observations on 3,000
customers.
When building the model, the decile analysis will split the data into
10 bins with 10% of the customers in each bin and you can
calculate the average predicted deal value per bin:
When you apply the model and evaluate it on up-to-date real-world
data (here with observations on 800 customers) the percentage of
customers in each bin could change substantially:
− Evaluate the model to assess the degree to which
it meets the business objectives.
− Determine if there is some business reason why
this model is deficient.
production environment with new, updated data (if
time and budget constraints permit).
− You confirm if the model meets the business
success criteria.
− One way to do this is to use decile analysis and
compare the deciles when you trained the model to
those when you use the model on more up-to-date
data.
[email protected]
warranty.

Week 5: Evaluating Model Performance Unit 1: Model ...

Documents

Transcript of Week 5: Evaluating Model Performance Unit 1: Model ...