Analytics on Microsoft Excel
Alberto Guillén03. June 2008
Overview of solutions around the platform from a real world perspective
Capgemini is a leading company with long experience in technology services
We are one of the biggest actors in Business Intelligence in Norway
A major demand from our clients is delivering solutions in Microsoft Excel, we have continuously updated our efforts to adapt clients’ needs
2
Alberto Guillén
Consultant
Risk Management & Compliance
MSc. MathematicsMSc. Statistics
Phone: +47 46444721E-Mail: [email protected]
Excel is more than worksheet functions and tables
3
Excel is the Industry standard for end-user calculations, and also as front interface Excel is the Industry standard for end-user calculations, and also as front interface
Data Visualization and Data Mining
Desktop for analytical solutions
Front for Data Warehouse Collection (SQL Server)
GUI for in-house coded programs (VBA)
Reporting and Monitoring Tool (Performance Point)
Basic statistical environment
EXC
EL
Analytical solutions can be created on different complexity layers beyond basic Excel
4
Excel
Analysis Tool pack
Solver
DataMiningAdd-in
Third-partyAdd-ins
VBA
StatisticalProgrammingLanguages
Analytical solutions can be created on different complexity layers beyond basic Excel
5
Excel
Analysis Tool pack
Solver
DataMiningAdd-in
Third-partyAdd-ins
VBA
StatisticalProgrammingLanguages
Analytical solutions can be created on different complexity layers beyond basic Excel
6
Excel
Analysis Tool pack
Solver
DataMiningAdd-in
Third-partyAdd-ins
VBA
StatisticalProgrammingLanguages
Analytical solutions can be created on different complexity layers beyond basic Excel
7
Excel
Analysis Tool pack
Solver
DataMiningAdd-in
Third-partyAdd-ins
VBA
StatisticalProgrammingLanguages
Analytical solutions can be created on different complexity layers beyond basic Excel
8
Excel
Analysis Tool pack
Solver
DataMiningAdd-in
Third-partyAdd-ins
VBA
StatisticalProgrammingLanguages
Analytical solutions can be created on different complexity layers beyond basic Excel
9
Excel
Analysis Tool pack
Solver
DataMiningAdd-in
Third-partyAdd-ins
VBA
StatisticalProgrammingLanguages
The average user masters the standard Excel tools
10
Standard Excel allows direct interaction with the raw dataStandard Excel allows direct interaction with the raw data
BASIC EXCEL
Charts
Worksheet functions
”What If” Analysis
Tables and Filters
Pivot Tables
ID Marital Status Gender Income Children Education Occupation Home Owner Cars Commute Distance Region Age High Value Customer12496 Married Female 40000 1 Bachelors Skilled Manual Yes 0 0-1 Miles Europe 42 Yes24107 Married Male 30000 3 Partial College Clerical Yes 1 0-1 Miles Europe 43 Yes14177 Married Female 80000 5 Partial College Professional No 2 2-5 Miles Europe 60 Yes24381 Single Male 70000 0 Bachelors Professional Yes 1 5-10 Miles Pacific 41 No25597 Single Male 30000 0 Bachelors Clerical No 0 0-1 Miles Europe 36 Yes13507 Married Female 10000 2 Partial College Manual Yes 0 1-2 Miles Europe 50 No27974 Single Male 160000 2 High School Management Yes 4 0-1 Miles Pacific 33 No19364 Married Male 40000 1 Bachelors Skilled Manual Yes 0 0-1 Miles Europe 43 Yes22155 Married Male 20000 2 Partial High School Clerical Yes 2 5-10 Miles Pacific 58 No19280 Married Male 20000 2 Partial College Manual Yes 1 0-1 Miles Europe 48 Yes22173 Married Female 30000 3 High School Skilled Manual No 2 1-2 Miles Pacific 5412697 Single Female 90000 0 Bachelors Professional No 4 10+ Miles Pacific 3611434 Married Male 170000 5 Partial College Professional Yes 4 0-1 Miles Europe 5525323 Married Male 40000 2 Partial College Clerical Yes 1 1-2 Miles Europe 3523542 Single Male 60000 1 Partial College Skilled Manual No 1 0-1 Miles Pacific 4520870 Single Female 10000 2 High School Manual Yes 1 0-1 Miles Europe 3823316 Single Male 30000 3 Partial College Clerical No 2 1-2 Miles Pacific 5912610 Married Female 30000 1 Bachelors Clerical Yes 0 0-1 Miles Europe 4727183 Single Male 40000 2 Partial College Clerical Yes 1 1-2 Miles Europe 3525940 Single Male 20000 2 Partial High School Clerical Yes 2 5-10 Miles Pacific 5525598 Married Female 40000 0 Graduate Degree Clerical Yes 0 0-1 Miles Europe 3621564 Single Female 80000 0 Bachelors Professional Yes 4 10+ Miles Pacific 3519193 Single Male 40000 2 Partial College Clerical Yes 0 1-2 Miles Europe 3526412 Married Female 80000 5 High School Management No 3 5-10 Miles Europe 5627184 Single Male 40000 2 Partial College Clerical No 1 0-1 Miles Europe 3412590 Single Male 30000 1 Bachelors Clerical Yes 0 0-1 Miles Europe 63
Sum of Income Column LabelsBachelors Bachelors Total
Row Labels Clerical Management Manual Professional Skilled ManualFemale 690000 3770000 10000 2890000 1960000 9320000
Married 540000 2270000 10000 1100000 850000 4770000Europe 500000 300000 10000 40000 500000 1350000North America 1770000 390000 350000 2510000Pacific 40000 200000 670000 910000
Single 150000 1500000 1790000 1110000 4550000Europe 150000 190000 340000North America 1130000 380000 1110000 2620000Pacific 180000 1410000 1590000
Male 740000 4200000 10000 3530000 1440000 9920000Married 460000 2260000 1560000 880000 5160000
Europe 420000 160000 40000 340000 960000North America 1390000 610000 540000 2540000Pacific 40000 710000 910000 1660000
Single 280000 1940000 10000 1970000 560000 4760000Europe 260000 10000 80000 350000North America 1580000 360000 560000 2500000Pacific 20000 360000 1530000 1910000
Grand Total 1430000 7970000 20000 6420000 3400000 19240000
The Analysis Toolpack makes rigorous analysis possible
11
Basic statistical analyses are available through Analysis ToolpackBasic statistical analyses are available through Analysis Toolpack
ANALYSIS TOOLPACK
Solver leverages computational abilities
Free Microsoft download Optimizing and root finding
set of algorithms Can be called on the
background from VBA Practical but slow in heavy
calculations
Not exact convergence sometimes!
Can be tuned Needs good seeds
12
Solver implements standard algorithms for mathematical optimization problemsSolver implements standard algorithms for mathematical optimization problems
SOLVER
Solver leverages computational abilities
Free Microsoft download Optimizing and root finding
set of algorithms Can be called on the
background from VBA Practical but slow in heavy
calculations
Not exact convergence sometimes!
Can be tuned Needs good seeds
13
Solver implements standard algorithms for mathematical optimization problemsSolver implements standard algorithms for mathematical optimization problems
SOLVER
Third party add-ins provide easily new functionalities
Cheap Simple Easy to use No development efforts
14
Many small software developers use Excel as their GUIMany small software developers use Excel as their GUI
ADD-INS
There are several third party add-ins offering solutions on quantitative analysis and Monte Carlo simulation
15
Hundreds of free or cheap add-ins offer various solutions on fields like Risk ManagementHundreds of free or cheap add-ins offer various solutions on fields like Risk Management
ADD-INS
The Table Analysis Tools add-in brings data mining capabilities
Included in the data mining add-in
Scenario analysis differs from the standard tool
Brings the power of data mining to end-users
16
Data Mining is embedded into table functionalityData Mining is embedded into table functionality
DATA MINING
The Data Mining add-in easies data mining to business analysts
What is Data Mining?• Data mining is frequently described as "the process of extracting valid,
authentic, and actionable information from large databases.“
Microsoft’s approach to Data Mining:• Business Intelligence with a user-friendly
interface, accessible to end-users and
developers
Software• SQL Server 2005/2008 (Visual Studio BI)• Excel/Visio add-ins• DMX• ADOMD.Net / AMO
17
Microsoft brings Data Mining to business users for the first timeMicrosoft brings Data Mining to business users for the first time
DATA MINING
Microsoft will not compete with traditional DM vendors, Microsoft targets other usersMicrosoft will not compete with traditional DM vendors, Microsoft targets other users
Microsoft takes a different approach to Data Mining
Donald Farmer - Principal Program Manager for Microsoft’s Data Mining
"We don't have all the functionality of something like a SAS or an SPSS, because that's just not our market. […] Our market just has to be a much larger market“
“We have a huge database marketing team who do classic customer analysis. These guys were all SAS users, but when they joined Microsoft, they started using our tools. […], they actually use the Excel data mining add-ins to do it. It's not that there's nothing they don't miss, it's that they are able to achieve the same business results using our tools.“
"For a function such as 'Detect Categories,' what the add-in is doing is building a clustering model in the background […], but we don't expect the Excel user to understand that. We just call it 'Build Categories,'“
"We're seeing a lot of interest in the Excel-side data mining,for one thing, but we're also seeing interest in the embed-ability, too. The people who are actually pushing this are from the developer side.
18
DATA MINING
Data Mining assists in various business processes
19
Data Mining is being used in several business areasData Mining is being used in several business areas
Top Business Scenarios for DM
Cross-sell and up-sell
Campaign management
Customer acquisition
Budget and forecasting
Customer retention
New fields: manufacturing, retail and entertaiment
Main DM tasks
Classification
Estimation
Prediction
Association
Clustering
DATA MINING
Data Mining is performed in SQL Server 2005 / 2008
20
SQL Server Business Development Studio and DMX code is the natural environmentSQL Server Business Development Studio and DMX code is the natural environment
DATA MINING
Data Mining is also accessible through Excel 2007
21
Excel sends DM queries and data directly to SQL Server Analysis ServicesExcel sends DM queries and data directly to SQL Server Analysis Services
The Excel add-in acts as a client to an instance of Analysis Services
Both Excel and SQL Server Analysis Services support the full DM Cycle:
Data understanding Data preparation Modeling Validation Deployment
DATA MINING
Data Mining is an iterative process
22
A mining model is part of a larger process that includes everything
This process can be defined by using the following six basic steps:
Defining the problem
Preparing Data
Exploring Data
Building Models
Exploring and Validating Models
Deploying and Updating
?Problem
Working environment
Deployment
Although the process is illustrated as circular, creating a data mining model is a dynamic and iterative processAlthough the process is illustrated as circular, creating a data mining model is a dynamic and iterative process
DATA MINING
There are 9 available Data Mining algorithms on Excel
Decision/Regression Trees Clustering Naïve Bayes Association rules Sequence clustering Time series Neural Networks Logistic regression Linear regression Plug-in algorithms
• Third-party or self programmed implementing a set of COM interfaces
23
9 built-in algorithms can be tuned to obtain new ones9 built-in algorithms can be tuned to obtain new ones
DATA MINING
Decision trees classify and find associations
Regression trees build segmented regressions
Example: • Identify potential buyers
Decision and Regression trees find natural splits
24
Decision trees give decision rules that are suitable to business understandingDecision trees give decision rules that are suitable to business understanding
DATA MINING
Clustering finds homogeneous groups
Example: Find segments of similar clients
25
Clustering can find hidden classes and identify outliersClustering can find hidden classes and identify outliers
DATA MINING
age
income
Clustering finds homogeneous groups
Example: Find segments of similar clients
26
DATA MINING
age
income
Young peopleNo children
Middle ageMany cars and children
”older” age2 carsno children
Clustering can find hidden classes and identify outliersClustering can find hidden classes and identify outliers
Naïve Bayes provides probabilities of group membership
Example: marketing campaign
27
Naîve Bayes is an efficient method to asses probability of classificationNaîve Bayes is an efficient method to asses probability of classification
DATA MINING
?
?
Association rules unveils hidden logic
Example: Shopping Basket
28
Association rules visualizes logical rules that underly your businessAssociation rules visualizes logical rules that underly your business
DATA MINING
Sequence clustering finds event patterns in time
Example: Web navigation
29
Sequence clustering identifies clusters of similarly ordered events in a sequenceSequence clustering identifies clusters of similarly ordered events in a sequence
DATA MINING
Time series forecasts processes in time
ARTx Microsoft proprietary algorithm
ARIMA available in SQL Server 2008
Example: forecast seasonal sales to keep suitable stock
30
DATA MINING
historical predicted
The past patterns that it discovers can be used to predict values for future time steps.The past patterns that it discovers can be used to predict values for future time steps.
Time series forecasts processes in time
ARTx Microsoft proprietary algorithm
ARIMA available in SQL Server 2008
Example: forecast seasonal sales to keep suitable stock
31
The past patterns that it discovers can be used to predict values for future time steps.The past patterns that it discovers can be used to predict values for future time steps.
DATA MINING
historical predicted
Neural networks discovers predictive patterns by learning
Example: fraud detection
32
Neural networks learns in an uncontrolled mannerNeural networks learns in an uncontrolled manner
DATA MINING
Logistic regression predicts binary responses
Microsoft Logistic Regression is implemented as a trivial neural network
Example: Probability of credit default based on personal information
33
Logistic regression gives probabilities of ”YES/NO” given some attributesLogistic regression gives probabilities of ”YES/NO” given some attributes
DATA MINING
Linear regression is of course also available
It is however extended by Regression Trees
(Linear Regression is implemented as a particular case)
Example: extrapolate the influence of oil price on house prices
34
The classical linear regression is also integrated in the add-inThe classical linear regression is also integrated in the add-in
DATA MINING
Chosen examples vs. real life problems
So far, we have seen chosen examples:• Shopping basket• Web navigation• Market segmentation• …
Unfortunately, it is not that easy; data Mining is a creative and unclear process. Sometimes there is no answer with data mining.
Books don’t show examples on when not to use the algorithms• Time series: long forecasts• Classification trees: credit scoring• Sequence clustering: non-markovian processes• …
35
Bottom line: understand the statistical models behind the iconsBottom line: understand the statistical models behind the icons
DATA MINING
Data Mining: using the Data Mining add-in to forecast Credit Default
Logistic regression
The algorithm uses Z-scores transformations Cut-offs should be calculated when deploying
36
After training the algorithm, probabilities of default can be predicted for new applicantsAfter training the algorithm, probabilities of default can be predicted for new applicants
DATA MINING
Probability of default
Income
Age
Civilstatus
Number of payments
score
C
A
B
D
Rankedclasses
Problem: need the algorithm k-nearest neighbours• Can be implemented as a plug-in algorithm
Data Mining: using the Data Mining add-in to forecast Debt Recovery
37
Some problems require creative approachesSome problems require creative approaches
DATA MINING
t
% recovered
Hybrid between time series and regression trees
Problem: need the algorithm k-nearest neighbours• Can be implemented as a plug-in algorithm
Data Mining: using the Data Mining add-in to forecast Debt Recovery
38
DATA MINING
t
% recovered
period
Training…
Some problems require creative approachesSome problems require creative approaches
Hybrid between time series and regression trees
Problem: need the algorithm k-nearest neighbours• Can be implemented as a plug-in algorithm
Data Mining: using the Data Mining add-in to forecast Debt Recovery
39
DATA MINING
t
% recovered
period
Training…
Some problems require creative approachesSome problems require creative approaches
Hybrid between time series and regression trees
Problem: need the algorithm k-nearest neighbours• Can be implemented as a plug-in algorithm
Data Mining: using the Data Mining add-in to forecast Debt Recovery
40
DATA MINING
t
% recovered
period
Training…
Some problems require creative approachesSome problems require creative approaches
Hybrid between time series and regression trees
Problem: need the algorithm k-nearest neighbours• Can be implemented as a plug-in algorithm
Data Mining: using the Data Mining add-in to forecast Debt Recovery
41
DATA MINING
t
% recovered
period
Training…
Some problems require creative approachesSome problems require creative approaches
Hybrid between time series and regression trees
Problem: need the algorithm k-nearest neighbours• Can be implemented as a plug-in algorithm
Data Mining: using the Data Mining add-in to forecast Debt Recovery
42
DATA MINING
t
% recovered
period
Training…
Some problems require creative approachesSome problems require creative approaches
Hybrid between time series and regression trees
Problem: need the algorithm k-nearest neighbours• Can be implemented as a plug-in algorithm
Data Mining: using the Data Mining add-in to forecast Debt Recovery
43
DATA MINING
t
% recovered
period
Predicting…
One period older:
Period = Period +1
Age = Age + …
Some problems require creative approachesSome problems require creative approaches
Hybrid between time series and regression trees
Problem: need the algorithm k-nearest neighbours• Can be implemented as a plug-in algorithm
Data Mining: using the Data Mining add-in to forecast Debt Recovery
44
DATA MINING
t
% recovered
period
Training…x
Some problems require creative approachesSome problems require creative approaches
Hybrid between time series and regression trees
Problem: need the algorithm k-nearest neighbours• Can be implemented as a plug-in algorithm
Data Mining: using the Data Mining add-in to forecast Debt Recovery
45
DATA MINING
t
% recovered
period
Predicting…x
Some problems require creative approachesSome problems require creative approaches
VBA for Excel is the main tool for automated solutions
Communication with other software (COM Server)
Build algorithms not available in Excel
Automation of processes (“Macro programming”)
Easy and quick interaction with the solution through ActiveX Buttons and Userforms
Possibility to embed analytical solutions in a simple user-front to end users without the right competence
46
VBA allows in-house built solutionsVBA allows in-house built solutions
VBA
VBA: Building a statistical tool for analyzing and forecasting Debt Collection
With VBA it is possible to deliver customized solutions to end users
Problems: a lot of work to implement statistical algorithms, Solver can get slow
47
VBA is the tool to use to provide end-users with an interactive work stationVBA is the tool to use to provide end-users with an interactive work station
VBA
There are no limits with statistical programming languages
48
Excel
Analysis Tool pack
Solver
DataMiningAdd-in
Third-partyAdd-ins
VBA
StatisticalProgrammingLanguages(COM Server)
R
There are no limits with statistical programming languages
49
Excel
Analysis Tool pack
Solver
DataMiningAdd-in
Third-partyAdd-ins
VBA
StatisticalProgrammingLanguages(COM Server)
1991
1996
1999
OLE
ActiveX
COM+
DCOM
.NetWCF
DDE
R
R is becoming the standard in the scientific community
R is a statistical programming language with syntax similar to S-plus• R is free (under GNU license)• R uses statistical libraries created by statisticians all over the world
R communicates with Excel through a COM server• ”COM” is a set of interfaces that covers OLE, ActiveX, DCOM, ...
R Excel add-in• Background mode• Small Ribbon toolbar• Fast• Code embedded in:
Worksheet functions VBA Cells
50
R allows analysts to implement the most advanced mathematical modelsR allows analysts to implement the most advanced mathematical models
R
Histograms are a dangerous tool to approximate empirical pdf’s
51
R
With standard Excel:
Histograms are a dangerous tool to approximate empirical pdf’s
52
R
With standard Excel:
Histograms are a dangerous tool to approximate empirical pdf’s
53
R
With standard Excel:
With the R add-in, advanced semiparametric methods are available
54
R
With the R add-in for Excel:
Empirical probability distribution functions are easily approximated with the R add-inEmpirical probability distribution functions are easily approximated with the R add-in
With the R add-in, advanced semiparametric methods are available
55
R
With the R add-in for Excel:
Empirical probability distribution functions are easily approximated with the R add-inEmpirical probability distribution functions are easily approximated with the R add-in
With the R add-in, advanced semiparametric methods are available
56
R
With the R add-in for Excel:
Empirical probability distribution functions are easily approximated with the R add-inEmpirical probability distribution functions are easily approximated with the R add-in
With the R add-in, advanced semiparametric methods are available
57
R
With the R add-in for Excel:
Empirical probability distribution functions are easily approximated with the R add-inEmpirical probability distribution functions are easily approximated with the R add-in
With the R add-in, advanced semiparametric methods are available
58
R
With the R add-in for Excel:
Empirical probability distribution functions are easily approximated with the R add-inEmpirical probability distribution functions are easily approximated with the R add-in
Statistical programming languages: problem case
Forecasting Oslo Børs Hovedindeks
59
Some problems demand advanced statistical approachesSome problems demand advanced statistical approaches
R
?
…
Statistical programming languages: problem case
Using Monte Carlo simulation to predict default in Specialized Lending
60
Complex multivariate Monte Carlo models are developed fast in RComplex multivariate Monte Carlo models are developed fast in R
R
0 50 100 150
05
10
15
20
25
Index
r
0 20 40 60 80 100
50
01
00
01
50
02
00
02
50
03
00
0
Index
olje
NIBOR Oil price
Industrial solutions are another alternative
61
Excel
Analysis Tool pack
Solver
DataMiningAdd-in
Third-partyAdd-ins
VBA
StatisticalProgrammingLanguages(COM Server)
IndustrializedVendors
Industry vendors
Industrial solutions should be chosen only if the area requires it
When to consider Industrial Solutions:
Big companies (important deployment)
Special industrial subject area
Data warehouse integration
Highly competent staff within Analytics
Expensive investment: study worthiness
Whatever vendor, check Excel compatibilities (reporting, platform migrations, …)
62
The previously presented alternatives for Excel can do the their job at end-user levelThe previously presented alternatives for Excel can do the their job at end-user level
Industry vendors
Further references
Capgemini:
• www.no.capgemini.com ([email protected])
Microsoft Data Mining:
• http://www.sqlserverdatamining.com• http://www.microsoft.com/sqlserver/2008/en/us/data-mining-
addins.aspx
R:
• http://www.r-project.org• http://sunsite.univie.ac.at/rcom/
63