Alteryx – Understanding and Utilising
Predictive Analytics
1
Table of Contents
Introduction …………………………………………………………………………………………………………...2
Linear Regression …………………………………………………………………………………………………...3
Time Series Analysis ……………………………………………………………………………………………....11
Different types of Time Series models ………………………………………………………………………12
Using time series tools in Alteryx + Example ……………………………………………………………..14
Covariate Time Series + Example …………………………………………………….………………………..15
Example–adding more to your visualisations using Time Series + Linear regression….16
Classification Problems …………………………………………………………………………………………….20
Logistic Regression……………………………………………………………………………………………………..21
Decision Trees ……………………………………………………………………………………………………………25
Naïve Bayes Classifier ……………………………………………………………………………………………….31
Association Rules/Market Basket Analysis……………………………………………………………….33
Clustering Analysis …………………………………………………………………………………………………..39
Custom R Tools …………………………………………………………………………………………………………43
Production Optimisation Examples ……………………………………………………………………………43
Delivery Route Optimisation ……………………………………………………………………………………..47
Stock Price analysis and portfolio allocation example ……………………………….……………….50
2
Introduction
The purpose of this workbook is to provide a step by step walkthrough of the main
predictive analytics tools within Alteryx, as well as data preparation/cleansing and data
investigation tools contained within the examples. I will also show how Alteryx and Qlik
Sense can be combined to allow businesses to attain significant insights and a thorough
understanding of their data. However, it is also intended to achieve an understanding of the
statistical models which are easily implemented in Alteryx. Although Alteryx has a very
simple and intuitive interface, it is important that you are able to determine and
understand:
- What models are suitable for different types of prediction cases/different types of
data.
- How to interpret the reports of the models to determine whether or not the model
you have fitted is suitable.
- How to extract useful information from the model, ultimately to apply the findings in
a business setting to achieve what should always be your goal - understanding and
insight.
The two main types of predictive goals – Quantitative Versus Categorical
data
There are two main types of data for which you will typically be trying to make a prediction.
The models you use will differ depending on which type it is. Firstly, we have quantitative
data – e.g. if you were trying to predict sales based on how much was spent on advertising
that month. Or predicting how many people will buy an ice cream based off of what
temperature it is. For problems such as these the majority of the time you will use a
prediction method called linear regression. You may also use other tools such as Time Series
Analysis and maybe even decision trees depending on the data which is available to you.
The second type of data is categorical – in this case you will be trying to predict what class
or category an observation will be in based off of other known variables. For example –
predict whether a person will respond to a marketing campaign (two categories – Yes/No)
based off of their age, previous products bought etc. In a medical setting predicting whether
a certain person is at high risk of a certain disease (two categories – High Risk/Not high Risk),
or predicting whether a mortgage application is likely to default – (Default/No Default). In
this scenario, models will typically give you a probability between 0 and 1 of each person
belonging to a certain category. Alteryx provides a number of tools for this type of
prediction including Logistic Regression/Decision Trees/ Forest Models/Naïve Bayes
Classifier.
3
Predicting Continuous Data – Linear Regression
Linear regression is a tool for modelling the relationship between a target variable which we
want to predict and one or more predictor variables. For example, imagine we wanted to
predict ice cream sales at a beach based off of one variable – temperature. The model
equation could look as follows: y = ax + b, where y is ice cream sales, x is the temperature,
‘a’ is a coefficient which indicates the strength of the relationship (e.g. if a = 4 it means that
if the temperature goes up 1 degree, then ice cream sales will go up by 4 on average), and b
is simply the number of ice cream sales if the temperature is equal to 0. Often you will use
more than one predictor variable (e.g x1 x1 and x3) to predict your target variable (y). The
equation would then look like y = ax1 + bx2 +cx3 + d. ‘a’ ‘b’ and ‘c’ are all simply measuring
the effect that changes in our predictor variables have on our target. All of the predictors
are taken into account and if they change then it affects the value predicted by the model.
Linear regression is also useful for inference – understanding the impact different variables
have and providing measures of their effects. This can be used to enhance visualisations
providing key information which will help in decision making within a company.
Steps to creating a linear regression model in Alteryx
1. The very first thing you will usually want to do is some data investigation. This is
because you want to make sure the data is suitable for a linear regression model. A
scatterplot is appropriate here. Figure 1 shows a linear relationship (real world data
will rarely be this perfect but you should still try to choose the most appropriate
model). Figure 2 shows an exponential relationship and Figure 3 shows a logarithmic
relationship. I will show how to deal with these later. You can also perform an
association analysis to see how strongly correlated different variables are.
4
Figure 1
Figure 2
Figure 3
5
2. Next, within Alteryx you will want to create the model using the linear regression
model tool. Remember you will need a certain amount of past data to train the
model, however also keep some extra data to test it on. By using the ‘create sample’
tool you can randomly split your data 90%/10% - Train the model using 90% and at
the end you can test how well it does on the remaining 10%. This is to see how well
it performs on unseen data (very important). I show how to do all of this step by step
within the workflows.
3. Now you need to evaluate whether the model is appropriate. In the output report
from your linear regression model there will be 4 plots. The one in the top left corner
is important – the residuals versus fitted plot. This plot shows the difference
between the predicted values of your model and the actual values which were
observed – these differences can be positive or negative. Here, you basically want to
see randomly scattered points – if there is a discernible pattern then that means
your model is making systematic errors. For instance, if you tried to fit a linear model
to an exponential relationship, then the residuals plot would show a pattern in the
errors. This violates one of the main assumptions of a linear regression model which
is that the errors are randomly distributed. A badly fitted model with systematic error
is worse than no model. In such a case you would have to perform a very simple
transformation on the data so that you can correctly fit a linear model – I have an
example workflow where I do this in Alteryx which should make it clear.
4. In the report there are two other important measures. One is the residual standard
error – this measures the average error between observations and predictions made
by the model. There is no ‘right’ value, but obviously the smaller this is the more
accurate the model. The other value is the ‘R-squared’ value, which measures the
amount of variance (movements of the target variable e.g ice-cream sales)
accounted for by your model. A value of 0.8 would indicate that it is correctly
accounting for 80% of the variance in your target variable – the higher this value the
better.
5. The final step is to test out your model on the unseen data (the 10% we held back
from earlier). To do this you simply drag in the ‘Score’ tool, connect the linear
regression model object and the test data to the tool, and run the workflow. In the
output there will be your original data plus a new column – your predicted values
from the model. You can make predictions on any data using this model object as
long as you have values for the predictor variables. For example, if a company
wanted to predict how much they could expect to sell given different levels of
advertising spend, they would simply pass a data set with different levels of
advertising spend through the model and look at the predicted values. You can also
export your data and predicted values to a .qvx file and visualise the predictions in
Qlik Sense. I did this in my workflow for predicting movements in the price of gold,
6
and compared movements predicted by my model to actual changes in the price to
see if they moved together.
Note: Very often owing to extremely high variance in real world data as a result of the
presence of so many variables, linear regression models may not provide extremely accurate
predictions. What they really provide is a better understanding of the relationship between
variables – both strength and direction. For example – if advertising spend goes up/down
you can expect on average ‘x’ amount increase/decrease in your sales. If you change the
price of a product by a certain amount you can forecast the approximate affect this will have
on units sold. If interest rates go up a certain amount you can make a prediction on the
average or mean effect this will have on the price of gold, and so on.
Linear Regression Examples in Alteryx
I have a folder attached entitled ‘linear regression examples’ which contain workflows
where I document step by step the data preparation, data investigation, model creation and
model evaluation for 3 examples. The first is basic linear regression between 2 variables.
The seconds shows an example where you have to slightly alter the data to fit a linear
model. The third is an example of multiple linear regression using 4 variables to predict
movements in the price of gold.
Example 1 – Examining the relationship between number of global
internet users and Google’s share price from 2004 – 2016.
This is a very simple example in which I take the number of internet users
each year from 2004 until 2016 and examine the relationship between these
values and Google’s share price for the same years.
There is a very short Alteryx workflow in the folder where you can look
through each step with documentation. I have screenshots from the report
on the next page with key values to look for and interpretations of the
statistics. At the end I also made predictions for Google’s share price for
2017/2018 given projected internet users for the corresponding years. (This
is just for demonstration purposes, the model is extremely simple and only
has 12 data points- you could increase the accuracy of the model by adding in
even more predictor variables and performing multiple linear regression, as
in the final example in this section.)
7
Example 1 – Output Report
In red I have highlighted the estimate for the coefficient of internet users. It can be
interpreted as follows: for every one extra internet user, the share price of google increases
by 2.418𝑒−07 on average. In blue, I have circled a value under the column Pr(>|t|). This
column indicates whether it thinks the relationship is significant or not. If you had a lot of
predictor variables in your model, some might be more significant where as other
relationships could be due to random variation. You will have to decided based on this value
and your intuition whether or not to include variables in your final model. Here, the value is
a tiny number (3.55𝑒−06) indicating that we are practically certain the relationship between
internet users and Google’s share price is significant (unsurprising). If this value was 0.1 or
above we would be less certain that the effect is significant and not due to random
variations.
Here we can read off the standard error of our model (Residual Standard Error) – its
predictions are 83.077 units off on average. Our Multiple R- Squared is 0.8685 – 86.85% of
the variance in Google’ Share price is accounted for by changes in the global number of
internet users.
8
Example 2 – Examining the relationship between adolescent
Fertility rates and GDP Per Capita (Example of having to transform
data which has an exponential relationship to fit a linear model).
You can open the Alteryx workflow, press run and allow it to run through (It will take about
a minute). I document the process whereby I prepped the data by combining two data sets I
downloaded from the World Bank website and carried out some quick data investigation. I
also explain why the linear model was not appropriate for the raw data, and how to get a
better model fit. Below is the end visualisation from Qlik Sense, comparing the actual values
of GDP/Capita against those predicted by the model. Of course, we could make a much
more accurate model by adding in many more predictor variables.
*In this example I transform the data to fit a linear regression model for an exponential
relationship. If you encountered a logarithmic relationship such as in Figure 3 – the only
difference would be that instead of finding the log of your target variable, you would find
the exponent of your target variable (𝑒𝑇𝑎𝑟𝑔𝑒𝑡 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒).
9
Example 3: Multiple Linear Regression – Predicting movements in
the price of gold and examining the relationship between Gold
Prices, Stock Market Volatility, the strength of the US dollar, and 10
year Treasury Rates.
This example demonstrates the process to building a multiple linear regression
model in Alteryx.
Each step works through data cleansing and joining multiple datasets together.
There is also a workflow demonstrating how to parse data in JSON form and
eventually merge this data with csv files to form a completely prepared dataset.
I also show how you can split continous data up into categories or ‘bins’, and how to
implement these categories in your multiple regression model.
Below is the comparison between our predicted gold price values (first scatter plot)
versus actual gold price values for 3,000 data points. The model could be applied by
inputting projections for future Treasury rates, stock market volatility and US Dollar
strength in order to measure the effect they would have on the gold price, which
would help in guiding investment decisions.
10
Interpretations of the residuals plots in Gold price Example
This was the residual plot when I first looked at
fitting a linear model using the Volatility Index of
the stock market to predict Gold prices. The blue
line is what we want in an ideal world – no
systematic errors (i.e. not constantly over or under
estimating with predictions). However, the red line
shows us that there is a pattern in our residuals.
You can see that the model is underestimating the
price of gold at the start and underestimating the
price in the later data as well (highlighted). I tried
to reduce this by appending a new variable
splitting the data up into VIX <40 and VIX > 40
(where the pattern changes)
This is the residuals plot after I applied the change.
As you can see it is not perfect, we would like the
red line to be perfectly horizontal. However, it is
vastly improved so I concluded that I would include
the change in the final model.
This is the residuals plot for our final model when we
have all 4 predictors included. The red line is almost
perfectly horizontal apart from a slight trend at the
very start of the data towards overestimation (red
line drifts above 0). However, it is a satisfactory
model fit overall.
11
Time Series Analysis
What is time – series analysis?
Time series analysis is a method which aims to extract meaningful information from
a sequence of observations through time in order to better understand underlying
patterns, and ultimately to make forecasts for future observations.
Unlike other methods which typically can work with random observations, Time
Series analysis assumes that successive observations represent consecutive
measurements at equally spaced time intervals. For example – a company’s closing
share price each day/ total sales for each month.
Time series can help to understand the pattern of observations through time, which
can often be described in terms of trend, seasonality and random error.
Trend refers to the underlying systematic component of the pattern which does not
repeat itself over time – for example an airline’s average passenger growth of 4%
year on year would represent a general upwards trend in their times series data.
However, there could be significant differences from month to month. Even if there
is an underlying upwards trend in passengers each year, there may be spikes around
Christmas and summer months and dips in others. It is useful to be able to describe
this seasonal component and extract the underlying trend in order to measure
performance and also to make forecasts for the coming year.
The final component consists of random error which comes with each observation.
This will typically make it more difficult to extract the underlying pattern, and time
series tools will involve some method of filtering out random error in order to make
the pattern clearer.
12
Different Types of Time Series Models
ARIMA MODEL – Autoregressive Integrated Moving Average
*ARIMA models encapsulate a large number of different types of Time Series models, and
this is just a very short summary of how they work. If you wanted to learn more about them,
the following is a link to a useful resource http://people.duke.edu/~rnau/411arim.htm
ARIMA models are the most general class of models for time series forecasting. An ARIMA
model works in order to separate the ‘signal’ (or the meaningful predictive patterns) from
the random error – and to apply these patterns to make forecasts for future observations.
You can think of an ARIMA model as a type of regression model, where the predictor
variables might include the last ‘n’ observations (autoregressive), or the error between your
model’s previous predictions and actual values (lagged forecast error) – when the error of
the models previous forecasts are used, this allows it to readjust when it over or
underestimates a prediction.
Below are some examples of different ARIMA model equations for a time series:
Yt (new observation) = a (constant) + b*𝑦𝑡−1 (b is a coefficient and 𝑦𝑡−1 is the most
recent observation) .
Yt = a +b*(𝑦𝑡−1) +c*(𝑦𝑡−2) (taking into account the last two observations)
Yt = a + 𝑦𝑡−1+ b*(𝑦𝑡−1– 𝑦𝑡−2) (Now we are including the last difference as a predictor
variable as well – so how much the value changed between the previous two
observations will help to determine how much it will change again, this is an example of
introducing a ‘differencing’ term)
𝑌𝑂𝑐𝑡96= 𝑌𝑜𝑐𝑡95 + (𝑌𝑆𝑒𝑝96)− 𝑌(𝑆𝑒𝑝95)) *Example of a seasonal ARIMA model - Oct 95
forecast is equal to last October’s value plus the year to year change observed the
previous month.
Yt = Ŷ𝑡−1 + α*(𝑒𝑡−1) (𝑒𝑡−1 ) is the error for our last observation – in this way the model
auto-corrects itself for underestimations and overestimations in the previous forecast)
The main piece of information you can take away from this is simply that ARIMA models
work by predicting future values based off of 1) previous observations 2)how much the
values have changed between previous observations 3) previous errors in past forecasts.
There are many other variations of the ARIMA model. Although it is good to have a
general idea of how the models work, the good news is that Alteryx will automatically
calculate the values of these parameters.
13
ETS MODEL – Error/Trend/Seasonality (Exponential Smoothing)
This model uses exponential smoothing to build a forecasting model for Time Series
analysis.
Exponential smoothing makes use of every single past observation to forecast the
next step ahead.
Although every single observation affects the forecast, the most recent have the
most influence. The very last observation will be the most influential, the second last
observation will carry slightly less weight, and as you go back further to the 50th
observation it will have negligible effect. The following is the basic equation for an
exponential smoothing model:
𝑦𝑡 (forecast value) = α*(𝑦𝑡−1) + (1- α)* ŷ𝑡−1
Note/ (α is a value which lies between 0 and 1) * ŷt-1 was the model forecast for the
previous observation. 𝑦𝑡−1 was the actual value for the previous observation.
The model forecast for the previous observation was affected by the model forecast for the
observation before that ŷ𝑡−1= α(𝑦𝑡−2) + (1- α)*ŷ𝑡−2, which was in turn affected by the
observation before that and so on. Thus, every observation is affected by every single
previous observation to some degree. An interesting point about this model is that it is self-
correcting – if it overestimates a prediction it will take that into account for the next
prediction by adjusting for the positive or negative error.
If there is a trend in the data then another component will be added to the equation
to account for this. This is also known as double exponential smoothing. The
equation would look something like this : 𝑦𝑡 = α* 𝑦𝑡−1 + (1- α)*(ŷ𝑡−1 + 𝑏𝑡−1). 𝑏𝑡−1 is
the trend component.
If there is a seasonal component to the data then this will also be taken into account
with a third component added to the equation – triple exponential smoothing
*It’s not necessary to understand these equations and I am just including them in case you
would like to get a better feel for how the model works. One take away is that the ETS model
will make forecasts by taking into account all previous observations, the seasonal pattern in
the data, and the overall trend. In essence, Alteryx will automatically calculate the model
parameters. You can just create the ETS model and the ARIMA model and choose which one
is more appropriate based on how accurate they are as shown in the example workflow.
14
Creating a Time Series Model in Alteryx
1. Read in your Time Series Data which will consist of recorded data for consecutive
observations at equally spaced time intervals. Split the data up into training and test
data by taking the first 90% of observations (You can test your model forecasts on
the last 10% or so).
2. Connect the ETS tool to your data, select your target field and the frequency of
observations (are they monthly/yearly/daily etc.)
3. Connect the ARIMA tool to your data, select your target field and frequency of
observations. If you want you can select a completely user specified model which
allows you to input values for the parameters I mentioned earlier. If you leave it
unticked Alteryx will converge to its own solution.
4. Use the ‘Union’ tool to join the two model objects together, and pass them through
the ‘TS Compare’ tool. Also connect your test data so you can compare which model
predicts these values better.
5. In the output from the TS Compare tool you will be able to see the forecasts for each
model and how they compare to actual values. In the table, one key measure of
accuracy is the RMSE (root mean squared error) for both models. An easy way to
choose between the ETS/ARIMA model is to choose the one with the lower RMSE.
6. Finally, connect all of your original data to the model you have chosen and create the
model object. You can now make forecasts for the periods ahead by dragging in the
‘TS Forecast’ tool. You can specify the number of periods ahead you want to predict
and what level of confidence intervals you would like to include. Bear in mind that the
further into the future you forecast, the more uncertain the predictions will be.
*I have included in the download folder an Alteryx workflow entitled Germany Trade data
(Example 4). It goes through an example of the above steps using monthly data on Germany
Commercial Trade data, resulting in a forecast for the next 12 months. Below is the forecast
with confidence intervals visualised in Qlik Sense.
15
Time Series Covariate forecast – Using other variables to improve
the predictive capabilities of your Time Series model!
When should you use the TS Covariate forecast tool?
The TS Covariate forecast tool can be used to improve your time series model by including
another variable which may have predictive capabilities for our target (similar to a
regression model). The model will still account for trend/seasonality/previous observations
as in the previous section, however we can also include another variable along with this to
give the model more information to make forecasts.
e.g. Imagine you are preparing a time series forecast for a large company who are looking to
make sales forecasts for the coming months (to improve handling of inventory etc.). You
could just use sales data from previous years and provide a standard Time Series model as
before. However, if you had historical data on money spent on advertising each month then
you could also include this as a predictive variable in the Time Series forecast.
Example 5 - Alteryx Work flow - Creating a covariate time series model for the FTSE100
using the Pound Sterling Index as another predictor variable
I have attached an Alteryx workflow with an example of covariate forecasting
to try and predict movements in the FTSE 100
You may only use an ARIMA model when creating a covariate time series
forecast.
Steps involved in Example Workflow
- Data Preparation in order to standardise date fields and then merge the two
datasets on these fields
- Data investigation involving an analysis of the time series plot as well as the
relationship between the FTSE100 and the Pound Sterling Index
- Splitting the data up into training data and test data. Connecting the training data to
an ARIMA tool and specifying to use covariates in the model estimation.
- Using the Covariate Time Series forecast tool to make predictions on the test data
using an ARIMA model with Sterling Index as a covariate predictor variable.
- The model could be applied to estimate future movements in the FTSE100 based on
expected Sterling Index values. For instance you could predict that the Sterling will
bounce back after its heavy drop following the Brexit. Using all previous observations
and your forecasts for the strength of the Pound you could make educated forecasts
for the FTSE100 for a few periods ahead.
16
Monthly Sales and Advertising Expenditure – Enriching
Visualisations with Key information (Utilising Time Series
and Linear Regression tools)
In this example we have a data set with 36 consecutive monthly sales and advertising spend
on a dietary weight control product. The aim is to try and find out whether our advertising is
having an effect on sales, and measuring this effect. Without the use of statistical models all
we would be able to visualise here would be a trend graph of advertising spend against
sales. However, what about if we happened to increase marketing spend during a month in
which sales were usually higher anyway? Or if we increased advertising spend and sales
dropped, but sales usually dipped significantly during this time? We can use the models
within Alteryx to try and separate the effect of advertising spend, general upward or
downward trend in sales and seasonal components in order to better understand the payoff
on the investment in advertising.
1) Basic Visualisation in Qlik Sense – Trend in advertising and sales
17
2) The next step I took was to look at a decomposition plot of the sales data –in order to
extract the seasonal component of the data: The output below shows the sales data
on top, the seasonal component in the middle, and the general underlying trend.
3) I wanted to overlay the sales data with the seasonal component removed on
advertising spend in Qlik Sense – in order to do this I wrote a few lines of code in a
custom R tool to extract the seasonal component in table form. I then exported the
information on seasonal effects and trend into Qlik Sense. The first visualisation here
shows the average seasonal effect each month.
18
4.) The below trend graph shows the sales less the seasonal effect for each month overlaid
with the advertising spend for each month. This gives us a better idea of the relationship
between the two as it removed the varying seasonal effect which can affect your
interpretation of the relationship.
5.) Scatterplot of Sales less seasonal effect versus advertising expenditure.
6.) Finally, you can run a linear regression tool in Alteryx to measure the average effect that
advertising expenditure is having on your sales. This is more of an inference problem. We
are trying to figure out whether our advertising expenditure is affecting our sales as
opposed to making specific predictions for future values.
19
So we can see here a few important points from the Alteryx output. The correlation
between advertising spend and Sales less seasonal effect is 0.37 (a moderately positive
correlation indicating that as advertising goes up, Sales tend to go up as well – we need
linear regression to determine whether this is significant)
The linear regression output I screenshotted here tells us two things – the estimate
coefficient for advertising was 0.7861. The data is in 1000’s US$ so this tells us that for every
1000 dollars spent on advertising, you are getting on average $786.1 more sales. The p-
value of .02624 indicates that this is almost certainly a significant relationship. We can
conclude that the extra spend on advertising is increasing sales. However the extra sales are
not covering the advertising expenditure on average.
This example demonstrates how predictive and inferential analytics can add a lot more
insight to your analysis as opposed to simply using visualisations. We have a much better
understanding of our data now than when we just had a simple trend graph.
20
Classification Problems
A new type of problem
You can consider this as a new section moving on from the purpose of predicting continuous
numerical data. As I mentioned earlier, another very important purpose of predictive
analytics is predicting classes/categories – e.g. the probability that a person will or won’t
respond to your marketing campaign/ classifying people as low or high risk of a certain
disease/ predicting whether a client is likely to default on their credit card bill/ predicting
whether a customer is likely to end their subscription etc. I will start off this section with an
explanation of our first model for this purpose – Logistic regression.
Logistic Regression – Predicting Probabilities
Logistic regression is a statistical tool which will relate a binary target variable of interest
(e.g Yes/No, Default/No Default) to one or more predictor variables (e.g income, age,
gender etc.). The output from a logistic regression model will be a probability between 0
and 1 of each observation belonging to a certain class . The following equation is the basic
equation behind the logistic regression model.
Probability = 1
1+𝑒−𝑎−𝑏𝑥−𝑏2𝑥2
*In this equation x and x2 would be the value of your predictor variables – so they will
obviously affect the probability that this observation belongs to the class (e.g. the probability
they will default). No matter what the values of the predictor variables, this equation will
always give a value between 0 and 1 (a probability). The model is made by calculating the
values of a,b,b2 (and so on if you have more predictor variables) so as to optimise the
accuracy for your data set. Alteryx will optimise these values automatically.
21
Logistic Regression in Alteryx – Examples (Visualisation in
Qlik Sense).
Example 6 – Classification problem: Predicting which clients will respond to a
bank marketing campaign (whether they subscribe a long term deposit) based
off of 17 attributes including Age/Gender/education level/whether they have a
housing loan, and more.
The data set: The data set consists of about 45,000 instances of the outcome from being
contacted by the bank’s marketing campaign along with attributes for each client. We want
to know which clients will subscribe a long-term deposit after being contacted. This can be
used in future campaigns to target customers with the highest probability of responding –
ultimately to increase efficiency and response rates.
*The Alteryx workflow is contained in the folder for Logistic Regression Examples. Each step
is documented and explained. Note that the entire workflow will take about 15 minutes to
run depending on your computer.
The first step I carried out was data preparation and subsequently writing the prepared data
into .qvx files in order to visualise it in Qlik Sense. I wanted to compare response rates to the
campaign across different categories to get a quick overview of the data. I have included
some of these visualisations here for visualisation purposes.
Note/ The average response rate is the proportion of each category who responded.
This graph indicates that people younger than about 35 years old and older than about 60
years old were more likely to respond.
22
This was an interactive bar chart splitting the clients up by job and education level. The
highest response rate here is for retired individuals with tertiary level education. Overall the
‘retired’ category seems to have the highest response rate.
This visualisation looks at response rates based off of whether the individual responded to a
previous campaign. Unsurprisingly, the highest response rate (65%) was for previous
responders.
23
This scatterplot looks at the response rates based off of the number of days since the client
had been previously contacted. I created 15 bins for the number of days in alteryx so the x-
axis represents categories. The highest response rates seem to be for clients in bins 8-11
which is about the 400-500 days range, so maybe customers who had not been hassled by
campaigns before were more likely to respond. The scatterplot doesn’t show the response
rate for clients who had not previously been contacted, however seeing as almost all of
these response rates are above the overall average, those who had not been contacted
before must have had a very low response rate.
The construction of the logistic regression model is all carried out and documented in the
Alteryx workflow. The example includes:
Data preparation for Qlik Sense and model development/ Creating categories
for numeric variables and appending this as a new column
Interpreting the logistic regression output report
How to use the stepwise regression tool in Alteryx
How to use the nested test tool in Alteryx
How to test the logistic regression model
How to use the ‘Lift Chart’ tool in Alteryx to demonstrate the efficacy of your
model
24
Example 7 – Medical Example: Creating a logistic Regression model to calculate
probabilities of people having heart disease and automatically flagging high-
risk patients based on 13 attributes including Age, Sex, chest pain type, serum
cholesterol, maximum measured heart rate and more.
Data Set
The data set was downloaded straight into the Alteryx interface from the UCI Machine
Learning Repository. The dataset contains information on about 300 patients with regards
to 13 different attributes and the target variable which was whether or not the patient was
determined to be suffering from heart disease.
Problem Definition: We want to create a logistic regression model which will be able to
determine whether a patient is at low, moderate, or high risk of suffering from heart disease
based off of the recorded attributes, and provide us with a list of patients who are at
moderate to high risk.
Result: The model was tested on 45 patients, and flagged 24 participants as being at either
Moderate or High Risk of Heart Disease. 15/24 of these patients did in fact have heart
disease, and of the 45 patients in total, 20 had heart disease. The model is effective – by
flagging only 53% of the participants it correctly found 75% of the patients with heart
disease. The model could be part of a screening process which could give a quick indicator of
the risk level of new patients.
*The Alteryx workflow for this example is contained in the folder ‘Logistic Regression
examples’ and provides a step by step explanation of the steps taken to achieve the
objective.
The workflow contains the following:
How to directly download a dataset in csv format from a webpage into
Alteryx and prepare the data using the parsing tools.
How to carry out some basic data investigation in Alteryx using the Data
investigation tools.
Creating samples of data, using the logistic regression model tool, stepwise
regression tool and the nested test tool to come up with a final model.
Using the formula and filter tools to classify the test data into different
categories based on the probabilities assigned by the model, and finally to
output a list of patients at moderate to high risk of having heart disease.
25
The Decision Tree – Another Classification Tool
The decision tree tool is a statistical method which once again tries to predict a target
variable based on one or more predictor variables which are expected to have an influence
on the outcome. Decision trees can actually be used to predict continuous or categorical
data.
How does the decision tree work?
Decision trees build regression or classification models in the form of a tree structure using
IF-THEN rules to split the data at nodes. The end result will be a tree with decision nodes and
leaf nodes. The decision nodes contain two branches which will split the data based on
which branch the piece of data satisfies. E.g. in our bank marketing campaign example one
of the decision nodes could have been – gender? - with two branches – male and female.
Another split could have been numeric : e.g age > 45? – with two branches – yes/no. The
decision nodes break the data down into smaller and smaller subsets, which will eventually
be classified by a leaf node. A leaf node will represent the final classification or decision on
the data. In the example below, the decision node decides whether the person will be a
purchaser or non-purchaser.
How do you determine what attributes to split on?
This decision will be based on two factors – entropy and information gain.
Imagine your decision tree is trying to predict a category from two classes e.g.
Purchase/Non-Purchaser. If you had 20 data points and you had different attributes to split
on, imagine one decision left you with two groups of ten people. Within these two groups of
10 people 5 are purchasers and 5 are non-purchasers. This would not be a good decision to
split on as it has literally told us nothing about the likelihood of purchase given membership
of the group your , it is 50:50. Take another decision which leaves you with two groups of
26
ten, with one group having 8 purchasers and 2 non-purchasers, and the other having 3
purchasers and 7 non-purchasers. This is a better decision, it increases the ‘purity’ of each
group. If we knew you were part of group 1 we could say there is a 0.8 probability that you
will purchase, whereas if you were part of group 2 we could say there is a 0.7 chance you
won’t purchase. At each decision, the algorithm will try to maximize the purity of the
groups, and this is the basis upon which the decision nodes are chosen – what will give us
the most information gain. The root node (the first decision) will be the most influential
factor in determining membership of a class. If you could only look at one attribute, this
would be it.
Feature selection – One of the key processes of a decision tree is feature selection. This
involves selecting a subset of all the variables you have to identify the most important ones
with regards to your target variable. This can be advantageous in that you will be able to
quickly identify the most influential attributes in your dataset, however you may lose out on
making use of important information in some cases.
Should you use logistic regression or a decision tree?
There is no blanket decision to decide which method to use. Decision trees
can be very good in quickly identifying some rules amongst your data i.e. if a
person is over 30 and has a salary > 40,000 they are more likely to purchase
your product etc. In essence, it depends on the distribution of your data. If I
had to pick one I would go with Logistic Regression, however if you have time
then it would be worthwhile creating a logistic regression model and a
decision tree model, and subsequently comparing these based on the
number of correct classifications and a lift chart.
Decision trees are probably easier to understand and also provide a visual aid
which can be more appealing to people (although accuracy should be the
priority). It is easy to follow along down a decision tree and identify which
characteristics are the most impactful.
27
Decision Tree (Example 8) – Using Census Data to predict whether a
person’s salary is >50,000 or <50,000.
The data set:
This dataset was retrieved from the UCI Machine Learning Repository and consists of census
data from the 1994 database. We have information on each individual regarding things such
as age, gender, work class, marital status, race, hours work per week, native country etc., as
well as whether their salary is >50,000 or <= 50,000.
Hypothetical Use Case/ Problem:
Imagine a watch company has developed a high end watch, and they wish to develop a
targeted marketing campaign directed towards customers with higher earnings. They have
previously collected data on customers with regards to certain metrics such as where they
work, what age they are, nationality etc., and want to develop a model which will predict
whether these customers are in the higher salary bracket so that they can target this group
with their marketing campaign, maximizing the pay-off from their advertising expense. You
must develop a classification model to achieve this goal.
Method: The method to configure the decision tree tool is contained within the examples
folder under Decision Tree Example. It is a very short workflow and should only take about
1-2 minutes to run.
Decision Tree output:
On the next couple of pages I have included some screenshots of the output from the
decision tree model. The first is the decision tree diagram. Although it is somewhat unclear,
the first decision node is whether or not the person’s relationship variable was either Not in
Family, Unmarried, other relative or own child. If you go down to the second output which is
Variable Importance then you can see relationship is the most important variable in
deciding each person’s income. The decision tree splits generally start in order of
importance, with the latter splits being less influential. You can see at the end of branches
there are two options - <=50K and >50K. The decision tree makes a decision on each subset
based on which category the majority of the group falls into. Ideally you would have over
90% of a group in either of the categories, but you can see within the Alteryx output
workflow on the interactive chart that some of the subsets are more uncertain e.g.
60%/40%.
28
29
Decision Tree Interpretation
You can click on any node within the
decision tree and see the path to that
node, as well as the final decision
which was made on that subset of
data. As you can see here this node
consisted of people who were in one of
the relationship categories specified in
the first decision, and who also had
Capital Gains < 7074. 95% of this group
had salaries <= 50,000 so the model
will classify anyone satisfying these
conditions as having a salary which is
less than $50,000.
The decision node I clicked on here
consists of people who were not in
the relationship category specified in
the first decision, who’s education
was one of the categories listed in
the second decision, capital gain
<5096, occupation was not one of
those listed in the 4th decision, age
was > 32.5, and capital loss was
<1846 – This subset of people had
88% salaries > 50,000 and 12%
salaries < $50,000 : The model will
therefore classify people in this
category as having salaries > 50,000.
30
Result
Below is the output from a tool called a ‘lift chart’. It demonstrates how much more efficient
you would be if you used the model predictions as opposed to choosing people completely
at random. We want to get all the people who have incomes >50,000$. In total, 24.26% of
the test data had incomes >50,000, which was equal to 395 people in total. If you had
randomly sampled 50% of the test data you would have found about 50% of these 395
people (as shown by the base black line in the chart). However, if you had taken the top 50%
probabilities from the output of the decision tree, you would have gotten about 90% of the
395 people ~ an increase of 40%. The blue curved line indicates the extra amount of our
target group we would get using the model predictions as opposed to random choice.
31
Naïve Bayes Classifier – Classification Tool
The Naïve Bayes Classifier is another classification tool. It is based on applying Bayes’
theorem to training data. In general, the NB Classifier is outperformed by other models
such as logistic regression, forest models and decision trees. However, it can be beneficial to
use when the amount of data you have to train your model is limited. The basic equation
behind the model is based on conditional probabilities as follows:
𝑃(𝐶𝐾|𝑋) = 𝑃(𝑋|𝐶𝑘) ∗ 𝑃(𝐶𝑘) /𝑃(𝑋)
A key feature of the NB classifier is that it assumes the features are independent – that is if
you have three predictors for instance, that having one particular feature does not affect
the value of another predictor. However, this is often not true unfortunately and in these
cases the NB classifier will not perform as well as other classification models.
The NB classifier will predict the probability of any observation being a member of any
number of classes. E.g is the customer likely to buy sportswear, formal wear, or casual wear.
Just remember that with any of the classification tools – you can try using different ones and
simply choose the most accurate one. Measure the accuracy with regards to how well it
does on the test data and not on the training data.
As an example of how to use the tool I have included a workflow to predict which customers
will respond to a marketing campaign based off of what store they shop at/which city they
are from/how many times they have visited the store and how much they have spent at the
store. The configuration is very simple and follows a similar process to previous models. I
also visualised the results in Qlik Sense in order to give a quick snapshot of likely responders
in the unseen data in order to get a better understanding of our customers.
Probability of responding by customer location
32
Average probability of response for customers in each city
Probability of responding by visits/Spend – Probability indicated by size of the dot
33
Association Analysis / Market basket analysis to aid with
product placement, recommendation systems/ targeted
promotions
Association rule learning
In the previous sections such as linear regression/ classification tools we were intentionally
trying to predict a target variable using certain predictor variables – so we had a defined
objective when we started developing the models. However, association rule learning is a
method which can discover interesting relationships among your data without having
specifically defined predictor and target variables. For instance, one of the major uses of
association analysis is in market basket analysis – the objective of market basket analysis is
to identify patterns amongst consumer buying behaviour in order to identify relationships
amongst products bought. E.g. the most regularly occurring items in transactions and most
importantly – what items are typically bought together. For instance, if a customer buys
garden chairs and a sun umbrella are they likely to buy a barbecue set as well. This can help
a company with product recommendation systems, product placement in stores as well as
targeted advertising and promotions which can enhance the customer’s buying experience.
Recommender systems such as the Netflix movie recommendation system and amazon’s
product recommendation systems are another practical application of association analysis.
Using the large amounts of data these company’s collect they can identify what other
products a customer might like based off of what they have already watched/bought, as
well as the patterns which have been previously identified in consumer behaviour. Ratings
systems can be implemented in these systems – for instance, by looking at the ratings a
consumer has given to other movies you could identify other movies they will very likely
enjoy as well. Association analysis models can allow you the opportunity to quickly extract
and gain value from data which may involve
Market Basket Analysis in Alteryx
Alteryx provides some great tools to carry out association analysis on
transactional data. The final output will be a table with an item/group of items on the left
hand side, and an item on the right hand side which the customer would also likely be
interested in, as identified by the model. There are only three key terms which you need to
understand in order to interpret the report output.
34
Support – Simply put, support is a measure of how often the item/item set occurs in a
transaction. If this value is 0.01, then the items occur in 1% of the transactions, if it is 0.5
then they occur in 50%. Ideally, it is nice to identify rules where the support is high – as
these rules or associations will be applicable to a larger number of transactions. However, it
depends on the business. A supermarket will have items with large support (e.g.
Bread/Milk), whereas an online retailed like Amazon will have items with very low support
simply due to the huge variety of products they have on sale.
Confidence – As I mentioned, the output from the market basket analysis will be an item or
group of items on the Left hand side and an item on the right hand side which is also likely
to be bought. The confidence measure is interpreted as follows – if it is = 0.8, then you can
be 80% confident that if the items on the LHS are bought, then the item on the RHS will also
be bought, if it is 0.9 then you are 90% confident.
Lift – The lift is calculated by the following formula:
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝑎𝑙𝑙 𝑖𝑡𝑒𝑚𝑠 𝑜𝑛 𝐿𝐻𝑆 𝑎𝑛𝑑 𝑅𝐻𝑆 𝑜𝑐𝑐𝑢𝑟𝑖𝑛𝑔 𝑖𝑛 𝑠𝑎𝑚𝑒 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛)
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑖𝑡𝑒𝑚𝑠 𝑜𝑛 𝐿𝐻𝑆) 𝑋 𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑖𝑡𝑒𝑚𝑠 𝑜𝑛 𝑅𝐻𝑆)
This is interpreted as the probability that the group of items occur together divided by the
probability that the items on the LHS and RHS occur together if we are assuming they are
independent and have no relationship. If the value is = 1 then there is no relationship, if the
value is > 1 then each item is more likely to be bought given that the other one has been
bought. So we want values which are > 1, and the higher the better.
35
Alteryx Example: Identifying Patterns amongst a dataset of
541,909 product sales of 4,224 different products –
Visualising the results in Qlik Sense
The data set: The dataset consists of approximately 540,000 lines of data, with each line
having an invoice number, a product description, a customer ID, date and other details
relating to each transaction. We are interested in looking at what items are typically bought
together, so we could use the invoice number as a representation of each transaction –
items with the same invoice number have been purchased at the same time. We could also
use the customer ID to identify patterns of items which customers are typically interested
in, even if they have not been bought at the same time. In this example I use the invoice
number.
The problem/goal: We want to identify the most significant patterns amongst consumer
purchases which we can easily analyse and visualise in Qlik Sense, in order to better
understand the customer’s buying behaviour so we can leverage this insight to improve
product recommendations and targeted marketing. *The data set I used here is mainly bulk
orders from other shops – so the patterns are quite obvious, however it is still a good
example of how to integrate this tool.
*The Alteryx workflow with step by step instructions is included in the Association rules
folder – including data preparation, selecting the appropriate fields, and tool configuration.
Main output report – Alteryx
Here is a snippet of the output report from the market basket analysis tools, with the data
which I exported into a .qvx file for visualisation in Qlik Sense. The items sets are sorted in
descending order for levels of support. The first line can be interpreted as follows - the
items occur in 3.22% of the transactions, if a Jumbo Bag Pink Polkadot is purchased we can
be 67.67% confident that the Jumbo Bag Red Retrospot will also be purchased – and the lift
of 8.21 is significantly greater than 1 so that tells us that the purchase of one item greatly
enhances the probability that the other item will also be purchased.
36
Qlik Sense app for Visualisation
The visualisation consisted of an interactive app whereby you could click on any item set
and immediately retrieve information on the lift/support/confidence and compare it to
other item sets. The scatter plot on the top left has support on the x-axis and Lift on the y-
axis with the size of the bubbles indicating confidence levels. Item sets which are further to
the right of the graph and higher up should be prioritized and it provides a very easy way of
identifying the most significant patterns. I also included charts comparing item sets purely
by support, confidence and lift as well as looking at the top 10 sellers in quantity sold and
revenue as seen below.
37
38
39
Clustering Analysis
What is Cluster Analysis?
Cluster analysis is another type of ‘exploratory’ data mining, similar to the association rules
– in which we have no defined variable we want to predict but we are searching for patterns
in our data. For instance, imagine you have a large data set with data on each customer
such as geographic/location data, data from surveys, their age, amount spent on different
types of products and so on. You might use cluster analysis to identify groups of customers
which are ‘similar’ to one another based off of these attributes. Very often you can have
natural clusters in the data, and within these clusters customers may behave differently.
Often you can use cluster analysis as a pre-processing technique before other data mining
techniques. Instead of treating all of the data the same way you can split it up into natural
groupings first, and then examine behaviour within each group. For example, we could
tailor our market basket analysis to each cluster, thus increasing our understanding of
customer behaviour. Furthermore, cluster analysis could be used in other areas such as
investing. Clustering analysis could be used to identify clusters of stocks which are similar in
their performance week to week, which can help you to understand their behaviour and aid
in diversifying your portfolio to reduce systematic risk.
Example of cluster analysis on 2 numeric variables – identifying natural groupings
Example of cluster analysis on three numeric variables – algorithm identifies clusters based
on ‘closeness’ of measures
40
*It is important to note that the clustering tools in Alteryx work on numeric data and not
categorical data. The algorithm works by using measures of ‘closeness’ between data points
– easy to visualise when there are only two or three variables as in the above graphs. You
cannot measure how close two categorical variables are. However, if your categorical
variables have a natural order to them, e.g. primary school, secondary school,
undergraduate, graduate – then you could transform these to numeric variables by
assigning 1,2,3,4 respectively.
Carrying out a cluster analysis in Alteryx:
The first step you have to carry out is to determine the number of clusters
you want. This can be done using the K-Centroids diagnostics tool. You will
have to specify your minimum number of clusters and your maximum
number of clusters – depending on the business problem you will have to
decide, also using trial and error. Between 3 and 10 will cover most bases.
The above graphs will be included in your output report. For each number of clusters (3 – 8
here) you will have box plots for adjusted rand and Calinksi-Harabasz indices. You will be
looking for the cluster group with the highest in both these graphs. In this case cluster 4 has
the highest adjusted rand , and cluster 6 has the highest C-H index. Cluster 6 is also quite
high in the adjusted rand, so you could choose 6 clusters in the next step.
Adjusted rand is a measure of how similar the points within each cluster are, whereas the
Calinksi Harabaz measures how different the groups are from one another. Ideally, we want
high similarity within groups and high variance between groups for well – defined clusters.
The next step will be to use the K-Centroids analysis tool. Select your
clustering method and the number of groups you want (as determined by the
41
previous step). This tool will separate all the observations into this number of
groups and provide a report with details on each group.
You can then use the Append Cluster tool to create a new column indicating
which cluster each observation is in. Finally, you could carry out other data
mining techniques to examine behaviours within each cluster – e.g. does a
marketing campaign have more of an impact on one group than another/Do
the different groups have different buying patterns as determined by market
basket analysis etc.
Example in Alteryx – Grouping Clients of a Wholesale distributor by
spending amounts in 6 different product categories
The data set : The data set comprises 440 clients of a wholesale distributor. For each client
we have their spending amounts in 6 different categories –
Fresh/Milk/Grocery/Frozen/Detergents/Delicatessen.
The aim: The aim is to identify ‘similar’ clients based on spending across categories. This will
give us a better understanding of our clients’ behaviour and help in further data mining
techniques. Clients with similar levels of spending across categories will cluster together.
*The Alteryx workflow is contained in the folder. I followed the steps above to carry out the
clustering. Run the workflow in order to see all of the output, however I will include the
most important things to look out for here.
The Diagnostics tool indicated that 5 clusters resulted in the highest levels of
similarity within groups (Adjusted Rand) and one of the highest levels of dissimilarity
between groups (C-H Indices). We always want well defined clusters – that is
members of one cluster are dissimilar to members of other clusters, and members of
each cluster are similar to each other.
42
Below is my configuration for the K-Centroids analysis tool. I want to
standardize the fields so that one category does not dominate the algorithm,
and I choose 5 as the number of clusters based off of the previous output.
Finally, I appended the cluster group to each client. Now, the wholesaler has
a method of grouping similar customers together. If we were to go on and
carry out market basket analysis we could do it individually for each group,
and you would very likely observe different rules between clusters.
Furthermore you could look at the effect of promotional offers on different
clients, and this could help with targeted promotion and result in more
efficient campaigns in the future.
43
Alteryx Functionality – Creating your own
tools using R – code
Optimisation Problems
In business, optimization techniques can be used to maximize efficiency by identifying the
best use of limited resources, assigning the best resources to each task (e.g. delivery
routes), and overall minimizing costs while trying to maximize revenue by making the most
efficient use of resources. There are numerous instances where optimisation can be used in
business and many of them can be modelled using formulae. We can use the flexibility of
Alteryx to show how we can model these problems and create tools to solve them using the
functionality of R.
There will be functions which we want to minimize or maximize subject to some parameters
and constraints. There are packages within R which we can use to help set up our tools in
order to identify solutions to our optimization problems. These packages include optim,
contrOptim, nlm and optimize.
Example 1: An apartment complex has 500 apartments to rent. Based off of current costs
and pricing, the profit function is equal to: -8𝑥2+ 3200x -80,000 (x = number of apartments
to rent). How many apartments should they actually rent out in order to maximize their
profit?
- Maximise the function subject to x<= 500 (total number of apartments).
1) R has a package called ‘optimize’ which will help solve these types of problems.
I made a very simple tool in Alteryx to demonstrate how R code can be used
to create your own solutions. The tool allows the user to input the constraint
(i.e. how many apartments there are available to rent – as the max may
change depending on this constraint). The output from the tool will be a
graph of the function, where the function reaches it’s max, and what the
profit is at this point.
44
R Code
Tool Output
45
Optimization Problem 2: More variables, constraint functions.
Imagine there is a company which produces two products – tables and chairs. Let x =
number of chairs produced and y = number of tables produced. Based off of pricing, costs
and forecasted demand you calculate a profit function for the week as follows:
−2𝑥2 + 60𝑥 − 3𝑦2 + 72𝑦 + 100
However, you only have the necessary resources to produce 20 products in total. x + y has
to be less than or equal to 20. Our constraint equation is therefore:
x + y <= 20
Make a tool to maximize the profit under these constraints:
How it works: User inputs the constraint in this box which will allow some flexibility in case
the constraint value changes. Then, simply run the workflow and optimal production levels
will be output as follows.
46
R – Code
47
Making a tool to optimise delivery routes
When you have numerous stores which deliver to a number of different locations it
becomes quite complicated to choose the most optimal pick up and drop of points in order
to reduce travel time and costs. We can simulate a simple case here and create a tool in
Alteryx to solve this problem.
On the left is the location of each store/driver. On the top is the delivery location of each
package. The corresponding cells show travel times in minutes between the two locations
(e.g. top left cell is time between Sandymount and Ballsbridge). Each driver can carry out
only one delivery and you have to assign them in the most optimal way. The drivers can’t do
half trips so they either carry out the delivery or they do not, there are only two possibilities.
This problem can be solved by integer linear programming.
What the tool does.
I configured the tool so that it will optimise the delivery assignments for any file of any size
which is in this format: You can have as many drop off points and as many pick up points as
you want. So you would simply go into an excel spreadsheet, put your delivery destinations
along the top, and the store/driver location on the rows – with the corresponding travel
times (or you could do petrol costs, total distance between the two places). The tool will
provide an output with the minimized objective (time) as well as all of the pick-up and drop-
off points.
48
R – Code
49
Tool Output
One of the outputs will show the minimized value of total travel time – in this case it is 56
minutes.
This output is called an assignment matrix. You can figure out which pick up points were
assigned to which drop off points. However, I coded an R tool which would provide a simple
readable table with each pick up and drop off point, which will work for any file.
Final assignment table
50
Stock Analysis/Return forecasting/Manage of risk and
Portfolio optimisation in Alteryx using inbuilt Predictive
Forecasting tools and custom R code
The aim of this tool is to create a workflow which will make predictions on the return of 11
stocks (we can add more easily) for a period ahead, and to invest 1,000,000 in the top 5
stocks subject to certain risk constraints, so as to optimally achieve the highest expected
return subject to these restrictions.
We will use Alteryx to quickly analyse 11` different stocks. We want to predict returns using
a Time Series model to forecast ahead the return for the next day. The model will identify
the top 5 stocks in terms of expected return.
The user can specify which stocks they want to analyse by simply entering the ticker symbol
of each stock in the input table, and can specify the date range for the amount of historical
data they want to use in order to measure risk and train a Time Series Model.
I included the following constraints in the optimisation process in the final investment
strategy for the top 5 stocks:
1. Max of 30% of the budget invested in the riskiest stock (as measured by the
standard deviation of the stock price).
2. Max of 40% of the budget in total allocated to the two most highly correlated
assets.
3. Minimum of 10% of the budget invested in each stock.
The final goal will be to allow a user to simply input whatever stocks they would like, the
date range for the data, and the investment budget, and this workflow will come up with
an investment strategy automatically.
This workflow is quite long and I used a lot of custom R code to manipulate the data so
that it would work on any Ticker symbols input by the user.
This is a good example of how you can create a very repeatable process in Alteryx to
simplify some common processes. It also demonstrates how you can increase the
flexibility of Alteryx using Developer tools, as well as how to combine your own tools with
inbuilt Alteryx functions. I used the Time Series ARIMA model tool provided by Alteryx as
well as the Time Series forecast tool to come up with forecasts for the return of each
stock.
51
User Interface: The configuration for the workflow requires the user to enter their 11 stocks
in the first column, start and end date for historical data, and budget constraint. Depending
on the date range selected, the tool will take about 3 – 5 minutes to run.
Output: The user will get an output indicating the top 5 stocks and optimal allocation of
budget subject to the risk constraints. The output will also show the expected return on
the stocks for the next day.
*All of the code is documented within the R tools in the workflow.