Download - Workbook Project

Alteryx – Understanding and Utilising

Predictive Analytics

1

Table of Contents

Introduction …………………………………………………………………………………………………………...2

Linear Regression …………………………………………………………………………………………………...3

Time Series Analysis ……………………………………………………………………………………………....11

Different types of Time Series models ………………………………………………………………………12

Using time series tools in Alteryx + Example ……………………………………………………………..14

Covariate Time Series + Example …………………………………………………….………………………..15

Example–adding more to your visualisations using Time Series + Linear regression….16

Classification Problems …………………………………………………………………………………………….20

Logistic Regression……………………………………………………………………………………………………..21

Decision Trees ……………………………………………………………………………………………………………25

Naïve Bayes Classifier ……………………………………………………………………………………………….31

Association Rules/Market Basket Analysis……………………………………………………………….33

Clustering Analysis …………………………………………………………………………………………………..39

Custom R Tools …………………………………………………………………………………………………………43

Production Optimisation Examples ……………………………………………………………………………43

Delivery Route Optimisation ……………………………………………………………………………………..47

Stock Price analysis and portfolio allocation example ……………………………….……………….50

2

Introduction

The purpose of this workbook is to provide a step by step walkthrough of the main

predictive analytics tools within Alteryx, as well as data preparation/cleansing and data

investigation tools contained within the examples. I will also show how Alteryx and Qlik

Sense can be combined to allow businesses to attain significant insights and a thorough

understanding of their data. However, it is also intended to achieve an understanding of the

statistical models which are easily implemented in Alteryx. Although Alteryx has a very

simple and intuitive interface, it is important that you are able to determine and

understand:

- What models are suitable for different types of prediction cases/different types of

data.

- How to interpret the reports of the models to determine whether or not the model

you have fitted is suitable.

- How to extract useful information from the model, ultimately to apply the findings in

a business setting to achieve what should always be your goal - understanding and

insight.

The two main types of predictive goals – Quantitative Versus Categorical

data

There are two main types of data for which you will typically be trying to make a prediction.

The models you use will differ depending on which type it is. Firstly, we have quantitative

data – e.g. if you were trying to predict sales based on how much was spent on advertising

that month. Or predicting how many people will buy an ice cream based off of what

temperature it is. For problems such as these the majority of the time you will use a

prediction method called linear regression. You may also use other tools such as Time Series

Analysis and maybe even decision trees depending on the data which is available to you.

The second type of data is categorical – in this case you will be trying to predict what class

or category an observation will be in based off of other known variables. For example –

predict whether a person will respond to a marketing campaign (two categories – Yes/No)

based off of their age, previous products bought etc. In a medical setting predicting whether

a certain person is at high risk of a certain disease (two categories – High Risk/Not high Risk),

or predicting whether a mortgage application is likely to default – (Default/No Default). In

this scenario, models will typically give you a probability between 0 and 1 of each person

belonging to a certain category. Alteryx provides a number of tools for this type of

prediction including Logistic Regression/Decision Trees/ Forest Models/Naïve Bayes

Classifier.

3

Predicting Continuous Data – Linear Regression

Linear regression is a tool for modelling the relationship between a target variable which we

want to predict and one or more predictor variables. For example, imagine we wanted to

predict ice cream sales at a beach based off of one variable – temperature. The model

equation could look as follows: y = ax + b, where y is ice cream sales, x is the temperature,

‘a’ is a coefficient which indicates the strength of the relationship (e.g. if a = 4 it means that

if the temperature goes up 1 degree, then ice cream sales will go up by 4 on average), and b

is simply the number of ice cream sales if the temperature is equal to 0. Often you will use

more than one predictor variable (e.g x1 x1 and x3) to predict your target variable (y). The

equation would then look like y = ax1 + bx2 +cx3 + d. ‘a’ ‘b’ and ‘c’ are all simply measuring

the effect that changes in our predictor variables have on our target. All of the predictors

are taken into account and if they change then it affects the value predicted by the model.

Linear regression is also useful for inference – understanding the impact different variables

have and providing measures of their effects. This can be used to enhance visualisations

providing key information which will help in decision making within a company.

Steps to creating a linear regression model in Alteryx

1. The very first thing you will usually want to do is some data investigation. This is

because you want to make sure the data is suitable for a linear regression model. A

scatterplot is appropriate here. Figure 1 shows a linear relationship (real world data

will rarely be this perfect but you should still try to choose the most appropriate

model). Figure 2 shows an exponential relationship and Figure 3 shows a logarithmic

relationship. I will show how to deal with these later. You can also perform an

association analysis to see how strongly correlated different variables are.

4

Figure 1

Figure 2

Figure 3

5

2. Next, within Alteryx you will want to create the model using the linear regression

model tool. Remember you will need a certain amount of past data to train the

model, however also keep some extra data to test it on. By using the ‘create sample’

tool you can randomly split your data 90%/10% - Train the model using 90% and at

the end you can test how well it does on the remaining 10%. This is to see how well

it performs on unseen data (very important). I show how to do all of this step by step

within the workflows.

3. Now you need to evaluate whether the model is appropriate. In the output report

from your linear regression model there will be 4 plots. The one in the top left corner

is important – the residuals versus fitted plot. This plot shows the difference

between the predicted values of your model and the actual values which were

observed – these differences can be positive or negative. Here, you basically want to

see randomly scattered points – if there is a discernible pattern then that means

your model is making systematic errors. For instance, if you tried to fit a linear model

to an exponential relationship, then the residuals plot would show a pattern in the

errors. This violates one of the main assumptions of a linear regression model which

is that the errors are randomly distributed. A badly fitted model with systematic error

is worse than no model. In such a case you would have to perform a very simple

transformation on the data so that you can correctly fit a linear model – I have an

example workflow where I do this in Alteryx which should make it clear.

4. In the report there are two other important measures. One is the residual standard

error – this measures the average error between observations and predictions made

by the model. There is no ‘right’ value, but obviously the smaller this is the more

accurate the model. The other value is the ‘R-squared’ value, which measures the

amount of variance (movements of the target variable e.g ice-cream sales)

accounted for by your model. A value of 0.8 would indicate that it is correctly

accounting for 80% of the variance in your target variable – the higher this value the

better.

5. The final step is to test out your model on the unseen data (the 10% we held back

from earlier). To do this you simply drag in the ‘Score’ tool, connect the linear

regression model object and the test data to the tool, and run the workflow. In the

output there will be your original data plus a new column – your predicted values

from the model. You can make predictions on any data using this model object as

long as you have values for the predictor variables. For example, if a company

wanted to predict how much they could expect to sell given different levels of

advertising spend, they would simply pass a data set with different levels of

advertising spend through the model and look at the predicted values. You can also

export your data and predicted values to a .qvx file and visualise the predictions in

Qlik Sense. I did this in my workflow for predicting movements in the price of gold,

6

and compared movements predicted by my model to actual changes in the price to

see if they moved together.

Note: Very often owing to extremely high variance in real world data as a result of the

presence of so many variables, linear regression models may not provide extremely accurate

predictions. What they really provide is a better understanding of the relationship between

variables – both strength and direction. For example – if advertising spend goes up/down

you can expect on average ‘x’ amount increase/decrease in your sales. If you change the

price of a product by a certain amount you can forecast the approximate affect this will have

on units sold. If interest rates go up a certain amount you can make a prediction on the

average or mean effect this will have on the price of gold, and so on.

Linear Regression Examples in Alteryx

I have a folder attached entitled ‘linear regression examples’ which contain workflows

where I document step by step the data preparation, data investigation, model creation and

model evaluation for 3 examples. The first is basic linear regression between 2 variables.

The seconds shows an example where you have to slightly alter the data to fit a linear

model. The third is an example of multiple linear regression using 4 variables to predict

movements in the price of gold.

Example 1 – Examining the relationship between number of global

internet users and Google’s share price from 2004 – 2016.

This is a very simple example in which I take the number of internet users

each year from 2004 until 2016 and examine the relationship between these

values and Google’s share price for the same years.

There is a very short Alteryx workflow in the folder where you can look

through each step with documentation. I have screenshots from the report

on the next page with key values to look for and interpretations of the

statistics. At the end I also made predictions for Google’s share price for

2017/2018 given projected internet users for the corresponding years. (This

is just for demonstration purposes, the model is extremely simple and only

has 12 data points- you could increase the accuracy of the model by adding in

even more predictor variables and performing multiple linear regression, as

in the final example in this section.)

7

Example 1 – Output Report

In red I have highlighted the estimate for the coefficient of internet users. It can be

interpreted as follows: for every one extra internet user, the share price of google increases

by 2.418𝑒−07 on average. In blue, I have circled a value under the column Pr(>|t|). This

column indicates whether it thinks the relationship is significant or not. If you had a lot of

predictor variables in your model, some might be more significant where as other

relationships could be due to random variation. You will have to decided based on this value

and your intuition whether or not to include variables in your final model. Here, the value is

a tiny number (3.55𝑒−06) indicating that we are practically certain the relationship between

internet users and Google’s share price is significant (unsurprising). If this value was 0.1 or

above we would be less certain that the effect is significant and not due to random

variations.

Here we can read off the standard error of our model (Residual Standard Error) – its

predictions are 83.077 units off on average. Our Multiple R- Squared is 0.8685 – 86.85% of

the variance in Google’ Share price is accounted for by changes in the global number of

internet users.

8

Example 2 – Examining the relationship between adolescent

Fertility rates and GDP Per Capita (Example of having to transform

data which has an exponential relationship to fit a linear model).

You can open the Alteryx workflow, press run and allow it to run through (It will take about

a minute). I document the process whereby I prepped the data by combining two data sets I

downloaded from the World Bank website and carried out some quick data investigation. I

also explain why the linear model was not appropriate for the raw data, and how to get a

better model fit. Below is the end visualisation from Qlik Sense, comparing the actual values

of GDP/Capita against those predicted by the model. Of course, we could make a much

more accurate model by adding in many more predictor variables.

*In this example I transform the data to fit a linear regression model for an exponential

relationship. If you encountered a logarithmic relationship such as in Figure 3 – the only

difference would be that instead of finding the log of your target variable, you would find

the exponent of your target variable (𝑒𝑇𝑎𝑟𝑔𝑒𝑡 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒).

9

Example 3: Multiple Linear Regression – Predicting movements in

the price of gold and examining the relationship between Gold

Prices, Stock Market Volatility, the strength of the US dollar, and 10

year Treasury Rates.

This example demonstrates the process to building a multiple linear regression

model in Alteryx.

Each step works through data cleansing and joining multiple datasets together.

There is also a workflow demonstrating how to parse data in JSON form and

eventually merge this data with csv files to form a completely prepared dataset.

I also show how you can split continous data up into categories or ‘bins’, and how to

implement these categories in your multiple regression model.

Below is the comparison between our predicted gold price values (first scatter plot)

versus actual gold price values for 3,000 data points. The model could be applied by

inputting projections for future Treasury rates, stock market volatility and US Dollar

strength in order to measure the effect they would have on the gold price, which

would help in guiding investment decisions.

10

Interpretations of the residuals plots in Gold price Example

This was the residual plot when I first looked at

fitting a linear model using the Volatility Index of

the stock market to predict Gold prices. The blue

line is what we want in an ideal world – no

systematic errors (i.e. not constantly over or under

estimating with predictions). However, the red line

shows us that there is a pattern in our residuals.

You can see that the model is underestimating the

price of gold at the start and underestimating the

price in the later data as well (highlighted). I tried

to reduce this by appending a new variable

splitting the data up into VIX <40 and VIX > 40

(where the pattern changes)

This is the residuals plot after I applied the change.

As you can see it is not perfect, we would like the

red line to be perfectly horizontal. However, it is

vastly improved so I concluded that I would include

the change in the final model.

This is the residuals plot for our final model when we

have all 4 predictors included. The red line is almost

perfectly horizontal apart from a slight trend at the

very start of the data towards overestimation (red

line drifts above 0). However, it is a satisfactory

model fit overall.

11

Time Series Analysis

What is time – series analysis?

Time series analysis is a method which aims to extract meaningful information from

a sequence of observations through time in order to better understand underlying

patterns, and ultimately to make forecasts for future observations.

Unlike other methods which typically can work with random observations, Time

Series analysis assumes that successive observations represent consecutive

measurements at equally spaced time intervals. For example – a company’s closing

share price each day/ total sales for each month.

Time series can help to understand the pattern of observations through time, which

can often be described in terms of trend, seasonality and random error.

Trend refers to the underlying systematic component of the pattern which does not

repeat itself over time – for example an airline’s average passenger growth of 4%

year on year would represent a general upwards trend in their times series data.

However, there could be significant differences from month to month. Even if there

is an underlying upwards trend in passengers each year, there may be spikes around

Christmas and summer months and dips in others. It is useful to be able to describe

this seasonal component and extract the underlying trend in order to measure

performance and also to make forecasts for the coming year.

The final component consists of random error which comes with each observation.

This will typically make it more difficult to extract the underlying pattern, and time

series tools will involve some method of filtering out random error in order to make

the pattern clearer.

12

Different Types of Time Series Models

ARIMA MODEL – Autoregressive Integrated Moving Average

*ARIMA models encapsulate a large number of different types of Time Series models, and

this is just a very short summary of how they work. If you wanted to learn more about them,

the following is a link to a useful resource http://people.duke.edu/~rnau/411arim.htm

ARIMA models are the most general class of models for time series forecasting. An ARIMA

model works in order to separate the ‘signal’ (or the meaningful predictive patterns) from

the random error – and to apply these patterns to make forecasts for future observations.

You can think of an ARIMA model as a type of regression model, where the predictor

variables might include the last ‘n’ observations (autoregressive), or the error between your

model’s previous predictions and actual values (lagged forecast error) – when the error of

the models previous forecasts are used, this allows it to readjust when it over or

underestimates a prediction.

Below are some examples of different ARIMA model equations for a time series:

Yt (new observation) = a (constant) + b*𝑦𝑡−1 (b is a coefficient and 𝑦𝑡−1 is the most

recent observation) .

Yt = a +b*(𝑦𝑡−1) +c*(𝑦𝑡−2) (taking into account the last two observations)

Yt = a + 𝑦𝑡−1+ b*(𝑦𝑡−1– 𝑦𝑡−2) (Now we are including the last difference as a predictor

variable as well – so how much the value changed between the previous two

observations will help to determine how much it will change again, this is an example of

introducing a ‘differencing’ term)

𝑌𝑂𝑐𝑡96= 𝑌𝑜𝑐𝑡95 + (𝑌𝑆𝑒𝑝96)− 𝑌(𝑆𝑒𝑝95)) *Example of a seasonal ARIMA model - Oct 95

forecast is equal to last October’s value plus the year to year change observed the

previous month.

Yt = Ŷ𝑡−1 + α*(𝑒𝑡−1) (𝑒𝑡−1 ) is the error for our last observation – in this way the model

auto-corrects itself for underestimations and overestimations in the previous forecast)

The main piece of information you can take away from this is simply that ARIMA models

work by predicting future values based off of 1) previous observations 2)how much the

values have changed between previous observations 3) previous errors in past forecasts.

There are many other variations of the ARIMA model. Although it is good to have a

general idea of how the models work, the good news is that Alteryx will automatically

calculate the values of these parameters.

13

ETS MODEL – Error/Trend/Seasonality (Exponential Smoothing)

This model uses exponential smoothing to build a forecasting model for Time Series

analysis.

Exponential smoothing makes use of every single past observation to forecast the

next step ahead.

Although every single observation affects the forecast, the most recent have the

most influence. The very last observation will be the most influential, the second last

observation will carry slightly less weight, and as you go back further to the 50th

observation it will have negligible effect. The following is the basic equation for an

exponential smoothing model:

𝑦𝑡 (forecast value) = α*(𝑦𝑡−1) + (1- α)* ŷ𝑡−1

Note/ (α is a value which lies between 0 and 1) * ŷt-1 was the model forecast for the

previous observation. 𝑦𝑡−1 was the actual value for the previous observation.

The model forecast for the previous observation was affected by the model forecast for the

observation before that ŷ𝑡−1= α(𝑦𝑡−2) + (1- α)*ŷ𝑡−2, which was in turn affected by the

observation before that and so on. Thus, every observation is affected by every single

previous observation to some degree. An interesting point about this model is that it is self-

correcting – if it overestimates a prediction it will take that into account for the next

prediction by adjusting for the positive or negative error.

If there is a trend in the data then another component will be added to the equation

to account for this. This is also known as double exponential smoothing. The

equation would look something like this : 𝑦𝑡 = α* 𝑦𝑡−1 + (1- α)*(ŷ𝑡−1 + 𝑏𝑡−1). 𝑏𝑡−1 is

the trend component.

If there is a seasonal component to the data then this will also be taken into account

with a third component added to the equation – triple exponential smoothing

*It’s not necessary to understand these equations and I am just including them in case you

would like to get a better feel for how the model works. One take away is that the ETS model

will make forecasts by taking into account all previous observations, the seasonal pattern in

the data, and the overall trend. In essence, Alteryx will automatically calculate the model

parameters. You can just create the ETS model and the ARIMA model and choose which one

is more appropriate based on how accurate they are as shown in the example workflow.

14

Creating a Time Series Model in Alteryx

1. Read in your Time Series Data which will consist of recorded data for consecutive

observations at equally spaced time intervals. Split the data up into training and test

data by taking the first 90% of observations (You can test your model forecasts on

the last 10% or so).

2. Connect the ETS tool to your data, select your target field and the frequency of

observations (are they monthly/yearly/daily etc.)

3. Connect the ARIMA tool to your data, select your target field and frequency of

observations. If you want you can select a completely user specified model which

allows you to input values for the parameters I mentioned earlier. If you leave it

unticked Alteryx will converge to its own solution.

4. Use the ‘Union’ tool to join the two model objects together, and pass them through

the ‘TS Compare’ tool. Also connect your test data so you can compare which model

predicts these values better.

5. In the output from the TS Compare tool you will be able to see the forecasts for each

model and how they compare to actual values. In the table, one key measure of

accuracy is the RMSE (root mean squared error) for both models. An easy way to

choose between the ETS/ARIMA model is to choose the one with the lower RMSE.

6. Finally, connect all of your original data to the model you have chosen and create the

model object. You can now make forecasts for the periods ahead by dragging in the

‘TS Forecast’ tool. You can specify the number of periods ahead you want to predict

and what level of confidence intervals you would like to include. Bear in mind that the

further into the future you forecast, the more uncertain the predictions will be.

*I have included in the download folder an Alteryx workflow entitled Germany Trade data

(Example 4). It goes through an example of the above steps using monthly data on Germany

Commercial Trade data, resulting in a forecast for the next 12 months. Below is the forecast

with confidence intervals visualised in Qlik Sense.

15

Time Series Covariate forecast – Using other variables to improve

the predictive capabilities of your Time Series model!

When should you use the TS Covariate forecast tool?

The TS Covariate forecast tool can be used to improve your time series model by including

another variable which may have predictive capabilities for our target (similar to a

regression model). The model will still account for trend/seasonality/previous observations

as in the previous section, however we can also include another variable along with this to

give the model more information to make forecasts.

e.g. Imagine you are preparing a time series forecast for a large company who are looking to

make sales forecasts for the coming months (to improve handling of inventory etc.). You

could just use sales data from previous years and provide a standard Time Series model as

before. However, if you had historical data on money spent on advertising each month then

you could also include this as a predictive variable in the Time Series forecast.

Example 5 - Alteryx Work flow - Creating a covariate time series model for the FTSE100

using the Pound Sterling Index as another predictor variable

I have attached an Alteryx workflow with an example of covariate forecasting

to try and predict movements in the FTSE 100

You may only use an ARIMA model when creating a covariate time series

forecast.

Steps involved in Example Workflow

- Data Preparation in order to standardise date fields and then merge the two

datasets on these fields

- Data investigation involving an analysis of the time series plot as well as the

relationship between the FTSE100 and the Pound Sterling Index

- Splitting the data up into training data and test data. Connecting the training data to

an ARIMA tool and specifying to use covariates in the model estimation.

- Using the Covariate Time Series forecast tool to make predictions on the test data

using an ARIMA model with Sterling Index as a covariate predictor variable.

- The model could be applied to estimate future movements in the FTSE100 based on

expected Sterling Index values. For instance you could predict that the Sterling will

bounce back after its heavy drop following the Brexit. Using all previous observations

and your forecasts for the strength of the Pound you could make educated forecasts

for the FTSE100 for a few periods ahead.

16

Monthly Sales and Advertising Expenditure – Enriching

Visualisations with Key information (Utilising Time Series

and Linear Regression tools)

In this example we have a data set with 36 consecutive monthly sales and advertising spend

on a dietary weight control product. The aim is to try and find out whether our advertising is

having an effect on sales, and measuring this effect. Without the use of statistical models all

we would be able to visualise here would be a trend graph of advertising spend against

sales. However, what about if we happened to increase marketing spend during a month in

which sales were usually higher anyway? Or if we increased advertising spend and sales

dropped, but sales usually dipped significantly during this time? We can use the models

within Alteryx to try and separate the effect of advertising spend, general upward or

downward trend in sales and seasonal components in order to better understand the payoff

on the investment in advertising.

1) Basic Visualisation in Qlik Sense – Trend in advertising and sales

17

2) The next step I took was to look at a decomposition plot of the sales data –in order to

extract the seasonal component of the data: The output below shows the sales data

on top, the seasonal component in the middle, and the general underlying trend.

3) I wanted to overlay the sales data with the seasonal component removed on

advertising spend in Qlik Sense – in order to do this I wrote a few lines of code in a

custom R tool to extract the seasonal component in table form. I then exported the

information on seasonal effects and trend into Qlik Sense. The first visualisation here

shows the average seasonal effect each month.

18

4.) The below trend graph shows the sales less the seasonal effect for each month overlaid

with the advertising spend for each month. This gives us a better idea of the relationship

between the two as it removed the varying seasonal effect which can affect your

interpretation of the relationship.

5.) Scatterplot of Sales less seasonal effect versus advertising expenditure.

6.) Finally, you can run a linear regression tool in Alteryx to measure the average effect that

advertising expenditure is having on your sales. This is more of an inference problem. We

are trying to figure out whether our advertising expenditure is affecting our sales as

opposed to making specific predictions for future values.

19

So we can see here a few important points from the Alteryx output. The correlation

between advertising spend and Sales less seasonal effect is 0.37 (a moderately positive

correlation indicating that as advertising goes up, Sales tend to go up as well – we need

linear regression to determine whether this is significant)

The linear regression output I screenshotted here tells us two things – the estimate

coefficient for advertising was 0.7861. The data is in 1000’s US$ so this tells us that for every

1000 dollars spent on advertising, you are getting on average $786.1 more sales. The p-

value of .02624 indicates that this is almost certainly a significant relationship. We can

conclude that the extra spend on advertising is increasing sales. However the extra sales are

not covering the advertising expenditure on average.

This example demonstrates how predictive and inferential analytics can add a lot more

insight to your analysis as opposed to simply using visualisations. We have a much better

understanding of our data now than when we just had a simple trend graph.

20

Classification Problems

A new type of problem

You can consider this as a new section moving on from the purpose of predicting continuous

numerical data. As I mentioned earlier, another very important purpose of predictive

analytics is predicting classes/categories – e.g. the probability that a person will or won’t

respond to your marketing campaign/ classifying people as low or high risk of a certain

disease/ predicting whether a client is likely to default on their credit card bill/ predicting

whether a customer is likely to end their subscription etc. I will start off this section with an

explanation of our first model for this purpose – Logistic regression.

Logistic Regression – Predicting Probabilities

Logistic regression is a statistical tool which will relate a binary target variable of interest

(e.g Yes/No, Default/No Default) to one or more predictor variables (e.g income, age,

gender etc.). The output from a logistic regression model will be a probability between 0

and 1 of each observation belonging to a certain class . The following equation is the basic

equation behind the logistic regression model.

Probability = 1

1+𝑒−𝑎−𝑏𝑥−𝑏2𝑥2

*In this equation x and x2 would be the value of your predictor variables – so they will

obviously affect the probability that this observation belongs to the class (e.g. the probability

they will default). No matter what the values of the predictor variables, this equation will

always give a value between 0 and 1 (a probability). The model is made by calculating the

values of a,b,b2 (and so on if you have more predictor variables) so as to optimise the

accuracy for your data set. Alteryx will optimise these values automatically.

21

Logistic Regression in Alteryx – Examples (Visualisation in

Qlik Sense).

Example 6 – Classification problem: Predicting which clients will respond to a

bank marketing campaign (whether they subscribe a long term deposit) based

off of 17 attributes including Age/Gender/education level/whether they have a

housing loan, and more.

The data set: The data set consists of about 45,000 instances of the outcome from being

contacted by the bank’s marketing campaign along with attributes for each client. We want

to know which clients will subscribe a long-term deposit after being contacted. This can be

used in future campaigns to target customers with the highest probability of responding –

ultimately to increase efficiency and response rates.

*The Alteryx workflow is contained in the folder for Logistic Regression Examples. Each step

is documented and explained. Note that the entire workflow will take about 15 minutes to

run depending on your computer.

The first step I carried out was data preparation and subsequently writing the prepared data

into .qvx files in order to visualise it in Qlik Sense. I wanted to compare response rates to the

campaign across different categories to get a quick overview of the data. I have included

some of these visualisations here for visualisation purposes.

Note/ The average response rate is the proportion of each category who responded.

This graph indicates that people younger than about 35 years old and older than about 60

years old were more likely to respond.

22

This was an interactive bar chart splitting the clients up by job and education level. The

highest response rate here is for retired individuals with tertiary level education. Overall the

‘retired’ category seems to have the highest response rate.

This visualisation looks at response rates based off of whether the individual responded to a

previous campaign. Unsurprisingly, the highest response rate (65%) was for previous

responders.

23

This scatterplot looks at the response rates based off of the number of days since the client

had been previously contacted. I created 15 bins for the number of days in alteryx so the x-

axis represents categories. The highest response rates seem to be for clients in bins 8-11

which is about the 400-500 days range, so maybe customers who had not been hassled by

campaigns before were more likely to respond. The scatterplot doesn’t show the response

rate for clients who had not previously been contacted, however seeing as almost all of

these response rates are above the overall average, those who had not been contacted

before must have had a very low response rate.

The construction of the logistic regression model is all carried out and documented in the

Alteryx workflow. The example includes:

Data preparation for Qlik Sense and model development/ Creating categories

for numeric variables and appending this as a new column

Interpreting the logistic regression output report

How to use the stepwise regression tool in Alteryx

How to use the nested test tool in Alteryx

How to test the logistic regression model

How to use the ‘Lift Chart’ tool in Alteryx to demonstrate the efficacy of your

model

24

Example 7 – Medical Example: Creating a logistic Regression model to calculate

probabilities of people having heart disease and automatically flagging high-

risk patients based on 13 attributes including Age, Sex, chest pain type, serum

cholesterol, maximum measured heart rate and more.

Data Set

The data set was downloaded straight into the Alteryx interface from the UCI Machine

Learning Repository. The dataset contains information on about 300 patients with regards

to 13 different attributes and the target variable which was whether or not the patient was

determined to be suffering from heart disease.

Problem Definition: We want to create a logistic regression model which will be able to

determine whether a patient is at low, moderate, or high risk of suffering from heart disease

based off of the recorded attributes, and provide us with a list of patients who are at

moderate to high risk.

Result: The model was tested on 45 patients, and flagged 24 participants as being at either

Moderate or High Risk of Heart Disease. 15/24 of these patients did in fact have heart

disease, and of the 45 patients in total, 20 had heart disease. The model is effective – by

flagging only 53% of the participants it correctly found 75% of the patients with heart

disease. The model could be part of a screening process which could give a quick indicator of

the risk level of new patients.

*The Alteryx workflow for this example is contained in the folder ‘Logistic Regression

examples’ and provides a step by step explanation of the steps taken to achieve the

objective.

The workflow contains the following:

How to directly download a dataset in csv format from a webpage into

Alteryx and prepare the data using the parsing tools.

How to carry out some basic data investigation in Alteryx using the Data

investigation tools.

Creating samples of data, using the logistic regression model tool, stepwise

regression tool and the nested test tool to come up with a final model.

Using the formula and filter tools to classify the test data into different

categories based on the probabilities assigned by the model, and finally to

output a list of patients at moderate to high risk of having heart disease.

25

The Decision Tree – Another Classification Tool

The decision tree tool is a statistical method which once again tries to predict a target

variable based on one or more predictor variables which are expected to have an influence

on the outcome. Decision trees can actually be used to predict continuous or categorical

data.

How does the decision tree work?

Decision trees build regression or classification models in the form of a tree structure using

IF-THEN rules to split the data at nodes. The end result will be a tree with decision nodes and

leaf nodes. The decision nodes contain two branches which will split the data based on

which branch the piece of data satisfies. E.g. in our bank marketing campaign example one

of the decision nodes could have been – gender? - with two branches – male and female.

Another split could have been numeric : e.g age > 45? – with two branches – yes/no. The

decision nodes break the data down into smaller and smaller subsets, which will eventually

be classified by a leaf node. A leaf node will represent the final classification or decision on

the data. In the example below, the decision node decides whether the person will be a

purchaser or non-purchaser.

How do you determine what attributes to split on?

This decision will be based on two factors – entropy and information gain.

Imagine your decision tree is trying to predict a category from two classes e.g.

Purchase/Non-Purchaser. If you had 20 data points and you had different attributes to split

on, imagine one decision left you with two groups of ten people. Within these two groups of

10 people 5 are purchasers and 5 are non-purchasers. This would not be a good decision to

split on as it has literally told us nothing about the likelihood of purchase given membership

of the group your , it is 50:50. Take another decision which leaves you with two groups of

26

ten, with one group having 8 purchasers and 2 non-purchasers, and the other having 3

purchasers and 7 non-purchasers. This is a better decision, it increases the ‘purity’ of each

group. If we knew you were part of group 1 we could say there is a 0.8 probability that you

will purchase, whereas if you were part of group 2 we could say there is a 0.7 chance you

won’t purchase. At each decision, the algorithm will try to maximize the purity of the

groups, and this is the basis upon which the decision nodes are chosen – what will give us

the most information gain. The root node (the first decision) will be the most influential

factor in determining membership of a class. If you could only look at one attribute, this

would be it.

Feature selection – One of the key processes of a decision tree is feature selection. This

involves selecting a subset of all the variables you have to identify the most important ones

with regards to your target variable. This can be advantageous in that you will be able to

quickly identify the most influential attributes in your dataset, however you may lose out on

making use of important information in some cases.

Should you use logistic regression or a decision tree?

There is no blanket decision to decide which method to use. Decision trees

can be very good in quickly identifying some rules amongst your data i.e. if a

person is over 30 and has a salary > 40,000 they are more likely to purchase

your product etc. In essence, it depends on the distribution of your data. If I

had to pick one I would go with Logistic Regression, however if you have time

then it would be worthwhile creating a logistic regression model and a

decision tree model, and subsequently comparing these based on the

number of correct classifications and a lift chart.

Decision trees are probably easier to understand and also provide a visual aid

which can be more appealing to people (although accuracy should be the

priority). It is easy to follow along down a decision tree and identify which

characteristics are the most impactful.

27

Decision Tree (Example 8) – Using Census Data to predict whether a

person’s salary is >50,000 or <50,000.

The data set:

This dataset was retrieved from the UCI Machine Learning Repository and consists of census

data from the 1994 database. We have information on each individual regarding things such

as age, gender, work class, marital status, race, hours work per week, native country etc., as

well as whether their salary is >50,000 or <= 50,000.

Hypothetical Use Case/ Problem:

Imagine a watch company has developed a high end watch, and they wish to develop a

targeted marketing campaign directed towards customers with higher earnings. They have

previously collected data on customers with regards to certain metrics such as where they

work, what age they are, nationality etc., and want to develop a model which will predict

whether these customers are in the higher salary bracket so that they can target this group

with their marketing campaign, maximizing the pay-off from their advertising expense. You

must develop a classification model to achieve this goal.

Method: The method to configure the decision tree tool is contained within the examples

folder under Decision Tree Example. It is a very short workflow and should only take about

1-2 minutes to run.

Decision Tree output:

On the next couple of pages I have included some screenshots of the output from the

decision tree model. The first is the decision tree diagram. Although it is somewhat unclear,

the first decision node is whether or not the person’s relationship variable was either Not in

Family, Unmarried, other relative or own child. If you go down to the second output which is

Variable Importance then you can see relationship is the most important variable in

deciding each person’s income. The decision tree splits generally start in order of

importance, with the latter splits being less influential. You can see at the end of branches

there are two options - <=50K and >50K. The decision tree makes a decision on each subset

based on which category the majority of the group falls into. Ideally you would have over

90% of a group in either of the categories, but you can see within the Alteryx output

workflow on the interactive chart that some of the subsets are more uncertain e.g.

60%/40%.

29

Decision Tree Interpretation

You can click on any node within the

decision tree and see the path to that

node, as well as the final decision

which was made on that subset of

data. As you can see here this node

consisted of people who were in one of

the relationship categories specified in

the first decision, and who also had

Capital Gains < 7074. 95% of this group

had salaries <= 50,000 so the model

will classify anyone satisfying these

conditions as having a salary which is

less than $50,000.

The decision node I clicked on here

consists of people who were not in

the relationship category specified in

the first decision, who’s education

was one of the categories listed in

the second decision, capital gain

<5096, occupation was not one of

those listed in the 4th decision, age

was > 32.5, and capital loss was

<1846 – This subset of people had

88% salaries > 50,000 and 12%

salaries < $50,000 : The model will

therefore classify people in this

category as having salaries > 50,000.

30

Result

Below is the output from a tool called a ‘lift chart’. It demonstrates how much more efficient

you would be if you used the model predictions as opposed to choosing people completely

at random. We want to get all the people who have incomes >50,000$. In total, 24.26% of

the test data had incomes >50,000, which was equal to 395 people in total. If you had

randomly sampled 50% of the test data you would have found about 50% of these 395

people (as shown by the base black line in the chart). However, if you had taken the top 50%

probabilities from the output of the decision tree, you would have gotten about 90% of the

395 people ~ an increase of 40%. The blue curved line indicates the extra amount of our

target group we would get using the model predictions as opposed to random choice.

31

Naïve Bayes Classifier – Classification Tool

The Naïve Bayes Classifier is another classification tool. It is based on applying Bayes’

theorem to training data. In general, the NB Classifier is outperformed by other models

such as logistic regression, forest models and decision trees. However, it can be beneficial to

use when the amount of data you have to train your model is limited. The basic equation

behind the model is based on conditional probabilities as follows:

𝑃(𝐶𝐾|𝑋) = 𝑃(𝑋|𝐶𝑘) ∗ 𝑃(𝐶𝑘) /𝑃(𝑋)

A key feature of the NB classifier is that it assumes the features are independent – that is if

you have three predictors for instance, that having one particular feature does not affect

the value of another predictor. However, this is often not true unfortunately and in these

cases the NB classifier will not perform as well as other classification models.

The NB classifier will predict the probability of any observation being a member of any

number of classes. E.g is the customer likely to buy sportswear, formal wear, or casual wear.

Just remember that with any of the classification tools – you can try using different ones and

simply choose the most accurate one. Measure the accuracy with regards to how well it

does on the test data and not on the training data.

As an example of how to use the tool I have included a workflow to predict which customers

will respond to a marketing campaign based off of what store they shop at/which city they

are from/how many times they have visited the store and how much they have spent at the

store. The configuration is very simple and follows a similar process to previous models. I

also visualised the results in Qlik Sense in order to give a quick snapshot of likely responders

in the unseen data in order to get a better understanding of our customers.

Probability of responding by customer location

32

Average probability of response for customers in each city

Probability of responding by visits/Spend – Probability indicated by size of the dot

33

Association Analysis / Market basket analysis to aid with

product placement, recommendation systems/ targeted

promotions

Association rule learning

In the previous sections such as linear regression/ classification tools we were intentionally

trying to predict a target variable using certain predictor variables – so we had a defined

objective when we started developing the models. However, association rule learning is a

method which can discover interesting relationships among your data without having

specifically defined predictor and target variables. For instance, one of the major uses of

association analysis is in market basket analysis – the objective of market basket analysis is

to identify patterns amongst consumer buying behaviour in order to identify relationships

amongst products bought. E.g. the most regularly occurring items in transactions and most

importantly – what items are typically bought together. For instance, if a customer buys

garden chairs and a sun umbrella are they likely to buy a barbecue set as well. This can help

a company with product recommendation systems, product placement in stores as well as

targeted advertising and promotions which can enhance the customer’s buying experience.

Recommender systems such as the Netflix movie recommendation system and amazon’s

product recommendation systems are another practical application of association analysis.

Using the large amounts of data these company’s collect they can identify what other

products a customer might like based off of what they have already watched/bought, as

well as the patterns which have been previously identified in consumer behaviour. Ratings

systems can be implemented in these systems – for instance, by looking at the ratings a

consumer has given to other movies you could identify other movies they will very likely

enjoy as well. Association analysis models can allow you the opportunity to quickly extract

and gain value from data which may involve

Market Basket Analysis in Alteryx

Alteryx provides some great tools to carry out association analysis on

transactional data. The final output will be a table with an item/group of items on the left

hand side, and an item on the right hand side which the customer would also likely be

interested in, as identified by the model. There are only three key terms which you need to

understand in order to interpret the report output.

34

Support – Simply put, support is a measure of how often the item/item set occurs in a

transaction. If this value is 0.01, then the items occur in 1% of the transactions, if it is 0.5

then they occur in 50%. Ideally, it is nice to identify rules where the support is high – as

these rules or associations will be applicable to a larger number of transactions. However, it

depends on the business. A supermarket will have items with large support (e.g.

Bread/Milk), whereas an online retailed like Amazon will have items with very low support

simply due to the huge variety of products they have on sale.

Confidence – As I mentioned, the output from the market basket analysis will be an item or

group of items on the Left hand side and an item on the right hand side which is also likely

to be bought. The confidence measure is interpreted as follows – if it is = 0.8, then you can

be 80% confident that if the items on the LHS are bought, then the item on the RHS will also

be bought, if it is 0.9 then you are 90% confident.

Lift – The lift is calculated by the following formula:

𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝑎𝑙𝑙 𝑖𝑡𝑒𝑚𝑠 𝑜𝑛 𝐿𝐻𝑆 𝑎𝑛𝑑 𝑅𝐻𝑆 𝑜𝑐𝑐𝑢𝑟𝑖𝑛𝑔 𝑖𝑛 𝑠𝑎𝑚𝑒 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛)

𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑖𝑡𝑒𝑚𝑠 𝑜𝑛 𝐿𝐻𝑆) 𝑋 𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑖𝑡𝑒𝑚𝑠 𝑜𝑛 𝑅𝐻𝑆)

This is interpreted as the probability that the group of items occur together divided by the

probability that the items on the LHS and RHS occur together if we are assuming they are

independent and have no relationship. If the value is = 1 then there is no relationship, if the

value is > 1 then each item is more likely to be bought given that the other one has been

bought. So we want values which are > 1, and the higher the better.

35

Alteryx Example: Identifying Patterns amongst a dataset of

541,909 product sales of 4,224 different products –

Visualising the results in Qlik Sense

The data set: The dataset consists of approximately 540,000 lines of data, with each line

having an invoice number, a product description, a customer ID, date and other details

relating to each transaction. We are interested in looking at what items are typically bought

together, so we could use the invoice number as a representation of each transaction –

items with the same invoice number have been purchased at the same time. We could also

use the customer ID to identify patterns of items which customers are typically interested

in, even if they have not been bought at the same time. In this example I use the invoice

number.

The problem/goal: We want to identify the most significant patterns amongst consumer

purchases which we can easily analyse and visualise in Qlik Sense, in order to better

understand the customer’s buying behaviour so we can leverage this insight to improve

product recommendations and targeted marketing. *The data set I used here is mainly bulk

orders from other shops – so the patterns are quite obvious, however it is still a good

example of how to integrate this tool.

*The Alteryx workflow with step by step instructions is included in the Association rules

folder – including data preparation, selecting the appropriate fields, and tool configuration.

Main output report – Alteryx

Here is a snippet of the output report from the market basket analysis tools, with the data

which I exported into a .qvx file for visualisation in Qlik Sense. The items sets are sorted in

descending order for levels of support. The first line can be interpreted as follows - the

items occur in 3.22% of the transactions, if a Jumbo Bag Pink Polkadot is purchased we can

be 67.67% confident that the Jumbo Bag Red Retrospot will also be purchased – and the lift

of 8.21 is significantly greater than 1 so that tells us that the purchase of one item greatly

enhances the probability that the other item will also be purchased.

36

Qlik Sense app for Visualisation

The visualisation consisted of an interactive app whereby you could click on any item set

and immediately retrieve information on the lift/support/confidence and compare it to

other item sets. The scatter plot on the top left has support on the x-axis and Lift on the y-

axis with the size of the bubbles indicating confidence levels. Item sets which are further to

the right of the graph and higher up should be prioritized and it provides a very easy way of

identifying the most significant patterns. I also included charts comparing item sets purely

by support, confidence and lift as well as looking at the top 10 sellers in quantity sold and

revenue as seen below.

39

Clustering Analysis

What is Cluster Analysis?

Cluster analysis is another type of ‘exploratory’ data mining, similar to the association rules

– in which we have no defined variable we want to predict but we are searching for patterns

in our data. For instance, imagine you have a large data set with data on each customer

such as geographic/location data, data from surveys, their age, amount spent on different

types of products and so on. You might use cluster analysis to identify groups of customers

which are ‘similar’ to one another based off of these attributes. Very often you can have

natural clusters in the data, and within these clusters customers may behave differently.

Often you can use cluster analysis as a pre-processing technique before other data mining

techniques. Instead of treating all of the data the same way you can split it up into natural

groupings first, and then examine behaviour within each group. For example, we could

tailor our market basket analysis to each cluster, thus increasing our understanding of

customer behaviour. Furthermore, cluster analysis could be used in other areas such as

investing. Clustering analysis could be used to identify clusters of stocks which are similar in

their performance week to week, which can help you to understand their behaviour and aid

in diversifying your portfolio to reduce systematic risk.

Example of cluster analysis on 2 numeric variables – identifying natural groupings

Example of cluster analysis on three numeric variables – algorithm identifies clusters based

on ‘closeness’ of measures

40

*It is important to note that the clustering tools in Alteryx work on numeric data and not

categorical data. The algorithm works by using measures of ‘closeness’ between data points

– easy to visualise when there are only two or three variables as in the above graphs. You

cannot measure how close two categorical variables are. However, if your categorical

variables have a natural order to them, e.g. primary school, secondary school,

undergraduate, graduate – then you could transform these to numeric variables by

assigning 1,2,3,4 respectively.

Carrying out a cluster analysis in Alteryx:

The first step you have to carry out is to determine the number of clusters

you want. This can be done using the K-Centroids diagnostics tool. You will

have to specify your minimum number of clusters and your maximum

number of clusters – depending on the business problem you will have to

decide, also using trial and error. Between 3 and 10 will cover most bases.

The above graphs will be included in your output report. For each number of clusters (3 – 8

here) you will have box plots for adjusted rand and Calinksi-Harabasz indices. You will be

looking for the cluster group with the highest in both these graphs. In this case cluster 4 has

the highest adjusted rand , and cluster 6 has the highest C-H index. Cluster 6 is also quite

high in the adjusted rand, so you could choose 6 clusters in the next step.

Adjusted rand is a measure of how similar the points within each cluster are, whereas the

Calinksi Harabaz measures how different the groups are from one another. Ideally, we want

high similarity within groups and high variance between groups for well – defined clusters.

The next step will be to use the K-Centroids analysis tool. Select your

clustering method and the number of groups you want (as determined by the

41

previous step). This tool will separate all the observations into this number of

groups and provide a report with details on each group.

You can then use the Append Cluster tool to create a new column indicating

which cluster each observation is in. Finally, you could carry out other data

mining techniques to examine behaviours within each cluster – e.g. does a

marketing campaign have more of an impact on one group than another/Do

the different groups have different buying patterns as determined by market

basket analysis etc.

Example in Alteryx – Grouping Clients of a Wholesale distributor by

spending amounts in 6 different product categories

The data set : The data set comprises 440 clients of a wholesale distributor. For each client

we have their spending amounts in 6 different categories –

Fresh/Milk/Grocery/Frozen/Detergents/Delicatessen.

The aim: The aim is to identify ‘similar’ clients based on spending across categories. This will

give us a better understanding of our clients’ behaviour and help in further data mining

techniques. Clients with similar levels of spending across categories will cluster together.

*The Alteryx workflow is contained in the folder. I followed the steps above to carry out the

clustering. Run the workflow in order to see all of the output, however I will include the

most important things to look out for here.

The Diagnostics tool indicated that 5 clusters resulted in the highest levels of

similarity within groups (Adjusted Rand) and one of the highest levels of dissimilarity

between groups (C-H Indices). We always want well defined clusters – that is

members of one cluster are dissimilar to members of other clusters, and members of

each cluster are similar to each other.

42

Below is my configuration for the K-Centroids analysis tool. I want to

standardize the fields so that one category does not dominate the algorithm,

and I choose 5 as the number of clusters based off of the previous output.

Finally, I appended the cluster group to each client. Now, the wholesaler has

a method of grouping similar customers together. If we were to go on and

carry out market basket analysis we could do it individually for each group,

and you would very likely observe different rules between clusters.

Furthermore you could look at the effect of promotional offers on different

clients, and this could help with targeted promotion and result in more

efficient campaigns in the future.

43

Alteryx Functionality – Creating your own

tools using R – code

Optimisation Problems

In business, optimization techniques can be used to maximize efficiency by identifying the

best use of limited resources, assigning the best resources to each task (e.g. delivery

routes), and overall minimizing costs while trying to maximize revenue by making the most

efficient use of resources. There are numerous instances where optimisation can be used in

business and many of them can be modelled using formulae. We can use the flexibility of

Alteryx to show how we can model these problems and create tools to solve them using the

functionality of R.

There will be functions which we want to minimize or maximize subject to some parameters

and constraints. There are packages within R which we can use to help set up our tools in

order to identify solutions to our optimization problems. These packages include optim,

contrOptim, nlm and optimize.

Example 1: An apartment complex has 500 apartments to rent. Based off of current costs

and pricing, the profit function is equal to: -8𝑥2+ 3200x -80,000 (x = number of apartments

to rent). How many apartments should they actually rent out in order to maximize their

profit?

- Maximise the function subject to x<= 500 (total number of apartments).

1) R has a package called ‘optimize’ which will help solve these types of problems.

I made a very simple tool in Alteryx to demonstrate how R code can be used

to create your own solutions. The tool allows the user to input the constraint

(i.e. how many apartments there are available to rent – as the max may

change depending on this constraint). The output from the tool will be a

graph of the function, where the function reaches it’s max, and what the

profit is at this point.

44

R Code

Tool Output

45

Optimization Problem 2: More variables, constraint functions.

Imagine there is a company which produces two products – tables and chairs. Let x =

number of chairs produced and y = number of tables produced. Based off of pricing, costs

and forecasted demand you calculate a profit function for the week as follows:

−2𝑥2 + 60𝑥 − 3𝑦2 + 72𝑦 + 100

However, you only have the necessary resources to produce 20 products in total. x + y has

to be less than or equal to 20. Our constraint equation is therefore:

x + y <= 20

Make a tool to maximize the profit under these constraints:

How it works: User inputs the constraint in this box which will allow some flexibility in case

the constraint value changes. Then, simply run the workflow and optimal production levels

will be output as follows.

46

R – Code

47

Making a tool to optimise delivery routes

When you have numerous stores which deliver to a number of different locations it

becomes quite complicated to choose the most optimal pick up and drop of points in order

to reduce travel time and costs. We can simulate a simple case here and create a tool in

Alteryx to solve this problem.

On the left is the location of each store/driver. On the top is the delivery location of each

package. The corresponding cells show travel times in minutes between the two locations

(e.g. top left cell is time between Sandymount and Ballsbridge). Each driver can carry out

only one delivery and you have to assign them in the most optimal way. The drivers can’t do

half trips so they either carry out the delivery or they do not, there are only two possibilities.

This problem can be solved by integer linear programming.

What the tool does.

I configured the tool so that it will optimise the delivery assignments for any file of any size

which is in this format: You can have as many drop off points and as many pick up points as

you want. So you would simply go into an excel spreadsheet, put your delivery destinations

along the top, and the store/driver location on the rows – with the corresponding travel

times (or you could do petrol costs, total distance between the two places). The tool will

provide an output with the minimized objective (time) as well as all of the pick-up and drop-

off points.

48

R – Code

49

Tool Output

One of the outputs will show the minimized value of total travel time – in this case it is 56

minutes.

This output is called an assignment matrix. You can figure out which pick up points were

assigned to which drop off points. However, I coded an R tool which would provide a simple

readable table with each pick up and drop off point, which will work for any file.

Final assignment table

50

Stock Analysis/Return forecasting/Manage of risk and

Portfolio optimisation in Alteryx using inbuilt Predictive

Forecasting tools and custom R code

The aim of this tool is to create a workflow which will make predictions on the return of 11

stocks (we can add more easily) for a period ahead, and to invest 1,000,000 in the top 5

stocks subject to certain risk constraints, so as to optimally achieve the highest expected

return subject to these restrictions.

We will use Alteryx to quickly analyse 11` different stocks. We want to predict returns using

a Time Series model to forecast ahead the return for the next day. The model will identify

the top 5 stocks in terms of expected return.

The user can specify which stocks they want to analyse by simply entering the ticker symbol

of each stock in the input table, and can specify the date range for the amount of historical

data they want to use in order to measure risk and train a Time Series Model.

I included the following constraints in the optimisation process in the final investment

strategy for the top 5 stocks:

1. Max of 30% of the budget invested in the riskiest stock (as measured by the

standard deviation of the stock price).

2. Max of 40% of the budget in total allocated to the two most highly correlated

assets.

3. Minimum of 10% of the budget invested in each stock.

The final goal will be to allow a user to simply input whatever stocks they would like, the

date range for the data, and the investment budget, and this workflow will come up with

an investment strategy automatically.

This workflow is quite long and I used a lot of custom R code to manipulate the data so

that it would work on any Ticker symbols input by the user.

This is a good example of how you can create a very repeatable process in Alteryx to

simplify some common processes. It also demonstrates how you can increase the

flexibility of Alteryx using Developer tools, as well as how to combine your own tools with

inbuilt Alteryx functions. I used the Time Series ARIMA model tool provided by Alteryx as

well as the Time Series forecast tool to come up with forecasts for the return of each

stock.

51

User Interface: The configuration for the workflow requires the user to enter their 11 stocks

in the first column, start and end date for historical data, and budget constraint. Depending

on the date range selected, the tool will take about 3 – 5 minutes to run.

Output: The user will get an output indicating the top 5 stocks and optimal allocation of

budget subject to the risk constraints. The output will also show the expected return on

the stocks for the next day.

*All of the code is documented within the R tools in the workflow.