A Guideline to Statistical and Machine Learning

A Guideline for

Statistical and Machine Learning

Alexandre Alves, June/12/2014

Define your Goal

Define your Goal

Are you interested on predicting or inferring your data?

Prediction is a black-box method: given values for the features X1, …, Xp, it predicts the value of the response Y.

Inference is a white-box method: how is the response Y affected as the features X1, …, Xp change.

Define your Goal

People tend to think they need to predict, but more often than not inference will give them more insight:

In an advertisement campaign, which media contributed most to sales?

Analyzing a business process failure, which attribute of the process contributes the most to a negative outcome?

Given an increase in height, what is the expected increase in weight?

You must have a goal in mind in the form of a Question to be answered by means of analyzing the Observations in your data.

Define the Model

Define the Model

Looking at the Observations, is the Response present in the data?

In a history of fraudulent transactions, the outcome of fraud or not fraud is specified in the transactions themselves.

If so, then you are looking at a Supervised model, and there is a Response variable.

Or is the Response not in the data?

In a financial market Exchange, which stocks are hot? The trade transactions do not include a variable specifying if the stock is hot or not hot!

In this case, you are looking at an Unsupervised model.

Supervised Models

Is the Response variable quantitative?

What’s the weight? What’s the price? What’s the income?

You are dealing with a Regression problem.

Or is the Response variable qualitative (categorical)?

Is it fraud? What’s the gender - male or female? What’s the brand - A, B, C?

You are looking into a Classification problem.

Regression Problems

Is there a somewhat linear relationship between the features and the response?

Gas consumption for horsepower.

Fit a Linear model to your Observations.

Is there no clear relationship or form between the features and the response?

Gas consumption for year of the car model.

Prefer a non-parametric method, such Regression Splines and Generalized Additive Models.

Classification Problems

Is the Response made of only two categories (e.g. yes/no)?

Fit a Logistic regression model to your Observations.

Is there a somewhat linear boundary between the categories of the Response?

Use Linear Discriminant Analysis.

Is there no clear boundary form between the categories, but is the probability distribution of the categories known?

Use a Naive Bayes Classifier.

Otherwise if no clear boundary and distribution is not known:

Use K-Nearest Neighbors.

Unsupervised Models

Unsupervised learning is a relative new field

Is there a desired number of groups or categories?

Hot stocks (financial derivatives) and Not-so Hot

K-Means Clustering

Otherwise if number of groups is not known:

Stocks A an B trend together, stocks C and D trend together, stocks E and F…

Hierarchical Clustering

Train, (and Re-train) the Model

Assessing the Model

The model is created by fitting the Observations.

The Accuracy of the model must be assessed:

If a regression problem, then measure the mean squared error.

If a classification problem, then measure the error rate.

Being able to measure, now we can try different methods to improve the model:

Leave-k-out of the test data and Cross-Validate.

Bootstrap by resampling.

Improving the Model

The possible findings are:

Change the features used in the Model:

Car color has no correlation to gas consumption, thus remove it from Model.

Change the interaction between the features:

Horsepower to gas consumption is not strictly linear, thus square the horsepower variable.

Change the model:

Low accuracy is a good indication that the selected Model is wrong.

Trade-offs

Models that tend to have high accuracy are hard to interpret and therefore inappropriate for inference

Linear regressions easy to interpret, however have low accuracy.

Support-Vector-Machines are very flexible, however can’t be easily interpreted.

Models that tend to be flexible are less biased, however don’t cope well to variances in the training data

Linear regressions are biased towards a linear form, however cope well with variances to the training data.

k-NN has no bias, however has high variance as the training data changes.

Flexibility versus Interpretability, Bias versus Variance

–William Deming

“In God we trust, all others bring data.”

”

–George Box

“All models are wrong, some are useful.”

”

–Rutherford Roger

“We are drowning in information and starving for knowledge.”

”

A Guideline to Statistical and Machine Learning

Technology

Transcript of A Guideline to Statistical and Machine Learning