Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine...

44
Machine Learning Demystified Michelle Hardwick Director, Data Science & Analytics Salt Lake Community College

Transcript of Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine...

Page 1: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Machine Learning Demystified

Michelle HardwickDirector, Data Science & Analytics

Salt Lake Community College

Page 2: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Director, Data Science & Analytics at Salt Lake Community College

Adjunct Professor at Utah State University

Oracle ACE Director, IOUG Executive Vice President, UTOUG President

About Me

Twitter: @datacheesehead

Page 3: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Key Concepts

What is Machine Learning?

Algorithms

Agenda

Pop Quiz

Machine Learning Process

Applications of Machine Learning

Page 4: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

What is Machine Learning?

Page 5: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Machine Learning

is the Process of Finding

Insightful Patterns

in your Data

Page 6: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Why is machine learning important?

Presenter
Presentation Notes
Volume of data is growing and will continue to do so Data is being generated not only from people but from devices
Page 7: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

How is machine learning being used?

Google search

Recommendation systems

Fraud detection

Tagging people in photos

Presenter
Presentation Notes
It’s all around us but it might not be readily apparent Example in google search: what to show you? If you are a developer and you search java, are you shown java programming docs or coffee sites?
Page 8: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Algorithms

Page 9: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Anomaly Detection

Regression

Classification

Clustering

Association

Support Vector Machines

Decision Tree Learning

Instance-Based Learning

Generalized Linear Models

Centroid-Based Clustering

Hierarchical Clustering

Density-Based Clustering

Problem Type Algorithm Family AlgorithmOne-Class SVM

Linear SVM

Non-Linear SVM

Classification/Regression Decision Tree

Random Forest

Isolation Forest

Radius Neighbors

K-Nearest Neighbors

Logistic Regression

Bayesian Naïve Classifier

Linear Regression

Bayesian Linear Regression

Feedforward ANN (Multilayer Perceptron)

K-Means Clustering

Complete-Linkage Clustering

Single-Linkage Clustering

Average-Linkage Clustering

DBSCAN

Association Rules

Artificial Neural Network

Page 10: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Unsupervised Machine

Learning

When we do not know what the output

values should be

Page 11: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Slide source: Toward Data Science blog

Used when we wish to learn the inherent structure of our data

without using explicitly-provided labels

Page 12: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Supervised Machine

Learning

When we have prior knowledge to know

what the output values should be

Page 13: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Dimension Reductionality

Learn about the data to find the dimensions that interrelate the features. Used for

eliminating redundant features to speed up data processing.

Types:

Regression

Supervised

Page 14: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Classification

Identifying to which category an object belongs

Types:

Decision Tree

Support Vector Machine

Logistic Regression

Neural Networks

Instance Based Learning

Supervised

Page 15: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Regression

Predicting a continuous valued attribute associated with an object. Regression predicts how

much something will happen.

Types:

Generalized Linear Model

Support Vector Machine

Neural Networks

Decision Tree

Supervised

Presenter
Presentation Notes
Regression is similar to classification but classification predicts whether something will happen. Regression predicts how much that something will happen.
Page 16: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Naïve Bayes

Finds the probability of an event occurring given the probability of another event that has

already occurred

Types:

Bayes Theorem

Supervised

Page 17: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Exploratory Analysis

Used to automatically identify structure in the data

Types:

Clustering

Unsupervised

Presenter
Presentation Notes
Clustering is dividing the data into smaller related subsets, or clusters
Page 18: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Association

Discover the rules that describe large portions of the data

Such as: People who buy X also tend to buy Y

Types:

Association Rules

Decision Tree

Unsupervised

Presenter
Presentation Notes
Sometimes referred to as market basket analysis
Page 19: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Anomaly Detection

Finds cases that are unusual or slightly different

Types:

Support Vector Machine

Decision Tree

Supervised or Unsupervised

Presenter
Presentation Notes
Sometimes referred to as market basket analysis
Page 20: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

How do you

pick which

algorithm to run?

Page 21: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

ActivityWhat algorithm families would you run for the following ML problem?

What students are most likely to succeed at SLCC?

Page 22: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Anomaly Detection

Regression

Classification

Clustering

Association

Support Vector Machines

Decision Tree Learning

Instance-Based Learning

Generalized Linear Models

Centroid-Based Clustering

Hierarchical Clustering

Density-Based Clustering

Problem Type Algorithm Family AlgorithmOne-Class SVM

Linear SVM

Non-Linear SVM

Classification/Regression Decision Tree

Random Forest

Isolation Forest

Radius Neighbors

K-Nearest Neighbors

Logistic Regression

Bayesian Naïve Classifier

Linear Regression

Bayesian Linear Regression

Feedforward ANN (Multilayer Perceptron)

K-Means Clustering

Complete-Linkage Clustering

Single-Linkage Clustering

Average-Linkage Clustering

DBSCAN

Association Rules

Artificial Neural Network

Page 23: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Key ML Concepts

Page 24: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Train vs Test

For Supervised Learning, we want to split out a portion of our dataset to do testing on to

validate the accuracy of our predictions

This should be done randomly

Typical splits are 70% Train and 30% Test or 80%/20%

Test dataset will have the output field but you will ignore it when running the model

Presenter
Presentation Notes
Another method I’ve heard of is train on data up to last quarter then test on last quarter. If your business is cyclical, this is not a good idea as you may be introducing bias.
Page 25: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Feature Engineering

Feature engineering is the process of transforming the raw data into features that will

better represent the underlying problem, resulting in improved model accuracy

The Attribute Importance model can be used for this (minimum description length

algorithm)

Presenter
Presentation Notes
Examples: Flight delay data. If the field is date & time, remove date to look at the hour of the day that flights are delayed. Student ages. We know that traditional students behave differently than non-traditional students. We can categorize them instead of using age as a raw number. City & State. We know that city is not useful without the state. Make these one field.
Page 26: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Bias

When an algorithm produces results that are systematically prejudiced due to erroneous

assumptions in the ML process

This is usually related to the gathering or usage of data

You should check your data, models and results for this bias

Presenter
Presentation Notes
You’ve probably heard about stereotyping & confirmation bias
Page 27: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

The Machine Learning Process

Page 28: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Training

Step 03

Testing

Step 04

Data Preparation

Step 02

Evaluation

Step 05

Gather Data

Step 01

The ML Process

Page 29: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Step 1: Gather Data

To improve the accuracy of the predictions,

Quantity and Quality of the data is most important

Presenter
Presentation Notes
The size of your data and the quality of it can make or break your ML project. Most errors in results can be traced back to errors in the data.
Page 30: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Step 1: Gather DataProfile data for quality

Reliable data avoids:

• Duplicated data

• Bad labels

• Bad values

• Omitted values

Presenter
Presentation Notes
Duplicated data: Same log loaded twice Bad labels: Someone mislabeled a picture of an oak tree as a maple Bad values: Extra digit, decimal point missing Omitted values: Someone forgot to enter a value
Page 31: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Step 1: Gather DataIdentify Features

A feature is a measurable property of the object you are trying to analyze.

These are data points that describe the object. Such as age, gender, zip code, etc.

Presenter
Presentation Notes
Features are sometimes referred to as variables or attributes
Page 32: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Step 1: Gather DataLabel Sources

If your training data is not classified with the outcome, it needs to be labeled

Presenter
Presentation Notes
Label is the term used for the final output definition
Page 33: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Step 2: Data Preparation

Perhaps the most time consuming task is preparing your data for the algorithm you’ll be

running

Where should you transform? Prior to the training? Or in the model?

Presenter
Presentation Notes
Transforming prior to training In this approach, we perform the transformation before training. This code lives separate from your machine learning model. Pros Computation is performed only once. Computation can look at entire dataset to determine the transformation. Cons Transformations need to be reproduced at prediction time. Beware of skew! Any transformation changes require rerunning data generation, leading to slower iterations. Skew is more dangerous for cases involving online serving. In offline serving, you might be able to reuse the code that generates your training data. In online serving, the code that creates your dataset and the code used to handle live traffic are almost necessarily different, which makes it easy to introduce skew. Transforming within the model For this approach, the transformation is part of the model code. The model takes in untransformed data as input and will transform it within the model. Pros Easy iterations. If you change the transformations, you can still use the same data files. You're guaranteed the same transformations at training and prediction time. Cons Expensive transforms can increase model latency. Transformations are per batch. There are many considerations for transforming per batch. Suppose you want to normalize a feature by its average value--that is, you want to change the feature values to have mean 0 and standard deviation 1. When transforming inside the model, this normalization will have access to only one batch of data, not the full dataset. You can either normalize by the average value within a batch (dangerous if batches are highly variant), or precompute the average and fix it as a constant in the model. We'll explore normalization in the next section. �
Page 34: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Step 2: Data Preparation

Numeric Transformations

Convert non-numeric data to numeric

Page 35: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Step 2: Data Preparation

Numeric Normalization

Transform features to be on the same scale

Methods:

• Scaling to a range

• Clipping

• Log scaling

• Z-score

Page 36: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Step 2: Data Preparation

Bucketing

For numeric data where there is not a linear relationship, you can bucket it

Two Types of Bucketing:

• Equal spaced boundaries

• Quantile boundaries

Images source: Google machine learning

Presenter
Presentation Notes
Quantile boundaries have the same number of records in each bucket
Page 37: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Step 2: Data Preparation

Transforming Categorial Data

When feature data is not an ordered relationship

Two methods:

• Vocabulary

• Hashing

Presenter
Presentation Notes
Vocabulary creates a unique feature for each category value Hashing is better for datasets that change but can cause collisions
Page 38: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Step 3: Training

At this point you’ll put together your machine learning model

You can use many tools for this

Build the model, accepting defaults, then run it

Presenter
Presentation Notes
My fav tools that I’ve used are R and Oracle Data Miner
Page 39: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Step 4: Testing

Now run your model (with the same parameter settings) against your test dataset

Page 40: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Step 5: Evaluation

Check the accuracy of your model

How many of your test records did the model predict correctly?

Presenter
Presentation Notes
What percent of accuracy is good?
Page 41: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Repeat and Repeat and Repeat

Page 42: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Applications of Machine Learning

Page 43: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

Demo Time!

Page 44: Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine Learning When we have prior knowledge to know what the output values should be Dimension

You can find me at: @datacheesehead

[email protected]

Any questions?

Thanks!