Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine...

Machine Learning Demystified

Michelle HardwickDirector, Data Science & Analytics

Salt Lake Community College

Director, Data Science & Analytics at Salt Lake Community College

Adjunct Professor at Utah State University

Oracle ACE Director, IOUG Executive Vice President, UTOUG President

About Me

Twitter: @datacheesehead

Key Concepts

What is Machine Learning?

Algorithms

Agenda

Pop Quiz

Machine Learning Process

Applications of Machine Learning

What is Machine Learning?

Machine Learning

is the Process of Finding

Insightful Patterns

in your Data

Why is machine learning important?

Presenter

Presentation Notes

Volume of data is growing and will continue to do so Data is being generated not only from people but from devices

How is machine learning being used?

Google search

Recommendation systems

Fraud detection

Tagging people in photos

Presenter

Presentation Notes

It’s all around us but it might not be readily apparent Example in google search: what to show you? If you are a developer and you search java, are you shown java programming docs or coffee sites?

Algorithms

Anomaly Detection

Regression

Classification

Clustering

Association

Support Vector Machines

Decision Tree Learning

Instance-Based Learning

Generalized Linear Models

Centroid-Based Clustering

Hierarchical Clustering

Density-Based Clustering

Problem Type Algorithm Family AlgorithmOne-Class SVM

Linear SVM

Non-Linear SVM

Classification/Regression Decision Tree

Random Forest

Isolation Forest

Radius Neighbors

K-Nearest Neighbors

Logistic Regression

Bayesian Naïve Classifier

Linear Regression

Bayesian Linear Regression

Feedforward ANN (Multilayer Perceptron)

K-Means Clustering

Complete-Linkage Clustering

Single-Linkage Clustering

Average-Linkage Clustering

DBSCAN

Association Rules

Artificial Neural Network

Unsupervised Machine

Learning

When we do not know what the output

values should be

Slide source: Toward Data Science blog

Used when we wish to learn the inherent structure of our data

without using explicitly-provided labels

https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

Supervised Machine

Learning

When we have prior knowledge to know

what the output values should be

Dimension Reductionality

Learn about the data to find the dimensions that interrelate the features. Used for

eliminating redundant features to speed up data processing.

Types:

Regression

Supervised

Classification

Identifying to which category an object belongs

Types:

Decision Tree

Support Vector Machine

Logistic Regression

Neural Networks

Instance Based Learning

Supervised

Regression

Predicting a continuous valued attribute associated with an object. Regression predicts how

much something will happen.

Types:

Generalized Linear Model


Neural Networks

Decision Tree

Supervised

Presenter

Presentation Notes

Regression is similar to classification but classification predicts whether something will happen. Regression predicts how much that something will happen.

Naïve Bayes

Finds the probability of an event occurring given the probability of another event that has

already occurred

Types:

Bayes Theorem

Supervised

Exploratory Analysis

Used to automatically identify structure in the data

Types:

Clustering

Unsupervised

Presenter

Presentation Notes

Clustering is dividing the data into smaller related subsets, or clusters

Association

Discover the rules that describe large portions of the data

Such as: People who buy X also tend to buy Y

Types:

Association Rules

Decision Tree

Unsupervised

Presenter

Presentation Notes

Sometimes referred to as market basket analysis

Anomaly Detection

Finds cases that are unusual or slightly different

Types:


Decision Tree

Supervised or Unsupervised

Presenter

Presentation Notes

Sometimes referred to as market basket analysis

How do you

pick which

algorithm to run?

ActivityWhat algorithm families would you run for the following ML problem?

What students are most likely to succeed at SLCC?

Anomaly Detection

Regression

Classification

Clustering

Association

Support Vector Machines

Decision Tree Learning

Instance-Based Learning

Generalized Linear Models

Centroid-Based Clustering

Hierarchical Clustering

Density-Based Clustering

Problem Type Algorithm Family AlgorithmOne-Class SVM

Linear SVM

Non-Linear SVM

Classification/Regression Decision Tree

Random Forest

Isolation Forest

Radius Neighbors

K-Nearest Neighbors

Logistic Regression

Bayesian Naïve Classifier

Linear Regression

Bayesian Linear Regression

Feedforward ANN (Multilayer Perceptron)

K-Means Clustering

Complete-Linkage Clustering

Single-Linkage Clustering

Average-Linkage Clustering

DBSCAN

Association Rules

Artificial Neural Network

Key ML Concepts

Train vs Test

For Supervised Learning, we want to split out a portion of our dataset to do testing on to

validate the accuracy of our predictions

This should be done randomly

Typical splits are 70% Train and 30% Test or 80%/20%

Test dataset will have the output field but you will ignore it when running the model

Presenter

Presentation Notes

Another method I’ve heard of is train on data up to last quarter then test on last quarter. If your business is cyclical, this is not a good idea as you may be introducing bias.

Feature Engineering

Feature engineering is the process of transforming the raw data into features that will

better represent the underlying problem, resulting in improved model accuracy

The Attribute Importance model can be used for this (minimum description length

algorithm)

Presenter

Presentation Notes

Examples: Flight delay data. If the field is date & time, remove date to look at the hour of the day that flights are delayed. Student ages. We know that traditional students behave differently than non-traditional students. We can categorize them instead of using age as a raw number. City & State. We know that city is not useful without the state. Make these one field.

Bias

When an algorithm produces results that are systematically prejudiced due to erroneous

assumptions in the ML process

This is usually related to the gathering or usage of data

You should check your data, models and results for this bias

Presenter

Presentation Notes

You’ve probably heard about stereotyping & confirmation bias

The Machine Learning Process

Training

Step 03

Testing

Step 04

Data Preparation

Step 02

Evaluation

Step 05

Gather Data

Step 01

The ML Process

Step 1: Gather Data

To improve the accuracy of the predictions,

Quantity and Quality of the data is most important

Presenter

Presentation Notes

The size of your data and the quality of it can make or break your ML project. Most errors in results can be traced back to errors in the data.

Step 1: Gather DataProfile data for quality

Reliable data avoids:

• Duplicated data

• Bad labels

• Bad values

• Omitted values

Presenter

Presentation Notes

Duplicated data: Same log loaded twice Bad labels: Someone mislabeled a picture of an oak tree as a maple Bad values: Extra digit, decimal point missing Omitted values: Someone forgot to enter a value

Step 1: Gather DataIdentify Features

A feature is a measurable property of the object you are trying to analyze.

These are data points that describe the object. Such as age, gender, zip code, etc.

Presenter

Presentation Notes

Features are sometimes referred to as variables or attributes

Step 1: Gather DataLabel Sources

If your training data is not classified with the outcome, it needs to be labeled

Presenter

Presentation Notes

Label is the term used for the final output definition

Step 2: Data Preparation

Perhaps the most time consuming task is preparing your data for the algorithm you’ll be

running

Where should you transform? Prior to the training? Or in the model?

Presenter

Presentation Notes

Transforming prior to training In this approach, we perform the transformation before training. This code lives separate from your machine learning model. Pros Computation is performed only once. Computation can look at entire dataset to determine the transformation. Cons Transformations need to be reproduced at prediction time. Beware of skew! Any transformation changes require rerunning data generation, leading to slower iterations. Skew is more dangerous for cases involving online serving. In offline serving, you might be able to reuse the code that generates your training data. In online serving, the code that creates your dataset and the code used to handle live traffic are almost necessarily different, which makes it easy to introduce skew. Transforming within the model For this approach, the transformation is part of the model code. The model takes in untransformed data as input and will transform it within the model. Pros Easy iterations. If you change the transformations, you can still use the same data files. You're guaranteed the same transformations at training and prediction time. Cons Expensive transforms can increase model latency. Transformations are per batch. There are many considerations for transforming per batch. Suppose you want to normalize a feature by its average value--that is, you want to change the feature values to have mean 0 and standard deviation 1. When transforming inside the model, this normalization will have access to only one batch of data, not the full dataset. You can either normalize by the average value within a batch (dangerous if batches are highly variant), or precompute the average and fix it as a constant in the model. We'll explore normalization in the next section. �


Numeric Transformations

Convert non-numeric data to numeric


Numeric Normalization

Transform features to be on the same scale

Methods:

• Scaling to a range

• Clipping

• Log scaling

• Z-score


Bucketing

For numeric data where there is not a linear relationship, you can bucket it

Two Types of Bucketing:

• Equal spaced boundaries

• Quantile boundaries

Images source: Google machine learning

Presenter

Presentation Notes

Quantile boundaries have the same number of records in each bucket

https://developers.google.com/machine-learning/data-prep/transform/bucketing


Transforming Categorial Data

When feature data is not an ordered relationship

Two methods:

• Vocabulary

• Hashing

Presenter

Presentation Notes

Vocabulary creates a unique feature for each category value Hashing is better for datasets that change but can cause collisions

Step 3: Training

At this point you’ll put together your machine learning model

You can use many tools for this

Build the model, accepting defaults, then run it

Presenter

Presentation Notes

My fav tools that I’ve used are R and Oracle Data Miner

Step 4: Testing

Now run your model (with the same parameter settings) against your test dataset

Step 5: Evaluation

Check the accuracy of your model

How many of your test records did the model predict correctly?

Presenter

Presentation Notes

What percent of accuracy is good?

Repeat and Repeat and Repeat

Applications of Machine Learning

Demo Time!

You can find me at: @datacheesehead

[email protected]

Any questions?

Thanks!

Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine...

Documents

Transcript of Machine Learning Demystified - Store & Retrieve … › utoug.documents › ...Supervised Machine...