Post on 07-Jun-2020
Machine Learning Demystified
Michelle HardwickDirector, Data Science & Analytics
Salt Lake Community College
Director, Data Science & Analytics at Salt Lake Community College
Adjunct Professor at Utah State University
Oracle ACE Director, IOUG Executive Vice President, UTOUG President
About Me
Twitter: @datacheesehead
Key Concepts
What is Machine Learning?
Algorithms
Agenda
Pop Quiz
Machine Learning Process
Applications of Machine Learning
What is Machine Learning?
Machine Learning
is the Process of Finding
Insightful Patterns
in your Data
Why is machine learning important?
How is machine learning being used?
Google search
Recommendation systems
Fraud detection
Tagging people in photos
Algorithms
Anomaly Detection
Regression
Classification
Clustering
Association
Support Vector Machines
Decision Tree Learning
Instance-Based Learning
Generalized Linear Models
Centroid-Based Clustering
Hierarchical Clustering
Density-Based Clustering
Problem Type Algorithm Family AlgorithmOne-Class SVM
Linear SVM
Non-Linear SVM
Classification/Regression Decision Tree
Random Forest
Isolation Forest
Radius Neighbors
K-Nearest Neighbors
Logistic Regression
Bayesian Naïve Classifier
Linear Regression
Bayesian Linear Regression
Feedforward ANN (Multilayer Perceptron)
K-Means Clustering
Complete-Linkage Clustering
Single-Linkage Clustering
Average-Linkage Clustering
DBSCAN
Association Rules
Artificial Neural Network
Unsupervised Machine
Learning
When we do not know what the output
values should be
Slide source: Toward Data Science blog
Used when we wish to learn the inherent structure of our data
without using explicitly-provided labels
Supervised Machine
Learning
When we have prior knowledge to know
what the output values should be
Dimension Reductionality
Learn about the data to find the dimensions that interrelate the features. Used for
eliminating redundant features to speed up data processing.
Types:
Regression
Supervised
Classification
Identifying to which category an object belongs
Types:
Decision Tree
Support Vector Machine
Logistic Regression
Neural Networks
Instance Based Learning
Supervised
Regression
Predicting a continuous valued attribute associated with an object. Regression predicts how
much something will happen.
Types:
Generalized Linear Model
Support Vector Machine
Neural Networks
Decision Tree
Supervised
Naïve Bayes
Finds the probability of an event occurring given the probability of another event that has
already occurred
Types:
Bayes Theorem
Supervised
Exploratory Analysis
Used to automatically identify structure in the data
Types:
Clustering
Unsupervised
Association
Discover the rules that describe large portions of the data
Such as: People who buy X also tend to buy Y
Types:
Association Rules
Decision Tree
Unsupervised
Anomaly Detection
Finds cases that are unusual or slightly different
Types:
Support Vector Machine
Decision Tree
Supervised or Unsupervised
How do you
pick which
algorithm to run?
ActivityWhat algorithm families would you run for the following ML problem?
What students are most likely to succeed at SLCC?
Anomaly Detection
Regression
Classification
Clustering
Association
Support Vector Machines
Decision Tree Learning
Instance-Based Learning
Generalized Linear Models
Centroid-Based Clustering
Hierarchical Clustering
Density-Based Clustering
Problem Type Algorithm Family AlgorithmOne-Class SVM
Linear SVM
Non-Linear SVM
Classification/Regression Decision Tree
Random Forest
Isolation Forest
Radius Neighbors
K-Nearest Neighbors
Logistic Regression
Bayesian Naïve Classifier
Linear Regression
Bayesian Linear Regression
Feedforward ANN (Multilayer Perceptron)
K-Means Clustering
Complete-Linkage Clustering
Single-Linkage Clustering
Average-Linkage Clustering
DBSCAN
Association Rules
Artificial Neural Network
Key ML Concepts
Train vs Test
For Supervised Learning, we want to split out a portion of our dataset to do testing on to
validate the accuracy of our predictions
This should be done randomly
Typical splits are 70% Train and 30% Test or 80%/20%
Test dataset will have the output field but you will ignore it when running the model
Feature Engineering
Feature engineering is the process of transforming the raw data into features that will
better represent the underlying problem, resulting in improved model accuracy
The Attribute Importance model can be used for this (minimum description length
algorithm)
Bias
When an algorithm produces results that are systematically prejudiced due to erroneous
assumptions in the ML process
This is usually related to the gathering or usage of data
You should check your data, models and results for this bias
The Machine Learning Process
Training
Step 03
Testing
Step 04
Data Preparation
Step 02
Evaluation
Step 05
Gather Data
Step 01
The ML Process
Step 1: Gather Data
To improve the accuracy of the predictions,
Quantity and Quality of the data is most important
Step 1: Gather DataProfile data for quality
Reliable data avoids:
• Duplicated data
• Bad labels
• Bad values
• Omitted values
Step 1: Gather DataIdentify Features
A feature is a measurable property of the object you are trying to analyze.
These are data points that describe the object. Such as age, gender, zip code, etc.
Step 1: Gather DataLabel Sources
If your training data is not classified with the outcome, it needs to be labeled
Step 2: Data Preparation
Perhaps the most time consuming task is preparing your data for the algorithm you’ll be
running
Where should you transform? Prior to the training? Or in the model?
Step 2: Data Preparation
Numeric Transformations
Convert non-numeric data to numeric
Step 2: Data Preparation
Numeric Normalization
Transform features to be on the same scale
Methods:
• Scaling to a range
• Clipping
• Log scaling
• Z-score
Step 2: Data Preparation
Bucketing
For numeric data where there is not a linear relationship, you can bucket it
Two Types of Bucketing:
• Equal spaced boundaries
• Quantile boundaries
Images source: Google machine learning
Step 2: Data Preparation
Transforming Categorial Data
When feature data is not an ordered relationship
Two methods:
• Vocabulary
• Hashing
Step 3: Training
At this point you’ll put together your machine learning model
You can use many tools for this
Build the model, accepting defaults, then run it
Step 4: Testing
Now run your model (with the same parameter settings) against your test dataset
Step 5: Evaluation
Check the accuracy of your model
How many of your test records did the model predict correctly?
Repeat and Repeat and Repeat
Applications of Machine Learning
Demo Time!
You can find me at: @datacheesehead
Michelle.Hardwick@slcc.edu
Any questions?
Thanks!