SciKit Learn: How to Standardize Your Data

15
How to Standardize Your Data: A ML Recipe

Transcript of SciKit Learn: How to Standardize Your Data

Page 1: SciKit Learn: How to Standardize Your Data

How to

Standardize Your Data:

A ML Recipe

Page 2: SciKit Learn: How to Standardize Your Data

DAMIAN MINGLECHIEF DATA SCIENTIST, WPC Healthcare

@DamianMingle

Page 3: SciKit Learn: How to Standardize Your Data

GET THE FULL STORY

bit.ly/UseSciKitNow

Page 4: SciKit Learn: How to Standardize Your Data

What’s Standardization Anyway?

• Often referred to as “functions and transformers that change raw feature vectors into a representation that is more suitable for the downstream estimator”

• Shifting the distribution of each attribute to have a mean of “0” and a standard deviation of “1”.

Page 5: SciKit Learn: How to Standardize Your Data

Why Standardization Matters

• It’s a common requirement of models

• Models may behave badly without it

• It’s useful for models that rely on the distribution of attributes such as Gaussian processes.

Page 6: SciKit Learn: How to Standardize Your Data

Power in SciKit Learn

• Preprocessing

• Clustering

• Regression

• Classification

• Dimensionality Reduction

• Model Selection

Power of SciKit Learn

Page 7: SciKit Learn: How to Standardize Your Data

Let’s Look at ML Recipe

Standardization

Page 8: SciKit Learn: How to Standardize Your Data

The Imports

from sklearn.datasets import load_iris

from sklearn import preprocessing

Page 9: SciKit Learn: How to Standardize Your Data

Separate Features from Target

iris = load_iris()

print(iris.data.shape)

X = iris.data

y = iris.target

Page 10: SciKit Learn: How to Standardize Your Data

Standardize the Features

normalized_X = preprocessing.scale(X)

Page 11: SciKit Learn: How to Standardize Your Data

Standardization Recipe

# Normalize the data attributes for the Iris dataset.

from sklearn.datasets import load_iris

from sklearn import preprocessing

# load the iris dataset iris = load_iris() print(iris.data.shape)

# separate the data from the target attributes

X = iris.data

y = iris.target

# normalize the data attributes

normalized_X = preprocessing.scale(X)

Page 12: SciKit Learn: How to Standardize Your Data

How to

Standardize Your Data:

An ML Recipe

Page 13: SciKit Learn: How to Standardize Your Data

DAMIAN MINGLECHIEF DATA SCIENTIST, WPC Healthcare

@DamianMingle

Page 14: SciKit Learn: How to Standardize Your Data

GET THE FULL STORY

bit.ly/UseSciKitNow

Page 15: SciKit Learn: How to Standardize Your Data

Resources

• Society of Data Scientists

• SciKit Learn

• Also:• Scaling features to a range (MinMaxScaler or MaxAbsScaler)

• Scaling sparse data (StandardScaler)

• Scaling data with outliers (RobustScaler)