Intro to machine learning with scikit learn

25
1 Yossi Cohen Machine Learning with Scikit-learn

Transcript of Intro to machine learning with scikit learn

Page 1: Intro to machine learning with scikit learn

1

Yossi Cohen

Machine Learning with Scikit-learn

Page 2: Intro to machine learning with scikit learn

2

INTRO TO ML PROGRAMMING

Page 3: Intro to machine learning with scikit learn

3

ML Programming

1. Get Data Get labels for supervised learning

2. Create a classifier

3. Train the classifier

4. Predict test data

5. Evaluate predictor accuracy

*Configure and improve by repeating 2-5

Page 4: Intro to machine learning with scikit learn

4

The ML Process

Filter

OutliersRegression

Classify

Validateconfigure

Model

Partition

Page 5: Intro to machine learning with scikit learn

5

Get Data & Labels• Sources

–Open data sources–Collect on your own

• Verify data validity and correctness• Wrangle data

–make it readable by computer–Filter it

• Remove Outliers

PANDAS Python library could assist in pre-processing & data manipulation before ML http://pandas.pydata.org/

Page 6: Intro to machine learning with scikit learn

6

Pre-Processing

Change formattingRemove redundant data Filter Data (take partial data)Remove OutliersLabelSplit for testing (10/90, 20/80)

Page 7: Intro to machine learning with scikit learn

7

Data Partitioning

• Data and labels–{[data], [labels]} –{[3,7, 76, 11, 22, 37, 56,2],[T, T, F, T, F, F, F, T]}–Data: [Age, Do you love Nutella?]

• Partitioning will create–{[train data], [train labels],[test data], [test labels]}–We usually split the data on a ration of 9:1–There is a tradeoff between the effectiveness of the test and the learning we could provide to the classifier

• We will look at a partitioning function later

Page 8: Intro to machine learning with scikit learn

8

Learn (The “Smart Part”)

ClassificationIf the output is discrete to a limited amount of classes (groups)

RegressionIf the output is continues

Page 9: Intro to machine learning with scikit learn

9

Learn Programming

Page 10: Intro to machine learning with scikit learn

10

Create Classifier

For most SUPERVISED LEARNING algorithms this would be

C = ClassifyAlg(Params)Its up to us (ML guys) to set the best

paramsHow?

1. We could develop a hunch for it2. Perform an exhaustive search

Page 11: Intro to machine learning with scikit learn

11

Train the classifier

We assigned

C = ClassifyAlg(Params)

This is a general algorithm with some initalizer and configurations.

In this stage we train it using:

C.fit(Data, Labels)

Page 12: Intro to machine learning with scikit learn

12

Predict

After we have a trained Algorithm classifier C

Prdeicted_Labels = C.predict(Data)

Page 13: Intro to machine learning with scikit learn

13

Predictor Evaluation

We are not done yetThere is a need to evaluate the predictor

accuracy in comparison to other predictors and to the system requirements

We will learn several methods for this

Page 14: Intro to machine learning with scikit learn

14

ENVIRONMENT

Page 15: Intro to machine learning with scikit learn

15

The Environment

• There are many existing environments and tools we could use–Matlab with Machine learning toolbox–Apache Mahout –Python with Scikit-learn

• Additional tools–Hadoop / Map-Reduce to accelerate and parallelize large data set processing

–Amazon ML tools–NVIDIA Tools

Page 16: Intro to machine learning with scikit learn

16

Scikit-learn

• Installation Instructions inhttp://scikit-learn.org/stable/install.html#install-official-release

• Depends on two other libraries• numpy and scipy

• Easiest way to install on windows:• Install WinPython

http://sourceforge.net/projects/winpython/files/WinPython_2.7/2.7.9.4/

–Lets install this togetherFor Linux / Mac computers just install the 3

libs separately using PIP

Page 17: Intro to machine learning with scikit learn

17

THE DATA

Page 18: Intro to machine learning with scikit learn

18

Data sets

There are many data sets to work onOne of them is the Iris data classification

into three groups. It has an interesting story you could google later

Well work on the iris data

Page 19: Intro to machine learning with scikit learn

19

Lab A – Plot the Iris data

Plot septal length vs septal width with labels ONLYHow? Google Iris data and the scikit learn environment

Try to understand the second part of the program with the PCA

Page 20: Intro to machine learning with scikit learn

20

Iris Dataimport matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

from sklearn import datasets

iris = datasets.load_iris()

X = iris.data[:, :2] # we only take the first two features.

Y = iris.target

x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5

y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

Page 21: Intro to machine learning with scikit learn

21

Plot Iris Data

plt.figure(2, figsize=(8, 6))

plt.clf()

plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)

plt.xlabel('Sepal length')

plt.ylabel('Sepal width')

plt.xlim(x_min, x_max)

plt.ylim(y_min, y_max)

plt.xticks(())

plt.yticks(())

Page 22: Intro to machine learning with scikit learn

22

Add PCA for better classificationfig = plt.figure(1, figsize=(8, 6))

ax = Axes3D(fig, elev=-150, azim=110)

X_reduced = PCA(n_components=3).fit_transform(iris.data)

ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=Y,

cmap=plt.cm.Paired)

ax.set_title("First three PCA directions")

ax.set_xlabel("1st eigenvector")

ax.w_xaxis.set_ticklabels([])

ax.set_ylabel("2nd eigenvector")

ax.w_yaxis.set_ticklabels([])

ax.set_zlabel("3rd eigenvector")

ax.w_zaxis.set_ticklabels([])

plt.show()

Page 23: Intro to machine learning with scikit learn

23

Iris Data Classified

Page 24: Intro to machine learning with scikit learn

24

Page 25: Intro to machine learning with scikit learn

25

Thank you!More About me:

Yossi CohenYossi [email protected]+972-545-313092+972-545-313092

Video compression and computer vision enthusiast & lecturer

Surfer