Practical 1 - PCA, kPCA (kernel PCA)lasa.epfl.ch/teaching/lectures/ML_Phd/Exercises/... · Bubble...

MACHINE LEARNING COURSE – EDOC

Instructor: Prof. Aude Billard, Assistants: Dr. Basilio Noris, Nicolas Sommer

==================================================

Practical 1 - PCA, kPCA (kernel PCA)

A. Getting started:

1. Open a Dataset in the Machine Learning Demonstration software:

1) Download latest mldemos software here: http://mldemos.epfl.ch/ and the zip file containing the

datasets on the ML class webpage: http://lasa.epfl.ch/teaching/lectures/ML_Phd/index.php.

2) Unzip mldemos on the desktop or on a local hard drive (for access rights issues) and ignore

warning messages. Unzip the datasets folder.

3) Open MlDemos.

4) Click on menu File Import Data (csv, txt) or Drag the dataset file in the interface.

http://mldemos.epfl.ch/

http://lasa.epfl.ch/teaching/lectures/ML_Phd/index.php

5) Choose the dataset you want to import.

6) Most datasets you are provided with are meant for classification. In this first practical, you will

ignore the class label and will perform PCA/k-PCA on the rest of the dimensions.

To do this, you must first put the number of the column that corresponds to the class label in Class

column, see below:

If you wish, for some reasons, to exclude more dimensions, you can simply do Ctrl + Click to select

the columns you wish to remove. Selected columns will be excluded from the input.

Each dataset differs in where it places the column label. The table below gives you the list of dataset

and associated class column.

Dataset Class Column

Breast cancer Wisconsin 10

Dermatology 35

House votes 17

Hayes-Roth 5

Ionosphere 33

Zoo 17

7) Click Send Data and close the window

Datapoints will be displayed in a different color depending on their class number, see example below:

You can display the color's legend using the top menu icon Display Options Legends.

The lower left menu allows you to change the display mode for your data. By default, data are

displayed in 2D mode. In this mode, data are projected onto the first two original dimensions of the

dataset; you can change the dimensions of projection in the bottom left corner of the window:

By scrolling through the lower left menu, you can select other display options. Including:

Scatterplots (Visualizations/Samples: Scatterplot): displays all combination of 2 projections.

Careful, this works well for datasets of reasonable size (you may run out of RAM on your PC if you

try this with very high dimensional datasets). You can zoom in/out using the scale scroller.

Class Color

1 Red

2 Green

3 Blue

4 Yellow

5 Pink

6 Cyan

7 Orange

Bubble Plots (Visualizations/Samples: Bubbleplot): is a 3 dimensional plot. Superposed to the 2D

projection of the Standard display, a 3rd

dimension is displayed by varying the size of the datapoints in

the 2D window (the higher the value on the dimension, the larger the bubble). This 3rd

dimension can

be any of the dimensions of the original dataset and is set using the size entry of the menu at the

bottom of the page.

2. Project the Data using PCA or kernel PCA:

1) Click on the Algorithms button (top menu) to display the Algorithms window.

You should see this window:

2) Click on the Projections Tab to display the different projections possibilities.

3) Choose PCA or Kernel PCA to choose the projection method.

4) Click on Project to project your data, Revert to go back to the data in original space. When

Auto-fit is checked, the window will resize automatically to the data after projection (instead

of manually clicking on Fit in the Display options window)

Once the data is projected, you can change the display of the dimensions where the points are

displayed using the menu at the bottom of the page, the same way you were doing it before, see below.

You will notice that it now displays e1, e2, in place of x1, x2.

PCA:

The values obtained for the eigenvalues and the amount of variance explained by each eigenvector are

displayed in the middle window of the algorithm window. The reconstruction error (computed by

cumulating the eigenvectors) is shown in a graph.

You can choose the number of dimensions after projection that you wish to keep (this might be useful

to reduce the dimensionality of the dataset for further processing), using the Components Range box

and selecting the desired dimensions.

Clicking on the Eigen button opens a separate window containing the coordinate of each eigenvector

in the original space. This is useful to have a read-out of the results obtained by PCA. Looking at the

relative importance of the original dimensions on the construction of the eigenvectors gives a notion

of the correlation across the dimension, as extracted by PCA.

KPCA:

Choose a kernel type (RBF or Polynomial) and its corresponding hyperparameter.

Projected Dimensions sets the number of projected Dimensions you want to keep (as done with

Components Range in the PCA case).

Show Eigenvector Iso-Lines open a new window that displays, for each dual eigenvector, the

isolines (i.e. level curves) superposed to the datapoints in original space. To recall, the isolines

correspond to the region of the space for which the points have the same projection on the dual

eigenvector. You can select the dual eigenvector projection you want to display using the scrolling

menu top left. The horizontal and vertical axes are here the original dimensions of the data.

Notice that the thickness of the contours varies with the value of the isoline. The thicker the line, the

larger the value. You can display the contours from several feature dimensions by selecting « Multi »

instead of « Single » in the top banner. (Right image on next page). You can also select the dimensions

in original space you want to display onto, using the top menu.

The eigenvalues can be found in the Algorithm window once you ran the algorithm:

You can get additional information about the selected algorithm (the current open tab) by clicking the

information button in the main window:

B. Practical’s goals

PART I

You will first try to determine with which type of kernel and with which parameters you could

cluster the following 2D examples:

You should ask yourself the following questions:

- What kind of projection can be achieved with an RBF kernel and with a polynomial kernel?

- How should we relate the kernel width to the data available?

- What is the influence of the degree of a polynomial kernel? Does it matter if the degree is even or

odd?

Once you have answered these questions, draw the following examples in Mldemos (use the drawing

tools) and find a good projection where the classes are easily separable.

Example 1 Example 2

Once you have found good projections for these examples, try to create different shapes to get

an intuition on which type of structure can be encapsulated by each kernel.

Now, try to create examples of data separable by both polynomial kernel and rbf kernel.

PART II

You will now compare Kernel PCA and PCA on high-dimensional datasets. You will choose the

hyperparameters carefully and study whether you can achieve a good projection with each method. A

good projection’s definition then depends on the application:

In the case of kPCA as a pre-processing step for classification or clustering, a good projection is when

the clustering/classification algorithms perform well. Since we have not seen any classification or

clustering method in class yet, the quality of the projection can be estimated visually with the

visualizations tools provided by Mldemos, for instance by estimating the separation between the

labeled classes after projection.

a) Ionosphere

This radar data was collected by a system in Goose Bay, Labrador. This system consists of a phased

array of 16 high-frequency antennas with a total transmitted power on the order of 6.4 kilowatts. The

targets were free electrons in the ionosphere. "Good" radar returns are those showing evidence of

some type of structure in the ionosphere. "Bad" returns are those that do not; their signals pass

through the ionosphere.

The data is very noisy and seems to have a lot of overlapping values. The dataset is already

normalized between -1 and 1.

Try to find a projection in which the data has few overlaying points from different classes.

b) Hayes-Roth

This dataset, designed to test classification algorithms, has a highly non-linear class repartition. Find

good kernels and projections that will ease the task of separating all three classes.

Practical 1 - PCA, kPCA (kernel PCA)lasa.epfl.ch/teaching/lectures/ML_Phd/Exercises/... · Bubble...

Documents

Transcript of Practical 1 - PCA, kPCA (kernel PCA)lasa.epfl.ch/teaching/lectures/ML_Phd/Exercises/... · Bubble...