Practical 1 - PCA, kPCA (kernel PCA)lasa.epfl.ch/teaching/lectures/ML_Phd/Exercises/... · Bubble...
Transcript of Practical 1 - PCA, kPCA (kernel PCA)lasa.epfl.ch/teaching/lectures/ML_Phd/Exercises/... · Bubble...
MACHINE LEARNING COURSE – EDOC
Instructor: Prof. Aude Billard, Assistants: Dr. Basilio Noris, Nicolas Sommer
==================================================
Practical 1 - PCA, kPCA (kernel PCA)
A. Getting started:
1. Open a Dataset in the Machine Learning Demonstration software:
1) Download latest mldemos software here: http://mldemos.epfl.ch/ and the zip file containing the
datasets on the ML class webpage: http://lasa.epfl.ch/teaching/lectures/ML_Phd/index.php.
2) Unzip mldemos on the desktop or on a local hard drive (for access rights issues) and ignore
warning messages. Unzip the datasets folder.
3) Open MlDemos.
4) Click on menu File Import Data (csv, txt) or Drag the dataset file in the interface.
5) Choose the dataset you want to import.
6) Most datasets you are provided with are meant for classification. In this first practical, you will
ignore the class label and will perform PCA/k-PCA on the rest of the dimensions.
To do this, you must first put the number of the column that corresponds to the class label in Class
column, see below:
If you wish, for some reasons, to exclude more dimensions, you can simply do Ctrl + Click to select
the columns you wish to remove. Selected columns will be excluded from the input.
Each dataset differs in where it places the column label. The table below gives you the list of dataset
and associated class column.
Dataset Class Column
Breast cancer Wisconsin 10
Dermatology 35
House votes 17
Hayes-Roth 5
Ionosphere 33
Zoo 17
7) Click Send Data and close the window
Datapoints will be displayed in a different color depending on their class number, see example below:
You can display the color's legend using the top menu icon Display Options Legends.
The lower left menu allows you to change the display mode for your data. By default, data are
displayed in 2D mode. In this mode, data are projected onto the first two original dimensions of the
dataset; you can change the dimensions of projection in the bottom left corner of the window:
By scrolling through the lower left menu, you can select other display options. Including:
Scatterplots (Visualizations/Samples: Scatterplot): displays all combination of 2 projections.
Careful, this works well for datasets of reasonable size (you may run out of RAM on your PC if you
try this with very high dimensional datasets). You can zoom in/out using the scale scroller.
Class Color
1 Red
2 Green
3 Blue
4 Yellow
5 Pink
6 Cyan
7 Orange
Bubble Plots (Visualizations/Samples: Bubbleplot): is a 3 dimensional plot. Superposed to the 2D
projection of the Standard display, a 3rd
dimension is displayed by varying the size of the datapoints in
the 2D window (the higher the value on the dimension, the larger the bubble). This 3rd
dimension can
be any of the dimensions of the original dataset and is set using the size entry of the menu at the
bottom of the page.
2. Project the Data using PCA or kernel PCA:
1) Click on the Algorithms button (top menu) to display the Algorithms window.
You should see this window:
2) Click on the Projections Tab to display the different projections possibilities.
3) Choose PCA or Kernel PCA to choose the projection method.
4) Click on Project to project your data, Revert to go back to the data in original space. When
Auto-fit is checked, the window will resize automatically to the data after projection (instead
of manually clicking on Fit in the Display options window)
Once the data is projected, you can change the display of the dimensions where the points are
displayed using the menu at the bottom of the page, the same way you were doing it before, see below.
You will notice that it now displays e1, e2, in place of x1, x2.
PCA:
The values obtained for the eigenvalues and the amount of variance explained by each eigenvector are
displayed in the middle window of the algorithm window. The reconstruction error (computed by
cumulating the eigenvectors) is shown in a graph.
You can choose the number of dimensions after projection that you wish to keep (this might be useful
to reduce the dimensionality of the dataset for further processing), using the Components Range box
and selecting the desired dimensions.
Clicking on the Eigen button opens a separate window containing the coordinate of each eigenvector
in the original space. This is useful to have a read-out of the results obtained by PCA. Looking at the
relative importance of the original dimensions on the construction of the eigenvectors gives a notion
of the correlation across the dimension, as extracted by PCA.
KPCA:
Choose a kernel type (RBF or Polynomial) and its corresponding hyperparameter.
Projected Dimensions sets the number of projected Dimensions you want to keep (as done with
Components Range in the PCA case).
Show Eigenvector Iso-Lines open a new window that displays, for each dual eigenvector, the
isolines (i.e. level curves) superposed to the datapoints in original space. To recall, the isolines
correspond to the region of the space for which the points have the same projection on the dual
eigenvector. You can select the dual eigenvector projection you want to display using the scrolling
menu top left. The horizontal and vertical axes are here the original dimensions of the data.
Notice that the thickness of the contours varies with the value of the isoline. The thicker the line, the
larger the value. You can display the contours from several feature dimensions by selecting « Multi »
instead of « Single » in the top banner. (Right image on next page). You can also select the dimensions
in original space you want to display onto, using the top menu.
The eigenvalues can be found in the Algorithm window once you ran the algorithm:
You can get additional information about the selected algorithm (the current open tab) by clicking the
information button in the main window:
B. Practical’s goals
PART I
You will first try to determine with which type of kernel and with which parameters you could
cluster the following 2D examples:
You should ask yourself the following questions:
- What kind of projection can be achieved with an RBF kernel and with a polynomial kernel?
- How should we relate the kernel width to the data available?
- What is the influence of the degree of a polynomial kernel? Does it matter if the degree is even or
odd?
Once you have answered these questions, draw the following examples in Mldemos (use the drawing
tools) and find a good projection where the classes are easily separable.
Example 1 Example 2
Once you have found good projections for these examples, try to create different shapes to get
an intuition on which type of structure can be encapsulated by each kernel.
Now, try to create examples of data separable by both polynomial kernel and rbf kernel.
PART II
You will now compare Kernel PCA and PCA on high-dimensional datasets. You will choose the
hyperparameters carefully and study whether you can achieve a good projection with each method. A
good projection’s definition then depends on the application:
In the case of kPCA as a pre-processing step for classification or clustering, a good projection is when
the clustering/classification algorithms perform well. Since we have not seen any classification or
clustering method in class yet, the quality of the projection can be estimated visually with the
visualizations tools provided by Mldemos, for instance by estimating the separation between the
labeled classes after projection.
a) Ionosphere
This radar data was collected by a system in Goose Bay, Labrador. This system consists of a phased
array of 16 high-frequency antennas with a total transmitted power on the order of 6.4 kilowatts. The
targets were free electrons in the ionosphere. "Good" radar returns are those showing evidence of
some type of structure in the ionosphere. "Bad" returns are those that do not; their signals pass
through the ionosphere.
The data is very noisy and seems to have a lot of overlapping values. The dataset is already
normalized between -1 and 1.
Try to find a projection in which the data has few overlaying points from different classes.
b) Hayes-Roth
This dataset, designed to test classification algorithms, has a highly non-linear class repartition. Find
good kernels and projections that will ease the task of separating all three classes.