Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &!...

159
Dimensionality reduction & Visualization 1

Transcript of Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &!...

Page 1: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Dimensionality reduction���&���

Visualization

1

Page 2: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Outline

•  Definition and objectives •  Methods of data reduction •  Examples of applications •  The curse of dimensionality •  Classification of high-dimensional data •  Techniques of dimensions reducing

2

Page 3: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Why visualising data ?

•  Better presentation of data => Better Understanding / Analysis •  “Goal ....is to communicate information

clearly and effectively through graphical means.”

–  Friedman(2008)

3

Page 4: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Motivation

•  Increasing computational power of computers

•  Data avalability •  A lot of high dimensional data •  An additional component of the clustering

techniques

4

Page 5: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Motivation

A good visualization technique should be applied even if: •  We have few a priori knowledge about the

data or not at all •  Exploratory goals are vague •  The data are inhomogeneous and noisy

5

Page 6: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Motivation

•  Nowadays high-dimensional data are everywhere

•  Associated with different Machine Learning

tasks such as : classification, "clustering" and regression

6

Page 7: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Results of the visualization

The results of the visualizations can be represented  in the form of: Ø Maps Ø Graphics Ø  Dashboarding

7

Page 8: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Why dimension reduction?

•  The number of features can be very large –  Genomic data: expressions of genes

•  Thousands of variables –  Image data : each pixel of an image

•  An image 64X64 = 4096 features

–  Text categorization: frequency of phrases (or words) in a document or web page

•  More than ten thousand features

8

Page 9: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Methods of data reduction

9

Page 10: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Methods of data reduction

There are many methods of visualization according to the type of treated data. The data can be : •  Univariate •  Bivariate •  Multivariate

10

Page 11: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Univariate data���

•  Represents measurements of a variable •  Usually characterize a distribution •  Are represented by the following methods :

– Histogram – Camambert (Pie chart) – Box plot

11

Page 12: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Univariate data (1D)

•  Representation

12

Page 13: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Bivariate data (2D)

•  Are paired samples of two variables •  The variables are related •  Are represented by the following methods :

– Scatter plot – Linear graphs

13

Page 14: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Bivariate data (2D)

•  Representation

14

Page 15: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Multivariate data

•  A multidimensional representation of���multivariate data

•  Are represented by the following methods :

–  Methods based icons

–  Pixel-based methods

–  Dynamic system in parallel coordinates

15

Page 16: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

16

What is visualization?

Page 17: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

What is visualization?

•  Visualization is the process of visual interpretation or graphical representation of a dataset.

17

Page 18: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

What is visualization? •  Medical investigations on patients

•  119 OE - pulse rate 26/07/2010 62 119 Diastolic blood pressure 26/07/2010 80 119 Systolic blood pressure 26/07/2010 120 201 Free T4 level 26/07/2010 17.8 201 Serum TSH level 26/07/2010 4.55 201

Serum calcium 26/07/2010 2.37 201 Serum inorganic phosphate 26/07/2010 1.17 201 Serum total protein 26/07/2010 64 201 Serum albumin 26/07/2010 41 201 Serum globulin 26/07/2010 23 201 Serum bilirubin level 26/07/2010 12 201 Serum alkaline phosphatase 26/07/2010 237 201 ALTSGPT serum level 26/07/2010 3 201 AST - aspartate transamSGOT 26/07/2010 24 201 International normalised ratio 26/07/2010 2 201 International normalised ratio 26/07/2010 2 580 Prostate specific antigen 26/07/2010 0.7 631 OE - pulse rate 26/07/2010 83 631 Diastolic blood pressure 26/07/2010 80 631 Systolic blood pressure 26/07/2010 133 634 Urine creatinine 26/07/2010 11.74 634 Urine microalbumin 26/07/2010 8.8 634 Urine albumin:creatinine ratio 26/07/2010 0.75 634 OE - weight 26/07/2010 84.35 634 OE - height 26/07/2010 176 634 Body Mass Index 26/07/2010 27.23 634 Diastolic blood pressure 26/07/2010 80 634 Systolic blood pressure 26/07/2010 135 786 Free T4 level 26/07/2010 12.6 786 Serum TSH level 26/07/2010 2.58 786 Plasma C reactive protein 26/07/2010 32 786 Serum calcium 26/07/2010 2.48 786 Serum inorganic phosphate 26/07/2010 0.93 786 Serum total protein 26/07/2010 71 786 Serum albumin 26/07/2010 41 786 Serum globulin 26/07/2010 30 786 Serum bilirubin level 26/07/2010 6 786 Serum alkaline phosphatase 26/07/2010 233 786 ALTSGPT serum level 26/07/2010 9 786 AST - aspartate transamSGOT 26/07/2010 18 786 Serum sodium 26/07/2010 137 786 Serum potassium 26/07/2010 4.1 786 Serum chloride 26/07/2010 103 786 Serum urea level 26/07/2010 4.7 786 Serum creatinine 26/07/2010 83 786 Total white blood count 26/07/2010 8.8 786 Red blood cell RBC count 26/07/2010 4.32 786 Haemoglobin estimation 26/07/2010 13 786 Haematocrit - PCV 26/07/2010 38.2 786 Mean corpuscular volume MCV 26/07/2010 88.4 786 Mean corpusc haemoglobinMCH 26/07/2010 30 786 Mean corpusc Hb conc MCHC 26/07/2010 34 786 Platelet count 26/07/2010 364 786 Neutrophil count 26/07/2010 6.6 786 Lymphocyte count 26/07/2010 1.5 786 Monocyte count 26/07/2010 0.5 786 Eosinophil count 26/07/2010 0.1 786 Erythrocyte sedimentation rate 26/07/2010 21 816 Serum vitamin B12 26/07/2010 348 816 Serum folate 26/07/2010 3.8 816 Total white blood count 26/07/2010 3.9 816 Red blood cell RBC count 26/07/2010 3.83 816 Haemoglobin estimation 26/07/2010 13.3 816 Haematocrit - PCV 26/07/2010 39.4 816 Mean corpuscular volume MCV 26/07/2010 103 816 Mean corpusc haemoglobinMCH 26/07/2010 34.8 816 Mean corpusc Hb conc MCHC 26/07/2010 33.8 816 Platelet count 26/07/2010 137 816 Neutrophil count 26/07/2010 2.4 816 Lymphocyte count 26/07/2010 1 816 Monocyte count 26/07/2010 0.4 816 Eosinophil count 26/07/2010 0.1 816 Body Mass Index 26/07/2010 28.99 816 OE - weight 26/07/2010 92.05 816 OE - height 26/07/2010 178.2 816 OE - pulse rate 26/07/2010 52 816 Diastolic blood pressure 26/07/2010 54 816 Systolic blood pressure 26/07/2010 107 856 Free T4 level 26/07/2010 16.4 856 Serum TSH level 26/07/2010 2.95 856 Serum calcium 26/07/2010 2.6 856 Serum inorganic phosphate 26/07/2010 0.82 856 Serum total protein 26/07/2010 72 856 Serum albumin 26/07/2010 47 856 Serum globulin 26/07/2010 25 856 Serum bilirubin level 26/07/2010 15 856 Serum alkaline phosphatase 26/07/2010 176 856 ALTSGPT serum level 26/07/2010 33 856 AST - aspartate transamSGOT 26/07/2010 23 856 Serum sodium 26/07/2010 141 856 Serum potassium 26/07/2010 4.8 856 Serum chloride 26/07/2010 102 856 Serum urea level 26/07/2010 6.3 856 Serum creatinine 26/07/2010 98 856 Total white blood count 26/07/2010 5.8 856 Red blood cell RBC count 26/07/2010 5.04 856 Haemoglobin estimation 26/07/2010 16 856 Haematocrit - PCV 26/07/2010 47.1 856 Mean corpuscular volume MCV 26/07/2010 93.4 856 Mean corpusc haemoglobinMCH 26/07/2010 31.9 856 Mean corpusc Hb conc MCHC 26/07/2010 34.1 856 Platelet count 26/07/2010 162 856 Neutrophil count 26/07/2010 3.2 856 Lymphocyte count 26/07/2010 2.1 856 Monocyte count 26/07/2010 0.3 856 Eosinophil count 26/07/2010 0.2 856 Erythrocyte sedimentation rate 26/07/2010 5 856 Diastolic blood pressure 26/07/2010 90 856 Systolic blood pressure 26/07/2010 150 1005 International normalised ratio 26/07/2010 3 1163 Serum sodium 26/07/2010 141 1163 Serum potassium 26/07/2010 5 1163 Serum chloride 26/07/2010 98 1163 Serum urea level 26/07/2010 16.8 1163 Serum creatinine 26/07/2010 159 1163 Glomerular filtration rate 26/07/2010 28 1818 Free T4 level 26/07/2010 13.6 1818 Serum TSH level 26/07/2010 1.43 1818 Total white blood count 26/07/2010 5.8 1818 Red blood cell RBC count 26/07/2010 4.65 1818 Haemoglobin estimation 26/07/2010 14.8 1818 Haematocrit - PCV 26/07/2010 43.1 1818 Mean corpuscular volume MCV 26/07/2010 92.6 1818 Mean corpusc haemoglobinMCH 26/07/2010 31.9 1818 Mean corpusc Hb conc MCHC 26/07/2010 34.5 1818 Platelet count 26/07/2010 205 1818 Neutrophil count 26/07/2010 4 1818 Lymphocyte count 26/07/2010 1.4 1818 Monocyte count 26/07/2010 0.3 1818 Eosinophil count 26/07/2010 0.1 1818 Erythrocyte sedimentation rate 26/07/2010 15 1818 Blood glucose level 26/07/2010 5.4 1818 Serum calcium 26/07/2010 2.48 1818 Serum inorganic phosphate 26/07/2010 0.89 1818 Serum total protein 26/07/2010 76 1818 Serum albumin 26/07/2010 43 1818 Serum globulin 26/07/2010 33 1818 Serum bilirubin level 26/07/2010 11 1818 Serum alkaline phosphatase 26/07/2010 287 1818 ALTSGPT serum level 26/07/2010 24 1818 AST - aspartate transamSGOT 26/07/2010 25 1818 Serum sodium 26/07/2010 138 1818 Serum potassium 26/07/2010 4.2 1818 Serum chloride 26/07/2010 105 1818 Serum urea level 26/07/2010 3.6 1818 Serum creatinine 26/07/2010 83 2714 Serum sodium 26/07/2010 134 2714 Serum potassium 26/07/2010 4.4 2714 Serum chloride 26/07/2010 104 2714 Serum urea level 26/07/2010 9.8 2714 Serum creatinine 26/07/2010 200 2714 Glomerular filtration rate 26/07/2010 29 3459 International normalised ratio 26/07/2010 2.5 3735 OE - pulse rate 26/07/2010 72 3735 Diastolic blood pressure 26/07/2010 96 3735 Systolic blood pressure 26/07/2010 169 4219 OE - pulse rate 26/07/2010 70 4219 Diastolic blood pressure 26/07/2010 71 4219 Systolic blood pressure 26/07/2010 108 4219 International normalised ratio 26/07/2010 3.3 4285 OE - pulse rate 26/07/2010 64 4285 Diastolic blood pressure 26/07/2010 90 4285 Systolic blood pressure 26/07/2010 150 4355 Diastolic blood pressure 26/07/2010 80 4355 Systolic blood pressure 26/07/2010 120 4511 International normalised ratio 26/07/2010 2.5 4763 International normalised ratio 26/07/2010 2.9 5111 International normalised ratio 26/07/2010 1.5

18

Page 19: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

What is visualization?

Snow’s Cholera Map, 1855 19

Page 20: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

What is visualization?

20

Page 21: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

What is visualization?

21

Page 22: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

What is visualization?

22

Page 23: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

What is visualization? •  Visualization is a branch of computer science involving the

processing, analysis and graphical representation of data from diverse fields: social sciences, finance, medicine, entertainment, etc.

•  There are two jurisdictions that are particularly requested

in visualization: computer graphics and statistics.

23

Page 24: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

What is visualization?

•  It is important to know how to distinguish the areas of image processing, computer graphics and visualization.

•  The image processing is the study of 2D images to extract information or to modify these characteristics.

24

Page 25: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

What is visualization? •  The computer graphics allows to create images of any

part using a computer, whether it be 2D images drawn by an artist or complex 3D scenes.

•  The visualization allows the exploration of data

represented in a visual form to help our understanding of the shown phenomenon.

25

Page 26: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

What is visualization?

•  One goal of visualization is to visually represent data that does not necessarily have a natural geometric interpretation.

26

Page 27: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Data Acquisition

The acquisition of the raw data can be done in different ways: •  by simulations (computer calculations) •  statistical surveys •  of historical databases •  of sensors of real measurements, etc.

27

Page 28: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Data Acquisition

Sources of errors •  Sampling is it accurate enough for us to be able to

get the desired information? It should not be considered useless data that would only increase the calculations.

•  The quantification is done with sufficient precision to be able to bring out the desired characteristics?

28

Page 29: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Filtering Data

Sources of errors •  Do we retain important and meaningful data? On

the contrary, do we eliminate irrelevant data to the extraction of the desired characteristics?

•  If we add data, the added data are representative of the rest?

29

Page 30: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

High-dimensional data

•  The increase in computing power and storage space of computers has led to an increase in the size of data sets

•  Thus, several fields of science now are based on our ability to analyze and visualize high-dimensional data.

30

Page 31: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Curse of dimensionality

31

Page 32: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Curse of dimensionality The curse of dimensionality  •  A term introduced by Bellman in 1961 •  Refers to the problem of the explosive increase in

data volume associated with adding extra dimensions in a mathematical space.

•  We will illustrate this problem with a simple example

32

Page 33: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Toy problem

•  We have a 3-class pattern recognition problem •  We have available 9 observations 1D (along an axis)

•  A simple approach would be to:

–  Divide the feature space into uniform bins –  Compute the ratio of examples for each class at each bin and –  For a new example, find its bin and choose the predominant class

of the bin

33

Page 34: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Toy problem •  For our example we start with one single

feature and divide the axis into 3 segments. We observe that we have an average of 3 examples by region.

•  Thereafter, we observe that there are too much overlap among the classes, so we decided to add a second feature to try to improve the class separability.

34

Page 35: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Toy problem (2D)

•  If we add a 2nd dimension we pass from 3 cases (in 1D) to 32=9 (in 2D).

•  We have an another problem: do we maintain the density of examples per bin or do we keep the same number of example that was used in 1D?

35

Page 36: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Toy problem (2D)

•  Choosing to maintain the density increases the number of examples from 9 (in 1D) to 27 (in 2D)

•  Choosing to maintain the number of examples results in a 2D scatter plot that is verry sparse

36

Page 37: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Toy problem (3D) If we add a 2nd dimension it gets worse : Ø  The number of bins grows to 33=27 Ø  To keep the same density of examples the number of

needed examples becomes 81 Ø  For the same number of examples, the 3D scatter plot is

almost empty

37

Page 38: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Curse of dimensionality

The approach performed on the toy example is  ineffective :

•  There are other approaches less affected by the curse of dimensionality, but the problem still exists

38

Page 39: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Curse of dimensionality

How can we beat the "curse of  dimensionality "? •  By incorporating prior knowledge •  By providing increasing smoothness of the

target function •  By reducing the dimensionality

39

Page 40: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Curse of dimensionality

•  In practice, the curse of dimensionality means that, for a given sample size, there���is a maximum number of variables ���beyond which the performance of our���classifier will degrade rather than improve.

40

Page 41: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Curse of dimensionality

The curse of dimensionality generates several���phenomena as: ���- The concentration of the measurement���- Desertification of the space���- Depopulation of the center of hyper-volumes

41

Page 42: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Consequences

There are many consequences of the curse of the dimensionality: 1.  Exponential growth in the number of

examples required to mantain a given sampling density (For a density of N examples/bin and D dimensions, the total number of examples is ND)

42

Page 43: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Consequences

2.  An exponential growth in the complexity of the target function (which estimates the density). To make a good learning, the target function requires denser sample points.

43

Page 44: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Consequences

3. For one dimension in the literature can be found a variety of density functions, but for high dimensions we have only the multivariate Gaussian density.

In addition, for large values of D, we can treat the density only in a Gaussian simplified form.

44

Page 45: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Curse of dimensionality

•  These findings suggest that we need special treatment to manipulate large data, which differs from that for low-dimensional data

•  The same problems happens in other data distribution

45

Page 46: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Clustering High-Dimensional Data

46

Page 47: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Clustering high-dimensional data Methods :

–  Subspace-clustering: we are looking for clusteurs that exist in subspaces of the given high dimensional data

•  CLIQUE, ProClus and co-clustering approaches

–  Techniques of dimension reduction : Construct a much lower dimensional space and search for clusters there (we may construct new dimensions by combining some dimensions in the original data).

47

Page 48: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Subspace-clustering

48

Page 49: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Methods of Subspace Clustering •  Subspace search methods: Search various subspaces to find clusters

–  “Bottom-up” approaches

–  “Top-down” approaches

•  Clustering methods based on correlation

–  For instance : PCA based approaches

•  Co-clustering methods

–  Optimization based methods (Cheng and Church, ISMB’2000)

–  Enumeration methods (Pei et al., ICDM’2003)

49

Page 50: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Subspace search methods •  Subspace search methods •  “Bottom-up” approaches –  We start from low subspaces and search higher subspaces only they may be

clusters in such subspaces –  Various pruning techniques to reduce the number of higher subspaces to be

searched –  Ex. CLIQUE (Agrawal et al. 1998)

•  “Top-down” approaches –  We start from full space and we search smaller subspaces recursively –  The subspace of the cluster can be determined by the local neighborhood –  Ex. PROCLUS (Aggarwal et al. 1999): a k-medoid similar method

50

Page 51: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Methods based on correlation

•  Clustering methods based on correlation : based on advanced correlation models

•  Ex : PCA based approaches :

–  We apply PCA to generate a set of new uncorrelated methods

–  Then mine clusters in the new space or its subspaces

51

Page 52: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Methods of Co-clustering •  Co-clustering : Cluster both objects and attributes simultaneously (we

treat objects and variables in symmetric way) •  Optimization based methods

–  We try to find a submatrix when she achieves the best significance as a co-cluster

–  Due to the cost in computation, greedy search is employed to find local optima co-clusters

•  Enumeration methods –  We use a tolerance threshold to specify the degree of noise allowed

in the co-cluster to treat –  The we try to enumerate all submatrices as co-clusters that satisfy

the requirements

52

Page 53: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Techniques of dimension reduction

53

Page 54: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Dimension Reduction

•  Data in a high-dimensional space are not uniformly distributed

•  The reduction of dimension is a technique

widely used to treat the "curse of dimensionality"

54

Page 55: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Dimension Reduction

There are a variety of techniques of dimension reduction: •  Linear vs. non-linear •  Deterministic vs. probabilistic •  Supervised vs. unsupervised

55

Page 56: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Dimension Reduction •  Dimension reduction : Methodologies

– Feature selection: choosing a subset of all the feature

– Feature extraction (« feature extraction ») : creating a subset of new features by combinations

56

Page 57: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Dimension reduction via

Feature selection

57

Page 58: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Feature selection •  Definition : Variable selection is a process to choose an optimal subset of relevant variables from a set of variables, according to a performance criterion. We can ask three basic questions:���

Q1: How to measure the relevance of the variables?���Q2: How to obtain the optimal subset?���Q3: Which optimality criterion to use?

58

Page 59: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Feature selection •  The answer to Q1 is to find a measure of relevance or evaluation criterion J(X) for quantify the importance of one variable or a combination of variables. •  Q2 refers to the problem of the choice of the procedure of research or creation of optimal subset of relevant variables. •  Q3 requires the definition of a stopping criterion of the���

research

59

Page 60: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Feature selection Evaluation criteria Ø For a classification problem, we test,���

for example, the discriminant quality of the system���in the presence or absence of a variable.���

Ø For a regression problem, we test���rather the quality of prediction with respect to other variables.

60

Page 61: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Feature selection An alternative is to use a search method of Branch & Bound type. This search allows you to restrict the research and gives the optimal subset of variables, under the hypothesis of monotocity of the selection criterion J(X). The criterium is called monotonous if: where X k is the subset containing the k selected variables.

)()()( 2121 mm XJKXJXJXKXX ≤≤≤⇒⊂⊂⊂

61

Page 62: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Feature selection Problem : most of the evaluation criteria are not monotonous Use of sub-optimal methods: : •  Sequential Forward Selection (SFS) •  Sequential Backward Selection (SBS) •  Bidirectional Selection (BS)

62

Page 63: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Feature selection •  Sequential Forward Selection (SFS) Let X be the set of variables. Initially the set of selected variables is empty. At each step k: : -  We select the variable Xi that maximizes the criterion of evaluation J(X k)

ü  ordered list of variables according to their importance

( { })ikXXxk xXJXJki

∪= −−∈ −

1)( 1

max)(

63

Page 64: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Feature selection •  Sequential Backward Selection (SBS) We start from the full set of variables X and we perform by elimination : at each step : - The variable Xi the least important according to the evaluation criterion J(X k) is removed. ü  list of ordered variables according to their importance : The most

relevant variables are then found in the last positions of the list.

( { })ikXxk xXJXJki

−= +∈ +

11

max)(

64

Page 65: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Feature selection ��� The BS procedure performs the search in both directions (Forward and Backward)

in a competitive manner. The procedure stops in two cases:���

(1) when one of the two directions has found the best subset of variables before reaching the middle of the search space���(2) when the two branches arriving in the middle.

Sets of selected variables found respectively by SFS and SBS are not equal because of their different principles of selection. This method reduces the search time as the search is performed in both directions and stops when there is a solution regardless of the direction. .

65

Page 66: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Feature selection •  Stopping criterion ü  The optimal number of variables is unknown a priori, the use���

of a rule to control the selection/elimination of variables allows to���stop the search when no variable is no longer enough���informative.

ü  The stopping criterion is often defined as a combination of���the search and the evaluation criteria.

ü  A heuristic often used is to calculate for different subsets of selected variables an estimation of the error of generalization by cross-validation.

ü  The subset of variables selected is that one which minimizes ���this generalization error.

66

Page 67: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Dimension reduction via

Feature extraction

67

Page 68: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Dimension reduction via feature extraction

Two main types of methods : • Linear Methods

• Principal Components Analysis (PCA) • Linear Discriminant Analysis (LDA) • Multi-Dimensional Scaling (MDS) • …

• Non-Linear Methods • Isometric feature mapping (Isomap) • Locally Linear Embedding (LLE) • Kernel PCA • Spectral clustering • Supervised methods (S-Isomap) • …

68

Page 69: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Visual data mining

69

Page 70: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Visual data mining

Visualization of information

Data mining Visual Data mining

70

Page 71: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Visual data mining��� Diagram

Data

Knowledges

Algorithm of DM

Result

Visualization of data

Preceding Visualization (PV)

Subsequent Visualization (SV)

Tightly integrated Visualization (TIV)

Visualization of the result

Visualization of the result

Data

Knowledges

Reult

Algorithm of DM

Visualization + Interaction

Knowledges

Data

Result

Algorithm of DM, step 1

Algorithm of DM, step n

71

Page 72: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Visualization software

•  Graphviz •  Tulip •  Knime •  Matlab •  R • … www.KDnuggets.com/software/visualization.html 72

Page 73: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Multi-Dimensional Scaling (MDS)

73

Page 74: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Outline

•  Introduction and definitions •  Problem Formulation •  Algorithm •  Example •  Conclusions

74

Page 75: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Dimension reduction via feature extraction

Two main types of methods : • Linear Methods

• Principal Components Analysis (PCA) • Linear Discriminant Analysis (LDA) • Multi-Dimensional Scaling (MDS) • …

• Non-Linear Methods • Isometric feature mapping (Isomap) • Locally Linear Embedding (LLE) • Kernel PCA • Spectral clustering • Supervised methods (S-Isomap) • …

75

Page 76: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Introduction

•  Multi-dimensional scaling (MDS) (proposed by Borg and Groenen in 1997) -  A collection of dimension reduction techniques

that maps the distances between observations in a high dimensional space into a lower dimensional space

-  Find a configuration of points in a low dimensional space whose inter-point distances correspond to dissimilarities in higher dimensions 76

Page 77: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Introduction

77

Source space Target space

-  Able to model intrinsic complex manifold structures and visualize them

Page 78: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Multi-Dimensional Scaling (MDS)

In many applications : -  We known distances between the points of a data set -  We seek a representation in a low-dimensional space

of these points

The method of multidimensional scaling (MDS) allows us to build this representation

78

Page 79: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Multi-Dimensional Scaling (MDS)

- Example : Ø Get the map of a country starting from the knowledge of the distances between each pair of cities. -  As PCA, the MDS algorithm is based on the

search of the eigenvalues. -  MDS builds a configuration of n���

points in Rd from N distances between objects.

79

Page 80: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Multi-Dimensional Scaling (MDS)

•  So we have N(N-1)/2 distances. It is always possible to generate a position of N points in N dimensions that meets exactly the given distances.

•  MDS computes an approximation in ���

dimensions d<N.

80

Page 81: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Problem Formulation

•  Given – The points x1,.., xn in k dimensions – we note by δij the distance between points xi and xj

•  Find – The points y1,…, yn in 2 (or 3) dimensions, s.t.

distance dij between yi and yj be close to δij

81

Source space Target space

Page 82: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Cost function

•  We must search δij which minimizes an objectif fonction

•  We can define the cost function in a general maner:

82 2

2

2

||||

||||

)(_

jiij

jiij

ijji

ij

yyd

xx

dfunctionCost

−=

−=

−=∑<

δ

δ

Page 83: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Examples of cost functions •  Possible Cost Functions (Stress)

–  dij is a function of yi and yj, and given the data the δij‘s are constant.

83 ∑∑

∑∑

<<

<

<

<

−=

⎟⎟⎠

⎞⎜⎜⎝

⎛ −=

−=

ji ij

ijij

ji ijar

ji ij

ijijrr

ji ij

ji ijijaa

dJ

dJ

dJ

δ

δ

δ

δ

δ

δ

δ

2

2

2

2

)(1

)(

a compromise between the two

penalizes large relative errors

penalizes large absolute errors

Disparity

Sammon Criterium

Page 84: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Update rules

84

kj

jk

kj kj

kjkj

ji ijkar

kj

jk

kj kj

kjkjkrr

kj kj

jkkjkj

ji ijkaa

dyyd

yJ

dyyd

yJ

dyy

dyJ

−−=∇

−−=∇

−−=∇

∑∑

∑∑

≠<

≠<

δ

δ

δ

δ

δ

δδ

2)(

2)(

)(2)(

2

2

•  Update rules

Page 85: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Algorithm

•  Compute or obtain distances δij •  Initialize the points y1,…, yn (e.g. randomly) •  Until convergence,

85

i∀ )( iii yJyy ∇−← η )10( <<η

Page 86: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Example •  Artificial data set : we pass from a 3-dimensional space to 2-dimensionsal space

86

Page 87: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Example •  The data set "Eurodist" represents the distance (in km)

between 21 cities of Europe. •  The source data set must be presented as a square matrix of

dissimilarities between variables.

87

Page 88: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Example

88

Page 89: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Conclusions •  MDS algorithms differ in :

– The distance used in the source space – The Stress (objective) fonctions; the use of

different stress functions leads to various results

– The optimization procedure ; linear MDS has analytic solvable but cannot model complex (nonlinear) low-dimensional manifold well while nonlinear MDS often needs to use an iterative algorithm

89

Page 90: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Isometric feature mapping (Isomap)

90

Page 91: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Outline

•  Introduction and definitions •  Algorithm •  Example •  Conclusions

91

Page 92: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Dimension reduction via feature extraction

Two main types of methods : • Linear Methods

• Principal Components Analysis (PCA) • Linear Discriminant Analysis (LDA) • Multi-Dimensional Scaling (MDS) • …

• Non-Linear Methods • Isometric feature mapping (Isomap) • Locally Linear Embedding (LLE) • Kernel PCA • Spectral clustering • Supervised methods (S-Isomap) • …

92

Page 93: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Introduction

•  Techniques as LDA, PCA and their variants perform a global transformation of the data (rotation/translation/rescaling) –  These techniques assume that most of the

information in the data is contained in a linear subspace.

– Which approache to use when the data is actually embedded in a non-linear subspace (a low-dimensional manifold).

93

Page 94: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Introduction

•  The PCA can not discover the structure of a data set in a spiral form

94

Page 95: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Euclidian Distance vs. Geodesic Distance

95

Euclidian Distance Geodesic

Distance

Page 96: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Introduction

•  The goal of Isomap is to find a non-linear manifold containing the data

•  We use the fact that for close points, the Euclidean distance is a good approximation to the geodesic distance on the manifold

•  We build a graph connecting each point to its k nearest neighbors

96

Page 97: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Introduction

•  The lengths of the geodesics are then estimated by searching the length of the shortest path between two points in the graph

•  Thereafter we apply MDS to obtained distances to determine a position of points in a space of reduced dimensions

97

Page 98: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

ISOMAP

•  ISOMAP [Tenebaum et al. 2000] – For neighboring samples, Euclidian distance

provides a good approximation to geodesic distance

– For distant points, geodesic distance can be approximated with a sequence of steps between clusters of neighboring points

98

Page 99: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

ISOMAP ISOMAP operates in three steps: 1.  Build the neighborhood graph G 2.  For each pair of points of G, compute the

shortest path (the geodesic distance) 3.  Using the classical MDS on geodesic

distances Euclidian Distance -> Geodesic Distance

99

Page 100: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

ISOMAP Algorithm

•  Step 1 –  Build the neighborhood graph based on the distances dX (i,j) in the input space X. –  This can be performed in two different ways:

•  Connect each point to all points within a fixed radius ε •  Connect each point to all of its k nearest neighbors

–  A weighted neighborhood graph G is obtained, where���dX (i,j) is the weight of each edge between neighboring points

100

Page 101: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

ISOMAP Algorithm

•  Step 2 – Estimate the geodesic distances d M(i,j) between

all pair of points on the manifold M by computing their shortest path distances d G(i,j) in the graph G.

–  It can be done using Dijkstra's algorithm or the Floyd algorithm.

101

Page 102: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

ISOMAP Algorithm •  Step 3

Ø  Apply classical MDS to the matrix of graph distances D.

102

Page 103: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Complexity of ISOMAP •  For large datasets ISOMAP can be quite slow : -  Step 1: Complexity of k-nearest neighbors O(n2 D) -  Step 2 : Complexity of the Djikstra algorithm O(n2 logn + n2 k)

-  Step 3 : MDS complexity O(n2 d)

103

Page 104: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

The «Swiss roll» dataset

104

•  The «Swiss roll» data set contains 20000 points. • In this figure we present���a sample of 1000 points.��� •  Thereafter we will represent on this example the development of the Isomap algorithm.

Page 105: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Construction of the neighborhood graph G

K- nearest neighbors (K=7) DG is a matrix of Euclidean distance 1000 x 1000 of two neighboring points (Figure A)

105

[Tenebaum et al. 2000]

Page 106: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Calculation of shortest paths in G

106

DG is a matrix of geodesic distances of two arbitrary points along  the manifold M (Figure B)

[Tenebaum et al. 2000]

Page 107: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Using MDS to represent the graph in Rd

107

Find a Euclidean d-dimensional space Y that preserves the pairwise distances (Figure C)

[Tenebaum et al. 2000]

Page 108: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Example on images

108

For each image we have 64x64 = 4096 pixels

Page 109: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Results

109

Page 110: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Results

110

Page 111: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Resultats

111

Page 112: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Conclusions •  Avantages

–  Non-linear –  Non-iterative –  Preserves the global properties of the data

•  Limitations –  Sensitive to noise –  Parameters to set: k or ε –  Relatively slow for large data sets –  k must be high to avoid " linear shortcuts " near regions

of high curvature of the surface

112

Page 113: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Locally Linear Embedding (LLE)

113

Page 114: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Dimension reduction via feature extraction

Two main types of methods : • Linear Methods

• Principal Components Analysis (PCA) • Linear Discriminant Analysis (LDA) • Multi-Dimensional Scaling (MDS) • …

• Non-Linear Methods • Isometric feature mapping (Isomap) • Locally Linear Embedding (LLE) • Kernel PCA • Spectral clustering • Supervised methods (S-Isomap) • …

114

Page 115: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Locally Linear Embedding (LLE) •  LLE (“Locally Linear Embedding") addresses the

same problem as Isomap by a different way. •  LLE preserves the local properties of the data

representing each point by a linear combination of its nearest neighbors.

•  LLE builds a projection to a linear low

dimensional space preserving the neighborhood.������ 115

Page 116: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

LLE Algorithm

116

[Roweis and Saul 2000]

Page 117: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

LLE Algorithm

117

LLE operates in 3 steps: •  Computes the k nearest neighbors •  Computes the weights needed to reconstruct each point using a linear combination of its neighbors •  Projects the results using the new found coordinates.

Page 118: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

LLE Algorithm •  The local geometry is modeled by linear weights that

reconstruct each data point as a linear combination of its neighbors

•  Reconstruction errors are measured with this cost function

-  where weights Wij measure the contribution of the j-th example to the reconstruction of the i-th example

•  Weights Wij are minimized subject to two constraints : 1)  Each data point is reconstructed only from its neighbors 2) Rows of the weight matrix sum up to one :

118 1=∑ j ijW

2

1)( ∑ ∑

=

−=N

ij jiji XWXWε

Page 119: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

LLE Algorithm

•  We seek to find d-dimensional coordinates Yi that minimize the following cost function:

119

2

1)( ∑ ∑

=

−=N

ij jiji YWYYφ

Page 120: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Parameter Estimation •  Consider a particular sample x with k nearest neighbors ηj and

reconstruction weights wj (that sum up to one). These weights can be found in three steps : –  Step 1 : Compute the neighborhood correlation matrix Cjk and its

inverse C -1 –  Step 2 : Compute the Langrage multiplier λ that enforces the

constraint

–  Step 3 : Compute the reconstruction weights as:

120

kTjjkC ηη=

∑∑

−−=

jk jk

jk kT

jk

C

xC1

1 )(1 ηλ

)(1 λη +=∑ −k

Tjkkj xCw

1=∑ j jw

Page 121: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Parameter Estimation •  The embedding vectors Yi are found by minimizing the cost function: •  To make the optimization problem well-posed we introduce 2

constrains: •  Which allows to expresse the cost function as:

where •  δij is equal to 1 if i=j and equal to 0 otherwise. •  The optimal embedding is found by computig the bottom d+1

eigenvectors of matrix M.

121

2

1)( ∑ ∑

=

−=N

ij jiji YWYYφ

0=∑ j jY IYYN

Tii i =∑

1

)()( jTi

ijij YYMY ∑=φ

kjk

kijiijijij WWWWM ∑+−−=δ

Page 122: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Examples on LLE

122

Page 123: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Results

123

•  Initial points represent images of faces. •  In the 2-dimensional space, these images are grouped according to the position, lighting and expression.

•  Images placed in the bottom of the figure correspond to successive points encountered on the line at the top right, sweeping a continuum of facial expression.

[Roweis and Saul 2000]

Page 124: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Results

124 [Roweis and Saul 2000]

•  PCA vs LLE

Unlike PCA, LLE preserves local topology!

Page 125: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Results

125

•  PCA vs LLE

[Roweis and Saul 2000]

Page 126: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Conclusions •  LLE is a nonlinear technique that preserves the local

properties of the data representing each point by a linear combination of its nearest neighbors in a new space of reduced dimensions.

•  LLE operates in 3 steps:

–  Computes the k nearest neighbors –  Computes the weights needed to reconstruct each point using a linear

combination of its neighbors –  Projects results using the new found coordinates

126

Page 127: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Conclusions

•  Sensitive to noise •  Parameters to set: k •  Relatively slow for large data sets

127

Page 128: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

References ���

•  J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319-2323, 2000.

•  Sam Roweis & Lawrence Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323-2326, 2000.

128

Page 129: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Self Organizing Map (SOM)

129

Page 130: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Outline

•  Introduction and definitions of neural networks

•  Introduction SOM •  Algorithm •  Example •  Conclusions

130

Page 131: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Neural networks •  A set of interconnected neurons/information

processing units •  A technique designed to model how the brain

performs a particular task •  Used to extract the pattern of information from data sets where numbers are vast and has hidden relations •  Ability to handle noisy data

131

Page 132: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Neural network learning

•  By learning we can extract information •  This information is stored on the links

between the neurons (also called weights) •  Using neural networks can perform two types of learning: - Supervised���- Unsupervised

132

Weights

Neural network

Input Output

Page 133: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Self-organisation •  The brain cells are self organizing themselves in groups,

according to incoming information. •  This incoming information is not only received by a single

neural cell, but also influences other cells in its neighbourhood.

•  SOMs mechanism is also based on this principle

133

Page 134: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

The principle of SOMs •  SOM produces the similarity graph of the input data •  Converts non-linear relationships between high

dimensional data into simple geometric relationships

134 Illustration d’une carte SOM de taille 7x7

• Donnée en entrée

• Poid

• Poid mis a jourt

Entrées Sorties

Page 135: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

SOM

•  Introduced in 1984 by Teuvo Kohonen •  Vector quantization + vector projection •  Technique based on unsupervised learning •  Used in classification and visualization of

large data sets •  Used in many areas

135

Page 136: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Fields of application •  Applications

–  Exploratory data analysis, clustering –  Quantification, variable selection, outlier detection –  Diagnosis, prediction, missing data

•  Domains –  Socio-economic –  TextMining –  Telecomunication –  Industrial Processes

136

Page 137: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

SOM Architecture •  Set of neurons / cluster units •  Each neuron is assigned with a prototype vector that is

taken from the input data set •  The neurons of the map can be arranged either on a

rectangular or a hexagonal lattice

137 Hexagonal neighborhood Rectangular neighborhood

Page 138: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

SOM

•  Cost function:

138

2)(,)( ∑∑ −=

i kkixsk wxhwC

i

Page 139: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

SOMs Algorithm 1.  Initialisation step

–  Define the topology map –  Randomly initialize all the prototypes for each neuron

2.  Competition step –  Have a given datum xi randomly chosen –  Determine the winning neuron according to the rule:

3.  Adaptation step –  Adapt prototypes under rule

4.  Repeat steps 2 and 3 until the updates prototypes are insignificant

139

2

1min)( kimk

i wxrgAxs −=≤≤

))(()()()1( )(, twxhttwtw kixskkk i−+=+ ε

Page 140: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Self organizing step

140

Page 141: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Self organizing step

The map is organized by checking the following properties:

– Each neuron specializes in a portion of the input space

– Similar data have near projections on the map

141

Page 142: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Data visualization using SOMs

•  Visualization of clusters and shape of the data (projections, U-matrices and other distance matrices)

•  Visualization of components/variables (scatter plots, components planes)

•  Visualization of data projections

142

Page 143: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Example : Countries of the world

143

Page 144: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Example: Iris

144

Page 145: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Example: Iris

145

Page 146: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Example: Iris

146

Page 147: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Example: Iris

147

Page 148: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Dimensionality Reduction:

Singular Value Decomposition

Page 149: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Recall (1)

•  Why dimensionality reduction? •  Some features may be irrelevant •  We want to visualize high dimensional data •  “Intrinsic” dimensionality may be smaller

than the number of features

Page 150: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Recall (2) ���Dimensionality Reduction

Techniques •  Visualize, categorize, or simplify large datasets. •  Principal Component Analysis: Finds the dimensions that capture the most

variance

•  MDS: Finds data points in lower dimensional space that best preserves the inter-point distance.���

•  Isomap: Estimates the distance between two points on a manifold by following a chain of points with shorter distances between them. (More accurate in representing global distances than LLE; slower than LLE)

•  LLE (Local Linear Embedding): Only worries about representing the distances between local points. Faster than Isomap (no worry about global distances).

•  Hesian Eigenmaps: •  Laplacian Eigenmaps •  Charting •  SOM (Kohonen’s Self-organizing map)

Page 151: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Dimensionality Reduction

•  The input space of many learning problems is of high dimensionality.

•  This has computational implications and, it makes finding the intrinsic information content difficult.

•  Dimensionality reduction methods usually try to address these two problems at the same time.

•  Context: unsupervised learning given a collection of data points in an n-dimensional space find a good

representation of the data in r-dimensional

Page 152: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Choosing sets of features

•  Score each feature •  Forward/Backward elimination

–  Choose the feature with the highest/lowest score –  Re-score other features –  Repeat

•  If you have lots of features (text mining for ex.) –  Just select top K scored features

Page 153: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Unsupervised feature selection

•  Differs from feature selection in two ways: –  Instead of choosing subset of features, – Create new features (dimensions) defined as

functions over all features – Don’t consider class labels, just the data points

Page 154: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

SVD: the problem •  For a large dimensional data, use PCA ?

–  for example: images (d ≥ 104), gene expression data, text data,…

•  The problem: –  The covariance matrix is size (d2), so the computational

time is expensive.

•  Solution: –  Singular Value Decomposition (SVD) - an efficient algorithm available in Matlab, R, SAS…

Page 155: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Definition

A[n x m] = U[n x r] Λ [ r x r] (V[m x r])T

•  A: n x m matrix (e.g., n documents, m terms) •  U: n x r matrix (n documents, r concepts) •  Λ: r x r diagonal matrix (strength of each

‘concept’) (r: rank of the matrix) •  V: m x r matrix (m terms, r concepts)

Page 156: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

SVD : schema

Page 157: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

SVD - Interpretation ‘documents’, ‘terms’ and ‘concepts’: •  U: document-to-concept similarity matrix •  V: term-to-concept similarity matrix •  Λ: its diagonal elements: ‘strength’ of each

concept Projection: •  best axis to project on: (‘best’ = min sum of

squares of projection errors)

Page 158: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

In matlab: •  svd(A); •  svds(A) – for sparse data.

•  The svd command computes the matrix singular value decomposition.

•  s = svd(X) returns a vector of singular values. •  [U,S,V] = svd(X) produces a diagonal matrix S of the same

dimension as X, with nonnegative diagonal elements in decreasing order, and unitary matrices U and V so that X = U*S*V'.

Page 159: Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &! Visualization" 1" Outline" • Definition and objectives • Methods of data reduction ... 26/07/2010

Conclusions •  Dimensionlity reduction (DR) for high dimensional data; •  DR for visualization and analysis; •  Can use Feature Selection methods (Course 8); •  Noisy detection and elimination (Course 8); •  Choose a technique depending on the data type and structure;

•  Then unbalunced data – use subsampling on the majority class

•  References: lectures/papers of Jure Leskovec; Christos Faloutsos and Tom Mitchell