Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &!...

Dimensionality reduction��&��

Visualization

1

Outline

•  Definition and objectives •  Methods of data reduction •  Examples of applications •  The curse of dimensionality •  Classification of high-dimensional data •  Techniques of dimensions reducing

2

Why visualising data ?

•  Better presentation of data => Better Understanding / Analysis •  “Goal ....is to communicate information

clearly and effectively through graphical means.”

–  Friedman(2008)

3

Motivation

•  Increasing computational power of computers

•  Data avalability •  A lot of high dimensional data •  An additional component of the clustering

techniques

4

Motivation

A good visualization technique should be applied even if: •  We have few a priori knowledge about the

data or not at all •  Exploratory goals are vague •  The data are inhomogeneous and noisy

5

Motivation

•  Nowadays high-dimensional data are everywhere

•  Associated with different Machine Learning

tasks such as : classification, "clustering" and regression

6

Results of the visualization

The results of the visualizations can be represented in the form of: Ø Maps Ø Graphics Ø  Dashboarding

7

Why dimension reduction?

•  The number of features can be very large –  Genomic data: expressions of genes

•  Thousands of variables –  Image data : each pixel of an image

•  An image 64X64 = 4096 features

–  Text categorization: frequency of phrases (or words) in a document or web page

•  More than ten thousand features

8

Methods of data reduction

9

Methods of data reduction

There are many methods of visualization according to the type of treated data. The data can be : •  Univariate •  Bivariate •  Multivariate

10

Univariate data��

•  Represents measurements of a variable •  Usually characterize a distribution •  Are represented by the following methods :

– Histogram – Camambert (Pie chart) – Box plot

11

Univariate data (1D)

•  Representation

12

Bivariate data (2D)

•  Are paired samples of two variables •  The variables are related •  Are represented by the following methods :

– Scatter plot – Linear graphs

13

Bivariate data (2D)

•  Representation

14

Multivariate data

•  A multidimensional representation of��multivariate data

•  Are represented by the following methods :

–  Methods based icons

–  Pixel-based methods

–  Dynamic system in parallel coordinates

15

16

What is visualization?


•  Visualization is the process of visual interpretation or graphical representation of a dataset.

17

What is visualization? •  Medical investigations on patients

•  119 OE - pulse rate 26/07/2010 62 119 Diastolic blood pressure 26/07/2010 80 119 Systolic blood pressure 26/07/2010 120 201 Free T4 level 26/07/2010 17.8 201 Serum TSH level 26/07/2010 4.55 201

Serum calcium 26/07/2010 2.37 201 Serum inorganic phosphate 26/07/2010 1.17 201 Serum total protein 26/07/2010 64 201 Serum albumin 26/07/2010 41 201 Serum globulin 26/07/2010 23 201 Serum bilirubin level 26/07/2010 12 201 Serum alkaline phosphatase 26/07/2010 237 201 ALTSGPT serum level 26/07/2010 3 201 AST - aspartate transamSGOT 26/07/2010 24 201 International normalised ratio 26/07/2010 2 201 International normalised ratio 26/07/2010 2 580 Prostate specific antigen 26/07/2010 0.7 631 OE - pulse rate 26/07/2010 83 631 Diastolic blood pressure 26/07/2010 80 631 Systolic blood pressure 26/07/2010 133 634 Urine creatinine 26/07/2010 11.74 634 Urine microalbumin 26/07/2010 8.8 634 Urine albumin:creatinine ratio 26/07/2010 0.75 634 OE - weight 26/07/2010 84.35 634 OE - height 26/07/2010 176 634 Body Mass Index 26/07/2010 27.23 634 Diastolic blood pressure 26/07/2010 80 634 Systolic blood pressure 26/07/2010 135 786 Free T4 level 26/07/2010 12.6 786 Serum TSH level 26/07/2010 2.58 786 Plasma C reactive protein 26/07/2010 32 786 Serum calcium 26/07/2010 2.48 786 Serum inorganic phosphate 26/07/2010 0.93 786 Serum total protein 26/07/2010 71 786 Serum albumin 26/07/2010 41 786 Serum globulin 26/07/2010 30 786 Serum bilirubin level 26/07/2010 6 786 Serum alkaline phosphatase 26/07/2010 233 786 ALTSGPT serum level 26/07/2010 9 786 AST - aspartate transamSGOT 26/07/2010 18 786 Serum sodium 26/07/2010 137 786 Serum potassium 26/07/2010 4.1 786 Serum chloride 26/07/2010 103 786 Serum urea level 26/07/2010 4.7 786 Serum creatinine 26/07/2010 83 786 Total white blood count 26/07/2010 8.8 786 Red blood cell RBC count 26/07/2010 4.32 786 Haemoglobin estimation 26/07/2010 13 786 Haematocrit - PCV 26/07/2010 38.2 786 Mean corpuscular volume MCV 26/07/2010 88.4 786 Mean corpusc haemoglobinMCH 26/07/2010 30 786 Mean corpusc Hb conc MCHC 26/07/2010 34 786 Platelet count 26/07/2010 364 786 Neutrophil count 26/07/2010 6.6 786 Lymphocyte count 26/07/2010 1.5 786 Monocyte count 26/07/2010 0.5 786 Eosinophil count 26/07/2010 0.1 786 Erythrocyte sedimentation rate 26/07/2010 21 816 Serum vitamin B12 26/07/2010 348 816 Serum folate 26/07/2010 3.8 816 Total white blood count 26/07/2010 3.9 816 Red blood cell RBC count 26/07/2010 3.83 816 Haemoglobin estimation 26/07/2010 13.3 816 Haematocrit - PCV 26/07/2010 39.4 816 Mean corpuscular volume MCV 26/07/2010 103 816 Mean corpusc haemoglobinMCH 26/07/2010 34.8 816 Mean corpusc Hb conc MCHC 26/07/2010 33.8 816 Platelet count 26/07/2010 137 816 Neutrophil count 26/07/2010 2.4 816 Lymphocyte count 26/07/2010 1 816 Monocyte count 26/07/2010 0.4 816 Eosinophil count 26/07/2010 0.1 816 Body Mass Index 26/07/2010 28.99 816 OE - weight 26/07/2010 92.05 816 OE - height 26/07/2010 178.2 816 OE - pulse rate 26/07/2010 52 816 Diastolic blood pressure 26/07/2010 54 816 Systolic blood pressure 26/07/2010 107 856 Free T4 level 26/07/2010 16.4 856 Serum TSH level 26/07/2010 2.95 856 Serum calcium 26/07/2010 2.6 856 Serum inorganic phosphate 26/07/2010 0.82 856 Serum total protein 26/07/2010 72 856 Serum albumin 26/07/2010 47 856 Serum globulin 26/07/2010 25 856 Serum bilirubin level 26/07/2010 15 856 Serum alkaline phosphatase 26/07/2010 176 856 ALTSGPT serum level 26/07/2010 33 856 AST - aspartate transamSGOT 26/07/2010 23 856 Serum sodium 26/07/2010 141 856 Serum potassium 26/07/2010 4.8 856 Serum chloride 26/07/2010 102 856 Serum urea level 26/07/2010 6.3 856 Serum creatinine 26/07/2010 98 856 Total white blood count 26/07/2010 5.8 856 Red blood cell RBC count 26/07/2010 5.04 856 Haemoglobin estimation 26/07/2010 16 856 Haematocrit - PCV 26/07/2010 47.1 856 Mean corpuscular volume MCV 26/07/2010 93.4 856 Mean corpusc haemoglobinMCH 26/07/2010 31.9 856 Mean corpusc Hb conc MCHC 26/07/2010 34.1 856 Platelet count 26/07/2010 162 856 Neutrophil count 26/07/2010 3.2 856 Lymphocyte count 26/07/2010 2.1 856 Monocyte count 26/07/2010 0.3 856 Eosinophil count 26/07/2010 0.2 856 Erythrocyte sedimentation rate 26/07/2010 5 856 Diastolic blood pressure 26/07/2010 90 856 Systolic blood pressure 26/07/2010 150 1005 International normalised ratio 26/07/2010 3 1163 Serum sodium 26/07/2010 141 1163 Serum potassium 26/07/2010 5 1163 Serum chloride 26/07/2010 98 1163 Serum urea level 26/07/2010 16.8 1163 Serum creatinine 26/07/2010 159 1163 Glomerular filtration rate 26/07/2010 28 1818 Free T4 level 26/07/2010 13.6 1818 Serum TSH level 26/07/2010 1.43 1818 Total white blood count 26/07/2010 5.8 1818 Red blood cell RBC count 26/07/2010 4.65 1818 Haemoglobin estimation 26/07/2010 14.8 1818 Haematocrit - PCV 26/07/2010 43.1 1818 Mean corpuscular volume MCV 26/07/2010 92.6 1818 Mean corpusc haemoglobinMCH 26/07/2010 31.9 1818 Mean corpusc Hb conc MCHC 26/07/2010 34.5 1818 Platelet count 26/07/2010 205 1818 Neutrophil count 26/07/2010 4 1818 Lymphocyte count 26/07/2010 1.4 1818 Monocyte count 26/07/2010 0.3 1818 Eosinophil count 26/07/2010 0.1 1818 Erythrocyte sedimentation rate 26/07/2010 15 1818 Blood glucose level 26/07/2010 5.4 1818 Serum calcium 26/07/2010 2.48 1818 Serum inorganic phosphate 26/07/2010 0.89 1818 Serum total protein 26/07/2010 76 1818 Serum albumin 26/07/2010 43 1818 Serum globulin 26/07/2010 33 1818 Serum bilirubin level 26/07/2010 11 1818 Serum alkaline phosphatase 26/07/2010 287 1818 ALTSGPT serum level 26/07/2010 24 1818 AST - aspartate transamSGOT 26/07/2010 25 1818 Serum sodium 26/07/2010 138 1818 Serum potassium 26/07/2010 4.2 1818 Serum chloride 26/07/2010 105 1818 Serum urea level 26/07/2010 3.6 1818 Serum creatinine 26/07/2010 83 2714 Serum sodium 26/07/2010 134 2714 Serum potassium 26/07/2010 4.4 2714 Serum chloride 26/07/2010 104 2714 Serum urea level 26/07/2010 9.8 2714 Serum creatinine 26/07/2010 200 2714 Glomerular filtration rate 26/07/2010 29 3459 International normalised ratio 26/07/2010 2.5 3735 OE - pulse rate 26/07/2010 72 3735 Diastolic blood pressure 26/07/2010 96 3735 Systolic blood pressure 26/07/2010 169 4219 OE - pulse rate 26/07/2010 70 4219 Diastolic blood pressure 26/07/2010 71 4219 Systolic blood pressure 26/07/2010 108 4219 International normalised ratio 26/07/2010 3.3 4285 OE - pulse rate 26/07/2010 64 4285 Diastolic blood pressure 26/07/2010 90 4285 Systolic blood pressure 26/07/2010 150 4355 Diastolic blood pressure 26/07/2010 80 4355 Systolic blood pressure 26/07/2010 120 4511 International normalised ratio 26/07/2010 2.5 4763 International normalised ratio 26/07/2010 2.9 5111 International normalised ratio 26/07/2010 1.5

18


Snow’s Cholera Map, 1855 19


20


21


22

What is visualization? •  Visualization is a branch of computer science involving the

processing, analysis and graphical representation of data from diverse fields: social sciences, finance, medicine, entertainment, etc.

•  There are two jurisdictions that are particularly requested

in visualization: computer graphics and statistics.

23


•  It is important to know how to distinguish the areas of image processing, computer graphics and visualization.

•  The image processing is the study of 2D images to extract information or to modify these characteristics.

24

What is visualization? •  The computer graphics allows to create images of any

part using a computer, whether it be 2D images drawn by an artist or complex 3D scenes.

•  The visualization allows the exploration of data

represented in a visual form to help our understanding of the shown phenomenon.

25


•  One goal of visualization is to visually represent data that does not necessarily have a natural geometric interpretation.

26

Data Acquisition

The acquisition of the raw data can be done in different ways: •  by simulations (computer calculations) •  statistical surveys •  of historical databases •  of sensors of real measurements, etc.

27

Data Acquisition

Sources of errors •  Sampling is it accurate enough for us to be able to

get the desired information? It should not be considered useless data that would only increase the calculations.

•  The quantification is done with sufficient precision to be able to bring out the desired characteristics?

28

Filtering Data

Sources of errors •  Do we retain important and meaningful data? On

the contrary, do we eliminate irrelevant data to the extraction of the desired characteristics?

•  If we add data, the added data are representative of the rest?

29

High-dimensional data

•  The increase in computing power and storage space of computers has led to an increase in the size of data sets

•  Thus, several fields of science now are based on our ability to analyze and visualize high-dimensional data.

30

Curse of dimensionality

31

Curse of dimensionality The curse of dimensionality •  A term introduced by Bellman in 1961 •  Refers to the problem of the explosive increase in

data volume associated with adding extra dimensions in a mathematical space.

•  We will illustrate this problem with a simple example

32

Toy problem

•  We have a 3-class pattern recognition problem •  We have available 9 observations 1D (along an axis)

•  A simple approach would be to:

–  Divide the feature space into uniform bins –  Compute the ratio of examples for each class at each bin and –  For a new example, find its bin and choose the predominant class

of the bin

33

Toy problem •  For our example we start with one single

feature and divide the axis into 3 segments. We observe that we have an average of 3 examples by region.

•  Thereafter, we observe that there are too much overlap among the classes, so we decided to add a second feature to try to improve the class separability.

34

Toy problem (2D)

•  If we add a 2nd dimension we pass from 3 cases (in 1D) to 32=9 (in 2D).

•  We have an another problem: do we maintain the density of examples per bin or do we keep the same number of example that was used in 1D?

35

Toy problem (2D)

•  Choosing to maintain the density increases the number of examples from 9 (in 1D) to 27 (in 2D)

•  Choosing to maintain the number of examples results in a 2D scatter plot that is verry sparse

36

Toy problem (3D) If we add a 2nd dimension it gets worse : Ø  The number of bins grows to 33=27 Ø  To keep the same density of examples the number of

needed examples becomes 81 Ø  For the same number of examples, the 3D scatter plot is

almost empty

37


The approach performed on the toy example is ineffective :

•  There are other approaches less affected by the curse of dimensionality, but the problem still exists

38


How can we beat the "curse of dimensionality "? •  By incorporating prior knowledge •  By providing increasing smoothness of the

target function •  By reducing the dimensionality

39


•  In practice, the curse of dimensionality means that, for a given sample size, there��is a maximum number of variables ��beyond which the performance of our��classifier will degrade rather than improve.

40


The curse of dimensionality generates several��phenomena as: ��- The concentration of the measurement��- Desertification of the space��- Depopulation of the center of hyper-volumes

41

Consequences

There are many consequences of the curse of the dimensionality: 1.  Exponential growth in the number of

examples required to mantain a given sampling density (For a density of N examples/bin and D dimensions, the total number of examples is ND)

42

Consequences

2.  An exponential growth in the complexity of the target function (which estimates the density). To make a good learning, the target function requires denser sample points.

43

Consequences

3. For one dimension in the literature can be found a variety of density functions, but for high dimensions we have only the multivariate Gaussian density.

In addition, for large values of D, we can treat the density only in a Gaussian simplified form.

44


•  These findings suggest that we need special treatment to manipulate large data, which differs from that for low-dimensional data

•  The same problems happens in other data distribution

45

Clustering High-Dimensional Data

46

Clustering high-dimensional data Methods :

–  Subspace-clustering: we are looking for clusteurs that exist in subspaces of the given high dimensional data

•  CLIQUE, ProClus and co-clustering approaches

–  Techniques of dimension reduction : Construct a much lower dimensional space and search for clusters there (we may construct new dimensions by combining some dimensions in the original data).

47

Subspace-clustering

48

Methods of Subspace Clustering •  Subspace search methods: Search various subspaces to find clusters

–  “Bottom-up” approaches

–  “Top-down” approaches

•  Clustering methods based on correlation

–  For instance : PCA based approaches

•  Co-clustering methods

–  Optimization based methods (Cheng and Church, ISMB’2000)

–  Enumeration methods (Pei et al., ICDM’2003)

49

Subspace search methods •  Subspace search methods •  “Bottom-up” approaches –  We start from low subspaces and search higher subspaces only they may be

clusters in such subspaces –  Various pruning techniques to reduce the number of higher subspaces to be

searched –  Ex. CLIQUE (Agrawal et al. 1998)

•  “Top-down” approaches –  We start from full space and we search smaller subspaces recursively –  The subspace of the cluster can be determined by the local neighborhood –  Ex. PROCLUS (Aggarwal et al. 1999): a k-medoid similar method

50

Methods based on correlation

•  Clustering methods based on correlation : based on advanced correlation models

•  Ex : PCA based approaches :

–  We apply PCA to generate a set of new uncorrelated methods

–  Then mine clusters in the new space or its subspaces

51

Methods of Co-clustering •  Co-clustering : Cluster both objects and attributes simultaneously (we

treat objects and variables in symmetric way) •  Optimization based methods

–  We try to find a submatrix when she achieves the best significance as a co-cluster

–  Due to the cost in computation, greedy search is employed to find local optima co-clusters

•  Enumeration methods –  We use a tolerance threshold to specify the degree of noise allowed

in the co-cluster to treat –  The we try to enumerate all submatrices as co-clusters that satisfy

the requirements

52

Techniques of dimension reduction

53

Dimension Reduction

•  Data in a high-dimensional space are not uniformly distributed

•  The reduction of dimension is a technique

widely used to treat the "curse of dimensionality"

54

Dimension Reduction

There are a variety of techniques of dimension reduction: •  Linear vs. non-linear •  Deterministic vs. probabilistic •  Supervised vs. unsupervised

55

Dimension Reduction •  Dimension reduction : Methodologies

– Feature selection: choosing a subset of all the feature

– Feature extraction (« feature extraction ») : creating a subset of new features by combinations

56

Dimension reduction via

Feature selection

57

Feature selection •  Definition : Variable selection is a process to choose an optimal subset of relevant variables from a set of variables, according to a performance criterion. We can ask three basic questions:��

Q1: How to measure the relevance of the variables?��Q2: How to obtain the optimal subset?��Q3: Which optimality criterion to use?

58

Feature selection •  The answer to Q1 is to find a measure of relevance or evaluation criterion J(X) for quantify the importance of one variable or a combination of variables. •  Q2 refers to the problem of the choice of the procedure of research or creation of optimal subset of relevant variables. •  Q3 requires the definition of a stopping criterion of the��

research

59

Feature selection Evaluation criteria Ø For a classification problem, we test,��

for example, the discriminant quality of the system��in the presence or absence of a variable.��

Ø For a regression problem, we test��rather the quality of prediction with respect to other variables.

60

Feature selection An alternative is to use a search method of Branch & Bound type. This search allows you to restrict the research and gives the optimal subset of variables, under the hypothesis of monotocity of the selection criterion J(X). The criterium is called monotonous if: where X k is the subset containing the k selected variables.

)()()( 2121 mm XJKXJXJXKXX ≤≤≤⇒⊂⊂⊂

61

Feature selection Problem : most of the evaluation criteria are not monotonous Use of sub-optimal methods: : •  Sequential Forward Selection (SFS) •  Sequential Backward Selection (SBS) •  Bidirectional Selection (BS)

62

Feature selection •  Sequential Forward Selection (SFS) Let X be the set of variables. Initially the set of selected variables is empty. At each step k: : -  We select the variable Xi that maximizes the criterion of evaluation J(X k)

ü  ordered list of variables according to their importance

( { })ikXXxk xXJXJki

∪= −−∈ −

1)( 1

max)(

63

Feature selection •  Sequential Backward Selection (SBS) We start from the full set of variables X and we perform by elimination : at each step : - The variable Xi the least important according to the evaluation criterion J(X k) is removed. ü  list of ordered variables according to their importance : The most

relevant variables are then found in the last positions of the list.

( { })ikXxk xXJXJki

−= +∈ +

11

max)(

64

Feature selection �� The BS procedure performs the search in both directions (Forward and Backward)

in a competitive manner. The procedure stops in two cases:��

(1) when one of the two directions has found the best subset of variables before reaching the middle of the search space��(2) when the two branches arriving in the middle.

Sets of selected variables found respectively by SFS and SBS are not equal because of their different principles of selection. This method reduces the search time as the search is performed in both directions and stops when there is a solution regardless of the direction. .

65

Feature selection •  Stopping criterion ü  The optimal number of variables is unknown a priori, the use��

of a rule to control the selection/elimination of variables allows to��stop the search when no variable is no longer enough��informative.

ü  The stopping criterion is often defined as a combination of��the search and the evaluation criteria.

ü  A heuristic often used is to calculate for different subsets of selected variables an estimation of the error of generalization by cross-validation.

ü  The subset of variables selected is that one which minimizes ��this generalization error.

66

Dimension reduction via

Feature extraction

67

Dimension reduction via feature extraction

Two main types of methods : • Linear Methods

• Principal Components Analysis (PCA) • Linear Discriminant Analysis (LDA) • Multi-Dimensional Scaling (MDS) • …

• Non-Linear Methods • Isometric feature mapping (Isomap) • Locally Linear Embedding (LLE) • Kernel PCA • Spectral clustering • Supervised methods (S-Isomap) • …

68

Visual data mining

69

Visual data mining

Visualization of information

Data mining Visual Data mining

70

Visual data mining�� Diagram

Data

Knowledges

Algorithm of DM

Result

Visualization of data

Preceding Visualization (PV)

Subsequent Visualization (SV)

Tightly integrated Visualization (TIV)

Visualization of the result

Visualization of the result

Data

Knowledges

Reult

Algorithm of DM

Visualization + Interaction

Knowledges

Data

Result

Algorithm of DM, step 1

Algorithm of DM, step n

71

Visualization software

•  Graphviz •  Tulip •  Knime •  Matlab •  R • … www.KDnuggets.com/software/visualization.html 72

Multi-Dimensional Scaling (MDS)

73

Outline

•  Introduction and definitions •  Problem Formulation •  Algorithm •  Example •  Conclusions

74





75

Introduction

•  Multi-dimensional scaling (MDS) (proposed by Borg and Groenen in 1997) -  A collection of dimension reduction techniques

that maps the distances between observations in a high dimensional space into a lower dimensional space

-  Find a configuration of points in a low dimensional space whose inter-point distances correspond to dissimilarities in higher dimensions 76

Introduction

77

Source space Target space

-  Able to model intrinsic complex manifold structures and visualize them


In many applications : -  We known distances between the points of a data set -  We seek a representation in a low-dimensional space

of these points

The method of multidimensional scaling (MDS) allows us to build this representation

78


- Example : Ø Get the map of a country starting from the knowledge of the distances between each pair of cities. -  As PCA, the MDS algorithm is based on the

search of the eigenvalues. -  MDS builds a configuration of n��

points in Rd from N distances between objects.

79


•  So we have N(N-1)/2 distances. It is always possible to generate a position of N points in N dimensions that meets exactly the given distances.

•  MDS computes an approximation in ��

dimensions d<N.

80

Problem Formulation

•  Given – The points x1,.., xn in k dimensions – we note by δij the distance between points xi and xj

•  Find – The points y1,…, yn in 2 (or 3) dimensions, s.t.

distance dij between yi and yj be close to δij

81

Source space Target space

Cost function

•  We must search δij which minimizes an objectif fonction

•  We can define the cost function in a general maner:

82 2

2

2

||||

||||

)(_

jiij

jiij

ijji

ij

yyd

xx

dfunctionCost

−=

−=

−=∑<

δ

δ

Examples of cost functions •  Possible Cost Functions (Stress)

–  dij is a function of yi and yj, and given the data the δij‘s are constant.

83 ∑∑

∑

∑∑

<<

<

<

<

−=

⎟⎟⎠

⎞⎜⎜⎝

⎛ −=

−=

ji ij

ijij

ji ijar

ji ij

ijijrr

ji ij

ji ijijaa

dJ

dJ

dJ

δ

δ

δ

δ

δ

δ

δ

2

2

2

2

)(1

)(

a compromise between the two

penalizes large relative errors

penalizes large absolute errors

Disparity

Sammon Criterium

Update rules

84

kj

jk

kj kj

kjkj

ji ijkar

kj

jk

kj kj

kjkjkrr

kj kj

jkkjkj

ji ijkaa

dyyd

yJ

dyyd

yJ

dyy

dyJ

−−=∇

−−=∇

−−=∇

∑∑

∑

∑∑

≠<

≠

≠<

δ

δ

δ

δ

δ

δδ

2)(

2)(

)(2)(

2

2

•  Update rules

Algorithm

•  Compute or obtain distances δij •  Initialize the points y1,…, yn (e.g. randomly) •  Until convergence,

85

i∀ )( iii yJyy ∇−← η )10( <<η

Example •  Artificial data set : we pass from a 3-dimensional space to 2-dimensionsal space

86

Example •  The data set "Eurodist" represents the distance (in km)

between 21 cities of Europe. •  The source data set must be presented as a square matrix of

dissimilarities between variables.

87

Example

88

Conclusions •  MDS algorithms differ in :

– The distance used in the source space – The Stress (objective) fonctions; the use of

different stress functions leads to various results

– The optimization procedure ; linear MDS has analytic solvable but cannot model complex (nonlinear) low-dimensional manifold well while nonlinear MDS often needs to use an iterative algorithm

89

Isometric feature mapping (Isomap)

90

Outline

•  Introduction and definitions •  Algorithm •  Example •  Conclusions

91





92

Introduction

•  Techniques as LDA, PCA and their variants perform a global transformation of the data (rotation/translation/rescaling) –  These techniques assume that most of the

information in the data is contained in a linear subspace.

– Which approache to use when the data is actually embedded in a non-linear subspace (a low-dimensional manifold).

93

Introduction

•  The PCA can not discover the structure of a data set in a spiral form

94

Euclidian Distance vs. Geodesic Distance

95

Euclidian Distance Geodesic

Distance

Introduction

•  The goal of Isomap is to find a non-linear manifold containing the data

•  We use the fact that for close points, the Euclidean distance is a good approximation to the geodesic distance on the manifold

•  We build a graph connecting each point to its k nearest neighbors

96

Introduction

•  The lengths of the geodesics are then estimated by searching the length of the shortest path between two points in the graph

•  Thereafter we apply MDS to obtained distances to determine a position of points in a space of reduced dimensions

97

ISOMAP

•  ISOMAP [Tenebaum et al. 2000] – For neighboring samples, Euclidian distance

provides a good approximation to geodesic distance

– For distant points, geodesic distance can be approximated with a sequence of steps between clusters of neighboring points

98

ISOMAP ISOMAP operates in three steps: 1.  Build the neighborhood graph G 2.  For each pair of points of G, compute the

shortest path (the geodesic distance) 3.  Using the classical MDS on geodesic

distances Euclidian Distance -> Geodesic Distance

99

ISOMAP Algorithm

•  Step 1 –  Build the neighborhood graph based on the distances dX (i,j) in the input space X. –  This can be performed in two different ways:

•  Connect each point to all points within a fixed radius ε •  Connect each point to all of its k nearest neighbors

–  A weighted neighborhood graph G is obtained, where��dX (i,j) is the weight of each edge between neighboring points

100

ISOMAP Algorithm

•  Step 2 – Estimate the geodesic distances d M(i,j) between

all pair of points on the manifold M by computing their shortest path distances d G(i,j) in the graph G.

–  It can be done using Dijkstra's algorithm or the Floyd algorithm.

101

ISOMAP Algorithm •  Step 3

Ø  Apply classical MDS to the matrix of graph distances D.

102

Complexity of ISOMAP •  For large datasets ISOMAP can be quite slow : -  Step 1: Complexity of k-nearest neighbors O(n2 D) -  Step 2 : Complexity of the Djikstra algorithm O(n2 logn + n2 k)

-  Step 3 : MDS complexity O(n2 d)

103

The «Swiss roll» dataset

104

•  The «Swiss roll» data set contains 20000 points. • In this figure we present��a sample of 1000 points.�� •  Thereafter we will represent on this example the development of the Isomap algorithm.

Construction of the neighborhood graph G

K- nearest neighbors (K=7) DG is a matrix of Euclidean distance 1000 x 1000 of two neighboring points (Figure A)

105

[Tenebaum et al. 2000]

Calculation of shortest paths in G

106

DG is a matrix of geodesic distances of two arbitrary points along the manifold M (Figure B)


Using MDS to represent the graph in Rd

107

Find a Euclidean d-dimensional space Y that preserves the pairwise distances (Figure C)


Example on images

108

For each image we have 64x64 = 4096 pixels

Results

109

Results

110

Resultats

111

Conclusions •  Avantages

–  Non-linear –  Non-iterative –  Preserves the global properties of the data

•  Limitations –  Sensitive to noise –  Parameters to set: k or ε –  Relatively slow for large data sets –  k must be high to avoid " linear shortcuts " near regions

of high curvature of the surface

112

Locally Linear Embedding (LLE)

113





114

Locally Linear Embedding (LLE) •  LLE (“Locally Linear Embedding") addresses the

same problem as Isomap by a different way. •  LLE preserves the local properties of the data

representing each point by a linear combination of its nearest neighbors.

•  LLE builds a projection to a linear low

dimensional space preserving the neighborhood.�� 115

LLE Algorithm

116

[Roweis and Saul 2000]

LLE Algorithm

117

LLE operates in 3 steps: •  Computes the k nearest neighbors •  Computes the weights needed to reconstruct each point using a linear combination of its neighbors •  Projects the results using the new found coordinates.

LLE Algorithm •  The local geometry is modeled by linear weights that

reconstruct each data point as a linear combination of its neighbors

•  Reconstruction errors are measured with this cost function

-  where weights Wij measure the contribution of the j-th example to the reconstruction of the i-th example

•  Weights Wij are minimized subject to two constraints : 1)  Each data point is reconstructed only from its neighbors 2) Rows of the weight matrix sum up to one :

118 1=∑ j ijW

2

1)( ∑ ∑

=

−=N

ij jiji XWXWε

LLE Algorithm

•  We seek to find d-dimensional coordinates Yi that minimize the following cost function:

119

2

1)( ∑ ∑

=

−=N

ij jiji YWYYφ

Parameter Estimation •  Consider a particular sample x with k nearest neighbors ηj and

reconstruction weights wj (that sum up to one). These weights can be found in three steps : –  Step 1 : Compute the neighborhood correlation matrix Cjk and its

inverse C -1 –  Step 2 : Compute the Langrage multiplier λ that enforces the

constraint

–  Step 3 : Compute the reconstruction weights as:

120

kTjjkC ηη=

∑∑

−

−−=

jk jk

jk kT

jk

C

xC1

1 )(1 ηλ

)(1 λη +=∑ −k

Tjkkj xCw

1=∑ j jw

Parameter Estimation •  The embedding vectors Yi are found by minimizing the cost function: •  To make the optimization problem well-posed we introduce 2

constrains: •  Which allows to expresse the cost function as:

where •  δij is equal to 1 if i=j and equal to 0 otherwise. •  The optimal embedding is found by computig the bottom d+1

eigenvectors of matrix M.

121

2

1)( ∑ ∑

=

−=N

ij jiji YWYYφ

0=∑ j jY IYYN

Tii i =∑

1

)()( jTi

ijij YYMY ∑=φ

kjk

kijiijijij WWWWM ∑+−−=δ

Examples on LLE

122

Results

123

•  Initial points represent images of faces. •  In the 2-dimensional space, these images are grouped according to the position, lighting and expression.

•  Images placed in the bottom of the figure correspond to successive points encountered on the line at the top right, sweeping a continuum of facial expression.


Results

124 [Roweis and Saul 2000]

•  PCA vs LLE

Unlike PCA, LLE preserves local topology!

Results

125

•  PCA vs LLE


Conclusions •  LLE is a nonlinear technique that preserves the local

properties of the data representing each point by a linear combination of its nearest neighbors in a new space of reduced dimensions.

•  LLE operates in 3 steps:

–  Computes the k nearest neighbors –  Computes the weights needed to reconstruct each point using a linear

combination of its neighbors –  Projects results using the new found coordinates

126

Conclusions

•  Sensitive to noise •  Parameters to set: k •  Relatively slow for large data sets

127

References ��

•  J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319-2323, 2000.

•  Sam Roweis & Lawrence Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323-2326, 2000.

128

Self Organizing Map (SOM)

129

Outline

•  Introduction and definitions of neural networks

•  Introduction SOM •  Algorithm •  Example •  Conclusions

130

Neural networks •  A set of interconnected neurons/information

processing units •  A technique designed to model how the brain

performs a particular task •  Used to extract the pattern of information from data sets where numbers are vast and has hidden relations •  Ability to handle noisy data

131

Neural network learning

•  By learning we can extract information •  This information is stored on the links

between the neurons (also called weights) •  Using neural networks can perform two types of learning: - Supervised��- Unsupervised

132

Weights

Neural network

Input Output

Self-organisation •  The brain cells are self organizing themselves in groups,

according to incoming information. •  This incoming information is not only received by a single

neural cell, but also influences other cells in its neighbourhood.

•  SOMs mechanism is also based on this principle

133

The principle of SOMs •  SOM produces the similarity graph of the input data •  Converts non-linear relationships between high

dimensional data into simple geometric relationships

134 Illustration d’une carte SOM de taille 7x7

• Donnée en entrée

• Poid

• Poid mis a jourt

Entrées Sorties

SOM

•  Introduced in 1984 by Teuvo Kohonen •  Vector quantization + vector projection •  Technique based on unsupervised learning •  Used in classification and visualization of

large data sets •  Used in many areas

135

Fields of application •  Applications

–  Exploratory data analysis, clustering –  Quantification, variable selection, outlier detection –  Diagnosis, prediction, missing data

•  Domains –  Socio-economic –  TextMining –  Telecomunication –  Industrial Processes

136

SOM Architecture •  Set of neurons / cluster units •  Each neuron is assigned with a prototype vector that is

taken from the input data set •  The neurons of the map can be arranged either on a

rectangular or a hexagonal lattice

137 Hexagonal neighborhood Rectangular neighborhood

SOM

•  Cost function:

138

2)(,)( ∑∑ −=

i kkixsk wxhwC

i

SOMs Algorithm 1.  Initialisation step

–  Define the topology map –  Randomly initialize all the prototypes for each neuron

2.  Competition step –  Have a given datum xi randomly chosen –  Determine the winning neuron according to the rule:

3.  Adaptation step –  Adapt prototypes under rule

4.  Repeat steps 2 and 3 until the updates prototypes are insignificant

139

2

1min)( kimk

i wxrgAxs −=≤≤

))(()()()1( )(, twxhttwtw kixskkk i−+=+ ε

Self organizing step

140

Self organizing step

The map is organized by checking the following properties:

– Each neuron specializes in a portion of the input space

– Similar data have near projections on the map

141

Data visualization using SOMs

•  Visualization of clusters and shape of the data (projections, U-matrices and other distance matrices)

•  Visualization of components/variables (scatter plots, components planes)

•  Visualization of data projections

142

Example : Countries of the world

143

Example: Iris

144

Example: Iris

145

Example: Iris

146

Example: Iris

147

Dimensionality Reduction:

Singular Value Decomposition

Recall (1)

•  Why dimensionality reduction? •  Some features may be irrelevant •  We want to visualize high dimensional data •  “Intrinsic” dimensionality may be smaller

than the number of features

Recall (2) ��Dimensionality Reduction

Techniques •  Visualize, categorize, or simplify large datasets. •  Principal Component Analysis: Finds the dimensions that capture the most

variance

•  MDS: Finds data points in lower dimensional space that best preserves the inter-point distance.��

•  Isomap: Estimates the distance between two points on a manifold by following a chain of points with shorter distances between them. (More accurate in representing global distances than LLE; slower than LLE)

•  LLE (Local Linear Embedding): Only worries about representing the distances between local points. Faster than Isomap (no worry about global distances).

•  Hesian Eigenmaps: •  Laplacian Eigenmaps •  Charting •  SOM (Kohonen’s Self-organizing map)

Dimensionality Reduction

•  The input space of many learning problems is of high dimensionality.

•  This has computational implications and, it makes finding the intrinsic information content difficult.

•  Dimensionality reduction methods usually try to address these two problems at the same time.

•  Context: unsupervised learning given a collection of data points in an n-dimensional space find a good

representation of the data in r-dimensional

Choosing sets of features

•  Score each feature •  Forward/Backward elimination

–  Choose the feature with the highest/lowest score –  Re-score other features –  Repeat

•  If you have lots of features (text mining for ex.) –  Just select top K scored features

Unsupervised feature selection

•  Differs from feature selection in two ways: –  Instead of choosing subset of features, – Create new features (dimensions) defined as

functions over all features – Don’t consider class labels, just the data points

SVD: the problem •  For a large dimensional data, use PCA ?

–  for example: images (d ≥ 104), gene expression data, text data,…

•  The problem: –  The covariance matrix is size (d2), so the computational

time is expensive.

•  Solution: –  Singular Value Decomposition (SVD) - an efficient algorithm available in Matlab, R, SAS…

Definition

A[n x m] = U[n x r] Λ [ r x r] (V[m x r])T

•  A: n x m matrix (e.g., n documents, m terms) •  U: n x r matrix (n documents, r concepts) •  Λ: r x r diagonal matrix (strength of each

‘concept’) (r: rank of the matrix) •  V: m x r matrix (m terms, r concepts)

SVD : schema

SVD - Interpretation ‘documents’, ‘terms’ and ‘concepts’: •  U: document-to-concept similarity matrix •  V: term-to-concept similarity matrix •  Λ: its diagonal elements: ‘strength’ of each

concept Projection: •  best axis to project on: (‘best’ = min sum of

squares of projection errors)

In matlab: •  svd(A); •  svds(A) – for sparse data.

•  The svd command computes the matrix singular value decomposition.

•  s = svd(X) returns a vector of singular values. •  [U,S,V] = svd(X) produces a diagonal matrix S of the same

dimension as X, with nonnegative diagonal elements in decreasing order, and unitary matrices U and V so that X = U*S*V'.

Conclusions •  Dimensionlity reduction (DR) for high dimensional data; •  DR for visualization and analysis; •  Can use Feature Selection methods (Course 8); •  Noisy detection and elimination (Course 8); •  Choose a technique depending on the data type and structure;

•  Then unbalunced data – use subsampling on the majority class

•  References: lectures/papers of Jure Leskovec; Christos Faloutsos and Tom Mitchell

Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &!...

Documents

Transcript of Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &!...