Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &!...
Transcript of Dimensionality reduction Visualization · 2019-01-28 · Dimensionality reduction! &!...
Dimensionality reduction���&���
Visualization
1
Outline
• Definition and objectives • Methods of data reduction • Examples of applications • The curse of dimensionality • Classification of high-dimensional data • Techniques of dimensions reducing
2
Why visualising data ?
• Better presentation of data => Better Understanding / Analysis • “Goal ....is to communicate information
clearly and effectively through graphical means.”
– Friedman(2008)
3
Motivation
• Increasing computational power of computers
• Data avalability • A lot of high dimensional data • An additional component of the clustering
techniques
4
Motivation
A good visualization technique should be applied even if: • We have few a priori knowledge about the
data or not at all • Exploratory goals are vague • The data are inhomogeneous and noisy
5
Motivation
• Nowadays high-dimensional data are everywhere
• Associated with different Machine Learning
tasks such as : classification, "clustering" and regression
6
Results of the visualization
The results of the visualizations can be represented in the form of: Ø Maps Ø Graphics Ø Dashboarding
7
Why dimension reduction?
• The number of features can be very large – Genomic data: expressions of genes
• Thousands of variables – Image data : each pixel of an image
• An image 64X64 = 4096 features
– Text categorization: frequency of phrases (or words) in a document or web page
• More than ten thousand features
8
Methods of data reduction
9
Methods of data reduction
There are many methods of visualization according to the type of treated data. The data can be : • Univariate • Bivariate • Multivariate
10
Univariate data���
• Represents measurements of a variable • Usually characterize a distribution • Are represented by the following methods :
– Histogram – Camambert (Pie chart) – Box plot
11
Univariate data (1D)
• Representation
12
Bivariate data (2D)
• Are paired samples of two variables • The variables are related • Are represented by the following methods :
– Scatter plot – Linear graphs
13
Bivariate data (2D)
• Representation
14
Multivariate data
• A multidimensional representation of���multivariate data
• Are represented by the following methods :
– Methods based icons
– Pixel-based methods
– Dynamic system in parallel coordinates
15
16
What is visualization?
What is visualization?
• Visualization is the process of visual interpretation or graphical representation of a dataset.
17
What is visualization? • Medical investigations on patients
• 119 OE - pulse rate 26/07/2010 62 119 Diastolic blood pressure 26/07/2010 80 119 Systolic blood pressure 26/07/2010 120 201 Free T4 level 26/07/2010 17.8 201 Serum TSH level 26/07/2010 4.55 201
Serum calcium 26/07/2010 2.37 201 Serum inorganic phosphate 26/07/2010 1.17 201 Serum total protein 26/07/2010 64 201 Serum albumin 26/07/2010 41 201 Serum globulin 26/07/2010 23 201 Serum bilirubin level 26/07/2010 12 201 Serum alkaline phosphatase 26/07/2010 237 201 ALTSGPT serum level 26/07/2010 3 201 AST - aspartate transamSGOT 26/07/2010 24 201 International normalised ratio 26/07/2010 2 201 International normalised ratio 26/07/2010 2 580 Prostate specific antigen 26/07/2010 0.7 631 OE - pulse rate 26/07/2010 83 631 Diastolic blood pressure 26/07/2010 80 631 Systolic blood pressure 26/07/2010 133 634 Urine creatinine 26/07/2010 11.74 634 Urine microalbumin 26/07/2010 8.8 634 Urine albumin:creatinine ratio 26/07/2010 0.75 634 OE - weight 26/07/2010 84.35 634 OE - height 26/07/2010 176 634 Body Mass Index 26/07/2010 27.23 634 Diastolic blood pressure 26/07/2010 80 634 Systolic blood pressure 26/07/2010 135 786 Free T4 level 26/07/2010 12.6 786 Serum TSH level 26/07/2010 2.58 786 Plasma C reactive protein 26/07/2010 32 786 Serum calcium 26/07/2010 2.48 786 Serum inorganic phosphate 26/07/2010 0.93 786 Serum total protein 26/07/2010 71 786 Serum albumin 26/07/2010 41 786 Serum globulin 26/07/2010 30 786 Serum bilirubin level 26/07/2010 6 786 Serum alkaline phosphatase 26/07/2010 233 786 ALTSGPT serum level 26/07/2010 9 786 AST - aspartate transamSGOT 26/07/2010 18 786 Serum sodium 26/07/2010 137 786 Serum potassium 26/07/2010 4.1 786 Serum chloride 26/07/2010 103 786 Serum urea level 26/07/2010 4.7 786 Serum creatinine 26/07/2010 83 786 Total white blood count 26/07/2010 8.8 786 Red blood cell RBC count 26/07/2010 4.32 786 Haemoglobin estimation 26/07/2010 13 786 Haematocrit - PCV 26/07/2010 38.2 786 Mean corpuscular volume MCV 26/07/2010 88.4 786 Mean corpusc haemoglobinMCH 26/07/2010 30 786 Mean corpusc Hb conc MCHC 26/07/2010 34 786 Platelet count 26/07/2010 364 786 Neutrophil count 26/07/2010 6.6 786 Lymphocyte count 26/07/2010 1.5 786 Monocyte count 26/07/2010 0.5 786 Eosinophil count 26/07/2010 0.1 786 Erythrocyte sedimentation rate 26/07/2010 21 816 Serum vitamin B12 26/07/2010 348 816 Serum folate 26/07/2010 3.8 816 Total white blood count 26/07/2010 3.9 816 Red blood cell RBC count 26/07/2010 3.83 816 Haemoglobin estimation 26/07/2010 13.3 816 Haematocrit - PCV 26/07/2010 39.4 816 Mean corpuscular volume MCV 26/07/2010 103 816 Mean corpusc haemoglobinMCH 26/07/2010 34.8 816 Mean corpusc Hb conc MCHC 26/07/2010 33.8 816 Platelet count 26/07/2010 137 816 Neutrophil count 26/07/2010 2.4 816 Lymphocyte count 26/07/2010 1 816 Monocyte count 26/07/2010 0.4 816 Eosinophil count 26/07/2010 0.1 816 Body Mass Index 26/07/2010 28.99 816 OE - weight 26/07/2010 92.05 816 OE - height 26/07/2010 178.2 816 OE - pulse rate 26/07/2010 52 816 Diastolic blood pressure 26/07/2010 54 816 Systolic blood pressure 26/07/2010 107 856 Free T4 level 26/07/2010 16.4 856 Serum TSH level 26/07/2010 2.95 856 Serum calcium 26/07/2010 2.6 856 Serum inorganic phosphate 26/07/2010 0.82 856 Serum total protein 26/07/2010 72 856 Serum albumin 26/07/2010 47 856 Serum globulin 26/07/2010 25 856 Serum bilirubin level 26/07/2010 15 856 Serum alkaline phosphatase 26/07/2010 176 856 ALTSGPT serum level 26/07/2010 33 856 AST - aspartate transamSGOT 26/07/2010 23 856 Serum sodium 26/07/2010 141 856 Serum potassium 26/07/2010 4.8 856 Serum chloride 26/07/2010 102 856 Serum urea level 26/07/2010 6.3 856 Serum creatinine 26/07/2010 98 856 Total white blood count 26/07/2010 5.8 856 Red blood cell RBC count 26/07/2010 5.04 856 Haemoglobin estimation 26/07/2010 16 856 Haematocrit - PCV 26/07/2010 47.1 856 Mean corpuscular volume MCV 26/07/2010 93.4 856 Mean corpusc haemoglobinMCH 26/07/2010 31.9 856 Mean corpusc Hb conc MCHC 26/07/2010 34.1 856 Platelet count 26/07/2010 162 856 Neutrophil count 26/07/2010 3.2 856 Lymphocyte count 26/07/2010 2.1 856 Monocyte count 26/07/2010 0.3 856 Eosinophil count 26/07/2010 0.2 856 Erythrocyte sedimentation rate 26/07/2010 5 856 Diastolic blood pressure 26/07/2010 90 856 Systolic blood pressure 26/07/2010 150 1005 International normalised ratio 26/07/2010 3 1163 Serum sodium 26/07/2010 141 1163 Serum potassium 26/07/2010 5 1163 Serum chloride 26/07/2010 98 1163 Serum urea level 26/07/2010 16.8 1163 Serum creatinine 26/07/2010 159 1163 Glomerular filtration rate 26/07/2010 28 1818 Free T4 level 26/07/2010 13.6 1818 Serum TSH level 26/07/2010 1.43 1818 Total white blood count 26/07/2010 5.8 1818 Red blood cell RBC count 26/07/2010 4.65 1818 Haemoglobin estimation 26/07/2010 14.8 1818 Haematocrit - PCV 26/07/2010 43.1 1818 Mean corpuscular volume MCV 26/07/2010 92.6 1818 Mean corpusc haemoglobinMCH 26/07/2010 31.9 1818 Mean corpusc Hb conc MCHC 26/07/2010 34.5 1818 Platelet count 26/07/2010 205 1818 Neutrophil count 26/07/2010 4 1818 Lymphocyte count 26/07/2010 1.4 1818 Monocyte count 26/07/2010 0.3 1818 Eosinophil count 26/07/2010 0.1 1818 Erythrocyte sedimentation rate 26/07/2010 15 1818 Blood glucose level 26/07/2010 5.4 1818 Serum calcium 26/07/2010 2.48 1818 Serum inorganic phosphate 26/07/2010 0.89 1818 Serum total protein 26/07/2010 76 1818 Serum albumin 26/07/2010 43 1818 Serum globulin 26/07/2010 33 1818 Serum bilirubin level 26/07/2010 11 1818 Serum alkaline phosphatase 26/07/2010 287 1818 ALTSGPT serum level 26/07/2010 24 1818 AST - aspartate transamSGOT 26/07/2010 25 1818 Serum sodium 26/07/2010 138 1818 Serum potassium 26/07/2010 4.2 1818 Serum chloride 26/07/2010 105 1818 Serum urea level 26/07/2010 3.6 1818 Serum creatinine 26/07/2010 83 2714 Serum sodium 26/07/2010 134 2714 Serum potassium 26/07/2010 4.4 2714 Serum chloride 26/07/2010 104 2714 Serum urea level 26/07/2010 9.8 2714 Serum creatinine 26/07/2010 200 2714 Glomerular filtration rate 26/07/2010 29 3459 International normalised ratio 26/07/2010 2.5 3735 OE - pulse rate 26/07/2010 72 3735 Diastolic blood pressure 26/07/2010 96 3735 Systolic blood pressure 26/07/2010 169 4219 OE - pulse rate 26/07/2010 70 4219 Diastolic blood pressure 26/07/2010 71 4219 Systolic blood pressure 26/07/2010 108 4219 International normalised ratio 26/07/2010 3.3 4285 OE - pulse rate 26/07/2010 64 4285 Diastolic blood pressure 26/07/2010 90 4285 Systolic blood pressure 26/07/2010 150 4355 Diastolic blood pressure 26/07/2010 80 4355 Systolic blood pressure 26/07/2010 120 4511 International normalised ratio 26/07/2010 2.5 4763 International normalised ratio 26/07/2010 2.9 5111 International normalised ratio 26/07/2010 1.5
18
What is visualization?
Snow’s Cholera Map, 1855 19
What is visualization?
20
What is visualization?
21
What is visualization?
22
What is visualization? • Visualization is a branch of computer science involving the
processing, analysis and graphical representation of data from diverse fields: social sciences, finance, medicine, entertainment, etc.
• There are two jurisdictions that are particularly requested
in visualization: computer graphics and statistics.
23
What is visualization?
• It is important to know how to distinguish the areas of image processing, computer graphics and visualization.
• The image processing is the study of 2D images to extract information or to modify these characteristics.
24
What is visualization? • The computer graphics allows to create images of any
part using a computer, whether it be 2D images drawn by an artist or complex 3D scenes.
• The visualization allows the exploration of data
represented in a visual form to help our understanding of the shown phenomenon.
25
What is visualization?
• One goal of visualization is to visually represent data that does not necessarily have a natural geometric interpretation.
26
Data Acquisition
The acquisition of the raw data can be done in different ways: • by simulations (computer calculations) • statistical surveys • of historical databases • of sensors of real measurements, etc.
27
Data Acquisition
Sources of errors • Sampling is it accurate enough for us to be able to
get the desired information? It should not be considered useless data that would only increase the calculations.
• The quantification is done with sufficient precision to be able to bring out the desired characteristics?
28
Filtering Data
Sources of errors • Do we retain important and meaningful data? On
the contrary, do we eliminate irrelevant data to the extraction of the desired characteristics?
• If we add data, the added data are representative of the rest?
29
High-dimensional data
• The increase in computing power and storage space of computers has led to an increase in the size of data sets
• Thus, several fields of science now are based on our ability to analyze and visualize high-dimensional data.
30
Curse of dimensionality
31
Curse of dimensionality The curse of dimensionality • A term introduced by Bellman in 1961 • Refers to the problem of the explosive increase in
data volume associated with adding extra dimensions in a mathematical space.
• We will illustrate this problem with a simple example
32
Toy problem
• We have a 3-class pattern recognition problem • We have available 9 observations 1D (along an axis)
• A simple approach would be to:
– Divide the feature space into uniform bins – Compute the ratio of examples for each class at each bin and – For a new example, find its bin and choose the predominant class
of the bin
33
Toy problem • For our example we start with one single
feature and divide the axis into 3 segments. We observe that we have an average of 3 examples by region.
• Thereafter, we observe that there are too much overlap among the classes, so we decided to add a second feature to try to improve the class separability.
34
Toy problem (2D)
• If we add a 2nd dimension we pass from 3 cases (in 1D) to 32=9 (in 2D).
• We have an another problem: do we maintain the density of examples per bin or do we keep the same number of example that was used in 1D?
35
Toy problem (2D)
• Choosing to maintain the density increases the number of examples from 9 (in 1D) to 27 (in 2D)
• Choosing to maintain the number of examples results in a 2D scatter plot that is verry sparse
36
Toy problem (3D) If we add a 2nd dimension it gets worse : Ø The number of bins grows to 33=27 Ø To keep the same density of examples the number of
needed examples becomes 81 Ø For the same number of examples, the 3D scatter plot is
almost empty
37
Curse of dimensionality
The approach performed on the toy example is ineffective :
• There are other approaches less affected by the curse of dimensionality, but the problem still exists
38
Curse of dimensionality
How can we beat the "curse of dimensionality "? • By incorporating prior knowledge • By providing increasing smoothness of the
target function • By reducing the dimensionality
39
Curse of dimensionality
• In practice, the curse of dimensionality means that, for a given sample size, there���is a maximum number of variables ���beyond which the performance of our���classifier will degrade rather than improve.
40
Curse of dimensionality
The curse of dimensionality generates several���phenomena as: ���- The concentration of the measurement���- Desertification of the space���- Depopulation of the center of hyper-volumes
41
Consequences
There are many consequences of the curse of the dimensionality: 1. Exponential growth in the number of
examples required to mantain a given sampling density (For a density of N examples/bin and D dimensions, the total number of examples is ND)
42
Consequences
2. An exponential growth in the complexity of the target function (which estimates the density). To make a good learning, the target function requires denser sample points.
43
Consequences
3. For one dimension in the literature can be found a variety of density functions, but for high dimensions we have only the multivariate Gaussian density.
In addition, for large values of D, we can treat the density only in a Gaussian simplified form.
44
Curse of dimensionality
• These findings suggest that we need special treatment to manipulate large data, which differs from that for low-dimensional data
• The same problems happens in other data distribution
45
Clustering High-Dimensional Data
46
Clustering high-dimensional data Methods :
– Subspace-clustering: we are looking for clusteurs that exist in subspaces of the given high dimensional data
• CLIQUE, ProClus and co-clustering approaches
– Techniques of dimension reduction : Construct a much lower dimensional space and search for clusters there (we may construct new dimensions by combining some dimensions in the original data).
47
Subspace-clustering
48
Methods of Subspace Clustering • Subspace search methods: Search various subspaces to find clusters
– “Bottom-up” approaches
– “Top-down” approaches
• Clustering methods based on correlation
– For instance : PCA based approaches
• Co-clustering methods
– Optimization based methods (Cheng and Church, ISMB’2000)
– Enumeration methods (Pei et al., ICDM’2003)
49
Subspace search methods • Subspace search methods • “Bottom-up” approaches – We start from low subspaces and search higher subspaces only they may be
clusters in such subspaces – Various pruning techniques to reduce the number of higher subspaces to be
searched – Ex. CLIQUE (Agrawal et al. 1998)
• “Top-down” approaches – We start from full space and we search smaller subspaces recursively – The subspace of the cluster can be determined by the local neighborhood – Ex. PROCLUS (Aggarwal et al. 1999): a k-medoid similar method
50
Methods based on correlation
• Clustering methods based on correlation : based on advanced correlation models
• Ex : PCA based approaches :
– We apply PCA to generate a set of new uncorrelated methods
– Then mine clusters in the new space or its subspaces
51
Methods of Co-clustering • Co-clustering : Cluster both objects and attributes simultaneously (we
treat objects and variables in symmetric way) • Optimization based methods
– We try to find a submatrix when she achieves the best significance as a co-cluster
– Due to the cost in computation, greedy search is employed to find local optima co-clusters
• Enumeration methods – We use a tolerance threshold to specify the degree of noise allowed
in the co-cluster to treat – The we try to enumerate all submatrices as co-clusters that satisfy
the requirements
52
Techniques of dimension reduction
53
Dimension Reduction
• Data in a high-dimensional space are not uniformly distributed
• The reduction of dimension is a technique
widely used to treat the "curse of dimensionality"
54
Dimension Reduction
There are a variety of techniques of dimension reduction: • Linear vs. non-linear • Deterministic vs. probabilistic • Supervised vs. unsupervised
55
Dimension Reduction • Dimension reduction : Methodologies
– Feature selection: choosing a subset of all the feature
– Feature extraction (« feature extraction ») : creating a subset of new features by combinations
56
Dimension reduction via
Feature selection
57
Feature selection • Definition : Variable selection is a process to choose an optimal subset of relevant variables from a set of variables, according to a performance criterion. We can ask three basic questions:���
Q1: How to measure the relevance of the variables?���Q2: How to obtain the optimal subset?���Q3: Which optimality criterion to use?
58
Feature selection • The answer to Q1 is to find a measure of relevance or evaluation criterion J(X) for quantify the importance of one variable or a combination of variables. • Q2 refers to the problem of the choice of the procedure of research or creation of optimal subset of relevant variables. • Q3 requires the definition of a stopping criterion of the���
research
59
Feature selection Evaluation criteria Ø For a classification problem, we test,���
for example, the discriminant quality of the system���in the presence or absence of a variable.���
Ø For a regression problem, we test���rather the quality of prediction with respect to other variables.
60
Feature selection An alternative is to use a search method of Branch & Bound type. This search allows you to restrict the research and gives the optimal subset of variables, under the hypothesis of monotocity of the selection criterion J(X). The criterium is called monotonous if: where X k is the subset containing the k selected variables.
)()()( 2121 mm XJKXJXJXKXX ≤≤≤⇒⊂⊂⊂
61
Feature selection Problem : most of the evaluation criteria are not monotonous Use of sub-optimal methods: : • Sequential Forward Selection (SFS) • Sequential Backward Selection (SBS) • Bidirectional Selection (BS)
62
Feature selection • Sequential Forward Selection (SFS) Let X be the set of variables. Initially the set of selected variables is empty. At each step k: : - We select the variable Xi that maximizes the criterion of evaluation J(X k)
ü ordered list of variables according to their importance
( { })ikXXxk xXJXJki
∪= −−∈ −
1)( 1
max)(
63
Feature selection • Sequential Backward Selection (SBS) We start from the full set of variables X and we perform by elimination : at each step : - The variable Xi the least important according to the evaluation criterion J(X k) is removed. ü list of ordered variables according to their importance : The most
relevant variables are then found in the last positions of the list.
( { })ikXxk xXJXJki
−= +∈ +
11
max)(
64
Feature selection ��� The BS procedure performs the search in both directions (Forward and Backward)
in a competitive manner. The procedure stops in two cases:���
(1) when one of the two directions has found the best subset of variables before reaching the middle of the search space���(2) when the two branches arriving in the middle.
Sets of selected variables found respectively by SFS and SBS are not equal because of their different principles of selection. This method reduces the search time as the search is performed in both directions and stops when there is a solution regardless of the direction. .
65
Feature selection • Stopping criterion ü The optimal number of variables is unknown a priori, the use���
of a rule to control the selection/elimination of variables allows to���stop the search when no variable is no longer enough���informative.
ü The stopping criterion is often defined as a combination of���the search and the evaluation criteria.
ü A heuristic often used is to calculate for different subsets of selected variables an estimation of the error of generalization by cross-validation.
ü The subset of variables selected is that one which minimizes ���this generalization error.
66
Dimension reduction via
Feature extraction
67
Dimension reduction via feature extraction
Two main types of methods : • Linear Methods
• Principal Components Analysis (PCA) • Linear Discriminant Analysis (LDA) • Multi-Dimensional Scaling (MDS) • …
• Non-Linear Methods • Isometric feature mapping (Isomap) • Locally Linear Embedding (LLE) • Kernel PCA • Spectral clustering • Supervised methods (S-Isomap) • …
68
Visual data mining
69
Visual data mining
Visualization of information
Data mining Visual Data mining
70
Visual data mining��� Diagram
Data
Knowledges
Algorithm of DM
Result
Visualization of data
Preceding Visualization (PV)
Subsequent Visualization (SV)
Tightly integrated Visualization (TIV)
Visualization of the result
Visualization of the result
Data
Knowledges
Reult
Algorithm of DM
Visualization + Interaction
Knowledges
Data
Result
Algorithm of DM, step 1
Algorithm of DM, step n
71
Visualization software
• Graphviz • Tulip • Knime • Matlab • R • … www.KDnuggets.com/software/visualization.html 72
Multi-Dimensional Scaling (MDS)
73
Outline
• Introduction and definitions • Problem Formulation • Algorithm • Example • Conclusions
74
Dimension reduction via feature extraction
Two main types of methods : • Linear Methods
• Principal Components Analysis (PCA) • Linear Discriminant Analysis (LDA) • Multi-Dimensional Scaling (MDS) • …
• Non-Linear Methods • Isometric feature mapping (Isomap) • Locally Linear Embedding (LLE) • Kernel PCA • Spectral clustering • Supervised methods (S-Isomap) • …
75
Introduction
• Multi-dimensional scaling (MDS) (proposed by Borg and Groenen in 1997) - A collection of dimension reduction techniques
that maps the distances between observations in a high dimensional space into a lower dimensional space
- Find a configuration of points in a low dimensional space whose inter-point distances correspond to dissimilarities in higher dimensions 76
Introduction
77
Source space Target space
- Able to model intrinsic complex manifold structures and visualize them
Multi-Dimensional Scaling (MDS)
In many applications : - We known distances between the points of a data set - We seek a representation in a low-dimensional space
of these points
The method of multidimensional scaling (MDS) allows us to build this representation
78
Multi-Dimensional Scaling (MDS)
- Example : Ø Get the map of a country starting from the knowledge of the distances between each pair of cities. - As PCA, the MDS algorithm is based on the
search of the eigenvalues. - MDS builds a configuration of n���
points in Rd from N distances between objects.
79
Multi-Dimensional Scaling (MDS)
• So we have N(N-1)/2 distances. It is always possible to generate a position of N points in N dimensions that meets exactly the given distances.
• MDS computes an approximation in ���
dimensions d<N.
80
Problem Formulation
• Given – The points x1,.., xn in k dimensions – we note by δij the distance between points xi and xj
• Find – The points y1,…, yn in 2 (or 3) dimensions, s.t.
distance dij between yi and yj be close to δij
81
Source space Target space
Cost function
• We must search δij which minimizes an objectif fonction
• We can define the cost function in a general maner:
82 2
2
2
||||
||||
)(_
jiij
jiij
ijji
ij
yyd
xx
dfunctionCost
−=
−=
−=∑<
δ
δ
Examples of cost functions • Possible Cost Functions (Stress)
– dij is a function of yi and yj, and given the data the δij‘s are constant.
83 ∑∑
∑
∑∑
<<
<
<
<
−=
⎟⎟⎠
⎞⎜⎜⎝
⎛ −=
−=
ji ij
ijij
ji ijar
ji ij
ijijrr
ji ij
ji ijijaa
dJ
dJ
dJ
δ
δ
δ
δ
δ
δ
δ
2
2
2
2
)(1
)(
a compromise between the two
penalizes large relative errors
penalizes large absolute errors
Disparity
Sammon Criterium
Update rules
84
kj
jk
kj kj
kjkj
ji ijkar
kj
jk
kj kj
kjkjkrr
kj kj
jkkjkj
ji ijkaa
dyyd
yJ
dyyd
yJ
dyy
dyJ
−−=∇
−−=∇
−−=∇
∑∑
∑
∑∑
≠<
≠
≠<
δ
δ
δ
δ
δ
δδ
2)(
2)(
)(2)(
2
2
• Update rules
Algorithm
• Compute or obtain distances δij • Initialize the points y1,…, yn (e.g. randomly) • Until convergence,
85
i∀ )( iii yJyy ∇−← η )10( <<η
Example • Artificial data set : we pass from a 3-dimensional space to 2-dimensionsal space
86
Example • The data set "Eurodist" represents the distance (in km)
between 21 cities of Europe. • The source data set must be presented as a square matrix of
dissimilarities between variables.
87
Example
88
Conclusions • MDS algorithms differ in :
– The distance used in the source space – The Stress (objective) fonctions; the use of
different stress functions leads to various results
– The optimization procedure ; linear MDS has analytic solvable but cannot model complex (nonlinear) low-dimensional manifold well while nonlinear MDS often needs to use an iterative algorithm
89
Isometric feature mapping (Isomap)
90
Outline
• Introduction and definitions • Algorithm • Example • Conclusions
91
Dimension reduction via feature extraction
Two main types of methods : • Linear Methods
• Principal Components Analysis (PCA) • Linear Discriminant Analysis (LDA) • Multi-Dimensional Scaling (MDS) • …
• Non-Linear Methods • Isometric feature mapping (Isomap) • Locally Linear Embedding (LLE) • Kernel PCA • Spectral clustering • Supervised methods (S-Isomap) • …
92
Introduction
• Techniques as LDA, PCA and their variants perform a global transformation of the data (rotation/translation/rescaling) – These techniques assume that most of the
information in the data is contained in a linear subspace.
– Which approache to use when the data is actually embedded in a non-linear subspace (a low-dimensional manifold).
93
Introduction
• The PCA can not discover the structure of a data set in a spiral form
94
Euclidian Distance vs. Geodesic Distance
95
Euclidian Distance Geodesic
Distance
Introduction
• The goal of Isomap is to find a non-linear manifold containing the data
• We use the fact that for close points, the Euclidean distance is a good approximation to the geodesic distance on the manifold
• We build a graph connecting each point to its k nearest neighbors
96
Introduction
• The lengths of the geodesics are then estimated by searching the length of the shortest path between two points in the graph
• Thereafter we apply MDS to obtained distances to determine a position of points in a space of reduced dimensions
97
ISOMAP
• ISOMAP [Tenebaum et al. 2000] – For neighboring samples, Euclidian distance
provides a good approximation to geodesic distance
– For distant points, geodesic distance can be approximated with a sequence of steps between clusters of neighboring points
98
ISOMAP ISOMAP operates in three steps: 1. Build the neighborhood graph G 2. For each pair of points of G, compute the
shortest path (the geodesic distance) 3. Using the classical MDS on geodesic
distances Euclidian Distance -> Geodesic Distance
99
ISOMAP Algorithm
• Step 1 – Build the neighborhood graph based on the distances dX (i,j) in the input space X. – This can be performed in two different ways:
• Connect each point to all points within a fixed radius ε • Connect each point to all of its k nearest neighbors
– A weighted neighborhood graph G is obtained, where���dX (i,j) is the weight of each edge between neighboring points
100
ISOMAP Algorithm
• Step 2 – Estimate the geodesic distances d M(i,j) between
all pair of points on the manifold M by computing their shortest path distances d G(i,j) in the graph G.
– It can be done using Dijkstra's algorithm or the Floyd algorithm.
101
ISOMAP Algorithm • Step 3
Ø Apply classical MDS to the matrix of graph distances D.
102
Complexity of ISOMAP • For large datasets ISOMAP can be quite slow : - Step 1: Complexity of k-nearest neighbors O(n2 D) - Step 2 : Complexity of the Djikstra algorithm O(n2 logn + n2 k)
- Step 3 : MDS complexity O(n2 d)
103
The «Swiss roll» dataset
104
• The «Swiss roll» data set contains 20000 points. • In this figure we present���a sample of 1000 points.��� • Thereafter we will represent on this example the development of the Isomap algorithm.
Construction of the neighborhood graph G
K- nearest neighbors (K=7) DG is a matrix of Euclidean distance 1000 x 1000 of two neighboring points (Figure A)
105
[Tenebaum et al. 2000]
Calculation of shortest paths in G
106
DG is a matrix of geodesic distances of two arbitrary points along the manifold M (Figure B)
[Tenebaum et al. 2000]
Using MDS to represent the graph in Rd
107
Find a Euclidean d-dimensional space Y that preserves the pairwise distances (Figure C)
[Tenebaum et al. 2000]
Example on images
108
For each image we have 64x64 = 4096 pixels
Results
109
Results
110
Resultats
111
Conclusions • Avantages
– Non-linear – Non-iterative – Preserves the global properties of the data
• Limitations – Sensitive to noise – Parameters to set: k or ε – Relatively slow for large data sets – k must be high to avoid " linear shortcuts " near regions
of high curvature of the surface
112
Locally Linear Embedding (LLE)
113
Dimension reduction via feature extraction
Two main types of methods : • Linear Methods
• Principal Components Analysis (PCA) • Linear Discriminant Analysis (LDA) • Multi-Dimensional Scaling (MDS) • …
• Non-Linear Methods • Isometric feature mapping (Isomap) • Locally Linear Embedding (LLE) • Kernel PCA • Spectral clustering • Supervised methods (S-Isomap) • …
114
Locally Linear Embedding (LLE) • LLE (“Locally Linear Embedding") addresses the
same problem as Isomap by a different way. • LLE preserves the local properties of the data
representing each point by a linear combination of its nearest neighbors.
• LLE builds a projection to a linear low
dimensional space preserving the neighborhood.������ 115
LLE Algorithm
116
[Roweis and Saul 2000]
LLE Algorithm
117
LLE operates in 3 steps: • Computes the k nearest neighbors • Computes the weights needed to reconstruct each point using a linear combination of its neighbors • Projects the results using the new found coordinates.
LLE Algorithm • The local geometry is modeled by linear weights that
reconstruct each data point as a linear combination of its neighbors
• Reconstruction errors are measured with this cost function
- where weights Wij measure the contribution of the j-th example to the reconstruction of the i-th example
• Weights Wij are minimized subject to two constraints : 1) Each data point is reconstructed only from its neighbors 2) Rows of the weight matrix sum up to one :
118 1=∑ j ijW
2
1)( ∑ ∑
=
−=N
ij jiji XWXWε
LLE Algorithm
• We seek to find d-dimensional coordinates Yi that minimize the following cost function:
119
2
1)( ∑ ∑
=
−=N
ij jiji YWYYφ
Parameter Estimation • Consider a particular sample x with k nearest neighbors ηj and
reconstruction weights wj (that sum up to one). These weights can be found in three steps : – Step 1 : Compute the neighborhood correlation matrix Cjk and its
inverse C -1 – Step 2 : Compute the Langrage multiplier λ that enforces the
constraint
– Step 3 : Compute the reconstruction weights as:
120
kTjjkC ηη=
∑∑
−
−−=
jk jk
jk kT
jk
C
xC1
1 )(1 ηλ
)(1 λη +=∑ −k
Tjkkj xCw
1=∑ j jw
Parameter Estimation • The embedding vectors Yi are found by minimizing the cost function: • To make the optimization problem well-posed we introduce 2
constrains: • Which allows to expresse the cost function as:
where • δij is equal to 1 if i=j and equal to 0 otherwise. • The optimal embedding is found by computig the bottom d+1
eigenvectors of matrix M.
121
2
1)( ∑ ∑
=
−=N
ij jiji YWYYφ
0=∑ j jY IYYN
Tii i =∑
1
)()( jTi
ijij YYMY ∑=φ
kjk
kijiijijij WWWWM ∑+−−=δ
Examples on LLE
122
Results
123
• Initial points represent images of faces. • In the 2-dimensional space, these images are grouped according to the position, lighting and expression.
• Images placed in the bottom of the figure correspond to successive points encountered on the line at the top right, sweeping a continuum of facial expression.
[Roweis and Saul 2000]
Results
124 [Roweis and Saul 2000]
• PCA vs LLE
Unlike PCA, LLE preserves local topology!
Results
125
• PCA vs LLE
[Roweis and Saul 2000]
Conclusions • LLE is a nonlinear technique that preserves the local
properties of the data representing each point by a linear combination of its nearest neighbors in a new space of reduced dimensions.
• LLE operates in 3 steps:
– Computes the k nearest neighbors – Computes the weights needed to reconstruct each point using a linear
combination of its neighbors – Projects results using the new found coordinates
126
Conclusions
• Sensitive to noise • Parameters to set: k • Relatively slow for large data sets
127
References ���
• J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319-2323, 2000.
• Sam Roweis & Lawrence Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323-2326, 2000.
128
Self Organizing Map (SOM)
129
Outline
• Introduction and definitions of neural networks
• Introduction SOM • Algorithm • Example • Conclusions
130
Neural networks • A set of interconnected neurons/information
processing units • A technique designed to model how the brain
performs a particular task • Used to extract the pattern of information from data sets where numbers are vast and has hidden relations • Ability to handle noisy data
131
Neural network learning
• By learning we can extract information • This information is stored on the links
between the neurons (also called weights) • Using neural networks can perform two types of learning: - Supervised���- Unsupervised
132
Weights
Neural network
Input Output
Self-organisation • The brain cells are self organizing themselves in groups,
according to incoming information. • This incoming information is not only received by a single
neural cell, but also influences other cells in its neighbourhood.
• SOMs mechanism is also based on this principle
133
The principle of SOMs • SOM produces the similarity graph of the input data • Converts non-linear relationships between high
dimensional data into simple geometric relationships
134 Illustration d’une carte SOM de taille 7x7
• Donnée en entrée
• Poid
• Poid mis a jourt
Entrées Sorties
SOM
• Introduced in 1984 by Teuvo Kohonen • Vector quantization + vector projection • Technique based on unsupervised learning • Used in classification and visualization of
large data sets • Used in many areas
135
Fields of application • Applications
– Exploratory data analysis, clustering – Quantification, variable selection, outlier detection – Diagnosis, prediction, missing data
• Domains – Socio-economic – TextMining – Telecomunication – Industrial Processes
136
SOM Architecture • Set of neurons / cluster units • Each neuron is assigned with a prototype vector that is
taken from the input data set • The neurons of the map can be arranged either on a
rectangular or a hexagonal lattice
137 Hexagonal neighborhood Rectangular neighborhood
SOM
• Cost function:
138
2)(,)( ∑∑ −=
i kkixsk wxhwC
i
SOMs Algorithm 1. Initialisation step
– Define the topology map – Randomly initialize all the prototypes for each neuron
2. Competition step – Have a given datum xi randomly chosen – Determine the winning neuron according to the rule:
3. Adaptation step – Adapt prototypes under rule
4. Repeat steps 2 and 3 until the updates prototypes are insignificant
139
2
1min)( kimk
i wxrgAxs −=≤≤
))(()()()1( )(, twxhttwtw kixskkk i−+=+ ε
Self organizing step
140
Self organizing step
The map is organized by checking the following properties:
– Each neuron specializes in a portion of the input space
– Similar data have near projections on the map
141
Data visualization using SOMs
• Visualization of clusters and shape of the data (projections, U-matrices and other distance matrices)
• Visualization of components/variables (scatter plots, components planes)
• Visualization of data projections
142
Example : Countries of the world
143
Example: Iris
144
Example: Iris
145
Example: Iris
146
Example: Iris
147
Dimensionality Reduction:
Singular Value Decomposition
Recall (1)
• Why dimensionality reduction? • Some features may be irrelevant • We want to visualize high dimensional data • “Intrinsic” dimensionality may be smaller
than the number of features
Recall (2) ���Dimensionality Reduction
Techniques • Visualize, categorize, or simplify large datasets. • Principal Component Analysis: Finds the dimensions that capture the most
variance
• MDS: Finds data points in lower dimensional space that best preserves the inter-point distance.���
• Isomap: Estimates the distance between two points on a manifold by following a chain of points with shorter distances between them. (More accurate in representing global distances than LLE; slower than LLE)
• LLE (Local Linear Embedding): Only worries about representing the distances between local points. Faster than Isomap (no worry about global distances).
• Hesian Eigenmaps: • Laplacian Eigenmaps • Charting • SOM (Kohonen’s Self-organizing map)
Dimensionality Reduction
• The input space of many learning problems is of high dimensionality.
• This has computational implications and, it makes finding the intrinsic information content difficult.
• Dimensionality reduction methods usually try to address these two problems at the same time.
• Context: unsupervised learning given a collection of data points in an n-dimensional space find a good
representation of the data in r-dimensional
Choosing sets of features
• Score each feature • Forward/Backward elimination
– Choose the feature with the highest/lowest score – Re-score other features – Repeat
• If you have lots of features (text mining for ex.) – Just select top K scored features
Unsupervised feature selection
• Differs from feature selection in two ways: – Instead of choosing subset of features, – Create new features (dimensions) defined as
functions over all features – Don’t consider class labels, just the data points
SVD: the problem • For a large dimensional data, use PCA ?
– for example: images (d ≥ 104), gene expression data, text data,…
• The problem: – The covariance matrix is size (d2), so the computational
time is expensive.
• Solution: – Singular Value Decomposition (SVD) - an efficient algorithm available in Matlab, R, SAS…
Definition
A[n x m] = U[n x r] Λ [ r x r] (V[m x r])T
• A: n x m matrix (e.g., n documents, m terms) • U: n x r matrix (n documents, r concepts) • Λ: r x r diagonal matrix (strength of each
‘concept’) (r: rank of the matrix) • V: m x r matrix (m terms, r concepts)
SVD : schema
SVD - Interpretation ‘documents’, ‘terms’ and ‘concepts’: • U: document-to-concept similarity matrix • V: term-to-concept similarity matrix • Λ: its diagonal elements: ‘strength’ of each
concept Projection: • best axis to project on: (‘best’ = min sum of
squares of projection errors)
In matlab: • svd(A); • svds(A) – for sparse data.
• The svd command computes the matrix singular value decomposition.
• s = svd(X) returns a vector of singular values. • [U,S,V] = svd(X) produces a diagonal matrix S of the same
dimension as X, with nonnegative diagonal elements in decreasing order, and unitary matrices U and V so that X = U*S*V'.
Conclusions • Dimensionlity reduction (DR) for high dimensional data; • DR for visualization and analysis; • Can use Feature Selection methods (Course 8); • Noisy detection and elimination (Course 8); • Choose a technique depending on the data type and structure;
• Then unbalunced data – use subsampling on the majority class
• References: lectures/papers of Jure Leskovec; Christos Faloutsos and Tom Mitchell