Multivariate description Visualisation Reduction of...

48
Multivariate description Visualisation Reduction of dimensionality Data Mining course Master in Information Technologies Enginyeria Informàtica Tomàs Aluja

Transcript of Multivariate description Visualisation Reduction of...

Page 1: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

Multivariate descriptionVisualisation

Reduction of dimensionality

Data Mining courseMaster in Information Technologies

Enginyeria Informàtica

Tomàs Aluja

Page 2: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

2

Two types of datasets to analyze

Data in Data Mining:massive, secondary, not random, with errors and missing values

topicsSocio-econ. Opinions Products

Data to explore Data to modelize

Output(s)Inputs

Course DM: Multivariate Visualisation. T. Aluja

Page 3: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

3Course DM: Multivariate Visualisation. T. Aluja

Page 4: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

4

Data exploration: Visualisation + “clustering”

• Data contains information about the genereting phenomenon.

• Visualization. The human eyes …– To consent a loss in the information in exchange for gaining

interpretability.

• Synthesis of the reality (clustering)– Reality is complex, we render operational simplifying it in a

limited number of clusters.

Snow’s Cholera Map, 1855

Course DM: Multivariate Visualisation. T. Aluja

Page 5: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

5

South and North Korea at night

South Korea,Guess where is Seoul?

North KoreaNotice how dark it is

Course DM: Multivariate Visualisation. T. Aluja

Page 6: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

6

Graph visualisation

Ggobi project

Course DM: Multivariate Visualisation. T. Aluja

Page 7: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

Parallel coordinates of IRIS data

7Course DM: Multivariate Visualisation. T. Aluja

Page 8: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

8

Iris versicolor

Iris virginica

Iris setosa

Course DM: Multivariate Visualisation. T. Aluja

Page 9: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

9

Visualization of the tableBCN Quarters x Profession of inhabitants

Course DM: Multivariate Visualisation. T. Aluja

Page 10: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

10

Spanish inquisition 1567‐1600sentences & crimes

Course DM: Multivariate Visualisation. T. Aluja

Page 11: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

11

Visualisation of international cities according their 

salaries. USB 1994.

Course DM: Multivariate Visualisation. T. Aluja

Page 12: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

12

Microarray data: 64 cancers 6830 gen cromotografy

Course DM: Multivariate Visualisation. T. Aluja

Page 13: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

13

M.Turk and A.Pentland. Eigen Faces for Recognition. Journal of Cognitive Neuroscience, 3(1), 1991.

Reconstitution of images

Course DM: Multivariate Visualisation. T. Aluja

Page 14: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

14

Actual image

Course DM: Multivariate Visualisation. T. Aluja

Page 15: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

15

Reconstituted image

Course DM: Multivariate Visualisation. T. Aluja

Page 16: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

16

Monitoring of the inner temperatures of Lascaux cave (France): 

Course DM: Multivariate Visualisation. T. Aluja

Page 17: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

17

Multivariate VisualizationSelection of the active topic

• Exploratory situation (without response variable but with illustrative varaibles).

p

n

Variables

Variablesactivas

Variablesilustrativas

Ind

ivid

uos

Course DM: Multivariate Visualisation. T. Aluja

Page 18: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

18

Active topic Multivariate technique

Continuous variables PCA - Principal Component Analysis

Count variables CA - (Simple) Correspondence Analysis

Categorical variables MCA - Multiple Correspondence Analysis

Course DM: Multivariate Visualisation. T. Aluja

Page 19: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

19

PCA, CA, MCA can be useful for …

• Visualisition of the information contained in a data matrix • Detection of “outliers”

• Reduction of the dimensionality (feature selection)• Image compression• Extraction of new derived variables (latent), “feature

extraction”

• Smoothing of data (error reduction, avoiding collineality)• First phase of the explanatory variables for modeling

Course DM: Multivariate Visualisation. T. Aluja

Page 20: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

20

Principal Component Analysis

• Cloud of points associated to the rows of the data matrix

• Total information contained in the cloud of points: the inertia respect G

i

i'

n

p

X=

••

•••

• •••

i

i'

var2

var1

var3Rp

Harold Hotelling, 1895-1973American statistician

Course DM: Multivariate Visualisation. T. Aluja

Page 21: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

21

• Purpose:– To project the cloud of points upon a subspace (a

plan) to retain the maximum of the original cloud information.

Course DM: Multivariate Visualisation. T. Aluja

Page 22: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

22

Principal Component Analysis

• Fitness Criterion– Find the subspace

maximizing the projected inertia.

• Decomposition of inertia in orthogonal directions (factorial axes) I I I Itotal p= + + +1 2

I I Ip1 2> > >

Course DM: Multivariate Visualisation. T. Aluja

Page 23: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

23

Fit in Rp

2

1

n

i iu i

p N u X NX uMax ψ ψ ψ=

′ ′ ′= =∑

X uψ =

( )( )( )

Cov Xdiag X NX

Cor X⎧′ = ⎨⎩

1

1

, , ( ), ,

r

r

r rang Xu uλ λ→ =……

X NX u uλ′ =

1 1 1

1Max u X NX uu u

λ′ ′ =′ =

Let call u∈Rp the unit vector defining the direction maximizing the projected inertia

Diagonalization of the correlation matrix (or

covariance)

Let X be the data matrix: centered or standardized

Course DM: Multivariate Visualisation. T. Aluja

Page 24: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

24

Eje 1nube multidimensional

Eje 2

Rp

Principal Components(derived latent variables)Factors, …

Direction maximizig the projected inertia: u1. Direction maximizing the projected inertia orthogonal to u1 : u2...

Xuα αψ =

Nα α αψ ψ λ′ =

1 2 3 4 5 6

0

1

2

3

Component Number

Eige

nval

ue

Scree Plot of Clarity-Quality

Assessing the importance of orthogonal directionsScree plot of eigenvalues:

Inertia of a PC

Course DM: Multivariate Visualisation. T. Aluja

Page 25: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

25

variables muy correlacionadas

variable ortogonal

correlación muy negativacon x e y

xy

z

w

Associated cloud of points to the columns of a data matrix in Rn

ind3

varjsj

Nube de las variablesRn

ind2

ind1 Centered variables Standardized variables

n

p

X=

Course DM: Multivariate Visualisation. T. Aluja

Page 26: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

26

Fit in Rn(standardized data)

v1

v3

v4

v2

Eje 1

Eje 2

•• •

Eje 1

Eje 2

v1

v4

v3

v2

Original cloud

Optimal joint visualisation of the correlations between variables

First factorial plan

Course DM: Multivariate Visualisation. T. Aluja

Page 27: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

27

Fit in Rn

1 12 22

1

p

jv j

v N XX N vMax ϕ ϕ ϕ=

′ ′ ′= =∑

12X N vϕ ′=

1v v′ =

1 12 2

1 12 2

u X N v

v N Xuα α

α α

λ

λ

′=

=

12X N vα α

α α α

ϕϕ ϕ λ

′=′ =

1 12 2N XX N v vλ

⎫ ′ =⎬⎭

12

1 12 2

u

N vα α

α α

ϕ λ

ψ λ −

=

=

Let v∈Rp be the unit vector defining the direction maximizing the inertia:

Transition relationships between both fits:

Indirect projection formulas

12

( , )( , )

j

j j

cor xX N

s cor xα

α αα

ψϕ λ ψ

ψ− ⎧

′= = ⎨⎩

Interpeting the projections

Data matrix X: centered o standardized

Course DM: Multivariate Visualisation. T. Aluja

Page 28: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

28

The PCA is a device to find artificial latent variables, from observed ones.

World of ideas, concepts, theories, …

Real worldObserved variables

PCA

exp

l.

Factors

ACP: Ψα α α α= + + +u u u1 2 px x x1 2 p

( , ) ( , ) ( , )n p n p p pΨ = X U

Var. 1

Var. 2

Var. p

Fac. 1

Fac. q

But only the first q Factors convey structural information, the remaining

are noise

Course DM: Multivariate Visualisation. T. Aluja

Page 29: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

29

PCA in practice

• Role of de las variables: Normed or non normed analysis– Normed PCA means to give all varaibles the same importance, we

achive this by standardization of data (diagonalization of the correlation matrix)

– Non normed PCA means to give to each varaible an importance proportional to tis standard deviation. We achieve this working with the just centered data matrix (diagonaization of the covaraince matrix)

• What variables to analize?– This is the most crucial decision. Often the information contained is

obvious, then try to perform partial analysis. PCA is a device of exploration.

Course DM: Multivariate Visualisation. T. Aluja

Page 30: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

30

PCA in practice

• How many factorial directions are significative? – Difficult to assess. How many axes remain stable with independent

data?– Use the screeplot.– Perform random perturbation of data to assess stability.

• How to interpret the axes– The significative axes convey structural (deterministic) information of

the phenomenon under study and they can be interpreted and given a name (this is the most appealing outcome.

– Interpretation is done in the basis of the correlations between the principal component (the new artificial latent variables and the original ones, the pc is a mean variable of the most correlated).

Course DM: Multivariate Visualisation. T. Aluja

Page 31: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

31

Projection of illustrative variables

• Continuous– We depict their correlations with the factorial axes.

• Categorical– We represent a categorical varaible by the set of the centre

of gravity of the different subclouds of individuals correponding to each level of the categorical variable.

Very useful … It allows to relate each illustrative variable to the active topic altogether

Course DM: Multivariate Visualisation. T. Aluja

Page 32: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

32

Finding the PCA solution iteratively (NIPALS)

Initialize X1←XFor h=1,..., r=rang(X)

Ψh = mean column of Xh

Repeat till convergence of uh

uh = X’hΨh

uh = uh/|uh|Ψh = Xh uh

Xh = Xh-1 - Ψh uh’

Rn

Rp

ψh

uh

hX ′hX

In the convergence: h h h h

h h h h

X X u uX X ψ ψ

Course DM: Multivariate Visualisation. T. Aluja

Page 33: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

A relevant application: Google• GoogleTM uses SVD to accelerate finding relevant web pages. Define a web

site as an authority if many sites link to it. Define a web site as a hub if it links to many sites. We want to compute a ranking x1; … ; xN of authorities and y1;… ; yMof hubs.

As a first pass, we can compute the ranking scores as follows: xi0 is the number of

links pointing to i and yi0 is the number of links going out of i. But, not all links

should be weighted equally. For example, links from authorities (or hubs) should count more. So, we can revise the rankings as follows

Where A is the adjacency matrix with aij = 1 if i links to j. (of 109 order)

But an authority depends also from the pages linking to the linking pages of the authority. Hence iterating …

33

1 0

1 0i

j

x A y

y Ax

′=

=

1 1k k k ki jx A Ax y AA y− −′ ′= =

Course DM: Multivariate Visualisation. T. Aluja

Page 34: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

34

prcomp(x, retx=T, center=T, scale.=F, tol = NULL, ...)

Arguments:x: a numeric (or data frame) which provides the data.retx: a logical value indicating whether the rotated variables should be returned.center: a logical value indicating whether the variables should be shifted to be

zero centered. scale.: a logical value indicating whether the variables should be scaled to have

unit variance before the analysis takes place.tol: a value indicating the magnitude below which components should be omitted.

Attributessdev: the standard deviations of the principal components (i.e., the square roots of

the eigenvalues).rotation: the matrix of variable loadings (i.e., a matrix whose columns contain the

eigenvectors). x: if 'retx' is true the value of the rotated data (the centred (and scaled if

requested) data multiplied by the 'rotation' matrix) is returned.

Course DM: Multivariate Visualisation. T. Aluja

Page 35: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

35

biplot(x, y, var.axes = TRUE, main = NULL, ...)

Arguments:

x: The first set of points (a two-column matrix), usually associated with observations.

y: The second set of points (a two-column matrix), usually associated with variables.

var.axes: If 'TRUE' the second set of points have arrows representingthem as (unscaled) axes.

Course DM: Multivariate Visualisation. T. Aluja

Page 36: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

36

Beyond PCA ⇒ MCA

• PCA just analyzes continuous variables through their correlations, hence it just can reveal linear relationships between variables

• Thus, transform the original variablesRecode them to ordinal to take into account non linearities

f(X) Ψ

var j a a

xj1 jk

ij → 001000

Ludovic LebartFrench statistician, promoter of MCA

Course DM: Multivariate Visualisation. T. Aluja

Page 37: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

37

MCA of hypercubes• Dimensions (= categorical variables)• Measured variables in cells (=responses, they may be continuous

or categorical)• (Hypercube can be explicit or implicit in a relational DB.

A1 B1 C1A1 B2 C2A3 B1 C3A2 B2 C1

Hypercube dimensions Numerical coding (bining)

1000 10 1001000 01 0100010 10 0010100 01 100

A1 A2 A3 A4B1 B2

C1 C2 C3 (=Z)

Course DM: Multivariate Visualisation. T. Aluja

Page 38: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

38

• Active Variables : Dimensions• Ilustrative variables : Responses

p

n

Variables

Dimensiones Variablesrespuesta

Ind

ivid

uos

1000 10 100

We will visualize the responses upon the grid provided by the dimensions

Course DM: Multivariate Visualisation. T. Aluja

Page 39: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

39

MCAActive grid

Edad CSP Nivel de ingresos

2 1 3 0 1 0 1 0 0 0 0 1

Edad

CSP Ingr

.

nj

p

n

nnp

Ed1

Ed2

Ed3CSP2 CSP3

CSP1

ing3

ing1

ing2Course DM: Multivariate Visualisation. T. Aluja

Page 40: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

40

El ACM como un ACP no lineal

Course DM: Multivariate Visualisation. T. Aluja

Page 41: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

41

2 1 1

1

1n

i iu i

p nu D Z ZD uMaxn

ψ ψ ψ − −

=

′ ′ ′= =∑1 1npu D u−′ =

11eig Z ZD u up

λ−′⇒ =

0010 01 010 pi

nj

i

1 … j … J

Z=

D=

1 … pvariablesmodalities

n

1

1

1n1

1J

J

1i

ppnp n

= =

1

n

j iji

n z=

= ∑1 Zp

Row profile:

1Znp D up

ψ −=Chi-square Metric:

1Dnp

−⎛ ⎞⎜ ⎟⎝ ⎠

Course DM: Multivariate Visualisation. T. Aluja

Page 42: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

42

What are the factors in MCA?

Edad

CSP

Nivel de ingresos

z1

z3

z2

Ψ

Rn

Max cor

u aj 1

p

j

j jk jkk

2 ( , )Ψ=

∑=

z

z

⇒ Optimal quantificationof the categorical variables

MCA

Original categorical data Equivalent continuous factors

But we will work with more dimensions than in PCACourse DM: Multivariate Visualisation. T. Aluja

Page 43: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

43

Interesting properties of the MCA displays

• Every individual is the cdg of their chosen modalities(apart from a multiplicative factor)

• Une modality (=level) is the cdg of individuals having chosen it(apart from a multiplicative factor)

ind

mod

1αλ

1αλ

1J

j ijji

z

αα

ϕψ

λ=

1n

i ijij

zn

αα

α

ψϕ

λ= ∑

Course DM: Multivariate Visualisation. T. Aluja

Page 44: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

44

MCA iterative algorithm

Initialize Y0 ← Z; Z ← [Z1,... Zp]; D=Z’Z; Dk=Zk’Zk

For h=1,..., rang(Z)Ψh = rowmean of YRepeat till convergence of uh

uh = D-1Y’hΨhuh = uh/|uh|Ψh = (1/p) Yh uh

Yh = Yh-1 - Ψh uh’zk = Zk uk; uk = Dk

-1Zk’ Ψh k=1...p

Course DM: Multivariate Visualisation. T. Aluja

Page 45: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

45

Projection of the illustrative variables

• Continuous– From their correlations with the factorial axes.

• Categorical– As the set of cdgs of the individuals having chosen each

level of the categorical variable.

Course DM: Multivariate Visualisation. T. Aluja

Page 46: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

46

library(MASS) mca(df, nf = 2, abbrev = FALSE)

Arguments:df: A data frame containing only factors nf: The number of dimensions for the MCA.

Attributes: rs: The coordinates of the rows, in 'nf' dimensions. cs: The coordinates of the column vertices, one for each level of

each factor. fs: Weights for each row, used to interpolate additional factors

in 'predict.mca'. d: The singular values for the 'nf' dimensions.

Course DM: Multivariate Visualisation. T. Aluja

Page 47: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

47

Qua

rters

CSP

nkj nk

nj n

CSP

n

Quarters

00010000 00000010000

nk njz1

z2

Max

u a

v b

1 k kk

j jj

cor( , )z z

z

z

1 2

2

=

=

∑Rn

Ψ

Jean Paul Benzecri, Analyse des Données father

Simple Correspondences AnalysisAnalyisis of crosstables

Course DM: Multivariate Visualisation. T. Aluja

Page 48: Multivariate description Visualisation Reduction of dimensionalitybelanche/Docencia/mineria/English... · 2008-09-30 · Data exploration: Visualisation + “clustering” • Data

48

library(MASS)corresp(x, data, ...)

Argumentsx : A two-way frequency table. Currently accepted forms are

matrices, data frames ...

nf: The number of factors to be computed. (max. value = min (nrow-1, ncol-1).

Course DM: Multivariate Visualisation. T. Aluja