Multidimensional scaling (MDS)people.math.umass.edu/~anna/stat697F/Chapter10_part3.pdf ·...
Transcript of Multidimensional scaling (MDS)people.math.umass.edu/~anna/stat697F/Chapter10_part3.pdf ·...
Multidimensional scaling (MDS)Just like SOM and principal curves or surfaces, MDS aims to mapdata points in Rp to a lower-dimensional coordinate system.However, MSD approaches the problem somewhat differently.
I Let x1, ..., xN ∈ Rp be observations and dij be the distancebetween observations i and j . MDS seeks valuesz1, z2, ..., zN ∈ Rk to minimize the stress function:
SM(z1, z2, ..., zN) =∑i 6=i ′
(dii ′ − ‖zi − zi ′‖)2
This is known as least squares or Kruskal-Shephard scaling.
I Sammon mapping:
SSm(z1, z2, ..., zN) =∑i 6=i ′
(dii ′ − ‖zi − zi ′‖)2
dii ′
where more emphasis is put on preserving smaller pairwisedistances.
I Classical scaling:
SC (z1, z2, ...,ZN) =∑i ,i ′
(sii ′− < zi − z̄ , zi ′ − z̄ >)2
where sii ′ is the similarity between xi and xi ′ and is usuallydefined as the centered inner product sii ′ =< xi − x̄ , xi ′ − x̄ >.
I Shephard-Kruskal nonmetric scaling seeks to minimize
SNM(z1, z2, ...,ZN , θ) =
∑i 6=i ′ [‖zi − zi ′‖ − θ(dii ′)]2∑
i 6=i ′ ‖zi − zi ′‖2
over the zi and an arbitrary increasing function θ.
I Classical scaling with centered inner product is equivalent toprincipal components. It is not equivalent to least squarescaling, in which mapping can be nonlinear.
I Nonmetric scaling effectively uses only ranks of the distances,rather than the actual dissimilarities or similarities.
I MDS tries to preserve all pairwise distances, while principalsurfaces and SOMs do not.
I MDS requires only the dissimilarities dij , in contrast to theSOM and principal curves and surfaces which need the datapoints xi .
Finding latent variables of multivariate data
Multivariate data are often viewed as multiple indirectmeasurements aris-ing from an underlying source, which typicallycannot be directly measured. Examples include the following:
I Educational and psychological tests use the answers toquestionnaires to measure the underlying intelligence andother mental abilities of subjects.
I EEG brain scans measure the neuronal activity in various partsof the brain indirectly via electromagnetic signals recorded atsensors placed at various positions on the head.
I The trading prices of stocks change constantly over time, andrflect various unmeasured factors such as market confidence,external influences, and other driving forces that may be hardto identify or measure.
PCA has a latent variable presentation
I The correlated Xj are each represented as a linear expansionin the uncorrelated, unit variance varaiables Sl .
I The problem with PCA latent variables is that they are notunique – any orthogonal transformation of S1, ...,Sp is alsouncorrelated with unit variance and satisfy the PCA expansion.
Factor analysis
The idea is that the latent variables Sl are common sources ofvariation amongst the Xj , and the account for their correlationstructure, while the uncorrelated εj are unique to each Xj and pickup the remaining unaccounted variation.
I Factor analysis faces the same problem as PCA, that is, anyorthogonal transformation of S1, ...,Sp is also uncorrelatedwith unit variance and satisfy the factorization equation
I This leaves a certain subjectivity in the use of factor analysis,since the user can search for rotated versions of the factorsthat are more easily interpretable. This aspect has left manyanalysts skeptical of factor analysis and may account for itslack of popularity in contemporary statistics.
Differences between PCA and factor analysis
Because of the separate disturbances εj for each Xj , factor analysiscan be seen to be modeling the correlation structure of the Xj
rather than the covariance structure, as PCA.Example (Exercise 14.15): Generate 200 observations of thethree variates X1,X2,X3 according to
X1 = Z1
X2 = X1 + 0.001Z2
X3 = 10Z3
where Z1,Z2,Z3 are independent standard normal variates. It turnsout the leading principal component aligns itself in the maximalvariance direction X3, while the leading factor essentially ignoresthe uncorrelated component X3 and picks up the correlatedcomponent X2 + X1.
Independent component analysis (ICA)
ICA model has exactly the same form as PCA:
except that the Sl are assumed to be statistically independentrather than uncorrelated.
I Since the multivariate Gaussian distribution is determined byits covariance matrix, any Gaussian independent componentscan be dtermined only up to a rotation. ICA therefore seeksSl that are independent and non-Gaussian.
I ICA looks for a sequence of orthogonal projections such thatthe projected data look as far from Gaussian as possible.
Finding ICA
ICA finds an orthogonal matrix A such that the components inATX are as independent as possible. Let Y = ATX and I (Y ) bethe Kullback-Leibler distance between the density g(y) of Y andits independence version
∏pj=1 gj(yj), where gj(yj) is the marginal
density of Yj :
I (Y ) =
p∑j=1
H(Yj)− H(Y )
where
H(Y ) = −∫
g(y) log g(y)dy
is the entropy of the random variable Y with density g(y).
It turns out
I (Y ) =
p∑j=1
H(Yj)− H(X )
I Finding A is equivalent to minimizing the sum of the entropiesof the separate components of Y .
I A well-known result in information theory says that among allrandom varaibles with equal variance, Gaussian varialbes havethe maximum entropy
I Therefore, finding A is equivalent to maximizing departure ofthe components of ATX from Gaussianity separately.
Subjects wear a cap embedded with a lattice of 100 EEGelectrodes, which record brain activity at different locations on thescalp. Figure 14.4111 (top panel) shows 15 seconds of output froma subset of nine of these elec-trodes from a subject performing astandard ”two-back” learning task over a 30 minute period. Thesubject is presented with a letter (B, H, J, C, F, or K) at roughly1500-ms intervals, and responds by pressing one of two buttons toindicate whether the letter presented is the same or dfferent fromthat presented two steps back. Depending on the answer, thesubject earns or loses points, and occasionally earns bonus or losespenalty points. The time-course data show spatial correlation inthe EEG signals-the signals of nearby sensors look very similar.