What is Independent Component Analysis? - Temple

35
What is Independent Component Analysis? Alan Julian Izenman Temple University E-mail address: [email protected] For David R. Brillinger July 2003 Abstract This article describes a relatively new research topic called independent component analysis (ICA), which is becoming very popular in the signal processing literature and amongst those working in machine learning and data mining. The primary focus of ICA is to resolve the classical problem of blind source separation (BSS), in which an unknown mixture of nonGaussian signals is decomposed into its independent compo- nent signals. The classical example of BSS is the so-called cocktail-party problem, where the mixture consists of simultaneous speech signals recorded by a number of mi- crophones. Important applications include biomedical signal processing (usually brain wave activity in the form of EEG and MEG tracings), audio signal separation (mixed speech and music signals), telecommunications (a confusion of signals transmitted by multiple users of mobile phones), financial time series (portfolios of stocks), and data mining (text document analysis). The ICA methodology has much in common with that of projection pursuit (PP). KEY WORDS: Blind source separation; Brain imaging; Cumulants; FastICA al- gorithm; Financial time series; Independent factor analysis; Kernel ICA; Kurtosis; Maximum likelihood; Mutual information; Negentropy; Projection pursuit; Signal pro- cessing; Supergaussian and subgaussian components; Time series analysis.

Transcript of What is Independent Component Analysis? - Temple

What is Independent Component Analysis?

Alan Julian Izenman

Temple University

E-mail address: [email protected]

For David R. Brillinger

July 2003

Abstract

This article describes a relatively new research topic called independent componentanalysis (ICA), which is becoming very popular in the signal processing literature andamongst those working in machine learning and data mining. The primary focus ofICA is to resolve the classical problem of blind source separation (BSS), in which anunknown mixture of nonGaussian signals is decomposed into its independent compo-nent signals. The classical example of BSS is the so-called cocktail-party problem,where the mixture consists of simultaneous speech signals recorded by a number of mi-crophones. Important applications include biomedical signal processing (usually brainwave activity in the form of EEG and MEG tracings), audio signal separation (mixedspeech and music signals), telecommunications (a confusion of signals transmitted bymultiple users of mobile phones), financial time series (portfolios of stocks), and datamining (text document analysis). The ICA methodology has much in common withthat of projection pursuit (PP).

KEY WORDS: Blind source separation; Brain imaging; Cumulants; FastICA al-gorithm; Financial time series; Independent factor analysis; Kernel ICA; Kurtosis;Maximum likelihood; Mutual information; Negentropy; Projection pursuit; Signal pro-cessing; Supergaussian and subgaussian components; Time series analysis.

2

1. INTRODUCTION

Independent component analysis (ICA) is a multivariate statistical technique which seeks

to uncover hidden variables in high-dimensional data. As such, it belongs to the class of latent

variable models, such as factor analysis (FA). Furthermore, because of its success in analyzing

signal processing data, ICA can also be regarded as a digital signal transform method.

Although the concept of ICA was introduced in 1982 in a neurophysiological context, its

name was coined by Herault and Jutten (1986). See Jutten (2000) for the early history.

Since then, theoretical insights, computational algorithms, and new applications have been

developed to enhance and understand the ICA technique. Several books (e.g., Cichocki and

Amari, 2003; Hyvarinen, Karhunen, and Oja, 2001; Lee, 1998) and edited volumes (e.g.,

Roberts and Everson, 2001; Girolami, 2000; Nandi, 1999) have appeared and a huge number

of articles have been published on the topic. There is also an international workshop on ICA

and related topics held annually in different countries. As an indication of the popularity

today of ICA, a Google search on ”independent component analysis” resulted in almost 1.7

million hits. Yet, we see very little attention paid to ICA in the statistical literature; an

isolated reference is Hastie, Tibshirani, and Friedman (2001, sec. 14.6).

In its most basic form, the ICA model is assumed to be a linear mixture of a number of

unknown hidden source variables, where the mixing coefficients are also unknown. A totally

“blind” approach to determining both the hidden variables and the mixing coefficients solely

from the observed multivariate data fails because the problem as stated is not well-defined.

To build more structure into the problem, we require the hidden variables to be mutually in-

dependent and also (with at most one exception) non-Gaussian. ICA is actually an amalgam

of several related approaches to this problem, and these approaches are characterized by the

types of assumptions visited upon the distributions of the independent source variables and

whether or not a separate noise component should be included in the ICA model.

The signal processing problem of blind source separation (BSS), in which an unknown

mixture of non-Gaussian signals is to be decomposed into its independent component sig-

nals, is closely related to ICA. BSS is similar to the classical electrical engineering problem of

source separation, but in BSS there is no knowledge of the signals that make up the mixture.

The best-known example of BSS is the so-called “cocktail-party problem” (Cherry, 1953). In

this problem, m people are speaking simultaneously at a party, and each of r microphones

placed in the same room at different distances from each speaker records a different mix-

3

ture of the speakers’ voices at n time points. The question is whether, based upon these

microphone recordings, we can separate out the individual speech signals of each of the m

speakers. Despite the fact that the cocktail-party problem assumes the speakers babble on

independently without considering the presence of other partygoers (who usually speak in

clustered groups), it does give a fairly simplistic explanation of how one can envision BSS

problems.

Amongst the cocktail-party-type problems, ICA has been extensively applied to the study

of the human brain, whose functions “provide the basis of perception and cognition and

underlie emotion and creative expression” (Pechura and Martin, 1991, p. 27). Patterns of

human brain-wave activity can be viewed through noninvasive recordings made by r (usually

around 20, sometimes a lot more) electrodes placed evenly around a subject’s head during

different periods of consciousness and sleep. The electrodes capture a mixture of brain

waves from different areas of the brain, and it is the job of ICA to separate them into

individual source signals. In particular, electroencephalographic (EEG) recordings make it

possible to relate certain types of behavior to changes in the electrical activity of the cerebral

cortex; event-related potential (ERP) recordings are finely-tuned EEGs resulting from the

stimulation of specific visual, auditory, or sensory systems; and magnetoencephalographic

(MEG) recordings measure the magnetic fields that are generated by cortical activity. ICA

applied to EEG, ERP, or MEG recordings assumes that the source signals are statistically

independent and stationary, and that the mixing process is linear and instantaneous.

ICA has also been found to be successful in analyzing the extremely large datasets ob-

tained from functional magnetic resonance imaging (fMRI) experiments (McKeown, Makeig,

Brown, Jung, Kindermann, Bell, and Sejnowski, 1998). Other applications of ICA include

extracting structure from financial stock returns (Back and Weigand, 1997), mapping the

cosmic microwave background anisotropy from satellite radiometric sky maps (Salerno, Be-

dini, Kuruoglu, and Tonazzini, 2002), separating out the effects of major volcanic eruptions

from climate and temperature data (Fodor and Kamath, 2003), Web image retrieval and clas-

sification, wireless communications and speech recognition systems, and agricultural remote

sensing images. Classification of microarray gene expression profiles using ICA methods has

also become a popular research issue.

The technical aspects of ICA in its basic formulation are remarkably similar to those of

exploratory projection pursuit (PP), which was developed over a decade earlier than ICA,

4

first by Kruskal (1969, 1972) and then by Friedman and Tukey (1974) (who named it). Af-

ter a brief hiatus, there followed a flurry of activity in which PP was studied by Tukey and

Tukey (1981), Friedman and Stuetzle (1982), Huber (1985), Friedman (1987), Jones and Sib-

son (1987), Hall (1989), and Cook, Buja, and Cabrera (1993). Because most low-dimensional

projections of high-dimensional data are approximately Gaussian-distributed (Diaconis and

Freedman, 1984), we should not expect such projections to show unusual patterns or struc-

ture. PP was, therefore, designed to seek out “interesting” low-dimensional (typically, one-

or two-dimensional) orthogonal projections of multivariate data, where the least-interesting

feature is defined to be Gaussianity. In its original incarnation, PP was driven by the desire to

expose specific features (e.g., local concentration, clustering into distinct groups, clumpiness,

or clottedness) which indicated non-Gaussianity of the data. Because an exhaustive search

for such features was clearly impossible, the search was automated. Indexes of interesting-

ness were created and optimized numerically in an attempt to imitate how users intuitively

(by eye) chose interesting projections (see Friedman’s discussion of Huber, 1985). This for-

mulation was later replaced by a search for projections that are as far from Gaussianity as

possible.

ICA and PP methodologies look at the same data in very different ways, yet they both use

the same (or similar) computational tool (numerically optimizing an objective function) to

achieve a common statistical goal of finding low-dimensional, non-Gaussian projections of the

data. Differences between ICA and PP derive from the different problems they were originally

built to solve. For example, ICA was introduced to resolve a separation problem, starting

with the estimation of independent components, while PP was designed as an exploratory

tool for visualization, focussing on dimensionality reduction of a high-dimensional space.

While much of the PP methodology has been incorporated into the ICA toolkit, there has

been little cross-pollination in the other direction. Recent enhancements of the ICA model

which take into account time structure and nonlinearity of the mixing coefficients further

distinguish ICA from PP.

This paper provides an expository account of ICA. In Section 2, the important step of

preprocessing the data using centering and sphering operations is described. Then, in Sec-

tion 3, we discuss the general type of problem for which ICA has been applied. In Section

4, ways of measuring non-Gaussianity are discussed, including skewness- and kurtosis-based

measures and relative entropy, which is a normalized version of entropy. Because entropy

5

is difficult to estimate directly, and because its major component is an unknown probabil-

ity density function, we, first, need to estimate the underlying source density. Two types

of density estimates are considered, an orthogonal polynomial approximation using a trun-

cated Gram-Charlier expansion, which leads to a moment-based index, and a nonpolynomial

approximation which is used in a FastICA algorithm.

In Section 5, we deal with the linear-mixing, noiseless, ICA model. Specifically, we de-

scribe a FastICA algorithm for extracting a single source component and two extensions

of that algorithm for extracting multiple independent source components. We also show

two methods of computing maximum-likelihood estimates of the independent source com-

ponents, one using the EM algorithm and another using the FastICA algorithm. In Section

6, we discuss the linear-mixing, noisy, ICA model. A special case of this model is the well-

known factor-analysis model, and we describe the principal components approach and the

maximum-likelihood approach using the EM algorithm. We also discuss the independent

factor analysis model, which is a hybrid between factor analysis and ICA. In Section 7, we

show the close relationship between ICA and projection pursuit.

2. CENTERING AND SPHERING

Suppose we observe a random r-vector, X = (X1, · · · , Xr)τ , of correlated measurements

with mean r-vector E(X) = µ and (r×r) covariance matrix cov(X) = ΣXX . Prior to carrying

out PP or ICA applications, we preprocess X so that its r components have commensurate

scales (see, e.g., Tukey and Tukey, 1981).

We do this by first centering X so that its components have zero mean, and then by

sphering (or whitening) the result so that its components are uncorrelated with unit variances.

Sphering is a linear transformation which removes all traces of scale and correlation structure

from X. From the spectral decomposition of the covariance matrix, ΣXX = UΛUτ , where

the columns of the orthogonal matrix U are the eigenvectors of ΣXX , and Λ is a diagonal

matrix with diagonal elements the eigenvalues of ΣXX . The columns of U and the diagonal

elements of Λ are ordered by the decreasing magnitudes of the eigenvalues of ΣXX . The

(centered and) sphered version of X is given by

X← Σ−1/2XX (X− µ), (1)

where Σ−1/2XX = UΛ−1/2Uτ . This transformation is equivalent to computing the principal

6

components of X − µ and then rescaling the principal components to have unit variance.

In other words, we can write (1) as X← Λ−1/2Uτ (X− µ). If ΣXX has less than full rank,

only those principal components having nonzero variance would be retained (and rescaled).

A benefit of sphering X is that it is now affine invariant, with µ = 0 and ΣXX = Ir.

In practice, µ and ΣXX will be unknown. Thus, we use n independent observations,

X1, . . . ,Xn, on X to compute X = n−1∑ni=1 Xi and ΣXX = n−1∑n

i=1(Xi − X)(Xi − X)τ ,

respectively. Centering and sphering the data using Xi ← Σ−1/2XX (Xi − X), i = 1, 2, . . . , n,

transform an elliptically-shaped symmetric cloud of points into a spherically-shaped cloud.

To reduce the dimensionality of the data, it is commonly advocated that only the first J < r

sphered variables be retained, where J is chosen to explain a certain (high) proportion of

the total variance (see, e.g., Friedman, 1987). If outliers are present, robust versions of the

sphering process are discussed in Tukey and Tukey (1981).

We note that the practice of sphering is somewhat controvertial. Although sphering has

computational and interpretational advantages (see, e.g., Friedman, 1987), arguments have

been made that the act of sphering is too closely tied to underlying unimodal (and especially

Gaussian) distributions, an environment we wish to avoid (see, e.g., the comments of Gower,

and Hastie and Tibshirani in the discussion of Jones and Sibson, 1987). However, we follow

PP and ICA practice by assuming that the components of X have been preprocessed to be

mutually uncorrelated, each having zero mean and unit variance.

3. THE GENERAL ICA PROBLEM

In its most general form, the ICA model assumes that X is generated by

X = f(S) + e, (2)

where S = (S1, · · · , Sm)τ is an (unobservable) random m-vector variate of sources whose

components {Sj} are independent latent variables each having zero mean, f : <m → <r

is an unknown mixing function, and e is an additive r-vector-valued noise component with

zero mean. Independence of the sources means that each individual source signal is thought

to be generated by a process unrelated to any other source signal. In general, it suffices to

assume that E(S) = 0 and cov(S) = Im.

The BSS problem is to invert f and estimate S. As it stands, this problem is ill-posed

and needs some additional constraints or regularization on S, f , and e. If we take f to be a

7

linear function, f(S) = AS, where A is a “mixing” matrix, then (1) is described as a linear

ICA model, while if f assumed to be nonlinear, then (1) is described as a nonlinear ICA

model. Most applications of ICA assume no additive noise e, and that all noise in the model

is to be associated with the components of the random vector S. Such a model is referred to

as noiseless ICA. If e is included in (1), the model is described as noisy ICA.

It turns out that the noiseless ICA model with linear mixing, X = AS, can only be solved

if the vector S with independent components is not Gaussian. We can see this by assuming

the contrary. Suppose that the sources, S1, . . . , Sm, are independent and Gaussian, each

with zero mean and unit variance. Their joint density is given by qS(s) =∏m

j=1 qSj (sj) =

(2π)−m/2e−‖s‖2/2, where ‖s‖2 =∑

j s2j . If the mixing matrix A is square (m = r) and,

hence, orthogonal (Ir = ΣXX = AAτ , so that A−1 = Aτ ), then one can show that the

density of X = AS is given by pX(x) = (2π)−m/2e−‖Aτx‖2/2|det(Aτ )|. But A is orthogonal,

and so ‖Aτx‖2 = ‖x‖2 and |det(Aτ)| = 1. Thus, the density of X reduces to pX(x) =

(2π)−m/2e−‖x‖2/2, which is identical to the density of S, so that the orthogonal mixing matrix

A cannot be identified for independent Gaussian sources. Thus, it makes sense to require

that, with the exception of one component, the remaining independent source components

cannot be Gaussian distributed.

There are a number of ways of estimating this type of ICA model while ensuring that the

components of S are as statistically independent and non-Gaussian as possible. Usually, we

are in possession of n repeated r-variate observations, Xi = (Xi1, · · · , Xir)τ , i = 1, 2, . . . , n,

on X, which constitute our data set. From this, our goal is to recover the m independent

sources, Si = (Si1, · · · , Sim)τ , i = 1, 2, . . . , n, which generated the data through Xi = ASi,

i = 1, 2, . . . , n. Several efficient computational algorithms have been created to reach this

goal.

In most ICA applications, X is regarded as an r-vector-valued stochastic process X(t) =

(X1(t), · · · , Xr(t))τ (e.g., audio or music signals, EEG tracings, seismic recordings), where

t is a time or index parameter. We usually assume that X(t) is an unknown non-Gaussian

process with zero mean. In the linear noiseless ICA model with temporally-structured sources

and static mixing, the model is written as X(t) = AS(t), where S(t) = (S1(t), · · · , Sm(t))τ

is assumed to be an m-vector of stationary sources with A static (i.e., instantaneous, non-

time-varying, without trends or delays), 1 ≤ t ≤ n. For example, in the cocktail-party

problem, Si(t) is the tth sound spoken by the ith speaker (i = 1, 2, . . . , m) and Xj(t) is the

8

tth acoustic recording made by the jth microphone (j = 1, 2, . . . , r). In this formulation, the

ICA problem is closely related to the deconvolution of time series; see, for example, Donoho

(1981), who discusses at length the single-channel deconvolution problem and its application

to seismology. Extensions to the multi-channel case have also been studied.

If the mixing matrix A = A(t) is allowed to depend upon the time parameter, then we

refer to the model as dynamic mixing. By incorporating the temporal structure of the sources

into the ICA model, there is a good chance that the separation properties of the analysis

can be improved. In our description of ICA models, we omit the explicit dependence of X

on t unless specifically needed in the exposition.

4. LINEAR MIXING: I. NOISELESS ICA

4.1 The Model

The simplest form of the ICA model is the linear mixing version with no additive noise,

usually called the noiseless (or classical) ICA model. In this scenario, X is modelled as

X = AS, (3)

where the source components {Sj} are assumed to be statistically independent and A is a

full-rank (r×m) mixing matrix with unknown coefficients. Usually, m ≤ r. For model (33),

where the sources have mean zero, X has mean zero and covariance matrix ΣXX = AAτ .

The BSS (and ICA) problem for model (33) is to estimate A and recover S. Note that the

model (33) does not identify A and S uniquely, for if S∗ = TτS and A∗ = AT, where T is

an orthogonal (m×m)-matrix, then X∗ = A∗S∗ has unchanged mean and covariance matrix

(ΣX∗X∗ = A∗A∗τ = AAτ = ΣXX).

If the number of sources is unknown, it is generally assumed that m < r. In situations

where A is not square but of full-rank, there exists an inverse mapping W = (w1, · · · ,wm)τ ,

usually termed a separating or unmixing matrix, such that

Y = WX = (wτ1X, · · · ,wτ

mX)τ = (Y1, · · · , Ym)τ (4)

approximates the source component vector S. Our goal is to determine W and, hence, Y. If

A were known, then the solution would be given by Y = (AτA)−1AτX.

An important special case of model (33) is the square mixing model, where the number of

independent sources is equal to the number of measurements (i.e., m = r), a simplification

9

studied by Bell and Sejnowski (1995). As we saw above, if X has been centered and sphered,

then the resulting square mixing matrix A in model (33) is orthogonal. In this case, the

number of elements of A to be determined is reduced from r2 to r(r − 1)/2. The goal

is to determine an orthogonal A and recover S using Y = WX, where W = Aτ , and the

elements, Y1 = wτ1X, . . . , Ym = wτ

mX, of Y are taken to be independent and as non-Gaussian

as possible.

4.2 Objective Functions

The general strategy behind ICA is to set up an appropriate objective (or contrast) func-

tion (also called a projection index in PP) to judge the merit of a particular m-dimensional

projection of multivariate data, and then use an optimization algorithm to find the global

and local maxima of that objective function over all such m-dimensional projections of the

data. For a given m = 1, 2, or 3, the optimization step determines the most informative

m-dimensional projection of the data. For numerical optimization purposes, we want the

objective function to possess certain desirable computational and analytical properties. The

most desirable property is that of affine invariance (location and scale invariance); exam-

ples of affine invariant objective functions include absolute cumulants, standardized Fisher

information, and relative entropy.

4.3 Polynomial-Based Indexes.

4.3.1 One-Dimensional Indexes. First, we assume that m = 1, so that Y = wτX is a

single continuous random variable having probability density function qY (y). The projection

indexes which drive PP can also be used as objective functions for ICA. These indexes take

the general form of weighted versions of integrated squared error,

I(Y ) =∫

[φ(y)− qY (y)]2w(y)dy, (5)

where w(y) is a given weight function on <. The index I(Y ) measures the extent of departure

of the density qY (y) from the standard Gaussian density, φ(y) = (2π)−1/2e−y2/2, having zero

mean and unit variance.

An index such as (5) can be expressed in terms of the coefficients of orthogonal poly-

nomial expansions of the density function qY (y). If qY (y) is a (square-integrable) density

function, then it can be represented as a convergent orthogonal series expansion, qY (y) =

10

∑∞k=0 αkPk(y), y ∈ <, where {Pk} is a complete orthonormal system of functions on the real

line < (or some subset thereof) (i.e.,∫Pi(y)Pj(y)dy = δij, the Kronecker delta), Pk is a

polynomial of degree k, and the {αk} are coefficients defined by αk = Eq{Pk(Y )}. There are

many different types of {Pk}; see, e.g., Abramowitz and Stegun (1972, Chapter 22). For our

purposes, we need only mention two versions of Hermite polynomials:

• Chebyshev-Hermite polynomials: Hek(y) = (−1)key2/2Dke−y2/2, k = 0, 1, 2. . . .. The

{Hek(y)} form a complete orthonormal basis on < with respect to the weight function

φ(y) in the sense that ∫φ(y)Hei(y)Hej(y)ds = j! δij. (6)

In this case, Pk(y) = (k!)−1/2Hek(y)[φ(y)]1/2, k = 0, 1, 2, . . .. The first few Chebyshev-

Hermite polynomials are given by He0(y) = 1, He1(y) = y, He2(y) = y2−1, He3(y) =

y3 − 3y, and He4(y) = y4 − 6y2 + 3.

• Hermite polynomials: Hk(y) = (−1)key2Dke−y2

, k = 0, 1, 2, . . .. The {Hk(y)} form a

complete orthonormal basis on < with respect to the weight function [φ(y)]2 in the

sense that ∫[φ(y)]2Hi(y)Hj(y)dy = δij2

j−1j!π−1/2. (7)

In this case, Pk(y) = (2k−1k!π−1/2)−1/2Hk(y)φ(y), k = 0, 1, 2, . . .. The first few Hermite

polynomials are given by H0(y) = 1, H1(y) = 2y, H2(y) = 4y2− 2, H3(y) = 8y3− 12y,

and H4(y) = 16y4 − 48y2 + 12.

The symbol Dk represents the derivative dk/dyk of whatever immediately follows.

In devising a projection index, Friedman (1987) noted that Y is standard Gaussian with

density φ(y) if and only if U = 2Φ(Y )−1, where Φ(Y ) =∫ Y−∞ φ(y)dy, is uniformly distributed

on the interval [−1, 1]. Thus, the density of U , qU (u), say, could be compared to the uniform

density using integrated squared error (ISE),

IF (Y ) =∫ 1

−1[qU(u)− 1

2]2du =

∫ 1

−1[qU (u)]2du− 1

2. (8)

The further qU(u) is from the uniform density, the further Y would be from Gaussianity, and

so IF (Y ) would, therefore, measure the extent of non-Gaussianity. Friedman approximated

IF by expanding qU (u) in (6) as a truncated sum of Legendre polynomials, where the number

of terms in the truncated expansion determines how much smoothing is allowed by the

11

approximation. A bivariate version of PP using an extension of the objective function (6) was

also derived and is publicly available from StatLib as the FORTRAN subroutine ppdeaux.

Hall (1989) (and later Cook, Buja, and Cabrera, 1993) showed that if Friedman’s index

(9) is transformed back to the original scale, it can be reexpressed as

IF (Y ) = 12

∫[φ(y)− qY (y)]2

1

φ(y)dy, (9)

where qY (y)/[φ(y)]1/2 is assumed to be square-integrable. Based on the form of (8), Hall

noted that unless the tails of qY (y) decrease fast enough, IF (Y ) can be infinite; thus, for

heavy-tailed qY (y), IF (Y ) will not be very useful as a measure of departure from Gaussianity.

Friedman, however, specifically used the index IF (Y ) to search for “projected distributions

that exhibit clustering (multimodality) or other kinds of nonlinear associations,” rather than

use it to identify heavy-tailed departures from Gaussianity.

The Gram-Charlier expansion of qY (y) is given by

qY (y) = φ(y)∞∑

k=0

ak

k!Hek(y) (10)

where ak = Eq{Hek(Y )} and Hek(y) is the Chebyshev-Hermite polynomial of order k

(Thisted, 1988, p. 285). Substitute (10) for qY (y) in (9), then expand the squared term

in the integrand and use the orthogonality condition (7). From the definition of ak, and

because Y has zero mean and unit variance, it follows that a0 = 1, a1 = 0, and a2 = 0.

Thus,

IF (Y ) = 12

∞∑

k=3

a2k

k!(11)

The index IF (Y ) can be approximated by truncating the sum to the first K terms,

IKF (Y ) = 1

2

K∑

k=3

a2k

k!. (12)

Given i.i.d. observations, Y1, . . . , Yn, on Y , we can estimate the {ak} by the sample averages,

ak = n−1n∑

i=1

Hek(Yi), k = 3, 4, . . . , K, (13)

and then substitute (13) into (12) to get the estimated index IKF (Y ).

The index IKF (Y ) can also be expressed in terms of the cumulants of Y . If Y has

zero mean, then the first four cumulants of Y are given by: κ1 = 0, κ2 = E(Y 2), κ3 =

E(Y 3), κ4 = E(Y 4) − 3[E(Y 2)]2. It follows that a3 = κ3 = κ3(Y ) is the skewness of Y

12

and a4 = κ4 = κ4(Y ) is the kurtosis of Y . If κ3 = 0, then the density of Y is symmetric;

otherwise, not. The fourth cumulant, κ4, measures the flatness vs. peakedness of the density

of Y . A zero-mean Gaussian Y has κ3 = κ4 = 0. Any Y with κ4 = 0 is called mesokurtic,

but examples of such densities (other than the Gaussian) are rare. If κ4 > 0, we say that

Y is super-Gaussian (or leptokurtic or approximately sparse) with a density which is highly

peaked at 0 and has heavier tails than the Gaussian (e.g., Laplacian or double-exponential

density), while if κ4 < 0, then Y is called sub-Gaussian (or platykurtic) with a density which

may be flat (or multimodal) over much of the range of Y and have very small values at the

extremes (e.g., uniform density). Setting K = 4, and estimating κ3 and κ4 by the sample

estimates κ3 = κ3(Y ) and κ3 = κ3(Y ), respectively, (8) can be estimated by

I4F (Y ) =

κ3(Y )

12+κ4(Y )

48, (14)

which is the moment-based projection index of Jones and Sibson (1987).

In practice, although the index (14) can be computed very quickly, the skewness and

kurtosis components are primarily influenced by tail structure (i.e., outliers) in the data. The

ironic feature of Friedman’s proposed index IF (Y ) is that rather than force attention away

from the tails as intended, it turns out to do exactly the opposite. Interestingly enough, it

turns out that outliers in the projected data are not at all unusual. In simulation experiments

using a moment-based index similar to (14) for PP (see Friedman and Johnstone’s discussions

of Jones and Sibson, 1987), outliers were observed to appear repeatedly in projections of even

well-behaved multivariate Gaussian data. Furthermore, there is no obvious way to robustify

the index (14).

Given the potential robustness problems inherent in using (10) as a projection index,

Hall (1989) proposed a variation on the theme of IF by studying the integrated squared

error between qY (y) and the Gaussian density φ(y),

IH(Y ) =∫

[φ(y)− qY (y)]2dy. (15)

This index can also be expressed in terms of the coefficients of certain orthogonal functions.

Expanding qY (y) in terms of the Hermite polynomials {Hk(y)} yields

qY (y) = φ(y)∞∑

k=0

bkγ−1/2k Hk(y), (16)

where bk = bk(Y ) = γ−1/2k Eq{Hk(Y )φ(Y )} and γk = 2k−1k!π−1/2. Substituting (16) into

13

(15), expanding the squared integrand, and then simplifying using (7) yields

IH(Y ) = (b0(Y )− γ1/20 )2 +

∞∑

k=1

[bk(Y )]2, (17)

where γ0 = (2π1/2)−1 ≈ 0.283.

An interesting result is obtained if we truncate (17) at k = 0. Then,

I0H(Y ) = (b0(Y )− γ1/2

0 )2

= γ−10 (Eq{φ(Y )} − E{φ(Z)})2, (18)

where Z is standard Gaussian and, using (7), E{φ(Z)} = (2π1/2)−1. We shall see a general-

ized form of this objective function (18) again in Section 4.4.2 (see (51)).

Given i.i.d. observations, Y1, . . . , Yn, on Y , the {bk} can be estimated by the sample

averages,

bk = bk(Y ) = γ−1/2k n−1

n∑

i=1

Hk(Yi)φ(Yi), k = 0, 1, 2, . . . . (19)

Substituting (18) into (17) and truncating the sum to the first K terms yields the estimate

IKH (Y ) = (b0(Y )− γ1/2

0 )2 +K∑

k=1

[bk(Y )]2. (20)

Under certain regularity conditions, Hall showed that IKH (Y ) is a useful measure of departure

from Gaussianity, with the most interesting projection direction maximizing IKH (Y ).

A further modification of Friedman’s and Hall’s proposed projection indexes was proposed

by Cook, Buja, and Cabrera (1993),

ICBC(Y ) =∫

[φ(y)− qY (y)]2φ(y)ds, (21)

who put more weight around the center of the distribution, rather than at the tails. This

time we apply the Chebyshev-Hermite expansion to both qY (y) and φ(y). We can write

qY (y) =∞∑

k=0

ckk!Hek(y), φ(y) =

∞∑

k=0

dk

k!Hek(y), (22)

where the coefficients {ck} are defined by ck = ck(Y ) = Eq{Hek(Y )φ(Y )}, the coefficients

{dk} are given by d2m = (−1)m√

(2m)!/m!22m+1√π and d2m+1 = 0, m = 0, 1, 2, . . ., and

Hek(y) is the Chebyshev-Hermite polynomial of order k. Substituting (21) into (20), ex-

panding the squared integrand and using the orthogonality condition (8), ICBC(Y ) can be

written as

ICBC(Y ) =∞∑

k=0

(dk − ck(Y ))2

k!. (23)

14

It is not difficult to show that if we truncate (23) to the first term (k = 0), then I0CBC can

be expressed as

I0CBC(Y ) = (d0 − c0(Y ))2

= (Eq{φ(Y )} − E{φ(Z)})2, (24)

which is proportional to (18).

Given i.i.d. observations, Y1, . . . , Yn on Y , we can estimate the unknown {ck} by the

sample averages,

ck = ck(Y ) = n−1n∑

i=1

Hek(Yi)φ(Yi). (25)

The index ICBC(Y ) is then estimated by substituting (23) for ck in (22), truncating the sum

to the first K terms, and setting

ICBC(Y ) =K∑

k=0

(dk − ck(Y ))2

k!. (26)

Some attention has been focussed upon an appropriate choice of K, which also serves as a

smoothing parameter. Cook et al (1993) surprisingly found that small values of K (K = 0

or K = 1) turned out to be the most interesting, especially in discovering projections with

a “hole” in the middle, or skewness when it exists.

4.3.2 Two-Dimensional Indexes. Next, we can obtain a projection index for m = 2 by

using the same ideas as for a one-dimensional index. Let (Y1, Y2) be a bivariate projection

of X, where

Y1 = wτ1X, Y2 = wτ

2X, (27)

4.4 Relative Entropy

The entropy of a random variable was introduced by Claude E. Shannon in 1948 and

has since become a valuable concept in information theory. See, for example, Gray (1990),

Cover and Thomas (1991). The entropy of the random variable Y gives us a notion of

how much information is contained in Y . Essentially, entropy is largest when Y is most

unpredictable. If Y is a continuous random variable with probability density function qY (y),

then the (differential) entropy H(Y ) of Y is defined by

H(Y ) = −∫qY (y) log qY (y)dy. (28)

15

Amongst all random variables having equal variance, the largest value of H(Y ) occurs when

Y has a Gaussian distribution. Small values of H(Y ) occur when the distribution of Y is

concentrated on specific values. Jones and Sibson (1987) had the idea to use the concept of

entropy as a measure of non-Gaussianity.

If we normalize H(Y ) so that it has the value zero for a Gaussian variable and otherwise

is always nonnegative, we arrive at relative entropy (also called negentropy) defined by

J (Y ) = H(Z)−H(Y ), (29)

where Z is a Gaussian random variable having the same variance as Y (Cover and Thomas,

1991). If Z has mean 0 and variance 1, then,

H(Z) =1

2[1 + log 2π] ≈ 1.419. (30)

An important property of relative entropy (but not of differential entropy) is that it is

invariant under linear invertible transformations (Comon, 1994): if Y is an m-vector with

mean 0 and covariance matrix Σ, and if X is an r-vector such that X = AY, then J (X) =

J (Y).

Differential entropy turns out to be difficult to compute, due mainly to the fact that the

probability density function qY (y) is, in principle, unknown. Attempts have been made to

estimate functionals of a density, and especially entropy, using a nonparametric estimate of

qY (y), which then gets used as a “plug-in” estimator (Izenman, 1991), but such computations

can be notoriously slow. More efficient approximations to J (Y ) involve either higher-order

cumulants or nonpolynomial expansions of the density function qY (y) in (28).

4.4.1 Polynomial Approximation. From (10), the Gram-Charlier expansion of the den-

sity qY (y) can be written as

qY (y) = φ(y)(1 + ε(y)), (31)

where

ε(y) =∞∑

k=3

ak

k!Hek(y). (32)

Assuming that qY (y) ≈ φ(y), then, expanding log(1 + ε) in a Taylor series,

log qY (y) = log φ(y) + log(1 + ε(y)) = log φ(y) + ε(y)− 1

2[ε(y)]2 +O([ε(y)]3). (33)

16

Substituting (31) into (28), while using (33) and (6), we have

H(Y ) = −∫φ(y)(1 + ε(y))

(log φ(y) + ε(y)− 1

2[ε(y)]2 +O([ε(y)]3)

)dy

= −∫φ(y)

(1 +

∞∑

k=3

ak

k!Hek(y)

log φ(y) +

∞∑

k=3

ak

k!Hek(y)−

1

2

(∞∑

k=3

ak

k!Hek(y)

)2

+O([ε(y)]3)

dy

= H(Z)− 1

2

∞∑

k=3

a2k

k!+O([ε(y)]3). (34)

If we truncate the series in (34) at k = 4, then we have the result that

J (Y ) ≈ κ23(Y )

12+κ2

4(Y )

48, (35)

which again (see (14)) is the moment-based projection index of Jones and Sibson (1987).

4.4.2 Nonpolynomial Approximation. To overcome the data-sensitivity of the moment-

based index (35), Hyvarinen (1998) used instead a nonpolynomial function to maximize the

entropy H(Y ) of Y . Suppose Gi(Y ), i = 1, 2, . . . , N , are different nonpolynomial functions

of Y which (like the Hermite polynomials) form an orthonormal system with respect to the

standard Gaussian density φ,

∫φ(y)Gi(y)Gj(y)ds = δij, (36)

and which also are orthogonal to all polynomials of up to second order,

∫φ(y)Gi(y)y

kdy = 0, k = 0, 1, 2. (37)

The orthogonality constraints (36) and (37) can always be satisfied by using ordinary Gram-

Schmidt orthonormalization. We further assume that the expectations of the first N of the

Gi(S) are given by the following values:

E(Gi(Y )) =∫qY (y)Gi(y)dy = ci, i = 1, 2, . . . , N. (38)

Assuming also that Y has mean 0 and variance 1 yields two more constraints,

GN+1(y) = y, cN+1 = 0, (39)

GN+2(y) = y2, cN+2 = 1. (40)

17

It can be shown that the probability density, q0Y (y), which satisfies the constraints (36)–(40)

and also has the largest entropy amongst all such densities is given by

q0Y (y) = Ae

∑iaiGi(y), (41)

where A and the {ai} are constants to be determined from (38). If we again assume that

qY (y) ≈ φ(y), then for (32) to be close to e−y2/2, the only substantial coefficient has to be

aN+2 ≈ −1/2. We can rewrite (32) as follows:

q0Y (y) = A exp{−y2/2 + aN+1y + (aN+2 + 1/2)y2 +

N∑

i=1

aiGi(y)}

= A φ(y)(1 + aN+1y + (aN+2 + 1/2)y2 +N∑

i=1

aiGi(y)), (42)

where A = (2π)1/2A and where we used the approximation eε ≈ 1 + ε. Furthermore,

1 =∫q0Y (y)dy = A[1 + (aN+2 + 1/2)] (43)

0 = E(Y ) =∫q0Y (y)ydy = AaN+1 (44)

1 = E(Y 2) =∫q0Y (y)y2dy = A[1 + 3(aN+2 + 1/2)] (45)

ci =∫q0Y (y)Gi(y)dy = Aai, i = 1, 2, . . . , N. (46)

These equations are easily solved to give ai = ci, i+1, 2, . . . , N, aN+1 = 0, aN+2 = −1/2, and

A = 1. Substituting these values into (23) yields

q0Y (y) = φ(y)

(1 +

N∑

i=1

ciGi(y)

), (47)

which is referred to as the approximative maximum entropy density. Compare this represen-

tation with that given by (21). Hence, H(Y ) can be approximated by

H(Y ) ≈ −∫q0Y (y) log q0

Y (y)ds

= −∫φ(y)

(1 +

N∑

i=1

ciGi(y)

)[log φ(y) + log

(1 +

N∑

i=1

ciGi(y)

)]dy

≈ −∫φ(y) logφ(y)dy −

N∑

i=1

ci

∫φ(y)Gi(y) logφ(y)dy

−∫φ(y)

(1 +

N∑

i=1

ciGi(y)

)log

(1 +

N∑

i=1

ciGi(y)

)dy

18

= H(Z)−N∑

i=1

ci

∫φ(y)Gi(y) logφ(y)dy −

N∑

i=1

ci

∫φ(y)Gi(y)dy

−1

2

N∑

i=1

c2i

∫φ(y)G2

i (y)dy − o(

N∑

i=1

c2i

∫φ(y)G2

i (y)dy

)

= H(Z)− 0− 0− 1

2

N∑

i=1

c2i + o

(N∑

i=1

c2i

), (48)

where we have used the conditions (36) and (37), the expansion (1+ε) log(1+ε) = ε+ε2/2+

o(ε2) for ε small, and where Z ∼ N (0, 1). From (48) and (29), we have that

J (Y ) ≈ 1

2

N∑

i=1

(E{Gi(Y )})2 . (49)

All that remains now is to choose the functions {Gi(Y )}.The simplest choices of these functions have N = 1 or N = 2. Taking N = 2, first, we

can make G1 an odd function (G1(−y) = −G1(y), reflecting symmetry vs. asymmetry) and

G2 an even function (G2(−y) = G2(y), reflecting sub-Gaussian vs. super-Gaussian). One

can show that in this case the approximation (49) boils down to

J (Y ) ≈ β1 (E{G1(Y )})2 + β2 (E{G2(Y )} − E{G2(Z)})2 , (50)

where β1 and β2 are positive constants. If we take N = 1, the approximation becomes

J (Y ) ≈ β (E{G(Y )} − E{G(Z)})2 , β > 0, (51)

for any nonquadratic contrast function G, where Z ∼ N (0, 1). Note that (51) generalizes

the objective functions (18) and (24), where G is given by the standard Gaussian density φ.

The approximation (51) to relative entropy (negentropy) is used in the R and C code

implementation (Marchini, Heaton, and Ripley, 2003) of the FastICA algorithm, where β =

1. Choices of functional form of the G function (fun) used in the approximation include:

• logcosh : G(y) = 1α

log cosh(αy), 1 ≤ α ≤ 2 (usually, alpha=1),

• exp : G(y) = −e−y2/2.

The logcosh function has been found to be good for most types of ICA problems, while the

exp function is probably best for highly supergaussian source components where robustness

is a serious consideration. The logcosh function has also been used successfully as a flexible

family of Bayesian prior distributions, especially for the image reconstruction of photon

emission computed tomographic data (Green, 1990; Weir and Green, 1994; Weir, 1997).

19

4.5 The FastICA Algorithm

4.5.1 Extracting a Single Source Component. First, we detail the case of a single

(m = 1) source component (or one-dimensional projection), Y = wτX, where w is an

r-vector. Consider finding the direction w which maximizes the approximation (51) to rel-

ative entropy subject to the sphering constraint E{(wτX)2} = ‖w‖2 = 1 on the projection.

In other words, we wish to find that w which makes the distance between the density of

the one-dimensional projection Y = wτX and the Gaussian density as large as possible,

where distance is measured by relative entropy. Because the maxima of the relative entropy

J (wτX) in (51) are typically obtained at certain maxima of E{G(wτX)}, we set

F (w) = E{G(wτX)} − λ

2(‖w‖2 − 1), (52)

where λ is the Lagrangian multiplier. To maximize (52), the Newton-Raphson iterative

method (see, e.g., Thisted, 1988, Section 4.2.2) yields the iteration

w← w −(∂2F (w)

∂w2

)−1 (∂F (w)

∂w

). (53)

We, thus, need to find the first and second partial derivatives of F (w) with respect to w.

Differentiating (52) with respect to w yields

∂F (w)

∂w= E(Xg(wτX))− λw, (54)

where g = ∂G/∂w. The stationary values of the function F are found by equating (54) to

zero. Premultiplying both sides of the resulting equation by wτ yields

λ = E(wτXg(wτX)). (55)

Differentiating (54) with respect to w gives the approximate second derivative of F ,

∂2F (w)

∂w2= E(XXτg′(wτX))− λIr ≈ E(XX)τg′(wτX))− λIr = (E(g′(wτX))− λ)Ir, (56)

where we used the fact that X has been sphered. Substituting (54) and (56) into (53), the

iteration reduces to

w← w− E(Xg(wτX))− λwE(g′(wτX))− λ . (57)

If we set E1 = E(Xg(wτk−1X)) and E2 = E(g′(wτ

k−1X)), then (57) can be written as wk =

wk−1 − (E1 − λwk−1)/(E2 − λ) for the kth iteration. Multiplying both sides by λ − E2

20

Table 1. Nonquadratic density functions and their first and second derivatives to be used asinput to the FastICA algorithm. Note that for the logcosh density, 1 ≤ α ≤ 2.

density G(y) g(y) = G′(y) g′(y) = G′′(y)

logcosh 1α log cosh(αy) tanh(αy) α(1 − tanh2(αy))

exp −e−y2/2 ye−y2/2 (1− y2)e−y2/2

yields wk(λ − E2) = E1 − wk−1E2. Because we divide w by its norm ‖w‖ at each step of

the iterative procedure, the factor λ − E2 can be ignored. The iteration (57) is, therefore,

equivalent to

w← E(Xg(wτX))−wE(g′(wτX)). (58)

For the logcosh and exp densities, the functions g and g ′ are given in Table 1. Substituting

for g and g′ in (58) for either the logcosh or exp density as appropriate yields the FastICA

algorithm, which is given in Table 2.

The values of w can change substantially from iteration to iteration; this is because the

ICA model cannot determine the sign of w, so that −w and w become equivalent and define

the same direction. In light of this comment, “convergence” of the FastICA algorithm is

taken to have a different meaning than usual, and is taken here to mean that successive

iterative values of w (i.e., wk−1 and wk for some k) are oriented in the same direction (i.e.,

wτkwk−1 is very close to 1).

4.5.2 Extracting Multiple Independent Source Components. The FastICA package (Hurri,

Gavert, Sarela, and Hyvarinen, 1998) includes two different ways of extracting more than one

independent source component. Both methods (termed “deflation” and “parallel” methods)

repeatedly call the single component extraction algorithm of Table 2. Essentially, at each

step in the algorithic cycle:

deflation: the single component routine finds a new component, that new component is

orthogonalized using the Gram-Schmidt method with respect to all previously-found

21

Table 2. FastICA algorithm for determining a single source component.

1. Center the data to make the mean zero, and then whiten the result to give X.

2. Choose an initial version of the r-vector w with unit norm.

3. Choose G to be any nonquadratic density with first and second partial derivatives g and g ′,respectively. If the choice is either the logcosh or exp density, g and g ′ are given in the text.

4. Let w ← E(Xg(wτX)) − wE(g′(wτX)). In practice, the expectations are estimated usingsample averages.

5. Let w← w/‖w‖.

6. Iterate between steps 4 and 5. Stop when convergence is attained.

components, and then the resulting new component is normalized.

parallel: the single component routine is carried out in parallel for each independent com-

ponent to be extracted, and then a symmetric orthogonalization is carried out on all

components simultaneously.

The deflation method extracts independent components sequentially one-at-a-time, while

the parallel method extracts all the independent components at the same time. Both

algorithms are listed in Table 3.

4.6 ML Estimation

4.6.1 The EM Algorithm. Consider two vector-valued random variables X and S, where

we assume that X is observed while S is latent. For x ∈ Rr and s ∈ Rm, let the probability

density function of X be given by pθ(x) with model parameters θ, and let the prior density of

S be given by qη(s) with variational parameters η. The posterior density of S given X = x

is given by

pθ(s|x) =pθ(x, s)

pθ(x), (59)

where pθ(x, s) is the joint density of X and S. Taking logarithms of (40) yields the log-

likelihood,

L(θ|x) ≡ log pθ(x) = log pθ(x, s)− log pθ(s|x). (60)

22

Table 3. Two FastICA algorithms for extracting multiple independent source components.

Deflation algorithm

1. Center the data to make its mean zero, and then whiten the result to give X.

2. Decide on the number, m, of independent components to be extracted.

3. For k = 1, 2, . . . ,m,

• Initialize (e.g., randomly) the r-vector wk to have unit norm.

• Let wk ← E(Xg(wτkX))−wkE(g′(wτ

kX)) be the FastICA single-component update forwk, where g and g′ are given in Table 1. In practice, the expectations are estimatedusing sample averages.

• Use the Gram-Schmidt process to orthogonalize wk with respect to the previously cho-sen w1, . . . ,wk−1:

wk ← wk −k−1∑

j=1

(wτkwj)wj .

• Let wk ← wk/‖wk‖.• Iterate wk until convergence.

4. Set k ← k + 1. If k ≤ m, return to step 3.

Parallel algorithm

1. Center the data to make its mean zero, and then whiten the result to give X.

2. Decide on the number, m, of independent components to be extracted.

3. Initialize (e.g., randomly) the r-vectors w1, . . . ,wm, each to have unit norm. Let W =(w1, · · · ,wm)τ .

4. Carry out a symmetric orthogonalization of W by W← (WWτ )−1/2W.

5. For each k = 1, 2, . . . ,m, let wk ← E(Xg(wτkX)) − wkE(g′(wτ

kX)) be the FastICA single-component update for wk, where g and g′ are given in Table 1. In practice, the expectationsare estimated using sample averages.

6. Carry out another symmetric orthogonalization of W.

7. If convergence has not occurred, return to step 5.

23

The function L(θ|x) is to be maximized over the parameters θ. The expectation of L(θ|x)

in (41) with respect to the prior density qη(s) of S is given by

L(θ|x) =∫L(θ|x)qη(s)ds =

∫qη(s) log pθ(x, s)ds−

∫qη(s) log pθ(s|x)ds

=∫qη(s) log pθ(x, s)ds +

∫qη(s) log

[qη(s)

pθ(s|x)qη(s)

]ds

=∫qη(s) log

[pθ(x, s)

qη(s)

]ds +

∫qη(s) log

[qη(s)

pθ(s|x)

]ds

= V (x|θ, η) +KL(qη||pθ), (61)

where

V (x|θ, η) =∫qη(s) log pθ(x, s)ds−

∫qη(s) log qη(s)ds (62)

is the difference between the expected energy under qη and the entropy of qη (which does not

depend upon θ), and KL(q||p) is the Kullback-Liebler divergence between the prior density

qη(s) and the posterior density pθ(s|x). The negative of V is also known by those in statistical

physics as (variational) free energy. Note that

KL(qη||pθ) =∫qη(s) log

[qη(s)

pθ(s|x)

]ds

= Eη

{− log

[pθ(s|x)

qη(s)

]}

≥ − log

{Eη

[pθ(s|x)

qη(s)

]}

= − log{∫

pθ(s|x)ds}

= 0, (63)

where we used Jensen’s inequality E(f(x)) ≥ f(E(x)) for the convex function f(x) = − log(x)

(Loeve, 1963, Section 9.3e), and Eη indicates expectation taken over the density qη. Thus,

KL(qη||pθ) ≥ 0, (64)

so that

L(θ|x) ≥ V (x|θ, η), (65)

with equality if qη(s) = pθ(s|x), in which case, from (42), the log-likelihood (46) becomes

L(θ|x) = V (x|θ, η). To maximize the log-likelihood, we use the EM algorithm given in Table

4. The iterations will increase L(θ|x) at every iteration. The main drawback of the EM

algorithm is that it has a tendency to get captured at local maxima of the likelihood surface.

24

Table 4. EM algorithm for maximum likelihood ICA.

1. Set the model parameter estimate, θ, at an initial value θ0.

2. For k = 1, 2, . . . , iterate between the following two steps:

• E-Step: Fix the model parameter estimate at θk−1. Update the variational parameterestimate η by maximizing V (x|θk−1, η) with respect to η:

ηk ← arg maxη

V (x|θk−1, η).

The maximum occurs when qηk(s) = p

θk−1(s|x), at which point L(θk−1|x) =

V (x|θk−1, ηk).

• M-Step: Fix the variational parameter estimate at ηk. Update the model parameterestimate θ by maximizing V (x|θ, ηk) with respect to θ:

θk ← arg maxθ

V (x|θ, ηk) = arg maxθ

∫qηk

(s) log pθ(x, s)ds.

3. Stop when convergence is attained.

4.6.2 Square Mixing and the FastICA Algorithm. If the density of the m-vector S =

(S1, · · · , Sm) is qS(s), then the density of the linear transformation X = AS, where A is

square and nonsingular, is pX(x) = |det(W)|qS(s), where W = A−1. Statistical indepen-

dence of the sources, implies that the joint density, qS(s), can be written as a product of

its m component source densities, qS(s) =∏m

j=1 qSj (sj), where qSj (sj) is the density of Sj.

Hence, the joint density of X is

pX(x) = |det(W)|m∏

j=1

qSj (wτj x), (66)

where wτj is the jth row of W. Now, suppose we are given n i.i.d. observations, x1. . . . ,xn,

on X. Then, the log-likelihood function (divided by n) is

L(W|{xi}) = log |det(W)|+ E

m∑

j=1

log qSj (wτj xi)

, (67)

where “E” represents sample average over the n observations. There are several ways of

maximizing the log-likelihood function (48), including the following FastICA-type algorithm.

The derivative of L(W) with respect to W is given by

∂ logL(W)

∂W= (Wτ)−1 + E(g(WX)Xτ), (68)

25

where g(S) = (g1(S1), · · · , gm(Sm))τ and gj = (log qSj)′ = q′Sj/qSj . This suggests the follow-

ing gradient of the log-likelihood (49):

∆W ∝ (Wτ)−1 + E(g(WX)Xτ ), (69)

where ∆W is the difference between successive iterations of W. A stochastic version of

(50) was introduced by Bell and Sejnowski (1995), who derived it using different principles.

We eliminate the matrix inversion of Wτ at each iteration step, which tends to slow down

this algorithm, by postmultiplying both sides of (50) by WτW. This gives us a simple

formulation of the ML algorithm:

W←W + µ[Im + E(g(S)Sτ )]W, (70)

where S = WX and µ is the learning rate. Because this algorithm converges if E(g(S)Sτ ) =

Im, this condition implies that, for i 6= j, Si is uncorrelated with gj(Sj).

For a given value of m, let W = (w1, · · · ,wm)τ . Step 5 in the parallel FastICA algorithm

in Table 3 can be written in matrix form as follows:

W←W + diag{αi} [diag{λi} − E(g(S)Sτ )]W, (71)

where S = (S1, · · · , Sm)τ , Si = wτi X, λi = E(Sig(Si)), and αi = 1/(E(g′(Si) − λi). The

second term on the right-hand side of (52) can be rearranged to give

W←W + diag{αiλi}[Im − diag{λ−1i }E(g(S)Sτ )]W. (72)

Hyvarinen (1999) recognized that because the ML algorithm (51) is just a special case of the

FastICA algorithm (53), the FastICA algorithm as given in Table 5 can be interpreted as

maximizing the likelihood (48), thereby directly obtaining the ML estimate of W. The scalar

learning rate µ has now become a more flexible part of the iterative process. Furthermore,

it turns out through simulation studies that careful choice of {αi} and {λi} can speed up

convergence of the FastICA algorithm to be 10–100 times faster than the gradient approach

in deriving ML estimates.

5. LINEAR MIXING: II. NOISY ICA

The linear mixing version of noisy ICA,

X = AS + e, (73)

26

Table 5. FastICA algorithm for obtaining maximum likelihood estimates.

1. Center the data to make its mean zero, and then whiten the result to give X.

2. Decide on the number, m, of independent components to be extracted.

3. Randomly initialize a separating matrix W.

4. Compute S = WX.

5. Compute λi = E(Sig(Si)), αi = 1/(E(g′(Si))− λi), i = 1, 2, . . . ,m.

6. Update W by

W←W + diag{αi} [diag{λi} − E(g(S)Sτ )]W.

7. Carry out a symmetric orthogonalization of W by

W← (WWτ )−1/2W.

8. If convergence has not occurred, return to step 4.

where A is a full-rank (r×m) mixing matrix with unknown coefficients, has much in common

with factor analysis (Lawley and Maxwell, 1971; Harman, 1976). If we assume that the noise

component e has zero mean, a diagonal (r× r) covariance matrix, cov(e) = Ψ, with positive

diagonal entries, and that S and e are uncorrelated (E(Seτ ) = 0), then (54) reduces to the

classical common factor analysis model (FA), where the sources are called factors. For the

model (54), µ = 0 and ΣXX = AAτ + Ψ. The BSS (and ICA) problem for model (54) is to

estimate A and recover S.

5.0.3 Principal Components FA. Without making any distributional assumption (e.g.,

Gaussian) for the sources in (2), we can determine A using a least-squares formulation. In

fact, premultiplying (2) by the Moore-Penrose generalized inverse, B = (AτA)−1Aτ , of A,

and then substituting the result in terms of S back into (2), we can re-express the model as

X = CX + E, (74)

where C = AB has rank m, A and B are full-rank matrices each of rank m, E = (I−C)e,

and X and E both have mean zero. The model (3) is the multivariate reduced-rank regression

model corresponding to principal components analysis (Izenman, 1975). The least-squares

27

criterion,

E{(X−ABX)τ (X−ABX)} (75)

is, therefore, minimized by setting

A = (v1, · · · ,vm) = Bτ , (76)

where vj is the eigenvector corresponding to the jth largest eigenvalue of ΣXX . The rows of

the matrix B give the coefficients of the m principal components scores, vτj X, j = 1, 2, . . . , m,

and the eigenvalues of ΣXX , which are usually ordered from largest to smallest, measure the

variance (or power) of the m sources.

Because C = (AT)(T−1B) for any nonsingular (m×m)-matrix T, we can only determine

A (and, hence, also S) up to a rotation. In factor analysis, this is generally referred to as

the problem of factor indeterminancy. Although this solution does not depend on Ψ, an

adjustment to the analysis can be made by considering the matrix ΣXX − Ψ in place of

ΣXX . This approach, usually called the principal factor method, has sufficient computational

defects that it has generally been abandoned in favor of the maximum-likelihood (ML)

method.

5.0.4 Maximum-Likelihood FA. The ML method assumes a fully parametric model

in which the m sources in (2) are distributed as multivariate Gaussian, S ∼ Nm(0, Im),

independent of the noise, which is also multivariate Gaussian, e ∼ Nr(0,Ψ), where Ψ is

diagonal. In some formulations, Ψ = a2Ir, where a is an unknown constant.

Given n independent observations, x1, . . . ,xn, on X, we compute the sample covariance

matrix ΣXX as before, which has a Wishart distribution: nΣXX ∼ Wr(n,ΣXX). ML

estimators of A and Ψ are obtained by maximizing the logarithm of the likelihood function,

logeL = −n2

loge |AAτ + Ψ| − n

2tr{ΣXX(AAτ + Ψ)−1}, (77)

where we have used (8.113) and ignored constants and terms which do not involve Λ or Ψ.

We apply the EM algorithm to maximize logeL with respect to A and Ψ (Rubin and

Thayer, 1982). See Table 1. The algorithm treats the unobservable source scores {si} as

if they were missing data. If the {si} were actually observed, the complete-data likelihood

would be given by the joint distribution of the {si} and the {ei = xi −Asi},

Lik =n∏

i=1

{(2π)r/2|Ψ|−1/2e−

12eτiΨ−1ei(2π)−r/2e−

12fτisi

}

28

Table 6. EM algorithm for maximum likelihood factor analysis.

1. Let A0 and Ψ0 be initial guesses for the parameter matrices A and Ψ, respectively.

2. For k = 1, 2, . . . , iterate between the following two steps:

• E-Step: Compute

CXX = n−1n∑

i=1

XiXτi

C(k−1)XS = CXXδτ

k−1

C(k−1)SS = δk−1CXXδτ

k−1 + ∆k−1

where

δk−1 = Aτk−1(Ak−1A

τk−1 + Ψk−1)

−1

∆k−1 = It − δk−1Ak−1.

• M-Step: Update the parameter estimates,

Ak ← C(k−1)XS (C

(k−1)SS )−1

Ψk ← diag{CXX −C(k−1)XS (C

(k−1)SS )−1C

(k−1)τXS }.

3. Stop when convergence has been attained.

=

(2π)r

r∏

j=1

ψjj

−n/2

e− 1

2

∑n

i=1

∑r

j=1

(xij−Ajsi)2

ψjj

× {(2π)r}−n/2 e−12

∑n

i=1sτisi, (78)

where xij is the jth component of xi, Aj is the jth row of A, and ψjj is the jth diagonal

element of the diagonal matrix Ψ. Given the observed data {xij} and the current estimated

values of the parameters, the conditional expectation of (8.124), taken over the distribution

of the missing data {Si}, is equal to eloge L.

The logarithm of (8.124) is

loge(Lik) = −n2

r∑

j=1

loge(ψjj)−1

2

n∑

i=1

r∑

j=1

(xij −Ajsi)2

ψjj− 1

2

n∑

i=1

Aτi si. (79)

The E-step of the EM algorithm entails finding the conditional expectation of (8.125), given

the observed data {xi} and the current values of the parameters A and Ψ. Because the

29

joint distribution of xi and si given A and Ψ, is (r + t)-variate Gaussian, the conditional

distribution of si given xi is

(si|xi,A,Ψ) ∼ Nt(δxi,∆), (80)

where

δ = Aτ(AAτ + Ψ)−1 (81)

∆ = It −Aτ (AAτ + Ψ)−1A. (82)

To find the expectation of (8.125), we need to find the expectations of the following sufficient

statistics,

CXX = n−1n∑

i=1

xixτi , CXS = n−1

n∑

i=1

xisτi , CSS = n−1

n∑

i=1

sisτi .

Given the data {xi} and parameters A and Ψ, the expectations are

C∗XX = E(CXX |{xi},A,Ψ) = CXX (83)

C∗XS = E(CXS|{xi},A,Ψ) = CXXδτ (84)

C∗SS = E(CSS|{xi},A,Ψ) = δCXXδτ + ∆. (85)

Equations (10) through (14) define the E-step based upon the observed data {xi} and the

current values of the parameter estimates Λ and Ψ.

The M -step provides the updated versions of the ML estimates by using the regression

estimates,

Λ = C∗XSC

∗−1SS (86)

Ψ = diag{C∗XX −C∗

XSC∗−1SS C∗τ

XS}. (87)

The current estimates (15) and (16) are substituted for A and Ψ, respectively, in (10) and

(11) to get updated estimates of δ and ∆, which are then used to recompute C∗XS and C∗

SS,

and get new values of A and Ψ. The method is iterated until we arrive at convergence.

MLFA, however, cannot resolve the BSS problem precisely because of these Gaussian

assumptions. Gaussian variables which are mutually uncorrelated are also automatically

independent, and so MLFA only requires that the sources be uncorrelated. Furthermore,

MLFA suffers from a similar ailment as does principal component FA: the likelihood function

is rotationally-invariant in factor space, and so the sources S and the mixing matrix A can

only be defined up to an arbitrary rotation.

30

5.0.5 Independent Factor Analysis. As an alternative to the FA assumptions for dealing

with the BSS problem, Attias (1999) introduced the technique of independent factor analysis

(IFA) in which the model is still given by (2) with e ∼ Nr(0,Ψ), Ψ not necessarily diago-

nal, but now each unobserved source signal Sj is assumed to be independently distributed

according to a non-Gaussian density. In particular, Attias modelled each source density by

an arbitrary mixture of univariate Gaussian (MoG) densities,

qSj (sj) =Ij∑

i=1

wijφηij (sj), (88)

where φηij (s) is N (µij, σ2ij), ηij = (µij, σ

2ij), and wij > 0 is the mixing proportion attached

to the ith component of the jth source density, i = 1, 2, . . . , Ij, with∑Ij

i=1 wij = 1, j =

1, 2, . . . , m (Attias, 1999).

The MoG density (69) can mimic both super-Gaussian and sub-Gaussian densities by

using a large enough set of component densities and is a major reason why it has played

such an important role in ICA modelling. MoG densities became widely used in statistics

after Tukey (1960) showed how useful their representation was for modelling the presence

of outliers and in robustness studies. He considered mixtures consisting of two components

which have the same mean but different variances, and referred to the mixture density

p(s) = (1− w)pS1(s) + wpS2(s), with w small, as a contaminated density. Since then, MoG

densities have been used in many different settings (Titterington, Smith, and Makov, 1985;

McLachlan and Basford, 1988; Everitt and Hand, 1981). The main disadvantage of working

with MoG densities is that the total number of parameters can grow to be very large.

The joint source density, qS(s), can be written as a product of its m component source

densities,

qS(s) =m∏

j=1

qSj (sj) =m∏

j=1

Ij∑

i=1

wijφηij (sj) =∑

i

wiφη(s), (89)

where η = {ηij}, wi is a product of the {wij}, and φη(s) is a product of the {φηij (sj)}.The parameters of this mixture, the mixing matrix A, and the noise covariance matrix

Ψ are estimated using an appropriate EM algorithm.

6. REFERENCES

Attias, H. (1999), “Independent Factor Analysis,” Neural Computation, 11, 803–852.

Bach, F.R. and Jordan, M.I. (2002), “Kernel Independent Component Analysis,” Journal

31

of Machine Learning Research, 3, 1–48.

Back, A.D. and Weigend, A.S. (1997), “A First Application of Independent Component

Analysis to Extracting Structure from Stock Returns,” International Journal of Neural

Systems, 8, 473–484.

Bell, A.J. and Sejnowski, T.J. (1995), “An Information-Maximization Approach to Blind

Separation and Blind Deconvolution,” Neural Computation, 7, 1129–1159.

Cardoso, J.-F. and Pham, D.-T. (2001), “Separation of Non-Stationary Sources: Algo-

rithms and Performance,” In Independent Component Analysis: Principles and Prac-

tice, Roberts, S. and Everson, R. (eds.), Cambridge, U.K.: Cambridge University Press,

pp. 158–180.

Cherry, E.C. (1953), “Some Experiments in the Recognition of Speech, With One and Two

Ears,” Journal of the Acoustical Society of America, 25, 975–979.

Cichocki, A. and Amari, S. (2003), Adaptive Blind Signal and Image Processing, New York:

Wiley.

Coman, P. (1994), “Independent Component Analysis — A New Concept?” Signal Pro-

cessing, 36, 287–314.

Cook, D., Buja, A., and Cabrera, J. (1993), “Projection Pursuit Indexes Based on Or-

thogonal Function Expansions,” Journal of Computational and Graphical Statistics, 2,

225–250.

Cover, T. and Thomas, J. (1991), Elements of Information Theory, Volume 1, New York:

Wiley.

Diaconis, P. and Freedman, D. (1984), “Asymptotics of Graphical Projection Pursuit,”

Annals of Statistics, 12, 793–815.

Donoho, D. (1981), “On Minimum Entropy Deconvolution,” In Applied Time Series Anal-

ysis II, D.A. Finley (ed.), New York: Academic Press, pp. 565–608.

Everitt, B.S. and Hand, D.J. (1981), Finite Mixture Distributions, London: Chapman and

Hall.

32

Fodor, I.K. and Kamath, C. (2003), “Using Independent Component Analysis to Separate

Signals in Climate Data,” unpublished technical report, Lawrence Livermore National

Laboratories, Livermore, CA.

Friedman, J. (1987), “Exploratory Projection Pursuit,” Journal of the American Statistical

Association, 82, 249–266.

Friedman, J. and Tukey, J. (1974), “A Projection Pursuit Algorithm for Exploratory Data

Analysis,” IEEE Transactions on Computers, Series C, 23, 881–889.

Giannakopoulos, X., Karhunen, J., and Oja, E. (1999), “An Experimental Comparison

of Neural Algorithms for Independent Component Analysis and Blind Separation,”

International Journal of Neural Systems, 9, 99–114.

Girolami, M. (ed.) (2000), Advances in Independent Component Analysis, New York:

Springer-Verlag.

Gray, R.M. (1990), Entropy and Information Theory, New York: Springer.

Green, P.J. (1990), “Bayesian Reconstructions From Emission Tomography Data Using a

Modified EM Algorithm,” IEEE Transactions on Medical Imaging, 16(5), 516–526.

Hall, P. (1989), “On Polynomial-Based Projection Indices for Exploratory Projection Pur-

suit,” Annals of Statistics, 17, 589–605.

Harman, H.H. (1976), Modern Factor Analysis, Third Edition Revised, Chicago: The Uni-

versity of Chicago Press.

Hastie, T. and Tibshirani, R. (2003), “Independent Components Analysis Through Product

Density Estimation,” unpublished manuscript.

Hastie, T., Tibshirani, R., and Friedman, J. (2001), The Elements of Statistical Learning:

Data Mining, Inference, and Prediction, New York: Springer-Verlag.

Herault, J. and Jutten, C. (1986), “Space or Time Processing by Neural Network Mod-

els,” in Proceedings of the AIP Conference: Neural Networks for Computing (ed.:

J.S. Denker), 151, American Institute for Physics.

Huber, P. (1985), “Projection Pursuit,” Annals of Statistics, 53, 73–101.

33

Hurri, Gavert, Sarela, and Hyvarinen, A. (1998), “The FastICA Package for MATLAB,”

http://isp.imm.dtu.dk/toolbox/

Hyvarinen, A. (1998), “New Approximations of Differential Entropy for Independent Com-

ponent Analysis and Projection Pursuit,” In Advances in Neural Information Process-

ing Systems. 10, 273–279.

Hyvarinen, A. (1999), “The Fixed-Point Algorithm and Maximum Likelihood Estimation

for Independent Component Analysis,” Neural Processing Letters, 10, 1–5.

Hyvarinen, A., Karhunen, J. and Oja, E. (2001), Independent Component Analysis, New

York: Wiley.

Izenman, A.J. (1975), “Reduced-Rank Regression for the Multivariate Linear Model,” Jour-

nal of Multivariate Analysis, 5, 248–264.

Izenman, A.J. (1991), “Recent Developments in Nonparametric Density Estimation,” Jour-

nal of the American Statistical Association, 86, 205–224.

Joe, H. (1989), “Estimation of Entropy and Other Functionals of a Multivariate Density,”

Annals of the Institute of Statistical Mathematics, 41, 683–697.

Jones, M.C. and Sibson, R. (1987), “What is Projection Pursuit?” Journal of the Royal

Statistical Society, Series A, 150, 1–36.

Jutten, C. (2000), “Source Separation: From Dusk Till Dawn,” in Proceedings of the 2nd

International Workshop on Independent Component Analysis and Blind Source Sepa-

ration (ICA 2000), 15–26, Helsinki, Finland.

Kruskal, J.B. (1969), “Toward a Practical Method Which Helps Uncover the Structure

of a Set of Multivariate Observations by Finding the Linear Transformation Which

Optimizes a New ‘Index of Condensation’,” In Statistical Computation (R.C. Milton

and J.A. Nelder, eds.), pp. 427–440, New York: Academic Press.

Kruskal, J.B. (1972), “Linear Transformation of Multivariate Data to Reveal Clustering,”

In Multidimensional Scaling: Theory and Applications in the Behavioural Sciences,

Volume 1 (R.N. Shepard, A.K. Romney, and S.B. Nerlove, eds.), pp. 179–191, London:

Seminar Press.

34

Lawley, D.N. and Maxwell, A.E. (1971), Factor Analysis as a Statistical Method, Second

Edition, New York: American Elsevier Publishing Company.

Lee, T.-W. (1998), Independent Component Analysis — Theory and Applications, Kluwer.

Loeve, M. (1963), Probability Theory, New York: Van Nostrand.

Marchini, J.L., Heaton, C., and Ripley, D. (2003), “The fastICA Package, Version 1.1-3,”

http://www.stats.ox.ac.uk/∼marchini/software.html

McKeown, M., Makeig, S, Brown, S., Jung, T.-P., Kindermann, S., Bell, A.J., and Se-

jnowski, T. (1998), “Analysis of fMRI Data by Blind Separation Into Independent

Spatial Components,” Human Brain Mapping, 6, 160–188.

McLachlan, G.J. and Basford, K.E. (1988), Mixture Models: Inference and Applications to

Clustering, New York: Dekker.

Nandi, A.K. (ed.) (1999), Blind Estimation Using Higher-Order Statistics, Kluwer.

Parra, L.C. and Spence, C.D. (2001), “Separation of Non-Stationary Natural Signals,” In

Independent Component Analysis: Principles and Practice, Roberts, S. and Everson,

R. (eds.), Cambridge, U.K.: Cambridge University Press, pp. 135–157.

Pechura, M. and Martin, J.B. (1991) (eds.), Mapping the Brain and Its Functions: Inte-

grating Enabling Technologies into Neuroscience Research, Washington, D.C.: National

Academy Press.

Roberts, S. and Everson, R. (eds.) (2001), Independent Component Analysis: Principles

and Practice, Cambridge, U.K.: Cambridge University Press.

Roweis, S. and Ghahramani, Z. (1999), “A Unifying Review of Linear Gaussian Models,”

Neural Computation, 11, 305–345.

Rubin, D.B. and Thayer, D.T. (1982), “EM Algorithms for ML Factor Analysis,” Psy-

chometrika, 47, 69–76.

Salerno, E., Bedini, L., Kuruoglu, E., and Tonazzini, A. (2002), “Blind Image Analysis

Helps Research in Cosmology,” ERCIM News, 49, April 2002.

35

Thisted, R.A. (1988), Elements of Statistical Computing: Numerical Computation, New

York: Chapman and Hall.

Titterington, D.M., Smith, A.F.M., and Makov, U.E. (1985), Statistical Analysis of Finite

Mixture Distributions, New York: Wiley.

Tukey, J.W. (1960), “A Survey of Sampling From Contaminated Distributions,” In: Olkin,

I. (Ed.), Contributions to Probability and Statistics, University Press, Stanford, CA.

Tukey, P.A. and Tukey, J.W. (1981), “Graphical Display of Data Sets in 3 or More Dimen-

sions,” in Interpreting Multivariate Data (V. Barnett, ed.), pp. 187–275, New York:

Wiley.

Venables, W.N. and Ripley, B.D. (2002), Modern Applied Statistics with S, Fourth Edition,

New York: Springer-Verlag.

Weir, I.S. (1997), “Fully Bayesian SPECT Reconstructions,” Journal of the American Sta-

tistical Association, 92, 49–60.

Weir, I.S. and Green, P.J. (1994), “Modelling Data From Single Photon Emission Com-

puted Tomography,” In K.V. Mardia (ed.), Statistics and Images, 2, 313–338. Carfax,

Abingdon.