Independent Component Analysis - Ruhr University Bochum · bution, independent component analysis...

Independent Component Analysis

— Lecture Notes —

Laurenz WiskottInstitut fur Neuroinformatik

Ruhr-Universitat Bochum, Germany, EU

22 October 2019

Contents

1 Intuition 3

1.1 Mixing and unmixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 How to find the unmixing matrix? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Sources can only be recovered up to permutation and rescaling . . . . . . . . . . . . . . . . . 7

1.4 Whiten the data first . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 A generic ICA algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Formalism based on cumulants 7

2.1 Moments and cumulants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Cross-cumulants of statistically independent components are zero . . . . . . . . . . . . . . . . 8

2.3 Components with zero cross-cumulants are statistically independent . . . . . . . . . . . . . . 10

2.4 Rotated cumulants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 Contrast function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 Givens-rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.7 Optimizing the contrast function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.8 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.9 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

© 2009, 2011–2013, 2016, 2018–2019 Laurenz Wiskott (ORCID http://orcid.org/0000-0001-6237-740X, homepagehttps://www.ini.rub.de/PEOPLE/wiskott/). This work (except for all figures from other sources, if present) is licensed underthe Creative Commons Attribution-ShareAlike 4.0 International License, see http://creativecommons.org/licenses/by-sa/

4.0/. If figures are not included for copyright reasons, they are uni colored, but the word ’Figure’, ’Image’, or the like in thereference is often linked to a freely available copy.Core text and formulas are set in dark red, one can repeat the lecture notes quickly by just reading these; � marks importantformulas or items worth remembering and learning for an exam; ♦ marks less important formulas or items that I would usuallyalso present in a lecture; + marks sections that I would usually skip in a lecture.More teaching material is available at https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/.

1

http://orcid.org/0000-0001-6237-740X

https://www.ini.rub.de/PEOPLE/wiskott/

http://creativecommons.org/licenses/by-sa/4.0/

http://creativecommons.org/licenses/by-sa/4.0/

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/

3 FastICA + 15

3.1 Contrast function + . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Optimizing the contrast function + . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Algorithm + . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Applications 19

4.1 Removal of muscular artifacts from EEG signals . . . . . . . . . . . . . . . . . . . . . . . . . 19

A Other resources 19

A.1 Written material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

A.2 Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

A.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

A.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Requirements: I assume the student already ...

... can apply basic concepts from linear algebra, such as vector, matrix, matrix product, inverse matrix,orthogonal matrix.

... can apply basic concepts from probablity theory and statistics, such as probability distribution,joint/conditional/marginal distribution, first and second moments (auto- as well as cross-), whiteningmatrix.

... can calculate the expectation value over a 2D probability distribution or samples thereof.

Learning objectives: The learning objective of this unit is that the student ...

... can reproduce the definition of the terms skewness, kurtosis, sub-Gaussian, super-Gaussian (all Sec. 2.1).

... can intuitively and formally explain the concepts statistical independence (Eq. 1), linear (un)mixing(Eqs. 3, 4), moments and cumulants (auto- as well as cross-) (Sec. 2.1), kurtosis (Sec. 2.1), ICA con-trast function (Eq. 58), Givens rotation (Sec. 2.6).

... can illustrate in a 2D sketch the mixing and unmixing of two statistically independent sources with thehelp of distribution- and extraction-vectors (Fig. 3).

... can discuss the relationship between a 1D-distribution and its moments as well as cumulants, in particularif one shifts or scales the distribution (Sec. 2.1).

... can explain the relation between the cross-cumulants of the unmixed signal components and their Gaus-sianity (Eqs. 56–58), and can relate these two to the statistical indepence of the components (Secs. 2.2,2.3).

... can explain the cumulant based ICA algorithm (Sec. 2.9).

... can explain why whitening is a useful preprocessing step for any ICA algorithm (Sec. 1.4).

Please also work on the exercises and read the solutions to deepen these concepts.

2

Sources: These lecture notes are largely based on (Hyvarinen et al., 2001; Blaschke and Wiskott, 2004).

1 Intuition

1.1 Mixing and unmixing

In contrast to principal component analysis, which deals with the second-order moments of a data distri-bution, independent component analysis focuses on higher-order moments, which can, of course, be of verydiverse and very complex nature. In (linear) independent component analysis (ICA) one assumes avery simple model of the data, namely that it is a linear mixture (D: lineare Mischung) x = (x1, ..., xN )T ,of some statistically independent sources (D: statistisch unabhangige Quellen) combined in a sourcevector s = (s1, ..., sI)

T , and one often even assumes that the number of sources I is the same as thedimensionality N of the data. Each source is characterized by a probability density function (pdf) (D:Wahrscheinlichkeitsdichtefunktion) psi(si) and the joint pdf of the sources is simply the product ofits individual pdfs, for two sources we have

� ps(s1, s2) = ps1(s1) ps2(s2) (1)

⇐⇒ ps(s1|s2) = ps1(s1) . (2)

The second equation suggests the intuitive definition: Knowing the value of one variable, here s2, doesnot tell anything about the other variable, s1. For an example see figure 1.

s2

s2

s1 s1

ps(s1, s2)

ps2(s2)

ps1(s1)

Figure 1: Individual and joint pdfs of two sources.© CC BY-SA 4.0

It is assumed that the data x is generated by linearly mixing (D: linear mischen) the sources like

� x := Ms , (3)

3

https://creativecommons.org/licenses/by-sa/4.0/

with an invertible square mixing matrix M . The task of independent component analysis then is to finda square matrix U that inverts this mixture, i.e. finds a linear unmixing (D: lineare Entmischung), sothat

� y := Ux (4)

recovers the sources. Consider first two examples where we know the mixing.

In practice one does not deal with the random variables s and x and their pdfs but one has concretesamples sµ drawn from a distribution like (1) and xµ calculated from that with (3). In thefollowing we will drop the sample index µ for simplicity, with the understanding that averages 〈f(s)〉 caneither be calculated theoretically with an integral 〈f(s)〉 :=

∫f(s)p(s)ds or in practice with a sum

〈f(s)〉 := 1M

∑Mµ=1 f(sµ), where M is the number of data points.

Example 1: Figure 2 shows an example of the two-dimensional mixture

� x := s1d1 + s2d2 =(d1 d2

)︸︷︷︸=:M

(s1s2

)(5)

of the sources of figure 1, where the mixing is orthogonal. This means that matrix M is orthogonal if

x2

x1

d2

d1

Figure 2: A two-dimensional orthogonal mixture of the two sources given above.© CC BY-SA 4.0

we assume the vectors di are normalized. Notice that the vectors di indicate how a single sourceis distributed over the different data vector components. I therefore call them distributionvectors (D: Verteilungsvektoren), but this is not a standard term. They do not indicate the mixing of allsources on one component of the data vector, thus calling them mixing vectors would be misleading. Noticealso that the fact that the data is concentrated around d2 is a consequence of s1 (not s2!) being concentratedaround zero.

Since the vectors di are orthogonal, extracting the sources from the mixture can be done by mulitplying

4


with these vectors, for instance

♦ y1 := dT1 x (6)

♦(5)= dT1 (s1d1 + s2d2) (7)

♦ = s1 dT1 d1︸︷︷︸=1

+s2 dT1 d2︸︷︷︸=0

(8)

♦ = s1 , (9)

and likewise for y2.

The unmixing matrix therefore is

U :=

(dT1dT2

), (10)

since

UM(10,5)=

(dT1dT2

)(d1 d2

)(11)

=

(dT1 d1 dT1 d2dT2 d1 dT2 d2

)(12)

=

(1 00 1

). (13)

Example 2: If the distribution vectors are not orthogonal, matters become a bit more complicated, seefigure 3. It is somewhat counterintuitive that now the vectors ei to extract the sources from the

x1

x2

e2

d2

d1

e1

Figure 3: A two-dimensional non-orthogonal mixture of the two sources given above.© CC BY-SA 4.0

mixture do not have to point in the direction of the corresponding distribution vectors butrather must be orthogonal to all other distribution vectors, so that

� eTi dj = δij . (14)

5


Notice that in the figure e1 has the same angle to d1 as e2 has to d2 and that e1 must be shorter than e2,because d1 is longer than d2, to keep the inner products eTi di equal.

With the extraction vectors (D: Extraktionsvektoren) ei we get the unmixing matrix

� U :=

(eT1eT2

)(15)

and verify

UM(15,5)=

(eT1eT2

)(d1 d2

)(16)

=

(eT1 d1 eT1 d2eT2 d1 eT2 d2

)(17)

(14)=

(1 00 1

). (18)

1.2 How to find the unmixing matrix?

So far we have only derived unmixing matrices if the mixing matrix was known. But we have not said anythingabout how to find the unmixing matrix if only the data is given and nothing is known about the mixing (orthe sources) apart from it being linear. So we need some statistical criteria to judge whether anunmixing matrix is good or not. There are two fundamental approaches, either one compares the jointpdf of the unmixed data with the product of its marginals, or one maximizes the non-Gaussianity of themarginals.

Make the output signal components statistically independent: A fundamental assumption of ICAis that the data is a linear mixture of statistically independent sources. If we unmix the data the resultingoutput signal components should therefore again be statistically independent. Thus a possible criterion forwhether the unmixing is good or not is whether

py(y1, y2) = py1(y1) py2(y2) . (19)

or not. In practice one measures the difference between the joint distribution and the product of marginalsand tries to minimize it. For instance one can use the Kullback-Leibler divergence between py(y1, y2)and py1(y1) py2(y2). In this lecture I will focus on cross-cumulants as a measure of statisticalindependence.

This approach necessarily optimizes the unmixing matrix as a whole.

Make the output signal components non-Gaussian: Another approach is based on the observationthat, if you mix two sources, the mixture is more Gaussian than the sources. Applied over andover again this culminates in the central limit theorem of statistics, which basically states that a mixtureof infinitely many variables has a Gaussian distribution. Turning this argument around it might be agood strategy to search for output signal components that are as different from a Gaussian aspossible. This are then most likely sources. To measure non-Gaussianity, one often uses kurtosis, see below.

This approach permits to extract one source after the other. One then typically first extracts themost non-Gaussian signal, eliminates the corresponding dimension from the data and then finds the secondnon-Gaussian signal, etc.

6

1.3 Sources can only be recovered up to permutation and rescaling

It is clear that if y1 and y2 are statistically independent, 3y2 and −0.5y1 are also statistcally independent.Thus, there is no way to tell the order of the sources and their scale or sign. A similar argument holds forthe approach based on non-Gaussianity. Thus, the sources can principally be only recovered up to apermutation and scaling (including sign).

1.4 Whiten the data first

The least one can expect from the estimated sources is that they are uncorrelated. To fix the arbitraryscaling factor, it is also common to require the output data components to have unit variance (zero meanis assumed in any case). It is therefore a good idea to whiten or sphere the data first, because thenwe know that the data projected onto any normalized vector has zero mean and unit variance and thatthe data projected onto two orthogonal vectors are uncorrelated. Thus, the unmixing matrix must beorthogonal and the unmixing problem reduces to finding the right rotation, which is still difficult enough.Intuitively, whitening brings us from the situation in figure 3 to that in figure 2.

1.5 A generic ICA algorithm

Summarizing what we have learned so far, a generic ICA algorithm is conceptually relatively simple andworks as follows:

1. Remove the mean and whiten the data.

2. Rotate the data such that either

(a) the output signal components are as statistically independent as possible, or

(b) the output signal components are most non-Gaussian.

2 Formalism based on cumulants

There is a whole zoo of different ICA algorithms. I focus here on a method based on higher-order cumulants.

2.1 Moments and cumulants

Moments and cumulants are a convenient way of describing statistical distributions. If all moments or allcumulants are known, the distribution is uniquely determined. Thus, moments and cumulants are equivalentdescriptions of a distribution, although one usually only uses the lower moments or cumulants. If 〈·〉 indicatesthe average over all data points, then the moments (D: Momente) of some vectorial data y written assingle components are defined as

� first moment 〈yi〉 (20)

� second moment 〈yiyj〉 (21)

� third moment 〈yiyjyk〉 (22)

� fourth moment 〈yiyjykyl〉 (23)

� higher moments ... .

7

A disadvantage of moments is that higher moments contain information that can already be expected fromlower moments, for instance

♦ 〈yiyj〉 = 〈yi〉〈yj〉 + ? , (24)

or for zero mean data

♦ 〈yiyjykyl〉 = 〈yiyj〉〈ykyl〉+ 〈yiyk〉〈yjyl〉+ 〈yiyl〉〈yjyk〉 + ? . (25)

It would be nice to have a definition for only the part that is not known yet, much like the variance for thesecond moment of a scalar variable,

♦⟨(y − 〈y〉)2

⟩= 〈yy〉 − 〈y〉〈y〉 , (26)

where one has removed the influence of the mean from the second moment. This idea can be generalized tomixed and higher moments. If one subtracts off from the higher moments what one would expectfrom the lower moments already, one gets corrected moments that are called cumulants (D:Kumulanten). For simplicity we assume zero mean data, then

� Ci := 〈yi〉!= 0 , (27)

� Cij := 〈yiyj〉 , (28)

� Cijk := 〈yiyjyk〉 , (29)

� Cijkl := 〈yiyjykyl〉 − 〈yiyj〉〈ykyl〉 − 〈yiyk〉〈yjyl〉 − 〈yiyl〉〈yjyk〉 . (30)

Notice that there are no terms to be subtracted off from Cij and Cijk due to the zero-mean constraint.Any term that one might extract would have a factor of the form 〈yi〉 included, which is zero. For non-zeromean data and for higher cumulants it is not trivial to determine the relationship between moments andcumulants, intuition fails quickly, but there are formalisms for that, see exercises.

For scalar variables these four cumulants have nice intuitive interpretations. We know already that Ci and

Cii are the mean and variance of the data, respectively. Ciii (often normalized by C3/2ii ) is called

skewness (D: Schiefe) of the distribution, which indicates how much tilted to the left or to the right thedistribution is. Ciiii (often normalized by C2

ii) is called kurtosis (D: Kurtosis) of the distribution,which indicates how peaky the distribution is. A uniform distribution has negative kurtosis and is calledsub-Gaussian (D: sub-Gauß’sch); a peaky distribution with heavy tails has positive kurtosis and is calledsuper-Gaussian (D: super-Gauß’sch). A Gaussian distribution has zero kurtosis, and also all its othercumulants higher than the second cumulant vanish.

Cumulants have the important property that if you add two (or more) statistically independent variables,the cumulants of the sum equals the sum of the cumulants of the variables.

2.2 Cross-cumulants of statistically independent components are zero

First notice that if the random variables of a higher moment can be split into two statisticallyindependent groups, then the moment can be written as a product of the two lower moments

8

0

0

0

0

Cii = 1

Ciii = 0

Ciiii = 0

Ci = 0

0

0

0

0

Cii > 1

Ciii > 0

Ciiii > 0

Ci > 0

0

0

0

0

Variance:

Skewness:

Kurtosis:

Mean:

Cii < 1

Ciii < 0

Ciiii < 0

Ci < 0

Figure 4: Illustration of the effect of the first four cumulants in one variable, i.e. mean, variance, skewness,and kurtosis.© CC BY-SA 4.0

of the groups. For instance, if yi and yj are statistically independent of yk, then

♦ 〈yiyjyk〉 =

∫yi

∫yj

∫yk

yiyjyk p(yi, yj , yk) dyk dyj dyi (31)

♦ =

∫yi

∫yj

∫yk

yiyjyk p(yi, yj) p(yk) dyk dyj dyi (due to statistical independence) (32)

♦ =

∫yi

∫yj

yiyj p(yi, yj) dyj dyi ·∫yk

yk p(yk) dyk (33)

♦ = 〈yiyj〉〈yk〉 . (34)

This has an important implication for cumulants of statistically independent variables. We have arguedabove that a cumulant can be interpreted as the corresponding moment of identical structure minus all whatcan be expected from lower-order moments (or cumulants) already. If the random variables can be split intotwo statistically independent groups then the corresponding moment can be written as a product of two lower

9


moments, which means that it can be completely predicted from these lower moments. As a consequencethe cumulant is zero, because there is nothing left, that could not be predicted. Thus we can state (withoutproof) that if a set of random variables are statistically independent then all cross-cumulantsvanish. For instance, for statistically independent random variables yi and yj with zero mean we get

♦ Ciij = 〈yiyiyj〉 (35)

♦(34)= 〈yiyi〉〈yj〉︸︷︷︸

=0

(since yi and yj are statistically independent) (36)

♦ = 0 (since the data is zero-mean) , (37)

♦ Ciijj = 〈yiyiyjyj〉 − 〈yiyi〉〈yjyj〉 − 〈yiyj〉〈yiyj〉 − 〈yiyj〉〈yiyj〉 (38)

♦(34)= 〈yiyi〉〈yjyj〉 − 〈yiyi〉〈yjyj〉 − 2 〈yi〉︸︷︷︸

= 0

〈yj〉〈yi〉〈yj〉 (39)

(since yi and yj are statistically independent)

♦ = 0 (since the data is zero-mean) . (40)

But this also holds for non-zero mean data.

2.3 Components with zero cross-cumulants are statistically independent

The converse is also true (again without proof). If all cross-cumulants vanish then the randomvariables are statistically independent. Notice that pairwise statistical independence does not generallysuffice for the overall statistical independence of the variables. Consider, for instance, the three binaryvaribles A, B, and C = A xorB.

Thus, the cross-cumulants can be used to measure the statistical dependence between theoutput signal components in ICA. Of course, one cannot use all cross-cumulants, but for instanceone can

� require Cij = δij (unit covariance matrix) , (41)

♦ and minimize∑

ijk 6=iii

C2ijk (42)

� or minimize∑

ijkl 6=iiii

C2ijkl (43)

or minimize∑

ijk 6=iii

C2ijk +

∑ijkl 6=iiii

C2ijkl . (44)

Either one of (42–44) serves as an objective function or contrast function (D: Kontrastfunktion), as itis usually called in the ICA context, which has to be minimized under the unit covariance constraint (41).

Usually one uses the fourth-order cumulants because signals are often symmetric and thenthe third-order cumulants vanish in any case. Assuming we disregard third-order cumulants theoptimiziation can be stated as follows: Given some multi-dimensional input data x, find the matrix U thatproduces output data

y = Ux (45)

that minimize (43) under the constraint (41). The constraint is trivial to fulfill by whitening the data. Oncethe data has been whitened with some matrix W to obtain the whitened data x, the only transformationthat is left to do is a rotation by a rotation matrix R, as we have discussed earlier. Thus, we have

y = Rx (46)

= RW︸︷︷︸=U

x . (47)

10

2.4 Rotated cumulants

Let x and y indicate the whitened and the rotated data, respectively, and Cxαβγε and Cy

ijkl the correspondingcumulants. Then we write the cumulants of y in terms of the cumulants of x.

♦ Cyijkl = 〈yiyjykyl〉 − 〈yiyj〉〈ykyl〉 − 〈yiyk〉〈yjyl〉 − 〈yiyl〉〈yjyk〉 (48)

♦ =

⟨∑α

Riαxα∑β

Rjβ xβ∑γ

Rkγ xγ∑ε

Rlεxε

⟩

−

⟨∑α

Riαxα∑β

Rjβ xβ

⟩⟨∑γ

Rkγ xγ∑ε

Rlεxε

⟩− ...− ... (49)

(since (46) ⇔ yi =∑α

Riαxα)

♦ =∑αβγε

RiαRjβRkγRlε (〈xαxβ xγ xε〉 − 〈xαxβ〉〈xγ xε〉 − ...− ...)︸︷︷︸Cxαβγε

(50)

♦ =∑αβγε

RiαRjβRkγRlε Cxαβγε . (51)

The fact that the new cumulants are simply linear combinations of the old cumulants reflects the multilin-earity of cumulants.

2.5 Contrast function

It is interesting to note that the square sum over all cumulants of a given order does not changeunder a rotation, as can be easily verified for fourth order.

♦∑ijkl

(Cyijkl

)2(51)=∑ijkl

∑αβγε

RiαRjβRkγRlε Cxαβγε

2

(52)

♦ =∑ijkl

∑αβγε

RiαRjβRkγRlε Cxαβγε

∑κλµν

RiκRjλRkµRlν Cxκλµν

(53)

♦ =∑αβγε

∑κλµν

CxαβγεC

xκλµν

∑i

RiαRiκ︸︷︷︸δακ

∑j

RjβRjλ︸︷︷︸δβλ

∑k

RkγRkµ︸︷︷︸δγµ

∑l

RlεRlν︸︷︷︸δεν

(54)

♦ =∑αβγε

(Cxαβγε

)2. (55)

This implies

�∑

ijkl 6=iiii

C2ijkl +

∑i

C2iiii = const , (56)

which means that instead of minimizing the square sum over all cross-cumulants of order four it is equivalentto maximize the square sum over the kurtosis of all components, i.e.

♦ minimize∑

ijkl 6=iiii

C2ijkl (57)

� ⇐⇒ maximize Ψ4 :=∑i

C2iiii , (58)

11

since the sum over these two sums is constant. This is one way of formalizing our intuition thatmaking all components statistically indendent is equivalent to making them as non-Gaussianas possible. Maximizing Ψ4 is obviously much easier than minimizing the square sum over allcross cumulants. Thus, we will use that as our objective function or contrast function.

A corresponding relationship also holds for third-order cumulants. However, if the sources are symmetric,which they might well be, then their skewness is zero in any case and maximizing their square sum is oflittle use. One therefore usually considers fourth- rather than third-order cumulants for ICA. Consideringboth simulataneously, however, might even be better.

2.6 Givens-rotations

A rotation matrix in 2D is given by

♦ R =

(cosφ − sinφsinφ cosφ

). (59)

In higher dimensions a rotation matrix can be quite complex. To keep things simple, we use so-called Givensrotations (D: Givens-Rotation), which are defined as a rotation within the 2D subspace spannedby two axes n and m. For instance a rotation within the plane spanned by the second and the fourth axis(n = 2,m = 4) of a four-dimensional space is given by

� R =

1 0 0 00 cosφ 0 − sinφ0 0 1 00 sinφ 0 cosφ

. (60)

It can be shown that any general rotation, i.e. any orthogonal matrix with positive determinant, can bewritten as a product of Givens-rotations. Thus, we can find the general rotation matrix of the ICAproblem by applying a series of Givens rotations and each time improve the objective function (43) a bit.

2.7 Optimizing the contrast function

We have argued above that ICA can be performed by repeatedly applying a Givens-rotation to the whiteneddata. Each time two axes (n and m) that define the rotation plane are selected at random and then therotation angle is optimized to maximize the value of the contrast function. By applying enough of suchGivens-rotations, the algorithm should eventually converge to the globally optimal solution(although we have no guarantee for that). Writing the constrast function as a function of the rotation angleφ yields

♦ Ψ4(φ) :=∑i

(Cyiiii)

2(61)

♦(51)=∑i

∑αβγε

RiαRiβRiγRiε Cxαβγε

2

(62)

♦ = K +∑i=n,m

∑αβγε=n,m

RiαRiβRiγRiε Cxαβγε

2

. (63)

For i 6= n,m the entries of the rotation matrix are Riω = δiω for ω = α, β, γ, ε and do not depend on φ.Thus, these terms are constant and are contained in K.

12

For i = n,m the entries of the rotation matrix are Riω = 0 for ω 6= n,m and Riω ∈ {cos(φ),± sin(φ)} forω = n,m with ω = α, β, γ, ε. Thus, the sums over α, β, γ, ε can be restricted to n,m and each resulting termcontains eight cosine- or sine-factors, because Riω ∈ {cos(φ),± sin(φ)} and due to the squaring. Thus, Ψ4

can always be written as

Ψ4(φ) = K +

8∑p=0

k′p cos(φ)(8−p) sin(φ)p . (64)

The old cumulants Cxαβγε are all contained in the constants k′p.

We can simplify Ψ4(φ) even further with the following two considerations: Firstly, a rotation by multiples of90◦ should have no effect on the contrast function, because that would only flip or exchange the componentsof y. Thus, Ψ4 must have a 90◦ periodicity and can always be written like

Ψ4(φ) = A0 +A4 cos(4φ+ φ4) +A8 cos(8φ+ φ8) +A12 cos(12φ+ φ12) +A16... (65)

Secondly, products of two sin- and cos-functions produce constants and frequency doubled terms, e.g.

sin(φ)2 = (1− cos(2φ))/2 (66)

cos(φ)2 = (1 + cos(2φ))/2 (67)

sin(φ) cos(φ) = sin(2φ)/2 . (68)

Products of eight sin- and cos-functions produce at most eightfold frequencies (three time frequency dou-bling). This limits equation (65) to terms up to 8. Thus, finally we get the following contrast functionfor one Givens-rotation:

� Ψ4(φ) = A0 +A4 cos(4φ+ φ4) +A8 cos(8φ+ φ8) . (69)

Again the constants contain the old cumulants Cxαβγε in some complicated but computable form. One

arrives at the same equation if one also includes third order terms, i.e. if one uses contrast function (44)rather than (43), but the constants differ. Constants for the latter case are shown in Table 1.

It is relatively simple to find the maximium of this function once the constants are known.

2.8 Visualization

A nice way to visualize how the contrast function Ψ4(φ) depends on the angle is to plot the

surface of (Cyαααα)

2as a function of angle for data that is already whitened and where the

(normalized) sources are aligned with the main axes, see Figures 5 and 6. Remember, that

(Cyαααα)

2is the squared fourth order cumulant of yα. In the figures the value of (Cy

αααα)2

is simply thedistance of the surface from the origin in the direction of uα, the corresponding row vector of the unmixingmatrix U , which is the vector onto which x has to be projected in order to get yα, see (4). As one rotatesthe unmixing vectors, i.e. as one varies the orthogonal unmixing matrix U , the rotated axes cut through thesurface at different points. The sum of the distances of these intersection points from the origin correspondsto the value of the contrast function.

2.9 Algorithm

A cumulant based algorithm for independent component analysis could now look as follows:

1. Whiten the data x with whitening matrix W and create y = x = Wx.

13

Table 1: Constants for equation (69) if it also included third order terms (Blaschke, 2005). Constants forequation (69) without third order terms are contained as a special case if one sets 0 = a30 = s34 = c34 =s38 = c38.

14

Figure 5: Visualization of (Cyαααα)

2as a function of angle in 2D (Blaschke, 2005). The data is assumed to

be whitened and the (normalized) sources are aligned with the axes. The value of (Cyαααα)

2is the distance

of the curved surface (or rather line in 2D) from the origin, as indicated by the gray solid line segmentsalong the rotated axes. The value of the contrast function as a function of angle is simply the sum of thetwo values (Cy

1111)2

and (Cy2222)

2. The values along the main axes are those of the sources. It is obvious that

the contrast function is maximized for angles φ ∈ {0, π/2, π, 3π/2}

2. Select two axes/variables yn and ym randomly.

3. Rotate y in the plane spanned by yn and ym such that Ψ4(φ) is maximized. This leads to a newy.

4. Go to step 2 unless a suitable convergence criterion is fulfilled, for instance, the rotation anglehas been smaller than some threshold for the last 1000 iterations.

5. Stop.

3 FastICA +

The algorithm described above finds a global solution, i.e. it extracts all components simultaneously. We havealready argued that one could do that also one by one, first extracting the most non-Gaussian component,then the next one, etc. Furthermore, the algorithm above is based on third and fourth order cumulants,which are somewhat sensitve to outliers because of the relatively high exponent. There are many other waysof quantifying how non-Gaussian a distribution is, some of them more robust than the cumulants. FastICAis an algorithm that can extract the independent components one by one and is based on adifferent, more robust measure of non-Gaussianity (Hyvarinen, 1999; Hyvarinen and Oja, 2000).

15

Figure 6: Visualization of (Cyαααα)

2as a function of angle in 3D. The left example shows a surface for

three sources with super-Gaussian distribution. The third source has a much smaller (positive) kurtosisthan the other two, which have a similar kurtosis. The right example shows a surface for two sources withsuper-Gaussian distribution and one source with sub-Gaussian distribution. The sources are not alignedwith the main axes. In this case the surface actually touches the origin while in the example on the left itdoes not. If there are sub-Gaussian as well as super-Gaussian sources, then there are always mixtures withzero kurtosis resulting in (Cy

αααα)2

= 0 for some angles.

3.1 Contrast function +

Assume y is an unknown scalar random variable and ν is a Gauss-distributed random variable, both withzero mean and unit variance. Now consider the general contrast function

♦ ΨG(y) = (〈G(y)〉y − 〈G(ν)〉ν)2 (70)

with some scalar function G. It is quite obvious that this contrast is non-negative and zero if y isGauss-distributed. This does not necessarily imply that y is Gauss-distributed if ΨG(y) = 0, but it isvery unlikely to get a value of zero for a random non-Gaussian distribution. For G(y) := y2, both 〈G(y)〉y aswell as 〈G(ν)〉ν measure the 2nd moment and equal one, since we have assumed zero mean and unit variancedata. Thus, choosing a quadratic function (or one that does not depend on y at all) is useless. For anyother function, chances are high (although not 100%) that ΨG(y) increases as py(y) deviates from aGaussian distribution.

For G(y) := y4 we find

ΨG(y) = (〈y4〉y − 〈ν4〉ν)2 (71)

= ((〈y4〉y − 3)− (〈ν4〉ν − 3))2 (72)

= (kurt(y)− kurt(ν))2 (since y and ν have unit variance) (73)

= (kurt(y))2 (since ν is Gauss-distributed) , (74)

where kurt indicates kurtosis. Thus, we have the same contrast function for one variable as we had abovefor several variables (58).

16

Due to the high exponent in the kurtosis, this contrast function is sensitive to outliers. Functions that yielda more robust contrast function are

Gt(y) := log cosh(y) , Ge(y):= − exp(−y2/2) . (75)

3.2 Optimizing the contrast function +

The data, i.e. the mixture, is denoted by x. As argued above it is convenient to prewhiten the data, so thatit is uncorrelated and the extraction vectors for the independent components are orthogonal to each other.The whitened data is denoted by x. FastICA in the form described here extracts only one componentat a time, which can be written as y = eT x, where e must be normalized to length one, so that y hasvariance 1. The optimization problem can then be stated as follows:

� maximize ΨG(e) = (〈G(eT x)〉 − 〈G(ν)〉ν)2 (76)

� subject to eTe = 1 . (77)

Since 〈G(ν)〉ν is a constant, it is clear that the maxima of (76) are assumed at extrema (minima or maxima)of 〈G(eT x)〉. The optimization problem can thus be simplified to

extremize ΦG(e) = 〈G(eT x)〉 (78)

subject to eTe = 1 , (79)

although now we get a minimum and a maximum, only one of which maximizes (76).

The Lagrange multiplier method integrates objective and constraint in one objective function

♦ 〈G(eT x)〉 − λeTe , (80)

the extrema of which can be found by setting the gradient wrt e to zero

0!= ∇〈G(eT x)〉 − (1/2)λ∇eTe (81)

♦ = 〈xG′(eT x)〉 − λe (82)

♦ =: φ(e) , (83)

where we have introduced the factor (1/2) for notational convenience, it vanishes in the next step due to thegradient. λ is a Lagrangian multiplier.

Assuming λ is known, a root of (82) can be found by Newton’s method (Wikipedia contributors, 2018). InNewton’s method the function φ is approximated linearly by a Taylor expansion and the root of that linearapproximation is calculated analytically, which hopefully lies close to the true root. The method is usuallyiterated and converges quickly if it converges at all. Since φ is a map from RN to RN , we need the Jacobianmatrix J for the Taylor expansion. And since we actually need the inverse of J , we use an approximationto reduce the computational costs. The Jacobian matrix depends on e and reads

♦ J(e) :=

(∂φ(e)

∂e1, ...,

∂φ(e)

∂eI

)(84)

(83)= 〈xxT G′′(eT x)〉 − λI (85)

♦ ≈ 〈xxT 〉︸︷︷︸= I

〈G′′(eT x)〉 − λI (86)

= (〈G′′(eT x)〉 − λ)I (87)

⇐⇒ J−1(e) = I/(〈G′′(eT x)〉 − λ) . (88)

17

The linear Taylor expansion around e0 is

♦ 0!= φ(e) ≈ φ(e0) + J(e0)[e− e0] . (89)

Solving this for e assuming the approximation is exact yields

♦ e = e0 − J−1(e0)φ(e0) (90)

♦(83,88)

= e0 − (〈xG′(eT0 x)〉 − λe0)/(〈G′′(eT0 x)〉 − λ) (91)

as a candidate for the root, which has to be improved by iterating this equation.

A problem is that we do not know λ, we only know that e should have norm 1, but (91) is not suitable todetermine λ from this constraint. Thus, we simply normalize e and try to eliminate λ. Since we normalize ein any case and since the polarity of e does not matter, we can multiply (91) by −(〈G′′(eT0 x)〉−λ) (assumingit to be non-zero), which yields

e(91)= e0 − (〈xG′(eT0 x)〉 − λe0)/(〈G′′(eT0 x)〉 − λ) (92)

♦ ⇐⇒ −(〈G′′(eT0 x)〉 − λ) e︸︷︷︸=: e+

= −(〈G′′(eT0 x)〉 − λ)e0 + (〈xG′(eT0 x)〉 − λe0) (93)

♦ = e+ = 〈xG′(eT0 x)〉 − 〈G′′(eT0 x)〉e0 . (94)

3.3 Algorithm +

The FastICA algorithm for a single component now works as follows:

1. Whiten the data x to produce whitened data x with zero mean and unit covariance matrix.

2. Draw a random weight vector e with norm 1.

3. Iterate

� e+(94)= 〈xG′(eT x)〉 − 〈G′′(eT x)〉e (95)

� e =e+

‖e+‖(96)

until convergence, which is reached when the inner product between two successive weight vectors is eithergreater than 1-ε or less than -1+ε for some small positive constant ε.

4. Take the final weight vector as an extraction vector e1 := e.

The convergence criterion measures whether the direction of the weight vector has stabilized, the orientationmight still flip back and forth, which is irrelevant for the result, because the sign of independent componentsis arbitrary.

We have made a lot of approximations to arrive at the relatively simple equation (95). Thus, it is not obviousthat the algorithm actually works. However, FastICA is one of the most popular ICA algorithmsdue to its robustness and computational efficiency, which indicates that the approximationsare ok in practice.

The averages are taken over all data points. If there are too many data points to be computational practical,one can take subsets thereof, selected differently for each iteration. If due to fluctuations in the selected datasamples the convergence criterion is not reached, one can use more samples in each iteration or increase εdepending on whether accuracy or speed is more important, respectively.

18

Extracting more than just one independent component can be achieved by extracting onecomponent after the other and orthogonalizing the weight vectors to the extraction vectorsalready determined. Thus, if extraction vectors e1, ..., ei have been found already, one replaces (96) by

e =e+ −

∑ij=1 eje

Tj e

+

‖e+ −∑ij=1 eje

Tj e

+‖. (97)

This would be a so-called deflation scheme, which is asymmetric insofar as later extraction vectors areconstrained by earlier ones but not vice versa. It is also possible to formulate a more symmetric scheme,where all extraction vectors are determined simultaneouly like in the cummulant based algorithm above, see(Hyvarinen, 1999; Hyvarinen and Oja, 2000).

Extracting multiple components also safeguards against the possibility that the first component extractedmight not be the most non-Gaussian one, but another locally optimal solution.

4 Applications

4.1 Removal of muscular artifacts from EEG signals

Sebek et al. (2018) have used ICA to remove muscular artifacts from EEG recordings. Muscle activity, suchas eye blinks or eye movements, produces electrical signals that interfere with the signals emitted by thebrain. Since electrical signals interfere linearly to a first approximation, ICA can be used to unmix trueEEG signals and muscular artifacts. The estimated sources are then classified by their signal characteristicsas being either an artifact or an EEG source. Once the artifacts are known they can easily be removed anda cleaned up EEG signal can be reconstructed, see Figure 7. EEG analysis was one of the first successfulapplication domains of ICA.

A Other resources

Numbers in square brackets indicate sections of these lecture notes to which the corresponding item is related.

A.1 Written material

� ICA on Wikipediahttps://en.wikipedia.org/wiki/Independent_component_analysis

A.2 Videos

� Abstract conceptual introduction to ICA from Georgia Tech

1. ICA objectivehttps://www.youtube.com/watch?v=2WY7wCghSVI (2:13)

2. Mixing and unmixing / blind source separation / cocktail party problemhttps://www.youtube.com/watch?v=wIlrddNbXDo (4:17)

3. How to formulate this mathematicallyhttps://www.youtube.com/watch?v=pSwRO5d266I (5:15)

4. Application to artificially mixed soundhttps://www.youtube.com/watch?v=T0HP9cxri0A (3:25)

19

https://en.wikipedia.org/wiki/Independent_component_analysis

https://www.youtube.com/watch?v=2WY7wCghSVI

https://www.youtube.com/watch?v=wIlrddNbXDo

https://www.youtube.com/watch?v=pSwRO5d266I

https://www.youtube.com/watch?v=T0HP9cxri0A

(Sebek et al., 2018, Fig. 1, © CC BY 4.0, URL)1

Figure 7: (A) Original EEG signal currupted by muscular artifacts. Each trace shows the signal of onechannel (electrode). (B) Signals separated by ICA and classified as muscular artifact (EMG) or EEG source.(C) Reconstructed EEG signal with muscular artifacts removed, same channels as in (A).

20

https://creativecommons.org/licenses/by/4.0/

https://doi.org/10.1371/journal.pone.0201900

5. Quiz Questions PCA vs ICAhttps://www.youtube.com/watch?v=TDW0vMz_3ag (0:32)

6. Quiz Answers PCA vs ICA with further commentshttps://www.youtube.com/watch?v=SjM2Qm7N9CU (10:04)

7. More on differences between PCA and ICAhttps://www.youtube.com/watch?v=e4woe8GRjEI (7:17)

– 01:23–01:34 I find this statement confusing. Transposing a matrix is a very different operationthan rotating data, and I find it highly non-trivial that PCA works also on the transposed datamatrix, see the section on singular value decomposition in the lecture notes on PCA.

– 02:49–02:59 Hm, this is not quite true. Either the mean has been removed from the data, whichis normally the case, then the average face is at the origin and cannot be extracted with PCA.Or the mean has not been removed, then it is typically the first eigenvector (not the second),that goes through the mean, although not exactly, as one can see in one of the python exercises.There are marked differences around the eyes.

� Lecture on ICA based on cumulants by Santosh Vempala, Georgia Institute of Technology.https://www.youtube.com/watch?v=KSIA908KNiw (52:22)

– 00:45–06:31 Introducing the ICA problem definition

– 06:31–07:50 Would PCA solve the problem? No!

– 07:50–15:04 Formulation of a deflation algorithm based on cumulants

– 15:04–20:35 (Reformulation with tensor equations)

– 20:35– (Two problems in solving the ICA problem)

* 21:32– (What if the second and fourth order cumulants are zero?)

· 25:26– (Algorithm: Fourier PCA)

* ...

– ...

A.3 Software

� FastICA in scikit-learn, a python library for machine learninghttp://scikit-learn.org/stable/modules/decomposition.html#independent-component-analysis-ica

� Examples using FastICA in scikit-learnhttp://scikit-learn.org/stable/modules/generated/sklearn.decomposition.FastICA.html#

sklearn.decomposition.FastICA

A.4 Exercises

� Analytical exercises by Laurenz Wiskotthttps://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/IndependentComponentAnalysis-

ExercisesPublic.pdf

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/IndependentComponentAnalysis-

SolutionsPublic.pdf

� Python exercises by Laurenz Wiskotthttps://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/IndependentComponentAnalysis-

PythonExercisesPublic.zip

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/IndependentComponentAnalysis-

PythonSolutionsPublic.zip

21

https://www.youtube.com/watch?v=TDW0vMz_3ag

https://www.youtube.com/watch?v=SjM2Qm7N9CU

https://www.youtube.com/watch?v=e4woe8GRjEI

https://www.youtube.com/watch?v=KSIA908KNiw

http://scikit-learn.org/stable/modules/decomposition.html#independent-component-analysis-ica

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.FastICA.html#sklearn.decomposition.FastICA

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.FastICA.html#sklearn.decomposition.FastICA

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/IndependentComponentAnalysis-ExercisesPublic.pdf

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/IndependentComponentAnalysis-ExercisesPublic.pdf

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/IndependentComponentAnalysis-SolutionsPublic.pdf

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/IndependentComponentAnalysis-SolutionsPublic.pdf

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/IndependentComponentAnalysis-PythonExercisesPublic.zip

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/IndependentComponentAnalysis-PythonExercisesPublic.zip

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/IndependentComponentAnalysis-PythonSolutionsPublic.zip

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/IndependentComponentAnalysis-PythonSolutionsPublic.zip

References

Blaschke, T. (2005). Independent Component Analysis and Slow Feature Analysis: relations and combination.PhD thesis, Institute for Physics, Humboldt University Berlin, Germany. 14, 15

Blaschke, T. and Wiskott, L. (2004). CuBICA: Independent component analysis by simultaneous third- andfourth-order cumulant diagonalization. IEEE Trans. on Signal Processing, 52(5):1250–1256. 3

Hyvarinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEETrans Neural Netw, 10(3):626–634. 15, 19

Hyvarinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis. John Wiley & Sons. 3

Hyvarinen, A. and Oja, E. (2000). Independent component analysis: algorithms and applications. NeuralNetw, 13(4-5):411–430. 15, 19

Sebek, J., Bortel, R., and Sovka, P. (2018). Suppression of overlearning in independent component analysisused for removal of muscular artifacts from electroencephalographic records. PloS one, 13(8):e0201900.19, 20

Wikipedia contributors (2018). Newton’s method — Wikipedia, the free encyclopedia. [Online; accessed19-November-2018]. 17

Notes

1Sebek et al, 2018, PLoS one, Fig. 1,© CC BY 4.0, https://doi.org/10.1371/journal.pone.0201900

22

https://creativecommons.org/licenses/by/4.0/

https://doi.org/10.1371/journal.pone.0201900

Independent Component Analysis - Ruhr University Bochum · bution, independent component analysis...

Documents

Transcript of Independent Component Analysis - Ruhr University Bochum · bution, independent component analysis...