1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale...

38
1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

Transcript of 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale...

Page 1: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

1

CS/CBB 545 - Data MiningSpectral Methods (PCA,SVD)

#2 - Application

Mark Gerstein, Yale Universitygersteinlab.org/courses/545

(class 2007,03.08 14:30-15:45)

Page 2: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

2

Intuition on interpretation of SVD in terms of genes

and conditions

Page 3: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

3

SVD for microarray data(Alter et al, PNAS 2000)

Page 4: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

4

Page 5: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

5

Notation• m=1000 genes

– row-vectors

– 10 eigengene (vi) of dimension 10 conditions

• n=10 conditions (assays)– column vectors

– 10 eigenconditions (ui) of dimension 1000 genes

Page 6: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

6

Understanding Eigengenes (vi) in terms PCA on (large) gene-gene correlation matrix

Page 7: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

7

Understanding Eigenconditions (ui) in terms of PCA on (small) condition-condition correlation matrix

Bra - ket notation

Page 8: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

8

Plotting Experiments in Low Dimension Subspace

Page 9: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

9

Close up on Eigengenes

Page 10: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

10Copyright ©2000 by the National Academy of Sciences

Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101-10106

Genes sorted by correlation with top 2 eigengenes

Page 11: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

11Copyright ©2000 by the National Academy of Sciences

Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101-10106

Same thing different experiment: Genes sorted by relative correlation with first two eigengenes for alpha-factor experiment

Page 12: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

12Copyright ©2000 by the National Academy of Sciences

Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101-10106

Normalized elutriation

expression in the subspace

associated with the cell cycle

Page 13: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

13

Biplot Applied to Genes and Conditions

See grouping of arrays and genes on same plot

Page 14: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

14

Spectral Biclustering

Page 15: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

15

Biclustering to associate particular genes with certain phenotypes

Conditions

Reo

rder

ed G

enes

(Sor

ted

acco

rdin

g to

a cl

assi

ficat

ion

vect

or)

?

Matrix of raw data

Gen

es

Reordered Conditions(Sorted according to

a classification vector)

Shuffled Matrix(containing checkerboard

“biclusters” of conditions with marker genes)

Page 16: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

16

Pomeroy et. al. , Nature 415 (2002) 436Prediction of central nervous system embryonal tumor outcome based on gene expression

5 types of brain tumors

Page 17: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

17

Intuition on Identification of Blocky Matrices

2

Page 18: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

18

6 6 6 6

6 6

4 4 4 4

4 4 4 4

5 5 5

5 5 56 6

6 6 6

8 8 8 8

8 8 8 8

8 8 8

5

7

5

3 3

54 4 4 4

7 7 7

7 7 7 7

7

3

3 3 3

37

6

8 37 37

E

E

E

a

a

aD

D

D

c

c

c

b

b

b

b

a

Ax yGene partition vector

tumor 1 tumor 2

Gen

e cl

uste

r 1

Gen

e cl

uste

r 2

tumor 3

Page 19: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

19

5

4 4 4

4 4 4

4

7 7 7 '

7 7 7 '

7 7 7 '

7

3 3 3

6 6 6

6 6 6

6 6 6

6 6 6

'

3 3 3

5 5

5 5 '

3 3 3

4 4

4 4 4

5

5 5 5

8

7

8 8 '

8 8 8 '

8 8 8 '

8 8 8 '

'

'

7

D

D

D

c

c

E

E

E

b

b

a

c

a

a

b

a

b

'TA y x

Tissue partition vector

Page 20: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

20

6 6 6 6

6 6

4 4 4 4

4 4 4 4

5 5 5

5 5 56 6

6 6 6

8 8 8 8

8 8 8 8

8 8 8

5

7

5

3 3

54 4 4 4

7 7 7

7 7 7 7

7

3

3 3 3

37

6

8 37 37

E

E

E

a

a

aD

D

D

c

c

c

b

b

b

b

a

5

4 4 4

4 4 4

4

7 7 7 '

7 7 7 '

7 7 7 '

7

3 3 3

6 6 6

6 6 6

6 6 6

6 6 6

'

3 3 3

5 5

5 5 '

3 3 3

4 4

4 4 4

5

5 5 5

8

7

8 8 '

8 8 8 '

8 8 8 '

8 8 8 '

'

'

7

D

D

D

c

c

E

E

E

b

b

a

c

a

a

b

a

b

'TA Ax x

Ax y 'TA y x

Biclustering by SVD

Page 21: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

21

6 6 6 6

6 6

4 4 4 4

4 4 4 4

5 5 5

5 5 56 6

6 6 6

8 8 8 8

8 8 8 8

8 8 8

5

7

5

3 3

54 4 4 4

7 7 7

7 7 7 7

7

3

3 3 3

37

6

8 37 37

E

E

E

a

a

aD

D

D

c

c

c

b

b

b

b

a

Identify checkerboard matrices by their action

on classification vectors: Formulation as “eigenproblem”

Checkerboard Matrix A

Condition Classification Vect. x

Conditions

Gen

es

Gene Classification Vector y

A A x = x’T

A A y = y’T

5

4 4 4

4 4 4

4

7 7 7 '

7 7 7 '

7 7 7 '

7

3 3 3

6 6 6

6 6 6

6 6 6

6 6 6

'

3 3 3

5 5

5 5 '

3 3 3

4 4

4 4 4

5

5 5 5

8

7

8 8 '

8 8 8 '

8 8 8 '

8 8 8 '

'

'

7

D

D

D

c

c

E

E

E

b

b

a

c

a

a

b

a

b

Genes

Con

ditio

ns

x’

yAT

Page 22: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

22

SVD to Solve Eigenproblem

[Botstein]

Page 23: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

23

Yuval Kluger et al. Genome Res. 2003; 13: 703-716

Figure 1. Overview of important parts of the biclustering process

Page 24: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

24

4 8 9 3

8 5

1 6 6 3

7 3 2 4

2 7 6

8 6 12 9

6 5 7

8 9 7 8

9 6 8 9

7 9 9

5

8

2

6 1

84 3 4 5

9 4 7

7 5 8 8

6

2

1 5 3

27

6

7 36 49

E

E

E

a

a

aD

D

D

c

c

c

b

b

b

b

a

Gene partition with noisy data

Page 25: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

25

.11

.12

.11

.12 .12

.11 .11

.10 .10 .10

.07 .07 .07

.12

.12 .12

.09

.10

.10 .10

.09 .0

.1

.12 .12

.

1 .11 .11

.10 .10

.10

.07

.07 .07

.04 .04 .04

.04 .04 .04

.04 .04 .04

9

.09 .09 .0.11

.11 .11 .11

12 .12 .1

.07 .07

.07 .07 .

2 .

07

.

.

10 .10 .1

9

.09 ..1 0907 .

02

1

1

09

D

D

D

c

E

E

b

b

b

bE

c

a

a

c

a

a

1R Ax y

Normalization Rescales Rows and Columns to Same Means

Page 26: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

26

.18 .09 .0.16 .32 .16

.16 .32 .16

9

.18 .09

.214 .107 .107

.21

.16 .32 .16

4 .107 .107

.214 .1

.09

.18 .09 .0

.16 .32 .1

.14 .28 .14

.14 .28 .14

.14 .28 .14

.14 .28 .14

.09 .18 .0 .312 .1

07 .10

56 .15

9

.18 .

9

.09 .

7

.214 .10

18 .0

6

.31

09

2 .

7

1

.0

56

6

.

9

1

.

9

107

56

.0 .312 .156 .

'

'

9 .18 .0

'

'

156

'

'

'

'

'

9 '

'

D

D

E

E

E

b

bD

c

c

b

c

a

a

a

b

a

1 'TC A y x Rescale columns

Page 27: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

27

Representative Cancer Data set

• Lymphoma Data from Dalla-Favera et al. at Columbia

• Informatics from Stolovitzky & Califano at IBM

• Supervised learning some identified characteristic genes associated with different types of lymphoma

Page 28: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

28

Patients (samples) sorted according to projection onto blocky classification eigenvector

(u2)

Gen

es s

orte

d ac

cord

ing

to

proj

ectio

n on

to b

lock

y cl

assi

ficat

ion

eige

nvec

tor

(v2)

Matrix values represent outer products of two blocky

classification eigenvectors

Results on Representative Cancer Data set

Page 29: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

29

Actual Data with Normalization and Sorting

Page 30: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

30

Actual Data just with Sorting

(no normalization)

Page 31: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

31

Actual Data (no normalization

or sorting)

Page 32: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

32

Actual Data just with Sorting

(no normalization)

Page 33: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

33

Actual Data with Normalization and Sorting

Page 34: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

34

Patients (samples) sorted according to projection onto blocky classification eigenvector

(u2)

Gen

es s

orte

d ac

cord

ing

to

proj

ectio

n on

to b

lock

y cl

assi

ficat

ion

eige

nvec

tor

(v2)

Matrix values represent outer products of two blocky

classification eigenvectors

Just signal from top classification

eigenvectors

Page 35: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

35

Low Dimension Representation

Page 36: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

36

Patients (samples) sorted according to projection onto blocky classification eigenvector

(u2)

Gen

es s

orte

d ac

cord

ing

to

proj

ectio

n on

to b

lock

y cl

assi

ficat

ion

eige

nvec

tor

(v2)

Actual Values of Projections onto

Classification Eigenvectors

Page 37: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

37

Classification of Cancers Based on Projection onto two top classification

eigenvectors: Better with Normalization

Normalized (“bistochastization”)

CLL DLCLFL DLCL

Straight SVD

Four types of Cancer in Della Favera dataset

Page 38: 1 CS/CBB 545 - Data Mining Spectral Methods (PCA,SVD) #2 - Application Mark Gerstein, Yale University gersteinlab.org/courses/545 (class 2007,03.08 14:30-15:45)

(

c) M

Ger

stei

n '0

6, g

erst

ein

.info

/tal

ks

38

Golub, TR et. al., Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 1999 286

biclustering bistochastization

SVD bi-normalization Normalized cuts

ALL (B) ALL (T)AML