Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf ·...

31
Probabilistic retrieval and visualization of relevant experiments Samuel Kaski Joint work with: José Caldas, Nils Gehlenborg, Ali Faisal, Alvis Brazma

Transcript of Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf ·...

Page 1: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Probabilistic retrieval and visualization of relevant

experiments

Samuel Kaski

Joint work with: José Caldas, Nils Gehlenborg, Ali Faisal, Alvis Brazma

Page 2: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Motivation

2

Page 3: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

How to best use collections of measurement data in biology?

Example: Microarray experiments

Data and experimental factors

Page 4: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Querying collections

“leukemia”“leukemia”“cancer”

“CML”“Crohnʼs disease”“CLL”

“leukemia”

(leukemia)

“CML”

“Crohnʼs disease”

“CLL”(leukemia)Data-driven query

Annotation-driven query:

Page 5: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

What is interesting/relevant?(i) Bring in covariates: Differential expression (treatment vs control). Why?- The experimenter designed the controls to

separate interesting variation- The differences are more comparable

across labs/situations

(ii) Bring in a modelof biology

Page 6: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Content-based Retrieval of EXperiments

REX v1.0 (Caldas et al, ISMB 2009): • Decompose data into binary comparisons• Find similar comparisons (model-based)• Include a biological model of the

comparisons

“CML”

“Crohnʼs disease”“CLL”(leukemia)

Page 7: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Methods needed

1. Modeling of an experiment2. Modeling of experiment collection3. Retrieval of relevant experiments4. Visualization of results

7

Page 8: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

1. Modeling of an experiment

Question: How do case and control differ?In biology: What biological processes are differentially active?

v0.1: Which genes are differentially expressed (t-test)

v1.0: Which gene sets are differentially expressed (and how much)

Page 9: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Gene Set Enrichment Analysis

Subramanian et al, PNAS 2005

Encode activity of the gene set by the count of active genes.

An experiment becomes encoded as a count vector. Dimension=gene set, count=number of active genes.

Gene set

Leading edge subset

Page 10: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

2. Modeling of an experiment collection

Task: Learn a decomposition of experiments into biological processes, given a database of experiments.

Solution v1.0:• Assume experiments are bags of gene set

activations (sets=biological constraints)• Probabilistic overlapping components by

topic models (data-driven modeling given the constraints)

Page 11: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

•Extensively used in bag-of-words text data.

Topic Model

“Species” “Global” “Climate” “Evolution”“Selection”

Topic 1 Topic 2

“Soccer”

Document 1 Document 2 Document 3

•Called Latent Dirichlet Allocation (LDA) or discrete PCA (dPCA)

Page 12: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

LDA and GSEA

Estimate Θ and Φ with collapsed Gibbs sampler.

GS 1 GS 2 GS 3 GS 5GS 4

Topic z2Topic z1 Topic z3

Comparison d

Θd

Φz

“leukemia” vs “healthy”

Page 13: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Components ofexperiments

GS

1G

S 2

GS

3G

S 5

GS

4

Top

ic z

2To

pic

z1

Top

ic z

3

Co

mp

ariso

n d

Page 14: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Components ofexperiments

Page 15: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

3. Retrieval of relevant experiments

Task: Find experiments in which the same biological processes are active.

≈ find experiments where the same components are active

Convenient given the probabilistic model. Rank the experiments by p(query|experiment)

Page 16: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

4. V

isua

lizat

ion

of re

sults

(a):

inte

rpre

tatio

n of

com

pone

nts

Page 17: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Visualization of results (b):nonlinear projection

Page 18: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Nonlinear projection

Task: Position each experiment on the plane such that relevant experiments are close to queries.

Solution:1. Use p(query|experiment) to define

relevance2. Ask the relative cost of misses and false

positives from the user3. Minimize total cost by NeRV

Page 19: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Neighbor retrieval visualizer NeRVOptimizes a user-defined tradeoff between precision and recall.

Input space

miss

Output space (visualization)

*

**

*

i

i

Q

Px

y

i

false

i

positiveVenna and Kaski, AISTATS 2007

Page 20: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Does really work

http://www.cis.hut.fi/projects/mi/software/dredviz/

Venna, Peltonen, Nybo, Aidos, and Kaski

0.994 0.996 0.998 1 1.002 1.0040.991

0.992

0.993

0.994

0.995

0.996

0.997

0.998

0.999NeRVLocalMDS

PCA

MDS

LLE

LE

HLLE

IsomapCCACDAMVULMVU

0.6 0.7 0.8 0.90.45

0.5

0.55

0.6

0.65

0.7

0.75NeRV

LocalMDS

PCA

MDS

LLE

LE

Isomap

CCA

CDA

LMVU

HLLE

0.85 0.9 0.95 1

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

!=0 NeRV

!=1

LocalMDS

!=0.9

PCA

MDS

LLE

LE

Isomap

CDA

MVULMVU

0.975 0.98 0.985 0.99 0.995 1

0.975

0.98

0.985

0.99

0.995 !=0

NeRV

!=1

LocalMDS!=0.9

PCA

MDS

LLE

LE IsomapCCA

MVU

LMVU

CDA

0.8 0.85 0.9 0.95

0.65

0.7

0.75

0.8!=0

NeRV

!=1

!=0

LocalMDS

!=0.9

PCA

MDS

LLE LE

HLLE

IsomapCCA

CDA

MVU

LMVU

0.95 0.96 0.97 0.98 0.99 1

0.95

0.96

0.97

0.98

0.99NeRV

!=0

LocalMDS

PCA

MDS

LLE

LE

HLLE

Isomap

CCACDA

MVU

LMVU

Plain s!curve

Mouse gene expressionNoisy s!curve

Faces Gene expression compendium

Mean rank!based smoothed precision (vertical axes) !

Mean rank!based smoothed recall (horizontal axes)

Seawater temperature time series

Figure 5: Mean rank-based smoothed precision–mean rank-based smoothed recall plotted forall six data sets. For clarity, only a few of the best performing methods are shownfor each data set. In each plot, the best performance is in the top right corner.

Experiment with a known underlying manifold. To further test how well the meth-ods are able to recover the neighborhood structure inherent in the data we studied a syn-thetic face data set where a known underlying manifold defines the relevant items (neigh-bors) of each point. The SculptFaces data contains 698 synthetic images of a face (sized64x64 pixels each). The pose and direction of lighting have been changed in a systematic way

16

Mean (rank-based, smoothed) recall

Mea

n (r

ank-

base

d, s

moo

thed

) pre

cisi

on

Page 21: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Results

21

Page 22: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Setup

•288 preprocessed experiments from ArrayExpress: 768 comparisons.

• Focussed on 105 comparisons of healthy vs disease.

•Ran GSEA on collection of manually curated gene sets from MSigDB (KEGG, Biocarta,...).

• LDA with 50 topics.

Page 23: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Top Gene Sets for Representative Topics

Topic 2 Topic 24 Topic 44 Topic 50

Cell Cycle (BIOCARTA) IL2RB Pathway mRNA Processing (REACTOME)

Oxidative Phosphorylation (KEGG)

Cell Cycle (KEGG) PDGF Pathway RNA Transcription (REACTOME)

Oxidative Phosphorylation (GENMAPP)

G1 to S Cell Cycle (REACTOME) EGF Pathway Translation Factors Glycolysis and

GluconeogenesisDNA Replication (REACTOME) Gleevec Pathway Folate Biosynthesis IL-7 Pathway

G2 Pathway IGF-1 Pathway Basal Transcription Factors γ-Hexachlorocyclohexane Degradation

Topics are coherent and cover a wide range of biological processes.

Some of the top gene sets have significant gene overlap.

Page 24: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Topic 2 (cell cycle gene sets) has “ATR BRCA Pathway” gene set at 8th position.

BRCA is involved in breast cancer and DNA repair.

Comparisons with highest probability for Topic 2?

More Detailed Analysis of Topic 2

Rank Comparison (... vs “normal”)

1 Sporadic basal-like breast cancer

2 Vulvar intraepithelial neoplasia

3 Breast carcinoma

4 Esophageal carcinoma

Page 25: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Topic 44 (transcription machinery) has “Folate Biosynthesis” gene set at 4th position.

Folate plays role in DNA and RNA synthesis.

Comparisons with highest probability for Topic 44?

Folate deficiency has been associated with both Crohn’s disease and leukemia.

More Detailed Analysis of Topic 44

Rank Comparison (... vs “normal)

1 Crohnʼs Disease

2 Chronic Lymphcytic Leukemia

3 Chronic Myelogenous Leukemia

Page 26: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Query with “malignant melanoma” vs “normal” comparison.

Querying the Model/Database

Rank Comparison (... vs “normal”)

1 Bladder Carcinoma

2 Vulvar Intraepithelial Neoplasia

3 Hyperparathyroidism

4 Lung (smoker)

5 Bladder Carcinoma

6 Bladder Carcinoma

7 Infiltrating Ductal Carcinoma

8 Prostate Cancer

9 Breast Carcinoma

10 Esophageal Adenocarcinoma

Page 27: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Retrieval results• 105 normal vs. disease comparisons:

‘cancer’ (27) or ‘not cancer’ (78)• Query with cancer comparisons• Compare to random baseline

Page 28: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Visualizationof retrievalresults

Page 29: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

Conclusions

Summary: Content-based query for experiments with an experiment:•Model of an experiment

• bring in controls / experimenter’s intent• bring in biology/task-dependent prior

knowledge•Model of an experiment collection

• data-driven machine learning•Retrieval: p(query|experiment)•Visualization of results: NeRV

Page 30: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

More information at

http://www.cis.hut.fi/projects/mi

Page 31: Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf · Probabilistic retrieval and visualization of relevant experiments Samuel Kaski

DEPARTMENT OF INFORMATIONAND COMPUTER SCIENCEADAPTIVE INFORMATICSRESEARCH CENTRE

!!

"#$#%&!'(!)!*+,&+-.+/!01!'202!!!!!34&&4561!748598:"#$#%&!'(!)!*+,&+-.+/!01!'202!!!!!34&&4561!748598:;+<4!*#--4&!=>8?+/+8=+!98:[email protected]&4>8!=+8&/+!!!!!!!!!;+<4!*#--4&!=>8?+/+8=+!98:[email protected]&4>8!=+8&/+!!!!!!!!!

!""#$%%&'(#)*+*,-./012,34!""#$%%&'(#)*+*,-./012,34

!"#$%&'()*+,,+-.&!%-/$0(%$

!"#$%&'( )&*( +,-+#*.( #%( '"/0+#( )( .%"/1*( 2%1"0,( 3)3*&( %4( "3( #%( '+5(3)6*'( "'+,6( #$*( *1*2#&%,+2( '"/0+''+%,( 3&%2*."&*( .*'2&+/*.( )#(122#344*5,#67879/-.:+;90<

!22*3#*.( 3)3*&'( 7+11( /*( 3"/1+'$*.( /8( 9:::( ;&*''( ),.( *1*2#&%,+2(3&%2**.+,6'(7+11(/*(.+'#&+/"#*.()#(#$*(<%&='$%3>

=>??&@AB&!>!CB'?$*( #7*,#+*#$( 7%&='$%3( +,( #$*( '*&+*'( %4( 7%&='$%3'( '3%,'%&*.( /8(9:::( @+6,)1( ;&%2*''+,6( @%2+*#8( 7+11( 3&*'*,#( #$*( 0%'#( &*2*,#( ),.(*52+#+,6( 2%,#&+/"#+%,'( +,( 0)2$+,*( 1*)&,+,6( 4%&( '+6,)1( 3&%2*''+,6(#$&%"6$( =*8,%#*( #)1='( )'( 7*11( )'( '3*2+)1( ),.( &*6"1)&( '+,61*A#&)2=('*''+%,'>(;)3*&'()&*('%1+2+#*.(#$)#(2%-*&(-)&+%"'()'3*2#'(%4(0)2$+,*(1*)&,+,6(4%&('+6,)1(3&%2*''+,6B()'(%"#1+,*.(+,(#$*(4%11%7+,6>

D"/1+.$&?$"%.+.E&F-%&'+E."5&!%-/$,,+.E

C)2$+,*( 1*)&,+,6( +,( 0"1#+A.+0*,'+%,)1( ),.( '#)#+'#+2)1( '+6,)1(3&%2*''+,6( +'( 2%,2*&,*.( 7+#$( #)'='( '"2$( )'( .*#*2#+%,B( *'#+0)#+%,B(3&*.+2#+%,B( 21)''+4+2)#+%,B( ),.( %3#+0+D)#+%,>( ?83+2)1( )33&%)2$*'( )&*(0%.*&,(+031*0*,#)#+%,'(%4('"3*&-+'*.B(",'"3*&-+'*.B(&*+,4%&2*0*,#(),.( '*0+A'"3*&-+'*.( 1*)&,+,6B( 4%&( +,'#),2*( "'+,6( 3&%/)/+1+'#+2(0%.*1+,6(),.(=*&,*1(0*#$%.'>(C)2$+,*(1*)&,+,6($)'()(7+.*(&),6*(%4()331+2)#+%,'E().)3#+-*(4+1#*&+,6B(#+0*A'*&+*'( ),)18'+'B( 3)##*&,( &*2%6,+#+%,B( +0)6*( 3&%2*''+,6B(2%03"#*&(-+'+%,B(.)#)(0+,+,6(),.(-+'")1+D)#+%,B( +,4%&0)#+%,(&*#&+*-)1B(&%/%#( 2%,#&%1B( .)#)( 4"'+%,B( /1+,.( '%"&2*( '*3)&)#+%,B( '3)&'*( ),.('#&"2#"&*.( &*3&*'*,#)#+%,'B( 2%,#*5#(0%.*1+,6B(0"1#+0%.)1( +,#*&4)2*'B(,*"&%+,4%&0)#+2'B( /+%+,4%&0)#+2'B( '*,'%&( ,*#7%&='B( 2%6,+#+-*( &).+%B(*#2>( !1'%B( +,( 0),8( )331+2)#+%,'B( $)&.7)&*( +031*0*,#)#+%,'( )&*(+03%&#),#(."*(#%(1)&6*()0%",#'(%4(.)#)(#%(/*(3&%2*''*.(),.(&*)1A#+0*(3&%2*''+,6(&*F"+&*0*,#'>

'/1$0(5$

@"/0+''+%,(%4(4"11(3)3*&'E(( ((>#%+5&8G%#+4+2)#+%,(%4()22*3#),2*E( ((D"G&6HH)0*&)A&*).8(3)3*&(),.()"#$%&(&*6+'#&)#+%,E ((I(.$&8H!.-),2*(&*6+'#&)#+%,(/*4%&*E((I(.$&6J

A%E".+;"2+-.

!"#"$%&'()%*$+(:&==+(IJ),$-.$%/'()%*$0+(

@)0"*1(K)'=+B(L)-+.(C+11*&12"(*%&'0"00*-#'()%*$0+(

C+==%(K"&+0%B(@)08(M*,6+%,34&*(*56'()%*$0+(

C)&2(N),(O"11*B(P))==%(;*1#%,*,7"4'%#8'234&*(%5*-#'()%*$0+(

!,##+(O%,=*1)B(P),(Q)&'*,9%5%'(-/2"5*5*-#+(K*,,*#$(O+1.B(

N+,2*(H)1$%",B(C+==%(K"&+0%:-(%&'%$$%#."/"#50'()%*$+(

?)3),+(R)+=%

K"2"&=-*#$2+2+-.

9,( 2%,J",2#+%,( 7+#$( #$*( 7%&='$%3B( )( .)#)( ),.( '+6,)1( ),)18'+'(2%03*#+#+%,(+'(/*+,6(%&6),+D*.>(<+,,*&'(7+11(3&*'*,#(#$*+&(7%&='(),.(&*2*+-*(#$*+&()7)&.(."&+,6(#$*(<%&='$%3>