Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf ·...
-
Upload
duongkhanh -
Category
Documents
-
view
215 -
download
0
Transcript of Probabilistic retrieval and visualization of relevant ...mmds.imm.dtu.dk/presentations/kaski.pdf ·...
Probabilistic retrieval and visualization of relevant
experiments
Samuel Kaski
Joint work with: José Caldas, Nils Gehlenborg, Ali Faisal, Alvis Brazma
Motivation
2
How to best use collections of measurement data in biology?
Example: Microarray experiments
Data and experimental factors
Querying collections
“leukemia”“leukemia”“cancer”
“CML”“Crohnʼs disease”“CLL”
“leukemia”
(leukemia)
“CML”
“Crohnʼs disease”
“CLL”(leukemia)Data-driven query
Annotation-driven query:
What is interesting/relevant?(i) Bring in covariates: Differential expression (treatment vs control). Why?- The experimenter designed the controls to
separate interesting variation- The differences are more comparable
across labs/situations
(ii) Bring in a modelof biology
Content-based Retrieval of EXperiments
REX v1.0 (Caldas et al, ISMB 2009): • Decompose data into binary comparisons• Find similar comparisons (model-based)• Include a biological model of the
comparisons
“CML”
“Crohnʼs disease”“CLL”(leukemia)
Methods needed
1. Modeling of an experiment2. Modeling of experiment collection3. Retrieval of relevant experiments4. Visualization of results
7
1. Modeling of an experiment
Question: How do case and control differ?In biology: What biological processes are differentially active?
v0.1: Which genes are differentially expressed (t-test)
v1.0: Which gene sets are differentially expressed (and how much)
Gene Set Enrichment Analysis
Subramanian et al, PNAS 2005
Encode activity of the gene set by the count of active genes.
An experiment becomes encoded as a count vector. Dimension=gene set, count=number of active genes.
Gene set
Leading edge subset
2. Modeling of an experiment collection
Task: Learn a decomposition of experiments into biological processes, given a database of experiments.
Solution v1.0:• Assume experiments are bags of gene set
activations (sets=biological constraints)• Probabilistic overlapping components by
topic models (data-driven modeling given the constraints)
•Extensively used in bag-of-words text data.
Topic Model
“Species” “Global” “Climate” “Evolution”“Selection”
Topic 1 Topic 2
“Soccer”
Document 1 Document 2 Document 3
•Called Latent Dirichlet Allocation (LDA) or discrete PCA (dPCA)
LDA and GSEA
Estimate Θ and Φ with collapsed Gibbs sampler.
GS 1 GS 2 GS 3 GS 5GS 4
Topic z2Topic z1 Topic z3
Comparison d
Θd
Φz
“leukemia” vs “healthy”
Components ofexperiments
GS
1G
S 2
GS
3G
S 5
GS
4
Top
ic z
2To
pic
z1
Top
ic z
3
Co
mp
ariso
n d
Components ofexperiments
3. Retrieval of relevant experiments
Task: Find experiments in which the same biological processes are active.
≈ find experiments where the same components are active
Convenient given the probabilistic model. Rank the experiments by p(query|experiment)
4. V
isua
lizat
ion
of re
sults
(a):
inte
rpre
tatio
n of
com
pone
nts
Visualization of results (b):nonlinear projection
Nonlinear projection
Task: Position each experiment on the plane such that relevant experiments are close to queries.
Solution:1. Use p(query|experiment) to define
relevance2. Ask the relative cost of misses and false
positives from the user3. Minimize total cost by NeRV
Neighbor retrieval visualizer NeRVOptimizes a user-defined tradeoff between precision and recall.
Input space
miss
Output space (visualization)
*
**
*
i
i
Q
Px
y
i
false
i
positiveVenna and Kaski, AISTATS 2007
Does really work
http://www.cis.hut.fi/projects/mi/software/dredviz/
Venna, Peltonen, Nybo, Aidos, and Kaski
0.994 0.996 0.998 1 1.002 1.0040.991
0.992
0.993
0.994
0.995
0.996
0.997
0.998
0.999NeRVLocalMDS
PCA
MDS
LLE
LE
HLLE
IsomapCCACDAMVULMVU
0.6 0.7 0.8 0.90.45
0.5
0.55
0.6
0.65
0.7
0.75NeRV
LocalMDS
PCA
MDS
LLE
LE
Isomap
CCA
CDA
LMVU
HLLE
0.85 0.9 0.95 1
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
!=0 NeRV
!=1
LocalMDS
!=0.9
PCA
MDS
LLE
LE
Isomap
CDA
MVULMVU
0.975 0.98 0.985 0.99 0.995 1
0.975
0.98
0.985
0.99
0.995 !=0
NeRV
!=1
LocalMDS!=0.9
PCA
MDS
LLE
LE IsomapCCA
MVU
LMVU
CDA
0.8 0.85 0.9 0.95
0.65
0.7
0.75
0.8!=0
NeRV
!=1
!=0
LocalMDS
!=0.9
PCA
MDS
LLE LE
HLLE
IsomapCCA
CDA
MVU
LMVU
0.95 0.96 0.97 0.98 0.99 1
0.95
0.96
0.97
0.98
0.99NeRV
!=0
LocalMDS
PCA
MDS
LLE
LE
HLLE
Isomap
CCACDA
MVU
LMVU
Plain s!curve
Mouse gene expressionNoisy s!curve
Faces Gene expression compendium
Mean rank!based smoothed precision (vertical axes) !
Mean rank!based smoothed recall (horizontal axes)
Seawater temperature time series
Figure 5: Mean rank-based smoothed precision–mean rank-based smoothed recall plotted forall six data sets. For clarity, only a few of the best performing methods are shownfor each data set. In each plot, the best performance is in the top right corner.
Experiment with a known underlying manifold. To further test how well the meth-ods are able to recover the neighborhood structure inherent in the data we studied a syn-thetic face data set where a known underlying manifold defines the relevant items (neigh-bors) of each point. The SculptFaces data contains 698 synthetic images of a face (sized64x64 pixels each). The pose and direction of lighting have been changed in a systematic way
16
Mean (rank-based, smoothed) recall
Mea
n (r
ank-
base
d, s
moo
thed
) pre
cisi
on
Results
21
Setup
•288 preprocessed experiments from ArrayExpress: 768 comparisons.
• Focussed on 105 comparisons of healthy vs disease.
•Ran GSEA on collection of manually curated gene sets from MSigDB (KEGG, Biocarta,...).
• LDA with 50 topics.
Top Gene Sets for Representative Topics
Topic 2 Topic 24 Topic 44 Topic 50
Cell Cycle (BIOCARTA) IL2RB Pathway mRNA Processing (REACTOME)
Oxidative Phosphorylation (KEGG)
Cell Cycle (KEGG) PDGF Pathway RNA Transcription (REACTOME)
Oxidative Phosphorylation (GENMAPP)
G1 to S Cell Cycle (REACTOME) EGF Pathway Translation Factors Glycolysis and
GluconeogenesisDNA Replication (REACTOME) Gleevec Pathway Folate Biosynthesis IL-7 Pathway
G2 Pathway IGF-1 Pathway Basal Transcription Factors γ-Hexachlorocyclohexane Degradation
Topics are coherent and cover a wide range of biological processes.
Some of the top gene sets have significant gene overlap.
Topic 2 (cell cycle gene sets) has “ATR BRCA Pathway” gene set at 8th position.
BRCA is involved in breast cancer and DNA repair.
Comparisons with highest probability for Topic 2?
More Detailed Analysis of Topic 2
Rank Comparison (... vs “normal”)
1 Sporadic basal-like breast cancer
2 Vulvar intraepithelial neoplasia
3 Breast carcinoma
4 Esophageal carcinoma
Topic 44 (transcription machinery) has “Folate Biosynthesis” gene set at 4th position.
Folate plays role in DNA and RNA synthesis.
Comparisons with highest probability for Topic 44?
Folate deficiency has been associated with both Crohn’s disease and leukemia.
More Detailed Analysis of Topic 44
Rank Comparison (... vs “normal)
1 Crohnʼs Disease
2 Chronic Lymphcytic Leukemia
3 Chronic Myelogenous Leukemia
Query with “malignant melanoma” vs “normal” comparison.
Querying the Model/Database
Rank Comparison (... vs “normal”)
1 Bladder Carcinoma
2 Vulvar Intraepithelial Neoplasia
3 Hyperparathyroidism
4 Lung (smoker)
5 Bladder Carcinoma
6 Bladder Carcinoma
7 Infiltrating Ductal Carcinoma
8 Prostate Cancer
9 Breast Carcinoma
10 Esophageal Adenocarcinoma
Retrieval results• 105 normal vs. disease comparisons:
‘cancer’ (27) or ‘not cancer’ (78)• Query with cancer comparisons• Compare to random baseline
Visualizationof retrievalresults
Conclusions
Summary: Content-based query for experiments with an experiment:•Model of an experiment
• bring in controls / experimenter’s intent• bring in biology/task-dependent prior
knowledge•Model of an experiment collection
• data-driven machine learning•Retrieval: p(query|experiment)•Visualization of results: NeRV
More information at
http://www.cis.hut.fi/projects/mi
DEPARTMENT OF INFORMATIONAND COMPUTER SCIENCEADAPTIVE INFORMATICSRESEARCH CENTRE
!!
"#$#%&!'(!)!*+,&+-.+/!01!'202!!!!!34&&4561!748598:"#$#%&!'(!)!*+,&+-.+/!01!'202!!!!!34&&4561!748598:;+<4!*#--4&!=>8?+/+8=+!98:[email protected]&4>8!=+8&/+!!!!!!!!!;+<4!*#--4&!=>8?+/+8=+!98:[email protected]&4>8!=+8&/+!!!!!!!!!
!""#$%%&'(#)*+*,-./012,34!""#$%%&'(#)*+*,-./012,34
!"#$%&'()*+,,+-.&!%-/$0(%$
!"#$%&'( )&*( +,-+#*.( #%( '"/0+#( )( .%"/1*( 2%1"0,( 3)3*&( %4( "3( #%( '+5(3)6*'( "'+,6( #$*( *1*2#&%,+2( '"/0+''+%,( 3&%2*."&*( .*'2&+/*.( )#(122#344*5,#67879/-.:+;90<
!22*3#*.( 3)3*&'( 7+11( /*( 3"/1+'$*.( /8( 9:::( ;&*''( ),.( *1*2#&%,+2(3&%2**.+,6'(7+11(/*(.+'#&+/"#*.()#(#$*(<%&='$%3>
=>??&@AB&!>!CB'?$*( #7*,#+*#$( 7%&='$%3( +,( #$*( '*&+*'( %4( 7%&='$%3'( '3%,'%&*.( /8(9:::( @+6,)1( ;&%2*''+,6( @%2+*#8( 7+11( 3&*'*,#( #$*( 0%'#( &*2*,#( ),.(*52+#+,6( 2%,#&+/"#+%,'( +,( 0)2$+,*( 1*)&,+,6( 4%&( '+6,)1( 3&%2*''+,6(#$&%"6$( =*8,%#*( #)1='( )'( 7*11( )'( '3*2+)1( ),.( &*6"1)&( '+,61*A#&)2=('*''+%,'>(;)3*&'()&*('%1+2+#*.(#$)#(2%-*&(-)&+%"'()'3*2#'(%4(0)2$+,*(1*)&,+,6(4%&('+6,)1(3&%2*''+,6B()'(%"#1+,*.(+,(#$*(4%11%7+,6>
D"/1+.$&?$"%.+.E&F-%&'+E."5&!%-/$,,+.E
C)2$+,*( 1*)&,+,6( +,( 0"1#+A.+0*,'+%,)1( ),.( '#)#+'#+2)1( '+6,)1(3&%2*''+,6( +'( 2%,2*&,*.( 7+#$( #)'='( '"2$( )'( .*#*2#+%,B( *'#+0)#+%,B(3&*.+2#+%,B( 21)''+4+2)#+%,B( ),.( %3#+0+D)#+%,>( ?83+2)1( )33&%)2$*'( )&*(0%.*&,(+031*0*,#)#+%,'(%4('"3*&-+'*.B(",'"3*&-+'*.B(&*+,4%&2*0*,#(),.( '*0+A'"3*&-+'*.( 1*)&,+,6B( 4%&( +,'#),2*( "'+,6( 3&%/)/+1+'#+2(0%.*1+,6(),.(=*&,*1(0*#$%.'>(C)2$+,*(1*)&,+,6($)'()(7+.*(&),6*(%4()331+2)#+%,'E().)3#+-*(4+1#*&+,6B(#+0*A'*&+*'( ),)18'+'B( 3)##*&,( &*2%6,+#+%,B( +0)6*( 3&%2*''+,6B(2%03"#*&(-+'+%,B(.)#)(0+,+,6(),.(-+'")1+D)#+%,B( +,4%&0)#+%,(&*#&+*-)1B(&%/%#( 2%,#&%1B( .)#)( 4"'+%,B( /1+,.( '%"&2*( '*3)&)#+%,B( '3)&'*( ),.('#&"2#"&*.( &*3&*'*,#)#+%,'B( 2%,#*5#(0%.*1+,6B(0"1#+0%.)1( +,#*&4)2*'B(,*"&%+,4%&0)#+2'B( /+%+,4%&0)#+2'B( '*,'%&( ,*#7%&='B( 2%6,+#+-*( &).+%B(*#2>( !1'%B( +,( 0),8( )331+2)#+%,'B( $)&.7)&*( +031*0*,#)#+%,'( )&*(+03%&#),#(."*(#%(1)&6*()0%",#'(%4(.)#)(#%(/*(3&%2*''*.(),.(&*)1A#+0*(3&%2*''+,6(&*F"+&*0*,#'>
'/1$0(5$
@"/0+''+%,(%4(4"11(3)3*&'E(( ((>#%+5&8G%#+4+2)#+%,(%4()22*3#),2*E( ((D"G&6HH)0*&)A&*).8(3)3*&(),.()"#$%&(&*6+'#&)#+%,E ((I(.$&8H!.-),2*(&*6+'#&)#+%,(/*4%&*E((I(.$&6J
A%E".+;"2+-.
!"#"$%&'()%*$+(:&==+(IJ),$-.$%/'()%*$0+(
@)0"*1(K)'=+B(L)-+.(C+11*&12"(*%&'0"00*-#'()%*$0+(
C+==%(K"&+0%B(@)08(M*,6+%,34&*(*56'()%*$0+(
C)&2(N),(O"11*B(P))==%(;*1#%,*,7"4'%#8'234&*(%5*-#'()%*$0+(
!,##+(O%,=*1)B(P),(Q)&'*,9%5%'(-/2"5*5*-#+(K*,,*#$(O+1.B(
N+,2*(H)1$%",B(C+==%(K"&+0%:-(%&'%$$%#."/"#50'()%*$+(
?)3),+(R)+=%
K"2"&=-*#$2+2+-.
9,( 2%,J",2#+%,( 7+#$( #$*( 7%&='$%3B( )( .)#)( ),.( '+6,)1( ),)18'+'(2%03*#+#+%,(+'(/*+,6(%&6),+D*.>(<+,,*&'(7+11(3&*'*,#(#$*+&(7%&='(),.(&*2*+-*(#$*+&()7)&.(."&+,6(#$*(<%&='$%3>