comp598 F2016 lecture 19 - cs.mcgill.cajeromew/docs/comp598_F2016_lecture_19.pdf · Reduc:onism vs....
Transcript of comp598 F2016 lecture 19 - cs.mcgill.cajeromew/docs/comp598_F2016_lecture_19.pdf · Reduc:onism vs....
16-11-09
1
COMP598: Advanced Computational Biology Methods & Research
System Biology
Jérôme Waldispühl School of Computer Science, McGill
(includes slides from M.Lavallée-Adam & B. Berger)
Whatisit?
Morel,N.M.,etal.MayoClinProc,2004.79(5):p.651-8.
“Thescienceofintegra:nggene:c,genomic,biochemical,cellular,physiologicalandclinicaldatatocreateasystemnetworkthatcanbeusedtopredic:velymodela
biologicalevent(s).”
Reduc:onismvs.SystemsBiology Inthe20thcentury…
NorbertWiener(1948)
Cyberne:cs“Thescienceof
communica:onsandautoma:ccontrolsystemsinboth
machineandlivingthings.”
(1952)Hudgkin&Huxley
Mathmodelexplainingtheac:onpoten:alpropaga:ngalongtheaxonofaneuronalcell.
(1968)MihajloMesarovicOrganizedthe1st
“SystemtheoryandBiology”symposium.Launchofanew
scien:ficdiscipline!
(1990s)
“-omicsrevolu:on”!
LudwigvonBertalanffy(1928)
GeneralSystemsTheory
“generalscienceofwholeness”
(1960)DenisNoble
Mathmodelofcardiaccells.
“-omicsrevolu:on”
Genomics Proteomics Metabolomics
Transcriptomics
FuncQonalproteomics/genomics
SYSTEMSBIOLOGY
Morel,N.M.,etal.MayoClinProc,2004.79(5):p.651-8.
Backtothecaranalogy
• Howwoulduseasystemapproachtounderstandhowacarfunc:ons?
1. Preliminaryunderstanding->formulateasimplemodel
2. Defineallthecomponents:mechanical,electrical,andcontrol.
3. Perturbthecarandcomparetonormalcar4. Integratedataandcomparetoyourmodel5. Discrepancies?->newhypothesis->repeatstep
3-5.
16-11-09
2
Atestsystem:galactoseu:liza:oninS.cerevisiae
• 9elements:– 4enzymescatalyzeconversionofgalactose(gal)toglucose-6-P
– 1transportermolecule• Setsthestateofthesystem
– 4transcrip:onfactors(TFs)• Turnsystemon/offdependingongalactosepresence/absence
Ideker,T.,etal.,Science,2001.292(5518):p.929-34.
Perturbthesystemandcompare
• Yeaststrainused:– 9knock-out(KO)– 1wild-type(WT)
DNAMicroarray
Ideker,T.,etal.,Science,2001.292(5518):p.929-34.
Experimentaldata=model?
Ideker,T.,etal.,Science,2001.292(5518):p.929-34.
DecontaminatorModelingcontaminantsinAP-MS/MSexperiments
ProteininteracQonsobtainedbyTandemAffinityPurificaQon
Bait TagPreys
Contaminants
Background
2D-LC
Database search
MS/MS
SDS-PAGE
TAP
CellCulture
16-11-09
3
2D-LC
Database search
MS/MS
SDS-PAGE
TAP
CellCulture
Non-specificityofTAGan:body
FaultyPurifica:on
Misiden:fica:on
Carry-over
Over-expression
GelContamina:on
FalseposiQvesources
2D-LC
Database search
MS/MS
SDS-PAGE
TAP
CellCulture
Non-specificityofTAGan:body
FaultyPurifica:on
Misiden:fica:on
Carry-over
Over-expression
GelContamina:on
In-cellnormalexpression
Addi:onalpurifica:ons
LCcolumnwashing
Robotgelbandcugng
ExperimentalImprovements FalseposiQvesources
ComputaQonalFiltering
2D-LC
Database search
MS/MS
TAP
CellCulture
SDS-PAGE
2D-LC
Database search
MS/MS
TAP
CellCulture
ComputaQonalFiltering
SDS-PAGE
Kroganetal.,2006Chuaetal.,2006Ewingetal.,2007Clou:eretal.,2009Collinsetal.,2007
2D-LC
Database search
MS/MS
TAP
CellCulture
Kroganetal.,2006Chuaetal.,2006Ewingetal.,2007Clou:eretal.,2009Collinsetal.,2007
ComputaQonalFiltering
SDS-PAGE
Pep:de/ProteinProphet(Kelleretal.,Nesvizhskiietal.)Percolator(Kalletal.)
2D-LC
Database search
MS/MS
TAP
CellCulture
Pep:de/ProteinProphet(Kelleretal.,Nesvizhskiietal.)Percolator(Kalletal.)
DeContaminator(Lavallee-Adametal.,JPR)
ComputaQonalFiltering
SDS-PAGE
Kroganetal.,2006Chuaetal.,2006Ewingetal.,2007Clou:eretal.,2009Collinsetal.,2007
16-11-09
4
Manuallylabelallcontaminantsandsystema:callyremoveallinterac:onswiththeseproteins.Limita:ons:
• Acontaminantforonebaitmightbeatrueinterac:onforanother.
• Couldnotdetectsporadiccontamina:on.
SimplecontaminantdetecQon
!"# !$# !%# !&#'"# '$# '"# '$#'$# '&# '$# '&#'%# '(# '(# ')##
Baits
Preys
Twoexperimentsforagivenbaitb
• Inducedexperiment:expressionofthebaitvectorisinduced.• Controlexperiment:expressionofthebaitvectorisnotinduced.
MRa:omethod:Forapreyp:IfMS_Score(binduced,p)<5*MS_Score(bcontrol,p)pisacontaminantElse pistrulyinterac:ngwithb
[Jeronimoetal.,2007]Limita:ons:
• Expensivebothintermsof:meandresources• One-to-onecomparisonsofnoisylowabundancepreyMS/MSresults• Controlmightshowleakyexpressionofthebait
AlternatecontaminantdetecQonmethod
• Goal:Usealimitednumberofcontrolsfortheproperiden:fica:onofcontaminantsinTAP-MS/MSPPIdata.
• Advantages:• Noone-to-onecomparisonsofMSscoreshavetobeperformed.
• Accuratemodelingwithlimitedresourceusage.
• Usingalimitednumberofhigh-qualitycontrolsavoidsexpressionleakinessissues.
DeContaminator(Lavallee-Adametal.,JournalofProteomeResearch)
(Lavallee-Adametal.,JPR)
ObjecQve:Computetheposteriordistribu:onofMCMpgivenallMSscoreobserva:onsCMb,p∀b∈B
Discussion
Pr[M̄NIp |MNI
b1,p, ...,MNIb14,p]
MNIbi,p
Pr[M̄Ib,p|MI
b,p]
MIb,p
pvalue(MIb,p) = Pr[M̄I
b,p > M̄NIp |MI
b,p,MNIb1,p,M
NIb2,p, ...,M
NIb14,p]
Mascot Score
Our proposed Bayesian method shows an improvement in accuracy for the detection of con-
taminant PPIs in our dataset when compared to currently used alternate approaches. We expect
that this decrease in false positive interactions will facilitate the analysis of PPI networks and ease
the characterization of novel biological pathways. At the same time, our approach will greatly
reduce experimental costs by cutting the number of most experimental manipulations almost in
half. This expense reduction is due to the much smaller number of control experiments needed
by our algorithm compared to the methods described in Jeronimo et al,16 where each induced ex-
periment requires a matched non-induced experiment for its interactions to be classified. It is also
worth noting that in theory, the control experiments provided as input to the algorithm could all be
performed with the same bait protein. However, we used non-induced experiments produced from
different baits, by different experimentalists at different time periods. These biological and tech-
nical replicates allow us to factor in the noise resulting from the change of baits in TAP-MS/MS
experiments and technical variation.
Advantages
PPIs are often viewed and studied as a network. Several algorithms (e.g.13,14,16) use the topology
of this network to determine whether an interaction is a likely true or a false positive. The reasoning
is based on the fact that if two putatively interacting proteins also share similar sets of interacting
partners, they are more likely to form a complex and therefore to be truly interacting. However, this
20
3. For each pair (b, p) ⇤ B�P, assign a p-value to MIb,p:
pvalue(MIb,p) = Pr[M̄NI
p ⇥ M̄Ib,p|MI
b,p,MNIb1,p,M
NIb2,p, ...,M
NIb14,p]
4. Using the non-induced and full induced data sets, assign a false discovery rate to each p-
value.
Pr[MCMp|CMb1,p, ...,CMb1,p]
CMbi,p
IMb,p
Pr[MIMb,p|IMb,p]
pvalue(IMb,p) = Pr[MCMp ⇥MIMb,p|IMb,p,CMb1,p,CMb2,p, ...,CMb14,p]
Each step is detailed further below and illustrated in Figure 2.
Step 1: Building a noise model from non-induced experiments
The set of 14 TAP-MS/MS experiments where the bait’s expression was not induced can be seen as
a set of biological replicates of the null condition. We use these measurements to assess the amount
of noise in each replicate, i.e. to estimate Pr[MNIb,p|M̄NI
p ], the probability of a given observation MNIb,p,
10
3. For each pair (b, p) ⇤ B�P, assign a p-value to MIb,p:
pvalue(MIb,p) = Pr[M̄NI
p ⇥ M̄Ib,p|MI
b,p,MNIb1,p,M
NIb2,p, ...,M
NIb14,p]
4. Using the non-induced and full induced data sets, assign a false discovery rate to each p-
value.
Pr[MCMp|CMb1,p, ...,CMb1,p]
CMbi,p
IMb,p
Pr[MIMb,p|IMb,p]
pvalue(IMb,p) = Pr[MCMp ⇥MIMb,p|IMb,p,CMb1,p,CMb2,p, ...,CMb14,p]
Each step is detailed further below and illustrated in Figure 2.
Step 1: Building a noise model from non-induced experiments
The set of 14 TAP-MS/MS experiments where the bait’s expression was not induced can be seen as
a set of biological replicates of the null condition. We use these measurements to assess the amount
of noise in each replicate, i.e. to estimate Pr[MNIb,p|M̄NI
p ], the probability of a given observation MNIb,p,
10
3. For each pair (b, p) ⇤ B�P, assign a p-value to MIb,p:
pvalue(MIb,p) = Pr[M̄NI
p ⇥ M̄Ib,p|MI
b,p,MNIb1,p,M
NIb2,p, ...,M
NIb14,p]
4. Using the non-induced and full induced data sets, assign a false discovery rate to each p-
value.
Pr[MCMp|CMb1,p, ...,CMb1,p]
CMbi,p
IMb,p
Pr[MIMb,p|IMb,p]
pvalue(IMb,p) = Pr[MCMp ⇥MIMb,p|IMb,p,CMb1,p,CMb2,p, ...,CMb14,p]
Each step is detailed further below and illustrated in Figure 2.
Step 1: Building a noise model from non-induced experiments
The set of 14 TAP-MS/MS experiments where the bait’s expression was not induced can be seen as
a set of biological replicates of the null condition. We use these measurements to assess the amount
of noise in each replicate, i.e. to estimate Pr[MNIb,p|M̄NI
p ], the probability of a given observation MNIb,p,
10
3. For each pair (b, p) ⇤ B�P, assign a p-value to MIb,p:
pvalue(MIb,p) = Pr[M̄NI
p ⇥ M̄Ib,p|MI
b,p,MNIb1,p,M
NIb2,p, ...,M
NIb14,p]
4. Using the non-induced and full induced data sets, assign a false discovery rate to each p-
value.
Pr[MCMp|CMb1,p, ...,CMb1,p]
CMbi,p
IMb,p
Pr[MIMb,p|IMb,p]
pvalue(IMb,p) = Pr[MCMp ⇥MIMb,p|IMb,p,CMb1,p,CMb2,p, ...,CMb14,p]
Each step is detailed further below and illustrated in Figure 2.
Step 1: Building a noise model from non-induced experiments
The set of 14 TAP-MS/MS experiments where the bait’s expression was not induced can be seen as
a set of biological replicates of the null condition. We use these measurements to assess the amount
of noise in each replicate, i.e. to estimate Pr[MNIb,p|M̄NI
p ], the probability of a given observation MNIb,p,
10
3. For each pair (b, p) ⇤ B�P, assign a p-value to MIb,p:
pvalue(MIb,p) = Pr[M̄NI
p ⇥ M̄Ib,p|MI
b,p,MNIb1,p,M
NIb2,p, ...,M
NIb14,p]
4. Using the non-induced and full induced data sets, assign a false discovery rate to each p-
value.
Pr[MCMp|CMb1,p, ...,CMb1,p]
CMbi,p
IMb,p
Pr[MIMb,p|IMb,p]
pvalue(IMb,p) = Pr[MCMp ⇥MIMb,p|IMb,p,CMb1,p,CMb2,p, ...,CMb14,p]
Each step is detailed further below and illustrated in Figure 2.
Step 1: Building a noise model from non-induced experiments
The set of 14 TAP-MS/MS experiments where the bait’s expression was not induced can be seen as
a set of biological replicates of the null condition. We use these measurements to assess the amount
of noise in each replicate, i.e. to estimate Pr[MNIb,p|M̄NI
p ], the probability of a given observation MNIb,p,
10
3. For each pair (b, p) ⇤ B�P, assign a p-value to MIb,p:
pvalue(MIb,p) = Pr[M̄NI
p ⇥ M̄Ib,p|MI
b,p,MNIb1,p,M
NIb2,p, ...,M
NIb14,p]
4. Using the non-induced and full induced data sets, assign a false discovery rate to each p-
value.
Pr[MCMp|CMb1,p, ...,CMb1,p]
CMbi,p
MCMp
IMb,p
Pr[MIMb,p|IMb,p]
pvalue(IMb,p) = Pr[MCMp ⇥MIMb,p|IMb,p,CMb1,p,CMb2,p, ...,CMb14,p]
Each step is detailed further below and illustrated in Figure 2.
10
ControlMSscoremeandistribu:on
InducedMSscoremeandistribu:on
ModelingContaminantsNon-inducedresultsfromthesetofbaitsBarepooled.Usingaweightedk-nearestneighbourssmoothingofthefrequencyofeachMCMpvalue,condi:onalonCMb,pvalues∀b∈B,weobtainanes:mateof:
AqerBayesrule:
Theposteriordistribu:onofMCMpscoresisthen:
Theposteriordistribu:onofMIMb,piscomputedinasimilarfashion:
ModelingContaminants
Pr[CMb,p = cm|MCMp = mcm]
Pr[MCMp|CMb1,p, ...,CMb14,p] = Pr[MCMp]14
�i=1
Pr[CMbi, p|MCMp]/⇥
Each step is detailed further below and illustrated in Figure 2.
Step 1: Building a noise model from non-induced experiments
The set of 14 TAP-MS/MS experiments where the bait’s expression was not induced can be seen as
a set of biological replicates of the null condition. We use these measurements to assess the amount
of noise in each replicate, i.e. to estimate Pr[MNIb,p|M̄
NIp ], the probability of a given observation MNI
b,p,
given its true mean Mascot score M̄NIp . This distribution is estimated using a leave-one-out cross-
validation approach on the set of 14 non-induced experiments. Specifically, for each bait b⇧ B, we
compare MNIb,p to µ⌃=b,p, the corrected average (see Supplementary Information) of the 13 Mascot
scores of p in all non-induced experiments except where bait b was used. µ⌃=b,p provides a good
estimate of M̄NIp . Let C(i, j) be the number bait-prey pairs for which ⌥MNI
b,p� = i and ⌥µ⌃=b,p� = j.
Then, a straight-forward estimator is
Pr[MNIb,p = x|M̄NI
p = y] = C(x,y)/⇥y⌅
C(x,y⌅).
Note that C is a fairly large matrix (the number of rows and columns is set to 1000; larger Mascot
scores are culled to 1000). In addition, aside from the zero-th column C(⇥,0), it is quite sparsely
populated, as the sum of all entries is 40306. Thus, the above formula yields a very poor estimator.
Matrix C therefore needs to be smoothed to matrix Cs using a k-nearest neighbors smoothing algo-
rithm. Specifically, let N� (i, j) = {(i⌅, j⌅) : |i� i⌅|⇤ � , | j� j⌅|⇤ �} be the set of neighboring matrix
11
Pr[CMb,p = cm|MCMp = mcm]
Pr[MCMp|CMb1,p, ...,CMb14,p] = Pr[MCMp]14
�i=1
Pr[CMbi, p|MCMp]/⇥
Each step is detailed further below and illustrated in Figure 2.
Step 1: Building a noise model from non-induced experiments
The set of 14 TAP-MS/MS experiments where the bait’s expression was not induced can be seen as
a set of biological replicates of the null condition. We use these measurements to assess the amount
of noise in each replicate, i.e. to estimate Pr[MNIb,p|M̄
NIp ], the probability of a given observation MNI
b,p,
given its true mean Mascot score M̄NIp . This distribution is estimated using a leave-one-out cross-
validation approach on the set of 14 non-induced experiments. Specifically, for each bait b⇧ B, we
compare MNIb,p to µ⌃=b,p, the corrected average (see Supplementary Information) of the 13 Mascot
scores of p in all non-induced experiments except where bait b was used. µ⌃=b,p provides a good
estimate of M̄NIp . Let C(i, j) be the number bait-prey pairs for which ⌥MNI
b,p� = i and ⌥µ⌃=b,p� = j.
Then, a straight-forward estimator is
Pr[MNIb,p = x|M̄NI
p = y] = C(x,y)/⇥y⌅
C(x,y⌅).
Note that C is a fairly large matrix (the number of rows and columns is set to 1000; larger Mascot
scores are culled to 1000). In addition, aside from the zero-th column C(⇥,0), it is quite sparsely
populated, as the sum of all entries is 40306. Thus, the above formula yields a very poor estimator.
Matrix C therefore needs to be smoothed to matrix Cs using a k-nearest neighbors smoothing algo-
rithm. Specifically, let N� (i, j) = {(i⌅, j⌅) : |i� i⌅|⇤ � , | j� j⌅|⇤ �} be the set of neighboring matrix
11
3. For each pair (b, p) ⇤ B�P, assign a p-value to MIb,p:
pvalue(MIb,p) = Pr[M̄NI
p ⇥ M̄Ib,p|MI
b,p,MNIb1,p,M
NIb2,p, ...,M
NIb14,p]
4. Using the non-induced and full induced data sets, assign a false discovery rate to each p-
value.
Pr[MCMp|CMb1,p, ...,CMb1,p]
CMbi,p
MCMp
IMb,p
Pr[MIMb,p|IMb,p]
pvalue(IMb,p) = Pr[MCMp ⇥MIMb,p|IMb,p,CMb1,p,CMb2,p, ...,CMb14,p]
Pr[MCMp = mcm|CMb,p = cm]
10
Pr[CMb,p = cm|MCMp = mcm]
Pr[MCMp|CMb1,p, ...,CMb14,p] = Pr[MCMp]14
�i=1
Pr[CMbi, p|MCMp]/�
Pr[MIMb,p|IMb,p]
Each step is detailed further below and illustrated in Figure 2.
Step 1: Building a noise model from non-induced experiments
The set of 14 TAP-MS/MS experiments where the bait’s expression was not induced can be seen as
a set of biological replicates of the null condition. We use these measurements to assess the amount
of noise in each replicate, i.e. to estimate Pr[MNIb,p|M̄NI
p ], the probability of a given observation MNIb,p,
given its true mean Mascot score M̄NIp . This distribution is estimated using a leave-one-out cross-
validation approach on the set of 14 non-induced experiments. Specifically, for each bait b ⇤ B, we
compare MNIb,p to µ⌅=b,p, the corrected average (see Supplementary Information) of the 13 Mascot
scores of p in all non-induced experiments except where bait b was used. µ⌅=b,p provides a good
estimate of M̄NIp . Let C(i, j) be the number bait-prey pairs for which ⇧MNI
b,p⌃ = i and ⇧µ⌅=b,p⌃ = j.
Then, a straight-forward estimator is
Pr[MNIb,p = x|M̄NI
p = y] = C(x,y)/⇥y⇥
C(x,y⇥).
Note that C is a fairly large matrix (the number of rows and columns is set to 1000; larger Mascot
scores are culled to 1000). In addition, aside from the zero-th column C(�,0), it is quite sparsely
11
16-11-09
5
p-valuethatpreypisacontaminantforbaitbFalseDiscoveryRate(FDR)foraninterac:onwithagivenp-value:
NPbandIPbarethesetsofnon-inducedandinducedinterac:onp-values.
ContaminantsAssessment
3. For each pair (b, p) ⇤ B�P, assign a p-value to MIb,p:
pvalue(MIb,p) = Pr[M̄NI
p ⇥ M̄Ib,p|MI
b,p,MNIb1,p,M
NIb2,p, ...,M
NIb14,p]
4. Using the non-induced and full induced data sets, assign a false discovery rate to each p-
value.
Pr[MCMp|CMb1,p, ...,CMb1,p]
CMbi,p
MCMp
IMb,p
Pr[MIMb,p|IMb,p]
pvalue(IMb,p) = Pr[MCMp ⇥MIMb,p|IMb,p,CMb1,p,CMb2,p, ...,CMb14,p]
Pr[MCMp = mcm|CMb,p = cm]
10
Pr[CMb,p = cm|MCMp = mcm]
Pr[MCMp|CMb1,p, ...,CMb14,p] = Pr[MCMp]14
�i=1
Pr[CMbi, p|MCMp]/�
Pr[MIMb,p|IMb,p]
FDR(p-value) =⇥b⇥B
|{np⇥NPb|np�p-value}||NPb|
⇥b⇥B|{ip⇥IPb|ip�p-value}|
|IPb|
Each step is detailed further below and illustrated in Figure 2.
Step 1: Building a noise model from non-induced experiments
The set of 14 TAP-MS/MS experiments where the bait’s expression was not induced can be seen as
a set of biological replicates of the null condition. We use these measurements to assess the amount
of noise in each replicate, i.e. to estimate Pr[MNIb,p|M̄
NIp ], the probability of a given observation MNI
b,p,
given its true mean Mascot score M̄NIp . This distribution is estimated using a leave-one-out cross-
validation approach on the set of 14 non-induced experiments. Specifically, for each bait b⇥ B, we
compare MNIb,p to µ⇤=b,p, the corrected average (see Supplementary Information) of the 13 Mascot
scores of p in all non-induced experiments except where bait b was used. µ⇤=b,p provides a good
estimate of M̄NIp . Let C(i, j) be the number bait-prey pairs for which ⌅MNI
b,p⇧ = i and ⌅µ⇤=b,p⇧ = j.
11
Protein-ProteinInteracQonNetwork• 89baitsand11894interac:ons[For:er,Lacombeetal.,2010]• Humancellline:HEK293• Proteinsinthenetworkaremainlyinvolvedintranscrip:onandRNAprocessing.• 14representa:vebaitsoutofthe89havebeenselectedforcontrolexperiments.
0
1000
2000
3000
4000
5000
6000
7000
8000
0 0.1 0.2 0.3 0.4 0.5
FDR
Nu
mb
er o
f p
red
icte
d i
nte
racti
on
s
Z-score
DeContaminator
DeContaminator:2430interac:onsZ-scoreapproach:1011interac:ons
FalseDiscoveryRates
FDR1%: IsoRankComparisonofPPInetworks
Comparative Genomics
Look at the same kind of data across species with the hope that areas of high correlation correspond to functional parts or modules of the genome.
Why understanding function-level differences is important
• Increased complexity (function) is not explained simply by variations in gene (or protein) count
6600 21000 14000 24500 23000
6600 27000 19000 32000 49000
Estimated Number of Genes
Estimated Number of Proteins
Numbers from h,p://www.ensembl.org
16-11-09
6
Protein-Protein Interactions (PPIs) • Often, proteins interact with other proteins to
perform their functions • Many cellular activities are a result of protein
interactions
Image from:h,p://focosi.altervista.org/mapkmap2.html
MAPK Signaling Cascade
Modeling PPIs • Traditional perspective: low-throughput, structural • New perspective: high-throughput, network-based
Image from www.rcsb.org
Gα Gβ
GγGDP
G-protein complex
New systems-level perspective
Gα
Gβ Gγ
GDP
Traditional perspective
Protein-Protein Interaction (PPI) Network
http://internal.binf.ku.dk
Yeast PPI Network
Cusic
k et
al.
Hum
Med
Gen
, 05
X + = ?Y
Yeast 2-Hybrid method
Motivation behind Network Comparison
• Compare PPI networks at the species level
• Transfer annotation from one species to another
– More feasible, cheaper and easier than in humans
– Error detection
• Compute functional orthologs
– Functional orthologs: proteins which perform the same
function across species
Given two protein-protein interaction networks, find for a piece of one network, something that has a comparative structure in the other network
Our approach: match neighborhood topologies
The Problem Algorithm: IsoRank a1
a3 a8
a4
a7
a6
a2
a5
b2 b3
b1
b8
b5
b7
b6
b4 b9
Sequence similarity
3e-9 b6 a3
5e-4 b1 a3
1e-4 b9 a5
1e-7 b3 a5
…
2e-8 b1 a5
1e-2 b7 a5
Functional similarity for each possible node pairing
a5 b7 2.1
a5 b9 1.5
a3 b2 3.4
16-11-09
7
Functional Similarity Score: Intuition
• Compute pairwise scores Rij:
• Goal: “high Rij” ⇒ “i and j are a good match” • Intuition: i and j are a good match if their
sequences align and their neighbors are a good match
b3
b1
b2
b4
b5 a1 a3
a4 a2
a5 Ra5,b1 = ?
Computing Rij • Combine both sequence and network data
Rij = Eij
functional similarity
sequence similarity
network similarity
Rij = (1-α)Eij+αNij
sequence similarity
Simple Case: α=1 (no Eij)
∑ ∑∈ ∈
=)( )( )()(
1iNu jNv
uvij RvNuN
R
b3
b1
b2
b4
b5
a1 a3 a4
a2
a5 3,24,1 321
baba RR×
=
a1 a2 b3
b4
∑ ∑∈ ∈
==)( )( )()(
1iNu jNv
uvijij RvNuN
NR
• Rij=Nij. Rij depends on neighborhoods of i and j
• N(a) is the set of neighbors of a
Simple case: α=1 (no Eij) • Rij=Nij. Rij depends on neighborhoods of i and j
• N(a) is the set of neighbors of a
∑ ∑∈ ∈
==)( )( )()(
1iNu jNv
uvijij RvNuN
NR
b3
b1
b2
b4
b5
a1 a3 a4
a2
a5
3,31,3
3,11,12,2
331
131
311
111
baba
bababa
RR
RRR
×+
×+
×+
×=
a1 a3 a2
b3
b1
b2
Example: Computed Rij values
b3
b1
b2
b4
b5
a1 a3 a4
a2
a5 b1 b2 b3 b4 b5
a1 0.0312 0.0937
a2 0.1250 0.0625 0.0625
a3 0.0937 0.2813
a4 0.0625 0.0312 0.0312
a5 0.0625 0.0312 0.0312
Empty cell indicates Rij = 0
R Example: Computed Rij values
b3
b1
b2
b4
b5
a1 a3 a4
a2
a5 b1 b2 b3 b4 b5
a1 0.0312 0.0937
a2 0.1250 0.0625 0.0625
a3 0.0937 0.2813
a4 0.0625 0.0312 0.0312
a5 0.0625 0.0312 0.0312
Empty cell indicates Rij = 0
R
16-11-09
8
Example: Computed Rij values
b3
b1
b2
b4
b5
a1 a3 a4
a2
a5 b1 b2 b3 b4 b5
a1 0.0312 0.0937
a2 0.1250 0.0625 0.0625
a3 0.0937 0.2813
a4 0.0625 0.0312 0.0312
a5 0.0625 0.0312 0.0312
Empty cell indicates Rij = 0
R Capturing non-local effects?
• The algorithm can resolve between p-r vs. p-q
q
p
r Rpr=8.12e-3 Rpq=8.64e-3
Computing R: an eigenvalue problem
2121)()()(
1]][[
NNNNAsizevNuN
uvijA
ARR
×=
=
=
N1 = # nodes in Graph 1 N2 = # nodes in Graph 2
• A is about 108x108 when aligning yeast and fly networks – However, both A and R are very sparse – We use the Power method to efficiently compute R
• Extension to weighted edges is straightforward
• The equations for R describe an eigenvalue problem
R is the principal eigenvector of A
∑ ∑∈ ∈
=)( )( )()(
1iNu jNv
uvij RvNuN
R
A Random Walk Interpretation
Tensor Product: G1 x G2
r p
s
v
j q
i
u G1
G2
)()(1
vNuN
)()(1
jNiN
r,s r,j r,v
u,s u,j u,v
i,s i,j i,v
… …… …
… …
………
………
General Case: 0 ≤ α ≤ 1
• Let Bij = sequence similarity score between
i (from graph #1) and j from (graph #2)
• Eij = Bij/|B|1
ARR = 10)1(
≤≤
+−=
α
αα ARER
Results: Yeast-Fly Global Alignment • # of edges in the common subgraph: 1420
• Implies about 5% overlap! Why so low? • PPI data currently is noisy and low-coverage
• # of edges in the largest component: 35
• The value of α used: 0.6 • Provided best overall agreement with previous gene
correspondence predictions
16-11-09
9
Various Topologies Are Found
Existing local alignment methods (PathBlast; Kelley et al.) often find only specific topologies
Role of α: why the dip?
Robustness to Error in PPI data
a1
a3 a8
a4
a7
a6
a2
a5
a9 a11
a10
a1
a3 a8
a4
a7
a6
a2
a5
a9 a11
a10
? Robustness to Error in PPI data
True curve somewhere around here
Functional Orthologs • Genes that perform similar functions
– “functional orthologs” vs “plain old orthologs”
– distinguish between orthologs and paralogs
• Bandyopadhyay et al. [Genome Res. ’06]
– Use local network alignment results
– Then use a MRF to partially resolve ambiguities
• We compared our results with theirs
Functional Orthologs: IsoRank Pairwise Alignment Predictions
Protein Functional Ortholog
IsoRank Bandyopadhyay et al.
Gid8 CG6617 CG6617 76% CG18467 ---
Gpa1 Goα47a Goα47a 41% Giα65a ---
Kap104 Trn Trn 41% CG8219 47%
CG18617 Vph1 Vph1 43% Stv1 48%
Egd1 Bic Bcd 47%