Consistent probabilistic outputs for protein function prediction William Stafford Noble Department...
-
Upload
cody-lindsey -
Category
Documents
-
view
218 -
download
2
Transcript of Consistent probabilistic outputs for protein function prediction William Stafford Noble Department...
Consistent probabilistic outputs for protein function prediction
William Stafford NobleDepartment of Genome Sciences
Department of Computer Science and EngineeringUniversity of Washington
The problem
Given:• protein sequence,• knockout phenotype,• gene expression
profile,• protein-protein
interactions, and • phylogenetic profile
Predict• a probability for every
term in the Gene Ontology
Heterogeneous dataMissing dataMultiple labels per geneStructured output
Consistent predictions
Cytoplasmic membrane-bound
vesicle(GO:0016023)
Cytoplasmic vesicle
(GO:0031410)
is a
The probability that protein X is a cytoplasmic membrane-bound vesicle must be less than or equal to the probability that protein X is a cytoplasmic vesicle.
SVM → Naïve BayesData 1
Data 2
Data 3
Data 4
Data 5
Data 6
Data 7
Data 8
Data 33
SVM/AL 1
SVM/AL 2
SVM/AL 3
SVM/AL 4
SVM/AL 5
SVM/AL 6
SVM/AL 7
SVM/AL 8
SVM/AL 33
Product, plus Bayes’ rule
Probability 1
Probability 2
Probability 3
Probability 4
Probability 6
Probability 8
Probability 33
Probability
Gaussian
Asymmetric Laplace
SVM → logistic regressionData 1
Data 2
Data 3
Data 4
Data 5
Data 6
Data 7
Data 8
Data 33
SVM 1
SVM 2
SVM 3
SVM 4
SVM 5
SVM 6
SVM 7
SVM 8
SVM 33
Logisticregressor 1
Logisticregressor 2
Logisticregressor 3
Logisticregressor 11
Predict 1
Predict 2
Predict 3
Predict 4
Predict 6
Predict 8
Predict 33
Probability
Reconciliation Methods
• 3 heuristic methods
• 3 Bayesian networks
• 1 cascaded logistic regression
• 3 projection methods
Heuristic methods
• Max: Report the maximum probability of self and all descendants.
• And: Report the product of probabilities of all ancestors and self.
• Or: Compute the probability that at least one descendant of the GO term is “on,” assuming independence.
jDj
i ppi
ˆmax
• All three methods use probabilities estimated by logistic regression.
iAj
ji pp ˆ
iDj
ji pp ˆ11
Bayesian network
• Belief propagation on a graphical model with the topology of the GO.
• Given Yi, the distribution of each SVM output Xi is modeled as an independent asymmetric Laplace distribution.
• Solved using a variational inference algorithm.• “Flipped” variant: reverse the directionality of edges in the graph.
Cascaded logistic regression
• Fit a logistic regression to the SVM output only for those proteins that belong to all parent terms.
• Models the conditional distribution of the term, given all parents.
• The final probability is the product of these conditionals:
iAj
ji pp
Ejipp
pp
ij
Iiii
Iipi
, , s.t.
ˆ 2
,min
Isotonic regression
• Consider the squared Euclidean distance between two sets of probabilities.
• Find the closest set of probabilities to the logistic regression values that satisfy all the inequality constraints.
Ejipp
pp
ij
Iiii
Iipi
, , s.t.
ˆ 2
,min
Ejipp
ppD
ij
Iiii
Iipi
, , s.t.
ˆmin ,
Isotonic regression
• Consider the squared Euclidean distance between two sets of probabilities.
• Find the closest set of probabilities to the logistic regression values that satisfy all the inequality constraints.
Küllback-Leibler projection
• Küllback-Leibler projection on the set of distributions which factorize according to the ontology graph.
• Two variants, depending on the directions of the edges.
Likelihood ratiosobtained from
logistic regression
Hybrid method
• Replace the Bayesian log posterior for Yi by the marginal log posterior obtained from the logistic regression.
• Uses discriminative posteriors from logistic regression, but still uses a structural prior.
BPAL KLP
BPLR
Axes of evaluation
• Ontology– biological process– cellular compartment– molecular function
• Term size – 3-10 proteins– 11-30 proteins– 31-100 proteins– 100-200 proteins
• Evaluation mode– Joint evaluation– Per protein– Per term
• Recall– 1%– 10%– 50%– 80%
Legend
Belief propagation, asymmetric Laplace
Belief propagation, asymmetric Laplace, flipped
Belief propagation, logistic regression
Cascaded logistic regression
Isotonic regression
Logistic regressionKüllback-Leibler projection
Küllback-Leibler projection, flipped
Naïve Bayes, asymmetric Laplace
Pre
cisi
on
TP
/(T
P+
FP
)
Recall TP / (TP+FN)
Joint evaluation
Biological process ontology
Large terms (101-200)
Conclusions: Joint evaluation
• Reconciliation does not always help.
• Isotonic regression performs well overall, especially for recall > 20%.
• For lower recall values, both Küllback-Leibler projection methods work well.
Conclusions: per protein
• Several methods perform well– Unreconciled logistic regression– Unreconciled naïve Bayes– Isotonic regression– Belief propagation with asymmetric Laplace
• For small terms– For molecular function and biological process, we do
not observe many significant differences.– For cellular components, belief propagation with
logistic regression works well.
Conclusions
• Reconciliation does not always help.
• Isotonic regression (IR) performs well overall.
• For small biological process and molecular function terms, it is less clear that IR is one of the best methods.
Acknowledgments
Guillaume Obozinski
Charles Grant
Gert Lanckriet
Michael Jordan
The mousefunc organizers• Tim Hughes• Lourdes Pena-Castillo• Fritz Roth• Gabriel Berriz• Frank Gibbons