Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to...

Post on 14-Jan-2016

219 views 0 download

Tags:

Transcript of Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to...

Predicting protein function from

heterogeneous data

Prof. William Stafford NobleGENOME 541

Intro to Computational Molecular Biology

We can frame functional annotation as a classification

task• Many possible types of labels:

– Biological process– Molecular function– Subcellular localization

• Many possible inputs:– Gene or protein sequence– Expression profile– Protein-protein interactions– Genetic associations

Classifier

Is gene X a penicillin amidase?

Yes

Outline

• Bayesian networks• Support vector machines• Network diffusion / message passing

Annotation transfer

• Rule: If two proteins are linked with high confidence, and one protein’s function is unknown, then transfer the annotation.

Protein of known

function

Protein of unknown function

Bayesian networks(Troyanskaya PNAS 2003)

Burglary Earthquake

Alarm

John callsMary calls

P(B) = 0.001P(E) = 0.002

P(M|A) = 0.70P(M|¬A) = 0.01

P(J|A) = 0.90P(J|¬A) = 0.05

P(A|B,E) = 0.95P(A|B, ¬E) = 0.94P(A|¬B,E) = 0.29P(A|¬B, ¬E) = 0.001

Create one network per gene pair

Probability that genes A

and B are functionally

linked

A

B Data type 1

Data type 2

Data type 3

Bayesian Network

Conditional probability tables• A pair of yeast proteins

that have a physical association will have a positive affinity precipitation result 75% of the time and a negative result in the remaining 25%.

• Two proteins that do not physically interact in vivo will have a positive affinity precipitation result in 5% of the experiments, and a negative one in 95%.

Inputs

• Protein-protein interaction data from GRID.

• Transcription factor binding sites data from SGD.

• Stress-response microarray data set.

ROC analysis

Using Gene Ontology biological process annotation as the gold standard.

Pros and cons

+ Bayesian network framework is rigorous.+ Exploits expert knowledge.- Does not (yet) learn from data.- Treats each gene pair independently.

The SVM is a hyperplane classifier

++

+

+ +

+

+

+ +

+ +

+-- -

-

-

-

-

--

-

--

-+

+

-

--

Locate a plane that separates

positive from negative

examples.

Focus on the examples closest to the boundary.

Four key concepts

1. Separating hyperplane

2. Maximum margin hyperplane

3. Soft margin

4. Kernel function (input space feature space)

Input space

gene1 gene2patient1 -1.7 2.1patient2 0.3 0.5patient3 -0.4 1.9patient4 -1.3 0.2patient5 0.9 -1.2

1

2

3

4

5

gene1

gene2

• Each subject may be thought of as a point in an m-dimensional space.

Separating hyperplane

• Construct a hyperplane separating ALL from AML subjects.

Choosing a hyperplane

• For a given set of data, many possible separating hyperplanes exist.

Maximum margin hyperplane

• Choose the separating hyperplane that is farthest from any training example.

Support vectors

• The location of the hyperplane is specified via a weight associated with each training example.

• Examples near the hyperplane receive non-zero weights and are called support vectors.

Soft margin

• When no separating hyperplane exists, the SVM uses a soft margin hyperplane with minimal cost.

• A parameter C specifies the relative cost of a misclassifcation versus the size of the margin.

Incorrectly measured or labeled data

No separating hyperplane exists

The separating hyperplane does not

generalize well

Soft margin

The kernel function

• “The introduction of SVMs was very good for the most part, but I got confused when you began to talk about kernels.”

• “I found the discussion of kernel functions to be slightly tough to follow.”

• “I understood most of the lecture. The part that was more challenging was the kernel functions.”

• “Still a little unclear on how the kernel is used in the SVM.”

Why kernels?

Separating previously unseparable data

Input space to feature space

• SVMs first map the data from the input space to a higher-dimensional feature space.

Kernel function as dot product

• Consider two training examples A = (a1, a2) and B = (b1, b2).

• Define a mapping from input space to feature space: (X) = (x1x1, x1x2, x2x1, x2x2)

• Let K(X,Y) = (X • Y)2 • Write (A) • (B) in terms of K.

Kernel function as dot product

• Consider two training examples A = (a1, a2) and B = (b1, b2).

• Define a mapping from input space to feature space: (X) = (x1x1, x1x2, x2x1, x2x2)

• Let K(X,Y) = (X • Y)2 • Write (A) • (B) in terms of K.• (A) • (B)

= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2)

Kernel function as dot product

(A) • (B)

= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2)

Kernel function as dot product

(A) • (B)

= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2

Kernel function as dot product

(A) • (B)

= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2

= a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2

Kernel function as dot product

(A) • (B)

= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2

= a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2

= (a1b1 + a2b2) (a1b1 + a2b2)

Kernel function as dot product

(A) • (B)

= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2

= a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2

= (a1b1 + a2b2) (a1b1 + a2b2)

= [(a1, a2) • (b1, b2)]2

Kernel function as dot product

(A) • (B)

= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2

= a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2

= (a1b1 + a2b2) (a1b1 + a2b2)

= [(a1, a2) • (b1, b2)]2

= (A • B)2

= K(A, B)

Separating in 2D with a 4D kernel

“Kernelizing” Euclidean distance

Kernel function

• The kernel function plays the role of the dot product operation in the feature space.

• The mapping from input to feature space is implicit.

• Using a kernel function avoids representing the feature space vectors explicitly.

• Any continuous, positive semi-definite function can act as a kernel function.

Need for “positive semidefinite” for kernel function unclear.

Proof of Mercer’s Theorem: Intro to SVMs by Cristianini and Shawe-Taylor, 2000, pp. 33-35.

YXYXK ,

31, YXYXK

2

2

2exp,

YX

YXK

Overfitting with a Gaussian kernel

The SVM learning problem

• Input: training vectors xi … xn and labels yi … yn.• Output: bias b plus one weight wi per training example• The weights specify the location of the separating

hyperplane.• The optimization problem is a convex, quadratic

optimization.• It can be solved using standard packages such as

MATLAB.

xx

xx

yfyfc

yfcCwn

iii

bwxf T

1,0max,,

,,2

1

1

2minarg

SVM prediction architectureQuery = x

x1 x2 x3 xn...

k k k k

w1

w2 w3 wn

Kernel function

• The kernel function plays the role of the dot product operation in the feature space.

• The mapping from input to feature space is implicit.

• Using a kernel function avoids representing the feature space vectors explicitly.

• Any continuous, positive semi-definite function can act as a kernel function.

Proof of Mercer’s Theorem: Intro to SVMs by Cristianini and Shawe-Taylor, 2000, pp. 33-35.

Learning gene classes

Predictor

Learner Model

Class

MYGD

79 experiments

79 experiments

3500Genes

2465Genes

Training set

Test set

Eisen et al.

Eisen et al.

Class predictionFP FN TP TN

TCA 4 9 8 2446

Respiration chain complexes

6 8 22 2431

Ribosome 7 3 118 2339

Proteasome 3 8 27 2429

Histone 0 2 9 2456

Helix-turn-helix 0 16 0 2451

SVM outperforms other methods

Predictions of gene function

Fleischer et al. “Systematic identification and functional screens

of uncharacterized proteins associated with eukaryotic

ribosomal complexes” Genes Dev, 2006.

Overview

• 218 human tumor samples spanning 14 common tumor types

• 90 normal samples• 16,063 “genes” measured per sample• Overall SVM classification accuracy: 78%.• Random classification accuracy: 1/14 =

9%.

Summary: Support vector machine learning

• The SVM learning algorithm finds a linear decision boundary.

• The hyperplane maximizes the margin; i.e., the distance from any training example.

• The optimization is convex; the solution is sparse.

• A soft margin allows for noise in the training set.• A complex decision surface can be learned by

using a non-linear kernel function.

Cost/Benefits of SVMs

+ SVMs perform well in high-dimensional data sets with few examples.

+ Convex optimization implies that you get the same answer every time.

+ Kernels functions allow encoding of prior knowledge.+ Kernel functions handle arbitrary data types.– The hyperplane does not provide a good explanation,

especially with a non-linear kernel function.

Vector representation

• Each matrix entry is an mRNA expression measurement.

• Each column is an experiment.

• Each row corresponds to a gene.

Similarity measurement

• Normalized scalar product

• Similar vectors receive high values, and vice versa.

iiii

ii

YYXX

YXYXK

,

Similar

Dissimilar

Kernel matrix

>ICYA_MANSEGDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKYDGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVNLVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH

>LACB_BOVINMKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI

Sequence kernels

• We cannot compute a scalar product on a pair of variable-length, discrete strings.

Pairwise comparison kernel

Pairwise comparison kernel

Protein-protein interactions

• Pairwise interactions can be represented as a graph or a matrix.1 0 0 1 0 1 0

11 0 1 0 1 1 0 10 0 0 0 1 1 0 00 0 1 0 1 1 0 10 0 1 0 1 0 0 11 0 0 0 0 0 0 10 0 1 0 1 0 0 0

protein

protein

Linear interaction kernel

• The simplest kernel counts the number of interactions between each pair.

1 0 0 1 0 1 0 11 0 1 0 1 1 0 10 0 0 0 1 1 0 00 0 1 0 1 1 0 10 0 1 0 1 0 0 11 0 0 0 0 0 0 10 0 1 0 1 0 0 0

3

Diffusion kernel

• A general method for establishing similarities between nodes of a graph.

• Based upon a random walk.

• Efficiently accounts for all paths connecting two nodes, weighted by path lengths.

Hydrophobicity profile

• Transmembrane regions are typically hydrophobic, and vice versa.

• The hydrophobicity profile of a membrane protein is evolutionarily conserved.

Membrane protein

Non-membrane protein

Hydrophobicity kernel

• Generate hydropathy profile from amino acid sequence using Kyte-Doolittle index.

• Prefilter the profiles.• Compare two profiles by

– Computing fast Fourier transform (FFT), and– Applying Gaussian kernel function.

• This kernel detects periodicities in the hydrophobicity profile.

Combining kernels

Identical

A B

K(A) K(B)

K(A)+K(B)

A B

A:B

K(A:B)

Semidefinite programming

• Define a convex cost function to assess the quality of a kernel matrix.

• Semidefinite programming (SDP) optimizes convex cost functions over the convex cone of positive semidefinite matrices.

Learn K from the convex cone of positive-semidefinite matrices or a convex subset of it :

According to a convex quality measure:

Integrate constructed kernels

Learn a linear mix

Large margin classifier (SVM)

Maximize the margin

SDP

Semidefinite programming

i

iiKK

i

iiKK

Integrate constructed kernels

Learn a linear mix

Large margin classifier (SVM)

Maximize the margin

Markov Random Field

• General Bayesian method, applied by Deng et al. to yeast functional classification.

• Used five different types of data.• For their model, the input data must be

binary.• Reported improved accuracy compared to

using any single data type.

Yeast functional classesCategory SizeMetabolism 1048

Energy 242

Cell cycle & DNA processing 600

Transcription 753

Protein synthesis 335

Protein fate 578

Cellular transport 479

Cell rescue, defense 264

Interaction w/ evironment 193

Cell fate 411

Cellular organization 192

Transport facilitation 306

Other classes 81

Six types of data

• Presence of Pfam domains.• Genetic interactions from CYGD.• Physical interactions from CYGD.• Protein-protein interaction by TAP.• mRNA expression profiles.• (Smith-Waterman scores).

Results

MRF

SDP/SVM(binary)

SDP/SVM(enriched)

Pros and cons

+ Learns relevance of data sets with respect to the problem at hand.

+ Accounts for redundancy among data sets, as well as noise and relevance.

+ Discriminative approach yields good performance.

- Kernel-by-kernel weighting is simplistic.- In most cases, unweighted kernel combination

works fine.- Does not provide a good explanation.

Network diffusionGeneMANIA

A rose by any other name …

• Network diffusion• Random walk with restart• Personalized PageRank• Diffusion kernel• Gaussian random field• GeneMANIA

Top performing methods

GeneMANIA

• Normalize each network (divide each element by the square root of the product of the sums of the rows and columns).

• Learn a weight for each network via ridge regression. Essentially, learn how informative the network is with respect to the task at hand.

• Sum the weighted networks.• Assign labels to the nodes. Use (n+ + n-)/n for

unlabeled genes.• Perform label propagation in the combined network.

Mostafavi et al. Genome Biology. 9:S4, 2008.

Random walk with restart

Positive examples

Random walk with restart

Random walk with restart

Restart

Random walk with restart

Random walk with restart

Random walk with restart

Size indicates frequency of visit

Final node scores

Size indicates frequency of visit

Final node scores

Label propagation is random walk with restart except:

(a) You restart less often from nodes with many neighbours (i.e., Restart probability of a node is inversely related to its degree)

(b) Nodes with many neighbors have their final node scores scaled up

Label propagation vs SVMLabel propagation

SVM

Performance averaged across 992 yeast Gene Ontology Biological Process categories.