T. M. Murali Department of Computer Science Virginia Tech Slides prepared by Arjun Krishnan

DNA mRNA Protein

Small molecul

es

Environment

RegulatoryRNA

How a cell is wiredHow a cell is wired

The dynamics of such interactions emerge as cellular processes and functions

How do the genes and their products interact to collectively perform a

function?

A

BGene G

35

RPM

Inhibitor

U2AF

Gene G

Molecular interaction networksMolecular interaction networks

Molecular interaction networksMolecular interaction networks

A network containing genes connected to each other whenever they physically or functionally interact

Proteins that interact/co-complex (ribosomal, polymerase, etc.)

Transcription factors and their target

Enzymes catalyzing different steps in the same metabolic pathway

Genes with correlation in expression

Genes with similar phylogenetic profiles

Functional

^

Arabidopsis is the primary Arabidopsis is the primary model organism for plantsmodel organism for plants

Complex organization from molecular to whole organism level.

A key challenge …

Understanding the cellular machinery that sustains this complexity.

In the current post-genomic times, a main aspect of this challenge is ‘gene function prediction’:

Identification of functions of all the (~30, 000) genes in the genome.

Total of ~30,000 genes in the genome

Extent of gene annotations in Extent of gene annotations in ArabidopsisArabidopsis

~15% with some

experimental annotation

~8% with ‘expert’

annotation

~13% with annotations

based on manually curated

computational analysis

~14% with electronic

annotations

Leaving ~50% of the genome

without any annotation

Ashburner et al, (2000) Nat. Gen.Swarbreck et al (2008) Nuc. Acids. Res.

Exploit high-throughput dataExploit high-throughput data

Integrating functional genomic data could lead to

Network models of gene interactions that resemble the underlying cellular map.

Typically these networks contain gene functional interactions

Connecting pairs of genes that participate in the same biological processes.

In such a network, the very place of a gene establishes the functional context that gene.

‘Guilt-by-association’ – genes of unknown functions can also be imputed with the function of their annotated neighbors.

Functional interaction networksFunctional interaction networks Functional interaction network models have been

developed for Arabidopsis.

Lee et al. (2010) Rational association of genes with traits using a genome-scale gene network for Arabidopsis thaliana.

Very comprehensive in terms of using and integrating datasets in other organisms for application in plants.

Integrated 24 datasets: 5 datasets from Arabidopsis and the rest from other models.

AraNet: 19,647 genes, 1,062,222 interactions.

Goal of this study …Goal of this study …

We examine the state of network-based gene function prediction in Arabidopsis.

Evaluate the performance of multiple prediction algorithms on AraNet.

Assesses the influence of the number of genes annotated to a function and the source of annotation evidence.

Compute the correlation of prediction performance with network properties.

Evaluate prediction performance for plant-specific functions.

Network-based gene function Network-based gene function prediction algorithmsprediction algorithms

Propagation of functional annotations

across the network Guilt-by-association

using direct interactions

Use positive

and negative examplesUse only positive

examples

SinkSourceHopfield

FunctionalFlow – multiple phases

Local

FunctionalFlow – 1 phaseLocal+

Each gene in the network

Network-based gene function Network-based gene function predictionprediction

Function A Function B

Network-based gene function Network-based gene function predictionprediction

Sink Source

In this study …In this study …

Recall: fraction of known examples predicted correctly

TP(TP + FN)

Precision: fraction of predictions that are correct

TP(TP + FP)

Performance of different Performance of different algorithmsalgorithms

Computational gene function prediction precedes and guides experimental validation

What we get is a ranked list of novel predictions

An experimenter would choose a manageable number of top-scoring predictions to pursue

Precision at the top of the prediction list

We choose precision at 20% recall (P20R) as the measure of performance


SS seems to be better than the other algorithms

What about the influence of the number of genes in a function?

3rd quartile

1st quartile

Median

Using only annotations based

on experimental/expert

evidence


Third group

First group

Second group

Number of genes annotated with a function

Nu

mb

er

of

fun

ctio

ns

Each group containing ~125

functions


For ‘small’ functions, the algorithm does not

matter!And, using just

experimental annotations is better when you know little about a function.

For ‘medium’ functions, SS is a little better and

use of ‘electronic’ evidences is mixed.

For ‘large’ functions-SS is clearly the best

- Using all annotation is better


All ECs Sans IEA/ISS

Wilcoxon test: SS vs. other algorithms

Overall, SinkSource appears to be best algorithm.

Correlation of performance with Correlation of performance with network properties network properties

Performance on a particular function might depend on how its genes are organized / connected among themselves in the network.

Number of nodes

Number of components

Fraction of nodes in the largest connected component

Total edge weight

Weighted density

Average weighted degree

Average segregation


Number of nodes = 9

Number of components = 3

Fraction of nodes in the largest connected component = 4/9

Total edge weight = 8

Weighted density = 8/36

Average weighted degree = 16/9


Functional modularity:

Average Segregation


Avg. seg = 8/22 Avg. seg = 12/15

Functional modularity:

Average Segregation

We have …

Vector of SS P20R values for each function

Vector of values of a particular topological property for each function

Spearman rank correlation


Weighted density


Spearman rank

correlation

Performance on plant-specific Performance on plant-specific functionsfunctions

For ‘conserved’ functions-Performance is better than

that for all functions-Using all annotations is

better

For ‘plant-specific’ functions-Performance is much worse

compared to ‘conserved’ functions

-Using only experimental annotations is better

The underlying network is built based on data from multiple non-plant species

3rd quartile

1st quartile

Median

Using only annotations based

on experimental/expert

evidence

Most predictable ‘conserved’ Most predictable ‘conserved’ functionsfunctions

protein folding

nucleotide transport

innate immunity

cytoskeleton organization, and

cell cycle

Least predictable ‘conserved’ Least predictable ‘conserved’ functionsfunctions

regulation of …

Specialized functions

Most predictable ‘plant-Most predictable ‘plant-specific’ functionsspecific’ functions

cell wall modification

auxin/cytokinin signaling, and

photosynthesis

Contribution from Arabidopsis datasets

Least predictable ‘plant-Least predictable ‘plant-specific’ functionsspecific’ functions

development, morphogenesis

pattern formation

phase transitions of various tissues, organs / growth stages

ConclusionsConclusions Evaluated the performance of various prediction

algorithms on AraNet.

SinkSource is the overall best prediction algorithm.

Measured the influence of the number of genes annotated to a function and the source of annotation evidence.

All algorithms perform poorly when only a small number of genes are ‘known’ or when annotating very specific functions.

When only a small number of genes are ‘known’, use only experimentally verified annotations to make new predictions.

When a considerable number of genes are ‘known’, use all annotations to make new predictions.

ConclusionsConclusions Measured the correlation of performance

with network properties

Several topological properties correlate well with performance.

‘Average segregation’ has the strongest correlation.

ConclusionsConclusions Assessed performance on

conserved/plant-specific functions

Performance on basic ‘conserved’ functions is better than that for all the functions.

Specialized ‘conserved’ functions are hard to predict.

Performance on ‘plant-specific’ functions is very poor.

Also a consequence of the fact that ‘plant-specific’ functions generally have small number of annotations.

ConclusionsConclusions

Avenues for improvement in functional interaction networks

Build functional interaction networks that are based on a larger collection of plant datasets.

If possible, rely as little as possible on data from other species.

Avenues for future experimental work

‘Plant-specific’ functions and

Specialized ‘conserved’ functions.

AcknowledgementsAcknowledgements Arjun Krishnan

Brett Tyler

Andy Pereira

T. M. Murali Department of Computer Science Virginia Tech Slides prepared by Arjun Krishnan

Documents

Transcript of T. M. Murali Department of Computer Science Virginia Tech Slides prepared by Arjun Krishnan