T. M. Murali Department of Computer Science Virginia Tech Slides prepared by Arjun Krishnan

36

description

The State of Gene Function Prediction in Arabidopsis thaliana. T. M. Murali Department of Computer Science Virginia Tech Slides prepared by Arjun Krishnan Introduction to Computational Biology and Bioinformatics (CS 3824 October 11, 13, 2011. How a cell is wired. Small molecules. - PowerPoint PPT Presentation

Transcript of T. M. Murali Department of Computer Science Virginia Tech Slides prepared by Arjun Krishnan

Page 1: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan
Page 2: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

DNA mRNA Protein

Small molecul

es

Environment

RegulatoryRNA

How a cell is wiredHow a cell is wired

The dynamics of such interactions emerge as cellular processes and functions

Page 3: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

How do the genes and their products interact to collectively perform a

function?

A

BGene G

35

RPM

Inhibitor

U2AF

Gene G

Molecular interaction networksMolecular interaction networks

Page 4: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Molecular interaction networksMolecular interaction networks

A network containing genes connected to each other whenever they physically or functionally interact

Proteins that interact/co-complex (ribosomal, polymerase, etc.)

Transcription factors and their target

Enzymes catalyzing different steps in the same metabolic pathway

Genes with correlation in expression

Genes with similar phylogenetic profiles

Functional

^

Page 5: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Arabidopsis is the primary Arabidopsis is the primary model organism for plantsmodel organism for plants

Complex organization from molecular to whole organism level.

A key challenge …

Understanding the cellular machinery that sustains this complexity.

In the current post-genomic times, a main aspect of this challenge is ‘gene function prediction’:

Identification of functions of all the (~30, 000) genes in the genome.

Page 6: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Total of ~30,000 genes in the genome

Extent of gene annotations in Extent of gene annotations in ArabidopsisArabidopsis

~15% with some

experimental annotation

~8% with ‘expert’

annotation

~13% with annotations

based on manually curated

computational analysis

~14% with electronic

annotations

Leaving ~50% of the genome

without any annotation

Ashburner et al, (2000) Nat. Gen.Swarbreck et al (2008) Nuc. Acids. Res.

Page 7: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Exploit high-throughput dataExploit high-throughput data

Integrating functional genomic data could lead to

Network models of gene interactions that resemble the underlying cellular map.

Typically these networks contain gene functional interactions

Connecting pairs of genes that participate in the same biological processes.

In such a network, the very place of a gene establishes the functional context that gene.

‘Guilt-by-association’ – genes of unknown functions can also be imputed with the function of their annotated neighbors.

Page 8: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Functional interaction networksFunctional interaction networks Functional interaction network models have been

developed for Arabidopsis.

Lee et al. (2010) Rational association of genes with traits using a genome-scale gene network for Arabidopsis thaliana.

Very comprehensive in terms of using and integrating datasets in other organisms for application in plants.

Integrated 24 datasets: 5 datasets from Arabidopsis and the rest from other models.

AraNet: 19,647 genes, 1,062,222 interactions.

Page 9: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Goal of this study …Goal of this study …

We examine the state of network-based gene function prediction in Arabidopsis.

Evaluate the performance of multiple prediction algorithms on AraNet.

Assesses the influence of the number of genes annotated to a function and the source of annotation evidence.

Compute the correlation of prediction performance with network properties.

Evaluate prediction performance for plant-specific functions.

Page 10: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Network-based gene function Network-based gene function prediction algorithmsprediction algorithms

Propagation of functional annotations

across the network Guilt-by-association

using direct interactions

Use positive

and negative examplesUse only positive

examples

SinkSourceHopfield

FunctionalFlow – multiple phases

Local

FunctionalFlow – 1 phaseLocal+

Each gene in the network

Page 11: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Network-based gene function Network-based gene function predictionprediction

Page 12: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Function A Function B

Network-based gene function Network-based gene function predictionprediction

Page 13: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Sink Source

In this study …In this study …

Recall: fraction of known examples predicted correctly

TP(TP + FN)

Precision: fraction of predictions that are correct

TP(TP + FP)

Page 14: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Performance of different Performance of different algorithmsalgorithms

Computational gene function prediction precedes and guides experimental validation

What we get is a ranked list of novel predictions

An experimenter would choose a manageable number of top-scoring predictions to pursue

Precision at the top of the prediction list

We choose precision at 20% recall (P20R) as the measure of performance

Page 15: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Performance of different Performance of different algorithmsalgorithms

SS seems to be better than the other algorithms

What about the influence of the number of genes in a function?

3rd quartile

1st quartile

Median

Using only annotations based

on experimental/expert

evidence

Page 16: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Performance of different Performance of different algorithmsalgorithms

Third group

First group

Second group

Number of genes annotated with a function

Nu

mb

er

of

fun

ctio

ns

Each group containing ~125

functions

Page 17: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Performance of different Performance of different algorithmsalgorithms

For ‘small’ functions, the algorithm does not

matter!And, using just

experimental annotations is better when you know little about a function.

For ‘medium’ functions, SS is a little better and

use of ‘electronic’ evidences is mixed.

For ‘large’ functions-SS is clearly the best

- Using all annotation is better

Page 18: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Performance of different Performance of different algorithmsalgorithms

All ECs Sans IEA/ISS

Wilcoxon test: SS vs. other algorithms

Overall, SinkSource appears to be best algorithm.

Page 19: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Correlation of performance with Correlation of performance with network properties network properties

Performance on a particular function might depend on how its genes are organized / connected among themselves in the network.

Number of nodes

Number of components

Fraction of nodes in the largest connected component

Total edge weight

Weighted density

Average weighted degree

Average segregation

Page 20: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Correlation of performance with Correlation of performance with network properties network properties

Page 21: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Correlation of performance with Correlation of performance with network properties network properties

Page 22: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Correlation of performance with Correlation of performance with network properties network properties

Number of nodes = 9

Number of components = 3

Fraction of nodes in the largest connected component = 4/9

Total edge weight = 8

Weighted density = 8/36

Average weighted degree = 16/9

Page 23: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Correlation of performance with Correlation of performance with network properties network properties

Functional modularity:

Average Segregation

Page 24: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Correlation of performance with Correlation of performance with network properties network properties

Avg. seg = 8/22 Avg. seg = 12/15

Functional modularity:

Average Segregation

Page 25: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

We have …

Vector of SS P20R values for each function

Vector of values of a particular topological property for each function

Spearman rank correlation

Correlation of performance with Correlation of performance with network properties network properties

Weighted density

Page 26: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Correlation of performance with Correlation of performance with network properties network properties

Spearman rank

correlation

Page 27: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Performance on plant-specific Performance on plant-specific functionsfunctions

For ‘conserved’ functions-Performance is better than

that for all functions-Using all annotations is

better

For ‘plant-specific’ functions-Performance is much worse

compared to ‘conserved’ functions

-Using only experimental annotations is better

The underlying network is built based on data from multiple non-plant species

3rd quartile

1st quartile

Median

Using only annotations based

on experimental/expert

evidence

Page 28: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Most predictable ‘conserved’ Most predictable ‘conserved’ functionsfunctions

protein folding

nucleotide transport

innate immunity

cytoskeleton organization, and

cell cycle

Page 29: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Least predictable ‘conserved’ Least predictable ‘conserved’ functionsfunctions

regulation of …

Specialized functions

Page 30: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Most predictable ‘plant-Most predictable ‘plant-specific’ functionsspecific’ functions

cell wall modification

auxin/cytokinin signaling, and

photosynthesis

Contribution from Arabidopsis datasets

Page 31: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

Least predictable ‘plant-Least predictable ‘plant-specific’ functionsspecific’ functions

development, morphogenesis

pattern formation

phase transitions of various tissues, organs / growth stages

Page 32: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

ConclusionsConclusions Evaluated the performance of various prediction

algorithms on AraNet.

SinkSource is the overall best prediction algorithm.

Measured the influence of the number of genes annotated to a function and the source of annotation evidence.

All algorithms perform poorly when only a small number of genes are ‘known’ or when annotating very specific functions.

When only a small number of genes are ‘known’, use only experimentally verified annotations to make new predictions.

When a considerable number of genes are ‘known’, use all annotations to make new predictions.

Page 33: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

ConclusionsConclusions Measured the correlation of performance

with network properties

Several topological properties correlate well with performance.

‘Average segregation’ has the strongest correlation.

Page 34: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

ConclusionsConclusions Assessed performance on

conserved/plant-specific functions

Performance on basic ‘conserved’ functions is better than that for all the functions.

Specialized ‘conserved’ functions are hard to predict.

Performance on ‘plant-specific’ functions is very poor.

Also a consequence of the fact that ‘plant-specific’ functions generally have small number of annotations.

Page 35: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

ConclusionsConclusions

Avenues for improvement in functional interaction networks

Build functional interaction networks that are based on a larger collection of plant datasets.

If possible, rely as little as possible on data from other species.

Avenues for future experimental work

‘Plant-specific’ functions and

Specialized ‘conserved’ functions.

Page 36: T. M. Murali  Department of Computer Science  Virginia Tech  Slides prepared by  Arjun  Krishnan

AcknowledgementsAcknowledgements Arjun Krishnan

Brett Tyler

Andy Pereira