Protein function and bioinformatics

21
Protein function and bioinformatics Outline of talk Why do we need bioinformatics? What tools do we need? Case study: The Methanococcoides burtonii genome Neil Saunders 76-455 [email protected] www.uq.edu.au/~uqnsaun1/

description

Talk for the BIOC6007 course at UQ; a lot of the material is similar to the presentation on genomics of cold-adapted microorganisms.

Transcript of Protein function and bioinformatics

Page 1: Protein function and bioinformatics

Protein function and bioinformatics

Outline of talk

● Why do we need bioinformatics?

● What tools do we need?

● Case study: The Methanococcoides burtonii genome

Neil [email protected]/~uqnsaun1/

Page 2: Protein function and bioinformatics

Protein function and bioinformatics

Why do we need bioinformatics?

● Rapid increase in data due to genomics● Too much data to characterise genes/proteins individually● Bioinformatics = “smart use” of information● Ideally, computational and experimental biology are partners

Page 3: Protein function and bioinformatics

Protein function and bioinformatics

The ideal computational – wet lab cycle

Biological system Biological objects

Computational objects

AnalysesBiological inferences

Experiments

Bioinformatics is about helping biologists solve problems

Page 4: Protein function and bioinformatics

Protein function and bioinformatics

Introduction to genomics

Genomes Online database● www.genomesonline.org

Published/complete 413Bacteria in progress 977Eukarya in progress 629Archaea in progress 57Metagenomes 56

10-50% of genes in a new genome may have no known function

Page 5: Protein function and bioinformatics

Protein function and bioinformatics

Computational skills for genomics

"So what new skills will postdocs need to ensure that they don't become science relics? The answer is math,statistics, and knowledge of a scripting language for computers."

­The Scientist, "Bioinformatics Knowledge Vital to Careers"Volume 16 | Issue 17 | 53 | Sep. 2, 2002www.the­scientist.com

Page 6: Protein function and bioinformatics

Protein function and bioinformatics

Using WWW resources

● The best web resources provide:- useful tools for analysis- integrated data from many sources

Good examples● InterPro database http://www.ebi.ac.uk/interpro/● Expasy http://au.expasy.org● UniProt http://www.uniprot.org/● CBS Prediction servers http://www.cbs.dtu.dk/services/● IMG Database http://img.jgi.doe.gov/

But...● Web services no good for genome-scale analyses● Usually limits to data input (with good reason)

Nucleic Acids Research publishes annual database andweb servers editions: http://nar.oxfordjournals.org/

Page 7: Protein function and bioinformatics

Protein function and bioinformatics

Computational infrastructure for genomics

Genome

Assembly

Gene sequence

Protein sequence

Protein structure

Pathway

Computationalobjects

Biologicalobjects

Analysis(limitless)

Comparative genomics

Pathway reconstruction

Phylogeny

Structural modeling

Sequence analysis

Regulatory motifs

Key points● Appropriate hardware: workstation v. cluster● Linux Linux Linux!● Freely-available, open source software is all you need● Toolkits and libraries (e.g. BioPerl) to build your own solutions● Philosophy of “many small tools plus glue” - scripting language● Website + database skills - sharing

Page 8: Protein function and bioinformatics

Protein function and bioinformatics

BioPerl: a life sciences computational toolkit● Website: http://www.bioperl.org● A collection of Perl modules for biology

● Handles many common tasks in sequence/structure analysis, e.g. - read/write various sequence formats- run BLAST and parse the output- read/write/analyse sequence alignments- access local or remote databases

Page 9: Protein function and bioinformatics

Protein function and bioinformatics

Annotation (or not) using BLAST BLAST: Basic Local Alignment and Search Tool● Is useful for finding similar sequences quickly● Not sensitive – less useful for weakly-similar sequences● Not much good at all for annotation

Why not?● “Hypothetical”: the database sequence is unique● “Conserved hypothetical”: several hits but no known function● Multi-domain proteins● BLAST database contains incorrect annotations● Annotation is at the whim of whoever deposited the sequence

Classic example: IMPDHWu et al. (2003)Comp. Biol. Chem. 27: 37-47

Page 10: Protein function and bioinformatics

Protein function and bioinformatics

A better annotation tool: InterProScan● IPRScan is a tool to search the InterPro database● It uses sequence signature profiles – more sensitive than BLAST● Integrates the search results from multiple databases● A good first step to characterise a new sequence● Available as standalone package and runs on clusters

Page 11: Protein function and bioinformatics

Protein function and bioinformatics

Structure prediction: threading and modelling● The structure of a protein often explains how it functions● However, structural determination is laborious, difficult and time-consuming● Modelling can be useful in cases sequence is similar to a known structure

Threading Homology modelling

Fit query sequence to fold database Assume similar sequence = similar structure

Page 12: Protein function and bioinformatics

Protein function and bioinformatics

Some modelling tools and databases

● SwissModel: http://swissmodel.expasy.org/● MODELLER: http://www.salilab.org/modeller/● PROSPECT: http://compbio.ornl.gov/structure/prospect2/● ModBase: http://modbase.compbio.ucsf.edu/

Page 13: Protein function and bioinformatics

Protein function and bioinformatics

Introduction to M. burtonii

Methanococcoides burtonii● Isolated from Ace Lake, Antarctica (1-2 °C)● Grows optimally at 23 °C● Is an archaeon● Is a psychrophilic methanogen

M. burtonii Ace Lake, Vestfold Hills The Archaea

Page 14: Protein function and bioinformatics

Protein function and bioinformatics

The M. burtonii genome

What features of this genomeare related to cold adaptation?

Page 15: Protein function and bioinformatics

Protein function and bioinformatics

Discovery of CSP-like proteins in M. burtonii

● CSP = cold shock protein● Expressed in bacteria at low temperature● Functions as RNA chaperone to facilitatetranscription at low temperature● Present in some Archaea, includingM. frigidum, but not M. burtonii

Page 16: Protein function and bioinformatics

Protein function and bioinformatics

Discovery of CSP-like proteins in M. burtonii

d1sro__ M. burtonii YP_564958

Protein sequences

PROSPECTthread v. CSD folds

MODELLERstructural model

● Both proteins are expressed (proteomics)● Located in a putative exosome/proteasome superoperon● This is consistent with their proposed function

Page 17: Protein function and bioinformatics

Protein function and bioinformatics

Integrating information: structural RNA study

OGT (°C)

% G

C

stems

all bases

Is tRNA GC content related to OGT?● tRNAScan find tRNA in genomes● GC content calculated using Perl scripts

Dihydrouridine in M. burtonii● tRNA contains > 1 hU/tRNA● Maintains flexibility at low temperature● DUS gene identified using iprscan

Page 18: Protein function and bioinformatics

Protein function and bioinformatics

Pyrrolysine: a problem for bioinformatics● Proteomics used to identify expressed proteins● One is trimethylamine methyltransferase (TMA-MT)● It shows post-translational modification● It also maps to 2 ORFs in the genome sequence

● The ORFs are actually one gene with a read-through UAG codon● Pyrrolysine is incorporated at the UAG● This is the 22nd genetically-encoded amino acid

Page 19: Protein function and bioinformatics

Protein function and bioinformatics

Statistical analysis of protein properties

Archaea27 organisms62 338 ORFs

Bacteria52 organisms165 192 ORFs

Amino acid frequency(bioperl)

PCAprincipal components

(R stats package)

data matrixorganisms (rows) x

composition (columns)

Page 20: Protein function and bioinformatics

Protein function and bioinformatics

Principal components analysis of composition

● 2 components explain most of the variation in amino acid composition● PC1 correlates with genome GC content● PC2 correlates with optimum growth temperature● The psychrophilic archaea are distinguished by PC2 score● Their proteins contain: more Gln, Ser, Thr, His, Asp

less Leu, Trp and Glu

Page 21: Protein function and bioinformatics

Protein function and bioinformatics

Conclusions

● Computational biology and bioinformatics are essential to modern biology● Many tools are available to annotate proteins: web-based

standalone

● Without experiments, bioinformatics is just predictions

● Data integration is our biggest problem

www.uq.edu.au/~uqnsaun1/