Classification of protein and domain families

34
Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Protein Family Resources and Protocols for Structural and Functional Annotation of Structural and Functional Annotation of Genome Sequences Genome Sequences Domain structures Domain structure predictions Structure to function

description

Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequences. Domain structures. Domain structure predictions. Classification of protein and domain families. Structure to function. Sequence to function. H. A. T. C. Fold Group (1100). - PowerPoint PPT Presentation

Transcript of Classification of protein and domain families

Page 1: Classification of protein and domain families

Classification of protein and

domain families

Sequence to function

Protein Family Resources and Protocols for Structural Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequencesand Functional Annotation of Genome Sequences

Domain structures

Domain structure predictions

Structure to function

Page 2: Classification of protein and domain families

Fold Group(1100)

HomologousSuperfamily

(2100)

40,000 domain entries

CC AATT HH

Sequence Family

~100,000 domains of known structure in CATH~2 million sequences from genomes assigned to CATH

superfamilies in Gene3D and functionally annotated

Gene3D

Page 3: Classification of protein and domain families

Gene3DGene3D::Domain structure annotations in genome Domain structure annotations in genome sequencessequences

scan againstlibrary of HMM

models and sequences for

CATHPfam

NewFam superfamilies

~5 million protein ~5 million protein sequencessequencesfrom 560 from 560

completed completed genomes and genomes and

UniProtUniProt

~ 2 million domain ~ 2 million domain sequences assigned sequences assigned

totoCATH superfamiliesCATH superfamilies

Page 4: Classification of protein and domain families

Gene3DGene3D

(1) (1) Cluster ~5 million sequences into protein Cluster ~5 million sequences into protein superfamiliessuperfamilies

(2) Map domains onto the sequences using HMM (2) Map domains onto the sequences using HMM technology technology (CATH & Pfam domains)(CATH & Pfam domains)

>200,000 protein superfamilies

~10,000 domain superfamilies(2100 of known structure)

Page 5: Classification of protein and domain families

0

10

20

30

40

50

60

70

80

90

100

Arabidopsis C.elegans Drosophila Human Mouse Yeast

Organism

Gen

es w

ith

str

uct

ura

l an

no

tati

on

Gene3D Genthreader

Proportion of genome sequences which can be assigned to domain families of known structure in CATH or SCOP

HMM prediction threading prediction

Page 6: Classification of protein and domain families

Annotation levels for an average genome

0

50%

100%

predicted to belong tostructural superfamilies using HMM

or threading techniques

many predicted to be transmembrane

many belonging to small species specific families

Page 7: Classification of protein and domain families

0

20

40

60

80

100

0 1000 2000 3000 4000 5000 6000

Families ordered by size

Per

cen

tag

e o

f d

om

ain

seq

uen

ces

Target selection strategy for PSI-2Target selection strategy for PSI-2

known structure(CATH - MEGA)

unknown structure

(BIG -Pfam)

Adam Godzik JCSG, Andras Fiser – NYSGC, Burkhard Rost - NESG

Page 8: Classification of protein and domain families

0 20 40 60 80 100 120 140 1600

20

40

60

80

100

120

3.40.50.720

3.40.50.300

3.40.50.150

2.60.40.10

1.10.10.10

2.40.50.140

Superfamily Variation: Structure/Sequence

0-25 GO Terms26-50 GO Terms51-100 GO Terms101-200 GO Terms201+ GO Terms

Sequence Families

Str

uctu

ral D

iver

sity

Population in genomes (x 1000)

Str

uct

ura

l D

iver s

i ty

Correlation of sequence and structural variability of CATH Correlation of sequence and structural variability of CATH families with the number of different functional groupsfamilies with the number of different functional groups

Page 9: Classification of protein and domain families

Structural diversity in the CATH Domain Superfamily P-loop hydrolases

Cutinase

Cocaine esterase

Acetylcholinesterase

Page 10: Classification of protein and domain families

Sequence to function

Protein Family Resources and Protocols for Structural Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequencesand Functional Annotation of Genome Sequences

Domain structures

Domain structure predictions

Page 11: Classification of protein and domain families

0.0E+00

5.0E+04

1.0E+05

1.5E+05

2.0E+05

2.5E+05

3.0E+05

3.5E+05

4.0E+05

4.5E+05

11-20% 21-30% 31-40% 41-50% 51-60% 61-70% 71-80% 81-90% 91-100%

Sequecne Identity (%)

Num

ber o

f Dom

ain

Rela

tives

0

20

40

60

80

100

120

140

160

Num

ber o

f CAT

H en

zym

e Su

perfa

mili

es

Number of domain relatives Number of CATH enzyme superfamilies

Sequence identity thresholds for 90% Sequence identity thresholds for 90% conservation of enzyme function (to 3 EC Levels) conservation of enzyme function (to 3 EC Levels)

highly variable highly variable familiesfamilies

Num

ber

of

seq

uence

s

Sequence identity threshold for 90% conservation

Num

ber

of

fam

ilies

Page 12: Classification of protein and domain families

N-Fold Increase in Functional Annotation for N-Fold Increase in Functional Annotation for Sequences in Gene3DSequences in Gene3D

general thresholds family specific thresholds

0

2

4

6

8

Gene3D (6.8%) H.sapiens (5%) A.thaliana (2.7%) C.elegans (1.1%) B.anthracis (3.7%)

N-f

old

in

crea

se i

n c

ove

rag

e

Domain - 50/80 and 40/80 cut-offs if identical MDA Domain - Family specific cut-off

N-f

old

incr

ease

in c

overa

ge

N-f

old

incr

ease

in c

overa

ge

Page 13: Classification of protein and domain families

Link to UniProt

Links to GO

Links to different levels in the Gene3D protein family

Link to InterPro

Links to CATH/PfamLinks to KEGG

“S” - indicates you can search the term against Gene3D

Get an XML version of this page

Gene3D

Functional information from GO, COGS, KEGG, EC, FunCat, MINT, IntAct, ComplexDB

Page 14: Classification of protein and domain families

Non-PSI PDBs PSI PDBs

0 terms 1 term 2 terms 3 terms 4 terms

Functional annotation of structures using EC, GO, KEGG, FunCat resources

Page 15: Classification of protein and domain families

Phylogenetic trees derived from multiple sequence alignments can be used to infer functionally related proteins

Tree Determinants - ValenciaEvolutionary Trace - LichtargeFunshift – SonnhammerSCI-PHY – Sjolander

Page 16: Classification of protein and domain families

Score conservation

for each position in the

alignment using an entropy measure

1 = highly conserved

0 = unconservedPutative functional site

Structural model

Methods exploiting information on sequence conserved residue positions

Scorecons –Thornton Protein Keys – Sander

multiple sequence alignment of relatives from functional group

Page 17: Classification of protein and domain families

Superfamily Superfamily of known of known structurestructure(CATH)(CATH)

GEMMA: Compares sequence profiles (HMMs) between GEMMA: Compares sequence profiles (HMMs) between subfamilies subfamilies

sequence subfamily 80% seq. id)

putative structure-function

group

clusters sequence relatives predicted to have similar structures/functions even at low levels of sequence identity

Page 18: Classification of protein and domain families

0

10

20

30

40

50

60

70

80

90

100

Amidohydrolase Crotonase Enolase Haloacid dehalogenase Vicinal oxygen chelate

SCI-PHY

GeMMA

0

5

10

15

20

25

Amidohydrolase Crotonase Enolase Haloacid dehalogenase Vicinal oxygen chelate

SCI-PHY

GeMMA

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Amidohydrolase Crotonase Enolase Haloacid dehalogenase Vicinal oxygen chelate

SCI-PHY

GeMMA

0

1

2

3

4

5

6

Amidohydrolase Crotonase Enolase Haloacid dehalogenase Vicinal oxygen chelate

SCI-PHY

GeMMA

GeMMA v SCI-PHY using gold annotated sequences in Babbitt benchmark

Purity(high isbest)

Editdistance(low)

VIdistance(low is best)

Deviationfrom no.singletons(low)

Page 19: Classification of protein and domain families

Annotation (EC number) coverage of MEGA family 3.90.1200.10

0

10

20

30

40

50

60

70

80

Database annotations Annotations inherited w ithin S60 clusters Annotations inherited w ithin GeMMAfunctional subfamilies

Source of annotation

Co

vera

ge

of

fam

ily (

%)

Covera

ge o

f su

perf

am

ily (

%)

experimentalannotations

inherit functions at 60% seq. id.

inherit functions by GEMMA

Functional annotation coverage using different strategies

Page 20: Classification of protein and domain families

Gene3D Biominer Methods

•Phylotuner: Correlation of domain occurrence profiles

•GOSS:Semantic Similarity calculation between protein pairs.

•CODA: Domain fusion analysis.

•HiPPI: homology inheritance of protein-protein physical interaction data.

•GECO: Correlation of gene expression data

Protein interactions and gene networks

Page 21: Classification of protein and domain families

Protein Family Resources and Protocols for Structural Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequencesand Functional Annotation of Genome Sequences

Domain structures

Domain structure predictions

Structure to function

Page 22: Classification of protein and domain families

Methods for Assessing Structural NoveltyMethods for Assessing Structural Novelty

CATHEDRAL – structure comparisonCATHEDRAL – structure comparison

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Rank

Pro

po

rtio

n C

orr

ec

t F

old

CATHEDRAL CE LSQMAN DALI STRUCTAL

Redfern et al. PLOS comp. biol. 2007

Page 23: Classification of protein and domain families

Structural clusters in the Aminoacyl tRNA synthetases – like family

Aminoacyl tRNA synthetases

DNA-binding, stress-related

Argininosuccinate lyases

Gln-hydrolyzing synthases

Nucleotidyl-transferases

stru

cture

sim

ilari

ty s

core

Page 24: Classification of protein and domain families

1bkzA00

2.60.120.200

1dypA00

Galectin binding superfamily

Page 25: Classification of protein and domain families

 Aminoacyl tRNA synthetases – like 

1dnpA00

Deoxyribodi-pyrimidine photo-lyases

Nucleotidylyl-transferases

1ej2A00

AA tRNA synthetase, Class I

1n3lA01

Electron transferflavoprotein

1o97D01

Identifying functional groups in domain Identifying functional groups in domain superfamiliessuperfamilies

Page 26: Classification of protein and domain families

Exploiting 3D Templates to Represent Functional Relatives

JESS – Thornton GASP - BabbittSPASM – KleywegtPINTS – RussellDRESPAT - SarawagipvSOAR – Joachimiak

Page 27: Classification of protein and domain families

SITESEER: Match 3-residue templates and assess relevance of hits by looking at residues within the local environment

green and purple – identical residues; orange and white – similar residues

Laskowski and Thornton

Page 28: Classification of protein and domain families

FLORA:3D templates for functional groupsFLORA:3D templates for functional groups

From multiple structure alignments of functional subgroups in the superfamily, identify vectors

between amino acids that are highly conserved and distinctive for the functional subgroup.

Page 29: Classification of protein and domain families

FLORA:3D templates for functional groups

localFLORA globalFLORA

single site multiple sites

Page 30: Classification of protein and domain families

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5 6 7 8 9 10

Rank

Co

vera

ge (

%)

Local FLORA Global FLORA

FLORA:Performance in recognising functionally related homologues

Benchmark of 36 diverse enzyme groups (from 12 families)

Page 31: Classification of protein and domain families

Performance of FLORAPerformance of FLORA

Benchmarked on 36 Benchmarked on 36 large enzyme familieslarge enzyme families

Page 32: Classification of protein and domain families

FLORA: 3D Templates for Structure-Function Groups in Domain Families

1dnpA01Deoxyribo-

dipyrimidine photo-lyases

1ej2A00Nucleotidylyl-transferases

1q77A00

Unknown

function MCSG

1n3lA01AA tRNA

synthetases

1o97D01Electrontransfer

flavoprotein

Page 33: Classification of protein and domain families

Fold and structural motifs

SSM fold search

Surface clefts

Residueconservation

DNA-bindingHTH motifs

Nest analysis

Sequence motifs(PROSITE, BLOCKS,SMART, Pfam, etc)

Sequence scans

Sequence searchvs PDB

Sequence searchvs Uniprot

Superfamily HMMlibrary

Gene neighbours

n-residue templates

Enzyme active sites

Ligand binding sites

DNA binding sites

Reverse templates

http://www.ebi.ac.uk/thornton-srv/databases/ProFunc/

Page 34: Classification of protein and domain families

Function Prediction for Proteins of ‘Putative’ or Unknown Function

Class Sequence

Evidence

StructureEvidence

Sequence +

Structure

Neither Successful

Putative (57)

53 44 41 1

Unknown (132)

95* 69* 57* 25

* Numbers refer to results where the top hit is classed as ‘Strong’ or ‘Moderate’

structural data provides relatively more information for proteins about which there is less knowledge

these predictions need to be experimentally validated