Percentage of Domain Sequences in Genomes

Coverage per sequence

0

10

20

30

40

50

60

70

80

90

0 1000 2000 3000 4000 5000 6000Families, ordered by size

Pe

rcen

tag

e o

f s

equ

en

ces

Perc

enta

ge o

f D

om

ain

Perc

enta

ge o

f D

om

ain

Sequen

ces

in G

enom

es

Sequen

ces

in G

enom

es

all

excluding singletons

excluding singletonsAnd filtering

Genome Coverage by Domain Superfamilies Genome Coverage by Domain Superfamilies

~50% of domain sequences in the genomes are contained in ~50% of domain sequences in the genomes are contained in ~1000 CATH/SCOP domain superfamilies ~1000 CATH/SCOP domain superfamilies

Further ~20% of sequences belong to ~1400 Pfam superfamilies Further ~20% of sequences belong to ~1400 Pfam superfamilies with no structural relativewith no structural relative

PSI2 currently targetting these ~1400 superfamiliesPSI2 currently targetting these ~1400 superfamilies

Pfamsuperfamily

close sequence

family (30%)‘unique family’

PSI2 targetting ~1400 LARGE superfamilies with no close PSI2 targetting ~1400 LARGE superfamilies with no close structural relativestructural relative

All sequence families

Near/distant PDB relative

no PDB relative

Targetting ~1400 Pfam superfamilies but these Targetting ~1400 Pfam superfamilies but these contain tens of thousand of subfamiliescontain tens of thousand of subfamilies

0

10

20

30

40

50

60

70

80

90

100

0 10000 20000 30000 40000 50000 60000Subfamilies ordered by size

Per

cen

tag

e o

f se

qu

ence

s

Unique families ordered by size

Pfamsuperfamily

close sequence

family (30%)‘unique family’

target ~1400 LARGE Pfam superfamilies common with no target ~1400 LARGE Pfam superfamilies common with no structural relativestructural relative

target clusters of families predicted to have different target clusters of families predicted to have different functionsfunctions

Gene3D annotations: COG, GO, EC, DIP, BIND, Y2H, Gene3D annotations: COG, GO, EC, DIP, BIND, Y2H, Microarray data, phylogenetic profiles Microarray data, phylogenetic profiles

functionalgroup

Model quality v sequence identity for 78,545 structural genomics homology models, built by Modeller 8v1, assessed

using ProSa II

Methods like ProSa and GA341 can identify reasonable models at low sequence identities.

Comparison of models built by different methods may help in identifying reliable

regions

In combination with analysis of other features e.g. domain context, homology models may help in suggesting functional

subgroups within a superfamily

Dissimilarity in electrostatic potential as an indicator of dissimilarity in function of PH domains.Blomberg et al. (1999) Classification of Protein Sequences by Homology Modeling and Quantitative Analysis of Electrostatic Similarity. Proteins 37:379-387

Electrostatic potential tends to be conserved to relatively low sequence identity between target and template.Chakravarty et al. (2005) Accuracy of structure-derived properties in simple comparative models of protein structures. Nucleic Acids Res. 33:244-259

Human pleckstrin Human exo84 signalling complex

GeMMA http://www.biochem.ucl.ac.uk/~dlee/GeMMA• Currently ~ 80,000 models built by Modeller

• Update requires ~ 1 month every 6 months

• Modelling alignments from SAM-T99 HMMs

• Residue conservation calculated by Scorecons

• Electrostatic potential calculated by APBS

• Model quality assessed by ProSa 2003 and GA341

Percentage of Domain Sequences in Genomes

Documents

Transcript of Percentage of Domain Sequences in Genomes