Percentage of Domain Sequences in Genomes
description
Transcript of Percentage of Domain Sequences in Genomes
Coverage per sequence
0
10
20
30
40
50
60
70
80
90
0 1000 2000 3000 4000 5000 6000Families, ordered by size
Pe
rcen
tag
e o
f s
equ
en
ces
Perc
enta
ge o
f D
om
ain
Perc
enta
ge o
f D
om
ain
Sequen
ces
in G
enom
es
Sequen
ces
in G
enom
es
all
excluding singletons
excluding singletonsAnd filtering
Genome Coverage by Domain Superfamilies Genome Coverage by Domain Superfamilies
~50% of domain sequences in the genomes are contained in ~50% of domain sequences in the genomes are contained in ~1000 CATH/SCOP domain superfamilies ~1000 CATH/SCOP domain superfamilies
Further ~20% of sequences belong to ~1400 Pfam superfamilies Further ~20% of sequences belong to ~1400 Pfam superfamilies with no structural relativewith no structural relative
PSI2 currently targetting these ~1400 superfamiliesPSI2 currently targetting these ~1400 superfamilies
Pfamsuperfamily
close sequence
family (30%)‘unique family’
PSI2 targetting ~1400 LARGE superfamilies with no close PSI2 targetting ~1400 LARGE superfamilies with no close structural relativestructural relative
All sequence families
Near/distant PDB relative
no PDB relative
Targetting ~1400 Pfam superfamilies but these Targetting ~1400 Pfam superfamilies but these contain tens of thousand of subfamiliescontain tens of thousand of subfamilies
0
10
20
30
40
50
60
70
80
90
100
0 10000 20000 30000 40000 50000 60000Subfamilies ordered by size
Per
cen
tag
e o
f se
qu
ence
s
Unique families ordered by size
Pfamsuperfamily
close sequence
family (30%)‘unique family’
target ~1400 LARGE Pfam superfamilies common with no target ~1400 LARGE Pfam superfamilies common with no structural relativestructural relative
target clusters of families predicted to have different target clusters of families predicted to have different functionsfunctions
Gene3D annotations: COG, GO, EC, DIP, BIND, Y2H, Gene3D annotations: COG, GO, EC, DIP, BIND, Y2H, Microarray data, phylogenetic profiles Microarray data, phylogenetic profiles
functionalgroup
Model quality v sequence identity for 78,545 structural genomics homology models, built by Modeller 8v1, assessed
using ProSa II
Methods like ProSa and GA341 can identify reasonable models at low sequence identities.
Comparison of models built by different methods may help in identifying reliable
regions
In combination with analysis of other features e.g. domain context, homology models may help in suggesting functional
subgroups within a superfamily
Dissimilarity in electrostatic potential as an indicator of dissimilarity in function of PH domains.Blomberg et al. (1999) Classification of Protein Sequences by Homology Modeling and Quantitative Analysis of Electrostatic Similarity. Proteins 37:379-387
Electrostatic potential tends to be conserved to relatively low sequence identity between target and template.Chakravarty et al. (2005) Accuracy of structure-derived properties in simple comparative models of protein structures. Nucleic Acids Res. 33:244-259
Human pleckstrin Human exo84 signalling complex
GeMMA http://www.biochem.ucl.ac.uk/~dlee/GeMMA• Currently ~ 80,000 models built by Modeller
• Update requires ~ 1 month every 6 months
• Modelling alignments from SAM-T99 HMMs
• Residue conservation calculated by Scorecons
• Electrostatic potential calculated by APBS
• Model quality assessed by ProSa 2003 and GA341