NIGMS Protein Structure Initiative: Target Selection Workshop
ADDA and remote homologue detection
Liisa HolmInstitute of Biotechnology
University of Helsinki
Definitions • Nrdbxx = nrdb where no two sequences are more
than xx % identical; redundant sequences are mapped to representative – Uniprot + Genpept + PIR + PDB + …– Nrdb100 – Nrdb90 – … – Nrdb40 – Nrdb30 = “modeling
family”
• PairsDB = database of all-against-all comparisons – Blast in nrdb90, PSI-Blast in nrdb40
• BIG = family detected by profile comparison– Profile needs seed set (alignment); automatic iterative
profile construction has poor convergence– Profiles Partially overlapping neighbour sets Need
to cluster sequences Clustering artefacts when true cluster shape is non-spherical
(graph) covering ≠ clustering ≠ classification• Incomplete detection of homologous set by profile models • Example: Urease et al. superfamily
IDEAL REAL
ADDA: clustering of domains into families• ADDA = Automatic Domain Definition Algorithm
– Heger & Holm (2003) J Mol Biol 328, 749-767. – Heger & al (2005) Nucl. Acids Res. 33 Database Issue, D188-D191.
• Principles of ADDA– Blast all-against-all comparison in nrdb90– Domains are optimally covered by alignments
• Complete domain coverage; every residue belongs to a domain– Minimum/maximum spanning tree of domains– Remove links where profile-profile score is below threshold– Connected components are domain families
• Quality assessment– Most ADDA families are pure, containing one PFAM family or SCOP
superfamily (plus previously unclassified members)– Occasionally members from different PFAM family are merged in one
ADDA family (contamination or PFAM misclassification) – Domain size distribution is reasonable
• For example, much less over-fragmentation than by Prodom algorithm
ADDA purity and domain size
<=10 <=100 <=1000 <=10000 0
5
10
15
20
25
30
35
0
5
10
15
20
25
30
35
rela
tive
fre
que
ncy
/ %
overlap or relative size / %
<=10 <=100 <=1000 <=10000 0
20
40
60
80
100
0
20
40
60
80
100
cu
mu
lativ
e fr
eque
ncy
/ %
0 20
40 60
80 100
0
200
400
600
800
1000
1200
0
20 40
60 80
100
selectivi ty
/ %
abso
lute
fre
qu
enc
y
sensitivity / %
0 20
40 60
80 100
0
50
100
150
200
0 20
40 60
80 100
selectivi ty
/ %
abs
olu
te f
req
ue
ncy
sensitivity / % PFAM SCOP
Accuracy of domain boundaries-Red: best possible in domain tree-Black: actually selected
3D coverage of model proteomes• PDB entries from May 2006
– Required greater than 80 % overlap between PDB sequence and ADDA domain to call family structurally covered
• ADDA domain families– BIG families
• 28429 families have more than ten members in nrdb100– 2383 structurally covered BIG families
• 8820 families have more than ten members in nrdb40– 1869 structurally covered BIG families
• NCBI genome sets– H sapiens, C elegans, D melanogaster, A thaliana, E
coli, B anthracis, T maritima – Mapped to ADDA families
• 6770 BIG(nrdb40) families occur in model genome set– 1705 structurally covered
Model genome coverage – 28429 BIG families in nrdb100
ADDA domain coverage of model genomes
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
hs ce dm at ec ba tm
Num
ber o
f dom
ains
domains_smalldomains_big_whitedomains_big_3D
ADDA domain coverage of model genomes
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
hs ce dm at ec ba tm
Per
cent
age
of d
omai
ns
domains_smalldomains_big_whitedomains_big_3D
ADDA family coverage of model genomes
0
5000
10000
15000
20000
25000
30000
hs ce dm at ec ba tm
Num
ber o
f fam
ilies
families_smallfamilies_big_whitefamilies_big_3D
ADDA family coverage of model genomes
0.00%
10.00%20.00%
30.00%40.00%
50.00%
60.00%70.00%
80.00%90.00%
100.00%
hs ce dm at ec ba tm
Per
cent
age
of fa
milie
s
families_smallfamilies_big_whitefamilies_big_3D
ADDA family coverage of model genomes
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
hs ce dm at ec ba tm
Per
cent
age
of B
IG fa
milie
s
families_big_whitefamilies_big_3D
ADDA domain coverage of model genomes
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
hs ce dm at ec ba tm
Per
cent
age
of d
omai
ns in
BIG
fa
milie
s
domains_big_whitedomains_big_3D
T. maritima would be covered by 1000 BIG families and is two thirds done
ADDA domains
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
hs ce dm at ec ba tm
domains_smalldomains_big_whitedomains_big_3D
ADDA domains
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
hs ce dm at ec ba tmPer
cent
age
of B
IG d
omai
ns (n
rdb4
0)
domains_big_whitedomains_big_3D
ADDA families
0
5000
10000
15000
20000
25000
30000
hs ce dm at ec ba tm
Num
ber o
f fam
ilies
families_smallfamilies_big_whitefamilies_big_3D
ADDA families
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
hs ce dm at ec ba tm
Per
cent
age
of B
IG fa
milie
s (n
rdb4
0)families_big_whitefamilies_big_3D
6770 BIG familiesin nrdb40
Multigenefamilies ineukaryotes
2.0-2.4 domainsper euk. gene;1.3 domains per prok. gene
Seven model genomes
Human Worm, fly, plant
Prokaryotes (E coli, B anthracis, T maritima)
638 2412
29
187
802
161
836
Human BIG target families are almost exclusively eukaryote-specific
57 535
18
588
88
264
155
Human Worm, fly, plant
Prokaryotes
Universal BIG families are almost covered
5065 white BIG target families 1705 structurally covered BIG families
Covering all modelling families will have astronomical cost
• Nrdbxx updates; Nrdb30 = “modelling family”
100000
1000000
10000000
v1
v2
v3
v4
Fine-grained coverage
• MF: Structural core shrinks rapidly below 30 % sequence identity Need less naïve
modelling software capable of building those parts ab initio which are not covered by template
– Misalignment is major source of error Transitive alignment covers more of the structurally equivalent core
Average coverage of structural core (152 pairs in 11 superfamilies):Transitive 51 % Global alignment (HMMer) 43 %Local alignment (PSI-Blast) 34 %
Error Rmsd/A
Template 32
Misaligned 16
Loops 8
Backbone 4
Rotamers 2
Coarse-grained coverage
• BIG/BIGGER: Homology detection– Difficulty of aligning
remote homologues Shared sequence motifs suggest conserved biochemical mechanism Functional classification
– Sequence comparison only detects half of remote homologue pairs Structure comparison reveals missing links
30
40
50
60
70
80
90
100
0 0.1 0.2 0.3 0.4 0.5 0.6
coverage
reli
abil
ity
our methodprofile-profile
Transitive search for conserved motifsdetects more remote homologues than profile-profile comparison
Clustering PFAM families
• Comparison of ADDA to PFAM-A resulted in extension but no discovery of completely new large families
• PFAM-A v.19: 7340 families, 2451 covered according to PFAM’s assignments, 1396 families in 205 clans
• Our method achieved 30 % coverage of clan relationships at 5 % error rate compared to 23 % coverage at 5 % error rate by profile-profile comparison– 1083 unclassified PFAMs linked to 205 known clans
• 1219 white PFAMs linked to known structure in 155 clusters
– 1256 PFAMs clustered in 470 predicted clans• 336 white PFAMs linked to known structure in 222 clusters
– 3610 PFAMS remained singletons• 2352 white PFAMs
2451 covered, ~1555 fold assignments, ~3334 targets
Conclusions • ADDA ~3000 human target families
– ~40 % reduction in number of PFAM target families by fold assignment (based on sequence only)
• Coarse-grained coverage yields information out of reach to sequence comparison– Need to improve measures of sequence similarity to infer
homology• Sequence motif-based functional classification
– Need to increase the radius of convergence in template-based structure prediction
• Protein complexes hypothesis-driven research– Large conformational changes– Multigene receptor-ligand pair discrimination involves
rotations in docking orientation
Acknowledgements
• Andreas Heger, Oxford University
• Swapan Mallick, Ashwin Sivakumar, Chris Wilton, Institute of Biotechnology
• Funding: Academy of Finland, Sigrid Juselius Foundation, EU
Top Related