domain database
description
Transcript of domain database
domain databasedomain database
• The CATH domain database and associated The CATH domain database and associated resources - DHS, Gene3Dresources - DHS, Gene3D
• How do we determine domain boundaries?How do we determine domain boundaries?
• How do we you identify fold groups and How do we you identify fold groups and evolutionary superfamilies?evolutionary superfamilies?
• What is the distribution of the CATH domain What is the distribution of the CATH domain families in the PDB and in the genomes?families in the PDB and in the genomes?
CCAA
TTHH
lasslassrchitecturerchitectureopology or Fold Groupopology or Fold Group
omologous Superfamilyomologous Superfamily
Orengo & Thornton 1994Orengo & Thornton 1994CCAATTHH
~20,000 chains from Protein Databank (PDB)
~50,000 domains in CATH structure
database
~40% of the entries in CATH are multidomain
Multidomain proteins
Domains are important evolutionary units
analysis by Teichmann and others suggests that ~60-80% of genes in genomes may be
multidomain
Carboxypeptidase G2 (1cg2A)
Carboxypeptidase A (2ctc)
~30% of multidomains in CATH are discontinuous
Algorithms for Recognising Algorithms for Recognising Domain BoundariesDomain Boundaries
DETECTIVE DETECTIVE Swindells 1995Swindells 1995
each domain should have a recognisable hydrophobic each domain should have a recognisable hydrophobic corecore
DOMAKDOMAK Siddiqui & Barton, 1995Siddiqui & Barton, 1995
residues comprising a domain make more internal residues comprising a domain make more internal contacts than external onescontacts than external ones
PUUPUU Holm & Sander, 1994Holm & Sander, 1994
parser for protein folding units: maximal interaction parser for protein folding units: maximal interaction within domains and minimal interaction between within domains and minimal interaction between domainsdomains
Consensus is sought between the three Consensus is sought between the three methods – on average this occurs about 20% methods – on average this occurs about 20% of the timeof the time
74%
29% 21%
4%
11%
Close homologuesClose homologues
Twilight zoneTwilight zone
Midnight zoneMidnight zone
Homologues/analoguesHomologues/analogues
Algorithms for Recognising Algorithms for Recognising HomologuesHomologues
Sequence Based methodsSequence Based methods
close homologues – BLAST (Altschul close homologues – BLAST (Altschul et alet al.).)
- SSEARCH (Smith & - SSEARCH (Smith & Waterman) Waterman)
remote homologues – SAM-T99 (Karplus remote homologues – SAM-T99 (Karplus et alet al))
Structure Based MethodsStructure Based Methods
close & remote homologues - CATHEDRAL (Harrison, close & remote homologues - CATHEDRAL (Harrison, Thornton Orengo) Thornton Orengo)
- SSAP (Taylor & Orengo)- SSAP (Taylor & Orengo)
- CORA (Orengo)- CORA (Orengo)
74%
29% 21%
4%
11%
Close homologuesClose homologues
Twilight zoneTwilight zone
Midnight zoneMidnight zone
Homologues/analoguesHomologues/analogues
SSEARCH
HMMs, SSAP
CATHEDRAL, SSAP
CATHEDRAL, SSAP
Hidden Markov Models (HMMs)
query sequence
Non redundant GenBank database
hits
these methods can currently identify ~70% of remote homologues(3 times more powerful than BLAST)
SAM-T99 Karplus GroupSAMOSA Orengo Group
59.220.7
7.6
8.61.9
2.0
Percentage of PDB structures classified in CATH Percentage of PDB structures classified in CATH by different methods over the last 2 yearsby different methods over the last 2 years
Near-identicalSSEARCH
Close homologues(>30%)
SSEARCH
remote homologues(<30%)HMMs
remote homologues (8.6)analogues (1.9)
SSAPNovel folds
22.0
8.0
22.0
28.4
7.711.8
Percentage of structural genomics PDB Percentage of structural genomics PDB structures classified in CATH by different structures classified in CATH by different
methods over the last 2 yearsmethods over the last 2 years
near-identicalSSEARCH
close homologues(>30%)
SSEARCH
remote homologues(<30%)HMMs
analoguesSSAP
novel folds
remote homologuesSSAP
Structure Based Algorithms for Structure Based Algorithms for Recognising HomologuesRecognising Homologues
CATHEDRAL CATHEDRAL Pairwise alignment - Pairwise alignment - secondary secondary structure structure comparisoncomparison
SSAP SSAP Pairwise alignment - residue Pairwise alignment - residue
comparisoncomparison CORA CORA Multiple alignment – residue Multiple alignment – residue
comparisoncomparison
74%
29% 21%
4%
11%
Close homologuesClose homologues
Twilight zoneTwilight zone
Midnight zoneMidnight zone
Homologues/analoguesHomologues/analogues
ssearch
HMMs
CATHEDRAL, SSAP
CATHEDRAL, SSAP
structure is much more highly conserved than sequence
cholera toxin pertussis toxin
Heat labile enterotoxin
97
79%
81
12%
Structure similarity (SSAP) score
Sequence identity
Pairwise Sequence Identities and Structure Pairwise Sequence Identities and Structure Similarity (SSAP) Scores in CATH Domain Similarity (SSAP) Scores in CATH Domain
FamiliesFamilies
structure similarity
(SSAP)score
sequence identity (%)
same function
different function
• Residue insertions in the loops connecting secondary structures
• Shifts in the orientations of secondary structures
Yeast Elongation factor complex Yeast Guanylate kinase
Helicase domain of bacteriophage t7 ATP phosphorylase
Structural variation in the P-loop Hydrolase SuperfamilyStructural variation in the P-loop Hydrolase Superfamily
Structural variation in the Galectin Binding SuperfamilyStructural variation in the Galectin Binding Superfamily
Fast Structure Fast Structure Comparison Method Comparison Method
(CATHEDRAL)(CATHEDRAL)
ignore the variable loop regions and only ignore the variable loop regions and only compare the compare the
secondary structuressecondary structures
derive vectors through secondary structure derive vectors through secondary structure elementselements
compare closest approach distances and compare closest approach distances and vector vector orientations using orientations using graph theorygraph theory
Andrew Harrison et al., JMB, 2002
d
a b
a . b = | a || b | cos
+ dihedral angle
+ chirality
Compares graphs of proteins
HH
H
d, , , chiralityd, , ,
chirality
d, , , chirality
node
edge
CATHEDRALCATHs Existing Domain Recognition
ALgorithm
A
B
C
I
II
IIIA,a
B,c
C,d
I
II
III
Comparing proteins with similar folds identifies an overlap graph with the largest common
structural motif
overlap graph has a structural motif of 3 secondary structures
a
c
d
I
II
IIIb
IV
V
b
MCSG Site Visit, Argonne, January 30, 2003
Graphs are compared using the Bron Kerbosch algorithm to find the largest
common graph
In this example the common graph contains 5 nodes.
1000 times faster than residue based methods (e.g. SSAP)
PerformancePerformance
Score ~ common graph size
(size protein1 . size protein2)1/2
statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of
all known structures
Score ~ common graph size
(size protein1 . size protein2)1/2
statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of
all known structures
F = A e - b . score
log F = log A - b .score
scores for unrelated structures exhibit an extreme value distribution
allows you to calculate the probability (P-value, E-value) of obtaining any score by chance
Using CATHEDRAL to Identify Domain Boundaries
Graph based secondary structure comparison is very fast - 1000
times faster than residue based methods
New multi-domain structures can be rapidly scanned against the library of CATH domains. E-values can be
used to identify significant matches.
85-90% of domains in new multi-domain structures have relatives in
CATH
Secondary structure match by graph
SSAP residue alignment
Multi-domain structure
Fold A
Fold B
CATHEDRAL
residues in new multi-domain
resi
du
es
in
CA
TH
dom
ain
fa
mily
1
resi
dues
in
CA
TH
dom
ain
fa
mily
2
SSAPTaylor & Orengo, J. Mol. Biol. 1989
Protein B Protein A
residue based structure
comparison method using
dynamic programming
Scores rangefrom 0-100
Resi
du
es
in p
rote
in B
Residues in protein A
CATHEDRAL
One third of known multi-domain structures are discontinuous
Reasons for Structural Reasons for Structural SimilaritySimilarity
• DivergenceDivergence - similarity arises due to - similarity arises due to divergent evolution from a common divergent evolution from a common ancestor - structure much more highly ancestor - structure much more highly conserved than sequenceconserved than sequence
• ConvergenceConvergence - similarity due to there - similarity due to there being a limited number of ways of being a limited number of ways of packing helices and strands in 3D spacepacking helices and strands in 3D space
~1500 domain superfamilies in CATH
~50,000 domains in PDB
Domain structure database
AATT
HH
lasslassrchitecturerchitecture
opology or Fold Groupopology or Fold Group
omologous Superfamilyomologous Superfamily
Orengo & Thornton 1994CC
Class
Architecture
Topology or Fold
3
~36
~810
domain database ~50,000 domainsCATH
Topology orFold Group
~810
HomologousSuperfamily (Domain
Family)~1500
SequenceFamily
(35%, 60%, 95%)
40,000 domain entries
~50,000 domain entries
CC AATT HH
DHSDHSDictionary of Homologous SuperfamiliesDictionary of Homologous Superfamilies
http://www.biochem.ucl.ac.uk/bsm/dhs
Description of structural and functional characteristics for each superfamily
DHSDHSDictionary of Homologous SuperfamiliesDictionary of Homologous Superfamilies
http://www.biochem.ucl.ac.uk/bsm/dhs
Description of structural and functional characteristics for each superfamily
DHS:Dictionary of Homologous SuperfamiliesDHS:Dictionary of Homologous Superfamilies
http://www.biochem.ucl.ac.uk/bsm/dhs
Variation in Secondary Structures Across Superfamily
Functional annotations from GO, EC, COGs, KEGG
DHS:Dictionary of Homologous SuperfamiliesDHS:Dictionary of Homologous Superfamilies
http://www.biochem.ucl.ac.uk/bsm/dhs
DHS:Dictionary of Homologous superfamiliesDHS:Dictionary of Homologous superfamilieshttp://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D
Multiple structure alignments with conserved residues highlighted
Population of CATH Families and Population of CATH Families and Structural GroupsStructural Groups
cluster proteins with cluster proteins with similar similar sequencessequences
~50,000 structural ~50,000 structural domainsdomains
~4000 sequence ~4000 sequence families (35%)families (35%)
~1,500 homologous ~1,500 homologous superfamiliessuperfamilies
cluster proteins with cluster proteins with similar similar structures and structures and functionsfunctions
~810 fold groups~810 fold groups
~36 architectures~36 architectures
3 major protein classes3 major protein classes
cluster proteins with cluster proteins with similar similar structuresstructures
HH
TT
AA
CC
SS
Rossmann Fold
Jelly Roll
Alpha/Beta Plaits
Arc repressor-like
OB Fold
CATHCATH
Rossmann
Alpha-beta plait TIM barrelJelly Roll
Immunoglobulin
OB fold
SH3-like
Up-down
Arc repressor-like
nearly one third of the
superfamilies belong to <10 fold
groups
CATH numbering scheme
2. Mainly beta
40. Barrel
50. OB Fold100 Heat labile enterotoxin superfamily
2.40.50.100
Class
Architecture
TopologyHomology
CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath
CATH domain structure database
CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath
CATH class level
CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath
CATH architecture level
CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath
CATH Topology or fold group level
CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath
CATH homologous superfamilies in each fold group
CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath
CATH homologous superfamily level
CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath
CATH sequence families (>=35% identity) in each superfamily
CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath
CATH classification information for individual domains
CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath
CATH structural relatives listed for each domain
CATH serverCATH serverhttp://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl
CATH serverCATH serverhttp://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl
CATH serverCATH server
structural matches and statistics listed for query domain
http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl
Library of HMMs built for representative sequences Library of HMMs built for representative sequences from each CATH domain superfamilyfrom each CATH domain superfamily
Expanding CATH with Expanding CATH with sequence relatives from sequence relatives from
genomesgenomes
Scanagainst CATH
HMM library
protein sequencesfrom genomes assign domains
toCATH
superfamilies
HH
S1S1
S2S2
S3S3
HH
S1S1
S2S2
S3S3
S4S4
S5S5
Homologous Homologous SuperfamilySuperfamily
Homologous Homologous SuperfamilySuperfamily
sequences sequences added from added from GenBank, GenBank,
genomes, SWPT-genomes, SWPT-TrEMBLTrEMBL
CATH-HMMsCATH-HMMs
Sequence familySequence family
Expanding CATHExpanding CATH~1400 Domain Structure Superfamilies~1400 Domain Structure Superfamilies
~50,000 sequences~50,000 sequences~4,000 sequence families~4,000 sequence families
~600,000 sequences~600,000 sequences~24,000 sequence families~24,000 sequence families
Up to 70% of sequences in completed genomes can be assigned to CATH domain superfamilies
Rossmann Fold
Jelly Roll
Alpha/Beta Plaits
TIM Barrel
Immunoglobulin-like
Arc repressor-like
OB Fold
Four helix bundle
SH3-type barrel
Alpha horseshoe fold
Gene3DGene3D
Rossmann
Alpha-beta plait TIM barrel
Jelly Roll
Arc repressor-like
Up-down
SH3-like
OB fold
Immunoglobulin
Alpha horseshoe
Gene3DGene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D
CATH domain structure annotations for complete genomes
Gene3DGene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D
Individual genome statistics
Gene3DGene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D
Assignment of sequences to Gene3D protein families
Gene3DGene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D
Functional annotations for individual sequences
Gene3DGene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D
Functional annotations for individual sequences
Gene3DGene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D
Domain annotations for individual sequences
Gene3DGene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D
Domain annotations for individual sequences
SummarySummary
CATH currently identifies ~1500 superfamilies in the ~50,000 structural domains from the PDB
These domains families contain over 600,000 domain sequences from the genomes and sequence databases
Up to 70% of genome sequences can be assigned to domain structure families using HMMs and threading
Frances PearlFrances PearlIan SillitoeIan Sillitoe
Oliver RedfernOliver RedfernMark DibleyMark DibleyTony LewisTony Lewis
Chris BennettChris BennettAndrew HarrisonAndrew HarrisonGabrielle ReevesGabrielle Reeves
Alastair GrantAlastair GrantDavid LeeDavid Lee
AcknowledgementsAcknowledgements
Janet ThorntonJanet Thornton
Medical Research Council,Wellcome Trust, NIH
Biotechnology and Biological Sciences Research Council
http://www.biochem.ucl.ac.uk/bsm/cathhttp://www.biochem.ucl.ac.uk/bsm/cath