domain database

domain databasedomain database

• The CATH domain database and associated The CATH domain database and associated resources - DHS, Gene3Dresources - DHS, Gene3D

• How do we determine domain boundaries?How do we determine domain boundaries?

• How do we you identify fold groups and How do we you identify fold groups and evolutionary superfamilies?evolutionary superfamilies?

• What is the distribution of the CATH domain What is the distribution of the CATH domain families in the PDB and in the genomes?families in the PDB and in the genomes?

CCAA

TTHH

lasslassrchitecturerchitectureopology or Fold Groupopology or Fold Group

omologous Superfamilyomologous Superfamily

Orengo & Thornton 1994Orengo & Thornton 1994CCAATTHH

~20,000 chains from Protein Databank (PDB)

~50,000 domains in CATH structure

database

~40% of the entries in CATH are multidomain

Multidomain proteins

Domains are important evolutionary units

analysis by Teichmann and others suggests that ~60-80% of genes in genomes may be

multidomain

Carboxypeptidase G2 (1cg2A)

Carboxypeptidase A (2ctc)

~30% of multidomains in CATH are discontinuous

Algorithms for Recognising Algorithms for Recognising Domain BoundariesDomain Boundaries

DETECTIVE DETECTIVE Swindells 1995Swindells 1995

each domain should have a recognisable hydrophobic each domain should have a recognisable hydrophobic corecore

DOMAKDOMAK Siddiqui & Barton, 1995Siddiqui & Barton, 1995

residues comprising a domain make more internal residues comprising a domain make more internal contacts than external onescontacts than external ones

PUUPUU Holm & Sander, 1994Holm & Sander, 1994

parser for protein folding units: maximal interaction parser for protein folding units: maximal interaction within domains and minimal interaction between within domains and minimal interaction between domainsdomains

Consensus is sought between the three Consensus is sought between the three methods – on average this occurs about 20% methods – on average this occurs about 20% of the timeof the time

74%

29% 21%

4%

11%

Close homologuesClose homologues

Twilight zoneTwilight zone

Midnight zoneMidnight zone

Homologues/analoguesHomologues/analogues

Algorithms for Recognising Algorithms for Recognising HomologuesHomologues

Sequence Based methodsSequence Based methods

close homologues – BLAST (Altschul close homologues – BLAST (Altschul et alet al.).)

- SSEARCH (Smith & - SSEARCH (Smith & Waterman) Waterman)

remote homologues – SAM-T99 (Karplus remote homologues – SAM-T99 (Karplus et alet al))

Structure Based MethodsStructure Based Methods

close & remote homologues - CATHEDRAL (Harrison, close & remote homologues - CATHEDRAL (Harrison, Thornton Orengo) Thornton Orengo)

- SSAP (Taylor & Orengo)- SSAP (Taylor & Orengo)

- CORA (Orengo)- CORA (Orengo)

74%

29% 21%

4%

11%





SSEARCH

HMMs, SSAP

CATHEDRAL, SSAP

CATHEDRAL, SSAP

Hidden Markov Models (HMMs)

query sequence

Non redundant GenBank database

hits

these methods can currently identify ~70% of remote homologues(3 times more powerful than BLAST)

SAM-T99 Karplus GroupSAMOSA Orengo Group

59.220.7

7.6

8.61.9

2.0

Percentage of PDB structures classified in CATH Percentage of PDB structures classified in CATH by different methods over the last 2 yearsby different methods over the last 2 years

Near-identicalSSEARCH

Close homologues(>30%)

SSEARCH

remote homologues(<30%)HMMs

remote homologues (8.6)analogues (1.9)

SSAPNovel folds

22.0

8.0

22.0

28.4

7.711.8

Percentage of structural genomics PDB Percentage of structural genomics PDB structures classified in CATH by different structures classified in CATH by different

methods over the last 2 yearsmethods over the last 2 years

near-identicalSSEARCH

close homologues(>30%)

SSEARCH

remote homologues(<30%)HMMs

analoguesSSAP

novel folds

remote homologuesSSAP

Structure Based Algorithms for Structure Based Algorithms for Recognising HomologuesRecognising Homologues

CATHEDRAL CATHEDRAL Pairwise alignment - Pairwise alignment - secondary secondary structure structure comparisoncomparison

SSAP SSAP Pairwise alignment - residue Pairwise alignment - residue

comparisoncomparison CORA CORA Multiple alignment – residue Multiple alignment – residue

comparisoncomparison

74%

29% 21%

4%

11%





ssearch

HMMs

CATHEDRAL, SSAP

CATHEDRAL, SSAP

structure is much more highly conserved than sequence

cholera toxin pertussis toxin

Heat labile enterotoxin

97

79%

81

12%

Structure similarity (SSAP) score

Sequence identity

Pairwise Sequence Identities and Structure Pairwise Sequence Identities and Structure Similarity (SSAP) Scores in CATH Domain Similarity (SSAP) Scores in CATH Domain

FamiliesFamilies

structure similarity

(SSAP)score

sequence identity (%)

same function

different function

• Residue insertions in the loops connecting secondary structures

• Shifts in the orientations of secondary structures

Yeast Elongation factor complex Yeast Guanylate kinase

Helicase domain of bacteriophage t7 ATP phosphorylase

Structural variation in the P-loop Hydrolase SuperfamilyStructural variation in the P-loop Hydrolase Superfamily

Structural variation in the Galectin Binding SuperfamilyStructural variation in the Galectin Binding Superfamily

Fast Structure Fast Structure Comparison Method Comparison Method

(CATHEDRAL)(CATHEDRAL)

ignore the variable loop regions and only ignore the variable loop regions and only compare the compare the

secondary structuressecondary structures

derive vectors through secondary structure derive vectors through secondary structure elementselements

compare closest approach distances and compare closest approach distances and vector vector orientations using orientations using graph theorygraph theory

Andrew Harrison et al., JMB, 2002

d

a b

a . b = | a || b | cos

+ dihedral angle

+ chirality

Compares graphs of proteins

HH

H

d, , , chiralityd, , ,

chirality

d, , , chirality

node

edge

CATHEDRALCATHs Existing Domain Recognition

ALgorithm

A

B

C

I

II

IIIA,a

B,c

C,d

I

II

III

Comparing proteins with similar folds identifies an overlap graph with the largest common

structural motif

overlap graph has a structural motif of 3 secondary structures

a

c

d

I

II

IIIb

IV

V

b

MCSG Site Visit, Argonne, January 30, 2003

Graphs are compared using the Bron Kerbosch algorithm to find the largest

common graph

In this example the common graph contains 5 nodes.

1000 times faster than residue based methods (e.g. SSAP)

PerformancePerformance

Score ~ common graph size

(size protein1 . size protein2)1/2

statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of

all known structures

F = A e - b . score

log F = log A - b .score

scores for unrelated structures exhibit an extreme value distribution

allows you to calculate the probability (P-value, E-value) of obtaining any score by chance

Using CATHEDRAL to Identify Domain Boundaries

Graph based secondary structure comparison is very fast - 1000

times faster than residue based methods

New multi-domain structures can be rapidly scanned against the library of CATH domains. E-values can be

used to identify significant matches.

85-90% of domains in new multi-domain structures have relatives in

CATH

Secondary structure match by graph

SSAP residue alignment

Multi-domain structure

Fold A

Fold B

CATHEDRAL

residues in new multi-domain

resi

du

es

in

CA

TH

dom

ain

fa

mily

1

resi

dues

in

CA

TH

dom

ain

fa

mily

2

SSAPTaylor & Orengo, J. Mol. Biol. 1989

Protein B Protein A

residue based structure

comparison method using

dynamic programming

Scores rangefrom 0-100

Resi

du

es

in p

rote

in B

Residues in protein A

CATHEDRAL

One third of known multi-domain structures are discontinuous

Reasons for Structural Reasons for Structural SimilaritySimilarity

• DivergenceDivergence - similarity arises due to - similarity arises due to divergent evolution from a common divergent evolution from a common ancestor - structure much more highly ancestor - structure much more highly conserved than sequenceconserved than sequence

• ConvergenceConvergence - similarity due to there - similarity due to there being a limited number of ways of being a limited number of ways of packing helices and strands in 3D spacepacking helices and strands in 3D space

~1500 domain superfamilies in CATH

~50,000 domains in PDB

Domain structure database

AATT

HH

lasslassrchitecturerchitecture

opology or Fold Groupopology or Fold Group

omologous Superfamilyomologous Superfamily

Orengo & Thornton 1994CC

Class

Architecture

Topology or Fold

3

~36

~810

domain database ~50,000 domainsCATH

Topology orFold Group

~810

HomologousSuperfamily (Domain

Family)~1500

SequenceFamily

(35%, 60%, 95%)

40,000 domain entries

~50,000 domain entries

CC AATT HH

DHSDHSDictionary of Homologous SuperfamiliesDictionary of Homologous Superfamilies

http://www.biochem.ucl.ac.uk/bsm/dhs

Description of structural and functional characteristics for each superfamily

DHS:Dictionary of Homologous SuperfamiliesDHS:Dictionary of Homologous Superfamilies


Variation in Secondary Structures Across Superfamily

Functional annotations from GO, EC, COGs, KEGG

DHS:Dictionary of Homologous SuperfamiliesDHS:Dictionary of Homologous Superfamilies


DHS:Dictionary of Homologous superfamiliesDHS:Dictionary of Homologous superfamilieshttp://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D

Multiple structure alignments with conserved residues highlighted

Population of CATH Families and Population of CATH Families and Structural GroupsStructural Groups

cluster proteins with cluster proteins with similar similar sequencessequences

~50,000 structural ~50,000 structural domainsdomains

~4000 sequence ~4000 sequence families (35%)families (35%)

~1,500 homologous ~1,500 homologous superfamiliessuperfamilies

cluster proteins with cluster proteins with similar similar structures and structures and functionsfunctions

~810 fold groups~810 fold groups

~36 architectures~36 architectures

3 major protein classes3 major protein classes

cluster proteins with cluster proteins with similar similar structuresstructures

HH

TT

AA

CC

SS

Rossmann Fold

Jelly Roll

Alpha/Beta Plaits

Arc repressor-like

OB Fold

CATHCATH

Rossmann

Alpha-beta plait TIM barrelJelly Roll

Immunoglobulin

OB fold

SH3-like

Up-down

Arc repressor-like

nearly one third of the

superfamilies belong to <10 fold

groups

CATH numbering scheme

2. Mainly beta

40. Barrel

50. OB Fold100 Heat labile enterotoxin superfamily

2.40.50.100

Class

Architecture

TopologyHomology

CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath

CATH domain structure database


CATH class level


CATH architecture level


CATH Topology or fold group level


CATH homologous superfamilies in each fold group


CATH homologous superfamily level


CATH sequence families (>=35% identity) in each superfamily


CATH classification information for individual domains


CATH structural relatives listed for each domain

CATH serverCATH serverhttp://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl

CATH serverCATH server

structural matches and statistics listed for query domain

http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl

Library of HMMs built for representative sequences Library of HMMs built for representative sequences from each CATH domain superfamilyfrom each CATH domain superfamily

Expanding CATH with Expanding CATH with sequence relatives from sequence relatives from

genomesgenomes

Scanagainst CATH

HMM library

protein sequencesfrom genomes assign domains

toCATH

superfamilies

HH

S1S1

S2S2

S3S3

HH

S1S1

S2S2

S3S3

S4S4

S5S5

Homologous Homologous SuperfamilySuperfamily

Homologous Homologous SuperfamilySuperfamily

sequences sequences added from added from GenBank, GenBank,

genomes, SWPT-genomes, SWPT-TrEMBLTrEMBL

CATH-HMMsCATH-HMMs

Sequence familySequence family

Expanding CATHExpanding CATH~1400 Domain Structure Superfamilies~1400 Domain Structure Superfamilies

~50,000 sequences~50,000 sequences~4,000 sequence families~4,000 sequence families

~600,000 sequences~600,000 sequences~24,000 sequence families~24,000 sequence families

Up to 70% of sequences in completed genomes can be assigned to CATH domain superfamilies

Rossmann Fold

Jelly Roll

Alpha/Beta Plaits

TIM Barrel

Immunoglobulin-like

Arc repressor-like

OB Fold

Four helix bundle

SH3-type barrel

Alpha horseshoe fold

Gene3DGene3D

Rossmann

Alpha-beta plait TIM barrel

Jelly Roll

Arc repressor-like

Up-down

SH3-like

OB fold

Immunoglobulin

Alpha horseshoe

Gene3DGene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D

CATH domain structure annotations for complete genomes


Individual genome statistics


Assignment of sequences to Gene3D protein families


Functional annotations for individual sequences


Domain annotations for individual sequences

SummarySummary

CATH currently identifies ~1500 superfamilies in the ~50,000 structural domains from the PDB

These domains families contain over 600,000 domain sequences from the genomes and sequence databases

Up to 70% of genome sequences can be assigned to domain structure families using HMMs and threading

Frances PearlFrances PearlIan SillitoeIan Sillitoe

Oliver RedfernOliver RedfernMark DibleyMark DibleyTony LewisTony Lewis

Chris BennettChris BennettAndrew HarrisonAndrew HarrisonGabrielle ReevesGabrielle Reeves

Alastair GrantAlastair GrantDavid LeeDavid Lee

AcknowledgementsAcknowledgements

Janet ThorntonJanet Thornton

Medical Research Council,Wellcome Trust, NIH

Biotechnology and Biological Sciences Research Council

http://www.biochem.ucl.ac.uk/bsm/cathhttp://www.biochem.ucl.ac.uk/bsm/cath

domain database

Documents

Transcript of domain database