domain database

69
domain database domain database The CATH domain database and associated The CATH domain database and associated resources - DHS, Gene3D resources - DHS, Gene3D How do we determine domain boundaries? How do we determine domain boundaries? How do we you identify fold groups and How do we you identify fold groups and evolutionary superfamilies? evolutionary superfamilies? What is the distribution of the CATH domain What is the distribution of the CATH domain families in the PDB and in the genomes? families in the PDB and in the genomes? C C A A T T H H lass lass rchitecture rchitecture opology or Fold Group opology or Fold Group omologous Superfamily omologous Superfamily Orengo & Thornton 1994 Orengo & Thornton 1994 C C A A T T H H

description

C. H. C. A. T. The CATH domain database and associated resources - DHS, Gene3D How do we determine domain boundaries? How do we you identify fold groups and evolutionary superfamilies? What is the distribution of the CATH domain families in the PDB and in the genomes?. lass. - PowerPoint PPT Presentation

Transcript of domain database

Page 1: domain database

domain databasedomain database

• The CATH domain database and associated The CATH domain database and associated resources - DHS, Gene3Dresources - DHS, Gene3D

• How do we determine domain boundaries?How do we determine domain boundaries?

• How do we you identify fold groups and How do we you identify fold groups and evolutionary superfamilies?evolutionary superfamilies?

• What is the distribution of the CATH domain What is the distribution of the CATH domain families in the PDB and in the genomes?families in the PDB and in the genomes?

CCAA

TTHH

lasslassrchitecturerchitectureopology or Fold Groupopology or Fold Group

omologous Superfamilyomologous Superfamily

Orengo & Thornton 1994Orengo & Thornton 1994CCAATTHH

Page 2: domain database

~20,000 chains from Protein Databank (PDB)

~50,000 domains in CATH structure

database

~40% of the entries in CATH are multidomain

Multidomain proteins

Page 3: domain database

Domains are important evolutionary units

analysis by Teichmann and others suggests that ~60-80% of genes in genomes may be

multidomain

Page 4: domain database

Carboxypeptidase G2 (1cg2A)

Carboxypeptidase A (2ctc)

~30% of multidomains in CATH are discontinuous

Page 5: domain database

Algorithms for Recognising Algorithms for Recognising Domain BoundariesDomain Boundaries

DETECTIVE DETECTIVE Swindells 1995Swindells 1995

each domain should have a recognisable hydrophobic each domain should have a recognisable hydrophobic corecore

DOMAKDOMAK Siddiqui & Barton, 1995Siddiqui & Barton, 1995

residues comprising a domain make more internal residues comprising a domain make more internal contacts than external onescontacts than external ones

PUUPUU Holm & Sander, 1994Holm & Sander, 1994

parser for protein folding units: maximal interaction parser for protein folding units: maximal interaction within domains and minimal interaction between within domains and minimal interaction between domainsdomains

Consensus is sought between the three Consensus is sought between the three methods – on average this occurs about 20% methods – on average this occurs about 20% of the timeof the time

Page 6: domain database

74%

29% 21%

4%

11%

Close homologuesClose homologues

Twilight zoneTwilight zone

Midnight zoneMidnight zone

Homologues/analoguesHomologues/analogues

Page 7: domain database

Algorithms for Recognising Algorithms for Recognising HomologuesHomologues

Sequence Based methodsSequence Based methods

close homologues – BLAST (Altschul close homologues – BLAST (Altschul et alet al.).)

- SSEARCH (Smith & - SSEARCH (Smith & Waterman) Waterman)

remote homologues – SAM-T99 (Karplus remote homologues – SAM-T99 (Karplus et alet al))

Structure Based MethodsStructure Based Methods

close & remote homologues - CATHEDRAL (Harrison, close & remote homologues - CATHEDRAL (Harrison, Thornton Orengo) Thornton Orengo)

- SSAP (Taylor & Orengo)- SSAP (Taylor & Orengo)

- CORA (Orengo)- CORA (Orengo)

Page 8: domain database

74%

29% 21%

4%

11%

Close homologuesClose homologues

Twilight zoneTwilight zone

Midnight zoneMidnight zone

Homologues/analoguesHomologues/analogues

SSEARCH

HMMs, SSAP

CATHEDRAL, SSAP

CATHEDRAL, SSAP

Page 9: domain database

Hidden Markov Models (HMMs)

query sequence

Non redundant GenBank database

hits

these methods can currently identify ~70% of remote homologues(3 times more powerful than BLAST)

SAM-T99 Karplus GroupSAMOSA Orengo Group

Page 10: domain database

59.220.7

7.6

8.61.9

2.0

Percentage of PDB structures classified in CATH Percentage of PDB structures classified in CATH by different methods over the last 2 yearsby different methods over the last 2 years

Near-identicalSSEARCH

Close homologues(>30%)

SSEARCH

remote homologues(<30%)HMMs

remote homologues (8.6)analogues (1.9)

SSAPNovel folds

Page 11: domain database

22.0

8.0

22.0

28.4

7.711.8

Percentage of structural genomics PDB Percentage of structural genomics PDB structures classified in CATH by different structures classified in CATH by different

methods over the last 2 yearsmethods over the last 2 years

near-identicalSSEARCH

close homologues(>30%)

SSEARCH

remote homologues(<30%)HMMs

analoguesSSAP

novel folds

remote homologuesSSAP

Page 12: domain database

Structure Based Algorithms for Structure Based Algorithms for Recognising HomologuesRecognising Homologues

CATHEDRAL CATHEDRAL Pairwise alignment - Pairwise alignment - secondary secondary structure structure comparisoncomparison

SSAP SSAP Pairwise alignment - residue Pairwise alignment - residue

comparisoncomparison CORA CORA Multiple alignment – residue Multiple alignment – residue

comparisoncomparison

Page 13: domain database

74%

29% 21%

4%

11%

Close homologuesClose homologues

Twilight zoneTwilight zone

Midnight zoneMidnight zone

Homologues/analoguesHomologues/analogues

ssearch

HMMs

CATHEDRAL, SSAP

CATHEDRAL, SSAP

Page 14: domain database

structure is much more highly conserved than sequence

cholera toxin pertussis toxin

Heat labile enterotoxin

97

79%

81

12%

Structure similarity (SSAP) score

Sequence identity

Page 15: domain database

Pairwise Sequence Identities and Structure Pairwise Sequence Identities and Structure Similarity (SSAP) Scores in CATH Domain Similarity (SSAP) Scores in CATH Domain

FamiliesFamilies

structure similarity

(SSAP)score

sequence identity (%)

same function

different function

Page 16: domain database

• Residue insertions in the loops connecting secondary structures

• Shifts in the orientations of secondary structures

Page 17: domain database

Yeast Elongation factor complex Yeast Guanylate kinase

Helicase domain of bacteriophage t7 ATP phosphorylase

Structural variation in the P-loop Hydrolase SuperfamilyStructural variation in the P-loop Hydrolase Superfamily

Page 18: domain database

Structural variation in the Galectin Binding SuperfamilyStructural variation in the Galectin Binding Superfamily

Page 19: domain database

Fast Structure Fast Structure Comparison Method Comparison Method

(CATHEDRAL)(CATHEDRAL)

ignore the variable loop regions and only ignore the variable loop regions and only compare the compare the

secondary structuressecondary structures

derive vectors through secondary structure derive vectors through secondary structure elementselements

compare closest approach distances and compare closest approach distances and vector vector orientations using orientations using graph theorygraph theory

Andrew Harrison et al., JMB, 2002

Page 20: domain database

d

a b

a . b = | a || b | cos

+ dihedral angle

+ chirality

Page 21: domain database

Compares graphs of proteins

HH

H

d, , , chiralityd, , ,

chirality

d, , , chirality

node

edge

CATHEDRALCATHs Existing Domain Recognition

ALgorithm

Page 22: domain database

A

B

C

I

II

IIIA,a

B,c

C,d

I

II

III

Comparing proteins with similar folds identifies an overlap graph with the largest common

structural motif

overlap graph has a structural motif of 3 secondary structures

a

c

d

I

II

IIIb

IV

V

b

Page 23: domain database

MCSG Site Visit, Argonne, January 30, 2003

Graphs are compared using the Bron Kerbosch algorithm to find the largest

common graph

In this example the common graph contains 5 nodes.

1000 times faster than residue based methods (e.g. SSAP)

Page 24: domain database

PerformancePerformance

Page 25: domain database

Score ~ common graph size

(size protein1 . size protein2)1/2

statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of

all known structures

Page 26: domain database

Score ~ common graph size

(size protein1 . size protein2)1/2

statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of

all known structures

Page 27: domain database

F = A e - b . score

log F = log A - b .score

scores for unrelated structures exhibit an extreme value distribution

allows you to calculate the probability (P-value, E-value) of obtaining any score by chance

Page 28: domain database

Using CATHEDRAL to Identify Domain Boundaries

Graph based secondary structure comparison is very fast - 1000

times faster than residue based methods

New multi-domain structures can be rapidly scanned against the library of CATH domains. E-values can be

used to identify significant matches.

85-90% of domains in new multi-domain structures have relatives in

CATH

Page 29: domain database

Secondary structure match by graph

SSAP residue alignment

Multi-domain structure

Fold A

Fold B

CATHEDRAL

residues in new multi-domain

resi

du

es

in

CA

TH

dom

ain

fa

mily

1

resi

dues

in

CA

TH

dom

ain

fa

mily

2

Page 30: domain database

SSAPTaylor & Orengo, J. Mol. Biol. 1989

Protein B Protein A

residue based structure

comparison method using

dynamic programming

Scores rangefrom 0-100

Resi

du

es

in p

rote

in B

Residues in protein A

Page 31: domain database

CATHEDRAL

One third of known multi-domain structures are discontinuous

Page 32: domain database

Reasons for Structural Reasons for Structural SimilaritySimilarity

• DivergenceDivergence - similarity arises due to - similarity arises due to divergent evolution from a common divergent evolution from a common ancestor - structure much more highly ancestor - structure much more highly conserved than sequenceconserved than sequence

• ConvergenceConvergence - similarity due to there - similarity due to there being a limited number of ways of being a limited number of ways of packing helices and strands in 3D spacepacking helices and strands in 3D space

Page 33: domain database
Page 34: domain database
Page 35: domain database

~1500 domain superfamilies in CATH

~50,000 domains in PDB

Domain structure database

AATT

HH

lasslassrchitecturerchitecture

opology or Fold Groupopology or Fold Group

omologous Superfamilyomologous Superfamily

Orengo & Thornton 1994CC

Page 36: domain database

Class

Architecture

Topology or Fold

3

~36

~810

domain database ~50,000 domainsCATH

Page 37: domain database

Topology orFold Group

~810

HomologousSuperfamily (Domain

Family)~1500

SequenceFamily

(35%, 60%, 95%)

40,000 domain entries

~50,000 domain entries

CC AATT HH

Page 38: domain database

DHSDHSDictionary of Homologous SuperfamiliesDictionary of Homologous Superfamilies

http://www.biochem.ucl.ac.uk/bsm/dhs

Description of structural and functional characteristics for each superfamily

Page 39: domain database

DHSDHSDictionary of Homologous SuperfamiliesDictionary of Homologous Superfamilies

http://www.biochem.ucl.ac.uk/bsm/dhs

Description of structural and functional characteristics for each superfamily

Page 40: domain database

DHS:Dictionary of Homologous SuperfamiliesDHS:Dictionary of Homologous Superfamilies

http://www.biochem.ucl.ac.uk/bsm/dhs

Variation in Secondary Structures Across Superfamily

Page 41: domain database

Functional annotations from GO, EC, COGs, KEGG

DHS:Dictionary of Homologous SuperfamiliesDHS:Dictionary of Homologous Superfamilies

http://www.biochem.ucl.ac.uk/bsm/dhs

Page 42: domain database

DHS:Dictionary of Homologous superfamiliesDHS:Dictionary of Homologous superfamilieshttp://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D

Multiple structure alignments with conserved residues highlighted

Page 43: domain database

Population of CATH Families and Population of CATH Families and Structural GroupsStructural Groups

cluster proteins with cluster proteins with similar similar sequencessequences

~50,000 structural ~50,000 structural domainsdomains

~4000 sequence ~4000 sequence families (35%)families (35%)

~1,500 homologous ~1,500 homologous superfamiliessuperfamilies

cluster proteins with cluster proteins with similar similar structures and structures and functionsfunctions

~810 fold groups~810 fold groups

~36 architectures~36 architectures

3 major protein classes3 major protein classes

cluster proteins with cluster proteins with similar similar structuresstructures

HH

TT

AA

CC

SS

Page 44: domain database

Rossmann Fold

Jelly Roll

Alpha/Beta Plaits

Arc repressor-like

OB Fold

CATHCATH

Rossmann

Alpha-beta plait TIM barrelJelly Roll

Immunoglobulin

OB fold

SH3-like

Up-down

Arc repressor-like

nearly one third of the

superfamilies belong to <10 fold

groups

Page 45: domain database

CATH numbering scheme

2. Mainly beta

40. Barrel

50. OB Fold100 Heat labile enterotoxin superfamily

2.40.50.100

Class

Architecture

TopologyHomology

Page 46: domain database

CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath

CATH domain structure database

Page 47: domain database

CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath

CATH class level

Page 48: domain database

CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath

CATH architecture level

Page 49: domain database

CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath

CATH Topology or fold group level

Page 50: domain database

CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath

CATH homologous superfamilies in each fold group

Page 51: domain database

CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath

CATH homologous superfamily level

Page 52: domain database

CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath

CATH sequence families (>=35% identity) in each superfamily

Page 53: domain database

CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath

CATH classification information for individual domains

Page 54: domain database

CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath

CATH structural relatives listed for each domain

Page 55: domain database

CATH serverCATH serverhttp://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl

Page 56: domain database

CATH serverCATH serverhttp://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl

Page 57: domain database

CATH serverCATH server

structural matches and statistics listed for query domain

http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl

Page 58: domain database

Library of HMMs built for representative sequences Library of HMMs built for representative sequences from each CATH domain superfamilyfrom each CATH domain superfamily

Expanding CATH with Expanding CATH with sequence relatives from sequence relatives from

genomesgenomes

Scanagainst CATH

HMM library

protein sequencesfrom genomes assign domains

toCATH

superfamilies

Page 59: domain database

HH

S1S1

S2S2

S3S3

HH

S1S1

S2S2

S3S3

S4S4

S5S5

Homologous Homologous SuperfamilySuperfamily

Homologous Homologous SuperfamilySuperfamily

sequences sequences added from added from GenBank, GenBank,

genomes, SWPT-genomes, SWPT-TrEMBLTrEMBL

CATH-HMMsCATH-HMMs

Sequence familySequence family

Expanding CATHExpanding CATH~1400 Domain Structure Superfamilies~1400 Domain Structure Superfamilies

~50,000 sequences~50,000 sequences~4,000 sequence families~4,000 sequence families

~600,000 sequences~600,000 sequences~24,000 sequence families~24,000 sequence families

Up to 70% of sequences in completed genomes can be assigned to CATH domain superfamilies

Page 60: domain database

Rossmann Fold

Jelly Roll

Alpha/Beta Plaits

TIM Barrel

Immunoglobulin-like

Arc repressor-like

OB Fold

Four helix bundle

SH3-type barrel

Alpha horseshoe fold

Gene3DGene3D

Rossmann

Alpha-beta plait TIM barrel

Jelly Roll

Arc repressor-like

Up-down

SH3-like

OB fold

Immunoglobulin

Alpha horseshoe

Page 61: domain database

Gene3DGene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D

CATH domain structure annotations for complete genomes

Page 62: domain database

Gene3DGene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D

Individual genome statistics

Page 63: domain database

Gene3DGene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D

Assignment of sequences to Gene3D protein families

Page 64: domain database

Gene3DGene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D

Functional annotations for individual sequences

Page 65: domain database

Gene3DGene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D

Functional annotations for individual sequences

Page 66: domain database

Gene3DGene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D

Domain annotations for individual sequences

Page 67: domain database

Gene3DGene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D

Domain annotations for individual sequences

Page 68: domain database

SummarySummary

CATH currently identifies ~1500 superfamilies in the ~50,000 structural domains from the PDB

These domains families contain over 600,000 domain sequences from the genomes and sequence databases

Up to 70% of genome sequences can be assigned to domain structure families using HMMs and threading

Page 69: domain database

Frances PearlFrances PearlIan SillitoeIan Sillitoe

Oliver RedfernOliver RedfernMark DibleyMark DibleyTony LewisTony Lewis

Chris BennettChris BennettAndrew HarrisonAndrew HarrisonGabrielle ReevesGabrielle Reeves

Alastair GrantAlastair GrantDavid LeeDavid Lee

AcknowledgementsAcknowledgements

Janet ThorntonJanet Thornton

Medical Research Council,Wellcome Trust, NIH

Biotechnology and Biological Sciences Research Council

http://www.biochem.ucl.ac.uk/bsm/cathhttp://www.biochem.ucl.ac.uk/bsm/cath