domain database

domain databasedomain database

• The CATH domain database and associated The CATH domain database and associated resources - DHS, Gene3Dresources - DHS, Gene3D

• How do we determine domain boundaries?How do we determine domain boundaries?

• How do we you identify fold groups and How do we you identify fold groups and evolutionary superfamilies?evolutionary superfamilies?

• What is the distribution of the CATH domain What is the distribution of the CATH domain families in the PDB and in the genomes?families in the PDB and in the genomes?

lasslassrchitecturerchitectureopology or Fold Groupopology or Fold Group

omologous Superfamilyomologous Superfamily

Orengo & Thornton 1994Orengo & Thornton 1994CCAATTHH

~20,000 chains from Protein Databank (PDB)

~50,000 domains in CATH structure

database

~40% of the entries in CATH are multidomain

Multidomain proteins

Domains are important evolutionary units

analysis by Teichmann and others suggests that ~60-80% of genes in genomes may be

multidomain

Carboxypeptidase G2 (1cg2A)

Carboxypeptidase A (2ctc)

~30% of multidomains in CATH are discontinuous

Algorithms for Recognising Algorithms for Recognising Domain BoundariesDomain Boundaries

DETECTIVE DETECTIVE Swindells 1995Swindells 1995

each domain should have a recognisable hydrophobic each domain should have a recognisable hydrophobic corecore

DOMAKDOMAK Siddiqui & Barton, 1995Siddiqui & Barton, 1995

residues comprising a domain make more internal residues comprising a domain make more internal contacts than external onescontacts than external ones

PUUPUU Holm & Sander, 1994Holm & Sander, 1994

parser for protein folding units: maximal interaction parser for protein folding units: maximal interaction within domains and minimal interaction between within domains and minimal interaction between domainsdomains

Consensus is sought between the three Consensus is sought between the three methods – on average this occurs about 20% methods – on average this occurs about 20% of the timeof the time

29% 21%

Close homologuesClose homologues

Twilight zoneTwilight zone

Midnight zoneMidnight zone

Homologues/analoguesHomologues/analogues

Algorithms for Recognising Algorithms for Recognising HomologuesHomologues

Sequence Based methodsSequence Based methods

close homologues – BLAST (Altschul close homologues – BLAST (Altschul et alet al.).)

- SSEARCH (Smith & - SSEARCH (Smith & Waterman) Waterman)

remote homologues – SAM-T99 (Karplus remote homologues – SAM-T99 (Karplus et alet al))

Structure Based MethodsStructure Based Methods

close & remote homologues - CATHEDRAL (Harrison, close & remote homologues - CATHEDRAL (Harrison, Thornton Orengo) Thornton Orengo)

- SSAP (Taylor & Orengo)- SSAP (Taylor & Orengo)

- CORA (Orengo)- CORA (Orengo)

29% 21%

SSEARCH

HMMs, SSAP

CATHEDRAL, SSAP

Hidden Markov Models (HMMs)

query sequence

Non redundant GenBank database

these methods can currently identify ~70% of remote homologues(3 times more powerful than BLAST)

SAM-T99 Karplus GroupSAMOSA Orengo Group

59.220.7

8.61.9

Percentage of PDB structures classified in CATH Percentage of PDB structures classified in CATH by different methods over the last 2 yearsby different methods over the last 2 years

Near-identicalSSEARCH

Close homologues(>30%)

SSEARCH

remote homologues(<30%)HMMs

remote homologues (8.6)analogues (1.9)

SSAPNovel folds

7.711.8

Percentage of structural genomics PDB Percentage of structural genomics PDB structures classified in CATH by different structures classified in CATH by different

methods over the last 2 yearsmethods over the last 2 years

near-identicalSSEARCH

close homologues(>30%)

SSEARCH

remote homologues(<30%)HMMs

analoguesSSAP

novel folds

remote homologuesSSAP

Structure Based Algorithms for Structure Based Algorithms for Recognising HomologuesRecognising Homologues

CATHEDRAL CATHEDRAL Pairwise alignment - Pairwise alignment - secondary secondary structure structure comparisoncomparison

SSAP SSAP Pairwise alignment - residue Pairwise alignment - residue

comparisoncomparison CORA CORA Multiple alignment – residue Multiple alignment – residue

comparisoncomparison

29% 21%

ssearch

CATHEDRAL, SSAP

structure is much more highly conserved than sequence

cholera toxin pertussis toxin

Heat labile enterotoxin

Structure similarity (SSAP) score

Sequence identity

Pairwise Sequence Identities and Structure Pairwise Sequence Identities and Structure Similarity (SSAP) Scores in CATH Domain Similarity (SSAP) Scores in CATH Domain

FamiliesFamilies

structure similarity

(SSAP)score

sequence identity (%)

same function

different function

• Residue insertions in the loops connecting secondary structures

• Shifts in the orientations of secondary structures

Yeast Elongation factor complex Yeast Guanylate kinase

Helicase domain of bacteriophage t7 ATP phosphorylase

Structural variation in the P-loop Hydrolase SuperfamilyStructural variation in the P-loop Hydrolase Superfamily

Structural variation in the Galectin Binding SuperfamilyStructural variation in the Galectin Binding Superfamily

Fast Structure Fast Structure Comparison Method Comparison Method

(CATHEDRAL)(CATHEDRAL)

ignore the variable loop regions and only ignore the variable loop regions and only compare the compare the

secondary structuressecondary structures

derive vectors through secondary structure derive vectors through secondary structure elementselements

compare closest approach distances and compare closest approach distances and vector vector orientations using orientations using graph theorygraph theory

Andrew Harrison et al., JMB, 2002

a . b = | a || b | cos

+ dihedral angle

+ chirality

Compares graphs of proteins

d, , , chiralityd, , ,

chirality

d, , , chirality

CATHEDRALCATHs Existing Domain Recognition

ALgorithm

IIIA,a

Comparing proteins with similar folds identifies an overlap graph with the largest common

structural motif

overlap graph has a structural motif of 3 secondary structures

MCSG Site Visit, Argonne, January 30, 2003

Graphs are compared using the Bron Kerbosch algorithm to find the largest

common graph

In this example the common graph contains 5 nodes.

1000 times faster than residue based methods (e.g. SSAP)

PerformancePerformance

Score ~ common graph size

(size protein1 . size protein2)1/2

statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of

all known structures

Score ~ common graph size

(size protein1 . size protein2)1/2

statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of

all known structures

F = A e - b . score

log F = log A - b .score

scores for unrelated structures exhibit an extreme value distribution

allows you to calculate the probability (P-value, E-value) of obtaining any score by chance

Using CATHEDRAL to Identify Domain Boundaries

Graph based secondary structure comparison is very fast - 1000

times faster than residue based methods

New multi-domain structures can be rapidly scanned against the library of CATH domains. E-values can be

used to identify significant matches.

85-90% of domains in new multi-domain structures have relatives in

Secondary structure match by graph

SSAP residue alignment

Multi-domain structure

Fold A

Fold B

CATHEDRAL

residues in new multi-domain

SSAPTaylor & Orengo, J. Mol. Biol. 1989

Protein B Protein A

residue based structure

comparison method using

dynamic programming

Scores rangefrom 0-100

Residues in protein A

CATHEDRAL

One third of known multi-domain structures are discontinuous

Reasons for Structural Reasons for Structural SimilaritySimilarity

• DivergenceDivergence - similarity arises due to - similarity arises due to divergent evolution from a common divergent evolution from a common ancestor - structure much more highly ancestor - structure much more highly conserved than sequenceconserved than sequence

• ConvergenceConvergence - similarity due to there - similarity due to there being a limited number of ways of being a limited number of ways of packing helices and strands in 3D spacepacking helices and strands in 3D space

~1500 domain superfamilies in CATH

~50,000 domains in PDB

Domain structure database

lasslassrchitecturerchitecture

opology or Fold Groupopology or Fold Group

omologous Superfamilyomologous Superfamily

Orengo & Thornton 1994CC

Architecture

Topology or Fold

domain database ~50,000 domainsCATH

Topology orFold Group

HomologousSuperfamily (Domain

Family)~1500

SequenceFamily

(35%, 60%, 95%)

40,000 domain entries

~50,000 domain entries

CC AATT HH

DHSDHSDictionary of Homologous SuperfamiliesDictionary of Homologous Superfamilies

http://www.biochem.ucl.ac.uk/bsm/dhs

Description of structural and functional characteristics for each superfamily

DHSDHSDictionary of Homologous SuperfamiliesDictionary of Homologous Superfamilies

Description of structural and functional characteristics for each superfamily

DHS:Dictionary of Homologous SuperfamiliesDHS:Dictionary of Homologous Superfamilies

Variation in Secondary Structures Across Superfamily

Functional annotations from GO, EC, COGs, KEGG

DHS:Dictionary of Homologous SuperfamiliesDHS:Dictionary of Homologous Superfamilies

DHS:Dictionary of Homologous superfamiliesDHS:Dictionary of Homologous superfamilieshttp://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D

Multiple structure alignments with conserved residues highlighted

Population of CATH Families and Population of CATH Families and Structural GroupsStructural Groups

cluster proteins with cluster proteins with similar similar sequencessequences

~50,000 structural ~50,000 structural domainsdomains

~4000 sequence ~4000 sequence families (35%)families (35%)

~1,500 homologous ~1,500 homologous superfamiliessuperfamilies

cluster proteins with cluster proteins with similar similar structures and structures and functionsfunctions

~810 fold groups~810 fold groups

~36 architectures~36 architectures

3 major protein classes3 major protein classes

cluster proteins with cluster proteins with similar similar structuresstructures

Rossmann Fold

Jelly Roll

Alpha/Beta Plaits

Arc repressor-like

OB Fold

CATHCATH

Rossmann

Alpha-beta plait TIM barrelJelly Roll

Immunoglobulin

OB fold

SH3-like

Up-down

Arc repressor-like

nearly one third of the

superfamilies belong to <10 fold

groups

CATH numbering scheme

2. Mainly beta

40. Barrel

50. OB Fold100 Heat labile enterotoxin superfamily

2.40.50.100

Architecture

TopologyHomology

CATHCATHhttp://www.biochem.ucl.ac.uk/bsm/cath

CATH domain structure database

CATH class level

CATH architecture level

CATH Topology or fold group level

CATH homologous superfamilies in each fold group

CATH homologous superfamily level

CATH sequence families (>=35% identity) in each superfamily

CATH classification information for individual domains

CATH structural relatives listed for each domain

CATH serverCATH serverhttp://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl

CATH serverCATH server

structural matches and statistics listed for query domain

http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl

Library of HMMs built for representative sequences Library of HMMs built for representative sequences from each CATH domain superfamilyfrom each CATH domain superfamily

Expanding CATH with Expanding CATH with sequence relatives from sequence relatives from

genomesgenomes

Scanagainst CATH

HMM library

protein sequencesfrom genomes assign domains

toCATH

superfamilies

Homologous Homologous SuperfamilySuperfamily

sequences sequences added from added from GenBank, GenBank,

genomes, SWPT-genomes, SWPT-TrEMBLTrEMBL

CATH-HMMsCATH-HMMs

Sequence familySequence family

Expanding CATHExpanding CATH~1400 Domain Structure Superfamilies~1400 Domain Structure Superfamilies

~50,000 sequences~50,000 sequences~4,000 sequence families~4,000 sequence families

~600,000 sequences~600,000 sequences~24,000 sequence families~24,000 sequence families

Up to 70% of sequences in completed genomes can be assigned to CATH domain superfamilies

Rossmann Fold

Jelly Roll

Alpha/Beta Plaits

TIM Barrel

Immunoglobulin-like

Arc repressor-like

OB Fold

Four helix bundle

SH3-type barrel

Alpha horseshoe fold

Gene3DGene3D

Rossmann

Alpha-beta plait TIM barrel

Jelly Roll

Arc repressor-like

Up-down

SH3-like

OB fold

Immunoglobulin

Alpha horseshoe

Gene3DGene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D

CATH domain structure annotations for complete genomes

Individual genome statistics

Assignment of sequences to Gene3D protein families

Functional annotations for individual sequences

Domain annotations for individual sequences

SummarySummary

CATH currently identifies ~1500 superfamilies in the ~50,000 structural domains from the PDB

These domains families contain over 600,000 domain sequences from the genomes and sequence databases

Up to 70% of genome sequences can be assigned to domain structure families using HMMs and threading

Frances PearlFrances PearlIan SillitoeIan Sillitoe

Oliver RedfernOliver RedfernMark DibleyMark DibleyTony LewisTony Lewis

Chris BennettChris BennettAndrew HarrisonAndrew HarrisonGabrielle ReevesGabrielle Reeves

Alastair GrantAlastair GrantDavid LeeDavid Lee

AcknowledgementsAcknowledgements

Janet ThorntonJanet Thornton

Medical Research Council,Wellcome Trust, NIH

Biotechnology and Biological Sciences Research Council

http://www.biochem.ucl.ac.uk/bsm/cathhttp://www.biochem.ucl.ac.uk/bsm/cath

domain database

Documents

Transcript of domain database

© Hostmaster Ltd., Public Supervisory Board and the ...The .UA domain register database (the WHOIS database)..... 12 5. Domain name service availability checking procedure..... 18

Facilities Research Programme Database Future COST Action E25 Domain Forests ... · 2017-11-10 · Facilities Research Programme Database Future • Participating countries expect

Oracle Database 12c Release 2 Oracle Real Application Clusters · » Cluster Domain architecture delegates the management aspects of Member Clusters to the Domain Services Cluster.

Domain-Driven Design for the Database-Driven Mind Julie Lerman theDataFarm.com @julielerman.

Efforts to Internationalise Domain Name System Control · In 1984 Paul Mockapetris described Domain Name System. (RFC 882 and 883 2) Domain Name System is a distributed database.

© 2010 Quest Software, Inc. ALL RIGHTS RESERVED Integrating Workload Replay into Database Change Management Bert Scalzo Database Domain Expert / Oracle.

Bates_1998_Indexing and Access for Digital Libraries and the Internet Human, Database, And Domain Factors

Business Web Hosting Australia | Free Mysql Database Hosting | Domain Hosting Australia

Microservices Modularity · ... microservices architecture. ... Liferay Database. Domain Model

SQLite Database. SQLite Public domain database – Advantages Small (about 150 KB) – Used on devices with limited resources Each database contained within.

DOMINO: a database of domain–peptide interactions

The CATH Domain Structure Database Ana Gabriela Murguía Carlos Villa Soto.

IBM i: Domain Name System...Domain Name System Domain Name System (DNS) is a distributed database system for managing host names and their associated Internet Pr otocol (IP) addr esses.

Chapter 1: Scenario 1: Fallback Procedure When EMS · Web viewStep 2 Add domain names for the following parameters to Domain Name Server database: DNS_FOR_CA_SIDE_A_BLG_LINK_MONITOR

Domain & Model Driven Geographic Database Designjugurta/papers/DE book.pdf · Domain & Model Driven Geographic Database Design 3 rizes the GeoProﬁle. Section 3 describes the steps

EMC® Data Domain Boost for Oracle Recovery …EMC Data Domain Boost (DD Boost) for Oracle Recovery Manager (RMAN) enables database servers to communicate with Data Domain systems

Database 5: protein domain/family

Database Domain Report - VITA€¦ · 09-07-2019 · Database Domain Report Version 1.1 09-11-2006 1.2 07-01-2016 Commonwealth of Virginia: To-Be ETA The to-be Enterprise Technical

Domain Specic Multistage Query Language for Medical Document Repositories · Domain Specic Multistage Query Language for Medical Document Repositories Aastha Madaan Database Systems

From Legacy Database to Domain Layer in the Cincom® VisualWorks® Mapping Tool