Genome Analysis to Select Targets which Probe Fold and Function Space

MCSG Site Visit, Argonne, January 30, 2003

Genome Analysis to Select Targets Genome Analysis to Select Targets which Probe Fold and Function Spacewhich Probe Fold and Function Space

How many protein superfamilies and families can we identify in the proteomes

How many structures needed to cover a high fraction of prokaryotic, eukaryotic families

Targeting Universal Recurrent Superfamilies (SCOP/CATH/Pfam) to optimise coverage of fold and function space

Russell Marsden, Alastair Grant, David Lee, Annabel ToddJanet Thornton, Andrzej Joachim

Midwest ConsortiumMidwest Consortium

Protein Families in Complete Protein Families in Complete Genomes with Genomes with Structural/Functional AnnotationsStructural/Functional Annotations

800,000 protein 800,000 protein sequences from 120 sequences from 120 completed genomescompleted genomes

14 eukaryotic genomes including human, mouse, 14 eukaryotic genomes including human, mouse, rat, rat, plant,fly, worm, fuguplant,fly, worm, fugu92 bacterial genomes92 bacterial genomes14 archael genomes14 archael genomes

Gene3DGene3D

Buchan, Thornton, OrengoBuchan, Thornton, Orengo,, Genome Research (2002)Genome Research (2002)

BLAST all the sequences from 120 completed BLAST all the sequences from 120 completed genomes against each and cluster into protein genomes against each and cluster into protein familiesfamilies

For each sequence identify CATH and Pfam For each sequence identify CATH and Pfam domainsdomains

Clustering Sequences into Protein Clustering Sequences into Protein Superfamilies of Known Domain Superfamilies of Known Domain

CompositionCompositionPFscape - Protein Family LandscapePFscape - Protein Family Landscape

SAM-T99 - sequence mapping of CATH & Pfam SAM-T99 - sequence mapping of CATH & Pfam Karplus et al., NAR, 2000

TRIBE-MCL - Markov Clustering TRIBE-MCL - Markov Clustering Enright & Ouzounis, Genome Research, 2002

Clustering ~800,000 genes from Clustering ~800,000 genes from 120 complete genomes120 complete genomes

PFscapePFscape

Gene Superfamily 1

Gene Superfamily 2

Gene Superfamily 3

Gene Superfamily 4

~50,000 gene superfamilies of 2 or more sequences, ~50,000 gene superfamilies of 2 or more sequences, 150,000 singletons150,000 singletons

Library of HMMs built for representative sequences Library of HMMs built for representative sequences from each CATH and Pfam domain superfamilyfrom each CATH and Pfam domain superfamily

Mapping CATH and Pfam Mapping CATH and Pfam Domains onto Genome Domains onto Genome

SequencesSequences

Scanagainst CATH &

PfamSAM-T99

HMM library

protein sequencesfrom genomes assign domains

toCATH and Pfam superfamilies

Performance of Sequence Mapping MethodPerformance of Sequence Mapping Method

1D-HMM 1D-HMM (SAM-T99)(SAM-T99)

Coverage vs Error rate (OHPS)

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

Error rate (%)

Co

vera

ge

Sreps.v2.5_Sreps.v2.5

Sreps.v2.4_Sreps.v2.5

Percentage of remote, structurally validated CATH Percentage of remote, structurally validated CATH homologues (<35% sequence identity) identified by homologues (<35% sequence identity) identified by

SAM-T99SAM-T99

(%)

of

hom

olo

gues

fou

nd

(%)

of

hom

olo

gu

es

fou

nd

Error rate

Library of 1D-HMM models detects ~80% of remote Library of 1D-HMM models detects ~80% of remote homologueshomologues

Use HMMs to annotate Gene Superfamilies with CATH and Pfam

domains

Gene Superfamily 1

Gene Superfamily 3

Gene Superfamily 4Gene Superfamily 2

50,000 Gene Superfamilies50,000 Gene Superfamilies

CATHPfam

NewFam

Gene Superfamily 1

Gene Superfamily 3

Gene Superfamily 2

Merge superfamilies with the same domain combinations

Gene3D: 50,000 -> 36,000 SuperfamiliesGene3D: 50,000 -> 36,000 Superfamilies

Superfamily Families (35%ID)

Superfamilies Further Classified into FamiliesSuperfamilies Further Classified into Families

Multi-linkage clusteringMulti-linkage clustering relatives in each sequence family have relatives in each sequence family have

35% or more sequence identity35% or more sequence identity

For good homology models one structure is needed for For good homology models one structure is needed for each family within a superfamilyeach family within a superfamily

Percent of Family with No Structure (close PDB match)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Unassigned CATH Pfam

Family

% s

35

pdb no pdb

Perc

en

tag

e o

f Fam

ilie

sP

erc

en

tag

e o

f Fam

ilie

s

CATH (60,360)+Pfam(53,907)+Newfam(56,973) = CATH (60,360)+Pfam(53,907)+Newfam(56,973) = 171,240 171,240 Families Families

Number of domain superfamilies and families with no close Number of domain superfamilies and families with no close structural homologue structural homologue

CATH (1400)+Pfam(4100)+Newfam(46,384) = CATH (1400)+Pfam(4100)+Newfam(46,384) = 51,844 51,844 SuperfamiliesSuperfamilies

100

50

NewFam CATH Pfam

Percentage of Sequence Families with and without Percentage of Sequence Families with and without Close Structural Homologues (>35% identity)Close Structural Homologues (>35% identity)

No close PDB homologue

CATH

Number of Non-identical Relatives

Pfam

Fitted power-laws (with gradients)

CATH (-0.4)

Pfam (-1.0)Newfam (-1.9)

Newfam

Number of Non-identical Relatives

Number of Non-identical RelativesNumber of Non-identical Relatives

Nu

mb

er

of

Su

perf

am

ilie

s c

on

tain

ing

g

iven

nu

mb

er

of

Non

-id

en

tical re

lati

ves

as p

erc

en

tag

e o

f th

e t

ota

lPreferentially Target Largest Superfamilies Preferentially Target Largest Superfamilies

CATH, Pfam, Unassigned Hlevels vs s100

0

10

20

30

40

50

60

70

80

90

100

0 5000 10000 15000 20000 25000 30000 35000

#Hlevel targets

% T

ota

l s10

0

50

~70% of Proteomes are contained in < 2500 Largest CATH ~70% of Proteomes are contained in < 2500 Largest CATH + Pfam + NewFamTarget Superfamilies + Pfam + NewFamTarget Superfamilies

Proteome Coverage by Superfamilies Proteome Coverage by Superfamilies

Superfamilies Ordered by Size

Perc

en

tag

e o

f Pro

teom

es

Perc

en

tag

e o

f Pro

teom

es

(Nu

mb

er

of

non

-id

en

tica

l pro

tein

s in

(N

um

ber

of

non-i

den

tica

l pro

tein

s in

1

20

com

ple

ted g

enom

es)

120

com

ple

ted g

en

om

es)

0

50

100

Superfamilies Ordered by Size

Perc

en

tag

e o

f Pro

teom

es

Perc

en

tag

e o

f Pro

teom

es

(120

com

ple

ted

gen

om

es)

(120

com

ple

ted

gen

om

es)

50

Proteome Coverage by Superfamilies Proteome Coverage by Superfamilies

CATH (superfamilies of known fold)

Pfam

NewFam

CATH, Pfam, Unassigned Hlevel vs s100 Comparison

0

5

10

15

20

25

30

35

40

45

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

#Hlevel targets

% T

ota

l s1

00

cath pfam unassigned

What Fraction of the Proteomes is covered by What Fraction of the Proteomes is covered by Bacterial Family Targets?Bacterial Family Targets?

Number of Target Families

Perc

en

tag

e o

f Pro

teom

es

Perc

en

tag

e o

f Pro

teom

es

(12

0 c

om

ple

ted

gen

om

es)

(120

com

ple

ted

gen

om

es)

40 o

50

~100,000 prokaryotic targets cover nearly 60% of ~100,000 prokaryotic targets cover nearly 60% of proteomesproteomes

100,000 200,00000

50

100

prokaryotes

eukaryotes

eukaryotes plusprokaryotes

How many family targets cover a significant How many family targets cover a significant proportion of the eukaryotes and/or prokaryotes?proportion of the eukaryotes and/or prokaryotes?

Number of Target Families

Perc

en

tag

e o

f K

ing

dom

Perc

en

tag

e o

f K

ing

dom

Pro

teom

es

(12

0 c

om

ple

ted

Pro

teom

es

(12

0 c

om

ple

ted

g

en

om

es)

gen

om

es)

40 o

50

25,000 - 45,000 family targets cover 70% of proteomes25,000 - 45,000 family targets cover 70% of proteomes(< 2500 largest superfamily targets)(< 2500 largest superfamily targets)

prokaryotes

eukaryotes

eukaryotes plusprokaryotes

25,000 45,000 30,000


Target Selection StrategyTarget Selection Strategy

the largest < 2500 superfamily targets give 70% of proteomes

this corresponds to 25,000 - 45,000 family targets

accurate homology models are not needed for all families

target families of biological interest or containing human homologues with disease association

targets families from functionally diverse superfamilies to understand how changes

in the structure can modify function

For example, Universal, Highly Recurrent Superfamilies are an interesting biological subset with diverse functions

0%

20%

40%

60%

80%

100%

Ap

e

Aae

Afu

Bsu

Bb

u

Cel

Cje

Cp

n

Ctr

Ec

o

Hin

Hp

y

Mth

Mja

Mtu

Mg

e

Mp

n

Nm

e

Pa

b

Rp

r

Sc

e

Ss

p

Tm

a

Uu

r

Vc

h

Xfa

Organism

Per

cen

t o

f A

ssig

ned

Do

ma

ins

Unique to Genome

Present in One KingdomPresent in Two Kingdoms

Present in Three Kingdoms

Universal CATH Domain SuperfamiliesUniversal CATH Domain Superfamilies

30 representative eukaryotic and prokaryotic organisms

Pro

port

ion

of

CA

TH

P

rop

ort

ion

of

CA

TH

d

om

ain

an

nota

tion

sd

om

ain

an

nota

tion

s

0

50

100

~60-70% of CATH domain annotations within each organism are from < 200 CATH universal

superfamilies common to all kingdoms of life some of which are very extensively duplicated

Domain Recurrences in the GenomesDomain Recurrences in the Genomes

0

10

20

30

40

50

60

70

80

90

1001 3 5 7 9 11 13 15 49 59 67 79 96 102

219

Occurrences

No

. Of

Fa

mili

es

E.coli

M.jannaschii

S.cerevisiae

nu

mb

er

of

su

perf

am

ilie

sn

um

ber

of

su

perf

am

ilie

s

occurrencesoccurrences

730730 570570

Highly Recurrent, Extensively Duplicated

Superfamilies

0

500

1000

1500

2000

S R Y VZ WOU

T N M D A JL B PQK I HEFG

C

Poorlycharac.

Cellular processesand signalling

Information stor.

& proce.Metabolism

Analysis in bacterial genomes Analysis in bacterial genomes showed that 56 Universal showed that 56 Universal Superfamilies recurred in Superfamilies recurred in

proportion to the genome size proportion to the genome size and accounted for 45% of the and accounted for 45% of the

CATH domain annotationsCATH domain annotations

COG functional annotationCOG functional annotation(25 Functional Categories)(25 Functional Categories)E (Amino acid metabolism)E (Amino acid metabolism)J (Translation and protein biosynthesis)J (Translation and protein biosynthesis)K (Transcription)K (Transcription)T (Signal Transduction)T (Signal Transduction)

56 Universal and Highly Recurrent 56 Universal and Highly Recurrent SuperfamiliesSuperfamilies

15,000 bacterial family targets

0

20

40

60

80

100

120

0 10 20 30 40 50 60 70 80 90 100 110 120

protein no (1-130)

num

ber

of h

om

olo

gu

es a

bo

ve

seq

uen

ce id

en

tity

thre

shol

d

35%

60%

Relative with most neighbours for which homology model can be built or function assigned

For >95% confidence when inheriting functional For >95% confidence when inheriting functional properties, homologues should have at least 60% properties, homologues should have at least 60%

identity (Todd, Valencia, Rost) identity (Todd, Valencia, Rost)

In Functionally Diverse Superfamilies Select In Functionally Diverse Superfamilies Select More TargetsMore Targets

functional clusters identified by sequence conservationfunctional clusters identified by sequence conservation

annotations (GO, Kegg, Pfam, EC, COGS, SWISS-PROT)annotations (GO, Kegg, Pfam, EC, COGS, SWISS-PROT)

stored in Gene3Dstored in Gene3D

functional clustersfunctional clustersS60_1S60_1

SuperfamilySuperfamily

S60_2S60_2

S60_3S60_3

S60_4S60_4

S60_5S60_5

Representative Structures for Superfamilies Representative Structures for Superfamilies will help identify Functional Familieswill help identify Functional Families


Target Selection StrategyTarget Selection Strategy

Targeting the 2500 largest superfamilies will cover a significant proportion (70%) of the

proteomes

For good homology models between 25,000 - 45,000 family targets are needed

Preferentially select targets from medically important and/or structurally and functionally diverse superfamilies

For example, targeting Universal and Recurrent superfamilies which exhibit significant

structural and functional divergence will help to improve function prediction methods

Genome Analysis to Select Targets which Probe Fold and Function Space

Documents

Transcript of Genome Analysis to Select Targets which Probe Fold and Function Space