Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State...

Post on 02-Jan-2016

214 views 0 download

Transcript of Bioinformatic Analysis of Protein Families Daniil G. Naumoff Laboratory of Bioinformatics State...

Bioinformatic Analysis of Protein Families

Daniil G. Naumoff

Laboratory of BioinformaticsState Institute for Genetics and Selection of Industrial Microorganisms

Moscow, Russia

Gos NII Genetika

Moscow, Russia

The International Nucleotide Sequence Database Collaboration (INSDC)

• GenBank at NCBI: http://www.ncbi.nlm.nih.gov/Genbank/

• EMBL Nucleotide Sequence Database: http://www.ebi.ac.uk/embl/

• DNA Data Bank of Japan (DDBJ): http://www.ddbj.nig.ac.jp/

Corresponding protein databases: GenPept, UniProtKB/TrEMBL, and DDBJ

Curated protein database Swiss-Prot: http://au.expasy.org/sprot/

Three dimensional structures of proteins (3D)

PDB: http://www.pdb.org/pdb/home/home.do (database)

SCOP: http://scop.mrc-lmb.cam.ac.uk/scop/ (classification)

http://www.ebi.ac.uk/embl/Services/DBStats/

http://www.genomesonline.org/gold_statistics.htm

http://www.pdb.org/pdb/statistics/contentGrowthChart.do?content=total&seqid=100

Search of homologues

BLOSUM-62 matrix

http://www.ncbi.nlm.nih.gov/blast/html/sub_matrix.html

Overprediction is annotation of sequences at a greater level of functional specificity than available evidence supports.

- Select a protein- Determine the domain structure of the selected protein- Select a domain to be analyzed- Has the protein domain family been annotated in a database?- Updating of the family list or searching for homologous domains - Cheek each "atypical" sequence (probably it will be edited or removed)- Preliminary division into subfamilies- Multiple sequence alignment (consensus?)- Phylogenetic analysis- Phylogenetic tree visualization- Subfamily structure- Interfamily relationship (superfamilies, clans, etc.)- 2D and 3D analysis (prediction)

A Protein Family Analysis(http://zbio.net/bio/001/003.html)

15333http://ekhidna.biocenter.helsinki.fi/sqgraph/pairsdbADDA

1108216 December 2009http://www.ebi.ac.uk/interpro/InterPro 24.0

8575http://compbio.mcs.anl.gov/puma2/cgi-bin/index.cgiPUMA2

11912October 2009http://pfam.janelia.org/Pfam 24.0

4852http://www.ncbi.nlm.nih.gov/COG/grace/shokog.cgiKOG

4872 http://www.ncbi.nlm.nih.gov/COG/grace/uni.htmlCOG

3902June 2009http://scop.mrc-lmb.cam.ac.uk/scop/SCOP 1.75

100194 Jan 2010http://www.cathdb.info/CATH 3.3

10324 Jan 2010http://www-cryst.bioc.cam.ac.uk/homstrad/HOMSTRAD

Number of families

DateAddressDatabase

Number of annotated protein domain families

15333http://ekhidna.biocenter.helsinki.fi/sqgraph/pairsdbADDA

1108216 December 2009http://www.ebi.ac.uk/interpro/InterPro 24.0

8575http://compbio.mcs.anl.gov/puma2/cgi-bin/index.cgiPUMA2

11912October 2009http://pfam.janelia.org/Pfam 24.0

4852http://www.ncbi.nlm.nih.gov/COG/grace/shokog.cgiKOG

4872 http://www.ncbi.nlm.nih.gov/COG/grace/uni.htmlCOG

3902June 2009http://scop.mrc-lmb.cam.ac.uk/scop/SCOP 1.75

100194 Jan 2010http://www.cathdb.info/CATH 3.3

10324 Jan 2010http://www-cryst.bioc.cam.ac.uk/homstrad/HOMSTRAD

Number of families

DateAddressDatabase

Number of annotated protein domain families

51,778 domain families (+ 158,798 singletons) according to Marsden RL, Lee D, Maibaum M, Yeats C, Orengo CA. Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space. Nucleic Acids Res. 2006, 34(3):1066-1080.

13,511 SMOG domains according to Sadreyev & Grishin (BMC Struct Biol, 2006)13,511 SMOG domains according to Sadreyev & Grishin (BMC Struct Biol, 2006)

ADDA - Automatic Domain Decomposition Algorithmhttp://ekhidna.biocenter.helsinki.fi/sqgraph/pairsdb/form_browse

33,879 domain families (79,965 if redundant sequences were used) according to Heger A,Holm L. Exhaustive enumeration of protein domain families. J Mol Biol. 2003, 328(3):749-767.

- Select a protein- Determine the domain structure of the selected protein- Select a domain to be analyzed- Has the protein domain family been annotated in a database?- Updating of the family list or searching for homologous domains - Cheek each "atypical" sequence (probably it will be edited or removed)- Preliminary division into subfamilies- Multiple sequence alignment (consensus?)- Phylogenetic analysis- Phylogenetic tree visualization- Subfamily structure- Interfamily relationship (superfamilies, clans, etc.)- 2D and 3D analysis (prediction)

A Protein Family Analysis(http://zbio.net/bio/001/003.html)

Let’s use this protein as a query sequence for BLAST

BLAST results (Descriptions)

E-value < 0.01 or 0.001

BLAST results (Graphic overview)

Domain I Domain II Domain III

GH27N GH27C

GH27N

GH27N GH27C CBM13

GH27N GH27C CBM6

GH27N GH27C CBM6 CBM13

GH27N CBM13 GH27C

NEW1 GH27N CBM13 GH27C

NEW1 GH27N GH27C

NEW2 NEW1 GH27N GH27C

GH27N GH27C NEW3 NEW2

GH27N GH27C NEW3

GH27N GH27C Dockerin

GH27N GH27C CBM1 CE1 N-terminal domain of GH27 family

C -terminal domain of GH27 family

CE1 domain of carbohydrate esterases

Carbohydrate-binding module CBM1

Carbohydrate-binding module CBM6

Carbohydrate-binding module CBM13

Dockerin I domain

Uncharacterized domain

Uncharacterized domain (NPCBM)

Uncharacterized domain

CBM13

CBM6

Dockerin

NEW1

NEW2

NEW3

CBM1

CE1

GH27C

GH27N

Domain structure of proteins of the GH27 familyaccording to Naumoff D.G. Phylogenetic analysis of α-galactosidases of the GH27 family. Molecular Biology (Engl Transl), 2004, 38(3):388-

399.PDF: http://bioinform.genetika.ru/members/Naumoff/MB2004E.pdf

15333http://ekhidna.biocenter.helsinki.fi/sqgraph/pairsdbADDA

1108216 December 2009http://www.ebi.ac.uk/interpro/InterPro 24.0

8575http://compbio.mcs.anl.gov/puma2/cgi-bin/index.cgiPUMA2

11912October 2009http://pfam.janelia.org/Pfam 24.0

4852http://www.ncbi.nlm.nih.gov/COG/grace/shokog.cgiKOG

4872 http://www.ncbi.nlm.nih.gov/COG/grace/uni.htmlCOG

3902June 2009http://scop.mrc-lmb.cam.ac.uk/scop/SCOP 1.75

100194 Jan 2010http://www.cathdb.info/CATH 3.3

10324 Jan 2010http://www-cryst.bioc.cam.ac.uk/homstrad/HOMSTRAD

Number of families

DateAddressDatabase

Universal Protein Domain Databases

15333http://ekhidna.biocenter.helsinki.fi/sqgraph/pairsdbADDA

11082

Databases of individual protein families(http://www.oxfordjournals.org/nar/database/subcat/3/10)

Sequence Based Classification of the Carbohydrate-Active Enzymesat the CAZy server (www.cazy.org/)

• Glycoside Hydrolases (including transglycosidases) => 118 GH families (14 clans)

• Glycosyltransferases => 92 GT families

• Polysaccharide Lyases => 21 PL families

• Carbohydrate Esterases => 16 CE families

• Carbohydrate-Binding Modules => 59 CBM families

Family GH72 of Glycoside Hydrolases(http://www.cazy.org/GH72.html)

Multiple Sequence Alignment:

– Automatic (ClustalW or ClustalX) >50% of sequence identity only one domain no protein fragments

– Manual (BioEdit)(take into account BLAST pairwise sequence alignment!) <30% of sequence identity long insertions / deletions facultative N-terminal part

Local dissimilarities of very similar sequences:

– Local frameshift– Exon-intron structure– Stop codon

BioEdit(http://www.mbio.ncsu.edu/BioEdit/bioedit.html)

Phylip(http://evolution.gs.washington.edu/phylip.html)

Maximum Parsimony(ProtPars)

Distance program(Neighbor-Joining)

An infile for the Phylip package programs

Maximum Parsimony(protpars.exe)

from the Phylip package

Phylogenetic tree visualization: TreeView program (http://taxonomy.zoology.gla.ac.uk/rod/treeview.html)

Slanted cladogramRadial

Rectangular cladogram Phylogram

Subfamily criteria (for glycosidases)

1. Pairwise sequence similarity (>30% of identity)

2. Order of sequence appearance during BLAST search (members of the same subfamily always appear at the top of BLAST results)

3. Monophyletic status

The maximum parsimony phylogenetic tree of family GH97

100

1000876

1000

1000

954

1000

97C1_LEIXY97C1_PRERU97C2_BACTH1000

97C1_MICDE97C2_MICDE

97C1_BACTH97C2_PRERU97C3_PRERU1000

1000925

579

97D1_CAUCR97D1_XANAX97D1_XANCA1000

97B1_MICDE97B4_BACTH

97B1_PRERU97B1_BACTH874

813

97B2_PRERU97B1_BACFR

97B3_BACTH97B2_BACFR97B2_BACTH

4311000

8091000

509

424

977

97E1_BACTH97E1_RHOBA97A1_HALMA

97A1_SALRU97A2_BACFR97A3_BACTH

1000496

97A1_PRERU97A1_PREIN

1000

97A1_BACTH97A1_TANFO

680

97A1_BACFR97A2_BACTH97A1_UNBAC895

10001000

1000

97A8_ENSEQ97A1_AZOVI1000

97A5_ENSEQ97A4_ENSEQ97A3_ENSEQ1000

97A7_ENSEQ97A6_ENSEQ

4921000

1000

678

97A1_MICDE97A1_SHEON

97A2_ENSEQ97A1_ENSEQ991

10001000

97A1_NOVAR97A1_ERYLI1000

97A1_XANAX1000866

999

558

277

782

Subfamily 97a

97A1_XANCA

Subfamily 97d

Subfamily 97e

Subfamily 97c

Subfamily 97b

-glucosidase activity [EC 3.2.1.20]

The neighbor-joining phylogenetic tree of family GH97 97E1_RHOBA97E1_BACTH97C1_LEIXY97C1_PRERU97C2_BACTH97C1_MICDE97C2_MICDE97C1_BACTH97C2_PRERU97C3_PRERU97D1_CAUCR97D1_XANCA97D1_XANAX97B1_MICDE97B1_BACTH97B4_BACTH97B1_PRERU97B2_PRERU97B1_BACFR97B3_BACTH97B2_BACFR97B2_BACTH97A1_HALMA97A1_PRERU97A1_PREIN97A1_TANFO97A1_BACTH97A1_BACFR97A1_UNBAC97A2_BACTH97A1_SALRU97A2_BACFR97A3_BACTH97A1_AZOVI97A8_ENSEQ97A5_ENSEQ97A4_ENSEQ97A3_ENSEQ97A7_ENSEQ97A6_ENSEQ97A1_ERYLI97A1_NOVAR97A1_XANCA97A1_XANAX97A1_MICDE97A1_SHEON97A2_ENSEQ97A1_ENSEQ

996

991

988839

969

993646

996

991

996996

808835

996

617499

392996

951996

498

992

908996

562953

996

996401

996

996

773996

992

850

996

996975

931996

995

865

452

271

830

Subfamily 97e

Subfamily 97c

Subfamily 97d

Subfamily 97b

Subfamily 97a

[EC 3.2.1.20]

The neighbor-joining phylogenetic tree of the α-galactosidase superfamily

GH31

XYLS SULSOAGL2 BACTQ

AGLU ACIACSUIS HUMANc

SUIS HUMANnLYAG HUMAN

5572

3859

89

XYLQ LACPEORF1 THEMA

ORF1 BACHAYICI ECOLIORF1 CLOAC

4270

4077

86

ORF1 CHLAUORF2 CLOPE

ORF1 MOUSEORF1 DROME

8036

43

69

ORF1 AERHYORF1 ECOLI

93

NAGA CLOPEORF1 STRPNORF1 CLOPE

3992

98

25

AGL3 STRCOAGL2 STRCO

AGAL THETHAGAL THET2 77

AGAL THEMAAGAL LEPIN 72

37

5724

AGAL VIBCHAGAL VIBPA

99

20

30

AGA2 PEDPEAGA1 PEDPE

AGAL LACPLAGAL STRMUAGL2 RUMAL

4039

AGL5 BACFRAGL6 BACFR

94

54

49

AGL6 ASPFU

AGLC ASPNGAGL2 HYPJE

6979

21

AGAL ABSCOAGL2 BIFLORAFA ECOLI

5131

39

86

AGL3 RUMALAGL7 ASPFU

99

65

AGAL PORGI

MEL2 ARATHAGAL CYATEAGAL PHAVU

10061

AGAL SACERAGAL PSEFLAGAL MICDE

9716

9

AGL1 STRCOAGL2 ASPFU

AGAL FIBSUAGAL CLOJO

6716

5

AGLB ASPNGMEL1 YEASTMELA PHACH

2733

6

AGLA ASPNGNAGA ACRSP

98

MEL1 CAEELMEL1 DROME

NAGA HUMANAGAL HUMAN

5680

46

48

AGL3 BACFRAGL2 BACFRAGL1 BACFR

10012

7

21

AGL3 HYPJEIMD ARTGO

MEL4 ARATHMEL5 ORYSA

94

AGL1 BIFLOAGAL BACHAAGL1 RUMAL

6247

49

3972

84

AGAL SULTOAGAL SULSO

93

AGL4 BACFRAGAL BIFBR76

ORF2 ARATHSTAS PISSAGALT VIGAN

6745

ORF1 ARATHSIP CICAR

SIP HORVU53

58

94

36

89

45

57

GH27

GH36C

GH36A

GH36B

GH36D

Families of the α-galactosidase superfamily and family GH97

Family GH27 GH31 GH36A GH36B GH36C GH36D GH97Clan GH-D GH-D GH-D GH-D GH-D None

COG1501KOG1065

EC 2.4.1.x EC 2.4.1.x EC 2.4.1.67EC 3.2.1.22

EC 3.2.1.84

EC 2.4.1.82EC 3.2.1.49

EC 3.2.1.10EC 3.2.1.22

EC 3.2.1.88

EC 3.2.1.20

EC 3.2.1.48

EC 4.2.2.13

Molecular mechanism

Retaining Retaining Retaining Not known Not known Not known

Eukaryota: Eukaryota: Eukaryota: Eubacteria: Eukaryota: Eubacteria: Eukaryota:

Alveolata Alveolata Fungi Acidobacteria Alveolata Firmicutes Metazoa (?)

FungiEntamoebidae

Eubacteria:

Proteobacteria

Fungi Proteobacteria Eubacteria:

MetazoaEuglenozoa

Actinobacteria

Spirochaetes

Viridiplantae Acidobacteria

MycetozoaFungi

Bacteroidetes

Thermotogales

Eubacteria:Bacteroidetes

Viridiplantae

Metazoa

Firmicutes

Thermus

ActinobacteriaPlanctomycetesEubacteria:

Mycetozoa

ProteobacteriaBacteroidetes

ProteobacteriaAcidobacteria

Rhodophyta

Spirochaetes Archaea:Archaea:

Actinobacteria

Viridiplantae

CrenarchaeotaEuryarchaeota

Bacteroidetes

Eubacteria:

FibrobacteresActinobacteria

FirmicutesBacteroidetes

ProteobacteriaCyanobacteriaFirmicutesProteobacteriaSpirochaetesThermotogales

Archaea:CrenarchaeotaEuryarchaeota

COG3345 COG3345 None

Origin

KOG2366 None None

Known enzymatic activities

EC 3.2.1.22 EC 3.2.1.22 EC 3.2.1.49 EC 3.2.1.20

COG/KOG

Actinobacteria

Actinobacteria

Deinococcus

Acidobacteria

Thermus

GH-D

EC 3.2.1.22

Retaining, Inverting

Verrucomicrobia

Verrucomicrobia

Verrucomicrobia

Verrucomicrobia

Verrucomicrobia

Verrucomicrobia Acidobacteria

EC 3.2.1.94

Clans of Glycoside Hydrolases

(β)3-solenoidinversion (axial orientation)28, 49GH-N

(/)6inversion (equatorial orientation)8, 48GH-M

(/)6inversion (axial orientation)15, 65GH-L

(β/)8 -barrelretention (equatorial orientation)18, 20, 85GH-K

5-fold β-propellerretention (β‑furanoside)32, 68GH-J

+βinversion (equatorial orientation)24, 46, 80GH-I

(β/)8 -barrelretention (axial orientation)13, 70, 77GH-H

inversion (axial orientation)37, 63GH-G

5-fold β-propellerinversion (equatorial orientation)43, 62GH-F

6-fold β-propellerretention (equatorial orientation)33, 34, 83, 93GH-E

(β/)8 -barrelretention (axial orientation)27, 31, 36GH-D

β-jelly rollretention (equatorial orientation)11, 12GH-C

β-jelly rollretention (equatorial orientation)7, 16GH-B

(β/)8 -barrelretention (equatorial orientation)1, 2, 5, 10, 17, 26, 30, 35, 39, 42, 50, 51, 53, 59, 72, 79, 86, 113

GH-A

Tertiary StructureOptical ConfigurationFamilies (GH)Clan

(/)6

Rigden DJ. Iterative database searches demonstrate that glycoside hydrolase families 27, 31, 36, and 66 share a common evolutionary origin with family 13. FEBS Lett. 2002, 523(1-3):17‑22.

clans

GH-D

GH-H

Nagano N, Porter CT, Thornton JM. The (β/α)8 glycosidases: sequence and structure analyses suggest distant evolutionary relationships. Protein Eng. 2001, 14(11):845-855.

clans: GH-H GH-A GH-K ?

Screenshot of PSI Protein Classifier

D.G. Naumoff and M. Carreras. 2009. PSI Protein Classifier: a new program automatingPSI-BLAST search results. Molecular Biology (Engl Transl). V.43. N.4. P.652-664.

A hierarchical classification of the (β/α)8-type glycosyl hydrolases

A hierarchical structure of the -fructosidase (furanosidase) superfamily

furanosidase superfamily

GH32

GH68

GH43

GH62

GHLP

clan GH-J

clan GH-F

GH32a

GH32b

GH32c

GH32d

GH68a

GH68b

GH43a

GH43b

GH43c

GH43d

GH43e

GH43f

GH43g

The Secondary Structure Prediction

– 3D-PSSM (http://www.sbg.bio.ic.ac.uk/~3dpssm/)– GOR IV (http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_gor4.html)– nnpredict (http://www.cmpharm.ucsf.edu/~nomi/nnpredict-instrucs.html)– PredictProtein (http://www.embl-heidelberg.de/predictprotein/predictprotein.html)– Hydrophobic cluster analysis (HCA)

The Tertiary Structure Prediction– The SWISS-MODEL modeling server (http://swissmodel.expasy.org/)

Phylogenetic Analysis of a Protein Family

– The first stage of a work Prediction of 3D structure and domain structure of the protein Prediction of the active center and residues for site-directed mutagenesis Prediction of the enzymatic activities– The only part of a work (bioinformatics)– The final stage of a work (interpretation of the experimental results)

Comparison of the phylogenetic trees of each domain of a certain protein will allow to reveal the protein evolutionary history, viz. the role of gene duplication, lost, fusion, and horizontal transfer.