ProtozoaDB: comparative genomics approaches for the study ... · ProtozoaDB: comparative genomics...

Post on 31-Mar-2019

223 views 0 download

Transcript of ProtozoaDB: comparative genomics approaches for the study ... · ProtozoaDB: comparative genomics...

Alberto M. R. DávilaEmail: davila@fiocruz.br

Laboratório de Biologia Computacional e Sistemas

Instituto Oswaldo Cruz – FIOCRUZ, Rio de Janeiro, Brasil.

ProtozoaDB: comparative genomics approaches for the study of

protozoan species

Pathogenic protozoan species are commonly studied because they are a major

cause of mortality and morbidity in humans and domestic animals throughout the

world.

We have developed an automatic methodology to reconstruct the phylogenetic

species tree in Protozoa integrating different phylogenetic algorithms and

programs.

We demonstrate the utility of a supermatrix approach to construct phylogenomic

trees using 31 universal representative orthologs.

These universal orthologs are involved in metabolic pathways, thus might

potentially provide some extra information for the understanding of the

phylogenetic relationships and evolutionary processes of Protozoa.

Background

Illness caused by parasitic Protozoa are a major cause of diseaseworldwide, but because they are concentrated in low socio-economic parts of the world, they receive relatively little attentionfrom the pharmaceutical industry and major scientific fundingagencies.

Of the ten diseases targeted as research priorities by the WorldHealth Organization’s Special Program for Research and Trainingin Tropical Diseases (http://www.who.int/tdr) four are caused byprotozoan parasites (malaria, leishmaniasis, Chagas disease,African tripanosomiasis).

These diseases and other less dangerous ones (e.g. amoebiasisand trichomoniasis) are having an alarming increase ofrefractoriness cases to the main treatment. Treatment failure haspotentially a multifactorial origin with one of the major concernsbeing drug resistance.

Phylogenomics is being intensively used in the “Post GenomicEra”. Many different phylogenetic trees have been published on thebasis of different models of sequence evolution and utilization ofdifferent parameter settings and algorithms.

Introduction

Trypanosoma cruzi

Trypanosoma brucei

Leishmania major

Plasmodium falciparum

Although, the rRNA and othersingle genes have beenextremely valuable forphylogenetic studies but theyhave their limitations.

Chaudhary (2005) and Keeling(2005) showed results onavailable sequencing data.

Phylogenetic analysis definedfive supergroups of eukaryotes:(i) the plant and red/green algallineage;

(ii) a clade comprised ofanimals, fungi, slime molds andamebozoans;

and three groups that areentirely protozoan:

(iii) chromalveolates,

(iv) excavates and

(v) rhizaria.

Introduction

Universal Orthologs:

Ciccareli et al (2006): UO for tree of life

Serra et al (2008): Orthosearch

Science. 2006 Mar 3;311(5765):1283-7.

Selection and preparation of marker gene families

We based our methodology on Ciccarelli et al. (2006) study for the selection andconstruction of the species tree.

Thirty-one universal orthologous (UO) genes were downloaded in fasta format fromGenBank and then aligned using MAFTT (version 5.861)

Selection and preparation of marker gene families for protozoan species trees

Hidden Markov Models (HMM) profiles were constructed for the 31 UO (localdatabase) set and queried against all Protozoa sequences available in Refseq (RefSeq-release35.catalog-01/11/2009) and Genbank (NCBI-Flat File Release 172.0-06/15/2009).

HMM profiles were created (hmmbuild) and calibrated (hmmcalibrate) usingHMMER version 2.3.2 and searches done using hmmpfam with e-value “1e-5” as cut-off.

Those best hmmpfam hits of each of the 31 UO against protozoan genomes were usedto construct individual multiple alignments (Mafft v5.861).

Methods

Methodology 1 (M1):

Those individual alignments were concatenated, then resulting in a globalsupermatrix of 21,260 positions in a total of 74 species.

This supermatrix was used to generate a tree with PHYML using 100 bootstrapreplicates and the JTT matrix.

Methodology 2 (M2):

Those individual alignments were trimmed using TrimAl v1.2 aiming toeliminate the most variable regions of each of them. Those remaining conservedblocks were then concatenated, resulting in a global supermatrix of 12,807positions in a total of 74 species.

This supermatrix was used to generate a tree with PHYML using 100 bootstrapreplicates and the JTT matrix.

Methods

Complete

Gblocks

TrimAI

Complete

TrimAI

Concatenated genes aligned

Supermatrix

Supertree

Individual genes aligned

Figure 2. Algorithm for Supermatrix versus Supertree construction.

We reconstructed the protozoan species tree using several automated bioinformaticstools: downloaded 31 universal UO from GenBank, built HMM profiles with them,searched those profiles in protozoan genomes (43 species) and related species (31species) showing different level of completeness/coverage.

The species tree was formed by 3 major clades that were related to 3 groups of data:

(i) 26 species composed of at least 80% of UO (25/31) called G1,

(ii) 12 species composed of 50-79% (15-24/31) of UO called G2 and

(iii) 36 species composed of less than 50% (1-14/31) of UO called G3.

Results

Results

Figure 1. Phylogenomic supermatrix radiation tree of protozoan using masked sequences with TrimAl.

Was used a supermatrix of 12,807 positions in the total 74 protozoan species. Maximum likelihood tree was

constructed with Phyml, JTT as evolutionary model and bootstrap 100. (C1 is red, C2 blue and C3 black).

C1: 26 sp - at least 80% of UO

C2: 12 sp - between 50-79% of UO

C3: 36 sp - less than 50% of UOE.h

isto

lytic

a

E. dis

par

T. va

gin

alis

C

. hom

inis

C. parv

um

G

. la

mblia

L. in

fantu

m

L. m

ajo

r

L. bra

zili

ensis

L. enriettii

T. cr

uzi

T. bru

cei

P. t

etra

urel

iaP. c

auda

tum

T. therm

ophila

M. b

revic

ollis

P. viva

xP. k

nwolesiP. fa

lciparumP. yoeliiP. berghei

P. chaubadi

T. parva

T. annulata

B. bovis

H. andersenii B. natans

D. fasciculatum

P. pallidum L. amazonensis

D. citrinum

D. discoideum

T. pyriformis

T. malaccensis

T. pigm

entosa

T. p

ara

vora

x

P. a

ure

liaC

. roenberg

ensis

A. h

ealy

iL. d

igita

taP

. littora

lis

F. v

esic

ulo

susD

. virid

is

L. d

onovani

D. d

ichoto

ma

A. poly

phaga

A. caste

llanii

N. gru

beri

T. gondii

E. te

nella

P. yezoensis

P.p

urp

ure

a

G. te

nuis

tipita

taG

. the

ta

R. s

alin

a

C. para

doxa

E. huxleyi

O. sinensis

T. pseudonana

P. tricornutumH. akashiwoC. merolae

C. caldariumE. longa

E. gracilisM. jakobiformis

R. americana

C. crispus

P. ramorum

P. so

jae

P. in

festa

ns

S. fe

rax

Csy

O. d

anic

a

0.2

C1 was composed by only protozoan species, C2 was composed by rodophyta,cryptophyta, glaucocystophyceae, stramenopiles species, and C3 was composed bysome species of C1 and C2.

C1 presented monophyly of Protozoa, C2 presented monophyly of species related toProtozoa, and C3 polytomy of some Protozoa and related species. The more UOincluded in the supermatrix the better was the tree robustness inside the 3 clades.

Figure 2 shows the comparison of trees using no masked (M1) and TrimAl masked(M2) sequences. Both methodologies showed very similar topologies, however thetree utilizing TrimAl masked sequences showed higher bootstrap values. Figure 1 isthe same tree using TrimAl masked sequences but presented as radiation tree.

Results

Results

Figure 2. Phylogenomic supermatrix trees of protozoan using total sequences and conserved blocks (TrimAl)

sequences. Were used supermatrixes of 12,807 and 21,260 positions respectively in 74 species. Maximum likelihood

tree was constructed with Phyml, JTT as evolutionary model and bootstrap 100. (C1 is red, C2 blue and C3 black).

Edi Ehi Entamoeba

Trichomonas Tva Cho Cpa Cryptosporidium

Giardia Gla Lin Lma Lbr

Leishmania

Len Tcr Tbr Trypanosoma

Pca Pte Paramecium

Tetrahymena TthMonosiga Mbr

Pvi Pkn Pfa

Pyo Pbe

Pch

Plasmodium

Tpa Tan Theileria

Babesia BboHemiselmis HanBigelowiella Bna

Dfa Ppa

Lam Dci

Dictyostelium Ddi Ahe

Cro Pau Tpy Tma Tpi Tpar

Pli Ldi Fve

Dvi Ldo

Ddic Apo

Aca Ngr

Tgo Ete Pye

Ppu PorphyraGracilaria GteGuillardia GthRhodomonas RsaCyanophora CparEmiliana Ehu

TpsOdontella OsiPhaeodactylum PtrHeterosigma HakCyanidioschyzon CmeCyanidium Cca

Elo Egr

Mja Ram

Cri Pin

Pso Pra

Sfe Csy Oda

100100

100

76

7489

100100

98

100

27

82

22

13

81

49

14

34

100100

100

33

20

57

64 98

52

24

10061

1007

20

19

7730

3664

97

18

10

4026

4721

26

19

24

100100

100100

100

100

100

87

23

25 66

43

22

8510043

37

46

34

29100

3544

100

0.2

Edi (30) Ehi (30) Entamoeba

Trichomonas Tva (30) Cho (29) Cpa (30) Cryptosporidium

Giardia Gla (30) Lin (31) Lma (29) Lbr (31)

Leishmania

Len (2) Tcr (30) Tbr (31) Trypanosoma

Pca (1) Pte (31) Paramecium

Tetrahymena Tth (28)Monosiga Mbr (31)

Pvi (30) Pkn (31) Pfa (31)

Pyo (30) Pbe (31)

Pch (30)

Plasmodium

Tpa (31) Tan (30) Theileria

Babesia Bbo (30)Hemiselmis Han (22)Bigelowiella Bna (25)

Dfa (7) Ppa (7)

Lam (1) Dci (6)

Dictyostelium Ddi (31) Ahe (1)

Cro (4) Pau (2) Tpy (2) Tma (2) Tpi (2) Tpar (2)

Pli (8) Ldi (7) Fve (6)

Dvi (6) Ldo (2)

Ddic (7) Apo (1) Aca (4)

Ngr (3) Tgo (6)

Ete (6) Pye (19)

Ppu (20) PorphyraGracilaria Gte (22)Guillardia Gth (25)Rhodomonas Rsa (18)Cyanophora Cpar (17)Emiliana Ehu (17)

Tps (20)Odontella Osi (20)Phaeodactylum Ptr (20)Heterosigma Hak (18)Cyanidioschyzon Cme (19)Cyanidium Cca (18)

Elo (6) Egr (10)

Mja (10) Ram (12)

Cri (1) Pin (7)

Pso (6) Pra (6)

Sfe (7) Csy (2) Oda (3)

100100

100

83

8290

99100

100

100

27

73

15

11

73

52

21

33

95100

100

29

19

54

69 100

56

22

10053

1007

17

11

6623

3675

99

21

7

4928

2418

27

12

13

100100

100100

100

100

100

87

23

39 81

62

41

6310051

45

61

53

33100

3540

100

0.2

Supermatrix tree built

with total sequences

Supermatrix tree built

with conserved blocks

Edi

EhiEntamoeba

Monosiga Mbr

Dictyostelium Ddi

Tetrahymena Tth

Paramecium Pte

Lin

Lma

Lbr

Leishmania

Tcr

TbrTrypanosoma

Trichomonas Tva

Cho

CpaCryptosporidium

Giardia Gla

Tpa

TanTheileria

Babesia Bbo

Pvi

Pkn

Pfa

Pyo

Pbe

Pch

Plasmodium

Bigelowiella Bna

Guillardia Gth

93

100100

100

84

100

100

100

100

100

72

100

87

61

100

100

49

100100

10060

100

100

0.2

Most variable regions trimmed (>80 UO): Clade 1 (zoom)

We have presented a phylogenomic-based overview for Protozoa. Relationshipsbetween protozoan groups are in agreement with previous studies, showingmonophyly for protozoan.

On the other hand, phylogenetic information inferred from C3 is questionable due toincomplete genomes, suggesting that using less than 15 universal UO forphylogenomic reconstruction is not reliable.

The inclusion of more data/genes is necessary to obtain a robust tree.

Our phylogenomic-based methodology using a supermatrix approach proved to bereliable with protozoan genome data, suggesting that (a) more universal UO used thebetter, and (b) using the entire UO sequence or just a conserved block of it producesimilar reliable results.

Finally, we need to investigate further if this methodology could be extrapolated orreproduced to other taxonomic groups.

Conclusion

ProtozoaDBhttp://protozoadb.biowebdb.org

Sistema baseado na web para a execução do pipeline de genotipagem e

visualização dos resultados.

Página de Internet construída para a execução do pipeline de genotipagem via web, onde é possível

optar em se utilizar o pipeline de desenho de iniciadores, o pipeline de anotação de sequências

MethodologyDNA extraction

PCR using 33 pair of designed primers

Agarose gel visualization

PCR product purification

Sequencing

Annotation

Phylogenetic analyses

L. braziliensis

L. infantum

L. major

T.brucei

T.cruzi

P. falciparum

P. vivax

L. braziliensis

L. infantum

L. major

T.brucei

T.cruzi

P. falciparum

P. vivax

Alinhamento dos Iniciadores COG0092F e COG0092R

no gene da Proteína Ribossomal S3Forward primer (COG0092F) in the multiple alignment

Reverse primer (COG0092F) in the multiple alignment

Genotyping system visualization (multiple alignment) using

Jalview

Phylogenetic tree inferred with

Maximum Likelihood for

KOG0400 (Ribosomal protein

S15P/S13E). 1000 replications

(bootstrap). Sequences from

genbank and generated with our

primers are shown.

Methodology: short version

380 sequences from 36 UO220 sequences validated,

corresponding to 20 UO

33 pair of primers designed

8 pair of primers with

good potential for

multiplex PCR

Sequencing of 4 genes

Genotyping:

2 genes showed good

potential/variability

Genotyping:

2 genes showed low

variability

OrthoSearch results

Protozoan Orthologs

(OrthoMCL-based)

* Some can be protozoa-specific

GRID: http://grid.biowebdb.org

http://arpa.biowebdb.org

http://stingray.biowebdb.org

Collaborators

Marta Mattoso

M. Luiza Machado Campos

Edmundo C. Grisard

“Yoko” Cavalcanti

Students

Sérgio Serra

Kary Soriano Ocaña

Fabio Coutinho

Edgar Sarmiento

Diogo Tschoeke

Rafael Cuadrat

Daniel Loureiro

Rodrigo Jardim

Luisa Pitaluga

Funding

CNPq

FINEP

CAPES

FAPERJ

European Union (EELA-2)

Recent collaborators:

Elisa Cupolillo

Claudia D. Levy