Integrating phylogenetic inference and metadata visualization for NGS data

35
João André Carriço, PhD Microbiology Institute/Institute for Molecular Medicine Faculty of Medicine, University of Lisbon Portugal Integrating phylogenetic inference and metadata visualization for NGS data http://im.fm.ul.pt http://imm.fm.ul.pt http://www.joaocarrico.in Workshop 20: Typing of Bacterial Pathogens in 2015: Expanding the scope of NGS

Transcript of Integrating phylogenetic inference and metadata visualization for NGS data

Page 1: Integrating phylogenetic inference and metadata visualization for NGS data

João André Carriço, PhD

Microbiology Institute/Institute for Molecular Medicine

Faculty of Medicine, University of Lisbon

Portugal

Integrating phylogenetic inference and metadata visualization for NGS data

http://im.fm.ul.pthttp://imm.fm.ul.pthttp://www.joaocarrico.info

Workshop 20:Typing of Bacterial Pathogens in 2015:

Expanding the scope of NGS

Page 2: Integrating phylogenetic inference and metadata visualization for NGS data

Conflicts of Interest

NOTHING TO DISCLOSE

Page 3: Integrating phylogenetic inference and metadata visualization for NGS data

Charles Darwin ‘s “tree of life” in Notebook B, 1837-1838

Darwin and the tree of life

Page 4: Integrating phylogenetic inference and metadata visualization for NGS data

Phylogenetics methods aim to infer the relationships between the taxa trying to define the common ancestors between taxa

Assumptions: the characters being compared are homologous and independent, i.e. they had shared a common ancestor and each character suffered evolutive forces individually

Phylogenetic Inference

ATTGGGG ATGGGGG

AT?GGGG

Page 5: Integrating phylogenetic inference and metadata visualization for NGS data

Software for Phylogenetic trees: based on sequence alignments

• MEGA

• http://www.megasoftware.net/

• Splitstree

• http://www.splitstree.org/

• Geneious (http://www.geneious.com/)

• www.geneious.com

• FastTree

• http://www.microbesonline.org/fasttree • RAxML

• http://sco.h-its.org/exelixis/web/software/raxml/index.html

• PHYLIP

• http://evolution.genetics.washington.edu/phylip.html

• BEAST

• http://beast.bio.ed.ac.uk/

And many many others…

Page 6: Integrating phylogenetic inference and metadata visualization for NGS data

Sequence Alignment methods

Kos, V.N. et al., 2012. Comparative genomics of vancomycin-resistant Staphylococcus aureus strains and their positions within the clade most commonly associated with Methicillin-resistant S. aureus hospital-acquired infection in the United States. mBio, 3(3).

Maximum Likelihood tree of concatenated SICOs

Page 7: Integrating phylogenetic inference and metadata visualization for NGS data

Sequence Alignment methods

Maximum Likelihood tree of concatenated SICOs

Caveats:

• Computationally intensive: some methods can’t be applied to hundreds to thousands of strains

• Require specialized method and software knowledge for parameter definition

• Some phenomena violate the assumptions (recombination, convergent evolution,etc)

Page 8: Integrating phylogenetic inference and metadata visualization for NGS data

Sequence Based Typing MethodsxStrain genomic information encoded as a numeric sequence

Sanger sequencing:MLST: Gene allele identifierMLVA: Number of repeats

NGS approaches: Gene-by-Gene / allele based:wgMLST: core + pan genome genes are represented

cgMLST: just core genome

SNP Typing : Polymorphism

Page 9: Integrating phylogenetic inference and metadata visualization for NGS data

To each unique gene sequence (allele) is attributed an integer ID, by comparison with online DBs 

Allelic profile:    12 - 9 - 11 - 7 - 11 - 20 - 3 Each allelic profile, aka ST, is unequivocally identified by an integer.

Single locus variant (SLV): Double locus variant (DLV):Triple locus variant (TLV):

121210

- 10- 10- 10

- 11- 11- 11

- 7- 11- 11

- 11- 11- 11

- 20- 20- 2

- 3- 3- 3

Bacterial chromosome

MLST

Page 10: Integrating phylogenetic inference and metadata visualization for NGS data

SNP NGS Approach

Good approach in Monomorphic species. For non-monomorphic species , SNPs in genome areas where recombination was detected need to be removed to avoid confounding the phylogenetic signal.

sample

NGSWGS

reads

Mapping to reference

Fasta File with SNPs

fastq filesBAM filesVCF files

Page 11: Integrating phylogenetic inference and metadata visualization for NGS data

Gene by Gene NGS Approach

Software currently available:

BIGSDB (Jolley, K.A. & Maiden, M.C.J., 2010. BIGSdb: Scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics)

RIDOM™ SEQSPHERE+ (http://www.ridom.com/seqsphere/)

Central nomenclature server:Schemas, Allele definitions and identifiers

sample

NGSWGS

reads

assembly

contigs

Output :Allelic Profile

Page 12: Integrating phylogenetic inference and metadata visualization for NGS data

Algorithms for Phylogenetic Inference

Based on the distance matrix:•Hierarchical clustering methods: UPGMA, Single Linkage and Complete linkage•Neighbor-joining •Minimum Spanning Trees

Maximum Parsimony methods

Based on rules (Graphic Matroids)•goeBURST

Maximum Likelihood methods

Bayesian inference methods

Sequence alignments

Sequence alignments

Sequence alignments

Sequence alignmentsAllelic Profiles

Allelic Profiles

Page 13: Integrating phylogenetic inference and metadata visualization for NGS data

Infering phylogeny from allelic profilesAssume that you have only 3 genes and each number corresponds to a different allele for each gene. The minimum assumption is assuming that a SLV may correspond to a possible phylogenetic descent.

1-1-1 1-1-2 1-2-1 1-2-2 1-2-3

SLV SLV SLV

SLV SLV

SLV

11 possible trees….

Page 14: Integrating phylogenetic inference and metadata visualization for NGS data

eBURST modelMore similar STs should denote closely related strains from an evolutionary point of view. STs with more SLVs can be regarded has a common ancestor.

Links between STs depict descent relations.

With these assumptions, connected STs should share an evolutionary path.

Maynard Smith J., et al. 2000. Bioessays 22:1115-22

eBURSTFeil E. et al, J Bac 2004

Page 15: Integrating phylogenetic inference and metadata visualization for NGS data

1-1-1

1-1-2

1-2-1

1-2-2

1-2-3

goeBURST

#SLVs #DLVs #TLVs Freq STid2 2 0 1 1

2 2 0 1 2

3 1 0 1 3

3 1 0 1 4

2 2 0 1 5

Implementation of the eBURST rules as a graphic matroid problem, allows for a globally optimal solution of the placement of the ST links.

Francisco et al, BMC Bioinf, 2009

More SLVs / lower ID

Connects to ST4 because #SLVs

Final goeBURST tree : unique solution

guaranteed

Page 16: Integrating phylogenetic inference and metadata visualization for NGS data

Applying goeBURST

1-1-1 1-1-2 1-2-1 1-2-2 1-2-3

SLV SLV SLV

SLV SLV

SLV

11 possible trees….

All these are valid goeBURST solutions. The tie break would need to be the ST ID if all of them would have the same frequency in the dataset

Page 17: Integrating phylogenetic inference and metadata visualization for NGS data

goeBURST output examples

Largest S. aureus MLST CC

1067 of 2650 STs total

2nd largest S. aureus CC252 Sts

Page 18: Integrating phylogenetic inference and metadata visualization for NGS data

goeBURST FULL MST

• The goeBURST rules can be expanded to any number of loci while maintaining the same assumptions of the evolutionary model behind

• Adds an evolutionary model to the basic Minnimum Spanning Tree approach

• Advantage: very fast to calculate compared to phylogenetic analysis algorithms

• Advantage: If the strains are closely related we have the internal nodes defined as strains as opposed to any traditional phylogenetic methodology

• Disadvantage: does not create internal nodes as putative recent common ancestral

Page 19: Integrating phylogenetic inference and metadata visualization for NGS data

Allelic profiles Accessory data(“metadata”)

AntibiogramSerotypeOrigin info (patient)

….

Analysis(goeBURST)

Other typing method

Present the data in a meaningful way

Integrating Data Analysis and Visualization

Page 20: Integrating phylogenetic inference and metadata visualization for NGS data

Using Phyloviz (http://www.phyloviz.net)

Page 21: Integrating phylogenetic inference and metadata visualization for NGS data

PHYLOViZ

Can be easily applied to:-MLST-MLVA-SNP data*-Gene Presence/absence

*Conversion of VCF to PHYLOViZ: https://github.com/nickloman/misc-genomics-tools/blob/master/scripts/vcf2phyloviz.py (Thanks Nick!)

Page 22: Integrating phylogenetic inference and metadata visualization for NGS data

PHYLOViZ Example of visualization with MLST+ (core genome) data of VRSA and MRSA strains

Page 23: Integrating phylogenetic inference and metadata visualization for NGS data

Core genome comparison - Workflow

Core genome from all available fully sequenced S.aureus Strains in NCBI

Using strain COL genes as reference

1866 target loci found for a cgMLST schema (RIDOM Seqsphere+)Call alleles for strains under studyRemoving loci with missing data in the strains under analysis

1542 target genes kept for whole genome comparison

goeBURST Minimum Spanning Tree of the resulting allelic profiles (PHYLOViZ software)

Page 24: Integrating phylogenetic inference and metadata visualization for NGS data

Core genome comparison

VRSA

NCBI strains

US VRSA strains (Kos et al)

HSM strains

MRSA srp

VRS5

MLST+: 1542 genesCore genome genes found in all strains

65

Page 25: Integrating phylogenetic inference and metadata visualization for NGS data

“Live” Demonstration

Page 26: Integrating phylogenetic inference and metadata visualization for NGS data

PHYLOViZ

PROs: Handles thousands of profilesFast calculationEasy to annotate and explore metadataAllows for basic statistics on profiles and metadataAllows for advanced statistics on MSTs (PLoS One. 2015 Mar 23;10(3):e0119315) Exports high quality graphical formatsAllows plugin development

CONs: goeBURST and goeBURST MST only(Neighbour Joining and UPGMA soon)JAVA knowledge to code new plugins

Page 27: Integrating phylogenetic inference and metadata visualization for NGS data

Final RemarksPhylogenetic inference has always an underlying model. The choice of method depends on what data is being analyzed and the underlying question  With the increasing availability of bacterial genomes, the methods that allow their comparison need to be efficient and scalable

Metadata should always be use to evaluate the algorithm results

PHYLOViZ provides a visualization framework to analyze inferred patterns of descent based on goeBURST , including detailed statistics and allows easy integration of metadata on algorithm results

Any sequence-based typing method that generates allelic profiles can be analyzed by this framework, including any NGS derived schema (ie cgMLST, SNPs)

Page 28: Integrating phylogenetic inference and metadata visualization for NGS data

Ongoing Phyloviz workModular plugin architecture  Allows expansion and addition of new

capabilities  Other analysis algorithms/ custom rules

 New visualization modules Allow the analysis of other data types Complementary statistics modules 

Try to address user’s needs…  We need your feedback!

 Phyloviz is open-source freeware software 

Page 29: Integrating phylogenetic inference and metadata visualization for NGS data

Alexandre Francisco Cátia Vaz

Pedro Monteiro

Mário Ramirez José Melo-Cristino

Acknowledgements

Initial funding from Fundação para a Ciência e Tecnologia

Page 30: Integrating phylogenetic inference and metadata visualization for NGS data

Draft Scientific Programme:Plenaries:1)Small Scale Microbial Epidemiology2)Large Scale Microbial Epidemiology3)Bioinformatics for Genome-based Microbial Epidemiology4)Population Genetics: Pathogen Emergence5)Population Dynamics : Transmission networks and surveillance6)Molecular Epidemiology for Global Health and One Health

Parallel Sessions1)Food and Environmental pathogens2)Microbial Forensics3)Virus 4)Fungi and Yeasts5)Novel Diagnostics methodologies6)Novel Typing approaches7)Phylogenetic Inference 8)Interactive Illustration Platforms

Save the date !

Page 31: Integrating phylogenetic inference and metadata visualization for NGS data

Phyloviz Visualization Examples

Page 32: Integrating phylogenetic inference and metadata visualization for NGS data

Phyloviz

Burkholderia pseudomallei

Page 33: Integrating phylogenetic inference and metadata visualization for NGS data

Clinical

animal NA

community

HospitalSurv/Outb

Enterococcus faecium

Page 34: Integrating phylogenetic inference and metadata visualization for NGS data

Streptococcus pneumoniae CC90Coloured by country of origin

Page 35: Integrating phylogenetic inference and metadata visualization for NGS data

Streptococcus pneumoniae 10 largest clonal complexes coloured by serotype