Integrating phylogenetic inference and metadata visualization for NGS data

João André Carriço, PhD

Microbiology Institute/Institute for Molecular Medicine

Faculty of Medicine, University of Lisbon

Portugal

Integrating phylogenetic inference and metadata visualization for NGS data

http://im.fm.ul.pthttp://imm.fm.ul.pthttp://www.joaocarrico.info

Workshop 20:Typing of Bacterial Pathogens in 2015:

Expanding the scope of NGS

Conflicts of Interest

NOTHING TO DISCLOSE

Charles Darwin ‘s “tree of life” in Notebook B, 1837-1838

Darwin and the tree of life

Phylogenetics methods aim to infer the relationships between the taxa trying to define the common ancestors between taxa

Assumptions: the characters being compared are homologous and independent, i.e. they had shared a common ancestor and each character suffered evolutive forces individually

Phylogenetic Inference

ATTGGGG ATGGGGG

AT?GGGG

Software for Phylogenetic trees: based on sequence alignments

• MEGA

• http://www.megasoftware.net/

• Splitstree

• http://www.splitstree.org/

• Geneious (http://www.geneious.com/)

• www.geneious.com

• FastTree

• http://www.microbesonline.org/fasttree • RAxML

• http://sco.h-its.org/exelixis/web/software/raxml/index.html

• PHYLIP

• http://evolution.genetics.washington.edu/phylip.html

• BEAST

• http://beast.bio.ed.ac.uk/

And many many others…

Sequence Alignment methods

Kos, V.N. et al., 2012. Comparative genomics of vancomycin-resistant Staphylococcus aureus strains and their positions within the clade most commonly associated with Methicillin-resistant S. aureus hospital-acquired infection in the United States. mBio, 3(3).

Maximum Likelihood tree of concatenated SICOs

Sequence Alignment methods

Maximum Likelihood tree of concatenated SICOs

Caveats:

• Computationally intensive: some methods can’t be applied to hundreds to thousands of strains

• Require specialized method and software knowledge for parameter definition

• Some phenomena violate the assumptions (recombination, convergent evolution,etc)

Sequence Based Typing MethodsxStrain genomic information encoded as a numeric sequence

Sanger sequencing:MLST: Gene allele identifierMLVA: Number of repeats

NGS approaches: Gene-by-Gene / allele based:wgMLST: core + pan genome genes are represented

cgMLST: just core genome

SNP Typing : Polymorphism

To each unique gene sequence (allele) is attributed an integer ID, by comparison with online DBs

Allelic profile: 12 - 9 - 11 - 7 - 11 - 20 - 3 Each allelic profile, aka ST, is unequivocally identified by an integer.

Single locus variant (SLV): Double locus variant (DLV):Triple locus variant (TLV):

121210

- 10- 10- 10

- 11- 11- 11

- 7- 11- 11

- 11- 11- 11

- 20- 20- 2

- 3- 3- 3

Bacterial chromosome

MLST

SNP NGS Approach

Good approach in Monomorphic species. For non-monomorphic species , SNPs in genome areas where recombination was detected need to be removed to avoid confounding the phylogenetic signal.

sample

NGSWGS

reads

Mapping to reference

Fasta File with SNPs

fastq filesBAM filesVCF files

Gene by Gene NGS Approach

Software currently available:

BIGSDB (Jolley, K.A. & Maiden, M.C.J., 2010. BIGSdb: Scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics)

RIDOM™ SEQSPHERE+ (http://www.ridom.com/seqsphere/)

Central nomenclature server:Schemas, Allele definitions and identifiers

sample

NGSWGS

reads

assembly

contigs

Output :Allelic Profile

Algorithms for Phylogenetic Inference

Based on the distance matrix:•Hierarchical clustering methods: UPGMA, Single Linkage and Complete linkage•Neighbor-joining •Minimum Spanning Trees

Maximum Parsimony methods

Based on rules (Graphic Matroids)•goeBURST

Maximum Likelihood methods

Bayesian inference methods

Sequence alignments

Sequence alignments

Sequence alignments

Sequence alignmentsAllelic Profiles

Allelic Profiles

Infering phylogeny from allelic profilesAssume that you have only 3 genes and each number corresponds to a different allele for each gene. The minimum assumption is assuming that a SLV may correspond to a possible phylogenetic descent.

1-1-1 1-1-2 1-2-1 1-2-2 1-2-3

SLV SLV SLV

SLV SLV

SLV

11 possible trees….

eBURST modelMore similar STs should denote closely related strains from an evolutionary point of view. STs with more SLVs can be regarded has a common ancestor.

Links between STs depict descent relations.

With these assumptions, connected STs should share an evolutionary path.

Maynard Smith J., et al. 2000. Bioessays 22:1115-22

eBURSTFeil E. et al, J Bac 2004

1-1-1

1-1-2

1-2-1

1-2-2

1-2-3

goeBURST

#SLVs #DLVs #TLVs Freq STid2 2 0 1 1

2 2 0 1 2

3 1 0 1 3

3 1 0 1 4

2 2 0 1 5

Implementation of the eBURST rules as a graphic matroid problem, allows for a globally optimal solution of the placement of the ST links.

Francisco et al, BMC Bioinf, 2009

More SLVs / lower ID

Connects to ST4 because #SLVs

Final goeBURST tree : unique solution

guaranteed

Applying goeBURST

1-1-1 1-1-2 1-2-1 1-2-2 1-2-3

SLV SLV SLV

SLV SLV

SLV

11 possible trees….

All these are valid goeBURST solutions. The tie break would need to be the ST ID if all of them would have the same frequency in the dataset

goeBURST output examples

Largest S. aureus MLST CC

1067 of 2650 STs total

2nd largest S. aureus CC252 Sts

goeBURST FULL MST

• The goeBURST rules can be expanded to any number of loci while maintaining the same assumptions of the evolutionary model behind

• Adds an evolutionary model to the basic Minnimum Spanning Tree approach

• Advantage: very fast to calculate compared to phylogenetic analysis algorithms

• Advantage: If the strains are closely related we have the internal nodes defined as strains as opposed to any traditional phylogenetic methodology

• Disadvantage: does not create internal nodes as putative recent common ancestral

Allelic profiles Accessory data(“metadata”)

AntibiogramSerotypeOrigin info (patient)

….

Analysis(goeBURST)

Other typing method

Present the data in a meaningful way

Integrating Data Analysis and Visualization

Using Phyloviz (http://www.phyloviz.net)

PHYLOViZ

Can be easily applied to:-MLST-MLVA-SNP data*-Gene Presence/absence

*Conversion of VCF to PHYLOViZ: https://github.com/nickloman/misc-genomics-tools/blob/master/scripts/vcf2phyloviz.py (Thanks Nick!)

PHYLOViZ Example of visualization with MLST+ (core genome) data of VRSA and MRSA strains

Core genome comparison - Workflow

Core genome from all available fully sequenced S.aureus Strains in NCBI

Using strain COL genes as reference

1866 target loci found for a cgMLST schema (RIDOM Seqsphere+)Call alleles for strains under studyRemoving loci with missing data in the strains under analysis

1542 target genes kept for whole genome comparison

goeBURST Minimum Spanning Tree of the resulting allelic profiles (PHYLOViZ software)

Core genome comparison

VRSA

NCBI strains

US VRSA strains (Kos et al)

HSM strains

MRSA srp

VRS5

MLST+: 1542 genesCore genome genes found in all strains

65

“Live” Demonstration

PHYLOViZ

PROs: Handles thousands of profilesFast calculationEasy to annotate and explore metadataAllows for basic statistics on profiles and metadataAllows for advanced statistics on MSTs (PLoS One. 2015 Mar 23;10(3):e0119315) Exports high quality graphical formatsAllows plugin development

CONs: goeBURST and goeBURST MST only(Neighbour Joining and UPGMA soon)JAVA knowledge to code new plugins

Final RemarksPhylogenetic inference has always an underlying model. The choice of method depends on what data is being analyzed and the underlying question With the increasing availability of bacterial genomes, the methods that allow their comparison need to be efficient and scalable

Metadata should always be use to evaluate the algorithm results

PHYLOViZ provides a visualization framework to analyze inferred patterns of descent based on goeBURST , including detailed statistics and allows easy integration of metadata on algorithm results

Any sequence-based typing method that generates allelic profiles can be analyzed by this framework, including any NGS derived schema (ie cgMLST, SNPs)

Ongoing Phyloviz workModular plugin architecture Allows expansion and addition of new

capabilities Other analysis algorithms/ custom rules

New visualization modules Allow the analysis of other data types Complementary statistics modules

Try to address user’s needs… We need your feedback!

Phyloviz is open-source freeware software

Alexandre Francisco Cátia Vaz

Pedro Monteiro

Mário Ramirez José Melo-Cristino

Acknowledgements

Initial funding from Fundação para a Ciência e Tecnologia

Draft Scientific Programme:Plenaries:1)Small Scale Microbial Epidemiology2)Large Scale Microbial Epidemiology3)Bioinformatics for Genome-based Microbial Epidemiology4)Population Genetics: Pathogen Emergence5)Population Dynamics : Transmission networks and surveillance6)Molecular Epidemiology for Global Health and One Health

Parallel Sessions1)Food and Environmental pathogens2)Microbial Forensics3)Virus 4)Fungi and Yeasts5)Novel Diagnostics methodologies6)Novel Typing approaches7)Phylogenetic Inference 8)Interactive Illustration Platforms

Save the date !

Phyloviz Visualization Examples

Phyloviz

Burkholderia pseudomallei

Clinical

animal NA

community

HospitalSurv/Outb

Enterococcus faecium

Streptococcus pneumoniae CC90Coloured by country of origin

Streptococcus pneumoniae 10 largest clonal complexes coloured by serotype

Integrating phylogenetic inference and metadata visualization for NGS data

Education

Transcript of Integrating phylogenetic inference and metadata visualization for NGS data