Integrating phylogenetic inference and metadata visualization for NGS data
-
Upload
joao-andre-carrico -
Category
Education
-
view
217 -
download
6
Transcript of Integrating phylogenetic inference and metadata visualization for NGS data
João André Carriço, PhD
Microbiology Institute/Institute for Molecular Medicine
Faculty of Medicine, University of Lisbon
Portugal
Integrating phylogenetic inference and metadata visualization for NGS data
http://im.fm.ul.pthttp://imm.fm.ul.pthttp://www.joaocarrico.info
Workshop 20:Typing of Bacterial Pathogens in 2015:
Expanding the scope of NGS
Conflicts of Interest
NOTHING TO DISCLOSE
Charles Darwin ‘s “tree of life” in Notebook B, 1837-1838
Darwin and the tree of life
Phylogenetics methods aim to infer the relationships between the taxa trying to define the common ancestors between taxa
Assumptions: the characters being compared are homologous and independent, i.e. they had shared a common ancestor and each character suffered evolutive forces individually
Phylogenetic Inference
ATTGGGG ATGGGGG
AT?GGGG
Software for Phylogenetic trees: based on sequence alignments
• MEGA
• http://www.megasoftware.net/
• Splitstree
• http://www.splitstree.org/
• Geneious (http://www.geneious.com/)
• www.geneious.com
• FastTree
• http://www.microbesonline.org/fasttree • RAxML
• http://sco.h-its.org/exelixis/web/software/raxml/index.html
• PHYLIP
• http://evolution.genetics.washington.edu/phylip.html
• BEAST
• http://beast.bio.ed.ac.uk/
And many many others…
Sequence Alignment methods
Kos, V.N. et al., 2012. Comparative genomics of vancomycin-resistant Staphylococcus aureus strains and their positions within the clade most commonly associated with Methicillin-resistant S. aureus hospital-acquired infection in the United States. mBio, 3(3).
Maximum Likelihood tree of concatenated SICOs
Sequence Alignment methods
Maximum Likelihood tree of concatenated SICOs
Caveats:
• Computationally intensive: some methods can’t be applied to hundreds to thousands of strains
• Require specialized method and software knowledge for parameter definition
• Some phenomena violate the assumptions (recombination, convergent evolution,etc)
Sequence Based Typing MethodsxStrain genomic information encoded as a numeric sequence
Sanger sequencing:MLST: Gene allele identifierMLVA: Number of repeats
NGS approaches: Gene-by-Gene / allele based:wgMLST: core + pan genome genes are represented
cgMLST: just core genome
SNP Typing : Polymorphism
To each unique gene sequence (allele) is attributed an integer ID, by comparison with online DBs
Allelic profile: 12 - 9 - 11 - 7 - 11 - 20 - 3 Each allelic profile, aka ST, is unequivocally identified by an integer.
Single locus variant (SLV): Double locus variant (DLV):Triple locus variant (TLV):
121210
- 10- 10- 10
- 11- 11- 11
- 7- 11- 11
- 11- 11- 11
- 20- 20- 2
- 3- 3- 3
Bacterial chromosome
MLST
SNP NGS Approach
Good approach in Monomorphic species. For non-monomorphic species , SNPs in genome areas where recombination was detected need to be removed to avoid confounding the phylogenetic signal.
sample
NGSWGS
reads
Mapping to reference
Fasta File with SNPs
fastq filesBAM filesVCF files
Gene by Gene NGS Approach
Software currently available:
BIGSDB (Jolley, K.A. & Maiden, M.C.J., 2010. BIGSdb: Scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics)
RIDOM™ SEQSPHERE+ (http://www.ridom.com/seqsphere/)
Central nomenclature server:Schemas, Allele definitions and identifiers
sample
NGSWGS
reads
assembly
contigs
Output :Allelic Profile
Algorithms for Phylogenetic Inference
Based on the distance matrix:•Hierarchical clustering methods: UPGMA, Single Linkage and Complete linkage•Neighbor-joining •Minimum Spanning Trees
Maximum Parsimony methods
Based on rules (Graphic Matroids)•goeBURST
Maximum Likelihood methods
Bayesian inference methods
Sequence alignments
Sequence alignments
Sequence alignments
Sequence alignmentsAllelic Profiles
Allelic Profiles
Infering phylogeny from allelic profilesAssume that you have only 3 genes and each number corresponds to a different allele for each gene. The minimum assumption is assuming that a SLV may correspond to a possible phylogenetic descent.
1-1-1 1-1-2 1-2-1 1-2-2 1-2-3
SLV SLV SLV
SLV SLV
SLV
11 possible trees….
eBURST modelMore similar STs should denote closely related strains from an evolutionary point of view. STs with more SLVs can be regarded has a common ancestor.
Links between STs depict descent relations.
With these assumptions, connected STs should share an evolutionary path.
Maynard Smith J., et al. 2000. Bioessays 22:1115-22
eBURSTFeil E. et al, J Bac 2004
1-1-1
1-1-2
1-2-1
1-2-2
1-2-3
goeBURST
#SLVs #DLVs #TLVs Freq STid2 2 0 1 1
2 2 0 1 2
3 1 0 1 3
3 1 0 1 4
2 2 0 1 5
Implementation of the eBURST rules as a graphic matroid problem, allows for a globally optimal solution of the placement of the ST links.
Francisco et al, BMC Bioinf, 2009
More SLVs / lower ID
Connects to ST4 because #SLVs
Final goeBURST tree : unique solution
guaranteed
Applying goeBURST
1-1-1 1-1-2 1-2-1 1-2-2 1-2-3
SLV SLV SLV
SLV SLV
SLV
11 possible trees….
All these are valid goeBURST solutions. The tie break would need to be the ST ID if all of them would have the same frequency in the dataset
goeBURST output examples
Largest S. aureus MLST CC
1067 of 2650 STs total
2nd largest S. aureus CC252 Sts
goeBURST FULL MST
• The goeBURST rules can be expanded to any number of loci while maintaining the same assumptions of the evolutionary model behind
• Adds an evolutionary model to the basic Minnimum Spanning Tree approach
• Advantage: very fast to calculate compared to phylogenetic analysis algorithms
• Advantage: If the strains are closely related we have the internal nodes defined as strains as opposed to any traditional phylogenetic methodology
• Disadvantage: does not create internal nodes as putative recent common ancestral
Allelic profiles Accessory data(“metadata”)
AntibiogramSerotypeOrigin info (patient)
….
Analysis(goeBURST)
Other typing method
Present the data in a meaningful way
Integrating Data Analysis and Visualization
Using Phyloviz (http://www.phyloviz.net)
PHYLOViZ
Can be easily applied to:-MLST-MLVA-SNP data*-Gene Presence/absence
*Conversion of VCF to PHYLOViZ: https://github.com/nickloman/misc-genomics-tools/blob/master/scripts/vcf2phyloviz.py (Thanks Nick!)
PHYLOViZ Example of visualization with MLST+ (core genome) data of VRSA and MRSA strains
Core genome comparison - Workflow
Core genome from all available fully sequenced S.aureus Strains in NCBI
Using strain COL genes as reference
1866 target loci found for a cgMLST schema (RIDOM Seqsphere+)Call alleles for strains under studyRemoving loci with missing data in the strains under analysis
1542 target genes kept for whole genome comparison
goeBURST Minimum Spanning Tree of the resulting allelic profiles (PHYLOViZ software)
Core genome comparison
VRSA
NCBI strains
US VRSA strains (Kos et al)
HSM strains
MRSA srp
VRS5
MLST+: 1542 genesCore genome genes found in all strains
65
“Live” Demonstration
PHYLOViZ
PROs: Handles thousands of profilesFast calculationEasy to annotate and explore metadataAllows for basic statistics on profiles and metadataAllows for advanced statistics on MSTs (PLoS One. 2015 Mar 23;10(3):e0119315) Exports high quality graphical formatsAllows plugin development
CONs: goeBURST and goeBURST MST only(Neighbour Joining and UPGMA soon)JAVA knowledge to code new plugins
Final RemarksPhylogenetic inference has always an underlying model. The choice of method depends on what data is being analyzed and the underlying question With the increasing availability of bacterial genomes, the methods that allow their comparison need to be efficient and scalable
Metadata should always be use to evaluate the algorithm results
PHYLOViZ provides a visualization framework to analyze inferred patterns of descent based on goeBURST , including detailed statistics and allows easy integration of metadata on algorithm results
Any sequence-based typing method that generates allelic profiles can be analyzed by this framework, including any NGS derived schema (ie cgMLST, SNPs)
Ongoing Phyloviz workModular plugin architecture Allows expansion and addition of new
capabilities Other analysis algorithms/ custom rules
New visualization modules Allow the analysis of other data types Complementary statistics modules
Try to address user’s needs… We need your feedback!
Phyloviz is open-source freeware software
Alexandre Francisco Cátia Vaz
Pedro Monteiro
Mário Ramirez José Melo-Cristino
Acknowledgements
Initial funding from Fundação para a Ciência e Tecnologia
Draft Scientific Programme:Plenaries:1)Small Scale Microbial Epidemiology2)Large Scale Microbial Epidemiology3)Bioinformatics for Genome-based Microbial Epidemiology4)Population Genetics: Pathogen Emergence5)Population Dynamics : Transmission networks and surveillance6)Molecular Epidemiology for Global Health and One Health
Parallel Sessions1)Food and Environmental pathogens2)Microbial Forensics3)Virus 4)Fungi and Yeasts5)Novel Diagnostics methodologies6)Novel Typing approaches7)Phylogenetic Inference 8)Interactive Illustration Platforms
Save the date !
Phyloviz Visualization Examples
Phyloviz
Burkholderia pseudomallei
Clinical
animal NA
community
HospitalSurv/Outb
Enterococcus faecium
Streptococcus pneumoniae CC90Coloured by country of origin
Streptococcus pneumoniae 10 largest clonal complexes coloured by serotype