Toolbox for bacterial population analysis using NGS

Toolbox for bacterial population analysis using NGSINTRODUCTION OF BACTERIAL POPULATION GENOMICS AND EVOLUTIONMIRKO ROSSIASS. PROF. ENVIRONMENTAL HYGIENE, FACULTY OF VETERINARY MEDICINE

I’m a vet and not a bioinformatics.. I’m a good example of end-user!

I do not want to teach population genetics today … just give you some tips how to do it using NGS in bacteria

If you are interested in bacterial population analysis … we are organizing an ad hoc course in Spring ..

There are several more software/pipelines.. These are the ones I like/I know/I apply

If you want the slides send an Email to me [email protected]

If you are a MSc in bioinformatics and interested in thesis in applied bioinformatics in public health microbiology and pathogen surveillance please contact me ..

mailto:[email protected]

Bacterial population A group of individuals of the same species

POPULATIONS, not individuals, evolve

Population and community are two different concepts … WE ARE SPEAKING OF INDIVIDUALS OF THE SAME SPECIES!!!! … although the definition of species in bacteriology is quite vague

Population genomics attend to understand the population by whole genome analysis a sample of it investigating the variation of a subset of individual members of the population

“Sequence data is ideal for this, as the differences between individuals are often tiny (i.e. there is very little variation) since they belong to a single population, and DNA sequence data allows us to detect single nucleotide changes (ie provides high resolution)” (Kate Hold)

The sample is a subset of the population

4

PopulationUniverseRealityState of natureTruth

parameters

SampleFinite, random

noiseerrorperturbation

statistics

Statistical inference: Extract maximum information from sample in order to draw conclusions about population

Inductive not deductive

Source John Bunge

How many samples do I need to sequence?

It depends on your question! Accuracy is important.. but big numbers help! Draft genomes are enough. Closing a genome is a waste of time and money!

good draft 100 €/s closed > 3000 €/s Include in your analysis as much diversity as possible (time, space, phenotypes,...) Sequence as much as you can … just stop before you get broke!!

1000 strains < 100 000 €

Bacterial population… different levels

Population of H. pylori living in a single stomach Population of H. pylori circulating globally

What do we want to measure? Genetic Drift

◦ the change in the gene pool of a small population due to chance

Natural Selection ◦ Allele increasing fitness will accumulate in the population◦ Cause ADAPTATION of Populations

Gene Flow◦ is genetic exchange due to the migration of individuals between populations

How do we measure (using NGS)?

Identify variants:◦SNP approach

◦Gene-by-gene approach

Define which part of the gene pool is common in all the individuals of the population (core)

and which part is not (accessory)

Use of phylogenetic frameworks for reconstructing genealogy and non-phylogenetic

clustering methods for inferring population structure

Applications Outbreak determination Pathogen transmission Understanding epidemics Pathogen surveillance Understanding evolution of bacteria ….

@jennifergardy

https://twitter.com/jennifergardy

https://twitter.com/jennifergardy

Identifying variants: SNP approaches

sample

NGSWGS

reads

Mapping to reference

VCF/Fasta File with SNPs

• Needs a reference strain• Monomorphic (Clonal) species• Recombination/Horizontal gene transfer is a

problem• Difficult to create a nomenclature

Source J. Carrio

Identifying variants: Gene-by-gene

sample

NGSWGS

reads• No need for reference strain• Buffers recombination effect• Simpler to create a nomenclature• Population structure of non-monomorphic

species• Multiple Schemas can be defined for a single

species

assembly

contigs

Central nomenclature server:Schemas, Allele definitions and identifiers

Output :Allelic Profile

Source J. Carrio

Sequence platformsLoman et al., 2012 Nature Review Microbiology

… I’m just using Illumina

For both de novo and re-sequencingAt the moment Illumina gives the best benefit-cost ratio:• High throughput • Accuracy• Possibility for multiplex• Reasonable work flow time• Easy accessible

For small genomes (1 to 2 Mb) it is nowadays possible to sequence at ~90 euro/sample with minimum x40 coverage

I have the reads for each strain.. OK, and now?An overview of main programs, platforms and approaches … sometime it is a question of style!

I want some results from reads… You can always map your reads against a close reference genome using ”classical” short reads

aligners and extract SNPs: BWA for example

Here just a (long) list http://omictools.com/read-alignment-c83-p1.html

Now you just need to decide the reference genome Note that you might need to select more than one reference genome to tune your analysis

…Be aware that there are available software designed specifically for bacterial genomes

http://omictools.com/read-alignment-c83-p1.html

Assembly-free analysesSNP CALLING AND CORE GENOME ALIGNMENTS - REFERENCE BASED MAPPING

Snippy ◦ One-by-one◦ a set results using the same reference to generate

a core SNP alignment◦ A lot of output files ◦ Variants: SNPs, MNPs, INDELs, MIX

Input Requirements◦ a reference genome in FASTA or GENBANK format

(can be in multiple contigs)◦ query sequence read files in FASTQ or FASTA

format (can be .gz compressed) format

Wombac ◦ Fast and “dirty”´; several samples in a run◦ Computations can re-used for building new trees◦ looks for substitution SNPs, not indels, and it may miss

some SNPs

Input Requirements◦ a reference genome in FASTA or GENBANK format (can be

in multiple contigs)◦ query sequences in

◦ a folder containing FASTQ short reads: eg. R1.fq.fz R2.fq.gz◦ a multi-FASTA file: eg. contigs.fa or NC_273461.fna◦ a .tar.gz file containing FASTA contig files: eg.

Ecoli_K12mut.contig.tar.gz (from EBI/NCBI)

https://github.com/tseemann/wombac https://github.com/tseemann/snippy

@torstenseemann

https://github.com/tseemann/wombac

https://github.com/tseemann/snippy

Assembly-free analysesSHORT READ SEQUENCE TYPING

Srst2◦ design specifically for bacterial genomes◦ Query Illumina sequence data, against an MLST database and/or a database

of gene sequences◦ Report the presence of STs (allele designation) and/or reference genes

Input Requirements◦ Query: illumina reads (fastq.gz format, but other options)◦ A fasta reference sequence database to match to:

◦ For MLST, this means a fasta file of all allele sequences. If you want to assign STs, you also need a tab-delim file which defines the ST profiles as a combination of alleles.

◦ For resistance/virulence genes, this means a fasta file of all the resistance genes/alleles that you want to screen for, clustered into gene groups.

https://github.com/katholt/srst2

@DrKatHolt

https://github.com/katholt/srst2

Stand-alone pipeline for SNP variant

Nullarbor◦ Clean reads◦ Species identification k-mer analysis against known genome database (Kraken)◦ De novo assembly◦ Annotation◦ MLST◦ Resistome◦ SNP Variants

https://github.com/tseemann/nullarbor

@torstenseemann



… or you might prefer assemble your genome!

When you know little or nothing of your dataset (it is not possible to select a reference genome)

In case of deep comparative genomics when you also are interest in the accessory genome (genes absence in your reference)

To extract the pangenome Because having all your dataset assembled will facilitate downstream applications To develop common NOMENCLATURE

The never ending nomenclature story…

Source J. Carrio

Assembly short readsREFERENCE BASED ASSEMBLY

Mira (best assembler … for geeks since 1999 )◦ multi-pass assembler/mapper for small genomes

(up to 150 Mb)◦ has full overview on the whole project at any time

of the assembly, using all available data and learning from mistakes

◦ Marks places of interest with tags so that these can be found quickly in finishing programs

◦ can do also de novo and hybrid assemblingInput Requirements◦ various formats (CAF, FASTA, FASTQ or PHD) from

Sanger, 454, Ion Torrent, illumina

DE NOVO ASSEMBLY

Spades (a very good assembler for lazy people)◦ is intended for both standard isolates and single-

cell MDA bacteria assemblies◦ It does its work and very well◦ Simple to run spades.py --careful -1 R1.fastq.gz -2 R2.fastq.gz –o output folder◦ Can use Nanopore and PacBio for hydrid

assembly Andrey’s lecture from WBG2014

https://docs.google.com/presentation/d/1wjrJGKhQQEHDwHF5OhQQyKnj5_c7duTAQjcDsBHTkWQ/edit#slide=id.g47b5b1626_0793

http://sourceforge.net/projects/mira-assembler/ http://bioinf.spbau.ru/spades

@BaCh_mira



http://sourceforge.net/projects/mira-assembler/

http://bioinf.spbau.ru/spades

https://twitter.com/BaCh_mira

https://twitter.com/BaCh_mira

Pangenome alignment(up to 50 strains)MUGSY

Genomes should be very similar

Mugsy (also Mauve) alignment generated a multiple block local alignment

Alignment format is in MAF

MAUVE

Large-scale evolutionary events

It can align more divergent strains than Mugsy: as little as 50% nucleotide identity

It aligns the pan-genome

Complete genome alignment in the eXtended Multi-FastA (XMFA)

List groups of genes that are predicted to be positionally orthologous

GUI available

http://mugsy.sourceforge.net/

http://darlinglab.org/mauve/

http://mugsy.sourceforge.net/

http://darlinglab.org/mauve/

Core genome alignment PARSNP

Designed to align the core genome of hundreds to thousands of bacterial genomes within a few minutes to few hours

Very very similar strains… it use MUMi to select the nearest genomes only the ones with distance <= 0.01 are included, all others are discarded.

Input can be both draft assemblies and finished genomes, and output includes variant (SNP) calls, core genome phylogeny ad multi-alignments

Results are visualized using a GUI

https://harvest.readthedocs.org/en/latest/content/parsnp.html

https://harvest.readthedocs.org/en/latest/content/parsnp.html

Gene-by-gene: pangenome, coregenome, accessory genome

assembly Structural annotation

Ortholog clustering

ProdigalProkkaRAST

OrthAgogueRoary

Structural annotationPRODIGAL

Gene finders

Very fast 3000 genomes in ~ a week (8 cpu 16 Gb RAM)

Prodigal can be run in one step on a single genomic sequence or on a draft genome containing many sequences.

It does not need to be supplied with any knowledge of the organism, as it learns all the properties it needs to on its own.

PROKKA

Structural and functional annotation

Fast automatic annotation in multi-core < 15 min

Several dependencies tedious to install (… I told you I’m very lazy!)

http://www.slideshare.net/torstenseemann/prokka-rapid-bacterial-genome-annotation-abphm-2013?related=1

https://github.com/hyattpd/prodigal/wiki https://github.com/tseemann/prokka





https://github.com/hyattpd/prodigal/wiki

https://github.com/tseemann/prokka

Ortholog clusteringORTHAGOGUE

high speed estimation of homology relations within and between species in massive data sets

easy to use and offers flexibility through a r

Input = all-against-all BLAST tabular output; range of optional parameters

Output = mcl file

-u -o XX ignore e-value, use BLAST score, esclude protein with overlap < XX

ROARY

high speed stand alone pan genome pipeline

128 samples can be analysed in under 1 hour using 1 GB of RAM and a single processor

Input = GFF3 format produced by Prokka

Roary –e –mafft *.gff

FastTree –nt –gtr core_gene_alignment.aln > my_tree.newick

Output = several files

https://code.google.com/p/orthagogue/ http://sanger-pathogens.github.io/Roary/

https://code.google.com/p/orthagogue/

http://sanger-pathogens.github.io/Roary/

Gene-by-gene: pangenome, coregenome, accessory genome

Ortholog clustering results

ad hoc scripts

Core Genome

Accessory Genome

Pangenome

Phylogeny RAxMLFastreeBEAST

Everything included in Roary but not in OrthAgogue

Population structure

BAPSSTRUCTURE

Recombination BRATNEXTGENGUBBINS

cgMLST and wgMLSTStrain 1

Strain 2

Strain 3

Strain 4

Strain 5

Strain 6

L1 L2L2 L3L4 L5 L6L7 L8 L9

Core Genome -> cgMLST Accessory genome

Core Genome+ Accessory Genome = PanGenome -> wgMLST

Source J. Carrio

@jacarrico

cgMLST and wgMLST Open source

BACTERIAL ISOLATE GENOME SEQUENCE DATABASE ◦ Jolley & Maiden 2010, BMC Bioinformatics 11:595 - http://pubmlst.org/software/database/bigsdb/ ◦ PROs: Freely available, open-source, handles thousands of genomes, has several schemas implemented

for MLSTfor several bacterial species, and some extended MLST and core genome MLST (mainly Neisseria sp. but soon to be expanded)

◦ CONs: Requires Perl knowledge to install and maintain

Source J. Carrio

@jacarrico

http://pubmlst.org/software/database/bigsdb/

http://pubmlst.org/software/database/bigsdb/

cgMLST and wgMLST Commercial software

RIDOM SEQSPHERE+ ◦ http://www.ridom.com/seqsphere/ ◦ with client server solutions from assembly to allele calling and visualization for core genome MLST

(MLST+/ cgMLST)

APPLIED MATHS - BIONUMERICS 7.5 ◦ http://www.applied-maths.com/news/bionumerics-version-75-released ◦ Commercial software with client server solutions from assembly to allele calling and visualization for

whole genome MLST (wgMLST)

Source J. Carrio

@jacarrico

http://www.ridom.com/seqsphere/

http://www.ridom.com/seqsphere/

http://www.applied-maths.com/news/bionumerics-version-75-released

http://www.applied-maths.com/news/bionumerics-version-75-released

cgMLST with Genome Profiler Index alleles of the loci that shared by the bacterial isolates implementing both BLASTN and BLASTX

Transforms WGS data into allele profile data

Using a reference genome it attempted to account for gene paralogy using conserved gene neighborhoods

http://jcm.asm.org/content/53/5/1765.abstract

http://jcm.asm.org/content/53/5/1765.abstract

cgMLST with Genome Profiler Input files

◦ reference genome in gbk format (even in multi-gbk format from RAST) or a multi-FASTA file the allele sequences

◦ Query genomes in FASTA format (complete or draft – in contigs)

If you run the data for the first time, you use one of the genome as reference to built a new cgMLST scheme (ad hoc mode):

◦ perl GeP.pl -r NC_017282.gbk -g genome_list.txt

Data can be run with the cgMLST scheme created previously by GeP: ◦ perl GeP.pl -g genome_list.txt –o

Or you could use a multi-Fasta file of the the allele sequences (nt) as reference (in this case all possible paralogs are excluded - a fix number of 999999999 will be assigned to expect-d)

◦ perl GeP.pl -r NC_017282.ffn -g genome_list.txt -n

cgMLST with Genome Profiler Output files:

◦ output.txt records the information of all the loci in each of the test genome sequences◦ difference_matrix.html contains a summary of the analysis and a matrix of pairwise

differences between the allelic profiles of the samples.◦ Splitstree.nex allele profile of the isolates in NEXUS format, which can be opened in

Splitstree 4◦ allele_profile.txt matrix of allele profile (input file of STRUCTURE and BAPS)◦ core_genomes.fas alignment of the core genome in FASTA format

https://www.dropbox.com/sh/02pt21410hla1rf/AADGNL7W6Uxsb5cAR0kffSaUa?dl=0



Infering recombination eventsGUBBINS

Iteratively identifies loci containing elevated densities of base substitutions while concurrently constructing a phylogeny based on the putative point mutations outside of these regions

Run in only a few hours on alignments of hundreds of bacterial genome sequences.

BRATNEXTGEN

Bayesian analysis of recombinations in whole-genome DNA sequence data

Use a GUI

Divides the genome into segments, then for each segment, detects genetically distinct clusters of isolates and estimates the probabilities of recombination events

Run efficiently on a desktop computer .. I tested up to 100 .. Results after O/N

https://github.com/sanger-pathogens/Gubbins

http://www.helsinki.fi/bsg/software/BRAT-NextGen/ https://www.dropbox.com/s/gppp5xs2pkw87ms/BratNextGen_manual.pdf?dl=0

https://github.com/sanger-pathogens/Gubbins

http://www.helsinki.fi/bsg/software/BRAT-NextGen/

https://www.dropbox.com/s/gppp5xs2pkw87ms/BratNextGen_manual.pdf?dl=0

Phylogeny (phylogeography) visualization

A directory for tree visualization

http://www.informatik.uni-rostock.de/~hs162/treeposter/poster.html

My favorite tree editor/viewer

http://itol.embl.de/

A very nice tool for phylogeography

http://microreact.org/showcase/

http://www.informatik.uni-rostock.de/~hs162/treeposter/poster.html





Toolbox for bacterial population analysis using NGS

Science

Transcript of Toolbox for bacterial population analysis using NGS