ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
-
Upload
nick-loman -
Category
Science
-
view
1.620 -
download
8
Transcript of ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
What bioinformatic tools should I use for analysis of high-throughput sequencing data
for molecular diagnostics?
Nick Loman
Reference-based approach
Alignment
Variant calling
SNP extraction & filter
Recombination filtering
Tree building
MLST/Antibiogram
Read QC
Adaptor/quality trimming
Species ID
Sample QC
FastQC, Qualimap
Trimmomatic
BLAST, Metaphlan, MOCAT
Blobology, Kraken, BLAST
BWA
Samtools/VarScanGATK
Custom script, snippy, SnpEff, BRESEQ
Gubbins, ClonalFrameML
FastTree, RaXML
SRST2
De novo approach
Assembly
MLST/Antibiogram
Annotation
Tree building
Population genomics
Pan-genome
VelvetSPADES
Prokka
Harvest
BigsDBPhyloviz
LS-BSR
mlst, Abricate
FastQC
• What: Analyse read-level sequence quality.
• Why: Determine serious errors in read quality that might affect downstream analysis.
• Where: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Qualimap
• What: Analyse insert size distribution
• Why: Determine whether sequencing has been effective, particularly for de novo assembly, need for adaptor trimming
• Where: http://qualimap.bioinfo.cipf.es/
Trimmomatic
• What: One of several million read trimmers
• Why: To remove sequence adaptors which may influence the results of de novo assembly
• Where: http://www.usadellab.org/cms/?page=trimmomatic
Species ID: BLAST
• What: Only the most famous bioinformatics algorithm ever made
• Why: A few random BLAST searches will reveal much important information about your data before you start on a pipeline analysis
• Where: http://ncbi.nlm.nih.gov/BLAST
Species ID: Metaphlan
• What: Designed for metagenomics, this algorithm will find “taxon-defining” genes to identify what species are in a sample
• Why: Check for extent of sample contamination, give an accurate species ID for unknown samples
• Where: https://bitbucket.org/biobakery/metaphlan2
Species ID: Kraken
• What: Similar to Metaphlan but even faster and with a more complete database
• Why: Check for extent of sample contamination, give an accurate species ID for unknown samples
• Where: https://ccb.jhu.edu/software/kraken/
Species ID: MOCAT
• What: Uses a phylogenetic approach to identify novel or divergent species by relying on distances in conserved marker genes
• Why: Sometimes you sequence something completely novel and want to know more about its relationships
• Where: http://vm-lux.embl.de/~kultima/MOCAT/
• Alternatives: Phylosift, rMLST
Sample QC: Blobology
• What: A simple method of plotting de novo assembly contigs by GC, coverage and taxon
• Why: Characterise contamination, plasmids, lytic phage in a sample
• Where: https://github.com/blaxterlab/blobology
Alignment: BWA
• What: The standard method for aligning Illumina sequences to a reference, use in BWA-MEM mode which works well with most read lengths
• Why: Finds the likely location of each sequence read in a reference genome
• Where: https://github.com/lh3/bwa
• Alternatives: SMALT, Bowtie2 (beware standard insert size parameters)
Variant calling: samtools&VarScan
• What: A way of calling SNPs against a reference in one or more samples
• Why: VarScan permits easy filtering of SNPs by allele frequency and strand, useful for getting a precise dataset
• Where: http://www.htslib.org/
• http://varscan.sourceforge.net/
• Alternatives: GATK, snippy, Nesoni
Recombination filtering: Gubbins
• What: Detect regions which have undergone recombination which will confound phylogenetic reconstructions assuming clonality
• Why: Important when attempting phylogenetic reconstructions from recombining organisms
• Where: http://sanger-pathogens.github.io/gubbins/
• Alternatives: ClonalFrameML, BRATNextGen
Tree building: FastTree
• What: Phylogenetic reconstructions from SNP data
• Why: Tree reconstructions are an effective way of examining evolutionary relationships in isolates and testing if they are from an outbreak, FastTree
• Note: Ensure you don’t hit the double-precision bug! (http://darlinglab.org/blog/2015/03/23/not-so-fast-fasttree.html)
• Where: http://meta.microbesonline.org/fasttree/Alternatives: RAxML (more thorough, slower), REALPHY http://realphy.unibas.ch/fcgi/realphy
MLST & Antibiogram: SRST2
• What: Aligns reads against MLST and antibiotic resistance databases
• Why: Permits MLST typing with genome data and a rough prediction of antibiotic resistance
• Where: http://katholt.github.io/srst2/
De novo assembly: SPADES
• What: A reliable de novo assembler which works well with multiple data types
• Why: Has in-built error corrector so no need for read trimming, can use multiple values of k so less need for experimentation, consistently performs well in comparisons
• Where: http://bioinf.spbau.ru/spades
De novo assembly: Velvet
• What: The original short-read assembly
• Why: Extremely fast for draft assemblies, particularly if just want to do MLST or antibiograms
• Where: https://www.ebi.ac.uk/~zerbino/velvet/
• Alternatives: MEGAHIT – even faster!
Annotation: Prokka
• What: Takes de novo assembly contig files and annotates them with coding sequences and non-coding features such as RNAs
• Why: A very sensible set of tools and reference databases in a single package, produces usable output for other software and database submission
• Where: http://www.vicbioinformatics.com/software.prokka.shtml
• Alternatives: xBASE annotation interface
Tree building: Harvest
• What: Takes de novo assembly contigs, performs whole-genome alignment and permits reconstruction of core genome phylogenies
• Why: Scaleable to hundreds of genomes on a laptop and with an excellent viewer
• Where: http://harvest.readthedocs.org/en/latest/index.html
• Alternatives: Mauve
Population genomics: BIGSDB
• What: Takes de novo assembly contigs and applies MLST-like schemes working on hundreds or thousands of core genes
• Why: Scaleable to >1000s of genomes for rapid population-level clustering
• Where: http://pubmlst.org/software/database/bigsdb/
• Alternatives: Bionumerics
Pan/accessory genomes: LS-BSR
• What: Takes de novo assembly contigs or annotations and compares gene content
• Why: To determine differences in gene content between 1 to 1000s of strains
• Where: https://github.com/jasonsahl/LS-BSR
• Alternatives: OrthoMCL
MLST/Antibiogram: mlst and Abricate
• What: Works on de novo assembly to give mlst prediction and antibiotic resistance perdiction
• Why: A very fast method
• Where: https://github.com/tseemann/mlst
• https://github.com/tseemann/abricate
• Alternatives: SRST2
CLoud Infrastructure for Microbial Bioinformatics (CLIMB)
• MRC funded project to develop Cloud Infrastructure for microbial bioinformatics
• £4M of hardware, capable of supporting >1000 individual virtual servers
• Amazon/Google cloud for Academics