Introduction to NGS Analysis Tools - APHL€¦ · Introduction to NGS Analysis Tools Heather...
Transcript of Introduction to NGS Analysis Tools - APHL€¦ · Introduction to NGS Analysis Tools Heather...
6/27/2016
1
National Center for Emerging and Zoonotic Infectious Diseases
Introduction to NGS Analysis Tools
Heather Carleton, PhD, MPH
Team Lead, Enteric Diseases Bioinformatics, Enteric Diseases Laboratory Branch, DFWED, NCEZID, CDC
Next Generation Sequencing: From concept to reality at public health laboratories
June 6th, 2016
Objectives
Provide a basic overview of terminology surrounding next generation sequencing data
Discuss analysis terminology
Highlight NGS analysis tools
– Command‐line freely available tools
– On‐line/cloud based tools
– Commercially available analysis tools
Discuss advantages/disadvantages to the tools
6/27/2016
2
Why do you need analysis tools: To translate WGS data
Consolidation of multiple workflows in the laboratory: Identification –serotyping – virulence profiling – antimicrobial resistance characterization – subtyping
Analysis Tools
Sequence QC
Assembly (de novo)
whole genome MLST Analysis
Functional analysis (ANI, Serotype, antimicrobial resistance
profile, annotation)
Read mapping
Reference-based assembly
hqSNP analysis
kmer (raw read/assembly)
SNP analysis
wgMLST analysis
6/27/2016
3
What is a analysis/bioinformatics pipeline?
Pipeline refers to the series of tools used to go from raw sequence data to answer
QC de novo assembly wgMLST phylogenetic
tree
Types of analysis pipelines
Freely available command-line/ on-line
cloud-based/fee for service
Commercialsoftware
Bioinformatics Experience
6/27/2016
4
How to pick an analysis pipeline(s)
Pick the tool that fits your users
– If you do not have bioinformaticians in your lab than using command line tools will be a challenge
– Make sure the tool delivers the output you need ‐ if you need a phylogenetic tree then it needs to do read mapping, snp detection, and phylogenetic inference
Must provide quality checks of raw sequence data and analysis steps so you can evaluate success of tool
Analysis Tools
Sequence QC
Assembly (de novo)
whole genome MLST Analysis
Functional analysis (ANI, Serotype, antimicrobial resistance profile, annotation)
6/27/2016
5
Basic QC analysis
Tools used to analyze the basic quality of a sequencing run or reads generated per isolate of a sequencing run
FastQC (also available in BaseSpace)
Torrent Server
Geneious
Qiagen/CLC workbench
BioNumerics v7
Sequence QC – Q‐score
Quality scores ‐ likelihood the base call is correct– Phred – part of fastq file generated from sequencer that scores base call quality
– Q30 – the percentage of base calls that have a 1 in 1000 chance or less of being incorrect (Q20 – 1 incorrect in 100 base calls)
• indicates whether a base call is trustworthy and can be used in a hqSNP analysis
95% ≥Q30
6/27/2016
6
Sequence QC – Read trimming
Assess quality over the entire read by looking at quality score by base position and % GC by base position
Most NGS machines have read trimming as part of machine workflow to remove indices and adaptors
Sequence Quality – Insert size
Insert size refers to the length of the piece of DNA you are sequencing
Generally want insert size to be larger than sequencing chemistry (i.e. if doing 2x250/500 cycle sequencing want insert size larger than 500bp)
Bad insert size 2x150 sequencing Good insert size
6/27/2016
7
Sequence QC – Coverage
NGS generates 100,000 or more reads per one genome sequenced
Any single location on the genome can have zero to hundreds of sequence reads that cover the one region
Coverage at 40x Coverage at 5x
Sequence Analysis – De Novo Assembly
Assemble raw sequence data from ~100k reads to 10‐500 contigs
Assemblers use different algorithms and are built to work with a specific NGS machine
SPAdes, Velvet, Newbler
BaseSpace/SPAdes plug‐in
Torrent Server
Geneious
Qiagen/CLC workbench
BioNumerics v7
6/27/2016
8
Sequence Analysis – De novo assembly
Combine overlapping reads into a single contig
Sequence Analysis – de novo assembly quality Assembly metrics can indicate sequence quality
Number of contigs raw reads assembles into
– Good: E. coli <200, Salmonella < 100, Listeria < 30
N50 statistic– Calculated by summarizing the lengths of the biggest contigs until you reach 50% of total combined contig length
– Good: >200,000 bp
3 Million base pair genome (determined by sum of contig lengths)
750,000bp 500,000bp 350,000bp
Indicates 1.5 Million base pairs, or cutoff for 50% combined contig length (N50)
*N50 is 350,000 bp
6/27/2016
9
Locus can be a gene or part of a gene
– any change (single nucleotide polymorphism, insertion, deletion, small inversion) is a new allele number
Loci can cover the whole genome of an isolate, the core (in common) genes of a species, or house keeping genes of a genus (traditional MLST)
hq‐SNP
cgMLST
Sequence Analysis – Multi‐locus sequence typing
Sequence Analysis ‐MLST
Comparing number (character) differences between isolates
Requires an already developed scheme for the analyzed organism
NCBI Pathogen pipeline (in development)
BigsDB (http://pubmlst.org/software/database/bigsdb/)
Ridom/SeqSphere (http://www.ridom.com/seqsphere/)
BioNumerics v7
6/27/2016
10
Sequence analysis – functional annotation
Predict isolate characteristics from WGS data (genus/species, serotype, antimicrobial resistance, virulence, etc.)
NCBI Pathogen pipeline (antimicrobial resistance)
Center for Genomic Epidemiology (CGE) (virulence, STEC/ Salmonella serotype, antimicrobial resistance)
BioNumerics v7 (genus/species (ANIm), virulence, STEC/Salmonella serotype, antimicrobial resistance)
Identifying Genus and Species from WGS data
Can use databases ‐MLST, ribosomal MLST, 16S to identify Genus and occasionally to species level
Can use WGS methods similar to classic laboratory methods for identification, DNA‐DNA hybridization, to calculate Average Nucleotide Identity (ANI) between a query genome and a reference genome
E. coli ACTAGAGGGAAAS. enterica GCATCCCCCGTT
GCATCCCCCGTA query genome ANI score 98% for S.enterica
6/27/2016
11
Inferring serotype from WGS
Since the genes that code the O and H antigens and determine serotype are known – can build a database that translates sequence to serotype
Limitations
– Sometimes genes are not expressed (non‐motile isolates)
– There may be modifications to the antigen protein that are not encoded in the genes that originally make the protein
Virulence factors from WGS data
Virulence factors like Shiga toxin or other enterotoxins that are traditionally detected by serology, PCR, or real‐time PCR can be detected in WGS data using databases
Publically available resources like the Center of Genomic Epidemiology VirulenceFinder can be used to find virulence genes in E. coli, Enterococcus, and S. aureus
http://www.genomicepidemiology.org/
6/27/2016
12
Predicting antimicrobial resistance from WGS
Acquired resistance
– Usually resistance genes (200bp‐1,000bp)
– Highly conserved even between different genera (>98% identity)
– Usually located on mobile elements (plasmids, integrons, islands)
– Methods to detect
• assembled sequence, resistance databases (Resfinder, ARG‐ANNOT, FDA/NCBI AR database)
Acquired Resistance
Genes associated with a particular AR phenotype
PhenotypeAmpicillinAmoxicillin/clavulanic acidCefoxitinCeftriaxoneCeftiofurKanamycinGentamicinStreptomycinChloramphenicolSulfisoxazoleTrimethoprim/sulphamethoxazoleTetracycline
Genotype
blacmy‐2
aph(3’)‐Iaaac(3)‐VIaaadA2, strABfloRsul1, sul2
dfrA12, sul1, sul2
tetA
6/27/2016
13
Predicting antimicrobial resistance from WGS
Mutational resistance
– Usually SNPs, but can be insertions/deletions
– Usually chromosomal
– Genera or species specific
– Methods
• no available databases
• assembled sequence, in silico PCR
• raw reads, SNP analysis
Analysis Tools
Sequence QC
Read mapping
Reference-based assembly
hqSNP analysis
Functional analysis (ANI, Serotype, antimicrobial resistance profile, annotation)
6/27/2016
14
Sequence Analysis – Read mapping/ hqSNP analysis
Map raw sequence data to a known reference genome
Pick mapper based on sequencing chemistry and organism (diploid/haploid)
Mapping used for downstream analysis including hqSNP
samtools, bowtie2, smalt (can wrap some of these in Galaxy)
BaseSpace (bacterial, viral, human, and cancer variant apps), torrent server
NCBI pathogen pipeline
BioNumerics v7, CLC Genome workbench, Geneious
Sequence Analysis – high quality single nucleotide polymorphisms (hqSNPs)
Quality filtered Sequence Reads ready for analysis
Sequence reads
Sequence Reads
Sequence reads What makes a SNP high quality (hq)?
Apply a quality filter that filters out nucleotides in sequence reads for comparison based on sequence coverage, quality, location
6/27/2016
15
What to call a SNP
SNPs called based on:
– Quality
– Coverage
– Base frequency
The differences between the reference and compared genome are extracted and used to determine relatedness
ATGTTACTCATGTTCCTC ATGTTCCTC ATGTTCCTCATGTTCCTCATGTTCCTCATGTTTCTCATGTTCCTC ATGTTCCTCATGTTCCTC ATGTTCCTCATGTTCCTC ATGTTCCTCATGTTGCTC ATGTTGCTC reference
Is it a SNP?
Where to call a SNP?
Not all SNP pipelines are equal – where you call SNPs will affect the total SNP count
SNPs relevant for phylogenetic analysis are vertically transmitted, not horizontally, so horizontal genetic elements like phages can be masked
Mask mobile elements-do no consider SNPs in this location
Mobile elements
genes
Only call SNPs in genes
Raw reads
6/27/2016
16
Where to call a SNP – pick the right reference
Choice of reference genome affects analysis – more closely related reference more likely to identify true SNP differences
How to interpret hqSNPs – phylogenetic trees
Use the differences you identified by hqSNP to infer the relatedness or phylogeny of isolates
Isolate A
Isolate B
Isolate C
Isolate D
11
1
6
3
5
genetic change
actgaatta
actgccggt
ggagaatta
ggagagtta
ggattatta
ggatcccccggataatta
6/27/2016
17
NCBI Pathogen Detection Pipeline
NC
BI S
ubm
issi
on P
orta
l
BioSamples
SRA
GenBank
BioProject
NCBI Pathogen Pipeline
Kmer analysis
Genome Assembly
Genome Annotation
Genome Placement
Clustering
SNP analysis
Tree Construction
Reports
QC
Automated Bacterial Assembly
SRA Reads sample 1
Trim reads (Ns, adaptor)
Reference Distance tree
Find closest reference genome(s)
ArgoCA (Combined Assembly)
De novo assembly panel
Argo (Reference assisted assembly) SOAP denovo GS-assembler (newbler)MaSuRCA Celera Assembler
Reads remapped to combined assembly
Contig fastaRead placements (bam)Quality profile
SPAdes
6/27/2016
18
http://www.ncbi.nlm.nih.gov/pathogens/
Results Available Now
NCBI Pathogen Detection SNP Pipeline: example 1 - stone fruit outbreak
6/27/2016
19
CDC SNP extraction tool – Lyve‐SET
Developed for analysis of raw sequence data from foodborne pathogens
Works with both ion torrent and illumina data (need to use 2 different mappers
Can filter based on quality and clustered SNPs and filter out phages automatically
https://github.com/lskatz/lyve‐SET
Clean raw reads• cg-pipeline
Map reads to reference • SMALT
Identify SNPs• Varscan
Create phylogeny• RaxML
SNP matrix
pairwise differences
phylogenetic tree
FDA SNP pipeline – SNP pipeline
Developed for analysis of sequence data for foodborne pathogens
Excellent documentation online http://snp‐pipeline.readthedocs.io/en/latest/
https://github.com/CFSAN‐Biostatistics/snp‐pipeline
Map reads to reference • Bowtie2
Identify SNPs• Varscan
SNP matrix
pairwise differences
Output for phylogenetic analysis
6/27/2016
20
Analysis Tools
Sequence QC
kmer (raw read based/assembly)
SNP analysis
wgMLST analysis
Sequence analysis – reference free raw read and assembly‐based approaches
Analysis does not require a reference
Can use kmer based analyses to measure relatedness between isolates
Can also use to fast match against a known allele/reference
kSNP (https://sourceforge.net/projects/ksnp/),
MASH
NCBI pathogen pipeline (kmer tree)
Center for genomic epidemiology
CLC genome workbench
BioNumerics v7.5 wgMLST
6/27/2016
21
Kmer‐based analysis Computer algorithms use a sliding window to chop up sequence reads into
shorter lengths (k) of DNA ‐ kmers
kmers are compared to identify differences
ACTGAACTGACTCAA
ACTGAACTGACTGAACTGACTGAACTGACTAACTGACTCAACTGACTCAA
Read(15bp)
K-mer (10bp)
Isolate 1 Isolate 2
ACTGAACTGACTCAC
ACTGAACTGACTGAACTGACTGAACTGACTAACTGACTCAACTGACTCAC
Identical K-mersIdentical K-mers
Unique K-mer
KSNP‐based analysis Computer algorithms use a sliding window to chop up sequence reads into
shorter lengths (k) of DNA – k is always an odd number
Compare base pair differences at central position of kmer
ACTGAACTGACTCAA
ACTGAACTGCTGAACTGATGAACTGACAACTGACTCACTGACTCA
Raw Read(15bp)
K-mer (9bp)
Isolate 1 Isolate 2
ACTGCACTGACTCAA
ACTGCACTGCTGCACTGATGCACTGACCACTGACTCACTGACTCA
6/27/2016
22
Kmer analysis – identifying organisms
End‐to‐End Analysis Tools
Sequence QC
Assembly (de novo)
whole genome MLST Analysis
Functional analysis (ANI, Serotype, antimicrobial resistance
profile, annotation)
Read mapping
Reference-based assembly
hqSNP analysis
kmer (raw read/assembly)
SNP analysis
wgMLST analysis
6/27/2016
23
Tools that offer end‐to‐end solutions: BioNumerics v7.6
Tools for QC, assembly, wgMLST, hqSNP, functional prediction in each single button workflows
Functions as a database so the metadata needed to interpret the analysis is easily viewable
For bacteriology, virology, mycology, animals, and plants
Tools that offer end‐to‐end solutions: CLC Genomics Has tools to handle haploid
and diploid genomes
Nice graphics and reporting features
Can export workflows for others to use
6/27/2016
24
Tools that offer end‐to‐end solutions: Illumina BaseSpace
Conclusions:
Pick the tool that fits your need
Think about whether you will be doing CLIA or CAP certified tests through
the pipeline and what kind of control and customization you need
Make sure your laboratorians can use the tool and interpret the output
6/27/2016
25
For more information, contact CDC1‐800‐CDC‐INFO (232‐4636)TTY: 1‐888‐232‐6348 www.cdc.gov
The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.
Questions?
Use of trade names is for identification only and does not imply endorsement by the Centers for Disease Control and Prevention or the U.S. Department of Health and Human Services.
• Resources:Program What for? Where to find it Cost? Platform
BioNumerics 7.5
Assembly, wgMLST, SNP analysis
http://www.applied-maths.com/
Yes Windows
CLC Bio Genomics Workbench
Workflows, read metrics, assemblies, etc, SNP analyses
https://www.qiagenbioinformatics.com/products/clc-genomics-workbench/
Yes Windows/Linux
Geneious Assemblies, trees, SNP analysis
http://geneious.com/ Yes Windows
MEGA6 Phylogenies megasoftware.net/ No Windows
Lasergene Assemblies, read metrics,analysis
http://www.dnastar.com/ Yes Windows
NCBI Genome Workbench
Viewing trees, analysis http://www.ncbi.nlm.nih.gov/tools/gbench/
No Windows/ Linux
CFSAN SNPpipeline
Assembly, read metrics, assembly metrics, read cleaning, etc
sourceforge.net/projects/cg-pipeline
No Linux
Snp Extraction Tool
Read cleaning, Creating Phylogenies
github.com/lskatz/lyve-SET No Linux