Introduction to NGS Analysis Tools - APHL€¦ · Introduction to NGS Analysis Tools Heather...

6/27/2016

1

National Center for Emerging and Zoonotic Infectious Diseases

Introduction to NGS Analysis Tools

Heather Carleton, PhD, MPH

Team Lead, Enteric Diseases Bioinformatics, Enteric Diseases Laboratory Branch, DFWED, NCEZID, CDC

Next Generation Sequencing: From concept to reality at public health laboratories

June 6th, 2016

Objectives

Provide a basic overview of terminology surrounding next generation sequencing data

Discuss analysis terminology

Highlight NGS analysis tools

– Command‐line freely available tools

– On‐line/cloud based tools

– Commercially available analysis tools

Discuss advantages/disadvantages to the tools

6/27/2016

2

Why do you need analysis tools: To translate WGS data

Consolidation of multiple workflows in the laboratory: Identification –serotyping – virulence profiling – antimicrobial resistance characterization – subtyping

Analysis Tools

Sequence QC

Assembly (de novo)

whole genome MLST Analysis

Functional analysis (ANI, Serotype, antimicrobial resistance

profile, annotation)

Read mapping

Reference-based assembly

hqSNP analysis

kmer (raw read/assembly)

SNP analysis

wgMLST analysis

6/27/2016

3

What is a analysis/bioinformatics pipeline?

Pipeline refers to the series of tools used to go from raw sequence data to answer

QC de novo assembly wgMLST phylogenetic

tree

Types of analysis pipelines

Freely available command-line/ on-line

cloud-based/fee for service

Commercialsoftware

Bioinformatics Experience

6/27/2016

4

How to pick an analysis pipeline(s)

Pick the tool that fits your users

– If you do not have bioinformaticians in your lab than using command line tools will be a challenge

– Make sure the tool delivers the output you need ‐ if you need a phylogenetic tree then it needs to do read mapping, snp detection, and phylogenetic inference

Must provide quality checks of raw sequence data and analysis steps so you can evaluate success of tool

Analysis Tools

Sequence QC

Assembly (de novo)


Functional analysis (ANI, Serotype, antimicrobial resistance profile, annotation)

6/27/2016

5

Basic QC analysis

Tools used to analyze the basic quality of a sequencing run or reads generated per isolate of a sequencing run

FastQC (also available in BaseSpace)

Torrent Server

Geneious

Qiagen/CLC workbench

BioNumerics v7

Sequence QC – Q‐score

Quality scores ‐ likelihood the base call is correct– Phred – part of fastq file generated from sequencer that scores base call quality

– Q30 – the percentage of base calls that have a 1 in 1000 chance or less of being incorrect (Q20 – 1 incorrect in 100 base calls)

• indicates whether a base call is trustworthy and can be used in a hqSNP analysis

95% ≥Q30

6/27/2016

6

Sequence QC – Read trimming

Assess quality over the entire read by looking at quality score by base position and % GC by base position

Most NGS machines have read trimming as part of machine workflow to remove indices and adaptors

Sequence Quality – Insert size

Insert size refers to the length of the piece of DNA you are sequencing

Generally want insert size to be larger than sequencing chemistry (i.e. if doing 2x250/500 cycle sequencing want insert size larger than 500bp)

Bad insert size 2x150 sequencing Good insert size

6/27/2016

7

Sequence QC – Coverage

NGS generates 100,000 or more reads per one genome sequenced

Any single location on the genome can have zero to hundreds of sequence reads that cover the one region

Coverage at 40x Coverage at 5x

Sequence Analysis – De Novo Assembly

Assemble raw sequence data from ~100k reads to 10‐500 contigs

Assemblers use different algorithms and are built to work with a specific NGS machine

SPAdes, Velvet, Newbler

BaseSpace/SPAdes plug‐in

Torrent Server

Geneious

Qiagen/CLC workbench

BioNumerics v7

6/27/2016

8

Sequence Analysis – De novo assembly

Combine overlapping reads into a single contig

Sequence Analysis – de novo assembly quality Assembly metrics can indicate sequence quality

Number of contigs raw reads assembles into

– Good: E. coli <200, Salmonella < 100, Listeria < 30

N50 statistic– Calculated by summarizing the lengths of the biggest contigs until you reach 50% of total combined contig length

– Good: >200,000 bp

3 Million base pair genome (determined by sum of contig lengths)

750,000bp 500,000bp 350,000bp

Indicates 1.5 Million base pairs, or cutoff for 50% combined contig length (N50)

*N50 is 350,000 bp

6/27/2016

9

Locus can be a gene or part of a gene

– any change (single nucleotide polymorphism, insertion, deletion, small inversion) is a new allele number

Loci can cover the whole genome of an isolate, the core (in common) genes of a species, or house keeping genes of a genus (traditional MLST)

hq‐SNP

cgMLST

Sequence Analysis – Multi‐locus sequence typing

Sequence Analysis ‐MLST

Comparing number (character) differences between isolates

Requires an already developed scheme for the analyzed organism

NCBI Pathogen pipeline (in development)

BigsDB (http://pubmlst.org/software/database/bigsdb/)

Ridom/SeqSphere (http://www.ridom.com/seqsphere/)

BioNumerics v7

6/27/2016

10

Sequence analysis – functional annotation

Predict isolate characteristics from WGS data (genus/species, serotype, antimicrobial resistance, virulence, etc.)

NCBI Pathogen pipeline (antimicrobial resistance)

Center for Genomic Epidemiology (CGE) (virulence, STEC/ Salmonella serotype, antimicrobial resistance)

BioNumerics v7 (genus/species (ANIm), virulence, STEC/Salmonella serotype, antimicrobial resistance)

Identifying Genus and Species from WGS data

Can use databases ‐MLST, ribosomal MLST, 16S to identify Genus and occasionally to species level

Can use WGS methods similar to classic laboratory methods for identification, DNA‐DNA hybridization, to calculate Average Nucleotide Identity (ANI) between a query genome and a reference genome

E. coli ACTAGAGGGAAAS. enterica GCATCCCCCGTT

GCATCCCCCGTA query genome ANI score 98% for S.enterica

6/27/2016

11

Inferring serotype from WGS

Since the genes that code the O and H antigens and determine serotype are known – can build a database that translates sequence to serotype

Limitations

– Sometimes genes are not expressed (non‐motile isolates)

– There may be modifications to the antigen protein that are not encoded in the genes that originally make the protein

Virulence factors from WGS data

Virulence factors like Shiga toxin or other enterotoxins that are traditionally detected by serology, PCR, or real‐time PCR can be detected in WGS data using databases

Publically available resources like the Center of Genomic Epidemiology VirulenceFinder can be used to find virulence genes in E. coli, Enterococcus, and S. aureus

http://www.genomicepidemiology.org/

6/27/2016

12

Predicting antimicrobial resistance from WGS

Acquired resistance

– Usually resistance genes (200bp‐1,000bp)

– Highly conserved even between different genera (>98% identity)

– Usually located on mobile elements (plasmids, integrons, islands)

– Methods to detect

• assembled sequence, resistance databases (Resfinder, ARG‐ANNOT, FDA/NCBI AR database)

Acquired Resistance

Genes associated with a particular AR phenotype

PhenotypeAmpicillinAmoxicillin/clavulanic acidCefoxitinCeftriaxoneCeftiofurKanamycinGentamicinStreptomycinChloramphenicolSulfisoxazoleTrimethoprim/sulphamethoxazoleTetracycline

Genotype

blacmy‐2

aph(3’)‐Iaaac(3)‐VIaaadA2, strABfloRsul1, sul2

dfrA12, sul1, sul2

tetA

6/27/2016

13

Predicting antimicrobial resistance from WGS

Mutational resistance

– Usually SNPs, but can be insertions/deletions

– Usually chromosomal

– Genera or species specific

– Methods

• no available databases

• assembled sequence, in silico PCR

• raw reads, SNP analysis

Analysis Tools

Sequence QC

Read mapping


hqSNP analysis

Functional analysis (ANI, Serotype, antimicrobial resistance profile, annotation)

6/27/2016

14

Sequence Analysis – Read mapping/ hqSNP analysis

Map raw sequence data to a known reference genome

Pick mapper based on sequencing chemistry and organism (diploid/haploid)

Mapping used for downstream analysis including hqSNP

samtools, bowtie2, smalt (can wrap some of these in Galaxy)

BaseSpace (bacterial, viral, human, and cancer variant apps), torrent server

NCBI pathogen pipeline

BioNumerics v7, CLC Genome workbench, Geneious

Sequence Analysis – high quality single nucleotide polymorphisms (hqSNPs)

Quality filtered Sequence Reads ready for analysis

Sequence reads

Sequence Reads

Sequence reads What makes a SNP high quality (hq)?

Apply a quality filter that filters out nucleotides in sequence reads for comparison based on sequence coverage, quality, location

6/27/2016

15

What to call a SNP

SNPs called based on:

– Quality

– Coverage

– Base frequency

The differences between the reference and compared genome are extracted and used to determine relatedness

ATGTTACTCATGTTCCTC ATGTTCCTC ATGTTCCTCATGTTCCTCATGTTCCTCATGTTTCTCATGTTCCTC ATGTTCCTCATGTTCCTC ATGTTCCTCATGTTCCTC ATGTTCCTCATGTTGCTC ATGTTGCTC reference

Is it a SNP?

Where to call a SNP?

Not all SNP pipelines are equal – where you call SNPs will affect the total SNP count

SNPs relevant for phylogenetic analysis are vertically transmitted, not horizontally, so horizontal genetic elements like phages can be masked

Mask mobile elements-do no consider SNPs in this location

Mobile elements

genes

Only call SNPs in genes

Raw reads

6/27/2016

16

Where to call a SNP – pick the right reference

Choice of reference genome affects analysis – more closely related reference more likely to identify true SNP differences

How to interpret hqSNPs – phylogenetic trees

Use the differences you identified by hqSNP to infer the relatedness or phylogeny of isolates

Isolate A

Isolate B

Isolate C

Isolate D

11

1

6

3

5

genetic change

actgaatta

actgccggt

ggagaatta

ggagagtta

ggattatta

ggatcccccggataatta

6/27/2016

17

NCBI Pathogen Detection Pipeline

NC

BI S

ubm

issi

on P

orta

l

BioSamples

SRA

GenBank

BioProject

NCBI Pathogen Pipeline

Kmer analysis

Genome Assembly

Genome Annotation

Genome Placement

Clustering

SNP analysis

Tree Construction

Reports

QC

Automated Bacterial Assembly

SRA Reads sample 1

Trim reads (Ns, adaptor)

Reference Distance tree

Find closest reference genome(s)

ArgoCA (Combined Assembly)

De novo assembly panel

Argo (Reference assisted assembly) SOAP denovo GS-assembler (newbler)MaSuRCA Celera Assembler

Reads remapped to combined assembly

Contig fastaRead placements (bam)Quality profile

SPAdes

6/27/2016

18

http://www.ncbi.nlm.nih.gov/pathogens/

Results Available Now

NCBI Pathogen Detection SNP Pipeline: example 1 - stone fruit outbreak

6/27/2016

19

CDC SNP extraction tool – Lyve‐SET

Developed for analysis of raw sequence data from foodborne pathogens

Works with both ion torrent and illumina data (need to use 2 different mappers

Can filter based on quality and clustered SNPs and filter out phages automatically

https://github.com/lskatz/lyve‐SET

Clean raw reads• cg-pipeline

Map reads to reference • SMALT

Identify SNPs• Varscan

Create phylogeny• RaxML

SNP matrix

pairwise differences

phylogenetic tree

FDA SNP pipeline – SNP pipeline

Developed for analysis of sequence data for foodborne pathogens

Excellent documentation online http://snp‐pipeline.readthedocs.io/en/latest/

https://github.com/CFSAN‐Biostatistics/snp‐pipeline

Map reads to reference • Bowtie2

Identify SNPs• Varscan

SNP matrix

pairwise differences

Output for phylogenetic analysis

6/27/2016

20

Analysis Tools

Sequence QC

kmer (raw read based/assembly)

SNP analysis

wgMLST analysis

Sequence analysis – reference free raw read and assembly‐based approaches

Analysis does not require a reference

Can use kmer based analyses to measure relatedness between isolates

Can also use to fast match against a known allele/reference

kSNP (https://sourceforge.net/projects/ksnp/),

MASH

NCBI pathogen pipeline (kmer tree)

Center for genomic epidemiology

CLC genome workbench

BioNumerics v7.5 wgMLST

6/27/2016

21

Kmer‐based analysis Computer algorithms use a sliding window to chop up sequence reads into

shorter lengths (k) of DNA ‐ kmers

kmers are compared to identify differences

ACTGAACTGACTCAA

ACTGAACTGACTGAACTGACTGAACTGACTAACTGACTCAACTGACTCAA

Read(15bp)

K-mer (10bp)

Isolate 1 Isolate 2

ACTGAACTGACTCAC

ACTGAACTGACTGAACTGACTGAACTGACTAACTGACTCAACTGACTCAC

Identical K-mersIdentical K-mers

Unique K-mer

KSNP‐based analysis Computer algorithms use a sliding window to chop up sequence reads into

shorter lengths (k) of DNA – k is always an odd number

Compare base pair differences at central position of kmer

ACTGAACTGACTCAA

ACTGAACTGCTGAACTGATGAACTGACAACTGACTCACTGACTCA

Raw Read(15bp)

K-mer (9bp)

Isolate 1 Isolate 2

ACTGCACTGACTCAA

ACTGCACTGCTGCACTGATGCACTGACCACTGACTCACTGACTCA

6/27/2016

22

Kmer analysis – identifying organisms

End‐to‐End Analysis Tools

Sequence QC

Assembly (de novo)


Functional analysis (ANI, Serotype, antimicrobial resistance

profile, annotation)

Read mapping


hqSNP analysis

kmer (raw read/assembly)

SNP analysis

wgMLST analysis

6/27/2016

23

Tools that offer end‐to‐end solutions: BioNumerics v7.6

Tools for QC, assembly, wgMLST, hqSNP, functional prediction in each single button workflows

Functions as a database so the metadata needed to interpret the analysis is easily viewable

For bacteriology, virology, mycology, animals, and plants

Tools that offer end‐to‐end solutions: CLC Genomics Has tools to handle haploid

and diploid genomes

Nice graphics and reporting features

Can export workflows for others to use

6/27/2016

24

Tools that offer end‐to‐end solutions: Illumina BaseSpace

Conclusions:

Pick the tool that fits your need

Think about whether you will be doing CLIA or CAP certified tests through

the pipeline and what kind of control and customization you need

Make sure your laboratorians can use the tool and interpret the output

6/27/2016

25

For more information, contact CDC1‐800‐CDC‐INFO (232‐4636)TTY: 1‐888‐232‐6348 www.cdc.gov

The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.

Questions?

Use of trade names is for identification only and does not imply endorsement by the Centers for Disease Control and Prevention or the U.S. Department of Health and Human Services.

• Resources:Program What for? Where to find it Cost? Platform

BioNumerics 7.5

Assembly, wgMLST, SNP analysis

http://www.applied-maths.com/

Yes Windows

CLC Bio Genomics Workbench

Workflows, read metrics, assemblies, etc, SNP analyses

https://www.qiagenbioinformatics.com/products/clc-genomics-workbench/

Yes Windows/Linux

Geneious Assemblies, trees, SNP analysis

http://geneious.com/ Yes Windows

MEGA6 Phylogenies megasoftware.net/ No Windows

Lasergene Assemblies, read metrics,analysis

http://www.dnastar.com/ Yes Windows

NCBI Genome Workbench

Viewing trees, analysis http://www.ncbi.nlm.nih.gov/tools/gbench/

No Windows/ Linux

CFSAN SNPpipeline

Assembly, read metrics, assembly metrics, read cleaning, etc

sourceforge.net/projects/cg-pipeline

No Linux

Snp Extraction Tool

Read cleaning, Creating Phylogenies

github.com/lskatz/lyve-SET No Linux

Introduction to NGS Analysis Tools - APHL€¦ · Introduction to NGS Analysis Tools Heather...

Documents

Transcript of Introduction to NGS Analysis Tools - APHL€¦ · Introduction to NGS Analysis Tools Heather...