Considerations for Analyzing Targeted NGS Data Introduction
description
Transcript of Considerations for Analyzing Targeted NGS Data Introduction
Considerations for Analyzing Targeted NGS Data
Introduction
Tim Hague, CTO
Introduction
Many mapping, alignment and variant calling algorithms
Most of these have been developed for whole genome sequencing and to some extent population genetic studies.
Premise
In contrast, NGS based diagnostics deals with particular genes or mutations of an individual.
Different diagnostic targets present specific challenges.
Goal
Present analysis issues related to differences in:
Sequencing technologiesTargeting technologiesTarget specifics Pseudogenes and segmental duplication
NGS Sequencers Illumina Ion Torrent Roche 454 (SOLiD)
Roche 454Illumina IonTorrentt
Moore B, Hu H, Singleton M, De La Vega, FM, Reese MG, Yandell M. Genet Med. 2011 Mar;13(3):210-7.
Sequencing TechnologyDifferences:Homopolymer error ratesG/C content errorsRead length Sequencing protocols (single vs paired reads)
Targeting Methods PCR primers (e.g. amplicons) Hybridization probes (e.g. exome kits)
Targeting TechnologyDifferences:Exact matching regions vs regions with SNPs.
Results in:Need for mapping against whole chromosomes to avoid false positives.
Analysis Targets
Differences:Rate of polymorphismRepetitive structuresMutation profilesG/C contentSingle genes vs multi gene complexes
BRCA1/2 HLA CFTR1/2000 1/29 1/2000
Distributions of insertions and deletionsDistribution of repeat elements
Segmental Duplications Sometimes called Low Copy Repeats (LCRs) Highly homologous, >95% sequence identity Rare in most mammals Comprise a large portion of the human genome
(and other primate genomes)
Important for understanding HLA
Segmental Duplications
Many LCRs are concentrated in "hotspots"
Recombinations in these regions are responsible for a wide range of disorders, including:
Charcot-Marie-Tooth syndrome type 1AHereditary neuropathy with liability to pressure palsiesSmith-Magenis syndromePotocki-Lupski syndrome
Data Analysis Tools
Differences:Detection rates of complex variants (sensitivity)False positive rates (accuracy)SpeedEase of use
Data analysis shouldn’t be like this!
“Depending upon which tool you use, you can see pretty big differences between even the same genome called with different tools—nearly as big as the two Life Tech/Illumina genomes.”
Mark Yandel in BioIT-World.com, June 8, 2011
Examples Missing variants SNPs, a DNP and deletions
Identify more valid variants
Find homopolymer indels
Examples Coverage differences
Four times exon coverage
[0-432]
[0-96]
Higher exome coverage
[0-24]
[0-10]
First conclusion
Read accuracy is not the limiting factor in accurate variant analysis.
Example Dense region of SNPs
www.omixon.com
Second conclusion
As variant density increases the performance of most tools goes down.
Variant Calling
TThere are few popular variant callers: GATK, SAMtools mpileup, VarScanThe most comprehensive (GATK) has a whole pipeline, including a quality recalibration step and an indel realignment stepThese recalibration and realignment steps are highly recommended to be run before any variant callDeduplication and removing non-primary alignments may also be required
There are few popular variant callers: GATK, SAMtools mpileup, The most comprehensive (GATK) has a whole pipeline, including a quality recalibration step and an indel realignment stepThese recalibration and realignment steps are highly recommended to be run before any variant callDeduplication and removing non-primary alignments may also be required
There are few popular variant callers: GATK, SAMtools mpileup, VarScan
The most comprehensive (GATK) has a whole pipeline, including a quality recalibration step and an indel realignment step
These recalibration and realignment steps are highly recommended to be run before any variant call
Deduplication and removing non-primary alignments may also be required
Indel realigner problem
Variants that can be hard to find
DNPs TNPs Small indels next to SNPs 30+ bp indels Homopolymer indels Homopolymer indel and SNP together Indels in palindromes Dense regions of variants