Post on 22-Feb-2016
description
Error Correction in HighThroughput
Datasets
Dale Beach, Longwood UniversityLisa Scheifele, Loyola University Maryland
Next-generation sequencing has revolutionized both biological research and clinical medicine, with sequencing of entire human genomes being used to predict drug responsiveness and to diagnose disease (for example Choi 2009).
The advent of next-gen sequencing requires students and researchers to
deal with large datasets
Students must be able to address error in large datasets
http://www.pnas.org/content/106/45/19096/F3.expansion.htmlhttp://www.pnas.org/content/106/45/19096.full.pdf+html
In contrast to traditional Sanger sequencing, next-generation sequencing datasets have shorter read lengths and higher error rates. This can create challenges for downstream analysis since even a small error rate will result in a large number of sequencing reads that contain errors due to the abundance of sequencing reads. Indeed, Illumina MiSeq data produces reads with an error rate of 0.1% (Glenn 2011), yet this corresponds to only ~85% of the 150 bp sequencing reads (.999150) being error-free.
Sequencing error in read
Background This module is designed for a genetics or
molecular biology class. It will require 3 lecture/seminar class periods with optional additional Linux-based lab activities
Prior to beginning this module, students should be familiar with: Sample preparation techniques for DNA sequencing DNA replication and the enzymes that synthesize DNA Nucleic acid and nucleotide structure
Research Goals
Initial evaluation of the quality of eukaryotic genome sequencing data
Implementation of error correction techniques
Comparison of the quality of sequencing data before and after error correction
Completed small eukaryotic genome data on Illumina platform
If students will not be performing command-line programming themselves, this data should be analyzed with:
Jellyfish to produce data on k-mer frequencies that students can use to generate a histogram in Excel
Quake to perform error correction so that students can be provided with pre- and post-error correction datasets
Sequencing Requirements
Student Learning Goals At the completion of this module, students will be
able to: Describe the important differences between
highthroughput and traditional (low throughput) experiments
Explain the reasons for variations in the quality of highthroughput datasets
Utilize computational tools to quantify errors in sequencing data
Interpret the quality of a sequencing experiment and be able to implement effective quality control measures
Computer Requirements
Excel or other Analytical packages to create a k-mer frequency distribution
Galaxy to create a boxplot of PHRED33 scores
Optional: Quake and Jellyfish on Linux system to generate k-mer data and perform error correction
Vision and Change Competencies
This module will develop students’ abilities to:
Apply the process of science▪ Design experiment from methodological design through data
analysis▪ Analyze and interpret data
Ability to use modeling and simulation▪ Design experimental strategies and predict outcomes
Ability to use quantitative reasoning▪ Depict data using histograms and boxplots▪ Interpret graphs and use the results of their analysis to modify
error correction strategies
Timeline: Class 1 Introductory lecture and data upload Intro to sequencing history
and platforms
Discuss typical sources of error in sequencing reads
Discuss sequence output formats and PHRED33 scores
Upload raw data to Galaxy
Optional: Quake in Linux to manipulate parameters and improve quality
http://www.nimr.mrc.ac.uk/mill-hill-essays/bringing-it-all-back-home-next-generation-sequencing-technology-and-you#
Introduce software packages that can be used to assess data quality
Demonstrate breaking sequencing reads into k-mers
Use Excel or Jellyfish to create k-mer graph
Use Excel or Jellyfish to create k-mer graph following manipulation of error correction parameters (variations in k-mer size)
K-mer frequency distibution
Timeline: Class 2Setting up analysis and adjusting
parameters
Discussion of using PHRED33 scores to assess data quality
Create boxplots of PHRED33 scores in Galaxy for raw data
Create boxplots of PHRED33 scores in Galaxy for data post Quake correction
can have students compare outcomes following Quake correction with different parameters
Raw Data
Data post Quake correction
Timeline: Class 3Assessing quality
Discussion Topics Why has next-generation sequencing technology led to a
revolution in biology/medicine?
Discuss and predict how chemical and physical mechanisms lead to errors
Comparison of sequence improvement based on different parameters
How do software packages determine which base is in error and which is correct if sequencing reads conflict?
Why is it important to have a numerical measure of error in addition to the nucleotide sequence?
Assessment This module will be performed as a team-based
project with students preparing and handing in a report at the end. Students will be able to:
Predict predominant types or sources of error based on experimental design and sequencing platform
Prepare a boxplot using Galaxy for an exemplary dataset and use the boxplot to evaluate the quality of the sequence data
Effectively improve the quality of any set of NGS reads prior to assembly
References https://banana-slug.soe.ucsc.edu/bioinformatic_tools:jellyfish
www.en.wikipedia.org/wiki/FASTQ_format
Kenney DR, Schatz MC, Salzberg SL. 2010. Quake:quality-aware detection and correction of sequencing errors. Genome Biology 11:R116
Marcais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 27:764-770. [Jellyfish program]
http://res.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2581791/pdf/ukmss-2586.pdf