Lecture Topics: Types of samples and analyses Experimental design and analysis Data formats and...

Lecture Topics:• Types of samples and analyses• Experimental design and analysis• Data formats and conversion tools• Alignment, de-novo assembly, and other analyses • Computing needs and available resources• Annotation • Summarizing and visualizing results

Labs:Lab sessions meet in a computing lab, and will provide students with hands-on experience in managing and analyzing datasets from Illumina and Roche/454 instruments, covering the same set of topics as the lectures. Example datasets will be available from both platforms, for both DNA and RNA samples; students who have their own datasets may contact the instructor prior to the course to discuss opportunities for analysis of their data during the lab sessions.* see http://www.physics.ubc.ca/mbelab/computer/linux-intro/html/ for an overview

Overview: • This course will cover methods for analysis of data from Illumina and Roche/454 high-throughput

sequencing, with or without a reference genome sequence, using free and open-source software tools with an emphasis on the command-line Linux computing environment*

BIT 815: Analysis of Deep Sequencing Data

http://www.physics.ubc.ca/mbelab/computer/linux-intro/html/

http://www.physics.ubc.ca/mbelab/computer/linux-intro/html/

Introduction to the course and to each other - background in biology, computing, and sequencing - experiments of interest to participants

Course structure - 3 two-hour blocks per week * ~ 45 min lecture/discussion * ~ 70 min lab exercises - some assigned reading - participation in classroom discussion is expected - no exams

Course Objective - to teach you how to teach yourself

The sequencing rate is growing faster than Moore’s Law

Stein (2010) Genome Biology 11:207

Doubling time 19.8 months

Doubling time 2 months

Doubling time 7.3 months

An alternative perspective from an independent source

Sequence data analysis is changing rapidly - relatively few methods are completely static - much of the software is still under active development - new methods and tools are reported every month - staying on the learning curve is essential

Why use Linux for sequencing data analysis? - it is well-suited to the task * preferred development platform for most tools * modular design * however … it’s built for speed, not for comfort

Modular design in Linux – a ‘toolbox’ approach

• Individual components of the Linux operating system are written as separate programs

• Different programs can have similar functions

• A Linux “distribution” is a collection of programs that work together as an operating system

• Users have the power to add new programs, or take away existing programs that are not being used, to optimize system performance

A map of the actual software components of the kernel

Why is modularity an advantage? - adding new software is relatively straightforward - the operating system can be continually upgraded - adding tools to the toolbox is easy - staying on the learning curve is essential

There is always more than one way to do it - some sequence analysis tasks have matured to stability - most have not, and are still changing - ‘best practices’ are also changing, and subject to dispute

Linux distributions - collections of ‘tools’ targeted to different user groups - some are commercial, most are not - five or six account for most of the users - many dozens of variants available, mostly of minor

interest

Which to use for sequencing data analysis? - Ubuntu * widely-used distribution with good hardware support * base for Bio-Linux, with pre-installed bioinformatics packages * Bio-Linux is also available as an Amazon EC2 machine image for cloud computing

Amazon Web Services - A commercial resource for computing infrastructure - provides access to ‘virtual machines’, or VMs, for users - a VM is an ‘instance’ of a ‘machine image’ - the underlying image is the same for every instance - user-generated files are lost when the instance terminates

Using AWS for this course - Cloudbiolinux is a machine image built on Ubuntu - We will use laptops (your own or BIT-provided) as terminals - Connection will be through Secure SHell (SSH), using PuTTY - A graphical interface is available through NX Client

Sequencing technology overview

- Two different systems on campus: Illumina GAIIx, 454 - A similar overall strategy for highly-parallel sequencing - Different approaches taken at virtually every step - These different platforms produce data with different

characteristics - Other platforms are available off-campus, but are not a

focus of the course

Similarities - DNA molecules are fragmented and ligated to adaptors - individual DNA molecules are immobilized on a surface - a series of nucleotide addition reactions are carried out - the nucleotide added is detected after each addition - a data file is produced containing the DNA sequences of many fragments

DNA fragmentation – usually sonication Adaptor oligonucleotide addition

Images from www.454.com

Sequencing technology overview - 454

A single molecule immobilized on a bead PCR amplification in oil-water emulsioncreates ~10 million copies per bead

Images from www.454.com


DNA-containing beads deposited in wells “Pyrosequencing” produces light when anyof PicoTiterPlate , along with smaller beads nucleotide is incorporated, so only a singlewith immobilized enzymes for light nucleotide is provided during a cycle, and production light output is recorded during each cycle



TACG ‘key’sequence

A ‘flowgram’ showing light output from each cycle of base addition one flowgram is produced for each of the ~1 million wells in a PicoTiterPlate

Sequencing technology overview – Illumina GAIIx

Illumina uses a glass ‘flowcell’, about the size of a microscope slide, with 8 separate ‘lanes’.

The GAIIx instrument focuses the laser and light detection system only on one of the two surfaces inside the flowcell; the new HiSeq instrument scans both surfaces and therefore doubles the yield of sequence data per lane. Additional improvements in scanning and increases in cluster density make the difference closer to 4x or 5x more data from a HiSeq.


Fragment DNA, ligate adaptor oligos Single-stranded DNA binds to flowcell surface


Surface-bound primers are extended by DNA polymerase across annealed ssDNA molecules,the DNA is denatured back to single strands, and the free ends of immobilized strands annealagain to oligos bound on surface of flowcell. This ‘bridge PCR’ continues until a cluster of~ 1000 molecules is produced on the surface of the flowcell, all descended from the singlemolecule that bound at that site. After PCR, the free ends of all DNA strands are blocked.


Another perspective of the amplification process, showing the clusters of products


GCTGACTTAG

AGCCGTAAGT

Although four different colors are used for the fluorescent nucleotides, only two lasers are used to excite the fluorescence. The fluorescent labels are grouped in pairs - labels on A and G are excited by one laser, and labels on C and T are excited by the other laser.

This means that distinguishing between the A signal and the G signal is more difficult for the instrument than A versus C or A versus T. Base substitution errors are the most common type of sequencing error for Illumina instruments.

Understanding FASTQ formator “what do all these symbols mean?”

See http://en.wikipedia.org/wiki/FASTQ_format for more details

Instrument ID lane tile X Y barcode read#

• Quality scores are numbers that represent the probability that the given base call is an error.

• These probabilities are always less than 1, so the value is given as 10 times minus log(10) of the probability

• For example, an error probability of 0.001 (1x10-3) is represented as a quality score of 30.

• The numbers are converted into text characters so they occupy less space – a single character is as meaningful as 2 numbers plus a space between adjacent values

Header lines sequence quality scores

http://en.wikipedia.org/wiki/FASTQ_format



Understanding FASTQ formatIllumina v1.8 header version:@HWI-EAS209:06:FC706VJ:5:58:5894:21141 1:N:ATCACG

Instrument /flowcell ID lane tile X Y barcode read#

Unfortunately, at least four different ways of converting numbers to characters have been used, and header line formats have also changed, so one aspect of data analysis is knowing what you have.

Header lines sequence quality scores

Illumina flowcell geometry (GAIIx)

1 2 3 4 5 6 7 8A flowcell has 8 lanes, which are physically separated. Each lane is imaged during each cycle of sequencing in 120 separate images, called ‘tiles’, which are not physically separated.

Tiles within a lane are numbered from 1 to 60 down the length of the lane, then from 61 to 120 back up the other side.

1

2

59

60 61

62

119

120

SolexaQA quality checking

A program written in Perl (a programming language designed for manipulating text files) that samples a specified number of reads from each tile and calculates a range of summary statistics . Mean quality scores for each cycle are calculated by default; minimum/maximum values and variances can also be calculated if desired.

The program returns data in both matrix form and in more visually-informative graphical formats, including a heatmap showing the quality per cycle for every tile and a plot showing the mean quality per cycle for each tile and the global mean quality. A histogram is also returned, showing the distribution of lengths of the longest segment of each read that surpasses a user-specified quality score or error probability threshold (p = 0.05 by default).

Another Perl program, DynamicTrim.pl, is provided that will trim the reads to leave only the longest contiguous segment that surpasses the quality threshold, and write the trimmed reads to a new FASTQ-format file for further use.


An example heatmap output from SolexaQA, from a sequencing run of particularly poor quality. This particular run consists of 75 cycles (shown left to right), and 100 tiles (shown top to bottom.Note that tile 75 failed to yield any sequences that passed the Solexa quality filtering software. The coloring of the heat map is shown as error probability, with the darkest shade indicating p=0.75, the same as random guessing. The quality values clearly differ widely across tiles at cycles after about 25, as well as decreasing within a tile as the cycle number increases.Occasional isolated cycle failures for individual tiles may be the result of a bubble or other problem with reagent flow in the flowcell.


An example plot from the same dataset, showing the dramatic differences in average quality across tiles (the dotted lines), and the increase in error probability with cycle number.


The distribution of lengths of longest segments with error probability < 0.05 from the same dataset. This is an unusually poor quality dataset, but illustrates the capabilities of SolexaQA well.

Lecture Topics: Types of samples and analyses Experimental design and analysis Data formats and...

Documents

Transcript of Lecture Topics: Types of samples and analyses Experimental design and analysis Data formats and...