Next generation sequencing (NGS) - utData Compression Algorithms Lossless Data Compression...

Vijayachitra Modhukur BIIT

[email protected]

Next generation sequencing (NGS)

Bioinformatics course 11/13/12 1

Sequencing


Microarrays vs NGS

11/13/12 Bioinformatics course 3

�  Sequences do not need to be known in advance �  Highly quantitative �  Lesser noise levels , do not suffer from cross hybridization �  NGS provides increased sensitivity to detect rare sequences

in complex genomic samples �  Accurate single-nucleotide resolution permits the

discrimination between highly related sequences �  The lowered cost of NGS makes comprehensive mapping of

multiple features possible Paul J. Hurd et al

Outline of NGS


Why sequencing? �  Genome architecture �  Disease diagnosis �  Variability studies �  Comparative genomics �  Gene regulation �  Drug design �  and many more……


Different generations (computers and sequencing)


First Generation – Sanger sequencing


�  http://www.youtube.com/watch?v=aPN8LP4YxPo&feature=related

Application – Human genome project 1990-2002


Human genome project key finding


�  1. There are approximately 23,000 genes in human beings, the same range as in mice and roundworms. Understanding how these genes express themselves will provide clues to how diseases are caused.

�  2. The human genome has significantly more segmental duplications (nearly identical, repeated sections of DNA) than other mammalian genomes. These sections may underlie the creation of new primate-specific genes

�  3. At the time when the draft sequence was published fewer than 7% of protein families appeared to be vertebrate specific

http://en.wikipedia.org/wiki/Human_Genome_Project/

Second generation sequencing



http://sciblogs.co.nz/code-for-life/2012/03/22/the-world-in-dna-sequencers/

ER Mardis. Nature 470, 198-203 (2011) doi:10.1038/nature09796

Break through NGS technology


NGS platforms


Leading Platforms

454 Solexa/Illumina SOLiD (ABI)

Bp per run 400 Mb 2-3 Gb 3-6 Gb

Read length 250-400 bp 35-50 (70-100) bp 35-50 bp

run time 10 hr 2.5 days 5 days

Download 20 min 27 hr (44 min) ~1 day

Analysis 2-5 hr 2 days 2-3 days

Files 20-50 Gb 1T 1 T

With 3730s, ~60Mb per year Specifications as of summer 2008

Massive amount of sequenced data


Sequencing projects


Application


Human Genome

Human genome


http://www.mdpi.com/journal/genes/special_issues/nextgen-sequencing/

1,000 genome project


1,000 genome project


�  Small inter individual differences in regulatory regions found in all human population

�  Genetic variation association to disease �  Discover novel genetic variats such as snps, cnvs etc., �  Better improvement of human reference sequence. �  Key results �  “Each person carry 250 to 300 loss-of-function variants in

annotated genes and 50 to 100 variants previously implicated in inherited disorders”.

Analysis


data to analysis

cpu/memory intensive

NGS pipeline



Name Description

BLAT BLAST-Like Alignment Tool. Can handle one mismatch in initial alignment step.

BowtieUses a Burrows-Wheeler transform to create a permanent, reusable index of the genome; 1.3 GB memory footprint for human genome. Aligns more than 25 million Illumina reads in 1 CPU hour.

BWAUses a Burrows-Wheeler transform to create an index of the genome. It's a bit slower than bowtie but allows indels in alignment

ELAND Implemented by Illumina. Includes ungapped alignment with a finite read length.

GMAP and GSNAP

Robust, fast, short-read alignment. GMAP: singleton reads; GSNAP: paired reads. Useful for digital gene expression, SNP and indel genotyping.

MAQ Ungapped alignment that takes into account quality scores for each base

MOSAIK

Fast gapped aligner and reference-guided assembler. Aligns reads using a banded Smith-Waterman algorithm seeded by results from a k-mer hashing scheme. Supports reads ranging in size from very short to very long.

RazerSNo read length limit. Hamming or edit distance mapping with configurable error rates. Configurable and predictable sensitivity (runtime/sensitivity tradeoff). Supports paired-end read mapping.

SHRiMPIndexes the reads instead of the reference genome. Uses masks to generate possible keys. Can map ABI SOLiD color space reads.

SLIDER

Slider is an application for the Illumina Sequence Analyzer output that uses the "probability" files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences.

SOAPRobust with a small (1-3) number of gaps and mismatches. Speed improvement over BLAT, uses a 12 letter hash table. Now SOAP2 is much faster than the first version.

SOCSFor ABI SOLiD technologies. Significant increase in time to map reads with mismatches (or color errors). Uses an iterative version of the Rabin-Karp string search algorithm.

SSAHA Fast for a small number of variants.Taipan de-novo Assembler for Illumina reads

based on http://en.wikipedia.org/wiki/List_of_sequence_alignment_software

Quality scores �  Each base from a sequencer comes with a quality score �  Base-calling error probabilities �  Phred quality score �  Q = 10 log10 P �  higher quality score indicates a smaller probability of error


http://www.illumina.com/truseq/quality_101/quality_scores.ilmn

Quality scores


http://www.illumina.com/truseq/quality_101/quality_scores.ilmn

File formats


fastQ

Raw data

http://en.wikipedia.org/wiki/FASTQ_format

fastQ to fasta

SAM/BAM format

11/13/12 Bioinformatics course 31 Thomas Keane 9th European Conference on Computational Biology 26th September, 2010

SAM/BAM Format

Proliferation of alignment formats over the years: Cigar, psl, gff, xml etc.

SAM (Sequence Alignment/Map) format

 Single unified format for storing read alignments to a reference genome

BAM (Binary Alignment/Map) format

 Binary equivalent of SAM

 Developed for fast processing/indexing

Advantages

 Can store alignments from most aligners

 Supports multiple sequencing technologies

 Supports indexing for quick retrieval/viewing

 Compact size (e.g. 112Gbp Illumina = 116Gbytes disk space)

 Reads can be grouped into logical groups e.g. lanes, libraries, individuals/genotypes

 Supports second best base call/quality for hard to call bases

Possibility of storing raw sequencing data in BAM as replacement to SRF & fastq

SAM format


Each bit in SAM format


Sequence alignment �  Reference alignment �  De novo alignment


Spaced seed vs BWT


Burrows wheeler transform �  Original : WBWBWB# �  Compressed : WWW#BBB = 3W#3B


Burrows wheeler transform

Book IVChapter 4

Da

ta C

om

pre

ssion

Alg

orith

ms

Lossless Data Compression Algorithms 437

The BWT algorithm must use a character that marks the end of the data, suchas the # symbol. Then the BWT algorithm works in three steps. First, it rotatestext through all possible combinations, as shown in the Rotate column of Table4-1. Second, it sorts each line alphabetically, as shown in the Sort column ofTable 4-1. Third, it outputs the final column of the sorted list, which groupsidentical characters together in the Output column of Table 4-1. In this exam-ple, the BWT algorithm transforms the string WBWBWB# into WWW#BBB.

Table 4-1 Rotating and Sorting DataRotate Sort Output

WBWBWB# BWBWB#W W#WBWBWB BWB#WBW WB#WBWBW B#WBWBW WWB#WBWB WBWBWB# #BWB#WBW WBWB#WB BWBWB#WB WB#WBWB BBWBWB#W #WBWBWB B

At this point, the BWT algorithm hasn’t compressed any data but merelyrearranged the data to group identical characters together; the BWT algo-rithm has rearranged the data to make the run-length encoding algorithmmore efficient. Run-length encoding can now convert the WWW#BBB stringinto 3W#3B, thus compressing the overall data.

After compressing data, you’ll eventually need to uncompress that samedata. Uncompressing this data (3W#3B) creates the original BWT output ofWWW#BBB, which contains all the characters of the original, uncompresseddata but not in the right order. To retrieve the original order of the uncom-pressed data, the BWT algorithm repetitively goes through two steps, asshown in Figure 4-1.

The BWT algorithm works in reverse by adding the original BWT output(WWW#BBB) and then sorting the lines repetitively a number of times equalto the length of the string. So retrieving the original data from a 7-characterstring takes seven adding and sorting steps.

After the final add and sort step, the BWT algorithm looks for the only linethat has the end of data character (#) as the last character, which identifiesthe original, uncompressed data. The BWT algorithm is both simple tounderstand and implement, which makes it easy to use for speeding up ordi-nary run-length encoding.


Sequence assembly- Solving a jigaw puzzle


Sequence assembly- repeating patterns


Greedy Assemblers �  Greedily joins the reads together that are most similar to

each other. �  Examples : Phrap, Cap3, TIGR assembler,

© 2009 SIB LF June 4, 2010

Greedy

• Greedy assemblers - The first assembly programs followed a simple but effective strategy in which the assembler greedily joins together the reads that are most similar to each other.

• An example is shown below, where the assembler joins, in order, reads 1 and 2 (overlap = 200 bp), then reads 3 and 4 (overlap = 150 bp), then reads 2 and 3 (overlap = 50 bp) thereby creating a single contig from the four reads provided in the input. One disadvantage of the simple greedy approach is that because local information is considered at each step, the assembler can be easily confused by complex repeats, leading to mis-assemblies.

© 2009 SIB LF June 4, 2010

Overlap-layout-consensus

• Overlap-layout-consensus - The relationships between the reads provided to an assembler can be represented as a graph, where the nodes represent each of the reads and an edge connects two nodes if the corresponding reads overlap. The assembly problem thus becomes the problem of identifying a path through the graph that contains all the nodes - a Hamiltonian path (Figure below). This formulation allows researchers to use techniques developed in the field of graph theory in order to solve the assembly problem.

• An assembler following this paradigm starts with an overlap stage during which all overlaps between the reads are computed and the graph structure is computed. In a layout stage, the graph is simplified by removing redundant information. Graph algorithms are then used to determine a layout (relative placement) of the reads along the genome. In a final consensus stage, the assembler builds an alignment of all the reads covering the genome and infers, as a consensus of the aligned reads, the original sequence of the genome being assembled.

HE(-"*8% 4-*8'% &#-% *% 7*D$(-;*"% 4(+#0(@% I'(% $';DJ% (,4(:% ;+% $'(% 8;D$2-(% #+% $'(% "(K% 9*% L*0;"$#+;*+% DMD"(?%D#--(:8#+,% $#% $'(% D#--(D$% "*M#2$% #&% $'(% -(*,:% *"#+4%$'(% 4(+#0(% 9N42-(% #+% $'(% -;4'$?@% I'(% -(0*;+;+4%(,4(:%-(8-(:(+$%&*":(%#E(-"*8:%;+,2D(,%7M%-(8(*$:%9(F(08";N(,%7M%$'(%-(,%";+(:?


Overlap layout consensus

Page 9 Barbara Hutter Assembly

● Based on all pairwise comparisons● Constuction of an overlap graph

• nodes = reads (sequences)

• egdes = connections between overlapping reads

● Layout: look for paths in the overlap graph which are segments of the genome to assemble (contigs)

• goal: find Hamiltonian path = a path that contains all nodes exactly once● Consensus: following the Hamiltonian path, combine the overlapping sequences in

the nodes into the sequence of the genome

• in case of different nucleotides: majority vote considering base qualities● Programs using the OLC:

• Arachne, Celera Assembler (CABOG), newbler, Minimus, Edena, CAP, PCAP

Overlap-Layout-Consensus

http://gepard.bioinformatik.uni-saarland.de/teaching/ws-2011-12/special-topic-lecture-bioinformatics-next


De bruign graph- Velvet


Online resources �  NCBI-SRA �  NCBI-GEO �  The European Nucleotide Archive (ENA) �  Array express


Visualization tools

NATURE METHODS SUPPLEMENT | VOL.7 NO.3s | MARCH 2010 | S3

REVIEW

sequence similarity. A user can interactively explore the sequence relationships between different contigs and view the results of search operations such as ‘find repeats’. Consed’s assembly view can display the output of a sequence comparison utility called ‘cross_match’, using arcs to connect regions with sequence similarity between user-selected contigs. Different colors dis-tinguish features such as directed repeats from inverted repeats. One advantage of viewing sequence similarity in ‘assembly view’ is that it can be integrated with a read coverage plot (Fig. 1a), which can reveal regions of unexpectedly high coverage often indicative of similar sequences that were erroneously collapsed by the assembler into one. The user can click to examine the sequence similarity at the base level, and click again to exam-ine the underlying reads. There are also standalone tools with related functionality; for example, Miropeats15, widely used for early genome sequencing projects, is a UNIX C-shell script that generates static images using arc representations to indicate different types of repeats.

Next-generation sequence viewers. As sequencing through-put increases and costs decrease, individual genome sequenc-ing has become feasible and has led to initiatives such as the 1,000 Genomes project (http://www.1000genomes.org/). These data provide an unprecedented opportunity to characterize the landscape of human genotypes, and a new generation of com-putational methods has emerged as a result16. In some cases, visual inspection can facilitate the evaluation and interpretation of read alignment techniques and variation detection outputs.

Assembly visualization tools possess most of the necessary functionality, but they were built with Sanger data in mind and initially strained under the substantially higher read volume of NGS technologies. Several of these tools are being retrofitted to tackle larger data sets, including Consed and the updated Gap5, but a new wave of tools is also being designed with this purpose in mind: for example, EagleView17, MapView18 and IGV (Table 1). Unlike finishing software, these tools are primarily data viewers and do not provide direct editing functionality. Because of their emphasis on browsing, many provide more flexible zooming capabilities and enable a user to freely zoom out to higher-level views. The commercially available CLC Genomics Workbench (CLC bio) is particularly user friendly and includes its own read alignment programs, which can be launched through a GUI.

In the resequencing context, mate pairs provide valuable infor-mation about structural variation, such as insertions, deletions and inversions. As discussed in the previous section, mate pairs can also indicate misassemblies, and users performing variation detection on draft assemblies should be aware of these issues. LookSeq19 and Gap5 use the vertical-axis position to indicate insertion size. This places inconsistent mate pairs at the extremes of the plot and visually separates large insert sizes, which are con-sistent with deletions, from small insert sizes, which suggest inser-tion events. When analyzing structural variations, it is important to consider gene annotations—for example, whether a single nucleotide variation leads to a synonymous or nonsynonymous amino acid change. For this reason, several of these visualization

Table 1 | Tools for visualizing sequencing dataName Cost OS Description URL

Stand-alone tools

ABySS-Explorer25 Free Win, Mac, Linux Interactive assembly structure visualization tool http://tinyurl.com/abyss-explorer/CLC Genomics Workbench $ Win, Mac, Linux Integrates NGS data visualization with analysis tools;

user friendlyhttp://www.clcbio.com/

Consed3* Free Mac, Linux Widely used; assembly finishing package; NGS compatible http://www.phrap.org/DNASTAR Lasergene14 $ Win, Mac Analysis suite with an assembly finishing package;

NGS compatiblehttp://www.dnastar.com/

EagleView17 Free Win, Mac, Linux Assembly viewer; compatible with single-end NGS http://tinyurl.com/eagleview/Gap12,13 Free Linux Widely used; assembly finishing package; Gap5 is

NGS compatiblehttp://staden.sourceforge.net/

Hawkeye6 Free Win, Mac, Linux (S) Sanger sequencing assembly viewer http://amos.sourceforge.net/hawkeye/Integrative Genomics Viewer (IGV)*

Free Win, Mac, Linux Genome browser with alignment view support (Table 2); NGS compatible

http://www.broadinstitute.org/igv/

MapView18 Free Win, Linux Read alignment viewer; custom file format for fast NGS data loading

http://evolution.sysu.edu.cn/mapview/

MaqView Free Mac, Linux Read alignment viewer; fast NGS data loading from Maq alignment files

http://maq.sourceforge.net/

Orchid Free Linux (S) Assembly viewer customized to display paired-end relationships

http://tinyurl.com/orchid-view/

Sequencher $ Win, Mac Assembly finishing package http://www.genecodes.com/SAMtools tview8 Free Win, Mac, Linux Simple and fast text alignment viewer; NGS compatible http://samtools.sourceforge.net/

Web-based tools

LookSeq19 Free Uses AJAX; y axis for insert size; user configures data resources; NGS compatible

http://lookseq.sourceforge.net/

NCBI Assembly Archive Viewer7

Free Graphical interface to contig and trace data in NCBI’s Assembly Archive

http://tinyurl.com/assmbrowser/

Free means the tool is free for academic use; $ means there is a cost. OS, operating system: Win, Microsoft Windows; Mac, Macintosh OS X. Tools running on Linux usually also run on other versions of Unix. (S) indicates that compilation from source is required. “Assembly finishing package” enables interactive sequence editing and/or integration with tools for automated assembly improvement.*Our recommendationBioinformatics course 11/13/12 44

Dr. Ece Gamsiz Bioinformatics course 11/13/12 45

Next lectures


�  RNA sequencing, method, application, advantages over microarrays

�  Chip sequencing �  Epigenomics, DNA methylation, histone modification �  ……..

Next generation sequencing (NGS) - utData Compression Algorithms Lossless Data Compression...

Documents

Transcript of Next generation sequencing (NGS) - utData Compression Algorithms Lossless Data Compression...