Post on 06-Feb-2018
Processing of Raw Genome Data
Kristina Kirsten, Torben Meyer
Trends in Bioinformatics
Hasso-Plattner-Institut
Session Agenda
1. Why Sequence the Human Genome?
2. Problem Statement
■ Genetic Basics
■ Sanger Sequencing vs. NGS
3. Sequencing Pipeline
■ Base Calling
■ Alignment
■ Variant Calling
■ Data Annotation
4. Analysis Results / Use Cases
5. Outlook
6. Discussion
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 2
Why Sequence the Human Genome?
Understand mutations
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 3[1] http://evolution.berkeley.edu/evolibrary/images/evo/dna-mutation.gif
[2] http://informoverload.com/man-has-extra-fingers/
[2][1]
Why Sequence the Human Genome?
Identify marker for diseases
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 4
http://hmg.oxfordjournals.org/content/18/R1/R48/F1.large.jpg
Why Sequence the Human Genome?
Take personalized treatment decisions
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 5
http://www.alphagenomix.com/wp-content/uploads/2014/03/Figure21.png
Why Sequence the Human Genome?
Take personalized treatment decisions
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 6
Genetic Basics (1)
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 7
Genetic Basics (2)
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 8
• Base pairs:
Adenine (A) & Thymine (T)
Guanine (G) & Cytosine (C)
Genes = specific parts of DNA
Allele = one specific form of a gene
http://study.com/cimages/multimages/16/phenotype_v_genotype.png
1,000 Genomes Project
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 9
Reference Genome
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 10
Sanger Sequencing vs. NGS
Developed in 1977
Small parallelization
Few but long reads
400 to 900 bases per read
Low error rate
1Kb with per-base error rate
<0.001%
Low amount of data per run
100 min for sequencing 1Kb
expensive: $ 0.5 per Kb
Introduced in 2004
High parallelization
Many short reads
50 to 300 bases per read
High error rate
0.5 - 1.0% per-base error rate
High amount of data per run
Up to 6 billion reads
0.002 min for sequencing 1Kb
affordable: $ 0.00005 per Kb
Sanger Sequencing
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 11
Next Generation Sequencing (NGS)Sanger Sequencing
3. Sequencing Pipeline
Next-Generation Sequencing Bioinformatics Pipeline
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 13
This presentation
3.1 Base Calling
■ Receives initial raw data
□ Image Filters, fluctuations in current, …
□ From Roche/454, Illumina, SOLiD, Helios, …
■ Calls nucleotide bases (A, C, G, T) in short strings
□ A few giga base-pairs (Gbp) per machine-day
□ Usually 50 - 300 bases long
■ Base calls get quality score assigned
Base Calling
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 16
Excursion – Phred Quality Scores
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 17
1. Probabilities of base call errors very small
■ Need to be mapped to values that are easier to compare
𝑄 = −10 ∗ log10 𝑃
2. Q: Quality Score, P: Probability of Base Call Error
3. Are calculated in the base calling phase
Excursion – Phred Quality Scores
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 18
Quality Score Probabilty of Base Call Error
Base Call Accuracy
5 Ca. 1 in 3 69 %
10 1 in 10 90 %
20 1 in 100 99 %
30 1 in 1,000 99.9 %
40 1 in 10,000 99.99 %
50 1 in 100,000 99.999 %
60 1 in 1,000,000 99.9999 %
Source: https://en.wikipedia.org/wiki/Phred_quality_score
3.2 Alignment
Alignment in the Sequencing Pipeline
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 20
■ High error rates in NGS reads
■ High number of short reads
■ Mutations and Variations can occur
■ Gaps, e.g. Indels (Insertions and Deletions)
■ Bases could be missing / new bases could have been inserted
Challenges
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 21
■ Using hash tables
□ BLAST
□ SeqMap
□ MAQ
■ Suffix or prefix tries (e.g. using an FM-Index)
□ Bowtie and Bowtie 2
□ BWA, BWA-SW
□ SOAP, SOAP2, SOAP3
Alignment Algorithms
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 22
■ Using hash tables
□ BLAST
□ SeqMap
□ MAQ
■ Suffix or prefix tries (e.g. using an FM-Index)
□ Bowtie and Bowtie 2
□ BWA, BWA-SW
□ SOAP, SOAP2, SOAP3
Alignment Algorithms
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 23
■ Based on Bowtie
■ Allows
□ Ungapped alignment
□ Gapped alignment (containing indels)
□ Inexact matching
■ Technologies
□ FM supported Index, using BWT
□ SIMD accelerated dynamic programming (Smith-Waterman)
■ Input: FASTQ file of reads
■ Output: SAM file of aligned reads
Bowtie 2 – Overview
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 24
■ Four lines per read
FASTQ-SANGER
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 26
@SRR014849.1 EIXKN4201CFU84 length=93
GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGAAAGG
GTTTTGAATTTCAAACCCTTTTCGGTTTCCAACCTTCCAA
GCAATGCCAATA
+SRR014849.1 EIXKN4201CFU84 length=93
3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA
1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@
/=<?7=9<2A8==
■ Four lines per read
■ ‚@‘ title-line
□ Record identifier
□ Other commentary
FASTQ-SANGER
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 27
@SRR014849.1 EIXKN4201CFU84 length=93
GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGAAAGG
GTTTTGAATTTCAAACCCTTTTCGGTTTCCAACCTTCCAA
GCAATGCCAATA
+SRR014849.1 EIXKN4201CFU84 length=93
3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA
1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@
/=<?7=9<2A8==
■ Four lines per read
■ ‚@‘ title-line
□ Record identifier
□ Other commentary
■ Sequence line
□ IUPAC single letter, uppercase
codes for DNA or RNA
(G, T, C, A)
FASTQ-SANGER
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 28
@SRR014849.1 EIXKN4201CFU84 length=93
GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGAAAGG
GTTTTGAATTTCAAACCCTTTTCGGTTTCCAACCTTCCAA
GCAATGCCAATA
+SRR014849.1 EIXKN4201CFU84 length=93
3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA
1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@
/=<?7=9<2A8==
■ Four lines per read
■ ‚@‘ title-line
□ Record identifier
□ Other commentary
■ Sequence line
□ IUPAC single letter, uppercase
codes for DNA or RNA
(G, T, C, A)
■ ‚+‘-line
□ Marks end of Sequence line
□ Sometimes contains copy of
‚@‘-line
FASTQ-SANGER
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 29
@SRR014849.1 EIXKN4201CFU84 length=93
GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGAAAGG
GTTTTGAATTTCAAACCCTTTTCGGTTTCCAACCTTCCAA
GCAATGCCAATA
+SRR014849.1 EIXKN4201CFU84 length=93
3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA
1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@
/=<?7=9<2A8==
■ Four lines per read
■ ‚@‘ title-line
■ Sequence line
■ ‚+‘-line
■ Quality line
□ Allows PHRED quality scores
from 0 to 93
□ Display as ASCII codes 33 –
126
One char per score
□ Same length as sequence line
FASTQ-SANGER
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 30
@SRR014849.1 EIXKN4201CFU84 length=93
GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGAAAGG
GTTTTGAATTTCAAACCCTTTTCGGTTTCCAACCTTCCAA
GCAATGCCAATA
+SRR014849.1 EIXKN4201CFU84 length=93
3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA
1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@
/=<?7=9<2A8==
Burrows-Wheeler-Transformation (BWT)
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 31
■ Input A A T T C G A
Burrows-Wheeler-Transformation (BWT)
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 32
■ Input
■ Append EOF char
A A T T C G A $
Burrows-Wheeler-Transformation (BWT)
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 33
■ Input
■ Append EOF char
■ Shift right
A A T T C G A $
$ A A T T C G A
A $ A A T T C G
G A $ A A T T C
C G A $ A A T T
T C G A $ A A T
T T C G A $ A A
A T T C G A $ A
Burrows-Wheeler-Transformation (BWT)
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 34
$ A A T T C G A
A $ A A T T C G
A A T T C G A $
A T T C G A $ A
C G A $ A A T T
G A $ A A T T C
T C G A $ A A T
T T C G A $ A A
■ Input
■ Append EOF char
■ Shift right
■ Sort
lexicographically
Burrows-Wheeler-Transformation (BWT)
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 35
■ Input
■ Append EOF char
■ Shift right
■ Sort
lexicographically
■ Return last column
AATTCGA$
AG$ATCTA
$ A A T T C G A
A $ A A T T C G
A A T T C G A $
A T T C G A $ A
C G A $ A A T T
G A $ A A T T C
T C G A $ A A T
T T C G A $ A A
BWT - Unpermute
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 36
1. We only know the first row
(sorting of the letters) and the
last row$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT - Unpermute
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 37
1. We only know the first row
(sorting of the letters) and the
last row
2. Last-First-Property:
The order of same letters is not
getting mixed up in the last and
first column
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT - Unpermute
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 38
1. We only know the first row
(sorting of the letters) and the
last row
2. Last-First-Property:
The order of same letters is not
getting mixed up in the last and
first column
3. We start with the first letter in
the last column and apply that
property to unpermute the
sentence
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT - Unpermute
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 39
1. We only know the first row
(sorting of the letters) and the
last row
2. Last-First-Property:
The order of same letters is not
getting mixed up in the last and
first column
3. We start with the first letter in
the last column and apply that
property to unpermute the
sentence
4. This letter equals the last letter of
the original sentence
A
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT - Unpermute
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 40
5. Apply LF-Mapping property until $
is reached
GA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT - Unpermute
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 41
5. Apply LF-Mapping property until $
is reached
CGA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT - Unpermute
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 42
5. Apply LF-Mapping property until $
is reached
TCGA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT - Unpermute
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 43
5. Apply LF-Mapping property until $
is reached
TTCGA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT - Unpermute
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 44
5. Apply LF-Mapping property until $
is reached
ATTCGA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT - Unpermute
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 45
5. Apply LF-Mapping property until $
is reached
AATTCGA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT - Unpermute
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 46
5. Apply LF-Mapping property until $
is reached
6. Original sentence is reconstructed
AATTCGA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Exact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 47
Let‘s search for ATT in original string!
BWT – Exact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 48
1. Write output of BWT to last
column
. . . . . . . A
. . . . . . . G
. . . . . . . $
. . . . . . . A
. . . . . . . T
. . . . . . . C
. . . . . . . T
. . . . . . . A
ATT
BWT – Exact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 49
1. Write output of BWT to last
column
2. Sort letters lexicographicly and
write to first column
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
ATT
BWT – Exact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 50
1. Write output of BWT to last
column
2. Sort letters lexicographicly and
write to first column
3. Mark all occurences of the last
letter in the first column
ATT
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Exact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 51
1. Write output of BWT to last
column
2. Sort letters lexicographicly and
write to first column
3. Mark all occurences of the last
letter in the first column
4. Check for occurence of the next
letter in the last column
ATT
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Exact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 52
1. Write output of BWT to last
column
2. Sort letters lexicographicly and
write to first column
3. Mark all occurences of the last
letter in the first column
4. Check for occurence of the next
letter in the last column
5. Count which occurence of T
was found
ATT
1.
2.
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Exact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 53
1. Write output of BWT to last
column
2. Sort letters lexicographicly and
write to first column
3. Mark all occurences of the last
letter in the first column
4. Check for occurence of the next
letter in the last column
5. Count which occurence of T
was found
6. Mark that occurence in the first
column
ATT
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Exact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 54
1. Write output of BWT to last
column
2. Sort letters lexicographicly and
write to first column
3. Mark all occurences of the last
letter in the first column
4. Check for occurence of the next
letter in the last column
5. Count which occurence of T
was found
6. Mark that occurence in the first
column
7. Repeat
ATT
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Exact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 55
1. Write output of BWT to last
column
2. Sort letters lexicographicly and
write to first column
3. Mark all occurences of the last
letter in the first column
4. Check for occurence of the next
letter in the last column
5. Count which occurence of T
was found
6. Mark that occurence in the first
column
7. Repeat
ATT
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Exact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 56
1. Write output of BWT to last
column
2. Sort letters lexicographicly and
write to first column
3. Mark all occurences of the last
letter in the first column
4. Check for occurence of the next
letter in the last column
5. Count which occurence of T
was found
6. Mark that occurence in the first
column
7. Repeat
ATT
1.
3.
2.
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Exact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 57
FOUND!
ATT
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Unwind
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 58
8. Now we need to unwind
to find the position in the
original string
AATTCGA$
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Unwind
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 59
8. Now we need to unwind
to find the position in the
original string
AATTCGA$
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Unwind
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 60
8. Now we need to unwind
to find the position in the
original string
AATTCGA$
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Unwind
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 61
8. Now we need to unwind
to find the position in the
original string
AATTCGA$
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Unwind
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 62
8. Now we need to unwind
to find the position in the
original string
AATTCGA$
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Unwind
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 63
8. Now we need to unwind
to find the position in the
original string
AATTCGA$
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Unwind
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 64
8. Now we need to unwind
to find the position in the
original string
AATTCGA$
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Unwind
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 65
8. Now we need to unwind
to find the position in the
original string
9. End of the string is found
position of the searched string:
first letter is the 6th last letter
of the sentence ≜ 2nd letter of the
original string
AATTCGA$
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Inexact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 66
But what about inexact matches?
What if we want to find TCAA in the original string?
BWT – Inexact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 67
Quality Scores are assigned in the Base Calling phase.
T C A A
65 47 10 50
BWT – Inexact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 68
1. Proceed as in exact matching
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
TCAA
BWT – Inexact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 69
1. Proceed as in exact matching TCAA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Inexact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 70
1. Proceed as in exact matching TCAA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Inexact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 71
1. Proceed as in exact matching TCAA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Inexact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 72
1. Proceed as in exact matching
2. No match found for next C
What now?
TCAA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Inexact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 73
1. Proceed as in exact matching
2. No match found for next C
3. Look at base calls and pick
position in stack with lowest
quality score
■ Second A had score of 10
TCAA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Inexact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 74
1. Proceed as in exact matching
2. No match found for next C
3. Look at base calls and pick
position in stack with lowest
quality score
■ Second A had score of 10
4. Walk back in stack and replace
with different possible base
TCAA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Inexact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 75
1. Proceed as in exact matching
2. No match found for next C
3. Look at base calls and pick
position in stack with lowest
quality score
■ Second A had score of 10
4. Walk back in stack and replace
with different possible base
TCAA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Inexact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 76
1. Proceed as in exact matching
2. No match found for next C
3. Look at base calls and pick
position in stack with lowest
quality score
■ Second A had score of 10
4. Walk back in stack and replace
with different possible base
(e.g, G)
5. Continue with exact matching and
repeat until match found
TCGA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Inexact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 77
1. Proceed as in exact matching
2. No match found for next C
3. Look at base calls and pick
position in stack with lowest
quality score
■ Second A had score of 10
4. Walk back in stack and replace
with different possible base
(e.g, G)
5. Continue with exact matching and
repeat until match found
TCGA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Inexact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 78
1. Proceed as in exact matching
2. No match found for next C
3. Look at base calls and pick
position in stack with lowest
quality score
■ Second A had score of 10
4. Walk back in stack and replace
with different possible base
(e.g, G)
5. Continue with exact matching and
repeat until match found
TCGA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Inexact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 79
1. Proceed as in exact matching
2. No match found for next C
3. Look at base calls and pick
position in stack with lowest
quality score
■ Second A had score of 10
4. Walk back in stack and replace
with different possible base
(e.g, G)
5. Continue with exact matching and
repeat until match found
TCGA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Inexact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 80
1. Proceed as in exact matching
2. No match found for next C
3. Look at base calls and pick
position in stack with lowest
quality score
■ Second A had score of 10
4. Walk back in stack and replace
with different possible base
(e.g, G)
5. Continue with exact matching and
repeat until match found
TCGA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Inexact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 81
1. Proceed as in exact matching
2. No match found for next C
3. Look at base calls and pick
position in stack with lowest
quality score
■ Second A had score of 10
4. Walk back in stack and replace
with different possible base
(e.g, G)
5. Continue with exact matching and
repeat until match found
TCGA
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Inexact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 82
1. Proceed as in exact matching
2. No match found for next C
3. Look at base calls and pick
position in stack with lowest
quality score
■ Second A had score of 10
4. Walk back in stack and replace
with different possible base
(e.g, G)
5. Continue with exact matching and
repeat until match found
TCGA
!
$ . . . . . . A
A . . . . . . G
A . . . . . . $
A . . . . . . A
C . . . . . . T
G . . . . . . C
T . . . . . . T
T . . . . . . A
BWT – Inexact Matching
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 83
We found an alignment for TCAA withone mismatch.
Gapped Alignment
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 85
What about gapped alignment (e.g. Indels)?
Let‘s look at the Bowtie 2 Pipeline.
Bowtie 2 – 1. Extract Seeds
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 86
Extract seeds from each read
Follow a certain policy (e.g. 16 base substring every 10 bases along the read)
Upcoming process works with the seed strings
Read:
CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA
Bowtie 2 – 1. Extract Seeds
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 87
Extract seeds from each read
Follow a certain policy (e.g. 16 nt substring every 10 nt along the read)
Upcoming process works with the seed strings
Read:
CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA
Bowtie 2 – 2. Align with FM Index
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 88
■ Find alignments using the BWT as described earlier
□ There may be more than one possible alignment per seed!
■ Returns range of possibe alignments within the FM Index
CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA
BWT, (In)exact matching
{ [211, 212];[212, 214] }
■ Every Alignment gets priority 1 𝑟2
□ r = total number of alignments for the seed
Seeds with fewer alignments get higher priority
■ Randomly select seeds, weighted by priority
□ Run dynamic programming approach on these
□ Modified Smith-Waterman algorithm
Bowtie 2 – 3. Prioritize, Resolve
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 89
■ Reference Genome:
CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 90
■ Reference Genome:
CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA
■ Read:
GCTCAG
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 91
■ Reference Genome:
CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA
■ Read:
GCTCAG
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 92
■ Reference Genome:
CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA
■ Read:
GCTCAG
■ One exact match for this seed found!
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 93
■ Uses Dynamic Programming approach
□ Calculate larger problem with first calculating smaller problems
□ Base larger problem then on the results of the smaller problems
□ Fill out table to save all results
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 94
■ Give an „Award“ if match is found
■ Give a „Penalty“ if no match was found (here is a gap!)
■ Table cell with the highest score is the best alignment
■ Let‘s align GCTCAG to GCTCTCAG!
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 95
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 96
- G C T C T C A G
-
G
C
T
C
A
G
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 97
- G C T C T C A G
- 0
G
C
T
C
A
G
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 98
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0
C 0
T 0
C 0
A 0
G 0
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = −𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 99
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0
C 0
T 0
C 0
A 0
G 0
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 100
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0 2
C 0
T 0
C 0
A 0
G 0
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 101
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0 2
C 0
T 0
C 0
A 0
G 0
Parallelizable!
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 102
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0 2 1
C 0 1
T 0
C 0
A 0
G 0
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 103
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0 2 1 0
C 0 1 4
T 0 0
C 0
A 0
G 0
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 104
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0 2 1 0 0
C 0 1 4 3
T 0 0 3
C 0 0
A 0
G 0
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 105
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0 2 1 0 0 0
C 0 1 4 3 2
T 0 0 3 6
C 0 0 2
A 0 0
G 0
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 106
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0 2 1 0 0 0 0
C 0 1 4 3 2 1
T 0 0 3 6 5
C 0 0 2 5
A 0 0 1
G 0 2
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 107
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0 2 1 0 0 0 0 0
C 0 1 4 3 2 1 2
T 0 0 3 6 2 4
C 0 0 2 2 8
A 0 0 1 1
G 0 1 0
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 108
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0 2 1 0 0 0 0 0 2
C 0 1 4 3 2 1 2 1
T 0 0 3 6 2 4 3
C 0 0 2 2 8 7
A 0 0 1 1 7
G 0 1 0 0
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 109
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0 2 1 0 0 0 0 0 2
C 0 1 4 3 2 1 2 1 1
T 0 0 3 6 2 4 3 2
C 0 0 2 2 8 7 6
A 0 0 1 1 7 7
G 0 1 0 0 6
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 110
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0 2 1 0 0 0 0 0 2
C 0 1 4 3 2 1 2 1 1
T 0 0 3 6 2 4 3 2 1
C 0 0 2 2 8 7 6 5
A 0 0 1 1 7 7 6
G 0 1 0 0 6 6
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 111
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0 2 1 0 0 0 0 0 2
C 0 1 4 3 2 1 2 1 1
T 0 0 3 6 2 4 3 2 1
C 0 0 2 2 8 7 6 5 4
A 0 0 1 1 7 7 6 8
G 0 1 0 0 6 6 6
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 112
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0 2 1 0 0 0 0 0 2
C 0 1 4 3 2 1 2 1 1
T 0 0 3 6 2 4 3 2 1
C 0 0 2 2 8 7 6 5 4
A 0 0 1 1 7 7 6 8 7
G 0 1 0 0 6 6 6 7
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 113
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0 2 1 0 0 0 0 0 2
C 0 1 4 3 2 1 2 1 1
T 0 0 3 6 2 4 3 2 1
C 0 0 2 2 8 7 6 5 4
A 0 0 1 1 7 7 6 8 7
G 0 1 0 0 6 6 6 7 10
𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥
0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −
𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗
𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −
+2, 𝑎 = 𝑏
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 114
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0 2 1 0 0 0 0 0 2
C 0 1 4 3 2 1 2 1 1
T 0 0 3 6 2 4 3 2 1
C 0 0 2 2 8 7 6 5 4
A 0 0 1 1 7 7 6 8 7
G 0 1 0 0 6 6 6 7 10
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 115
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0 2 1 0 0 0 0 0 2
C 0 1 4 3 2 1 2 1 1
T 0 0 3 6 2 4 3 2 1
C 0 0 2 2 8 7 6 5 4
A 0 0 1 1 7 7 6 8 7
G 0 1 0 0 6 6 6 7 10
■ Traceback from best value
■ We need to remember howwe calculated the values forthat!
Smith-Waterman
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 116
- G C T C T C A G
- 0 0 0 0 0 0 0 0 0
G 0 2 1 0 0 0 0 0 2
C 0 1 4 3 2 1 2 1 1
T 0 0 3 6 2 4 3 2 1
C 0 0 2 2 8 7 6 5 4
A 0 0 1 1 7 7 6 8 7
G 0 1 0 0 6 6 6 7 10
■ Backtracing from best value
□ We need to remember howwe calculated the valuesfor that!
■ Alignment:
G C T C T C A G
G C - - T C A G
■ Smith-Waterman ends, when ..
□ All possible alignments are examined
□ Enough alignments are examined
□ Dynamic programming limit is reached
■ Pick the alignment with the highest score in the Smith-Waterman
algorithm
Results
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 117
■ Different gap penalties for Starting and Extending a gap
■ Restrictions on where gaps are allowed
■ Scoring function also takes quality score into account
■ Reseeding, if no proper matches were found
Adjustments to Smith-Waterman in Bowtie2
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 118
@SQ SN:ref LN:45
r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAAGGATACTA *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1
r004 0 ref 16 30 6M14N5M * 0 0 TAGCTTCAGC *
Sequence Alignment/Map format (SAM)
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 119
■ Leftmost position of the alignment
■ Quality of the alignment (Phred)
■ Matching as CIGAR string
■ Query Sequence
Variant Calling3.3 Variant Calling
Variant Calling in Pipeline
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 121
Genetic Variation vs. Mutation
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 122
“Genetic variation is what makes us
all unique, whether in terms of hair
colour, skin colour or even the shape of
our faces.”
“A mutation is a change that occurs in our DNA sequence, either due to mistakes when
the DNA is copied or as the result of environmental factors such as UV light and
cigarette smoke.”
vs.
Genetic Variation vs. Mutation
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 123
Mutations contribute to genetic variation within species
What means Variant Calling?
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 124
■ Single-Nucleotide-Polymorphism (SNP)
What means Variant Calling?
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 125
■ Single-Nucleotide-Polymorphism (SNP)
■ Insertions
What means Variant Calling?
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 126
■ Single-Nucleotide-Polymorphism (SNP)
■ Insertions and Deletions (Indels)
What means Variant Calling?
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 127
■ Single-Nucleotide-Polymorphism (SNP)
■ Insertions and Deletions (Indels)
■ Larger structural variants
□ Copy Number Variation (CNV)
□ Loss of one copy of a gene or of both
□ Movement of DNA sections from one location to another
■ Early method: Counting abundance of
high-quality nucleotides at a site
■ Recent approaches:
□ Integrate several sources of information
□ Use of prior probabilities for a SNP at a
given position (e.g. dbSNP)
■ Further method: Heuristic approach
□ Specific features of different sequencing
platforms and different read alignment
methods (VarScan)
SNP Calling Methods
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 128
■ Presence of Indels
■ Errors from library preparation
■ Variable quality scores with higher error rates
Identify „true“ variants and no alignment and/ or sequencing errors
Minimize the amount of false positives
Challenges with Variant Calling
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 129
■ Is a software package for analysis of HTS data
■ Offers a variety of tools: Focus on variant discovery and genotyping
■ Reads-to-variants workflow:
Genome Analysis Toolkit
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 130
GATKs HaplotypeCaller
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 133
SAM / BAM File
VCF FileHaplotypeCaller
GATKs HaplotypeCaller
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 134
SAM / BAM File
VCF FileHaplotypeCaller
GATKs HaplotypeCaller
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 135
SAM / BAM File
VCF FileHaplotypeCaller
GATKs HaplotypeCaller
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 136
SAM / BAM File
VCF FileHaplotypeCaller
■ Is capable of calling SNPs and Indels simultaneously
GATKs HaplotypeCaller
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 137
■ Go through reference genome
with a sliding window
■ Count Indels and mismatches
■ Memorize regions to operate on
HaplotypeCallerSTEP 1: Identify Active Regions
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 138
■ Local re-assembly
■ Building a De Bruijn-like graph
■ Prune according to threshold
■ Traverse graph to collect most
likely haplotypes
■ Align haplotypes to reference
genome using SWA
HaplotypeCallerSTEP 2: Assembly Plausible Haplotypes
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 139
■ Determine likelihoods of haplotypes given the read data
■ For each ActiveRegion program performs pairwise alignment of each read
against each haplotype using PairHMM algorithm
■ Produces a matrix of likelihoods of haplotypes given the read data
HaplotypeCallerSTEP 3: Determine per-read Likelihoods
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 140
Markov Chain
P(xi | xi-1, …, x1) = (xi | xi-1) = axi-1xi
Hidden Markov Model (HMM)
akl = P(πi = l | πi-1 = k)
HaplotypeCallerRecap Statistical Models
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 141
HaplotypeCallerSTEP 3: Determine per-read Likelihoods
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 142
HaplotypeCallerSTEP 3: Determine per-read Likelihoods
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 143
Match (M) = emitting an
aligned pair with probability p
HaplotypeCallerSTEP 3: Determine per-read Likelihoods
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 144
Ix, Iy= emitting symbol xi, yi against
a gap with probability p
HaplotypeCallerSTEP 3: Determine per-read Likelihoods
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 145
All transition probabilities
leaving each state must sum to 1
HaplotypeCallerSTEP 3: Determine per-read Likelihoods
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 146
Empirical gap penalties = derived from data by BQSR
Base mismatch penalties = base quality scores
HaplotypeCallerSTEP 3: Determine per-read Likelihoods
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 147
■ Matrix with likelihoods of the haplotypes
given the reads
■ Assign genotypes to individual samples
based on the allele likelihoods
■ By applying Bayes' theorem to calculate
the likelihoods of each possible genotype,
and selecting the most likely
HaplotypeCallerSTEP 4: Genotype Sample
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 148
HaplotypeCallerSTEP 4: Genotype Sample
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 149
Bayes Rule
HaplotypeCallerRecap Probabilistic Model
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 150
P(data)P(data|hypothesis) P(hypothesis)P(hypothesis|data) =
HaplotypeCallerSTEP 4: Genotype Sample
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 151
3.4 Data Annotation
Data Annotation in the pipeline
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 153
After variant calling: where are variants
Now find out: what are these variants
Multiple thousand variants cannot be analyzed manually
Tools for automated variant annotation (e.g. ANNOVAR)
Already known SNPs can be filtered by using information from dbSNP or
the 1,000 genome project
dbSNP = Single Nucleotide Polymorphism Database
Free public archive for genetic variation
Get information from SNPs
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 154
4. Analysis Results / Use Cases
Analysis Results in the pipeline
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 156
Different genome browsers (Ensembl Genome Browser, UCSC Genome
Browser, VEGA Genome Browser)
Ensembl Variant Effect Predictor (Online Tool)
Analysis Results
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 157
Personalized Medicine
Prescribe medicine that work for you
Detect genetic diseases early on
Learn how humans work
Why do we age? Can we switch it off?
Why are some smarter than others? Might it be genes?
Use Cases
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 158[1] http://www.tedxvienna.at/blog/personalize-this/
[1]
5. Outlook
Think about these Questions..
If genome sequencing is as easy and cheap like blood tests, would you:
digitize your genome?
want to know about biomarkers?
want to know about markers that identicate a disease to come?
digitize the genome of your unborn baby?
want to know the appearance of your unborn baby?
want to know that you are getting the Alzheimer disease before being 40
years old?
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 160
6. Discussion
Do we really want every technological progress
that will be possible in future?
■ B. Langmead, C. Trapnell, M. Pop, S. Salzberg: Ultrafast and memory-efficient
alignment of short DNA sequences to the human genome in Genome Biology
2009
■ B. Langmead, S. Salzberg: Fast gapped-read alignment with Bowtie 2. Nature
Methods Vol. 9 No.4,2012
■ P. J. A. Cock et al.: The Sanger FASTQ file format for sequences with quality
scores, and the Solexa/Illumina FASTQ variants. Nucelic Acids Research, Vol. 38,
No. 6, 2010
■ Dolled-Filhart, Marisa P., et al. Computational and bioinformatics frameworks for
next-generation whole exome and genome sequencing. The Scientific World
Journal 2013 (2013).
■ Nielsen, Rasmus, et al. Genotype and SNP calling from next-generation
sequencing data. Nature Reviews Genetics 12.6 (2011): 443-451.
Literature
Kirsten, KristinaMeyer, Torben27.01.2016
Processing of Raw Genome Data
Chart 162