Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data...

Post on 06-Feb-2018

218 views 0 download

Transcript of Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data...

Processing of Raw Genome Data

Kristina Kirsten, Torben Meyer

Trends in Bioinformatics

Hasso-Plattner-Institut

Session Agenda

1. Why Sequence the Human Genome?

2. Problem Statement

■ Genetic Basics

■ Sanger Sequencing vs. NGS

3. Sequencing Pipeline

■ Base Calling

■ Alignment

■ Variant Calling

■ Data Annotation

4. Analysis Results / Use Cases

5. Outlook

6. Discussion

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 2

Why Sequence the Human Genome?

Understand mutations

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 3[1] http://evolution.berkeley.edu/evolibrary/images/evo/dna-mutation.gif

[2] http://informoverload.com/man-has-extra-fingers/

[2][1]

Why Sequence the Human Genome?

Identify marker for diseases

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 4

http://hmg.oxfordjournals.org/content/18/R1/R48/F1.large.jpg

Why Sequence the Human Genome?

Take personalized treatment decisions

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 5

http://www.alphagenomix.com/wp-content/uploads/2014/03/Figure21.png

Why Sequence the Human Genome?

Take personalized treatment decisions

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 6

Genetic Basics (1)

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 7

Genetic Basics (2)

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 8

• Base pairs:

Adenine (A) & Thymine (T)

Guanine (G) & Cytosine (C)

Genes = specific parts of DNA

Allele = one specific form of a gene

http://study.com/cimages/multimages/16/phenotype_v_genotype.png

1,000 Genomes Project

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 9

Reference Genome

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 10

Sanger Sequencing vs. NGS

Developed in 1977

Small parallelization

Few but long reads

400 to 900 bases per read

Low error rate

1Kb with per-base error rate

<0.001%

Low amount of data per run

100 min for sequencing 1Kb

expensive: $ 0.5 per Kb

Introduced in 2004

High parallelization

Many short reads

50 to 300 bases per read

High error rate

0.5 - 1.0% per-base error rate

High amount of data per run

Up to 6 billion reads

0.002 min for sequencing 1Kb

affordable: $ 0.00005 per Kb

Sanger Sequencing

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 11

Next Generation Sequencing (NGS)Sanger Sequencing

3. Sequencing Pipeline

Next-Generation Sequencing Bioinformatics Pipeline

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 13

This presentation

3.1 Base Calling

■ Receives initial raw data

□ Image Filters, fluctuations in current, …

□ From Roche/454, Illumina, SOLiD, Helios, …

■ Calls nucleotide bases (A, C, G, T) in short strings

□ A few giga base-pairs (Gbp) per machine-day

□ Usually 50 - 300 bases long

■ Base calls get quality score assigned

Base Calling

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 16

Excursion – Phred Quality Scores

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 17

1. Probabilities of base call errors very small

■ Need to be mapped to values that are easier to compare

𝑄 = −10 ∗ log10 𝑃

2. Q: Quality Score, P: Probability of Base Call Error

3. Are calculated in the base calling phase

Excursion – Phred Quality Scores

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 18

Quality Score Probabilty of Base Call Error

Base Call Accuracy

5 Ca. 1 in 3 69 %

10 1 in 10 90 %

20 1 in 100 99 %

30 1 in 1,000 99.9 %

40 1 in 10,000 99.99 %

50 1 in 100,000 99.999 %

60 1 in 1,000,000 99.9999 %

Source: https://en.wikipedia.org/wiki/Phred_quality_score

3.2 Alignment

Alignment in the Sequencing Pipeline

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 20

■ High error rates in NGS reads

■ High number of short reads

■ Mutations and Variations can occur

■ Gaps, e.g. Indels (Insertions and Deletions)

■ Bases could be missing / new bases could have been inserted

Challenges

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 21

■ Using hash tables

□ BLAST

□ SeqMap

□ MAQ

■ Suffix or prefix tries (e.g. using an FM-Index)

□ Bowtie and Bowtie 2

□ BWA, BWA-SW

□ SOAP, SOAP2, SOAP3

Alignment Algorithms

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 22

■ Using hash tables

□ BLAST

□ SeqMap

□ MAQ

■ Suffix or prefix tries (e.g. using an FM-Index)

□ Bowtie and Bowtie 2

□ BWA, BWA-SW

□ SOAP, SOAP2, SOAP3

Alignment Algorithms

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 23

■ Based on Bowtie

■ Allows

□ Ungapped alignment

□ Gapped alignment (containing indels)

□ Inexact matching

■ Technologies

□ FM supported Index, using BWT

□ SIMD accelerated dynamic programming (Smith-Waterman)

■ Input: FASTQ file of reads

■ Output: SAM file of aligned reads

Bowtie 2 – Overview

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 24

■ Four lines per read

FASTQ-SANGER

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 26

@SRR014849.1 EIXKN4201CFU84 length=93

GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGAAAGG

GTTTTGAATTTCAAACCCTTTTCGGTTTCCAACCTTCCAA

GCAATGCCAATA

+SRR014849.1 EIXKN4201CFU84 length=93

3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA

1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@

/=<?7=9<2A8==

■ Four lines per read

■ ‚@‘ title-line

□ Record identifier

□ Other commentary

FASTQ-SANGER

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 27

@SRR014849.1 EIXKN4201CFU84 length=93

GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGAAAGG

GTTTTGAATTTCAAACCCTTTTCGGTTTCCAACCTTCCAA

GCAATGCCAATA

+SRR014849.1 EIXKN4201CFU84 length=93

3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA

1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@

/=<?7=9<2A8==

■ Four lines per read

■ ‚@‘ title-line

□ Record identifier

□ Other commentary

■ Sequence line

□ IUPAC single letter, uppercase

codes for DNA or RNA

(G, T, C, A)

FASTQ-SANGER

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 28

@SRR014849.1 EIXKN4201CFU84 length=93

GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGAAAGG

GTTTTGAATTTCAAACCCTTTTCGGTTTCCAACCTTCCAA

GCAATGCCAATA

+SRR014849.1 EIXKN4201CFU84 length=93

3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA

1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@

/=<?7=9<2A8==

■ Four lines per read

■ ‚@‘ title-line

□ Record identifier

□ Other commentary

■ Sequence line

□ IUPAC single letter, uppercase

codes for DNA or RNA

(G, T, C, A)

■ ‚+‘-line

□ Marks end of Sequence line

□ Sometimes contains copy of

‚@‘-line

FASTQ-SANGER

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 29

@SRR014849.1 EIXKN4201CFU84 length=93

GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGAAAGG

GTTTTGAATTTCAAACCCTTTTCGGTTTCCAACCTTCCAA

GCAATGCCAATA

+SRR014849.1 EIXKN4201CFU84 length=93

3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA

1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@

/=<?7=9<2A8==

■ Four lines per read

■ ‚@‘ title-line

■ Sequence line

■ ‚+‘-line

■ Quality line

□ Allows PHRED quality scores

from 0 to 93

□ Display as ASCII codes 33 –

126

One char per score

□ Same length as sequence line

FASTQ-SANGER

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 30

@SRR014849.1 EIXKN4201CFU84 length=93

GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGAAAGG

GTTTTGAATTTCAAACCCTTTTCGGTTTCCAACCTTCCAA

GCAATGCCAATA

+SRR014849.1 EIXKN4201CFU84 length=93

3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA

1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@

/=<?7=9<2A8==

Burrows-Wheeler-Transformation (BWT)

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 31

■ Input A A T T C G A

Burrows-Wheeler-Transformation (BWT)

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 32

■ Input

■ Append EOF char

A A T T C G A $

Burrows-Wheeler-Transformation (BWT)

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 33

■ Input

■ Append EOF char

■ Shift right

A A T T C G A $

$ A A T T C G A

A $ A A T T C G

G A $ A A T T C

C G A $ A A T T

T C G A $ A A T

T T C G A $ A A

A T T C G A $ A

Burrows-Wheeler-Transformation (BWT)

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 34

$ A A T T C G A

A $ A A T T C G

A A T T C G A $

A T T C G A $ A

C G A $ A A T T

G A $ A A T T C

T C G A $ A A T

T T C G A $ A A

■ Input

■ Append EOF char

■ Shift right

■ Sort

lexicographically

Burrows-Wheeler-Transformation (BWT)

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 35

■ Input

■ Append EOF char

■ Shift right

■ Sort

lexicographically

■ Return last column

AATTCGA$

AG$ATCTA

$ A A T T C G A

A $ A A T T C G

A A T T C G A $

A T T C G A $ A

C G A $ A A T T

G A $ A A T T C

T C G A $ A A T

T T C G A $ A A

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 36

1. We only know the first row

(sorting of the letters) and the

last row$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 37

1. We only know the first row

(sorting of the letters) and the

last row

2. Last-First-Property:

The order of same letters is not

getting mixed up in the last and

first column

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 38

1. We only know the first row

(sorting of the letters) and the

last row

2. Last-First-Property:

The order of same letters is not

getting mixed up in the last and

first column

3. We start with the first letter in

the last column and apply that

property to unpermute the

sentence

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 39

1. We only know the first row

(sorting of the letters) and the

last row

2. Last-First-Property:

The order of same letters is not

getting mixed up in the last and

first column

3. We start with the first letter in

the last column and apply that

property to unpermute the

sentence

4. This letter equals the last letter of

the original sentence

A

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 40

5. Apply LF-Mapping property until $

is reached

GA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 41

5. Apply LF-Mapping property until $

is reached

CGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 42

5. Apply LF-Mapping property until $

is reached

TCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 43

5. Apply LF-Mapping property until $

is reached

TTCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 44

5. Apply LF-Mapping property until $

is reached

ATTCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 45

5. Apply LF-Mapping property until $

is reached

AATTCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 46

5. Apply LF-Mapping property until $

is reached

6. Original sentence is reconstructed

AATTCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 47

Let‘s search for ATT in original string!

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 48

1. Write output of BWT to last

column

. . . . . . . A

. . . . . . . G

. . . . . . . $

. . . . . . . A

. . . . . . . T

. . . . . . . C

. . . . . . . T

. . . . . . . A

ATT

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 49

1. Write output of BWT to last

column

2. Sort letters lexicographicly and

write to first column

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

ATT

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 50

1. Write output of BWT to last

column

2. Sort letters lexicographicly and

write to first column

3. Mark all occurences of the last

letter in the first column

ATT

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 51

1. Write output of BWT to last

column

2. Sort letters lexicographicly and

write to first column

3. Mark all occurences of the last

letter in the first column

4. Check for occurence of the next

letter in the last column

ATT

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 52

1. Write output of BWT to last

column

2. Sort letters lexicographicly and

write to first column

3. Mark all occurences of the last

letter in the first column

4. Check for occurence of the next

letter in the last column

5. Count which occurence of T

was found

ATT

1.

2.

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 53

1. Write output of BWT to last

column

2. Sort letters lexicographicly and

write to first column

3. Mark all occurences of the last

letter in the first column

4. Check for occurence of the next

letter in the last column

5. Count which occurence of T

was found

6. Mark that occurence in the first

column

ATT

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 54

1. Write output of BWT to last

column

2. Sort letters lexicographicly and

write to first column

3. Mark all occurences of the last

letter in the first column

4. Check for occurence of the next

letter in the last column

5. Count which occurence of T

was found

6. Mark that occurence in the first

column

7. Repeat

ATT

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 55

1. Write output of BWT to last

column

2. Sort letters lexicographicly and

write to first column

3. Mark all occurences of the last

letter in the first column

4. Check for occurence of the next

letter in the last column

5. Count which occurence of T

was found

6. Mark that occurence in the first

column

7. Repeat

ATT

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 56

1. Write output of BWT to last

column

2. Sort letters lexicographicly and

write to first column

3. Mark all occurences of the last

letter in the first column

4. Check for occurence of the next

letter in the last column

5. Count which occurence of T

was found

6. Mark that occurence in the first

column

7. Repeat

ATT

1.

3.

2.

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 57

FOUND!

ATT

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Unwind

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 58

8. Now we need to unwind

to find the position in the

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Unwind

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 59

8. Now we need to unwind

to find the position in the

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Unwind

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 60

8. Now we need to unwind

to find the position in the

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Unwind

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 61

8. Now we need to unwind

to find the position in the

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Unwind

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 62

8. Now we need to unwind

to find the position in the

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Unwind

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 63

8. Now we need to unwind

to find the position in the

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Unwind

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 64

8. Now we need to unwind

to find the position in the

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Unwind

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 65

8. Now we need to unwind

to find the position in the

original string

9. End of the string is found

position of the searched string:

first letter is the 6th last letter

of the sentence ≜ 2nd letter of the

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 66

But what about inexact matches?

What if we want to find TCAA in the original string?

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 67

Quality Scores are assigned in the Base Calling phase.

T C A A

65 47 10 50

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 68

1. Proceed as in exact matching

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

TCAA

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 69

1. Proceed as in exact matching TCAA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 70

1. Proceed as in exact matching TCAA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 71

1. Proceed as in exact matching TCAA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 72

1. Proceed as in exact matching

2. No match found for next C

What now?

TCAA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 73

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

TCAA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 74

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

4. Walk back in stack and replace

with different possible base

TCAA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 75

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

4. Walk back in stack and replace

with different possible base

TCAA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 76

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

4. Walk back in stack and replace

with different possible base

(e.g, G)

5. Continue with exact matching and

repeat until match found

TCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 77

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

4. Walk back in stack and replace

with different possible base

(e.g, G)

5. Continue with exact matching and

repeat until match found

TCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 78

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

4. Walk back in stack and replace

with different possible base

(e.g, G)

5. Continue with exact matching and

repeat until match found

TCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 79

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

4. Walk back in stack and replace

with different possible base

(e.g, G)

5. Continue with exact matching and

repeat until match found

TCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 80

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

4. Walk back in stack and replace

with different possible base

(e.g, G)

5. Continue with exact matching and

repeat until match found

TCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 81

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

4. Walk back in stack and replace

with different possible base

(e.g, G)

5. Continue with exact matching and

repeat until match found

TCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 82

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

4. Walk back in stack and replace

with different possible base

(e.g, G)

5. Continue with exact matching and

repeat until match found

TCGA

!

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 83

We found an alignment for TCAA withone mismatch.

Gapped Alignment

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 85

What about gapped alignment (e.g. Indels)?

Let‘s look at the Bowtie 2 Pipeline.

Bowtie 2 – 1. Extract Seeds

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 86

Extract seeds from each read

Follow a certain policy (e.g. 16 base substring every 10 bases along the read)

Upcoming process works with the seed strings

Read:

CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA

Bowtie 2 – 1. Extract Seeds

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 87

Extract seeds from each read

Follow a certain policy (e.g. 16 nt substring every 10 nt along the read)

Upcoming process works with the seed strings

Read:

CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA

Bowtie 2 – 2. Align with FM Index

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 88

■ Find alignments using the BWT as described earlier

□ There may be more than one possible alignment per seed!

■ Returns range of possibe alignments within the FM Index

CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA

BWT, (In)exact matching

{ [211, 212];[212, 214] }

■ Every Alignment gets priority 1 𝑟2

□ r = total number of alignments for the seed

Seeds with fewer alignments get higher priority

■ Randomly select seeds, weighted by priority

□ Run dynamic programming approach on these

□ Modified Smith-Waterman algorithm

Bowtie 2 – 3. Prioritize, Resolve

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 89

■ Reference Genome:

CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 90

■ Reference Genome:

CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA

■ Read:

GCTCAG

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 91

■ Reference Genome:

CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA

■ Read:

GCTCAG

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 92

■ Reference Genome:

CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA

■ Read:

GCTCAG

■ One exact match for this seed found!

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 93

■ Uses Dynamic Programming approach

□ Calculate larger problem with first calculating smaller problems

□ Base larger problem then on the results of the smaller problems

□ Fill out table to save all results

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 94

■ Give an „Award“ if match is found

■ Give a „Penalty“ if no match was found (here is a gap!)

■ Table cell with the highest score is the best alignment

■ Let‘s align GCTCAG to GCTCTCAG!

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 95

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 96

- G C T C T C A G

-

G

C

T

C

A

G

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 97

- G C T C T C A G

- 0

G

C

T

C

A

G

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 98

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0

C 0

T 0

C 0

A 0

G 0

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = −𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 99

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0

C 0

T 0

C 0

A 0

G 0

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 100

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2

C 0

T 0

C 0

A 0

G 0

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 101

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2

C 0

T 0

C 0

A 0

G 0

Parallelizable!

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 102

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1

C 0 1

T 0

C 0

A 0

G 0

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 103

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0

C 0 1 4

T 0 0

C 0

A 0

G 0

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 104

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0

C 0 1 4 3

T 0 0 3

C 0 0

A 0

G 0

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 105

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0

C 0 1 4 3 2

T 0 0 3 6

C 0 0 2

A 0 0

G 0

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 106

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0

C 0 1 4 3 2 1

T 0 0 3 6 5

C 0 0 2 5

A 0 0 1

G 0 2

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 107

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0

C 0 1 4 3 2 1 2

T 0 0 3 6 2 4

C 0 0 2 2 8

A 0 0 1 1

G 0 1 0

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 108

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1

T 0 0 3 6 2 4 3

C 0 0 2 2 8 7

A 0 0 1 1 7

G 0 1 0 0

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 109

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2

C 0 0 2 2 8 7 6

A 0 0 1 1 7 7

G 0 1 0 0 6

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 110

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5

A 0 0 1 1 7 7 6

G 0 1 0 0 6 6

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 111

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5 4

A 0 0 1 1 7 7 6 8

G 0 1 0 0 6 6 6

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 112

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5 4

A 0 0 1 1 7 7 6 8 7

G 0 1 0 0 6 6 6 7

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 113

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5 4

A 0 0 1 1 7 7 6 8 7

G 0 1 0 0 6 6 6 7 10

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 114

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5 4

A 0 0 1 1 7 7 6 8 7

G 0 1 0 0 6 6 6 7 10

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 115

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5 4

A 0 0 1 1 7 7 6 8 7

G 0 1 0 0 6 6 6 7 10

■ Traceback from best value

■ We need to remember howwe calculated the values forthat!

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 116

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5 4

A 0 0 1 1 7 7 6 8 7

G 0 1 0 0 6 6 6 7 10

■ Backtracing from best value

□ We need to remember howwe calculated the valuesfor that!

■ Alignment:

G C T C T C A G

G C - - T C A G

■ Smith-Waterman ends, when ..

□ All possible alignments are examined

□ Enough alignments are examined

□ Dynamic programming limit is reached

■ Pick the alignment with the highest score in the Smith-Waterman

algorithm

Results

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 117

■ Different gap penalties for Starting and Extending a gap

■ Restrictions on where gaps are allowed

■ Scoring function also takes quality score into account

■ Reseeding, if no proper matches were found

Adjustments to Smith-Waterman in Bowtie2

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 118

@SQ SN:ref LN:45

r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAAGGATACTA *

r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *

r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1

r004 0 ref 16 30 6M14N5M * 0 0 TAGCTTCAGC *

Sequence Alignment/Map format (SAM)

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 119

■ Leftmost position of the alignment

■ Quality of the alignment (Phred)

■ Matching as CIGAR string

■ Query Sequence

Variant Calling3.3 Variant Calling

Variant Calling in Pipeline

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 121

Genetic Variation vs. Mutation

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 122

“Genetic variation is what makes us

all unique, whether in terms of hair

colour, skin colour or even the shape of

our faces.”

“A mutation is a change that occurs in our DNA sequence, either due to mistakes when

the DNA is copied or as the result of environmental factors such as UV light and

cigarette smoke.”

vs.

Genetic Variation vs. Mutation

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 123

Mutations contribute to genetic variation within species

What means Variant Calling?

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 124

■ Single-Nucleotide-Polymorphism (SNP)

What means Variant Calling?

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 125

■ Single-Nucleotide-Polymorphism (SNP)

■ Insertions

What means Variant Calling?

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 126

■ Single-Nucleotide-Polymorphism (SNP)

■ Insertions and Deletions (Indels)

What means Variant Calling?

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 127

■ Single-Nucleotide-Polymorphism (SNP)

■ Insertions and Deletions (Indels)

■ Larger structural variants

□ Copy Number Variation (CNV)

□ Loss of one copy of a gene or of both

□ Movement of DNA sections from one location to another

■ Early method: Counting abundance of

high-quality nucleotides at a site

■ Recent approaches:

□ Integrate several sources of information

□ Use of prior probabilities for a SNP at a

given position (e.g. dbSNP)

■ Further method: Heuristic approach

□ Specific features of different sequencing

platforms and different read alignment

methods (VarScan)

SNP Calling Methods

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 128

■ Presence of Indels

■ Errors from library preparation

■ Variable quality scores with higher error rates

Identify „true“ variants and no alignment and/ or sequencing errors

Minimize the amount of false positives

Challenges with Variant Calling

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 129

■ Is a software package for analysis of HTS data

■ Offers a variety of tools: Focus on variant discovery and genotyping

■ Reads-to-variants workflow:

Genome Analysis Toolkit

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 130

GATKs HaplotypeCaller

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 133

SAM / BAM File

VCF FileHaplotypeCaller

GATKs HaplotypeCaller

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 134

SAM / BAM File

VCF FileHaplotypeCaller

GATKs HaplotypeCaller

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 135

SAM / BAM File

VCF FileHaplotypeCaller

GATKs HaplotypeCaller

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 136

SAM / BAM File

VCF FileHaplotypeCaller

■ Is capable of calling SNPs and Indels simultaneously

GATKs HaplotypeCaller

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 137

■ Go through reference genome

with a sliding window

■ Count Indels and mismatches

■ Memorize regions to operate on

HaplotypeCallerSTEP 1: Identify Active Regions

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 138

■ Local re-assembly

■ Building a De Bruijn-like graph

■ Prune according to threshold

■ Traverse graph to collect most

likely haplotypes

■ Align haplotypes to reference

genome using SWA

HaplotypeCallerSTEP 2: Assembly Plausible Haplotypes

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 139

■ Determine likelihoods of haplotypes given the read data

■ For each ActiveRegion program performs pairwise alignment of each read

against each haplotype using PairHMM algorithm

■ Produces a matrix of likelihoods of haplotypes given the read data

HaplotypeCallerSTEP 3: Determine per-read Likelihoods

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 140

Markov Chain

P(xi | xi-1, …, x1) = (xi | xi-1) = axi-1xi

Hidden Markov Model (HMM)

akl = P(πi = l | πi-1 = k)

HaplotypeCallerRecap Statistical Models

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 141

HaplotypeCallerSTEP 3: Determine per-read Likelihoods

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 142

HaplotypeCallerSTEP 3: Determine per-read Likelihoods

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 143

Match (M) = emitting an

aligned pair with probability p

HaplotypeCallerSTEP 3: Determine per-read Likelihoods

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 144

Ix, Iy= emitting symbol xi, yi against

a gap with probability p

HaplotypeCallerSTEP 3: Determine per-read Likelihoods

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 145

All transition probabilities

leaving each state must sum to 1

HaplotypeCallerSTEP 3: Determine per-read Likelihoods

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 146

Empirical gap penalties = derived from data by BQSR

Base mismatch penalties = base quality scores

HaplotypeCallerSTEP 3: Determine per-read Likelihoods

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 147

■ Matrix with likelihoods of the haplotypes

given the reads

■ Assign genotypes to individual samples

based on the allele likelihoods

■ By applying Bayes' theorem to calculate

the likelihoods of each possible genotype,

and selecting the most likely

HaplotypeCallerSTEP 4: Genotype Sample

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 148

HaplotypeCallerSTEP 4: Genotype Sample

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 149

Bayes Rule

HaplotypeCallerRecap Probabilistic Model

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 150

P(data)P(data|hypothesis) P(hypothesis)P(hypothesis|data) =

HaplotypeCallerSTEP 4: Genotype Sample

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 151

3.4 Data Annotation

Data Annotation in the pipeline

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 153

After variant calling: where are variants

Now find out: what are these variants

Multiple thousand variants cannot be analyzed manually

Tools for automated variant annotation (e.g. ANNOVAR)

Already known SNPs can be filtered by using information from dbSNP or

the 1,000 genome project

dbSNP = Single Nucleotide Polymorphism Database

Free public archive for genetic variation

Get information from SNPs

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 154

4. Analysis Results / Use Cases

Analysis Results in the pipeline

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 156

Different genome browsers (Ensembl Genome Browser, UCSC Genome

Browser, VEGA Genome Browser)

Ensembl Variant Effect Predictor (Online Tool)

Analysis Results

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 157

Personalized Medicine

Prescribe medicine that work for you

Detect genetic diseases early on

Learn how humans work

Why do we age? Can we switch it off?

Why are some smarter than others? Might it be genes?

Use Cases

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 158[1] http://www.tedxvienna.at/blog/personalize-this/

[1]

5. Outlook

Think about these Questions..

If genome sequencing is as easy and cheap like blood tests, would you:

digitize your genome?

want to know about biomarkers?

want to know about markers that identicate a disease to come?

digitize the genome of your unborn baby?

want to know the appearance of your unborn baby?

want to know that you are getting the Alzheimer disease before being 40

years old?

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 160

6. Discussion

Do we really want every technological progress

that will be possible in future?

■ B. Langmead, C. Trapnell, M. Pop, S. Salzberg: Ultrafast and memory-efficient

alignment of short DNA sequences to the human genome in Genome Biology

2009

■ B. Langmead, S. Salzberg: Fast gapped-read alignment with Bowtie 2. Nature

Methods Vol. 9 No.4,2012

■ P. J. A. Cock et al.: The Sanger FASTQ file format for sequences with quality

scores, and the Solexa/Illumina FASTQ variants. Nucelic Acids Research, Vol. 38,

No. 6, 2010

■ Dolled-Filhart, Marisa P., et al. Computational and bioinformatics frameworks for

next-generation whole exome and genome sequencing. The Scientific World

Journal 2013 (2013).

■ Nielsen, Rasmus, et al. Genotype and SNP calling from next-generation

sequencing data. Nature Reviews Genetics 12.6 (2011): 443-451.

Literature

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 162