Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data...

Processing of Raw Genome Data

Kristina Kirsten, Torben Meyer

Trends in Bioinformatics

Hasso-Plattner-Institut

Session Agenda

1. Why Sequence the Human Genome?

2. Problem Statement

■ Genetic Basics

■ Sanger Sequencing vs. NGS

3. Sequencing Pipeline

■ Base Calling

■ Alignment

■ Variant Calling

■ Data Annotation

4. Analysis Results / Use Cases

5. Outlook

6. Discussion

Kirsten, KristinaMeyer, Torben27.01.2016

Chart 2

Why Sequence the Human Genome?

Understand mutations

Chart 3[1] http://evolution.berkeley.edu/evolibrary/images/evo/dna-mutation.gif

[2] http://informoverload.com/man-has-extra-fingers/

[2][1]

Identify marker for diseases

Chart 4

http://hmg.oxfordjournals.org/content/18/R1/R48/F1.large.jpg

Take personalized treatment decisions

Chart 5

http://www.alphagenomix.com/wp-content/uploads/2014/03/Figure21.png

Take personalized treatment decisions

Chart 6

Genetic Basics (1)

Chart 7

Genetic Basics (2)

Chart 8

• Base pairs:

Adenine (A) & Thymine (T)

Guanine (G) & Cytosine (C)

Genes = specific parts of DNA

Allele = one specific form of a gene

http://study.com/cimages/multimages/16/phenotype_v_genotype.png

1,000 Genomes Project

Chart 9

Reference Genome

Chart 10

Sanger Sequencing vs. NGS

Developed in 1977

Small parallelization

Few but long reads

400 to 900 bases per read

Low error rate

1Kb with per-base error rate

<0.001%

Low amount of data per run

100 min for sequencing 1Kb

expensive: $ 0.5 per Kb

Introduced in 2004

High parallelization

Many short reads

50 to 300 bases per read

High error rate

0.5 - 1.0% per-base error rate

High amount of data per run

Up to 6 billion reads

0.002 min for sequencing 1Kb

affordable: $ 0.00005 per Kb

Sanger Sequencing

Chart 11

Next Generation Sequencing (NGS)Sanger Sequencing

3. Sequencing Pipeline

Next-Generation Sequencing Bioinformatics Pipeline

Chart 13

This presentation

3.1 Base Calling

■ Receives initial raw data

□ Image Filters, fluctuations in current, …

□ From Roche/454, Illumina, SOLiD, Helios, …

■ Calls nucleotide bases (A, C, G, T) in short strings

□ A few giga base-pairs (Gbp) per machine-day

□ Usually 50 - 300 bases long

■ Base calls get quality score assigned

Base Calling

Chart 16

Excursion – Phred Quality Scores

Chart 17

1. Probabilities of base call errors very small

■ Need to be mapped to values that are easier to compare

𝑄 = −10 ∗ log10 𝑃

2. Q: Quality Score, P: Probability of Base Call Error

3. Are calculated in the base calling phase

Excursion – Phred Quality Scores

Chart 18

Quality Score Probabilty of Base Call Error

Base Call Accuracy

5 Ca. 1 in 3 69 %

10 1 in 10 90 %

20 1 in 100 99 %

30 1 in 1,000 99.9 %

40 1 in 10,000 99.99 %

50 1 in 100,000 99.999 %

60 1 in 1,000,000 99.9999 %

Source: https://en.wikipedia.org/wiki/Phred_quality_score

3.2 Alignment

Alignment in the Sequencing Pipeline

Chart 20

■ High error rates in NGS reads

■ High number of short reads

■ Mutations and Variations can occur

■ Gaps, e.g. Indels (Insertions and Deletions)

■ Bases could be missing / new bases could have been inserted

Challenges

Chart 21

■ Using hash tables

□ BLAST

□ SeqMap

□ MAQ

■ Suffix or prefix tries (e.g. using an FM-Index)

□ Bowtie and Bowtie 2

□ BWA, BWA-SW

□ SOAP, SOAP2, SOAP3

Alignment Algorithms

Chart 22

■ Using hash tables

□ BLAST

□ SeqMap

□ MAQ

■ Suffix or prefix tries (e.g. using an FM-Index)

□ Bowtie and Bowtie 2

□ BWA, BWA-SW

□ SOAP, SOAP2, SOAP3

Alignment Algorithms

Chart 23

■ Based on Bowtie

■ Allows

□ Ungapped alignment

□ Gapped alignment (containing indels)

□ Inexact matching

■ Technologies

□ FM supported Index, using BWT

□ SIMD accelerated dynamic programming (Smith-Waterman)

■ Input: FASTQ file of reads

■ Output: SAM file of aligned reads

Bowtie 2 – Overview

Chart 24

■ Four lines per read

FASTQ-SANGER

Chart 26

@SRR014849.1 EIXKN4201CFU84 length=93

GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGAAAGG

GTTTTGAATTTCAAACCCTTTTCGGTTTCCAACCTTCCAA

GCAATGCCAATA

+SRR014849.1 EIXKN4201CFU84 length=93

3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA

1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@

/=<?7=9<2A8==

■ ‚@‘ title-line

□ Record identifier

□ Other commentary

FASTQ-SANGER

Chart 27

GCAATGCCAATA

3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA

1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@

/=<?7=9<2A8==

■ Sequence line

□ IUPAC single letter, uppercase

codes for DNA or RNA

(G, T, C, A)

FASTQ-SANGER

Chart 28

GCAATGCCAATA

3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA

1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@

/=<?7=9<2A8==

■ Sequence line

□ IUPAC single letter, uppercase

codes for DNA or RNA

(G, T, C, A)

■ ‚+‘-line

□ Marks end of Sequence line

□ Sometimes contains copy of

‚@‘-line

FASTQ-SANGER

Chart 29

GCAATGCCAATA

3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA

1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@

/=<?7=9<2A8==

■ Sequence line

■ ‚+‘-line

■ Quality line

□ Allows PHRED quality scores

from 0 to 93

□ Display as ASCII codes 33 –

One char per score

□ Same length as sequence line

FASTQ-SANGER

Chart 30

GCAATGCCAATA

3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA

1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@

/=<?7=9<2A8==

Burrows-Wheeler-Transformation (BWT)

Chart 31

■ Input A A T T C G A

Chart 32

■ Input

■ Append EOF char

A A T T C G A $

Chart 33

■ Input

■ Append EOF char

■ Shift right

A A T T C G A $

$ A A T T C G A

A $ A A T T C G

G A $ A A T T C

C G A $ A A T T

T C G A $ A A T

T T C G A $ A A

A T T C G A $ A

Chart 34

$ A A T T C G A

A $ A A T T C G

A A T T C G A $

A T T C G A $ A

C G A $ A A T T

G A $ A A T T C

T C G A $ A A T

T T C G A $ A A

■ Input

■ Append EOF char

■ Shift right

■ Sort

lexicographically

Chart 35

■ Input

■ Append EOF char

■ Shift right

■ Sort

lexicographically

■ Return last column

AATTCGA$

AG$ATCTA

$ A A T T C G A

A $ A A T T C G

A A T T C G A $

A T T C G A $ A

C G A $ A A T T

G A $ A A T T C

T C G A $ A A T

T T C G A $ A A

BWT - Unpermute

Chart 36

1. We only know the first row

(sorting of the letters) and the

last row$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Chart 37

last row

2. Last-First-Property:

The order of same letters is not

getting mixed up in the last and

first column

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Chart 38

last row

first column

3. We start with the first letter in

the last column and apply that

property to unpermute the

sentence

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Chart 39

last row

first column

3. We start with the first letter in

the last column and apply that

property to unpermute the

sentence

4. This letter equals the last letter of

the original sentence

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Chart 40

5. Apply LF-Mapping property until $

is reached

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Chart 41

is reached

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Chart 42

is reached

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Chart 43

is reached

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Chart 44

is reached

ATTCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Chart 45

is reached

AATTCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT - Unpermute

Chart 46

is reached

6. Original sentence is reconstructed

AATTCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Exact Matching

Chart 47

Let‘s search for ATT in original string!

Chart 48

1. Write output of BWT to last

column

. . . . . . . A

. . . . . . . G

. . . . . . . $

. . . . . . . A

. . . . . . . T

. . . . . . . C

. . . . . . . T

. . . . . . . A

Chart 49

column

2. Sort letters lexicographicly and

write to first column

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 50

column

3. Mark all occurences of the last

letter in the first column

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 51

column

4. Check for occurence of the next

letter in the last column

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 52

column

5. Count which occurence of T

was found

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 53

column

was found

6. Mark that occurence in the first

column

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 54

column

was found

column

7. Repeat

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 55

column

was found

column

7. Repeat

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 56

column

was found

column

7. Repeat

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 57

FOUND!

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Unwind

Chart 58

8. Now we need to unwind

to find the position in the

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Unwind

Chart 59

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Unwind

Chart 60

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Unwind

Chart 61

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Unwind

Chart 62

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Unwind

Chart 63

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Unwind

Chart 64

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Unwind

Chart 65

original string

9. End of the string is found

position of the searched string:

first letter is the 6th last letter

of the sentence ≜ 2nd letter of the

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

BWT – Inexact Matching

Chart 66

But what about inexact matches?

What if we want to find TCAA in the original string?

Chart 67

Quality Scores are assigned in the Base Calling phase.

T C A A

65 47 10 50

Chart 68

1. Proceed as in exact matching

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 69

1. Proceed as in exact matching TCAA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 70

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 71

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 72

2. No match found for next C

What now?

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 73

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 74

quality score

4. Walk back in stack and replace

with different possible base

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 75

quality score

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 76

quality score

(e.g, G)

5. Continue with exact matching and

repeat until match found

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 77

quality score

(e.g, G)

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 78

quality score

(e.g, G)

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 79

quality score

(e.g, G)

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 80

quality score

(e.g, G)

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 81

quality score

(e.g, G)

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 82

quality score

(e.g, G)

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Chart 83

We found an alignment for TCAA withone mismatch.

Gapped Alignment

Chart 85

What about gapped alignment (e.g. Indels)?

Let‘s look at the Bowtie 2 Pipeline.

Bowtie 2 – 1. Extract Seeds

Chart 86

Extract seeds from each read

Follow a certain policy (e.g. 16 base substring every 10 bases along the read)

Upcoming process works with the seed strings

CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA

Bowtie 2 – 1. Extract Seeds

Chart 87

Extract seeds from each read

Follow a certain policy (e.g. 16 nt substring every 10 nt along the read)

Upcoming process works with the seed strings

Bowtie 2 – 2. Align with FM Index

Chart 88

■ Find alignments using the BWT as described earlier

□ There may be more than one possible alignment per seed!

■ Returns range of possibe alignments within the FM Index

BWT, (In)exact matching

{ [211, 212];[212, 214] }

■ Every Alignment gets priority 1 𝑟2

□ r = total number of alignments for the seed

Seeds with fewer alignments get higher priority

■ Randomly select seeds, weighted by priority

□ Run dynamic programming approach on these

□ Modified Smith-Waterman algorithm

Bowtie 2 – 3. Prioritize, Resolve

Chart 89

■ Reference Genome:

Smith-Waterman

Chart 90

■ Read:

GCTCAG

Smith-Waterman

Chart 91

■ Read:

GCTCAG

Smith-Waterman

Chart 92

■ Read:

GCTCAG

■ One exact match for this seed found!

Smith-Waterman

Chart 93

■ Uses Dynamic Programming approach

□ Calculate larger problem with first calculating smaller problems

□ Base larger problem then on the results of the smaller problems

□ Fill out table to save all results

Smith-Waterman

Chart 94

■ Give an „Award“ if match is found

■ Give a „Penalty“ if no match was found (here is a gap!)

■ Table cell with the highest score is the best alignment

■ Let‘s align GCTCAG to GCTCTCAG!

Smith-Waterman

Chart 95

Smith-Waterman

Chart 96

- G C T C T C A G

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Chart 97

- G C T C T C A G

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Chart 98

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = −𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Chart 99

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Chart 100

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Chart 101

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

Parallelizable!

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Chart 102

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Chart 103

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0

C 0 1 4

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Chart 104

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0

C 0 1 4 3

T 0 0 3

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Chart 105

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0

C 0 1 4 3 2

T 0 0 3 6

C 0 0 2

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Chart 106

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0

C 0 1 4 3 2 1

T 0 0 3 6 5

C 0 0 2 5

A 0 0 1

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Chart 107

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0

C 0 1 4 3 2 1 2

T 0 0 3 6 2 4

C 0 0 2 2 8

A 0 0 1 1

G 0 1 0

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Chart 108

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1

T 0 0 3 6 2 4 3

C 0 0 2 2 8 7

A 0 0 1 1 7

G 0 1 0 0

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Chart 109

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2

C 0 0 2 2 8 7 6

A 0 0 1 1 7 7

G 0 1 0 0 6

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Chart 110

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5

A 0 0 1 1 7 7 6

G 0 1 0 0 6 6

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Chart 111

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5 4

A 0 0 1 1 7 7 6 8

G 0 1 0 0 6 6 6

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Chart 112

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5 4

A 0 0 1 1 7 7 6 8 7

G 0 1 0 0 6 6 6 7

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Chart 113

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5 4

A 0 0 1 1 7 7 6 8 7

G 0 1 0 0 6 6 6 7 10

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Chart 114

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5 4

A 0 0 1 1 7 7 6 8 7

G 0 1 0 0 6 6 6 7 10

Smith-Waterman

Chart 115

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5 4

A 0 0 1 1 7 7 6 8 7

G 0 1 0 0 6 6 6 7 10

■ Traceback from best value

■ We need to remember howwe calculated the values forthat!

Smith-Waterman

Chart 116

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5 4

A 0 0 1 1 7 7 6 8 7

G 0 1 0 0 6 6 6 7 10

■ Backtracing from best value

□ We need to remember howwe calculated the valuesfor that!

■ Alignment:

G C T C T C A G

G C - - T C A G

■ Smith-Waterman ends, when ..

□ All possible alignments are examined

□ Enough alignments are examined

□ Dynamic programming limit is reached

■ Pick the alignment with the highest score in the Smith-Waterman

algorithm

Results

Chart 117

■ Different gap penalties for Starting and Extending a gap

■ Restrictions on where gaps are allowed

■ Scoring function also takes quality score into account

■ Reseeding, if no proper matches were found

Adjustments to Smith-Waterman in Bowtie2

Chart 118

@SQ SN:ref LN:45

r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAAGGATACTA *

r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *

r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1

r004 0 ref 16 30 6M14N5M * 0 0 TAGCTTCAGC *

Sequence Alignment/Map format (SAM)

Chart 119

■ Leftmost position of the alignment

■ Quality of the alignment (Phred)

■ Matching as CIGAR string

■ Query Sequence

Variant Calling3.3 Variant Calling

Variant Calling in Pipeline

Chart 121

Genetic Variation vs. Mutation

Chart 122

“Genetic variation is what makes us

all unique, whether in terms of hair

colour, skin colour or even the shape of

our faces.”

“A mutation is a change that occurs in our DNA sequence, either due to mistakes when

the DNA is copied or as the result of environmental factors such as UV light and

cigarette smoke.”

Genetic Variation vs. Mutation

Chart 123

Mutations contribute to genetic variation within species

What means Variant Calling?

Chart 124

■ Single-Nucleotide-Polymorphism (SNP)

Chart 125

■ Insertions

Chart 126

■ Insertions and Deletions (Indels)

Chart 127

■ Insertions and Deletions (Indels)

■ Larger structural variants

□ Copy Number Variation (CNV)

□ Loss of one copy of a gene or of both

□ Movement of DNA sections from one location to another

■ Early method: Counting abundance of

high-quality nucleotides at a site

■ Recent approaches:

□ Integrate several sources of information

□ Use of prior probabilities for a SNP at a

given position (e.g. dbSNP)

■ Further method: Heuristic approach

□ Specific features of different sequencing

platforms and different read alignment

methods (VarScan)

SNP Calling Methods

Chart 128

■ Presence of Indels

■ Errors from library preparation

■ Variable quality scores with higher error rates

Identify „true“ variants and no alignment and/ or sequencing errors

Minimize the amount of false positives

Challenges with Variant Calling

Chart 129

■ Is a software package for analysis of HTS data

■ Offers a variety of tools: Focus on variant discovery and genotyping

■ Reads-to-variants workflow:

Genome Analysis Toolkit

Chart 130

GATKs HaplotypeCaller

Chart 133

SAM / BAM File

VCF FileHaplotypeCaller

Chart 134

SAM / BAM File

Chart 135

SAM / BAM File

Chart 136

SAM / BAM File

■ Is capable of calling SNPs and Indels simultaneously

Chart 137

■ Go through reference genome

with a sliding window

■ Count Indels and mismatches

■ Memorize regions to operate on

HaplotypeCallerSTEP 1: Identify Active Regions

Chart 138

■ Local re-assembly

■ Building a De Bruijn-like graph

■ Prune according to threshold

■ Traverse graph to collect most

likely haplotypes

■ Align haplotypes to reference

genome using SWA

HaplotypeCallerSTEP 2: Assembly Plausible Haplotypes

Chart 139

■ Determine likelihoods of haplotypes given the read data

■ For each ActiveRegion program performs pairwise alignment of each read

against each haplotype using PairHMM algorithm

■ Produces a matrix of likelihoods of haplotypes given the read data

HaplotypeCallerSTEP 3: Determine per-read Likelihoods

Chart 140

Markov Chain

P(xi | xi-1, …, x1) = (xi | xi-1) = axi-1xi

Hidden Markov Model (HMM)

akl = P(πi = l | πi-1 = k)

HaplotypeCallerRecap Statistical Models

Chart 141

Chart 142

Chart 143

Match (M) = emitting an

aligned pair with probability p

Chart 144

Ix, Iy= emitting symbol xi, yi against

a gap with probability p

Chart 145

All transition probabilities

leaving each state must sum to 1

Chart 146

Empirical gap penalties = derived from data by BQSR

Base mismatch penalties = base quality scores

Chart 147

■ Matrix with likelihoods of the haplotypes

given the reads

■ Assign genotypes to individual samples

based on the allele likelihoods

■ By applying Bayes' theorem to calculate

the likelihoods of each possible genotype,

and selecting the most likely

HaplotypeCallerSTEP 4: Genotype Sample

Chart 148

Chart 149

Bayes Rule

HaplotypeCallerRecap Probabilistic Model

Chart 150

P(data)P(data|hypothesis) P(hypothesis)P(hypothesis|data) =

Chart 151

3.4 Data Annotation

Data Annotation in the pipeline

Chart 153

After variant calling: where are variants

Now find out: what are these variants

Multiple thousand variants cannot be analyzed manually

Tools for automated variant annotation (e.g. ANNOVAR)

Already known SNPs can be filtered by using information from dbSNP or

the 1,000 genome project

dbSNP = Single Nucleotide Polymorphism Database

Free public archive for genetic variation

Get information from SNPs

Chart 154

4. Analysis Results / Use Cases

Analysis Results in the pipeline

Chart 156

Different genome browsers (Ensembl Genome Browser, UCSC Genome

Browser, VEGA Genome Browser)

Ensembl Variant Effect Predictor (Online Tool)

Analysis Results

Chart 157

Personalized Medicine

Prescribe medicine that work for you

Detect genetic diseases early on

Learn how humans work

Why do we age? Can we switch it off?

Why are some smarter than others? Might it be genes?

Use Cases

Chart 158[1] http://www.tedxvienna.at/blog/personalize-this/

5. Outlook

Think about these Questions..

If genome sequencing is as easy and cheap like blood tests, would you:

digitize your genome?

want to know about biomarkers?

want to know about markers that identicate a disease to come?

digitize the genome of your unborn baby?

want to know the appearance of your unborn baby?

want to know that you are getting the Alzheimer disease before being 40

years old?

Chart 160

6. Discussion

Do we really want every technological progress

that will be possible in future?

■ B. Langmead, C. Trapnell, M. Pop, S. Salzberg: Ultrafast and memory-efficient

alignment of short DNA sequences to the human genome in Genome Biology

■ B. Langmead, S. Salzberg: Fast gapped-read alignment with Bowtie 2. Nature

Methods Vol. 9 No.4,2012

■ P. J. A. Cock et al.: The Sanger FASTQ file format for sequences with quality

scores, and the Solexa/Illumina FASTQ variants. Nucelic Acids Research, Vol. 38,

No. 6, 2010

■ Dolled-Filhart, Marisa P., et al. Computational and bioinformatics frameworks for

next-generation whole exome and genome sequencing. The Scientific World

Journal 2013 (2013).

■ Nielsen, Rasmus, et al. Genotype and SNP calling from next-generation

sequencing data. Nature Reviews Genetics 12.6 (2011): 443-451.

Literature

Chart 162

Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data...

Documents

Transcript of Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data...

d. hasso plattner Institute of Design at Stanford EMPATHY ... · PDF fileInstitute of Design at Stanford d. hasso plattner ... our users’ motivations based on what they say, do,

Opinion Mining - Hasso Plattner Institutehpi.de/.../FG_Naumann/folien/WS1112/Question_Answering/Opinion_Mining.pdfOpinion Mining Question Answering Seminar January 20, 2012 Nils Rethmeier

Introduction. Thankyou for the Invitation Preparation: – Interview: Hasso Plattner: 'If this doesn’t work, we’re dead. Flat-out dead.’ ArticleArticle.

Day 3 Keynotes - Road to S/4HANA and Beyond (Prof. Dr. Hasso Plattner)

HASSO - PLATTNER - INSTITUT Komponentenbasierter Taschenrechner mit CORBA Silke Kugelstadt Torsten Steinert.

An Architecture for Emergent Semantics Sven Herschel, Ralf Heese, and Jens Bleiholder Humboldt-Universität zu Berlin/ Hasso-Plattner-Institut.

iTunes U Aggregator - hpi.dehpi.de/.../seminare/webprog_web20_1112/itunesu_aggregator.pdfmotivation New Email iTunes U Weekly Report for Hasso-Plattner-Institut für Systemtechnik

openHPI The MOOC Offer at Hasso Plattner Institute...2 MOOC offers at Hasso Plattner Instxitute. The Potsdam Universityaffiliated Hasso Plattner Institute (HPI)-is Germany’s university

Automated Visual Software Analyticsastro.altspu.ru/~aw/.../softwareanalytics2015-RecordOfAchievement.… · openHPI is the educational internet platform of the Hasso Plattner Institute

#? rahul swaminathan (T-Labs) & professor patrick baudisch hci2 hasso-plattner institute determining depth.

Hasso Plattner, co-CEO of software firm SAP AG, has Part A

Presented by Eran Davidson President and CEO Hasso Plattner Ventures The High-Tech Industry And The Israeli Economy 2005.

Alexander Wolf - astro.altspu.ruastro.altspu.ru/~aw/certificate/webtech2015/...openHPI is the educational internet platform of the Hasso Plattner Institute (HPI) for Software Systems

Towards Self-Adaptive Software Dipl.-Inf. Andreas Rasche Hasso-Plattner-Institute University of Potsdam.

1 Gregor Schmidt Origins of Operating Systems - The Microkernel Mach Betriebssysteme und Middleware / Prof. Andreas Polze Hasso-Plattner-Institut 1 Seminar.

HASSO-PLATTNER-INSTITUT für Softwaresystemtechnik GmbH an der Universität Potsdam Multiprocessor Scheduling “Global Multiprocessor Scheduling of Aperiodic.

Event Applications: Real-Life Experiences at the Hasso Plattner Institute

Does one size really fit all? Evaluating classifiers in Bag-of-Visual-Words classification Christian Hentschel, Harald Sack Hasso Plattner Institute.

3FDPSEPGDIJFWFNFOUcsharp-blog.de/.../imdb2017-RecordOfAchievement.pdf · openHPI is the educational internet platform of the Hasso Plattner Institute (HPI) for %JHJUBM Engineering.

Dr. Peter Tröger Hasso Plattner Institute, University of ... · PDF fileStatistics 101 Dr. Peter Tröger Hasso Plattner Institute, University of Potsdam Software Proﬁling Seminar,