CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment...
Transcript of CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment...
![Page 1: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/1.jpg)
CS 466Introduction to Bioinformatics
Lecture 4
Mohammed El-KebirSeptember 6, 2019
![Page 2: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/2.jpg)
Outline1. Fitting alignment2. Local alignment3. Gapped alignment4. BLOSUM scoring matrix
Reading:• Jones and Pevzner. Chapters 6.6-6.9• Lecture notes
2
![Page 3: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/3.jpg)
Allow for inexact matches due to:• Sequencing errors• Polymorphisms/mutations in
reference genome
3
NGS Characterized by Short Reads
GenomeMillions -billions nucleotides
Next-generationDNA sequencing
10-100’s million short readsShort read: 100 nucleotides
… GGTAGTTAG …
… TATAATTAG …
… AGCCATTAG …
… CGTACCTAG …
… CATTCAGTAG …
… GGTAAACTAG …
Question: How to account for discrepancy between lengths of reference and short read?
Human reference genome is 3,300,000,000 nucleotides, while a short read is 100 nucleotides. Global sequence alignment will not work!
![Page 4: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/4.jpg)
Fitting Alignment
4
For short read alignment, we want to align complete short read 𝐯 ∈Σ$ to substring of reference genome𝐰 ∈ Σ&. Note that 𝑚 ≪ 𝑛.
Fitting Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find a alignment of 𝐯 and a substring of 𝐰 with
maximum global alignment score 𝑠∗ among all global alignments of 𝐯 and all substrings of 𝐰
𝐯 ∈ Σ$𝐰 ∈ Σ&
![Page 5: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/5.jpg)
Fitting Alignment – Naive Approach
• Consider all contiguous non-empty substrings of 𝐰, defined by 1 ≤ 𝑖 ≤ 𝑗 ≤ 𝑛• How many?
5
Fitting Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global alignment score 𝑠∗ among all global alignments of 𝐯 and all substrings of 𝐰
𝐯 ∈ Σ$𝐰 ∈ Σ&
![Page 6: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/6.jpg)
Fitting Alignment – Naive Approach
• Consider all contiguous non-empty substrings of 𝐰, defined by 1 ≤ 𝑖 ≤ 𝑗 ≤ 𝑛• How many? Answer: 𝑛 + &
2• What are their total lengths?• What is the running time?
6
Fitting Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global alignment score 𝑠∗ among all global alignments of 𝐯 and all substrings of 𝐰
𝐯 ∈ Σ$𝐰 ∈ Σ&
![Page 7: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/7.jpg)
Fitting Alignment – Dynamic Programming
7
Fitting Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global alignment score 𝑠∗ among all global alignments of 𝐯 and all substrings of 𝐰
s[i, j] = max
8>>><
>>>:
0, if i = 0,
s[i� 1, j] + �(vi,�), if i > 0,
s[i, j � 1] + �(�, wj), if i > 0 and j > 0,
s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.
s⇤ = max{s[m, 0], . . . , s[m,n]}A
G
G
0T A C G G C0𝐯\𝐰
![Page 8: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/8.jpg)
Fitting Alignment – Dynamic Programming
8
Fitting Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global alignment score 𝑠∗ among all global alignments of 𝐯 and all substrings of 𝐰
s[i, j] = max
8>>><
>>>:
0, if i = 0,
s[i� 1, j] + �(vi,�), if i > 0,
s[i, j � 1] + �(�, wj), if i > 0 and j > 0,
s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.
s⇤ = max{s[m, 0], . . . , s[m,n]}A
G
G
0T A C G G C0𝐯\𝐰 Start anywhere on first row
End anywhere on last row
![Page 9: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/9.jpg)
Fitting Alignment – Dynamic Programming
9
Fitting Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global alignment score 𝑠∗ among all global alignments of 𝐯 and all substrings of 𝐰
s[i, j] = max
8>>><
>>>:
0, if i = 0,
s[i� 1, j] + �(vi,�), if i > 0,
s[i, j � 1] + �(�, wj), if i > 0 and j > 0,
s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.
s⇤ = max{s[m, 0], . . . , s[m,n]}
Question: Let match score be 1, mismatch/indel score be -1. What is 𝑠∗?
Question: Same scores. What is optimal global alignment and score?
A
G
G
0T A C G G C0
- A - G G -
T A C G G C
𝐯𝐰
𝐯\𝐰 Start anywhere on first row
End anywhere on last row
![Page 10: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/10.jpg)
Fitting Alignment – Dynamic Programming• Online:
https://valiec.github.io/AlignmentVisualizer/index.html
10
Question: Let match score be 1, mismatch/indel score be -1. What is 𝑠∗?
Question: Same scores. What is optimal global alignment and score?
A
G
G
0T A C G G C0
- A - G G -
T A C G G C
𝐯𝐰
𝐯\𝐰s[i, j] = max
8>>><
>>>:
0, if i = 0,
s[i� 1, j] + �(vi,�), if i > 0,
s[i, j � 1] + �(�, wj), if i > 0 and j > 0,
s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.
s⇤ = max{s[m, 0], . . . , s[m,n]}
![Page 11: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/11.jpg)
Outline1. Fitting alignment2. Local alignment3. Gapped alignment4. BLOSUM scoring matrix
Reading:• Jones and Pevzner. Chapters 6.6-6.9• Lecture notes
11
![Page 12: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/12.jpg)
Local Alignment – Biological Motivation
12
ABL1
SHKA
From Pfam database (http://pfam.sanger.ac.uk/)
Proteins are composed of functional units called domains. Such domains may occur in different proteins even across species.
Local Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find a substring of 𝐯 and a substring of 𝐰 whose alignment has
maximum global alignment score 𝑠∗ among all global alignments of all substrings of 𝐯 and 𝐰
![Page 13: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/13.jpg)
Global, Fitting and Local Alignment
13
Local Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find a substring of 𝐯 and a substring of 𝐰 whose alignment has
maximum global alignment score 𝑠∗ among all global alignments of all substrings of 𝐯 and 𝐰
Fitting Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global alignment score 𝑠∗ among all global alignments of 𝐯 and all substrings of 𝐰
Global Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find alignment of 𝐯 and 𝐰 with maximum score.
![Page 14: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/14.jpg)
Local Alignment – Naive Algorithm
Brute force:1. Generate all pairs (𝐯4, 𝐰4) of substrings of 𝐯 and 𝐰2. For each pair (𝐯4, 𝐰4), solve global alignment problem.
14
Question: There are $2
&2 pairs of substrings.
But they have different lengths. What is the running time?
Local Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find a substring of 𝐯 and a substring of 𝐰 whose alignment has
maximum global alignment score 𝑠∗ among all global alignments of all substrings of 𝐯 and 𝐰
![Page 15: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/15.jpg)
1/31/12 2:15 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=817a35…d341760cf452189f652a1139c9eb9617fb03c6a1ce6b29621e9a1b7161cd19e63
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 182.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=201Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
Key Idea
15
Local alignment:• Start and end anywhere
Global alignment:• Start at (0,0) and end at (𝑚, 𝑛)
![Page 16: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/16.jpg)
1/31/12 2:15 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=817a35…d341760cf452189f652a1139c9eb9617fb03c6a1ce6b29621e9a1b7161cd19e63
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 182.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=201Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
Local Alignment Recurrence
16
s[i, j] = max
8>>><
>>>:
0, if i = 0 and j = 0,
s[i� 1, j] + �(vi,�), if i > 0,
s[i, j � 1] + �(�, wj), if j > 0,
s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.
s⇤ = maxi,j
s[i, j]
Local Alignment problem: Given strings 𝐯 ∈ Σ$and 𝐰 ∈ Σ& and scoring function 𝛿, find a substring
of 𝐯 and a substring of 𝐰 whose alignment has maximum global alignment score 𝑠∗ among allglobal alignments of all substrings of 𝐯 and 𝐰
![Page 17: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/17.jpg)
1/31/12 2:15 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=817a35…d341760cf452189f652a1139c9eb9617fb03c6a1ce6b29621e9a1b7161cd19e63
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 182.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=201Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
Local Alignment Recurrence
17
s[i, j] = max
8>>><
>>>:
0, if i = 0 and j = 0,
s[i� 1, j] + �(vi,�), if i > 0,
s[i, j � 1] + �(�, wj), if j > 0,
s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.
s⇤ = maxi,j
s[i, j]
Local Alignment problem: Given strings 𝐯 ∈ Σ$and 𝐰 ∈ Σ& and scoring function 𝛿, find a substring
of 𝐯 and a substring of 𝐰 whose alignment has maximum global alignment score 𝑠∗ among allglobal alignments of all substrings of 𝐯 and 𝐰
Start anywhere
End anywhere
Running time: 𝑂(𝑚𝑛)
![Page 18: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/18.jpg)
Local Alignment – Dynamic Programming• Online:
https://valiec.github.io/AlignmentVisualizer/index.html
18
Question: Let match score be 2, mismatch score be -2 and indel be -4. What is 𝑠∗?
A
G
G
0T A C G G C0
G G
G G
𝐯𝐰
𝐯\𝐰
s[i, j] = max
8>>><
>>>:
0, if i = 0 and j = 0,
s[i� 1, j] + �(vi,�), if i > 0,
s[i, j � 1] + �(�, wj), if j > 0,
s[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0.
s⇤ = maxi,j
s[i, j]
![Page 19: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/19.jpg)
Global, Fitting and Local Alignment
19
Local Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find a substring of 𝐯 and a substring of 𝐰 whose alignment has
maximum global alignment score 𝑠∗ among all global alignments of all substrings of 𝐯 and 𝐰
Fitting Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find an alignment of 𝐯 and a substring of 𝐰 with maximum global alignment score 𝑠∗ among all global alignments of 𝐯 and all substrings of 𝐰
Global Alignment problem: Given strings 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& and scoring function 𝛿, find alignment of 𝐯 and 𝐰 with maximum score.
![Page 20: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/20.jpg)
Outline1. Fitting alignment2. Local alignment3. Gapped alignment4. BLOSUM scoring matrix
Reading:• Jones and Pevzner. Chapters 6.6-6.9• Lecture notes
20
![Page 21: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/21.jpg)
Scoring Gaps
21
Let 𝐯 = AAC and 𝐰 = ACAGGC
Match 𝛿 𝑐, 𝑐 = 1; Mismatch 𝛿 𝑐, 𝑑 = −1 (where 𝑐 ≠ 𝑑); Indel 𝛿 𝑐, − = 𝛿 −, 𝑐 = −2
Both alignments have 3 matches and 2 indels. Score: 3 ∗ 1 + 2 ∗ −2 = −1
A - - A C
A C A A C
𝐯𝐰
A - A - C
A C A A C
𝐯𝐰
![Page 22: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/22.jpg)
Scoring Gaps
22
Let 𝐯 = AAC and 𝐰 = ACAGGC
Match 𝛿 𝑐, 𝑐 = 1; Mismatch 𝛿 𝑐, 𝑑 = −1 (where 𝑐 ≠ 𝑑); Indel 𝛿 𝑐, − = 𝛿 −, 𝑐 = −2
Question: Which alignment is better?
A - - A C
A C A A C
𝐯𝐰
A - A - C
A C A A C
𝐯𝐰
Both alignments have 3 matches and 2 indels. Score: 3 ∗ 1 + 2 ∗ −2 = −1
![Page 23: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/23.jpg)
Scoring Gaps – Affine Gap Penalties
23
Desired: Lower penalty for consecutive gaps than interspersed gaps.
Why: Consecutive gaps are more likely due to slippage errors in DNA replication (2-3 nucleotides), codons for protein sequences, etc.
A - - A C
A C A A C
𝐯𝐰
A - A - C
A C A A C
𝐯𝐰
![Page 24: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/24.jpg)
Scoring Gaps – Affine Gap Penalties
24
Desired: Lower penalty for consecutive gaps than interspersed gaps.
Why: Consecutive gaps are more likely due to slippage errors in DNA replication (2-3 nucleotides), codons for protein sequences, etc.
Affine gap penalty: Two penalties: (i) gap open penalty 𝜌 ≥ 0 and (ii) gap extension penalty 𝜎 ≥ 0. Stretch of 𝑘 consecutive gaps has score −(𝜌 + 𝜎𝑘).
A - - A C
A C A A C
𝐯𝐰
A - A - C
A C A A C
𝐯𝐰
![Page 25: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/25.jpg)
Scoring Gaps – Affine Gap Penalties
25
Desired: Lower penalty for consecutive gaps than interspersed gaps.
Why: Consecutive gaps are more likely due to slippage errors in DNA replication (2-3 nucleotides), codons for protein sequences, etc.
Affine gap penalty: Two penalties: (i) gap open penalty 𝜌 ≥ 0 and (ii) gap extension penalty 𝜎 ≥ 0. Stretch of 𝑘 consecutive gaps has score −(𝜌 + 𝜎𝑘).
Let 𝜌 = 10 and 𝜎 = 1. Left: 3 ∗ 1 − 10 + 1 ∗ 2 = −9. Right: 3 ∗ 1 − (10 + 1 ∗ 1) − 10 + 1 ∗ 1 = −19.
A - - A C
A C A A C
𝐯𝐰
A - A - C
A C A A C
𝐯𝐰
![Page 26: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/26.jpg)
Affine Gap Penalty Alignment – Naive Approach
26
Idea: Insert horizontal (deletion) and vertical (insertion) edges
spanning 𝑘 > 1 gaps with score − (𝜌 + 𝜎𝑘).
new edges
old edges
Affine gap penalty: Two penalties: (i) gap open penalty 𝜌 ≥ 0 and (ii) gap extension penalty 𝜎 ≥ 0. Stretch of 𝑘 consecutive gaps has score −(𝜌 + 𝜎𝑘).
... ... ... ... ... ...
...
...
...
![Page 27: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/27.jpg)
Affine Gap Penalty Alignment – Naive Approach
27
Idea: Insert horizontal (deletion) and vertical (insertion) edges
spanning 𝑘 > 1 gaps with score − (𝜌 + 𝜎𝑘).
new edges
old edges
Question: What’s the running time?Question: What’s the recurrence?
Affine gap penalty: Two penalties: (i) gap open penalty 𝜌 ≥ 0 and (ii) gap extension penalty 𝜎 ≥ 0. Stretch of 𝑘 consecutive gaps has score −(𝜌 + 𝜎𝑘).
... ... ... ... ... ...
...
...
...
![Page 28: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/28.jpg)
2/2/12 1:48 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=3154c40…5a012a18acf610ed0ed651c4da07b31eb9f79f017c327bb61f4a3391e10b7ca1c
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 186.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=205Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
28
Affine Gap Penalty AlignmentIdea: Three separate recurrences:
(i) Gap in first sequence 𝑠→ 𝑖, 𝑗(ii) Match/mismatch 𝑠↘[𝑖, 𝑗]
(iii) Gap in second sequence 𝑠↓[𝑖, 𝑗]
![Page 29: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/29.jpg)
2/2/12 1:48 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=3154c40…5a012a18acf610ed0ed651c4da07b31eb9f79f017c327bb61f4a3391e10b7ca1c
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 186.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=205Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
29
Affine Gap Penalty AlignmentIdea: Three separate recurrences:
(i) Gap in first sequence 𝑠→ 𝑖, 𝑗(ii) Match/mismatch 𝑠↘[𝑖, 𝑗]
(iii) Gap in second sequence 𝑠↓[𝑖, 𝑗]
s![i, j] = max
(s![i, j � 1]� �, if j > 1,
s&[i, j � 1]� (� + ⇢), if j > 0,
s&[i, j] = max
8>>><
>>>:
0, if i = 0 and j = 0,
s![i, j], if j > 0,
s#[i, j], if i > 0,
s&[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0,
s#[i, j] = max
(s#[i� 1, j]� �, if i > 1,
s&[i� 1, j]� (� + ⇢), if i > 0.
![Page 30: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/30.jpg)
2/2/12 1:48 PMIntroduction to Bioinformatics Algorithms
Page 1 of 1http://site.ebrary.com/lib/brown/docPrint.action?encrypted=3154c40…5a012a18acf610ed0ed651c4da07b31eb9f79f017c327bb61f4a3391e10b7ca1c
Jones, Neil C.; Pevzner, Pavel. Introduction to Bioinformatics Algorithms.Cambridge, MA, USA: MIT Press, 2004. p 186.http://site.ebrary.com/lib/brown/Doc?id=10225303&ppg=205Copyright © 2004. MIT Press. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
30
Affine Gap Penalty AlignmentIdea: Three separate recurrences:
(i) Gap in first sequence 𝑠→ 𝑖, 𝑗(ii) Match/mismatch 𝑠↘[𝑖, 𝑗]
(iii) Gap in second sequence 𝑠↓[𝑖, 𝑗]
Running time: 𝑂(𝑚𝑛)
s![i, j] = max
(s![i, j � 1]� �, if j > 1,
s&[i, j � 1]� (� + ⇢), if j > 0,
s&[i, j] = max
8>>><
>>>:
0, if i = 0 and j = 0,
s![i, j], if j > 0,
s#[i, j], if i > 0,
s&[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0,
s#[i, j] = max
(s#[i� 1, j]� �, if i > 1,
s&[i� 1, j]� (� + ⇢), if i > 0.
![Page 31: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/31.jpg)
Affine Gap Penalty Alignment – Example
31
𝐯 = AAC 𝐰 = ACAAC
Let 𝜌 = 10 and 𝜎 = 1. Match = 1. Mismatch = -1
s![i, j] = max
(s![i, j � 1]� �, if j > 1,
s&[i, j � 1]� (� + ⇢), if j > 0,
s&[i, j] = max
8>>><
>>>:
0, if i = 0 and j = 0,
s![i, j], if j > 0,
s#[i, j], if i > 0,
s&[i� 1, j � 1] + �(vi, wj), if i > 0 and j > 0,
s#[i, j] = max
(s#[i� 1, j]� �, if i > 1,
s&[i� 1, j]� (� + ⇢), if i > 0.
![Page 32: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/32.jpg)
Gapped Alignment – Additional Insights
•Naive approach supports arbitrary gap penalties given two sequences 𝐯 ∈ Σ$ and 𝐰 ∈ Σ&. This results in an 𝑂(𝑚𝑛 𝑚 + 𝑛 ) algorithm.
•Alignment with convex gap penalties given two sequences 𝐯 ∈ Σ$ and 𝐰 ∈ Σ& can be computed in 𝑂(𝑚𝑛 log𝑚) time.See: Dan Gusfield. 1997. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York, NY, USA.
32
![Page 33: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/33.jpg)
Outline1. Fitting alignment2. Local alignment3. Gapped alignment4. BLOSUM scoring matrix
Reading:• Jones and Pevzner. Chapters 6.6-6.9• Lecture notes
33
![Page 34: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/34.jpg)
Substitution Matrices• Given a pair (𝐯,𝐰) of aligned sequences, we want to assign a score that
measure the relative likelihood that the sequences are related as opposed to being unrelated• We need two models:
• Random model 𝑅: each letter 𝑎 ∈ Σ occurs independently with probability 𝑞U• Match model 𝑀: aligned pair 𝑎, 𝑏 ∈ Σ × Σ occur with joint probability 𝑝U,Z
34
Pr 𝐯,𝐰 𝑅 =]^
𝑞_` ⋅]^
𝑞b` Pr 𝐯,𝐰 𝑀 =]^
𝑝_`,b`
log cd 𝐯,𝐰 𝑀cd 𝐯,𝐰 𝑅 =∑^ 𝑠 𝑣^, 𝑤^ where 𝑠 𝑎, 𝑏 = log hi,j
kikj
![Page 35: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/35.jpg)
BLOSUM (Blocks Substitution Matrices)• Henikoff and Henikoff, 1992• Computed using ungapped alignments of protein
segments (blocks) from BLOCKS database• Thousands of such blocks go into computing a
single BLOSUM matrix• Example of a one such block (right):
• 31 positions (columns)• 61 sequences (rows)
• Given threshold 𝐿, block is pruned down to largest set 𝐶 of sequences that have at least 𝐿% sequence identity to another sequence in 𝐶• How to compute 𝐶?
35
![Page 36: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/36.jpg)
BLOSUM (Blocks Substitution Matrices)
36
• Null model frequencies 𝑞U𝑞Z of letters 𝑎 and 𝑏:• Count the number of occurrences of 𝑎 (𝑏) in all
blocks• Divide by sum of lengths of each block
(sequences * positions)
• Match model frequency 𝑝U,Z:• Count the number of pairs (𝑎, 𝑏) in all columns
of all blocks• Divide by the total number of pairs of columns:
• ∑n 𝑛(𝐶)$(n)2
• 𝑚(𝐶) is the number of sequences in block 𝐶• 𝑛(𝐶) is the number of positions in block 𝐶
log cd 𝐯,𝐰 𝑀cd 𝐯,𝐰 𝑅 =∑^ 𝑠 𝑣^, 𝑤^ where 𝑠 𝑎, 𝑏 = o
plog hi,j
kikj
![Page 37: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/37.jpg)
BLOSUM (Blocks Substitution Matrices)
37
log cd 𝐯,𝐰 𝑀cd 𝐯,𝐰 𝑅 =∑^ 𝑠 𝑣^, 𝑤^ where 𝑠 𝑎, 𝑏 = o
plog hi,j
kikj
• Null model frequencies 𝑞U𝑞Z of letters 𝑎 and 𝑏:• Count the number of occurrences of 𝑎 (𝑏) in all
blocks• Divide by sum of lengths of each block
(sequences * positions)
• Match model frequency 𝑝U,Z:• Count the number of pairs (𝑎, 𝑏) in all columns
of all blocks• Divide by the total number of pairs of columns:
• ∑n 𝑛(𝐶)$(n)2
• 𝑚(𝐶) is the number of sequences in block 𝐶• 𝑛(𝐶) is the number of positions in block 𝐶
Example: (𝜆 = 0.5)
A A TS A LT A LT A VA A L
𝑞s =715
𝑞u =315
𝑝s,u =430
𝑠 𝐴, 𝑇 = 2 ⋅ log430715 ⋅
315
≈ 0.3
![Page 38: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/38.jpg)
BLOSUM62
38
log cd 𝐯,𝐰 𝑀cd 𝐯,𝐰 𝑅 =∑^ 𝑠 𝑣^, 𝑤^ where 𝑠 𝑎, 𝑏 = o
plog hi,j
kikj
https://doi.org/10.1038/nbt0804-1035
![Page 39: CS 466 Introduction to Bioinformatics - El-Kebir · Outline 1.Fitting alignment 2.Local alignment 3.Gapped alignment 4.BLOSUM scoring matrix Reading: •Jones and Pevzner. Chapters](https://reader030.fdocuments.in/reader030/viewer/2022040515/5e7034111279f44e2370dc0e/html5/thumbnails/39.jpg)
Take Home Messages1. Edit distance2. Global alignment 3. Fitting alignment4. Local alignment5. Gapped alignment6. BLOSUM substitution matrix
Reading:• Jones and Pevzner. Chapters 6.6-6.9• Lecture notes
39
Global alignment is longest path in DAG
Small tweaks enable different extensions
Edit distance is shortest path in DAG