Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list •...

Pairwise Sequence alignment Basic Algorithms

Agenda - Previous Lesson: Minhala

- + Biological Story on Biomolecular Sequences

- + General Overview of Problems in Computational

Biology

Today:

- Reminder: Dynamic Programming

-Algorithms for Global and Local Sequence Alignment + variants

-Bioinformatic Motivation for Sequence Alignment

Literature list

• Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology of the Cell.

• Mount, D.W. Bioinformatics: Sequence and Genome Analysis.

• Jones N.C & Pevzner P.A. An introduction to Bioinformatics algorithms.

• R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.

• Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology.

• Move to Slides On Dynamic Programming…

Sequence Comparison (cont)

• We seek the following similarities between sequences :

• Find similar proteins – Allows to predict function & structure

• Locate similar subsequences in DNA – Allows to identify (e.g) regulatory elements

• Locate DNA sequences that might overlap – Helps in sequence assembly

Sequence Modifications

• Three types of changes

– Substitution (point mutation)

– Insertion

– Deletion

TCAGT TCGAGT

Indel (replication slippage)

Choosing Alignments

There are many possible alignments For example, compare:

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

to ------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Which one is better?

Another example

Given two sequences:

X: TGCATAT

Y: ATCCGAT

Question:

How can X be transformed into Y?

How did Y evolve from X?

TGCATAT

TGCATA

ATGCAT

ATCCAT

ATCCGAT

delete T

delete A

insert A

insert G

One possible transformation

Alignment:

-TGC-ATAT

ATCCGAT--

5 operations

-TGCATAT

ATCCG-AT

TGCATAT

ATGCATAT

ATGCAAT

ATGCGAT

ATCCGAT

insert A

delete T

Another possible transformation

Alignment:

4 operations

Which one is better?

In order to align two sequences we need a quantitive model to evaluate similarity between sequences.

How do we quantitate sequence similarity?

Scoring Similarity

• Assume independent mutation model

– Each site considered separately

• Score at each site

– Positive if the same

– Negative if different

• Sum to make final score

– Can be positive or negative

– Significance depends on sequence length

GTAGTC

CTAGCG

Pairwise Alignment - Identity

(HH) VLSPADKTNVKAAWGKVGAHAGYEG

||| | | || | |

(SWM) VLSEGEWQLVLHVWAKVEADVAGHG

• Percent Identity: 36.000 (| only)

Human Hemoglobin (HH) vs Sperm Whale Myoglobin (SWM):

Pairwise Alignment - Similarity

(HH) VLSPADKTNVKAAWGKVGAHAGYEG

||| . | | || | |

(SWM) VLSEGEWQLVLHVWAKVEADVAGHG

• Percent Similarity: 40.000 (| and .)

• Percent Identity: 36.000 (| only)

D and E are similar:

1. structure is similar.

2. both are acidic and hydrophilic

3. one mutation can separate them

from one to the other.

Pairwise Alignment – Gap insertion

(HH) VLSPADKTNVKAAWGKVGAH-AGYEG

(SWM) VLSEGEWQLVLHVWAKVEADVAGH-G

• Gaps: 2

• Percent Similarity: 54.167

• Percent Identity: 45.833 (12/26)

Pairwise Alignment - Scoring

• The final score of the alignment is the sum of the positive scores and penalty scores:

+ Number of Identities

+ Number of Similarities

- Number of gap insertions

Alignment score

Pairwise Alignment - Scoring

(HH) VLSPADKTNVKAAWGKVGAH-AGYEG

||| . | | || || |

(SWM) VLSEGEWQLVLHVWAKVEADVAGH-G

Final score:

(V,V) + (L,L) + (S,S) + (D,E) + … - (penalty for gap insertion)*(number of gaps) - (penalty for gap extension)*(extension length)

We are interested in both the score and the alignment trace.

Optimum Alignment

The score of an alignment is a measure of its

quality

Optimum alignment problem: Given a pair of

sequences X and Y, find an alignment (global or

local) with maximum score

The similarity between X and Y, denoted

sim(X,Y), is the maximum score of an alignment of X and Y

Computing Optimal Score • How can we compute the optimal score ?

– If |s| = n and |t| = m, the number A(m,n) of possible “legal” alignments is large!

• we perform dynamic programming to compute the optimal score efficiently.

222( , ) ( , )

nA m n A n n

Stirling’s formula: Exercise 1 2! 2 x xx x e m of the

order of n

Manhattan Tourist Problem (MTP)

Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink

Source

Manhattan Tourist Problem (MTP)

Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink

Source

Manhattan Tourist Problem: Formulation

Goal: Find the longest (highest scoring) path in a weighted grid.

Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink”

Output: A longest path in G from “source” to “sink”

MTP: An Example

0 1 2 3

j coordinate i c

source

3 2 4 0

1 0 2 4 3

MTP: Greedy Algorithm Is Not Optimal 1 2 5

2 promising start, but leads to bad choices!

source

sink 18

source

S1,0 = 5

S0,1 = 1

• Calculate optimal path score for each vertex in the graph

• Each vertex’s score is the maximum of the prior vertices score plus the weight of the respective edge in between

MTP: Dynamic Programming j

MTP: Dynamic Programming

(cont’d)

source

S2,0 = 8

S1,1 = 4

S0,2 = 3 3

(cont’d)

0 1 2 3

source

S3,0 = 8

S2,1 = 9

S1,2 = 13

S3,0 = 8

MTP: Dynamic Programming (cont’d)

greedy alg. fails!

-5 1 -5

0 1 2 3

source

S3,1 = 9

S2,2 = 12

S1,3 = 8

(cont’d)

-5 1 -5

-5 3 3

0 1 2 3

source

S3,2 = 9

S2,3 = 15

(cont’d)

-5 1 -5

-5 3 3

0 1 2 3

source

16 S3,3 = 16

(showing all back-traces)

MTP: Recurrence

Computing the score for a point (i,j) by the recurrence relation:

si, j = max

si-1, j + weight of the edge between (i-1, j) and (i, j)

si, j-1 + weight of the edge between (i, j-1) and (i, j)

The running time is n x m for a n by m grid

(n = # of rows, m = # of columns)

What about diagonals?

• The score at point B is given by:

sB = max of

sA1 + weight of the edge (A1, B)

Adding Diagonal Edges to the Grid

More generally, computing the score for point x is given by the recurrence relation:

sx = max

sy + weight of vertex (y, x) where

y є Predecessors(x)

• Predecessors (x) – set of vertices that have edges leading to x

Adding Diagonal Edges to the Grid

Traveling in the Grid •The only hitch is that one must decide on the order in which visit the vertices

•By the time the vertex x is analyzed, the values sy for all its predecessors y should be computed – otherwise we are in trouble.

•We need to traverse the vertices in some order

•Try to find such order for a directed acyclic grid graph

Traversing the Manhattan Grid

• 3 different strategies:

• a) Column by column

• b) Row by row

• c) Along diagonals

Comparison methods

• Global alignment – Finds the best alignment across the whole two sequences.

• Local alignment – Finds regions of similarity in parts of the sequences. Global Local

_____ _______ __ ____

__ ____ ____ __ ____

Global Alignment

• Algorithm of Needleman and Wunsch (1970)

• Finds the alignment of two complete sequences: ADLGAVFALCDRYFQ

|||| |||| |

ADLGRTQN-CDRYYQ

• Some global alignment programs “trim ends”

Local Alignment

• Algorithm of Smith and Waterman (1981).

• Makes an optimal alignment of the best segment of

similarity between two sequences.

ADLG CDRYFQ

|||| |||| |

ADLG CDRYYQ

• Can return a number of highly aligned segments.

Global Alignment: Algorithm

1..j1..i T and S of alignment optimum of Cost),( jiC

T of jlength of Prefix

S of i length of Prefix

)1j,i(C

)j,1i(C

)T,S(w)1j,1i(C

max)j,i(Cji

j)j,0(Ci)0,i(C

Initial conditions:

Recurrence relation: For 1 i n, 1 j m:

Theorem. C(i,j) satisfies the following relationships:

Example

Case 1: Line up Si with Tj

S: C A T T C A C

T: C - T T C A G

i - 1 i

j j -1

S: C A T T C A - C

T: C - T T C A G -

Case 2: Line up Si with space i - 1 i

S: C A T T C A C -

T: C - T T C A - G

Case 3: Line up Tj with space i

j j -1

Justification: Optimal Substructure Property Followed

S1 S2 . . . Si-1 Si

T1 T2 . . . Tj-1 Tj

C(i-1,j-1) + w(Si,Tj)

S1 S2 . . . Si-1 Si

T1 T2 . . . Tj —

C(i-1,j)

S1 S2 . . . Si —

T1 T2 . . . Tj-1 Tj

C(i,j-1)

Computation Procedure

C(n,m)

C(0,0)

C(i,j)

)1j,i(C,)j,1i(C),T,S(w)1j,1i(Cmax)j,i(C ji

C(i-1,j) C(i-1,j-1)

C(i,j-1)

λ C T C G C A G C

+10 for match, -2 for mismatch, -5 for space

0 -5 -10 -15 -20 -25 -30 -35 -40

-5 10 5 0 -5 -10 -15 -20 -25

-10 5 8 3 -2 -7 0 -5 -10

-15 0 15 10 5 0 -5 -2 -7

-20 -5 10 13 8 3 -2 -7 -4

-25 -10 5 20 15 18 13 8 3

-30 -15 0 15 18 13 28 23 18

-35 -20 -5 10 13 28 23 26 33

λ C T C G C A G C

Traceback can yield both optimum alignments

Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list •...

Documents

Transcript of Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list •...

Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

CHAPTER 1 INTRODUCTON - CORE

Introducton to Labanotation

A kind and gentle introducton to rac

INTRODUCTON TO WATER QUALITY IN AQUACULTURE Unit 4: Dissolved oxygen/aeration Auburn University Department of Fisheries and Allied Aquacultures.

Introducton to Labanotation Edward Chow Reference: state.edu/5_resources/labanlab/welcome.html state.edu/5_resources/labanlab/welcome.html

Introducton to Verification Validation and Accreditation

INTRODUCTON TO NDT

CHAPTER 1 INTRODUCTON Histopathology - … Training Modules/Lab. Tech/Histo... · 1 CHAPTER 1 INTRODUCTON Histopathology - Definition it is a branch of pathology which deals with

Introducton to COIN-OR Tools for Optimizationted/files/talks/COIN-Lehigh... · 2016. 3. 26. · Introducton to COIN-OR Tools for Optimization TED RALPHS ISE Department COR@L Lab Lehigh

Telecommunication Networks Architecture, …teln131/wiki.files/OSS-2013...Telecommunication Networks – Architecture, Operations and Innovation תונשדחו ,לועפת ,הרוטקטיכרא

DMRG Theory and Introducton -Manual for DMRG Code

Introducton to Assetwolf

D4.3 Application roadmap for the introducton of Virtual ...

Elevate introducton & overview winter 2014

P1-Introducton to FinMan (1)

CHAPTER 1. INTRODUCTON TO REINFORCED CONCRETE · CHAPTER 1. INTRODUCTON TO REINFORCED CONCRETE 1.1. INTRODUCTION Traditionally, the study of reinforced concrete design begins directly

Introducton to Psychology

1. Introducton · Tables of Contents 1. INTRODUCTION.....4

Introducton to GEM and Overview of OpenQuake software · 2018. 11. 15. · Introducton to GEM and Overview of OpenQuake software ... – Daily Scrum developer meetngs, 4-6 week Sprints.