Introduction to

61
1 Introduction to Bioinformatic s

description

Introduction to. Bioinformatics. Introduction to Bioinformatics. LECTURE 3: SEQUENCE ALIGNMENT * Chapter 3: All in the family. Introduction to Bioinformatics LECTURE 3: SEQUENCE ALIGNMENT. 3.1 Eye of the tiger - PowerPoint PPT Presentation

Transcript of Introduction to

Page 1: Introduction to

1

Introduction to

Bioinformatics

Page 2: Introduction to

2

Introduction to Bioinformatics.

LECTURE 3: SEQUENCE ALIGNMENT

* Chapter 3: All in the family

Page 3: Introduction to

3

Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT

3.1 Eye of the tiger

* In 1994 Walter Gehring et alum (Un. Basel) turn the gene “eyeless” on in various places on Drosophila melanogaster

* Result: on multiple places eyes are formed

* ‘eyeless’ is a master regulatory gene that controls +/- 2000 other genes

* ‘eyeless’ on induces formation of an eye

Page 4: Introduction to

4Eyeless Drosophila

Page 5: Introduction to

5

Mutant Drosophila melanogaster: gene ‘EYELESS’ turned on

Page 6: Introduction to

6

LECTURE 3: SEQUENCE ALIGNMENTHomeoboxes and Master regulatory genes

Page 7: Introduction to

7

Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT

HOMEO BOX

A homeobox is a DNA sequence found within genes that are involved in the regulation of development (morphogenesis) of animals, fungi and plants.

Page 8: Introduction to

8

LECTURE 3: SEQUENCE ALIGNMENTDrosophila melanogaster: HOX homeoboxes

Page 9: Introduction to

9

LECTURE 3: SEQUENCE ALIGNMENTDrosophila melanogaster: PAX homeoboxes

Page 10: Introduction to

10

LECTURE 3: SEQUENCE ALIGNMENTHomeoboxes and Master regulatory genes

Page 11: Introduction to

11

Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT

3.2 On sequence alignment

Sequence alignment is the most important task in bioinformatics!

Page 12: Introduction to

12

Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT

3.2 On sequence alignment

Sequence alignment is important for:

* prediction of function* database searching* gene finding* sequence divergence* sequence assembly

Page 13: Introduction to

13

Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT

3.3 On sequence similarity

Homology: genes that derive from a common ancestor-gene are called homologs

Orthologous genes are homologous genes in different organisms

Paralogous genes are homologous genes in one organism that derive from gene duplication

Gene duplication: one gene is duplicated in multiple copies that therefore free to evolve and assume new functions

Page 14: Introduction to

14

LECTURE 3: SEQUENCE ALIGNMENTHOMOLOGOUS and PARALOGOUS

Page 15: Introduction to

15

LECTURE 3: SEQUENCE ALIGNMENTHOMOLOGOUS and PARALOGOUS

Page 16: Introduction to

16

LECTURE 3: SEQUENCE ALIGNMENTHOMOLOGOUS and PARALOGOUS versus ANALOGOUS

Page 17: Introduction to

17

Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT: sequence similarity

Causes for sequence (dis)similarity

mutation: a nucleotide at a certain location is replaced by another nucleotide (e.g.: ATA → AGA)

insertion: at a certain location one new nucleotide is inserted inbetween two existing nucleotides (e.g.: AA → AGA)

deletion: at a certain location one existing nucleotide is deleted (e.g.: ACTG → AC-G)

indel: an insertion or a deletion

Page 18: Introduction to

18

Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT

3.4 Sequence alignment: global and local

Find the similarity between two (or more)DNA-sequences by finding a good alignment between them.

Page 19: Introduction to

19

The biological problem of sequence alignment

DNA-sequence-1

tcctctgcctctgccatcat---caaccccaaagt |||| ||| ||||| ||||| |||||||||||| tcctgtgcatctgcaatcatgggcaaccccaaagt

DNA-sequence-2 Alignment

Page 20: Introduction to

20

Sequence alignment - definition

Sequence alignment is an arrangement of two or more sequences, highlighting their similarity.

The sequences are padded with gaps (dashes) so that wherever possible, columns contain identical characters from the sequences involved

tcctctgcctctgccatcat---caaccccaaagt|||| ||| ||||| ||||| ||||||||||||tcctgtgcatctgcaatcatgggcaaccccaaagt

Page 21: Introduction to

21

Algorithms

Needleman-WunschPairwise global alignment only.

Smith-WatermanPairwise, local (or global) alignment.

BLASTPairwise heuristic local alignment

Page 22: Introduction to

22

Pairwise alignment

Pairwise sequence alignment methods are concerned with finding the best-matching piecewise local or global alignments of protein (amino acid) or DNA (nucleic acid) sequences.

Typically, the purpose of this is to find homologues (relatives) of a gene or gene-product in a database of known examples.

This information is useful for answering a variety of biological questions:

1. The identification of sequences of unknown structure or function.

2. The study of molecular evolution.

Page 23: Introduction to

23

Global alignment

A global alignment between two sequences is an alignment in which all the characters in both sequences participate in the alignment.

Global alignments are useful mostly for finding closely-related sequences.

As these sequences are also easily identified by local alignment methods global alignment is now somewhat deprecated as a technique.

Further, there are several complications to molecular evolution (such as domain shuffling) which prevent these methods from being useful.

Page 24: Introduction to

24

Global Alignment

Find the global best fit between two sequences

Example: the sequences s = VIVALASVEGAS and t = VIVADAVIS align like:

A(s,t) = V I V A L A S V E G A S| | | | | | |V I V A D A - V - - I S

indels

Page 25: Introduction to

25

The Needleman-Wunsch algorithm

The Needleman-Wunsch algorithm (1970, J Mol Biol. 48(3):443-53) performs a global alignment on two sequences (s and t) and is applied to align protein or nucleotide sequences.

The Needleman-Wunsch algorithm is an example of dynamic programming, and is guaranteed to find the alignment with the maximum score.

Page 26: Introduction to

26

The Needleman-Wunsch algorithm

Of course this works for both DNA-sequences as for protein-sequences.

Page 27: Introduction to

27

Alignment scoring function

The cost of aligning two symbols xi and yj is the scoring function σ(xi,yj )

Page 28: Introduction to

28

Alignment cost

The cost of the entire alignment:

c

iii yxM

1

),(

Page 29: Introduction to

29

A simple scoring function

σ(-,a) = σ(a,-) = -1

σ(a,b) = -1 if a ≠ b

σ(a,b) = 1 if a = b

Page 30: Introduction to

30

The substitution matrix

A more realistic scoring function is given by the biologically inspired substitution matrix :

- A G C T A 10 -1 -3 -4 G -1 7 -5 -3 C -3 -5 9 0 T -4 -3 0 8

Examples:

* PAM (Point Accepted Mutation) (Margaret Dayhoff)* BLOSUM (BLOck SUbstitution Matrix) (Henikoff and Henikoff)

Page 31: Introduction to

31

Scoring function

The cost for aligning the two sequences s = VIVALASVEGAS and t = VIVADAVIS :

A(s,t) =

is:

M(A) = 7 matches + 2 mismatches + 3 gaps= 7 – 2 – 3 = 2

V I V A L A S V E G A S| | | | | | |V I V A D A - V - - I S

Page 32: Introduction to

32

Optimal global alignment

The optimal global alignment A* between two sequences s and t is the alignment A(s,t) that maximizes the total alignment score M(A) over all possible alignments.

A* = argmax M(A)

Finding the optimal alignment A* looks a combinatorial optimization problem:

i. generate all possible allignmentsii. compute the score Miii. select the alignment A* with the maximum score M*

Page 33: Introduction to

33

Local alignment

Local alignment methods find related regions within sequences - they can consist of a subset of the characters within each sequence.

For example, positions 20-40 of sequence A might be aligned with positions 50-70 of sequence B.

This is a more flexible technique than global alignment and has the advantage that related regions which appear in a different order in the two proteins (which is known as domain shuffling) can be identified as being related.

This is not possible with global alignment methods.

Page 34: Introduction to

34

The Smith Waterman algorithmThe Smith-Waterman algorithm (1981) is for determining similar regions between two nucleotide or protein sequences.

Smith-Waterman is also a dynamic programming algorithm and improves on Needleman-Wunsch. As such, it has the desirable property that it is guaranteed to find the optimal local alignment with respect to the scoring system being used (which includes the substitution matrix and the gap-scoring scheme).

However, the Smith-Waterman algorithm is demanding of time and memory resources: in order to align two sequences of lengths m and n, O(mn) time and space are required.

As a result, it has largely been replaced in practical use by the BLAST algorithm; although not guaranteed to find optimal alignments, BLAST is much more efficient.

Page 35: Introduction to

35

Optimal local alignment

The optimal local alignment A* between two sequences s and t is the optimal global alignment A(s(i1:i2), t(j1:j2) ) of the sub-sequences s(i1:i2) and t(j1:j2) for some optimal choice of i1, i2, j1 and j2.

Page 36: Introduction to

36

Sequence alignment - meaning

Sequence alignment is used to study the evolution of the sequences from a common ancestor such as protein sequences or DNA sequences.

Mismatches in the alignment correspond to mutations, and gaps correspond to insertions or deletions.

Sequence alignment also refers to the process of constructing significant alignments in a database of potentially unrelated sequences.

Page 37: Introduction to

37

Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT

3.5 Statistical analysis of alignments

This works identical to gene finding:

* Generate randomized sequences based on the second string

* Determine the optimal alignments of the first sequence with these randomized sequences

* Compute a histogram and rank the observed score in this histogram

* The relative position defines the p-value.

Page 38: Introduction to

38

Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT: statistical analysis

Histogram of scores of randomly generated strings using permutation of original sequence t

original sequence s

Page 39: Introduction to

39

Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT

3.6 BLAST: fast approximate alignment

• Fast but heuristic

• Most used algorithm in bioinformatics

• Verb: to blast

Page 40: Introduction to

40

Page 41: Introduction to

41

Page 42: Introduction to

42

Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT

3.7 Multiple sequence alignment:

Determine the best alignment between multiple (more than two) DNA-sequences.

Page 43: Introduction to

43

Multiple alignment

Multiple alignment is an extension of pairwise alignment to incorporate more than two sequences into an alignment.

Multiple alignment methods try to align all of the sequences in a specified set.

The most popular multiple alignment tool is CLUSTAL.

Multiple sequence alignment is computationally difficult and is classified as an NP-Hard problem.

Page 44: Introduction to

44

Multiple alignment

Page 45: Introduction to

45

Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT

3.8 Computing the alignments

* NW and SW are both based on Dynamic Programming (DP)

* A recursive relation breaks down the computation

Page 46: Introduction to

46

Dynamic Programming Approach to Sequence Alignment

The dynamic programming approach to sequence alignment always tries to follow the best prior-result so far.

Try to align two sequences by inserting some gaps at different locations, so as to maximize the score of this alignment.

Score measurement is determined by "match award", "mismatch penalty" and "gap penalty". The higher the score, the better the alignment.

If both penalties are set to 0, it aims to always find an alignment with maximum matches so far. Maximum match = largest number matches can have for one sequence by allowing all possible deletion of another sequence.

It is used to compare the similarity between two sequences of DNA or Protein, to predict similarity of their functionalities.Examples: Needleman-Wunsch(1970), Sellers(1974), Smith-Waterman(1981)

Page 47: Introduction to

47

The Needleman-Wunsch algorithm

The Needleman-Wunsch algorithm (1970, J Mol Biol. 48(3):443-53) performs a global alignment on two sequences (A and B) and is applied to align protein or nucleotide sequences.

The Needleman-Wunsch algorithm is an example of dynamic programming, and is guaranteed to find the alignment with the maximum score.

Scores for aligned characters are specified by the transition matrix σ (i,j) : the similarity of characters i and j.

Page 48: Introduction to

48

The Needleman-Wunsch algorithm

For example, if the substitution matrix was

- A G C T A 10 -1 -3 -4 G -1 7 -5 -3 C -3 -5 9 0 T -4 -3 0 8

with a gap penalty of -5, would have the following score...

then the alignment: AGACTAGTTAC CGA---GACGT

Page 49: Introduction to

49

The Needleman-Wunsch algorithm

1. Create a table of size (m+1)x(n+1) for sequences s and t of lengths m and n,

2. Fill table entries (m:1) and (1:n) with the values:

3. Starting from the top left, compute each entry using the recursive relation:

4. Perform the trace-back procedure from he bottom-right corner

j

kkj

i

kki MM

1,1

11, ),(,),( ts

),(

),(

),(

max

1,

,1

1,1

,

jji

iji

jiji

ji

M

M

M

M

t

s

ts

Page 50: Introduction to

50

The Needleman-Wunsch algorithm

Once the F matrix is computed, note that the bottom right hand corner of the matrix is the maximum score for any alignments. To compute which alignment actually gives this score, you can start from the bottom left cell, and compare the value with the three possible sources(Choice1, Choice2, and Choice3 above) to see which it came from. If it was Choice1, then A(i) and B(i) are aligned, if it was Choice2 then A(i) is aligned with a gap, and if it was Choice3, then B(i) is aligned with a gap.

Page 51: Introduction to

51

The Needleman-Wunsch algorithm

Page 52: Introduction to

52

Page 53: Introduction to

53

The Smith-Waterman algorithm

1. Create a table of size (m+1)x(n+1) for sequences s and t of lengths m and n,

2. Fill table entries (1,1:m+1) and (1:n+1,1) with zeros.

3. Starting from the top left, compute each entry using the recursive relation:

4. Perform the trace-back procedure from the maximum element in the table to the first zero element on the trace-back path.

0

),(

),(

),(

max1,

,1

1,1

,jji

iji

jiji

ji M

M

M

Mt

s

ts

Page 54: Introduction to

54

Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT

EXAMPLE: Eyeless Gene Homeobox

Compare the gene eyeless of Drosophila Melanoganster with the human gene

aniridia. They are master regulatory genes producing proteins that control large

cascade of other genes. Certain segments of genes eyeless of Drosophila

melanogaster and human aniridia are almost identical. The most important of

such segments encodes the PAX (paired-box) domain, a sequence of 128 amino

acids whose function is to bind specific sequences of DNA. Another common

segment is the HOX (homeobox) domain that is thougth to be part of more than

0.2% of the total nummber of vertebrate genes.

Page 55: Introduction to

55

Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT

Page 56: Introduction to

56

Introduction to BioinformaticsLECTURE 3: GLOBAL ALIGNMENT

Page 57: Introduction to

57

Introduction to BioinformaticsLECTURE 3: GLOBAL ALIGNMENT

Page 58: Introduction to

58

Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT

Page 59: Introduction to

59

END of LECTURE 3

Page 60: Introduction to

60

Introduction to BioinformaticsLECTURE 3: SEQUENCE ALIGNMENT

Page 61: Introduction to

61