1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

52
1 Multiple sequence alignment Multiple sequence alignment Lesson 3 Lesson 3
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    249
  • download

    0

Transcript of 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

Page 1: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

1

Multiple sequence alignmentMultiple sequence alignment

Lesson 3Lesson 3

Page 2: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

2

1. What is a multiple sequence 1. What is a multiple sequence alignment?alignment?

Page 3: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

3

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

Similar to pairwise alignment BUT n sequences are aligned instead of just n=2

Multiple sequence Multiple sequence alignmentalignment

Page 4: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

4

MSA = Multiple Sequence AlignmentEach row represents an individual sequenceEach column represents the ‘same’ position

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

Multiple sequence Multiple sequence alignmentalignment

Page 5: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

5

Multiple sequence alignmentMultiple sequence alignment

Homosapiens

Pantroglodytes

Musmusculus

Canisfamiliaris

Gallusgallus

Anophelesgambiae

Drosophilamelanogaster

Caenorhabditis elegans

Arabidobsisthaliana

Rattusnorvegicus

Page 6: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

6

Histone H4 proteinHistone H4 protein

Page 7: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

7

Multiple sequence alignmentMultiple sequence alignment

NADH dehydrogenase subunit 4

Histone H4 protein 4

►Which is better – pairwise alignment of a pair of rows in MSA?

Page 8: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

8

2. How MSAs are computed2. How MSAs are computed

Page 9: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

9

Alignment – Dynamic Alignment – Dynamic ProgrammingProgramming

There is a dynamic programming algorithm for n sequences similar to the pairwise alignment

Complexity :

O(n|sequences|)

Page 10: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

10

Alignment methodsAlignment methods

This is not practical complexity, therefore heuristics are used:

• Progressive/hierarchical alignment (Clustal)

• Iterative alignment (mafft, muscle)

Page 11: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

11

ABCDE

Compute the pairwise Compute the pairwise alignments for all against all alignments for all against all

(6 pairwise alignments).(6 pairwise alignments).The similarities are The similarities are

converted to distances and converted to distances and stored in a tablestored in a table

First step:

Progressive alignmentProgressive alignment

ABCDE

A

B8

C1517

D161410

E32313132

Page 12: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

12

A

D

C

B

E

Cluster the sequences to create a Cluster the sequences to create a tree (tree (guide treeguide tree):):• represents the order in which pairs ofrepresents the order in which pairs of sequences are to be aligned sequences are to be aligned• similar sequences are neighbors in thesimilar sequences are neighbors in the tree tree • distant sequences are distant from eachdistant sequences are distant from each other in the tree other in the tree

Second step: ABCDE

A

B8

C1517

D161410

E32313132

The guide tree is imprecise The guide tree is imprecise and is NOT the tree which and is NOT the tree which truly describes the truly describes the evolutionary relationship evolutionary relationship between the sequences!between the sequences!

Page 13: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

13

Third step:A

D

C

B

E

1. Align the most similar (neighboring) pairs

sequence

sequence

sequence

sequence

Page 14: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

14

Third step:A

D

C

B

E

2. Align pairs of pairs

sequence

profile

Page 15: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

15

Third step:A

D

C

B

E sequence

profile

Main disadvantages:

• Sub-optimal tree topology

• Misalignments resulting from globally aligning pairs of sequences.

Page 16: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

16

ABCDE

IterativeIterative alignmentalignment

Guide tree

MSA

Pairwise distance table

A

DCB

Iterate until the MSA does not change (convergence)

E

Page 17: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

17

3. MSA – What is it good for?3. MSA – What is it good for?

A.A. Conserved positionsConserved positions

B.B. ConsensusConsensus

C.C. PatternsPatterns

D.D. ProfilesProfiles

E.E. Much more…Much more…

Page 18: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

18

3. MSA – What is it good for?3. MSA – What is it good for?

A.A. Conserved positionsConserved positions

B.B. ConsensusConsensus

C.C. PatternsPatterns

D.D. ProfilesProfiles

E.E. Much more…Much more…

Page 19: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

19

Consensus sequenceConsensus sequence

ATCTTGT

AACTTGT

AACTTCT

AACTTGT

A consensus sequence holds the most frequent character of the alignment at each column

Page 20: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

20

Consensus sequence – an Consensus sequence – an exampleexample

TACGAT

TATAAT

TATAAT

GATACT

TATGTT

TATGTT

The -10 region of six promoters. There are many variants to the

“consensus.”

TACGAT

TATAAT

TATAAT

GATACT

TATGAT

TATGTT

Page 21: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

21

Consensus sequence – an Consensus sequence – an exampleexample

TACGAT

TATAAT

TATAAT

GATACT

TATGAT

TATGTT

TATAAT

1 .Strict majority . *In case of equal

frequencies – choose one according to the alphabet order.

Page 22: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

22

Consensus sequence – an Consensus sequence – an exampleexample

Had we searched the region upstream of genes for this consensus, we would have identified only 2 out of the 6 sequences. So we will miss many cases.

By chance, we expect a “hit” every 4,096 bp.

TACGAT

TATAAT

TATAAT

GATACT

TATGAT

TATGTT

TATAAT

Page 23: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

23

Consensus sequence – an Consensus sequence – an exampleexample

We can search while allowing 1 mismatch.

we would have identified 3 out of the 6 sequences. So we will miss less cases.

By chance, we expect a “hit” every ~200bp → more “noise”.

TACGAT

TATAAT

TATAAT

GATACT

TATGAT

TATGTT

TATAAT

Page 24: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

24

Consensus sequence – an Consensus sequence – an exampleexample

We can search while allowing 2 mismatches.

we would have identified all 6 sequences. So we won’t miss.

By chance, we expect a “hit” every ~30bp → A LOT OF “noise”.

TACGAT

TATAAT

TATAAT

GATACT

TATGAT

TATGTT

TATAAT

Page 25: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

25

Consensus sequence – an Consensus sequence – an exampleexample

2. Majority only when it is a clear case. In the remaining cases – use wildcards.

Y = PyrimidineR = PurineN = Any nucleotide

TACGAT

TATAAT

TATAAT

GATACT

TATGAT

TATGTT

TATRNT

Page 26: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

26

Reminder: Purines & PyrimidinesReminder: Purines & Pyrimidines

Y = PyrimidineR = PurineN = Any nucleotide

Page 27: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

27

Consensus sequence – an Consensus sequence – an exampleexample

Had we searched the region upstream of genes with the redundant consensus, we would have identified 4/6 sequences.

By chance, we expect a “hit” every ~500 bp.

TACGAT

TATAAT

TATAAT

GATACT

TATGAT

TATGTT

TATRNT

Page 28: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

28

Consensus sequence – an Consensus sequence – an exampleexample

There is always a tradeoff between sensitivity and specificity.Sensitivity: the fraction of true positive predictions among all positive predictions. Specificity: the fraction of true negative predictions among all negative predictions.

TATRNT TATAAT

Page 29: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

29

Consensus sequence – an exampleConsensus sequence – an exampleSensitivity: the fraction of true positive predictions among all positive predictions

Specificity: the fraction of true negative predictions among all negative predictions

Permissive consensus: higher sensitivity, lower specificity (more true positives , more false positives ↔ less true negatives , less false negatives ) Nonpermissive consensus: higher specificity, lower sensitivity (less true positives , less false positives ↔ more true negatives , more false negatives )

Page 30: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

30

3. MSA – What is it good for?3. MSA – What is it good for?

A.A. Conserved positionsConserved positions

B.B. ConsensusConsensus

C.C. PatternsPatterns

D.D. ProfilesProfiles

E.E. Much more…Much more…

Page 31: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

31

PatternsPatterns

TACGAT

TATAAT

TATAAT

GATACT

TATGAT

TATGTT

[TG-]A-]TC[-]GA[-]CTA[-]T[

Patterns are more informative than consensuses sequences.

Pattern specify for each position the possible characters for this position.

Page 32: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

32

Patterns - syntaxPatterns - syntax

• The standard IUPAC one-letter codes. • ‘x’ : any amino acid. • ‘][’ : residues allowed at the position. • ‘{}’ : residues forbidden at the position. • ‘()’ : repetition of a pattern element are indicated in

parenthesis. X(n) or X(n,m) to indicate the number or range of repetition.

• ‘-’ : separates each pattern element. • ‘‹’ : indicated a N-terminal restriction of the pattern. • ‘›’ : indicated a C-terminal restriction of the pattern. • ‘.’ : the period ends the pattern.

Page 33: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

33

• W-x(9,11)-]FYV[-]FYW[-x(6,7)-]GSTNE[

PatternsPatterns

Any amino-acid, between 9-11

times

F or Y or

V

WOPLASDFGYVWPPPLAWSROPLASDFGYVWPPPLAWSWOPLASDFGYVWPPPLSQQQ

Page 34: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

34

3. MSA – What is it good for?3. MSA – What is it good for?

A.A. Conserved positionsConserved positions

B.B. ConsensusConsensus

C.C. PatternsPatterns

D.D. ProfilesProfiles

E.E. Much more…Much more…

Page 35: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

35

Profile =Profile = PSSM =PSSM = PPositionosition SSpecificpecific SScorecore MMatrixatrixACCCAA

AACCGG

AACCTT

123456

A1.6700.33.33

C0.331100

G0000.33.33

T0000.33.33

Page 36: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

36

P(AACCAA)= 1 × 0.67 × 1 × 1 × 0.33 × 0.33 P(GACCAA)= 0

Sequences with higher probabilities → higher chance of being related to the PSSM.

123456

A1.6700.33.33

C0.331100

G0000.33.33

T0000.33.33

Profiles / PSSMsProfiles / PSSMs

Page 37: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

37

One compares each n-mer to the profile and computes the probabilities. Sequences with probabilities > threshold are considered as hits.

Searching with PSSMSearching with PSSM

GACGGTACGTAGCGGAGCGACCAA

Computes the probability of the first 6-mer

123456

A1.6700.33.33

C0.331100

G0000.33.33

T0000.33.33

Page 38: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

38

6-mers with probabilities > threshold are considered as hits .

Searching with PSSMSearching with PSSM

P2

P3

P4

GACGGTACGTAGCGGAGCGACCAA

GACGGTACGTAGCGGAGCGACCAA

GACGGTACGTAGCGGAGCGACCAA

GACGGTACGTAGCGGAGCGACCAAP1

123456

A1.6700.33.33

C0.331100

G0000.33.33

T0000.33.33

Page 39: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

39

Profile-pattern-consensusProfile-pattern-consensus

AACTTG

AAGTCG

CACTTC

12345

A0.66100.

T0001.

C0.3300.660.

G000.330.

AACTTG

[AC-]A-]GC[-T-]TC[-]GC[

multiple alignment

consensus

pattern

profile

NANTNN

Page 40: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

40

4. HMM:4. HMM:HHidden idden MMarkov arkov MModelsodels

Page 41: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

41

Definitions & UsesDefinitions & Uses

• A probabilistic model which deals with sequences of symbols.Uses: inferring hidden states.

• Originally used in speech recognition (the symbols being phonemes)

• Useful in biology – the sequence of symbols being the DNA\Proteins.

Page 42: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

42

Markov ChainsMarkov Chains• A sequence of random variables X1,X2,… where each present state depends only on the previous state.

• Weather example:

The weather in day xdepends only on day x-1:

• We can easilycompute the probability of:Sunny Sunny Rainy Sunny Sunny

Page 43: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

43

Markov ChainsMarkov Chains

• Similarly we can assume a DNA sequence is Markovian • ACGGTA…(vertical or horizontal!)• These conditional probabilities can be illustrated as follows

(in DNA)

• Each arrow has a transition probability: PCA = P(xi=A|Xi-1=C)

• Thus – the probability of a sequence x will be :

A T

C G

ii xxLiLL PxPxxxPxP 11111 )(),...,,()(

Page 44: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

44

Hidden Markov ModelsHidden Markov Models

• The state sequence itself follows a simple Markov chain. But-

• In a HMM it is no longer possible to know the state by looking at the symbols – the state is hidden.

P

B

PPP

BB

Si+1SiSi-1

Ki+1KiKi-1

S1

K1

Sn

Kn. . . . . .

. . . . . .

Page 45: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

45

The weather HMM exampleThe weather HMM example

• In this weather example only the actions are observable and the weather is hidden:

Page 46: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

46

• {S, K, Π, P, B}

• S : {s1…sN } are the values for the hidden states

• K : {k1…kM } are the values for the observations

• The hidden states emit/generate the symbols (observations)

• Π = {Πi} are the initial state probabilities

• P = {Pij} are the state transition probabilities

• B = {bik} are the emission probabilities

HMM formalitiesHMM formalities

P

B

PPP

BB

Si+1SiSi-1

Ki+1KiKi-1

S1

K1

Sn

Kn. . . . . .

. . . . . .

Page 47: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

47

Another HMM example –Another HMM example –the dishonest casinothe dishonest casino

• In a casino, they use a fair dice most of the time, but occasionally switch to an unfair dice. The switch between dice can be represented by an HMM:

1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6

1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2

FAIR UNFAIR

0.05

0.1

0.950.9

1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6

1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2

0.05

0.1

0.950.9

UNFAIR

FAIR

Page 48: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

48

Dishonest casino - continuedDishonest casino - continued

• The symbols (observations) are the sequence of rolls:

3 5 6 2 1 4 6 3 6…

• What is hidden?

If the die is fair or unfair:

f f f f u u u f f

This is a Markov chain.

Except for that, we have:

• Emission probabilities:

Given a state, we have 6 possible matching symbols,

each with an emission probability.

1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6

1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2

FAIR UNFAIR

0.05

0.1

0.950.9

Page 49: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

49

HMM of MSAHMM of MSA

• MSA can be represented by an HMM

– Insertion of A/C/G/T

– Match or Mismatch

– Deletion

Page 50: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

50

HMM of MSAHMM of MSA

• MSA can be represented by an HMM

– Insertion of A/C/G/T

– Match or Mismatch

– Deletion

Page 51: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

51

HMM of MSA can get more complex…HMM of MSA can get more complex…

Page 52: 1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

52

Questions where HMM’s are Questions where HMM’s are used:used:

• Does this sequence belong to a particular

family?

• Can we identify regions in a sequence (for

instance – alpha helices, beta sheets)?

• Pairwise/multiple sequence alignment

• Searching databases for protein families

(building profiles).