Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for...

108
Design and creation of Design and creation of multiple sequence multiple sequence alignments alignments Unit 13 Unit 13 BIOL221T BIOL221T : Advanced : Advanced Bioinformatics for Bioinformatics for Biotechnology Biotechnology Irene Gabashvili, PhD

Transcript of Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for...

Page 1: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Design and creation of Design and creation of multiple sequence multiple sequence

alignmentsalignmentsUnit 13Unit 13

BIOL221TBIOL221T: Advanced : Advanced Bioinformatics for Bioinformatics for

BiotechnologyBiotechnologyIrene Gabashvili, PhD

Page 2: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Dot Plot (Matrix) for Dot Plot (Matrix) for Sequence comparisonSequence comparison

Reminders from Previous Reminders from Previous LecturesLectures

Page 3: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

DOTPLOTSDOTPLOTS

D

OR

OT

HY

HO

DG

KIN

DO

RO

TH

YH

OD

GK

IN

DOROTHYCROWFOOTHODGKINDOROTHYCROWFOOTHODGKIN

Page 4: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

A T G C A G T TA T G C A G T T

Dot Matrix: Self Co Dot Matrix: Self Comparisonmparison

Page 5: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

A T G C A G T TA T G C A G T T

Dot Matrix: Self Co Dot Matrix: Self Comparisonmparison

Identity diagonal

Page 6: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

A T G C A T G CA T G C A T G C

Dot Matrix: Self Co Dot Matrix: Self Comparisonmparison

Page 7: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Dot Matrix: Self Co Dot Matrix: Self Comparisonmparison

Identity diagonal

Direct Repeat

A T G C A T G CA T G C A T G C

Page 8: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

A T G G A G T TA T G CA G T T

Dot Matrix: Point Dot Matrix: PointMutationMutation

Page 9: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Dot Matrix: Point Dot Matrix: PointMutationMutation

Main diagonal

Point mutation

A T G G A G T TA T G CA G T T

Page 10: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

A T G A A C A G T TA T G C A G T T

Dot Matrix: Ga Dot Matrix: Gapp

Page 11: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Dot Matrix: Ga Dot Matrix: Gapp

Main diagonal

Deletion/Insertion

A T G A A C A G T TA T G C A G T T

Page 12: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

A G T T A T G CA T G C A G T T

Dot Matrix: Rearra Dot Matrix: Rearrangementngement

Page 13: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Dot Matrix: Rearra Dot Matrix: Rearrangementngement

Main diagonal

A G T T A T G CA T G C A G T T

Page 14: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Dot Plot Analysis Dot Plot Analysis

AdvantagesAdvantages Simple and fast. Simple and fast. Can detect DNA rearrangemen Can detect DNA rearrangemen

tt DisadvantagesDisadvantages

No numerical values produced No numerical values produced Subjective interpretation Subjective interpretation

Page 15: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Problems of Sequenc Problems of Sequenc e Alignment e Alignment

How to score? Match, Mismat How to score? Match, Mismat ch and Gap ch and Gap

Example: Example: +1 for each match +1 for each match , , 0for mismatch 0for mismatch and and - 2 for each- 2 for each

internal gap internal gap ( (ggg gggggggggg ggggggg g), 0 g), 0 or terminal gap ( or terminal gap ( similarity sco similarity sco

gggggggg

Page 16: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Computational measuresComputational measures

Distance measureDistance measure 0 for a match0 for a match 1 for a mismatch or gap1 for a mismatch or gap Lowest bestLowest best

Another measureAnother measure 2 for a match2 for a match -1 for a mismatch, -2 for a gap-1 for a mismatch, -2 for a gap highest besthighest best

Page 17: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Gap PenaltiesGap Penalties

Gap penaltiesGap penalties Linear score f(g) = - gdLinear score f(g) = - gd Affine score f(g) = - d – (g-1) eAffine score f(g) = - d – (g-1) e

d = gap open penalty e = gap extend d = gap open penalty e = gap extend penaltypenalty

g = gap lengthg = gap length

Example Gap penalty values used:Example Gap penalty values used: d = 500 d = 500 e = 50e = 50

Page 18: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Example from Lab-Example from Lab-Feb20:Feb20:

-1 for terminal gap, -2 for -1 for terminal gap, -2 for for each internal gap for each internal gap ( ( gap penalty gap penalty)) Blosum(A,A) = 4; Blosum(A,P) = -1; Blosum(A,A) = 4; Blosum(A,P) = -1;

Blosum(A,W) = -3; Blosum(P,P) = 7; Blosum(P,W) = -4 Blosum(A,W) = -3; Blosum(P,P) = 7; Blosum(P,W) = -4 

AAWWAAPP -1-1-3-1+7=2 -3-1+7=2 (one terminal gap, 2 (one terminal gap, 2 mismatches)mismatches)

- - AAPPPP AWAWAPAP - - -3-3+4+7=8 +4+7=8 (3 terminal gaps, no mismatches)(3 terminal gaps, no mismatches)

- -- -APAPP best if gap penalty (inside) is highP best if gap penalty (inside) is high AWAP AWAP -2-2+4-1+7=8 +4-1+7=8 (one internal gap, 1 mismatch)(one internal gap, 1 mismatch)

A - PP best if terminal gap is highA - PP best if terminal gap is high

Page 19: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

How to find the alig How to find the alig nment with the best nment with the best

score? score?

Page 20: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Finding alignment wi Finding alignment wi th best score th best score

Brute force approach Brute force approach = calculat = calculat ing scores of all possible alignm ing scores of all possible alignm ent and select the best ones. ent and select the best ones.

For For -two 1000 bp -two 1000 bp DNA sequence DNA sequence , the number of possible alignm , the number of possible alignm

ent is ent is1010600600 . Brute force appro . Brute force appro ach is impossible. ach is impossible.

Page 21: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Dynamic programmin Dynamic programmin g Methods g Methods

Finding the best alignment without Finding the best alignment without calculating all possible alignment. calculating all possible alignment.

The method is The method is EXACTEXACT.. Original method by Original method by Needleman&WNeedleman&W

unschunsch performs performs global alignment global alignment.. Modification by Modification by Smith&WatermanSmith&Waterman

performs performs local alignment local alignment..

Page 22: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

A T G A A C A G T TA 1 -1 -3 -5 -7 -9 -11 -13 -15 -17T -1 2 0 -2 -4 -6 -8 -10 -12 -14G -3 0 3 1 -1 -3 -5 -7 -9 -11C -5 -2 1 3 1 0 -2 -4 -6 -8A -7 -4 -1 2 4 2 1 -1 -3 -5G -9 -6 -3 0 2 4 2 2 0 -2T -11 -8 -5 -2 0 2 4 2 3 1T -13 -10 -7 -4 -2 0 2 4 2 4

Needleman&Wunsch Needleman&Wunsch Methods (match=1, m Methods (match=1, m

-ismatch=0 , gap= 2 ) -ismatch=0 , gap= 2 )

Page 23: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Local Alignment wit Local Alignment wit - h Smith Waterman - h Smith Waterman

AlgorithmAlgorithm Adding one modification: Any n Adding one modification: Any n

egative score are changed to egative score are changed to 0 . That is alignment will not b 0 . That is alignment will not b

e done unl ess t he score i s po e done unl ess t he score i s posi t i vesi t i ve

Page 24: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

A T G A A C A G T TA 1 0 0 1 1 0 1 0 0 0T 0 2 0 0 1 1 0 1 1 1G 0 0 3 1 0 1 1 1 1 1C 0 0 1 3 1 1 1 1 1 1A 1 0 0 2 4 2 2 1 1 1G 0 1 0 0 2 4 2 3 1 1T 0 1 1 0 0 2 4 2 4 2T 0 1 1 1 0 0 2 4 3 5

- Smith Waterman Met- Smith Waterman Met hods (match=1 , mis hods (match=1 , mis

-match=0 , gap= 2 ) -match=0 , gap= 2 )

Page 25: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

A T G A A C A G T TA 1 0 0 1 1 0 1 0 0 0T 0 2 0 0 1 1 0 1 1 1G 0 0 3 1 0 1 1 1 1 1C 0 0 1 3 1 1 1 1 1 1A 1 0 0 2 4 2 2 1 1 1G 0 1 0 0 2 4 2 3 1 1T 0 1 1 0 0 2 4 2 4 2T 0 1 1 1 0 0 2 4 3 5

- Smith Waterman Met- Smith Waterman Met hods (match=1 , mis hods (match=1 , mis

-match=0 , gap= 2 ) -match=0 , gap= 2 )

Page 26: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Scoring scheme Scoring schemess

Although dynamic programming g Although dynamic programming g uarantee correct results for each uarantee correct results for each

scoring scheme. The biological b scoring scheme. The biological b asis of scoring scheme is weak, e asis of scoring scheme is weak, e

xcept for the fact that insertion/d xcept for the fact that insertion/d eletion is rarer than substitution eletion is rarer than substitution

s and scored accordingly s and scored accordingly

Page 27: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

-Match Mismatc-Match Mismatc h score h score

DNADNA Transition is more frequent than transv Transition is more frequent than transv

ersion ersion (e.g., for (e.g., for M. tuberculosisM. tuberculosis SNP ~ SNP ~ 2:1)2:1) and can be scored accordingly and can be scored accordingly..

In practice base transition and In practice base transition and transversion are usually scored equally.transversion are usually scored equally.

ProteinsProteins Substitution matrix such as PAM or Substitution matrix such as PAM or

BLOSUMBLOSUM

Page 28: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Transitions & Transitions & TransversionsTransversions

Transition: A nucleotide substitution Transition: A nucleotide substitution from one purine to another purine from one purine to another purine (eg, A->G), or from one pyrimidine (eg, A->G), or from one pyrimidine to another pyrimidine (eg, T->C).to another pyrimidine (eg, T->C).

Transversion: A nucleotide Transversion: A nucleotide substitution from a purine to a substitution from a purine to a pyrimidine (eg, A->C), or vice versa pyrimidine (eg, A->C), or vice versa (eg, T->G).(eg, T->G).

Page 29: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Transitions & Transitions & TransversionsTransversions

PurinesPurines Pyrimidines

Page 30: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Gap penalty Gap penalty

Linear model Linear model = = kk Affine model Affine model = = 00 + + k, k, 00 = = gap opening gap opening

penalltypenallty , , k= k= gap extension penalty gap extension penalty . .00

More biologically realistic model More biologically realistic modelss needneed e e xponentially decrease gap penalty functi xponentially decrease gap penalty functi

ons such as ons such as 00 + + Logk. C Logk. C omputational omputational

complexity prohibits its common use. complexity prohibits its common use.

Page 31: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

More advance scoring sys More advance scoring systemtem

Position dependent scores, use di Position dependent scores, use di fferent matrix (and penalty) at dif fferent matrix (and penalty) at dif

ferent position in proteins. Funct ferent position in proteins. Funct ional importance of protein regio ional importance of protein regio

ns affect divergence ns affect divergence Structure dependent scores. Structure dependent scores.

Page 32: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Software providing Software providing ALIGNMENT toolsALIGNMENT tools

MATLAB: Bioinformatics toolboxMATLAB: Bioinformatics toolbox

[GlobalScore, GlobalAlignment] = [GlobalScore, GlobalAlignment] = nwalign(humanProtein,... nwalign(humanProtein,... mouseProtein) mouseProtein)

… … swalignswalign

showalignment(GlobalAlignment) showalignment(GlobalAlignment)

ORACLE 10g BLAST functions: blastn, ORACLE 10g BLAST functions: blastn, blastp, blastx, etc blastp, blastx, etc

Page 33: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Types of AlgorithmsTypes of Algorithms

Heuristic A heuristic is an algorithm that will yield reasonable results, even if it is not provably optimal or lacks even a performance guarantee.

In most cases, heuristic methods can be very fast, but they make additional assumptions and will miss the best match for some sequence pairs.

Dynamic Programming The algorithm for finding optimal alignments

given an additive alignment score dynamically These type of algorithms are guaranteed to find

the optimal scoring alignment or set of alignments.

HMM - Based on Probability Theory – very versatile.

Page 34: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

http://www.soe.ucsc.edu/http://www.soe.ucsc.edu/research/compbio/HMM-research/compbio/HMM-

apps/HMM-apps/HMM-applications.htmlapplications.html

Page 35: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Hidden Markov M Hidden Markov M odel (HMM) odel (HMM)

Page 36: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Markov chain Markov chain Chain of events, in which the Chain of events, in which the prpr

obability of each event obability of each event depend dependss only on only on aa preceding event preceding event..

Assumption: Assumption: DNA can be viewed DNA can be viewed as a Markov chain as a Markov chain . Probability o . Probability o

f A, T, G, or C appearing in each f A, T, G, or C appearing in each position depend on kind of nucle position depend on kind of nucle

otide in the preceding position. otide in the preceding position.

Page 37: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Markov chain is defi Markov chain is defi ned by ned by

P(A|A) = probability of a base be P(A|A) = probability of a base be ing A if the preceding base is A. ing A if the preceding base is A.

P(T|G) = probability of a base be P(T|G) = probability of a base be ing T if the preceding base is G. ing T if the preceding base is G.

And so on. And so on. So a DNA Markov So a DNA Markov chain is defined by 16 chain is defined by 16 probabilities.probabilities.

Page 38: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Markov Chain Model of DNA. E Markov Chain Model of DNA. E ach arrow is defined by a transit ach arrow is defined by a transit

ion probability. ion probability.

A G

T C

Page 39: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Hi ddenMarkov Model Hi ddenMarkov Model

HiddenHidden : State path e.g., : State path e.g.,NNNNNNNNNNNNNNNNCCCCCCCCCCCCCCCCCCCCCCNNNNNNNNNN

Not hidden Not hidden : DNA sequence e.g., : DNA sequence e.g.,attactggattactggcggccgcgtcgcggccgcgtcgatctgatctg

The question is to find the The question is to find the most pr most pr obable (hidden) state path obable (hidden) state path when th when th

- e (non hidden) sequence is known. - e (non hidden) sequence is known.

Page 40: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Algorithm to find Most Pr Algorithm to find Most Pr obable State Path (Decodi obable State Path (Decodi

ng)ng)

If parameters are known, If parameters are known, Viterbi algorithm Viterbi algorithm..Posterior decodingPosterior decoding

Page 41: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Esti mati onof parameters Esti mati onof parameters

Usually a “training set” of Usually a “training set” of sequences are required. sequences are required.

The “training set” may beThe “training set” may be Sequences of known stateSequences of known state Sequences of unknown state. Sequences of unknown state.

Parameters are arbitrarily set and Parameters are arbitrarily set and reiterated until state changes are reiterated until state changes are minimal.minimal.

Page 42: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

HMM HMM for identifying for identifying coding coding DNA Sequences DNA Sequences

A G

T C

A G

T CCoding (exon) -Non Coding (intron)

Page 43: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Hidden Markov Model for Codi Hidden Markov Model for Codi ng Sequence predictions ng Sequence predictions

HiddenHidden : State path : State path (I=intron, X=exon) (I=intron, X=exon) e.e.g.,g.,IIIIIIIIIIIIIIIIXXXXXXXXXXXXXXXXXXXXXXXXIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIXXXXXXXXXXXXXXXXXXXXXXXX

Not hidden Not hidden : DNA sequence e.g., : DNA sequence e.g.,attactggattactggcggccgcgtcgcggccgcgtcgatctgggtcttaggtadtgtatctgggtcttaggtadtgtacggacggcccctcgtaggcacccctcgtaggca

The question is to find the The question is to find the most probable ( most probable ( hidden) state path hidden) state path - when the (non hidden) - when the (non hidden)

sequence is known. sequence is known.

Page 44: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

TTTTTTTT TTTT TTT TTT TTTTTTTT TTTT TTT TTT coding sequences coding sequences ppredi credi c

t i onti on Best come from experimental wo Best come from experimental wo

rksrks Best come from the same species Best come from the same species

Page 45: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

HMM HMM for Spliced for Spliced Alignment (between Alignment (between

genomic and EST genomic and EST sequences)sequences)

A/A G/G

T/T C/C

A G

T CPaired (exon) Unpaired (intron)

Page 46: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Selections of Alignment Selections of Alignment ProgramsPrograms

Global vs LocalGlobal vs Local Pairwise (1-1), database searching Pairwise (1-1), database searching

(1-many), module searching (1-1 (1-many), module searching (1-1 many loci), mulitiplemany loci), mulitiple

Distance between query and Distance between query and databasedatabase

Number of query, size of databasesNumber of query, size of databases Exact vs HeuristicExact vs Heuristic

Page 47: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Multiple sequence Multiple sequence alignmentalignment

Multiple sequence alignmentMultiple sequence alignment Dynamic programming: restricted to 3-4 Dynamic programming: restricted to 3-4

sequences at most.sequences at most. Progressive sequence alignment: ClustalW, X.Progressive sequence alignment: ClustalW, X. Divide and conquer methodologyDivide and conquer methodology HMMHMM OthersOthers

Constructing common patternsConstructing common patterns Consensus: TATAATConsensus: TATAAT Weight matrix Weight matrix Input (from training set) for HMM methodsInput (from training set) for HMM methods Input for PSI-BLASTInput for PSI-BLAST

Page 48: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Multiple Sequence Multiple Sequence Alignments: Creation Alignments: Creation

and Analysisand AnalysisChapter 12, B&O – Protein AlignmentChapter 12, B&O – Protein Alignment What is a Multiple Alignment?What is a Multiple Alignment? Structural or Evolutionary? (not Structural or Evolutionary? (not

necessarily correspond, not really necessarily correspond, not really possible)possible)

How to multiply align?How to multiply align? How to generate alignments?How to generate alignments? ToolsTools

Page 49: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Significance of an Significance of an Alignment ScoreAlignment Score

Statistical methods used to evaluate the Statistical methods used to evaluate the significance of an alignment scoresignificance of an alignment score Z-score, P-value and E-valueZ-score, P-value and E-value

Significance of ScoreSignificance of Score Z- score = (score – mean)/std. devZ- score = (score – mean)/std. dev

Measures how unusual our original match is. Measures how unusual our original match is. Z Z 5 are significant. 5 are significant.

P- value measures probability that the alignment is no P- value measures probability that the alignment is no better than random. (Z and P depends on the better than random. (Z and P depends on the distribution of the scores)distribution of the scores)

P P 10 10-100-100 exact match. exact match. E- value is the expected number of sequences that give E- value is the expected number of sequences that give

the same Z- score or better. (E = P x size of the the same Z- score or better. (E = P x size of the database)database)

E E 0.02 sequences probably homologous 0.02 sequences probably homologous

Page 50: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Aligning more than 2 Aligning more than 2 sequencessequences

Sequences should not be very Sequences should not be very different in lengthdifferent in length

Should be edited down to regions Should be edited down to regions that are most similar (PSI-BLAST that are most similar (PSI-BLAST does it automatically, but not all does it automatically, but not all tools do)tools do)

Random alignment of pairs of Random alignment of pairs of sequences helps assessing sequences helps assessing similaritiessimilarities

Page 51: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Multiple Sequence Multiple Sequence Alignment Alignment

- - N W or S W algorithms can be generalized to- - N W or S W algorithms can be generalized to >2 sequences. Its computational complexity >2 sequences. Its computational complexity

precludes their use for >3 sequences. precludes their use for >3 sequences. Heuristicapproaches, e.g. Heuristicapproaches, e.g. progressive alignment method progressive alignment method , are requir , are requir

ed.ed. These method cannot guarantee the best These method cannot guarantee the best

multiple alignment but in most cases give multiple alignment but in most cases give biologically meaningful results.biologically meaningful results.

Page 52: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Progressive Alignm Progressive Alignm ent Method ent Method

Each pair of sequences is aligned (e.g. by N-Each pair of sequences is aligned (e.g. by N-W method). W method).

Similarity in each pair is used for Similarity in each pair is used for constructing dendrogram relating each constructing dendrogram relating each sequence. sequence.

The most similar sequences are first aligned. The most similar sequences are first aligned. Then next most similar sequences or cluster Then next most similar sequences or cluster

of sequences are sequentially aligned.of sequences are sequentially aligned.

Page 53: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Progressive Alignm Progressive Alignm ent Method ent Method

A popular program is Clustal series.A popular program is Clustal series. ClustalV align up to 30 sequences, ClustalV align up to 30 sequences,

penalize left terminal gap but not penalize left terminal gap but not right terminal gap.right terminal gap.

ClustalW align up to 100 sequences, ClustalW align up to 100 sequences, not penalize terminal gaps. not penalize terminal gaps.

- ClustalX Windows based.- ClustalX Windows based.

Page 54: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Comparing a sequence with Comparing a sequence with a profile of a group a profile of a group

sequences.sequences. Testing arm of HMM.Testing arm of HMM. Searching arm of PSI-BLAST: for Searching arm of PSI-BLAST: for

more sensitive search of homologous more sensitive search of homologous sequencessequences

With profile of protein sequences for With profile of protein sequences for comparative molecular modeling.comparative molecular modeling.

Page 55: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

PSI-BLASTPSI-BLAST

A profile search methodA profile search method A query sequence is used to search for A query sequence is used to search for

similar sequences.similar sequences. The sequences were used to generate a The sequences were used to generate a

sequence profile.sequence profile. The profile was again search against the The profile was again search against the

databases.databases. The method increase sensitivity of The method increase sensitivity of

search over normal BLAST. False search over normal BLAST. False positive can be a problem.positive can be a problem.

Page 56: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

BLASTBLAST

Basic Local Alignment Search ToolBasic Local Alignment Search Tool Altschul et al, 1990Altschul et al, 1990 HeuristicHeuristic

Page 57: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Makes list of wordsMakes list of words

Fixed-length subsequencesFixed-length subsequences Default 3 protein, or 11 nucleotidesDefault 3 protein, or 11 nucleotides

Keeps words that match the query Keeps words that match the query with score above some thresholdwith score above some threshold

See file “triples.ss” for some See file “triples.ss” for some discussion of thresholdsdiscussion of thresholds

Searches database for words in this Searches database for words in this setset

triples.ss

Page 58: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

When it finds a sequence When it finds a sequence containing a word in the containing a word in the

setset Uses that as a “seed” for hit Uses that as a “seed” for hit

extensionextension In both directionsIn both directions Extending the possible match as an Extending the possible match as an

ungapped alignmentungapped alignment After version 2.0 BLAST can handle After version 2.0 BLAST can handle

gapsgaps

Page 59: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

FASTA IdeaFASTA Idea

IdeaIdea: a good alignment probably : a good alignment probably matches some identical ‘words’ (matches some identical ‘words’ (ktupsktups))

Example:Example:

Database record:Database record:

ACTTGTAGATACAAAATGTGACTTGTAGATACAAAATGTG

Aligned query sequence:Aligned query sequence:

A-TTGTCG-TACAA-ATCTGTA-TTGTCG-TACAA-ATCTGT

Matching words of size 4Matching words of size 4

Page 60: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Dictionaries of WordsDictionaries of Words

ACTTGTAGATAC ACTTGTAGATAC Is translated to the Is translated to the dictionary:dictionary:

ACTT,ACTT,

CTTG,CTTG,

TTGT,TTGT,

TGTATGTA……

Dictionaries of well aligned sequences Dictionaries of well aligned sequences share words.share words.

Page 61: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

FASTA Stage IFASTA Stage I Prepare dictionary for db sequence (in Prepare dictionary for db sequence (in

advance)advance) Upon query:Upon query:

Prepare dictionary for query sequencePrepare dictionary for query sequence For each DB record:For each DB record:

Find matching wordsFind matching words Search for long Search for long diagonal runsdiagonal runs

of matching words of matching words Init-1 scoreInit-1 score: longest run: longest run Discard record if low scoreDiscard record if low score

*= matching word

Position in query

Position in DB record

* * * *

* * *

* * * * *

Page 62: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

FASTA stage IIFASTA stage II

Good alignment – path Good alignment – path through many runs, withthrough many runs, withshort short connectionsconnections

Assign weights to runs(+)Assign weights to runs(+)and connections(-)and connections(-)

Find a path of max weightFind a path of max weight Init-n scoreInit-n score – total path – total path

weightweight Discard record if low scoreDiscard record if low score

Page 63: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

FASTA Stage IIIFASTA Stage III

Improve Improve Init-1. Init-1. Apply Apply anan exact algorithm exact algorithm aroundaround Init-1 Init-1 diagonal within a diagonal within a given width band.given width band.

Init-1 Opt-scoreInit-1 Opt-score – – new weightnew weight

Discard record if low Discard record if low scorescore

Page 64: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

FASTA final stageFASTA final stage

Apply an exact algorithm to Apply an exact algorithm to surviving records, computing the surviving records, computing the final alignment score.final alignment score.

Page 65: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

BLAST BLAST (Basic Local Alignment Search (Basic Local Alignment Search

Tool)Tool) Approximate Matches Approximate Matches

BLAST:BLAST:

Words are allowed to contain inexact Words are allowed to contain inexact matching.matching.

Example:Example:

In the polypeptide sequence In the polypeptide sequence IHAVEADREAMIHAVEADREAM

The 4-long word The 4-long word HAVEHAVE starting at position 2 starting at position 2 may matchmay match

HAVE,RAVE,HIVE,HALE,…HAVE,RAVE,HIVE,HALE,…

Page 66: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Approximate MatchesApproximate Matches

For each For each wordword of length of length ww from a Data Base generate all from a Data Base generate all similarsimilar words. words.

‘‘Similar’Similar’ means: score( means: score( wordword, , word’word’ ) > T ) > T

Store all similar words in a look-up table.Store all similar words in a look-up table.

Page 67: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

DB searchDB search

1) For each 1) For each wordword of length of length ww from a query sequence generate all from a query sequence generate all similarsimilar words.words.

2) Access DB.2) Access DB.

3) Each 3) Each hithit extend as much as possible -> High-scoring Segment Pair (HSP) extend as much as possible -> High-scoring Segment Pair (HSP)

score(HSP) > Vscore(HSP) > V

THEFIRSTLINIHAVEADREAMESIRPATRICKREAD INVIEIAMDEADMEATTNAMHEWASNINETEEN

Page 68: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

DB searchDB search

s-query

s-db

4) Around HSP perform DP.

At each step alignment score should be > T

starting point (seed pair)

Page 69: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

B&O, chapter 12: B&O, chapter 12: HIERARCHICAL HIERARCHICAL

METHODSMETHODS Some of the most accurate practical methodsSome of the most accurate practical methods Work by finding the guide tree to build the Work by finding the guide tree to build the

alignmentalignment ClustalW is a hierarchical multiple alignment ClustalW is a hierarchical multiple alignment

program. It uses a series of different pair-score program. It uses a series of different pair-score matrices, biases the location of gaps and allows matrices, biases the location of gaps and allows to realign aligned sequencesto realign aligned sequences

T-coffee builds a library of pairwise alignmentsT-coffee builds a library of pairwise alignments Psi-Blast – “profile-based” method Psi-Blast – “profile-based” method

Page 70: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Why we do multiple Why we do multiple alignments?alignments?

Multiple nucleotide or amino sequence Multiple nucleotide or amino sequence alignment techniques are usually performed to alignment techniques are usually performed to fit one of the following scopes :fit one of the following scopes :

– In order to characterize protein families, In order to characterize protein families, identify shared regions of homology in a identify shared regions of homology in a multiple sequence alignment; (this happens multiple sequence alignment; (this happens generally when a sequence search revealed generally when a sequence search revealed homologies to several sequences) homologies to several sequences)

– Determination of the consensus sequence of Determination of the consensus sequence of several aligned sequences.several aligned sequences.

Page 71: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Why we do multiple alignments?Why we do multiple alignments?

– Help prediction of the secondary and tertiary Help prediction of the secondary and tertiary structures of new sequences;structures of new sequences;

– Preliminary step in molecular evolution Preliminary step in molecular evolution analysis using Phylogenetic methods for analysis using Phylogenetic methods for constructing phylogenetic trees.constructing phylogenetic trees.

Page 72: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

An example of Multiple An example of Multiple AlignmentAlignment

VTISCTGSSSNIGAG-NHVKWYQQLPGQLPGVTISCTGTSSNIGS--ITVNWYQQLPGQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

Page 73: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Multiple Alignment MethodMultiple Alignment Method

The most practical and widely used method The most practical and widely used method in multiple sequence alignment is the in multiple sequence alignment is the hierarchical extensions of pairwise hierarchical extensions of pairwise alignment methods. alignment methods.

The principal is that multiple alignments is The principal is that multiple alignments is achieved by successive application of achieved by successive application of pairwise methodspairwise methods..

Page 74: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Multiple Alignment MethodMultiple Alignment Method The steps are summarized as follows:The steps are summarized as follows: Compare all sequences pairwise. Compare all sequences pairwise. Perform cluster analysis on the pairwise data to generate Perform cluster analysis on the pairwise data to generate

a hierarchy for alignment. This may be in the form of a a hierarchy for alignment. This may be in the form of a binary tree or a simple orderingbinary tree or a simple ordering

Build the multiple alignment by first aligning the most Build the multiple alignment by first aligning the most similar pair of sequences, then the next most similar pair similar pair of sequences, then the next most similar pair and so on. Once an alignment of two sequences has and so on. Once an alignment of two sequences has been made, then this is fixed. Thus for a set of sequences been made, then this is fixed. Thus for a set of sequences A, B, C, D having aligned A with C and B with D the A, B, C, D having aligned A with C and B with D the alignment of A, B, C, D is obtained by comparing the alignment of A, B, C, D is obtained by comparing the alignments of A and C with that of B and D using averaged alignments of A and C with that of B and D using averaged scores at each aligned position.scores at each aligned position.

Page 75: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Choosing sequences for Choosing sequences for alignmentalignment

General considerationsGeneral considerations

The more sequences to align the better.The more sequences to align the better. Don’t include similar (>80%) sequences.Don’t include similar (>80%) sequences. Sub-groups should be pre-aligned Sub-groups should be pre-aligned

separately, and one member of each separately, and one member of each subgroup should be included in the final subgroup should be included in the final multiple alignment.multiple alignment.

Page 76: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Multiple alignment in Multiple alignment in GCGGCG

The program available in GCG for multiple The program available in GCG for multiple alignment is Pileup.alignment is Pileup.

The input file for Pileup is a list of sequence The input file for Pileup is a list of sequence file_names or sequence codes in the database, file_names or sequence codes in the database, created by a text editor.created by a text editor.

Pileup creates a multiple sequence alignment from Pileup creates a multiple sequence alignment from a group of related sequences using progressive, a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing pairwise alignments. It can also plot a tree showing the clustering relationships used to create the the clustering relationships used to create the alignment.alignment.

Please note that there is no one absolute alignment, Please note that there is no one absolute alignment, even for a limited number of sequences.even for a limited number of sequences.

Page 77: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Output of PileupOutput of Pileup//

1 OATNFA1 ~~~~~~~~~~ ~~~~~~~~~~ ~GGCCAAGAG OATNFAR ~~~~~GGGAC ACCAGGGGAC CAGCCAAGAG BSPTNFA ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ CEU14683 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ HSTNFR ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~GCAGASYNTNFTRP AGCAGACGCT CCCTCAGCAA GGACAGCAGA CATTNFAA ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ CFTNFA ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ RABTNFM ~~~~AAGCTC CCTCAGTGAG GACACGGGCA RNTNFAA ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~

Page 78: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ShadyBox OutputShadyBox Output

Page 79: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Multiple alignment Multiple alignment programsprograms

ClustalW / ClustalX

pileup

multalign

multal

saga

hmmt

DIALIGN

SBpima

MLpima

T-Coffee

...

Page 80: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Multiple alignment Multiple alignment programsprograms

ClustalW / ClustalX

pileup

multalign

multal

saga

hmmt

DIALIGN

SBpima

MLpima

T-Coffee

...

Page 81: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Global methods (Global methods (e.g.,e.g., ClustalX) get into ClustalX) get into

trouble when data is trouble when data is not globally related!!!not globally related!!!

Page 82: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Global methods (Global methods (e.g.,e.g., ClustalX) get into ClustalX) get into

trouble when data is trouble when data is not globally related!!!not globally related!!!

Clustalx

Page 83: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Global methods (Global methods (e.g.,e.g., ClustalX) get into ClustalX) get into

trouble when data is trouble when data is not globally related!!!not globally related!!!

Clustalx

Possible solutions:(1) Cut out conserved regions of interest and THEN align them (2) Use method that deals with local similarity (e.g. DIALIGN)

Page 84: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ClustalW- for multiple ClustalW- for multiple alignmentalignment

ClustaW is a general purpose multiple alignment ClustaW is a general purpose multiple alignment program for DNA or proteins.program for DNA or proteins.

ClustalW is produced by Julie D. Thompson, ClustalW is produced by Julie D. Thompson, Toby Gibson of European Molecular Biology Toby Gibson of European Molecular Biology Laboratory, Germany and Desmond Higgins of Laboratory, Germany and Desmond Higgins of European Bioinformatics Institute, Cambridge, European Bioinformatics Institute, Cambridge, UK. AlgorithmicUK. Algorithmic

ClustalW is cited: improving the sensitivity of ClustalW is cited: improving the sensitivity of progressive multiple sequence alignment through progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, and weight matrix choice. Nucleic Acids Research, 22:4673-4680.22:4673-4680.

Page 85: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ClustalW- for multiple ClustalW- for multiple alignmentalignment

ClustalW can create multiple alignments, ClustalW can create multiple alignments, manipulate existing alignments, do manipulate existing alignments, do profile analysis and create phylogentic profile analysis and create phylogentic trees.trees.

Alignment can be done by 2 methods:Alignment can be done by 2 methods:- slow/accurate - slow/accurate

- fast/approximate- fast/approximate

Page 86: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Running ClustalW Running ClustalW [~]% clustalw

************************************************************** ******** CLUSTAL W (1.7) Multiple Sequence Alignments ******** **************************************************************

1. Sequence Input From Disc 2. Multiple Alignments 3. Profile / Structure Alignments 4. Phylogenetic trees

S. Execute a system command H. HELP X. EXIT (leave program)

Your choice:

Page 87: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Running ClustalWRunning ClustalW

The input file for clustalW is a file containing all sequences in one of the following formats:NBRF/PIR, EMBL/SwissProt, Pearson (Fasta),GDE, Clustal, GCG/MSF, RSF.

Page 88: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Using ClustalWUsing ClustalW****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file

4. Toggle Slow/Fast pairwise alignments = SLOW

5. Pairwise alignment parameters 6. Multiple alignment parameters

7. Reset gaps between alignments? = OFF 8. Toggle screen display = ON 9. Output format options

S. Execute a system command H. HELP or press [RETURN] to go back to main menu

Your choice:

Page 89: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Output of ClustalWOutput of ClustalWCLUSTAL W (1.7) multiple sequence alignment

HSTNFR GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAGSYNTNFTRP GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAGCFTNFA -------------------------------------------TGTCCAG------ACAGCATTNFAA GGGAAGAG---CTCCCACATGGCCTGCAACTAATCAACCCTCTGCCCCAG------ACACRABTNFM AGGAGGAAGAGTCCCCAAACAACCTCCATCTAGTCAACCCTGTGGCCCAGATGGTCACCCRNTNFAA AGGAGGAGAAGTTCCCAAATGGGCTCCCTCTCATCAGTTCCATGGCCCAGACCCTCACACOATNFA1 GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACACOATNFAR GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACACBSPTNFA GGGAAGAGCAGTCCCCAGGTGGCCCCTCCATCAACAGCCCTCTGGTTCAA------ACACCEU14683 GGGAAGAGCAATCCCCAACTGGCCTCTCCATCAACAGCCCTCTGGTTCAG------ACCC ** *

Page 90: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ClustalW optionsClustalW optionsYour choice: 5 ********* PAIRWISE ALIGNMENT PARAMETERS ********* Slow/Accurate alignments:

1. Gap Open Penalty :15.00 2. Gap Extension Penalty :6.66 3. Protein weight matrix :BLOSUM30 4. DNA weight matrix :IUB

Fast/Approximate alignments:

5. Gap penalty :5 6. K-tuple (word) size :2 7. No. of top diagonals :4 8. Window size :4

9. Toggle Slow/Fast pairwise alignments = SLOW

H. HELPEnter number (or [RETURN] to exit):

Page 91: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ClustalW optionsClustalW optionsYour choice: 6

********* MULTIPLE ALIGNMENT PARAMETERS *********

1. Gap Opening Penalty :15.00 2. Gap Extension Penalty :6.66 3. Delay divergent sequences :40 %

4. DNA Transitions Weight :0.50

5. Protein weight matrix :BLOSUM series 6. DNA weight matrix :IUB 7. Use negative matrix :OFF

8. Protein Gap Parameters

H. HELP

Enter number (or [RETURN] to exit):

Page 92: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ClustalX - Multiple Sequence ClustalX - Multiple Sequence Alignment ProgramAlignment Program

ClustalX provides a new window-based ClustalX provides a new window-based user interface to the ClustalW program. user interface to the ClustalW program.

It uses the Vibrant multi-platform user It uses the Vibrant multi-platform user interface development library, developed by interface development library, developed by the National Center for Biotechnology the National Center for Biotechnology Information (Bldg 38A, NIH 8600 Rockville Information (Bldg 38A, NIH 8600 Rockville Pike,Bethesda, MD 20894) as part of their Pike,Bethesda, MD 20894) as part of their NCBI SOFTWARE DEVELOPEMENT TOOLKITNCBI SOFTWARE DEVELOPEMENT TOOLKIT. .

Page 93: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ClustalXClustalX

Page 94: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ClustalXClustalX

Page 95: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ClustalXClustalX

Page 96: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ClustalXClustalX

Page 97: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ClustalXClustalX

Page 98: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

ClustalXClustalX

Page 99: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Blocks database and toolsBlocks database and tools

Blocks are multiply aligned ungapped Blocks are multiply aligned ungapped segments corresponding to the most highly segments corresponding to the most highly conserved regions of proteins.conserved regions of proteins.

The Blocks web server tools are : The Blocks web server tools are : Block Searcher, Get Blocks and Block Block Searcher, Get Blocks and Block Maker. These are aids to detection and Maker. These are aids to detection and verification of protein sequence homology.verification of protein sequence homology.

They compare a protein or DNA sequence They compare a protein or DNA sequence to a database of protein blocks, retrieve to a database of protein blocks, retrieve blocks, and create new blocks,respectively. blocks, and create new blocks,respectively.

Page 100: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

The BLOCKS web The BLOCKS web serverserver

At URL: http://blocks.fhcrc.org/At URL: http://blocks.fhcrc.org/

The BLOCKS WWW server can be used to The BLOCKS WWW server can be used to create blocks of a group of sequences, create blocks of a group of sequences, or to compare a protein sequence to a or to compare a protein sequence to a database of blocks.database of blocks.

The Blocks Searcher tool should be used The Blocks Searcher tool should be used for multiple alignment of distantly for multiple alignment of distantly related protein sequences.related protein sequences.

Page 101: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

The Blocks Searcher The Blocks Searcher tooltool

For searching a database of blocks, the first position of the For searching a database of blocks, the first position of the sequence is aligned with the first position of the first block, sequence is aligned with the first position of the first block, and a score for that amino acid is obtained from the profile and a score for that amino acid is obtained from the profile column corresponding to that position. Scores are summed column corresponding to that position. Scores are summed over the width of the alignment, and then the block is over the width of the alignment, and then the block is aligned with the next position. aligned with the next position.

This procedure is carried out exhaustively for all positions This procedure is carried out exhaustively for all positions of the sequence for all blocks in the database, and the best of the sequence for all blocks in the database, and the best alignments between a sequence and entries in the alignments between a sequence and entries in the BLOCKS database are noted. If a particular block scores BLOCKS database are noted. If a particular block scores highly, it is possible that the sequence is related to the highly, it is possible that the sequence is related to the group of sequences the block represents. group of sequences the block represents.

Page 102: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

The Blocks Searcher toolThe Blocks Searcher tool

Typically, a group of proteins has more than one Typically, a group of proteins has more than one region in common and their relationship is region in common and their relationship is represented as a series of blocks separated by represented as a series of blocks separated by unaligned regions. If a second block for a group unaligned regions. If a second block for a group also scores highly in the search, the evidence also scores highly in the search, the evidence that the sequence is related to the group is that the sequence is related to the group is strengthened, and is further strengthened if a strengthened, and is further strengthened if a third block also scores it highly, and so on. third block also scores it highly, and so on.

Page 103: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

The BLOCKS DatabaseThe BLOCKS Database

The blocks for the BLOCKS database are The blocks for the BLOCKS database are made automatically by looking for the most made automatically by looking for the most highly conserved regions in groups of highly conserved regions in groups of proteins represented in the PROSITE proteins represented in the PROSITE database. These blocks are then database. These blocks are then calibrated against the SWISS-PROT calibrated against the SWISS-PROT database to obtain a measure of the database to obtain a measure of the chance distribution of matches. It is these chance distribution of matches. It is these calibrated blocks that make up the calibrated blocks that make up the BLOCKS database.BLOCKS database.

Page 104: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

The Block Maker ToolThe Block Maker Tool

Block Maker finds conserved blocks in a Block Maker finds conserved blocks in a group of two or more unaligned protein group of two or more unaligned protein sequences, which are assumed to be sequences, which are assumed to be related, using two different algorithms.related, using two different algorithms.

Input file must contain at least 2 sequences.Input file must contain at least 2 sequences.

Input sequences must be in FastA format.Input sequences must be in FastA format.

Results are returned by e-mail.Results are returned by e-mail.

Page 105: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

vsvs Clustal – room for Clustal – room for improvementimprovement

Gaps are consistent with the phylogenetic tree

Page 106: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

CLUSTALW-artifacts?CLUSTALW-artifacts?

Gaps are largely inconsistent with the phylogenetic tree

Page 107: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Misc LinksMisc Links

http://hits.isb-sib.ch/cgi-bin/PFSCANhttp://hits.isb-sib.ch/cgi-bin/PFSCAN

http://www.soe.ucsc.edu/research/http://www.soe.ucsc.edu/research/compbio/HMM-apps/HMM-compbio/HMM-apps/HMM-applications.htmlapplications.html

http://server1-kimlab.stanford.edu/cgi-http://server1-kimlab.stanford.edu/cgi-bin/index.cgi?BigFigures+ZahnFig4bin/index.cgi?BigFigures+ZahnFig4

Page 108: Design and creation of multiple sequence alignments Unit 13 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Misc LinksMisc Links

CAMDA competition --- CAMDA competition --- http://www.camda.duke.edu/

Baylor college Sequencing CenterBaylor college Sequencing Center http://www.hgsc.bcm.tmc.edu/http://www.hgsc.bcm.tmc.edu/

projects/rmacaque/projects/rmacaque/ http://www.hgsc.bcm.tmc.edu/http://www.hgsc.bcm.tmc.edu/

projects/chimpanzee/ projects/chimpanzee/