Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund...

Post on 11-Jan-2016

215 views 0 download

Transcript of Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund...

Pairwise Alignment

How do we tell whether two sequences are

similar?

BIO520 Bioinformatics Jim Lund

Assigned reading:Ch 4.1-4.7, Ch 5.1, get what you can out of 5.2, 5.4

Pairwise alignment

• DNA:DNA

• polypeptide:polypeptide

The BASIC Sequence Analysis Operation

Alignments

• Pairwise sequence alignments

–One-to-One

–One-to-Database• Multiple sequence alignments

–Many-to-Many

Origins of Sequence Similarity

• Homology– common evolutionary descent

• Chance– Short similar segments are very

common.

• Similarity in function– Convergence (very rare)

Visual sequence comparison: Dotplot

Visual sequence comparison: Filtered dotplot

4 bp window, 75% identity cutoff

Visual sequence comparison: Dotplot

4 bp windw, 75% identity cutoff

Dotplots of sequence rearrangements

Assessing similarity

GAACAAT||||||| 7/7 OR 100%GAACAAT

GAACAAT | 1/7 or 14%GAACAAT

Which is BETTER?How do we SCORE?

Similarity

GAACAAT||||||| 7/7 OR 100%GAACAAT

GAACAAT||| ||| 6/7 OR 84%GAATAAT

MISMATCH

Mismatches

GAACAAT||| ||| 6/7 OR 84%GAATAAT

GAACAAT||| ||| 6/7 OR 84%GAAGAAT

Terminal Mismatch

GAACAATttttt ||| |||aaaccGAATAAT 6/7 OR 84%

INDELS

GAAgCAAT||| |||| 7/7 OR 100%GAA*CAAT

Indels, cont’d

GAAgCAAT||| ||||GAA*CAAT

GAAggggCAAT||| ||||GAA****CAAT

Similarity Scoring

Common Method: • Terminal mismatches (0)• Match score (1)• Mismatch penalty (-3)• Gap penalty (-1)• Gap extension penalty (-1)

DNA Defaults

DNA Scoring

GGGGGGAGAA

|||||*|*|| 8(1)+2(-3)=22GGGGGAAAAAGGGGG

GGGGGGAGAA--GGG

|||||*|*|| ||| 11(1)+2(-3)+1(-1)+1(-1)=33GGGGGAAAAAGGGGG

Absurdity of Low Gap Penalty

GATCGCTACGCTCAGC A.C.C..C..T

Perfect similarity,Every time!

Sequence alignment algorithms

• Local alignment– Smith-Waterman

• Global alignment– Needleman-Wunsch

Alignment Programs

• Local alignment (Smith-Waterman)– BLAST (simplified Smith-Waterman)

– FASTA (simplified Smith-Waterman)

– BESTFIT (GCG program)

• Global alignment (Needleman-Wunsch)– GAP

Local vs. global alignment

10 gaggc 15 ||||| 3 gaggc 7

1 gggggaaaaagtggccccc 19 || |||| ||1 gggggttttttttgtggtttcc 22

Global alignment: alignment of the full length of the sequences

Local alignment: alignment of regions of substantial similarity

Local vs. global alignment

BLAST Algorithm

Look for local alignment, a High Scoring Pair (HSP)• Finding word (W) in query and subject. Score > T.• Extend local alignment until score reaches

maximum-X.• Keep High Scoring Segment Pairs (HSPs) with

scores > S.• Find multiple HSPs per query if present• Expectation value (E value) using Karlin-Altschul

stats

BLAST statistical significance: assessing the likelihood a match

occurs by chance

Karlin-Altschul statistic:E = k m N exp(-Lambda S)

m = Size of query seqeunceN = Size of databasek = Search space scaling parameterLambda = scoring scaling parameterS = BLAST HSP score

Low E -> good match

BLAST statistical significance:

Rule of thumb for a good match:

•Nucleotide match•E < 1e-6•Identity > 70%

•Protein match•E < 1e-3•Identity > 25%

Protein Similarity Scoring

• Identity - Easy• WEAK Alignments• Chemical Similarity

– L vs I, K vs R…

• Evolutionary Similarity–How do proteins evolve?–How do we infer similarities?

BLOSUM62

C S T P A G N D C 9 -1 -1 -3 0 -3 -3 -3 S -1 4 1 -1 1 0 1 0 T -1 1 4 1 -1 1 0 1 P -3 -1 1 7 -1 -2 -1 -1 A 0 1 -1 -1 4 0 -1 -2 G -3 0 1 -2 0 6 -2 -1 N -3 1 0 -2 -2 0 6 1 D -3 0 1 -1 -2 -1 1 6

Single-base evolution changes the encoded

AACAU=HCAU=H

CAC=H CGU=R UAU=Y

CAA=Q CCU=P GAU=D

CAG=Q CUU=L AAU=N

Substitution Matrices

Two main classes:

• PAM-Dayhoff

• BLOSUM-Henikoff

PAM-Dayhoff

• Built from closed related proteins, substitutions constrained by evolution and function

• “accepted” by evolution (Point Accepted Mutation=PAM)

• 1 PAM::1% divergence• PAM120=closely related proteins

• PAM250=divergent proteins

BLOSUM-Henikoff&Henikoff

• Built from ungapped alignments in proteins: “BLOCKS”

• Merge blocks at given % similar to one sequence

• Calculate “target” frequencies

• BLOSUM62=62% similar blocks– good general purpose

• BLOSUM30– Detects weak similarities, used for distantly related proteins

BLOSUM62

C S T P A G N D C 9 -1 -1 -3 0 -3 -3 -3 S -1 4 1 -1 1 0 1 0 T -1 1 4 1 -1 1 0 1 P -3 -1 1 7 -1 -2 -1 -1 A 0 1 -1 -1 4 0 -1 -2 G -3 0 1 -2 0 6 -2 -1 N -3 1 0 -2 -2 0 6 1 D -3 0 1 -1 -2 -1 1 6

Gapped alignments

• No general theory for significance of matches!!

• G+L(n) – indel mutations rare

– variation in gap length “easy”, G > L

Real Alignments

Phylogeny

1 MGLSDGEWQLVLNAWGKVEADVAGHGQEVLIRLFTGHPETLEKFDKFKHL 50 ||||||||||||| |||||||||||||||||||| ||||||||||||||| 1 MGLSDGEWQLVLNVWGKVEADVAGHGQEVLIRLFKGHPETLEKFDKFKHL 50 . . . . . 51 KTEAEMKASEDLKKHGNTVLTALGGILKKKGHHEAEVKHLAESHANKHKI 100 |.| ||||||||||||||||||||||||||||||||. ||:||| |||| 51 KSEDEMKASEDLKKHGNTVLTALGGILKKKGHHEAELTPLAQSHATKHKI 100 . . . . . 101 PVKYLEFISDAIIHVLHAKHPSDFGADAQAAMSKALELFRNDMAAQYKVL 150 |||||||||:||| || .||| ||||||| |||||||||||||||.|| | 101 PVKYLEFISEAIIQVLQSKHPGDFGADAQGAMSKALELFRNDMAAKYKEL 150

151 GFHG 154 || | 151 GFQG 154

Cow-to-Pig Protein

Cow-to-Pig cDNA 1 CAGCTGTCGGAGACAGACACCCAGTCAGTCCCGCCCTTGTTCTTTTTCTC 50 | ||| ||| || | ||||| |||| ||| |||||| 1 .......CAGAGCCAGGACACCCAGTACGCCCGCACTTGCTCTGTTTCTC 43 . . . . . 51 TTCTTCAGACTGCGCCATGGGGCTCAGCGACGGGGAATGGCAGTTGGTGC 100 |||| ||||||| |||||||||||||||||||||||||||||| |||||| 44 TTCTGCAGACTGTGCCATGGGGCTCAGCGACGGGGAATGGCAGCTGGTGC 93 . . . . . 101 TGAATGCCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG 150 |||| | ||||||||||||||||||||||||||||||||||||||||||| 94 TGAACGTCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG 143 . . . . . 151 GTCCTCATCAGGCTCTTCACAGGTCATCCCGAGACCCTGGAGAAATTTGA 200 ||||||||||||||||| | ||||| ||||||||||||||||||||||| 144 GTCCTCATCAGGCTCTTTAAGGGTCACCCCGAGACCCTGGAGAAATTTGA 193 . . . . . 201 CAAGTTCAAGCACCTGAAGACAGAGGCTGAGATGAAGGCCTCCGAGGACC 250 |||||| |||||||||||| |||||| ||||||||||||||| ||||||| 194 CAAGTTTAAGCACCTGAAGTCAGAGGATGAGATGAAGGCCTCTGAGGACC 243

80% Identity (88% at aa!)

DNA similarity reflects polypeptide similarity

101 TGAATGCCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG 150 |||| | ||||||||||||||||||||||||||||||||||||||||||| 94 TGAACGTCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG 143

501 CCAGTACAAGGTGCTGGGCTTCCATGGCTAAGCCCCACCCCTGTGCCCCT 550 | ||||||||| |||||||||||| ||||||||||| | | || | 494 CAAGTACAAGGAGCTGGGCTTCCAGGGCTAAGCCCCCCAGACGCCCCTCA 543 . . . . .

Coding vs Non-coding Regions

451 CAGGCTGCCATGAGCAAGGCCCTGGAACTGTTCCGGAATGACATGGCTGC 500 |||| ||||||||||||||||||||||| |||||||| |||||||| || 444 CAGGGAGCCATGAGCAAGGCCCTGGAACTCTTCCGGAACGACATGGCGGC 493 . . . . . 501 CCAGTACAAGGTGCTGGGCTTCCATGGCTAAGCCCCACCCCTGTGCCCCT 550 | ||||||||| |||||||||||| ||||||||||| | | || | 494 CAAGTACAAGGAGCTGGGCTTCCAGGGCTAAGCCCCCCAGACGCCCCTCA 543 . . . . . 551 CAC.CCCACCCACCTGGG...........CAGGGTGGGCGGGGACTGAAT 588 | | |||| |||| |||| | || ||| ||| ||||| 544 CCCACCCATCCACTTGGGCCAGGGCCCCCCGCGGAGGGTGGGCGCTGAAG 593 . . . . . 589 CCCAAGTAGTTATAGGGTTTGCTTCTGAGTGTGTGCTTTGTTTAGGAGAG 638 | | |||| | |||||||||||||||||||| ||||||||| | ||||| 594 CTCCTGTAGCTGTAGGGTTTGCTTCTGAGTGT.TGCTTTGTTCATGAGAG 642 . . . . . 639 GTGGGTGGAAGAGGTGGATGGGTTAGGGGTGGAGG............... 673 |||||||| ||||||||| ||| | | ||||| || 643 GTGGGTGGGAGAGGTGGAGGGGCTGGTGGTGGTGGTGGGGGGGTGTTCAG 692

90% in coding (70% in non-coding)

Third Base of Codon is Hypervariable

201 CAAGTTCAAGCACCTGAAGACAGAGGCTGAGATGAAGGCCTCCGAGGACC 250 ||||||*||||||||||||*||||||*|||||||||||||||*||||||| 194 CAAGTTTAAGCACCTGAAGTCAGAGGATGAGATGAAGGCCTCTGAGGACC 243 . . . . . 251 TGAAGAAGCATGGCAACACGGTGCTCACGGCCCTGGGGGGTATCCTGAAG 300 ||||||||||*||||||||||||||*||*|||||||||||*|||||*||| 244 TGAAGAAGCACGGCAACACGGTGCTGACTGCCCTGGGGGGCATCCTTAAG 293

Cow-to-Fish Protein

1 MGLSDGEWQLVLNAWGKVEADVAGHGQEVLIRLFTGHPETLEKFDKFKHL 50 :. :|| || .||| | || || |||| |||||. | || : 1 ....MADFDMVLKCWGPMEADHATHGSLVLTRLFTEHPETLKLFPKFAGI 46 . . . . . 51 KTEAEMKASEDLKKHGNTVLTALGGILKKKGHHEAEVKHLAESHANKHKI 100 :: . || ||| || :|| :| | | .| |. ||| |||| 47 .AHGDLAGDAGVSAHGATVLNKLGDLLKARGAHAALLKPLSSSHATKHKI 95 . . . . . 101 PVKYLEFISDAIIHVLHAKHPSDFGADAQAAMSKALELFRNDMAAQYKVL 150 |: . |.: | |: | | | | |: : : || | || | 96 PIINFKLIAEVIGKVMEEKAGLD..AAGQTALRNVMAIIITDMEADYKEL 143

151 GFHG 154 || 144 GFTE 147

42% identity, 51% similarity

Cow-to-Fish DNA

32 .ACAGGACATTTTACTACTCTGCAGATAATGGCTGACTTTGACATGGTAC 80 | | | | | | || | | || | | |||| | 51 TTCTTCAGACTGCGCCATGGGGCTCAGCGACGGGGAATGGCAGTTGGTGC 100 . . . . . 81 TGAAGTGCTGGGGTCCAATGGAGGCGGACCACGCAACCCACGGGAGTCTG 130 |||| |||||| ||||||| || |||| ||| ||| | 101 TGAATGCCTGGGGGAAGGTGGAGGCTGATGTCGCAGGCCATGGGCAGGAG 150 . . . . . 131 GTGCTGACCCGTTTATTCACAGAGCACCCAGAAACCCTAAAGTTATTCCC 180 || || | | | | ||||||| || || || ||||| || ||| 151 GTCCTCATCAGGCTCTTCACAGGTCATCCCGAGACCCTGGAGAAATTTGA 200 . . . . . 181 CAAGTTTGCTGGC...ATCGCCCATGGGGACCTGGCCGGGGATGCAGGTG 227 |||||| | | | | | || || | | | 201 CAAGTTCAAGCACCTGAAGACAGAGGCTGAGATGAAGGCCTCCGAGGACC 250

48% similarity

Protein vs. DNAAlignments

• Polypeptide similarity > DNA• Coding DNA > Non-coding

• 3rd base of codon hypervariable• Moderate Distance poor DNA similarity

Rules of Thumb

• DNA-DNA similarities– 50% significant if “long”

– E < 1e-6, 70% identity

• Protein-protein similarities– 80% end-end: same structure, same function

– 30% over domain, similar function, structure overall similar

– 15-30% “twilight zone”

– Short, strong match…could be a “motif”

Basic BLAST Family

• BLASTN– DNA to DNA database

• BLASTP– protein to protein database

• TBLASTN– DNA (translated) to protein database

• BLASTX– protein to DNA database (translated)

• TBLASTX– DNA (translated) to DNA database (translated)

DNA Databases

• nr (non-redundantish merge of Genbank, EMBL, etc…)– EXCLUDES HTGS0,1,2, EST, GSS, STS, PAT, WGS

• est (expressed sequence tags)• htgs (high throughput genome seq.)• gss (genome survey sequence)• vector, yeast, ecoli, mito• chromosome (complete genomes)• And more

http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#nucleotide_databases

Protein Databases

• nr (non-redundant Swiss-prot, PIR, PDF, PDB, Genbank CDS)

• swissprot

• ecoli, yeast, fly

• month

• And more

BLAST Input

• Program

• Database

• Options - see more

• Sequence– FASTA

– gi or accession#

BLAST Options

• Algorithm and output options– # descriptions, # alignments returned– Probability cutoff– Strand

• Alignment parameters– Scoring Matrix

• PAM30, PAM70, BLOSUM45, BLOSUM62BLOSUM62, BLOSUM80, BLOSUM80

– Filter (low complexity) PPPPP->XXXXX

Extended BLAST Family

• Gapped Blast (default)Gapped Blast (default)• PSI-Blast (Position-specific iterated

blast)– “self” generated scoring matrix

• PHI BLAST (motif plus BLAST)• BLAST2 client (align two seqs)

• megablast (genomic sequence)• rpsblast (search for domains)