Roadmap

RoadmapRoadmap

The topics:The topics: basic concepts of molecular biologybasic concepts of molecular biology more on Perlmore on Perl overview of the fieldoverview of the field biological databases and database biological databases and database

searchingsearching sequence alignmentssequence alignments phylogeneticsphylogenetics structure predictionstructure prediction microarray data analysismicroarray data analysis

Sequence alignmentsSequence alignments

IntroductionIntroduction What is an alignment?What is an alignment? Why do alignments?Why do alignments? A bit of historyA bit of history

Dot matrix comparisonDot matrix comparison Scoring alignmentsScoring alignments Alignment methodsAlignment methods Significance of alignmentsSignificance of alignments

What is Sequence What is Sequence alignmentalignment

Sequence alignment is an Sequence alignment is an arrangement of two or more arrangement of two or more sequences, highlighting their sequences, highlighting their similarity. similarity.

Why do alignments?Why do alignments?

Sequence Alignment is useful for Sequence Alignment is useful for

discovering discovering structuralstructural, ,

functionalfunctional and and evolutionalevolutional

information in biological information in biological

sequences.sequences.

Over time, genes Over time, genes accumulate accumulate mutationsmutations

Environmental factors Radiation Oxidation

Mistakes in replication/repair Deletions, Duplications Insertions Inversions Point mutations

Comparing two Comparing two sequencessequences

Point mutations, easy:Point mutations, easy:ACGTCTGATACGTCTGATAACGCCCGCCGGTATTATAAGTCTATCTGTCTATCTACGTCTGATACGTCTGATTTCGCCCGCCCCTATTATCCGTCTATCTGTCTATCT

Insertions/deletions, must Insertions/deletions, must alignalign::ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATACGCCGTATAGTCTATCTCTGATTCGCATCGTCTATCTCTGATTCGCATCGTCTATCTACGTCTGATACGTCTGATAACGCCGTATCGCCGTATAAGTCTATCTGTCTATCT----CTGAT----CTGATTTCGC---ATCGC---ATCCGTCTATCTGTCTATCT

Sequence Sequence AlignmentAlignment

Doolittle RF, Hunkapiller MW, Hood LE, Doolittle RF, Hunkapiller MW, Hood LE, Devare SG, Robbins KC, Aaronson SA, Devare SG, Robbins KC, Aaronson SA, Antoniades HN. Antoniades HN. ScienceScience 221:275-277, 1983.221:275-277, 1983.

A sequence for platelet derivedgrowth factor (PDGF) from mammalian cells was virtually identical to the sequence for the retrovirus encoded oncogene known as v-sis (gene causing cancer in animals).

Retrovirus had acquired the gene from the host cell as some kind of genetic exchange event and then had produced a mutant that could alter the function of the normal protein when it infected another animal.

Russell F. Doolittle

Dot Matrix ComparisonDot Matrix ComparisonA: T C A G A G G T C T GA: T C A G A G G T C T G

B: T C A G A G C T GB: T C A G A G C T G

XX

XX

CC

XX

XX

TT

XXXXXXXXGG

XXXXTT

XXCC

XXXXXXXXGG

XXXXAA

XXXXXXXXGG

XXXXAA

XXCC

XXXXTT

GGTTGGGGAAGGAACCTT

Interpretation of dot Interpretation of dot matrixmatrix

Regions of similarity appear as diagonal Regions of similarity appear as diagonal runs of dotsruns of dots

Reverse diagonals (perpendicular to Reverse diagonals (perpendicular to diagonal) indicate inversionsdiagonal) indicate inversions

Can link or "join" separate diagonals to Can link or "join" separate diagonals to form alignment with "gaps"form alignment with "gaps"

More on Dot MatrixMore on Dot Matrix Improving detection of matching Improving detection of matching

regions by filteringregions by filtering using sliding window to compare using sliding window to compare

the two sequences. For example, the two sequences. For example, print a dot at a matrix position only if print a dot at a matrix position only if

7 out of the next 11 positions in the 7 out of the next 11 positions in the sequence are identical sequence are identical

Similarity score of the next 11 Similarity score of the next 11 positions in the sequence is greater positions in the sequence is greater than 5.than 5.

Sequence repeatsSequence repeats

Many Many sequences sequences contains contains repetitive repetitive regions.regions.

a retrovirus vector sequence against itself using a window size of 9 and mismatch limit of 2(http://arbl.cvmbs.colostate.edu/molkit/dnadot/bkg.html)

More on Dot MatrixMore on Dot Matrix

Dot matrix graphically presents regions Dot matrix graphically presents regions of identity or similarity between two of identity or similarity between two sequencessequences

The use of windows and thresholds can The use of windows and thresholds can reduce “noise” in dot matrixreduce “noise” in dot matrix

Inversions and duplications have unique Inversions and duplications have unique “signatures” in dot matrix“signatures” in dot matrix

SoftwareSoftware Dotlet (java applet)– Dotlet (java applet)–

www.ch.embnet.orgwww.ch.embnet.org Dnadot – Dnadot –

arbl.cvmbs.colostate.edu/molkit/dnadotarbl.cvmbs.colostate.edu/molkit/dnadot// Dotter –Dotter –

www.cgr.ki.se/cgr/groups/sonnhammer/Dwww.cgr.ki.se/cgr/groups/sonnhammer/Dotter.htmlotter.html

Dottup – Dottup – www.emboss.orgwww.emboss.org

How to measure the How to measure the similaritysimilarity

Basically three kinds of changes can Basically three kinds of changes can occur at any given position within a occur at any given position within a sequence:sequence:

MutationMutation InsertionInsertion DeletionDeletion

Insertion and deletion have been found Insertion and deletion have been found to occur in nature at a significantly to occur in nature at a significantly lower frequency than mutations.lower frequency than mutations.

Scoring Matrices for Aligning DNA Scoring Matrices for Aligning DNA SequencesSequences

TransitionTransition --- substitutions in which a purine (A/G) --- substitutions in which a purine (A/G) is replaced by another purine (A/G) or a is replaced by another purine (A/G) or a pyrimadine (C/T) is replaced by another pyrimadine (C/T) is replaced by another pyrimadine (C/T).pyrimadine (C/T).

TransversionsTransversions --- --- (A/G) (A/G) (C/T) (C/T)

11000000GG

00110000CC

00001100TT

00000011AA

GGCCTTAA

Identity matrix

55-4-4-4-4-4-4GG

-4-455-4-4-4-4CC

-4-4-4-455-4-4TT

-4-4-4-4-4-455AA

GGCCTTAA

BLAST matrix

11-5-5-5-5-1-1GG

-5-511-1-1-5-5CC

-5-5-1-111-5-5TT

-1-1-5-5-5-511AA

GGCCTTAA

Transition-Transversion matrix

Scoring a sequence Scoring a sequence alignmentalignment

Match score:Match score: +1+1 Mismatch score:Mismatch score: +0+0 Gap penalty:Gap penalty: –1–1

ACGTCTGATACGTCTGATAACGCCGTATCGCCGTATAAGTCTATCTGTCTATCT ||||| ||| || |||||||| ||||| ||| || ||||||||----CTGAT----CTGATTTCGC---ATCGC---ATCCGTCTATCTGTCTATCT

Matches: 18 Matches: 18 × (+1)× (+1) Mismatches: 2 Mismatches: 2 × 0× 0 Gaps: 7 × (Gaps: 7 × (–– 1) 1)

Score = +11Score = +11

Gap opening and extension Gap opening and extension penaltiespenalties

We want to find alignments that are We want to find alignments that are evolutionarily likely.evolutionarily likely.

Which of the following alignments seems Which of the following alignments seems more likely to you?more likely to you?

ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATACGCCGTATAGTCTATCTACGTCTGAT-------ATAGTCTATCTACGTCTGAT-------ATAGTCTATCT

ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATACGCCGTATAGTCTATCTAC-T-TGA--CG-CGT-TA-TCTATCTAC-T-TGA--CG-CGT-TA-TCTATCT

We can achieve this by penalizing more for We can achieve this by penalizing more for a new gap, than for extending an existing a new gap, than for extending an existing gapgap

Scoring a sequence Scoring a sequence alignmentalignment

Match/mismatch score:Match/mismatch score: +1/+0+1/+0 Open/extension penalty:Open/extension penalty: –2/–1–2/–1ACGTCTGATACGTCTGATAACGCCGTATCGCCGTATAAGTCTATCTGTCTATCT ||||| ||| || |||||||| ||||| ||| || ||||||||----CTGAT----CTGATTTCGC---ATCGC---ATCCGTCTATCTGTCTATCT

Matches: 18 Matches: 18 × (+1)× (+1) Mismatches: 2 Mismatches: 2 × 0× 0 Open: 2 × (Open: 2 × (––2)2) Extension: 5 × (Extension: 5 × (––1)1)

Score = +9Score = +9

Amino Acid Substitution Amino Acid Substitution MatricesMatrices

PAMPAM - point accepted mutation - point accepted mutation based on based on global global alignment alignment [evolutionary model][evolutionary model]

BLOSUMBLOSUM - block substitutions - block substitutions based on based on local local alignments alignments [similarity among conserved [similarity among conserved sequences]sequences]

Part of PAM 250 MatrixPart of PAM 250 Matrix

CC SS TT PP AA GG

CC 1212

SS 00 22

TT -2-2 11 33

PP -3-3 11 00 66

AA -2-2 11 11 11 22

GG -3-3 11 00 -1-1 11 55

Log-odds = log ( )

chance to see the pair in homologous proteinschance to see the pair in unrelated proteins by chance

PAM matricesPAM matricesPAM 1PAM 1 Matrix reflects an amount of Matrix reflects an amount of

evolution producing on average evolution producing on average one one mutation per hundred amino acidsmutation per hundred amino acids (1 (1 unit evolution).unit evolution).

PAM 250PAM 250 --- 250 unit evolution --- 250 unit evolution

0.010.01

……

0.010.01

0.020.02

0.010.01

0.040.04

ProbabilityProbability

PAM 250PAM 250

0.00000.0000Phe to CysPhe to Cys

……......

0.00000.0000Phe to AspPhe to Asp

0.00010.0001Phe to AsnPhe to Asn

0.00010.0001Phe to ArgPhe to Arg

0.00020.0002Phe to AlaPhe to Ala

PAM 1PAM 1Amino acid changeAmino acid change

Limitations of PAM Limitations of PAM MatricesMatrices

Constructed based on the Constructed based on the phylogenetic relationships prior to phylogenetic relationships prior to scoring mutations;scoring mutations;

Difficulty of determining ancestral Difficulty of determining ancestral relationships among sequences;relationships among sequences;

Based on a small set of closely Based on a small set of closely related proteins;related proteins;

……

BLOSUM MatricesBLOSUM Matrices Based on the observed amino acid Based on the observed amino acid

substitutions in a large set of ~2000 substitutions in a large set of ~2000 conserved amino acid patterns (blocks). The conserved amino acid patterns (blocks). The blocks are found in a database of protein blocks are found in a database of protein sequences representing more than 500 sequences representing more than 500 families of related proteins and act as families of related proteins and act as signatures of these protein families.signatures of these protein families.

The matrices are measured on the multiple The matrices are measured on the multiple alignment of the blocks.alignment of the blocks.

The entries of the matrices are computed The entries of the matrices are computed based on the same principle used in PAM -- based on the same principle used in PAM -- log(odds’ ratio).log(odds’ ratio).

Part of BLOSUM 62 Part of BLOSUM 62 MatrixMatrix

CC SS TT PP AA GG

CC 99

SS --11

44

TT --11

11 55

PP --33

--11

--11

77

AA 00 11 00 --11

44

GG --33

00 --22

--22

00 66

BLOSUM62 was BLOSUM62 was measured on measured on pairs of pairs of sequences with sequences with an average of 62 an average of 62 % identical % identical amino acids.amino acids.

Log-odds = log ( )

chance to see the pair in homologous proteinschance to see the pair in unrelated proteins by chance

PAM vs. BLOSUMPAM vs. BLOSUM PAM PAM

Based on mutational model of evolution (Based on mutational model of evolution (Markov Markov

processprocess)) PAM1 is based on sequences of 85% similarityPAM1 is based on sequences of 85% similarity Designed to track the evolutionary originsDesigned to track the evolutionary origins

BLOSUM BLOSUM Based on the multiple alignment of blocksBased on the multiple alignment of blocks Good to be used to compare distant sequencesGood to be used to compare distant sequences Designed to find proteins’ conserved domainsDesigned to find proteins’ conserved domains

Gap PenaltyGap Penalty

Optimal penalties vary from sequence to Optimal penalties vary from sequence to sequence, and finding the most adequate sequence, and finding the most adequate value is a matter of empirical trial and error.value is a matter of empirical trial and error.

When compare distantly related sequences, a When compare distantly related sequences, a high gap-opening penalty and a very low gap-high gap-opening penalty and a very low gap-extension penalty often give better resultsextension penalty often give better results

When compare closely related sequences, When compare closely related sequences, gaps should be penalized on both a gap-gaps should be penalized on both a gap-opening and gap-extensionopening and gap-extension

Roadmap

Documents

Transcript of Roadmap