Roadmap
description
Transcript of Roadmap
RoadmapRoadmap
The topics:The topics: basic concepts of molecular biologybasic concepts of molecular biology more on Perlmore on Perl overview of the fieldoverview of the field biological databases and database biological databases and database
searchingsearching sequence alignmentssequence alignments phylogeneticsphylogenetics structure predictionstructure prediction microarray data analysismicroarray data analysis
Sequence alignmentsSequence alignments
IntroductionIntroduction What is an alignment?What is an alignment? Why do alignments?Why do alignments? A bit of historyA bit of history
Dot matrix comparisonDot matrix comparison Scoring alignmentsScoring alignments Alignment methodsAlignment methods Significance of alignmentsSignificance of alignments
What is Sequence What is Sequence alignmentalignment
Sequence alignment is an Sequence alignment is an arrangement of two or more arrangement of two or more sequences, highlighting their sequences, highlighting their similarity. similarity.
Why do alignments?Why do alignments?
Sequence Alignment is useful for Sequence Alignment is useful for
discovering discovering structuralstructural, ,
functionalfunctional and and evolutionalevolutional
information in biological information in biological
sequences.sequences.
Over time, genes Over time, genes accumulate accumulate mutationsmutations
Environmental factors Radiation Oxidation
Mistakes in replication/repair Deletions, Duplications Insertions Inversions Point mutations
Comparing two Comparing two sequencessequences
Point mutations, easy:Point mutations, easy:ACGTCTGATACGTCTGATAACGCCCGCCGGTATTATAAGTCTATCTGTCTATCTACGTCTGATACGTCTGATTTCGCCCGCCCCTATTATCCGTCTATCTGTCTATCT
Insertions/deletions, must Insertions/deletions, must alignalign::ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATACGCCGTATAGTCTATCTCTGATTCGCATCGTCTATCTCTGATTCGCATCGTCTATCTACGTCTGATACGTCTGATAACGCCGTATCGCCGTATAAGTCTATCTGTCTATCT----CTGAT----CTGATTTCGC---ATCGC---ATCCGTCTATCTGTCTATCT
Sequence Sequence AlignmentAlignment
Doolittle RF, Hunkapiller MW, Hood LE, Doolittle RF, Hunkapiller MW, Hood LE, Devare SG, Robbins KC, Aaronson SA, Devare SG, Robbins KC, Aaronson SA, Antoniades HN. Antoniades HN. ScienceScience 221:275-277, 1983.221:275-277, 1983.
A sequence for platelet derivedgrowth factor (PDGF) from mammalian cells was virtually identical to the sequence for the retrovirus encoded oncogene known as v-sis (gene causing cancer in animals).
Retrovirus had acquired the gene from the host cell as some kind of genetic exchange event and then had produced a mutant that could alter the function of the normal protein when it infected another animal.
Russell F. Doolittle
Dot Matrix ComparisonDot Matrix ComparisonA: T C A G A G G T C T GA: T C A G A G G T C T G
B: T C A G A G C T GB: T C A G A G C T G
XX
XX
CC
XX
XX
TT
XXXXXXXXGG
XXXXTT
XXCC
XXXXXXXXGG
XXXXAA
XXXXXXXXGG
XXXXAA
XXCC
XXXXTT
GGTTGGGGAAGGAACCTT
Interpretation of dot Interpretation of dot matrixmatrix
Regions of similarity appear as diagonal Regions of similarity appear as diagonal runs of dotsruns of dots
Reverse diagonals (perpendicular to Reverse diagonals (perpendicular to diagonal) indicate inversionsdiagonal) indicate inversions
Can link or "join" separate diagonals to Can link or "join" separate diagonals to form alignment with "gaps"form alignment with "gaps"
More on Dot MatrixMore on Dot Matrix Improving detection of matching Improving detection of matching
regions by filteringregions by filtering using sliding window to compare using sliding window to compare
the two sequences. For example, the two sequences. For example, print a dot at a matrix position only if print a dot at a matrix position only if
7 out of the next 11 positions in the 7 out of the next 11 positions in the sequence are identical sequence are identical
Similarity score of the next 11 Similarity score of the next 11 positions in the sequence is greater positions in the sequence is greater than 5.than 5.
Sequence repeatsSequence repeats
Many Many sequences sequences contains contains repetitive repetitive regions.regions.
a retrovirus vector sequence against itself using a window size of 9 and mismatch limit of 2(http://arbl.cvmbs.colostate.edu/molkit/dnadot/bkg.html)
More on Dot MatrixMore on Dot Matrix
Dot matrix graphically presents regions Dot matrix graphically presents regions of identity or similarity between two of identity or similarity between two sequencessequences
The use of windows and thresholds can The use of windows and thresholds can reduce “noise” in dot matrixreduce “noise” in dot matrix
Inversions and duplications have unique Inversions and duplications have unique “signatures” in dot matrix“signatures” in dot matrix
SoftwareSoftware Dotlet (java applet)– Dotlet (java applet)–
www.ch.embnet.orgwww.ch.embnet.org Dnadot – Dnadot –
arbl.cvmbs.colostate.edu/molkit/dnadotarbl.cvmbs.colostate.edu/molkit/dnadot// Dotter –Dotter –
www.cgr.ki.se/cgr/groups/sonnhammer/Dwww.cgr.ki.se/cgr/groups/sonnhammer/Dotter.htmlotter.html
Dottup – Dottup – www.emboss.orgwww.emboss.org
How to measure the How to measure the similaritysimilarity
Basically three kinds of changes can Basically three kinds of changes can occur at any given position within a occur at any given position within a sequence:sequence:
MutationMutation InsertionInsertion DeletionDeletion
Insertion and deletion have been found Insertion and deletion have been found to occur in nature at a significantly to occur in nature at a significantly lower frequency than mutations.lower frequency than mutations.
Scoring Matrices for Aligning DNA Scoring Matrices for Aligning DNA SequencesSequences
TransitionTransition --- substitutions in which a purine (A/G) --- substitutions in which a purine (A/G) is replaced by another purine (A/G) or a is replaced by another purine (A/G) or a pyrimadine (C/T) is replaced by another pyrimadine (C/T) is replaced by another pyrimadine (C/T).pyrimadine (C/T).
TransversionsTransversions --- --- (A/G) (A/G) (C/T) (C/T)
11000000GG
00110000CC
00001100TT
00000011AA
GGCCTTAA
Identity matrix
55-4-4-4-4-4-4GG
-4-455-4-4-4-4CC
-4-4-4-455-4-4TT
-4-4-4-4-4-455AA
GGCCTTAA
BLAST matrix
11-5-5-5-5-1-1GG
-5-511-1-1-5-5CC
-5-5-1-111-5-5TT
-1-1-5-5-5-511AA
GGCCTTAA
Transition-Transversion matrix
Scoring a sequence Scoring a sequence alignmentalignment
Match score:Match score: +1+1 Mismatch score:Mismatch score: +0+0 Gap penalty:Gap penalty: –1–1
ACGTCTGATACGTCTGATAACGCCGTATCGCCGTATAAGTCTATCTGTCTATCT ||||| ||| || |||||||| ||||| ||| || ||||||||----CTGAT----CTGATTTCGC---ATCGC---ATCCGTCTATCTGTCTATCT
Matches: 18 Matches: 18 × (+1)× (+1) Mismatches: 2 Mismatches: 2 × 0× 0 Gaps: 7 × (Gaps: 7 × (–– 1) 1)
Score = +11Score = +11
Gap opening and extension Gap opening and extension penaltiespenalties
We want to find alignments that are We want to find alignments that are evolutionarily likely.evolutionarily likely.
Which of the following alignments seems Which of the following alignments seems more likely to you?more likely to you?
ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATACGCCGTATAGTCTATCTACGTCTGAT-------ATAGTCTATCTACGTCTGAT-------ATAGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCTACGTCTGATACGCCGTATAGTCTATCTAC-T-TGA--CG-CGT-TA-TCTATCTAC-T-TGA--CG-CGT-TA-TCTATCT
We can achieve this by penalizing more for We can achieve this by penalizing more for a new gap, than for extending an existing a new gap, than for extending an existing gapgap
Scoring a sequence Scoring a sequence alignmentalignment
Match/mismatch score:Match/mismatch score: +1/+0+1/+0 Open/extension penalty:Open/extension penalty: –2/–1–2/–1ACGTCTGATACGTCTGATAACGCCGTATCGCCGTATAAGTCTATCTGTCTATCT ||||| ||| || |||||||| ||||| ||| || ||||||||----CTGAT----CTGATTTCGC---ATCGC---ATCCGTCTATCTGTCTATCT
Matches: 18 Matches: 18 × (+1)× (+1) Mismatches: 2 Mismatches: 2 × 0× 0 Open: 2 × (Open: 2 × (––2)2) Extension: 5 × (Extension: 5 × (––1)1)
Score = +9Score = +9
Amino Acid Substitution Amino Acid Substitution MatricesMatrices
PAMPAM - point accepted mutation - point accepted mutation based on based on global global alignment alignment [evolutionary model][evolutionary model]
BLOSUMBLOSUM - block substitutions - block substitutions based on based on local local alignments alignments [similarity among conserved [similarity among conserved sequences]sequences]
Part of PAM 250 MatrixPart of PAM 250 Matrix
CC SS TT PP AA GG
CC 1212
SS 00 22
TT -2-2 11 33
PP -3-3 11 00 66
AA -2-2 11 11 11 22
GG -3-3 11 00 -1-1 11 55
Log-odds = log ( )
chance to see the pair in homologous proteinschance to see the pair in unrelated proteins by chance
PAM matricesPAM matricesPAM 1PAM 1 Matrix reflects an amount of Matrix reflects an amount of
evolution producing on average evolution producing on average one one mutation per hundred amino acidsmutation per hundred amino acids (1 (1 unit evolution).unit evolution).
PAM 250PAM 250 --- 250 unit evolution --- 250 unit evolution
0.010.01
……
0.010.01
0.020.02
0.010.01
0.040.04
ProbabilityProbability
PAM 250PAM 250
0.00000.0000Phe to CysPhe to Cys
……......
0.00000.0000Phe to AspPhe to Asp
0.00010.0001Phe to AsnPhe to Asn
0.00010.0001Phe to ArgPhe to Arg
0.00020.0002Phe to AlaPhe to Ala
PAM 1PAM 1Amino acid changeAmino acid change
Limitations of PAM Limitations of PAM MatricesMatrices
Constructed based on the Constructed based on the phylogenetic relationships prior to phylogenetic relationships prior to scoring mutations;scoring mutations;
Difficulty of determining ancestral Difficulty of determining ancestral relationships among sequences;relationships among sequences;
Based on a small set of closely Based on a small set of closely related proteins;related proteins;
……
BLOSUM MatricesBLOSUM Matrices Based on the observed amino acid Based on the observed amino acid
substitutions in a large set of ~2000 substitutions in a large set of ~2000 conserved amino acid patterns (blocks). The conserved amino acid patterns (blocks). The blocks are found in a database of protein blocks are found in a database of protein sequences representing more than 500 sequences representing more than 500 families of related proteins and act as families of related proteins and act as signatures of these protein families.signatures of these protein families.
The matrices are measured on the multiple The matrices are measured on the multiple alignment of the blocks.alignment of the blocks.
The entries of the matrices are computed The entries of the matrices are computed based on the same principle used in PAM -- based on the same principle used in PAM -- log(odds’ ratio).log(odds’ ratio).
Part of BLOSUM 62 Part of BLOSUM 62 MatrixMatrix
CC SS TT PP AA GG
CC 99
SS --11
44
TT --11
11 55
PP --33
--11
--11
77
AA 00 11 00 --11
44
GG --33
00 --22
--22
00 66
BLOSUM62 was BLOSUM62 was measured on measured on pairs of pairs of sequences with sequences with an average of 62 an average of 62 % identical % identical amino acids.amino acids.
Log-odds = log ( )
chance to see the pair in homologous proteinschance to see the pair in unrelated proteins by chance
PAM vs. BLOSUMPAM vs. BLOSUM PAM PAM
Based on mutational model of evolution (Based on mutational model of evolution (Markov Markov
processprocess)) PAM1 is based on sequences of 85% similarityPAM1 is based on sequences of 85% similarity Designed to track the evolutionary originsDesigned to track the evolutionary origins
BLOSUM BLOSUM Based on the multiple alignment of blocksBased on the multiple alignment of blocks Good to be used to compare distant sequencesGood to be used to compare distant sequences Designed to find proteins’ conserved domainsDesigned to find proteins’ conserved domains
Gap PenaltyGap Penalty
Optimal penalties vary from sequence to Optimal penalties vary from sequence to sequence, and finding the most adequate sequence, and finding the most adequate value is a matter of empirical trial and error.value is a matter of empirical trial and error.
When compare distantly related sequences, a When compare distantly related sequences, a high gap-opening penalty and a very low gap-high gap-opening penalty and a very low gap-extension penalty often give better resultsextension penalty often give better results
When compare closely related sequences, When compare closely related sequences, gaps should be penalized on both a gap-gaps should be penalized on both a gap-opening and gap-extensionopening and gap-extension