July 29, 2008, 7 to 10 PM Marine Biological Laboratory, Woods Hole, MA Workshop on Molecular...
-
Upload
sheena-cooper -
Category
Documents
-
view
213 -
download
0
Transcript of July 29, 2008, 7 to 10 PM Marine Biological Laboratory, Woods Hole, MA Workshop on Molecular...
July 29, 2008, 7 to 10 PMJuly 29, 2008, 7 to 10 PM
Marine Biological Marine Biological
Laboratory, Woods Hole, MALaboratory, Woods Hole, MA
Workshop on Molecular Workshop on Molecular
Evolution: multiple Evolution: multiple
sequence analysis sessionsequence analysis session
More data yields stronger analyses — if done carefully! The More data yields stronger analyses — if done carefully! The patterns of conservation become ever clearer by comparing the patterns of conservation become ever clearer by comparing the conserved portions of sequences amongst a larger and larger conserved portions of sequences amongst a larger and larger
dataset. Mosaic ideas and evolutionary ‘importance.’dataset. Mosaic ideas and evolutionary ‘importance.’
Multiple Sequence Multiple Sequence Alignment & Analysis Alignment & Analysis
with SeaView and MAFFTwith SeaView and MAFFTSteven M. ThompsonSteven M. Thompson
Florida State University School of Florida State University School of Computational Science (SCS)Computational Science (SCS)
But first a prelude: My definitions
Biocomputing and computational biology are synonymous and Biocomputing and computational biology are synonymous and
describe the use of computers and computational techniques to describe the use of computers and computational techniques to
analyze any biological system, from molecules, through cells, analyze any biological system, from molecules, through cells,
tissues, organisms, and populations, to complete ecologies.tissues, organisms, and populations, to complete ecologies.
Bioinformatics describes using computational techniques to access, Bioinformatics describes using computational techniques to access,
analyze, and interpret the biological information in any of the analyze, and interpret the biological information in any of the
available online biological databases.available online biological databases.
Sequence analysis is the study of molecular sequence data for the Sequence analysis is the study of molecular sequence data for the
purpose of inferring the function, mechanism, interactions, purpose of inferring the function, mechanism, interactions,
evolution, and perhaps structure of biological molecules.evolution, and perhaps structure of biological molecules.
Genomics analyzes the context of genes or complete genomes (the Genomics analyzes the context of genes or complete genomes (the
total DNA content of an organism) within and across genomes.total DNA content of an organism) within and across genomes.
Proteomics is a subdivision of genomics concerned with analyzing Proteomics is a subdivision of genomics concerned with analyzing
the complete protein complement, i.e. the proteome, of the complete protein complement, i.e. the proteome, of
organisms, both within and between different organisms.organisms, both within and between different organisms.
from a ‘virtual’ DNA sequence to actual molecular from a ‘virtual’ DNA sequence to actual molecular physical characterization, not the other way ‘round.physical characterization, not the other way ‘round.
Using bioinformatics tools, you can infer all sorts Using bioinformatics tools, you can infer all sorts of functional, evolutionary, and, structural of functional, evolutionary, and, structural insights into a gene product, without the need insights into a gene product, without the need to isolate and purify massive amounts of to isolate and purify massive amounts of protein! Eventually you can go on to clone protein! Eventually you can go on to clone and express the gene based on that analysis and express the gene based on that analysis using PCR techniques.using PCR techniques.
The computer and molecular databases are an The computer and molecular databases are an essential part of this process.essential part of this process.
And a ‘way’ to think about it:And a ‘way’ to think about it:The reverse biochemistry analogyThe reverse biochemistry analogy
The exponential growth of molecular sequence databasesYearYear BasePairs BasePairs SequencesSequences
19821982 680338 680338 606 606
19831983 2274029 2274029 2427 2427
19841984 3368765 3368765 4175 4175
19851985 5204420 5204420 5700 5700
19861986 9615371 9615371 9978 9978
19871987 1551477615514776 1458414584
19881988 23800000 23800000 2057920579
19891989 34762585 34762585 2879128791
19901990 49179285 49179285 3953339533
19911991 71947426 71947426 55627 55627
19921992 101008486 101008486 78608 78608
19931993 157152442 157152442 143492143492
19941994 217102462 217102462 215273 215273
19951995 384939485 384939485 555694555694
19961996 651972984 651972984 10212111021211
19971997 1160300687 1160300687 17658471765847
19981998 2008761784 2008761784 28378972837897
19991999 3841163011 3841163011 4864570 4864570
20002000 1110106628811101066288 1010602310106023
20012001 1584992143815849921438 1497631014976310
20022002 2850799016628507990166 22318883 22318883
20032003 3655336848536553368485 3096841830968418
20042004 4457574517644575745176 4060431940604319
20052005 5603773446256037734462 52016762 52016762
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.htmlhttp://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
& cpu power& cpu power
Doubling time ~ 1 year!Doubling time ~ 1 year!
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Molecular evolutionary analysis; plusMolecular evolutionary analysis; plus
Probe/primer, and motif/profile design;Probe/primer, and motif/profile design;
Graphical illustrations; andGraphical illustrations; and
Comparative ‘homology’ inference.Comparative ‘homology’ inference.
OK — here’s some examples.OK — here’s some examples.
Now then, Now then, why even bother why even bother — Applicability?— Applicability?
Molecular evolution and Molecular evolution and phylogeneticsphylogeneticsWe all know multiple sequence We all know multiple sequence
alignments are necessary for alignments are necessary for
phylogenetic inference, but does phylogenetic inference, but does
everybody here everybody here trulytruly realize that the realize that the
absolute positional homology of every absolute positional homology of every
column in a data matrix passed on to column in a data matrix passed on to
these programs is the most critical these programs is the most critical
assumption that all the algorithms assumption that all the algorithms
make (but see Bayesian coestimation)!make (but see Bayesian coestimation)!
And what about this other stuff?And what about this other stuff?
Multiple sequence alignments can be Multiple sequence alignments can be
indispensable for primer design when indispensable for primer design when
you don’t have data on a particular you don’t have data on a particular
taxa, yet data is available in related taxa, yet data is available in related
taxa. The conservation and taxa. The conservation and
variability within an alignment can variability within an alignment can
help guide the design of universal or help guide the design of universal or
species specific primers.species specific primers.
Here’s an HPV L1 exampleHere’s an HPV L1 example
The ellipses show areas where PCR primers could differentiate the Type 16 clade The ellipses show areas where PCR primers could differentiate the Type 16 clade from it’s closest relatives — areas of high L1 conservation in the Type 16 clade (red from it’s closest relatives — areas of high L1 conservation in the Type 16 clade (red line) that correspond to areas of much weaker conservation in the others (blue line).line) that correspond to areas of much weaker conservation in the others (blue line).
Motif and profile definitionMotif and profile definitionAn alignment of human An alignment of human
SRY/SOX proteins SRY/SOX proteins
illustrates the illustrates the
conservation of the conservation of the
HMG box. Conserved HMG box. Conserved
regions can be regions can be
visualized with a sliding visualized with a sliding
window approach and window approach and
appear as peaks. appear as peaks.
Motifs and (better yet) Motifs and (better yet)
HMM profiles can be HMM profiles can be
created of the region to created of the region to
be used as a search be used as a search
tool to find other HMG tool to find other HMG
box proteins.box proteins.
HMG HMG boxbox
One picture’s worth . . .One picture’s worth . . .
The HMG-box domain is strikingly conserved amongst the otherwise The HMG-box domain is strikingly conserved amongst the otherwise nearly unalignable human DNA regulatory paralogous protein family.nearly unalignable human DNA regulatory paralogous protein family.
Structure/function homology inferenceStructure/function homology inference
A Swiss-Model A Swiss-Model
homology based model homology based model
of of GiardiaGiardia EF1 EF1
superimposed over its superimposed over its
eight most similar eight most similar
sequences with solved sequences with solved
structure. Amazingly structure. Amazingly
accurate inferences of accurate inferences of
both function and both function and
structure are possible structure are possible
using comparative using comparative
methods.methods.
On to aligning multiple sequences — On to aligning multiple sequences — dynamic programming’s complexity dynamic programming’s complexity increases exponentially with the number of increases exponentially with the number of sequences being compared:sequences being compared:
N-dimensional matrix . . . .N-dimensional matrix . . . .complexity complexity O ( [sequence length]O ( [sequence length]number of sequences number of sequences ))
See —See —
MSA (‘global’ within ‘bounding box’) andMSA (‘global’ within ‘bounding box’) and
PIMA (‘local’ portions only) on the PIMA (‘local’ portions only) on the multiple alignment page at themultiple alignment page at the
Both available at the Baylor College of Both available at the Baylor College of Medicine’s Search Launcher —Medicine’s Search Launcher —
http://searchlauncher.bcm.tmc.edu/ — —
but, severely limiting restrictions!but, severely limiting restrictions!
A couple ‘global’ solutions using A couple ‘global’ solutions using heuristic tricksheuristic tricks
. . . restricts the . . . restricts the solution to the neighbor-solution to the neighbor-hood of only two hood of only two sequences at a time.sequences at a time.
All sequences are All sequences are compared, pairwise, compared, pairwise, and then each is and then each is aligned to its most aligned to its most similar partner or group similar partner or group of partners represented of partners represented as a consensus. Each as a consensus. Each group of partners is group of partners is then aligned to finish then aligned to finish the complete multiple the complete multiple sequence alignment.sequence alignment.
Therefore — Therefore — pairwise, progressive pairwise, progressive dynamic programming . . .dynamic programming . . .
Enhancements on the themeEnhancements on the themeFirst enhancements came from ClustalW — First enhancements came from ClustalW —
variable sequence weighting, dynamically variable sequence weighting, dynamically varying gap penalties and substitution varying gap penalties and substitution matrices, and a neighbor-joining guide-tree.matrices, and a neighbor-joining guide-tree.
Since the year 2000 a slew of new programs Since the year 2000 a slew of new programs have tried other heuristic variations, all in have tried other heuristic variations, all in attempts to build faster, more accurate attempts to build faster, more accurate multiple sequence alignments. The devil’s in multiple sequence alignments. The devil’s in the details: Muscle, ProbCons, T-Coffee, the details: Muscle, ProbCons, T-Coffee, MAFFT and many, many more.MAFFT and many, many more.
This was pretty much the original ClustalV This was pretty much the original ClustalV and GCG’s PileUp program . . . then . . .and GCG’s PileUp program . . . then . . .
MuscleMuscleAn iterative method that uses weighted log-expectation An iterative method that uses weighted log-expectation profile scoring along with a slew of optimizations. It profile scoring along with a slew of optimizations. It proceeds in three stages — draft progressive using k-proceeds in three stages — draft progressive using k-mer counting, improved progressive using a revised mer counting, improved progressive using a revised tree from the previous iteration, and refinement by tree from the previous iteration, and refinement by sequential deletion of each tree edge with subsequent sequential deletion of each tree edge with subsequent profile realignment.profile realignment.
ProbConProbConUses Hidden Markov Model (HMM) techniques and Uses Hidden Markov Model (HMM) techniques and posterior probability matrices that compare random posterior probability matrices that compare random pairwise alignments to expected pairwise alignments. pairwise alignments to expected pairwise alignments. Probability consistency transformation is used to Probability consistency transformation is used to reestimate the scores, and a guide-tree is then reestimate the scores, and a guide-tree is then constructed, which is used to compute the alignment, constructed, which is used to compute the alignment, which is then iteratively refined. Incredibly accurate.which is then iteratively refined. Incredibly accurate.
T-CoffeeT-CoffeeUses a preprocessed, weighted library of all the pairwise Uses a preprocessed, weighted library of all the pairwise global alignments between your sequences, plus the ten global alignments between your sequences, plus the ten best local alignments associated with each pair. This best local alignments associated with each pair. This helps build the NJ guide-tree and the progressive helps build the NJ guide-tree and the progressive alignment. The library is used to assure consistency and alignment. The library is used to assure consistency and help prevent errors, by allowing ‘forward-thinking’ to see help prevent errors, by allowing ‘forward-thinking’ to see whether the overall alignment will be better one way or whether the overall alignment will be better one way or another after particular segments are aligned one way or another after particular segments are aligned one way or another. The institutional schedule analogy . . . .another. The institutional schedule analogy . . . .
T-Coffee can even tie together multiple methods as T-Coffee can even tie together multiple methods as external modules, making consistency libraries from the external modules, making consistency libraries from the results of each, as long as all the specified methods are results of each, as long as all the specified methods are installed on your system. T-Coffee is one of the most installed on your system. T-Coffee is one of the most accurate multiple sequence alignment methods available accurate multiple sequence alignment methods available because of this consistency based rationale, but it is not because of this consistency based rationale, but it is not the fastest. Regardless, I encourage you to check it out!the fastest. Regardless, I encourage you to check it out!
MAFFT MAFFT — today’s example— today’s example— — has many modes, among them: a couple of has many modes, among them: a couple of progressive, approximate modes, using a fast Fourier progressive, approximate modes, using a fast Fourier transformation (FFT); a couple of iteratively refined transformation (FFT); a couple of iteratively refined methods that add in weighted-sum-of-pairs (WSP) methods that add in weighted-sum-of-pairs (WSP) scoring; and several iterative methods that use WSP scoring; and several iterative methods that use WSP scoring combined with a T-Coffee-like consistency scoring combined with a T-Coffee-like consistency based scoring scheme. Speed and accuracy are based scoring scheme. Speed and accuracy are inversely proportional for these from fast and rough, inversely proportional for these from fast and rough, to slow and accurate, respectively. to slow and accurate, respectively.
MAFFT provides command aliases for all of these, MAFFT provides command aliases for all of these, from fast to slow — FFTNS with or without retree, from fast to slow — FFTNS with or without retree, FFTNSI with or without maxiterate, and the three FFTNSI with or without maxiterate, and the three combined approaches EINSI, LINSI, and GINSI.combined approaches EINSI, LINSI, and GINSI.
MAFFT’s basic algorithmMAFFT’s basic algorithmMAFFT’s fast Fourier transform provide a huge speedup over MAFFT’s fast Fourier transform provide a huge speedup over previous methods. Homologous regions are quickly identified by previous methods. Homologous regions are quickly identified by converting amino acid residues to vectors of volume and polarity, converting amino acid residues to vectors of volume and polarity, thus changing a twenty-character alphabet to six, rather than by thus changing a twenty-character alphabet to six, rather than by using an amino acid similarity matrix. Similarly, nucleotide bases using an amino acid similarity matrix. Similarly, nucleotide bases are converted to vectors of imaginary and complex numbers. The are converted to vectors of imaginary and complex numbers. The FFT trick then reduces the complexity of the subsequent FFT trick then reduces the complexity of the subsequent comparison to comparison to O ( N logN )O ( N logN ). FFT identifies potential similarities . FFT identifies potential similarities though, without localizing them; a sliding window step using the though, without localizing them; a sliding window step using the BLOSUM62 matrix is used for this.BLOSUM62 matrix is used for this.
Then MAFFT constructs a distance matrix, and hence a Then MAFFT constructs a distance matrix, and hence a progressive guide tree, on the number of shared six-tuples from progressive guide tree, on the number of shared six-tuples from this Fourier transform, rather than on a ranking based on full-this Fourier transform, rather than on a ranking based on full-length, pairwise sequence similarity. The user can specify how length, pairwise sequence similarity. The user can specify how many times a new guide tree is subsequently recalculated from a many times a new guide tree is subsequently recalculated from a previous alignment as many times as desired; the alignment is previous alignment as many times as desired; the alignment is reconstructed using the Needlman Wunsch algorithm each time.reconstructed using the Needlman Wunsch algorithm each time.
Some of MAFFT’s many modesSome of MAFFT’s many modesAnd each mode has a bunch of additional options!
1) Most basic, fastest modes — just progressive.
a) FFTNS1 (fftns --retree 1)
b) FFTNS2 (fftns) (same as mafft --retree 2)
Suitable for 1,000’s of easily aligned sequences.
A rough distance matrix is built from the sequences using FFT and the shared number of six-mers.
A modified UPGMA guide tree is built from this matrix.
The sequences are aligned according to the rough, initial guide tree (as in ‘traditional’ methods).
FFTNS2 adds a recomputation of the guide tree (retree 2) from the original alignment, from which a new progressive alignment is built.
MAFFT’s interative refinementsMAFFT’s interative refinements2) Intermediate modes — progressive + iterations
to maximize the WSP objective function.
a) FFTNSI (fftnsi) default two cycles, or e.g.
fftnsi --maxiterate 1000
b) NWNSI (nwnsi) same as FFTNSI, but no FFT, Needleman Wunsch only.
Progressive alignment and retree as before, with or without FFT, and then . . . .
Iterative refinement is cycled twice (default), or repeatedly until there is no further improvement, or until you reach your specified limit number.
Suitable for 100’s through 1000’s of sequences.
MAFFT’s most accurate modesMAFFT’s most accurate modes3) Advanced modes — progressive + iterations to
maximize the objective WSP and T-Coffee-like consistency functions. Options differ according to the way the pairwise alignments are calculated.
a) EINSI (einsi) most general of these.
Uses a Smith Waterman style local algorithm with generalized affine gap costs for the pairwise step. Most appropriate for sequences with multi- shared, similarly ordered domains, in an otherwise nearly unalignable ‘mess,’ .e.g:
ooooooXXX------XXXX-----------------------XXXXXXXXXXX-XXXXXXXXXXXXXXXooooooooooooooooXXX------XXXX-----------------------XXXXXXXXXXX-XXXXXXXXXXXXXXXoooooooooo------XXXXXXXXXXXXXooo--------------------XXXXXXXXXXXXXXXXXX-XXXXXXXX----------------XXXXXXXXXXXXXooo--------------------XXXXXXXXXXXXXXXXXX-XXXXXXXX------------ooooXXXXXX---XXXXooooooooooo------------XXXXX----XXXXXXXXXXXXXXXXXXoooooooooo--ooooXXXXXX---XXXXooooooooooo------------XXXXX----XXXXXXXXXXXXXXXXXXoooooooooo------XXXXX----XXXXoooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX----------------XXXXX----XXXXoooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX----------------XXXXX----XXXX-----------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo-----------XXXXX----XXXX-----------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo-----
MAFFT’s most accurate modes, cont.MAFFT’s most accurate modes, cont.
3) Advanced modes — progressive + iterations to maximize the objective WSP and T-Coffee-like consistency functions. Options differ according to the way the pairwise alignments are calculated.
b) LINSI (linsi) strictly local.
Uses a Smith Waterman style local algorithm with affine gap costs for the pairwise step. Most appropriate for sequences with only one single, shared domain, in an otherwise nearly unalignable ‘mess,’ .e.g:
--------------XXXXXXXXXXX-XXXXXXXXXXXXXXXoooooooooo--------------XXXXXXXXXXX-XXXXXXXXXXXXXXXoooooooooo--------------XXXXXXXXXXXXXXXXXX-XXXXXXXX------------------------XXXXXXXXXXXXXXXXXX-XXXXXXXX------------------------XXXXX----XXXXXXXXXXXXXXXXXXoooooooooo--------------XXXXX----XXXXXXXXXXXXXXXXXXooooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX----------ooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX------------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo-------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo-----
MAFFT’s most accurate modes, cont.MAFFT’s most accurate modes, cont.
3) Advanced modes — progressive + iterations to maximize the objective WSP and T-Coffee-like consistency functions. Options differ according to the way the pairwise alignments are calculated.
c) LINSI (ginsi) strictly global.
Uses a Needleman Wunsch style global algorithm with affine gap costs for the pairwise step. Most appropriate for sequences where only one single, shared domain extends the full length of all of the sequences, .e.g:
XXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXooooXXXooXXXXXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXooooXXXooXXX-XXXXXXXXXXXXXXXXXX-XXXXXXXX--XXXXXXX---XXX-XXXXXXXXXXXXXXXXXX-XXXXXXXX--XXXXXXX---XXXXX--XXXXX---XXXXXXXXXXXXXXXXXXXoooooXXoooXXXX--XXXXX---XXXXXXXXXXXXXXXXXXXoooooXXoooXXooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXXX-ooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXXX-XXXXX---XXXXXXXXXX--XXXXXXXoooooXXXXXXXXX--XXXXX---XXXXXXXXXX--XXXXXXXoooooXXXXXXXXX--
How to know when to use whatHow to know when to use what
for all of them — Take home message:for all of them — Take home message:
For simple cases it doesn’t really matter what For simple cases it doesn’t really matter what
program to use. For complicated situations it may, program to use. For complicated situations it may,
and what you use will depend on the size of your and what you use will depend on the size of your
dataset, personal preferences, time allotted, and dataset, personal preferences, time allotted, and
how much hand editing you want to do.how much hand editing you want to do.
Really nice, recent review: Edgar, R.C. and Really nice, recent review: Edgar, R.C. and
Batzoglou, S. (2006) Multiple sequence alignment. Batzoglou, S. (2006) Multiple sequence alignment.
Current Opinion in Structural BiologyCurrent Opinion in Structural Biology 1616, 368–373., 368–373.
The rest of my references can be found in my The rest of my references can be found in my
tutorial manuscript.tutorial manuscript.
for MAFFT — see “tips,” 2,3, and 4 pages,
You can do a lot of this stuff on the You can do a lot of this stuff on the Web, if you need to — some resources Web, if you need to — some resources for multiple sequence alignment:for multiple sequence alignment:
http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/welcome.html..
http://pbil.univ-lyon1.fr/alignment.html
http://www.ebi.ac.uk/clustalw/
http://searchlauncher.bcm.tmc.edu/
However, problems with very large datasets However, problems with very large datasets and huge multiple alignments make doing and huge multiple alignments make doing multiple sequence alignment on the Web multiple sequence alignment on the Web impractical after your dataset has reached a impractical after your dataset has reached a certain size. You’ll know it when you’re there!certain size. You’ll know it when you’re there!
If large datasets become intractable for If large datasets become intractable for analysis on the Web, what other analysis on the Web, what other resources are available?resources are available?
Desktop software solutions — all of these programs Desktop software solutions — all of these programs are available in public domain open source, but . . . are available in public domain open source, but . . . they can be complicated to install, configure, and they can be complicated to install, configure, and maintain. User must be pretty computer savvy.maintain. User must be pretty computer savvy.
So, commercial software packages are available, e.g. So, commercial software packages are available, e.g. MacVector, DS Gene, DNAsis, DNAStar, etc.,MacVector, DS Gene, DNAsis, DNAStar, etc.,
but . . . license hassles, big expense per machine, but . . . license hassles, big expense per machine, lack of most recent programs, underperformance, lack of most recent programs, underperformance, and Internet and/or CD database access all and Internet and/or CD database access all complicate matters!complicate matters!
Therefore, I argue for UNIX server-Therefore, I argue for UNIX server-based solutions . . .based solutions . . .
UNIX servers — pros and consFree/public domain solutions still available, but now a Free/public domain solutions still available, but now a
very cooperative systems manager needs to very cooperative systems manager needs to maintain everything for users. If you have such a maintain everything for users. If you have such a person, then:person, then:
You end up with a more powerful, and usually faster You end up with a more powerful, and usually faster computer, with larger storage capabilities. Plus, computer, with larger storage capabilities. Plus, connections can be made from any networked connections can be made from any networked terminal or workstation anywhere!terminal or workstation anywhere!
Operating system:Operating system: UNIX command line operation UNIX command line operation hassles; communications software — telnet, ssh, hassles; communications software — telnet, ssh, and terminal emulation; X graphics; file transfer — and terminal emulation; X graphics; file transfer — ftp, and scp/sftp; and editors — vi, emacs, ftp, and scp/sftp; and editors — vi, emacs, pico/nano (or desktop word processing followed by pico/nano (or desktop word processing followed by file transfer [save as "text only!"]). See my file transfer [save as "text only!"]). See my supplement pdf file.supplement pdf file.
Reliability and the Reliability and the Comparative Approach —Comparative Approach —explicit homologous correspondence;explicit homologous correspondence;
manual adjustments should be manual adjustments should be encouraged — based on knowledge,encouraged — based on knowledge,
especially structural, regulatory, and especially structural, regulatory, and functional sites.functional sites.
Therefore, editors like SeaView andTherefore, editors like SeaView and
databases like the Ribosomal Database databases like the Ribosomal Database Project: Project: http://rdp.cme.msu.edu/index.jsphttp://rdp.cme.msu.edu/index.jsp
Coding DNA issuesCoding DNA issues
Work with proteins! If at all possible.Work with proteins! If at all possible.Twenty match symbols versus four, plus Twenty match symbols versus four, plus
similarity versus identity!similarity versus identity!
Way better signal to noise.Way better signal to noise.
Also guarantees no indels are placed within Also guarantees no indels are placed within codons. So translate, then align.codons. So translate, then align.
Nucleotide sequences will only reliably align Nucleotide sequences will only reliably align if they are if they are veryvery similarsimilar to each other. And to each other. And they will likely require extensive and they will likely require extensive and carefully considered hand editing with an carefully considered hand editing with an editor like SeaView.editor like SeaView.
Beware of aligning apples and Beware of aligning apples and oranges oranges [[and grapefruitand grapefruit]]!!
receptors and/or receptors and/or activators with their activators with their namesake proteins;namesake proteins;
parologous versus parologous versus orthologous;orthologous;
genomic versus genomic versus cDNA;cDNA;
mature versus mature versus precursor.precursor.
Mask out uncertain areas —Mask out uncertain areas —
Complications —Complications —Order dependence.Order dependence.
Not that big of a deal.Not that big of a deal.
Substitution matrices and gap penalties.Substitution matrices and gap penalties.
Can be a very big deal!Can be a very big deal!
Regional ‘realignment’ becomes Regional ‘realignment’ becomes
incredibly important, especially with incredibly important, especially with
sequences that have areas of high and sequences that have areas of high and
low similarity. SeaView let’s you do this!low similarity. SeaView let’s you do this!
Complications cont. —Complications cont. —
Format hassles!Format hassles!Specialized format conversion tools Specialized format conversion tools
such as GCG’s SeqConv+ program such as GCG’s SeqConv+ program and PAUPSearch, andand PAUPSearch, and
Don Gilbert’s public domain Don Gilbert’s public domain ReadSeqReadSeq program.program.
Plus, some programs like SeaView Plus, some programs like SeaView can read and write several formats.can read and write several formats.
Still more complications —Still more complications —
Indels and missing Indels and missing
data symbols (i.e. data symbols (i.e.
gaps) designation gaps) designation
discrepancy discrepancy
headaches —headaches —
., -, ~, ?, N, or X., -, ~, ?, N, or X
. . . . . Help!. . . . . Help!
FOR MORE INFO...FOR MORE INFO...
Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html.Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html.
Contact me (Contact me (stevet@[email protected]) for specific long-distance ) for specific long-distance bioinformatics assistance and collaboration.bioinformatics assistance and collaboration.
Gunnar von Heijne in his old but quite readable treatise, Gunnar von Heijne in his old but quite readable treatise, Sequence Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), provides a very appropriate conclusion:(1987), provides a very appropriate conclusion:
““Think about what you’re doing; use your knowledge of the molecular Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and direction of inquiry; use as much information as possible; and do not do not blindly accept everything the computer offers youblindly accept everything the computer offers you.”.”
He continues:He continues:
““. . . if any lesson is to be drawn . . . it surely is that to be able to make a . . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and only useful contribution one must first and foremost be a biologist, and only second a theoretician . . . . We have to develop better algorithms, we second a theoretician . . . . We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and above have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.”all we have to become better biologists. But that’s all it takes.”
Conclusions —Conclusions —
On to a demonstration of some of On to a demonstration of some of
SeaView’s multiple sequence SeaView’s multiple sequence
dataset capabilities —dataset capabilities —
The HPV L1 gene and complete The HPV L1 gene and complete
genome . . . the tutorial:genome . . . the tutorial:
How to use SeaView with How to use SeaView with
MAFFT.MAFFT.