MCB 5472 Lecture #6: Sequence alignment March 27, 2014

download MCB 5472 Lecture  #6: Sequence alignment March 27,  2014

If you can't read please download the document

description

MCB 5472 Lecture #6: Sequence alignment March 27, 2014. Sequence alignment. As you have seen, sequence alignment is key to nearly all experiments in molecular evolution Thus far we have discussed local alignment as implemented in BLAST Global alignment: - PowerPoint PPT Presentation

Transcript of MCB 5472 Lecture #6: Sequence alignment March 27, 2014

PowerPoint Presentation

MCB 5472 Lecture #6:Sequence alignmentMarch 27, 2014Sequence alignmentAs you have seen, sequence alignment is key to nearly all experiments in molecular evolutionThus far we have discussed local alignment as implemented in BLASTGlobal alignment:Aligns sequences over their entire lengthAssumes that sequences for alignment are homologousRecall from BLAST lecture:Sequence alignment is scored:According to a substitution matrixSome substitutions are more likely than othersUsing affine gapsGap opening and extension are considered separatelyReflects biological reality that Alignment score is the sum of substitution and gaps scoresPairwise alignmentsNeedleman-WunschNeedleman and Wunsch (1970) J. Mol. Biol. 48:443-453The first algorithm to computer the optimum alignment between two sequences using dynamic programmingi.e., examines many possible solutions and picks the bestImplemented as EMBOSS needle programGlobal alignments: assumes sequences should be aligned over their entire lengthsNeedleman-WunschWorks by scoring alignments sequentially and evaluating scoring for each alignment position based on previous scoresGaps in one position may force an unlikely substitution later onCalculating these scores represents a dynamic programming sub-problem; computationally efficientEvaluates all possibilities but also maps the best optionGuarantees finding the best path according to the parameters usedSmith-WatermanSmith and Waterman (1981) J. Mol. Biol. 147:195-197Local alignment version of Needleman-WunschGuaranteed to find the statistically best local alignmentBLAST only evaluates a subset of alignment possibilitiesImplimented in FASTA search programMultiple sequence alignmentMultiple sequence alignmentExtending similar dynamic programming approaches to calculate all possible sequence alignments quickly becomes impossibleVarious tools therefore use different heuristic approaches to align multiple sequencesDifferent specializations and/or motivationsDifferent computational efficiencyClustalWOne of the first widely used multiple sequence alignment programsThompson et al. (1994) Nuc. Acids Res. 22: 4673-4680Larkin et al. (2007) Bioinformatics 23:2947-2948ClustalX: Widely used version with a graphical interface

ClustalWStep #1a: Align all pairs of sequences separatelyCurrent default: count kmers conserved between sequencesCan also be global alignments (original defaults)Step #1b: Calculate distance matrix from pairwise comparisonsStep #2: Cluster distance matrix using neighbor-joining or UPGMA algorithms to create a guide treeStep #3: Midpoint root tree and weight branches by sequence similarityClustalW guide treesGuide trees are no substitute for full phylogenetic analysis!!!Not based on multiple sequence alignmentAre only a rough approximation of the true relationships between sequencesEven though they can be produced by ClustalW they should not be used for detailed analysis!ClustalWStep #4: Progressive alignmentStarting from most similar sequences on guide tree, align each to each otherUses full dynamic programming alignment methods (cf. N-W) including substitution and gap penaltiesGap opening parameters vary based on sequence position to favor alignment to preexisting gapsAny gaps introduced are maintained during subsequent alignment iterationsProgressive alignment exampleGaps introduced earlier are propagated into later alignmentsM Q T I FL H - I WM Q T I FL H I WL Q S WL S FL Q S WL - S FM Q T I FL H - I WL Q S - WL - S - FEdgar (2004) BMC Bioinformatics 5:133ClustalW

Thompson et al. (1994) Nuc. Acids Res. 22:4673-4680Step #1Step #2Step #3Step #4

Thompson et al. (1994) Nuc. Acids Res. 22:4673-4680ClustalWNo guarantee of optimal alignmentEarly errors propagated to more divergent sequencesThis is true of all multiple sequence programsReasonably accurate when all sequences are ~