Multiple Sequence Alignment An Introduction to Bioinformatics.
-
date post
20-Dec-2015 -
Category
Documents
-
view
239 -
download
5
Transcript of Multiple Sequence Alignment An Introduction to Bioinformatics.
Multiple Sequence Alignment
An Introduction to Bioinformatics
AIMS
OBJECTIVES
To introduce the different approaches to multiple sequence alignment
To identify criteria for selecting a multiple sequence alignment program
To select an appropriate multiple sequence alignment program
To carry out a multiple sequence alignment using CLUSTALX
The result of searching databases is the establishment of a list of sequences, either protein or nucleotide, which exhibit significant similarity and are inferred to be homologous
These sequences can then be subjected to multiple sequence alignment
The process that involves an attempt to place residues in columns that derive from a common ancestral residue by substitutions
The most successful alignment is the one that most closely represents the evolutionary history of the sequences
Why create multiple sequence alignments?
to attempt a phylogenetic analysis of the sequences so as to construct evolutionary trees
the identification of functional sites
the identication of modules in multimodular protein
the detection of weak similarities in databases using profiles
the design of PCR primers for the identification of related genes
the identification of motifs
Global versus local alignments
Things would be much simpler if we only considered sequences that are homologous over their entire length and could be globally aligned
Homology is often restricted to certain regions of sequence
Many proteins are multi-modular and the shuffling of modules is part of the evolutionary process
An attempt to align, over their entire length, sequences that share some, but not all of their modules, would be bound to lead to errors
In such a case a series of multiple local sequence alignments of each of the modules would be appropriate
Substitutions and Gaps
In trying to establish the evolutionary trajectories of a group of related sequences the same problem is encountered as met in pairwise alignment
The solution is the same
How do you deal with substitutions and gaps?
Use of gap penalties, gap extension penalties and substitution matrices such as PAM and BLOSUM
There are essentially four major approaches to multiple sequence alignment:
Optimal global sequence alignment
Progressive global alignment
Block-based global alignment
Motif-based local alignment
Optimal global sequence alignment
Attempts to align sequences along their entire length.
‘Optimal’ means that it will give the best alignment amongst all the possible solutions for a given scoring scheme
Whether the optimal alignment corresponds with the biologically correct alignment will depend on a variety of factors e.g. substitution matrix, the gap penalty and the scoring scheme
Optimal global sequence alignment programs are very computer intensive and the complexity of the task increases exponentially with the number of sequences
There are few programs which employ this approach - there is one available on the Web
Progressive global alignment
employs multiple pairwise alignments in a series of three steps:
1. Estimate alignment scores between all possible pairwise combinations of sequences in the set
2. Build a ‘guide tree’ determined by the alignment scores
3. Align the sequences on the basis of the guide tree
Each step can be carried out in a number of ways designed toincrease speed or accuracy
Progressive global alignment is the most commonly used method and the best known programs employing this approach are CLUSTAL family
Block-based global alignment
Divides the sequences into blocks which, depending on the program, are exact (identical regions of sequence) or not exact and uniform (found in every sequence) or not uniform
Once the blocks have been defined other approaches are employed to align regions between the blocks
Examples of block-based global alignment programs available onthe Web are DCA and DIALIGN2
Once blocks have been identified other programs (e.g. CLUSTAL X) can be used to multiply align individual modules
Motif-based local alignment
Most recent local alignment programs employ computationally efficient heuristics to solve optimization calculations for local alignments
The Gibbs iterative sampling approach is used to find blocks in programs such as the excellent MACAW
MACAW although available as freeware is not available as a Web-based application
MEME is Web-based
Which method to use
Optimal global alignment programs are rarely employed computationally intensive requirements can only handle a very small number of sequences
When the sequences to be aligned are homologous over their entire length a progressive global alignment program should be used.
Where the sequences share conserved modules in a consistent order blocks-based global alignment or motif-based local Alignment Is appropriate
Where the sequences share conserved modules, but the order of modules is not consistent, a motif-based local alignment is the approach of choice
Multiple sequence alignment file types
The various multiple sequence alignment programs will require different input file types and there are also a variety of output file types
The sequences to be aligned are usually placed in a single file commonly in the Fasta format
The common output file formats are: NBRF/PIR, EMBL/SWISS-PROT, Pearson (Fasta), Clustal (*.aln), GCG/MSF (Pileup), GCG9/RSF and GDE flat file
Multiple sequence files can be interconverted using
Sequence formats that allow one or more sequences:
•IG/Stanford, used by Intelligenetics and others •* GenBank/GB, genbank flatfile format •* NBRF format •* EMBL, EMBL flatfile format •* DNAStrider, for common Mac program •* Fitch format, limited use •* Pearson/Fasta, a common format used by Fasta programs and others •* Zuker format, limited use. Input only. •* Olsen, format printed by Olsen VMS sequence editor. Input only. •* Phylip3.2, sequential format for Phylip programs •* Phylip, interleaved format for Phylip programs (v3.3, v3.4) •+ MSF multi sequence format used by GCG software• + PAUP's multiple sequence (NEXUS) format •+ PIR/CODATA format used by PIR •+ASN.1 format used by NCBI
Phylip
The first line of the input file contains the number of species and the number of characters separated by blanks. The information for eachspecies follows, starting with a ten-character species name (which can include punctuation marks and blanks), and continuing with the characters for that species. Phylip format files can be interleaved, as in the example below, or sequential. 7 123seq1 ---------- ---------- ---KSKERYK DENGGNYFQL REDWWDANRE seq2 ---------- -----YEGLT TANGXKEYYQ DKNGGNFFKL REDWWTANRE seq3 ---------- ---------- ----SQRHYK D-DGGNYFQL REDWWTANRH seq4 ---------- ---------- NVAALKTRYE K-DGQNFYQL REDWWTANRA seq5 ----KRIYKK IFKEIHSGLS TKNGVKDRYQ N-DGDNYFQL REDWWTANRS seq6 ------FSKN IX--QIEELQ DEWLLEARYK D--TDNYYEL REHWWTENRH seq7 ---------- ---------- ---------- ---------- ---------K
TVWKAITCNA --GGGKYFRN TCDG--GQNP TETQNNCRCIG--------- TVWKAITCGA P-GDASYFHA TCDSGDGRGG AQAPHKCRCD G--------- TVWEAITCSA DKGNA-YFRR TCNSADGKSQ SQARNQCRC- --KDENGKN- TIWEAITCSA DKGNA-YFRA TCNSADGKSQ SQARNQCRC- --KDENGXN- TVWKALTCSD KLSNASYFRA TC--SDGQSG AQANNYCRCN GDKPDDDKP-
TVWEALTCEA P-GNAQYFRN ACS----EGK TATKGKCRCI SGDP------ ELWEALTCSR P-KGANYFVY KLD-----RP KFSSDRCGHN YNGDP-----
clustal
Clustal format files contain the word clustal at the beginning. Sequences can be interleaved, (as in the example below) or sequential. (note: the multiple sequence alignment program Clustalw (and clustalx) produce clustal format files by default, but you can specify in "output format options" if you want your results in a different format.).
CLUSTAL W (1.74) multiple sequence alignment
seq1 -----------------------KSKERYKDENGGNYFQLREDWWDANRETVWKAITCNAseq2 ---------------YEGLTTANGXKEYYQDKNGGNFFKLREDWWTANRETVWKAITCGAseq3 ----KRIYKKIFKEIHSGLSTKNGVKDRYQN-DGDNYFQLREDWWTANRSTVWKALTCSDseq4 ------------------------SQRHYKD-DGGNYFQLREDWWTANRHTVWEAITCSAseq5 --------------------NVAALKTRYEK-DGQNFYQLREDWWTANRATIWEAITCSAseq6 ------FSKNIX--QIEELQDEWLLEARYKD--TDNYYELREHWWTENRHTVWEALTCEAseq7 -------------------------------------------------KELWEALTCSRseq1 --GGGKYFRNTCDG--GQNPTETQNNCRCIG----------ATVPTYFDYVPQYLRWSDEseq2 P-GDASYFHATCDSGDGRGGAQAPHKCRCDG---------ANVVPTYFDYVPQFLRWPEEseq3 KLSNASYFRATC--SDGQSGAQANNYCRCNGDKPDDDKP-NTDPPTYFDYVPQYLRWSEEseq4 DKGNA-YFRRTCNSADGKSQSQARNQCRC---KDENGKN-ADQVPTYFDYVPQYLRWSEEseq5 DKGNA-YFRATCNSADGKSQSQARNQCRC---KDENGXN-ADQVPTYFDYVPQYLRWSEEseq6 P-GNAQYFRNACS----EGKTATKGKCRCISGDP----------PTYFDYVPQYLRWSEEseq7 P-KGANYFVYKLD-----RPKFSSDRCGHNYNGDP---------LTNLDYVPQYLRWSDE