Multiple Sequence Alignment An Introduction to Bioinformatics.

Multiple Sequence Alignment

An Introduction to Bioinformatics

AIMS

OBJECTIVES

To introduce the different approaches to multiple sequence alignment

To identify criteria for selecting a multiple sequence alignment program

To select an appropriate multiple sequence alignment program

To carry out a multiple sequence alignment using CLUSTALX

The result of searching databases is the establishment of a list of sequences, either protein or nucleotide, which exhibit significant similarity and are inferred to be homologous

These sequences can then be subjected to multiple sequence alignment

The process that involves an attempt to place residues in columns that derive from a common ancestral residue by substitutions

The most successful alignment is the one that most closely represents the evolutionary history of the sequences

Why create multiple sequence alignments?

to attempt a phylogenetic analysis of the sequences so as to construct evolutionary trees

the identification of functional sites

the identication of modules in multimodular protein

the detection of weak similarities in databases using profiles

the design of PCR primers for the identification of related genes

the identification of motifs

Global versus local alignments

Things would be much simpler if we only considered sequences that are homologous over their entire length and could be globally aligned

Homology is often restricted to certain regions of sequence

Many proteins are multi-modular and the shuffling of modules is part of the evolutionary process

An attempt to align, over their entire length, sequences that share some, but not all of their modules, would be bound to lead to errors

In such a case a series of multiple local sequence alignments of each of the modules would be appropriate

Substitutions and Gaps

In trying to establish the evolutionary trajectories of a group of related sequences the same problem is encountered as met in pairwise alignment

The solution is the same

How do you deal with substitutions and gaps?

Use of gap penalties, gap extension penalties and substitution matrices such as PAM and BLOSUM

There are essentially four major approaches to multiple sequence alignment:

Optimal global sequence alignment

Progressive global alignment

Block-based global alignment

Motif-based local alignment

Optimal global sequence alignment

Attempts to align sequences along their entire length.

‘Optimal’ means that it will give the best alignment amongst all the possible solutions for a given scoring scheme

Whether the optimal alignment corresponds with the biologically correct alignment will depend on a variety of factors e.g. substitution matrix, the gap penalty and the scoring scheme

Optimal global sequence alignment programs are very computer intensive and the complexity of the task increases exponentially with the number of sequences

There are few programs which employ this approach - there is one available on the Web

Progressive global alignment

employs multiple pairwise alignments in a series of three steps:

1. Estimate alignment scores between all possible pairwise combinations of sequences in the set

2. Build a ‘guide tree’ determined by the alignment scores

3. Align the sequences on the basis of the guide tree

Each step can be carried out in a number of ways designed toincrease speed or accuracy

Progressive global alignment is the most commonly used method and the best known programs employing this approach are CLUSTAL family

Block-based global alignment

Divides the sequences into blocks which, depending on the program, are exact (identical regions of sequence) or not exact and uniform (found in every sequence) or not uniform

Once the blocks have been defined other approaches are employed to align regions between the blocks

Examples of block-based global alignment programs available onthe Web are DCA and DIALIGN2

Once blocks have been identified other programs (e.g. CLUSTAL X) can be used to multiply align individual modules

Motif-based local alignment

Most recent local alignment programs employ computationally efficient heuristics to solve optimization calculations for local alignments

The Gibbs iterative sampling approach is used to find blocks in programs such as the excellent MACAW

MACAW although available as freeware is not available as a Web-based application

MEME is Web-based

Which method to use

Optimal global alignment programs are rarely employed computationally intensive requirements can only handle a very small number of sequences

When the sequences to be aligned are homologous over their entire length a progressive global alignment program should be used.

Where the sequences share conserved modules in a consistent order blocks-based global alignment or motif-based local Alignment Is appropriate

Where the sequences share conserved modules, but the order of modules is not consistent, a motif-based local alignment is the approach of choice

Multiple sequence alignment file types

The various multiple sequence alignment programs will require different input file types and there are also a variety of output file types

The sequences to be aligned are usually placed in a single file commonly in the Fasta format

The common output file formats are: NBRF/PIR, EMBL/SWISS-PROT, Pearson (Fasta), Clustal (*.aln), GCG/MSF (Pileup), GCG9/RSF and GDE flat file

Multiple sequence files can be interconverted using

Sequence formats that allow one or more sequences:

•IG/Stanford, used by Intelligenetics and others •* GenBank/GB, genbank flatfile format •* NBRF format •* EMBL, EMBL flatfile format •* DNAStrider, for common Mac program •* Fitch format, limited use •* Pearson/Fasta, a common format used by Fasta programs and others •* Zuker format, limited use. Input only. •* Olsen, format printed by Olsen VMS sequence editor. Input only. •* Phylip3.2, sequential format for Phylip programs •* Phylip, interleaved format for Phylip programs (v3.3, v3.4) •+ MSF multi sequence format used by GCG software• + PAUP's multiple sequence (NEXUS) format •+ PIR/CODATA format used by PIR •+ASN.1 format used by NCBI

Phylip

The first line of the input file contains the number of species and the number of characters separated by blanks. The information for eachspecies follows, starting with a ten-character species name (which can include punctuation marks and blanks), and continuing with the characters for that species. Phylip format files can be interleaved, as in the example below, or sequential. 7 123seq1 ---------- ---------- ---KSKERYK DENGGNYFQL REDWWDANRE seq2 ---------- -----YEGLT TANGXKEYYQ DKNGGNFFKL REDWWTANRE seq3 ---------- ---------- ----SQRHYK D-DGGNYFQL REDWWTANRH seq4 ---------- ---------- NVAALKTRYE K-DGQNFYQL REDWWTANRA seq5 ----KRIYKK IFKEIHSGLS TKNGVKDRYQ N-DGDNYFQL REDWWTANRS seq6 ------FSKN IX--QIEELQ DEWLLEARYK D--TDNYYEL REHWWTENRH seq7 ---------- ---------- ---------- ---------- ---------K

TVWKAITCNA --GGGKYFRN TCDG--GQNP TETQNNCRCIG--------- TVWKAITCGA P-GDASYFHA TCDSGDGRGG AQAPHKCRCD G--------- TVWEAITCSA DKGNA-YFRR TCNSADGKSQ SQARNQCRC- --KDENGKN- TIWEAITCSA DKGNA-YFRA TCNSADGKSQ SQARNQCRC- --KDENGXN- TVWKALTCSD KLSNASYFRA TC--SDGQSG AQANNYCRCN GDKPDDDKP-

TVWEALTCEA P-GNAQYFRN ACS----EGK TATKGKCRCI SGDP------ ELWEALTCSR P-KGANYFVY KLD-----RP KFSSDRCGHN YNGDP-----

clustal

Clustal format files contain the word clustal at the beginning. Sequences can be interleaved, (as in the example below) or sequential. (note: the multiple sequence alignment program Clustalw (and clustalx) produce clustal format files by default, but you can specify in "output format options" if you want your results in a different format.).

CLUSTAL W (1.74) multiple sequence alignment

seq1 -----------------------KSKERYKDENGGNYFQLREDWWDANRETVWKAITCNAseq2 ---------------YEGLTTANGXKEYYQDKNGGNFFKLREDWWTANRETVWKAITCGAseq3 ----KRIYKKIFKEIHSGLSTKNGVKDRYQN-DGDNYFQLREDWWTANRSTVWKALTCSDseq4 ------------------------SQRHYKD-DGGNYFQLREDWWTANRHTVWEAITCSAseq5 --------------------NVAALKTRYEK-DGQNFYQLREDWWTANRATIWEAITCSAseq6 ------FSKNIX--QIEELQDEWLLEARYKD--TDNYYELREHWWTENRHTVWEALTCEAseq7 -------------------------------------------------KELWEALTCSRseq1 --GGGKYFRNTCDG--GQNPTETQNNCRCIG----------ATVPTYFDYVPQYLRWSDEseq2 P-GDASYFHATCDSGDGRGGAQAPHKCRCDG---------ANVVPTYFDYVPQFLRWPEEseq3 KLSNASYFRATC--SDGQSGAQANNYCRCNGDKPDDDKP-NTDPPTYFDYVPQYLRWSEEseq4 DKGNA-YFRRTCNSADGKSQSQARNQCRC---KDENGKN-ADQVPTYFDYVPQYLRWSEEseq5 DKGNA-YFRATCNSADGKSQSQARNQCRC---KDENGXN-ADQVPTYFDYVPQYLRWSEEseq6 P-GNAQYFRNACS----EGKTATKGKCRCISGDP----------PTYFDYVPQYLRWSEEseq7 P-KGANYFVYKLD-----RPKFSSDRCGHNYNGDP---------LTNLDYVPQYLRWSDE

Multiple Sequence Alignment An Introduction to Bioinformatics.

Documents

Transcript of Multiple Sequence Alignment An Introduction to Bioinformatics.