Chapter 9: Building a multiple Background knowledge · Chapter 9: Building a multiple sequence...
Transcript of Chapter 9: Building a multiple Background knowledge · Chapter 9: Building a multiple sequence...
1
Chapter 9: Building a multiple sequence alignment
The man with two watches never knows the time, and the man with one watch only thinks he knows.
__A man with multiple watches
Background knowledge
• Multiple alignment is a Swiss Army knife of bioinformatics:*– predicting protein structure– predicting protein function– phylogenetic analysis.
• It is both art and science.• It is easy to generate bad alignment
that looks good.*Perl is Swiss Army knife for bioinformatics software programmers.
Multiple sequence alignment example Rules of evolution
• Important aa are NOT allowed to mutate.• Less important residues change easily,
sometimes randomly, sometimes to gain a new function.
• Conserved positions in alignment are more important than non-conserved positions.
Criteria for multiple sequence alignment
• Similarity– structural: aa with the same role in
structure are in the same column (aligned),– evolutionary: aa related to the same
predecessor are aligned,– functional: aa with the same function are
aligned,– sequence: aa that yield the best alignment
are aligned.
Applications of multiple sequence alignment
• Extrapolation• Phylogenetic analysis• Functional pattern identification• Domain identification• Identification of regulatory elements• Structure prediction for proteins and RNA• PCR analysis with the least degeneracy
http://blocks.fhcrc.org/codehop.html
2
Outline
• Gathering the sequences• ClustalW• TCoffee• Comparing unalignable
Gathering the sequences
• Proteins are better than DNA (shorter, more informative; exception primer design, non-coding target).
• Protein family is usually too large to collect all sequences. Start with 10, remove troublemakers, and rerun with 50 sequences.
• Discard sequences with <30% and >90% identity.• Read DE (description) section of sequence accession to
check for transposones).• Discard shorter than other sequences.• Extract (with Dotlet) sequences with repeated domains.• Note: There are mistakes in databases.
Mistakes in databases
• Sequencing errors– revised Cambridge sequence of human mtDNA
(Anderson 1981) http://www.mitomap.org/mitomap/mitoseq.html
• Taxonomic misidentification– Amanita pantherina from Japan can be also
A. ibotengutake (in mixed forests with Fagaceaeand Pinaceae).
– Russula drimeia var. queletii = Russula flavovirens = Russula queletii var. flavovirens = Russula sardonia sensu auct. mult.
Phylogenetic analysis on DNAs
• Better on proteins.• Translate DNA to protein.• Perform multiple alignment.• Thread DNA back onto protein alignment
framework.
Restrict number of sequences
Computing and building accurate big alignment is difficult.
• Displaying is difficult – aim for printing and monitor size.• Using is difficult – phylogenetic programs can NOT handle
them.• Included bad sequences multiply mistakes – avoid long
indels and mavericks.• Compromise between similarity and new information (30-
70(90)% identity).• Uncharacterized sequences are useful only if you want to
see conserved - unmutable regions.
Naming your sequences
• Never_use_white_spaces: – Amanita phalloides.
• Never use ßšpeciál symbols.• Shorten your name to 15 characters:
– Amanita_phalloides• Use different names for different
sequences:– A.ph.261004_a
3
Step 1
BLASTing at ExPaSy server – avoid translated EST http://www.expasy.ch/cgi-bin/BLASTEMBnet-CH.pl
Sequence selection
• NOT just the best ten
Sequence selection
• Similarity along the whole sequence – NOT just the bits.
Step 2
Import results from BLAST to alignment programs.
Step 3
Run multiple alignment using different methods, compare results.
Program selection
• FASTA and Swiss-Prot are good for saving.
• ClustalW handles more sequences.• TCoffee is more accurate than ClustalW.
4
Outline
• Gathering the sequences• ClustalW• TCoffee• Comparing unalignable
ClustalW
• http://www.ebi.ac.uk/clustalw/index.html• Paula Hogeweg described an algorithm, Des
Higgins made a program Clustal.• Progressive method – adding sequences one
by one.• Input gaps are NOT removed – use FASTA
(SeqCheck) to clean the sequence.• Order of input sometimes influences
alignment.
ClustalW principle
• Compares sequences two by two:– compares A and B, makes consensus AB– compares C and D, makes consenus CD– compares AB and CD.
• Does NOT use all the information at once.• Clustal W version – Clustal weights, every
sequence receive a weight proportional to the amount of new info it contributes.
ClustalW output
• Use default, output in input sequence order.• Reformat using fmtseq if needed
http://bioweb.pasteur.fr/seqanal/interfaces/fmtseq.html.
• Results:– pairwise scores (ignore)– multiple alignment– guide tree – „dendrogram“ used by Clustal in
newick format (text and parenthesis), NOT a proper phylogenetic tree.
ClustalW parameters
• Change only if dissatisfied with blocks and weakly conserved positions (if alignment is difficult to interpret)– substitution matrix – change BLOSUM to PAM.– gap-opening penalty (GOP) / Gap-extension
penalty (GEP) – try empirically, automatically readjusted.
ClustalW mirrors
• Europehttp://www.ebi.ac.uk/clustalw/index.html
• USAhttp://pir.georgetown.edu/pirwww/search/multaln.html
• Japan http://clustalw.genome.ad.jp/
5
Outline
• Gathering the sequences• ClustalW• TCoffee• Comparing unalignable
TCoffee
• http://igs-server.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi
• Slower, more precise. • Uses progressive alignment like Clustal but it
compares segments across the entire sequence set.• First, ClustalW (and Lalign) make a collection of local
and global alignments – „library“.• Second, libraries are progressively aligned to yield
highest possible agreement with all pairwise alignments in the library.
• New option – structural alignment.
TCoffee choices 3-D Coffee
Appropriate method usedfor each pair
default
TCoffee output
• aln – text file alignment, can be used as input for other programs;
• html – colour coded, red indicates high quality, blue low;
• pdf – html converted;• dnd – dendrogram used by TCoffee, in
Newick format;• ph – true phylogenetic tree in Newick format
(generated by neighbour-joining method);• png – picture of ph.
6
Evaluating quality of alignment with TCoffee
• Align by other program.• Evaluate with TCoffee.• There are NOT E-values for multiple
alignment• Yellow, orange, and red means 80% chance
of being correctly aligned.
Looking at protein alignment
* entirely conserved column: roughly the same size and hydropathy of aa. the size or hydropathy preserved in the course
of evolutionGood block of 20 aa has 3 stars (*), 6
colons (:), and few periods (.).
…looking at protein alignment
• Conserved columns in a multiple alignment are meaningful only when surroundings are NOT conserved.
Patterns of conservation
• Tryptophan – large hydrophobic aa, deep in the core, difficult to mutate. If mutates then to other aromatic (phenylalanine or tyrosine). WYF
• Glycine or proline are on extremities of beta-strands or alpha helices. GP
• Cysteines form disulphide bridge, distance is a signature of domain. CC
• Histidine and serine in catalytic sites (of proteases). HS
…patterns of conservation
• Charged aa lysine, arginine, aspartic acid, and glutamic acid are involved in ligand binding or in a salt bridge inside the core. KRDE
• Leucine conserved only in protein-protein interaction like leucine zipper. L
Refinement of alignment
• Adding distantly related sequences should enhance existing patterns rather than destroy blocks.
• Regions with indels – candidates of loops (flexible part of structure, acting as connector).
7
Before adding a new sequence (aln format) After
Other multiple alignment methods
• Dialign - alignments by comparing whole segments of the sequences, NOT residues. No gap penalty is used. For globally unrelated sequences with local similarities (genomic DNA and protein families).http://dialign.gobics.de/chaos-dialign-submission
• DCA Divide-and-Conquer Alignment - heuristic approach to sum-of-pairs (SP) optimal alignment. http://bibiserv.techfak.uni-bielefeld.de/dca/submission.html
DCA
Outline
• Gathering the sequences• ClustalW• TCoffee• Comparing unalignable
When NOT to use multiple sequence alignment: assembly
• Assembling sequence pieces in a genome sequencing project (use Phrap instead). http://www.phrap.org/
8
When NOT to use multiple sequence alignment: comparing
unalignable
• Sequences without common ancestor, sequence without homologue.
• Look simultaneously at all sequences for short conserved gap-free segments (Gibbssampler).
• Look simultaneously at all sequences for flexible patterns – segments with gaps, conserved at certain positions (Pratt).
Gibbs sampler
• http://bioweb.pasteur.fr/seqanal/interfaces/gibbs-simple.html
• Stochastic method – contains an element of chance, irreproducible.
• Needs at least 20 sequences to start with.• Random alignment until good solution appears.• Good for identifying helix-turn-helix (HTH)
domains and regulatory elements across a protein family.
Pratt
• http://www.ebi.ac.uk/pratt/index.html• For motifs with different lengths.• Allows flexible spacing between the
conserved positions.• Using PROSITE pattern-finding motif.
Chapter 10: Editing and publishing alignment
It looked so good that I thought it just had to be genuine.
__Everyone´s secret thought.
Outline
• Reformatting• Jalview• Boxshade• Logos
Background knowledge
• Attempt to insert or delete a residue in a subgroup of sequences manually can drive you crazy.
9
Residues
• Charged KRDE, • polar NQST, • aliphatic ILMV, • aromatic FYW,• others APCGH.
Ask yourself before saving alignment
• Do most (of your) programs support this format?
• What about your collaborators?• Is all needed info included?• Is it easy to manipulate?
• Stick to one format.
Outline
• Reformatting• Jalview• Boxshade• Logos
Formats
• Arrangement of characters, symbols and keywords that specify what sequence, ID name, comments, etc. look like in the sequence entry and where in the entry the program should look to find them.
• There are never any hidden, unprintable „control“ characters in any sequence format.
• MS Word is NOT sequence format, Simple ASCII is (pico text editor).
• Interleaved - chopped to blocks of 60 residues.
Frequent formats
• fasta – easy to manipulate by machines, NOT interleaved
• pir – fasta with annotation line• msf – GCG package, multiple sequence format• selex – extended msf• aln - Clustal W format, simplified msf • phylip – aln variant for phylogenetic analysis• post-script, pdf, html – graphic for printing;
terminal format• xml – extensible markup language, future
standard.
Alignment in FASTA
10
Alignment in MSF Alignment in SELEX
Alignment in ALN Alignment in PHYLIP
Conversion
• Shorten name of sequence to 15 characters, otherwise name is included in sequence.
• FMTSEQ – converts most formats. http://bioweb.pasteur.fr/seqanal/interfaces/fmtseq.html
• http://www.ii.uib.no/~matthewb/tools/align_convert_in.cgi
• SEQCHECK – cleans FASTA sequence. http://darwin.nmsu.edu/bioinfo/seqcheck/seqcheck.php
• Some info can be lost, keep back-up copy of your sequence.
Possibly lost information
• Sequence name – (shortened, included in sequence),
• upper/lowercase,• gap type (., -, ~ turned to -),• special aa and nucleotides (X and N).
11
Outline
• Reformatting• Jalview• Boxshade• Logos
JalView
• http://www.jalview.org/• Java applet: runs on your computer, your secret
sequence does NOT travel through Internet if you set File-Work Offline
Jalview output
Graph showing a level of conservation
Colour scheme
• http://www.es.embnet.org/Doc/jalview/contents.html#colour
• ClustalX colours• Zappo colours • Taylor colours • Hydrophobicity colours • Colouring residues above a percentage identity
threshold • PID (Percentage Identity) Colours • BLOSUM62 colours • User defined colour schemes.
Editing a group of sequences
• To modify alignment collectively, define a group (mouse click while CTRL).
Defining a group
12
Editing a group of sequences Group editing mode
Steps
• Click anywhere on a sequence• Drag to the left to insert gaps• Drag to the right to remove gaps• Save intermediate results, there is NOT
any undo button.
Useful features of JalView
• Calculate - Autocalculate consensus: graph below alignment is updated.
• Calculate-Remove redundancy: removes x% identical sequences.
• Calculate-Neighbour joining tree using PID: computes and displays phylogenetic tree, ready for group editing.
Saving output of JalView
• File-Output alignment via text box• Select format, Apply• Open MS Word• CTRL-C and CTRL-V• Save Word document as .doc or .txt.
Other options
• BioEdit
• QAlign
• CLC Workbench
13
Outline
• Reformatting• Jalview• Boxshade• Logos
Boxshade
• http://www.ch.embnet.org/software/BOX_form.html
• Shades columns according to level of conservation
•• BlackBlack – identical•• GreyGrey - similar
Boxshade inputBoxshade intermediate output
Boxshade output Outline
• Reformatting• Jalview• Boxshade• Logos
14
Logos
• Position corresponds to a column in alignment.• Total height depends on the level of
conservation.• Size of each letter in a logo position depends on
frequency of letter in the column.• Top letter is the most frequent in the column.
Logos
Logos
• Tom Schneider http://www.bio.cam.ac.uk/cgi-bin/seqlogo/logo.cgi
• Jan Gorodkinhttp://www.cbs.dtu.dk/~gorodkin/appl/plogo.html
• FASTA format, delete the name:>namexxxx-xxxxto>xxxx-xxxx
Extracting info from multiple sequence alignment
• Blocks – identifies blocks• Blockgap – measures blocks• Lama - compares alignment with the
BLOCKs database• Amas – identifies features in alignment
Thank you for your attention!