Chapter 9: Building a multiple Background knowledge · Chapter 9: Building a multiple sequence...

1

Chapter 9: Building a multiple sequence alignment

The man with two watches never knows the time, and the man with one watch only thinks he knows.

__A man with multiple watches

Background knowledge

• Multiple alignment is a Swiss Army knife of bioinformatics:*– predicting protein structure– predicting protein function– phylogenetic analysis.

• It is both art and science.• It is easy to generate bad alignment

that looks good.*Perl is Swiss Army knife for bioinformatics software programmers.

Multiple sequence alignment example Rules of evolution

• Important aa are NOT allowed to mutate.• Less important residues change easily,

sometimes randomly, sometimes to gain a new function.

• Conserved positions in alignment are more important than non-conserved positions.

Criteria for multiple sequence alignment

• Similarity– structural: aa with the same role in

structure are in the same column (aligned),– evolutionary: aa related to the same

predecessor are aligned,– functional: aa with the same function are

aligned,– sequence: aa that yield the best alignment

are aligned.

Applications of multiple sequence alignment

• Extrapolation• Phylogenetic analysis• Functional pattern identification• Domain identification• Identification of regulatory elements• Structure prediction for proteins and RNA• PCR analysis with the least degeneracy

http://blocks.fhcrc.org/codehop.html

2

Outline

• Gathering the sequences• ClustalW• TCoffee• Comparing unalignable

Gathering the sequences

• Proteins are better than DNA (shorter, more informative; exception primer design, non-coding target).

• Protein family is usually too large to collect all sequences. Start with 10, remove troublemakers, and rerun with 50 sequences.

• Discard sequences with <30% and >90% identity.• Read DE (description) section of sequence accession to

check for transposones).• Discard shorter than other sequences.• Extract (with Dotlet) sequences with repeated domains.• Note: There are mistakes in databases.

Mistakes in databases

• Sequencing errors– revised Cambridge sequence of human mtDNA

(Anderson 1981) http://www.mitomap.org/mitomap/mitoseq.html

• Taxonomic misidentification– Amanita pantherina from Japan can be also

A. ibotengutake (in mixed forests with Fagaceaeand Pinaceae).

– Russula drimeia var. queletii = Russula flavovirens = Russula queletii var. flavovirens = Russula sardonia sensu auct. mult.

Phylogenetic analysis on DNAs

• Better on proteins.• Translate DNA to protein.• Perform multiple alignment.• Thread DNA back onto protein alignment

framework.

Restrict number of sequences

Computing and building accurate big alignment is difficult.

• Displaying is difficult – aim for printing and monitor size.• Using is difficult – phylogenetic programs can NOT handle

them.• Included bad sequences multiply mistakes – avoid long

indels and mavericks.• Compromise between similarity and new information (30-

70(90)% identity).• Uncharacterized sequences are useful only if you want to

see conserved - unmutable regions.

Naming your sequences

• Never_use_white_spaces: – Amanita phalloides.

• Never use ßšpeciál symbols.• Shorten your name to 15 characters:

– Amanita_phalloides• Use different names for different

sequences:– A.ph.261004_a

3

Step 1

BLASTing at ExPaSy server – avoid translated EST http://www.expasy.ch/cgi-bin/BLASTEMBnet-CH.pl

Sequence selection

• NOT just the best ten

Sequence selection

• Similarity along the whole sequence – NOT just the bits.

Step 2

Import results from BLAST to alignment programs.

Step 3

Run multiple alignment using different methods, compare results.

Program selection

• FASTA and Swiss-Prot are good for saving.

• ClustalW handles more sequences.• TCoffee is more accurate than ClustalW.

4

Outline


ClustalW

• http://www.ebi.ac.uk/clustalw/index.html• Paula Hogeweg described an algorithm, Des

Higgins made a program Clustal.• Progressive method – adding sequences one

by one.• Input gaps are NOT removed – use FASTA

(SeqCheck) to clean the sequence.• Order of input sometimes influences

alignment.

ClustalW principle

• Compares sequences two by two:– compares A and B, makes consensus AB– compares C and D, makes consenus CD– compares AB and CD.

• Does NOT use all the information at once.• Clustal W version – Clustal weights, every

sequence receive a weight proportional to the amount of new info it contributes.

ClustalW output

• Use default, output in input sequence order.• Reformat using fmtseq if needed

http://bioweb.pasteur.fr/seqanal/interfaces/fmtseq.html.

• Results:– pairwise scores (ignore)– multiple alignment– guide tree – „dendrogram“ used by Clustal in

newick format (text and parenthesis), NOT a proper phylogenetic tree.

ClustalW parameters

• Change only if dissatisfied with blocks and weakly conserved positions (if alignment is difficult to interpret)– substitution matrix – change BLOSUM to PAM.– gap-opening penalty (GOP) / Gap-extension

penalty (GEP) – try empirically, automatically readjusted.

ClustalW mirrors

• Europehttp://www.ebi.ac.uk/clustalw/index.html

• USAhttp://pir.georgetown.edu/pirwww/search/multaln.html

• Japan http://clustalw.genome.ad.jp/

5

Outline


TCoffee

• http://igs-server.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi

• Slower, more precise. • Uses progressive alignment like Clustal but it

compares segments across the entire sequence set.• First, ClustalW (and Lalign) make a collection of local

and global alignments – „library“.• Second, libraries are progressively aligned to yield

highest possible agreement with all pairwise alignments in the library.

• New option – structural alignment.

TCoffee choices 3-D Coffee

Appropriate method usedfor each pair

default

TCoffee output

• aln – text file alignment, can be used as input for other programs;

• html – colour coded, red indicates high quality, blue low;

• pdf – html converted;• dnd – dendrogram used by TCoffee, in

Newick format;• ph – true phylogenetic tree in Newick format

(generated by neighbour-joining method);• png – picture of ph.

6

Evaluating quality of alignment with TCoffee

• Align by other program.• Evaluate with TCoffee.• There are NOT E-values for multiple

alignment• Yellow, orange, and red means 80% chance

of being correctly aligned.

Looking at protein alignment

* entirely conserved column: roughly the same size and hydropathy of aa. the size or hydropathy preserved in the course

of evolutionGood block of 20 aa has 3 stars (*), 6

colons (:), and few periods (.).

…looking at protein alignment

• Conserved columns in a multiple alignment are meaningful only when surroundings are NOT conserved.

Patterns of conservation

• Tryptophan – large hydrophobic aa, deep in the core, difficult to mutate. If mutates then to other aromatic (phenylalanine or tyrosine). WYF

• Glycine or proline are on extremities of beta-strands or alpha helices. GP

• Cysteines form disulphide bridge, distance is a signature of domain. CC

• Histidine and serine in catalytic sites (of proteases). HS

…patterns of conservation

• Charged aa lysine, arginine, aspartic acid, and glutamic acid are involved in ligand binding or in a salt bridge inside the core. KRDE

• Leucine conserved only in protein-protein interaction like leucine zipper. L

Refinement of alignment

• Adding distantly related sequences should enhance existing patterns rather than destroy blocks.

• Regions with indels – candidates of loops (flexible part of structure, acting as connector).

7

Before adding a new sequence (aln format) After

Other multiple alignment methods

• Dialign - alignments by comparing whole segments of the sequences, NOT residues. No gap penalty is used. For globally unrelated sequences with local similarities (genomic DNA and protein families).http://dialign.gobics.de/chaos-dialign-submission

• DCA Divide-and-Conquer Alignment - heuristic approach to sum-of-pairs (SP) optimal alignment. http://bibiserv.techfak.uni-bielefeld.de/dca/submission.html

DCA

Outline


When NOT to use multiple sequence alignment: assembly

• Assembling sequence pieces in a genome sequencing project (use Phrap instead). http://www.phrap.org/

8

When NOT to use multiple sequence alignment: comparing

unalignable

• Sequences without common ancestor, sequence without homologue.

• Look simultaneously at all sequences for short conserved gap-free segments (Gibbssampler).

• Look simultaneously at all sequences for flexible patterns – segments with gaps, conserved at certain positions (Pratt).

Gibbs sampler

• http://bioweb.pasteur.fr/seqanal/interfaces/gibbs-simple.html

• Stochastic method – contains an element of chance, irreproducible.

• Needs at least 20 sequences to start with.• Random alignment until good solution appears.• Good for identifying helix-turn-helix (HTH)

domains and regulatory elements across a protein family.

Pratt

• http://www.ebi.ac.uk/pratt/index.html• For motifs with different lengths.• Allows flexible spacing between the

conserved positions.• Using PROSITE pattern-finding motif.

Chapter 10: Editing and publishing alignment

It looked so good that I thought it just had to be genuine.

__Everyone´s secret thought.

Outline

• Reformatting• Jalview• Boxshade• Logos

Background knowledge

• Attempt to insert or delete a residue in a subgroup of sequences manually can drive you crazy.

9

Residues

• Charged KRDE, • polar NQST, • aliphatic ILMV, • aromatic FYW,• others APCGH.

Ask yourself before saving alignment

• Do most (of your) programs support this format?

• What about your collaborators?• Is all needed info included?• Is it easy to manipulate?

• Stick to one format.

Outline


Formats

• Arrangement of characters, symbols and keywords that specify what sequence, ID name, comments, etc. look like in the sequence entry and where in the entry the program should look to find them.

• There are never any hidden, unprintable „control“ characters in any sequence format.

• MS Word is NOT sequence format, Simple ASCII is (pico text editor).

• Interleaved - chopped to blocks of 60 residues.

Frequent formats

• fasta – easy to manipulate by machines, NOT interleaved

• pir – fasta with annotation line• msf – GCG package, multiple sequence format• selex – extended msf• aln - Clustal W format, simplified msf • phylip – aln variant for phylogenetic analysis• post-script, pdf, html – graphic for printing;

terminal format• xml – extensible markup language, future

standard.

Alignment in FASTA

10

Alignment in MSF Alignment in SELEX

Alignment in ALN Alignment in PHYLIP

Conversion

• Shorten name of sequence to 15 characters, otherwise name is included in sequence.

• FMTSEQ – converts most formats. http://bioweb.pasteur.fr/seqanal/interfaces/fmtseq.html

• http://www.ii.uib.no/~matthewb/tools/align_convert_in.cgi

• SEQCHECK – cleans FASTA sequence. http://darwin.nmsu.edu/bioinfo/seqcheck/seqcheck.php

• Some info can be lost, keep back-up copy of your sequence.

Possibly lost information

• Sequence name – (shortened, included in sequence),

• upper/lowercase,• gap type (., -, ~ turned to -),• special aa and nucleotides (X and N).

11

Outline


JalView

• http://www.jalview.org/• Java applet: runs on your computer, your secret

sequence does NOT travel through Internet if you set File-Work Offline

Jalview output

Graph showing a level of conservation

Colour scheme

• http://www.es.embnet.org/Doc/jalview/contents.html#colour

• ClustalX colours• Zappo colours • Taylor colours • Hydrophobicity colours • Colouring residues above a percentage identity

threshold • PID (Percentage Identity) Colours • BLOSUM62 colours • User defined colour schemes.

Editing a group of sequences

• To modify alignment collectively, define a group (mouse click while CTRL).

Defining a group

12

Editing a group of sequences Group editing mode

Steps

• Click anywhere on a sequence• Drag to the left to insert gaps• Drag to the right to remove gaps• Save intermediate results, there is NOT

any undo button.

Useful features of JalView

• Calculate - Autocalculate consensus: graph below alignment is updated.

• Calculate-Remove redundancy: removes x% identical sequences.

• Calculate-Neighbour joining tree using PID: computes and displays phylogenetic tree, ready for group editing.

Saving output of JalView

• File-Output alignment via text box• Select format, Apply• Open MS Word• CTRL-C and CTRL-V• Save Word document as .doc or .txt.

Other options

• BioEdit

• QAlign

• CLC Workbench

13

Outline


Boxshade

• http://www.ch.embnet.org/software/BOX_form.html

• Shades columns according to level of conservation

•• BlackBlack – identical•• GreyGrey - similar

Boxshade inputBoxshade intermediate output

Boxshade output Outline


14

Logos

• Position corresponds to a column in alignment.• Total height depends on the level of

conservation.• Size of each letter in a logo position depends on

frequency of letter in the column.• Top letter is the most frequent in the column.

Logos

Logos

• Tom Schneider http://www.bio.cam.ac.uk/cgi-bin/seqlogo/logo.cgi

• Jan Gorodkinhttp://www.cbs.dtu.dk/~gorodkin/appl/plogo.html

• FASTA format, delete the name:>namexxxx-xxxxto>xxxx-xxxx

Extracting info from multiple sequence alignment

• Blocks – identifies blocks• Blockgap – measures blocks• Lama - compares alignment with the

BLOCKs database• Amas – identifies features in alignment

Thank you for your attention!

Chapter 9: Building a multiple Background knowledge · Chapter 9: Building a multiple sequence...

Documents

Transcript of Chapter 9: Building a multiple Background knowledge · Chapter 9: Building a multiple sequence...