Sequence alignment

Sequence alignment

School B&I TCD Bioinformatics

May 2010

What is an alignment?• CENTRAL concept in bioinformatics• Easy if straight-forward, similar seqs

– THISTHESAME or THISTHESAME– | |||| ||| | ||| ||– TOSSTHEGAME TRANTHELPME

• Hard and CPU-intensive if seqs v. diff.– THISTHESAME vs THATGAMETHE– THISTHESAME--- or THIS----THESAME– || ||| || |||– THAT---GAMETHE THATGAMETHE

< better

Why align?

• Trying to establish homology by similarity

• Homology – having a common ancestor– whale fin, bat wing, human hand (Cuvier)– human beta globin, dog beta globin– human beta globin, human alpha globin

• You can have % similarity, % identity

• Can’t have % homology

paralog

ortholog

Why homology?

• homologous structures/molecules have similar function.

• related by evolution. – more similar seqs = more recent common

ancestor = more likely similar function

• human hand not for locomotion

• bricolage – evolutionary tinkering

Define terms

• Indel– Insertion or deletion– May get a better alignment if you put a gap in

one sequence

• Implies a mutation in one of the seqs– Not clear if insert in one or delete in the other

Optimal alignment• Best guess at evolutionary relationship

– Which residues/bases are homologous

• Depends on model of evolution and parameters of alignment– Is a gap more likely than a substitution– Is one substitution more likely than another– Transition (Y-Y or R-R) vs transversion (R-Y)– Similar shape amino acid or different

• No “correct” answer.

Global alignment• Needleman & Wunsch• Tries to align two sequences from 5’ to 3’ or C

terminus to N terminus• Assumes (only works well if) seqs are similar

over their entire length• So less good if there are large indels (but can

identify such features)• Assesses overall (functional) similarity) LARGGHYFGKISTGREFDN L FGKI T E LNAHILSFGKISTSLEDA • Identify (and count) every difference/mutation

Local alignment

• Smith & Waterman

• Ignores whole and focuses on region or domain

• Use to make high quality alignments

• …that has good similarity ----------FGKI----------

||||

----------FGKI----------

BLAST:Basic local alignment search tool

Algorithm

• Both local and global alignment programs use “dynamic programming” (wikipedia that)

• … to make optimal alignment– the alignment that tells evolutionary story

• True story unknown without time-travel

– the alignment that has the highest score

• Choose/change parameters to maximise score

2 sequence alignment

aligning GARFIELDTHECAT & GARFIELDTHERAT is easy

GARFIELDTHECAT

||||||||||| ||

GARFIELDTHERAT

Scoring systems DNA

• In an alignment add 1 if bases identical 0 if they are different

• Transition/transversion?– AG purines CUT pyrimidines

A T C G

A 1 0 0 0

T 0 1 0 0

C 0 0 1 0

G 0 0 0 1

A T C G

A 2 0 0 1

T 0 2 1 0

C 0 1 2 0

G 1 0 0 2

Scoring comparison DNA

• CTAGCGATGC• CGAACGACAC• 1010111001 1/0 Score = 6/10• 2021222112 Ts/Tv score = 15/20

• transitions 5x more common that tranversions

Insert gaps

Sometimes, you can get a better overall alignment if you insert gaps

GARFIELDTHECAT |||||||| ||| GARFIELDA--CAT is better (scores higher) than GARFIELDTHECAT |||||||| GARFIELDACAT

No gap penalty

But there must be some sort of a gap-penalty or you can align ANY two sequences:

G-R--E------AT

| | | ||

GARFIELDTHECAT

Gap penalty

• Could set a –ve score for each indel– Linear gap penalty

• But mutation could be point or deletion– latter is a single event

• Advise to use affine (open + extend)– Open –10, extend -0.05

• How choose penalty? – Start with program defaults– Use good judgment - trial and error– Investigate statistical distribution of indels

Scoring for similarities: proteins

• Gap penalty?

• Traded vs positive scores for matches in aligned residues

• Could, as with DNA, use – match=1 mismatch=0

• Or …

Scoring system proteins

• When doing a similarity search against a database you are trying to decide which of many sequences is the CLOSEST match to your search sequence. Which of the following alignment pairs is better?:

FGDERTHHSFGD--DHRS

FGDERTHHSFGDD--HRS

FGDERTHHSFGD-D-HRS

Where put the gap?

3 Garfield relatives

GARFIELDTHECAT |||| |||||||GARFRIEDTHECAT GARFIELDTHECAT ||| ||| ||||| GARWIELESHECAT GARFIELDTHECAT || ||||||| || GAVGIELDTHEMAT

Substitution matricesTop left part of a BLOSUM 90 matrix A R N D C Q E G H I LA 5 -2 -2 -3 -1 -1 -1 0 -2 -2 -2R -2 6 -1 -3 -5 1 -1 -3 0 -4 -3N -2 -1 7 1 -4 0 -1 -1 0 -4 -4D -3 -3 1 7 -5 -1 1 -2 -2 -5 -5C -1 -5 -4 -5 9 -4 -6 -4 -5 -2 -2Q -1 1 0 -1 -4 7 2 -3 1 -4 -3E -1 -1 -1 1 -6 2 6 -3 -1 -4 -4G 0 -3 -1 -2 -4 -3 -3 6 -3 -5 -5H -2 0 0 -2 -5 1 -1 -3 8 -4 -4I -2 -4 -4 -5 -2 -4 -4 -5 -4 5 1L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5

Symmetrical!

conservative subst.

Willie Taylor’s AA Venn Diagram

Substitution matrices

• Plenty of choice– Identical = 1.0; similar (K/R, F/Y) = 0.5; rest 0.0– PAM series, BLOSUM series, others

• Based on observations and counting in real seqs• Blosum 90 made from aligned seqs 90% identical

• Main diagonal elements positive– Some more positive than others– More highly conserved (C, F etc.)

• Off-diagonal elements mostly negative– Some more negative than others (less likely)– Some positive score (K-R, D-E etc.)

Dotplot theory

A T G A T A T T C T T A . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . G . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . C . . . . . . . . . . .

Task: align ATGATATTCTT and ATTGTTC

Another way of comparing 2 sequences

A T G A T A T T C T T A . . . . . . . . . . . T . + . . + . + . . + . T . . . . . . . . . . . G . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . C . . . . . . . . . . .

Go along the first seq inserting a + wherever 2/3 bases in a moving window match. The first seq is compared to ATT (the first 3 bases in the vertical sequence)

Windowsize = 3Threshold = 2

A T G A T A T T C T T A . . . . . . . . . . . T . + . . + . + . . + . T . + . . . . . + . . . G . . . . . . . . . . . T . . . . . . . . . . . T . . . . . . . . . . . C . . . . . . . . . . .

Then go along the first seq inserting a + wherever 2/3 bases in a moving window match. The first seq is compared to TTG (the next 3 in the vertical sequence).

A T G A T A T T C T T A . . . . . . . . . . . T . + . . + . + . . + . T . + . . . . . + . . . G . . + . . . . . + . . T . . . + . . . . . + . T . . . . . . . + . . . C . . . . . . . . . . .

Iterate until

A T G A T A T T C T T A T + + + + T + + G + + T + + T + C

The human eye is particularly good at picking up structure from the pattern of dots. You might see a hint of a duplicated region in the horizontal sequence that is not so clear from the sequence itself

Jurassic DotplotMark Boguski1st smartass

Dinosaur DNA 2

(GAT1_CHICK sw:P17678 Erythroid Transcription Factor)scoring matrix: BLOSUM50, gap penalties: -12/-295.6% identity; Global alignment score: 2144

• New seq published in Jurassic Park II

• Search database with “dinosaur” DNA

• Top hit

But alignment not perfect – gaps inserted

TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECVMARKNCGAT::::::::::::::::::::::::::::::::::::::::::::::::::: ::::: TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV----NCGAT

ATPLWRRDGTGHYLCNWASACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCSHERENCQT :::::::::::::::: ::::::::::::::::::::::::::::::::: :::: ATPLWRRDGTGHYLCN---ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS----NCQT

STTTLWRRSPMGDPVCNNIHACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG::::::::::::::::: ::::::::::::::::::::::::::::::::::::::::STTTLWRRSPMGDPVCN---ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG

Dinosaur Boguski Alignment

Aligning the “dinosaur” DNA (upper) with the chicken (lower)

When global fails

F1

E F1 EF2

E K CatalyticK

K CatalyticF12

PLAT

Two blood clotting genes Factor 12 and Plasminogen Activatorhave F, E, K and Catalytic domains typical of pathway

By aligning PLAT’s F1 domain with F12’s F2 domain, you miss a better alignment (in grey) between the two F1 domains

The alignment doesn’t recognise the second E domain in F12 but just puts a gap in the other sequence

The alignment doesn’t recognise the second K domain in PLAT but forces an alignment to the other sequence

Alignment protocol• What should real biologists do?1. Dotplot against self to identify internal repeats2. Dotplot against other sequence

• Alter windowsize and stringency3. If similarity along whole seq do global

alignment• Take default parameters• Then change parameters to check effect

4. If local/domain similarity only then do local alignment

5. If in doubt do local alignment6. LOOK at the alignment and see if you can

improve it: by hand – use good judgment

LALIGN

• Internal repeats really confuse global alignment• Local alignment reports only BEST alignment• What about sub-optimal, second best hits?• If you do a dotplot repeats will be clear• Use LALIGN to report not only the best

alignment but also any other repeated elements– And show you the aligned sequences there


Finally, some sequences are similar even if they have no recent common ancestor. Huntington's disease is caused by repeated CAG tracks in the DNA which results in polyGlutamine (Gln, Q) tracks in the protein. If you do a homology search with QQQQQQQQQQ you get hits to other proteins that have a lot of glutamines but have totally different function.


Huntingtin: MATLEKLMKA FESLKSFQQQ QQQQQQQQQQQQQQQQQQQQ PPPPPPPPPP PQLPQPPPQA Search against database hits:

>MM16_MOUSE MATRIX METALLOPROTEINASE-16 Score = 34.4 bits (78), Expect = 0.18 Identities = 21/65 (32%), Positives = 25/65 (38%), Gaps = 2/65 (3%):

FQQQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQ--AQPLLPQPQPPPPPPF Q + + Q Q+ PP PPP LP PP P P+ P PPFYQYMETDNFKLPNDDLQGIQKIYGPPDKIPPPTRPLPTVPPHRSVPPADPRRHDRPKPP

But not because it is involved in microtubule mediated transport!

PRPs (proline-rich protein) have same problem

Sequence alignment

Documents

Transcript of Sequence alignment