Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago.

Alignment of Genomic Sequences

Wen-Hsiung Li

Ecology & Evolution

Univ. of Chicago

1 GTCTGTTCCAAGGGCCTTTGCGTCAGG-TGGGC-T * # * # *2 GTCTGTTCCAAGGGCCTTCGAGCCAGTCTGGGCCC

1 TT---------------CCAGGGTGGCTGGACCCC * * ** *2 CTGCCCCACTCGGGGTTCCAGAGCAGTTGGACCCC

1 CCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCG * **2 TCAGC---------GGGAGGGTGTGGCTGGGCTC-

(1) pairs of matched bases

(2) pairs of mismatched bases (3) pairs consisting of a base from one

sequence and a gap (null base) from the other sequence

Sequence Alignment

TCAGA** * TC-GT

Alignment as an Evolutionary

Hypothesis

A: TCAGACGATTGLA = 11

B: TCGGAGCTGLB = 9

Alignment I

TCAG-ACG-ATTG|| | | | | |TC-GGA-GC-T-GMatches = 7Gaps = 6

Alignment II

TCAGACGATTG|| || TCGGAGCTG--Matches = 4Gaps = 1

Alignment III

TCAG-ACGATTG|| | | | | TC-GGA-GCTG-Matches = 6Gaps = 4

Which alignment is best?

Gap and Mismatch Penalties

• Gap penalty - a factor by which gap values are multiplied to make the gaps equivalent to mismatches

• Mismatch penalty - an assessment of how frequently substitutions occur

Similarity Index

S = x - wkzk

X : number of matches

Zk : number of gaps of length k

wk : positive number representing

penalty for gaps of length k

Distance (Dissimilarity) Index

D = y + w'kzky : number of mismatches

zk : number of gaps of length k

w'k: positive number representing

penalty for gaps of length k

Gap penalty systems

• Fixed - no gap extension penalty

• Affine or Linear - has two componenets gap opening penalty and gap extension penalty

• Logarithmic - also has two components but the cost increases more slowly allowing longer gaps than the latter system

Gap penalty systemsLinear

Logarithmic

Fixed

Gap length

Gap

pen

alty

TCAG-ACG-ATTG|| | | | | | S = -5 S = -11 TC-GGA-GC-T-G

TCAGACGATTG|| || S = -4 S = 1 TCGGAGCTG--

TCAG-ACGATTG|| | | | | S = -2 S = -6TC-GGA-GCTG-

Gap opening cost = 2 Gap opening cost = 3Gap extension cost = 6 Gap extension cost = 0BEST

Dynamic programming

• Large searches are divided into succession of small stages:

• solution of the initial search stage is trivial• each partial solution in a later stage can be

calculated by reference to only a small number of solutions of the earlier stage

• the final stage contains overall solution

A T G C GA 1 0 0 0 0

T 0 2 1 1 1

C 0 1 2 3 2

C 0 1 2 3 3

G 0 1 3 2 4

C 0 1 2 4 3

Pointer valuesand pathsconnectingthe pointers

A T G C GA 1 0 0 0 0

T 0 2 1 1 1

C 0 1 2 3 2

C 0 1 2 3 3

G 0 1 3 2 4

C 0 1 2 4 3

Traceback

ATGCG-|| ||ATCCGC

AT--GCG|| ||ATCCGC-

Similarity Index

S = x - wkzk

x - number of matches

zk - number of gaps of length k

wk - a positive number representing penalty for gaps of length k

TCAGACGAGTG x = 6(I) | | | | | | a gap of 2 bp TCGGA - - GCTG S = 6 - (a + 2b) TCAGACGAGTG x = 7(II) | | | | | | | 2 gaps of 1 bp TCGGA -GC - TG S = 7 - 2(a + b)

TCAGACGAGTG x = 7 (III) | | | | | | | 2 gaps of 1 bp TCGGA -G - CTG S = 7 - 2(a + b)

TCAGACGAG - TG x = 8 (IV) | | | | | | | | 2 1-bp gaps; 1 2-bp gaps TC - G - - GAGCTG S = 8 - 2(a + b) - (a + 2b)

How to align twolong genomic sequences?

Traditional Seq. Alignment• The seqs. are usually known (coding or

non-coding) and are homologous

• They are not very long, usually < 10,000 base pairs (bp)

• They contain no inversions

• Relies on dynamic programming:

The time and space required are O(N2), where N is the sequence length.

The Human Genome

• Genome size: ~3.2 billion bp

• Only ~1.5% is coding.

• Contains numerous repetitive elements (more than 4 million).

• Introns are usually longer than exons.

• Non-coding regions evolve fast and are not well conserved.

Genomic Seq. Alignment

• The seqs. can be > one million bp (Mb); e.g., the genome size of Mycobacterium tuberculosis is about 4 Mb. Long time to align. Large computer memory.

• May contain inversions and many tandem repeats.

• May contain non-alignable (too divergent) segments.

Genomic Seq. Alignment

Strategy: Search for anchors that can divide the sequences into subregions. The gaps between anchors can then be aligned by a local alignment algorithm.

The System of Delcher et al. (1999)

• Three ideas: (1) Suffix trees; (2) the Longest Increasing Subsequence (LIS); and (3) the local alignment method of Smith and Waterman (1981)

• Two closely homologous long sequences or genomes (A and B).

Step 1: Perform a Maximum Unique Match (MUM) decomposition ofthe two sequences

A MUM is a subsequence that occursonce in sequence A and once in sequence B, and is not contained in any longer such sequence.

Max. Unique Matches (MUMs)

MUM1

Seq. A tcgatcaAGCTCACTGATatgtaccat

Seq. B cgagcgAGCTCACTGATcctgcatca

MUM2

-acgctgaATCGACGTAGTCCATGtactgta

agtgc-agATCGACGTAGTCCATGatgaat

Suffix Trees

A suffix is a subseq. that begins at any position in the seq. & extends to the seq. end. g a a c c g a c c t

1 2 3 4 5 6 7 8 9 10

A suffix: c c g a c c t

A suffix tree is a compact representationthat stores all possible suffixes of a seq.

o

11 2 10

232 195 6

73 84

Root

g a a c c g a c c t

1 2 3 4 5 6 7 8 9 10

a t

c ga

accgacctcc

gacct t

gacct

c t

gacct t

accgacctcct

o

11 2 10

232

1

95 6

73 84

Root

g a a c c g a c c t#

g a a c c t a c c t*

1 2 3 4 5 6 7 8 9 10

a t

c ga

accgacct#cc

gacct#

gacct#

c t

gacct# t#

acc cct

5

gacct#

1

tacct*

7

4t

Step 2: Sort the MUMs

After finding the MUMs, we sort themaccording to their positions in genome A. See figure. Longest Increasing Sequence (LIS):If the order of B positions is given by thesequence [1,2,10,4,5,8,6,7,9,3], the LIS is[1,2,4,5,6,7,9].The LIS gives a global MUM-alignment.

Genome A:

Genome B:

1 2 3 4 5 6 7

13

2 4 67 5

Genome A:

Genome B:

1 2 4 6 7

1 2 4 67

Step 3: Close the gaps between MUMs

Use the Smith-Waterman algorithm toclose the gaps between MUMs.Some regions may be very difficult to align. These regions are ignored andconsidered as non-alignable parts.

Default: If the gap between 2 MUMs is 10 kb, no local alignment is attempted.

Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago.

Documents

Transcript of Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago.