Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago.
-
date post
22-Dec-2015 -
Category
Documents
-
view
218 -
download
4
Transcript of Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago.
Alignment of Genomic Sequences
Wen-Hsiung Li
Ecology & Evolution
Univ. of Chicago
1 GTCTGTTCCAAGGGCCTTTGCGTCAGG-TGGGC-T * # * # *2 GTCTGTTCCAAGGGCCTTCGAGCCAGTCTGGGCCC
1 TT---------------CCAGGGTGGCTGGACCCC * * ** *2 CTGCCCCACTCGGGGTTCCAGAGCAGTTGGACCCC
1 CCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCG * **2 TCAGC---------GGGAGGGTGTGGCTGGGCTC-
(1) pairs of matched bases
(2) pairs of mismatched bases (3) pairs consisting of a base from one
sequence and a gap (null base) from the other sequence
Sequence Alignment
TCAGA** * TC-GT
Alignment as an Evolutionary
Hypothesis
A: TCAGACGATTGLA = 11
B: TCGGAGCTGLB = 9
Alignment I
TCAG-ACG-ATTG|| | | | | |TC-GGA-GC-T-GMatches = 7Gaps = 6
Alignment II
TCAGACGATTG|| || TCGGAGCTG--Matches = 4Gaps = 1
Alignment III
TCAG-ACGATTG|| | | | | TC-GGA-GCTG-Matches = 6Gaps = 4
Which alignment is best?
Gap and Mismatch Penalties
• Gap penalty - a factor by which gap values are multiplied to make the gaps equivalent to mismatches
• Mismatch penalty - an assessment of how frequently substitutions occur
Similarity Index
S = x - wkzk
X : number of matches
Zk : number of gaps of length k
wk : positive number representing
penalty for gaps of length k
Distance (Dissimilarity) Index
D = y + w'kzky : number of mismatches
zk : number of gaps of length k
w'k: positive number representing
penalty for gaps of length k
Gap penalty systems
• Fixed - no gap extension penalty
• Affine or Linear - has two componenets gap opening penalty and gap extension penalty
• Logarithmic - also has two components but the cost increases more slowly allowing longer gaps than the latter system
Gap penalty systemsLinear
Logarithmic
Fixed
Gap length
Gap
pen
alty
TCAG-ACG-ATTG|| | | | | | S = -5 S = -11 TC-GGA-GC-T-G
TCAGACGATTG|| || S = -4 S = 1 TCGGAGCTG--
TCAG-ACGATTG|| | | | | S = -2 S = -6TC-GGA-GCTG-
Gap opening cost = 2 Gap opening cost = 3Gap extension cost = 6 Gap extension cost = 0BEST
Dynamic programming
• Large searches are divided into succession of small stages:
• solution of the initial search stage is trivial• each partial solution in a later stage can be
calculated by reference to only a small number of solutions of the earlier stage
• the final stage contains overall solution
A T G C GA 1 0 0 0 0
T 0 2 1 1 1
C 0 1 2 3 2
C 0 1 2 3 3
G 0 1 3 2 4
C 0 1 2 4 3
Pointer valuesand pathsconnectingthe pointers
A T G C GA 1 0 0 0 0
T 0 2 1 1 1
C 0 1 2 3 2
C 0 1 2 3 3
G 0 1 3 2 4
C 0 1 2 4 3
Traceback
ATGCG-|| ||ATCCGC
AT--GCG|| ||ATCCGC-
Similarity Index
S = x - wkzk
x - number of matches
zk - number of gaps of length k
wk - a positive number representing penalty for gaps of length k
TCAGACGAGTG x = 6(I) | | | | | | a gap of 2 bp TCGGA - - GCTG S = 6 - (a + 2b) TCAGACGAGTG x = 7(II) | | | | | | | 2 gaps of 1 bp TCGGA -GC - TG S = 7 - 2(a + b)
TCAGACGAGTG x = 7 (III) | | | | | | | 2 gaps of 1 bp TCGGA -G - CTG S = 7 - 2(a + b)
TCAGACGAG - TG x = 8 (IV) | | | | | | | | 2 1-bp gaps; 1 2-bp gaps TC - G - - GAGCTG S = 8 - 2(a + b) - (a + 2b)
How to align twolong genomic sequences?
Traditional Seq. Alignment• The seqs. are usually known (coding or
non-coding) and are homologous
• They are not very long, usually < 10,000 base pairs (bp)
• They contain no inversions
• Relies on dynamic programming:
The time and space required are O(N2), where N is the sequence length.
The Human Genome
• Genome size: ~3.2 billion bp
• Only ~1.5% is coding.
• Contains numerous repetitive elements (more than 4 million).
• Introns are usually longer than exons.
• Non-coding regions evolve fast and are not well conserved.
Genomic Seq. Alignment
• The seqs. can be > one million bp (Mb); e.g., the genome size of Mycobacterium tuberculosis is about 4 Mb. Long time to align. Large computer memory.
• May contain inversions and many tandem repeats.
• May contain non-alignable (too divergent) segments.
Genomic Seq. Alignment
Strategy: Search for anchors that can divide the sequences into subregions. The gaps between anchors can then be aligned by a local alignment algorithm.
The System of Delcher et al. (1999)
• Three ideas: (1) Suffix trees; (2) the Longest Increasing Subsequence (LIS); and (3) the local alignment method of Smith and Waterman (1981)
• Two closely homologous long sequences or genomes (A and B).
Step 1: Perform a Maximum Unique Match (MUM) decomposition ofthe two sequences
A MUM is a subsequence that occursonce in sequence A and once in sequence B, and is not contained in any longer such sequence.
Max. Unique Matches (MUMs)
MUM1
Seq. A tcgatcaAGCTCACTGATatgtaccat
Seq. B cgagcgAGCTCACTGATcctgcatca
MUM2
-acgctgaATCGACGTAGTCCATGtactgta
agtgc-agATCGACGTAGTCCATGatgaat
Suffix Trees
A suffix is a subseq. that begins at any position in the seq. & extends to the seq. end. g a a c c g a c c t
1 2 3 4 5 6 7 8 9 10
A suffix: c c g a c c t
A suffix tree is a compact representationthat stores all possible suffixes of a seq.
o
11 2 10
232 195 6
73 84
Root
g a a c c g a c c t
1 2 3 4 5 6 7 8 9 10
a t
c ga
accgacctcc
gacct t
gacct
c t
gacct t
accgacctcct
o
11 2 10
232
1
95 6
73 84
Root
g a a c c g a c c t#
g a a c c t a c c t*
1 2 3 4 5 6 7 8 9 10
a t
c ga
accgacct#cc
gacct#
gacct#
c t
gacct# t#
acc cct
5
gacct#
1
tacct*
7
4t
Step 2: Sort the MUMs
After finding the MUMs, we sort themaccording to their positions in genome A. See figure. Longest Increasing Sequence (LIS):If the order of B positions is given by thesequence [1,2,10,4,5,8,6,7,9,3], the LIS is[1,2,4,5,6,7,9].The LIS gives a global MUM-alignment.
Genome A:
Genome B:
1 2 3 4 5 6 7
13
2 4 67 5
Genome A:
Genome B:
1 2 4 6 7
1 2 4 67
Step 3: Close the gaps between MUMs
Use the Smith-Waterman algorithm toclose the gaps between MUMs.Some regions may be very difficult to align. These regions are ignored andconsidered as non-alignable parts.
Default: If the gap between 2 MUMs is 10 kb, no local alignment is attempted.