Sequencing & Sequence Alignment
description
Transcript of Sequencing & Sequence Alignment
![Page 1: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/1.jpg)
1 Lecture 2.4
Sequencing & Sequence Alignment
G E N E T I C SG 60 40 30 20 20 0 10 0E 40 50 30 30 20 0 10 0N 30 30 40 20 20 0 10 0E 20 20 20 30 20 10 10 0S 20 20 20 20 20 0 10 10I 10 10 10 10 10 20 10 0S 0 0 0 0 0 0 0 10
![Page 2: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/2.jpg)
2 Lecture 2.4
Objectives
• Understand how DNA sequence data is collected and prepared
• Be aware of the importance of sequence searching and sequence alignment in biology and medicine
• Be familiar with the different algorithms and scoring schemes used in sequence searching and sequence alignment
![Page 3: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/3.jpg)
3 Lecture 2.4
High Throughput DNA Sequencing
![Page 4: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/4.jpg)
4 Lecture 2.4
30,000
![Page 5: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/5.jpg)
5 Lecture 2.4
Shotgun Sequencing
IsolateChromosome
ShearDNAinto Fragments
Clone intoSeq. Vectors Sequence
![Page 6: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/6.jpg)
6 Lecture 2.4
Principles of DNA Sequencing
Primer
PBR322
Amp
Tet
Ori
DNA fragment
Denature withheat to produce
ssDNA
Klenow + ddNTP + dNTP + primers
![Page 7: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/7.jpg)
7 Lecture 2.4
The Secret to Sanger Sequencing
![Page 8: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/8.jpg)
8 Lecture 2.4
Principles of DNA Sequencing
5’
5’ Primer
3’ TemplateG C A T G C
dATPdCTPdGTPdTTPddATP
dATPdCTPdGTPdTTPddCTP
dATPdCTPdGTPdTTPddTTP
dATPdCTPdGTPdTTP
ddCTP
GddC
GCATGddC
GCddA GCAddT ddG
GCATddG
![Page 9: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/9.jpg)
9 Lecture 2.4
Principles of DNA SequencingG
C
T
A
+
_
+
_
G
C
A
T
G
C
![Page 10: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/10.jpg)
10 Lecture 2.4
Capillary Electrophoresis
Separation by Electro-osmotic Flow
![Page 11: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/11.jpg)
11 Lecture 2.4
Multiplexed CE with Fluorescent detection
ABI 3700 96x700 bases
![Page 12: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/12.jpg)
12 Lecture 2.4
Shotgun Sequencing
SequenceChromatogram
Send to Computer AssembledSequence
![Page 13: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/13.jpg)
13 Lecture 2.4
Shotgun Sequencing
• Very efficient process for small-scale (~10 kb) sequencing (preferred method)
• First applied to whole genome sequencing in 1995 (H. influenzae)
• Now standard for all prokaryotic genome sequencing projects
• Successfully applied to D. melanogaster• Moderately successful for H. sapiens
![Page 14: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/14.jpg)
14 Lecture 2.4
The Finished Product
GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTAGAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGAT
![Page 15: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/15.jpg)
15 Lecture 2.4
Sequencing Successes
T7 bacteriophagecompleted in 198339,937 bp, 59 coded proteins
Escherichia colicompleted in 19984,639,221 bp, 4293 ORFs
Sacchoromyces cerevisaecompleted in 199612,069,252 bp, 5800 genes
![Page 16: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/16.jpg)
16 Lecture 2.4
Sequencing Successes
Caenorhabditis eleganscompleted in 199895,078,296 bp, 19,099 genes
Drosophila melanogastercompleted in 2000116,117,226 bp, 13,601 genes
Homo sapiens1st draft completed in 20013,160,079,000 bp, 31,780 genes
![Page 17: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/17.jpg)
17 Lecture 2.4
So what do we do with all this
sequence data?
![Page 18: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/18.jpg)
18 Lecture 2.4
Sequence Alignment
G E N E T I C SG 60 40 30 20 20 0 10 0E 40 50 30 30 20 0 10 0N 30 30 40 20 20 0 10 0E 20 20 20 30 20 10 10 0S 20 20 20 20 20 0 10 10I 10 10 10 10 10 20 10 0S 0 0 0 0 0 0 0 10
![Page 19: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/19.jpg)
19 Lecture 2.4
Alignments tell us about...
• Function or activity of a new gene/protein
• Structure or shape of a new protein
• Location or preferred location of a protein
• Stability of a gene or protein
• Origin of a gene or protein
• Origin or phylogeny of an organelle
• Origin or phylogeny of an organism
![Page 20: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/20.jpg)
20 Lecture 2.4
Factoid:
Sequence comparisons
lie at the heart of all
bioinformatics
![Page 21: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/21.jpg)
21 Lecture 2.4
Similarity versus Homology
• Similarity refers to the likeness or % identity between 2 sequences
• Similarity means sharing a statistically significant number of bases or amino acids
• Similarity does not imply homology
• Homology refers to shared ancestry
• Two sequences are homologous is they are derived from a common ancestral sequence
• Homology usually implies similarity
![Page 22: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/22.jpg)
22 Lecture 2.4
Similarity versus Homology
• Similarity can be quantified
• It is correct to say that two sequences are X% identical
• It is correct to say that two sequences have a similarity score of Z
• It is generally incorrect to say that two sequences are X% similar
![Page 23: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/23.jpg)
23 Lecture 2.4
• Homology cannot be quantified
• If two sequences have a high % identity it is OK to say they are homologous
• It is incorrect to say two sequences have a homology score of Z
It is incorrect to say two sequences are X% homologous
Similarity versus Homology
![Page 24: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/24.jpg)
24 Lecture 2.4
Sequence Complexity
MCDEFGHIKLAN…. High Complexity
ACTGTCACTGAT…. Mid Complexity
NNNNTTTTTNNN…. Low Complexity
Translate those DNA sequences!!!
![Page 25: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/25.jpg)
25 Lecture 2.4
Assessing Sequence Similarity
THESTORYOFGENESISTHISBOOKONGENETICS
THESTORYOFGENESI-STHISBOOKONGENETICS
THE STORY OF GENESISTHIS BOOK ON GENETICS
Two CharacterStrings
CharacterComparison
ContextComparison
* * * * * * * * * * *
![Page 26: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/26.jpg)
26 Lecture 2.4
Assessing Sequence Similarity
Rbn KETAAAKFERQHMDLsz KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNT
Rbn SST SAASSSNYCNQMMKSRNLTKDRCKPMNTFVHESLALsz QATNRNTDGSTDYGILQINSRWWCNDGRTP GSRN
Rbn DVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKYLsz LCNIPCSALLSSDITASVNC AKKIVSDGDGMNAWVAWR
Rbn PNACYKTTQANKHIIVACEGNPYVPHFDASVLsz NRCKGTDVQA WIRGCRL
is this alignment significant?
![Page 27: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/27.jpg)
27 Lecture 2.4
Is This Alignment Significant?
Gelsolin 89 L G N E L S Q D E S G A A A I F T V Q L 108
Annexin 82 L P S A L K S A L S G H L E T V I L G L 101
154 L E K D I I S D T S G D F R K L M V A L 173
240 L E – S I K K E V K G D L E N A F L N L 258
314 L Y Y Y I Q Q D T K G D Y Q K A L L Y L 333
Consensus L x P x x x P D x S G x h x x h x V L L
![Page 28: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/28.jpg)
28 Lecture 2.4
Some Simple Rules
• If two sequence are > 100 residues and > 25% identical, they are likely related
• If two sequences are 15-25% identical they may be related, but more tests are needed
• If two sequences are < 15% identical they are probably not related
• If you need more than 1 gap for every 20 residues the alignment is suspicious
![Page 29: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/29.jpg)
29 Lecture 2.4
Doolittle’s Rules of Thumb
Evolutionary Distance VS Percent Sequence Identity
0
20
40
60
80
100
120
0 40 80 120 160 200 240 280 320 360 400
Number of Residues
Sequ
ence
Iden
tity
(%)
Twilight Zone
![Page 30: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/30.jpg)
30 Lecture 2.4
Sequence Alignment - Methods
• Dot Plots
• Dynamic Programming
• Heuristic (Fast) Local Alignment
• Multiple Sequence Alignment
• Contig Assembly
![Page 31: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/31.jpg)
31 Lecture 2.4
PAM Matrices
• Developed by M.O. Dayhoff (1978)• PAM = Point Accepted Mutation• Matrix assembled by looking at patterns of
substitutions in closely related proteins• 1 PAM corresponds to 1 amino acid
change per 100 residues• 1 PAM = 1% divergence or 1 million years
in evolutionary history
![Page 32: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/32.jpg)
32 Lecture 2.4
Developed by Lipman & Pearson (1985/88) Refined by Altschul et al. (1990/97) Ideal for large database comparisons Uses heuristics & statistical simplification Fast N-type algorithm (similar to Dot Plot) Cuts sequences into short words (k-tuples) Uses “Hash Tables” to speed comparison
Fast Local Alignment Methods
![Page 33: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/33.jpg)
33 Lecture 2.4
FASTA• Developed in 1985 and 1988 (W. Pearson)• Looks for clusters of nearby or locally
dense “identical” k-tuples• init1 score = score for first set of k-tuples• initn score = score for gapped k-tuples• opt score = optimized alignment score• Z-score = number of S.D. above random• expect = expected # of random matches
![Page 34: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/34.jpg)
34 Lecture 2.4
FASTAgi|135775|sp|P08628|THIO_RABIT THIOREDOXIN (104 aa) initn: 641 init1: 641 opt: 642 Z-score: 806.4 expect() 3.2e-38Smith-Waterman score: 642; 86.538% identity in 104 aa overlap (2-105:1-104)
gi|135 2- 105: --------------------------------------------------------------------:
10 20 30 40 50 60 70 80thiore MVKQIESKTAFQEALDAAGDKLVVVDFSATWCGPCKMINPFFHSLSEKYSNVIFLEVDVDDCQDVASECEVKCTPTFQFF :::::::.::::.::.:::::::::::::::::::::.::::.::::..::.:.:::::::.:.:.:::::: ::::::gi|135 VKQIESKSAFQEVLDSAGDKLVVVDFSATWCGPCKMIKPFFHALSEKFNNVVFIEVDVDDCKDIAAECEVKCMPTFQFF 10 20 30 40 50 60 70
90 100thiore KKGQKVGEFSGANKEKLEATINELV ::::::::::::::::::::::::.gi|135 KKGQKVGEFSGANKEKLEATINELL 80 90 100
![Page 35: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/35.jpg)
35 Lecture 2.4
Multiple Sequence Alignment
Multiple alignment of Calcitonins
![Page 36: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/36.jpg)
36 Lecture 2.4
Multiple Alignment Algorithm
• Take all “n” sequences and perform all possible pairwise (n/2(n-1)) alignments
• Identify highest scoring pair, perform an alignment & create a consensus sequence
• Select next most similar sequence and align it to the initial consensus, regenerate a second consensus
• Repeat step 3 until finished
![Page 37: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/37.jpg)
37 Lecture 2.4
Multiple Sequence Alignment
• Developed and refined by many (Doolittle, Barton, Corpet) through the 1980’s
• Used extensively for extracting hidden phylogenetic relationships and identifying sequence families
• Powerful tool for extracting new sequence motifs and signature sequences
![Page 38: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/38.jpg)
38 Lecture 2.4
Multiple Alignment
• Most commercial vendors offer good multiple alignment programs including:
• GCG (Accelerys)• PepTool/GeneTool (BioTools Inc.)• LaserGene (DNAStar)
• Popular web servers include T-COFFEE, MULTALIN and CLUSTALW
• Popular freeware includes PHYLIP & PAUP
![Page 39: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/39.jpg)
39 Lecture 2.4
Mutli-Align Websites
• Match-Box http://www.fundp.ac.be/sciences/biologie/bms/matchbox_submit.shtml
• MUSCA http://cbcsrv.watson.ibm.com/Tmsa.html
• T-Coffee http://www.ch.embnet.org/software/TCoffee.html
• MULTALIN http://www.toulouse.inra.fr/multalin.html
• CLUSTALW http://www.ebi.ac.uk/clustalw/
![Page 40: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/40.jpg)
40 Lecture 2.4
Multi-alignment & Contig Assembly
ATCGATGCGTAGCAGACTACCGTTACGATGCCTT…TAGCTACGCATCGTCTGATGGCAATGCTACGGAA..
ATCGAT
GCGTAG
CTAGCAGACTACCGTT
GTTACGATGCCTT
TAGCTACGCATCGT
![Page 41: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/41.jpg)
41 Lecture 2.4
Contig Assembly
• Read, edit & trim DNA chromatograms• Remove overlaps & ambiguous calls• Read in all sequence files (10-10,000)• Reverse complement all sequences (doubles
# of sequences to align)• Remove vector sequences (vector trim)• Remove regions of low complexity• Perform multiple sequence alignment
![Page 42: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/42.jpg)
42 Lecture 2.4
Chromatogram Editing
![Page 43: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/43.jpg)
43 Lecture 2.4
Sequence Loading
![Page 44: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/44.jpg)
44 Lecture 2.4
Sequence Alignment
![Page 45: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/45.jpg)
45 Lecture 2.4
Contig Alignment - Process
ATCGATGCGTAGCAGACTACCGTTACGATGCCTT…
ATCGATGCGTAGCTAGCAGACTACCGTT
GTTACGATGCCTT
CGATGCGTAGCA
ATCGATGCGTAGCTAGCAGACTACCGTTGTTACGATGCCTTTGCTACGCATCG CGATGCGTAGCA
![Page 46: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/46.jpg)
46 Lecture 2.4
Sequence Assembly Programs
• Phred - base calling program that does detailed statistical analysis (UNIX) http://www.phrap.org/
• Phrap - sequence assembly program (UNIX) http://www.phrap.org/
• TIGR Assembler - microbial genomes (UNIX) http://www.tigr.org/softlab/assembler/
• The Staden Package (UNIX) http://www.mrc-lmb.cam.ac.uk/pubseq/
• GeneTool/ChromaTool/Sequencher (PC/Mac)
![Page 47: Sequencing & Sequence Alignment](https://reader035.fdocuments.in/reader035/viewer/2022062315/5681584e550346895dc5a846/html5/thumbnails/47.jpg)
47 Lecture 2.4
Conclusions• Sequence alignments and database
searching are key to all of bioinformatics• There are four different methods for doing
sequence comparisons 1) Dot Plots; 2) Dynamic Programming; 3) Fast Alignment; and 4) Multiple Alignment
• Understanding the significance of alignments requires an understanding of statistics and distributions