Using BLAST for Genomic Sequence Annotation
description
Transcript of Using BLAST for Genomic Sequence Annotation
![Page 1: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/1.jpg)
Using BLAST for Genomic Sequence Annotation
Jeremy BuhlerFor HHMI / BIO4342
Tutorial Workshop
![Page 2: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/2.jpg)
2
Overview
What is comparative annotation?
How to measure similarity between biosequences
How to decide whether two sequences are “similar enough”
![Page 3: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/3.jpg)
3
Comparative Annotation
Identify functional elements in DNA sequence
Uses comparison to databases of sequences with known function
BiosequenceDatabase
NewDNAsequence
human CFTR gene
mouse CFTR gene
dog CFTR geneProbably CFTR
![Page 4: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/4.jpg)
4
Why Does It Work?
Functional sequences are under
negative selection fewer mutations
More conservation greater similarity
BLAST software recognizes
similarity.
![Page 5: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/5.jpg)
5
Caveats w/Similarity Evidence Similarity without conservation
random chance
Conservation without selective pressure slow mutation recent divergence
Similar selective pressures, but seqs have two distinct functions
![Page 6: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/6.jpg)
6
Overview
What is comparative annotation?
How to measure similarity between biosequences
How to decide whether two sequences are “similar enough”
![Page 7: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/7.jpg)
7
What is Similarity?
How to measure similarity of two DNA seqs?
Mutations happen…
Measure should reflect desired evolutionary inference
![Page 8: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/8.jpg)
8
Mutational Model Sequences change by series of
events of (only) three types:
Substitution of one base
Insertion of one base
Deletion of one base
acg atg
acg acag
acg ag
![Page 9: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/9.jpg)
9
Sequence History (1/2)
Suppose seqs S, T diverged from a common ancestral sequence U…
![Page 10: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/10.jpg)
10
Sequence History (2/2)
Draw lines between bases of S and T that come from same base of U.
This is a “tree alignment” of S,T,U.
![Page 11: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/11.jpg)
11
Sequence Alignment
Now elide the ancestor…
Result is correspondence between bases of S, T – a sequence alignment
![Page 12: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/12.jpg)
12
Similarity Score of Alignment Fewer mutations more
conservation
Give bonus for matches, penalties for substitutions and gaps
![Page 13: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/13.jpg)
13
One Small Problem…
Do you own a time machine?
If not, how do you know ancestral sequence U? history of mutation?
Hence, how to get correct alignment?
![Page 14: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/14.jpg)
14
What We Do In Practice
Guess an alignment that minimizes # of hypothesized mutations
(more precisely, maximizes score)
![Page 15: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/15.jpg)
15
Overview
What is comparative annotation?
How to measure similarity between biosequences
How to decide whether two sequences are “similar enough”
![Page 16: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/16.jpg)
16
Deciding What to Report
Any two sequences can be aligned with some score.
Higher scores are better…
When is score high enough to be evidence of conservation?
![Page 17: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/17.jpg)
17
Idea: Test a Null Hypothesis Suppose two DNA seqs S, T are
completely unrelated.
What is probability that best alignment between S, T has score at least ?
If score(S,T) is unlikely to occur by chance, then report (S,T)
![Page 18: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/18.jpg)
18
Null Model Assumptions
Bases of seqs S, T generated independently at random
acagt
tccga
random process S
random process T
![Page 19: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/19.jpg)
19
P-values
Given random seqs S’, T’ with same base distributions as S, T
Karlin-Altschul theory tells us probability that S’, T’ align with score at least
If p() is small, report alignment of S,T
![Page 20: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/20.jpg)
20
E-values For computational reasons, BLAST
reports not p() but rather E()
E() = expected # times alignment with score at least happens by chance in current search
If E() < 1, then score is interesting
![Page 21: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/21.jpg)
21
Caveats about E-values
Model from which E-values are com-puted is too simple for real bioseqs
Large margin of safety is wise
Be very skeptical of “matches” with E > 10-5
![Page 22: Using BLAST for Genomic Sequence Annotation](https://reader034.fdocuments.in/reader034/viewer/2022051116/568150bb550346895dbed68e/html5/thumbnails/22.jpg)
22
Summary Comparative annotation with BLAST uses
similarity as evidence for conserved function.
Similarity score based on hypothesized evolutionary relations among sequences.
E-values indicate whether scores are high enough to be real biological conservation.