Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence...
Transcript of Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence...
![Page 1: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/1.jpg)
Bioinformatics
• Sequence alignment• Database searching• Significance, e-values• Trees• Gene ontology• Protein Structure
1
![Page 2: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/2.jpg)
Experimental origins of sequence data
F
Each color is one lane of an electrophoresis gel.
The Sanger dideoxynucleotide method
![Page 3: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/3.jpg)
New technology: Pyrosequencing• http://www.youtube.com/watch?
v=nFfgWGFe0aA&NR=1• ..or search youtube for “pyrosequencing”
• Whole genome sequencing in < 1 day!!
3
![Page 4: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/4.jpg)
AAAGAGATTCTGCTAGCGGTCGG
AGAGATGCTGCAGCGAGTCGGCC
Plant.
Bug.
Aligning two sequences tells us how they are related.
An alignment is a one-to-one association, or a set of one-to-one associations. Aligned sequences are assumed to be homologous (having a common ancestor). Furthermore, aligned positions within the sequences are assumed to have a common ancestor position.
![Page 5: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/5.jpg)
Positions that align in sequence usually align in space
have a common ancestor
superimpose in spaceTGCTA TGCAA
TGCTA
a Venn diagram
![Page 6: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/6.jpg)
Simple alignment• Simple similarity score:
Identity match = 1 point
mismatch = 0 points
gap = -1 points
• Optimal alignment = The highest-scoring alignment given the similarity score.
![Page 7: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/7.jpg)
AAAGAGATTCTGCTAGCGGTCGGAGAGATGCTGCAGCGAGTCGGCC
Building an alignment starts with a scoring matrix. In its simplest form, a dot plot.
Everything aligned to everything.
![Page 8: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/8.jpg)
AAAGAGATTCTGCTAGCGGTCGGAGAGATGCTGCAGCGAGTCGGCC
An alignment is a path through the scoring matrix, always proceeding to the right and
down. (no non-sequential alignments allowed.)
![Page 9: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/9.jpg)
AAAGAGATTCTGCTAGCGGTCGG
AGAGATGCTGCAGCGAGTCGGCC
Unbroken diagonals represent “blocks” of sequence without indels.
![Page 10: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/10.jpg)
AAAGAGATTCTGCTAGCGGTCGGAGAGATGCTGCAGCGAGTCGGCC
blocks
indels
insertion, A
deletion of T
mutation, T->G
The path records, and scores, all mutational events, incl. insertions, deletions, mutations.
![Page 11: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/11.jpg)
BLOSUM62: protein substitution matrix
![Page 12: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/12.jpg)
PAM250
![Page 13: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/13.jpg)
Protein versus DNA alignments
• Protein alphabet = 20, DNA alphabet = 4.– Protein alignment is more informative– Less chance of homoplasy with proteins.– Homology detectable at greater edit distance– Protein alignment more informative
• Better Gold Standard alignments are available for proteins. – Better statistics from G.S. alignments.
• On the other hand, DNA alignments are more sensitive to short evolutionary distances. 13
Are protein alignment better?
![Page 14: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/14.jpg)
Bioinformatics
• Sequence alignment• Database searching• Significance, e-values• Trees• Gene ontology• Protein Structure
14
![Page 15: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/15.jpg)
Database searching
Why do a database search?Mol. Bio: Determination of gene function. Primer design.
Pathology, epidemiology, ecology: Determination of species, strain, lineage, phylogeny.
Biophysics: Prediction of RNA or protein structure, effect of mutation.
one sequenceGenBank, PIR,
Swissprot,GenEMBL, DDBJ
lots of sequences
![Page 16: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/16.jpg)
Searching millions of sequences
Given a protein or DNA sequence, we want to find all of the sequences in GenBank (over 17 million sequences!!) that have a good alignment score.
Each alignment score should be the optimal score (or a close approximation).
How do we do it?
![Page 17: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/17.jpg)
Fast Database SearchingBLAST S. Altschul et al.
First make a set of lookup tables for all 3-letter (protein) or 11-letter (DNA) matches.
Make another lookup table: the locations of all 3-letter words in the database.
Start with a match, extend to the left and right until the score no longer increases.
Very fast. Selective, but not as sensitive as slower search methods (SSEARCH). Reliable statistics. Heuristic, not optimal.
![Page 18: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/18.jpg)
BLAST, precalculations
PGQ
...
PGQ PGR PGS ... PGT PGV PGW PGY PAQ PCQ PDQ PEQ PFQ ... ...
All 8000 possible 3-tuples
50 high-scoring
3-tuples
Each 3-tuple is scored against all 8000 possible 3-tuples using BLOSUM. The top scoring 50 are kept as that 3tuple’s “neighborhood words”
![Page 19: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/19.jpg)
BLASTquery sequence
identity matches
seeds HSPs
a 3-tuple
For every 3-residue window, we get the set of 50 nearest neighbors. Use each word to get identity matches (seeds). Then extend the seed alignments as long as the score increases.
neighborhood words for 3-tuple
target sequence
![Page 20: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/20.jpg)
BLAST
HSPs alignment
The best extended seeds are called HSPs (high scoring pairs). The top scoring HSP is picked first, then the second (as long as it falls "northwest" or "southeast" of the first.), and so on.
![Page 21: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/21.jpg)
Other forms of BLAST
21
BLAST query databaseblastn nucleotide nucleotideblastp protein proteintblastn protein translated DNAblastx translated DNA proteintblastx translated DNA translated DNA
psi-blast protein, profile proteinphi-blast pattern protein
transitive blast* any any*not really a blast. Just a way of using blast.
![Page 22: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/22.jpg)
Psi-BLAST: Blast with profiles
Psi-BLAST searches the database iteratively.(Cycle 1) Normal BLAST (with gaps)
(Cycle 2) (a) Construct a profile from the results of Cycle 1.
(b) Search the database using the profile.
(Cycle 3) (a) Construct a profile from the results of Cycle 2.
(b) Search the database using the profile.
And So On... (user sets the number of cycles)
Psi-BLAST is much more sensitive than BLAST.
Also more vulnerable to low-complexity.
![Page 23: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/23.jpg)
PHI-BLAST --Patterned Hit Initiated BLAST
23
![Page 24: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/24.jpg)
DNA or Protein search?•Advantages of searching DNA databases
Larger database. Does not assume a reading frame. Can find similarity in non-coding regions (introns, promotor regions). Can find frameshift mutations. Can find pseudogenes.
•Disadvantages
Slower. Not as sensitive. Ignores selective pressure at the protein level.
•Advantages of searching protein sequences
Faster. More sensitive. More biologically relevant.
•Disadvantages
Not applicable to non-coding DNA (promotors, introns, etc)
![Page 25: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/25.jpg)
Bioinformatics
• Sequence alignment• Database searching• Significance, e-values• Trees• Gene ontology• Protein Structure
25
![Page 26: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/26.jpg)
How significant is that?
Please give me a number for...
...how likely the data would not have been the result of chance,...
...as opposed to...
...a specific inference. Thanks.
![Page 27: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/27.jpg)
Dayhoff's randomization experiment
Aligned scrambled Protein A versus scrambled Protein B
100 times (re-scrambling each time).
NOTE: scrambling does not change the AA composition!
Results: A Normal Distributionsignificance of a score is measuredas the probability of getting this score in a random alignment
score
freq
![Page 28: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/28.jpg)
Lippman's randomization experimentAligned Protein A to 100 natural sequences, not scrambled.
Results: A wider normal distribution (Std dev = ~3 times larger)WHY? Because natural sequences are different than random.
Even unrelated sequences have similar local patterns, and uneven amino acid composition.
Lippman got a similar result if he randomized the sequences by words instead of letters.
Was the significance over-estimated using Dayhoff's method?
score
freq
![Page 29: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/29.jpg)
P(S > x)E(M) gives us the expected length of the longest number of matches in a row. But, what we really want is the answer to this question:
How good is the score x? (i.e. how significant)
So, we need to model the whole distribution of chance scores, then ask how likely is it that my score or greater comes from that model.
score
freq
![Page 30: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/30.jpg)
A normal distribution
Suppose you had a Gaussian distribution “dart-board”. You throw 1000 darts randomly. Score your darts according the number on the X-axis where it lands. What is the probability distribution of scores? Answer:The same Gaussian distribution! (duh)
![Page 31: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/31.jpg)
Extreme values from a normal distribution
What if we throw 10 darts at a time and keep only the highest-scoring dart (extreme value)? What is the distribution of the extreme values?
![Page 32: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/32.jpg)
The Extreme Value Distribution
Normal distributions (Dayhoff, Lippman) overestimate significance when the scores are extreme values. EVD is the correct null model.
![Page 33: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/33.jpg)
Fitting the EVD to random alignments
log(P(S≥x)) = log(Kmn) - λx
• Generate a large number of known false alignment scores S, (all alignments with the same two lengths m and n), • Plot log(P(S≥x)) versus x , fit to a line!
Estimated P (integral of the EVD): P(S≥x) ≈ Kmne-λx
Taking the log,
x
x x
xx
x
x
x
x x x xx
xxx x xxx
x
logP
(S≥x
)
The slope is −λ, the intercept is log(Kmn). Now we can calculate P for any score x.
where K=constant, m=size of database, n=length of sequence, λ=constant
![Page 34: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/34.jpg)
Pop-quiz
You did a BLAST search using a sequence that has absolutely no homologs in the database. Absolutely none.
The BLAST search gave you false “hits” with the top e-values ranging from 0 to 20. You look at them and you notice a pattern in the e-values.
How many of your hits have e-value ≤ 10.?
![Page 35: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/35.jpg)
Bioinformatics
• Sequence alignment• Database searching• Significance, e-values• Trees• Gene ontology• Protein Structure
35
![Page 36: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/36.jpg)
Evolutionary time
A
B
C
D
11
1
6
3
5
genetic change
A
B
C
D
time
A
B
C
D
no meaning
Cladogram Phylogram Ultrametric tree
(D:5,(A:1,(C:1,B:6):1):3)parenthesis (notation can have both labels and distances.
![Page 37: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/37.jpg)
A multiple sequence alignment is made using many pairwise sequence alignments
Multiple Sequence Alignment
![Page 38: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/38.jpg)
Multiple sequence alignment
1. align all pairs2. pairwise align two most similar first3. align next most similar4. repeat until all sequences are aligned
38
A G H I . W W P FA G H I I F W P Y
AWPY
S(P,[W,F]) =(1/2)(S(P,W) + S(P,F))
![Page 39: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/39.jpg)
Construct a distance-based tree
97 8177
82 59 3280 55 3190 65 40
61 4233
ABCDEF
A B C D E F ABCDEF
Draw tree heredistances
![Page 40: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/40.jpg)
CLUSTALW
• Start with unrooted tree, using Neighbor joining.
• choose root to get guide tree• progressive alignment
– matches are scored using sequence weights– gaps are position dependent
• GOP lower for polar residues• GOP zero where there is already a gap
40
JD Thompson, DG Higgins, TJ Gibson - Nucleic acids research, 1994
![Page 41: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/41.jpg)
CLUSTALW Position specific gap penalty
41
![Page 42: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/42.jpg)
Parsimony: Finding the tree with the minimum number of mutations
Given a tree and a set of taxa, one-letter each (1) choose optional characters for each ancestor. (2) Select the root character that minimizes the number of mutations by selecting each and propagating it through the tree.
T
C
T
C
C
T/C
T/C
T/C
T/C/C C
C
C
C
T
C
T
C
C
minimum 2 mutations minimum 1 mutation
![Page 43: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/43.jpg)
Columns in a MSA have a common evolutionary history
By aligning the sequences, we assert that the aligned residues in each column had a common ancestor.
![Page 44: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/44.jpg)
Orthologs/paralogsOrthologs: homologs originating from a speciation event
Paralogs: homologs originating from a gene duplication event.
clam
duck
crab
fish
clam
A
duck
A
crab
Bfis
h A
duck
B
fish
B
Sequence treecl
am A
crab
B
duck
Adu
ck B
fish
Afis
h B
duplication
speciation
speciationgene loss
Species tree reconciled trees
![Page 45: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/45.jpg)
How do I know it’s a paralog?
• If it’s a paralog, then at some point in evolutionary history, a species existed with two identical genes in it.
• One may have been lost since then. (Descendants are still paralogs!)
• Paralogs can be from different species.
• Paralogous genes have more than the expected sequence divergence.
• Because they are more likely to have different functions
• Because they diverged earlier than the speciation event.
• Without species information or functional information, it’s impossible to tell
![Page 46: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/46.jpg)
Life is not strictly a tree -- horizontal gene transfer
46
BF Smets, T Barkay (2005) “Horizontal gene transfer: perspectives at a crossroads of scientific disciplines” Nature Reviews Microbiology.
Discrete Steps Needed for Stability of Gene TransferStably incorporating horizontally transferred genes into a recipient genome involves five distinct steps (Fig. 1). 1. First, a particular segment of DNA or RNA is prepared for transfer from the donor strain through one of several processes, including excision and circularization of conjugative transposons, initiation of conjugal plasmid transfer by synthesis of a mating pair-formation protein complex, or packaging of nucleic acids into phage virions. 2. Next, the segment is transferred either by conjugation, which requires contact between the donor and recipient cells, or by transformation and transduction without direct contact. 3. During the third step, genetic material enters the recipient cell, where cell exclusion may abort the transfer. 4. Otherwise, during the fourth step, the incoming gene is integrated into the recipient genome by legitimate or sitespecific recombination or by plasmid circularization and complementary strand
synthesis. Barriers to transfer during this step come from restriction modification systems, failure to integrate and replicate within the new host genome, and incompatibility with resident plasmids. 5. In the final step, transferred genes are replicated as part of the recipient genome and transmitted to daughter cells in stable fashion over successive generations. Researchers from different disciplines tend to focus on specific stages within this five-step sequence. Thus, evolutionary biologists who examine microbial genomes for evidence of past transfers tend to look at HGTs from the perspective of step five. Molecular biologists are more likely to examine the details of the transfer events, while microbial ecologists look more broadly when they describe the magnitude and diversity of the mobile gene pool, sometimes called the mobilome.
![Page 47: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/47.jpg)
“Boot strap analysis”
• A method to validate a phylogenetic tree, branchpoint by branchpoint.
• Requires a means to generate independent trees. (For example trees generated from different regions of the mitochondrial genome.)
• Choose the representative tree as the ‘parent’. Calculate the following:
For each branchpoint in the parent tree, For each tree, ask Is there a branchpoint having the same subclade contents (i.e. same taxa, any order)Bootstrap value = number of trees having the branchpoint / total trees.
![Page 48: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/48.jpg)
Comparing branchpoints
A B C D E E DBCA AB C D E E DCAB
B A C D EE DABBB A A EE CCC DD
= P((A,B),C) = 5/8For each branchpoint in the parent tree, For each tree, ask Is there a branchpoint having the same subclade contents (i.e. same taxa, any order)Bootstrap value = number of trees having the branchpoint / total trees.
![Page 49: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/49.jpg)
Bioinformatics
• Sequence alignment• Database searching• Significance, e-values• Trees• Gene ontology• Protein Structure
49
![Page 50: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/50.jpg)
Ontology
• Ontologies relate facts to knowledge• facts
– may be known/unknown/little known– not attached to knowers– unchanging
• Knowledge– attached to knower– may disappear
50
![Page 51: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/51.jpg)
Gene Ontology
- Gene annotation system- Controlled vocabulary that can be
applied to all organisms- Used to describe gene products
![Page 52: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/52.jpg)
What is the Gene Ontology?
A (part of the) solution:
- A controlled vocabulary that can be applied to all organisms
- Used to describe gene products - proteins and RNA - in any organism
![Page 53: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/53.jpg)
GO: Three ontologies
Where does it act?
What processes is it involved in?
What does it do? Molecular Function
Cellular Component
Biological Process
gene product
![Page 54: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/54.jpg)
Cellular Component
• where a gene product acts
![Page 55: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/55.jpg)
Mitochondrial membraneCellular Component
![Page 56: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/56.jpg)
Biological Process
![Page 57: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/57.jpg)
GluconeogenesisBiological Process
![Page 58: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/58.jpg)
Molecular Function
• A single reaction or activity, not a gene product
• A gene product may have several functions• Sets of functions make up a biological
process
![Page 59: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/59.jpg)
Molecular Function
hexose kinase
![Page 60: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/60.jpg)
Filter queries by organism, data source or evidence
Search for GO terms or by Gene symbol/name
Querying the GO
![Page 61: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/61.jpg)
• Access gene product functional information
• Find how much of a proteome is involved in a process/ function/ component in the cell
• Map GO terms and incorporate manual annotations into own databases
• Provide a link between biological knowledge and …
• gene expression profiles
• proteomics data
What can scientists do with GO?
![Page 62: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/62.jpg)
attacked
time
control
Puparial adhesionMolting cyclehemocyanin
Defense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genes
Immune responseToll regulated genes
Amino acid catabolismLipid metobolism
Peptidase activityProtein catabloismImmune response
Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI.
…analysis of high-throughput data according to GOMicroArray data analysis
![Page 63: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/63.jpg)
AnalysisofFunc.onalAnnota.on–DownregulatedGenes
Figuremodifiedfromh;p://en.wikipedia.org/wiki/Image:Microarray‐schema.jpg
Notreatment Treatedcells
Selec%ngmicroarraysubsetsbasedonGOrevealsdrugtarget
courtesyofShabanaShabeer,AlbertEinsteingSchoolofMedicine
![Page 64: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/64.jpg)
Bioinformatics
• Sequence alignment• Database searching• Significance, e-values• Trees• Gene ontology• Correlation• Protein structure
64
![Page 65: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/65.jpg)
65
Trioxx non-users
Trioxx users
RLS
no RLS
4 0
12 16
€
ui − u( ) ri − r( )∑ui − u( )2 ri − r( )2∑∑
Biology Grad Core course: Discussion Topic Merck Smith-Kline was the author on a study of Trioxx, an anti-inflammatory drug used to treat arthritis, for which it was know to be effective. The study followed over 500 long-time Trioxx users and an equal number of control subjects who had never used the drug. Dr. Smith-Kline was looking for correlations between the use of Trioxx and the incidence of any disease other than arthritis, in any demographic group. He noted in the study that Tunisian Americans, in the age range from 45-55, male or female, and who had been a vegetarian for more than 6 months at any time in their lives, had a "strong negative correlation" between the use of Trioxx and the incidence of restless leg syndrome (RLS), and began touting Trioxx as an effective anti-RLS drug.
The numbers were as follows:
Total Tunisian American vegetarians age 45-55 : 32Total Tunisian American vegetarians age 45-55 Trioxx users : 16Total Tunisian American vegetarians age 45-55 Trioxx non-users : 16Total Tunisian American vegetarians age 45-55 who have RLS : 4Total Tunisian American vegetarians age 45-55 who do not have RLS : 28
Dr. Smith-Kline correctly calculated the correlation between Trioxx and RLS as follows:Corr =, where ui = 1 if subject i is a user, and 0 otherwise, ri is 1 if the subject has RLS and 0 otherwise.
The sums were carried out over all 32 subjects in the subset, and the resulting correlation was -0.378. This confidence level was cited as 99%, since the p-value for this correlation was 0.01, The sample size of 32 and the uneven distribution of subjects with RLS were taken into account.
The data itself was collected correctly and the calculations were correct, both for the correlation and its confidence. Yet Merck Smith-Kline did something dishonest in this study. What was it and what specific question would you ask him to reveal his dishonesty?
![Page 66: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/66.jpg)
Correlation
x2
(∑ - <x>) y( - <y> )i ii
∑ x( - <x>) y( - <y> )i ii∑
i
2r =
Pearson’s correlations coefficient. Or
Pearson's product moment correlation
![Page 67: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/67.jpg)
Correlation using metric data
Correlation cancels baseline and scale.Correlation is insensitive to non-linear relationships.
![Page 68: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/68.jpg)
Non-linearity is not picked up by correlation
All of these examples have r=0.816Re-sampling will fix some of these.
![Page 69: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/69.jpg)
Correlation Confidence by Resampling
• Start with paired data (x,y), calculate r. Let’s say r=0.511
• Randomly associate x and y values.
• Calculate rran
• Repeat 10000 times.
• Significance is p = number of times rran > 0.511, divided by 10000. (If r is negative, count rran < r)
![Page 70: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/70.jpg)
Resampling exampleThe y-values have been randomly swapped, 10000 times. 10% of the time, r is >=
0.816, therefore p=0.10
0.816
0.816 0.101 -0.321
-0.2670.020
-0.199 -0.331
-0.204
![Page 71: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/71.jpg)
Resampling example
0.816
-0.130
Scrambling randomizes sampling in the gray shaded area. If there is one data point set apart from the others, correlation can be high by chance.
![Page 72: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/72.jpg)
Correlation using Booelan data
0 00011000101
x y0 10011001101
<x>=4/12=0.33
<y>=6/12=0.50
True is assigned 1, False 0.
x \ y 1 0
1 4 0
0 2 6
x
2
(∑ - <x> ) y( - <y> )i ii
∑ x( - <x> ) y( - <y> )i ii∑
i
2r =
r=0.71p=0.025
![Page 73: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/73.jpg)
Bioinformatics
• Sequence alignment• Database searching• Significance, e-values• Trees• Gene ontology• Correlation• Protein structure
73
![Page 74: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/74.jpg)
Photosystem I: 1JB0
![Page 75: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/75.jpg)
Classes of membrane proteins
•Single transmembrane helix
•several transmembrane helices
•beta-barrel or channel
•Anchored by one (not-transmembrane) helix or a covalently attached fatty acid
![Page 76: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/76.jpg)
Photosystem I: Guided tour
Download and display 1JB0.pdb (one jay bee zero)
restrict not protein and not hohcolor cpkDisplay -> ball and stick
select magnesiumlabel %rset fontsize 12set fontstroke 2color labels yellow
Find the pseudo 2-fold axis
How many Mg are there?
What are the residue numbers of the “special pair” of chlorophylls?
![Page 77: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/77.jpg)
Photosystem I : Guided tour
(select the special pair using select XXX or YYY)spacefillselect hetero and not hohlabels offcolor temperature
How are the B-factorsdistributed?
Was NCS 2-fold symmetry enforced during refinement?
Guess what: 2-fold symmetry was not
enforced during evolution!
Which side is more ordered? Chain A or chain B?
![Page 78: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/78.jpg)
Photosystem I : Guided tour
Find the name of the lipid that does not havea phosphate group.
Characterize the environmentof the lipid. Could it have a role in the light harvest process?
Unix shortcut: use grepgrep ^”HETNAM” 1JB0.pdb
select [LMG]restrict selectedcenter selectedselect within (11., [LMG]) and proteinDisplay -> ball_and_stickcolor cpkselect within (11., [LMG]) and ligand
![Page 79: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/79.jpg)
Photosystem I : Guided tourselect within (11., [LMG]) and ligandspacefill 1.5color green select within (11., [LMG]) and *.MGspacefill 1.5color white select within (11., [LMG]) and [PQN]color red select within (11., [LMG]) and solventspacefill 1.0color cyan What is PQN?
How close is it to the nearestmagnesium
![Page 80: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/80.jpg)
Photosystem I : Guided tourrestrict ligandwireframecolor cpkDisplay-> ball and stickselect [CL1] or [CL2]wireframe color greenselect [PQN]color magentaspacefill 1.0select [BCR]color orangespacefill 1.5select *.MGspacefill 1.0color white
Trace the path of the electronsfrom the special pair to the twoquinones.
Light harvesting complex
Are any of the pigmentsconnected to the special pair?
![Page 81: Bioinformatics - Rensselaer Polytechnic Institute · 2011. 3. 9. · Bioinformatics • Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology](https://reader033.fdocuments.in/reader033/viewer/2022052103/603d87615a687f5e444b5e6b/html5/thumbnails/81.jpg)
Photosystem I : Guided tourrestrict [PQN]spacefill color cpkselect within (11.,[PQN]) and proteinwireframe 0.5color cpkselect within (11.,[PQN]) and ligand and not [PQN]color greenwireframe 0.5select within (11.,[PQN]) and solvent spacefill 0.6color cyan
Which quinone is more loosely-bound
Environment of the quinones
How does the electron getfrom one quinone to theother? What protein sidechainforms a bridge?