Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638...
Transcript of Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638...
![Page 2: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/2.jpg)
nuts and bolts
• meet Tuesdays & Thursdays, 8:30-9:50
• no exam; grade derived from 3-4 homework assignments plus a final project (open book, open note, collaborations allowed as long as work is not copied)
• no single recommended textbook. Website has a few recommendations with guidance for choosing a resource.
• I will try to keep it updated with upcoming lecture notes, a “daily dozen” for each lecture, and homework assignments
• “daily dozen” is just some questions (probably not always 12!) that you don’t have to turn in but that you should be able to answer easily after each lecture
![Page 3: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/3.jpg)
Course objectives
• describe the algorithms used in estimating function of biological sequences
• determine which methods are appropriate for analyzing sequences derived from different experiments
• design analysis pipelines that are biologically meaningful and mathematically rigorous
![Page 4: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/4.jpg)
concepts covered
• algorithms, including• HMM• MCMC• dynamic programming• heuristic methods• enrichment of spatial associations
• experimental methods• ChIP• RNAseq• bisulfite, RRBS, MBDseq, MeDIP• variant calling• HiC & similar structural methods
![Page 5: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/5.jpg)
waaaay back: prebiotic soup/primordial sandwich
early Earth was too hot for stable molecules, but as atmosphere cooled, molecules formed at random
many hypotheses about what happened next . . .
but eventually molecules appeared that had catalytic capabilities
and could replicate themselves.
![Page 6: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/6.jpg)
amazing property of nucleotides
![Page 7: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/7.jpg)
RNA
• single stranded but self-complementary, so complex 3D structures with enzymatic capacity are possible
![Page 8: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/8.jpg)
RNA
• amino acids were likely also present in the prebiotic soup
![Page 9: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/9.jpg)
Next steps
• an RNA that gained a permanent function could out-reproduce other RNAs
• proteins are much more stable than RNA
• proteins are linear arrangements of information (like RNA)
![Page 10: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/10.jpg)
RNA encodes proteins
![Page 11: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/11.jpg)
the genetic code is a wobbling degenerate
![Page 12: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/12.jpg)
protein synthesis (translation)unsurprisingly, protein synthesis involves large RNA/protein complexes (ribosomes)
![Page 13: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/13.jpg)
protein synthesis
• translation is energetically expensive
• highly regulated
• ribosomes have proofreading functions
• all components are recycled
![Page 14: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/14.jpg)
becoming a useful protein
![Page 15: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/15.jpg)
and then?
RNA and proteins were working well, and there were probably many “genetic codes” . . . but the RNA that won was the one that invented a stable version of itself
![Page 16: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/16.jpg)
DNA
![Page 17: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/17.jpg)
transcription
• a usually short-lived RNA copy of the DNA is created through transcription
• RNA is exported to the cytoplasm to encode proteins
• some types of RNA do not encode proteins
![Page 18: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/18.jpg)
transcription: the cell knows where to start!
• transcription is expensive and potentially damaging, so it is highly regulated at many levels:
• signal sequences (activating or repressive)• chromatin structure• polymerase control (elongation speed, etc)• cleavage of nascent RNA
classical eukaryotic promoter
biological signals: how do we find these signal sequences, in a big sequence?
![Page 19: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/19.jpg)
motif finding
Simplest example: look for exact matches to a known motif
Next example: imperfect matches to a known motif
Finally: finding enriched motifs in a pile of sequences
additional questions: conservation throughout evolution, coordinated changes
![Page 20: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/20.jpg)
exact match to a known motif: TATA
the TATA box is one of many signals in DNA sequences, that mark the location for transcriptional initiation in a large percentage of eukaryotic genes.
does my sequence contain a TATA box?
ACGCTAGCGCATATAGCATGACTAGTATAGCTAGACGAGCTAGCATATCCGAT
![Page 21: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/21.jpg)
finding an exact match to a known motif: TATA
ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGATTATAACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATAACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATAACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA…ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA
![Page 22: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/22.jpg)
exact match to a known motif: TATA
ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA
how many comparisons are needed?
(hint: 4 comparisons for each position x # positions to be compared)
![Page 23: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/23.jpg)
exact match to a known motif: TATA
ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT TATA
are there ways to speed this up?
![Page 24: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/24.jpg)
exact match to a known motif: TATA
ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT * * * * * * * * * * * *
reduce search space by flagging all Ts—how many comparisons are made?
find and catalog all 4mers in advancehow do we store that information?lots of options:
big text tablehash tabletree structure
![Page 25: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/25.jpg)
exact match to a known motif: TATA
ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT
I found a gene!! or did I?p(TATA) = ?
![Page 26: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/26.jpg)
exact match to a known motif: TATA
ACGCTAGCGCATATAGCATGACTAGTATCGCTAGACGAGCTAGCATATCCGAT
p(TATA at any site) = p(T)*p(A)*p(T)*p(A)and, assuming that the nucleotides are equally represented (which isn’t true)
p(TATA) = 0.25^4 = 0.0039
our sequence is 53 nucleotides long, so we have 50 possible start sites.
expect 53 * 0.0039 occurrences of TATA = 0.2so is our result surprising (do we have a gene)?
What if we’re working with a genome that is 80% AT?
![Page 27: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/27.jpg)
imperfect motifs
Many proteins bind DNA or RNA with less strict sequence preferences.
good example: splicing
Our understanding is still very incomplete . . . but a cell knows how to do it!
![Page 28: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/28.jpg)
CAAT TATA
Enhancer Promoter exon intron exon intron exon polyA signal
5’ 3’
5’ UTR
3’ UTR
![Page 29: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/29.jpg)
CAAT TATA
Enhancer Promoter exon intron exon intron exon polyA signal
5’ 3’
5’ UTR
3’ UTR
Start point for transcription
Start point for Translation (AUG)
Terminator for translation (UGA,
UAA, UAG)
![Page 30: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/30.jpg)
CAAT TATA
Enhancer Promoter exon intron exon intron exon polyA signal
5’ 3’
Pre-mRNA
transcription
5’ UTR
3’ UTR
![Page 31: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/31.jpg)
CAAT TATA
Enhancer Promoter exon intron exon intron exon polyA signal
5’ 3’
Pre-mRNA
transcription
mRNAsplicing
5’ UTR
3’ UTR
![Page 32: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/32.jpg)
CAAT TATA
Enhancer Promoter exon intron exon intron exon polyA signal
5’ 3’
Pre-mRNA
transcription
mRNAsplicing
translation
5’ UTR
3’ UTR
![Page 33: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/33.jpg)
Alternative splicing
![Page 34: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/34.jpg)
Splice site and branch site consensus sequences
The problem:
Consensus 5' and 3' splice site sequences, branch site sequences occur frequently in any genome
what is the probability of finding a GT sequence?
More information necessary to define bona fide exons
![Page 35: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/35.jpg)
Splice site and branch site consensus sequences
UGACAUUACUGUGAGUAAAACUUGUUUUCAGGUACAGUAGUCGCAAGUCAUGGUAAGUCCUCUGACUUAACAGGUACUAUAUAUAAAGGAUUAGGUAUGUAUACCUUCAACACAGGUAACUGACUUGGGGCUGCAGGUACAGUCAUGAGUCAUGUCUGUAUCCUUUUGACCUUACAGUGUGAUGGGCAGAGAGGAUGAUGUAAGUAAUGGAUCAUUCGGGGUGAGUAUUUUCAAAAUGGGGGUAAGAAGACUUUCAACAAAGGUAAGACCAUUCAAAAAUAAGGUGAUUGGCACUAUGAAUUAGGUAAGAACUAUUGCGUAACAGGUGAGGCCCUUCGAGCAGAAGGUGAGAACUGACUGGAGCAAGGUAAUUGUGAGUAUGAUGAAGGUAAAUCUUUACAAACUGGAGGUACUUCAAUUUCUUUUUAGGGUUUCACUAAG
-2 -1 1 2
A 0.64 0 0 0
C 0.05 0 0 0
G 0.30 0.85 1 0
U 0.01 0.15 0 1
![Page 36: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/36.jpg)
Splice site and branch site consensus sequences
Cartegni et al. 2002
interpretation of sequence logos: If the letters occupy the entire vertical space, the height of each letter is the proportion of sequences with that base at that position. If the letters do not occupy the entire vertical space, the height of each letter typically signifies information content.
U1 U2 U2AF
![Page 37: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/37.jpg)
splice signals . . . not much information
how many times would the sequence GT occur in a 3GB genome with 60% AT? how about the sequence CAGGTAAG?
![Page 38: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/38.jpg)
so how does the cell know how to splice things?
as we’ll see, eukaryotic gene prediction is a tricky problem! but cells know where their exons are . . . what are some approaches that we can take?
![Page 39: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/39.jpg)
finding signals in eukaryotic genes: additional tools and approaches
• evolutionary conservation: take advantage of history
• open reading frames (we know what a protein-coding gene should look like!)
![Page 40: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/40.jpg)
evolutionary conservation: where are mutations tolerated?
this doesn’t look like a random coincidence . . .
![Page 41: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/41.jpg)
evolutionary conservation
probability of seeing a mutation is the product of the probability of mutation occurrence (mutation rate) and the probability of retaining the mutation (selection)
surrogate for biological importance?
![Page 42: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/42.jpg)
evolutionary conservation
to think about this in a meaningful way we need metrics. How do you define “surprisingly conserved”?
![Page 43: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/43.jpg)
open reading frames
if you are looking for protein-coding genes, you also want to look for open reading frames.
![Page 44: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/44.jpg)
open reading frames
There are 64 codons and 3 of them are stop codons. If the codons are equally likely to appear, how long would you expect an ORF to be, in random sequence?
frequency of stop codon = 3/64 expected probability of stop codon in random sequence = 3/64
![Page 45: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/45.jpg)
open reading frames
frequency of stop codon = 3/64 expected probability of stop codon in random sequence p = 3/64
the length of ORFs in random sequence follows a negative binomial distribution, with mean 1/p (21.3 codons, or 64 nucleotides)
![Page 46: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/46.jpg)
put the signals together to make a gene:
intron exon intergenic
ORF mean length 20aa spans exon mean length
20aa
splice site random occurrence
at boundaries +
random
random occurrence
conservation low high low
this is, of course, an immense oversimplification and ignores lots of biological entities (pseudogenes, conserved noncoding regions etc). Also, these are noisy signals!
![Page 47: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/47.jpg)
highly conserved locus
![Page 48: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/48.jpg)
finding signals in DNA
• lots of data available for various organisms, to infer conservation, binding sites, function
• look for known motifs, known sequence signals
• mine experimental data for overrepresented sequences and motifs
• laboratory approaches for exploration (e.g. mutagenesis) and confirmation
• gene finding algorithms may assess as many data modalities as possible, with weighting
![Page 49: Analysis of Biological Sequences SPH 140 · Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu. nuts and bolts • meet Tuesdays & Thursdays, 8:30-9:50 • no exam; grade](https://reader034.fdocuments.in/reader034/viewer/2022042710/5f5fddcbaf8f7b25785685e6/html5/thumbnails/49.jpg)
First generation sequencing Pairwise sequence alignment Dot plots Needleman-Wunsch and Smith-Waterman BLAST, gapped BLAST Phylogeny Multiple sequence alignment Next (and nextnext) generation sequencing Short read and not-so-short read alignment Hidden Markov Models ChIPseq variant calling Gene expression: approaches and statistics Functional analysis in genome space Alignment-free sequence comparison Metagenomics Visualizing big data: Circos, Hive plots