Sequence Analysis '18 lecture 18 - Rensselaer Polytechnic Institute · 2018-10-25 · Sequence...
Transcript of Sequence Analysis '18 lecture 18 - Rensselaer Polytechnic Institute · 2018-10-25 · Sequence...
Sequence Analysis '18 lecture 18
Repeats sequences
Hidden Markov Models
Repeats, Satellites & Transposable
Elements
Transposable elements: junk dealers
Barbara McClintock Transposase, transposasome
Transposable elements “jumping genes” lead to rapid germline variation.
“Out standing in her field”
Excision of transposon may leave a “scar”.
TR TRIR IR
cruciform structure
repaired DNA with copied TR and added IR
TR=tandem repeatIR=inverted repeat
Millions of years of accumulated TE “scars”
Some genomes contain a large accumulation of transposon scars.
Estimated Transposable element-associated DNA content in selected genomes
H.sapiens Z. mays Drosophila Arabidopsis C. elegans S. cerevisiae
Everything else
TEs35%
>50%
2%15% 1.8% 3.1%
ACRONYMS for satellites and transposonsSSR Short Sequence RepeatSTR Short Tandem RepeatVNTR Variable Number Tandem RepeatLTR Long Terminal RepeatLINE Long Interspersed Nuclear ElementSINE Short Interspersed Nuclear ElementMITE Miniature Inverted repeat Transposable Element (class III TE)TE Transposable ElementIS Insertion SequenceIR Inverted RepeatRT Reverse TranscriptaseTPase TransposaseAlu 11% of primate genome (SINE)LINE1 14.6% of human genomeTn7,Tn3,Tn10,Mu,IS50 transposons or transposable bacteriophageretroposon=retrotransposon
Class I TE, uses RT.Class II TE, uses TPase.Class III TE, MITEs*
*Cl,ass III are now merged with Class II TEs.
fun with bioinformatics jargon
Types of repeat sequencesSatellites -- 1000+ bp in heterochromatin: centromeres, telomeres
Simple Sequence Repeats (SSRs), in euchromatin :
Minisatellites -- ~15bp (VNTR)
Microsatellites -- 2-6 bp
heterochromatin=compact, light bandseuchromatin=loose, dark bands.
microsatellite
541 gagccactag tgcttcattc tctcgctcct actagaatga acccaagatt gcccaggccc 601 aggtgtgtgt gtgtgtgtgt gtgtgtgtgt gtgtgtgtgt gtatagcaga gatggtttcc 661 taaagtaggc agtcagtcaa cagtaagaac ttggtgccgg aggtttgggg tcctggccct 721 gccactggtt ggagagctga tccgcaagct gcaagacctc tctatgcttt ggttctctaa 781 ccgatcaaat aagcataagg tcttccaacc actagcattt ctgtcataaa atgagcactg 841 tcctatttcc aagctgtggg gtcttgagga gatcatttca ctggccggac cccatttcac
a microsatellite in a dog (canis familiaris) gene.
Minisatellite1 tgattggtct ctctgccacc gggagatttc cttatttgga ggtgatggag gatttcagga
61 tttgggggat tttaggatta taggattacg ggattttagg gttctaggat tttaggatta 121 tggtatttta ggatttactt gattttggga ttttaggatt gagggatttt agggtttcag 181 gatttcggga tttcaggatt ttaagttttc ttgattttat gattttaaga ttttaggatt 241 tacttgattt tgggatttta ggattacggg attttagggt ttcaggattt cgggatttca 301 ggattttaag ttttcttgat tttatgattt taagatttta ggatttactt gattttggga 361 ttttaggatt acgggatttt agggtgctca ctatttatag aactttcatg gtttaacata 421 ctgaatataa atgctctgct gctctcgctg atgtcattgt tctcataata cgttcctttg
This 8bp tandem repeat has a consensus sequence AGGATTTT,
but is almost never a perfect match to the consensus.
How do you recognize a repeat sequence?
•High scoring self-alignments
•High dot plot density
•Compositional bias
A repeat region in a dot plot.
Is there an evolutionary advantage of repeat sequences?
Repeat sequences are prone to
(1) locally: errors in replication
(2) non-locally: homologous recombination
Errors in replication (polymerase slippage) can lead to a change in the reading frame, eliminating a STOP codon, adding one, or translating to a different sequence entirely.
Neisseriae Gonorrheoae evades the human immune system by randomly, periodically (weeks) changing the reading frame of the pilin surface antigen protein, aided by a 4-base microrepeat.
(How) do you align repeat sequences?
B: Dynamic Programming with special EVD. Align just like any other sequence, but using a special null model to assess the significance of the alignment score. Use EVD to fit random scores.
Remember: Low complexity sequences will have high-scoring alignments randomly. For example:
ATTTATATAATTAATATATAAATATAATAAATAT aligned to
TATTATATATATATATATATTATATATATATATA
Random score is likely to have >50% identity!
A: Don’t align. Mask them out instead.
www.repeatmasker.org
Annotation Results SW perc perc perc query position in query matching repeat position in repeat score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID Overlap
194 10.5 2.6 0.0 chr1 1031265 1031302 (244491545) + C-rich Low_complexity 3 41 (0) 624 0 238 26.4 0.7 0.7 chr1 1031638 1031782 (244491065) + (TG)n Simple_repeat 1 145 (0) 625 0 298 29.0 2.1 0.0 chr1 1031794 1031886 (244490961) + (CGTG)n Simple_repeat 3 97 (0) 626 0 255 23.1 1.8 1.8 chr1 1031900 1032062 (244490785) + (TG)n Simple_repeat 1 163 (0) 627 0 1864 13.8 0.0 0.7 chr1 1032330 1032614 (244490233) + AluJo SINE/Alu 5 287 (25) 628 0
Compares your seqeunce to a curated library of known repeats to a query sequence: Returns: (1) Location and type of each repeat, and/or
(2) Query sequence with repeats masked (set to “N”)
Ariana Smit, Phil Green
What is a stochastic emitter?
modelrandom numbers
sequence data
Markov models
A Markov “chain” is a network of “states” (circles) connected by “transitions” (arrows)
The probability of the next Markov state depends only on the current state.
H E
L
Each state has a different “state identifier”: H=helix, E=beta sheet, L=loop
�17
Markov model
how much wood would a wood chuck chuck if a wood chuck would chuck wood?
Exercise: MMs for tongue-twisters.
can you can a can as a canner can can a can?
I wish to wish the wish you wish to wish, but if you wish the wish the witch wishes, I won’t wish the wish you wish to wish.
H E
L
The marble bag represents a probability distribution (PD) of amino acids, b. ( A PD of amino acids is a profile )
bH(i)
HMM states “emit” one characterHidden Markov models
The probability of the next Markov state depends on the current state and the sequence.
Each state has a unique “state identifier” and a PD (or multiple PDs)
�20
Hidden Markov Model
Begin
End
HMMs consist of…
• Non-emitting source states (begin) • Emitting states • Non-emitting connector states • Non-emitting sink states (end)
How to make a HMM from Blast results Score ESequences producing significant alignments: (bits) Value
gi|18977279|ref|NP_578636.1| (NC_003413) hypothetical protein [P... 136 5e-32gi|14521217|ref|NP_126692.1| (NC_000868) hypothetical protein [P... 59 8e-09gi|14591052|ref|NP_143127.1| (NC_000961) hypothetical protein [P... 56 8e-08gi|18313751|ref|NP_560418.1| (NC_003364) translation elongation ... 42 9e-04gi|729396|sp|P41203|EF1A_DESMO Elongation factor 1-alpha (EF-1-a... 40 0.007gi|1361925|pir||S54734 translation elongation factor aEF-1 alpha... 39 0.008gi|18312680|ref|NP_559347.1| (NC_003364) translation initiation ... 37 0.060
QUERY 3 GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V 5918977279 2 GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V 5814521217 1 -MLGFFRRKKKEEEEKI---TGKPVGKVKVENILIVGFK-TVI-ICEVLEGMVKVGYK-V 5314591052 1 -MFKFFKRKGEDEKD----VTGKPVGKVKVESILKVGFR-DVI-ICEVLEGIVKVGYK-V 5218313751 243 --------------------------RMPIQDVFTITGAGTVV-VGRVETGVLKVGDR-V 274729396 236 --------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV 2681361925 239 --------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV 27118312680 487 -----------------------------------IVGV-KVL-AGTIKPGVT----L-V 504
QUERY 60 --KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT 10918977279 59 --KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT 10814521217 54 --RKGKKVAGIVS-MEREHKKVEFAIPGDKIGIMLEKNI---G---AEKGDILEVF-- 10014591052 53 --KKGKKVAGIVS-MEREHKKIEFAIPGDRVGMMLEKNI----N--AEKDDILEVY-- 9918313751 275 VIVPPAKVGDVRS-IETHHMKLEQAQPGDNIGVNVRG-I---AKEDVKRGDVL----- 322729396 269 --FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV------ 3141361925 272 --FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV------ 31718312680 505 --KDGREVGRIMQ-IQKTGRAINEAAAGDEVAISIHGDVIVGRQ--IKEGDILYVY-- 555
First run BLAST. Output to MSA.
GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-VGLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V-MLGFFRRKKKEEEEKI---TGKPVGKVKVENILIVGFK-TVI-ICEVLEGMVKVGYK-V-MFKFFKRKGEDEKD----VTGKPVGKVKVESILKVGFR-DVI-ICEVLEGIVKVGYK-V--------------------------RMPIQDVFTITGAGTVV-VGRVETGVLKVGDR-V--------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV--------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV-----------------------------------IVGV-KVL-AGTIKPGVT----L-V
--KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT--KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT--RKGKKVAGIVS-MEREHKKVEFAIPGDKIGIMLEKNI---G---AEKGDILEVF----KKGKKVAGIVS-MEREHKKIEFAIPGDRVGMMLEKNI----N--AEKDDILEVY--VIVPPAKVGDVRS-IETHHMKLEQAQPGDNIGVNVRG-I---AKEDVKRGDVL-------FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV--------FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV--------KDGREVGRIMQ-IQKTGRAINEAAAGDEVAISIHGDVIVGRQ--IKEGDILYVY--
begin
end
Match
Insert
Delete
parent
Assign parent sequence. Assign Insertion and Deletion states according to where gaps appear in the MSA.
�24
Steps in making a profile HMM from a MSA
1.Choose a sequence to be the parent sequence, s (defines number of positions)
2.Read each sequence in MSA, i2.1.Create an Insert state if seq i has an insertion relative
to seq s.2.2.Create a Delete state is seq i has a deletion relative to
seq s.3.At this point every position in every sequence is assigned
to one Markov state.
4.Sum the emmission profile for each state
5.Sum the transition probabilities.
Picking a parent sequence (does it matter?)
• The parent defines the number of Match states
• A Match state should conserve the chemical nature of the sidechain as much as possible.
• A Match state implies structural similarity.
Which sequence is the best parent?
�26
Filling in state probabilities
I E G K I G K V K KDDDMIIIselfIMMDMIMMACDEFGHIKLMNPQRSTVWY
M=match state, I=insert state, D=delete state, MM=match-to-match, MI=match-to-insertion, MD=match-to-deletion,IM=insertion-to-deletion, Iself=I-to-I non-advancing, II=insertion-to-insertion, DM=deletion-to-insertion, DD=deletion-to-deletion.
Each element in the matrix is a probability, either an emission probability (M state) or a transition probability.
M
arrowsD
M
I
parent
HMM
�27
Pseudocode, setting HMM parameters based on MSA.
Choose parent sequence==> parentAssign match states==> m, to non-gap columns==> c, of parent.m=0For each column, cIf (parent(c) == “.”) { \\ c is an insert stateII(m) = II(m) + Count(.NOT.gap@c+1) \\ next col is matchID(m) = Count(gap@m+1) \\ next char is gapIM(m) = Count(.NOT.gap@m+1) \\ next char is not gap
} else { \\ c is match state mm++ \\ next match statefor each amino acid, aa { profile(m,aa) = Count(aa@c) } \\ use unweighted countsMD(m) = Count(gap@m+1) \\ gaps at next match stateMI(m) = parent(c+1) == “.”? Count(.NOT.gap@c+1)| 0
\\ if next col is not a match state, count insertionsMM(m) = Count(.NOT.gap@m+1) \\ non-gaps at next match stateDD(m) = Count(gap@m+1) \\ gaps at next match state DM(m) = Count(.NOT.gap@m+1) \\ non-gaps at next match state
}Normalize(MD+MI+MM)=1,(DD+DM)=1, (II+ID+IM)=1. \\ convert counts to probabilities
Exercise 10: make a profile HMMAGF---PDGAGGYL-PDGAG----PNGSGFFLIPNGSGF--EPNG
•Use seq 1 as parent.
•Draw non-zero Match, Delete, Insert states and transitions.
•Assign probabilities to all non-zero transitions.
•Draw a Logo for the Match state profile.
Added information
With Profile HMMs we allow insertions and deletions to have position-specific probabilities, and insertions have lengths.
Many uses of HMMs
Weather prediction
Ecosystem modeling
Brain activity
Language structure
Econometrics
etc etc
HMMs can be applied to any dataset that can be represented as strings.
The expert input is the “topology”, or how the states are connected.
Homolog detection using a library of profile HMMs
MYSEQUENCE
2
1
3
4
P(s|λ2)
P(s|λ3)
P(s|λ4)
P(s|λ1)
Pick the model w
ith the max P
Get P(S|λ) for each λ
HMMER2.0 [2.3.2]NAME hmm_profileLENG 209ALPH AminoRF noCS noMAP yesNSEQ 10DATE Wed Oct 21 11:00:59 2015CKSUM 3231XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -78.931824 0.205202HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -778 * -1263 1 115 4272 -1536 -1390 -774 -663 -853 -13 -1051 -600 -229 -921 -1200 -993 -1046 -212 -155 161 -1134 -839 73 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -49 -5474 -6516 -894 -1115 -701 -1378 -778 *
PFAM: Profile HMM libraries made by HMMER
D
M
I