Sequence Analysis '18 lecture 18 - Rensselaer Polytechnic Institute · 2018-10-25 · Sequence...

Sequence Analysis '18 lecture 18

Repeats sequences

Hidden Markov Models

Repeats, Satellites & Transposable

Elements

Transposable elements: junk dealers

Barbara McClintock Transposase, transposasome

Transposable elements “jumping genes” lead to rapid germline variation.

“Out standing in her field”

Excision of transposon may leave a “scar”.

TR TRIR IR

cruciform structure

repaired DNA with copied TR and added IR

TR=tandem repeatIR=inverted repeat

Millions of years of accumulated TE “scars”

Some genomes contain a large accumulation of transposon scars.

Estimated Transposable element-associated DNA content in selected genomes

H.sapiens Z. mays Drosophila Arabidopsis C. elegans S. cerevisiae

Everything else

TEs35%

>50%

2%15% 1.8% 3.1%

ACRONYMS for satellites and transposonsSSR Short Sequence RepeatSTR Short Tandem RepeatVNTR Variable Number Tandem RepeatLTR Long Terminal RepeatLINE Long Interspersed Nuclear ElementSINE Short Interspersed Nuclear ElementMITE Miniature Inverted repeat Transposable Element (class III TE)TE Transposable ElementIS Insertion SequenceIR Inverted RepeatRT Reverse TranscriptaseTPase TransposaseAlu 11% of primate genome (SINE)LINE1 14.6% of human genomeTn7,Tn3,Tn10,Mu,IS50 transposons or transposable bacteriophageretroposon=retrotransposon

Class I TE, uses RT.Class II TE, uses TPase.Class III TE, MITEs*

*Cl,ass III are now merged with Class II TEs.

fun with bioinformatics jargon

Types of repeat sequencesSatellites -- 1000+ bp in heterochromatin: centromeres, telomeres

Simple Sequence Repeats (SSRs), in euchromatin :

Minisatellites -- ~15bp (VNTR)

Microsatellites -- 2-6 bp

heterochromatin=compact, light bandseuchromatin=loose, dark bands.

microsatellite

541 gagccactag tgcttcattc tctcgctcct actagaatga acccaagatt gcccaggccc 601 aggtgtgtgt gtgtgtgtgt gtgtgtgtgt gtgtgtgtgt gtatagcaga gatggtttcc 661 taaagtaggc agtcagtcaa cagtaagaac ttggtgccgg aggtttgggg tcctggccct 721 gccactggtt ggagagctga tccgcaagct gcaagacctc tctatgcttt ggttctctaa 781 ccgatcaaat aagcataagg tcttccaacc actagcattt ctgtcataaa atgagcactg 841 tcctatttcc aagctgtggg gtcttgagga gatcatttca ctggccggac cccatttcac

a microsatellite in a dog (canis familiaris) gene.

Minisatellite1 tgattggtct ctctgccacc gggagatttc cttatttgga ggtgatggag gatttcagga

61 tttgggggat tttaggatta taggattacg ggattttagg gttctaggat tttaggatta 121 tggtatttta ggatttactt gattttggga ttttaggatt gagggatttt agggtttcag 181 gatttcggga tttcaggatt ttaagttttc ttgattttat gattttaaga ttttaggatt 241 tacttgattt tgggatttta ggattacggg attttagggt ttcaggattt cgggatttca 301 ggattttaag ttttcttgat tttatgattt taagatttta ggatttactt gattttggga 361 ttttaggatt acgggatttt agggtgctca ctatttatag aactttcatg gtttaacata 421 ctgaatataa atgctctgct gctctcgctg atgtcattgt tctcataata cgttcctttg

This 8bp tandem repeat has a consensus sequence AGGATTTT,

but is almost never a perfect match to the consensus.

How do you recognize a repeat sequence?

•High scoring self-alignments

•High dot plot density

•Compositional bias

A repeat region in a dot plot.

Is there an evolutionary advantage of repeat sequences?

Repeat sequences are prone to

(1) locally: errors in replication

(2) non-locally: homologous recombination

Errors in replication (polymerase slippage) can lead to a change in the reading frame, eliminating a STOP codon, adding one, or translating to a different sequence entirely.

Neisseriae Gonorrheoae evades the human immune system by randomly, periodically (weeks) changing the reading frame of the pilin surface antigen protein, aided by a 4-base microrepeat.

(How) do you align repeat sequences?

B: Dynamic Programming with special EVD. Align just like any other sequence, but using a special null model to assess the significance of the alignment score. Use EVD to fit random scores.

Remember: Low complexity sequences will have high-scoring alignments randomly. For example:

ATTTATATAATTAATATATAAATATAATAAATAT aligned to

TATTATATATATATATATATTATATATATATATA

Random score is likely to have >50% identity!

A: Don’t align. Mask them out instead.

www.repeatmasker.org

Annotation Results SW perc perc perc query position in query matching repeat position in repeat score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID Overlap

194 10.5 2.6 0.0 chr1 1031265 1031302 (244491545) + C-rich Low_complexity 3 41 (0) 624 0 238 26.4 0.7 0.7 chr1 1031638 1031782 (244491065) + (TG)n Simple_repeat 1 145 (0) 625 0 298 29.0 2.1 0.0 chr1 1031794 1031886 (244490961) + (CGTG)n Simple_repeat 3 97 (0) 626 0 255 23.1 1.8 1.8 chr1 1031900 1032062 (244490785) + (TG)n Simple_repeat 1 163 (0) 627 0 1864 13.8 0.0 0.7 chr1 1032330 1032614 (244490233) + AluJo SINE/Alu 5 287 (25) 628 0

Compares your seqeunce to a curated library of known repeats to a query sequence: Returns: (1) Location and type of each repeat, and/or

(2) Query sequence with repeats masked (set to “N”)

Ariana Smit, Phil Green

What is a stochastic emitter?

modelrandom numbers

sequence data

Markov models

A Markov “chain” is a network of “states” (circles) connected by “transitions” (arrows)

The probability of the next Markov state depends only on the current state.

H E

L

Each state has a different “state identifier”: H=helix, E=beta sheet, L=loop

�17

Markov model

how much wood would a wood chuck chuck if a wood chuck would chuck wood?

Exercise: MMs for tongue-twisters.

can you can a can as a canner can can a can?

I wish to wish the wish you wish to wish, but if you wish the wish the witch wishes, I won’t wish the wish you wish to wish.

H E

L

The marble bag represents a probability distribution (PD) of amino acids, b. ( A PD of amino acids is a profile )

bH(i)

HMM states “emit” one characterHidden Markov models

The probability of the next Markov state depends on the current state and the sequence.

Each state has a unique “state identifier” and a PD (or multiple PDs)

�20

Hidden Markov Model

Begin

End

HMMs consist of…

• Non-emitting source states (begin) • Emitting states • Non-emitting connector states • Non-emitting sink states (end)

How to make a HMM from Blast results Score ESequences producing significant alignments: (bits) Value

gi|18977279|ref|NP_578636.1| (NC_003413) hypothetical protein [P... 136 5e-32gi|14521217|ref|NP_126692.1| (NC_000868) hypothetical protein [P... 59 8e-09gi|14591052|ref|NP_143127.1| (NC_000961) hypothetical protein [P... 56 8e-08gi|18313751|ref|NP_560418.1| (NC_003364) translation elongation ... 42 9e-04gi|729396|sp|P41203|EF1A_DESMO Elongation factor 1-alpha (EF-1-a... 40 0.007gi|1361925|pir||S54734 translation elongation factor aEF-1 alpha... 39 0.008gi|18312680|ref|NP_559347.1| (NC_003364) translation initiation ... 37 0.060

QUERY 3 GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V 5918977279 2 GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V 5814521217 1 -MLGFFRRKKKEEEEKI---TGKPVGKVKVENILIVGFK-TVI-ICEVLEGMVKVGYK-V 5314591052 1 -MFKFFKRKGEDEKD----VTGKPVGKVKVESILKVGFR-DVI-ICEVLEGIVKVGYK-V 5218313751 243 --------------------------RMPIQDVFTITGAGTVV-VGRVETGVLKVGDR-V 274729396 236 --------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV 2681361925 239 --------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV 27118312680 487 -----------------------------------IVGV-KVL-AGTIKPGVT----L-V 504

QUERY 60 --KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT 10918977279 59 --KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT 10814521217 54 --RKGKKVAGIVS-MEREHKKVEFAIPGDKIGIMLEKNI---G---AEKGDILEVF-- 10014591052 53 --KKGKKVAGIVS-MEREHKKIEFAIPGDRVGMMLEKNI----N--AEKDDILEVY-- 9918313751 275 VIVPPAKVGDVRS-IETHHMKLEQAQPGDNIGVNVRG-I---AKEDVKRGDVL----- 322729396 269 --FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV------ 3141361925 272 --FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV------ 31718312680 505 --KDGREVGRIMQ-IQKTGRAINEAAAGDEVAISIHGDVIVGRQ--IKEGDILYVY-- 555

First run BLAST. Output to MSA.

GLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-VGLFDFLKRKEVKEEEKIEILSKKPAGKVVVEEVVNIMGK-DVI-IGTVESGMIGVGFK-V-MLGFFRRKKKEEEEKI---TGKPVGKVKVENILIVGFK-TVI-ICEVLEGMVKVGYK-V-MFKFFKRKGEDEKD----VTGKPVGKVKVESILKVGFR-DVI-ICEVLEGIVKVGYK-V--------------------------RMPIQDVFTITGAGTVV-VGRVETGVLKVGDR-V--------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV--------------------------RIPIQDVYNISGI-GVVPVGRVETGVLKVGDKLV-----------------------------------IVGV-KVL-AGTIKPGVT----L-V

--KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT--KGPSGIGGIVR-IERNREKVEFAIAGDRIGISIEGKI---GK--VKKGDVLEIYQT--RKGKKVAGIVS-MEREHKKVEFAIPGDKIGIMLEKNI---G---AEKGDILEVF----KKGKKVAGIVS-MEREHKKIEFAIPGDRVGMMLEKNI----N--AEKDDILEVY--VIVPPAKVGDVRS-IETHHMKLEQAQPGDNIGVNVRG-I---AKEDVKRGDVL-------FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV--------FMPAGLVAEVKTIETHHTKIEKAEPGDNIGFNVKGVE---KKD-IKRGDV--------KDGREVGRIMQ-IQKTGRAINEAAAGDEVAISIHGDVIVGRQ--IKEGDILYVY--

begin

end

Match

Insert

Delete

parent

Assign parent sequence. Assign Insertion and Deletion states according to where gaps appear in the MSA.

�24

Steps in making a profile HMM from a MSA

1.Choose a sequence to be the parent sequence, s (defines number of positions)

2.Read each sequence in MSA, i2.1.Create an Insert state if seq i has an insertion relative

to seq s.2.2.Create a Delete state is seq i has a deletion relative to

seq s.3.At this point every position in every sequence is assigned

to one Markov state.

4.Sum the emmission profile for each state

5.Sum the transition probabilities.

Picking a parent sequence (does it matter?)

• The parent defines the number of Match states

• A Match state should conserve the chemical nature of the sidechain as much as possible.

• A Match state implies structural similarity.

Which sequence is the best parent?

�26

Filling in state probabilities

I E G K I G K V K KDDDMIIIselfIMMDMIMMACDEFGHIKLMNPQRSTVWY

M=match state, I=insert state, D=delete state, MM=match-to-match, MI=match-to-insertion, MD=match-to-deletion,IM=insertion-to-deletion, Iself=I-to-I non-advancing, II=insertion-to-insertion, DM=deletion-to-insertion, DD=deletion-to-deletion.

Each element in the matrix is a probability, either an emission probability (M state) or a transition probability.

M

arrowsD

M

I

parent

HMM

�27

Pseudocode, setting HMM parameters based on MSA.

Choose parent sequence==> parentAssign match states==> m, to non-gap columns==> c, of parent.m=0For each column, cIf (parent(c) == “.”) { \\ c is an insert stateII(m) = II(m) + Count(.NOT.gap@c+1) \\ next col is matchID(m) = Count(gap@m+1) \\ next char is gapIM(m) = Count(.NOT.gap@m+1) \\ next char is not gap

} else { \\ c is match state mm++ \\ next match statefor each amino acid, aa { profile(m,aa) = Count(aa@c) } \\ use unweighted countsMD(m) = Count(gap@m+1) \\ gaps at next match stateMI(m) = parent(c+1) == “.”? Count(.NOT.gap@c+1)| 0

\\ if next col is not a match state, count insertionsMM(m) = Count(.NOT.gap@m+1) \\ non-gaps at next match stateDD(m) = Count(gap@m+1) \\ gaps at next match state DM(m) = Count(.NOT.gap@m+1) \\ non-gaps at next match state

}Normalize(MD+MI+MM)=1,(DD+DM)=1, (II+ID+IM)=1. \\ convert counts to probabilities

Exercise 10: make a profile HMMAGF---PDGAGGYL-PDGAG----PNGSGFFLIPNGSGF--EPNG

•Use seq 1 as parent.

•Draw non-zero Match, Delete, Insert states and transitions.

•Assign probabilities to all non-zero transitions.

•Draw a Logo for the Match state profile.

Added information

With Profile HMMs we allow insertions and deletions to have position-specific probabilities, and insertions have lengths.

Many uses of HMMs

Weather prediction

Ecosystem modeling

Brain activity

Language structure

Econometrics

etc etc

HMMs can be applied to any dataset that can be represented as strings.

The expert input is the “topology”, or how the states are connected.

Homolog detection using a library of profile HMMs

MYSEQUENCE

2

1

3

4

P(s|λ2)

P(s|λ3)

P(s|λ4)

P(s|λ1)

Pick the model w

ith the max P

Get P(S|λ) for each λ

HMMER2.0 [2.3.2]NAME hmm_profileLENG 209ALPH AminoRF noCS noMAP yesNSEQ 10DATE Wed Oct 21 11:00:59 2015CKSUM 3231XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -78.931824 0.205202HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -778 * -1263 1 115 4272 -1536 -1390 -774 -663 -853 -13 -1051 -600 -229 -921 -1200 -993 -1046 -212 -155 161 -1134 -839 73 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -49 -5474 -6516 -894 -1115 -701 -1378 -778 *

PFAM: Profile HMM libraries made by HMMER

D

M

I

Sequence Analysis '18 lecture 18 - Rensselaer Polytechnic Institute · 2018-10-25 · Sequence...

Documents

Transcript of Sequence Analysis '18 lecture 18 - Rensselaer Polytechnic Institute · 2018-10-25 · Sequence...