Motif Finding

Post on 08-Jan-2016

48 views 2 download

Tags:

description

Motif Finding. PSSMs Expectation Maximization Gibbs Sampling. Complexity of Transcription. A matrix describing a a set of sites. A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 - PowerPoint PPT Presentation

Transcript of Motif Finding

Motif Finding

PSSMs

Expectation Maximization

Gibbs Sampling

Complexity of Transcription

Representing Binding Sites for a TF

A set of sites represented as a consensus VDRTWRWWSHD (IUPAC degenerate DNA)

A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4

A matrix describing a a set of sites

A single site AAGTTAATGA

Set of binding

sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA

Set of binding

sitesAAGTTAATGACAGTTAATAAGAGTTAAACACAGTTAATTAGAGTTAATAACAGTTATTCAGAGTTAATAACAGTTAATCAAGATTAAAGAAAGTTAACGAAGGTTAACGAATGTTGATGAAAGTTAATGAAAGTTAACGAAAATTAATGAGAGTTAATGAAAGTTAATCAAAGTTGATGAAAATTAATGAATGTTAATGAAAGTAAATGAAAGTTAATGAAAGTTAATGAAAATTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGAAAGTTAATGA

Nucleic acid codes

code description

A Adenine

C Cytosine

G Guanine

T Thymine

U Uracil

R Purine (A or G)

Y Pyrimidine (C, T, or U)

M C or A

K T, U, or G

W T, U, or A

S C or G

B C, T, U, or G (not A)

D A, T, U, or G (not C)

H A, T, U, or C (not G)

V A, C, or G (not T, not U)

N Any base (A, C, G, T, or U)

From frequencies to log scores

TGCTG = 0.9

A 5 0 1 0 0C 0 2 2 4 0G 0 3 1 0 4T 0 0 1 1 1

A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3T -1.7 -1.7 -0.2 -0.2 -0.2

f matrix w matrix

Log ( )f(b,i) + s(N)p(b)

TFs do not act alone

http://www.bioinformatics.ca/

PSSMs for Liver TFs…

HNF1

C/EBP

HNF3

HNF4

PSSMs for Helix-Turn-Helix Motif

Promoter…

Promoter Weight Matrices (PWM)

E.Coli PWMs

Motif Logo Motifs can mutate on

less important bases. The five motifs at top

right have mutations in position 3 and 5.

Representations called motif logos illustrate the conserved regions of a motif.

http://weblogo.berkeley.eduhttp://fold.stanford.edu/eblocks/acsearch.html

1234567TGGGGGATGAGAGATGGGGGATGAGAGATGAGGGA

Position:

Example: Calmodulin-Binding Motif (calcium-binding proteins)

Sequence Motifs

• Motifs represent a short common sequence– Regulatory motifs (TF binding sites)

– Functional site in proteins (DNA binding motif)

http://webcourse.cs.technion.ac.il/236523/Winter2005-2006/en/ho_Lectures.html

Regulatory Motifs

Transcription Factors bind to regulatory motifs Motifs are 6 – 20 nucleotides long Activators and repressors Usually located near target gene, mostly

upstream

Challenges

How to recognize a regulatory motif? Can we identify new occurrences of

known motifs in genome sequences? Can we discover new motifs within

upstream sequences of genes?

Motif Representation

Exact motif: CGGATATA Consensus: represent only

deterministic nucleotides. Example: HAP1 binding

sites in 5 sequences. consensus motif:

CGGNNNTANCGG N stands for any nucleotide.

Representing only consensus loses information. How can this be avoided?

CGGATATACCGG

CGGTGATAGCGG

CGGTACTAACGG

CGGCGGTAACGG

CGGCCCTAACGG

------------

CGGNNNTANCGG

1 2 3 4 5

A 10 25 5 70 60

C 30 25 80 10 15

T 50 25 5 10 5

G 10 25 10 10 20

PSPM – Position Specific Probability Matrix

Represents a motif of length k (5) Count the number of occurrence of each

nucleotide in each position

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

PSPM – Position Specific Probability Matrix

Defines Pi{A,C,G,T} for i={1,..,k}. Pi (A) – frequency of nucleotide A in position i.

Identification of Known Motifs within Genomic Sequences

Motivation: identification of new genes controlled by the

same TF. Infer the function of these genes. enable better understanding of the regulation

mechanism.

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

PSPM – Position Specific Probability Matrix

Each k-mer is assigned a probability. Example: P(TCCAG)=0.5*0.25*0.8*0.7*0.2

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

Detecting a Known Motif within a Sequence using PSPM

The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the

PSPM. Example:

sequence = ATGCAAGTCT…

The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the

PSPM. Example: sequence = ATGCAAGTCT… Position 1: ATGCA

0.1*0.25*0.1*0.1*0.6=1.5*10-4

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

Detecting a Known Motif within a Sequence using PSPM

The PSPM is moved along the query sequence. At each position the sub-sequence is scored for a match to the

PSPM. Example: sequence = ATGCAAGTCT… Position 1: ATGCA

0.1*0.25*0.1*0.1*0.6=1.5*10-4 Position 2: TGCAA

0.5*0.25*0.8*0.7*0.6=0.042

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

Detecting a Known Motif within a Sequence using PSPM

Detecting a Known Motif within a Sequence using PSSM

Is it a random match, or is it indeed an occurrence of the motif?

PSPM -> PSSM (Probability Specific Scoring Matrix) odds score matrix: Oi(n) where n {A,C,G,T} for i={1,..,k} defined as Pi(n)/P(n), where P(n) is background

frequency. Oi(n) increases => higher odds that n at position i is

part of a real motif.

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

1 2 3 4 5

A 0.4 1 0.2 2.8 2.4

1 2 3 4 5

A -1.322 0 -2.322 1.485

1.263

PSSM as Odds Score Matrix Assumption: the background frequency of each nucleotide is

0.25.

Original PSPM (Pi):

Odds Matrix (Oi):

Going to log scale we get an additive score,Log odds Matrix (log2Oi):

1 2 3 4 5

A -1.32 0 -2.32 1.48 1.26

C 0.26 0 1.68 -1.32 -0.74

T 1 0 -2.32 -1.32 -2.32

G -1.32 0 -1.32 -1.32 -0.32

Calculating using Log Odds Matrix

Odds 0 implies random match; Odds > 0 implies real match (?).

Example: sequence = ATGCAAGTCT… Position 1: ATGCA

-1.32+0-1.32-1.32+1.26=-2.7odds= 2-2.7=0.15

Position 2: TGCAA1+0+1.68+1.48+1.26 =5.42odds=25.42=42.8

Calculating the probability of a match

ATGCAAG Position 1 ATGCA = 0.15 Position 2 TGCAA = 42.3 Position 3 GCAAG =0.18

P (i) = S / (∑ S)Example 0.15 /(.15+42.8+.18)=0.003

P (1)= 0.003P (2)= 0.993P (3) =0.004

Building a PSSM

Collect all known sequences that bind a certain TF.

Align all sequences (using multiple sequence alignment).

Compute the frequency of each nucleotide in each position (PSPM).

Incorporate background frequency for each nucleotide (PSSM).

Finding new Motifs

We are given a group of genes, which presumably contain a common regulatory motif.

We know nothing of the TF that binds to the putative motif.

The problem: discover the motif.

Example

Predicting the cAMP Receptor Protein (CRP) binding site motif

GGATAACAATTTCACAAGTGTGTGAGCGGATAACAAAAGGTGTGAGTTAGCTCACTCCCCTGTGATCTCTGTTACATAGACGTGCGAGGATGAGAACACAATGTGTGTGCTCGGTTTAGTTCACCTGTGACACAGTGCAAACGCGCCTGACGGAGTTCACAAATTGTGAGTGTCTATAATCACGATCGATTTGGAATATCCATCACATGCAAAGGACGTCACGATTTGGGAGCTGGCGACCTGGGTCATGTGTGATGTGTATCGAACCGTGTATTTATTTGAACCACATCGCAGGTGAGAGCCATCACAGGAGTGTGTAAGCTGTGCCACGTTTATTCCATGTCACGAGTGTTGTTATACACATCACTAGTGAAACGTGCTCCCACTCGCATGTGATTCGATTCACA

Extract experimentally defined CRP Binding Sites

GGATAACAATTTCACATGTGAGCGGATAACAATGTGAGTTAGCTCACTTGTGATCTCTGTTACACGAGGATGAGAACACACTCGGTTTAGTTCACCTGTGACACAGTGCAAACCTGACGGAGTTCACAAGTGTCTATAATCACGTGGAATATCCATCACATGCAAAGGACGTCACGGGCGACCTGGGTCATGTGTGATGTGTATCGAATTTGAACCACATCGCAGGTGAGAGCCATCACATGTAAGCTGTGCCACGTTTATTCCATGTCACGTGTTATACACATCACTCGTGCTCCCACTCGCATGTGATTCGATTCACA

Create a Multiple Sequence Alignment

A C G T

1 -0.43 0.1 -0.46 0.55

2 1.37 0.12 -1.59 -11.2

3 1.69 -1.28 -11.2 -1.43

4 -1.28 0.12 -11.2 1.32

5 0.91 -11.2 -0.46 0.47

6 1.53 -1.38 -1.48 -1.43

7 0.9 -0.48 -11.2 0.12

8 -1.37 -1.28 -11.2 1.68

9 -11.2 -11.2 1.73 -0.56

10 -11.2 -0.51 -11.2 1.72

11 -0.48 -11.2 1.72 -11.2

12 1.56 -1.59 -11.2 -0.46

13 -0.51 -0.38 -0.55 0.88

14 -11.2 0.5 0.57 0.13

15 0.17 -0.51 0.12 0.12

16 0.9 -11.2 0.5 -0.48

17 0.17 0.16 0.06 -0.48

18 -0.4 -0.38 0.82 -0.48

19 -1.38 -1.28 -11.2 1.68

20 -1.48 1.7 -11.2 -1.38

21 1.5 -1.38 -1.43 -1.28

Generate a PSSM

Shannon Entropy

Expected variation per column can be calculated

Low entropy means higher conservation

Entropy

The entropy (H) for a column is:

a: is a residue, fa: frequency of residue a in a column,

pa : probability of residue a in that column

)(

)log(aresidues

aa pfH

Entropy

entropy measures can determine which evolutionary distance (PAM250, BLOSUM80, etc) should be used

Entropy yields amount of information per column (discussed with sequence logos in a bit)

Log-odds score

Profiles can also indicate log-odds score: Log2(observed:expected)

Result is a bit score

Matlab

Multalign1 Enter an array of sequences.seqs =

{'CACGTAACATCTC','ACGACGTAACATCTTCT','AAACGTAACATCTCGC'};

2 Promote terminations with gaps in the alignment.multialign(seqs,'terminalGapAdjust',true)

ans =--CACGTAACATCTC--ACGACGTAACATCTTCT-AAACGTAACATCTCGC

Matlab

3 Compare alignment without termination gap adjustment.

multialign(seqs)

ans =

CA--CGTAACATCT--C

ACGACGTAACATCTTCT

AA-ACGTAACATCTCGC

Matlab

>> a={'ATATAGGAG','AATTATAGA','TTAGAGAAA'}

>> a =

'ATATAGGAG' 'AATTATAGA' 'TTAGAGAAA'

Char function

>> cseq=char(a)

cseq =

ATATAGGAG

AATTATAGA

TTAGAGAAA

Double function

>> intseq=double(cseq)

intseq =

65 84 65 84 65 71 71 65 71

65 65 84 84 65 84 65 71 65

84 84 65 71 65 71 65 65 65

double

>> double('A')ans = 65>> double('C')ans = 67>> double('G')ans = 71>> double('T')ans = 84

Initiate PSPM matrix

>> Pspm=zeros(4,length(intseq))

Pspm =

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Use a for loop to count each nucleotide at each position>> for i = 1:length(intseq)Pspm(1,i)=length(find(intseq(:,i)==65));Pspm(2,i)=length(find(intseq(:,i)==67));Pspm(3,i)=length(find(intseq(:,i)==71));Pspm(4,i)=length(find(intseq(:,i)==84));end>> Pspm

Pspm =

2 1 2 0 3 0 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 1 1 1 1 2 1 2 0 1 0 0 0

Add pseudocounts

>> Pspmp=Pspm+1

Pspmp =

3 2 3 1 4 1 3 3 3

1 1 1 1 1 1 1 1 1

1 1 1 2 1 3 2 2 2

2 3 2 3 1 2 1 1 1

Normalize to get frequencies>> Pspmnorm=Pspmp./repmat(sum(Pspmp),4,1)

Pspmnorm =

Columns 1 through 7

0.4286 0.2857 0.4286 0.1429 0.5714 0.1429 0.4286 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.2857 0.1429 0.4286 0.2857 0.2857 0.4286 0.2857 0.4286 0.1429 0.2857 0.1429

Columns 8 through 9

0.4286 0.4286 0.1429 0.1429 0.2857 0.2857 0.1429 0.1429

Calculate odds score>> Pswm=Pspmnorm/0.25

Pswm =

Columns 1 through 7

1.7143 1.1429 1.7143 0.5714 2.2857 0.5714 1.7143 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 1.1429 0.5714 1.7143 1.1429 1.1429 1.7143 1.1429 1.7143 0.5714 1.1429 0.5714

Columns 8 through 9

1.7143 1.7143 0.5714 0.5714 1.1429 1.1429 0.5714 0.5714

Log odds ratio>> logPswm=log2(Pswm)

logPswm =

Columns 1 through 7

0.7776 0.1926 0.7776 -0.8074 1.1926 -0.8074 0.7776 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 0.1926 -0.8074 0.7776 0.1926 0.1926 0.7776 0.1926 0.7776 -0.8074 0.1926 -0.8074

Columns 8 through 9

0.7776 0.7776 -0.8074 -0.8074 0.1926 0.1926 -0.8074 -0.8074

Estimate the probability of the given sequence to belong to the defined PSWM

>> Unknown='TTAAGAAGG'

Unknown =

TTAAGAAGG

>> intunknown=double(Unknown)

intunknown =

84 84 65 65 71 65 65 71 71

Get the index of the PSWM for the unknown sequence>> for i=1:length(intunknown)

A=find(intunknown==65)intunknown(A)=1;C=find(intunknown==67)intunknown(C)=2;G=find(intunknown==71)intunknown(G)=3;T=find(intunknown==84)intunknown(T)=4;

end>> intunknownintunknown =

4 4 1 1 3 1 1 3 3

Calculate the log odds-ratio of the Unknown 'TTAAGAAGG'

>> logunknown=logPswm(intunknown)

logunknown =

Columns 1 through 7

0.1926 0.1926 0.7776 0.7776 -0.8074 0.7776 0.7776

Columns 8 through 9

-0.8074 -0.8074

>> Punknown=sum(logunknown)

Punknown =

1.0737

Is this significant score or just random similarity?

>> cseqcseq =

ATATAGGAGAATTATAGATTAGAGAAA

>> Unknown

Unknown =

TTAAGAAGG

What would be the maximum score?

>> logPswm

logPswm = Columns 1 through 7 0.7776 0.1926 0.7776 -0.8074 1.1926 -0.8074 0.7776 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 0.1926 -0.8074 0.7776 0.1926 0.1926 0.7776 0.1926 0.7776 -0.8074 0.1926 -0.8074 Columns 8 through 9 0.7776 0.7776 -0.8074 -0.8074 0.1926 0.1926 -0.8074 -0.8074

>> maxscore=max(logPswm)maxscore =Columns 1 through 7 0.7776 0.7776 0.7776 0.7776 1.1926 0.7776 0.7776Columns 8 through 9 0.7776 0.7776>> totalmaxscore=sum(maxscore)

totalmaxscore=

7.4135

Write a function using the above statements to scan a sequence

Write a function named ‘logodds’ that calculates the logs-odd ratio of a given alignment.

Write a function named ‘scanmotif’ that calls the ‘logodds’ to search through a sequence using a sliding window to calculate the logodds of a subsequence and store these scores. The function should allow for selection of a maximum number of locations that are likely to contain the motif based on the scores obtained.

Position Specific Scoring Matrix (PSSM) incorporate information theory to

indicate information contained within each column of a multiple alignment.

information is a logarithmic transformation of the frequency of each residue in the motif

PSSMs and Pseudocounts

Problem: PSSMs are only as good as the initial msa Some residues may be underrepresented Other columns may be too conserved

Solution: Introduce Pseudocounts to get a better indication

Pseudocounts

New estimated probability:

Pca: Probability of residue a in column c nca: count of a’s in column c bca: pseudocount of a’s in column c Nc: total count in column c Bc: total pseudocount in column c

cc

cacaca BN

bnP

PSSMs and pseudocounts

probabilities converted into a log-odds form (usually log2 so the information

can be reported in bits) and placed in the PSSM.

Searching PSSMs

value for the first residue in the sequence occurring in the first column is calculated by searching the PSSM

the value for the residue occurring in each column is calculated

Searching PSSMs

values are added (since they are logarithms) to produce a summed log odds score, S

S can be converted to an odds score using the formula 2S

odds scores for each position can be summed together and normalized to produce a probability of the motif occurring at each location.

Information in PSSMs

Information theory: amount of information contained within each sequence.

No information: amount of uncertainty can be measured as log220 = 4.32 for amino

acids, since there are 20 amino acids. For nucleic acid sequences, the amount of uncertainty can be measured as log24 = 2.

Information in PSSMs

If a column is completely conserved then the uncertainty is 0 – there is only one choice.

two residues occurring with equal probability -- uncertainty to deciding which residue it is.

Measure of Uncertainty

Measured as the entropy

)(

)log(aresidues

acacC pfH

Relative Entropy

. Relative entropy takes into account overall composition of the organism being studied

 

Ba is background frequency of residue a in the organism

)(

2 )/(logaresidues

aacacC bpfR

PSSM Uncertainty

Uncertainty for whole model is summed over all columns:

allcolumns

cc HH

Sequence Logos

Information in PSSMs can be viewed visually

Sequence logos illustrate information in each column of a motif

height of logo is calculated as the amount by which uncertainty has been decreased

Sequence Logos

Statistical Methods

Commonly used methods for locating motifs:

Expectation-Maximization (EM) Gibbs Sampling

Expectation-Maximization

Begin with set of sequences with an unknown signal in common Signal may be subtle Approximate length of signal must be

given

Randomly assign locations of this motif in each sequence

Expectation-Maximization

Two steps: Expectation Step Maximization Step

Expectation-Maximization

Expectation step Residue Frequencies for each position

calculated Residues not in a motif are background

Frequencies used to determine probability of finding site at any position in a sequence to fit motif model

Maximization Step

Determine location for each sequence that maximally aligns to the motif pattern

Once new motif location found for each sequence, motif pattern is revised in the expectation

E-M continues until solution converges

TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCTCCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTGTCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAGAAAACACTTGAGGGAGCAGATAACTGGGCCAACCATGACTCGGGTGAATGGTACTGCTGATTACAACCTCTGGTGCTGCAGCCTAGAGTGATGACTCCTATCTGGGTCCCCAGCAGGAGCCTCAGGATCCAGCACACATTATCACAAACTTAGTGTCCACATTATCACAAACTTAGTGTCCATCCATCACTGCTGACCCTTCGGAACAAGGCAAAGGCTATAAAAAAAATTAAGCAGCGCCCCTTCCCCACACTATCTCAATGCAAATATCTGTCTGAAACGGTTCCCATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGGGATTGGTCACAGCATTTCAAGGGAGAGACCTCATTGTAAGTCCCCAACTCCCAACTGACCTTATCTGTGGGGGAGGCTTTTGACCTTATCTGTGGGGGAGGCTTTTGAAAAGTAATTAGGTTTAGCATTATTTTCCTTATCAGAAGCAGAGAGACAAGCCATTTCTCTTTCCTCCCGGTAGGCTATAAAAAAAATTAAGCAGCAGTATCCTCTTGGGGGCCCCTTCCCAGCACACACACTTATCCAGTGGTAAATACACATCATTCAAATAGGTACGGATAAGTAGATATTGAAGTAAGGATACTTGGGGTTCCAGTTTGATAAGAAAAGACTTCCTGTGGATGGCCGCAGGAAGGTGGGCCTGGAAGATAACAGCTAGTAGGCTAAGGCCAGCAACCACAACCTCTGTATCCGGTAGTGGCAGATGGAAACTGTATCCGGTAGTGGCAGATGGAAAGAGAAACGGTTAGAAGAAAAAAAATAAATGAAGTCTGCCTATCTCCGGGCCAGAGCCCCTTGCCTTGTCTGTTGTAGATAATGAATCTATCCTCCAGTGACTGGCCAGGCTGATGGGCCTTATCTCTTTACCCACCTGGCTGTCAACAGCAGGTCCTACTATCGCCTCCCTCTAGTCTCTGCCAACCGTTAATGCTAGAGTTATCACTTTCTGTTATCAAGTGGCTTCAGCTATGCAGGGAGGGTGGGGCCCCTATCTCTCCTAGACTCTGTGCTTTGTCACTGGATCTGATAAGAAACACCACCCCTGC

Residue Counts

Given motif alignment, count for each location is calculated:

Residue Frequencies

The counts are then converted to frequencies:

Example Maximization Step

Consider the first sequence:

TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT

  There are 41 residues; 41-6+1 = 36

sites to consider

MEME Software

One of three motif models:

OOPS: One expected occurrence per sequence

ZOOPS: Zero or one expected occurrence per sequence

TCM: Any number of occurrences of the motif

Gibbs Sampling

Similar to E-M algorithm Combines E-M and simulated annealing

Goal: Find most probable pattern by sampling from motif probabilities to maximize ratio of model:background probabilities

Predictive Update Step

random motif start position chosen for all sequences except one

Initial alignment used to calculate residue frequencies for motif and background

similar to the Expectation Step of EM

Sampling Step

ratio of model:background probabilities normalized and weighted

motif start position chosen based on a random sampling with the given weights

Different than E-M algorithm

Gibbs Sampling

process repeated until residue frequencies in each column do not change

The sampling step is then repeated for a different initial random alignment

Sampling allows escape from local maxima

Gibbs Sampling

Dirichlet priors (pseudocounts) are added into the nucleotide counts to improve performance

shifting routine shifts motif a few bases to the left or the right

A range of motif sizes is checked

Gibbs Sampler Web Interface

http://bayesweb.wadsworth.org/gibbs/gibbs.html