Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of...

cbio course, spring 2005, Hebrew University

Class: Motif FindingCS-67693, Spring 2005

School of Computer Science & Engineering

Hebrew University, Jerusalem

*Few slides were adopted and edited from www.cs.ucsb.edu/~ambuj/Courses/ bioinformatics/motif%20finding.ppt


Background

Basic dogma: Information is coded in the genome Information includes:

Where the genes are coded, including: Transcription Start UTR Exons and Introns Alternative splicing


Eukaryotic Gene

Adapted in part from http://online.itp.ucsb.edu/online/infobio01/burge/


Background



Functional units in proteins


Proteins Local structure motifs

diverging type-2 turn

Serine hairpin Type-I hairpin

Frayed helix

Proline helix C-capalpha-alpha corner

glycine helix N-cap

I-sites Library = a catalog of local sequence-structure correlations


Background



Functional units in proteins RNA family structure


RNA – Multiple Align. + structure

Biological Sequence Analysis; Durbin, Eddy, Krogh, Mitchison; Cambridge press, 1998


Background



Functional units in proteins RNA family structure How to control which gene to turn on/off and when


Background

In many cases, we can related such functions to reappearing “motifs” in the genome: Splice/start/end site signals in coding genes Binding sites of regulatory elements controlling

transcription of nearby genes A certain function of a protein “domain”.

The definition of what is a sequence “motif” depends on the context !


Background



Functional units in proteins RNA family structure How to control which gene to turn on/off and when

Future

Classes


Regulation of Gene Expression

Gene regulatory proteins bind to specific places (regulatory sites) on DNA. These sites are usually close to the gene.

geneoff

site

genesiteon

regulatory protein


Regulatory Sites

Regulatory sites are sometimes divided to 2 types: Promoter sites – Usually upstream of a gene in

non-translated (non-coding) regions. In some cases, these sites can be in exonic or intronic regions.

Enhancer sites – Can be very far away (either upstream or downstream).

Regulatory proteins recognize sites by conserved DNA patterns, which consist of a short stretch of “partially specific” nucleotide sequences.


lac operon in E. coli

Figure 13.16 The lac Operon of E. coli


Promoter…


Transcription Factor Binding Sites

Non-coding regions gene regulation

We want to describe this site


Difficulty of Finding Regulatory Elements

Regulatory sites are short (up to 30 nucleotides). Non-coding regions are very long (includes all

regions which are not translated into proteins). Experiments to find regulatory sites are tedious and

time-consuming. One approach is to mutate different combinations of nucleotides until functionality changes.

We don’t have good understanding on what makes a site active/how active in terms of the chemical/physical constraints


Why Not Use Multiple Alignment?

The motif is short and may appear at different location in different sequences. Most other areas are random

Not all positions within a binding site should be treated in the same way, and usually we don’t know in advance how. Therefore the use of a general scoring matrix is not adequate

The problem is made more complicated since not every sequence contains a motif, due to: The upstream region used may not be long enough to

include a regulatory site in every sequence Usually, potential co-regulated genes are used to

construct the sample, which means that we don’t know for sure whether all these genes are really co-regulated


Computational Approach

Identify a set of genes believed to be controlled by the same regulatory mechanism (co-regulated genes).

Extract regulatory regions of the genes (usually upstream sequences) to form a sample of sequences.

Find some way to identify “conserved” elements in these sequences, resulting in a list of potential regulatory sites.


How to Find Regulatory Sites

genesite

genesite

genesite

genesite

genesite

sample


Formulating Motif Finding Task

Given a set of sequences, find a common motif shared by these sequences.Steps:

Construct a model of what we mean by common motif.

Solve the problem within the model on simulated samples.

Evaluate performance on real life biological samples.


Formulating Motif Finding Task (2) This means we need to define:

Input of the algorithm: This implicitly defines various assumptions we have on the problem (e.g: do we have different belief for each sequence that it belongs to the group?)

Type of “motif” class: Search Algorithm: How we search the space of possible

motifs? Scoring function: How we score putative motifs? Output of the algorithm: Should it give us just putative

sites or maybe a binding site model to predict sites? Evaluation technique: How do we test our algorithm?


Task Definition Example

Given a sample of sequences and an unknown pattern (motif) that appears at different unknown positions in each sequence, can we find the unknown pattern?

Input: a set of sequences, each one with an unknown pattern at an unknown position.

Output: a set of starting positions of the pattern in each sequence.


Pattern == Subsequence

atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa

tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag

gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa

Subsequence = AAAAAAAAGGGGGGG


Pattern == (l,d)

atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa

tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat

AgAAgAAAGGttGGG

cAAtAAAAcGGcGGG..|..|||.|..|||

All variants of AAAAAAAAGGGGGGG

First formulated by Pevzner (ISMB 2000)Pattern = subsequence of length l and exactly d random mismatches in itAll other sequence is assumed randomAssumes exactly one “true” occurrence of the motif in each sequence


Formulating Motif Finding Task (2) We need to define:

Input of the algorithm: This implicitly defines various assumptions we have on the problem (e.g: do we have different belief for each sequence that it belongs to the group?)

Type of “motif” class: Search Algorithm: How we search the space of

possible motifs? Scoring function: How we score putative motifs? Output of the algorithm: Should it give us just putative

sites or maybe a binding site model to predict sites? Evaluation technique: How do we test our algorithm?

Think: •How the (l,d) problem defines these ? •How does it relate to “real” biology?


How to Define Motif Class?

Subsequences : ACTCTT IUPAC alphabet: {A, C, G, T, R,Y, M, K, S, W, B, D, H, V,

N } = all subsets of {A,C,G,T} PSSM / PWM (Position Specific Score Matrix or Position

Weight Matrix) More general probabilistic/other models: e.g. using Bayesian

Networks modeling language Refined definition based on prior knowledge:

Homo/Hetro dimers Variable gaps Bias to some characteristic information profile (Van,

2003)


NOTE:

• Independence assumption between biding sites positions !

• The score used in a probabilistic setting is the log odds score

• In many case the BG is a simple, fixed, background distribution (Q) over {ACGT}.

• The entries in the Matrix can be Pi(a), log(Pi(a)) or log(Pi(a)/logQ(a) – depending on the context of its usage !

PSSM Representation of Binding Sites

Position Specific Score Matrix: each possible kmer will get a “score” for being a binding site which is:

Probabilistic interpretation:ACGT

1 2 k

w[i,c] – weight of letter c at position i

)|,...,(

)|,...,(log),...,(

1

11 BGssP

MssPssScore

k

kk

i

ik siwssScore ],[),...,( 1

)()|,...,(1

1 ii

k

ik sPMssP


• PSSM:+ Enables representing low/high affinity in different Positions+ Trade off Sens. and Spec. in genomic wide scans- Huge Search space, how to cover efficiently?

ABF1 Example – (Targets by Lee at el. ,2002)

>YAL011W: CGTGTTAGATGA

√

?

PSSM vs. IUPAC


How to Learn PSSM Motif?

Easier Task - We have aligned samples to learn from: We have a set of known BS, all of length k, (e.g. verified by

some biological experiment) Compute counts for each base in each position, and

normalize == ML estimator: N number of sequence, Na number of “a”s in position i:

Note: This is the ML solution. As in many other cases, this might be problematic

when we have very few samples to learn from (e.g.: we can get probability 0 for base A in position i simply because we did not see enough examples.)

Solution: use pseudo counts or some prior (e.g. Derichele prior)


How to Learn PSSM Motif ? (2)

BSModel

1 2 3 4 5 6 7ACGT

Remember: In the motif finding problem we have a much harder task –

The input: is a set of (long) sequence suspected to contain a common motif (PSSM according to our current model assumption), but we don’t know where !

The output: Prediction of new BS based on our learned PSSM motif

Predictions

Input Sequence:Dark blue are BS positions which are hidden from us, and we are trying to learn


How to Learn PSSM Motif ? (3) MEME Algorithm (Bailey T.L. and Elkan C.P. 1995 )

(Still) one of the most commonly used tools for motif (PSSM) search:


How to Learn PSSM Motif ? (3) MEME Algorithm (Bailey T.L. and Elkan C.P. 1995 )

The basic probabilistic framework used by MEME: Input: N sequences Assume each has 1 BS Assume a generative model: sequence is either

generated by BS model M (PSSM) or from a fixed background distribution BG

Assume each sequence has exactly 1 BS in it. Scoring function: P(Seq | M,BG) Try to maximize likelihood scoring function by

adjusting M’s (PSSM) parameters.


How to Learn PSSM Motif ? (4) What’s the problem? Why is it hard?

Think of the positions of the BS in each sequence as H were H is a vector of dimension N

Given H we have complete data. Then inferring M’s ML parameters are just as we saw for the aligned case easy

Problem 1: We don’t have H, we are trying to learn it too and the ML parameters of M for each position become dependent if H is not given we have no close form to compute them analytically and going over all possible H assignments is not feasible, we need to resort to some method to search the space of possible assignments to M’s parameters

Problem 2: The landscape of the likelihood function is typically far from convex many local optima


How to Learn PSSM Motif ? (5) MEME Algorithm

MEME uses a technique called EM to search the space of model M’s parameters

EM = Expectation Maximization We review how EM is used in the MEME algorithm

in class….


Problems with the MEME & other Models

Think: In light of what we discussed, what assumptions are made in this model? What might cause us problems in “real” life data? MEME has also other variants we did not

discuss here (oops, zoops, etc.) Also: EM is very sensitive to starting point need

a good way to find good ones


Other Algorithmic Techniques for Motif Finding

MEME (Expectation Maximization) GibbsDNA, AlignAce (Gibbs Sampling) CONSENUS (greedy multiple alignment) WINNOWER (Clique finding in graphs) SP-STAR (Sum of pairs scoring) MITRA (Mismatch trees to prune exhaustive search

space)

More then one way to skin a cat….


How to find Binding Sites- Revisited

Find a common motif in gene set (CONSENSUS, MITRA, MEME, AlignACE…)

“Classical” Solutions:

GeneSet

Promoter

Find a common & unique motif in genesDiscriminative Solutions:

Extract the relevant bit from sequences

n

kxn

M

xn

KM

x

K

MKnkPhyper ),,|(

Main problem: In many cases the motif is common not just to the subset of sequences we have, but to many other as well not a good candidate to explain regulation

“A simple hyper-geometric approach for discovering putative transcription factor binding sites” WABI 01


Finding Discriminative Motifs

Define Space of Motifs“mimic” motifs with a simpler class for efficient search

Search Space, Evaluate Motifs using discriminative scoring

Choose Significant Motifs

Correct for multiple hyp.Bonfferoni or FDR criteria

Step1

Step2: “A simple hyper-geometric approach for discovering putative transcription factor binding sites” WABI 01Refine Motifs


Binding Sites - Revisited

→ independence assumption

Two relevant questions: Are there dependencies in binding sites? Do we gain an edge in computational

tasks if we model such dependencies?

promoter

gene

binding site

A?C?T

“Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03


How to model binding sites ?

))P(X)P(X)P(X)P(XP(X)XP(X 543215 1 T

5432151 T)|T)P(X|T)P(X|T)P(X|T)P(X|P(T)P(X)XP(X )X|)P(X)P(XX|)P(XX|)P(XP(X)XP(X 354133215 1

X1 X2 X3 X4 X5 Profile: Independency model

Tree: Direct dependencies

Mixture of Profiles:Global dependencies

Mixture of Trees:Both types of dependencies

X1 X2 X3 X4 X5

T

X1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

T

3541332151 )XT,|T)P(X|)P(XXT,|)P(XXT,|T)P(X|P(T)P(X)XP(X

? )X X X X P(X 54321 represent a distribution of binding sites



Learning models: Aligned binding sites

Learning procedure for Bayesian networks

GCGGGGCCGGGCTGGGGGCGGGGTAGGGGGCGGGGGTAGGGGCCGGGCTGGGGGCGGGGTAAAGGGCCGGGCGGGAGGCCGGGAGCGGGGCGGGGCGAGGGGACGAGTCCGGGGCGGTCCATGGGGCGGGGC

Aligned binding sitesModels

X1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

X1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

LearningMachinery

select maximum likelihood model



Arabidopsis ABA binding factor 1(49 examples)

Profile

Test LL per instance -19.93

Mixture of Profiles76%

24%

Test LL per instance -18.70 (+1.23)(improvement in likelihood > 2-fold)

X4 X5 X6 X7 X8 X9 X10 X11 X12

Tree

Test LL per instance -18.47 (+1.46)(improvement in likelihood > 2.5-fold)



Rap1 Example (Harbison at. el.04)(171 expmples)

Profile Mixture of Profiles

X4 X5 X6 X7 X8 X9 X10 X11 X12

Tree


¼

½

1

2

4

8

1 11 21 31 41 51 61 71 81 91

Datasets of Binding Sites Significant

Non sig.

67

Fo

ld c

ha

ng

e in

lik

elih

oo

d (

hel

d o

ut

test

da

ta)

Likelihood improvement over profiles

Significant improvement in generalization

Data often exhibits dependencies



EM algorithm

Learning models: unaligned data

Use EM algorithm to simultaneously Identify binding site positions Learn a dependency model

Unaligned Data

Learna model

Identify binding

sites

ModelsX1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

X1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T



Evaluating PerformanceDetect target genes on a genomic scale:

ACGTAT…………….………………….AGGGATGCGAGC-1000 0-473

Scoring rule:

Crucial issue: p-value of scores

“CIS: Compound Importance Sampling Method for Protein-DNA Binding Site p-value Estimation” Bioinformatics, 2004, ISMB 04

),,(

),,(log),,Score(

10

11

K

KK XXP

XXPXX

Probability by binding site model

Background model (order-3 markov chain)


0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

0% 1% 2% 3% 4% 5%

Tru

e P

ositi

ve R

ate

(Sen

sitiv

ity)

False Positive Rate

Profile

Example: ROC curve of HSF1

Mixture of Trees

Tree

~60 FP

Mixture of Profiles



-20 -10 0 10 20 30 40 50 60

-25

-20

-15

-10

-5

0

5

10

15

20

Evaluation – Localization Data5-fold Cross Validation [Lee et al 2002]

Δ s

pe

cif

icit

y (T

P/P

red

icte

d)

Δ sensitivity (TP/True)

Improvement by Mix of Trees over PSSM

“True”

Predicted

TP



Motif Finding - Evaluation Still an open problem We have seen several examples on how performance can be evaluated

in different ways There is (still) no absolute solution for this Main problems:

no large data sets of known sites no real annotation of negative samples How to define success measure? Difference in input/output assumptions …

A recent effort in this direction: “Assessing computational tools for the discovery of transcription factor binding sites” (Nat. Biotech. Jan 05)

compared publicly available tools on the web on (small) data sets of known binding sites based on the Transfac D.B

Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of...

Documents

Transcript of Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of...