Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of...
-
date post
21-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of...
cbio course, spring 2005, Hebrew University
Class: Motif FindingCS-67693, Spring 2005
School of Computer Science & Engineering
Hebrew University, Jerusalem
*Few slides were adopted and edited from www.cs.ucsb.edu/~ambuj/Courses/ bioinformatics/motif%20finding.ppt
cbio course, spring 2005, Hebrew University
Background
Basic dogma: Information is coded in the genome Information includes:
Where the genes are coded, including: Transcription Start UTR Exons and Introns Alternative splicing
cbio course, spring 2005, Hebrew University
Eukaryotic Gene
Adapted in part from http://online.itp.ucsb.edu/online/infobio01/burge/
cbio course, spring 2005, Hebrew University
Background
Basic dogma: Information is coded in the genome Information includes:
Where the genes are coded, including: Transcription Start UTR Exons and Introns Alternative splicing
Functional units in proteins
cbio course, spring 2005, Hebrew University
Proteins Local structure motifs
diverging type-2 turn
Serine hairpin Type-I hairpin
Frayed helix
Proline helix C-capalpha-alpha corner
glycine helix N-cap
I-sites Library = a catalog of local sequence-structure correlations
cbio course, spring 2005, Hebrew University
Background
Basic dogma: Information is coded in the genome Information includes:
Where the genes are coded, including: Transcription Start UTR Exons and Introns Alternative splicing
Functional units in proteins RNA family structure
cbio course, spring 2005, Hebrew University
RNA – Multiple Align. + structure
Biological Sequence Analysis; Durbin, Eddy, Krogh, Mitchison; Cambridge press, 1998
cbio course, spring 2005, Hebrew University
Background
Basic dogma: Information is coded in the genome Information includes:
Where the genes are coded, including: Transcription Start UTR Exons and Introns Alternative splicing
Functional units in proteins RNA family structure How to control which gene to turn on/off and when
cbio course, spring 2005, Hebrew University
Background
In many cases, we can related such functions to reappearing “motifs” in the genome: Splice/start/end site signals in coding genes Binding sites of regulatory elements controlling
transcription of nearby genes A certain function of a protein “domain”.
The definition of what is a sequence “motif” depends on the context !
cbio course, spring 2005, Hebrew University
Background
Basic dogma: Information is coded in the genome Information includes:
Where the genes are coded, including: Transcription Start UTR Exons and Introns Alternative splicing
Functional units in proteins RNA family structure How to control which gene to turn on/off and when
Future
Classes
cbio course, spring 2005, Hebrew University
Regulation of Gene Expression
Gene regulatory proteins bind to specific places (regulatory sites) on DNA. These sites are usually close to the gene.
geneoff
site
genesiteon
regulatory protein
cbio course, spring 2005, Hebrew University
Regulatory Sites
Regulatory sites are sometimes divided to 2 types: Promoter sites – Usually upstream of a gene in
non-translated (non-coding) regions. In some cases, these sites can be in exonic or intronic regions.
Enhancer sites – Can be very far away (either upstream or downstream).
Regulatory proteins recognize sites by conserved DNA patterns, which consist of a short stretch of “partially specific” nucleotide sequences.
cbio course, spring 2005, Hebrew University
lac operon in E. coli
Figure 13.16 The lac Operon of E. coli
cbio course, spring 2005, Hebrew University
Promoter…
cbio course, spring 2005, Hebrew University
cbio course, spring 2005, Hebrew University
cbio course, spring 2005, Hebrew University
Transcription Factor Binding Sites
Non-coding regions gene regulation
We want to describe this site
cbio course, spring 2005, Hebrew University
Difficulty of Finding Regulatory Elements
Regulatory sites are short (up to 30 nucleotides). Non-coding regions are very long (includes all
regions which are not translated into proteins). Experiments to find regulatory sites are tedious and
time-consuming. One approach is to mutate different combinations of nucleotides until functionality changes.
We don’t have good understanding on what makes a site active/how active in terms of the chemical/physical constraints
cbio course, spring 2005, Hebrew University
Why Not Use Multiple Alignment?
The motif is short and may appear at different location in different sequences. Most other areas are random
Not all positions within a binding site should be treated in the same way, and usually we don’t know in advance how. Therefore the use of a general scoring matrix is not adequate
The problem is made more complicated since not every sequence contains a motif, due to: The upstream region used may not be long enough to
include a regulatory site in every sequence Usually, potential co-regulated genes are used to
construct the sample, which means that we don’t know for sure whether all these genes are really co-regulated
cbio course, spring 2005, Hebrew University
Computational Approach
Identify a set of genes believed to be controlled by the same regulatory mechanism (co-regulated genes).
Extract regulatory regions of the genes (usually upstream sequences) to form a sample of sequences.
Find some way to identify “conserved” elements in these sequences, resulting in a list of potential regulatory sites.
cbio course, spring 2005, Hebrew University
How to Find Regulatory Sites
genesite
genesite
genesite
genesite
genesite
sample
cbio course, spring 2005, Hebrew University
Formulating Motif Finding Task
Given a set of sequences, find a common motif shared by these sequences.Steps:
Construct a model of what we mean by common motif.
Solve the problem within the model on simulated samples.
Evaluate performance on real life biological samples.
cbio course, spring 2005, Hebrew University
Formulating Motif Finding Task (2) This means we need to define:
Input of the algorithm: This implicitly defines various assumptions we have on the problem (e.g: do we have different belief for each sequence that it belongs to the group?)
Type of “motif” class: Search Algorithm: How we search the space of possible
motifs? Scoring function: How we score putative motifs? Output of the algorithm: Should it give us just putative
sites or maybe a binding site model to predict sites? Evaluation technique: How do we test our algorithm?
cbio course, spring 2005, Hebrew University
Task Definition Example
Given a sample of sequences and an unknown pattern (motif) that appears at different unknown positions in each sequence, can we find the unknown pattern?
Input: a set of sequences, each one with an unknown pattern at an unknown position.
Output: a set of starting positions of the pattern in each sequence.
cbio course, spring 2005, Hebrew University
Pattern == Subsequence
atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa
tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag
gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa
Subsequence = AAAAAAAAGGGGGGG
cbio course, spring 2005, Hebrew University
Pattern == (l,d)
atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa
tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag
gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat
AgAAgAAAGGttGGG
cAAtAAAAcGGcGGG..|..|||.|..|||
All variants of AAAAAAAAGGGGGGG
First formulated by Pevzner (ISMB 2000)Pattern = subsequence of length l and exactly d random mismatches in itAll other sequence is assumed randomAssumes exactly one “true” occurrence of the motif in each sequence
cbio course, spring 2005, Hebrew University
Formulating Motif Finding Task (2) We need to define:
Input of the algorithm: This implicitly defines various assumptions we have on the problem (e.g: do we have different belief for each sequence that it belongs to the group?)
Type of “motif” class: Search Algorithm: How we search the space of
possible motifs? Scoring function: How we score putative motifs? Output of the algorithm: Should it give us just putative
sites or maybe a binding site model to predict sites? Evaluation technique: How do we test our algorithm?
Think: •How the (l,d) problem defines these ? •How does it relate to “real” biology?
cbio course, spring 2005, Hebrew University
How to Define Motif Class?
Subsequences : ACTCTT IUPAC alphabet: {A, C, G, T, R,Y, M, K, S, W, B, D, H, V,
N } = all subsets of {A,C,G,T} PSSM / PWM (Position Specific Score Matrix or Position
Weight Matrix) More general probabilistic/other models: e.g. using Bayesian
Networks modeling language Refined definition based on prior knowledge:
Homo/Hetro dimers Variable gaps Bias to some characteristic information profile (Van,
2003)
cbio course, spring 2005, Hebrew University
NOTE:
• Independence assumption between biding sites positions !
• The score used in a probabilistic setting is the log odds score
• In many case the BG is a simple, fixed, background distribution (Q) over {ACGT}.
• The entries in the Matrix can be Pi(a), log(Pi(a)) or log(Pi(a)/logQ(a) – depending on the context of its usage !
PSSM Representation of Binding Sites
Position Specific Score Matrix: each possible kmer will get a “score” for being a binding site which is:
Probabilistic interpretation:ACGT
1 2 k
w[i,c] – weight of letter c at position i
)|,...,(
)|,...,(log),...,(
1
11 BGssP
MssPssScore
k
kk
i
ik siwssScore ],[),...,( 1
)()|,...,(1
1 ii
k
ik sPMssP
cbio course, spring 2005, Hebrew University
• PSSM:+ Enables representing low/high affinity in different Positions+ Trade off Sens. and Spec. in genomic wide scans- Huge Search space, how to cover efficiently?
ABF1 Example – (Targets by Lee at el. ,2002)
>YAL011W: CGTGTTAGATGA
√
?
PSSM vs. IUPAC
cbio course, spring 2005, Hebrew University
How to Learn PSSM Motif?
Easier Task - We have aligned samples to learn from: We have a set of known BS, all of length k, (e.g. verified by
some biological experiment) Compute counts for each base in each position, and
normalize == ML estimator: N number of sequence, Na number of “a”s in position i:
Note: This is the ML solution. As in many other cases, this might be problematic
when we have very few samples to learn from (e.g.: we can get probability 0 for base A in position i simply because we did not see enough examples.)
Solution: use pseudo counts or some prior (e.g. Derichele prior)
cbio course, spring 2005, Hebrew University
How to Learn PSSM Motif ? (2)
BSModel
1 2 3 4 5 6 7ACGT
Remember: In the motif finding problem we have a much harder task –
The input: is a set of (long) sequence suspected to contain a common motif (PSSM according to our current model assumption), but we don’t know where !
The output: Prediction of new BS based on our learned PSSM motif
Predictions
Input Sequence:Dark blue are BS positions which are hidden from us, and we are trying to learn
cbio course, spring 2005, Hebrew University
How to Learn PSSM Motif ? (3) MEME Algorithm (Bailey T.L. and Elkan C.P. 1995 )
(Still) one of the most commonly used tools for motif (PSSM) search:
cbio course, spring 2005, Hebrew University
How to Learn PSSM Motif ? (3) MEME Algorithm (Bailey T.L. and Elkan C.P. 1995 )
The basic probabilistic framework used by MEME: Input: N sequences Assume each has 1 BS Assume a generative model: sequence is either
generated by BS model M (PSSM) or from a fixed background distribution BG
Assume each sequence has exactly 1 BS in it. Scoring function: P(Seq | M,BG) Try to maximize likelihood scoring function by
adjusting M’s (PSSM) parameters.
cbio course, spring 2005, Hebrew University
How to Learn PSSM Motif ? (4) What’s the problem? Why is it hard?
Think of the positions of the BS in each sequence as H were H is a vector of dimension N
Given H we have complete data. Then inferring M’s ML parameters are just as we saw for the aligned case easy
Problem 1: We don’t have H, we are trying to learn it too and the ML parameters of M for each position become dependent if H is not given we have no close form to compute them analytically and going over all possible H assignments is not feasible, we need to resort to some method to search the space of possible assignments to M’s parameters
Problem 2: The landscape of the likelihood function is typically far from convex many local optima
cbio course, spring 2005, Hebrew University
How to Learn PSSM Motif ? (5) MEME Algorithm
MEME uses a technique called EM to search the space of model M’s parameters
EM = Expectation Maximization We review how EM is used in the MEME algorithm
in class….
cbio course, spring 2005, Hebrew University
Problems with the MEME & other Models
Think: In light of what we discussed, what assumptions are made in this model? What might cause us problems in “real” life data? MEME has also other variants we did not
discuss here (oops, zoops, etc.) Also: EM is very sensitive to starting point need
a good way to find good ones
cbio course, spring 2005, Hebrew University
Other Algorithmic Techniques for Motif Finding
MEME (Expectation Maximization) GibbsDNA, AlignAce (Gibbs Sampling) CONSENUS (greedy multiple alignment) WINNOWER (Clique finding in graphs) SP-STAR (Sum of pairs scoring) MITRA (Mismatch trees to prune exhaustive search
space)
More then one way to skin a cat….
cbio course, spring 2005, Hebrew University
How to find Binding Sites- Revisited
Find a common motif in gene set (CONSENSUS, MITRA, MEME, AlignACE…)
“Classical” Solutions:
GeneSet
Promoter
Find a common & unique motif in genesDiscriminative Solutions:
Extract the relevant bit from sequences
n
kxn
M
xn
KM
x
K
MKnkPhyper ),,|(
Main problem: In many cases the motif is common not just to the subset of sequences we have, but to many other as well not a good candidate to explain regulation
“A simple hyper-geometric approach for discovering putative transcription factor binding sites” WABI 01
cbio course, spring 2005, Hebrew University
Finding Discriminative Motifs
Define Space of Motifs“mimic” motifs with a simpler class for efficient search
Search Space, Evaluate Motifs using discriminative scoring
Choose Significant Motifs
Correct for multiple hyp.Bonfferoni or FDR criteria
Step1
Step2: “A simple hyper-geometric approach for discovering putative transcription factor binding sites” WABI 01Refine Motifs
cbio course, spring 2005, Hebrew University
Binding Sites - Revisited
→ independence assumption
Two relevant questions: Are there dependencies in binding sites? Do we gain an edge in computational
tasks if we model such dependencies?
promoter
gene
binding site
A?C?T
“Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03
cbio course, spring 2005, Hebrew University
How to model binding sites ?
))P(X)P(X)P(X)P(XP(X)XP(X 543215 1 T
5432151 T)|T)P(X|T)P(X|T)P(X|T)P(X|P(T)P(X)XP(X )X|)P(X)P(XX|)P(XX|)P(XP(X)XP(X 354133215 1
X1 X2 X3 X4 X5 Profile: Independency model
Tree: Direct dependencies
Mixture of Profiles:Global dependencies
Mixture of Trees:Both types of dependencies
X1 X2 X3 X4 X5
T
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5
T
T
3541332151 )XT,|T)P(X|)P(XXT,|)P(XXT,|T)P(X|P(T)P(X)XP(X
? )X X X X P(X 54321 represent a distribution of binding sites
“Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03
cbio course, spring 2005, Hebrew University
Learning models: Aligned binding sites
Learning procedure for Bayesian networks
GCGGGGCCGGGCTGGGGGCGGGGTAGGGGGCGGGGGTAGGGGCCGGGCTGGGGGCGGGGTAAAGGGCCGGGCGGGAGGCCGGGAGCGGGGCGGGGCGAGGGGACGAGTCCGGGGCGGTCCATGGGGCGGGGC
Aligned binding sitesModels
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5
T
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5
T
LearningMachinery
select maximum likelihood model
“Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03
cbio course, spring 2005, Hebrew University
Arabidopsis ABA binding factor 1(49 examples)
Profile
Test LL per instance -19.93
Mixture of Profiles76%
24%
Test LL per instance -18.70 (+1.23)(improvement in likelihood > 2-fold)
X4 X5 X6 X7 X8 X9 X10 X11 X12
Tree
Test LL per instance -18.47 (+1.46)(improvement in likelihood > 2.5-fold)
“Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03
cbio course, spring 2005, Hebrew University
Rap1 Example (Harbison at. el.04)(171 expmples)
Profile Mixture of Profiles
X4 X5 X6 X7 X8 X9 X10 X11 X12
Tree
cbio course, spring 2005, Hebrew University
¼
½
1
2
4
8
1 11 21 31 41 51 61 71 81 91
Datasets of Binding Sites Significant
Non sig.
67
Fo
ld c
ha
ng
e in
lik
elih
oo
d (
hel
d o
ut
test
da
ta)
Likelihood improvement over profiles
Significant improvement in generalization
Data often exhibits dependencies
“Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03
cbio course, spring 2005, Hebrew University
EM algorithm
Learning models: unaligned data
Use EM algorithm to simultaneously Identify binding site positions Learn a dependency model
Unaligned Data
Learna model
Identify binding
sites
ModelsX1 X2 X3 X4 X5
X1 X2 X3 X4 X5
T
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5
T
“Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03
cbio course, spring 2005, Hebrew University
Evaluating PerformanceDetect target genes on a genomic scale:
ACGTAT…………….………………….AGGGATGCGAGC-1000 0-473
Scoring rule:
Crucial issue: p-value of scores
“CIS: Compound Importance Sampling Method for Protein-DNA Binding Site p-value Estimation” Bioinformatics, 2004, ISMB 04
),,(
),,(log),,Score(
10
11
K
KK XXP
XXPXX
Probability by binding site model
Background model (order-3 markov chain)
cbio course, spring 2005, Hebrew University
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
0% 1% 2% 3% 4% 5%
Tru
e P
ositi
ve R
ate
(Sen
sitiv
ity)
False Positive Rate
Profile
Example: ROC curve of HSF1
Mixture of Trees
Tree
~60 FP
Mixture of Profiles
“Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03
cbio course, spring 2005, Hebrew University
-20 -10 0 10 20 30 40 50 60
-25
-20
-15
-10
-5
0
5
10
15
20
Evaluation – Localization Data5-fold Cross Validation [Lee et al 2002]
Δ s
pe
cif
icit
y (T
P/P
red
icte
d)
Δ sensitivity (TP/True)
Improvement by Mix of Trees over PSSM
“True”
Predicted
TP
“Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03
cbio course, spring 2005, Hebrew University
Motif Finding - Evaluation Still an open problem We have seen several examples on how performance can be evaluated
in different ways There is (still) no absolute solution for this Main problems:
no large data sets of known sites no real annotation of negative samples How to define success measure? Difference in input/output assumptions …
A recent effort in this direction: “Assessing computational tools for the discovery of transcription factor binding sites” (Nat. Biotech. Jan 05)
compared publicly available tools on the web on (small) data sets of known binding sites based on the Transfac D.B