Introduction to Neuroscience: Introduction to Neuroscience ...
Introduction to
description
Transcript of Introduction to
1
Introduction to
Bioinformatics
2
Introduction to Bioinformatics.
LECTURE 10: Identification of regulatory sequences
* Chapter 10: A bed-time story
3
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES
10.1 The circadian clock
* All living beings have a biological clock (remember jet lag) called the Circadian Rhythm/Clock
* Disruptions between the circadian rhythm and the natural day-night cycle lead to various health problems
* The internal clock synchronizes numerous functions such as metabolism, activity/awareness level, and body temperature
* For plants this is especially true: the photosynthesis
4
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES
* Plants lead a stressful life; they have many needs (water, sun, nutrients) nut are unable to move
* Rather than moving, plants react to external stress by changing their internal condition
* Herbivore? → Chemical repellent! (e.g. nicotine)
* Falling temperature? → Anti-freeze proteins!
* Plants that can ‘anticipate’ changes have a competitive advantage → this is the importance of a circadian clock
5
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES
Arabidopsis thalianaFrom Wikipedia, the free encyclopedia
Scientific classificationKingdom:PlantaeDivision:MagnoliophytaClass:Magnoliopsida
Order:BrassicalesFamily:Brassicaceae
Subfamily:BrassicoideaeGenus:ArabidopsisSpecies:A. thaliana
Arabidopsis thaliana, commonly called arabidopsis, thale cress, or mouse-ear cress, a small flowering plant related to cabbage
and mustard, is one of the model organisms for studying plant sciences, including genetics and plant development. It plays
the role for agricultural sciences that mice and fruit flies (Drosophila) play in human biology.
6
Arabidopsis thaliana
7
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES
Arabidopsis thaliana
120 Mbp5 chromosomes29,000 genes
8
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES
* Arabidopsis thaliana has a cell-autonomous circadian clock: each single cell keeps track of day-night cycle independently
+
+
+
++
--
-
-
--
+
awake
asleep
9
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES
* If you remove the day-night stimulus and keep A. thaliana in constant light – or dark – then within days the clock looses periodicity.
* In contrast, mammals kept in constant light keep the circadian clock running for months.
* How does A. thaliana (and other organisms) run their circadian clock?
* Three proteins are the key-players: LHY, CCA1, and TOC1:
10
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES
11
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES
* Transcription factors TF : a protein that regulates transcription. TFs regulate the binding of RNA polymerase and the initiation of transcription. A TF binds upstream or downstream to either enhance or repress transcription of a gene by assisting or blocking RNA polymerase binding.
* Transcription Factor Binding Site TFBS : The location on the DNA molecule where a TF can physically attach.
12
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES
DNA
TF
TF
TF
TF
TF
TF TFBSTFBS
13
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES
14
15
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES
* A TFBS has a specific sequence of nucleotides for the TF to attach. This is called the motif
* Motifs are short (5-15 bp) sequences of nucleotides, e.g. TATAA, TAAAAAAAAAATCTA, TATCTG, …
* Different TFs have different TFBS motifs
* However, there is some freedom in the motif sequence: a given TF may lock to TATACT, but also to TATAACT and TATACT
16
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES
10.2 Basic mechanisms of gene expression
* Statistical and algorithmic issues in finding TFBS motifs
* Protein control: gene transcription control, mRNA control, post-translational control of proteins, …
gene expressedgene mRNA protein
-TF
-TF
-TF
-TF
17
Introduction to Bioinformatics10.2: Basic mechanisms of gene expression
Genetic signposts
* A gene embedded in random DNA is totally inert
* Promotor = regulatory DNA = cis-regulatory DNA
* The promotor is a region on the DNA just before (=upstream) the gene that indicates where the transcription starts …
18
Introduction to Bioinformatics10.2: Basic mechanisms of gene expression
Promotor locking– simplified –
19
Introduction to Bioinformatics10.2: Basic mechanisms of gene expressionPromotor locking – realistic –
20
Introduction to Bioinformatics10.2: Basic mechanisms of gene expression
* A major TFBS is the RNA polymerase binding site
* Eubacteria: rigid motifs at -10: TATAAT, at -35: TTGACA
* Eukaryota: has different RNA polymerase → different motifs; TATA-box (= TATAA[A/T]) at ~ -40
* Other docking sites at +/- -1000, but also many other places up to - 250,000 (.. and further???)
21
Introduction to Bioinformatics10.2: Basic mechanisms of gene expression
Computational challenges in finding TFBS
Finding TFBS motifs is complex:
1. TFBS are very short and will therefore appear by chance alone
2. There is a high variability (ATAATC, ATAATT, ATACTC, …)
3. We don’t know the TFBS motif nor the TFBS location
22
Introduction to Bioinformatics10.2: Basic mechanisms of gene expression
* Trick 1: area’s on the gene with high conservation
* Trick 2: co-regulated genes (have same TF): look for shared motifs upstream
* For Arabidopsis thaliana: look for motifs upstream bound by LHY and CCA1:
* [i] cluster genes with same day-night oscillatory pattern
* [ii] look in this cluster for shared motifs upstream up to -1000
23
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES
10.3 Motif-finding strategies
* Where to look for a TFBS? Look at +/-1000 upstream
* gapped/ungapped motif
* fixed/variable length motif
* TFBS motifs are variable but highly similar
* Consensus sequence: most probable sequence
24
Introduction to Bioinformatics10.3: Motif-finding strategies
Consensus motif: a useful notation
25
Introduction to Bioinformatics10.3: Motif-finding strategies
PSSM / PSWM / profile
* Position Specific Scoring Matrix: PSSM (also profile)
* PSSM: multinomial model of sequence of length L
* PSSM: multinomial distribution depends on position on the sequence: P[position,symbol], symbol={CTGA} or 20AA
26
Introduction to Bioinformatics10.3: Motif-finding strategies
Example 10.1: fixed length – ungapped motif
A T G C T G A A T G T A C T A T A T A G T A A T C T G T C A A T A T G T A A C C T A A T T G T T C A G A T T T C C C A C C T C G A C A A A T T T A C T C A G A T T C T C
Note: we know neither the place (*) nor the length (6) of the TFBS
27
Introduction to Bioinformatics10.3: Motif-finding strategies
Example 10.1: fixed length – ungapped motif
A T G *C T G A A T G T A *C T A T A T A G T A A T C T G T *C A A T A T G T A A C *C T A A T T G T T *C A G A T T T C C C A C C T C G A *C A A A T T T A C T *C A G A T T C T C
Note: we know neither the place (*) nor the length (6) of the TFBS
28
Introduction to Bioinformatics10.3: Motif-finding strategies
Example 10.1: fixed length – ungapped motif
*C T G A A T *C T A T A T *C A A T A T *C T A A T T *C A G A T T *C A A A T T *C A G A T T
Alignment
A 0 5 5 5 4 0C 7 0 0 0 0 0G 1 0 3 0 0 0T 0 3 0 3 4 8PSSM
C A A A T T consensus motif
29
Introduction to Bioinformatics10.3: Motif-finding strategies
Identifying motifs
* Start position, motif sequence and motif length are unknown
* PSSM = scoring from multiple alignment
* What is a significant result: compare the sequence with the background model: the chance based on the current set that the motif occurs by pure chance
30
Introduction to Bioinformatics10.3: Motif-finding strategies
Identifying motifs [2]
* Algorithmically finding the motif sequence by optimization
of a scoring function is extremely computationally expensive
* Therefore heuristics have been proposed
* Example of a randomized and greedy heuristic is Gibbs sampling
* Now focus on ungapped fixed sequence motif with fixed length as is the case in circadian rhythm in A. thaliana
31
Introduction to Bioinformatics10.3: Motif-finding strategies
Identifying motifs [3]
Ungapped fixed sequence motif with fixed length as is the case in circadian rhythm in A. thaliana
From example 10-1 the PSSM:
A 0 5/8 5/8 5/8 4/8 0C 7/8 0 0 0 0 0G 1/8 0 3/8 0 0 0T 0 3/8 0 3/8 4/8 8/8
32
Introduction to Bioinformatics10.3: Motif-finding strategies
Identifying motifs [4]
A motif is interesting if it is unlikely under the background distribution: column 6 is more unbalanced than column 1
Scoring function for imbalance: Kullback–Leibler divergence (KL divergence) :
pi[k] is probability of observing symbol k at position iqi[k] is multinomial background model for symbol k at i
i k
kqkp
iKL i
ikpSposition letter
][][log][
33
Introduction to Bioinformatics10.3: Motif-finding strategies
Identifying motifs [5]
To avoid zero entries and resulting divergences (log 0), a statistical trick is to add pseudocounts: add 1 at each entry
A 0 5 5 5 4 0C 7 0 0 0 0 0G 1 0 3 0 0 0T 0 3 0 3 4 8PSSM
A 1 6 6 6 5 1C 8 1 1 1 1 1G 2 1 4 1 1 1T 1 4 1 4 5 9PSSM + pseudocounts
34
Introduction to Bioinformatics10.3: Motif-finding strategies
Identifying motifs [6]
PSSM with pseudocounts has no zeros:
A 1/12 6/12 …C 8/12 1/12 …G 2/12 1/12 …T 1/12 4/12 …
35
Introduction to Bioinformatics10.3: Motif-finding strategies
Finding high-scoring motifs
* Sequence s of length n (> L = length of the PSSM)
* Slide the PSSM along the sequence and compute the likelihood:
* With this algorithm try to find starting position (j with highest value), and most probable motif (argmax of L).
NOTE: in practice use log-likelihood l(j) = log L(j)
1
]][[)(Lnj
jii ipj s
36
Introduction to Bioinformatics10.3: Motif-finding strategies
Finding high-scoring motifs [2]
ALGORITHM FOR FINDING TFBS MOTIFS:
0. Start with random location j and random PSSM
Iteration:
1. With fixed j optimize PSSM2. With fixed PSSM optimize j
Until the result has converged
This is the EM-algorithm
37
Introduction to Bioinformatics10.3: Motif-finding strategies
Finding high-scoring motifs [3]
Gibbs sampling to avoid local optima:
Use randomization of the sequence as an alternative for using the location with the highest score
Use a simple assumption: e.g. there is no variation – so look for a fixed sequence
Figure 10-1 shows the log likelihood score and therefore the locations for the optimal locations of the motifs
38
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES
10.4 Case study: the circadian rhythm
* Harmers et al. (2000) the clock-regulated elements of A. thaliana are activated in the evening – hence: Evening Element (EE)
* Cluster the expression profiles and consider the clusters with appropriate periodicity : they are candidates for containing the EE
39
Introduction to Bioinformatics10.4: Case study the circadian rhythm
40
Introduction to Bioinformatics10.4: Case study the circadian rhythm
41
Introduction to Bioinformatics10.4: Case study the circadian rhythm
42
Introduction to Bioinformatics10.4: Case study the circadian rhythm
43
Introduction to Bioinformatics10.4: Case study the circadian rhythm
44
Introduction to Bioinformatics10.4: Case study the circadian rhythm
METHOD:
* Look only at motifs of fixed length (9): consider all words of length 9 whose frequency in the evening cluster is very different from its frequency in the rest of the data.
* Therefore examine all words of fixed length 9 in both sequences seq1 and seq2 (considering also the reverse complement).
* Motifs found are scored and sorted in descending order by margin (the difference between their frequency in cluster 2 and that in cluster 1-3). The top 10 of 9-mers are computed and shown.
45
Introduction to Bioinformatics10.4: Case study the circadian rhythm
METHOD [2]:
* The obtained set of motifs contains a lot of repeats (either of single letters or of 2-mers). They likely have no biological significance and they must be filtered out.
* After eliminating the repeating element, we can observe that the most significant EE element is the motif AAAATATCT.
46
Introduction to Bioinformatics10.4: Case study the circadian rhythm
METHOD [3]:
* The EE element is the motif AAAATATCT.
* We known from the study of Harmer et al. that it corresponds to the evening element (word of 9 bases found upstream of genes turned on un the evening). Its margin is 0.00014. We notice that 2 of the other 3 top motifs are simply variants of the evening element (AAATATCTT and AAAAATATC).
* To assess the significance of the value found for the margin of the evening element we perform 100 random splits of the data and measure the margin of the highest-scoring element.
47
Introduction to Bioinformatics10.4: Case study the circadian rhythm
METHOD [4]:
* In 100 trials we never observe a margin larger than 0.000147462.
* We can look in detail at the frequency of the evening element among all the clock regulated genes:
48
Introduction to Bioinformatics10.3: Motif-finding strategies
EE-count and circadian rhythm in genes
Circadian time: 0 4 8 12 16 20Number of genes: 78 45 124 67 30
93EE count: 5 6 49 27 8 8
49
Introduction to Bioinformatics10.4: Case study the circadian rhythm
METHOD [6]:
* The arrays EEcount and Ngenes show that not all the genes of the second cluster have the evening element, nor this motif is limited only to these genes.
50
Introduction to Bioinformatics10.3: Motif-finding strategies
Finding the motif-length
Compare the log-likelihood score relative to the background model for motif length L:
51
Introduction to Bioinformatics10.4: Case study the circadian rhythm
52
Introduction to Bioinformatics10.3: Motif-finding strategies
Biological validation
Biological confirmation: compare the EE motif (and other found motifs) with standard TFBS databases like cisRED :
53
54
Introduction to Bioinformatics10.3: Motif-finding strategies
Biological validation [2]
Biological confirmation: perform biological experiments to test the hypothesis: Harper et al. attached a fluorescent molecule-complex to the TFBS and could thus with a scintillation counter
The experiments confirmed the finds from the computational analysis.
This (in 2000) was the first time that explicitly Bioinformatics gave surprising new and unanticipated finds
55
56
57
Introduction to Bioinformatics10.3: Motif-finding strategies
Biological validation [3]
The fluorescent scintillation experiments fully confirmed the finds from the computational analysis.
This (in 2000) was the first time that explicitly Bioinformatics gave surprising new and unanticipated finds
With the EE we can now look for other locations on the DNA with the same or similar motifs.
58
Introduction to Bioinformatics10.3: Motif-finding strategies
Biological validation [4]
Because not all genes are directly regulated by the first few TFs in the circadian regulatory cascade, the presence or absence of EE enables to reveal the exact sequence of events that occur during circadian control.
EE
59
Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES
References
S. L. Harmer, J. B. Hogenesch, M. Straume, C. H.S. Chang, B. Han, Z.Tong X. Wang, J. A. Kreps, and S. A. Kay, Orchestrated Transcription of Key Pathways in Arabidopsis by the Circadian Clock, Science, Vol. 290. no. 5499, pp. 2110 - 2113, Dec 2000.
60
END of COURSE BIOINFORMATICS
Thanks for your attention …