Introduction to

1

Introduction to

Bioinformatics

2

Introduction to Bioinformatics.

LECTURE 10: Identification of regulatory sequences

* Chapter 10: A bed-time story

3

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

10.1 The circadian clock

* All living beings have a biological clock (remember jet lag) called the Circadian Rhythm/Clock

* Disruptions between the circadian rhythm and the natural day-night cycle lead to various health problems

* The internal clock synchronizes numerous functions such as metabolism, activity/awareness level, and body temperature

* For plants this is especially true: the photosynthesis

4


* Plants lead a stressful life; they have many needs (water, sun, nutrients) nut are unable to move

* Rather than moving, plants react to external stress by changing their internal condition

* Herbivore? → Chemical repellent! (e.g. nicotine)

* Falling temperature? → Anti-freeze proteins!

* Plants that can ‘anticipate’ changes have a competitive advantage → this is the importance of a circadian clock

5


Arabidopsis thalianaFrom Wikipedia, the free encyclopedia

Scientific classificationKingdom:PlantaeDivision:MagnoliophytaClass:Magnoliopsida

Order:BrassicalesFamily:Brassicaceae

Subfamily:BrassicoideaeGenus:ArabidopsisSpecies:A. thaliana

Arabidopsis thaliana, commonly called arabidopsis, thale cress, or mouse-ear cress, a small flowering plant related to cabbage

and mustard, is one of the model organisms for studying plant sciences, including genetics and plant development. It plays

the role for agricultural sciences that mice and fruit flies (Drosophila) play in human biology.

http://en.wikipedia.org/wiki/Scientific_classification

http://en.wikipedia.org/wiki/Plant

http://en.wikipedia.org/wiki/Flowering_plant

http://en.wikipedia.org/wiki/Magnoliopsida

http://en.wikipedia.org/wiki/Brassicales

http://en.wikipedia.org/wiki/Brassicaceae

http://en.wikipedia.org/w/index.php?title=Brassicoideae&action=edit

http://en.wikipedia.org/wiki/Arabidopsis

http://en.wikipedia.org/wiki/Cabbage

http://en.wikipedia.org/wiki/Mustard_plant

http://en.wikipedia.org/wiki/Model_organism

http://en.wikipedia.org/wiki/Plant_sciences

http://en.wikipedia.org/wiki/Genetics

http://en.wikipedia.org/wiki/Mus_musculus

http://en.wikipedia.org/wiki/Drosophila_melanogaster



6

Arabidopsis thaliana

7


Arabidopsis thaliana

120 Mbp5 chromosomes29,000 genes

8


* Arabidopsis thaliana has a cell-autonomous circadian clock: each single cell keeps track of day-night cycle independently

+

+

+

++

--

-

-

--

+

awake

asleep

9


* If you remove the day-night stimulus and keep A. thaliana in constant light – or dark – then within days the clock looses periodicity.

* In contrast, mammals kept in constant light keep the circadian clock running for months.

* How does A. thaliana (and other organisms) run their circadian clock?

* Three proteins are the key-players: LHY, CCA1, and TOC1:

10


11


* Transcription factors TF : a protein that regulates transcription. TFs regulate the binding of RNA polymerase and the initiation of transcription. A TF binds upstream or downstream to either enhance or repress transcription of a gene by assisting or blocking RNA polymerase binding.

* Transcription Factor Binding Site TFBS : The location on the DNA molecule where a TF can physically attach.

12


DNA

TF

TF

TF

TF

TF

TF TFBSTFBS

13


15


* A TFBS has a specific sequence of nucleotides for the TF to attach. This is called the motif

* Motifs are short (5-15 bp) sequences of nucleotides, e.g. TATAA, TAAAAAAAAAATCTA, TATCTG, …

* Different TFs have different TFBS motifs

* However, there is some freedom in the motif sequence: a given TF may lock to TATACT, but also to TATAACT and TATACT

16


10.2 Basic mechanisms of gene expression

* Statistical and algorithmic issues in finding TFBS motifs

* Protein control: gene transcription control, mRNA control, post-translational control of proteins, …

gene expressedgene mRNA protein

-TF

-TF

-TF

-TF

17

Introduction to Bioinformatics10.2: Basic mechanisms of gene expression

Genetic signposts

* A gene embedded in random DNA is totally inert

* Promotor = regulatory DNA = cis-regulatory DNA

* The promotor is a region on the DNA just before (=upstream) the gene that indicates where the transcription starts …

18


Promotor locking– simplified –

19

Introduction to Bioinformatics10.2: Basic mechanisms of gene expressionPromotor locking – realistic –

20


* A major TFBS is the RNA polymerase binding site

* Eubacteria: rigid motifs at -10: TATAAT, at -35: TTGACA

* Eukaryota: has different RNA polymerase → different motifs; TATA-box (= TATAA[A/T]) at ~ -40

* Other docking sites at +/- -1000, but also many other places up to - 250,000 (.. and further???)

21


Computational challenges in finding TFBS

Finding TFBS motifs is complex:

1. TFBS are very short and will therefore appear by chance alone

2. There is a high variability (ATAATC, ATAATT, ATACTC, …)

3. We don’t know the TFBS motif nor the TFBS location

22


* Trick 1: area’s on the gene with high conservation

* Trick 2: co-regulated genes (have same TF): look for shared motifs upstream

* For Arabidopsis thaliana: look for motifs upstream bound by LHY and CCA1:

* [i] cluster genes with same day-night oscillatory pattern

* [ii] look in this cluster for shared motifs upstream up to -1000

23


10.3 Motif-finding strategies

* Where to look for a TFBS? Look at +/-1000 upstream

* gapped/ungapped motif

* fixed/variable length motif

* TFBS motifs are variable but highly similar

* Consensus sequence: most probable sequence

24

Introduction to Bioinformatics10.3: Motif-finding strategies

Consensus motif: a useful notation

25


PSSM / PSWM / profile

* Position Specific Scoring Matrix: PSSM (also profile)

* PSSM: multinomial model of sequence of length L

* PSSM: multinomial distribution depends on position on the sequence: P[position,symbol], symbol={CTGA} or 20AA

26


Example 10.1: fixed length – ungapped motif

A T G C T G A A T G T A C T A T A T A G T A A T C T G T C A A T A T G T A A C C T A A T T G T T C A G A T T T C C C A C C T C G A C A A A T T T A C T C A G A T T C T C

Note: we know neither the place (*) nor the length (6) of the TFBS

27



A T G *C T G A A T G T A *C T A T A T A G T A A T C T G T *C A A T A T G T A A C *C T A A T T G T T *C A G A T T T C C C A C C T C G A *C A A A T T T A C T *C A G A T T C T C

Note: we know neither the place (*) nor the length (6) of the TFBS

28



*C T G A A T *C T A T A T *C A A T A T *C T A A T T *C A G A T T *C A A A T T *C A G A T T

Alignment

A 0 5 5 5 4 0C 7 0 0 0 0 0G 1 0 3 0 0 0T 0 3 0 3 4 8PSSM

C A A A T T consensus motif

29


Identifying motifs

* Start position, motif sequence and motif length are unknown

* PSSM = scoring from multiple alignment

* What is a significant result: compare the sequence with the background model: the chance based on the current set that the motif occurs by pure chance

30


Identifying motifs [2]

* Algorithmically finding the motif sequence by optimization

of a scoring function is extremely computationally expensive

* Therefore heuristics have been proposed

* Example of a randomized and greedy heuristic is Gibbs sampling

* Now focus on ungapped fixed sequence motif with fixed length as is the case in circadian rhythm in A. thaliana

31



Ungapped fixed sequence motif with fixed length as is the case in circadian rhythm in A. thaliana

From example 10-1 the PSSM:

A 0 5/8 5/8 5/8 4/8 0C 7/8 0 0 0 0 0G 1/8 0 3/8 0 0 0T 0 3/8 0 3/8 4/8 8/8

32



A motif is interesting if it is unlikely under the background distribution: column 6 is more unbalanced than column 1

Scoring function for imbalance: Kullback–Leibler divergence (KL divergence) :

pi[k] is probability of observing symbol k at position iqi[k] is multinomial background model for symbol k at i

i k

kqkp

iKL i

ikpSposition letter

][][log][

33



To avoid zero entries and resulting divergences (log 0), a statistical trick is to add pseudocounts: add 1 at each entry

A 0 5 5 5 4 0C 7 0 0 0 0 0G 1 0 3 0 0 0T 0 3 0 3 4 8PSSM

A 1 6 6 6 5 1C 8 1 1 1 1 1G 2 1 4 1 1 1T 1 4 1 4 5 9PSSM + pseudocounts

34



PSSM with pseudocounts has no zeros:

A 1/12 6/12 …C 8/12 1/12 …G 2/12 1/12 …T 1/12 4/12 …

35


Finding high-scoring motifs

* Sequence s of length n (> L = length of the PSSM)

* Slide the PSSM along the sequence and compute the likelihood:

* With this algorithm try to find starting position (j with highest value), and most probable motif (argmax of L).

NOTE: in practice use log-likelihood l(j) = log L(j)

1

]][[)(Lnj

jii ipj s

36


Finding high-scoring motifs [2]

ALGORITHM FOR FINDING TFBS MOTIFS:

0. Start with random location j and random PSSM

Iteration:

1. With fixed j optimize PSSM2. With fixed PSSM optimize j

Until the result has converged

This is the EM-algorithm

37


Finding high-scoring motifs [3]

Gibbs sampling to avoid local optima:

Use randomization of the sequence as an alternative for using the location with the highest score

Use a simple assumption: e.g. there is no variation – so look for a fixed sequence

Figure 10-1 shows the log likelihood score and therefore the locations for the optimal locations of the motifs

38


10.4 Case study: the circadian rhythm

* Harmers et al. (2000) the clock-regulated elements of A. thaliana are activated in the evening – hence: Evening Element (EE)

* Cluster the expression profiles and consider the clusters with appropriate periodicity : they are candidates for containing the EE

39

Introduction to Bioinformatics10.4: Case study the circadian rhythm

40


41


42


43


44


METHOD:

* Look only at motifs of fixed length (9): consider all words of length 9 whose frequency in the evening cluster is very different from its frequency in the rest of the data.

* Therefore examine all words of fixed length 9 in both sequences seq1 and seq2 (considering also the reverse complement).

* Motifs found are scored and sorted in descending order by margin (the difference between their frequency in cluster 2 and that in cluster 1-3). The top 10 of 9-mers are computed and shown.

45


METHOD [2]:

* The obtained set of motifs contains a lot of repeats (either of single letters or of 2-mers). They likely have no biological significance and they must be filtered out.

* After eliminating the repeating element, we can observe that the most significant EE element is the motif AAAATATCT.

46


METHOD [3]:

* The EE element is the motif AAAATATCT.

* We known from the study of Harmer et al. that it corresponds to the evening element (word of 9 bases found upstream of genes turned on un the evening). Its margin is 0.00014. We notice that 2 of the other 3 top motifs are simply variants of the evening element (AAATATCTT and AAAAATATC).

* To assess the significance of the value found for the margin of the evening element we perform 100 random splits of the data and measure the margin of the highest-scoring element.

47


METHOD [4]:

* In 100 trials we never observe a margin larger than 0.000147462.

* We can look in detail at the frequency of the evening element among all the clock regulated genes:

48


EE-count and circadian rhythm in genes

Circadian time: 0 4 8 12 16 20Number of genes: 78 45 124 67 30

93EE count: 5 6 49 27 8 8

49


METHOD [6]:

* The arrays EEcount and Ngenes show that not all the genes of the second cluster have the evening element, nor this motif is limited only to these genes.

50


Finding the motif-length

Compare the log-likelihood score relative to the background model for motif length L:

51


52


Biological validation

Biological confirmation: compare the EE motif (and other found motifs) with standard TFBS databases like cisRED :

54


Biological validation [2]

Biological confirmation: perform biological experiments to test the hypothesis: Harper et al. attached a fluorescent molecule-complex to the TFBS and could thus with a scintillation counter

The experiments confirmed the finds from the computational analysis.

This (in 2000) was the first time that explicitly Bioinformatics gave surprising new and unanticipated finds

57



The fluorescent scintillation experiments fully confirmed the finds from the computational analysis.

This (in 2000) was the first time that explicitly Bioinformatics gave surprising new and unanticipated finds

With the EE we can now look for other locations on the DNA with the same or similar motifs.

58



Because not all genes are directly regulated by the first few TFs in the circadian regulatory cascade, the presence or absence of EE enables to reveal the exact sequence of events that occur during circadian control.

EE

59


References

S. L. Harmer, J. B. Hogenesch, M. Straume, C. H.S. Chang, B. Han, Z.Tong X. Wang, J. A. Kreps, and S. A. Kay, Orchestrated Transcription of Key Pathways in Arabidopsis by the Circadian Clock, Science, Vol. 290. no. 5499, pp. 2110 - 2113, Dec 2000.

60

END of COURSE BIOINFORMATICS

Thanks for your attention …

Introduction to

Documents

Transcript of Introduction to