Introduction to

Post on 11-Jan-2016

20 views 0 download

Tags:

description

Introduction to. Bioinformatics. Introduction to Bioinformatics. LECTURE 10: Identification of regulatory sequences * Chapter 10: A bed-time story. Introduction to Bioinformatics LECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES. 10.1 The circadian clock - PowerPoint PPT Presentation

Transcript of Introduction to

1

Introduction to

Bioinformatics

2

Introduction to Bioinformatics.

LECTURE 10: Identification of regulatory sequences

* Chapter 10: A bed-time story

3

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

10.1 The circadian clock

* All living beings have a biological clock (remember jet lag) called the Circadian Rhythm/Clock

* Disruptions between the circadian rhythm and the natural day-night cycle lead to various health problems

* The internal clock synchronizes numerous functions such as metabolism, activity/awareness level, and body temperature

* For plants this is especially true: the photosynthesis

4

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

* Plants lead a stressful life; they have many needs (water, sun, nutrients) nut are unable to move

* Rather than moving, plants react to external stress by changing their internal condition

* Herbivore? → Chemical repellent! (e.g. nicotine)

* Falling temperature? → Anti-freeze proteins!

* Plants that can ‘anticipate’ changes have a competitive advantage → this is the importance of a circadian clock

5

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

Arabidopsis thalianaFrom Wikipedia, the free encyclopedia

Scientific classificationKingdom:PlantaeDivision:MagnoliophytaClass:Magnoliopsida

Order:BrassicalesFamily:Brassicaceae

Subfamily:BrassicoideaeGenus:ArabidopsisSpecies:A. thaliana

Arabidopsis thaliana, commonly called arabidopsis, thale cress, or mouse-ear cress, a small flowering plant related to cabbage

and mustard, is one of the model organisms for studying plant sciences, including genetics and plant development. It plays

the role for agricultural sciences that mice and fruit flies (Drosophila) play in human biology.

6

Arabidopsis thaliana

7

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

Arabidopsis thaliana

120 Mbp5 chromosomes29,000 genes

8

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

* Arabidopsis thaliana has a cell-autonomous circadian clock: each single cell keeps track of day-night cycle independently

+

+

+

++

--

-

-

--

+

awake

asleep

9

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

* If you remove the day-night stimulus and keep A. thaliana in constant light – or dark – then within days the clock looses periodicity.

* In contrast, mammals kept in constant light keep the circadian clock running for months.

* How does A. thaliana (and other organisms) run their circadian clock?

* Three proteins are the key-players: LHY, CCA1, and TOC1:

10

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

11

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

* Transcription factors TF : a protein that regulates transcription. TFs regulate the binding of RNA polymerase and the initiation of transcription. A TF binds upstream or downstream to either enhance or repress transcription of a gene by assisting or blocking RNA polymerase binding.

* Transcription Factor Binding Site TFBS : The location on the DNA molecule where a TF can physically attach.

12

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

DNA

TF

TF

TF

TF

TF

TF TFBSTFBS

13

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

14

15

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

* A TFBS has a specific sequence of nucleotides for the TF to attach. This is called the motif

* Motifs are short (5-15 bp) sequences of nucleotides, e.g. TATAA, TAAAAAAAAAATCTA, TATCTG, …

* Different TFs have different TFBS motifs

* However, there is some freedom in the motif sequence: a given TF may lock to TATACT, but also to TATAACT and TATACT

16

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

10.2 Basic mechanisms of gene expression

* Statistical and algorithmic issues in finding TFBS motifs

* Protein control: gene transcription control, mRNA control, post-translational control of proteins, …

gene expressedgene mRNA protein

-TF

-TF

-TF

-TF

17

Introduction to Bioinformatics10.2: Basic mechanisms of gene expression

Genetic signposts

* A gene embedded in random DNA is totally inert

* Promotor = regulatory DNA = cis-regulatory DNA

* The promotor is a region on the DNA just before (=upstream) the gene that indicates where the transcription starts …

18

Introduction to Bioinformatics10.2: Basic mechanisms of gene expression

Promotor locking– simplified –

19

Introduction to Bioinformatics10.2: Basic mechanisms of gene expressionPromotor locking – realistic –

20

Introduction to Bioinformatics10.2: Basic mechanisms of gene expression

* A major TFBS is the RNA polymerase binding site

* Eubacteria: rigid motifs at -10: TATAAT, at -35: TTGACA

* Eukaryota: has different RNA polymerase → different motifs; TATA-box (= TATAA[A/T]) at ~ -40

* Other docking sites at +/- -1000, but also many other places up to - 250,000 (.. and further???)

21

Introduction to Bioinformatics10.2: Basic mechanisms of gene expression

Computational challenges in finding TFBS

Finding TFBS motifs is complex:

1. TFBS are very short and will therefore appear by chance alone

2. There is a high variability (ATAATC, ATAATT, ATACTC, …)

3. We don’t know the TFBS motif nor the TFBS location

22

Introduction to Bioinformatics10.2: Basic mechanisms of gene expression

* Trick 1: area’s on the gene with high conservation

* Trick 2: co-regulated genes (have same TF): look for shared motifs upstream

* For Arabidopsis thaliana: look for motifs upstream bound by LHY and CCA1:

* [i] cluster genes with same day-night oscillatory pattern

* [ii] look in this cluster for shared motifs upstream up to -1000

23

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

10.3 Motif-finding strategies

* Where to look for a TFBS? Look at +/-1000 upstream

* gapped/ungapped motif

* fixed/variable length motif

* TFBS motifs are variable but highly similar

* Consensus sequence: most probable sequence

24

Introduction to Bioinformatics10.3: Motif-finding strategies

Consensus motif: a useful notation

25

Introduction to Bioinformatics10.3: Motif-finding strategies

PSSM / PSWM / profile

* Position Specific Scoring Matrix: PSSM (also profile)

* PSSM: multinomial model of sequence of length L

* PSSM: multinomial distribution depends on position on the sequence: P[position,symbol], symbol={CTGA} or 20AA

26

Introduction to Bioinformatics10.3: Motif-finding strategies

Example 10.1: fixed length – ungapped motif

A T G C T G A A T G T A C T A T A T A G T A A T C T G T C A A T A T G T A A C C T A A T T G T T C A G A T T T C C C A C C T C G A C A A A T T T A C T C A G A T T C T C

Note: we know neither the place (*) nor the length (6) of the TFBS

27

Introduction to Bioinformatics10.3: Motif-finding strategies

Example 10.1: fixed length – ungapped motif

A T G *C T G A A T G T A *C T A T A T A G T A A T C T G T *C A A T A T G T A A C *C T A A T T G T T *C A G A T T T C C C A C C T C G A *C A A A T T T A C T *C A G A T T C T C

Note: we know neither the place (*) nor the length (6) of the TFBS

28

Introduction to Bioinformatics10.3: Motif-finding strategies

Example 10.1: fixed length – ungapped motif

*C T G A A T *C T A T A T *C A A T A T *C T A A T T *C A G A T T *C A A A T T *C A G A T T

Alignment

A 0 5 5 5 4 0C 7 0 0 0 0 0G 1 0 3 0 0 0T 0 3 0 3 4 8PSSM

C A A A T T consensus motif

29

Introduction to Bioinformatics10.3: Motif-finding strategies

Identifying motifs

* Start position, motif sequence and motif length are unknown

* PSSM = scoring from multiple alignment

* What is a significant result: compare the sequence with the background model: the chance based on the current set that the motif occurs by pure chance

30

Introduction to Bioinformatics10.3: Motif-finding strategies

Identifying motifs [2]

* Algorithmically finding the motif sequence by optimization

of a scoring function is extremely computationally expensive

* Therefore heuristics have been proposed

* Example of a randomized and greedy heuristic is Gibbs sampling

* Now focus on ungapped fixed sequence motif with fixed length as is the case in circadian rhythm in A. thaliana

31

Introduction to Bioinformatics10.3: Motif-finding strategies

Identifying motifs [3]

Ungapped fixed sequence motif with fixed length as is the case in circadian rhythm in A. thaliana

From example 10-1 the PSSM:

A 0 5/8 5/8 5/8 4/8 0C 7/8 0 0 0 0 0G 1/8 0 3/8 0 0 0T 0 3/8 0 3/8 4/8 8/8

32

Introduction to Bioinformatics10.3: Motif-finding strategies

Identifying motifs [4]

A motif is interesting if it is unlikely under the background distribution: column 6 is more unbalanced than column 1

Scoring function for imbalance: Kullback–Leibler divergence (KL divergence) :

pi[k] is probability of observing symbol k at position iqi[k] is multinomial background model for symbol k at i

i k

kqkp

iKL i

ikpSposition letter

][][log][

33

Introduction to Bioinformatics10.3: Motif-finding strategies

Identifying motifs [5]

To avoid zero entries and resulting divergences (log 0), a statistical trick is to add pseudocounts: add 1 at each entry

A 0 5 5 5 4 0C 7 0 0 0 0 0G 1 0 3 0 0 0T 0 3 0 3 4 8PSSM

A 1 6 6 6 5 1C 8 1 1 1 1 1G 2 1 4 1 1 1T 1 4 1 4 5 9PSSM + pseudocounts

34

Introduction to Bioinformatics10.3: Motif-finding strategies

Identifying motifs [6]

PSSM with pseudocounts has no zeros:

A 1/12 6/12 …C 8/12 1/12 …G 2/12 1/12 …T 1/12 4/12 …

35

Introduction to Bioinformatics10.3: Motif-finding strategies

Finding high-scoring motifs

* Sequence s of length n (> L = length of the PSSM)

* Slide the PSSM along the sequence and compute the likelihood:

* With this algorithm try to find starting position (j with highest value), and most probable motif (argmax of L).

NOTE: in practice use log-likelihood l(j) = log L(j)

1

]][[)(Lnj

jii ipj s

36

Introduction to Bioinformatics10.3: Motif-finding strategies

Finding high-scoring motifs [2]

ALGORITHM FOR FINDING TFBS MOTIFS:

0. Start with random location j and random PSSM

Iteration:

1. With fixed j optimize PSSM2. With fixed PSSM optimize j

Until the result has converged

This is the EM-algorithm

37

Introduction to Bioinformatics10.3: Motif-finding strategies

Finding high-scoring motifs [3]

Gibbs sampling to avoid local optima:

Use randomization of the sequence as an alternative for using the location with the highest score

Use a simple assumption: e.g. there is no variation – so look for a fixed sequence

Figure 10-1 shows the log likelihood score and therefore the locations for the optimal locations of the motifs

38

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

10.4 Case study: the circadian rhythm

* Harmers et al. (2000) the clock-regulated elements of A. thaliana are activated in the evening – hence: Evening Element (EE)

* Cluster the expression profiles and consider the clusters with appropriate periodicity : they are candidates for containing the EE

39

Introduction to Bioinformatics10.4: Case study the circadian rhythm

40

Introduction to Bioinformatics10.4: Case study the circadian rhythm

41

Introduction to Bioinformatics10.4: Case study the circadian rhythm

42

Introduction to Bioinformatics10.4: Case study the circadian rhythm

43

Introduction to Bioinformatics10.4: Case study the circadian rhythm

44

Introduction to Bioinformatics10.4: Case study the circadian rhythm

METHOD:

* Look only at motifs of fixed length (9): consider all words of length 9 whose frequency in the evening cluster is very different from its frequency in the rest of the data.

* Therefore examine all words of fixed length 9 in both sequences seq1 and seq2 (considering also the reverse complement).

* Motifs found are scored and sorted in descending order by margin (the difference between their frequency in cluster 2 and that in cluster 1-3). The top 10 of 9-mers are computed and shown.

45

Introduction to Bioinformatics10.4: Case study the circadian rhythm

METHOD [2]:

* The obtained set of motifs contains a lot of repeats (either of single letters or of 2-mers). They likely have no biological significance and they must be filtered out.

* After eliminating the repeating element, we can observe that the most significant EE element is the motif AAAATATCT.

46

Introduction to Bioinformatics10.4: Case study the circadian rhythm

METHOD [3]:

* The EE element is the motif AAAATATCT.

* We known from the study of Harmer et al. that it corresponds to the evening element (word of 9 bases found upstream of genes turned on un the evening). Its margin is 0.00014. We notice that 2 of the other 3 top motifs are simply variants of the evening element (AAATATCTT and AAAAATATC).

* To assess the significance of the value found for the margin of the evening element we perform 100 random splits of the data and measure the margin of the highest-scoring element.

47

Introduction to Bioinformatics10.4: Case study the circadian rhythm

METHOD [4]:

* In 100 trials we never observe a margin larger than 0.000147462.

* We can look in detail at the frequency of the evening element among all the clock regulated genes:

48

Introduction to Bioinformatics10.3: Motif-finding strategies

EE-count and circadian rhythm in genes

Circadian time: 0 4 8 12 16 20Number of genes: 78 45 124 67 30

93EE count: 5 6 49 27 8 8

49

Introduction to Bioinformatics10.4: Case study the circadian rhythm

METHOD [6]:

* The arrays EEcount and Ngenes show that not all the genes of the second cluster have the evening element, nor this motif is limited only to these genes.

50

Introduction to Bioinformatics10.3: Motif-finding strategies

Finding the motif-length

Compare the log-likelihood score relative to the background model for motif length L:

51

Introduction to Bioinformatics10.4: Case study the circadian rhythm

52

Introduction to Bioinformatics10.3: Motif-finding strategies

Biological validation

Biological confirmation: compare the EE motif (and other found motifs) with standard TFBS databases like cisRED :

53

54

Introduction to Bioinformatics10.3: Motif-finding strategies

Biological validation [2]

Biological confirmation: perform biological experiments to test the hypothesis: Harper et al. attached a fluorescent molecule-complex to the TFBS and could thus with a scintillation counter

The experiments confirmed the finds from the computational analysis.

This (in 2000) was the first time that explicitly Bioinformatics gave surprising new and unanticipated finds

55

56

57

Introduction to Bioinformatics10.3: Motif-finding strategies

Biological validation [3]

The fluorescent scintillation experiments fully confirmed the finds from the computational analysis.

This (in 2000) was the first time that explicitly Bioinformatics gave surprising new and unanticipated finds

With the EE we can now look for other locations on the DNA with the same or similar motifs.

58

Introduction to Bioinformatics10.3: Motif-finding strategies

Biological validation [4]

Because not all genes are directly regulated by the first few TFs in the circadian regulatory cascade, the presence or absence of EE enables to reveal the exact sequence of events that occur during circadian control.

EE

59

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

References

S. L. Harmer, J. B. Hogenesch, M. Straume, C. H.S. Chang, B. Han, Z.Tong X. Wang, J. A. Kreps, and S. A. Kay, Orchestrated Transcription of Key Pathways in Arabidopsis by the Circadian Clock, Science, Vol. 290. no. 5499, pp. 2110 - 2113, Dec 2000.

60

END of COURSE BIOINFORMATICS

Thanks for your attention …