Introduction to

60
1 Introduction to Bioinformatic s

description

Introduction to. Bioinformatics. Introduction to Bioinformatics. LECTURE 10: Identification of regulatory sequences * Chapter 10: A bed-time story. Introduction to Bioinformatics LECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES. 10.1 The circadian clock - PowerPoint PPT Presentation

Transcript of Introduction to

Page 1: Introduction to

1

Introduction to

Bioinformatics

Page 2: Introduction to

2

Introduction to Bioinformatics.

LECTURE 10: Identification of regulatory sequences

* Chapter 10: A bed-time story

Page 3: Introduction to

3

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

10.1 The circadian clock

* All living beings have a biological clock (remember jet lag) called the Circadian Rhythm/Clock

* Disruptions between the circadian rhythm and the natural day-night cycle lead to various health problems

* The internal clock synchronizes numerous functions such as metabolism, activity/awareness level, and body temperature

* For plants this is especially true: the photosynthesis

Page 4: Introduction to

4

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

* Plants lead a stressful life; they have many needs (water, sun, nutrients) nut are unable to move

* Rather than moving, plants react to external stress by changing their internal condition

* Herbivore? → Chemical repellent! (e.g. nicotine)

* Falling temperature? → Anti-freeze proteins!

* Plants that can ‘anticipate’ changes have a competitive advantage → this is the importance of a circadian clock

Page 5: Introduction to

5

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

Arabidopsis thalianaFrom Wikipedia, the free encyclopedia

Scientific classificationKingdom:PlantaeDivision:MagnoliophytaClass:Magnoliopsida

Order:BrassicalesFamily:Brassicaceae

Subfamily:BrassicoideaeGenus:ArabidopsisSpecies:A. thaliana

Arabidopsis thaliana, commonly called arabidopsis, thale cress, or mouse-ear cress, a small flowering plant related to cabbage

and mustard, is one of the model organisms for studying plant sciences, including genetics and plant development. It plays

the role for agricultural sciences that mice and fruit flies (Drosophila) play in human biology.

Page 6: Introduction to

6

Arabidopsis thaliana

Page 7: Introduction to

7

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

Arabidopsis thaliana

120 Mbp5 chromosomes29,000 genes

Page 8: Introduction to

8

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

* Arabidopsis thaliana has a cell-autonomous circadian clock: each single cell keeps track of day-night cycle independently

+

+

+

++

--

-

-

--

+

awake

asleep

Page 9: Introduction to

9

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

* If you remove the day-night stimulus and keep A. thaliana in constant light – or dark – then within days the clock looses periodicity.

* In contrast, mammals kept in constant light keep the circadian clock running for months.

* How does A. thaliana (and other organisms) run their circadian clock?

* Three proteins are the key-players: LHY, CCA1, and TOC1:

Page 10: Introduction to

10

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

Page 11: Introduction to

11

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

* Transcription factors TF : a protein that regulates transcription. TFs regulate the binding of RNA polymerase and the initiation of transcription. A TF binds upstream or downstream to either enhance or repress transcription of a gene by assisting or blocking RNA polymerase binding.

* Transcription Factor Binding Site TFBS : The location on the DNA molecule where a TF can physically attach.

Page 12: Introduction to

12

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

DNA

TF

TF

TF

TF

TF

TF TFBSTFBS

Page 13: Introduction to

13

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

Page 14: Introduction to

14

Page 15: Introduction to

15

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

* A TFBS has a specific sequence of nucleotides for the TF to attach. This is called the motif

* Motifs are short (5-15 bp) sequences of nucleotides, e.g. TATAA, TAAAAAAAAAATCTA, TATCTG, …

* Different TFs have different TFBS motifs

* However, there is some freedom in the motif sequence: a given TF may lock to TATACT, but also to TATAACT and TATACT

Page 16: Introduction to

16

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

10.2 Basic mechanisms of gene expression

* Statistical and algorithmic issues in finding TFBS motifs

* Protein control: gene transcription control, mRNA control, post-translational control of proteins, …

gene expressedgene mRNA protein

-TF

-TF

-TF

-TF

Page 17: Introduction to

17

Introduction to Bioinformatics10.2: Basic mechanisms of gene expression

Genetic signposts

* A gene embedded in random DNA is totally inert

* Promotor = regulatory DNA = cis-regulatory DNA

* The promotor is a region on the DNA just before (=upstream) the gene that indicates where the transcription starts …

Page 18: Introduction to

18

Introduction to Bioinformatics10.2: Basic mechanisms of gene expression

Promotor locking– simplified –

Page 19: Introduction to

19

Introduction to Bioinformatics10.2: Basic mechanisms of gene expressionPromotor locking – realistic –

Page 20: Introduction to

20

Introduction to Bioinformatics10.2: Basic mechanisms of gene expression

* A major TFBS is the RNA polymerase binding site

* Eubacteria: rigid motifs at -10: TATAAT, at -35: TTGACA

* Eukaryota: has different RNA polymerase → different motifs; TATA-box (= TATAA[A/T]) at ~ -40

* Other docking sites at +/- -1000, but also many other places up to - 250,000 (.. and further???)

Page 21: Introduction to

21

Introduction to Bioinformatics10.2: Basic mechanisms of gene expression

Computational challenges in finding TFBS

Finding TFBS motifs is complex:

1. TFBS are very short and will therefore appear by chance alone

2. There is a high variability (ATAATC, ATAATT, ATACTC, …)

3. We don’t know the TFBS motif nor the TFBS location

Page 22: Introduction to

22

Introduction to Bioinformatics10.2: Basic mechanisms of gene expression

* Trick 1: area’s on the gene with high conservation

* Trick 2: co-regulated genes (have same TF): look for shared motifs upstream

* For Arabidopsis thaliana: look for motifs upstream bound by LHY and CCA1:

* [i] cluster genes with same day-night oscillatory pattern

* [ii] look in this cluster for shared motifs upstream up to -1000

Page 23: Introduction to

23

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

10.3 Motif-finding strategies

* Where to look for a TFBS? Look at +/-1000 upstream

* gapped/ungapped motif

* fixed/variable length motif

* TFBS motifs are variable but highly similar

* Consensus sequence: most probable sequence

Page 24: Introduction to

24

Introduction to Bioinformatics10.3: Motif-finding strategies

Consensus motif: a useful notation

Page 25: Introduction to

25

Introduction to Bioinformatics10.3: Motif-finding strategies

PSSM / PSWM / profile

* Position Specific Scoring Matrix: PSSM (also profile)

* PSSM: multinomial model of sequence of length L

* PSSM: multinomial distribution depends on position on the sequence: P[position,symbol], symbol={CTGA} or 20AA

Page 26: Introduction to

26

Introduction to Bioinformatics10.3: Motif-finding strategies

Example 10.1: fixed length – ungapped motif

A T G C T G A A T G T A C T A T A T A G T A A T C T G T C A A T A T G T A A C C T A A T T G T T C A G A T T T C C C A C C T C G A C A A A T T T A C T C A G A T T C T C

Note: we know neither the place (*) nor the length (6) of the TFBS

Page 27: Introduction to

27

Introduction to Bioinformatics10.3: Motif-finding strategies

Example 10.1: fixed length – ungapped motif

A T G *C T G A A T G T A *C T A T A T A G T A A T C T G T *C A A T A T G T A A C *C T A A T T G T T *C A G A T T T C C C A C C T C G A *C A A A T T T A C T *C A G A T T C T C

Note: we know neither the place (*) nor the length (6) of the TFBS

Page 28: Introduction to

28

Introduction to Bioinformatics10.3: Motif-finding strategies

Example 10.1: fixed length – ungapped motif

*C T G A A T *C T A T A T *C A A T A T *C T A A T T *C A G A T T *C A A A T T *C A G A T T

Alignment

A 0 5 5 5 4 0C 7 0 0 0 0 0G 1 0 3 0 0 0T 0 3 0 3 4 8PSSM

C A A A T T consensus motif

Page 29: Introduction to

29

Introduction to Bioinformatics10.3: Motif-finding strategies

Identifying motifs

* Start position, motif sequence and motif length are unknown

* PSSM = scoring from multiple alignment

* What is a significant result: compare the sequence with the background model: the chance based on the current set that the motif occurs by pure chance

Page 30: Introduction to

30

Introduction to Bioinformatics10.3: Motif-finding strategies

Identifying motifs [2]

* Algorithmically finding the motif sequence by optimization

of a scoring function is extremely computationally expensive

* Therefore heuristics have been proposed

* Example of a randomized and greedy heuristic is Gibbs sampling

* Now focus on ungapped fixed sequence motif with fixed length as is the case in circadian rhythm in A. thaliana

Page 31: Introduction to

31

Introduction to Bioinformatics10.3: Motif-finding strategies

Identifying motifs [3]

Ungapped fixed sequence motif with fixed length as is the case in circadian rhythm in A. thaliana

From example 10-1 the PSSM:

A 0 5/8 5/8 5/8 4/8 0C 7/8 0 0 0 0 0G 1/8 0 3/8 0 0 0T 0 3/8 0 3/8 4/8 8/8

Page 32: Introduction to

32

Introduction to Bioinformatics10.3: Motif-finding strategies

Identifying motifs [4]

A motif is interesting if it is unlikely under the background distribution: column 6 is more unbalanced than column 1

Scoring function for imbalance: Kullback–Leibler divergence (KL divergence) :

pi[k] is probability of observing symbol k at position iqi[k] is multinomial background model for symbol k at i

i k

kqkp

iKL i

ikpSposition letter

][][log][

Page 33: Introduction to

33

Introduction to Bioinformatics10.3: Motif-finding strategies

Identifying motifs [5]

To avoid zero entries and resulting divergences (log 0), a statistical trick is to add pseudocounts: add 1 at each entry

A 0 5 5 5 4 0C 7 0 0 0 0 0G 1 0 3 0 0 0T 0 3 0 3 4 8PSSM

A 1 6 6 6 5 1C 8 1 1 1 1 1G 2 1 4 1 1 1T 1 4 1 4 5 9PSSM + pseudocounts

Page 34: Introduction to

34

Introduction to Bioinformatics10.3: Motif-finding strategies

Identifying motifs [6]

PSSM with pseudocounts has no zeros:

A 1/12 6/12 …C 8/12 1/12 …G 2/12 1/12 …T 1/12 4/12 …

Page 35: Introduction to

35

Introduction to Bioinformatics10.3: Motif-finding strategies

Finding high-scoring motifs

* Sequence s of length n (> L = length of the PSSM)

* Slide the PSSM along the sequence and compute the likelihood:

* With this algorithm try to find starting position (j with highest value), and most probable motif (argmax of L).

NOTE: in practice use log-likelihood l(j) = log L(j)

1

]][[)(Lnj

jii ipj s

Page 36: Introduction to

36

Introduction to Bioinformatics10.3: Motif-finding strategies

Finding high-scoring motifs [2]

ALGORITHM FOR FINDING TFBS MOTIFS:

0. Start with random location j and random PSSM

Iteration:

1. With fixed j optimize PSSM2. With fixed PSSM optimize j

Until the result has converged

This is the EM-algorithm

Page 37: Introduction to

37

Introduction to Bioinformatics10.3: Motif-finding strategies

Finding high-scoring motifs [3]

Gibbs sampling to avoid local optima:

Use randomization of the sequence as an alternative for using the location with the highest score

Use a simple assumption: e.g. there is no variation – so look for a fixed sequence

Figure 10-1 shows the log likelihood score and therefore the locations for the optimal locations of the motifs

Page 38: Introduction to

38

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

10.4 Case study: the circadian rhythm

* Harmers et al. (2000) the clock-regulated elements of A. thaliana are activated in the evening – hence: Evening Element (EE)

* Cluster the expression profiles and consider the clusters with appropriate periodicity : they are candidates for containing the EE

Page 39: Introduction to

39

Introduction to Bioinformatics10.4: Case study the circadian rhythm

Page 40: Introduction to

40

Introduction to Bioinformatics10.4: Case study the circadian rhythm

Page 41: Introduction to

41

Introduction to Bioinformatics10.4: Case study the circadian rhythm

Page 42: Introduction to

42

Introduction to Bioinformatics10.4: Case study the circadian rhythm

Page 43: Introduction to

43

Introduction to Bioinformatics10.4: Case study the circadian rhythm

Page 44: Introduction to

44

Introduction to Bioinformatics10.4: Case study the circadian rhythm

METHOD:

* Look only at motifs of fixed length (9): consider all words of length 9 whose frequency in the evening cluster is very different from its frequency in the rest of the data.

* Therefore examine all words of fixed length 9 in both sequences seq1 and seq2 (considering also the reverse complement).

* Motifs found are scored and sorted in descending order by margin (the difference between their frequency in cluster 2 and that in cluster 1-3). The top 10 of 9-mers are computed and shown.

Page 45: Introduction to

45

Introduction to Bioinformatics10.4: Case study the circadian rhythm

METHOD [2]:

* The obtained set of motifs contains a lot of repeats (either of single letters or of 2-mers). They likely have no biological significance and they must be filtered out.

* After eliminating the repeating element, we can observe that the most significant EE element is the motif AAAATATCT.

Page 46: Introduction to

46

Introduction to Bioinformatics10.4: Case study the circadian rhythm

METHOD [3]:

* The EE element is the motif AAAATATCT.

* We known from the study of Harmer et al. that it corresponds to the evening element (word of 9 bases found upstream of genes turned on un the evening). Its margin is 0.00014. We notice that 2 of the other 3 top motifs are simply variants of the evening element (AAATATCTT and AAAAATATC).

* To assess the significance of the value found for the margin of the evening element we perform 100 random splits of the data and measure the margin of the highest-scoring element.

Page 47: Introduction to

47

Introduction to Bioinformatics10.4: Case study the circadian rhythm

METHOD [4]:

* In 100 trials we never observe a margin larger than 0.000147462.

* We can look in detail at the frequency of the evening element among all the clock regulated genes:

Page 48: Introduction to

48

Introduction to Bioinformatics10.3: Motif-finding strategies

EE-count and circadian rhythm in genes

Circadian time: 0 4 8 12 16 20Number of genes: 78 45 124 67 30

93EE count: 5 6 49 27 8 8

Page 49: Introduction to

49

Introduction to Bioinformatics10.4: Case study the circadian rhythm

METHOD [6]:

* The arrays EEcount and Ngenes show that not all the genes of the second cluster have the evening element, nor this motif is limited only to these genes.

Page 50: Introduction to

50

Introduction to Bioinformatics10.3: Motif-finding strategies

Finding the motif-length

Compare the log-likelihood score relative to the background model for motif length L:

Page 51: Introduction to

51

Introduction to Bioinformatics10.4: Case study the circadian rhythm

Page 52: Introduction to

52

Introduction to Bioinformatics10.3: Motif-finding strategies

Biological validation

Biological confirmation: compare the EE motif (and other found motifs) with standard TFBS databases like cisRED :

Page 53: Introduction to

53

Page 54: Introduction to

54

Introduction to Bioinformatics10.3: Motif-finding strategies

Biological validation [2]

Biological confirmation: perform biological experiments to test the hypothesis: Harper et al. attached a fluorescent molecule-complex to the TFBS and could thus with a scintillation counter

The experiments confirmed the finds from the computational analysis.

This (in 2000) was the first time that explicitly Bioinformatics gave surprising new and unanticipated finds

Page 55: Introduction to

55

Page 56: Introduction to

56

Page 57: Introduction to

57

Introduction to Bioinformatics10.3: Motif-finding strategies

Biological validation [3]

The fluorescent scintillation experiments fully confirmed the finds from the computational analysis.

This (in 2000) was the first time that explicitly Bioinformatics gave surprising new and unanticipated finds

With the EE we can now look for other locations on the DNA with the same or similar motifs.

Page 58: Introduction to

58

Introduction to Bioinformatics10.3: Motif-finding strategies

Biological validation [4]

Because not all genes are directly regulated by the first few TFs in the circadian regulatory cascade, the presence or absence of EE enables to reveal the exact sequence of events that occur during circadian control.

EE

Page 59: Introduction to

59

Introduction to BioinformaticsLECTURE 10: IDENTIFICATION OF REGULATORY SEQUENCES

References

S. L. Harmer, J. B. Hogenesch, M. Straume, C. H.S. Chang, B. Han, Z.Tong X. Wang, J. A. Kreps, and S. A. Kay, Orchestrated Transcription of Key Pathways in Arabidopsis by the Circadian Clock, Science, Vol. 290. no. 5499, pp. 2110 - 2113, Dec 2000.

Page 60: Introduction to

60

END of COURSE BIOINFORMATICS

Thanks for your attention …