Many similarly expressed genes are coregulated by the same transcription factor(s) …

19
ilarly expressed genes are coregulated by the same transcription fa Therefore, can search promoters of coregulated genes for binding sites Genes induced by carbon starvation 1

description

Many similarly expressed genes are coregulated by the same transcription factor(s) … Therefore, can search promoters of coregulated genes for binding sites. Genes induced by carbon starvation. Many similarly expressed genes are coregulated by the same transcription factor(s) … - PowerPoint PPT Presentation

Transcript of Many similarly expressed genes are coregulated by the same transcription factor(s) …

Page 1: Many similarly expressed genes are coregulated by the same transcription factor(s) …

Many similarly expressed genes are coregulated by the same transcription factor(s) …

Therefore, can search promoters of coregulated genes for binding sites

Genes induced by carbon starvation

1

Page 2: Many similarly expressed genes are coregulated by the same transcription factor(s) …

Many similarly expressed genes are coregulated by the same transcription factor(s) …

Therefore, can search promoters of coregulated genes for binding sites

ORFsUpstream regionGenes induced by carbon starvation

2

Page 3: Many similarly expressed genes are coregulated by the same transcription factor(s) …

Many similarly expressed genes are coregulated by the same transcription factor(s) …

Therefore, can search promoters of coregulated genes for binding sites

ORFsUpstream region

Similar sequence found in most upstream regions(here = CCAAT which = Hap4p binding site)

Genes induced by carbon starvation

3

Page 4: Many similarly expressed genes are coregulated by the same transcription factor(s) …

Finding sequence motifs common to a group of ‘similar’ sequences

ORFsUpstream region

Similar sequence found in most upstream regions

How do you identify motifs in sequence data?

How can you tell if the identified motif is ‘significant’?

How do you find genomic examples of the identified motif? 4

Page 5: Many similarly expressed genes are coregulated by the same transcription factor(s) …

A G A T G G A T G GT G A T T G A T G T T G A T G G A T G GA G A T T G A T C G T G A T G G A T T G T G A T G G A T T G A G A T G G A T T G

IUPAC consensus: W G A T G G A T N G

Site 1Site 2Site 3Site 4Site 5Site 6Site 7

(where W = A or T)

First, representation of motifs: Position-specific Weight Matrices (PWMsaka Position-Specific Scoring Matrix, PSSM)

5

Page 6: Many similarly expressed genes are coregulated by the same transcription factor(s) …

A G A T G G A T G GT G A T T G A T G T T G A T G G A T G GA G A T T G A T C G T G A T G G A T T G T G A T G G A T T G A G A T G G A T T G

G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0

Site 1Site 2Site 3Site 4Site 5Site 6Site 7

PWM represents frequencies of each base at each position in the motif *

* These days, PWM/PSSM can correspond to the frequency matrix or a likelihood matrix

First, representation of motifs: Position-specific Weight Matrices (PWMsaka Position-Specific Scoring Matrix, PSSM)

6

Page 7: Many similarly expressed genes are coregulated by the same transcription factor(s) …

Web-logo: A graphical representation of PWMs

http://weblogo.berkeley.edu/

Height of the base proportional to frequency of base on that position …more specifically known as “bits” , “information content” , or “entropy”

7

Page 8: Many similarly expressed genes are coregulated by the same transcription factor(s) …

Information content IC

The least variable positions likely are important for specifying the protein-DNA interactionTherefore high information content = low sequence variation at that position.

If using log2, the info content is in ‘bits’

ICi = 2 + Pb(i) * log2(Pb(i) )b=G,A,T,C

Information Content at position i:

Where Pb(i) is the probability of base b at position i

G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0

8

Page 9: Many similarly expressed genes are coregulated by the same transcription factor(s) …

Information content IC

The least variable positions likely are important for specifying the protein-DNA interactionTherefore high information content = low sequence variation at that position.

If using log2, the info content is in ‘bits’

ICi = 2 + Pb(i) * log2(Pb(i) )b=G,A,T,C

Information Content at position i:

Where Pb(i) is the probability of base b at position i

Maximum IC if P of some base is 1.0: = 2 + [ (1.0 * 0) + 0 + 0 + 0 ] = 2

Minimum IC if P is 0.25 for all bases: = 2 + [0.25(-2) ] * 4 = 0

G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0

9

Page 10: Many similarly expressed genes are coregulated by the same transcription factor(s) …

G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0

Information content IC

The least variable positions likely are important for specifying the protein-DNA interactionTherefore high information content = low sequence variation at that position.

IC 1.0 2.0 2.0 2.0 1.1 2.0 2.0 2.0 0.5 1.3 = bit score of 15.9

Position

bits

Information Profile:

10

Page 11: Many similarly expressed genes are coregulated by the same transcription factor(s) …

Position

bits

Often for protein-DNA interactions, IC profile is smooth

bits

Position

Real motif Randomized data

11

Page 12: Many similarly expressed genes are coregulated by the same transcription factor(s) …

12

One limitation of PWMs: each position is considered independently(does not represent inter-dependencies across motif positions)

Page 13: Many similarly expressed genes are coregulated by the same transcription factor(s) …

13

Gary Stormo, Nat Biotech 2011

Morris et al. , Nat Biotech 2011

Page 14: Many similarly expressed genes are coregulated by the same transcription factor(s) …

Finding matches to (instances of) a PWM

G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0

b = G,A,T,C i

Joint probability: assuming each position is independent,

P(motif) Pb(i)

P(sequence | matrix model ) = (0.4)(1.0)(1.0)(1.0)(0.3)(1.0)(1.0)(1.0)(0.2)(0.2) = 0.0048

14

Is the sequence A G A T T G A T C T a match to this matrix?

Page 15: Many similarly expressed genes are coregulated by the same transcription factor(s) …

Finding matches to (instances of) a PWM

G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0

Is the sequence A G A T T G A T C T a match to this matrix?

b = G,A,T,C i

Joint probability: assuming each position is independent,

P(motif) Pb(i)

P(sequence | background model ) = (0.25)(0.25)(0.25)(0.25)(0.25)(0.25)(0.25)(0.25)(0.25)(0.25) = 6.8e-24

P(sequence | matrix model ) = (0.4)(1.0)(1.0)(1.0)(0.3)(1.0)(1.0)(1.0)(0.2)(0.2) = 0.0048

Background model:P(G,A,T,C) = 0.25

15

Page 16: Many similarly expressed genes are coregulated by the same transcription factor(s) …

Log-likelihood ratio LLR

= log ( P(sequence | matrix model ) / P(sequence | background model ) )

A measure of how different the likelihood of the sequence is, given themotif model vs. the background model.

In our example:

LLR = log ( 0.0048 / 6.8e-24 ) = 20.8

The larger the LLR, the more likely the motif model is the right one.To select motifs in real life, can define a LLR cutoff (often defined by sampling).

16

Page 17: Many similarly expressed genes are coregulated by the same transcription factor(s) …

Finding matches to (instances of) a PWM

G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0

Is the sequence A A A T T G A T C T a match to this matrix?

b = G,A,T,C i

Joint probability: assuming each position is independent,

P(motif) Pb(i)

P(sequence | matrix model ) = (0.4)(0)(1.0)(1.0)(0.3)(1.0)(1.0)(1.0)(0.2)(0.2) = 0

** If your PWM was trained on a small sample set, you might have missed some examples= overfitting of the matrix (ie. too specific) 17

Page 18: Many similarly expressed genes are coregulated by the same transcription factor(s) …

Pseudo-counts: protecting against overfitting due to small sample sizes

A G A T G G A T G GT G A T T G A T G T T G A T G G A T G GA G A T T G A T C G T G A T G G A T T G T G A T G G A T T G A G A T G G A T T G

G 0 1.0 0 0 0.7 1.0 0 0 0.4 0.8

A 0.4 0 1.0 0 0 0 1.0 0 0 0T 0.6 0 0 1.0 0.3 0 0 1.0 0.4 0.2C 0 0 0 0 0 0 0 0 0.2 0

Site 1Site 2Site 3Site 4Site 5Site 6Site 7

Add 1 count to each base at each position, then divide by n + 4

Without pseudo-counts:

18

Page 19: Many similarly expressed genes are coregulated by the same transcription factor(s) …

Motif finding methods and algorithms

Given a set of n promoters of n coregulated genes, find a motif common to the promoters.Both the PWM and the motif sequences are unknown.

Common methods:1. Enumeration:

Simplest case: look at the frequency of all n-mers* Finds Global Optimum since can search entire space

2. EM algorithms (MEME): Iteratively hone in on the most likely motif model – can simultaneouslyidentify the motif and find examples of the motif

3. Gibbs sampling methods (AlignAce, BioProspector)Iteratively replace (‘sample’) sites to retrain the matrix

19