Http://cs173.stanford.edu [BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano...

Post on 14-Dec-2015

214 views 0 download

Tags:

Transcript of Http://cs173.stanford.edu [BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano...

http://cs173.stanford.edu [BejeranoWinter12/13] 1

MW  11:00-12:15 in Beckman B302Prof: Gill BejeranoTAs: Jim Notwell & Harendra Guturu

CS173

Lecture 15: TF Motifs (Harendra)

• Project milestones due Today

http://cs173.stanford.edu [BejeranoWinter12/13] 2

Announcements

http://cs173.stanford.edu [BejeranoWinter12/13] 3

Review: Transcriptional regulation of genes

Transcription Start Site (TSS)

Thousands of transcription factor-CRM interactions that control gene expression in each cell type

Enhancer (CRM)

http://cs173.stanford.edu [BejeranoWinter12/13] 4

Last Time: ChIP-Seq - a first glimpses of the regulatory genome in action

Cis-regulatory peak

4

Peak Calling

Gene transcription start site

SRF binding ChIP-seq peak

Ontology term (e.g. ‘actin cytoskeleton’)

http://cs173.stanford.edu [BejeranoWinter12/13] 5

Last Time: Infer functions of ChIP-seq binding profile using GREAT

π π π

GREAT = Genomic RegionsEnrichment of Annotations Tool

P = Prbinom(k ≥5 | n=6, p =0.33)

p = 0.33 of genome annotated with

n = 6 genomic regions

k = 5 genomic regions hit annotation

π

π π π

π

http://cs173.stanford.edu [BejeranoWinter12/13] 6

GREAT gives you a tables of functions

Ontology Term # Genes Binomial Experimental P-value support*

Gene Ontology actin cytoskeletonactin binding

7x10-9

5x10-5

Miano et al. 2007

Miano et al. 2007

* Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT.

3031

Pathway Commons

TRAIL signalingClass I PI3K signaling

5x10-7

2x10-6

Bertolotto et al. 2000

Poser et al. 2000

3226

TreeFam 1x10-85 Chai & Tarnawski 2002

TF Targets Targets of SRFTargets of GABPTargets of YY1Targets of EGR1

5x10-76

4x10-9

1x10-6

2x10-4

Positive control

ChIp-Seq support

Natesan & Gilman 1995

84284423

Top GREAT enrichments of SRF

FOS gene family

Gene transcription start site

SRF binding ChIP-seq peak

Ontology term (e.g. ‘actin binding’)

http://cs173.stanford.edu [BejeranoWinter12/13] 7

Last Time: Infer functions of ChIP-seq binding profile using GREAT

GREAT = Genomic RegionsEnrichment of Annotations Tool

P = Prbinom(k ≥4 | n=6, p =0.5)

p = 0.5 of genome annotated with

n = 6 genomic regions

k = 4 genomic regions hit annotation

π

π π π

π`

π π π

http://cs173.stanford.edu [BejeranoWinter12/13] 8

GREAT gives you a tables of functions

Ontology Term # Genes Binomial Experimental P-value support*

Gene Ontology actin cytoskeletonactin binding

7x10-9

5x10-5

Miano et al. 2007

Miano et al. 2007

* Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT.

3031

Pathway Commons

TRAIL signalingClass I PI3K signaling

5x10-7

2x10-6

Bertolotto et al. 2000

Poser et al. 2000

3226

TreeFam 1x10-85 Chai & Tarnawski 2002

TF Targets Targets of SRFTargets of GABPTargets of YY1Targets of EGR1

5x10-76

4x10-9

1x10-6

2x10-4

Positive control

ChIp-Seq support

Natesan & Gilman 1995

84284423

Top GREAT enrichments of SRF

FOS gene family

http://cs173.stanford.edu [BejeranoWinter12/13] 9

GREAT gives you a tables of functions

Ontology Term # Genes Binomial Experimental P-value support*

Gene Ontology actin cytoskeletonactin binding

7x10-9

5x10-5

Miano et al. 2007

Miano et al. 2007

* Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT.

3031

Pathway Commons

TRAIL signalingClass I PI3K signaling

5x10-7

2x10-6

Bertolotto et al. 2000

Poser et al. 2000

3226

TreeFam 1x10-85 Chai & Tarnawski 2002

TF Targets Targets of SRFTargets of GABPTargets of YY1Targets of EGR1

5x10-76

4x10-9

1x10-6

2x10-4

Positive control

ChIp-Seq support

Natesan & Gilman 1995

84284423

Top GREAT enrichments of SRF

FOS gene family“π”Different

http://cs173.stanford.edu [BejeranoWinter12/13] 10

• Hard or impossible to get the required cells• Some cells don’t occur in enough quantity to ChIP• Others are hard to dissect• Certain human tissues are hard to obtain

• Hard to get a good antibody• Ex: We have ChIP results for a factor in brain

• We have not be able to repeat it since we can’t find the same antibody

• Lots of time and money to do one experiment

• Only information for one context – cell type or time

Can we computationally predict the binding sites for many contexts and factors?

But doing the experiment is the hard part!

http://cs173.stanford.edu [BejeranoWinter12/13] 11

Recall: TFBS Position Weight Matrix (PWM)

Alignment (count) MatrixA 9 0 0 1 0 8 0 0C 0 1 1 1 7 0 3 0G 0 2 7 8 1 2 0 8T 1 7 2 0 2 0 7 2

Frequency Weight MatrixA 0.9 0.0 0.0 0.1 0.0 0.8 0.0 0.0C 0.0 0.1 0.1 0.1 0.7 0.0 0.3 0.0G 0.0 0.2 0.7 0.8 0.1 0.2 0 0.8T 0.1 0.7 0.2 0.0 0.2 0.0 0.7 0.2

Cons A T G G C A T G

Experimentally determined sites

A T G G C A T GA G G G T G C GA T C G C A T GT T G C C A C GA T G G T A T TA T T C G A C GA G G G C G T TA T G A C A T GA T G G C A T GA C T G G A T G

Can we use a PWM to predict where the TF will bind in the genome

(without doing ChIP-seq)?

http://cs173.stanford.edu [BejeranoWinter12/13] 12

Binding Site Prediction using Match

Problem: High number of false positives.

http://cs173.stanford.edu [BejeranoWinter12/13] 13

Recall: TFBS Position Weight Matrix (PWM)

Alignment (count) MatrixA 9 0 0 1 0 8 0 0C 0 1 1 1 7 0 3 0G 0 2 7 8 1 2 0 8T 1 7 2 0 2 0 7 2

Frequency Weight MatrixA 0.9 0.0 0.0 0.1 0.0 0.8 0.0 0.0C 0.0 0.1 0.1 0.1 0.7 0.0 0.3 0.0G 0.0 0.2 0.7 0.8 0.1 0.2 0 0.8T 0.1 0.7 0.2 0.0 0.2 0.0 0.7 0.2

Cons A T G G C A T G

Experimentally determined sites

A T G G C A T GA G G G T G C GA T C G C A T GT T G C C A C GA T G G T A T TA T T C G A C GA G G G C G T TA T G A C A T GA T G G C A T GA C T G G A T G

1.2 0.7 0.7 0.7 0.6 1.0 0.8 1.0

Informationcontent ofeach column

Information content of a motif= sum of all columns= 1.2 + 0.7 + 0.7 +0.6 + 1.0 + 0.8 + 1.0 = 6.0

http://cs173.stanford.edu [BejeranoWinter12/13] 14

Information content is a measure of motif specificity

SRF

REST

SPIB

(IC ~ 12)

(IC ~ 5)

(IC ~ 25)

How do these compare to a library of many PWMs?

http://cs173.stanford.edu [BejeranoWinter12/13] 15

PWMs have a range of information content

SRF

RESTSPIB

• Measure of motif specificity

16

Information content determines how accurately we can predict the binding site

SRF

SRF

2 million

http://cs173.stanford.edu [BejeranoWinter12/13]

• Measure of motif specificity

17

Information content determines how accurately we can predict the binding site

SRF

SRF

2 million

2 million matches to the SRF motif,

but ChIP-seq and other estimates suggest ≈ 10,000 actual binding sites

http://cs173.stanford.edu [BejeranoWinter12/13]

Can we do better?

http://cs173.stanford.edu [BejeranoWinter12/13] 18

Use excess conservation to improve prediction accuracy

Aaron Shoa

Wenger et al., PRISM offers a comprehensive genomic approach to transcription factor function prediction. 2013

Use shuffled motifs to calculate confidence of excess conservation binding site prediction

19http://cs173.stanford.edu [BejeranoWinter12/13]

shuffled

real

branch length (subst / site)

fraction conserved

Confidence is the fraction conserved in excess.

excess = 0.12total = 0.32

confidence = excess / total

Transcription factor motif

Genome-widebinding sitepredictions

10 ShuffledTranscription factor motifs

Genome-widebinding sitepredictions

20

Probabilistic interpretation• Confidence is the probability that a motif instance is functional given its observed conservation.

PrR(functional | C ≥ c)

= 1 - PrR(not functional | C ≥ c) PrR(C ≥ c | not F) PrR(not

F) PrR(C ≥ c)

= 1 -

branch length (subst / site)

PrR(C ≥ c)

PrS(C ≥ c)

R: real motifS: average shuffled motif

PrR(C ≥ 1.5) = 0.2

PrS(C ≥ c) PrR(not F)PrR(C ≥ c)

= 1 -

PrR(C ≥ c) - PrS(C ≥ c) PrR(not F)PrR(C ≥

c)

=

PrR(C ≥ c) - PrS(C ≥ c)PrR(C ≥ c)

≈excesstotal

=

http://cs173.stanford.edu [BejeranoWinter12/13]

Excess conservation score defined by genomic background

21http://cs173.stanford.edu [BejeranoWinter12/13]

Excess conservation score also defined by motif

http://cs173.stanford.edu [BejeranoWinter12/13] 22

ARE THE PREDICTIONS ANY GOOD?

Perform genome-wide binding site predictions…

http://cs173.stanford.edu [BejeranoWinter12/13] 23

http://cs173.stanford.edu [BejeranoWinter12/13] 24

Use ChIP-seq overlap as a measure of sensitivity

Genome-widebinding sitepredictions for one factor (Ex: E2F4)

ChIP-seqfor same factor(Ex: E2F4)

Sensitivity = Overlapping ChIP-peaks / Total ChIP-peaks

But how do you assess if your overlap is good?Compare to the best tool out there

(or all the tools, if there is no “best”)

Excess conservation binding site prediction is more accurate than existing methods

25http://cs173.stanford.edu [BejeranoWinter12/13]

(prior state of the art)

26

Excess conservation captures binding site profile similar to ChIP-seq

ChI

P-se

q

Mot

ifMap

PRIS

M

cons

erva

tion

(% id

entit

y)

http://cs173.stanford.edu [BejeranoWinter12/13]

http://cs173.stanford.edu [BejeranoWinter12/13] 27

• Now we have good genome-wide binding site predictions for many factors

Lets submit them to GREAT and find out what they are doing…

Submit predictions to GREAT

Transcription factor Ontology Top-ranked biological context GREAT rank for ChIP-seq Experimental supportGABPA GO Biological Process translation 2 (Genuario and Perry, 1996)

GO Cellular Component membrane coat 14 NovelGO Molecular Function translation initiation factor activity 4 (Genuario and Perry, 1996)Mouse Phenotypes increased single-positive T cell number None (Yu et al., 2010)PANTHER Pathway general transcription by RNA polymerase I 1 (Hauck et al., 2002)Pathway Commons transcription 3 (Hauck et al., 2002)

REST (NRSF) GO Biological Process neurotransmitter transport 1 (Schoenherr et al., 1996)GO Cellular Component neuronal cell body None (Schoenherr et al., 1996)GO Molecular Function cation channel activity 1 (Schoenherr et al., 1996)Mouse Phenotypes abnormal synaptic transmission 1 (Schoenherr et al., 1996)PANTHER Pathway synaptic vesicle trafficking 2 (Schoenherr et al., 1996)Pathway Commons transmission across chemical synapses 3 (Schoenherr et al., 1996)

SRF GO Biological Process muscle structure development None (Miano et al., 2007)In Jurkat GO Cellular Component actin cytoskeleton 1 (Miano et al., 2007)

GO Molecular Function structural constituent of muscle None (Miano et al., 2007)Mouse Phenotypes dilated heart ventricles None (Parlakian et al., 2004)PANTHER Pathway cytoskeletal regulation by Rho GTPase None (Hill et al., 1995)Pathway Commons regulation of insulin secretion by acetylcholine None Novel

STAT3 GO Biological Process negative regulation of signal transduction None (Naka et al., 1997)In mESC GO Molecular Function transforming growth factor beta binding None (Kinjyo et al., 2006)

Mouse Phenotypes abnormal spleen B cell follicle morphology None (Schmidlin et al., 2009)Pathway Commons Signaling events mediated by TCPTP None (Yamamoto et al., 2002)

Comparing binding site prediction to ChIP-seq

28http://cs173.stanford.edu [BejeranoWinter12/13]

TF function p-value target genes

SRF muscle structure development 7.43×10-41 157

29

PRISM re-discovers known functions

GLI2 skeletal system development 7.07×10-48 192

CRX retinal photoreceptor degeneration 1.30×10-10 34

AR abnormal spermiogenesis 1.19×10-6 26

Is the number of re-discovered known functions impressive?

http://cs173.stanford.edu [BejeranoWinter12/13]

http://cs173.stanford.edu [BejeranoWinter12/13] 30

Evaluate re-discovery of known function using “closed loops”

How can we assess if the functional associations predictedby PRISM for a particular TF are reasonable without

reading a lot of papers?One way is to check if the TFs are

annotated with the function (form a closed loop)

SRF

Genes involved in “muscle structure development”

SRFIs SRF itself annotated with the term “muscle structure development”?

YES – a “closed loop”

31

PRISM predictions are consistent with known transcription factor biology

http://cs173.stanford.edu [BejeranoWinter12/13]

Null Model:How many closed loopsusing 50,000 random shuffled PWM libraries?

http://cs173.stanford.edu [BejeranoWinter12/13] 32

1. Incomplete annotation

2. “Regulation of” annotation

Many non-closed loops are still trueTF function p-value target genes

GATA6 abnormal pancreas development 5.69×10-13 23

SRF actin cytoskeleton4.84×10-58 142

Nature Genetics, December 2011.

SRF acts in the nucleus, where it regulates actin cytoskeleton genes.

http://cs173.stanford.edu [BejeranoWinter12/13] 33

• Now we have good genome-wide binding site predictions for many factors

• AND we have functional predictions without ChIP-seq

Was it as easy as creating binding sites and submitting the results to GREAT?

…not quite…

Raw GREAT results need cleaning for conserved TFBS

Shuffled motifs also give GREAT enrichments

34http://cs173.stanford.edu [BejeranoWinter12/13]

Examine closely

Transcription factor motif

Genome-widebinding sitepredictions

10 ShuffledTranscription factor motifs

Genome-widebinding sitepredictions

Run GREATand observe

biological function

Run GREATand observe

biological function

Filter PRISM

http://cs173.stanford.edu [BejeranoWinter12/13] 35

Shuffled motifs are used to create a “E-value” metric to black list enrichments that show up for shuffles

Stage 1: GREAT on binding site prediction

Stage 2: Top significant

GREAT terms

Stage 3: PRISM terms (via black

listing)  Obtained = GREAT Kept Kept = PRISMPRISM vs. GREAT on b.s. prediction

# TF-term associations

31,946

7,529

1,658 GREAT predictions kept 5.2%TF-term FDR 50.5% 49.5% 16.4%FDR improvement 308%

closed loop % 3.3% 5.3% 10.9%fraction loops improvement 329%

(from shuffles)

What are all the terms we are throwing away?

http://cs173.stanford.edu [BejeranoWinter12/13] 36

GREAT enrichments from shuffles are due to conservation bias

1733755 546Shuffles (2488) CNEs (2279)

• Create 10,000 random sets of random conserved non-coding regions• Run GREAT• How do the enrichments compared to those from shuffled motifs?

Pro: E-value helps us get more accurate predictions by removing false predictionsCon: Conservation bias filter, causes us to lose potentially real enrichments

in systems that are more often conserved

• “Excess Conservation”• advanced the state of the art for binding site prediction

• “PRISM pipeline”• combined accurate binding site prediction with GREAT

• Publically offered as a web application• bejerano.stanford.edu/prism

http://cs173.stanford.edu [BejeranoWinter12/13] 37

So far…

http://cs173.stanford.edu [BejeranoWinter12/13] 38

The rest of the talk includespre-publication work