Motif-based analysis of ChIP-seq data - Timothy Bailey

44
Motif-based analysis of ChIP-seq data Timothy L. Bailey AMATA October 16, 2013

description

ChIP-seq experiments are the current method of choice for surveying the targets of DNA-binding transcription factors (TFs). Motif-based sequence analysis of ChIP-seq data can provide extremely valuable insights into the biological mechanisms underlying transcriptional regulation. For example, it can determine the DNA-binding affinity (motif) of the factor, distinguish regions bound directly or indirectly by the factor, and suggest the identities of TFs that regulate cooperatively. Motif analysis can also be employed in differential mode to identify explanatory differences in the motif content of genomic regions bound by the factor under different cellular contexts. I will describe several types of motif-based analysis and give concrete examples of biological insights they can yield.

Transcript of Motif-based analysis of ChIP-seq data - Timothy Bailey

Page 1: Motif-based analysis of ChIP-seq data - Timothy Bailey

Motif-based analysis of ChIP-seq data

Timothy L. Bailey AMATA

October 16, 2013

Page 2: Motif-based analysis of ChIP-seq data - Timothy Bailey

Overview of Talk

• ChIP-seq data analysis – Why do motif-based analysis?

– MEME-ChIP

• Two case studies – KLF1 in mouse fetal liver cells

– NFI in mouse neural stem cells

Page 3: Motif-based analysis of ChIP-seq data - Timothy Bailey

Steps in ChIP-seq

• Cross-link proteins to DNA

• Fragment chromatin • Immunoprecipitate

with antibody to protein

• Size-select and ligate

• Amplify • Sequence

Cross-link

Page 4: Motif-based analysis of ChIP-seq data - Timothy Bailey

What can we learn from Transcription Factor ChIP-seq data?

• Where is the TF bound?

• What is its DNA-binding affinity?

• What genes might it regulate?

• What are its partners?

Page 5: Motif-based analysis of ChIP-seq data - Timothy Bailey

ChIP-seq Data Analysis

1. Mapping: Align the reads with the reference genome.

2. Peak Calling: Identify regions with significant read coverage.

3. Motif-based sequence analysis: Identifying DNA sequence patterns in the peaks.

“Practical guidelines for the comprehensive analysis of ChIP-seq data”, Bailey et al., PLoS Comp Bio (in press).

Page 6: Motif-based analysis of ChIP-seq data - Timothy Bailey

Why do motif-based analysis?

• Quality control

• Understanding DNA-binding affinity

• Understanding regulatory mechanisms

Page 7: Motif-based analysis of ChIP-seq data - Timothy Bailey

PWM-based Word-based

Known motifs

Page 8: Motif-based analysis of ChIP-seq data - Timothy Bailey

Motif Discovery: MEME

• Searches for novel PWM motif with most significant information content (IC).

• Null model: the distribution of the IC of a set of sites of a given width in random sequences of a given length.

100-bp ChIP-seq regions

motif IC = total height of letters

align sites

motif logo

Page 9: Motif-based analysis of ChIP-seq data - Timothy Bailey

Discriminative Motif Discovery: DREME

• Searches for novel regular expression motif with most significant enrichment of sites in positive sequences.

• Null model: the probability of a site is the same in the two sets of sequences.

• Test: Fisher’s Exact Test on P and N (number of sequences with ≥ 1 site)

100-bp ChIP-seq regions

100-bp shuffled regions

P=5

N=3

Motif Regular Expression: CCMRCCC

Page 10: Motif-based analysis of ChIP-seq data - Timothy Bailey

“site-probability” curve

Position of Best Site

Prob

abilit

y

Central Motif Enrichment Analysis: CentriMo

• Searches for known motif whose best sites are most centrally enriched in the ChIP-seq regions.

• Null model: best sites are uniformly distributed within the regions.

• Test: Binomial(S, T, w/L)

500-bp ChIP-seq regions

W=120

L=500

S = number of “successes” = 4 T = number of “trials” = 5

Page 11: Motif-based analysis of ChIP-seq data - Timothy Bailey

Motif Spacing Analysis: SpaMo

• Searches for known motifs whose best sites have a preferred spacing with the primary motif. 1. Align regions on best

primary site. 2. Predict best secondary

site. 3. Compute enrichment at

each possible spacing.

• Null model: uniform • Test: Binomial

500-bp ChIP-seq regions

300-bp centered on primary

Page 12: Motif-based analysis of ChIP-seq data - Timothy Bailey

Case Study 1: KLF1

Did my ChIP-seq work?

Page 13: Motif-based analysis of ChIP-seq data - Timothy Bailey

• The best DREME motif only approximates the KLF-family motif. MEME finds no similar motif.

Knowing when TF ChIP-seq fails

1) KLF1 ChIP: Tallack et al, Genome Research, 2011.

• The top MEME and DREME motifs confirm the in vitro KLF-family motif.

2) KLF1 ChIP: Other published data.

UniPROBE Klf7_primary motif

MEME KLF1 motif DREME KLF1 motif

UniPROBE Klf7_primary motif

Page 14: Motif-based analysis of ChIP-seq data - Timothy Bailey

TF motif databases are an invaluable resource

Page 15: Motif-based analysis of ChIP-seq data - Timothy Bailey

KLF-family motifs are nearly identical

Page 16: Motif-based analysis of ChIP-seq data - Timothy Bailey

Strong Evidence of Failure: Central Motif Enrichment

1) Tallack KLF1 data 2) Other KLF1 data

p = 10-66

p = 0.7

KLF7 in vitro motif

Page 17: Motif-based analysis of ChIP-seq data - Timothy Bailey

CentriMo Analysis of Tallack KLF1 data

KLF4 W=111 P = 10-66

GATA1/SCL W=177 P = 10-48

KLF7 W=103 P = 10-54

GATA1 W=194 P = 10-40

Are KLF1, GATA and GATA/SCL motifs the most centrally enriched motifs in

KLF1 peak regions?

1. Tallack KLF1 data – yes.

Klf4

Klf7

GATA/SCL

GATA

Top four centrally enriched motifs in JASPAR+UniProbe (862 motifs)

Page 18: Motif-based analysis of ChIP-seq data - Timothy Bailey

Are KLF1, GATA and GATA/SCL motifs the most centrally enriched motifs in

KLF1 peak regions?

1. Tallack KLF1 data – yes.

1. Tallack KLF1 data

1. T

2. Other KLF1 data – no.

KLF4

2. Other KLF1 data

Page 19: Motif-based analysis of ChIP-seq data - Timothy Bailey

KLF1 summary

• Enrichment of the known KLF-family motif(s) as well as of known co-factors are strong evidence of a successful TF ChIP-seq experiment.

• Perform motif-based analysis on TF ChIP-seq before publishing!

Page 20: Motif-based analysis of ChIP-seq data - Timothy Bailey

Case Study 2: NFI

How does my TF bind?

Page 21: Motif-based analysis of ChIP-seq data - Timothy Bailey

Nuclear Factor I

• Martynoga et al. (2013) ChIP-ed NFI in proliferating and quiescent mouse neural stem cells.

• NFIA, NFIB, NFIC and NFIX.

• NFI thought to bind as dimers.

Page 22: Motif-based analysis of ChIP-seq data - Timothy Bailey

Enriched motifs in NFI peaks in proliferating neural stem cells

Page 23: Motif-based analysis of ChIP-seq data - Timothy Bailey

Does NFI bind as a monomer in neural stem cells?

Page 24: Motif-based analysis of ChIP-seq data - Timothy Bailey

Enrichment suggests NFIX binds often as a monomer

MEME

MEME

Page 25: Motif-based analysis of ChIP-seq data - Timothy Bailey

Half-site spacing enriched at multiples of 10 bp

Page 26: Motif-based analysis of ChIP-seq data - Timothy Bailey

Dimeric sites are twice as common in embryonic fibroblasts

mNSC ChIP MEF ChIP

Page 27: Motif-based analysis of ChIP-seq data - Timothy Bailey

E-boxes are highly enriched near NFI peaks

Page 28: Motif-based analysis of ChIP-seq data - Timothy Bailey

Most enriched E-box is enriched also in quiescent neural stem cells

proliferating

quiescent

Page 29: Motif-based analysis of ChIP-seq data - Timothy Bailey

One E-box not enriched in quiescent neural stem cells

proliferating

quiescent

Page 30: Motif-based analysis of ChIP-seq data - Timothy Bailey

Differentially enriched motif could bind OLIG or NEUROG/D

Page 31: Motif-based analysis of ChIP-seq data - Timothy Bailey

NFI summary

• Motif-based analysis sheds light on how TFs bind. – Unlike other NFIs, NFIX often binds as a

monomer.

– NFI binding is less associated with binding of OLIG or NEUROG/D factors in quiescent than in proliferating neural stem cells.

Page 32: Motif-based analysis of ChIP-seq data - Timothy Bailey

Conclusions

• Motif-based TF ChIP-seq analysis is highly useful for: – Quality control

– Understanding DNA-binding affinity

– Understanding regulatory mechanisms

Page 33: Motif-based analysis of ChIP-seq data - Timothy Bailey

Acknowledgements

The MEME Suite • William Noble • James Johnson • Charles Grant • Martin Frith • Philip Machanick • Tom Whitington • Shobhit Gupta • Tom Lesluyes • Benjamin Dartigues

KLF Project • Michael Tallack • Tom Whitington • Andrew Perkins • Sean Grimmond • Brooke Gardiner • Ehsan Nourbakhsh • Nicole Cloonan • Elanor Wainwright • Janelle Keys • Wai Shan Yuen

Page 34: Motif-based analysis of ChIP-seq data - Timothy Bailey

http://meme.nbcr.net

Page 35: Motif-based analysis of ChIP-seq data - Timothy Bailey

Transcription Factors

• Mammalian transcription is controlled (in part) by about 1400 DNA-binding transcription factor (TF) proteins.

• These proteins control transcription in two main ways: – Directly, by promoting (or preventing) the

assembly of the pre-initiation complex.

– Indirectly, by modifying chromatin.

Page 36: Motif-based analysis of ChIP-seq data - Timothy Bailey

ChIP-seq

• Chromatin ImmunoPrecipitation followed by high-throughput sequencing.

• TF binding sites (“punctate peaks”)

• Chromatin mods (“broad peaks”)

Page 37: Motif-based analysis of ChIP-seq data - Timothy Bailey

KLF1 is a key transcription factor in blood cell development

• We performed KLF1 ChIP-seq in mouse fetal liver cells and analyzed the resulting 945 peak regions using the MEME Suite [Tallack et al, Genome Research, 2010.]

• We confirmed – the in vitro binding motif of KLF1, – several co-factor TFs, and – a co-factor complex.

Pooled 4 Livers (~80x106 cells)

Positive: ChIP (αKLF1 Rabbit Polyclonal Ab)

Control: Input DNA

Page 38: Motif-based analysis of ChIP-seq data - Timothy Bailey

A second KLF1 ChIP-seq experiment

• Pilon et al. (Blood, 2011) also performed KLF1 ChIP-seq in mouse fetal liver cells.

• They predicted over 13,000 peak regions.

• We reanalyzed their data using the MEME Suite.

• This second ChIP-seq data gives very different results.

Page 39: Motif-based analysis of ChIP-seq data - Timothy Bailey

Do the Pilon KLF1 ChIP-seq regions contain KLF1 co-factor

sites?

Page 40: Motif-based analysis of ChIP-seq data - Timothy Bailey

GATA1 and SCL are important KLF1 regulatory co-factors

MEME GATA-SCL motif found in Tallack KLF1 data

Known GATA-SCL motif (JASPAR database)

• GATA1 and SCL bind DNA in a protein complex [Wadman et al, EMBO Journal, 1997]. 1. Tallack KLF1 data – MEME finds complex motif 1. T

2. Pilon KLF1 data—MEME does not find the motif

Page 41: Motif-based analysis of ChIP-seq data - Timothy Bailey

Is motif discovery failing?

• To check this, we use CentriMo to search for any motifs in the JASPAR+UniPROBE motif database that are centrally enriched in the two KLF1 ChIP-seq datasets.

Page 42: Motif-based analysis of ChIP-seq data - Timothy Bailey

Caveats in ChIP-seq Motif Analysis

• Peak regions may contain other TF motifs due to looping.

• The binding of the ChIP-ed factor “X” may be indirect.

• ChIP-ed motif might be weak due to assisted binding.

Farnham, Nature Reviews Genetics, 2009

Page 43: Motif-based analysis of ChIP-seq data - Timothy Bailey

MEME motif is E-box with adjacent NFI half-site

NFI half-site

Page 44: Motif-based analysis of ChIP-seq data - Timothy Bailey

Differential central enrichment