Motif-based analysis of ChIP-seq data - Timothy Bailey

Motif-based analysis of ChIP-seq data

Timothy L. Bailey AMATA

October 16, 2013

Overview of Talk

• ChIP-seq data analysis – Why do motif-based analysis?

– MEME-ChIP

• Two case studies – KLF1 in mouse fetal liver cells

– NFI in mouse neural stem cells

Steps in ChIP-seq

• Cross-link proteins to DNA

• Fragment chromatin • Immunoprecipitate

with antibody to protein

• Size-select and ligate

• Amplify • Sequence

Cross-link

What can we learn from Transcription Factor ChIP-seq data?

• Where is the TF bound?

• What is its DNA-binding affinity?

• What genes might it regulate?

• What are its partners?

ChIP-seq Data Analysis

1. Mapping: Align the reads with the reference genome.

2. Peak Calling: Identify regions with significant read coverage.

3. Motif-based sequence analysis: Identifying DNA sequence patterns in the peaks.

…

“Practical guidelines for the comprehensive analysis of ChIP-seq data”, Bailey et al., PLoS Comp Bio (in press).

Why do motif-based analysis?

• Quality control

• Understanding DNA-binding affinity

• Understanding regulatory mechanisms

PWM-based Word-based

Known motifs

Motif Discovery: MEME

• Searches for novel PWM motif with most significant information content (IC).

• Null model: the distribution of the IC of a set of sites of a given width in random sequences of a given length.

100-bp ChIP-seq regions

motif IC = total height of letters

align sites

motif logo

Discriminative Motif Discovery: DREME

• Searches for novel regular expression motif with most significant enrichment of sites in positive sequences.

• Null model: the probability of a site is the same in the two sets of sequences.

• Test: Fisher’s Exact Test on P and N (number of sequences with ≥ 1 site)


100-bp shuffled regions

P=5

N=3

Motif Regular Expression: CCMRCCC

“site-probability” curve

Position of Best Site

Prob

abilit

y

Central Motif Enrichment Analysis: CentriMo

• Searches for known motif whose best sites are most centrally enriched in the ChIP-seq regions.

• Null model: best sites are uniformly distributed within the regions.

• Test: Binomial(S, T, w/L)


W=120

L=500

S = number of “successes” = 4 T = number of “trials” = 5

Motif Spacing Analysis: SpaMo

• Searches for known motifs whose best sites have a preferred spacing with the primary motif. 1. Align regions on best

primary site. 2. Predict best secondary

site. 3. Compute enrichment at

each possible spacing.

• Null model: uniform • Test: Binomial


300-bp centered on primary

Case Study 1: KLF1

Did my ChIP-seq work?

• The best DREME motif only approximates the KLF-family motif. MEME finds no similar motif.

Knowing when TF ChIP-seq fails

1) KLF1 ChIP: Tallack et al, Genome Research, 2011.

• The top MEME and DREME motifs confirm the in vitro KLF-family motif.

2) KLF1 ChIP: Other published data.

UniPROBE Klf7_primary motif

MEME KLF1 motif DREME KLF1 motif

UniPROBE Klf7_primary motif

TF motif databases are an invaluable resource

KLF-family motifs are nearly identical

Strong Evidence of Failure: Central Motif Enrichment

1) Tallack KLF1 data 2) Other KLF1 data

p = 10-66

p = 0.7

KLF7 in vitro motif

CentriMo Analysis of Tallack KLF1 data

KLF4 W=111 P = 10-66

GATA1/SCL W=177 P = 10-48

KLF7 W=103 P = 10-54

GATA1 W=194 P = 10-40

Are KLF1, GATA and GATA/SCL motifs the most centrally enriched motifs in

KLF1 peak regions?

1. Tallack KLF1 data – yes.

Klf4

Klf7

GATA/SCL

GATA

Top four centrally enriched motifs in JASPAR+UniProbe (862 motifs)

Are KLF1, GATA and GATA/SCL motifs the most centrally enriched motifs in

KLF1 peak regions?

1. Tallack KLF1 data – yes.

1. Tallack KLF1 data

1. T

2. Other KLF1 data – no.

KLF4

2. Other KLF1 data

KLF1 summary

• Enrichment of the known KLF-family motif(s) as well as of known co-factors are strong evidence of a successful TF ChIP-seq experiment.

• Perform motif-based analysis on TF ChIP-seq before publishing!

Case Study 2: NFI

How does my TF bind?

Nuclear Factor I

• Martynoga et al. (2013) ChIP-ed NFI in proliferating and quiescent mouse neural stem cells.

• NFIA, NFIB, NFIC and NFIX.

• NFI thought to bind as dimers.

Enriched motifs in NFI peaks in proliferating neural stem cells

Does NFI bind as a monomer in neural stem cells?

Enrichment suggests NFIX binds often as a monomer

MEME

MEME

Half-site spacing enriched at multiples of 10 bp

Dimeric sites are twice as common in embryonic fibroblasts

mNSC ChIP MEF ChIP

E-boxes are highly enriched near NFI peaks

Most enriched E-box is enriched also in quiescent neural stem cells

proliferating

quiescent

One E-box not enriched in quiescent neural stem cells

proliferating

quiescent

Differentially enriched motif could bind OLIG or NEUROG/D

NFI summary

• Motif-based analysis sheds light on how TFs bind. – Unlike other NFIs, NFIX often binds as a

monomer.

– NFI binding is less associated with binding of OLIG or NEUROG/D factors in quiescent than in proliferating neural stem cells.

Conclusions

• Motif-based TF ChIP-seq analysis is highly useful for: – Quality control

– Understanding DNA-binding affinity

– Understanding regulatory mechanisms

Acknowledgements

The MEME Suite • William Noble • James Johnson • Charles Grant • Martin Frith • Philip Machanick • Tom Whitington • Shobhit Gupta • Tom Lesluyes • Benjamin Dartigues

KLF Project • Michael Tallack • Tom Whitington • Andrew Perkins • Sean Grimmond • Brooke Gardiner • Ehsan Nourbakhsh • Nicole Cloonan • Elanor Wainwright • Janelle Keys • Wai Shan Yuen

http://meme.nbcr.net

Transcription Factors

• Mammalian transcription is controlled (in part) by about 1400 DNA-binding transcription factor (TF) proteins.

• These proteins control transcription in two main ways: – Directly, by promoting (or preventing) the

assembly of the pre-initiation complex.

– Indirectly, by modifying chromatin.

ChIP-seq

• Chromatin ImmunoPrecipitation followed by high-throughput sequencing.

• TF binding sites (“punctate peaks”)

• Chromatin mods (“broad peaks”)

KLF1 is a key transcription factor in blood cell development

• We performed KLF1 ChIP-seq in mouse fetal liver cells and analyzed the resulting 945 peak regions using the MEME Suite [Tallack et al, Genome Research, 2010.]

• We confirmed – the in vitro binding motif of KLF1, – several co-factor TFs, and – a co-factor complex.

Pooled 4 Livers (~80x106 cells)

Positive: ChIP (αKLF1 Rabbit Polyclonal Ab)

Control: Input DNA

A second KLF1 ChIP-seq experiment

• Pilon et al. (Blood, 2011) also performed KLF1 ChIP-seq in mouse fetal liver cells.

• They predicted over 13,000 peak regions.

• We reanalyzed their data using the MEME Suite.

• This second ChIP-seq data gives very different results.

Do the Pilon KLF1 ChIP-seq regions contain KLF1 co-factor

sites?

GATA1 and SCL are important KLF1 regulatory co-factors

MEME GATA-SCL motif found in Tallack KLF1 data

Known GATA-SCL motif (JASPAR database)

• GATA1 and SCL bind DNA in a protein complex [Wadman et al, EMBO Journal, 1997]. 1. Tallack KLF1 data – MEME finds complex motif 1. T

2. Pilon KLF1 data—MEME does not find the motif

Is motif discovery failing?

• To check this, we use CentriMo to search for any motifs in the JASPAR+UniPROBE motif database that are centrally enriched in the two KLF1 ChIP-seq datasets.

Caveats in ChIP-seq Motif Analysis

• Peak regions may contain other TF motifs due to looping.

• The binding of the ChIP-ed factor “X” may be indirect.

• ChIP-ed motif might be weak due to assisted binding.

Farnham, Nature Reviews Genetics, 2009

MEME motif is E-box with adjacent NFI half-site

NFI half-site

Differential central enrichment

Motif-based analysis of ChIP-seq data - Timothy Bailey

Health & Medicine

Transcript of Motif-based analysis of ChIP-seq data - Timothy Bailey