Motif-based analysis of ChIP-seq data - Timothy Bailey
-
Upload
australian-bioinformatics-network -
Category
Health & Medicine
-
view
947 -
download
3
description
Transcript of Motif-based analysis of ChIP-seq data - Timothy Bailey
Motif-based analysis of ChIP-seq data
Timothy L. Bailey AMATA
October 16, 2013
Overview of Talk
• ChIP-seq data analysis – Why do motif-based analysis?
– MEME-ChIP
• Two case studies – KLF1 in mouse fetal liver cells
– NFI in mouse neural stem cells
Steps in ChIP-seq
• Cross-link proteins to DNA
• Fragment chromatin • Immunoprecipitate
with antibody to protein
• Size-select and ligate
• Amplify • Sequence
Cross-link
What can we learn from Transcription Factor ChIP-seq data?
• Where is the TF bound?
• What is its DNA-binding affinity?
• What genes might it regulate?
• What are its partners?
ChIP-seq Data Analysis
1. Mapping: Align the reads with the reference genome.
2. Peak Calling: Identify regions with significant read coverage.
3. Motif-based sequence analysis: Identifying DNA sequence patterns in the peaks.
…
“Practical guidelines for the comprehensive analysis of ChIP-seq data”, Bailey et al., PLoS Comp Bio (in press).
Why do motif-based analysis?
• Quality control
• Understanding DNA-binding affinity
• Understanding regulatory mechanisms
PWM-based Word-based
Known motifs
Motif Discovery: MEME
• Searches for novel PWM motif with most significant information content (IC).
• Null model: the distribution of the IC of a set of sites of a given width in random sequences of a given length.
100-bp ChIP-seq regions
motif IC = total height of letters
align sites
motif logo
Discriminative Motif Discovery: DREME
• Searches for novel regular expression motif with most significant enrichment of sites in positive sequences.
• Null model: the probability of a site is the same in the two sets of sequences.
• Test: Fisher’s Exact Test on P and N (number of sequences with ≥ 1 site)
100-bp ChIP-seq regions
100-bp shuffled regions
P=5
N=3
Motif Regular Expression: CCMRCCC
“site-probability” curve
Position of Best Site
Prob
abilit
y
Central Motif Enrichment Analysis: CentriMo
• Searches for known motif whose best sites are most centrally enriched in the ChIP-seq regions.
• Null model: best sites are uniformly distributed within the regions.
• Test: Binomial(S, T, w/L)
500-bp ChIP-seq regions
W=120
L=500
S = number of “successes” = 4 T = number of “trials” = 5
Motif Spacing Analysis: SpaMo
• Searches for known motifs whose best sites have a preferred spacing with the primary motif. 1. Align regions on best
primary site. 2. Predict best secondary
site. 3. Compute enrichment at
each possible spacing.
• Null model: uniform • Test: Binomial
500-bp ChIP-seq regions
300-bp centered on primary
Case Study 1: KLF1
Did my ChIP-seq work?
• The best DREME motif only approximates the KLF-family motif. MEME finds no similar motif.
Knowing when TF ChIP-seq fails
1) KLF1 ChIP: Tallack et al, Genome Research, 2011.
• The top MEME and DREME motifs confirm the in vitro KLF-family motif.
2) KLF1 ChIP: Other published data.
UniPROBE Klf7_primary motif
MEME KLF1 motif DREME KLF1 motif
UniPROBE Klf7_primary motif
TF motif databases are an invaluable resource
KLF-family motifs are nearly identical
Strong Evidence of Failure: Central Motif Enrichment
1) Tallack KLF1 data 2) Other KLF1 data
p = 10-66
p = 0.7
KLF7 in vitro motif
CentriMo Analysis of Tallack KLF1 data
KLF4 W=111 P = 10-66
GATA1/SCL W=177 P = 10-48
KLF7 W=103 P = 10-54
GATA1 W=194 P = 10-40
Are KLF1, GATA and GATA/SCL motifs the most centrally enriched motifs in
KLF1 peak regions?
1. Tallack KLF1 data – yes.
Klf4
Klf7
GATA/SCL
GATA
Top four centrally enriched motifs in JASPAR+UniProbe (862 motifs)
Are KLF1, GATA and GATA/SCL motifs the most centrally enriched motifs in
KLF1 peak regions?
1. Tallack KLF1 data – yes.
1. Tallack KLF1 data
1. T
2. Other KLF1 data – no.
KLF4
2. Other KLF1 data
KLF1 summary
• Enrichment of the known KLF-family motif(s) as well as of known co-factors are strong evidence of a successful TF ChIP-seq experiment.
• Perform motif-based analysis on TF ChIP-seq before publishing!
Case Study 2: NFI
How does my TF bind?
Nuclear Factor I
• Martynoga et al. (2013) ChIP-ed NFI in proliferating and quiescent mouse neural stem cells.
• NFIA, NFIB, NFIC and NFIX.
• NFI thought to bind as dimers.
Enriched motifs in NFI peaks in proliferating neural stem cells
Does NFI bind as a monomer in neural stem cells?
Enrichment suggests NFIX binds often as a monomer
MEME
MEME
Half-site spacing enriched at multiples of 10 bp
Dimeric sites are twice as common in embryonic fibroblasts
mNSC ChIP MEF ChIP
E-boxes are highly enriched near NFI peaks
Most enriched E-box is enriched also in quiescent neural stem cells
proliferating
quiescent
One E-box not enriched in quiescent neural stem cells
proliferating
quiescent
Differentially enriched motif could bind OLIG or NEUROG/D
NFI summary
• Motif-based analysis sheds light on how TFs bind. – Unlike other NFIs, NFIX often binds as a
monomer.
– NFI binding is less associated with binding of OLIG or NEUROG/D factors in quiescent than in proliferating neural stem cells.
Conclusions
• Motif-based TF ChIP-seq analysis is highly useful for: – Quality control
– Understanding DNA-binding affinity
– Understanding regulatory mechanisms
Acknowledgements
The MEME Suite • William Noble • James Johnson • Charles Grant • Martin Frith • Philip Machanick • Tom Whitington • Shobhit Gupta • Tom Lesluyes • Benjamin Dartigues
KLF Project • Michael Tallack • Tom Whitington • Andrew Perkins • Sean Grimmond • Brooke Gardiner • Ehsan Nourbakhsh • Nicole Cloonan • Elanor Wainwright • Janelle Keys • Wai Shan Yuen
http://meme.nbcr.net
Transcription Factors
• Mammalian transcription is controlled (in part) by about 1400 DNA-binding transcription factor (TF) proteins.
• These proteins control transcription in two main ways: – Directly, by promoting (or preventing) the
assembly of the pre-initiation complex.
– Indirectly, by modifying chromatin.
ChIP-seq
• Chromatin ImmunoPrecipitation followed by high-throughput sequencing.
• TF binding sites (“punctate peaks”)
• Chromatin mods (“broad peaks”)
KLF1 is a key transcription factor in blood cell development
• We performed KLF1 ChIP-seq in mouse fetal liver cells and analyzed the resulting 945 peak regions using the MEME Suite [Tallack et al, Genome Research, 2010.]
• We confirmed – the in vitro binding motif of KLF1, – several co-factor TFs, and – a co-factor complex.
Pooled 4 Livers (~80x106 cells)
Positive: ChIP (αKLF1 Rabbit Polyclonal Ab)
Control: Input DNA
A second KLF1 ChIP-seq experiment
• Pilon et al. (Blood, 2011) also performed KLF1 ChIP-seq in mouse fetal liver cells.
• They predicted over 13,000 peak regions.
• We reanalyzed their data using the MEME Suite.
• This second ChIP-seq data gives very different results.
Do the Pilon KLF1 ChIP-seq regions contain KLF1 co-factor
sites?
GATA1 and SCL are important KLF1 regulatory co-factors
MEME GATA-SCL motif found in Tallack KLF1 data
Known GATA-SCL motif (JASPAR database)
• GATA1 and SCL bind DNA in a protein complex [Wadman et al, EMBO Journal, 1997]. 1. Tallack KLF1 data – MEME finds complex motif 1. T
2. Pilon KLF1 data—MEME does not find the motif
Is motif discovery failing?
• To check this, we use CentriMo to search for any motifs in the JASPAR+UniPROBE motif database that are centrally enriched in the two KLF1 ChIP-seq datasets.
Caveats in ChIP-seq Motif Analysis
• Peak regions may contain other TF motifs due to looping.
• The binding of the ChIP-ed factor “X” may be indirect.
• ChIP-ed motif might be weak due to assisted binding.
Farnham, Nature Reviews Genetics, 2009
MEME motif is E-box with adjacent NFI half-site
NFI half-site
Differential central enrichment