Post on 12-Jan-2016
Analyzing ChIP-seq data
Wing-Kin Sung
National University of Singapore
Transcriptional Control (I)
Transcriptional Control (II)
Protein-DNA binding sites
• Binding sites usually consist of 5-12 bases (upto 30 bp)
• Binding site sequence preferences of protein factors is not exact. It may be represented as a weight matrix
AGCTAAACCACGTGGCATGGGACGTATGCCCAGTA
Transcriptionfactor
Transcriptionfactor
Binding site
Question
• Can we identify where the transcription factors bind on the genome?
• Can we identify the binding motifs of the transcription factors?
Technology: ChIP experiment
• Chromatin immunoprecipitation experiment– Detect the interaction between protein (transcription
factor) and DNA.
Technology: ChIP-seq
Sonication + ChIP
ChIP-sequencing + mapping to reference genome
Peak detection
Noise
ChIP-seqdata
Tag Mapping
Peak calling(CCAT)
Motif scanning(CentDist)
CCAT: A peak finding method
ChIP-seq peak finders
• ChIP-Seq is becoming the main stream for genome-wide study of protein-DNA interactions, histone modifications and DNA methylation patterns.
• Many tools have been proposed for ChIP-Seq analysis (e.g., PeakFinder, MACs, SISSRs, PeakSeq, CisGenome)
Aim
• Contribution of CCAT:– How to estimate noise in a ChIP-seq library?– How to perform a more correct FDR
estimation?
• Aim:– Hope to show that CCAT can identify weak
binding sites which cannot be discovered by existing methods.
ChIP-seq model(Linear signal-noise model)
Noise:
Signal:
Our sample library:
Binding regions
How to identify binding sites with the help of control library?
Our sample library (N=27):
Control library (M=14):
Sample library has 3 fold more reads.Hence, we predict this is a binding site.
What happen if we cannot correctly estimate the noise?
Our sample library (N=27):
Control library (M=28):
Fail to identify this binding site.
When control library hasalmost the same size as the sample library!
• If we know the list of background regions R, the noise can be estimated as
How to estimate noise? (I)
Our sample library (N=27):
Control library (M=28):
In this example, we estimate = 7/14 x 28/27.
• Given some initial guess of , we can predict the list of background regions R by
How to estimate noise? (II)
Our sample library (N=27):
Control library (M=28):
In this example, if = 1, predicted background regions are regions with #sample_reads < 27/28 #ctrl_reads.
• Input: ChIP library and control library
1.Set = 1;
2.Iterate until is stablized– Estimate the background regions
– Predict from the regions R;
How to estimate noise? (III)
Spike-in Simulation
Convergency is fast!The noise rate coverge inabout 5 iterations!
The noise rate estimationis accurate.Relative error < 5%!
FDR estimation
• Given a list of candidate sites ranked by some scoring function,– our aim is to determine the cutoff threshold such that FDR<0.05;
• If the threshold is too loose,– We get more noise.
• If threshold is too strengent,– We miss the weak peaks.
• To identify the weak peaks, we need an accurate FDR estimation
Methods for estimating FDR
• A number of methods for determine the cutoff.– Bionomial p-value, e.g.,
• Benjamini-Hochberg (B-H) correction by (Benjamini & Hochberg, 1995; Rozowsky et al., 2009)
• Storey’s method by (Storey, 2002; Nix et al., 2008)– Empirical p-value, e.g.,
• eFDR by (Nix et al.,2008)• Library swapping proposed by (Zhang et al., 2008)
Is binomial p-value good?
• Observed background variation is different from the estimation from the binomial model.
• Reason:– The wet lab noise is
not uniformly distributed in the genome.
• Binomial p-value is not good enough!
Library swapping
N sample readsN control reads
N sample readsN control reads
N reads from ChIP library
N reads from control library
Determine empirical cutoff
ChIP sites Control sites
More on library swapping
• Library swapping works well for most cases.
• However, as mentioned by Zhang et al., the estimated FDR would be biased for some cases.
• We found that the bias is due to the fact that they did not consider the noise rate.
Modified library swapping
N sample readsN control reads
N sample readsN control reads
N reads from ChIP library
N reads from control library
Determine empirical cutoff
ChIP sites Control sites
Spike-in Simulation
FDR estimation for Nanog FDR estimation for H3K4me3
Library swapping has the best FDR estimation!
Application to mESC H3K4me3 data• ChIP library: Mikkelson et. al., 2007, Science.
Control library: Chen et. al., 2008, Cell.• Normalized difference score (Nix et. al., 2008, BMC Bioinfo.)
qPCR validation
Distinct chromatin features associated with strong and weak H3K4me3 sites.
FDRCCAT: 0.02PeakSeq: 0.05
Application to mESC H3K36me3 data
Comparison of 8176 novel regions to RefSeq, Ensembl, and MGC gene annotation.
Motif scanning for ChIP-seq data
Advantages of ChIP-seq
• ChIP-seq allows us to precisely map global binding sites for any TF with validated antibody.
• It offers two advantages:– More candidate binding sites (known as
peaks)– Higher resolution (usually the main motif is
located +/- 100bp from the peaks)
How to find motifs in ChIP-seq data?
• Input: a set of peaks
• Select high intensity peaks.• For every selected peak, extract the DNA
sequence in, says, +/-200bp region from the peak.
• Perform motif finding on those selected DNA sequences.
Apply such approach on AR dataset in LNCaP cell-line
• LNCaP cell-line DHT treated 2hr, ChIP-ed with AR antibody– MACS reports 58788 binding sites
• Using 600 vertebrate PWMs (145 clusters) from TRANSFAC.
• Perform CEAS and Core-TF using top 10000 sites.– Window size: 200, 400, 1000– For Core-TF: we try random background and
promoter background.
• There are 7 known co-TFs of AR.
Motif Scanning Result (top 20 results)
GATACEBPNKXOCT
ARFOXNF1
CEAS 400bp
CoreTF 200bp
ETS
Detail of motif scanning result
CORE_TF
prombg
200
CORE_TF
prombg
400
CORE_TF
prombg
1000
CORE_TF
randbg
200
CORE_TF
randbg
400
CORE_TF
randbg
1000
CEAS 200 CEAS 400CEAS
1000
AR 2 2 6 1 1 1 2 2 1
CEBP 12 16 25 20 15 7
ETS 64 61 66 37 37 47
FOX 1 1 1 2 2 2 1 1 2
GATA 10 13 12 16 12 14
NF1 40 60 70 10 21 31 3 3
NKX 11 5 2 12 4 3
OCT 4 8 5 15 19 26
AP4 65 70
AUC 0.91 0.8917 0.8742 0.9358 0.9375 0.9208 0.6854 0.6875 0.625
ChIPed motif show center enrichment around AR peaks
• Due to the ChIP-seq protocal, we expect the correct motif shows a center enrichment for the frequency graph.
• We assume noise like CG bias is uniformly distributed. If the motif is not real, its 1st derivative will be near zero.
• Below frequency graph shows that AR has center enrichment while the velocity graph shows that AR is not noise.
AR motif distribution around AR peak Velocity distribution for AR motif
Co-motifs show center enrichment around peaks
• Since co-regulating factors are expected to co-occur in close proximity,– we expected co-motifs also show center enrichment around
peaks.
• For example, NF1 is a known co-motif of AR.• We observe center enrichment of NF1 motif around the
AR peaks.
Center distribution score
• We define a score function based on the frequency graph and the velocity graph.
• Features:– We don’t require background model.– We will learn the window size automatically– We will learn the PWM score cutoff
Automatically learn the parameter of the frequency graph• V$AR_02
Initial window size
cutoff velocity0
20
40
60
80
100
120
140
160
180
RandomBGPromoterBGCENTDISTZ-
Scor
e
CENTDIST workflow
ChIP-seqPeaks
Extracting SequencesTransfac DB
Motif Scanning
Distribution Analysis
Output
Ranked By Center Distribution Score
CENTDIST
• Based on the center enrichment of the TFs relative to the peak, we derive a method CENTDIST.
• CENTDIST measures the center enrichment based on Z-score.• Then, the ranked TFs are reported.
Can CENTDIST find known co-motifs of AR?
• All known co-motifs of AR show good center enrichment.
CENTDIST vs CEAS vs CORE_TF
CENTDIST
CORE_TF
prombg
200
CORE_TF
prombg
400
CORE_TF
prombg
1000
CORE_TF
randbg
200
CORE_TF
randbg
400
CORE_TF
randbg
1000
CEAS 200 CEAS 400CEAS
1000
AR 1 2 2 6 1 1 1 2 2 1
CEBP 14 12 16 25 20 15 7
ETS 9 64 61 66 37 37 47
FOX 2 1 1 1 2 2 2 1 1 2
GATA 10 10 13 12 16 12 14
NF1 11 40 60 70 10 21 31 3 3
NKX 8 11 5 2 12 4 3
OCT 19 4 8 5 15 19 26
AP4 21 65 70
AUC 0.9683 0.91 0.8917 0.8742 0.9358 0.9375 0.9208 0.6854 0.6875 0.625
Can CENTDIST identify novel factor?
• AP4 is rank 21 in CENTDIST.• Core-TF and CEAS rank AP4 low, since AP4 is
not highly enrich around the peaks.
Validation of AP4
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
CO
NT
RO
L
AP
BB
2
GR
EB
1
CA
MK
K2
TM
2D
1
SL
C3
8A
9
BA
NP
SL
C4
3A
1
PA
LM
2
EE
PP
1
MA
PK
6
HIP
K2
AC
SM
2A
RA
LY
L
RG
S2
ID2
B
PX
DN
CR
AD
D
KL
HL
3
FA
M1
53
B
AF
F3
DD
AH
1
YP
EL
1
ETH
DHT
Validation of AP4
• To be unbiased, we make a AP4 ChIP-seq.• 38% of AP4 peaks overlap with AR peaks.
62768 2296 3786
AR
AP4
Validation of AP4
• We also check the microarray expression.• The result suggests that AP4 may co-localize
with AR to directly up-regulate the transcription of androgen target genes.
Genes containing AR+AP4 peaks
Genes containing AR peaks
only
Others0123456
% g
enes
up
-reg
ulat
ed
Validation using ChIP-seq from ES cell
• CENTDIST performs better than CEAS and Core-TF for most cases. CENTDIST CEAS Core-TF
Nanog 0.9647 0.7346 0.7549Oct4 0.9133 0.825 0.7508Sox2 0.9499 0.8765 0.6939Stat3 0.9309 0.7492 0.7308Smad1 0.8483 0.8803 0.7048P300 0.9234 0.8098 0.719KLF4 0.8432 0.6864 0.8015ESRRB 0.8622 0.9744 0.9295Cmyc 0.9776 0.8401 0.9237Nmyc 0.9334 0.5235 0.9107ZFX 0.9545 0.5373 0.9221E2F1 0.9529 0.5349 0.9351AVG AUC 0.921192 0.747667 0.814733
Discussion
• CentDist can find motifs which are marginally over-represented.
• CentDist can detect the window size• CentDist doesn’t require background
model
Acknowledgement• Bioinformatics
– Guoliang Li– Pramila– Charlie Lee– Han Xu– Fabi– Kuan Hon Loh– Chang Cheng Wei– Gao Song– Chandana– Rikky– Zhang Zhi Zhou
• Sequencing– Wei Chialin– Handoko Lusy – Sequencing team
• Cancer Biology– Edwin Cheung– Pau You Fu