Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Analyzing ChIP-seq data

Wing-Kin Sung

National University of Singapore

Transcriptional Control (I)

Transcriptional Control (II)

Protein-DNA binding sites

• Binding sites usually consist of 5-12 bases (upto 30 bp)

• Binding site sequence preferences of protein factors is not exact. It may be represented as a weight matrix

AGCTAAACCACGTGGCATGGGACGTATGCCCAGTA

Transcriptionfactor

Transcriptionfactor

Binding site

Question

• Can we identify where the transcription factors bind on the genome?

• Can we identify the binding motifs of the transcription factors?

Technology: ChIP experiment

• Chromatin immunoprecipitation experiment– Detect the interaction between protein (transcription

factor) and DNA.

Technology: ChIP-seq

Sonication + ChIP

ChIP-sequencing + mapping to reference genome

Peak detection

Noise

ChIP-seqdata

Tag Mapping

Peak calling(CCAT)

Motif scanning(CentDist)

CCAT: A peak finding method

ChIP-seq peak finders

• ChIP-Seq is becoming the main stream for genome-wide study of protein-DNA interactions, histone modifications and DNA methylation patterns.

• Many tools have been proposed for ChIP-Seq analysis (e.g., PeakFinder, MACs, SISSRs, PeakSeq, CisGenome)

Aim

• Contribution of CCAT:– How to estimate noise in a ChIP-seq library?– How to perform a more correct FDR

estimation?

• Aim:– Hope to show that CCAT can identify weak

binding sites which cannot be discovered by existing methods.

ChIP-seq model(Linear signal-noise model)

Noise:

Signal:

Our sample library:

Binding regions

How to identify binding sites with the help of control library?

Our sample library (N=27):

Control library (M=14):

Sample library has 3 fold more reads.Hence, we predict this is a binding site.

What happen if we cannot correctly estimate the noise?



Fail to identify this binding site.

When control library hasalmost the same size as the sample library!

• If we know the list of background regions R, the noise can be estimated as

How to estimate noise? (I)



In this example, we estimate = 7/14 x 28/27.

• Given some initial guess of , we can predict the list of background regions R by

How to estimate noise? (II)



In this example, if = 1, predicted background regions are regions with #sample_reads < 27/28 #ctrl_reads.

• Input: ChIP library and control library

1.Set = 1;

2.Iterate until is stablized– Estimate the background regions

– Predict from the regions R;

How to estimate noise? (III)

Spike-in Simulation

Convergency is fast!The noise rate coverge inabout 5 iterations!

The noise rate estimationis accurate.Relative error < 5%!

FDR estimation

• Given a list of candidate sites ranked by some scoring function,– our aim is to determine the cutoff threshold such that FDR<0.05;

• If the threshold is too loose,– We get more noise.

• If threshold is too strengent,– We miss the weak peaks.

• To identify the weak peaks, we need an accurate FDR estimation

Methods for estimating FDR

• A number of methods for determine the cutoff.– Bionomial p-value, e.g.,

• Benjamini-Hochberg (B-H) correction by (Benjamini & Hochberg, 1995; Rozowsky et al., 2009)

• Storey’s method by (Storey, 2002; Nix et al., 2008)– Empirical p-value, e.g.,

• eFDR by (Nix et al.,2008)• Library swapping proposed by (Zhang et al., 2008)

Is binomial p-value good?

• Observed background variation is different from the estimation from the binomial model.

• Reason:– The wet lab noise is

not uniformly distributed in the genome.

• Binomial p-value is not good enough!

Library swapping

N sample readsN control reads


N reads from ChIP library

N reads from control library

Determine empirical cutoff

ChIP sites Control sites

More on library swapping

• Library swapping works well for most cases.

• However, as mentioned by Zhang et al., the estimated FDR would be biased for some cases.

• We found that the bias is due to the fact that they did not consider the noise rate.

Modified library swapping



N reads from ChIP library

N reads from control library

Determine empirical cutoff

ChIP sites Control sites

Spike-in Simulation

FDR estimation for Nanog FDR estimation for H3K4me3

Library swapping has the best FDR estimation!

Application to mESC H3K4me3 data• ChIP library: Mikkelson et. al., 2007, Science.

Control library: Chen et. al., 2008, Cell.• Normalized difference score (Nix et. al., 2008, BMC Bioinfo.)

qPCR validation

Distinct chromatin features associated with strong and weak H3K4me3 sites.

FDRCCAT: 0.02PeakSeq: 0.05

Application to mESC H3K36me3 data

Comparison of 8176 novel regions to RefSeq, Ensembl, and MGC gene annotation.

Motif scanning for ChIP-seq data

Advantages of ChIP-seq

• ChIP-seq allows us to precisely map global binding sites for any TF with validated antibody.

• It offers two advantages:– More candidate binding sites (known as

peaks)– Higher resolution (usually the main motif is

located +/- 100bp from the peaks)

How to find motifs in ChIP-seq data?

• Input: a set of peaks

• Select high intensity peaks.• For every selected peak, extract the DNA

sequence in, says, +/-200bp region from the peak.

• Perform motif finding on those selected DNA sequences.

Apply such approach on AR dataset in LNCaP cell-line

• LNCaP cell-line DHT treated 2hr, ChIP-ed with AR antibody– MACS reports 58788 binding sites

• Using 600 vertebrate PWMs (145 clusters) from TRANSFAC.

• Perform CEAS and Core-TF using top 10000 sites.– Window size: 200, 400, 1000– For Core-TF: we try random background and

promoter background.

• There are 7 known co-TFs of AR.

Motif Scanning Result (top 20 results)

GATACEBPNKXOCT

ARFOXNF1

CEAS 400bp

CoreTF 200bp

ETS

Detail of motif scanning result

CORE_TF

prombg

200

CORE_TF

prombg

400

CORE_TF

prombg

1000

CORE_TF

randbg

200

CORE_TF

randbg

400

CORE_TF

randbg

1000

CEAS 200 CEAS 400CEAS

1000

AR 2 2 6 1 1 1 2 2 1

CEBP 12 16 25 20 15 7

ETS 64 61 66 37 37 47

FOX 1 1 1 2 2 2 1 1 2

GATA 10 13 12 16 12 14

NF1 40 60 70 10 21 31 3 3

NKX 11 5 2 12 4 3

OCT 4 8 5 15 19 26

AP4 65 70

AUC 0.91 0.8917 0.8742 0.9358 0.9375 0.9208 0.6854 0.6875 0.625

ChIPed motif show center enrichment around AR peaks

• Due to the ChIP-seq protocal, we expect the correct motif shows a center enrichment for the frequency graph.

• We assume noise like CG bias is uniformly distributed. If the motif is not real, its 1st derivative will be near zero.

• Below frequency graph shows that AR has center enrichment while the velocity graph shows that AR is not noise.

AR motif distribution around AR peak Velocity distribution for AR motif

Co-motifs show center enrichment around peaks

• Since co-regulating factors are expected to co-occur in close proximity,– we expected co-motifs also show center enrichment around

peaks.

• For example, NF1 is a known co-motif of AR.• We observe center enrichment of NF1 motif around the

AR peaks.

Center distribution score

• We define a score function based on the frequency graph and the velocity graph.

• Features:– We don’t require background model.– We will learn the window size automatically– We will learn the PWM score cutoff

Automatically learn the parameter of the frequency graph• V$AR_02

Initial window size

cutoff velocity0

20

40

60

80

100

120

140

160

180

RandomBGPromoterBGCENTDISTZ-

Scor

e

CENTDIST workflow

ChIP-seqPeaks

Extracting SequencesTransfac DB

Motif Scanning

Distribution Analysis

Output

Ranked By Center Distribution Score

CENTDIST

• Based on the center enrichment of the TFs relative to the peak, we derive a method CENTDIST.

• CENTDIST measures the center enrichment based on Z-score.• Then, the ranked TFs are reported.

Can CENTDIST find known co-motifs of AR?

• All known co-motifs of AR show good center enrichment.

CENTDIST vs CEAS vs CORE_TF

CENTDIST

CORE_TF

prombg

200

CORE_TF

prombg

400

CORE_TF

prombg

1000

CORE_TF

randbg

200

CORE_TF

randbg

400

CORE_TF

randbg

1000

CEAS 200 CEAS 400CEAS

1000

AR 1 2 2 6 1 1 1 2 2 1

CEBP 14 12 16 25 20 15 7

ETS 9 64 61 66 37 37 47

FOX 2 1 1 1 2 2 2 1 1 2

GATA 10 10 13 12 16 12 14

NF1 11 40 60 70 10 21 31 3 3

NKX 8 11 5 2 12 4 3

OCT 19 4 8 5 15 19 26

AP4 21 65 70

AUC 0.9683 0.91 0.8917 0.8742 0.9358 0.9375 0.9208 0.6854 0.6875 0.625

Can CENTDIST identify novel factor?

• AP4 is rank 21 in CENTDIST.• Core-TF and CEAS rank AP4 low, since AP4 is

not highly enrich around the peaks.

Validation of AP4

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

CO

NT

RO

L

AP

BB

2

GR

EB

1

CA

MK

K2

TM

2D

1

SL

C3

8A

9

BA

NP

SL

C4

3A

1

PA

LM

2

EE

PP

1

MA

PK

6

HIP

K2

AC

SM

2A

RA

LY

L

RG

S2

ID2

B

PX

DN

CR

AD

D

KL

HL

3

FA

M1

53

B

AF

F3

DD

AH

1

YP

EL

1

ETH

DHT

Validation of AP4

• To be unbiased, we make a AP4 ChIP-seq.• 38% of AP4 peaks overlap with AR peaks.

62768 2296 3786

AR

AP4

Validation of AP4

• We also check the microarray expression.• The result suggests that AP4 may co-localize

with AR to directly up-regulate the transcription of androgen target genes.

Genes containing AR+AP4 peaks

Genes containing AR peaks

only

Others0123456

% g

enes

up

-reg

ulat

ed

Validation using ChIP-seq from ES cell

• CENTDIST performs better than CEAS and Core-TF for most cases. CENTDIST CEAS Core-TF

Nanog 0.9647 0.7346 0.7549Oct4 0.9133 0.825 0.7508Sox2 0.9499 0.8765 0.6939Stat3 0.9309 0.7492 0.7308Smad1 0.8483 0.8803 0.7048P300 0.9234 0.8098 0.719KLF4 0.8432 0.6864 0.8015ESRRB 0.8622 0.9744 0.9295Cmyc 0.9776 0.8401 0.9237Nmyc 0.9334 0.5235 0.9107ZFX 0.9545 0.5373 0.9221E2F1 0.9529 0.5349 0.9351AVG AUC 0.921192 0.747667 0.814733

Discussion

• CentDist can find motifs which are marginally over-represented.

• CentDist can detect the window size• CentDist doesn’t require background

model

Acknowledgement• Bioinformatics

– Guoliang Li– Pramila– Charlie Lee– Han Xu– Fabi– Kuan Hon Loh– Chang Cheng Wei– Gao Song– Chandana– Rikky– Zhang Zhi Zhou

• Sequencing– Wei Chialin– Handoko Lusy – Sequencing team

• Cancer Biology– Edwin Cheung– Pau You Fu

Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Documents

Transcript of Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.