Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

48
Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore

Transcript of Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Page 1: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Analyzing ChIP-seq data

Wing-Kin Sung

National University of Singapore

Page 2: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Transcriptional Control (I)

Page 3: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Transcriptional Control (II)

Page 4: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Protein-DNA binding sites

• Binding sites usually consist of 5-12 bases (upto 30 bp)

• Binding site sequence preferences of protein factors is not exact. It may be represented as a weight matrix

AGCTAAACCACGTGGCATGGGACGTATGCCCAGTA

Transcriptionfactor

Transcriptionfactor

Binding site

Page 5: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Question

• Can we identify where the transcription factors bind on the genome?

• Can we identify the binding motifs of the transcription factors?

Page 6: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Technology: ChIP experiment

• Chromatin immunoprecipitation experiment– Detect the interaction between protein (transcription

factor) and DNA.

Page 7: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Technology: ChIP-seq

Sonication + ChIP

ChIP-sequencing + mapping to reference genome

Peak detection

Noise

Page 8: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

ChIP-seqdata

Tag Mapping

Peak calling(CCAT)

Motif scanning(CentDist)

Page 9: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

CCAT: A peak finding method

Page 10: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

ChIP-seq peak finders

• ChIP-Seq is becoming the main stream for genome-wide study of protein-DNA interactions, histone modifications and DNA methylation patterns.

• Many tools have been proposed for ChIP-Seq analysis (e.g., PeakFinder, MACs, SISSRs, PeakSeq, CisGenome)

Page 11: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Aim

• Contribution of CCAT:– How to estimate noise in a ChIP-seq library?– How to perform a more correct FDR

estimation?

• Aim:– Hope to show that CCAT can identify weak

binding sites which cannot be discovered by existing methods.

Page 12: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

ChIP-seq model(Linear signal-noise model)

Noise:

Signal:

Our sample library:

Binding regions

Page 13: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

How to identify binding sites with the help of control library?

Our sample library (N=27):

Control library (M=14):

Sample library has 3 fold more reads.Hence, we predict this is a binding site.

Page 14: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

What happen if we cannot correctly estimate the noise?

Our sample library (N=27):

Control library (M=28):

Fail to identify this binding site.

When control library hasalmost the same size as the sample library!

Page 15: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

• If we know the list of background regions R, the noise can be estimated as

How to estimate noise? (I)

Our sample library (N=27):

Control library (M=28):

In this example, we estimate = 7/14 x 28/27.

Page 16: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

• Given some initial guess of , we can predict the list of background regions R by

How to estimate noise? (II)

Our sample library (N=27):

Control library (M=28):

In this example, if = 1, predicted background regions are regions with #sample_reads < 27/28 #ctrl_reads.

Page 17: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

• Input: ChIP library and control library

1.Set = 1;

2.Iterate until is stablized– Estimate the background regions

– Predict from the regions R;

How to estimate noise? (III)

Page 18: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Spike-in Simulation

Convergency is fast!The noise rate coverge inabout 5 iterations!

The noise rate estimationis accurate.Relative error < 5%!

Page 19: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

FDR estimation

• Given a list of candidate sites ranked by some scoring function,– our aim is to determine the cutoff threshold such that FDR<0.05;

• If the threshold is too loose,– We get more noise.

• If threshold is too strengent,– We miss the weak peaks.

• To identify the weak peaks, we need an accurate FDR estimation

Page 20: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Methods for estimating FDR

• A number of methods for determine the cutoff.– Bionomial p-value, e.g.,

• Benjamini-Hochberg (B-H) correction by (Benjamini & Hochberg, 1995; Rozowsky et al., 2009)

• Storey’s method by (Storey, 2002; Nix et al., 2008)– Empirical p-value, e.g.,

• eFDR by (Nix et al.,2008)• Library swapping proposed by (Zhang et al., 2008)

Page 21: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Is binomial p-value good?

• Observed background variation is different from the estimation from the binomial model.

• Reason:– The wet lab noise is

not uniformly distributed in the genome.

• Binomial p-value is not good enough!

Page 22: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Library swapping

N sample readsN control reads

N sample readsN control reads

N reads from ChIP library

N reads from control library

Determine empirical cutoff

ChIP sites Control sites

Page 23: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

More on library swapping

• Library swapping works well for most cases.

• However, as mentioned by Zhang et al., the estimated FDR would be biased for some cases.

• We found that the bias is due to the fact that they did not consider the noise rate.

Page 24: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Modified library swapping

N sample readsN control reads

N sample readsN control reads

N reads from ChIP library

N reads from control library

Determine empirical cutoff

ChIP sites Control sites

Page 25: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Spike-in Simulation

FDR estimation for Nanog FDR estimation for H3K4me3

Library swapping has the best FDR estimation!

Page 26: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Application to mESC H3K4me3 data• ChIP library: Mikkelson et. al., 2007, Science.

Control library: Chen et. al., 2008, Cell.• Normalized difference score (Nix et. al., 2008, BMC Bioinfo.)

qPCR validation

Distinct chromatin features associated with strong and weak H3K4me3 sites.

FDRCCAT: 0.02PeakSeq: 0.05

Page 27: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Application to mESC H3K36me3 data

Comparison of 8176 novel regions to RefSeq, Ensembl, and MGC gene annotation.

Page 28: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Motif scanning for ChIP-seq data

Page 29: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Advantages of ChIP-seq

• ChIP-seq allows us to precisely map global binding sites for any TF with validated antibody.

• It offers two advantages:– More candidate binding sites (known as

peaks)– Higher resolution (usually the main motif is

located +/- 100bp from the peaks)

Page 30: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

How to find motifs in ChIP-seq data?

• Input: a set of peaks

• Select high intensity peaks.• For every selected peak, extract the DNA

sequence in, says, +/-200bp region from the peak.

• Perform motif finding on those selected DNA sequences.

Page 31: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Apply such approach on AR dataset in LNCaP cell-line

• LNCaP cell-line DHT treated 2hr, ChIP-ed with AR antibody– MACS reports 58788 binding sites

• Using 600 vertebrate PWMs (145 clusters) from TRANSFAC.

• Perform CEAS and Core-TF using top 10000 sites.– Window size: 200, 400, 1000– For Core-TF: we try random background and

promoter background.

Page 32: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

• There are 7 known co-TFs of AR.

Motif Scanning Result (top 20 results)

GATACEBPNKXOCT

ARFOXNF1

CEAS 400bp

CoreTF 200bp

ETS

Page 33: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Detail of motif scanning result

 

CORE_TF

prombg

200

CORE_TF

prombg

400

CORE_TF

prombg

1000

CORE_TF

randbg

200

CORE_TF

randbg

400

CORE_TF

randbg

1000

CEAS 200 CEAS 400CEAS

1000

AR 2 2 6 1 1 1 2 2 1

CEBP 12 16 25 20 15 7      

ETS 64 61 66 37 37 47      

FOX 1 1 1 2 2 2 1 1 2

GATA 10 13 12 16 12 14      

NF1 40 60 70 10 21 31 3 3  

NKX 11 5 2 12 4 3      

OCT 4 8 5 15 19 26      

AP4       65 70        

AUC 0.91 0.8917 0.8742 0.9358 0.9375 0.9208 0.6854 0.6875 0.625

Page 34: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

ChIPed motif show center enrichment around AR peaks

• Due to the ChIP-seq protocal, we expect the correct motif shows a center enrichment for the frequency graph.

• We assume noise like CG bias is uniformly distributed. If the motif is not real, its 1st derivative will be near zero.

• Below frequency graph shows that AR has center enrichment while the velocity graph shows that AR is not noise.

AR motif distribution around AR peak Velocity distribution for AR motif

Page 35: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Co-motifs show center enrichment around peaks

• Since co-regulating factors are expected to co-occur in close proximity,– we expected co-motifs also show center enrichment around

peaks.

• For example, NF1 is a known co-motif of AR.• We observe center enrichment of NF1 motif around the

AR peaks.

Page 36: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Center distribution score

• We define a score function based on the frequency graph and the velocity graph.

• Features:– We don’t require background model.– We will learn the window size automatically– We will learn the PWM score cutoff

Page 37: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Automatically learn the parameter of the frequency graph• V$AR_02

Initial window size

cutoff velocity0

20

40

60

80

100

120

140

160

180

RandomBGPromoterBGCENTDISTZ-

Scor

e

Page 38: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

CENTDIST workflow

ChIP-seqPeaks

Extracting SequencesTransfac DB

Motif Scanning

Distribution Analysis

Output

Ranked By Center Distribution Score

Page 39: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

CENTDIST

• Based on the center enrichment of the TFs relative to the peak, we derive a method CENTDIST.

• CENTDIST measures the center enrichment based on Z-score.• Then, the ranked TFs are reported.

Page 40: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Can CENTDIST find known co-motifs of AR?

• All known co-motifs of AR show good center enrichment.

Page 41: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

CENTDIST vs CEAS vs CORE_TF

 

CENTDIST

CORE_TF

prombg

200

CORE_TF

prombg

400

CORE_TF

prombg

1000

CORE_TF

randbg

200

CORE_TF

randbg

400

CORE_TF

randbg

1000

CEAS 200 CEAS 400CEAS

1000

AR 1 2 2 6 1 1 1 2 2 1

CEBP 14 12 16 25 20 15 7      

ETS 9 64 61 66 37 37 47      

FOX 2 1 1 1 2 2 2 1 1 2

GATA 10 10 13 12 16 12 14      

NF1 11 40 60 70 10 21 31 3 3  

NKX 8 11 5 2 12 4 3      

OCT 19 4 8 5 15 19 26      

AP4 21       65 70        

AUC 0.9683 0.91 0.8917 0.8742 0.9358 0.9375 0.9208 0.6854 0.6875 0.625

Page 42: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Can CENTDIST identify novel factor?

• AP4 is rank 21 in CENTDIST.• Core-TF and CEAS rank AP4 low, since AP4 is

not highly enrich around the peaks.

Page 43: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Validation of AP4

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

CO

NT

RO

L

AP

BB

2

GR

EB

1

CA

MK

K2

TM

2D

1

SL

C3

8A

9

BA

NP

SL

C4

3A

1

PA

LM

2

EE

PP

1

MA

PK

6

HIP

K2

AC

SM

2A

RA

LY

L

RG

S2

ID2

B

PX

DN

CR

AD

D

KL

HL

3

FA

M1

53

B

AF

F3

DD

AH

1

YP

EL

1

ETH

DHT

Page 44: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Validation of AP4

• To be unbiased, we make a AP4 ChIP-seq.• 38% of AP4 peaks overlap with AR peaks.

62768 2296 3786

AR

AP4

Page 45: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Validation of AP4

• We also check the microarray expression.• The result suggests that AP4 may co-localize

with AR to directly up-regulate the transcription of androgen target genes.

Genes containing AR+AP4 peaks

Genes containing AR peaks

only

Others0123456

% g

enes

up

-reg

ulat

ed

Page 46: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Validation using ChIP-seq from ES cell

• CENTDIST performs better than CEAS and Core-TF for most cases. CENTDIST CEAS Core-TF

Nanog 0.9647 0.7346 0.7549Oct4 0.9133 0.825 0.7508Sox2 0.9499 0.8765 0.6939Stat3 0.9309 0.7492 0.7308Smad1 0.8483 0.8803 0.7048P300 0.9234 0.8098 0.719KLF4 0.8432 0.6864 0.8015ESRRB 0.8622 0.9744 0.9295Cmyc 0.9776 0.8401 0.9237Nmyc 0.9334 0.5235 0.9107ZFX 0.9545 0.5373 0.9221E2F1 0.9529 0.5349 0.9351AVG AUC 0.921192 0.747667 0.814733

Page 47: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Discussion

• CentDist can find motifs which are marginally over-represented.

• CentDist can detect the window size• CentDist doesn’t require background

model

Page 48: Analyzing ChIP-seq data Wing-Kin Sung National University of Singapore.

Acknowledgement• Bioinformatics

– Guoliang Li– Pramila– Charlie Lee– Han Xu– Fabi– Kuan Hon Loh– Chang Cheng Wei– Gao Song– Chandana– Rikky– Zhang Zhi Zhou

• Sequencing– Wei Chialin– Handoko Lusy – Sequencing team

• Cancer Biology– Edwin Cheung– Pau You Fu