More on TF Motif Finding ChIP-chip / seq
description
Transcript of More on TF Motif Finding ChIP-chip / seq
More on TF Motif Finding ChIP-chip / seq
Xiaole Shirley Liu
STAT115, STAT215, BIO298, BIST520
De novo Sequence Motif Finding
• Goal: look for common sequence patterns enriched in the input data (compared to the genome background)
• Regular expression enumeration – Pattern driven approach
– Enumerate patterns, check significance in dataset
– Oligonucleotide analysis, MobyDick
• Position weight matrix update – Data driven approach, use data to refine motifs
– Consensus, EM & Gibbs sampling
– Motif score and Markov background2
Position Weight Matrix Update
• Advantage– Can look for motifs of any widths– Flexible with base substitutions
• Disadvantage:– EM and Gibbs sampling: no guaranteed
convergence time– No guaranteed global optimum
3
Motif Finding in Bacteria
• Promoter sequences are short (200-300 bp)• Motif are usually long (10-20 bases)
– Some have two blocks with a gap, some are palindromes
– Long motifs are usually very degenerate
• Single microarray experiment sometimes already provides enough information to search for TF motifs
4
Motif Finding in Lower Eukaryotes
• Upstream sequences longer (500-1000 bp), with some simple repeats
• Motif width varies (5 – 17 bases)• Expression clusters provide decent input
sequences quality for TF motif finding• Motif combination and redundancy appears,
although single motifs are usually significant enough for identification
5
Yeast Promoter
Architecture
• Co-occurring regulators suggest physical interaction between the regulators
6
Motif Finding in Higher Eukaryotes
• Upstream sequences very long (3KB-20KB) with repeats, TF motif could appear downstream
• Motifs can be short or long (6-20 bases), and appear in combination and clusters
• Gene expression cluster not good enough input• Need:
– Comparative Genomics: phastcons score– Motif modules: motif clusters– ChIP-chip/seq
7
8
Yeast Regulatory Sequence Conservation
9
UCSC PhastCons Conservation• Functional regulatory sequences are under
stronger evolutionary constraint• Align orthologous sequences together• PhastCons conservation score (0 – 1) for each
nucleotide in the genome can be downloaded from UCSC
10
Conserved Motif Clusters
• First find conserved regions in the genome
• Then look for repeated transcription factors (TF) binding sites
• They form transcription factor modules
Outline
• ChIP-chip on yeast– Technology and data analysis: MDscan motif finding,
regulatory network
• ChIP-X on human– Tiling microarrays and peak finding
– High throughput sequencing and peak finding
– Data analysis and examples
• Analysis: peak finding, gene expression analysis, sequence motif finding, regulatory network– Holistic picture of gene regulation
11
Motivation
• Motif finding works well in bacteria, OK in yeast, marginal in worm/fly, and almost never in mammals
• Cistrome: Genome-wide in vivo binding sites of DNA-binding proteins
• ChIP-chip and ChIP-seq gives cistrome results
12
ChIP-chip Technology• Chromatin ImmunoPrecipitation + microarray
– ChIP-on-chip or ChIP-chip
– Also known as Genome Scale Location Analysis
• Detect genome-wide in vivo location of TF and other DNA-binding proteins– Find all the DNA sequences bound by TF-X?
– Cook all the dishes with cinnamon
• Can learn the regulatory mechanism of a transcription factor or DNA-binding protein much better and faster
13
Chromatin ImmunoPrecipitation (ChIP)
14
TF/DNA Crosslinking in vivo
15
Sonication (~500bp)
16
TF-specific Antibody
17
Immunoprecipitation
18
Reverse Crosslink and DNA Purification
19
Promoter Array Hybridization
Genes Intergenetic ChIP
ChIP-DNA chip Detection• Started in yeast, use promoter
cDNA microarray– ~ 6000 spots, each 800-1000 bp
• Two color assay– Control: no antibody, or chromatin
(a little bit of everything)– Need triplicates to cancel noise
• Applied to all yeast TFs– TF modified to contain a tag– Tag can be precipitated with
Immunoglobin
21
ChIP-chip Motif Finding
• ChIP-chip gives 10-5000 binding regions ~600-1000bp long. Precise binding motif?– Raw data is like perfect clustering, plus enrichment
values
• MDscan– High ChIP ranking => true targets, contain more sites
– Search TF motif from highest ranking targets first (high signal / background ratio)
– Refine candidate motifs with all targets
– Used successfully in ChIP-chip motif finding
22
Similarity Defined by m-match
For a given w-mer and any other random w-mer
TGTAACGT 8-mer
TGTAACGT matched 8
AGTAACGT matched 7
TGCAACAT matched 6
TGACACGG matched 5
AATAACAG matched 4
m-matches for TGTAACGT
Pick a reasonable m to call two w-mers similar
23
MDscan Seeds
ATTGCAAATTTTGCGAATTTTGCAAAT
Seedmotif pattern
ATTGCAAAT
A 9-mer
TTTGCAAAT
TTTGCGAAT
Hig
her
enri
chm
ent
ChIP-chip selected upstream sequences
TTGCAAATC
CAAATCCAACAAATCCAAGAAATCCAC
GCAAATCCAGCAAATTCGGCAAATCCAGGAAATCCAGGAAATCCT
TGCAAATCCTGCAAATTC
GCCACCGTACCACCGTACCACGGTGCCACGGC…
TTGCAAATCTTGCGAATATTGCAAATTTTGCCCATC
24
Seed1 m-matches
Update Motifs With Remaining Seqs
ExtremeHighRank
All ChIP-selected targets25
Seed1 m-matches
Refine the Motifs
ExtremeHighRank
All ChIP-selected targets26
Yeast TF Regulatory Network
Protein
Gene
RegulateTranscribe
27
ChIP-chip Better Explains Expression
Ndt80 regulated genes Sum1 regulated genes
Ndt80 & Sum1 regulated genes
28
Genome Tiling Microarrays• Promoter array doesn’t work for human ChIP-chip
• Binding could appear in much further intergenic sequences, introns, exons, or downstream sequences.
Genomic DNA on the chromosome
Tiling Probes
29
DNA Purification
30
ChIP-chip on Tiling Microarray
ChIP-DNA
Noise
ChIP
Ctrl
Chromosome
31
ChIP-chip
• Detect genome-wide location of transcription and epigenetic factors
• Affymetrix genome tiling arrays are cheaper
• $2000 7 arrays * 6 million probes * (3 ChIP + 3 Ctrl)
• But data is noisier and less informative
Two peaks? How about ChIP alone? Over 42M probes?
32
ChIP
Ctrl
Chromosome CoordinatesLog
Pro
be I
nte
nsit
y
ChIP-chip AnalysisMann-Whitney U-test
• Affy TAS, Cawley et al (Cell 2004): – Assign 1 to all probe pairs with MM > PM
– Each probe: rank probes within [-500bp, +500bp] window
33
ChIP-chip AnalysisMann-Whitney U-test
• Affy TAS, Cawley et al (Cell 2004): – Assign 1 to all probe pairs with MM > PM
– Each probe: rank probes within [-500bp, +500bp] window
– Check whether sum of ChIP ranks is much smaller
– Consider all probes equally
– Half of the probes have MM > PM
PM – MM
Histogram of (PM – MM)
34
Affymetrix Tiling Array Peak Finding
• Challenges:– Massive data, probe values noisy
– Only 1/3 of researchers get it to work the first time
– Previous algorithms only work by comparing 3 ChIP with 3 Ctrl
• Model-based Analysis of Tiling arrays (MAT)– Work with single ChIP (no rep, no ctrl)
– Find individual failed samples
– More sensitive, specific, and quantitative with 3 ChIP & 3 Ctrl
MAT: Johnson et al, PNAS 2006
35
MAT• Most of the probes in ChIP-chip measures
non-specific hybridization and background noise• Estimate probe behavior by checking other
probes with similar sequence on the same array• Probe sequence plays
a big role in signal
value
36
Model Sequence-Specific Probe Effect
• First detailed model of probe sequence on probe signal
• AATGC ACTGT GCACA GATCG GCCAT7 A, 7 C, 6 G, 5 T, map to 2 places in genome
• Use all the probes on the array to estimate the parameters
# of T’sintercept
Position-specific
A, C, G effect
A,C,G,T count squared
25-mer copynumber
Probesignal
37
€
5α + β1A + β 2A + β 4G + β 5C + ...
+ 49γA + 49γC + 36γG + 25γT + Log(2)δ + ε
Probe Standardization
• Fit the probe model array by array
6M Probes
2K bins
binaffinityi
iii s
mPMLogt
ˆ)(
Model predicted probe intensity
Observed probe intensity
Observed probe variance within
each bin38
Raw probe values at two spike-in regions with concentration 2X
ChIP
Ctrl
Sequence-based probe behavior standardization
ChIP standardized
Ctrl standardized
Window-based neighboring probe combination for ChIP-region detection
ChIP Window
(ChIP – Ctrl)
(3 ChIP – 3 Ctrl)
2X 2X
39
MA2C: Model-based for 2-Color Arrays
• Normalize probes by GC bins within each array– How much variance is observed in the GC bin
– Give high confidence probes more weight
• Running window average or median for peak finding
MA2C: Song et al, Genome Biol 2007
40
Is a ChIP experiment working?
• MAT window scores ~ normal with long tails• Estimate pvalue of normal from left half of data• FDR = A / B (Ctrl/ChIP peaks are all FPs)• Spike-in shows MAT FDR estimate is accurate• Can find individual failed replicate
41
<1% enriched
MAT: Quality Control
Background
Enriched DNA
A B
ChIP-Seq
ChIP-DNA
Noise
Sequence millions of 30-mer ends of fragments
Map 30-mers back to the genome
42
MACS: Model-based Analysis for ChIP-Seq
• Use confident peaks to model shift size
Binding
43
Peak Calls
• Tag distribution along the genome ~ Poisson distribution (λBG = total tag / genome size)
• ChIP-Seq show local biases in the genome– Chromatin and sequencing bias
44
Peak Calls
• Tag distribution along the genome ~ Poisson distribution (λBG = total tag / genome size)
• ChIP-Seq show local biases in the genome– Chromatin and sequencing bias– 200-300bp control windows have to few tags– But can look
further
Dynamic λlocal =
max(λBG, [λctrl, λ1k,] λ5k, λ10k)
ChIP
Control
300bp1kb5kb10kb
http://liulab.dfci.harvard.edu/MACS/Zhang et al, Genome Bio, 2008
CEAS: Cis-regulatory Element Annotation System
• Data Analysis Button for Biologists
http://ceas.cbi.pku.edu.cn
Estrogen Receptor
• Carroll et al, Cell 2005• Overactive in > 70% of breast cancers• Where does it go in the genome?• ChIP-chip on chr21/22, motif and expression
analysis found its partner FoxA1
TF??ER
Estrogen Receptor (ER) Cistrome in Breast Cancer
• Carroll et al, Nat Genet 2006
• ER may function far away (100-200KB) from genes
• Only 20% of ER sites have PhastCons > 0.2
• ER has different effect based on different collaborators
AP1
ER
NRIP
Estrogen Receptor (ER) Cistrome in Breast Cancer
• Carroll et al, Nat Genet 2006
• ER may function far away (100-200KB) from genes
• Only 20% of ER sites have PhastCons > 0.2
• ER has different effect based on different collaborators
AP1
ERNRIP