MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano
including ChIP-Seq, Epi markers, GWAS...
Transcript of including ChIP-Seq, Epi markers, GWAS...
Using GREAT.stanford.edu to interpret cis-regulatory rich datasets
including ChIP-Seq, Epi markers, GWAS etc.
1
Gill BejeranoDept. of Developmental Biology &
Dept. of Computer ScienceStanford University
http://bejerano.stanford.edu
http://bejerano.stanford.edu/ 2
Human Gene Regulation
All these cells have the same Genome.
Gene
Gene
Gene
Gene
20,000 Genes encode how to make proteins.
1,000,000 Genomic “switches” determinewhich and how much proteins to make.
1013 different cells in an adult human.
Hundreds of different cell types.
http://bejerano.stanford.edu/ 3
Most Non-Coding Elements likely work in cis…
9Mb
“IRX1 is a member of the Iroquois homeobox gene family. Members of this family appear to play multiple rolesduring pattern formation of vertebrate embryos.”
gene deserts
regulatory jungles
Every orange tick mark is roughly 100-1,000bp long, each evolves under purifying selection, and does not code for protein.
http://bejerano.stanford.edu/ 4
Many non-coding elements tested are cis-regulatory
http://bejerano.stanford.edu/ 5
Bejerano Lab : Human Cis-RegulationCIS REGULATION DEVELOPMENT EVOLUTION
DISEASE
We build tools, predict and test
in house.
http://bejerano.stanford.edu/ 6
Combinatorial Regulatory Code
Gene
2,000 different proteins can bind specific DNA sequences.
A regulatory region encodes 3-10 such protein binding sites.When all are bound by proteins the regulatory region turns “on”,
and the nearby gene is activated to produce protein.
Proteins
DNA
DNA
Protein binding site
ChIP-Seq: first glimpses of the regulatory genome in action
Cis-regulatory peak77http://bejerano.stanford.edu/
Peak Calling
Gene transcription start site
What is the transcription factor I just assayed doing?
Cis-regulatory peak88http://bejerano.stanford.edu/
• Collect known literature of the form• Function A: Gene1, Gene2, Gene3, ...• Function B: Gene1, Gene2, Gene3, ...• Function C: ...
• Ask whether the binding sites you discovered are preferentially binding (regulating) any one or more of the functions listed above.• Form hypothesis and perform further experiments.
Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile
9
Gene transcription start site
SRF binding ChIP-seq peak
• ChIP-seq identified 2,429 SRF binding peaksin human Jurkat cells1
• SRF is known as a “master regulator of the actin cytoskeleton”
• In the ChIP-Seq peaks, we expect to find binding sites regulating (genes involved in) actin cytoskeleton formation.
[1] Valouev A. et al., Nat. Methods, 2008http://bejerano.stanford.edu/
Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile
10
Existing, gene-based method to analyze enrichment:
• Ignore distal binding events.
• Count affected genes.
• Rank by enrichment hypergeometric p-value.
Gene transcription start site
SRF binding ChIP-seq peakOntology term (e.g. ‘actin cytoskeleton’)
N = 8 genes in genomeK = 3 genes annotated withn = 2 genes selected by proximal peaksk = 1 selected gene annotated with
P = Pr(k ≥1 | n=2, K =3, N=8)
http://bejerano.stanford.edu/
We have (reduced ChIP-Seq into) a gene list!What is the gene list enriched for?
11
Microarray tool
Microarray data
Microarray data
Deep sequencing
data
http://bejerano.stanford.edu/
Pro: A lot of tools out there for the analysis of gene lists.Cons: These tools are built for microarray analysis.Does it matter ??
SRF Gene-based enrichment results
12
• Original authors can only state: “basic cellular processes, particularly those related to gene expression” are enriched1
[1] Valouev A. et al., Nat. Methods, 2008
SRF
SRF
SRF acts on genes both in nucleus and cytoplasm, that are involved in transcription and various types of binding
12http://bejerano.stanford.edu/
Where’s the signal?Top “actin” term is ranked #28 in the list.
Associating only proximal peaks loses a lot of information
13
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0-2 2-5 5-50 50-500 > 500
Frac
tion
of a
ll el
emen
ts
Distance to nearest transcription start site (kb)
SRF (H: Jurkat) NRSF (H: Jurkat) GABP (H: Jurkat)
Stat3 (M: ESC) p300 (M: ESC) p300 (M: limb)
p300 (M: forebrain) p300 (M: midbrain)
Relationship of binding peaks to nearest genes for eight human (H) and mouse (M) ChIP-seq datasets
Restricting to proximal peaks often leads to complete loss of key enrichments
http://bejerano.stanford.edu/
Bad Solution: Associating distal peaks brings in many false enrichments
14
Why bad? 14% of human genes tagged ‘multicellular organismal development’. But 33% of base pairs have such a gene nearest upstream/downstream.
http://bejerano.stanford.edu/
Term Bonferronicorrected p-value
nervous system development 5x10-9
system development 8x10-9
anatomical structure development 7x10-8
multicellular organismal development 1x10-7
developmental process 2x10-6
SRF ChIP-seq set has >2,000 binding events.Throw a random set of 2,000 regions at the genome. What do you get from a gene list analysis?
Regulatory jungles are oftennext to key developmental genes
Real Solution: Do not convert to gene list.Analyze the set of genomic regions
15
Gene transcription start siteOntology term ( ‘actin cytoskeleton’)
P = Prbinom(k ≥5 | n=6, p =0.33)
p = 0.33 of genome annotated withn = 6 genomic regionsk = 5 genomic regions hit annotation
http://bejerano.stanford.edu/
Gene regulatory domainGenomic region (ChIP-seq peak)
Since 33% of base pairs are near a ‘multicellular organismal development’ gene, we now expect 33% of genomic regions to hit this term by chance. => Toss 2,000 random regions at genome, get NO (false) enrichments.
GREAT = Genomic RegionsEnrichment of Annotations Tool
How does GREAT know how to assign distal binding peaks to genes?
16
Future: High-throughput assays based on chromosome conformation capture (3C) methods will elucidate complex regulation mechanisms
Currently: Flexible computational definitions allow assignment of peaks to nearest gene, nearest two genes, etc.
• Default: each gene has a “basal regulatory domain” of 5 kb up- and 1kb downstream of transcription start site, extends to basal domain of nearest genes within 1 Mb
• Though some associations may be missed or incorrect, in general signal richness and robustness is greatly improved by associating distal peaks
http://bejerano.stanford.edu/
GREAT infers many specific functions of SRF from its binding profile
17
Ontology Term # Genes Binomial ExperimentalP-value support*
Gene Ontology actin cytoskeletonactin binding
7x10-9
5x10-5Miano et al. 2007
Miano et al. 2007
* Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT.
3031
PathwayCommons
TRAIL signalingClass I PI3K signaling
5x10-7
2x10-6Bertolotto et al. 2000
Poser et al. 20003226
TreeFam 1x10-85 Chai & Tarnawski 2002
TF Targets Targets of SRFTargets of GABPTargets of YY1Targets of EGR1
5x10-76
4x10-9
1x10-6
2x10-4
Positive control
ChIp-Seq support
Natesan & Gilman 1995
84284423
Top gene-basedenrichments of SRF
Top GREAT enrichments of SRF
(top actin-related term 28th in list)
FOS gene family
http://bejerano.stanford.edu/
Similar results for GABP, NRSF, Stat3, p300 ChIP-Seq
[McLean et al., Nat Biotechnol., 2010]
GREAT data integrated
18Michael Hiller
• Twenty ontologies spanning broad categories of biology• 44,832 total ontology terms tested in each GREAT run
(2,800 terms)(5,215)(834)
(5,781)(427)(456)
(150)(1,253)(288)(706)
(6,700)(3,079)(911)
(615)(19)(222)(9)
(6,857)(8,272)(238)
http://bejerano.stanford.edu/
GREAT implementation
• Can handle datasets of hundreds of thousands of genomic regions• Testing a single ontology term takes ~1 ms• Enables real-time calculation of enrichment results for all ontologies
19http://bejerano.stanford.edu/ Cory McLean
20
GREAT web app: input page
Dave Bristor
Pick a genome assembly
Input BED regions of interest
http://great.stanford.edu
http://bejerano.stanford.edu/
As of February ‘11: Added Zebrafish
http://bejerano.stanford.edu/ 21
22
GREAT web app:(Optional): alter association rules
http://great.stanford.edu
Three association rule choices
Literature-curated domains for a small subset of genes
Lnp Evx2 HoxD cluster
[adapted from Spitz, Gonzalez, & Duboule, Cell, 2003]http://bejerano.stanford.edu/
23
Additional ontologies,term statistics,multiple hypothesis corrections, etc.
GREAT web app: output summary
Ontology-specific enrichments
http://bejerano.stanford.edu/
24
GREAT web app: term details page
Genes annotated as “actin binding” with associated genomic regions
Genomic regions annotated with “actin binding”
Drill down to explore how a particular peak regulates Plectin and its role in actin binding
http://bejerano.stanford.edu/
You can also submit any trackstraight from the UCSC Table Browser
25http://bejerano.stanford.edu/
A simple, well documentedprogrammatic interface allowsany tool to submit directly to GREAT.(See our Help / Inquiries welcome!)
GREAT web app: export data
26
HTML output displays all user selected rows and columns
Tab-separated values also available for additional postprocessing
http://bejerano.stanford.edu/
GREAT Web Stats: 40 jobs/day x 300 days up
27http://bejerano.stanford.edu/
500 entries
GREAT can be used with any cis-reg set
http://bejerano.stanford.edu/ 28
Top 119 SNPs associated with diabetesfrom NIH GWAS Catalog
Human specific loss of regulatory seq.
http://bejerano.stanford.edu 30
Human specific deletionsappear significantly oftennext to:
•Steroid hormone receptors•Neural tumor suppressors
[McLean et al., Nature, 2011]
• Human genome chock-full of regulatory sequences• GREAT accurately assesses functional enrichments of validated or putative cis-regulatory sequences using a novel genomic region-based approach [McLean et al., Nature Biotechnol., 2010]
• Online tool available at http://great.stanford.eduhas been embraced by the biomedical research community
31
Summary
http://bejerano.stanford.edu
http://bejerano.stanford.edu/ 32
Bejerano Lab: Developmental Genomics & Evolutionary Developmental Genomics
Bejerano LabBruce Schaar, Ph.DGeetu Tuteja, Ph.D (Dean’s Fellow,)Michael Hiller, Ph.D (HFSP Fellow)Andrew Doxey, Ph.D. (NSERC Fellow)Cory McLean (Bio-X Fellow)Shoa Clarke (HHMI Gilliam Fellow)Aaron Wenger (SiGF Fellow)Harendra Guturu (NSF Fellow)Jim Notwell (NSF Fellow)Tisha ChungJose SoltrenSaatvik AgarwalJenny ChenSushant Shankar
StanfordDavid Kingsley & lab
Funding:NIH / NICHD, NHGRI NSF / STCPackard FoundationHFSP Young Investigator AwardSearle Scholar Network, Microsoft ResearchMallinckrodt Foundation, A.P. Sloan FoundationOkawa Foundation, Borroughs-Wellcome
wetdry
http://bejerano.stanford.edu 33