including ChIP-Seq, Epi markers, GWAS...

33
Using GREAT.stanford.edu to interpret cis-regulatory rich datasets including ChIP-Seq, Epi markers, GWAS etc. 1 Gill Bejerano Dept. of Developmental Biology & Dept. of Computer Science Stanford University http://bejerano.stanford.edu

Transcript of including ChIP-Seq, Epi markers, GWAS...

Page 1: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

Using GREAT.stanford.edu to interpret cis-regulatory rich datasets

including ChIP-Seq, Epi markers, GWAS etc.

1

Gill BejeranoDept. of Developmental Biology &

Dept. of Computer ScienceStanford University

http://bejerano.stanford.edu

Page 2: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

http://bejerano.stanford.edu/ 2

Human Gene Regulation

All these cells have the same Genome.

Gene

Gene

Gene

Gene

20,000 Genes encode how to make proteins.

1,000,000 Genomic “switches” determinewhich and how much proteins to make.

1013 different cells in an adult human.

Hundreds of different cell types.

Page 3: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

http://bejerano.stanford.edu/ 3

Most Non-Coding Elements likely work in cis…

9Mb

“IRX1 is a member of the Iroquois homeobox gene family. Members of this family appear to play multiple rolesduring pattern formation of vertebrate embryos.”

gene deserts

regulatory jungles

Every orange tick mark is roughly 100-1,000bp long, each evolves under purifying selection, and does not code for protein.

Page 4: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

http://bejerano.stanford.edu/ 4

Many non-coding elements tested are cis-regulatory

Page 5: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

http://bejerano.stanford.edu/ 5

Bejerano Lab : Human Cis-RegulationCIS REGULATION DEVELOPMENT EVOLUTION

DISEASE

We build tools, predict and test

in house.

Page 6: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

http://bejerano.stanford.edu/ 6

Combinatorial Regulatory Code

Gene

2,000 different proteins can bind specific DNA sequences.

A regulatory region encodes 3-10 such protein binding sites.When all are bound by proteins the regulatory region turns “on”,

and the nearby gene is activated to produce protein.

Proteins

DNA

DNA

Protein binding site

Page 7: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

ChIP-Seq: first glimpses of the regulatory genome in action

Cis-regulatory peak77http://bejerano.stanford.edu/

Peak Calling

Page 8: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

Gene transcription start site

What is the transcription factor I just assayed doing?

Cis-regulatory peak88http://bejerano.stanford.edu/

• Collect known literature of the form• Function A: Gene1, Gene2, Gene3, ...• Function B: Gene1, Gene2, Gene3, ...• Function C: ...

• Ask whether the binding sites you discovered are preferentially binding (regulating) any one or more of the functions listed above.• Form hypothesis and perform further experiments.

Page 9: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile

9

Gene transcription start site

SRF binding ChIP-seq peak

• ChIP-seq identified 2,429 SRF binding peaksin human Jurkat cells1

• SRF is known as a “master regulator of the actin cytoskeleton”

• In the ChIP-Seq peaks, we expect to find binding sites regulating (genes involved in) actin cytoskeleton formation.

[1] Valouev A. et al., Nat. Methods, 2008http://bejerano.stanford.edu/

Page 10: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile

10

Existing, gene-based method to analyze enrichment:

• Ignore distal binding events.

• Count affected genes.

• Rank by enrichment hypergeometric p-value.

Gene transcription start site

SRF binding ChIP-seq peakOntology term (e.g. ‘actin cytoskeleton’)

N = 8 genes in genomeK = 3 genes annotated withn = 2 genes selected by proximal peaksk = 1 selected gene annotated with

P = Pr(k ≥1 | n=2, K =3, N=8)

http://bejerano.stanford.edu/

Page 11: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

We have (reduced ChIP-Seq into) a gene list!What is the gene list enriched for?

11

Microarray tool

Microarray data

Microarray data

Deep sequencing

data

http://bejerano.stanford.edu/

Pro: A lot of tools out there for the analysis of gene lists.Cons: These tools are built for microarray analysis.Does it matter ??

Page 12: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

SRF Gene-based enrichment results

12

• Original authors can only state: “basic cellular processes, particularly those related to gene expression” are enriched1

[1] Valouev A. et al., Nat. Methods, 2008

SRF

SRF

SRF acts on genes both in nucleus and cytoplasm, that are involved in transcription and various types of binding

12http://bejerano.stanford.edu/

Where’s the signal?Top “actin” term is ranked #28 in the list.

Page 13: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

Associating only proximal peaks loses a lot of information

13

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0-2 2-5 5-50 50-500 > 500

Frac

tion

of a

ll el

emen

ts

Distance to nearest transcription start site (kb)

SRF (H: Jurkat) NRSF (H: Jurkat) GABP (H: Jurkat)

Stat3 (M: ESC) p300 (M: ESC) p300 (M: limb)

p300 (M: forebrain) p300 (M: midbrain)

Relationship of binding peaks to nearest genes for eight human (H) and mouse (M) ChIP-seq datasets

Restricting to proximal peaks often leads to complete loss of key enrichments

http://bejerano.stanford.edu/

Page 14: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

Bad Solution: Associating distal peaks brings in many false enrichments

14

Why bad? 14% of human genes tagged ‘multicellular organismal development’. But 33% of base pairs have such a gene nearest upstream/downstream.

http://bejerano.stanford.edu/

Term Bonferronicorrected p-value

nervous system development 5x10-9

system development 8x10-9

anatomical structure development 7x10-8

multicellular organismal development 1x10-7

developmental process 2x10-6

SRF ChIP-seq set has >2,000 binding events.Throw a random set of 2,000 regions at the genome. What do you get from a gene list analysis?

Regulatory jungles are oftennext to key developmental genes

Page 15: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

Real Solution: Do not convert to gene list.Analyze the set of genomic regions

15

Gene transcription start siteOntology term ( ‘actin cytoskeleton’)

P = Prbinom(k ≥5 | n=6, p =0.33)

p = 0.33 of genome annotated withn = 6 genomic regionsk = 5 genomic regions hit annotation

http://bejerano.stanford.edu/

Gene regulatory domainGenomic region (ChIP-seq peak)

Since 33% of base pairs are near a ‘multicellular organismal development’ gene, we now expect 33% of genomic regions to hit this term by chance. => Toss 2,000 random regions at genome, get NO (false) enrichments.

GREAT = Genomic RegionsEnrichment of Annotations Tool

Page 16: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

How does GREAT know how to assign distal binding peaks to genes?

16

Future: High-throughput assays based on chromosome conformation capture (3C) methods will elucidate complex regulation mechanisms

Currently: Flexible computational definitions allow assignment of peaks to nearest gene, nearest two genes, etc.

• Default: each gene has a “basal regulatory domain” of 5 kb up- and 1kb downstream of transcription start site, extends to basal domain of nearest genes within 1 Mb

• Though some associations may be missed or incorrect, in general signal richness and robustness is greatly improved by associating distal peaks

http://bejerano.stanford.edu/

Page 17: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

GREAT infers many specific functions of SRF from its binding profile

17

Ontology Term # Genes Binomial ExperimentalP-value support*

Gene Ontology actin cytoskeletonactin binding

7x10-9

5x10-5Miano et al. 2007

Miano et al. 2007

* Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT.

3031

PathwayCommons

TRAIL signalingClass I PI3K signaling

5x10-7

2x10-6Bertolotto et al. 2000

Poser et al. 20003226

TreeFam 1x10-85 Chai & Tarnawski 2002

TF Targets Targets of SRFTargets of GABPTargets of YY1Targets of EGR1

5x10-76

4x10-9

1x10-6

2x10-4

Positive control

ChIp-Seq support

Natesan & Gilman 1995

84284423

Top gene-basedenrichments of SRF

Top GREAT enrichments of SRF

(top actin-related term 28th in list)

FOS gene family

http://bejerano.stanford.edu/

Similar results for GABP, NRSF, Stat3, p300 ChIP-Seq

[McLean et al., Nat Biotechnol., 2010]

Page 18: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

GREAT data integrated

18Michael Hiller

• Twenty ontologies spanning broad categories of biology• 44,832 total ontology terms tested in each GREAT run

(2,800 terms)(5,215)(834)

(5,781)(427)(456)

(150)(1,253)(288)(706)

(6,700)(3,079)(911)

(615)(19)(222)(9)

(6,857)(8,272)(238)

http://bejerano.stanford.edu/

Page 19: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

GREAT implementation

• Can handle datasets of hundreds of thousands of genomic regions• Testing a single ontology term takes ~1 ms• Enables real-time calculation of enrichment results for all ontologies

19http://bejerano.stanford.edu/ Cory McLean

Page 20: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

20

GREAT web app: input page

Dave Bristor

Pick a genome assembly

Input BED regions of interest

http://great.stanford.edu

http://bejerano.stanford.edu/

Page 21: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

As of February ‘11: Added Zebrafish

http://bejerano.stanford.edu/ 21

Page 22: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

22

GREAT web app:(Optional): alter association rules

http://great.stanford.edu

Three association rule choices

Literature-curated domains for a small subset of genes

Lnp Evx2 HoxD cluster

[adapted from Spitz, Gonzalez, & Duboule, Cell, 2003]http://bejerano.stanford.edu/

Page 23: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

23

Additional ontologies,term statistics,multiple hypothesis corrections, etc.

GREAT web app: output summary

Ontology-specific enrichments

http://bejerano.stanford.edu/

Page 24: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

24

GREAT web app: term details page

Genes annotated as “actin binding” with associated genomic regions

Genomic regions annotated with “actin binding”

Drill down to explore how a particular peak regulates Plectin and its role in actin binding

http://bejerano.stanford.edu/

Page 25: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

You can also submit any trackstraight from the UCSC Table Browser

25http://bejerano.stanford.edu/

A simple, well documentedprogrammatic interface allowsany tool to submit directly to GREAT.(See our Help / Inquiries welcome!)

Page 26: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

GREAT web app: export data

26

HTML output displays all user selected rows and columns

Tab-separated values also available for additional postprocessing

http://bejerano.stanford.edu/

Page 27: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

GREAT Web Stats: 40 jobs/day x 300 days up

27http://bejerano.stanford.edu/

500 entries

Page 28: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

GREAT can be used with any cis-reg set

http://bejerano.stanford.edu/ 28

Page 29: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

Top 119 SNPs associated with diabetesfrom NIH GWAS Catalog

Page 30: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

Human specific loss of regulatory seq.

http://bejerano.stanford.edu 30

Human specific deletionsappear significantly oftennext to:

•Steroid hormone receptors•Neural tumor suppressors

[McLean et al., Nature, 2011]

Page 31: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

• Human genome chock-full of regulatory sequences• GREAT accurately assesses functional enrichments of validated or putative cis-regulatory sequences using a novel genomic region-based approach [McLean et al., Nature Biotechnol., 2010]

• Online tool available at http://great.stanford.eduhas been embraced by the biomedical research community

31

Summary

http://bejerano.stanford.edu

Page 32: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

http://bejerano.stanford.edu/ 32

Bejerano Lab: Developmental Genomics & Evolutionary Developmental Genomics

Bejerano LabBruce Schaar, Ph.DGeetu Tuteja, Ph.D (Dean’s Fellow,)Michael Hiller, Ph.D (HFSP Fellow)Andrew Doxey, Ph.D. (NSERC Fellow)Cory McLean (Bio-X Fellow)Shoa Clarke (HHMI Gilliam Fellow)Aaron Wenger (SiGF Fellow)Harendra Guturu (NSF Fellow)Jim Notwell (NSF Fellow)Tisha ChungJose SoltrenSaatvik AgarwalJenny ChenSushant Shankar

StanfordDavid Kingsley & lab

Funding:NIH / NICHD, NHGRI NSF / STCPackard FoundationHFSP Young Investigator AwardSearle Scholar Network, Microsoft ResearchMallinckrodt Foundation, A.P. Sloan FoundationOkawa Foundation, Borroughs-Wellcome

wetdry

Page 33: including ChIP-Seq, Epi markers, GWAS etc.dldcc-web.brc.bcm.edu/lilab/xgen/genom_targets/Bejerano... · 2011. 3. 22. · 1,000,000 Genomic “switches” determine which and how much

http://bejerano.stanford.edu 33