Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State...

44
Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform. Webb Miller Francesca Chiaromonte Ross Hardison Anton Nekrutenko University of California at Santa Cruz: David Haussler (HHMRI) Jim Kent Institute for Systems Biology Arian Smit University of Pennsylvania, Children’s Hospital of Philadelphia Mitchell Weiss Consortia for sequence and analysis of: Mouse, Rat, Chicken

Transcript of Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State...

Page 1: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Gene Regulatory Elements Discovered by Vertebrate Genome

Comparisons

Laboratory Heads

Penn State University: Center Comparative Genomics and Bioinform.

Webb Miller Francesca Chiaromonte

Ross Hardison Anton Nekrutenko

University of California at Santa Cruz:

David Haussler (HHMRI) Jim Kent

Institute for Systems Biology

Arian Smit

University of Pennsylvania, Children’s Hospital of Philadelphia

Mitchell Weiss

Consortia for sequence and analysis of: Mouse, Rat, Chicken

Page 2: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

DNA sequences of mammalian genomes• Human: 2.9 billion bp, “finished”

– High quality, comprehensive sequence, very few gaps

• Mouse, rat, dog, oppossum, chicken, frog etc. etc etc.• About 40% of the human genome aligns with mouse

– This is conserved, but not all is under selection.

• About 5-6% of the human genome is under purifying selection since the rodent-primate divergence

• About 1.5% codes for protein• The 4.5% of the human genome that is under selection but

does not code for protein should have:– Regulatory sequences– Non-protein coding genes– Other important sequences

Page 3: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Alignment of vertebrate genomes

• blastZ for pairwise alignments• multiZ for multiple alignment

– Human, chimp, mouse, rat, chicken, dog– Organize local alignments – Chains and nets

• All against all comparisons– High sensitivity and specificity

• Computer cluster at UC Santa Cruz – 1024 cpus Pentium III – Job takes about half a day

• Results available at– UCSC Genome Browser http://genome.ucsc.edu– GALA database: http://www.bx.psu.edu

Scott Schwartz Webb Miller

David Haussler

Jim Kent

Schwartz et al., 2003, blastZ, Genome ResearchBlanchette et al., 2004, TBA and multiZ, Genome Research

Page 4: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Net

Genome-wide local alignment chains

blastZ: Each segment of human is given the opportunity to align with all mouse sequences.

Human: 2.9 Gb assembly. Mask interspersed repeats, break into 300 segments of 10 Mb.

Human

Run blastZ in parallel for all human segments. Collect all local alignments above threshold.

Organize local alignments into a set of chains based on position in assembly and orientation.

Level 1 chainLevel 2 chain

Mouse

Page 5: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Comparative genomics to find functional sequencesGenome size

2,900

2,400

2,500

1,200

Human

Mouse Rat

All mammals1000 Mbp

Identify functional sequences: ~ 145 Mbp

million base pairs(Mbp)

Find common sequences

Also birds: 72Mb

Papers in Nature from rat and chicken genome consortia, 2004

Page 6: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Conservation by type of function

Human-mouse-rat

Human-chicken

For several reference sets of human known functional DNA segments, what fraction aligns?

Chicken Genome Sequencing Consortium, 2004, submitted

Page 7: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Score alignments for level of conservation

• Multiple alignment scores (Margulies et al., 2003)• PhyloHMMcons - PhastCons (Siepel and Haussler, 2003; Siepel et

al. 2005)– Phylogenetic Hidden Markov Model– Posterior probability that a site is among the 10% most highly

conserved sites– Allows for variation in rates and autocorrelation in rates

Page 8: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Example of PhastCons output: UCSC Genome Browser

Available at http://genome.ucsc.edu/

Page 9: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Other ways to use alignments to find functional sequences

• Score alignments by frequency of matches to patterns distinctive for CRMs– Regulatory potential (Elnitski et al., 2003; Kolbe et al.,

2004)

• Factor binding sites conserved in human, mouse and rat – Tffind (from M. Weirauch, Schwartz et al., 2003)

Page 10: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

1. Collapse the alignment to a small alphabet, e.g.Match involving G or C = S Transition = I Gap = GMatch involving A or T = W Transversion = V

Alignment seq1 G T A C C T A C T A C G C A seq2 G T G T C G - - A G C C C ACollapsed alphabet S W I I S V G G V I S V S W

Evaluate patterns in alignments to discriminate functional classes of DNA

2. Is a pattern, e.g., SWIIS followed by V found more frequently in alignments of

known cis-regulatory modules (set of 93) or neutral DNA (200 ancestral repeats)?

3. The regulatory potential for any alignment measures extent to which its patterns are more like those in regulatory regions than in neutral DNA.

5/101/6

= 31/42/8

= 1 1/43/6

= 0.5

Page 11: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

… A A G C C C G — A T A A C G G G C G C G C C C C C T T T A T A T A C C C …

… T A G C C G G A A T A A C G G G G C G C G C C C C T T T A T A T A C A C …

……………………………………s1 s2 s3 s4 s5 s6 s7 ………………………..sW ………….

RP= logpREG(st |st−1...st−T )pAR(st |st−1...st−T )

⎝ ⎜ ⎞

⎠ ⎟

t=1...W

W (sliding window)

T (order)alphabet) (collapsed As∈

Order T Markov Model on A : transition probabilities estimated on REG training data

.

.

.

)...(

)...()...|(

1

11

−−

−−−− =

ssfsssf

ssspTREG

TREGTREG

Order T Markov Model on A : transition probabilities estimated on AR training data

.

.

.

)...(

)...()...|(

1

11

−−

−−−− =

ssfsssf

ssspTAR

TARTAR

Regulatory Potential Score

TrainingGenome-wide computation

Page 12: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

RP scores have good discriminatory power

Kolbe et al., 2004, Genome Research

Page 13: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Alignment-based

scores can find some but not all

known CRMs in the

HBB complex

King et al., submitted

Page 14: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

RP has better performance than phastCons or MCS

King et al., submitted

Page 15: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Other CRMs are easier to identify than those in the HBB complex

King et al., submitted

Page 16: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Binding sites conserved between species

• tffind: Identify high-quality matches to a weight matrix in one sequence (e.g. human) that also aligns with other sequences (e.g. mouse and rat)

• Look for matches to weight matrix in 2nd and 3rd sequences, in the part of the alignment that aligns to match to weight matrix in first species

• GALA records these matches

HMR

not

Matt Weirach

Page 17: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Genes co-expressed in late erythroid maturation• Two somatic cell models systems:• Murine erythroleukemia (MEL) cells: mature into late erythroblasts

when induced with small organic compounds.• G1E-ER cells: proerythroblast line from mice lacking the transcription

factor GATA-1. Can restore the activity of GATA-1 by expressing an estrogen-responsive form of GATA-1.

• Use microarray analysis of each to find genes that increase or decrease expression upon induction. Many of the genes respond similarly in the two systems. Walsh et al., (2004) BLOOD; Image from k-means cluster, GEO:

Page 18: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Genes whose expression increases during maturation, confirmed by RT-PCR

Page 19: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Predicting cis-regulatory modules (preCRMs)

Identify a genomic region with a regulated gene.

Find all intervals whose RP score exceeds an empirical threshold.

Subtract exons

Find all matches to GATA-1 binding sites that are conserved (cGATA-1_BS)

Intervals with RP scores above the threshold and with a cGATA-1_BS within 50bp are preCRMs.

Page 20: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Test predicted cis-regulatory modules (preCRMs)

• Amplify the preCRMs and test them by– (1) Enhancement in transient transfections of

erythroid cells– (2) Activation and induction of reporter genes

after site-directed, stable integration in erythroid cells

– (3) Chromatin immunoprecipitation (ChIP) for GATA-1

Page 21: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Transient transfection assay for enhancers

Dual luciferase assay

FF luciferaseHBGtest

FF luciferaseHBG

Ren luciferasetk

K562 cells

Compare to:

Ren luciferasetk

prom

prom

0

2

4

6

8

10

12

14

MCS HS2 Alas2pCRM1

I

II

positive control30-fold

Page 22: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Negative controls do not enhance transient expression

0

1

2

3

4

5

6

7

parentLucFog1N1Fog1N2Hipk2N2Gata2N2Alas2N1HS2N1HS2N2Alas2N2Vav2N1Vav2N2CdmN1

Coro2aN1Gata2r.2N1

Fold change

Negative controls are segments of mouse DNA that align with rat and human but have low RP scores and do not have a match to a GATA-1 binding site. They have almost no effect on the level of expression of the reporter gene in erythroid cells.

Page 23: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

7 of 24 Zfpm1 preCRMs enhance transient expression

Page 24: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Site-directed recombination to stably integrate expression cassettes

Recombinase-mediated cassette exchange, Bouhassira et al.

Page 25: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

9 of 24 Zfpm1 preCRMs enhance after stable integration at RL5

Page 26: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

PreCRMs in Fog1 bind GATA-1 in vivo

Chromatin immunoprecipitation assay

Page 27: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

13 of 24 Zfpm1 preCRMs are validated in at least one assay

Validated

Not validated

Page 28: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Validation of preCRM in Alas2

Page 29: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

All preCRMs in Gata2 are functional in at least one assay

ChIP data are from publications from E. Bresnick’s lab.

Page 30: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Frequent validation at 3 other loci

Page 31: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Infrequent validation of preCRMs in Hipk2

Page 32: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Assay Number Number %tested positive validated

GATA-1 ChIPs 5 5 100Transient 64 18 28 transfectionsSite-directed 54 24 44 integrantsAll assays 64 33 52

About half of the preCRMs are validated as functional

Page 33: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Omitting Hipk2, validation rate increases to 67%

Assay Number Number %tested positive validated

GATA-1 ChIPs 5 5 100Transient 45 17 38 transfectionsSite-directed 43 23 53 integrantsAll assays 45 30 67

Page 34: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

N Mean %G+C StDev preCRMS not validated 31 50.53 8.54preCRMS validated 32 54.87 6.46Difference = mu (“false positive”) - mu (verified) = -4.35 %G+Ct-Test of difference = 0 (vs not =): T-Value = -2.27 P-Value = 0.027 DF = 55

%G+C is higher in validated preCRMs

Page 35: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Mean Mean N RPscores StDev phastCons StDev

preCRMs not validated 31 2.020 0.381 0.511 0.229preCRMs validated 32 2.232 0.456 0.571 0.210Difference -0.212 -0.061t-Test of difference = 0 (vs not =): t-value = -2.01 -1.10

p-value = 0.049 p-value = 0.277df = 59 df = 60

Average scores for RP are significantly higher in validated preCRMs

Page 36: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Lab Folks

Yuepin Zhou, Hao Wang, Ying Zhang, Yong Cheng, David King

Page 37: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

GALA: database of Genome ALignments and Annotation http://www.bx.psu.edu/

• Database for human, chimp, mouse, rat, and chicken genomes

• Whole-genome sequence alignments– 16 million alignments for human-mouse-rat– Probabilities of sequences being under selection (200 million)

– Goodness of fit to models of alignments in known regulatory regions (RP-scores) (200 million)

• Annotations– Known and predicted genes (39,000)– Microarray data from GNF (14,000 genes, multiple tissues)

– Transcription factor binding sites (190 million)

– Conserved factor binding sites (4 million, HMR)

• Integrate information• Simple or complex queries

Yi Zhang

CathyRiemer

Belinda Giardine

Page 38: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Galaxy metaserver and data sources

www.bx.psu.edu

Page 39: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Galaxy Portal page

Page 40: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

UCSC Bioinformatics Table Browser

Page 41: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Galaxy History Page

Page 42: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Operations: Intersection, Clustering

Page 43: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Output to UCSC Genome Browser

Page 44: Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State University: Center Comparative Genomics and Bioinform.

Conclusions

• Multispecies alignments can be used to predict whether a sequence is functional (signature of purifying selection).

• Alignments can be used to predict certain functional regions, such as coding exons and some cis-regulatory elements.

• The predictions of cis-regulatory elements for erythroid genes has a good validation rate.

• Databases such as the UCSC Table Browser, GALA and Galaxy provide access to these data.