Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State...

Post on 02-Jan-2016

218 views 0 download

Tags:

Transcript of Gene Regulatory Elements Discovered by Vertebrate Genome Comparisons Laboratory Heads Penn State...

Gene Regulatory Elements Discovered by Vertebrate Genome

Comparisons

Laboratory Heads

Penn State University: Center Comparative Genomics and Bioinform.

Webb Miller Francesca Chiaromonte

Ross Hardison Anton Nekrutenko

University of California at Santa Cruz:

David Haussler (HHMRI) Jim Kent

Institute for Systems Biology

Arian Smit

University of Pennsylvania, Children’s Hospital of Philadelphia

Mitchell Weiss

Consortia for sequence and analysis of: Mouse, Rat, Chicken

DNA sequences of mammalian genomes• Human: 2.9 billion bp, “finished”

– High quality, comprehensive sequence, very few gaps

• Mouse, rat, dog, oppossum, chicken, frog etc. etc etc.• About 40% of the human genome aligns with mouse

– This is conserved, but not all is under selection.

• About 5-6% of the human genome is under purifying selection since the rodent-primate divergence

• About 1.5% codes for protein• The 4.5% of the human genome that is under selection but

does not code for protein should have:– Regulatory sequences– Non-protein coding genes– Other important sequences

Alignment of vertebrate genomes

• blastZ for pairwise alignments• multiZ for multiple alignment

– Human, chimp, mouse, rat, chicken, dog– Organize local alignments – Chains and nets

• All against all comparisons– High sensitivity and specificity

• Computer cluster at UC Santa Cruz – 1024 cpus Pentium III – Job takes about half a day

• Results available at– UCSC Genome Browser http://genome.ucsc.edu– GALA database: http://www.bx.psu.edu

Scott Schwartz Webb Miller

David Haussler

Jim Kent

Schwartz et al., 2003, blastZ, Genome ResearchBlanchette et al., 2004, TBA and multiZ, Genome Research

Net

Genome-wide local alignment chains

blastZ: Each segment of human is given the opportunity to align with all mouse sequences.

Human: 2.9 Gb assembly. Mask interspersed repeats, break into 300 segments of 10 Mb.

Human

Run blastZ in parallel for all human segments. Collect all local alignments above threshold.

Organize local alignments into a set of chains based on position in assembly and orientation.

Level 1 chainLevel 2 chain

Mouse

Comparative genomics to find functional sequencesGenome size

2,900

2,400

2,500

1,200

Human

Mouse Rat

All mammals1000 Mbp

Identify functional sequences: ~ 145 Mbp

million base pairs(Mbp)

Find common sequences

Also birds: 72Mb

Papers in Nature from rat and chicken genome consortia, 2004

Conservation by type of function

Human-mouse-rat

Human-chicken

For several reference sets of human known functional DNA segments, what fraction aligns?

Chicken Genome Sequencing Consortium, 2004, submitted

Score alignments for level of conservation

• Multiple alignment scores (Margulies et al., 2003)• PhyloHMMcons - PhastCons (Siepel and Haussler, 2003; Siepel et

al. 2005)– Phylogenetic Hidden Markov Model– Posterior probability that a site is among the 10% most highly

conserved sites– Allows for variation in rates and autocorrelation in rates

Example of PhastCons output: UCSC Genome Browser

Available at http://genome.ucsc.edu/

Other ways to use alignments to find functional sequences

• Score alignments by frequency of matches to patterns distinctive for CRMs– Regulatory potential (Elnitski et al., 2003; Kolbe et al.,

2004)

• Factor binding sites conserved in human, mouse and rat – Tffind (from M. Weirauch, Schwartz et al., 2003)

1. Collapse the alignment to a small alphabet, e.g.Match involving G or C = S Transition = I Gap = GMatch involving A or T = W Transversion = V

Alignment seq1 G T A C C T A C T A C G C A seq2 G T G T C G - - A G C C C ACollapsed alphabet S W I I S V G G V I S V S W

Evaluate patterns in alignments to discriminate functional classes of DNA

2. Is a pattern, e.g., SWIIS followed by V found more frequently in alignments of

known cis-regulatory modules (set of 93) or neutral DNA (200 ancestral repeats)?

3. The regulatory potential for any alignment measures extent to which its patterns are more like those in regulatory regions than in neutral DNA.

5/101/6

= 31/42/8

= 1 1/43/6

= 0.5

… A A G C C C G — A T A A C G G G C G C G C C C C C T T T A T A T A C C C …

… T A G C C G G A A T A A C G G G G C G C G C C C C T T T A T A T A C A C …

……………………………………s1 s2 s3 s4 s5 s6 s7 ………………………..sW ………….

RP= logpREG(st |st−1...st−T )pAR(st |st−1...st−T )

⎝ ⎜ ⎞

⎠ ⎟

t=1...W

W (sliding window)

T (order)alphabet) (collapsed As∈

Order T Markov Model on A : transition probabilities estimated on REG training data

.

.

.

)...(

)...()...|(

1

11

−−

−−−− =

ssfsssf

ssspTREG

TREGTREG

Order T Markov Model on A : transition probabilities estimated on AR training data

.

.

.

)...(

)...()...|(

1

11

−−

−−−− =

ssfsssf

ssspTAR

TARTAR

Regulatory Potential Score

TrainingGenome-wide computation

RP scores have good discriminatory power

Kolbe et al., 2004, Genome Research

Alignment-based

scores can find some but not all

known CRMs in the

HBB complex

King et al., submitted

RP has better performance than phastCons or MCS

King et al., submitted

Other CRMs are easier to identify than those in the HBB complex

King et al., submitted

Binding sites conserved between species

• tffind: Identify high-quality matches to a weight matrix in one sequence (e.g. human) that also aligns with other sequences (e.g. mouse and rat)

• Look for matches to weight matrix in 2nd and 3rd sequences, in the part of the alignment that aligns to match to weight matrix in first species

• GALA records these matches

HMR

not

Matt Weirach

Genes co-expressed in late erythroid maturation• Two somatic cell models systems:• Murine erythroleukemia (MEL) cells: mature into late erythroblasts

when induced with small organic compounds.• G1E-ER cells: proerythroblast line from mice lacking the transcription

factor GATA-1. Can restore the activity of GATA-1 by expressing an estrogen-responsive form of GATA-1.

• Use microarray analysis of each to find genes that increase or decrease expression upon induction. Many of the genes respond similarly in the two systems. Walsh et al., (2004) BLOOD; Image from k-means cluster, GEO:

Genes whose expression increases during maturation, confirmed by RT-PCR

Predicting cis-regulatory modules (preCRMs)

Identify a genomic region with a regulated gene.

Find all intervals whose RP score exceeds an empirical threshold.

Subtract exons

Find all matches to GATA-1 binding sites that are conserved (cGATA-1_BS)

Intervals with RP scores above the threshold and with a cGATA-1_BS within 50bp are preCRMs.

Test predicted cis-regulatory modules (preCRMs)

• Amplify the preCRMs and test them by– (1) Enhancement in transient transfections of

erythroid cells– (2) Activation and induction of reporter genes

after site-directed, stable integration in erythroid cells

– (3) Chromatin immunoprecipitation (ChIP) for GATA-1

Transient transfection assay for enhancers

Dual luciferase assay

FF luciferaseHBGtest

FF luciferaseHBG

Ren luciferasetk

K562 cells

Compare to:

Ren luciferasetk

prom

prom

0

2

4

6

8

10

12

14

MCS HS2 Alas2pCRM1

I

II

positive control30-fold

Negative controls do not enhance transient expression

0

1

2

3

4

5

6

7

parentLucFog1N1Fog1N2Hipk2N2Gata2N2Alas2N1HS2N1HS2N2Alas2N2Vav2N1Vav2N2CdmN1

Coro2aN1Gata2r.2N1

Fold change

Negative controls are segments of mouse DNA that align with rat and human but have low RP scores and do not have a match to a GATA-1 binding site. They have almost no effect on the level of expression of the reporter gene in erythroid cells.

7 of 24 Zfpm1 preCRMs enhance transient expression

Site-directed recombination to stably integrate expression cassettes

Recombinase-mediated cassette exchange, Bouhassira et al.

9 of 24 Zfpm1 preCRMs enhance after stable integration at RL5

PreCRMs in Fog1 bind GATA-1 in vivo

Chromatin immunoprecipitation assay

13 of 24 Zfpm1 preCRMs are validated in at least one assay

Validated

Not validated

Validation of preCRM in Alas2

All preCRMs in Gata2 are functional in at least one assay

ChIP data are from publications from E. Bresnick’s lab.

Frequent validation at 3 other loci

Infrequent validation of preCRMs in Hipk2

Assay Number Number %tested positive validated

GATA-1 ChIPs 5 5 100Transient 64 18 28 transfectionsSite-directed 54 24 44 integrantsAll assays 64 33 52

About half of the preCRMs are validated as functional

Omitting Hipk2, validation rate increases to 67%

Assay Number Number %tested positive validated

GATA-1 ChIPs 5 5 100Transient 45 17 38 transfectionsSite-directed 43 23 53 integrantsAll assays 45 30 67

N Mean %G+C StDev preCRMS not validated 31 50.53 8.54preCRMS validated 32 54.87 6.46Difference = mu (“false positive”) - mu (verified) = -4.35 %G+Ct-Test of difference = 0 (vs not =): T-Value = -2.27 P-Value = 0.027 DF = 55

%G+C is higher in validated preCRMs

Mean Mean N RPscores StDev phastCons StDev

preCRMs not validated 31 2.020 0.381 0.511 0.229preCRMs validated 32 2.232 0.456 0.571 0.210Difference -0.212 -0.061t-Test of difference = 0 (vs not =): t-value = -2.01 -1.10

p-value = 0.049 p-value = 0.277df = 59 df = 60

Average scores for RP are significantly higher in validated preCRMs

Lab Folks

Yuepin Zhou, Hao Wang, Ying Zhang, Yong Cheng, David King

GALA: database of Genome ALignments and Annotation http://www.bx.psu.edu/

• Database for human, chimp, mouse, rat, and chicken genomes

• Whole-genome sequence alignments– 16 million alignments for human-mouse-rat– Probabilities of sequences being under selection (200 million)

– Goodness of fit to models of alignments in known regulatory regions (RP-scores) (200 million)

• Annotations– Known and predicted genes (39,000)– Microarray data from GNF (14,000 genes, multiple tissues)

– Transcription factor binding sites (190 million)

– Conserved factor binding sites (4 million, HMR)

• Integrate information• Simple or complex queries

Yi Zhang

CathyRiemer

Belinda Giardine

Galaxy metaserver and data sources

www.bx.psu.edu

Galaxy Portal page

UCSC Bioinformatics Table Browser

Galaxy History Page

Operations: Intersection, Clustering

Output to UCSC Genome Browser

Conclusions

• Multispecies alignments can be used to predict whether a sequence is functional (signature of purifying selection).

• Alignments can be used to predict certain functional regions, such as coding exons and some cis-regulatory elements.

• The predictions of cis-regulatory elements for erythroid genes has a good validation rate.

• Databases such as the UCSC Table Browser, GALA and Galaxy provide access to these data.