ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of...

49
ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of

Transcript of ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of...

Page 1: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

ENCODE: understanding our genome

Ewan Birney

The ENCODE Project Consortium

Biosapiens Network of Excellence

Page 2: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 3: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

ENCODE experiments

Page 4: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Area Assay GroupsProteins Manual annotation,

RT-PCRGuigo, Harrow+Hubbard, Reymond

Transcripts Tiling Arrays Gingeras, Snyder

Transcripts Tag seq. Yijun, Riken

General Chromatin Marks

Tiling Arrays, ChIP Dunham, Reng

Sequence sp. Factors

Tiling Arrays, ChIP Snyder, Gingeras, Farnham, Dunham

DNaseI sens. PCR, Tiling arrays Stam. , Crawford

Replication Tiling arrays Dutta

Conservation Comparative sequence

Green, Sidow, Miller

DNA structure Hydroxyl radical Leib

Promoter Reporter assays Myers

Page 5: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

ENCODE Pilot

• Considered too expensive and too risky to decide on winning technologies (started in 2004)

• 1% of the genome (30MB) chosen - all experiments on the same 1%

• Pilot phase ended– Analysis and publication– Scale up to genome wide now

funded

Page 6: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

A lot of Chip/Chip

Page 7: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Nowdays, a lot of Chip/seq

Page 8: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Transcription

Page 9: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Transcription

• Lots of it– And not all of it genes– And even when it is inside a gene,

not all of it with open reading frames

– And even when it has an open reading frame, not all of it making sense! (evolutionary or structurally)

• Not technical false positives

Page 10: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Protein coding loci are far more complex than we think

• On average 5 transcripts per locus

• Many do not encode proteins (as far as we can see)

• Even the ones which do encode proteins, many of these proteins look “weird”

Page 11: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Unplausible structures

Page 12: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Many effects on potential function

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 13: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Signal peptides, TM Helices

• 1097 protein transcripts from 487 loci– 219 have signal peptides (107 loci)– 12 loci have an isoform without the

signal peptide– 41 transcripts have a gain or loss

of a tansmembrane helix (sometimes up to 8!)

Page 14: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

a inactive, "stressed"

(d) (e)

b active (beta inserted)(c)

(f)

The Clade B Serpins PotentialMissing fragments

Page 15: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Transcription Start Sites

Page 16: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Technologies on TSS

Gencode

Manual Ann.UnbiasedTxFrag

Ditag data

Cage data

Histone mod.Dnase I sens

Sequence spFactors (eg Myc)

Page 17: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Integration Strategy

Anchor on 5’ endsGenCode 5’ and CAGE/DiTag

Categorise and assess usingTranscript based evidence

Exons, TxFrags, CpG islands

Assess categories withHistone and TF data

16,051 unique TSS

8,587 TSS “tight clusters”

5 different classesFirst 4 low-Pvalues

First 4 categories haveBiological signals:4,491 TSS

Page 18: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

TSS CategoriesCategory Number

(non-redundant)

P-value of overlap

GenCode 5’ 1730 2e-70

Exon(sense) 1437 6e-39

Exon(anti) 521 3e-8

TxFrag 639 7e-63

CpG 164 4e-90

No support 2666

Page 19: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

GenCode 5’ ends

Page 20: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Unsupported tags

Page 21: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Novel TSSs

Page 22: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Conclusion

• There are 4,418 TSS with multiple lines of evidence supporting them

• This is ~10 fold more than the number of Genes

• Only 38% would be traditionally classified as TSS (less if one took Ensembl or RefSeq)

Page 23: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Implications of many more TSSs

• Consistent with considerable diversity of transcripts

• Independently integrating Chip/Chip data suggested ~1,000 “Regulatory Clusters”– 25% proximal considering

Ensembl/Refseq– 65% when this TSS catalog is

considered

Page 24: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

More subtle conclusions

• Sequence specific factors are distributed symmetrically around the TSS– Should we only be taking upstream

regions for reporter genes?

• Histone information is highly correlated with gene on/off status– Generalising many locus specific

studies

Page 25: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Gene On/Off

Page 26: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Gene status prediction

Page 27: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Distal sites

Page 28: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Finding distal sites

• Chip/Chip not “great”– Most look close to one of these

new TSSs– Factor bias?

• DNaseI Hypersenstive Sites– All factors give a DHS signal– 55% of DHSs are distal to any TSS

Page 29: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Distal DHS

Page 30: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Most surveyed factors are proximal

Page 31: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Replication

Page 32: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

H3K27me3 is correlated

Page 33: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Evolutionary conservation and ENCODE

Page 34: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Evolutionary conservation

Page 35: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

…but not everything is constrained

Page 36: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Why is there a discrepancy?

• False positives in the experiments– But experiments validate at >80% and cross-

validate each other

• False negatives in the constraint detection– But can detect up to 8bp elements, and within

“neutral” zone of alignability

• Neutral turnover model

Page 37: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Neutral biochemical events

Time

Page 38: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Lineage specific

Time

Page 39: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

“Functional” conservation

HumanMouse

Page 40: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Special case: Transcription

GeneRegulatory Information

Constrained sequence

Constrained sequence

Pre-miRNAs

Page 41: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

What should we learn from ENCODE

• “whacky” transcription is real (but god knows what it does)– Unconventional Transcript

• Lots more TSSs than we understand– Many “distal” regions are actually close

to promoters

• Broad specificity marks are more useful– DNaseI sites, Histone marks

Page 42: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Neutral model for biochemical events on the genome

• Because things happen reproducibly in multiple tissues does not imply selection

• (this is not the same as experimental variance)

• Could imply “functional” conservation outside of orthologous bases– Comparative genomics sequencing not

enough (but a great starting point!)– Comparative functional investigation

Page 43: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Consortia work

• ENCODE – Experimentally lead consortia– Needs a lot of computational

collaboration

• Biosapiens– Computationally lead consortia– Needs experimental collaboration

(!)

• DNA: ENCODE• Protein: Biosapiens

Page 44: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

What happens next?

Page 45: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Ensembl Regulatory Build

elements

Chr 14,5677077-567896

Status

GM06990Cells, Myc bound

Page 46: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Initial Regulatory Build

• DNaseI Hypersenstive sites, 6 histone modifications, CTCF binding

• ~110,000 elements, ~2MB of DNA• 6,000 “promoter associated” by

inherent pattern (DNaseI + H3K36me3)

• Available now

• This year: Mouse, More classification

Page 47: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Regulatory build

Page 48: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

Ensembl - at your service

• Web browser www.ensembl.org• MySQL DB access• BioMart

• “Geek for a week”– You send someone to use for a

week

• Xose for a day– We send someone to you for a day

Page 49: ENCODE: understanding our genome Ewan Birney The ENCODE Project Consortium Biosapiens Network of Excellence.

The ENCODE Project ConsortiumDamian Keefe, Yutao Fu, Zhiping Weng, Mike Snyder, Elliott Marguilles, John Stam., Manolis Dermitzakis, Tom Gingeras, Roderic Guigo, Ian Dunham, Christophe Koch, Anindya Dutta Paul Flicek and 293 others…

The Biosapiens Network of Excellence

Michael Tress, Alfonso Valencia, Janet Thornton, Roderic Guigo, Soren Brunak, David Jones, Martin Vingron, Anna Tramontano, Jacques van Helden and 57 others…