Download - Jason Ernst Broad Institute of MIT and Harvard

Disease epigenomics:Interpreting non-coding variants using chromatin and activity signatures
Jason Ernst Broad Institute of MIT and Harvard MIT Computer Science & Artificial Intelligence Laboratory Challenge: interpreting disease-associated variants
Gene annotation (Coding, 5/3UTR, RNAs) Evolutionary signatures Roles in gene/chromatin regulation Activator/repressor signatures CATGACTG CATGCCTG Disease-associated variant (SNP/CNV/) Non-coding annotation Chromatin signatures Other evidence of function Signatures of selection (sp/pop) GWAS, case-control, reveal disease-associated variants Molecular mechanism, cell-type specificity, drug targets Challenges towards interpreting disease variants Find true causative SNP among many candidates in LD Use causal variant: predict function, pathway, drug targets Non-coding variant: type of function, cell type of activity Regulatory variant: upstream regulators, downstream targets This talk: genomics tools for addressing these challenges The good news: ever-expanding dimensions
Additional dimensions: Environment Genotype Disease Gender Stage Age Each point represents agenome-wide dataset Chromatin marks Cell types Now: Cell-type and chromatin-mark dimensions Next: References for each background All clearly needed, and increasingly available Difficulty of interpreting increasing # tracks
Challenge: simplify Learn combinations Interpret function Prioritize marks Study dynamics Challenge of data integration in many marks/cells
Epigenetic modifications DNA/histone/nucleosome Encode epigenetic state Histone code hypothesis Distinct function for distinct combinations of marks? Hundreds of histone marks Astronomical number of histone mark combinations How do we find biologically relevant ones? Unsupervised approach Probabilistic model Explicit combinatorics Epigenomic information retains genome state in differentiation and development Genome-widemodification maps Hundreds ofhistone tail modifications already known Two types: DNA methyl. Histone marks DNA packaged into chromatin around histone proteins Genomic tools for disease SNP interpretation
Chromatin states regulatory region annotation Combinatorial patterns of marks chromatin states Distinct classes of prom/enh/transcr/represd/repetitive Reveal new genes, lincRNAs, enhancers, GWAS/SNP Activity signatures linking enhancer networks Correlated changes in expression, chromatin, motifs Link TFs to enhancers and enhancers to targets Predict causal cell-type specific activators/repressors Interpreting disease variants Predicting SNP chromatin states and cell-type specificity Specific mechanistic predictions for disease SNPs Measuring selective pressures within human populations ChromHMM: learning hidden chromatin states
Transcription Start Site Enhancer DNA Observed chromatin marks. Called based on a poisson distribution Most likely Hidden State Transcribed Region 1 6 5 3 4 1: 3: 4: 5: 6: High Probability Chromatin Marks in State 2: 0.8 0.9 0.7 200bp intervals All probabilities are learned de novo from chromatin data alone (Baum-Welch aka. EM) 2 K4me3 K36me3 K4me1 K27ac We had talked about adding the H3K4 etc labels within the shapes Each state: vector of emissions, vector of transitions Ernst and Kellis, Nature Biotech 2010 Chromatin states for genome annotation
Learn de novo significant combinations of chromatin marks Reveal functional elements, even without looking at sequence Use for genome annotation Use for studying regulation dynamics in different cell types Promoter states Transcribed states Active Intergenic Repressed Emerging large-scale genomic/epigenomic datasets
Multiple cell types Diverse experiments Developmental time-course Reference Epigenome Mapping Centers Used to study many disease epigenomes ENCODE Chromatin Group (PI: Bernstein) Insulator Enhancer Promoter Transcribed Repressed Repetitive 15-state model learned jointly 9 chromatin marks+WCE 9 human cell types HUVEC Umbilical vein endothelial NHEK Keratinocytes GM12878 Lymphoblastoid K562 Myelogenous leukemia HepG2 Liver carcinoma NHLF Normal human lung fibroblast HMEC Mammary epithelial cell HSMM Skeletal muscle myoblasts H1 Embryonic H3K4me1 H3K4me2 H3K4me3 H3K27ac H3K9ac H3K27me3 H4K20me1 H3K36me3 CTCF +WCE +RNA x NHEK HUVEC H1 Cell type concatenation approach Ensures common emission parameters Verified with independent learning Chromatin states capture coordinated mark changes
State definitions are cell-type invariant Same combinations consistently found State locations are cell-type specific Can study pair-wise or multi-way changes Chromatin states correlation with gene expression
TSS +50kb -50kb Lower expression Higher expression Pair-wise changes reveal cell-type specific functions
Gene functional enrichments match cell function Distinguish On, Off, and Poised promoter states Genomic tools for disease SNP interpretation
Chromatin states regulatory region annotation Combinatorial patterns of marks chromatin states Distinct classes of prom/enh/transcr/represd/repetitive Reveal new genes, lincRNAs, enhancers, GWAS/SNP Activity signatures linking enhancer networks Correlated changes in expression, chromatin, motifs Link TFs to enhancers and enhancers to targets Predict causal cell-type specific activators/repressors Interpreting disease variants Predicting SNP chromatin states and cell-type specificity Specific mechanistic predictions for disease SNPs Measuring selective pressures within human populations Introducing multi-cell activity profiles
Gene expression Chromatin States Active TF motif enrichment TF regulator expression Dip-aligned motif biases HUVEC NHEK GM12878 K562 HepG2 NHLF HMEC HSMM H1 TF On TF Off Motif aligned Flat profile ON OFF Active enhancer Repressed Motif enrichment Motif depletion Enhancer vs. promoter dynamics
Promoters typically active in many cells Enhancers exquisitely cell-type specific Enhancer vs. promoter dynamics Linking candidate enhancers to correlated target genes
Search for coherent changes between: gene expression chromatin marks at distant loci (10kb) Combine two vectors: Expression vector for each gene Vector of mark intensities at dist locus (combine marks based on enhancer emissions) 3. High correlation enhancer/target link 10kb Candidate TM4SF1 Enhancer Predictive power of distal enhancer regions
Correlation of individual regions (Sorted by Rank) Mark intensity correlation w/ expr 10kb upstream 100kb upstream 10kb/100kb controls At least 100 regions with >80% correlation Coordinated activity reveals enhancer links
Enhancer activity Gene activity Predicted regulators Activity signatures for each TF Distal enhancer hard to integrate in regulatory models Linked to target genes based on coordinated activity Linked to upstream regulators using TF expr & motifs Nucleosome Positioning Footprints Supports Transcription Factor Cell Type Predictions
Tag Enrichment for H3K27ac Genomic tools for disease SNP interpretation
Chromatin states regulatory region annotation Combinatorial patterns of marks chromatin states Distinct classes of prom/enh/transcr/represd/repetitive Reveal new genes, lincRNAs, enhancers, GWAS/SNP Activity signatures linking enhancer networks Correlated changes in expression, chromatin, motifs Link TFs to enhancers and enhancers to targets Predict causal cell-type specific activators/repressors Interpreting disease variants Predicting SNP chromatin states and cell-type specificity Specific mechanistic predictions for disease SNPs Measuring selective pressures within human populations Enhancer annotation revisits disease SNPs
xx Enhancer annotation revisits disease SNPs Previously unlinked phenotypes enriched for cell-type specific enhancers Application1: Pinpoint disease SNPs in enhancers
Much smaller fraction of genome considered Strong enhancers 1.9%, weak 2.8%, promoter 1.4% Application 2: Make much more precise predictions
Use: * Cell-type specificity of chromatin states * Predicted activators/repressors of these states * Predicted motif instances across the genome Ex1: Systemic lupus erythematosus intergenic SNP
SNP in lymphoblastoid GM-specific enhancer state Disrupts Ets1 motif instance, predicted GM regulator Model: Disease SNP abolishes GM-specific enhancer Ets-1 is a predicted activator of GM/HUVEC enhancers
Enhancer activity Gene activity Predicted regulators Activity signatures for each TF Enhancer class specific to GM and HUVEC cell types Ets expression Ets-1 motif enrichment in enhancers Model: Ets-1 disruption would abolish enhancer state Ex2: Erythrocyte phenotype study intronic SNP
K562: erythroleukaemia cell type ` ` Disease SNP creates motif instance for Gfi-1 repressor Gfi-1 predicted repressor for K562-specific enhancers Creation of repressive motif abolishes K562 enhancer Gfi-1 is a predicted repressor of non-K562 enhancers
Enhancer activity Gene activity Predicted regulators Activity signatures for each TF Gfi expression Gfi-1 motif depletion in enhancers Prediction: Gfi-1 large-scale repression of non-K562 Motif created Gfi-1 recruited enhancer repressed More generally: eQTLs in specific chromatin states
Dixon 2007: All eQTLs, Lymphoblasts, 400 ind. Schadt 2008: Trans eQTLs, liver cells, 427 ind. Nucleotide-resolution genome-wide expr. predictors Strong enrichment for promoter and enhancer states Trans-eQTLs select for cell-type specific enhancers Genomic tools for disease SNP interpretation
Chromatin states regulatory region annotation Combinatorial patterns of marks chromatin states Distinct classes of prom/enh/transcr/represd/repetitive Reveal new genes, lincRNAs, enhancers, GWAS/SNP Activity signatures linking enhancer networks Correlated changes in expression, chromatin, motifs Link TFs to enhancers and enhancers to targets Predict causal cell-type specific activators/repressors Interpreting disease variants Predicting SNP chromatin states and cell-type specificity Specific mechanistic predictions for disease SNPs Measuring selective pressures within human populations