Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen,...

33
Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational Biology Program, Memorial Sloan Kettering Cancer Center, New York, New York, USA. 2 Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, Copenhagen, Denmark. 3 Department of Radiation Oncology, Memorial Sloan Kettering Cancer Center, New York, New York, USA. June 03, 2015 Ka-Kyung Kim Postdoctoral Researcher [email protected] Yonsei Biomedical Science Institute Yonsei University College of Medicine

Transcript of Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen,...

Page 1: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Genome-wide analysis of noncoding regulatory mutations in cancer

Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee

1Computational Biology Program, Memorial Sloan Kettering Cancer Center, New York, New York, USA.2Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, Copenhagen, Denmark.3Department of Radiation Oncology, Memorial Sloan Kettering Cancer Center, New York, New York, USA.

June 03, 2015

Ka-Kyung Kim

Postdoctoral Researcher

[email protected]

Yonsei Biomedical Science Institute

Yonsei University College of Medicine

Page 2: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Genomic Structure of Gene

• Numbers of Functional Genomic Elements

activate transcription of a gene or transcription

regulate gene transcription

removed by RNA splicing

Consist of coding region of mature RNA transcripts

regulate post-transcriptionally influence gene expression

1.7%

Page 3: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Pathogenic variants in non-coding regions

Makrythanasis P and Antonarakis SE. Clin Genet 2013:84:422-428

Page 4: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Non-coding variations

• Noncoding variations act through transcription control– Nature 473:43-49 (2011)

Mapping and analysis of chromatin state dynamics in nine human cell types.

– Science 337:1190-1195 (2012) Systematic localization of common disease-associated variation in regulatory DNA

– Cell 152:633-641 (2013) Integrative eQTL-based analyses reveal the biology of breast cancer risk loci

– Cell 155:934-947 (2013) Super-enhancers in the control of cell identity and disease

• Majority of disease SNPs are in noncoding regions

Page 5: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Non-coding variations

• TERT promoter mutations generate de novo consensus binding motifs for E-twenty-six (ETS) transcription factors, and occur in 50 of 70 (71%) melanomas, 24 cases (16%) in bladder and hepatocellular cancer cells.

Page 6: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

• The obesity-associated noncoding sequences within FTO are functionally connected, at megabase distances, with the homeobox gene IRX3

Page 7: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Enhancer

: integrated method for predicting enhancer targets

Page 8: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.
Page 10: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Non-coding variation

• Maturation of sequencing technologies– Computational approach and limitation

on non-coding variation of the previous study : Nucleotide conservation, Sample size

– Comprehensive analysis of somatic mutations from whole-genome sequences (WGS) from 863 cancer patients collected from The Cancer Genome Atlas (TCGA) and other public sources in this study

Page 11: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Large-scale Genomics Projects

• Cancer genomics projects: The Cancer Genome Atlas (TCGA), the International Cancer Genome Consortium (ICGC) – Focus on genomic variation in the coding sequences of tumor genomes– Most studies rely heavily on targeted exome sequencing =>understanding of somatic

variation in coding regions has improved significantly. – The protein-coding component of the genome accounts for less than 2% of the total

sequence => very little information on how non-coding variation affects cancer development.

– Even well-studied cancer types such as non-small-cell lung cancer still have significant sub-populations with no observable “driver” mutation.

• The Encyclopedia of DNA Elements (ENCODE) project– Estimates that roughly 80% of the human genome has some sort of biochemical

functionality– Somatic mutations in non-coding regions are frequent– Disease-associated genomic variation is commonly located in regulatory element

• Khurana E, et al. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science. 2013;342:1235587.

• Maurano MT, et al. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012;337:1190–5

• Huang FW, et al. Highly recurrent TERT promoter mutations in human melanoma. Science. 2013;339:957–9.

• Horn S, et al. TERT promoter mutations in familial and sporadic melanoma. Science. 2013;339:959–61

• The GENCODE project– High quality reference gene annotation and experimental validation for human and

mouse genomes (http://www.gencodegenes.org)– Gene/Transcript Biotypes in GENCODE & Ensembl

http://www.gencodegenes.org/gencode_biotypes.html http://vega.sanger.ac.uk/info/about/gene_and_transcript_types.html

Page 12: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

• The FANTOM5 project (http://fantom.gsc.riken.jp/5/)

– Finds general rules for how cells change from one cell type to another

– FANTOM5 Data Hub on the UCSC Genome browser• GTEx Project (http://gtexportal.org)

– provide a comprehensive atlas of gene expression and regulation across multiple human tissues.

– enable studies of expression quantitative trait loci (eQTLs), alternative splicing, and the tissue specificity of gene regulatory mechanisms, and aid in the interpretation of Genome-Wide Association Studies (GWAS)

– Available at database of Genotype and Phenotype (dbGaP) also

Large-scale Genomics Projects

Page 13: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.
Page 14: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

• Calling mutations

– Intersection of the somatic mutation calls made by MuTect and Strelka

– ≥2 mutant alleles for whole-exome sequence data

– Excluded samples with >500,000 mutations

– Focused on single-nucleotide substitutions → no consideration for structural variants

• Defining noncoding regions of interest

– Promoter regions were defined as the genomic intervals ranging from 2,000 bp upstream to 200 bp downstream of all transcription start sites.

– 66,944 enhancer regions to gene associations (27,493 unique regions) from a study (Nature 507, 455–461 (2014)) in which the inferred middle positions of the enhancer regions (±200 bp)

– Removed regions overlapping ORFs to avoid mutation bias from protein-coding regions (±5 bp) from the collection of regions of interest.

– Removed the regions corresponding to 429 annotated immunoglobulin loci (± 50 kb) to avoid bias from immune system-coupled somatic hypermutation

• Identification of hotspot mutations

– All mutations within 50 bp of each other were merged using BEDTools into hotspot clusters

– Clusters with 1~2 mutations were removed

– P value was calculated for each cluster using the negative binomial distribution, taking into account the length of the candidate hotspot, the number of mutations in the cluster and a background mutation rate for the cluster

– negative binomial distribution: discrete probability distribution of the number of successes in a sequence of independent and identically distributed Bernoulli trials with replacements ("1" as failure, all non-"1"s as successes)

• URLs

– CGHub, https://cghub.ucsc.edu/; Broad Genome Data Analysis Center (GDAC) Firehose,http://gdac.broadinstitute.org/; data from Alexandrov et al.15,ftp://ftp.sanger.ac.uk/pub/cancer/AlexandrovEtAl.

Methods

Page 15: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Methods• Testing regions of interest for mutation recurrence

– Local approach: extracted 10-kb flanking regions upstream and downstream of the region of interest, excluding ORFs to reduce mutation bias from nearby protein-coding regions.

– Global approach: nucleotide mutation frequencies from other regions of the same category of regions

– For each region or gene, maximum FDR for the individual global and local tests. k: number of mutated samples binomial distribution (n, pi) n: the total number of samples with mutation data pi : estimated sample mutation rate for region of interest i under the null hypothesis that the region was not recurrently mutated

pi depended on the effective length Li of the region (with ORF overlap subtracted) The estimated nucleotide mutation rate qi for the region under the null hypothesis as follows:

• Transcription factor analysis– Mutations creating ETS transcription factor binding sites if the nucleotide substitution created a

novel ETS transcription factor core response element (TGCC>TTCC)– Mutations disrupting ETS transcription factor binding sites if they altered an existing ETS core

response element (TTCC>TGCC). – For each region of interest that contained more than one mutation in an ETS binding site, an

empirical P value was computed by comparing the observed count statistic to a reference distribution of count statistics.

– → Extend more TFBS as well as ETS

• Expression analysis– Expression analysis was performed using RNA sequencing raw counts from TCGA. P values are

reported using a negative binomial test from the edgeR package.– In-depth analyses of SDHD promoter mutations with a read depth of ≥15 on a set of melanoma

samples from TCGA.

Page 16: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Results: Assessing the genomic landscape of non-coding mutations

• Mutations in transcribed regions, including coding sequences (CDS), introns, and 3′ and 5′ UTRs were observed at similar frequencies (Figure 1b) =>a role for transcription-coupled repair8,17.

• Intergenic regions, less implicated in gene regulation and possibly under weaker selective constraint, carried the highest mutational burden across all regions investigated here (Mann-Whitney P < 2.2e-16).

Figure 1 Summary of data and methods.(a) Tumor samples by disease type. Boldface: TCGA. ALL, acute lymphoblastic leukemia; AML, acute myeloid leukemia; CLL, chronic lymphocytic leukemia.

• The genome-wide mutation burden varied between different cancer types => consistent with previous observations in exome sequencing studies

(b) Mean mutation frequency and 95% confidence interval across samples (n = 858) by type of genomic region. CDS, coding sequence.

Page 17: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

(c) Workflow for the identification of recurrent noncoding mutations in regulatory regions of interest. Our approach integrates mutation calls from 863 tumor-normal pairs and regulatory regions of interest, which are tested for noncoding mutations using 3 distinct analyses. Hotspot analysis detects recurrent mutations that are often very focal. Regional recurrence analysis identifies annotated regions of interest that are enriched for mutation throughout the entire region. Transcription factor analysis searches for regions that contain recurrent mutations within transcription factor binding sites.

: small regions that frequently contained mutations

: annotated regions that contained numerous mutations

: ETS transcription factor binding sites disrupted or created by mutation

Page 18: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Results: Mutation hotspot

• TERT promoter mutations: the catalytic subunit of telomerase, the most significant hotspot (P = 1.1-127), 2 highly recurrent mutations in muptiple samples across cancer types at chr5:1295228, chr5:1295250 having C->T substitutions, as previously reported

• PLEKHS1 promoter mutations: uncharacterized gene that has not previously been linked to tumorigenesis, pleckstrin homology domain suggesting a role for the protein in intracellular signaling. significant mutations (P = 4.6-80) at chr10:115511590,chr10:115511593 having C->T transitions, and palindromic to each other

• Several significant hotspots linked to STAG3, BCL2, TCL1A, AGAP5, TRMT10C, TNK2, WDR74 => many of these genes have been associated with cancer previously

• Hotspots in the promoter and 5′ UTR of BCL2 are significant as clusters of several mutations within the same sample (average 2.2 mutations per mutated sample) => these are all in B-cell lymphoma samples and are likely a result of targeted somatic hypermutation at hypervariable regions.

Page 19: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Figure 2 Hotspot analysis. (a) Significance of mutation hotspots in noncoding regulatory regions. (b) Mutation hotspot in the promoter region of PLEKHS1, including 2 highly recurrent sites (with 11 and 12 mutations) located at the center of a palindromic sequence.

gray curve : mutation density across the region

bar chart: the frequency of

the hotspot mutation in individual cancer types

4.6X10-801.1X10-127

Page 20: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Results: Regional Recurrence Analysis

• Somatic mutations are frequently distributed across the entire open reading frame• Local approach: compares regional mutation rates to the overall mutation frequency in

the genomic neighborhood• Global approach: compares mutation rates for regions in the same category (promoter

or 3′ UTR) and with similar DNA replication timing• Together => identified larger, more frequently mutated genomic regions• 5′UTR (P < 5.1-8) and promoter of WDR74 (P < 3.6-9) were highly enriched for mutations

across numerous positions clustered (Figure 3b) not significantly different in mutated samples.

• WDR74 contains a WD40 repeat having enzymatic activity and involved in a variety of biological processes, including cell cycle control and apoptosis and mutations in this region are more common than previously known.

• Other frequently mutated regions in non-coding regions of genes such as SGK1, DHX16, SDHD (Supplementary Tables 9-12).

• The 5′ end of the SDHD gene, which encodes subunit D of the succinate dehydrogenase complex, contained multiple mutations in putative ETS (E26 transformation-specific) family transcription factor binding sites.

Page 21: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Figure 3 Regional recurrence analysis.(a) Significance of recurrent mutations in regulatory regions of interest. (b) Strong enrichment of mutations in the promoter region of WDR74 in contrast to the remainder of the gene sequence.

more often affected by mutation

Page 22: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Results: Transcription factor analysis

• Mutations in the regulatory regions of TERT, ANKRD53, TAF11, ERLIN2, MEF2C, KRT4, SDHD create novel binding sites for ETS transcription factors (CTCC>TTCC)• Promoter mutations in ETS binding site alter regulation of SDHD– SDHD mutations can cause paraganglioma, a benign tumor of the head and

neck.– Recurrent mutations in the TERT promoter create a novel ETS binding site, and

mutations in the SDHD promoter damage existing ETS binding sites.– Tumors with SDHD promoter mutation significantly reduced expression of the

SDHD gene (P = 0.004, Figure 4b).– ETS family transcription factors with binding activity in the SDHD promoter: EHF,

ELF1, and ETS1– Only ELF1 expression exhibited significant positive correlation with the SDHD

expression data in the subset of 42 SDHD proficient samples without promoter mutation (Figure 4c, P < 0.0035)

– Tumor samples with SDHD promoter mutation do not exhibit a correlation between SDHD and ELF1 mRNA levels (P = 0.35) => adverse effect SDHD promoter mutation on transcriptional regulation by ELF1 (Figure 4c).

– Samples with SDHD mutation had a significantly shorter overall survival compared to a reference group of 88 melanoma samples (P = 0.005, Figure 4d).

Page 23: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Figure 4 Transcription factor analysis. Mutations in the promoter region of SDHD disrupt ETS transcription factor binding sites in melanoma cancer genomes. (a) Three recurrently mutated sites in the promoter region of SDHD, each one altering a separate ETS recognition site, which are highly conserved. (b) SDHD mRNA expression is lower in melanoma samples with SDHD promoter mutations (n = 13) in comparison to tumor samples with wild-type (WT) SDHD (n = 42) (negative binomial test). (c) mRNA expression for ELF1 (ETS transcription factor) and SDHD is positively correlated in samples without SDHD promoter mutation (n = 42; blue) and is not correlated in samples with SDHD promoter mutation (n = 13; red). (d) Survival analysis shows that overall survival is significantly lower for samples with SDHD promoter mutation (n = 12) than in the reference group (n = 88).

The box plot displays the first and third quartiles (top and bottom of the boxes), the median (band inside the boxes), and the lowest and highest point within 1.5 times the interquartile range of the lower and higher quartile (whiskers).

Page 24: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Discussion• Cmprehensive analysis of whole-genome sequencing data from 863 individuals with cancer to characterize the

landscape of noncoding mutations in cancer. • Intergenic regions are more often affected by mutation than transcribed regions in close proximity to the coding

sequence, such as introns, promoters, enhancers and UTRs. • Distinct types of analysis to identify regions of interest significantly affected by mutation• Hotspot analysis focused on small regions that frequently contained mutations• Regional recurrence analysis identified annotated regions that contained numerous mutations• Transcription factor analysis nominated regions with ETS transcription factor binding sites that were

disrupted or created by mutation. • Significant findings identified by multiple methods• Promoter mutations in the TERT gene were found by all three methods.• Hotspot analysis identified highly recurrent mutations in PLEKHS1. The mutations occur at the center of a

perfectly palindromic sequence.• SDHD promoter mutation was moderately significant in regional recurrence analysis but was subsequently

substantiated by transcription factor binding site analysis. • Recurrent mutations in three distinct ETS response elements were associated with loss of correlation with ETS

transcription factor (ELF1) expression at the mRNA level and with shorter survival times for the affected individuals => Need to apply to all known conserved binding sites

• Multiple cancer types with fewer than 50 samples => limitation to detecting regions that are mutated at high frequency in individual tumor types or across several different tumor types => similar analyses on larger sets of samples in individual tumor types will provide additional insights

• Interrogation and interpretation of noncoding mutation will become more accurate and more important as the availability of whole-genome sequencing data increases

• No consideration for cell- and developmental stage selectiveness / Network(Pathway) / CNV / Non-coding RNA / Methylation

Page 25: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Method for annotation and prioritizing noncoding mutation

Page 26: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.
Page 27: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

non-coding categories

Figure S7 Broad and high-resolution categories. The numbers of sub-categories withineach category are shown in brackets.

Page 28: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Filtering variants against 1000 Genomes Phase I data. FunSeq filters SNVs against 1000 Genome Phase I database using user-defined MAF (minor allele frequency) threshold.

Scoring scheme for non-coding variants. For non-coding SNVs, FunSeq utilizes the results of this paper to score variants. A variant is assigned an additional score of 1 for each of the following categories that are applicable to the variant:1. ENCODE annotation: Variant is in a region annotated by ENCODE.2. In sensitive region: Variant is in a sensitive region.3. In ultrasensitive region: Variant is in an ultrasensitive region.4. Motif-breaking: Variant breaks a known TF motif5. Target gene known: Variant is in a gene promoter or the target gene of the enhancerin which it occurs is known6. Target is hub: The assigned target is a hub.

Page 29: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.
Page 30: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Demo

Page 31: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Output screen

Page 32: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Output screen: results table

Page 33: Genome-wide analysis of noncoding regulatory mutations in cancer Nils Weinhold, Anders Jacobsen, Nikolaus Schultz, Chris Sander & William Lee 1 Computational.

Output screen

Output screen: results table