Detection of structural variants and copy number alterations in cancer: from computational...

1
Detection of structural variants and copy number alterations in cancer: from computational strategies to the discovery of chromothripsis in neuroblastoma Introduction CNA & LOH detection (FREEC) Discovery of chromothripsis in neuroblastoma Detection of CNA regions Detection of LOH regions Possibility to work without control sample Possibility to set tumor ploidy Automatic window selection Use of mappability information Evaluation of and adjustment of contamination of tumor samples by normal cells Possibility to work with exome data Possibility to cross the output with the output of SVDetect 1 Inserm U900, 75248 Paris, France 2 Mines ParisTech, Fontainebleau, F-77300 France 3 Institut Curie, 26, rue d’Ulm, 75248 Paris, France 4 Inserm U830, 75248 Paris, France To find a best fit by polynomial, shown in black (A-D), we first make an initialization of the polynomial's parameters (median value of RC for GC- content). Then, we optimize polynomial’s parameters by iteratively selecting data points related to P-copy regions and making a least-squares fit on them. In many studies that apply deep sequencing to cancer genomes, one has to calculate copy number profiles (CNPs) and predict regions of gain and loss. There exist two frequent obstacles in the analysis of cancer genomes: absence of an appropriate control sample for normal tissue and possible polyploidy. We therefore developed Control-FREEC 1,2 , able to automatically detect Copy Number Alterations (CNAs) with or without use of a control dataset and Loss of Heterozygosity (LOH) regions. For mate-paired/paired-ends mapping (PEM) data, one can complement the information about CNAs (i.e., output of Control-FREEC) with the predictions of Structural Variants (SVs) made by another tool that we developed, SVDetect 3 . Here we used a combination of Control-FREEC and SVDetect (http://bioinfo-out.curie.fr/projects/freec/sv.html ) on neuroblastoma samples to (1) refine coordinates of CNAs using PEM data and (2) improve confidence in calling true positive rearrangements (particularly, in ambiguous satellite/repetitive regions). For mate-paired/paired-ends mapping (PEM) data, one can complement the information about copy number changes (i.e., output of FREEC) with the predictions of structural variants (SVs) made by SVDetect 3 . Automatic intersection of Control-FREEC and SVDetect outputs allows one to: 1 Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization. Boeva, V., et al. Bioinformatics, 2011; 27(2):268-9. http://bioinfo-out.curie.fr/projects/freec/ 2 Control-FREEC: a tool for assessing copy number and allelic content using next generation sequencing data. V. Boeva, et al. Bioinformatics, 2012, 28(3):423-5. 3 SVDetect - a bioinformatic tool to identify genomic structural variations from paired-end next-generation sequencing data. B. Zeitouni et al., Window size selection Calculation of dependency function “RC vs GC-content” or “RC sample vs RC control” W = L/T/(CV) 2 , where L = genome length, T = total number of reads, CV = user- defined Coefficient of Variation. Refine coordinates of CNAs using PEMs Filter out false predictions of SVDetect (often in ambiguous satellite/repetitive regions) Valentina Boeva 1,2,3 , Bruno Zeitouni 1,2,3 , Tatiana Popova 1,2,3 , Kevin Bleakley 1,2,3 , Andrei Zinovyev 1,2,3 , Jean-Philippe Vert 1,2,3 , Isabelle Janoueix-Lerosey 3,4 , Olivier Delattre 3,4 and Emmanuel Barillot 1,2,3 E-mail: [email protected] Segmentation Segmentation is done by a LASSO- based algorithm suggested by (Harchaoui and Lévy-Leduc, 2008). Adjustment for a possible contamination by normal cells Control-FREEC uses the following formula to evaluate the fraction of contaminating normal cells p, and then correct copy number profiles: NRC i E i + (1 - E i )p, where NRC i is the normalized read count in window i, E i is the expected ratio in window i . 1.List of gains and losses with assigned copy numbers 2. Visualization in R 3.Creation of different file format outputs for graphical visualization: Circos, UCSC Genome Browser (BedGraph) Results and graphical visualization SVDetect 3 is a tool that allows the user to: identify candidate SVs using the clustering of discordant PEMs, predict the type of a SV using the PEM signature, Filter out PEMs inconsistent with the main signature of the predicted SV, Compare SVs predicted for different samples Create different file format outputs for graphical visualization of predicted SVs Illustrations of read signatures for SV type prediction (implemented in SVDetect 3 ) Intra-chromosomal SVs Inter-chromosomal SVs Circos representation of SVs predicted by SVDetect confirmed by the CNAs identified by Control-FREEC. (A-C) NB1141, (D-E) NB1142. (A,D) whole genome view, (B, E) zoom on chromothripsis, (C, F) copy number profile for chr1 of NB1141 and chr6 of NB1142. F G Calculation of BAF profiles Normalized Copy Number B allele frequency Annotation of B allele frequency profiles using Gaussian mixture model fit Primary neuroblastoma tumors with chromothripsis Neuroblastoma cell lines CLB-GA CLB-RE Detection of SVs (SVDetect) We investigated somatic rearrangements in two neuroblastoma cell lines and two primary tumors using paired-end sequencing of mate-pair libraries

Transcript of Detection of structural variants and copy number alterations in cancer: from computational...

Page 1: Detection of structural variants and copy number alterations in cancer: from computational strategies to the discovery of chromothripsis in neuroblastoma.

Detection of structural variants and copy number alterations in cancer: from computational strategies to the discovery of chromothripsis in neuroblastoma

Introduction

CNA & LOH detection (FREEC)

Discovery of chromothripsis in neuroblastoma

• Detection of CNA regions• Detection of LOH regions• Possibility to work without control sample• Possibility to set tumor ploidy• Automatic window selection• Use of mappability information• Evaluation of and adjustment of contamination of tumor samples

by normal cells• Possibility to work with exome data• Possibility to cross the output with the output of SVDetect

• Detection of CNA regions• Detection of LOH regions• Possibility to work without control sample• Possibility to set tumor ploidy• Automatic window selection• Use of mappability information• Evaluation of and adjustment of contamination of tumor samples

by normal cells• Possibility to work with exome data• Possibility to cross the output with the output of SVDetect

1 Inserm U900, 75248 Paris, France 2 Mines ParisTech, Fontainebleau, F-77300 France3 Institut Curie, 26, rue d’Ulm, 75248 Paris, France 4 Inserm U830, 75248 Paris, France

To find a best fit by polynomial, shown in black (A-D), we first make an initialization of the polynomial's parameters (median value of RC for GC-content). Then, we optimize polynomial’s parameters by iteratively selecting data points related to P-copy regions and making a least-squares fit on them.

In many studies that apply deep sequencing to cancer genomes, one has to calculate copy number profiles (CNPs) and predict regions of gain and loss. There exist two frequent obstacles in the analysis of cancer genomes: absence of an appropriate control sample for normal tissue and possible polyploidy. We therefore developed Control-FREEC1,2, able to automatically detect Copy Number Alterations (CNAs) with or without use of a control dataset and Loss of Heterozygosity (LOH) regions.For mate-paired/paired-ends mapping (PEM) data, one can complement the information about CNAs (i.e., output of Control-FREEC) with the predictions of Structural Variants (SVs) made by another tool that we developed, SVDetect3. Here we used a combination of Control-FREEC and SVDetect (http://bioinfo-out.curie.fr/projects/freec/sv.html) on neuroblastoma samples to (1) refine coordinates of CNAs using PEM data and (2) improve confidence in calling true positive rearrangements (particularly, in ambiguous satellite/repetitive regions).

For mate-paired/paired-ends mapping (PEM) data, one can complement the information about copy number changes (i.e., output of FREEC) with the predictions of structural variants (SVs) made by SVDetect3. Automatic intersection of Control-FREEC and SVDetect outputs allows one to:

1 Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization. Boeva, V., et al. Bioinformatics, 2011; 27(2):268-9. http://bioinfo-out.curie.fr/projects/freec/2 Control-FREEC: a tool for assessing copy number and allelic content using next generation sequencing data. V. Boeva, et al. Bioinformatics, 2012, 28(3):423-5.3 SVDetect - a bioinformatic tool to identify genomic structural variations from paired-end next-generation sequencing data. B. Zeitouni et al., Bioinformatics, 2010. 26: 1895-1896. http://svdetect.sourceforge.net

Window size selection Calculation of dependency function “RC vs GC-content” or “RC sample vs RC control”

W = L/T/(CV)2, where L = genome length, T = total number of reads, CV = user-defined Coefficient of Variation.

• Refine coordinates of CNAs using PEMs • Filter out false predictions of SVDetect (often in ambiguous satellite/repetitive regions)

Valentina Boeva1,2,3, Bruno Zeitouni1,2,3, Tatiana Popova1,2,3, Kevin Bleakley1,2,3, Andrei Zinovyev1,2,3, Jean-Philippe Vert1,2,3, Isabelle Janoueix-Lerosey3,4, Olivier Delattre3,4 and Emmanuel Barillot1,2,3 E-mail: [email protected]

SegmentationSegmentation is done by a LASSO-based algorithm suggested by (Harchaoui and Lévy-Leduc, 2008).

Adjustment for a possible contamination by normal cells

Control-FREEC uses the following formula to evaluate the fraction of contaminating normal cells p, and then correct copy number profiles:

NRCi ≈ Ei + (1 - Ei)p,

where NRCi is the normalized read count in window i, Ei is the expected ratio in window i .

1. List of gains and losses with assigned copy numbers2. Visualization in R

3. Creation of different file format outputs for graphical visualization: Circos, UCSC Genome Browser (BedGraph)

Results and graphical visualization

SVDetect3 is a tool that allows the user to:•identify candidate SVs using the clustering of discordant PEMs,•predict the type of a SV using the PEM signature, •Filter out PEMs inconsistent with the main signature of the predicted SV,•Compare SVs predicted for different samples•Create different file format outputs for graphical visualization of predicted SVs

Illustrations of read signatures for SV type prediction (implemented in SVDetect3)

Intra-chromosomal SVs Inter-chromosomal SVs

Circos representation of SVs predicted by SVDetect confirmed by the CNAs identified by Control-FREEC. (A-C) NB1141, (D-E) NB1142. (A,D) whole genome view, (B, E) zoom on chromothripsis, (C, F) copy number profile for chr1 of NB1141 and chr6 of NB1142.

F G

Calculation of BAF profiles

Nor

mal

ized

Co

py N

umbe

r B

alle

le

freq

uenc

y

Annotation of B allele frequency profiles using Gaussian mixture model fit

Primary neuroblastoma tumors with chromothripsis

Neuroblastoma cell lines

CLB-GA

CLB-RE

Detection of SVs (SVDetect)

We investigated somatic rearrangements in two neuroblastoma cell lines and two primary tumors using paired-end sequencing of mate-pair libraries