Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of...

39
Genome-wide Copy Number Analysi Genome-wide Copy Number Analysi s s Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University School of Medicine 02 - 08 – 2006 Course: M 21-621 Computational Statistical Genetics

Transcript of Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of...

Page 1: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Genome-wide Copy Number AnalysisGenome-wide Copy Number Analysis

Qunyuan Zhang,Ph.D.

Division of Statistical Genomics

Department of Genetics & Center for Genome Sciences

Washington University School of Medicine

02 - 08 – 2006

Course: M 21-621 Computational Statistical Genetics

Page 2: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Four QuestionsFour Questions

What is Copy Number ?What is Copy Number ?

What can Copy Number tell us?What can Copy Number tell us?

How to measure/quantify Copy Number?How to measure/quantify Copy Number?

How to analyze Copy Number?How to analyze Copy Number?

Page 3: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

What is Copy Number ?What is Copy Number ?

Gene Copy Number

The gene copy number (also "copy number variants" or CNVs) is the amount of copies of a particular gene in the genotype of an individual. Recent evidence shows that the gene copy number can be elevated in cancer cells. For instance, the EGFR copy number can be higher than normal in Non-small cell lung cancer. …Elevating the gene copy number of a particular gene can increase the expression of the protein that it encodes.

From Wikipedia www.wikipedia.org

Page 4: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

DNA Copy Number A Copy Number Variant (CNV) represents a copy number change involving a D

NA fragment that is ~1 kilobases or larger. From Nature Reviews Genetics, Feuk et al. 2006

DNA Copy Number ≠ DNA Tandem Repeat Number (e.g. micro satellites) <10 bases

DNA Copy Number ≠ RNA Copy Number RNA Copy Number = Gene Expression Level

DNA transcription mRNA

Copy Number is the amount of copies of a particular fragment of nucleic acid molecular chain. It refers to DNA Copy Number in most publications.

Page 5: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

What can Copy Number tell us?What can Copy Number tell us?

Genetic Diversity/Polymorphisms

- restriction fragment length polymorphism (RFLP)- amplified fragment length polymorphism (AFLP)- random amplification of polymorphic DNA (RAPD)- variable number of tandem repeat (VNTR; e.g., mini- and

microsatellite)- single nucleotide polymorphism (SNP)- presence/absence of transportable elements…- structural alterations (e.g., deletions, duplications, inversions … )- DNA copy number variant (CNV)

Association with phenotypes/diseases genes/genetic factors

Page 6: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Genetic Alterations in Tumor Cells (DNA Copy Number Changes)

Homologous repeats

Segmental duplications

Chromosomal rearrangements

Duplicative transpositions

Non-allelic recombinations

……

Normal cell

Tumor cells

deletion amplification

CN=0 CN=1 CN=2 CN=3 CN=4

CN=2

Page 7: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

How to measure/quantify Copy Number?How to measure/quantify Copy Number?

Quantitative Polymerase Chain Reaction (Q-PCR) : DNA Amplification

(dNTPs, primers, Taq polymerase, fluorescent dye)

PCR

less CN amplification less DNA low fluorescent intensity

more CN amplification more DNA high fluorescent intensity

(one fragment each time)

Microarray : DNA Hybridization

(dNTPs, primers, Taq polymerase, fluorescent dye)

PCR

less CN amplification less DNA arrayed probes low intensities

more CN amplification more DNA arrayed probes high intensities

(multiple/different fragments, mixed pool)

Hybridization

Page 8: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Microarray: From Image to Copy Number

Tumor NormalAffymetrix Mapping

250K Sty-I chip

~250K probe sets

~250K SNPs

CN=1

CN=0

CN>2

CN=2

CN=2

CN=2

probe set (24 probes)

Deletion

Deletion

Amplification

more DNA copy number more DNA hybridization higher intensity

Page 9: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

~400 cancer patients

Normal tissue & tumor tissue (~400 pairs, ~800 DNA samples)

Affymetrix 250K Sty-I Human Mapping SNP Array

DNA hybridization signals (intensities on chip images)

Genotype calling

SNP genotypes

LOH analysis DNA copy number analysis (genotypic changes) (DNA copy number changes)

How to Analyze Copy Number?How to Analyze Copy Number?

?

A Real Example

Page 10: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

General Procedures for Copy Number Analysis

Finished chips (scanner) Raw image data [.DAT files] (experiment info [ .EXP]) (image processing software)

Probe level raw intensity data [.CEL files]

Background adjustment, Normalization, Summarization

Summarized intensity data

Raw copy number (CN) data [log ratio of tumor/normal intensities]

Significance test of CN changesEstimation of CN

Smoothing and boundary determination Concurrent regions among population

Amplification and deletion frequencies among populationsAssociation analysis

Preprocessing :

chip description file [.CDF]

Page 11: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Background Adjustment/Correction

Reduces unevenness of a single chip Makes intensities of different positions on a chip comparable

Before adjustment After adjustment

Corrected Intensity (S’) = Observed Intensity (S) – Background Intensity (B)

For each region i, B(i) = Mean of the lowest 2% intensities in region i

AffyMetrix MAS 5.0

Page 12: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Eliminates non-specific hybridization signalObtains accurate intensity values for specific hybridization

Background Adjustment/Correction

PM only, PM-MM, Ideal MM, etc.

quartet probe set

sense or antisense strands

25 oligonucleotide probes

Page 13: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

NormalizationReduces technical variation between chips Makes intensities from different chips comparable

Before normalization After normalization

Base Line Array (linear); Quantile Normalization;Contrast Normalization; etc.

S – Mean of S

S’ =

STD of S

S’ ~ N(0,1 )

Page 14: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Combines the multiple probe intensities for each probe set to produce a summarized value for subsequent analyses.

Summarization

Average methods:

PM only or PM-MM, allele specific or non-specific

Model based method : Li & Wong , 2001

Gene Expression Index

Page 15: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Raw Copy Number Data

S : Summarized raw intensity

S’ : Log transformation, S’ = log2(S)Raw CN: Log ratio of tumor / normal intensities

CN = S’tumor - S’normal = log2(Stumor/Snormal)

Pair design

Snormal = S of the paired normal sampleGroup design

Snormal = average S of the group of normal samples

before Log transformation

S

after Log transformation

Log(S)

Raw CN

Page 16: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Individual Level AnalysisIndividual Level Analysis

Analysis for each individual sample (or each sample pair)

Significance test of CN amplification and deletion

Boundary finding (smoothing and segmentation)

CN estimation

Page 17: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Intensities and Raw CNs, Chr. 1 (Piar#101)Black: Normal, Red: Tumor, Green: Tumor- Normal

Page 18: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Significance Test for Copy Number Changes: -log(p) values, chr. 1, pair#101

Window-based t test

Window size = 0.5 Mbp (~30 SNPs); N = SNP number in window

Mean CN of window t = X N ~ t (df=N -1) SD of widow

-log(p)

Window Position (Mbp)

Page 19: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Genome-wide Raw CN Changes (Piar#105)

Page 20: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Genome-wide Widow-based Test of CN Changes (Piar#105)

- Log (p)

Page 21: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

SegmentationBioConductor R Packages (www.bioconductor.org)GLAD package, adaptive weights smoothing (AWS) methodDNAcopy package, circular binary segmentation method

Page 22: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

CN Estimation: Hidden Markov Model (HMM) CNAT(www.affymetrix.com); dChip (www.dchip.org) ; CNAG (www.genome.umin.jp)

CN=? CN=? CN=? CN=? CN=?

log ratio

log ratio

log ratio

log ratio

log ratio

… SNP_i SNP_i+1 SNP_i+2 SNP_i+3 SNP_i+4 … position

hidden status(unknown CN )

observed status(raw CN = log ratio of intensities)

CN estimation: finding a sequence of CN values which maximizes the likelihood of observed raw CN.

Algorithm: Viterbi algorithm (can be Iterative)

Information/assumptions below are needed

Background probabilities: Overall probabilities of possible CN values.

P(CN=x); x=-2,-1,0,1,2,3,…, n (usually,n<10)

Transition probabilities: Probabilities of CN values of each SNP conditional on the previous one.

P(CN_i+1=x|CN_i=y); x=-2,-1,0,1,2,3,…, or n; y=-2,-1,0,1,2,3, …, or n

Emission probabilities: Probabilities of observed raw CN values of each SNP conditional on the hidden/unknown/true CN status.

P(log ratio<x|CN=y)=f(x|CN=y); x=one of real numbers; y=-2,-1,0,1,2,3, …, or n

Page 23: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

HMM Estimation of CN for Chr. 1 (Piar#101)Black: Normal Intensities, Red: Tumor Intensities, Green: Tumor- Normal

Blue: HMM estimated CNs in Tumor Tissue

CN=2 CN=1

CN=4CN=3

Page 24: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Population Level AnalysisPopulation Level Analysis

Analysis for the whole group (or sub-group) of samples

Overall significance test

Amplification and deletion frequencies summarization

Common/concurrent region finding

Associations (with mutations, LOHs, clinical variables …)

Page 25: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Genome-wide Raw CN Changes(average over ~400 pairs )

Page 26: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Raw CN Changes of Chr. 14(average over ~400 pairs )

Page 27: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Sliding Window Analysis

… .. … … . . . . .. …… …… .. … … . . . . .. …… … .. …… … ..

Window 1Window 2

Window 3Window 4

Window 5Window 6

Window 7Window 8

Window 9Window 10

Window N

Window k

………..

………..

Each window (k) contains 30 consecutive SNPs (k, k+1, k+2, k+3, …, k+29)

Page 28: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Genome-wide Raw Copy Number Changes(sliding window plot, averaged over ~400 pairs )

Page 29: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Sliding Window Test of Significance of CN Changes -log(p) values, based on ~ 400 pairs

Page 30: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

CN Change Frequencies in Population ( Chr.14,~400 pairs)Black: Freq.(CN>0) Red: Freq.(CN>0, significant amplification at 0.01 level) Green: Freq.(CN<0, significant deletion at 0.01 level)

Page 31: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Population Level Segmentation Analysis (~400 pairs)Circular Binary Segmentation approach, Bioconductor Package DNAcopy

Page 32: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Segmentation of Chr. 14(average result of ~400 pairs)

Page 33: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Visualization of Concurrent Regions of Chr. 14(~400 pairs)

positions

samples

Page 34: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Group-specific AnalysisBlack: non-smokers, Red: non-smokers

Page 35: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Separate Tumor Samples from Normal Samples Using Six Chromosomal Peaks with Significant CN Changes

(Classification Based on RAW CN)

Tumor

Normal

Page 36: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Mapping Known Cancer-related Genes onto the Copy Number Map

Page 37: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Software

Affymetrix Chips (www.affymetrix.com)Illumina Chips (www.illumina.com)

CNAT(www.affymetrix.com); dChip (www.dchip.org) ;CNAG (www.genome.umin.jp)

GenePattern www.broad.mit.edu/cancer/software/genepattern/

BioConductor R Packages (www.bioconductor.org)GLAD package, adaptive weights smoothing (AWS) methodDNAcopy package, circular binary segmentation method

Widows ?Unix ?Parallel Computation ?

Page 38: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

References

• R Gentlemen et al. Bioinformatics and computational biology solutions using R and Bioconductor. Springer, 2005

• JL Freeman et al. Genome Research 2006; 16:949-961

• J Huang et al. Hum Genomics. 2004;1(4):287-99

• X Zhao et al. Cancer Research 2004; 64:3060-3071

• Y Nannya et al. Cancer Research 2005, 65: 6071-6079

• … see google …

Page 39: Genome-wide Copy Number Analysis Qunyuan Zhang,Ph.D. Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University.

Acknowledgements

Aldi Kraja Li DingIngrid Borecki John OsborneMichael Province Ken Chen

Division of Statistical Genomics Medical Sequencing Group

Center for Genome SciencesWashington University School of Medicine