Extensive genetic variation in somatic human tissues · Extensive genetic variation in somatic...

6
Extensive genetic variation in somatic human tissues Maeve OHuallachain a,b , Konrad J. Karczewski a,c , Sherman M. Weissman d,1 , Alexander Eckehart Urban a,e , and Michael P. Snyder a,1 a Department of Genetics, c Biomedical Informatics Training Program, and e Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, CA 94305; and Departments of b Molecular, Cellular and Developmental Biology and d Genetics, Yale University, New Haven, CT 06520 Contributed by Sherman M. Weissman, August 15, 2012 (sent for review June 22, 2012) Genetic variation between individuals has been extensively investi- gated, but differences between tissues within individuals are far less understood. It is commonly assumed that all healthy cells that arise from the same zygote possess the same genomic content, with a few known exceptions in the immune system and germ line. However, a growing body of evidence shows that genomic variation exists between differentiated tissues. We investigated the scope of somatic genomic variation between tissues within humans. Analysis of copy number variation by high-resolution array-comparative ge- nomic hybridization in diverse tissues from six unrelated subjects reveals a signicant number of intraindividual genomic changes between tissues. Many (79%) of these events affect genes. Our re- sults have important consequences for understanding normal ge- netic and phenotypic variation within individuals, and they have signicant implications for both the etiology of genetic diseases such as cancer and for immortalized cell lines that might be used in research and therapeutics. somatic mosiacism | genomic rearrangement G enetic variation between individuals has been extensively explored (1, 2), but far less effort has been made to ascertain somatic genetic variation in healthy individuals. This information is important for understanding how somatic mutations arise de novo that lead to both phenotypic variation and somatic genetic human diseases such as cancer. There is presently considerable interest in generation of induced pluripotent stem cells (iPS cells) from the cells of adult tissues because these cells have the po- tential to be used for therapeutic purposes. Evidence of genomic alterations has been found in human embryonic stem cells (3) and human iPS cells (47). Recent ndings suggest that immortalized cell lines (8) contain signicant copy number differences, al- though it is unclear whether these variations arise in vitro during cell culturing or originate in vivo in somatic cell populations. On the basis of results from the 1,000 Genomes Project and other analyses, it is estimated that there are upward of 2,500 structural variations (duplications, insertions, and inversions) be- tween any two individuals, 7001,000 of which are greater than 1 kb (9, 10). The 1,000 Genomes Project analyses estimate upward of 3 million single nucleotide polymorphisms between any two humans (2). Sequence analysis of parentoffspring trios and quartets has revealed a mutation rate corresponding to 70 de novo single nucleotide variations that occur between parents and offspring, presumably through germline mutations (11, 12). Structural var- iations between parents and offspring were not measured. Somatic mosaicism is the presence of genetically distinct populations of somatic cells within an organism and even within the same tissue. Somatic rearrangements in various cancer cell types are well documented, and more than 30 Mendelian dis- eases have been associated with somatic mosaicism (13). Somatic mosaicism is well known to occur in normal germline cells as the process of meiotic recombination and in cells of the immune system within Ig and T-cell receptor genes as a result of V(D)J recombination that provides diversity in immune responses. Other studies have revealed age-related structural variation in human blood cells (14), aneuploidy and retrotransposition in the human brain (15, 16), and aneuploidy in human preimplantation embryos (17). Evidence of copy number variation (CNV) in monozygotic twins (18) also reveals the presence of somatic mosaicism in tissues arising from the same zygote. High-resolution genomic arrays enable the monitoring of ge- netic variation with an accuracy not previously applied to somatic tissues. Somatic mosaicism has been studied in cultured cells (8), but only one study has documented copy number variation be- tween unxed human tissues within individuals using BAC arrays for array-comparative genomic hybridization (aCGH) (19). These arrays produce low-resolution data, and they have high error rates, leaving uncertainty. Although deep whole-genome se- quencing provides the highest resolution and the ability to resolve rearrangements to the base pair, mapping sequences from mosaic samples remains a challenge using available sequencing tech- nologies. aCGH using arrays that can resolve rearrangements as small as 2 kb provides a sensitive and well-established technol- ogy for this investigation and can detect mosaic events in tissues. Results Identication of Copy Number Variations (CNVs) in Somatic Tissues. The presence of somatic CNVs was investigated using organ tissues from subjects obtained during routine autopsy. In total, tissues from six individuals (age range, 4585 y) not known to be affected by any disease with a genetic component (e.g., heredi- tary disorders or cancer) were analyzed using Nimblegen 2.1M oligonucleotide whole-genome arrays (20). Between three and 11 tissues were tested for each of the six subjects. The tissue that yielded the most abundant and high-quality DNA for each subject was selected as the reference. The remaining tissues, referred to as test tissues, were each hybridized against the reference for each subject. DNA from the test tissues was labeled with Cy3 uo- rescent dye and hybridized to DNA from the reference tissue labeled with Cy5 uorescent dye. For each tissue comparison, dye swapexperiments were also performed in which the ref- erence DNA was labeled with Cy3 and the test DNA was labeled with Cy5. We required that an event be observed in both dye swapexperiments to be included in the list of total somatic CNVs (Fig. 1A and Table S1). As a control, the reference DNA of one subject was labeled with Cy3 and hybridized to the same DNA labeled with Cy5. Positive events called using Nexus Copy Number software were detected when different tissues types were hybridized, and the number varied with the threshold used (Table S2). CNVs above the threshold chosen for calling somatic CNVs were not detected when the reference DNA was hybrid- ized to itself. Author contributions: M.O., S.M.W., A.E.U., and M.P.S. designed research; M.O. per- formed research; A.E.U. contributed new reagents/analytic tools; M.O. and K.J.K. ana- lyzed data; and M.O., S.M.W., A.E.U., and M.P.S. wrote the paper. The authors declare no conict of interest. Data deposition: The data reported in this paper have been deposited in the Gene Ex- pression Omnibus (GEO) database, www.ncbi.nlm.nih.gov/geo (accession no. GSE40159). 1 To whom correspondence may be addressed. E-mail: [email protected] or [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1213736109/-/DCSupplemental. 1801818023 | PNAS | October 30, 2012 | vol. 109 | no. 44 www.pnas.org/cgi/doi/10.1073/pnas.1213736109 Downloaded by guest on June 15, 2021

Transcript of Extensive genetic variation in somatic human tissues · Extensive genetic variation in somatic...

  • Extensive genetic variation in somatic human tissuesMaeve O’Huallachaina,b, Konrad J. Karczewskia,c, Sherman M. Weissmand,1, Alexander Eckehart Urbana,e,and Michael P. Snydera,1

    aDepartment of Genetics, cBiomedical Informatics Training Program, and eDepartment of Psychiatry and Behavioral Sciences, Stanford University School ofMedicine, Stanford, CA 94305; and Departments of bMolecular, Cellular and Developmental Biology and dGenetics, Yale University, New Haven, CT 06520

    Contributed by Sherman M. Weissman, August 15, 2012 (sent for review June 22, 2012)

    Genetic variation between individuals has been extensively investi-gated, but differences between tissues within individuals are far lessunderstood. It is commonly assumed that all healthy cells that arisefrom the same zygote possess the same genomic content, with a fewknown exceptions in the immune system and germ line. However,a growing body of evidence shows that genomic variation existsbetween differentiated tissues. We investigated the scope ofsomatic genomic variation between tissues within humans. Analysisof copy number variation by high-resolution array-comparative ge-nomic hybridization in diverse tissues from six unrelated subjectsreveals a significant number of intraindividual genomic changesbetween tissues. Many (79%) of these events affect genes. Our re-sults have important consequences for understanding normal ge-netic and phenotypic variation within individuals, and they havesignificant implications for both the etiology of genetic diseases suchas cancer and for immortalized cell lines that might be used inresearch and therapeutics.

    somatic mosiacism | genomic rearrangement

    Genetic variation between individuals has been extensivelyexplored (1, 2), but far less effort has been made to ascertainsomatic genetic variation in healthy individuals. This informationis important for understanding how somatic mutations arise denovo that lead to both phenotypic variation and somatic genetichuman diseases such as cancer. There is presently considerableinterest in generation of induced pluripotent stem cells (iPS cells)from the cells of adult tissues because these cells have the po-tential to be used for therapeutic purposes. Evidence of genomicalterations has been found in human embryonic stem cells (3) andhuman iPS cells (4–7). Recent findings suggest that immortalizedcell lines (8) contain significant copy number differences, al-though it is unclear whether these variations arise in vitro duringcell culturing or originate in vivo in somatic cell populations.On the basis of results from the 1,000 Genomes Project and

    other analyses, it is estimated that there are upward of 2,500structural variations (duplications, insertions, and inversions) be-tween any two individuals, 700–1,000 of which are greater than 1 kb(9, 10). The 1,000 Genomes Project analyses estimate upward of 3million single nucleotide polymorphisms between any two humans(2). Sequence analysis of parent–offspring trios and quartets hasrevealed a mutation rate corresponding to ∼70 de novo singlenucleotide variations that occur between parents and offspring,presumably through germline mutations (11, 12). Structural var-iations between parents and offspring were not measured.Somatic mosaicism is the presence of genetically distinct

    populations of somatic cells within an organism and even withinthe same tissue. Somatic rearrangements in various cancer celltypes are well documented, and more than 30 Mendelian dis-eases have been associated with somatic mosaicism (13). Somaticmosaicism is well known to occur in normal germline cells as theprocess of meiotic recombination and in cells of the immunesystem within Ig and T-cell receptor genes as a result of V(D)Jrecombination that provides diversity in immune responses.Other studies have revealed age-related structural variation inhuman blood cells (14), aneuploidy and retrotransposition in thehuman brain (15, 16), and aneuploidy in human preimplantation

    embryos (17). Evidence of copy number variation (CNV) inmonozygotic twins (18) also reveals the presence of somaticmosaicism in tissues arising from the same zygote.High-resolution genomic arrays enable the monitoring of ge-

    netic variation with an accuracy not previously applied to somatictissues. Somatic mosaicism has been studied in cultured cells (8),but only one study has documented copy number variation be-tween unfixed human tissues within individuals using BAC arraysfor array-comparative genomic hybridization (aCGH) (19). Thesearrays produce low-resolution data, and they have high errorrates, leaving uncertainty. Although deep whole-genome se-quencing provides the highest resolution and the ability to resolverearrangements to the base pair, mapping sequences from mosaicsamples remains a challenge using available sequencing tech-nologies. aCGH using arrays that can resolve rearrangements assmall as ∼2 kb provides a sensitive and well-established technol-ogy for this investigation and can detect mosaic events in tissues.

    ResultsIdentification of Copy Number Variations (CNVs) in Somatic Tissues.The presence of somatic CNVs was investigated using organtissues from subjects obtained during routine autopsy. In total,tissues from six individuals (age range, 45–85 y) not known to beaffected by any disease with a genetic component (e.g., heredi-tary disorders or cancer) were analyzed using Nimblegen 2.1Moligonucleotide whole-genome arrays (20). Between three and11 tissues were tested for each of the six subjects. The tissue thatyielded the most abundant and high-quality DNA for each subjectwas selected as the reference. The remaining tissues, referred to astest tissues, were each hybridized against the reference for eachsubject. DNA from the test tissues was labeled with Cy3 fluo-rescent dye and hybridized to DNA from the reference tissuelabeled with Cy5 fluorescent dye. For each tissue comparison,“dye swap” experiments were also performed in which the ref-erence DNA was labeled with Cy3 and the test DNA was labeledwith Cy5. We required that an event be observed in both “dyeswap” experiments to be included in the list of total somaticCNVs (Fig. 1A and Table S1). As a control, the reference DNAof one subject was labeled with Cy3 and hybridized to the sameDNA labeled with Cy5. Positive events called using Nexus CopyNumber software were detected when different tissues typeswere hybridized, and the number varied with the threshold used(Table S2). CNVs above the threshold chosen for calling somaticCNVs were not detected when the reference DNA was hybrid-ized to itself.

    Author contributions: M.O., S.M.W., A.E.U., and M.P.S. designed research; M.O. per-formed research; A.E.U. contributed new reagents/analytic tools; M.O. and K.J.K. ana-lyzed data; and M.O., S.M.W., A.E.U., and M.P.S. wrote the paper.

    The authors declare no conflict of interest.

    Data deposition: The data reported in this paper have been deposited in the Gene Ex-pression Omnibus (GEO) database, www.ncbi.nlm.nih.gov/geo (accession no. GSE40159).1To whom correspondence may be addressed. E-mail: [email protected] [email protected].

    This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1213736109/-/DCSupplemental.

    18018–18023 | PNAS | October 30, 2012 | vol. 109 | no. 44 www.pnas.org/cgi/doi/10.1073/pnas.1213736109

    Dow

    nloa

    ded

    by g

    uest

    on

    June

    15,

    202

    1

    http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1213736109/-/DCSupplemental/st01.docxhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1213736109/-/DCSupplemental/st02.docxhttp://www.ncbi.nlm.nih.gov/geohttp://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE40159mailto:[email protected]:[email protected]://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1213736109/-/DCSupplementalhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1213736109/-/DCSupplementalwww.pnas.org/cgi/doi/10.1073/pnas.1213736109

  • To set a threshold for calling CNVs, we analyzed a number ofCNVs in various tissues using an initial set of 50 NanoStringprobes. A threshold was established that minimized the falsediscovery rate (10%; Materials and Methods). Two hundredtwenty-nine CNVs passed the threshold. CNVs from telomericand centromeric regions were removed, because these regionsare prone to cause cross-hybridization signals on aCGH. CNVsthat appeared in the same location in multiple test tissues fromthe same individual were ascribed to the reference tissue. Thisresulted in a set of 178 candidate somatic CNVs. Of the 178

    regions, 168 were amenable to NanoString probe design andsubjected to a second round of validation. Of the 168 regionstested by NanoString technology, 73 were validated across sixpeople and 36 tissue comparisons (Fig. 1A and Table S1). Werefer to this list of 73 events as the high-confidence set.The frequency of high-confidence events varied from 0 to 45

    per individual, with most events detected within the two individ-uals with the most tissues tested. Reference tissues often yieldedthe greatest numbers of CNVs per individual. The increasednumber of tests performed with the reference tissues presumablyincreased detection of CNVs in those samples. It is plausible thatall 178 events that were reproducibly detected by the arrays arebona fide events but that many were not confirmed using theNanoString technology owing to inadequate probe design or theywere at a level below the stringent threshold used in our valida-tion process. Regardless, these results indicate that there are alarge number of somatic CNVs in the tissues of adults.

    CNVs Occur in Many Tissue Types and Can Be Large. Somatic CNVswere detected by aCGH in a variety of tissues (Fig. 1A, Table S1,and Table S3). The greatest number of somatic CNVs wasdetected in the pancreas of subject 6 (43 events), followed bythose observed in pancreas of subject 3 (35 events) and kidney ofsubject 1 (21 events). Events in brain tissues were detected byaCGH in the leptomeningeal tissue (two events) of subject 1, aswell as in tissues of the inferior parietal lobe (10 events) andmiddle frontal gyrus (one event) of subject 6. There were fiveevents in subject 6 that appeared in more than one brain tissueand likely affect the entire brain. Events in the small intestinetissues were observed in subjects 1 and 5 (seven and six events,respectively). Somatic CNVs were detected by aCGH in all of theliver tissues tested (four subjects). Tissues of the reproductivesystem were obtained for the two female subjects analyzed. Tenovary-specific events were detected in subject 3, and eight uterus-specific events were detected in subject 4. Examples of CNVs dis-covered by NimbleGen and validated by NanoString are shown inFig. 1B and Fig. 2. The validated events in Fig. 1B demonstrate thatsomatic CNVs have signals less than 1.5 (a 3:2 ratio for a dupli-cation) or greater than 0.5 (a 1:2 ratio for a deletion). In part, this isto be expected because the tissues and organs contain a substantialfraction of nonparenchymal cells from blood vessels and connec-tive tissue. However, the signals suggest that the events may not behomogeneous throughout the parenchymal cells of the tissue.The size distribution of somatic CNVs was analyzed (Fig. 3A).

    Somatic CNVs with sizes smaller than 10 kb were binned into1-kb intervals, somatic CNVs with sizes between 10 kb and 100 kbwere binned into 10-kb intervals, and somatic CNVs with sizeslarger than 100 kb were binned into 100-kb intervals. The sizes ofthe total somatic CNVs discovered by aCGH ranged from 2.0 kbto 184 kb, and those of the NanoString-validated set ranged from3.2 kb to 184 kb. Thus, larger events were more likely to bevalidated by NanoString: 30 of 32 CNVs (93.8%) larger than 20kb were validated, whereas only 43 of 146 CNVs (29.5%) smallerthan 20 kb were validated. It is possible that the smaller CNVswere either false positive, or the validation probes were notoptimal for the more limited target regions.

    Many Somatic CNVs Affect Genes Involved in Regulation. We de-termined the number of high-confidence CNVs that affect genesand found 58 (79%) intersect genes, including introns, exons, orupstream sequence; 51 of these (70% of the total) directly affectexons (Fig. 3B). Fig. 2 shows examples of CNVs that affect theACOXL, BCL2L11, TMEM132D, NFIL3, and NR4A2 genes.Fig. 2A shows a ∼13.7-kb increase in signal in DNA from livertissue when hybridized against DNA from reference kidney tis-sue, suggesting a duplication in liver of a region overlapping boththe BCL2L11 gene and the ACOXL gene. BCL2L11 mRNA isexpressed in several human tissues, including liver (21). Fig. 2C

    Fig. 1. (A) Locations of somatic CNVs for each of six subjects. (B) Ratios ofselected somatic CNVs by NimbleGen aCGH and NanoString validation. Acontrol region with no CNV between somatic tissues is also included. HC,hippocampus; IPL, inferior parietal lobe; LM, leptomeninges; MFG, middlefrontal gyrus; SI, small intestine.

    O’Huallachain et al. PNAS | October 30, 2012 | vol. 109 | no. 44 | 18019

    GEN

    ETICS

    Dow

    nloa

    ded

    by g

    uest

    on

    June

    15,

    202

    1

    http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1213736109/-/DCSupplemental/st01.docxhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1213736109/-/DCSupplemental/st01.docxhttp://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1213736109/-/DCSupplemental/st03.docx

  • shows an increase in signal of a ∼21.2-kb region on chromosome9 of subject 6 in liver tissue DNA when hybridized against pan-creas reference tissue DNA, suggesting a duplication of a regionencompassing the NFIL3 gene in liver tissue. Expression ofNFIL3 was previously reported in human liver tissue (22). Fig. 2D and E show examples of somatic CNVs observed in the DNAof reference tissues. Fig. 2E shows an increase in signal for DNAfrom tissues of three brain regions and liver when hybridizedagainst pancreas tissue DNA of subject 6 in a region encom-passing the NR4A2 gene. Because this event is observed in allhybridizations against the pancreas reference tissue, the event islikely a deletion in the pancreas. More examples are shown inFig. S1. These results indicate that most somatic CNVs affectgenes. Some of these genes have previously been reported to beexpressed in the affected tissues.Investigation of the types of genes affected by high-confidence

    somatic CNVs was performed by Gene Ontology (GO) andKyoto Encyclopedia of Genes and Genomes (KEGG) pathwayanalyses (23, 24) (Fig. 4 and Table S4). GO analysis revealedmodest enrichments for genes within somatic CNVs in ∼30 GO

    categories. Many of the GO categories relate to regulation ofcellular processes such as regulation of phosphorylation, regu-lation of primary metabolic processes, and regulation of geneexpression. Sequence-specific DNA binding and protein bindingwere also enriched in genes within somatic CNVs. KEGG pathwayanalysis revealed modest enrichments in the Wnt signaling path-way and the MAPK signaling pathway. Thus, many of the CNVsare likely to affect regulatory processes in the cell.Further analysis was performed on genes within the somatic

    CNVs discovered to investigate whether any of the affected geneswithin the same subject or tissue are known to interact. Deletionevents affecting the MAP2K3 gene on chromosome 17 and theSMAD7 gene on chromosome 18 were observed in pancreastissue of subject 6. MAP2K3 and SMAD7 were previously shownto be expressed in pancreas tissue (25), and SMAD7 has beenshown to be overexpressed in pancreatic cancer (26). The SMAD7and MAP2K3 genes were previously reported (27) to interact aspart of the TGF-β–dependent activation of p38 MAPK pathwayinducing apoptosis in prostate cancer cells. Thus, in at least onecase the CNVs affect interacting components.

    Fig. 2. aCGH signal for somatic CNVs. (A) Increased signal in liver tissue when hybridized against kidney reference tissue of subject 1 corresponding to aduplication of chromosome 2 DNA in liver. (B) Increased signal in small intestine tissue when hybridized against kidney reference tissue of subject 1 cor-responding to a duplication of chromosome 12 DNA in small intestine. (C) Increased signal in liver tissue when hybridized against pancreas reference tissueof subject 6 corresponding to a duplication of chromosome 9 DNA in liver. (D) Increased signal in small intestine and liver tissues when hybridized againstkidney reference tissue of subject 1 corresponding to a deletion in the kidney reference tissue on chromosome 12. (E ) Increased signal in MFG, hippocampus,IPL, and liver tissues hybridized against pancreas tissue in subject 6 corresponding to a deletion in the pancreas reference tissue on chromosome 2. Controlsshow no signal change between hybridized tissues. DGV, database of genomic variants; IPL, inferior parietal lobe; LM, leptomeninges; MFG, middle frontalgyrus; Seg Dups, segmental duplications. Blue and red dots represent dye-swap experiments.

    18020 | www.pnas.org/cgi/doi/10.1073/pnas.1213736109 O’Huallachain et al.

    Dow

    nloa

    ded

    by g

    uest

    on

    June

    15,

    202

    1

    http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1213736109/-/DCSupplemental/pnas.201213736SI.pdf?targetid=nameddest=SF1http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1213736109/-/DCSupplemental/st04.docxwww.pnas.org/cgi/doi/10.1073/pnas.1213736109

  • Several Somatic CNVs Occur in the Same Location in Multiple Individuals.Nearly all events were unique; only seven regions overlap betweentwo individuals, three of which were validated by NanoString. Fig. 5A and B show regions on chromosomes 12 and 14, respectively,where CNVs were observed in both subjects 1 and 6. Fig. 5 A and Bshow increased signal in DNA from liver tissue when hybridizedagainst DNA from kidney tissue of subject 1. An increased signal isobserved for subject 6 in the same regions in DNA from liver tissueand tissue from brain regions when hybridized against DNA fromreference pancreas tissue. Increased signal in all of the test tissuesof subject 6 correspond to deletions in the reference pancreas tis-sue for both chromosomal regions. The events on chromosome 12occur in a region encompassing the GABARAPL1 gene (Fig. 5A).The events on chromosome 14 overlap both the ANG gene and theRNASE4 gene (Fig. 5B). ANG and RNASE4 were previouslyshown to be expressed in pancreas and liver (28, 29). Thus, a fewCNVs (4%) lie in the same regions and contain genes.

    Somatic CNV Breakpoints Analysis. Analysis of the breakpointregions of 73 high-confidence somatic CNVs was undertaken toassess the frequency of repetitive elements near somatic CNVbreakpoints. aCGH does not provide base pair resolution that isnecessary for determining exact CNV formation mechanisms(e.g., nonallelic homologous recombination, nonhomologous end-joining, and transpositions of mobile genomic elements). How-ever, repetitive sequence is known to be integral in many of theCNV formation mechanisms. NimbleGen 2.1M oligonucleotidemicroarrays can achieve breakpoint resolution to ∼2 kb based onprobe spacing of ∼1 kb. Therefore, a 2-kb window was analyzedaround each breakpoint for the presence of six classes of

    repetitive sequence: L1, L2, Alu, LTR, segmental duplications,and microsatellites (Table 1). To test for significant enrichment ordepletion for each class of repetitive element, P values were cal-culated by performing permutations of 10,000 sets of randomintervals. Within each set of random intervals the number and sizeof the intervals was kept equal to those of the validated somaticCNVs being analyzed. The locations of the intervals were ran-domized within the regions probed by the NimbleGen 2.1Mmicroarray. The number of somatic CNVs that intersect a re-petitive element class was compared with the average number ofrandom intervals that intersect the element to calculate an en-richment/depletion ratio. Thirty-nine of 73 somatic CNVs have abreakpoint that intersects a microsatellite, an enrichment of ∼1.29over the genomic background (P < 3.8e−2). Microsatellites havebeen suggested to play a role in chromosomal rearrangement (30),and a colocalization of microsatellites with CNVs was previouslyreported (31). Twenty-five of 73 somatic CNVs have a breakpointthat overlaps an L1, a significant depletion of ∼0.53 comparedwith the genomic background (P < 1.2e−7). Twenty-nine of 73somatic CNVs have a breakpoint that intersects an LTR, a sig-nificant depletion of ∼0.77 compared with random intervals (P <4.8e−2). Segmental duplications, Alu, and L2 were not significantlyenriched or depleted near somatic CNV breakpoints comparedwith average random intervals.In summary, these results show that somatic CNVs occur in

    humans in a variety of tissues. Although the sample size is small,the number of events detected was often higher in dividing tis-sues (e.g., liver, small intestine, and pancreas) relative to non-dividing tissues (e.g., brain).

    DiscussionThis study confirms that humans have extensive genetic variationin somatic tissues. aCGH and the validation method used candetect heterogeneous cell populations within tissue samples.Many of the events detected have aCGH signals correspondingto less than a full-copy duplication or deletion, corroboratingthat there are indeed heterogeneous cell populations within tis-sues. To detect these events, they must be present in a substantialfraction of the parenchymal cells. In proliferating tissues, such asin the cases of p53 mutations in UV-irradiated skin (32) andsome of the CNVs seen in normal blood cells (14), it seems prob-able that the somatic mutation conferred a growth advantage to

    Fig. 3. Analysis of somatic CNVs. (A) Frequency vs. size of all somatic CNVsdiscovered by aCGH and frequency vs. size of all high-confidence somaticCNVs validated by secondary methods. (B) Genomic elements affected byvalidated high-confidence CNVs.

    Fig. 4. Selected GO terms and KEGG pathways enriched in genes withinhigh-confidence somatic CNVs.

    O’Huallachain et al. PNAS | October 30, 2012 | vol. 109 | no. 44 | 18021

    GEN

    ETICS

    Dow

    nloa

    ded

    by g

    uest

    on

    June

    15,

    202

    1

  • the affected cells. In organs that undergo little or no proliferationin the adult, one possibility is that the observed CNV may havegiven a growth or differentiation advantage to the affected pre-cursor cells during development. It is also likely that many moreCNVs exist in small numbers of parenchymal cells but do not giveany growth advantage to the cells.These results have important implications for the detection of

    disease formation and treatment. A stepwise theory of cancerhas been proposed to explain the origin of tumor populations(33–36). This theory posits that genetic instability in pre-cancerous cells leads to an accumulation of mutations thateventually result in cancer. Our results suggest that CNVs mayoccur commonly during somatic cell growth and division. Insome cases, these might result in predilection for diseases such ascancer. Detection of such events in cells of healthy individualsmay have prognostic value for the risk of malignancy. Theseresults also indicate that choosing the relevant tissue for com-parison of normal and disease tissues is important. Use of tissues

    from distant lineages may find events that are not relevant toformation or maintenance of the disease state and that mightsimply be “background” events in another tissue.In developmental biology, there is great interest in understanding

    how common progenitor cells differentiate into distinct tissue types.Our results show that somatic tissues can genetically vary. Althoughour sample size is small, several of the tissues with the highestnumbers of observed events are dividing tissues (e.g., liver andintestine). The pancreas, liver, and small intestine are all derivedfrom the endoderm germ layer. Studies of differentiation haverevealed that the ventral endoderm can give rise to each of thesetissues, depending on signaling from various pathways (37–39).We observed seven events that occurred in the same locations intwo individuals, suggesting that there are potential hotspots forsomatic genomic variation. These types of events were observedin several different tissues, including pancreas, liver, and smallintestine. It remains to be seen whether the somatic genetic dif-ferences observed here play a role in the differentiation process.

    Fig. 5. aCGH signal for somatic CNVs that appear in more than one subject. (A) Increased signal in liver tissue hybridized against kidney tissue of subject 1corresponding to a duplication of chromosome 12 DNA in liver. Increased signal in liver tissue and superior temporal gyrus–middle temporal gyrus (STG-MTG)tissue hybridized against pancreas tissue of subject 6 corresponding to a deletion in the pancreas reference tissue on chromosome 12. (B) Increased signal inliver tissue hybridized against kidney tissue of subject 1 corresponding to a duplication of chromosome 14 DNA in liver. Increased signal in liver tissue andMFG tissue hybridized against pancreas tissue of subject 6 corresponding to a deletion in pancreas reference tissue on chromosome 14. Controls show nosignal change between hybridized tissues. Blue and red dots represent dye-swap experiments.

    Table 1. Analysis for enrichment or depletion of several classes of repetitive elements at somatic CNV breakpoints

    Repeat typeNo. of high-confidence somatic CNVsthat intersect repeat type (out of 73)

    Size range of high-confidence somaticCNVs that intersect repeat type (kb) Enrichment P

    Microsatellite 39 3.23–184 1.291 3.8e−2

    LTR 29 3.23–73.6 0.773 4.8e−2

    L1 25 3.47–39.2 0.534 1.2e−7

    L2 32 3.23–184 1.150 3.1e−1

    Alu 55 3.23–184 1.056 4.5e−1

    Segmental duplication 5 5.33–22.8 0.743 4.8e−1

    18022 | www.pnas.org/cgi/doi/10.1073/pnas.1213736109 O’Huallachain et al.

    Dow

    nloa

    ded

    by g

    uest

    on

    June

    15,

    202

    1

    www.pnas.org/cgi/doi/10.1073/pnas.1213736109

  • Finally, this work is particularly important for therapies thatuse iPS cells. iPS cells are typically derived from somatic tissuesand may contain somatically altered DNA. Indeed, CNVs havebeen observed in iPS cells relative to the starting material fromwhich they were derived (4–7). Our findings indicate that many ofthese events (and perhaps all) are not induced during iPS cellformation but perhaps exist in somatic cells in vivo. Althoughgenetic alterations are generally viewed as negative, somaticvariation could be beneficial. For example, in tissues that fre-quently encounter pathogens, CNVs that eliminate viral receptorsmight enhance host survival. Somatic genomic variation may havewide-reaching implications for diverse aspects of human biologyand health, making it an important area of future research.

    Materials and MethodsaCGH. Tissues that were identified as noncancerous were obtained fromroutine autopsy. DNA was extracted from tissues using the Qiagen DNeasyBlood & Tissue Kit (Qiagen). For each subject, the tissue that yielded thegreatest amount of DNA was selected as the reference tissue. Each of theother tissues from a subject was compared against the reference tissue.aCGH was performed using NimbleGen (Roche NimbleGen) 2.1M micro-arrays according to the NimbleGen protocol. Dye swap replicates were car-ried out for each tissue comparison to control for dye-specific bias. Arrayswere scanned using the Roche MS 200 Microarray Scanner. Images wereanalyzed using NimbleScan 2.6 software. CNV calls were made with NexusCopy Number version 6.0 (BioDiscovery) using the Fast Adaptive States Seg-mentation Technique 2 (FASST2) segmentation algorithm, a hidden Markovmodel-based approach.

    Validation. NanoString probes were designed to 50 regions of possible copynumber change. Half of the regions had two probes designed, whereas theother half had three probes designed. Hybridization, sample prep, andscanning were performed according to the NanoString protocol. Dataanalysis was performed using the nCounter CNV Collector tool. Copy numberwas determined by averaging over two to three probes per region. To val-idate the NimbleGen CNV calls, test/reference NanoString ratios were com-pared with NimbleGen ratios. For ratios below 1 (i.e., deletions), the ratioswere inverted for a direct comparison. Events with NanoString (validation)ratios of 1.25 or greater were considered to be valid CNVs. A false-positiverate of 10%was used as a cutoff, which corresponded to a NimbleGen ratio of1.3 for subjects 1–5 and 1.35 for subject 6 (Figs. S2 and S3). All thresholdinganalyses were performed using the ROCR package for the R statisticalcomputing package. These ratios were used as the thresholds for callingCNVs from the NimbleGen data for each tissue comparison using the FASSTsegmentation algorithm of the Nexus Copy Number (BioDiscovery) software.A second round of NanoString probes was designed to the candidate CNVsthat had not been previously validated. One probe was used for each of 168regions. Candidate events with NanoString ratios of 1.25 or greater wereconsidered validated.

    Note Added In Proof. Our speculation that many of these events are notinduced during iPS cell formation but perhaps exist in somatic cells in vivoreceived support from complementary data in a recent paper (40).

    ACKNOWLEDGMENTS. We thank Alexander Vortmeyer and colleagues atthe Yale Pathology Department for providing tissue samples. This work wassupported by National Institutes of Health Grants 5R01MH094740 (to M.P.S.and A.E.U.) and 5P50HG002357 (to M.P.S.).

    1. Altshuler DM, et al; International HapMap 3 Consortium (2010) Integrating commonand rare genetic variation in diverse human populations. Nature 467:52–58.

    2. Durbin RM, et al.; 1000 Genomes Project Consortium (2010) A map of human genomevariation from population-scale sequencing. Nature 467:1061–1073.

    3. Funk WD, et al. (2012) Evaluating the genomic and sequence integrity of human EScell lines; comparison to normal genomes. Stem Cell Res (Amst) 8:154–164.

    4. Mayshar Y, et al. (2010) Identification and classification of chromosomal aberrationsin human induced pluripotent stem cells. Cell Stem Cell 7:521–531.

    5. Hussein SM, et al. (2011) Copy number variation and selection during reprogrammingto pluripotency. Nature 471:58–62.

    6. Laurent LC, et al. (2011) Dynamic changes in the copy number of pluripotency and cellproliferation genes in human ESCs and iPSCs during reprogramming and time inculture. Cell Stem Cell 8:106–118.

    7. Martins-Taylor K, et al. (2011) Recurrent copy number variations in human inducedpluripotent stem cells. Nat Biotechnol 29:488–491.

    8. Mkrtchyan H, et al. (2010) Early embryonic chromosome instability results in stablemosaic pattern in human tissues. PLoS ONE 5:e9591.

    9. Mills RE, et al.1000 Genomes Project (2011) Mapping copy number variation bypopulation-scale genome sequencing. Nature 470:59–65.

    10. Abyzov A, Urban AE, Snyder M, Gerstein M (2011) CNVnator: An approach to dis-cover, genotype, and characterize typical and atypical CNVs from family and pop-ulation genome sequencing. Genome Res 21:974–984.

    11. Roach JC, et al. (2010) Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328:636–639.

    12. Conrad DF, et al.1000 Genomes Project (2011) Variation in genome-wide mutationrates within and between human families. Nat Genet 43:712–714.

    13. Wada T, Candotti F (2008) Somatic mosaicism in primary immune deficiencies. CurrOpin Allergy Clin Immunol 8:510–514.

    14. Forsberg LA, et al. (2012) Age-related somatic structural changes in the nuclear ge-nome of human blood cells. Am J Hum Genet 90:217–228.

    15. Yurov YB, et al. (2005) The variation of aneuploidy frequency in the developing andadult human brain revealed by an interphase FISH study. J Histochem Cytochem 53(3):385–390.

    16. Baillie JK, et al. (2011) Somatic retrotransposition alters the genetic landscape of thehuman brain. Nature 479:534–537.

    17. Wells D, Delhanty JD (2000) Comprehensive chromosomal analysis of human pre-implantation embryos using whole genome amplification and single cell comparativegenomic hybridization. Mol Hum Reprod 6:1055–1062.

    18. Bruder CE, et al. (2008) Phenotypically concordant and discordant monozygotic twinsdisplay different DNA copy-number-variation profiles. Am J Hum Genet 82:763–771.

    19. Piotrowski A, et al. (2008) Somatic mosaicism for copy number variation in differ-entiated human tissues. Hum Mutat 29:1118–1124.

    20. Nuwaysir EF, et al. (2002) Gene expression analysis using oligonucleotide arraysproduced by maskless photolithography. Genome Res 12:1749–1755.

    21. Gaviraghi M, et al. (2008) Identification of a candidate alternative promoter region ofthe human Bcl2L11 (Bim) gene. BMC Mol Biol 9:56.

    22. Zhang W, et al. (1995) Molecular cloning and characterization of NF-IL3A, a transcrip-tional activator of the human interleukin-3 promoter. Mol Cell Biol 15:6055–6063.

    23. Zhang B, Kirov S, Snoddy J (2005) WebGestalt: An integrated system for exploringgene sets in various biological contexts. Nucleic Acids Res 33(Web Server issue):W741–W748.

    24. Duncan D, Prodduturi N, Zhang B (2010) WebGestalt2: An updated and expandedversion of the Web-based Gene Set Analysis Toolkit. BMC Bioinformatics 11(Suppl 4):P10.

    25. Marselli L, et al. (2008) Gene expression of purified beta-cell tissue obtained fromhuman pancreas with laser capture microdissection. J Clin Endocrinol Metab 93:1046–1053.

    26. Kleeff J, et al. (1999) The TGF-beta signaling inhibitor Smad7 enhances tumorigenicityin pancreatic cancer. Oncogene 18:5363–5372.

    27. Edlund S, et al. (2003) Transforming growth factor-beta1 (TGF-beta)-induced apo-ptosis of prostate cancer cells involves Smad7-dependent activation of p38 by TGF-beta-activated kinase 1 and mitogen-activated protein kinase kinase 3. Mol Biol Cell14:529–544.

    28. Rosenberg HF, Dyer KD (1995) Human ribonuclease 4 (RNase 4): Coding sequence,chromosomal localization and identification of two distinct transcripts in human so-matic tissues. Nucleic Acids Res 23:4290–4295.

    29. Futami J, et al. (1997) Tissue-specific expression of pancreatic-type RNases and RNaseinhibitor in humans. DNA Cell Biol 16:413–419.

    30. Ugarkovi�c D, Plohl M (2002) Variation in satellite DNA profiles—causes and effects.EMBO J 21:5955–5959.

    31. Kim PM, et al. (2008) Analysis of copy number variants and segmental duplications inthe human genome: Evidence for a change in the process of formation in recentevolutionary history. Genome Res 18:1865–1874.

    32. Klein AM, Brash DE, Jones PH, Simons BD (2010) Stochastic fate of p53-mutant epi-dermal progenitor cells is tilted toward proliferation by UV B during preneoplasia.Proc Natl Acad Sci USA 107:270–275.

    33. Nowell PC (1976) The clonal evolution of tumor cell populations. Science 194:23–28.34. Fearon ER, Vogelstein B (1990) A genetic model for colorectal tumorigenesis. Cell 61:

    759–767.35. Lengauer C, Kinzler KW, Vogelstein B (1998) Genetic instabilities in human cancers.

    Nature 396:643–649.36. Jackson AL, Loeb LA (1998) On the origin of multiple mutations in human cancers.

    Semin Cancer Biol 8:421–429.37. Deutsch G, Jung J, Zheng M, Lóra J, Zaret KS (2001) A bipotential precursor pop-

    ulation for pancreas and liver within the embryonic endoderm. Development 128:871–881.

    38. Rossi JM, Dunn NR, Hogan BL, Zaret KS (2001) Distinct mesodermal signals, includingBMPs from the septum transversum mesenchyme, are required in combination forhepatogenesis from the endoderm. Genes Dev 15:1998–2009.

    39. Chung WS, Shin CH, Stainier DYR (2008) Bmp2 signaling regulates the hepatic versuspancreatic fate decision. Dev Cell 15:738–748.

    40. Abyzov A, Mariani J, Palejev D, Zhang Y, Haney MS, Tomasini L, Ferrandino A,Rosenberg Belmaker L, Szekely A, Wilson M, Kocabas A, Calixto NE, Grigorenko EL,Huttner A, Chawarska K, Weissman S, Urban AE, Gerstein M, Vaccarino FM: Somaticcopy-number mosaicism in human skin revealed by induced pluripotent stem cells.Nature, in press. DOI 10.1038/nature11629.

    O’Huallachain et al. PNAS | October 30, 2012 | vol. 109 | no. 44 | 18023

    GEN

    ETICS

    Dow

    nloa

    ded

    by g

    uest

    on

    June

    15,

    202

    1

    http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1213736109/-/DCSupplemental/pnas.201213736SI.pdf?targetid=nameddest=SF2http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1213736109/-/DCSupplemental/pnas.201213736SI.pdf?targetid=nameddest=SF3