Large-scale sequence and structural comparisons of human ...Large-scale sequence and structural...

10
Large-scale sequence and structural comparisons of human naive and antigen-experienced antibody repertoires Brandon J. DeKosky a,1 , Oana I. Lungu a,b,1 , Daechan Park a,b , Erik L. Johnson a , Wissam Charab a , Constantine Chrysostomou a , Daisuke Kuroda c , Andrew D. Ellington d , Gregory C. Ippolito b , Jeffrey J. Gray c , and George Georgiou a,b,e,f,2 a Department of Chemical Engineering, University of Texas at Austin, Austin, TX 78712; b Department of Molecular Biosciences, University of Texas at Austin, Austin, TX 78712; c Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218; d Center for Systems and Synthetic Biology University of Texas at Austin, Austin, TX 78712; e Institute for Cell and Molecular Biology, University of Texas at Austin, Austin, TX 78712; and f Department of Biomedical Engineering, University of Texas at Austin, Austin, TX 78712 Edited by James A. Wells, University of California, San Francisco, CA, and approved March 30, 2016 (received for review December 24, 2015) Elucidating how antigen exposure and selection shape the human antibody repertoire is fundamental to our understanding of B-cell immunity. We sequenced the paired heavy- and light-chain variable regions (VH and VL, respectively) from large populations of single B cells combined with computational modeling of antibody structures to evaluate sequence and structural features of human antibody repertoires at unprecedented depth. Analysis of a dataset com- prising 55,000 antibody clusters from CD19 + CD20 + CD27 - IgM-naive B cells, >120,000 antibody clusters from CD19 + CD20 + CD27 + antigenexperienced B cells, and >2,000 RosettaAntibody-predicted structural models across three healthy donors led to a number of key find- ings: (i ) VH and VL gene sequences pair in a combinatorial fashion without detectable pairing restrictions at the population level; (ii ) certain VH:VL gene pairs were significantly enriched or depleted in the antigen-experienced repertoire relative to the naive repertoire; (iii ) antigen selection increased antibody paratope net charge and solvent-accessible surface area; and (iv) public heavy-chain third complementarity-determining region (CDR-H3) antibodies in the antigen-experienced repertoire showed signs of convergent paired light-chain genetic signatures, including shared light-chain third complementarity-determining region (CDR-L3) amino acid sequences and/or Vκ,λJκ,λ genes. The data reported here address several longstanding questions regarding antibody repertoire selection and development and provide a benchmark for future repertoire-scale analyses of antibody responses to vaccination and disease. antibody | B cell | immunology | high-throughput sequencing | computational modeling E ffective antigen recognition by the humoral immune system is predicated on the somatic generation of a large antibody repertoire that encompasses the sequence and conformational diversity to respond to a highly diversified set of antigens (13). Upon antigen challenge, naive B cells (NBCs) expressing unmu- tated antibodies capable of binding antigen with an affinity suffi- cient to initiate B-cell receptor (BCR) signaling may be stimulated to undergo somatic hypermutation (SHM) of the antibody genes. B cells expressing higher-affinity BCRs are better equipped to compete for antigen and thus receive signals that enable their preferential proliferation and further antibody sequence diversi- fication in additional rounds of SHM. This process generates a repertoire of somatically mutated antibodies that, at the structural level, generally display decreased conformational flexibility (4, 5), slower antigen dissociation rates, and increased binding selectivity relative to the germline repertoire. Understanding the salient features of the human antibody repertoire is critical for immunology research (6, 7). Specifically, additional information is needed regarding how a history of pathogen and environmental exposure modulates the sequence and conformational properties of naive antibodies to yield a mature antibody repertoire that confers effective protection. High-throughput DNA sequencing of the heavy-chain variable (VH) gene repertoire from human donors has begun to delineate repertoire-wide differences among antibodies encoded by dif- ferent B-cell subsets (814). For example, antigen experience has been reported to result in decreased length and hydrophobicity of the heavy-chain third complementarity-determining region (CDR-H3) and to alter CDR-H3 amino acid content relative to the naive repertoire (8, 9). Analysis of the VH repertoire in iden- tical twins provided evidence that germline gene use is genetically predetermined (10), but other studies demonstrated that VH gene use for a particular B-cell subset among different donors is more closely related than are different B-cell subsets within an individual (8, 9, 11). However, because of technical limitations (13), these and other studies were confined to analysis of the heavy and light chain repertoires separately. Earlier studies using limiting dilution and single-cell cloning to determine complete antibody gene sequences Significance We applied a very recently developed experimental strategy for high-throughput sequencing of paired antibody heavy and light chains along with large-scale computational structural modeling to delineate features of the human antibody reper- toire at unprecedented scale. Comparison of antibody reper- toires encoded by peripheral naive and memory B cells revealed (i ) preferential enrichment or depletion of specific germline gene combinations for heavy- and light-chain variable regions and (ii ) enhanced positive charges, higher solvent-accessible sur- face area, and greater hydrophobicity at antigen-binding re- gions of mature antibodies. The data presented in this report provide fundamental new insights regarding the biological features of antibody selection and maturation and establish a benchmark for future studies of antibody responses to dis- ease or to vaccination. Author contributions: B.J.D., O.I.L., G.C.I., J.J.G., and G.G. designed the experiments; B.J.D., O.I.L., W.C., and J.J.G. performed the experiments; B.J.D., O.I.L., D.P., E.L.J., C.C., D.K., A.D.E., and J.J.G. analyzed data; and B.J.D., O.I.L., G.C.I., and G.G. wrote the paper. Conflict of interest statement: G.G., B.J.D., and A.D.E. declare competing financial interests in the form of a patent filed by the University of Texas at Austin. This article is a PNAS Direct Submission. Data deposition: The sequence reported in this paper has been deposited in the National Center for Biotechnology Information Short Read Archive (accession codes PRJNA315079, SRX709625, and SRX709626). The Bioinformatic source code is shared on GitHub (https:// github.com/bdekosky/PNAS_2015-25510) (accession code PNAS_2015-25510). 1 B.J.D. and O.I.L. contributed equally to this work. 2 To whom correspondence should be addressed. Email: [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1525510113/-/DCSupplemental. E2636E2645 | PNAS | Published online April 25, 2016 www.pnas.org/cgi/doi/10.1073/pnas.1525510113 Downloaded by guest on July 18, 2020

Transcript of Large-scale sequence and structural comparisons of human ...Large-scale sequence and structural...

Page 1: Large-scale sequence and structural comparisons of human ...Large-scale sequence and structural comparisons of human naive and antigen-experienced antibody repertoires Brandon J. DeKoskya,1,

Large-scale sequence and structural comparisonsof human naive and antigen-experiencedantibody repertoiresBrandon J. DeKoskya,1, Oana I. Lungua,b,1, Daechan Parka,b, Erik L. Johnsona, Wissam Charaba,Constantine Chrysostomoua, Daisuke Kurodac, Andrew D. Ellingtond, Gregory C. Ippolitob, Jeffrey J. Grayc,and George Georgioua,b,e,f,2

aDepartment of Chemical Engineering, University of Texas at Austin, Austin, TX 78712; bDepartment of Molecular Biosciences, University of Texas at Austin,Austin, TX 78712; cDepartment of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218; dCenter for Systems andSynthetic Biology University of Texas at Austin, Austin, TX 78712; eInstitute for Cell and Molecular Biology, University of Texas at Austin, Austin, TX 78712;and fDepartment of Biomedical Engineering, University of Texas at Austin, Austin, TX 78712

Edited by James A. Wells, University of California, San Francisco, CA, and approved March 30, 2016 (received for review December 24, 2015)

Elucidating how antigen exposure and selection shape the humanantibody repertoire is fundamental to our understanding of B-cellimmunity. We sequenced the paired heavy- and light-chain variableregions (VH and VL, respectively) from large populations of single Bcells combined with computational modeling of antibody structuresto evaluate sequence and structural features of human antibodyrepertoires at unprecedented depth. Analysis of a dataset com-prising 55,000 antibody clusters from CD19+CD20+CD27− IgM-naiveB cells, >120,000 antibody clusters from CD19+CD20+CD27+ antigen–experienced B cells, and >2,000 RosettaAntibody-predicted structuralmodels across three healthy donors led to a number of key find-ings: (i) VH and VL gene sequences pair in a combinatorial fashionwithout detectable pairing restrictions at the population level;(ii) certain VH:VL gene pairs were significantly enriched or depletedin the antigen-experienced repertoire relative to the naive repertoire;(iii) antigen selection increased antibody paratope net charge andsolvent-accessible surface area; and (iv) public heavy-chain thirdcomplementarity-determining region (CDR-H3) antibodies in theantigen-experienced repertoire showed signs of convergentpaired light-chain genetic signatures, including shared light-chainthird complementarity-determining region (CDR-L3) amino acidsequences and/or Vκ,λ–Jκ,λ genes. The data reported here addressseveral longstanding questions regarding antibody repertoireselection and development and provide a benchmark for futurerepertoire-scale analyses of antibody responses to vaccinationand disease.

antibody | B cell | immunology | high-throughput sequencing |computational modeling

Effective antigen recognition by the humoral immune system ispredicated on the somatic generation of a large antibody

repertoire that encompasses the sequence and conformationaldiversity to respond to a highly diversified set of antigens (1–3).Upon antigen challenge, naive B cells (NBCs) expressing unmu-tated antibodies capable of binding antigen with an affinity suffi-cient to initiate B-cell receptor (BCR) signaling may be stimulatedto undergo somatic hypermutation (SHM) of the antibody genes.B cells expressing higher-affinity BCRs are better equipped tocompete for antigen and thus receive signals that enable theirpreferential proliferation and further antibody sequence diversi-fication in additional rounds of SHM. This process generates arepertoire of somatically mutated antibodies that, at the structurallevel, generally display decreased conformational flexibility (4, 5),slower antigen dissociation rates, and increased binding selectivityrelative to the germline repertoire.Understanding the salient features of the human antibody

repertoire is critical for immunology research (6, 7). Specifically,additional information is needed regarding how a history ofpathogen and environmental exposure modulates the sequence

and conformational properties of naive antibodies to yield amature antibody repertoire that confers effective protection.High-throughput DNA sequencing of the heavy-chain variable(VH) gene repertoire from human donors has begun to delineaterepertoire-wide differences among antibodies encoded by dif-ferent B-cell subsets (8–14). For example, antigen experience hasbeen reported to result in decreased length and hydrophobicityof the heavy-chain third complementarity-determining region(CDR-H3) and to alter CDR-H3 amino acid content relative tothe naive repertoire (8, 9). Analysis of the VH repertoire in iden-tical twins provided evidence that germline gene use is geneticallypredetermined (10), but other studies demonstrated that VH geneuse for a particular B-cell subset among different donors is moreclosely related than are different B-cell subsets within an individual(8, 9, 11). However, because of technical limitations (13), these andother studies were confined to analysis of the heavy and light chainrepertoires separately. Earlier studies using limiting dilution andsingle-cell cloning to determine complete antibody gene sequences

Significance

We applied a very recently developed experimental strategyfor high-throughput sequencing of paired antibody heavy andlight chains along with large-scale computational structuralmodeling to delineate features of the human antibody reper-toire at unprecedented scale. Comparison of antibody reper-toires encoded by peripheral naive and memory B cells revealed(i) preferential enrichment or depletion of specific germline genecombinations for heavy- and light-chain variable regions and(ii) enhanced positive charges, higher solvent-accessible sur-face area, and greater hydrophobicity at antigen-binding re-gions of mature antibodies. The data presented in this reportprovide fundamental new insights regarding the biologicalfeatures of antibody selection and maturation and establish abenchmark for future studies of antibody responses to dis-ease or to vaccination.

Author contributions: B.J.D., O.I.L., G.C.I., J.J.G., and G.G. designed the experiments; B.J.D.,O.I.L., W.C., and J.J.G. performed the experiments; B.J.D., O.I.L., D.P., E.L.J., C.C., D.K., A.D.E.,and J.J.G. analyzed data; and B.J.D., O.I.L., G.C.I., and G.G. wrote the paper.

Conflict of interest statement: G.G., B.J.D., and A.D.E. declare competing financial interestsin the form of a patent filed by the University of Texas at Austin.

This article is a PNAS Direct Submission.

Data deposition: The sequence reported in this paper has been deposited in the NationalCenter for Biotechnology Information Short Read Archive (accession codes PRJNA315079,SRX709625, and SRX709626). The Bioinformatic source code is shared on GitHub (https://github.com/bdekosky/PNAS_2015-25510) (accession code PNAS_2015-25510).1B.J.D. and O.I.L. contributed equally to this work.2To whom correspondence should be addressed. Email: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1525510113/-/DCSupplemental.

E2636–E2645 | PNAS | Published online April 25, 2016 www.pnas.org/cgi/doi/10.1073/pnas.1525510113

Dow

nloa

ded

by g

uest

on

July

18,

202

0

Page 2: Large-scale sequence and structural comparisons of human ...Large-scale sequence and structural comparisons of human naive and antigen-experienced antibody repertoires Brandon J. DeKoskya,1,

[i.e., both VH and light-chain variable (VL) or VH:VL] in smallnumbers of B cells have provided invaluable insights into aspects ofB-cell and antibody repertoire development (15–20). However,because of the low throughput of B-cell cloning, it has not beenpossible to address a number of key questions that require morecomprehensive coverage of the VH:VL repertoire. For example,whether heavy and light chains pair in a combinatorial fashion orinstead whether certain germline heavy-chain genes are con-formationally predisposed to pair with particular light-chain genesremains controversial (15, 21, 22).Although repertoire-wide crystallographic studies are not

feasible at present, advances in computational protein modelinghave enabled the structure of antibodies having moderate-lengthCDR loops to be predicted accurately (23–26). In particular,computational models using RosettaAntibody-predicted X-raystructures within 1–1.5 Å rmsd in framework and canonicalloops and 2 Å in CDR-H3 loops (27). However, the use ofRosettaAntibody for repertoire analysis requires, first and fore-most, the availability of high-quality paired VH and VL sequencedata for each antibody; second, significant computational resources

for efficient execution of large-scale computational modeling ofantibody repertoires; and, finally, an appropriate suite ofinformatic tools and statistical metrics suitable for repertoire-level comparisons. Such a pipeline enables the study of thephysicochemical properties of antibody repertoires important fordisease and autoimmunity (9, 18, 28) on a 3D level at the site ofantigen contact.Here we deployed a recently developed technology designed

for massive determination of the natively paired VH:VL reper-toire from single B cells (29, 30) coupled with high-throughputcomputational modeling to obtain the first (to our knowledge)comprehensive sequence and structural examination of humannaive and antigen-experienced antibody repertoires circulating inperipheral blood. Our analysis of >170,000 high-quality antibodysequence clusters and >2,000 antibody models of naive and an-tigen-experienced repertoires from three human donors providedcomprehensive data that have enabled us to address the questionsstated above in addition to several other longstanding issues re-garding antibody repertoire maturation and development.

Template Grafting andRefinement

CDR-H3 de novo loop modeling

VH-VL Orientation Docking

x100

0 M

odel

s

Structural TemplateQuality Filtering

RosettaAntibody in silico Computational Modeling

PutativeParatope Metrics

VH:VL SequenceQuality Filtering

C J V V D J CLight Link Heavy

TGTGCGAGAGAGGTCCTCTACGGTGACTACGGGACTGGCTACTGGTGTCAGCAGTATAATAACTGGCCTCGGACGTTCTGTGCGAGAGAGGTCCTCTACGGTGACTACGGGACTGGCTACTGGTGTCAGCAGTAT

Emulsion-Based VH:VLPaired Sequencing

B-cell Subset IsolationFrom Healthy Human Donors

0 102

103

104

105

CD20

0

102

103

104

CD27+ Ag-Exp32.2%

CD27- Naive64.4%

105

CD27

A B C

D E F

Fig. 1. High-throughput pipeline for antibody repertoire sequencing, modeling, and analysis. (A) Peripheral B-cell repertoires from healthy human donorswere fractionated into naive (CD3−CD19+CD20+CD27−) and antigen-experienced (CD3−CD19+CD20+CD27+) samples via FACS. (B) B cells were isolated as singlecells in emulsion droplets for single-cell mRNA capture, overlap extension linkage RT-PCR, and high-throughput sequencing as previously reported (29, 30).(C) VH:VL sequences were quality filtered for read quality, two or more reads, and 96% CDR-H3 identity clustering to remove sequence errors. Additionalquality controls for the naive antibody dataset included filtering for IgM expression and SHM load to enhance data purity beyond the limitations of FACS.(D) Antibodies were selected for modeling from among sequences with the highest read counts and were filtered for CDR-H3 length and the availability ofhigh-quality, high-sequence-similarity structural templates in the PDB. (E) Antibody repertoires were modeled using RosettaAntibody 3.0. (F) CR-paratopeswere identified from antibody models and analyzed for charge (Upper), hydrophobicity (Middle), and SASA (Lower). Sequence and structural metrics wereanalyzed using a variety of statistical approaches to gain new biological understanding from high-throughput antibody repertoire data.

DeKosky et al. PNAS | Published online April 25, 2016 | E2637

IMMUNOLO

GYAND

INFLAMMATION

PNASPL

US

Dow

nloa

ded

by g

uest

on

July

18,

202

0

Page 3: Large-scale sequence and structural comparisons of human ...Large-scale sequence and structural comparisons of human naive and antigen-experienced antibody repertoires Brandon J. DeKoskya,1,

ResultsDetermination of Antibody Sequence and Structural Repertoires.CD3−CD19+CD20+CD27− NBCs were isolated from peripheralblood mononuclear cells (PBMCs) of three healthy human donors(Fig. 1A and SI Appendix, Fig. S1). IgM amplicons encoding na-tively paired VH and VL sequences were generated and se-quenced using a single-cell emulsion-based methodology (Fig. 1B).Briefly, paired heavy- and light-chain sequencing was achieved viasingle-cell isolation in emulsion droplets, cell lysis, and mRNAcapture, overlap extension RT-PCR to link heavy and light chains,and Illumina MiSeq sequencing of the VH:VL cDNA product (29,30). We obtained 55,355 antibody sequences from NBCs followingfiltering for read quality, functional gene segment identification,and CDR-H3 clustering. Sequences in the naive group were fil-tered for >98% germline nucleotide identity in the framework 3region (FR3), similar to previous reports (10), to eliminate anti-body sequences caused by FACS artifacts (typically <1% of non-naive cells) or by low-frequency CD27– antigen-experienced Bcells (AEBCs) falling within the sort gates (Fig. 1C; also see SIAppendix, SI Methods) (31, 32). For two of these donors wedetermined the IgM/IgG/IgA VH:VL paired repertoire ofCD3−CD19+CD20+CD27+ AEBCs from the same blooddraw. A total of 123,941 distinct CDR-H3:light-chain thirdcomplementarity-determining region (CDR-L3) clusters corre-sponding to antigen-experienced VH:VL pairs were thus obtained(SI Appendix, Table S1). MiSeq 2 × 300 technology permits se-quencing 600 bp of the ∼850-bp VH:VL amplicon per sample,and therefore full-length VH and VL genes for AEBCs weregenerated via bioinformatic assembly of the separate VH, VL,and VH:VL sequencing samples as described previously (29, 30,33, 34). No gene assembly was necessary for NBC sequencesbecause naive BCRs do not exhibit SHM and can be assumed tocorrespond to germline FR1–FR3 VH and VL gene segments.Structural antibody repertoire modeling was performed using

the RosettaAntibody 3.0 antibody modeling protocol imple-mented on a high-performance computing cluster (Fig. 1 D andE). Briefly, structures were generated by (i) grafting homologoustemplate framework and CDR loop regions and (ii) de novomodeling of the CDR-H3 while refining the surrounding loopsand the relative orientation of heavy- and light-chain variable(VH and VL, respectively) domains. Top-scoring models wereselected from thousands of candidate model structures. From thetwo donors, 1,014 naive and 1,015 antigen-experienced antibodymodels were generated (SI Appendix, Table S1) at an estimatedcomputational expense of 570,000 CPU hours. Paratopes ofantibodies are the regions that directly recognize and bind toantigens. We estimated the paratope using contact-residue regionsobserved in antibody–antigen crystal structures (35). We calcu-lated the net charge, solvent-accessible surface area (SASA), andhydrophobic solvent-accessible surface area (hSASA) over thecomputationally estimated contact-region paratopes (CR-para-topes), and we analyzed framework regions (FR) for conforma-tional similarities and differences at the repertoire level (Fig. 1F).

V-Gene Use and Public VH:VL Characteristics Differ in Naive andAntigen-Experienced Antibody Repertoires. We sought to under-stand how paired heavy:light V-gene use differed across indi-viduals and across B-cell subsets (Fig. 2A and SI Appendix, Fig.S2). Pearson hierarchical clustering of VH:VL repertoire V-geneuse revealed that naive antibody repertoires from different indi-viduals clustered together, whereas antigen-experienced repertoiresfrom multiple donors clustered separately from naive repertoires(Fig. 2B, Left). In other words, full antibody (i.e., paired VH:VL)V-gene use in naive repertoires was more similar to the naive reper-toires of other donors than to a donor-matched antigen-experiencedrepertoire. Further, subclassification of the antigen-experienced rep-ertoire by heavy-chain isotype revealed that antigen-experienced IgM

repertoires clustered as one group, whereas class-switched rep-ertoires (IgG, IgA) clustered separately (Fig. 2B, Right; principalcomponent analysis is reported in SI Appendix, Fig. S3) Antigen-experienced class-switched IgG and IgA subsets were moresimilar within an individual than across the two individuals (in-dicated by the height separating groups in Fig. 2B, Right, and alsoobserved in principal component analysis and pair-wise Pearsoncorrelation coefficients in SI Appendix, Figs. S3 and S4).We next performed statistical analysis to understand shifting

heavy:light V-gene gene use across B-cell subsets using linearmodel-based t tests with adjustment for multiple comparisons(36). We found 28 statistically significant enriched/depleted VH:VLgene pairs in AEBCs compared with the NBC repertoire, withan adjusted P value less than 0.05, comprising 3.2% of the 872germline VH:VL gene combinations present in all experimentalsamples (Fig. 2C). As one salient example, previous studies pro-vided conflicting reports on IGHV6-1 prevalence across B-cellsubsets [IGHV6-1 being reported as both enriched (9, 12) anddepleted (10) in antigen-experienced subsets]. Interestingly, wefound IGHV6-1 was significantly depleted in antigen-experiencedsubsets when paired with certain light-chain V-genes (KV1-33,LV3-19, LV1-40, and LV3-1); however IGHV6-1 was enrichedwhen paired with other light-chain genes (e.g., KV4-1, LV2-11, andLV1-44). Differentially enriched/depleted VH:VL pairs and acomprehensive list of VH:VL genes and P values are provided inSI Appendix, Table S2 and other data in SI Appendix. We per-formed several statistical analyses after subdividing the repertoireinto SHM bins to understand the role of SHM in gene use; no clearcorrelations between gene use and SHM load were identified.A longstanding question in studies of the antibody repertoire

is whether certain VH:VL V-gene combinations are favored ordisfavored relative to random or combinatorial, VH:VL genepairings (15, 21, 22). Given p VH genes represented at x fractionamong all antibodies, and q VL genes represented at y fractionamong all antibodies, then p × q different Ig VH:VL genepairings can be formed at an expected xi × yj fraction for each i-jVH:VL gene pair in the repertoire (defined as the VH:VL ex-pectation value). A reduced frequency of observation for aparticular VH:VL gene pair relative to its expected frequencywould suggest negative selection for that VH:VL; negative VH:VLselection could arise from structural incompatibility of VH and VLdomains. We applied several statistical techniques including linearmodel-based t tests (36), DESeq (37), and Student’s t test toidentify reduced-frequency “holes” and increased-frequency“peaks” compared with the xi × yj expectation value (null) hy-pothesis that could be observed repeatedly within each B-cellsubset in multiple donors. No statistically significant holes orpeaks could be identified across donors among the B-cell sub-sets analyzed, consistent with prior small datasets (15, 21) andindicating that any disproportional pairings of human heavy-and light-chain V-genes that were observed in a single B-celldataset were not replicated in other donors.Analysis of a small set (<200) of antibody sequences obtained

by single B-cell cloning (15) and high-throughput sequencing ofVH repertoire subsets (8) had suggested that the CDR-H3 re-gion is shorter, has a more basic pI, and is more hydrophilic inAEBCs than in NBCs. We observed that the reduction in CDR-H3 length in antigen-experienced repertoires compared withnaive repertoires was very slight, although the difference be-tween distributions was statistically significant [naive CDR-H3:15.23 ± 3.69 amino acids, antigen-experienced: 15.08 ± 3.64amino acids, mean ± SD, International Immunogenetics In-formation System (IMGT) CDR3 length definitions, P < 10−14

by the Kolmogorov–Smirnov (K–S) test which compares theequality of distributions] (SI Appendix, Fig. S5A). We found nostrong correlation in paired heavy- and light-chain CDR3 looplengths (SI Appendix, Fig. S5 C and D), as has been previouslyinferred from limited antibody sequence data (15, 21).

E2638 | www.pnas.org/cgi/doi/10.1073/pnas.1525510113 DeKosky et al.

Dow

nloa

ded

by g

uest

on

July

18,

202

0

Page 4: Large-scale sequence and structural comparisons of human ...Large-scale sequence and structural comparisons of human naive and antigen-experienced antibody repertoires Brandon J. DeKoskya,1,

Promiscuous light-chain sequences (i.e., light chains pairedwith two or more VH sequences) have been observed previouslyand result from the lower diversity of light-chain Vκ,λ–Jκ,λjunctions (6, 29). We found widespread evidence of promiscuouslight-chain junctions, i.e., Vκ,λ–Jκ,λ junctions observed in mul-tiple BCRs within the same donor. In NBCs, promiscuous CDR-L3 junctions comprised 68.4 ± 4.5% of the repertoire at thenucleotide level (78.5 ± 4.0% based on protein sequence). Thefraction of promiscuous Vκ,λ–Jκ,λ junctions was reduced inAEBCs (30.2 ± 3.9% nucleotide basis, 46.9 ± 5.7% amino acidbasis) (29). Promiscuous Vκ,λ–Jκ,λ junctions paired with a diverseset of heavy-chain genes (29), consistent with the hypothesis thatpromiscuous light-chain CDR3 sequences emerge as a consequenceof high-probability light-chain Vκ,λ–Jκ,λ recombination events.Public Vκ,λ–Jκ,λ nucleotide and amino acid junctions (i.e.,

junctions observed in more than one individual) comprised asignificant fraction of NBC CDR-L3s (64.6 ± 6.1% by nucleo-tide, 75.7 ± 6.7% by amino acid) and AEBC CDR-L3s (16.6 ±0.3% by nucleotide, 33.6 ± 1.9% by amino acid) (6, 29, 38, 39).In contrast to the high prevalence of public Vκ,λ–Jκ,λ nucleotidesequences, we observed very few public CDR-H3 nucleotide

sequences across individuals (three among naive groups, onebetween antigen-experienced groups), as was consistent withprior reports (10, 13, 40). At the amino acid level we observed 23CDR-H3 sequences that were shared between two donors in thenaive repertoire (0.083% frequency; no CDR-H3 was sharedamong all three donors) and 38 CDR-H3 amino acid sequencesthat were shared between two donors in the antigen-experiencedrepertoire (0.061% frequency). Public CDR-H3 lengths weresignificantly shorter than CDR-H3 lengths in the overall reper-toires, presumably reflecting the lower sequence diversity inherentto shorter CDR-H3s, as is consistent with a higher probability ofthe same short junction occurring in different individuals (SI Ap-pendix, Fig. S6A) (40, 41). We detected five antigen-experiencedpublic CDR-H3 amino acid sequences that were paired withidentical CDR-L3 amino acid sequences; all five public CDR-H3:CDR-L3 antibodies were encoded by different nucleotide se-quences with distinct patterns of N/P addition, SHM, gene com-position, and isotype use, indicating that they derived from distinctV(D)J recombinations (SI Appendix, Table S3). Convergence inlight-chain gene pairing was markedly higher for antigen-expe-rienced public CDR-H3 antibodies that had undergone antigen

Naïve Donor 1 Antigen-Experienced Donor 1

HV1-2

HV1-69HV3-15

HV3-49HV4-28HV5-51

00.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

KV1

-5K

V1-9

KV1

-16

KV1

-33

KV1

D-8

KV2

-24

KV2

-40

KV3

-7K

V3-2

0K

V4-1

LV1-

36LV

1-44

LV2-

5LV

2-14

LV3-

1

LV3-

16

LV3-

25

LV4-

60

LV5-

45

LV7-

43

LV9-

49

1.8–2

1.6–1.8

1.4–1.6

1.2–1.4

1–1.2

0.8–1

0.6–0.8

0.4–0.6

0.2–0.4

0–0.2

HV1-2

HV1-69HV3-15

HV3-49HV4-28HV5-51

00.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2K

V1-5

KV1

-9K

V1-1

6K

V1-3

3K

V1D

-8K

V2-2

4K

V2-4

0K

V3-7

KV3

-20

KV4

-1LV

1-36

LV1-

44LV

2-5

LV2-

14LV

3-1

LV3-

16

LV3-

25

LV4-

60

LV5-

45

LV7-

43

LV9-

49

Perc

ent o

f rep

erto

ire

Don

or 2

Ag-

Exp

Don

or 1

Ag-

ExpD

onor

2 N

aïve

Don

or 1

Naï

ve

Don

or 3

Naï

ve

0.15

0.25

0.35

0.45

Hei

ght

A

B C

Don

or 1

Naï

veD

onor

3 N

aïve

Don

or 2

Naï

ve

Don

or 1

Ag-

Exp

IgA

Don

or 1

Ag-

Exp

IgG

Don

or 2

Ag-

Exp

IgA

Don

or 2

Ag-

Exp

IgG

Don

or 2

Ag-

Exp

IgM

Don

or 1

Ag-

Exp

IgM

0.1

0.3

0.5

Hei

ght

0.7

−6 −4 −2 0 2 4 6

01

23

45

log2(fold change)

−log

10(p

−val

ue)

HV3-74:KV4-1

HV3-74:KV2-28HV3-74:LV2-8HV3-7:KV4-1HV3-15:KV2-28

HV3-74:LV1-51HV1-3:KV4-1

HV4-61:KV3-20

HV3-33:KV1-8HV6-1:KV1-33

HV4-34:KV1-8

HV6-1:LV3-19HV1-69:KV1-8 HV1-18:KV1-8

HV4-59:KV1-8 HV6-1:LV1-40HV1-24:LV2-14 HV1-58:KV1-33

HV4-34:LV3-1HV5-51:KV1-8HV1-58:LV3-1

HV3-30:KV1-8HV6-1:LV3-1

HV1-58:KV1-39HV1-69:LV3-1

HV3-21:KV1-8HV2-26:KV1-33

HV1-46:KV1-33

Fig. 2. Paired V-gene use in naive and antigen-experienced B-cell repertoires. (A) Paired heavy:light V-gene use surface maps of antibody sequence rep-ertoires from naive (CD3−CD19+CD20+CD27− IgM) (Left) and antigen-experienced (CD3−CD19+CD20+CD27+ IgG/IgA/IgM) (Right) repertoires of donor 1 (n =13,780 and 34,692, respectively). Donor-matched NBCs and AEBCs were isolated from the same time point and blood draw. V-genes are plotted in alpha-numeric order, with heights indicating percentage representation among VH:VL clusters. (B) Clustergrams resulting from Pearson hierarchical cluster analysisof paired heavy:light V-gene use across donors; relative distance is indicated by line heights connecting different groups. (Left) Clustering of donor and B-cellsubset repertoires. (Right) Clustering of heavy-chain isotype repertoires (naive and antigen-experienced IgM, IgA, and IgG). (C) Volcano plot representation ofdifferences in VH:VL gene use in the NBC and AEBC repertoires. Positive fold-change values denote VH:VL gene pairs that were more frequent in antigen-experienced datasets. Gene pairs with adjusted P values below 0.05 are displayed in red and are listed in SI Appendix, Table S2. Gene pairs with a log2 (fold-change) absolute value of 1 or more and with an adjusted P value greater than 0.05 are displayed in orange. Other gene pairs are displayed in black. A total of872 VH:VL gene pair combinations was present in all donors and are included in this analysis.

DeKosky et al. PNAS | Published online April 25, 2016 | E2639

IMMUNOLO

GYAND

INFLAMMATION

PNASPL

US

Dow

nloa

ded

by g

uest

on

July

18,

202

0

Page 5: Large-scale sequence and structural comparisons of human ...Large-scale sequence and structural comparisons of human naive and antigen-experienced antibody repertoires Brandon J. DeKoskya,1,

selection than for public CDR-H3 antibodies from NBCs(SI Appendix, Fig. S6B), reinforcing the hypothesis that publicantigen-experienced CDR-H3 antibodies can be functionallyselected for binding to common antigens (42–45).

Structural Differences in Repertoire Charge, Surface Area, andHydrophobicity. We first analyzed the predictive accuracy ofRosettaAntibody for naive antibodies. RosettaAntibody modelshave been reported to have an FR and canonical loop accuracyof 0.45- to 1.0-Å rmsd relative to X-ray structures of antigen-experienced antibodies and an average 2.1-Å rmsd in CDR-H3loops (27). For the seven human germline antibodies in theProtein Data Bank (PDB) (100% sequence identity to germlineV-gene segments), our RosettaAntibody models displayed a 0.8-to 1.4-Å rmsd in FR and canonical CDR loops and a <2.4-Å rmsdin CDR-H3 relative to X-ray structures. We produced antibodymodels for sequences that were unique among all repertoiresand for which high-sequence-identity templates of FR and ca-nonical CDRs were available in the PDB, as required for ho-mology modeling. Because computational structural prediction oflong CDR-H3 loops is problematic, we considered only the antibodyrepertoire subset with CDR-H3 lengths of <16 amino acids, by theChothia definition (76% of total sequenced antibodies).We sought to characterize the repertoire-wide physicochemi-

cal properties of computationally modeled CR-paratopes, in-cluding charge, hydrophobicity, SASA, and hSASA; theseproperties have been shown to be important in tolerance and forthe avoidance of self-reactivity (8, 12, 18). We calculated physi-cochemical metrics holistically over all six regions of each antibodythat comprise the most likely antigen-binding contacts, i.e., the

CR-paratope. The CR-paratope overlaps but does not perfectlycoincide with the six CDRs of an antibody, which are classicallydefined by hypervariable amino acid sequence (see SI Appendix, SIMethods for a detailed description). The CR-paratope charge innaive and antigen-experienced repertoires was found to be heavilyinfluenced by V-gene use (Fig. 3A). Naive antibodies exhibited aslightly more negative CR-paratope charge than antigen-experi-enced antibodies, with distribution mean charges of −1.1 and−0.68, respectively (median charges −1 and 0; differences incharge distribution were statistically significant by the K–S test,P = 2.8 × 10−3). Antibodies using the gene segment IGKV1-33exhibited a strong negative charge over the CR-paratope becauseof a −3 charge in the IGKV1-33 germline. Reanalysis of CR-paratope charge distribution data when antibodies using IGKV1-33 were excluded rendered the differences in CR-paratope chargedistribution between the repertoires nonsignificant (P = 0.096 byK–S test, n = 859 for naive and n = 958 for antigen-experiencedrepertoires). IGKV1-33 is a common gene segment, representing15% of V-gene use in paired VH:VL naive repertoire models butonly 3.1% in antigen-experienced models (8.9% and 3.1%, re-spectively, in donor sequences). Indeed, IGKV1-33 is also amember of four VH:VL V-gene pairs which have statisticallysignificant decreased expression in antigen-experienced reper-toires (SI Appendix, Table S2) (10).We compared charge distributions in computationally mod-

eled antibody CR-paratopes with those of CDR-H3:CDR-L3amino acid sequences. CDR-H3:CDR-L3 sequences from bothnaive and antigen-experienced repertoires exhibited neutralcharge distributions, with 90% of the repertoire falling between+2 and −2 by total CDR3 loop charge (Fig. 3B). In agreement

A B

DC

0

0.05

0.1

0.15

0.2

0.25

-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5

Frac

tion

of R

eper

toire

Paratope Charge

Fig. 3. Charge distributions in naive and antigen-experienced repertoires. (A) CR-paratope charge. (B) Total CDR-H3 and CDR-L3 charge. (C) CDR-H3 charge.(D) CDR-L3 charge for naive and antigen-experienced BCR repertoires. In all panels, differences in charge distribution between naive and antigen-experiencedrepertoires were statistically significant by the K–S test (P = 3.5 × 10−3 for A; P < 10−15 for B–D). The number in each group is provided in SI Appendix, Table S1;error bars represent SD.

E2640 | www.pnas.org/cgi/doi/10.1073/pnas.1525510113 DeKosky et al.

Dow

nloa

ded

by g

uest

on

July

18,

202

0

Page 6: Large-scale sequence and structural comparisons of human ...Large-scale sequence and structural comparisons of human naive and antigen-experienced antibody repertoires Brandon J. DeKoskya,1,

with our CR-paratope charge analysis, the antigen-experiencedCDR-H3:CDR-L3 amino acid sequences displayed a slightlyreduced negative charge as compared with the naive sequences(mean charges of −0.47 and −0.099, respectively) (Fig. 3B).Antigen-experienced CDR-H3 and CDR-L3 repertoires showedstatistically significant increases in net charge distributions ascompared with naive repertoires [Fig. 3 C (8, 40) and D and SIAppendix, Table S4]. Increased CDR3 charge was observed inantigen-experienced CDR-H3 and CDR-L3 concurrently andwas evident even when antibodies were binned for gene use (SIAppendix, Figs. S7 and S8), indicating that the shift toward amore positive charge was driven by functional BCR selectionrather than by changing gene use. We also observed distinctdistributions of CDR-L3 charge across kappa vs. lambda light-chain repertoires, with kappa CDR-L3 repertoires being signif-icantly more positively charged than lambda CDR-L3 repertoires(SI Appendix, Fig. S9 and Table S5). Importantly, distinct Igκ vs.Igλ light-chain charge distributions also were evident in light-chain CR-paratopes. Kappa CR-paratopes had a median chargeof 0, whereas lambda CR-paratopes had a median charge of −1(P = 1.8 × 10−9 for naive kappa vs. lambda charge distributionand P < 10−15 for antigen-experienced kappa vs. lambda chargedistribution by K–S test) (SI Appendix, Fig. S10).BCRs with positive charge extremes were slightly more prev-

alent in antigen-experienced repertoires than in naive reper-

toires. Antibodies having highly positively charged CDR3s aremore likely to be autoreactive, and B cells encoding positivelycharged CDR3s are known to be eliminated at distinct de-velopmental checkpoints in bone marrow (18). Nonetheless, ourresults showed that a small but significant fraction of highlypositively charged antibodies is found in NBCs in the periphery.Further, the frequency of such antibodies was statistically sig-nificantly increased by antigen exposure (+5 or +6 CDR-H3:CDR-L3: average 0.24% of naive vs. 0.33% of antigen-experi-enced antibodies, P = 2.7 × 10−4 by Z test.)We next analyzed the hydrophobicity of paired CDR-H3 and

CDR-L3 loops by calculating the hydrophobic index (H-index)(46). Although distributions in average H-index did not showsignificant differences between naive and antigen-experiencedrepertoires, we observed major differences in the CDR-L3average H-index between kappa and lambda repertoires (SIAppendix, Fig. S9). Lambda light-chain CDR3s were morehydrophobic than kappa CDR3s, and these patterns were con-sistent across donors (P < 10−14). Kappa CDR-L3s were understronger H-index positive selection pressure than lambda lightchains (the mean H-index from naive to antigen-experiencedincreased by 0.037 in kappa light chains compared with 0.016 inlambda light chains) (SI Appendix, Fig. S9 and Table S5), per-haps because of the much lower hydrophobicity of naive kappalight chains.

Fig. 4. Distribution of SASA in naive and antigen-experienced repertoires. (A) CR-paratope SASA (Upper) and hSASA (Lower) of a naive antibody (Left) andantigen-experienced antibody (Right) with SASA or hSASA at the median of the respective distributions. (Upper) VH CR-paratope SASA is shown in blue; VL CR-paratope SASA is shown in green. (Lower) hSASA of the CR-paratope is rendered with each residue colored according to the Eisenberg hydrophobicity scale,from most hydrophobic (red) to least (white). (B–D) Total CR-paratope SASA (B), CDR-H1 SASA (C), and fraction of hSASA (D) for naive (blue) and antigen-experienced (red) BCR repertoires. In B–D differences between naive and antigen-experienced repertoires were statistically significant by the K–S test (P = 6.5 × 10−5

for B; P = 2.5 × 10−12 for C; and P = 5.4 × 10−10 for D). The number for each group is provided in SI Appendix, Table S1; error bars represent SD.

DeKosky et al. PNAS | Published online April 25, 2016 | E2641

IMMUNOLO

GYAND

INFLAMMATION

PNASPL

US

Dow

nloa

ded

by g

uest

on

July

18,

202

0

Page 7: Large-scale sequence and structural comparisons of human ...Large-scale sequence and structural comparisons of human naive and antigen-experienced antibody repertoires Brandon J. DeKoskya,1,

The availability of structural models also enabled us to char-acterize the SASA and the hSASA across the antibody repertoireCR-paratopes (Fig. 4A). There was a very slight increase inmedian SASA, from 2,600 Å2 in antibodies from NBCs to 2,650 Å2

in antibodies from AEBCs (mean SASA 2,611 vs. 2,633 Å2, re-spectively, P = 6.5 × 10−5 by K–S test to compare SASA distri-butions) (Fig. 4B). We observed that IGHV4-59 and IGHV4-34had a strong impact on naive and antigen-experienced antibodySASA. For example, antibodies using IGHV4-34 were charac-terized by a smaller-than-average SASA (2,527 Å2 for antibodiesusing IGHV4-34 vs. 2,637 Å2 average for all modeled antibodiesin this study) and comprised 8% of the naive repertoire but only3% of the antigen-experienced repertoire.Additionally we found that the median SASA contributed by

CDR-H1 was slightly smaller, at 400 Å2 for naive antibodies (Fig.

4C), than the median of 425 Å2 in the antigen-experienced an-tibodies (means: 407 Å2 and 430 Å2, respectively; adjusted Pvalue = 2.5 × 10−12 by K–S test to compare H1-SASA distribu-tions). Although distribution means differed only slightly, theoverall probability distributions of NBC vs. AEBC repertoireCR-paratope SASA and CDR-H1 SASA were significantly dif-ferent, trending toward greater SASA in the CR-paratope ofantibodies over the process of antigen experience.The fraction of hSASA (i.e., the hSASA/SASA ratio) in antigen-

experienced antibodies displayed a slight but statistically sig-nificant median increase, from 0.57 to 0.58 (means 0.57 and0.58, respectively; adjusted P value = 2.5 × 10−4 for donor 1 and7.6 × 10−8 for donor 2) (Fig. 4D). Antibodies using IGHV4-59,IGHV1-18, and IGHV3-33, along with IGHV1-3, IGHV3-30,and IGKV3-20 experienced small but statistically significant

A

B

Same Gene Same Family Different Family

C D

Donor 1 Donor 2

Fig. 5. Average rmsd of VH FR1–3 backbone atoms. (A) Superimposed models of two naive antibodies sharing the same IGHV gene segment (Left) or thesame IGHV gene family (Center) or from two different IGHV gene families (Right). FR1–3 residues used to calculate pairwise rmsd values are highlighted inblue. (B) Superimposed models of two antigen-experienced antibodies sharing the same IGHV gene segment (Left) or the same IGHV gene family (Center) orfrom two different IGHV gene families (Right). FR1–3 residues used to calculate pairwise rmsd values are highlighted in red. (C and D) FR average pairwisermsd for donor 1 (C) and donor 2 (D). Average pairwise rmsds are plotted for antibodies using the same V-gene segment, using the same V-gene family, andusing two different V-gene families; naive repertoires are shown in blue, and antigen-experienced repertoires are shown in red; error bars indicate SD.Distribution differences between naive and antigen-experienced repertoires were statistically significant for all comparisons by the K–S test (P < 10−15 in Cand D). The number for each group is given in SI Appendix, Table S1; error bars represent SD.

E2642 | www.pnas.org/cgi/doi/10.1073/pnas.1525510113 DeKosky et al.

Dow

nloa

ded

by g

uest

on

July

18,

202

0

Page 8: Large-scale sequence and structural comparisons of human ...Large-scale sequence and structural comparisons of human naive and antigen-experienced antibody repertoires Brandon J. DeKoskya,1,

increases in the hSASA fraction in antigen-experienced antibodyCR-paratopes of 1–3% (P = 9.3 × 10−6, 1.9 × 10−3, 2.5 × 10−2,2.6 × 10−2, and 4.9 × 10−2, respectively). CDR-L3 CR-paratopehydrophobicity correlated well with sequence-based hydropho-bicity metrics, with lambda CR-paratopes exhibiting a 10% higherhSASA fraction than kappa CR-paratopes over the CDR-L3.Antibody FRs dictate the conformational space accessible to

the CDRs. Conserved VH:VL interactions at FR positions arecritical determinants of variable domain stability (47–50). Toexplore the relationship between V-gene use and antibodystructural similarity within FRs, we calculated the rmsd of anti-body models to each other over the FR1–3 backbone within eachnaive or antigen-experienced computationally predicted anti-body model repertoire (VH: 8–25, 36–51, and 57–94 and VL: 10–23, 35–49, and 57–88 backbone residue atoms by the Chothiadefinition; see SI Appendix, SI Methods). Pairwise antibodyheavy-chain FR rmsds were calculated over the heavy-chainV-gene segment and compared across gene subsets in eachdonor (Fig. 5 A and B). We found that antibody structures wereremarkably similar to each other over the VH FR regions, with amedian overall rmsd of 1.02–1.09 Å in naive repertoires and1.03–1.04 Å in antigen-experienced repertoires (Fig. 5). In naiverepertoires, the FRs of antibodies having the same VH genesegment use were indistinguishable (same median rmsd of 0.13in both donor 1 and donor 2) and slightly less so in antigen-experienced repertoires (median rmsds of 0.41 and 0.48 Å in do-nors 1 and 2, respectively, P < 10−15 by K–S test). A similar trendwas observed for antibodies using VH gene segments from thesame family (i.e., IGHV1, IGHV2, and so forth), with naiveantibodies having a closer structural similarity to each other thanantigen-experienced antibodies (naive antibodies: median rmsdsof 0.51 Å in both donors; antigen-experienced antibodies: me-dian rmsds of 0.56 and 0.57 Å in donors 1 and 2, respectively, P <10−15 for both donors by K–S test). Interestingly, pairwise rmsdswere not substantially increased for antibody pairs using VHgene segments from different VH gene families, with a medianrmsd of 1.1 Å in all repertoires across both donors (P < 10−15 byK–S test for naive and antigen-experienced repertoires in bothdonors). We compared the FR1–3 rmsds among antigen-expe-rienced predicted models in each donor with 141 nonredundanthuman antibody structures in the PDB (SI Appendix, SI Methods).Overall rmsd for the set of crystal structures from the PDB was1.0 ± 0.29 Å; among antibodies using different V-gene segmentsthe median rmsd was 1.1 ± 0.22 Å (SI Appendix, Fig. S11), inexcellent agreement with our RosettaAntibody predictions. Ofnote, antibody sequences in the PDB were more somaticallyhypermutated than the antibodies modeled here, with antibodiesin the PDB having 83% average germline sequence identity vs. the88% average in our antigen-experienced repertoires. Thus, weconclude that at the repertoire level SHM does not appear toexert a significant impact on FR structural diversity.

DiscussionThe results presented here constitute by far the most in-depthanalysis to date of naive and antigen-experienced human anti-body repertoires, both at the sequence level and with respect tostructural paratope features. Although the technical capabilitiesfor tackling larger numbers of donors now exist, we focused onthe analysis of a small set of donors in this study because scalingthe scope of such analysis must be tempered by cost consider-ations, particularly regarding the computational capacity for dataanalysis and structural modeling (e.g., >250,000 CPU hours per1,000 antibodies). We note that biological sampling limitationsrestrict our ability to analyze more than a very small fraction ofthe 1010 B cells in an individual; however the 1.8 × 105 antibodysequences that we recovered enabled highly sensitive analyses ofthe genetic, physicochemical, and structural properties of humanantibody repertoires. The very rich data presented here highlight

a number of key features of the antibody repertoire that eitherhad not been observed previously or had been inferred fromlimited data.First, the question of whether heavy- and light-chain gene

pairing occurs in a purely combinatorial fashion or if, instead,germline sequences impose conformational constraints on theassociation of certain heavy- and light-chain genes has been hotlydebated for many years. Although earlier studies relying onsingle-cell sequencing of tens to hundreds of B cells had arguedprimarily for random pairing (15, 21), this conclusion has beenchallenged by recent data meta-analyses (22). Here, examinationof >175,000 high-quality antibody sequences from three donorsprovided compelling evidence that there are no observable biasesin VH and VL gene pairing at a population level. Instead, wefound that population-scale frequencies of VH:VL V-gene com-binations are proportional to the relative representation of theVH and VL genes among the population. Further we observed nocorrelation between CDR-H3 and CDR-L3 loop lengths in eithernaive or antigen-experienced repertoires (SI Appendix, Fig. S5), asalso had been inferred from limited prior data but had not yetbeen evaluated at the repertoire scale (15, 21).Second, we present evidence that combined VH:VL gene use

the in naive and antigen-experienced repertoire subsets corre-lates much better across individuals than within a single donor(i.e., naive and antigen-experienced repertoire within the samedonor). Earlier reports on the VH-only or VL-only repertoires indifferent B-cell subsets also had revealed interdonor concor-dance (9–12), and our data demonstrate that these trends alsoare evident when the complete antibody sequence is examined(Fig. 2 and SI Appendix, Table S2). Further, the clear intradonorsegregation of VH:VL gene use showed that the CD27+ IgMrepertoire was distinct from class-switched CD27+ IgG/IgA (Fig.2 and SI Appendix, Fig. S3), indirectly supporting prior hypoth-eses that distinct developmental pathways (possibly in responseto different classes of antigen, adjuvant, and/or route of expo-sure) drive the development of these distinct B-cell subsets, assuggested previously (51–53).Third, we sought to identify public sequences, i.e., sequences

shared among multiple individuals, that encompassed pairedheavy- and light-chain information. Public VH sequences havebeen reported to arise in individuals infected with certain path-ogens and also at a low frequency in healthy donors (10, 40, 42–44, 54). However, the question of whether the light chains pairedwith public heavy chains are also shared among individuals hadnot been addressed. Although we did not find any perfect nu-cleotide sequence matches among full-length antibodies in ourdatasets, at the amino acid level we identified five public CDR-H3:CDR-L3 pairs (out of 179,296 total antibody clusters), afinding that underscores the very low frequency at which publicantibodies typically arise. We observed signatures of convergentlight-chain genes among public antibodies expressing the sameCDR-H3 amino acid sequence (SI Appendix, Fig. S6B), con-tributing additional evidence to the hypothesis that public anti-bodies can be elicited in response to common immune stimuli (SIAppendix, Table S3) (40, 42–45, 54).Fourth, repertoire-wide examination of CR-paratope features

revealed a slight shift toward less negatively charged CR-para-topes, which also was observed in CDRH3:CDRL3 amino acidsequences (Fig. 3). Shifts in net charge were observed even whenantibodies were binned by gene use (SI Appendix, Fig. S8),providing evidence to support a functional selection mechanismfor more positively charged CDR3s in both heavy (40) and lightchains. This enhancement of positive charge may be driven byselection for binding to negatively charged bacterial membranesurfaces or by the generation of salt bridges between antibodyCR-paratopes and negatively charged antigens. In addition torepertoire-wide differences in charge between naive and antigen-experienced antibodies, we found a slight increase in SASA that

DeKosky et al. PNAS | Published online April 25, 2016 | E2643

IMMUNOLO

GYAND

INFLAMMATION

PNASPL

US

Dow

nloa

ded

by g

uest

on

July

18,

202

0

Page 9: Large-scale sequence and structural comparisons of human ...Large-scale sequence and structural comparisons of human naive and antigen-experienced antibody repertoires Brandon J. DeKoskya,1,

was reflected in gene use at the repertoire scale, particularly inthe use of segment IGHV4-34. This gene segment may be se-lected against in AEBC repertoires because of its potentialability to bind mannose-binding lectins, thus explaining its im-plication in disease states such as follicular lymphoma (8).Fifth, we noted key differences in CDR-L3 charge and hy-

drophobicity between kappa and lambda light chains that hadnot been previously reported (e.g., SI Appendix, Fig. S9). Recentreports delineated functional differences between kappa andlambda antibody responses (55), which may derive from thephysicochemical differences between kappa and lambda isotypesreported here. The distinct physicochemical profiles of kappaand lambda light chains may serve important functional purposesin receptor editing and allelic inclusion. For example, receptorediting is known to mitigate self-targeting of autoimmunogenicantibodies (56–58). The presence of two different light-chaingene sets, each with a distinct distribution of biochemical prop-erties, may greatly enhance the probability of altering bindingspecificity by editing the light-chain isotype. Differences betweenkappa and lambda repertoires also may have functional impor-tance in allelic inclusion, because kappa and lambda allelicallyincluded antibodies are more likely to show different bindingspecificities (29, 56–59).Finally, the close agreement between RosettaAntibody models

and crystal structures in the PDB for FR rmsd analysis indicatedthat computational modeling is a powerful tool for predicting theconformation of FR regions. The structural FR similarities acrossantibodies indicated that a limited set of available conformationalspace is used for VH FR domains, which does not expand signifi-cantly because of SHM, and similarities between antibody VH do-mains within families were especially pronounced. FR conformationalsimilarities across V-genes may serve as a structural mechanismto enable heavy and light V-genes to pair productively with eachother in a combinatorial fashion according to their overall fre-quencies in the repertoire and to express successfully the tremen-dous diversity required for effective humoral adaptive immunity.

MethodsSample Collection and VH:VL Sequencing. Informed consent was obtainedfrom anonymous donors by the Gulf Coast Regional Blood Center (Houston,TX). This study was approved by the University of Texas at Austin InstitutionalBiosafety Committee (2010-06-0084). PBMCs from whole blood were isolatedinto B-cell subsets via flow cytometric sorting. CD3−CD19+CD20+CD27− NBCsand CD3−CD19+CD20+CD27+ AEBCs were analyzed for VH:VL sequences asreported previously (29). Briefly, cells were isolated into emulsion dropletsalong with poly(dT) magnetic beads for mRNA capture using a flow-focusingnozzle apparatus. Droplets contained lithium dodecyl sulfate and DTT to lysecells and inactivate proteins, and mRNA released from lysed cells was cap-tured by the poly(dT) sequences on magnetic beads. The emulsion wasbroken chemically as described (29), and beads were collected, washed, andused as template for emulsion overlap extension RT-PCRwhich linked heavy- andlight-chain transcripts into a single, linked cDNA construct for high-throughputsequencing via Illumina MiSeq 2 × 250 or 2 × 300 technology. See SI Appendix, SIMethods for further details regarding cell isolation, sorting, and antibody RT-PCRand sequencing.

Bioinformatic Sequence Analysis. Illumina sequences were quality-filtered andannotated using both the IMGT (60) and National Center for BiotechnologyInformation IgBlast software (61) with a CDR3 motif identification algorithm(62). CDR-H3 junction nucleotide sequences were extracted and clustered to96% nucleotide identity with terminal gaps ignored [USEARCH v. 5.2.32

(63)], with a minimum of one nucleotide mismatch permitted during CDR-H3junction clustering regardless of sequence length; the most abundant CDR-L3corresponding to each CDR-H3 cluster seed was chosen as an H3:L3 pair.Resulting CDR-H3:L3 pairs with two or more reads comprised the preliminarylist of VH:VL clusters for each dataset. Naive antibody sequences were addi-tionally filtered to include only sequences with >98% germline identity in theFR3 region, similar to previous reports (10). Additional details of bioinformaticsanalysis are provided in SI Appendix, SI Methods.

Structural Modeling and Analysis. Structural modeling and analyses are de-scribed in SI Appendix, SI Methods. Briefly, antibody sequences representedby the most reads from donor 1 and donor 2 (all selected antibodies wereobserved at >50 reads per sequence from the respective repertoire) for naiveand antigen-experienced sets were analyzed. Antibody sequences weretested for uniqueness in and across repertoires, so that no sequence wasmodeled more than once. Antibodies with a CDR-H3 length of ≥16 aminoacids (Chothia numbering) were excluded from modeling. All sequenceswere subsequently filtered to ensure that each FR and CDR was identifiableby the modified Chothia definitions in RosettaAntibody. Antibodies for whichhigh-sequence-identity templates were available for CDR-H1, CDR-H2, CDR-L1,CDR-L2, and CDR-L3 were input through the RosettaAntibody 3.0 antibodymodeling protocol as described (27). A total of 1,000 trajectories were modeledper antibody; the lowest-scoring models, as evaluated by the Rosetta scoringfunction, were chosen for visual inspection and further analysis.

The CR-paratope comprised residues that were part of the contact regionof each antibody as defined by Stave et al. (35). These consisted of VH residuesnumbered 26–33 (CDR-H1), 50–58 (CDR-H2), and 94–101 (CDR-H3) and VL

residues 27–32 (CDR-L1), 49–56 (CDR-L2), and 91–96 (CDR-L3) in the Chothianumbering scheme.

Similarities between FRs (FR1–3) of antibodies were calculated by de-termining the rmsd over the backbone atoms (C, Cα, N, O) of each antibodyFR1-3 region to all other antibodies in a repertoire using the McLachlanalgorithm (64) as implemented in the ProFit software (A. C. R. Martin andC. T. Porter, University College London, London) (www.bioinf.org.uk/software/profit/). Antibodies then were grouped by IGHV gene use (same gene, samefamily, or different family), and median rmsd values, SDs, and statistical sig-nificance of distributions were determined using R version 3.1.1.

Statistical Analysis. R version 3.1.1 was used for Pearson hierarchical clustering(function “hclust”). Distance between samples was measured by Pearsoncorrelation with complete-linkage as the agglomerative method. Principalcomponent analysis (the “princomp” function in MATLAB R2012b) was ap-plied to processed Pearson hierarchical clustering data. R version 3.1.1 wasused for the identification of differentially paired genes (package “limma”version 3.14.4) (36, 65). Before running limma, gene pairs with zero use wereremoved, and quantile normalization was performed to normalize the dif-ference in distribution of values among samples. P values for multiple com-parisons were corrected with the Benjamini–Hochberg procedure. Differentiallypaired gene cut-offs were established at a fold-change of 2 and an adjusted Pvalue of 0.05. R version 3.1.1 was used for the K–S test (function “ks.test”). Rawvalues such as charge, length, and hydrophobicity were used to compare prob-ability distributions across experimental groups. The Z score was used to comparetwo proportions of amino acid charges. Further details regarding statistical testsare described in SI Appendix, SI Methods.

ACKNOWLEDGMENTS. We thank Jeliazko Jeliazkov and Brian Weitzner foraid in implementing the antibody structure prediction protocol, JessicaWheeler and Scott Hunicke-Smith for Illumina MiSeq sequencing, and theTexas Advanced Computing Center for computational resources. This workwas supported by NIH Grant R56 AI106006, by Defense Threat Reduction AgencyGrant HDTRA1-12-C-0105, and by a grant from the Clayton Foundation. B.J.D.was funded by graduate fellowships from the Hertz Foundation, the Universityof Texas Donald D. Harrington Foundation, and the National Science Founda-tion. O.I.L. was funded by NIH Collaborative Opportunities for Research Educa-tors Fellowship 1 K12 GM102745. G.C.I. was funded by the World HealthOrganization. D.K. and J.J.G. were funded by NIH Grant 5 R01 GM078221.

1. Murphy K, Travers P, Walport M, Janeway C (2012) Janeway’s Immunobiology (Gar-

land Science, New York), 8th Ed.2. Kirkham PM, Schroeder HW, Jr (1994) Antibody structure and the evolution of im-

munoglobulin V gene segments. Semin Immunol 6(6):347–360.3. Manser T (1989) Evolution of antibody structure during the immune response. The

differentiative potential of a single B lymphocyte. J Exp Med 170(4):1211–1230.4. Schmidt AG, et al. (2013) Preconfiguration of the antigen-binding site during affinity

maturation of a broadly neutralizing influenza virus antibody. Proc Natl Acad Sci USA

110(1):264–269.

5. Li T, et al. (2014) Redistribution of flexibility in stabilizing antibody fragment mutants

follows Le Châtelier’s principle. PLoS One 9(3):e92870.6. Jackson KJL, Kidd MJ, Wang Y, Collins AM (2013) The shape of the lymphocyte re-

ceptor repertoire: lessons from the B cell receptor. Front Immunol 4(263):263.7. Yaari G, Benichou JIC, Heiden JAV, Kleinstein SH, Louzoun Y (2015) The mutation

patterns in B-cell immunoglobulin receptors reflect the influence of selection acting

at multiple time-scales. Phil Trans R Soc B 370(1676):20140242.8. Wu YC, et al. (2010) High-throughput immunoglobulin repertoire analysis distinguishes be-

tween human IgMmemory and switchedmemory B-cell populations. Blood 116(7):1070–1078.

E2644 | www.pnas.org/cgi/doi/10.1073/pnas.1525510113 DeKosky et al.

Dow

nloa

ded

by g

uest

on

July

18,

202

0

Page 10: Large-scale sequence and structural comparisons of human ...Large-scale sequence and structural comparisons of human naive and antigen-experienced antibody repertoires Brandon J. DeKoskya,1,

9. Wu Y-CB, Kipling D, Dunn-Walters DK (2011) The relationship between CD27 negativeand positive B cell populations in human peripheral blood. Front Immunol 2:81.

10. Glanville J, et al. (2011) Naive antibody gene-segment frequencies are heritable andunaltered by chronic lymphocyte ablation. Proc Natl Acad Sci USA 108(50):20066–20071.

11. Briney BS, Willis JR, McKinney BA, Crowe JE, Jr (2012) High-throughput antibodysequencing reveals genetic evidence of global regulation of the naïve and memoryrepertoires that extends across individuals. Genes Immun 13(6):469–473.

12. Mroczek ES, et al. (2014) Differences in the composition of the human antibodyrepertoire by B cell subsets in the blood. Front Immunol 5:96.

13. Georgiou G, et al. (2014) The promise and challenge of high-throughput sequencingof the antibody repertoire. Nat Biotechnol 32(2):158–168.

14. Robinson WH (2015) Sequencing the functional antibody repertoire–diagnostic andtherapeutic discovery. Nat Rev Rheumatol 11(3):171–182.

15. Brezinschek H-P, Foster SJ, Dörner T, Brezinschek RI, Lipsky PE (1998) Pairing of var-iable heavy and variable κ chains in individual naive and memory B cells. J Immunol160(10):4762–4767.

16. Bräuninger A, Goossens T, Rajewsky K, Küppers R (2001) Regulation of immuno-globulin light chain gene rearrangements during early B cell development in thehuman. Eur J Immunol 31(12):3631–3637.

17. Meffre E, et al. (2001) Immunoglobulin heavy chain expression shapes the B cell re-ceptor repertoire in human B cell development. J Clin Invest 108(6):879–886.

18. Wardemann H, et al. (2003) Predominant autoantibody production by early human Bcell precursors. Science 301(5638):1374–1377.

19. Tian C, et al. (2007) Evidence for preferential Ig gene usage and differential TdT andexonuclease activities in human naïve and memory B cells. Mol Immunol 44(9):2173–2183.

20. Wardemann H, Nussenzweig MC (2007) B-cell self-tolerance in humans. Adv Immunol95(95):83–110.

21. de Wildt RM, Hoet RM, van Venrooij WJ, Tomlinson IM, Winter G (1999) Analysis ofheavy and light chain pairings indicates that receptor editing shapes the humanantibody repertoire. J Mol Biol 285(3):895–901.

22. Jayaram N, Bhowmick P, Martin ACR (2012) Germline VH/VL pairing in antibodies.Protein Eng Des Sel 25(10):523–529.

23. Zhu K, et al. (2014) Antibody structure determination using a combination of homologymodeling, energy-based refinement, and loop prediction. Proteins 82(8):1646–1655.

24. Berrondo M, Kaufmann S, Berrondo M (2014) Automated Aufbau of antibodystructures from given sequences using Macromoltek’s SmrtMolAntibody. Proteins82(8):1636–1645.

25. Shirai H, et al. (2014) High-resolution modeling of antibody structures by a combi-nation of bioinformatics, expert knowledge, and molecular simulations. Proteins82(8):1624–1635.

26. Teplyakov A, et al. (2014) Antibody modeling assessment II. Structures and models.Proteins 82(8):1563–1582.

27. Weitzner BD, Kuroda D, Marze N, Xu J, Gray JJ (2014) Blind prediction performance ofRosettaAntibody 3.0: grafting, relaxation, kinematic loop modeling, and full CDRoptimization. Proteins 82(8):1611–1623.

28. Marcatili P, et al. (2013) Igs expressed by chronic lymphocytic leukemia B cells showlimited binding-site structure variability. J Immunol 190(11):5771–5778.

29. DeKosky BJ, et al. (2015) In-depth determination and analysis of the human pairedheavy- and light-chain antibody repertoire. Nat Med 21(1):86–91.

30. McDaniel JR, DeKosky BJ, Tanno H, Ellington AD, Georgiou G (2016) Ultra-high-throughput sequencing of the immune receptor repertoire from millions of lym-phocytes. Nat Protoc 11(3):429–442.

31. Wang J, et al. (2013) High frequencies of activated B cells and T follicular helper cellsare correlated with disease activity in patients with new-onset rheumatoid arthritis.Clin Exp Immunol 174(2):212–220.

32. Kaminski DA, Wei C, Qian Y, Rosenberg AF, Sanz I (2012) Advances in human B cellphenotypic profiling. Front Immunol 3:302.

33. DeKosky BJ, et al. (2013) High-throughput sequencing of the paired human immu-noglobulin heavy and light chain repertoire. Nat Biotechnol 31(2):166–169.

34. Lavinder JJ, et al. (2014) Identification and characterization of the constituent humanserum antibodies elicited by vaccination. Proc Natl Acad Sci USA 111(6):2259–2264.

35. Stave JW, Lindpaintner K (2013) Antibody and antigen contact residues define epi-tope and paratope size and structure. J Immunol 191(3):1428–1435.

36. Smyth GK (2005) limma: Linear models for microarray data. Bioinformatics andComputational Biology Solutions Using R and Bioconductor, Statistics for Biology andHealth., eds Gentleman R, Carey VJ, Huber W, Irizarry RA, Dudoit S (Springer, NewYork), pp 397–420.

37. Anders S, Huber W (2010) Differential expression analysis for sequence count data.Genome Biol 11(10):R106.

38. Jackson KJL, et al. (2012) Divergent human populations show extensive shared IGKrearrangements in peripheral blood B cells. Immunogenetics 64(1):3–14.

39. Hoi KH, Ippolito GC (2013) Intrinsic bias and public rearrangements in the humanimmunoglobulin Vλ light chain repertoire. Genes Immun 14(4):271–276.

40. Arnaout R, et al. (2011) High-resolution description of antibody heavy-chain reper-toires in humans. PLoS One 6(8):e22365.

41. Galson JD, et al. (2015) In-depth assessment of within-individual and inter-individualvariation in the B cell receptor repertoire. Front Immunol 6:531.

42. Smith K, et al. (2013) Fully human monoclonal antibodies from antibody secretingcells after vaccination with Pneumovax�23 are serotype specific and facilitate op-sonophagocytosis. Immunobiology 218(5):745–754.

43. Parameswaran P, et al. (2013) Convergent antibody signatures in human dengue. CellHost Microbe 13(6):691–700.

44. Jackson KJL, et al. (2014) Human responses to influenza vaccination show serocon-version signatures and convergent antibody rearrangements. Cell Host Microbe 16(1):105–114.

45. Galson JD, et al. (2015) Analysis of B cell repertoire dynamics following hepatitis Bvaccination in humans, and enrichment of vaccine-specific antibody sequences.EBioMedicine 2(12):2070–2079.

46. Eisenberg D (1984) Three-dimensional structure of membrane and surface proteins.Annu Rev Biochem 53(1):595–623.

47. Tan PH, Sandmaier BM, Stayton PS (1998) Contributions of a highly conserved VH/VLhydrogen bonding interaction to scFv folding stability and refolding efficiency.Biophys J 75(3):1473–1482.

48. Ewert S, Honegger A, Plückthun A (2004) Stability improvement of antibodies forextracellular and intracellular applications: CDR grafting to stable frameworks andstructure-based framework engineering. Methods 34(2):184–199.

49. Honegger A, Malebranche AD, Röthlisberger D, Plückthun A (2009) The influence ofthe framework core residues on the biophysical properties of immunoglobulin heavychain variable domains. Protein Eng Des Sel 22(3):121–134.

50. Wang N, et al. (2009) Conserved amino acid networks involved in antibody variabledomain interactions. Proteins 76(1):99–114.

51. Dunn-Walters DK, Isaacson PG, Spencer J (1995) Analysis of mutations in immuno-globulin heavy chain variable region genes of microdissected marginal zone (MGZ) Bcells suggests that the MGZ of human spleen is a reservoir of memory B cells. J ExpMed 182(2):559–566.

52. Weller S, et al. (2004) Human blood IgM “memory” B cells are circulating splenicmarginal zone B cells harboring a prediversified immunoglobulin repertoire. Blood104(12):3647–3654.

53. Reynaud CA, et al. (2012) IgM memory B cells: a mouse/human paradox. Cell Mol LifeSci 69(10):1625–1634.

54. Dunand CJH, Wilson PC (2015) Restricted, canonical, stereotyped and convergentimmunoglobulin responses. Philos Trans R Soc Lond B Biol Sci 370(1676):20140238.

55. Sajadi MM, et al. (2016) λ light chain bias associated with enhanced binding andfunction of anti-HIV env glycoprotein antibodies. J Infect Dis 213(1):156–164.

56. Liu S, et al. (2005) Receptor editing can lead to allelic inclusion and development of Bcells that retain antibodies reacting with high avidity autoantigens. J Immunol 175(8):5067–5076.

57. Casellas R, et al. (2007) Igkappa allelic inclusion is a consequence of receptor editing.J Exp Med 204(1):153–160.

58. Andrews SF, et al. (2013) Global analysis of B cell selection using an immunoglobulinlight chain-mediated model of autoreactivity. J Exp Med 210(1):125–142.

59. Giachino C, Padovan E, Lanzavecchia A (1995) kappa+lambda+ dual receptor B cellsare present in the human peripheral repertoire. J Exp Med 181(3):1245–1250.

60. Brochet X, Lefranc M-P, Giudicelli V (2008) IMGT/V-QUEST: the highly customized andintegrated system for IG and TR standardized V-J and V-D-J sequence analysis. NucleicAcids Res 36(Web Server issue, suppl 2):W503-8.

61. Ye J, Ma N, Madden TL, Ostell JM (2013) IgBLAST: an immunoglobulin variable do-main sequence analysis tool. Nucleic Acids Res 41(Web Server issue, W1):W34-40.

62. Ippolito GC, et al. (2012) Antibody repertoires in humanized NOD-scid-IL2Rγ(null)mice and human B cells reveals human-like diversification and tolerance checkpointsin the mouse. PLoS One 7(4):e35497.

63. Edgar RC (2010) Search and clustering orders of magnitude faster than BLAST.Bioinformatics 26(19):2460–2461.

64. McLachlan AD (1982) Rapid comparison of protein structures. Acta Crystallogr A38(6):871–873.

65. Smyth GK (2004) Linear models and empirical Bayes methods for assessing differentialexpression in microarray experiments: statistical applications in genetics and molec-ular biology. Stat Appl Genet Mol Biol 3(1):3.

DeKosky et al. PNAS | Published online April 25, 2016 | E2645

IMMUNOLO

GYAND

INFLAMMATION

PNASPL

US

Dow

nloa

ded

by g

uest

on

July

18,

202

0