Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability...

30
Supplementary Figure 1 Pipeline overview. (1) MiXCR accepts raw paired- or single-end sequencing data as input. RNA-seq data contain only a tiny fraction of target Ig/TCR reads (one per 10 5 -10 7 reads). (2) Alignment of raw sequencing reads to genomic sequences of V, D, J and C genes. Non-aligned reads are filtered out. (3) Reads containing only fragments of CDR3 are assembled into contigs with full or near-full coverage of CDR3. Assembly is performed only for reads with large overlap involving a substantial part of the hypervariable N region. (4) TCR sequences that contain defined V and J genes but do not fully cover the ends of the CDR3 are extended using germline sequence. This procedure is not performed for Igs because of possible hypermutations in the extended sequence. (5-6) The resulting sequences are clustered into clonotypes based on their CDR3 sequence. This step includes correction of artificial diversity from PCR and sequencing errors. (7) The primary output is a list of clonotypes with comprehensive information on their abundance, V/D/J genes composition, antibody isotype, CDR3 sequence topology, etc. Nature Biotechnology: doi:10.1038/nbt.3979

Transcript of Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability...

Page 1: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

Supplementary Figure 1

Pipeline overview.

(1) MiXCR accepts raw paired- or single-end sequencing data as input. RNA-seq data contain only a tiny fraction of target Ig/TCR reads (one per 105-107 reads). (2) Alignment of raw sequencing reads to genomic sequences of V, D, J and C genes. Non-aligned reads are filtered out. (3) Reads containing only fragments of CDR3 are assembled into contigs with full or near-full coverage of CDR3. Assembly is performed only for reads with large overlap involving a substantial part of the hypervariable N region. (4) TCR sequences that contain defined V and J genes but do not fully cover the ends of the CDR3 are extended using germline sequence. This procedure is not performed for Igs because of possible hypermutations in the extended sequence. (5-6) The resulting sequences are clustered into clonotypes based on their CDR3 sequence. This step includes correction of artificial diversity from PCR and sequencing errors. (7) The primary output is a list of clonotypes with comprehensive information on their abundance, V/D/J genes composition, antibody isotype, CDR3 sequence topology, etc.

Nature Biotechnology: doi:10.1038/nbt.3979

Page 2: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

Supplementary Figure 2

Sensitivity and specificity of TCR and Ig repertoire extraction from tumor RNA-seq data, SPX6730-2, SPX8151-1 and SPX8151-2 tumor samples.

a. Dependence of total number of TCR beta CDR3 clonotypes, number of TCR-seq-confirmed clonotypes (green), number of canonical unconfirmed clonotypes (blue) and number of non-canonical unconfirmed clonotypes (red) on the paired-end sequencing reads length. In the case of TRUST green band states for definitely confirmed while orange for partially confirmed (allowing 6 nucleotides to be truncated) clonotypes. b. Dependence of the share of TCR-seq-confirmed TRA and TRB clonotypes extracted from RNA-seq data on the clonotype abundance (estimated by TCR-seq). The x-axis corresponds to clonotype frequency A in TCR-seq data. The y-axis corresponds to the fraction of identified clonotypes in the total number of control clonotypes with the frequency greater than A. c. Correlation of TCR clonotypes frequency in MiXCR-extracted repertoires from TCR-seq and RNA-seq data.

Nature Biotechnology: doi:10.1038/nbt.3979

Page 3: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

Supplementary Figure 3

MiXCR and TRUST performance comparison on in silico generated data.

a. Comparison on paired-end data. b. Single-end data. Dark green color corresponds to fully matched CDR3s (without any mismatches or indels), lighter shades of green denotes CDR3s matched with mismatches or indels (up to 3 mutations); red color denotes false-positive CDR3s (missing in the original set of synthetic clones).  

Nature Biotechnology: doi:10.1038/nbt.3979

Page 4: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

Supplementary Figure 4

Validation of IGH clonality, expression and isotype ratio effects for TCGA SKCM samples.

a,d. Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (<cutoff) metrics groups for IGG, IGK, IGL (for samples with >500 CDR3 reads of corresponding gene loci) and TRA, TRB (for samples with >50 CDR3 reads) repertoire clonality and TRB expressions (a) and isotype coverage ratios (d). Cutoff values were determined in terms of median values. p values of log-rank test for survival difference between low and high metrics groups are shown. n, number of patients. b. Exemplary lineage trees of hypermutating IGH CDR3 variants, sample SKCM193 (primary tumor). Adjacent nodes differ by exactly one nucleotide mismatch. The nodes are colored according to antibody isotype: IgG1 (yellow) or undetermined (white). Size of nodes is proportional to the clonotypes frequencies. c. Plot depicts the proportions of Ig isotypes among IGH. e. Analysis of 1,000 subsets of the original SKCM cohort, each representing a randomly chosen 50% of samples. For each iteration, we evaluated log-rank test χ2 statistics for groups split by a given metric’s median. Box plots report the distribution of the log-rank test score for the sampled subsets. p value is for the median log-rank test score over 1,000 simulations. f. TRB and IGH CDR3 reads extraction efficiency from lymph-node containing and non-lymph node samples. g,h. Kaplan–Meier survival plots for high (>=cutoff) and low (<cutoff) metrics groups for TRB expression, TRB clonality (for samples with >50 TRB CDR3 reads), IGH expression, IGH clonality (for samples with >500 IGH CDR3 reads), and IgG1/IGH proportion for lymph node-containing and non-lymph node-containing samples (f), and melanoma Stage III samples only (g). i. Inverse hazard ratios of the major covariates (IGH expression, IGH clonality and IgG1/IGH proportion) for all 458 samples. The hazard ratios are calculated by the proportional hazards regression model. Hazard ratios were inverted to reflect a favorable prognostic function of the covariates in the chart.  

Nature Biotechnology: doi:10.1038/nbt.3979

Page 5: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

Supplementary Note 1. Extracting more but extracting truth

RNA-seq data contain only a tiny fraction of TCR/Ig reads covering CDR3. This fraction varies from sample to

sample depending on the degree of immune cell infiltration, ranging from 10-5 to 10-7 for TCRs and from 10-4 to

10-7 for Igs. Samples having no target reads are also common. Additionally, due to the limited length of

sequencing (typically paired-end 50-100 bp), successful detection of target V(D)J junctions implies alignment

with very short fragments of germline V and J genes (12-15 bp). The main challenge in the analysis of such

short sequences is the high probability of false-positive alignments. Thus, the primary objective of a robust

CDR3 extraction procedure is to extract as many true CDR3 sequences as possible with nearly zero amount of

CDR3-like false-positives.

To meet these challenges, we have developed and implemented a set of new algorithms (Supplementary Fig.

1):

(i) Sensitive and highly selective aligner. Since there is no prior knowledge of true TCR/Ig sequences in real

data and because some sources of false-positive sequences are not random (Online Methods), the

implementation of a fully automated procedure for optimization of the alignment algorithm is substantially

complicated. Manual analysis of the structure of false- and true-positive alignments on real and artificial

datasets has helped us to establish a semi-automated pipeline for the optimization of our built-in MiXCR

aligner. This has resulted in high-efficiency extraction of target V(D)J rearrangements from bulk RNA-seq data

with a zero detected false-positive rate. Optimized aligner efficiently filters out even the fragments of mRNA

sequences that are homologous to antibody or TCR hypervariable regions, which comprise the most prominent

source of reproducible false-positives in RNA-seq data analysis (Online Methods). We further enhanced our

aligner by targeting ambiguous cases, which recurrently arise in short reads that have either V or J segment

alignment but for which the fast alignment algorithm has failed to detect a J or V gene segment, respectively. In

such cases MiXCR switches to a more sensitive algorithm—a modified version of the Smith-

Waterman/Needleman-Wunsch algorithm— in order to reconstruct the full V(D)J junction if possible. In this

way, we achieve high alignment sensitivity even on short reads, maintaining high overall performance

(Supplementary Table 1).

(ii) Partial alignments assembler. For short-read data, there is little to no chance that a long CDR3 will be

fully covered by a single read. To extract such sequences, we introduced an additional analysis step that

assembles contigs from several initial alignments, originating from different paired- or single-end reads. The

partial assembler merges left-half reads (LR), defined as reads which cover only the left boundary of CDR3

(the conserved Cys in the V gene) while not covering the right boundary (the conserved Phe/Trp in the J gene),

with right-half reads (RR), which do not cover the left CDR3 boundary and cover part of the J gene. To protect

the algorithm from artificial diversity generation, LR and RR are merged if and only if the following criteria are

satisfied:

Nature Biotechnology: doi:10.1038/nbt.3979

Page 6: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

1. Minimal length of the overlap region is 12 nt.

2. The overlap covers at least 7 non-germline-derived letters (N region). Boundaries of N regions are

determined after the realignment of fused contigs against V, D and J gene sequences, and P-segments

are also aligned.

3. Sequences are 100% identical inside the overlap.

These default thresholds showed the best extraction efficiency while keeping negligible rate of observed false-

positive overlaps (as verified with in silico generated data, Online Methods, doi.org/10.5281/zenodo.804326).

All parameters can be adjusted by the user.

(iii) CDR3 extension. To safely utilize even those reads that partially cover CDR3 and were not fully

reconstructed using previous step, we added an optional CDR3 extension step for TCRs (but not Igs, due to

possible presence of hypermutations). This step fills in the edges of the CDR3 based on known information on

the relevant germline gene segments. A substantial fraction of extracted TCR reads fully cover the N-D-N

region and are assigned with definite V and J genes (often supported by companion paired-end reads), but at

the same time do not cover the CDR3 sequence end-to-end, missing several germline nucleotides. Since

TCRs do not undergo hypermutations and their germline sequences are quite well-conserved, it is reasonable

to artificially extend the CDR3 for such junctions with existing data from reference germline genes. This allows

us to make use of these sequences for clonotype assembly and further comparative analysis. Using in silico

generated data, we estimated a false extension rate of ~10^-5 for this procedure (Online Methods).

Importantly, the resulting RNA-seq analysis pipeline employs the same MiXCR modules, the same error-

correction algorithms, and has the same output format as for targeted TCR or Ig profiling. This allows unified

processing and comparison of immune repertoires obtained from different types of raw sequencing data.

To verify the efficiency and specificity of TCR CDR3 repertoires extraction from RNA-seq data, we performed

both deep targeted profiling of TCR alpha (TRA) and beta (TRB) chains repertoires (TCR-seq) as described

previously1 and 100+100 paired-end RNA-seq analysis for the same split RNA samples, which were obtained

from surgically resected melanoma specimens from two patients, SPX6730 (ileocecal lymph node metastasis)

and SPX8151 (small intestine resection).

We further used the deep TRA and TRB CDR3 repertoires extracted from TCR-seq data using the standard

MiXCR analysis pipeline2 (Online Methods) as control data. Analysis of RNA-seq data was performed with

MiXCR or TRUST, a recently published software tool for extracting TCR repertoires from RNA-seq3, 4.

For MiXCR, we compared the RNA-seq CDR3 sequences with the TCR-seq control to directly assess the

equivalence of the identified CDR3 nucleotide sequences. Since TRUST does not group clonal CDR3

sequences, we used unique CDR3s from the TRUST output for comparison. Furthermore, the majority of

CDR3s reported by TRUST are truncated in a nondeterministic way, such that strict equality gives almost no

Nature Biotechnology: doi:10.1038/nbt.3979

Page 7: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

matches between the TRUST results and the control data. Since it was still possible that TRUST-reported

CDR3s may represent immunologically useful information, we allowed subsequence matching between the

control data and TRUST-reported CDR3s (Online Methods).

We assessed the dependence of the number of confirmed RNA-seq clonotypes on their abundance estimated

from TCR-seq data (Fig. 1a and Supplementary Fig. 2). MiXCR was able to extract all relatively abundant

TRB CDR3 clonotypes (frequency in repertoire > 0.15%) from the SPX6730 sample RNA-seq, even with the

short paired-end reads (50+50-bp, trimmed in silico from the 100+100-bp paired-end data). In contrast, TRUST

failed to extract a considerable proportion of high-frequency clonotypes. As expected, performance of RNA-seq

analysis degraded at shorter read lengths. Our data also indicate that paired-end RNA-seq >100-bp would be

beneficial for immune repertoires profiling.

Next, we divided extracted clonotypes into several categories: clonotypes that are also present in the TCR-seq

control (verified CDR3s, considered true positives), clonotypes that are absent in the TCR-seq control but have

canonical amino acid sequences (potentially true low-frequency clonotypes, Online Methods), and clonotypes

that are absent in the TCR-seq control and have non-canonical amino acid sequences (probable false

positives). Fig. 1b shows the dependence of the number of found CDR3 clonotypes on the read length in

paired-end analysis for these various categories. Most MiXCR-reported clonotypes were confirmed by the

control data; ~20% of all clonotypes were unique to RNA-seq samples and had canonical amino acid

sequences, and there was only a tiny fraction of unconfirmed clonotypes with non-canonical CDR3 sequences,

which was less than the fraction of non-canonical CDR3s observed in control TCR-seq data. It should be noted

that we did not apply a filter for CDR3 canonical sequences in the MiXCR pipeline. On the other hand, nearly

none of CDR3s reported by TRUST could be confirmed by the TCR-seq control; ~20% of CDR3s were partially

confirmed on the basis of subsequence matches, with truncations of up to 6 nucleotides allowed, and >50% of

unconfirmed CDR3s did not match canonical pattern. Only ~16% of the CDR3s reported by TRUST had both V

and J genes annotated, while in the case of MiXCR, all clones had both V and J segments assigned.

Software testing with in silico-generated data confirmed the high extraction efficiency of MiXCR, with zero

false-positive clones observed. In contrast, TRUST efficiency was an order of magnitude lower, and the

software generated a substantial number of false clonotypes, including those of non-TCR origin (Online

Methods, Supplementary Fig. 3).

The frequencies of clonotypes in the TCR repertoires extracted by MiXCR from the SPX6730 sample

correlated between the TCR-seq and RNA-seq data (Fig. 1c). This demonstrates that RNA-seq-based TCR

profiling can be relatively quantitative for the abundant clonotypes that occupy >0.1% of the overall T-cell

repertoire for those samples that harbor a substantial number of T-cells. It should be noted that the SPX6730

sample was an ileocecal lymph node metastasis that was enriched with T-cells. The second sample, SPX8151,

was a small intestine resection that contained lower proportion of T-cells and correspondingly yielded a lower

number of TCR CDR3 reads (Fig. 1d), resulting in poor quantification of observed clonotypes (Supplementary

Nature Biotechnology: doi:10.1038/nbt.3979

Page 8: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

Fig. 2). Roughly, the number of TRB CDR3-containing sequencing reads extractable from an RNA-seq dataset

was proportional to TRBC coverage, and was estimated as approximately 46 TRB CDR3 reads per 1,000

TRBC reads for 50+50-bp, and 128 TRB CDR3 reads per 1,000 TRBC reads for 100+100-bp paired-end

sequencing (Online Methods, Supplementary Table 2).

We also compared MiXCR performance with the recently reported V’DJer software5, which was designed for

the extraction of Ig repertoires from RNA-seq data. MiXCR successfully extracted repertoires for all immune

receptor types from both melanoma samples (Fig. 1d), while for these large RNA-Seq datasets V’DJer failed to

extract IGH and IGK repertoires within four days using 8 threads on a Xeon E5-2683 CPU with 50 GB of RAM.

Additionally, we used several representative samples analyzed in refs. 3, 4 from TCGA and SRA databases in

order to compare MiXCR performance for TCR and Ig repertoires relative to the TRUST and V’DJer packages,

respectively. In all comparisons, MiXCR demonstrated superior sensitivity (Fig. 1e). Both alternative software

packages require substantially more hands-on time and implementation of third-party alignment tools with

particular versions of еру human genome and particular analysis settings, which are not clearly defined in the

documentation and require laborious optimization. Additionally, the output from both tools lacks useful

biological information; some of this information may be recovered by additional post-processing, but other

important information is irretrievably lost during analysis (Supplementary Table 1).

In single T-cell transcriptome analysis, MiXCR outperformed TraCeR6 in efficiency of TRA and TRB chains

detection (Supplementary Table 3).

References:

1. Britanova, O.V. et al. J Immunol 196, 5005-5013 (2016).

2. Bolotin, D.A. et al. Nat Methods 12, 380-381 (2015).

3. Li, B. et al. Nature genetics 48, 725-732 (2016).

4. Li, B. et al. Nature genetics 49, 482-483 (2017).

5. Mose, L.E. et al. Bioinformatics 32, 3729-3734 (2016).

6. Stubbington, M.J. et al. Nat Methods 13, 329-332 (2016).

Nature Biotechnology: doi:10.1038/nbt.3979

Page 9: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

Supplementary Note 2. Intratumoral Ig repertoire derived from RNA-seq

We employed MiXCR to extract immune repertoires from the TCGA 48+48-bp paired-end RNA-seq data for

458 patients with cutaneous melanoma (SKCM). In terms of functional CDR3 clonotypes/CDR3-covering reads

per sample, MiXCR yielded an average of 52/69 for TRA, 54/86 for TRB, 2.4/3 for TCR gamma (TRG),

0.15/0.2 for TCR delta (TRD), 395/3924 for IGH, 620/7595 for IGK, and 414/4939 for IGL (see

doi:10.6084/m9.figshare.4620739 for clonesets). Notably, the extracted Ig repertoires were an order of

magnitude larger than for the TCRs, indicating the presence of intratumoral Ig-producing plasma cells.

Furthermore, we noted that high intratumoral IGH expression levels as well as high levels of IGH clonality

(calculated according to ref. 1) were associated with longer survival (Fig. 2a,b, Supplementary Fig. 4a), and

the two parameters had strong cumulative value for patient stratification (Fig. 2c).

In many patients, a single dominant intratumoral Ig clonotype occupied 30–80% of all Ig CDR3 sequences in

both heavy and light chains repertoires. Hypermutating IGH CDR3 variants could be observed even in primary

tumor samples (Supplementary Table 4, Supplementary Fig. 4b), which could reflect the presence of

intratumoral germinal centers2, 3, or the co-infiltration of extra-tumorally hypermutated B cells encoding

homologous Igs.

Higher TRB expression levels, reflecting greater tumor infiltration by T-cells, were also associated with longer

survival (Supplementary Fig. 4a). TCR CDR3 repertoire clonality was not significantly associated with

survival, but this could be attributable to insufficient information on TCR repertoires extracted from tumor RNA-

seq data.

Analysis of isotype composition revealed that the IgG1 isotype was dominant among intratumorally-produced

Igs (Supplementary Fig. 4c). A high proportion of IgG1 among IGH was associated with longer survival, while

a high IgA/IGH proportion was associated with a negative prognosis (Fig. 2d). Proportions of IgD and IgE also

tend to correlate negatively with survival, while proportions of IgG2, IgG3, IgG4 and IgM of IGH had no clear

association with prognosis (Supplementary Fig. 4d).

To validate the observed effects, we analyzed 1,000 subsets of the original SKCM cohort by randomly

sampling 50% of the total number of samples. For each iteration, we evaluated log-rank test χ2 statistics for

groups split by the given metric’s median. This analysis demonstrated that the positive correlations of IGH

clonality, high IGH expression, and high IgG1/IGH ratio with survival remains significant even in these smaller

subsets of the SKCM cohort (Supplementary Fig. 4e).

It should be noted that there was a significant difference in the number of TCR and Ig CDR3 reads extracted

from the samples from regional lymph nodes versus other tissue sites (Supplementary Fig. 4f). However, the

association of high IGH clonality, high IGH expression, and high IgG1/IGH ratio with survival remained

significant for the separately analyzed samples with or without lymph node tissue (Supplementary Fig. 4g).

Nature Biotechnology: doi:10.1038/nbt.3979

Page 10: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

To exclude the influence of the disease stage, we separately analyzed melanoma stage III samples, which are

the most abundant stage in the TCGA data. Again, all effects remained significant (Supplementary Fig. 4h).

The proportional hazard model suggested comparable regression coefficients for all three covariates

(Supplementary Fig. 4i). Collectively, these results indicate that intratumorally produced Igs represent a

critical component of efficient anti-tumor response.

References:

1. Tumeh, P.C. et al. Nature 515, 568-571 (2014).

2. Gottlin, E.B. et al. Journal of thoracic oncology : official publication of the International Association for the Study of Lung Cancer 6, 1687-1690 (2011).

3. Willis, S.N. et al. J Immunol 182, 3310-3317 (2009).

Nature Biotechnology: doi:10.1038/nbt.3979

Page 11: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

Supplementary Note 3. Functional characterization of TCR repertoires based on RNA-seq of sorted T-cells

RNA-seq data from tissues that contain relatively low T-cell counts, as is the case with many tumor samples,

allow to extract information only for the most abundant TCR clonotypes. However, melanoma samples with

high T-cell infiltration allowed us to extract relatively rich TCR repertoires, comparable to the shallow targeted

TCR profiling (Fig. 1c). Therefore, we predicted that for the RNA-seq data obtained from pure T-cell samples,

such as sorted T-cells, efficient extraction and comparative analysis of TCR repertoires should be possible.

To test the feasibility of such an approach, we performed 50+50-bp paired-end RNA-seq analysis for the

sorted effector (Teff) and regulatory (Treg) CD4 T-cells from the spleen and central nervous system (CNS) of six

individual Foxp3yfpcre mice1 with induced experimental autoimmune encephalomyelitis (EAE). The near-100%

abundance of T-cells in these samples allowed MiXCR to extract high-quality TCR repertoires comprising at

average 1330/3295 TRA and 1489/3933 TRB unique functional CDR3 clonotypes/CDR3 sequencing reads per

sample (Supplementary Table 5, see doi:10.6084/m9.figshare.4620739 for clonesets).

We determined an efficiency of about 90–100 TRB CDR3 reads per million unique reads. For small samples of

500 sorted T-cells covered by 3x10^7 reads, we could identify about 350–450 distinct clonotypes, indicating

almost complete repertoire extraction (taking clonality into consideration), similar to the single cell RNA-seq.

For large samples, the total number of reads remains the limiting factor. 50 million reads yield approximately

5,000 TRB CDR3 reads, which means up to 5,000 TRB clonotypes in theory, although in reality the number is

usually lower due to the natural clonality of the repertoire. Roughly speaking, 50 million unique 50+50-bp

paired-end RNA-seq reads could reveal the TRB CDR3 repertoire for approximately 5,000 T-cells randomly

chosen from a large T-cell sample.

Extracted repertoires were suitable for the routine post-analysis using the VDJtools software2. First, we

compared the diversity of repertoires, a challenging task in TCR profiling that preferably requires unique

molecular barcoding for normalization of multiple samples3, 4. However, in paired-end RNA-seq, each

sequencing read usually covers a unique starting RNA molecule, characterized with the unique starting

nucleotide positions. PCR and optical duplicates are relatively rare and were excluded from the raw data. This

allowed us to normalize samples for the accurate comparison of diversity metrics by extracting 500 random

unique CDR3-containing reads, representing unique fragments of template RNA molecules from each sample.

The diversity correlated well between the TRA and TRB repertoires (R>0.95, Fig. 2e), was similar between the

effector and regulatory CD4 T-cell subsets, and was significantly lower in the CNS compared to spleen

samples (Fig. 2f), reflecting the narrowed TCR repertoire in the CNS.

Next, we analyzed CDR3 characteristics for the full extracted repertoires, weighted for clonotypes frequency.

Treg cells were characterized by shorter TRB CDR3 lengths (Fig. 2g). The functional characteristics of the

amino acids comprising the middle portion of CDR3 differed between Treg and Teff cells TRB repertoires (Fig.

Nature Biotechnology: doi:10.1038/nbt.3979

Page 12: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

2h). The higher interaction ―strength‖5 of Treg CDR3s is in keeping with the previously observed higher TCR

affinity of Tregs for self-peptide:MHC complexes, which may enable thymic Treg precursors to compete more

efficiently for the limited amount of antigens found on thymic antigen-presenting cells6-8. Analysis of amino acid

TRB CDR3 repertoire overlaps revealed separate clustering of Teff and Treg cells, indicating functional similarity

of subset repertoires across mice (Fig. 2i). Thus, we conclude that detailed and highly informative insights into

the structure of TCR repertoires can be obtained by using RNA-seq data from sorted T-cell subsets.

References:

1. Rubtsov, Y.P. et al. Immunity 28, 546-558 (2008).

2. Shugay, M. et al. PLoS computational biology 11, e1004503 (2015).

3. Best, K., Oakes, T., Heather, J.M., Shawe-Taylor, J. & Chain, B. Scientific reports 5, 14629 (2015).

4. Britanova, O.V. et al. J Immunol 196, 5005-5013 (2016).

5. Kosmrlj, A., Jha, A.K., Huseby, E.S., Kardar, M. & Chakraborty, A.K. Proc Natl Acad Sci U S A 105, 16671-16676 (2008).

6. Hsieh, C.S., Zheng, Y., Liang, Y., Fontenot, J.D. & Rudensky, A.Y. Nature immunology 7, 401-410 (2006).

7. Jordan, M.S. et al. Nature immunology 2, 301-306 (2001).

8. Feng, Y. et al. Nature 528, 132-136 (2015).

Nature Biotechnology: doi:10.1038/nbt.3979

Page 13: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

Supplementary Table 1. Comparison of the key software characteristics.

TRUST V’DJer MiXCR

Analysis of T-cell receptors ✔ ✘ ✔

Analysis of B-cell receptors ✘ ✔ ✔

Paired-end analysis ✔ ✔ ✔

Single-end analysis ✔ ✘ ✔

Species Human Human Human, Mouse, Rat

Analysis of TCR-seq and/or IG-seq1) ✘ ✘ ✔

Assemble CDR3 clonotypes2) ✘ ✘ ✔

Reports full CDR3 sequence ✘ 3) ✔ ✔

Reports clonal abundances2) ✘ ✘ ✔

Builds full-length sequences4) ✘ ✔ ✘

Annotates V/J genes ✘ /✔ 5) ✘ ✔

Annotates D gene ✘ ✘ ✔

Annotates C gene / antibody isotype ✘ ✘ ✔

Annotates V/D/J gene positions6) ✘ ✘ ✔

Failed on some samples Yes7)/No Yes8) No

Median % of V/J annotated CDR3’s 16% N/A 100%

Median % of canonical CDR3’s9) 18% 100% 95%

RAM requirement ~2Gb10) ~30+ Gb11) ~2 Gb

Average time of analysis of 108 reads, h 20h 28h (9 hours per chain)12) 4h

Average sample analysis cost, USD12) 0.46$ 1.51$ (0.5$ per chain) 0.09$

Total analysis cost (all 140 samples, including truncated), USD13)

64$ 106$ (35$ per chain) 12$

Operating system Cross-platform Linux Cross-platform

Depends on external software TopHat 14) STAR 14) No

Source-code available ✘ ✔ ✔

Nature Biotechnology: doi:10.1038/nbt.3979

Page 14: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

1) MiXCR allows for homogeneous analysis of TCR-seq, IG-seq, and RNA-seq datasets producing output in the same format, making possible further comparative analysis of outputs from both sources of information. Neither TRUST nor V’DJer support analysis of TCR-seq and IG-Seq data.

2) Assembled clonotypes and information about their abundances is the standard representation of TCR/Ig repertoire by MiXCR. Clonotype abundances are highly important for further data interpretation.

3) CDR3s reported by TRUST are truncated in nondeterministic way.

4) From the beginning of Framework 1 (FR1) till the end of Framework 4 (FR4)

5) In majority of cases TRUST has annotated either V or J gene but not both simultaneously

6) Positions of boundaries of V/D/J genes. This information allows to calculate number of 5’/3’ truncated nucleotides of V/D/J genes and number of inserted N nucleotides. Such information is crucial for statistical inference of clonotype assembly probability and estimation of statistical significance of co-occurrence of same clonotype in several datasets.

7) On some samples (e.g. analysis of TRA chain for paired-end SPX6730 with >80bp length) TRUST reported zero number of CDR3s which is seemed to be a bug.

8) On some samples V’DJer failed to finish execution within 4 days running in 8 computer threads on Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz and occupying more than 50 Gb of RAM.

9) For TCR analysis canonical amino acid CDR3 is considered as matching regex ^C[^_]*(?:[FW]|[FW]G.G)$. Since V’DJer provides no amino acid translation, for Ig analysis canonical nucleotide CDR3 is considered as matching ^TG[TC].*(?:TT[TC]|TGG)$ pattern, not matching (?:...)*(?:TAA|TAG|TGA)(?:...)* pattern and with the sequence length multiple of 3.

10) TopHat aligner is required to align raw sequencing reads before TRUST can assemble V-J contigs. TRUST itself consumes less than 1Gb RAM, while TopHat consumed more memory in all cases. Thus amount of RAM consumed by TopHat was used as an overall memory consumption value.

11) STAR aligner is required to align raw sequencing reads before V’DJer can assemble V-J contigs. STAR aligner consumes at least 30 Gb of RAM (depending on the reference human genome used). RAM requirement for V’DJer itself highly varies from sample to sample (e.g. analysis of SPX6730 required 50Gb of RAM).

12) To analyze all IGH, IGK and IGL chains, one has to run V’DJer three times (independently for each chain).

13) Bulk analysis of a large set of data samples requires substantial computation power. To analyze all 140 samples by three software tools in our benchmark we rented computer instances at Amazon Web Services (EC2). To minimize total analysis cost we used only spot EC2 instances during weekends, when price is the lowest. Instances with minimal required amount of RAM were used for each software. In case of V’DJer and STAR which require a huge amount of RAM we used r3.2xlarge instances ($0.083/hour). For running TopHat, TRUST and MiXCR c3.xlarge instances were used ($0.035/hour). We believe that this estimation reflects the real costs of organization hosting required computational power (spent on hardware, electricity, administration, etc).

14) Both TRUST and V’DJer require reads aligned to human genome in BAM format as input. TRUST requires raw reads to be aligned with TopHat aligner, while V’DJer requires STAR aligner. We found that when using TopHat alignments as input for V’DJer its performance degrades. When using STAR alignments as input for TRUST, it produces almost no results.

Nature Biotechnology: doi:10.1038/nbt.3979

Page 15: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

Supplementary Table 2. Estimating the number of extractable TRB CDR3 reads.

Sequencing length TRB CDR3-containing reads per 1,000 TRBC sequencing reads, CI 95%

50 bp 20.2 ± 0.3

75 bp 50.0 ± 1.0

100 bp 81.4 ± 3.8

50+50 bp 46.2 ± 0.6

75+75 bp 80.7 ± 21.1

100+100 bp 128.2 ± 6.4

Nature Biotechnology: doi:10.1038/nbt.3979

Page 16: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

Sequencing length, bp 100+100 bp 100 bp 50+50 bp 50 bp

Software tool MiXCR TraCeR MiXCR TraCeR MiXCR TraCeR MiXCR TraCeR

Functional TRA 231(85%) 223(82%) 227(83%) 220(81%) 226(83%) 215(79%) 215(79%) 206(76%)

Functional TRB 252(93%) 247(91%) 250(92%) 245(90%) 249(91%) 242(89%) 239(88%) 235(86%)

Functional pairs 221(81%) 209(77%) 215(79%) 207(76%) 214(79%) 198(73%) 196(72%) 186(68%)

Supplementary Table 3. TRA and TRB CDR3 extraction from 272 single cells RNA-seq data. Single-cell

transcriptome analysis is a rapidly developing methodology for characterization of cellular phenotypes and

heterogeneity1-3. As MiXCR showed high efficiency for conventional bulk RNA-seq data, we tested its

performance for previously published single-cell RNA-seq data from 272 CD4+ T-cells, and compared the

efficiency of MiXCR versus the TraCeR software that was originally used for analysis4. MiXCR showed higher

efficiency for both functional chains, detecting at least one TRA and TRB chain in 85% and 93% of individual

cells, respectively, and detecting productive pairs in 81% of cells (versus 77% for TraCeR). MiXCR also

outperformed TraCeR with 50-bp single- and paired-end and 100-bp single-end reads that were in silico-

generated from original 100-bp paired-end data.

References:

1. Macosko, E.Z. et al. Cell 161, 1202-1214 (2015).

2. Fan, H.C., Fu, G.K. & Fodor, S.P. Science 347, 1258367 (2015).

3. Klein, A.M. et al. Cell 161, 1187-1201 (2015).

4. Stubbington, M.J. et al. Nat Methods 13, 329-332 (2016).

Nature Biotechnology: doi:10.1038/nbt.3979

Page 17: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

Supplementary Table 5. TCR repertoires extraction form sorted mice T cell RNA-seq.

Sample ID Approximate

number of sorted T cells

RNA amount used for

RNA-seq, ng

Total number of unique

paired-end sequencing

reads

Number of reads

containing functional (in frame) TCR alpha CDR3

Number of reads

containing functional TCR beta

CDR3

Number of functional TCR alpha

CDR3 reads per million

reads

Number of functional TCR beta

CDR3 reads per million

reads

Number of extracted functional TCR alpha clonotypes

Number of extracted functional TCR beta clonotypes

CNS_Teff_1 8900 2 35587184 2543 2666 71 75 720 703

CNS_Teff_2 4100 all (1.3) 33431485 2789 3164 83 95 864 728

CNS_Teff_3 24000 2 32804755 2547 2589 78 79 700 611

CNS_Teff_4 6900 2 34756703 2826 3016 81 87 997 936

CNS_Teff_5 4000 all (1.0) 30220582 2396 2750 79 91 695 658

CNS_Teff_6 4000 all (1.0) 27260264 2464 2710 90 99 639 591

CNS_Treg_1 2000 all (1.0) 30444580 1767 1895 58 62 642 617

CNS_Treg_2 1300 all (0.9) 46112231 4476 4376 97 95 666 582

CNS_Treg_3 2400 all (1.3) 27457454 1793 1801 65 66 793 736

CNS_Treg_4 1900 all (0.8) 33091302 2418 2803 73 85 819 787

CNS_Treg_5 500 all (0.3) 35420782 2839 2942 80 83 462 408

CNS_Treg_6 500 all (0.3) 30011308 2622 2730 87 91 395 367

SP_Teff_1 100000 10 51204555 3439 4117 67 80 1571 1707

SP_Teff_2 100000 10 50886232 3605 4445 71 87 1970 2250

SP_Teff_3 100000 all (3.1) 52677819 3659 4648 69 88 1506 1684

SP_Teff_4 100000 10 57358961 3789 5147 66 90 1781 2006

SP_Teff_5 100000 10 48550774 2916 4127 60 85 1762 2088

SP_Teff_6 100000 10 57360066 5175 7092 90 124 2603 3352

SP_Treg_1 100000 10 49361408 3187 3549 65 72 1582 1726

SP_Treg_2 100000 10 55310370 4360 5600 79 101 2397 2977

SP_Treg_3 100000 10 54781194 4966 6156 91 112 2038 2390

SP_Treg_4 100000 10 51283048 4194 5749 82 112 1972 2398

SP_Treg_5 100000 10 53589946 4694 5596 88 104 2066 2565

SP_Treg_6 100000 10 51560460 3639 4736 71 92 2289 2881

Nature Biotechnology: doi:10.1038/nbt.3979

Page 18: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

Supplementary  Methods  

Back-­‐to-­‐back  RNA-­‐seq  versus  TCR-­‐seq  comparison  

Tumor   samples   SPX6730   and   SPX8151   were   collected   from   two   patients   with   advanced  melanoma,   following   signed   informed   consent   and   after   approval   of   the   medical   ethical  committee  at  NKI-­‐AVL.  Material  was  a   resection   taken   from   the   ileocecum.  Morphological  analysis  showed  that  SPX6730  was  an  ileocecal  lymph  node  metastasis,  while  SPX8151  was  a  small  intestine  resection.  

Total   RNA   was   extracted   using   TRIzol   reagent   (15596-­‐018,   Ambion   life   technologies)  according   to   the   manufacturer’s   protocol.   Each   tissue   sample   in   TRIzol   was   split   on   3  portions.  Extracted   total  RNA  was  purified  using   the  MinElute  Cleanup  Kit   (74204,  Qiagen)  according   to   the   manufacturer’s   instructions.   Quality   and   quantity   of   the   total   RNA   was  assessed   by   the   2100   Bioanalyzer   using   a  Nano   chip   (Agilent,   Santa   Clara,   CA).   Total   RNA  samples   having   RIN>8   were   subjected   to   library   generation.   For   the   RNA-­‐seq   analysis,  strand-­‐specific   libraries  were   generated   from   total   RNA   using   the   TruSeq   Stranded  mRNA  sample   preparation   kit   (Illumina   Inc.,   San   Diego,   RS-­‐122-­‐2101/2)   according   to   the  manufacturer's  instructions  (Illumina,  Part  #  15031047  Rev.  E),  in  two  replicas.  The  resulting  libraries  were  sequenced  with  100  bp  paired-­‐end  reads  on  a  HiSeq2500  using  V4  chemistry  (Illumina  Inc.,  San  Diego).  The  second  portion  of  total  RNA  of  each  sample  was  used  for  the  targeted  TRA  and  TRB  profiling  as  described20,  using  Illumina  MiSeq  paired  end  150+150  bp  sequencing.    

The  TCR-­‐seq  control  data  were  analysed  using  standard  MiXCR  analysis  pipeline  for  targeted  TCR  profiling  data19.  The  resulting  assembled  clonal  CDR3  sequences  for  TRA  and  TRB  chains  were  used  as  control  data.    

Performance  of  extraction  of  repertoire  data  from  RNA-­‐seq  heavily  depends  on  read  length  and  reads  topology  (paired-­‐end  or  single-­‐end).  Each  read  from  initial  RNA-­‐seq  samples  was  in  silico  cut  to  several  lengths  from  40  bp  to  100  bp  with  5  bp  increment  to  simulate  RNA-­‐seq  samples  with  shorter  reads.  This  way  based  on  each  initial  100+100  bp  RNA-­‐seq  sample  we  generated  26  samples  simulating  different  sequencing  setups  (40  bp,  45  bp,  …  95  bp,  100  bp,    single-­‐  /  paired-­‐end).  

We  compared  MiXCR  with  TRUST  software17.  The  prototype  of  TRUST  was  used  in  one  of  the  widest14  studies  on  repertoire  extraction  from  RNA-­‐seq  known  to  date.  In  the  case  of  MiXCR  we  have  directly   tested   the  equality  of  CDR3  nucleotide   sequences  between  RNA-­‐seq  and  TCR-­‐seq.   TRUST   does   not   assemble   reported   contigs   into   full   CDR3   clonal   sequences,   so  unique  CDR3s  reported  by  TRUST  were  used  for  comparison.  The  majority  of  CDR3s  reported  by   TRUST   are   truncated   in   a   nondeterministic   way   (dramatically   complicating   any   further  analysis),  so  strict  CDR3  equality-­‐based  matching  with  control  data  gives  almost  no  results.  Thus,   we   used   the   following   workaround   to   describe   the   performance   of   underlying   TCR  extraction   algorithms:   we   allowed   a   partial  match   between   CDR3s   from   control   data   and  TRUST   reported   CDR3s.   Namely   we   allowed   N   maximal   possibly   truncated   nucleotides.  TRUST   CDR3   is   considered   to   match   some   control   CDR3   if   it’s   nucleotide   sequence   is  contained  in  the  control  nt  sequence  (of  course  it  can  be  equal)  and  the  control  sequence  is  

Nature Biotechnology: doi:10.1038/nbt.3979

Page 19: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

not  more  than  N  nucleotides   longer  than  the  TRUST  sequence.  For  example,   for  N  =  6   the  following  TRUST  sequences  are  considered  as  confirmed  by  the  control:  

Control CDR3: TGTGCCAGTAGTATAGAACATAGCAATCAGCCCCAGCATTTT  

TRUST matched CDR3: TGTGCCAGTAGTATAGAACATAGCAATCAGCCCCAGCATTTT  

TRUST matched CDR3: ...GCCAGTAGTATAGAACATAGCAATCAGCCCCAGCAT...  

TRUST matched CDR3: ....CCAGTAGTATAGAACATAGCAATCAGCCCCAGCATTT.  

TRUST matched CDR3: TGTGCCAGTAGTATAGAACATAGCAATCAGCCCCAGCATT..  

The  value  N  =  6  was  used  for  all  calculations  in  this  paper,  this  value  was  chosen  because  it  gave  biggest  intersection  with  control  data,  while  greater  values  showed  substantial  number  of  false-­‐matches  (matches  of  exclusively  germline-­‐derived  regions).  

The  primary  characteristic  of  repertoire  extraction  software  for  RNA-­‐seq  data  is  it’s  ability  to  extract  as  many  true  CDR3  sequences  as  possible  while  preserving  nearly  zero  false-­‐positive  CDR3  calls.  For  further  verification  of  the  efficiency  and  specificity  of  TCR  CDR3  repertoires  extraction  on  the  RNA-­‐seq  samples  with  TCR-­‐seq  control  we  introduced  the  following  metrics:  

1. Total  number  of  reported  RNA-­‐seq  CDR3  clonotypes  

2. Number  of  reported  RNA-­‐seq  CDR3s  that  are  also  present  in  TCR-­‐seq  control  Such  verified  CDR3’s  are  considered  as  definitely  true-­‐positive  ones.  

3. Number   of   reported   non-­‐canonical   RNA-­‐seq   CDR3s   that   are   absent   in   TCR-­‐seq  control.  CDR3s  that  found  in  RNA-­‐seq  sample  but  not  found  in  TCR-­‐seq  control  can  be  either  true-­‐positive  or  false-­‐positive.  It  is  natural  to  differentiate  such  clones  in  two  groups:  with  canonical  and  with  non-­‐canonical  amino  acid  sequences.  The  former  are  more  likely   true-­‐positive,   while   the   latter   are   more   likely   false-­‐positive   ones.   This  assumption  is  also  supported  by  the  fact  that  over  85%  of  clones  in  control  TCR-­‐seq  samples  have  canonical  CDR3  a.a.  sequences:    

Control  sample   TRA  clonotypes   TRB  clonotypes  %  non-­‐canonical  

TRA  %  non-­‐canonical  

TRB  

SPX6730   168,572   269,648   13.2%   3.3%  

SPX8151   16,427   30,451   14.7%   6.7%  

Canonical   CDR3   amino   acid   sequence   was   defined   as   matching   regex  ^C[^_]*(?:[FW]|[FW]G.G)$,  where  [FW]G.G  in  the  end  of  the  sequence  are  allowed  to  match  CDR3s  from  TRUST  output  with  displaced  right  CDR3  boundary  actually  being  canonical.  No  CDR3s  with  displaced  right  boundary  were  detected  in  MiXCR  output.  

 

TCR/Ig  repertoires  extraction  from  samples  without  control

Apart   from   the   data   generated   specifically   for   this   paper,  we   used   several   representative  samples   from   TCGA   (CDC   file   IDs:   9986414f-­‐26b7-­‐4c3b-­‐90cd-­‐feb59dc033a2   (old   id:    b6c7b112-­‐81cb-­‐4ae6-­‐b604-­‐1564eabc7aec),   0d674012-­‐2baa-­‐4b6e-­‐afe5-­‐735f19990a52  

Nature Biotechnology: doi:10.1038/nbt.3979

Page 20: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

(old   id:   fed55b2d-­‐2c04-­‐4a5b-­‐b714-­‐38962c061f72),   92e1bb3e-­‐7f6d-­‐4128-­‐bb6f-­‐8cbbab1444a6  (old   id  3d44d6c3-­‐5dcf-­‐42ee-­‐8ebe-­‐dcdaad30493c))  and  SRA  (SRR1813898,  SRR1813883  и  SRR2314045),  which  underwent   in  silico  cut  procedure  from  40  bp  to  50  bp  with   5   bp   increment.   Additionally,   each   sample  was   analyzed   in   two  ways:   (i)   taking   both  mates  from  paired-­‐end  reads  (R1  and  R2  fastq  files;  paired-­‐end  analysis)  and  (ii)  taking  only  first  mate  (R1  fastq  file;    single-­‐end  analysis).  This  procedure  gave  us  final  set  of  140  samples  for  analysis  (including  SPX6730  and  SPX8141):  

1. SPX6730   and   SPX8141   both   have   two   replicas   (SPX6730-­‐1,   SPX6730-­‐2,   SPX8141-­‐1,  SPX8141-­‐2),   in  silico   truncated  from  40bp  to  100bp    with  5bp   increment   in  paired-­‐end  and  single-­‐end  modes  -­‐-­‐-­‐  104  samples  

2. 3  TCGA  samples,  in  silico  truncated  from  40bp  to  50bp  in  paired-­‐end  and  single-­‐end  modes  -­‐-­‐-­‐  18  samples  

3. 3  SRA  samples,   in  silico   truncated  from  40bp  to  50bp   in  paired-­‐end  and  single-­‐end  modes  -­‐-­‐-­‐  18  samples  

To  test  the  performance  of  the  software  on  samples  without  control  data  we  calculated  two  metrics:  

1. Total  number  of  reported  CDR3  clonotypes  2. Number  of  reported  CDR3s  with  non-­‐canonical  sequence  

In  the  case  of  MiXCR  and  TRUST  which  both  can  report  CDR3  amino  acid  translation    we  considered  canonical  amino  acid  sequence  as  matching  ^C[^_]*(?:[FW]|[FW]G.G)$,  where  [FW]G.G  in  the  end  of  the  sequence  are  allowed  to  match  CDR3s  from  TRUST  output  with  displaced  right  CDR3  boundary  actually  being  canonical  (no  CDR3s  with  displaced  right  boundary  detected  in  MiXCR  output).    

In  the  case  of  V’DJer  which  reports  only  CDR3  nucleotide  sequences  we  used  the  following  criteria  of  canonicity.  Canonical  CDR3  nucleotide  sequence  defined  as:     a)  matching  regex    ^TG[TC].*(?:TT[TC]|TGG)$    pattern     b)  not  matching  (?:...)*(?:TAA|TAG|TGA)(?:...)*  pattern     c)  sequence  length  is  multiple  of  3  

For   IGH,   IGK,   and   IGL   MiXCR   extracted   at   average   2   orders   of   magnitude   more   CDR3  clonotypes   than   V’DJer,   preserving   small   fraction   of   non-­‐canonical   clonotypes  (https://doi.org/10.6084/m9.figshare.4620739)  

 

MiXCR  aligner  optimization  for  RNA-­‐seq  data Since  there  is  no  prior  knowledge  on  true  TCR/Ig  sequences  in  real  data  and  because  some  sources  of  false-­‐positive  sequences  are  not  random  (see  section  Reproducible  false  positives  below),   implementation   of   fully   automated   procedure   for   optimization   of   alignment  algorithm   is   substantially   complicated.   To   find   the   best   aligner   parameters   for   analysis   of  RNA-­‐seq  data  we  performed  the  following  steps:  

1. Manual  analysis  of  results  obtained  on  real  RNA-­‐seq  samples  showed  that   in  order  to  find  scoring  values  that  better  discriminate  between  false-­‐aligned  non-­‐TCR/Ig  and  true-­‐aligned   TCR/Ig   sequences,     V-­‐gene   alignment  must   be   forcefully   extended   to  the  5’  end  of  the  sequence  and  J-­‐gene  alignment  must  be  forcefully  extended  to  the  

Nature Biotechnology: doi:10.1038/nbt.3979

Page 21: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

3’  end  of  the  sequence  (even  if  it  lowers  total  score  of  the  alignment;  in  other  words  alignment  must  be  global  on  the  one  side   in  respect  to  read  sequence  while  being  local  on  another  side).  Built-­‐in  MiXCR  aligner  (KAligner)  can  be  setup  to  work  in  this  regime  by  setting  floatingLeftBound  or  floatingRightBound  parameters  to  false.  As  a  result,   two   modifications   to   the   default   parameters   were   made:   -­‐OvParameters.floatingLeftBound=false,   -­‐OjParameters.floatingRightBound=false.  Similar   parameter   is   also   present   in   software   like   STAR33   (-­‐-­‐alignEndsType  parameter),   but   there   it   is   not   possible   to   specify   different   values   for   different  genes.    

2. Multi-­‐target   optimization   using   NSGAII34   algorithm   implemented   in   MOEA  Framework   (http://moeaframework.org/   )   was   performed   on   all   numeric  parameters  of  V  and  J  aligners  as  well  as  minimal  scoring  thresholds  to  find  Pareto  optimal   set   of   parameters   in   terms   of   number   of   false-­‐alignments   calculated   on  RNA-­‐seq   samples   known   to   have   zero   TCR/Ig-­‐content   (while   actually   containing  trace   amounts   of   target   molecules   and   non-­‐canonical   transcripts   from   non-­‐rearranged   TCR/Ig   loci)   and   true-­‐positive   TCR/Ig   sequences   generated   using  procedure   and   software   reported   in   ref.   19.   From   this   set,   solution   having   best  performance  on  true  TCR/Ig  data  while  having  no  real-­‐false-­‐positive  alignments  on  control  samples  was  manually  selected.  

3. Further   analysis   of   results   obtained   on   real   RNA-­‐seq   samples   revealed   rare   false-­‐positive   alignments   for   several   non-­‐canonical   transcripts   from   TCR/Ig   loci.   Scoring  thresholds   for   V   and   J   alignments   as   well   as   threshold   on   the   total   V,D,J   and   C  alignments  were   adjusted   to   reliably   exclude   these   false-­‐positives.  Adjustment  did  not  substantially  change  overall  sensitivity  of  the  aligner.  

Resulting   parameter   values   can   be   found   in   “RNA-­‐seq”   section   of   parameter   preset   file  (https://github.com/milaboratory/mixcr/blob/develop/src/main/resources/parameters/vdjcaligner_parameters.json).  

Results   of   false-­‐alignment   rate   evaluation   during   our   initial   parameters   optimization   (not  shown  here)  and  results  provided  in  section  “Performance  of  MiXCR  and  TRUST  on  in  silico  data”  below,  had  not   reveal   any   false-­‐alignments  on  a  wide   range  of   input  dataset   types,  suggesting  zero  observed  false-­‐positive  rate  for  MiXCR.  

 

In  silico  generated  data  We  used   in   silico  modeled  RNA-­‐seq  data   to   estimate   several   characteristics   of  MiXCR   and  other   software   packages   under   consideration   in   maximally   controlled   setup.   VDJ  recombination  model  from  Ref.  1  was  used  to  generate  TRB  VDJ  recombinations.  Generator  is   available   as   easy   to   use   software  util   –   repseqio   (https://github.com/repseqio/repseqio,  doi:10.5281/zenodo.804326).   We   used   well-­‐known   ART   software   package35   to   model  Illumina  reads  from  generated  TRB  sequences.  To  introduce  negative  control  sequences  into  our   data   we   modeled   reads   from   the   collection   of   Gencode   Human   transcripts  (ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.transcripts.fa.gz,  Ref.  36).  The  following  procedures  were  used  in  each  case:  

Nature Biotechnology: doi:10.1038/nbt.3979

Page 22: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

To  simulate  RNA-­‐seq  reads  from  rearranged  TRB  genes,  fasta  file  containing  first  three  exons  of  mature  TRB  gene  transcript  sequences  were  generated  using  repseqio  tool  according  to  model   from   ref.   1.   Transcripts   were   repeated   several   times   to   simulate   power-­‐law  distribution  of  clone  abundances  (Pareto  distribution  with  parameters  Xm  =  0.9,  alpha  =  1.3).  art_illumina  tool  was  used  to  simulate  RNA-­‐seq  Illumina  reads  from  transcripts  generated  on  the  previous  step.  

Negative  control  (non-­‐TCR)  RNA-­‐seq  reads  were  simulated  from  human  transcript  sequence  collection  downloaded  from  Gencode  using  art_illumina  tool.  

Results   from   this   chapter   can   be   easily   reproduced   using   publically   available   Docker  (https://www.docker.com/community-­‐edition)   image   (see   section   Data,   software   and  source  code  below).  

Performance  of  MiXCR  and  TRUST  on  in  silico  data

Using   the  procedure  described   above,  we   generated  RNA-­‐seq   reads   for   10^3   in   silico   TRB  clones   and   randomly   picked   10^3   reads   (results   are   invariant   to   the   number   of   clones   in  initial   dataset   due   to   the   scale   invariance   of   Pareto   distribution   used   to  model   clonotype  abundances).  

For  negative  control  we  generated  10^8  non-­‐TCR  RNA-­‐seq  reads.  Finally,  we  have  prepared  two  datasets  for  analysis  by  MiXCR  and  TRUST:  

1. 10^8  non-­‐TCR  reads  (negative  control  sample)  2. Dataset   obtained   by   mixing   10^8   of   non-­‐TCR   reads   with   10^3   TRB   reads  

(approximately  reproducing  average  RNA-­‐seq  TCR  content)  

The   analysis   was   repeated   for   50-­‐bp,   75-­‐bp   and   100-­‐bp   read   lengths.   Each   analysis   was  performed  for  both  paired-­‐end  and  single-­‐end  (using  only  R1)  modes.  

STAR  aligner33  was  used  to  produce  BAM  files  for  TRUST.  For  the  script  which  runs  the  whole  analysis   (from   in  silico  data  generation  to  runnig  of  MiXCR  and  TRUST)  as  well  as  a  Docker  container  see  section  Data,  software  and  source  code  below).  

CDR3s   reported   by   MiXCR   and   TRUST   were   compared   against   the   originally   in   silico  generated  clones  allowing  up  to  3  mismatches  or  indels  (mutations  are  allowed  to  account  for  possible  mutations  in  CDR3s  introduced  by  art_illumina).  

Supplementary   Fig.   3   shows  the  results  of  MiXCR  and  TRUST  performance  comparison   for  datasets   described   above.   MiXCR   shows   consistently   higher   extraction   efficiency.   On   a  paired-­‐end  data  MiXCR  showed  order  of  magnitude  higher  efficiency.  Most  CDR3  confirmed  sequences  for  TRUST  have  three  mismatches  or  indels  differences  from  true  CDR3  sequence,  while   nearly   all   MiXCR   clonotypes   are   supported   by   the   exact   match.   The   major   part   of  CDR3s  reported  by  TRUST  were  not  found  in  original  set  of  synthetic  clones,  even  with  three  mutations  allowed.  Moreover,  TRUST  have  reported  CDR3  records  even  for  negative  control  data,  supporting  observation  that  it  is  prone  to  extract  false  CDR3  sequences  from  non-­‐TCR  targets  (see  section  Reproducible  false  positives  below).    

 

Nature Biotechnology: doi:10.1038/nbt.3979

Page 23: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

Estimation  of  rates  for  false-­‐overlaps  and  false-­‐extensions  on  in  silico  generated  data In   silico   generated   data   was   used   to   estimate   potential   false-­‐diversity   that   may   be  introduced  during  partial  alignments  assembly  (assemblePartial)  and  incomplete  alignments  extension  for  TCR  reads  (extendAlignments)  steps  on  different  types  of  input  data.  Individual  overlaps  and  extensions  performed  by  the  software  were  analysed  to  calculate  the  values.  

Datasets  consisting  entirely  of  TRB  reads  was  generated  using  procedure  described  above.  We  tested  all  possible  combinations  of  the  following  dataset  parameters:  

● clonotype  count:  10^2,  10^3,  10^4  ● number  of  reads:  10^3,  10^4,  10^5  ● read  length:  50,  75,  100  ● art_illumina  sequencing  errors  profile:  HS20,  HS25,  NS50  

Overlap  events  where  emerged  CDR3  was  different  from  any  of  true  known  CDR3s  of  reads  participated   in   overlap   were   defined   as   false-­‐overlaps.   Because   further   analysis   performs  quality-­‐aware  error-­‐correction,  and  does  not  form  new  clones  from  CDR3  sequences  having  quality   score   less   than   20   in   some   position,   we   also   calculated   number   of   false-­‐overlaps  having  enough  quality  scores  in  CDR3  region  to  form  new  clonotypes  during  further  pipeline  steps.  

For  datasets  with  number  of  modelled  clonotypes  10^2  and  10^3,  which   represent   typical  RNA-­‐seq  datasets,  false-­‐overlap  rate  was  less  than  0.04%  and  potentially  introduced  artificial  diversity  was   no  more   than   0.006%   across   all   combinations   of   other   parameters,  while   in  most  cases  we  observed  zero  values  for  both  characteristics.  Worst  observed  value  for  false-­‐overlap   rate   across   all   parameters   including   untypically   diverse   samples  with   10^4   clones  was  0.25%  and  maximal  potentially  introduced  artificial  diversity  was  less  than  0.09%.  

The  worst  overall  artificial  diversity  introduced  by  false-­‐overlaps  was  nearly  two  orders  of  magnitude  less  than  typical  artificial  diversity  introduced  by  sequencing  and  PCR  errors,  making  it’s  effect  negligible  for  further  analysis.  

The   same   datasets   were   used   for   estimation   of   false-­‐extension   rate.   Analysis   of   each  individual   extension   event  was   performed.   Extension  was   defined   as   false   if   subsequence  introduced  by  the  procedure  was  different  from  corresponding  subsequence  from  originally  generated  CDR3  sequence  for  current  alignment.  Total  195461  extensions  were  performed  by  the  software,  and  only  two  of  them  were  inconsistent  with  the  sequences  expected  from  original  CDR3.  This  gives  approximately  one  in  10^5  false  extension  rate.  

Another  potential  source  of  false-­‐extensions  is  allelic  variants  of  certain  V/J  segments,  which  are  not  present  in  the  reference  library.  Unfortunately  extent  to  which  this  phenomenon  can  affect   performance   of   the   procedure   is   very   hard   to   estimate.   Still,   there   is   a   possible  solution  for  the  issue,  consisting  of  learning  of  allelic  variants  right  from  the  target  sample.  This  functionality  is  planned  for  implementation  in  the  described  software.  

 

Calculation  of  potential  repertoire  information  content  for  RNA-­‐seq  datasets Because  MiXCR  RNA-­‐seq  mode  showed  best  CDR3  extraction  efficiency  from  RNA-­‐seq  data  

Nature Biotechnology: doi:10.1038/nbt.3979

Page 24: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

among   software   packages   for   this   task   known   to   date,   we   used   it   to   estimate   current  technical  limit  for  this  procedure.  

Analysis  was  performed  on  (i)  melanoma  samples  SPX6730  and  SPX8151  truncated   in  silico  to   100,   75   and   50   bp,   (ii)   all   samples   for   the   functional   characterization   of   TRA   and   TRB  repertoires  based  on  sorted  T-­‐cells  RNA-­‐seq,  (iii)  all  the  samples  from  ref.  18,  and  (iv)  three  samples   from   TCGA   used   in   our   comparison   with   TRUST.   All   samples   were   analysed   by  MiXCR   in   both   paired-­‐   (PE)   and   single-­‐end   (SE)   variants   (by   utilizing   only   R1   fastq   file).    Following  characteristics  were  calculated:  

-­‐ sum  of  TRBC1  and  TRBC2  expression  levels  in  terms  of  reads  count  (calculated  using  Kallisto  software  package37).  This  characteristic  was  chosen  instead  of  transcripts  per  million  (TMP)    because  it  is  proportional  to  the  total  number  of  reads  in  the  sample,  which   is   also   the   case   for   the   number   of   CDR3   calls,   removing   the   need   for   any  additional  normalization.  

-­‐ number  of  MiXCR  TRB  CDR3  calls  that  left  after  all  processing  steps  and  participated  in  clonotype  formation.  

PE   and   SE   samples   with   the   same   read   length   showed   good   correlation   between   two  described   characteristics.   Supplementary   Table   2   shows   relation   between   TRBC   coverage  and  CDR3  sequence  yield  for  different  sequencing  setups.    

 

 

Reproducible  false  positives  

One  of  the  main  sources  of  false-­‐positive  alignment  calls  are  fragments  of  mRNA  molecules  homologous   to   antibody   or   TCR   hypervariable   regions.   Such   false-­‐positives   are   especially  dangerous,   because   they   could   be   reproduced   in   different   samples,   thereby   providing  misleading   evidence   about   connection   between   TCR/Ig   repertoires.   In   our   experience,  substantial   part   of   such   sequences   came   from   unprocessed   or   wrongly   spliced   TCR/Ig   or  non-­‐TCR/Ig  mRNA  molecules.  Due  to  this  fact,  given  that  specificity  of  extraction  software  is  low,   absence  or   existence  of   certain   false   sequences   in   the  output  may  e.g.   be   correlated  with   efficiency   of   nonsense-­‐mediated   mRNA   decay,   which   in   turn   may   lead   to   false  conclusion   on   correlation   between   certain   physiological   conditions   and   TCR/IG   repertoire  composition.  Many   other   similar   sources   of   reproducible   false-­‐positives  may   also   exist.   In  view   of   this   we   believe   that   high   specificity   of   RNA-­‐seq   repertoire   extraction   software   is  crucial  for  obtaining  reliable  biological  results.  

In   the  case  of  TRUST  a  substantial  part  of   reported  contigs  can  be  directly  mapped  to  raw  genome  (i.e.  clearly  are  false-­‐positives)  which  shows  low  software  specificity.  

We   have   found   TRUST   assembled   contigs   for   most   of   analysed   datasets   annotated,   with  corresponding  false-­‐positive  CDR3,  which  entirely  maps  onto  certain  genomic  regions  in  TRB  and  TRA  loci.  Such  sequences  presumably  came  from  immature  mRNA  molecules,  not  fully  processed  by  nonsense  mediated  decay  and  splicing.  This  observation  suggest  low  software  specificity  and  potential  risk  of  observing  false  intersections  between  samples  due  to  other  than  immunological  physiological  reasons.  

Nature Biotechnology: doi:10.1038/nbt.3979

Page 25: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

Example  of  TRUST  false-­‐positive  fasta  record  that  was  reproduced  in  two  different  patients  (samples  RNASeq_SPX6730-­‐2  and  RNASeq_SPX8151-­‐1,  50  bp,  paired-­‐end)  and  fully  maps  to  genomic   intron   region  near   TRBC1   gene,  which   shows   that   low   specificity   of   the   software  may  lead  to  false  matches  between  repertoires:  >RNASeq_SPX8151-­‐1_cut50_PAIREDEND.bam+est_clonal_freq=0.0243902439024+seq_length=50+est_lib_size=11717+TRBV5-­‐4*04_TRBV4-­‐3*01_TRBV14*01_TRBV5-­‐5*02_TRBV4-­‐3*03_TRBV2*03_TRBV13*02_TRBV7-­‐8*01_TRBV19*02_TRBV7-­‐2*04_TRBV7-­‐2*01_TRBV5-­‐8*01_TRBV10-­‐2*01_TRBV6-­‐6*01_TRBV7-­‐9*02_TRBV19*03_TRBV4-­‐2*01_TRBV5-­‐5*03_TRBV5-­‐8*02_TRBV6-­‐4*01_TRBV7-­‐9*05_TRBV6-­‐6*02_TRBV7-­‐2*02_TRBV10-­‐1*02_TRBV11-­‐3*02_TRBV6-­‐5*01_TRBV6-­‐6*03_TRBV7-­‐9*04_TRBV11-­‐2*01_TRBV11-­‐2*03_TRBV7-­‐8*03_TRBV11-­‐3*01_TRBV6-­‐1*01_TRBV7-­‐4*01_TRBV7-­‐7*02_TRBV16*03_TRBV4-­‐1*02_TRBV16*01_TRBV3-­‐1*01_TRBV9*01_TRBV27*01_TRBV7-­‐9*07_TRBV25-­‐1*01_TRBV14*02_TRBV5-­‐4*01_TRBV9*03_TRBV12-­‐5*01_TRBV7-­‐3*01_TRBV5-­‐4*03_TRBV5-­‐4*02_TRBV10-­‐1*01_TRBV4-­‐3*02_TRBV6-­‐9*01_TRBV12-­‐4*02_TRBV12-­‐4*01_TRBV4-­‐2*02_TRBV7-­‐7*01_TRBV5-­‐6*01_TRBV6-­‐8*01_TRBV7-­‐8*02_TRBV6-­‐3*01_TRBV12-­‐3*01_TRBV7-­‐6*01_TRBV2*02_TRBV4-­‐3*04_TRBV7-­‐6*02_TRBV7-­‐3*05_TRBV9*02_TRBV6-­‐6*05_TRBV11-­‐3*03_TRBV11-­‐1*01_TRBV5-­‐5*01_TRBV5-­‐1*01_TRBV6-­‐6*04_TRBV3-­‐1*02_TRBV2*01_TRBV4-­‐1*01_TRBV7-­‐9*06_TRBV6-­‐4*02_TRBV7-­‐9*03_TRBV5-­‐1*02_TRBV11-­‐2*02_TRBV28*01_TRBV6-­‐2*01_TRBV7-­‐9*01_TRBV7-­‐3*04_TRBV13*01_TRBV19*01_TRBV18*01++TRBC2|chr7:142498725-­‐142499111+CASLRPIPRALSH+minus_log_Eval=1.90138771133+TGTGCTTCATTACGGCCCATTCCCAGGGCTCTCTCTCACAC    GGACATAGTTGTGCTTCATTACGGCCCATTCCCAGGGCTCTCTCTCACAC  

Another  example  of  TRUST  false-­‐positive  fasta  record  that  was  reproduced  in  two  different  samples   and   fully   maps   to   genomic   intron   region,   which   proofs   that   situation   described  above  takes  place  for  the  TRUST  software:  

Sample#1:  

SRR2314045,  paired-­‐end,  45bp  

Sample#2:  

fed55b2d-­‐2c04-­‐4a5b-­‐b714-­‐38962c061f72,  paired-­‐end,  45bp  

Record#1:  >SRR2314045_cut45_PAIREDEND.bam+est_clonal_freq=0.0238095238095+seq_length=45+est_lib_size=6832+TRAV25*01_TRAV35*01_TRAV27*01++TRBC2|chr7:142498725-­‐142499111+CAGQRPCVDGHGQ+minus_log_Eval=3.0+TGTGCTGGTCAGCGCCCTTGTGTTGATGGCCATGGTCAAGAG  CTCTTGACCATGGCCATCAACACAAGGGCGCTGACCAGCACAGCA  

Record#2:  >fed55b2d-­‐2c04-­‐4a5b-­‐b714-­‐38962c061f72_cut45_PAIREDEND.bam+est_clonal_freq=0.0238095238095+seq_length=45+est_lib_size=53346+TRAV25*01_TRAV35*01_TRAV27*01++TRBC2|chr7:142498725-­‐

Nature Biotechnology: doi:10.1038/nbt.3979

Page 26: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

142499111+CAGQRPCVDGHGQ+minus_log_Eval=3.0+TGTGCTGGTCAGCGCCCTTGTGTTGATGGCCATGGTCAAGAG  CTCTTGACCATGGCCATCAACACAAGGGCGCTGACCAGCACAGCA  

Several   other   examples   of   such   false-­‐intersection   between   samples   were   found   in   TRUST  results.  

 

Analysis  of  Ig  isotype  proportions    

We   run   Kallisto37   with   default   parameters   and   calculated   estimated   counts   for   all   the  transcripts   of   human   transcriptome   GRCh37.75  (http://www.ensembl.org/info/website/tutorials/grch37.html).   Gene   expressions   were  calculated   as   sum  of   estimated   counts   over   its   transcripts.   Coverages   of   IgA   isotype  were  taken   from   IGHA1   and   IGHA2   genes.  We   calculated   relative   coverages   dividing   estimated  count   of   each   isotype   by   sum   of   estimated   counts   over   all   the   isoforms.   Pseudogenes  IGHEP1,  IGHEP2,  IGHGP  and  IGHMBP2  demonstrated  relatively  low  expression  and  were  not  taken  into  account.  

 

RNA-­‐seq  from  the  sorted  murine  T-­‐cell  functional  subsets  

Six  male  Foxp3yfpcre  mice27   (B6.129(Cg)-­‐Foxp3   tm4(YFP/icre)Ayr/J)  mice   that  were  10  weeks  of  age  were  used  for  EAE  induction.  Treatment  of  mice  was  performed  according  to  protocol  08-­‐10-­‐023  approved  by  the  Memorial  Sloan  Kettering  Institute  Institutional  Animal  Care  and  Use  Committee.  All  mice  were  maintained  in  the  Memorial  Sloan  Kettering  Institute  animal  facility  under   specific  pathogen-­‐free   conditions   in  accordance  with   institutional   guidelines.  Mice   were   immunized   s.c.   with   100   ug   MOG35-­‐55   peptide   (GenScript)   emulsified   in   CFA  (Sigma)   and   injected   i.p.  with   200  ng  pertussis   toxin   (EMD  chemicals)   on  day  0   and  day  2  post   immunization.  Spleen  and  pooled  brain  and  spinal  chord  (CNS)  were  harvested  at  day  16  (peak  of  disease).  CNS  tissue  was  minced  and  digested  for  45  minutes  in  the  presence  of  Collagenase   A   and   DNaseI   (Sigma)   and   lymphocytes   were   enriched   on   a   Percoll   gradient  (30%/37%/70%)(GE  healthcare).   CD4+  T-­‐cells  were   isolated  with   the  Dynabeads   FlowComp  Mouse   CD4   kit   (Life   Technologies).   Splenic   and   CNS   derived   effector   T-­‐cells   (Teff:  TCRb+CD4+Foxp3-­‐(YFP-­‐)CD62L-­‐CD44+)   or   effector   regulatory   T-­‐cells   (Treg:  TCRb+CD4+Foxp3+(YFP+)CD62L-­‐CD44+)  were  sorted  into  Trizol  LS  (Life  Technologies)  using  a  BD  FACS-­‐AriaII   cell   sorter.   For   splenic  Treg  and  Teff   subsets,  we  sorted  1x105  cells,   for   the  CNS  we  sorted  4,000-­‐24,000  Teff  cells  and  500-­‐2,400  Treg  cells  per  sample  (Supplementary  Table   5).   RNA   was   extracted,   amplified   using   SMARTer   technology   and   libraries   were  prepared  using  Illumina  TruSeq  paired-­‐end  library  preparation.  Samples  were  sequenced  on  the  Illumina  HiSeq  2500  platform  to  an  average  depth  of  25  x  106  50  bp  paired-­‐end  reads  per  sample.   PCR   duplicates   were   removed   from   fastq   files   using   clumpify   utility   from   BBMap  (https://sourceforge.net/projects/bbmap/),   in   order   to   count   unique   RNA   fragments.  Comparative  post-­‐analysis  was  performed  using  VDJtools  software28.  

 

Extraction  of  TCR  and  Ig  repertoires  from  TCGA  SKCM  data    

Nature Biotechnology: doi:10.1038/nbt.3979

Page 27: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

PCR   duplicates   were   removed   using   FastUniq   software38.   Extraction   was   performed   by  MiXCR  as  described  in  the  next  section.  

 

Running  software  

On  each   sample  a   standard  analysis  pipeline   for  MiXCR,  TRUST  and  V’DJer  was  performed  (since  V’DJer  works  only  with  paired-­‐end  data,  it  was  executed  only  on  paired-­‐end  samples).  The  results  of  each  run  were  analyzed  for  all  available  immunological  chains  (all  TCR  and  Ig  for  MiXCR,  TCR  only  for  TRUST  and  Ig  only  for  V’DJer).  The  following  sections  describe  exact  commands  executed  for  each  software:  

 

MiXCR    

Standard  alignment  procedure   for  RNA-­‐seq  data  onto  all   IG/TCR   loci  preserving   reads   that  not  fully  cover  CDR3  (-­‐OallowPartialAlignments=true).  

Single-­‐end  analysis:  

>   mixcr   align   -­‐p   RNA-­‐seq   -­‐OallowPartialAlignments=true   sample_R1.fastq.gz  alignments.vdjca  

Paired-­‐end  analysis:  

>  mixcr  align  -­‐p  RNA-­‐seq  -­‐OallowPartialAlignments=true  sample_R1.fastq.gz  sample_R2.fastq.gz  alignments.vdjca  

 

The  following  commands  were  the  same  for  single-­‐  and  paired-­‐  end  analysis.    

Performing  two  rounds  of  contig  assembly  from  alignments  not  covering  full  CDR3  sequence  

>  mixcr  assemblePartial  -­‐p  alignments.vdjca  alignments_rescued_1.vdjca  

>  mixcr  assemblePartial  -­‐p  alignments_rescued_1.vdjca  alignments_rescued_2.vdjca    

Extend  V/J   junctions  of  TCRs  with  not   fully   covered  sequence  of  CDR3  and  unambiguously  assigned  with  V  and  J  genes,  up  to  complete  CDR3  sequence:  

>   mixcr   extendAlignments   alignments_rescued_2.vdjca  alignments_rescued_2_extended.vdjca  

 

Assemble  clonotypes  for  the  final  vdjca  files:  

>  mixcr  assemble  -­‐ObadQualityThreshold=0  alignments_rescued_2_extended.vdjca  alignments_rescued_2_extended.clns  

 

Exporting  clonotypes  

>  mixcr  exportClones  -­‐c  ${CHAIN}  \  

Nature Biotechnology: doi:10.1038/nbt.3979

Page 28: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

   alignments_rescued_2_extended.clns  \  

   alignments_rescued_2_extended.clns.${CHAIN}.txt  

Where  ${CHAIN}  should  be  replaced  by  a  particular  immunological  chain  (TRA/TRB/TRG/TRD  for  specific  TCR  chain,  TCR  for  all  TCR  chains,  IGH/IGK/IGL  for  specific  IG  chain,  IG  for  all   IG  chains,   ALL   for   all   TCR   and   IG   chains),   or   can   be   skipped   in   order   to   export   clones   for   all  possible  immunological  chains  (both  TCR  and  IG).  

 

STAR

Running  STAR  on  input  samples:  

>  ./STAR  -­‐-­‐readFilesCommand  zcat  -­‐-­‐runThreadN  8  -­‐-­‐genomeDir  /path/to/star/hg38  -­‐-­‐outSAMtype  BAM  SortedByCoordinate  -­‐-­‐outSAMunmapped  Within  -­‐-­‐outStd  BAM_SortedByCoordinate  -­‐-­‐readFilesIn  sample_R1.fastq.gz  [sample_R2.fastq.gz]  >  sample.sort.bam  

 

Preparing  BAM  for  analysis:  

>  samtools  index  sample.sort.bam  

 

TopHat

Running  TopHat  on  input  samples  

>  tophat  -­‐p  8  /path/to/hg19  sample_R1.fastq.gz  [sample_R2.fastq.gz]  

Adding  all  unmapped  reads  to  final  bam  file  

>  samtools  merge  sample.unsorted.bam  accepted_hits.bam  unmapped.bam  

Prepare  bam  file  for  analysis  

>  samtools  sort  -­‐-­‐threads  8  -­‐o  sample.sort.bam  sample.unsorted.bam  

>  samtools  index  sample.sort.bam    

TRUST  

TopHat  aligner  was  used  to  produce  bam  files  in  all  tests  except  tests  with  in  silico  data  (described  in  section  ”Performance  of  MiXCR  and  TRUST  on  in  silico  data”)  where  STAR  aligner  was  used.  

Running  TRUST  

>  python  TRUST.py  -­‐f  sample.sort.bam  -­‐a  -­‐H    

V’DJer

STAR  aligner  was  applied  as  described  above  to  produce  bam  files  for  V’DJer.  

Nature Biotechnology: doi:10.1038/nbt.3979

Page 29: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

Running  V’DJer:    

>  ./vdjer  -­‐-­‐in  sample.sort.bam  -­‐-­‐t  8  -­‐-­‐ins  175  -­‐-­‐chain  ${CHAIN}  -­‐-­‐ref-­‐dir  vdjer_human_references/$(echo  ${CHAIN}  |  tr  '[:upper:]'  '[:lower:]')"  >  sample.sam  2>  sample.log  

In  the  last  command  ${CHAIN}  should  be  replaced  with  one  of  IGH,  IGK,  IGL  (capital  letters).  V’DJer  requires  three  separate  runs  to  analyze  all  IG  chains.  Since  the  execution  time  of  V’DJer  on  some  samples  is  extremely  large  (for  some  samples  V’DJer  was  running  for  a  week  in  8  threads  occupying  50Gb  of  RAM  and  still  did  not  succeed)  we  constrained  V’DJer  execution  time  by  11  hours  (running  in  8  computing  threads).  V’DJer  analysis  which  took  more  than  11  hours  we  considered  as  failed  due  to  timeout.    

In  order  to  build  STAR  reference  for  V’DJer  we  used  recent  hg38  from  http://www.gencodegenes.org  and  the  following  command:  

>  ./STAR  -­‐-­‐runMode  genomeGenerate  -­‐-­‐runThreadN  52  -­‐-­‐genomeDir  /path/to/star/hg38/  -­‐-­‐genomeFastaFiles  GRCh38.p7.genome.fa  -­‐-­‐sjdbGTFfile  gencode.v25.chr_patch_hapl_scaff.annotation.gtf  

The  software  versions  were  as  follows:  

MiXCR:  2.1.3  (doi:10.5281/zenodo.804046)  

Samtools:  1.3.1  

TopHat:  2.1.1  

TRUST17  

STAR:  2.4.2a  

V’DJer:   0.12  

All   execution   commands   and   software   versions   were   approved   by   the   authors   of  corresponding  software  (we  have  contacted  to  both  TRUST  and  V’DJer  developers).  

 

Statistical  tests  used  

On   the   Kaplan-­‐Meier   plots   showing   survival   probability   relative   to   IgG1/IGH   and   IgA/IGH  proportions   (Fig.   2a-­‐d),    p   values   reflect   log-­‐rank   test   for   survival   difference   between   low  and   high   metrics   groups,   divided   based   on   the   median   for   the   relevant   metric.   For   the  comparison  of  TCR  CDR3  repertoires  characteristics  (Fig.  2f-­‐h),  paired  two-­‐tailed  t-­‐test  was  used.  

 

Data  availability  and  accession  code  availability  

Raw   sequencing   data   generated   for   the   paper   is   deposited   in   SRA   under   BioProject   id  PRJNA371303.   Analysis   results   using   MiXCR,   TRUST   and   V’DJer   software   are   available   at  https://doi.org/10.6084/m9.figshare.4620739.  

Software  and  source  code  described  in  this  paper:

Nature Biotechnology: doi:10.1038/nbt.3979

Page 30: Nature Biotechnology: doi:10.1038/nbt · Kaplan–Meier plots depicting the survival probability over time for high (>=cutoff) and low (

• MiXCR:  2.1.3  (doi:10.5281/zenodo.804046;  https://github.com/milaboratory/mixcr/releases/tag/v2.1.3)  

• RepSeq.IO  util,  version  1.2.8,  including  artificial  clones  generator  (doi:  10.5281/zenodo.804326;  https://github.com/repseqio/repseqio/releases/tag/v1.2.8)  

• Docker  image  and  scripts  to  reproduce  in-­‐silico  experiments  (published  in  DockerHub  uder  id:  milaboratory/RNA-­‐seq-­‐paper:v1.0;  source  code  https://github.com/milaboratory/mixcr-­‐RNA-­‐seq-­‐paper/releases/tag/v1.0).  See  README.md  from  source  code  package  for  usage  instruction.  

Other  software  used  in  the  paper: • Samtools  version  1.3.1  (https://github.com/samtools/samtools/releases/tag/1.3.1)  • TopHat  version  2.1.1  (https://ccb.jhu.edu/software/tophat/downloads/tophat-­‐

2.1.1.Linux_x86_64.tar.gz  )  • TRUST,  version  published  with17  • STAR  version  2.5.3a  (https://github.com/alexdobin/STAR/releases/tag/2.5.3a)  • V’DJer  version  0.12  (https://github.com/mozack/vdjer/releases/tag/v0.12)  

• Kalisto  version  0.43.1  (https://github.com/pachterlab/kallisto/releases/tag/v0.43.1)  

 

References  

33.   Dobin,  A.  et  al.  Bioinformatics  29,  15-­‐21  (2013).  

34.   Deb,  K.,  Anand,  A.  &  Joshi,  D.  Evolutionary  computation  10,  371-­‐395  (2002).  

35.   Huang,  W.,  Li,  L.,  Myers,  J.R.  &  Marth,  G.T.  Bioinformatics  28,  593-­‐594  (2012).  

36.   Harrow,  J.  et  al.  Genome  Res  22,  1760-­‐1774  (2012).  

37.   Bray,  N.L.,  Pimentel,  H.,  Melsted,  P.  &  Pachter,  L.  Nat  Biotechnol  34,  525-­‐527  (2016).  

38.   Xu,  H.  et  al.  PLoS  One  7,  e  

Nature Biotechnology: doi:10.1038/nbt.3979