More than 1,100 cancer treatments are currently in clinical testing. Of these, 295 are immune-oncology medicines and vaccines. For Checkpoint Immunotherapy alone, 52,539 patients were enrolled in 2017. Of these treatments, 108 are for breast cancer, the leading cancer diagnosed in women. Although checkpoint immunotherapy has yet to be proven beneficial in most types of breast cancer, current observations of which specific T cell go into the tumor, and how these cells are connected to the response to treatment, have lead to a new understating of the role of the immune system. Yet, and in spite of this intense research into the role of T cells in tumors, very little is known about how clones of T cells behave during tumor formation. The paper we present here studies this clonal behavior over time. We study it during mammary tumor development in mice. We see that a set of T cell clones, which comes from a diversity of different nucleotide sequences, stands out in its behavior compared with other T cell clones. Importantly, we show that these mouse T cell clones have a strong connection to human breast cancer. The specific identification of a subset of T cell clones, together with the behavior of these clones over time, is critical to our understanding of the effect of modifications to the immune system, such as the modifications made by immunotherapy treatments, and to our understanding of how to better control the behavior of the immune system.
The CDR3 region of the TCR binds the peptide epitopes presented by MHC molecules to antigen-specific T cells; the CDR3 region thus directs T-cell antigen specificity . Despite the astronomical numbers of possible CDR3 nucleotide (NT) recombinations , more than 10^18 different TCR sequences, healthy mice and humans manifest shared, Public TCR CDR3 amino acid (AA) sequences that seem to be enriched for associations with self-reactivity, allo-reactivity and tumors [3,4]. These Public TCR repertoires seem to emerge as a consequence of selection and are rich in convergent recombination–different (NT) recombinations encode identical (AA) CDR3 sequences. Notably, prevalent CDR3 regions appear across species in mice and humans .
The present study was done to test whether the spontaneous development of a tissue-specific tumor might be associated with detectible Public TCR CDR3 sequences; such tumor-associated sequences could function in T-cells that might mediate tumor immunity or, alternatively, might protect tumors by immune suppression . The first question, however, was whether a developing tumor, in the absence of premeditated immunization, might induce or enhance the expression of tumor-associated CDR3 TCR sequences. The use of multiple sequences across individual animals, different phenotypes groups and across species, led us to use the term “Public” in the context of this manuscript, to indicate CDR3 sequences that appear in two animals, or in two subjects. Since mice and humans have been found to share prevalent Public TCR sequences, we carried out this study in two stages: In the first stage we carried out a longitudinal study of TCR repertoires in mice undergoing the development of spontaneous, transgene-engineered breast cancer; the results uncovered multiple CDR3 regions sequences that were Public and Exclusive to mice developing tumors; these sequences were undetectable in mice lacking the breast-cancer transgene. In the second stage, we developed a metric to learn whether women with breast cancer manifested specific TCR sequences also present in mice developing breast cancer.
We sequenced CDR3 repertoires in the peripheral blood of 10 FVB/N-Tg(MMTVneu), a mouse model of HER2 human breast cancer mice, and from 5 FVB/NJ Control mice (Fig 1). We will use the term "Transgenic" for FVB/N-Tg(MMTVneu) 202 Mul/J and "Control" for the FVB/NJ strain. The group of Transgenic mice developed tumors (see Fig 1). In the text, we refer to these mice as tumor-developing or, alternatively, as Transgenic mice, according to context. We focused on the largest recoverable subset of T cells, those bearing the markers CD4+CD62LhiCD44lo , which comprises 65% to 85% of the T cells in a blood samples. These T cells are of the naïve compartment. While some literature discusses subsets of this compartment, which include various sub populations, including some that have been referred to as atypical naïve–like memory/effector cells , the vast majority of these cells are naïve. Some recent works show that the unique TCR reactivity to immunogenic pathogen-derived epitopes is present already in the naíve repertoire prior to immune expansion in B6 mice  and that Naïve T cells can manifest shifting to Teffs, which can effectively kill tumor cells . Interestingly, recent evidence has shown that CD4+ naive T cells significantly more heterogeneous than was previously thought, and that these sub populations of T cells are diverse in phenotypes, differentiation stages, persistence, functions, and anatomic localizations .
To minimize the possible bias that accompanies the use of multiple groups and diverse sample sizes, we combined samples from all the time-points of a mouse and subsampled equal numbers of sequences in 5 mice each of the test and Control sets: 843,326 CDR3 β chain sequences and 116,751 α chain sequences (see Methods). We first filtered for Public AA sequences, which are CDR3 AA sequences appearing in at least two different mice. Public CDR3 types are termed Inclusive if they appear in both Control mice and tumor-bearing Transgenic mice (FVB/N-Tg(MMTVneu) 202 Mul/J). Exclusive CDR3 AA sequences are shared only by mice bearing tumors or only by tumor-free Control mice. We found that Exclusive Public CDR3 types in both the α and β chains are more prevalent in the tumor-developing group than in the Control group (α chain: 14% and 5% respectively; Figs 2A, S1A and S1B; t-test: p <7.4E-08; S1 Table; β chain: 8% and 5% respectively; S1C–S1F Fig; t-test: p < 0.0097; S2 Table). In other words, tumor-associated Public sequences mark the tumor-developing mice. Interestingly, Public CDR3 types occupy between 35–40% of the sequences in all the mice. This level of sharing is consistent with the sharing levels reported in . Note that we found no significant differences between tumor-developing and Control groups in repertoire unique clones, diversity, clonality or CDR3 length distributions (S2 Fig).
The metric by which we are looking at the (dis)similarities first aims to see if the complete set of mice share higher similarities during the early time points. We calculated overlaps between samples, by following these steps: (1) We defined samples from young mice as samples from time points 1–4, and samples from old mice as samples from time points 5–8. (2) We calculated Morisita’s overlap index for each pair of samples. (3) For each combination of groups, we averaged all the pairs related to this group. For example, for the combination of ‘Control Young’ and ‘Transgenic Young’, we averaged over all pairs of samples where one sample belongs to the ‘Control Young’ group, and the other sample belongs to the ‘Transgenic Young’ group. As shown in S3A Fig, the ‘Control Young’ and ‘Transgenic Young’ groups indeed display higher similarities compared to the ‘Control Old’ and ‘Transgenic Old’ groups.
Although we screened relatively small numbers of mice, those developing tumors clearly manifested Exclusive Public CDR3 types; a smaller proportion of Control mice manifested Exclusive Public CDR3 types not found in the mice developing tumors.
We measured CR levels over time in the tumor and Control mice. For each CDR3 AA sequence, we calculated the number of different NT sequences encoding that AA sequence in the individual mice, that is, each AA sequence has a “CR level” value, which is the number of different NT sequences encoded to that sequence in our subsampled data. CDR3 sequences may end up identical despite different VJ origins. That is, in our workflow, a single CDR3 could be the result of multiple, different, VJ pairs. We found a somewhat higher frequency of CR in mice developing tumors compared to Control mice in both the α chain (47% and 40%, respectively; t-test; p < 0.016; S4A Fig) and the β chain (37% and 33%, respectively; t-test; p < 0.011; S4B Fig). We also found that the α chain manifests higher CR levels compared to the β chain.
These results confirm that Public CDR3 AA sequences manifest greater CR levels than do Private sequences . Fig 2A and 2B show that CR in the Public-Inclusive groups is double that of the Private CDR3 types. Fig 2C and 2D show that the Public tumor-associated CDR3 (publicness is determined according to AA sequence) types originate from a smaller portion of the Private types (determined according to NT sequence) in the Control group compared to the Tumor group (5% and 11%, respectively). Moreover, non-CR CDR3 types dominate the Private repertoires (S4C and S4D Fig). In addition, we identified a significant contribution of CR to the Public TCR repertoire– 98% of the CR types in the α chain are Public (Fig 2E) compared to 22%-40% of the non-CR types (S4C Fig). The same phenomenon was observed in β chain– 96%-97% of the CR CDR3 types were Public (Fig 2F), in comparison to 16%-28% of the non-CR types (S4D Fig). We compared the tumor-Exclusive frequency to the Control-Exclusive frequency. We calculated the levels of CR for each group of Exclusive clones. As shown in S4E Fig, the averaged CR level in the tumor-developing group is significantly higher than the averaged CR level found in the Control group (t-test: p<0.0001).
Finally, we observed evolution of CR and publicness over time. We quantified CR levels linked to publicness in the following groups of mice: Controls, Transgenic mice before palpable tumors, Transgenic mice with early-tumors and Transgenic mice with large tumors. Fig 2G and 2H show that CR levels increase in tandem with the degree of the averaged publicness in all the groups. However, the connection between CR and Public sharing is specific to the group. In the tumor-developing group, compared to the Control group, Public sharing increased as CR increased (lr test: p<2.2e-16; Fig 2G). This suggests that selection of specific CDR3 types is dominant in Transgenic mice. Separating the Transgenic group into early-tumor and cancer samples, we observed that sharing was greater in the cancer group than in the early-tumor group (lr test: p<2.2e-16; Fig 2H); thus, high degrees of CR and highly shared between mice, Public sequences develop early, even before the emergence of palpable tumors.
We studied the TCRβ-CDR3 repertoire of three human sources: 1) TCR sequences that we extracted from RNA-seq data of breast cancer patients obtained from The Cancer Genome Atlas (TCGA) ; 2) bulk TCR-seq data from previously published human TCR breast cancer datasets: Wang et al , Page et al  and Beausang et al ; and 3) Single-cell RNA-seq TCR data of 3 breast cancer patients obtained from Azizi et al  (Fig 3). This analysis, detailed below, supported a number of conclusions: 1) humans with breast cancer share Public CDR3 AA sequences with mice developing spontaneous breast cancer; 2) cross-species CDR3 AA sequences rank similarly in mice and human patients, detected by our ranking metric; and 3) Public, cross-species CDR3 AA sequences arise from NT recombinations that do not overlap in humans and mice–in other words, mice and humans manifest identical CDR3 AA sequences, but the NT recombinations that each species uses to generate these shared AA sequences are completely different.
As we detail below, mice and humans with breast cancer share TCR CDR3 AA sequences. We compared the mouse data we produce in this work with data from human samples and extracted different subsets of CDR3 sequences detailed in the Methods section. The first subset of cross-species sequences included the set of CDR3s from this work and the set of CDR3s from the 3 human-data papers mentioned above. We found 7513 cross-species sequences in this subset. S5 Fig shows a comparison of mouse-specific NT and AA sequences (inner circle) with cross-species mouse sequences shared with human patients (outer circle); the groups were divided into Private, Public Inclusive and Public Exclusive as previously described. The lower two wheels in the figure describe NT sequences. We can see that the tumor-developing mice express more Public Exclusive and less Public Inclusive sequences than the Control mice. AA sequences, in contrast, show a significantly greater proportion of Public Inclusive sequences than Public Exclusive sequences. Nevertheless, tumor-developing mice do express relatively more Public Exclusive sequences. Thus, the NT view and the AA view portray different perspectives, and mice developing tumors show more Public Exclusive sequences of both NT and AA types.
We compared the mouse data to samples from early-stage breast cancer patients from Beausang et al., and found 258 sequences that were shared between these human samples and the collection of time points across our mouse data (see Methods). To quantify these 258 CDR3 cross-species sequences, we produced the bar-chart shown in Fig 4A. Every time-point in the bar-chart contains a stack of 258 bars. The area each bar occupies in the stack is inversely proportional to an area defined by the ranking of the sequence at the specific time-point, multiplied by the ranking of the sequence in the human samples, such that similarly ranked sequences occupy a larger area. According to Fig 4A, time-point 5 (which is associated with early-tumors in mice) is the time-point most similar to early-tumors in patients. We can also see that a small set of sequences (blue (CASSLSYEQYF), orange (CASSLGYEQYF), and green (CASSLGDTQYF) in the bar-chart) dominate the chart. The CASSLSYEQYF and CASSLGYEQYF CDR3 sequences are ubiquitous to TCR repertoire studies, and have been associated with Melanoma and Influenza (Table 1). Recent findings from other studies show associations to these phenotypes [15,16]. Additionally, 829 of the 14,349 (6%) tumor-associated CDR3 sequences found previously, were also tumor-associated in human samples; that is, they were found in tumor samples but not in normal samples.
To characterize the differences between the specific groups of cross-species sequences and the general population of CDR3 sequences, we utilized IGoR (Inference and Generation Of Repertoires) , which is a tool that probabilistically annotates sequences. It takes B- or T-cell receptor sequence reads and quantitatively characterizes the statistics of receptor generation. Fig 4B shows the results over the three groups discussed here (see Methods): the collection of all sequences, the collection of all cross-species sequences (7513 AA sequences that were found in both mouse data and the 3 human datasets), and the collection of the unique set of 258 cross-species sequences that were inherent to all samples, together with human samples. As the figure shows, the three curves are significantly different from each other. Moreover, cross-species sequences and the set of 258 sequences were associated both with a distribution of relatively low probabilities and with high CR distributions, displayed at the top of the panel. These findings indicate that, even though the sequences were not produced at higher probabilities, they were selected to appear as such. Of special interest, and highlighted in the figure, are two sequences–CASSLGYEQYF and CASSLSYEQYF–which, in spite of their single amino acid difference, appear with very different IGoR scores and with very high CR values (notice that we did not include the third sequence CASSLGDTQYF as part of the analyses, since it did not produce as meaningful findings as the other two sequences).
When we examined the tumor-associated sequences that were shared between our mouse data and the human data subset of 7513 cross-species sequences, we identified 639 tumor-associated cross-species sequences. To examine the gene usage origin of these sequences, we analyzed V, J gene usage. We show the heatmaps in S6 Fig, which represent the correlations between the different V and J usage in tumor-associated cross-species clones. As shown, there are multiple Vs that encode these sequences, some of them are close across species, such as TRBV31 in mouse and TCRBV30 in human, and some are close within species, such as TRBV5 and TRBV2 in mouse. As for J gene usage, there are closer orthologues across species.
To see if the TCR sequences we singled out as tumor associated through our mouse-human workflow are associated with human tumors, we analyzed the 258 cross-species tumor sequences over a set of tumor-peptides, from the McPAS database , that are known (from the literature used to build the dataset) to be associated with human tumors. To perform this analysis, we used ERGO, a deep-learning based tool for the prediction of TCR-peptide binding . From MCPAS, we selected for the peptides that are tagged with a cancer classification (n = 257). We then created all possible CDR3-peptide pairs of these cancer-peptides and our 258 cross-species tumor sequences (n = 258x257 = 66,306 pairs). We used ERGO to find the predictive binding score for each CDR3-peptide pair. To see the significance of these scores, we compared them with a set of randomly chosen 258 CDR3 sequences from our data. We applied the same steps to these randomly selected TCR sequences as we did for the 258 cross-species TCR tumor associated sequences. To see the significance of these findings, we repeated the process (random pooling followed by ERGO) five times. Fig 5A provides the results of the process and the average predictive binding scores. As the figure shows, the cross-species tumor sequences pairing with human tumor peptides is significantly higher than the other pairs of CDR3-tumor peptides.
To further examine the significance of the tumor-exclusive clones, we compared the sharing between control-exclusive and tumor-exclusive clones in mouse to human breast cancer dataset (Beausang et al.) and to non-breast cancer dataset (Robert et al.; Melanoma). We subsampled an equal number of sequences from all of the examined samples, both mouse and human samples (n = 27776), and calculated the sharing between every 2 mouse-human samples. The sharing is defined as the Jaccard overlap index with a modification that takes into account the frequency of clones . Fig 5B shows that the tumor-exclusive clones have a significantly higher overlap with human cancer patients compared to control-exclusive in both breast cancer human patients (p>3.2E-05) and human Melanoma patients (p>1.7E-04). Additionally, when we compared the overlap of tumor-exclusive with the 2 types of cancer, we found significantly higher mouse-human sharing with breast cancer compared to melanoma (p>1.6E-06).
To look at the difference between ‘Transgenic Young’ and ‘Transgenic Old’ samples, we compared the highly abundant sequences from these two groups with sequences we obtained from two human cohorts. First, we identified a set of highly abundant sequences in mouse samples by collecting the CDR3 sequences that appeared both in at least 10 “Transgenic Young” samples and in at least 10 “Transgenic Old” samples. This initial filtering resulted in a set of 345 sequences.
To see which of these sequenced is highly abundant in the “Transgenic Old” samples, we used a t-test, with which we identified 13 such sequences with significant differences in their copy numbers between the two groups. We then used these sequences to see their copy numbers in a cohort of TCR sequences from breast cancer patients and in a cohort of TCR sequences from Melanoma patients. For a comparison with other sequences, we repeated the above steps 1000 times using random choices of 13 sequences (out of the 345 sequences). As can be seen in Fig 5C, the frequency of the clones that were highly abundant in the “Transgenic Old’ is significantly higher compared to the random clones both in samples from breast cancer patients (p<0.013) and in samples from melanoma patients (p<0.003). Interestingly, the frequency of the ‘Transgenic old’ highly abundant clones is higher in the breast cancer samples compared to melanoma samples.
To learn about the nucleotide source of the specific sequences we identified to be simultaneously high ranking in mouse and human samples (blue and orange in Fig 4A and 4B), we compared the NT sequences that generated these AA sequences. The results are shown in Fig 4C and 4D, where we connected each NT sequence to its end-result AA sequence. Edges are blue if they originate from a mouse NT sequence and red if they originate from a human NT sequence. As the figure shows, the same AA sequence originates from multiple NT sequences. However, the NT sequences show no overlap between mouse and human samples. The AA sequence in Fig 4C is the result of 27 different NT sequence recombinations in mice and 29 different NT sequences in humans. The AA sequence in Fig 4D originates from 29 different NT sequences in mice and 27 different NT sequences in humans. Whatever the upstream NT recombination, the CR process strongly selects for a specific AA sequence. S7 Fig shows the NT display of the AA compositions presented in Fig 4C. The figure reveals the recombination effect: the same AA sequences were built on a combination of 9 different TRBVs (-2,3,4,5,19,21,24,26,29), while TRBJ2-7, both for mouse and for human, was used. The colored bars represent the NT sequences encoding to these AA sequences.
To further research the importance of the two AA sequences CASSLGYEQYF and CASSLSYEQYF, we calculated the numbers of other sequences very similar to them in a network configuration [27,28]. Here, we define similar sequences as AA sequences with an edit distance (Levenshtein distance) of up to 2. We created a pool of all the sequences from the tumor-developing group for each time point, and then only used the top 10,000 sequences–in order to avoid a bias stemming from different sample sizes. We then counted the number of sequences close to the two given AA sequences, and compared these results with 1000 random sequences. Fig 4D shows that the number of sequences differing from CASSLGYEQYF and to CASSLSYEQYF by 1 or 2 AA is significantly higher than the average number of sequences equally similar to any of 1000 random sequences. We can also see that this organization does not significantly change over time. Similar behavior is observable in the tumor-associated sequences (Fig 4E). Fig 4F shows the differences between the distance networks of CASSLGYEQYF and CASSLSYEQYF and two random sequences. Each node represents a similar sequence to these two sequences. We can see that the networks of CASSLGYEQYF and CASSLSYEQYF are significantly larger than those of the random sequences.
To learn more about the connection between the CDR3 types we identified in the mouse model and those from human patients, we analyzed RNA-seq samples from breast cancer patients that were part of the TCGA cohort. Using data from genome alignment and TCR alignment on the 1,256 breast cancer samples, we extracted roughly 60 α/β sequences per sample with a total number of 38,068 α NT sequences and 30,524 β NT sequences. Out of these sequences, 874 AA β sequences and 300 AA α sequences were also found in our mouse subsampled data.
When examining the different Public groups of the sequences shared with our mouse samples and TCGA samples, we found that the sequences from the tumor-developing group share significantly more Public Exclusive clones compared to the Control group in both the α chain (20% and 3%, respectively; Fig 6A; t-test: p< 0.0003) and β chain (9% and 2%, respectively; Fig 6B; t-test–p< 0.0005). In addition, the percent of CR clones in this subset of cross-species CDR3 types is significantly higher than that observed in our data for both α and β chains (71%-81% and 37%-63%, respectively; Figs 6C, 6D, S4A and S4B).
TCGA breast cancer data includes clinical information about each patient, including the stage of the disease. We compared TCGA disease stage data to our longitudinal time point data (see Methods). For each time point we calculated the similarity between sample and stage using the following formula: (Shared clones between sample and stage) / (# of clones in a sample x # of total cases of a stage). S3 Table shows the number of mouse-human-shared clones between each sample and each stage. The analysis indicated that across all mouse time points in the α chain, the greatest similarity is between mouse sequences and stage 1 breast cancer. Least similar is stage 4 breast cancer (Fig 6E). The same is seen in β chain data (S8 Fig). These results might be influenced by the source of T cells: The TCGA data come from the tumor sample itself, but the mouse data come from blood samples. The findings might suggest that T cells from later stages of tumor development are not present in peripheral blood or, that the similarity between mouse and human CDR3 sequences is limited to early stages of breast cancer. It is also conceivable that cancer treatments might lead to major modifications of the T-cell repertoire, absent from the mouse population.
Single-cell data from T cells can provide a more refined resolution than bulk populations; it is possible to match α and β sequences and to study whole transcriptomes. To gain a better view of the TCR repertoire in human breast-cancer patients and its relevance to the longitudinal mouse study, we analyzed single-cell data available from Azizi et al. , who sequenced tumor tissue from 3 patients with breast cancer using 10x single-cell RNA-seq sequencing. We identified the set of α and β sequences that overlapped with our mouse data and found the α-β pair of each cell. This resulted in 3168 unique cells with 2110 different α-β pairs. Next, we searched for these α-β pairs in our data. An α-β pair was defined as shared with a mouse sample if both the α CDR3 sequence and the β CDR3 sequence were present in that sample. We found 582 α-β (single-cell) pairs whose alpha and beta chains appeared in the same sample in our mouse data.
We found that the tumor-developing mice shared a significantly higher number of human α-β pairs in comparison to the Control group (27 α-β pairs per sample and 11 α-β pairs per sample, respectively; Fig 7A; t-test: p< 0.024). To learn whether CR is dominant in this subset, we averaged the CR levels of the α and β sequences. Indeed, this subset of the repertoire showed a much higher rate of CR in comparison to the total repertoire (Figs 7B and S4A–S4B). The Public repertoire presented the same effect as the CR, and was expended in both the Control and the tumor-developing mice (9% and 30%, respectively, in comparison to 7% and 12% in the total repertoire). However, the Public Exclusive repertoire was significantly higher in the tumor-developing mice compared to the Control mice (Fig 7C and 7D; 30% and 9% respectively; t-test: p < 0.0009), and compared to the total repertoire of α and β chains in tumor-developing mice (16% and 12%, respectively; Figs 2B and S1D).
Finally, we mapped single cells to the different groups. For each cell, we checked whether its α-β pair appeared in only or in both of the groups. For each group (Control, tumor-developing and shared between these 2 groups), we calculated the Levenshtein distance between the subsets of α-β pairs. In this analysis, we allowed up to 3 different AA between two pairs. As can be seen in Fig 7D, the tumor-developing network (red circles) was significantly larger than the Control network (blue circles) or the shared network (green circles); this suggests that the tumor-developing mice share significantly higher similarity with human breast-cancer patients than do the Control mice. Interestingly, we also found that the tumor-developing network can be divided into 2 main subnetworks. Combining all the human datasets, we found that the Control group dominates the subset of CDR3 types that are Public Inclusive, and that the tumor-developing group dominates those that are Public Exclusive (S9 Fig).
The objective of this study was to investigate whether the TCR repertoires of mice with breast cancer develop representative CDR3 sequences. We focused on the CDR3 regions of the alpha and beta chains of the TCR because the CDR3 domain binds the processed antigen epitope presented to the responding T cells ; thus, the CDR3 region of the TCR is responsible for antigen-specific T-cell recognition.
We initiated the study with a longitudinal survey of TCR repertoires monitored in the peripheral blood of individual mice that were in the process of spontaneously developing breast tumors; We recovered a subset of T cells based on their CD4+CD62LhiCD44lo markers. CD4+ T cells are a key component of tumor immunity via their communication with several types of immune cells, direct tumor killing and by providing support to CD8+ T cells . Based on molecular markers, these traits in human immunity are usually attributed to cells with an effector phenotype (CD62LloCD44hi). In our study, the CD4+CD62Lhicd44lo subset was the largest recoverable subset out of peripheral blood.
Sorting to the CD4+CD62LhiCD44lo group still leaves multiple cell types that should be separately discussed. These include conventional CD4+ T cells that will recognize antigen peptides presented by MHC class I. These cell types, in human, might include different stages of T cell development and differentiation of naïve CD4+ cells from unprimed quiescent to effector and central memory cells [6,31] and Stem cell memory T cells . Several other prominent populations of ‘unconventional’ T cells might also be included, and were analyzed and found at small frequencies in our data. We further studied data from these cell types, even though they are CD44hi , to exclude the possibility that they are responsible for some of the trends we saw. These T cells recognize non-peptide antigens presented by specialized monomorphic MHC class I–like molecules  and have unique and conserved TCR repertoires. These include CD1-restricted natural killer T cells (NKT cells) [34–36], MR1-restricted mucosal associated invariant T cells (MAIT cells) [37,38]. We found only 4 CDR3 sequences that can be classified as MAIT cells. We also identified 96 sequences in the Control group, and 201 sequences in the Transgenic group, that could be tagged as iNKT cells. However, there was no statistically significant difference in the frequency of these cells between the groups.
Another aspect of the compartments we sorted using the markers is that these are some of the T cells previously addressed as naïve, while in fact this is not a homogeneous pool of cells awaiting activation. The compartment previously addressed as naïve T cells is actually a complex mixture of cells at different developmental stages and of diverse ages . Further, there are inflows of new cells from the naive pool that strongly impact memory cell .
In mice there are differences between RTEs (recent thymic emigrants) and mature naive T cells and between naive populations that are not replaceable and high-rate renewed cells .
High-throughput biology, high-resolution biology and a host of technologies are quickly adding changes to current immunological theory. The nature of memory T cell differentiation signals, the extent of their fate diversity, the lineage relationship between different fates, and the degree of plasticity at different stages of differentiation  are projecting on the previous stages of these cells maturation. Together with the various sub populations we mentioned earlier, we see many degree of freedom in T cell populations in spite of phenotype characteristic.
We here show that the TCR repertoires of these T cells do reflect the tumor-bearing state. Most importantly, this mouse model provided two advantages in the quest for tumor-associated CDR3 sequences: sampling inbred mice enabled us to circumvent the genomic variability inherent in outbred humans, and the longitudinal monitoring of T-cell repertoires during actual cancer development supported a level of confidence in the association of repertoire changes with the actual development of breast cancer; in contrast, a collection of repertoire snapshots from a mixed collection of women sampled at different stages of the disease would not match the precision of the controlled longitudinal mouse experiment. As one of the reviewers to this manuscript suggested, it would have been extremely useful to compare the timeline of TCR repertoire changes in these mice with an equivalent set from another Transgenic mouse. Such a set could have allowed a comparison between the changes that were caused by the tumor with changes that are caused by the transgene itself. However, we were unable to identify such a dataset, and the limit of this comparison remains. It is our opinion that the significant findings we identified with human tumor samples eliminate much of the inherent randomness that may be associated with the comparison between the repertoires.
Although the numbers of mice in the mouse study were necessarily small, the information yield was meaningful; we were able to detect and characterize 7513 CDR3 AA sequences that appeared both in the mice and in the various human populations of breast cancer patients. Most importantly, 829 of these Public sequences appeared to be Exclusive for breast-tumors in both mice and humans. The results of the present study stimulate us to think anew about the host-tumor relationship.
Our most notable finding is that the development of breast cancer selects for CDR3 regions exclusive to tumor-bearing mice, shared by both mice and humans (Fig 7D). This raises the possibility that the development of breast cancer involves the expression of particular antigens; antigen selection is the simplest way to explain cross-species CDR3 AA sequences despite the fact that each species mobilizes a different set of NT recombinations to generate exactly the same AA sequence; there was no NT overlap between the species in any of the NT recombinations leading to marked convergent recombination (Fig 4C). Thus, mouse-human-shared CDR3 AA selection must come about by way of similar antigens expressed by both mice and humans in the process of tumor development–this immune similarity takes place despite the genetic differences between inbred mice and natural human populations.
It has been taught that the TCR repertoire and tumor development are stochastic events: TCR molecules are thought to be generated by a huge number of possible NT recombinations between V-D-J gene segments augmented by NT deletions and additions at the joining segments. This process is thought to be able to produce more than 1018 different TCR sequences . Moreover, an individual TCR antigen receptor has been reported to recognize with significant avidity upwards of a million or so peptide epitopes . Consequently, the chances are vanishingly small that different mice and humans would randomly select for T cells bearing identical CDR3 binding sites, even in response to the same immunogenic epitope. It is likely that certain TCR gene segments are located in positions that favor recombination events relative to other TCR gene segments . However, the effects of chromosomal positioning would not be expected to determine identical repertoire recombinations in mice and humans, which show no overlap in convergent recombination (Fig 4C).
Likewise, the development of a malignant tumor is believed to result from random combinations of many mutations in various genes–including so-called driver mutations that drive tumor development, which differs markedly from the outcome of the multitude of mutations often found in so-called normal or healthy cells . The observation that mice and humans bearing diverse MHC genes can express identical TCR CDR3 amino acid sequences has a precedent in the study by (Madi, Poran et al. 2017) (Madi, Shifrut et al. 2014) [47,48] showing that different strains of mice bearing diverse MHC genes expressed identical CD-4 TCR CDR3 amino acid sequences known to recognize a single defined peptide epitope; apparently, different MHC class II molecules can present the same processed peptide to their TCR CDR3 regions. The present finding of Public, tumor-associated TCR repertoires in both mice and humans developing breast cancer hints at a uniformity in the tumorogenic process as it develops in different individuals in at least two different species. Understanding the generation of immunogenic markers shared across species is likely to deepen our understanding of pathophysiological uniformities inherent in tumor development itself. Mutations may turn a normal cell into a potentially malignant transformed cell, but a single aberrant cell is not a tumor; a growing tumor is like an organ that requires the support of a microenvironment that includes blood vessels, stromal cells, a supply of growth factors, and particular metabolic energy sources, all dynamically integrated . It is conceivable that breast cancer development might require similar microenvironmental interactions in both mice and humans, which are highly structured and, therefore, quite similar. Requirements for the development of breast cancer in mice and humans, across species, may impose shared processes that stimulate shared TCR repertoires. The present investigation needs to be followed up by identification of the key immunogenic entities that drive the evolution of Public, tumor-associated TCR repertoires. At present it is not possible with any degree of certainty to infer the selecting antigen from the AA sequence of the responding CDR3 TCR; new technologies are needed. Moreover, the possible candidate antigens are many and varied: normal developmental antigens expressed aberrantly by dedifferentiated tumor cells; neo-antigens generated by mutations; regulatory molecules expressed by the activated immune system itself–ergotypic activation markers, for example; tumor-associated bacteria or viruses; epitopes of stress proteins; the list goes on and on. Some Public CDR3 specificities may be annotated by their appearance in other situations . In any case, Public, tumor-associated TCR repertoires might arise from common factors needed for tumor development; the elucidation of the antigenic molecules that drive Public, tumor-associated TCR repertoires could help identify commonalities in tumor development. The challenges underlined by the present study are difficult but promising.
Approval number: 39-10-2012, Dr. Benny Motro, the committee chairman, Bar Ilan University.
We purchased from Jackson Laboratories transgenic mice, at the age of 1 month, expressing the inactivated rat neu (Erbb2) oncogene under the transcriptional Control of the mouse mammary tumor virus promoter (FVB/N-Tg(MMTVneu) 202 Mul/J). Female mice of this strain serve as a mouse model of mammary tumor in humans–the HER2/ Erbb2 / Neu human breast cancer model ; this test group included 10 female mice. The transgenic mouse model we used, FVB/N-Tg(MMTVneu)202Mul/J (Jackson laboratories) is homozygous for the MMTVneu (rat) transgene. The mice are viable and fertile. There is no phenotypic effect in males. The transgene is expressed at low levels in normal mammary epithelium, salivary gland, and lung. Higher expression is detected in tumor tissue. Focal mammary tumors first appear at 4 months, with a median incidence of 205 days (6.8 months). Both virgin and breeder mice develop tumors. Tumors arise as foci in hyperplastic, dysplastic mammary glands. Seventy-two percent of tumor-bearing mice that live to 8 months or longer develop metastatic disease to the lung.
FVB/NJ strain mice, bearing the same genetic background as the transgenic mice, but without the breast-cancer transgene, served as Controls; the Control group included 5 mice. The mice were housed under specific pathogen-free conditions in accordance with applicable laws and regulations, and approved by the responsible animal care and ethical committee. Mice were monitored by palpation for tumor development and bled monthly for up to 9 months, or until they had to be sacrificed because of tumor growth.
Peripheral blood was sampled from the retro-orbital sinus of each of 15 mice once a month for 8 time points (a total of 120 samples). Mononuclear cells were isolated by density gradient centrifugation using Ficoll (Ficoll Paque plus, GE Health Care–according to the manufacturer’s instructions). At the termination of the experiment, single-cell suspensions were prepared from the thymus and spleen of each mouse. For cell sorting, the cells were stained with the following fluorescently labeled monoclonal antibodies: anti-CD4 Pacific Blue (BD), anti-CD25 PE (eBioscience), anti-CD44 APC (BD) and anti-CD62L PE-Cy7 (eBioscience); viability was determined using the Fixable Viability stain 450 (BD Horizon). Cell sorting was performed using a FACS ARIA III sorter; CD4+CD62LhiCD44lo T cells, the most prevalent T-cell population, was used for CDR3 sequencing. After sorting, the cells were pelleted and resuspended with 300μl of RNAprotect cell reagent (Qiagen). The cells were stored at minus 80°C until RNA extraction. RNA was purified from RNAprotect-stabilized cells using the RNeasy Plus Mini Kit. After RNA extraction, samples were run on TapeStation to estimate quality.
Cells were FACS sorted (by ARIA III sorter) to obtain CD4+CD62LhiCD44lo T cells. RNA was prepared using Qiagen’s RNeasy kit. Isolated RNA was reverse transcribed using a biotinylated oligo dT primer. An adaptor sequence was added to the 3' end of all cDNAs, which contains the Illumina P7 universal priming site and a 17-nucleotide unique molecular identifier (UMI). Products were purified using streptavidin-coated magnetic beads followed by a primary PCR reaction using a set of the following primers:
The primers were designed to target the constant regions of mouse-TCRalpha and mouse-TCRbeta. The TCRalpha/beta-specific primers contained tails corresponding to the Illumina P5 sequence. PCR products were then purified using AMPure XP beads. A secondary PCR was then performed to add the Illumina C5 clustering sequence to the end of the molecule containing the constant region. The number of secondary PCR cycles was tailored to each sample to avoid entering plateau phase, as judged by a prior quantitative PCR analysis. Final products were purified, quantified with Agilent Tapestation and pooled in equimolar proportions, followed by high-throughput paired-end sequencing on the Illumina MiSeq platform. For sequencing, the Illumina 600 cycle kit was used with the modifications that 325 cycles were used for read 1, 6 cycles for the index reads, 300 cycles for read 2 and a 20% PhiX spike-in to increase sequence diversity.
RNA-seq raw reads were aligned and mapped to V-D-J sequences using MiXCR, a computational tool that enables the processing of big immunome data from raw sequences and output quantitated clonotypes . The method performs paired-end read merging and extracts human or animal TCR clonotypes providing corrections for erroneous sequences introduced by NGS [53–55]. Using MiXCR, we obtained CDR3 sequences of α and β chains. The counts of TCRα and TCRβ sequences are presented in S4 Table.
To minimize any bias that might stem from diverse sample sizes, we initially combined all time points of each mouse. Next, we used the set of samples from the 5 mice of the Control group, and 5 mice from the tumor-developing group, who had the highest sequences coverage. We then subsampled equal numbers of sequences for each of these 10 mice, as the lowest number of sequences obtained for each chain: 843,326 CDR3 β chain sequences and 116,751 α chain sequences, allowing us to keep as much of the information as possible while still normalizing the data.
We studied the TCRβ-CDR3 repertoire of three human sources: 1) TCR sequences that we extracted from RNA-seq data of breast cancer patients obtained from The Cancer Genome Atlas (TCGA) ; 2) bulk TCR-seq data from previously published human TCR breast cancer datasets: Wang et al , Page et al  and Beausang et al ; and 3) Single-cell RNA-seq TCR data of 3 breast cancer patients obtained from Azizi et al .
We downloaded RNA-seq samples publicly available from The Cancer Genome Atlas (TCGA)– 1256 tumor and normal tissues of breast cancer patients. We applied the following procedures to extract TCR repertoires: 1) We mapped reads to the human reference genome using STAR. 2) We designated as somatic recombinations only those reads that at least one of the mates did not map to the reference whole genome; a recombined V-D-J sequence cannot map completely to the genome. 3) We utilized MiXCR on all unmapped reads.
For the comparison of TCGA disease stage data to our longitudinal time point data, we used our full dataset (the entire set of clones within the sample, without subsampling). The different subtypes of stages obtained from the clinical information of each breast cancer patient was aggregated to the following stages: stage i, stage ii, stage iii and stage iv. We calculated the similarity between a sample and a stage using the following formula: (Shared clones between sample and stage) / (# of clones in a sample x # of total cases of a stage). Next, we averaged all the samples of a specific time point and a stage.
We used RNA-seq data from the following published articles: Wang et al. , Page et al.  and Beausang et al. . In contrast to the bulk TCGA data, these authors carried out directed TCR repertoire sequencing.
To be able to compare the mouse data produced here and the above mentioned human data, we studied multiple subsets of sequences. These are the subsets:
We developed a computational model for the assembly of CDR3 regions using single-cell RNA-seq data published by Azizi et al. . The model was built using the following steps: 1) We applied MiXCR on the samples and extracted the CDR3 sequences for each cell in each patient. 2) We filtered out CDR3 sequences that did not appear in our mouse data. 3) We constructed an α-β pair of each cell barcode. 4) We filtered out all cells with more than one α or β sequence. And, 5) we searched for these α-β pairs in our mouse sequence data (after subsampling). An α-β pair was defined as shared when both the α CDR3 sequence and the β CDR3 sequence were found in our mouse samples.
We developed an additional metric to compare mouse sequences with sequences obtained from the early-stage breast cancer patients reported by Beausang et al . First, we identified the sequences that appeared at all the mouse time-points and in the human samples. At each time-point, we ranked the sequences according to their copy number–the highest copy number was ranked 1, the second highest ranked 2 and so on. We then ranked the human samples in the same manner as the mouse samples. The size of each clone is inversely proportional to an area defined by the ranking of the sequence at the specific time-point, multiplied by the ranking of the sequence in the human samples, such that similarly ranked sequences will have a larger size. For example, the largest area would potentially belong to a shared CDR3 sequence that is ranked #1 in both species (since 1 / ((sequence ranking in mouse = 1) x (sequence ranking in human = 1)) = 1). The smallest potentially belong to a shared CDR3 sequence ranked last in both species (1 / ((sequence ranking in mouse = 258) x (sequence ranking in human = 258)) = 0.000015).
IGOR: Version: IGoR v1.2.0; Non-default parameters: -species mouse -chain beta—ntCDR3.
MiXCR: Version: MiXCR v2.1.5; Non-default parameters: -p rna-seq -s MusMusculus -OallowPartialAlignments = true.
TCGA analysis—MiXCR: MiXCR v3.0.3; STAR: STAR_2.7.0b_0206.
Single cell analysis: MiXCR: MiXCR v3.0.3.
ERGO: Parameters: Model Type = LSTM based model; Training Database = McPAS.
For the significance tests that quantify differences between means of 2 populations, we used the t-test; For the significance tests that quantify differences between 2 curves we used the lrtest function from the R package lmtest–a function that uses a likelihood ratio test to compare generalized linear models; we used the Kolmogorov–Smirnov test when we compared distributions.
Values for the error bars were calculated using the standard definition: the standard deviation divided by the square root of the number of measurements that generated the mean.