Probability, Entropy, and Adaptive Immune System Repertoires...My grandparents, Patarasp Sethna,...
Transcript of Probability, Entropy, and Adaptive Immune System Repertoires...My grandparents, Patarasp Sethna,...
-
Probability, Entropy, and Adaptive
Immune System Repertoires
Zachary Michael Sethna
A Dissertation
Presented to the Faculty
of Princeton University
in Candidacy for the Degree
of Doctor of Philosophy
Recommended for Acceptance
by the Department of
Physics
Adviser: Professor Curtis Callan
September 2018
-
c© Copyright by Zachary Michael Sethna, 2018.
All rights reserved.
-
Abstract
The adaptive immune system, composed of white blood cells called lymphocytes (B
and T cells) that circulate in the lymph and blood, is a precision tool that tags
and removes foreign peptides. Such peptides, also called antigens or epitopes, are
identified by a specific binding to elements of a library or repertoire of unique proteins
called receptors (e.g. antibodies or T cell receptors). A repertoire must be large and
diverse enough so that at least one receptor will be able to recognize any pathogen
epitope the organism is likely to encounter. This diversity is achieved by stochastic
rearrangement of the germline DNA to create novel complementarity determining
region sequences (CDR3) in a process called called V(D)J recombination.
In this thesis we utilize previously developed generative models of V(D)J recombi-
nation events, and infer the model parameters from large datasets of DNA sequences.
The generation probability (Pgen) of a nucleotide or amino acid CDR3 is the sum
of all model probabilities of V(D)J recombination events that generate the sequence.
While previously it was only feasible to compute Pgen of nucleotide sequences, we
introduce a novel dynamic programming algorithm that efficiently computes Pgen of
amino acid sequences. We use this Pgen for several applications. First we examine
how the diversity of a repertoire, characterized by the model entropy, scales with the
number of insertions in the V(D)J process. This is used to describe the maturation
of the T cell repertoire of mice from embryos to young adults. Next, we introduce a
statistical model of hypermutation in B cells and infer the parameters from a human
repertoire, providing a principled quantification of the biases in hypermutation rates.
Lastly, we examine the statistics of the receptors shared amongst a cohort of more
than 600 individual humans and show that the statistics and identities of so-called
‘public’ sequences are determined directly from Pgen.
We highlight possible clinical applications and attempt to place this work in the
context of a full theory of the adaptive immune system.
iii
-
Acknowledgements
I don’t have the words to express my thanks to my advisor Curt Callan. Curt has been
a consummate advisor, providing support, advice, direction, and countless opportu-
nities. I came into grad school with somewhat scattered interests, yet Curt showed
me, by example, how to find a path forward through dedication, collaboration, and
boundless curiosity. Curt has always been willing to entertain my crazy, inchoate
ideas, and with only a few incisive questions give them shape (though it often takes
me days to catch up and realize this). Curt, thank you for all of your time and effort,
thank you for being my mentor. Thank you.
I also thank my collaborators on both sides of the pond. I have learned so much
from the insights and clarity of Aleksandra Walczak and Thierry Mora. Their ability
to parse the underlying science, translate it into math, and then communicate this
effectively is something I hope to one day be able to emulate. Yuval Elhanati has
made my time here much more productive and enjoyable. Not only did Yuval provide
crucial assistance with every step of the research, but he provided a sympathetic ear
and was willing to talk about whatever the topic of the day was. Quentin Marcou is
not only a wonderful collaborator, but a welcoming friend.
Thanks to Ben Greenbaum and Vinod Balachandran for great discussions, data,
and continuing collaboration.
I would also like to thank Anand Murugan, whom I have never met, but whose
code I’ve spent uncounted hours working with.
Biophysics
The professors in biophysics have been hugely influential on my perspective on science
and life, and I would like to thank them. I must start by thanking Bill Bialek, and not
only for being on my committee. His vision, instant understanding of any topic, and
personality have made his conversations something to be sought after. I would like to
iv
-
thank Bob Austin, not only for being a reader of this thesis, but for the many crazy
conversations and a shared appreciation of scotch. I also want to thank Josh Shaevitz
for efficiently cutting to the bone of any issue, Thomas Gregor for teaching me much
during my time as a TA for ISC, and Ned Wingreen for somehow always knowing
everything about any biological system. You all have made Princeton biophysics not
only a superb place to do research, but a friendly and welcoming environment.
The biophysics community also has had several postdocs and graduate students
over the years that I would like to thank for teaching me much and making my
time here so much fun. Andreas Mayer for great discussions on immunology. I’ve
immensely enjoyed speculating about Information Geometry with Ben Machta. I’d
also like to thank Leenoy Mushulam, Henry Mattingly, Dima Krotov, Ashley Linder,
Ugne Klibaite, Ben Bratton, Gordon Berman, Michael Tikhonov, Xiaowen Chen,
Guannan Liu, Mochi Liu, Alex Song, Sagar Setru, Mark Ioffe, and Jeff Nyugen.
Physics
The greater physics community has made Jadwin Hall a second home for these years.
I’d like to thank Herman Verlinde for all of his work in organizing the grad program.
A special thanks to Suzanne Staggs for being on my committee. Thanks to Jessica
Heslin, Barbara Mooring, and Kate Brosowksy for the invaluable administrative as-
sistance – without you we grad students would be helpless. Sumit Saluja has been a
lifesaver with helping me get my code running on the server. Also, a shoutout to the
softball team – especially the impressive Ed Groth.
Friends
Naturally, I must thank my fellow grad students who’ve been through the ringer with
me and yet made my time here enjoyable. There are too many people to name, so un-
doubtably I have accidentally forgotten some people: I must beg your forgiveness! I’d
v
-
like to thank Aitor Lewkowycz for the science, fun, keen insight, and advice. Aaron
Levy for the innumerable discussions about life, politics, and science. Will Coulton
for always being a good sport and a positive influence in every scenario. Josh Hard-
enbrook for always calling me out when he thinks I am wrong. Dave Zajac for helping
me ‘study’ for prelims with uncounted games of pool. Christian Jepsen for his impec-
cable taste. Joaquin Turiaci and Debayan Mitra for the many fun nights of beer and
foosball. Shai Chester for the fun and ridiculous stories, but NOT for any ‘help’ in my
work. Farzan Beroz for the many philosophical and science discussions. Lauren Mc-
Gough for the many discussions about about stat mech, information theory, and life.
Kenan Diab also understands the important things in a grad student’s life: softball,
starcraft, MTG, and beer. Ilya Belopolski for doing many prelim problems together
while DJ’ing with some select music. DJ Strouse for our annual run-ins at APS and
the many good conversations about information theory and machine learning. Bin Xu
for his always cheerful demeanor and great scientific discussions. Mallika Randeria
for her friendship and advice. Tom Hazard, softball captain extraordinaire. Many
thanks to Shawn Westerdale, Anne Gambrel, Guangyong Koh, Ed Young, Matt Her-
nandez, Lee Gunderson, Sarthak Parikh, Grisha Tarnoplskiy, Vlad Kirilin, Matteo
Ippolti, Luca Iliesiu, and Trithep Devakul. Thanks to everyone.
Family
Lastly, I must thank the whole of my family for being so supportive of me since
before I can remember. I come from a unique family, filled with medical doctors and
physicists, such that when I go home I am frequently grilled on my research. Coming
from such a background, it is no surprise that I’ve effectively split the difference
between physics and medicine in this thesis.
It would be hard to overstate the influence my uncle, Jim Sethna, has had on
me: I’ve quite literally followed in his footsteps in getting a PhD in physics from
vi
-
Princeton. Thank you Uncle Jim for all of your advice, support, and even academic
mentorship. I cannot tell you how much it means to me.
My grandparents, Patarasp Sethna, Shirley Sethna, Marjory Sethna, Joshua Lyn-
field, and Yelva Lynfield, have always been an examples to me, both in their achieve-
ments and morality. Sadly, not all of my grandparents will see me graduate, however
I am confident that all of them would both be proud and approve of my time here.
I also thank my sisters, Julia and Sharon Sethna, for always providing a ready
distraction when needed.
Finally, I would like to thank my parents Ruth Lynfield and Michael Sethna, with-
out whom not only would this thesis not have been possible but I never would have
been in the position in the first place. Your love, support, direction, and parenting
have got me to this point. Mom, your talents and commitment to helping people is
inspiring. Your work in infectious diseases and epidemiology have clearly colored my
interests. And Dad, your elevation of science and logic above all else has shaped the
way I think. You have frequently ‘joked’, that studying math, physics, and science is
‘holy work’ – a sentiment I certainly share. Thank you both for everything.
vii
-
“The idea is like grass. It craves light, likes crowds, thrives
on crossbreeding, grows better for being stepped on.”
- Ursula K. Le Guin, The Dispossessed
viii
-
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
1 Introduction 1
1.1 Adaptive immune system . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 B cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 T cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 The DNA problem . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 V(D)J recombination . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Repertoire sequencing and analysis . . . . . . . . . . . . . . . . . . . 7
1.4 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Generative Model 9
2.1 V(D)J recombination models . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 VDJ generative model . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 VJ generative model . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.4 Pgen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Model Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
ix
-
2.2.1 Entropy of Precomb . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Entropy of Pgen . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 The Pgen distribution . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Errors and Mismatches . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Expectation Maximization algorithm . . . . . . . . . . . . . . 24
2.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 V(D)J recombination to sequences: Precomb → Pgen 28
3.0.1 Probability Spaces (mathematical aside) . . . . . . . . . . . . 29
3.1 Too many states! The free energy problem . . . . . . . . . . . . . . . 29
3.2 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 OLGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Notation, 3 ′ and 5 ′ vectors . . . . . . . . . . . . . . . . . . . . 34
3.3.2 VDJ recombination: V, M, D, N, and J . . . . . . . . . . . . 37
3.3.3 VJ recombination: V, M, and J . . . . . . . . . . . . . . . . . 43
3.3.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.5 Comparison to existing methods . . . . . . . . . . . . . . . . . 46
3.4 Some applications of OLGA computed Pgen . . . . . . . . . . . . . . . 48
3.4.1 Pgen distributions and diversity . . . . . . . . . . . . . . . . . 48
3.4.2 Generation probability of epitope-specific TCRs . . . . . . . . 49
3.4.3 Predicting the frequencies . . . . . . . . . . . . . . . . . . . . 51
3.4.4 Generation probability of sequence motifs . . . . . . . . . . . 53
4 The repertoires ‘Of Mice and Men’ 55
4.1 Of Mice... (mouse TRB) . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.1 Generative model . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.2 Changing insertion profile → Increasing diversity . . . . . . . 58
x
-
4.1.3 Mixture mode . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1.4 Toy model of mouse repertoire maturation . . . . . . . . . . . 64
4.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 ...and Men (human IGH) . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.1 Analysis approach . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.2 Generative Model, Allele identification . . . . . . . . . . . . . 68
4.2.3 Hypermutation . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5 Sharing 74
5.1 The Sharing Distribution . . . . . . . . . . . . . . . . . . . . . . . . 76
5.1.1 Analytical calculation of the sharing distribution from the Pgen
distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.2 Sharing modified by selection . . . . . . . . . . . . . . . . . . 81
5.2 Extrapolation to full repertoires and beyond . . . . . . . . . . . . . . 83
5.3 Predicting the publicness of sequences . . . . . . . . . . . . . . . . . 86
5.3.1 Sharing and TCR generation probability . . . . . . . . . . . . 86
5.3.2 PUBLIC: Classifier of public vs. private TCRs based on gener-
ation probability . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6 Conclusion 93
A Information Theory 96
A.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
A.2 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
A.3 Kullback-Leibler divergence . . . . . . . . . . . . . . . . . . . . . . . 99
B Probabilistic vs Deterministic inference 100
xi
-
C Proof of Expectation Maximization algorithm 103
D Mouse Appendix 105
D.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
D.2 Model parameters and validation . . . . . . . . . . . . . . . . . . . . 106
E Human B cells Appendix 113
E.1 Repertoire entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
E.2 Inference of alleles and their chromosome distribution . . . . . . . . . 114
E.3 Model parameters and validation . . . . . . . . . . . . . . . . . . . . 116
F Sharing Appendix 122
F.1 Sampling effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
F.2 Monte Carlo simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 124
F.2.1 Sequence data . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Bibliography 126
xii
-
List of Tables
3.1 Distance metrics for OLGA VDJ validation . . . . . . . . . . . . . . 45
3.2 Time performance and scaling of possible methods. . . . . . . . . . . 47
3.3 P funcgen of TCR motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4 Pgen of invariant T cell (iNKT and MAIT cells) TRA motifs . . . . . 54
4.1 Breakdown of B cell sequences and models . . . . . . . . . . . . . . . 67
D.1 Mouse dataset summary . . . . . . . . . . . . . . . . . . . . . . . . . 106
E.1 Heterozygous V allele information (Individual A) . . . . . . . . . . . 116
E.2 Heterozygous D and J allele information (Individual A) . . . . . . . . 116
F.1 Mice dataset sample sizes . . . . . . . . . . . . . . . . . . . . . . . . 125
xiii
-
List of Figures
1.1 Schematic of VDJ recombination . . . . . . . . . . . . . . . . . . . . 5
2.1 Distribution functions: P (−E = log Pgen) . . . . . . . . . . . . . . . . 19
3.1 CDR3 indexing cartoon . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Validation of OLGA VDJ algorithm . . . . . . . . . . . . . . . . . . . 44
3.3 Validation of OLGA VJ algorithm . . . . . . . . . . . . . . . . . . . . 46
3.4 Precomb and Pgen distributions . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Pgen of human TRB sequences for hepatitis C and influenza A epitopes. 50
3.6 Pgen distributions for virus specific TRB sequences . . . . . . . . . . . 51
3.7 Scatter of mean occurrence frequencies vs Pgen . . . . . . . . . . . . . 52
4.1 Age-dependent insertion length distributions . . . . . . . . . . . . . . 56
4.2 Sequence entropy for thymic repertoires . . . . . . . . . . . . . . . . . 59
4.3 Repertoire maturation schematic . . . . . . . . . . . . . . . . . . . . 61
4.4 Mean effective TdT level ᾱ and entropy vs age . . . . . . . . . . . . . 63
4.5 Amount of mixing: variance of α vs age . . . . . . . . . . . . . . . . . 64
4.6 Allele organization on chromosomes . . . . . . . . . . . . . . . . . . . 69
4.7 Sequence dependence of somatic hypermutations . . . . . . . . . . . . 71
5.1 Pipeline for computing the distribution of shared sequences . . . . . . 76
5.2 Sharing distribution for 14 mice . . . . . . . . . . . . . . . . . . . . . 78
5.3 Sharing distribution for 658 humans . . . . . . . . . . . . . . . . . . . 79
xiv
-
5.4 Number of unique CDR3s in pooled repertoires . . . . . . . . . . . . 84
5.5 Fraction of total repertoire composed of ‘public’ sequences . . . . . . 85
5.6 Mouse Pgen distributions by sharing number . . . . . . . . . . . . . . 87
5.7 Human Pgen distributions by sharing number . . . . . . . . . . . . . . 88
5.8 PUBLIC classifier schematic . . . . . . . . . . . . . . . . . . . . . . . 89
5.9 Performance of the PUBLIC classifier . . . . . . . . . . . . . . . . . . 90
B.1 Probabilistic vs Deterministic marginal distributions . . . . . . . . . . 101
D.1 Gene usages by mouse age . . . . . . . . . . . . . . . . . . . . . . . . 107
D.2 Deletion profiles by mouse age . . . . . . . . . . . . . . . . . . . . . . 108
D.3 Frequencies of non-templated insertions . . . . . . . . . . . . . . . . . 109
D.4 Mouse model MI validation . . . . . . . . . . . . . . . . . . . . . . . 110
D.5 Variation of V and J gene usage across biological replicates . . . . . . 111
D.6 Variation of deletion profiles across biological replicates . . . . . . . . 112
E.1 Entropy of B cell model . . . . . . . . . . . . . . . . . . . . . . . . . 113
E.2 B cell gene usages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
E.3 B cell deletion profiles . . . . . . . . . . . . . . . . . . . . . . . . . . 118
E.4 B cell non-templated nucleotide frequencies . . . . . . . . . . . . . . . 119
E.5 PinsVD and PinsDJ over replicates . . . . . . . . . . . . . . . . . . . . . 120
E.6 B cell model MI validation . . . . . . . . . . . . . . . . . . . . . . . . 120
E.7 B cell model insertion Markov model validation . . . . . . . . . . . . 121
F.1 Downsampling in sharing analyses . . . . . . . . . . . . . . . . . . . . 123
xv
-
Chapter 1
Introduction
1.1 Adaptive immune system
The adaptive immune system evolved to provide animals with a precision tool to
identify and remove anything ‘foreign’ to the animal. This is done by having a
large library, or repertoire, of proteins called receptors that bind specifically to some
small fragment of a protein called an epitope or antigen. This binding or affinity
is determined by physical properties such as electrostatics, hydrophobicity, Van de
Waals forces, steric concerns, etc. By specificity we mean that this receptor will only
bind to a very limited number of epitopes and have only limited affinity for other
epitopes1. Crucially, this specificity allows the adaptive immune system to weed
out any receptors which recognize self peptides which would trigger an autoimmune
response. However, this repertoire must be large and diverse enough to be able to
identify any foreign peptide to ensure that microbes and cancerous cells are quickly
identified and dealt with. In this thesis we will characterize just how staggeringly
diverse these adaptive immune system repertoires are.
In order to generate and regulate these receptors, the adaptive immune system
has a special class of cells called lymphocytes, of which there are two main subtypes:
1Frequently the amount of ‘cross-reactivity’ is assumed to negligible
1
-
B cells and T cells. Each lymphocyte has a single receptor, of which it expresses many
copies, in order to recognize epitopes. These lymphocyte receptors are protein com-
plexes composed of two amino acid chains, a larger one and a smaller one. Each chain
has largely conserved portions (in order to standardize the way the adaptive immune
system uses these receptors) along with highly variable regions that provide the spe-
cific binding to epitopes. The most highly variable region, and the one that largely
determines the affinity of a receptor to an epitope, is called the complementarity-
determining region 3 or CDR32. We will often be a little sloppy and refer to the
‘receptor’ and the CDR3 of a single chain interchangeably. Once a ‘naive’ lympho-
cyte is activated by specifically binding to an epitope, it will proliferate and some of
these cells will be archived as ‘memory’ cells to quickly reactivate and eliminate the
antigen if the organism is ever exposed to it again.
1.1.1 B cells
B cells are lymphocytes that produce, and secrete, receptors called antibodies. Anti-
bodies are composed of a heavy chain (IGH) and a light chain (IGL). These receptors
can either be free in the plasma or expressed on the membrane of B cells3. These
antibodies bind specifically to antigens. An antibody bound to an antigen serves as a
tag for the rest of the immune system to attack the antigen. Furthermore, antibodies
can directly neutralize microbes by binding to surface proteins and ‘gum up’ their
operation. Foreign peptides in solution can also be made to precipitate by antibodies
coagulating many of the peptides together.
2There are two other variable loops, CDR1 and CDR2, that are determined by the V germlinetemplates. As a result the variation of these loops is limited. While the CDR1 and CDR2 loops areimportant biologically, particularly for major histocompatibility complex (MHC) recognition of Tcells, we focus exclusively on the CDR3 region in this thesis. Unlike the CDR1 and CDR2 loops, theCDR3 region spans the region of the receptor sequence where the DNA editing process called V(D)Jrecombination occurs (1.2). We define the boundaries of the CDR3 region to be the conserved aminoacid residues cysteine (C) on the 5 ′ end and a phenylalanine (F) or tryptophan (W) on the 3 ′ end.These conserved residues are important to ensure the receptor folds and works properly.
3If expressed on a membrane an antibody is frequently referred to as a B cell receptor (BCR).We are sometimes sloppy and will refer to antibodies in general as BCRs to parallel TCRs.
2
-
The amazing specificity of antibodies is generated through a process called hy-
permutation [Teng and Papavasiliou, 2007]. Following the successful recognition of
an antigen, a B cell proliferates and its receptor sequence undergoes random point
mutations. These cells are then selected for affinity to the epitope. The result is an
evolutionary process within a single individual, producing receptors with dramatically
increased affinity to the epitope. We will present a quantitative model of hypermu-
tation in chapter 4.
1.1.2 T cells
Although antibodies bind directly to epitopes in solution, T cells have their epitope
recognition mediated by other cells. In animals with adaptive immune systems, cells
display a protein complex called major histocompatibility complex (MHC) on their
membrane. This protein complex can then be ‘loaded up’ with a peptide fragment by
the cell, and a T cell receptor (TCR) can then recognize the peptide - MHC complex
(pMHC)4. Cells load up the MHC complex with chopped up peptides internal to the
cell, giving the T cell a snapshot of the current protein synthesis of the cell. This
provides an excellent mechanism for the T cell to be able to identify if a cell was
infected by a virus or has become cancerous. Also, if a cell is infected by a virus it is
possible that peptides internal to the viral capsid (and thus not an accessible epitope
to an antibody/BCR) could be loaded up into pMHC, providing additional epitopes
for the adaptive immune system to tag.
Similar to antibodies, TCRs are composed of two chains, an α chain (TRA) and
a β chain (TRB). Ideally we would analyze the full receptor composed of TRA-TRB
pairs, however it is hard experimentally to have high throughput sequencing that
4This is the interaction between Cytotoxic or CD8+ T cells and the MHC I complex. There isan additional MHC complex (MHC II) that is expressed by a class of cells called antigen presentingcells (APC) that actively uptake and present peptides. There are also several other classes of Tcells, which perform a variety of roles. For the purposes of this thesis we focus on CD8+ T cells andthe MHC I complex.
3
-
accurately pairs TRA and TRB chains. Instead, many sequencing analyses focus on
only one chain. For much of this thesis we will focus on TRBs in both humans and
mice as the TRB chain is not only much more diverse than the TRA chain, it is also
the chain that determines much of the receptor-epitope specificity.
1.1.3 The DNA problem
The massive diversity of receptors needed for a functioning repertoire poses a very
interesting problem. These receptors are proteins, coded for by DNA sequences. Each
unique receptor demands a unique DNA sequence. The number of unique receptors
in a repertoire utterly dwarfs the number of coding genes in a genome. For example,
a human TRB repertoire might have 108−1010 unique receptors, whereas the number
of coding genes in the human genome is approximated to be of the order of 104−105.
Clearly the human genome cannot directly store the DNA sequences of every receptor
in a repertoire. This prompts question of how can such a diversity of receptors be
generated from limited DNA.
1.2 V(D)J recombination
The solution to the apparent conundrum laid out in the previous section is a process
called V(D)J recombination wherein the actual DNA sequences of developing B cells
and T cells get recombined, generating novel genes that translate to unique CDR3
amino acid sequences. While highly regulated, this process allows the adaptive im-
mune repertoire to generate the necessary diversity to specifically recognize foreign
antigens/epitopes. This discovery led to Susumu Tonegawa’s 1987 Nobel Prize in
Medicine [Hozumi and Tonegawa, 1976]. The rest of the thesis will involve proba-
bilistically modeling this V(D)J recombination.
4
-
Figure 1.1: Schematic of VDJ recombination
J2-1D1
J2-2J2-1D2V3V2V1 J1-2J1-1D1… … …RAG
TdT
V3V2V1 … N2
V3V2V1 … D1 J2-1N2RAG
V2 D1 J2-1N2
TdT
N1
D1 J2-1N2V2 N1
Simplification of the stages of VDJ recombination for TRB. Shows the arrangementof example V, D, and J genes on the chromosome, along with the RSS regions (or-ange stripes). For TRB gene locus the D and J genes are arranged as above, whichimplies the topological constraint that D2 and J1-∗ genes are never jointly used.Non-templated nucleotides, indicated by N1 and N2, are inserted at the VD and DJjunctions by the TdT complex.
V(D)J recombination has become an extremely well studied process over the past
40 years and the critical enzymes have been identified and studied. Of particular
interest to this thesis will be the enzymes recombination activating genes (RAG) 1
and 2, and terminal deoxynucleotidyl transferase (TdT), both of which are uniquely
expressed in lymphocytes. VDJ recombination leads to the generation of sequences
that produce IGH and TRB chains, while VJ recombination produces IGL and TRA
chains.
Before recombination, the germline chromosome has two or three types of genetic
templates: variable (V), diversity (D), and joining (J). For each type of template,
there are multiple genes (e.g. there are 35 TRBV genes in mice) which are identi-
5
-
fied by immediately adjacent, highly stereotyped, 7-mer nucleotide sequences called
recombination signal sequences (RSS). During VDJ recombination5, RAG enzymes
bind specifically to the RSS of a J gene and of a D gene and make an incision that cuts
out the intervening DNA. This cutting of the DNA can be messy, possibly deleting
away parts of the D and J genes, or leaving some single stranded DNA hanging, which
will get repaired by inserting in reverse complementary palindromic nucleotides. The
D and J genes are then spliced together, possibly with non-templated nucleotide in-
sertions from the TdT enzyme. A similar slicing and splicing process then happens
at the V-D junction.
To remove the biology, and make this clear on an abstract level, VDJ recombina-
tion acts by choosing a particular gene (strings of nucleotides) for each of the V, D,
and J segments, deleting away some of the nucleotides of those genes (or inserting
reverse palindromic nucleotides), and then inserting random nucleotides at the VD
and DJ junctions as the sequence is spliced together to read (from 5 ′ to 3 ′ ) VDJ.
This provides a new DNA sequence, where all of the edits (splicing, deleting, and
inserting) correspond to the CDR3 region of the receptor.
This V(D)J recombination process has no guarantee of success, or of producing
a DNA sequence that can translate to a functional protein. As there are random
numbers of deletions and insertions, the DNA sequence may have frame shifts or stop
codons in them. If this happens, and a V(D)J recombination event on a chromosome
leads to a nonproductive sequence, the cell may try again on the second chromosome.
If this second recombination leads to a functional receptor, the cell will have two
rearranged chromosomes: one functional and expressed, and the nonfunctional one
silenced by allelic exclusion. This fortunate quirk will prove crucial later in this thesis.
Once a T cell or B cell has a functional receptor there is some quality control that
occurs. The cell undergoes both positive selection (e.g. checking a TCR interacts
5In VJ recombination, there is no D gene, and the V and J genes are directly spliced together
6
-
well with MHC) and negative selection (i.e. removing cells with high affinity to self
epitopes). This somatic selection process is crucial to ensure both useful receptors
and to prevent autoimmune responses and skews the repertoire on a statistical level.
Models characterizing the statistics of this selection process have been introduced by
my collaborators, particularly Yuval Elhanati [Elhanati et al., 2014], and are discussed
in the papers that are referenced in chapter 4 [Elhanati et al., 2015, Sethna et al.,
2017].
1.3 Repertoire sequencing and analysis
Advances in high throughput sequencing [Robins et al., 2010a] have allowed for large
scale sequencing of lymphocytes in a blood or tissue sample: the sample is broken
down, the DNA extracted, and specialized primers amplify the DNA sequence of the
CDR3 region before sequencing. Such experiments are now becoming so routine that
there is interest in using them for medical diagnostic and immunotherapy purposes.
Almost all of the data discussed in this thesis was sequenced using a protocol pio-
neered by Harlan Robins [Robins et al., 2010a], who has started a company, Adaptive
Biotechnologies, to provide repertoire sequencing services.
These experiments can successfully sequence millions of cells (or more), produc-
ing datasets of ∼ 104 − 106 unique DNA sequences. The availability of datasets of
such size and quality allows for serious statistical analyses to quantify the underly-
ing biology as well as the possibility to explore more theoretical questions. Being
physicists, the approach we will take in this thesis is to construct a statistical model,
i.e. a parameterized probability distribution, of V(D)J recombination that reflects
the underlying biological processes. These large datasets are then used to infer the
model parameters. The model parameters will provide quantitative descriptions of the
V(D)J recombination machinery, and the model itself provides a distribution of the
7
-
probability of generating any receptor (Pgen) that can be used to answer theoretical
questions like characterizing the diversity of a repertoire.
1.4 Organization of thesis
This thesis is broken into two main parts. The first, covers chapters 2 and 3 and
provides the mathematical framework for the rest of the thesis. The class of generative
models used to analyze the generation probability (Pgen) of adaptive immune system
repertoires (first introduced in Murugan et al. [2012]) is described, and the inference
process, expectation maximization (EM), used to fit the model parameters is laid out.
We also show how one of the main metrics we use, the entropy of a model, can be
computed and broken down into different components. In addition, the computational
challenges associated with computing Pgen of sequences is discussed, in particular the
exponential explosion of the number of recombination events that generate amino acid
CDR3 sequences. We then demonstrate the novel dynamic programming algorithm,
OLGA [Sethna et al., 2018], that we developed to efficiently solve this problem and
make the computation of amino acid CDR3 sequences Pgen not only tractable, but
fast.
The second part, spanning chapters 4 and 5, dives into the applications of the
modeling framework defined in the first part. The first part of chapter 4 describes
the work from Sethna et al. [2017] analyzing the maturation of mouse repertoires from
embryo to young adult. The second half of chapter 4 lays out a model quantifying
hypermutation in B cells [Elhanati et al., 2015]. Finally, chapter 5 demonstrates how
Pgen explains the curious observation of so-called ‘public’ sequences.
8
-
Chapter 2
Generative Model
2.1 V(D)J recombination models
The definition, selection, and inference of a generative model of V(D)J recombination
is the foundation for all of the work that comes later. Such a generative model defines
a probability measure over the state space of V(D)J recombination events, which can
be extended to define probabilities of particular receptors or collections of receptors.
We begin by introducing a general model framework by requiring that the model
respects the biology of the V(D)J recombination process. To do this we define the
state (sample) space of V(D)J recombination events by combinations of the stochastic
events in the DNA splicing itself (i.e. gene choice, deletions/palindromic insertions,
and insertions). For example, we can described the state (sample) space of VDJ
recombination events as:
Ωe = {(V,D, J, dV , dD, d′D, dJ, {mi}, {ni})} (2.1)
Where V, D, and J are the gene choices, dV , dD (5′ /left), d′D (3
′ /right), and dJ
are deletions (including palindromic insertions), and {mi} and {ni} are the specific
9
-
nucleotide sequences which are inserted at the VD and DJ junctions respectively 1.
This also allows us to define a fully general model family for the recombination event
e ∈ Ωe:
Precomb(e) = P (V, dV , {mi}, dD, D, d′D, {ni}, dJ , J) (2.2)
We cannot use the fully general model above, which defines a unique probability for
each combination of recombination events, due to the exponential explosion of param-
eters. The challenge is to construct sub-models which have few enough parameters
to be inferred, yet still sufficiently describe the observed sequences. In general this is
done by positing the independence and dependence of the various splicing events and
then checking if the factorization captures the necessary correlations. The specific
models used are factorized to reflect the spatial correlations along the chromosome.
For VDJ recombination, these models assume that the V choice is independent
of the D/J choice (the latter two being correlated by virtue of the order in which
the genes are laid out on the chromosome, see Fig. 1.1), the deletion profiles depend
only on the gene choice, and lastly that the insertions are independent of the genomic
contributions and each other. There is still an exponential blowup of parameters
unless a simpler (fewer parameters) model for the inserted sequences is introduced.
We use a model that is a product of a length distribution and a dinucleotide Markov
model. This model factorization and dinucleotide Markov model is first introduced
and validated in Murugan et al. [2012], however important exceptions to this factor-
ization will be discussed in Chapter 4 in the contexts of mouse T cells [Sethna et al.,
2017] and human B cells (Elhanati et al. [2015]). For VJ recombination, these models
assume the V/J choice is correlated, the deletion profiles depend only the gene choice,
and lastly the insertion region is independent of the genomic contribution.
1The subscript i index is read from 5 ′ to 3 ′ .
10
-
2.1.1 VDJ generative model
The VDJ recombination model is defined as:
Precomb(e) = PV(V )PDJ(D, J)PdelV(dV |V )PdelJ(dJ |J)PdelD(dD, d′D|D)
×PinsVD(`VD)p0(m1)[`VD∏i=2
SVD(mi|mi−1)]
×PinsDJ(`DJ)q0(n`DJ )[`DJ−1∏i=1
SDJ(ni|ni+1)] (2.3)
Where the inserted nucleotide sequences {mi} and {ni} have lengths `VD and `DJ with
insertion length distributions PinsDJ(`VD) and PinsDJ(`DJ), SVD and SDJ are the respec-
tive dinucleotide Markov transition matrices, and finally, p0 and q0 are the nucleotide
biases for the first insertion at each junction2. Note that the inserted sequence of
length 0 (i.e. no insertions at a junction) is also allowed and has probability PinsVD(0)
or PinsDJ(0) depending on the splicing junction.
2.1.2 Model Validation
As mentioned above, it is important to check that the factorization of the model
structure is correct. To address this issue, we examine the correlations between
various marginal variables of the model (i.e. the stochastic recombination events: V,
delV, J, insVD, etc) by examining the mutual information of each pair.
To determine if we have captured the correct correlations in the data, we compare
the precise mutual information computed directly from the model, to the estimated
mutual information determined by the expectation over the data (using the Treves-
Panzeri correction [Treves et al., 1998] to account for finite sample size).
The generative model has zero mutual information, by construction, between in-
dependent marginal pairs, e.g. the number of VD insertions and the choice of J gene.
2Note, we often make the further approximation that the insertion Markov model is at steady-state, i.e. we set p0 and q0 to be the steady state distributions of SVD and SDJ respectively.
11
-
Variables that correlate with each other either directly or indirectly, e.g. between D
and J gene choice, or between D choice and number of D deletions may have non-zero
mutual information. In order to quickly gauge if a model is consistent (or inconsis-
tent) with the model factorization, we use plots like D.4, where the MI computed
from the model is below the diagonal and the expectation over the data is above the
diagonal. If the plot is symmetric about the diagonal, then the model is self consis-
tent with the data. Indeed, the total missed mutual information is, to leading order,
precisely the amount of information our factorized model missed due to its structure.
To validate the dinucleotide Markov model for insertions, we compare the expected
trinucleotide frequencies to the observed trinucleotide frequencies.
We will perform these checks in Chapter 4 when we look at mouse T cells [Sethna
et al., 2017] and human B cells [Elhanati et al., 2015].
2.1.3 VJ generative model
Analogous to the VDJ model, we define the model factorization for the generative
model of VJ recombination. The primary distinction is that there is no D gene, nor
is there an N2 insertion region (DJ junction). Also, as there is evidence of repeated
splicing attempts for the TCRα chain, the V and J gene usages are allowed to be
correlated Elhanati et al. [2016].
Precomb(e) = PVJ(V, J)PdelV(dV |V )PdelJ(dJ |J)PinsVJ(`VJ)p0(m1)[`V J∏i=2
SVJ(mi|mi−1)]
(2.4)
2.1.4 Pgen
Our model, Precomb, defines a probability measure over the state (sample) space Ωe of
recombination events. However, this model can, in theory, be extended to other state
(sample) spaces of much more interest scientifically and biologically. In particular,
12
-
we examine the state spaces of DNA nucleotide sequence reads, CDR3 nucleotide
sequences, and CDR3 amino acid sequences (or collections/motifs of amino acid CDR3
sequences). This is done by summing over all recombination events that generate one
of the ‘coarse grained’ states to give the probability of generating a particular CDR3
sequence or receptor.
Pgen(seq) =∑e|seq
Precomb(e) (2.5)
This generation probability, or ‘Pgen’, of a sequence or receptor will be used con-
tinuously throughout this thesis. We will return to this idea of extending or ‘coarse
graining’ the probability space in greater detail in Chapter 3.
2.2 Model Entropy
Before we introduce our method for inferring the model parameters, we first introduce
a concept that we will return to repeatedly: the entropy of a model. One of the
advantages of having a probabilistic model of V(D)J recombination is that we can
use the (Shannon) entropy (Appendix A) of the distribution as a well defined measure
of the ‘diversity’ of a repertoire. We examine the entropy of both Precomb and Pgen.
First we show how to compute the entropy S(Precomb) directly from the model, and
how it decomposes into contributions from the gene choice, the deletions, and the
insertions. We also show explicitly how changing the insertion length distribution
has an outsized impact on the entropy. Then we discuss how to approximate S(Pgen)
by Monte Carlo simulation. Throughout this section we do not specify what units we
want to express the entropy in, however we will most frequently talk about entropy
in units of bits (so base of the log is 2, i.e. log2)3.
3Personally, I think everything should be done in nats (log base e), however for most people it iseasier to parse bits (log base 2) or dits (log base 10).
13
-
2.2.1 Entropy of Precomb
The entropy4 of a VDJ recombination model is:
H(Precomb) = −〈log(Precomb)〉Ωe = −〈log(PVPDJPdelVPdelJPdelDP{mi}P{ni})
〉Ωe
(2.6)
Now, we can break the total entropy expression into independent components, and
compute the entropy of each of the components independently.
Genes/Deletions entropic contribution
The gene/deletion contributions are fairly straightforward to compute. Examining
the V templates:
H(PV(V )PdelV(dV |V )) =−∑V,dV
PV(V )PdelV(dV |V ) [log(PV(V )) + log(PdelV(dV |V ))]
=−∑V
PV(V ) log(PV(V ))−∑V,dV
PV(V )PdelV(dV |V ) log(PdelV(dV |V ))
=H(PV) +∑V
PV(V )H(PdelV(dV |V ))
=H(PV) + 〈H(PdelV)〉V(2.7)
In an analogous fashion we can determine H(PDJ), 〈H(PdelD)〉D, and 〈H(PdelJ)〉J . We
say that the entropy contribution from the choice of germline template is H(PV) +
H(PDJ), while the deletion entropic contribution is 〈H(PdelV)〉V + 〈H(PdelD)〉D +
〈H(PdelJ)〉J .4We indicate entropy by H not S in this section so as not to confuse notation with the dinucleotide
transition matrices SVD and SDJ
14
-
Insertion entropic contribution
The entropy of the insertions is much trickier to compute as we will have to sum
the Markov model probabilities over all possible insertion sequences. We drop the
VD/DJ subscripts as the computations are identical.
H(P{mi}) =−∑{mi}
P{mi}({mi}) log(P{mi}({mi}))
=−∑`
∑{mi}|`
Pins(`)P{mi}|`({mi})[log(Pins(`)) + log(P{mi}|`({mi}))]
=−∑`
Pins(`) log(Pins(`))−∑`
Pins(`)∑{mi}|`
P{mi}|`({mi}) log(P{mi}|`({mi}))
=H(Pins)−∑`
Pins(`)∑{mi}|`
P{mi}|`({mi}) log(P{mi}|`({mi}))
(2.8)
where,
P{mi}|`({mi}) = p0(m1)[∏̀i=2
S(mi|mi−1)]. (2.9)
In order to make the dependence of this entropy on the average insertion length
(〈`〉) more explicit we will make the approximation that the Markov model is at
steady-state (i.e. p0 = pss, the steady-state distribution of S).
We will now prove inductively that for ` ≥ 1:
H(P{mi}|`) =−∑{mi}|`
P{mi}|`({mi}) log(P{mi}|`({mi}))
=H(pss)− (`− 1)∑m
pss(m)∑n
S(n|m) log(S(n|m))(2.10)
15
-
Initial Step: ` = 1
This is trivial as P{mi}|`({mi} = m) = p0(m) = pss(m), so by direct computation:
−∑
{mi}|`=1
P{mi}|`(m) log(P{mi}|`(m)) = −∑m
pss(m) log(pss(m)) = H(pss) (2.11)
Inductive step
Assuming we have have shown that Eq. 2.10 is true for ` ≤ k, we prove it holds for
` = k + 1.
−∑
{mi}|`=k+1
P{mi}|`({mi}) log(P{mi}|`({mi}))
=−∑mk+1
∑{mi≤k}
S(mk+1|mk)P{mi}|k({mi≤k})[log(S(mk+1|mk)) + log(P{mi}|k({mi≤k}))
]=H(P{mi}|k({mi}))−
∑mk+1
∑{mi≤k}
S(mk+1|mk)P{mi}|k({mi≤k}) log(S(mk+1|mk))
=H(P{mi}|k({mi}))
−∑mk+1
∑mk
S(mk+1|mk) log(S(mk+1|mk))∑
{mi≤k−1}
S(mk|mk−1)P ({mi≤k−1}|k − 1)
(2.12)
Now, in order to do the summation in the second term we make the observation
that the conditional terms only depend on the two last nucleotides σk+1 and σk so we
would like to get the marginal distribution
pk(mk) =∑
{mi≤k−1}
P (mk|mk−1)P ({mi≤k−1}|k − 1) (2.13)
But, we recall our previous assumption that the Markov process is in its steady state
to know that the marginal distribution is the same as the steady-state distribution
16
-
(i.e. pk = pss)5 Plugging this back in shows
−∑mk+1
∑mk
S(mk+1|mk) log(S(mk+1|mk))∑
{mi≤k−1}
S(mk|mk−1)P ({mi≤k−1}|k − 1)
=−∑m
pss(m)∑n
S(n|m) log(S(n|m))
(2.14)
which shows the inductive step holds for k + 1 and completes the proof.
Putting everything together we get the entropy from a single insertion junction
being
H(Pins) +H(pss)− (〈`〉 − 1)∑m
pss(m)∑n
S(n|m) log(S(n|m)) (2.15)
Note the dependence of this expression on the average number of insertions 〈`〉.
We will return to this in chapter 4 when we see that the way a repertoire scales its
diversity is by changing the insertion length distribution.
Total entropy of Precomb
H(Precomb) =H(PV) +H(PDJ) + 〈H(PdelV)〉V + 〈H(PdelD)〉D + 〈H(PdelJ)〉J
+H(PinsVD) +H(pss)− (〈`VD〉 − 1)∑m
pss(m)∑n
SVD(n|m) log(SVD(n|m))
+H(PinsDJ) +H(qss)− (〈`DJ〉 − 1)∑m
qss(m)∑n
SDJ(n|m) log(SDJ(n|m))
(2.16)
5If we didn’t want to make the steady-state assumption, it is easy to see how using this marginaldistribution would change Eq. 2.10 to:H(P{mi}|`) = H(p0)−
∑`k=2
∑m pk(m)
∑n S(n|m) log(S(n|m))
17
-
2.2.2 Entropy of Pgen
The probability distribution of Pgen no longer factorizes after the summation. As a
result we cannot break down the entropy into independent pieces. Instead a different
tack is taken, to estimate the entropy of Pgen.
We recall that the entropy of a distribution is just −〈logP 〉. This means that we
can estimate the entropy of Pgen by taking the expectation value over Monte Carlo
simulated sequences:
S(Pgen) ≈ 〈log(Pgen(s))〉s∈MCsample (2.17)
2.2.3 The Pgen distribution
Another extremely effective way of visualizing the diversity of a repertoire is to exam-
ine the probability density of the log Pgen of sequences. If a large number of sequences
(or recombination events) are drawn from a model distribution (i.e. Monte Carlo sam-
pling), they can be histogrammed by the log of their generation probabilities. If we
define an energy as E ∼ − log Pgen, this distribution is the probability density P (−E),
and is closely related to the density of states (a connection we will return to in chapter
5). An example of one of these plots is shown for a human TRB model in Fig. 2.1,
demonstrating the massive range of generation probabilities, spanning ∼20 orders of
magnitude. Another very useful aspect of these plots is that the mean of each dis-
tribution is the entropy of the distribution (up to a minus sign), and is indicated as
the dotted lines in Fig 2.1. We frequently use such plots as a way of characterizing
the data visually. It is easy to see shifts to more or less entropic distributions, and
to see any impact on the tails. Furthermore, these plots can be made from the data
directly by histogramming their generation probabilities and the entropy of such a
distribution will again be the mean6.
6Please note, when using data sequences we should technically say that the ‘entropy’ computedas the mean of the distribution is technically a cross entropy. For the non-productive sequenceswe largely focus on in this thesis this is a negligible distinction. However, for inframe productive
18
-
Figure 2.1: Distribution functions: P (−E = log Pgen)
Shows the distribution of generation probabilities over 3 different state spaces of thesame human TRB model, highlighting the ‘coarse graining’ of the model from recombi-nation events, to nucleotide sequences, and finally to amino acid sequences/receptors.The dotted lines indicate the mean of each distribution and is mathematically equiv-alent to the negative of the entropy of each distribution. The entropy of the distri-butions decreases as they get more coarse grained.
2.3 Inference
The data which is used to infer these models comes from high-throughput Illumina
sequencing [Robins et al., 2010a] and is organized as a collection of DNA sequences
of around 60-200 base pairs. We will want to infer the parameters of the generative
model that most accurately reflect the sequences observed in the experiment. Without
a principled prior that significantly biases the distribution (note, Jeffrey’s prior is
remarkably flat for these generative models), the parameters are inferred by way of
sequences this is not an irrelevant concern as the distributions are noticeably skewed towards highergeneration probabilities due to somatic selection. See Elhanati et al. [2014] for a discussion of somaticselection and the statistical effects on the distribution. We are a little sloppy and always refer tothis quantity as the entropy of the distribution, even if it is technically a cross entropy at times.
19
-
maximum likelihood estimation. Given a collection of observed DNA sequences S and
a generative model determined by parameters θ ∈ Θ, we want to infer the estimated
parameters θ̂:
θ̂ = arg maxθ
L(θ; S) = arg maxθ
p(S|θ) = arg maxθ
∏seq∈S
Pgen(seq|θ) (2.18)
as the sequences in S are assumed to be independently generated.
In order to properly infer the parameters of a V(D)J model we must be careful to
only use sequences that are statistically representative of the V(D)J recombination
machinery itself and are not skewed by any selective process or somatic population
dynamics. This is a real worry as not only could clonal expansion overrepresent
specific sequences, but functional receptors are systematically biased away from the
underlying V(D)J generative distribution due to their involvement in the immune
system function (this is explored in Elhanati et al. [2014]). Fortunately, as discussed
in section 1.2, V(D)J recombination does not always produce inframe, productive
sequences with each recombination event. As a result, the DNA sequence datasets
we analyze contain a significant fraction of sequences we know must be nonproduc-
tive/nonfunctional because they are frame shifted (out of frame) or contain a stop
codon. These sequences can never be expressed and therefore should experience no
selective pressures. Thus, to ensure a statistically unbiased sample, we filter our sam-
ple for only unique, nonproductive sequences. Filtering for unique sequences removes
the influence of clonal dynamics and expansion, whereas filtering for nonproductive
sequences removes any selection effects.
The generative models described (Eq. 2.3, Eq. 2.4) are defined over the space
of recombination events which are ‘hidden’ in the sense that there are many, many,
recombination events that can lead to a particular DNA sequence and there is no way
to determine which one actually occurred. In order to infer the parameters of such a
20
-
model, a classic iterative learning algorithm, expectation maximization (EM), is used
which ensures that a local maximum in likelihood is achieved (proof in Appendix C).
2.3.1 Errors and Mismatches
Each recombination event e = (V,D, J, dV , dD, d′D, dJ , {mi}, {ni}) generates a specific
DNA sequence. However, it is possible that when this gene was sequenced that the
recorded nucleotides do not match up perfectly with the sequence generated by e.
This mismatch could indicate a sequencing error in the experiment or, in the case of
B cells, could be the result of hypermutations (this will be discussed in much greater
detail in Section 4.2). We will need to account for such mismatches or errors in order
to properly infer the parameters of the generative model. To do this we introduce
an error/mismatch model. Formally, we define the observed probabilities, given an
observed/measured sequence seqo as:
Porecomb(e, seqo) = Precomb(e)Pmis(seqo|e)
Pogen(seqo) =∑e∈E
Porecomb(e, seqo)(2.19)
Where Pmis(seqo|e) is the error/mismatch model whose parameters will be inferred
during the EM inference. There are several Pmis(seqo|e) models used over the course
of this work.
No error model
It is useful to first consider a model where no errors or mismatches are allowed. To
do this, define Pmis(seqo|e) = I[e generates seqo]. Then,
Porecomb(e, seqo) =
Precomb(e) if e generates seq0 otherwise (2.20)21
-
and
Pogen(seqo) =∑e∈E
Porecomb(e, seqo) =∑e|seqo
Precomb(e) = Pgen(seqo) (2.21)
we see we recover Pgen from Pogen.
Flat error rate
This model assumes that the probability of a mismatched nucleotide between the
observed sequence seqo = {soi} and the sequence generated by recombination event e,
seqe = {sei}, is a flat probability pm.
Pmis(seqo|e) =∏i
(pmI[soi 6= sei ] + (1− pm)I[soi = sei ]) (2.22)
Flat error rate, restricted to genomic templates
In practice, it doesn’t make much sense to examine mismatches outside of the region
of the sequence that is determined by a germline V, D, or J sequence. Define the set
of positions, Posgene where the nucleotides {sei} come from a germline template and
its complement, Posins, where the nucleotides come from non-templated insertions.
We define a new error model that applies the flat error model to positions Posgene and
the no error model to positions Posins:
Pmis(seqo|e) =
0, if ∃i ∈ Posins s.t. soi 6= sei∏
i∈Posgene (pmI[soi 6= sei ] + (1− pm)I[soi = sei ]) , otherwise
(2.23)
This is the model that is used most frequently and unless otherwise stated is the
model that is used for inference purposes.
22
-
N-mer context dependent error model
In order to study hypermutations in Section 4.2 we use a mismatch model where
the mismatch rate is modulated depending on the 7-mer nucleotide sequence around
the mismatch site. Here we define a general N-mer context model where there are
independent energies at each site (i.e. a one point model).
ph(i|seq) =1
Zpbg(si−bN
2c, si−bN
2c+1, . . . , si+bN
2c) exp
bN2 c∑k=−bN
2c
−Ek(si+k)
(2.24)where pbg(σ) is the background frequency of the N-mer nucleotide sequence σ and
the proportionality constant Z is determined by matching the overall mismatch rate
(i.e. 〈ph〉 = pm). As we have the freedom to define the 0 energy with each of the Eks,
it is convenient to set∑
σ∈{A,C,G,T}Ek(σ) = 0 to make it transparent if the nucleotide
identity at position k in the N-mer makes a hypermutation mismatch more or less
likely.
One may also notice that we did not specify whether seq is seqo or seqe. Ideally we
would want seq to be the sequence immediately before the hypermutation occurred
(e.g. if we were constructing an evolutionary tree from hypermutations we should use
the current node’s sequence as seq). However, for inference purposes this ambiguity
is functionally irrelevant as choosing either seqo or seqe to be seq will result in a
negligible difference.
Again, we will want to restrict to mismatches with the germline sequences (to
ensure we have identified a hypermutation), so we define:
Pmis(seqo|e) =
0, if ∃i ∈ Posins s.t. soi 6= sei
otherwise:∏i∈Posgene (ph(i|seq)I[soi 6= sei ] + (1− ph(i|seq))I[soi = sei ])
(2.25)
23
-
2.3.2 Expectation Maximization algorithm
Expectation maximization is implemented by taking an initial guess (generally ran-
domized) for the parameters and then iterating two different steps. The first step,
expectation, defines a function which is the expected log-likelihood over the distribu-
tion of data and hidden variables determined by the data and the current guess of the
parameters. Explicitly, if θ′ is the current estimation of the parameters, we define:
Q(θ|θ′) = 〈logL(θ; X,Z)〉Z|X,θ′ (2.26)
Note, Q(θ|θ′) is still a function of some undetermined parameters θ. This leads
to the second step: maximization. To determine the next iteration’s parameter esti-
mation we maximize the estimation function:
θ(i+1) = arg maxθ
Q(θ|θ(i)) (2.27)
Repeatedly iterating these steps will monotonically increase both Q and the full
likelihood function (proof below). Let us be explicit in how this translates into the
specific scenario of a VDJ generative model. Say we have (nonproductive) sequences
S, the set of possible recombination events Ωe = {(V,D, J, dV , dD, d′D, dJ , {mi}, {ni}},
and the model structure from Eq. 2.3. Then θ is the collection of parameters defining
PV, PDJ, PdelV, etc. The expectation step is defined as so:
Q(θ|θ′) = 〈logL(θ; S,E)〉E|S,θ′ =∑
seq∈S
∑e∈E
Porecomb(e|seq, θ′) log Porecomb(e, seq|θ)
(2.28)
Now,
Porecomb(e|seq, θ′) =Porecomb(e, seq|θ′)∑e∈E P
orecomb(e, seq|θ′)
=Porecomb(e, seq|θ′)
Pogen(seq|θ′)(2.29)
24
-
is the fractional contribution of the particular event to the total Pgen of that sequence.
Plugging Porecomb(e|seq, θ′) back in and expanding Porecomb(e|θ) we get:
Q(θ|θ′) =∑
seq∈S
∑e∈E
Porecomb(e, seq|θ′)Pogen(seq|θ′)
×[
logPV(V (e)) + logPDJ(D(e), J(e))
+ logPdelV(dV (e)|V (e)) + logPdelD(dD(e), d′D(e)|D(e)) + logPdelJ(dJ(e)|J(e))
+ logPinsVD(`VD(e)) + log p0(m1(e)) +
`VD∑i=2
logSVD(mi(e)|mi−1(e))
+ logPinsDJ(`DJ(e)) + log q0(n`DJ (e)) +
`DJ−1∑i=1
logSDJ(ni(e)|ni+1(e))
+ logPmis(e|seq)]
(2.30)
We now need to evaluate arg maxθQ(θ|θ′). As the expansion breaks up into indepen-
dent pieces, we can deal with them one at a time. First examine the parameters in PV.
We want to maximize f(PV) = Q(θ|θ′) conditioned on g(PV) =∑
V PV(V ) − 1 = 0.
Naturally, this is done with Lagrange multipliers (5f = λ5 g). 5f is readily com-
puted:
∂f
∂PV(Vi)=∂Q(θ|θ′)∂PV(Vi)
=∂
∂PV(Vi)
∑seq∈S
∑e∈E
Precomb(e, seq|θ′)Pgen(seq|θ′)
logPV(e)
=∑
seq∈S
∑e∈E
Precomb(e, seq|θ′)Pgen(seq|θ′)
I[Vi = V (e)]PV(Vi)
(2.31)
λ5 g is even more straightforward:
λ∂g
∂PV(Vi)= λ
∂
∂PV(Vi)
[∑V
PV(V )− 1]
= λ (2.32)
So,
PV(Vi) =1
λ
∑seq∈S
∑e∈E
Precomb(e, seq|θ′)Pgen(seq|θ′)
I[Vi = V (e)] (2.33)
25
-
To solve for λ, plug back in to our normalization condition (g(PV) =∑
V PV(V )−1 =
0):
g(PV) = 0 =∑V
PV(V )− 1 =− 1 +1
λ
∑seq∈S
∑e∈E
Precomb(e, seq|θ′)Pgen(seq|θ′)
∑Vi
I[Vi = V (e)]
=− 1 + 1λ
∑seq∈S
∑e∈E
Precomb(e, seq|θ′)Pgen(seq|θ′)
=− 1 + 1λ
∑seq∈S
Pgen(seq|θ′)Pgen(seq|θ′)
=− 1 + 1λ
∑seq∈S
1
=− 1 + |S|λ
⇒ λ =|S|
(2.34)
Finally this gives us the expression for the parameters of PV for the next iteration:
PV(Vi) =1
|S|∑
seq∈S
∑e∈E
Precomb(e, seq|θ′)Pgen(seq|θ′)
I[Vi = V (e)] (2.35)
which is just the expectation of that marginal, V gene usage in this case, over the
data sequences and using the previous iteration’s parameters. It is easy to show
that the remaining parameters are inferred in an analogous fashion with the only
caveat being that in the derivation for conditional distributions you need to use a
normalization condition (and thus another Lagrange multiplier) for each variable
that the distribution is conditioned on (or do the inference as a joint distribution).
For example:
g(PdelV|Vi) = 0 =∑d′V
PdelV(d′V |Vi)− PV(Vi) (2.36)
26
-
Also note that as the insertion dinucleotide Markov models also break up into a
similar form, their parameters are also inferred in an identical manner (only that
each recombination event e can contribute more than one term to the sum).
2.3.3 Implementation
Implementation of the EM algorithm for these V(D)J generative models is quite
tricky, and requires a large amount of computational power. As model parameters
are learned from large datasets of∼ 104−105 sequences, there is a premium on efficient
parallelized code. Sequence alignment, efficient enumeration of recombination events,
and intelligent organization of data structures are only some of the challenges. The
story of developing software to infer these parameters belongs to others and so won’t
be a focus of this thesis. However, I do want to take a moment to describe and
highlight the work done to make this difficult inference process possible.
My predecessor, Anand Murugan, was the first to code up and implement a VDJ
generative model of the form Eq. 2.3 and this was the basis of the first paper de-
scribing these V(D)J generative models in Murugan et al. [2012]. His MATLAB code
was then later adapted by me to define and infer the models discussed in Chapter 4.
Despite the success of this MATLAB code, it does require some expertise to use and
any changes to the model structure must be hard coded.
Recently a collaborator, Quentin Marcou, developed a software package called
IGoR (Inference and Generation Of Repertoires) in C++ [Marcou et al., 2018]. IGoR
is constructed in a way that allows the user to easily define the model structure (i.e.
the factorization) and runs smoothly and quickly. This software was used to infer the
models discussed/used in chapters 3 and 5. IGoR is publicly available on GitHub:
https://github.com/qmarcou/IGoR.
27
https://github.com/qmarcou/IGoR
-
Chapter 3
V(D)J recombination to sequences:
Precomb→ Pgen
The previous chapter laid out how a generative V(D)J model can be constructed and
inferred. However, the generative model is defined over a state (sample) space of re-
combination events, Ωe, whereas the scientific interest is over the state (sample) space
of sequences or receptors (both nucleotide and amino acid), and biological/physical
effects can only take place on the level of the physical protein structure of the re-
ceptor, i.e. the amino acid sequence (or possibly some coarse grained version of the
amino acid sequence). As briefly discussed in 2.1.4, the V(D)J model does define the
probability of generating a particular nucleotide or amino acid sequence by summing
over all recombination events that generate the sequence. This was summarized in
Eq. 2.5, which we repeat here:
Pgen(seq) =∑e|seq
Precomb(e) (3.1)
This summation over recombination events is, in some sense, ‘coarse graining’ the
state (sample) space as we are aggregating many states (recombination events) into
a new state (a nucleotide or amino acid sequence).
28
-
3.0.1 Probability Spaces (mathematical aside)
Formally, this ‘coarse graining’ is just extending probability spaces. First we define
the sample space of recombination events (Ωe, with σ-algebra Be), the sample space of
nucleotide CDR3 sequences (Ωnt, with σ-algebra Bnt), and the sample space of amino
acid CDR3 sequences (Ωaa, with σ-algebra Baa). Note that as each recombination
event generates a specific nucleotide sequence through the physical process of V(D)J
recombination, we have the surjective map πv(d)j : Ωe → Ωnt. Furthermore, as each
(in-frame) nucleotide sequence translates to an amino acid sequence, we can define the
translation mapping πnt2aa : Ωnt → Ωaa (if we wished to be pedantic we could keep the
out of frame sequences in Ωaa to ensure that πnt2aa is a function over the whole sample
space and to maintain the total measure of 1 over Ωaa). In this notation it is easy to see
that the mapping πv(d)j extends the probability space of V(D)J recombination events,
(Ωe,Be,Precomb) to the probability space of nucleotide sequences, (Ωnt,Bnt,Pgennt),
while the mapping πnt2aa extends the probability space of nucleotide sequences to the
probability space of amino acid sequences (Ωaa,Baa,Pgenaa). Our sloppy notation of
e|seq can now be understood as either π−1v(d)j(ntseq) or π−1v(d)jπ−1nt2aa(aaseq).
3.1 Too many states! The free energy problem
Despite Eq 2.5’s seeming simplicity, it can prove to be computationally very problem-
atic because of the number of recombination events that could generate a particular
sequence. This is the exact same problem that plagues much of statistical physics –
summing over all states to determine the partition function or a free energy can prove
to be computationally prohibitive if the only method of doing the summation is by
enumerating the states. Indeed, log(Pgen), a quantity we will look at repeatedly, can
even be thought of as a free energy. The reader may remember that this quantity, Pgen
was required to do the EM inference in the previous chapter 2.3.2, so to do any sort
29
-
of inference or to construct any sort of probabilistic model of V(D)J recombination
the problem of enumerating all possible recombination events must be addressed.
In previous work, and in the inference procedures of Murugan et al. [2012], and
Marcou et al. [2018], the number of states to be summed over is controlled through
through regularization. By regularization we mean that some procedure is used to
limit the number of recombination events that are considered to a manageable num-
ber. Fortunately, this is quite possible for nucleotide sequences. By only considering
gene templates V, (D), and J that have a sufficiently good alignment (e.g. Smith-
Waterman alignment), capping the number of deletions/insertions, and having cutoffs
for fractional probabilities and errors, it is feasible to reduce the number of recom-
bination events that correspond to a nucleotide sequence (i.e. the notation e|seq) to
the order of thousands or less. This makes it tractable, if still very computationally
intensive, to compute Pgen for nucleotide sequences. It must be noted, that for soft-
ware attempting to infer V(D)J models of arbitrary structure, this enumeration of
recombination events is very useful as there are no restrictions on the correlations it
can consider.
However, this approach of exhaustive enumeration with some regularization is
computationally intractable for amino acid CDR3 sequences, let alone any kind of
coarse grained alphabet of amino acids that might be more interesting functionally.
This can easily be seen from the fact that the number of possible nucleotide sequences
that translate to a particular amino acid sequence will explode exponentially with the
number of amino acids in the CDR3 region:
|{σ s.t. nt2aa(σ) = a}| =∏ai∈a
#codons|ai (3.2)
To put some perspective on these numbers, the average number of nucleotide
sequences that code for a mouse TRB CDR3 amino acid sequence is ∼ 2 billion
30
-
— and mouse TRB CDR3 sequences are significantly shorter than human TRB or
IGH. Even the heavily optimized and efficient IGoR software developed to do V(D)J
generative model inference [Marcou et al., 2018], which can compute the Pgen of
around 60 nucleotide sequences per CPU second, would take around 8500 CPU hrs to
compute the Pgen of a single mouse TRB amino acid sequence. This is prohibitively
long if there is interest in analyzing repertoire datasets that can easily be of the order
of 105 unique sequences or larger. For this reason, much of the early work in this
field, and in this thesis, was restricted to the analysis of nucleotide sequences.
While computing Pgen for amino acid sequences by way of enumerating recombi-
nation events is computationally intractable, this is not to say that the summation
is impossible. In this chapter we present a dynamic programming algorithm and
software, OLGA (Optimized Likelihood estimate of immunoGlobulin Amino-acid se-
quences, available at https://github.com/zsethna/OLGA), that efficiently computes
Pgen not only for amino acid CDR3 sequences, but inframe nucleotide sequences as
well as sequences composed of coarse grained/ambiguous amino acid alphabets and
motifs. Indeed, OLGA can sum over all possible recombination events of a mouse
TRB model in seconds (and can compute around 50 Pgen mouse TRB amino acid
sequences per CPU second). This work is detailed in the paper Sethna et al. [2018].
This algorithm however does require the V(D)J generative models of the form 2.3 or
2.4, and so loses the flexibility of being able to consider arbitrary model correlations.
The ability to compute Pgen on an amino acid and functional receptor level will
likely prove to be extremely useful, and we explore some example applications.
3.2 Dynamic Programming
OLGA is an algorithm that leverages ‘dynamic programming’ to avoid enumerating an
exponentially large number of states. Rather than give a formal definition of dynamic
31
https://github.com/zsethna/OLGA
-
programming, we show an example. Fortunately, physicists are already familiar with
one of the cleanest examples of dynamic programming, and one that truly shows the
computational effectiveness of such a technique: the discretized path integral. If we
have position x with N possible locations, discretized time t, and a Markov transition
matrix Rt(xi → xj) (which may depend on time), we can ask what is the probability
of starting at position x0 and ending at position xT at time T . If we define the
function
Pt(x0, xi) =∑
{x0,x(1),x(2),...,x(t−1),xi}
t−1∏t′=0
Rt′(x(t′)→ x(t′ + 1) (3.3)
we want PT (x0, xT ). Now, one could list out all the paths that start at x0 and end
at xT , compute their weights, and sum. However, the number of paths increases
exponentially with t, so the computation time would explode exponentially as O(T ×
NT−1) (T operations on each of NT−1 paths). Instead, it is computationally much
more efficient to sum up all the path weights to each position, at each time step and
then update. In other words, we notice this recursion relation:
Pt+1(x0, xi) =∑
{x0,x(1),x(2),...,x(t−1),x(t),xi}
t∏t′=0
Rt′(x(t′)→ x(t′ + 1)
=∑x(t)
Rt(x(t)→ xi)∑
{x0,x(1),x(2),...,x(t−1),x(t)}
t−1∏t′=0
Rt′(x(t′)→ x(t′ + 1)
=∑x(t)
Rt(x(t)→ xi)Pt(x0, x(t))
(3.4)
This can be written in a vectorized notation by writing Pt(x0,x) as a column vector
with elements Pt(x0, xi):
Pt+1(x0,x) = RtPt(x0,x)⇒ PT (x0,x) = RT−1RT−2...R1R0P0(x0,x) (3.5)
where P0(x0,x) = I(x0). Thus, solving for PT (x0, xT ) by using dynamic programming
would require O(T ×N2) operations — a massive speedup from the O(T ×NT−1) op-32
-
erations of the exhaustive enumeration of the paths. We have turned the summation
over all individual microstates (i.e. the paths) into a matrix expression with steps in
time. The algorithm, OLGA, that we developed to compute Pgen of nucleotide and
amino acid sequences from a generative model will analogously reduce the exponen-
tial blowup of exhaustive enumeration of recombination events down to polynomial
time by summing over matrix expressions based on positions in the sequence read.
3.3 OLGA
We now describe how OLGA computes Eq. 2.5 without summing over exhaustively
enumerated recombination events, using dynamic programming. This algorithm re-
quires specific tailoring to the model structure as the correlations have to be built
in explicitly, so the algorithm is slightly different for generative models of VDJ
(TCRβ/IGH, Eq. 2.3) and VJ (TCRα/IGL, Eq. 2.4) recombination. We will first
present the VDJ algorithm, and give the simpler algorithm for generative models of
VJ recombination afterwards.
Each recombination event implies an annotation of the amino acid CDR3 sequence,
(a1, . . . , aL), assigning a different origin to each nucleotide position (one of V, N1, D,
N2, or J, where N1 and N2 are the non-templated VD and DJ insertion segments,
respectively) that parses the sequence into 5 contiguous segments (see schematic in
Fig.3.1)
The core principle of the method is to sum over possible nucleotide locations of
the 4 boundaries between the 5 segments, x1, x2, x3, and x4, but in a recursive way
using matrix operation. This can be summarized into a compact matrix expression:
Pgen(a1, . . . , aL) =∑
x1,x2,x3,x4
Vx1Mx1x2
∑D
[D(D)x2x3N
x3x4J(D)
x4]. (3.6)
33
-
Figure 3.1: CDR3 indexing cartoon
} } }} }Vx1 D(D)x2x3 J (D)
x4
Mx1x2 N x3x4N2N1
V D Jx1 x2 x3 x4
a4, i1=4
x1=11
u=1, u*=2 u=2, u*=1 u=3, u*=3
10 11 12u1=2, u1*=1
Boxes correspond to nucleotides and are indexed by integers. Each group of threeboxes (identified by heavier boundary lines) corresponds to an amino acid. Thenucleotide positions x1, . . . , x4 identify the boundaries between different elements ofthe partition. The V, M, D(D), N and J(D) matrices define cumulated weightscorresponding to each of the 5 elements.
However, to do this, we will need to define objects that accumulate the probabil-
ities of events from the left of a position x (i.e. up to x) and the right of x (i.e. from
x+ 1 on) which will require some notation.
3.3.1 Notation, 3 ′ and 5 ′ vectors
Suppose we have a CDR3 ‘amino acid’ sequence a = (a1, . . . , aL). By ‘amino acid’
sequence, we mean that each of the ‘amino acids’, ai, correspond to some collection of
nucleotide triplets, or codons. We allow this mapping between ‘amino acids’, a, and
codons to be arbitrary at this point, and use the notation σ ∼ a if the codons in the
nucleotide sequence σ correspond to the codons allowed by the amino acid sequence
34
-
a. This will allow us not only to recover the standard nucleotide translation map-
ping, πnt2aa, when using the standard amino acid alphabet (e.g. TGTGCCAGCAGT
∼ πnt2aa(TGTGCCAGCAGT) = CASS), but also provides a trivial extension to in-
clude in-frame nucleotide sequences (define an ‘amino acid’ symbol for each individual
codon) as well as coarser grained collections of amino acids. For example, all codons
that code for amino acids with a common chemical property, e.g. hydrophobicity or
charge, could be grouped into a single ‘amino acid’. In that formulation, (a1, . . . , aL)
would correspond to a sequence of symbols denoting that property. This could prove
to be very useful in constructing and assessing future coarse grained models of recep-
tor - epitope affinities.
It will simplify the later expressions to be able to refer to a position x not only
by its nucleotide index, but by the corresponding amino acid index i as well as what
position x is in the codon reading from 5 ′ to 3 ′ (u) and what position x + 1 is in a
codon reading from 3 ′ to 5 ′ (u∗). This is shown graphically in Fig. 3.1. Explicitly, for
position xj:
ij =⌈xj
3
⌉uj = xj − 3(ij − 1)
u∗j = 3−mod(uj, 3)
(3.7)
It is also crucial to introduce what we will call ‘5 ′ vectors’ and ‘3 ′ vectors’. A 5 ′ vector,
denoted with a subscript (e.g. Xx) accumulates weights for the sequence to the 5′ (left)
side of x (including the nucleotide position x), whereas a 3 ′ vector, denoted with a
superscript (e.g. Yy) reflects the weights for the sequence to the 3 ′ (right) side of
x (excluding the nucleotide position x). Because we are dealing with amino-acids,
which are encoded with codons made of 3 nucleotides, we need to keep track of
weights by the identity of the nucleotides at the beginning or the end of the codon.
This requires the definition of a 5 ′ vector (3 ′ vector) to depend on the value of u (u∗).
35
-
For the first nucleotide position in a codon, u = 1 (u∗ = 1), Xx (Yy) must be
interpreted as a row (column) vector of 4 numbers indexed by σ = A, T,G, or C,
corresponding to the cumulated probability weight from the 5 ′ /left (3 ′ /right) side
that nucleotide at position x (x + 1) takes value σ. If u = 2 (u∗ = 2), then Xx (Yy)
is also a row (column) vector of 4 numbers indexed by nucleotide σ = A, T,G, or
C, but with a different interpretation: it corresponds to the cumulated probability
up to position x from the 5 ′ /left side (x+ 1 from the 3 ′ /right), with the additional
constraint that the nucleotide at the last position in the codon, x + 1 (x), can take
value σ (the value is 0 otherwise). Lastly, if x (x+ 1) is the last position in a codon,
i.e. u = 3 (u∗ = 3), the cumulative sequence terminates at the end of a codon and
we do not keep nucleotide information, so Xx (Yx) is a scalar.
If we have a 5 ′ vector Xx that contains the accumulated weights up to position
x, and 3 ′ vector Yx that contains the weights from position x+1 onwards, we will
want to ‘glue’ these sequence contributions together to get the total probability of
the sequence. This is indicated by the expression1 XxYx, which has a very convenient
structure. As the combinations of u and u∗ are (1, 2), (2, 1), or (3, 3), we see that
the matrix multiplication XxYx is one of two situations. First if u = u∗ = 3, XxY
x is
just scalar multiplication of the aggregate weights for the 5 ′ and 3 ′ sides. If (u, u∗)
= (1, 2) or (2, 1), then XxYx is the dot product between a vector of weights indexed
by nucleotides needed to complete the codon and the vector of weights indexed the
completing nucleotide on the other side. In either case, the result is the total aggregate
weight of the sequence conditioned on the partition x, accurately reflecting the weight
of ‘gluing’ the possible sequences from the 5 ′ /left side to the 3 ′ /right side.
This notion of sequence gluing also allows for the definition and interpretation of
matrices (e.g. Rxy), with both 5′ and 3 ′ indices. A matrix Rxy can be thought of as
1Please note that the resemblance of the expression XxYx to a contraction over the position x in
Einstein notation should not be misinterpreted. The ‘contraction’ is over possible nucleotide identityindices not over the position index x.
36
-
‘gluing ’ a new sequence segment (x to y) to what an existing 5 ′ or 3 ′ vector describes.
For example:
XxRxy = Hy
RxyYy = Gx
(3.8)
The matrix Rxy can be mapping from any value of u to any other (or any value of
u∗ to any other), and so has 9 possible combinations/interpretations based on the u
mapping and can be a 4x4, 4x1, 1x4, or 1x1 matrix as a result.
3.3.2 VDJ recombination: V, M, D, N, and J
Eq 3.6 shows the summation over positions of a matrix expression, with the vec-
tors/matrices corresponding to different VDJ contributions. The 5 ′ vector Vx1 corre-
sponds to a cumulated probability of the V segment finishing at position x1; matrix
Mx1x2 is the probability of the VD insertion extending from x1 + 1 to x2; Nx3x4 is the
same for DJ insertions; matrix Dx2
x3(D) corresponds to weights of the D segment
extending from x2 + 1 to x3, conditioned on the D germline choice being D; 3′ vector
Jx4(D) gives the weight of J segments starting at position x4 + 1 conditioned on the
D germline being D. This D dependency is necessary to account for the dependence
between the D and J germline segment choices [Murugan et al., 2012]. All the defined
vectors and matrices depend on the amino acid sequence (a1, . . . , aL), but we leave
this dependency implicit to avoid making the notation too cumbersome.
The entries of the vectors/matrices corresponding to the germline segments, V,
D(D), and J(D), can be calculated by simply summing over the probabilities of
different germline segments compatible with the sequence (a1, . . . , aL) with conditions
on deletions to achieve the required segment length. The ∼ sign is generalized to
incomplete codons so that it returns a true value if there exists a codon completion
that agrees with the sequence a.
37
-
V contribution: Vx1
The 5 ′ vector, Vx1 , aggregates the weights (PV and PdelV) from sequences originating
from the templated V genes up from the start of the CDR3 region to position x1. As a
5 ′ vector, Vx1 can be a 1x1 or 1x4 matrix depending on u1. sV is the sequence of the V
germline gene (read 5 ′ to 3 ′ ) from the conserved residue (generally the cysteine C) to
the end of the gene plus the maximum number of reverse complementary palindromic
insertions appended to the 3 ′ end. lV is the length of sV .
Vx1(σ) =∑V
PV(V )PdelV(lV − x1|V )I(sVx1 = σ)I(sV1:x1 ∼ a1:i1) if u1 = 1,
Vx1(σ) =∑V
PV(V )PdelV(lV − x1|V )I((sV1:x1 , σ) ∼ a1:i1) if u1 = 2,
Vx1 =∑V
PV(V )PdelV(lV − x1|V )I(sV1:x1 ∼ a1:i1) if u1 = 3.
(3.9)
N1 contribution: Mx1x2
This matrix includes the weights (PinsVD, p0, and∏SVD(mi|mi−1) from the glu