linearham
A Phylo-HMM implementation for B cell receptor sequence analysis.
|
linearham
is a phylogenetic hidden Markov model (phylo-HMM), which simultaneously models the VDJ recombination process (with an HMM) and sequence evolution (with a phylogenetic tree).
This document is the reference for understanding how linearham
works. In the following sections, we provide an overview of the abstractions used in the linearham
codebase. See the links above for detailed information about classes and methods. To fully understand how linearham
works, please see the linearham
tests, which provide examples describing these core concepts.
We use the term "site positions" to refer to positions in the aligned sequences and "germline positions" to positions in the germline gene.
To allow for more tractable inference on the HMM, we utilize Smith-Waterman (S-W) alignment information between the clonal family sequences and the germline genes. We fix the relative positions of the germline genes to their S-W site positions. Of course, the S-W algorithm aligns two sequences at a time, but we use heuristics to determine the final alignment between all the relevant germline genes and the equal-length clonal sequences. For more information on these heuristics, see the add_linearham_info()
function in python/utils.py
of partis
(https://github.com/psathyrella/partis/tree/dev).
The diagram shown below provides an example alignment between a single-sequence clonal family and a V/D/J gene. In the code, we have two important S-W information objects: flexbounds_
and relpos_
. flexbounds_
is a map holding ([vdj]_[lr]
, pair<int, int>
) pairs, which represent the possible V/D/J starting/ending match positions. During inference, these starting/ending match position ranges are found by computing the min
/max
of the corresponding S-W starting/ending positions for all germline gene matches in a given gene type (i.e. V/D/J). In addition, the [vd]_r
([dj]_l
) bounds are shifted to the left (right) by a certain amount to allow for more flexible naive sequence inference in the CDR3 region.
This alignment graphic is a simple example to demonstrate how these data structures work. Because the S-W match positions for a given germline gene are specified using Python 0-based right-exclusive range conventions, the [vdj]_r
bounds in flexbounds_
denote the range of site positions immediately after the last possible S-W match positions in a given gene type. In this example,
These ranges are marked in the diagram between vertical dashed lines. Note that the last V gene starting position is 2 and the first V gene ending position is 4, which means only positions 2 and 3 are guaranteed to be in the V gene. This logic will be important when we describe how the HMM hidden state space is constructed. relpos_
is a map specifying the starting positions of the germline genes. In this example, relpos_ = {{"V", 1}, {"D", 5}, {"J", 10}}
.
In the diagram above, the germline genes are not properly aligned to the clonal family sequence as the V (J) gene should align to the start (end) of the sequence. To account for this, we pad germline genes with N bases until the V (J) gene aligns to the start (end) of the clonal sequence. These padding bases represent fully ambiguous (i.e. any possible) germline bases. The padding transition/emission probability information is stored in the NPadding
class. Note that while germline padding is handled by linearham
, sequence padding is performed in partis
.
We construct three germline gene classes, one for each type of germline gene (i.e. V/D/J). These classes VGermline
, DGermline
, and JGermline
are inherited from three classes that store partis
HMM germline parameter information (i.e. Germline
, NTInsertion
, and NPadding
). The corresponding inheritance diagram is shown below.
Looking at this diagram, you can see that we account for non-templated insertions to the left of D and J germline genes. In the code, we parse the parameter files (in YAML format) and store these [VDJ]Germline
objects in a common GermlineGene
class, which is conceptually similar to a tagged union class.
Our hidden state space is similar to that of ham
. ham
uses states of the form \((V, j)\), \((D, j)\), \((D, N)\), \((J, j)\), and \((J, N)\) for a V gene \(V\), D gene \(D\), J gene \(J\), germline position \(j\), and non-templated insertion base \(N \in \{N_A, N_C, N_G, N_T\}\). Again, non-templated insertion bases appear to the left of a germline gene and thus are associated with the gene to their right. In linearham
(versus ham
), we use the S-W alignment information to collapse the germline state space.
As described in the above S-W alignment diagram, positions 2 and 3 represent the sites "guaranteed" to be in a hidden V germline state according to S-W. In this "germline" region, we do not need to care about the different possibilities of hidden states beyond which V gene is entered because we know that the HMM must march along the gene until it exits that gene. Thus, the only possible hidden state in the V "germline" region is \(V\). It is helpful to think about the V "germline" region as a single site position in the HMM.
To make this more concrete, we provide another S-W alignment diagram below with more than one V/D/J germline gene. In this new example, the hidden V "germline" state can be either \(V_{01}\) or \(V_{99}\). The V-D "junction" region (i.e. sites 4 and 5) can have hidden states associated with V genes, non-templated insertions, and D genes. Specifically, the hidden state space consists of \((V_{01}, 3)\), \((V_{01}, 4)\), \((V_{99}, 3)\), \((V_{99}, 4)\), \((D_{01}, N_A)\), \((D_{01}, N_C)\), \((D_{01}, N_G)\), \((D_{01}, N_T)\), \((D_{01}, 0)\), \((D_{99}, N_A)\), \((D_{99}, N_C)\), \((D_{99}, N_G)\), \((D_{99}, N_T)\), \((D_{99}, 1)\), \((D_{99}, 2)\). In general, the hidden state space in the V-D "junction" region comprises the matched V germline states, the non-templated insertion states associated with D gene matches, and the matched D germline states from site position flexbounds_["v_r"].first
to site position flexbounds_["d_l"].second - 1
; the site position flexbounds_["d_l"].second
is the last D gene match starting position so it is considered part of the D "germline" region (and not part of the V-D "junction" region). Note that this discussion about the V "germline" and V-D "junction" states applies to the D/J "germline" and D-J "junction" states as well.
This "germline" and "junction" state space decomposition allows us to reduce the space-time complexity of the forward algorithm, which is quadratic in the number of hidden states. By collapsing the "germline" state space using S-W alignment information, the forward algorithm scales approximately linearly in the number of ham
hidden states (hence the name linearham
).
Because the linearham
HMM has different hidden state spaces for the "germline" and "junction" regions, we need to specify "germline"-to-"junction", "junction", and "junction"-to-"germline" hidden state transition probability matrices. To help illustrate how these matrices are structured, we revisit the S-W example alignment first discussed in Smith-Waterman alignment information. Specifically, we focus our attention on the V "germline" region, V-D "junction" region, and D "germline" region; the other "germline" and "junction" regions can be treated analogously.
The hidden state spaces in that first example for the V "germline" region, V-D "junction" region, and D "germline" region are \(\{V\}\), \(\{(V, 3)\), \((V, 4)\), \((D, N_A)\), \((D, N_C)\), \((D, N_G)\), \((D, N_T)\), \((D, 0)\), \((D, 1)\), \((D, 2)\}\), and \(\{D\}\), respectively. Therefore, the transition probability matrix between the V "germline" region and V-D "junction" region has 1 row and 9 columns.
partis
-inferred transition probability going from position 2 to position 3 in germline gene \(\{V\}\).Note that the \(\text{init}\) and \(\text{end}\) states in a germline gene represent the initial and ending states used in partis
HMM germline parameter files. The other matrix entries, the ( \(9 \times 9\)) V-D "junction" transition probability matrix, and the ( \(9 \times 1\)) [V-D "junction"]-to-[D "germline"] transition probability matrix are filled using analogous logic. However, when computing the transition probabilities between the V-D "junction" region and D "germline" region, we must account for the transitions within the D "germline" region as well.
In each "germline" region, we have an emission probability row vector that stores an entry for every "germline" state in that region. The S-W example alignment shown in HMM hidden state space has a V "germline" state space equal to \(\{V_{01}, V_{99}\}\) and thus the corresponding emission probability row vector has 2 entries. Because a "germline" region spans at least one site position in the alignment, each "germline" emission probability is defined as the product over all site-wise emission probabilities in the region. In the aforementioned example, the "germline" emission probabilities for the \(V_{01}\) and \(V_{99}\) states are products of the site-specific emission probabilities from position 0 to position 3.
In each "junction" region, we use an emission probability matrix with entries for each site position and "junction" state in that region. For the same example used above, the V-D "junction" emission probability matrix has 2 rows and 15 columns. Some "junction" states do not occur at particular site positions (i.e. \((D_{01}, 0)\) at position 4) and these cases are assigned emission probabilities of 0.
In phylo-HMMs, per-column phylogenetic likelihoods take the place of emission probabilities in classical HMMs. Here, we describe how the notion of an "expanded" multiple sequence alignment (xMSA) makes phylo-HMM computation efficient. Suppose we have an S-W alignment as shown in Smith-Waterman alignment information, but instead of the single input sequence ACAGTACCCTGTTNN
, we have a clonal family with 3 sequences (see diagram below).
The phylogenetic likelihoods depend on the observed bases at each site and the associated hidden naive bases as well. We map each MSA site position and possible hidden naive base to an xMSA site position and compute the phylogenetic log-likelihood at each xMSA site only once. The important takeaway is that even though many pairs of MSA site positions and hidden naive bases occur more than once, we never have to compute the corresponding phylogenetic log-likelihoods more than once. The xMSA for the aforementioned example is displayed in the graphic below, where the top row of the alignment cycles through all of the possible bases of the "expanded" naive sequence.
In this example,
TCC
, AAG
, ACT
, AAA
) are guaranteed to match with the first 4 bases of \(V\) (NATG
) so there is no MSA expansion.CCG
) is in the D "germline" region and is guaranteed to match with the germline base A
.Note that the V-D "junction" states \((D, 0)\) and \((D, N_G)\) at MSA site position 5 both map to the same xMSA site position (13) because they represent the same likelihood calculation.
Given that there can be a potentially large number of sequences in a clonal family, we must account for possible numerical underflow in the forward algorithm. We define a large numeric constant SCALE_FACTOR
and check that the computed forward probabilities are all above 1.0 / SCALE_FACTOR
. If not, we multiply the forward probabilities by SCALE_FACTOR
enough times until no forward probability is less than 1.0 / SCALE_FACTOR
and record the number of times we multiplied by SCALE_FACTOR
. We keep a running total of these scaler counts as the forward algorithm progresses along the site positions of the HMM and undo the scaling according to the final scaler count to compute the HMM log-likelihood.