1 Introduction

It is of interest to evolutionary biologists to determine what biological and chemical mechanics contribute to the evolution of genomes. By making comparisons between rates of substitutions in amino acids, changes in the functionality of a genome can be studied. Alternatively, a comparison of rates of substitutions of DNA nucleotides allows observation of the underlying random processes of genome evolution. Both the studies of functionality and underlying processes are of interest to biologists as their simultaneous consideration allows for more accurate modelling of evolutionary data. The analysis of codons (triplets of DNA, each of which codes for an amino acid) allows one to analyse factors of both DNA rates of change and amino acid rates of change at once.

A typical characteristic of codon models is inclusion of a non-synonymous/ synonymous relative rate, \(\omega \), as a model parameter. A synonymous mutation is one between two codons that code for the same amino acid and hence the functionality of the gene does not change under such a mutation. For example codons AAA and AAG both code for the amino acid lysine so a mutation from AAA to AAG is synonymous. A non-synonymous mutation is one where the amino acid does change. For example AAA and AAC code for amino acids lysine and asparagine respectively, so a mutation between these codons would change the functionality of the gene. The parameter \( \omega \) is the relative rate of non-synonymous to synonymous mutations. Values of \(\omega \) are characterised as follows:

$$\begin{aligned}{\left\{ \begin{array}{ll} \omega <1 &{} \text { purifying selection} \\ \omega =1 &{} \text { neutral evolution} \\ \omega >1 &{} \text { positive selection}. \end{array}\right. }\end{aligned}$$

in practice, purifying selection (that is, synonymous mutations happen more frequently than non-synonymous mutations) is most often observed. The observation of neutral evolution can indicate that the gene in question is unimportant as a mutation that changes the protein produced is equally likely to one that does not. There are cases of positive selection occurring, this is most often observed in viruses (Bennett et al. 2006; Shen et al. 2009; Yang 1998).

As there are four DNA nucleotides, there are 64 (\( = 4 \times 4 \times 4 \)) codons which may lead one to think that such a model would be computationally expensive to use when it is considered that the maximum number of free rate parameters for a codon model would be 4032 (\( = 64 \times 64 - 64 \)). Despite this, codon models currently in use can simultaneously model aspects of both functionality of a gene and underlying DNA process with as few as two free parameters. The branch-site codon models described in Yang (1997), for example, can have as few as a single parameter which is the proportion of codon sites whose \( \omega \) values fall in a specified range. The Muse–Gaut codon model (Muse and Gaut 1994) takes into account \( \omega \) and the frequency of DNA nucleotides; a total of four free parameters.

A common way to represent models of evolution is with matrices where an off diagonal \( (i,j)^{th} \) entry of rate matrix Q represents the rate in which state j changes to state i and diagonal entries are chosen to give Q zero column sum. Note here that this differs to the similar common convention of an off-diagonal \( (i,j)^{th} \) entry of such a matrix to represent the rate of state i changing to state j and hence off-diagonal entries of the matrix are chosen to give zero row sum. Mathematically, a model which contains free parameters can be represented by a set which here we generically denote as \( {\mathcal {Q}} \) so \(Q \in {\mathcal {Q}}\). In this paper, we assume that the matrix set \({\mathcal {Q}}\) is always determined by polynomial constraints (on the matrix entries). Further we note that in most cases constraints are homogeneous. In these cases, for any \(Q \in {\mathcal {Q}}\), scalar multiples \(\lambda Q\) are also in \({\mathcal {Q}}\). For any given rate matrix Q it is possible to generate a corresponding transition matrix M where an off-diagonal \((i,j)^{th}\) entry of M represents the probability of state j changing to state i in a given time period and the diagonal entries are chosen to give a unit column sum. This is done through the exponential map: \( M = e^{Qt}\) where t represents time elapsed. It should be noted here that the \(n \times n\) zero matrix (where n is the number of states) is always assumed to be contained in \({\mathcal {Q}}\) and, as a consequence of taking the matrix exponential of the zero matrix, the \(n \times n\) identity matrix is always contained in the corresponding set of Markov matrices.

In the case of DNA models, research by Sumner et al. (2012a) found practical merit in having a set of Markov matrices which are closed under matrix multiplication. If there are two Markov matrices, \(M_{1}\) and \(M_{2}\), acting on different segments on the same branch of a phylogenetic tree, in order to find the overall process, \( {\hat{M}} \), for that branch it is required to multiply \( M_{1} \) and \( M_{2} \) together. Therefore, if a set of Markov matrices, \( {\mathcal {M}} \), are closed under matrix multiplication and \( M_{1}, M_{2} \in {\mathcal {M}} \), then \( {\hat{M}} \) in this scenario would also belong to \( {\mathcal {M}} \). For a set of rate matrices \({\mathcal {Q}}\) and its corresponding set of transition matrices \({\mathcal {M}}\), it has been shown that \({\mathcal {M}}\) is closed under matrix multiplication if and only if \({\mathcal {Q}}\) forms a Lie algebra (Sumner 2017; Sumner et al. 2012a). Therefore, demanding that \( {\mathcal {Q}} \) forms a Lie algebra will ensure \( {\hat{M}} \in {\mathcal {M}} \). Further studies by Sumner et al. (2012a) found that the general time reversible model (Tavaré 1986) (from now referred to as GTR) does not have this property, i.e. if \( M_{1} \) and \( M_{2} \) are of GTR form then it is not always the case that \( {\hat{M}} \) is also of GTR form.

Woodhams et al. (2017) conducted similar research on codon models. Their simulations involved selecting a DNA model, generating two sets of parameters from these models to produce two distinct codon Markov matrices (in Sect. 2 we describe this process in detail). They then demonstrated that if these two phylogenetic processes on the two branches of a two-taxon phylogenetic tree have the same underlying value of \( \omega \) then the average process over the tree is estimated, the resulting process does not necessarily have the same value of \( \omega \). This result was consistent for different initial DNA models selected from both time-reversible models and Lie-Markov models (LMM) (Sumner et al. 2012a; Fernández-Sánchez et al. 2015). It seems sensible given previous research on the mis-estimation of substitution probabilities in DNA models to assume that a codon model which forms a Lie algebra would be less prone to mis-estimation of \( \omega \). It is currently an open problem to construct such a model. The purpose of this paper is to explore the inherent obstructions in doing so.

In this paper, we explore two methods of finding a Lie algebra to represent a codon model. Both these methods compute the smallest Lie algebra containing a given linear space; the methods differ by how they generate a linear space from a codon model. The first method is to find the “linear closure” which is the smallest linear space containing the model (in this case the Lie algebra generated by using the method is the smallest Lie algebra which contains a model). The second method is to find the “linear version” which is found by changing the operations in a codon model’s formulation in such a way that the resulting space is linear. We found that both methods produce inherit difficulties concerning the large number of parameters in the resulting codon model. The method of finding the smallest Lie algebra which contains a model has the additional difficulty of there not being an unambiguous definition of the \( \omega \) parameter.

As it has been shown that multiplicative closure of a set of Markov matrices occurs if and only if the rate matrices form a Lie algebra (Sumner 2017), our results show that there is no practical way to have a multiplicatively closed codon model. This tells us that there is a fundamental conflict between codon models, the genetic code, and multiplicative closure.

However, although our attempt to find a codon model which is a Lie algebra gave impractical results, the linear spaces generated during the process of this analysis (the linear closures and linear versions) potentially offer a partial solution to the initial problem. These linear codon models do not have the problems associated with codon models that are Lie algebras: they have a sensible number of parameters and, unlike other partial solutions we discuss, have positive substitution rates. Linear closures and linear versions of codon models closely resemble the initial codon models from which they are generated and in the context of multiplicative closure it has been shown that in the DNA case a linear model mis-estimates parameters less than a non-linear model (Kaine 2011).

Addressing the issue of multiplicative closure of Markov models is not the first application of Lie theory in the field of genetics. There have been several studies which find Lie algebras and similar structures which resemble different aspects of the genetic code, such as the relationship between codons and amino acids (Hornos and Hornos 1993; Bashford et al. 1998; Bashford and Jarvis 2000; Sánchez et al. 2006). These works, however, bear little relevance to our study because we are aiming to use the structure of Lie algebras to build a Markov model which represents mutation rates between codons, not to represent the genetic code itself with an algebraic structure.

In Sect. 2 of this paper we define commonly used codon models and explore their structure. Next, in Sect. 3, the mathematical tools used to find linear spaces to represent codon models are defined and examples are given for finding linear spaces associated with sets of matrices, some of which are Markov chains which represent DNA nucleotide substitutions. Also in Sect. 3, the procedure of finding a Lie algebra from a linear space is defined and examples are given. Our analysis and its results are then discussed. In Sect. 4 we use a toy model to illustrate why the size of the Lie algebras of codon models tend to be so large. Finally in Sect. 5, the results of this study are summarised and possible further research topics are explored.

2 Defining the codon model

The Muse–Gaut codon model (from now referred to as MG) (Muse and Gaut 1994) defines the rate of change from codon \(J = (j_{1},j_{2}, j_{3}) \) to codon \(I = (i_{1},i_{2},i_{3})\) as follows:

$$\begin{aligned} Q_{IJ} = {\left\{ \begin{array}{ll} \pi _{i_{k}} &{} \text {synonymous} \\ \omega \pi _{i_{k}} &{} \text {non-synonymous} \\ 0 &{} \text {multiple nucleotide substitutions needed} \end{array}\right. } \end{aligned}$$

where \(I \not = J\) and k is the codon position that is undergoing a mutation and \( \pi _{i_{k}} \) is the frequency of nucleotide \( i_{k} \). The diagonal entries of \( Q_{IJ} \) are chosen to give zero column sum.

We will be studying MG “style” codon models, which are based on the original MG model, as described presently.

Following the derivation given in Woodhams et al. (2017), we first consider DNA substitution matrices, \(M_{1}\), \(M_{2}\) and \(M_{3}\), whose entries give the probabilities of DNA substitutions at codon positions 1, 2 and 3 respectively. With the assumption of independence of mutations at codon sites, it follows that the probability of transition from the triplet J to I is given by the product

$$\begin{aligned} \text {prob}(j_{1}j_{2}j_{3} \rightarrow i_{1}i_{2}i_{3}) = M_{1}(i_{1},j_{1})M_{2}(i_{2},j_{2})M_{3}(i_{3}j_{3}). \end{aligned}$$

In the construction of this set of models, we use \( \otimes \) to signify the Kronecker product of matrices. As an example of this operation, consider the two matrices:

$$\begin{aligned} A = \left( \begin{array}{cc} 3 &{} 1 \\ 0 &{} -2 \end{array} \right) , B = \left( \begin{array}{cc} 4 &{} 2 \\ -1 &{} 1 \end{array} \right) . \end{aligned}$$

The Kronecker product of A and B is as follows:

$$\begin{aligned} A \otimes B = \left( \begin{array}{cc} 3B &{}\quad 1B \\ 0B &{}\quad -2B \\ \end{array} \right) = \left( \begin{array}{cccc} 12 &{}\quad 6 &{}\quad 4 &{}\quad 2 \\ -3 &{}\quad 3 &{}\quad -1 &{}\quad 1 \\ 0 &{}\quad 0 &{}\quad -8 &{}\quad -4 \\ 0 &{}\quad 0 &{}\quad 2 &{}\quad -2 \\ \end{array} \right) . \end{aligned}$$

The Kronecker product can then be used to represent the transition probabilities from codon J to I. Specifically, the matrix \( M_{triplet} = M_{1} \otimes M_{2} \otimes M_{3} \) is \( 64 \times 64\) and contains the transition probabilities of codons based on the underlying DNA processes.

Now assuming a continuous time Markov chain, we may recover a rate matrix, Q from a transition matrix, M(t), by taking the derivative of M(t) and evaluate it at \(t=0\). By applying this operation to \(M_{triplet}\) we are given the result:

$$\begin{aligned} Q_{triplet} = Q_{1} \otimes I \otimes I + I \otimes Q_{2} \otimes I + I \otimes I \otimes Q_{3}, \end{aligned}$$
(1)

where I is the \( 4 \times 4 \) identity matrix. Here, the \(Q_{k}\) describe the underlying DNA rates of change processes of different codon positions: \(Q_{1}\) for position 1, \(Q_{2}\) for position 2 and \(Q_{3}\) for position 3. Typically one takes the constraint that \(Q_{1} = Q_{2} = Q_{3}\), i.e. the underlying DNA rate of change process is the same at all three codon positions. We recall that an assumption almost always made in codon models is that the rate of multiple substitutions is zero, e.g. the rate of \( AAA \rightarrow AGG =0 \) as two DNA mutations are required. This property is automatically present in the \( Q_{triplet} \) matrix as given: the zero matrix entries of the matrix I ensure that entries of \(Q_{triplet}\) are zero for multiple substitutions.

We define a \( 64 \times 64 \) matrix G to contain information about the rates of change of codons with respect to their amino acid counterparts. It contains \( \omega \) for non-synonymous substitutions, 1 for synonymous substitutions and 0 for prohibited substitutions (to and from stop codons). We then have:

Definition 1

MG-style codon models are given by

$$\begin{aligned} Q_{codon} = Q_{triplet} \circ G, \end{aligned}$$

where the \( \circ \) operation first computes the element-wise product of the two matrices and then resets the diagonal entries to ensure zero column sum.

In the original MG paper, the F81 model (Felsenstein 1981) was assumed for the underlying DNA rate substitution process. However, any DNA rate process can be used to produce a codon model in the above formulation. In the following analysis, we illustrate our discussion with the Kimura 2 parameter model (from now referred to as K2ST) (Kimura 1980) and the Jukes Cantor model (from now referred to as JC) (Jukes et al. 1969) as our underlying DNA rate processes. An MG process with an underlying model of JC is denoted as JC-MG, and similarly, an MG process with K2ST as the underlying model is denoted as K2ST-MG.

As noted above, for any DNA or codon rate matrix, we can find the corresponding transition matrix (that is, a matrix whose entries are the probability of the change of state for a given time period) by taking the matrix exponential of that rate matrix multiplied by time elapsed as per Markov chain standard practice.

3 Finding the linear and Lie closures of a model

In this section, both the general process for finding the Lie closure of an arbitrary set of matrices and the specific results of finding the Lie closures for models JC-MG and K2ST-MG are discussed.

For a given codon model, we consider two methods for finding a linear space associated to it. Firstly, we look at the smallest linear space that contains the model. This is a fairly straightforward process as we presently illustrate. We define \(\hbox {Mat}_{n \times m}({\mathbb {R}}) \) to be the set of all \( n \times m \) matrices with real entries. Note that \(\hbox {Mat}_{n \times m}({\mathbb {R}}) \) forms a linear space under matrix addition.

Definition 2

The linear closure of a set of \( n \times m \) matrices with polynomial constraints on the matrix entries, \( {\mathcal {A}} \), is the intersection of all linear sub-spaces of \(\hbox {Mat}_{n \times m}({\mathbb {R}}) \) which contain \( {\mathcal {A}} \). More simply, this can be described as the smallest linear space which contains \( {\mathcal {A}} \).

For the examples we consider in this paper, to obtain the linear closure of \({\mathcal {A}}\subseteq \text {Mat}_{n\times n}({\mathbb {R}})\) it is sufficient to take the set, \({\mathcal {X}}\), of polynomial constraints on the matrix entries of members of \({\mathcal {A}}\), and remove the non-linear polynomials in \({\mathcal {X}}\). Caution is required however, since in general there exists cases where this process gives the incorrect answer for the linear closure. Characterising the cases where our procedure will produce the linear closure of a matrix set is, in general, a difficult topic that we do not address in this paper.

Example 1

As an example of the process of finding the linear closure of a matrix set, we find the linear closure for the following set of matrices defined using two parameters

$$\begin{aligned} {\mathcal {A}} = \left\{ \left( \begin{array}{cc} x &{} xy \\ y &{} y \end{array} \right) : x,y \in {\mathbb {R}} \right\} . \end{aligned}$$

Equivalently we may define this set in terms of constraints on the matrix entries: \( {\mathcal {X}} = \{ A_{11}A_{22} = A_{12} , A_{21} = A_{22} \} \). Applying our method, the first constraint is non-linear and is discarded. The second constraint is linear so this remains as a constraint. Our method therefore gives the set,

$$\begin{aligned} {\mathcal {A}}'&= \left\{ A \in \text {Mat}_{2 \times 2}: A_{21} = A_{22} \right\} \\&= \left\{ \left( \begin{array}{cc} x &{} z \\ y &{} y \end{array} \right) : x,y,z \in {\mathbb {R}} \right\} \\&= \text {span}_{{\mathbb {R}}} \left\{ \left( \begin{array}{cc} 1 &{} 0 \\ 0 &{} 0 \end{array} \right) , \left( \begin{array}{cc} 0 &{} 1 \\ 0 &{} 0 \end{array} \right) , \left( \begin{array}{cc} 0 &{} 0 \\ 1 &{} 1 \end{array} \right) \right\} \end{aligned}$$

which is a three dimensional linear space.

As stated above, there are cases where our method of removing all non-linear constraints does not result in the linear closure. To show that \({\mathcal {A}}'\) is in fact the linear closure of \({\mathcal {A}}\), we note that the linearly independent matrices

$$\begin{aligned} \left( \begin{array}{cc} 1 &{} 1 \\ 1 &{} 1 \end{array} \right) , \left( \begin{array}{cc} 1 &{} 2 \\ 2 &{} 2 \end{array} \right) \text { and } \left( \begin{array}{cc} 2 &{} 2 \\ 1 &{} 1 \end{array} \right) \end{aligned}$$

are all contained in \( {\mathcal {A}} \), which tells us that the linear closure must be dimension 3 or more. The dimension of \( {\mathcal {A}}' \) is 3 and it is clear that \( {\mathcal {A}} \subseteq {\mathcal {A}}' \), so the linear closure of \({\mathcal {A}}\) must have dimension 3 or less. Hence we conclude that \({\mathcal {A}}'\) is indeed the linear closure of \({\mathcal {A}}\).

An alternative process to this is that instead of considering the linear closure of a space, we explore the “linear version” as described presently.

Definition 3

For a set \({\mathcal {A}} \subseteq \)\(\hbox {Mat}_{ n \times m}({\mathbb {R}}) \) of matrices defined using polynomial constraints, a fixed non-zero \(n \times m\) matrix \(B \in {\mathcal {A}}\), and smooth paths \(A(t) \in {\mathcal {A}}\) which start at B, i.e. \(A(0)=B\), we define \( T_{B}({\mathcal {A}})\) to be the tangent space of \({\mathcal {A}} \)at B. Specifically, this is the span of derivatives of these paths evaluated at zero: \( T_{B}({\mathcal {A}}):= \text {span}_{{\mathbb {R}}}\{A'(0): A(t) \in {\mathcal {A}} \text { } \forall t, A(0)=B\)}.

In the language of algebraic geometry, we note that the \(T_{B}({\mathcal {A}}) \) is in fact the linear space associated to the linear variety through B. To clarify, we explore this concept in Example 2.

We also note there are some cases where the only smooth path in a matrix set from some matrix B is trivial i.e. \( A(t) = B \)\(\forall t\). The derivative of such a path is 0.

Example 2

Consider the unit circle, \(x^2 + y^2 = 1\), on \( {\mathbb {R}}^2 \). The tangent line on the circle at (1, 0) gives us the set of points \( {\mathcal {M}} = \{ (1,a): a \in {\mathbb {R}} \} \) which all sit on the line \(x=1\). We note that \( {\mathcal {M}} \) does not form a linear space as it in particular does not contain (0, 0).

Now we consider a path on the unit circle \(x(t)^2 + y(t)^2 = 1\) where \(x(0)=1\) and \(y(t)=0\). Differentiating with respect to t gives us \( 2(x(t)x'(t) + y(t)y'(t)) = 0 \). When we let \(t=0\), the condition becomes \(2x'(0) = 0\). Hence the tangent space is \( {\mathcal {N}} = \{ (0,a): a \in {\mathbb {R}} \} \). Unlike \({\mathcal {M}}\), \({\mathcal {N}}\) is a linear space.

In this example, \({\mathcal {M}}\) is the linear variety through (1, 0) whereas \({\mathcal {N}}\) is the tangent space at (1, 0). Under our terminology, the point (1, 0) is not in the tangent space at (1, 0).

Definition 4

Given a set, \( {\mathcal {A}} \), of \( n \times m \) matrices defined using polynomial constraints, whose terms have degree 1 or greater and a fixed \( n \times m \) matrix, B, the linear version of \( {\mathcal {A}} \) at B is \( T_{B}({\mathcal {A}}) \).

Lemma 1

Given a set, \( {\mathcal {A}} \), of \( n \times m \) matrices with polynomial constraints of positive degree on the matrix entries:

  1. (i)

    For any \(B\in {\mathcal {A}}\), \( T_{B}({\mathcal {A}}) \) is a subspace of the linear closure of \({\mathcal {A}}\).

  2. (ii)

    In the situation that \({\mathcal {A}}\) is defined using only homogeneous polynomial constraints of positive degree, \( T_{0}({\mathcal {A}}) \) is equal to the linear closure of \({\mathcal {A}}\).

Proof

(i) By the definition of the derivative, members of the linear version of \( {\mathcal {A}} \) at B are limits of members of the linear closure of \( {\mathcal {A}} \) and real linear spaces are closed under limits.

(ii) Setting \(B=0\), part (i) tells us that \(T_0({\mathcal {A}})\) is a subspace of the linear closure of \({\mathcal {A}}\).

Now consider \(C\in {\mathcal {A}}\). Then, by the homogeneity assumption, each \(A(t):=tC \) is a smooth path in \({\mathcal {A}}\). This immediately implies \(C=A'(0)\in T_0({\mathcal {A}})\) and hence \(T_0({\mathcal {A}})\) contains \({\mathcal {A}}\). But then, since \(T_0({\mathcal {A}})\) is a subspace of \(\hbox {Mat}_{n\times m}({\mathbb {R}})\), it must contain the linear closure of \({\mathcal {A}}\).

The result follows. \(\square \)

We note that in the situation that \({\mathcal {A}}\) is already a linear space, the linear version of \({\mathcal {A}}\) at (any) \(B \in {\mathcal {A}}\) and linear closure of \( {\mathcal {A}} \) are both equal to \({\mathcal {A}}\).

Lemma 2

To find a linear version of a set of matrices with polynomial constraints where B has unit entries everywhere, any polynomial constraints of degree \( \ge 2 \) on the original matrix set are replaced by changing each multiplication operation to an addition operation and any non-zero constant terms in the polynomial constraints are replaced by 0.

Proof

For a matrix set \( {\mathcal {A}} \) with polynomial constraints \({\mathcal {X}}\) on the matrix entries, consider arbitrary \( C \in {\mathcal {A}} \) and \( f \in {\mathcal {X}} \):

$$\begin{aligned} f(C) = a_{1}c_{11}^{k_{11;1}}c_{12}^{k_{12;1}}...c_{nm}^{k_{nm;1}} + ... + a_{p}c_{11}^{k_{11;p}} c_{12}^{k_{12;p}}...c_{nm}^{k_{nm;p}} = 0. \end{aligned}$$

where \( a_{i} \in {\mathbb {R}} \), \( c_{ij} \) are the matrix entries of C and \( k_{ij;l} \in {\mathbb {N}} \cup \{ 0 \} \). To find the tangent space of \( {\mathcal {A}} \) at B, we first consider paths in \( C(t) \in {\mathcal {A}} \) with matrix entries \( c_{ij}(t) \) where \( c_{ij}(0) = 1 \) (i.e. \( C(0) = B \)). Therefore \( f \in {\mathcal {X}} \) becomes

$$\begin{aligned} f(C(t))&= a_{1}c_{11}(t)^{k_{11;1}}c_{12}(t)^{k_{12;1}}...c_{nm}(t)^{k_{nm;1}} + ... \\&\quad + a_{p}c_{11}(t)^{k_{11;p}} c_{12}(t)^{k_{12;p}}...c_{nm}(t)^{k_{nm;p}} \\&= 0 . \end{aligned}$$

To find the tangent space, we differentiate the constraints on \( {\mathcal {A}} \) with respect to t and set \( t=0 \) (as \( C'(0) \in \) tangent space at B):

$$\begin{aligned} \left. \frac{d}{dt}\right| _{t=0 }&(a_{1}c_{11}(t)^{k_{11;1}}c_{12}(t)^{k_{12;1}}...c_{nm}(t)^{k_{nm;1}} + ... \\&+ a_{p}c_{11}(t)^{k_{11;p}} c_{12}(t)^{k_{12;p}}...c_{nm}(t)^{k_{nm;p}}) \\ = \left. \frac{d}{dt} \right| _{t=0}&0 =0, \end{aligned}$$

then using the chain rule and product rule, we differentiate the first term only:

$$\begin{aligned}&\left. \frac{d}{dt} \right| _{t=0} a_{1}c_{11}(t)^{k_{11;1}}c_{12}(t)^{k_{12;1}}...c_{nm}(t)^{k_{nm;1}}\\&= a_{1}( k_{11;1} c_{11}'(0) c_{11}(0)^{k_{11;1}-1} c_{12}(0)^{k_{12;1}}...c_{nm}(0)^{k_{nm;1}} \\&\quad + k_{12;1} c_{12}'(0) c_{11}(0)^{k_{11;1}} c_{12}(0)^{k_{12;1}-1}...c_{nm}(0)^{k_{nm;1}} + ... \\&\quad + k_{nm;1} c_{nm}'(0) c_{11}(0)^{k_{11;1}} c_{12}(0)^{k_{12;1}}...c_{nm}(0)^{k_{nm;1}-1}) \\&= a_{1} ( k_{11;1} c_{11}'(0) + k_{12;1}c_{12}'(0) + ... + k_{nm;1}c_{nm}'(0) ) \end{aligned}$$

as \( c_{ij}(0) = 1\)\( \forall i,j \). Note that this expression would contain no constant terms as the process of differentiation sends all constant terms to zero. A similar procedure can be used for the other terms of f so that the tangent of the constraint at B is

$$\begin{aligned} a_{1}({k_{11;1}}c_{11}'(0) + {k_{12;1}}c_{12}'(0) +...+ {k_{nm;1}}c_{nm}'(0)) + ...&\\ + a_{p}({k_{11;p}}c_{11}'(0) + {k_{12;p}}c_{12}'(0) +... + {k_{nm;p}}c_{nm}'(0) )&= 0 \end{aligned}$$

which is equivalent to changing every multiplication operation between \( c_{ij} \) entries with addition. As f was an arbitrary element of \({\mathcal {X}}\), we can say that this applies to all \( f \in {\mathcal {X}} \). \(\square \)

In the case of sets of DNA rate matrices, \( {\mathcal {Q}} \), whose constraints on matrix entries are polynomial we take B (the point at which we take the tangents) as \(B=J_{DNA}\) which we define as the JC matrix with DNA rate of change equal to 1 i.e. the \(4 \times 4\) matrix which has unit value off-diagonal entries and whose diagonal entries ensure zero column sum.

For MG-style codon models, we set \(B = J_{codon}\) where

$$\begin{aligned} J_{codon}(i,j) = {\left\{ \begin{array}{ll} 0 &{} i \not =j \text { and is a STOP codon row or column} \\ 0 &{} i \not = j \text { and multiple DNA substitutions are required} \\ 1 &{} i \not = j \text { and one DNA substitution is required} \\ a_{i} &{} i=j \text { and } a_{i} = -\sum _{k=1, k \not =i}^{64} B_{ik}, \\ \end{array}\right. } \end{aligned}$$

i.e. \(J_{codon}\) has unit entries anywhere where we expect to see a non-zero rate of change in a codon model. Here we note that for matrix \(G'\) being G with \(\omega = 1\), we have

$$\begin{aligned} J_{codon} =(J_{DNA} \otimes I \otimes I + I \otimes J_{DNA} \otimes I + I \otimes I \otimes J_{DNA} ) \circ G'. \end{aligned}$$

When it is clear from the context, both \(J_{DNA}\) and \(J_{codon}\) will be abbreviated to J.

Note that whenever scale multiples of the matrix B lie within the model under consideration (which is certainly the case for \(J_{codon}\) in the MG-style models considered here), the matrix B is guaranteed to also be in the linear version of the model: B occurs as the tangent vector to smooth paths on the line that goes from 0 to B. One obvious such path is \(A(t):=(1+t)B\), which gives \(A(0)=B=A'(0)\). We also note that finding the tangent space of a matrix space \({\mathcal {A}}\) at 0 as opposed to B would give a different space as a result. In particular: Lemma 1 states that in the case of homogeneous constraints, \( T_{0}({\mathcal {A}}) \) and the linear closure are identical.

To find the linear version of \( {\mathcal {Q}} \), we treat constraints on off diagonal entries the same way as they are treated in Lemma 2 (\(c_{ij}(0) = 1\)) and treat diagonal entries to be the entry required to ensure zero column sum.

Example 3

Using the same matrix set from Example 1, we take the linear version at B being the \( 2 \times 2 \) matrix with unit entries. The constraint of \( A_{11}A_{22} = A_{12} \) is changed to \( A_{11} + A_{22} = A_{12} \) so that the linear version of the set would be

$$\begin{aligned} \left\{ \left( \begin{array}{cc} x &{} xy \\ y &{} y \end{array} \right) : x,y \in {\mathbb {R}} \right\} \rightarrow \left\{ \left( \begin{array}{cc} x &{} x+y \\ y &{} y \end{array} \right) : x,y \in {\mathbb {R}} \right\} = \text {span}_{{\mathbb {R}}} \left\{ \left( \begin{array}{cc} 1 &{} 1 \\ 0 &{} 0 \\ \end{array} \right) , \left( \begin{array}{cc} 0 &{} 1 \\ 1 &{} 1 \\ \end{array} \right) \right\} \end{aligned}$$

which has two free parameters, hence is a two dimensional matrix linear space. Note that the number of free parameters in the linear version is the same as the original.

Example 4

As a more practical example of these processes, we examine a variation of the HKY model (Hasegawa et al. 1985) by adding the constraints of \( \pi _{A} = \pi _{G} \) and \( \pi _{C} = \pi _{T} \). The matrix representation of this model has three free parameters:

$$\begin{aligned} {\mathcal {Q}}_{HKY'} = \left\{ \left( \begin{array}{cccc} * &{} \pi _{A} \kappa &{} \pi _{A} &{} \pi _{A} \\ \pi _{A} \kappa &{} * &{} \pi _{A} &{} \pi _{A} \\ \pi _{C} &{} \pi _{C} &{} * &{} \pi _{C} \kappa \\ \pi _{C} &{} \pi _{C} &{} \pi _{C} \kappa &{} * \\ \end{array} \right) : \pi _{A}, \pi _{C}, \kappa \in {\mathbb {R}} \right\} \end{aligned}$$

where * denotes that the entry is chosen to give zero column sum.

To calculate the linear closure of this model, linear constraints such as \( A_{14} = A_{24} \) are kept but non-linear constraints such as \( A_{12}A_{41} = A_{13}A_{43} \) are discarded. Thus the linear closure of this model is

$$\begin{aligned} \left\{ \left( \begin{array}{cccc} * &{} \gamma &{} \alpha &{} \alpha \\ \gamma &{} * &{} \alpha &{} \alpha \\ \beta &{} \beta &{} * &{} \delta \\ \beta &{} \beta &{} \delta &{} * \\ \end{array} \right) : \alpha , \beta , \gamma , \delta \in {\mathbb {R}} \right\} \end{aligned}$$

which is Model 4.4b (a four dimensional matrix linear space) in the Lie-Markov model (LMM) hierarchy described in Fernández-Sánchez et al. (2015). This is different to the linear version, with respect to \(B=J\) as defined above, which would change the non-linear constraints such as \( A_{12}A_{41} = A_{13}A_{43} \) to linear constraints such as \( A_{12} + A_{41} = A_{13} + A_{43} \) so that the linear version of the model is

$$\begin{aligned} \left\{ \left( \begin{array}{cccc} * &{} \alpha + \kappa &{} \alpha &{} \alpha \\ \alpha + \kappa &{} * &{} \alpha &{} \alpha \\ \beta &{} \beta &{} * &{} \beta + \kappa \\ \beta &{} \beta &{} \beta + \kappa &{} * \\ \end{array} \right) : \alpha , \beta , \kappa \in {\mathbb {R}} \right\} . \end{aligned}$$

This is Model 3.4 (a three dimensional matrix linear space) in the LMM hierarchy (Fernández-Sánchez et al. 2015).

One might notice that in these examples of finding linear versions of DNA models, instances where parameters are multiplied together are changed to the parameters being added together. In the case of the last example, we can imagine that if transitions were less likely to occur than transversions we would expect \( \kappa <1 \) for the model but \( \kappa <0 \) for the linear version (this is assuming a DNA ordering of AGCT). Having negative additive parameters in a model can potentially bring about negative substitution rates if not handled carefully. There are many examples, however, of this issue being overcome computationally by using appropriate constraints on the parameters. For the Lie–Markov models, a sensible approach is described in Woodhams et al. (2015). For this example, the constraints required to maintain stochasticity are \( \alpha >-\kappa \) and \( \beta >-\kappa \).

In our analysis, we will be using both Lie closures (which are the Lie closures of the linear closures) of codon models and Lie closures of linear versions of codon models.

Definition 5

A Lie algebra, \( {\mathcal {L}} \), is a linear space over a field, \( {\mathbb {F}} \), with an additional operation of the Lie bracket \( [\cdot ,\cdot ]: {\mathcal {L}} \times {\mathcal {L}} \rightarrow {\mathcal {L}} \), which, for \( x,y,z \in {\mathcal {L}} \) and \( \lambda \in {\mathbb {F}} \) satisfies:

  1. (i)

    \( [x,y] = -[y,x], \)

  2. (ii)

    \( [ \lambda x,y ] = \lambda [x,y], \)

  3. (iii)

    \( [x, [y,z] ] + [ y, [z,x] ] + [z, [x,y] ] = 0. \)

In matrices, we define the Lie bracket operation as \( [A,B] = AB-BA\). Therefore we note here that the third condition of a Lie algebra is automatically satisfied by the first two conditions and hence plays no role in finding the Lie closure.

Definition 6

The Lie closure of a set of matrices, \( {\mathcal {A}} \), is the intersection of all matrix Lie algebras which contain \( {\mathcal {A}}\). More simply, this can be described as the smallest matrix Lie algebra which contains \( {\mathcal {A}} \).

Once the linear closure of a set of matrices is found, Lie brackets of the linear space’s basis elements are calculated: it is sufficient to only work with the basis elements of the linear closure due to the space being linear and the Lie bracket operation being bi-linear. A matrix linear space that is closed under the Lie bracket operation is a Lie algebra. If a Lie bracket is found to be in the existing linear space, the Lie bracket is ignored and another is tried. If a Lie bracket is not in the existing linear space, it is added to the basis. The stop condition is when all Lie brackets of the basis elements are in the linear space. At this point, we have found the Lie closure of the linear space.

Example 5

To find the Lie closure of the matrix set described in Example 1, we take Lie brackets of the basis elements in the linear closure. We see that

$$\begin{aligned} \left( \begin{array}{cc} 1 &{} 0 \\ 0 &{} 0 \end{array} \right) \left( \begin{array}{cc} 0 &{} 1 \\ 0 &{} 0 \end{array} \right) - \left( \begin{array}{cc} 0 &{} 1 \\ 0 &{} 0 \end{array} \right) \left( \begin{array}{cc} 1 &{} 0 \\ 0 &{} 0 \end{array} \right) = \left( \begin{array}{cc} 0 &{} 1 \\ 0 &{} 0 \end{array} \right) \end{aligned}$$

which is in the linear closure so there is no further action to be taken. On the other hand

$$\begin{aligned} \left( \begin{array}{cc} 0 &{} 0 \\ 1 &{} 1 \end{array} \right) \left( \begin{array}{cc} 1 &{} 0 \\ 0 &{} 0 \end{array} \right) - \left( \begin{array}{cc} 1 &{} 0 \\ 0 &{} 0 \end{array} \right) \left( \begin{array}{cc} 0 &{} 0 \\ 1 &{} 1 \end{array} \right) = \left( \begin{array}{cc} 0 &{} 0 \\ 1 &{} 0 \end{array} \right) \end{aligned}$$

which is not in the linear closure so it is added to the basis. Therefore our new space is

$$\begin{aligned} \text {span}_{{\mathbb {R}}} \left\{ \left( \begin{array}{cc} 1 &{} 0 \\ 0 &{} 0 \end{array} \right) , \left( \begin{array}{cc} 0 &{} 1 \\ 0 &{} 0 \end{array} \right) , \left( \begin{array}{cc} 0 &{} 0 \\ 1 &{} 1 \end{array} \right) , \left( \begin{array}{cc} 0 &{} 0 \\ 1 &{} 0 \end{array} \right) \right\} \end{aligned}$$

which is a four dimensional matrix linear space. As this space now spans all \( ( 2 \times 2 ) \) matrices, we know that any Lie bracket in this space will be contained in the space so we now have the Lie closure of the matrix set.

Example 6

In Example 4, the linear closure is in the LMM hierarchy (Fernández-Sánchez et al. 2015) and hence is a Lie algebra so no further computation is necessary to find the Lie closure of that matrix set (so the Lie closure of \( {\mathcal {Q}}_{HKY'} \) is a four dimensional matrix linear space). The linear version of \( {\mathcal {Q}}_{HKY'} \) also forms a Lie algebra so the Lie closure of the linear version of \( {\mathcal {Q}}_{HKY'} \) is equal to the linear version (a four dimensional matrix linear space).

Fig. 1
figure 1

If we begin with a model, the other matrix sets generated by the model will be nested as shown. For example, linear closures being inside Lie closures represents that when the Lie closure and linear closure are not equal then the Lie closure is the larger space of which the linear closure is a linear subspace of. The Lie closure of the linear version does not fit into this diagram easily as it can be both inside or outside of the linear closure, however it will be inside the Lie closure as a simple consequence of Lemma 1

For the general case, to find the Lie closure of a model, first the linear closure is found and then Algorithm 1 is used. An alternative method for finding a Lie algebra associated with a model is to find the Lie closure of its linear version. The nesting of the spaces that can be generated from a model is depicted in Fig. 1. It was found in general practise that Algorithm 1 could be computationally expensive to complete: it would “settle” on a result in less than a day (i.e. the basis did not increase in size for several thousand iterations) but would have taken months to confirm the result so in these cases we applied Algorithm 2 to check if the resulting space most likely formed a Lie algebra or whether there was evidence to suggest that it did not. Algorithm 2 generates Lie brackets in the existing space then tries to solve the resulting matrix in terms of the space’s basis. The idea behind it is that it would be very unlikely for the Lie bracket of two matrices which are a linear combination of all the basis elements (whose coefficients are randomly generated rational numbers) to be linear combination of the basis elements if it the space were not a Lie algebra. For the purposes of our study, Algorithm 1 produces a lower bound for the size of the Lie closure which is all we need to know it is not going to produce a useful model.

figure a
figure b

3.1 Examples of linear closures and linear versions in DNA models

To better illustrate the different ways of closing matrix sets, and to demonstrate the differences between Lie closures and Lie closures of linear versions, we now further explore finding closures of some popular DNA rate substitution models.

Example 7

The symmetric DNA model (GTR (Tavaré 1986) with uniform base distribution) has matrix form

$$\begin{aligned} {\mathcal {Q}}_{SYM} = \left\{ \left( \begin{array}{cccc} *&{} a &{} b &{} c \\ a &{}*&{} d &{} e \\ b &{} d &{}*&{} f \\ c &{} e &{} f &{}*\\ \end{array} \right) : a,b,c,d,e,f \in {\mathbb {R}} \right\} . \end{aligned}$$

Clearly, \( {\mathcal {Q}}_{SYM} \) is a linear space as all constraints on it are linear, e.g. \( Q_{12} = Q_{21} \).

We define the symmetric matrix \( S_{ij} \) as the \((4 \times 4)\) matrix with unit entries in positions (ij) and (ji), zero entries in all other off-diagonal entries and diagonal entries are set to ensure zero column sum. We also define an anti-symmetric matrix \( T_{ij} \) as the \( (4 \times 4) \) matrix with unit entry in position (ij) , \( -1 \) in position (ji) , zeros in all other off-diagonal entires and whose diagonal entries are set to give zero column sum. Clearly the set \( \{ S_{12}, S_{13}, S_{14}, S_{23}, S_{24}, S_{34} \} \) is a basis for \( {\mathcal {Q}}_{SYM} \).

Each time a Lie bracket is taken of two symmetric matrices, an anti-symmetric matrix is produced. For example consider the Lie bracket \( [S_{12},S_{13}] \):

$$\begin{aligned}&\left( {\begin{matrix} -1 &{} 1 &{} 0 &{} 0 \\ 1 &{} -1 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 \end{matrix}} \right) \left( {\begin{matrix} -1 &{} 0 &{} 1 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 \\ 1 &{} 0 &{} -1 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 \end{matrix}} \right) - \left( {\begin{matrix} -1 &{} 0 &{} 1 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 \\ 1 &{} 0 &{} -1 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 \end{matrix}} \right) \left( {\begin{matrix} -1 &{} 1 &{} 0 &{} 0 \\ 1 &{} -1 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 \end{matrix}} \right) \\&\quad = \left( {\begin{matrix} 0 &{} 1 &{} -1 &{} 0 \\ -1 &{} 0 &{} 1 &{} 0 \\ 1 &{} -1 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 \end{matrix}} \right) \\&\quad = T_{12} - T_{13} + T_{23}. \end{aligned}$$

For distinct abcd in \( \{ 1,2,3,4 \} \), the Lie brackets of the basis elements of \( {\mathcal {Q}}_{SYM} \) can be summarised as follows:

$$\begin{aligned} {[}S_{ab},S_{cd}]&= 0 \\ {[}S_{ab},S_{ac}]&= T_{ab} - T_{ac} + T_{bc}.\\ \end{aligned}$$

Hence we have

$$\begin{aligned} {[}S_{23},S_{24}]&= T_{23} - T_{24} + T_{34}= U_{1}\\ {[}S_{32},S_{34}]&= T_{32} - T_{34} + T_{24}= U_{2}\\ {[}S_{12},S_{14}]&= T_{12} - T_{14} + T_{24}= U_{3}\\ {[}S_{12},S_{13}]&= T_{12} - T_{13} + T_{23} = U_{4},\\ \end{aligned}$$

where each \( U_{i} \) is a \(4 \times 4\) matrix which is both anti-symmetric and doubly stochastic and has zero entries in all of its \( i^{th} \) row and column. We see that \( U_{4} = U_{1} - U_{2} + U_{3} \) and hence the \( \{ U_{1}, U_{2}, U_{3} \} \) is a linearly independent set which spans the space of doubly stochastic anti-symmetric matrices. The set \( \text {span}_{{\mathbb {R}}} \{ S_{12}, S_{13}, S_{14}, S_{23}, S_{24}, S_{34}, U_{1}, U_{2}, U_{3} \} \) is the set of doubly stochastic \( 4 \times 4 \) matrices which is known as the doubly stochastic model (DS) and is Model 9.20b in the LMM hierarchy (Fernández-Sánchez et al. 2015). It forms a Lie algebra and hence is the Lie closure of SYM.

Example 8

Consider the GTR model (Tavaré 1986) which has matrix form

$$\begin{aligned} {\mathcal {Q}}_{GTR} = \left\{ \left( \begin{array}{cccc} * &{} \alpha \pi _{A} &{} \beta \pi _{A} &{} \gamma \pi _{A} \\ \alpha \pi _{G} &{} * &{} \delta \pi _{G} &{} \varepsilon \pi _{G} \\ \beta \pi _{C} &{} \delta \pi _{C} &{} * &{} \eta \pi _{C} \\ \gamma \pi _{T} &{} \varepsilon \pi _{T} &{} \eta \pi _{T} &{} * \end{array} \right) : \alpha , \beta , \gamma , \delta , \varepsilon , \eta , \pi _{A}, \pi _{G}, \pi _{C}, \pi _{T} \in {\mathbb {R}} \right\} . \end{aligned}$$

Its linear version is

$$\begin{aligned} \left\{ \left( \begin{array}{cccc} * &{} \alpha + \pi _{A} &{} \beta + \pi _{A} &{} \gamma + \pi _{A} \\ \alpha + \pi _{G} &{} * &{} \delta + \pi _{G} &{} \varepsilon + \pi _{G} \\ \beta + \pi _{C} &{} \delta + \pi _{C} &{} * &{} \eta + \pi _{C} \\ \gamma + \pi _{T} &{} \varepsilon + \pi _{T} &{} \eta + \pi _{T} &{} * \end{array} \right) : \alpha , \beta , \gamma , \delta , \varepsilon , \eta , \pi _{A}, \pi _{G}, \pi _{C}, \pi _{T} \in {\mathbb {R}} \right\} \end{aligned}$$

which, although presented using ten free parameters, is a nine dimensional linear space. We define \(L_{i}\) to be the matrix generated when each parameter except i is set to be zero, and i is set to be 1. We can then assert that the set \( \{ L_{\alpha }, L_{\beta }, L_{\gamma }, L_{\delta }, L_{\varepsilon }, L_{\eta }, L_{\pi _{A}},L_{\pi _{G}}, L_{\pi _{C}},L_{\pi _{T}} \} \) is not linearly independent as \( L_{\alpha }+L_{\beta }+L_{\gamma }+L_{\delta }+L_{\varepsilon }+L_{\eta } - L_{\pi _{A}}- L_{\pi _{G}}- L_{\pi _{T}} = L_{\pi _{C}} \).

In its own right, this set of matrices does not form a Lie algebra and therefore the Lie closure of this space is not trivial. The linear version of \({\mathcal {Q}}_{GTR}\) is not contained in the DS model and hence must be contained in a LMM of a higher dimension than 9. It is not contained in Models 10.12 or 10.34 of the LMM hierarchy (Fernández-Sánchez et al. 2015) and hence we conclude that the Lie closure of the linear version of \({\mathcal {Q}}_{GTR}\) must contain the set of \( 4 \times 4 \) matrices which have zero column sum, known as the General Markov Model (GMM) (Barry and Hartigan 1987). As GMM is the largest \( 4 \times 4 \) rate matrix set, we conclude that Lie closure of the linear version of \({\mathcal {Q}}_{GTR}\) cannot be bigger than GMM and is therefore equal to GMM.

Example 9

We take the GTR model (Tavaré 1986) and assume that \( \pi _{A} = \pi _{G} \) and \( \pi _{C} = \pi _{T} \). This model has matrix form

$$\begin{aligned} {\mathcal {Q}}_{GTR'} = \left\{ \left( \begin{array}{cccc} * &{} \alpha \pi _{A} &{} \beta \pi _{A} &{} \gamma \pi _{A} \\ \alpha \pi _{A} &{} * &{} \delta \pi _{A} &{} \varepsilon \pi _{A} \\ \beta \pi _{C} &{} \delta \pi _{C} &{} * &{} \eta \pi _{C} \\ \gamma \pi _{C} &{} \varepsilon \pi _{C} &{} \eta \pi _{C} &{} * \end{array} \right) : \alpha , \beta , \gamma , \delta , \varepsilon , \eta , \pi _{A}, \pi _{C} \in {\mathbb {R}} \right\} . \end{aligned}$$

Its linear version has the form

$$\begin{aligned} \left\{ \left( \begin{array}{cccc} * &{} \alpha +\pi _{A} &{} \beta + \pi _{A} &{} \gamma + \pi _{A} \\ \alpha + \pi _{A} &{} * &{} \delta +\pi _{A} &{} \varepsilon + \pi _{A} \\ \beta + \pi _{C} &{} \delta +\pi _{C} &{} * &{} \eta + \pi _{C} \\ \gamma + \pi _{C} &{} \varepsilon + \pi _{C} &{} \eta + \pi _{C} &{} * \end{array} \right) : \alpha , \beta , \gamma , \delta , \varepsilon , \eta , \pi _{A}, \pi _{C} \in {\mathbb {R}} \right\} . \end{aligned}$$

We see that \( {\mathcal {Q}}_{SYM} \) is contained in this set so the Lie closure of the linear version must contain the DS model. We also note that \( L_{\pi _{A}} \) and \( L_{\pi _{C}} \) are not doubly stochastic so the Lie closure of the linear version of \( {\mathcal {Q}}_{GTR'} \) must have dimension 11 at minimum. As there are no 11 dimensional Lie-Markov models (Fernández-Sánchez et al. 2015), the Lie closure of the linear version of \( {\mathcal {Q}}_{GTR'} \) must be, again, GMM.

We note here that as the Lie closure of the linear version is contained in or equal to the Lie closure of a set, the Lie closures of both \( {\mathcal {Q}}_{GTR} \) and \( {\mathcal {Q}}_{GTR'} \) are also GMM.

Example 10

The model proposed by Tamura and Nei (1993), often referred to the Tamura Nei model (TN), has the matrix form:

$$\begin{aligned} {\mathcal {Q}}_{TN} = \left\{ \left( \begin{array}{cccc} * &{} \pi _{A} \kappa _{1} &{} \pi _{A} &{} \pi _{A} \\ \pi _{G} \kappa _{1} &{} * &{} \pi _{G} &{} \pi _{G} \\ \pi _{C} &{} \pi _{C} &{} * &{} \pi _{C} \kappa _{2} \\ \pi _{T} &{} \pi _{T} &{} \pi _{T} \kappa _{2} &{} * \end{array} \right) : \kappa _{1}, \kappa _{2}, \pi _{A}, \pi _{G}, \pi _{C}, \pi _{T} \in {\mathbb {R}} \right\} . \end{aligned}$$

This is an interesting example as both its linear closure and linear version form Lie algebras. Since these models also have purine/pyrimidine symmetries,Footnote 1 they are in the LMM hierarchy given in Fernández-Sánchez et al. (2015). The linear version has the form

$$\begin{aligned} \left\{ \left( \begin{array}{cccc} * &{} \pi _{A} + \kappa _{1} &{} \pi _{A} &{} \pi _{A} \\ \pi _{G}+ \kappa _{1} &{} * &{} \pi _{G} &{} \pi _{G} \\ \pi _{C} &{} \pi _{C} &{} * &{} \pi _{C}+ \kappa _{2} \\ \pi _{T} &{} \pi _{T} &{} \pi _{T}+ \kappa _{2} &{} * \end{array} \right) : \kappa _{1}, \kappa _{2}, \pi _{A}, \pi _{G}, \pi _{C}, \pi _{T} \in {\mathbb {R}} \right\} \end{aligned}$$

which is Model 6.8a of the LMMs (Fernández-Sánchez et al. 2015). The linear closure on the other hand is

$$\begin{aligned} \left\{ \left( \begin{array}{cccc} * &{} \alpha &{} \pi _{A} &{} \pi _{A} \\ \beta &{} * &{} \pi _{G} &{} \pi _{G} \\ \pi _{C} &{} \pi _{C} &{} * &{} \gamma \\ \pi _{T} &{} \pi _{T} &{} \delta &{} * \end{array} \right) : \alpha , \beta , \gamma , \delta , \pi _{A}, \pi _{G}, \pi _{C}, \pi _{T} \in {\mathbb {R}} \right\} \end{aligned}$$

which is Model 8.8 of the LMMs (Fernández-Sánchez et al. 2015).

In summary we have:

Result 1

The Lie closure of \({\mathcal {Q}}_{SYM}\) is the DS model. The Lie closures, and Lie closures of the linear versions, of both GTR and GTR’ are GMM. The Lie closure of TN is Model 8.8 of the LMMs and the Lie closure of the linear version of TN is Model 6.8a of the LMMs.

Proof

As established in Examples 7, 8, 9 and 10 above. \(\square \)

3.2 Incorporating the \( \omega \) parameter into Lie closures of codon models

The use of linear versions (which leads to use of Lie closures of linear versions) changes the \( \omega \) parameter from a multiplicative operation to an additive one (for examples of the unique rates present in Markov models of interest, see Table 1). Adding scalar multiples of the matrix G to an existing rate matrix would result in undesirable consequences, for example, non-zero entries where multiple nucleotide substitutions are required. We are therefore required to define a new matrix, \(G^*\), with the parameter \(\omega \) being the coefficient of this matrix in linear versions. In off-diagonal entries, the matrix \(G^*\) is defined to have unit entries for the entries representing non-synonymous mutations which require only one nucleotide mutation and are not to or from stop codons; and zero entries everywhere else. Its diagonal entries are chosen to give zero column sum. Additionally, because we are not multiplying \( Q_{triplet} \) (1) by G, we are required to add the extra constraint on \( Q_{triplet} \) of zero values for entries that represent mutations to or from stop codons. When the linear version is found for a MG-style codon model, \(G^*\) is automatically in the basis and when the matrix set is written as a linear combination of its basis elements, \(G^*\) would have the coefficient \(\omega \). This accounts for all “\( + \omega \)” terms of off diagonal matrix entries (for examples of matrix entries in linear versions of MG-style codon models, see Table 1).

Defining an \( \omega \) parameter for the linear closure (and hence Lie closure) of an MG-style codon model case is less clear. When the linear closure of the codon model is found, for example, of K2ST-MG we start with parameters \( \{ a, b, \omega \}\) (which would result in matrix entries \(\{ a, b, a \omega , b \omega \} \)) and the linear closure has parameters \( \{ a, b, a \omega , b \omega \} \rightarrow \{ c_{1}, c_{2}, c_{3}, c_{4} \} \) (i.e. there are now 4 independent parameters) which means that there is no longer a clear \( \omega \) parameter. Like the linear version, it would seem logical for \( \omega \) to be the coefficient of the \( G^* \) matrix. Therefore in practice, the basis for the linear closure should be defined in a way to include \( G^* \). This still leaves the question of how \( \omega \) itself should be calculated.

One possible way to calculate \( \omega \) is \( \frac{1}{2} (\frac{c_{3}}{c_{1}} + \frac{c_{4}}{c_{2}}) \); an average of the non-synonymous/synonymous rate ratios. Another method proposes that \( \omega _{1} = \frac{c_{3}}{c_{1}} \), \( \omega _{2} = \frac{c_{4}}{c_{2}} \) and hence \( \omega = \omega _{1} \frac{c_{1}}{c_{1} + c_{2}} + \omega _{2} \frac{c_{2}}{c_{1} + c_{2}} \); a weighted average where the weights are the frequencies of the types of substitutions. A third method would be to consider the geometric average between \(\frac{c_{3}}{c_{1}}\) and \(\frac{c_{2}}{c_{3}}\) so we would obtain \(\omega = \sqrt{\frac{c_{3}c_{4}}{c_{1}c_{2}}}\). It is currently an open question to how \( \omega \) is to be calculated or interpreted; especially as the situation is more complicated in MG-style codon models whose linear closures have more than 4 parameters.

Table 1 Interesting cases of Markov models, codon and DNA, with their unique off-diagonal entries and the number of free parameters listed

3.3 Lie closures of codon models

In our analysis of codon models, first the codon model was defined in the way we have discussed in Sect. 2. Recall that under this model structure the rates of change between codons under JC-MG and K2ST-MG are as follows:

$$\begin{aligned} Q_{\text {JC-MG}}= & {} {\left\{ \begin{array}{ll} 0 &{} \text {mutations that are to or from stop codons} \\ 0 &{} \text {multiple nucleotide substitutions required} \\ \alpha &{} \text {synonymous substitution} \\ \alpha \omega &{} \text {non-synonymous substitution} \\ \end{array}\right. }\\ Q_{\text {K2ST-MG}}= & {} {\left\{ \begin{array}{ll} 0 &{} \text {mutations that are to or from stop codons} \\ 0 &{} \text {multiple nucleotide substitutions required} \\ \alpha &{} \text {synonymous transition} \\ \alpha \omega &{} \text {non-synonymous transition} \\ \beta &{} \text {synonymous transversion} \\ \beta \omega &{} \text {non-synonymous transversion} \\ \end{array}\right. } \end{aligned}$$

where \(\alpha \) and \(\beta \) represent substitution behaviour at a DNA level.

The linear closure was found for both JC-MG and K2ST-MG, the linear version of K2ST-MG was found, and then Algorithm 1 was applied to these linear spaces. It should be noted here that the linear closure and linear version of JC-MG are the same linear space only with different bases and hence have the same Lie closure. This is not the case for K2ST-MG, for this codon model the linear closure and linear version are different linear spaces; this is because the model has non-linear constraints. We waited for the algorithm to “settle” then, as the algorithm would have taken months to terminate, used Algorithm 2 to confirm that the resulting spaces were Lie algebras.

We found the dimensions of both the Lie closure of K2ST-MG and the Lie closure of the linear version of K2ST-MG are 2106 (curiously, these were found to be the same linear space which is a topic for further exploration). The dimension of the Lie closure of JC-MG was 1996. It is therefore clear that such models are far too big to be of practical use in phylogenetic applications. In order for the Lie closure or the Lie closure of the linear version to be smaller, the starting model would have to be simpler but the only way we can make a codon model that is simpler than JC-MG is to set \( \omega \) to a constant value which would ruin the whole point of the exercise as we are trying to reduce mis-estimation of \( \omega \).

3.4 Further analysis: partial Lie closures

It was thought, given that finding a full Lie closure of a MG-style codon model was not practical, that we could instead create a partial Lie closure; that is to begin to close the Lie algebra but not completely do so. We now define more precisely what we mean by a partial Lie closure.

In Algorithm 1 above, we can see that any element added to the basis L can be represented as a Lie bracket of the original n matrices from the linear closure. We define the generation of an element of L as the number of Lie brackets necessary to build that element from the elements of the linear closure plus one. For example, we would say that \( B = [A_{1},[[A_{2},A_{3}],A_{1}]] \) (where \( A_{1}, A_{2}, A_{3} \) are in the linear closure of the original matrix set) would belong to generation 4 as there are 3 Lie bracket operations required to build this element from the elements of the linear closure. When building a partial Lie closure, we will calculate elements up to a fixed generation. For example, if one was interested in a Lie closure up to generation 4 then first generation 1 elements would be calculated followed by generations 2, 3 and 4. This process is really the same as the algorithm for finding the Lie closure (described in Algorithm 1) apart from the stop condition and the order in which Lie brackets are calculated (and possibly appended to \( {\mathcal {L}} \)).

It was hoped that a model that was partially Lie closed would have similar enough properties to Lie algebras that the mis-estimation of \( \omega \) could be reduced. We tried to find a partial closure of the JC-MG model. Unfortunately, problems arose regarding “stochasticity.”

We say a zero column sum matrix is stochastic when it is a rate matrix i.e. a matrix is stochastic when its off diagonal entries are non-negative. This is a requirement of rate matrices as it does not make sense to have a negative rate of one state changing to another. Sometimes given a matrix linear space, constraints must be put on the basis coefficients in order to achieve stochasticity. For example, for the linear space of matrices

$$\begin{aligned} {\mathcal {A}} = \text {span}_{{\mathbb {R}}} \left\{ A_{1} = \left( \begin{array}{ccc} -2 &{} 1 &{} 1 \\ 1 &{} -2 &{} 1 \\ 1 &{} 1 &{} -2 \end{array} \right) , A_{2} = \left( \begin{array}{ccc} 0 &{} -1 &{} -1 \\ -1 &{} 0 &{} 1 \\ 1 &{} 1 &{} 0 \end{array} \right) \right\} \end{aligned}$$

with a typical element \( a_{1}A_{1} + a_{2}A_{2}: a_{1},a_{2} \in {\mathbb {R}} \), we must place constraints on \( a_{1} \) and \( a_{2} \) in order for matrices in \( {\mathcal {A}} \) to be stochastic. One possible set of constraints is \( a_{1} \ge 0 \) and \( a_{1} \ge |a_{2}| \). Sometimes however, there are no constraints that will ensure non-trivial stochasticity, for example consider the set

$$\begin{aligned} {\mathcal {B}} = \text {span}_{{\mathbb {R}}} \left\{ B_{1} = \left( \begin{array}{ccc} -2 &{} 1 &{} 1 \\ 1 &{} -2 &{} -1 \\ 1 &{} 1 &{} 0 \end{array} \right) , B_{2} = \left( \begin{array}{ccc} -2 &{} 1 &{} -2 \\ 1 &{} -2 &{} 1 \\ 1 &{} 1 &{} 1 \end{array} \right) \right\} \end{aligned}$$

with the typical element \( b_{1}B_{1} + b_{2}B_{2}: b_{1},b_{2} \in {\mathbb {R}} \). We see that the only way an element of \( {\mathcal {B}} \) can be stochastic is if we set \( b_{1} = b_{2} = 0 \).

It was found that any non-trivial partial Lie closure (i.e. a partial Lie closure which is bigger than the linear closure where there can be non-zero coefficients for the basis elements that are not in the linear closure) of the JC-MG codon model with dimension of less than 227 (this was finding the partial Lie closure up to generation 10) violated stochasticity. This means that for a non-trivial partial Lie closure to be stochastic, we would need a dimension \( \ge 227 \) but such a space is still too big to be practical.

4 Toy model: an interesting case of symmetries

It is interesting that the Lie closure of a codon model which began with a linear space with a dimension of 2 could have a Lie closure whose dimension is so large. Studying this further has proven to be difficult given the computational difficulty of the problem. A toy model was created in an attempt to better understand the features that could lead to the Lie closure of a linear space being so large.

We assumed that the codon length was 3. We then assumed that the number of states is 2 (R and Y) instead of 4 (A, G, C and T). The resulting codon model is \( (8 \times 8) \). Like in the MG-style codon models, it was assumed that there cannot be two changes happening on the same codon at once so, for example, the rate of \( RRR \rightarrow RYY = 0 \). We defined our basis model as

$$\begin{aligned} Q_{triplet} = Q_{2} \otimes I \otimes I + I \otimes Q_{2} \otimes I + I \otimes I \otimes Q_{1} \end{aligned}$$

where

$$\begin{aligned} Q_{1} = \left( \begin{array}{rr} -a &{} a \\ a &{} -a \end{array} \right) \text { and } Q_{2} = \left( \begin{array}{rr} -b &{} b \\ b &{} -b \end{array} \right) . \end{aligned}$$

This results in

$$\begin{aligned} Q_{triplet} = \left( \begin{array}{cccccccc} * &{}\quad a &{}\quad b &{}\quad 0 &{}\quad b &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ a &{}\quad * &{}\quad 0 &{}\quad b &{}\quad 0 &{}\quad b &{}\quad 0 &{}\quad 0 \\ b &{}\quad 0 &{}\quad * &{}\quad a &{}\quad 0 &{}\quad 0 &{}\quad b &{}\quad 0 \\ 0 &{}\quad b &{}\quad a &{}\quad * &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad b \\ b &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad * &{}\quad a &{}\quad b &{}\quad 0 \\ 0 &{}\quad b &{}\quad 0 &{}\quad 0 &{}\quad a &{}\quad * &{}\quad 0 &{}\quad b \\ 0 &{}\quad 0 &{}\quad b &{}\quad 0 &{}\quad b &{}\quad 0 &{}\quad * &{}\quad a \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad b &{}\quad 0 &{}\quad b &{}\quad a &{}\quad * \\ \end{array} \right) . \end{aligned}$$

This matrix is equivalent to a full codon model where a synonymous change is when the third codon position mutates to another nucleotide and a non-synonymous change is when the first or second codon position mutates to another nucleotide. As it currently stands, \( Q_{triplet} \) forms an abelian Lie algebra so its Lie closure would have dimension 2. What we want to test now is if we make minor adjustments to \( Q_{triplet} \), what will happen to the size of the Lie closure?

If matrix entry (7, 5) of \( Q_{triplet} \) is changed from b to a, then the dimension for the Lie closure is 25. When similar changes were made to \( Q_{triplet} \) (swapping b values to a values and vice versa), the dimensions of the Lie closures ranged from 5 to 56. (Note that the maximum possible size for a Lie closure of a model of this form is \( 8 \times 8 - 8 = 56 \).) There was an apparent trend that when the adjusted \( Q_{triplet} \) was still symmetrical after being altered (i.e. if we changed the matrix entry (7, 5) from b to a then we also changed the entry (5, 7) from b to a) the Lie closure tended to be smaller but such symmetries were not sufficient to obtain a Lie closure of less than 5.

For a fixed number of changes, we found there was great variety in the size of the resulting Lie closure. Table 2 gives details on the sizes of the Lie closures generated after making 4 symmetric changes (2 pairs of changes) in 10 different ways and making 2 asymmetric changes in 10 different ways. For more details on the range of Lie closure sizes after 10 various ways of making a particular number of changes, see Table 3. We notice that the more changes we make, generally the larger the resulting Lie closure is. It can also be seen that, for the 10 possibilities we tried, having four or more asymmetric changes resulted in a Lie closure size which is as high as possible.

Table 2 This “stem and leaf” plot shows displays sizes of Lie closures after making various symmetric and asymmetric changes to \( Q_{triplet} \) (for the symmetric case, there were four changes made and for the asymmetric case, there were two changes made)
Table 3 The observed size range of Lie closures of matrices being \( Q_{triplet} \) with a particular number of modifications in 10 different ways

This is interesting as it shows that as soon as the model deviates from being \( Q_{triplet} \) the Lie closure no longer has a simple answer. The linear closures of JC-MG and K2ST-MG are far from being as simple as the \( (64 \times 64) \)\( Q_{triplet} \) matrix so it is not surprising that the Lie closures are so large.

5 Discussion

Our initial aim in this work was to find a multiplicatively closed codon model which incorporated \( \omega \) as a parameter. We have shown that there is no practical way to do this.

This negative result is interesting in a mathematical context. In the case of the JC-MG codon model, it was surprising that a model which began with as few as 2 parameters would have a Lie closure of 1996 parameters. Our investigation of the toy model demonstrated that this was intrinsic to the problem: if the starting matrix set deviates too far from being perfectly symmetrical, then the Lie closure tends to be close to as large as possible.

From here, the perceived way forward is to conduct further analysis of linear closures and linear versions of codon models. These do not violate stochasticity and are reasonable in size. In our case of analysing the JC-MG model, the linear closure (which is the same as the linear version, only a different basis) is trivial due to JC-MG only having two parameters to begin with. But when the underlying DNA rate substitution process has more parameters, for example HKY, then the linear closure is not trivial and the setup of the linear version is quite different to the original. It has not yet been tested to see if linear closures of codon models mis-estimate \( \omega \) as much as the models themselves but it is possible that this could help as previous exploration (Kaine 2011; Sumner et al. 2012b) found that, in DNA models, parameters are mis-estimated less in DNA models which form linear spaces.

This approach also brings about the opportunity to further study an additive \( \omega \). As previously discussed, it is not immediately clear how this parameter should be defined in the model building process. Making this parameter additive introduces an opportunity for stochasticity violation in a model which is a mathematical problem yet to be explored. How to interpret the parameter biologically when it is added instead of multiplied is also an open problem which opens up potential research topics.