1 Introduction

The standard genetic code is a template according to which 64 codons are mapped into 20 amino acids and a stop coding signal. This set of rules, with rare exceptions, is nearly universal for all domains of life and is responsible for transmitting genetic information stored in the DNA molecule into proteins. The questions about the origin and also the present structure of the standard genetic code have been puzzling biologists since the first codon assignments were discovered in the sixties of the last century (Khorana et al. 1966; Nirenberg et al. 1966). In particular, the question about the way of the standard genetic code’s degeneracy appears to be intriguing because if we assume that potential theoretical genetic code must encode 20 amino acids and stop coding signals then we get around \(10^{84}\) possible variations (Schönauer and Clote 1997). Therefore, the standard genetic code is just one potential solution out of extremely many different scenarios. This fact gives a motivation for studying which features are playing a decisive role in the process of genetic code emerging.

There are three main hypotheses concerning origin of the standard genetic code (Di Giulio 2005). These are stereochemical, coevolution, and adaptive. The first claims that the genetic code evolves as a result of a high affinity between amino acids and respective codons/anti-codons or other aptamers and oligomers (Dunnill 1966; Pelc and Welton 1966; Yarus et al. 2005). However, some evidences to support this evolutionary scenario were found only in very few cases. The coevolution hypothesis posits that the present structure of the standard genetic code evolved from its ancestral version including small number of simple amino acids. This code has been changed simultaneously with the development of metabolic pathways, i.e. the newly synthesized amino acids took over the codons of their precursors (Wong 1975; Di Giulio 2017). This process required the simultaneous evolution of the genetic code and biosynthetic pathways. In the framework of this hypothesis, the physicochemical properties of amino acid played only a subsidiary role in the standard genetic code evolution. The adaptive hypothesis postulates that the code was created to minimize the effect of amino acid replacements and errors which occur during the translation process (Epstein 1966; Freeland and Hurst 1998a, b). The main argument for this explanations follows from the present structure of the standard genetic code where there is observed a tendency to group similar amino acids in the same column of the code table. However, using several optimization methods it was shown that the standard genetic code can be significantly improved under some criteria to minimize genetic errors and is rather a suboptimal solution in the vast space of possible genetic codes (Di Giulio 1989; Santos and Monteagudo 2010; Blazej et al. 2016, 2018a). It should be noted that proposed explanations are still not satisfactory, and none of these hypotheses became a comprehensive explanatory theory. On the other hand, they are not to be mutually exclusive because the main drivers of the standard genetic code evolution postulated by these hypotheses could have a significant impact on the evolution at different stages.

It is interesting that many questions concerning the properties of the standard genetic code can be formulated as an interesting problem from mathematical and also computational point of view. Many authors used some optimization methods, such as single or multi-objective evolutionary algorithms, to test the quality of the standard genetic code under selected criteria. Moreover, they used many different and—at the same time—interesting mathematical approaches to describe properties or more generally to develop some rules for generating theoretical genetic codes. These techniques mainly follow from graph theory, coding theory, and group theory (Fimmel et al. 2016, 2017, 2018, 2014; José et al. 2017; Tlusty 2010).

A broader class of models of the genetic code, the so-called BDA-generated models (binary dichotomic algorithms) (Gumbel et al. 2015), is based on overlappings of the so-called dichotomic partitions of the set of codons. The most known dichotomic partition is the Rumer’s degeneracy dichotomy (Fimmel and Strüngmann 2016; Rumer 2016a, b, c) which decomposes codons into two disjoint equal-sized classes: the first Rumer class identifies amino acids with degeneracy 4 (for which the first two bases of the triplet are sufficient to define unambiguously the amino acid), while the second one specifies amino acids with degeneracy non-4 (i.e. 1, 2 or 3). A generalization of Rumer’s algorithm (Fimmel et al. 2013) has led to a family of binary dichotomic algorithms (BDAs) which derive their decision for classifying a codon from biochemical properties of the bases involved. These algorithms distinguish whether a base is of type

  • purine (denoted as R = {A, G}) or pyrimidine (Y = {C, T})

  • keto (K = {T, G} or amino (Am = (C, A)

  • strong (S = {C, G}) or weak (W = {A, T}).

and classify the codons into two disjoint classes of equal size. Dichotomic partitions seem to contribute to frame retrieval and error detection properties, sustaining a robustness of the code (Giannerini et al. 2012).

The other approach involved to investigate the properties of the standard genetic code follows on graph theory. Many authors (Tlusty 2010) used this methodology to develop several genetic code representations. They applied this approach to their studies about the structure and evolution of the standard genetic code. Among them, the set conductance approach presented in Blazej et al. (2018b) appears to be especially interesting in testing the quality of codon blocks structure generated by BDA-algorithms. Moreover, this methodology has an interesting biological interpretation as the level of robustness of the genetic code structure against single point mutation.

In the present work, we are developing the conductance approach for studying the genetic code further, applying it for a measurement of the quality of BDA-generated models and comparing them with randomly generated models. This enables us to construct variants of the genetic code with a good set conductance.

2 Methods

2.1 BDA-Models

In the sequel \({\mathcal {B}}=\{A,C,G,T(U)\}\) will denote the set of four nucleotide bases Uracil (Thymine), Cytosine, Adenine, and Guanine, in short T(U), CAG. A codon is an element of \({\mathcal {B}}^{3}\), e.g. ACU.

The alphabet \({\mathcal {B}}\) can be decomposed in three different ways into two disjoint equal-sized subsets. Each of these decompositions has a biochemical meaning:

$$\begin{aligned} {\mathcal {B}}= & {} \{C,G\}\cup \{A,T\}\quad \hbox {(strong/weak)},\\ {\mathcal {B}}= & {} \{C,A\}\cup \{G,T\}\quad \hbox {(amino/keto)},\\ {\mathcal {B}}= & {} \{C,T\}\cup \{A,G\}\quad \hbox {(pyrimidine/purine)}. \end{aligned}$$

Based on these biochemical properties of nucleotides, we can classify the set of codons into two disjoint equal-sized subsets, establishing a dichotomic partition of \({\mathcal {B}}^3\). Let us first give a precise definition of how we understand a dichotomic partition:

Definition 2.1

An ordered pair \((H_0,H_1)\) of subsets \(H_0,H_1 \subseteq {\mathcal {B}}^3\) is called a dichotomic partition of \({\mathcal {B}}^3\) if \(H_0 \cap H_1=\emptyset \), \(H_0 \cup H_1 = {\mathcal {B}}^3\) and \(\mid H_0 \mid = \mid H_1 \mid \).

In other words: the set of 64 codons is divided into two disjoint subsets of equal size as, for instance, the so-called Rumer partition does, which separates the codons, where two first bases are enough to determine the encoded amino acid, from the codons, where the third base for the decision is needed.

In Fimmel et al. (2013), the notion of binary dichotomic algorithms was introduced for sequences of nucleotide bases of arbitrary length, i.e. for classification of n-nucleotides \(c\in {\mathcal {B}}^n, n\in {\mathbb {N}}\). For the purposes of the present work, it is sufficient to consider only the set of codons, i.e. \(c\in {\mathcal {B}}^3.\) Let us recall the definition from Fimmel et al. (2013) in this special case:

Definition 2.2

Let \((H_0,H_1)\) be a dichotomic partition of \({\mathcal {B}}^3\). We call an algorithm \({\mathcal {A}}\) a binary dichotomic algorithm (BDA) with dichotomic partition\((H_0,H_1)\) if it follows the following scheme: \({\mathcal {A}}\) chooses two indices \(i_1, i_2\in \{1,2,3\}\) with \(i_1\ne i_2\), an ordered pair of different nucleotide bases \(Q_1=(B_1,B_2)\) and a subset \(Q_2\subset {\mathcal {B}}\) with \(|Q_2|=2\). Now \({\mathcal {A}}\) classifies \(c=(b_1, b_2, b_3) \in {\mathcal {B}}^3\) as follows:

  1. (A)

    if \(b_{i_1}\in \{B_1, B_2\}\), then

    $$\begin{aligned} (c\in H_0\quad \text{ if }\quad b_{i_1}=B_1,)\quad \text{ and }\quad (c\in H_1\quad \text{ if }\quad b_{i_1}=B_2,) \end{aligned}$$
  2. (B)

    if \(b_{i_1}\notin \{B_1, B_2\}\), then

    $$\begin{aligned} (c\in H_0\quad \text{ if }\quad b_{i_2}\in Q_2,)\quad \text{ and }\quad (c\in H_1\quad \text{ if }\quad b_{i_2}\notin Q_2.) \end{aligned}$$

We will call \(Q_1\) and \(Q_2\) the questions of \({\mathcal {A}}\), \(i_1,i_2\in \{1,2,3\}\) the indices of \({\mathcal {A}}\), and the pair \((H_0,H_1)\) a dichotomic partition of\({\mathcal {B}}^3\)generated by the binary dichotomic algorithm\({\mathcal {A}}\).

Remark 2.3

We will call in short a binary dichotomic algorithm BDA and will write for all \(c\in H_0\)\({\mathcal {A}}(c)=0\) and for all \(c\in H_1\)\({\mathcal {A}}(c)=1\).

Figure 1 depicts an example of how a BDA works in order to get Rumer’s degeneracy dichotomy:

Fig. 1
figure 1

Algorithmic way to define Rumer’s dichotomy (Colour figure online)

If we apply several BDAs successively, we ‘cut’ the set of codons \({\mathcal {B}}^3\) into disjoint subsets. For example, we obtain four subsets labelled as (0, 0), (1, 0), (0, 1), (1, 1) (or in short 00, 10, 01, 11) when two different BDAs are applied. In the subset (1, 0) we have, for instance, codons which are classified by the first BDA into the class 1 and by the second BDA into the class 0. In Gumbel et al. (2015), based on this notion, a class of models of the genetic code was introduced:

Definition 2.4

Let \(k\in {\mathbb {N}}\). We will call a bijective mapping

$$\begin{aligned} M:{\mathcal {B}}^3\rightarrow \{0,1\}^k \end{aligned}$$

a BDA-generated model of the genetic code of grade k if there exist k different BDAs

$$\begin{aligned} {\mathcal {A}}_1, {\mathcal {A}}_2, {\mathcal {A}}_3, ..., {\mathcal {A}}_k \end{aligned}$$

such that for all \(c\in {\mathcal {B}}^3\) the following equation holds:

$$\begin{aligned} M(c)=({\mathcal {A}}_1(c),{\mathcal {A}}_2(c),...,{\mathcal {A}}_k(c)). \end{aligned}$$

Remark 2.5

It means for a codon \(c\in {\mathcal {B}}^3\) that the mapping M assigns c in the jth coordinate 1 if c is classified by \({\mathcal {A}}_j\) for the dichotomic class \(H_1\) and 0 if c is classified by \({\mathcal {A}}_j\) for the dichotomic class \(H_0\).

Table 1 shows the standard genetic code and the Rumer classification as an example for a BDA.

Table 1 Standard genetic code where ! represents a stop codon

2.2 Conductance

In this section, we introduce a methodology which comes from graph theory. Using these characteristics, we can describe some new features of BDA-algorithms in terms of the graph partition quality. We start our investigation by giving a definition of specific graph, which includes all information about single point mutations occurred in protein-coding sequences.

Definition 2.6

Let G(VE) be a graph in which V is the set of vertices (nodes) representing all possible 64 codons, whereas E is the set of edges connecting these vertices. All connections are defined in such a way that two vertices \(c, c' \in V \), i.e. respective codons, are connected by the edge \(e(c; c') \in E \) if and only if the codon c differs from \(c'\) in exactly one position.

According to Definition 2.6, the graph G is unweighted, undirected and also regular because the degree of each node is equal to nine (compare Fig. 2).

Fig. 2
figure 2

The example of AAA codon neighbourhood generated by single nucleotide substitution. There are exactly nine codons which differ from AAA in exactly one nucleotide (Colour figure online)

Fig. 3
figure 3

Graphical representation of the graph defined in 2.6 (Colour figure online)

Furthermore, G has a nice interpretation from biological point of view because the set of edges E represents all possible single point mutations, i.e. single nucleotide substitutions, which can occur between codons in protein-coding sequences. In this work, we would like to investigate properties of the selected partitions of the vertices of the graph represented in Fig. 3 into fixed number \( 1<k\le 64\) of disjoint and non-empty subsets \({{\mathcal {C}}}_{k}\), i.e. k codon groups. The \({{\mathcal {C}}}_{k}\) partition is defined in the following way:

$$\begin{aligned} {{\mathcal {C}}}_{k}= \{ S_{1},S_{2},\ldots , S_{k}:\ S_{i}\cap S_{j}=\emptyset ,\ S_{1}\cup S_{2}\cup \ldots \cup S_{k}=V \}. \end{aligned}$$

It is easy to see that for \(k=21\), we get \({{\mathcal {C}}}_{21}\) which is a representation of the genetic code as a partition of the set of vertices V into 21 disjoint and non-empty subsets. Therefore, it is interesting to test some characteristics of the \({{\mathcal {C}}}_k\) following graph theory. Particularly, we considered properties of \({{\mathcal {C}}}_k\) in terms of the optimal graph partitioning. Generally, the problem of finding optimal, in some sense, partition of G can be investigated from different perspectives. However, the idea presented in Blazej et al. (2018b), using the conductance property, appears to be promising for further research around the standard genetic code. The central role in this approach plays the set conductance measure which is used to calculate the quality of a given genetic code but clearly this method is used in more general clustering problem. This characteristic is defined in the following way:

Definition 2.7

For a given graph \(G=(V,E)\) let S be a subset of V . The conductance of S is defined as:

$$\begin{aligned} \phi (S)=\frac{E(S,{\overline{S}})}{9|S|} \end{aligned}$$

where \(E(S, {\bar{S}})\) is the number of edges of G crossing from S to its complement \({\bar{S}}\).

The set conductance has many applications, for example, in the theory of random walks, theory of Markov processes (Levin et al. 2009) and also in social networks (Bollobàs 1998). Moreover, \(\phi (S)\) has also a very interesting biological interpretation. Assuming that all codons belonging to S encode the same label, i.e. amino acids or stop coding signal, \(\phi (S)\) is the ratio of the total number of non-synonymous single nucleotide substitution to all possible nucleotide substitution generated by all codons from S. Moreover, the Definition 2.7 allowed us as to define the conductance of the partition \({{\mathcal {C}}}_k\).

Definition 2.8

The conductance of a partition \({{\mathcal {C}}}_k\) is defined as

$$\begin{aligned} \Phi ({{\mathcal {C}}}_k)=\max _{S\in {{\mathcal {C}}}_k}\phi (S). \end{aligned}$$

Therefore, the \(\Phi \) measure gives us a characterization of the quality of a given partition \({{\mathcal {C}}}_k\) as the set conductance of the worst, in this term, codon group. What is more, \(\Phi \) measure involves a question about the structure of the optimal graph G partition \({{\mathcal {C}}}_k\) for a fixed k. In this context, the best partition \({{\mathcal {C}}}_k\) of the graph G in terms of \(\Phi \) follows in a natural way and is given by the formula:

$$\begin{aligned} \Phi _{\min }(k)=\min _{{{\mathcal {C}}}_k}\Phi ({{\mathcal {C}}}_k). \end{aligned}$$

The definition of \(\Phi _{\min }\) is identical with the definition of the kth-order graph conductance presented in Lee et al. (2014) and has an interpretation as lower boundary of a set robustness against changes which cause transitions between codon groups. It should be noted that in the case of \(k=2\), the minimum partition conductance \(\Phi _\mathrm{min}\) is equivalent to the definition of the graph conductance (Lee et al. 2014).

3 Results and Discussion

3.1 Conductance of BDA-Partitions

We consider in this paper BDA-generated models of the genetic code from the viewpoint of their conductance. The next proposition shows that the conductance of only one BDA-partition is independent on the algorithm applied:

Proposition 3.1

Let \({\mathcal {A}}\) be a BDA with the indices \(i_1, i_2\in \{1,2,3\}, i_1\ne i_2\) and the questions \(Q_1 = (B_1, B_2) (B_1 \ne B_2)\), and \(Q_2 = \{B_3, B_4\}\) with \(B_3 \ne B_4\), \({\mathcal {C}}= (H_0, H_1)\) the corresponding BDA-partition of \({\mathcal {B}}^3\). Then, the following formula holds:

$$\begin{aligned} \phi (H_0)=\phi (H_1)=\Phi ({\mathcal {C}})=\frac{80}{9\cdot 32}=0.2(7). \end{aligned}$$

Proof

Since \(|H_0|=|H_1|\) and \(E(H_1,H_2)=E(H_2,H_1)\), we get immediately \(\phi (H_0)=\phi (H_1)=\Phi ({\mathcal {C}}_{2})\). Therefore, it is sufficient to show \(\phi (H_0)=0.2(7)\). Let us consider \(c=(b_1,b_2,b_3)\in H_0\) and assume also without loss of generality that \(i_1=1, i_2=2\). Since all codons of the form \(c=(B_2,b_2,b_3)\) are in \(H_1\), we have to take into account the following two cases:

  1. Case 1:

    Let \(b_{1}=B_1\). There are 16 codons fulfilling this condition in total. We have two cases in which the edge could go outside the set \(H_0\), namely if the base in the first position in codon is substituted, i.e. \(b_{1}\rightarrow B_2\), we have one possible edge going outside \(H_0\). Moreover, when codon c fulfils additional condition, i.e. for 8 codons out of 16: \( b_2\notin Q_2\), we obtain two additional edges, i.e. \(b_1\rightarrow \overline{ \{B_1 , B_2\}}\). Therefore, the total number of crossing edges calculated for all \((B_1,b_2,b_3)\) type codons is equal to 32.

  2. Case 2:

    Let \(b_{1}\notin \{B_1,B_2\}\). In this case, all codons c belonging to the set \(H_0\), 16 in total, have the following form: \(c=(b_1,[B_3|B_4],b_3)\). In this case, each c has one crossing edge generated by substitution in the first position in codon i.e. \(b_1\rightarrow B_2\). Moreover, they all have two possible crossing edges generated by nucleotide substitution in the second position in codon, i.e. \(b_2 \rightarrow \overline{Q_2}\). As a result, the total number of crossing edges is equal, in this case, to 48.

To sum up, we have \(48+32=80\) edges crossing from \(H_0\) to \(H_1\) and, thus,

$$\begin{aligned} \phi (H_0)=\frac{80}{9\cdot 32} \end{aligned}$$

what completes the proof. \(\square \)

The following Theorem helps to calculate the minimal possible conductances for subsets of \({\mathcal {B}}^3\) of arbitrary size:

Theorem 3.2

Let \(G=(V;E)\) be the graph according to the Definition 2.6, \(N_1<N_2<N_3<N_4\) a linear order over the alphabet \({\mathcal {B}}=\{A;C;G;T(U)\}\), e.g. \(C<G<A<T\), and \(S_k\subseteq V\) the collection of the first \(k=1,2,\ldots , 64\) vertices of the graph G in the lexicographic order. Then, the following recursive formula for the number of edges of G crossing from \(S_k\) to its complement \({\bar{S}}_k\) holds:

$$\begin{aligned} E(S_{k+1}, \overline{S_{k+1}})=E(S_{k}, \overline{S_{k}})+9-2\cdot (k_1+k_2+k_3), \quad E(S_{1}, \overline{S_{1}})=9 \end{aligned}$$

where \((k_1,k_2,k_3)_4, k_i\in \{0,1,2,3\}\)Footnote 1 is the representation of k to base 4, i.e.

$$\begin{aligned} k=k_1\cdot 4^2+k_2\cdot 4^1+k_3\cdot 4^0. \end{aligned}$$

The conductance of \(S_k\) is accordingly equal to

$$\begin{aligned} \phi (S_k)=\frac{E(S_{k}, \overline{S_{k}})}{9\cdot k}. \end{aligned}$$

Proof

It is clear that \(E(S_{1}, \overline{S_{1}})=9\) since the graph 2.6 is 9-regular and for only one codon in \(S_1\) all edges are crossing edges between \(S_1\) and \({\bar{S}}_1\).

Let us assume now that we already have calculated \(E(S_{k}, \overline{S_{k}})\) for \(k\ge 1\) and we are inserting now the next codon \(c\in {\mathcal {B}}^3\) in the lexicographic order. It is easy to see that all codons ordered in lexicographic order can be rewritten as a sequence of consecutive three-digit numbers to the base 4 if we assign, for example, \(N_1\rightarrow 0, N_2\rightarrow 1, N_3\rightarrow 2, N_4\rightarrow 3\). Therefore, newly included codon c has a numeric representation \(c= (k_1, k_2, k_3)_4\), where \(k_i=0,1,2,3\). What is more, \(k_i,\ i=1,2,3\) is the number of codons that differ from c at the position i which are smaller than c in the lexicographic order and the total number of edges crossing \(S_k\) and c, i.e. \(E(S_k,\{c\})\), is, consequently, equal to \(k_1+k_2+k_3\). As a result, the total number of edges between \(S_{k+1}\) to its complement fulfils the equation:

$$\begin{aligned} E(S_{k+1}, \overline{S_{k+1}})= & {} E(S_k, \overline{S_k})-E(S_k,\{c\}) +9-E(S_k,\{c\})\\= & {} E(S_{k}, \overline{S_{k}})+9-2\cdot (k_1+k_2+k_3) . \end{aligned}$$

That completes the proof. \(\square \)

With Table 2, we calculate conductances for all \(S_k\) from Theorem 3.2 for \(1\le k\le 32\). It suffices if we calculate it for \(1\le k\le 32\) since in the case of at least one partitioning of \({\mathcal {B}}^3\) into at least 2 subsets, the size of one of them will be at most 32:

Table 2 The Table shows exact values for conductances of all \(S_k\) from Theorem 3.2 and \(1\le k\le 32\)

Following the approach from Proposition 3.2, we can calculate the minimal conductances for arbitrary partitions:

Proposition 3.3

Let \({\mathcal {C}}_n\) be a partition of graph G where \(n\in {\mathbb {N}}\), i.e. with n classes. Then, we have the following lower boundary for the conductance of the partition \({\mathcal {C}}_n\) :

  1. (1)

    For \(n\ne 12,n\ne 3\)

    $$\begin{aligned} \Phi ({\mathcal {C}}_n)\ge \phi (S_k)\;\text{ with }\;k=\left\lfloor \frac{64}{n}\right\rfloor , \end{aligned}$$
  2. (2)

    For \(n=12\)

    $$\begin{aligned} \Phi ({\mathcal {C}}_{12})\ge \phi (S_6), \end{aligned}$$
  3. (3)

    For \(n=3\)

    $$\begin{aligned} \Phi ({\mathcal {C}}_{3})\ge \phi (S_{23}), \end{aligned}$$

where \(\phi (S_k)\) is the entry corresponding to k from Table 2.

Proof

According to Proposition 3 from Blazej et al. (2018b), the collection of the first k vertices taken in lexicographic order of a graph as defined in 2.6 has the minimal conductance among all subsets of the same size k.

  1. (1)

    For \(n=2\) we have two subsets, the size of one of them has to be at most 32. Since

    $$\begin{aligned} \phi (S_{64})<\phi (S_k)\quad \text{ for } \text{ all }\quad k<32 \end{aligned}$$

    takes place, a partition with the minimal conductance has to consist of two equal-sized subsets. Thus, we get a partition with the minimum conductance if we classify in the lexicographic order the first 32 codons into one class and the remaining 32 into the other one. Corresponding to Table 2, the conductance of such partition is equal to

    $$\begin{aligned} \Phi ({\mathcal {C}})=\frac{64}{9\cdot 32}=0.(2) \end{aligned}$$

    Let \(n\ge 4, n\ne 12\). In this case, \(k=\lfloor \frac{64}{n}\rfloor \le 16 \) represents the average size of a subset in a partition and \(\phi (S_k)\) is decreasing with increasing of k with only one exception \(\phi (S_4)<\phi (S_5)\). Since \(\Phi ({\mathcal {C}}_n)\) is defined as the maximal value of conductances of all subsets from \({\mathcal {C}}\), it is equal to the conductance of the ‘worst’, in this sense, subset. Since, on the one hand, the following inequality takes place

    $$\begin{aligned} \phi (S_{i})>\phi (S_{15})\ge \phi (S_{j}),\quad 1\le i\le 14,\quad 17\le j\le 32 \quad \text{(compare } \text{ table }\;2) \end{aligned}$$

    and, on the other hand, increasing the size of one subset leads to decreasing it for another subset of the partition, we have that it is not favourable to have bigger than 16 codons subsets in the partition. Let us now assume that

    $$\begin{aligned} \Phi ({\mathcal {C}}_n)<\phi (S_k). \end{aligned}$$

    It means that for all subsets \(C_i\in {\mathcal {C}}_n\), we have

    $$\begin{aligned} \phi (C_i)<\phi (S_k). \end{aligned}$$

    According to the behaviour of the function \(\phi (S_k)\) (compare Table 2), it means that for all \(i=1,...,n\;\; |C_i|>k\) or \(k=5\). In the first case, we obtain immediately a contradiction since then we have

    $$\begin{aligned} \sum _{i=1}^n |C_i|\ge n\cdot (k+1)> 64. \end{aligned}$$

    In the second case (\(k=5\)), we have \(n=12\) what is excluded.

  2. (2)

    Let \(n=12\) and \(\Phi ({\mathcal {C}}_{12})<\phi (S_6)\). It means that for all subsets \(C_i\in {\mathcal {C}}_{12}\), we have

    $$\begin{aligned} \phi (C_i)<\phi (S_6) \end{aligned}$$

    and, thus, for all \(i=1,...,12\;\; |C_i|>6\). Consequently,

    $$\begin{aligned} \sum _{i=1}^{12} |C_i|\ge 12\cdot 7=84> 64. \end{aligned}$$

    This is a contradiction.

  3. (3)

    Let \(n=3\) and \(\Phi ({\mathcal {C}}_{3})<\phi (S_{23})\). It means that for all subsets \(C_i\in {\mathcal {C}}_{3}\), we have

    $$\begin{aligned} \phi (C_i)<\phi (S_{23}) \end{aligned}$$

    and, thus, for all \(i=1,2,3\;\; |C_i|>23\). Consequently,

    $$\begin{aligned} \sum _{i=1}^{3} |C_i|\ge 3\cdot 23=69> 64. \end{aligned}$$

    This is a contradiction.

\(\square \)

Applying the Proposition 3.3 in the special case of BDA-partitions, we obtain:

Corollary 3.4

Let M be a BDA-model of \({\mathcal {B}}^3\) with \(n\in {\mathbb {N}}\) classes and \({\mathcal {C}}\) the corresponding BDA-partition. Then, we have for the conductance of \({\mathcal {C}}\) the following lower boundary:

  1. (1)

    For \(n\ne 12\)

    $$\begin{aligned} \Phi ({\mathcal {C}})\ge \phi (S_k)\;\text{ with }\;k=\left\lfloor \frac{64}{n}\right\rfloor \end{aligned}$$

    and \(\phi (S_k)\) the entry corresponding k from Table 2.

  2. (2)

    For \(n=12\)

    $$\begin{aligned} \Phi ({\mathcal {C}})\ge 0.(6). \end{aligned}$$

Proof

According to Proposition 6 in Gumbel et al. (2015), it is not possible for a BDA-generated model that \(n=3\). The remaining part follows immediately from 3.3. \(\square \)

3.2 Best Conductance BDAs

In the previous section, we found lower boundaries for BDA-generated partitions. However, it is not clear yet whether these boundaries are sharp. We have adapted the algorithm described in Gumbel et al. (2015) and searched for models of the genetic code based on BDAs with the best conductance \(\Phi _\mathrm{min}\). The following examples in Tables 345 and 6 show that we can indeed obtain partitions generated with BDAs with the best possible conductance for the class sizes 8, 12, 16, and 24. All these code tables contain the Rumer-BDA, the code table for 24 classes additionally uses the Complementary-BDA. These partitions are achieved with the minimum number of BDAs required, i.e. 3 for 8 classes, 4 for 12 and 16 classes and 5 for 24 classes.

Table 3 \(|M|=8\) classes generated by three BDAs including Rumer. (A) Code table, (B) list of BDAs. Conductance of \({\mathcal {C}}_8\) is \(\Phi ({\mathcal {C}}_8)=5/9=\Phi _\mathrm{min}(8)\). ! represents a stop codon (Colour figure online)

According to the corollary 3.3, we have an exception if the number of generated classes is equal to 12. Table 4 shows that we can also obtain in this case a partition with the best possible conductance with a BDA-model.

Table 4 \(|M|=12\) classes generated by three BDAs including Rumer. (A) Code table, (B) list of BDAs. Conduction of \({\mathcal {C}}_{12}\) is \(\Phi ({\mathcal {C}}_{12})=2/3=0.(6)=\Phi _\mathrm{min}(12)\) (Colour figure online)
Table 5 \(|M|=16\) classes generated by four BDAs including Rumer. (A) Code table, (B) list of BDAs. Conductance of \({\mathcal {C}}_{16}\) is \(\Phi ({\mathcal {C}}_{16})=2/3=\Phi _\mathrm{min}(16)\) (Colour figure online)

Next, it was analysed whether a code table for the standard genetic code could be created by means of BDAs, i.e. if we can classify 21 classes (20 for amino acids plus 1 for stop codons). If we consider codons of degeneracy 6 (like those for Serine) as codons belonging to two groups of size 4 and 2 each—like in the Rumer transformation—we get three extra classes, and thus 24 classes in total. Table 6 shows such a model of the genetic code with 24 classes and optimal conductance (\(\Phi ({\mathcal {C}}_{24})=8/9\)). It is striking that again the Rumer-BDA but this time also the Complementary-BDA is included. Moreover, the code with optimal conductance is also to some extend compatible with the standard universal code. In Gumbel et al. (2015), an error metric E, (\(0\le E\le 1\)) was introduced to indicate how “good” a code is compatible with the standard genetic code, where an error of \(E=0\) represents a perfect match. It is known that the standard genetic code does not have an optimal conductance as there are codons coding only for one amino acid, e.g. AUG for Methionine (Blazej et al. 2018b). However, the code in Table 6 with optimal conductance has only a compatibility error of \(E=12/64\). That is, only 12 changes in the assignments of codons to amino acids are required to get a perfect standard universal code. When the mitochondrial vertebrate code is considered (table not shown), this comparability error could even be reduced to \(E=10/64=0.15625\).

Table 6 \(|M|=24\) classes generated by six BDAs including Rumer and Complementary. (A) Code table, (B) list of BDAs. Conductance of \({\mathcal {C}}_{24}\) is \(\Phi ({\mathcal {C}}_{24})=8/9=\Phi _\mathrm{min}(24)\). The compatibility error is \(E = 0.1875=12/64\) (Colour figure online)

For the sake of completeness and to ensure that the best conductance BDAs are not a multiple of four for the number of classes, other class sizes ranging from 4 to 23 have been tested. BDA-generated models with best possible conductance values could be indeed obtained for

$$\begin{aligned} |M|=4,7,8,10,11,12,13,14,15,16,22,23,24. \end{aligned}$$

However, it is not possible, for example, for \(|M|=21\) as it was proven with a comprehensive search as explained in Gumbel et al. (2015). In this case, the best possible partitioning has the conductance value equal to 7 / 9 (Blazej et al. 2018b) and cannot be obtained using a BDA-generated model of the genetic code. For the remaining class sizes (\(|M|=5,6,9,17,18,19,20\)), a sample of 10,000 partitions for each class size (compare Fig. 4) did not show any best BDA-model and it remains to be shown, whether there are any BDA-models; however, this is very unlikely.

3.3 Distribution of Conductance

Lower boundaries for the optimal conductance of BDA-generated models were derived, and it was proven that there are BDA-models which are optimal. This section shows that those models have a much better conductance compared to random partitions with the same number of classes. Figure 4a–c shows the distribution of the set conductance for code tables (1) generated by random BDAs and (2) random partitions with 8, 16, and 24 classes. The number of BDAs of a random model ranges from the minimum number required, e.g. 3 for 8 classes to four extra BDAs, i.e. 7. for 8 classes. Those redundant BDAs were included because it was shown in Gumbel et al. (2015) that some partitions can only be achieved with more than the minimum number of BDAs.

In any case, random BDAs create code tables with a better conductance and some of them are optimal. All random partitions for 24 classes have the worst conductance of 1. BDA-generated partitions, however, have either the best conductance (\(\Phi ({\mathcal {C}}_{24})=8/9\)) or the worst, too.

Fig. 4
figure 4

Distribution of conduction for different partitions \({\mathcal {C}}_k\). Green bars show partitions generated by BDAs, and red bars show random partitions. Blue dashed line indicates the best conductance of \({\mathcal {C}}_k\). No random partition has an optimal conductance. d, e zoom in and show only a fraction of the y-axis as the scale in a and b is not sufficient to see the details. This is not required in c as all frequencies are visible. Sample size is 10,000. (a, d) Eight classes: there are about 0.2 % BDA-partitions with optimal conductance (\(\Phi ({\mathcal {C}}_8)=5/9\)). Numbers of BDAs per model (\({\mathcal {A}}_1,\ldots , {\mathcal {A}}_l\)) range from \(3 \le l \le 7\). b, e 16 classes: again there are BDA-partitions (about 0.4 %) with optimal conductance (\(\Phi ({\mathcal {C}}_{16})=2/3\)). Numbers of BDAs per model range from 4 to 8. c 24 classes: also BDA-partitions (8 %) with optimal conductance (\(\Phi ({\mathcal {C}}_{24})=8/9\)). Numbers of BDAs per model range from 5 to 9 (Color figure online)

Even if there exist no BDA-generated models with the best possible conductance for some class sizes (compare the previous section), the BDA-generated models perform in terms of average conductance significantly better than the randomly generated ones.

4 Conclusions

In this work, we are discussing more deeply the properties of BDA-generated models of the genetic code. To do so, we are incorporating a new methodology following on graph theory. According to this approach, each BDA algorithm and, more generally, genetic code induces its own partition of graph vertices into disjoint and non-empty subsets, corresponding to amino acids to be encoded. The quality of a given partition was calculated by using the maximum partition conductance measure. This measure gives us a general overview of the quality of each set of codons belonging to its partition because it is based on calculating a proportion of a number of all possible non-synonymous substitution to all nucleotide changes for every codon group. Therefore, the maximum conductance has a biological interpretation as a measure of robustness of partition sets against point mutations. Moreover, the maximum partition conductance can be used in general for evaluating quality of theoretical genetic codes which encode different number of amino acids. In this context, we found a formula for the lower boundary of the maximum conductance for graph partitions with a given number of classes corresponding to amino acids. We also compared it to a large number of randomly generated partitions. It should be noted that, taking a single BDA, none of the dichotomic partitions obtained has the minimal conductance; however, applying overlappings of BDA-partitions, i.e. BDA-generated models, we can reach the minimal possible conductance values, i.e. create the most robust against point mutations models of the genetic code. Moreover, the BDA-models have a substantially better quality in comparison with randomly generated partitions. This result indicates that the quality of models generated by BDA-algorithms can not easily be overcome by just a random process of amino acids’ assignment to codons. The results obtained can, for instance, be useful for understanding the evolution of the genetic code.