1 Introduction

The use of genome rearrangements to estimate evolutionary distance dates back as far as Watterson et al. (1982). The novelty of this approach is that it ignores single nucleotide polymorphisms (SNPs) and takes genes and their relative positions (and orientation) as the fundamental unit of DNA for the purposes of distance calculations. This is particularly valuable in the case of bacterial DNA, which undergoes relatively frequent rearrangement (relative to eukaryotes, that is), but which also experiences significant horizontal, or lateral, gene transfer (HGT). The use of large-scale rearrangements to establish distance is thought to be less vulnerable to the effects of HGT than the use of SNPs (Darling et al. 2008).

The double cut and join (DCJ) operator, introduced by Yancopoulos et al. (2005) [see also Bergeron et al. (2006)], provided a significant breakthrough by treating a much larger family of operations acting on a more general, multichromosomal genome, and showing how distance can be expressed in a very simple formula based on features of a graph derived directly from the genome arrangements. While the DCJ operator treats all operations in its remit as equally likely, it is possible that this operator may provide a valuable base for operators that account for differences in frequency of different operations, and may potentially be specialised to genomes with specific chromosomal structure (such as a single circular chromosome).

With this potential in mind, in this paper we translate the DCJ operator into a group-theoretic setting. We show that by expressing the multi-chromosomal genome with \(n\) oriented regions as a permutation of \(\{1,\dots ,2n\}\), a DCJ operator can be defined as an action on the genome. Hence, the set of double cut and join operators generates a group acting on the entire genome space. The DCJ distance is then a path distance on the Cayley graph of this group [as described in Egri-Nagy et al. (2014)]. We show how the DCJ distance between two genomes can be obtained in a very simple way from the permutation encoding of the genomes. We obtain a formulation for the DCJ distance that is analogous to the distance formula found by Yancopoulos et al. (2005), but is expressed in terms of features of a permutation. This is derived independently of the established DCJ theory.

Over the last decade, there have been several examples of algebraic approaches to modeling biological phenomena, particularly in the genomic distance literature. While the traditional approach to the problem of finding distance between genomes is to cast them as permutations, limited use has been made of the powerful machinery that algebra provides to deal with permutations. Adopting an algebraic viewpoint might in fact reveal deep insights and lead to simplification.

A recent example of this is the work of Lu et al. (2006) who used the theory of symmetric groups to give an algorithm that gives a sorting sequence between circular genomes using fission, fusion and block interchanges. More recently, group theory has been used by the authors’ group to calculate the inversion distance between circular genomes under one model (Egri-Nagy et al. 2014), and a wider algebraic framework has been proposed that includes DNA knotting (Francis 2014).

A circular genome is modeled as a cyclic permutation by Meidanis and Dias (2000), treating a genome as a permutation of the genes \(a_1,a_2, \ldots , a_n\). Writing the circular genome in cycle notation as \((a_1,a_2,\ldots ,a_n)\) denotes that gene \(a_{i}\) is adjacent on the genome to gene \(a_{i+1}\), with gene \(a_n\) being adjacent to \(a_1\). This approach allowed them to derive many important properties of the breakpoint graph in terms of permutation products, and to give a lower bound on the transposition distance. This work was later extended in Feijão and Meidanis (2013) to include linear chromosomes. Feijão and Meidanis (2013) model a genome as a product of disjoint 2-cycles, and present a formulation of a \(k\)-break operation as a permutation. The double cut and join model then becomes a special case of the \(k\)-break operation with \(k=2\).

The model of the genome as a product of disjoint 2-cycles and the double cut and join operation as conjugation used by Feijão and Meidanis (2013), is also employed by us in this paper. The novelty of our work lies in development of this model in a completely algebraic framework, without making use of the existing theory. This allows us to present new proofs for existing results, sometimes leading to a considerable simplification of arguments such as in the result related to counting sorting scenarios (Theorem 7.3).

This paper is organized as follows. In Sect. 2 we introduce the double cut and join model, following the exposition of Bergeron et al. (2006). Section 3 explains how we can encode a genome as a product of 2-cycles, essentially extending the concept of “adjacencies” and “telomeres” from the established theory. In Sect. 4 we give the major construction of this paper, namely the definition of the DCJ operator as a group action on the genome space. While the standard definition has several cases depending on the arrangement of the genome, this action requires just two cases. In order to find a distance formula in this model (Sect. 6), we first need to establish some results about products of involutions, covered in Sect. 5. The main result, establishing distance in this model, is given in the following theorem:

Main Theorem

(Theorem 6.11) Let \(G_1\) and \(G_2\) be genomes on \(n\) regions with corresponding genomic permutations \(\pi _1\) and \(\pi _2\). The DCJ distance between \(G_1\) and \(G_2\) is given by

$$\begin{aligned} d_{DCJ}\left( \pi _1,\pi _2\right) =\frac{1}{2}\left( \ell _t (\pi _2\pi _1)+ n_c\right) \end{aligned}$$

where \(n_c\) is the number of cycles in the product \(\pi _2\pi _1\) which contain two fixed points of \({\pi _1}\) or \({\pi _2}\), and \(\ell _t\) is the transposition length.

Finally, in Sect. 7, we derive a formula for the number of optimal sorting scenarios between two genomes. That is, the number of minimal length paths in the Cayley graph of the group generated by the DCJ operators. As our work utilizes many well-known results about permutations, we have collected them in Appendix A for ease of reference. These results are stated without proofs. A complete treatment may be found in an abstract algebra text such as (Herstein 2006; Fraleigh 2003).

2 The double cut and join model

In this section we follow the notation of Bergeron et al. (2006). For a more complete introduction to the model and results, see that paper as well as Yancopoulos et al. (2005).

2.1 The genome graph

Before presenting the double cut and join operator, we first explain how multichromosomal genomes are modeled. In this model, a gene is essentially an oriented section of the DNA and its two ends are called its extremities. This of course differs from the biological meaning of the word gene and is closer to what are referred to as “conserved blocks” in the rearrangement literature [for example see Hannenhalli and Pevzner (1995), Lin and Tang (2006)]. However, for convenience we will use the words gene and region interchangeably. The extremities of the gene \(a\) are denoted \(a_t\) and \(a_h\) where the subscripts stand for tail and head respectively.

To represent a genome, considered as an arrangement of oriented genes, it is sufficient to note which extremities are adjacent on the genome. An extremity that is not adjacent to any other is the end point of a linear section of the genome and is called a telomere. An (unordered) pair of extremities that are adjacent on the genome is referred to as an adjacency. For instance, the adjacency \(\{a_t,b_h\}\) indicates that the tail of gene \(a\) is adjacent to the head of gene \(b\) on the genome. Note that an extremity can be adjacent to at most one other extremity.

Thus, in this model a genome is represented by a partition of the set of extremeties of the genes into subsets of cardinality 1 (telomeres) or 2 (adjacencies). Equivalently, the genome can be viewed as a graph whose vertex set is the set of all adjacencies and telomeres and whose edges are drawn between the extremities of the same gene. Thus every vertex of a genome graph has degree one or two. Figure 1 illustrates a genome graph.

Fig. 1
figure 1

The genome graph of a genome with one linear chromosome containing genes numbered 1, 2, 3 and 4, and one circular chromosome containing genes numbered 5 and 6. The vertex set of the graph is \(\{\{1_t\}, \{1_h,3_t\},\{3_h,2_t\},\{2_h,4_t\},\{4_h\},\{5_h,6_t\},\{6_h,5_t\}\}\). Edges are drawn between extremities of the same gene. The map \(\phi \) (Definition 3.2) maps the set of extremities \(\{1_t,1_h,2_t,2_h, \ldots , 6_h\}\) into \( \{1,2,3,4, \ldots , 12 \}\), with \(\phi (1_t)= 1\), \(\phi (1_h)=2\) and so on. \(1_t\) and \(4_h\) are telomeres, hence 1 and 8 are fixed points of the permutation \(\pi \) (Definition 3.3). \(1_h\) is connected to \(3_t\) which is captured by the 2-cycle (2, 5) in the permutation encoding. The other 2-cycles can be similarly interpreted. The above genome is thus encoded as the permutation \((2,5)(3,6)(4,7)(10,11)(9,12)\)

2.2 The double cut and join operator

The double cut and join (DCJ) operator acts on a pair of vertices of a genome graph in one of the following ways:

  1. 1.

    \(\{p,q\},\{r,s\}\) may be changed to \(\{p,r\},\{q,s\}\) or \(\{p,s\},\{q,r\}\),

  2. 2.

    \(\{p,q\},\{r\}\) may be changed to \(\{p,r\},\{q\}\) or \(\{q,r\},\{p\}\),

  3. 3.

    \(\{p,q\}\) may be changed to \(\{p\},\{q\}\) or \(\{p\},\{q\}\) changed to \(\{p,q\}\).

Depending on the vertices that it acts on, the double cut and join operator can simulate the inversion, excision and translocation of a section of the genome as well as fusion and fission of chromosomes. Figure 2 presents some examples.

Fig. 2
figure 2

a \(\{1_h,3_t\},\{3_h,2_t\}\) is changed to \(\{1_h,3_h\},\{3_t,2_t\}\) leading to an inversion. b \(\{1_t\}, \{1_h,3_t\}\) is changed to \(\{1_h\},\{1_t,3_t\}\), another inversion. c \(\{2_h\},\{4_h\}\) is changed to \(\{2_h,4_h\}\), a fusion

2.3 The double cut and join distance

The DCJ distance between genomes \(G_1\) and \(G_2\) is the minimal number of DCJ operations required to change one genome into the other.

Bergeron et al. (2006) make use of a graph construct called the “adjacency graph” to determine the DCJ distance between two genomes. An adjacency graph \(AG(G_1,G_2)\) can be drawn for any pair of genomes \(G_1\) and \(G_2\) defined on the same set of \(n\) genes. The vertex set of the graph is the set of all adjacencies and telomeres in \(G_1\) and \(G_2\). For an adjacency (or telomere) \(u \in G_1\) and adjacency (or telomere) \(v \in G_2\), there is an edge in \(AG(G_1,G_2)\) between \(u\) and \(v\) for each gene extremity they have in common.

The vertices of the adjacency graph then have degree either one (at a telomere) or two (at an adjacency), so the graph consists of a set of cycles and a set of paths. Let \(c\) be the number of cycles and \(p\) be the number of paths of odd length in \(AG(G_1,G_2)\). Bergeron et al. (2006) established that the DCJ distance between two genomes can be given in terms of these adjacency graph statistics as follows:

$$\begin{aligned} d_{DCJ}\left( G_1,G_2\right) =n-(c+p/2). \end{aligned}$$

3 Genomes as permutations

We now present our reformulation of the double cut and join model. We first formalize the notion of a genome on \(n\) regions. Let \(\{h,t\}\) be the extremities of a gene where \(h\) and \(t\) denote the head and tail respectively. Let \({\mathbf {n}}\) be the set \(\{1,2,\ldots ,n\}\) enumerating the \(n\) regions.

Definition 3.1

(Extremities) The Cartesian product \(E={\mathbf {n}} \times \{h,t\}\) is the set of all extremities of \(n\) regions.

To conform to the notation used earlier in this paper and in previous literature, we will use \(i_h\) and \(i_t\) to denote the extremities \((i,h)\) and \((i,t)\) giving the head and tail of gene \(i\) respectively.

We define a map that assigns numeric labels to the elements of \(E\).

Definition 3.2

(Assignment map) Let \(\phi :E \rightarrow {\mathbf {2n}}\) be defined as follows:

$$\begin{aligned} \phi (i_t)&= 2i-1,\\ \phi (i_h)&= 2i. \end{aligned}$$

Definition 3.3

(Genome) A genome on \(n\) regions is a permutation \(\pi \) on the set \(E\) such that

$$\begin{aligned} \pi (i)=j \iff \pi (j)=i. \end{aligned}$$

The above definition implies that a genomic permutation is a product of disjoint 2-cycles. The restriction in the definition of a genome captures the notion of pairing of gene extremities on a genomic strand. Therefore a 2-cycle in this formulation is an adjacency, and similarly, fixed points of a permutation are telomeres. It is important to note that at this point we use the permutation \(\pi \) as a static description of the genome, not as an operation, so that the 2-cycles can be considered as synonyms for unordered pairs. Furthermore, this construction with 2-cycles representing adjacencies means that the identity permutation will only arise in the trivial case in which all chromosomes in the genome contain just a single region.

As mentioned in Sect. 2, the vertex set of the genome graph consists of the adjacencies and telomeres. For every gene \(i\), an edge is drawn between the adjacency containing \(i_h\) and \(i_t\). In writing the genome as a permutation, the 2-cycles and the fixed points are the adjacencies and the telomeres. The assignment map \(\phi \) tells us the correspondence between the gene extremities and the set \({\mathbf {2n}}\). Hence the assignment map \(\phi \) and the genomic permutation \(\pi \) contain all the information that is needed to construct the genome.

Bafna and Pevzner (1993) introduced the notion of a breakpoint graph for an unsigned permutation. To extend this concept to signed permutations, they transform a signed permutation \(\pi \) on \(n\) elements to an unsigned permutation \(\pi ^{\prime }\) on \(2n\) elements. This is done by replacing a positive integer \(i\) in \(\pi \) by \(2i-1\) followed by \(2i\) in \(\pi ^{\prime }\) and by replacing a negative integer \(-i\) in \(\pi \) by \(2i\) followed by \(2i-1\). This transformation is precisely the labeling map \(\phi \). Reverse orientation of a gene on a chromosome means that the tail of the gene is present after the head of the gene. Thus in the permutation representation, a negative integer \(-i\) is replaced by \(2i\) (label of \(i_h\)) followed by \(2i-1\) (label of \(i_t\)).

We use cycle notation to write permutations. Thus the cycle \((i_1, i_2, i_3, \ldots , i_n)\) in a permutation \(\alpha \) means that \(\alpha (i_1)=i_2, \alpha (i_2)=i_3\) etc. and \(\alpha (i_n)=i_1\) (see Appendix A). Figure 1 illustrates an example of permutation encoding of a genome on \(6\) regions.

A genome on \(n\) regions is a permutation of the set \({\mathbf {2n}}\) satisfying the constraints in Definition 3.3. Lemma 3.4 gives an expression for the number of permutations in \(S_{2n}\) satisfying this definition.

Lemma 3.4

The number of genomes on \(n\) regions is given by

$$\begin{aligned} \sum _{t=0}^n{\left( {\begin{array}{c}2n\\ 2t\end{array}}\right) (2n-2t-1)!!}=\sum _{t=0}^n{\left( {\begin{array}{c}2n\\ 2t\end{array}}\right) (2t-1)!!}. \end{aligned}$$

Proof

Let the set of all genomes on \(n\) regions be \({\varGamma }_n\). Each genome is a permutation of the \(2n\) extremities \(E\), and hence \({\varGamma }_n\) is a subset of the symmetric group \(S_{2n}\). The cardinality of \({\varGamma }_n\) can be determined as follows.

Each genomic permutation can have an even number of fixed points, since it is a product of disjoint 2-cycles and 1-cycles acting on a set of even cardinality (\(2n\)). This also follows from the fact that fixed points are telomeres, and a genome must have an even number of telomeres.

Let the number of fixed points be \(2t\). The remaining \(2n-2t\) elements must be paired off with each other. Each such pairing of the \(2n-2t\) elements defines an involution in the symmetric group \(S_{2n-2t}\) that does not have any fixed points. An involution is an element of order 2 i.e., \(\pi \) is an involution if \(\pi ^2\) is the identity permutation.

The number of such involutions is \((2n-2t-1)!!\) (Stanley 1999, pp. 15–16) where the double factorial function is the product of odd numbers i.e. \((2k-1)!!=\prod _{i=1}^k(2i-1)\). Therefore the cardinality of \({\varGamma }_n\) is given by

$$\begin{aligned} \left| {\varGamma }_n \right| = \sum _{t=0}^n{\left( {\begin{array}{c}2n\\ 2t\end{array}}\right) (2n-2t-1)!!}=\sum _{t=0}^n{\left( {\begin{array}{c}2n\\ 2t\end{array}}\right) (2t-1)!!}. \end{aligned}$$

\(\square \)

The number of genomes is already almost a billion for 9 regions. The astute observer will note that this number is also the number of tableaux on \(2n\) elements, with a correspondence given by the Robinson–Schensted algorithm [see for instance Fulton (1997)]. The first nine numbers in the sequence are shown in Table 1.

Table 1 The number of genomes on \(n\) regions (also the number of tableaux on \(2n\) elements)

4 The DCJ operator as an action on a permutation

In this section we define an algebraic version of the DCJ operator acting on the set \({\varGamma }_n\) of genomes on \(n\) regions, and show that it is an involution. Appendix A contains a summary of some results on symmetric groups that may be useful for reference in this section and the next.

As explained in the previous section, the genome is modeled as a set of unordered pairs of gene extremities (adjacencies) and single gene extremities (telomeres). A DCJ operation as defined in Bergeron et al. (2006) swaps gene extremities between two pairs (i.e. adjacencies) or a pair and a singleton, as described in Sect. 2.2.

Hence, the possible scenarios are that the two gene extremities being swapped are: adjacent to each other on the genome; both involved in different adjacencies; one of them is a telomere and the other in an adjacency; or both of them are telomeres. When a DCJ operation acts on a pair of gene extremities that form an adjacency, it splits them, producing two telomeres, and conversely when it acts on two telomeres, it combines them into an adjacency. In the two other cases, an extremity is swapped.

Thus, in the permutation representation, the DCJ operation swapping \(i\) and \(j\) changes:

$$\begin{aligned} (i,k)(j,l)&\longrightarrow (j,k)(i,l),\\ (i,k)(j)&\longrightarrow (j,k)(i),\\ (i,j)&\longrightarrow (i)(j),\quad \text { and}\\ (i)(j)&\longrightarrow (i,j). \end{aligned}$$

With this in mind, for \(i,j \in \mathbf {2n}\), and \(i \ne j\) we define the double cut and join operator \(D_{ij}\) acting on the set of genomes \({\varGamma }_n\), \(D_{ij}: {\varGamma }_n \rightarrow {\varGamma }_n \) as follows:

Definition 4.1

(Set of fixed points) Let \(\pi \) be a permutation of \({\mathbf {2n}}\), then the set of fixed points of \(\pi \) is defined by

$$\begin{aligned} F_{\pi }:=\{i \mid i \in \mathbf {2n}, \pi (i)=i\}. \end{aligned}$$

Definition 4.2

(Algebraic double cut and join operator) For a permutation \(\pi \) representing a genome, set

$$\begin{aligned} D_{ij}(\pi ): = {\left\{ \begin{array}{ll} (i,j)\pi &{} \text {if }i,j \in F_{\pi }\text { or }\pi (i)=j,\text { and}\\ (i,j)\pi (i,j) &{} \text {otherwise. } \end{array}\right. } \end{aligned}$$

Therefore, in algebraic terms, the double cut and join operators as defined above are conjugations or left actions by 2-cycle involutions. To distinguish this formulation from the standard, we will call \(D_{ij}\) the algebraic double cut and join operator.

The vertex set of the genome graph consists of unordered 2-tuples and singletons from the set of gene extremities \(E\). The map \(\phi \) simply relabels elements of the set \(E\) with the labels from the set \({\mathbf {2n}}\). Therefore we can consider the vertex set of the genome graph to consist of unordered 2-tuples and singletons from \({\mathbf {2n}}\). A genomic permutation \(\pi \) is a permutation on the set \(\mathbf {2n}\) satisfying the constraint \(\pi (i)=j \iff \pi (j)=i\).

Let \(\rho \) be the map that writes the vertex set of the genome graph \(G\) as permutation \(\pi \).

$$\begin{aligned} \rho (\{i,j\})=(i,j) \quad \text {and} \quad \rho (\{i\})=(i). \end{aligned}$$

Let \({\mathcal {D}}_{\{i,j\}}(G)\) be the DCJ operator acting on the extremities \(i\) and \(j\) of genome \(G\) and let \(D_{ij}(\pi )\) be the operator acting on the permutation \(\pi \). The remarks motivating the definition of algebraic DCJ operator informally explain why we can expect the graph-theoretic and the algebraic operators to be equivalent. That is, the diagram in Fig. 3 commutes.

Fig. 3
figure 3

Rewriting genome \(G\) as the permutation \(\pi \), and acting on \(\pi \) by \(D_{ij}\) gives the same result as acting on \(G\) by the DCJ operator and then rewriting the result as permutation \(\pi ^{\prime }\)

We now prove this statement formally.

Lemma 4.3

For all genomes \(G\),

$$\begin{aligned} \rho \left( {\mathcal {D}}_{\{i,j\}}(G)\right) = D_{ij} \left( \rho (G)\right) . \end{aligned}$$

Proof

We prove this by considering all the four cases in the definition of the operator \({\mathcal {D}}\) (Sect. 2.2).

Case 1. \(i\) and \(j\) are in separate 2-tuples \(\{i,k\},\{j,l\} \in G\).

$$\begin{aligned}&\rho \left( {\mathcal {D}}_{\{i,j\}}(G)\right) =\rho \left( {\mathcal {D}}(\{i,k\},\{j,l\})\right) =\rho \left( \{j,k\},\{i,l\}\right) =(j,k)(i,l).\\&D_{ij} \left( \rho (\{i,k\},\{j,l\}) \right) = D_{ij} \left( (i,k)(j,l)\right) =(j,k)(i,l). \end{aligned}$$

Case 2. Exactly one of the \(i,j\) is in a 2-tuple \(\{i,k\},\{j\} \in G\).

$$\begin{aligned}&\rho \left( {\mathcal {D}}_{\{i,j\}}(G)\right) =\rho \left( {\mathcal {D}}(\{i,k\},\{j\})\right) =\rho (\{(j,k)\},\{i\})=(j,k)(i).\\&D_{ij} \left( \rho (\{i,k\},\{j\}) \right) = D_{ij} \left( (i,k)(j)\right) =(j,k)(i). \end{aligned}$$

Case 3. None of the \(i,j\) is in a 2-tuple. \(\{i\},\{j\} \in G\).

$$\begin{aligned}&\rho \left( {\mathcal {D}}_{\{i,j\}}(G)\right) =\rho \left( {\mathcal {D}}_{\{i,j\}} (\{i\},\{j\})\right) =\rho (\{i,j\})=(i,j).\\&D_{ij} \left( \rho (\{i\},\{j\})\right) = D_{ij} \left( (i)(j)\right) =(i,j). \end{aligned}$$

Case 4. \(i,j\) are in the same 2-tuple in \(G\;\{i,j\} \in G\)

$$\begin{aligned}&\rho \left( {\mathcal {D}}_{\{i,j\}}(G)\right) =\rho \left( {\mathcal {D}}_{\{i,j\}} (\{i,j\})\right) =\rho (\{i\},\{j\})=(i)(j).\\&D_{ij} \left( \rho (\{i,j\})\right) = D_{ij} \left( (i,j)\right) =(i)(j). \end{aligned}$$

Thus in all cases, \(\rho \left( {\mathcal {D}}_{\{i,j\}}(G)\right) = D_{ij} \left( \rho (G)\right) \). \(\square \)

The following lemma shows that the algebraic DCJ operator is an involution.

Lemma 4.4

\(D_{ij}^2(\pi )=\pi \) for all \(\pi \in {\varGamma }_n\) and \(i,j \in {\mathbf {2n}}\).

Proof

For any permutation \(\pi \in {\varGamma }_n\), if \(i,j\) are not both telomeres and do not form an adjacency of \(\pi \), then the same holds in \(D_{ij}(\pi )=(i,j)\pi (i,j)\). Similarly if \(i\) and \(j\) are both telomeres in \(\pi \) then they form an adjacency in \(D_{ij}(\pi )\), and if they are in an adjacency in \(\pi \), they will both be telomeres in \(D_{ij}(\pi )\). Thus acting by \(D_{ij}\) on \(D_{ij}(\pi )\) will cause the same condition in the definition of \(D_{ij}\) to be invoked which was invoked when \(D_{ij}\) acted on \(\pi \).

The operation in each case is an involution, hence \(D_{ij}^2(\pi )=\pi \) for all \(\pi \in {\varGamma }_n\). \(\square \)

At this point, we make the following note about the notation employed in the remainder of this paper. Permutations are functions where the operand is written on the right. So for example, \(\pi (i)\) is the permutation \(\pi \) acting on \(i\). In line with this, permutation multiplication is done from right to left.

A 2-cycle is written as \((i,j)\). We will not in general write cycles of length 1, except for emphasis.

5 Products of involutions

In this section we prove some results about products of involutions. We make use of these results in Sect. 6 to determine the DCJ distance between genomic permutations \(\pi _1\) and \(\pi _2\).

Lemma 5.1

Let \(\alpha \) and \(\beta \) be involutions acting on the set \({\mathbf {2n}}=\{1,2,\ldots ,2n\}\). If \(F_\alpha =F_\beta =\varnothing \), then

  1. 1.

    For any \(i \in \mathbf {2n}\), \(i\) and \(\alpha (i)\) are in different cycles of \(\beta \alpha \). Similarly \(i\) and \(\beta (i)\) are in different cycles of \(\beta \alpha \).

  2. 2.

    \(\beta \alpha \) has an even number of cycles of length \(k\) for any \(k \in \mathbb {N}\).

Proof

(1) The cycle in \(\beta \alpha \) containing \(1\) is of the form

$$\begin{aligned} \left( 1,\beta \alpha (1),\beta \alpha \beta \alpha (1),\ldots ,(\beta \alpha )^k(1)\right) \!, \end{aligned}$$

where \(k\) is the smallest positive integer such that \((\beta \alpha )^{k+1}(1)=1\), therefore the length of this cycle is \(k+1\). We claim that \(\alpha (1) \notin \{1,\beta \alpha (1),\ldots , (\beta \alpha )^k(1)\}\).

Suppose that \(\alpha (1)=(\beta \alpha )^r(1)\) for some \(r\). If \(r\) is even then

$$\begin{aligned} \alpha (1)&=(\beta \alpha )^r(1) \\&= (\beta \alpha )^{(r/2)-1}\beta \alpha (\beta \alpha )^{r/2}(1). \end{aligned}$$

By multiplying on the left both sides by \((\alpha \beta )^{(r/2)-1}\), (the inverse of \((\beta \alpha )^{(r/2)-1}\)), we get

$$\begin{aligned} (\alpha \beta )^{(r/2)-1}\alpha (1)=\beta \alpha (\beta \alpha )^{r/2}(1), \end{aligned}$$

and multiplying by \(\beta \) yields

$$\begin{aligned} \beta (\alpha \beta )^{(r/2)-1}\alpha (1)&=\alpha (\beta \alpha )^{r/2}(1)\\ (\beta \alpha )^{r/2}(1)&=\alpha \left( (\beta \alpha )^{r/2}(1)\right) \!. \end{aligned}$$

In other words, \((\beta \alpha )^{r/2}(1)\in F_\alpha \), contradicting the assumption that \(F_\alpha =\varnothing \). Similarly, if \(r\) is odd, we find that \(\alpha (1)=(\beta \alpha )^{r}(1)\) implies that \((\beta \alpha )^{(r+1)/2}(1)\) is a fixed point of \(\beta \), another contradiction.

Thus \(\alpha (1) \notin \{1,\beta \alpha (1),\ldots , (\beta \alpha )^k(1)\}\).

(2) Write the cycle in \(\beta \alpha \) containing \(\alpha (1)\) as \(\left( \alpha (1),\beta (1),\beta \alpha \beta (1),\ldots ,(\beta \alpha )^s \beta (1)\right) \), where \(s\) is the smallest positive integer such that \((\beta \alpha )^{s+1}\beta (1)=\alpha (1)\). Then multiplying both sides of the equation by \(\beta (\alpha \beta )^{s+1}\) (the inverse of \((\beta \alpha )^{s+1}\beta \)) we obtain

$$\begin{aligned} 1&= \beta (\alpha \beta )^{s+1} \alpha (1) \\&= (\beta \alpha )^{s+2}(1). \end{aligned}$$

That is, \((\beta \alpha )^{s+2}(1)=1\). But \((\beta \alpha )^{k+1}=1\) and minimality of \(k\) and \(s\) imply \(s=k-1\). Thus, the length of the cycle containing \(\alpha (1)\) is \(k+1\), which is the same as the length of the cycle containing \(1\). Since the same argument holds for any \(i \in {\mathbf {2n}}\) there will be an even number of cycles of any given length in \(\beta \alpha \). \(\square \)

Lemma 5.2

Let \(\alpha \) and \(\beta \) be permutations on the set \({\mathbf {2n}}\) such that \(\alpha \) and \(\beta \) are involutions. Then a cycle in \(\beta \alpha \) contains at most 2 points from \(F_{\alpha } \cup F_{\beta }\).

Proof

If \(F_{\alpha } \cup F_{\beta } = \varnothing \), then any cycle on \(\beta \alpha \) contains \(0\) elements of \(F_{\alpha } \cup F_{\beta }\) and the statement is vacuously true.

Suppose then that \(F_{\alpha } \cup F_{\beta } \ne \varnothing \). Suppose \(1 \in F_{\alpha }\); that is, \(\alpha (1)=1\). If \(1 \in F_{\beta }\), a similar argument would apply.

The cycle containing \(1\) in \(\beta \alpha \) is of the form

$$\begin{aligned} \left( 1,\beta \alpha (1),\beta \alpha \beta \alpha (1),\ldots , (\beta \alpha )^k(1)\right) \end{aligned}$$

where \(k\) is the smallest positive integer for which \((\beta \alpha )^{k+1}(1)=1\). As in the proof of Lemma 5.1, this cycle contains \(\alpha (1)=1=(\beta \alpha )^{k+1}(1)\). We have argued in the proof of Lemma 5.1 that if \(k+1\) is odd then

$$\begin{aligned} (\beta \alpha )^{k+1}(1)=1=\alpha (1) \implies (\beta \alpha )^{(k+2)/2}(1) \in F_{\beta } \end{aligned}$$

and if \(k+1\) is even then

$$\begin{aligned} (\beta \alpha )^{k+1}(1)=1=\alpha (1) \implies (\beta \alpha )^{(k+1)/2}(1) \in F_{\alpha }. \end{aligned}$$

That is, if \(1 \in F_{\alpha }\) then the cycle containing 1 contains at least one other point from \(F_{\alpha } \cup F_{\beta }\), namely \((\beta \alpha )^{(k+2)/2}(1)\) if the length of the cycle is odd, and \((\beta \alpha )^{(k+1)/2}(1)\) if it is even.

Suppose that this cycle contains another point \(i \in F_{\alpha } \cup F_{\beta }.\) We claim that \(i\) must be one of the points identified above. Since \(i\) is in the cycle, for some positive integer \(s\), \(i=(\beta \alpha )^s(1)\). Let \(s\) be the smallest such integer.

If \(i \in F_{\alpha }\); that is, \(\alpha ((\beta \alpha )^s(1))=(\beta \alpha )^s(1)\). We then have

$$\begin{aligned} 1&=(\alpha \beta )^s\alpha (\beta \alpha )^s(1) \quad \text {since }((\beta \alpha )^s)^{-1}=(\alpha \beta )^s \\&=\alpha (\beta \alpha )^{2s}(1). \end{aligned}$$

But \(\alpha (1)=1\), so acting on both sides by \(\alpha \) we have that \((\beta \alpha )^{2s}(1)=1\). If \(i \in F_{\beta }\), so that \(\beta ((\beta \alpha )^s(1)) =(\beta \alpha )^s(1)\), we obtain \((\beta \alpha )^{2s-1}(1)=1\).

But since \(k+1\) is the minimal integer for which \((\beta \alpha )^{k+1}(1)=1\), and \(s\) is also minimal, it follows \(2s=k+1\) or \(2s-1=k+1\) according to whether \(i \in F_{\alpha }\) or \(i \in F_{\beta }\). That is,

$$\begin{aligned} s={\left\{ \begin{array}{ll} (k+1)/2 &{} \text { if } i \in F_{\alpha }, \\ (k+2)/2 &{} \text { if } i \in F_{\beta }. \end{array}\right. } \end{aligned}$$

Hence \(i\) is one of the points identified.

The two points are the same if

$$\begin{aligned} k+1=\frac{k+1}{2} \quad \text { or }\quad k+1=\frac{k+2}{2}. \end{aligned}$$

The first equation does not have any nonnegative solution. The only nonnegative integer satisfying the second condition is \(k=0\). If \(k\) is 0 i.e., \((\beta \alpha )(1)=1\) then since \(1 \in F_{\alpha }\), it follows that \(\beta (1)=1\) and hence \(1\) is a fixed point of \(\beta \) as well. In this case, the cycle of \(\alpha \beta \) containing 1 will be of length 1 and hence contains a single point from \(F_{\alpha } \cup F_{\beta }\). Thus, a cycle of \(\beta \alpha \) contains no fixed points if there are no fixed points in \(\beta \) and \(\alpha \).

A cycle of length 1 contains a point \(i\) from \(F_{\alpha } \cup F_{\beta }\) if \(i\) is fixed in both \(\alpha \) and \(\beta \), that is if \(i \in F_{\alpha } \cap F_{\beta }\).

If \(F_{\alpha } \cap F_{\beta } = \emptyset \), then a cycle of \(\beta \alpha \) contains exactly two points from \(F_{\alpha } \cup F_{\beta }\). \(\square \)

Petersen and Tenner (2013) also investigate the nature of the product of involutions and prove similar results. Their results are stated in terms of the structure of an involution product graph.

6 Determining the DCJ distance

6.1 Subpermutations and the link to transposition distance

We define a binary relation on \({\mathbf {2n}}\) which will allow us to separate out the different components of a pair of genomic permutations, each of which we will then be able to sort independently of the others.

Definition 6.1

Let \(\pi _1\) and \(\pi _2\) be genomic permutations on \(n\) regions. That is, \(\pi _1\) and \(\pi _2\) are involutions on the set \({\mathbf {2n}}\). Define the binary relation \(\sim \) on \(\mathbf {2n}\) by setting

$$\begin{aligned} i \sim j \iff (\pi _2 \pi _1)^k(i)=j \text { or } \pi _1(\pi _2 \pi _1)^k(i)=j \text { for some } k \in {\mathbb {Z}}. \end{aligned}$$

The cycles of \(\pi _1 \pi _2\) and \(\pi _2 \pi _1\) are the same as sets (they are inverses of each other in \(S_{2n}\)), hence the binary relation \(\sim \) defined for the pair \(\pi _1,\pi _2\) would be the same as defined for the pair \(\pi _2,\pi _1\). Therefore, without any ambiguity \(\sim \) can be defined for an unordered pair of genomic permutations.

Lemma 6.2

\(\sim \) is an equivalence relation on \(\mathbf {2n}\).

Proof

It is easy to verify that \(\sim \) is reflexive and symmetric. To establish that \(\sim \) is transitive, note that since \(i\) could be related to \(j\) through either of the two relations \(\pi _1(\pi _2 \pi _1)^k(i)=j\) or \((\pi _2 \pi _1)^k(i)=j\), there are four possible cases to be checked. For example, suppose that \(\pi _1(\pi _2\pi _1)^p(i)=j\) and \((\pi _2\pi _1)^q(j)=k\). Noting that \((\pi _2\pi _1)^q\pi _1=\pi _1(\pi _2\pi _1)^{-q}\), since \(\pi _i\) are involutions, we have

$$\begin{aligned} k&=(\pi _2\pi _1)^q\pi _1(\pi _2\pi _1)^p(i) \\&=\pi _1(\pi _2\pi _1)^{p-q}(i), \end{aligned}$$

and hence \(i \sim k\). The other cases can be checked similarly. \(\square \)

For any \(i,j \in {\mathbf {2n}}\), if \(i\) and \(j\) are in the same cycle of \(\pi _2\pi _1\) then \(j=(\pi _2\pi _1)^s(i)\) for some \(s\) and hence \(i \sim j\). Also, \(i \sim \pi _1(i)\) and \(\pi _1(i)\) is related to all the elements in the cycle of \(\pi _2\pi _1\) that contains \(\pi _1(i)\). Therefore an equivalence class under \(\sim \) will be the union of the cycles of \(\pi _1\pi _2\) containing \(i\) and \(\pi _1(i)\).

Observe that \(i \sim \pi _1(i)\) and \(i \sim \pi _2(i)\). Hence the partition of \(\mathbf {2n}\) under \(\sim \) will also partition the 2-cycles of \(\pi _1\) and \(\pi _2\). In what follows we would like to talk about the sub-permutations of \(\pi _1\) and \(\pi _2\) thus induced. Formally,

Definition 6.3

Let \(\pi _1\) and \(\pi _2\) be genomic permutations, and let \({\mathcal {C}}_1,{\mathcal {C}}_2, \ldots , {\mathcal {C}}_r\subseteq \mathbf {2n}\) be the equivalence classes under \(\sim \) defined by \(\pi _1,\pi _2\). For \(1\le s\le r\) and \(i=1,2\), the sub-permutation \(\pi _i^{(s)}\) of \(\pi _i\) induced by \(\sim \) is defined to be the restriction of \(\pi _i\) to \({\mathcal {C}}_s\), that is,

$$\begin{aligned} \pi _i^{(s)}:=\pi _i \big |_{{\mathcal {C}}_s}. \end{aligned}$$

Intuitively, we have collected in a sub-permutation all the \(2\)-cycles that are relevant for sorting \(\pi _1^{(s)}\) into \(\pi _2^{(s)}\).

An example will help illustrate these definitions. Let \(\pi _1\) and \(\pi _2\) be the following genomic permutations on \(4\) regions:

$$\begin{aligned} \pi _1=(1,6)(2,3)(4,5)(7,8),\quad \pi _2=(1,2)(3,4)(5,6). \end{aligned}$$

The partitions of the set \(\{1,2,\ldots ,8\}\) under \(\sim \) are \({\mathcal {C}}_1=\{1,2,3,4,5,6\}, \ {\mathcal {C}}_2=\{7,8\}\).

The sub-permutations \(\pi _1^{(1)}\) and \(\pi _1^{(2)}\) are then \(\pi _1^{(1)}=(1,6)(2,3)(4,5)\), \(\pi _1^{(2)}=(7,8)\). Similarly, the sub-permutations \(\pi _2^{(1)}\) and \(\pi _2^{(2)}\) are \(\pi _2^{(1)}=(1,2)(3,4)(5,6)\), \(\pi _2^{(2)}=(7)(8)\) where the cycles of length 1 are written for clarity.

As remarked earlier, an equivalence class \({\mathcal {C}}_s\) under \(\sim \) contains precisely those elements of \({\mathbf {2n}}\) that are contained in the cycle of \(\pi _1\pi _2\) containing \(i\) and \(\pi _1(i)\). Therefore as proved in Lemmas 5.1 and 5.2, the product of the sub-permutations \(\pi _1^{(s)}\) and \(\pi _2^{(s)}\) is either a single cycle containing one or two points from \(F_{\pi _1} \cup F_{\pi _2}\) or a product of (exactly) two disjoint cycles.

Suppose the sub-permutations \(\pi _1^{(s)}\) and \(\pi _2^{(s)}\) are distinct. If either \(\pi _1^{(s)}\) and \(\pi _2^{(s)}\) are conjugate in \(S_{2n}\), then they have the same cycle type. Hence either \({\mathcal {C}}_s\) contains no points from \(F_{\pi _1} \cup F_{\pi _2}\), or it contains one point each from \(F_{\pi _1}\) and \(F_{\pi _2}\). In the first case, it follows from Lemma 5.1 that \({\mathcal {C}}_s\) contains an even number of points. In the latter case, the cardinality of \({\mathcal {C}}_s\) is odd.

These observations are summarised in Corollary 6.4.

Corollary 6.4

Let \(\pi _1^{(s)}\) and \(\pi _2^{(s)}\) be distinct sub-permutations induced by an equivalence class \({\mathcal {C}}_s\) such that \(\pi _1^{(s)}\) and \(\pi _2^{(s)}\) are conjugate in \(S_{2n}\). Let \(i \in {\mathcal {C}}_s\). The product \(\pi _2^{(s)}\pi _1^{(s)}\) is given by

$$\begin{aligned} \pi _2^{(s)}\pi _1^{(s)}= {\left\{ \begin{array}{ll} \left( i,\pi _2\pi _1(i), \ldots , (\pi _2\pi _1)^u(i)\right) \left( \pi _1(i),\pi _2(i), \ldots , (\pi _2\pi _1)^{u-1}\pi _2(i)\right) &{} \text { if } F_{\pi _1} \cup F_{\pi _2}= \emptyset , \\ \left( i,\pi _2\pi _1(i), \ldots (\pi _2\pi _1)^{2u}(i)\right) &{}\text { if } F_{\pi _1} \cup F_{\pi _2} \ne \emptyset , i \in F_{\pi _1}. \end{array}\right. } \end{aligned}$$

The sum of lengths of the cycles in the product is the cardinality of \({\mathcal {C}}_s\).

Continuing our example above, we see \(\pi _1^{(1)} \pi _2^{(1)}=(1,3,5)(2,6,4)\), \(\pi _1^{(2)} \pi _2^{(2)}=(7,8)\).

Observe that \(\pi _1^{(1)}\pi _2^{(1)}\) is a product of two disjoint cycles. \(\pi _1^{(2)}\pi _2^{(2)}\) contains two points from \(F_{\pi _2}\) namely 7 and 8.

If a partition \({\mathcal {C}}_t\) consists of a single point say \(i\) then \(i\) is a fixed point of both \(\pi _1\) and \(\pi _2\), hence the DCJ distance between sub-permutations induced by \({\mathcal {C}}_t\) is 0.

We will determine the DCJ distance between \(\pi _1\) and \(\pi _2\) by determining the DCJ distance between \(\pi _1^{(s)}\) and \(\pi _2^{(s)}\) for \(s \in \{1,2, \ldots , r\}\).

Definition 6.5

For any permutation \(\pi \), the transposition length of \(\pi \) denoted by \(\ell _t(\pi )\) is the minimal number of transpositions needed to express \(\pi \).

Since the \(D_{ij}\) operation involves multiplying a permutation with transpositions, we are interested in how multiplication by a transposition affects the tranposition length of a permutation. In fact this effect is easily stated: multiplication by a transposition changes the transposition length of a permutation by \(\pm 1\).

That is,

$$\begin{aligned} \ell _t\left( (i,j\right) \pi )&= \ell _t(\pi ) \pm 1,\nonumber \\ \ell _t\left( \pi (i,j)\right)&= \ell _t(\pi ) \pm 1. \end{aligned}$$
(1)

This can be observed by noting first that a permutation can be expressed as a product of either an odd or an even number of transpositions, but not both. That is, the parity of the number of transpositions needed to write a permutation as a product is unique.

Let the transposition length of a permutation \(\pi \) be \(r\). That is,

$$\begin{aligned} \pi =t_1 t_2 \ldots t_r, \end{aligned}$$

where the \(t_i\) are transpositions.

Suppose \(r\) is odd. Then the parity of \((i,j)\pi \) is even since \((i,j) t_1 t_2 \ldots t_r\) is one expression of the result as a product of transpositions, although it may not be minimal. The transposition length of \((i,j)\pi \) is also therefore even, and hence it is either \(r+1\) or \(r-1\). A similar argument follows if \(r\) is even.

In the remaining part of this section, we will show that the DCJ distance between \(\pi _1\) and \(\pi _2\) can be determined in terms of the transposition length of the permutation product \(\pi _2\pi _1\). First of all, note that if \(\pi _1=\pi _2\) then \(\pi _2\pi _1=()\) where \(()\) is the identity permutation, hence \(\ell _t\left( \pi _2\pi _1\right) =0\).

We make the following claim regarding a lower bound on the DCJ distance between permutations \(\pi _1\) and \(\pi _2\).

Lemma 6.6

Let \(\pi _1\) and \(\pi _2\) be genomic permutations. Then

$$\begin{aligned} d_{DCJ}\left( \pi _1,\pi _2 \right) \ge \frac{\ell _t\left( \pi _1 \pi _2\right) }{2}. \end{aligned}$$

Proof

A single DCJ operation \(D_{ij}\) acts either by conjugation of \(\pi _1\) by the transposition \((i,j)\), or by multiplication of \(\pi _1\) by \((i,j)\).

If \(D_{ij}(\pi _1)=(i,j)\pi _1\), then \(D_{ij}(\pi _1)\pi _2=(i,j)\pi _1\pi _2\). Hence by Eq. (1)

$$\begin{aligned} \ell _t\left( D_{ij}(\pi _1) \pi _2\right) =\ell _t \left( \pi _1 \pi _2\right) \pm 1. \end{aligned}$$

If \(D_{ij}(\pi _1)=(i,j)\pi _1(i,j)\) then

$$\begin{aligned} D_{ij}(\pi _1)\pi _2=(i,j)(\pi _1(i),\pi _1(j))\pi _1\pi _2. \end{aligned}$$

By applying Eq. 1 twice,

$$\begin{aligned} \ell _t\left( D_{ij}(\pi _1)\pi _2\right) =\ell _t\left( \pi _1 \pi _2\right) \text { or }\ell _t \left( \pi _1 \pi _2\right) \pm 2. \end{aligned}$$

Thus a single DCJ operation on \(\pi _1\) can reduce the transposition length of \(\pi _1 \pi _2\) by at most 2. Since \(\ell _t(\pi _2\pi _1)=0\) when \(\pi _1=\pi _2\), the DCJ distance between \(\pi _1\) and \(\pi _2\) must be at least \(\ell _t(\pi _1 \pi _2)/2\). \(\square \)

6.2 DCJ distance between conjugate sub-permutations

Let the sub-permutations \(\pi _1^{(s)}\) and \(\pi _2^{(s)}\) be conjugate in \(S_{2n}\). That is, there exists a \(g\in S_{2n}\) such that

$$\begin{aligned} g\pi _1^{(s)} g^{-1}=\pi _2^{(s)}. \end{aligned}$$

By writing \(g\) as a product of transpositions, we obtain a sequence of DCJ operations that transforms \(\pi _1^{(s)}\) to \(\pi _2^{(s)}\), each of which is conjugation by a transposition. Let \(d^c_{DCJ}\left( \pi _1^{(s)},\pi _2^{(s)}\right) \) be the minimal number of DCJ conjugation operations needed to transform \(\pi _1^{(s)}\) into \(\pi _2^{(s)}\). Then clearly

$$\begin{aligned} d_{DCJ}\left( \pi _1^{(s)},\pi _2^{(s)}\right) \le d^c_{DCJ}\left( \pi _1^{(s)},\pi _2^{(s)}\right) . \end{aligned}$$

Theorem 6.7

Let \(\pi _1^{(s)}\) and \(\pi _2^{(s)}\) be sub-permutations of genomic permutations \(\pi _{1}\) and \(\pi _{2}\) on \(n\) regions such that \(\pi _1^{(s)}\) and \(\pi _2^{(s)}\) are conjugate in \(S_{2n}\). Then the conjugation distance \(d^c_{DCJ}\left( \pi _1^{(s)},\pi _2^{(s)}\right) \) is half the transposition length of \(\pi _2^{(s)} \pi _1^{(s)}\), that is,

$$\begin{aligned} d^c_{DCJ}\left( \pi _1^{(s)},\pi _2^{(s)}\right) =\frac{1}{2}\ell _t \left( \pi _2^{(s)} \pi _1^{(s)}\right) . \end{aligned}$$

Proof

We prove the claim by induction on \(r:=d^c_{DCJ}\left( \pi _1^{(s)}, \pi _2^{(s)}\right) \).

Suppose first that \(d^c_{DCJ}\left( \pi _1^{(s)}, \pi _2^{(s)}\right) =1\). Because \(\pi _1^{(s)}\) and \(\pi _2^{(s)}\) are conjugate, \(\pi _2^{(s)}\pi _1^{(s)}\) is an even permutation, so \(\ell _t \left( \pi _2^{(s)} \pi _1^{(s)} \right) \) is at least \(2\). Since \(r=1\), there exists a transposition \((i,j) \in S_{2n}\) such that \((i,j)\pi _1^{(s)}(i,j)=\pi _2^{(s)}\). Hence,

$$\begin{aligned} \pi _2^{(s)} \pi _1^{(s)} = (i,j)\pi _1^{(s)}(i,j)\pi _1^{(s)} =(i,j)(\pi _1^{(s)}(i),\pi _1^{(s)}(j)). \end{aligned}$$

Therefore the transposition length of \(\pi _2^{(s)} \pi _1^{(s)}\) is 2.

Assume that the hypothesis is true for all \(r \in \mathbb {N}\), with \(r < u \). That is, for \(r<u\),

$$\begin{aligned} r=d^c_{DCJ}\left( \pi _1^{(s)},\pi _2^{(s)}\right) =\frac{1}{2} \ell _t\left( \pi _2^{(s)}\pi _1^{(s)}\right) . \end{aligned}$$

Next suppose that \(d^c_{DCJ}\left( \pi _1^{(s)},\pi _2^{(s)}\right) =u\). That is,

$$\begin{aligned} w_u w_{u-1}\ldots w_1 (\pi _1^{(s)}) w_1 \ldots w_{u-1} w_{u}=\pi _2^{(s)}, \end{aligned}$$

for some transpositions \(w_i \in S_{2n}\) with \(u\) minimal. Let \(\pi _1'=w_{u-1}\ldots w_1 (\pi _1^{(s)}) w_1 \ldots w_{u-1}\).

Since the conjugation distance between \(\pi _1'\) and \(\pi _1\) is \(u-1 < u\), from the induction hypothesis we know that \(\ell _t\left( \pi _1'\pi _1^{(s)}\right) =2(u-1)\). Write

$$\begin{aligned} \pi _1'\pi _1^{(s)}=t_1 t_2 \ldots t_{2(u-1)} \end{aligned}$$

for transpositions \(t_i \in S_{2n}\). Since \(w_u \pi _1' w_u=\pi _2^{(s)}\), we have \(d^c_{DCJ}\left( \pi _1',\pi _2^{(s)}\right) =1\), and hence \(\ell _t\left( \pi _2^{(s)}\pi _1' \right) =2\). That is, \(\pi _2^{(s)}\pi _1'=v_1v_2'\) where \(v_1\) and \(v_2\) are transpositions in \(S_{2n}\).

Then

$$\begin{aligned} \pi _2^{(s)}\pi _1^{(s)}=\pi _2^{(s)}\pi _1'\pi _1'\pi _1^{(s)}=v_1v_2t_1 t_2 \ldots t_{2(u-1)}. \end{aligned}$$

This is a product of a permutation of transposition length \(2(u-1)\) with two transpositions. The transposition length of the result will be \(\ell _t\left( \pi _2^{(s)}\pi _1^{(s)}\right) \in \{2u-4,2u-2,2u\}\). However, if \(\ell _t \left( \pi _2^{(s)}\pi _1^{(s)} \right) <2u \) then \(d^c_{DCJ}(\pi _1^{(s)},\pi _2^{(s)}) < u\), contrary to our assumption. Hence \(\ell _t(\pi _2^{(s)}\pi _1^{(s)})=2u\). That is,

$$\begin{aligned} d^c_{DCJ}\left( \pi _1^{(s)},\pi _2^{(s)}\right) =u \implies \ell _t \left( \pi _2^{(s)}\pi _1^{(s)}\right) =2u. \end{aligned}$$

\(\square \)

Putting together the lower bound for DCJ distance from Lemma 6.6 with the upper bound from Theorem 6.7, we have the following.

Corollary 6.8

If \(\pi _1^{(s)}\) and \(\pi _2^{(s)}\) are conjugate sub-permutations of the genomic permutations \(\pi _{1}\) and \(\pi _{2}\) on \(n\) regions, then

$$\begin{aligned} d_{DCJ}\left( \pi _1^{(s)},\pi _2^{(s)}\right) =\frac{\ell _t \left( \pi _2^{(s)}\pi _1^{(s)}\right) }{2}. \end{aligned}$$

6.3 Constructing a sorting element for conjugate sub-permutations

When the sub-permutations \(\pi _1^{(s)}\) and \(\pi _2^{(s)}\) induced by \({\mathcal {C}}_s\) are conjugate, an element of minimal length sorting \(\pi _1^{(s)}\) into \(\pi _2^{(s)}\) can easily be constructed as follows. Corollary 6.4 gives the structure of the product \(\pi _2^{(s)}\pi _1^{(s)}\).

If \(\pi _1^{(s)}\) and \(\pi _2^{(s)}\) have no fixed points then

$$\begin{aligned} \pi _2^{(s)}\pi _1^{(s)}&= \left( 1,\pi _2\pi _1(1),(\pi _2\pi _1)^2(1), \ldots , (\pi _2\pi _1)^u(1)\right) \nonumber \\&\times \left( \pi _1(1), \pi _2(1),\pi _2 \pi _1\pi _2(1),\ldots , (\pi _2\pi _1)^{u-1}\pi _2(1)\right) . \end{aligned}$$
(2)

If \(\pi _1^{(s)}\) and \(\pi _2^{(s)}\) contain fixed points then \(\pi _2^{(s)} \pi _1^{(s)}\) is a single cycle

$$\begin{aligned} \pi _2^{(s)}\pi _1^{(s)}=\left( 1,\pi _2\pi _1(1),(\pi _2\pi _1)^2(1), \ldots , (\pi _2\pi _1)^{2u}(1)\right) . \end{aligned}$$
(3)

Let \(g=\left( 1,\pi _2\pi _1(1),(\pi _2\pi _1)^2(1) \ldots , (\pi _2\pi _1)^u(1)\right) \). We claim that

$$\begin{aligned} g\pi _1^{(s)}g^{-1}=\pi _2^{(s)}. \end{aligned}$$

If \(i\) is moved by \(g\), then \(g(i)=\pi _2\pi _1(i)\). For any \(2\)-cycle \(\left( i,\pi _1(i)\right) \) in \(\pi _1^{(s)}\), either \(i\) or \(\pi _1(i)\) is moved by \(g\). If \(\pi _2^{(s)}\pi _1^{(s)}\) is a product of two cycles then as proved in Lemma 5.2, \(i\) and \(\pi _1(i)\) are in different cycles of \(\pi _2^{(s)}\pi _1^{(s)}\) . Since \(g\) is precisely one of the two cycles of \(\pi _2^{(s)}\pi _1^{(s)}\), it moves exactly one of \(i\) and \(\pi _1(i)\).

On the other hand, if \(\pi _2^{(s)}\pi _1^{(s)}\) is a single cycle as in Eq. (3), then suppose \(\pi _1(1)=1\) (if instead \(\pi _2(1)=1\), a similar argument would apply). Then,

$$\begin{aligned} 1&=\left( \pi _2^{(s)}\pi _1^{(s)}\right) ^{2u+1}(1) \\&=\pi _2^{(s)}\left( \pi _1^{(s)}\pi _2^{(s)}\right) ^{2u}(1). \end{aligned}$$

This implies that \(\left( \pi _1^{(s)}\pi _2^{(s)}\right) ^{2u}(1)=\pi _2^{(s)}(1)\). Now suppose \(i=\left( \pi _2^{(s)}\pi _1^{(s)}\right) ^b(1)\) for some \(b\) then \(\pi _1^{(s)}(i)=\left( \pi _2^{(s)}\pi _1^{(s)}\right) ^a(i)\) for some \(a\), since \(i\) and \(\pi _1^{(s)}(i)\) are in the same cycle. Also,

$$\begin{aligned} \pi _1^{(s)}(i)&=\pi _1^{(s)}\left( \pi _2^{(s)}\pi _1^{(s)} \right) ^b(1)\\&=\left( \pi _1^{(s)}\pi _2^{(s)}\right) ^b(1) \\&=\left( \pi _1^{(s)}\pi _2^{(s)}\right) ^{-2u+b}\left( \pi _1^{(s)} \pi _2^{(s)}\right) ^{2u}(1) \\&=\left( \pi _1^{(s)}\pi _2^{(s)}\right) ^{-2u+b}\pi _2^{(s)}(1)= \left( \pi _1^{(s)}\pi _2^{(s)}\right) ^{-2u+b-1}(1)\\&=\left( \pi _2^{(s)}\pi _1^{(s)}\right) ^{2u-b+1}(1). \end{aligned}$$

If \(i\) is moved by \(g\), that is if \(i\) is in the cycle \(\left( 1,\pi _2\pi _1(1),\ldots ,(\pi _2\pi _1)^u(1)\right) \), then \(b \le u \) which means that \(a=2u-b+1 > u\) and \(\pi _1^{(s)}(i)\) is not in this cycle and hence not moved by \(g\).

If \(\pi _1^{(s)}(i)\) is moved by \(g\), that is if \(\pi _1^{(s)}(i)\) is in the cycle \(\left( 1,\pi _2\pi _1(1),\ldots , (\pi _2\pi _1)^u(1)\right) \), then \(a= 2u-b+1 \le u \) which means that \(u+1 \le b\) and \(i\) is not in this cycle and hence not moved by \(g\).

Thus we have established that whether the product \(\pi _2^{(s)}\pi _1^{(s)}\) is given by Eq. (2) or Eq. (3), for any \(i \in {\mathcal {C}}_s\), \(g\) moves either \(i\) or \(\pi _1^{(s)}(i)\).

So if \(i\) is moved by \(g\), then \(g(i)=\pi _2\pi _1(i)\), and since \(\pi _1(i)\) is then not moved by \(g\), \(g(\pi _1(i))=\pi _1(i)\). Consider the product \(g \pi _1^{(s)} g^{-1}\). For any \(2\)-cycle \((i,\pi _1(i))\) in \(\pi _1^{(s)}\) suppose \(i\) is moved by \(g\). We have

$$\begin{aligned} g\left( i,\pi _1(i)\right) g^{-1}=\left( g(i),g\left( \pi _1(i) \right) \right) =\left( \pi _2\left( \pi _1(i)\right) ,\pi _1(i)\right) \end{aligned}$$

which is the \(2\)-cycle in \(\pi _2^{(s)}\) containing \(\pi _1(i)\). Thus

$$\begin{aligned} g \pi _1^{(s)} g^{-1}=\pi _2^{(s)}. \end{aligned}$$

Since the transposition length of \(g\) is \(u\) (that is, we require \(u\) transpositions to express \(g\) as a product), and conjugation by a \(2\)-cycle \((i,j)\) is one \(D_{ij}\) operation, we require \(u\) DCJ operations to sort \(\pi _1^{(s)}\) into \(\pi _2^{(s)}\). This is exactly the DCJ distance between them. Hence \(g\) gives an optimal sorting element that is, \(g \pi _1^{(s)} g^{-1}=\pi _2^{(s)}\).

As there is nothing special about the choice of \(1\) in this argument, the cycle

$$\begin{aligned} \left( \pi _1(1), \pi _2(1),\pi _2\pi _1\pi _2(1),\ldots ,(\pi _2 \pi _1)^{u-1}\pi _2(1)\right) \end{aligned}$$

is also an optimal sorting element.

The above construction might be better understood through an example. Let \(\pi _1\) and \(\pi _2\) be the following genomic permutations on 6 regions:

$$\begin{aligned} \pi _1=(1,6)(2,3)(4,5) ,\quad \pi _2=(1,2)(3,4)(5,6). \end{aligned}$$

Since \(\pi _1\) and \(\pi _2\) have the same cycle structure, we know that they are conjugate in \(S_{2n}\). Consider the product \(\pi _1\pi _2\).

$$\begin{aligned} \pi _1\pi _2=(1,6)(2,3)(4,5)(1,2)(3,4)(5,6)=(1,3,5)(2,6,4). \end{aligned}$$

Let \(g=(1,3,5)\). Then \(g^{-1}=(1,5,3)\).

$$\begin{aligned} g\pi _2g^{-1}=(1,5,3)(1,2)(3,4)(5,6)(1,3,5)=(1,6)(2,3)(4,5)=\pi _1. \end{aligned}$$

The above discussion is summarised in Lemma 6.9.

Lemma 6.9

Let \(\pi _1^{(s)},\pi _2^{(s)}\) and \(u\) be as in Corollary 6.4. An element \(g\) such that \(\ell _t(g)=d_{DCJ}(\pi _1^{(s)},\pi _2^{(s)})\) that sorts \(\pi _1^{(s)}\) into \(\pi _2^{(s)}\) can be constructed from the product \(\pi _2^{(s)}\pi _1^{(s)}\) as

$$\begin{aligned} g=\left( 1,\pi _2\pi _1(1), (\pi _2\pi _1)^2(1),\ldots , (\pi _2 \pi _1)^u(1)\right) \,. \end{aligned}$$

A similar construction has been given in Feijão and Meidanis (2013), where they construct the sorting element by establishing a correspondence between the connected components of adjacency graph and the permutation product.

6.4 DCJ distance between non-conjugate sub-permutations

We will now consider the case where \(\pi _1^{(s)}\) and \(\pi _2^{(s)}\) are not conjugate in \(S_{2n}\).

Theorem 6.10

Let \(\pi _1^{(s)}\) and \(\pi _2^{(s)}\) be non-conjugate sub-permutations of the genomic permutations \(\pi _{1}\) and \(\pi _{2}\) on \(n\) regions. Then

$$\begin{aligned} d_{DCJ}\left( \pi _1^{(s)},\pi _2^{(s)}\right) =\frac{1}{2}\left( \ell _t (\pi _2^{(s)}\pi _1^{(s)})+1\right) \,. \end{aligned}$$

Proof

We have remarked earlier that for any \(i\) in the equivalence class \({\mathcal {C}}_s\), \({\mathcal {C}}_s\) contains only the elements contained in the cycles of \(\pi _1\pi _2\) containing \(i\) and \(\pi _1(i)\). Hence as proved in Lemma 5.2 if the induced sub-permutations \(\pi _2^{(s)}\pi _1^{(s)}\) are not identical, their product contains exactly two fixed points. Since \(\pi _1^{(s)}\) and \(\pi _2^{(s)}\) are not conjugate, both the fixed points belong to the same sub-permutation.

Suppose the fixed points are \(i_1\) and \(i_2\) and that they belong to \(\pi _1^{(s)}\). Recall that if \(i,j\) are both fixed in \(\pi \) then \(D_{ij}(\pi )=(i,j)\pi \), and hence

$$\begin{aligned} D_{i_1i_2}\left( \pi _1^{(s)}\right) =(i_1,i_2)\pi _1^{(s)}\,. \end{aligned}$$

While \(\pi _1^{(s)}\) and \(\pi _2^{(s)}\) are both products of \(2\)-cycles, the number of \(2\)-cycles in \(\pi _1^{(s)}\) is one less than the number of \(2\)-cycles of \(\pi _2^{(s)}\), since it has two fixed points. Therefore \(D_{i_1i_2}(\pi _1^{(s)})\) is conjugate to \(\pi _2^{(s)}\). Similarly, if the two fixed points belong to \(\pi _2\) then \(D_{i_1i_2}(\pi _2^{(s)})\) is conjugate to \(\pi _1^{(s)}\). As the DCJ distance is symmetric we can assume without loss of generality that the fixed points belong to \(\pi _1^{(s)}\).

Let \(\pi '=D_{i_1i_2}(\pi _1^{(s)})\). From Theorem 6.7 we have that

$$\begin{aligned} d_{DCJ}\left( \pi ',\pi _2^{(s)}\right) =\frac{1}{2}\ell _t\left( \pi _2^{(s)} \pi '\right) = \frac{1}{2}\ell _t\left( \pi '\pi _2^{(s)}\right) =\frac{1}{2} \ell _t\left( (i_1,i_2)\pi _1^{(s)}\pi _2^{(s)}\right) . \end{aligned}$$

Since \(i_1\) and \(i_2\) are in the same cycle of \(\pi _1^{(s)}\pi _2^{(s)}\), multiplication by \((i_1,i_2)\) will split this cycle into two cycles, reducing the transposition length of the product by 1. Hence

$$\begin{aligned} d_{DCJ}\left( \pi ',\pi _2^{(s)}\right) =\frac{1}{2}\ell _t\left( (i_1, i_2)\pi _1^{(s)}\pi _2^{(s)}\right) =\frac{1}{2}\left( \ell _t \left( \pi _1^{(s)}\pi _2^{(s)}\right) -1\right) . \end{aligned}$$

The DCJ distance between \(\pi _1^{(s)}\) and \(\pi '\) is 1. Thus the triangle inequality gives

$$\begin{aligned} d_{DCJ}\left( \pi _1^{(s)},\pi _2^{(s)}\right)&\le d_{DCJ} \left( \pi _1^{(s)},\pi '\right) +d_{DCJ}\left( \pi ',\pi _2^{(s)}\right) \\&=\frac{1}{2}\left( \ell _t\left( \pi _1^{(s)}\pi _2^{(s)}\right) -1\right) +1\\&=\frac{1}{2}\left( \ell _t\left( \pi _1^{(s)}\pi _2^{(s)}\right) +1\right) \!. \end{aligned}$$

At the same time we have a lower bound on the distance (Lemma 6.6), so that

$$\begin{aligned} \frac{1}{2}\ell _t\left( \pi _1^{(s)}\pi _2^{(s)}\right) \le d_{DCJ}\left( \pi _1^{(s)},\pi _2^{(s)}\right) \le \frac{1}{2}\left( \ell _t\left( \pi _1^{(s)}\pi _2^{(s)}\right) +1\right) . \end{aligned}$$

Since the DCJ distance is an integer (the number of DCJ operations), and the transposition length of \(\pi _1^{(s)}\pi _2^{(s)}\) is odd, we have

$$\begin{aligned} d_{DCJ}\left( \pi _1^{(s)},\pi _2^{(s)}\right) =\frac{1}{2} \left( \ell _t\left( \pi _1^{(s)}\pi _2^{(s)}\right) +1\right) \end{aligned}$$

as required. \(\square \)

From Theorems 6.7 and 6.10, it is clear that the sub-permutations induced by the partitions \({\mathcal {C}}_1,{\mathcal {C}}_2, \ldots , {\mathcal {C}}_r\) under \(\sim \) can be sorted independently. Therefore

$$\begin{aligned} d_{DCJ}(\pi _1,\pi _2) \le \sum _s{d_{DCJ} \left( \pi _1^{(s)}, \pi _2^{(s)}\right) }. \end{aligned}$$

We claim that a sorting sequence that involves a DCJ operation \(D_{ij}\), where \(i,j\) are in different partitions \({\mathcal {C}}_s\), cannot be shorter than a sequence that sorts each partition independently.

Let \({\mathcal {C}}_r\) and \({\mathcal {C}}_s\) be distinct equivalence classes of \(\mathbf {2n}\) under \(\sim \). Let \(\ell _t\left( \pi _1^{(r)}\pi _2^{(r)}\right) \) be \(l_1\) and \(\ell _t\left( \pi _1^{(s)}\pi _2^{(s)}\right) \) be \(l_2\). Then

$$\begin{aligned} \frac{1}{2}\left( l_1\right) \le d_{DCJ}\left( \pi _1^{(r)}, \pi _2^{(r)}\right) \le \frac{1}{2}\left( l_1+1\right) \!. \end{aligned}$$

Similarly

$$\begin{aligned} \frac{1}{2}(l_2) \le d_{DCJ}\left( \pi _1^{(s)},\pi _2^{(s)}\right) \le \frac{1}{2}\left( l_2+1\right) \!. \end{aligned}$$

Combining these, we have

$$\begin{aligned} \frac{1}{2}\left( l_1+l_2\right) \le d_{DCJ}({\mathcal {C}}_r) + d_{DCJ}({\mathcal {C}}_s) \le \frac{1}{2}\left( l_1+l_2+2\right) \!. \end{aligned}$$

The action of \(D_{ij}\) on \(\pi _1\) may combine the two partitions \({\mathcal {C}}_r\) and \({\mathcal {C}}_s\) into a single partition \({\mathcal {C}}_{t}\) or it may change each of the partitions \({\mathcal {C}}_r\) and \({\mathcal {C}}_s\). In the latter case, by an abuse of notation we use \({\mathcal {C}}_{t}\) to denote the union of the partitions changed by the action of \(D_{ij}\). We wish to determine the transposition length of \(\pi _1^{(t)}\pi _2^{(t)}\) in order to find the DCJ distance.

The action of \(D_{ij}\) on \(\pi _1\) may be left multiplication by \((i,j)\) or conjugation by \((i,j)\). If it acts by left multiplicaton, so that \(D_{ij}(\pi _1)=(i,j)\pi _1\), then since \(i\) and \(j\) are in different partitions and hence in different cycles of \(\pi _1^{(r)}\pi _2^{(r)}\pi _1^{(s)}\pi _2^{(s)}\), multiplication by \((i,j)\) will combine the two cycles that contain \(i\) and \(j\). The transposition length of the product will therefore increase by 1. That is, in this case,

$$\begin{aligned} \ell _t\left( \pi _1^{(t)}\pi _2^{(t)}\right) =\ell _t\left( \pi _1^{(r)} \pi _2^{(r)}\pi _1^{(s)}\pi _2^{(s)}\right) +1. \end{aligned}$$

On the other hand, if \(D_{ij}\) acts by conjugation then \(D_{ij}(\pi _1)=(i,j)\pi _1(i,j)\), and we have that

$$\begin{aligned} (i,j)\pi _1(i,j)=(i,j)\left( \pi _1(i),\pi _1(j)\right) \pi _1. \end{aligned}$$

The images \(\pi _1(i)\) and \(\pi _1(j)\) are in different cycles of \(\pi _1^{(r)}\pi _2^{(r)}\pi _1^{(s)}\pi _2^{(s)}\) since \(i\) and \(j\) are in different partitions. In \(x=\left( \pi _1(i),\pi _1(j)\right) \pi _1\pi _2\), the cycles containing \(\pi _1(i)\) and \(\pi _1(j)\) will combine into a single cycle, increasing the length of the product by 1. Then the cycle of \(x\) containing \(\pi _1(i)\) and \(\pi _1(j)\) will contain either both, one, or neither of \(i\) and \(j\).

Accordingly, multiplication by \((i,j)\) will either split this cycle into two cycles or combine two different cycles into one cycle. Thus the transposition length may increase or decrease by 1 (from the previous step). Hence

$$\begin{aligned} \ell _t\left( \pi _1^{(t)}\pi _2^{(t)}\right) =\ell _t\left( \pi _1^{(r)} \pi _2^{(r)}\pi _1^{(s)}\pi _2^{(s)}\right) \!, \end{aligned}$$

or

$$\begin{aligned} \ell _t\left( \pi _1^{(t)}\pi _2^{(t)}\right) =\ell _t\left( \pi _1^{(r)} \pi _2^{(r)}\pi _1^{(s)}\pi _2^{(s)}\right) +2. \end{aligned}$$

In both cases \(l_1+l_2 \le \ell _t({\mathcal {C}}_{t})\) and hence

$$\begin{aligned} \frac{1}{2}\left( l_1+l_2\right) \le d_{DCJ} \left( \pi _1^{(t)},\pi _2^{(t)}\right) \!. \end{aligned}$$

Since one DCJ operation was needed to change the partition \({\mathcal {C}}_r\) and \({\mathcal {C}}_s\) into \({\mathcal {C}}_{t}\), and the DCJ distance of the sub-permutations induced by \({\mathcal {C}}_{t}\) is at least \(\frac{1}{2}(l_1+l_2)\), any sorting scenario for \({\mathcal {C}}_r\) and \({\mathcal {C}}_s\) that steps through \({\mathcal {C}}_{t}\) is of length at least \(\frac{1}{2}(l_1+l_2)+1\). At the same time, the sum of the distances of the partitions \({\mathcal {C}}_r\) and \({\mathcal {C}}_s\) is bounded above by:

$$\begin{aligned} d_{{\mathcal {C}}_r}+d_{{\mathcal {C}}_s} \le \frac{1}{2}(l_1+l_2)+1. \end{aligned}$$

Therefore we conclude that no sequence of DCJ operations sorting \(\pi _1\) into \(\pi _2\) can be shorter than a sequence that sorts the sub-permutations independently.

Theorem 6.11

Let \(\pi _1\) and \(\pi _2\) be genomic permutations on \(n\) regions. The DCJ distance between \(\pi _1\) and \(\pi _2\) is given by

$$\begin{aligned} d_{DCJ}(\pi _1,\pi _2)=\frac{1}{2}\left( \ell _t(\pi _2\pi _1)+ n_c\right) \end{aligned}$$

where \(n_c\) is the number of cycles in the product \(\pi _2\pi _1\) which contain two fixed points of \(F_{\pi _1}\) or \(F_{\pi _2}\).

7 Counting the optimal sorting scenarios

To sort a permutation \(\pi _a\) into \(\pi _b\) means to transform \(\pi _a\) into \(\pi _b\) through a sequence of allowed operations (in this case the DCJ operation). A sorting scenario is defined as follows.

Definition 7.1

A sorting scenario of length \(k\) that sorts genomic permutation \(\pi _a\) into genomic permutation \(\pi _b\) is a sequence of genomic permutations

$$\begin{aligned} \{(\pi _a=)\pi _0,\pi _1,\pi _2, \ldots , \pi _{k-1} ,\pi _k(=\pi _b)\}, \end{aligned}$$

such that each element of the sequence is obtained from the previous element through a single DCJ operation.

That is, a sorting scenario is the sequence of permutations we step through as \(\pi _a\) is sorted into \(\pi _b\) through DCJ operations. If the DCJ distance is \(d\), then an optimal sorting scenario (scenario of minimal length) is a sequence of length \(d+1\). Two optimal sorting scenarios are equal if they are equal as sequences i.e., corresponding terms are equal.

The total number of optimal sorting scenarios between a pair of genomes is an interesting and important question. In constructing a phylogenetic history, the minimal distance with respect to some mutational operation is used. However such a minimal path is seldom unique. Hence as Miklós and Darling (2009) and Siepel (2002) point out, it would be more appropriate to account for and average over all possible evolutionary paths to draw meaningful statistical inferences.

Braga and Stoye (2010) extended their earlier work (Braga and Stoye 2009) to give a closed formula for the number of optimal DCJ sorting scenarios for certain instances of the problem. Ouangraoua and Bergeron (2010) also present similar results for the number of optimal sorting scenarios and establish connections between the number of sorting scenarios and other combinatorial objects such as parking functions.

Considering genomes as permutations and the DCJ as an action on a permutation allows us to count the number of optimal sorting scenarios for a subset of genomes in a straightforward manner. The subset of genomes we consider are those that are conjugate in \(S_{2n}\). Our result is equivalent to the results obtained by the previous papers.

Let \(\pi \) be a genomic permutation on \(n\) regions and let \(i,j \in {\mathbf {2n}}\). The restriction of \(\pi \) to the cycles containing \(i\) and \(j\) is \(\left( i,\pi (i)\right) \left( j,\pi (j)\right) \). As \(D_{ij}\) acting on \(\pi \) only affects the cycles containing \(i\) and \(j\),

$$\begin{aligned} D_{ij}(\pi )&=(i,j)\left( i,\pi (i)\right) \left( j,\pi (j)\right) (i,j)\\&=\left( i,\pi (j)\right) \left( j,\pi (i)\right) \\&=\left( \pi (i),\pi (j)\right) \left( i,\pi (i)\right) \left( j,\pi (j)\right) \left( \pi (i),\pi (j)\right) \\&=D_{\pi (i)\pi (j)}(\pi ). \end{aligned}$$

If the restriction of \(\pi \) to the cycles containing \(i\) and \(j\) is \(\left( i\right) \left( j,\pi (j)\right) \), then

$$\begin{aligned} D_{ij}(\pi )=(i,j)\left( i\right) \left( j,\pi (j)\right) (i,j)=\left( j\right) \left( i,\pi (j)\right) . \end{aligned}$$

In this case there is no \(D_{kl}\) such that \(D_{kl}(\pi )=D_{ij}(\pi )\). As these are the two cases where the algebraic DCJ operator acts via conjugation (see Definition 4.2), we have the following lemma.

Lemma 7.2

Let \(\pi \) be a genomic permutation on \(n\) regions. Let \(D_{ij}\) and \(D_{kl}\) act on \(\pi \) via conjugation. If

$$\begin{aligned} D_{ij}\left( \pi \right) =D_{kl}\left( \pi \right) , \end{aligned}$$

then either \((i,j)=(k,l)\) or \((i,j)\) and \((k,l)\) are disjoint transpositions such that \((k,l)=\left( \pi (i),\pi (j)\right) \).

Based on the characterization of the DCJ operators that act in the same way on a genomic permutation, we can easily enumerate the sorting scenarios.

Theorem 7.3

Let \(\pi _a\) and \(\pi _b\) be genomic permutations on \(n\) regions such that \(\pi _a\) and \(\pi _b\) are conjugate in \(S_{2n}\). If the DCJ distance \(d_{DCJ}\left( \pi _a,\pi _b\right) =d\) then the number of optimal DCJ sorting scenarios sorting \(\pi _a\) into \(\pi _b\) is \((d+1)^{d-1}\).

Proof

As we have seen in the proof of Theorem 6.7, if the DCJ distance is \(d\) then we can construct an element \(g \in S_{2n}\) such that \(g\) is a cycle of length \(d+1\) (and consequently of transposition length \(d\)) and

$$\begin{aligned} g\pi _ag^{-1}=\pi _b. \end{aligned}$$

Lemma 6.9 gives the construction of a cycle \(g\) such that \(g\pi _ag^{-1}=\pi _b\). Let \(g\) be as in the statement of Lemma 6.9 i.e., \(g=\left( 1, \pi _b\pi _a(1), \left( \pi _b\pi _a\right) ^2(1), \ldots , \left( \pi _b\pi _a\right) ^d(1)\right) \). The number of ways to represent a cycle of length \(n\) as the product of \(n-1\) transpositions (i.e., as a minimal product) is \(n^{n-2}\) (Dénes 1959). Hence \(g\) can be written as a product of transpositions in \((d+1)^{d-1}\) ways. It remains to show that

  1. 1.

    Each expression of \(g\) as a minimal product of transpositions corresponds to a distinct sorting scenario, and

  2. 2.

    There can be no other sorting scenarios. That is, if there is \(h \in S_{2n}\) such that \(\ell _t(h)=\ell _t(g)\) and \(h \pi _a h^{-1}\), then any sorting scenario produced by \(h\) is identical to some sorting scenario produced by \(g\).

Let \(S(g)\) be the set of all expressions for \(g\) as a minimal product of transpositions. For example, if \(g=(1,3,5)\) then \(S(g)=\{(1,5)(1,3), (1,3)(3,5), (3,5)(1,5)\}\).

1. Claim: each expression of \(g\) as a minimal product of transpositions corresponds to a distinct sorting scenario.

From the construction of \(g\) preceding Lemma 6.9, we know that for any \(i \in \mathbf {2n}\), \(g\) moves either \(i\) or \(\pi _a(i)\) but not both. Suppose \(g\) moves \(i\). Then since \(g\) fixes \(\pi _a(i)\), in a minimal factorization of \(g\) as a product of transpositions, no transposition moves \(\pi _a(i)\).

To observe this, note that the number of trees on \(d\) labeled vertices is \((d+1)^{d-1}\) (given by Cayley’s formula). Thus there is a bijection between the \(S(g)\) and the set of trees on \(d\) vertices. A minimal factorization of \(g\) into transpositions can be associated with a tree by considering a transposition \((i,j)\) to correspond to the edge between vertices \(i\) and \(j\). If a point \(\pi _a(i)\) fixed by \(g\) is moved by some transposition in a minimal factorization of \(g\), then the factorization must contain a cycle that would move \(\pi _a(i)\) back to itself. But such a cycle would correspond to a loop in the graph corresponding to the factorization, which cannot be as the graph is a tree. Hence in a minimal factorization of \(g\) as a product of transpositions, no transposition moves \(\pi _a(i)\).

Suppose \(u_du_{d-1}\ldots u_1\) and \(w_d w_{d-1}\ldots w_1\) are distinct elements of \(S(g)\) that produce the same sorting scenarios. We will derive a contradiction.

Let \(k\) be the lowest index such that \(u_k \ne w_k\). Let \(w_{k-1}\ldots w_1=u_{k-1}\ldots u_1 =g^{\prime }\) and let \(g^{\prime } \pi _a g^{\prime -1}=\pi _{k-1}\).

Let \(u_k=(i,j)\). By Lemma 7.2, \(u_k\) and \(w_k\) are disjoint and \(w_k=\left( \pi _{k-1}(i),\pi _{k-1}(j)\right) \).

Now, \(u_k\) is a transposition in the minimal expression for \(g\). Since \(u_k\) moves \(i\), \(i\) is in the support of \(g\), and \(\pi _a(i)\) is not in the support of \(g\). The support of \(g^{\prime }\) is a subset of the support of \(g\), hence \(\pi _a(i)\) is not in the support of \(g^{\prime }\).

If \(u_k\) is disjoint from \(g^{\prime }\), then

$$\begin{aligned} \pi _{k-1}(i)=g^{\prime } \pi _a g^{\prime -1}(i) =\pi _a(i). \end{aligned}$$

Hence \(\pi _{k-1}(i)\) is not in the support of \(g\).

If, on the other hand, \(u_k\) is not disjoint from \(g^{\prime }\), then let \(i\) be in the support of \(g^{\prime }\).

Clearly, \(g^{\prime -1}(i)\) is in the support of \(g^{\prime }\) (it gets mapped to \(i\) by \(g^{\prime }\)) and hence in the support of \(g\). Therefore, \(\pi _a\left( (g^{\prime }) ^{-1}(i)\right) \) is not in the support of \(g\) (and hence not in the support of \(g^{\prime }\)) since for any \(i \in \mathbf {2n}\), \(g\) moves either \(i\) or \(\pi _a(i)\). Hence

$$\begin{aligned} \pi _{k-1}(i)=g^{\prime } \pi _a g^{\prime -1}(i)= \pi _a\left( g^{\prime -1} (i)\right) . \end{aligned}$$

That is, \(\pi _{k-1}(i)\) is not in the support of \(g\).

Thus, in both cases (whether \(u_k\) is disjoint from \(g^{\prime }\) or not), \(w_k\) moves an element that is not in the support of \(g\). This contradicts the assertion that \(w_d w_{d-1} \ldots w_1=g\). Thus either \(w_d w_{d-1}\ldots w_1 \notin S(g)\) or the sorting scenarios produced by distinct elements are distinct. Each expression of \(g\) as a minimal product of transpositions therefore gives a unique sorting scenario and the number of sorting scenarios is at least the cardinality of \(S(g)\).

2. Claim: there are no additional sorting scenarios

Let \(h\in S_{2n}\) such that \(\ell _t(h)=\ell _t(g)=d\) and \(h \pi _a h^{-1}=\pi _b\). Let \(h=w_dw_{d-1}\ldots w_1\) be a factorization of \(h\) into transpositions. We claim that \(h\) produces the same sorting scenario as some element in \(S(g)\). This will establish that the number of sorting scenarios is equal to the cardinality of \(S(g)\). To prove this assertion, we first prove that there is some element \(u_d \ldots u_1 \in S(g)\) such that \(u_1 \pi _a u_1=w_1 \pi _a w_1\).

Suppose that this is not the case. That is, no element in \(S(g)\) produces a sorting scenario that has \(w_1 \pi _a w_1\) as the second term (the first term in all sorting scenarios is \(\pi _a\)). Consider the element

$$\begin{aligned} h^{\prime }=u_d u_{d-1} \ldots u_1 w_1. \end{aligned}$$

Let \(w_1 \pi _a w_1 = \pi _a^{\prime }\). Then

$$\begin{aligned} h^{\prime } \pi _a^{\prime } h^{\prime -1}&= \left( u_d u_{d-1} \ldots u_1 w_1\right) \left( w_1 \pi _a w_1\right) \left( w_1u_1 \ldots w_{d-1}w_d\right) \\&= \left( u_d u_{d-1} \ldots u_1\right) \left( \pi _a\right) \left( u_1 \ldots w_{d-1}w_d\right) \\&=\pi _b. \end{aligned}$$

At the same time, the DCJ distance between \(\pi _a^{\prime }\) and \(\pi _b\) is \(d-1\), since \((w_d \ldots w_2) \pi _a^{\prime } (w_2 \ldots w_d)=\pi _b\). Hence \(u_d u_{d-1} \ldots u_1 w_1\) can be written as a product of \(d-1\) transpositions say \(v_{d-1} \ldots v_1\). Then \( v_{d-1} \ldots v_1w_1=g\), and is an expression of length \(d\) equal to \(g\) that is not is \(S(g)\) because we have assumed that there is no element in \(S(g)\) such that \(u_1 \pi _a u_1 = w_1 \pi _a w_1\). This is a contradiction since \(S(g)\) by definition contains all factorizations of \(g\) of length \(d\). Thus there is some element in \(u_d \ldots u_1 \in S(g)\) such that \(w_1 \pi _a w_1 = u_1 \pi _a u_1\).

Let \(S_1(g)=\{u_d \ldots u_1 \in S(g) \mid u_1 \pi _a u_1 =w_1 \pi _a w_1\}\).

By a similar argument we can prove that there exists some element in \(S_1(g)\) such that \(u_2 (u_1 \pi _a u_1) u_2 = w_2 (u_1\pi _au_1) w_2\). In general, let

$$\begin{aligned} S_k(g)&= \{u_d \ldots u_1 \in S_{k-1}(g) \mid u_k(u_{k-1} \ldots u_1\pi _a u_1 \ldots u_{k-1})u_k\\&= w_k(w_{k-1} \ldots w_1\pi _a w_1 \ldots w_{k-1})w_k\}. \end{aligned}$$

Suppose that there does not exist any element in \(S_k(g)\) such that

$$\begin{aligned} u_{k+1}(u_{k} \ldots u_1\pi _a u_1 \ldots u_k)u_{k+1}=w_{k+1}(u_k \ldots u_1 \pi _a u_1 \ldots u_k). \end{aligned}$$

Let \(u_d \ldots u_1 \in S_k(g)\) and let \(u_{k} \ldots u_1\pi _a u_1 \ldots u_k=\pi _a^{\prime }\). Then,

$$\begin{aligned} \left( u_d \ldots u_{k+1} w_{k+1}\right) \left( w_{k+1} \pi _a^{\prime } w_{k+1}\right) \left( w_{k+1} u_{k+1} \ldots u_d\right) =\pi _b. \end{aligned}$$

The DCJ distance between \(\left( w_{k+1} \pi _a^{\prime } w_{k+1}\right) \) and \(\pi _b\) is \(d-k\). Therefore \(u_d \ldots u_{k+1} w_{k+1}\) can be re-written as a product of \(d-k\) transpositions, say \(v_{d-k} \ldots v_1\). Now \(v_{d-k} \ldots v_1 w_{k+1} u_k \ldots u_1\) is an expression of length \(d\) equal to \(g\) and prefix \(u_k \ldots u_1\) that is not in \(S_k(g)\). This contradicts the definition of \(S_k(g)\).

By repeating this argument, we can conclude that there exists some element in \(S(g)\) that gives the same sorting scenario as \(h\).

Thus the number of sorting scenarios is equal to \(\left| S(g)\right| = (d+1)^{d-1}\). \(\square \)

8 Conclusions and future work

The double cut and join operator has been a major step forward for the study of genome rearrangements, because it is very general, allowing numerous operations on multi-chromosomal genomes, and in addition has a very simple length formula with which to calculate genomic distance. In this paper, we have shown how this operator may be described group-theoretically, and derived a correspondingly simple length formula independently of prior results. The length formula given in Theorem 6.11 requires only the ability to write each genome as a permutation and to multiply such permutations. We are also able to provide a simple construction for an optimal sorting scenario in particular instances of the problem. Translating the model into algebra allows us to exploit established results in group theory, a field with over a century of development. The proof of Theorem 7.3 is an example of this, relying as it does on a combinatorial group theory result from the 1950s. The use of group theory to model rearrangements provides a natural context in which to study alternative assumptions about the rearrangement processes. As pointed out in Egri-Nagy et al. (2014) and Francis (2014), different assumptions about the processes gives rise to questions about length functions in different groups, or length functions with respect to different generating sets. A group-theoretic model may also provide an avenue for investigating additional operations that are not captured by the DCJ.