1 Introduction

Phylogenetic (evolutionary) trees have been widely used in the study of evolutionary biology to represent the tree-like evolution of a collection of species. Given the same set of species, different data sets and different building methods may result in the construction of different trees. In order to facilitate the comparison of these different phylogenetic trees, several distance metrics have been proposed, such as Robinson and Foulds [17] distance, the Nearest Neighbor Interchange (NNI) distance [15], the Tree-Bisection and Reconnection (TBR) distance and the Subtree-Prune and Regraft (SPR) distance [7, 23]. In particular, SPR and TBR distances have been commonly used in phylogenetic inference [12], and SPR operations have been applied to investigate lateral genetic transfer [3, 26] and MCMC search [27].

A graph theoretical model, the maximum agreement forest (maf) of two phylogenetic trees, has been formulated for the TBR distance and for the SPR distance [14] for phylogenetic trees. Define the order of a forest to be the number of connected components in the forest.Footnote 1 Allen and Steel [2] proved that the TBR distance between two unrooted binary phylogenetic trees is equal to the order of their maf minus 1, and Bordewich and Semple [6] proved that the rSPR distance between two rooted binary phylogenetic trees is equal to the order of their rooted version of maf minus 1. In terms of computational complexity, it is known that the Maximum Agreement Forest problem (Maf), i.e., constructing an maf, is NP-hard and MAX SNP-hard for two unrooted binary phylogenetic trees [14], as well as for two rooted binary phylogenetic trees [6].

Approximation algorithms have been studied for the Maf problem, mainly on two trees. For the Maf problem on two rooted binary phylogenetic trees, Hein et al. [14] proposed an approximation algorithm and claimed that the ratio of the algorithm was 3. Later Rodrigues et al. [18] found a subtle error in [14], showed that the algorithm in [14] has ratio at least 4, and presented a new approximation algorithm which they claimed had ratio 3. Bonet et al. [4] provided a counterexample and showed that both the algorithms in [14] and [18] compute a 5-approximation of the rSPR distance between two rooted binary trees. The approximation ratio was improved to 3 by Bordewich et al. [5], but at the expense of an increased running time of \(O(n^5)\). A second 3-approximation algorithm presented in [19] achieves a running time of \(O(n^2)\). Whidden and Zeh [24] presented the third 3-approximation algorithm, whose running time is linear. Recently, Shi et al. [21] presented a approximation algorithm of ratio 2.5, which is the best known approximation algorithm for the Maf problem on two rooted binary trees. For the Maf problem on two unrooted binary phylogenetic trees, the best approximation algorithm is due to Whidden and Zeh [24], which runs in linear-time and has a ratio 3.

There are also a couple of approximation algorithms for the Maf problem on two general (i.e., binary and non-binary) phylogenetic trees. Rodrigues et al. [19] developed an approximation algorithm of ratio \(d+1\) for the Maf problem on two rooted general trees, where d is the maximum number of children a vertex in the input trees may have. Chen et al. [9] developed an approximation algorithm of ratio 3 recently, which is the first constant-ratio approximation algorithm for the Maf problem on two unrooted general trees.

The Maf problem on multiple phylogenetic trees has not been studied as extensively as that on two trees. To our best knowledge, there is currently no known approximation algorithm for the Maf problem on multiple unrooted binary phylogenetic trees. The only approximation algorithm for the problem on multiple phylogenetic trees is an 8-approximation algorithm developed by Chataigner [8], which is for the problem on two or more rooted binary trees.

We remark that it makes perfect sense to investigate the Maf problem on more than two phylogenetic trees: we may construct two or more different phylogenetic trees for the same collection of species according to different data sets and different building methods. An maf of order k for a set of phylogenetic trees means that for any two phylogenetic trees \(T_i\) and \(T_j\) in the given set, the TBR distance (if the trees are unrooted) or the rSPR distance (if the trees are rooted) between \(T_i\) and \(T_j\) is not larger than \(k-1\). Moreover, the order of an maf of a collection of rooted trees is a lower bound on their hybridization number as it is a lower bound on the order of a maximum acyclic agreement forest of the trees, which is equal to the minimum number of hybridization nodes (nodes with multiple incoming edges) over all hybridization networks displaying the given collection of trees [28]. On the other hand, it seems much more difficult to construct an maf for more than two trees than that for two trees. For example, while there have been several polynomial-time approximation algorithms of ratio 3 for the Maf problem on two rooted binary phylogenetic trees [5, 19, 24], the best polynomial-time approximation algorithm [8] for the Maf problem on more than two rooted binary phylogenetic trees has a ratio 8. Also, to our best knowledge, there are currently no known polynomial-time approximation algorithms for the Maf problem on multiple unrooted binary phylogenetic trees.

In the current paper, we are focused on polynomial-time approximation algorithms for the Maf problem on multiple (i.e., two or more) binary phylogenetic trees, for both the version of rooted trees and the version of unrooted trees. We propose a very general framework for approximation algorithms for the Maf problem, which is valid for both rooted trees and unrooted trees. Our major contribution is the introduction of the concept of “edge-removal meta-steps” (or simply “meta-steps”) and of the metric that evaluates the quality of the meta-steps. Roughly speaking, each meta-step is a sequence of consecutive edge removal operations, and the metric measures the ratio of the number of “essential edges” over the number of “correct edges” removed by the meta-step. A subtle issue is how to define and identify essential edges and correct edges in the entire set of edges removed by a meta-step. Our framework consists of meta-steps. We formally prove that as long as the meta-steps meet certain given conditions in terms of their metric, the corresponding algorithm based on our framework is an approximation algorithm with a specific approximation ratio. We then work on the careful development of the meta-steps, for rooted trees and then for unrooted trees, focusing on achieving meta-steps that are good in terms of the proposed metric. This development results in a polynomial-time 3-approximation algorithm for the Maf problem on multiple rooted binary phylogenetic trees, which is an improvement over the previous best 8-approximation algorithm for the problem, and whose ratio matches the best known approximation ratio for the problem on two rooted binary trees. We also present a polynomial-time 4-approximation algorithm for the Maf problem on multiple unrooted binary phylogenetic trees, giving the first constant-ratio approximation algorithm for the problem.

2 Problem Formulations

We assume that readers are familiar with the general terminology of graph theory [11]. Our definitions for the study in maximum agreement forests are consistent with those used in the literature [6, 13, 14, 24]. A single-vertex tree is a tree that consists of a single vertex, and a single-edge tree is a tree that consists of a single edge. A tree is binary if either it is a single-vertex tree or each of its vertices has degree either 1 or 3. The degree-1 vertices are leaves and the degree-3 vertices are non-leaves of the tree. For a subset \(E^{\prime }\) of edges in a graph G, we will denote by \(G \setminus E^{\prime }\) the graph G with the edges in \(E^{\prime }\) removed (so \(G \setminus E^{\prime }\) and G have the same vertex set).

The problem in our discussion has two versions, one is on unrooted trees and the other is on rooted trees. We first give the terminologies on the unrooted version, then remark on the differences for the rooted version. Let X be a fixed label-set.

2.1 X-Trees and X-Forests: The Unrooted Version

A binary tree is unrooted if no root is specified in the tree—in this case no ancestor–descendant relation is defined in the tree. For a label-set X, an unrooted binary phylogenetic X-tree, or simply an unrooted X-tree, is an unrooted binary tree whose leaves are labeled bijectively by the label-set X (and all non-leaves are unlabeled). An unrooted X-tree will also be called an (unrooted) leaf-labeled tree when there is no need to specify the label-set X. An unrooted X-forest F is a collection of disjoint leaf-labeled trees whose label-sets are disjoint such that the union of the label-sets is equal to X.

A subtree \(T^{\prime }\) of an unrooted X-tree may contain unlabeled vertices of degree \({<}3\). In this case we apply the forced contraction operation on \(T^{\prime }\), which, repeatedly, replaces each degree-2 vertex v and its incident edges with an edge connecting the two neighbors of v, and removes all unlabeled vertices of degree smaller than 2. An X-forest F is irreducible if forced contraction is not applicable to F. When we want to emphasize that forced contraction has been applied on a graph G, we add a subscript “fc” and write it as \((G)_{\text{ fc }}\), which is irreducible. After forced contraction, an unlabeled vertex in an unrooted X-forest is always of degree 3.

Two leaf-labeled forests \(F_1\) and \(F_2\) are isomorphic if there is a graph isomorphism between \(F_1\) and \(F_2\) in which each leaf of \(F_1\) is mapped to a leaf of \(F_2\) with the same label. We will simply say that a leaf-labeled forest \(F^{\prime }\) is a subgraph of another leaf-labeled forest F if \((F^{\prime })_{\text{ fc }}\) is isomorphic to \((F^{\prime \prime })_{\text{ fc }}\) for some subgraph \(F^{\prime \prime }\) of F.

2.2 X-Trees and X-Forests: The Rooted Version

A binary tree is rooted if a particular leaf is designated as the root (so it is both a root and a leaf), which specifies a well-defined ancestor–descendant relation in the tree. A rooted X-tree is a rooted binary tree whose leaves are labeled bijectively by the label-set X. The root of an X-tree will always be labeled by a special label \(\rho \) in X. A subtree \(T^{\prime }\) of a rooted X-tree T is a connected subgraph of T that contains at least one leaf in T. In order to preserve the ancestor–descendant relation in the rooted tree T, we should define the root of the subtree \(T^{\prime }\). If \(T^{\prime }\) contains the leaf labeled \(\rho \), then, certainly, \(\rho \) is the root of the subtree \(T^{\prime }\); otherwise, the vertex in \(T^{\prime }\) that is in T the least common ancestor of all the labeled leaves in \(T^{\prime }\) is defined to be the root of \(T^{\prime }\). A rooted X-forest F is a subgraph of a rooted X-tree T that contains all leaves of T. Thus, the X-forest F is a collection of disjoint (rooted) subtrees of the rooted X-tree T with disjoint leaf label-sets whose union is equal to X. In particular, one of the subtrees in a rooted X-forest F must have the leaf labeled \(\rho \) as its root.

We again have the forced contraction operation applied on a subtree \(T'\) of a rooted X-tree. However, if the root r of the subtree \(T'\) is of degree 2, then the forced contraction operation will not be applied on r, in order to preserve the ancestor–descendant relation in \(T^{\prime }\). Therefore, after forced contraction, the root of a subtree \(T^{\prime }\) of a rooted X-forest is either an unlabeled vertex of degree 2, or the vertex labeled \(\rho \) of degree 1, or a labeled vertex of degree 0. Every unlabeled vertex in the subtree \(T^{\prime }\) that is not the root of \(T^{\prime }\) has degree 3.

2.3 Agreement Forests

The following terminologies are used for both rooted and unrooted versions. The order of an X-forest F, denoted \(\text{ Ord }(F)\), is the number of connected components of F that contain at least one leaf of F, or equivalently, \(\text{ Ord }(F)\) is equal to the number of connected components of \((F)_{\text{ fc }}\).

An agreement forest for a collection \(\{F_1, F_2, \ldots , F_m\}\) of X-forests is an X-forest that is a subgraph of \(F_i\), for all \(1 \le i \le m\). Note that since the concept of “subgraph” in X-forests is defined based on the forced contracted versions of the X-forests, forced contraction on any related X-forest will not affect the construction of an agreement forest for a given collection of X-forests. This fact has been well observed and used in the research on the Maf problems, see, for example, [2, 46, 13, 14].

A maximum agreement forest (abbr. maf) for the collection \(\{F_1, F_2, \ldots , F_m\}\) of X-forests is an agreement forest for \(\{F_1, F_2, \ldots , F_m\}\) of the minimum order over all agreement forests for \(\{F_1, F_2, \ldots , F_m\}\).

The problems we are focused on in this paper are formally described as follows.

The Rooted Maximum Agreement Forest problem (rooted Maf)

Input: A set \(\{F_1, \ldots , F_m\}\) of rooted X-forests

Output: an maf, i.e., an agreement forest of the minimum order for \(\{F_1, \ldots , F_m\}\)

The Unrooted Maximum Agreement Forest problem (unrooted Maf)

Input: A set \(\{F_1, \ldots , F_m\}\) of unrooted X-forests

Output: an maf, i.e., an agreement forest of the minimum order for \(\{F_1, \ldots , F_m\}\)

When each of the X-forests \(F_1\), \(\ldots \), \(F_m\) is an X-tree, the above problems become the standard Maximum Agreement Forest problems on multiple binary phylogenetic trees, for the rooted version and for the unrooted version, respectively.

3 Approximating Maf: A General Framework

We now present a general framework for approximation algorithms for the Maf problems. The discussion is valid for both rooted and unrooted versions of the problem.

In this section, we will assume that the forced contraction operation is not applied unless we explicitly require it. Therefore, a subgraph \(F^{\prime }\) of an X-forest F may contain vertices of degree \({<}3\) that are non-leaves in the original X-forest F. We will call the vertices in \(F^{\prime }\) “labeled vertices” and “unlabeled vertices” to refer to the leaves and non-leaves in the X-forest F, respectively. We will relax our definition and call such a forest \(F^{\prime }\) an X-forest if there is a one-to-one mapping between the label-set X and the labeled vertices of \(F^{\prime }\). A connected component of \(F^{\prime }\) is an l-component if it contains at least one labeled vertex. The order Ord\((F^{\prime })\) of \(F^{\prime }\) is the number of l-components of \(F^{\prime }\).

For any edge set \(E^{\prime }\) in an X-forest F, we have \(\text{ Ord }(F \setminus E^{\prime }) \le \text{ Ord }(F) + |E^{\prime }|\). An edge subset \(E^{\prime }\) of an X-forest F is an essential edge-subset (abbr. ee-set) if \(\text{ Ord }(F \setminus E^{\prime }) = \text{ Ord }(F) + |E^{\prime }|\). Note that every subset of an ee-set for F is an ee-set: if a subset \(E^{\prime \prime }\) of an ee-set \(E^{\prime }\) for F is not an ee-set for F, then the forest \(F \setminus E^{\prime \prime }\) has its order smaller than \(\text{ Ord }(F) + |E^{\prime \prime }|\) so the order of the forest \(F \setminus E^{\prime } = (F \setminus E^{\prime \prime }) \setminus (E^{\prime } \setminus E^{\prime \prime })\) is smaller than \(\text{ Ord }(F) + |E^{\prime \prime }| + |E^{\prime } \setminus E^{\prime \prime }| = \text{ Ord }(F) + |E^{\prime }|\), contradicting the fact that \(E^{\prime }\) is an ee-set for F. On the other hand, the union of ee-sets for F may not be an ee-set: for example, in an unrooted tree with a single non-leaf and three labeled leaves, every edge makes an ee-set but the union of the three edges is not an ee-set. Nevertheless, we have the following result.

Lemma 1

Let F be an X-forest and let \(E_1\) be an edge subset in F. Then for every ee-set \(E_1^{\prime } \subseteq E_1\) for F and for every ee-set \(E_2\) for \(F \setminus E_1\), \(E_1^{\prime } \cup E_2\) is an ee-set for F.

Proof

Let \(E_1^{\prime \prime }\) be a largest ee-set for F that is a subset of \(E_1\) and contains \(E_1^{\prime }\). Thus, \(\text{ Ord }(F \setminus E_1) = \text{ Ord }(F) + |E_1^{\prime \prime }|\). We first show that \(E_1^{\prime \prime } \cup E_2\) is an ee-set for F. Let \(F_1 = F \setminus (E_1^{\prime \prime } \cup E_2)\).

Claim \(F_1\) and \(F_1 \setminus (E_1 \setminus E_1^{\prime \prime })\) have the same order.

To prove the claim, assume the contrary that the order of \(F_1 \setminus (E_1 \setminus E_1^{\prime \prime })\) is larger than that of \(F_1\). Then removing the edges of \(E_1 \setminus E_1^{\prime \prime }\) from \(F_1\) would split some l-component of \(F_1\) into at least two l-components. Since each l-component of \(F_1 = F \setminus (E_1^{\prime \prime } \cup E_2)\) is a subgraph of an l-component of \(F \setminus E_1^{\prime \prime }\), removing the edges of \(E_1 \setminus E_1^{\prime \prime }\) from \(F \setminus E_1^{\prime \prime }\) would also split some l-component of \(F \setminus E_1^{\prime \prime }\) into at least two l-components (note that all these l-components are trees). This implies that the order of \((F \setminus E_1^{\prime \prime }) \setminus (E_1 \setminus E_1^{\prime \prime }) = F \setminus E_1\) is larger than the order of \(F \setminus E_1^{\prime \prime }\). But this contradicts the assumption that \(E_1^{\prime \prime }\) is an ee-set for F and that \(\text{ Ord }(F \setminus E_1)=\text{ Ord }(F)+|E_1^{\prime \prime }|\). This contradiction proves the claim that \(F_1\) and \(F_1 \setminus (E_1 \setminus E_1^{\prime \prime })\) have the same order.

Since \(E_2\) is an ee-set for \(F \setminus E_1\), the order of \((F \setminus E_1) \setminus E_2 = (F \setminus (E_1^{\prime \prime } \cup E_2)) \setminus (E_1 \setminus E_1^{\prime \prime }) = F_1 \setminus (E_1 \setminus E_1^{\prime \prime })\) is equal to \(\text{ Ord }(F \setminus E_1) + |E_2| = \text{ Ord }(F) + |E_1^{\prime \prime }| + |E_2|\). By the above claim, the order of \(F_1 = F \setminus (E_1^{\prime \prime } \cup E_2)\) is also \(\text{ Ord }(F) + |E_1^{\prime \prime }| + |E_2| = \text{ Ord }(F) + |E_1^{\prime \prime } \cup E_2|\), which derives that \(E_1^{\prime \prime } \cup E_2\) is an ee-set for F.

Since every subset of an ee-set for F is also an ee-set, and since \(E_1^{\prime } \cup E_2\) is a subset of the ee-set \(E_1^{\prime \prime } \cup E_2\) for F, we conclude that \(E_1^{\prime } \cup E_2\) is an ee-set for F. \(\square \)

It is easy to see that for any X-subforest \(F^{\prime }\) of an X-forest F, there is an ee-set \(E^{\prime }\) of \(\text{ Ord }(F^{\prime }) - \text{ Ord }(F)\) edges in F such that \((F^{\prime })_{\text{ fc }}= (F \setminus E^{\prime })_{\text{ fc }}\).

Up to forced contraction, every irreducible agreement forest \(F^{\prime }\) for an instance \(\{F_1, \ldots , F_m\}\) of Maf corresponds to a unique subgraph \(F_i^{\prime }\) of \(F_i\), for each i. Thus, without any confusion, we can simply say that an edge e in \(F_i\) is in or is not in the agreement forest \(F^{\prime }\), as long as the edge e is in or is not in the corresponding unique subforest \(F_i^{\prime }\) of \(F_i\), respectively.

Our approximation algorithms for Maf consist of a sequence of “meta-steps”. An edge-removal meta-step (or simply a meta-step) in an algorithm for Maf is a collection of consecutive computational steps in the algorithm that on an instance \(\{F_1, \ldots , F_m\}\) of Maf removes certain edges in the forests in \(\{F_1, \ldots , F_m\}\) (and then applies forced contraction). Our approximation algorithms have the following general framework (for both rooted and unrooted versions).

The performance of the algorithm Apx-MAF heavily depends on the quality of the meta-steps we employ in step 2 of the algorithm. For this, we introduce the following concept that measures the quality of a meta-step, where \(r \ge 1\) is an arbitrary real number.

Definition-R

A meta-step \(\sigma \), which removes a set \(E^{\sigma }\) of edges in the forests in \({\mathcal {I}} = \{F_1, \ldots , F_m\}\), keeps a ratio r, where \(r \ge 1\), if \(E^{\sigma }\) contains a subset \(E_1^{\sigma }\) of edges in \(F_1\) such that no edge in \(E^{\sigma } \setminus E_1^{\sigma }\) is in any agreement forest for \(\{F_1 \setminus E_1^{\sigma }, F_2, \ldots , F_m\}\), and for each agreement forest \(F^{\prime }\) for \(\mathcal {I}\), there is an ee-set \(E_{1,F^{\prime }}^{\sigma }\) for \(F_1\), \(E_{1,F^{\prime }}^{\sigma } \subseteq E_1^{\sigma }\), \(|E_{1,F^{\prime }}^{\sigma }| \ge |E_1^{\sigma }|/r\), and no edge in \(E_{1,F^{\prime }}^{\sigma }\) is in \(F^{\prime }\).

Remark 1

The meta-step \(\sigma \) above may also remove other edges in the forest \(F_1\) that are not in the subset \(E_1^{\sigma }\), as long as these edges are not in any agreement forest for \(\{F_1 \setminus E_1^{\sigma }, F_2, \ldots , F_m\}\).

Remark 2

By definition, adding to the meta-step \(\sigma \) more edge removals that remove edges not in any agreement forest for \(\{F_1 \setminus E_1^{\sigma }, F_2, \ldots , F_m\}\) does not change the ratio of the meta-step \(\sigma \). In particular, if the meta-step \(\sigma \) removes only edges not in any agreement forest for \(\{F_1, F_2, \ldots , F_m\}\), then we can let \(E_1^{\sigma } = \emptyset \), and the meta-step \(\sigma \) keeps a ratio 1.

Remark 3

Definition-R looks rather complicated, for which we give some intuitive explanations. Our algorithm operates on an instance \(\{F_1, F_2, \ldots , F_m\}\) by deleting edges in \(F_1\), \(F_2\), \(\ldots \), \(F_m\) to eventually make \(F_1\), \(F_2\), \(\ldots \), \(F_m\) identical, which thus gives an agreement forest for the original \(\{F_1, F_2, \ldots , F_m\}\). Therefore, allowing the edge set \(E^{\sigma }\) removed by the meta-step \(\sigma \) to contain edges not only in \(F_1\) but also in \(F_2\), \(\ldots \), \(F_m\) seems necessary. However, we require that the edge set \(E^{\sigma }\) contain an “important” subset \(E_1^{\sigma }\) in \(F_1\) such that removing \(E^{\sigma }\) is not worse than removing \(E_1^{\sigma }\) (this is the condition that no edge in \(E^{\sigma } \setminus E_1^{\sigma }\) is in any agreement forest for \(\{F_1 \setminus E_1^{\sigma }, F_2, \ldots , F_m\}\)). Moreover, for each agreement forest \(F^{\prime }\) for \(\{F_1, F_2, \ldots , F_m\}\), we require that there be a “correct” subset \(E_{1,F^{\prime }}^{\sigma }\) of \(E_1^{\sigma }\) (this is given by the conditions that \(E_{1,F^{\prime }}^{\sigma }\) is an ee-set for \(F_1\) and that no edge in \(E_{1,F^{\prime }}^{\sigma }\) is in \(F^{\prime }\)) such that removing \(E_1^{\sigma }\) is not worse than r times removing \(E_{1,F^{\prime }}^{\sigma }\) (this is given by the condition \(E_{1,F^{\prime }}^{\sigma } \subseteq E_1^{\sigma }\), \(|E_{1,F^{\prime }}^{\sigma }| \ge |E_1^{\sigma }|/r\)). Combining these observations, we get the condition that for any agreement forest \(F^{\prime }\) for \(\{F_1, F_2, \ldots , F_m\}\), removing \(E^{\sigma }\) is not worse than r times removing a correct edge subset \(E_{1,F^{\prime }}^{\sigma }\). From this, the reason why we call \(\sigma \) a meta-step keeping a ratio r becomes obvious.

The optimal value for an instance \(\mathcal {I}\) of the Maf problem is the order of an maf for \(\mathcal {I}\).

Theorem 1

Let \(r \ge 1\) be a real number. Suppose that each meta-step in step 2 of the algorithm Apx-MAF keeps a ratio not larger than r and that the algorithm Apx-MAF halts on an instance \({\mathcal {I}}_0\) of Maf, then the output of the algorithm Apx-MAF is an agreement forest for \({\mathcal {I}}_0\) whose order is at most r times the optimal value for \({\mathcal {I}}_0\).

Proof

Let \({\mathcal {I}}_0 = \{F_1^{(0)}, \ldots , F_m^{(0)}\}\). First note that each execution of step 3 of the algorithm Apx-MAF can also be regarded as a meta-step. To find the ratio of this meta-step, suppose that step 3 is applied on an instance \(\{F_1^{\prime \prime }, \ldots , F_m^{\prime \prime }\}\), which is obtained by the i-th execution of step 2 on an instance \(\{F_1^{\prime }, \ldots , F_m^{\prime }\}\), where i is any integer between 1 and \(m-1\). Because of the i-th execution of step 2, \(F_1^{\prime \prime }\) and \(F_{i+1}^{\prime \prime }\) are X-subforests of \(F_1^{\prime }\) and \(F_{i+1}^{\prime }\), respectively, and \(F_j^{\prime \prime } = F_j^{\prime }\) for \(j \ne 1, i+1\). By induction, it is easy to see that \(F_1^{\prime } = F_2^{\prime } = \cdots = F_i^{\prime }\). Therefore, \(F_1^{\prime \prime }\) is also an X-subforest of \(F_j^{\prime \prime }\) for \(j = 2, \ldots , i\). In particular, for each j, \(2 \le j \le i\), any edge in \(F_j^{\prime \prime }\) but not in \(F_1^{\prime \prime }\) cannot be in any agreement forest for \(\{F_1^{\prime \prime }, \ldots , F_m^{\prime \prime }\}\). Therefore, the meta-step made by step 3 on \(\{F_1^{\prime \prime }, \ldots , F_m^{\prime \prime }\}\) removes no edge in any agreement forest for \(\{F_1^{\prime \prime }, \ldots , F_m^{\prime \prime }\}\). By Remark 2, this meta-step keeps a ratio 1, which is bounded by r. Moreover, by the condition given in the theorem, each meta-step in step 2 of the algorithm keeps a ratio bounded by r.

Therefore, the algorithm Apx-MAF applies a sequence of meta-steps \({\sigma }_1\), \(\sigma _2\), \(\ldots \), \({\sigma }_t\), where t is a finite number because we assume that the algorithm halts on \({\mathcal {I}}_0\). By the above discussion, each meta-step \(\sigma _i\) keeps a ratio bounded by r. By Definition-R, for each i, \(1 \le i \le t\), the meta-step \({\sigma }_i\) removes a set \(E^{\sigma _i}\) of edges in the forests in \({\mathcal {I}}_{i-1} = \{F_1^{(i-1)}, \ldots , F_m^{(i-1)}\}\) and produces an instance \({\mathcal {I}}_i = \{F_1^{(i)}, \ldots , F_m^{(i)}\}\), where the set \(E^{\sigma _i}\) contains a subset \(E_1^{\sigma _i}\) of edges in \(F_1^{(i-1)}\) such that no edge in \(E^{\sigma _i} \setminus E_1^{\sigma _i}\) is in any agreement forest for \(\{F_1^{(i-1)} \setminus E_1^{\sigma _i}, F_2^{(i-1)}, \ldots , F_m^{(i-1)}\}\), and that for each agreement forest \(F^{\prime }\) for \({\mathcal {I}}_{i-1}\), there is an ee-set \(E_{1,F^{\prime }}^{\sigma _i}\) for \(F_1^{(i-1)}\), \(E_{1,F^{\prime }}^{\sigma _i} \subseteq E_1^{\sigma _i}\), \(|E_{1,F^{\prime }}^{\sigma _i}| \ge |E_1^{\sigma _i}|/r\), and no edge in \(E_{1,F^{\prime }}^{\sigma _i}\) is in \(F^{\prime }\). Since the meta-step \(\sigma _i\) only removes edges in the forests in \({\mathcal {I}}_{i-1}\), for each j, \(F_j^{(i)}\) is an X-subforest of \(F_j^{(i-1)}\). In particular, \(F_j^{(t)}\) is an X-subforest of \(F_j^{(0)}\). Since at the end of the algorithm, we have \(F_1^{(t)} = F_2^{(t)} = \cdots = F_m^{(t)}\), the output \(F_1^{(t)}\) of the algorithm Apx-MAF is an X-subforest of \(F_j^{(0)}\) for all j, \(1 \le j \le m\). This proves that the output \(F_1^{(t)}\) of the algorithm Apx-MAF is an agreement forest for the input \({\mathcal {I}}_0\) of the algorithm.

Now consider the order of the X-forest \(F_1^{(t)}\). Fix an maf \(F_0\) for \({\mathcal {I}}_0\). Inductively, for a given \(i \ge 0\), suppose that we have an agreement forest \(F_i\) for \({\mathcal {I}}_i=\{F_1^{(i)},F_2^{(i)},\ldots ,F_m^{(i)}\}\) with \(\text{ Ord }(F_i)\le \text{ Ord }(F_0)+\frac{r-1}{r}\sum _{h=1}^i|E_1^{\sigma _h}|\) (this certainly holds true for the case \(i = 0\)). Because the meta-step \(\sigma _{i+1}\) keeps a ratio bounded by r, for the agreement forest \(F_i\) for \({\mathcal {I}}_i\), there is an ee-set \(E_{1,F_i}^{\sigma _{i+1}}\) for \(F_1^{(i)}\), \(E_{1,F_i}^{\sigma _{i+1}} \subseteq E_1^{\sigma _{i+1}}\), \(|E_{1,F_i}^{\sigma _{i+1}}| \ge |E_1^{\sigma _{i+1}}|/r\), and no edge in \(E_{1,F_i}^{\sigma _{i+1}}\) is in \(F_i\). Thus, \(E_1^{\sigma _{i+1}}\) contains at least \(|E_1^{\sigma _{i+1}}|/r\) edges not in \(F_i\) (recall that \(F_i\) can be treated as a subgraph of \(F_1^{(i)}\)), so \(E_1^{\sigma _{i+1}}\) contains at most \(\frac{r-1}{r}|E_1^{\sigma _{i+1}}|\) edges in \(F_i\). Therefore, the order of \(F_i \setminus E_1^{\sigma _{i+1}}\) is bounded by \(\text{ Ord }(F_i) + \frac{r-1}{r}|E_1^{\sigma _{i+1}}|\). Let \(F_{i+1} = F_i \setminus E_1^{\sigma _{i+1}}\). Then \(F_{i+1}\) is an agreement forest for \(\{F_1^{(i)} \setminus E_1^{\sigma _{i+1}}, F_2^{(i)}, \ldots , F_m^{(i)} \}\). By the properties of \(E_1^{\sigma _{i+1}}\), no edge in \(E^{\sigma _{i+1}} \setminus E_1^{\sigma _{i+1}}\) is in \(F_{i+1}\). Thus, \(F_{i+1}\) is also an agreement forest for \({\mathcal {I}}_{i+1} = \{F_1^{(i+1)}, F_2^{(i+1)}, \ldots , F_m^{(i+1)} \}\), which is obtained from \({\mathcal {I}}_i = \{F_1^{(i)}, F_2^{(i)}, \ldots , F_m^{(i)} \}\) with the edges in \(E^{\sigma _{i+1}}\) removed by the meta-step \(\sigma _{i+1}\).

Thus, \(F_{i+1} = F_i \setminus E_1^{\sigma _{i+1}}\) makes the induction go through: \(F_{i+1}\) is an agreement forest for \({\mathcal {I}}_{i+1}=\{F_1^{(i+1)},F_2^{(i+1)},\ldots ,F_m^{(i+1)} \}\), and the order of \(F_{i+1}\), by the inductive hypothesis, satisfies

$$\begin{aligned} \text{ Ord }\left( F_{i+1}\right) \le \text{ Ord }\left( F_i\right) + \frac{r-1}{r}\left| E_1^{\sigma _{i+1}}\right| \le \text{ Ord }(F_0) + \frac{r-1}{r} \sum _{h=1}^{i+1} \left| E_1^{\sigma _h}\right| . \end{aligned}$$

This gives an agreement forest \(F_t\) for \({\mathcal {I}}_t=\{F_1^{(t)},F_2^{(t)},\ldots , F_m^{(t)}\}\) whose order satisfies \(\text{ Ord }(F_t) \le \text{ Ord }(F_0) + \frac{r-1}{r} \sum _{h=1}^t |E_1^{\sigma _h}|\). Since \(F_t\) is an X-subforest of the X-forest \(F_1^{(t)}\), we also have

$$\begin{aligned} \text{ Ord }\left( F_1^{(t)}\right) \le \text{ Ord }\left( F_t\right) \le \text{ Ord }\left( F_0\right) + \frac{r-1}{r} \sum _{h=1}^t \left| E_1^{\sigma _h}\right| . \end{aligned}$$
(1)

To complete the proof, we need to compare \(\text{ Ord }(F_1^{(t)})\) with the optimal value \(\text{ Ord }(F_0)\). For this, we introduce one more notation. For each \(i \ge 1\), let \(E_{1+}^{\sigma _i}\) be the set of edges in \(F_1^{(i-1)}\) that are removed by the meta-step \(\sigma _i\). Thus, \(E_{1,F_{i-1}}^{\sigma _i} \subseteq E_1^{\sigma _i} \subseteq E_{1+}^{\sigma _i} \subseteq E^{\sigma _i}\), and \(F_1^{(i)} = F_1^{(i-1)} \setminus E_{1+}^{\sigma _i}\), where \(E_{1,F_{i-1}}^{\sigma _i}\) is an ee-set for \(F_1^{(i-1)}\).

It is easy to see that for \(i \ne j\), the sets \(E_1^{\sigma _i}\) and \(E_1^{\sigma _j}\) are disjoint: suppose \(i < j\), then \(E_1^{\sigma _i} \subseteq E_{1+}^{\sigma _i}\) while the edges in \(E_{1+}^{\sigma _i}\) are removed from \(F_1^{(i-1)}\) by \(\sigma _i\), so they cannot be in \(F_1^{(h)}\) for any \(h \ge i\). On the other hand, the edges in \(E_1^{\sigma _j}\) are in \(F_1^{(j-1)}\).

Inductively, suppose that for an integer \(i \ge 0\) we have proved that the set \(E_i = \bigcup _{h=1}^i E_{1,F_{h-1}}^{\sigma _h}\) is an ee-set for \(F_1^{(0)}\), and that no edge in \(E_i\) is in \(F_0\) (this is true for \(i=1\) by the definition of the set \(E_{1,F_0}^{\sigma _1}\)). Now consider the set \(E_{1,F_i}^{\sigma _{i+1}}\) in \(F_1^{(i)}\). By its properties, no edge in \(E_{1,F_i}^{\sigma _{i+1}}\) is in \(F_i\). Since \(F_i = F_0 \setminus (\bigcup _{h=1}^i E_1^{\sigma _h})\), and \(E_{1,F_i}^{\sigma _{i+1}}\) is disjoint with \(E_1^{\sigma _h}\) for all \(1 \le h \le i\) (note \(E_{1,F_i}^{\sigma _{i+1}} \subseteq E_1^{\sigma _{i+1}}\)), we derive that no edge in \(E_{1,F_i}^{\sigma _{i+1}}\) is in \(F_0\). Thus, no edge in the edge set \(E_{i+1} = \bigcup _{h=1}^{i+1} E_{1,F_{h-1}}^{\sigma _h}\) is in \(F_0\). Moreover, since \(E_i\) is an ee-set for \(F_1^{(0)}\), \(F_1^{(i)} = F_1^{(0)} \setminus (\bigcup _{h=1}^i E_{1+}^{\sigma _h})\), \(E_i \subseteq \bigcup _{h=1}^i E_{1+}^{\sigma _h}\), and \(E_{1,F_i}^{\sigma _{i+1}}\) is an ee-set for \(F_1^{(i)}\), by Lemma 1, \(E_i \cup E_{1,F_i}^{\sigma _{i+1}} = E_{i+1}\) is an ee-set for \(F_1^{(0)}\). So the induction goes through. In particular, we derive that \(E_t = \bigcup _{h=1}^t E_{1,F_{h-1}}^{\sigma _h}\) is an ee-set for \(F_1^{(0)}\), and that no edge in \(E_t\) is in \(F_0\). Since \(E_t\) is an ee-set for \(F_1^{(0)}\), we have

$$\begin{aligned} \text{ Ord }\left( F_1^{(0)} \setminus E_t\right) = \text{ Ord }\left( F_1^{(0)}\right) + |E_t|= & {} \text{ Ord }\left( F_1^{(0)}\right) + \sum _{h=1}^t \left| E_{1,F_{h-1}}^{\sigma _h}\right| \\\ge & {} \text{ Ord }\left( F_1^{(0)}\right) + \sum _{h=1}^t \left| E_1^{\sigma _h}\right| /r. \end{aligned}$$

The last equality is from the disjointness of the sets \(E_{1,F_{h-1}}^{\sigma _h}\), which follows directly from the disjointness of the sets \(E_1^{\sigma _h}\). Since no edge in \(E_t\) is in \(F_0\), \(F_0\) is an X-subforest of \(F_1^{(0)} \setminus E_t\), so,

$$\begin{aligned} \text{ Ord }(F_0) \ge \text{ Ord }\left( F_1^{(0)} \setminus E_t\right) \ge \sum _{h=1}^t \left| E_1^{\sigma _h}\right| /r. \end{aligned}$$
(2)

Combining (1) and (2), we get \(\text{ Ord }(F_1^{(t)}) \le r \cdot \text{ Ord }(F_0)\). Since \(F_1^{(t)}\) is the output of the algorithm Apx-MAF and \(F_0\) is an maf for \({\mathcal {I}}_0\), this inequality proves the theorem. \(\square \)

Let \(r \ge 1\) be a real number. An algorithm for the Maf problem is an r-approximation algorithm if on any instance \(\mathcal {I}\) of Maf, the algorithm produces an agreement forest \(F_{\mathcal {I}}\) for \(\mathcal {I}\) such that the order of \(F_{\mathcal {I}}\) is at most r times the optimal value for \(\mathcal {I}\). By Theorem 1, if the meta-steps in step 2 can be constructed and keep ratios bounded by r, and if they guarantee that the algorithm halts on every instance of the Maf problem, then the algorithm Apx-MAF will be an r-approximation algorithm for the Maf problem. In the next two sections, we present such meta-steps for the rooted version and for the unrooted version of the Maf problem, respectively, which thus lead to the desired approximation algorithms for these problems.

4 Meta-steps for Rooted X-Forests

We develop meta-steps for rooted Maf in this section. Thus, all leaf-labeled forests considered in this section are rooted. Because of the bijection between the leaves in an X-forest F and the elements in the label-set X, sometimes we will use, without confusion, an element in X to refer to the corresponding leaf in F, or vice versa.

Fig. 1
figure 1

Approximation algorithm for the Maf problem

As described in the algorithm Apx-MAF (see Fig. 1), for each execution of step 2 of the algorithm, we are given a fixed integer \(i > 1\) and an instance \({\mathcal {I}} = \{F_1, F_2, \ldots , F_m\}\) of the rooted Maf problem, which is a collection of rooted X-forests, with \(F_1 = F_2 = \cdots = F_{i-1}\), and, as long as \(F_1 \ne F_i\), meta-steps are applied on \(F_1\) and \(F_i\).Footnote 2 In the following, we show how these meta-steps are constructed based on different structures of \(F_1\) and \(F_i\) so that they can keep a ratio bounded by 3. Suppose, without loss of generality, that both \(F_1\) and \(F_i\) are irreducible.

Two leaves of a rooted leaf-labeled forest are siblings if they have a common parent. Note that by definition, the root \(\rho \), which is also a leaf, has no sibling.

Suppose that there are two elements a and b in the label-set X that are sibling leaves in both \(F_1\) and \(F_i\). Because our objective is to make \(F_1 = F_i\), and the local structure consisting of a, b and their parent will not distinguish \(F_1\) and \(F_i\), we can treat the local structure as an un-decomposable unit. To implement this, we can simply replace, in both \(F_1\) and \(F_i\), the subtree rooted at the parent of a and b by a single leaf with a new label \(\underline{ab}\). We will call such an operation as “shrinking a and b into a single leaf,” and denote it by \(\sigma _1\). In the further processing of \(F_1\) and \(F_i\), we will simply treat \(\underline{ab}\) as a leaf in the forests \(F_1\) and \(F_i\).

The operation \(\sigma _1\) changes the label-set for \(F_1\) and \(F_i\) from X to \(X^{\prime } = X \setminus \{a, b\} \cup \{\underline{ab}\}\), which introduces certain subtle issues when we consider agreement forests for \(\{F_1, F_2, \ldots , F_m\}\). In particular, the leaves a and b might not be siblings in some forests \(F_j\) with \(j \ne 1, i\), so it might be impossible to shrink a and b in these X-forests. Moreover, because the operation \(\sigma _1\) may be applied recursively, the labels a and b may already be composed labels. Therefore, in the general form for our discussion of the meta-steps in step 2 of the algorithm Apx-MAF, the leaf-labeled forests \(F_1\) and \(F_i\) are \(X^{\prime }\)-forests for some label-set \(X^{\prime }\), while \(F_j\), with \(j \ne 1, i\), are X-forests. Each leaf in \(F_1\) (resp. \(F_i\)) corresponds to a subtree of the original X-forest \(F_1\) (resp. \(F_i\)) and its label in \(X^{\prime }\) is given by a collection of the elements in X structured in the form to uniquely describe the subtree. To indicate these differences, we will use \(F_1^{\prime }\) and \(F_i^{\prime }\), instead of \(F_1\) and \(F_i\), in our description of the meta-steps in step 2 of the algorithm Apx-MAF. In particular, with the new label-set \(X^{\prime }\), from the \(X^{\prime }\)-forests \(F_1^{\prime }\) and \(F_i^{\prime }\) we can easily reconstruct the corresponding X-forests \(F_1\) and \(F_i\): \(F_1^{\prime }\) and \(F_i^{\prime }\) are just \(F_1\) and \(F_i\) with certain subtrees shrunk into single leaves. Thus, if \(F_1^{\prime }\), \(F_i^{\prime }\), \(F_1\), and \(F_i\) are all irreducible, then

  1. (1)

    an edge in \(F_1^{\prime }\) (resp. \(F_i^{\prime }\)) is an edge in \(F_1\) (resp. \(F_i\));

  2. (2)

    a non-leaf vertex in \(F_1^{\prime }\) (resp. \(F_i^{\prime }\)) is a non-leaf vertex in \(F_1\) (resp. \(F_i\));

  3. (3)

    an ee-set in \(F_1^{\prime }\) (resp. \(F_i^{\prime }\)) is an ee-set in \(F_1\) (resp. \(F_i\));

  4. (4)

    \(F_1^{\prime } = F_i^{\prime }\) as \(X^{\prime }\)-forests if and only if \(F_1 = F_i\) as X-forests; and

  5. (5)

    edge-removal meta-steps on \(F_1^{\prime }\) and \(F_i^{\prime }\) are also edge-removal meta-steps on \(F_1\) and \(F_i\).

In the following discussions on cases 1–3, we assume that the irreducible \(X^{\prime }\)-forest \(F_i^{\prime }\) has two leaves a and b that are siblings. Let \(\tau _a\) and \(\tau _b\) be the subtrees in both \(F_1\) and \(F_i\) that correspond to the leaves a and b in \(F_1^{\prime }\) and \(F_i^{\prime }\), respectively. Let \(e_a^{\prime }\) and \(e_b^{\prime }\) be the two edges in \(F_i^{\prime }\) that are incident to a and b, respectively. Thus, \(e_a^{\prime }\) and \(e_b^{\prime }\) are also edges in \(F_i\) that are not in \(\tau _a \cup \tau _b\) but are incident to the roots of \(\tau _a\) and \(\tau _b\), respectively.

Our first meta-step \(\sigma _1\) now can be described as follows.

Case 1 The elements a and b are also siblings in \(F_1^{\prime }\).

Meta-step \(\sigma _1\): In both \(F_1^{\prime }\) and \(F_i^{\prime }\), shrink a and b into a single leaf labeled \(\underline{ab}\).

Meta-step \(\sigma _1\) is a special meta-step that removes no edges in a given instance \(\{F_1, F_2, \ldots , F_m\}\). Instead, it groups certain structures in \(F_1^{\prime }\) and \(F_i^{\prime }\) (thus in \(F_1\) and \(F_i\)) into un-decomposable units. Using the notation in Definition-R, we have \(E^{\sigma _1}=\emptyset \). Thus, we can let \(E_1^{\sigma _1}=\emptyset \), and for all agreement forests \(F^{\prime }\) for \(\{F_1, F_2, \ldots , F_m\}\), let \(E_{1,F^{\prime }}^{\sigma _1}=\emptyset \). By Definition-R, we have

Lemma 2

Meta-step \(\sigma _1\) keeps a ratio 1 on a given instance \(\{F_1, F_2, \ldots , F_m\}\).

Case 2 The elements a and b are in different connected components in \(F_1^{\prime }\).

Meta-step \(\sigma _2\): In case 2, if at least one of a and b is a single-vertex tree in \(F_1^{\prime }\), then remove the edge(s) in \(F_i^{\prime }\) that are incident to the corresponding leaves (a or b or both) that are single-vertex trees in \(F_1^{\prime }\); otherwise, remove in both \(F_1^{\prime }\) and \(F_i^{\prime }\) the edges incident to a and b.

Lemma 3

Meta-step \(\sigma _2\) keeps a ratio 2 on a given instance \(\{F_1, F_2, \ldots , F_m\}\).

Proof

We consider the first subcase. Suppose that a is a single-vertex tree in \(F_1^{\prime }\), then \(\tau _a\) is a connected component of \(F_1\). Therefore, no agreement forest for \(\{F_1, F_2, \ldots , F_m\}\) can have a connected component that contains both leaves in \(\tau _a\) and leaves not in \(\tau _a\). This means that the edge \(e_a^{\prime }\) in \(F_i\) cannot be in any agreement forest for \(\{F_1, F_2, \ldots , F_m\}\). The same argument also holds true for the leaf b. Therefore, if at least one of a and b is a single-vertex tree in \(F_1^{\prime }\), then the edge set \(E^{\sigma _2}\) removed by \(\sigma _2\) in \(F_1^{\prime }\) and \(F_i^{\prime }\) (thus in \(F_1\) and \(F_i\)) is entirely in \(F_i\), and contains no edge in any agreement forest for \(\{F_1, F_2, \ldots , F_m\}\). Thus, for this subcase, we can let \(E_1^{\sigma _2} = \emptyset \) and for every agreement forest \(F^{\prime }\) for \(\{F_1, F_2, \ldots , F_m\}\) let \(E_{1,F^{\prime }}^{\sigma _2} = \emptyset \). It is easy to verify that these sets \(E^{\sigma _2}\), \(E_1^{\sigma _2}\) and \(E_{1,F^{\prime }}^{\sigma _2}\) satisfy Definition-R with a ratio \(r = 1\). Thus, in this subcase, the meta-step \(\sigma _2\) keeps a ratio 1, which is \({<}2\).

Now consider the subcase where neither of a and b is a single-vertex tree in \(F_1^{\prime }\). Let \(e_a\) and \(e_b\) be the edges incident to a and b in \(F_1^{\prime }\) (thus in \(F_1\)), respectively. We have \(E^{\sigma _2} = \{e_a, e_b, e_a^{\prime }, e_b^{\prime }\}\). Let \(E_1^{\sigma _2} = \{e_a, e_b\}\) and we show that \(E_1^{\sigma _2}\) satisfies all conditions in Definition-R to make the meta-step \(\sigma _2\) to keep a ratio 2. Obviously, \(E_1^{\sigma _2} \subseteq F_1\). In the forest \(F_1 \setminus E_1^{\sigma _2}\), \(\tau _a\) and \(\tau _b\) are by themselves two connected components. Therefore, no agreement forest for \(\{F_1 \setminus E_1^{\sigma _2}, F_2, \ldots , F_m\}\) can contain an edge in \(E^{\sigma _2} \setminus E_1^{\sigma _2} = \{e_a^{\prime }, e_b^{\prime }\}\). Now let \(F^{\prime }\) be an arbitrary agreement forest for \(\{F_1, F_2, \ldots , F_m\}\). If both \(e_a^{\prime }\) and \(e_b^{\prime }\) in \(F_i\) were in \(F^{\prime }\), then some leaf in \(\tau _a\) and some leaf in \(\tau _b\) would be in the same connected component in \(F^{\prime }\). However, this is impossible because \(\tau _a\) and \(\tau _b\) belong to different connected components in \(F_1\). Therefore, for the agreement forest \(F^{\prime }\) for \(\{F_1, F_2, \ldots , F_m\}\), at least one of \(e_a^{\prime }\) and \(e_b^{\prime }\) is not in \(F^{\prime }\). As a consequence, at least one of the edges \(e_a\) and \(e_b\) in \(F_1\) is not in \(F^{\prime }\). Let \(E_{1,F^{\prime }}^{\sigma _2}\) be the set of edges in \(E_1^{\sigma _2} = \{e_a, e_b\}\) that are not in \(F^{\prime }\), then \(|E_{1,F^{\prime }}^{\sigma _2}| \ge 1 = |E_1^{\sigma _2}|/2\). Finally, since a and b belong to different connected components and are not single-vertex trees in \(F_1^{\prime }\), it is easy to verify that \(E_{1,F^{\prime }}^{\sigma _2}\) is an ee-set for \(F_1^{\prime }\), thus is also an ee-set for \(F_1\). This shows that in this subcase, meta-step \(\sigma _2\) keeps a ratio 2. \(\square \)

Case 3 The elements a and b are in the same connected component but are not siblings in \(F_1^{\prime }\).

Let \(P=\{a,c_1,c_2,\ldots ,c_r,b\}\) be the unique path in \(F_1^{\prime }\) that connects a and b, in which \(c_h\) is the least common ancestor of a and b in \(F_1^{\prime }\), \(1 \le h \le r\). Since a and b are not siblings in \(F_1^{\prime }\), \(r \ge 2\). Let \(c_q\) be any non-leaf vertex on the path P with \(c_q \ne c_h\), and let \(e_q\) be the edge incident to \(c_q\) but not on the path P (note that \(F_1^{\prime }\) is binary and irreducible). Let \(e_a\) and \(e_b\) be the edges incident to a and b in \(F_1^{\prime }\), respectively. See Fig. 2 for an illustration.

Fig. 2
figure 2

The path connecting the labels a and b in \(F_1^{\prime }\)

Meta-step \(\sigma _3\): In case 3, remove the edges \(e_a\), \(e_b\), \(e_q\) in \(F_1^{\prime }\) (thus in \(F_1\)), and remove the edges \(e_a^{\prime }\) and \(e_b^{\prime }\) in \(F_i^{\prime }\) (thus in \(F_i\)) (see Fig. 2).

Lemma 4

Meta-step \(\sigma _3\) keeps a ratio 3 on a given instance \(\{F_1, F_2, \ldots , F_m\}\).

Proof

First note that the path P is also a path in the X-forest \(F_1\), with the two ends a and b replaced by the roots of the two subtrees \(\tau _a\) and \(\tau _b\). Thus, all edges removed by \(\sigma _3\) are also edges in \(F_1\) and \(F_i\). Again we use the notations in Definition-R. Thus, \(E^{\sigma _3} = \{e_a, e_b, e_q, e_a^{\prime }, e_b^{\prime }\}\). We let \(E_1^{\sigma _3} = \{e_a, e_b, e_q\}\) and show that \(E_1^{\sigma _3}\) satisfies all conditions in Definition-R and makes the meta-step \(\sigma _3\) to keep a ratio 3.

Both \(\tau _a\) and \(\tau _b\) by themselves become connected components in the X-forest \(F_1 \setminus E_1^{\sigma _3}\). Thus, a connected component of an agreement forest for \(\{ F_1 \setminus E_1^{\sigma _3}, F_2, \ldots , F_m\}\) either contains leaves only in \(\tau _a\), or contains leaves only in \(\tau _b\), or contains no leaves in \(\tau _a \cup \tau _b\). Therefore, no edge in \(E^{\sigma _3} \setminus E_1^{\sigma _3} = \{e_a^{\prime }, e_b^{\prime }\}\) can be in any agreement forest for \(\{ F_1 \setminus E_1^{\sigma _3}, F_2, \ldots , F_m\}\).

Let \(F^{\prime }\) be any agreement forest for \(\{ F_1, F_2, \ldots , F_m\}\). We have three possible cases:

  1. (1)

    The edge \(e_a^{\prime }\) of \(F_i\) is not in \(F^{\prime }\). Then, a connected component of \(F'\) either contains only leaves in \(\tau _a\) or contains no leaves in \(\tau _a\). In this case, we can pick \(\{e_a\}\) as the set \(E_{1,F^{\prime }}^{\sigma _3}\), which satisfies: \(E_{1,F^{\prime }}^{\sigma _3} \subseteq E_1^{\sigma _3}\), \(|E_{1,F^{\prime }}^{\sigma _3}| = 1 \ge |E_1^{\sigma _3}|/3\), and the edge \(e_a\) in \(E_{1,F^{\prime }}^{\sigma _3}\) is not in \(F^{\prime }\). Moreover, since \(F_1^{\prime }\) is irreducible and a is not a single-vertex tree in \(F_1^{\prime }\), the set \(E_{1,F^{\prime }}^{\sigma _3}\) is an ee-set for \(F_1^{\prime }\), thus also an ee-set for \(F_1\). Thus, for the agreement forest \(F^{\prime }\) not containing \(e_a^{\prime }\), the set \(E_{1,F^{\prime }}^{\sigma _3} = \{e_a\}\) satisfies all conditions in Definition-R to make the meta-step \(\sigma _3\) to keep a ratio 3.

  2. (2)

    The edge \(e_b^{\prime }\) is not in \(F^{\prime }\). Then similarly we let \(E_{1,F^{\prime }}^{\sigma _3} = \{e_b\}\), and can verify that for the agreement forest \(F^{\prime }\) not containing \(e_b^{\prime }\), the set \(E_{1,F^{\prime }}^{\sigma _3} = \{e_b\}\) satisfies all conditions in Definition-R to make the meta-step \(\sigma _3\) to keep a ratio 3.

  3. (3)

    Both edges \(e_a^{\prime }\) and \(e_b^{\prime }\) are in \(F^{\prime }\). Since a and b are siblings in \(F_i^{\prime }\), the roots of the subtrees \(\tau _a\) and \(\tau _b\) in \(F_i\) must have a common parent p in \(F^{\prime }\). Since \(F^{\prime }\) is a subgraph of \(F_1\) that must preserve the ancestor-descendent relations in \(F_1\), the vertex \(c_h\) in \(F_1\) must correspond to the vertex p in \(F^{\prime }\). As a consequence, no edge in \(F_1\) that is incident to a vertex \(c_j\) on the path P with \(c_j \ne c_h\) but not on the path P can be in \(F^{\prime }\) (see Fig. 2 for references). In particular, the edge \(e_q\) is not in \(F^{\prime }\). So in this case, we let \(E_{1,F^{\prime }}^{\sigma _3} = \{e_q\}\), and can verify easily that for the agreement forest \(F^{\prime }\) containing both \(e_a^{\prime }\) and \(e_b^{\prime }\), the set \(E_{1,F^{\prime }}^{\sigma _3} = \{e_q\}\) satisfies all conditions in Definition-R to make the meta-step \(\sigma _3\) to keep a ratio 3. Note that the fact \(E_{1,F^{\prime }}^{\sigma _3}\) is an ee-set for \(F_1\) follows from the irreducibilities of the \(X^{\prime }\)-forest \(F_1^{\prime }\) and the X-forest \(F_1\).

This verifies that the set \(E_1^{\sigma _3}\) satisfies all conditions in Definition-R to make the meta-step \(\sigma _3\) to keep a ratio 3. Thus, the meta-step \(\sigma _3\) keeps a ratio 3. \(\square \)

Cases 1–3 cover all cases in which a and b are sibling leaves in \(F_i^{\prime }\). If the \(X^{\prime }\)-forest \(F_i^{\prime }\) has no sibling leaves, then it must be in one of the following two cases: (1) \(F_i^{\prime }\) contains no edges; and (2) all connected components of \(F_i^{\prime }\) are single-vertex trees, except one that is a single-edge tree, and the single-edge tree has a root labeled \(\rho \) and a leaf labeled \(a \in X^{\prime }\), \(a \ne \rho \).

Case 4 \(F_i^{\prime }\) contains no edges.

Meta-step \(\sigma _4\): In case 4, remove all edges in \(F_1^{\prime }\).

Lemma 5

Meta-step \(\sigma _4\) keeps a ratio 1 on a given instance \(\{F_1, F_2, \ldots , F_m\}\).

Proof

In case 4, it is obvious that no edges in \(F_1^{\prime }\), which are also edges in \(F_1\), can be in any agreement forest for \(\{F_1, F_2, \ldots , F_m\}\). By Remark 2, the meta-step \(\sigma _4\) keeps a ratio 1. \(\square \)

Now we come to our last case.

Case 5 \(F_i^{\prime }\) has a single edge, which makes a single-edge tree rooted at \(\rho \) with a leaf a, \(a \ne \rho \).

Meta-step \(\sigma _5\): In case 5, remove all edges in \(F_1^{\prime }\) except those that are on the path between \(\rho \) and a. If \(F_1^{\prime }\) becomes a collection of single-vertex trees, then also remove the edge \([\rho , a]\) in \(F_i^{\prime }\).

Lemma 6

Meta-step \(\sigma _5\) keeps a ratio 1 on a given instance \(\{F_1, F_2, \ldots , F_m\}\).

Proof

Let \(\rho \), a, \(b_1\), \(\ldots \), \(b_h\) be the leaves in \(F_i^{\prime }\), where each \(b_i\) is a single-vertex tree in \(F_i^{\prime }\). Let \(\tau _a\) and \(\tau _{b_i}\), \(1 \le i \le h\), be the subtrees in \(F_i\) that correspond to the leaves a and \(b_i\) in \(F_i^{\prime }\), respectively. Then, for each subtree \(\tau _{b_i}\), a connected component of an agreement forest \(F^{\prime }\) for \(\{F_1, F_2, \ldots , F_m\}\) either contains only leaves in \(\tau _{b_i}\) or contains no leaves in \(\tau _{b_i}\). Therefore, if there is an edge \(e_{b_i}\) incident to \(b_i\) in \(F_1^{\prime }\), which is also the edge between the root of \(\tau _{b_i}\) and its parent in \(F_1\), then the edge \(e_{b_i}\) cannot be in \(F^{\prime }\). This observation plus the forced contraction operation shows that the edges that are not on the path between \(\rho \) and a in \(F_1^{\prime }\) cannot be in any agreement forest for \(\{F_1, F_2, \ldots , F_m\}\). Thus, all edges in \(F_1^{\prime }\) (thus also in \(F_1\)) that are removed by the meta-step \(\sigma _5\) are not in any agreement forest for \(\{F_1, F_2, \ldots , F_m\}\). Finally, if \(F_1^{\prime }\) becomes a collection of single-vertex trees after \(\sigma _5\) removes edges in \(F_1^{\prime }\) (this is the case when \(\rho \) and a are not in the same connected component in \(F_1^{\prime }\)), then no agreement forest for \(\{F_1, F_2, \ldots , F_m\}\) can have a connected component containing both \(\rho \) and a leaf in \(\tau _a\). Therefore, in this case, the edge \([\rho , a]\) in \(F_i^{\prime }\) (thus in \(F_i\)) cannot be in any agreement forest for \(\{F_1, F_2, \ldots , F_m\}\). In summary, no edge in the edge set \(E^{\sigma _5}\) removed by the meta-step \(\sigma _5\) can be in an agreement forest for \(\{F_1, F_2, \ldots , F_m\}\). By Remark 2, the meta-step \(\sigma _5\) keeps a ratio 1. \(\square \)

Now we are ready for our main theorem in this section. Suppose \(|X| = n\). Each X-forest has a size (i.e., the number of vertices plus the number of edges) O(n). Therefore, the size of an instance \(\{F_1, F_2, \ldots , F_m\}\) of the rooted Maf problem is \(n_0 = O(n m)\).

Theorem 2

If step 2 of the algorithm Apx-MAF uses the meta-steps \(\sigma _1\)\(\sigma _5\), then the algorithm Apx-MAF is a 3-approximation algorithm for the rooted Maf problem that runs in time \(O(n_0 \log n_0)\) on an instance of size \(n_0\).

Proof

By Lemmas 26, each of the meta-steps \(\sigma _1\)\(\sigma _5\) keeps a ratio bounded by 3. By Theorem 1, if the algorithm Apx-MAF uses these meta-steps in step 2, and halts on an instance \(\mathcal {I}\) of rooted Maf, then the algorithm produces an agreement forest for the instance \(\mathcal {I}\) whose order is bounded by three times the optimal value for \(\mathcal {I}\). Therefore, to show that the algorithm Apx-MAF is a 3-approximation algorithm for the rooted Maf problem, it suffices to show that on any instance \(\mathcal {I}\) of size \(n_0\), the algorithm Apx-MAF runs in time \(O(n_0 \log n_0)\).

Let the instance \(\mathcal {I}\) be \(\{F_1, F_2, \ldots , F_m\}\). Thus, \(n_0 = O(nm)\). Now fix an i, and consider the processing of \(F_1\) and \(F_i\) in step 2 of the algorithm Apx-MAF. By the algorithm, a meta-step in step 2 is applied on \(F_1\) and \(F_i\) only when \(F_1 \ne F_i\). Under the condition \(F_1 \ne F_i\), it is easy to verify that each of the meta-steps \(\sigma _2\)\(\sigma _5\) removes at least one edge in \(F_1 \cup F_i\). Therefore, the total number of times these meta-steps can be applied is bounded by O(n). Moreover, each application of the meta-step \(\sigma _1\) shrinks three vertices into a single vertex, in each of \(F_1\) and \(F_i\) (recall that we are operating on \(F_1^{\prime }\) and \(F_i^{\prime }\)). Therefore, the meta-step \(\sigma _1\) can be applied at most O(n) times. Summarizing all these, we conclude that if the algorithm Apx-MAF uses the meta-steps \(\sigma _1\)\(\sigma _5\) in step 2, then the total number of times the meta-steps are applied for processing \(F_1\) and \(F_i\) for each i in the execution of step 2 is O(n).

It is easy to see that each meta-step can be implemented to run in time O(n), which then directly gives a simple \(O(n_0^2)\)-time implementation of the algorithm Apx-MAF. In the following, we explain how the running time of the algorithm can be further improved to \(O(n_0 \log n_0)\).

Each of the meta-steps \(\sigma _4\) and \(\sigma _5\) is applied at most once in step 2 for processing \(F_1\) and \(F_i\). Each of the meta-steps \(\sigma _1\)\(\sigma _3\) is called on two sibling leaves in the forest \(F_i\). Therefore, step 2 of the algorithm can be implemented by a depth-first search on the forest \(F_i\), which continuously presents siblings in \(F_i\) for possible applications of the meta-steps \(\sigma _1\)\(\sigma _3\), until the meta-steps \(\sigma _4\) and \(\sigma _5\) become applicable. This depth-first search process, without counting the complexity of the calls to the meta-steps, runs in time O(n).

The meta-steps \(\sigma _1\)\(\sigma _3\) also require efficient determination on whether two leaves are in the same connected component in \(F_1\). Note that the connected component structure of \(F_1\) is dynamically changing, in particular when the meta-step \(\sigma _3\) removes the edge \(e_q\) that can be connected to a non-trivial subtree (see Fig. 2). For this, we can organize the leaves in \(F_1\) in depth-first search order so that all leaves in a subtree appear in a consecutive segment. Such a sequence then can be stored in a 2-3 tree that supports logarithmic-time insertion, deletion, splice, and split [1]. Based on this data structure, the connected component structure of \(F_1\) can be dynamically updated in time \(O(\log n)\) for each application of the meta-steps \(\sigma _1\)\(\sigma _3\).

With the above implementations, we conclude that each of the meta-steps \(\sigma _1\)\(\sigma _3\) takes time \(O(\log n)\), thus, for a given i, the running time of step 2 of the algorithm is \(O(n \log n)\). Also note that step 3 of the algorithm Apx-MAF is actually “virtual”, for which we can, without doing any real computation, simply record that \(F_1 = F_j\) for all \(1 \le j \le i\). As a consequence, the total running time of the algorithm Apx-MAF is bounded by \(O( n \log n \cdot m) = O(n_0 \log n_0)\), where \(n_0 = O(nm)\) is the size of the input instance \(\mathcal {I}\). \(\square \)

If the original input of our algorithm is a collection of X-trees, then the algorithm Apx-MAF will return an agreement forest for the trees. Thus, the algorithm Apx-MAF is a 3-approximation algorithm for the standard Maximum Agreement Forest problem on multiple rooted binary phylogenetic trees.

5 Meta-steps for Unrooted X-Forests

For the unrooted Maf problem, the meta-steps used in step 2 of the algorithm Apx-MAF and their analysis proceed in a manner similar to those for rooted Maf. However, since an unrooted X-tree enforces no ancestor–descendant relation in the tree, subforests in the X-tree have no requirement of preserving such a relation. This fact induces certain subtle differences. As a starting point, note that in an irreducible unrooted X-forest, every non-leaf has degree 3.

Again, for each execution of step 2 of the algorithm Apx-MAF, we are given a fixed integer \(i > 1\) and an instance \({\mathcal {I}} = \{F_1, F_2, \ldots , F_m\}\) of the unrooted Maf problem, which is a collection of unrooted X-forests, with \(F_1 = F_2 = \cdots = F_{i-1}\), and, as long as \(F_1 \ne F_i\), meta-steps are applied on \(F_1\) and \(F_i\). We present the meta-steps and show that these meta-steps keep a ratio bounded by 4. Suppose, without loss of generality, that both \(F_1\) and \(F_i\) are irreducible.

Two leaves a and b of an unrooted X-forest F are edge-siblings if they are the two leaves of a single-edge tree in F, and are vertex-siblings if they are adjacent to the same non-leaf vertex p in F (in this case, the vertex p is called the “parent” of a and b). The leaves a and b are siblings if they are either vertex-siblings or edge-siblings.

Again for two leaves a and b that are vertex-siblings in both \(F_1\) and \(F_i\), we can replace the subtree consisting of a, b, and their parent with a single leaf labeled \(\underline{ab}\). Similarly, for two leaves a and b that are edge-siblings in both \(F_1\) and \(F_i\), we can replace the single-edge tree [ab] with a single-vertex tree labeled \(\underline{ab}\). We will call the above operations on vertex-siblings and edge-siblings as “shrinking the siblings a and b into a single leaf \(\underline{ab}\).” Again because of this, we will use two \(X^{\prime }\)-forests \(F_1^{\prime }\) and \(F_i^{\prime }\) for some label-set \(X^{\prime }\), instead of the X-forests \(F_1\) and \(F_i\), in the description of our meta-steps in step 2 of the algorithm Apx-MAF, where each element in the label-set \(X^{\prime }\) is a collection of elements in the label-set X structured to represent a subtree of \(F_1\) and \(F_i\). In other words, the \(X^{\prime }\)-forests \(F_1^{\prime }\) and \(F_i^{\prime }\) are just the X-forests \(F_1\) and \(F_i\) with certain subtrees shrunk into single leaves. In particular, edges and non-leaf vertices of \(F_1^{\prime }\) and \(F_i^{\prime }\), respectively, are also edges and non-leaf vertices of \(F_1\) and \(F_i\).

In the discussions on cases 1–3 below, we assume that the irreducible \(X^{\prime }\)-forest \(F_i^{\prime }\) has two leaves a and b that are siblings (either edge-siblings or vertex-siblings). Let \(\tau _a\) and \(\tau _b\) be the subtrees in both \(F_1\) and \(F_i\) that correspond to the leaves a and b in \(F_1^{\prime }\) and \(F_i^{\prime }\), respectively. Let \(e_a^{\prime }\) and \(e_b^{\prime }\) be the edges in \(F_i^{\prime }\) that are incident to a and b, respectively (if a and b are edge-siblings, then \(e_a^{\prime } = e_b^{\prime }\)). Note that \(e_a^{\prime }\) and \(e_b^{\prime }\) are the edges in \(F_i\) that are not in \(\tau _a \cup \tau _b\) but are incident to the roots of \(\tau _a\) and \(\tau _b\), respectively.

Case 1 The elements a and b are also siblings in \(F_1^{\prime }\).

Meta-step \(\omega _1\): In case 1, shrink a and b into a single leaf \(\underline{ab}\) in both \(F_1^{\prime }\) and \(F_i^{\prime }\). If \(\underline{ab}\) is a single-vertex tree in exactly one of \(F_1^{\prime }\) and \(F_i^{\prime }\), then remove the edge incident to \(\underline{ab}\) in the other.

Lemma 7

Meta-step \(\omega _1\) keeps a ratio 1 on a given instance \(\{F_1, F_2, \ldots , F_m\}\).

Proof

If a and b are either edge-siblings in both \(F_1^{\prime }\) and \(F_i^{\prime }\), or vertex-siblings in both \(F_1^{\prime }\) and \(F_i^{\prime }\), then after shrinking a and b into \(\underline{ab}\), the vertex \(\underline{ab}\) either is a single-vertex tree in both \(F_1^{\prime }\) and \(F_i^{\prime }\), or is a single-vertex tree in neither of \(F_1^{\prime }\) and \(F_i^{\prime }\). Therefore, in this case, the meta-step \(\omega _1\) removes no edges in \(F_1^{\prime }\) and \(F_i^{\prime }\), thus also removes no edges in \(F_1\) and \(F_i\). As we discussed for the meta-step \(\sigma _1\) in Lemma 2, in this case, the meta-step \(\omega _1\) keeps a ratio 1.

Now suppose that a and b are edge-siblings in exactly one of \(F_1^{\prime }\) and \(F_i^{\prime }\). Without loss of generality, suppose that a and b are edge-siblings in \(F_1^{\prime }\) but are vertex-siblings in \(F_i^{\prime }\). Let \(e_0^{\prime }\) be the edge incident to the parent of a and b in \(F_i^{\prime }\) such that \(e_0^{\prime } \ne e_a^{\prime }\) and \(e_0^{\prime } \ne e_b^{\prime }\). Because of the single-edge tree [ab] in \(F_1^{\prime }\), no connected component of an agreement forest \(F^{\prime }\) for \(\{F_1, F_2, \ldots , F_m\}\) can contain both leaves in \(\tau _a \cup \tau _b\) and leaves not in \(\tau _a \cup \tau _b\). Therefore, the edge \(e_0^{\prime }\) in \(F_i^{\prime }\) cannot be in \(F^{\prime }\) and can be removed. After removing \(e_0^{\prime }\) from \(F_i^{\prime }\) and by applying a forced contraction, a and b become edge-siblings in \(F_i^{\prime }\), thus we can shrink a and b in both \(F_1^{\prime }\) and \(F_i^{\prime }\). Note that this is equivalent to the meta-step \(\omega _1\) that first shrinks the edge-siblings a and b in \(F_1^{\prime }\) and the vertex-siblings a and b in \(F_i^{\prime }\), then removes the edge incident to \(\underline{ab}\) in \(F_i^{\prime }\) (which is just \(e_0^{\prime }\)). As a fact, in this case, the meta-step \(\omega _1\) only removes an edge (i.e., \(e_0^{\prime }\)) that is not in any agreement forest for \(\{F_1, F_2, \ldots , F_m\}\). By Remark 2, in this case, the meta-step \(\omega _1\) also keeps a ratio 1. \(\square \)

Case 2 The elements a and b are in different connected components in \(F_1^{\prime }\).

Meta-step \(\omega _2\): In case 2, if at least one of a and b is a single-vertex tree in \(F_1^{\prime }\), then remove the edge(s) in \(F_i^{\prime }\) that are incident to the corresponding leaves (a or b or both) that are single-vertex trees in \(F_1^{\prime }\); otherwise, remove in both \(F_1^{\prime }\) and \(F_i^{\prime }\) the edges incident to a and b.

Lemma 8

Meta-step \(\omega _2\) keeps a ratio 2 on a given instance \(\{F_1, F_2, \ldots , F_m\}\).

Proof

The proof for this lemma is similar to that for Lemma 3. If an element in \(\{a, b\}\) is a single-vertex tree in \(F_1^{\prime }\), then the edge incident to that element in \(F_i^{\prime }\) cannot be in any agreement forest for \(\{F_1, \ldots , F_m\}\). Thus, by Remark 2, in this subcase, the meta-step \(\omega _2\) keeps a ratio 1.

Now assume that neither of a and b is a single-vertex tree in \(F_1^{\prime }\). Let \(e_a\) and \(e_b\) be the two edges in \(F_1^{\prime }\) that are incident to a and b, respectively. Note that even though it is possible that \(e_a^{\prime } = e_b^{\prime }\), we must have \(e_a \ne e_b\) because a and b are in different connected components in \(F_1^{\prime }\).

Using the notations in Definition-R, \(E^{\omega _2} = \{e_a, e_b, e_a^{\prime }, e_b^{\prime }\}\). Let \(E_1^{\omega _2} = \{e_a, e_b\}\). The proof that the set \(E_1^{\omega _2}\) satisfies all conditions in Definition-R to make the meta-step \(\omega _2\) to keep a ratio 2 goes exactly the same as that for the corresponding subcase in the proof for Lemma 3, no matter whether \(e_a^{\prime } = e_b^{\prime }\),. Therefore, we conclude that the meta-step \(\omega _2\) keeps a ratio 2. \(\square \)

Case 3 The elements a and b are in the same connected component in \(F_1^{\prime }\), but are not siblings.

This case is the one that is most different from its corresponding case for the rooted version. Let \(P = \{a, c_1, c_2, \ldots , c_r, b\}\) be the unique path in the unrooted \(X^{\prime }\)-forest \(F_1^{\prime }\) that connects a and b, where \(r \ge 2\) because we assume that a and b are neither edge-siblings nor vertex-siblings. Let \(e_a\) and \(e_b\) be the edges in \(F_1^{\prime }\) that are incident to a and b, respectively. Moreover, let \(e_1\) be the edge in \(F_1^{\prime }\) that is incident to \(c_1\) but not on the path P, and let \(e_r\) be the edge in \(F_1^{\prime }\) that is incident to \(c_r\) but not on the path P. See Fig. 3 for an illustration. Note that since \(F_1^{\prime }\) is irreducible, \(e_a\), \(e_b\), \(e_1\), and \(e_r\) are four well-defined distinct edges in \(F_1^{\prime }\) (thus also in \(F_1\)). Moreover, the path P is also a path in \(F_1\) in which the two ends a and b are replaced by the roots of the two subtrees \(\tau _a\) and \(\tau _b\) of \(F_1\), respectively.

Fig. 3
figure 3

The path connecting the elements a and b in \(F_1^{\prime }\)

Meta-step \(\omega _3\): In case 3, remove the edges \(e_a\), \(e_b\), \(e_1\), \(e_r\) in \(F_1^{\prime }\), and the edges \(e_a^{\prime }\), \(e_b^{\prime }\) in \(F_i^{\prime }\).

Lemma 9

Meta-step \(\omega _3\) keeps a ratio 4 on a given instance \(\{F_1, F_2, \ldots , F_m\}\).

Proof

Using the notations in Definition-R, we have (note that it is possible that \(e_a^{\prime } = e_b^{\prime }\)) \(E^{\omega _3} = \{e_a, e_b, e_1, e_r, e_a^{\prime }, e_b^{\prime }\}\). Let \(E_1^{\omega _3} = \{e_a, e_b, e_1, e_r\}\). We show that the set \(E_1^{\omega _3}\) satisfies all conditions in Definition-R and makes the meta-step \(\omega _3\) to keep a ratio 4. First note that \(|E_1^{\omega _3}| = 4\) no matter whether \(e_a^{\prime } = e_b^{\prime }\).

In the \(X^{\prime }\)-forest \(F_1^{\prime } \setminus E_1^{\omega _3}\), both a and b become single-vertex trees. Thus, both subtrees \(\tau _a\) and \(\tau _b\) by themselves become connected components in the X-forest \(F_1 \setminus E_1^{\omega _3}\). As a consequence, a connected component of an agreement forest for \(\{F_1 \setminus E_1^{\omega _3}, F_2, \ldots , F_m\}\) either contains leaves only in \(\tau _a\), or contains leaves only in \(\tau _b\), or contains no leaves in \(\tau _a \cup \tau _b\). Therefore, no edge in \(E^{\omega _3} \setminus E_1^{\omega _3} = \{e_a^{\prime }, e_b^{\prime }\}\) can be in any agreement forest for \(\{ F_1 \setminus E_1^{\omega _3}, F_2, \ldots , F_m\}\). Note that this holds true no matter whether \(e_a^{\prime } = e_b^{\prime }\).

Let \(F^{\prime }\) be any agreement forest for \(\{ F_1, F_2, \ldots , F_m\}\). We have three possible cases:

  1. (1)

    The edge \(e_a^{\prime }\) is not in \(F^{\prime }\). In this case, no connected component in \(F^{\prime }\) can contain both leaves in \(\tau _a\) and leaves not in \(\tau _a\). So we can pick \(\{e_a\}\) as the set \(E_{1,F^{\prime }}^{\omega _3}\), which satisfies: \(E_{1,F^{\prime }}^{\omega _3} \subseteq E_1^{\omega _3}\), \(|E_{1,F^{\prime }}^{\omega _3}| = 1 \ge |E_1^{\omega _3}|/4\), and the edge \(e_a\) in \(E_{1,F^{\prime }}^{\omega _3}\) is not in \(F^{\prime }\). Moreover, since \(F_1^{\prime }\) is irreducible, the set \(E_{1,F^{\prime }}^{\omega _3}\) is an ee-set for \(F_1^{\prime }\), thus is also an ee-set for \(F_1\). Thus, for the agreement forest \(F^{\prime }\) for \(\{ F_1, F_2, \ldots , F_m\}\) that does not contain \(e_a^{\prime }\), the set \(E_{1,F^{\prime }}^{\omega _3} = \{e_a\}\) satisfies all conditions in Definition-R to make the meta-step \(\omega _3\) to keep a ratio 4.

  2. (2)

    The edge \(e_b^{\prime }\) is not in \(F^{\prime }\). Then similarly we can let \(E_{1,F^{\prime }}^{\omega _3} = \{e_b\}\), and verify that for the agreement forest \(F^{\prime }\) for \(\{ F_1, F_2, \ldots , F_m\}\) that does not contain \(e_b^{\prime }\), the set \(E_{1,F^{\prime }}^{\omega _3} = \{e_b\}\) satisfies all conditions in Definition-R to make the meta-step \(\omega _3\) to keep a ratio 4.

    If \(e_a^{\prime } = e_b^{\prime }\) and the edge is not in \(F^{\prime }\), then we can apply either (1) or (2) above to have a set \(E_{1,F^{\prime }}^{\omega _3}\) that satisfies all conditions in Definition-R to make the meta-step \(\omega _3\) to keep a ratio 4.

  3. (3)

    The agreement forest \(F^{\prime }\) contains both edges \(e_a^{\prime }\) and \(e_b^{\prime }\), which includes the case where \(F^{\prime }\) contains \(e_a^{\prime } = e_b^{\prime }\). In this case, because a and b are siblings in \(F_i^{\prime }\), in the X-forest \(F_i\) there must be a leaf \(l_a\) in the subtree \(\tau _a\) and a leaf \(l_b\) in the subtree \(\tau _b\), where \(l_a, l_b \in X\), such that \(l_a\) and \(l_b\) are in the same connected component in \(F_i\). Observe that because a and b are siblings in \(F_i^{\prime }\), the path between the roots of the two subtrees \(\tau _a\) and \(\tau _b\) in \(F_i\) can contain at most one non-leaf vertex in \(F_i\). Since \(F^{\prime }\) is a subgraph of \(F_i\), the path between \(l_a\) and \(l_b\) in \(F^{\prime }\) contains at most one non-leaf vertex that is not in \(\tau _a \cup \tau _b\). Since both vertices \(c_1\) and \(c_r\) in \(F_1\) are on the path between \(l_a\) and \(l_b\) and are not in \(\tau _a \cup \tau _b\), at most one of \(c_1\) and \(c_r\) can become a non-leaf vertex in \(F^{\prime }\). As a consequence, at most one of the two edges \(e_1\) and \(e_r\) in \(F_1^{\prime }\) can be in \(F^{\prime }\) (see Fig. 3 for references). Note that \(e_1\) and \(e_r\) are also edges in \(F_1\). Thus, if we let \(E_{1,F^{\prime }}^{\omega _3}\) be the set of edges in \(\{e_1, e_r\}\) that are not in \(F^{\prime }\), then the set \(E_{1,F^{\prime }}^{\omega _3}\) satisfies: \(E_{1,F^{\prime }}^{\omega _3} \subseteq E_1^{\omega _3}\), \(|E_{1,F^{\prime }}^{\omega _3}| \ge 1 = |E_1^{\omega _3}|/4\), and the edges in \(E_{1,F^{\prime }}^{\omega _3}\) are not in \(F^{\prime }\). Moreover, it is not difficult to verify that \(\{e_1, e_r\}\) is an ee-set for \(F_1^{\prime }\) (thus an ee-set for \(F_1\)). Since a subset of an ee-set is also an ee-set, the set \(E_{1,F^{\prime }}^{\omega _3}\) is also an ee-set for \(F_1\). Thus, in this case, the set \(E_{1,F^{\prime }}^{\omega _3}\) defined as this satisfies all conditions in Definition-R to make the meta-step \(\omega _3\) to keep a ratio 4.

This verifies that the set \(E_1^{\omega _3}\) satisfies all conditions in Definition-R to make the meta-step \(\omega _3\) to keep a ratio 4. Thus, the meta-step \(\omega _3\) keeps a ratio 4. \(\square \)

Cases 1–3 cover all cases in which the \(X^{\prime }\)-forest \(F_i^{\prime }\) contains siblings. If \(F_i^{\prime }\) contains no siblings, then \(F_i^{\prime }\) contains no edges. This case is handled by the following meta-step.

Case 4 \(F_i^{\prime }\) contains no edges.

Meta-step \(\omega _4\): In case 4, remove all edges in \(F_1^{\prime }\).

It is rather easy to see that the meta-step \(\omega _4\) removes no edge in any agreement forest for \(\{F_1, F_2, \ldots , F_m\}\). By Remark 2, we have

Lemma 10

Meta-step \(\omega _4\) keeps a ratio 1 on a given instance \(\{F_1, F_2, \ldots , F_m\}\).

Now we are ready for our main theorem in this section.

Theorem 3

If step 2 of the algorithm Apx-MAF uses the meta-steps \(\omega _1\)\(\omega _4\), then the algorithm Apx-MAF is a 4-approximation algorithm for the unrooted Maf problem with running time \(O(n_0 \log n_0)\), where \(n_0\) is the size of the input instance.

Proof

By Lemmas 710, each of the meta-steps \(\omega _1\)\(\omega _4\) keeps a ratio bounded by 4. Thus, by Theorem 1, in order to prove that Apx-MAF is a 4-approximation algorithm for the unrooted Maf problem, it suffices to prove that the algorithm, when using the meta-steps \(\omega _1\)\(\omega _4\) in its step 2, runs in time \(O(n_0 \log n_0)\) on an instance of size \(n_0\) of the unrooted Maf problem.

Suppose that the given instance of the unrooted Maf problem is \(\{F_1, F_2, \ldots , F_m\}\), where each \(F_h\) is an unrooted X-forest, and \(|X| = n\). Thus, \(n_0 = O(n m)\). According to the algorithm Apx-MAF, meta-steps in the i-th execution of step 2 are applied on \(F_1^{\prime }\) and \(F_i^{\prime }\) (thus on \(F_1\) and \(F_i\)) only when \(F_1^{\prime } \ne F_i^{\prime }\). When \(F_1^{\prime } \ne F_i^{\prime }\), each of the meta-steps \(\omega _2\)\(\omega _4\) removes at least one edge in \(F_1^{\prime } \cup F_i^{\prime }\) (thus in \(F_1 \cup F_i\)). Therefore, the total number of times these meta-steps are applied is bounded by O(n). Moreover, similar to our discussion in Theorem 2, each application of the meta-step \(\omega _1\) reduces the number of vertices in \(F_1^{\prime }\) and \(F_i^{\prime }\) by at least 2. Therefore, the meta-step \(\omega _1\) can be applied at most O(n) times.

In summary, each execution of step 2 of the algorithm Apx-MAF, when using the meta-steps \(\omega _1\)\(\omega _4\), applies at most O(n) meta-steps. Using data structures similar to those we used in Theorem 2, each of the meta-steps \(\omega _1\)\(\omega _4\) can be implemented to run in time \(O(\log n)\). This then leads to an \(O(n_0 \log n_0)\)-time implementation of the algorithm Apx-MAF for the unrooted Maf problem, where \(n_0\) is the size of the input instance. \(\square \)

If the original input of the algorithm is a collection of unrooted X-trees, then the algorithm Apx-MAF will return an agreement forest for the trees. In this case, the algorithm Apx-MAF is a 4-approximation algorithm for the standard Maximum Agreement Forest problem on multiple unrooted binary phylogenetic trees.

6 Conclusion and Future Research

In this paper, we presented two polynomial-time approximation algorithms for the Maf problem on multiple binary phylogenetic trees: one for rooted trees with a ratio 3 and the other for unrooted trees with a ratio 4. The 3-approximation algorithm for rooted trees is an improvement over the previous best approximation algorithm for the problem due to Chataigner [8], which has a ratio 8,Footnote 3 and the 4-approximation algorithm for unrooted trees is, to our best knowledge, the first constant ratio approximation algorithm for the problem.

As suggested by Whidden et al. [25] in their recent publication in SIAM J. Comput., “The most important open problem is extending our approach to computing MAFs and MAAFs for multifurcating trees and for more than two trees.” Our result is a response to this call and makes an important step towards this direction. We believe that our general framework, the algorithm Apx-MAF, will have further applications in the study of approximation algorithms for the Maf problem. Indeed, by Theorem 1, any further improvement on the ratio of the meta-steps will directly lead to improvements in the corresponding approximation algorithms. Moreover, by combining our general framework and related techniques presented in the current paper with the techniques developed recently by Chen et al. [9] for multifurcating trees (i.e., general trees, instead of only binary trees), we believe that we should also be able to develop approximation algorithms for the Maf problem on multiple multifurcating trees.

Further improvements on the approximation ratio of polynomial-time approximation algorithms for the Maf problem, either for two binary or multifurcating phylogenetic trees, or more general for multiple binary or multifurcating phylogenetic trees, are certainly desired. In particular, the best approximation algorithm for the Maf problem on two unrooted binary X-trees has a ratio 3 [24], while our approximation algorithm for the problem on multiple unrooted binary X-trees has a ratio 4 (Theorem 3). The disparity appears because our meta-step \(\omega _3\) has a ratio 4 (Lemma 9), while in handling the same situation for two unrooted binary X-trees, the algorithm proposed in [24] is able to limit the number of removed edges by 3, instead of 4 (see Theorem 6 in [24]). Unfortunately, the operation described in [24] cannot be easily translated into an efficient meta-step. In fact, a direct translation of the operation given in [24] will result in a meta-step that does not guarantee any positive ratio. This is also the main reason why our algorithm has an extra \(\log n\) factor in its time complexity. It will be interesting to see how this gap can be closed, either by strengthening the definition of the meta-step metric or by developing new algorithmic techniques.

Approximation algorithms for the Maf problem, in particular approximation algorithms for the SPR distance have been used in a branch-and-bound fashion to quickly compute exact SPR distance for two phylogenetic trees [26, 27]. Our methods and formulations presented in the current paper may be used in the same fashion for the multiple tree problem.

Accompanied by the research in approximation algorithms for the Maf problem, there is also an active line of research on parameterized algorithms for the problem [2, 9, 10, 13, 20, 22, 24, 25]. In particular, work has been done on parameterized algorithms for the Maf problem on multiple binary trees [10, 20]. The parameterized algorithms in [10, 20] and the approximation algorithms presented in the current paper share a common idea of using sibling pairs in one tree to identify edges in the other trees that may potentially cause inconsistence. The parameterized algorithms, which are based on a branch-and-search process, then branch on removing each of these potentially inconsistent edges, while the approximation algorithms simply remove all these potentially inconsistent edges. However, it seems that the analysis for the parameterized algorithms based on this idea is much easier: the algorithms only need to ensure that at least one branch in the branch-and-search process traces an optimal solution. On the other hand, after removing all potentially inconsistent edges, it becomes much more difficult to characterize the optimal solutions in the resulting instance. Thus, among the removed edges, we have to identify “irrelevant edges”, and find a more accurate way to compute the ratio of the number of “essential edges” over the number of “correct edges”. In particular, simply counting the number of removed edges and the number of “correctly” removed edges might give a very loose estimation on the resulting approximation ratio of the algorithms. This difference has forced us to build a very different model to enable more precise analysis on approximation algorithms for the Maf problem on multiple trees.