Introduction

Different phylogenetic studies derive various tree topologies considering the Operational Taxonomic Units (OTU). Hence, measuring the numerical dissimilarity of a derived phylogenetic tree with respect to the known hypotheses becomes one of the most fundamental tasks in phylogenetic research (Bogdanowicz and Giaro 2011; Goluch et al. 2020; Sul et al. 2009). Now, different methods have already been proposed to compute the accuracy of the derived phylogenetic trees. The existing methods derive the correctness of a tree by comparing it with a reference tree topology which is an utmost requirement for these methods. Some of the popular methods include Robinson-Fould (RF) distance (Robinson and Foulds 1981), Matching Split (MS) distance (Bogdanowicz and Giaro 2011), Nodal Split (NS) distance (Cardona et al. 2010), Triples (TT) distance (Critchlow et al. 1996), Geodesic Treepath (GTP) distance (Billera et al. 2001; Owen and Provan 2010). However, all these methods have common significant limitations, which are categorized into the following situations:

Case 1 (unresolved relations) A full reference tree for the selected OTUs may not available to us due to the lack of knowledge of the precise relationships among some of them. In such cases, we assume multifurcation among those OTUs whose evolutionary relationships are not clear to us. Existing tree comparison methods are very rigid towards the nature of furcation of the internal nodes in both of the reference tree and the target tree. As a result, the existing methods perform poorly when the reference tree has a multifurcated node whereas the same set of species show bifurcation in the target tree (for details, please refer to Supplementary Sec. 1).

Case 2 (unavailability of an absolute reference tree) Expansion of the phylogenetic study by introducing newly discovered species is necessary for the success of the Tree of Life Web Project (Goldstein 2010). Moreover, the limited availability of the sequence data is a significant reason to obtain an absolute reference tree for a target OTUs. Nevertheless, the existing tree comparison measures are applied when both of the reference and the target trees have the same set of OTUs. To ensure that, the conventional methods prune the OTUs present in one tree but not in another. In another way, these methods expand the tree(s) by inserting the missing leaf (or leaves). Hence, for this case, all the existing methods may reflect inappropriate results when the target tree has few more species than that of the reference tree (for details, please refer to Supplementary Sec. 1).

Due to these limitations of the existing measures, most of the recent phylogenetic studies discarded the conventional tree comparison methods and measured the accuracy of their derived trees using the following methods:

  1. a.

    Manual inspection: The correctness of the derived tree is checked manually by validating each clade utilizing the existing biological knowledge (Li et al. 2017). However, this technique does not generate any quantitative measure. Apart from that, the manual inspection is very subjective.

  2. b.

    Comparison-based method: In this approach, a phylogenetic tree is formed by applying the widely accepted established methods on the same set of OTUs. This tree is then considered as the reference tree to be compared with the target tree derived from the corresponding dataset (Yin and Yau 2015). However, in this approach, the established methods have to be considered as the gold standard which are not beyond contentions.

  3. c.

    Indirect method: This approach computes different errors, like root mean square error of distance matrices (Xie et al. 2018; Zheng et al. 2019). So this method overlooks the accuracy of the final tree construction steps as well as ignores the correctness of the ultimate derived tree. There are some methods like maximum likelihood, Bayesian estimation. which directly derive trees from the sequences without generating any distance matrices. For these cases, the indirect method is not applicable.

In this regard, we propose a novel measure, called the deformity index, to quantify the dissimilarity of a phylogenetic tree based on the presence of different clades in the reference tree. The computation of the deformity index only depends on the clades acquired from the reference tree. So the precise relationships among the species within a clade do not affect its computation. Hence, the presence of both bifurcation and multifurcation in the target tree does not influence the measure. Deformity index also handles the existence of extra species or OTUs in the target tree appropriately. Additionally, this technique generates a single quantitative measure to represent the correctness of a target tree with respect to the reference tree. This measure is applicable for all types of phylogenetic trees. Hence in this manuscript, we use the term OTU and species interchangeably.

Methodology

Our proposed measure computes the dissimilarity of a tree with respect to a reference tree by utilizing the clade(s) of the reference tree. Our objective is to compute the degree of deformation for each clade in the target phylogenetic tree with respect to the clades of the reference tree. Next, we define a few terms and notations used to describe the methodology for computing the deformity index (DI). Throughout this paper, we use deformity index and DI interchangeably.

Let us consider a phylogenetic tree \(T\) having the set of leaves and the set of internal nodes \({V}_{I}\). Thus, the set of all nodes of \(T\) is \(V={V}_{I}\cup L\). The level of a node η is represented as \(h\left(\eta \right);\forall \, \eta \in V\). The parent node of η is represented as \(p\left(\eta \right);\forall \, \, \eta \in \left(V-{r}_{T}\right)\) where \({r}_{T}\) is the root node of \(T\). For a set of leaf nodes \(S\) the most recent common ancestor is represented as \(\mathcal{M}\left(S\right)\). The clade at a particular node \(\eta\) is represented as the list of all leaves present under \(\eta\) and is denoted as \(\lambda \left(\eta \right), \forall \, \eta \in {V}_{I}\). Please refer to Supplementary Sec. 2 for more details about the definitions and the notations used here.

Definition 1 Transfer In

The minimum distance required to shift an OTU from its parent node to a selected clade that does not contain the species is called transfer in.

In this study, we consider the unweighted tree; hence the distance between two nodes is considered as the number of edges between them. This movement is applicable only for the leaves of a tree. The transfer in of an OTU is computed when we need to shift the corresponding species to its correct clade from its assigned position. If we require to add an OTU \(A\) at node \(\eta\), we need to remove \(A\) from its parent node, \(p(A)\) and insert it in \(\eta\). For this case, the transfer in is denoted as \(T{L}_{\text{in}}(A, \eta )\). Hence,

$$\begin{array}{c}\begin{array}{c}T{L}_{\text{in}}\left(A, \eta \right)=\left(h\left(p\left(A\right)\right)-h\left(\mathcal{M}\left({L}_{A}\right)\right)\right)+\left(h\left(\eta \right)-h\left(\mathcal{M}\left({L}_{A}\right)\right)\right)\\ =h\left(\eta \right)+h\left(p\left(A\right)\right)-2h\left(\mathcal{M}\left({L}_{A}\right)\right)\end{array}\end{array}$$
(1)

where \({L}_{A}=\lambda \left(\eta \right)\cup A;\forall \, A\in L\) and \(\forall \, \eta \in {V}_{I}\).

A typical example is shown in Fig. 1a. We need to traverse a minimum distance of two edges to shift the species \(4\) of the tree \(T\) from its current position to \(\eta\). So, \(T{L}_{\text{in}}\left(4, \eta \right)=2\).

Fig. 1
figure 1

Let us consider a clade of the reference tree as \({\Lambda }_{i}=\left\{4, 5, 6, 7, 9, 10\right\}\). The members are denoted by red color. a Transfer in: To make the clade η monophyletic with respect to \({\Lambda }_{i}\), OTU \(4\) should be placed within the clade \(\eta\). Hence the OTU \(4\) should be moved two labels (marked as blue arrow). So \(T{L}_{\text{in}}\left(4,\eta \right)=2\). b The modified tree after adding the ingroup OTU \(4\) to the clade \(\eta\). c Transfer out: to make the clade \(\eta\) monophyletic with respect to \({\Lambda }_{i}\), OTU \(8\) should be placed outside the clade by shifting two levels from the current position (marked as blue arrow). So \(T{L}_{\text{out}}\left(8,\eta \right)=2\). d The modified tree after removing the outgroup OTU \(8\) from the clade \(\eta\). e The node deformity is the sum of the transfer in and the transfer out. Hence, \(Dn\left(\eta ,{\Lambda }_{i}\right)=4\). The deformation, \(d\left(\eta ,{\Lambda }_{i}\right)=\frac{6}{6}(2+2)=4\). f Transformed tree after applying \(Dn(\eta ,{\Lambda }_{i})\) (Color figure online)

Definition 2 Transfer Out

The minimum distance required to shift an OTU from its parent node to an ancestor of a selected clade consisting of the OTU is called transfer out.

Like the transfer in, only a leaf node can be shifted from a particular clade to an ancestral clade. The transfer out of an OTU is computed when we need to remove the OTU from a selected clade consisting of this species. For removing an OTU \(B\) from a clade \(\eta\), we need to remove \(B\) from its parent node, \(p(B)\) and insert it to the parent node of \(\eta\), \(p(\eta )\). For this case, the transfer out is denoted as \(T{L}_{\text{out}}(B,\eta )\). Hence,

$$\begin{array}{c}T{L}_{\text{out}}\left(B, \eta \right)=h\left(\eta \right)-h\left(p\left(B\right)\right)+1\end{array}$$
(2)

where \(\forall \, B\in L\) and \(\forall \, \eta \in {V}_{I}\).

In the example shown in Fig. 1c, we need to traverse a minimum distance of two edges to move the OTU \(8\) of the tree \(T\), from its allocated position to the parent of a mentioned node, \(\eta\). So, the transfer out, \(T{L}_{\text{out}}\left(8,\eta \right)=2\).

We have a reference tree \({T}_{R}\) with a list of clades, say \(\Lambda \left({T}_{R}\right)\), to compute the DI of a target tree, \(T\). Let us consider a reference clade, \({\Lambda }_{i} : {\Lambda }_{i}\in\Lambda \left({T}_{R}\right)\). At any particular node of the target tree,\(\eta \in {V}_{I}\), we may need to add a list of OTUs, \(P\), which are present in the reference clade, \({\Lambda }_{i}\), but not in the clade,\(\Lambda \left(\eta \right)\), of \(T\). Hence, \(P={\Lambda }_{i}\setminus \lambda \left(\eta \right)\). Also, we may need to remove a list of OTUs, \(Q\), which are present in \(\lambda \left(\eta \right)\) but not in \({\Lambda }_{i}\). So, \(Q=\lambda \left(\eta \right)\setminus {\Lambda }_{\mathrm{i}}\). These operations are necessary to make the clade \(\lambda \left(\eta \right)\) of \(T\) concordant to the reference clade, \({\Lambda }_{i}\). For inserting the list of OTUs \(P\), we can compute the transfer in individually for each \(p\in P\). The sum of these individual transfer ins is called total transfer in. Similarly, for deleting the list of OTUs \(Q\), we can compute the transfer out individually for each\(q\in Q\). We call the sum of these transfer outs as the total transfer out. The total transfer in and total transfer out at an internal node \(\eta\) with respect to the reference clade \({\Lambda }_{i}\) are represented as \(TT{L}_{\text{in}}\left(\eta , {\Lambda }_{i}\right)\) and \(TT{L}_{\text{out}}\left(\eta , {\Lambda }_{i}\right)\), respectively. The sum of \(TT{L}_{\text{in}}\left(\eta , {\Lambda }_{i}\right)\) and \(TT{L}_{\text{out}}(\eta ,{\Lambda }_{i})\) is considered as the node deformity at \(\eta\) with respect to \({\Lambda }_{i}\) and is denoted as \(Dn(\eta ,{\Lambda }_{i})\). Hence,

$$\begin{array}{c}Dn\left(\eta ,{\Lambda }_{i}\right)={\sum }_{p}T{L}_{\text{in}}\left(p,\eta \right) +{\sum }_{q}T{L}_{\text{out}}\left(q,\eta \right);\forall \, \, p\in P\, {\text{a}}{\text{n}}{\text{d}}\, \forall \, \,q \, \in Q\end{array}$$
(3)

Definition 3 Deformation

The deformation at an internal node of the target tree with respect to a reference clade is defined as the normalized node deformity at the same node with respect to the same reference clade.

To compute the deformation at η with respect to \({\Lambda }_{i}\), \(\mathrm{Dn}\left(\upeta ,{\Lambda }_{\mathrm{i}}\right)\) is normalized by the ratio of the size of the reference clade, \({\Lambda }_{i}\) and the size of the target clade, \(\Lambda \left(\eta \right)\). Hence, the deformation at η with respect to \({\Lambda }_{i}\) is denoted as \(\mathrm{d}\left(\upeta ,{\Lambda }_{\mathrm{i}}\right)\) which is shown in the following expression.

$$\begin{array}{c}d\left(\eta ,{\Lambda }_{i}\right)=\frac{\left|{\Lambda }_{i}\right|}{\left|\lambda \left(\eta \right)\right|} Dn\left(\eta ,{\Lambda }_{i}\right)\end{array}$$
(4)

where \(|{\Lambda }_{\mathrm{i}}|\) and \(|\uplambda \left(\upeta \right)|\) represent the number of members present under the clades \({\Lambda }_{\mathrm{i}}\) and \(\uplambda \left(\upeta \right)\), respectively.

We can compute the deformation at every internal node of the target tree with respect to the reference clade. Now, our objective is to compute the minimum deformation among all the internal nodes of the target tree, \(T\), with respect to the reference clade, \({\Lambda }_{i}\).

Definition 4 Clade Deformation

The clade deformation of a target tree is the minimum deformation among all its internal nodes with respect to a reference clade.

Clade deformation with respect to \({\Lambda }_{i}\) is denoted as, \(\mathrm{Dc}\left({\Lambda }_{\mathrm{i}}\right)\). Hence,

$$\begin{array}{c}Dc\left({\Lambda }_{i}\right)=\underset{\eta }{\mathrm{min}}\left\{d\left(\eta , {\Lambda }_{i}\right)\right\};\forall \, \, \eta \, \in {V}_{I}\end{array}$$
(5)

For each \({\Lambda }_{\mathrm{i}}\in\Lambda \left({\mathrm{T}}_{\mathrm{R}}\right)\), we compute the \(\mathrm{Dc}\left({\Lambda }_{\mathrm{i}}\right)\). The deformity index (DI) of a tree \(T\) is the average of the clade deformations for all the reference clades. Deformity index of the tree, \(T\), is denoted as \(D(T)\), and is expressed as follows:

$$\begin{array}{c}D\left(T\right)=\frac{1}{\left|\Lambda \left({T}_{R}\right)\right|}{\sum }_{i}Dc\left({\Lambda }_{i}\right); \, \forall \, \, {\Lambda }_{i}\in \Lambda \left({T}_{R}\right)\end{array}$$
(6)

The minimum possible value of DI is zero which occurs when all the clades of the target tree are consistent with the reference clades. The maximum value is achieved when the tree is a caterpillar tree, and the members of the reference clade are attached at their highest possible level in the caterpillar tree. The formulations, justifications, and detailed explanations on the range of DI are provided in Supplementary Sec. 3. A flow diagram of our proposed methodology for computing the deformity index is shown in Fig. 2.

Fig. 2
figure 2

Flow diagram for computing deformity index of the target tree \(T\) with respect to a reference tree \({T}_{R}\). Here, the all the clades present in \({T}_{R}\) is represented as \(\Lambda \left({T}_{R}\right)\)

Results and Discussion

In this section, we first compute DI and the other existing measures for a simulated dataset. The performance analysis shows the associativity of DI with the conventional measures. Then, we perform different statistical tests to characterize the distribution of DI. Finally, we compute DI by considering the biological information of clades and illustrate that DI outperforms the traditional techniques of computing the correctness of a tree.

Simulated Dataset

We generate 10,000 trees having the same set of OTUs randomly by considering both the Yule model (Harding 1971) and the uniform model (Semple and Steel 2003). However, the trees estimated from the sequence data do not follow the Yule model and the uniform model in reality. Hence, we also generate 10,000 phylogenetic trees with the same set of OTUs and consider them as the reference tree. For each reference tree, we simulate the DNA sequence of its OTUs by utilizing a tool called INDELible (Fletcher and Yang 2009). Finally, we apply the maximum likelihood method to reconstruct the trees from each set of simulated sequences. In the next sections, we refer to this method as the simulated model. The details of the data simulation are provided in Supplementary Sec. 4. We enumerate the simulations using three different models for different numbers of OTUs (from 10 to 100) separately. Thus, for each set of OTUs, the simulated data contain 10,000 reference and target trees. Now, we compute the RF, MS, NS, TT, GTP, and DI scores of these target trees with respect to the corresponding reference trees. We first use the means of DI of each set of trees to compute the correlation between DI and the traditional methods.

Correlation Coefficients

Here, we compute the Pearson correlation coefficients (PCC) (Stigler 1989) between DI and RF, MS, NS, TT, and GTP. The PCC values between DI and the other existing measures, such as RF, MS, NS, TT, and GTP, are very high for all the cases (please refer to Table 1). This phenomenon represents a strong association of DI with the existing measures.

Table 1 Pearson correlation coefficients between deformity index and RF, MS, NS, TT, and GTP for the random trees with different number of leaves generated under the Yule model, uniform model, and simulated model

Additionally, our measure deals with the cases where the reference tree and the target tree contain different sets of species and also, the cases where the complete knowledge of the reference tree is not known to us. Again, our proposed method outperforms the existing methods as discussed in Sect. Biological Dataset.

Distribution of Deformity Index

As DI is linearly related to clade deformation, the distribution of clade deformation also reflects the distribution of DI. The primary components for computing the clade deformation are \(TT{L}_{\text{in}}\) and \(TT{L}_{\text{out}}\) (please refer to Eq. 5) which depend on the levels of internal vertices and follows the geometric distribution (Steel and McKenzie 2001). Hence, deformation also follows the geometric distribution. Now, to compute clade deformation, the deformation is normalized by the ratio of the size of the reference clade and the number of leaf nodes under a clade of the target tree. The distribution of the number of leaves under a clade follows the beta distribution (Mahmoud and Smythe 1991). Hence, the distribution of clade deformation is a combination of geometric and beta distribution. However, it is challenging to model it and it remains an open problem at this stage.

Test for Goodness-of-Fit

The goodness-of-fit test is used to determine whether the observed sample distribution of a given phenomenon is significantly different from the expected probability distribution. Since it is challenging to derive the distribution of DI mathematically, here, we employ the goodness-of-fit test to determine whether this distribution follows a normal distribution or not. We employed two widely used techniques, Chi-square test (Pearson 1992) and Kolmogorov–Smirnov (K–S) test (Marsaglia et al. 2003), to examine whether the distribution of DI follows a normal distribution or not. In our test procedure, we consider the distribution of clade deformation following a normal distribution as the null hypothesis. The Chi-square test accepts the null hypothesis (with a significant level of more than 95%), whereas the K–S test rejects it. We compute the distributions of DI for 10,000 random trees generated under both the Yule model and the uniform model. Now, we perform the goodness-of-fit test on the trees generated from the simulated model as we have described above. We observe that the distribution of DI for the simulated model also follows a normal distribution based on the Chi-square test (with a significant level of more than 95%). However, the K–S test rejects the null hypothesis. Figure 3 represents the histogram of DI for such random trees with 100 leaves. From Fig. 3, we may add that the distribution of clade deformation does not follow a normal distribution but is very close to it.

Fig. 3
figure 3

Histogram of the deformity index for 10,000 random trees with 100 leaves generated under a Yule model, b uniform model, c the trees derived from the simulated sequences by utilizing the maximum likelihood method (Color figure online)

We compute the means and the standard deviation intervals of DI for 10,000 random trees with different number of leaves generated under both the Yule model and the uniform model (please refer to Fig. 4). It is a fact that trees generated under the uniform model are more distinct than that generated under the Yule model. Means and the standard deviation intervals of DI in the uniform model are larger and grow faster than that of in the Yule model (shown in Fig. 4). Thus, DI shows more distinct values for the uniform model than the Yule model, which satisfies the fact described above. Now, we consider the trees derived from the simulated model and we perform this analysis on them. From Fig. 4, it is observed that the means and the standard deviation intervals of DI for this case grow slower than that of the Yule and the uniform models. These phenomena also claim that the DI is very sensitive to the tree topology.

Fig. 4
figure 4

Means and standard deviation intervals of deformity index computed for 10,000 random trees with different number of leaves generated under three different models. In the simulated model, we simulate sequences from a reference tree and derive the tree by utilizing the maximum likelihood method. Finally, we compute DI of the derived tree with respect to the corresponding reference tree. The solid line corresponding to each model shows the changes of the DI means with respect to the changes of the number of leaves. The ribbons show the double standard deviation intervals \(\stackrel{-}{x}\pm 2\sigma\) of the mean for the respective models (Color figure online)

Biological Dataset

There exist various studies for determining the phylogenetic relationships of different sets of OTUs. Hence, there are many observations on the phylogenetic relationships of a selected set of OTUs. In this section, we consider different such hypotheses to analyze the performance of both DI and the conventional measures for each of such hypotheses as well as for a selected reference tree. Here, we consider two datasets of fishes (from order Gadiformes) and mammals. We examine our proposed method on the trees derived by utilizing 14 different methods belonging to the alignment-based method and the alignment-free method. This section demonstrates the power of DI when the reference tree and the target tree contain different sets of OTUs. Since, GTP (Owen and Provan 2010) executes by considering only the same set of OTUs, we consider RF, MS, NS, and TT methods in this section.

Gadiformes

We choose 19 species from eight different taxonomy groups of subfamily from the order Gadiformes and one outgroup species from the order Clupeiformes. The details of the species are given in Supplementary Table 5.1.

To demonstrate the power of DI, we consider a tree as the reference tree (shown in Fig. 5a) and also consider a phylogenetic tree of Gadiformes derived from Euclidean-based dissimilarity measure of k-mers followed by UPGMA method (Lu et al. 2017) (shown in Fig. 5b). Both of the trees have different sets of species. Though these trees have visible differences, all the conventional methods prune the uncommon species from both the trees and provide zero as the comparison score representing that both the trees are equivalent. For this case, DI shows a non-zero value implying that the target tree has some differences with respect to the reference tree. However, a single reference tree may not be sufficient as it may not include complete information about the relationships among the selected species. For these cases, consideration of multiple reference trees can resolve this issue. Nevertheless, the conventional methods consider only a single reference tree while computing the score. However, DI takes account of the knowledge of the relationships among the species from multiple hypotheses and computes a score based on this cumulative knowledge. Here, firstly we demonstrate how to accumulate knowledge of the clades from different hypotheses and use them to compute the DI. Then we present the results showing that DI outperforms the other conventional measures.

Fig. 5
figure 5

Phylogenetic tree of Gadiformes. a Reference tree proposed by (Teletchea et al. 2006). b Tree generated by utilizing the Euclidean distance of k-mers followed by UPGMA method. There are different subfamilies, such as, Gadinae (pink), Lotinae (orange), Merlucciidae (red), Trachyrincinae (brown), Macrouroidae (blue), Bathygadinae (purple), Macrourinae (green), Bregmacerotidae (light blue), Phycidae (cyan), Macruronidae (grey), Steindachneridae (light green), and Moridae (dark green). The outgroup is colored by black (Color figure online)

There are many conflicts among various hypotheses of the phylogenetic relationships among the species of Gadiformes. Hence, the true reference tree of Gadiformes is still unknown to us. Here we consider the widely accepted hypotheses of the phylogenetic relationships of these families for scoring the derived trees constructed from the different methods. A brief description of the most accepted hypotheses are given below. Please refer to Supplementary Sec. 5 for the details related to these hypotheses.

HG1:

Gadinae subfamily is monophyletic (Roa-Varón and Ortí 2009; Shi et al. 2016; Teletchea et al. 2006; von der Heyden and Matthee 2008).

HG2:

Subfamily Lotinae is the sister group of Gadinae. So, they form a monophyletic clade (Fahay 1984; Nelson 1984).

HG3:

Bregmacerotidae belongs to the sister group of either higher “gadoids” or higher “gadoids” excluding “macruronids” (Nelson 2006; Roa-Varón and Ortí 2009; Shi et al. 2016). Most of the recent studies also accepted the monophyletic relation of Merlucciidae with the Gadinae and Lotinae (Endo 2002; Shi et al. 2016; Teletchea et al. 2006; von der Heyden and Matthee 2008). Hence, Bregmacerotidae and Merlucciidae both form the monophyletic clade with Gadinae and Lotinae.

HG4:

Macrourinae is monophyletic (Gaither et al. 2016).

HG5:

Macrouroinae and Trachyrincinae together form monophyletic clade (Kriwet and Hecht 2008).

HG6:

Macrourinae, Macrouroinae, Trachyrincinae, and Bathygadinae together form monophyletic clade (Kriwet and Hecht 2008).

According to these hypotheses, we construct the list of the reference clades. Then we compute different quality metrics such as, DI, RF, MS, NS, and TT of the target trees. We also visually judge the derived trees as “complete”, “partial”, and “marginal”, based on the reference clades.

  • “complete” denotes that the tree agrees with the corresponding hypothesis completely.

  • “partial” denotes that the tree misses a few relations based on the corresponding hypothesis.

  • “marginal” denotes that very few relations of the hypothesis exist in the tree.

Considering the six hypotheses, we compute DI, RF, MS, NS, and TT for the trees derived from all 14 different methods. The trees are provided in the Supplementary Sec. 6. The visual judgments and the scores for each of the hypotheses are summarized in Supplementary Fig. 7.1. Here, some typical examples of these summarizations are shown in Fig. 6a–d. When visual judgment is “complete”, then the corresponding scores should be zero as the tree completely supports the hypothesis. While for the other cases, the comparison scores should be a positive value. For example, the tree derived by applying Euclidean distance measure on k-mer followed by UPGMA method (please refer to Fig. 5b) has monophyletic clade of the Gadinae subfamily. Hence, it satisfies HG1 “completely”. The DI shows zero score for that case which is represented by the absence of pink bar for HG1 as shown in Fig. 6a. However, all the other methods show the non-zero values because the reference clade contains multifurcation, which is not present in the derived tree (explained in Sect. Introduction). This phenomenon is represented by the presence of other four colored bars for HG1 as shown in Fig. 6a. The same thing happens for HG2 (please refer to HG2 of Fig. 6a). Again, in the same tree (shown in Fig. 5b), the members of subfamily Macrourinae form a paraphyletic clade with the other members of Gadiformes. So it “marginally” satisfies HG4. Hence, according to HG4, a high comparative score is expected. However, only the DI shows a non-zero score which is shown as a pink bar for HG4 as shown in Fig. 6a. All the other scores show zero value because these methods prune all the species which do not belong to the subfamily Macrourinae (explained in Sect. Introduction). This phenomenon is represented by the absence of these four bars in HG4 of Fig. 6a. Apart from that, when the visual judgment is “partial”, then the tree misses few relations based on the corresponding hypothesis. Hence, for a particular target tree, for these cases, the comparison scores should be less than that of when the tree supports the hypothesis “marginally”. Again, for the same tree (shown in Fig. 5b), HG3 is “partially” supported, whereas, both HG4 and HG6 are supported “marginally”. DI score is less for HG3 than that of both HG4 and HG6 (refer to Fig. 6a). Similarly, for all the 14 derived trees, DI shows more logical scores, which are also more associated with the visual inspections than that of the other measures. The detailed scores for each of the cases are provided in Supplementary Table 7.1.

Fig. 6
figure 6

Comparative scores of the trees derived from different methods for different hypotheses. ad The scores for Gadiformes are computed by considering the reference clades or the hypotheses described as HG1–HG6 and eh the scores for mammals are computed by considering the reference clades or the hypotheses described as HM1–HM6. The scores shown here are deformity index (pink), RF (green), MS (blue), NS (yellow), and TT (purple), respectively. The non-existence of a bar denotes the zero value of the corresponding score. For the visual judgment “complete”, the scores should be zero while for the “partial” and “marginal” cases the scores should be nonzero positive values. Deformity index (pink) consistently satisfies this property, while the other scoring methods underperform in many cases (Color figure online)

Mammals

We consider 40 mammals from seven orders of class Mammalia. Among the existing hypotheses, we consider some of the widely accepted ones which are listed as follows (details are provided in Supplementary Table 8.1):

HM1:

The order Primates forms a monophyletic clade (Hallström et al. 2007; Kriegs et al. 2006; Li et al. 2017; Murphy et al. 2001a, b; Murphy et al. 2001a, b).

HM2:

Human is the sibling of chimpanzee and gorilla is the sister of (human + chimpanzee) (Hallström et al. 2007; Prasad et al. 2008).

HM3:

The order Carnivora forms the monophyletic clade (Li et al. 2017; Murphy et al. 2001a, b; Murphy et al. 2001a, b; Springer et al. 2003; Waddell and Shelley 2003).

HM4:

The order Rodentia is the sister of Lagomorpha (Murphy et al. 2001a, b; Murphy et al. 2001a, b; Springer et al. 2003; Waddell and Shelley 2003).

HM5:

Rodentia and Lagomorpha are sisters of Primates (Murphy et al. 2001a, b; Murphy et al. 2001a, b; Springer et al. 2003).

HM6:

The order Perissodactyla is the sister group of order Carnivora (Murphy et al. 2001a, b; Murphy et al. 2001a, b; Springer et al. 2003).

Similar to the Gadiformes, we also consider these reference clades to compute the DI, RF, MS, NS, and TT measures of the trees derived from different methods. The derived trees are provided in Supplementary Sec. 9. We judge these trees visually as “complete”, “partial”, and “marginal”, based on the reference clades, as we have done for the Gadiformes, which are summarized in Supplementary Fig. 10.1. Here, some typical examples are shown in Fig. 6e–h. To show that DI outperforms the other methods for this dataset, we consider the tree derived from the maximum parsimony-based method [as shown in Supplementary Fig. 9.1(d)]. As this tree supports HM3 “completely”, hence, a zero comparative score is expected. However, except DI, all the other scoring methods show non-zero values because the reference clade depicts the multifurcation relationships that are not present in the derived tree (explained in Sect. Introduction). This phenomenon is represented in HM3 of Fig. 6e. Again, as HM4 is “partially” supported by the same tree [shown in Supplementary Fig. 9.1(d)], a non-zero score is expected. But except DI all the other methods show zero score for this case (shown in HM4 of Fig. 6e). The tree shown in Supplementary Fig. 9.1(d), supports HM2 “marginally” and both HM4 and HM6 “partially”. DI shows less scores for both HM4 and HM6 than that of HM2, whereas the other methods show discrepancies in scoring the correctness for these cases. Similarly, based on the other derived trees, we observe that DI outperforms the conventional methods. The detailed scores for all the measures are provided in Supplementary Table 10.1.

Complexity Analysis

The deformation is the sum of \(TT{L}_{\text{in}}\) and \(TT{L}_{\text{out}}\). Its computational complexity depends on the number of species placed under the wrong clade. Considering a reference clade with \(c\) leaves and a tree with \(n\) leaves, the number of transfer in operations required are \([0,c]\). However, the number of transfer out operations required are \([0,n-c]\). Hence, the total number of \(TT{L}_{\text{in}}\) and \(TT{L}_{\text{out}}\) is \([0,n]\). Hence, the average time complexity of computing the deformation of a clade is \(\mathcal{O}\left(n\right)\). A tree with \(n\) leaves has the maximum of \((n-1)\) internal nodes (for binary trees). Thus, for each reference clade, the computation of the clade deformation is performed for a maximum of \((n-1)\) times. Hence, the time complexity of computing the clade deformation of a reference clade is \(\mathcal{O}\left({n}^{2}\right)\). If we have \(R\) number of reference clades, then the time complexity for computing the deformity index of the tree is \(\mathcal{O}\left(R{n}^{2}\right)\).

Conclusion

In this paper, we propose a novel semi-reference method to measure the quality of a tree using the list of the clades. Deformity index of the tree gives an idea about the correctness of the clades within a tree. As this method only depends on the clades of a reference tree, DI can easily adapt with the present knowledge in biology and provides the quality metric in that context. At the same time, DI can also adapt itself in versatile scenarios where the other conventional tree comparison methods do not provide a meaningful score. Though in this study, we propose the measure for the unweighted tree, this method can also be extended for the weighted tree. We inspect the distributions of different modules of the DI and also perform various statistical tests, such as Chi-square test and K–S test for the goodness-of-fit to understand the distribution of DI. From the statistical tests, we observe that DI is very sensitive to the tree topology. We have given extremal results as well as experimental results for different biological models to characterize the proposed method. Considering two datasets of fishes and mammals, we apply this measure to score the biological trees generated by different state-of-the-art methods. Higher the degree of adherence to the widely accepted hypotheses about various phylogeny, lower is their DI score. Finally, we conclude that deformity index, a flexible, versatile, and scalable tool, outperforms the traditional measures for computing the correctness of a tree and will be useful in the phylogeny research community.