Introduction

Similarity and distance measures are cornerstones of a vast range of methodologies in the fields of molecular modeling, drug design and cheminformatics [1, 2]. In some common examples, their binary implementations are used to quantify the similarity of binary molecular fingerprints (with the Tanimoto coefficient unquestionably being the most popular one) [3], while their continuous implementations constitute the basics of clustering algorithms [4]. The applications of molecular similarity (as expressed by pairwise similarity calculations between binary fingerprints) in ligand-based virtual screening were thoroughly explored by the groups of Jürgen Bajorath [5, 6], Peter Willett [7, 8], and many others, with a large body of works from the latter group dedicated to data fusion practices [9, 10]. Binary similarity measures from many sub-fields were collected by Todeschini and colleagues [11], and further analyzed by our group to select ideal candidates for specific applications in metabolomics [12] and molecular design [13] studies. We have also shown that two similarity measures can be consistent with each other in a surprisingly high percentage of cases, even when they are poorly correlated [14].

Recently, we have introduced several methodological frameworks to extend the usage of similarity measures beyond the common cases mentioned above. Most importantly, we have demonstrated that the mathematical expansion of the core concepts of similarity measures can provide a way to quantify the similarity of an arbitrary number of objects at the same time. We first showed this on binary (molecular) fingerprints: the resulting similarity measures were termed extended (or n-ary) similarity measures [15]. They employ the core concept of similarity and dissimilarity counters, which have replaced the a, b, c and d terms that are commonly applied in the well-known, pairwise definitions of the similarity measures to describe the number of bit positions where two fingerprints have co-occurring one (a) or zero (d) bits, or a one bit that is exclusive to either of the fingerprints (b and c). In our framework, the 1-similarity, 0-similarity, and dissimilarity counters express the number of bit positions where the number of co-occurring one (or zero) bits is above, or below, a pre-defined coincidence threshold, respectively. For pairwise comparisons, these generalizations naturally revert to the well-known definitions of the classical, pairwise similarity measures. We have shown that the new methodology is not only computationally efficient, scaling as O(n) with the number of compared objects n, but it can be successfully applied for tasks such as diversity selection, clustering, as well as the visualization of large sections of chemical space [16,17,18,19]. A further generalization involved the extension of this framework to allow for more than two possible characters (t = 2) in an object (vector), opening the possibility to apply the extended similarity measures in bioinformatics, for the comparison of nucleotide (t = 4) or protein sequences (t = 20) [20]. We have termed these, even further generalized definitions extended many-item, or (t, n) similarity measures, to distinguish them from the above-mentioned, extended binary, or (2, n) similarity measures.

In this study, realizing the potential of further possible generalizations to extended similarity measures, we introduce extended continuous, or (ℝ, n) similarity measures, to provide a way to compare an arbitrary number of vectors with real values. This generalization will employ the same concepts as mentioned above, with novel formulas for determining the number of similarity and dissimilarity counters. As we will show in Sect. 2.1, there are at least three ways to generalize the extended indices so they can handle continuous-valued vectors. All of these variants were implemented and compared in the Results section.

To demonstrate the utility of the new class of similarity metrics, we use them in descriptor selection. Quantitative Structure–Activity Relationships (QSAR) are one of the earliest and most important concepts in molecular design [21]. QSAR realizes a linear or non-linear regression between numerical descriptors of compound structure and experimentally determined or calculated physicochemical parameters and bioactivity. While multiple linear regression (MLR) has ruled the QSAR field for a long time as a classical regression algorithm, the last decades have seen the emergence of several new algorithms, many of them based on Machine Learning [22], including some interesting examples that are adapted from other fields [23]. In the meantime, new families of molecular descriptors were introduced: contemporary descriptor calculator software (such as Dragon) can generate thousands of continuous (and, discrete and binary) descriptors. Also, public bioactivity repositories such as ChEMBL [24] or PubChem Bioassay [25] allow access to large molecular datasets for the more thorough training of QSAR models. The increasing number of descriptors, more complex algorithms and larger training datasets are factors that drive up the computational demand of QSAR modeling: to mitigate this, it is common practice to apply one or more descriptor (feature) selection algorithms to reduce the input dataset of the modeling algorithm by pre-selecting the most meaningful descriptors to work with [26]. In turn, descriptor selection has its own computational cost as a limiting factor: less sophisticated (less demanding) algorithms will sample the descriptor space only superficially, while more sophisticated options, such as genetic algorithms, will be more time-consuming [27]. A thorough review of descriptor selection methods is given by Goodarzi et al. [28]. While we do not necessarily gain prediction accuracy from descriptor selection [29], a smaller number of descriptors will convey a significant speedup to QSAR modeling in most cases, especially if the descriptor selection approach is not laborious either.

Here, we apply the new class of extended continuous similarity metrics in a simple descriptor selection scenario. Using a large and relevant ADME-related dataset of cytochrome P450 (CYP) 2C9 inhibitors (actives) and inactive species, we calculate group-wise similarities based on several descriptor families to find the best ones at discriminating the group of actives from the total dataset. Therefore, we provide a novel, simple variable selection tool for QSAR/QSPR analyses. This idea can constitute the basis of more complex descriptor selection approaches with a more thorough exploration of the descriptor space to yield the set of descriptors that can optimally distinguish the actives from the total set of ligands.

Materials and methods

Extended continuous similarity indices

There are several ways to extend the domain of definition of our n-ary indices such that they can be applied to quantify the similarity of an arbitrary number of vectors with continuous components. The strategies that we will consider here all start from a common point: scaling the input data between 0 and 1. In other words, we will work with vectors:

$$\begin{gathered} V_{1} = \left( {x_{11}^{{}} ,x_{12}^{{}} ,...,x_{1m}^{{}} } \right) \hfill \\ V_{2} = \left( {x_{21}^{{}} ,x_{22}^{{}} ,...,x_{2m}^{{}} } \right) \hfill \\ ... \hfill \\ V_{n} = \left( {x_{n1}^{{}} ,x_{n2}^{{}} ,...,x_{nm}^{{}} } \right), \hfill \\ \end{gathered}$$
(1)

where \(\forall i,j:0 \le x_{ij} \le 1\).

Variant 1

The first way of quantifying the similarity of these vectors is to see how different the components are from the average of their column (e.g., how distant is a feature from its average value). Hence, the first step is to calculate the vector of column-wise averages:

$$A\, = \,\left( {a_{1} ,a_{2}^{{}} ,...,a_{m}^{{}} } \right),$$
(2)

where

$$a_{i} = \frac{1}{n}\sum\limits_{j = 1}^{j = n} {x_{ji} }$$
(3)

We now have to subtract this average from the corresponding normalized elements (e.g., centering) and find the absolute of these differences:

$$\begin{gathered} \left( {\left| {x_{11}^{{}} - a_{1} } \right|,\left| {x_{12}^{{}} - a_{2} } \right|,...,\left| {x_{1m}^{{}} - a_{m} } \right|} \right) \hfill \\ \left( {\left| {x_{21}^{{}} - a_{1} } \right|,\left| {x_{22}^{{}} - a_{2} } \right|,...,\left| {x_{2m}^{{}} - a_{m} } \right|} \right) \hfill \\ ... \hfill \\ \left( {\left| {x_{n1}^{{}} - a_{1} } \right|,\left| {x_{n2}^{{}} - a_{2} } \right|,...,\left| {x_{nm}^{{}} - a_{m} } \right|} \right) \hfill \\ \end{gathered}$$
(4)

The next step is to sum all these differences across a given column and form a new vector with the results:

$$S\, = \,\left( {s_{1} ,s_{2} ,...,s_{m} } \right),$$
(5)

where

$$s_{i} = \sum\limits_{j = 1}^{j = n} {\left| {x_{ji} - a_{i} } \right|}$$
(6)

Now, analogously to the original binary extended similarity indices, we need to define a new vector of “coincidences”:

$$D\, = \,\left( {\delta_{1} ,\delta_{2} ,...,\delta_{m} } \right),$$
(7)

where

$$\delta_{i} = \sqrt {n^{2} - 2ns_{i} }$$
(8)

This is directly related to our previous works [15, 16]. The key insight is that if the ith column of the normalized data has k 1’s and \(n - k\) 0’s, then \(\delta_{i} = \Delta_{n\left( k \right)} = \left| {2k - n} \right|\), that is, the indicator we use in our original paper to quantify the coincidence. The main difference is that the simpler \(\left| {2k - n} \right|\) expression is only defined over strings of 1’s and 0’s, while Eq. (8) is defined over real numbers in the [0, 1] interval.

Having established this connection, we can now follow a similar route as in the binary case. First, we defined a coincidence threshold, \(\gamma\), and if \(\delta_{i} > \gamma\) then we use \(f_{s} \left( {\delta_{i} } \right)\) to estimate the similarity, and if \(\delta_{i} \le \gamma\) then we use \(f_{d} \left( {\delta_{i} } \right)\) to calculate the dissimilarity. By analogy of the 1- and 0-similarities of the binary case, we can distinguish between “high-content similarities” (where the column average is higher) and “low-content similarities” (where the column average is lower):

If n is odd:

\(\delta_{i}\) will be a “high-content similarity” if \(\delta_{i} > \gamma\) and \(a_{i} \, \ge \,\frac{{\left( {n\, - \,n\bmod 2} \right)/2\, + \,n\bmod 2}}{n}\), moreover, \(\delta_{i}\) will be a “low-content similarity” if \(\delta_{i} > \gamma\) and \(a_{i} \, \le \,\frac{{\left( {n\, - \,n\bmod 2} \right)/2}}{n}\).

If n is even:

\(\delta_{i}\) will be a “high-content similarity” if \(\delta_{i} > \gamma\) and \(a_{i} \ge 0.5\), and \(\delta_{i}\) will be a “low-content similarity” if \(\delta_{i} > \gamma\) and \(a_{i} < 0.5\).

This procedure extends many of the notions of the binary case in a natural way. However, while the notion of quantifying similarity by measuring the distance to the mean is a common one, we should be aware that a high similarity in this case implies that the components of vector S (Eq. 5) can be very close to zero. This means that if we do not use a high enough coincidence threshold, we have the risk of identifying all the columns as corresponding to low-content similarities. The problem with this is that several indices will be ill-defined (e.g. they will involve division by zero), because their denominators only include high-content similarities and dissimilarity counters. For instance, taking the more common case of the traditional binary similarity indices (and the standard convention that a, b + c, and d represent the number of common “on” bits, the mismatches, and common number of “off” bits, respectively), this situation will be equivalent to saying that a = b + c = 0. Hence, indices without d in their denominator (like Jaccard–Tanimoto, Baroni–Urbani–Buser, etc., see Table 1) could not be calculated. Once again, this would not be a problem, if we select a large enough coincidence threshold. Nonetheless, the potential issues that could be caused by this prevalence of 0-similarities motivate us to explore another variant of extended continuous indices.

Table 1 Formulae and notations of the extended continuous similarity indices

Variant 2

As noted above, the raison d’être for this new approach is to increase the number of high-content (as opposed to low-content) similarities. Here we also measure similarity according to the distance from the mean, so we also need to calculate the column-average vector (Eq. 2), and we need to form the matrix given in Eq. (4). The key difference is that now we carry out an additional transformation of this matrix before calculating the similarities, namely, we work instead with a new matrix defined by:

$$\begin{gathered} \left( {1 - \left| {x_{11}^{{}} - a_{1} } \right|,1 - \left| {x_{12}^{{}} - a_{2} } \right|,...,1 - \left| {x_{1m}^{{}} - a_{m} } \right|} \right) \hfill \\ \left( {1 - \left| {x_{21}^{{}} - a_{1} } \right|,1 - \left| {x_{22}^{{}} - a_{2} } \right|,...,1 - \left| {x_{2m}^{{}} - a_{m} } \right|} \right) \hfill \\ ... \hfill \\ \left( {1 - \left| {x_{n1}^{{}} - a_{1} } \right|,1 - \left| {x_{n2}^{{}} - a_{2} } \right|,...,1 - \left| {x_{nm}^{{}} - a_{m} } \right|} \right) \hfill \\ \end{gathered}$$
(9)

The rationale behind this is quite simple: in Eq. (4) a high similarity will correspond to a very small element, while in Eq. (9) a high similarity will correspond to an element that is close to 1.

From here we proceed as usual, first calculating the vector of column sums:

$$\overline{S} = \left( {\overline{s}_{1} ,\overline{s}_{2} ,...,\overline{s}_{m} } \right)$$
(10)

with:

$$\overline{s}_{i} = \sum\limits_{j = 1}^{j = n} {\left( {1 - \left| {x_{ji} - a_{i} } \right|} \right)} = n - \sum\limits_{j = 1}^{j = n} {\left| {x_{ji} - a_{i} } \right|}$$
(11)

In this case, we will follow a simpler recipe to determine the character of the counters:

$$\begin{gathered} 2\overline{s}_{i} - n > \gamma \to \text{high - content} \hfill \\ n - 2\overline{s}_{i} > \gamma \to \text{low - content} \hfill \\ \left| {2\overline{s}_{i} - n} \right| \le \gamma \to \text{dissimilarity} \hfill \\ \end{gathered}$$
(12)

From purely theoretical arguments, we should expect this variant to be better than the previous one, if anything because it will lead to indices that can be calculated regardless of the coincidence threshold selected. Nonetheless, it still measures similarity by taking the mean as a reference, so it seems desirable to explore yet another option, which measures similarity directly from the normalized values.

Variant 3

Starting from the scaled data (Eq. 1), we only need to calculate the sums along each column:

$$\sigma_{i} = \sum\limits_{j} {x_{ji} }$$
(13)

Then, we use these numbers to assign the type of counters, analogously to what we did in variant 2:

$$\begin{gathered} 2\sigma_{i} - n > \gamma \to \text{high - content} \hfill \\ n - 2\sigma_{i} > \gamma \to \text{low - content} \hfill \\ \left| {2\sigma_{i} - n} \right| \le \gamma \to \text{dissimilarity} \hfill \\ \end{gathered}$$
(14)

Notice that this method is essentially equivalent to the original binary case (the analogy is clear if we notice that if all the \(x_{ij}\) are either 0 or 1, then \(\sigma_{i} = k_{i}\) in our original notation).

This variant has two potential advantages: its simplicity, and the ability of looking at the data from a different point of view (since it does not rely on the calculation of the average). However, the latter can bring a potential problem: by not referring to an average and using the raw normalized values to directly calculate the similarity, this variant should be more prone to depend on the scaling (normalization) procedure. This can lead to a pathological behavior, since a normalization method that gives very small values for the xij will lead to an input that suffers from the overly abundance of low-content similarities of variant 1. This will once again imply that we will not be able to calculate all indices, unless we use a very high coincidence threshold.

Having reached this point in any of the previous variants, we can easily classify each column as contributing to the high-content, low-content similarity, or dissimilarity between the compared objects. Notice that, as in the binary case, the minimum possible value for \(\gamma\) in all these cases is also equal to \(n\bmod 2\). Once we have classified all the counters, the process to calculate the similarity indices is exactly the same as in the binary case (see Table 1 for the list of all the expressions). Notice that here we can also decide whether to include or not weight functions in the denominator of the indices, leading to the weighted (w) or non-weighted (nw) alternatives, respectively.

The formulae of n-ary continuous indices are enumerated in Table 1 (notice that the cJa and cJa0 differ in that in the latter we do not differentiate between high-content and low-content similarities). Notice how the original formulation of some of these indices (e.g., the asymmetric indices, like Gle, Ja, etc.) distinguished between the high- and low-content similarities, assigning a more important role to the former. As we showed in our original paper, we can generalize these indices by replacing every occurrence of the high-content similarity by the sum of the high- and low-content similarities, which leads to more symmetrical expressions (and novel ways to quantify similarity).

Dataset and descriptors

A large dataset of cytochrome P450 (CYP) 2C9 ligands from Pubchem Bioassay (AID 1851) was used as a case study to highlight the applicability of the n-ary indices for continuous variables [30]. Cytochrome P450 enzymes are important mediators of drug metabolism, therefore they are widely studied in the field of QSAR/QSPR: many compounds were evaluated against this enzyme family and they are recurring targets in machine learning classification studies as well [31]. In total, 12,161 molecules were applied after the data curation and preparation step. The dataset contained 4016 inhibitors with a potency of 10 µM or better (actives) and 8145 inactive species. Dragon 7 software was used for the calculation of 2D descriptors [32, 33]. Table 2 shows the 19 different 2D descriptor groups, which were calculated in the study (the groups are predefined by the applied software). We have applied the same numbering system for the descriptor sets as it was used in the Dragon software. The excluded numbers (13–20, 26–27) are connected to 3D descriptors. Highly correlated variables (above 0.997) and constant variables were excluded from the sets [34]. The details and descriptions of the different descriptor sets can be found in the DRAGON software manual.

Table 2 The applied 2D descriptor packages with the number of descriptors

Statistical analysis

First, we had to normalize the descriptor sets before the calculation of the continuous n-ary indices. Two different methods were used for this step: rank transformation and mean scaling. The equations are the following:

$$y_{mean} \left( {x_{i} } \right) = \frac{{x_{i} - \min X}}{\max X - \min X}$$
(15)

and

$$y_{rank} \left( {x_{i} } \right) = \frac{{rank\left( {x_{i} } \right) - 1}}{\max \,rank\left( X \right) - 1}$$
(16)

After the normalization of the dataset, 16 different continuous n-ary indices were calculated for the 19 descriptor sets. We have calculated the n-ary indices for the active and inactive groups, as well as the total dataset, corresponding to three different levels of similarity. We assume the active group to be more coherent—based on earlier examples from our research group, where a small number of descriptors was sufficient to define simple multicriteria optimization rules for kinase [35] and GPCR ligands [36], distinguishing them from a larger set of commercially available compounds. By comparison, the inactive set should display a lesser degree of similarity, while the total dataset (containing both the active and inactive sets) should be the most diffuse, i.e. less similar overall. A further level of comparison was introduced by calculating the absolute differences between the similarities of the active group vs. the total dataset (from here on, denoted as |active-total| values). Here, a larger difference corresponds to more discriminatory power of the given descriptor set and similarity metric. The datasets—with the 16 continuous n-ary metrics in the columns and the descriptor sets in the rows—were evaluated and compared with factorial ANOVA and the multicriteria decision making tool, sum of ranking differences (SRD) [37]. The SRD procedure is not a simple extension of the Spearman footrule to equal numbers (ties) in the input matrix [38], but contains two validation steps: (i) comparison of ranks with random numbers (CRRN) [39], and (ii) cross-validation [40]. It is a generally applicable multicriteria decision making tool [41], whose applications were demonstrated in a wide range of fields from food chemistry [42] to medical applications [43], as well as politics [44] and sports [45]. The SRDs is calculated as the city block (Manhattan) distance (dkj) between the rank values of the gold standard and the rank values of the original data. In the calculation process, always the columns of the dataset are compared to the reference column. SRD helped to compare and rank the descriptor sets and the n-ary continuous indices. SRD was carried out separately for the similarities of the active and inactive sets, as well as the absolute differences between the actives and the total dataset (|active-total| values). In all cases, the maximum values were used as the reference column. When the novel similarity metrics were compared, the dataset contained those in the columns and the descriptor sets in the rows, while in the comparison of the descriptor sets, the mentioned dataset was transposed. It is important to note, that in every SRD calculation, the variables with smaller SRD values are the better ones (these are closer to the reference). The scaled (between 0 and 100) and cross-validated SRD values were applied for the final evaluation by ANOVA.

Factorial ANOVA analysis is dedicated to compare the group averages according to the different factors. For the original datasets (containing the 16 similarity metrics for the active, inactive and the complete dataset of molecules), we have used several factors such as the n-ary indices (16 levels), the molecular descriptor sets (19), the different threshold limits (0.05–0.95 fraction of the total size of the set, with steps of 0.05) and the applied groups of molecules (active, inactive, total). For the final comparison of the descriptors based on their SRD values, we have used (i) the descriptor sets, and (ii) the actives, inactives and |active-total| groups as factors.

Results and discussion

The calculated n-ary continuous indices were used for descriptor (variable) selection in the case study of a large dataset of CYP 2C9 inhibitors and inactives. Moreover, the 16 different continuous similarity measures (weighted and non-weighted versions) were compared and ranked to find the most optimal ones for the task. We have calculated the similarity measures for three different sets of the dataset: actives, inactives and the complete dataset (total). As the optimal coincidence threshold limit (γ) is case-dependent, in each variant (1, 2, 3) of the similarity calculation, a coincidence threshold analysis was carried out to select the best threshold limit. In the next step, the most important descriptor sets, and the optimal similarity measures have been selected based on the continuous similarity values for the “active”, “inactive” and “total” groups. In the optimal situation, the best similarity measures should return bigger similarity values for the group of active ligands, somewhat smaller similarities for the inactive ligands, and the lowest similarity for the most diffuse “total” group. An additional parameter, the absolute difference between the similarity of “active” and “total” groups was calculated to select and rank the examined descriptors and similarity indices with the SRD analysis, based on their ability to distinguish the active group within the total dataset. The whole process was carried out for the three different continuous n-ary similarity calculation variants; thus, we could compare their efficiencies for the task and finally select the most applicable one. Figure 1 shows the mentioned workflow of the study in an illustrative way.

Fig. 1
figure 1

The applied workflow of the study, emphasizing the most important aspects of the analysis

Variant 1

As we have two different normalization procedures for the descriptors: rank and mean normalization; as well as weighted and non-weighted versions for the continuous similarity calculations, the coincidence thresholds were compared for all the four cases. Figure 2 shows the dependence of the similarity values on the applied threshold limits in the x axis. The similarity of the groups of molecules: “actives”, “inactives” and “total” are compared in the factorial ANOVA plot. It is clear that the weighted and non-weighted versions have the same shape or pattern, but the range of the similarity values are different. Naturally, we can say that in the optimal case, the groups are separated, especially the actives from the total. For the non-weighted version, this separation is slightly better based on the covered similarity range, while the use of rank normalization of the descriptors clearly gives us better results. We have selected 0.70 as the coincidence threshold limit for the further SRD analysis, based on the non-weighted and rank scaled version of the plots. In this case the active, inactive, and total groups are the farthest from each other.

Fig. 2
figure 2

Threshold dependence for similarity values in the weighted and non-weighted variants of the continuous indices. First row: non-weighted version; second row: weighted version; first column: mean normalization; second column: rank normalization. Active molecules: blue line; inactive molecules: red line; total dataset: green line. Similarity values are plotted against the different coincidence threshold limits

The continuous n-ary similarity measures were also compared with factorial ANOVA. Molecule groups were selected as the second factor in this case as well. Figure S1 in the Supplementary Information shows the result of the factorial ANOVA, where the similarity values are plotted against the different similarity measures. The same pattern can be noticed as in the case of the threshold limit selection. Again, the rank normalized version coupled with the non-weighted similarity calculation provides a much better result. Since it would be hard to select the most proper measures based on the ANOVA plot, the rank-normalized and non-weighted results were used for the SRD analysis for further evaluation. Figure 3 shows the result of the SRD analysis, where the scaled SRD values were used for factorial ANOVA, instead of the original similarity values. The SRD analyses were carried out for the active, inactive and the additionally calculated |active-total| similarity values separately. This latter parameter is relevant because the bigger the difference between the similarities of the active group and the total dataset, the better the final model could be. In the SRD analyses, the maximum values were used as the reference. It means that those similarity measures, which had higher values for the different groups of molecules (or the difference between the actives and total set), are ranked better. In other words, the best similarity indices should be the most sensitive in finding the similarities amongst the actives and providing bigger differences in similarity between the actives and the total dataset. The result of the three SRD analyses were merged for the final ANOVA analysis. It is justified because the SRD values are scaled to the same range in each case. The smaller the SRD values, the better the similarity measure. We must make note of a difference between the use of the original similarity values and the calculated SRD values in the ANOVA analysis. For the original similarity values, all of the results with the various coincidence thresholds were used. For the SRD analysis, only the optimal threshold limit with non-weighted similarities and rank normalization was used, based on the conclusions from the ANOVA of the similarity values.

Fig. 3
figure 3

SRD values [%] to the gold standard for the active and inactive sets, and |active-total| values. The coincidence threshold was determined by variant 1. The continuous extended, individual similarity measures are plotted on the x axis (for their formulae, see Table 1). The similarity of the active group is marked with a blue line, the similarity of the inactive group is marked with a green line and the absolute difference between the active and the total group is marked with a red line

The active, inactive and the |active-total| versions had different behaviors. As cCT1 has the best ranks in the active and inactive cases and still good SRD values for the |active-total| case, we can recommend that measure as the best one for variant 1. The cRR similarity measure can be considered the worst one based on the SRD values of the three cases.

Similarly, the molecular descriptor sets have been compared based on the original similarity values and the SRD values as well. The factorial ANOVA of the original values can be found in the Supplementary Information as Fig. S2. The original similarity values showed that the first half of the descriptor sets have better discrimination between the similarities of the groups. The SRD analysis of the active, inactive and |active-total| similarity sets provided extra information about the best descriptor sets. Figure 4 shows the factorial ANOVA result of the three cases. Descriptor sets 3 and 8 have the smallest SRD values in all the three cases together, although the results are not consistent: where the difference between the active and total is bigger (thus the SRD value is smaller), the inactive group has a worse result. (Descriptor set 3 contains the topological indices, while descriptor set 8 contains the 2D autocorrelation descriptors.) Many descriptor sets cannot rank the |active-total| better than random, e.g., No. 1 and 21–25. Some of the descriptor sets evaluate the active and |active-total| very similarly, e.g., No. 3, 8, 11 and 28. The actives are found to be the most similar according to sets No. 1, 21, 22, 23.

Fig. 4
figure 4

Factorial ANOVA of the SRD values [%] as a function of descriptor sets (Table 1) in the case of variant 1. The different descriptor sets are plotted in the x axis. The similarity of the active group is marked with a blue line, the similarity of the inactive group is marked with a green line and the absolute difference between the active and the total group is marked with a red line

Variant 2

In the case of variant 2, the same process was carried out as in the case of variant 1. First, we have compared the coincidence thresholds with the different pretreatments (rank/mean, weighted/non-weighted). Figure 5 shows the factorial ANOVA of the original similarity values. With the mean transformation, the curve has a long plateau part; then, it drops quickly, while in the case of rank normalization, the curve has an inflexion point. In this point, at 0.5–0.55, the similarity of the three groups (active, inactive, total) are the farthest: large similarity for the active set, and small similarity for the inactive and total sets. Thus, we have selected 0.50 as the coincidence threshold limit for the further SRD analysis.

Fig. 5
figure 5

The factorial ANOVA for the four different scenarios of the similarity calculations in the case of variant 2. First row: non-weighted version; second row: weighted version; first column: mean normalization; second column: rank normalization. Active molecules: blue line; inactive molecules: red line; total dataset: green line. Similarity values are plotted against the different coincidence threshold limits

The continuous n-ary similarity measures are compared in Fig. S3 in the Supplementary Information in the four pretreatment scenarios (with all the threshold limits), but we have also tested how the optimal threshold limit affects the result. In Fig. 6A, where we use only the 0.50 threshold limit data, the lines of the different groups are further from each other compared to Fig. S3. However, it would be still hard to find the most optimal similarity measure based on this figure, because most of them are in the same range and the lines are at about the same distance from each other. In the optimal case, the similarity metric should provide higher similarity within the group of actives and smaller similarity within the total dataset: this holds for all metrics. Figure 6B shows that SRD values can more easily select the most prominent continuous measures. As in this case, the smaller the SRD value, the better the applied measure, here we can highlight the cCT1, cCT3 and cCT4 metrics, because they have the smallest SRD values consistently in all the three cases (active, inactive, |active-total|).

Fig. 6
figure 6

The factorial ANOVA of the original similarity values (A) and the scaled SRD values (B) with the continuous similarity measures and molecule groups as factors. (For the formulae of the similarity metrics, see Table 1.) The similarity of the active group is marked with blue line, the similarity of the inactive group is marked with green line and the absolute difference between the active and the total group is marked with red line

In the case of descriptor set selection, the same analyses have been made. Figure S4 in the Supplementary Information shows the factorial ANOVA with the descriptor sets and molecule groups as factors for the four preprocessing scenarios. Figure 7 presents the results focusing only to the optimal threshold limit 0.50 based on the original data and the scaled SRD values.

Fig. 7
figure 7

The factorial ANOVA of the original similarity values (A) and the scaled SRD values (B) with the descriptor sets and molecule groups as factors. The similarity of the active group is marked with a blue line, the similarity of the inactive group is marked with a green line and the absolute difference between the active and the total group is marked with a red line

The line of the active group is further away from the others with the use of the optimal threshold limit, but we can safely say (based on Fig. 7A) that descriptor sets 21–25 have no discriminative power between the active and the other groups, which is not advantageous for their use in QSAR models. In the descriptor selection phase, those descriptor sets can be more important, which are capable to find the active molecules that are more similar to each other than the whole dataset. Based on Fig. 7B, descriptor sets 3 and 4 have remarkably good SRD values in all the three cases (active, inactive and |active-total|). These descriptor sets are the topological indices (3) and the walk and path counts (4), which together contain 118 descriptors. Moreover, all the mean SRD values of the descriptor sets tend to be closer to zero compared to the variant 1 results, which is a favorable feature in the case of variant 2.

Variant 3

In the case of variant 3, the calculation is much simpler and less robust than the other two variants. This resulted in different plots compared to the others, such as in the case of the optimal threshold limit determination. Figure 8 shows that mean normalization is not able to select any threshold limit, but in the case of rank normalization the group similarities are separated better, but only in the beginning of the plot. The “typical” curve shape that we had in the other two cases, is now missing. In the case of mean, a linearly or slightly convex decreasing curve can be seen, while in the case of rank transformation the curve plateaus off at the end. Therefore, we have decided to use the “min” threshold limit, which is the minimum coincidence threshold possible, calculated as nmod2. In this case, based on the rank transformed data, the three group similarities can be separated better. The SRD analyses were carried out with the “min” coincidence threshold data.

Fig. 8
figure 8

The factorial ANOVA for the four different scenarios of the similarity calculations in the case of variant 3. First row: non-weighted version; second row: weighted version; first column: mean normalization; second column: rank normalization. Active molecules: blue line; Inactive molecules: red line; total dataset: green line. Similarity values are plotted against the different coincidence threshold limits

The continuous n-ary similarity measures were also compared. The Supplementary Information contains the factorial ANOVA for the four cases together as Figure S5. Here we show the result of the factorial ANOVA based on the scaled SRD values in Fig. 9. It is still true, that the original similarity values cannot be used for the selection of the most optimal similarity measure. SRD analysis with the selected coincidence threshold limit (“min”) data could extend the results and provide a more consistent picture about the comparison of the indices. Figure 9 clearly shows that cCT1 can be selected as the most optimal continuous similarity measure. On the other hand, all the cCTi measures are somewhat better than the others, especially in returning higher similarities for the active set.

Fig. 9
figure 9

The result of factorial ANOVA based on the SRD values [%] in the case of variant 3. The continuous similarity measures are plotted in the x axis (for their formulae, see Table 1). The similarity of the active group is marked with a blue line, the similarity of the inactive group is marked with a green line and the absolute difference between the active and the total group is marked with a red line

The molecular descriptor sets were compared with the same workflow as in the case of variant 1 and 2. The results of the factorial ANOVA based on the original similarity values with the four different pretreatment scenarios are shown in the Supplementary Information as Fig. S6. Finally, the results of the factorial ANOVA based on the scaled SRD values are shown in Fig. 10. In Fig. S6, the similarity values calculated with this variant have no discriminative power. Even the SRD analysis could not select the best sets properly, because it was not sensitive enough. However, the inactive molecules can be ranked worse for almost all descriptor sets, with two definite but diverse exceptions (No. 12 and 29) As the variant 3 is a simpler and less robust version of the calculation, it could not provide a definite selection for the task.

Fig. 10
figure 10

Factorial ANOVA of SRD values [%] in the case of variant 3. The different descriptor sets are plotted in the x axis. The similarity of the active group is marked with blue line, the similarity of the inactive group is marked with green line and the absolute difference between the active and the total group is marked with red line

Conclusion

We have generalized our recently introduced n-ary similarity indices for vectors with continuous components. This greatly expands the domain of applicability of the extended similarity framework, which can now be applied to the selection of molecular descriptors for QSAR/QSPR modeling. We proposed three ways to calculate the extended (or n-ary) continuous similarity indices, depending on the way of defining the similarity between the different elements to be compared. We also considered how different factors impact the characteristics of these indices, including the way of normalizing the data, and the inclusion or omission of weight factors in the denominators of the similarity indices. A case study of a publicly available dataset of CYP 2C9 inhibitors (actives) and inactives was used for comparing the various possible similarity metrics and coincidence thresholds (cutoff values to determine whether a certain variable/descriptor contributes to the similarity or dissimilarity of the given dataset).

The first variant for the calculation of extended continuous similarities is based on how different the elements of an array (in this case, a column vector) are from their average. This is an intuitive measure that can be easily related to the original n-ary formalism for binary fingerprints, but it has some important disadvantages. For instance, indices without low-content similarity counters in the denominator could be ill-defined for relatively small values of the coincidence threshold. Overall, for the descriptor selection case study, cCT1 showed the best ranks in the active and inactive cases and still good SRD values for the |active-total| case, so we can recommend this measure as the best one.

The second variant attempts to remedy the issues of Variant 1 by converting the low-content similarities to high-content similarities, but it also quantifies similarity by measuring how distant are the different components to the corresponding column average. Here, the cCT1, cCT3 and cCT4 indices have the smallest SRD values consistently in all three cases (active, inactive, |active-total|). Descriptor sets 21–25 have no discriminative power between the active and the other groups. On the other hand, descriptor sets 3 (topological indices) and 4 (walk path counts) have remarkably good SRD values in all the three cases (together, these sets contain 118 descriptors).

Finally, the third variant takes a different approach to measuring the similarity between the elements of a set, by directly assessing how related are the components in each column of the normalized matrix (just like it is done to calculate the counters in the binary case). Now, cCT1 can once again be selected as the most optimal continuous similarity measure. More generally, all the cCTi measures (i = 1, … 4) are somewhat better than the others, especially in returning higher similarities for the active set. However, this variant places almost all descriptor sets in the same position, so it is not as clear to give a precise indication of the best conditions for this option.

Overall, this work bridges the missing gap in the applicability of extended similarity indices, which can now handle more general types of input. While we have shown here different ways in which one can handle continuous inputs, Variant 2 seems to be the more robust of these options, mainly because the original similarity indices used in cheminformatics tend to favor high-content-similarities (1-similarity in the binary case). This means that using this variant we will have access to a more diverse toolkit of extended similarity measures. We have shown that the extended similarity measures with the use of ANOVA and SRD methods can be successfully applied for the selection of continuous molecular descriptor sets, but this formalism opens the way for other applications, including the analysis of three-dimensional structures and conformations of biological ensembles, since we could directly represent them via their coordinates in real space. We are currently exploring this line of research, by studying the different conformations obtained via molecular dynamics simulations. These results will be presented elsewhere in due course.