Introduction

A naturally occurring result from observing two objects is the extent of similarity among them [1]. Although we deal with this elusive concept daily, it is challenging to properly define similarity or quantify it because of its multidimensional character. The similarity between molecules plays a key role in cheminformatics [2, 3]. Furthermore, it is also significant in modern drug design and the material industry [4,5,6]. For example, Krasowski and coworkers have used the similarity approach to assess the structural similarity for a wide spectrum of clinically important drugs to the target molecules of DOA/Tox screening tests [7]. Also, the authors in [8] have shown that compounds producing cross-reactivity in steroid hormone immunoassays have a high degree of structural similarity to the target hormone. Additionally, a similarity approach was used to design novel zeolite materials for separations based on adsorption [9]. Such a big interest in molecular similarity may be especially accounted to the one renowned postulate, i.e., to similarity property principle (SPP) [10]. Namely, this viewpoint of the structure–property relationship implies that compounds with high structural similarity are likely to have similar physico-chemical properties. From this statement, it is obvious that SPP has a simple and logical foundation. However, a relationship between structural features and a physico-chemical property (or some type of activity) of the corresponding molecule is complex and it is not always so apparent. Therefore, there are stumbling stones within SPP, such as activity cliffs [11,12,13,14].

To measure the similarity of two molecules, a quantification of molecular structure is a necessary step. For this reason, diverse ways of encoding structural information have been proposed. Presently, an enormous number of molecular descriptors is available, with an increasing tendency [15, 16]. Types of descriptors range from simple, such as counting descriptors, up to complex quantum-chemical molecular descriptors [17]. The molecular descriptors are usually categorized according to the dimensionality of molecular representation needed for the calculation of a descriptor. Thus, there are 1D, 2D, 3D, and higher-dimension descriptors. Besides similarity investigations, molecular descriptors have also found important applications in QSPR/QSAR studies [18, 19]. A special place in the similarity-related calculations is reserved for structural fingerprints [20, 21]. In their simplest form, these are numerical strings constructed by zeros and ones. More precisely, one in a bit-string represents the presence of a certain structural feature in a molecule, while zero signifies its absence, in this way producing molecule-specific linear bit patterns. With this, the molecular structure is converted into a binary vector that may be manipulated with. This type of reasoning was used to create a novel method for representing and analyzing 3D protein–ligand binding interactions. In other words, a binary fingerprint was constructed to model intermolecular connections, where 1 denotes a certain bond between protein and ligand and 0 assigns the lack of a bond. These are called interaction fingerprints [22, 23]. One of the most popular structural fingerprints is the extended-connectivity circular fingerprint [24]. Even though this descriptor was not originally developed as a binary sequence, if necessary, it can be transformed into a binary analog.

The similarity of two molecules is usually perceived as the amount of coherency between their structural features. However, there is another aspect of molecular similarity, usually referred to as chemical similarity. The most obvious manifestation of the chemical similarity between molecules is in the case of compounds that exert similar activities but are structurally quite different. Moreover, the opposite situation is more frequent, i.e., when structurally similar molecules do not show similar physico-chemical properties or activities [25, 26]. For example, Boström and others have found that there is a significant probability that minor modifications on one ligand, in a pair of structurally similar ligands, will produce high changes in the binding sites, hence, the changes in their activities [27]. The main obstacle regarding chemical similarity is its quantification. While structural similarity has been heavily studied, the chemical similarity is poorly understood. Several attempts were made to examine the chemical similarity of some molecules. One of them was made by Xenides and coworkers, who have applied the information theory approach to generate clusters of chemically similar compounds [28].

In this study, a novel method was introduced to quantitatively determine the chemical similarity of molecules. More precisely, a plain binary fingerprint of a molecule was developed by encoding its physico-chemical properties. Within the present paper, we examine the chemical similarity of compounds depicted in Fig. 1. This set consists of 13 molecules that induce diverse physiological responses and was also studied in paper [28]. These compounds cause pleasant, euphoric, and analgesic effects. Several published papers have shown that some of these molecules are producing similar effects, or they are acting as antagonists [29,30,31,32]. Since these molecules are well-known and widespread, and some of them are consumed daily, it is of utmost interest to examine relationships among them, i.e., to quantify their chemical similarity. In the following section, we are going to expose a procedure that enables this.

Fig. 1
figure 1

The compounds with physiological effects

Computational methodology

Construction of a fingerprint

The very first step in constructing binary fingerprints for assessing the chemical similarity of compounds is to provide several physico-chemical properties for underlying molecules. The more diverse properties are supplied, the better the chemical description of a molecule is conceived. Due to the limitation of available experimental data, this might be a tricky task. Therefore, experimental values may be replaced with the properties provided by, e.g., quantum chemistry computations at a sufficiently high level of theory. For our set of compounds, melting point (MP), logP, and pKa experimental values were used for this purpose and are collected in Table 1. These values were retrieved from PubChem [33] and DrugBank [34] chemistry data repositories. A reason for using these sets of physico-chemical properties is that these experimentally determined properties were available for all thirteen molecules under the consideration. Moreover, both logP and pKa are known as high-quality indicators of physiological effects. In order to avoid fingerprint dependance on dataset size, in this approach we do not apply standardization of the physico-chemical properties. The advantage of this is twofold, the resulting molecular fingerprint remains the same within any dataset of compounds with the use of the same physico-chemical properties arranged in the same order, and the developed fingerprint is highly informative. Regarding the latter, this means that the obtained fingerprint is not sparse, which might be the case when min–max scaling is applied, for example. Therefore, we have developed fingerprints based on the physico-chemical properties in the following manner.

Table 1 The experimental physico-chemical properties that are used to construct binary fingerprints

To encode as much chemical information as possible, values from Table 1 are rounded up to two decimals and then multiplied by 102. In this way, experimental values are converted into integers without losing valuable information. Then, for each property five digits are reserved for encoding into a string, since our values are no bigger than five digits. If the obtained integer has less than 5 digits, then one or more zeros are added at the beginning of an experimental value to get a string of five digits. By completing this step, all used values are encoded into a numerical string made of five digits. Further construction of molecular fingerprints based on physico-chemical properties demands the transformation of these integer strings into binary strings. This step is done using the binary coded decimal (BCD) system. Namely, in this type of encoding every digit in a five-digits-string (even zero) is replaced by the corresponding 4-bits-long binary code (see Table 2). This conversion is transforming the experimental value into a binary string with a length of 20 bits.

Table 2 The BCD system is used to encode digits

An additional bit is added at the beginning of a string to encode the sign of an experimental value. Zero denotes positive, while one stands for the negative sign. In this way, every physico-chemical property from Table 1 is transformed into a 21-bits-long binary code. Finally, by merging obtained strings for MP, logP, and pKa, in this order, the molecular fingerprint based on the physico-chemical properties, with the length of 63 bits, is constructed. In Fig. 2 this procedure is depicted in the case of the molecule of cocaine.

Fig. 2
figure 2

The procedure of constructing binary 63-bits-long fingerprint based on physico-chemical properties of cocaine molecule

The authors of the paper [35] have studied similarity coefficients that are usually utilized in cheminformatics investigations. They have found that some of the examined metrics exhibit better characteristics than others. Namely, out of all coefficients that yield the similarity results within [0,1]-range, Jaccard (Ja) [36], Jaccard-Tanimoto (JT) [37], Gleason (Gle) [38], Sokal-Sneath (SS) [39], and Consonni-Todeschini (CT) [40] have shown satisfactory performance in similarity calculations related to real and simulated cheminformatics binary data. They are defined as follows:

$$Ja=\frac{3a}{3a+b+c}$$
(1)
$$JT=\frac{a}{a+b+c}$$
(2)
$$Gle=\frac{2a}{2a+b+c}$$
(3)
$$SS=\frac{a}{a+2b+2c}$$
(4)
$$CT=\frac{ln\left(1+a\right)}{ln\left(1+a+b+c\right)}$$
(5)

In Eqs. (1)–(5) a is the frequency of bits 1 that fingerprints of molecules A and B have in common, b is the frequency of bits 1 present in fingerprint A but not in B, and c is the frequency of bits 1 found in fingerprint B but not in A. As can be seen, most of these indices differ only in the weights that they give to some parts of the fingerprints during comparative analysis. Namely, JA and Gle coefficients highlight the same features in fingerprints, while SS emphasizes their differences. To analyze the chemical similarity of compounds depicted in Fig. 1, we have employed Ja, JT, Gle, SS, and CT asymmetric similarity indices to measure the coherence between fingerprints based on physico-chemical properties. For all these computations, a Python script was coded with an implementation of the RDKit cheminformatics package [41]. In addition, we have calculated the extended versions of our similarity indices. The extended similarity indices allow simultaneous comparison of more than two molecules at the same time. These similarity coefficients are entirely general, and they do not depend on the fingerprints used [42, 43]. In the original paper, their features were investigated by sum of ranking differences (a statistical method that we also use here to get better insight into our results) and ANOVA. The definition of the extended form of Ja, JT, Gle, SS, and CT and the corresponding Python scripts for their calculation are freely available at: https://github.com/ramirandaq/MultipleComparisons.

Sum of ranking differences (SRD)

The SRD is a novel general-purpose statistical procedure to compare models, methods, analytical techniques, etc. [44]. So far, it has been successfully used on many different problems, e.g., for the correct split of training and test sets in QSAR, column selection in supercritical fluid chromatography, and analysis of chromatographic retention data [45,46,47]. Here, we use this tool to compare the results of the similarity obtained by different metrics. This technique is available as MS Office Excel macro at http://aki.ttk.hu/srd/. In the input matrix, the objects (in the present case molecules) are arranged in rows and the variables (models or methods, in the present case similarity coefficients) are arranged in the columns. The results are ranked for each method (similarity coefficient) to the ranking of experimental or reference values. If the standard value is not available, like in this case, then the mean value for all methods (similarity indices) may be used. With the scaling of SRD values between 0 and 100, it is possible to compare these values to different methods/models. The full description of SRD calculation and its validation may be found elsewhere [44, 48]. In general, the closer the SRD value is to zero (i.e., the closer is the ranking to the golden standard), the better is the method. The proximity of SRD values indicates the similarity of the methods, thus in our case, the similar performance of tested similarity coefficients.

Results and discussion

The developed plain binary fingerprints based on physico-chemical properties of compounds depicted in Fig. 1 have been mutually compared using Ja, JT, Gle, SS, and CT similarity coefficients and by their extended versions. The calculated similarities, in percentages, are given in Figs. 1S-5S in Supporting Information as heatmaps, while Table 3 summarizes obtained results. As one may see, all five applied metrics yielded comparable results, that is, the same trends have been identified, especially in the case of extended indices. This is expected considering the closeness of their definitions. With an average value of 57%, the highest similarities are obtained by the CT coefficient, while the SS index gives the lowest mean value (14%). In the case of extended indices, the chemical similarity of four indices is 50%, while the extended CT index shows a similarity of 63%. The Ja and CT indices yielded comparable results by both approaches, standard pairwise and extended, while for the other measures higher similarities are obtained by their extended versions. As for the data variation, most of the similarity values demonstrate comparable scattering. The highest standard deviation is obtained for the Ja index, while the lowest data dispersion gives SS coefficient. It was found that, on average, the chemically most similar compound to other molecules is adrenaline. Its mean chemical similarities by Ja, JT, Gle, SS, and CT are 58%, 32.7%, 48.5%, 19.8%, and 66.5%, respectively. On the other hand, with 39.8%, 18.9%, 30.9%, 10.7%, and 50.7% of similarity, LSD is the least similar to the other compounds. Also, comparable to LSD, the low similarity is obtained for THC by SS and CT indices.

Table 3 The results of chemical similarity analysis of compounds from Fig. 1 by five different metrics

All five similarity coefficients have found that morphine and methadone are chemically the most similar compounds, while cocaine and caffeine show the lowest chemical similarity. Namely, the obtained values for Ja, JT, Gle, SS, and CT for morphine-methadone similarity are 77%, 53%, 69%, 36%, and 80%, respectively, while these values for cocaine-caffeine similarity amount 14%, 5%, 10%, 3%, and 23%, respectively. Such a high chemical similarity between methadone and morphine is in accordance with the experimental findings, where these two opioids are found to have similar physiological responses. Moreover, both have an analgesic effect, and they are used as substitution pain killers [49]. This finding supports the assumption that molecules with similar physico-chemical properties should stimulate similar physiological reactions. On the other hand, such similarity between methadone and morphine, within this set of molecules, is quite surprising, considering the high structural similarity of morphine and codeine. Namely, it is reasonable to expect that morphine and codeine show the highest chemical similarity since their structures differ in only one methyl group. However, the chemical similarity of these two compounds ranks as the fifth highest, among all similarities, by all coefficients and it amounts to 68%, 41%, 58%, 26%, and 72% according to Ja, JT, Gle, SS, and CT index, respectively. Such result may be attributed to the big differences in MP and logP, while their pKa values disagree by only 0.01. Even though both molecules cause analgesic effects in the human body, it was found that the magnitude and lasting of these effects significantly differ [50].

The other two molecules that also exhibited high chemical similarity (ranked as the second highest) are adrenaline and nicotine. Both compounds are known as euphoric feeling inducers and their similarities calculated by Ja, JT, Gle, SS, and CT are 73%, 48%, 65%, 31%, and 78%, respectively. Even though these two molecules share some structural features, like an aromatic ring and a nitrogen atom, structural coherence between adrenaline and mescaline, for example, is more noticeable. However, their chemical similarity is ~ 9% lower on average, compared to adrenaline-nicotine similarity. On the other hand, the effects of nicotine on the heart and systemic blood pressure are almost identical to those of adrenaline [51].

The lowest chemical similarity was detected for cocaine and caffeine by all five metrics, as stated above. For example, the SS gives only 3% of similarity between these two molecules. Such a finding is expected considering big differences in all three physico-chemical properties (Table 1). Also, they belong to different types of drugs regarding the sensation they cause in our body, i.e., cocaine is classified as a “hard” drug, while caffeine is marked as a “soft” drug.

As we previously mentioned, the same trends are identified by all coefficients and the differences come only from the amount of computed similarity. Since the Jaccard-Tanimoto coefficient is the most used index in the cheminformatics community, therefore, we decided to present the similarity assessments obtained by this metric. In Fig. 3 chemical similarity of our compounds is depicted. The graph is constructed to reflect the similarity of molecules in the cases where it amounts to ≥ 40%. Namely, an edge is established between two nodes (molecules) only if their chemical similarity is ≥ 40%. This high threshold is chosen to show molecules with a high chemical similarity since the average similarity calculated by JT is 25% (Table 3). At the level of 40%, around 50% of molecules are connected, and two clusters of similar molecules are observed. The first group consists of methadone, morphine, and codeine that exhibit a high chemical similarity among themselves. The second, a loosely connected group includes adrenaline, nicotine, and fentanyl. These two clusters of molecules are related through the connection between methadone and nicotine.

Fig. 3
figure 3

The chemical similarity of compounds calculated by the Jaccard-Tanimoto similarity coefficient. Note that the graph is constructed to reflect the similarity of molecules in the cases where their amounts ≥ 40%. The edges are labeled with the percentage of similarity between two compounds

In Table 4 the correlation coefficients (R) between similarity results calculated by Ja, JT, Gle, SS, and CT are presented. The similarities obtained by these coefficients are highly correlated. The highest linear correlation is observed between Ja and Gle with R = 0.9981, while the lowest correlation, R = 0.9545, is between SS and CT. Such a good correlation between Ja and Gle comes from the fact that these indices differ only by the weights they set on the same features, i.e., on the bits 1 present in the same place in two fingerprints.

Table 4 The Pearson correlation coefficients between similarity values computed by five different metrics

To get better insight into obtained similarity results, we have employed the SRD statistical procedure. As described in Sect. 2.2, this tool enables us to compare similarity coefficients. Since the reference value is not available, the average value for all five indices has been used as an “ideal” standard for each molecule, for ranking purposes. The calculated SRD values, of every similarity coefficient, are presented in Table 3. The Ja, JT, Gle, and SS indices yielded the same SRD values, while the SRD for the CT was 92. This finding shows that our coefficients are useful for the similarity assessment of fingerprints based on physico-chemical properties, especially the first four indices since they produced SRD values that are close to zero. Also, these results reveal that Ja, JT, Gle, and SS show similar performance, compared to CT, which is in accordance with their definitions but cannot be concluded from the previous results. It is interesting to note that Ja, JT, Gle, and SS indices produce different rankings in the case of interaction fingerprints in virtual screening scenarios [23]. The validation of the SRD procedure has been carried out by performing the comparison of ranks with 78 random numbers (CRRN). This is a randomization test that gives a distribution of SRD values with randomized ranks. Based on this validation, it can be concluded whether the SRD value characterizing a coefficient overlap with the use of random numbers (in that case, the coefficient is statistically not distinguishable from randomly assigned ranks). The obtained results are depicted in Fig. 4 with a magnified view.

Fig. 4
figure 4

X and left Y axes: The percentage of scaled SRD for different similarity coefficients (scaled between 0 and 100, i.e., put on the same scale as the random numbers). The scaled SRD for Ja, JT, Gle, and SS is 0.1972% (red) and for CT is 3.0243% (blue). Right Y-axis: The frequencies of random SRD are plotted (the black curve corresponds to random SRD distribution)

As can be seen, the scaled SRD of similarity coefficients is extremely low, compared to random SRD. Most importantly, there is no overlap between the left side (real numbers) and the right side (random numbers). The location of the scaled SRD for similarity coefficients (located between 0 and 4) is far from the SRD for random numbers (located between 50 and 81). It can be concluded that the probability that the real variables are random is negligible.

Conclusion

Chemical similarity is an important aspect of similarity between two molecules. Here, we have examined the chemical similarity of 13 compounds with the physiological response. To do so, we have developed plain binary fingerprints based on physico-chemical properties, i.e., on melting point, logP, and pKa. The Jaccard, Jaccard-Tanimoto, Gleason, Sokal-Sneath, and the Consonni-Todeschini similarity coefficients have been used to calculate the similarity of fingerprints. It was found that adrenaline on average is the most similar to other molecules, while LSD and THC are the least similar to other compounds. All five similarity coefficients have found that morphine and methadone are chemically the most similar compounds, while cocaine and caffeine show the lowest chemical similarity. The sum of ranking differences statistical procedure has shown that applied similarity indices can be successfully used for similarity analysis of developed binary fingerprints. The advantage of the applied methodology is that it summarizes the information on the physico-chemical features in a simple and straightforward way, which enables the calculation of the chemical similarity of the compounds. Therefore, this approach is a useful tool that can provide information on the amount of chemical similarity of molecules only using several available physico-chemical properties.