1 Introduction

In a quantitative manner, comparing two protein tertiary structures to evaluate their similarity is a major challenge. A successful comparison can provide answers to some important questions in structural biology, cell biology, and biochemistry [1]. In particular, it is believed that functional similarity can be predicted from the structural similarity between proteins. The 3D structure of a protein is obtained by various experimental methods such as X-ray or electron crystallography and sometimes NMR [2]. If there is no crystallographic structure of a protein, computational structure prediction methods exist that use sequence similarity. In sequence similarity, a technique called homology modeling is used based on the structure of a known protein as a template to predict the structure of an unknown protein [3]. If structural information of the protein exists, there are methods that have been developed to compare the structures [413].

Examples of methods based on numerical techniques to predict structural information are: SCOP (Structural Classification of Proteins) [9, 10], CATH (Class, Architecture, Topology and Homologous superfamily) [13], TM-SCORE [14, 15], STRUCTAL software [16], FSSP (Families of Structurally Similar Proteins) [17] and DALILITE [18].

Recently, in Ref. [19], the similarity value (SV), as a geometrical-based structural property, was introduced as a new protein similarity measure. The SV is an alternative measure to the root-mean-square deviation (RMSD). ‘SV’ is defined as a normalized RMSD of the protein distances in reciprocal space so that the protein’s atomic coordinates are mapped into the corresponding Fourier space. There are theorems in mathematics that allow us to perform this task by using Wigner-D functions and then by using crystallography concepts to arrive at the structure factor [19]. The advantage of defining SV in the reciprocal space is that it solves the known problem of different sizes of two compared proteins. Thus, there is no need to use partial or local similarity tests. An example of using a method that relies on the partial RMSD for computing the similarity value is STRUCTAL software [16]. SV is also sensitive to protein topology (for a brief explanation regarding the differences between SV and other methods see [19]).

In this paper, we propose an improved SV definition, called the ‘weighted similarity value’ (WSV) in order to add some important physical properties required to adequately compare any two proteins. We define WSV by adding a lower limit on the reciprocal space dimension for the two proteins that are being compared. This constraint ensures that we do not lose any information in mapping from the protein’s spatial space to the reciprocal space. We also consider the masses of the atoms as a physical property in the protein shape function. The importance of adding mass to the shape function comes from the structure factors in X-ray scattering data [20]. Thus, adding the atomic masses to the shape function provides more reliable computed structure factors as we show later in this paper.

We compare the results regarding protein similarity obtained by WSV with NRMSD, DALILITE, and TM-SCORE, and we show that our results are in good agreement with these methods. DALILITE is a multiple alignment method, which is based on the alignment of the amino acid sequences and the secondary structure states (helix, sheet, coil) of the two proteins being compared [18]. Since DALILITE is a multiple alignment method, the results given by DALILITE have multi-valued z-scores and corresponding similarity values between two proteins [18].

The template modeling score (TM-SCORE) is a global fold similarity measure between two protein structures with different tertiary structures and it is independent of proteins sizes. The TM-SCORE is a normalized measure and has a value in the [0,1] range; when it is equal to 1, the two proteins are similar [21].

2 Methods

RMSD is defined as a dissimilarity parameter between two proteins as follows:

$$ RMS{D}^2=\frac{2}{N\left(N-1\right)}{\displaystyle \sum_{i<j}^N}{\displaystyle \sum_{j=2}^N}{\left({d}_{i\boldsymbol{j}}-{d}_{ij}^{\mathit{\hbox{'}}}\right)}^2=\frac{2}{N\left(N-1\right)}{\displaystyle \sum_{i<j}^N}{\displaystyle \sum_{j=2}^N}\left({d}_{ij}^2+{d}_{ij}^{\mathit{\hbox{'}}2}-2\ {d}_{ij}{d}_{ij}^{\mathit{\hbox{'}}}\right) $$
(1)

where N is the number of proteins’ atoms and d ij is defined as the elements of the distance matrix between the atoms’ positions of a given protein, as is the case for d ' ij . Here, we assumed that the two proteins in question have the same number of atoms. If the numbers of atoms of the two proteins are not equal, we should use a partial RMSD definition. RMSD is a semi-bounded parameter (between zero and infinity). We now define ‘normalized RMSD’ (NRMSD) as a bounded similarity parameter between two proteins. First, we introduce the following auxiliary parameter:

$$ {D}^2=N\left(N-1\right)\times RMS{D}^2=2{\displaystyle \sum_{i<j}^N}{\displaystyle \sum^{\boldsymbol{N}}}\left({d}_{ij}^2+{d}_{ij}^{\mathit{\hbox{'}}2}-2\ {d}_{ij}{d}_{ij}^{\mathit{\hbox{'}}}\right) $$
(2)

and define:

$$ NRMSD = \frac{1}{2}\ \left(1 - \frac{D^2}{d_1^2+{d}_2^2}\right) $$
(3)

where d 2 = 2∑ N i <j N j = 1 d 2 ij is the vector length (sum of the squares of arrays), as is the case for d ' 2. If the two proteins are not correlated, we have D 2 = d 2 + d ' 2 and NRMSD = 0. If we have a maximum correlation between these two proteins (two proteins are the same), i.e., D 2 = 0, then, NRMSD = 1/2. In the next step we define WSV.

The SV was defined by using the Wigner-D function in conjunction with a series expansion of the protein’s shape functions [19]. The Wigner-D functions [22] describe the surface of a 4-sphere and they are an extension of spherical harmonic oscillators (SHO). The surface of a 4-sphere is a three-dimensional manifold, which can be explored by using a set of three angles, defined as Euler angles. On the other hand, Euler angles describe a motion in three-dimensional Euclidean space. Thus, we can project a three-dimensional Euclidean space onto the three-dimensional manifold (4-sphere surface). This means we project a body onto the surface of a 4-sphere. Adding atomic masses, M atom to point coordinates gives gravitational attraction for a given projected point. Thus, we define the protein shape function as:

$$ f\left({\alpha}_i,{\beta}_j,{\gamma}_k\right)=\left\{\begin{array}{c}\hfill {M}_{\mathrm{atom}},\ \mathrm{if}\ \mathrm{there}\ \mathrm{is}\ \mathrm{an}\ \mathrm{atom}\ \mathrm{with}\ \mathrm{mass}\ {M}_{\mathrm{atom}}\hfill \\ {}\hfill 0,\ \mathrm{else}\ \mathrm{where}\hfill \end{array}\right. $$
(4)

where i, j, k = 1, 2, …, N (N is the number of protein’s atoms) and M atom is the molar atomic mass in the atomic mass unit (in the definition of SV for all atoms we have M atom = 1). Here, (α i , β j , γ k ) are three Euler angles corresponding to the position of this atom in the corresponding (x i , y j , z k ) PDB (Protein Data Bank) entry. We now expand a protein shape function in terms of the Wigner-D functions, D lmn (α, β, γ), which span a basis set as follows:

$$ f\left(\alpha, \beta, \gamma \right)={\displaystyle \sum_{l=0}^{\infty }}{\displaystyle \sum_{m=-l}^l}{\displaystyle \sum_{n=-l}^l}{C}_{lmn}{D}_{lmn}\left(\alpha, \beta, \gamma \right) $$
(5)

where C lmn s are the coefficients of the series expansion and they are unique for a given function, f(α, β, γ). Some theorems in mathematics allow us to use the coefficients of expansion of a function by the Wigner-D function as a three-dimensional Fourier transform of this function [23, 24]. Thus, in the above expansion, the C lmn s corresponds to elements of the three-dimensional Fourier transform of f(α, β, γ). From crystallography considerations, it is readily recognized that these are the coefficients of the crystal shape function as a structure factor [25]. Thus, C lmn s are the protein structure factors. Now, we can see why adding the masses of atoms is so important because in X-ray scattering the atomic masses play an important role in determining the corresponding structure factors [26]. C lmn s can be obtained by the following relation:

$$ {C}_{lmn}=\frac{\left(2l+1\right)}{8{\pi}^2}{\displaystyle \int }{\displaystyle \int }{\displaystyle \int }f\left(\alpha, \beta, \gamma \right)\ {D_{lmn}}^{*}\left(\alpha, \beta, \gamma \right)\ \sin \beta\ d\beta\ d\alpha\ d\gamma $$
(6)

where we have used the orthogonality relation between the Wigner-D functions as follows:

$$ {\displaystyle \int }{\displaystyle \int }{\displaystyle \int }{D_{l^{\mathit{\hbox{'}}}{m}^{\mathit{\hbox{'}}}{n}^{\mathit{\hbox{'}}}}}^{*}\left(\alpha, \beta, \gamma \right)\ {D}_{lmn}\left(\alpha, \beta, \gamma \right)\ \sin \beta\ d\beta\ d\alpha\ d\gamma =\frac{8{\pi}^2}{\left(2l+1\right)}\ {\delta}_{l{l}^{\mathit{\hbox{'}}}}{\delta}_{m{m}^{\mathit{\hbox{'}}}}{\delta}_{n{n}^{\mathit{\hbox{'}}}} $$
(7)

Now, in the reciprocal space, the two shapes (proteins) are described with the same dimensions [19], however, they have different numbers of atoms. This is due to the use of Wigner-D functions. The dimension of reciprocal space, N R , is given by:

$$ {N}_R={\displaystyle \sum_{l=0}^{L_{max}}}{\left(2l+1\right)}^2 = \frac{1}{3}\ \left({L}_{max}+1\right)\left(2{L}_{max}+1\right)\left(2{L}_{max}+3\right) $$
(8)

where L max is an arbitrary maximum value chosen in the computation of C lmn .

The coefficients C lmn s belong to the complex space and we can embed them in the (N R  × 2) -dimensional Euclidean space such that S≡(Real(C lmn ), Imaginary(C lmn )) where S≡{S ij }, (i = 1, 2, ⋯, N R  and j = 1, 2) is a matrix of structure factors. In this step, we can define an (N R  × N R ) -distance matrix for S and then, we define the SD parameter between two proteins as follows:

$$ S{D}^2=2{\displaystyle \sum_{i<j}^{N_R}}{\displaystyle \sum_{j=2}^{N_R}}{\left(s{d}_{ij}-s{d}_{ij}^{\mathit{\hbox{'}}}\right)}^2=2{\displaystyle \sum_{i<j}^{N_R}}{\displaystyle \sum_{j=2}^{N_R}}\left(s{d}_{ij}^2+s{d}_{ij}^{\mathit{\hbox{'}}2}-2\ s{d}_{ij}s{d}_{ij}^{\mathit{\hbox{'}}}\right) $$
(9)

where sd ij and sd ' ij are the elements of the distance matrix in the reciprocal space of each of the two proteins that is defined by:

$$ s{d}^2=\left(\begin{array}{cc}\hfill {S}_{11}\hfill & \hfill {S}_{12}\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill \\ {}\hfill {S}_{N_R1}\hfill & \hfill {S}_{N_R2}\hfill \end{array}\right)\left(\begin{array}{ccc}\hfill {S}_{11}\hfill & \hfill \cdots \hfill & \hfill {S}_{N_R1}\hfill \\ {}\hfill {S}_{12}\hfill & \hfill \cdots \hfill & \hfill {S}_{N_R2}\hfill \end{array}\right)=\left(\begin{array}{ccc}\hfill {S}_{11}^2+{S}_{12}^2\hfill & \hfill \cdots \hfill & \hfill {S}_{11}{S}_{N_R1}+{S}_{12}{S}_{N_R2}\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \vdots \hfill \\ {}\hfill {S}_{11}{S}_{N_R1}+{S}_{12}{S}_{N_R2}\hfill & \hfill \cdots \hfill & \hfill {S}_{N_R1}^2+{S}_{N_R2}^2\hfill \end{array}\right) $$
(10)

Here, we add a constraint on the definition of WSV by always making sure that N R  ≥ max(N 1, N 2) where N 1 and N 2 are the numbers of atoms of the two compared proteins. Then, we define \( {L}_{max}=\left\lfloor {N}_R^{\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$3$}\right.}-1\right\rfloor \) where ⌊.⌋ indicates the integer part of the number in the brackets. Now, we introduce a direct measure to characterize the similarity between two proteins, which depends on their geometries and physical properties (masses and positions of their atoms). Thus we define the weighted similarity values, WSV, as:

$$ WSV = \frac{1}{2}\left(1-\frac{S{D}^2}{s{d}^2+s{d}^{\mathit{\hbox{'}}2}}\right) $$
(11)

where \( s{d}^2=2{\sum}_{i<j}^{N_R}{\sum}_{j=1}^{N_R}c{d}_{ij}^2 \), is the vector length (sum of the squares of arrays), as is the case for sd ' 2. If the two proteins are not correlated, we have SD 2 = sd 2 + sd ' 2 and then, WSV = 0. If we have a maximum correlation between these two proteins (two proteins are the same), i.e., SD 2 = 0, then, WSV = 1/2.

The range of atomic masses for the proteins is given in the following. The heaviest atom’s weight in a protein can be a sulfur atom, with a mass about 32.065 a.m.u. and the lightest atom mass is for hydrogen with a mass about 1.00794 a.m.u. We have also considered the atomic mass of some metal atoms in the liganded proteins.

We also wish to compare WSV with the other measures of protein similarity, namely (NRMSD, TM-SCORE, and DALILITE). We use these methods separately as targets and observe that WSV predicts the similarity or dissimilarity in close agreement with their predictions. To analyze it in this way, we compute ‘sensitivity’ (or the probability of prediction similarity between two proteins), ‘specificity’ (or the probability of prediction dissimilarity between two proteins), ‘accuracy’ (probability that the WSV measure is true or what it is supposed to measure), ‘precision’ (probability that if a test is repeated, it gives the same result), and ‘F-score’ (probability of giving a positive (similarity) prediction, or performance of higher sensitivity) [27] as explained below.

To compute sensitivity, specificity, etc., we first normalize TM-SCORE and DALILITE (referred to as M-score) to 0.5. Thus, when M = 0.5, the two proteins are completely similar and when it is 0, the two proteins are completely dissimilar. Then, we assume that a measure that predicts similarity between two proteins does so with any value greater than 0.25Footnote 1 and dissimilar proteins with a value less than 0.25. We have a true positive (TP) result when both measures predict similarity, true negative (TN) when both methods predict dissimilarity, false positive (FP) when WSV predicts similarity, and M-score predicts dissimilarity and false negative (FN) when WSV predicts dissimilarity and M-score predicts similarity. The definitions of sensitivity, specificity, etc., are given in Table 4.

We also compare WSV with the other scores by introducing a relative difference between WSV and M-score as:

$$ dif=\frac{\left|WSV-M\right|}{\left(WSV+M\right)} $$
(12)

When dif = 0, the WSV and M-score have the same prediction values and when dif = 1, this means the WSV and M-score have totally different prediction values. In other words, one predicts that the two proteins are similar and the other predicts that they are totally dissimilar.

3 Results and discussion

In this paper, we defined WSV as a development of SV by including some physical properties of proteins in its definition and a constraint on the dimension of the reciprocal space. In Tables 1 and 2, we show a comparison of the WSV with SV [19], RMSD, NRMSD, TM-SCORE [14, 15] and DALILITE [18] values for 48 and 86 datasets, respectively, where both liganded and unliganded proteins are listed in the supplementary material of Li et al. [1] (these sets are reported in http://dragon.bio.purdue.edu/visgrid_suppl). We reported only minimum and maximum similarity values between two proteins predicted by DALILITE. The data acquisition for the TM-SCORE [14, 15] was obtained by the Zhang Lab’s server http://zhanglab.ccmb.med.umich.edu/TM-SCORE/ for 48 and 86 datasets (only 84 data of the 86 dataset and 47 data of the 48 dataset were used; because there are no TM-SCORE values) and for DALILITE [18] it was obtained by the Holm’s Lab’s server: http://ekhidna.biocenter.helsinki.fi/dali_lite/start) for 48 and 86 datasets (only 85 data of the 86 dataset and 47 data of the 48 dataset were used because there are no DALILITE values).

Table 1 A set of 48 protein structures with WSV, SV [19], and RMSD from Li et al. [1], NRMSD, TM-SCORE [14, 15], and DALILITE [18]
Table 2 A set of 86 protein structures with WSV, SV [8], and RMSD from Li et al. [1], NRMSD, TM-SCORE [16, 17], and DALILITE [18]

A way to see how the mass of atoms and restriction on the space dimension perform the similarity criterion between two proteins is to compute the WSV and SV correlation with RMSD. The correlation between WSV and RMSD for the 48 dataset is 0.45 and for the 86 dataset is 0.55, which are better than the correlation between SV and RMSD for these two datasets, i.e., 0.32 and 0.36, respectively. In Ref. [19], a complete discussion is given to explain why we do not expect to see a high correlation between RMSD and SV (WSV). This is why we defined NRMSD for comparison with WSV. NRMSD is a bounded parameter that removes the inconvenience of semi-bounded RMSD. Also, both the parameters WSV and NRMSD are similarity criteria.

Figures 1 and 2 show the histogram of dif between WSV and NRMSD for the 48 and 86 datasets. We see that 60% of WSV and NRMSD prediction for the 86 dataset have less than 10% differences and 70% of their prediction values have less than 20% differences. The results for the 48 dataset also show a 60% agreement between WSV and NRMSD with less than 10% differences and 67% for 20% difference of prediction values. The disagreement between WSV and NRMSD by a 80% difference of prediction values for the 48 dataset is equal to 4.5% and for the 86 dataset is equal to 5.8% (a summary of these results is given in Table 3). Figures 3, 4, 5, and 6 show the histogram of dif computed between WSV and TM-SCORE and also DALILITE. These figures and also Table 3 show good agreement between WSV and these methods.

Fig. 1
figure 1

dif histogram between WSV and NRMSD for 48 dataset

Fig. 2
figure 2

dif histogram between WSV and NRMSD for 86 dataset

Table 3 Differences between WSV and the other methods by using dif
Fig. 3
figure 3

dif histogram between WSV and TM-SCORE [14, 15] for 48 dataset

Fig. 4
figure 4

dif histogram between WSV and TM-SCORE [14, 15] for 86 dataset

Fig. 5
figure 5

dif histogram between WSV and DALILITE [18] for 48 dataset

Fig. 6
figure 6

dif histogram between WSV and DALILITE [18] for 86 dataset

Table 4 shows the sensitivity, specificity, accuracy, and precision of WSV compared with the other scores (NRMSD, DALILITE, TM-SCORE) as targets. In summary, as Table 4 shows, comparing WSV with NRMSD, the ‘sensitivity’ or the probability that two proteins are determined to be similar by WSV is about 85.7% (80.0%) for the 86(48) dataset and the ‘specificity’ or the probability that the two proteins are determined to be dissimilar by WSV is equal to 62.5% (37.5%). The accuracy of the method (WSV) for the 86(48) dataset is 81.4% (72.9%) and the precision is 90.9% (86.5%), which indicates that both measures show good agreement between WSV and NRMSD predictions. The F-score for the 86(48) dataset shows that the performance to give similarity prediction is about 88.2% (83.1%), which is an expected result because the two datasets for the proteins examined here are closely similar. These results show that WSV could be a good alternative parameter for RMSD (or NRMSD); it does not involve the protein size issue and provides a normalized similarity criterion between any two proteins. The results of the comparison between WSV and TM-SCORE [14, 15] and DALILITE [18], in the same manner as for WSV and NRMSD and reported in Table 4, show that a good agreement exists between WSV and these methods’ predictions and also good precision of WSV.

Table 4 The computation of sensitivity, specificity, accuracy, precision, and F-score for the 48 and 86 datasets

All of the above results show that WSV appears to be a reliable alternative parameter for RMSD (or NRMSD). WSV is a geometrical criterion while it also includes physical properties. Moreover, it does not suffer from the protein size problem and it provides a similarity criterion between two proteins as well as other criteria.

For computing WSV, we used an i7 laptop with 8 GB RAM. The time required to complete this computation depends on the proteins’ sizes and on average it takes about 3 min (for small proteins it takes about 1 min and for large proteins the computation takes about 6 min). In Tables 1 and 2, we also show the L max used for each pair of proteins.

4 Conclusions

In this paper, we introduced WSV, which displays two major differences compared to SV. First, we weighted the shape function by atomic masses, which stresses the importance of the individual atoms in the computation. Second, we extended the dimensions of the reciprocal space at least up to the largest compared proteins’ sizes (measured by the number of atoms). This condition ensures that we do not lose any information about the proteins when we map them onto the reciprocal space. As discussed in the Results and discussion section, these two changes in SV improve the correlation between WSV with RMSD relative to SV. We compared WSV with NRMSD, TM-SCORE, and DALILITE by using statistical concepts such as sensitivity, specificity, etc. The results show good accuracy and precision for WSV. Also, we computed a relative difference (dif) between WSV and other methods, which also shows good agreement between WSV predictions and other scores. Our results confirm the reliability and usefulness of our method and show that WSV can be used alternatively with RMSD in helping to find protein similarity in various areas of protein science and in drug discovery.

WSV is now defined as a geometrical structural score. To develop this work in the future, it is suggested to define a score on both the WSV- and domain-based structural methods. Also, we wish to emphasize that WSV is a geometric-based method, sensitive to the protein's atoms positions and their masses. Thus, if one of these parameters changes, WSV will also change. Apparently, for two structurally similar proteins with dissimilar sequences, WSV does not give structural homologues as a result. This hypothesis will be examined in our future research and if it is indeed verified, this could present an advantage of WSV relative to SV or RMSD.