Introduction

One of most important physical properties is the enthalpy of fusion at normal melting point (\( \Updelta_{\text{fus}} H_{\text{tm}} \)). The property is defined as the enthalpy change in the transition from the most stable form of solid to liquid state at the normal melting point.

The \( \Updelta_{\text{fus}} H_{\text{tm}} \) has many important applications. It is an important property applied in energy balances computations when solid–liquid phase change happens in the chemical or petrochemical processes under study. It is also related to the molecular packing in crystals and can be useful in correcting thermochemical data to a standard state when combined with other thermodynamic properties [1]. It is also used to calibrate the commercially manufactured testing equipment such as the differential scanning calorimeters applied for the determination of the temperature and the amount of energy transfers during phase changes [2]. Another important application of \( \Updelta_{\text{fus}} H_{\text{tm}} \) would be in estimation of other physical properties. There are several reliable methods developed for estimation of solubility of compounds in various solvents based on \( \Updelta_{\text{fus}} H_{\text{tm}} \) [3].

There are several methods applied for experimentally measuring the \( \Updelta_{\text{fus}} H_{\text{tm}} \) that can be categorized into two main groups; calorimetric methods and non-calorimetric methods. Of calorimetric methods, we can refer to adiabatic [410], isoperibol [1116], isothermal [1720], heat conduction [2127], drop [2] and differential scanning calorimetry [2], and differential thermal analysis [2]. Of non-calorimetric methods, we can refer to cryoscopic [2], vapor pressure, and enthalpy of solution methods [2].

There are several methods for estimation of \( \Updelta_{\text{fus}} H_{\text{tm}} \). The first attempt to propose a model for estimation of \( \Updelta_{\text{fus}} H_{\text{tm}} \) was done by Bondi [28]. He used the relation between enthalpy of fusion (\( \Updelta_{\text{fus}} H_{\text{tm}} \)) and the entropy of fusion at the normal melting point (\( \Updelta_{\text{fus}} S_{\text{tm}} \)). These two are related to each other using the Eq. 1.

$$ \Updelta_{\text{fus}} H_{\text{tm}} = t_{\text{m}} \Updelta_{\text{fus}} S_{\text{tm}} $$
(1)

Bondi [28] proposed application of total entropy of fusion at 0 K (\( \Updelta_{\text{fus}} S_{ 0}^{\text{tot}} \)) instead of \( \Updelta_{\text{fus}} S_{\text{tm}} \) in Eq. 1. The equality of \( \Updelta_{\text{fus}} S_{ 0}^{\text{tot}} \) and \( t_{\text{m}} \Updelta_{\text{fus}} S_{\text{tm}} \) is true just for those compounds that do not have solid–solid transitions. For the compounds, the Eq. 1 is a good idea to give an estimation for \( \Updelta_{\text{fus}} H_{\text{tm}} \). However, \( \Updelta_{\text{fus}} S_{ 0}^{\text{tot}} \) is much greater than \( \Updelta_{\text{fus}} S_{\text{tm}} \) for the compounds that have solid–solid transitions. This idea has been recently applied to estimate the total phase change enthalpy of more than 1,000 pure compounds [1]. In another attempt, Marrero and Gani [29] developed several Group Contribution methods (GC). They proposed a first order, a second order, and a third order group contribution methods to estimate the \( \Updelta_{\text{fus}} H_{\text{tm}} \). Their third order GC methods showed the best results over 741 compounds they studied. The model showed standard deviation, average absolute error, average absolute deviation of 3.65, 2.17 kJ/mol, and 15.7%.

The quantitative structure–property relationship (QSPR) method was also applied to predict the \( \Updelta_{\text{fus}} H_{\text{tm}} \). The QSPR-based methods were often used to predict the \( \Updelta_{\text{fus}} H_{\text{tm}} \) of particular chemical families of compounds [3034]. These methods are not reviewed in this study because they are proposed for especial purposes and cannot be applied for general compounds.

The GC methods have been used for determination of various physical properties [3549]. Recently, one of the authors of this paper proposed a new GC type method for determination of the \( \Updelta_{\text{fus}} H_{\text{tm}} \) [48]. The method is a comprehensive an accurate one, however, it needs a large number of parameters to give an estimation for \( \Updelta_{\text{fus}} H_{\text{tm}} \) [48].

In this study, the QSPR is applied to develop a comprehensive model for estimation of the enthalpy of fusion of pure compounds at their normal melting points. QSPR implements the chemical structure based parameters called molecular descriptors to develop a model.

Materials and methods

Materials

To develop a comprehensive, it is required to have a large experimental data set. The accuracy and reliability of models for estimation of physical properties, especially those dealing with large number of experimental data, directly depends on the quality and comprehensiveness of the applied data set for its development. The aforementioned characteristics of such a model include both diversity in the investigated chemical families and the number of pure compounds available in the data set. In this study, the database prepared by Yaws [50] was implemented, which is one of the most comprehensive sources of physical property data for chemical species, e.g., \( \Updelta_{\text{fus}} H_{\text{tm}} \). The \( \Updelta_{\text{fus}} H_{\text{tm}} \) for 3,864 compounds found in the database and used as main data set in this study.

Computation of molecular descriptors

In QSPR theory, chemical structure of a compound is encoded into some parameters called “molecular descriptors.” The molecular descriptors are basic molecular properties of a compound [49, 5172]. Each type of molecular descriptors is related to a specific type of interaction between chemical groups in a particular molecule [49, 5172]. There are many software packages used for the computation of molecular descriptors of any desired chemical structure. A review of these software packages can be found elsewhere [51]. In this study, one of the most widely used software named “Dragon” is used [73]. This software is able to calculate more than 3,000 molecular descriptors for any desired chemical structure. Since the values of many descriptors are related to the bond lengths, bond angles, etc., each chemical structure is optimized before calculating its molecular descriptors. For doing this, chemical structures of all 3,864 pure compounds have been drawn in Hyperchem software [74] and optimized using the MM+ molecular mechanics force field. Finally, the molecular descriptors have been determined using the Dragon software [73].

Generating model

Having calculated the molecular descriptors from the optimized chemical structures of all investigated compounds, a linear equation is presented that is able to represent/predict the desired property with the least number of variables as well as the highest accuracy [49, 5272]. In other words, the problem is to find an optimal subset of variables (most statistically effective molecular descriptors on \( \Updelta_{\text{fus}} H_{\text{tm}} \)) from all available variables (all molecular descriptors) that are able to calculate the \( \Updelta_{\text{fus}} H_{\text{tm}} \) with the least possible deviation from the experimental values. A generally accepted method for this problem is genetic algorithm-based multivariate linear regression (GA-MLR) [49, 5272, 7577]. In this method, the genetic algorithm is applied to select best subset of variables based on an objective function as performed firstly by Leardi et al. [76]. Fitness functions such as R 2, adjusted R 2, Q 2, “Akaike” information content (measure of the goodness of fit of an estimated statistical model) etc. are generally applied as objective function in GA-MLR technique [75, 77]. The “RQK” fitness function is a novel one for model searching proposed to avoid undesired model properties such as chance correlation, presence of noisy variables in the models, and other model pathologies causing lack of model prediction capability [49, 5272, 75, 77]. Besides, RQK is a constrained fitness function based on Q 2LOO statistics (leave-one-out cross validated variance) and other four tests that must be fulfilled contemporarily. This function is defined as follows [77]:

$$ Q_{\text{Loo}}^{ 2} = 1 - \frac{{\sum\nolimits_{i = 1}^{n} {(y_{i} - \hat{y}_{ic} )^{2} } }}{{\sum\nolimits_{i = 1}^{n} {(y_{i} - \bar{y})^{2} } }} $$
(2)

where y i is the \( \Updelta_{\text{fus}} H_{\text{tm}} \) for ith compound, \( \bar{y} \) is mean value of \( \Updelta_{\text{fus}} H_{\text{tm}} \) for all of the investigated compounds, and \( \hat{y}_{ic} \) is response of ith object estimated by a model obtained ignoring the value of the related object. Todeschini et al. [77] proposed that the preceding equation should be subjected to the following constraints:

$$ \Updelta K = K_{XY} - K_{X} > 0\;({\text{Quick rule}}) $$
(3)
$$ \Updelta Q = Q_{\text{LOO}}^{ 2} - Q_{\text{ASYM}}^{ 2} > 0\;({\text{Asymptotic }}Q^{2} {\text{ rule}}) $$
(4)
$$ R^{P} > 0\;({\text{Redundancy RP rule}}) $$
(5)
$$ R^{N} > 0\;({\text{Overfitting RN rule}}) $$
(6)

It should be noted that the RQK function is used in this study as the fitness function. The results of application of GA-MLR with RQK fitness function have been satisfactory in previous studies [43, 49, 5272].

Of particular interest is the fact that the main data set should be divided into two sub-data sets before performing the GA-MLR computational steps including the “Training” set and the “Test (prediction)” set. In this article, these sets are defined as follows: the “Training set” is used to generate the model. The “Test set” is used to test the prediction capability of the obtained model. The process of division of main data set into three sub-data sets is performed randomly. For this purpose, about 80 and 20% of the main data set are randomly selected for the “Training” set (3,092 compounds), and the “Test” set (772 compounds). The effect of the allocation percent of the two sub-data sets from the data of main data set on the accuracy of the model has been already discussed in previous studies [53].

Several validation techniques are generally used in the QSPR methods to obtain a valid and reliable model. The most widely used techniques have been presented by Todeschini et al. [75]. The bootstrapping, y-scrambling and external validation techniques are used in this study. Using the bootstrapping technique, the original size of the data set (n) is preserved for the “Training” set by the selection of n objects with repetition. In this procedure, the training set usually consists of repeated objects and the evaluation set of the objects left out [49, 5272, 75, 77]. The model is calculated on the “Training” set and responses are predicted on the evaluation set [43, 49, 5272, 75, 77]. All the squared differences between the true response and the predicted response of the objects of the evaluation set are collected “PRESS”. This procedure of building “Training” sets and evaluation sets is repeated thousands of time. “PRESS” is summed and the predictive capability is calculated [43, 49, 5272, 75, 77].

The y-scrambling technique is adopted to check the obtained models with chance correlation. This test is performed by calculating the quality of the model (usually the Q 2) modifying the sequence of the response vector by assigning to each object a response, randomly selected from the true responses. If the original model has no chance correlation, there is a significant deference in the quality of the original model and that associated with a model obtained with random responses. The procedure is repeated several hundreds of time [43, 49, 5272, 75, 77].

External technique is a validation method, where a test is retained to perform a further check on the predictive capabilities of a model obtained from a “Training” set and that optimized by an evaluation set [49, 5272, 75, 77].

Results and discussion

The most accurate multivariate linear equation is obtained following the presented procedure. For obtaining this equation, the best molecular descriptor model is obtained at the first place. Later, the best two molecular descriptors model are determined [49, 5272, 75, 77]. This procedure is repeated to achieve the most accurate three, four, five, etc., molecular descriptors models. It is found that the most accurate multivariate linear model has seven parameters because further increase in the number of molecular descriptors does not lead to any considerable effects on the accuracy of the model. The final equation and its statistical parameters are presented as follows:

$$ \Updelta_{\text{fus}} H_{\text{tm}} = - 3.04644( \pm 0.13585) + 1.53111( \pm 0.00444){\text{Sp}} - 2.3318( \pm 0.05299){\text{GGI}}1 + 7.88627( \pm 0.24321){\text{SEige}} - 0.48091( \pm 0.03191){\text{RDF}}030{\text{v}} + 9.41134( \pm 0.3597)n{\text{RNH2}} + 5.5341( \pm 0.21958){\text{O}} - 057 + 0.04041( \pm 0.00413){\text{TPSA(Tot)}} $$
(7)
$$ n_{\text{training}} = 3092; \ n_{\text{test}} = 772; \ R_{\text{training}}^{ 2} = 0.9883; \ R_{\text{test}}^{ 2} = 0.9854; \ Q_{\text{LOO}}^{ 2} = 0.9877; \ Q_{\text{EXT}}^{2} =0.9887; \ Q_{\text{LTO}}^{2} = 0.9804; \ s = 2.57; \ Q_{\text{BOOT}}^{2} = 0.9877; \ a = - 0.012; \ F =35772.55 $$

RQK function parameters

$$ (\Updelta K = 0.088;\;\Updelta Q = 0.000;\;R^{P} = 0.016;\;R^{N} = 0.000) $$

where \( n_{\text{training}} \) and \( n_{\text{test}} \) are the numbers of compounds available in training set and test set, respectively. “Sp” is a “constitutional descriptor” defined as sum of atomic polarizabilities (scaled on carbon atom). It is a measure of polarity of a molecule. As expected, when it increases, \( \Updelta_{\text{fus}} H_{\text{tm}} \) increases. “GGI1” is topological charge index of order 1. “topological charge indices” were proposed to evaluate the charge transfer between pairs of atoms. As stated by Todeschini et al. [75], it is a measure of molecular branching in a molecule. So, increase in molecular branching results decrease in \( \Updelta_{\text{fus}} H_{\text{tm}} \). “RDF030v” is defined as Radial Distribution Function-3.0/weighted by atomic van der Waals volumes. It is a measure of sphericity of a molecule. The more sphericity in a molecule, the lower \( \Updelta_{\text{fus}} H_{\text{tm}} \). “nNRNH2” is number of primary amine groups (aliphatic amines). It is a group count descriptor. “O-057” is phenol or enol or carboxyl OH group. It is an atomic fragment. “TPSA(Tot)” is defined as topological polar surface area using N, O, S, P polar contributions. In general, the latter three molecular descriptors (“nNRNH2,” “nNRNH2,” and “TPSA(Tot)”) disclose some sort of hydrogen bonding effects in a molecule. The model shows an \( \Updelta_{\text{fus}} H_{\text{tm}} \) increase when these three descriptors increase in a molecule.

For more information about procedure of calculation of these molecular descriptors from chemical structure of a compound, please refer to the Dragon software user’s guide [73].

For testing the validity of the model, bootstrap technique, y-scrambling, and external validation techniques are used [49, 5272, 75, 77]. The bootstrapping is repeated 5,000 times. Besides, y-scrambling is repeated 300 times. As can be seen, the difference between \( Q_{\text{LOO}}^{ 2} \), \( Q_{\text{BOOT}}^{ 2} \), \( Q_{\text{EXT}}^{ 2} \), and \( R^{2} \) demonstrates the predictive and accuracy of the proposed model. The intercept value of the y-scrambling technique has low value (\( a = - 0.011 \)) that reveals the validity of the model. In addition, the values of four constraints of the model are equal or greater than zero which shows that this model is valid and is not chance correlation.

The predicted \( \Updelta_{\text{fus}} H_{\text{tm}} \) values by Eq. 7 in comparison with the experimental values [50] are presented in Fig. 1. The predicted \( \Updelta_{\text{fus}} H_{\text{tm}} \) values for the investigated chemical compounds, the calculated descriptors, and status of all compounds (“Training” or “Test” sets) are presented as supplementary information.

Fig. 1
figure 1

The comparison between the predicted and experimental \( \Updelta_{\text{fus}} H_{\text{tm}} \)

Conclusions

In this study, a QSPR model was presented for determination of the enthalpy of fusion of the chemical compounds at their normal melting points (\( \Updelta_{\text{fus}} H_{\text{tm}} \)). The proposed model is a multivariate linear one consisting seven variables (molecular descriptors), which is developed based on the experimental data of 3,864 chemical compounds. The molecular descriptors were selected using GA-MLR [49, 5272, 7577] technique and are calculated based on the chemical structure of molecules. The obtained results show that the presented model is simple, comprehensive, and accurate. Because the model were developed using the largest database of the experimental values [50] of \( \Updelta_{\text{fus}} H_{\text{tm}} \), the range of application of this model is wide and it may be used for determination of other chemical families excluding those investigated in this study.