Introduction

Speed of sound (u) is one of the important parameters in both acoustics as well as thermodynamics. Speed of sound and density are noticeable thermodynamic properties owing to their high level of experimental accuracy, which is at least one order of magnitude higher than the other quantities. Density is conventionally employed for the sake of modeling; however, recently attentions have been altered to the speed of sound regarding to the significant development of the rigorous measuring protocols in a wide array of temperature and pressure in fluid state. Quick and highly accurate measurement protocols for the speed of sound make it a reliable quantity to estimate other thermodynamic properties with high precision. All observable thermodynamic properties of a fluid phase can be directly obtained from the speed of sound by integration of partial differential equations which relate it with the other thermodynamic properties. This procedure offers promising predictions over conventional direct approaches owing to high accurate acoustic data.

In liquids, applying speed of sound data aligned with (p, ρ, T) data would offer an alternative approach to determine heat capacities instead of calorimetric method:

$$ u^{2} = \frac{1}{M}\left[ {\left( {\frac{{\partial \rho_{\text{n}} }}{\partial p} } \right)_{\text{T}} - \frac{T}{{\rho_{\text{n}}^{2} C_{\text{p,m}} }} \left( {\frac{{\partial \rho_{\text{n}} }}{\partial T}} \right)_{\text{p}}^{2} } \right]^{ - 1} . $$
(1)

Also, the combination of speed of sound with (p, ρ, T) data is the promising experimental way to determine the heat-capacity ratio γ and the isentropic compressibility κ s of pure liquids:

$$ u^{2} = \frac{1}{{\rho \kappa_{\text{s}} }}, $$
(2)
$$ u^{2} = \frac{\gamma }{{\rho \kappa_{\text{T}} }}, $$
(3)

where

$$ \kappa_{\text{S}} = \frac{1}{\rho } \left( {\frac{\partial \rho }{\partial P} } \right)_{\text{S}} , $$
(4)
$$ \kappa_{\text{T}} = \frac{1}{\rho } \left( {\frac{\partial \rho }{\partial P} } \right)_{\text{T}} , $$
(5)
$$ \gamma = \frac{{C_{\text{p}} }}{{C_{\text{V}} }}. $$
(6)

At higher pressures, (p, ρ, T) measurements are much more difficult and in this region sound speed measurements in liquids are probably of the greatest value [1].

In this communication, the quantitative structure property relationship (QSPR) methodology [28] is successfully applied for prediction of u for a wide array of liquids at the broad spectrum of temperatures.

Methodology

Data preparation

In this study, a comprehensive dataset of speed of sound comprising 1,470 data belongs to 73 liquids in a wide range of temperature (58–646.47 K) was extracted from ThermoData Engine [9]. In terms of reliability as well as the critical evaluation of the experimental data, ThermoData Engine would be one of the most promising options to collect experimental data.

Training and test set selection

Typically, in QSPR modeling, the compiled experimental database is split into two subsets: training set which is involved in model development and the test set used to assess the learning ability of the model from training set to produce reliable results for absent compounds. In this study, K-means clustering is applied to select training and test sets. K-means clustering is a method of cluster analysis, which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. As the rule of thumb, 20 % of collecting data was retained to test the model and the remaining was applied for model derivation [10]. For LSSVM model derivation, 80-10-10 % of data points split into training-validation-test sets, respectively. This selection like the previous one is performed by k-means clustering.

Calculation of descriptor

Prior to the descriptor calculation, the optimization of 3D structures of present compounds is required. The well-known Dreiding Force field [11] implemented by Chemaxon’s JChem software was applied to optimize 3D structures in this study. About 3,000 descriptors from 22 diverse classes of descriptors are calculated by Dragon software [12]. These 22 classes of descriptors are Constitutional descriptors, Topological indices, Walk and path counts, Connectivity indices, Information indices, 2D autocorrelations, Burden Eigen values, Edge-adjacency indices, Functional group counts, Atom-centered fragments, Molecular properties, topological charge indices, Eigenvalue-based indices, Randic molecular profiles, geometrical descriptors, RDF descriptors, 3D-MORSE descriptors, WHIM descriptors, GETAWAY descriptors, charge descriptors, 2D binary fingerprint, and 2D frequency finger print.

Descriptors that could not be calculated for a compound are excluded completely from the list. Next, the pair correlation for each binary group of descriptors is performed. For correlation greater than 0.9, one of the descriptors is omitted randomly.

Subset variable selection

Genetic function approximation (GFA) was successfully implemented for subset variable selection in this study. GFA originally developed by Rogers and Hopfinger [13] is the fusion of two seemingly distinctive algorithms: multivariate adaptive regression splines algorithm [14] and genetic algorithm [15]. One of the promising features of the GFA is to evolve a series of models instead of one model. In addition, by utilizing the Friedman’s LOF scoring function in GFA, derived models are prone to overfitting with better predictions. In this study, population and the number of maximum generations are set to 100 and 5,000, respectively. The value of mutation probability is set to 1.5 in this study. To study the nonlinear nature of speed of sound, the LSSVM is also practiced for the sake of model derivation.

Result and discussion

Linear model

The final linear model derived by GFA for estimation of speed of sound in liquids contains 12 descriptors as follows:

$$ \begin{aligned} c & = 1564.5671 \left( { \pm 15.2855} \right) - 6.5480 \left( { \pm 0.1143} \right)T + 0.0034\left( { \pm 0.0001} \right)T^{2} \\ & \quad + 438.7373\left( { \pm 12.9701} \right) AAC + 12.8962\left( { \pm 2.2292} \right)Y_{\text{index}} + 455.4282 \left( { \pm 12.5541} \right)SPH \\ & \quad - 336.8196\left( { \pm 7.5202} \right)Mor13m + 432.8956 \left( { \pm 22.0591} \right)E2v - 504.0771\left( { \pm 16.3251} \right)Ds \\ & \quad - 837.9355 \left( { \pm 25.5316} \right)HATS1m + 69.1189 \left( { \pm 1.6293} \right)RTp + 635.6081 \left( { \pm 20.0058} \right)NRCN \\ & \quad + 193.9080 \left( { \pm 6.4533} \right)nHDon \\ \end{aligned} $$
(7)
$$ \begin{gathered} R^{2} = 0. 9 4 9;\quad R_{\text{adj}}^{2} = 0. 9 4 8;\quad n_{\text{Training}} = 1176;\quad n_{\text{Test}} = 294; \hfill \\ F = 2 1 6 5 2. 5 9 ;\quad Q^{2} = 0.938;\quad Q_{\text{boot}}^{2} = 0. 9 4 8;\quad Q_{\text{ext}}^{2} = 0. 9 5 2\hfill \\ a(R^{2} ) = - 0.0 1 9;\quad \Updelta K = 0. 9 7 4;\quad \Updelta Q = 0;\quad R^{\text{p}} = 0;\quad R^{\text{N}} = 0.996 \hfill \\ \end{gathered} $$

Table 1 demonstrates the GFA predicted values of speed of sounds in the studied liquids (u is in m s−1 unit). The definitions of molecular descriptors in Eq. 7 are enlisted in Table 2. Figure 1 illustrates the predicted speed of sound values versus experimental data. As it can be seen, the majority of points are located in the vicinity of the bisection of graph. This indicates the reasonable agreement between GFA predicted values versus experimental ones. Relative deviations of GFA predicted values from experimental data are depicted in Fig. 2.

Table 1 Predicted speed of sound values in studied liquids by GFA model
Table 2 Model’s descriptors
Fig. 1
figure 1

Predicted speed of sound values by GFA model versus the experimental ones

Fig. 2
figure 2

Deviation of the predicted speed of sound by GFA model from experimental data

Nonlinear model

For the sake of nonlinear modeling, LSSVM was successfully implemented in this study. LSSVM is a member of large machine-learning family namely support vector machine (SVM) which profoundly based on the seeking of an optimal separating hyperplane to minimize expected generalization error in the feature space. The detailed mathematical explanation of the optimization problem treated by LSSVM approach is not provided here and can be found in detail in mentioned references [1618]. To implement the original SVM algorithm to handle nonlinear problem, radial basis function is defined as the kernel function. The objective of the definition of the kernel function is to map the data into the higher dimensional feature space in order to increase computational power. The simulated annealing optimization method is actuated to find the proper combination of the LSSVM parameters, namely (γ, σ 2) considering the minimum mean squared error of leave-one-out (LOO) cross-validation of the training set as the optimal condition. The twelve descriptors selected for linear model by GFA were introduced as inputs to LSSVM for the nonlinear model derivation. The obtained parameters of the final model are described as follows: γ = 14649.98, σ 2 = 0.5602.

Figure 3 shows the LSSVM predicted values versus the experimental speeds of sound. As it is vividly clear in this figure, the great improvement of prediction is achieved by employing LSSVM instead of GFA. Besides, deviation of predicted LSSVM values from experimental ones is depicted in Fig. 4. The significant reduction of deviation of predicted values by LSSVM model in comparison with GFA model is apparent in this figure. Table 3 demonstrates the LSSVM predicted values of speed of sounds in the studied liquids. Statistical parameters of LSSVM model are enlisted in Table 4.

Fig. 3
figure 3

Predicted speed of sound values by LSSVM model versus the experimental ones

Fig. 4
figure 4

Deviation of the predicted speed of sound by LSSVM model from experimental data

Table 3 Predicted speed of sound values in studied liquids by LSSVM model
Table 4 Statistical parameters of LSSVM model

Applicability domain (AD) [19]

To test the reliability of the predicted responses, the AD of the derived model is investigated. AD is a theoretical spatial domain defined by molecular descriptors as well as by both training and test sets. The AD objective is to investigate whether the test and training sets share the same domain or not. This is crucial since prediction outside of the AD might be erroneous.

In this study, Williams graph generated from Hat indices is used to investigate AD. Hat indices are calculated based on Hat matrix (H) with the following definition:

$$ H = X(X^{\text{T}} X)^{ - 1} X^{\text{T}} , $$
(8)

where X is a two-dimensional matrix comprising n compounds (rows) and k descriptors (columns). The diagonal elements of H are leverages or hat values (h i) of the chemicals in the descriptor space.

Williams graph shows the correlation of hat values and standardized cross-validated residuals (R). A warning leverage (h * = 0.0337)—blue vertical line—is generally fixed at 3n/p, where n is number of training compounds and p is the number of model variables plus one. The leverage of 3 is considered as a cutoff value to accept the points that lay ±3 (two horizontal red lines) standard deviations from the mean (to cover 99 % normally distributed data).

The AD is located in the region of 0 ≤ h ≤ 0.0337 and −3 ≤ R ≤ +3. The prediction within this region is considered valid. As it is clearly illustrated in this figure, the majority of test and training compounds are located in this region. There are 24 points that wrongly predicted by the model (3 < R or R < −3), however, their hat values lie in the domain of AD. This erroneous prediction could probably be attributed to wrong experimental data rather than the molecular structure [20]. Figure 5 depicts the Williams graph of the studied model.

Fig. 5
figure 5

Williams graph of the developed model

The absolute relative deviation is defined as follows:

$$ {\text{ARD}}\% = \left( {\frac{1}{{N_{{\text{p}}} }}} \right)\sum\limits_{{i = 1}}^{{N_{{\text{p}}} }} {\left| {\frac{{c_{{\exp }} - c_{{{\text{calc}}}} }}{{c_{{\exp }} }}} \right|} , $$
(9)

where N p is the number of total points. The GFA-driven model shows that for 72 studied liquids the mean ARD % is 10.4 % with maximum deviation of 34.2 %.The highest error associated with GFA model belongs to water at T = 452.57. However, this point is located at the wrongly predicted area with the high chance of being wrong experimental data. 32.7 % of the estimated speed of sound was within absolute deviation of 0.00–3.00 %, 19.9 % was within 3.001–6.00 %, 16.5 % was within 6.001–10.00 %, 8.2 % was within 10.001–13.00 %, and only 11.1 % was within 13.001–20 % and 11.6 % was within 20.1–34.2 %. The results obtained by the nonlinear model present that 98.4 % of the estimated speeds of sounds were within absolute deviation of 0.00–3.00 %, and merely 1.6 % of the predicted value have the error higher than 3 %.

The applied validation techniques as well as their interpretations are shown in Table 5. The readers can find the detailed statistical procedures of the mentioned techniques from previous works of the authors [2, 7, 8, 18, 2124]. The results of validation techniques indicate that the derive model is not only an accurate one but also prone to being a chance-correlated model.

Table 5 Validation techniques

Conclusions

In the light of highly accurate measurement protocols, the application of speed of sound to correlate thermodynamic properties of liquids received many attentions in the recent decade. The ease of measurement as well as highly precise mensuration make speed of sound a reliable option to replace arduous (p, ρ, T) measurement at high pressures. Despite its broad applications, there is no study conducted on the prediction of speed of sounds in liquid.

Originally, in this communication a robust twelve-parameter QSPR model is introduced to estimate speed of sounds of 73 liquids at wide range of temperatures. GFA is applied for subset variable selection as well linear model derivation. For the sake of more accurate modeling as well as studying the nonlinearity of the speed of sound, LSSVM approach is also practiced to develop a nonlinear model. The results of LSSVM modeling reveal significant improvement of prediction power as well as substantial reduction of predicted values deviation from experimental ones.

For the sake of the investigation of the model reliability, AD of the model is also studied. The presence of the majority of both training and test sets data in the AD generated by Williams graph authenticates the validity of the predictions. Besides, Analysis based on AD of the derived model and LSSVM (0 ≤ h ≤ 0.0337 and R > 3 or  < −3) implies that reported experimental data for 24 data points are ambiguous and need modification. By the aid of derived model parameters as well as its AD, the reliability of the experimental data could be analyzed to find flawed data points. Moreover, the reliability and predictive capability of the model are adequately scrutinized by various statistical validation techniques. The results of validation techniques pronounced that the model is stable and accurate and is immune of chance correlation. Predicted speed of sounds by both GFA and LSSVM model for studied data points as well as corresponding model descriptors values are provided as supplementary information.