Abstract
Quantitative structure property relationship models for the speed of sound in liquids are developed based on molecular descriptors. A large dataset of 1,470 experimental data of speed of sound in 73 liquids is applied to derive the model. Twelve descriptors are selected by genetic function approximation to relate the speed of sound in liquids to their corresponding chemical structures. To capture the nonlinear nature of speed of sound in liquids, a model based on least-squared supported vector machine is also developed. The derived models are authenticated with several statistical validation techniques.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
Speed of sound (u) is one of the important parameters in both acoustics as well as thermodynamics. Speed of sound and density are noticeable thermodynamic properties owing to their high level of experimental accuracy, which is at least one order of magnitude higher than the other quantities. Density is conventionally employed for the sake of modeling; however, recently attentions have been altered to the speed of sound regarding to the significant development of the rigorous measuring protocols in a wide array of temperature and pressure in fluid state. Quick and highly accurate measurement protocols for the speed of sound make it a reliable quantity to estimate other thermodynamic properties with high precision. All observable thermodynamic properties of a fluid phase can be directly obtained from the speed of sound by integration of partial differential equations which relate it with the other thermodynamic properties. This procedure offers promising predictions over conventional direct approaches owing to high accurate acoustic data.
In liquids, applying speed of sound data aligned with (p, ρ, T) data would offer an alternative approach to determine heat capacities instead of calorimetric method:
Also, the combination of speed of sound with (p, ρ, T) data is the promising experimental way to determine the heat-capacity ratio γ and the isentropic compressibility κ s of pure liquids:
where
At higher pressures, (p, ρ, T) measurements are much more difficult and in this region sound speed measurements in liquids are probably of the greatest value [1].
In this communication, the quantitative structure property relationship (QSPR) methodology [2–8] is successfully applied for prediction of u for a wide array of liquids at the broad spectrum of temperatures.
Methodology
Data preparation
In this study, a comprehensive dataset of speed of sound comprising 1,470 data belongs to 73 liquids in a wide range of temperature (58–646.47 K) was extracted from ThermoData Engine [9]. In terms of reliability as well as the critical evaluation of the experimental data, ThermoData Engine would be one of the most promising options to collect experimental data.
Training and test set selection
Typically, in QSPR modeling, the compiled experimental database is split into two subsets: training set which is involved in model development and the test set used to assess the learning ability of the model from training set to produce reliable results for absent compounds. In this study, K-means clustering is applied to select training and test sets. K-means clustering is a method of cluster analysis, which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. As the rule of thumb, 20 % of collecting data was retained to test the model and the remaining was applied for model derivation [10]. For LSSVM model derivation, 80-10-10 % of data points split into training-validation-test sets, respectively. This selection like the previous one is performed by k-means clustering.
Calculation of descriptor
Prior to the descriptor calculation, the optimization of 3D structures of present compounds is required. The well-known Dreiding Force field [11] implemented by Chemaxon’s JChem software was applied to optimize 3D structures in this study. About 3,000 descriptors from 22 diverse classes of descriptors are calculated by Dragon software [12]. These 22 classes of descriptors are Constitutional descriptors, Topological indices, Walk and path counts, Connectivity indices, Information indices, 2D autocorrelations, Burden Eigen values, Edge-adjacency indices, Functional group counts, Atom-centered fragments, Molecular properties, topological charge indices, Eigenvalue-based indices, Randic molecular profiles, geometrical descriptors, RDF descriptors, 3D-MORSE descriptors, WHIM descriptors, GETAWAY descriptors, charge descriptors, 2D binary fingerprint, and 2D frequency finger print.
Descriptors that could not be calculated for a compound are excluded completely from the list. Next, the pair correlation for each binary group of descriptors is performed. For correlation greater than 0.9, one of the descriptors is omitted randomly.
Subset variable selection
Genetic function approximation (GFA) was successfully implemented for subset variable selection in this study. GFA originally developed by Rogers and Hopfinger [13] is the fusion of two seemingly distinctive algorithms: multivariate adaptive regression splines algorithm [14] and genetic algorithm [15]. One of the promising features of the GFA is to evolve a series of models instead of one model. In addition, by utilizing the Friedman’s LOF scoring function in GFA, derived models are prone to overfitting with better predictions. In this study, population and the number of maximum generations are set to 100 and 5,000, respectively. The value of mutation probability is set to 1.5 in this study. To study the nonlinear nature of speed of sound, the LSSVM is also practiced for the sake of model derivation.
Result and discussion
Linear model
The final linear model derived by GFA for estimation of speed of sound in liquids contains 12 descriptors as follows:
Table 1 demonstrates the GFA predicted values of speed of sounds in the studied liquids (u is in m s−1 unit). The definitions of molecular descriptors in Eq. 7 are enlisted in Table 2. Figure 1 illustrates the predicted speed of sound values versus experimental data. As it can be seen, the majority of points are located in the vicinity of the bisection of graph. This indicates the reasonable agreement between GFA predicted values versus experimental ones. Relative deviations of GFA predicted values from experimental data are depicted in Fig. 2.
Nonlinear model
For the sake of nonlinear modeling, LSSVM was successfully implemented in this study. LSSVM is a member of large machine-learning family namely support vector machine (SVM) which profoundly based on the seeking of an optimal separating hyperplane to minimize expected generalization error in the feature space. The detailed mathematical explanation of the optimization problem treated by LSSVM approach is not provided here and can be found in detail in mentioned references [16–18]. To implement the original SVM algorithm to handle nonlinear problem, radial basis function is defined as the kernel function. The objective of the definition of the kernel function is to map the data into the higher dimensional feature space in order to increase computational power. The simulated annealing optimization method is actuated to find the proper combination of the LSSVM parameters, namely (γ, σ 2) considering the minimum mean squared error of leave-one-out (LOO) cross-validation of the training set as the optimal condition. The twelve descriptors selected for linear model by GFA were introduced as inputs to LSSVM for the nonlinear model derivation. The obtained parameters of the final model are described as follows: γ = 14649.98, σ 2 = 0.5602.
Figure 3 shows the LSSVM predicted values versus the experimental speeds of sound. As it is vividly clear in this figure, the great improvement of prediction is achieved by employing LSSVM instead of GFA. Besides, deviation of predicted LSSVM values from experimental ones is depicted in Fig. 4. The significant reduction of deviation of predicted values by LSSVM model in comparison with GFA model is apparent in this figure. Table 3 demonstrates the LSSVM predicted values of speed of sounds in the studied liquids. Statistical parameters of LSSVM model are enlisted in Table 4.
Applicability domain (AD) [19]
To test the reliability of the predicted responses, the AD of the derived model is investigated. AD is a theoretical spatial domain defined by molecular descriptors as well as by both training and test sets. The AD objective is to investigate whether the test and training sets share the same domain or not. This is crucial since prediction outside of the AD might be erroneous.
In this study, Williams graph generated from Hat indices is used to investigate AD. Hat indices are calculated based on Hat matrix (H) with the following definition:
where X is a two-dimensional matrix comprising n compounds (rows) and k descriptors (columns). The diagonal elements of H are leverages or hat values (h i) of the chemicals in the descriptor space.
Williams graph shows the correlation of hat values and standardized cross-validated residuals (R). A warning leverage (h * = 0.0337)—blue vertical line—is generally fixed at 3n/p, where n is number of training compounds and p is the number of model variables plus one. The leverage of 3 is considered as a cutoff value to accept the points that lay ±3 (two horizontal red lines) standard deviations from the mean (to cover 99 % normally distributed data).
The AD is located in the region of 0 ≤ h ≤ 0.0337 and −3 ≤ R ≤ +3. The prediction within this region is considered valid. As it is clearly illustrated in this figure, the majority of test and training compounds are located in this region. There are 24 points that wrongly predicted by the model (3 < R or R < −3), however, their hat values lie in the domain of AD. This erroneous prediction could probably be attributed to wrong experimental data rather than the molecular structure [20]. Figure 5 depicts the Williams graph of the studied model.
The absolute relative deviation is defined as follows:
where N p is the number of total points. The GFA-driven model shows that for 72 studied liquids the mean ARD % is 10.4 % with maximum deviation of 34.2 %.The highest error associated with GFA model belongs to water at T = 452.57. However, this point is located at the wrongly predicted area with the high chance of being wrong experimental data. 32.7 % of the estimated speed of sound was within absolute deviation of 0.00–3.00 %, 19.9 % was within 3.001–6.00 %, 16.5 % was within 6.001–10.00 %, 8.2 % was within 10.001–13.00 %, and only 11.1 % was within 13.001–20 % and 11.6 % was within 20.1–34.2 %. The results obtained by the nonlinear model present that 98.4 % of the estimated speeds of sounds were within absolute deviation of 0.00–3.00 %, and merely 1.6 % of the predicted value have the error higher than 3 %.
The applied validation techniques as well as their interpretations are shown in Table 5. The readers can find the detailed statistical procedures of the mentioned techniques from previous works of the authors [2, 7, 8, 18, 21–24]. The results of validation techniques indicate that the derive model is not only an accurate one but also prone to being a chance-correlated model.
Conclusions
In the light of highly accurate measurement protocols, the application of speed of sound to correlate thermodynamic properties of liquids received many attentions in the recent decade. The ease of measurement as well as highly precise mensuration make speed of sound a reliable option to replace arduous (p, ρ, T) measurement at high pressures. Despite its broad applications, there is no study conducted on the prediction of speed of sounds in liquid.
Originally, in this communication a robust twelve-parameter QSPR model is introduced to estimate speed of sounds of 73 liquids at wide range of temperatures. GFA is applied for subset variable selection as well linear model derivation. For the sake of more accurate modeling as well as studying the nonlinearity of the speed of sound, LSSVM approach is also practiced to develop a nonlinear model. The results of LSSVM modeling reveal significant improvement of prediction power as well as substantial reduction of predicted values deviation from experimental ones.
For the sake of the investigation of the model reliability, AD of the model is also studied. The presence of the majority of both training and test sets data in the AD generated by Williams graph authenticates the validity of the predictions. Besides, Analysis based on AD of the derived model and LSSVM (0 ≤ h ≤ 0.0337 and R > 3 or < −3) implies that reported experimental data for 24 data points are ambiguous and need modification. By the aid of derived model parameters as well as its AD, the reliability of the experimental data could be analyzed to find flawed data points. Moreover, the reliability and predictive capability of the model are adequately scrutinized by various statistical validation techniques. The results of validation techniques pronounced that the model is stable and accurate and is immune of chance correlation. Predicted speed of sounds by both GFA and LSSVM model for studied data points as well as corresponding model descriptors values are provided as supplementary information.
References
Goodwin ARH, Trusler JPM. Speed of sound. In: Goodwin ARH, Marsh KN, Wakeham WA, editors. Experimental thermodynamics. Amsterdam: Elsevier; 2003. p. 237–323.
Gharagheizi F, Eslamimanesh A, Ilani-Kashkouli P, Mohammadi AH, Richon D. QSPR molecular approach for representation/prediction of very large vapor pressure dataset. Chem Eng Sci. 2012;76:99–107.
Gharagheizi F, Eslamimanesh A, Sattari M, Mohammadi AH, Richon D. Corresponding states method for evaluation of the solubility parameters of chemical compounds. Ind Eng Chem Res. 2012;51:3826–31.
Gharagheizi F, Eslamimanesh A, Sattari M, Tirandazi B, Mohammadi AH, Richon D. Evaluation of thermal conductivity of gases at atmospheric pressure through a corresponding states method. Ind Eng Chem Res. 2012;51:3844–9.
Gharagheizi F, Gohar MRS, Vayeghan MG. A quantitative structure-property relationship for determination of enthalpy of fusion of pure compounds. J Therm Anal Calorim. 2012;109:501–6.
Gharagheizi F, Ilani-Kashkouli P, Mohammadi AH. Computation of normal melting temperature of ionic liquids using a group contribution method. Fluid Phase Equilibria. 2012;329:1–7.
Mirkhani SA, Gharagheizi F, Ilani-Kashkouli P, Farahani N. Determination of the glass transition temperature of ionic liquids: a molecular approach. Thermochim Acta. 2012;543:88–95.
Mirkhani SA, Gharagheizi F, Ilani-Kashkouli P, Farahani N. An accurate model for the prediction of the glass transition temperature of ammonium based ionic liquids: a QSPR approach. Fluid Phase Equilibria. 2012;324:50–63.
Frenkel M, Chirico RD, Diky V, Yan X, Dong Q, Muzny C. ThermoData Engine (TDE): software implementation of the dynamic data evaluation concept. J Chem Inf Model. 2005;45:816–38.
Gharagheizi F. QSPR analysis for intrinsic viscosity of polymer solutions by means of GA-MLR and RBFNN. Comput Mater Sci. 2007;40:159–67.
Mayo SL, Olafson BD, Goddard WA. DREIDING: a generic force field for molecular simulations. J Phys Chem. 1990;94:8897–909.
Talete S. Dragon for windows (Software for Molecular Descriptor Calculations), Version 5.5. 2007. http://www.talete.mi.it/.
Rogers D, Hopfinger AJ. Application of genetic function approximation to quantitative structure-activity relationships and quantitative structure-property relationships. J Chem Inf Comput Sci. 1994;34:854–66.
Friedman JH. Multivariate adaptive regression splines. Ann Stat. 1991;19:1–67.
Holland JH. Adaptation in natural and artificial systems : an introductory analysis with applications to biology, control, and artificial intelligence. 1st MIT Press ed. ed: MIT Press; 1992.
Eslamimanesh A, Gharagheizi F, Illbeigi M, Mohammadi AH, Fazlali A, Richon D. Phase equilibrium modeling of clathrate hydrates of methane, carbon dioxide, nitrogen, and hydrogen + water soluble organic promoters using support vector machine algorithm. Fluid Phase Equilibria. 2012;316:34–45.
Gharagheizi F, Eslamimanesh A, Farjood F, Mohammadi AH, Richon D. Solubility parameters of nonelectrolyte organic compounds: determination using quantitative structure-property relationship strategy. Ind Eng Chem Res. 2011;50:11382–95.
Mousavisafavi SM, Gharagheizi F, Mirkhani SA, Akbari J. A predictive quantitative structure-property relationship for glass transition temperature of 1,3-dialkyl imidazolium ionic liquids—Part 2. The nonlinear approach. J Therm Anal Calorim. 2013;111:1639–48.
Gramatica P. Modelling chemicals in the environment. In: Livingstone DJ, Davis AM, editors. Drug design strategies: quantitative approaches. London: The Royal Society of Chemistry; 2012. p. 458–78.
Gramatica P. Principles of QSAR models validation: internal and external. QSAR Comb Sci. 2007;26:694–701.
Gharagheizi F, Ilani-Kashkouli P, Mirkhani SA, Farahani N, Mohammadi AH. QSPR molecular approach for estimating Henry’s law constants of pure compounds in water at ambient conditions. Ind Eng Chem Res. 2012;51:4764–7.
Mirkhani SA, Gharagheizi F. Predictive quantitative structure-property relationship model for the estimation of ionic liquid viscosity. Ind Eng Chem Res. 2012;51:2470–7.
Mirkhani SA, Gharagheizi F, Sattari M. A QSPR model for prediction of diffusion coefficient of non-electrolyte organic compounds in air at ambient condition. Chemosphere. 2012;86:959–66.
Mousavisafavi SM, Mirkhani SA, Gharagheizi F, Akbari J. A predictive quantitative structure-property relationship for glass transition temperature of 1,3-dialkyl imidazolium ionic liquids—Part 1. The linear approach. J Therm Anal Calorim. 2013;111:235–46.
Bonchev D. Information theoretic indices for characterization of chemical structures. Chichester: Research Studies Press; 1983.
Balaban AT, Balaban T-S. New vertex invariants and topological indices of chemical graphs based on information on distances. J Math Chem. 1991;8:383–97.
Mekenyan O, Peitchev D, Bonchev D, Trinajstic N. Arzneim Forsch. 1986;36:176–83.
Gasteiger JE. Software-Entwicklung in der Chemie 10 = Software development in chemistry: GDCh; 1995.
Todeschini R, Bettiol C, Giurin G, Gramatica P, Miana P, Argese E. Modeling and prediction by using WHIM descriptors in QSAR studies: submitochondrial particles (SMP) as toxicity blosensors of chlorophenols. Chemosphere. 1996;33:71–9.
Consonni V, Todeschini R, Pavan M. Structure/response correlations and similarity/diversity analysis by GETAWAY descriptors. 1. Theory of the novel 3D molecular descriptors. J Chem Inf Comput Sci. 2002;42:682–92.
Krzanowski WJ. Principles of multivariate analysis: a user’s perspective. Rev ed. Oxford: Oxford University Press; 2000.
Todeschini R, Consonni V. Molecular descriptors for chemoinformatics. 2nd ed., rev. and Enl. ed. Weinheim: Wiley-VCH 2009.
Gharagheizi F, Eslamimanesh A, Mohammadi AH, Richon D. QSPR approach for determination of parachor of non-electrolyte organic compounds. Chem Eng Sci. 2011;66:2959–67.
Gharagheizi F, Eslamimanesh A, Mohammadi AH, Richon D. Representation/prediction of solubilities of pure compounds in water using artificial neural network-group contribution method. J Chem Eng Data. 2011;56:720–6.
Gharagheizi F, Eslamimanesh A, Mohammadi AH, Richon D. Use of artificial neural network-group contribution method to determine surface tension of pure compounds. J Chem Eng Data. 2011;56:2587–601.
Gharagheizi F, Gohar MRS, Vayeghan MG. A quantitative structure-property relationship for determination of enthalpy of fusion of pure compounds. J Therm Anal Calorim. 2011;27:1–6.
Efron B. Better bootstrap confidence intervals. J Am Stat Assoc. 1987;82:171–85.
Lindgren F, Hansen B, Karcher W, Sjöström M, Eriksson L. Model validation by permutation tests: applications to variable selection. J Chemom. 1996;10:521–32.
Chiou J. Hybrid method of evolutionary algorithms for static and dynamic optimization problems with application to a fed-batch fermentation process. Comput Chem Eng. 1999;23:1277–91.
Author information
Authors and Affiliations
Corresponding authors
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Bagheri-Chokami, Y., Farahani, N., Mirkhani, S.A. et al. A chemical structure-based model for estimating speed of sound in liquids. J Therm Anal Calorim 116, 529–538 (2014). https://doi.org/10.1007/s10973-013-3465-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10973-013-3465-9