Abstract
In this study, we present a new model that has been developed for the prediction of θ (lower critical solution temperature) using a database of 169 data points that include 12 polymers and 67 solvents. For the characterization of polymer and solvent molecules, a number of molecular descriptors (topological, physicochemical, steric and electronic) were examined. The best subset of descriptors was selected using the elimination selection-stepwise regression method. Multiple linear regression (MLR) served as the statistical tool to explore the potential correlation among the molecular descriptors and the experimental data. The prediction accuracy of the MLR model was tested using the leave-one-out cross-validation procedure, validation through an external test set and the Y-randomization evaluation technique. The domain of applicability was finally determined to identify the reliable predictions.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
The phase behavior of polymer solutions is an important property involved in the development and design of most polymer-related processes. Partially miscible polymer solutions often exhibit two solubility boundaries, the upper critical solution temperature (UCST) and the lower critical solution temperature (LCST), which both depend on the molar mass and the pressure. At temperatures below LCST, the system is completely miscible in all proportions, whereas above LCST partial liquid miscibility occurs [1, 2]. θ(LCST) is the LCST at infinite chain length, which is not affected by polymer molar mass. θ(LCST) is often regarded as less important than θ(UCST) because it is usually located at high temperature, where the polymer degenerates. Nevertheless, θ(LCST) can serve as an upper temperature limit for polymer processing. Solvent systems that exhibit this LCST behavior have been suggested for applications where partial or complete miscibility above and below LCST offers advantages [3, 4].
There are three groups of methods for correlating and predicting LCSTs. The first group proposes models that are based on a solid theoretical background using liquid–liquid or vapor–liquid experimental data. These methods require experimental data to adjust the unknown parameters, resulting in limited predictive ability [5–7]. Another approach uses empirical equations that correlate θ(LCST) with physicochemical properties such as density, critical properties etc., but suffers from the disadvantage that these properties are not always available [8–10]. A new approach proposed by Liu and Zhong develops linear models for the prediction of θ(LCST) using molecular connectivity indices, which depends only on the solvent and polymer structures [11, 12]. The latter approach has proven to be a very useful technique in quantitative structure–activity/property relationships (QSAR/QSPR) research for polymers and polymer solutions. QSAR/QSPR studies constitute an attempt to reduce the trial-and-error element in the design of compounds with desired activity/properties by establishing mathematical relationships between the activity/property of interest and measurable or computable parameters, such as topological, physicochemical, stereochemistry, or electronic indices [13–18].
The present study follows the third approach and its goal is to develop a new, more efficient QSPR model for the prediction of θ(LCST) in polymer solutions, using a newly introduced set of molecular descriptors. The model is developed using a rigorous variable-selection technique and the multiple linear regression (MLR) modeling methodology. The accuracy of the produced model is illustrated using numerous model validation techniques.
Materials and methods
A large set of 169 experimental θ(LCST) data was collected from the literature [12] comprising 12 polymers and 67 solvents. The polymers and solvents are shown in Tables 1 and 2. There were 37 topological, physicochemical, steric and electronic descriptors considered as potential descriptors (inputs) to the QSPR model. The descriptors were calculated from the structure of the monomer compound using ChemSar, which is included in the ChemOffice (CambridgeSoft Corporation) suite of programs [19]. All structures were fully optimized using MOPAC (included in ChemOffice) and, more specifically, the AM1 Hamiltonian, which provides balance between speed and accuracy. Eight of the descriptors were the topological descriptors calculated and used by Liu and Zhong [12]. Table 3 shows the remaining 29 descriptors for both the polymers and the solvents.
Stepwise multiple regression
Among the aforementioned indices, the best combinations were determined using a rigorous elimination selection-stepwise regression (ES-SWR) algorithm that was programmed in Matlab. The aim of variable subset selection is to reach optimal model complexity in predicting a response variable using a reduced set of descriptors that are not highly intercorrelated. In particular, the objective of this work was to select the subset of variables that produces the most significant linear QSPR model as far as prediction of θ(LCST) is concerned.
ES-SWR is a popular stepwise technique that combines forward selection (FS)-SWR and backward elimination (BE)-SWR. It is basically a forward-selection approach but, at each step, it considers the possibility of deleting a variable, as in the backward-elimination approach, provided that the number of model variables is greater than two. The two basic elements of the ES-SWR method are described next in more detail.
Forward selection
The variable considered for inclusion at any step is the one yielding the largest single degree of freedom F-ratio among those eligible for inclusion. The variable is included only if this value is larger than a fixed value F in . At each step, the jth variable is consequently added to a k-size model if
In the above inequality, RSS is the residual sum of squares and s is the mean square error. The subscript k+j refers to quantities computed when the jth variable is added to the k variables that are already included in the model.
Backward elimination
The variable considered for elimination at any step is the one yielding the minimum single degree of freedom F-ratio among the variables included in the model. The variable is eliminated only if this value does not exceed a specified value F out . At each step, the jth variable is eliminated from a k-size model if
The subscript k−j refers to quantities computed when the jth variable is eliminated from the k variables included in the model so far.
Model validation
A reliable and predictive QSPR model should (1) be statistically significant and robust, (2) provide accurate predictions for external data sets not used during the model development, and (3) have its application boundaries defined. The approaches used in this work to ensure the significance and predictive power of the QSPR model are described below.
Cross-validation technique
To explore the reliability of the proposed method, we used the cross-validation method. Based on this technique, a number of modified data sets are created by deleting, in each case, one or a small group (leave-some-out) of objects [20–22]. For each data set, an input–output model is developed, based on the modeling technique used. Each model is evaluated by measuring its accuracy in predicting the responses of the remaining data (the ones not used to develop the model). In particular, the leave-one-out (LOO) procedure was used in this study. It produces a number of models by deleting each time one object from the training set. The number of models produced by the LOO procedure is obviously equal to the number of available examples n. Prediction error sum of squares (PRESS) is a standard index to measure the accuracy of a modeling method based on the cross-validation technique. Based on the PRESS and SSY (sum of squares of deviations of the experimental values from their mean) statistics, the \(R^{2}_{{{\text{CV}}}}\) and S PRESS values can be calculated easily. The formulae used to calculate all the aforementioned statistics are presented below (Eqs. 3 and 4):
Y-randomization test
This technique ensures the robustness of a QSPR model [23, 24]. The dependent variable vector (property) is randomly shuffled and a new QSPR model is developed using the original independent variable matrix. The new QSPR models (after several repetitions) are expected to have low R 2 and \(R^{2}_{{{\text{CV}}}}\) values. If the opposite happens, then an acceptable QSPR model cannot be obtained for the specific modeling method and data.
Estimation of the predictive ability of a QSPR model
According to Tropsha et al. [24], the predictive power of a QSPR model can be estimated conveniently by an external \( R^{2}_{{CV,ext}} \) (Eq. 5).
where \(\overline{y} _{{{\text{tr}}}}\) is the averaged value for the dependent variable for the training set.
Furthermore, the same group [24, 25] considered a QSPR model predictive if the following conditions are satisfied:
The mathematical definitions of \(R^{2}_{{\text{o}}} ,\) \(R^{{\prime 2}}_{{\text{o}}} ,\) k and k′ are based on regression of the observed activities against predicted activities and the opposite (regression of the predicted activities against observed activities). The definitions are presented clearly in [25] and are not repeated here for brevity.
Defining the model-applicability domain
In order for a QSPR model to be used for the prediction of θ(LCST) of new systems, its domain of application [24, 26] must be defined and predictions for only those compounds that fall into this domain may be considered reliable. Extent of extrapolation [24] is one simple approach to define the applicability of the domain. It is based on the calculation of the leverage h i [27] for each chemical, where the QSPR model is used to predict its activity:
In Eq. 10, \(x_{i}\)is the descriptor-row vector of the query compound and X is the k×n matrix containing the k descriptor values for each one of the n training compounds. A leverage value greater than 3k/n is considered large and means that the predicted response is the result of a substantial extrapolation of the model and may not be reliable.
Results and discussion
The ES-SWR variable-selection technique was used to select a subset of the available chemical descriptors that is most meaningful and statistically significant in terms of correlation with the θ(LCST). The variable-selection procedure identified nine descriptors that significantly influence θ(LCST) and characterize both the polymer and the solvent.
In particular, for the polymer, the following descriptors were selected: HOMO energy, shape coefficient (ShpC), dipole length (DPLL), radius (Rad), and the polymer third-order connectivity index contributed by the side groups (3 χ SG-topological). For the solvent, four descriptors were chosen: sum of degrees (SDeg), electronic energy, dipole length, and solvent third-order connectivity index (3 χ p-topological).
The above descriptors are defined as follows: HOMO energy (HOMO) is the energy of the highest occupied molecular orbital. According to frontier-orbital theory, the shapes and symmetries of the highest occupied molecular orbital HOMO are crucial in predicting the molecule’s reactivity. The electronic energy (ElcE) is the total electronic energy given in electron volt at 0 °C. Dipole length is the electric dipole moment divided by the elementary charge. Electric dipole is a vector quantity that encodes displacement with respect to the center of gravity of positive and negative charges in a molecule. The radius is the minimum such value and is held by the most central atom(s). The shape coefficient is given by: ShpC=(D−Rad)/Rad, where the diameter (D) is the maximum such value for all atoms and is held by the most outlying atom(s). The sum-of-degrees is the sum of the degrees of every atom. The polymer third-order connectivity index contributed by the side groups (3 χ SG) and solvent third-order connectivity index (3 χ p) are described in [12].
The full linear equation for the prediction of θ(LCST) for all systems in the data set is the following:
To evaluate the performance of the QSPR model presented in this work, the data set was split randomly into a training and a validation set in a ratio of approximately 65:35% (112 and 57 systems, respectively) according to [12]. The training and validation compounds are clearly indicated in Tables 4 and 5. The validation set was not involved in any way during the training phase. The full linear equation that was developed using only the 112 training data is the following:
The results are shown in Tables 4 and 5, where the prediction of the QSPR model is shown for both the training and the external examples. The corresponding absolute and relative errors are also indicated in Tables 4 and 5. The experimental vs predicted values for the training and test sets are shown graphically in Fig. 1. Figures 2, 3, 4, and 5 show the distributions of absolute and relative errors for the training and the test sets. The average absolute (relative) error in predicting θ(LCST) for the training set is 17.57K (3.73%) and the corresponding value for the validation set is 16.59K (3.82%). There is a clear improvement compared to the model described in [12], where the corresponding average relative errors were 5.44 and 5.63% for the training and the validation sets, respectively.
The min/max values of the absolute errors are 0/82.55 and 1.20/110.43 for the training and validation sets, respectively. As far as the relative error is concerned, the corresponding statistics are 0/17.93 and 0.23/33.56 for the training and validation sets, respectively. Among the systems in the training set, 41 out of 112 have an absolute error of more than 20K. System D37 (PIB-cyclooctane) has the maximum absolute error (82.54K). This can be explained by the fact that the solvent was not included in any other system in the training set. Among the systems in the test set, 18 out of 57 have an absolute error of more than 20K. System H51 (PS-tert-butyl acetate) has the maximum absolute error (110.42K). The large error is due to the fact that the particular solvent was not included in any system during the training procedure. Only six systems in the training set have relative errors greater than 10% and there are no systems with a relative error greater than 20%. In the test set, there are only two systems having a relative error greater than 10% and only one system (PS-tert-butyl acetate) has a relative error of 33.5%.
We should point out that the experimental data were collected from different sources that might have used different measuring methods (with different measurement accuracies and systematic errors). As stated by Liu and Zhong [12] the experimental uncertainties in θ(LCST) can be large, depending on the measurement methods, polymer samples used and the techniques followed by different researchers. Taking these facts into account, we can conclude that our approach models θ(LCST) successfully and has a significant predictive potential.
Several other statistics calculated based on Eq. 12 illustrate the efficiency of the QSPR model. The coefficients of determination (R 2 values) given above indicate a high correlation between experimental and predicted values. \(R^{2}_{{{\text{CV}}}}\)(the result of the LOO cross-validation procedure) is particularly high (\(R^{2}_{{{\text{CV}}}}\)= 0.8546 > 0.5), showing that the model has a high predictive ability and is also robust. As mentioned before, the calculation of this statistic is based on a number of modified data sets created by deleting, in each case, one object from the data. An MLR model is developed based on the remaining data and is validated using the deleted object. For our particular training set, 112 MLR models were obviously built by deleting each time one compound from the training set.
The proposed model also passed all the tests defined by Eqs. 6, 7, 8, and 9):
The model was validated further by applying the Y-randomization of response test (in this work, the θ(LCST) values). It consists of repeating the calculation procedure several times after shuffling the Y vector randomly. If all models obtained by the Y-randomization test have relatively high values for both R 2 and \(R^{2}_{{{\text{CV}}}}\) statistics, this is due to a chance correlation and implies that the current modeling method cannot lead to an acceptable model using the available data set. This was not the case for the data set and methodology used in this work. Several random shuffles of the Y vector were performed and the results are shown in Table 6. The low R 2 and \( R^{2}_{{CV}} \)values show that the good results in our original model are not due to a chance correlation or structural dependency of the training set.
It needs to be emphasized that, no matter how robust and accurate a QSPR model proves to be, it cannot be expected to predict the modeled property reliably for the entire universe of chemicals. Therefore, for the QSPR model, the domain of applicability must be defined and predictions for only those chemicals that fall in this domain can be considered as reliable. The method was applied to the compounds that constitute the test set. The leverages for all 57 test systems were computed (Table 7). Two systems (G65 and J51) were found to fall slightly outside the domain of the model (warning leverage limit 0.2679).
Conclusions
In this work, we have presented a novel MLR model to predict θ(LCST) using nine molecular descriptors. For the development and the validation of the model, 169 polymer–solvent systems were used. The methodologies used in this work illustrated the accuracy of the model, not only by calculating its fitness on sets of training data but also by testing the predicting abilities of the model. In terms of various validation techniques and statistical indicators, the MLR model produced proved to have significant predictive potential. Using the proposed model, experimental time and effort can be reduced significantly as reliable estimates of θ(LCST) for polymer solutions can be obtained before they are actually synthesized in the laboratory.
References
(a) Charlet G, Delmas G (1981) Polymer 22:1181–1189; (b) Charlet G, Ducasse R, Delmas G (1981) Polymer 22:1190–1198
Christensen SP, Donate FA, Frank TC, LaTulip RJ, Wilson LC (2005) J Chem Eng Data 50:869–877
Kavanagh CA, Rochev YA, Gallagher WM, Dawson KA, Keenan AK (2004) Pharmacol Ther 102:1–15
Kopecek J (2003) Eur J Pharm Sci 20:1–16
Chang BH, Bae CY (1998) Polymer 39:6449–6454
Pappa GD, Voutsas EC, Tassios DP (2001) Ind Eng Chem Res 40:4654–4663
Bogdanic G, Vidal J (2000) Fluid Phase Equilib 173:241–252
Wang F, Saeki S, Yamaguchi T (1999) Polymer 40:2779–2785
Vetere A (1998) Ind Eng Chem Res 37:4463–4469
Imre AR, Bae YC, Chang BH, Kraska Th (2004) Ind Eng Chem Res 43:237–242
Liu H, Zhong C (2005) Eur Polym J 41:139–147
Liu H, Zhong C (2005) Ind Eng Chem Res 44:634–638
Melagraki G, Afantitis A, Sarimveis H, Igglessi-Markopoulou O, Supuran CT (2006) Bioorg Med Chem 14:1108–1114
Afantitis A, Melagraki G, Sarimveis H, Koutentis PA, Markopoulos J, Igglessi-Markopoulou O (2005) Mol Divers (In press) DOI: 10.1007/s11030-005-9012-2
Afantitis A, Melagraki G, Makridima K, Alexandridis A, Sarimveis H, Igglessi-Markopoulou O (2005) J Mol Struct Theochem 716:193–198
Melagraki G, Afantitis A, Makridima K, Sarimveis H, Igglessi-Markopoulou O (2005) J Mol Model 12:297–305
Al-Fahemi JH, Cooper DL, Allan NL (2005) J Mol Struct Theochem 727:57–61
Villanueva-Garcıa M, Gutierrez-Parra RN, Martınez-Richa A, Robles J (2005) J Mol Struct Theochem 727:63–69
CambridgeSoft Corporation (http://www.cambridgesoft.com)
Efron B (1983) J Am Stat Assoc 78:316–331
Efroymson MA (1960) Multiple regression analysis. In: Ralston A, Wilf HS (eds) Mathematical methods for digital computers. Wiley, New York, pp 191–203
Osten DW (1998) J Chemom 2:39–48
Wold S, Eriksson L (1995) Statistical validation of QSAR results. In: van de Waterbeemd H (ed) Chemometrics methods in molecular design. VCH, Weinheim, pp 309–318
Tropsha A, Gramatica P, Gombar VK (2003) Quant Struct-Act Relatsh 22:1–9
Golbraikh A, Tropsha A (2002) J Mol Graph Model 20:269–276
Shen M, Beguin C, Golbraikh A, Stables J, Kohn H, Tropsha A (2004) J Med Chem 47:2356–2364
Atkinson A (1985) Plots, transformations and regression. Clarendon, Oxford, p 282
Acknowledgements
G.M. wishes to thank the Greek State Scholarship Foundation for a doctoral assistantship. A.A. wishes to thank Cyprus Research Promotion Foundation (grant no. PENEK/ENISX/0603/05) and A.G. Leventis Foundation for their financial support.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Melagraki, G., Afantitis, A., Sarimveis, H. et al. A novel QSPR model for predicting θ (lower critical solution temperature) in polymer solutions using molecular descriptors. J Mol Model 13, 55–64 (2007). https://doi.org/10.1007/s00894-006-0125-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00894-006-0125-z