Introduction

The phase behavior of polymer solutions is an important property involved in the development and design of most polymer-related processes. Partially miscible polymer solutions often exhibit two solubility boundaries, the upper critical solution temperature (UCST) and the lower critical solution temperature (LCST), which both depend on the molar mass and the pressure. At temperatures below LCST, the system is completely miscible in all proportions, whereas above LCST partial liquid miscibility occurs [1, 2]. θ(LCST) is the LCST at infinite chain length, which is not affected by polymer molar mass. θ(LCST) is often regarded as less important than θ(UCST) because it is usually located at high temperature, where the polymer degenerates. Nevertheless, θ(LCST) can serve as an upper temperature limit for polymer processing. Solvent systems that exhibit this LCST behavior have been suggested for applications where partial or complete miscibility above and below LCST offers advantages [3, 4].

There are three groups of methods for correlating and predicting LCSTs. The first group proposes models that are based on a solid theoretical background using liquid–liquid or vapor–liquid experimental data. These methods require experimental data to adjust the unknown parameters, resulting in limited predictive ability [57]. Another approach uses empirical equations that correlate θ(LCST) with physicochemical properties such as density, critical properties etc., but suffers from the disadvantage that these properties are not always available [810]. A new approach proposed by Liu and Zhong develops linear models for the prediction of θ(LCST) using molecular connectivity indices, which depends only on the solvent and polymer structures [11, 12]. The latter approach has proven to be a very useful technique in quantitative structure–activity/property relationships (QSAR/QSPR) research for polymers and polymer solutions. QSAR/QSPR studies constitute an attempt to reduce the trial-and-error element in the design of compounds with desired activity/properties by establishing mathematical relationships between the activity/property of interest and measurable or computable parameters, such as topological, physicochemical, stereochemistry, or electronic indices [1318].

The present study follows the third approach and its goal is to develop a new, more efficient QSPR model for the prediction of θ(LCST) in polymer solutions, using a newly introduced set of molecular descriptors. The model is developed using a rigorous variable-selection technique and the multiple linear regression (MLR) modeling methodology. The accuracy of the produced model is illustrated using numerous model validation techniques.

Materials and methods

A large set of 169 experimental θ(LCST) data was collected from the literature [12] comprising 12 polymers and 67 solvents. The polymers and solvents are shown in Tables 1 and 2. There were 37 topological, physicochemical, steric and electronic descriptors considered as potential descriptors (inputs) to the QSPR model. The descriptors were calculated from the structure of the monomer compound using ChemSar, which is included in the ChemOffice (CambridgeSoft Corporation) suite of programs [19]. All structures were fully optimized using MOPAC (included in ChemOffice) and, more specifically, the AM1 Hamiltonian, which provides balance between speed and accuracy. Eight of the descriptors were the topological descriptors calculated and used by Liu and Zhong [12]. Table 3 shows the remaining 29 descriptors for both the polymers and the solvents.

Table 1 Polymers used in this work
Table 2 Solvents used in this work
Table 3 Descriptors

Stepwise multiple regression

Among the aforementioned indices, the best combinations were determined using a rigorous elimination selection-stepwise regression (ES-SWR) algorithm that was programmed in Matlab. The aim of variable subset selection is to reach optimal model complexity in predicting a response variable using a reduced set of descriptors that are not highly intercorrelated. In particular, the objective of this work was to select the subset of variables that produces the most significant linear QSPR model as far as prediction of θ(LCST) is concerned.

ES-SWR is a popular stepwise technique that combines forward selection (FS)-SWR and backward elimination (BE)-SWR. It is basically a forward-selection approach but, at each step, it considers the possibility of deleting a variable, as in the backward-elimination approach, provided that the number of model variables is greater than two. The two basic elements of the ES-SWR method are described next in more detail.

Forward selection

The variable considered for inclusion at any step is the one yielding the largest single degree of freedom F-ratio among those eligible for inclusion. The variable is included only if this value is larger than a fixed value F in . At each step, the jth variable is consequently added to a k-size model if

$$ F_{j} = {\text{max}}_{j} {\left( {\frac{{RSS_{k} - RSS_{{k + j}} }} {{s^{2}_{{k + j}} }}} \right)} > F_{{in}} $$
(1)

In the above inequality, RSS is the residual sum of squares and s is the mean square error. The subscript k+j refers to quantities computed when the jth variable is added to the k variables that are already included in the model.

Backward elimination

The variable considered for elimination at any step is the one yielding the minimum single degree of freedom F-ratio among the variables included in the model. The variable is eliminated only if this value does not exceed a specified value F out . At each step, the jth variable is eliminated from a k-size model if

$$ F_{j} = \min _{j} {\left( {\frac{{RSS_{{k - j}} - RSS_{k} }} {{s^{2}_{k} }}} \right)} < F_{{out}} $$
(2)

The subscript k−j refers to quantities computed when the jth variable is eliminated from the k variables included in the model so far.

Model validation

A reliable and predictive QSPR model should (1) be statistically significant and robust, (2) provide accurate predictions for external data sets not used during the model development, and (3) have its application boundaries defined. The approaches used in this work to ensure the significance and predictive power of the QSPR model are described below.

Cross-validation technique

To explore the reliability of the proposed method, we used the cross-validation method. Based on this technique, a number of modified data sets are created by deleting, in each case, one or a small group (leave-some-out) of objects [2022]. For each data set, an input–output model is developed, based on the modeling technique used. Each model is evaluated by measuring its accuracy in predicting the responses of the remaining data (the ones not used to develop the model). In particular, the leave-one-out (LOO) procedure was used in this study. It produces a number of models by deleting each time one object from the training set. The number of models produced by the LOO procedure is obviously equal to the number of available examples n. Prediction error sum of squares (PRESS) is a standard index to measure the accuracy of a modeling method based on the cross-validation technique. Based on the PRESS and SSY (sum of squares of deviations of the experimental values from their mean) statistics, the \(R^{2}_{{{\text{CV}}}}\) and S PRESS values can be calculated easily. The formulae used to calculate all the aforementioned statistics are presented below (Eqs. 3 and 4):

$$ R^{2}_{{CV}} = 1 - \frac{{PRESS}} {{SSY}} = 1 - \frac{{{\sum\limits_{i - 1}^n {{\left( {y_{{\exp }} - y_{{pred}} } \right)}^{2} } }}} {{{\sum\limits_{i = 1}^n {{\left( {y_{{\exp }} - \overline{y} } \right)}^{2} } }}} $$
(3)
$$ S_{{PRESS}} = {\sqrt {\frac{{PRESS}} {n}} } $$
(4)

Y-randomization test

This technique ensures the robustness of a QSPR model [23, 24]. The dependent variable vector (property) is randomly shuffled and a new QSPR model is developed using the original independent variable matrix. The new QSPR models (after several repetitions) are expected to have low R 2 and \(R^{2}_{{{\text{CV}}}}\) values. If the opposite happens, then an acceptable QSPR model cannot be obtained for the specific modeling method and data.

Estimation of the predictive ability of a QSPR model

According to Tropsha et al. [24], the predictive power of a QSPR model can be estimated conveniently by an external \( R^{2}_{{CV,ext}} \) (Eq. 5).

$$ R^{2}_{{CV,ext}} = 1 - \frac{{{\sum\limits_{i = 1}^{test} {{\left( {y_{{\exp }} - y_{{pred}} } \right)}^{2} } }}} {{{\sum\limits_{i = 1}^{test} {{\left( {y_{{\exp }} - \overline{y} _{{tr}} } \right)}^{2} } }}} $$
(5)

where \(\overline{y} _{{{\text{tr}}}}\) is the averaged value for the dependent variable for the training set.

Furthermore, the same group [24, 25] considered a QSPR model predictive if the following conditions are satisfied:

$$ R^{2}_{{CV,ext}} > 0.5 $$
(6)
$$R^{2}_{{pred}} > 0.6$$
(7)
$$\frac{{{\left( {R^{2} - R^{2}_{{\text{o}}} } \right)}}} {{R^{2} }} < 0.1{\text{ or }}\frac{{{\left( {R^{2} - R^{{\prime 2}}_{{\text{o}}} } \right)}}} {{R^{2} }} < 0.1$$
(8)
$$ 0.85 \leqslant k \leqslant 1.15{\text{ or }}0.85 \leqslant k^{\prime } \leqslant 1.15 $$
(9)

The mathematical definitions of \(R^{2}_{{\text{o}}} ,\) \(R^{{\prime 2}}_{{\text{o}}} ,\) k and k′ are based on regression of the observed activities against predicted activities and the opposite (regression of the predicted activities against observed activities). The definitions are presented clearly in [25] and are not repeated here for brevity.

Defining the model-applicability domain

In order for a QSPR model to be used for the prediction of θ(LCST) of new systems, its domain of application [24, 26] must be defined and predictions for only those compounds that fall into this domain may be considered reliable. Extent of extrapolation [24] is one simple approach to define the applicability of the domain. It is based on the calculation of the leverage h i [27] for each chemical, where the QSPR model is used to predict its activity:

$$h_{i} = x^{T}_{i} {\left( {X^{T} X} \right)}x_{i} $$
(10)

In Eq. 10, \(x_{i}\)is the descriptor-row vector of the query compound and X is the k×n matrix containing the k descriptor values for each one of the n training compounds. A leverage value greater than 3k/n is considered large and means that the predicted response is the result of a substantial extrapolation of the model and may not be reliable.

Results and discussion

The ES-SWR variable-selection technique was used to select a subset of the available chemical descriptors that is most meaningful and statistically significant in terms of correlation with the θ(LCST). The variable-selection procedure identified nine descriptors that significantly influence θ(LCST) and characterize both the polymer and the solvent.

In particular, for the polymer, the following descriptors were selected: HOMO energy, shape coefficient (ShpC), dipole length (DPLL), radius (Rad), and the polymer third-order connectivity index contributed by the side groups (3 χ SG-topological). For the solvent, four descriptors were chosen: sum of degrees (SDeg), electronic energy, dipole length, and solvent third-order connectivity index (3 χ p-topological).

The above descriptors are defined as follows: HOMO energy (HOMO) is the energy of the highest occupied molecular orbital. According to frontier-orbital theory, the shapes and symmetries of the highest occupied molecular orbital HOMO are crucial in predicting the molecule’s reactivity. The electronic energy (ElcE) is the total electronic energy given in electron volt at 0 °C. Dipole length is the electric dipole moment divided by the elementary charge. Electric dipole is a vector quantity that encodes displacement with respect to the center of gravity of positive and negative charges in a molecule. The radius is the minimum such value and is held by the most central atom(s). The shape coefficient is given by: ShpC=(D−Rad)/Rad, where the diameter (D) is the maximum such value for all atoms and is held by the most outlying atom(s). The sum-of-degrees is the sum of the degrees of every atom. The polymer third-order connectivity index contributed by the side groups (3 χ SG) and solvent third-order connectivity index (3 χ p) are described in [12].

The full linear equation for the prediction of θ(LCST) for all systems in the data set is the following:

$$ \begin{array}{*{20}l} {{{\theta {\left( {LCST} \right)}} \mathord{\left/ {\vphantom {{\theta {\left( {LCST} \right)}} K}} \right. \kern-\nulldelimiterspace} K = } \hfill} & {{31.5{\left( { \pm 3.7} \right)} * DPLL{\left( {solvent} \right)}} \hfill} \\ {{} \hfill} & {{ - 38.7{\left( { \pm 6.6} \right)} * ^{3} \chi ^{{SG}} } \hfill} \\ {{} \hfill} & {{ + 49.3{\left( { \pm 4.9} \right)} * Rad} \hfill} \\ {{} \hfill} & {{ - 92.0{\left( { \pm 8.0} \right)} * DPLL{\left( {polymer} \right)}} \hfill} \\ {{} \hfill} & {{ + 99.9{\left( { \pm 7.3} \right)} * ShpC} \hfill} \\ {{} \hfill} & {{ + 46.7{\left( { \pm 5.9} \right)} * ^{3} \chi _{P} } \hfill} \\ {{} \hfill} & {{ + 0.0351{\left( { \pm 0.00332} \right)} * ElcE} \hfill} \\ {{} \hfill} & {{ - 105.6{\left( { \pm 6.4} \right)} * HOMO} \hfill} \\ {{} \hfill} & {{ + 30.6{\left( { \pm 2.3} \right)} * SDeg} \hfill} \\ {{} \hfill} & {{ - 931.6{\left( { \pm 68.88} \right)}} \hfill} \\ \end{array} $$
(11)

To evaluate the performance of the QSPR model presented in this work, the data set was split randomly into a training and a validation set in a ratio of approximately 65:35% (112 and 57 systems, respectively) according to [12]. The training and validation compounds are clearly indicated in Tables 4 and 5. The validation set was not involved in any way during the training phase. The full linear equation that was developed using only the 112 training data is the following:

$$ \begin{array}{*{20}l} {{{\theta {\left( {LCST} \right)}} \mathord{\left/ {\vphantom {{\theta {\left( {LCST} \right)}} K}} \right. \kern-\nulldelimiterspace} K = } \hfill} & {{33.0{\left( { \pm 4.8} \right)} * DPLL{\left( {solvent} \right)}} \hfill} \\ {{} \hfill} & {{ - 39.8{\left( { \pm 8.4} \right)} * ^{3} \chi ^{{SG}} } \hfill} \\ {{} \hfill} & {{ + 47.3{\left( { \pm 6.3} \right)} * Rad} \hfill} \\ {{} \hfill} & {{ - 92.1{\left( { \pm 10.4} \right)} * DPLL{\left( {polymer} \right)}} \hfill} \\ {{} \hfill} & {{ + 95.7{\left( { \pm 9.4} \right)} * ShpC} \hfill} \\ {{} \hfill} & {{ + 41.4{\left( { \pm 7.5} \right)} * ^{3} \chi _{P} } \hfill} \\ {{} \hfill} & {{ + 0.0371{\left( { \pm 0.0041} \right)}ElcE} \hfill} \\ {{} \hfill} & {{ - 99.1{\left( { \pm 6.4} \right)} * HOMO} \hfill} \\ {{} \hfill} & {{ + 32.2{\left( { \pm 2.7} \right)} * SDeg} \hfill} \\ {{} \hfill} & {{ - 860.7{\left( { \pm 89.9} \right)}} \hfill} \\ {{} \hfill} & {{n = 112{\text{, }}R^{{\text{2}}}_{{CV}} = 0.8546{\text{, }}R^{2}_{{tr}} = 0.8860{\text{,}}} \hfill} \\ {{} \hfill} & {{R^{2}_{{pred}} \;{\text{ = 0}}{\text{.8738, }}F\;{\text{ = 88}}{\text{.04, }}} \hfill} \\ {{} \hfill} & {{RMS_{{tr}} \; = 23.4806,\;RMS_{{pred}} \;{\text{ = 23}}{\text{.7893,}}} \hfill} \\ {{} \hfill} & {{S_{{PRESS}} = 25.5095} \hfill} \\ \end{array} $$
(12)
Table 4 Experimental and predicted values, absolute/relative errors for the training set
Table 5 Experimental and predicted values, absolute/relative errors for the test set

The results are shown in Tables 4 and 5, where the prediction of the QSPR model is shown for both the training and the external examples. The corresponding absolute and relative errors are also indicated in Tables 4 and 5. The experimental vs predicted values for the training and test sets are shown graphically in Fig. 1. Figures 2, 3, 4, and 5 show the distributions of absolute and relative errors for the training and the test sets. The average absolute (relative) error in predicting θ(LCST) for the training set is 17.57K (3.73%) and the corresponding value for the validation set is 16.59K (3.82%). There is a clear improvement compared to the model described in [12], where the corresponding average relative errors were 5.44 and 5.63% for the training and the validation sets, respectively.

Fig. 1
figure 1

Predicted vs experimental values for the training and test sets

Fig. 2
figure 2

Distribution of absolute errors for the training set

Fig. 3
figure 3

Distribution of percent of relative errors for the training set

Fig. 4
figure 4

Distribution of absolute errors for the test set

Fig. 5
figure 5

Distribution of percent of relative errors for the test set

The min/max values of the absolute errors are 0/82.55 and 1.20/110.43 for the training and validation sets, respectively. As far as the relative error is concerned, the corresponding statistics are 0/17.93 and 0.23/33.56 for the training and validation sets, respectively. Among the systems in the training set, 41 out of 112 have an absolute error of more than 20K. System D37 (PIB-cyclooctane) has the maximum absolute error (82.54K). This can be explained by the fact that the solvent was not included in any other system in the training set. Among the systems in the test set, 18 out of 57 have an absolute error of more than 20K. System H51 (PS-tert-butyl acetate) has the maximum absolute error (110.42K). The large error is due to the fact that the particular solvent was not included in any system during the training procedure. Only six systems in the training set have relative errors greater than 10% and there are no systems with a relative error greater than 20%. In the test set, there are only two systems having a relative error greater than 10% and only one system (PS-tert-butyl acetate) has a relative error of 33.5%.

We should point out that the experimental data were collected from different sources that might have used different measuring methods (with different measurement accuracies and systematic errors). As stated by Liu and Zhong [12] the experimental uncertainties in θ(LCST) can be large, depending on the measurement methods, polymer samples used and the techniques followed by different researchers. Taking these facts into account, we can conclude that our approach models θ(LCST) successfully and has a significant predictive potential.

Several other statistics calculated based on Eq. 12 illustrate the efficiency of the QSPR model. The coefficients of determination (R 2 values) given above indicate a high correlation between experimental and predicted values. \(R^{2}_{{{\text{CV}}}}\)(the result of the LOO cross-validation procedure) is particularly high (\(R^{2}_{{{\text{CV}}}}\)= 0.8546 > 0.5), showing that the model has a high predictive ability and is also robust. As mentioned before, the calculation of this statistic is based on a number of modified data sets created by deleting, in each case, one object from the data. An MLR model is developed based on the remaining data and is validated using the deleted object. For our particular training set, 112 MLR models were obviously built by deleting each time one compound from the training set.

The proposed model also passed all the tests defined by Eqs. 6, 7, 8, and 9):

$$ \begin{aligned} & R^{2}_{{CV,ext}} = 0.8634 > 0.5 \\ & R^{2}_{{pred}} = 0.8738 > 0.6 \\ & \frac{{{\left( {R^{2} - R^{2}_{{\text{o}}} } \right)}}} {{R^{2} }} = {\text{ - 0}}{\text{.2834}} < 0.1{\text{ or }} \\ & \frac{{{\left( {R^{2} - R^{{\prime 2}}_{{\text{o}}} } \right)}}} {{R^{2} }} = {\text{ - 0}}{\text{.2969}} < 0.1 \\ & k = 0.9885{\text{ and }}k^{\prime } {\text{ = 1}}{\text{.0093}} \\ \end{aligned} $$

The model was validated further by applying the Y-randomization of response test (in this work, the θ(LCST) values). It consists of repeating the calculation procedure several times after shuffling the Y vector randomly. If all models obtained by the Y-randomization test have relatively high values for both R 2 and \(R^{2}_{{{\text{CV}}}}\) statistics, this is due to a chance correlation and implies that the current modeling method cannot lead to an acceptable model using the available data set. This was not the case for the data set and methodology used in this work. Several random shuffles of the Y vector were performed and the results are shown in Table 6. The low R 2 and \( R^{2}_{{CV}} \)values show that the good results in our original model are not due to a chance correlation or structural dependency of the training set.

Table 6 Results of the Y-randomization test

It needs to be emphasized that, no matter how robust and accurate a QSPR model proves to be, it cannot be expected to predict the modeled property reliably for the entire universe of chemicals. Therefore, for the QSPR model, the domain of applicability must be defined and predictions for only those chemicals that fall in this domain can be considered as reliable. The method was applied to the compounds that constitute the test set. The leverages for all 57 test systems were computed (Table 7). Two systems (G65 and J51) were found to fall slightly outside the domain of the model (warning leverage limit 0.2679).

Table 7 Leverages for the test set

Conclusions

In this work, we have presented a novel MLR model to predict θ(LCST) using nine molecular descriptors. For the development and the validation of the model, 169 polymer–solvent systems were used. The methodologies used in this work illustrated the accuracy of the model, not only by calculating its fitness on sets of training data but also by testing the predicting abilities of the model. In terms of various validation techniques and statistical indicators, the MLR model produced proved to have significant predictive potential. Using the proposed model, experimental time and effort can be reduced significantly as reliable estimates of θ(LCST) for polymer solutions can be obtained before they are actually synthesized in the laboratory.