1 Introduction

A large number of main properties of the surfactant solution such as conductivity, emulsification, surface tension, detergency, foam stability and conductivity are important for many industrial and biological systems [1,2,3,4]. In addition, it has been established that the values of these physicochemical properties change radically as soon as the molecules begin to aggregate to form micelles [5]. At this moment, the concentration of the surfactant is defined as critical micelle concentration (CMC). Therefore, the CMC is an important and extremely useful parameter for characterizing surfactants. Given the impact of CMC on the characteristics of surfactant, it seems obvious to pay special attention to it [4].

Anionic surfactants constitute the largest class of surfactants. Of all types of surfactants, they are the most widely used and account for about 70% of the surfactant consumption [6]. Various industrial processes are carried out with the addition of anionic surfactants. They are used as wetting agents, emulsifiers, dispersants and foaming agents. They play a major role in modern cleaning products (laundry detergents, electric dishwashing detergents, and in some shampoos) due to their superior detergency performance [7]. In addition, these compounds are often added to a variety of products, such as pharmaceuticals, antimicrobials and corrosion inhibitors [4]. From the moment that micellization process is important from both technological and environmental point of view, many researchers use quantitative structure–property relationships (QSPR) models [8] to provide early CMC estimations on the one hand and to design new surfactants with a special property on the other hand. The QSPR principle consists in finding a correlation between a property (or activity) of a substance (such as CMC) with its structural characteristics (molecular descriptors) reflecting topological, geometric and electrostatic aspects of a molecule. QSPRs are very practical methods involved in the early assessment of physico-chemical and biological parameters of substances that have not been studied experimentally.

Indeed, founded over 50 years ago by Corwin Hansch, quantitative models of “structure–activity” (QSAR) relationships are widely used at university research structures, industry and several government agencies (OECD, REACH, etc.). The abundance of experimental databases for millions of products added to pressure from several organizations to minimize the use of animals has encouraged the use of QSAR/QSPR as a promising alternative in drug design, toxicology and ecotoxicological risk assessment [9, 10]. Reference works dealing with fundamental concepts of QSAR modeling and their basic concepts for applications in risk assessment are currently available in the literature [11, 12].

Recently, large amounts of QSPR approaches have been developed to estimate the physic-chemical parameters of a large group of important compounds for industrial applications [13,14,15,16]. In the past two decades, few studies have been published on the establishing of QSPR models to predict CMCs for anionic surfactants. A QSPR approach was designed to attempt to correlate the molecular structure of 119 anionic surfactants to their CMC [17]. With three descriptors containing information on the size, structure and hydrophobic properties of the surfactants studied (n = 68), a regression model was developed. A very good value of the coefficient of determination (R2 = 0.988) was obtained. Recently, a set of 31 anionic surfactants was used to develop a QSPR model linking molecular structural parameters and log CMC [4]. The statistics of these models (R2 = 0.964 and R2 = 0.982) with each other with cross-validation performance confirm the aptitude of both models to predict the CMC of anionic surfactants. In conclusion, the authors suggest that the branching and the polarity of the compounds studied contribute significantly to the micellization process. However, they claim that the polarity contributes less to this process.

A QSPR study was developed for the quantification of CMC from the logP (octanol/water partition coefficient) for a various range of anionic surfactants [18]. Acceptable modeling was obtained by using two parameters, πh and L. In the same year, 37 anionic surfactants of sodium alkyl sulfates and 3 descriptors were used to develop a QSPR model for the prediction of CMC [19]. Given the results obtained, the authors suggest the use of this model in the context of the prediction of the CMC of anionic surfactants

In a report published in 2004 [20], 98 anionic surfactants and three descriptors were used to develop a QSPR model for the predicting of the CMC using multiple linear regression techniques. The anionic surfactants used include a wide variety of hydrophobic structures.

The results obtained (R2 = 0.980 and \( R_{\text{cv}}^{2} = 0.978 \)) indicate the robustness and friability of the QSPR approach. A good correlation between observed and predicted CMC was noticed. The authors suggested that the contribution of three parameters (total atom number, dipole moment, and max net atomic charge on C atom) is very important. Interesting QSPR models have been generated to predict the CMC of 37 anionic surfactants using two categories of descriptors [21]. The internal performance and the predictivity of models are satisfactory. The descriptors used highlighted the impact of ramification, hydrophobicity, and electronic properties of the surfactants on the micellization process.

In this context, the main objective of this work is to establish new robust QSPR models for predicting the CMC of a wide variety of surfactants (classical and extended anionic surfactants) from their molecular structure. The models developed using multiple linear regression and artificial neural network, which satisfies the guidelines required by the Organization for Economic Cooperation and Development (OECD), are based on different types of descriptors to access physically meaningful models. In addition, the developed QSPR models can be useful in the design of new anionic surfactants.

2 Methodology

2.1 Data Collection and Dataset Division

To establish high-performance QSPR models, experimental data must be of high quality [22]. In the present work, the experimental critical micelle concentration (CMC) data of 50 anionic surfactants (36 conventional anionic surfactants and 14 extended anionic surfactants) were extracted from the literature (Table 1). A wide variety of surfactant structures were included. The CMC values were measured at 25 °C in purified water without any added ingredient. The data were carefully analyzed to avoid any errors. The values of the CMC were converted to negative logarithmic scale [pCMC = − log10CMC (µmol/L)] to guarantee the linear distribution. The normality distribution was checked using different statistical tests and distribution plots are presented in Fig. 1a, b. The complete dataset (50 anionic surfactants) has been split into two sets, one for training and one for testing [23, 24] employing “Kennard Stone” division method using ‘Dataset division GUI 1.2’ tool (DTC Lab Software Tools). In this case, the best ratio is 75:25 (ntraining = 38 anionic surfactants and ntest = 12 anionic surfactants).

Table 1 List of 50 anionic surfactants and their experimental pCMC values along with predicted pCMC values
Fig. 1
figure 1

Normality distribution plot of CMC data: a before transformation, b after log transformation (pCMC)

2.2 Molecular Descriptor Calculation

There are more than 11145 usable molecular descriptors [25]. All descriptors considered in this study were computed using PaDEL-Descriptor (ver. 2.21) program. Anionic surfactant structures were saved as SMILES (Simplified Molecular Input Line-Entry System) notation, which is the recommended input format for PaDEL-Descriptor software [26]. In this work, 1543 molecular descriptors were calculated for each surfactant.

2.3 Molecular Descriptor Selection

One of the important steps in QSPR modeling is the reduction of the number of descriptors. This reduction has a twofold purpose: to avoid the phenomenon of overfitting and to reduce the risk of obtaining a model by chance [27]. To keep only the most important descriptors, the selection procedure described previously in one of our articles was used [22]. The number of descriptors obtained after the selection was 868. Then, the GA approach was employed (http://teqip.jdvu.ac.in/QSAR_Tools/).

2.4 QSPR Model’s Development and Validation

In this present study, models are developed employing three statistical methods: (1) multiple linear regressions (MLR) approach; (2) partial least square (PLS) approach; (3) multilayer perceptron–artificial neural networks (MLP/ANN) with BFGS (Broyden–Fletcher–Goldfarb–Shanno) as learning algorithms. For MLP/ANN approach, we have employed STATISTICA software (STATISTICA 8.0, Tulsa; StatSoft, Inc.). For second and third methods, we have employed MLR Plus Validation GUI 1.3 tools and Partial Least Squares v1.0, respectively (DTC Lab Software Tools). Theories and applications about the MLP–ANN have been reported in the literature [28, 29].

To assess model’s quality and predictability, validation (internal and external) is a necessary step. For internal validation, traditional validation metrics recommended by leading research groups [30, 31] were checked: the determination coefficient (R2), and the cross-validated correlation coefficient (\( Q_{\text{LOO}}^{2} \)) along with novel validation parameters (\( r_{m}^{2} \); \( \Delta r_{m}^{2} \)). The external validation was performed on the test set by calculating the following parameters: \( Q_{F1}^{2} \), \( Q_{F2}^{2} \), average \( r_{m}^{2} \), and \( \Delta r_{m}^{2} \). The equations of these validation statistical parameters are provided in the literature [32,33,34,35]. In a recent work, Roy and his collaborators [36] suggested adding a parameter for external validation. This parameter concerns to establish of a threshold for the mean absolute error (MAE). Thus, to assess the predictive performance of QSAR models with a higher degree of confidence, we have calculated and verified the criteria based on the MAE as suggested by Roy et al. [36]. A tool (XternalValidationPlus) for computing the suggested MAE based criteria for external validation is accessible online [37].

Further, Xternal Validation Plus has been used to check the presence of high systematic errors (bias) in the ANN model. If the bias is present in the model, then it should be discarded and performing any external validation test is useless on such a biased model [38].

2.5 QSPR Model’s Applicability Domain

In addition to internal and external validation, the determination of applicability domain (AD) is therefore of great importance [39]. The AD is another validation measure to check the friability of QSPR models. The QSPR model cannot be applied outside of the chemical space covered by the training set. Therefore, extrapolation is not allowed and can lead to erroneous model predictions [40]. To investigate the AD of the anionic surfactants used in this study, the Williams plot was established [22, 41].

3 Results and Discussion

3.1 Molecular Descriptor Selection

To select the optimal number of suitable descriptors, the effect of the number of descriptors on the statistic parameters (R2, Q2, \( \bar{r}_{m }^{2} \), \( \Delta r_{m}^{2} \)) was investigated for 2–7 descriptors. The results are shown as plots of R2, Q2, \( \bar{r}_{m}^{2} , \) and \( \Delta r_{m}^{2} \) for the training set as a function of the number of descriptors for the 2–7 descriptors in the model (Fig. 2). Analyzing Fig. 2 and Table 2 (prediction quality), we can confirm that the best numbers of descriptors are 4 and 5. However, it is plausible to note that there is no significant improvement in their statistical parameters. For this reason, we have chosen the following 4 descriptors: ATSC7v; ATSC5e; nAtomLAC; ETA_Epsilon_3 (Table S1 in supplementary files).

Fig. 2
figure 2

Influences of the number of descriptors on statistical parameters

Table 2 Calculated parameters for selection of optimal descriptors

The correlation matrix of four relevant descriptors has been obtained and is presented in the additional files of Table S2. From this table, the 4 relevant variables (descriptors) are independent of the fact that each pair of descriptors has a correlation coefficient value of less than 0.57.

3.2 MLR Model

As mentioned in the methodology section, the MLR model was developed based on 4 relevant molecular descriptors. The final equation of the MLR model is as follows (Eq. 1):

$$ \begin{aligned} pCMC & = 76.89501\left( { \pm \,15,94778} \right) + 0.00041\left( { \pm \,0,00007} \right) \, ATSC7v \, - 0.13615\left( { \pm \,0.02508} \right) \, ATSC5e \, \\ & \quad - 165.43679\left( { \pm \,37.07955} \right) \, ETA\_Epsilon\_3 \, - 0.18446\left( { \pm \,0.01934} \right) \, nAtomLAC \\ \end{aligned} $$
(1)
  • ntrain = 38 R2 = 0.80 \( Q_{\text{LOO}}^{2} \) = 0.72 \( \overline{{r_{{m\left( {\text{scaled}} \right)}}^{2} }} \) = 0.62 \( \Delta r_{{m\left( {\text{scaled}} \right)}}^{2} \) = 0.16

  • ntest = 12 \( Q_{F1}^{2} \) = 0.87 \( Q_{F2}^{2} \) = 0.87 \( \overline{{r_{{m\left( {\text{scaled}} \right)}}^{2} }} \) = 0.64 \( \Delta r_{{m\left( {\text{scaled}} \right)}}^{2} \) = 0.15

  • Q2 = 0.72, (Threshold value Q2 > 0.5), Passed

  • r2 = 0.92, (Threshold value r2 > 0.6), Passed

  • |r20  − r02| = 0.11 (Threshold value |r20  − r02| < 0.3), Passed

  • [(r2 − r20 )/r2] = 0.05 < 0.1 or, [(r2 − r2′0 )/r2] = 0.16 < 0.1, Passed

  • 0.85 ≤ k = 1.00 ≤ 1.15 or 0.85 ≤ k′ = 0.99 ≤ 1.15, Passed

The statistical parameters values indicate the robustness and friability of the MLR model. The predicted pCMC values of the surfactants studied as well as the values of the descriptors of the model are presented in Table S3 (supplementary files).

According to the recommendation of Tropsha and Golbraikh [42, 43], if the difference between R2 and \( Q_{\text{LOO}}^{2} \) is less than 0.3, the model is without overfitting. In the actual study, R2-\( Q_{\text{LOO}}^{2} \) = 0.08, indicating no overfitting in the MLR model. Moreover, the concrete prediction error of the model is estimated by the PRESS parameter value [44]. To have a credible QSPR model, the PRESS/SSY ratio should be smaller than 0.4. As part of this study, the PRESS/SSY ratio was equal to 0.25 (PRESS = 3.32 and SSY = 13.19), so this proves that the developed model predict is better than chance. In addition, to confirm the absence of a chance factor during the development of the MLR model, a Y-randomization analysis was performed by generating 50 random models. The average values of R2 and Q2 obtained (0.12 and − 0.17) are below than the acceptable limit of 0.5 for both parameters.

The impact of a descriptor in a model is characterized by its sign-in model mathematical equation. According to the regression coefficients of Eq. (1), the ETA_Epsilon_3 descriptor was the main contributor to the CMC of anionic surfactants. This descriptor has highest and negative contribution and therefore has a negative impact on CMC. Thus, for specified anionic surfactants, low values of ETA_Epsilon_3 descriptor would help in improving its CMC. In addition, the regression coefficients of the descriptors ATSC5e and nAtomLAC had negative signs, thus giving a negative impact on CMC. On the contrary, ATSC7v has positive contribution towards CMC and the highest values were conducive to the improvement of the CMC of anionic surfactants.

3.3 PLS Model

The results obtained with PLS model for the prediction of CMC of anionic surfactants, using 50 compounds, are summarized below:

3.3.1 Internal Validation Parameters

  • R2 (Train): 0.79896

  • Q2 (LOO): 0.72238

  • Scaled average R2m (train; LOO): 0.6165

  • Scaled Delta R2m (train; LOO): 0.16275

  • Mean absolute errors (MAE; train):0.2673

  • Standard deviation of absolute errors (SD; train):0.2247

  • Training set prediction quality (based on MAE-based criteria*): MODERATE

3.3.2 External Validation Parameters

  • \( Q_{F1}^{2} \): 0.87167

  • \( Q_{F2}^{2} \): 0.87034

  • Scaled average R2m (test): 0.63903

  • Scaled Delta R2m (test): 0.15178

  • CCC (test): 0.91628

  • Standard deviation of absolute errors (SD; test): 0.0952

  • Test set prediction quality (based on MAE-based criteria*): good

3.4 MLP/ANN Model

In this investigation, the learning algorithm used to develop an MLP/ANN nonlinear model to predict the critical micelle concentration (CMC) of anionic surfactants is called BFGS. The database has been divided into a training set (75%) and a test set (25%). the ANN network selected for this study is the multilayer perceptron (with an input layer, a hidden layer and an output layer). Several studies have shown that this category of the network is able to model any activity (or property) of a substance whatever its complexity [28]. One output neuron was used to represent the predicted pCMC. The two transfer functions used in this study are the hyperbolic tangent (tanh) and the identity function, respectively. Furthermore, the following rule has been taken into account to optimize the number of neurons in the hidden layer:

$$ \big[ \left( {{\text{Number of input neurons}} \times {\text{number of hidden neurons}}} \right) + \left( {\text{number of hidden neurons}} \times {\text{number of output neurons}} \right) \big] \le \left( {\text{size of database}} \right) $$

In order to ensure the best possible model, many trials sometimes involving more than 800 iterations were carried out. The model with the lowest value of the RMSE was selected [28]. Then, the best model with MLP/ANN architecture {4-3-1} was selected.

The predictive pCMC from the MLP/ANN model for 50 anionic surfactants is given in Table 1. The observed versus predicted pCMC of the training and test set is shown in Fig. 3. From this figure, a close correlation between the predicted and observed values of pCMC was obtained. The values of the validation (internal and external) statistical parameters reported in Table 3 comply with the acceptability criteria [29], suggesting that the MLP/ANN model is robust and provides excellent predictive quality.

Fig. 3
figure 3

Scatter plot of the predicted values of logCMC versus the experimental values by ANN model for the training, and test set

Table 3 Statistical quality of all developed QSPR models

Due to the complexity of the relationship between the predicted property and the descriptors (variables) in an ANN model [45], the effect of variables in the micellization process is relatively easy to interpret in the case of linear regression. The relative contribution [46] of the MLP/ANN model descriptors was calculated and is represented in Fig. 4. The importance of these descriptors decreases in the order: ATSC7v > nAtomLAC > ATSC5e > ETA_Epsilon_3.

Fig. 4
figure 4

Plot of the fraction contribution of the descriptors to the pCMC of anionic surfactant

ATSC7v (weighted by van der Waals volume) belongs to the 2D autocorrelation descriptors. This descriptor describes the distribution of van der Waals volume with a lag of 7 along the topological structure of the anionic surfactants. The physic-chemical significance of the descriptor ATSC7v concerns the volume of the molecule. Thus, the increase in the volume of a molecule leads to the increase in the value of ATSC7v. The second descriptor in MLP/ANN model was nAtomLAC, which involves the number of atoms in the longest aliphatic chain. ATSC5e (weighted by Sanderson electronegativity) and ETA_Epsilon_3 (Extended Topochemical Atom descriptor) are the third and fourth descriptors in the MLP/ANN model. All these quantities are well defined in the literature.

As can be seen in Fig. 4, the CMC widely depends on the two descriptors ATSC7v and nAtomLAC. ATSC7v and nAtomLAC which accounts, respectively, 40.85% and 30.18% of the total contribution. The remaining 28.97% is from ATSC5e (16.68%), and ETA_Epsilon_3 (12.29%). Summarizing, it can be concluded that atomic electronegativity, molecular size, and the number of atoms in the longest aliphatic chain, all play an important role in micellization of anionic surfactants.

3.5 Statistical Comparison of the QSPR Models

Comparative statistics of the MLR, PLS, and MLP/ANN regression models is presented in Table 3. For each model, we used the same type and number of descriptors, as well as the same composition of the training and test sets. From Table 3, it is observed that all the reported models (MLP/ANN, MLR, and PLS) are of acceptable quality. Among all three regression models, MLP/ANN model shows the highest values for quality parameters, i.e., R2 (0.94), \( R_{\text{adjusted}}^{2} \) (0.93), \( Q_{\text{LOO}}^{2} \) (0.93), and \( Q_{F1}^{2} \) (0.95). In addition, the MLP/ANN model exhibits an improvement in terms of external statistics compared to the PLS and MLR models.

The MAE-based metrics (MAE and MAE + 3 × σ) after omitting 5% data points with high prediction residuals estimated that the predictions of the ANN model are classified as ‘good’ (see results in supplementary file) which is also in agreement with the judgment provided by the classical metrics for external validation (Table 3). In addition, the output file of Xternal Validation tool Plus (see results in supplementary file) indicated the absence of systematic error (bias) in the ANN model.

3.6 Applicability Domain Investigation

After the validation of a model, the domain of applicability (third principle of the OECD) must be established. As part of this study, the applicability domain of the MLP/ANN model was determined based on the Williams plot (Fig. 5). The computed threshold leverage (h*) is 0.34. As shown in Fig. 5, none of the 50 surfactants in the model are outside the range of ± 3 standard deviation units. Also, compound 24 [C4H9CH(C2H5)CH2OOCCH2 CH(SO3Na+)COOCH2CH(C2H5)C4H9] and compound 25 [C8H17 OOCCH2 CH(SO3Na+)COOC8H17] are outside the applicability domain (with h > h*). Thus, 96% of surfactants belong to the applicability domain and therefore was covered by the MLP/ANN model. Fortunately, in this work, the data predicted by the MLP/ANN model is good for these compounds; thus, these are “good leverage” chemicals, implying that these compounds were very influential on the model, and can stabilize the QSPR model and make it more precise. Consequently, Williams’s plot provides the acceptance of built MLP/ANN model to predict the CMC. In conclusion, we can assert that the MLP/ANN model adheres to the third OECD principle.

Fig. 5
figure 5

Projection of the training, and test set of anionic surfactants in the Williams plot

3.7 Comparison with Previously Reported Models

The statistical results of the MLP/ANN model were compared with those of some previously developed QSPR models (Table 4). In Table 4, it can be observed that no applicability domain according to the OECD guidelines has been determined and no external quality measurement approach has been carried out in the other models with the exception of the model developed by Roy and Kabir [21]. Also, unlike other models, the MLP/ANN model and that of Roy and Kabir [21] are those that offer a better predictive power. If the statistical parameters of the internal validation are almost identical, our model slightly exceeds the model of Roy and Kabir [21] in terms of external validation, since the criteria based on the MAE have not been verified. It should be noted, however, that the results provided by Roy are those obtained by the linear regression-based techniques. We can conclude that the MLP/ANN model developed in this work is encouraging and can therefore be used for the determination of CMC of new surfactants, thus contributing to substantial amounts of money and time.

Table 4 Comparison of the results of internal and external validation of our best model (MLP/ANN) with previously published models

4 Conclusions

For the prediction of the CMC values for anionic surfactants, three regression methods were utilized (MLR, PLS and MLP/ANN) to develop robust predictive models. The proposed models trained and validated using a dataset comprised of 50 anionic surfactants were based on four molecular descriptors. By applying all available validation strategies, we were able to deduce that the models adopted were robust for both internal validation and external validation parameters. The multilayer perceptron–artificial neural network model (MLP/ANN) trained with the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm gave better performance in CMC predictions with a higher \( Q_{\text{ext}}^{2} \) and \( \overline{{r_{m}^{2} }} \) values (0.95 and 0.87) and acceptable ∆r2m value (0.15) for testing dataset compared to that of previously reported models. The MAE-based metrics estimated that the MLP/ANN model shows ‘GOOD’ predictions (after removing 5% test set objects with high residual values).

By studying the properties of the four descriptors used to develop QSPR models, it appears that the length of the aliphatic chain, the electronic properties (electronegativity), and the structure of the molecules play a crucial role in the micellization process. In conclusion, the QSPR model developed in this work is in line with OECD principles and is useful to provide early CMC estimations on the one hand and to design new surfactants with a special property on the other hand.