1 Introduction

Eosinophils are a type of leukocytes or white blood cells, which are part of the body’s immune system components responsible for combating multicellular parasites and certain infections in vertebrates [5, 20]. The value of blood eosinophil above 600/cmm is defined as eosinophilia [23, 29], which was originally observed in patients treated with tryptophan-containing commercial products in the USA in 1898 [3, 15, 22]. Clinical manifestations, eosinophilia–myalgia syndrome (EMS), mainly include severe skin eruption, fever, hematologic abnormalities, and organ system dysfunction [1, 18, 24]. Presently, various factors have been found implicated as causes for eosinophilia, and exposure to drugs is considered as the most common causes, such as antipsychotic, antibacterial, antiviral, antithyroid, anticancer, and other medications [8, 11, 23]. Unfortunately, the drug-induced adverse effects are usually detected after the drug is introduced into the market or in phase III clinical trials. These experimental processes are very complicated, time-consuming, and costly [14, 34]. In particular, the drug-induced eosinophilia experimental evaluation processes would cause negative effect on human health, such as autoimmune diseases and end-organ failure, and even lead to mortality [8, 15]. Thus, using the cheaper, rapid, and accurate computational prediction methods as novel alternative techniques to evaluate the safety of candidate compounds prior to their synthesis would be a good choice. This may help medicinal chemists rationally select the chemicals with the best prospects to be effective and safe, and withdrawal of the suspected culprit chemicals in the early stage of drug development.

Presently, many academic institutions and pharmaceutical companies have realized the advantages of computational techniques and have been widely employed for the assessment of the pharmacokinetic properties and preclinical safety in the early stage of drug development [10, 13, 19, 21, 28]. Among these computational methods, the statistical and machine learning methods have been widely used in the prediction of adverse drug reactions (ADRs), and some of them have shown a good performance in the forecast of possible ADRs [6, 31, 32]. However, there were few reports of computational model of drug-induced eosinophilia. As far as we know, only González-Díaz et al. [8] constructed a computational prediction model of drug-induced eosinophilia using linear discriminant analysis (LDA) method. Thus, in this work, two statistical and machine learning methods, a modified method for support vector machine (SVM) [26] and the naïve Bayesian approach [2, 4], were considered to access drug-induced eosinophilia. For the modified SVM, the genetic algorithm (GA) is used for the feature selection [16], and conjugate gradient (CG) method is employed for the parameter optimization [12]. The naïve Bayesian classification model is a popular and mature machine learning method, which employs the versatile machine learning algorithms based on the Bayes’ theorem and judges the plausibility of different candidate classes for a system [2, 4], and has been widely applied in the pharmaceutical industry [7, 17, 33].

The purpose of this investigation was to develop computer prediction models for drug-induced eosinophilia by using SVM and naïve Bayesian approaches, and identify some important molecular descriptors and substructures associated with compounds inducing eosinophilia. The generated prediction models will be validated by fivefold cross-validation and an external test set. We hope the established computational models should be employed for the prediction of drug-induced eosinophilia adverse effect in the early stage of drug development, and the molecular descriptors and substructures associated with drug-induced eosinophilia should be taken into consideration in the design of new candidate compounds to help medicinal chemists rationally select the chemicals with the best prospects to be effective and safe.

2 Materials and methods

2.1 Dataset collection

The biological activity and chemical structure of each of the compounds were extracted from the literature [8]. In this research, some compounds were deleted because of the Benzen was duplicate, the Nitroprusside is an inorganic compound, and the structures of Mustar vacilic and Nafaline were not found. Finally, the remaining 148 compounds were applied in this investigation. In order to compare with previous study, the same training set (107 agents) and test set (41 compounds) as those used in the literature [8] were applied. The structures of the training set (TrainingSet_107.sd) and test set (TestSet_41.sd) molecules are listed in the Supplementary Data.

2.2 Support vector machines (SVM)

The optimized SVM method, namely GA-CG-SVM, is a modified SVM modeling approach. Detailed description of the proposed GA-CG-SVM method can be found in our previous paper [3032]. Here, we just make a short summary to the basic idea of SVM and GA-CG-SVM.

In SVM, each object is described by a vector \(x_{i}\), and the class index is represented by the \(y_{i}\). In linearly separable cases, two different classes of feature vectors can be correctly classified by

$$w \times x_{i} + b \ge + 1,\quad {\text{for}}\quad y_{i} = + 1$$
(1)
$$w \times x_{i} + b \le - 1,\quad {\text{for}}\quad y_{i} = + 1$$
(2)

Here, \(w\) is a vector normal to the hyperplane, and \(b\) is a scalar quantity. The SVM attempts to find an optimal separating hyperplane with the maximum margin by solving the following optimization problem:

$$\mathop {\text{Max}}\limits_{w,b} \frac{2}{\left\| w \right\|}\quad {\text{Subject to}}\quad y_{i} (w \times x_{i} + b) - 1 \ge 0$$
(3)

However, in the linearly non-separable cases, no hyperplane can be used to perfectly separate two sets of points. In this case, the nonnegative slack variables \(\xi_{i} \ge 0\), i = 1, …, m. could be introduced. Such that

$$w \times x_{i} + b \ge + 1 - \xi_{i} ,\quad {\text{for}}\quad y_{i} = + 1$$
(4)
$$w \times x_{i} + b \le - 1 + \xi_{i} ,\quad {\text{for}}\quad y_{i} = - 1$$
(5)

In order to find a hyperplane that provides the minimum number of training errors, the equation to be solved becomes:

$$\mathop {\text{Max}}\limits_{w,b} \frac{2}{\left\| w \right\|} + C\sum\limits_{i = 1}^{m} {\xi_{i} } \quad {\text{Subject to}}\quad y_{i} (w \times x_{i} + b) - 1 + \xi_{i} \ge 0$$
(6)

Here, C is the penalty parameter, which should be predetermined by user.

The nonlinear (non-)separable cases could be easily transferred to linear cases through projecting the input variable into a new high-dimensional feature space by using a kernel function \(K(x_{i} ,x_{j} )\). Such as the radial basis function (RBF), which is the most widely used kernel function, it performed very well in most cases.

$$k(x_{i} ,x_{j} ) = \exp \left( { - \gamma \left\| {x_{i} - x_{j} } \right\|^{2} } \right)$$
(7)

The \(\gamma\) is a parameter which should be specified by user in advance.

An optimal C and γ can significantly improve the accuracy of SVM classification. Furthermore, the feature selection and parameter setting (\(C,\gamma\)) influence each other in SVM modeling. Thus, the combined scheme was used to handle the two problems: a genetic algorithm (GA)-based method is used for the feature selection, and a conjugate gradient (CG) method is used for the (\(C,\gamma\)) parameter optimizations.

2.3 Modeling details by GA-CG-SVM

All the structures of the prepared compounds were generated, and then, geometrical optimization of these compounds was calculated by using Accelrys Discovery Studio program package (Accelrys, San Diego, CA). The optimized 3D structure of each compound was manually inspected to ensure that each molecule was properly represented and is consistent with the one [8]. Molecular descriptors were calculated by using the online program PCLIENT [27].

The initial features were preprocessed whose purpose is to eliminate the redundancy and overlapping of the descriptors. Here, the following descriptors were removed: (1) descriptors with too many zero values, (2) descriptors with very small standard deviation values (<0.5 %), and (3) descriptors which are highly correlated with others (correlation coefficients >95 %). After the preprocessing, the descriptor values were scaled to a range of −1 to +1, which is necessary since the different ranges of descriptor values will influence the quality of the SVM model generated.

2.3.1 Construction of the GA-CG-SVM model of drug-induced eosinophilia

A total of 107 compounds, including 71 toxic compounds and 36 non-toxic agents, were used as training set to train the SVM classification model of drug-induced eosinophilia. The following various molecule properties were initially calculated: 48 constitutional descriptors, 21 topological charge indices descriptors, 99 WHIM descriptors, 154 functional group counts descriptors, 119 topological descriptors, 150 RDF descriptors, 74 geometrical descriptors, and 31 molecular descriptors. These descriptors were firstly preprocessed for removing those redundant and unrelated properties. 186 molecular descriptors were selected and were subjected to being further reduced by using GA-CG method. Finally, eight molecular descriptors were selected (Table 1), and the optimized parameters \((C,\gamma )\) are (4435.096191, 0.025620).

Table 1 Molecular descriptors used in the SVM modeling for the prediction model of drug-induced eosinophilia adverse effect

2.4 Naïve Bayesian model

The introduction of naive Bayes classification theory has been described in the literature [4, 26]. The naïve Bayesian classification approach is a popular and mature machine learning method, which could distinguish between compounds that are positives and those that are negatives with using molecular descriptors. In this investigation, the naïve Bayesian model was developed by using Discovery Studio (DS) version 3.1 (Accelrys Inc., San Diego, CA). The default physical property descriptors were used, including ALogP, molecular weight, number of H donors, number of H acceptors, number of rings, number of aromatic rings, and molecular fractional polar surface area. The cross-validation method of the training set was set to 5. The “Model Domain Fingerprint” was chosen as ECFP-6 [extended connectivity fingerprints, with a diameter of 6, were generated in Pipeline Pilot (SciTegic, Inc.)], because it could give the highest ROC curve. The ROC curve charts the true-positive rate (sensitivity) versus the false-positive rate (100 % specificity). Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. The other parameters were kept at their default values.

2.5 Statistical analysis

The predictive performances of statistical and machine learning models were assessed by overall prediction accuracy (\(Q\)); sensitivity (SE), the prediction accuracy for positive compounds; and specificity (SP), the prediction accuracy for negative compounds.

$$Q = \frac{\text{TP + TN}}{\text{TP + TN + FP + FN}}$$
(8)
$${\text{SE}} = \frac{\text{TP}}{\text{TP + FN}}$$
(9)
$${\text{SP}} = \frac{\text{TN}}{\text{TN + FP}}$$
(10)

where TP, true positives, is the number of positive instances which are correctly identified; TN, true negatives, is the number of negative instances which are correctly recognized; FP, false positives, is the number of the negative instances which are wrongly predicted as positives; FN, false negatives, is the number of positive instances which are wrongly predicted as negatives.

3 Results

3.1 SVM classification model of drug-induced eosinophilia

In this research, the fivefold cross-validation was employed for the training set to evaluate the stability and capacity of the established SVM model, and an external test set with 41 unique drugs was used to further assess the model’s predictive power. For the training set, the overall prediction accuracy (Q, Table 2) is 91.6 %. Among these 71 toxic compounds, 67 agents were correctly predicted. The sensitivity (SE, Table 2) is 94.4 %. Of these 36 non-toxic agents, 31 compounds were correctly identified. The specificity (SP, Table 2) is 86.1 %. In order to evaluate whether the established SVM model could successfully recognize the external series as toxic agents and non-toxic agents, the external test set containing 41 compounds was applied. Table 3 shows the prediction results of the test set; of these 41 compounds, 34 were correctly classified. The overall prediction accuracy (Q, Table 3) for the test set is 82.9 %. For these 24 toxic compounds, 20 agents were correctly recognized. The sensitivity (SE, Table 2) is 83.3 %. For these 17 non-toxic compounds, 14 agents were correctly forecasted. The specificity (SP, Table 3) is 82.4 %. These results indicate the established SVM prediction model of drug-induced eosinophilia could successfully discriminate these agents as positives (toxic compounds) or negatives (non-toxic compounds).

Table 2 Fivefold cross-validation results of SVM and Bayesian models for the training set
Table 3 Prediction results of the external test set

3.2 The naïve Bayesian classification model of drug-induced eosinophilia

The Bayesian prediction model of drug-induced eosinophilia based on the same training set was successfully developed, in which the default physical property descriptors together with the extended connectivity fingerprint descriptor (ECFP-6) were applied (Bayesian model: descriptors + ECFP-6). The established naïve Bayesian prediction model was also evaluated by fivefold cross-validation method and an external test set. The best cutoff for this model is 0.014. The area under the ROC curve (AUC) is the ROC score, which is widely used as measure of a model discriminatory power. The maximum value for the ROC score of 1 indicates the model has a perfect prediction performance (100 % true-positive (TP) rate, and 0 % false-positive (FP) rate). The ROC score of 0.5 represents the model has no discriminative ability (i.e., 50 % true-positive (TP) rate and 50 % false-positive (FP) rate). In this work, the ROC score for the fivefold cross-validation in the training set is 0.858, which represents the established model has a good predictive power.

The fivefold cross-validation results of the training set for the model (Bayesian model: descriptors + ECFP-6) are given in Table 2. From Table 2, we can see that the prediction accuracy for these toxic compounds (SE, Table 2) is 88.7 %. For these non-toxic compounds, the specificity (SP, Table 2) is 100 %. The overall prediction accuracy (Q, Table 2) for the training set is 92.5 %. For the external test set, the ROC score is 0.973. The detail information of prediction results is shown in Table 3. As shown in Table 3, the total predictability (Q, Table 3) is 85.4 %. The model recognizes as toxic (SE, Table 3) 75.0 % of these compounds, that is, 18 chemicals out of 24. Moreover, the model correctly classifies 100 % of the non-toxic chemicals (SP, Table 3), that is, 17 agents out of 17. These results indicate the established naïve Bayesian prediction model (Bayesian model: descriptors + ECFP-6) of drug-induced eosinophilia could successfully recognize internal/external agents as positives or negatives. Furthermore, the other naïve Bayesian model based on the default physical property descriptors was established (Bayesian model: simple descriptors), in which the ECFP-6 fingerprint descriptor was removed. As shown in Tables 2 and 3, the prediction performance of the model (Bayesian model: simple descriptors) was significantly decreased, especially for the predictive capability for non-toxic compounds. The prediction accuracy for the training set and for the test set is 85.0 and 80.5 %, respectively. Figure 1 shows some fragments produced by the ECFP-6 descriptors. The Bayesian score is a measure of how different this is from the hit rate as a whole (the ratio that would be expected if the feature was occurring randomly across the toxic agents and non-toxic agents), which represents the final contribution of a feature to the model prediction. The top 20 toxic/non-toxic fragments are listed in Fig. 1. The results suggested that combined with these fragments could significantly increase the overall accuracy of drug-induced eosinophilia prediction.

Fig. 1
figure 1figure 1

a ECFP-6 descriptors: some substructures that are important for drug-induced eosinophilia. Each panel shows the naming convention for each fragment, the numbers of molecules it is present in that are toxic agents, and the Bayesian score for the fragment. b ECFP-6 descriptors: some substructures that are absent from drug-induced eosinophilia compounds. Each panel shows the naming convention for each fragment, the numbers of molecules it is present in that are toxic compounds, and the Bayesian score for the fragment

4 Discussion

In this investigation, the prediction models of drug-induced eosinophilia have been successfully developed by using the optimal SVM and naïve Bayesian approaches. For the SVM modeling, the overall prediction accuracy for the training set and for the test set is 91.6 and 82.9 %, respectively. For the naïve Bayesian modeling, the overall prediction accuracy for the training set and for the external test set is 92.5 and 85.4 %, respectively. All of these indicate the constructed SVM and naïve Bayesian models are suitable for predicting the drug-induced eosinophilia adverse effect and could be used as tools for screening compounds with eosinophilia adverse effect and reducing late-stage attrition rates in drug development process.

4.1 Molecular features important for drug-induced eosinophilia

The pathogenesis of drug-induced eosinophilia is very complex, and different mechanisms have been implicated in its development [9, 25]. Thus, investigation of important molecular descriptors of these compounds inducing eosinophilia is very necessary. Using simple natural descriptors depicting chemical–physical properties of chemical agents to establish the relationship between chemical agents and their bioactivities is an advantage of the statistical and machine learning methods, such as SVM and naïve Bayesian used here. In this research, the GA-CG method was used to select some important descriptors for drug-induced eosinophilia. Eight kinds of molecular descriptors, including 696 descriptors, were initially calculated. After those redundant and unrelated properties removed, 186 descriptors were obtained. Finally, eight important molecular descriptors were successfully selected from the 186 descriptors. Table 1 lists the selected descriptors and their definitions. From the results of this work, it can be seen that the GA-CG selected molecular descriptors are powerful to discriminate compounds causing and not causing eosinophilia. These selected descriptors can be roughly grouped into several categories: hydrogen-bonding descriptors (nHDon), molecular electronic property-related descriptors (E3s), molecular structural information-related descriptors (ZM2 V, L3u, E2u), lipophilicity-related descriptors (BLTF96), and molecular weight-related descriptors (RDF015 m, RDF035 m).

4.2 Analysis of the toxic/non-toxic fragments produced by the ECFP-6 fingerprint descriptors

The molecular features considered as important for drug-induced eosinophilia have been identified by the GA-CG method. In order to better understand the structures of compounds inducing and not inducing eosinophilia, the ECFP-6 fingerprint descriptors were applied in the naïve Bayesian model to produce some substructures of toxic compounds and non-toxic compounds. Figure 1 shows some toxic fragments and non-toxic fragments generated by the ECFP-6 fingerprints. As shown in Fig. 1, some substructures that contribute to toxic compounds (Fig. 1a) and those that are not inducing eosinophilia (Fig. 1b) were identified. Figure 1a shows some substructures associated with toxic compounds, and a compound having any of these fragments was considered as a toxic agent, each panel represents the naming convention for each fragment, the numbers of molecules it is present in that are toxic agents, and the Bayesian score for the fragment. The Bayesian score takes account of the total number of occurrences of the feature, ensuring more weight is placed on features that occur more often and little weight on those for which there are very few occurrences. In further analysis of these fragments generated in toxic compounds and non-toxic compounds, we found that some fragments only appeared in compounds inducing eosinophilia, such as dimethylsulfane (G1), N-methylcyclobutanamine (G2), chlorobenzene (G7, G16,), N-methylenemethanamine (G10, G11, G14, G15), and tetrahydrothiophene (G17, G18). Thus, these substructures of toxic compounds identified in this research might be associated with the drug-induced eosinophilia adverse effect and should be taken into consideration in the design of new candidate drugs to help medicinal chemists rationally select the chemicals with the best prospects to be effective and safe.

4.3 Comparison with previous prediction model of drug-induced eosinophilia

Presently, although a number of prediction models of the pharmacokinetic properties and toxicity have been developed and used in drug development, there were few reports of computational model for drug-induced eosinophilia. Only González-Díaz et al. [8] built a prediction model of drug-induced eosinophilia using linear discriminant analysis (LDA) method, which gave a good classification of 91.82 % for the training series and 88.1 % for the external validation series. In this study, the GA-CG-SVM gives 91.6 % for the training set and 82.9 % for the test set. The naïve Bayesian model could correctly classify 92.5 % of training set compounds and 85.4 % of test set agents. Prediction accuracies of the GA-CG-SVM model and naïve Bayesian model established in this work are comparable to those of the LDA model built by González-Díaz et al. [8]. However, the GA-CG-SVM model could select some critical molecular descriptors for drug-induced eosinophilia, and the naïve Bayesian model could give some fragments that contribute to eosinophilia inductors and those that are not.

5 Conclusions

In this investigation, the prediction models of drug-induced eosinophilia adverse effect have been successfully developed by using SVM and naïve Bayesian approaches. A set of 107 compounds were used as the training set, and 41 agents were applied as the external test set. For the SVM modeling, the overall prediction accuracy for the training set by means of fivefold cross-validation is 91.6 and for the external test set is 82.9 %. For the naïve Bayesian modeling, the overall prediction accuracy for the training set and for the external test set is 92.5 and 85.4 %, respectively. Moreover, some molecular descriptors and substructures associated with the toxicity of eosinophilia compounds were identified. Thus, we hope the prediction models of drug-induced eosinophilia built in this work could be applied to filter early-stage molecules for this potential eosinophilia adverse effect. And the selected molecular descriptors and substructures of toxic compounds should be taken into consideration in the design of new candidate drugs and finally reduce attrition rate in later stages of drug development.