Introduction

The issue of which drugs are safe to take during lactation is quite important and complicated. When nursing mothers take drugs, nearly all drugs are excreted into breast milk and are bioavailable to the infants (1). Generally, most drugs do not pose a significant problem to nursing infants, because medication ingested by infants seem to be proportionally small, usually less than 1% of the maternal dose. However, considering the size of infants and differences in metabolism between infants and adults, physicians need to obtain more accurate information to decide which drugs can be safely used and which drugs should be used with caution during lactation (24).

A direct assessment of drugs' risk to infants entails the measurement of drug concentrations in breast milk and extrapolation to other doses/patients by using the milk/plasma (M/P) drug concentration ratio, which is equivalent to drug concentration in the breast milk divided by that in the maternal serum. However, it is difficult to enroll nursing women in a trial solely to assess the pharmacokinetics of a compound in milk and plasma (5). Accurate determination of the M/P ratio also requires careful attention to experimental details, and collection of serial milk and plasma samples. Given the difficulties, combinations of experiments in vitro/in vivo and modeling approaches should be sufficient to determine whether a particular drug will show high M/P ratio in human milk, and thus pose a potential risk to nursing infants.

Linear regression methods have been developed relating M/P ratio to the drug's physical–chemical properties. Meskin and Lien (6) were the first to include the physicochemical properties of a drug in the prediction of M/P ratios, and Agatonovic-Kustrin et al. (5,7) extended this method by using artificial neural networks (ANNs) to allow the prediction of ratios. It should be noted that in most cases, for a certain drug, the M/P ratios measured under different conditions varied significantly. For instance, the maximum M/P ratio of acyclovir was 4.1, whereas the minimum ratio was only 0.6—representing a sevenfold difference between the two values. It seemed that the predictive models based on these data could be misleading and were not applicable for predicting new data. Therefore, the regulation M/P values can be applied to indicate the class of drugs. Comparing with regression models, the classification can offer some advantages to allow better management of data that are often noisy. Thus, in this paper, classification models were constructed to distinguish the potential risk of a drug to nursing infants. Many algorithms can be used for classification, e.g., nearest mean classifier (NMC) (8), linear discriminant analysis (LDA) (9), K-nearest neighbors classification (KNN) (10), and classification and regression tree (CART) (11). Recently, support vector machine (SVM), as a new algorithm, was developed by the machine learning community (12,13). Because of its remarkable generalization performance, SVM has attracted attention and gained extensive applications (1419), etc.

In this paper, SVM models were developed to distinguish the potential risk of drugs to nursing infants. Each compound was encoded with 400 molecular structure descriptors, which can be directly calculated from the drug's structure. The constructed models would evaluate the risk of drugs when experimental M/P ratios have not been investigated. Furthermore, it would provide some insight into the grade of drugs that can transfer into milk.

Materials and Methods

Data Set

The compounds under study consisted of 126 commonly used drugs, whose experimental M/P values were taken from the literature (57,2025). For these drugs, if there existed more than one M/P value, the average one was calculated as the M/P ratio. As shown in Table I, the maximum ratio of the data set was recorded as 5.545, whereas the minimum was 0. Next, the M/P ratios were scaled in the internal [0, 1] and two classes were defined, including Class 1 indicated as “−” (49 drugs, 0 ≥ M/P ≥ 0.1) and Class 2 indicated as “+” (77 drugs, 1 ≥ M/P > 0.1).

Table I Compounds and Its Corresponding Classification

Model Validation

The data set was randomly divided into a training set (96 compounds) for model development/calibration and an independent test set (30 compounds) for prediction. In addition, to test the classified ability of the predictive model for new drugs, an external data set of nine drugs whose M/P values were currently unavailable was applied for LDA and SVM models. Initially, bootstrapping validation, which implies quantitative assessment of model robustness and its predictive power, was applied to training set (26). Boosting is a related technique that attempts to drive the test set error rapidly to zero. It produces a series of predictors. The training set used for each member of the series is based on the performance of the preceding predictor(s). The method creates new training sets by choosing patterns for which the predictions of the previous predictors were bad more frequently than those for which the predictions were good. Thus, boosting attempts to produce new predictors for its ensemble that are able to make better predictions for patterns for which the current ensemble performance is poor. The resampled training set is assembled using probabilistic selection, with the exception that the probability assigned to each sample depends on the prediction error for that sample by the existing ensemble. According to this method, each training pattern is selected with a probability 1/N each time, where N is the total number of patterns in the original sample. Many of the original examples may be repeated in the resampled training set, whereas others may be left out. The method constructs a proxy universe equal in size to the original sample and is, in effect, equivalent to sampling without replacement from an infinitely large replicated universe. Typically, 200 bootstrap replications are sufficient to estimate confidence intervals (27). For this study, 1000 replications are used. Goodness-of-fit of the model is tested by the accuracy of classification and the number of misclassified cases. In addition, to test the classified ability of the predictive model for new drugs, an external data set of nine drugs whose M/P values were currently unavailable, was applied for LDA and SVM models.

Descriptor Generation and Selection

The structures of these compounds were drawn with the ISIS DRAW 2.3 program (28). The final geometries were obtained with the semiempirical PM3 method in the HYPERCHEM 6.03 program (29). All calculations were carried out at restricted Hartree–Fock level with no configuration interaction. The molecular structures were optimized using the Polak–Ribiere algorithm until the root-mean-square gradient was 0.001. Then the resulted geometry was transferred into the CODESSA software, developed by Katritzky et al. (30,31), which can calculate constitutional, topological, geometrical, electrostatic, and quantum chemical descriptors. For this study, 400 descriptors were calculated for each compound, including 38 constitutional descriptors, 38 topological descriptors, 12 geometrical descriptors, 71 electrostatic descriptors, and 241 quantum chemical descriptors. Constitutional descriptors are related to the number of atoms and bonds in each molecule, e.g., number of C atoms, number of bonds, and number of rings. Topological descriptors describe the atomic connectivity in the molecule including valence and nonvalence molecular connectivity indices calculated from the hydrogen-suppressed formula of the molecule, encoding information about the size, composition, and the degree of branching of a molecule, e.g., Wiener index, Randic index, and structural information content. The geometrical descriptors describe the size of the molecule and require 3D coordinates of the atoms in the given molecule, e.g., XY Shadow, molecular volume, and molecular surface area. Electrostatic descriptors reflect the characteristics of the charge distribution of the molecule, e.g., max/min partial charge, count of H acceptor sites, and topographic electronic index. The quantum chemical descriptors provide information about binding and formation energies, partial atom charge, dipole moment, and molecular orbital energy levels, e.g., HOMO/LUMO energy, max/min e–e repulsion, and max/min e–n attraction.

Next, the stepwise variable selection method was applied to choose proper features for the large set of descriptors. F value was used as the criterion to decide which descriptor should be removed/entered into the discriminant function. The entry F value is 3.84 and the removal F value is 2.71. Thus, the obtained five descriptors, together with their coefficients in function, and F values are all listed in Table II. The correlation matrix of the five selected descriptors is shown in Table III. It can be seen that the linear correlation coefficient between any of two descriptors is less than 0.85, which means that the descriptors are independent.

Table II Five Descriptors and the Coefficients for LDAa
Table III Correlation Matrix of the Five Selected Descriptors

LDA Method

Discriminant analysis is useful for situations where a user wants to build a predictive model of group membership based on the observed characteristics of each case. The procedure generates a discriminant function (or, for more than two groups, a set of discriminant functions) based on linear combinations of the predictor variables that provide the best discrimination between the groups. The functions are generated from a sample of cases for which group membership is known; the functions can then be applied to new cases with measurements for the predictor variables but unknown group membership.

The basic theory of linear discriminant analysis (LDA) is to classify the dependent term by dividing an n-dimensional descriptor space into two regions separated by a hyperplane defined by a linear discriminant function (32,33) as follows:

$$Y = b_{0} + b_{1} X_{1} + B_{2} X_{2} + \cdots b_{n} X_{n}$$

where Y is the discriminant score, that is, the dependent variable; X 1X n represents the specific descriptors; and b corresponds to weights associated with the respective descriptors. The linear classification was performed in a stepwise manner: at each step the variable adding the most to the separation of groups is entered into (or the variable adding the least is removed from) the discriminant function. The criteria for the selection were as follows: comparison of the tabulated F, determination of the percentage of molecules correctly classified, and prediction of the classification of molecules not included in the training process. As soon as we have obtained the optimal discrimination conditions for classifying a particular compound, the next step is to obtain new compounds. For this purpose, the final LDA equation was used to select new agents, and the molecules were classified by applying them to the discriminant function.

SVM Method

The following is a brief description of the SVM algorithm. A more detailed description can be found in Vapnik's (12) and Burges's articles (13).

For a binary classification problem, assume that we have a set of samples, i.e., a series of input vectors \(\ifmmode\expandafter\bar\else\expandafter\=\fi{x}_{i} \in R^{{\text{d}}}\)(i = 1, 2, ..., N), with corresponding labels y i ∈ {−1, +1} (i = 1, 2, ..., N). Here, +1 and −1 indicate the two classes. The goal here is to construct one binary classifier or to derive one decision function from the available samples, which has a small probability of misclassifying a future sample. Both the basic linear separable case and the most useful linear nonseparable case (for most real life problems) are considered here.

For a linear separable case, there exists a separating hyperplane whose function is \(\ifmmode\expandafter\bar\else\expandafter\=\fi{w} \cdot \ifmmode\expandafter\bar\else\expandafter\=\fi{x} + b = 0\) which implies the following:

$$y_{i} {\left( {\ifmmode\expandafter\bar\else\expandafter\=\fi{w} \cdot \ifmmode\expandafter\bar\else\expandafter\=\fi{x}_{i} + b} \right)} \geqslant 1\quad i = 1, 2, \ldots , N$$

where y i is the class index, \(\ifmmode\expandafter\bar\else\expandafter\=\fi{w}\)is a vector normal to the hyperplane, \({{\left| b \right|}} \mathord{\left/ {\vphantom {{{\left| b \right|}} {{\left\| {\ifmmode\expandafter\bar\else\expandafter\=\fi{w}} \right\|}}}} \right. \kern-\nulldelimiterspace} {{\left\| {\ifmmode\expandafter\bar\else\expandafter\=\fi{w}} \right\|}}\)is the perpendicular distance from the hyperplane to the origin, and \(\ifmmode\expandafter\bar\else\expandafter\=\fi{w}^{2}\)is the Euclidean norm of \(\ifmmode\expandafter\bar\else\expandafter\=\fi{w}\)

By minimizing \(\frac{1} {2}{\left\| {{\mathbf{\ifmmode\expandafter\bar\else\expandafter\=\fi{w}}}} \right\|}^{2}\)subject to this constraint, the SVM approach tries to find a unique separating hyperplane, which maximizes the distance between the hyperplane and the nearest data points of each class. The classifier is called the largest margin classifier.

By introducing Lagrange multipliers a i , the SVM training procedure amounts to solving a convex quadratic programming (QP) problem. The solution is a unique globally optimized result, and can be shown as:

$${\mathbf{\ifmmode\expandafter\bar\else\expandafter\=\fi{w}}} = {\sum\limits_{i = 1}^N {y_{i} a_{i} \ifmmode\expandafter\bar\else\expandafter\=\fi{x}_{i} } }$$

Only if the corresponding a i > 0, are these x i called support vectors. When an SVM is trained, the decision function can be written as:

$$f{\left( {\overline{{\mathbf{x}}} } \right)} = {\text{sign}}{\left( {{\sum\limits_{i = 1}^N {y_{i} a_{i} } }{\left( {\overline{{\mathbf{x}}} \cdot \overline{{x_{i} }} } \right)} + b} \right)}$$

where sign is the sign of a function. For a linear nonseparable case, allowing for training errors can be done by introducing positive slack variables ξ i (i = 1, 2, ..., N) in the constraints, which then become:

$$y_{i} {\left( {\overline{{\mathbf{w}}} \cdot \overline{{x_{i} }} + b} \right)} \geqslant 1 - \xi _{i} \quad \xi _{i} \geqslant 0, i = 1, 2, \ldots , N$$

We want to simultaneously maximize the margin and minimize the number of misclassifications. This can be achieved by changing the objective function from \(\frac{1} {2}{\left\| {\overline{{\mathbf{w}}} } \right\|}^{2}\)to \( \frac{1} {2}{\left\| {\overline{{\mathbf{w}}} } \right\|}^{2} + C{\sum\limits_{i = 1}^N {\xi ^{k}_{i} } } \)

Minimize \(\frac{1} {2}{\left\| {\overline{{\mathbf{w}}} } \right\|}^{2} + C{\sum\limits_{i = 1}^N {\xi ^{k}_{i} } } \)subject to \(y_{i} {\left( {\overline{{\mathbf{w}}} \cdot \overline{{x_{i} }} + b} \right)} - 1 + \xi _{i} \geqslant 0,\) i = 1, 2,..., N, ξ i i = 1, 2,..., N.

Error weight C is a regularization parameter to be chosen by the user. It controls the size of penalties assigned to errors. The optimization problem is convex for any positive integer k. For k = 1 and k = 2, it is also a quadratic programming problem.

For a binary nonlinear classification problem, SVM performs a nonlinear mapping Φ(·) of the input vector x i from the input space R d into a higher-dimensional Hilbert space H, and constructs an optimal separating hyperplane. In the linear separable case, we know that the algorithm only depends on inner products between training examples and test examples. Thus we can generalize to nonlinear case. The inner products are substituted by the kernel function \(k{\left( {\overline{{x_{i} }} ,\overline{{x_{j} }} } \right)} = \Phi {\left( {\overline{{x_{i} }} } \right)} \cdot \Phi {\left( {\overline{{x_{j} }} } \right)}\), in the input space. Then, the decision function implemented by SVM can be written as:

$$f{\left( {\mathbf{x}} \right)} = {\text{sign}}{\left( {{\sum\limits_{i = 1}^N {y_{i} a_{i} k} }{\left( {\overline{{\mathbf{x}}} ,\overline{{x_{i} }} } \right)} + b} \right)}$$

Two typical kernel functions are listed below;

$$ \begin{array}{*{20}c} {{{\text{Polynomial}}\,{\text{function}}}} & {{k{\left( {\overline{{x_{i} }} ,\overline{{x_{j} }} } \right)} = {\left( {\overline{{x_{i} }} \cdot \overline{{x_{j} }} + 1} \right)}^{d} }} \\ {{{\text{Gaussian}}\,{\text{radial}}\,{\text{basis}}\,{\text{function}}}} & {{k{\left( {\overline{{x_{i} }} ,\overline{{x_{j} }} } \right)} = \exp {\left( { - \gamma {\left\| {\overline{{x_{i} }} - \overline{{x_{j} }} } \right\|}^{2} } \right)}}} \\ \end{array} $$

Results

Results of LDA

Results of the LDA model using bootstrapping validation are listed in Table I, together with misclassified samples marked by double asterisks. This gave an accuracy of 78.85% for Class 1, 77.03% for Class 2, 79.17% for training set, 73.33% for test set, and 77.78% for the whole data set (see Fig. 1). In addition, the predictive results of the external nine drugs are also listed in Table I.

Fig. 1
figure 1

Comparison of the results obtained from the LDA and SVM method.

Results of SVM

SVM was used to develop a nonlinear model. As with other multivariate statistical models, the performance of SVM for classification depends on the combination of several parameters.

First, the kernel functions should be established, because this determines the sample distribution in the mapping space. There are a number of kernel functions, including linear, polynomial, spline, and radial basis function. For classification tasks, a commonly variant is the Gaussian radial basis function because of its good general performance (34). Next is the capacity parameter C, a regularization parameter controlling the tradeoff between maximizing the margin and minimizing the training error. If C is too small, then insufficient stress will be placed on fitting the training data. If C is too large, then the algorithm will overfit the training data. To make the learning process stable, a large value should be set up for C (35). In this study, the initial value was 100. The third parameter is γ. This parameter greatly affects the number of support vectors (SV), which have a close relation with training time. Too many support vectors can produce overfitting and make the training time longer. Parameter γ also controls the amplitude of the Gaussian function and, therefore, controls the generalization ability of SVM. Based on our experience, these two parameters exhibit strong interactions. Therefore, grid search (GS), which has been used either formally or informally for SVM parameter selection, was performed in this study (36). We ranged parameter γ and C from 0.005 to 0.9 with 0.005 as the increment, and from 10 to 1000 with 10 as the increment, respectively. Parameters γ, C, number of support vectors, and accuracy of the training set are shown in Table IV. It can be seen from the table that the three groups of γ and C led to the same accuracy of training set. The lowest number of support vectors prompted the selection of γ = 0.215 and C = 50 as the optimal values. Thus the optimal model was obtained and the predictive results of training set, test set, and external set are all listed in Table I. It gave an accuracy of 92.30% for Class 1, 89.19% for Class 2, 90.63% for training set, 90.00% for the test set, and 90.48% for the whole data set (see Fig. 1).

Table IV Optimal γ, C , Number of Support Vectors, and Accuracy of Training Set

Discussion

Figure 1 compares the predictive results obtained from LDA and SVM models. Performance-wise, SVM gave better results than LDA, which implies that, using the same descriptors, the SVM method is capable of recognizing nonlinear relationships; in contrast, LDA approaches can only capture linear relationships between molecular properties and descriptors. Thus, SVM method can be used as an effective tool in distinguishing the potential risk of drugs to nursing infants.

In addition, it is necessary to gain some insight into factors that are likely to relate to the transfer process of the drug compounds by interpreting the descriptors used in the study. Log P has been widely used as a measure of hydrophobicity or lipophilicity, which is the ratio of a chemical's concentration in the n-octanol phase to its concentration in the aqueous phase of a two-phase system at equilibrium (37). Log P represents the hydrophobicity of molecules and reflects the ability of molecules to penetrate the biomembrane and reach the interacting sites. In most cases, it seemed evident that log P terms were used to access biological properties relevant to drug action, cellular uptake, metabolism, bioavailability, and toxicity. For the process of drug's transfer, lipophilicity is approximately correlated to passive transport across cell membranes and the ability of a compound to partition through a membrane. Drugs are expected to partition into milk in accordance with their lipid characteristics. High lipid solubility favors protein binding, reducing the amount of drug available for diffusion into milk. Therefore, as log P increases, log M/P decreases. In summary, log P played an important role and it was evident that very little real progress could be made in understanding the drug transfer process without considering hydrophobicity. Randic index (order 2) (χ 2), a topological descriptor, encodes the size, shape, and degree of branching of the compound, and also relates to the dispersion interaction among molecules (38). It is always used to predict if molecular cavities can be filled up with a candidate molecule. With the application of molecular graphics, the fit docking or intercalation of molecules into cavities in macromolecular simulations became an important consideration for a drug's protein binding. The maximum partial charge for a C atom [PCmax(C)] and the minimum coulombic interaction for a C–C bond [CImin(CC)] belong to electrostatic descriptors, which reflect the characteristics of charge distribution in the molecule. The empirical partial charges in the molecule are calculated using the approach proposed by Zefirov (39). This method is based on the Sanderson electronegativity scale and uses the concept that represents molecular electronegativity as a geometric mean of atomic electronegativities. Min e–e repulsion for a C–C bond [Ree(CC)] is one of the quantum chemical descriptors used to establish conformational stability, chemical reactivity, and intermolecular interactions. The descriptor characterizes the nuclear repulsion-driven processes in the molecule and may be related to the conformational (rotational, inversional) changes or atomic reactivity in the molecule. Energy was calculated for an optimized conformation with the most stable geometry, or minimum energy structure using molecular or quantum mechanics to determine bond strengths, atomic hybridizations, partial charges, and orbitals from the positions of the atoms and the net charge.

From the discussions above, it can be seen that steric and electric factors are likely two major components in the transfer process for drugs, and all the descriptors involved in the model, which have explicit physical meaning, may account for the structural features responsible for the M/P ratio of drug compounds.

Conclusion

In this work, linear discriminant analysis (LDA) and support vectors machine (SVM) were used for classifying the milk/plasma (M/P) concentration ratio of a set of 126 drug compounds using descriptors calculated from the molecular structure alone. The LDA model could provide some insight into what structural features can best describe the process of drug transfer. The SVM method proved to be a highly effective classification tool because of its structural risk minimization principle, which minimizes an upper bound of the generalization error rather than the training error. This eventually leads to better generalization than neural networks, which implement the empirical risk minimization (ERM) principle and do not converge to global solutions.

Although the classification for M/P ratios is only one component of the complex process for a drug transfer to human milk, it probably forms the most selective filter. We would expect computational methods to have a similar effect on the search for other drugs whose M/P ratios are not currently available.