Introduction

The human immunodeficiency virus (HIV) is a retrovirus from the Retroviridae family and is responsible for the acquired immunodeficiency deficiency syndrome (AIDS), which was first documented in 1981 [1]. The control of this infection is achieved using antiretroviral drugs, which help reduce the mortality and morbidity as well as promote increased patient lifespans [2]. However, several patients show or develop resistance to some of the available drugs, which is a major limiting factor of HIV therapy effectiveness.

Different statistical techniques and machine learning algorithms have been developed to predict HIV resistance. Such studies have used statistical modeling [35], neural networks [68], support vector machines [9] and decision trees [10].

Most of these studies used data provided by genotyping, a test that identifies genetic mutations associated with resistance to antiretroviral drugs. Although it is not considered the gold standard test, genotyping is faster and cheaper than phenotyping and provides a direct quantitative measure of the susceptibility of HIV strains to drugs [11].

Several authors recognize that different classifiers perform poorly with imbalanced data sets [12, 13]. The imbalanced problem is characterized when there are greater instances of some classes than others. Developing classifiers using imbalanced data may result in solutions with a good overall performance due to the tendency of overfitting for the majority class [14].

In this study, we present a classifier for predicting resistance to lopinavir, an HIV protease inhibitor, using genotypic information. Due to the high number of resistant mutations, we used logistic regression (LR) to select the features and a probabilistic neural network (PNN) to design the classifiers by using balanced data in relation to the classes of resistance. In addition, a comparison of the performance was conducted between the PNN models and three well-known HIV-1 genotyping interpretation systems: HIV db, Rega, and ANRS.

Methods

Genotype dataset

The data were made available from the Molecular Virology Laboratory, Health Science Center, Federal University of Rio de Janeiro (UFRJ /CCS / Brazil), a member of the Brazilian Network for HIV-1 Genotyping (RENAGENO), responsible to perform and analyze genotyping tests for all HIV-infected patients within the public system. The Brazilian HIV data are accessible to the RENAGENO laboratory members and general data are publicly available at www2.aids.gov.br/ final/dados/dados_aids.asp.Footnote 1 For this study, 625 amino acid sequences of the protease enzyme of the pol gene of HIV-1 subtype B from infected patients were analyzed.

Modeling

Outcome variable

The outcome was a binary variable that indicated whether the patient was resistant to lopinavir. For patients who were susceptible or had an intermediate resistance to this drug in the last regimen of the therapy, the variable was coded as 0 (nonresistant), whereas those who developed resistance to lopinavir were coded as 1 (resistant). Patient classification was obtained using the HIV Genotyping Test—Brazilian Interpretation Algorithm (version 05:12) [15], which uses a set of predefined rules to identify if there is a particular drug resistance.

Explanatory variables

The explanatory variables were obtained from a set of positions in the HIV-1 protease gene (PR) known to influence drug resistance. The initial positions included here were those obtained from an updated list of mutations associated with resistance to antiretroviral drugs provided by the International Antiviral Society (IAS-USA) [16]. The PR positions with the corresponding amino acid code for the original sequence are: L (amino acid) 10 (position), K20, L24, V32, L33, M46, I47, I50, F53, I54, L63, A71, G73, L76, V82, I84 and L90.

Training and test sets

The set of 625 available amino acid sequences was divided into a training set of 500 sequences, and a test set of 125 sequences. In the training group, 400 patients had no resistance to lopinavir, whereas 100 were resistant. In the test group, 30 patients were resistant, and 95 showed no resistance. The training set was used for feature selection and to obtain an optimal smoothing parameter of the PNN. The test set was only used to evaluate the performance of the final models.

Feature selection

The selection of input variables is an important step to enhance the classification ability of the models and to reduce the training and test computing time. Because no feature selection method designed for the PNN was found in the literature, we proposed using the bootstrap method and LR to select the amino acid mutation positions to be the inputs for the PNN model. Bootstrapping is a resampling technique with a replacement proposed by Efron [17]. Each resample has the same sample size as the original data.

From the training set, a total of 1000 bootstrap samples of equal size (n = 100) were obtained from the 100 patients with therapeutic failure. Each one of these bootstrap samples was combined with 100 patients randomly selected from the 400 non-resistant individuals, generating a balanced set used for model estimation.

One LR model was designed for each bootstrap sample. The variables of each of the 1000 models were selected by the stepwise method, using the Akaike information criterion (AIC) [18]. For this method, a sequence of regression models is obtained by adding or removing variables at each step. Non-significant variables are excluded, and the procedure is repeated until no other variable improves the model [19]. The AIC penalizes models with many variables, and lower values of AIC are preferred. The final chosen variables used as input to the PNN were those selected in 50 % of the LR models.

PNN modeling

A PNN is an artificial neural network (ANN) used in different classification problems [2024]. This particular ANN proposed by Specht [25] has a faster training than the multilayer perceptron network. It generates accurate predicted target probability scores, and it is relatively insensitive to outliers.

A PNN uses Bayesian decision to classify the input vectors. The optimal decision rule that minimizes the average cost of misclassification is called the Bayes optimal decision rule [26]. The architecture of a classic PNN is shown in Fig. 1.

Fig. 1
figure 1

Basic architecture of a probabilistic neural network

The input layer has as many neurons as the number of explanatory variables and only distributes the input to the neurons in the pattern layer. The pattern layer contains one neuron for each case in the training data set. The neurons of the pattern layer are divided into i groups, one for each class. Each neuron receives the input vector and estimates its probability density function (PDF) using the Parzen window method [27]. In this study, the Gaussian function was used as the Parzen window. The ith kernel node in the jth group is defined as a Gaussian basis function:

$$ {p}_{i,j}(x)=\frac{1}{{\left(2\pi {\sigma}^2\right)}^{\frac{d}{2}}} \exp \left(-\frac{\left\Vert x-{\left.{x}_{i,j}\right\Vert}^2\right.}{2{\sigma}^2}\right) $$
(1)

where xi,j is the vector of sample stored in the pattern unit of class i or j (the center of the kernel), d is the number of input variables and σ is a smoothing (spread) parameter.

The summation layer sums the output from pattern units associated with a given class. This layer has as many processing units as the number of classes to be recognized. Each of these units estimates a class-conditional PDF using a mixture of Gaussian kernels according to equation 2:

$$ {f}_{i,j}(x)={\displaystyle \sum_{i=1}^{N_j}{\alpha}_{i,j}{p}_{i,j}(x),\kern0.36em 1\le j\le n} $$
(2)

where α i,j satisfies:

$$ {\displaystyle \sum_{i=1}^{N_j}{\alpha}_{i,j}=1,\kern0.6em 1\le j\le n} $$
(3)

The output layer makes the decision based on the maximum probability of Bayes’ rule. A competitive transfer function on output neurons selects the node with the highest output and assigns a 1 (positive identification) to that class and a 0 (negative identification) to non-targeted classes.

After the feature selection step, we use a combination of bootstrap and cross-validation to choose the smoothing parameter of the PNN. Following a similar procedure as previously described for feature selection, 100 balanced subsets with size 200 were obtained, and for each subset, a PNN model was implemented. The smoothing parameter should be chosen to obtain the highest accuracy of the classifier. Therefore, we varied the parameter from 0.1 to 1 in steps of 0.1, and for each smoothing parameter, a 10-fold cross validation to evaluate the model was applied.

The data were partitioned into 10 equal sub-samples. For each smoothing value, a PNN was trained with 90 % of the data and was evaluated with the remaining sub-sample. The area under the receiver operating characteristic (ROC) curve (AUC) was the accuracy criterion. Ten computed AUCs from the folds were averaged to produce a single estimation for that particular value of the smoothing constant, and the smoothing parameter was selected as the value that provided the best average AUC. This procedure was repeated for each one of the 100 balanced subsets. The final smoothing parameter was defined as the average of the smoothing parameters associated with the best AUCs of each subset.

The variables selected by LR and the estimated smoothing parameter were employed to develop four PNN models over four balanced test sets, which were later used in the validation step. The four balanced data sets were obtained by dividing the 400 nonresistant samples into four sub-samples of size 100 and combining each one with the available 100 resistant samples from the training set.

Evaluation of the PNN models

The performance of the four final models was evaluated using ROC curve analysis, AUC, accuracy, sensitivity, and specificity. The models were applied to the test set with 125 samples, which was not used at any other stage of the analysis.

An ROC curve characterizes the performance of a binary classification model across all possible cut-offs and depicts the tradeoff between sensitivity and the false-positive rate. The AUC represents the expected performance as a single scalar.

Accuracy (Acc) is defined as the proportion of correct classification by the model over the total sample. This measure is given by the following formula:

$$ \mathrm{A}\mathrm{c}\mathrm{c} = \left(\mathrm{T}\mathrm{P} + \mathrm{T}\mathrm{N}\right)/\left(\mathrm{T}\mathrm{P} + \mathrm{F}\mathrm{P} + \mathrm{T}\mathrm{N} + \mathrm{F}\mathrm{N}\right) $$
(4)

where TP, FP, TN, and FN are true positives, false positives, true negatives and false negatives, respectively.

Sensitivity (Se) measures the proportion of true positives compared to the total positive class, and specificity (Sp) comprises the proportion of true negatives in relation to the total negative class.

$$ \mathrm{S}\mathrm{e} = \mathrm{T}\mathrm{P}\ /\ \left(\mathrm{T}\mathrm{P} + \mathrm{F}\mathrm{N}\right) $$
(5)
$$ \mathrm{S}\mathrm{p} = \mathrm{T}\mathrm{N}\ /\ \left(\mathrm{T}\mathrm{N} + \mathrm{F}\mathrm{P}\right) $$
(6)

Classification algorithms for comparison

The classifiers were compared to three rule-based genotypic interpretation systems, HIVdb (version 7.0) [28], Rega (version 9.1.0) [29] and ANRS (Agence Nationale de Recherches sur le SID) (version 2013.09) [30].

In addition to the PNN, the k-Nearest Neighbors (k-NN), a non-probabilistic algorithm, was implemented to provide a comparison of diagnostic performance. The k-NN algorithm classifies each test case by a majority vote of its neighbors, with the case being assigned to the class most common amongst its k nearest neighbors as measured by Euclidean distance. The dataset used was the same applied in the PNN, and the input variables were those selected by LR.

Software

PNN classifiers were implemented using MATLAB® software package (MATLAB version R2009b with neural networks’ toolbox) [31]. Statistical analysis and LR were performed using the open source R software version 3.0.1 [32].

Results

We initially selected the 16 positions of lopinavir resistance provided by the IAS-USA as the input for the variable selection approach using bootstrap and stepwise LR. The percentages of selection of each input variable in 1000 bootstrap samples are shown in Fig. 2. The final selected features, those that were selected in at least 50 % of the models, were the ten following mutation positions: A71, I54, I84, K20, L10, L24, L33, L90, M46, and V82. The PNN smoothing parameter was set to 0.63, which is the average of 100 smoothing parameters with the best AUCs of each model as described previously.

Fig. 2
figure 2

Frequency of variables selected in the 1000 bootstrap samples in logistic regression

The four PNN classifiers developed using the same 100 resistant samples combined with random samples of the same size from the 400 non-resistant patients were evaluated with the test set. Because the test set emulates a real situation, with resistant and non-resistant patients arriving at random and without knowledge of whether it was balanced or not, we had in this test set 30 resistant and 95 nonresistant patients. Table 1 shows the performance of the PNN classifiers. The mean AUC equals 0.96, accuracy equals 0.91, and sensitivity and specificity equal 0.98 and 0.89, respectively. The ROC curves for the four classifiers are shown in Fig. 3.

Table 1 Test set evaluation results with 95 % confidence interval (CI) of PNN classifiers
Fig. 3
figure 3

ROC curve for the PNN classifiers. The black points are the threshold of 0.5 used to predict class

The k-NN algorithm resulted in classifiers with a mean AUC equal to 0.93, an accuracy equal to 0.91 and a sensitivity and specificity equal to 0.98 and 0.88, respectively. Table 2 summarizes the performance of k-NN classifiers.

Table 2 Test set evaluation results with 95 % confidence interval (CI) of k-NN classifiers

HIVdb, Rega and ANRS algorithms classified the data at three levels of resistance: susceptible, intermediate resistance and high level of resistance. To compare the ability of these algorithms to our models, the outputs of the algorithms were classified according to the following criteria: susceptible or intermediate resistance were classified as non-resistant, and samples classified as a high resistance were classified as resistant. Table 3 summarizes the performance of these three algorithms.

Table 3 Test set evaluation results with 95 % confidence interval (CI) of HIVdb, Rega and ANRS algorithms

Discussion

In the present study, we used an approach combining bootstrap and LR stepwise variable selection followed by the prediction of resistance to the antiretroviral lopinavir with a PNN neural network. Only those variables that appeared in 50 % of logistic regression models were used in final models. If the cutoff point was increased to 60 %, only position 33 (50.1 %) did not appear in PNN models and an increase to 70 % would also exclude position 71 (64.2 %). Using a cutoff equals to 70 % the classifiers had a lower overall performance, with mean AUC equal to 0.65, accuracy of 0.46 and sensitivity and specificity equal to 0.99 and 0.27, respectively. For the test data, the PNN classifier showed predictive performances greater or comparable to three well-known interpretation systems.

In this study, feature selection and model development used balanced data sets. Most classification procedures assume balanced training data sets in its learning stage [14]. When these methods are trained on highly imbalanced data sets, they often tend to overpredict the presence of the majority class [33]. For example, if the data have a large number of negative cases, it is possible that the classifier shows a higher specificity than sensitivity, which results in an overestimated accuracy.

The available data set had fewer instances of resistance class compared to susceptible or non-resistance class. We addressed this problem by using random undersampling of the majority class. Accuracy alone is not a good measure of the performance of a classifier because it is strongly biased in favor of the majority class. Moreover, this measure considers different classification errors as equally important. It would be more attractive if we used a performance measure that disassociates the errors that occur in each class. In addition to global performance metrics such as AUC or Acc, other parameters should be considered to evaluate classifiers, such as sensitivity and specificity. The absence of these parameters may impair a proper evaluation of the model as well as a misinterpretations of the results.

Several studies report only accuracy, which reduces the interpretation of their results. In a recent study, Pasomsub et al. [8], with a feed-forward artificial neural network showed that developed classifier had an AUC equal to 0.92 (95 % IC: 0.88–0.95) for lopinavir. However, they did not mention other indices, such as sensitivity and specificity, and there is no indication if their data set was balanced or not.

In a study developed by Rhee et al. [34], a feed-forward network was used in the development of models, using a complete set of 70 positions in HIV-protease and a set of selected positions by the list of IAS. For lopinavir, the accuracy was 0.76 for the full set of positions and 0.73 for the list of IAS, which was lower than those found in our study.

The four classifiers showed very similar performances, with accuracies ranging from 0.89 to 0.94 and an average AUC equal to 0.96. When applying the variables selected by the approach proposed in this present study, the k-Nearest Neighbor exhibited results similar to PNN models, demonstrating that this feature selection method could be applied to probabilistic and non-probabilistic algorithms. In all cases, they were at least comparable or superior to some metrics to HIVdb, Rega, and ANRS algorithms, three well-known rule-based genotypic interpretation systems used for many clinicians to predict resistance to specific antiretrovirals. Compared with these prediction algorithms, our approach requires fewer features—10 positions as input to the PNN model to classify lopinavir resistance in contrast to 17 positions proposed by IAS. Additionally, feature selection can be revised and PNNs re-trained without difficulties when new data are made available or new resistance positions are reported.

The limitations of this approach for predicting HIV resistance deserve consideration. First, this approach can only predict drug resistance that is included in the training set, which in our case was lopinavir. Although this is a limitation, the method can be trained with available data for other drugs, but here, we did not have enough samples to properly develop models for other drugs. Second, the choice of features and smoothing parameter of the PNN neural network requires some computational effort. However, once this stage is accomplished, the prediction speed is very high.

With specificity and sensitivity of 0.98 and 0.89, respectively, the PNN classification developed here may serve as a useful tool to support decision making regarding the prediction of resistance of HIV+ patients, thus assisting physicians in their treatment of HIV+ patients. Additional applications of this approach using other antiretroviral drugs in therapeutic practice are needed to better evaluate the impact and the usefulness of the proposed PNN model.