1 Introduction

In machine learning, classification is normally known as a supervised learning mechanism where the initial input data is analyzed and used in to learn the system how to classify the future data or observations (Brownlee 2016). Therefore, The classification process is considered as an important task in data mining. In simple terms, classification identifies a set of data objects that resemble each other in the same dataset and are different from the objects in other datasets. In order to manipulate the classification problem, a training set (input dataset) is required that contains example records with a number of attributes. The classification algorithm uses the training dataset to construct a model and that model can be used to allocate unclassified records to one of the defined classes (Alweshah 2018; Alweshah and Abdullah 2015). Examples of classification problems are speech recognition Juang et al. (1997), handwriting recognition (LeCun et al. 1990), biometric identification (Yampolskiy and Govindaraju 2008), document classification (Alweshah et al. 2017b; Yan et al. 2018), image and video classifications (Yan et al. 2019a, b), and many others as reported in the latest comprehensive survey (Alweshah et al. 2017a).

Many techniques have been used to solve classification problems.Examples include the support vector machine (SVM) (Adankon and Cheriet 2009), neural network (NN) (Specht 1991), radial basis function (RBF) network (Craddock and Warwick 1996), naïve Bayes (NB) (Friedman et al. 1997) and many others (Alshareef et al. 2015b; Faris et al. 2016; Aljarah et al. 2018; Khan et al. 2019; Anagaw and Chang 2019; Boveiri et al. 2019). The NN is a common technique for solving classification problems and there are different types of NN, including the feed-forward neural network and multilayer perceptron (MLP), as well as a radial basis function network (RFB), modular networks and the probabilistic neural network (PNN) (Specht 1990).

The NN is exceptionally effective and has been utilized to solve many classification problems. It was created in the late 1950s by Rosenblatt in Alweshah (2014). The NN consists of two fundamental components: the neuron and the link. The neuron alludes to a sort of processing component, while the link interfaces two neurons together. Each link has an individual weight. Each neuron gets a simulating signal from other neighboring neurons, and from these signals it forms data and creates an output result (Alshareef et al. 2015a).

The PNN is a strong data mining tool and it is an algorithm that can be used for a huge number and complex (input/output) relationships. The PNN is a special form of NN model with a statistical Bayesian decision rule. The general structure of a PNN consists of four layers: (1) an input layer, where the dimensions of the input vector transcribes the dimension of the input layer; (2) a pattern layer (second layer), where the dimension of the pattern layer is equal to the dimension of the number of examples in the training set; (3) a summation layer (third layer), which contains the number of classes in the set of examples, and (4) a decision layer (fourth layer), which classifies the test example into predefined classes.

The success of the PNN is depended on the proper selection of the weight during the classification process (Iliou and Anagnostopoulos 2010). The way of selecting the optimal value of weight vector for PNN is recently modelled as an optimization problem (Iliou and Anagnostopoulos 2010). Several optimization problems such as metaheuristics have been recently utilized to tackle such a problem. metaheuristics-based optimization approaches are categorized into either single-solution-based or population-based searches (Blum and Roli 2003).

Population-based approaches focus on maintaining and improving multiple candidate solutions.They are normally manipulating several solutions at a time using recombination, mutation and selection operators. Although they are very efficient in handling several search space regions at the same time, they cannot find the local optimal solution at each search space regions to which they handled. Two main population-based metaheuristics are popularly known: evolutionary algorithms and swarm intelligence. Evolutionary algorithms include genetic algorithms (GAs) (Hu et al. 2016), harmony search algorithm (Geem et al. 2001). Another category of metaheuristics is swarm intelligence. Examples of this category include ant colony optimization (ACO) (Dorigo and Stützle 2010), PSO, social cognitive optimization, the penguin search optimization algorithm (PeSOA), the artificial bee colony (ABC) (Zhu and Kwong 2010) and the Grey Wolf Optimiser (GWO) algorithm (Al Nsour et al. 2019).

The single-solution approaches are initiated with a single provisional solution. Iterativelly, that solution can be improved by moving to its neighbouring solution until the local optimal solution is reached. Examples of single-solution-based used for NN include simulated annealing (SA) (Ren and Qu 2014), iterated local search (ILS) (El-Bouri 2012), variable neighborhood search (VNS) (Xie et al. 2012), guided local search (GLS) (Tairan and Zhang 2010) and \(\beta\)-hill climbing (Al-Betar 2017).

\(\beta\)-Hill clibming is a recent local-search based algorithm (Al-Betar 2017) It is able to escape the local optimal trap using \(\beta\)-operator. It has several successful features such as it has few parameters to be set in the initial search. It is a very simple optimization frameworks. It can be flexibly adapt for any kind of optimization problems such as economic load dispatch (Al-Betar et al. 2018), denoising electrocardiogram (ECG) signals (Alyasseri et al. 2018, 2017b, a), Multiple-reservoir scheduling (Alsukni et al. 2017), sudoku game (Al-Betar et al. 2017), Substitution-Boxes (Alzaidi et al. 2018), gene selection (Alomari et al. 2018) and feature selection (Abualigah et al. 2017). Furthermore, theoretical concepts of \(\beta\)-hill climbing related to the adaptive parameter settings are also improved (Al-Betar et al. 2019). Also, \(\beta\)-hill climbing is hybridized with other population-based algorithms to improve exploitation concepts (Abed-alguni and Alkhateeb 2018).

In this paper, the \(\beta\)-hill-climbing (\(\beta\)-HC) algorithm is adapted to seek for the optimal weight for the PNN so as to improve the performance (accuracy). The weight of PNN is modelled as an optimization problem to be solved by \(\beta\)-HC. In order to evaluate the proposed method, 11 test benchmarks of classification problems are used. Comparative evaluation between PNN without \(\beta\)-HC, PNN with hill climbing, PNN with \(\beta\)-HC shows that the PNN with \(\beta\)-HC is able to overcome the PNN significantly without \(\beta\)-HC, PNN with hill climbing. It can find the optimal weight for the PNN and strike the right balance between the exploration and exploitation during the search.

The rest of this paper is organized as follows: In Sect. 2, the background of PNN algorithm and \(\beta\)-HC is described. Then, in Sect. 3 the proposed \(\beta\)-hill-climbing-based PNN is discussed in detail. In Sect. 4, the experimental results are presented and analyzed. Finally, Sect. 5 concludes the paper and provide some possible directions for future improvements.

2 Research background

2.1 Probabilistic neural network (PNN)

The PNN introduced by Specht (1990) has been shown to have effective classification performance, even when used to a wide variety of different problems (Berrar et al. 2002; Wasserman 1993; Specht 1990; Sweeney et al. 1994; Mao et al. 2000; Kwigizile et al. 2004; Manimala and Selvi 2007; Araghi et al. 2009). The training of a PNN does not involve the implementation of heuristic searches to get the best smoothing factor, for the training, which is an optimization problem. A PNN uses a statistical algorithm, known as kernel discriminant analysis (Mika et al. 1999), in which the methods are organized into a four-layered feed-forward network consisting of an input layer, hidden layer, pattern/summation layer and decision/output layer, as shown in Fig. 1.

Fig. 1
figure 1

Architecture of a typical probabilistic neural network

Figure 1 illustrates a typical PNN model. Each input neural symbolizes a distinct feature in the training/test data sets. The number of the inputs is equivalent to the number of features in the data set. A PNN has a relatively quicker training process than BP. It basically has an analogous structure, which ensures convergence with an ideal classifier as the dimension of the associated training set increases, and has training samples that can be possibly included or eliminated without the need for substantial retraining (Wasserman 1993). The four layers of the PNN network are described below:

Input layer::

This contains one neuron for each indicator variable. For the categorical factors, the N − 1 neurons are connected, where N shows the number of categories. The input neuron supposedly standardizes the value range by subtracting the middle value. after that, it divides it between quartile range value. From this point onward, the input neurons are believed to sustain these qualities in each neuron exhibited inside the hidden layer.

Hidden layer::

This contains a single neuron for each case in the training dataset. The neurons store the indicator variable values and the objective values for the cases. At the point when the x vector of the input values of the input layer is displayed, the Euclidean separation from the center point of the neuron for the experiment is ascertained by the hidden neurons and, after that, with the assistance of the sigma value the RBF kernel function is applied. The resultant value is then passed to the neurons in the pattern layer.

Pattern/summation layer::

For each class of objective factors, a single pattern neuron is available. The objective class for each training group is put away alongside each hidden neuron, where the weight value emerging from the hidden neurons is passed to the pattern neurons that correspond to the type of hidden neurons. The pattern neurons then include the values for their representation classes (hence, these values refer to the weighted votes for the particular category). In practice, this layer operate a dot product process (Z) on both input vector \({{\varvec{X}}}=(x_1,x_2,\ldots ,x_n)\) and weight vector \({{\varvec{W}}}=(w_1,w_2,\ldots ,w_n)\) such that \(Z=X.W\). Thereafter, the non-linear activation function is triggered using Z as shown in Eq. (1)

$$\begin{aligned} f(X)= \exp {\frac{(w_i-x)^t - (w_i-x)}{2\sigma ^2}} \end{aligned}$$
(1)

where \(\sigma\) is the smoothing parameter that should be set in the initial training process.

Decision/output layer::

This final layer compares the weight votes in favour of each target category that is gathered in the pattern layer and, after that, uses the highest vote to predict the target group.

For mathematical formulations, the interpretation of PNN concepts is modelled as a multilayer framework of an FFNN as shown in Eq. (2).

$$\begin{aligned} h_i\times c_i \times f_i > h_j\times c_j \times f_j \quad \forall i \ne j \end{aligned}$$
(2)

where \(h_i\) is the preceding prospect, in which the new object belongs to class i, and \(c_i\) is the cost of misclassifying an object that belongs to class i, and \(f_i\) is the probability density function (PDF) of class i. On the other hand, it is significant that \({{\varvec{X}}}\) is the input vector that has to be classified. In general, the prior probabilities and cost of misclassification are called as priority when dealing with PNNs and kept equivalent; therefore the Bayes optimal decision rule can be reduced using Eq. (3).

$$\begin{aligned} g_i( {{\varvec{X}}})> g_j( {{\varvec{X}}}) \quad \forall i \ne j. \end{aligned}$$
(3)

This rule discloses that the outstanding classification choice can be developed just by accomplishing straightforward evaluations upon identifying the PDFs of the various classes. In particular, the PDF for a single class can be estimated based on Eq. (4):

$$\begin{aligned} g_i( {{\varvec{X}}})=\frac{1}{n_i\times \sigma } \sum _{k=1}^{n_i} \frac{X-x_{ik}}{\sigma } \end{aligned}$$
(4)

where \(\sigma\) is a smoothing parameter, W is the weighting function, X is the unidentified input sample that has to be classified, \(X_{ik}\) is the kth training input from the ith class, n is the number of training inputs for class i, and \(g_i( {{\varvec{X}}})\) is the PDF approximation for class i.

Generally, the Euclidean distance is applied to determine the closeness or range between the training sample and the unidentified input. The Gaussian function is a typically used weighting function because it demonstrates the above-stated attributes and can be quickly calculated. Supervised learning provides an input pattern and changes the network factors (weights) to reduce the distance between the calculated output and the expected output. By setting the weighting function to the Gaussian function it is computed as shown in Eq. (5):

$$\begin{aligned} g_i( {{\varvec{X}}})=\frac{1}{n_i\times \sigma } \sum _{k=1}^{n_i} \exp ^ {\frac{(x-x_{ik})^2}{\sigma ^2}}. \end{aligned}$$
(5)

2.2 \(\beta\)-Hill climbing optimizer

The \(\beta\)-hill climbing is a recent local search-based algorithm introduced in Al-Betar et al. (2017). \(\beta\)-Hill climbing is initialized with a random solution \({{\varvec{x}}}=(x_1,x_2, \ldots , x_n)\). Iteratively, that solution undergoes changes using three main operators: (1) \(\mathcal {N}\)-operator, (2) \(\beta\)-operator, and (3) \(\mathcal {S}\)-operator. \(\beta\)-Hill climbing is normally stopped when the maximum number of iterations is met. The feasibility of the solution is normally preserved during the search. Note that the hill climbing (HC) algorithm is the same of the \(\beta\)-hill climbing but without \(\beta\)-operator. Algorithm 1 pseudo-coded the basic version of \(\beta\)-hill climbing. The three operators are described as follows (Fig. 2):

Fig. 2
figure 2

\(\beta\)-Hill climbing optimizer

\(\mathcal {N}\)-operator:

This operator is used to move the current solution \({{\varvec{x}}}\) to its neighboring solution \({{\varvec{x}}}'\) using a movement strategy with a probability of \(\mathcal {N}\) in which the feasibility is maintained. The line 5 in Algorithm 1 is the pseudo-code for \(\mathcal {N}\)-operator. For more clarification, let \(x_i\) be assigned by a value of \(\nu _i(k)\) of \(k^{th}\) location, the value of \(x'_i\) will be assigned by value as follows:

$$\begin{aligned} x'_i= \nu _{i, k} \pm m \end{aligned}$$
(6)

where \(\nu _{i, k} \pm m\) is the neighboring value of \(\nu _i(k)\). It is worth mentioning that the \(\mathcal {N}\)-operator is invoked once at each iteration using a random walk mechanism. This means that the effect of moving to the neighbouring solution is not checked until the \(\mathcal {S}-operator\) is invoked.

\(\beta\)-operator:

The \(\beta\)-operator is responsible for the exploration process by applying a uniform mutation to the current solution. \(\forall i \in (1,2,\ldots ,n)\), the decision variable \(x_i\) is randomly selected to be changed using the \(\beta\) parameter as in Eq. (7). The lines from 6 to 10 in Algorithm 1 are the pseudo-code for \(\beta\)-operator.

$$\begin{aligned} x'_{i}\leftarrow {\left\{ \begin{array}{ll} \nu _{r} &{} \qquad U[0,1] \le \beta \\ i'_i &{} \qquad otherwise. \end{array}\right. } \end{aligned}$$

It is worth mentioning that \(\beta\) determines how often the uniform mutation is applied.

\(\mathcal {S}\)-operator:

The improved solution \({{\varvec{x}}}'\) is evaluated using \(f({{\varvec{x}}}')\). The current solution \({{\varvec{x}}}\) will be replaced by the improved solution \({{\varvec{x}}}'\), if better (i.e., \(f({{\varvec{x}}}')\le f({{\varvec{x}}})\)). The lines from 11 to 13 in Algorithm 1 are the pseudo-code for \(\mathcal {S}\)-operator.

figure a

3 \(\beta\)-Hill climbing optimizer for PNN classifier

In this paper, the \(\beta\)-HC algorithm is utilized to find the optimal weights that can be efficiently used in the PNN algorithm hoping to increase the accuracy of the classification process. Initially, random weights vector x are generated through the PNN algorithm, and the input values are multiplied by the corresponding weights by a model determined by the PNN. Figure3 shows the solution representation example

Fig. 3
figure 3

Representation of initial weights

As shown in Fig. 4, the proposed method included two parts, the first one represents the PNN algorithm and the other represents the \(\beta\)-HC algorithm. In the first part, the training data is used, and the tested data is subject to the classification process. As for the process of adjusting the weights of the PNN is the responsibility of the \(\beta\)-HC algorithm, then the accuracy of the disaggregated data is calculated and the process is repeated until the completion criterion is met.

Fig. 4
figure 4

Flowchart of proposed \(\beta\)-HC-PNN

The proposed method was assessed by figuring four counts. True positives (TP) refer to the class with the number of correctly assigned records; true negatives (TN) refer to the number of correct instances that are not part of the class; false positives (FP) refer to the number of instances that are incorrectly assigned; false negatives (FN) refer to the positive tuples that are labelled incorrectly.

Classification accuracy computed in Eq. (7) is a term that refers to a statistical measure that displays how well the classifier is able to correctly identify the objects towards the labelled classes. The error rate computed in Eq.(8) is a measure of the objects that are incorrectly recognized. Furthermore, specificity and sensitivity were also calculated in Eqs. (9) and (10), respectively.

$$\begin{aligned} Accuracy= & {} \frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(7)
$$\begin{aligned} Error Rate= & {} 1-\frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(8)
$$\begin{aligned} Specificity= & {} \frac{TN}{TN+FP} \end{aligned}$$
(9)
$$\begin{aligned} Sensitivity= & {} \frac{TP}{TP+TN}. \end{aligned}$$
(10)

4 Experiments and results

In this section, the results produced by the proposed method is evaluated using a set of experiments. The proposed algorithm are programmed using Matlab R2010 on an Intel® Xeon® CPU E5-1630 v3@3.70 GHz computer are presented. Table 1 shows the parameter settings used for the proposed algorithm, which were determined after intensive preliminary experiments.

Table 1 Parameter settings

The aim of conducting these experiments is to check whether the classification accuracy is maximized using the PNN utilized \(\beta\)-HC.

The quality of the solutions is given in terms of the best accuracy value (in percent), after 30 independent runs for each of the 11 datasets. In this research, four performance measures are used: accuracy, error rate, sensitivity and specificity as formulated in the previous section. Table 2 presents the results for HC-based PNN, \(\beta\)-HC-based PNN and the original PNN classification techniques when applied to the 11 benchmark datasets. The results in bold highlights the best outcomes. The results summarized in table apparently proved that the proposed \(\beta\)-HC based PNN method was able to outperform the original PNN algorithm on all datasets and also mostly obtained better results than HC based PNN.

Table 2 Results obtained by PNN, HC and \(\beta\)-HC on the test datasets

Figures 5 and 6 presents the simulation results for the convergence characteristics of \(\beta\)-HC based PNN and HC based PNN in comparison with original PNN using the 11 datasets. Each algorithm was run for 200 iterations. These results show that \(\beta\)-HC based PNN can converge faster than HC based PNN except when applied to GCD and ACA. Moreover, \(\beta\)-HC obtains similar convergence trend for the Heart, and API datasets. Interestingly, when applied to the Four class dataset, \(\beta\)-HC achieved 100% accuracy in all iterations.

Fig. 5
figure 5

Convergence characteristics of \(\beta\)-HC-PNN and HC-PNN

Fig. 6
figure 6

Continued,Convergence characteristics of \(\beta\)-HC-PNN and HC–PNN

We also investigate whether the performance of \(\beta\)-HC-PNN has a significant statistical different indicator from HC-PNN and the firefly algorithm (FA) in terms of classification accuracy, sensitivity, and specificity. Accordingly, a sample T test with a significance interval of 95% (\(\alpha\) = 0.05) is used. The results show that there is a significant difference between the comparative algorithms in favour to \(\beta\)-HC-PNN. To elaborate, with regards to accuracy, \(\beta\)-HC-PNN is statistically different from HC-PNN and FA (p value \(<0.05\)). As for sensitivity and specificity, \(\beta\)-HC is also statistically different from HC-PNN and FA. Figure 7 presents the box plots that demonstrate the distribution resolution quality obtained by \(\beta\)-HC, HC-PNN and FA for the six tested datasets used Gorunescu (2011).

Fig. 7
figure 7

Box plots for \(\beta\)-HC, HC and FA

We also assessed all the computational results (with regards to classification accuracy) of the proposed \(\beta\)-HC against those of six popular approaches in the literature. The abbreviations of the comparative approaches are listed in Table 3.

Table 3 Acronyms of the compared methods

Notably, none of the comparative approaches mentioned in Table 4 were applied to whole datasets used in our study, except for FA. Therefore, different combinations of these separate approaches were tested for each dataset, as described in Table 4. The best results are highlighted in bold font.

Table 4 Comparison of the results of \(\beta\)-HC and the state-of-the-art

From the table it can be seen that \(\beta\)-HC ranked first in all datasets except ACA and GCD, thereby achieving the best performance with regards to classification accuracy. Also, HC ranked second. Moreover, \(\beta\)-HC and HC were able to classify the Four class dataset with no miss classifications (100% accuracy). In a nutshell, \(\beta\)-HC outperformed the other comparative approaches for almost all datasets used.

5 Conclusion and future work

The key goal of this research is to improve the classification accuracy of the PNN by (1) modifying the PNN’s inner weights to discover high-quality solutions, thus improve classification accuracy and (2) attaining a high convergence speed. The proposed approach presented in this paper is the utilization of the \(\beta\)-HC metaheuristic algorithm with the PNN algorithm, where \(\beta\)-HC was utilized to optimize the weights of the PNN. The rationale for utilizing \(\beta\)-HC, which is a nature-inspired population-based algorithm, was based on its capacity to undertake an extensive exploration of the search space, its straightforwardness, and its potential for solving various complicated classification problems.

In order to evaluate the proposed method, 11 benchmark datasets circulated widely to evaluate the new classification methods are used in the experiments. Initially, the three experimented approaches which are PNN, HC-PNN, and \(\beta\)-HC-PNN are compared in terms of Accuracy, Sensitivity, Specificity, and Error rate. Interestingly, for all measures, the proposed \(\beta\)-HC-PNN achieved the best results.

Furthermore, the proposed \(\beta\)-HC-PNN is also compared with three well-established approaches using the same datasets. The results produced by \(\beta\)-HC-PNN are better than that produced by others for nine out of eleven datasets.

evaluated against 11 benchmark datasets and the experimental results showed that the proposed \(\beta\)-HC with PNN approach performed better in terms of classification accuracy than the original PNN, HC-PNN and other six well-established approaches using the same experimented benchmarks. In a nutshell, hybridizing \(\beta\)-HC with PNN improves the classification accuracy of PNN.

In general, the results obtained affirmed the powerful performance of the proposed algorithm. In the future, the following improvement can be investigated:

  • The \(\beta\)-HCO can be hybridized with another classification method instead of PNN.

  • Other classification datasets with combinatorial features can be used for further validation.

  • Other local search methods can be investigated to show the strength of \(\beta\)-HCO.