Keywords

1 Introduction

Determining the structure of Multi-layer Perceptron is a critical issue in the design of a Neural Network [1]. Until now, there is no general equation to define the structure of Multi-layer Perceptron, which can deal with different kind of problems to be resolved by the neural network. Each problem needs a particular structure that responds to his requirements. Methods currently used do not rely on the complexity of the problem must be solved by the Multi-layer Perceptron. Most currently used methods are very limited, time-consuming and supervised [2] such us Growing and Pruning Algorithms, Exhaustive Search, Evolutionary Algorithms and so on. In this paper, a novel method to determine the optimal Multi-layer Perceptron structure using Linear Regression will be introduced. The idea is to group the dataset used to train the Multi-layer Perceptron using conventional methods of pattern recognition [3, 4] according to specific criteria until we get a set of useful parameters, which will be used in the design of Multi-layer Perceptron structure. The results obtained from clustering the dataset used to train the network are used as independent variables to define a linear regression models [5] used to determine the Multi-layer Perceptron structure. The equation defined by the linear regression used to minimize the distance between a fitted line and all the data points. The regression model aims to achieve maximum accuracy in determining the number of hidden layers and the number of neurons in these layers.

2 Related Work

The design of the structure of a neural network is an extremely active area of research and does not yet have any definitive guiding theoretical principles. The currently used methods are very limited and time-consuming such as Growing and Pruning algorithms [6], exhaustive search, and evolutionary algorithms [7]. Here are some widely spread methods for determining the number of hidden neurons.

Many researchers use numerous thumb rules such as the number of hidden neurons should be between the size of the input and output layers. The number of hidden neurons should be: (number of inputs + outputs) * (2/3). The number of hidden neurons should be less than twice the number of input layer neurons [8]. These rules provide a starting point but do not achieve the best architecture only after a number of tests based on trial and error. Trial and Error approach does not yield good results except by accident, sometimes called exhaustive search [9]. Exhaustive Search approach makes searching through all possible topologies and then select the one with the least generalization error. The disadvantage of this method is time-consuming.

The Growing neural network algorithm was initially proposed by Vinod et al. [10]. Growing Algorithms method makes searching through all possible topologies and then select the one with the least generalization error. Search in this method stops if the generalization error does not have remarkable change, unlike exhaustive search.

Pruning Algorithms method tries to train an oversized network, and then determines the relative importance of weights by analyzing them. This method prunes the weights with the least importance and then repeats the task. The disadvantage of this method is that the analysis of weights is time-consuming.

In this paper, we proposed a method to determine the structure of a Multi-layer Perceptron based on the complexity level of the considered problem making it more flexible with different datasets types than classical methods. In addition, this method makes the design of Multi-layer Perceptron unsupervised.

3 Multiple Linear Regression Method Used

Following a set of criteria in the analysis of clusters obtained through hierarchical clustering of the dataset used to train the neural network, which results a number of parameters can be useful to define a linear regression model to determine the structure of Multi-layer Perceptron [11]. Parameters obtained from clustering will be evaluated using statistical hypothesis testing [12] to be able to identify whether it exists dependencies between these parameters and the number of hidden layers and the number of hidden neurons. The parameters selected through this evaluation used as independent variables of the regression models [13].

Figure 1 presents a framework of the regression model. The model shows how to use results obtained from clustering the training dataset to determine the regression model used to generate the optimal number of hidden layers and the number of neurons in these layers.

Fig. 1.
figure 1

The framework of the regression model

3.1 Regression Analysis

Regression analysis is a statistical technique to predict a quantitative relationship between a dependent variable and a set of independent variables [14]. The defined regression equation depends on the assumption concerning the relationship between the dependent variable and the independent variables [15]. The linear regression equation seeking to minimize the errors to fit the data points to a straight regression line representing the equation. Using information obtained by observations or measurements, the equation is defined. The indicator of multiple determination coefficient R2 is required to determine the relationship between the Independent variable and the dependent variables. R2 expresses the variation of the dependent variable affected by the variation of independent variables. The indicator of multiple determination coefficient is essential special if supported by other statistical indicators [16].

The mathematically Multiple Linear Regression model having this form:

$$ f = X \to Y $$
$$ f\left( X \right) = w_{0} + \sum\nolimits_{j = 1}^{n} {w_{j} x_{j} } $$
(1)

The regression models consist of unknown parameter w, the dependent variable Y and the independent variable X.

3.2 Statistical Hypothesis Testing

The statistical hypothesis testing [17] are used to examine parameters obtained through hierarchical clustering of training dataset to select a number of parameters to determine the regression model. The hypothesis testing used to prove that the regression model is significant. Depending on the null hypothesis H0, which assume no significant relationship between the independent variables X and the dependent variable Y.

  • H0: There is no relationship between the clustering results and the structure of Multi-layer Perceptron

  • Ha: There is a relationship between the clustering results and the structure of Multi-layer Perceptron

The probability coefficients of independent variables (P-value) have a value of less than 0.05 based on parameters proposed to be independent variables of the regression model.

F-Test analysis [18] used for the analysis of variance will be taken as an evidence to prove that the structure of Multi-layer Perceptron depends on the selected factors.

3.3 The Independent Factors Selected

Based on statistical hypothesis testing and F-Test a set of factors are proposed to determine the regression equation in addition to that it has been proven that there is a link between all factors in models. Moreover, relatively small positive and negative correlations exist [19]. The selected factors prove the effectiveness and efficiency of the proposed model through the Multiple Coefficient of Determination [20] and the Multiple Correlation Coefficient [21] where they obtain results close to 1.

The proposed factors obtained by clustering the training dataset will be used as independent variables to determine the regression models:

  • The number of obtained cluster

  • The percentage of grouped items

  • The reference distance

  • The number of training forms

  • The number of features in the input

Moreover, the quality measure of the network structure was considered as an independent factor. The quality measure factor takes into account the configuration and interconnection layers [22].

The proposed regression models consist of two models the first model used to determine the number of hidden layers and the second model used to determine the number of hidden neurons.

Regression Model to Determine the Number of Hidden Layers.

A set of factors prove the ability to influence on the dependent variable y, which represent the number of hidden layers of a Multi-layer Perceptron. Using the statistical hypothesis testing mentioned previously and experimental results it is turned out that the dependent variable y depending on changes of the following independent variables.

  • X1: The number of obtained cluster multiplied by the reference distance

  • X2: The reference distance

  • X3: The percentage of grouped items

  • X4: The quality measure of the network structure

The independent variables X1, X2, and X3 obtained through clustering of the training dataset. In addition to that, a quality measure of the network structure X4 is taken as independent factors. The quality measure factor depends on the reference distance and the structure of Multi-layer Perceptron.

A Multiple Linear Regression model representing the equation to determine the number of hidden layers of the Multi-layer Perceptron have the following mathematical form:

$$ y = a_{0} + \sum\nolimits_{j = 1}^{4} {a_{j} x_{j} } $$
(2)

The dependent variable y represents the number of hidden layers and a0, a1, a2, a3, and a4 present the constants used to predict the dependent variable y. a0 is the intercept parameter and a1, a2, a3, and a4 are the slope parameters.

Based on the percentage of contribution of each independent variable in the regression model and the absolute values of partial correlation coefficients let us concluded that dependent variable y is influenced by several factors. Among these factors is Reference Distance, which has an important influence on the number of hidden layers.

Regression Model to Determine the Number of Hidden Neurons.

The regression model used to calculate the number of hidden neurons will be determined using a set of factors selected in accordance with the above considerations from the results obtained through clustering of the training dataset. The number of hidden neurons depending on changes in the following independent variables.

  • X1: The number of features in the input

  • X2: The number of obtained cluster

  • X3: The reference distance

  • X4: The quality measure of the network structure

The independent variable X1 represents the number of features of the training dataset. X2 and X3 present the obtained number of cluster and the reference distance respectively obtained using clustering of the training dataset.

A Multiple Linear Regression model representing the equation to determine the number of hidden neurons of the Multi-layer Perceptron have the following mathematical form:

$$ y = a_{0} + \sum\nolimits_{j = 1}^{4} {a_{j} x_{j} } $$
(3)

The dependent variable y represents the number of hidden neurons. The obtained number of hidden neurons will be evenly distributed to the hidden layers if the number of hidden layers exceeds one layer. Therefore, each hidden layer contains a number of neurons equal to others.

The independent factors influence on the number of hidden neurons with varying levels. The factor that has the highest influence being the Reference Distance.

The number of hidden neurons obtained using regression method will be divided equally by the number of hidden layers.

3.4 Clustering of the Training Dataset

The proposed regression method depends mainly on the results obtained from clustering of the training dataset. The most convenient clustering algorithm for the proposed method is Agglomerative Hierarchical Clustering algorithm [23]. Each cluster obtained through Agglomerative Hierarchical Clustering seeks to ensure the highest similarity of objects within the cluster and at the same time the highest dissimilarity between clusters [24]. Clusters obtained using Agglomerative Hierarchical Clustering can contain several sub-clusters then there will be a hierarchical clustering. The hierarchical clustering is a set of nested clusters that build a cluster tree (Dendrogram) to represent objects. The root of the tree represents the cluster, which group all other clusters and objects. In some cases, the leaves of the tree represent clusters of one objects. The Agglomerative Hierarchical Clustering algorithm [25,26,27,28,29] consider each object as a single cluster and then try to join the closest clusters until obtaining only one single cluster. The optimal number of clusters is determined by making a cut of all segment with a length greater than a predefined value [30]. This reference value (Reference Distance) is chosen according to specific criteria.

The value of Reference Distance, which is appropriate to obtain the optimal number of clusters, must attain a set of criteria. Implementation of the following criteria can make the number of obtained clusters useful for the proposed regression method to determine the structure of Multi-layer Perceptron.

The first criterion requires grouping at least ninety percent of the items of the training dataset. The ninety percent of items grouped considered sufficient where the result could cover the entire training dataset.

The second criterion requires that the number of clusters should be taken as few as possible in order to minimize the size of the network with reason that the increase of the number of clusters causes an increase in the number of hidden layers and the number of hidden neurons using the proposed method. With a few numbers of hidden layers and neurons the complexity of Multi-layer Perceptron reduce [31].

The third criterion requires a Reference Distance value in which any increase on it does not affect the number of obtained clusters [32,33,34]. While taking into account the condition, which should be avoided such as the very short value of Reference Distance for which each leaf of the tree, represents a cluster of one object or a relatively large value of Reference Distance for which grouping all objects in one cluster.

The fourth criterion requires the right selection of distance metrics (such as Manhattan and Euclidean) and linkage methods (single, complete, and average linkage) appropriate to the clustering algorithm and the type of training dataset [35]. A good choice of distance metrics increases the accuracy of the proposed method.

By following these criteria, the results of clustering of training dataset can generate a set of parameters useful to construct the regression models used for determining the optimal structure of a Multi-layer Perceptron.

4 Experimental Results

A number of experimental tests will be conducted to prove the effectiveness of the proposed method. The training dataset will be trained using different Multi-layer Perceptron structure then compared to results of the proposed method.

In this paper, the Waveform Database Generator Version 1 Dataset used to prove the validity of the proposed method. Waveform dataset consists of 21 attributes and 5000 instances. Dataset classes are generated from a combination of two of three “base” waves.

4.1 Experimental Results Obtained from Clustering of the Training Dataset

Agglomerative Hierarchical clustering is used to cluster the Waveform dataset. The number of clusters varies based on the value of Reference Distance. According to the criteria described above the perfect Reference Distance value is 8.

Figures 2 and 3 below presents the number of clusters and the percentage of objects clustered and the corresponding values of Reference Distance.

Fig. 2.
figure 2

Clusters obtained based on the percentage of items grouped

Fig. 3.
figure 3

Clusters number obtained vs. the corresponding reference distance of waveform dataset and items grouped percent vs. number of clusters of waveform dataset

The appropriate parameters selected in this case study is Normalization type “standard”, clusters distance “Average Link” and “Manhattan” distance.

Figure 2 presents the obtained Dendrogram and the corresponding number of clusters obtained according to the reference distance values. Based on criteria listed above the optimal number of clusters is one cluster.

According to the criteria described above, we conclude that the ideal number of clusters is one cluster with 93.2% of items clustered for a value of Reference Distance equal to 8. The results obtained can be useful for determining the regression models used to construct the optimal structure of Multi-layer Perceptron for training the Waveform dataset.

4.2 Calculating the Number of Hidden Layers Using Regression Model

Based on the clustering of Waveform dataset the independent factors used to determine the Eq. (2) for calculating the number of hidden layers has the following values: X1 = 1 × 8, X2 = 8, X3 = 93.2, X4 = 98.38.

According to the criteria described above, the selected value of Reference Distance is 8 for a percentage of grouped items more than 90% and the corresponding number of clusters is one, therefore, the value of X1 will be 8 × 1 equal to 8. X2 represents the value of Reference Distance therefore, X2 = 8. X3 represents the percentage of grouped items therefore X3 = 93.2 as we see in Fig. 3. For X4 which represents the quality measure of the network structure, it was determined by creating a structure based on the number of hidden layers equal to the obtained number of clusters corresponding to the selected Reference Distance and for the number of neurons is determined using the Formula (9). Based on that the quality measure of the network structure X4 = 98.38 corresponding to the selected Reference Distance.

Using the above values in Eq. (2) will result in y = 1.

Y = 1, this concludes that the optimal number of hidden layers using the proposed method is equal to one layer.

The values of the Multiple Determination Coefficient R2 obtained is close to 1. R2 expresses the level of variation of the number of hidden layers affected by the variation of selected independent variable X1, X2, X3, and X4. It proves the validation of the proposed model and the successful choice of independent factors.

4.3 Calculating the Number of Hidden Neurons Using Regression Model

The implementation of Eq. (3) to calculate the number of neurons in the hidden layers use the following values of independent factors obtained from the clustering of Waveform dataset.

Based on the clustering of Waveform dataset the independent factor used to determine the Eq. (3) for calculating the number of hidden layers has the following values.

X1 = 21, X2 = 1, X3 = 8, X4 = 98.38

Using the above values in Eq. (3) will result in y = 74.

Y = 74, this concludes that the optimal number of hidden neurons using the proposed method is equal to 74 hidden neurons.

R2 obtained is close to 1. R2 expresses the level of variation of the number of hidden neurons affected by the variation of selected independent variable X1, X2, X3 and X4. It proves the validation of the proposed model and the successful choice of independent factors.

4.4 Comparison of the Proposed Method with Classical Methods

To validate the results obtained using the proposed method a comparison with widely spread methods are conducted. The proposed regression method will be compared with the classical methods so that we can prove the validity of the proposed method. The following classical formulas will be used in this comparison:

  • Innumber of input neurons

  • Outnumber of output neurons

  • Hiddennumber of hidden neurons

  • Trainingnumber of training forms

$$ Hidden \, = { 1/2 }\left( {In \, + \, Out} \right) $$
(4)
$$ Hidden \, = \, SQRT \, \left( {1/2 \, \left( {In \, + \, Out} \right)} \right) $$
(5)
$$ Hidden \, = \, \left( {In \, + \, Out} \right) \, * \, 2/3 $$
(6)
$$ Hidden \, = \, Training \, /10 \, \left( {In \, + \, Out} \right) $$
(7)
$$ Hidden \, = \, \left( {Training \, - \, Out} \right) \, / \, In \, + \, Out \, + \, 1 $$
(8)
$$ Hidden \, = \, 1/2 \, \left( {In \, + \, Out} \right) \, + \, SQRT \, \left( {Training} \right) $$
(9)

Formula (10): The number of hidden neurons should be between the size of the input layer and the size of the output layer. Formula (11): The number of hidden neurons should be less than twice the size of the input layer. A set of datasets used such as Waveform Database Generator dataset, Image Segmentation dataset, Glass identification dataset, Landsat dataset, Sonar dataset, ECG dataset, QRS dataset, P-wave dataset and T-wave datasets. Table 1 Presents specifications of datasets used.

Table 1. Specifications of neural networks

Table 2 presents the number of hidden neurons using the classical method:

Table 2. Number of hidden neurons using the classical method

Table 3 presents the number of hidden neurons using the proposed method:

Table 3. Number hidden neurons using the proposed method

Comparison Based on the Training Time Using Classical Methods vs. the Proposed Method.

A comparison of the training time using classical methods vs. the proposed method results that the training time of the proposed method does not have the best training time for all datasets but for ECG, P-wave and QRS perform well. For example, the training time of ECG is 0.75 s using the proposed method and the best classical method record is equal to 1.47 s. The failure of the proposed method with some datasets to obtain the best training time is because of the number of neurons selected. The training time depends mainly on the number of neurons in the network and the size of dataset regardless of the used formula. Since the number of neurons is selected based on the complexity of the problem, therefore, the training time is affected by the complexity of the problem using the proposed method.

Comparison Based on the Percentage of Accuracy Using Classical Methods vs. the Proposed Method.

Table 4 below shows a comparison of the results in terms of percentage of classification accuracy [36] using classical methods vs. the proposed method.

Table 4. Comparison of the percentage of accuracy using classical methods vs. the proposed method

As observed from Table 4 and Fig. 4, the proposed method has the best percentage of accuracy for most datasets. The classical methods sometimes get good results but it depends on the database. For example, formula (7) has a good percentage of accuracy for ECG dataset and the lowest percentage accuracy for sonar and Landsat datasets. Formula (8) obtained the highest percentage of accuracy compared to other classical methods for Glass dataset while getting the lowest percentage for ECG dataset. Formula (9) perform well with datasets Glass, Segmentation and waveform but for other dataset have a medium percentage of accuracy.

Fig. 4.
figure 4

Comparison of the percentage of accuracy using classical methods vs. the proposed method

Comparison Based on the Error/Epoch Using Classical Methods vs. the Proposed Method.

Table 5 presents a comparison of the error/epoch [37] using classical methods vs. the proposed method:

Table 5. Comparison of the error/epoch using classical methods vs. the proposed method

As observed in Table 5 and Fig. 5, the proposed method has the lowest values of error/epoch for most datasets. The classical methods sometimes get good results with formulas (8) (9) (10) these results somewhat acceptable compared to other classical methods. Formulas (5) and (7) have the highest values of error/epoch.

Fig. 5.
figure 5

Comparison of the error/epoch using classical methods vs. the proposed method

Comparison of Classical Methods vs. the Proposed Method Conclusion.

The proposed method get the best percentage accuracy for most datasets, unlike the classical methods. The classical formulas (4) (5) (6) (10) (11) perform well with small datasets which have a few training items. Whereas Formulas (7) (8) (9) perform well with large datasets because they take into consideration the number of training items. Formula (9) is better than (7) and (8) which mean the SQRT of training items have a positive effect on the results. Formulas (4) (5) (6) depend mainly on the number of input and output neurons making it effective for a small datasets while do not perform well with large datasets which have complex problems to solve. The training time depends mainly on the number of neurons in the network and the size of dataset regardless the used formula. The results of error/epoch obtained is almost similar to the result of the percentage of accuracy. Comparison of the proposed method with classical methods leads us to deduce that the proposed method performs well for the different type of datasets, which mean that the proposed method is more flexible with different datasets types than classical methods. The proposed method adapt to the complexity of datasets to provide the best results regardless of the size of the dataset. In some cases, the dataset is chosen with a size more than required, which leads to bad results using classical methods but this problem is avoided by using the proposed method since it focuses on the complexity of the problem to be solved regardless the size of the dataset.

5 Conclusion

It is noticeable that Pattern Recognition plays a significant role in the determination of the optimal structure of Multi-layer Perceptron using the proposed method. The proposed method makes the design of Multi-layer Perceptron unsupervised and helps to dispense with the need for designer experience and the waste of time using trial and error methods. By clustering the training dataset, we can collect a set of parameters useful to determine the structure of Multi-layer Perceptron as independent variables used to determine the regression models of the proposed method. The independent variable Reference Distance has the highest influence on the results compared to other variables. Comparison of the proposed method with classical methods leads us to deduce that the proposed method performs well for the different type of datasets, which mean that is more flexible with different datasets types than classical methods. The proposed method adapt to the complexity of datasets to provide the best results regardless of the size of the dataset.