Keywords

1 Introduction

Artificial Neural Networks (ANNs) are a very flexible modelling technique based on biological neural systems, whose computing power is developed using an adaptive learning process. Properties and characteristics of ANNs have made them a common tool when successfully solving high complexity problems from different areas, e.g. medical diagnosis, financial data modelling, predictive microbiology, remote sensing, analytical chemistry... For some of these problems, items have to be classified into naturally ordered classes. They are traditionally handled by conventional methods intended for classification of nominal classes, where the order relation is ignored. This kind of supervised learning problems are referred to as ordinal regression, where an ordinal scale is used to label the examples. Therefore, in ordinal classification problems, the goal is to learn how to classify examples in the correct class. But one should take into account that the higher distance between predicted and real labels is (with respect to the ordinal scale), the more the misclassification error should be penalised.

In general, Logistic Regression (LR) is a simple and useful classification procedure, although it poses problems when applied to real-problems, where, frequently, we cannot make the assumption of additive and purely linear effects of the covariates. As suggested by [13], an obvious way to generalise the linear logistic regression is to replace the linear predictors with structured non-parametric models such as an additive model of basis function. In this paper, we extend the ideas introduced in [12], where a combination of LR and Neural Networks models was used to solve nominal classification problems. We present an adaptation of the corresponding algorithm to tackle ordinal classification combined with an ordinal ANN model with hybrid basis functions.

As exposed in [19], the motivation for studying hybrid basis functions for neural networks comes from many different branches. Biological neural systems are built from a large diversity of neuron types. We can also think about computational efficiency; neural networks with high diversity performs better [7]. That leads us to a third motivation, the reduction of complexity in the neural network; adding diversity to a neural network allows the number of nodes (and links) to be reduced [25].

One of the first models specifically designed for ordinal classification, and the one our model is built on, is the Proportional Odds Model (POM) [20], which is basically an ordinal logistic regression. This model is based on the assumption of stochastic ordering in the input space, and the use of thresholds to split the projected input space into different ordered classes. We propose a hybrid neural network ordinal model using a combination of projection functions (sigmoidal unit, SU) and kernel functions (radial basis function, RBF) in the hidden layer of a feed-forward neural network [11]. An evolutionary algorithm is adapted to train this model and applied for learning the model architecture, link weights and node typology. In order to obtain further conclusions, the hybrid basis neural network proposed is compared to its corresponding pure models: SU and RBF neural networks. A mixture of different kinds of basis functions [11] is an interesting alternative, which could be able to take advantage from the benefits of each one. Our proposal follows the idea of augmenting the vector of inputs with non-linear covariates obtained by the neural network hybrid hidden layer, and then, use the POM in this new space of derived input features.

The estimation of the coefficients is carried out in several steps. In a first step, an evolutionary algorithm [27] (EA) is applied to design the structure and train the weights of a hybrid SURBF neural network [5, 15] (SURBFNN). Evolutionary computation has been widely used in the late years to evolve NN architectures and weights. There have been many applications for parametric learning [22] and for both parametric and structural learning [1, 17, 26]. This evolutionary process determines the number of neurons in the model and the corresponding variables, which will be the new covariates in the non-linear LR model. The best model in the last generation is used for that purpose. In a second step, we augment the input space adding the SURBF non-linear covariates to the linear ones. That led us to add a third step, where we perform a local optimization algorithm using the new input covariates, with a maximum likelihood method for ordinal LR, based on the structure of the POM.

The rest of the paper is organized as follows. Section 2 introduces the hybrid neural networks. In Sect. 3, the neural network model for ordinal regression is explained. Section 4 presents the algorithm developed in order to obtain the coefficients for the hybrid model. Section 5 includes the experiments: experimental design, information about the datasets and results of the experiments. Finally, in Sect. 6, we present the conclusions of the paper.

2 Hybrid Artificial Neural Networks

Different types of neural networks are being used today for classification purposes, including neural networks based on a sigmoidal basis (SU), radial basis function (RBF) [15] and a class of multiplicative basis functions, called the product unit (PU) [18, 23]. The combination of different basis functions in the hidden layer of a neural network has been proposed as an alternative to traditional neural networks [16]. We use RBF neurons and SU neurons according to Cohen and Intrator insights [7], based on the duality and complementary properties of projection-based functions (SU and PU) and kernel typology (RBF). These models have also been theoretically justified by Donoho [8], who demonstrated that any continuous function can be decomposed into two mutually exclusive functions, such as radial (RBF) and crest ones (SU and PU). In this way, RBF neurons contribute to a local recognition model [4], while SU neurons contribute to a global recognition one [18]. Their combination results in a high degree of diversity because the submodels differ from one another.

3 Proposed Neural Network Model

The POM model [20], as the majority of existing ordinal regression models, can be represented in the following general form:

$$\begin{aligned} C(\mathbf {x})= {\left\{ \begin{array}{ll} \mathcal {C}_1, &{} \text {if }f(\mathbf {x},{\varvec{\uptheta }})\le \beta ^1_0\\ \mathcal {C}_2, &{} \text {if }\beta ^1_0<f(\mathbf {x},{\varvec{\uptheta }})\le \beta ^2_0\\ \cdots \\ \mathcal {C}_J, &{} \text {if }f(\mathbf {x},{\varvec{\uptheta }})>\beta ^{J-1}_0, \end{array}\right. } \end{aligned}$$
(1)

where \(\beta ^1_0< \beta ^2_0< \cdots < \beta ^{J-1}_0\) (this will be the most important constraint in order to adapt the nominal classification model to ordinal classification), J is the number of classes, \(\mathbf {x}\) is the input pattern to be classified, \(f(\mathbf {x},{\varvec{\uptheta }})\) is a ranking function and \({\varvec{\uptheta }}\) is the vector of parameters of the model. Indeed, the analysis of Eq. (1) uncovers the general idea presented in [20]: patterns, \(\mathbf {x}\), are projected to a real line by using the ranking function, \(f(\mathbf {x},{\varvec{\uptheta }})\), and the biases or thresholds, \(\beta ^j_0\), are separating the ordered classes.

We are using an adaptation of the POM to artificial neural networks. This adaptation is based on two elements: the first one is a second hidden linear layer with only one node whose inputs are the non-linear transformations of the first hidden layer. The task of this node is to project the values into a line, to make them have an order. After this one node linear layer, an output layer is included with one bias for each class, whose objective is to set the optimum thresholds to classify the patterns in the class they belong to.

The structure of our model is presented in Fig. 1 which has two main parts. The lower one shows the SURBFNN model, where \(\mathbf {x}\) = (\(x_1\), ... , \(x_k\)), is the vector of input variables and k is the number of variables in the database. \(\varvec{\omega }\) = (\(\omega _{1,1 0}\), ... , \(\omega _{1,m_{1} k}\), \(\omega _{2,1 0}\), ... , \(\omega _{2,m_{2} k}\)) is the matrix of weights of the connections from the input nodes to the hidden layer SU nodes (\(\omega _{1,1 0}\), ... , \(\omega _{1,m_{1} k}\)) and to the RBF ones (\(\omega _{2,1 0}\), ... , \(\omega _{2,m_{2} k}\)) and B are the nodes in the hybrid hidden layer, \(m_1\) is the number of nodes of the first type and \(m_2\) is the number of nodes of the second type, SU and RBF respectively in our case, and “1” is the bias of the layer, which takes part in the calculations.

Fig. 1.
figure 1

Neural Network model for ordinal classification

The upper part of the figure shows a single node in the second hidden layer of the model, which is the one that performs the linear transformation of the POM model. Its result, \(f(\mathbf {x}\),\(\varvec{\uptheta })\), is connected, together with a second bias, to the output layer, where J is the number of classes, and \(\beta ^{0}_0\), ... , \(\beta ^{J-1}_0\) are the thresholds for the different classes. These \(J-1\) thresholds are able to separate the J classes, but they have to fulfil the order constraint shown in the figure. Finally, the output layer obtains the outputs of the model, \(f_j(\mathbf {x}\),\(\varvec{\uptheta },\beta ^{j}_0)\), for \(j\in \{1,\ldots ,J-1\}\). These outputs are transformed using the function of the POM model, which transforms them into a probability (\(g_j(\mathbf {x}\),\(\varvec{\uptheta },\beta ^{j}_0)\)). This is the probability that each pattern has to belong to the different classes, and the class with the greatest probability is the one selected by the NN to be the class of the pattern.

4 Estimation of the Coefficients

The methodology proposed is based on the combination of an EA and an ordinal maximum likelihood optimization method. Figure 2 represents the different steps of the algorithm and the models obtained for the experiments. The different steps of the algorithm are now explained:

Fig. 2.
figure 2

Steps in the proposed methodology. The different models associated with this methodology are presented in a double squared box

Step 1: We apply and EA to find the basis functions (SUs and RBFs):

$$\mathbf {B}(\mathbf {x},\mathbf {W}) = \{B_{1,1}(\mathbf {x},\mathbf {w}_{m_1}), \ldots , B_{1,m_1}(\mathbf {x},\mathbf {w}_2), \ldots , B_{2,1}(\mathbf {x},\mathbf {w}_2), \ldots , B_{2,m_2}(\mathbf {x},\mathbf {w}_{m_2})\}$$

corresponding to the non-linear part of the hybrid logistic regression model presented in this paper. The NN model for the EA is presented in Fig. 1. The EA begins with a random initial population, and each iteration the population is updated using a population-update algorithm [9]. The population is subject to operations of mutation and replication with ordinal constraints. Crossover is not used because of its disadvantages in evolving NNs [1].

Step 2: We perform a transformation of the input space, including the non-linear transformations of the inputs obtained by the EA in Step 1:

$$\begin{aligned}&\mathbb {H}: \mathbb {R}^k \rightarrow \mathbb {R}^{k+m}\\&(x_1,x_2,\ldots ,x_k) \rightarrow (x_1,x_2,x_k,\ldots ,z_1,z_2,\ldots ,z_m), \end{aligned}$$

where \(z_1= B_1(\mathbf {x},\mathbf {w}_1)\), \(z_2= B_2(\mathbf {x},\mathbf {w}_2),\ldots ,z_m= B_m(\mathbf {x},\mathbf {w}_m)\).

Step 3: We apply an ordinal maximum likelihood optimization method in the new input space obtained in step 2. The optimization of the maximum likelihood is performed using a gradient descent algorithm called iRProp+ [14], which optimises the non-linear ordinal logistic regression for a defined number of epochs.

5 Experiments

In order to analyse the performance of the proposed models, ten datasets have been tested, their characteristics being shown in Table 1. The collection of datasets is taken from the UCI [2] and the mldata.org [21] repositories. The experimental design was conducted using 10 random holdout procedures.

Three different neural networks have been compared: RBFOrdLR is the LR combined with the hidden nodes of an evolutionary RBFNN, SUOrdLR is the same model with SU functions on its hidden layer, finally SURBFOrdLR is the LR combined with both SU and RBF from the hybrid hidden layer of an evolutionary neural network. We also compare the results against the original POM model, SVOREX [6] and SVR [24], which is an ordinal regression transformed into an standard regression, changing the ordinal labels \((C_1,C_2,\ldots )\), for numbers \((0,1/(Q-1),2/(Q-1),\ldots , Q)\).

All the parameters of the algorithm are common to these ten problems. The main parameters of the algorithm are: number of generations: 100; population size: 250; mutation percentage: \(10\,\%\); minimum number of hidden nodes: 4; maximum number of hidden nodes: 14.

In order to set up the minimum number of hidden neurons, a preliminary experiment was done with one partition of each dataset. A 5-fold cross-validation (using only the training split) was done and repeated with the following minimum number of hidden nodes, \(\{1,2,4,...,20\}\). We concluded that the optimum minimum number of hidden nodes was 4 and we added 10 nodes for the maximum, in order to give the EA some freedom to optimise the NN.

The idea of having a small population size and a small number of generations is to give less importance to the evolutionary algorithm because its computational time is much higher than the optimisation of the non-linear ordinal logistic regression. After finishing the EA, we took the SURBF hidden layer from the best model in the last generation to augment the input space and applied a gradient descent algorithm with 1000 epochs to optimise the non-linear ordinal logistic regression model.

Table 1. Characteristics of the ten datasets used for the experiments: number of instances (\(\mathrm {Size}\)), inputs (\(\# \mathrm {In.}\)), classes (\(\# \mathrm {Out.}\)) and patterns per-class (\(\# \mathrm {PPC}\))

The following two measures have been used for comparing the models:

  • CCR: The Correct Classification Rate (CCR) is the rate of correctly classified patterns: \(CCR = \frac{1}{n}\sum ^{N}_{i=1}\)J\(y^*_i=y_i\)K,  where \(y_i\) is the true label, \(y^*_i\) is the predicted label and J\(\cdot \)K is a Boolean test which is 1 if the inner condition is true and 0 otherwise.

  • MAE: The Mean Absolute Error (MAE) is the average deviation (number of categories) in absolute value of the predicted class from the true class [3]: \(MAE = \frac{1}{n}\sum ^{N}_{i=1}e(\mathbf {x}_i),\) where \(e(\mathbf {x}_i)=|\mathcal {O}(y_i)-\mathcal {O}(y^*_i)|\) is the distance between the true and the predicted ranks, \(\mathcal {O}(\mathcal {C}_j)=j\). This is a way of evaluating the ordering performance of the classifier.

Table 2 shows the mean test value and standard deviation of the correct classified rate (CCR) and the mean absolute error (MAE) over the 10 models obtained.

Table 2. Generalization results obtained for benchmark datasets

To determine the statistical significance of the rank differences observed for each method in the different datasets, we have carried out a non-parametric Friedman test [10] with the ranking of CCR and MAE of the best models as the test variables. For CCR, the test shows that the effect of the method used for classification is statistically significant at a significance level of 10 %, as the confidence interval is \(C_0(0,F_{0.10}=1.97)\) and the F-distribution statistical value is \(F^*=6.78 \notin C_0\). For MAE, the test concludes the same, obtaining \(C_0(0,F_{0.10}=1.97)\) as confidence interval and \(F^*=4.73 \notin C_0\) as F-distribution variable. Consequently, we reject the null-hypothesis stating that all algorithms perform equally in mean ranking for both variables.

Based on this rejection, the Holm post-hoc test is used to compare all classifiers to each other using both CCR and MAE. The results of the Holm test for \(\alpha =0.10\) can be seen in Table 3, using the corresponding p and \(\alpha ^{'}_{Holm}\) values. From the results of this test, it can be concluded that SURBFOrdLR obtains a significantly higher ranking of CCR and MAE when compared to the remaining methods, which justifies the proposed method in this paper. As MAE is a metric that needs to be minimised the best ranking is the higher one.

Table 3. Comparison of p-Value and \(\alpha ^{'}\) for the Holm post-hoc non-parametric tests in CCR and MAE with a \(\alpha =0.1\) (SURBFOrdLR is the control method)

6 Conclusions

This work proposes to improve an ordinal linear logistic regression model and transform it into a non-linear one. To this end, the linear model is extended with non-linear covariates from the outputs of the hidden neurons of a hybrid SURBFNN. These neural network models are trained using an evolutionary algorithm that optimizes both its architecture and weights.

Moreover, the coefficients of the ordinal logistic regression model, consisting of the initial covariates and the SURBF non-linear outputs, are estimated by a gradient descent algorithm that tries to optimize the maximum likelihood.

Experiments show that this hybrid approach is promising and generally improves accuracy and order quality, performing better than the corresponding pure models. The model also obtains competitive results when compared to state-of-the-art ordinal classification algorithms.