Keywords

1 Introduction

Nonlinear regression modeling, i.e. estimating (smoothing, fitting) a continuous response variable based on a set of regressors (features, independent variables) plays a crucial role in the analysis of real data in a tremendous variety of applications. An important task of regression modeling is also to predict a future development of the response  [6]. In practical applications, the nonlinear regression function is not known and is not assumed to be of any specific form. Recently, there is an increasing trend in applying machine learning methods to nonlinear regression modeling. In this paper, multilayer perceptrons (MLPs) and radial basis function (RBF) networks, i.e. two very important classes of feedforward artificial neural networks [10], are considered for the nonlinear regression task.

Real data across various disciplines, e.g. in numerous regression tasks of biomedicine, economics, engineering etc., are typically contaminated by the presence of outlying measurements (outliers). In some applications (e.g. in measurements of molecular genetic and metabolomic biomarkers [14]), outliers appear unavoidably, because severe measurement errors are immanent to the measurement technology. So far, most available applications of MLPs and RBF networks to regression tasks have not paid sufficient attention to the presence and influence of outliers; both these networks however implicitly assume the observed data not to be contaminated by outliers [2, 26]. Therefore, it is highly desirable to consider alternative robust approaches to training of MLPs and RBF networks. One direction of the robustification is based on an intrinsically performed detection of outliers [1]. Another direction for a possible robustification is inspired by the very rich experience of robust statistics with data contamination by outliers or anomalies (see [12]); this approach represents the interest of the current paper.

While there are some robust approaches to training neural networks available, they are mostly tailor-made the classification task; see ([17], p. 54) for discussion. Let us mention at least a few available robust approaches for the regression task. Compositions of sigmoidal activation functions were considered to robustify the performance for a rather specific task in [18] to estimate a response which is almost constant over relatively large intervals. If subtractive clustering (SC) is used for an automatic recommendation of the center vectors, a robustified loss function may be subsequently used [26]; still, the popular SC approach remains vulnerable to outliers and consecutive steps of the training cannot improve this. A recent approach to outlier detection for regression RBF networks was developed in [17], which is denoted as generalized edited nearest neighbor (ENN) algorithm; this was also combined with robust versions of the activation function. Robust loss functions based on least trimmed squares or least trimmed absolute values estimators were investigated in  [24, 25], where they outperformed standard training approaches on contaminated data. We do not agree with the formulas for partial derivatives of the loss function published in [24], but this may not influence the results presented there, as practical computations typically exploit numerical approximations of derivatives (not relying on theoretical expressions). Nevertheless, even the extensive numerical computations in [25] do not compare robust neural networks with the (sophisticated and powerful) support vector regression.

The idea to apply a robust loss function in neural networks will be extended in the current paper by means of the least weighted squares estimator, which represents a natural generalization of the least trimmed squares and turns out to be a perspective and (possibly) highly robust tool for estimating parameters in linear regression. Section 2 recalls the least trimmed squares and least weighted squares estimators of parameters in linear regression and in the location model. Section 3 uses these estimators to propose novel robust versions of MLPs and RBF networks. Numerical examples presented in Sect. 4 illustrate the performance of the novel robust neural networks. Finally, Sect. 5 concludes the paper.

2 Highly Robust Estimation in Linear Models

This section recalls two (possibly highly) robust implicitly weighted estimators of parameters of the linear regression model (including the location model as a special case), namely the least trimmed squares and least weighted squares estimators. Highly robust estimators are defined as those, which attain a high value of the breakdown point; this measure of robustness of a statistical estimator of an unknown parameter represents a fundamental concept of robust statistics  [12]. Formally, the finite-sample breakdown point evaluates the minimal fraction of data that can drive an estimator beyond all bounds when set to arbitrary values.

The standard linear regression model has the form

$$\begin{aligned} Y_i = \beta _0 + \beta _1 X_{i1} + \cdots + \beta _p X_{ip} + e_i, \quad i=1,\dots ,n, \end{aligned}$$
(1)

with a continuous response \(Y_1,\dots ,Y_n\) explained by the total number of p regressors, and independent and identically distributed (not necessarily Gaussian) random errors \(e_1,\dots ,e_n\).

The least trimmed squares (LTS) estimator [22, 23] of \(\beta \) represents a popular robust regression estimator with a high breakdown point. Consistency of the LTS and other properties were derived in [27]. The user must select the value of a trimming constant h (\(n/2 \le h < n\)). We will denote residuals corresponding to a particular \(b=(b_0,\dots ,b_p)^T \in \mathbbm {R}^{p+1}\) as

$$\begin{aligned} u_i(b) = Y_i - b_0 - b_1 X_{i1} - \cdots - b_p X_{ip} \end{aligned}$$
(2)

and order statistics of their squares as

$$\begin{aligned} u_{(1)}^2(b) \le \cdots \le u_{(n)}^2(b). \end{aligned}$$
(3)

The LTS estimator, formally obtained as

$$\begin{aligned} \mathop {\mathrm {arg}\,\mathrm {min}}\limits _{b \in \mathbbm {R}^{p+1}} \frac{1}{n} \sum _{i=1}^h u^2_{(i)}(b), \end{aligned}$$
(4)

may attain a high robustness but cannot achieve a high efficiency. We may consider the LTS as an implicitly weighted estimator, namely as a special case of the least weighted squares with weights equal only to 0 or 1.

The least weighted squares (LWS) estimator (see e.g.  [28]) for the model (1) represents a flexible natural extension of the LTS. The LWS estimator motivated by the idea to down-weight potential outliers based on ranks of residuals however remains much less known compared to the LTS. The LWS estimator may achieve a high breakdown point (with properly selected weights) and is robust to heteroscedasticity [28]. Its primary attention is focused on estimating \(\beta \) and not on outlier detection. The LWS estimator with given magnitudes of weights \(w_1,\dots ,w_n\) is defined as

$$\begin{aligned} \mathbf{b}^{LWS}= (b_0^{LWS}, \dots , b_p^{LWS})^T = \mathop {\mathrm {arg}\,\mathrm {min}}\limits _{b \in \mathbbm {R}^{p+1}} \sum _{k=1}^n w_k u_{(k)}^2(b). \end{aligned}$$
(5)

The efficiency of the LWS is able to exceed the low efficiency of the LTS; if data-dependent adaptive weights of [4] are used, the estimator asymptotically attains the full efficiency of the least squares. The LWS estimator was successful in a variety of recent applications including denoising gene expression measurements acquired by the microarray technology [14] or image analysis based on landmarks measured within facial images [13]. There has been a good experience with implicit weighting also for multivariate robust estimation; the multivariate analogy of the LWS is the minimum weighted covariance determinant (MWCD) estimator proposed in  [21].

The location model represent an important special case of (1) in the form

$$\begin{aligned} Y_i = \mu + e_i \quad \text{ for }\quad i=1,\dots ,n, \end{aligned}$$
(6)

where \(\mu \in \mathbbm {R}\) represents a parameter of location (shift). In (6), the LWS estimator inherits the appealing properties of the LWS from (1). The performance of the LWS on real data in (6) was revealed as successful e.g. in the image analysis applications of  [13], where the LWS estimator in (6) was also proven to correspond to the estimator with the smallest weighted variance. This allows a very efficient computation of the LWS in (6).

3 Robust Neural Networks with Implicitly Weighted Loss Functions

We consider the regression model

$$\begin{aligned} Y_i = f\left( X_i\right) + e_i, \quad i=1,\dots ,n, \end{aligned}$$
(7)

with an unknown nonlinear function f, where \(Y_1,\dots ,Y_n\) are values of the response and \(X_i \in \mathbbm {R}^p\) (with \(p \ge 1\)) is a vector of regressors corresponding to the i-th observation. This is a nonlinear regression setup with a univariate continuous response \(Y_1,\dots ,Y_n\), which is explained by means of p regressors. A novel robust tool for neural networks is proposed in this section, namely an MLP or an RBF network based on the loss function of the LWS estimator.

MLPs, which represent a very popular type of artificial neural networks, contain an input layer, one or more hidden layers with a fixed number of neurons, and an output layer. As we use the most standard form of multilayer perceptrons, we will not present their detailed model, as it can be found in numerous monographs (see e.g.  [7, 9]). For a particular multilayer perceptron (with a selected architecture), let the fitted value of the response for the i-th measurement (i.e. estimate of \(Y_i\)) be denoted by \(\hat{Y}_i\) for each \(i=1,\dots ,n\).

Let us start by describing the training of a standard MLP in a symbolic (general but very simplified) way in Algorithm 1. There, we denote the whole (say m-dimensional) vector of all parameters of a given MLP with a specified architecture as \(\theta \in \mathbbm {R}^m\). Denoting the estimated version of f obtained by the MLP as \(\hat{f}\), we may denote the vector of fitted values of Y by \(\hat{Y} = \hat{f}(\hat{\theta })\) and the vector of residuals, which depend on \(\hat{f}\), as \(u = Y - \hat{Y}\). Concerning the stopping rule in Algorithm 1, our computations use a default version implemented in  [3]. Algorithm 1 is formulated in such a way that it remains valid also for a robust version of an MLP, as it considers a general loss function.

figure a

The most common way of training MLPs minimizes the sum of prediction errors in the form

$$\begin{aligned} \ell = \ell (u_1,\dots ,u_n) := \min \sum _{i=1}^n u_i^2. \end{aligned}$$
(8)

It corresponds to the least squares estimation in a location model. It is now natural to replace this quadratic loss function by one of available robust alternatives (again for the location model). We consider a method of [24] denoted here as LTS-MLP; for a fixed h, it is defined by replacing (8) in the form

$$\begin{aligned} \ell := \sum _{i=1}^h u_{(i)}^2. \end{aligned}$$
(9)

We define a new version of MLP dentoted as LWS-MLP by choosing \(\ell \) in the form

$$\begin{aligned} \ell := \sum _{i=1}^n w_i u_{(i)}^2 \end{aligned}$$
(10)

for selected magnitudes of weights \(w_1,\dots ,w_n\). We always consider the natural standardization to \(\sum _{i=1}^n w_i =1\). We consider three particular choices, namely the LWSa-MLP with linear weights

$$\begin{aligned} w_i = \frac{2(n+1-i)}{n(n+1)}, \quad i=1,\dots ,n, \end{aligned}$$
(11)

LWSb-MLP with trimmed linear weights

$$\begin{aligned} w_i = \frac{h-i+1}{h} \mathbbm {1}[i \le h], \quad i=1,\dots ,n, \end{aligned}$$
(12)

where we consider \(h=\lfloor 3n/4 \rfloor \) and \(\lceil x \rceil = \min \{n\in \mathbbm {N};~ n \ge x\}\), and finally LWSc-MLP with weights generated by the (strictly decreasing) logistic function

$$\begin{aligned} w_i = \left( 1+\exp \left\{ \frac{i-n-1}{n}\right\} \right) ^{-1}, \quad i=1,\dots ,n. \end{aligned}$$
(13)

While LTS-MLP loss detects outliers and trims them away, LWS-MLP estimator does not do this but intrinsically arranges observations according to outlyingness.

Another alternative version denoted here as LTA-MLP was defined in [24], where a robust loss function corresponding to the least trimmed absolute value (LTA) estimator was used. LTA-MLP is defined for a fixed h (\(n/2 \le h < n\)) by means of

$$\begin{aligned} \ell := \sum _{i=1}^h |u_{(i)}| \end{aligned}$$
(14)

and according to [24] yields very similar results to those of LTS-MLP. We can say that the LTA estimator is practically unknown in the community of robust statistics; at the same time, it is not sufficiently discussed in the majority of monographs on robust estimation [12]. It is worth noting that, although we are not aware of systematic numerical comparison of the LTA estimator with other robust estimates in linear regression, it has been claimed that the performance of the LTA is very similar to that of the LTS in linear regression. Possible improvements of the LTA compared to the LTS are known not to be more than only marginal (see p. 429 of [29]). Still, the LWS estimator seems to be much more promising in terms of both robustness and efficiency, as repeatedly discussed [5, 28].

Radial basis function (RBF) networks represent another important class of neural networks. They contain an input layer with p inputs, a single hidden layer with N RBF units (neurons), and a linear output layer. The user chooses N together with a radially symmetric function denoted here as \(\rho \). The RBF network is based also on minimizing (8); using the Gaussian density as \(\rho \), the residuals can be expressed as

$$\begin{aligned} u_i = Y_i - \sum _{j=1}^N a_j\rho (||X_i-c_j||), \quad i=1,\dots ,n, \end{aligned}$$
(15)

with parameters \(c_1,\dots ,c_N \in \mathbbm {R}^p\) and \(a_1,\dots ,a_N \in \mathbbm {R}\), and possibly with other parameters corresponding to \(\rho \). We refer to [10, 16] for a detailed description of RBF networks. RBF networks can be expressed in an analogous way as MLPs in Algorithm 1 by means of minimizing the sum of squared residuals.

Robust versions of RBF networks, which will be denoted here as LTS-RBF, LWSa-RBF, LWSb-RBF, or LWSc-RBF networks, will be defined by means of the loss functions above. In other words, the are obtained by replacing the quadratic loss in (15) by the loss functions of the LTS or LWS estimators.

We implemented all the robust neural networks in Keras [3]. The implementation exploits a back-propagation algorithm, namely a stochastic gradient descent method, i.e. the same approach as in [24, 25], for optimization of all parameters for both standard and robust MLPs as well as RBF networks. As our experiments have demonstrated, also the loss function of LWS-MLP and LWS-RBF networks is in practice smooth enough for our gradient-based approach.

4 Numerical Experiments

The aim of the computations over 1 simulated and 3 real datasets is to illustrate the performance of the novel robust neural networks and compare it with other nonlinear regression tools.

Fig. 1
figure 1

Dataset Eckerle4. Horizontal axis: the regressor. Vertical axis: the response. The curve corresponds to the standard MLP (left) and LTS-MLP with \(h=\lfloor 3n/4 \rfloor \) (right)

Fig. 2
figure 2

The simulated dataset. Horizontal axis: the regressor. Vertical axis: the response. The curve corresponds to the standard RBF network (left) and LTS-RBF network with \(h=\lfloor 3n/4 \rfloor \) (right)

4.1 Data Description

  1. (A)

    The so-called Eckerle4 dataset publicly available in the package NISTnls of R software [20] has \(p=1\) regressor and \(n=35\) observations, including one apparent outlier. In Fig. 1, this real dataset is presented together with fitted trend, estimated by a standard MLP as well as LTS-MLP.

  2. (B)

    A simulated dataset obtained by means of a sine function with a (rather artificial) contamination by a linearly decreasing trend with \(p=1\) and \(n=101\). The dataset is presented in Fig. 2, together with estimated trend, obtained by a standard RBF as well as LTS-RBF network.

  3. (C)

    The Auto MPG dataset [8] with \(p=4\) continuous regressors and \(n=392\) observations after omitting all missing values (i.e. observations with index 33, 127, 331, 337, 355, and 375) from the original dataset. The consumption of each car in miles per gallon (MPG) is considered here as a response explained by engine displacement, horsepower, weight, and acceleration.

  4. (D)

    The Boston Housing dataset [8] with \(p=11\) continuous regressors (omitting features 4, 7, and 9 from the original dataset) and \(n=506\) observations. The per capita crime rate by town (i.e. in each individual location) is considered as the response variable here.

4.2 Methods

The following methods will be used in the computations. For the description of standard machine learning methods, the reader may refer to monographs [9, 10].

  • RBF network. The number N of RBF units used in particular examples is specified in Table 1.

  • LTS-RBF network with the same architecture as the plain RBF network and \(h = \lfloor 3n/4 \rfloor \).

  • LWS-RBF (i.e. LWAa-RBF, LWSb-RBF, LWSc-RBF) networks with the same architecture as the plain RBF network.

  • MLP with 1 or 2 hidden layers as specified in Table 1 for particular examples, together with the number of neurons in these layers. In every example, a sigmoid activation function is considered in every hidden layer. A linear output layer is always used.

  • LTS-MLP with the same architecture as the plain MLP and \(h = \lfloor 3n/4 \rfloor \).

  • LWS-MLP with the same architecture as the plain ML.

Three different measures of prediction errors are evaluated for each situation within a ten-fold cross validation study, performed in a standard way. Because the standard MSE suffers from the presence of outliers in the data, we also consider the trimmed MSE (TMSE) and weighted MSE (WMSE) defined formally as

$$\begin{aligned} \mathsf{MSE}= \frac{1}{n} \sum _{i=1}^n r_i^2, \quad \mathsf{TMSE}(\alpha )= \frac{1}{h} \sum _{i=1}^h r_{(i)}^2, \quad \mathsf{WMSE} = \sum _{i=1}^n w_i r_{(i)}^2, \end{aligned}$$
(16)

where \(r_i=Y_i-\hat{Y}_i\) are prediction errors and \(\hat{Y}_i\) denotes the fitted value of the i-th observation for \(i=1,\dots ,n\). For TMSE, we choose h as the is integer part of 3n/4, and squared prediction errors are arranged as \(r_{(1)}^2 \le \cdots \le r_{(n)}^2\). WMSE requires to use some fixed non-increasing magnitudes of weights and we use here trimmed linear weights (12) with \(\sum _{i=1}^n w_i=1\).

Table 1 Results of numerical experiments. Three error measures (MSE, TMSE and WMSE) defined in (16) evaluated for various nonlinear regression methods for 4 datasets. The architectures (number of RBF units and neurons in hidden layers) are specified here for various versions of RBF networks and MLPs, respectively

4.3 Results

The results for standard as well as robust neural networks with the selected architectures and parameters are presented in Table 1. The number N of RBF units for all versions of RBF networks was selected as the most suitable one for plain RBF networks. The number of neurons in the hidden layers for all versions of MLPs was selected as the most suitable for plain MLPs.

The dataset Eckerle4, the simplest from the 4 datasets under considerations, is very simple with very much variability (except for an apparent outlier). Results over the two datasets with \(p=1\) are illustrated in Figs. 1 and 2. In these datasets with \(p=1\), TMSE is able to ignore the true outliers for robust but also for plain neural networks. This is because the regression task is not so difficult for these datasets and the outliers are exactly those points, which have large absolute values of the residuals. The situation becomes much more complex for the other datasets.

In all examples, robust versions of neural networks approaches are able to yield smaller values of robust prediction errors (TMSE and WMSE); this is true in spite of the fact that the architecture of the neural networks was optimized for the plain networks. On the other hand, standard versions of neural networks are superior in terms of conventional MSE. This does not mean that the robust methods are less suitable, because the MSE itself is vulnerable to the presence of outliers. Thus, only robust versions of MSE should be considered for data contaminated with outliers.

Comparing RBF networks with MLPs, RBF networks turn out to yield smaller values of the prediction errors for all 4 datasets. It is especially interesting for the two datasets with \(p>1\) from real applications that the superiority of robust neural networks compared to standard (non-robust) ones is revealed. Basically we can say that using (any) robust neural network brings benefits, while the results of LWSb-RBF networks are not overcome by any other method in the 4 datasets.

5 Conclusions

Robust alternatives to training neural networks are highly desirable because of the vulnerability of common types of neural networks to the presence of outliers in the data. We use highly robust estimators corresponding to the LTS and LWS estimators to formulate robust loss function of MLPs and RBF networks. Thus, we extend the idea of [24], who used the loss function of the LTS (only) within MLPs. To the best of our knowledge, our approach is the first application of the LWS estimator within neural networks. The novel methods assign implicit weights to individual observations and correspond to their outlyingness, which offers a possible interpretation of individual observations and their influence to the resulting estimated trend. Robust fitting of neural networks based on the loss function of the least weighted squares estimator is able to minimize robust measures of prediction error. The methods denoted as LWSb-MLPs and LWSb-RBF networks, i.e. those with trimmed linear weights, turn out to yield better results in terms of prediction accuracy compared to other choices of weights for the LWS loss.

The superior results of the neural networks based on the LWS estimator are in correspondence with recent findings of [15]. There, the LWS turned out to outperform other estimators in linear regression, including S-estimators and mainly MM-estimators, where the latter allow to tune paramters so that a high robustness and a high efficiency are reached simultaneously.

The robust neural networks considered in the paper appear suitable for all the 4 datasets considered in this paper and thus are recommendable for real datasets, where robustness to data contamination by outliers is desirable. All datasets analyzed here do contain outliers. If a new dataset should be analyzed, which does not seem to contain apparent outliers, the strategy common in linear regression may be adopted for neural networks as well; namely, the novel robust neural networks may serve as a diagnostic tool. In such a situation, the user may check if the results of a standard neural network are similar with results of robust ones. In case of remarkable discrepancies, the robust approach may be more suitable. As a limitation, however, it is necessary to state that the robust neural networks of this paper (just like any robust statistical method  [12]) may not be suitable for certain datasets, e.g. when we are interested in every individual observation and ignoring specific observations (or their clusters) is not desirable.

Several possible directions recommendable for future research include adapting robust neural networks for heteroscedastic data, proposing an adaptive selection of h for the LTS-based loss function, considering robust and regularized neural networks, or proposing adaptive (data-dependent) selection of weights for the LWS-based loss. In addition, it would be desirable to perform a systematic comparison of robust approaches to training neural networks over a larger number of datasets, accompanied by a detailed statistical analysis of the data and by a thorough interpretation of the results on the level of individual observations.