1 Introduction

When dealing with real-world problems, some degree of uncertainty can rarely be avoided. Modeling physical or social systems, either for further understanding or as a guide for decision processes, dealing with uncertainty is a critical issue. Uncertainty has been formalized in different ways leading to several uncertainty theories (Klir and Smith 2001; Klir 2006; Wong 1993; Vigo 2013; IEEE 1993; Zhang et al. 2014). Ambiguity, that is, uncertainty with unknown probabilities, is also a subject of concern on decision problems (Inukai and Takahashi 2009; Christensen 2013), information retrieval under queries (Roul and Sahay 2012; Clarke et al. 2009), parameter identification (Reppa et al. 2014) and dynamical systems reconstruction and control (Yan and Wang 2014; Hao and Jagannathan 2013; Alfaro-Ponce et al. 2014). Here, we are concerned with uncertainty in the construction of models from observed data. In this context, uncertainty may arise either from inaccuracies in the measurement of the observed variables or from the fact that the variables that are measured do not provide a complete specification of the behavior of the system. Our main concern in this paper is the latter situation, that is, the case of ambiguous data.

There are several models of distributed learning systems that can be used to reconstruct functional relationships between variables. It has been shown that they all are basically equivalent (Doyne 1990), and among them, neural networks, with or without adjustable node parameters (Dente and Vilela 1996), are capable to learn deterministic relations or extract the characteristic parameters of stochastic processes (Dente and Vilela 1997). Their computational power is also, at least, as wide as a large class of symbolic languages (Martins and Mendes 2001). Here, to implement our ambiguity detection algorithm, feedforward networks are used, but of course, other classifier modules might be used as well. The claimed novelty of our approach is not the neural networks architecture, but their use in a global algorithm, which in particular identifies and quantifies the degree of ambiguity in each region of the input variable space.

In the context of construction of models of physical phenomena by neural networks, the distinct problem of learning from data with error bars has been addressed before by several authors (see for example Gernoth and Clark 1995; Gabrys and Bargiela 1999; Cawley et al. 2007; Huang et al. 2012; Alippi et al. 1995). Here, however, we will be concerned not with inaccuracies in the input data but with the fact that the observed variables might not completely specify the output, that is, some essential variables may not be accessible to be used as inputs. Of special interest is the characterization of the ambiguity across the subregions of the input space, to know in which regions the output is or is not reliable. Being mostly concerned with ambiguous data, rather than with smoothing out noisy data, that is, with “incomplete” rather than “inaccurate” measurements, the system that is developed might as well be used to estimate the degree of noisiness in the data.

The ambiguity situation is a rather complex one because, in general, the uncertainty is not uniform throughout the input variable space. There might be regions where the input variables provide an unambiguous answer and others where they are not sufficient to provide a precise answer. For example, in credit scoring, which we will use here as an example, the “no income, no job, no asset” situation is a clear sign of no credit reliability, but most other situations are not so clear-cut. Therefore, it is desirable to develop a method that, for each region of the input variable space, provides the most probable outcome and, at the same time, tells us how reliable the result is.

The system consists of two coupled networks, one to learn the most probable output value for each input and the other to provide the expected error (or variance) of the result for that particular input. The first network converges to an average of the target values in each region of the input space and the second to the expected uncertainty (or ambiguity) in that particular region of input space. One finds in practice, and in our examples, that the ambiguity varies greatly from region to region of input space. Therefore, it would not make sense to just compute the sample variance of the whole data. The second network does indeed compute a sample variance, but does so for each region of the input variable space and, because of the interpolating features of the network, does it in a more accurate way than if, for example, we were to divide arbitrarily the input space into subregions to do a numerical computation. In short, the main idea of our system is not to smooth out fluctuations in the data to obtain an approximate output. Instead, it is to characterize the ambiguity of the answer and in particular to quantify this ambiguity for each region of input space.

The ambiguity problem in model reconstruction has been addressed in the past by other authors. For example, in the context of fuzzy models, ambiguity is dealt with by increasing the number of fuzzy sets or changing the membership function from bell-shaped to trapezoid-shaped surfaces (Cox 2005). The lattice basis reduction used by some authors (Svendsen 2003) to assign an approximate output to ambiguous inputs is similar to our average value output of the first network, but lacks the quantitative estimate of ambiguity that is provided by the second network. In other approaches, a collection of classifiers and learning algorithms is used to arbitrate between the results by using the most reliable classifier for each subdomain (Ortega et al. 2001; dos Santos et al. 2007). This, of course, is useful if there is a domain-specific adequacy of the classifiers but does not help if the ambiguity is intrinsic to the data. In other cases, the most ambiguous data subsets are gathered into clusters (Lin et al. 2006) and the classifiers retrained (Albalate et al. 2010) or the input data resampled (Bailey-Kellogg and Ramakrishnan 2001) in these domains. Again, this helps only if some new input variables are used in the ambiguity regions, which is not always possible. For example, in a credit scoring modeling problem if all known socio-economic parameters are fed into the system, what else can we use? And in this particular case, at least, a great deal of ambiguity is known to exist. In our system, we simply aim at detecting the ambiguity and quantifying it in each region of the input space.

For definiteness, the system is formalized as the problem of learning random functions in the next section. Then, we study two application examples, the first being the measurement of track angles by straw chambers in high-energy physics and the other a credit scoring model.

2 Learning the average and variance of random functions

The general setting which is analyzed is the following.

The signal to be learned is a random function \(\theta ( \vec {X}) \) with distribution \(F_{\vec {X}}( \theta )\). For simplicity we consider \(\theta \) to be a scalar and the set \(\{ \vec {X}\} \) to be vector-valued, \(\vec {X}\in {\mathbb {R}}^{i}\). Notice that we allow for different distribution functions at different points.

In the straw chamber example, to be discussed later, \(\vec {X}\) would be the set of delay times and \(\theta \) the track angle. For the credit score example, \(\vec {X}\) would be the set of client parameters and \(\theta \) the credit reliability.

In our learning system, the \(\vec {X}\) values are inputs to a multilayer (feedforward) network, \(\{ W\}\) denoting the full set of connection strengths, the output being \(Y( \vec {X}) =f_{W}( \vec {X} )\). The aim is to chose a set of connection strengths \(\{ W\} \) that annihilates the expectation value

$$\begin{aligned} {\mathbb {E}}\left\{ \sum _{\left\{ \vec {X}\right\} }\left( f_{W}\left( \vec {X}\right) -\theta \left( \vec {X}\right) \right) ^{2}\right\} =0 \end{aligned}$$
(1)

However, what, for example, the backpropagation algorithm does is to minimize \( {\mathbb {E}}( f_{W}( \vec {X}) -\theta ( \vec {X}) ) ^{2}\) for each realization of the random variable \(\vec {X}\). Hence, let us fix \(\vec {X}\) and consider \(f_{W}( \vec {X}) \) evolving in learning time. That is, we are considering, in the learning process, the subprocess corresponding to the sampling of a particular fixed region of the variables. Then, the time evolution of the network output is framed as

$$\begin{aligned}&f_{W}\left( \vec {X},\tau +1\right) \nonumber \\&\quad =f_{W}\left( \vec {X},\tau \right) -2\eta \frac{ \partial f_{W}}{\partial W}\cdot \left( f_{W}\left( \vec {X}\right) \right. \nonumber \\&\qquad \left. -\,\theta \left( \vec {X} \right) \right) \frac{\partial f_{W}}{\partial W} \end{aligned}$$
(2)

where \(\Delta W=-\eta \frac{\partial e}{\partial W}\), \(\eta \) being the learning rate and \(e=( f_{W}( \vec {X}) -\theta ( \vec {X}) ) ^{2}\) the error function. Let the input random variable \(\theta ( \vec {X}) \) at learning step \(\tau \) be modeled as

$$\begin{aligned} \theta \left( \vec {X}\right) _{\tau }=\overline{{ \theta }}\left( \vec {X}\right) +B\left( \vec {X},\tau \right) \end{aligned}$$
(3)

\(\overline{{\theta }}( \vec {X}) \) being the expectation (average) value of the variable and \(B( X,\tau ) \) a zero-mean Wiener process. Then, taking expectation values in Eq. (2), and because the \(\frac{\partial f_{W}}{\partial W}\) quantities are deterministic functions, one obtains

$$\begin{aligned}&{\mathbb {E}}\left[ f_{W}\left( \vec {X},\tau +1\right) \right] \nonumber \\&\quad ={\mathbb {E}}\left[ f_{W}\left( \vec {X},\tau \right) \right] \nonumber \\&\qquad -\,2\eta \frac{\partial f_{W}}{\partial W}\cdot \left( {\mathbb {E}} \left[ f_{W}\left( \vec {X},\tau \right) \right] - \overline{{\theta }}\left( \vec {x}\right) \right) \frac{\partial f_{W}}{\partial W} \end{aligned}$$
(4)

Notice that we are now dealing with two time scales: the time scale of the \( \theta ( \vec {X}) \) random variable and the time scale of the leaning process, controlled by the learning rate \(\eta \). If the learning rate \(\eta \) is sufficiently small for the learning time scale to be much smaller than the sampling rate of the \(\theta ( \vec {X}) \) random variable, the last equality may be approximated by

$$\begin{aligned}&f_{W}\left( \vec {X},\tau +1\right) \nonumber \\&\quad =f_{W}\left( \vec {X},\tau \right) \nonumber \\&\qquad -\,2\eta \frac{ \partial f}{\partial W}\cdot \left( f_{W}\left( \vec { X},\tau \right) -\overline{{\theta }}\left( \vec {X} \right) \right) \frac{\partial f_{W}}{\partial W} \end{aligned}$$
(5)

A fixed point is obtained at

$$\begin{aligned} f_{W}\left( \vec {X}\right) =\overset{-}{\theta }\left( \vec {X}\right) , \end{aligned}$$
(6)

the average value of the random variable \(\theta \) at the argument \(\vec {X}\). In practice, the convergence to the average of the objective variable is better achieved by making \(\eta \) converge slowly to zero during the learning process.

Similarly, if a second network [with output \(g_{W^{\prime }}( \vec {X})\)] and the same input \(\vec {X}\) is constructed according to the learning law

$$\begin{aligned} g_{W^{\prime }}\left( \vec {X},\tau ^{\prime }+1\right) =g_{W^{\prime }}\left( \vec {X},\tau ^{\prime }\right) + \frac{\partial g_{W^{\prime }}}{\partial W^{\prime }}\cdot \Delta W^{\prime } \end{aligned}$$
(7)

with error function

$$\begin{aligned} e^{\prime }=\left( g_{W^{\prime }}\left( \vec {X} \right) -\left( f_{W}\left( \vec {X}\right) -\theta \left( \vec {X}\right) \right) ^{2}\right) ^{2} \end{aligned}$$
(8)

and \(\Delta W^{\prime }=-\eta ^{\prime }\frac{\partial e^{\prime }}{\partial W^{\prime }}\), then

$$\begin{aligned}&g_{W^{\prime }}\left( \vec {X},\tau ^{\prime }+1\right) =g_{W^{\prime }}\left( \vec {X},\tau ^{\prime }\right) \\&\quad -\,2\eta ^{\prime }\frac{\partial g_{W^{\prime }}}{\partial W^{\prime }} \cdot \frac{\partial g_{W^{\prime }}}{\partial W^{\prime }}\left( g_{W^{\prime }}\left( \vec {X},\tau ^{\prime }\right) \right. \\&\quad \left. -\,\left( f_{W}\left( \vec {X}\right) -\theta \left( \vec {X}\right) \right) ^{2}\right) \end{aligned}$$

and, under the same assumptions as before concerning the smallness of the learning rates, \(g_{W^{\prime }}( \vec {X}) \) has the fixed point

$$\begin{aligned} g_{W^{\prime }}\left( \vec {X}\right) =\overline{{ \left( \theta \left( \vec {X}\right) -\overline{{\theta }}\left( \vec {X}\right) \right) ^{2}}} \end{aligned}$$
(9)

In conclusion: The first network reproduces the average value of the random function \(\theta \) for each input \(\vec {X}\) and the second one, receiving as data the errors of the first, reproduces the variance of the function at \(\vec {X}\). Instead of the variance, the second network might as well be programed to learn the expected value of the absolute error \({\mathbb {E}}\vert \theta ( \vec {X}) -\overset{-}{\theta }( \vec {X}) \vert \). Actually, for numerical convenience, we will use this alternative in the examples of the next section. Figure 1 is a schematic representation of the learning process.

In practice, the training of the second network should start after the first one because, before the first one becomes to converge, its errors are not representative of the fluctuations of the random function. In general, it seems reasonable to have \(\eta ^{\prime }( t) <\eta ( t)\hbox { with }\eta ( t) \) decreasing in time.

Implicit in the derivation sketched above is the assumption that we are already in the basin of attraction of the global minimum of the cost functions. In practice, the existence of local minima is an issue to be taken into account in all modeling and optimization problems. To avoid convergence to local minima, one may use occasional random perturbations. Our approach however has been to run several times the algorithm starting from different initial conditions for the neural network parameters.

Because in both networks one wants convergence to average values of the target functions, a critical issue is also to avoid overfitting in the design of the networks. Because for our examples we use networks with one hidden layer, the number of neurons in the hidden layer is the parameter that should be of our concern. In general for a learning machine, the ideal situation is to have a Vapnik–Chervonenkis (VC) dimension (Vapnik and Chervonenkis 1971; Vapnik 2000) equal to the number of independent functions that one wants to discriminate or, in a classification setting, the number of points that one wants to shatter. Methods have been developed to estimate the VC dimension of a learning machine (Vapnik et al. 1994; Bartlett and Maass 2003). In the spirit of the final prediction error criterion (Alippi 1999), we use here a simple approach to estimate the right number of neurons in the hidden layer. In “Appendix”, we show the evolution of the mean square error, after training of the networks in the two examples, when the number of neurons in the hidden layer changes. One sees that for the scoring case 14 hidden layer neurons seem to be an appropriate number to obtain a good fit without overfitting and similarly for the straw chamber a number between 10 and 15 is adequate. Several other methods have been proposed in the literature to choose the number of hidden layer neurons. A popular method is the clipping method where during the learning process the synapses with the smallest strengths are suppressed. As we have found out, this method is not very effective when there is a high level of ambiguity in the data. Therefore, the control of the mean square error seems more appropriate in this case.

Fig. 1
figure 1

A schematic representation of the learning process. The two-network system, given inputs \(X_{i}\) and target values \(\theta _{k}\), learns the average values \(\overline{Y_{k}}( X_{i})\) and average errors \(\overline{\vert Y_{k}-\theta _{k}\vert }\hbox {for}\) each set \(\{ X_{i}\}\) of input values

Finally, for the scoring example we have used 14 hidden layer neurons and for the straw chamber both 14 and 25. The test with 25 is included to reproduce the setting of Denby et al. (1990) who use 25 hidden layer neurons.

3 Examples

3.1 Measuring track angles by straw chambers

One of the first applications of neural networks to the processing of high-energy physics data (Denby 1999) was the work by Denby et al. (1990) on the slopes of particle tracks in straw tube drift chambers. In a straw chamber (Fig. 2), each wire receives a signal delayed by a time proportional to the distance of closest approach of the particle to the wire.

Fig. 2
figure 2

A particle track through a straw chamber. The input values to the neural networks are the delay times, proportional to the distances of the particle to the wires

The neural network receives these times as inputs \(\{ \vec {X}\} \), with as many inputs as the number of wires and, for the training, the track angle \(\theta ( \vec {X}) \) is the target function. The half cell shift of alternate layers in the straw chamber solves some of the left–right ambiguities, but this ambiguity still remains for many directions (Fig. 3).

Fig. 3
figure 3

An example with two different beam track angles generating the same input signal

The authors of Denby et al. (1990) required the training and test events to pass through at least four straws to avoid edge effects. Nevertheless, they consistently find large non-Gaussian error tails when testing the trained network. The authors have not separated the contribution to the tails coming from the ambiguities from those arising from eventual inadequacies on training or network architecture. We have repeated the simulations, and our results essentially reproduce those of Denby et al. (1990), showing that the non-Gaussian tails do indeed originate from the left–right ambiguities. If edge effects are allowed for, including in the training set events that pass through less than four straws, the degree of ambiguity and the tails increase even further. The important role of persistent fat tails on the error response of a learned system as a symptom of data ambiguity will be discussed later.

This example is therefore a typical example of the situation described in the introduction, where some regions of the input data correspond to a unique event, but others have an ambiguous identification. Also, it is a pure example of ambiguous data in the sense that the signals fed to the networks have no noise component. As the example shows, it is not easy to separate the ambiguous regions from the non-ambiguous ones because they are mixed all over parameter space. It is therefore important to have a system that not only provides an answer but also states how reliable that answer is.

We have applied to this example the two-network scheme (Fig. 1) described before. Both networks have the same architecture and train using the same input data, the first one with the target track angles and the second with the absolute value of the errors of the first. To avoid big fluctuations in training convergence, the second network starts learning after the first has stabilized and finished training. Both networks have a feedforward network architecture with three neuron layers: input, hidden and output. They both train using a supervised backpropagation algorithm. The neuron activation function is the logistic sigmoid (tan-sigmoid). After some optimization, our sigmoid-based backpropagation became rather efficient. The use of radial basis functions (RBF) might, in some cases, provide faster learning rates if the RBFs are tuned to particular applications.

For the results presented here, we use 14 input neurons (representing the drift times in each straw), either 14 or 25 hidden neurons and an output neuron for the slope of each track. We use Monte Carlo generated data coded as follows: If the track does not meet the straw the input value is zero and if the track crosses the straw, the input value is the difference between the straw radius and the distance to the wire in the center of the straw. The output is the angle of the track slope. A training sample of 25,000 simulated tracks was generated. After training, the performance of the network was tested using a new set of 5000 independent tracks.

Figure 4 compares the actual error of the first network with the uncertainty predicted by the second. One sees that the largest errors do indeed correspond to large uncertainty prediction by the second network. Of course in a few cases large uncertainty is predicted when the actual error is small. It only means that particular result is unreliable in the sense that it was by chance that it fell in the middle of the error bar interval. The results obtained with either 14 or 25 hidden neurons are practically indistinguishable, meaning that the networks are indeed characterizing the ambiguity of the data.

Fig. 4
figure 4

Comparison of the actual error of the first network and the estimated uncertainty predicted by the second network (straw chamber data)

Now that we are equipped with a system that predicts both an angle and its probable uncertainty, it makes sense to state that the result of a measurement is \(\theta \pm \Delta _{\mathrm{ann}}\), \(\theta \) being the output of the first network and \(\Delta _{\mathrm{ann}}\) the output of the second. In this sense, we will count an output as an error only when the objective value is outside the error bars. The effective error will be the distance of the objective value to the boundary of the error bars. Figure 5 plots the effective error for a sample of 1000 tracks.

Fig. 5
figure 5

Effective error (straw chamber data)

An important problem when attempting to model experimental data is the detection of ambiguities or equivalently to know whether the data set completely characterizes the phenomenon. As mentioned before, clustering methods have been proposed (Lin et al. 2006; Albalate et al. 2010; Bailey-Kellogg and Ramakrishnan 2001) to isolate the ambiguity regions and identify the origin of the ambiguities. This is not always possible, nor reliable, if ambiguity regions exist spread all over the input space and for which a subset of variable values are shared by non-ambiguous regions. This is the case in the straw chamber example. Therefore, the more conservative way of looking for fat tails in the error distribution and reconstructing an ambiguity predictor as proposed here is, in our opinion, more appropriate. As an illustration, we have studied the error distribution of the first network for successively higher sizes of the training set to find out that indeed fat tails are persistent and their nature quite stable. Figure 6 shows the error distribution for a training set of 273,271 (4-hits) tracks, and as a clear symptom, the excess kurtosis of the plot is 5.3.

Fig. 6
figure 6

Fat tails as a symptom of ambiguity

3.2 A credit scoring model

Defaulting on loans has recently increased, promoting the search for accurate techniques of credit evaluation by financial institutions. Credit scoring is a quantitative method, based on credit report information that helps lenders in the credit granting decision. The objective is to categorize credit applicants into two separate classes: the “good credit” class, that is, the one likely to repay loans on time and the “bad credit” class to which credit should be denied, due to a high probability of defaulting. For a more detailed understanding of credit scoring models, we refer to Lando (2004), Van Gestel and Baesens (2009) and Thomas et al. (2002).

Here, we have developed a credit scoring model based on the two-network scheme discussed before. Because complete information on the credit applicants is impossible to obtain and human behavior is dependent on so many factors, credit scoring is also a typical example of a situation where one is trying to predict an outcome based on incomplete information. Credit scoring models with neural networks had been proposed in the past (see, for example, West 2000; Pacelli and Azzollini 2011). The novelty of our system lies in that not only we provide a scoring result but we also obtain an estimate of how reliable the result is. A similar system has been successfully developed by the authors for a credit company where scoring ambiguities are of utmost importance for risk evaluation. For privacy restrictions, however, the data we use in our second example are from an open source.

We use here a publicly available credit data of anonymous clients, downloaded from UCI Irvine Machine (http://archive.ics.uci.edu/ml/). It is composed of 1000 cases, one per applicant, of which 700 cases correspond to creditworthy applicants and 300 cases correspond to applicants which were later found to be in the bad credit class. Each instance corresponds to 24 attributes (e.g., loan amount, credit history, employment status, personal information, etc.) with the corresponding credit status of each applicant coded as good (1) or bad (0). Inspecting the database, it is clear that some apparently good attributes correspond, in the end, to bad credit performance and conversely putting into evidence the incomplete information nature of the problem.

For our system, the attributes are numerically coded and we use a neural network architecture with 24 input neurons (representing the 24 numerical attributes), 14 hidden neurons and an output neuron indicating good or bad credit. To ensure that the network learns evenly, we randomly alternate between good and bad applicants instances. After training, the performance of the network was tested. Figure 7 shows a plot of the errors of the first network after training. Although, in general, the network provides good estimations, there are several customers classified as good when they are bad and vice-versa. In fact, there are some extremely incorrect network predictions, as can easily been perceived by the bins at the two ends of the histogram. These bins clearly reveal lack of information in the data set.

Fig. 7
figure 7

Error distribution in the first network (credit scoring)

As in the previous example, Fig. 8 shows the comparison of the errors in the first network with the estimated uncertainty obtained by the second network and Fig. 9 shows the effective error distribution. Similarly to the previous straw chamber example, one obtains good uncertainty predictions by the second network. The second network wrongly classified very few cases: Only two occurrences with no actual errors were predicted having maximum uncertainty, and only three critical errors were unsuccessfully predicted without uncertainty.

Fig. 8
figure 8

Comparison of the actual error of the first network and the estimated uncertainty predicted by the second network (credit scoring)

Looking at the effective error distribution plot, it is easy to confirm the refinement in the degree of certainty in each estimation. Nevertheless, there still are a very few occurrences of estimations outside the error bar interval.

Fig. 9
figure 9

Effective error (credit scoring)

3.3 Conclusions

  1. 1.

    The goal of this research was to develop a computational scheme with the ability to evaluate the degree of reliability of predictive models. Two application examples were studied, the first one being the measurement of track angles by straw chambers in high-energy physics and the other a credit scoring model. Both examples use data with incomplete information. A two-network system is used which, although not perfect, greatly improves the reliability check of the predicted results.

  2. 2.

    That the estimate of the reliability of the data modeling is sensitive to situations where uncertainty is not uniform throughout the parameter spaces, is an asset of the system. A weakness is of course the assumption that uncertainty is well modeled by the second momentum of the input data. Skewness, power-law distributions and rare events fall outside the scope of the system. In any case, to model such features with neural networks might not be appropriate and more complex systems involving, for example, estimates of characteristic functions (Dente and Vilela 1997) might have to be brought into play.