Keywords

1 Introduction

In recent years, there has been growing interest of extracting patterns from data using artificial neural network (ANN)-based modelling techniques. The use of these models in the real-life scenarios is becoming primary focus area across different industries and data analytics practitioners. It is already established that the ANN-based models provide a flexible framework to build the models with increased predictive performance for the large and complex data. But unfortunately, due to high degree of complexity of ANN models, the interpretability of the results can be significantly reduced, and it has been named as “black box” in this community. For example, in banking system to detect the fraud or a robo-advisor for securities consulting or for opening a new account in compliance with the KYC method, there are no mechanisms in place which make the results understandable. The risk with this type of complex computing machines is that customers or bank employees are left with a series of questions after a consultancy or decision which the banks themselves cannot answer: “Why did you recommend this share?”, “Why was this person rejected as a customer?”, “How does the machine classify this transaction as terror financing or money laundering?”. Naturally, industries are more and more focusing on the transparency and understanding of AI when deploying artificial intelligence and complex learning systems.

Probably, this has opened a new direction of research works to develop various approaches to understand the model behaviour and the explainability of the model structure. Recently, Joel et al. (2018) has developed explainable neural network model based on additive index models to learn interpretable network connectivity. But it is not still enough to understand the significance of the features used in the model and the model is well specified or not.

In this article, we will express the neural network (NN) model as nonlinear regression model and use statistical measures to interpret the model parameters and the model specification based on certain assumptions. We will consider only multilayer perceptron (MLP) networks which is a very flexible class of statistical procedures. We have arranged this article as: (a) explain the structure of MLP as feed-forward neural network in terms of nonlinear regression model, (b) the estimation of the parameters, (c) properties of parameters and their asymptotic distribution, (d) simulation study and conclusion.

Fig. 1
figure 1

A multilayer perceptron neural network: MLP network with three layers

2 Transparent Neural Network Model (TRANN)

In this article, we have considered the MLP structure given in Fig. 1. Each neural network can be expressed as a function of explaining variable \(X=\left[ x_1,x_2,\ldots ,x_p\right] \) and the network weights \(\omega =\left( \gamma ^{\prime },\beta ^\prime ,b^\prime \right) \) where \(\alpha ^\prime \) is the weights between input and hidden layers, \(\beta ^\prime \) is the weights between hidden and output layers and \(b^\prime \) is the bias of the network. This network is having the following functional form

$$\begin{aligned} F(X,\omega ) = \sum _{h=1}^{H}\beta _{h}g(\sum _{i=1}^{I}\gamma _{hi}x_{i}+b_{h})+b_{00} \end{aligned}$$
(1)

where the scalars I and H denote the number of input and hidden layers of the network and g is a nonlinear transfer function. The transfer function g can be considered as either logistic function or the hyperbolic tangent function. In this paper, we have considered logistic transfer function for all the calculation. Let us assume that Y is dependent variable and we can write Y as a nonlinear regression form

$$\begin{aligned} Y=F(X,\omega ) + \epsilon \end{aligned}$$
(2)

where \(\epsilon \) is \( {i.i.d}\) normal distribution with E\([\epsilon ]=0\), E\([\epsilon \epsilon ^{\prime }]=\sigma I\). Now, Eq. (2) can be interpreted as parametric nonlinear regression of Y on X. So based on the given data, we will be able to estimate all the network parameters. Now the most important question is what would be the right architecture of the network, how we can identify the number of hidden units in the network and how to measure the importance of those parameters. The aim is always to identify an optimum network with small number of hidden units which can well approximate the unknown function (Sarle 1995). Therefore, it is important to derive a methodology not only to select an appropriate network but also to explain the network well for a given problem.

In the network literature, available and pursued approaches are regularization, stopped-training and pruning (Reed 1993). In regularization methods, we can minimize the network error (e.g. sum of error square) along with a penalty term to choose the network weights. In the stopped-training data set, the training data set split into training and validation data set. The training algorithm is stopped when the model errors in the validation set begin to grow during the training of the network, basically stopping the estimation when the model is overparameterized or overfitted. It may not be seen as sensible estimates of the parameters as the growing validation error would be an indication to reduce the network complexity. In the pruning method, the network parameters are chosen based on the “significant” contribution to the overall network performance. However, the “significance” is not judged by based on any theoretical construct but more like a measure of a factor of importance.

The main issue with regularization, stopped-training and pruning is that they are highly judgemental in nature which makes the model building process difficult to reconstruct. In transparent neural network (TRANN), we are going to explain the statistical construct of the parameters’ estimation and their properties through which we explain the statistical importance of the network weights and will address well the model misspecification problem. In the next section, we will describe the statistical concept to estimate the network parameters and their properties. We have done a simulation study to justify our claim.

3 TraNN Parameter Estimation

In general, the estimation of parameters of a nonlinear regression model cannot be determined analytically and needs to apply the numerical procedures to find the optima of the nonlinear functions. This is a standard problem in numerical mathematics. In order to estimate the parameters, we minimized squared error, \(SE=\sum _{t=1}^{T}(Y_{t}-F(X_{t},\omega ))^2\), and applied backpropagation method to estimate the parameters. Backpropagation is the most widely used algorithm for supervised learning with multi-layered feed-forward networks. The repeated application of chain rule has been used to compute the influence of each weight in the network with respect to an error function SE in the backpropagation algorithm (Rumelhart et al. 1986) as:

$$\begin{aligned} \frac{\partial {SE}}{\partial \omega _{ij}}=\frac{\partial {SE}}{\partial {s_{i}}}\frac{\partial {s_{i}}}{\partial {\text {net}_{i}}}\frac{\partial {\text {net}_{i}}}{\partial {\omega _{ij}}} \end{aligned}$$
(3)

where \(\omega _{ij}\) is the weight from neuron j to neuron i, \(s_{i}\) is the output, and net\(_{i}\) is the weighted sum of the inputs of neuron i. Once the partial derivatives of each weight are known, then minimizing the error function can be achieved by performing

$$\begin{aligned} \check{\omega }_{t+1}= \check{\omega }_{t}-\eta _{t}[- \nabla F(X_{t},\check{\omega }_{t})]^{\prime }[Y_{t}-F(X,\check{\omega }_{t})], t=1,2,\ldots ,T \end{aligned}$$
(4)

Based on the assumptions of the nonlinear regression model (2) and under some regularity conditions for F, it can be proven (White 1989) that the parameter estimator \(\hat{\omega }\) is consistent with asymptotic normal distribution. White ((White, 1989)) had shown that the parameter estimator an asymptotically equivalent estimator can be obtained from the backpropagation estimator using Eq. (4) when \(\eta _{t}\) is proportional to \(t^{-1}\) as

$$\begin{aligned} \hat{\omega }_{t+1}&= \check{\omega }_{t}+\left[ \sum _{t=1}^{T}\nabla F(X_{t},\check{\omega }_{t})^{'}\nabla F(X_{t},\check{\omega }_{t})\right] ^{-1}\\ \nonumber&\quad \times \sum _{t=1}^{T}\nabla F(X_{t},\check{\omega }_{t})^{'}[Y_{t}-F(X,\check{\omega }_{t})], t=1,2,\ldots ,T \end{aligned}$$
(5)

In that case, the usual hypothesis test like Wald test or the LM test for nonlinear models can be applied. Neural network belongs to the class of misspecified models as it does not map to the unknown function exactly but approximates. The application of asymptotic standard test is still valid as the misspecification can be taken care through covariance matrix calculation of the parameters (White 1994). The estimated parameters \(\hat{\omega }\) are normally distributed with mean \(\omega ^{*}\) and covariance matrix \(\frac{1}{T}C\). The parameter vector \(\omega ^{*}\) can be considered as best projection of the misspecified model onto the true model which lead to:

$$\begin{aligned} \sqrt{T}(\hat{\omega }-\omega ^{*})\sim N(0,C) \end{aligned}$$
(6)

where the T denotes the number of observations. As per the theory of misspecified model (Anders 2002), the covariance matrix can be calculated as

$$\begin{aligned} \frac{1}{T} = A^{-1}BA^{-1} \end{aligned}$$
(7)

where the matrix A and B can be expressed as \(A \equiv E[\nabla ^{2} SE_{t}]\) and \(B \equiv E[\nabla SE_{t}\nabla SE_{t}^{'}]\). \(SE_{t}\) denotes the squared error contribution of the tth observations, and \(\nabla \) is the gradient with respect to the weights.

4 TRANN Model Parameter Test for Significance

The hypothesis tests for significance of the parameters are an instrument for any statistical models. In TRANN, we are finding and eliminating redundant inputs from the feed-forward single layered network through statistical test of significance. This will help to understand the network well and will be able to explain to network connection with mathematical evidence. This will help to provide a transparency to the model as well. The case of irrelevant hidden units occurs when identical optimal network performance can be achieved with fewer hidden units. For any regression method, the value of t-statistic plays an important role for hypothesis testing whereas it is overlooked in neural networks. The non-significant parameters can be removed from the network, and the network can be uniquely defined (White 1989). This is valid for linear regression as well as neural networks. Here, we estimate the t-statistic as

$$\begin{aligned} {\frac{\hat{\omega }_{k}-\omega _{H_{0}}(k)}{\hat{\sigma }_{k}}} \end{aligned}$$
(8)

where \(\omega _{H_{0}}(k)\) denotes the value or the restrictions to be tested under null hypothesis \(H_{0}\). The \(\hat{\sigma }_{k}\) is the estimated standard deviation of the estimated parameter \(\hat{\omega }_{k}\). Later, we have estimated the variance–covariance matrix \(\hat{C}\) where the diagonal elements are \(\omega _{k}\) and the \(\hat{C}\) can be estimated as

$$\begin{aligned} \frac{1}{T}\hat{C} = \hat{A}^{-1}\hat{B}\hat{A}^{-1} \end{aligned}$$
(9)
$$\begin{aligned} \hat{A}^{-1}= \frac{1}{T} \sum _{t=1}^{T}\frac{\partial ^{2}SE_{t}}{\partial \hat{\omega }\partial \hat{\omega }^{'}} \text{ and } \hat{B}^{-1} = \sum _{t=1}^{T}\hat{\epsilon }_{t}^{2}(\frac{\partial F(t,\hat{\omega })}{\partial \hat{\omega }})(\frac{\partial F(t,\hat{\omega })}{\partial \hat{\omega }})^{'} \end{aligned}$$
(10)

where \(\hat{\epsilon }_{t}^{2}\) is the square of estimated error for tth sample.

Equation (6) implies that asymptotic distribution of the network parameters is normally distributed and it is possible to perform the test of significance of each parameter using the estimated covariance matrix \(\hat{C}\). Then, both Wald test and LM test are applicable as per the theory of misspecified model (Anders 2002).

5 Simulation Study

We have performed a simulation study to establish the estimation methods and hypothesis test of significance with a 8-2-1 feed-forward network where we have considered eight input variables, one hidden layer with two hidden units and one output layer. Therefore, as per the structure of Eq. (1), the network model contains 21 parameters and we have set the parameter values as \(b^{\prime } = (b_{00}: 0.91, b_{1}: -0.276, b_{2}: 0.276)\)

\(\beta ^{\prime } = (\beta _{1}: 0.942, \beta _{2}: 0.284)\)

\(\gamma ^{\prime } = (\gamma _{11}= -1.8567, \gamma _{21} = -0.0185, \gamma _{31}= -0.135), \gamma _{41}= 0.743, \gamma _{51}= 0.954, \gamma _{61}= 1.38, \gamma _{71}= 1.67, \gamma _{81}= 0.512, \gamma _{12}= 1.8567, \gamma _{22}= 0.0185, \gamma _{32}= 0.135, \gamma _{42}= -0.743, \gamma _{52}= -0.954, \gamma _{62}= -1.38, \gamma _{72}= -1.67, \gamma _{82}= -0.512)\)

and the error term \(\epsilon \) is generated from normal distribution with mean zero and standard deviation 0.001. In the model, the independent variables \(X = [x_{1}, . . . , x_{8}]\) are drawn from exponential distribution. We have generated 100,000 samples using the above parameters, and then we have taken multiple sets of 5000 random sample of observations out of 100,000 observations and derived the estimates of the parameters and confidence intervals. We are calling this method as bootstrap method. The estimated values of the parameters, standard errors, confidence interval, t-values and p-values through bootstrapping method are given in Table 1. The results based on the asymptotic properties of the estimates are given in Table 2 based on Eq. (9). Both the methods are establishing the test of significance of parameters under null hypothesis \(H_{0} : \omega =0\).

Table 1 Results using bootstrapping method
Table 2 Results using asymptotic properties

6 Conclusion

Neural networks are a very flexible class of assumptions about the structural form of the unknown function F. In this paper, we have used nonlinear regression technique to explain the network through statistical analysis. The statistical procedures usable for model building in neural networks are significance test of parameters through which an optimal network architecture can be established. In our opinion, the transparent neural network is a major requirement to perform a diagnosis of neural network architecture which not only approximates the unknown function but also explains the network features well through the statistical nonlinear modelling assumptions. As a next step, we would like to investigate more on the deep neural networks based on the similar concepts.