Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

3.1 Introduction

This chapter presents the Bayesian and the evidence frameworks to create an automatic relevance determination technique. These methods are explained in detail, and pertinent literature reviews are conducted. The automatic relevance determination method is then applied to establish the relevance of economic variables that are critical for driving the consumer price index (CPI).

Shutin et al. (2012) introduced incremental reformulated automatic relevance determination and related this to the incremental version of sparse Bayesian learning. The fast marginal likelihood maximization procedure is an incremental method where the objective function is optimized with respect to the parameters of a single component given that the other parameters are fixed. The procedure is then demonstrated to relate to a series of re-weighted convex optimization problems.

Huang et al. (2012) introduced stochastic optimization using an automatic relevance determination (ARD) prior model and applied this for Bayesian compressive sensing. Compressive sensing is a unique data acquisition procedure where the compression is conducted during the sampling process. In condition monitoring systems, original data compression methods such as compressive sensing are required to decrease the cost of signal transfer and storage. Huang et al. (2012) introduced Bayesian compressive sensing (BCS) for condition monitoring of signals. The results obtained from the improved BCS technique were better than conventional BCS reconstruction procedures.

Böck et al. (2012) used the ARD method for a hub-cantered gene network reconstruction procedure and exploited topology of the of gene regulatory networks by using a Bayesian network. The proposed technique was applied to a large publicly available dataset was able to identify several main hub genes.

Jacobs (2012) applied Bayesian support vector regression with an ARD kernel for modeling antenna input characteristics. The results indicated that Bayesian support vector regression was appropriate for highly non-linear modeling tasks. They observed that the Bayesian framework allowed efficient training of the multiple kernel ARD hyper-parameters.

Shutin et al. (2011) proposed fast variational sparse Bayesian learning with ARD for superimposed signals. They showed that a fast version of variational sparse Bayesian learning can be built using stationary points of the variational update factors with non-informative ARD hyper-priors.

Zhang et al. (2010) applied the Gaussian process classification using ARD for synthetic aperture radar target recognition. The method they proposed implemented kernel principal component analysis to identify sample features and applied target recognition using Gaussian process classification with an ARD function. When they compared this technique to the k-Nearest Neighbor clustering method, a Naïve Bayes classifier, and a Support Vector Machine, the proposed technique was found to be able to automatically select an appropriate model and to optimize hyper-parameters.

Lisboa et al. (2009) applied a neural network model and Bayesian regularization with the typical approximation of the evidence to create an automatic relevance determination system. The model was applied to local and distal recurrence of breast cancer.

Mørupa and Hansena (2009) applied automatic relevance determination for multi-way models, while Browne et al. (2008) applied ARD for identifying thalamic regions concerned in schizophrenia. Other successful applications of ARD include in earthquake early warning systems (Oh et al. 2008), feature correlations investigation (Fu and Browne 2008), ranking the variables to determine ischaemic episodes (Smyrnakis and Evans 2007), the estimation of relevant variables in multichannel EEG (Wu et al. 2010) and the estimation of relevant variables in classifying ovarian tumors (Van Calster et al. 2006).

3.2 Mathematical Framework

The automatic relevance determination technique is a process applied to evaluate the relevance of each input variable in its ability to predict a particular phenomenon. In this chapter, we apply ARD to determine the relevance of economic variables in predicting the CPI. ARD achieves this task of ranking input variables by optimizing the hyper-parameters to maximize the evidence in the Bayesian framework (Marwala and Lagazio 2011). As already described by Marwala and Lagazio (2011), Wang and Lu (2006) successfully applied the ARD technique to approximate influential variables in modeling the ozone layer, while Nummenmaa et al. (2007) successfully applied ARD to a dataset where a male subject was presented with uncontaminated tone and checker board reversal stimuli, individually and in combination, in employing a magnetic resonance imaging-based cortical surface model. Ulusoy and Bishop (2006) applied ARD to categorize relevant features for object recognition of 2D images. One part of the ARD as applied in this chapter is neural networks, the topic of the next sub-section. It should be borne in mind that the discourse of ARD is not only limited to neural networks and can, within the context of artificial intelligence, include subjects such as support vector machines, Gaussian mixture models, and many other models.

3.2.1 Neural Networks

This section gives a summary of neural networks in the context of economic modeling (Leke and Marwala 2005; Lunga and Marwala 2006). A neural network is an information processing procedure that is inspired by the way that biological nervous systems, like the human brain, process information (Marwala and Lagazio 2011). It is a computer-based mechanism that is aimed at modeling the way in which the brain processes a specific function of consideration (Haykin 1999; Marwala and Lagazio 2011).

The neural network technique is a powerful tool that has been successfully used in mechanical engineering (Marwala and Hunt 1999; Vilakazi and Marwala 2007; Marwala 2012), civil engineering (Marwala 2000, 2001), aerospace engineering (Marwala 2003), biomedical engineering (Mohamed et al. 2006; Russell et al. 2008, 2009a, b), finance (Leke and Marwala 2005; Patel and Marwala 2006; Hurwitz and Marwala 2011; Khoza and Marwala 2012), statistics (Marwala 2009), intrusion detection (Alsharafat 2013), telecommunications (Taspinar and Isik 2013) and political science (Marwala and Lagazio 2011).

Meena et al. (2013) applied fuzzy logic and neural networks for gender classification in speech recognition. To train fuzzy logic and neural networks, features that included the pitch of the speech were used and the tests showed good results. Nasir et al. (2013) applied a multilayer perceptron and simplified fuzzy ARTMAP neural networks for the classification of acute leukaemia cells. The cells were classified into lymphoblast, myoblast, and normal cell; to categorize the severity of leukaemia types, and the results obtained gave good classification performance. Shaltaf and Mohammad (2013) applied a hybrid neural network and maximum likelihood based estimation of chirp signal parameters and observed that a hybrid of neural networks and maximum likelihood gradient based optimization gave accurate parameter approximation for large signal to noise ratio.

In this chapter, neural networks are viewed as generalized regression models that can model any data, linear or non-linear. As described by Marwala and Lagazio (2011), a neural network is made up of four main constituents (Haykin 1999; Marwala 2012):

  • the processing units u j , where each u j has a particular activation level a j (t) at any given time;

  • weighted inter-connections between several processing units. These inter-connections control how the activation of one unit influences the input for another unit;

  • an activation rule, which takes input signals at a unit to yield a new output signal; and

  • a learning rule that stipulates how to regulate the weights for a given input/output pair (Haykin 1999).

Neural networks are able to derive meaning from complex data and, therefore, can be used to extract patterns and detect trends that are too complex to be detected by many other computer approaches (Hassoun 1995; Marwala 2012). A trained neural network can be viewed as an expert in the type of information it has been set to discover (Yoon and Peterson 1990; Valdés et al. 2012; Sinha et al. 2013). A trained neural network can be used to predict given new circumstances. Neural networks have been applied to model a number of non-linear applications because of their capacity to adapt to non-linear data (Leke et al. 2007; Martínez-Rego et al. 2012).

The topology of neural processing units and their inter-connections can have a deep influence on the processing capabilities of neural networks. Accordingly, there are many different connections that describe how data flows between the input, hidden, and output layers.

There are different kinds of neural network topologies and these include the multi-layer perceptron (MLP) and the radial basis function (RBF) (Bishop 1995; Sanz et al. 2012; Prakash et al. 2012). In this chapter, the MLP was applied to identify the relationship between economic variables and CPI. The motivation for using the MLP was because it offers a distributed representation with respect to the input space due to cross-coupling between hidden units, while the RBF gives only local representation (Bishop 1995; Marwala and Lagazio 2011).

Ikuta et al. (2012) applied the MLP for solving two-spiral problem whereas Rezaeian-Zadeh et al. (2012) applied the MLP and RBF for hourly temperature prediction. Wu (2012) applied a multi-layer perceptron neural network for scattered point data surface reconstruction while Li and Li (2012) implemented the MLP in a hardware application.

The conventional MLP contains hidden units and output units and normally has one hidden layer. The bias parameters in the first layer are presented as mapping weights from an extra input having a fixed value of x 0 = 1. The bias parameters in the second layer are displayed as weights from an extra hidden unit, with the activation fixed at z 0 = 1. The model in Fig. 3.1 can take into account the intrinsic dimensionality of the data. Models of this form can approximate any continuous function to arbitrary accuracy if the number of hidden units M is adequately large.

Fig. 3.1
figure 1

Feed-forward multi-layer perceptron network having two layers of adaptive weights (Reprinted with permission from Marwala 2009; Marwala and Lagazio 2011)

The size of the MLP may be increased by permitting for a number of layers, however, it has been verified by the universal approximation theorem (Cybenko 1989) that a two-layered design is suitable for the MLP. Because of this theorem, in this chapter, the two-layered network shown in Fig. 3.1 was chosen. The relationship between CPI, y, and economic variables, x, may be written as follows (Bishop 1995; Marwala 2012):

$$ {y_k}={f_{\it outer}}\left( {\sum\limits_{j=1}^M {w_{kj}^{(2) }} {f_{\it inner}}\left( {\sum\limits_{i=1}^d {w_{ji}^{(1) }{x_i}+w_{j0}^{(1) }} } \right)+w_{k0}^{(2) }} \right) $$
(3.1)

Here, \( w_{ji}^{(1) } \) and \( w_{ji}^{(2) } \) specify neural network weights in the first and second layers, respectively, going from input i to hidden unit j, M is the number of hidden units, d is the number of output units while \( w_{j0}^{(1) } \) specifies the bias for the hidden unit j. This chapter uses a hyperbolic tangent function for the function f inner (•). The function f outer (•) is linear because the problem we are handling is a regression problem (Bishop 1995; Marwala and Lagazio 2011).

On training regular neural networks, the network weights are identified while the training of probabilistic neural networks identifies the probability distributions of the network weights. An objective function must be selected to identify the weights in Eq. 3.1. An objective function is a mathematical representation of the global objective of the problem (Marwala and Lagazio 2011). In this chapter, the principal objective was to identify a set of neural network weights, given the economic variables and the CPI and then rank the inputs in terms of importance.

If the training set \( D=\left\{ {{x_k},{y_k}} \right\}_{k=1}^N \) is used, where superscript N is the number of training examples, and assuming that the targets y are sampled independently given the kth inputs x k and the weight parameters w kj then the objective function, E, may be written using the sum of squares of errors objective function (Rosenblatt 1961; Bishop 1995; Wang and Lu 2006; Marwala 2009):

$$ {E_D}=-\beta \sum\limits_{n=1}^N {\sum\limits_{k=1}^K {{{{\left\{ {{t_{nk }}-{y_{nk }}} \right\}}}^2}} } $$
(3.2)

Here, t nk is the target vector for the nth output and kth training example, N is the number of training examples, K is the number of network output units, n is the index for the training pattern, β is the data contribution to the error, and k is the index for the output units.

The sum of squares error objective function was chosen because it has been found to be suited to regression problems than the cross-entropy error objective function (Bishop 1995). Equation 3.2 can be regularized by presenting extra information to the objective function which is penalty function to solve an ill-posed problem or to prevent over-fitting by safeguarding smoothness of the solution to reach a trade-off between complexity and accuracy using (Bishop 1995; Marwala and Lagazio 2011):

$$ {E_W}=-\frac{\alpha }{2}\sum\limits_{j=1}^W {w_j^2} $$
(3.3)

Here, α is the prior contribution to the regularization error and W is the number of network weights. This regularization parameter penalizes weights of large magnitudes (Bishop 1995; Tibshirani 1996; Marwala 2009; Marwala and Lagazio 2011). To solve for the weights in Eq. 3.1, the back-propagation technique described in the next section was used.

By merging Eqs. 3.2 and 3.3, the complete objective function can be written as follows (Bishop 1995):

$$ \begin{aligned} E= & \beta {E_D}+\alpha {E_W} \\ = & -\beta \sum\limits_{n=1}^N {\sum\limits_{k=1}^K {{{{\left\{ {{t_{nk }}-{y_{nk }}} \right\}}}^2}} -\frac{\alpha }{2}\sum\limits_{j=1}^W {w_j^2} } \\ \end{aligned} $$
(3.4)

3.2.1.1 Back-Propagation Method

As described by Marwala and Lagazio (2011), the back-propagation technique is a procedure of training neural networks (Bryson and Ho 1989; Rumelhart et al. 1986; Russell and Norvig 1995). Back-propagation is a supervised learning procedure and is basically an application of the Delta rule (Rumelhart et al. 1986; Russell and Norvig 1995). Back-propagation requires that the activation function (seen in Eq. 3.1) can be differentiated and is divided into the propagation and weight update. According to Bishop (1995) the propagation characteristic has these steps:

  • the propagation’s output activations is enacted by forward propagating the training pattern’s input into the neural network,

  • the deltas of all output and hidden neurons are estimated by back-propagating the output activations in the neural network by applying the training target.

The weight update characteristic applies the output delta and input activation to get the gradient of the weight. The weight in the reverse direction of the gradient is utilized by subtracting a proportion of it from the weight and then this proportion influences the performance of the learning process. The sign of the gradient of a weight defines the direction where the error increases and this is the motive that the weight is updated in the opposite direction (Bishop 1995), and this process is recurred until convergence.

In basic terms, back-propagation is used to identify the network weights given the training data, using an optimization technique. Generally, the weights can be identified using the following iterative technique (Werbos 1974; Zhao et al. 2010; Marwala and Lagazio 2011):

$$ {{\{w\}}_{i+1 }}={{\{w\}}_i}-\eta \frac{{\partial E}}{{\partial \{w\}}}\left( {{{{\{w\}}}_i}} \right) $$
(3.5)

In Eq. 3.5, the parameter \( \eta \) is the learning rate while {} characterizes a vector. The minimization of the objective function, E, is attained by computing the derivative of the errors in Eq. 3.2 with respect to the network’s weight. The derivative of the error is calculated with respect to the weight which joins the hidden layer to the output layer and may be written as follows, using the chain rule (Bishop 1995):

$$ \begin{aligned}[b] \frac{{\partial E}}{{\partial {w_{kj }}}}&= \frac{{\partial E}}{{\partial {a_k}}}\frac{{\partial {a_k}}}{{\partial {w_{kj }}}} \\ &= \frac{{\partial E}}{{\partial {y_k}}}\frac{{\partial {y_k}}}{{\partial {a_k}}}\frac{{\partial {a_k}}}{{\partial {w_{kj }}}} \\ &= \sum\limits_n {f_{outer}^{\prime}\left( {{a_k}} \right)\frac{{\partial E}}{{\partial {y_{nk }}}}{z_j}} \end{aligned} $$
(3.6)

In Eq. 3.6, \( {z_j}={f_{\it inner}}\left( {{a_j}} \right) \) and \( {a_k}=\sum\limits_{j=0}^M {w_{kj}^{(2) }{y_j}} \). The derivative of the error with respect to weight which connects the hidden layer to the output layer may also be written using the chain rule (Bishop 1995):

$$ \begin{aligned}[b] \frac{{\partial E}}{{\partial {w_{kj }}}} &= \frac{{\partial E}}{{\partial {a_k}}}\frac{{\partial {a_k}}}{{\partial {w_{kj }}}} \\ & = \sum\limits_n {f_{\it inner}^{\prime}\left( {{a_j}} \right)} \sum\limits_k {{w_{kj }}f_{outer}^{\prime}\left( {{a_k}} \right)\frac{{\partial E}}{{\partial {y_{nk }}}}} \end{aligned} $$
(3.7)

In Eq. 3.7, \( {a_j}=\sum\limits_{i=1}^d {w_{ji}^{(1) }{x_i}} \). The derivative of the objective function in Eq. 3.2 may thus be written as (Bishop 1995):

$$ \frac{{\partial E}}{{\partial {y_{nk }}}}=\left( {{t_{nk }}-{y_{nk }}} \right) $$
(3.8)

while that of the hyperbolic tangent function is (Bishop 1995):

$$ f_{\it inner}^{\prime}\left( {{a_j}} \right)=\sec {h^2}\left( {{a_j}} \right) $$
(3.9)

Now that it has been explained how to estimate the gradient of the error with respect to the network weights using back-propagation procedure, Eq. 3.6 can be applied to update the network weights by using an optimization process until some pre-defined stopping criterion is realized. If the learning rate in Eq. 3.5 is fixed, then this is called the steepest descent optimization technique (Robbins and Monro 1951). Conversely, the steepest descent technique is not computationally efficient and, hence, an improved scheme requires to be identified. In this chapter, the scaled conjugate gradient technique was applied (Møller 1993), the subject of the next section.

3.2.1.2 Scaled Conjugate Gradient Method

The method in which the network weights are approximated from the data is by using non-linear optimization technique (Mordecai 2003). In this chapter, the scaled conjugate gradient technique (Møller 1993) was used. As described earlier in this chapter, the weight vector, which provides the minimum error, is calculated by simulating sequential steps through the weight space as presented in Eq. 3.7 until a pre-determined stopping criterion is achieved. Different procedures select this learning rate in a different ways. In this section, the gradient descent technique is described, and after that how it is modified to the conjugate gradient technique (Hestenes and Stiefel 1952). For the gradient descent technique, the step size is defined as \( {{{-\eta \partial E}} \left/ {{\partial w}} \right.} \), where the parameter \( \eta \) is the learning rate and the gradient of the error is estimated using the back-propagation method.

If the learning rate is adequately small, the value of the error decreases at each following step until a minimum value for the error between the model prediction and training target data is achieved. The drawback with this method is that it is computationally expensive when compared to other procedures. For the conjugate gradient method, the quadratic function of the error is minimized at every step over a gradually increasing linear vector space that comprises the global minimum of the error (Luenberger 1984; Fletcher 1987; Bertsekas 1995; Marwala 2012). In the conjugate gradient technique, the following steps are used (Haykin 1999; Marwala 2009; Babaie-Kafaki et al. 2010; Marwala and Lagazio 2011):

  1. 1.

    Select the initial weight vector {w}0.

  2. 2.

    Estimate the gradient vector \( \frac{{\partial E}}{{\partial \{w\}}}\left( {{{{\{w\}}}_0}} \right) \).

  3. 3.

    At each step, n, apply the line search to identify \( \eta (n) \) that minimizes \( E(\eta ) \) indicating the objective function expressed in terms of \( \eta \) for fixed values of w and \( -\frac{{\partial E}}{{\partial \{w\}}}\left( {\{{w_n}\}} \right) \).

  4. 4.

    Evaluate that the Euclidean norm of the vector \( -\frac{{\partial E}}{{\partial w}}\left( {\{{w_n}\}} \right) \) is sufficiently less than that of \( -\frac{{\partial E}}{{\partial w}}\left( {\{{w_0}\}} \right) \).

  5. 5.

    Change the weight vector using Eq. 3.4.

  6. 6.

    For w n + 1, calculate the changed gradient \( \frac{{\partial E}}{{\partial \{w\}}}\left( {{{{\{w\}}}_{n+1 }}} \right) \).

  7. 7.

    Apply the Polak-Ribiére technique to estimate:

    $$ \beta (n+1)=\frac{{\nabla E{{{\left( {{{{\{w\}}}_{n+1 }}} \right)}}^T}(\nabla E\left( {{{{\{w\}}}_{n+1 }}} \right)-\nabla E\left( {{{{\{w\}}}_n}} \right)))}}{{\nabla E{{{\left( {{{{\{w\}}}_n}} \right)}}^T}\nabla E\left( {{{{\{w\}}}_n}} \right)}}. $$
  8. 8.

    Change the direction vector

    $$ \frac{{\partial E}}{{\partial \{w\}}}\left( {{{{\{w\}}}_{n+2 }}} \right)=\frac{{\partial E}}{{\partial \{w\}}}\left( {{{{\{w\}}}_{n+1 }}} \right)-\beta (n+1)\frac{{\partial E}}{{\partial \{w\}}}\left( {{{{\{w\}}}_n}} \right). $$
  9. 9.

    Let n = n + 1 and return to step 3.

  10. 10.

    Terminate when the following criterion is met:

    $$ \varepsilon =\frac{{\partial E}}{{\partial \{w\}}}\left( {{{{\{w\}}}_{n+2 }}} \right)-\frac{{\partial E}}{{\partial \{w\}}}\left( {{{{\{w\}}}_{n+1 }}} \right)\quad \mathrm{where}\ \ll 0. $$

The scaled conjugate gradient technique is different from the conjugate gradient scheme because it does not include the line search referred to in Step 3. The step-size is estimated using the following formula (Møller 1993):

$$ \begin{array}{lllll} \eta (n)&=2\left( \eta (n)-{{{\left( {\frac{{\partial E(n)}}{{\partial \{w\}}}(n)} \right)}}^T}H(n)\left( {\frac{{\partial E(n)}}{{\partial \{w\}}}(n)} \right)\right.\\ & \quad \quad {\left. +\eta (n){{{\left\| {\left( {\frac{{\partial E(n)}}{{\partial \{w\}}}(n)} \right)} \right\|}}^2} \left/ {{\left\| {\left( {\frac{{\partial E(n)}}{{\partial \{w\}}}(n)} \right)} \right\|}} \right. \right)}^2 \end{array} $$
(3.10)

where H is the Hessian of the gradient. The scaled conjugate gradient technique is applied because it has been observed to resolve the optimization problems of training an MLP network more computationally effective than the gradient descent and conjugate gradient approaches (Bishop 1995).

3.2.2 Bayesian Framework

Multi-layered neural networks are parameterized classification models that make probabilistic assumptions about the data. The probabilistic outlook of these models is enabled by the application of the Bayesian framework (Marwala 2012). Learning algorithms are viewed as approaches for identifying parameter values that look probable in the light of the presented data. The learning process is performed by dividing the data into training, validation and testing sets. This is done for model selection and to ensure that the trained network is not biased towards the training data it has seen. Another way of realizing this is by the application of the regularization framework, which comes naturally from the Bayesian formulation and is now explained in detail in this chapter.

Thomas Bayes was the originator of the Bayes’ theorem and Pierre-Simon Laplace generalized the theorem and it has been applied it to problems such as in engineering, statistics, political science and reliability (Stigler 1986; Fienberg 2006; Bernardo 2005; Marwala and Lagazio 2011). Initially, the Bayesian method applied uniform priors and was known as the “inverse probability” and later succeeded by a process called “frequentist statistics” also called the maximum-likelihood method. The maximum-likelihood method is intended to identify the most probable solution without concern to the probability distribution of that solution. The maximum-likelihood method technique is basically a special case of Bayesian results indicating the most probable solution in the distribution of the posterior probability function. Bayesian procedure consists of the following concepts (Bishop 1995):

  • The practice of hierarchical models and the marginalization over the values of irrelevant parameters using methods such as the Markov chain Monte Carlo techniques.

  • The iterative use of the Bayes’ theorem as data points are acquired and after approximating a posterior distribution, the posterior equals the next prior.

  • In the maximum-likelihood method, a hypothesis is a proposition which must be proven right or wrong while in a Bayesian procedure, a hypothesis has a probability.

The Bayesian method has been applied to many complex problems, including those of finite element model updating (Marwala and Sibisi 2005), missing data estimation (Marwala 2009), health risk assessment (Goulding et al. 2012), astronomy (Petremand et al. 2012), classification of file system activity (Khan 2012), simulating ecosystem metabolism (Shen and Sun 2012) and in image processing (Thon et al. 2012). The problem of identifying the weights (w i) and biases (with subscripts 0 in Fig. 3.1) in the hidden layers may be posed in the Bayesian form as (Box and Tiao 1973; Marwala 2012):

$$ P\left( {\{w\}|[D]} \right)=\frac{{P\left( {[D]|\{w\}} \right)P\left( {\{w\}} \right)}}{{P\left( {[D]} \right)}} $$
(3.11)

where P(w) is the probability distribution function of the weight-space in the absence of any data, also known as the prior distribution and D ≡ (y 1,…,y N ) is a matrix containing the data. The quantity P(w|D) is the posterior probability distribution after the data have been seen and P(D|w) is the likelihood function.

3.2.2.1 Likelihood Function

The likelihood function is the notion that expresses the probability of the model which depends on the weight parameters of a model to be true. It is fundamentally the probability of the observed data, given the free parameters of the model. The likelihood can be expressed mathematically as follows, by using the sum of squares error (Edwards 1972; Bishop 1995; Marwala 2012):

$$ \begin{aligned}[b] P\left( {[D]|\{w\}} \right)= & \frac{1}{{{Z_D}}}\exp \left( {-\beta {E_D}} \right) \\ = & \frac{1}{{{Z_D}}}\exp \left( {\beta \sum\limits_n^N {\sum\limits_k^K {{{{\left\{ {{t_{nk }}-{y_{nk }}} \right\}}}^2}} } } \right) \end{aligned} $$
(3.12)

In Eq. 3.12, E D is the sum of squares of error function, β represents the hyper-parameters, and Z D is a normalization constant which can be approximated as follows (Bishop 1995):

$$ {Z_D}=\int\limits_{{-\infty}}^{\infty } { \exp \left( {\beta \sum\limits_n^N {\sum\limits_k^K {{{{\left\{ {{t_{nk }}-{y_{nk }}} \right\}}}^2}} } } \right)} d\{w\} $$
(3.13)

3.2.2.2 Prior Function

The prior probability distribution is the assumed probability of the free parameters and is approximated by a knowledgeable expert (Jaynes 1968; Bernardo 1979; Marwala and Lagazio 2011). There are many types of priors and these comprise informative and uninformative priors. An informative prior expresses accurate, particular information about a variable while an uninformative prior states general information about a variable. A prior distribution that assumes that model parameters are of the same order of magnitude can be written as follows (Bishop 1995):

$$ \begin{aligned} P\left( {\{w\}} \right)= & \frac{1}{{{Z_w}}} \exp \left( {-{E_W}} \right) \\ = & \frac{1}{{{Z_w}}} \exp \left( {-\frac{\alpha }{2}\sum\limits_j^W {w_j^2} } \right) \\ \end{aligned} $$
(3.14)

Parameter α represents the hyper-parameters, and Z W is the normalization constant which can be approximated as follows (Bishop 1995):

$$ {Z_w}=\int\limits_{{-\infty}}^{\infty } { \exp \left( {-\frac{\alpha }{2}\sum\limits_j^W {w_j^2} } \right)} d\{w\} $$
(3.15)

The prior distribution of a Bayesian method is the regularization parameter in Eq. 3.2. Regularization includes presenting additional information to the objective function, through a penalty function to solve an ill-posed problem or to prevent over-fitting to guarantee the smoothness of the solution to balance complexity with accuracy.

3.2.2.3 Posterior Function

The posterior probability is the probability of the network weights given the observed data. It is a conditional probability assigned after the appropriate evidence is taken into account (Lee 2004). It is estimated by multiplying the likelihood function with the prior function and dividing it by a normalization function. By combining Eqs. 3.11 and 3.14, the posterior distribution can be expressed as follows (Bishop 1995):

$$ P\left( {w|D} \right)=\frac{1}{{{Z_s}}} \exp \left( {\beta \sum\limits_n^N {\sum\limits_k^K {{{{\left\{ {{t_{nk }}-{y_{nk }}} \right\}}}^2}-\frac{\alpha }{2}\sum\limits_j^W {w_j^2} } } } \right) $$
(3.16)

where

$$ \begin{aligned} {Z_E}(\alpha, \beta )= & \int { \exp \left( {-\beta {E_D}-\alpha {E_W}} \right)} dw \\ = & {{\left( {\frac{{2\pi }}{\beta }} \right)}^{{\frac{N}{2}}}}+{{\left( {\frac{{2\pi }}{\alpha }} \right)}^{{\frac{W}{2}}}} \\ \end{aligned} $$
(3.17)

Training the network using a Bayesian method gives the probability distribution of the weights shown in Eq. 3.1. The Bayesian method penalizes highly complex models and can choose an optimal model (Bishop 1995).

3.2.3 Automatic Relevance Determination

As described by Marwala and Lagazio (2011), an automatic relevance determination method is built by associating the hyper-parameters of the prior with each input variable. This, consequently, necessitates Eq. 3.14 to be generalized to form (MacKay 1991, 1992):

$$ {E_W}=\frac{1}{2}\sum\limits_k {{\alpha_k}{{{\{w\}}}^T}\left[ {{I_k}} \right]\{w\}} $$
(3.18)

Here, superscript T is the transpose, k is the weight group and [I] is the identity matrix. By using the generalized prior in Eq. 3.18, the posterior probability in Eq. 3.16 becomes (Bishop 1995):

$$ \begin{aligned} P\left( {\{w\}|[D],{H_i}} \right)= & \frac{1}{{{Z_s}}} \exp \left( {\beta \sum\limits_n {{{{\left\{ {{t_n}-y({{{\{x\}}}_n}} \right\}}}^2}-\frac{1}{2}\sum\limits_k {{\alpha_k}{{{\{w\}}}^T}\left[ {{I_k}} \right]\{w\}} } } \right) \\ = & \frac{1}{{{Z_E}}} \exp \left( {-E\left( {\{w\}} \right)} \right) \raisetag{17pt} \end{aligned} $$
(3.19)

where

$$ {Z_E}(\alpha, \beta )={{\left( {\frac{{2\pi }}{\beta }} \right)}^{{\frac{N}{2}}}}+\prod\limits_k {{{{\left( {\frac{{2\pi }}{{{\alpha_k}}}} \right)}}^{{\frac{{{W_k}}}{2}}}}} $$
(3.20)

Here, W k is the number of weights in group k. The evidence can be written as follows (Bishop 1995):

$$ \begin{aligned}[b] p\left( {[D]\left| {\alpha, \beta } \right.} \right)& = \frac{1}{{{Z_D}{Z_W}}}\int { \exp \left( {-E\left( {\{w\}} \right)} \right)d\{w\}} \\ &= \frac{{{Z_E}}}{{{Z_D}{Z_W}}} \\ & = \frac{{{{{\left( {\frac{{2\pi }}{\beta }} \right)}}^{{\frac{N}{2}}}}+\prod\limits_k {{{{\left( {\frac{{2\pi }}{{{\alpha_k}}}} \right)}}^{{{W_k}/2}}}} }}{{{{{\left( {\frac{{2\pi }}{\beta }} \right)}}^{{\frac{N}{2}}}}\prod\limits_k {{{{\left( {\frac{{2\pi }}{{{\alpha_k}}}} \right)}}^{{{W_k}/2}}}} }} \end{aligned} $$
(3.21)

The simultaneous estimation of the network weights and the hyper-parameters can be achieved using a number of ways including the use of Monte Carlo methods or any of its derivatives to maximize the posterior probability distribution. Another way of achieving this goal is to first maximize the log evidence and thus giving the following estimations for the hyper-parameters (Bishop 1995):

$$ {\beta^{MP }}=\frac{{N-\gamma }}{{2{E_D}\left( {{{{\{w\}}}^{MP }}} \right)}} $$
(3.22)
$$ \alpha_k^{MP }=\frac{{{\gamma_k}}}{{2{E_{{{W_k}}}}\left( {{{{\{w\}}}^{MP }}} \right)}} $$
(3.23)

where \( \gamma =\sum\limits_k {{\gamma_k}} \), \( 2{E_{{{W_k}}}}={{\{w\}}^T}\left[ {{I_k}} \right]\{w\} \) and

$$ {\gamma_k}=\sum\limits_j {\left( {\frac{{{\pi_j}-{\alpha_k}}}{{{\eta_j}}}{{{\left( {{{{\left[ V \right]}}^T}\left[ {{I_k}} \right]\left[ V \right]} \right)}}_{jj }}} \right)} $$
(3.24)

and {w}MP is the weight vector at the maximum point and this is identified in this chapter using the scaled conjugate gradient method, η j are the eigenvalues of [A], and [V] are the eigenvalues such that \( {{\left[ V \right]}^T}[V]=[I] \). To estimate the relevance of each input variable, the \( \alpha_k^{MP } \), β MP, and the following steps are followed (MacKay 1991):

  1. 1.

    Randomly choose the initial values for the hyper-parameters.

  2. 2.

    Train the neural network using the scaled conjugate gradient algorithm to minimize the objective function in Eq. 3.4 and thus identify {w}MP.

  3. 3.

    Apply the evidence framework to estimate the hyper-parameters using Eqs. 3.22 and 3.23.

  4. 4.

    If not converged go to Step 2.

3.3 Applications of ARD in Inflation Modeling

In this chapter we apply the ARD to identify variables that drive inflation. Inflation is measured using a concept called Consumer Price Index (CPI). Artificial intelligence has been used in the past to model inflation. For example, Şahin et al. (2004) applied neural networks and cognitive mapping to model Turkey’s inflation dynamics while Anderson et al. (2012) applied neural network to estimate the functional relationship between certain component sub-indexes and the CPI extracted decision rules from the network.

Binner et al. (2010) studied the influence of money on inflation forecasting. They applied recurrent neural networks and kernel recursive least squares regression to identify the best fitting U.S.A. inflation prediction models and compared these to a naïve random walk model. Their results demonstrated no correlation between monetary aggregates and inflation. McAdam and McNelis (2005) used neural networks and models based on Phillips-curve formulations to forecast inflation. This proposed models outperformed the best performing linear models. Cao et al. (2012) applied linear autoregressive moving average model (ARMA) and neural networks for forecasting medical cost inflation rates. The results showed that the neural network model outperformed the ARMA.

Nakamura (2005) applied neural networks for predicting inflation and the results from a U.S.A. data demonstrated that neural networks outperformed univariate autoregressive model, while Binner et al. (2006) used a neural network and a Markov switching autoregressive (MS-AR) model to predict U.S.A. inflation and found that MS-AR model performed better than neural networks.

The CPI is a measure of inflation in an economy. It measures the changes in prices of a fixed pre-selected basket of goods. A basket of goods which is used for calculating the CPI in South Africa is as follows (Anonymous 2012):

  1. 1.

    Food and non-alcoholic beverages: bread and cereals, meat, fish, milk, cheese, eggs, oils, fats, fruit, vegetables, sugar, sweets, desserts, and other foods

  2. 2.

    Alcoholic beverages and tobacco

  3. 3.

    Clothing and footwear

  4. 4.

    Housing and utilities: rents, maintenance, water, electricity, and others

  5. 5.

    Household contents, equipment, and maintenance

  6. 6.

    Health: medical equipment, outpatient, and medical service

  7. 7.

    Transport

  8. 8.

    Communication

  9. 9.

    Recreation and culture

  10. 10.

    Education

  11. 11.

    Restaurants and hotels

  12. 12.

    Miscellaneous goods and services: personal care, insurance, and financial services.

This basket is weighed and the variation of prices of these goods is tracked from month to month and this is a basis for calculating inflation. It must be noted that there is normally a debate as to whether this basket of goods is appropriate. For example, in South Africa where there are two economies, one developed and formal and another informal and under-developed, there is always a debate on the validity of the CPI. This is even more important because the salary negotiations are based on the CPI.

In this chapter, we use the CPI data from 1992 to 2011 to model the relationship between economic variables and the CPI. These economic variables are listed in Table 3.1. They represent the performance of various aspect of the economy represented by 23 variables in the agriculture, manufacturing, mining, energy, construction, etc. A multi-layered perceptron neural network with 23 input variables, 12 hidden nodes, and 1 output representing the CPI is constructed. The ARD based MLP network is trained using the scaled conjugate gradient method and all these techniques were described earlier in the chapter. The results indicating the relevance of each variable is indicated in Table 3.1.

Table 3.1 Automatic relevance with multi-layer perceptron and scaled conjugate gradient

From Table 3.1, the following variables are deemed to be essential for modeling the CPI and these are mining, transport, storage and communication, financial intermediation, insurance, real estate and business services, community, social and personal services, gross value added at basic prices, taxes less subsidies on products, affordability, economic growth, repo rate, gross domestic product, household consumption, and investment.

It should be noted, however, that these results are purely based on the data set that was analyzed and the methodology that was used which is the ARD that is based on the MLP. These conclusions may change from one economy to another and from one methodology to another e.g. support vector machines instead of the MLP.

3.4 Conclusions

This chapter presented the Bayesian and the evidence frameworks to create an automatic relevance determination technique. The ARD method was then applied to determine the relevance of economic variables that are essential for driving the consumer price index. It is concluded that for the data analyzed using the MLP based ARD technique, the variables driving the CPI are mining, transport, storage and communication, financial intermediation, insurance, real estate and business services, community, social and personal services, gross value added at basic prices, taxes less subsidies on products, affordability, economic growth, repo rate, gross domestic product, household consumption, and investment.