Keywords

1 Introduction

Mixtures of truncated basis functions (MoTBFs) [2] have recently been proposed as a general framework for handling hybrid Bayesian networks, i.e., Bayesian networks where discrete and continuous variables coexist. Previous hybrid models as the so-called mixtures of truncated exponentials (MTEs) [7] and mixtures of polynomials (MoPs) [10] can be regarded as particular cases of MoTBFs.

Part of the success of MoTBFs is due to the fact that they can model hybrid Bayesian networks with no structural restrictions, unlike the conditional Gaussian (CG) model [6], where discrete variables are not allowed to have continuous parents. Furthermore, MoTBFs are closed under addition, multiplication, and integration, which facilitates the use of efficient inference methods like the Shenoy-Shafer architecture [9] or the variable elimination algorithm [12].

The problem of learning MoTBFs from data has been studied considerably already (see, e.g., [3, 5]). However, even though a Bayesian network model populated with MoTBF distributions requires the specification of both marginal and conditional MoTBF distributions, only limited attention has been given to learning the conditional MoTBF distributions directly from data [1, 11]. In this paper we first extend previous work on learning marginal MoTBF distributions [5] to also learn joint densities. These are in turn employed to generate the required conditional MoTBFs.

The remainder of the paper is organized as follows: The MoTBF model is introduced in Sect. 2. Next, techniques for learning marginal and joint MoTBF densities from data is described in Sect. 3, where we also detail how we define the conditional distributions. The main part of this work is given in Sect. 4, where our proposal is validated through a series of experiments. Finally, we give some conclusions in Sect. 5.

2 The MoTBF Model

The MoTBF framework is based on the abstract notion of real-valued basis functions \(\psi (\cdot )\), which include both polynomial and exponential functions as special cases. Let X be a continuous variable with domain \(\varOmega _X\subset \mathbb {R}\) and let \(\psi _i:\varOmega _X\mapsto \mathbb {R}\), for \(i=0,\ldots , k\), define a collection of real basis functions. We say that a function \(f:\varOmega _{X}\mapsto \mathbb {R}_0^+\) is an MoTBF potential of level k wrt. \(\varPsi =\{\psi _0,\psi _1,\ldots , \psi _k\}\) if f can be written as

$$\begin{aligned} f(x) = \sum _{i=0}^k c_{i}\, \psi _i\left( x \right) \!, \end{aligned}$$

where \(c_i\) are real numbers [2]. The potential is a density if \(\int _{\varOmega _X} f(x) \,\mathrm{d}x=1\).

In this paper we will restrict our attention to the MoP framework, meaning that \(\psi _i(x)=x^i\).

When there are more than one variable, we can use a joint MoP to capture the probability density function over the variables. Let \(\mathbf {X}\) be a d-dimensional continuous variable, \(\mathbf {X}=(X_1,\dots ,X_d)\) with domain \(\varOmega _\mathbf {X}\subset \mathbb {R}^d\). A function \(f: \varOmega _{\mathbf {X}}\mapsto \mathbb {R}^+\) is said to be an MoP potential of level k if it can be written as

$$\begin{aligned} f(\mathbf {x})=\sum _{\ell _1=0}^k \ldots \sum _{\ell _d=0}^k c_{\ell _1,\ell _2,\ldots ,\ell _d} \prod _{i=1}^d x_i^{\ell _i}, \end{aligned}$$
(1)

or if there is a partition of \(\varOmega _{\mathbf {X}}\) into hypercubes where f can be written as in Eq. 1 for each part.

3 Learning MoPs from Data

We will now investigate how to learn MoP distributions for a given set of random variables. We start by looking at how to learn univariate MoP distributions from data, before we extend that approach to learning joint MoP distributions, and finally discuss how one can obtain conditional distribution functions.

3.1 Univariate MoPs

The learning of univariate MoTBFs from data was explored in [5], and we will briefly summarize that approach here in the special case of MoPs. The estimation procedure relies on the empirical cumulative distribution function (CDF) as a representation of the data \(\mathcal{D}=\{ x_1,\ldots ,x_N \}\). The empirical CDF is defined as

$$\begin{aligned} G_N(x) = \frac{1}{N} \sum _{\ell =1}^N \mathbf {1}\{ x_\ell \le x\}, \quad x\in \varOmega _X\subset \mathbb {R}, \end{aligned}$$

where \(\mathbf {1}\{ \cdot \}\) is the indicator function.

The algorithm in [5] approximates the empirical CDF by a function whose derivative is an MoTBF, using least squares. In our case, the role of the basis functions is taken by the polynomials, and since the integral of a polynomial is itself a polynomial, the target function is of the form \(F(x)= \sum _{i=0}^{k} c_{i}\,x^i\), defined on an interval \(\varOmega _X=[a,b]\subset \mathbb {R}\). The optimization problem thus becomes

$$\begin{aligned} \mathbf {minimize}&\displaystyle \sum _{\ell =1}^N \left( G_N(x_\ell ) - F(x_\ell )\right) ^2 \nonumber \\ \mathbf {subject~to}&\displaystyle \frac{\mathrm{d} F(x)}{\mathrm{d}x} \ge 0 \quad \forall x\in \varOmega _X, \\&\displaystyle F(a) =0 \nonumber \text { and } \nonumber \displaystyle F(b) =1. \nonumber \end{aligned}$$
(2)

The probability density function (PDF) is found by simple differentiation of the estimated CDF. The constraints of the optimization program ensures that the result is a legal density; the first requirement ensures that the PDF is non-negative over the domain, the others ensure it integrates to one. Furthermore, [5] remarks that the solution obtained by solving program in Eq. 2 is a consistent estimator of the true CDF in terms of the mean squared error for all \(x\in \varOmega _X\).

Note that the optimization program is convex, and can be efficiently solved in theory. However, the infinite number of constraints introduced by requiring that \(\frac{\mathrm{d} F(x)}{\mathrm{d}x} \ge 0\) for all \(x\in \varOmega _X\) complicates the implementation on a computer. In practice, we therefore only check that the constraint is fulfilled for a limited set of points spread across \(\varOmega _X\).

In learning situations where we have lots of data (N is large), the solution of the program can be slow. In such cases we rather define a grid on \(\varOmega _X\), where the grid is selected so that the number of observations is the same between each pair of consecutive grid-points. Then, the grid-points will play the role of the evaluation points in the objective function.

The level k of the estimated MoP can be decided using a multitude of different model selection techniques. For the results presented in this paper we have searched greedily for k, and chosen the value that maximized the BIC score [8]. This choice is motivated by [3], who showed that the estimators based on Eq. 2 are consistent in terms of the mean squared error for all \(x\in \varOmega _X\).

3.2 Joint MoPs

During the definition of the conditional distributions (described in Sect. 3.3), we will investigate the use of joint MoP densities to define conditional distributions. We therefore proceed by extending the program in Eq. 2 to arbitrarily dimensional random vectors. The procedure is very similar to the univariate case. The data now consists of d-dimensional observations, \(\mathcal{D}=\{ \mathbf {x}_1,\ldots ,\mathbf {x}_N \}\), \(\mathbf {x}\in \varOmega _\mathbf {X}\subset \mathbb {R}^d\). We continue to use \(\mathbf {1}\{\cdot \}\) to denote the indicator function, and the say that the event \(\mathbf {x}_\ell \le \mathbf {x}\) is true if and only if \(\mathbf {x}_{\ell ,i} \le \mathbf {x}_i\) for each dimension \(i=1,\ldots , d\). For notational convenience we use \(\varOmega _\mathbf {X}^-\in \mathbb {R}^d\) to denote the minimal point of \(\varOmega _\mathbf {X}\) (obtained by choosing the minimum of \(\varOmega _\mathbf {X}\) in each dimension), and let \(\varOmega _\mathbf {X}^+\in \mathbb {R}^d\) be the corresponding maximal point. Then, the empirical CDF is defined as

$$\begin{aligned} G_N(\mathbf {x}) = \frac{1}{N} \sum _{\ell =1}^N \mathbf {1}\{ \mathbf {x}_\ell \le \mathbf {x}\}, \quad \mathbf {x}\in \varOmega _\mathbf {X}\subset \mathbb {R}^d. \end{aligned}$$

Our goal is to find a representation of the empirical CDF of the form

$$F(\mathbf {x})=\sum _{\ell _1=0}^k \ldots \sum _{\ell _d=0}^k c_{\ell _1,\ell _2,\ldots ,\ell _d} \prod _{i=1}^d x_i^{\ell _i},$$

leading us to the optimization problem

$$\begin{aligned} \mathbf {minimize}&\displaystyle \sum _{\ell =1}^N \left( G_N(\mathbf {x}_\ell ) - F(\mathbf {x}_\ell )\right) ^2 \nonumber \\ \mathbf {subject~to}&~~\displaystyle \frac{\partial ^d F(\mathbf {x})}{\partial x_1, \dots , \partial x_d} \ge 0 \quad \forall \mathbf {x}\in \varOmega _\mathbf {X},\\&\displaystyle F\left( \varOmega _\mathbf {X}^-\right) =0 \nonumber \text { and } \nonumber \displaystyle F\left( \varOmega _\mathbf {X}^+\right) =1. \nonumber \end{aligned}$$
(3)

The solution to this problem is the parameter-set that defines the joint CDF, and the density can be obtained simply by differentiation of the joint CDF. As in the univariate case, the problem is a quadratic optimization problem, that can be solved efficiently. When the amount of data and/or the dimensionality get large, we have used the same strategy wrt. grid-points for the joint density as we did when estimating the univariate PDFs.

The top of Fig. 1 shows the MoP density generated by solving the optimization program in Eq. 3. The model was learned from a database of 1000 observation generated from a bivariate standard normal distribution (i.e., with correlation-coefficient \(\rho =0\)). In the bottom part of Fig. 1 we can see the model learned from same distributions but with correlation \(\rho =0.99\).

3.3 Conditional Distributions

The last piece of the puzzle is to learn the conditional density functions for a variable X with parents \(\mathbf {Z}\), that will be used to populate the Bayesian network structure. Using the minimization program in Eq. 3, we can learn both \(f(x,\mathbf {z})\) and \(f(\mathbf {z})\), hence by the definition of a conditional probability density it seems natural to define \(f(x|\mathbf {z})\) as

$$\begin{aligned} f(x|\mathbf {z}) \leftarrow \frac{f(x, \mathbf {z})}{f(\mathbf {z})}, \end{aligned}$$
(4)

where both \(f(\mathbf {z})\) and \(f(x, \mathbf {z})\) are MoPs. Unfortunately, though, MoPs are not closed under division [2], thus \(f(x|\mathbf {z})\) defined by Eq. 4 will not lead to a legal MoP-representation of a conditional density. An alternative was therefore pursued by [2], where the influence the parents \(\mathbf {Z}\) have on X was encoded only through the partitioning of the domain of \(\mathbf {Z}\) into hyper-cubes. Then, specific distributions for X that are valid as long as \(\mathbf {Z}\) is inside a specific hypercube was learned from data.

Fig. 1.
figure 1

The contour and the perspective plots of the result of learning a MoP from \(N=1000\) samples drawn from bivariate standard normal distributions with \(\rho =0\) (top) and \(\rho =0.99\) (bottom).

Here, however, we will follow an alternative strategy similar to the one pursued in [11]. The idea is to learn representations for \(f(x, \mathbf {z})\) and \({f(\mathbf {z})}\), then utilize Eq. 4 to calculate \(f(x|\mathbf {z})\). As already noted, this will not result in an MoP, and the next step is therefore to approximate this representation into an MoP by some means. Varando et al. [11] investigated two schemes: i) To use the representation in Eq. 4 to generate samples from the conditional distribution of x given \(\mathbf {Z}\) and learn the MoP representation from the generated dataset; ii) to use numerical techniques to approximate the fraction directly (specifically, both Taylor series and Lagrange interpolation were considered). In our work we first learn an MoP representation for \(f(x,\mathbf {z})\) using the program in Eq. 3, then calculate \(f(\mathbf {z}) = \int _{\varOmega _x} f(x,\mathbf {z}) \mathrm{d}x\) directly from the learned joint. Note that since \(f(x,\mathbf {z}) \) is a MoP the integral can easily be performed analytically. Next, the conditional distribution defined through Eq. 4 is our target, leading to the following optimization program:

$$\begin{aligned} \mathbf {minimize}&\displaystyle \sum _{\ell =1}^N \left( \frac{f(x_\ell , \mathbf {z}_\ell )}{f(\mathbf {z}_\ell ) }- f(x_\ell |\mathbf {z}_\ell )\right) ^2 \\ \mathbf {subject~to}&~~\displaystyle f(x|\mathbf {z}) \ge 0 \quad \forall (x, \mathbf {z}) \in \left( \varOmega _X\times \varOmega _\mathbf {z}\right) .\nonumber \end{aligned}$$
(5)

The solution to this problem is a parameter-set that defines an un-normalized conditional PDF (that is, we have no guarantee that \(\int _{\varOmega _x} f(x|\mathbf {z}) \mathrm{d}x=1\) for all \(\mathbf {z}\in \varOmega _{\mathbf {z}}\)). Hence, the procedure is finalized by partially normalizing the distribution [10]. The program is quadratic, and can therefore be solved efficiently.

We note that while the programmes in Eqs. 2 and 3 are defined to obtain the CDFs, the programme in Eq. 5 works directly with the PDF. The reason for the programmes in Eqs. 2 and 3 to work with the cumulative distribution functions is that the defined \(G_N(\cdot )\) function is a more robust data-representation than, say, a histogram [5], and as \(G_N(\cdot )\) represents the empirical CDF the result of these programs are also CDFs. On the other hand, the program in Eq. 5 does not work directly with representations of the data, but rather defines the target function through Eq. 4. Therefore, the objects under study by this program are PDFs.

4 Experimental Analysis

In this section, we compare the proposal given in Sect. 3 with the methods described in [5] (where the conditioning variables are discretized) and in [11] (where B-splines are used) for learning conditional MoPs from data.

We consider two different scenarios concerning two continuous variables, X and Y. In the first one, \(Y\sim \mathcal{N} (\mu = 0,\sigma = 1)\) and \(X|\{Y=y\}\sim \mathcal{N} (\mu = y,\sigma = 1)\). In the second scenario, \(Y\sim \text {Gamma}(\text {rate}=10,\text { shape}=10)\) and \(X|\{Y=y\}\sim \text {Exp}(\text {rate}=y)\). For each scenario, we generated 10 data-sets of samples \(\{X_i,Y_i\}_{i=1}^N\), where the size is chosen as \(N =25\), 500, 2500, 5000. The effectiveness of the tested methods was measured by computing the mean square error for each set of samples. The results are showed in Tables 1 and 2.

The results in Table 1 indicate that the most accurate results for scenario 1 are achieved by the B-spline approach [11]. The worst results by far are obtained by the approach that discretizes the conditioning variables [5]. Both the proposed approach and the B-spline approach yield errors close to zero in most cases.

The results for scenario 2 are reported in Table 2. In this case, the most accurate results in terms of mean square error are provided by the MoTBF approach. Again, the method in [5] obtains the worst results overall.

The results are consistent with the plots in Fig. 2, where the MoTBF approach (bottom row in the figure) presented in this paper is able to resemble the shape of the exact conditional distribution (top row), specially in the non Gaussian scenario, while the method in [5] (middle row) is penalized by the fact that the estimated model is piecewise constant along the Y axis. The plots in Fig. 2 show the results obtained when learning from \(N=5000\) samples.

Table 1. Average MSE between the different methods to obtain MoP approximations and the true conditional densities for each set of 10 samples, where \(Y\sim \mathcal{N}(0,1)\) and \(X|Y\sim \mathcal{N}(y,1)\).
Table 2. Average MSE between the different methods to obtain MoP approximations and the true conditional densities for each set of 10 samples, where \(Y\sim Gamma(rate=10,shape=10)\) and \(X|Y\sim Exp(y)\).
Fig. 2.
figure 2

For the two scenarios (in columns), true conditional density (top row), the MoP produced by the method introduced in [5] (middle row) and the MoP obtained by the proposal in this paper (bottom row).

5 Concluding Remarks

In this paper we have extended the learning algorithm for univariate MoTBFs in [5] to multivariate and conditional densities. The advantage of the proposal described here with respect to the B-spline approach in [11] is that there is no need to split the domain of any variable. This is a fundamental issue in order to keep the complexity of inference in hybrid Bayesian networks under control. We note that while in theory high order polynomials may be required to model the distributions, the use of the BIC-score [8] leads to low-order polynomials being selected in practice [4, 5].

The experimental analysis suggests that our proposal is competitive with the B-spine approach in a range of commonly used distributions. Even if the conditional distribution functions yielded by the method is this paper are not proper conditional densities, evidence so far indicates they are accurate approximations, which in practice allows the method to be used as a means of representing the parameters of a Bayesian network. This paves the way to envisioning structural learning algorithms for hybrid Bayesian networks parameterized by MoTBFs.

Finally, we note that even though the paper develops a learning method for MoPs, the techniques employed here can easily be extended to be applicable for MoTBFs in general.