Introduction

Recently, demand has been growing for the use of interpretable prediction models in many machine learning problems, whereas uninterpretable models with high prediction accuracy such as deep learning have been widely used. For example, when constructing a credit rating model, it is legally required that the model be made interpretable. Interpretable models include decision tree and regression models such as linear regression and logistic regression.

Such frequently used interpretable models are simple. Therefore, they are usually inferior to uninterpretable models in terms of prediction accuracy. One solution for prediction accuracy improvement is to construct an interpretable model by combining multiple interpretable models.

In addition, it is important to conduct relevance determination to construct an interpretable model. Relevance determination is useful to prune irrelevant features and to lead sparse models. Consequently, it is an important machine-learning tool.

Motivated by these backgrounds, we address the issue of relevance determination in supervised mixture distributions [1]. We designate this issue as hierarchical relevance determination (HRD). The HRD task is simultaneous optimization of the number of mixture components, model parameters of the individual components and latent variables for assigning observations to the components, and to select an optimal subset of the input variables used in the individual components.Footnote 1 As described below, the task is decomposed into a mixture model selection task and a variable selection task.

Mixture model selection task (model selection task hereinafter) is an estimation task for a mixture distribution, which is to optimize the number of mixture components, model parameters of the individual components and latent variables simultaneously. The expectation-maximization (EM) algorithm [2] has been used widely to estimate the parameters of mixture distributions. However, the number of components can not be optimized using the EM algorithm. Model comparison approaches are often used to determine the number of components automatically. A mixture with the minimum criterion value is selected as the best distribution from a set of candidate mixtures with different numbers of components. As a criterion, an information criterion such as AIC (Akaike information criterion) [3] and BIC (Bayesian information criterion) [4] has been usually used. AIC and BIC are model estimation measures, and they penalize mean training fit by the number of model parameters. Therefore, by choosing the minimum AIC/BIC model, we can determine the model complexity such as the number of components. The model comparison approaches are computationally expensive because we must estimate all candidate mixtures. We can avoid computational cost problems using Bayesian inference methods that directly minimize the upper bound of a negative marginal log-likelihood and can therefore estimate the number of components through a single mixture estimation. As the Bayesian inference methods, variational Bayesian inference (VB) [1, 5], collapsed variational Bayesian inference (CVB) [6] and factorized asymptotic Bayesian inference (FAB) [7] have been proposed. The first, VB entails the shortcoming that the upper bound is loose because the latent variables and model parameters are assumed to be independent. In the case of CVB and FAB, the independence assumption is not required. Therefore the upper bounds of CVB and FAB are tighter than that of VB.

The variable selection task is an estimation task for a single distribution, which is to optimize the parameters and to select an optimal subset of input variables. Model comparison approaches such as the forward backward algorithm [8, 9] have often been used for variable selection. As in the case of model selection, model comparison approaches are computationally expensive. Sparse estimation methods such as Lasso [10], SCAD [11] and least angle regression [12] have often been used for variable selection. The methods can solve the variable selection task because they can derive sparse parameters by minimizing the regularized objective functions. We can avoid the computational cost difficulties posed by model comparison approaches using sparse estimation methods, which directly minimize the regularized objective functions and therefore only require single estimation.

Then, how can the HRD task be solved? Apparently, we can readily construct a solution by simply combining the model selection and variable selection methods described above. However, simple combinations entail technical difficulties as discussed below. It is unrealistic to solve the HRD task using the model comparison approaches because the computational cost becomes too large. In the case of HRD, the number of candidate models becomes much larger than those in the cases of model selection and variable selection. It is also unrealistic to accomplish the task using sparse estimation approaches because it is not trivial how to solve the model selection task using a sparse estimation method. Using VB, we can solve the HRD task. However, VB presents the following two shortcomings. First, it is difficult to apply VB widely because update equations must be derived analytically. Second, it is required that the latent variables and the model parameters must be assumed as independent, although their dependence is fundamentally important for the true distributions. We can also solve the HRD task using the FAB-based method proposed in an earlier report of the literature [13]. The method solves the model selection task through continuous optimization (by variational inference) and the variable selection task by discrete optimization (using a model comparison approach).

Using these inference methods, hierarchical models have been developed and applied to many areas. Such models include Gaussian process mixtures [14], hidden Markov mixtures [15], gamma mixtures [16], and hierarchical multinomial-Dirichlet model [17].

As described in this paper, we propose the sequential information criterion minimization (SICM) algorithm, which is a method for relevance determination in mixture distributions. In addition to being a “consistent” framework for solving the HRD task, SICM has the following properties. First, the objective function of SICM is consistent. Therefore, for example, SICM is NOT a method which solves the model selection task by VB and solves the variable selection task by Lasso. Actually, SICM solves the HRD task as a minimization problem of an information criterion. Second, it is easy to derive the model parameter update equations used in SICM. In fact, SICM estimates the model parameters by iterating \(L_{1}\)-regularized sparse estimation. It is, therefore, not necessary to derive the parameter update equations analytically as in the VB case. Third, SICM require no assumption of independence between the latent variables and the model parameters. Fourth, the optimization method used in SICM is consistent. Therefore, for example, SICM is NOT a method which solves the model selection task by variational inference (by continuous optimization) and solves the variable selection task by conducting model comparison (by discrete optimization). SICM minimizes the objective function continuously using variational inference and \(L_{1}\)-regularized sparse estimation.

The key ideas of SICM are summarized as follows (“Overall flow of the SICM algorithm” presents details). First, we use an information criterion, AIC or BIC, as the objective function. SICM solves the HRD task by directly minimizing an information criterion. When minimizing BIC, SICM corresponds to the Bayesian inference methods such as VB and FAB because BIC is an approximate representation of a negative marginal log-likelihood. Second, in SICM, the difficulty of the \(L_{0}\) term included in an information criterion is overcome by a concave continuous approximation. The resulting optimization problem is solved using a method based on the majorization-minimization (MM) algorithm [18]. The MM algorithm minimizes an objective function approximately by minimizing an upper bound of the objective. Third, independence between the latent variables and the model parameters is not assumed because only the marginal distribution of the latent variables is used (that of the model parameters is not used) when variational inference is used in SICM.

We propose a method for constructing an interpretable hierarchical model by the application of SICM. The model is supervised and is represented as a combination of a decision tree and regression models. When conducting prediction, we assign a regression model to an observation by selection from a set of regression models in accordance with the decision tree. By SICM, the number of regression models used in the hierarchical model and subsets of input variables used in the individual regressions are determined automatically.

We demonstrate the utility of the SICM algorithm through experiments conducted using nine UCI datasets [19]. The results indicate that we can construct an interpretable prediction model with higher prediction accuracy than frequently used interpretable models such as decision tree and logistic regression. This is true because, in binary classification problems, the above-mentioned interpretable hierarchical model based on SICM (1) outperformed decision tree and logistic regression, (2) outperformed VB, and (3) performed comparably to support vector machine (SVM), which is representative of uninterpretable models having high prediction accuracy.

In summary, the main contributions of this study are the following.

  1. 1.

    We propose the SICM algorithm, which solves the HRD problem by continuously minimizing an information criterion: AIC or BIC.

  2. 2.

    As an SICM application, we propose an interpretable hierarchical model represented as a combination of decision tree and regression models.

  3. 3.

    Through binary classification experiments, we demonstrate that SICM is useful for supporting construction of an interpretable model with higher prediction performance than either decision tree or regression because, in the experiments, the earlier described interpretable hierarchical model based on SICM outperformed decision tree and regression and performed comparably to SVM, an uninterpretable model that exhibits high performance.

The remainder of this paper is organized as follows. The next section provides a problem setting of HRD. The following section explains the proposed SICM algorithm, which solves the HRD problem by minimizing information criteria continuously. The next section proposes a method for constructing an interpretable hierarchical model based on the SICM algorithm. The following section explains the experimentally obtained findings. The last section presents concluding remarks.

Problem Setting

This section presents a description of our problem settings. We consider the hierarchical relevance determination task for supervised mixture distributions, which we explain below.

Hierarchical Relevance Determination Problem

Presuming that we are given a dataset of N observations, each of which consists of D numerical input variables \({\varvec{x}}\) and a target variable y:

$$\begin{aligned} (Y,X)= & \{y_{n},{\varvec{x}}_{n}\}_{n=1}^{N}. \end{aligned}$$
(1)

We regard supervised mixture distributions to be represented as

$$\begin{aligned} p(Y|X,Z,{\theta })= \,& {\prod _{n=1}^{N}}{\prod _{k=1}^{K}} \left( p_{k}(y_{n}|{\varvec{x}}_{n},{\theta }_{k})\right) ^{z_{nk}}, \end{aligned}$$
(2)

where p, \(p_{k}\), K, Z, and \({\theta }_{k}\) respectively represent a probability density function (pdf) of a mixture, a pdf of the kth mixture component, the number of components, latent variables explained below, and parameters of the kth mixture. Let us denote a set of latent variables as \(Z=\{z_{nk}\}\)\((n=1,\ldots ,N,\,k=1,\ldots ,K)\), where Z is a set of binary variables representing the component assignments. If the nth observation \((y_{n},{\varvec{x}}_{n})\) is generated from the kth component distribution \(p_{k}\), then \(z_{nk}=1\), and \(z_{nk}=0\) otherwise. Latent variables of one observation are mutually exclusive. Therefore, \({\sum _{k=1}^{K}}z_{nk}=1\) holds. We assume that the latent variables follow the following distribution:

$$\begin{aligned} p(Z|{\pi })= {\prod _{n=1}^{N}}{\prod _{k=1}^{K}}\left( {\pi }_{k}\right) ^{z_{nk}} \quad \left( 0\,{\le }\,{\pi }_{k}\,{\le }\,1,\, {\sum _{k=1}^{K}}{\pi }_{k}=1 \right) . \end{aligned}$$
(3)

As described herein, we consider a task of hierarchical relevance determination (HRD) for mixture distributions. The HRD task is to optimize, simultaneously, the number of mixtures K, the parameters \({\theta }\), the mixing coefficients \({\pi }\) and the latent variables Z, and to select an optimal subset of the input variables used in the individual components. The HRD task is to determine automatically all the degrees of freedom in the mixture distribution.

As discussed below, we solve the task by minimizing an information criterion. Therefore, when solving the task, we assume that the following two conditions hold. First, \(\left\{ p_{k}(y_{n}|{\varvec{x}}_{n},{\theta }_{k})\right\} _{k=1}^{K}\) satisfy the regularity conditions by which the Fisher information matrices of \(\left\{ p_{k}(y_{n}|{\varvec{x}}_{n},{\theta }_{k})\right\} _{k=1}^{K}\) are non-singular around the maximum likelihood estimators. Second, the optimal assignment Z is unique.

Decomposition of the Hierarchical Relevance Determination Problem

We decompose the HRD problem into two: variable selection and model selection.

Variable Selection Problem

Next we consider the HRD problem where the number of components K is fixed to 1 and the pdf is represented as

$$\begin{aligned} p(Y|X,{\theta })= \, & {\prod _{n=1}^{N}}p(y_{n}|{\varvec{x}}_{n},{\theta }). \end{aligned}$$
(4)

We define variable selection as the relevance determination problem in this case. The variable selection task is to optimize parameter \({\theta }\) and to select an optimal subset of the input variables simultaneously.

Model Selection Problem

Next, we consider the HRD problem for which all the input variables are used in the individual components. We define model selection as the relevance determination problem in this case. The model selection task is to optimize, simultaneously, the number of mixtures K, the parameters \({\theta }\), the mixing coefficients \({\pi }\), and the latent variables Z.

Proposed Method: SICM Algorithm

In this section, we propose the sequential information criterion minimization (SICM) algorithm, which solves the HRD task by continuously minimizing an information criterion of a supervised mixture distribution.

We derive the SICM algorithm for HRD by the following procedure. First, we propose an algorithm for solving the variable selection task by minimizing an information criterion continuously. Second, we propose an algorithm for solving the model selection task by continuously minimizing an information criterion. Third, we derive the SICM algorithm for solving the HRD problem by combining these two algorithms.

Overall Flow of the SICM Algorithm

We briefly review the SICM algorithm. We denote parameters included in the information criterion as \({\varTheta }\). In the case of variable selection, \({\varTheta }={\theta }\). In the case of model selection and HRD, \({\varTheta }=(Z,{\theta })\).

The overall flow of the SICM algorithm is summarized as presented below.

  1. 1.

    We use the objective function E, which satisfies \({\min _{\varTheta }}E({\varTheta })=\mathrm{IC}^{*}\), where IC\(^{*}\) represents the minimum of an information criterion. We then formulate a minimization problem of the information criterion as a minimization problem of E.

  2. 2.

    We approximate the \(L_{0}\) term of the parameters in the objective function by a function of \(L_{1}\) norm, and numerically stabilize the problem as

    $$\begin{aligned} ||{\varTheta }||_{0}\simeq \, & g(||{\varTheta }||_{1}), \end{aligned}$$
    (5)
    $$\begin{aligned} E\longrightarrow \, & E|_{||{\varTheta }||_{0}{\rightarrow }g(||{\varTheta }||_{1})}. \end{aligned}$$
    (6)
  3. 3.

    We introduce new parameters \({\varTheta }_{0}\) and derive an upper bound of the objective function:

    $$\begin{aligned} E({\varTheta })\le & \, F({\varTheta },\,{\varTheta }_{0}). \end{aligned}$$
    (7)
  4. 4.

    By iteratively minimizing F, we estimate \({\varTheta }\) as

    $$\begin{aligned} {\varTheta }^{(s+1)}= & \text{ arg }{\min _{\varTheta }}\,F({\varTheta },\,{\varTheta }_{0}^{(s)}), \end{aligned}$$
    (8)
    $$\begin{aligned} {\varTheta }_{0}^{(s+1)}= \, & {\varTheta }^{(s+1)}, \end{aligned}$$
    (9)

    where s represents an update step. The minimization method by which an upper bound of an objective function is iteratively minimized is called the majorization-minimization (MM) algorithm [18].

Objective Function

Next, we present a definition of the objective function of SICM.

Let us denote \({\varTheta }\) as

$$\begin{aligned} {\varTheta }= \, & (h,v), \end{aligned}$$
(10)

where h and v respectively represent a subset of the parameters used (selected) in a model and values of the “selected” parameters. In the case of regression models, h represents a subset of input variables which have non-zero regression coefficients, and v represents the values of the non-zero coefficients. An information criterion and its minimum are expressed as

$$\begin{aligned} IC(h,v_{\text{ML }}(h))= & -{\log }\,p(Y|X,h,v_{\text{ML }}(h)) +f(h,v_{\text{ML }}(h)), \end{aligned}$$
(11)
$$\begin{aligned} IC^{*}= \, & {\min _{h}}\,IC(h,v_{\text{ML }}(h)), \end{aligned}$$
(12)

where f represents a model complexity, an \(L_{0}\) term included in the information criterion, and \(v_{\text{ML }}(h)\) represents a maximum likelihood estimator where the subset is fixed to h:

$$\begin{aligned} v_{\text{ML }}(h)= & \text{ arg }{\min _{v}}\,-{\log }\,p(Y|X,h,v). \end{aligned}$$
(13)

Model complexity f is dependent on h and is independent of v. Therefore, f is expressed as

$$\begin{aligned} f(h,v_{\text{ML }}(h))= \, & f(h) =f(h,v). \end{aligned}$$
(14)

From Eqs. (10) to (14), the minimum of the information criterion is expressed as

$$\begin{aligned} IC^{*}= \, & {\min _{h}}\left[ {\min _{v}}\,-{\log }\,p(Y|X,h,v)+f(h)\right] \nonumber \\= \, & {\min _{h,v}}\,-{\log }\,p(Y|X,h,v)+f(h,v) \nonumber \\= \, & {\min _{\varTheta }}\,-{\log }\,p(Y|X,{\varTheta })+f({\varTheta }). \end{aligned}$$
(15)

As described in this paper, we solve the HRD problem by minimizing the information criterion of the mixture distribution. For this purpose, we use E in Eq. (17) below as the objective function to be minimized. We consider the following minimization problem:

$$\begin{aligned}&{\min _{\varTheta }}\,E({\varTheta }), \end{aligned}$$
(16)
$$\begin{aligned}&E({\varTheta })=-{\log }\,p(Y|X,{\varTheta })+f({\varTheta }). \end{aligned}$$
(17)

By solving the minimization problem in Eq. (16), we can minimize the information criterion for the following reasons. The objective function E is derived by replacing \(v_{\text{ML }}\) of the information criterion with v. Therefore, E is not equal to the information criterion itself. However, the minimization problem of E coincides with that of the information criterion because the minimum E is equal to the minimum information criterion value IC\(^{*}\), as shown in Eq. (15).

SICM for Variable Selection

In this section, we propose a method for solving the variable selection problem. We consider the following information criterion minimization problem:

$$\begin{aligned}&{\min _{\theta }}\,E({\theta }), \end{aligned}$$
(18)
$$\begin{aligned}&E({\theta })=-{\log }\,p(Y|X,{\theta })+f({\theta }), \end{aligned}$$
(19)
$$\begin{aligned}&f({\theta })=c_{\text{IC }}||{\theta }||_{0}, \end{aligned}$$
(20)
$$\begin{aligned}&c_{\text{AIC }}=1, \end{aligned}$$
(21)
$$\begin{aligned}&c_{\text{BIC }}=\frac{1}{2}{\log }\,N. \end{aligned}$$
(22)

We use AIC or BIC as the information criterion to be minimized. We set \(c_{\text{IC }}=c_{\text{AIC }}\) (\(c_{\text{IC }}=c_{\text{BIC }}\)) when we use AIC (BIC).

It is difficult to treat \(E({\theta })\) numerically because \(||{\theta }||_{0}\) is not always continuous with respect to \({\theta }\). Therefore we use the following approximation:

$$\begin{aligned} ||{\theta }||_{0}= & {\sum _{i}}||{\theta }_{i}||_{0} \,\,{\simeq }\,\, {\sum _{i}} \frac{|{\theta }_{i}|}{|{\theta }_{i}|+{\eta }}\quad ({\eta }>0), \end{aligned}$$
(23)

where \({\theta }_{i}\) and \({\eta }\) respectively represent the ith component of \({\theta }\) and a user-defined small positive constant. The approximation in Eq. (23) is justified because \(||s||_{0}\) and \(\frac{|s|}{|s|+{\eta }}\)(s is a scalar) share the following properties: (1) they become 0 when \(s=0\); (2) they rapidly approach 1 when |s| increases; (3) they are convex upward functions with respect to |s|; and (4) they become nearly equal when \({\eta }\) approaches 0.

We introduce new parameters \({\theta }_{0}\), which have the same dimension as \({\theta }\). \(\frac{|{\theta }_{i}|}{|{\theta }_{i}|+{\eta }}\) in Eq. (23) is a convex upward function of \(|{\theta }_{i}|\). Therefore, \(||{\theta }||_{0}\) has an upper bound as

$$\begin{aligned} ||{\theta }||_{0}\simeq & {\sum _{i}} \frac{|{\theta }_{i}|}{|{\theta }_{i}|+{\eta }} \,\,{\le }\,\, {\sum _{i}} \frac{{\eta }}{(|{\theta }_{0i}|+{\eta })^{2}}|{\theta }_{i}|\nonumber \\&+ \frac{|{\theta }_{0i}|^{2}}{(|{\theta }_{0i}|+{\eta })^{2}}, \end{aligned}$$
(24)

where \({\theta }_{0i}\) represents the ith component of \({\theta }_{0}\). The equality in Eq. (24) holds if \({\theta }={\theta }_{0}\) holds because \(|{\theta }_{0i}|\) represents a point of contact of \(u(|{\theta }_{i}|)=\frac{|{\theta }_{i}|}{|{\theta }_{i}|+{\eta }}\) and its tangent line. Using this inequality, the following upper bound of the objective function is derived:

$$\begin{aligned} E({\theta })\le \, & F({\theta },{\theta }_{0}) \end{aligned}$$
(25)
$$\begin{aligned}= & -{\log }\,p(Y|X,{\theta }) \nonumber \\&+ c_{\text{IC }} {\sum _{i}} \left\{ \frac{{\eta }}{(|{\theta }_{0i}|+{\eta })^{2}}|{\theta }_{i}| + \frac{|{\theta }_{0i}|^{2}}{(|{\theta }_{0i}|+{\eta })^{2}} \right\} . \end{aligned}$$
(26)

Using the upper bound in Eq. (26), we define the variable selection problem as the following minimization problem:

$$\begin{aligned} {\min _{{\theta },{\theta }_{0}}}\,F({\theta },{\theta }_{0}). \end{aligned}$$
(27)

By minimizing \(F({\theta },{\theta }_{0})\) in Eq. (26), we approximately minimize the original objective function E.

We propose an algorithm for minimizing F, i.e., for solving the variable selection task. The algorithm sequentially minimizes F. It therefore approximately minimizes the information criterion. Therefore, we designate it as the sequential information criterion minimization (SICM) algorithm.

We summarize the SICM algorithm for variable selection as Algorithm 1. The algorithm consists of the following two steps. First, F is minimized with respect to \({\theta }\). As shown in Eq. (26), the problem in Eq. (27) is an \(L_{1}\)-regularized maximum likelihood estimation problem with respect to \({\theta }\). The \(L_{1}\) regularization term has weights \(\left\{ \frac{{\eta }}{(|{\theta }_{0i}|+{\eta })^{2}}\right\}\) corresponding to \(\left\{ |{\theta }_{i}|\right\}\) like that of adaptive lasso [20]. It is important that, different from adaptive lasso, the weights be determined automatically based on the information criterion. Second, F is minimized with respect to \({\theta }_{0}\). The solution of this problem is \({\theta }_{0}={\theta }\) because the equality in Eq. (24) holds if \({\theta }={\theta }_{0}\) holds, as discussed above.

figure a

SICM for Model Selection

In this section, we propose a method for solving the model selection problem. We consider the following information criterion minimization problem:

$$\begin{aligned}&{\min _{Z,{\theta }}}\,E(Z,{\theta }), \end{aligned}$$
(32)
$$\begin{aligned}&E(Z,{\theta }) = -{\log }\,p(Y|X,Z,{\theta })+f(Z,{\theta }), \end{aligned}$$
(33)
$$\begin{aligned}&f(Z,{\theta }) = {\sum _{k=1}^{K}}g_{\text{IC }}(N_{k}) \,\, \left( N_{k}={\sum _{n=1}^{N}}z_{nk} \right) , \end{aligned}$$
(34)
$$\begin{aligned}&g_{\text{AIC }}(N_{k}) = ||N_{k}||_{0}(N+D_{k}), \end{aligned}$$
(35)
$$\begin{aligned}&g_{\text{BIC }}(N_{k}) = \frac{1}{2}||N_{k}||_{0}(N{\log }\,N+D_{k}{\log }\,N_{k}), \end{aligned}$$
(36)

where \(D_{k}\) represents the degrees of freedom of the k-th mixture component \(p_{k}(y|{\varvec{x}},{\theta }_{k})\) and is expressed as \(D_{k}=||{\theta }_{k}||_{0}\). In the case of the model selection task, \(D_{k}\) is a constant. We set \(g_{\text{IC }}=g_{\text{AIC }}\) (\(g_{\text{IC }}=g_{\text{BIC }}\)) when we minimize AIC (BIC).

We derive an upper bound of the objective function \(E(Z,{\theta })\), as in the case of variable selection. We denote the minimum of \(E(Z,{\theta })\) as \(E(Z^{*},{\theta }^{*})\). A minimum is always less than or equal to an expectation. Consequently, the following inequality holds

$$\begin{aligned} e^{-E(Z^{*},{\theta }^{*})}\ge & {\langle }e^{-E(Z,{\theta })}{\rangle }_{p(Z|{\pi })} = {\int }\!\text{ d }Z\, p(Z|{\pi }) \left[ p(Y|X,Z,{\theta })e^{-f(Z,{\theta })} \right] \end{aligned}$$
(37)
$$\begin{aligned}= & {\int }\!\text{ d }Z\, p(Y,Z|X,{\theta },{\pi })e^{-f(Z,{\theta })}. \end{aligned}$$
(38)

Therein, \({\langle }{\cdot }{\rangle }_{p}\) represents the expectation of \({\cdot }\) with respect to p. By applying Jensen’s inequality (see Appendix A) to \(-{\log }\) of Eq. (38), we obtain the following upper bound as

$$\begin{aligned} E(Z^{*},{\theta }^{*})\le & -{\log }\, {\int }\!\text{ d }Z\, q(Z)\frac{p(Y,Z|X,{\theta },{\pi })e^{-f(Z,{\theta })}}{q(Z)} \nonumber \\\le & -{\int }\!\text{ d }Z\, q(Z){\log }\,\frac{p(Y,Z|X,{\theta },{\pi })e^{-f(Z,{\theta })}}{q(Z)}, \end{aligned}$$
(39)

where q(Z) represents a pdf of Z.

As in the variable selection case, we use the following approximation of the \(L_{0}\) term:

$$\begin{aligned} ||N_{k}||_{0}= \, & \frac{N_{k}}{N_{k}+{\eta }} \,\,({\eta }>0), \end{aligned}$$
(40)

where \({\eta }\) is a user-defined positive constant. Also, \(g(N_{k})\) is a convex upward function of \(N_{k}\). Consequently, by introducing a pdf \({\bar{q}}(Z)\), the following upper bound is derived:

$$\begin{aligned} g(N_{k})\le \, & g({\bar{N}}_{k})+g'({\bar{N}}_{k})(N_{k}-{\bar{N}}_{k}), \end{aligned}$$
(41)
$$\begin{aligned} g'(N_{k})= \, & \frac{{\partial }}{{\partial }N_{k}}g(N_{k}), \end{aligned}$$
(42)
$$\begin{aligned} {\bar{N}}_{k}= \, & {\langle }N_{k}{\rangle }_{{\bar{q}}(Z)} = {\langle }z_{nk}{\rangle }_{{\bar{q}}(Z)}. \end{aligned}$$
(43)

By combining inequalities (39) and (41), the following upper bound of the objective function E is derived:

$$\begin{aligned} E(Z^{*},{\theta }^{*})\le \, & F({\theta },{\pi },q_{{Z}},{\bar{q}}_{{Z}}) \end{aligned}$$
(44)
$$\begin{aligned}= \,& -{\int }\!\text{ d }Z\, q(Z){\log }\,\frac{p(Y,Z|X,{\theta },{\pi })e^{-{\sum _{k=1}^{K}}G_{k}(Z,{\bar{q}})}}{q(Z)}, \end{aligned}$$
(45)
$$\begin{aligned} G_{k}(Z,{\bar{q}})= \, & g({\bar{N}}_{k})+g'({\bar{N}}_{k})(N_{k}-{\bar{N}}_{k}), \end{aligned}$$
(46)
$$\begin{aligned} N_{k}= \, & {\sum _{n=1}^{N}}z_{nk}, \end{aligned}$$
(47)
$$\begin{aligned} {\bar{N}}_{k}= \, & {\langle }N_{k}{\rangle }_{{\bar{q}}(Z)} ={\langle }z_{nk}{\rangle }_{{\bar{q}}(Z)}, \end{aligned}$$
(48)
$$\begin{aligned} g_{\text{AIC }}(N_{k})= \,& \frac{N_{k}}{N_{k}+{\eta }}(N+D_{k}), \end{aligned}$$
(49)
$$\begin{aligned} g_{\text{BIC }}(N_{k})= \, & \frac{N_{k}}{2(N_{k}+{\eta })}(N{\log }\,N+D_{k}{\log }\,N_{k}), \end{aligned}$$
(50)
$$\begin{aligned} g_{\text{AIC }}'(N_{k})= \, & \frac{{\eta }}{(N_{k}+{\eta })^{2}}(N+D_{k}), \end{aligned}$$
(51)
$$\begin{aligned} g_{\text{BIC }}'(N_{k})= \, & \frac{{\eta }}{2(N_{k}+{\eta })^{2}} \left( N{\log }\,N+D_{k}{\log }\,N_{k}+\frac{D_{k}}{{\eta }}(N_{k}+{\eta }) \right) , \end{aligned}$$
(52)

where we set \(g=g_{\text{AIC }}\) and \(g'=g'_{\text{AIC }}\) (\(g=g_{\text{BIC }}\) and \(g'=g'_{\text{BIC }}\)) when we minimize AIC (BIC).

Using the upper bound in Eq. (45), we define the model selection problem as the following minimization problem:

$$\begin{aligned}&{\min _{{\theta },{\pi },q,{\bar{q}}}} F({\theta },{\pi },q_{{Z}},{\bar{q}}_{{Z}}). \end{aligned}$$
(53)

By minimizing \(F({\theta },{\pi },q_{Z},{\bar{q}}_{Z})\) in Eq. (45), we approximately minimize \(E(Z,{\theta })\).

We propose an algorithm for solving the problem in Eq. (53). We designate the algorithm as SICM algorithm because it minimizes the information criterion sequentially and approximately as in the variable selection case.

We summarize the SICM algorithm for model selection as Algorithm 2. The algorithm consists of the following four steps. First, we minimize F with respect to \({\theta }\) by solving the following maximum likelihood estimation problem:

$$\begin{aligned} {\theta }= \, & \text{ arg }{\min _{\theta }}\, -{\int }\!\text{ d }Z\, q(Z){\log }\,p(Y,Z|X,{\theta },{\pi }), \end{aligned}$$
(54)
$$\begin{aligned} {\theta }_{k}= \, & \text{ arg }{\min _{{\theta }_{k}}}\, -{\sum _{n}} r_{nk} {\log }\,p_{k}(y_{n}|{\varvec{x}}_{n},{\theta }_{k}) \,\,\left( r_{nk}={\langle }z_{nk}{\rangle }_{q(Z)}\right) . \end{aligned}$$
(55)

Second, F is minimized with respect to \({\pi }\) by maximum likelihood estimation:

$$\begin{aligned} {\pi }_{k}= \, & \frac{1}{N}{\sum _{n}}{\langle }z_{nk}{\rangle }_{q(Z)} = \frac{1}{N}{\sum _{n}}r_{nk}. \end{aligned}$$
(56)

Third, F is minimized with respect to q(Z). We assume that q(Z) factorizes so that

$$\begin{aligned} q(Z)= \, & {\prod _{n=1}^{N}} q_{n}({\varvec{z}}_{n}), \end{aligned}$$
(57)
$$\begin{aligned} {\varvec{z}}_{n}= \, & (z_{n1},z_{n2},\ldots ,z_{nK})^{\mathrm{T}}. \end{aligned}$$
(58)

Then, using the variational inference method explained in Appendix B, q(Z) is optimized as

$$\begin{aligned} q_{n}({\varvec{z}}_{n})= \, & {\prod _{k=1}^{K}}\left( r_{nk}\right) ^{z_{nk}}, \end{aligned}$$
(59)
$$\begin{aligned} r_{nk}= \, & \frac{{\rho }_{nk}}{{\sum _{k'=1}^{K}}{\rho }_{nk'}}, \end{aligned}$$
(60)
$$\begin{aligned} {\rho }_{nk}= \, & {\pi }_{k} p_{k}(y_{n}|{\varvec{x}}_{n},{\theta }_{k}) {\exp }\left[ -g'({\bar{N}}_{k})\right] , \end{aligned}$$
(61)
$$\begin{aligned} {\langle }z_{nk}{\rangle }_{q(Z)}= \, & r_{nk}. \end{aligned}$$
(62)

Fourth, F is minimized with respect to \({\bar{q}}(Z)\). The solution of this problem is expressed as

$$\begin{aligned} {\bar{q}}= \, & q. \end{aligned}$$
(63)

This is true because (1) \({\bar{q}}_{Z}\) appears in \(F({\theta },{\pi },q_{Z},{\bar{q}}_{Z})\) with the form of \({\langle }G_{k}(Z,{\bar{q}}){\rangle }_{q}\); (2) \(G_{k}(Z,{\bar{q}})\) is represented as Eq. (46); (3) Inequality (41) holds for \(g(N_{k})\), which is a convex upward function with respect to \(N_{k}\).

figure b

SICM for Hierarchical Relevance Determination

In this section, we propose the SICM algorithm for solving the HRD problem. We construct the algorithm based on the SICM algorithms described above, i.e., those for variable selection and model selection, which are summarized as Algorithms 1 and 2.

Let us start from the SICM for model selection. When we consider the HRD task, it is necessary to treat \(\left\{ ||{\theta }_{k}||_{0}\right\}\) as variables, whereas they are constants in the model selection case. Therefore, when considering the HRD problem, it is necessary to make the following alterations to the SICM for model selection:

$$\begin{aligned} D_{k}\rightarrow & ||{\theta }_{k}||_{0}, \end{aligned}$$
(72)
$$\begin{aligned} G_{k}(Z,{\bar{q}})\rightarrow & G_{k}(Z,{\bar{q}},{\theta }_{k}). \end{aligned}$$
(73)

By these changes, the upper bound to be minimized for solving the HRD problem is derived as

$$\begin{aligned}&F({\theta },{\pi },q_{Z},{\bar{q}}_{Z})\nonumber \\&\quad = -{\int }\!\text{ d }Z\, q(Z){\log }\,\frac{p(Y,Z|X,{\theta },{\pi })e^{-{\sum _{k=1}^{K}}G_{k}(Z,{\bar{q}},{\theta }_{k})}}{q(Z)}. \end{aligned}$$
(74)

Under the alterations, the individual steps in Algorithm 2 are changed as follows. First, the estimation step of q is invariant except for the expression of \(g'({\bar{N}}_{k})\). It is necessary to replace \(D_{k}\) included in \(g'({\bar{N}}_{k})\) with \(||{\theta }_{k}||_{0}\). Second, the estimation step of \({\bar{q}}\) is invariant because (1) the optimal \({\bar{q}}\) is expressed as \({\bar{q}}=q\), as discussed above, and (2) this result is independent of the expression of \(D_{k}\) (independent of the described above alterations). Third, the estimation step of \({\pi }\) is invariant because the estimation of \({\pi }\) is only dependent on \(\{{\langle }z_{nk}{\rangle }_{q}\}\) and is independent of \(D_{k}\). Fourth, the estimation step of \({\theta }_{k}\) changes because \({\sum _{k=1}^{K}}G_{k}(Z,{\bar{q}},{\theta }_{k})\) becomes dependent on \({\theta }_{k}\). Because of the dependence, the estimation step of \({\theta }\) in the HRD case is expressed as

$$\begin{aligned} {\theta }= & \text{ arg }{\min _{\theta }}\, -{\int }\!\text{ d }Z\, q(Z){\log }\,\frac{p(Y,Z|X,{\theta },{\pi })e^{-{\sum _{k=1}^{K}}G_{k}(Z,{\bar{q}},{\theta }_{k})}}{q(Z)} \nonumber \\= & \text{ arg }{\min _{\theta }}\, {\int }\!\text{ d }Z\, q(Z)\left[ -{\log }\,p(Y,Z|X,{\theta },{\pi })+{\sum _{k=1}^{K}}G_{k}(Z,{\bar{q}},{\theta }_{k}) \right] \nonumber \\= & \text{ arg }{\min _{\theta }}\,{\sum _{k=1}^{K}}F_{k}, \end{aligned}$$
(75)
$$\begin{aligned} F_{k}= & {\langle } -{\sum _{n}}z_{nk}{\log }\,{\pi }_{k} -{\sum _{n}}z_{nk}{\log }\,p_{k}(y_{n}|{\varvec{x}}_{n},{\theta }_{k})\nonumber \\&+g({\bar{N}}_{k})+g'({\bar{N}}_{k})(N_{k}-{\bar{N}}_{k}) {\rangle }_{q(Z)} \nonumber \\= & -{\sum _{n}} r_{nk}{\log }\,p_{k}(y_{n}|{\varvec{x}}_{n},{\theta }_{k}) + g({\bar{N}}_{k},\,D_{k}=||{\theta }_{k}||_{0})\nonumber \\&+\mathrm{const}.\quad ({\bar{q}}=q). \end{aligned}$$
(76)

Therefore \({\theta }_{k}\) is estimated as presented below:

$$\begin{aligned} {\theta }_{k}= & \text{ arg }{\min _{{\theta }_{k}}}\,F_{k} = \text{ arg }{\min _{{\theta }_{k}}}\, -{\sum _{n}} r_{nk}{\log }\,p_{k}(y_{n}|{\varvec{x}}_{n},{\theta }_{k}) \nonumber \\&+ h({\theta }_{k}), \end{aligned}$$
(77)
$$\begin{aligned} h_{\text{AIC }}({\theta }_{k})= & \frac{{\bar{N}}_{k}}{{\bar{N}}_{k}+{\eta }}||{\theta }_{k}||_{0}, \end{aligned}$$
(78)
$$\begin{aligned} h_{\text{BIC }}({\theta }_{k})= & \frac{1}{2}{\cdot }\frac{{\bar{N}}_{k}}{{\bar{N}}_{k}+{\eta }} {\log }\,{\bar{N}}_{k} {\cdot } ||{\theta }_{k}||_{0}, \end{aligned}$$
(79)
$$\begin{aligned} {\bar{N}}_{k}= & {\sum _{n=1}^{N}}{\langle }z_{nk}{\rangle }_{{\bar{q}}(Z)} ={\sum _{n=1}^{N}}{\langle }z_{nk}{\rangle }_{q(Z)} ={\sum _{n}}r_{nk}, \end{aligned}$$
(80)

where we set \(h=h_{\text{AIC }}\) (\(h=h_{\text{BIC }}\)) when we minimize AIC (BIC).

If we make the following alterations, then the upper bound F in Eq. (26) in the variable selection case coincides with \(F_{k}\) in Eq. (77) in the HRD case, except for their constant terms:

$$\begin{aligned} c_{\text{IC }}\rightarrow & \frac{{\bar{N}}_{k}}{{\bar{N}}_{k}+{\eta }}c_{\text{IC }}, \end{aligned}$$
(81)
$$\begin{aligned} {\log }\,p(Y|X,{\theta })= & {\sum _{n}}{\log }\,p(y_{n}|{\varvec{x}}_{n},{\theta })\nonumber \\\rightarrow & {\sum _{n}} r_{nk} {\log }\,p_{k}(y_{n}|{\varvec{x}}_{n},{\theta }_{k}). \end{aligned}$$
(82)

Therefore we can estimate \({\theta }_{k}\) using the SICM for variable selection (Algorithm 1) with the alterations in Eqs. (81) and (82). As a result, the problem for estimating \({\theta }_{k}\) becomes a weighted maximum likelihood estimation with \(L_{1}\) regularization. The minimization of \(F_{k}\) is interpreted as a minimization of the information criterion, where the pdf is \(p_{k}(y|{\varvec{x}},{\theta }_{k})\) and the number of observations is \(N_{k}\) because (1) \(r_{nk}{\log }\,p_{k}(y_{n}|{\varvec{x}}_{n},{\theta }_{k})\) is considered to be a weighted log-likelihood of \(N_{k}\) observations because \(0\,{\le }\,r_{nk}\,{\le }\,1\) and \({\sum _{n}}r_{nk}=N_{k}\); (2) \(\frac{{\bar{N}}_{k}}{{\bar{N}}_{k}+{\eta }}c_{\text{IC }}\) is approximately equal to \(c_{\text{IC }}\) when \(N_{k}\) is larger than \({\eta }\), while we can adopt small \({\eta }\) and we need not estimate \({\theta }_{k}\) if \(N_{k}=0\).

Summarizing the points presented above, the SICM algorithm can be constructed for solving the HRD problem in the following way. First, we replace Step 5 of Algorithm 2 (SICM for model selection) with Algorithm 1 (SICM for variable selection). Second, we make the alterations in Eqs. (81) and (82) to Algorithm 1. As an example, Appendix D summarizes the estimation step of \({\theta }_{k}\), which corresponds to Step 5 in Algorithm 2 (corresponds to Algorithm 1 with the alterations), for the HRD problem of a logistic regression mixture. In Algorithm 3, we summarize the overall flow of the SICM algorithm for solving the HRD problem. The objective function F in Algorithm 3 is represented as

$$\begin{aligned} F= & -{\sum _{n,\,k}}r_{nk}{\log }\,p_{k}(y_{n}|{\varvec{x}}_{n},\,{\theta }_{k}) +{\sum _{k}}g({\bar{N}}_{k})\big |_{D_{k}=||{\theta }_{k}||_{0}}\nonumber \\&+{\sum _{n,\,k}}r_{nk}{\log }\,\frac{r_{nk}}{{\pi }_{k}}, \end{aligned}$$
(83)

which is derived from Eqs. (49), (50), (74), (75) and (76).

figure c

Properties of SICM

In this section, we describe some properties of the SICM algorithm for HRD.

Sparsity

The SICM algorithm derives a sparse solution theoretically as discussed below.

Automatically, SICM determines the number of mixture components. In fact, Eq. (65) in Algorithm 2 plays a fundamentally important role for the determination. By the effect of the term \({\exp }\left[ -g'({\bar{N}}_{k})\right]\), a mixture component with a few observations is erased. Thereby, the number of the components is optimized because \({\exp }\left[ -g'({\bar{N}}_{k})\right]\) becomes close to 0 if \({\bar{N}}_{k}\), representing the number of observations of the k-th component, becomes small. Consequently, \({\rho }_{nk}\) rapidly converges to 0 if \({\bar{N}}_{k}\) becomes smaller.

The parameters of the individual components are made sparse by SICM. As discussed in “SICM for Hierarchical Relevance Determination”, SICM estimates the parameters of the individual mixtures by solving the \(L_{1}\)-regularized maximum likelihood estimation problems. Therefore SICM derives sparse parameters of the individual components by the regularization effect of \(L_{1}\) penalty, as in the case of sparse estimation methods such as Lasso.

Monotonicity

The SICM algorithm for HRD monotonically decreases the objective function, which corresponds to the upper bound of the information criterion to be minimized. We present a sketch of the proof below.

SICM for model selection monotonically decreases F in Eq. (45), the upper bound of the information criterion to be minimized. A sketch of the proof can be shown as

$$\begin{aligned} F^{(s)}= & F({\theta }^{(s)},{\pi }^{(s)},q^{(s)},{\bar{q}}^{(s)})\nonumber \\\ge & F({\theta }^{(s)},{\pi }^{(s)},q^{(s+1)},{\bar{q}}^{(s)}) \end{aligned}$$
(84)
$$\begin{aligned}\ge & F({\theta }^{(s)},{\pi }^{(s)},q^{(s+1)},{\bar{q}}^{(s+1)}) \end{aligned}$$
(85)
$$\begin{aligned}\ge & F({\theta }^{(s+1)},{\pi }^{(s+1)},q^{(s+1)},{\bar{q}}^{(s+1)}), \end{aligned}$$
(86)

where s represents an update step. Inequality (84) arises from Eqs. (90) and (92) in Appendix B, and Step 3 of Algorithm 2, which estimates the optimal q variationally by fixing \({\bar{q}}\), \({\theta }\) and \({\pi }\). Inequality (85) arises from Step 4 of Algorithm 2, which estimates the optimal \({\bar{q}}\) [as described immediately after Eq. (62)] by fixing q, \({\theta }\) and \({\pi }\). Inequality (86) arises from Step 5 and Step 6, which conducts maximum likelihood estimations of \({\theta }\) and \({\pi }\) by fixing q and \({\bar{q}}\).

In the case of SICM for HRD, which is a combination of SICM for model selection and SICM for variable selection, the maximum likelihood estimation of \({\theta }\) is replaced with \(L_{1}\)-penalized sparse estimation of \({\theta }\) such as sparse logistic regression summarized in Appendix D. The sparse estimation decreases the upper bound expressed as the \(L_{1}\)-regularized objective function. Therefore inequality (86) also holds in the HRD case.

Summarizing the explanation presented above, the SICM for HRD satisfies inequalities (84)–(86). It therefore monotonically decreases the objective function.

Properties Related to Information Criterion Minimization

The SICM algorithm can avoid the singularity problem of mixture modeling described hereinafter. Both AIC and BIC are derived based on second-order expansions of their original objective functions and are derived by assuming the regularity condition. It is therefore not justifiable to apply the information criteria to singular models. However, a mixture model represented as \(p(y|{\varvec{x}},{\theta })={\sum _{k}}{\pi }_{k}p_{k}(y|{\varvec{x}},{\theta }_{k})\) is singular. SICM avoids this difficulty for the following reasons. When constructing SICM, we started from a mixture represented as Eq. (2) and considered the objective function corresponding to its information criterion in Eq. (19). \(p(Y|X,Z,{\theta })\) in Eq. (2) is regular because it is assumed that the conditions described in “Hierarchical Relevance Determination Problem” (“each mixture \(p_{k}(y|{\varvec{x}},{\theta })\) is regular” and “the mixture assignment of each observation is unique”) hold. Consequently, use of the information criterion corresponding to Eq. (19) is justified.

Actually, SICM with BIC minimization has asymptotic consistency. Because \(p(Y|X,Z,{\theta })\) satisfies the regularity condition as described above, its Laplace approximation (BIC used in SICM) has asymptotic consistency. Actually, SICM with AIC does not have asymptotic consistency as AIC does not. Nevertheless, AIC is “consistent” in the sense that an estimated distribution asymptotically approaches the true distribution (not the true “model” in the case of BIC).

One benefit of SICM is that it can minimize either AIC or BIC, whereas Bayesian methods such as variational Bayesian inference correspond only to BIC minimization (actually, minimization of negative marginal log-likelihood). We expected, and we show in “Experiments”, that SICM with AIC sometimes outperforms SICM with BIC in some cases.

Applicability to Unsupervised Learning

As described in this report, we have considered the task of relevance determination in supervised mixture distributions. However, the SICM algorithm is applicable to unsupervised mixtures by making the following alteration:

$$\begin{aligned} p_{k}(y_{n}|{\varvec{x}}_{n},{\theta }_{k})\rightarrow & p_{k}({\varvec{x}}_{n}|{\theta }_{k}). \end{aligned}$$
(87)

Unsupervised mixtures include a mixture of normal distributions. For instance, by application of SICM to a normal mixture, we can estimate a mixture of sparse Gaussian graphical models.

Application to Interpretable Hierarchical Modeling

In this section, as an application of SICM, we propose a method for constructing an interpretable hierarchical model, which is constructed as a combination of interpretable models: a decision tree and regression models.

The overall flow of the model construction is summarized as presented below.

  1. 1.

    We estimate \(p(Y|X,Z,{\theta })\) using the SICM algorithm. As the individual mixtures, we use interpretable regression models: linear regression models or logistic regression models.

  2. 2.

    We construct a decision tree for model assignment using the training dataset where \(\{r_{nk}\}_{k=1}^{K}\) (\(r_{nk}={\langle }z_{nk}{\rangle }_{q}\)) is a set of target variables and where \({\varvec{x}}_{n}\) is a set of input variables. The decision tree for predicting \(r_{{\cdot }k}\) is estimated by solving a K-class classification problem.

  3. 3.

    Using the results of the two steps described above, we conduct prediction as follows:

    $$\begin{aligned} p(y|{\varvec{x}})= & {\sum _{k}}r_{k}({\varvec{x}})p_{k}(y|{\varvec{x}},{\theta }_{k}). \end{aligned}$$
    (88)

    Here \(r_{k}({\varvec{x}})\) represents the predicted value of \(r_{{\cdot }k}\) estimated from a test observation \({\varvec{x}}\) using the decision tree. Actually, \(r_{k}({\varvec{x}})\) is considered \(p(k|{\varvec{x}})\).

A key point in constructing the model is the introduction of Step 2, the decision tree estimation. To conduct prediction, it is necessary to estimate the predicted value of \(r_{{\cdot }k}\) from a test observation. However, the prediction mechanism of \(r_{{\cdot }k}\) is not included in the SICM algorithm. Therefore, Step 2 is introduced to predict \(r_{{\cdot }k}\). Consequently, we adopt the two-step training process described above, which consists of HRD based on SICM and model assignment determination based on a decision tree.

The hierarchical model is constructed as a combination of a decision tree and regression models (linear regression models or logistic regression models). When conducting prediction according to the hierarchical model, the regression model used for an observation is switched in accordance with the rules in the decision tree. Figure 1 presents an example of the interpretable hierarchical models. For example, in the case of Fig. 1, the predicted value is estimated (1) by Model (regression model) 1 (\(p(y|{\varvec{x}})= p_{1}(y|{\varvec{x}},{\theta }_{1})\)) if the observation \({\varvec{x}}\) satisfies Condition A; (2) by Model 2 if \({\varvec{x}}\) satisfies Conditions \({\bar{A}}\), B and C; (3) by Model 3 if \({\varvec{x}}\) satisfies Conditions \({\bar{A}}\), B and \({\bar{C}}\); and (4) by Model 4 if \({\varvec{x}}\) satisfies Conditions \({\bar{A}}\) and \({\bar{B}}\). The proposed model differs from ensemble learning models because a single model is assigned to an observation in the case of the proposed model. When estimating the model, the number of regression models used in the hierarchical model and the subsets of input variables used in the individual regression models are determined automatically using the SICM algorithm.

Fig. 1
figure 1

Example of the interpretable hierarchical model based on the SICM algorithm. In accordance with the rules represented as the decision tree, the prediction model used for an observation is switched. In the figure, squares and circles respectively represent conditional branches and assigned regression models. \(\{A,B,C\}\) and \(\{{\bar{A}},{\bar{B}},{\bar{C}}\}\) respectively represent conditions and their complements. The number of regression models used in the hierarchical model and the subsets of input variables used in the individual regressions are determined automatically by the SICM algorithm

Experiments

Experimental Setting

Through the following experiments, we demonstrate the utility of the SICM algorithm using the hierarchical prediction model proposed in “Application to Interpretable Hierarchical Modeling”.

We considered binary classification problems in the experiments. Therefore, we used a logistic regression as a mixture component of the hierarchical model. Appendix D summarizes the SICM algorithm for logistic regression mixtures. The maximum number of components, the constant for approximating the \(L_{0}\) terms, and the threshold for the convergence condition were set to \((K,\,{\eta },\,{\epsilon })=(50,\,1.0,\,10^{-6})\). The maximum number of components K is equal to the initial number of components in the SICM algorithm. The number of components decreases as the SICM steps progress, but it never increases. Therefore it is necessary to set K to enough large number and we set K to 50. The approximation constant \({\eta }\) was not estimated from training datasets but was given by us as a fixed value.

We used nine datasets from the UCI repository [19]. We selected these datasets from the UCI repository because they are datasets for binary classification problems and consist of a binary target variable and numerical input variables. Some properties of the datasets are summarized in Table 1.

Table 1 Properties of nine UCI datasets used in the experiments

For training the models, we transformed categorical input variables into their one-hot encodings and standardized all input variables.

We observed the binary classification accuracies of the proposed models: \(\text{ SICM }_{\text{AIC }}\) and \(\text{ SICM }_{\text{BIC }}\). Hereinafter, we denote the proposed hierarchical model estimated by minimizing AIC (BIC) as \(\text{ SICM }_{\text{AIC }}\) (\(\text{ SICM }_{\text{BIC }}\)). As a measure of the classification accuracy, we used the mean Area Under the Curve (AUC) of ROC curves estimated using five-fold cross validation.

In the experiments, we aimed to show the effectiveness of the SICM algorithm by demonstrating (a) we could construct accurate and interpretable models by solving the HRD problem, and (b) we could get better results using the SICM algorithm than using another existing inference method. In other words, (a) means that we could improve prediction accuracies of interpretable models by solving the HRD problem. In order to show (a), we compared SICMs (\(\text{ SICM }_{\text{AIC }}\) and \(\text{ SICM }_{\text{BIC }}\), the hierarchical models proposed in “Application to Interpretable Hierarchical Modeling”, which are estimated using the SICM algorithm) to decision tree and logistic regression, which are frequently used as interpretable models. For this comparison, as our model, we selected the hierarchical prediction model proposed in “Application to Interpretable Hierarchical Modeling”, which consisted of tree and logistic regression. Namely to compare to tree and logistic regression, we constructed the hierarchical model consisting of a tree and logistic regressions. For reference, we compared SICMs to support vector machine (SVM), which is a highly accurate uninterpretable model. We note that SICM is not proposed for constructing highly accurate uninterpretable models but proposed for improving prediction accuracies of interpretable models. Therefore, in the experiments, we did NOT aim to show SICMs outperformed SVM because SVM is uninterpretable. SVM was selected to observe how closely SICMs performs to an uninterpretable model.

In order to show (b) SICMs derived better results than an existing inference method, we compared SICMs to VB. Here, VB represents the hierarchical prediction model which has the same structure as SICMs and its parameter inference method is replaced from the SICM algorithm to variational Bayesian inference (namely SICMs and VB are the same models estimated by the different methods). We selected VB for the following reasons. First, as described in “Introduction”, our main aim and contribution are to construct a new inference method for solving the HRD task. Therefore, we compared the SICM algorithm to variational Bayesian inference by comparing the same models estimated by different inference methods. Variational Bayesian inference is one of the most popular method for relevance determination and is still a SOTA in this area. Second, in this paper, we do NOT aim to propose a new hierarchical model, and thus to select an optimal model family is out of scope. Therefore we selected VB having the same structure as SICMs, and did not select another latest hierarchical models belonging to different model families.

Let us give supplementary explanations to how to estimate the compared models. Logistic regression models were constructed by maximum likelihood estimation. Gaussian kernels were used in SVM. We optimized a decay parameter of a Gaussian kernel by conducting two-fold cross validation. As mentioned above, we used variational Bayesian inference for estimating VB.

The computational environment was as follows: Intel Xeon CPU ES-1650 3.50 GHz CPU and 64 GB memory on a Linux Ubuntu platform. We conducted the experiments using Python language. The implementations of SVM, logistic regression and decision tree (including the tree estimation part in VB and SICMs) were taken from scikit-learn [21], a public machine learning library in Python. We implemented the HRD part of VB (variational Bayesian inference of a logistic regression mixture) based on Refs. [1, 22].

Results and Discussion

Prediction Performance

Table 2 presents the experimentally obtained results: the classification accuracies of six models. Results indicate the SICM algorithm’s effectiveness for the following reasons.

By combining decision tree and logistic regression, we can construct a model with higher accuracy than that of any individual models. We can construct an interpretable and highly accurate model using the SICM algorithm. This statement is supported by the following results. First, combination models (hierarchical models), known as \(\text{ SICM }_{\text{AIC }}\), \(\text{ SICM }_{\text{BIC }}\) and VB, outperformed decision tree and logistic regression. Second, the combination models performed comparably to SVM.

The SICM algorithm which minimizes BIC is theoretically better than variational Bayesian inference because the independence between the latent variables and the parameters of the individual components is not assumed in SICM, whereas variational Bayesian inference requires the independence assumption. This statement is supported by the results that \(\text{ SICM }_{\text{BIC }}\) outperformed VB.

One benefit of SICM is that SICM can minimize either AIC or BIC, whereas Bayesian methods such as variational Bayesian inference only minimize the negative marginal log-likelihood. In fact, BIC is an approximate representation of the negative marginal log-likelihood. This statement is supported by results showing that \(\text{ SICM }_{\text{AIC }}\) performed comparably to \(\text{ SICM }_{\text{BIC }}\) from the viewpoint of their win-loss record.

Table 2 Classification accuracies of six prediction models

Interpretability

Using the proposed method, we can construct an interpretable model because the proposed hierarchical model is represented as a combination of a decision tree and regression models, and has the sparse structure. In this section, we qualitatively describe the property of interpretability through an example of the proposed model.Footnote 2

As a related example, we used the hierarchical model trained on “Default of Credit Card Clients” dataset. An observation in the dataset represents a record of a credit card user, and consists of a binary target variable and 23 input variables. The target variable represents default on a payment in October. The input variables represent (1) amount of the given credit; (2) payment status recorded in April–September such as payment delay (month), amount of bill statement, and amount of payment; and (3) user profile such as age, gender, education, and marital status. We transformed the categorical input variables into their one-hot encodings.

Tables 3 and 4 summarize the proposed hierarchical model trained on the “default of credit card clients” dataset based on the SICM algorithm. Table 3 represents the model assignment rules included in the decision tree. Table 4 shows regression coefficients of the individual logistic regression models. These results indicate that the proposed model has the following properties.

Table 3 Model assignment rules of the proposed hierarchical model
Table 4 Regression coefficients of the individual logistic regression models included in the proposed hierarchical model

The SICM algorithm derives sparse solutions for the following reasons. First, as shown in Table 4, many coefficients were estimated as exactly zero. Second, the number of regression models diminished to six starting with \(K=50\).

The proposed model assigns a regression model to an observation based on its properties. For example, in the case of Table 3, (1) Model 1 corresponded to the “standard” clients (observations), who had low payment delay in September and low bill statement in July; (2) Model 2 corresponded to the clients who had larger bill statements in July than the standard clients had; (3) Model 3 corresponded to the clients who had more months of payment delay in September than the standard clients; and (4) Models 4, 5, and 6 corresponded to clients who had exceptionally large payments in August. Each regression model occupied an imbalanced number of observations. Three models (Models 4, 5 and 6) were used for the “exceptional” clients.

For each input variable, the signs of the regression coefficients may differ depending on the regression model. For example, in the case of Model 2, the regression coefficient of “age” was negative, which indicates that the default risk decreases as age increases. However, for Models 4 and 6, the coefficients were positive, which indicates that the default risk increases when age increases. Age inversely contributed to the default risk depending on whether the payment in August was exceptionally large (\(>\,4.0{\sigma }\)), or not. Such local sign inversion is regarded as one reason why the proposed hierarchical model is highly accurate.

Sign inversion does not always occur. For example, “the amount of the given credit” invariably had non-positive coefficients, which indicates that the default risk always decreases as the credit rating becomes better. This result is consistent with our intuition.

Summary

The SICM algorithm was proposed for solving the hierarchical relevance determination problem. The SICM algorithm minimizes an information criterion continuously and therefore enables us to determine the degrees of freedom in a mixture distribution automatically. A method for constructing an interpretable hierarchical model based on the SICM algorithm was also proposed. Experiment results obtained using the interpretable hierarchical model have demonstrated the utility of the SICM algorithm for the following reasons. First, the hierarchical model outperformed frequently used interpretable models (tree and logistic regression) in terms of prediction accuracy. Second, it was shown qualitatively that the hierarchical model derived interpretable results consistent with our intuition.

Future work includes the following two issues. One is a theoretical expansion of the interpretable hierarchical model. When constructing the model, the degrees of freedom in the decision tree for model assignment are not automatically determined. Therefore introduction of relevance determination mechanism to the decision tree estimation is a subject of future work. The other is an application of the proposed information criterion minimization method to those other than the HRD problem. Relevance determination based on the continuous minimization of the information criterion is widely applicable. A promising application is relevance determination in unsupervised distributions such as sparse estimation of Gaussian graphical model mixtures.