Keywords

1 Introduction

In our previous research Plajner and Vomlel (2015) we focused on Computerized Adaptive Testing (CAT) (Almond and Mislevy 1999; van der Linden and Glas 2000). We used artificial student models to select questions during the course of testing. We have shown that it is useful to include monotonicity conditions while learning parameters of these models (Plajner and Vomlel 2016b). Monotonicity conditions incorporate qualitative influences into a model. These influences restrict conditional probabilities in a specific way to avoid unwanted behavior. Some models we use for CAT include monotonicity naturally, but in this article we focus on a specific family of models, Bayesian Networks, which do not. Monotonicity in Bayesian Networks is discussed in literature for a long time. It is addressed, for example, by Wellman (1990), Druzdzel and Henrion (1993) and more recently by, e.g., Restificar and Dietterich (2013), Masegosa et al. (2016). Monotonicity restrictions are often motivated by reasonable demands from model users. In our case of CAT it means we want to make sure that students having certain skills will have a higher probability of answering questions depending on these skills correctly. Moreover, assuming monotonicity we can learn better models, especially when the data sample is small. In our work we have so far used monotonicity attained by logistic regression models of CPTs. This has proven useful but it is restrictive since it requires a prescribed CPT structure.

In this article we extends our results in the domain of Bayesian Networks. We present a gradient descent optimum search method for learning parameters of CPTs respecting monotonicity conditions. First, we establish our notation and monotonicity conditions in Sect. 2. Our method is derived in Sect. 3. We have implemented the method and performed tests. For testing we used two different data sets. First, we used a synthetic data set generated from a monotonic model (CPTs satisfying monotonicity) and second, we used real data set collected earlier. Experiments were performed on these data sets also with the isotonic regression EM (irem) method described by Masegosa et al. (2016) and the ordinary EM learning without monotonicity restrictions. In Sect. 4 of this paper we take a closer look at the experimental setup and present results of described tests. The last section brings an overview and a discussion of the obtained results.

2 BN Models and Monotonicity

2.1 Notation

In this article we use Bayesian Networks. Details about BNs can be found in, for example, Pearl (1988), Nielsen and Jensen (2007). We restrict ourselves to the following BN structure. Networks have two levels. In compliance with our previous articles, variables in the parent’s level are addressed as skill variables S. The children level contains questions-answers variables X. Example network structures, which we also used for experiments, are shown in Figs. 1 and 2.

  • We will use symbol \(\varvec{X}\) to denote the multivariable \((X_1,\ldots , X_{n})\) taking states \(\varvec{x} = (x_1,\ldots , x_{n})\). The total number of question variables is n, the set of all indexes of question variables is \(\varvec{N} = \{1,\ldots ,n\}\). Question variables are binary and they are observable.

  • We will use symbol \(\varvec{S}\) to denote the multivariable \((S_1,\ldots , S_{m})\) taking states \(\varvec{s} = (s_1,\ldots , s_{m})\). The set of all indexes of skill variables is \(\varvec{M} = \{1,\ldots ,m\}\). Skill variables have variable number of statesFootnote 1, the total number of states of a variable \({S_j}\) is \(m_j\) and individual states are \(s_{j,k}, k \in \{1,\ldots ,m_j\}\). The variable \(\varvec{S}^i = \varvec{S}^{pa(i)}\) stands for a multivariable same as \(\varvec{S}\) but containing only parent variables of the question \(X_i\). Indexes of these variables are \(\varvec{M}^i \subseteq \varvec{M}\). The set of all possible state configurations of \(\varvec{S}^i\) is \(Val(\varvec{S}^i)\). Skill variables are all unobservable.

Fig. 1.
figure 1

Artificial model

Fig. 2.
figure 2

CAT model network

CPT parameters for a question variable \(X_i\) for all \(i \in \varvec{N}, \varvec{s}^i \in Val(\varvec{S}^i)\) are

$$\begin{aligned} \theta _{i,\varvec{s}^i} = P(X_i = 0 | \varvec{S}^i = \varvec{s}^i), \ \varvec{\theta }_{i} = (\theta _{{i,\varvec{s}^i}})_{\varvec{s}^i \in Val(\varvec{S}^i)} . \end{aligned}$$

We will also use \(\theta _{i,\varvec{s}} = \theta _{i,\varvec{s}^i}\) with the whole parent set \(\varvec{S}\), where variables from \(\varvec{S} {\setminus } \varvec{S}^i\) do not affect the value. Probabilities of a correct answer to a question \(X_i\) given state configuration \(\varvec{s}^i\) is \(P(X = 1| \varvec{S}^i = \varvec{s}^i) = 1-\theta _{i,\varvec{s}^i}\) (binary questions).

Parameters of parent variables for \(j \in \varvec{M}\) are

$$\begin{aligned} \rho _{j,s_j} = P(S_j = s_j), \; \varvec{\rho }_j = \left( P(S_j = s_{j'})\right) , \, {j' \in \{1,\ldots ,m_j\}} . \end{aligned}$$

Parameter vector \(\varvec{\rho }_j\) is constrained by a condition \(\sum _{s_j = 1}^{m_j}{\rho _{j,s_j}} = 1\). To remove this condition we reparametrize this vector to

$$\begin{aligned} \rho _{j,s_j}= & {} \dfrac{exp(\mu _{j,s_j})}{\sum _{s_j'=1}^{m_i}{exp(\mu _{j,s_j'}})} . \end{aligned}$$

The whole vector of parameters is then

$$\begin{aligned} \varvec{\theta } = \left( \varvec{\theta }_{1},\ldots ,\varvec{\theta }_{n},\varvec{\rho }_{1},\ldots ,\varvec{\rho }_{m}\right) , \; \text {or} \; \varvec{\mu } = \left( \varvec{\theta }_{1},\ldots ,\varvec{\theta }_{n},\varvec{\mu }_{1},\ldots ,\varvec{\mu }_{m}\right) , \end{aligned}$$

where the meaning of \(\varvec{\mu _j}\) is the same as \(\varvec{\rho _j}\) but in this case vectors contain reparametrized variables. The transition from \(\varvec{\mu }\) to \(\varvec{\theta }\) is simply done with the reparametrization above and will be used without further notice. The total number of elements in the vector \(\varvec{\mu }\) and \(\varvec{\theta }\) is

$$\begin{aligned} l_{\varvec{\mu }} = l_{\varvec{\theta }} = \sum _{{i \in \varvec{N}}}\prod _{j \in \varvec{M^i}}m_j + \sum _{l \in \varvec{M}}{m_l} . \end{aligned}$$

2.2 Monotonicity

The concept of monotonicity in BNs has been discussed in literature since the last decade of the previous millennium (Wellman 1990; Druzdzel and Henrion 1993). Later its benefits for BN parameter learning were addressed, for example, by van der Gaag et al. (2004), Altendorf et al. (2005). This topic is still active, e.g., Feelders and van der Gaag (2005), Restificar and Dietterich (2013), Masegosa et al. (2016).

We will consider only variables with states from \(\mathbb {N}_0\) with their natural ordering, i.e., the ordering of states of skill variable’s \(S_j\) for \(j \in \varvec{M}\), is

$$\begin{aligned} s_{j,1} \prec \ldots \prec s_{j,m_j} . \end{aligned}$$

For questions we use natural ordering of its states (\(0 \prec 1\)).

A variable \(S_j\) has monotone, resp. antitone, effect on its child if for all \(k,l \in \{1,\ldots ,m_j\}\):

$$\begin{aligned} s_{j,k} \preceq s_{j,l}\Rightarrow & {} P(X_i = 1|S_j = s_{j,k}, \varvec{s}) \ \le \ P(X_i = 1|S_j = s_{j,l}, \varvec{s}), \quad \text {resp.} \\ s_{j,k} \preceq s_{j,l}\Rightarrow & {} P(X_i = 1|S_j = s_{j,k}, \varvec{s}) \ > \ P(X_i = 1|S_j = s_{j,l}, \varvec{s}) . \end{aligned}$$

where \(\varvec{s}\) is the configuration of other remaining parents of question i without \(S_j\). For each question \(X_i, i \in \varvec{M}\) we denote by \(\varvec{S}^{i,+}\) the set of parents with a monotone effect and by \(\varvec{S}^{i,-}\) the set of parents with an antitone effect.

Next, we create a partial ordering \(\preceq _i\) on all state configurations of parents \(\varvec{S}^i\) of the i-th question, where for all \(\varvec{s}^i, \varvec{r}^{i} \in Val(\varvec{S^i})\):

$$\begin{aligned} \varvec{s}^i \preceq _i \varvec{r}^{i} \Leftrightarrow \left( s^i_j \preceq r^i_j, \ j \in \varvec{S}^{i,+}\right) \ \text {and} \ \left( r^i_j \preceq s^i_j, \ j \in \varvec{S}^{i,-}\right) . \end{aligned}$$

The monotonicity condition then requires that the question probability of correct answer is higher for a higher order parent configuration, i.e., for all \(\varvec{s}^i, \varvec{r}^{i} \in Val(\varvec{S^i})\):

$$\begin{aligned} \varvec{s}^i \preceq _i \varvec{r}^i\Rightarrow & {} P(X_i = 1|\varvec{S}^i = \varvec{s}^i) \ \le \ P(X_i = 1|\varvec{S}^i = \varvec{r}^i),\\ \varvec{s}^i \preceq _i \varvec{r}^i\Rightarrow & {} P(Xi=0|\varvec{S}^i = \varvec{s}^i) \ \ge P(Xi=0|\varvec{S}^i = \varvec{r}^i) \ \Leftrightarrow \ \theta _{i,\varvec{s}^i} \ \ge \ \theta _{i,\varvec{r^i}} . \end{aligned}$$

In our experimental part we consider only isotone effect of parents on their children. The difference with antitone effects is only in the partial ordering.

3 Parameter Gradient Search with Monotonicity

To learn parameter vector \(\varvec{\mu }\) we develop a method based on the gradient descent optimization. We follow the work of Altendorf et al. (2005) where they use a gradient descent method with exterior penalties to learn parameters. The main difference is that we consider models with hidden variables.

We denote by \(\varvec{D}\) the set of indexes of observations vectors. One vector \({x^k, k \in \varvec{D}}\) corresponds to one student and an observation of i-th variable \(X_i\) is \(x_i^k\). The number of occurrences of the k-th configuration vector in the data sample is \(d_k\).

We use the model structure as described in Sect. 2, i.e., unobserved parent variables and observed binary children variables. With sets \(\varvec{I}^k_0\) and \(\varvec{I}^k_1\) of indexes of incorrectly and correctly answered questions, we create following products based on observations in the k-th vector:

$$\begin{aligned} p_0^k(\varvec{\mu }, \varvec{s}, k) = \prod _{i \in \varvec{I}^k_0}{\theta _{i,\varvec{s}}}, \quad p_1^k(\varvec{\mu }, \varvec{s}, k) = \prod _{i \in \varvec{I}^k_1}{(1-\theta _{{i,\varvec{s}}})}, \quad p_{\mu }(\varvec{\mu }, \varvec{s})= & {} \prod _{j = 1}^{m}{exp(\mu _{j,s_j})} . \end{aligned}$$

We work with the log likelihood:

$$\begin{aligned} LL(\varvec{\mu })= & {} \sum _{k \in \varvec{D}}{d_k \cdot log \left( \sum _{\varvec{s} \in Val(\varvec{S})} {\prod _{j = 1}^{m}{\dfrac{exp(\mu _{j,s_j})}{\sum _{s_j'=1}^{m_j}{exp(\mu _{j,s_j'})}}} \cdot p_0^k(\varvec{\mu }, \varvec{s}, k) \cdot p_1^k(\varvec{\mu }, \varvec{s}, k) }\right) } \\= & {} \sum _{k \in \varvec{D}}{d_k \cdot log \Big ( \sum _{\varvec{s} \in Val(\varvec{S})}{ {p_{\mu }(\varvec{\mu }, \varvec{s})} \cdot p^k_0(\varvec{\mu }, \varvec{s}, k) \cdot p_1^k(\varvec{\mu }, \varvec{s}, k) } \Big )} \\&{-} \, N\cdot \sum _{j=1}^{m}{log\sum _{s_j'=1}^{m_j}{exp(\mu _{j,s_j'})}} . \end{aligned}$$

The partial derivatives of \(LL(\mu )\) with respect to \(\theta _{i,\varvec{s^i}}\) for \(i \in \varvec{N}, \varvec{s}^i \in Val(\varvec{S}^i)\) are

$$\begin{aligned} \dfrac{\delta LL(\varvec{\mu })}{\delta \theta _{i,\varvec{s^i}}}= & {} \sum _{k \in \varvec{D}}{d_k \cdot \dfrac{ (-2 x^k_i + 1)\cdot {p_\mu (\varvec{\mu }, \varvec{s}^i)} \cdot p_0^k(\varvec{\mu }, \varvec{s}^i, k) \cdot p_1^k(\varvec{\mu }, \varvec{s}^i, k) }{ \theta _{i,\varvec{s^i}}\cdot \sum _{\varvec{s} \in Val(\varvec{S})}{ {p_{\mu }(\varvec{\mu }, \varvec{s})} \cdot p_0^k(\varvec{\mu }, \varvec{s}, k) \cdot p_1^k(\varvec{\mu }, \varvec{s}, k)} } } . \end{aligned}$$

and with respect to \(\mu _{i,l}\) for \(i \in \varvec{M}, l \in \{1,\ldots ,m_i\}\) are

$$\begin{aligned} \dfrac{\delta LL(\varvec{\mu })}{\delta \mu _{i,l}}= & {} \sum _{k \in \varvec{D}}{d_k \cdot \dfrac{ \sum _{\varvec{s} \in Val(\varvec{S})}^{s_i = l}{ {p_{\mu }(\varvec{\mu }, \varvec{s})} \cdot p_0^k(\varvec{\mu }, \varvec{s}, k) \cdot p_1^k(\varvec{\mu }, \varvec{s}, k) } }{ \sum _{\varvec{s} \in Val(\varvec{S})}{ {p_{\mu }(\varvec{\mu }, \varvec{s})} \cdot p_0^k(\varvec{\mu }, \varvec{s}, k) \cdot p_1^k(\varvec{\mu }, \varvec{s}, k) } } } \\&{-} \, N\cdot \dfrac{exp(\mu _{i,l})}{\sum _{l' = 1}^{m_i}{exp(\mu _{k,l'})}} . \end{aligned}$$

3.1 Monotonicity Restriction

To ensure monotonicity we use a penalty function

$$\begin{aligned} p(\theta _{i,\varvec{s}^i}, \theta _{i,\varvec{r}^i}) = {exp(c\cdot (\theta _{i,\varvec{r}^i} - \theta _{i,\varvec{s}^i})) } \end{aligned}$$

for the log likelihood:

$$\begin{aligned} LL'(\varvec{\mu }, c) = LL(\varvec{\mu }) - \sum _{i \in \varvec{N}}{\sum _{\varvec{s}^i \preceq _i \varvec{r}^i}}p(\theta _{i,\varvec{s}^i}, \theta _{i,\varvec{r}^i}), \end{aligned}$$

where c is a constant determining the strength of the condition. Theoretically, this condition does not ensure monotonicity but, practically, selecting high values of c results in monotonic estimates. If the monotonicity is not violated, i.e. \(\theta _{i,\varvec{r}^i} < \theta _{i,\varvec{s}^i}\) then the penalty value is close to zero. Otherwise, the penalty is raising exponentially fast with respect to \(\theta _{i,\varvec{r}^i} - \theta _{i,\varvec{s}^i}\). In our experiments we have used the value of \(c = 40\) but any value higher than 20 provided almost identical results.

Partial derivatives with respect to \(\mu _{i,l}\) remain unchanged. Partial derivatives with respect to \(\theta _{i,\varvec{s^i}}\) are:

$$\begin{aligned} \dfrac{\delta LL'(\varvec{\mu },c)}{\delta \theta _{i,\varvec{s^i}}} = \dfrac{\delta LL(\varvec{\mu })}{\delta \theta _{i,\varvec{s^i}}} + c\sum _{\varvec{s}^i \preceq _i \varvec{r}^i} p(\theta _{i,\varvec{s}^i}, \theta _{i,\varvec{r}^i}) - c\sum _{\varvec{r}^i \preceq _i \varvec{s}^i} p(\theta _{i,\varvec{r}^i}, \theta _{i,\varvec{s}^i}) \end{aligned}$$

Using the penalized log likelihood, \(LL'(\varvec{\mu }, c)\), and its gradient

$$\begin{aligned}&\nabla (LL(\varvec{\mu },c)) = \Big (\dfrac{\delta LL'(\varvec{\mu },c)}{\delta \theta _{i,\varvec{s^i}}},\dfrac{\delta LL(\varvec{\mu })}{\delta \mu _{j,l}}\Big ) , \end{aligned}$$

for \(i \in \varvec{N}, \varvec{s}^i \in Val(\varvec{S}^i), \ j \in \varvec{M}, l \in \{1,\ldots ,m_j\}\), we can apply the standard gradient method optimization to solve the problem. In order to ensure probability values of \(\varvec{\theta }_i, i \in \varvec{N}\) it is necessary to use a bounded optimization method.

4 Experiments

For testing we use two different Bayesian Network models. The first one is an artificial model and we use simulated data. The second model is one of the models we used for computerized adaptive testing and we work with real data (for details please refer to Plajner and Vomlel (2016a)). In both cases we learn model parameters from data. Parameters are learned with our gradient method, isotonic regression EMFootnote 2 and the standard unrestricted EM algorithm. The learned model quality is measured by the log likelihood of the whole data sample including the training subset. This is done in order to provide results comparable between different training set sizes.

4.1 Artificial Model

The first model is displayed in Fig. 1. This model was created to provide simulated data for testing. The structure of the model is similar to models we use in CAT modeling with two levels of variables. Parents \(S_1\) and \(S_2\) have 3 possible states and children \(X_1, \ldots , X_5\) are binary. We have instantiated the model with random parameters vector \(\varvec{\theta }^*\) satisfying monotonicity conditions. We drew a random sample of 100 000 cases from the model.

For parameters learning we use random subsets of size k of 10, 20, 50, 100, 1 000, 10 000, 50 000, and 100 000-(full data set) cases. For each size (except the last one) we use 10 different sets. Next, we prepared 15 initial parameter configurations for the fixed Bayesian Network structure (Fig. 1). These networks have starting parameters \(\varvec{\theta }_i\) generated at random, but in such a way, that they satisfy monotonicity conditions. The assumption of monotonicity is part of our domain expert knowledge. Therefore we can use it to speed up the process and avoid local optima. Parameters of parent variables are uniform and initial vectors are the same for each method. In our experiment we learn network parameters for each initial parameter setup for each set in a particular set size (giving a total of 150 learned networks for one set size). The learned parameter vectors are \(\varvec{\theta }_{i,j}\) for j-th subset of data.

Fig. 3.
figure 3

Negative log likelihood for the whole sample and different training set sizes for the artificial model.

The average log likelihood for the whole data sample

$$\begin{aligned} LL_A = \dfrac{\sum _{j=1}^{10}\sum _{i=1}^{15}{LL(\varvec{\theta }_{i,j})}}{150} \end{aligned}$$

is shown in Fig. 3 for each set size. In case of this model we are also able to measure the distance of learned parameters from the generating parameters in addition to the log likelihood. First we calculate an average error for each learned model:

$$\begin{aligned} e_{i,j} = \dfrac{|\varvec{\theta }^* - \varvec{\theta }_{i,j}|}{l_{\varvec{\theta }}} , \end{aligned}$$

where || is L1 norm. Next we average over all results in one set size:

$$\begin{aligned} e = \dfrac{\sum _{j = 1}^{10}\sum _{i = 1}^{15}e_{i,j}}{150} . \end{aligned}$$

Resulting values of e are displayed in Fig. 4 for each set size.

Fig. 4.
figure 4

Mean difference of parameters of learned and generating networks for different set sizes for the artificial model.

Fig. 5.
figure 5

Negative log likelihood for the whole sample and different training set sizes for the CAT model.

4.2 CAT Model

The second model is the model we used for CAT (Plajner and Vomlel 2016b). Its structure is displayed in Fig. 2. Parent variables \(S_1, \ldots , S_7\) have 3 states and each one of them represents a particular student skill. Children nodes \(X_i\) are variables representing questions which are binary. Data associated with this model were collected from paper tests of mathematical skills of high school students. In total the data sample has 281 cases. For more detailed overview of tests refer to Plajner and Vomlel (2016a). For learning we use random subsets of size of 1/10, 2/10, 3/10, and 4/10 of the whole sample. Similarly to the previous model, we drew 10 random sets for each size and initiated models by 15 different initial random monotonic starting parameters \(\varvec{\theta }_i\).

After learning we compute log likelihoods of the whole data set and we create averages for each set size \(LL_A(k)\) as with the previous model. Resulting values are in Fig. 5. In this case we cannot compare learned parameters because the real parameters with real are unknown.

5 Conclusions

In this article we have presented a gradient based method for learning parameters of Bayesian Network under monotonicity restrictions. The method was described and then tested on two data sets. In Figs. 3 and 5 it is clearly visible that this method achieves the best results from three tested methods (especially for small training samples). The irem method has problems with small training samples and the log likehood in those cases is low. This is a consequence of the fact that it moves to monotonic solution from a poor EM estimate and in these cases ensuring monotonicity implies log likelihood degradation. We can also observe that for the training sets larger than 1000 data vectors the EM algorithm stabilizes in its parameter estimations. It means that at about \({k=1000}\) the EM algorithm found the best model it can and increasing training size does not improve the result. Nevertheless, as we can observe in Fig. 4 parameters of learned networks are always closer to the generating parameters while considering monotonicity for both the irem and the gradient methods than for the standard EM.

These results verify usefulness of monotonicity for learning Bayesian Networks. A possible extension is to enlarge the theory of gradient based method to work with more general network structures.