Keywords

1 Introduction

In recent years, the non-standard supervised classification problems have grown significantly. In particular, the Label Ranking (LR) problem [9] consists in learning preference models able to predict rankings (a.k.a. total orders or permutations) defined over a finite set of class labels. An important difference between this problem and other non-standard supervised classification problems (e.g., ordinal classification) is that the instances of the training dataset are labeled with rankings, and these rankings are used during model learning.

In this paper, we focus on the Partial Label Ranking problem [6]. In this problem, the rankings associated with the instances of the training dataset and the predictions given as output are partial rankings (a.k.a. total orders with ties or bucket orders), that is, rankings with (possibly) tied class labels.

Based on [22], we rely on the use of a hybrid Bayesian network [14] where the root is a hidden discrete variable to jointly model the probability distributions for the discrete (multinomial) and continuous (Gaussian) attributes, and for the rankings. Although permutations (complete rankings without ties) may be modeled with the Mallows distribution [20], as far as we know, there is no probability distribution for bucket orders (complete rankings with ties). Therefore, we transform the ranking global preferences into a set of pairwise local preferences (precedes, ties and succeeds). By doing that, these variables can be modeled by a multinomial distribution (discrete variables), and so they can be easily integrated in the hybrid Bayesian network. The prediction for these variables are used to solve the associated rank aggregation problem [12], so outputting a partial ranking for a given unlabeled instance.

The paper is structured as follows. In Sect. 2, we review some basic notions concerning rankings. In Sect. 3, we formally describe the proposed model. In Sect. 4, we extend this model to allow interactions between the (continuous) attributes by means of a multivariate Gaussian distribution. In Sect. 5, we set out the experimental evaluation conducted to assess the proposed models. Finally, in Sect. 6, we provide the conclusions and future research lines.

2 Rank Aggregation Problem

Given a set of items \([[n]] = \{1, \dots , n\}\), a ranking \(\pi \) represents a precedence relation among them. In particular, rankings may be without ties (a.k.a. total orders or permutations) or with ties (a.k.a. partial rankings, total orders with ties or bucket orders) if there is no preference among some of the ranked items.

The rank aggregation problem (RAP) [1] consists in obtaining a consensus order from a set of rankings. In particular, the Kemeny Ranking Problem (KRP) [19] is probably the most well-known, whose goal is to obtain the consensus permutation (a.k.a. central permutation) that minimizes a particular distance measure (e.g., the Kendall distance) with respect to a set of permutations. The KRP is usually solved with the Borda count algorithm because of its good trade-off between accuracy and efficiency.

In addition to the KRP, another well-known RAP is the Optimal Bucket Order Problem (OBOP) [12]. The goal of the OBOP is to obtain the bucket matrix \(B\) (associated with a bucket order \(\pi \)) that minimizes the distance \(D\)

$$\begin{aligned} D(B, P) = \sum _{u, v \in [[n]]} \left| B(u, v) - P(u, v)\right| \end{aligned}$$

where \(P\) is the pair order matrix associated with a set of bucket orders (see [12] for the details).

Although there are several heuristic methods to solve the OBOP [2,3,4, 17], we use a particular instance of the Bucket Pivot Algorithm with least indecision assumption [2] named \(LIA_{G}^{MP2}\), because of the good trade-off it achieves between accuracy and efficiency.

3 Hidden Naive Bayes

Given that the goal is to output a partial ranking for an unlabeled instance, we have to obtain the pair order matrix \(C\) for solving the corresponding RAP. In our proposal, we use a Bayesian network to get the entries for this matrix, which codifies the preferences of a class label \(c_u\) over \(c_v\), with \(u < v\). Since \(P(u, v) = 1 - P(v, u)\), we model both entries with a variable \(Z_{u, v}\), and also the probability that \(c_u\) is tied with \(c_v\). Thus, for each pair of class labels, we create the discrete variable

$$ Z_{u,v} = \left\{ \begin{array}{lcl} z_1, &{} \text {if} &{} c_u \succ c_v \\ z_2, &{} \text {if} &{} c_u \sim c_v \\ z_3, &{} \text {if} &{} c_u \prec c_v \end{array} \right. $$

The advantage of this approach is that we only manage discrete variables. However, the complexity of the model grows quadratically with the number of labels according with \( n_L = (n \cdot (n - 1)) / 2 \).

We propose a Bayesian network with Naive Bayes structure and a hidden variable to capture the interactions between the predictive variables and the target variable, and so obtain the a-posteriori probabilities for an unlabeled instance.

3.1 Model Definition

Figure 1 shows the Plateau representation of the proposed model. Note that the (hybrid) Bayesian network contains the discrete and continuous predictive variables, the discrete target variables and the discrete hidden variable \(H\). In particular:

Fig. 1.
figure 1

Proposed HNB model

  • Discrete predictive variables, denoted by \(X_j\), \(j = 1, \dots , n_J\) with \(dom(X_j) = \{x_{j_1}, \dots , x_{j_{r_{j}}}\}\). They are observed both in the learning and inference stages.

  • Continuous predictive variables, denoted by \(Y_k\), \(k = 1, \dots , n_K\). They are observed both in the learning and inference stages.

  • Target variables, denoted by \(Z_{u,v}\), \(u = 1, \dots , n - 1\) and \(v = u + 1, \dots , n\), with domain \(dom(Z_{u, v}) = \{z_1, z_2, z_3\}\), being \(z_1 = \ c_u \succ c_v\), \(z_2 = \ c_u \sim c_v\) and \(z_3 = \ c_u \prec c_v\). They are observed in the learning stage.

  • Hidden variable, denoted by \(H\) with \(dom(H) = \{h_1, \dots , h_{r_H}\}\), where \(r_H\) is the number of mixtures. This variable is never observed.

The discrete attributes, the target variables and the hidden variable follow a (conditional) multinomial distribution, and the continuous attributes follow a (conditional) Gaussian distribution. The joint probability distribution is given by

3.2 Parameter Estimation

We assume complete and i.i.d. data in both, the attributes and in the ranking variable. Therefore, we only deal with the hidden variable \(H\), and we use the Expectation-Maximization (EM) to estimate jointly the parameters of both, the observed and hidden variables:

  • E step: Under the assumption that the parameters of the discrete attributes \(p(x_{j_i}^{h_w})\), continuous attributes \(\mu ^{h_w}_k, \sigma ^{h_w}_k\), ranking variable \(p(z_{u, v}^{h_w})\) and hidden variable \(p(h_w)\), \(j = 1, \dots , n_J\), \(i = 1, \dots , r_j\), \(k = 1, \dots , n_K\), \(l = 1, \dots , n_L\), \(u = 1, \dots , n - 1\), \(v = u + 1, \dots , n\), \(w = 1, \dots , r_H\) are known, the probability of an instance \(e_t = (x_{1, t}, \dots , x_{n_J, t}, y_{1, t}, \dots , y_{n_K, t}, \pi _t)\) being in a mixture is

    $$\begin{aligned} \begin{aligned} P(h_w | x_{1, t}, \dots , x_{n_J, t}, y_{1, t}, \dots , y_{n_K, t}, \pi _t) = \frac{1}{C} \cdot p(h_w) \cdot \prod \limits _{k = 1}^{n_K} \frac{1}{\sigma _{k, t}^{h_w}\sqrt{2\uppi }} \\ \cdot e^{-\frac{1}{2} \left( \frac{y_{k, t} - \mu _{k, t}^{h_w}}{\sigma }\right) ^2} \cdot \prod \limits _{j = 1}^{n_J} p(x^{h_w}_{j, t}) \cdot \prod \limits _{u = 1, v = u + 1}^{u = n - 1, v = n} p(z^{h_w}_{u, v, t}) \end{aligned} \end{aligned}$$
    (1)

    where \(C\) is a normalization constant.

  • M step: Under the assumption that the probabilities of belonging to each mixture for all examples are known, the parameters of the model can be estimated as follows:

    • Multinomial parameters for the discrete attributes and the target variables. Each multinomial parameter is estimated by means of maximum likelihood estimation (MLE), where the count for each instance is weighted by the probability of \(H = h_w\) given the instance.

    • Gaussian parameters for the continuous attributes. The Gaussian parameters \(\mu ^{h_w}_k\) y \(\sigma ^{h_w}_k\) are estimated by means of MLE for each \(H = h_w\), weighting each instance by the probability of it being in the mixture.

Stopping Condition: We use the log-likelihood of the model given the data with \(\alpha = 0.001\) as convergence value. Moreover, we fix a maximum of \(\beta = 100\) iterations.

3.3 Model Learning and Selection

We use the following procedure to compute the number of mixtures of the hidden variable \(H\):

  1. 1.

    The dataset is divided in training \(Tr\) (0.8) and validation \(Tv\) (0.2), and the \(\tau _X\) rank correlation coefficient is used to evaluate the models goodness in the search procedure.

  2. 2.

    The search for the number of mixtures is carried out greedily. First, we evaluate the model with \(r_H = \{2^i\}_{i = 1}^{10}\), and we select the best number of mixtures \(r_H^{\prime }\) according with \(\tau _X^{Tv}\). Second, we apply a binary search in \([\frac{r_H^{\prime }}{2}, r_H^{\prime }]\), and we use the best number of mixtures \(r_H^*\) to train the model with the whole dataset.

Each time a new value for \(r_H\) is tried, the process starts from scratch, that is, all the parameters of the components are initialized (probabilities and weights) using the k-means clustering algorithm [23] with \(k = r_H\) and \(\gamma = 10\) different centroid seeds. Then, the EM algorithm is executed.

3.4 Inference

In the inference process, the method needs to compute the partial ranking \(\pi _t\) of the values in \(dom(C)\) associated with an unlabeled instance \(e_t\). Although the standard approach would select the (partial) ranking that maximizes the a-posteriori probability given the instance \(e_t\), the cardinality of the search space \(n! / 2 \cdot \ln {2}^{n + 1}\), is too high, so we need to use an approximate method:

  1. 1.

    We compute the a-posteriori probability \(P(Z_{u, v} | e_t)\) for each target variable using the Bayesian network.

  2. 2.

    We use these probabilities to populate the pair order matrix \(P_t\) associated with the instance \(e_t\)

    $$ \begin{array}{c} P_t(u, v) = P(Z_{u, v} = z_1 | e_t) + \frac{1}{2} \cdot P(Z_{u, v} = z_2 | e_t) \\ \\ P_t(v, u) = P(Z_{u, v} = z_3 | e_t) + \frac{1}{2} \cdot P(Z_{u, v} = z_2 | e_t) = 1 - P_t(u, v) \\ \end{array} $$

    with \(u < v\) and \(P_t(u, v) = 0.5\) if \(u = v\), and we solve the OBOP using \(P_t\) to obtain the (partial) ranking \(\pi _t\) for the instance \(e_t\).

4 Gaussian Mixture Semi Naive Bayes

In this section, we assume that all the attributes are continuous, and we allow interactions among them.

4.1 Definition

We propose a Semi-Naive Bayes (SNB) [8, 15] structure to model the continuous attributes using a multivariate normal distribution, while the rest of interactions are still managed by the hidden variable \(H\).

We assume two variants of the Gaussian mixture model (GMM) [21]: full, where each component has its own general covariance matrix, and tied, where all components share the same general covariance matrix.

4.2 Estimation

The differences of the GMSNB model with respect to the HNB model are:

  • E step: Similarly to the HNB model, we use Eq. 1, but, instead of the product of the conditional Gaussian distributions, we use the probability density function of the multivariate normal distribution \(\mathcal{{MN}}(\mathbf {y}_t | \mathbf {\mu }^{h_w}\), \(\mathbf {\varSigma }^{h_w})\), where \(\mathbf {y}_t\) is the configuration of values for the continuous attributes in \(e_t\), \(\mathbf {\mu }^{h_w}\) is the vector of means and \(\mathbf {\varSigma }^{h_w}\) is the covariance matrix.

  • M step: In the same way that the parameters of the HNB model, the means and empirical covariances of the continuous attributes are weighted by \(w_t^{h_w} = P(h_w | \mathbf {y}_t, \pi _t)\).

4.3 Learning and Inference

The learning and inference stages of the HNB model and the GMSNB model are the same, but we model the continuous attributes with the multivariate normal distribution.

5 Experimental Evaluation

In this section, we detail the datasets used, the algorithms tested, the methodology adopted and the results obtained in the evaluation of our proposal.

5.1 Datasets

Table 1 shows the main characteristics of the 15 (semi-synthetic) datasets used as benchmark for the PLR problem [6]. The columns #rankings y #buckets stand for the mean number of different (partial) rankings in the dataset and the mean number of buckets per ranking, respectively. The datasets (and their description) are provided at: https://www.openml.org/u/25829.

Table 1. Description of the datasets

5.2 Algorithms

We tested the following algorithms:

  • Instance Based Partial Label Ranking (IBPLR) [6]. The Euclidean distance was used to identify the \(k\) nearest neighbors, and the (partial) rankings associated with these neighbors were weighted according to the (inverse) distance. The number of nearest neighbors was adjusted with a five-fold cross validation (5-cv) over the training dataset (see [6] for details).

  • Partial Label Ranking Trees (PLRT) using the four criteria described in [6].

  • HNB-PLR (Sect. 3). We considered four alternatives: Gaussian distribution (HNB-PLR-G) for the continuous attributes and multinomial distribution for the equal-width (HNB-PLR-W), equal-frequency (HNB-PLR-F) and entropy-based [13] (HNB-PLR-E) discretized versions. The number of bins for the equal-width and equal-frequency binning was fixed to 5.

  • GMSNB-PLR (Sect. 4). We used a different covariance matrix for each mixture (full, GMSNB-PLR-F) and the same covariance matrix for all the mixtures (tied, GMSNB-PLR-T).

5.3 Methodology

We decided to apply the following design decisions:

  • A \(5 \times 10\) cross validation method was used (standard in the PLR problem).

  • The accuracy was measured with the \(\tau _X\) rank correlation coefficient [11].

  • We used the standard statistical analysis procedure [10, 16] by using the tool exreport [7] to analyze the results:

    1. 1.

      First, a Friedman test is carried out with a significance level of \(\alpha = 0.05\). If the obtained \(p\)-value is less than or equal to \(\alpha = 0.05\), we reject the null hypothesis \(H_0\), and so at least one algorithm is not equal to the rest.

    2. 2.

      Second, a post-hoc test using the Holm procedure [18] is applied to discover the outstanding methods. This method compares all the algorithms with respect to the one ranked first by the Friedman test (control algorithm).

5.4 Reproducibility

The source code is provided at: https://github.com/alfaro96/scikit-lr. The experiments were executed in computers running the CentOS Linux 7 operating system, with CPU Intel(R) Xeon(R) E5–2630 a 2.40 GHz, and 16 GB of RAM memory.

5.5 Results

In this section, we provide and analyze the accuracy and CPU time results.

Accuracy. Table 2 shows the accuracy of the mixture-based models. Each cell contains the average and standard deviation over the test datasets of the \(5 \times 10\) cv for the \(\tau _X\) rank correlation coefficient between the real and predicted (partial) rankings. The boldfaced values correspond to the algorithms leading to the best accuracy for each dataset.

Table 2. Accuracy for the HNB-PLR and GMSNB-PLR algorithms

We used the standard statistical analysis procedure described in Sect. 5.3:

  1. 1.

    The p-value obtained in the Friedman test was 4.124e\(^{-6}\), so we rejected the null hypothesis (\(H_0\)), and, at least, one algorithm was different.

  2. 2.

    Table 3 shows the results of the post-hoc test, taking as control the GMSNB-PLR-T algorithm. The columns ranking and p-valor represent the ranking obtained by the Friedman test and the p-value adjusted by the Holm procedure, respectively. The columns win, tie, loss contain the number of times that the control algorithm win, tie and losses with respect to the row-wise one. The boldfaced p-values are non-rejected null hypotheses (\(H_0\)).

Table 3. Results of the post-hoc test for the mean accuracy of the HNB-PLR and GMSNB-PLR algorithms
Table 4. Mean accuracy for the IBPLR and PLRT algorithms

In the statistical analysis, we can observe that the GMSNB-PLR-T algorithm is ranked first by the Friedman test, and it is statistically different from HNB-PLR-F, HNB-PLR-W and GMSNB-PLR-F algorithms. However, there is no statistical difference with respect to the HNB-PLR-G and HNB-PLR-E algorithms.

Let us compare the best algorithms with respect to the IBPLR and PLRT algorithms. Table 4 shows the results of this comparison.

For these set of algorithms, the p-value obtained by the Friedman test was 1.5e\(^{-2}\), so we rejected the null hypothesis \(H_0\) and at least one algorithm is different to the rest. Table 5 shows the results of the post-hoc test adjusted with the Holm procedure, taking as control the IBPLR algorithm (ranked first by the Friedman test).

Table 5. Results of the post-hoc test for the mean accuracy of the compared algorithms

According with these results, we can conclude that:

  • The IBPLR algorithm is ranked first by the Friedman test, without statistical difference with respect to the PLRT and GMSNB-PLR-T algorithms. These results show that the model-based methods are competitive with respect to the instance-based in the PLR problem, which is not the case for the LR problem [5, 9].

  • The main disadvantage of the HNB-PLR and GMSNB-PLR algorithms is the memory requirements, as they are not able to deal with datasets generating a high number of target variables. For instance, there are no results for the letter dataset (325 target variables and 20000 instances).

Time. Our proposals are slower than the instance-based (IBPLR) and tree-based methods (PLRT) because of the high number of mixtures (see Table 6) and so EM iterations required to properly model the joint probability distribution. Furthermore, since we have a high number of target variables (due to the ranking transformation), the EM algorithm takes too much time to converge. For instance, taking the pendigits dataset (10992 instances and 10 class labels, that is, 45 target variables), the GMSNB-PLR-T is 120 times slower than the IBPLR algorithm and 850 slower than the PLRT-G.

Table 6. Mean number of mixtures for each PGM

6 Conclusions and Future Work

In this paper, we have proposed an algorithm based on Bayesian networks and rank aggregation to solve the PLR problem. In particular, we have transformed the ranking variable into several target variables to model the preferences among pair of class labels with a multinomial distribution. Our proposal is based on a SNB structure with a hidden variable as root to model the interaction between the predictive and target variables. Thus, we only need to estimate the parameters with the EM algorithm. Given an unlabeled instance, the a-posteriori probabilities for the target variables are computed and input to the pair order matrix used to solve the OBOP, and so obtain the output (partial) ranking.

From the experimental evaluation, we have concluded that the GMSNB-PLR-T algorithm is competitive with the IBPLR and PLRT algorithms. Note that, although the GMSNB-PLR-T requires more computational resources during the learning phase than the IBPLR algorithm, this is not the case during the inference phase.

As future research, we plan to reduce the problem using clustering techniques to solve the memory problems, which we expect that also reduces the CPU time required in the learning phase.