Keywords

1 Introduction

Ordinal regression is an important type of supervised learning paradigm, which aims to learn predictive models for ordinal targets. Ordinal regression problems are very common in massive domains from social sciences [9] to financial technology [1] and clinical research [23]. In recent years, ordinal regression has experienced significant developments, with many prevalent methods adapted from traditional machine learning algorithms such as support vector machines [3, 24], neural networks [2, 4, 6, 13], boosting [11] and discriminant learning [21, 22]. These methods have shown to be effective in many scenarios, but unavoidably, retain substantial weaknesses of the original methods. One significant challenge comes from the fact that the feature space can be of very high dimension but sparse in many real-world ordinal regression applications, e.g., collaborative filtering, click-through rate prediction, and computer-aided pathology diagnosis. It is known that the sparse representation problem greatly hinders the performance of traditional machine learning methods, as well as their extensions for ordinal regression [7].

One successful solution to the sparse representation problem is to model the inherent interactions among features because co-occurrence of features often helps reveal high-level domain knowledge about the task under consideration. One effective approach to model feature interaction is Factorization Machines (FM) [19], which embeds high-dimensional sparse features into a rank-low latent space and learns pairwise feature interactions via the inner product of features’ embedding vectors. Although originally proposed in the context of recommender systems, FM has yielded great promise in a wide range of prediction tasks, especially those with very high and sparse feature space [16,17,18, 25]. However, the target variables of traditional FM models can only be either discrete or continuous. Thus, the tradition FM does not yet cater for the ordinal relationship among learning targets. To our best knowledge, there is little work adapting FM for ordinal regression.

In this paper, we propose a novel Factorization Machines for Ordinal Regression (FMOR), in which the sparsity issue in ordinal regression is tackled through factorized feature interactions. Motivated by the popular threshold methodology of ordinal regression studies, the proposed FMOR extends the traditional FM by introducing a set of threshold parameters that map real-valued outputs of FM to ordinal labels. We implement the learning algorithm of FMOR based on stochastic gradient descent, and further claim that the ordinal threshold constraint required by threshold-based ordinal regression methods can be automatically satisfied by the derived model. Finally, we perform comprehensive experiments on several benchmark datasets and compare FMOR with state-of-the-art approaches. The results show that FMOR noticeably outperforms all counterparts, especially in case of sparse feature space.

The rest of this paper is organized as follows. Section 2 briefly discusses the literature of ordinal regression. Section 3 gives the details of the proposed factorization machines for ordinal regression. Section 4 reports the experimental results. Section 5 gives some conclusive remarks.

2 Related Work

Generally, ordinal regression methods can be classified into three categories: (i) naive methods, (ii) ordinal binary decomposition methods, and (iii) threshold-based methods.

Naive Methods. Ordinal regression, akin to nominal classification and metric regression, can be simplified into these conventional supervised learning paradigms by either ignoring the ordinal relationship among classes or casting ordinal labels into real values. A more advanced method of this type is to transform ordinal regression as cost-sensitive classification, in which the ordinal information is encoded as misclassification costs [10].

Ordinal Binary Decomposition Methods. The main idea of ordinal binary decomposition methods is to decompose the ordinal classes into several binary pairs, each modeled by single or multiple traditional classifiers. Lin et al. [12] proposed a reduction framework from ordinal regression to binary classification: each sample is extended with a series of ordinal patterns, then a binary classifier is learned for each ordinal class that answers the question: “Is the rank of \(\mathbf {x}\) greater than r or not?”. Liu et al. [14] made use of triplets with each element from a different rank as samples and a binary classifier is learned for each ordinal class that answers the question: “Is the rank of \(\mathbf {x}\) greater than \(r-1\) and smaller than \(r+1\)?”.

Threshold-Based Methods. Threshold-based methods have been one popular technique for handling ordinal samples. Threshold-based methods aim to estimate: (i) a latent regression function \(f(\mathbf {x})\) that maps the input feature space to a one-dimensional real space; and (ii) a set of thresholds \(b_1 \le \cdots \le b_R\) that cast the real-valued \(f(\mathbf {x})\) into an interval corresponding to an ordinal class. The proportional odds model (POM) [15] is one of the first threshold-based methods and inspires many subsequent studies. Another well-known threshold-based ordinal regression method is Support Vector Ordinal Regression (SVOR) [3, 20] that generalize the “large margin” principle adopted by support vector machines to ordinal regression. Two solutions to SVOR have been developed: one maximizes the margin of the closest neighboring classes (called fixed-margin strategy) and one maximizes the sum of margins of all classes (called sum-of-margin strategy). In a recent survey on ordinal regression [5], SVOR is proved to be the best threshold-based methods for its competitive performance on both prediction accuracy and training time.

3 The Proposed Method

In this section, we first give a preliminary introduction to the traditional FM, and then elaborate our proposed FMOR method.

3.1 Preliminary

Factorization Machines (FM) [19] are a generic model class that capable of dealing with high-dimensional and sparse features. Formally, FM takes as input a real valued vector \(\mathbf {x} \in \mathbb {R}^d\), and estimates the target by modelling pairwise interactions of sparse features using low-rank latent factors. The model equation of FM is formulated as:

$$\begin{aligned} \hat{y}_{\text {FM}}(\mathbf {x};{\Theta }) = w_0 + \sum _{j=1}^d w_j x_j + \sum _{j=1}^d\sum _{j'=j+1}^d \langle \mathbf {v}_j,\mathbf {v}_{j'} \rangle x_j x_{j'} \end{aligned}$$
(1)

where the parameters \({\Theta }\) have to be estimated are:

$$\begin{aligned} w_0\in \mathbb {R}; \quad \mathbf {w}\in \mathbb {R}^d; \quad \mathbf {V}=(\mathbf {v}_1,\cdots ,\mathbf {v}_d)\in \mathbb {R}^{p\cdot d} \end{aligned}$$

In Eq. 1, the first two items are linear combinations of every features with weights \(w_j \ (1\le j\le d)\) and a global bias \(w_0\), and the last item is pairwise feature interactions using a factorized weighting schema \(\hat{w}_{jj'}=\langle \mathbf {v}_j,\mathbf {v}_{j'} \rangle =\sum _{k=1}^p{v_{jk}\cdot v_{j'k}}\), where \(\mathbf {v}_j \) is factor vector of the j-th feature, and \(p\in \mathbb {N}^+\) is the hyper-parameter that defines the dimensionality of factor vectors. Feature factors in FM are commonly said to be low-rank, due to \(p\ll d\). Compared with traditional ways (e.g., polynomial SVM) to model feature interactions using separated interaction weights, the factorization schema of FM can reduce the model complexity from \(O(d^2)\) to \(O(p\cdot d)\), which is a favored property for high-dimensional feature space.

Furthermore, FM is practically efficient for its linear computation time complexity. The model equation of FM in Eq. 1 can be reformulated as:

$$\begin{aligned} \hat{y}_{\text {FM}}(\mathbf {x};{\Theta }) = w_0 + \sum _{j=1}^d w_j x_j + \frac{1}{2}\sum _{k=1}^p \left( \left( \sum _{j=1}^d v_{jk}x_{j}\right) ^2 - \sum _{j=1}^d v_{jk}^2x_{j}^2 \right) \end{aligned}$$
(2)

Equation 2 indicates that the model equation of FM has only linear time complexity in both d and p. In fact, the pairwise feature interaction can be only computed over the non-zero elements of \(\mathbf {x}\), i.e., the computation complexity is \(O(p\cdot N_z)\). Under sparsity settings, \(N_z\) can be much smaller than d, thus the computation of decision function of FM can be very efficient. In brief, FM provides a promising framework for handling high dimensional and sparse data.

3.2 Factorization Machines for Ordinal Regression

We realize Factorization Machines for Ordinal Regression (FMOR) by leveraging the threshold methodology. The basic idea is to introduce a set of consecutive thresholds to partition real line into several intervals which define the boundaries of ordinal classes.

Given an ordinal regression problem with R ordinal classes, OrdinalFM estimates the target of an input vector \(\mathbf {x}\in \mathbb {R}^d\) as:

$$\begin{aligned} \hat{y}(\mathbf {x}) = \mathop {\arg \min }_{r\in \{1,\cdots ,R\}}\{f(\mathbf {x})-b_r \le 0\} \end{aligned}$$
(3)

where \(b_1,\cdots ,b_{R}\in \mathbb {R}\) are the R thresholds partitioning real line into intervals, each corresponding to an ordinal class. Besides, the thresholds are required to satisfy the constraint \(b_1\le \cdots \le b_{R}\). For mathematical convenience, \(b_R\) is simply set as \(+\infty \). f(x) is the latent factor regression function that captures all possible interactions between features (up to second-order, practically):

$$\begin{aligned} f(\mathbf {x}) = \sum _{j=1}^d w_j x_j + \sum _{j=1}^d\sum _{j'=j+1}^d \langle \mathbf {v}_j,\mathbf {v}_{j'} \rangle x_j x_{j'} \end{aligned}$$
(4)

Note that \(f(\mathbf {x})\) is the same as the traditional FM model in Eq. 1 except the global bias item. In essence, FMOR extend the traditional FM by introducing a set of thresholds instead of the single global bias. The thresholds are used to map the regression function value \(f(\mathbf {x})\) into ordinal targets. Particularly, an input is predicted as r if and only if \(b_{r-1}<f(\mathbf {x})\le b_r\).

The model parameters of FMOR that have to be estimated are:

$$\begin{aligned} b_1,\cdots ,b_{R-1}\in \mathbb {R}; \quad \mathbf {w}\in \mathbb {R}^d; \quad \mathbf {v}_1,\cdots ,\mathbf {v}_d\in \mathbb {R}^p \end{aligned}$$

Next, we discuss the learning procedure of FMOR, including the learning objective and the optimization algorithm.

The Learning Objective. Following the traditional supervised learning framework, the parameters \({\Theta }\) are learned from a given training set \(\mathcal {D}\) that minimizes the following regularized empirical risk:

$$\begin{aligned} \mathcal {O}({\Theta },\mathcal {D})=R_{\mathrm {emp}}({\Theta },\mathcal {D})+\lambda {\Omega }({\Theta }) \end{aligned}$$
(5)

where \(R_{\mathrm {emp}}\) is the empirical risk of an ordinal regression model on the training data, and \({\Omega }(\cdot )\) is the regularization item. \(\lambda \) is the trade-off between the empirical risk and regularizer of model parameters.

In order to account for the ordinal relationship among targets when calculating the empirical risk, we consider measuring the predicting errors w.r.t. each ordinal class. Formally, give a training set \(\mathcal {D}=\mathcal {D}^{(1)}\cup \cdots \cup \mathcal {D}^{(R)}\), where \(\mathcal {D}^{(r)}=\{(\mathbf {x}_i,r),\cdots ,(\mathbf {x}_{N_k},r)\}\ (1\le r \le R)\) is the set of training samples with the class r, the empirical risk of a FMOR model is defined as:

$$\begin{aligned} R_{\mathrm {emp}}({\Theta },\mathcal {D})=\sum _{r=1}^R\sum _{i=1}^{N_r} \left( \sum _{k=1}^{r-1}\ell (f(\mathbf {x})-b_{k}) +\sum _{k=r}^{R-1}\ell (b_{k}-f(\mathbf {x}))\right) \end{aligned}$$
(6)

where \(\ell (\cdot )\) is the surrogate loss function that penalizes an erroneous prediction. In fact, the empirical risk is contributed over all thresholds, including the lower-grading ones (\(k=1,\cdots ,r-1\)) and the upper-grading ones (\(k=r,\cdots ,R\)), involved when a predicting error occurs.

Generally, a surrogate loss function is required to be monotonically decreasing in true positives. Moreover, smoothness is a derivable property such that efficient optimization techniques can be applied to estimate model parameters. Here, we adopt the smoothed hinge loss:

$$\begin{aligned} \ell (z)=\left\{ \begin{aligned}&0&\mathrm{if}&\ z\ge 1\\&\frac{(1-z)^2}{2}&\mathrm{if}&\ 0<z<1 \\&0.5-z&\mathrm{if}&\ z\le 0 \end{aligned} \right. \end{aligned}$$
(7)

As mentioned above, the threshold parameters need to satisfy the ordinal inequality constraint \(b_1 \le \cdots \le b_{R}\). Interestingly, the constraint, although not being imposed on the learning procedure explicitly, can be automatically satisfied at the optimal solution of OrdinalFM, as will be shown in the following theorem.

Theorem 1

Let \({\Theta }^*=(b_1^*,\cdots ,b_{R-1}^*, \mathbf {w}^*, \mathbf {v}_1^*,\cdots ,\mathbf {v}_d^*)\) be the optimal solution of the regularized empirical risk minimization problem in Eq. 5, i.e.,

$$\begin{aligned} {\Theta }^* = \mathop {\arg \min }_{\Theta } \{R_{\mathrm {emp}}({\Theta },\mathcal {D})+\lambda {\Omega }({\Theta })\}, \end{aligned}$$

Then we have \(b_1^* \le \cdots \le b_{R-1}^*\).

Theorem 1 not only establishes a nice property of FMOR but also induces a heuristic that is helpful for finding a better FMOR model. Theorem 1 leads to the following corollary.

Corollary 1

Given two solution \(\dot{\Theta }=(\dot{b}_1,\cdots ,\dot{b}_{R-1},\dot{\mathbf {w}},\dot{\mathbf {v}}_1,\cdots ,\dot{\mathbf {v}}_d)\) and \(\ddot{\Theta }=(\ddot{b}_1,\cdots ,\ddot{b}_{R-1},\dot{\mathbf {w}},\dot{\mathbf {v}}_1,\cdots ,\dot{\mathbf {v}}_d)\), where \((\ddot{b}_1,\cdots ,\ddot{b}_{R-1})\) is sorted in an ascending order of \((\dot{b}_1,\cdots ,\dot{b}_{R-1})\), we have \(R_{\mathrm {emp}}(\ddot{{\Theta }},\mathcal {D})\le R_{\mathrm {emp}}(\dot{{\Theta }},\mathcal {D})\).

Due to space limitation, the proofs will be provided in the full version of the paper.

The Learning Algorithm. We employ the Adaptive Moment Estimation (Adam) [8] algorithm, a popular variant of stochastic gradient descent algorithm that uses adaptive per-parameter learning rates, to solve the regularized empirical risk minimization problem in Eq. 5. The main idea is to iterate over each sample \((\mathbf {x},r)\) in the training set, and update model parameters towards the direction of negative gradient of the objective:

$$\begin{aligned} \theta ^{(t)}=\theta ^{(t-1)}-\eta ^{(\theta ,t)}\cdot \left( \frac{\partial R_{\mathrm {emp}}({\Theta },\{(\mathbf {x},r)\})}{\partial \theta } + \lambda \frac{\partial {\Omega }({\Theta })}{\partial \theta }\right) \end{aligned}$$
(8)

where \(\eta ^{(\theta ,t)}\) is the individual adaptive learning rate for \(\theta \) at the t-th iteration.

For the empirical risk in Eq. 6, the gradient is given by

$$\begin{aligned} \frac{\partial R_{\mathrm {emp}}({\Theta },\{(\mathbf {x},r)\})}{\partial \theta } = \sum _{k=1}^{r-1}\frac{\partial \ell (f(\mathbf {x})-b_{k})}{\partial \theta } +\sum _{k=r}^{R-1}\frac{\partial \ell (b_{k}-f(\mathbf {x}))}{\partial \theta } \end{aligned}$$
(9)

From Eq. 7, the gradient of the smoothed hinge loss is:

$$\begin{aligned} \frac{\partial \ell (z)}{\partial z}=\left\{ \begin{aligned}&0&\mathrm{if}&\ z\ge 1\\&z-1&\mathrm{if}&\ 0<z<1 \\&-1&\mathrm{if}&\ z\le 0 \end{aligned} \right. \end{aligned}$$
(10)

From Eq. 4, the gradient of the factorized-based regression function is:

$$\begin{aligned} \begin{aligned} \frac{\partial f(\mathbf {x})}{\partial w_j}= & {} x_j \quad (j=1,\cdots ,d) \\ \frac{\partial f(\mathbf {x})}{\partial v_{j,l}}= & {} x_j\cdot \sum _{j'\ne j}{v_{j',l}x_{j'}} \quad (j=1,\cdots ,d;l=1,\cdots ,p) \\ \end{aligned} \end{aligned}$$
(11)

Through embedding Eqs. 10 and 11 into Eq. 9, we can obtain the gradient used in the optimization algorithm.

One thing to be noted here is gradient-based algorithms, though simple and efficient, are not guaranteed to find the global optimum solution, since the regularized empirical risk minimization problem is usually highly non-convex empirically. Thus the ordinal inequality constraint might be violated in the estimated parameters. Fortunately, according to Corollary 1, we can find better parameters, which not only satisfies the ordinal inequality constraint but also achieve lower regularized empirical risk, by sorting the learned threshold parameters in ascending order.

4 Empirical Study

In this section, we report the results of the empirical studies on the proposed FMOR using several benchmark datasets.

4.1 Experimental Settings

As the proposed FMOR is essentially a threshold-based method, we select several state-of-the-art threshold-based ordinal regression methods as baselines. We also compare FMOR against the traditional FM.

  • ORBoost: The thresholded ensemble model for ordinal regression proposed by Lin and Li [11]. The two implementations, namely ORBoost-LR (Ordinal Regression Boosting with Left-Right margins) and ORBoost-All (Ordinal Regression Boosting with All margins), are used as the baselines.

  • SVOR: The support vector formulation for ordinal regression proposed by Chu and Keerthi [3]. The two implementations, namely SVOREX (Support Vector Ordinal Regression with EXplicit constraints) and SVORIM (Support Vector Ordinal Regression with IMplicit constraints), are used as the baselines. Both methods are implemented with Gaussian kernel (with kernel width as 1) and linear kernel, respectively.

  • POMNN: The ordinal neural network based on the proportional odds model proposed by Gutiérrez [6].

  • FM: The original factorization machines proposed by Rendle [19]. The FM model is learned with the regression least-squares loss and the predictions are rounded to the nearest ordinal class.

The ORBoost methods, the SVOR methods and the traditional FM are run using the publicly available implementations provided by the authorsFootnote 1. We implemented the POMNN method using TensorFlow. As for the proposed FMOR, we implemented it on basis of LibFM. The hyper-parameters of each method are chosen from a certain range (shown in Table 1) using 5-fold cross-validation within the training set. For other parameters of each method, we use default settings provided by the implementations.

Table 1. The ranges for hyper-parameter selection

All the methods are evaluated using the following measures:

  1. 1.

    MZE: The Mean Zero–one Error (MZE) is the fraction of incorrect predictions:

    $$\begin{aligned} MZE=\frac{1}{N}\sum _{i=1}^{N}\llbracket \hat{y}(\mathbf {x}_i) \ne y_i \rrbracket \end{aligned}$$
  2. 2.

    MAE: The Mean Absolute Error (MAE) is the average absolute deviation of the predictions from the ground-truth:

    $$\begin{aligned} MAE=\frac{1}{N}\sum _{i=1}^{N}|\hat{y}(\mathbf {x}_i)-y_i| \end{aligned}$$

4.2 Prediction Accuracy

In this experiment, we compared the proposed FMOR against the baselines on 9 benchmark datasets which are taken from public machine learning data repositoriesFootnote 2. All these datasets are real ordinal datasets with a varying number of samples, features and classes. We preprocess each dataset by normalizing every numeric attributes into [0, 1] and transforming every categorical attributes to binary forms with one-hot encoding (one feature per value). As for the winequality dataset, we generate one more preprocessed dataset by transforming all attributes, including both numeric ones and categorical ones, to binary forms. To be specific, each numeric attribute in the original dataset is discretized into pre-defined bins and then converted into one-hot vectors. This dataset, denoted as winequality0/1 in Table 2, is of high sparsity as each sample is described by a 180-dimension binary feature vector. The characteristics of benchmark datasets are described in Table 2.

Table 2. Characteristics of the benchmark datasets
Table 3. The MZE results (means and standard deviations over 5 runs) on benchmark datasets

Each dataset is randomly split 5 times into training and testing sets with ratio 2:1. The averaged MZE and MAE over 5 runs along with the standard deviations are reported in Tables 3 and 4 (best in bold), respectively. From the results, we can see the proposed FMOR beats all baselines 6 of 10 times in terms of both MZE and MAE. Among the datasets that the proposed FMOR performs best, the most significant improvement is obtained on the winequality0/1 dataset of which the feature space is sparser than others. This indicates that the advantage of the proposed FMOR can be more significant as the level of sparsity gets higher. Actually, in these datasets (i.e., user-knowledge, eucalyptus and turkiye-eval) that FMOR or traditional FM fails to outperform traditional ordinal regression methods, the sparsity issue rarely occurs. Taking the user-knowledge datasetFootnote 3 as an example, the attributes are all numerical ones such as the exam performance or the study time of a student. However, this result does not necessarily mean that the proposed FMOR cannot be applied to the ordinal regression problems with dense feature space. In fact, there is only a small gap between FMOR and the best-performed baselines on these datasets. Also note that the proposed FMOR still achieves the best performance on the dense dataset winequality. Among all the baselines, SVOR methods perform best in most cases. We also notice that POMNN does not perform as well as expected. We argue that more advanced techniques for training deep neural networks need to be employed to learn a better neural networks model for ordinal regression.

Table 4. The MAE results (means and standard deviations over 5-fold cross validation runs) on benchmark datasets

4.3 Training Efficiency

In this experiment, we evaluate the training efficiency of the proposed FMOR by comparing training time with other methods. In this experiment, we only consider SVOR methods for comparison for the consistent outperformance over other baselines. All comparison methods were run on a single core of an Intel(R) Xeon(R) CPU E7-4830 processor clocked at 2.13 GHz with access to 24 GB RAM. Due to space limitation, only the results on the turkiye-eval dataset are reported. Figure 1 plots the training time of every comparison methods with varying sizes of the training dataset. It can be clearly seen that the proposed FMOR scales much better than SVOR methods. On large training sets with thousands of samples, training a FMOR model takes only seconds while training SVOR models can take several hours.

Fig. 1.
figure 1

Training time with varying dataset size

5 Conclusion

In this paper, we put forward Factorization Machines for Ordinal Regression (FMOR), a latent factor model addressing the sparsity issue in ordinal regression problems. Using the factorization machines as the base generic framework for modeling sparse feature space, we incorporate the threshold methodology to handle the ordinal targets in a proper way. We experimentally show that FMOR has the dual advantages of effectiveness and efficiency, and can be applied not only to sparse ordinal data, but competitive results can even be obtained for dense data. Future work includes applying FMOR to model ordinal user preference scores in recommender systems.