1 Introduction

In machine learning, the models are typically trained by default under the hypothesis that training and test data comply with the same statistical distribution [1, 2]. Nevertheless, in real world applications, such assumption often does not hold, resulting in degenerated model. To overcome this issue, the paradigm of unsupervised domain adaptation (UDA) [3,4,5,6,7,8,9,10] was proposed to mitigate the distribution inconsistency between the training and test data domains.

In UDA, the supervised domains with knowledge to be transferred are defined as source domains, while the other unsupervised domains are distinguished as target domains. According to the modeling methodology, the existing UDA methods can be grouped into three categories [11], i.e. instance-level, feature-level and model-level UDA. Specifically, the instance-level UDA [12,13,14,15,16] typically assigns the source instances weights in terms of their similarity to the target domains, and takes weighted source instances to help training the target model. Such methodology usually works effectively when the cross-domain divergence is small, otherwise they may lose efficacy especially when the distributions of the source and target domains do not intersect. The feature-level UDA [17,18,19,20,21,22] typically transforms the source and target domains into a common correlated representation space, in which the cross-domain distributions are pulled as near as possible. Although such feature-level UDA usually can achieve better results, its efficacy greatly depends on the choice of the representation space. As for the model-level UDA [23,24,25,26], it fulfils knowledge adaptation from the source model parameters. Although such kind of UDA can distill the source knowledge to the target domain, the data distribution priors are usually ignored.

Most of the existing UDA works concentrate on such task scenarios where merely one source and one target domain (1S1T) are involved, while few researches implement UDA from multiple sources to one target domain (mS1T). To generalize knowledge from one source to multiple target domains (1SmT), Yu et al. [27] proposed the 1SmT UDA method (PA-1SmT), which implements domain adaptation by reconstructing the source model with the target model parameters and relates the targets with a shared representation dictionary. Nevertheless, the PA-1SmT tends to fail when the union of the targets is proper subset of the source domain so that the latter cannot be completely approximated by the former. Even worse, the PA-1SmT is designed for ordinary problems, such that it may degenerate when facing the ordinal data problems. Let us take human age as an example, it exhibits ordinal relationships among different ages, e.g., the person aged 20 is younger than somebody aged 25, but elder than the people aged 18. In other words, the severity of misclassifying age 20 to 25 is more serious than to 18. Such order relationships are not preserved in existing UDA methods so that they cannot be directly employed to handle the cross-domain ordinal problems.

To implement 1SmT UDA for ordinal data scenarios, as shown in Fig.1, we construct an ordinal unsupervised domain adaptation through transferring both implicit and explicit knowledge from data distribution and model parameters perspectives, coined as OrUDA. In addition, we design an optimization algorithm to solve the OrUDA model alternatingly, with theoretically convergence guarantee. Finally, through extensive evaluations on artificial and real datasets, we demonstrate the effectiveness of the proposed method. In summary, the main contributions of this work are four-fold as follows:

  1. 1.

    A kind of 1SmT UDA for ordinal data is proposed (OrUDA), which transfers both explicit and implicit knowledge from the supervised source and unsupervised target domains respectively via distribution alignment and dictionary transmission.

  2. 2.

    The unknown ordinal prior of the target domains is transferred from the already trained source model via source model adaptation in the process of 1SmT UDA.

  3. 3.

    An alternating optimization algorithm is designed to solve the OrUDA model, with convergence guarantee.

  4. 4.

    Extensive evaluations are conducted to demonstrate the effectiveness and superiority of the proposed method.

The rest of this article is organized as follows. Section 2 briefly reviews the related work. Section 3 elaborates the proposed method. Section 4 experimentally evaluates the proposed method with analysis. Finally, Section 5 concludes this article and gives future research directions.

Fig. 1
figure 1

Illustration of OrUDA. The implicit knowledge transfer and explicit knowledge transfer are represented by the purple solid line and black dashed line. And the target relationship transfer is represented by the solid line

2 Related work

In this section, we present the related researches on UDA including 1S1T, mS1T and the most related 1SmT UDA methods.

2.1 1S1T UDA

Thanks to the broad practice prospects, a large number of 1S1T UDA methods are proposed based on non-deep architecture and deep architecture, which can be grouped into three categories [11], i.e. instance-level, feature-level and model-level UDA. For the instance-level UDA, most methods [12,13,14,15,16] reweight the source instances according to the similarity of samples, which work effectively when the cross-domain divergence is small otherwise these methods may fail. A typical example is KLIEP [28], which reweights the instances by solving a convex optimization problem with a sparse solution. The model-level methods [23,24,25,26] mitigate the domain shift by transferring the parameters of the source model. For example, DAN [29] conducts domain adaptation by sharing parameters of the probability distribution match layer. Additionally, most UDA methods [17,18,19,20,21,22] refer to the feature-level, which transfer knowledge by distribution alignment, such as MMD [30], CORAL [31], CMD [32] and so on. Recently, CMMS [21] captures the feature consistency by the class centroid matching and SALFL [22] aligns the domains by incorporating the projection clustering, label propagation and distributional alignment into a unified optimization framework.

2.2 mS1T UDA

Recently, more and more mS1T-UDA methods emerge which aim to transfer knowledge from multiple source domains at the same time to better assist the learning of the target domain. Specifically, mDA [33] aligns all the domains by selecting the shared latent sub-space. Differently, SSF [34] samples the sub-space along with the spline flow from the source domains to the target domain which associates the domains on the Grassmann manifold. Later, UMDL [35] realizes the domain adaptation by training the constructed task-shared and task-specific jointly. Additionally, to transfer the source decision model to the target domain without the bias, MDAN [36] learns the aligned cross-domain semantic network by the generative adversarial scheme. Further, WS-UDA [37] and CMSS [38] trains the adversarial network by reweighting the samples and spaces of the source domains respectively, which transfer the source knowledge effectively. Moreover, DistanceNet [39] conducts 1SmT UDA by the dynamic distance measure and the Bandit controller. And LtC-MSDA [40] constructs an adjacent relationship graph of the mixed knowledge domain to realize the consistent transfer of mS1T UDA.

2.3 1SmT UDA

Although a variety of 1S1T and mS1T UDA methods have been proposed, the research on 1SmT is quite rare. To our knowledge, PA-1SmT [27] is the first and representative 1SmT UDA method that transfers knowledge between the source and target domains via model parameter adaptation. More specifically, it performs clustering in the label space of multiple target domains simultaneously through the soft large-margin clustering. It also assumes the label space of target domains is subset of the source domain. To transfer the source domain knowledge to help clustering these unlabeled target instances, the PA-1SmT bridges the single source domain with each of the target domains with individual representing factor. Besides, a correlation dictionary is embedded in the model to capture the correlations between the target domains. Finally, when these considerations are taken into account, the objective function of PA-1SmT is achieved as follows:

$$\begin{aligned} \begin{aligned}&\min _{ \{ \varvec{W}_T^m, \varvec{V}^m, \varvec{D}, \varvec{V}_T^m, u_{ki}^m \}}\; \sum _{m=1}^M \Big \{ \frac{1}{2}\Vert \varvec{W}_T^m\Vert _F^2\\&\qquad \qquad \qquad +\frac{\alpha }{2}\sum _{k=1}^{C_T^m}\sum _{i=1}^{N_t^m}(u_{ki}^m)^2\Vert (\varvec{W}_T^m)^T\varvec{x}_i^m-\varvec{l}_k\Vert _2^2 \\&\qquad \qquad \qquad +\frac{\beta }{2}\Vert \varvec{W}_S - \varvec{W}_T^m\varvec{V}^m\Vert _F^2 + \frac{\gamma }{2}\Vert \varvec{W}_T^m - \varvec{D}\varvec{V}_T^m\Vert _F^2 \\&\qquad \qquad \qquad + \eta \left( \Vert \varvec{V}^m\Vert _{2,1}+\Vert \varvec{V}_T^m\Vert _{2,1}\right) \Big \} \\&\qquad \, \; s.t. \qquad \, \; \sum _{k=1}^{C_T^m}u_{ki}^m = 1, \; 0 \le u_{ki}^m \le 1 \end{aligned} \end{aligned}$$
(1)

where \(\varvec{W}_S\) and \(\varvec{W}_T^m\) respectively denote the projection matrices for the source and target domains, \(\varvec{V}^m\) and \(\varvec{V}_T^m\) indicate the individual selection matrices, \(\varvec{D}\) stands for the shared dictionary among the target domains, \(u_{ki}^m\) is the clustering membership of instance \(\varvec{x}_i^m\) to the kth class in the mth target domain. \(\alpha\), \(\beta\), \(\gamma\) and \(\eta\) are the tradeoff parameters. For more details about the PA-1SmT model and its algorithm, please refer to [27].

Although PA-1SmT has incorporated the knowledge relationship between the source and the target domains, it fails to preserve the cross-target relationships, whose performance may be limited especially in scenarios where the target domains are closely related to each other. Even worse, it does not characterize the ordinal relationships of the data.

3 The Proposed method

In this section, we propose an unsupervised domain adaptation for ordinal data scenario (OrUDA) that transfers implicit and explicit knowledge from source domain.

3.1 Notation and hypothesis

For convenience of elaboration, we systematically define the notations to be used in the remainder sections in Table 1.

Table 1 Summary of notation definitions involved in this article

Without loss of generality, we also comply with the hypothesis that the source data set \(\varvec{X}_S \in \mathbb {R}^{d\times N_S}\) follows distribution \(\mathcal {P}_S(\varvec{x}_S)\), while the data set of the mth target follows distribution \(\mathcal {P}_T^m(\varvec{x}_T^m)\). We concentrate on the UDA scenario where the supervised source and unsupervised target domains share the same original feature space and label space, i.e. \(\mathcal {X}_S = \mathcal {X}_T^m\) and \(\mathcal {Y}_S = \mathcal {Y}_T^m\). Considering the domain shift between the source and target domains, the marginal distributions \(\mathcal {P}_S(\varvec{x}_S) \ne \mathcal {P}_T^m(\varvec{x}_T^m)\) and conditional distributions \(\mathcal {P}_S(\varvec{y}_S \mid \varvec{x}_S) \ne \mathcal {P}_T^m(\varvec{y}_T^m \mid \varvec{x}_T^m)\).

3.2 Ordinal UDA with implicit and explicit knowledge transfer

3.2.1 Implicit knowledge transfer from the source domain

For ordinal classification or regression (e.g. human age estimation), one of the mainstream methods is to project the estimation samples into an ordered feature subspace and then make decisions in this space. Following this principle, the KDLOR method [41] was proposed to seek discriminative ordinal projection. Furthermore, in order to obtain orthogonal projections with complementary components, a multi-direction counterpart of KDLOR [42] was derived, with objective function formulated as

$$\begin{aligned} \begin{aligned}&\min _{ \{\varvec{w}_{p+1}, \rho _{p+1}\} } \; \varvec{w}_{p+1}^T\varvec{S}_W\varvec{w}_{p+1} - \lambda \rho _{p+1} \\&\; \quad s.t. \quad \varvec{w}_{p+1}^T\left( \varvec{m}_{k+1}-\varvec{m}_{k}\right) \ge \rho _{p+1}, \, k=1,\cdot \cdot \cdot ,K-1 \\&\qquad \; \; \quad \varvec{w}_{p+1}^T\varvec{w}_{h}=0,\; h=1,\cdot \cdot \cdot ,p \\ \end{aligned} \end{aligned}$$
(2)

where \(\varvec{w}_{p+1}\) denotes the (p+1)th ordinal projection direction who is restricted to be orthogonal to the previous p directions, with \(\rho _{p+1}\) being the class margin along this projection, \(\varvec{S}_W\) is the intra-class scatter matrix, \(\varvec{m}_{k}\) indicates the data centroid of the kth class. \(\lambda\) is the tradeoff parameter.

The projections \(\varvec{W}_S = [\varvec{w}_{p+1}, \varvec{w}_{p}, \cdot \cdot \cdot , \varvec{w}_{1}]\) can be obtained by solving the variable \(\varvec{w}_{p+1}\) of (2) in the source domain. Then, we can transfer knowledge from the source domain via \(\varvec{W}_S\) to the target domains. Nevertheless, considering the distribution shift between the source and the target domain, as well as the individuality divergence between different targets, it is not reasonable to directly assign \(\{\varvec{W}_T^m\}_{m=1}^M\) with \(\varvec{W}_S\). To this end, we propose to adaptively transfer positive components from the source to the target domains through designing the individual transfer matrices \(\{\varvec{V}^m\}_{m=1}^M\), and consequently formulate it as

$$\begin{aligned} \begin{aligned}&\mathcal {J}_{im} = \min _{ \{\varvec{W}_T^m, \varvec{V}^m\}} \; \sum _{m=1}^M\left( \Vert \varvec{W}_T^m - \varvec{W}_S\varvec{V}^m\Vert _F^2\right) \\&\qquad \quad \; s.t. \quad (\varvec{V}^m)^T\varvec{V}^m = \varvec{I} \end{aligned} \end{aligned}$$
(3)

where the transfer matrix \(\varvec{V}^m\) acts to adaptively extract components from the source model \(\varvec{W}_S\) to represent the mth target module \(\varvec{W}_T^m\). The constraint aims to preserve the discriminant component of the transfer matrix, with \(\varvec{I}\) being an identity matrix. Modeling individual transfer matrix \(\varvec{V}^m\) for each of the target domains can effectively preserve their personality. Since knowledge transfer from \(\varvec{W}_S\) to \(\varvec{W}_T^m\) is implemented in an implicit manner, so we call it implicit knowledge transfer.

3.2.2 Explicit knowledge transfer from the source domain

Considering the distribution shift between the source and target domains, we need to align the domains by reducing their divergence in both marginal distribution and conditional distribution. To this end, we propose to introduce the maximum mean discrepancy (MMD) [43] to model the marginal distribution, while the conditional MMD [44] to characterize the conditional distribution between the domains. To seek a balance between the marginal divergence and conditional divergence, we seek a tradeoff between the domain distributions and thus formulate the objective as

$$\begin{aligned} \begin{aligned} \mathcal {J}_{ex} =&\min _{ \{\varvec{W}_T^m, \varvec{F}_T^m\}} \; \sum _{m=1}^M\left( (1-\mu )\Vert (\varvec{W}_T^m)^T\overline{\varvec{X}_T^m}\right. \\&\left. - (\varvec{W}_S)^T\overline{\varvec{X}_S} \Vert _2^2 + \mu \Vert \varvec{F}_T^m-\varvec{F}_S\Vert _F^2\right) \end{aligned} \end{aligned}$$
(4)

where the first term characterizes the marginal distribution divergence between the source and the M target domains while the second term describes their conditional divergence, which are balanced by the parameter \(0 \le \mu \le 1\). \(\varvec{F}_S\) and \(\varvec{F}_T^m\) respectively store the class centroid in column for the source and target domains. It is worth noting that \(\varvec{F}_T^m\) is actually padded with “pseudo-centroid” for the target domains by classifying their instances using the classifier trained on the source domain. To boost the reliability, these pseudo-centroid are updated in iterative manner in the process of model optimization. To gain performance improvement, we adaptively calculate \(\mu\) according to the \(\mathcal {A}\)-distance [45] between the marginal and conditional distributions. Compared to the implicit knowledge transfer manner in Section 3.2.1, the domain distribution alignment is an explicit way of transferring prior knowledge from the source domain, so we distinguish it as explicit knowledge transfer.

3.2.3 Relation transfer between the target domains

For 1SmT UDA, there are usually potential correlations between the target domains. To explore these relations, we construct a shared representation dictionary to bridge the target domains as

$$\begin{aligned} \begin{aligned}&\mathcal {J}_{re} = \min _{\{\varvec{D}, \varvec{V}_T^m\}} \; \sum _{m=1}^M\left( \Vert \varvec{W}_T^m - \varvec{DV}_T^m\Vert _F^2 + \lambda \Vert \varvec{V}_T^m\Vert _{2,1}\right) \end{aligned} \end{aligned}$$
(5)

where \(\varvec{D} \in \mathbb {R}^{d\times r}\) denotes the dictionary shared by the target domains, and \(\varvec{V}_T^m\) is the relation transfer matrix for the mth target domain. As formulated in (5), all the M target domains are related with knowledge transfer among them by the common dictionary.

3.2.4 Overall objective of OrUDA

For the concerned ordinal 1SmT UDA, we can consequently build the overall objective function for the OrUDA model by taking all the above considerations simultaneously, and formulate it as

$$\begin{aligned} \begin{aligned}&\mathcal {J} = \mathcal {L}_{target} + \frac{\lambda _1}{2}\mathcal {J}_{ex} + \frac{\lambda _2}{2}\mathcal {J}_{im} + \frac{\lambda _3}{2}\mathcal {J}_{re} \end{aligned} \end{aligned}$$
(6)

where the first term denotes the empirical loss on the target domain, while the other terms regularize the learning by transferring knowledge from the source domain (implicit and explicit) and other target domains (relation). It is worth noting that implicit knowledge transfer constructs the individual transfer matrices for each target domain to learn the latent ordinal information from the source domain while the explicit knowledge transfer aims to mitigate the domain shift in the shared sub-space which is obtained according to the explicit measure of the domain distribution. In order to transfer from the source domain the ordinal structure for the target domains, we encode the target instance label through least-squares regression on their centroid. Then, we substitute (3), (4), (5) into (6) and consequently rewrite (6) as

$$\begin{aligned} \begin{aligned}&\mathcal {J} = \min _{\{\varvec{W}_T^m, \varvec{F}_T^m, \varvec{G}_T^m, \varvec{V}^m, \varvec{V}_T^m, \varvec{D}\}} \; \sum _{m=1}^M \Bigg \{\frac{1}{2}\Vert (\varvec{W}_T^m)^T\varvec{X}_T^m-\varvec{F}_T^m(\varvec{G}_T^m)^T\Vert _F^2 \\&\qquad + \frac{\lambda _1}{2}\left( (1-\mu )\Vert (\varvec{W}_T^m)^T\overline{\varvec{X}_T^m} - (\varvec{W}_S)^T\overline{\varvec{X}_S} \Vert _2^2 + \mu \Vert \varvec{F}_T^m-\varvec{F}_S\Vert _F^2\right) \\&\qquad + \frac{\lambda _2}{2}\Vert \varvec{W}_T^m - \varvec{W}_S\varvec{V}^m\Vert _F^2 + \frac{\lambda _3}{2}\Vert \varvec{W}_T^m - \varvec{DV}_T^m\Vert _F^2 + \lambda _4\Vert \varvec{V}_T^m\Vert _{2,1}\Bigg \} \\&\qquad s.t. \quad (\varvec{V}^m)^T\varvec{V}^m = \varvec{I} \end{aligned} \end{aligned}$$
(7)

where \(\lambda _1\) to \(\lambda _4\), as well as \(\mu\) are predefined tradeoff parameters. By the modeling manner of (7), the data ordinal characteristics, as well as other domain knowledge can be effectively transferred to the target domains.

3.3 Optimization of OrUDA

As shown in (7), the objective function is jointly convex w.r.t. the variables; therefore, we construct an alternating optimization to solve it, i.e. solving one variable while fixing the others.

  • Solve \(\varvec{W}_T^m\) with \(\varvec{F}_T^m\), \(\varvec{G}_T^m\), \(\varvec{V}^m\), \(\varvec{V}_T^m\), \(\varvec{D}\) fixed.

    When \(\varvec{F}_T^m\), \(\varvec{G}_T^m\), \(\varvec{V}^m\), \(\varvec{V}_T^m\) and \(\varvec{D}\) are fixed, then (7) w.r.t. \(\varvec{W}_T^m\) can be equivalently written as

    $$\begin{aligned} \begin{aligned} \mathcal {J}_{\varvec{W}_T^m}&= \min _{\varvec{W}_T^m} \; \frac{1}{2}\Vert (\varvec{W}_T^m)^T\varvec{X}_T^m-\varvec{F}_T^m(\varvec{G}_T^m)^T\Vert _F^2 \\&\qquad \quad + \frac{\lambda _1}{2}(1-\mu )\Vert \varvec{W}_T^m\overline{(\varvec{X}_T^m})^T - (\varvec{W}_S)^T\overline{\varvec{X}_S} \Vert _2^2 \\&\qquad \quad + \frac{\lambda _2}{2}\Vert \varvec{W}_T^m - \varvec{W}_S\varvec{V}^m\Vert _F^2 + \frac{\lambda _3}{2}\Vert \varvec{W}_T^m - \varvec{DV}_T^m\Vert _F^2 \\ \end{aligned} \end{aligned}$$
    (8)

    Calculating the derivative of (8) w.r.t. \(\varvec{W}_T^m\) and making it to zero yields the closed-form solution

    $$\begin{aligned} \begin{aligned} \varvec{W}_T^m & = \left( \varvec{X}_T^m(\varvec{X}_T^m)^T+\lambda _1(1-\mu )\overline{\varvec{X}_T^m}(\overline{\varvec{X}_T^m})^T+(\lambda _2+\lambda _3)\varvec{I}_d\right) ^{-1}\\& \quad \cdot \left( \varvec{X}_T^m\varvec{G}_T^m(\varvec{F}_T^m)^T+\lambda _1(1-\mu )\overline{\varvec{X}_T^m}(\overline{\varvec{X}_S})^T\varvec{W}_S\right. \\&\quad \left. +\lambda _2\varvec{W}_S\varvec{V}^m+\lambda _3\varvec{DV}_T^m\right) \end{aligned} \end{aligned}$$
    (9)
  • Solve \(\varvec{F}_T^m\) with \(\varvec{W}_T^m\), \(\varvec{G}_T^m\), \(\varvec{V}^m\), \(\varvec{V}_T^m\), \(\varvec{D}\) fixed.

When \(\varvec{W}_T^m\), \(\varvec{G}_T^m\), \(\varvec{V}^m\), \(\varvec{V}_T^m\) and \(\varvec{D}\) are fixed, then (7) w.r.t. \(\varvec{F}_T^m\) can be written as

$$\begin{aligned} \begin{aligned} \mathcal {J}_{\varvec{F}_T^m} =&\min _{\varvec{F}_T^m} \; \frac{1}{2}\Vert (\varvec{W}_T^m)^T\varvec{X}_T^m-\varvec{F}_T^m(\varvec{G}_T^m)^T\Vert _F^2\\&+ \frac{\lambda _1\mu }{2}\Vert \varvec{F}_T^m-\varvec{F}_S\Vert _F^2 \end{aligned} \end{aligned}$$
(10)

Taking the derivative of (10) w.r.t. \(\varvec{F}_T^m\) to zero, yields the following closed-form solution

$$\begin{aligned} \begin{aligned}&\varvec{F}_T^m = \left( (\varvec{W}_T^m)^T\varvec{X}_T^m\varvec{G}_T^m+\lambda _1\mu \varvec{F}_S\right) \left( (\varvec{G}_T^m)^T\varvec{G}_T^m+\lambda _1\mu \varvec{I}_d\right) ^{-1} \end{aligned} \end{aligned}$$
(11)
  • Solve \(\varvec{G}_T^m\) with \(\varvec{W}_T^m\), \(\varvec{F}_T^m\), \(\varvec{V}^m\), \(\varvec{V}_T^m\), \(\varvec{D}\) fixed.

When \(\varvec{W}_T^m\), \(\varvec{F}_T^m\), \(\varvec{V}^m\), \(\varvec{V}_T^m\) and \(\varvec{D}\) are fixed, then (7) w.r.t. \(\varvec{G}_T^m\) can be written as

$$\begin{aligned} \begin{aligned}&\mathcal {J}_{\varvec{G}_T^m} = \min _{\varvec{G}_T^m} \; \frac{1}{2}\Vert (\varvec{W}_T^m)^T\varvec{X}_T^m-\varvec{F}_T^m(\varvec{G}_T^m)^T\Vert _F^2 \end{aligned} \end{aligned}$$
(12)

Considering the (ij)th element, \(\varvec{G}_{T(ij)}^m\) of \(\varvec{G}_T^m\) stores the membership degree of the ith instance to the jth class, we compare the distance of the instance to each of the class centroids and assign it to the class with the closet distance, as formulated

$$\begin{aligned} \begin{aligned}&\varvec{G}_{T(ij)}^m = {\left\{ \begin{array}{ll} 1, \quad j = \arg \min _{k}\Vert (\varvec{W}_T^m)^Tx_i-\varvec{F}_k\Vert _2^2 \\ 0, \quad otherwise \end{array}\right. } \end{aligned} \end{aligned}$$
(13)
  • Solve \(\varvec{V}^m\) with \(\varvec{W}_T^m\), \(\varvec{F}_T^m\), \(\varvec{G}_T^m\), \(\varvec{V}_T^m\), \(\varvec{D}\) fixed.

When \(\varvec{W}_T^m\), \(\varvec{F}_T^m\), \(\varvec{G}_T^m\), \(\varvec{V}_T^m\) and \(\varvec{D}\) are fixed, then (7) w.r.t. \(\varvec{V}^m\) can be written as

$$\begin{aligned} \begin{aligned}&\mathcal {J}_{\varvec{V}^m} = \min _{\varvec{V}^m} \; \frac{1}{2}\Vert \varvec{W}_T^m - \varvec{W}_S\varvec{V}^m\Vert _F^2\\ \end{aligned} \end{aligned}$$
(14)

constrained by \((\varvec{V}^m)^T\varvec{V}^m = \varvec{I}\). We set the derivative of \(\mathcal {J}_{\varvec{V}^m}\) to zero, yielding

$$\begin{aligned} \begin{aligned}&\varvec{V}^m = \left( (\varvec{W}_S)^T\varvec{W}_S\right) ^{-1}\left( (\varvec{W}_S)^T\varvec{W}_T^m\right) \end{aligned} \end{aligned}$$
(15)

Then, performing Gram-Schmidt orthogonalization operation on \(\varvec{V}^m\) generates the solution.

  • Solve \(\varvec{V}_T^m\) with \(\varvec{W}_T^m\), \(\varvec{F}_T^m\), \(\varvec{G}_T^m\), \(\varvec{V}^m\), \(\varvec{D}\) fixed.

When \(\varvec{W}_T^m\), \(\varvec{F}_T^m\), \(\varvec{G}_T^m\), \(\varvec{V}^m\) and \(\varvec{D}\) are fixed, then (7) w.r.t. \(\varvec{V}_T^m\) can be written as

$$\begin{aligned} \begin{aligned}&\mathcal {J}_{\varvec{V}_T^m} = \min _{\varvec{V}_T^m} \; \frac{\lambda _3}{2}\Vert \varvec{W}_T^m - \varvec{DV}_T^m\Vert _F^2 + \lambda _4\Vert \varvec{V}_T^m\Vert _{2,1} \end{aligned} \end{aligned}$$
(16)

For convenience of optimization, we introduce a diagonal matrix

$$\begin{aligned} \begin{aligned}&\varvec{S}_v = diag\left\{ \frac{1}{2\Vert \varvec{V}_{T(1:)}^m\Vert _2}, \cdot \cdot \cdot ,\frac{1}{2\Vert \varvec{V}_{T(d:)}^m\Vert _2}\right\} \end{aligned} \end{aligned}$$
(17)

into (16) and reformulate it as

$$\begin{aligned} \begin{aligned}&\mathcal {J}_{\varvec{V}_T^m} = \min _{\varvec{V}_T^m} \; \frac{\lambda _3}{2}\Vert \varvec{W}_T^m - \varvec{DV}_T^m\Vert _F^2 + \lambda _4tr\left( (\varvec{V}_T^m)^T\varvec{S}_v\varvec{V}_T^m\right) \end{aligned} \end{aligned}$$
(18)

Setting the derivative of \(\mathcal {J}_{\varvec{V}_T^m}\) w.r.t. \(\varvec{V}_T^m\) to zero, yields

$$\begin{aligned} \begin{aligned}&\varvec{V}_T^m = (\lambda _3\varvec{DD}^T+2\lambda _4\varvec{S}_v)^{-1}(\lambda _3\varvec{D}^T\varvec{W}_T^m) \end{aligned} \end{aligned}$$
(19)

Since \(\varvec{V}_T^m\) is involved in \(\varvec{S}_v\), therefore we need to update them in an alternating manner.

  • Solve \(\varvec{D}\) with \(\varvec{W}_T^m\), \(\varvec{F}_T^m\), \(\varvec{G}_T^m\), \(\varvec{V}^m\), \(\varvec{V}_T^m\) fixed.

When \(\varvec{W}_T^m\), \(\varvec{F}_T^m\), \(\varvec{G}_T^m\), \(\varvec{V}^m\) and \(\varvec{V}_T^m\) are fixed, then (7) w.r.t. \(\varvec{D}\) can be equivalently formulated as

$$\begin{aligned} \begin{aligned}&\mathcal {J}_{\varvec{D}} = \min _{\varvec{D}} \; \sum _{m=1}^M\frac{\lambda _3}{2}\Vert \varvec{W}_T^m - \varvec{DV}_T^m\Vert _F^2 \end{aligned} \end{aligned}$$
(20)

Making the derivative of \(\mathcal {J}_{\varvec{D}}\) w.r.t. \(\varvec{D}\) to zero, yields the closed-form analytical solution

$$\begin{aligned} \begin{aligned}&\varvec{D} = \left( \sum _{m=1}^M\varvec{W}_T^m(\varvec{V}_T^m)^T\right) \left( \sum _{m=1}^M\varvec{V}_T^m(\varvec{V}_T^m)^T\right) ^{-1} \end{aligned} \end{aligned}$$
(21)

Through updating \(\varvec{W}_T^m\), \(\varvec{F}_T^m\), \(\varvec{G}_T^m\), \(\varvec{V}^m\), \(\varvec{V}_T^m\), \(\varvec{D}\) alternatively until convergence, we can eventually achieve their optimal solutions. The complete optimization algorithm is summarized in Algorithm 1.

figure a

3.4 Convergence analysis

Here, we analyze the convergence property of Algorithm 1. Specifically, denote by \(\mathcal {J}(\varvec{W}_T^{m(t)}, \varvec{F}_T^{m(t)}, \varvec{G}_T^{m(t)},\) \(\varvec{V}^{m(t)}, \varvec{V}_T^{m(t)}, \varvec{D}^{(t)})\) the objective value of (7) at the tth iteration. The objective is convex w.r.t. \(\varvec{W}_T^m\) when fixing \(\varvec{F}_T^m, \varvec{G}_T^m, \varvec{V}^m, \varvec{V}_T^m, \varvec{D}\). Therefore, after updating the solution of \(\varvec{W}_T^m\), it holds

$$\begin{aligned} \begin{aligned}&\mathcal {J}(\varvec{W}_T^{m(t+1)}, \varvec{F}_T^{m(t)}, \varvec{G}_T^{m(t)}, \varvec{V}^{m(t)}, \varvec{V}_T^{m(t)}, \varvec{D}^{(t)}) \\&\quad \le \mathcal {J}(\varvec{W}_T^{m(t)}, \varvec{F}_T^{m(t)}, \varvec{G}_T^{m(t)}, \varvec{V}^{m(t)}, \varvec{V}_T^{m(t)}, \varvec{D}^{(t)}) \end{aligned} \end{aligned}$$
(22)

Considering the objective of (7) is also convex w.r.t each of \(\varvec{F}_T^m\), \(\varvec{G}_T^m\), \(\varvec{V}^m\), \(\varvec{V}_T^m\), \(\varvec{D}\) when fixing all the other variables.Footnote 1 As a result, the following inequalities hold

$$\begin{aligned}&\begin{aligned}&\mathcal {J}(\varvec{W}_T^{m(t+1)}, \varvec{F}_T^{m(t+1)}, \varvec{G}_T^{m(t)}, \varvec{V}^{m(t)}, \varvec{V}_T^{m(t)}, \varvec{D}^{(t)}) \\&\quad \le \mathcal {J}(\varvec{W}_T^{m(t+1)}, \varvec{F}_T^{m(t)}, \varvec{G}_T^{m(t)}, \varvec{V}^{m(t)}, \varvec{V}_T^{m(t)}, \varvec{D}^{(t)}) \end{aligned} \end{aligned}$$
(23)
$$\begin{aligned}&\begin{aligned}&\mathcal {J}(\varvec{W}_T^{m(t+1)}, \varvec{F}_T^{m(t+1)}, \varvec{G}_T^{m(t+1)}, \varvec{V}^{m(t)}, \varvec{V}_T^{m(t)}, \varvec{D}^{(t)}) \\&\quad \le \mathcal {J}(\varvec{W}_T^{m(t+1)}, \varvec{F}_T^{m(t+1)}, \varvec{G}_T^{m(t)}, \varvec{V}^{m(t)}, \varvec{V}_T^{m(t)}, \varvec{D}^{(t)}) \end{aligned} \end{aligned}$$
(24)
$$\begin{aligned}&\begin{aligned}&\mathcal {J}(\varvec{W}_T^{m(t+1)}, \varvec{F}_T^{m(t+1)}, \varvec{G}_T^{m(t+1)}, \varvec{V}^{m(t+1)}, \varvec{V}_T^{m(t)}, \varvec{D}^{(t)}) \\&\quad \le \mathcal {J}(\varvec{W}_T^{m(t+1)}, \varvec{F}_T^{m(t+1)}, \varvec{G}_T^{m(t+1)}, \varvec{V}^{m(t)}, \varvec{V}_T^{m(t)}, \varvec{D}^{(t)}) \end{aligned} \end{aligned}$$
(25)
$$\begin{aligned}&\begin{aligned}&\mathcal {J}(\varvec{W}_T^{m(t+1)}, \varvec{F}_T^{m(t+1)}, \varvec{G}_T^{m(t+1)}, \varvec{V}^{m(t+1)}, \varvec{V}_T^{m(t+1)}, \varvec{D}^{(t)}) \\&\quad \le \mathcal {J}(\varvec{W}_T^{m(t+1)}, \varvec{F}_T^{m(t+1)}, \varvec{G}_T^{m(t+1)}, \varvec{V}^{m(t+1)}, \varvec{V}_T^{m(t)}, \varvec{D}^{(t)}) \end{aligned} \end{aligned}$$
(26)

and

$$\begin{aligned} \begin{aligned}&\mathcal {J}(\varvec{W}_T^{m(t+1)}, \varvec{F}_T^{m(t+1)}, \varvec{G}_T^{m(t+1)}, \varvec{V}^{m(t+1)}, \varvec{V}_T^{m(t+1)}, \varvec{D}^{(t+1)}) \\&\quad \le \mathcal {J}(\varvec{W}_T^{m(t+1)}, \varvec{F}_T^{m(t+1)}, \varvec{G}_T^{m(t+1)}, \varvec{V}^{m(t+1)}, \varvec{V}_T^{m(t+1)}, \varvec{D}^{(t)}) \end{aligned} \end{aligned}$$
(27)

Taking into account from (22) to (27), it appears that

$$\begin{aligned} \begin{aligned}&\mathcal {J}(\varvec{W}_T^{m(t+1)}, \varvec{F}_T^{m(t+1)}, \varvec{G}_T^{m(t+1)}, \varvec{V}^{m(t+1)}, \varvec{V}_T^{m(t+1)}, \varvec{D}^{(t+1)}) \\&\quad \le \mathcal {J}(\varvec{W}_T^{m(t)}, \varvec{F}_T^{m(t)}, \varvec{G}_T^{m(t)}, \varvec{V}^{m(t)}, \varvec{V}_T^{m(t)}, \varvec{D}^{(t)}) \end{aligned} \end{aligned}$$
(28)

It verifies that the entire objective value descends monotonously with increased iterations. In addition, (7) is definitely lower-bounded by nonnegative value since it is a linear sum of nonnegative norms, i.e. \(\Vert \cdot \Vert _F^2\), \(\Vert \cdot \Vert _2^2\) and \(\Vert \cdot \Vert _{2,1}\). As a result, we draw the conclusion that the objective function of (7), solved by Algorithm 1, converges in finite iterations.

3.5 Time complexity analysis

The time cost of Algorithm 1 mainly lies in updating the variables. More specifically, the cost of calculating the solution of \(\varvec{W}_T^m\) in line 3 is \(\mathcal {O}(d^3+d^2p)\), the cost of updating \(\varvec{F}_T^m\) in line 4 is \(\mathcal {O}({K_T^m}^3+{K_T^m}^2p)\). In line 6 and 8, calculating \(\varvec{V}^m\) and \(\varvec{V}_T^m\) respectively costs \(\mathcal {O}(d^3+p^2d)\) and \(\mathcal {O}(r^3+rdp)\). As for the time cost of solving the dictionary \(\varvec{D}\) in line 11, it is \(\mathcal {O}(Mr^3+Mdpr)\). Usually, it holds that \(d \ge p \ge r\). Assume the algorithm converges in L iterations. As a result, taking all the cost into account, the total time complexity of Algorithm 1 is \(\mathcal {O}\left( LMdpr+Ld^3+L(K_T^m)^2p+L(K_T^m)^3\right)\).

4 Experiment

In this section, we conduct experiments to evaluate the proposed method. Firstly, we introduce the setting and data set used for the evaluations. Secondly, we report the comparison with other related methods, with performance Hypothesis test and ablation study. Finally, we evaluate the convergence efficiency of the proposed algorithm.

4.1 Dataset and setting


Artificial dataset In order to verify the motivation of the proposed method, we construct an artificial dataset with known effects. As shown in Table 2, the artificial dataset consists of one source domain and two target domains with two classes. We fix the covariance matrix and generate randomly twenty samples that obey the Gaussian distribution for each class according to the given class centers. It is worth noting that compared with the source domain, the class center of target domain 2 is closer to target domain 1, which is designed to demonstrate the feasibility of target knowledge transfer.

Table 2 Statistics of the benchmarks

Real dataset We evaluate on two types of ordinal image datasets: character dataset, i.e. Chars74k [47] and face aging datasets, i.e. AgeDB [48], Morph (album 2) [49], CACD [50]. For Chars74k, it is consisted of over 100000 images of three modalities of characters, i.e. Img, Hnd, Fnt, as shown in Fig. 2. We uniformly resize the images to \(32\times 32\), extracted the Hog coefficients from them with normalization and apply the generated 288-dimensional components as feature representation. For the AgeDB, Morph and CACD face datasets, they respectively contain 16,000, 55,000, and 160,000 face images with age annotation, as demonstrated in Fig. 3. We extract their normalized BIF visual features and retained 95% components for evaluation.

Fig. 2
figure 2

Image examples of Img, Hnd, Fnt from the Chars74k dataset

Fig. 3
figure 3

Image examples of the AgeDB, Morph and CACD datasets


Setting To make extensive evaluations, we conduct comparison with the most related 1SmT UDA method PA-1SmT, as well as other related 1S1T UDA methods, i.e. STC [51], TSC [52], TFSC [53], CMMS [21], SLSA[22]. For fairness of comparison, the source modules of these methods are trained in supervised manner while the target unsupervised. The values of hyper parameters \(\lambda _1\), \(\lambda _2\), \(\lambda _3\), \(\lambda _4\) are searched in the range of (1e−3, 1e−2, 1e−1, 1e1, 1e2, 1e3, 1e4, 1e5, 1e6), the number p of source domain projection directions in KDLOR is selected in the range of [5, 10, 15, \(\ldots\), 100], the dimension r of the dictionary in the range of [5, 10, 15, \(\ldots\), p], all through five-fold cross-validation. The parameters in the compared methods are also tuned in cross-validation referring to the literature. To comprehensively evaluate the performance, we adopt the Normalized Mutual Information (NMI) and Rand Index (RI) [27], as well as the Mean Absolute Errors (MAE) [18] as performance measure. In order to mitigate the randomness of results, we run the evaluations ten times and report the average results.

4.2 Results and analysis


Artificial dataset recognition For comparison, we construct two 1S1T tasks “source \(\rightarrow\) target1”, “source \(\rightarrow\) target2” and one 1SmT task “source \(\rightarrow\) target1, target2”. The data distribution and classification bound of three tasks are shown in Fig. 4 (The classification bound is marked by the dashed line of the corresponding color of each domain. ). We can find that the classification result of target2 is worse than target1 in the 1S1T task while the performance improves in the 1SmT task, which is consistent with our expectation. Actually, in the process of 1SmT UDA, the target1 which is closer to the source could be seen as an intermediate domain between source and target2. And the dictionary learning can be regarded as a bias term in the linear space of this artificial dataset so that the discrimination information of target1 is utilized by the target2.

Fig. 4
figure 4

The distribution and classification bound of artificial datasets


Ordinal character recognition We conduct ordinal character recognition evaluation on the Chars74k dataset. Specifically, we randomly choose one modality from Img, Fnt, Hnd as source domain while the rest as target domains. The results are shown in Table 3 and 4 (best in bold, second-best underlined).

Table 3 Character recognition NMI results on Chars74k
Table 4 Character recognition RI results on Chars74k

We can observe the following findings. On the one hand, the proposed OrUDA model generated the best results in terms of both NMI and RI measures, with clear performance improvement. Moreover, in 1S1T setting, OrUDA still beats the other methods. It states that transferring both the implicit source model knowledge and explicit distribution information, as well as the inter-target relations effectively benefit the target domain learning. On the other hand, the improvement extent on different cases differs. It affirms the divergence between the target domains and verifies the rationality of modeling target-specific transfer matrix and relation matrix in OrUDA.


Human age estimation We also perform human age estimation in the setting of cross datasets. Specifically, we randomly take from AgeDB, Morph and CACD one dataset as the source dataset while the other two as target datasets. For the sake of domain knowledge transfer, we select their common age range of 16 to 62 years old, and divide them into several groups, i.e. 16–20, 21–25, \(\ldots\), 55–60, 61–62 for age group estimation. The averaged results on ten random runs are shown in Table 5.

Table 5 Age group estimation MAE results on the AgeDB, Morph and CACD datasets

We observe that in both 1S1T and 1SmT settings, the proposed OrUDA model generates the best age estimation results, compared to related methods. It demonstrates the effectiveness of the proposed model and its superiority to other compared models.

In order to estimate the effectiveness of the OrUDA method in improving performance, we perform hypothesis test [54] on the results in Table 3 to Table 5. The test results are shown in Fig. 5. We can observe that the proposed OrUDA method (i.e. OURS) generates a quite clear performance improvement than the others.

Fig. 5
figure 5

Hypothesis test (Friedman Test) among the compared methods

4.3 Ablation study

In order to explore the effectiveness of the modules of the proposed model (objective function), we additionally perform ablation study. Specifically, we estimate respectively the efficacy of orderly projection, implicit knowledge transfer, explicit knowledge transfer and target knowledge transfer in (7). As shown in Table 6, each of the four modules in OrUDA is significant, especially the explicit knowledge transfer. Moreover, though the target knowledge transfer could not improve the model as much as knowledge transfer from the labeled source domain in most tasks due to the absence of supervised information, it is far away from enough to transfer supervised knowledge only for a UDA model. Actually, various relationships of domains are crucial to benefit the process of domain adaptation, which acts as the complementary set of the source domain knowledge, such as relationship of target domains or the ordered relationship of samples. The results of the ablation study prove our hypothesis and explain why our model performs better than other UDA models.

Table 6 Ablation estimation MAE results on the AgeDB, Morph and CACD datasets

4.4 Convergence evaluation

We also empirically evaluate the convergence efficiency of Algorithm 1. Without loss of generality, we conduct analysis experiments with the same setting aforementioned on the three face aging datasets and report the convergence results in Fig. 6. We can observe from the results that, the algorithm efficiently converges in about 15 iterations.

Fig. 6
figure 6

Convergence efficiency of Algorithm 1 on the AgeDB, Morph and CACD datasets

4.5 Parameter sensitivity analysis

To assess the parameters of the proposed model, we perform parameter sensitivity analysis for OrUDA on the face datasets. Specifically, we evaluate by just tuning the concerned parameter while fixing all the other ones. The evaluation results are shown in Fig. 7. We can observe the following findings. On the one hand, although OrUDA is sensitive to \(\lambda _1\), \(\lambda _2\) and \(\lambda _3\), the performance changes with good trends. In summary, the best performance can be achieved when \(1e5< \lambda _1 < 1e7\), \(\lambda _2 < 1e1\), \(\lambda _3 < 1e1\), regardless on which face dataset the source target sub-model is trained. On the other hand, the performance is preferably not sensitive to \(\lambda _4\), which can be fixed in practical applications.

Fig. 7
figure 7

Parameter sensitivity results on the AgeDB, Morph and CACD datasets

5 Conclusion

In this work, we proposed an ordinal model of unsupervised domain adaptation, i.e. OrUDA, by transferring knowledge from both the implicit model parameters and explicit cross-domain data distributions, as well as the relations between the target domains. By this kind of model, the knowledge from the source and target has been exploited to training the concerned target model. In addition, we designed an alternating optimization algorithm to solve the OrUDA model and provided theoretical convergence proof. Finally, we experimentally evaluated the effectiveness of the proposed method in performance and the sensitivity of its parameters. We proved that this modeling method can effectively handle the UDA problem in ordinal and 1SmT scenarios. Compared with related existing UDA methods, the proposed OrUDA outperforms others thanks to the utilization of ordinal prior and related information in other target domains. Actually, there are more priors that could be taken into consideration such as sparsity, low-rank and so on. Hence, in the future, we will consider generalizing the proposed method by exploring more prior knowledge [55] and extending the method into the deep network architecture.