1 Introduction

The fields of machine learning [1] and pattern recognition have been widely and successfully applied to many practical applications, in which patterns can be extracted from training data to predict future results [2, 3]. Traditional machine learning methodologies assume that the training and test data come from the same domain, such that the input feature space and data distribution are the same. The performance of the predictive classifier can be degraded when data distribution between the training and the test data differs. In some scenarios, obtaining training data that matches the feature space and predicted data distribution of the test data can be exhausting and costly. Therefore, adaptive classifiers need to be created for target domains trained from related domains. This objective is the motivation of transfer learning.

Transfer learning is used to solve the problem in one domain (i.e., target domain) by using the information from a related domain (i.e., source domain). Domain adaptation is a subtopic of transfer learning, which constructs knowledge transfer from the labeled source domain to the unlabeled target domain by learning domain-invariant and label-discriminative knowledge representations that manifest similarities between domains under significant differences. To date, domain adaptation has been successfully applied in various fields, such as text sentiment classification [4, 5], image classification [6,7,8], human activity classification [9], and multi-language text classification [10]. Domain divergence poses a major obstacle for adapting predictive models across domains.

The main problem of domain adaptation is the reduction of distribution divergence between domains. To this end, existing approaches can be categorized into four main groups [2, 3]: (a) instance-based adaptation, which reweights samples in the source domain or in both domains to reduce domain discrepancy [11, 12], (b) feature representation-based adaptation, which learns feature representations to minimize domain shift or learning task error or both [13, 14], (c) classifier-based adaptation, which aims to learn a new model that minimizes the generalization error in the target domain via training data from both domains [15, 16], and (d) hybrid knowledge-based adaptation, which transfers more than one kind of knowledge, such as joint instance and feature representation-based adaptation [17,18,19,20], joint instance and classifier-based adaptation [21, 22], or joint feature representation and classifer-based adaptation [23,24,25].

Among the abovementioned classical approaches, the hybrid methods perform better than the single methods in reducing the cross-domain discrepancy. Most existing hybrid methods follow a two-step procedure: first, either instance reweighting or feature representation is performed independently, finally, the cross-domain classifier is trained separately, but these methods do not perform well in practical applications, and many factors cannot to be considered. For example, some methods are significantly influenced by feature representations or irrelevant instances, some ignore the importance of evaluating data distributions, and some fail to exploit hidden knowledge structures in data labels of the source and target domains. Therefore, a new hybrid method for robust unsupervised domain adaptation needs to be developed. Knowledge that can be successfully transferred across domains should be (1) invariant to feature representations and unbiased to irrelevant instances, (2) quantitatively estimated in terms of the importance of distributions, and (3) able to exploit the potential manifold structural features behind the data.

As far as we know, no research has optimized all the three challenges together in a unified learning machine for unsupervised domain adaptation. In this paper, we complete this challenge and propose a new Lie Group Manifold Analysis (LGMA) method based on FLDA [26], which learns a domain-invariant and label-discriminative classifier in Lie algebra manifold space by extracting invariant representations, estimating unbiased instance weights, performing evaluated distribution alignment and graph Laplacian regularization that jointly minimize the cross-domain distribution discrepancy. To the best of our knowledge, LGMA is the first attempt to minimize the cross-domain discrepancy in Lie algebra manifold space for domain adaptation. Extensive experiments on five real-world benchmark datasets validate that LGMA can outperform competitive state-of-the-art methods.

The rest of the paper is organized as follows. Section 2 introduces related works of domain adaptation. Section 3 presents the LGMA algorithm based on Lie algebra transformation. Section 4 provides experiments to illustrate the effectiveness and efficiency of the proposed method. Section 5 draws the conclusions of this paper.

2 Related work

According to a recent survey [2], existing domain adaptation problems can be roughly divided into four categories according to research methods: instance, feature representation, classifier, and hybrid knowledge-based adaptations.

Instance-based adaptation methods aim to minimize the cross-domain distribution discrepancy by reweighting the source samples according to the related samples in the target domain. Baktashmotlagh et al. [27] introduced a sample selection method and a subspace-based method by using the structure of Riemannian manifold to compare the source and target distributions. Transfer component analysis (TCA) [28] learns transfer components across domains in a reproducing kernel Hilbert space (RKHS) using maximum mean discrepancy (MMD) [29].

Feature representation-based adaptation methods aim to reduce distribution differences by learning a new feature representation. Fernando et al. [30] proposed a subspace alignment (SA) algorithm by learning a mapping function that aligns the source subspace with the target one. Geodesic flow kernel (GFK) [31] extends the concept of sampling points in manifold [32], and a method for learning the GFK between domains is proposed. Generalized unsupervised manifold alignment (GUMA) [33] is proposed as a method to build the connections between domains without any known correspondences by using manifold alignment. Low-rank transfer subspace learning (LTSL) [34] is proposed as a novel framework to solve transfer learning problem through subspace learning and low-rank representation constraints. Zhai et al. [35] proposed a novel manifold alignment method by learning the underlying common manifold with supervision from the corresponding data pairs of different observation sets.

Classifier-based adaptation methods aim to learn a new domain-invariant classifier that minimizes the generalization error in the target domain via training data from both domains. The works of distribution matching machine (DMM) [36] and adaptation regularization transfer learning (ARTL) [6] aim to learn a unified domain-invariant classifier based on structural risk minimization (SRM) [37].

Hybrid knowledge-based adaptation methods aim to learn domain-invariant knowledge by jointly utilizing multiple kinds of adaptations. Locality preserving joint transfer (LPJT) [19], domain invariant and class discriminative feature learning (DICD) [17], and transfer independently together (TIT) [20] jointly leverage instance-based and feature representation-based adaptations to learn domain-invariant and label-discriminative vector representations. Qin et al. [21] proposed a novel generatively inferential co-training (GICT) framework based on instance-based and classifier-based adaptations. In [25], three unsupervised transfer learning methods, i.e., discriminative subspace learning (DSL), joint geometrical and statistical distribution adaptation (GSDA), and joint subspace and distribution adaptation (DSL-GSDA) are proposed to transfer the common domain-invariant knowledge from the source domain to the target domain by jointly adapting feature representation and classifier.

Formally, the single adaptation methods explore instance reweighting, feature representation, or classifier learning independently, which are ineffective when the domain difference is substantially large. While hybrid knowledge-based adaptation methods perform better than single adaptation methods when the domain differences are large or some outlier source instances are unrelated to the target domain or the two conditions hold. Almost all the existing adaptation methods on image classification tasks proceed by linearizing the images, which makes an implicit Euclidean space assumption [38, 39]. However, when the domain divergence is extremely large, the classification performance of the adaptation method based on the assumption of Euclidean space will be degraded significantly. In general, most of the transformations used in image classification tasks have matrix Lie group structure. Thus, we first devise a nonlinear transformation to project samples in the original Lie group manifold space onto a corresponding Lie algebra manifold space, where the samples are more discriminative and can be classified more easily. Finally, we perform hybrid knowledge-based adaptation to further minimize the domain discrepancy between domains for higher cross-domain classification accuracy.

The most similar approaches to the proposed hybrid method LGMA are scatter component analysis (SCA) [40] and joint geometrical and statistical alignment (JGSA) [41]. However, LGMA differs significantly from SCA and JGSA in two key aspects: (a) LGMA jointly learns the invariant cross-domain classifier and transferable knowledge (invariant to feature representations) in a learning paradigm in a linear Lie algebra manifold space, whereas SCA and JGSA learn the transferable knowledge and transfer classifier in a nonlinear Lie group manifold space (reproducing kernel Hilbert space). (b) LGMA learns unbiased instance reweighting and unbiased to irrelevant instances not only by using the domain scatters but also by exploiting the weighted distribution alignment and the graph Laplacian regularization, whereas SCA and JGSA learn reweighting by scatters or unweighted distribution alignment. In summary, the proposed LGMA approach can jointly learn the cross-domain classifier and transferable knowledge with statistical and geometrical guarantees.

3 LGMA

In this section, we provide the LGMA approach in detail.

3.1 Problem definition

We begin with the formalized definition of domain adaptation [6, 42, 43]. For clarity, the frequently used notations are summarized in Table 1.

Table 1 Notations and corresponding descriptions used in this paper

Definition 1

(Domain adaptation). A labeled source domain \( \mathcal {D}_{\mathit {s}} = \{x_{\mathit {s_{i}}},\mathit {y_{\mathit {s_{i}}}}\}_{i=1}^{n} \) and an unlabeled target domain \( \mathcal {D}_{\mathit {t}}= \{x_{\mathit {t_{j}}}\}_{j=n+1}^{n+m} \), we assume the feature space \( \mathcal {X}_{\mathit {s}} = \mathcal {X}_{\mathit {t}} \) and the label space \( \mathcal {Y}_{\mathit {s}} = \mathcal {Y}_{\mathit {t}} \). However, the marginal probability distribution Ps(xs)≠Pt(xt) with the conditional probability distribution Qs(ys|xs)≠Qt(yt|xt). The purpose of unsupervised domain adaptation is to learn a classifier \( f:x_{\mathit {t}} \mapsto y_{\mathit {t}}, y_{\mathit {t}} \in \mathcal {Y}_{\mathit {t}} \) to classify the samples for the target domain \( \mathcal {D}_{\mathit {t}}\) using the related label information in the source domain \( \mathcal {D}_{\mathit {s}}\). Data in the source and target domains can be denoted as \( \mathrm {X}_{\mathit {s}} \in \mathbb {R}^{D\times n}\), \( \mathrm {X}_{\mathit {t}} \in \mathbb {R}^{D\times m}\), respectively.

Classical Fisher’s linear discriminant analysis (FLDA) [26] can be represented as

$$ \underset{v}{\mathrm{arg max}} J(v)= \frac{v^{\mathrm{T}}\mathrm{S}_{b}v}{v^{\mathrm{T}}\mathrm{S}_{w}v} $$
(1)

where Sb and Sw are the matrices of between-class and within-class scatter, respectively. Maximizing FLDA increases the separation of samples with respect to the class cluster. However, the classification accuracy will be affected due to the different distributions between \( \mathcal {D}_{\mathit {s}}\) and \( \mathcal {D}_{\mathit {t}}\). Thus, minimizing the domain distribution discrepancy is the only way to improve classification performance when learning a cross-domain classifier f.

3.2 Main idea

LGMA mainly includes three steps. First, LGMA performs Lie algebra transformation to project features in Lie group manifold space onto the corresponding Lie algebra manifold space. Second, LGMA finds a paired transformation (i.e., A and B, A and B for source domain and target domain, respectively) to obtain new representations of respective domains. Third, LGMA performs weighted distribution alignment and manifold alignment to learn a cross-domain invariant classifier in the linear Lie algebra space. Figure 1 shows the main idea of the proposed LGMA method.

Fig. 1
figure 1

The main idea of LGMA. (a) Features in the Lie group manifold space are mapped to Lie algebra manifold space (A projection point is the intersection of the black bold curve and the geodesic etv). Data with similar manifold properties are aggregated together after Lie algebra transformation. (b) LGMA finds a paired transformation (one for source domain, and another one for target domain) to obtain new representations of respective domains. (c) Weighted distribution alignment and manifold alignment are performed in Lie algebra manifold space to learn the cross-domain invariant classifier f

We first obtain the transformed features by means of Lie algebra transformation. Then, on the basis of FLDA and weighted distribution alignment and manifold alignment [41], the domain-invariant classifier f can be represented as

$$ \underset{\mathrm{A},\mathrm{B}}{\max}\frac{\alpha\mathit{S}_{f}(\mathcal{D}_{t})+\beta\mathit{S}_{bf}(\mathcal{D}_{s})} {\bar{D}_{f}(\mathcal{D}_{s},\mathcal{D}_{t}) + \delta\mathit{R}_{f}(\mathcal{D}_{s},\mathcal{\!D}_{t}) + \lambda\mathit{D}_{f}(S_{A},S_{B}) + \beta\mathit{S}_{wf}(\mathcal{D}_{s})} $$
(2)

where the terms Sf(⋅), Sbf(⋅), \( \bar {D}_{f}(\cdot ,\cdot ) \), Rf(⋅,⋅), Df(⋅,⋅), and Swf(⋅) represent the domain variance, the between-class variance, the weighted distribution alignment, the graph Laplacian regularization, the subspace divergence, and the within-class variance, respectively. α, β, δ, and λ are the regularization parameters.

3.3 Lie algebra transformation

Lie algebra transformation serves as the preprocessing step that aims to find a geodesic on the Lie group manifold and project all features onto this geodesic and perform weighted distribution and manifold alignment thereafter to maximize the ratio of (2).

Before Lie algebra transformation is introduced, we first elaborate the definition of Lie group and Lie algebra [44, 45].

Definition 2

(Lie group). A real Lie group [44] is a group that is also a finite-dimentional real smooth manifold, in which the group operations of multiplication and invertion are smooth maps. Smoothness of the group multiplication \( \mu :G\times G\rightarrow G\ \mu (x,y)=xy \) means that μ is a smooth mapping of the product manifold G × G into G. These two requirements can be combined to the single requirement that the mapping (x,y)↦x− 1y be a smooth mapping of the product manifold into G.

Definition 3

(Lie algebra). A Lie algebra [45] is a vector space \( \mathfrak {g} \) over some field \( \mathbb {F} \) togeter with a binary operation \( [\cdot ,\cdot ]:\mathfrak {g}\times \mathfrak {g}\rightarrow \mathfrak {g}\) called the Lie bracket that satisfies the following axioms:

  • Bilinearity: [ax + by,z] = a[x,z] + b[y,z],[z,ax + by] = a[z,x] + b[z,y] for all scalars a, b in \( \mathbb {F} \) and all elements x, y, z in \( \mathfrak {g} \).

  • Alternativity: [x,x] = 0 for all x in \( \mathfrak {g} \).

  • The Jacobi identity: [x,[y,z]] + [z,[x,y]] + [y,[z,x]] = 0 for all x, y, z in \( \mathfrak {g} \).

  • Anticommutativity: [x,y] = −[y,x] for all x, y in \( \mathfrak {g} \).

Exponential and logarithmic transformations [45] are important theories in Lie group, and exponential transformation can be defined as

$$ \exp:\mathfrak{g}\rightarrow G, \ \ \exp(x)=\sum\limits_{i=0}^{\infty }\frac{x^{i}}{i!} $$
(3)

Elements in Lie algebra manifold space can be transformed into Lie group manifold space through this transformation. Similarly, logarithmic transformation can also be represented as

$$ \log:G\rightarrow \mathfrak{g}, \ \ \log(x)=\sum\limits_{i=0}^{\infty }\frac{(-1)^{i-1}}{i}(x-e)^{i} $$
(4)

Features in Lie group manifold space can be transformed into Lie algebra manifold space through this transformation.

We denote g(⋅) as the Lie algebra transformation. Thus, the feature in the Lie group manifold space can be transformed into Lie algebra manifold space through z = g(x).

3.4 Target domain variance maximization

The variance of the target domain can be maximized in the corresponding subspace to avoid projecting features onto some irrelevant dimensions. Therefore, the variance maximization term can be generalized as

$$ \underset{\mathrm{B}}{\max} \mathit{S}_{f}(\mathcal{D}_{t})=\underset{\mathrm{B}}{\max} \text{tr}(\mathrm{B}^{\mathrm{T}}\mathrm{S}_{t}\mathrm{B}) $$
(5)

where tr(⋅) denotes the trace of a matrix and

$$ \mathrm{S}_{t} = \mathrm{Z}_{t}\mathrm{H}_{t}\mathrm{Z}_{t}^{\mathrm{T}} $$
(6)

is the scatter matrix of the target domain, Zt is the set of projected target samples, \( \mathrm {H}_{t} = \mathrm {I}_{t} - \frac {1}{m}\mathrm {1}_{t}\mathrm {1}_{t}^{\mathrm {T}} \) is the centering matrix, and \( \mathrm {1}_{t} \in \mathbb {R}^{m} \) is the column vector with all elements equal to 1.

3.5 Source domain discriminative feature preservation

We use the rich label information in the source domain to make the new representation of samples in the source domain discriminative as follows:

$$ \underset{\mathrm{A}}{\max} \mathit{S}_{bf}(\mathcal{D}_{s})=\underset{\mathrm{A}}{\max} \text{tr}\left( \mathrm{A}^{\mathrm{T}}\mathrm{S}_{b}\mathrm{A}\right) $$
(7)
$$ \underset{\mathrm{A}}{\min} \mathit{S}_{wf}(\mathcal{D}_{s})=\underset{\mathrm{A}}{\min} \text{tr}\left( \mathrm{A}^{\mathrm{T}}\mathrm{S}_{w}\mathrm{A}\right) $$
(8)

where Sb and Sw are the between-class and within-class scatter matrices, respectively, and are defined as follows:

$$ \mathrm{S}_{w} = \sum\limits_{c = 1}^{C}\mathrm{Z}_{\mathit{s}}^{\mathit{(c)}}\mathrm{H}_{\mathit{s}}^{\mathit{(c)}}\left( \mathrm{Z}_{\mathit{s}}^{\mathit{(c)}}\right)^{\mathrm{T}} $$
(9)
$$ \mathrm{S}_{b} = \sum\limits_{c = 1}^{C}n^{(c)}\left( m_{\mathit{s}}^{(c)}-\bar{m}_{\mathit{s}}\right)\left( m_{\mathit{s}}^{(c)}-\bar{m}_{\mathit{s}}\right)^{\mathrm{T}} $$
(10)

where \( \mathrm {Z}_{\mathit {s}}^{(c)} \) indicates the set of transformed source samples that belong to class c, \( m_{\mathit {s}}^{(c)} = \frac {1}{n^{(c)}}{\sum }_{i = 1}^{n^{(c)}}z_{s_{i}}^{(c)} \), \( \bar {m}_{\mathit {s}}=\frac {1}{n}{\sum }_{i=1}^{n}z_{s_{i}} \), and \( \mathrm {H}_{\mathit {s}}^{\mathit {(c)}}=\mathrm {I}_{\mathit {s}}^{\mathit {(c)}}-\frac {1}{n^{(c)}}\mathrm {1}_{\mathit {s}}^{\mathit {(c)}}\left (\mathrm {1}_{\mathit {s}}^{\mathit {(c)}}\right )^{\mathrm {T}} \) is the centering matrix of samples within class c, \( \mathrm {I}_{\mathit {s}}^{\mathit {(c)}} \in \mathbb {R}^{n^{(c)}\times {n^{(c)}}} \) is the identity matrix, \( \mathrm {1}_{\mathit {s}} \in \mathbb {R}^{n^{(c)}} \) is a column vector with all ones, and n(c) is the number of source samples in class c.

3.6 Weighted distribution alignment

Weighted distribution alignment is devised to minimize the distribution divergence between the source and target domains by quantitatively assessing the importance of the marginal distribution (i.e., P) and the conditional distribution (i.e., Q). Formally, the weighted distribution alignment \( \bar {D}_{f}(\mathcal {D}_{s},\mathcal {D}_{t}) \) can be defined as follows:

$$ \bar{D}_{f}(\mathcal{D}_{s},\mathcal{D}_{t}) = (1-\mu)\mathit{D}(\mathit{P}_{s},\mathit{P}_{t})+\mu\mathit{D}(\mathit{Q}_{s},\mathit{Q}_{t}) $$
(11)

with μ ∈ [0,1] as the adaptive parameter. The projected MMD [6, 46, 47] methods can be adopted to compute the marginal and conditional distributions, which compare the different distributions on the basis of distance between the sample means of the two domains in the low-dimensional smooth manifold. The marginal distribution divergence D(Ps, Pt) can be detailed as

$$ \parallel\frac{1}{n}\sum\limits_{\mathrm{z}_{\mathit{s_{i}}} \in \mathrm{Z}_{\mathit{s}}}\mathrm{A}^{\mathrm{T}}\mathrm{z}_{\mathit{s_{i}}}-\frac{1}{m}\sum\limits_{\mathrm{z}_{\mathit{t_{j}}} \in \mathrm{Z}_{\mathit{t}}}\mathrm{B}^{\mathrm{T}}\mathrm{z}_{\mathit{t_{j}}}\parallel_{\mathrm{F}}^{\mathrm{2}} $$
(12)

Correspondingly, the conditional distribution divergence D(Qs, Qt) can be expressed as

$$ \sum\limits_{c=1}^{C}\parallel \frac{1}{n^{(c)}}\sum\limits\limits_{\mathrm{z}_{\mathit{s_{i}}} \in \mathrm{Z}_{\mathit{s}}^{\mathit{(c)}}}\mathrm{A}^{\mathrm{T}}\mathrm{z}_{\mathit{s_{i}}}-\frac{1}{m^{(c)}}\sum\limits_{\mathrm{z}_{\mathit{t_{j}}} \in \mathrm{Z}_{\mathit{t}}^{\mathit{(c)}}}\mathrm{B}^{\mathrm{T}}\mathrm{z}_{\mathit{t_{j}}}\parallel_{\mathrm{F}}^{\mathrm{2}} $$
(13)

where \( \mathrm {Z}_{\mathit {s}}^{\mathit {(c)}} = \left \{ \mathrm {z}_{\mathit {s_{i}}}:\mathrm {z}_{\mathit {s_{i}}}\in \mathrm {Z}_{\mathit {s}}\wedge \mathit {y}(\mathrm {z}_{\mathit {s_{i}}}) = c \right \} \) is the projected source samples that belong to class c and \(\mathit {y}(\mathrm {z}_{\mathit {s_{i}}}) \) is the true label of \( \mathrm {z}_{\mathit {s_{i}}} \). \( \mathrm {Z}_{\mathit {t}}^{\mathit {(c)}} = \left \{ \mathrm {z}_{\mathit {t_{j}}}:\mathrm {z}_{\mathit {t_{j}}}\in \mathrm {Z}_{\mathit {t}}\wedge \hat {\mathit {y}}(\mathrm {z}_{\mathit {t_{j}}}) = c \right \} \) is the set of projected target samples that belong to class c, \( \hat {\mathit {y}}(\mathrm {z}_{\mathit {t_{j}}}) \) is the true label of \( \mathrm {z}_{\mathit {t_{j}}} \), and \( n^{(c)} = |\mathrm {Z}_{\mathit {s}}^{\mathit {(c)}}| \), \( m^{(c)} = |\mathrm {Z}_{\mathit {t}}^{\mathit {(c)}}| \) are the number of samples in class c in respective projected manifold spaces of the source and target domains. The evaluation of conditional distribution divergence D(Qs, Qt) is relative difficult because there is no labeled data are in the target domain. Long et al. [6] proposed to utilize the pseudo labels of the target domain which predicted by some supervised approaches (e.g., KNN) trained on the data in the source domain. The pseudo labels can be refined iteratively to minimize the difference in conditional distributions between the source and target domains. Thus, we follow this idea to further reduce the conditional MMD between domains.

Thus, combining the marginal and conditional MMDs together, the final weighted distribution alignment optimization can be stated in the following matrix form

$$ \underset{\mathrm{A},\mathrm{B}}{\min} \bar{D}_{f}(\mathcal{D}_{s},\mathcal{D}_{t})=\underset{\mathrm{A},\mathrm{B}}{\min} \text{tr} \left (\begin{bmatrix} \mathrm{A}^{\mathrm{T}} & \mathrm{B}^{\mathrm{T}} \end{bmatrix} \begin{bmatrix} \mathrm{M}_{\mathit{ss}} &\mathrm{M}_{\mathit{st}} \\ \mathrm{M}_{\mathit{ts}} & \mathrm{M}_{\mathit{tt}} \end{bmatrix} \begin{bmatrix} \mathrm{A} \\ \mathrm{B} \end{bmatrix}\right ) $$
(14)

where

$$ \begin{array}{ll} {\kern12pt}\mathrm{M}_{\mathit{ss}} &= \mathrm{Z}_{\mathit{s}}\left( (1-\mu)\mathrm{N}_{\mathit{ss}}+\mu{\sum}_{c=1}^{C}\mathrm{N}_{\mathit{ss}}^{\mathit{(c)}}\right)\mathrm{Z}_{\mathit{s}}^{\mathrm{T}}, \mathrm{N}_{\mathit{ss}} = \frac{1}{n^{2}}\mathrm{1}_{\mathit{n}}\mathrm{1}_{\mathit{n}}^{\mathrm{T}},\\ \left( \mathrm{N}_{\mathit{ss}}^{\mathit{(c)}}\right)_{ij} &= \begin{cases} \frac{1}{(n^{(c)})^{2}}, \mathrm{z}_{\mathit{i}},\mathrm{z}_{\mathit{j}}\in \mathrm{Z}_{\mathit{s}}^{(c)} \\ 0, \text{otherwise} \end{cases} \end{array} $$
(15)
$$ \begin{array}{ll} {\kern17pt}\mathrm{M}_{\mathit{tt}} = \mathrm{Z}_{\mathit{t}}((1-\mu)\mathrm{N}_{\mathit{tt}}+\mu{\sum}_{c=1}^{C}\mathrm{N}_{\mathit{tt}}^{\mathit{(c)}})\mathrm{Z}_{\mathit{t}}^{\mathrm{T}}, \mathrm{N}_{\mathit{tt}} = \frac{1}{m^{2}}\mathrm{1}_{\mathit{m}}\mathrm{1}_{\mathit{m}}^{\mathrm{T}},\\ \left( \mathrm{N}_{\mathit{tt}}^{\mathit{(c)}}\right)_{ij} = \begin{cases} \frac{1}{(m^{(c)})^{2}}, & \mathrm{z}_{\mathit{i}},\mathrm{z}_{\mathit{j}}\in \mathrm{Z}_{\mathit{t}}^{(c)} \\ 0, & \text{otherwise} \end{cases} \end{array} $$
(16)
$$ \begin{array}{ll} {\kern14pt}\mathrm{M}_{\mathit{st}} = \mathrm{Z}_{\mathit{s}}((1-\mu)\mathrm{N}_{\mathit{st}}+\mu{\sum}_{c=1}^{C}\mathrm{N}_{\mathit{st}}^{\mathit{(c)}})\mathrm{Z}_{\mathit{t}}^{\mathrm{T}}, \mathrm{N}_{\mathit{st}} = -\frac{1}{nm}\mathrm{1}_{\mathit{n}}\mathrm{1}_{\mathit{m}}^{\mathrm{T}},\\ (\mathrm{N}_{\mathit{st}}^{\mathit{(c)}})_{ij} = \begin{cases} -\frac{1}{n^{(c)}m^{(c)}}, \mathrm{z}_{\mathit{i}}\in\mathrm{Z}_{\mathit{s}}^{(c)},\mathrm{z}_{\mathit{j}}\in \mathrm{Z}_{\mathit{t}}^{(c)} \\ 0, \text{otherwise} \end{cases} \end{array} $$
(17)
$$ \begin{array}{ll} {\kern15pt}\mathrm{M}_{\mathit{ts}} = \mathrm{Z}_{\mathit{t}}((1-\mu)\mathrm{N}_{\mathit{ts}}+\mu{\sum}_{c=1}^{C}\mathrm{N}_{\mathit{ts}}^{\mathit{(c)}})\mathrm{Z}_{\mathit{s}}^{\mathrm{T}}, \mathrm{N}_{\mathit{ts}} = -\frac{1}{nm}\mathrm{1}_{\mathit{m}}\mathrm{1}_{\mathit{n}}^{\mathrm{T}},\\ (\mathrm{N}_{\mathit{ts}}^{\mathit{(c)}})_{ij} = \begin{cases} -\frac{1}{n^{(c)}m^{(c)}}, & \mathrm{z}_{\mathit{j}}\in\mathrm{Z}_{\mathit{s}}^{(c)},\mathrm{z}_{\mathit{i}}\in \mathrm{Z}_{\mathit{t}}^{(c)} \\ 0, & \text{otherwise} \end{cases} \end{array} $$
(18)

3.7 Graph Laplacian regularization

In this section, we use graph Laplacian regularization to guarantee the unbiased problem of irrelevant instances.

In domain adaptation, labeled and unlabeled data are used. It is expected that knowledge of marginal distributions (i.e., Ps and Pt) can be further exploited to improve the performance of function learning. Thus, the unlabeled samples may often reveal the underlying facts of the target domain, such as sample variances. The idea of manifold assumption [48] can be expressed as follows. If two points, namely, zi,zjg are close in the geometry of marginal distributions Ps(zs) and Pt(zt), then the conditional distributions Qs(ys|zs) and Qt(yt|zt) are similar. Under the hypothesis of the smooth properties of geodesics, Laplacian regularization can be used for further exploiting the similar geometrical properties of nearest points in Lie algebra manifold space \( \mathfrak {g} \). Thus, the final optimization of graph Laplacian regularization \(\mathit {R}_{f}(\mathcal {D}_{s},\mathcal {D}_{t})\) can be computed as

$$ \underset{\mathrm{A},\mathrm{B}}{\min}\mathit{R}_{f}(\mathcal{D}_{s},\!\mathcal{D}_{t}) = \underset{\mathrm{A},\mathrm{B}}{\min} \text{tr}\! \left (\begin{bmatrix}\!\! \mathrm{A}^{\mathrm{T}} &\mathrm{0} \\ \!\!\!\!\mathrm{0} & \mathrm{B}^{\mathrm{T}}\!\!\end{bmatrix} \begin{bmatrix} \mathrm{Z}_{\mathit{s}}\mathrm{L}_{\mathit{ss}}\mathrm{Z}_{\mathit{s}}^{\mathrm{T}} &\mathrm{Z}_{\mathit{s}}\mathrm{L}_{\mathit{st}}\mathrm{Z}_{\mathit{t}}^{\mathrm{T}} \\ \mathrm{Z}_{\mathit{t}}\mathrm{L}_{\mathit{ts}}\mathrm{Z}_{\mathit{s}}^{\mathrm{T}} & \mathrm{Z}_{\mathit{t}}\mathrm{L}_{\mathit{tt}}\mathrm{Z}_{\mathit{t}}^{\mathrm{T}}\end{bmatrix} \begin{bmatrix} \!\mathrm{A} &\mathrm{0} \\ \!\!\mathrm{0} & \mathrm{B}\!\!\!\end{bmatrix}\! \right ) $$
(19)

where L = I −D− 1/2WD− 1/2 is the graph Laplacian matrix and D is a diagonal matrix with its i th diagonal element calculated as the sum of i th row of W, i.e., \( \mathrm {D}_{ii} = {\sum }_{j=1}^{n}\mathrm {W}_{ij} \). W is defined by

$$ \mathrm{W}_{ij} = \begin{cases} \cos(\mathrm{z}_{\mathit{i}},\mathrm{z}_{\mathit{j}}), & \mathrm{z}_{\mathit{i}} \in \mathcal{N}_{p}(\mathrm{z}_{\mathit{j}}) \vee \mathrm{z}_{\mathit{j}} \in \mathcal{N}_{p}(\mathrm{z}_{\mathit{i}})\\ 0, & \mathrm{otherwise,} \end{cases} $$
(20)

where \( \mathcal {N}_{p}(\mathrm {z}_{\mathit {i}}) \) is zi’s p nearest neighbors which are from the same class with zi.

3.8 Subspace divergence minimization

In this section, we further mitigate the domain divergence by moving the source and target subspaces closer together, which is similar to the aforementioned methods, such as transfer component analysis (TCA) [28] or joint distribution alignment (JDA) [42]. The differences in the two domains will be reduced but cannot be completely removed through this transformation. By contrast, we obtain the idea from [30] to minimize A and B simultaneously. In this way, the statistical and geometrical features can be preserved. Formally, we use the following minimization form of Frobenius-norm to move the two subspaces closer.

$$ \underset{\mathrm{A},\mathrm{B}}{\min}\mathit{D}_{f}(S_{A},S_{B})=\underset{\mathrm{A},\mathrm{B}}{\min}\left \| \mathrm{A}-\mathrm{B} \right \|_{\mathrm{F}}^{2} $$
(21)

3.9 Optimization

To control the scale of solution B, we follow [40, 41] to impose a constraint that tr(BTB) is sufficiently small. We formulate the LGMA method by incorporating (5), (7), (8), (14), (19), and (21). Then, our objective function (2), therefore, can be formulated as follows:

$$ \underset{\mathrm{A},\mathrm{B}}{\mathrm{arg max}}\frac{\text{tr} \left (\begin{bmatrix} \mathrm{A}^{\mathrm{T}} & \mathrm{B}^{\mathrm{T}} \end{bmatrix} \begin{bmatrix} \beta\mathrm{S}_{\mathit{b}} & \mathrm{0} \\ \mathrm{0} & \alpha \mathrm{S}_{\mathit{t}} \end{bmatrix} \begin{bmatrix} \mathrm{A} \\ \mathrm{B} \end{bmatrix} \right )} {\text{tr}\left( \begin{bmatrix} \mathrm{A}^{\mathrm{T}} & \mathrm{B}^{\mathrm{T}} \end{bmatrix} \begin{bmatrix} \mathrm{M}_{\mathit{ss}}+\delta\mathrm{L}_{\mathit{ss}}+\lambda\mathrm{I}+\beta\mathrm{S}_{\mathit{w}} & \mathrm{M}_{\mathit{st}}+\delta\mathrm{L}_{\mathit{st}}-\lambda\mathrm{I}\\ \mathrm{M}_{\mathit{ts}}+\delta\mathrm{L}_{\mathit{ts}}-\lambda\mathrm{I} & \mathrm{M}_{\mathit{tt}}+\delta\mathrm{L}_{\mathit{tt}}+(\lambda+\alpha)\mathrm{I} \end{bmatrix} \begin{bmatrix} \mathrm{A} \\ \mathrm{B} \end{bmatrix} \right) } $$
(22)

where α, β, δ, and λ are penalty parameters, and \( \mathrm {I} \in \mathbb {R}^{d \times d} \) is the identity matrix.

LGMA aims to find a paired transformation A and B by solving the generalized eigendecomposition problem in the projected Lie algebra manifold space. To optimize (22), we define [AT BT] to be equal to UT. Thus, we get

$$ \begin{array}{ll} &\underset{\mathrm{U}}{\mathrm{arg max}} \text{tr} \left (\mathrm{U}^{\mathrm{T}} \begin{bmatrix} \beta\mathrm{S}_{\mathit{b}} & \mathrm{0} \\ \mathrm{0} & \alpha\mathrm{S}_{\mathit{t}} \end{bmatrix} \mathrm{U} \right ) \\ &\mathrm{s.t.} \text{tr}\!\left( \!\!\mathrm{U}^{\mathrm{T}}\!\ \begin{bmatrix} \mathrm{M}_{\mathit{ss}} + \delta\mathrm{L}_{\mathit{ss}} + \lambda\mathrm{I} + \beta\mathrm{S}_{\mathit{w}} &\!\!\! \mathrm{M}_{\mathit{st}}+\delta\mathrm{L}_{\mathit{st}}-\lambda\mathrm{I}\\ \!\!\!\!\!\!\!\!\!\!\!\!\mathrm{M}_{\mathit{ts}}+\delta\mathrm{L}_{\mathit{ts}}-\lambda\mathrm{I} & \!\!\!\!\!\!\!\!\!\!\!\!\!\mathrm{M}_{\mathit{tt}}+\delta\mathrm{L}_{\mathit{tt}}+(\lambda+\alpha)\mathrm{I} \end{bmatrix}\! \mathrm{U} \right) = 1 \end{array} $$
(23)

Equivalently, the constraint optimization of (23) can be written in the form of Lagrangian. Thus, we have

$$ \begin{array}{ll} \mathit{L}(\mathrm{U}) &=\text{tr} \left (\mathrm{U}^{\mathrm{T}} \begin{bmatrix} \beta\mathrm{S}_{\mathit{b}} & \mathrm{0} \\ \mathrm{0}& \alpha\mathrm{S}_{\mathit{t}} \end{bmatrix} \mathrm{U} \right ) \\ &{\kern7pt}+\text{tr}\left( \left( \mathrm{U}^{\mathrm{T}} \begin{bmatrix} \mathrm{M}_{\mathit{ss}}+\delta\mathrm{L}_{\mathit{ss}}+\lambda\mathrm{I}+\beta\mathrm{S}_{\mathit{w}} & \mathrm{M}_{\mathit{st}}+\delta\mathrm{L}_{\mathit{st}}-\lambda\mathrm{I}\\ \mathrm{M}_{\mathit{ts}}+\delta\mathrm{L}_{\mathit{ts}}-\lambda\mathrm{I} & \mathrm{M}_{\mathit{tt}}+\delta\mathrm{L}_{\mathit{tt}}+(\lambda+\alpha)\mathrm{I} \end{bmatrix} \mathrm{U}-\mathrm{I} \right){\varLambda} \right) \end{array} $$
(24)

To solve (24), we set the first derivative \( \frac {\partial L(\mathrm {U})}{\partial \mathrm {U}}=\mathrm {0} \). Then, we obtain generalized eigendecomposition

$$ \begin{bmatrix} \beta\mathrm{S}_{\mathit{b}} & \mathrm{0} \\ \mathrm{0} & \alpha\mathrm{S}_{\mathit{t}} \end{bmatrix} \!\mathrm{U} = \begin{bmatrix} \mathrm{M}_{\mathit{ss}} + \delta\mathrm{L}_{\mathit{ss}} + \lambda\mathrm{I} + \beta\mathrm{S}_{\mathit{w}} & \mathrm{M}_{\mathit{st}} + \delta\mathrm{L}_{\mathit{st}}-\lambda\mathrm{I}\\ \!\!\!\!\!\!\!\!\!\!\!\!\mathrm{M}_{\mathit{ts}}+\delta\mathrm{L}_{\mathit{ts}}-\lambda\mathrm{I} &\mathrm{M}_{\mathit{tt}}+\delta\mathrm{L}_{\mathit{tt}}+(\lambda+\alpha)\mathrm{I} \end{bmatrix} \mathrm{U}{\varLambda} $$
(25)

where Λ = diag(λ1,...,λk) is the k leading eigenvalue and \( \mathrm {U}=\begin {bmatrix} \mathrm {U}_{1},...,\mathrm {U}_{k} \end {bmatrix} \) contains the corresponding eigenvectors. Finding the optimal adaptation matrix U is decreased to solving (25) for k eigenvectors. Algorithm 1 provides a complete summary of LGMA.

3.10 Computational complexity

The computational complexity of Algorithm 1 consists of four parts as follows.

  1. (1)

    The computation of St, Sb, and Sw in step 2.

  2. (2)

    The construction of Mss, Mtt, Mst, and Mts in step 2.

  3. (3)

    The optimization of eigendecomposition problem in step 4.

  4. (4)

    The computation for all other processes.

Generally, in terms of the big O notation. The computation of St, Sb, and Sw cost O(m2), O(n2), and O(n2). The construction of Mss, Mtt, Mst, and Mts cost O(TCn2), O(TCm2), O(TCnm), and O(TCmn). The optimization of eigendecomposition problem costs O(Tkm2). The computation for all other processes cost O(Tmn). Denote T and k as the number of iterations and the subspace bases. The overall computational costs of Algorithm 1 would be O(T(k + C)m2 + TCn2 + TCmn).

figure f

4 Experiments

In this section, we perform extensive experiments on real-world image recognition datasets to evaluate the proposed LGMA approach against the state-of-the-art methods. The experiments are divided into three parts. Section 4.1 visualizes performance on image classification tasks. Section 4.2 evaluates the performance on a range of cross-domain image classification tasks with a standard and realistic hyper-parameter tuning. Section 4.3 reports the results with a tuning protocol established in the literature for completeness.

4.1 Feature visualization

Figure 2a, b, e, f, c, d, g and h show the visualization of transfer tasks V→ I and A\( \rightarrow \)W after performing SCA, JGSA, and LGMA algorithms, respectively. Some interesting conclusions can be drawn. (a) SCA can not learn the invariant cross-domain features well because the differences between the source domain and target domain are still large. (b) JGSA does not learn the weighted distribution alignment because the distribution of the source domain are dissimilar to the target domain, thereby leading to large domain bias. The abovementioned conclusions show the inferior performance of SCA and JGSA and validate the superiority of LGMA.

Fig. 2
figure 2

Feature visualization of source and target domain data. (a) and (b) indicate the visualization of the source domain V and the target domain I after performing SCA, respectively. (c) and (d) indicate the visualization of the source domain V and the target domain I after performing LGMA, respectively. (e) and (f) indicate the visualization of the source domain A and the target domain W after performing JGSA, respectively. (g) and (h) indicate the visualization of the source domain A and the target domain W after performing LGMA, respectively. Color makers denote different classes

4.2 Real world object recognition

4.2.1 Experimental setup

Five public large-scale image datasets are used, as shown in Table 2.

Table 2 Five benchmark datasets used in this paper

The public large-scale image recognition datasets in our experiments include Office+Caltech10, Office-31, and ImageNet + VOC2007, which are popular image classification datasets that are widely used for evaluating machine learning and data mining models, such as [6, 31, 41].

Office + Caltech10

[49] contains 2,533 images from 10 different subcategories. The dataset includes 4 image domains, i.e., Amazon (A), DSLR (D), Webcam (W), and Caltech (C). Figure 3 depicts the sample images from the object monitor category in the four domains, namely, Caltech, Amazon, DSLR, and Webcam [31]. Features in Office and Caltech follow different distributions, domain adaptation can help the performance of cross-domain image classification. Formally, 10 classes are used in each dataset. Thus, 12 tasks are constructed, namely, A→C, A→D, A\(\rightarrow \)W,..., D→W. In this study, A\(\rightarrow \)B represents the transfer task from the source domain A to the target domain B.

Fig. 3
figure 3

Sample images from object monitor category in the four domains Caltech, Amazon, DSLR, and Webcam [31]

Office-31

[49] is also a widely used dataset for transfer learning tasks in image recognition and multimedia analysis. It includes 4,652 images and 31 categories from three domains: Amazon (A), Webcam (W), and DSLR (D). Each of these two domains can construct a transfer learning task, thereby leading to 6 tasks: A→D, A\(\rightarrow \)W ,..., and W→D, respectively.

ImageNet + VOC2007

(I, V) are another widely used image datasets. Because images from the same classes of both domains follow different distributions, each dataset can be considered one domain. In this paper, we use the datasets in [50] to perform transfer learning tasks. Both of the two datasets have five classes, namely, bird, cat, chair, dog, and person, respectively. Thus, another two transfer learning tasks, i.e., I→V and V→I, are constructed.

For all the baseline approaches, we use the optimal parameters reported in the original papers. As for LGMA, we set λ = 1 and α = 1, such that the inner subspace bias and the target variance are treated as equally important. The subspace dimension k = 30 in the tasks of Office + Caltech10 datasets with DeCaf6 features and the tasks of ImageNet + VOC2007 datasets, and the subspace dimension k = 100 in the tasks of Office-31 datasets with DeCaf7 features. We empirically validate that the fixed parameters can obtain promising performance on different types of tasks. Therefore, the weighted coefficient μ, the regularization parameter β, the number of iteration T, the number of nearest neighbors p, and the coefficient of the graph Laplacian regularization term δ are free parameters.

We also exploit classification Accuracy on test data as the evaluation metric, which is widely used in many studies [28, 31, 47]:

$$ Accuracy = \frac{|\mathbf{x}:\mathbf{x}\in\mathcal{D}_{t}\wedge \hat{y}(\mathbf{x})=y(\mathbf{x})|}{|\mathbf{x}:\mathbf{x}\in\mathcal{D}_{t}|}, $$
(26)

where y(x) and \(\hat {y}(\mathbf {x})\) indicate the truth and predicted labels in the target domain, respectively.

4.2.2 Baselines

To evaluate the robustness of the proposed LGMA approach to different configurations of datasets, we conduct comprehensive evaluation on image recognition datasets and compare LGMA with competitive state-of-the-art domain adaptation methods as follows:

  • 1-Nearest neighbor (1NN) classifier;

  • Support vector machine (SVM) [51];

  • Transfer component analysis (TCA) [28], which adapts marginal distribution;

  • Transfer joint matching (TJM) [52], which performs marginal distribution with the sample selection of the source domain;

  • Distribution matching machine (DMM) [36], which aims to learn an SVM classifier to adapt distributions alignment based on SRM;

  • Scatter component analysis (SCA) [40], which leans a classifier through scatter component analysis;

  • Joint geometrical and statistical alignment (JGSA) [41], which performs geometrical and statistical alignment with label propagation.

  • Unsupervised transfer metric learning (UTML) [18], which decreases intra-class distance and increases inter-class distance;

  • Locality preserving joint transfer (LPJT) [19], which jointly exploits feature adaptation with distribution matching and sample adaptation with landmark selection;

  • Domain invariant and class discriminative feature learning (DICD) [17], which matches the marginal and conditional distributions, and maximizes the inter-class dispersion and minimizes the intra-class scatter;

  • Transfer independently together (TIT) [20], which learns multiple transformations for each domain to map data onto a shared latent space where the domains are well aligned.

4.2.3 Experimental results and analysis

The classification performance of all comparison models on the 12 transfer tasks of Office + Caltech10 datasets with DeCaf6 features, the 6 transfer tasks of Office-31 datasets with DeCaf7 features, and the 2 transfer tasks of ImageNet + VOC2007 datasets are shown in Tables 34, and 5, respectively. LGMA considerably outperforms the competitive baseline methods on most of the transfer tasks. Specifically, LGMA achieves the following performance gains compared with the best baselines: (1) 1.3% on the 12 transfer tasks of Office+Caltech10 datasets with DeCaf6 features, (2) 0.4% on the 6 transfer tasks of Office-31 datasets with DeCaf7 features, and (3) 6.1% on the 2 transfer tasks of ImageNet + VOC2007 datasets. Although LGMA cannot perform the best on all tasks, if LGMA performs the best, then it usually performs considerably better than the best baseline approach; otherwise, it performs only slightly worse than the optimal baseline. This finding demonstrates that LGMA is robust to feature shift and instance bias for domain adaptation.

Table 3 Recognition accuracy(%) against other baseline methods on Office + Caltech10 (DeCaf6) datasets, the best results and the best baseline results are shown in boldface and italic, respectively
Table 4 Recognition accuracy(%) against other baseline methods on Office-31 (DeCaf7) datasets, the best results and the best baseline results are shown in boldface and italic, respectively
Table 5 Recognition accuracy(%) against other baseline methods on ImageNet + VOC2007 datasets, the best results and the best baseline results are shown in boldface and italic, respectively

We can make more observations. (1) Domain adaptation methods (i.e., instance-based adaptation, feature representation-based adaptation, classifier-based adaptation, and hybrid knowledge-based adaptation methods) are generally superior to SVM and 1NN, which indicates that minimizing the distribution differences is the key to domain adaptation. (2) Classifier-based adaptation DMM method outperforms TCA, thereby showing the effectiveness of minimizing the distribution differences based on SRM in the infinite dimension reproducing kernel Hilbert space (DMM) rather than in the dimension reduced kernel PCA space (TCA). (3) Hybrid knowledge-based adaptation methods (i.e., SCA, JGSA, TIT, LPJT, UTML, DICD and LGMA) further outperform TCA and other single methods, whereas LGMA performs the best in most transfer tasks. Only single knowledge-based adaptation methods are insufficiently good for domain adaptation when the domain discrepancy is substantially large. The reason is that some source samples which are irrelevant to the target samples are not helpful for learning a unified classifier even when using the cross-domain invariant features or the high dimensional nonlinear features or both. LGMA addresses this limitation by reweighting the source instances according to their relevance to the target instances and performing weighted distribution alignment in the linear Lie algebra manifold space.

Although SCA, JGSA, LPJT, and DICD perform distribution matching by using hybrid knowledge based adaptation, the advantages of LGMA over these four methods are threefold. (1) LGMA corrects the domain mismatch by quantitatively evaluating the importance of the marginal and conditional distributions in the generalized FLDA framework. LGMA further performs feature matching to guarantee a large number of effective source instances for classifying the related target domain. In SCA, JGSA, LPJT, and DICD, the evaluation of distribution importance is ignored. (2) LGMA jointly learns the domain-invariant and label-discriminative transfer classifier and the transferable knowledge (invariant to feature representations and unbiased to irrelevant instances) in a learning paradigm in the nonlinear Lie group manifold space, whereas SCA, JGSA, LPJT, and DICD learn the transferable knowledge and cross-domain classifier in a linear manifold space. (3) LGMA aims to find a geodesic on the original Lie group and projects all the samples onto a Lie algebra manifold space along the geodesic direction, while ensuring the discrimination of the projected samples in a linear Lie algebra manifold space. However, the other four methods (i.e., SCA, JGSA, LPJT, and DICD) cannot guarantee that the transformed samples are linear separable in the RKHS.

We further verify the performance of LGMA on another Office + Caltech10 datasets using SURF features, and the performance results are reported in Table 6. It is worth noting that LGMA outperforms other baselines range from traditional machine learning methods (i.e., 1NN and SVM) to state-of-the-art transfer learning models (i.e., TJM, SCA, JGSA, UTML, TIT, DICD, and LPJT), which demonstrates that LGMA is significantly superior to other baselines in minimizing the cross-domain discrepancy.

Table 6 Recognition accuracy(%) against other baseline methods on Office + Caltech10 (SURF) datasets, the best results and the best baseline results are shown in boldface and italic, respectively

We also evaluate the importance of the Lie algebra transformation, the graph Laplacian regularization term (including the parameters p and δ), and the weighted distribution alignment factor μ, where we stand out from the baseline methods. We randomly select several tasks and show the results in Figs. 45, and 6. In Fig. 4, the dotted lines represent the baseline method, the solid lines represent the proposed LGMA method. We can make additional observations. (1) The Lie algebra transformation (L), the graph Laplacian regularization (GLR), and the weighted distribution alignment (WDA) are highly important in dealing with the domain adaptation problems (Figs. 5 and 6). (2) Compared with other methods (FLDA, F LDA with L ie algebra transfermation (FL), F LDA with L ie algebra transfermation and W eighted distribution alignment (FLW), F LDA with L ie algebra transfermation, W eighted distribution alignment, and G raph Laplacian regularization (FLWG)), the performance of LGMA method is better, which validates the effectiveness of the proposed method. (3) LGMA can reach a steady performance in approximately \( \mathrm {T\leqslant 10} \) iterations (Fig. 4d and Fig. 5). (4) LGMA can reach a high performance using the wide range of parameters (Fig. 4a, b, and c).

Fig. 4
figure 4

The parameter sensitivity and convergence analysis of the proposed LGMA approach

Fig. 5
figure 5

The recognition accuracy of methods F, FL, FLW, FLWG, and LGMA

Fig. 6
figure 6

Evaluate the impotance of Lie algebra transformation (L), weighted distribution alignment (W), and graph Laplacian regularization (G)

The reasons for these results are presented as follows. First is that the instances in Lie group manifold space are projected onto the linear Lie algebra manifold space by Lie algebra transformation to realize the data discrimination in the nonlinear Lie group manifold space. Second is that the graph Laplacian regularization can further exploit the similar geometrical properties of the nearest points in domain adaptation. The third is that the weighted distribution alignment factor μ ∈{0,0.01,...,0.99,1} can evaluate the importance of the marginal and conditional distributions. We do not perform experiments on the DeCaf7 features of Office-31 datasets because the results are satisfactory.

4.3 Results with parameter tuning on target domain

In this section, we analyze the parameter fluctuations of LGMA on different types of datasets to validate that a wide range of parameter values can be selected for improved performance.

We find the sensitivity of the number of the nearest neighbors p by experimenting with a large range of p ∈{2,4,8,...,64} on randomly selected tasks. From Fig. 4a and the experimental results, we can conclude that LGMA is robust in terms of p = 32. μ is a weight factor with the value range μ ∈{0,0.01,...,0.99,1}, and we can choose the value of μ from the analysis of Fig. 4c.

LGMA uses a wide range of values for regularization parameters β, δ, and some other necessary parameters k, T. We follow the same setup of [41] that β ∈ [2− 15,2− 1] and k ∈ [20,180]. In this study, we set the number of iterations T = 10 (Fig. 4d). δ (Fig. 4b) is a factor with δ ∈{0,0.01,...,0.99,1}. We observe that LGMA can achieve robust performance for a wide range of parameter values.

In the experiment on Office + Caltech10 datasets using DeCaf6 features, we set the free parameters β = 0.08, δ = 0.18, and μ = 0.81. In the experiment on Office-31 datasets using DeCaf7 features, we set the free parameters β = 0.1, δ = 0.46, and μ = 0.74. In the experiment on ImageNet + VOC2007 datasets, we set the free parameters β = 0.1, δ = 0.11, and μ = 0.81.

5 Conclusions

In this paper, we proposed a new Lie group manifold analysis (LGMA) method for unsupervised domain adaptation. LGMA performs transformation using variances between subsets of data to suppress insignificant differences (within labels and between domains) and to amplify useful differences (between labels and overall variability) in a linear Lie algebra manifold space. In the meanwhile, LGMA learns a invariant cross-domain classifier by extracting domain-invariant feature representations, evaluating the importance of distributions (marginal and conditional distributions), exploiting the similar geometrical properties of the nearest points, and estimating irrelevant instance weights that jointly reduce the cross-domain distribution difference. Extensive experiments on several cross-domain image datasets validate that LGMA considerably outperforms state-of-the-art domain adaptation methods.

In general, the problem of dataset bias in domain adaptation is far from being solved. The actual performance of existing approaches (\(\geqslant 90\%\) accuracy) is only achieved in several cross-domain tasks, even using advanced feature extraction methods, such as DeCaf6 and DeCaf7 features. Using raw features is clearly not satisfactory. Therefore, it is critical to develop more robust algorithms that can significantly reduce data bias in all cases.