1 Introduction

Recently, in the machine learning and pattern recognition fields, enormous amounts of data, e.g., images, videos, and texts are emerging where the traditional supervised machine learning methods need to label data for each gallery or corpus [1, 2]. In fact, in most existing applications, there are not sufficient labeled data to classify the new domains while the manual labeling of the unlabeled instances is immensely intricate and expensive. Thus, the vital significance of using other existing related labeled domains to classify the new visual domains has drawn more consideration during the last few years. However, the classification results are often poor when the trained classifiers on the available labeled samples directly are used to classify the new unlabeled instances with various distributions. For example, imagine that we are to develop an iPhone app to identify the car-captured images via a phone’s camera while there are no labeled images. In this case, the trained model would not work appropriately since the training and test images have various expressions, postures, and lighting conditions which means different distributions [3].

The challenge of exploiting other related domains with different distributions or feature spaces to classify new tasks presents the domain shift problem. To address the problem, a variety of solutions have been developed named transfer learning (TL) and domain adaptation (DA) which are the improvement of the learning paradigms in a new task through the transferring of knowledge from related tasks that have been already learned.

Generally, DA and TL techniques are categorized into two different settings. The first setting is called semi-supervised domain adaptation in which a few parts of the target domain are labeled and the rest is unlabeled. In the second setting, called unsupervised domain adaptation, there is no labeled sample in the target domain [4]. However, in both settings, the source and target data often have different marginal and conditional distributions. Some basic DA methods, only consider the marginal distribution disparity between domains and often ignore the conditional distribution discrepancy. Unlike previous methods, our proposed approach exploits the geometry of the data manifold to minimize both the marginal and the conditional distribution differences.

Learning invariant features while the distribution of the source and target domains are different is decisive. However, most of the traditional dimensionality reduction approaches like Fisher linear discriminant analysis (FDA) [5] and locality preserving projections (LPP) [6], perform poorly to encounter domain shift problems either in original or low-dimensional spaces. Thus, we are to develop a new dimensionality reduction algorithm to address the domain shift issue.

on the other hand, Bregman divergence (BD) [7] is a nonlinear measure to minimize the distance between distributions using the kernel density estimation (KDE) technique [8]. Specifically, we extend the nonlinear Bregman divergence to measure the discrepancy of marginal and conditional distributions and integrate it with local Fisher discriminant analysis (LFDA) [9] to create a new feature representation that is efficacious and robust against considerable distribution divergence. LFDA effectively combines the ideas of FDA and LPP to maximize the between-class separability and preserve the within-class local structure, simultaneously. Furthermore, the Bregman divergence can transfer the gained knowledge from training to test sets by minimizing the mismatch across the marginal and conditional distributions of them.

In this work, to tackle with unsupervised DA problem, we propose a novel domain adaptation approach called unsupervised domain adaptation via transferred local Fisher discriminant analysis (TLFDA), which projects the source and target data into a common subspace such that both the marginal and conditional distribution discrepancies of domains are minimized. Moreover, TLFDA exploits the Bregman divergence to measure the distribution difference, which enables transferring the knowledge from training samples to test ones. Furthermore, TLFDA maximizes the between-class separability of source and target domains and preserves the within-class local structure in a low-dimensional subspace. TLFDA considers both the discriminative information of marginal instances in various classes and the local geometry of instances in each class.

Contributions: The contributions of our TLFDA are listed as follows.

  1. 1.

    TLFDA mitigates the joint marginal and conditional distribution discrepancies across the source and target domains via Bregman divergence.

  2. 2.

    TLFDA introduces a novel dimensionality reduction method for the domain shift problem where its idea originates from joint FDA and LPP.

  3. 3.

    TLFDA predicts the pseudo-labels of the target data based on a conscious estimate via a trained model on source data.

  4. 4.

    TLFDA benefits from an iterative approach to refining the pseudo-labels of target samples.

Organization of the paper: The next section provides a review of related work. We then represent our proposed method in Sect. 3. In Sects. 4 and 5, we provide the experimental results and discussion. Finally, the paper is concluded in Sect. 6, and future works are included.

2 Related work

The existing DA methods to tackle the problem of domain shift are organized into three adaptation categories [10, 11]: (1) instance-based methods, (2) model-based methods, and (3) feature-based methods.

The instance-based approaches [12, 13] are to assign less importance to the irrelevant source instances to reduce the distribution discrepancy across the source and target domains. Landmark selection [14] is one of the instance-based methods, which benefits from max mean discrepancy (MMD) [15] to select a subset of source samples that obey the same distribution as target samples. In the other words, the major focus of the landmark selection method is to conjoin the source data with the target one using the discovered landmarks. LSSA [16] is another instance-based method selecting a subset of instances as landmarks and projects the source and target data into a latent subspace in a non-linear procedure considering the selected landmarks. LSSA uses the subspace alignment to adapt the unaligned domains by learning a non-linear mapping function.

The model-based methods [17, 18] center around the notion of adaptive classifier design to have a robust model to cope with the distribution mismatch across domains by transferring the model parameters from a source-made model to another target model. Domain adaptation machine (DAM) [19] benefits from a set of auxiliary/source classifiers that are trained with the labeled samples from many source domains to learn a robust target classifier. Adaptation regularization-based transfer learning (ARTL) [20] is a novel method that reduces structural risk and jointly minimizes the marginal and conditional distribution difference between domains. Also, ARTL maximizes the manifold consistency to tackle unsupervised domain adaptation.

The feature-based methods [21,22,23,24,25,26,27,28,29,30] change the feature representation of the source and target domains to bring closer the marginal and conditional distributions of domains, jointly. Joint distribution adaptation (JDA) [21] creates a novel feature representation using a principled dimensionality reduction technique that is robust against distribution shift. Low-rank and sparse representation (LRSR) [22] preserves the inherent geometric data structure via low-rank constraint and/or sparse representation in an embedded subspace. LRSR geometrically aligns the source and target data through both the low-rank and sparse constraints such that the source and target data are interleaved within a new shared feature subspace. Visual domain adaptation (VDA) [23] uses the joint domain adaptation and transfer learning to deal with the problem of domain shift. VDA discriminates various classes in an embedded representation via condensed domain-invariant clusters. Close yet discriminative domain adaptation (CDDA) [31] is a novel framework that constructs a common feature representation with the following two properties. First, the difference across the source and target domains is measured in terms of joint marginal and conditional probability distributions via maximum mean discrepancy. Second, CDDA discriminates data using the inter-class repulsive force. Coupled local–global adaptation (CLGA) [24] globally adapts the marginal and conditional distribution disparities. CLGA builds a graph to minimize the distances across the sample pairs in the same class manifold through the different domain manifolds and to maximize the distances across the sample pairs in the same domain manifold with different class manifolds. Domain invariant and class discriminative representations (DICD) [25] jointly matches the marginal and conditional distributions in a latent subspace. DICD discriminates classes by increasing the intra-class density and also decreasing the inter-class dispersal. Discriminative and geometry aware domain adaptation (DGA-DA) [26] defines a repulsive form term to discriminate the latent feature space. DGA-DA infers the labels using the geometric structures of explored data through label smoothness and geometric structure consistencies. Transductive transfer learning for image classification (TTLC) [27], as a state-of-the-art domain adaptation method, globally adapts the marginal and conditional distributions in two respective low-dimensional subspaces. TTLC regulates the distances across sample pairs in both domains to discriminate various classes. At last, TTLC locally aligns both latent subspaces. Domain adaptation with geometrical preservation and distribution alignment (GPDA) [28] preserves both the statistical and the geometrical properties of domains in a unified framework. Firstly, GPDA tries to preserve the statistical properties of data via the nonnegative matrix factorization model [32]. Then, GPDA preserves the geometrical structure of data through graph dual regularization in the nonnegative matrix factorization framework. Also, the marginal and conditional distribution disparities are aligned in the mentioned framework. Unified cross-domain classification method via geometric and statistical adaptations (UCGS) [29] minimizes the structural risk on source data, at first. Also, UCGS uses MMD for statistical adaptation and the Nystrom method for geometrical adaptation in a unified framework. Feature selection-based visual domain adaptation (FSVDA) [30] uses the particle swarm optimization (PSO) [33] algorithm to select the most relevant feature subsets across both domains. To evaluate the effectiveness of each subset, FSVDA uses the manifold embedded distribution alignment (MEDA)’s function [34] as its fitness function.

Recently, a great effort has been dedicated to deep DA methods. Deep methods for extracting features via hidden layers need a high amount of training data. Venkateswara et al. proposed a domain adaptive hashing network (DAH) [35] to assign unique hash codes for source and target domains. Manifold Aligned Label Transfer for Domain Adaptation (MALT-DA) [36] uses a densely connected architecture (DenseNet) [37] to learn better deep features on the source domain. MALT-DA aligns features across domains through two methods including Adaptive Batch Normalization (ABN) [38], and subspace alignment via LPP. Following, MALT-DA clusters the features into variant groups. MALT-DA compares the labels which are made via the cluster-matching process and the labels which are hypothesized via the network. Then, the samples with matching labels are used as training data for the adaptation method. Deep methods have more time complexity for the training phase whereas TLFDA with a convex time complexity can be preferred to deep methods.

As a result, our current research in TL has focused on the third category, the feature-based adaptation approach, which looks for a common feature representation across the source and target domains. In summary, our main contribution is a new dimensionality reduction-based method that combines the ideas of FDA and LPP to minimize the distance between the marginal and conditional distributions of source and target data to enable effective transfer learning. TLFDA unlike most of the previous works decreases the marginal and conditional discrepancies via Bregman divergence. Moreover, TTLC maps source and target samples into respective subspaces but TLFDA maps samples into a common subspace. We apply our new method to four real-world applications in a transfer learning setting to demonstrate its outstanding performance.

3 Proposed method

In this section, we describe a precise description of TLFDA to deal with the unsupervised domain shift problem. In the case of domain shift problems, considering the distribution nonconformity between the source and target domains is vital for achieving the desired results for methods that are based on FDA and LPP criteria. To this end, in this work, we represent a novel dimensionality reduction framework for the domain shift problem where its idea originates from joint FDA, LPP, and Bregman divergence. Our contribution in this work is to find a common subspace where the marginal and conditional distribution discrepancies of domains are jointly minimized and the local structure of data is well preserved.

To this end, we suppose \(D = {<X, P(x)>}\) as a domain that consists of the instances in feature space X with marginal probability distribution P(x) . Moreover, we consider \( T = {<Y, f(x)>} \) as a task for domain D that consists of a label set Y and a prediction function f(x) . Note that f(x) can be interpreted as a conditional probability distribution, i.e., P(Y|x) . In this paper, an unsupervised domain shift problem with \( D_S = \{(x_{1}, y_{1}),\ldots , (x_{n_{\textrm{s}}}, y_{n_{\textrm{s}}})\} \) as the labeled source domain and \( D_T = \{x_{n_{s+1}},\ldots , x_{n_s+n_t}\} \) as an unlabeled target domain is addressed where \( n_s \) and \( n_{\textrm{t}} \) are the number of the source and target samples, respectively.

3.1 Dimensionality and divergence reduction

The major idea behind our proposed approach is to discover an optimal couple of projections and classification for the source and target instances in a way that the distribution divergence between domains is decreased. A feasible way to attain this objective is to use dimensionality decrement methods (e.g., LPP and FDA). However, contrary to the success of such methods, they cannot guarantee the model’s efficiency against the problem of domain shift. Thus, as well as exploiting the dimensionality reduction approaches, we are to consider domain adaptation settings to build a robust model against distribution shift. In this section, at first, we formulate the classical FDA and LPP, and afterward, we introduce the classical LFDA [9], which is a combination of dimensionality reduction methods.

The main objective of FDA is to express one dependent variable as a linear combination of other variables. To this end, FDA extracts the new features of the domain according to the linear combination of available features. In fact, FDA attempts to maximize the class-separate degree. Therefore, FDA incorporates the following two criteria. (1) FDA maximizes the between-class scatter matrix, and (2) minimizes the within-class scatter matrix, such that the samples in the embedded subspace have maximum discrimination. Let \(x_i \in R^D (i = 1, 2,\ldots , n)\) be a D-dimensional sample and \(y_i \in {1, 2,\ldots , c}\) be the related class label of \(i\textrm{th}\) sample, where n is the number of instances and c is the number of classes. Let \( n_l \) be the number of instances in class l, where \(\sum _{l=1}^cn_l=n\). Mathematically, the between-class scatter matrix is given by \( S^{(b)}= \sum _{l=1}^c n_l(\mu _l-\mu )(\mu _l-\mu )^{\textrm{T}} \) and the within-class scatter matrix is \( S^{(w)}= \sum _{l=1}^c\sum _{i:y_i=l} (x_i-\mu _l)(x_i-\mu _l )^{\textrm{T}} \), where the inner sigma stands for the summation over i such that \( y_i =l \), \( \mu _l \) is the mean of instances in class l, and \( \mu \) is the mean of all instances.

Thus, FDA subspace is given by \( \textrm{argmax}_{J}tr(J^{\textrm{T}} S^{(b)}J) /tr \)\( (J^{\textrm{T}} S^{(w)}J) \) subject to \( J^{\textrm{T}} J=I \). Therefore, FDA transformation matrix J is defined as follows:

$$\begin{aligned} J_{\textrm{FDA}}= \textrm{argmax}_{J} \left[ \textrm{tr}^{-1}\left( J^{\textrm{T}} S^{(b)}J\right) \textrm{tr}\left( J^{\textrm{T}} S^{(w)}J\right) \right] , \end{aligned}$$
(1)

where \( \textrm{tr}^{-1}(.) \) is the inverse of matrix trace. The intuition behind maximizing \( J_{\textrm{FDA}} \) is to learn a projection matrix \( J \in R^{D\times d} \) to transform data from the original feature space composed of D features into a low dimensional subspace with d features (i.e., \(d < D\)).

LPP as another dimensionality reduction technique exploits an undirected graph indicating the neighbor relations of pairwise samples to preserve the local geometry of data. Also, LPP optimally approximates the eigenfunctions of the Laplace Beltrami operator over the data manifold, linearly. The weight between instances \( \textbf{x}_{i} \) and \( \textbf{y}_{j} \) is calculated via \( E_{ij}=\exp (-\Vert \textbf{x}_{i}-\textbf{y}_{j}\Vert ^2/t)\) for the same class samples, and is considered \( E_{ij}=0 \) in other cases. The LPP transformation matrix \( J_{\textrm{LPP}} \) is defined as follows:

$$\begin{aligned} J_{\textrm{LPP}}= & {} \frac{1}{2} \sum _{i,j=1}^n\left( \left( J^{\textrm{T}} x_i-J^{\textrm{T}} x_j \right) ^{\textrm{T}}\left( J^{\textrm{T}} x_i-J^{\textrm{T}} x_j\right) \right) E_{ij}\nonumber \\= & {} 2tr\left( J^{\textrm{T}} X\left( W-E\right) X^{\textrm{T}} J\right) , \end{aligned}$$
(2)

where W is a diagonal matrix with \( W_{ii}=\begin{matrix} \sum _{j=1}^n E_{ji} \end{matrix} \) and \(J_{\textrm{LPP}}\) is a transformation matrix that transforms samples into a latent subspace.

The performance of FDA degrades dramatically where the instances in a class are from several distinct clusters. In fact, it causes via the globality during the within-class and between-class scatters evaluation. Therefore, whenever the samples in various classes are close in the original high-dimensional space \( R^{D} \), LPP with its unsupervised nature can also overlap them. To dominate these problems, we introduce a novel idea from a combination of FDA and LPP. Since the various classes might come from different distributions, they are treated differently, and thus, there is the dissimilarity among them. Therefore, we maximize the borders across various classes as much as possible. In fact, we introduce local Fisher discriminant analysis (LFDA) as a novel linear dimensionality reduction approach which effectively combines the ideas of FDA and LPP. In fact, having an analytical form of the embedding transformation, the projection matrix can be easily computed just by solving a generalized eigenvalue problem. The LFDA transformation matrix \( J_{\textrm{LFDA}} \) is defined as follows:

$$\begin{aligned} J_{\textrm{LFDA}}=\textrm{argmax}_{J} \left[ tr\left( J^{\textrm{T}} \tilde{S}^{(b)}J\right) ^{-1}J^{\textrm{T}} \tilde{S}^{(w)}J\right] , \end{aligned}$$
(3)

where \( \tilde{S}^{(w)} \) can be defined as

$$\begin{aligned} \tilde{S}^{(w)}= \frac{1}{2} \sum _{i=1}^n \frac{1}{n_{y_i}}\tilde{P}_i^{(w)} \end{aligned}$$
(4)

and \( n_{y_i} \) is the number of instances in which the sample \( x_i \) belongs and \( \tilde{P}_i^{(w)} \) is the pointwise local within-class scatter matrix around \( x_i \) and is defined as follows:

$$\begin{aligned} \tilde{P}_i^{(w)}=\frac{1}{2} \sum _{j:y_j=y_i}^n E_{ij}(x_j-x_i)(x_j-x_i)^{\textrm{T}}. \end{aligned}$$
(5)

Accordingly, minimizing \( \tilde{S}^{(w)} \) corresponds to minimizing the weighted sum of the pointwise local within-class scatter matrices over all instances. Moreover, \( \tilde{S}^{(b)}\) can be defined in a similar way as follows:

$$\begin{aligned} \tilde{S}^{(b)}= \frac{1}{2} \sum _{i=1}^n\left( \frac{1}{n}-\frac{1}{n_{y_i}}\right) \tilde{P}_i^{(w)}+\frac{1}{2n}\sum _{i=1}^n \frac{1}{n_{y_i}}\tilde{P}_i^{(b)}, \end{aligned}$$
(6)

where \( \tilde{P}_i^{(b)} \)is the pointwise between-class scatter matrix around \( x_i \) and is expressed as follows:

$$\begin{aligned} \tilde{P}_i^{(b)}= \sum _{j:y_j\ne y_i}(x_j-x_i)(x_j-x_i)^{\textrm{T}}. \end{aligned}$$
(7)

Note that \( \tilde{P}_i^{(b)}\) does not include the localization factor \( E_{i,j} \). Moreover, Eq. 6 mentions that maximizing \( \tilde{S}^{(b)} \) corresponds to decreasing the weighted sum of pointwise local within-class scatter matrices and maximizing the sum of pointwise between-class scatter matrices. Therefore, eigenvalue decomposition of \( \textrm{tr}^{(-1)} (\tilde{S}^{(w)} \tilde{S}^{(b)}) \) is used for achieving the solution of \( J_{\textrm{LFDA}} \) as an optimization problem. However, the eigenvectors corresponding to d largest eigenvalues construct the mapping matrix J. Despite LFDA efficiency, it cannot minimize the distribution diversity across the source and target domains. Hence, we are to find a solution to adapt the distribution diversity across domains.

Therefore, the major issue is to reduce the distribution mismatches across the source and target domains by precisely reducing the empirical distance measure. Measuring the distance across distributions by the parametric criteria requires expensive distribution calculation. Thus, we utilize a nonlinear distance measure, referred to as Bregman divergence. Bregman divergence measures the distribution diversity of drawn samples from different domains in a projected subspace. In fact, the Bregman divergence is able to transfer the gained knowledge from training sets to test ones by reducing the distance across the distributions of training and test samples.

The Bregman distance is a generalization of a wide range of distance functions (e.g., Mahalanobis distance [39], square root, and Kullback–Leibler divergence [40]), and is capable to explore the nonlinear correlations of data features. Many Bregman divergence applications cause recent advances in machine learning.

Definition 3.1

(Bregman divergence). Given a strictly convex function f on \( \Omega \), the Bregman divergence corresponding to f is defined as:

$$\begin{aligned} \textit{BD}(x, y)= f(x) - f(y) - (x - y)\mathcal 5 f(y), \end{aligned}$$
(8)

where \( \mathcal 5 f \) represents the gradient vector of f.

The defined Bregman divergence in (3) can be described as the discrepancy across the value of the convex function at x and its first-order Taylor expansion at y, or equivalently the remainder term of the first-order Taylor expansion of f at y. Indeed, the Bregman divergence reduces to a well-known loss function according to the choice of the convex function f. For example, if \( f = x^2 \), then we have \( \textit{BD}(x, y) = (x - y)^2 \), and clearly its square root is a metric. Thus, the functional Bregman divergence \( \textit{BD}(.,.) \) is expressed as

$$\begin{aligned} \begin{aligned} \textit{BD}(P_{\textrm{S}},P_{T})&= \int \left( P_{\textrm{S}}(\textbf{y})- P_{T}(\textbf{y})\right) ^{2}d\textbf{y}\\&{=} \int \left( P_{\textrm{S}}(\textbf{y})^2 {-}2P_{\textrm{S}}(\textbf{y})P_{T}(\textbf{y})+P_{T}(\textbf{y})^2\right) d\textbf{y}, \end{aligned}\nonumber \\ \end{aligned}$$
(9)

where \( P_{\textrm{S}} \) and \( P_{T} \) are the probability density functions (PDFs) of the source and the target data in latent subspaces, respectively. Therefore, Bregman divergence measures the distribution differences of the source and target domains in the latent subspace.

Using the kernel density estimation technique, the densities are estimated in the latent subspace. Therefore, the density is estimated at each point \( y\in R^d \) as the sum of kernels between \( \textbf{y} \) and other points \( \textbf{y}_{i} \) as follows:

$$\begin{aligned} p(\textbf{y}) = \left( \frac{1}{n}\right) \textbf{G}_{\sum }\left( \textbf{y}-\textbf{y}_{i}\right) , \end{aligned}$$
(10)

where n is the number of samples and \( \textbf{G}_{\sum }(.) \) is the d-dimensional Gaussian kernel with the covariance matrix \( \sum \).

3.2 Distribution adaptation using Bregman divergence

The most important challenge in domain adaptation is to decrease the divergence across the source and target domains. BD preserves the geometric structure of data and just minimizes the distribution divergence across domains. However, aligning both the marginal and the conditional distributions is very effective for robust domain adaptation. To measure the discrepancy across the marginal distributions of the source and target domains, we employ BD as follows:

$$\begin{aligned}{} & {} \!\!\!\textrm{Dist}^{\textrm{marginal}}(D_{\textrm{S}},D_{T})\nonumber \\{} & {} \quad = \left[ \int \left( \frac{1}{n_{\textrm{s}}}\sum _{i=1}^n\textbf{G}_{\sum _{1}}(\textbf{y}-\textbf{y}_{i})\right) ^{2}d\textbf{y}\right] \nonumber \\{} & {} \qquad +\left[ \int \left( \frac{1}{n_{\textrm{t}}}\sum _{j=n_{\textrm{s}}+1}^n\textbf{G}_{\sum _{2}}(\textbf{y}-\textbf{y}_{i})\right) ^{2}d\textbf{y}\right] \nonumber \\{} & {} \qquad -\left[ \left( \frac{1}{n_{\textrm{s}}n_{\textrm{t}}}\sum _{i=1}^n\sum _{j=n_{\textrm{s}}+1}^n\textbf{G}_{\sum _{12}}(\textbf{y}-\textbf{y}_{i})\right) ^{2}\right] , \end{aligned}$$
(11)

where \( \textrm{Dist}^{\textrm{marginal}} \) is the distance of marginal distributions between the source and target domains. Also, \( D_S \) and \( D_T \) demonstrate the set of instances in the source and target domains, in turn. The discrepancy across the marginal distributions \( P(X_{\textrm{S}}) \) and \( P(X_{T}) \) is reduced by minimizing \( \textrm{Dist}^{\textrm{marginal}}(D_{\textrm{S}},D_{T}) \).

Even though decreasing the marginal distribution difference across training and target domains minimizes the domain misalignment, but the conditional distribution difference of domains should be considered for a robust distribution adaptation. However, as the target domain lacks labels, conditional distribution adaptation is a nontrivial problem. We apply a trained classifier on the source samples to measure the posterior probabilities. Using this technique, the pseudo-labels of the target samples are obtained. We rewrite the Bregman divergence measure to match the class conditional distributions as follows:

$$\begin{aligned} \begin{aligned}&\textrm{Dist}^{\textrm{conditional}}\begin{matrix} \sum _{c=1}^C \end{matrix}\left( D_{\textrm{S}^c},D_{T^c}\right) \\&\quad =\left[ \int \left( \frac{1}{n_{\textrm{s}}^c}\sum _{i=1}^{n_{\textrm{s}}^c}\textbf{G}_{\sum _{1}}\left( \textbf{y}-\textbf{y}_{i}\right) \right) ^{2}d\textbf{y}\right] \\&\quad \quad +\left[ \int \left( \frac{1}{n_{\textrm{t}}^c}\sum _{j={n_{\textrm{s}}^c}+1}^{{n_{\textrm{s}}^c} +{n_{\textrm{t}}^c}}\textbf{G}_{\sum _{2}}\left( \textbf{y}-\textbf{y}_{i}\right) \right) ^{2}d\textbf{y}\right] \\&\quad \quad -\left[ \left( \frac{1}{{n_{\textrm{s}}^c}{n_{\textrm{t}}^c}}\sum _{i=1}^{n_{\textrm{s}}^c} \sum _{j={n_{\textrm{s}}^c}+1}^{{n_{\textrm{s}}^c}+{n_{\textrm{s}}^c}}\textbf{G}_{\sum _{12}}\left( \textbf{y}-\textbf{y}_{i}\right) \right) ^{2}\right] , \end{aligned} \end{aligned}$$
(12)

where \( \textrm{Dist}^{\textrm{conditional}} \) is the class-conditional distributions distance across the source and target domains. Moreover, \( n^{c}_{\textrm{s}} \) and \( n^{c}_{\textrm{t}} \) denote the number of examples in source and target domains that belong to class c, respectively. Also, \( D_{\textrm{S}^{c}} \) is the set of examples belonging to class c in source data, and \( D_{T^{c}} \) is the set of examples belonging to class c in target data. With minimizing \( \textrm{Dist}^{\textrm{conditional}}\sum _{c=1}^C(D_{\textrm{S}^{c}},D_{T^{c}}) \), the conditional distribution mismatches between \( D_{\textrm{S}^{c}} \) and \( D_{T^{c}} \) are reduced.

3.3 Unsupervised domain adaptation via transferred local Fisher discriminant analysis (TLFDA)

The intuition behind TLFDA is to minimize the marginal and conditional distribution mismatches between the source and target domains by finding an optimal couple of projection and classification models. To this end, LFDA as a dimensionality reduction method is exploited to find a latent subspace with the criteria embedded in Eqs. 311, and 12 as follows:

$$\begin{aligned} \begin{aligned} J_{TLFDA}&=\textrm{argmin}_{J\in \mathbb {R}^{D\times d}}[J_{\textrm{LFDA}} \\&\quad +\lambda (\textrm{Dist}^{\textrm{marginal}}(D_{\textrm{S}},D_{T}) \\&\quad +\textrm{Dist}^{\textrm{conditional}}\begin{matrix} \sum _{c=1}^C \end{matrix}(D_{\textrm{S}^c},D_{T^c}))], \end{aligned} \end{aligned}$$
(13)

where \( \lambda \) denotes the regularization parameter to balance between feature matching and domain adaptation. The first part of the equation is a transformation matrix that maps samples into a latent subspace. The second and third parts are minimizing the marginal and conditional distribution mismatches across domains, respectively. Equation 13 can be treated using the gradient descent algorithm, i.e.,

$$\begin{aligned} \begin{aligned} J&\leftarrow J-\eta (\partial _J J_{\textrm{LFDA}} +\lambda (\partial _J \textrm{Dist}^{\textrm{marginal}}(D_{\textrm{S}},D_{T})\\&\quad +\partial _J \textrm{Dist}^{\textrm{conditional}} \begin{matrix} \sum _{c=1}^C \end{matrix}(D_{\textrm{S}^c},D_{T^c})), \end{aligned} \end{aligned}$$
(14)

where \(\eta \) is the learning rate and \( \partial _J \) is the gradient with respect to J. The derivative of \( J_{\textrm{LFDA}} \) with respect to J is given by

$$\begin{aligned} \begin{aligned} \frac{\partial _J J_{\textrm{LFDA}}}{\partial _J}&=2\textrm{tr}^{-1}(J^{\textrm{T}} \tilde{S}^{(b)}J)^{-1} \tilde{S}^{(w)}J\\&\quad -2\textrm{tr}^{-2}(J^{\textrm{T}} \tilde{S}^{(b)}J) tr(J^{\textrm{T}} \tilde{S}^{(w)}J)\tilde{S}^{(b)}J. \end{aligned} \end{aligned}$$
(15)

To obtain the optimal linear subspace J in Eq. 13, a direct method is to optimize Eq. 13 with respect to J iteratively by adopting the gradient descent technique as follows:

$$\begin{aligned} \begin{aligned} J_{k+1}&=J_k-\eta (k)\left( \frac{\partial _J J_{\textrm{LFDA}}}{\partial _J}\right. \\&\quad +\lambda \left( \sum _{i=1}^{n_s+n_t}\frac{\partial _J \textrm{Dist}^{\textrm{marginal}}(D_{\textrm{S}},D_{T})}{\partial \textbf{y}_{i}} \frac{\partial \textbf{y}_{i}}{\partial _J} \right. \\&\quad +\left. \left. \,\sum _{j=1}^{{n_{\textrm{s}}^c}+{n_{\textrm{t}}^c}}\frac{\textrm{Dist}^{\textrm{conditional}}(D_{\textrm{S}^c},D_{T^c})}{\partial \textbf{y}_{i}}\frac{\partial \textbf{y}_{i}}{\partial _J}\right) \right) \end{aligned} \end{aligned}$$
(16)

where \( \eta (k) \) is the learning rate factor at \( k\textrm{th} \) iteration that controls the gradient step size. According to the quadratic form Eq. 11, the derivative of \( \textrm{Dist}^{\textrm{marginal}} \) with respect to J is

$$\begin{aligned} \begin{aligned}&\sum _{i=1}^{n_s+n_t}\frac{\partial _J \textrm{Dist}^{\textrm{marginal}}(D_{\textrm{S}},D_{T})}{\partial \textbf{y}_{i}}\frac{\partial \textbf{y}_{i}}{\partial J}\\&\quad =\sum _{i=1}^{n_s}\frac{\partial _J \textrm{Dist}^{\textrm{marginal}}(D_{\textrm{S}},D_{T})}{\partial \textbf{y}_{i}}\textbf{x}^{T}_{i}\\&\quad \quad +\sum _{i=1}^{n_s+n_{\textrm{t}}}\frac{\partial _J \textrm{Dist}^{\textrm{marginal}}(D_{\textrm{S}},D_{T})}{\partial \textbf{y}_{i}}\textbf{x}^{T}_{i}\\&\quad =\frac{1}{n_s^2} \sum _{s=1}^{n_s} \sum _{t=1}^{n_t} {\textbf{G}_{\sum _{11}}} (\textbf{y}_{\textrm{s}}-\textbf{y}_{\textrm{t}})\\&\quad \quad +\frac{1}{n_t^2} \sum _{s=1}^{n_s+n_{\textrm{t}}} \sum _{t=1}^{n_s+n_{\textrm{t}}} {\textbf{G}_{\sum _{22}}} (\textbf{y}_{\textrm{s}}-\textbf{y}_{\textrm{t}})\\&\quad \quad -\frac{1}{n_sn_{\textrm{t}}} \sum _{s=1}^{n_s} \sum _{t=n_{\textrm{s}}+1}^{n_s+n_{\textrm{t}}} {\textbf{G}_{\sum _{12}}} (\textbf{y}_{\textrm{s}}-\textbf{y}_{\textrm{t}}). \end{aligned} \end{aligned}$$
(17)

And according to the quadratic form Eq. 12, the derivative of \( \textrm{Dist}^{\textrm{conditional}} \) with respect to J is

$$\begin{aligned} \begin{aligned}&\sum _{i=1}^{n_s^c+n_t^c}\frac{\partial _J {\textrm{Dist}^{\textrm{conditional}}(D_{\textrm{S}^c},D_{T^c})}}{\partial \textbf{y}_{i}}\frac{\partial \textbf{y}_{i}}{\partial J}\\&\quad = \sum _{i=1}^{n_s^c}\frac{\partial _J \textrm{Dist}^{\textrm{conditional}}(D_{\textrm{S}^c},D_{T^c})}{\partial \textbf{y}_{i}}\textbf{x}^{T}_{i} \\&\quad \quad +\sum _{i=n_s^c+1}^{n_s^c+n_t^c}\frac{\partial _J \textrm{Dist}^{\textrm{conditional}}(D_{\textrm{S}^c},D_{T^c})}{\partial \textbf{y}_{i}}\textbf{x}^{T}_{i}\\&\quad =\frac{1}{(n_s^c)^2} \sum _{s=1}^{n_s^c} \sum _{t=1}^{n_t^c} {\textbf{G}_{\sum _{11}}} (\textbf{y}_{\textrm{s}}-\textbf{y}_{\textrm{t}})\\&\quad \quad +\frac{1}{(n_t^c)^{2}} \sum _{s=n_s^c+1}^{n_s^c+n_t^c} \sum _{t=n_s^c+1}^{n_s^c+n_t^c} {\textbf{G}_{\sum _{22}}} (\textbf{y}_{\textrm{s}}-\textbf{y}_{\textrm{t}})\\&\quad \quad -\frac{1}{n_s^cn_t^c} \sum _{s=1}^{n_s^c} \sum _{t=n_s^c+1}^{n_s^c+n_t^c} {\textbf{G}_{\sum _{12}}} (\textbf{y}_{\textrm{s}}-\textbf{y}_{\textrm{t}}). \end{aligned} \end{aligned}$$
(18)

For two optional Gaussian kernels, we have \( \int \textbf{G}_{\sum _1}(\textbf{y}-\textbf{y}_{\textrm{s}}) \textbf{G}_{\sum _2}(\textbf{y}-\textbf{y}_{\textrm{t}})dy=\textbf{G}_{\sum _1 + \sum _2}(\textbf{y}_s -\textbf{y}_{\textrm{t}}) \) and, \( \sum _{11}=\sum _1 + \sum _1 \), \(\sum _{22}=\sum _2 + \sum _2 \) and \( \sum _{12}=\sum _1 + \sum _2 \). Based on Eqs. 15, 17, and 18, TLFDA is solved iteratively subject to \( J^{\textrm{T}}J=I \).

The computational complexity of TLFDA is investigated. We analyze the computational complexity of TLFDA using big O notation where \( n_s \) and \( n_t \) denote the number of source samples and the number of target samples, respectively. Moreover, \( n_s^c \) and \( n_t^c \) denote the number of instances in the source and target domains that belong to class c, respectively. The computational cost for Eqs. 1517, and 18 are \( O((n_s + n_t)^2) \), \( O((n_s + n_t)^2) \), and \( O((n_s^c + n_t^c)^2) \), respectively. Hence, the total computational complexity of TLFDA is \( O((n_s + n_t)^2) \).

4 Experimental setup

In this section, we introduce the domain adaptation benchmark datasets and the implementation details of TLFDA and other compared methods.

4.1 Data description

We apply our experiments on the following four visual domain adaptation benchmarks: Office+Caltech-256 (Surf) [41, 42], Office+Caltech-256 (Decaf6) [43], Digit (USPS [44] and MNIST [45]), and CMU-PIE [46].

Office benchmark contains three domains with 31 object classes where domains either are downloaded from commercial sites (e.g. amazon.com) or taken with high-resolution digital SLR cameras or captured by low-resolution webcams. The set contains 4110 images which has 10 common classes with a minimum of 7 and a maximum of 100 examples in each class over three domains. Caltech-256 includes images in which objects appear in several various poses. Thus, the set contains images that are not normally aligned. The dataset contains 256 categories with a minimum of 80 and a maximum of 827 images in each category. We make 12 cross-domain tasks according to four Office+Caltech-256 (Surf) benchmarks via considering two different domains as the source and target domains. Also, we utilize Decaf6 (deep convolutional activation feature) features with 4096 dimensions normalized to unit vectors. Decaf6 features are the activation values of the \(6\textrm{th}\) layer of a convolutional neural network (CNN) trained on the ImageNet dataset [43]. Though, we are to compare the effectiveness of TLFDA with traditional and other deep DA methods.

Table 1 Classification accuracy (%) of the proposed method on Office+Caltech-256 (Surf) and Digits datasets

USPS (U) and MNIST (M) domains are popular handwritten digit benchmarks with various statistics and distributions. The USPS dataset possesses 7291 training and 2007 test images with overall, 9298 images with \( 16 \times 16 \) size scanned from envelopes of the US Postal Service. MNIST dataset contains 60,000 training and 10,000 test images with \( 28 \times 28 \) size scanned from mixed American Census Bureau employees and American high school students. All images of USPS and MNIST datasets are resized to \( 16 \times 16 \) with a grayscale level. Therefore, we design two cross-domain tasks as follows: \(U\longrightarrow M\) and \(M\longrightarrow U\).

PIE is an introduced face benchmark containing 41,368 images with the size of \( 32 \times 32 \) captured from 68 individuals. All images are taken by 13 synchronized cameras and 21 flashes with different poses, illuminations, and expressions. Depending on the position of images, the dataset is divided into five different subsets: PIE1(C05, left pose), PIE2(C07, upward pose), PIE3(C09, downward pose), PIE4(C27, front pose), and PIE5(C29, right pose). Hence, twenty cross-domain tasks are conducted as follows: \(P1\longrightarrow P2\), \(P1\longrightarrow P3,\ldots , P4\longrightarrow P5\).

4.2 Method evaluation

The performance evaluation of our proposed approach is conducted on four DA datasets with two baseline machine learning methods (LPP and FDA) and ten state-of-the-art DA methods (JACRL [47], RTML [48], VDA [23], JGSA [49], CLGA [24], DICD [25], DGA-DA [26], JDA-CDMA [50], TTLC [27], and DOLL-DA [51]). Since TLFDA and other DA methods are dimensionality reduction techniques, we exploit a NN classifier to achieve classification results. Furthermore, TLFDA is compared with the best-reported results of the compared methods. Also, we evaluate the performance of TLFDA on Office+Caltech-256 (Decaf6) dataset with a baseline deep method, AlexNet [52], and deep DA methods including DDC [53], AELM [54], and ELM [54]. Moreover, we evaluate TLFDA with DA methods including PUnDA [55], TAISL [56], SCA [57], and TIT [58] on Office+Caltech-256 (Decaf6) benchmark.

4.3 Implementation details

To justly test and compare TLFDA with other methods, we measure the classification accuracy on the target domain \( (D_t) \) as the evaluation metric as follows:

$$\begin{aligned} \textrm{Accuracy}=\frac{\mid x:\in D_t \wedge f(x)=y(x)\mid }{n_t}, \end{aligned}$$
(19)

where f(x) is the achieved prediction function and y(x) is the correct label of sample x, respectively. In addition, \( n_t \) is the number of target domain samples.

Furthermore, TLFDA consists of the following two free parameters. (1) \( \lambda \) is the regularization parameter in Eq. 13 that controls the trade-off between the feature matching and Bregman divergence, and (2) \( \eta (k) \) is the learning rate factor at iteration k in Eq. 16, which controls the gradient step size for \( k\textrm{th} \) iteration. Additionally, the number of iterations to the convergence of TLFDA is set to 20.

5 Experimental results and discussion

In this section, the performance of TLFDA and other compared methods on a variety of visual DA benchmarks are compared.

Table 2 Classification accuracy (%) of the proposed method on PIE dataset
Fig. 1
figure 1

Classification accuracy (%) on Office+Caltech-256 (Surf) and Digits datasets

Fig. 2
figure 2

Classification accuracy (%) on PIE datasets. a the first ten tasks, b the second ten tasks

Table 3 Classification accuracy (%) of the proposed method on Office+Caltech-256 (Decaf6) datasets

5.1 Result evaluation

Tables 1 and 2 show the classification accuracies of TLFDA and other machine learning and DA approaches on object recognition (Office+Caltech-256 (Surf)), hand-written Digits recognition (USPS, MNIST), and face (PIE) datasets, respectively. The results are visualized in Figs. 1 and 2 for better interpretation.

Due to the mismatched distribution among the training and test datasets, the performance improvement of TLFDA over FDA and LPP is (48.34%) and (57.23%) on the PIE dataset, respectively. In comparison to the best-compared approach TTLC on the PIE dataset, TLFDA achieves (1.32%) performance improvement. Moreover, TLFDA outperforms LPP and FDA in all tasks and in 11 tasks in comparison with TTLC on the PIE dataset. In the rest, our proposed method will be compared with other methods, in detail.

Fig. 3
figure 3

Classification accuracy (%) on Office+Caltech-256 (Decaf6) datasets

Most of the available methods that are based on FDA or LPP criteria do not achieve the desired results in case of domain shift problems where they do not consider the distribution mismatch between the source and target domains. FDA is a well-known approach that projects the data into a low-dimensional subspace with a linear combination of features. In dealing with shifted data, FDA can not generally transfer knowledge across domains.

LPP is another classical method of dimensionality reduction to find an embedding that preserves the local information. LPP uses a graph to model the geometrical structure of data. Despite the LPP efficiency, it cannot reduce the distribution divergence across the source and target domains due to the significant dissimilarity of scatters from the mean. Nevertheless, FDA shows better performance rather than LPP, because it uses discrimination in the feature extraction step. The results illustrate that TLFDA has (11.7%) and (13.07%) average classification accuracy improvement against FDA and LPP on Office+Caltech-256 (Surf) dataset, respectively. The performance improvement of TLFDA against FDA on the Digits dataset is (16.5%). Also, TLFDA obtains (30.38%) improvement compared to LPP on the Digits dataset.

JACRL is another state-of-the-art transfer learning method reducing the functional structural risk and the distribution mismatch across domains. Thus, JACRL learns an adaptive classifier through maximizing the manifold consistency of the adaptive classifier. However, TLFDA outperforms JACRL in most cases where it considers both the discriminative information contained in the training samples and the distribution bias between the training and test sets. TLFDA gains (7.18%) and (17.78%) performance improvement compared to JACRL on object+digit and PIE datasets, respectively.

Fig. 4
figure 4

Classification accuracy \((\%)\) with respect to the number of iterations for Office+Caltech-256 (Surf) dataset. a \(C \longrightarrow A\). b \(C \longrightarrow W\). c \(C \longrightarrow D\). d \(A \longrightarrow C\). e \(A \longrightarrow W\). f \(A \longrightarrow D\). g \(W \longrightarrow C\). h \(W \longrightarrow A\). i \(W \longrightarrow D\). j \(D \longrightarrow C\). k \(D \longrightarrow A\). l \(D \longrightarrow W\)

Fig. 5
figure 5

Classification accuracy \((\%)\) with respect to the number of iterations for Digits dataset. a USPS versus MNIST. b MNIST versus USPS

Fig. 6
figure 6

Classification accuracy \((\%)\) with respect to the number of iterations for PIE dataset. a \(P1 \longrightarrow P2\). b \(P1 \longrightarrow P3\). c \(P1 \longrightarrow P4\). d \(P1 \longrightarrow P5\). e \(P2 \longrightarrow P1\). f \(P2 \longrightarrow P3\). g \(P2 \longrightarrow P4\). h \(P2 \longrightarrow P5\). i \(P3\longrightarrow P1\). j \(P3 \longrightarrow P2\). k \(P3 \longrightarrow P4\). l \(P3 \longrightarrow P5\). m \(P4\longrightarrow P1\). n \(P4 \longrightarrow P2\). o \(P4 \longrightarrow P3\). p \(P4 \longrightarrow P5\). q \(P5\longrightarrow P1\). r \(P5 \longrightarrow P2\). s \(P5 \longrightarrow P3\). t \(P5 \longrightarrow P4\)

RTML exploits the knowledge transfer to alleviate the domain shift in two directions, i.e., sample and feature space. RTML aims to build a cross-domain metric to reduce the mismatches across domains. However, TLFDA jointly benefits from representation and classification learning to adapt the source and target domains. TLFDA uses the source domain labels to construct the shared low-dimensional subspace and discriminate across various classes. TLFDA achieves (5.05%) and (29.05%) performance improvement in average classification accuracy compared to RTML on object+digit and PIE datasets, respectively.

VDA is a novel technique that exploits joint DA and TL to create a shared feature space that is robust against distribution mismatch. VDA discriminates different classes in the latent subspace by employing the domain invariant clustering technique. VDA only seeks to align the marginal and conditional distributions across the source and target domains, while it ignores the discriminative properties between different classes in the adapted domain. However, TLFDA minimizes the distances between the marginal and the conditional distributions of domains, while the specific information of domains (i.e., data manifold structure and within-class local structure) is preserved. TLFDA achieves (5.28%) and (16.85%) performance improvement in average classification accuracy compared to VDA on object+digit and PIE datasets, respectively.

JGSA proposes a unified framework to minimize the shift between domains both geometrically and statistically using both shared and domain-specific features of domains. JGSA aligns the source and target domains even with high divergence. However, the joint marginal and conditional distributions alignment between domains does not explicitly render the data discrimination in achieved feature representation. TLFDA gains (3.64%) and (10.86%) performance improvement compared to JGSA on object+digit and face datasets, respectively.

CLGA finds a unified subspace where the marginal and conditional distributions are globally matched. CLGA locally adapts both domains using the class and domain manifold structures. CLGA measures the distance across distributions via MMD whereas TLFDA employs Bregman divergence as a measurement metric. Moreover, TLFDA uses Bregman divergence to preserve the discrimination ability. TLFDA outperforms CLGA in 10 tasks out of 14 domain adaptation tasks and gains (6.54%) improvement on object+digit datasets. Also, TLFDA works better than CLGA with (15.7%) improvement on the face dataset.

DICD in a shared subspace adapts both the marginal and conditional distribution disparities. DICD discriminates classes by minimizing the distances across sample pairs in the same classes either in both domains. DICD maximizes the distances between samples with various class labels in both domains. However, TLFDA discriminates classes by preserving the intra-class and inter-class scatters through Bregman divergence. TLFDA outperforms DICD with (4.96%) and (14.76%) performance improvement on object+digit and face datasets, respectively.

DGA-DA, as a novel domain adaptation method, aligns the marginal and conditional distributions in a unified subspace using a non-parametric measurement method. Despite DGA-DA, TLFDA measures the distribution disparities across domains via Bregman divergence. TLFDA preserves the discriminated information of data via Bregman divergence. However, DGA-DA defines a repulsive form term to discriminate different classes. TLFDA works better than DGA-DA with (4.19%) accuracy improvement on object+digit datasets. Also, TLFDA outperforms DGA-DA in 18 tasks out of 20 tasks on the face dataset with (22.75%) accuracy improvement.

JDA-CDMA proposes a novel measurement metric called cross domain mean approximation (CDMA) to estimate the distances between the source and target samples. Then, based on CDMA, JDA-CDMA reduces the marginal and conditional divergences across both domains in a shared subspace. Although CDMA has low computational complexity in comparison with Bregman divergence, but TLFDA in comparison with JDA-CDMA gains (2.67%) and (5.67%) improvements on object+digit and face datasets, respectively.

Fig. 7
figure 7

Parameter evaluation with respect to classification accuracy \((\%)\) and parameter, \(\lambda \), for Office+Caltech-256 (Surf) dataset. TLFDA obtains considerable results with large values of \(\lambda \). We consider \(\lambda =1\) for Office+Caltech-256 (Surf) dataset. a \(C \longrightarrow A\). b \(C \longrightarrow W\). c \(C \longrightarrow D\). d \(A \longrightarrow C\). e \(A \longrightarrow W\). f \(A \longrightarrow D\). g \(W \longrightarrow C\). h \(W \longrightarrow A\). i \(W \longrightarrow D\). j \(D \longrightarrow C\). k \(D \longrightarrow A\). l \(D \longrightarrow W\)

Fig. 8
figure 8

Parameter evaluation with respect to the classification accuracy \((\%)\) and the regularization parameter, \(\lambda \), for Digits dataset. TLFDA performs well on Digits dataset with small values of \(\lambda \). We adjust \(\lambda =0.01\) on Digits dataset. a USPS versus MNIST. b MNIST versus USPS

Fig. 9
figure 9

Parameter evaluation with respect to the classification accuracy \((\%)\) and parameter, \(\lambda \) for PIE dataset. In most cases, TLFDA has better performance with \(\lambda \in [0.00001 \, \, 1]\). The optimal value of \(\lambda \) is 0.5 for PIE dataset. a \(P1 \longrightarrow P2\). b \(P1 \longrightarrow P3\). c \(P1 \longrightarrow P4\). d \(P1 \longrightarrow P5\). e \(P2 \longrightarrow P1\). f \(P2 \longrightarrow P3\). g \(P2 \longrightarrow P4\). h \(P2 \longrightarrow P5\). i \(P3\longrightarrow P1\). j \(P3 \longrightarrow P2\). k \(P3 \longrightarrow P4\). l \(P3 \longrightarrow P5\). m \(P4\longrightarrow P1\). n \(P4 \longrightarrow P2\). o \(P4 \longrightarrow P3\). p \(P4 \longrightarrow P5\). q \(P5\longrightarrow P1\). r \(P5 \longrightarrow P2\). s \(P5 \longrightarrow P3\). t \(P5 \longrightarrow P4\)

TTLC aligns the marginal and conditional distributions in the respective subspaces via MMD. However, TLFDA adapts the marginal and conditional distribution disparities in a unified subspace via Bregman divergence. TTLC discriminates classes by creating condensed clusters in both domains. To this end, TTLC minimizes the distances across sample pairs in the same classes of both domains. In addition, TTLC maximizes the distances between each instance pairs of various classes in both domains. TLFDA gains (1.21%) and (1.32%) accuracy improvements on object+digit and face datasets, respectively.

DOLL-DA projects both domains into a common subspace by decreasing the marginal and conditional distribution discrepancies through adding the repulsive force to the MMD constraint. DOLL-DA uses a label embedding trick to propose an orthogonal label subspace. DOLL-DA proposes a noise-robust sparse orthogonal label regression term to prevent negative transfer learning and overfitting. However, TLFDA benefits from Bregman divergence as a measurement method. TLFDA outperforms DOLL-DA in 11 tasks out of 14 tasks on object+digit datasets. Also, TLFDA improves against DOLL-DA by achieving (5.35%) performance improvement on the face benchmark.

Figures 1 and 2 show the results evaluation of TLFDA comparing to DA methods including VDA, JGSA, and DGA-DA on Office+Caltech-256 (Surf) and Digit benchmarks with 14 tasks and on PIE dataset with 20 tasks, respectively. TLFDA outperforms VDA in 10 tasks out of 14 tasks and 15 tasks out of 20 tasks on PIE dataset. Figure 1 shows that TLFDA performs better than JGSA in 9 tasks. Figure 2 presents that TLFDA outperforms JGSA in all tasks on PIE dataset. Moreover, TLFDA has better accuracy in 7 tasks on Office+Caltech-256 (Surf) and Digit datasets and 18 tasks on PIE dataset.

In recent years, deep DA approaches have gained high performance. To compare the effectiveness of TLFDA with deep methods, we train TLFDA on Office+Caltech-256 (Decaf6) datasets. Experimental results are shown in Table 3. According to the results, TLFDA outperforms deep methods including AlexNet, DDC, AELM, ELM in most of cross-domain tasks. TLFDA works better than AlexNet and DDC with (3.33%) and (1.23%) improvements, respectively. TLFDA outperforms ELM, PUnDA, and TAISL methods in most cases, where the results are visualized in Fig. 3. To be precise, Fig. 3 specifies that TLFDA is better than ELM and PUnDA in 10 tasks and TLFDA outperforms TAISL in 11 tasks of Office+Caltech-256 (Decaf6) dataset. Moreover, TLFDA outperforms SCA and TIT as the domain adaptation methods on Office+Caltech-256 (Decaf6) with (3.55%) and (1.23%), respectively. Although deep methods gain outperforming performances, they need to be trained on large amounts of data for reliable prediction. However, TLFDA outperforms deep methods while is trained on reasonable number of samples. Deep methods have large time complexities and need high-power processing equipment including GPU and CPU. However, TLFDA could be run on a medium-power CPU. Simplicity in processing and reliable predictions on enough number of samples cause that TLFDA to be picked in comparison to deep DA methods.

As the main findings of this study, TLFDA decreases the marginal and conditional distribution discrepancies via Bregman divergence in the mapped subspace. Moreover, TLFDA iteratively predicts pseudo-labels of the target domain via a model trained on the source domain.

5.2 Effectiveness evaluation

We conduct a variety of experiments in 20 iterations to evaluate the efficiency property of TLFDA. We run TLFDA, JGSA, and VDA in 20 iterations on Office+Caltech-256 (Surf), Digit, and PIE datasets. Figures 45, and 6 illustrate the performance of TLFDA and two baseline methods with respect to the number of iterations on different benchmarks. As it is understood from the figures, in all datasets, TLFDA outperforms the best baseline method JGSA. Our proposed approach significantly reduces the difference of marginal and conditional distributions among the source and target domains. TLFDA predicts the accurate labels for target samples in an iterative manner. Almost, the predicted labels of each stage are better than the previous one.

Table 4 Ablation study of 3 variants of TLFDA

The convergence property of TLFDA is evaluated in 20 iterations and its results are compared against JGSA and VDA. We run TLFDA on Office+Caltech-256 (Surf), digit, and face datasets in 20 iterations and show the results in Figs. 45, and 6, respectively. As is clear from the figures, TLFDA converges in 20 iterations in most cases. Although TLFDA fluctuates in some cases, it has a limited interlude after 20 iterations, and increasing the number of iterations does not have much effect on the performance improvement of the algorithm.

5.3 Parameter and ablation study

In Fig. 7, the classification accuracies of TLFDA and baseline methods are shown with respect to parameter \(\lambda \) on Office+Caltech-256 (Surf) dataset. \(\lambda =1\) is chosen for Office+Caltech-256 (Surf) dataset. Figure 8 shows the parameter evaluation with respect to the classification accuracy and parameter \(\lambda \in [0.00001 \, \, 10]\) for the Digits dataset. The reported results demonstrate that TLFDA operates well on the Digits dataset with small values of \(\lambda \). Figure 9 illustrates the experimental results for parameter \(\lambda \in [0.00001 \, \, 10]\) on the PIE dataset. As is obvious from the sub-figures, TLFDA has better results with \(\lambda =0.5\) in most cases.

The performance of TLFDA is evaluated regarding the different values of parameters in various situations. To understand our model deeply, we evaluate several variants, i.e., (1) \( \textrm{TLFDA}_M \) by eliminating the conditional distribution adaptation, (2) \( \textrm{TLFDA}_C \) by eliminating the marginal distribution adaptation, and (3) \( \textrm{TLFDA}_{M+C} \) by removing the marginal and conditional distributions adaptation, jointly. The evaluation results on various cases are shown in Table 4. The results indicate that \( \textrm{TLFDA}_{M+C} \) performs worse than the other two variants whereas \( \textrm{TLFDA}_C \) works better than others, since minimizing the diversity across the conditional distributions is crucial for robust distribution adaptation. However, all three variants cannot achieve better results than TLFDA. In fact, TLFDA constructs an effective feature representation for significant distribution misalignment across domains and it jointly aligns both the marginal and conditional distributions.

6 Conclusion and future works

In this paper, unsupervised domain adaptation via transferred local Fisher discriminant analysis (TLFDA) is proposed to deal with the shift and bias data of cross-domain problems. In TLFDA, a projection matrix is utilized to map the source and target domains into a shared subspace. Moreover, TLFDA reduces the joint marginal and conditional distribution mismatches based on Bregman divergence minimization. Experimental results on different visual benchmarks illustrate that TLFDA achieves better performance where there always exist shifts and biased data across domains. However, TLFDA has not been investigated to deal with big data. In the future, we will plan to merge transfer and deep learning for big data problems, which enables the deep learning network to transfer across cross-distribution sets.