1 Introduction

The machine learning and pattern recognition tasks often assume that the training and test data come from the similar distributions and feature spaces [1]. However, this assumption is unrealistic for many real-world applications where we have to benefit from other existing and related domains due to the lack of labeled training data. In this situation, because of the distribution mismatch across the training and test data, the trained model might perform poorly on the test data [2]. For example, in sentiment classification task, the reviews on the books have significant distribution difference against the reviews on electronic devices [3]. However, when the label is not available for the test data, we have to adapt the learning data from other related domains. The distribution difference across the training and test sets is known as domain shift problem.

To address domain shift issue, domain adaptation (DA) [4] and transfer learning (TL) [5] have led to major solutions in recent years. In DA, the knowledge from an already trained machine learning model is transferred to a different but related problem. In fact, DA tries to improve the generalization in one task via employing what has been learned in another task. DA learns a robust classifier to deal with the distribution mismatch across the source and target domains. DA approaches according to the available information in the target domain are divided into two general categories as follows: unsupervised DA where there are no labeled data in the target domain [3, 6,7,8,9,10], and semi-supervised DA where the target domain contains a small amount of labeled data [11,12,13]. However, both the unsupervised and the semi-supervised DA can benefit from either single source domain [14,15,16,17] or multi-source domains [18,19,20,21] to transfer knowledge across domains.

Since the source and target domains have different distributions, the key of having a prosperous adaptation is the reduction of distribution divergence. To this end, the existing DA approaches are summarized into the following three different categories: (1) instance-based transfer learning approaches on which the source domain samples are reweighted to have similar distribution with target samples [22,23,24,25], (2) feature-based transfer learning approaches that project the source and target data into a common subspace with shared features [8, 17, 26,27,28,29,30,31] and (3) model-based transfer learning approaches in which an adaptive classifier is modeled using joint parameters and priors of learned model [32,33,34,35]. In this paper, our focus is on feature-based and model-based transfer learning approaches. However, there are two important challenges in existing works, i.e., defective transformation and unevaluated discriminant analysis.

Defective transformation means that both feature learning and model learning approaches can only reduce, but not remove the distribution mismatch [36, 37]. Particularly, the feature learning [17, 28,29,30,31] conducts feature transformation to have better feature representation. However, the feature mismatch is not removed after transformation [38] since the feature transformation only employs the manifold and structure of data, but ignores to perform strengthen the model against cross domain changes. Also, the model learning usually adapts the priors and parameters of model in the original feature space, where the features are often mismatched, which makes it difficult to minimize the discrepancy across domains. Therefore, it is essential to benefit both the feature learning and the model learning to further facilitate DA.

Unevaluated discriminant analysis means that the FLDA-based existing works [39, 40] only attempt to project the training and test samples into a low-dimensional subspace based on the maximum class discrimination. But, they failed to evaluate the distribution difference across domains during the discriminant analysis. However, the iterative FLDA exploits the pseudotarget labels to customize the FLDA criterions to adapt the multiple source domains with a target domain.

As far as we know, there has been no previous work that tackle these two challenges together. In this work, we propose a novel cross- and multiple-domains visual transfer learning via iterative Fischer linear discriminant analysis (CIDA) approach, to address both challenges of defective transformation and unevaluated discriminant analysis. CIDA learns a domain-invariant classifier in an iterative FLDA-based embedding with empirical risk minimization, while performing hybrid distribution alignment by considering the different importance of criterions on embedded subspace. This work makes the following contributions:

  1. (1)

    CIDA addresses the challenges of both defective transformation and unevaluated discriminant analysis. CIDA strengthens the model against cross domain changes and minimize the cross domain discrepancies. CIDA benefits both the feature learning and the model learning to tackle challenges ahead.

  2. (2)

    CIDA focuses on multi-source DA where it exploits multiple knowledge resources to transfer across domains. The experiment results indicate that the existence of multiple related resources facilitate the adaptation tasks.

  3. (3)

    CIDA employs an iterative FLDA method to estimate the pseudotarget labels for better transformation of data in a hybrid manner. CIDA evaluates the distribution difference across domains during the discriminant analysis. CIDA employs the iterative FLDA and pseudotarget labels to customize the FLDA criterions to adapt the multiple source domains with target domain.

  4. (4)

    CIDA is evaluated on nine benchmark domain adaptation datasets. The experiments are conducted to assess the robustness and strengthens of CIDA to face with the various situations. However, the results illustrate that CIDA outperforms the baseline machine learning and other state-of-the-art transfer learning approaches.

The rest of paper is organized as follows. In the next section, a short review on DA literature is presented. The proposed method is introduced in Sect. 3. The experimental setup and implementation details are explained in Sect. 4. Section 5 includes the experimental results and discussions. The last section contains the conclusion and future works.

2 Related work

In this section, two lines of related work are discussed as follows: (1) the dimensionality reduction-based transfer learning and (2) multi-source transfer learning to highlight the difference between the proposed algorithm and the available works.

2.1 Dimensionality reduction-based transfer learning

Dimensionality reduction approaches are a well-known case to represent the learning techniques. In general, most of the dimensionality reduction approaches obey from two main frameworks: (1) PCA-based (principal component analysis) framework [29, 41, 42], which attempts to project data into a low-dimensional space besides the maximum variance preservation on the embedded subspace and (2) FLDA-based framework [43,44,45], which attempts to project data into a low-dimensional space besides the maximum class discrimination. However, both the PCA- and the FLDA-based frameworks show poor performance in case of domain shift problem where the source and target domains obey from different distributions.

There are several PCA-based approaches such as transfer component analysis (TCA) [29], joint distribution adaptation (JDA) [41] and visual domain adaptation (VDA) [42], which exploit PCA to embed data into a latent subspace. TCA is an efficient feature extraction method that finds the transferred components of input data based on the variance maximization and mismatch minimization. TCA benefits from maximum mean discrepancy (MMD) [46] to measure the distribution difference of source and target domains. TCA is one of the benchmark approaches in DA literature.

JDA is another novel transfer learning approach that aims to learn a common feature subspace that jointly decreases the marginal and conditional distribution differences between the source and target domains. JDA utilizes MMD to measure the distance among the source and target domains. VDA is a novel framework that constructs a shared feature representation besides the minimizing of joint marginal and conditional distributions across the source and target domains. In fact, VDA preserves the statistical and geometrical structure of input data using the manifold assumptions. In addition, VDA exploits the domain invariant clustering in an embedded subspace to discriminate the various classes of target data.

The main drawback of PCA-based approaches is that most of them embed data in a low-dimensional subspace without considering the class discrimination criteria. In contrast, the FLDA-based approaches consider the class discrimination criteria besides the domain adaptation criterions to adapt the distribution mismatch across domains. Wenting et al. [44] proposed an effective framework that finds a common feature representation such that it maximizes the difference between classes (class-separate objective) and minimizes the difference between domains (domain-merge objective).

Cuong et al. [43] introduced a generalized Fischer-based method for domain shift problem (FIDOS) that constructs a shared feature representation besides the minimizing within-class scatter and maximizing the class discrimination. Zheng et al. [45] proposed the transferred dimensionality reduction (TDA), which is an iterative method that iteratively utilizes the clustering procedure to predict the labels of unlabeled target data. TDA employs the dimensionality reduction and distribution discrepancy minimization across the source and target domains.

2.2 Multi-source transfer learning

In recent years, multi-source transfer learning is of interest to researchers, since there are generally multiple sources available for knowledge transfer in target learning [47]. Although tapping from multi-sources would provide more knowledge, they further result to a challenging domain adaptation issue, since the multiple sources have a large mismatch from each other. To this end, there are dozens of the proposed methods to deal with multi-source problems [48,49,50,51,52].

Transfer learning for multiple-domain sentiment analysis [48] is a Bayesian probabilistic model to handle the multiple source and multiple target domains. The method uses Gibbs sampling for inferring the parameters of model from unlabeled and labeled data. Multi-domain collaborative filtering (MCF) [52] is a probabilistic method, which exploits the probabilistic matrix factorization for modeling of rating problem in various domains. MCF transfers the knowledge across different domains by automatically learning the correlation between domains.

Conditional probability-based multi-source domain adaptation (CP-MDA) [51] is a multi-source domain adaptation method for realizing the different stages of fatigue using the surface electromyography signals, which tackles the distribution differences. CP-MDA employs a novel weighting scheme to address the conditional probability distribution differences across multiple domains. Boosting for transfer learning with multiple sources [47] extends the boosting framework for transferring the knowledge from multiple sources. The proposed approach addresses the negative transfer problem to import the knowledge from multiple sources. Multi-domain adaptation for sentiment classification (MCS) [50] adapts the classifiers for a specific domain via multiple source domains. MCS combines the base classifiers to select the automatically labeled instances from unlabeled data in target data.

Different from the existing models, our CIDA tends to extract an embedded shared subspace in which the within-class scatters and between-class scatters regularizers are developed to couple multiple sources during the knowledge transfer. Compared with [43], our model uses from general criterions to extract the high-ranked subspaces. Furthermore, we propose the feature and model learning regularizers to further strengthen the supervised knowledge from multiple sources and intrinsic information of target.

3 Proposed method

In this section, we introduce our CIDA approach in detail for addressing the unsupervised domain shift problem, efficiently.

3.1 Motivation

In this work, we propose a new FLDA-based framework that projects the input data into an embedded subspace based on the following criteria: (1) the distribution of source and target data obey from similar distribution, (2) a solution based on the representation and model learning, (3) an intermediate pseudolabel prediction to converge the accurate results. In the rest of this section, the preliminaries and problem description are presented with full details.

3.2 Problem description

Definition 1

(Domain) A domain \(\mathcal {D}\) is comprised of \(\{\mathcal {X},P(X)\}\) where \(\mathcal {X}\) is an m-dimensional feature space and P(X) is a marginal probability distribution on \(\mathcal {X}\) where \(X= \{x_1,\dots ,x_n\}\in \mathcal {X}\). The input data includes two domains, the source domain S and the target domain T. We denote the source domain as \(\mathcal {D}_s=\{(x_1,y_1),\dots ,(x_{n_s},y_{n_s})\}\) where is completely labeled. Similarly, we define the target domain as \(\mathcal {D}_t=\{x_{n_s+1},\dots ,x_{n_s+n_t}\}\) where is fully unlabeled. Also, \(n_s\) and \(n_t\) are defined as the number of source and target samples, respectively.

Definition 2

(Task) Given a specific domain \(\mathcal {D}\), a task for domain \(\mathcal {D}\) is denoted by \(\mathcal {T} =\{\mathcal {Y},f(x)\}\) where is composed of the following two components: \(\mathcal {Y}\) is the set of labels of domain \(\mathcal {D}\) and f(x) is a classifier, which can be employed to predict the corresponding labels of data x. From a probabilistic standpoint, f(x) can be expressed as the conditional probability distribution, i.e., \(f(x) = Q(y \mid x)\) where y \(\in \) \(\mathcal {Y}\).

The domain shift problem is considered with \(N_s\) source domains and a single target domain. Therefore, the input is a collection of related source domains as \(X^S=\{X^1,X^2,\dots ,X^{N_s}\}\) and the output is a linear mapping, which transforms data into an embedded subspace to predict the labels of target data \(X^T\). Since the distribution difference across the source and target domains degrades the performance of model, in this paper, we are to learn a feature representation in which the marginal distribution difference of source and target domains is reduced, i.e., \(P_s(x_s)\) \(\approx \) \(P_t(x_t)\) where \(P_s(x_s)\) and \(P_t(x_t)\) are the marginal distribution probability of source and target domains, respectively. Moreover, \(\mathcal {X}_s=\mathcal {X}_t\) where \(\mathcal {X}_s\) and \(\mathcal {X}_t\) are the feature spaces of source and target domains, in turn. In fact, CIDA attempts to learn a shared low-dimensional feature space on which the marginal distribution of source and target domains obeys from the similar distribution.

3.3 Generating domain invariant representation

In this section, at first we introduce the classical FLDA and then propose our CIDA, which is the based on FLDA.

3.3.1 Feature extraction using classical FLDA

The main objective of FLDA is to model one dependent variable as a linear combination of other variables. In this way, FLDA extracts new features of a domain according to the linear combination of the available features. In fact, FLDA attempts to maximize the class-separate degree to incorporate the following two criterions: (1) FLDA maximizes the between-class scatter matrix (\(S_b\)) and (2) minimizes the within-class scatter matrix (\(S_w\)), such that the samples in the embedded subspace have maximum discrimination.

\(S_b\) and \(S_w\) are defined as follows such that K and \(n_i\) demonstrate the number of available classes and the number of samples that belongs to class i, respectively:

$$\begin{aligned} S_b= & {} \sum _{i=1}^{K} p_i (\mu _i-\mu )(\mu _i-\mu )^T \end{aligned}$$
(1)
$$\begin{aligned} S_w= & {} \sum _{i=1}^{K}\sum _{j=1}^{n_i}(x_i^j-\mu _i)(x_i^j-\mu _i)^T \end{aligned}$$
(2)

where \(p_i=\frac{n_i}{N}\) shows the prior of class i, N is the total number of samples, \(\mu _i=\frac{1}{n_i}\sum _{j=1}^{n_i}x^{j}_{i}\) is the mean vector of class i, \(\mu =\frac{1}{N}\sum _{i=1}^{N}x^{i}\) is the overall data mean, \(x_i^j\) is the \(j^{th}\) sample in \(i^{th}\) class. Therefore, the projection matrix of FLDA, i.e., the matrix A, is obtained from maximizing the following optimization problem J(A):

$$\begin{aligned} J(A)=\frac{A^TS_bA}{A^TS_wA}. \end{aligned}$$
(3)

The intuition behind maximizing J(A) is to learn a projection matrix \(A\in R^{m\times k}\) in order to transform data from the original feature space that composed of m features into a low-dimensional subspace with k features (i.e., \(k<m\)). The optimization problem J(A) can be solved using the eigenvalue decomposition of \({S_w}^{-1}S_b\) where k eigenvectors of \({S_w}^{-1}S_b\) corresponding to k largest eigenvalues is chosen as matrix A.

3.3.2 CIDA

In recent years, the classical machine learning approaches could not be responsible to most of real-world applications, where the attention to DA has been increased due to the considerable performance of it to deal with the available problems. Thus, we are to tackle the shift problem by integrating the machine learning approaches and DA solutions.

In this paper, the domain shift problem is leveraged based on the multi-source scenario. Thus, the training and test sets are defined as \(X^S=\{X^1,X^2,\dots ,X^{N_s}\}\) and \(X^T\), respectively, where \(X^u\) denotes the \(u^{th}\) source domain and \(N_s\) is the total number of source domains. In general, the domain adaptation problems are divided into following categories, heterogeneous and homogeneous. In heterogeneous domain adaptation, the source and target domains are from different feature spaces, while in homogeneous domain adaptation, the source and target domains are from the same one. Our problem belongs to the homogeneous domain adaptation problem.

Since the various classes might come from different distributions, they are treated differently, and thus, there is dissimilarity among them. Therefore, we enlarge the margins across various classes as much as possible. To this end, the new between-class scatter matrix, \(S_B^\prime \), is defined as follows:

$$\begin{aligned} S_B^{\prime }=\frac{1}{N_s^{2}}\sum _{i,j}\sum _{u,v} p_i^up_j^v(\mu _i^u-\mu _j^v)(\mu _i^u-\mu _j^v)^T \end{aligned}$$
(4)

where \(p_i^u\) and \(\mu _i^u\) are the prior and the mean of class i on the subset \(X^u\), respectively. Also \(p_j^v\) and \(\mu _j^v\) are the prior and the mean of class j on the subset \(X^v\), in turn. Moreover, \(S_B^\prime \) computes the weighted average of between class-scatter matrices across different subsets from various classes in source domains. In the other words, \(S_B^\prime \) minimizes the marginal distribution difference of various classes of source domains such that the learned classifier can accurately predict the labels of target data due to the large margins across various classes of different source domains.

Moreover, we are to minimize the distribution difference across the same classes in different domains to adapt the source and target domains. In this way, the difference among the same classes from source and target domains is minimized. Hence, we shrink the margins among the samples of the same classes of source and target domains in order to well-align the samples. Consequently, \(S_W^\prime \) is defined as the new within class scatter matrix as follows:

$$\begin{aligned} S_W^{\prime }=\sum _{u=1}^{N_s}\sum _{i=1}^{K}(\mu _i^u-\mu _i^t)(\mu _i^u-\mu _i^t)^T \end{aligned}$$
(5)

where \(\mu _i^t\) is the mean of class i that belongs to \(X^T\). In fact, \(S_W^{\prime }\) minimizes the marginal distribution difference between the same classes that belong to the source and target domains.

The intuition behind CIDA is to learn a projection matrix \(A\in R^{m\times k}\) that persuades the following three principal objectives: (1) the marginal distribution difference of various classes of source and target domains is maximized (i.e., \(S_B^{\prime }\)), (2) the marginal distribution difference between the same class of source and target domains is minimized (i.e., \(S_W^{\prime }\)) and (3) the amount of variance between the various classes is minimized (i.e., \(S_W\)). Therefore, the optimization problem of CIDA, i.e. \(J^{\prime }(A)\), is composed of \(S_B^{\prime }\), \(S_W^\prime \) and \(S_W\) as follows:

$$\begin{aligned} J^{\prime }(A)=\frac{AS_B^{\prime }A^T}{A(cS_W+(1-c)S_W^{\prime })A^T} \end{aligned}$$
(6)

where \(c\in [0,1]\) is a parameter to regulate between \(S_W\) and \(S_W^\prime \). Similar to FLDA optimization problem, \(J^{\prime }(A)\) also can be solved by an eigenvalue decomposition of \((cS_W+(1-c)S_W^{\prime } )^{-1} S_B^\prime \) where the k eigenvectors that corresponds to k largest eigenvalues are chosen as matrix A. In contrast to FLDA in which the number of extracted features is dependent to the number of available classes, i.e., \(K-1\), CIDA extracts more features according to the rank of \(S_B^\prime \). In fact, the number of extracted features of CIDA is \(min\{m,N_s\times {K-1}\}\) where almost increases with regard to the number of source domains.

3.3.3 Adaptive classifier

In the second phase, CIDA exploits an adaptive classifier to meet the following two complementary objectives: (1) the empirical risk minimization of prediction function on labeled source data, which adapts across the source and target domains, (2) the rate of consistency maximization among the prediction function and the geometric data structure to preserve the input data structure. In the rest, the adaptive classifier and its objectives are expressed, in detail.

Learning based on the empirical risk minimization. The first objective of an adaptive classifier is to minimize the empirical risk of the prediction function on the labeled source data. The loss function is formulated as follows:

$$\begin{aligned} l(f(g(x_i)),y_i)=\sum _{i=1}^{n_s} max(0, 1-y_i*f(g(x_i))) \end{aligned}$$
(7)

where l computes the hinge loss, f denotes the prediction function of the classifier in order to predict the labels of labeled source data and g(x) is the mapping function of a feature vector x, which transfers data into a new representation. Equation 7 computes the sum squared error of true and predicted label of f on source data.

Learning based on the data structure preservation. The second objective of an adaptive classifier is to maximize the consistency across the prediction function and the geometric data structure. We realize this objective by the manifold assumption. According to the manifold assumption, if two points \(x_s\) and \(x_t\) are close together in the underlying geometry of marginal distribution, it is induced that the conditional distribution of two points is similar as well, i.e., \(Q_s (y_s \mid x_s )\) \(\approx \) \(Q_t (y_t \mid x_t )\) [53]. Therefore, the marginal distribution knowledge is utilized in order to learn a prediction function with good performance for target domain.

Generally, the structure of input data is modeled via a nearest neighbor graph that contains \(n_s+n_t\) vertices on which each data point represents a node. For each data point, P nearest neighbors are determined and connected via edges. In order to determine the weight of each edge that connects the nodes \(x_i\) and \(x_j\), the following weight function is employed:

$$\begin{aligned} W_{i,j}=e^{-\parallel \frac{(x_i-x_j)^2}{\delta }\parallel } \end{aligned}$$
(8)

where \(\delta \) is the normalization parameter to normalize matrix W and \(W_{i,j}\) is the weight of nodes \(x_i\) and \(x_j\). Then, the function \(M_f\) is defined in order to maximize the consistency between the prediction function and the manifold underlying the marginal distribution as follows:

$$\begin{aligned} M_f(P_s,P_t)=\sum _{i,j=1}^{n_s+n_t}(f(x_i)-f(x_j))^{2}W_{ij}=\sum _{i,j=1}^{n_s+n_t}f(x_i)\overline{L}_{i,j}f(x_i)f(x_j) \end{aligned}$$
(9)

where \(\overline{L}\) is the normalized Laplacian matrix and \(P_s\) and \(P_t\) are the marginal distribution of source and target domains, respectively. Moreover, D is a diagonal matrix, which its elements are defined as follows:

$$\begin{aligned} D_{ii}=\sum _{j=1}^{n_s+n_t}W_{ij} \end{aligned}$$
(10)

where \(D_{ii}\) illustrates the sum of \(i^{th}\) node weights with other nodes. Also, \(L=D-W\) is considered as the un-normalized Laplacian matrix that \(L_{ii}\) shows the sum of node i weights with other nodes except itself. The normalized form of Laplacian matrix L is defined as follows [54]:

$$\begin{aligned} \overline{L}=I-D^{-\frac{1}{2}}WD^{\frac{1}{2}}. \end{aligned}$$
(11)

where I is the identity matrix. Thus, the optimization problem of the adaptive classifier is defined as follows:

$$\begin{aligned} min_{f\in H} \sum _{i=1}^{n_s} l(f(g(x_i)),y_i)+\sigma \parallel f \parallel ^{2}+\gamma M_f(P_s,P_t) \end{aligned}$$
(12)

where H is a set of classifiers and \(\sigma \) and \(\gamma \) are the regularization parameters and \(\parallel f \parallel \) is the norm of f. Let the prediction function f is defined as \(f(g(x_i ))=w^{T} \varphi (g(x_i))\) where w denotes the classifier parameters and \(\varphi \) shows the mapping function that transfers data from the original space to Hilbert space. Also, the kernel function k is defined as \(k(g(x_i ),g(x_j ))=<\varphi (g(x_i )), \varphi (g(x_j))>\). According to the Representer theorem [55], the minimizer of the optimization problem in Eq. 12 can be formulated as:

$$\begin{aligned} f(g(x))=\sum _{i=1}^{n_s+n_t}\alpha _{i} k(g(x_i),g(x_j)). \end{aligned}$$
(13)

where \(\alpha _{i}\) is the classifier parameters. If the Eqs. 7 and 9 are rewritten using Eq. 13 and incorporates the results into Eq. 12, the final optimization problem will be:

$$\begin{aligned} \alpha =argmin_{\alpha \in R^{n_s}} (Y-\alpha ^T)+argmin_{\alpha \in R^{n_s+n_t}}tr(\gamma \alpha ^{T}{} \mathbf{K} \overline{L}{} \mathbf{K} \alpha + \sigma \alpha ^{T} \mathbf{K} \alpha ) \end{aligned}$$
(14)

where \(\mathbf{K} \) denotes the kernel matrix. Therefore, the value of \(\alpha \) is achieved from the following relation:

$$\begin{aligned} \alpha =(\sigma I+(R+\gamma \overline{L})\mathbf{K} )^{-1}RY^{T} \end{aligned}$$
(15)

where R is a diagonal matrix in which \(R_{ii}=1\) if \(x_i \in X_s\) and \(R_{ii}=0\) otherwise. Moreover, Y is the label set. Now, we have a robust classifier that adapts the source and target domains. Algorithm 1 shows the complete procedure of CIDA. In each iteration, CIDA finds a projection matrix A and learns an adaptive classifier f based on the projected data to refine the pseudolabels of target data. In general, CIDA updates the projection matrix and classifier parameters in an iterative manner to predict the pseudotarget labels with superior accuracy. In the next section, the data description and the implementation details are explained.

figure c

3.4 Computational complexity

In this section, the computational complexity of CIDA is analyzed. According to Algorithm 1, the number of iterations of main loop is adjusted constant (e.g., 10), with O(1). In more details, the computational cost is as follows: \(O(K^{2}N_s^{2})\) for computing \(S_B^{\prime }\), i.e., Line3; \(O(N_sK)\) for computing \(S_W^{\prime }\), i.e., Line 5; \(O((N_s+1)K(n_s+n_t))\) for computing \(S_W\), i.e., Line 6; \(O(m^3)\) for solving the eigenvalue decomposition problem, i.e., Line 7; \(O((n_s+n_t)^{2})\) for adaptive classifier construction, i.e., Line 18. Since \(N_{s}<<K<<m<<(n_s+n_t)\), the total computational complexity of CIDA is \(O((n_s+n_t)^{2})\).

4 Experimental setup

In this section, the evaluation data are introduced and the implementation details are discussed.

4.1 Data description

CIDA is evaluated on three benchmark visual domain adaptation datasets that are summarized in Table 1. Office and Caltech-256 datasets are a collection of four different domains, which were investigated in [8, 13, 23, 56] and contain the images of webcam domain (W) that were taken from a web camera with low resolutions, images in Amazon domain (A) that were downloaded from online merchants, images in DSLR domain (D) that were taken from a digital SLR camera with high resolutions, and images in Caltech-256 domain (C) that were downloaded and sieved from google images [57]. In our experiments, we use the public Office dataset published by Gong et al. [8] to compare the reported results with other state-of-the-arts.

Table 1 Three benchmark domain adaptation datasets

We choose following ten common classes across Office and Caltech-256 datasets: head-phones, touring-bike, computer-monitor, computer-mouse, computer-keyboard, laptop-101, calculator, video projector, backpack, and coffee-mug. Also, we utilize SURF features [58] for all images and represent each image with 800-bin histograms from trained codebooks on Amazon images and standardize the histograms by z-score normalization.

We conduct three different scenarios to compare our proposed approach against other state-of-the-art domain adaptation approaches. (1) Single source domain in which one domain is considered as the training set and another domain is supposed as test set, i.e., \(C \longrightarrow A, C \longrightarrow W, \dots , D \longrightarrow W\). (2) Double source domains where two domains are selected as the training set and another domain is selected as test set, i.e., \( A \& C \longrightarrow D,A \& C \longrightarrow W,\dots ,D \& W \longrightarrow C\). (3) Triple source domains in which three domains are considered as the training set and another domain is considered as test set, i.e. \( A \& W \& D \longrightarrow C,\dots ,C \& A \& W \longrightarrow D\). Therefore, CIDA is evaluated on twenty-eight different tasks on Office dataset.

Table 2 Classification accuracy (%) on Office+Caltech-256 datasets

PIE is another benchmark domain adaptation dataset, which is the abbreviation of “Pose, Illumination, Expression.” The dataset contains the face images of 68 individuals with 41,368 images of size \(32 \times 32\) that were taken from 13 synchronized cameras and 21 flashes under different poses, illuminations and expressions. We select following five sets of PIE dataset, each pertaining to a different pose: PIE1 (C05, left pose), PIE2 (C07, upward pose), PIE3 (C09, downward pose), PIE4 (C27, frontal pose) and PIE5 (C29, right pose). In our experiments, we use the public PIE dataset published by Gong et al. [8] to have a fair comparison.

In order to evaluate the classification performance of CIDA versus other methods, four scenarios are designed as follows. (1) Single source domain in which one domain is considered as the training set and another domain is considered as test set, i.e., \(P1 \longrightarrow P2,P1 \longrightarrow P3,\dots ,P5 \longrightarrow P4\). (2) Double source domains where two domains are selected as the training set and another domain is selected as test set, i.e., \( P1 \& P2 \longrightarrow P3,P1 \& P2 \longrightarrow P4,\dots ,P4 \& P5 \longrightarrow P3\). (3) Triple source domains in which three domains are chosen as the training set and another domain is chosen as test set, i.e., \( P1 \& P2 \& P3 \longrightarrow P4,\dots ,P3 \& P4 \& P5 \longrightarrow P2\). (4) Quadruple source domains where four domains are selected as the training set and another domain is selected as test set, i.e., \( P1 \& P2 \& P3 \& P4 \longrightarrow P5,\dots ,P2 \& P3 \& P4 \& P5 \longrightarrow P1\). Therefore, CIDA is tested on seventy-five different tasks.

4.2 Method evaluation

We systematically compare our CIDA results with two baseline machine learning methods, i.e. nearest neighbor (NN) and PCA, and other related state-of-the-art domain adaptation approaches including TCA [29], GFK [8], FIDOS [43], TSL [30], LTSL [59] and TSL-LRSR [60]. Since these methods are considered as dimensionality reduction approaches, we train a classifier on the labeled training data (i.e., NN), and then apply it on test data to predict the primary labels of the unlabeled test data. To validate the theoretical results of this research, the proposed method are compared with the best reported results of standard machine learning and other state-of-the-art domain adaptation methods.

4.3 Implementation details

In order to evaluate the performance of CIDA against other methods, the classification accuracy is utilized as the evaluation criterion. We set the number of iteration for convergence of CIDA to 10 and regulate \(c=0.71\) for Office+Caltech datasets and \(c=0.01\) for PIE datasets. Also, we consider \(\sigma =0.0001\) and \(\gamma =0.1\) for all datasets. In the next section, the parameter setting will be presented, in detail.

5 Experimental results and discussion

In this section, we compare the performance of our proposed method with eight related state-of-the-art and baseline methods on benchmark visual domain adaptation datasets.

Fig. 1
figure 1

Classification accuracy (%) of single source domain scenario on Office and Caltech-256 datasets. CIDA outperforms other dimensionality reduction and DA approaches in 7 out of 12 tasks using NN classifier

Fig. 2
figure 2

Classification accuracy (%) of double source domains scenario on Office and Caltech-256 datasets. CIDA outperforms other dimensionality reduction and DA approaches in 7 out of 12 tasks using NN classifier

Fig. 3
figure 3

Classification accuracy (%) of triple source domains scenario on Office and Caltech-256 datasets. CIDA outperforms other dimensionality reduction and DA approaches in 2 out of 4 tasks using NN classifier

5.1 Results evaluation

Object recognition: The classification accuracy of CIDA and other methods on Office+Caltech datasets is reported in Table 2 that is considered for single, double and triple source domains, respectively. In order to interpret better, the results are visualized in Figs. 1, 2 and 3. We comprehend the following observations from the reported experimental results. (1) CIDA gains best performance in terms of the average classification accuracy (47.12%) in single source domain settings where it performs better than the state-of-the-art domain adaptation methods in 7 out of 12 DA tasks. Moreover, due to the mismatched distribution among the training and test datasets, the performance improvement of CIDA over NN is (15.75%). This substantiates that CIDA performs robustly and effectively for domain image classification tasks. (2) CIDA achieves a significant improvement (2.07%) compared to the best baseline method TSL-LRSR in double source domain settings where the performance of CIDA is higher than the novel domain adaptation methods in 7 out of 12 DA tasks. Also, CIDA obtains (21.53%) performance improvement compared to NN. (3) The performance improvement of CIDA in comparison with the best baseline method TSL-LRSR in the triple source domain settings is (2.44%) where CIDA outperforms the modern domain adaptation methods in 2 out of 4 DA tasks. In addition, CIDA has (23.26%) improvement over NN classifier.

Table 3 Classification accuracy (%) on Multi-PIE datasets

Face recognition: We summarize the classification accuracy of CIDA and other methods on PIE datasets in Table 3 that is considered for single, double, triple and quadruple source domains, respectively. In order to interpret better, the results are visualized in Figs. 4, 5, 6 and 7. We get the following observations from the reported experimental results. (1) CIDA obtains remarkable improvement in terms of the average classification accuracy (7.31%) compared to the best method TSL-LRSR in single source domain settings, which outperforms all other domain adaptation methods in 15 out of 20 DA tasks. Also, CIDA obtains (17.05%) improvement compared to NN. (2) CIDA achieves the significant improvement in terms of the average classification accuracy (4.42%) in comparison with the best method TSL-LRSR in double source domain settings where CIDA performs classification task with more accuracy in 14 out of 30 DA tasks. In addition, CIDA achieves (33.51%) performance improvement over NN. (3) The improvement accuracy of CIDA is (3.33%) in terms of the classification accuracy in comparison with the best method TSL-LRSR in triple source domain settings where CIDA outperforms other methods in 12 out of 20 DA tasks. CIDA also gains (34.86%) performance improvement compared to NN. (4) CIDA achieves (2.26%) performance improvement in average classification accuracy compared to the best baseline method TSL-LRSR in quadruple source domain settings where it outperforms other methods in 3 out of 5 DA tasks. Moreover, CIDA gains (33.86%) in comparison with NN. In the rest, we compare CIDA with other methods, in detail. In the rest, the performance of compared methods is investigated with detail.

Fig. 4
figure 4

Classification accuracy (%) of single source domain scenario on PIE datasets. CIDA outperforms other dimensionality reduction and DA approaches in 15 out of 20 tasks using NN classifier. a the first ten tasks, b the second ten tasks

Fig. 5
figure 5

Classification accuracy (%) of double source domains scenario on PIE datasets. CIDA outperforms other dimensionality reduction and DA approaches in 14 out of 30 tasks using NN classifier. a the first ten tasks, b the second ten tasks, c the third ten tasks

Fig. 6
figure 6

Classification accuracy (%) of triple source domains scenario on PIE datasets. CIDA outperforms other dimensionality reduction and DA approaches in 12 out of 20 tasks using NN classifier. a the first ten tasks, b the second ten tasks

Fig. 7
figure 7

Classification accuracy (%) of quadruple source domains scenario on PIE datasets. CIDA outperforms other dimensionality reduction and DA approaches in 3 out of 5 tasks using NN classifier

Fig. 8
figure 8

Average classification accuracy (%) of different methods under various scenarios. GFK, TSL and LTSL perform poorly on multiple source scenario tasks. However, CIDA systematically benefits from the available knowledge in different domains to adapt the input data. a Office+Caltech datasets, b PIE datasets

Fig. 9
figure 9

Average classification accuracy (%) with respect to the number of iterations for Office+Caltech and PIE datasets under different scenarios. CIDA predicts the accurate labels to target samples in an iterative manner. Almost, the predicted labels of each stage are better than the previous one. a, b and c are single, double and triple source domain, respectively, on Office+Caltech datasets. dg are single, double, triple and quadruple source domain, respectively, on PIE datasets

Fig. 10
figure 10

Parameter evaluation with respect to the classification accuracy (%) and parameter c, for Office+Caltech datasets under single source domain scenario. CIDA is not sensitive to the value of c in most cases

Fig. 11
figure 11

Parameter evaluation of CIDA with respect to the classification accuracy (%). The parameter \(\sigma \) on Office+Caltech datasets under single source domain scenario. CIDA is not sensitive to the value of \(\sigma \) in most cases. Also, CIDA achieves acceptable results with small values of \(\sigma \). Indeed, we consider \(\sigma \in [0.00001 \,\, 0.01]\) for all datasets

Fig. 12
figure 12

Parameter evaluation of CIDA with respect to the classification accuracy (%). The parameter \(\gamma \), on Office+Caltech datasets under single source domain scenario. CIDA is not sensitive to the value of \(\gamma \) in period \([0.00001 \,\, 0.1]\)

Fig. 13
figure 13

Convergence evaluation of CIDA with respect to the classification accuracy (%) in 20 iterations on Office+Caltech datasets under double source domains scenario. CIDA is converged in 10 iteration in most cases

PCA is probably the most popular dimensionality reduction approach, which attempts to discover a shared representation across domains besides the maximum variance preservation on the new representation. Since PCA does not consider the distribution difference between domains, it does not perform well versus domain adaptation baseline methods. Nevertheless, PCA obtains better performance against NN.

TCA is a novel domain adaptation method that learns common transfer components among domains and maps the original data into the new subspace according to the transferred components. TCA is affected by te following two major restrictions: (1) TCA projects domains into an unsupervised manner and does not consider the label information of source data, and (2) TCA only reduces the marginal distribution difference across domains and does not consider the conditional distribution difference. However, CIDA benefits from the source domain labels in constructing the shared low-dimensional subspace and also discriminates across various classes.

GFK is another well-known DA approach that transfers domains into a shared low-dimensional subspace besides reducing the marginal distribution difference. The main limitation of GFK is the low-sized dimension of the embedded subspace that causes the original data represented inaccurately on the embedded subspace. However, CIDA learns an accurate shared subspace that exactly represents the original data according to the high rank of between class scatter matrix.

TSL is another noticeable method that adapts the marginal distribution of source and target domains based on the kernel density estimation. TSL suffers from following three important weaknesses. (1) TSL does not reduce the conditional distribution difference among the source and target domains due to its dependence to the distribution density. (2) Since TSL is sensitive to data size, it does not describe the distribution of data using the kernel density estimation when the target domain contains a few data. (3) Even with enough data, TSL has convergence problem when data have a large scale such as PIE dataset. But, CIDA performs well on both small and big datasets and has considerable improvement against TSL.

LTSL is a novel framework that transfers data into a shared subspace such that some combination of the source samples represent the target samples. Also, LTSL utilizes a low-rank constraint to preserve the structure of the source and target domains. However, there are two reasons that LTSL is insufficient in domain adaptation and subspace alignment. (1) In LTSL, since the subspace learning and reconstruction process are independent, domain adaptation performance is limited. (2) In TSL, the target data are only reconstructed with the source data. Thus, LTSL performs poorly on small dataset. However, CIDA jointly benefits from the representation and classification learning to adapt the source and target domains.

FIDOS is a modern framework that constructs the shared low-dimensional subspace besides the reduction of distribution difference and preserving the discrimination across classes. FIDOS similar to CIDA is an FLDA-based approach, but it is only sufficient for the strong related datasets.

TSL-LRSR is another approach that transfers the source and target data into a shared subspace in which each target data are reconstructed using the composition of the source samples. TSL-LRSR employs the low-rank and sparse constraints on the reconstruction matrix to preserve the local and global structure of data. Moreover, TSL-LRSR learns a flexible linear classifier and a non-negative label relaxation matrix to maximize the margins across various classes. In spite of the complicated structure of TSL-LRSR, CIDA benefits from simple and robust optimization problem that adapts the distribution mismatch.

5.2 Multi-source domain adaptation problems

We remark that some of methods such as GFK, TSL and LTSL perform poorly on the experiments on multiple source scenarios (according to Fig. 8). In fact, the multi-source scenario causes the severe multi-modality problem across various classes and much distribution mismatches across domains. In this way, the learned classifier on the source domains performs poorly to predict the labels of target domain. However, CIDA systematically benefits from the available knowledge in different domains to adapt the input data. Following three major factors contribute to the supremacy of our approach against other DA and machine learning approaches: 1) CIDA maximizes the marginal distribution difference of the various classes of source and target domains, 2) CIDA minimizes the distribution difference between the same classes of the source and target domains, 3) CIDA minimizes the amount of variance between the samples of each class.

5.3 Effectiveness evaluation

We conduct experiments in 10 iterations to evaluate the performance of CIDA and the best baseline method TSL-LRSR via comparing their average classification accuracy. We run TSL-LRSR and CIDA on all datasets under different scenarios. Since CIDA has almost similar behavior against different methods, we only report analysis of CIDA and TSL-LRSR. Our results are reported in Fig. 9. In the next section, the convergence of CIDA will be investigated. As it is understood from the figures, in all scenarios, CIDA outperforms the best baseline method TSL-LRSR. Our proposed approach significantly reduces the distribution difference among the source and target domains. Also, CIDA employs an adaptive classifier to adapt the source and target domains. CIDA predicts the accurate labels of target samples in an iterative manner. Almost, the predicted labels of each stage are better than the previous one.

5.4 Impact of parameter settings

The performance of CIDA is evaluated regarding to the different values of parameters in various situations. In general, we adjust three regularization parameters c, \(\sigma \) and \(\gamma \) for CIDA on various datasets. Since CIDA has similar behavior on all datasets, we just report the results of CIDA on Office+Caltech datasets due to space limitation.

In Fig. 10, the experimental results of Office+Caltech datasets are reported for evaluating the parameter c. We run CIDA with respect to the various values of c. We report the classification accuracy of CIDA with \(c \in [0.01 \,\, 0.91]\) on 12 Office+Caltech datasets. As is clear from the figures, CIDA is not sensitive to the value of c in most cases.

Figure 11 illustrates the experimental results for parameter \(\sigma \) on Office+Caltech datasets. We plot classification accuracy of CIDA with \(\sigma \in [0.00001 \,\, 10]\) on 12 Office+Caltech datasets. As is clear from the plots, CIDA is not sensitive to the value of \(\sigma \) in most cases. Also, CIDA achieves the acceptable results with small values of \(\sigma \). Indeed, we consider \(\sigma \in [0.00001 \,\, 0.01]\) for all datasets.

Figure 12 shows the experimental results of CIDA with respect to \(\gamma \in [0.00001 10]\) on Office+Caltech datasets. The results demonstrate that CIDA is not sensitive to the value of \(\gamma \) in period \([0.00001 \,\, 0.1]\).

5.5 Convergence evaluation

The convergence property of CIDA is validated by conducting the general experiments on Office+Caltech datasets under double source domains scenario. Figure 13 indicates the classification accuracy of CIDA in 20 iterations. As is clear from the figures, CIDA is converged in 10 iteration in most cases.

6 Conclusion and future work

In this paper, we proposed a novel cross- and multiple-domains visual transfer learning via iterative Fischer linear discriminant analysis (CIDA) approach for visual domain adaptation. Compared to the existing works, CIDA is the first attempt to handle the challenges of both defective transformation and unevaluated discriminant analysis. CIDA trains a domain-invariant classifier with minimization of structural risk and customized FLDA-based adaptation. We also provide a hybrid solution to exploit the adaptive classifier.

The effectiveness of CIDA is validated from a variety of perspectives such as results, effectiveness, parameters and convergence, where its performance are compared with eight state-of-the-art baseline methods on various benchmark visual domain adaptation datasets under different scenarios. The experimental results indicate that CIDA significantly outperforms other DA methods specifically when the number of source domain increases. In the future, we plan to generalize our approach to cope with non-linear feature extraction, utilizing online transfer learning and employing inductive transfer learning.