1 Introduction

In the past decade, machine learning has been widely used for various computer vision applications, such as multimedia retrieval [1,2,3], image classification [4,5,6,7,8,9], object detection [10,11,12,13], person re-identification [14,15,16,17,18], etc. Traditional machine learning methods often learn a model from the training data, and then apply it to the testing data. The fundamental assumption here is that the training data and testing data have the same distribution. However, in real-world applications, it cannot always guarantee that training data share the same distribution with testing data. Therefore, it may produce very poor results when the testing data and training data have the different distributions since the training model is no longer optimal on testing data. For example, applies image classification classifier trained on amazon dataset to phone photos in real life. Face recognition model trained on frontal and well-illumination images to recognize non-frontal poses and less-illumination images. This often viewed as visual domain adaptation problem which has been increasing interest in understanding and overcoming.

Domain adaptation aims at learning an adaptive classifier by utilizing the information between source domain with a plenty of labeled data and target domain which is collected from a different distribution. Generally, we can divide domain adaptation into two settings depending on the availability of labels in the target domain data: semi-supervised domain adaptation, and unsupervised domain adaptation. In scenario of semi-supervised domain adaptation, labeled data is available in both source domain (with a plenty of labeled data) and target domain (with a few labeled data), while in scenario of unsupervised domain adaptation labeled data are only available from source domain. In this paper, we mainly focus on unsupervised domain adaptation which is a more challenging task, and more in line with the real-world applications.

Many recent works [19,20,21] focus on subspace based method to tackle visual domain adaptation problems. In [21], Li et al. determined a feature subspace via canonical correlation analysis (CCA) [22] for recognizing faces with different poses. In [19], Gopalan et al. using geodesic flows to generate intermediate subspaces along the geodesic path between source domain subspace and target domain subspace on the Grassmann manifold. In [20], Gong et al. proposed Geodesic Flow Kernel (GFK), which computes a symmetric kernel between source and target points based on geodesic flow along a latent manifold.

Fig. 1.
figure 1

The overall schema of the proposed framework.

In last few years, the study of dictionary learning based sparse representation has received extensive attention. It has been successfully used for a variety of computer vision applications. For example, classification [23], recognition [24] and denoising [25]. Using an over-complete dictionary, signal or image can be approximated by the combination of only a few number of atoms, that are chosen from the learned dictionary. One of the early dictionary learning algorithms was proposed by Olshausen and Field [26], where a maximum likelihood (ML) learning method was used to sparsely encode images upon a redundant dictionary. Based on the same ML objective function as in [26], Engan et al. [27] developed a more efficient algorithm, called the method of optimal directions (MOD), in which a closed-form solution for the dictionary update has been proposed. More recently, in [28], Aharon, Elad and Bruckstein proposed the K-SVD algorithm by generalizing k-means clustering and efficiently learns an over-complete dictionary from a set of training signals. This method has been implemented in a variety of image processing problems.

The most existing dictionary based methods assuming that training data and testing data come from the same distribution. However, the learned dictionary may not be optimal if the testing data has different distribution from the data used for training. Learning dictionaries under different domain is a challenging task, and gradually become a hot research over the last few years. In [29], Jia et al. considered a special case where corresponding samples from each domain were available, and learn a dictionary for each domain. Qiu et al. [30] presented a general joint optimization function that transforms a dictionary learned from one domain to the other, and applied such a framework to applications such as pose alignment, pose illumination estimation, and face recognition. Zheng et al. [31] proposed a method achieved promising results on the cross-view action recognition problem with pairwise dictionaries constructed using correspondences between the target view and the source view. In [32], Shekhar et al. learn a latent dictionary which can succinctly represent both the domains in a common projected low-dimensional space. Ni et al. [33] learn a set of subspaces through dictionary learning to mitigate the divergence of source and target domains. Huang and Wang [34] proposed a joint model which learns a pair of dictionaries with a feature space for describing and associating cross-domain data. In [35, 36], Zhu and Shao proposed a weakly-supervised framework learns a pairwise dictionaries and a classifier while considering the capacity of the dictionaries in terms of reconstructability, discriminability and domain adaptability.

In this paper, we present an unsupervised domain adaptation approach through dictionary learning. Different from above dictionary learning based domain adaptation methods, our method directly learning adaptive dictionaries in low-level feature space and with no need for labels either in source domain or target domain during dictionary learning process. Our method is inspired by [35, 36], which forcing the similar samples in the same class to have identical representations in the sparse space. However, our method is unsupervised, we assume that the nearest coupled low-level features in the original space should maintain their relationship in the sparse space (i.e. these coupled features have the same sparse representation). According to this main idea, we learn a transformation matrix, which selected the nearest data in source domain to each target data. Then the dictionaries for each domain are jointly learned by these selected source data and target data. The data from each domain can be encoded by their dictionaries and then represented by sparse features. Thus, SVM classifier can be trained using these sparse features, and predicting test data on the learned classifier. The learning framework is performed by a classic and efficient dictionary learning method, K-SVD [28]. We demonstrate the effectiveness of our approach on standard cross-domain datasets, and we get state-of-the-art results. An overall schema of the proposed framework is shown in Fig. 1.

1.1 Organization of the Paper

The structure of the rest of the paper is as follows: In Sect. 2, we present our unsupervised domain adaptation dictionary learning algorithm and introduce the classification scheme for the learned dictionary. Experimental results on object recognition are presented in Sect. 3. Finally, the conclusion of this work is given in Sect. 4.

2 Proposed Method

2.1 Problem Notation

Let \(I_{s} = \{I_{s,i}\}_{i=1}^{N_s}\), and \(I_t = \{I_{t,j}\}_{j=1}^{N_t}\) be the data instances from the source and target domain respectively, where \(N_s\) and \(N_t\) denote the number of samples. Each sample from \(I_s\) and \(I_t\) has a set of \(d\)-dimensional local features, thus each sample can represented by \(I_{s,i}=\{I_{s,i}^1,I_{s,i}^2,...,I_{s,i}^{M_i}\}\) and \(I_{t,j}=\{I_{t,j}^1,I_{t,j}^2,...,I_{t,j}^{M_j}\}\) in source and target domain respectively, where \(M_i\) and \(M_j\) denote the number of local features. Then, the set of local features of source and target domain can be denoted as \(Y_s\in \mathbb {R}^{d*L_s}\), and \(Y_t\in \mathbb {R}^{d*L_t}\) respectively, where \(L_s\) and \(L_t\) denote the number of local features in the source and target domain.

2.2 Dictionary Learning

Here, we give a brief review of classical dictionary learning approach. Given a set of \(d\)-dimensional input signals, \(Y\in \mathbb {R}^{d*L}\), where \(L\) is denoted as the number of input signals. Then, learning a \(K\)-atoms dictionary of the signals \(Y\), \(D\in \mathbb {R}^{d*K}\), can be obtained by solving the following optimization problem:

$$\begin{aligned} \begin{aligned} \{D,X\}=arg min_{D,X}\Vert Y-DX\Vert ^2_F \\ s.t. \text { }\text { }\forall _i, \Vert x_i\Vert _0 \le T_0 \end{aligned} \end{aligned}$$
(1)

where \(D=[d_1,d_2,...,d_K]\in \mathbb {R}^{d*K}\) denotes the dictionary, \(X=[x_1,x_2,...,x_L]\in \mathbb {R}^{K*L}\) denotes the sparse coefficients of \(Y\) decomposed with \(D\), and \(T_0\) is the sparsity level that constraint the number of nonzero entries in \(x_i\).

The performance of sparse representation strictly lie on dictionary learning method. The K-SVD algorithm [28] is a highly effective dictionary learning method that focuses on minimizing the reconstruction error. In this paper, we will solve our formulation of unsupervised domain adaptation dictionary learning based on the K-SVD algorithm.

2.3 Unsupervised Domain Adaptation Dictionary Learning

Now, consider a more general scenario, where we have data from two domains, source domain \(Y_s\in \mathbb {R}^{d*L_s}\), and target domain \(Y_t\in \mathbb {R}^{d*L_t}\). We wish to jointly learning corresponding dictionaries for each domain. Formally, we desire to minimize the following cost function:

$$\begin{aligned} \begin{aligned}&\{D_s,D_t,X_s,X_t\} \\&=arg min_{D_s,D_t,X_s,X_t}\Vert Y_s - D_s X_s\Vert ^2_F \\&+ \Vert Y_t - D_t X_t\Vert ^2_F \text { } \text { } \text { } s.t. \text { } \text { }\forall _i, [\Vert x_i^s\Vert _0, \Vert x_i^t\Vert _0] \le T_0 \end{aligned} \end{aligned}$$
(2)

In addition, in order to maintain the relationship in original feature space, we assume that the nearest coupled low-level features in the original space should also be the nearest couple in the sparse space. Now the new cost function is given by:

$$\begin{aligned} \begin{aligned}&\{D_s,D_t,X_s,X_t\}\\&=arg min_{D_s,D_t,X_s,X_t}\Vert Y_s - D_s X_s\Vert ^2_F \\&+ \Vert Y_t - D_t X_t\Vert ^2_F + C([X_s X_t]) \\&s.t. \text { }\text { } \forall _i, [\Vert x_i^s\Vert _0 , \Vert x_i^t\Vert _0] \le T_0 \end{aligned} \end{aligned}$$
(3)

where \(D_s=[d_1^s,d_2^s,...,d_K^s]\in \mathbb {R}^{d*K}\) is the learned source domain dictionary, \(X_s=[x_1^s,x_2^s,...,x_{L_s}^s]\in \mathbb {R}^{K*L_s}\) is the sparse coefficients of source domain, \(D_t=[d_1^t,d_2^t,...,d_K^t]\in \mathbb {R}^{d*K}\) is the learned target domain dictionary, and \(X_t=[x_1^t,x_2^t,...,x_{L_t}^t]\in \mathbb {R}^{K*L_t}\) is the sparse coefficients of target domain. The function \(C(\cdot )\) is defined as the distance in the new sparse space of original nearest couples, a small \(C(\cdot )\) indicates the data maintain more relationship in new sparse space. This idea is inspired by [35, 36], in their method, this function is designed to measure the distances of similar cross-domain instances of the same class. However, our method is exactly unsupervised and directly perform on low-level feature. Thus, the function \(C([X_s X_t])\) is defined as:

$$\begin{aligned} C([X_s X_t])=\Vert X_t-X_s P\Vert _F^2 \end{aligned}$$
(4)

where \(P\in \mathbb {R}^{L_s*L_t}\) is the transformation matrix which records the nearest couples between the original data in source and target domain, \(P\) can be represented by:

$$\begin{aligned} P=\left( \begin{array}{cccc} \varPhi (y_1^s,y_1^t) &{} ... &{} ... &{} \varPhi (y_1^s,y_{L_t}^t)\\ \vdots &{} \ddots &{} &{} \vdots \\ \vdots &{} &{} \ddots &{} \vdots \\ \varPhi (y_{L_s}^s,y_1^t) &{} ... &{} ... &{} \varPhi (y_{L_s}^s,y_{L_t}^t)\\ \end{array} \right) \end{aligned}$$
(5)

where \(\varPhi (y_i^s,y_j^t)\) is the Gaussian distance between data in original feature space:

$$\begin{aligned} \varPhi (y_i^s,y_j^t)= \frac{1}{\sqrt{2\pi }}e^{(-\frac{{y_i^s}^2-{y_j^t}^2}{2})} \end{aligned}$$
(6)

Then, P can be computed by selecting the maximum entry in each column and set to 1 while the other entries are set to 0:

$$\begin{aligned} P=(i,j)={\left\{ \begin{array}{ll} 1&{} if \quad P(i,j)=max (P(:,j))\\ 0&{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(7)

Thus, Eq. (3) can be written as:

$$\begin{aligned} \begin{aligned}&\{D_s,D_t,X_s,X_t\}\\&=arg min_{D_s,D_t,X_s,X_t}\Vert Y_s - D_s X_s\Vert ^2_F \\&+ \Vert Y_t - D_t X_t\Vert ^2_F + \Vert X_t-X_sP\Vert _F^2 \\&s.t. \text { }\text { }\forall _i, [\Vert x_i^s\Vert _0 \Vert x_i^t\Vert _0] \le T_0 \end{aligned} \end{aligned}$$
(8)

Assuming P leads to a perfect mapping across the sparse codes \(X_t\) and \(X_s\), and each nearest couple has an identical representation after encoding, then \(\Vert X_t-X_sP\Vert _F^2=0\). Thus \(X_t=X_sP\), we can rewritten Eq. (8) as:

$$\begin{aligned} \begin{aligned}&\{D_s,D_t,X_s,X_t\} \\&=arg min_{D_s,D_t,X_s,X_t}\Vert (Y_s - D_s X_s)P\Vert ^2_F + \Vert Y_t - D_t X_t\Vert ^2_F \\&=arg min_{D_s,D_t,X_s,X_t}\Vert Y_sP - D_s X_sP\Vert ^2_F + \Vert Y_t - D_t X_t\Vert ^2_F \\&=arg min_{D_s,D_t,X_s,X_t}\Vert Y_sP - D_s X_t\Vert ^2_F + \Vert Y_t - D_t X_t \Vert ^2_F \\&s.t. \text { }\text { } \forall _i, \Vert x_i^t\Vert _0 \le T_0 \\ \end{aligned} \end{aligned}$$
(9)

2.4 Optimization

We can written Eq. (9) as:

$$\begin{aligned} \begin{aligned} \{\widetilde{D},\widetilde{X}\}=arg min_{\widetilde{D},\widetilde{X}}\Vert \widetilde{Y}-\widetilde{D}\widetilde{X}\Vert ^2_F \\ s.t. \text { }\text { }\text { } \forall _i, \Vert \widetilde{x}_i\Vert _0 \le T_0 \end{aligned} \end{aligned}$$
(10)

where \(\widetilde{Y}=\left( \begin{matrix} Y_sP \\ Y_t \end{matrix} \right) \) , \(\widetilde{D}=\left( \begin{matrix} D_s \\ D_t \end{matrix} \right) \) ,and \(\widetilde{X}=X_t \). Thus, such optimization problem can be solved using the K-SVD algorithm [28].

2.5 Object Recognition

Given the learned \(D_s\) and \(D_t\), we obtain sparse representations of the training data in source domain and testing data in target domain respectively. For each image, we obtain a set of sparse representation \(X_i=[x_{i,1},x_{i,2},...,x_{i,M_i}]\in \mathbb {R}^{K*M_i}\), where \(X_{i,j}\) is the sparse representation of \(j^{th}\) feature in image i, K denotes the dictionary size, and \(M_i\) is the number of local feature in image i. Then each image represented by a K-vector global representation through max pooling the sparse codes of local features, and then we use linear SVM classifier for cross-domain recognition.

Fig. 2.
figure 2

Example images from the LAPTOP category on four datasets.

3 Experiments

In this section, we evaluate our domain adaptation approach on 2D object recognition across different datasets.

Experimental Setup: Following the experiment setting in [20], we evaluate our domain adaptation approach on four datasets: Amazon (images downloaded from online merchants), Webcam (low resolution images by a web camera), Dslr (high-resolution images by a SLR camera), and Caltech-256 [37]. We regard each dataset as a domain. Figure 2 shows sample images from these datasets, and clearly highlights the differences between them. We extract 10 classes common to all four datasets: BACKPACK, TOURING-BIKE, CALCULATOR, HEADPHONES, COMPUTER-KEYBOARD, LAPTOP-101, COMPUTER-MONITOR, COMPUTER-MOUSE, COFFEEMUG, AND VIDEO-PROJECTOR. There are 2533 images in total. Each class has 8 to 151 images in a dataset. We use a SURF detector [38] to extract local features over all images. For each pair of source and target domains, we use 20 training samples per class for Amazon/Caltech, and 8 samples per class for DSLR/Webcam when used as source. To draw complete comparison with existing domain adaptation methods, we also carried out experiments on the semi-supervised setting where we additionally sampled 3 labeled images per class from the target domain. We ran 20 different trials corresponding to different selections of labeled data from the source and target domains and testing all unlabeled data in target domain. Our baseline is BOW, where all the images were represented by 800-bin histograms over the codebooks trained from a subset of Amazon images. Our method is also compared with Metric [39], SGF [19] and GFK [20]. Note that, Metric [39] is limited to the semi-supervised setting.

Fig. 3.
figure 3

Cross dataset object recognition accuracies on target domains with unsupervised adaptation over the four datasets (A: Amazon, C: Caltech, D: Dslr, W: Webcam).

Fig. 4.
figure 4

Cross dataset object recognition accuracies on target domains with semi-supervised adaptation over the four datasets (A: Amazon, C: Caltech, D: Dslr, W: Webcam).

Parameter Settings: For our method, we set dictionary size \(K=512\), and sparse level \(T_0=5\) for each domain.

Results: The average recognition rate is reported in Figs. 3 and 4 for unsupervised and supervised settings respectively. It is seen that the baseline BOW has the lowest recognition rate, all domain adaptation methods improve accuracy over it. Furthermore, GFK [20] based method clearly outperforms Metric [39] and SGF [19]. Overall, our method consistently demonstrates better performance over all methods except for one pair of source and target combination a little less than GFK [20] in the unsupervised setting.

4 Conclusions

In this paper, we presented a fully unsupervised domain adaptation dictionary learning method to jointly learning domain dictionaries by capturing the relationship between the source and target domain in the original data space. We evaluated our method on publicly available datasets and obtain improved performance upon the state of the art.