1 Introduction

1.1 Background

Image tagging problem is becoming more and more important for both image retrieval and classification applications [5, 7, 23, 32, 33, 40, 41]. This problem is defined as assigning a set of textual tags to an targeted set of images. An example is the keyword-based image searching application. In this application, given a query keyword, we return a list of images whose tags are similar to the query words. In this application, the quality of the tags assigned to the images is critical for the performance of the retrieval. In the ideal case, the tags of an image should be accurate and complete. However, in the real-world applications, the tags of images are usually incomplete, and it is necessary to complete the tags of images. For example, when a social network user updates an image to his/her page, some tags may be provided. But the provided tags usually cannot cover all the sematic concepts of the images, and the tags are incomplete. In this scenario, it is necessary to obtain the complete list of tags of the image before we use it for image retrieval problem. The problem of completing the tags of images is called image tag completion [10, 14, 25, 44]. To solve this problem, many approaches have been proposed [6, 15, 17, 18, 39, 42]. The matrix completion is a popular method for the tag completion problem [2,3,4]. However, the performance of these works are not satisfied yet. Meanwhile, convolutional neural network (CNN) has been proved to be an effective tool to represent images [8, 20, 21, 43]. However, surprisingly, CNN has not been applied to the problem of image tag completion. In this paper, we propose to use CNN as the representation method for the problem of image tag completion and develop an joint learning method of convolutional representations and complete tags of images.

1.2 Relevant works

Our work is an image tag completion method; thus, we introduce a few stat-of-the-art image tag completion methods. Lin et al. [18] proposed a novel method for image tag completion by using the image-specific and tag-specific linear sparse reconstructions. The incomplete initial tagging matrix is reconstructed by the linear sparse reconstructions which optimally reconstructs images and the tags. The reconstruction is constrained by sparsity, image–image similarity, image–tag association, and tag–tag concurrence. Wu et al. [39] presented the image–tag relation by a tag matrix and completed the tags by learning an optimal tag matrix. The completion is constrained by both the observed tags and the visual similarity. A new algorithm was proposed to solve the argued optimization problem. Feng et al. [6] showed that the problem of tag completion can be solved as a noisy matrix recovery problem. The missing tags are recovered, and the noisy tags are de-emphasized jointly in this method. Moreover, a graph Laplacian component derived from visual features are also used to regularize the noisy matrix recovery. Lin et al. [17] developed a novel approach for image tag completion using dual-view linear sparse reconstructions. This method conducts tag completion from two views, which are image view and tag view, and uses available context information. The image–image correlations are used to linearly reconstruct the low-level image features and the tag vectors jointly. Moreover, the tag–tag correlations are also used to obtain the reconstruction of the tag vector from the initial incomplete tags. Both the image view- and tag view-reconstructed tag vectors are combined to predict the complete tags. Xia et al. [42] designed a regularized optimization framework for the tag completion problem, by jointly conducting nonnegative matrix factorization (NMF) and the holistic visual diversity minimization to complete the tag-image matrix. The NMF modeling map the tag-image matrix into a latent low-rank space and use the sematic relevance of the tags to complete the tags partially. Moreover, the holistic visual diversity is used to regularize the NMF to utilize the content-similar images. Li et al. [15] proposed a novel tag completion method by using the low-rank and error sparsity, the local reconstruction structure consistency, and the promote basis diversity. The tag matrix is decomposed as the combination of a low-rank matrix and a sparse error matrix, and the local linear reconstruction is further applied to regularize the learn of the tag vectors.

1.3 Our contributions

In our paper, we propose a novel image tag completion method. This method is based on the convolutional representation of the images and the linear prediction of the tag assignment vectors. We first use a CNN model to represent the images to the convolutional vectors and then apply a linear predictive model over the convolutional representations to obtain the complete tag assignment vectors of the images. To learn the parameters of the CNN and the linear predictive model, we impose the learned tag assignment vectors to be consistent to the available elements of the incomplete tag vectors, minimizing the prediction errors of the linear predicative model over the image set. We also argue that the tag assignment vectors of the images with large convolutional similarities should be similar and minimize the distance between their tag assignment vectors weighted by their convolutional similarity. Finally, we apply the sparsity penalty to the tag assignment vectors and minimize the \(L_1\) norm of the tag assignment vectors. The learning problem is modeled as a minimization problem with regard to the CNN filters, the complete tag assignment matrix, and the predictive model parameters jointly. The objective function of the problem is the weighted combination of a the squared Frobenius norm distance between the tag assignment elements and available tag elements, a predictive error term, a convolutional similarity regularization term, and a sparsity term of tag assignment vectors. To solve this problem, we use the augmented Lagrangian method (ALM). The experimental results over some benchmark data sets show that the proposed convolutional representation-based tag completion method outperforms the state-of-the-art tag completion methods.

1.4 Paper organization

The rest parts of this paper are organized as follows. In Sect. 2, we introduce the proposed method. In Sect. 3, we conduct experiments to evaluate the proposed method. In Sect. 4, we give the conclusion of this paper with some future works.

2 Proposed method

2.1 Problem modeling

We suppose that we have a set of images, denoted as \({\mathcal {I}} = \{I_1,\ldots ,I_n\}\), where \(I_i\) is the ith image, and n is the total number of the images. A set of candidate tags of these images are also given \({\mathcal{T}}=\{T_1,\ldots , T_m\}\), where \(T_j\) is the jth tag and m is the number of tags. The problem of image tagging is to assign the tags in \({\mathcal{T}}\) to the images in \({\mathcal {I}}\). To denote the assigning relation between the images and the tags, we define a matrix of assignment, \(T=[t_{ji}]\in \{1,0\}^{m\times n}\), and its (ji)th entity \(t_{ji}\) is set to 1 if \(T_j\) is assigned to \(I_i\), and 0 otherwise. In many applications, we already have a existing assignment matrix \(\widehat{T} = [\widehat{t}_{ji}]\in \{1,0\}^{m\times n}\), and its entities are of the same meaning as T, but it is incomplete, i.e., many of its entities are missing. To denote which entities are missing, we further define a binary matrix \(\varPhi = [\phi _{ji}] \in \{1,0\}^{m\times n}\), where its (ji)th entity is defined to indicate if \(\widehat{t}_{ji}\) is missing,

$$\begin{aligned} \phi _{ji}= \left\{ \begin{array}{ll} 1, &{}\quad \hbox {if}\,\widehat{t}_{ji}\,\hbox {is}\,\hbox {available}, \\ 0, &{}\quad \hbox {otherwise}. \end{array}\right. \end{aligned}$$
(1)

Thus with these definitions, the problem of image tagging is transformed to the learning of a complete assignment matrix T from the images set \({\mathcal {I}}\), the incomplete assignment matrix \(\widehat{T}\), and its corresponding missing entity indicator matrix \(\varPhi\). To this end, we propose to present an image to a convolutional representation by a convolutional network and then use a linear model to predict its assignment vector from its convolutional representation. The learning of parameters of the convolutional network and the complete assignment matrix are conducted jointly. In the following sections, we will introduce the model of predicting tag assignment vectors and the learning of the parameters.

2.1.1 Tagging model

To complete the tags of an image, I, we propose to learn its convolutional representation and the complete tag assignment vector jointly. Given the image I we use a sliding window to split the image to \(n_I\) overlapping small image patches, \(I\rightarrow \{{\mathbf{x}}_1,\ldots ,{\mathbf{x}}_{n_I}\}\), where \({\mathbf{x}}_i\in {\mathbb {R}}^d\) is the visual feature vector of the ith image patch. We further organize the feature vectors of the image patches as a matrix \(X = \left[ {\mathbf{x}}_1,\ldots ,{\mathbf{x}}_{n_I}\right] \in {\mathbb {R}}^{d\times n_I}\), where \({\mathbf{x}}_i\) is the ith column of the matrix. To obtain the convolutional representation, we first perform the filtering and nonlinear transformation to the matrix X,

$$\begin{aligned} G = g\left( W^\top X \right) \in R^{r\times n_I},\; \hbox {where}\,g(x) = \frac{1}{1+\exp (-x)}, \end{aligned}$$
(2)

\(W = [{\mathbf{w}}_1,\ldots ,{\mathbf{w}}_r]\in {\mathbb {R}}^{d\times r}\) is the matrix of r filters, and \({\mathbf{w}}_k\in {\mathbb {R}}^d\) is the kth filter vector. \(g(\cdot )\) is a element-wise nonlinear transformation function. In our experiments, we use the sigmoid function as the nonlinear transformation function. Then, we perform the row-wise max-pooling to the output matrix G,

$$\begin{aligned} \mathbf{y }= \max (G)= [y_1,\ldots ,y_r]^\top \in {\mathbb {R}}^{r} \end{aligned}$$
(3)

where \(\max (\cdot )\) gives the row-wise maximum elements, and \(y_k\) is the maximum entity of the kth row of G,

$$\begin{aligned} y_k = \max (G_{k,:}), \end{aligned}$$
(4)

where \(G_{k,:}\) is the kth row of matrix G. The output vector of max-pooling, \(\mathbf{y }\), is the convolutional representation vector of the image I. The tag completion problem of this image to learn a tag assignment vector \(\mathbf{t }= [t_1,\ldots ,t_m]\in \{1,0\}^m\), where \(t_j=1\) indicates that the \(T_j\) is assigned to the image I.

Remark

Each image can be assigned to multiple tags, not just one tag. Thus this is an multi-tag completion problem.

Direct learning of the binary vector is difficult; thus, we relax the learning of binary vector to the learning of a continues value vector, \(\mathbf{t }\in {\mathbb {R}}^m\). The elements of the assignment vector can be treated as the scores of assigning the tags to this image. To learn the tag assignment vector \(\mathbf{t }\) of an image from its convolutional representation vector \(\mathbf{y }\), we use a linear function to predict \(\mathbf{t }\) from \(\mathbf{y }\),

$$\begin{aligned} \mathbf{t }\leftarrow (U\mathbf{y }- \mathbf{b }) \end{aligned}$$
(5)

where \(U\in {\mathbb {R}}^{m\times r}\) and \(\mathbf{b }\in {\mathbb {R}}^m\) are the parameters of the predictive model for the assignment vector. To learn the complete tag assignment matrix T for all the images in \({\mathcal {I}}\), we apply the convolutional representation and linear tag assignment prediction to all the images. The problem of the learning of the complete tag assignment matrix is considered jointly with the learning problem of the convolutional representation parameters, i.e., the filter matrix, and the linear predictive model parameters.

2.2 Learning problem

To jointly learn the complete tag assignment matrix T, the convolutional representation filter matrix W, and the linear predictive model U and \(\mathbf{b }\) jointly, we consider the following few problems to construct the joint objective function of the learning problem.

  • For the ith training image \(I_i\), its existing incomplete assignment vector is given as \(\widehat{\mathbf{t }}_i = \left[ \widehat{t}_{1i},\ldots ,\widehat{t}_{mi}\right] ^\top \in \{1,0\}^m\), which is the ith column of the matrix \(\widehat{T}\). This vector is not complete, and the ith column of \(\varPhi\) indicates the missing elements of \(\widehat{\mathbf{t }}_i\), \({\varvec{\phi }}_i=[\phi _{1i},\ldots ,\phi _{mi}]^\top \in \{1,0\}^m\). If \(\phi _{ji}=1\), \(\widehat{t}_{ji}\) is available, not missing. In this case, the learned tag assignment score \(t_{ji}\) should be as close to the available \(\widehat{t}_{ji}\). To measure how close \(t_{ji}\) is to its corresponding available \(\widehat{t}_{ji}\), we calculate the squared Frobenius norm distance between them, \((t_{ji}-\widehat{t}_{ji})^2\). The squared Frobenius norm distance between \(t_{ji}\) and \(\widehat{t}_{ji}\) weighted by the \(\phi _{ji}\) is minimized with regard to \(t_{ji}\). The minimization is applied over all the images and tags,

    $$\begin{aligned}&\min _{T} \left\{ \sum _{i=1}^n \sum _{j=1}^m\phi _{ji} (t_{ji}-\widehat{t}_{ji})^2 \right. \\&\left.\quad = \sum _{i=1}^n Tr\left( (\mathbf{t }_i-\widehat{\mathbf{t }}_i)^\top \hbox {diag}({\varvec{\phi }}_i)(\mathbf{t }_i-\widehat{\mathbf{t }}_i)\right) \right\} , \end{aligned}$$
    (6)

    where \(Tr(\cdot )\) is the trace of a matrix, and \(\hbox {diag}({\varvec{\phi }}_i)\) is a diagonal matrix with \({\varvec{\phi }}_i\) as its diagonal vector. By minimizing this problem, we hope the learned T can respect the incomplete tag assignment vector, \(\widehat{T}\).

  • Since we use a linear model to predict the tag assignment vector from the convolution representation vectors, we show the prediction error can be as small as possible. Again, we use the squared Frobenius norm distance to measure the prediction error. For the ith training image, the squared Frobenius norm distance between its tag assignment vector, \(\mathbf{t }_i\), and the output vector of the linear predictive model of its corresponding convolutional representation vector, \(\mathbf{y }_i = \max \left( g ( W^\top X_i)\right)\), is minimized. The minimization problem is also applied over the training set,

    $$\begin{aligned} \min _{T,U,\mathbf{b },W} \sum _{i=1}^n \left\| \mathbf{t }_i - (U\mathbf{y }_i - \mathbf{b })\right\| _F^2. \end{aligned}$$
    (7)

    By this way, we bridge the problems of learning complete tags and convolutional representations of the images.

  • We also use the similarities between the visual features of different images to regularize the learning of the complete tag assignment vectors. For a images \(I_i\), we seek its k-nearest neighbors to present its visually similar images. The set of k-nearest neighbors of \(I_i\) is denoted as \({\mathcal {N}}_i\). To measure the similarity between \(I_i\) and a neighboring image \(I_{i^{\prime }}\in {\mathcal {N}}_i\) is measured by the normalized Gaussian kernel of the Frobenius norm distance between their convolutional representation vectors,

    $$\begin{aligned} S_{ii^{\prime }} = \left\{ \begin{array}{ll} \frac{\exp \left( -\gamma \Vert \mathbf{y }_i-\mathbf{y }_{i^{\prime }}\Vert _F^2\right) }{\sum _{i^{\prime \prime }\in {\mathcal {N}}_i} \exp \left( -\gamma \Vert \mathbf{y }_i-\mathbf{y }_{i''}\Vert _F^2\right) },&{}\quad \hbox {if}\,I_{i^{\prime }}\in \,{\mathcal {N}}_i \\ 0, &{}\hbox {otherwise}. \end{array}\right. \end{aligned}$$
    (8)

    If \(S_{ii^{\prime }}\) is large, the \(I_i\) and \(I_{i^{\prime }}\) are visually similar, we expect their complete tag assignment vectors, \(\mathbf{t }_i\) and \(\mathbf{t }_{i^{\prime }}\) should also be close to each other. Thus, we minimize the squared Frobenius norm distance between \(\mathbf{t }_i\) and \(\mathbf{t }_{i^{\prime }}\) weighted by \(S_{ii^{\prime }}\). The minimization is also applied over all pairs of training images,

    $$\begin{aligned} \min _{T} \sum _{i,i^{\prime }=1}^n S_{ii^{\prime }}\left\| \mathbf{t }_i - \mathbf{t }_{i^{\prime }}\right\| _F^2. \end{aligned}$$
    (9)
  • We also observed that for each image, it is usually assigned to only a few tags, and the complete tag assignment vector should be sparse. To encourage the sparsity of the tag assignment vector, we use a \(L_1\) norm term to regularize the tag assignment vector to force most of its elements to zero. The \(L_1\) norm of the tag assignment vector \(\mathbf{t }_i\), \(\left\| \mathbf{t }_i\right\| _1=\sum _{j=1}^m |t_{ji}|\) is minimized, and the minimization problem is applied over all the training images,

    $$\begin{aligned} \min _{T} \left\{ \sum _{i=1}^n \left\| \mathbf{t }_i\right\| _1 = \sum _{i=1}^n \sum _{j=1}^m |t_{ji}|\right\} . \end{aligned}$$
    (10)

The overall optimization problem of the learning is the weighted summation of the above objectives,

$$\begin{aligned}&\min _{T,U,\mathbf{b },W} \left\{ \sum _{i=1}^n Tr\left( (\mathbf{t }_i-\widehat{\mathbf{t }}_i)^\top \hbox {diag}({\varvec{\phi }}_i)(\mathbf{t }_i-\widehat{\mathbf{t }}_i)\right) +\lambda _1\sum _{i=1}^n \left\| \mathbf{t }_i - (U\mathbf{y }_i - \mathbf{b })\right\| _F^2 \phantom {\sum _{i,i^{\prime }=1}^n } \right. \\&\left. +\lambda _2 \sum _{i,i^{\prime }=1}^n S_{ii^{\prime }}\left\| \mathbf{t }_i - \mathbf{t }_{i^{\prime }}\right\| _F^2 +\lambda _3 \sum _{i=1}^n \Vert \mathbf{t }_i\Vert _1 \right\} \end{aligned}$$
(11)

where \(\lambda _1\), \(\lambda _2\), and \(\lambda _3\) are the weights of different regularization terms of the objective, and their values are decided by cross-validation in our experiments. They play the roles of trade-off of different terms. Please note that \(\mathbf{y }_i\) is not listed as a variable in this problem. It is a function of the convolutional filter matrix W. Directly solving W of (11) is difficult, we propose to explicitly introduce \(\mathbf{y }_i,i=1,\ldots ,n\) as variable vectors, and constrains to impose \(\mathbf{y }_i = \max \left( g ( W^\top X_i)\right)\). The optimization problem of (11) is transformed to a constrained optimization problem,

$$\begin{aligned} \min _{T,U,\mathbf{b },W,Y}&\left\{ \sum _{i=1}^n Tr\left( (\mathbf{t }_i-\widehat{\mathbf{t }}_i)^\top \hbox {diag}({\varvec{\phi }}_i)(\mathbf{t }_i-\widehat{\mathbf{t }}_i)\right) +\lambda _1\sum _{i=1}^n \left\| \mathbf{t }_i - (U\mathbf{y }_i - \mathbf{b })\right\| _F^2 \phantom {\sum _{i,i^{\prime }=1}^n } \right. \\&\left. +\lambda _2 \sum _{i,i^{\prime }=1}^n S_{ii^{\prime }}\left\| \mathbf{t }_i - \mathbf{t }_{i^{\prime }}\right\| _F^2 +\lambda _3 \sum _{i=1}^n \Vert \mathbf{t }_i\Vert _1 \right\} \\ \hbox {s.t.}\,&\mathbf{y }_i = \max \left( g(W^\top X_i)\right) ,\forall \,i=1,\ldots ,n, \end{aligned}$$
(12)

where \(Y=[\mathbf{y }_1,\ldots ,\mathbf{y }_n]\in {\mathbb {R}}^{m\times n}\) is the matrix of convolutional representations.

2.3 Problem optimization

To solve the problem in (12), we use the augmented Lagrangian method (ALM) [1, 9, 26, 28, 30, 31, 45,46,47]. The augmented Lagrangian function of the problem in (12) is given as follows,

$$\begin{aligned}&\mathcal {L}(T,U,\mathbf{b },W,Y,A)= \sum _{i=1}^n Tr\left( (\mathbf{t }_i-\widehat{\mathbf{t }}_i)^\top \hbox {diag}({\varvec{\phi }}_i)(\mathbf{t }_i-\widehat{\mathbf{t }}_i)\right) +\lambda _1\sum _{i=1}^n \left\| \mathbf{t }_i - (U\mathbf{y }_i - \mathbf{b })\right\| _F^2 \\& \quad +\lambda _2 \sum _{i,i^{\prime }=1}^n S_{ii^{\prime }}\left\| \mathbf{t }_i - \mathbf{t }_{i^{\prime }}\right\| _F^2 +\lambda _3 \sum _{i=1}^n \Vert \mathbf{t }_i\Vert _1 \\& \quad +\sum _{i=1}^n{\varvec{\alpha }}_i^\top \left( \mathbf{y }_i - \max \left( g(W^\top X_i)\right) \right) +\frac{\beta }{2}\sum _{i=1}^n\left\| \mathbf{y }_i - \max \left( g(W^\top X_i)\right) \right\| _F^2, \end{aligned}$$
(13)

where \({\varvec{\alpha }}_i \in {\mathbb {R}}_+^m\) is the Lagrange multiplier vector for the constraint \(\mathbf{y }_i =\max \left( g(W^\top X_i)\right)\), and \(\beta\) is the positive penalty parameter. The optimization problem is transformed to the following minimization–maximization coupled problem,

$$\begin{aligned} \max _{A}&\min _{T,U,\mathbf{b },W,Y}\mathcal {L}(T,U,\mathbf{b },W,Y,A)\\ \hbox {s.t.}\,&A\ge 0, \end{aligned}$$
(14)

where \(A=[{\varvec{\alpha }}_1,\ldots ,{\varvec{\alpha }}_n]\in {\mathbb {R}}_+^{m\times n}\) is the matrix of Lagrange multiplier vectors. To solve this coupled problem, we use the alternating direction method of multipliers (ADMM). This method updates the multiplier matrix A and the other variables, \(T,U,\mathbf{b },W\) and Y alternately. In the following sections, we discuss how to optimize these variables, respectively.

2.3.1 Optimizing T

To optimize the complete tag assignment matrix T, we fix all the other variables, and the following sub-optimization problem is obtained from (14),

$$\begin{aligned}&\min _{T} \left\{ \sum _{i=1}^n Tr\left( (\mathbf{t }_i-\widehat{\mathbf{t }}_i)^\top \hbox {diag}({\varvec{\phi }}_i)(\mathbf{t }_i-\widehat{\mathbf{t }}_i)\right) +\lambda _1\sum _{i=1}^n \left\| \mathbf{t }_i - (U\mathbf{y }_i - \mathbf{b })\right\| _F^2 \right. \\&+\lambda _2 \sum _{i,i^{\prime }=1}^n S_{ii^{\prime }}\left\| \mathbf{t }_i - \mathbf{t }_{i^{\prime }}\right\| _F^2 +\lambda _3 \sum _{i=1}^n \Vert \mathbf{t }_i\Vert _1\\&=\sum _{i=1}^n \left( Tr\left( (\mathbf{t }_i-\widehat{\mathbf{t }}_i)^\top \hbox {diag}({\varvec{\phi }}_i)(\mathbf{t }_i-\widehat{\mathbf{t }}_i)\right) +\lambda _1 \left\| \mathbf{t }_i - (U\mathbf{y }_i - \mathbf{b })\right\| _F^2 \phantom {\sum _1^1} \right. \\&\left. \left. +\lambda _2 \sum _{i^{\prime }=1}^n S_{ii^{\prime }}\left\| \mathbf{t }_i - \mathbf{t }_{i^{\prime }}\right\| _F^2 +\lambda _3 \Vert \mathbf{t }_i\Vert _1\right) =\sum _{i=1}^n g(\mathbf{t }_i)\right\} . \end{aligned}$$
(15)

From (15), we observe that the objective of (15) is a summation of a function \(g(\mathbf{t }_i)\) over independent tag assignment vectors \(\mathbf{t }_i\). Thus the optimization of each \(\mathbf{t }_i\) is independent from the optimization problems of other \(\mathbf{t }_{i^{\prime }}\), \(i^{\prime }\ne i\). We can optimize \(\mathbf{t }_i\) by only optimizing \(g(\mathbf{t }_i)\), and the following optimization problem is obtained,

$$\begin{aligned}&\min _{\mathbf{t }_i} \left\{ g(\mathbf{t }_i) = Tr\left( (\mathbf{t }_i-\widehat{\mathbf{t }}_i)^\top \hbox {diag}({\varvec{\phi }}_i)(\mathbf{t }_i-\widehat{\mathbf{t }}_i)\right) +\lambda _1 \left\| \mathbf{t }_i - (U\mathbf{y }_i - \mathbf{b })\right\| _F^2 \phantom {\sum _1^1} \right. \\&\left. +\lambda _2 \sum _{i^{\prime }=1}^n S_{ii^{\prime }}\left\| \mathbf{t }_i - \mathbf{t }_{i^{\prime }}\right\| _F^2 +\lambda _3 \Vert \mathbf{t }_i\Vert _1\right\} . \end{aligned}$$
(16)

In \(g(\mathbf{t }_i)\), the last term \(\Vert \mathbf{t }_i\Vert _1\) is non-smooth, and it is difficult to minimize it directly. To overcome this problem, we use the fix-point method. The last term \(\Vert \mathbf{t }_i\Vert _1\) can be rewritten as follows,

$$\begin{aligned} \Vert \mathbf{t }_i\Vert _1 = \sum _{j=1}^m \left| t_{ji}\right| = \sum _{j=1}^m \frac{t_{ji}^2}{\left| t_{ji}\right| }=\mathbf{t }_i^\top \varLambda (\mathbf{t }_i) \mathbf{t }_i, \end{aligned}$$
(17)

where \(\varLambda (\mathbf{t }_i) = \hbox {diag}\left( \frac{1}{\left| t_{1i}\right| },\ldots ,\frac{1}{\left| t_{mi}\right| }\right)\). It could be observed that if we fix the matrix \(\varLambda (\mathbf{t }_i)\), \(\mathbf{t }_i^\top \varLambda (\mathbf{t }_i) \mathbf{t }_i\) is a quadratic function of \(\mathbf{t }_i\). In each iteration of an iterative algorithm, we firstly use the previous solution \(\widetilde{\mathbf{t }}_i\) to update the \(\varLambda (\widetilde{\mathbf{t }_i})\) and then fix it to solve the minimization problem of (16). Meanwhile, we should also note that in the third term, \(\sum _{i^{\prime }=1}^n S_{ii^{\prime }}\left\| \mathbf{t }_i - \mathbf{t }_{i^{\prime }}\right\| _F^2\), we also need to know the tag assignment vectors of other images, \(\mathbf{t }_{i^{\prime }}\), \(i^{\prime }\ne i\), to update \(\mathbf{t }_i\). We also use the previous solutions of \(\mathbf{t }_{i^{\prime }}\), \(\widetilde{\mathbf{t }}_{i^{\prime }}\), to regularize the learning of \(\mathbf{t }_i\). The obtained objective is given as follows,

$$\begin{aligned}&\widetilde{g}(\mathbf{t }_i) = Tr\left( (\mathbf{t }_i-\widehat{\mathbf{t }}_i)^\top \hbox {diag}({\varvec{\phi }}_i)(\mathbf{t }_i-\widehat{\mathbf{t }}_i)\right) +\lambda _1 \left\| \mathbf{t }_i - (U\mathbf{y }_i - \mathbf{b })\right\| _F^2\\&+\lambda _2 \sum _{i^{\prime }=1}^n S_{ii^{\prime }}\left\| \mathbf{t }_i - \widetilde{\mathbf{t }}_{i^{\prime }}\right\| _F^2 +\lambda _3 \mathbf{t }_i^\top \varLambda (\widetilde{\mathbf{t }}_i) \mathbf{t }_i\\ \end{aligned}$$
(18)

To minimize this objective, we set its derivative \(\nabla \widetilde{g}(\mathbf{t }_i)\) to zero to obtain the optimal solution \(\mathbf{t }_i^*\),

$$\begin{aligned}&\nabla \widetilde{g}(\mathbf{t }_i) = 2 \hbox {diag}({\varvec{\phi }}_i)(\mathbf{t }_i-\widehat{\mathbf{t }}_i) +2\lambda _1 \left( \mathbf{t }_i - (U\mathbf{y }_i - \mathbf{b })\right) \\&+2\lambda _2 \sum _{i^{\prime }=1}^n S_{ii^{\prime }}\left( \mathbf{t }_i - \widetilde{\mathbf{t }}_{i^{\prime }}\right) +2\lambda _3 \varLambda (\widetilde{\mathbf{t }}_i) \mathbf{t }_i=0,\\&\Rightarrow \left( \hbox {diag}({\varvec{\phi }}_i) + \left( \lambda _1 + \lambda _2 \sum _{i^{\prime }=1}^n S_{ii^{\prime }} \right) I + \lambda _3 \varLambda (\widetilde{\mathbf{t }}_i)\right) \mathbf{t }_i \\&= \left( \hbox {diag}({\varvec{\phi }}_i)\widehat{\mathbf{t }}_i + \lambda _1 (U\mathbf{y }_i - \mathbf{b }) + \lambda _2 \sum _{i^{\prime }=1}^n S_{ii^{\prime }} \widetilde{\mathbf{t }}_{i^{\prime }} \right) ,\\&\Rightarrow \mathbf{t }_i^* = \left( \hbox {diag}({\varvec{\phi }}_i) + \left( \lambda _1 + \lambda _2 \sum _{i^{\prime }=1}^n S_{ii^{\prime }} \right) I + \lambda _3 \varLambda (\widetilde{\mathbf{t }}_i)\right) ^{-1}\left( \hbox {diag}({\varvec{\phi }}_i)\widehat{\mathbf{t }}_i \phantom {\sum _1^1} \right. \\&\left. + \lambda _1 (U\mathbf{y }_i - \mathbf{b }) + \lambda _2 \sum _{i^{\prime }=1}^n S_{ii^{\prime }} \widetilde{\mathbf{t }}_{i^{\prime }} \right) . \end{aligned}$$
(19)

2.3.2 Optimizing U and \(\mathbf{b }\)

To optimize the linear predictive model parameter U and \(\mathbf{b }\), we fix the other variables and obtain the following sub-optimization problem.

$$\begin{aligned}&\min _{U,\mathbf{b }}\left\{ h(U,\mathbf{b })= \lambda _1\sum _{i=1}^n \left\| \mathbf{t }_i - (U\mathbf{y }_i - \mathbf{b })\right\| _F^2 \right\} , \end{aligned}$$
(20)

where \(h(U,\mathbf{b })\) is the objective function of this sub-optimization problem. To minimize this objective function with regard to \(\mathbf{b }\), we firstly set its derivative with regard to \(\mathbf{b }\) to zero to obtain the optimal solution \(\mathbf{b }^*\),

$$\begin{aligned}&\nabla _\mathbf{b }h(U,\mathbf{b })= \lambda _1\sum _{i=1}^n \left( \mathbf{t }_i - (U\mathbf{y }_i - \mathbf{b })\right) = 0,\\&\Rightarrow \mathbf{b }^* = \frac{1}{n}\sum _{i=1}^n \left( U\mathbf{y }_i-\mathbf{t }_i\right) . \end{aligned}$$
(21)

We substitute the optimal solution \(\mathbf{b }^*\) to (20) to reduce the problem to only one variable U,

$$\begin{aligned}&\min _{U}\left\{ \widetilde{h}(U)= \lambda _1\sum _{i=1}^n \left\| \mathbf{t }_i - \left( U\mathbf{y }_i - \frac{1}{n}\sum _{i^{\prime }=1}^n \left( U\mathbf{y }_{i^{\prime }}-\mathbf{t }_{i^{\prime }}\right) \right) \right\| _F^2\right. \\&\left. = \lambda _1\sum _{i=1}^n \left\| \left( \mathbf{t }_i - \frac{1}{n}\sum _{i^{\prime }=1}^n \mathbf{t }_{i^{\prime }} \right) - U \left( \mathbf{y }_i - \frac{1}{n}\sum _{i^{\prime }=1}^n \mathbf{y }_{i^{\prime }} \right) \right\| _F^2 \right\} . \end{aligned}$$
(22)

To obtain the optimal solution of U, we set the derivative of \(\widetilde{h}(U)\) to zero,

$$\begin{aligned}&\nabla _U\widetilde{h}(U)= 2\lambda _1\sum _{i=1}^n \left( \left( \mathbf{t }_i - \frac{1}{n}\sum _{i^{\prime }=1}^n \mathbf{t }_{i^{\prime }} \right) - U \left( \mathbf{y }_i - \frac{1}{n}\sum _{i^{\prime }=1}^n \mathbf{y }_{i^{\prime }} \right) \right) =0,\\&\Rightarrow U \sum _{i=1}^n \left( \mathbf{y }_i - \frac{1}{n}\sum _{i^{\prime }=1}^n \mathbf{y }_{i^{\prime }} \right) = \sum _{i=1}^n \left( \mathbf{t }_i - \frac{1}{n}\sum _{i^{\prime }=1}^n \mathbf{t }_{i^{\prime }} \right) ,\\&\Rightarrow U^* = \sum _{i=1}^n \left( \mathbf{t }_i - \frac{1}{n}\sum _{i^{\prime }=1}^n \mathbf{t }_{i^{\prime }} \right) \left( \sum _{i=1}^n \left( \mathbf{y }_i - \frac{1}{n}\sum _{i^{\prime }=1}^n \mathbf{y }_{i^{\prime }} \right) \right) ^{-1}. \end{aligned}$$
(23)

2.3.3 Optimizing W

To solve the filter matrix W, we fix the other variables and obtain the following sub-optimization problem,

$$\begin{aligned}&\min _{W} \left\{ \sum _{i=1}^n{\varvec{\alpha }}_i^\top \left( \mathbf{y }_i - \max \left( g(W^\top X_i)\right) \right) +\frac{\beta }{2}\sum _{i=1}^n\left\| \mathbf{y }_i - \max \left( g(W^\top X_i)\right) \right\| _F^2\right. \\&= \sum _{k=1}^r \sum _{i=1}^n\left( \alpha _{ik}\left( y_{ik} - \max \left( g({\mathbf{w}}_k^\top X_i)\right) \right) +\frac{\beta }{2}\left( y_{ik} - \max \left( g({\mathbf{w}}_k^\top X_i)\right) \right) ^2 \right) \\&\left. =\sum _{k=1}^r f({\mathbf{w}}_k)\right\} . \end{aligned}$$
(24)

According to (24), we can see that the objective of (24) is actually a summation of the function \(f({\mathbf{w}}_k)\) over independent filters \({\mathbf{w}}_k,k=1,\ldots ,r\). So we can optimize the filters independently by solving r sub-optimization problems. The sub-optimization of the kth filter is given as follows,

$$\begin{aligned}&\min _{{\mathbf{w}}_k} \left\{ f({\mathbf{w}}_k) = \sum _{i=1}^n \left( \alpha _{ik}\left( y_{ik} - \max \left( g({\mathbf{w}}_k^\top X_i)\right) \right) \phantom {\frac{1}{1}} \right. \right. \\&\left. \left. +\frac{\beta }{2}\left( y_{ik} - \max \left( g({\mathbf{w}}_k^\top X_i)\right) \right) ^2\right) \right\} , \end{aligned}$$
(25)

where \(\max \left( g({\mathbf{w}}_k^\top X_i)\right)\) is the output of the convolutional representation model with regard to the kth filter, \({\mathbf{w}}_k\). It can be rewritten as follows by introducing a maximum responding instance vector, \({\varvec{\eta }}_{ik}\),

$$\begin{aligned}&\max \left( g({\mathbf{w}}_k^\top X_i)\right) = \max _{{\varvec{\eta }}_{ik}}\left( g({\mathbf{w}}_k^\top X_i {\varvec{\eta }}_{ik})\right) ,\\&\hbox {s.t.} \,{\varvec{\eta }}_{ik}\in \{1,0\}^{n_i},{\varvec{\eta }}_{ik}\mathbf 1 = 1. \end{aligned}$$
(26)

\({\varvec{\eta }}_{ik}\) is a binary vector, and it has only one element as one, while all the other elements are set to zero. The position of this one-value element indicates the position of the instance in \(X_i\) which gives the maximum response. Substituting (26) to (25), we have

$$\begin{aligned}&\min _{{\mathbf{w}}_k} \left\{ f({\mathbf{w}}_k) = \sum _{i=1}^n \left( \alpha _{ik}\left( y_{ik} - g({\mathbf{w}}_k^\top X_i {\varvec{\eta }}_{ik}^*)\right) \phantom {\frac{1}{1}} \right. \right. \\&\left. \left. +\frac{\beta }{2}\left( y_{ik} - g({\mathbf{w}}_k^\top X_i {\varvec{\eta }}_{ik}^*)\right) ^2 \right) \right\} ,\\&\hbox {s.t.} \;{\varvec{\eta }}_{ik}^* = {\arg \max }_{{\varvec{\eta }}_{ik}\in \{1,0\}^{n_i},\;{\varvec{\eta }}_{ik}\mathbf 1=1}g({\mathbf{w}}_k^\top X_i {\varvec{\eta }}_{ik}),\quad i=1,\ldots ,n. \end{aligned}$$
(27)

We also use the fix-point method to solve this problem. In each iteration of an iterative algorithm, we first fix \({\mathbf{w}}_k\) as its previous solution, \(\widetilde{{\mathbf{w}}}_k\), to update \({\varvec{\eta }}_{ik}^*\). Then we fix \({\varvec{\eta }}_{ik}^*\) to update \({\mathbf{w}}_k\) by using the gradient descent method,

$$\begin{aligned} {\mathbf{w}}_k\leftarrow {\mathbf{w}}_k - \eta \nabla _{{\mathbf{w}}_k}f({\mathbf{w}}_k), \end{aligned}$$
(28)

where \(\nabla _{{\mathbf{w}}_k}f({\mathbf{w}}_k)\) is the gradient function of \(f({\mathbf{w}}_k)\),

$$\begin{aligned}&\nabla _{{\mathbf{w}}_k}f({\mathbf{w}}_k) = \sum _{i=1}^n \left( -\alpha _{ik}\nabla _{{\mathbf{w}}_k}g({\mathbf{w}}_k^\top X_i {\varvec{\eta }}_{ik}^*)X_i {\varvec{\eta }}_{ik}^* \phantom {\frac{1}{1}} \right. \\&\left. \left. -\beta \left( y_{ik} - g({\mathbf{w}}_k^\top X_i {\varvec{\eta }}_{ik}^*)\right) \right) \nabla _{{\mathbf{w}}_k}g({\mathbf{w}}_k^\top X_i {\varvec{\eta }}_{ik}^*)X_i {\varvec{\eta }}_{ik}^* \phantom {\frac{1}{1}} \right) ,\\ \end{aligned}$$
(29)

\(\nabla g(\cdot )\) is the gradient function of the nonlinear transformation function, and \(\eta\) is the descent step size.

2.3.4 Optimizing Y

To optimize the convolutional representation vectors of the images, we fix the other variables and obtain the following sub-optimization problem,

$$\begin{aligned}&\min _Y \left\{ \sum _{i=1}^n \left( \lambda _1\left\| \mathbf{t }_i - (U\mathbf{y }_i - \mathbf{b })\right\| _F^2 +{\varvec{\alpha }}_i^\top \left( \mathbf{y }_i - \max \left( g(W^\top X_i)\right) \right) \phantom {\frac{1}{1}}\right. \right. \\&\left. \left. \quad+\frac{\beta }{2}\left\| \mathbf{y }_i - \max \left( g(W^\top X_i)\right) \right\| _F^2 \right) = \sum _{i=1}^n l(\mathbf{y }_i) \right\} . \end{aligned}$$
(30)

We also observe the objective function of this problem is a linear combination of an independent function \(l(\mathbf{y }_i)\) over the training images. Thus we can consider the minimization of \(l(\mathbf{y }_i)\) with regard to \(\mathbf{y }_i\) independently. The minimization problem of \(l(\mathbf{y }_i)\) is given as follows,

$$\begin{aligned}&\min _{\mathbf{y }_i} \left\{ l(\mathbf{y }_i) = \lambda _1\left\| \mathbf{t }_i - (U\mathbf{y }_i - \mathbf{b })\right\| _F^2 +{\varvec{\alpha }}_i^\top \left( \mathbf{y }_i - \max \left( g(W^\top X_i)\right) \right) \phantom {\frac{1}{1}}\right. \\&\left. +\frac{\beta }{2}\left\| \mathbf{y }_i - \max \left( g(W^\top X_i)\right) \right\| _F^2 \right\} . \end{aligned}$$
(31)

To solve this problem, we set the directive of \(l(\mathbf{y }_i)\) with regard to \(\mathbf{y }_i\) to zero and have the following optimal solution \(\mathbf{y }_i^*\)

$$\begin{aligned}&\nabla l(\mathbf{y }_i) = -2\lambda _1U^\top \left( \mathbf{t }_i - (U\mathbf{y }_i - \mathbf{b })\right) +{\varvec{\alpha }}_i +\beta \left( \mathbf{y }_i - \max \left( g(W^\top X_i)\right) \right) =0,\\&\Rightarrow -2\lambda _1U^\top \left( \mathbf{t }_i - (U\mathbf{y }_i - \mathbf{b })\right) +{\varvec{\alpha }}_i +\beta \left( \mathbf{y }_i - \max \left( g(W^\top X_i)\right) \right) =0,\\&\Rightarrow \left( \beta I-2\lambda _1U^\top U\right) \mathbf{y }_i =2\lambda _1U^\top \left( \mathbf{t }_i + \mathbf{b }\right) +\beta \max \left( g(W^\top X_i)\right) -{\varvec{\alpha }}_i,\\&\Rightarrow \mathbf{y }_i^* =\left( \beta I-2\lambda _1U^\top U\right) ^{-1}\left( 2\lambda _1U^\top \left( \mathbf{t }_i + \mathbf{b }\right) +\beta \max \left( g(W^\top X_i)\right) -{\varvec{\alpha }}_i\right) . \end{aligned}$$
(32)

2.3.5 Optimizing A

To optimize the Lagrange multiplier matrix, we fix the other variables and obtain the following sub-optimization problem,

$$\begin{aligned} \max _{A} \left\{ s(A) = \sum _{i=1}^n{\varvec{\alpha }}_i^\top \left( \mathbf{y }_i - \max \left( g(W^\top X_i)\right) \right) = Tr\left( A^\top \varTheta \right) \right\} \end{aligned}$$
(33)

where \(\varTheta = \left[ \left( \mathbf{y }_1 - \max \,\left( g(W^\top X_1)\right) \right) ,\ldots , \left( \mathbf{y }_n - \max \left( g(W^\top X_n)\right) \right) \right] \in R^{m\times n}.\) We use the gradient ascent method to update this matrix to maximize the objective function s(A),

$$\begin{aligned} A\leftarrow A+\varsigma \nabla s(A) \end{aligned}$$
(34)

where \(\nabla s(A)\) is the gradient function of s(A), which is defined as

$$\begin{aligned} \nabla s(A) = \varTheta , \end{aligned}$$
(35)

and \(\varsigma\) is the ascent step size.

2.4 Algorithm

Based on the optimization results of the above section, we design an iterative algorithm to learn these variables. The algorithm, named as Joint Image Convolutional Representation and Tag Completion (JICRTC) algorithm, is described in Algorithm 1.

Algorithm 1 :

Iterative optimization algorithm of JICRTC.

Inputs:

: \(X_i,\widehat{\mathbf{t }}_i,i=1,\ldots ,n\).

Initialization:

: \(\mathbf{t }_i,\mathbf{y }_i,i=1,\ldots ,n\), U, \(\mathbf{b }\), W, and A.

Repeat:

:

  1. 1.

    Calculate the image pair-wise convolutional similarities for \(i,i^{\prime }=1,\ldots ,n\):

    $$\begin{aligned} S_{ii^{\prime }} = \left\{ \begin{array}{ll} \frac{\exp \left( -\gamma \Vert \mathbf{y }_i-\mathbf{y }_{i^{\prime }}\Vert _F^2\right) }{\sum _{i^{\prime \prime }\in {\mathcal {N}}_i} \exp \left( -\gamma \Vert \mathbf{y }_i-\mathbf{y }_{i^{\prime \prime }}\Vert _F^2\right) },&{}\hbox {if}\,I_{i^{\prime }}\in \,{\mathcal {N}}_i \\ 0, &{}\hbox {otherwise}. \end{array}\right. \end{aligned}$$
    (36)
  2. 2.

    Update the complete tag assignment score vectors for \(i=1,\ldots ,n\):

    $$\begin{aligned}&\mathbf{t }_i = \left( \hbox {diag}({\varvec{\phi }}_i) + \left( \lambda _1 + \lambda _2 \sum _{i^{\prime }=1}^n S_{ii^{\prime }} \right) I + \lambda _3 \varLambda ({\mathbf{t }}_i)\right) ^{-1}\left( \hbox {diag} ({\varvec{\phi }}_i)\widehat{\mathbf{t }}_i \phantom {\sum _{i,i^{\prime }=1}^n }\right. \\&\left. + \lambda _1 (U\mathbf{y }_i - \mathbf{b }) + \lambda _2 \sum _{i^{\prime }=1}^n S_{ii^{\prime }} {\mathbf{t }}_{i^{\prime }} \right) . \end{aligned}$$
    (37)
  3. 3.

    Update the predictive model parameters, U and \(\mathbf{b }\),

    $$\begin{aligned}&U = \sum _{i=1}^n \left( \mathbf{t }_i - \frac{1}{n}\sum _{i^{\prime }=1}^n \mathbf{t }_{i^{\prime }} \right) \left( \sum _{i=1}^n \left( \mathbf{y }_i - \frac{1}{n}\sum _{i^{\prime }=1}^n \mathbf{y }_{i^{\prime }} \right) \right) ^{-1},\,\hbox {and}\\&\mathbf{b }= \frac{1}{n}\sum _{i=1}^n \left( U\mathbf{y }_i-\mathbf{t }_i\right) . \end{aligned}$$
    (38)
  4. 4.

    Update the maximum responding instance indicators for \(i=1,\ldots ,n\):

    $$\user2{\eta }_{{ik}}^{*} = \mathop {\arg \max }\limits_{{\user2{\eta }_{{ik}} \in \left\{ {1,0} \right\}^{{n_{i} }} ,\user2{\eta }_{{ik}} {\mathbf{1}} = {\mathbf{1}}}} g\left( {w_{k}^{{\top}} X_{i} \user2{\eta }_{{ik}} } \right),\quad k = 1, \ldots ,r.$$
    (39)
  5. 5.

    Update the filters for \(k=1,\ldots ,r\):

    $$\begin{aligned}&{\mathbf{w}}_k= {\mathbf{w}}_k - \eta \left( \sum _{i=1}^n \left( -\alpha _{ik}\nabla _{{\mathbf{w}}_k}g({\mathbf{w}}_k^\top X_i {\varvec{\eta }}_{ik}^*)X_i {\varvec{\eta }}_{ik}^* \right. \right. \\&\left. \left. -\beta \left( y_{ik} - g({\mathbf{w}}_k^\top X_i {\varvec{\eta }}_{ik}^*)\right) \right) \nabla _{{\mathbf{w}}_k}g({\mathbf{w}}_k^\top X_i {\varvec{\eta }}_{ik}^*)X_i {\varvec{\eta }}_{ik}^* \phantom {\frac{1}{1}} \right) . \end{aligned}$$
    (40)
  6. 6.

    Update the convolutional representations of the images:

    $$\begin{aligned}&\mathbf{y }_i =\left( \beta I-2\lambda _1U^\top U\right) ^{-1}\left( 2\lambda _1U^\top \left( \mathbf{t }_i + \mathbf{b }\right) +\beta \max \left( g(W^\top X_i)\right) -{\varvec{\alpha }}_i\right) . \end{aligned}$$
    (41)
  7. 7.

    Update the Lagrange multiplier matrix A:

    $$\begin{aligned} A= A+\varsigma \left( \left[ \left( \mathbf{y }_1 - \max \left( g(W^\top X_1)\right) \right) ,\ldots , \left( \mathbf{y }_n - \max \left( g(W^\top X_n)\right) \right) \right] \right) \end{aligned}$$
    (42)

    Until convergence.

Outputs:

: \(\mathbf{t }_i,i=1,\ldots ,n\).

3 Experiments

In this section, we evaluate the proposed algorithm, JICRTC, over a few benchmark image data sets. The evaluation is based on two different computer vision problems, which are image annotation and image retrieval.

3.1 Data sets and experimental setting

In the experiments, three data sets of images are used. The data sets are introduced as follows.

  • Corel dataset is composed of 4993 images. The tag set is composed of 260 different tags. Each image has at most five tags.

  • Labelme data set is composed 2900 images, and its tag set has 495 tags. Each image of this data sets has at most 48 tags.

  • Flickr data set has 1,000,000 images. Its tag set has more than 1,000 tags. Each images is tagged by at most 76 tags.

The statistical information of the data sets is summarized in Table 1.

Table 1 Data set statistical information

3.2 Image annotation experiments

3.2.1 Experimental setting

We first evaluate our image tag completion method over the image annotation problem [27]. Given an image, and a set of candidate tags, the problem of image annotation is to predict its true complete list of tags relevant to the image. This is an special case of image tag completion, where each image has no existing tags and all the tags are missing. We should predict all its missing tags.

We use the fourfold cross-validation protocol to split the training/test subsets. Each data set is split to four subsets, and the four subsets are of the same size. We use each subset as the test set in turns, and use the remaining three subsets as the training sets. We first apply the proposed JICRTC method to the training set to learn the parameters of CNN and linear predictive model and apply the models to the test images to predict the tags. The image tags assignment scores are the outputs of the model, and the top-ranked tags are returned as the tags of the candidate image. We set the number of returned top-ranked images to 5 and 10, respectively, and report the precisions and recalls as the performance measures.

3.2.2 Experimental results

We compare our method to the existing stat-of-the-art methods, including the methods proposed by Lin et al. [18], Wu et al. [39], Feng et al. [6], Lin et al. [17], and Li et al. [15]. The results are reported in Table 1 and Fig. 1. From the results reported in Fig. 1, the proposed method JICRTC outperforms the compared methods over all the three data sets on the four performance measures. This is a strong evidence of the advantage of the CNN model for the tag completion and annotation of images. Over the three data sets, it seems the compared methods obtain the best results over the Corel data set, and the worst results over the Flickr data set.

Fig. 1
figure 1

Figures of results of image annotation

3.3 Image retrieval experiments

3.3.1 Experimental setting

Then we evaluate the proposed method over the problem of tag-based image retrieval [16]. This problem uses tags as queries to retrieve the images from the database of images. In each data set of images, we remove some tags of the images to set up the image tag completion problem and then apply the image tag completion algorithm to complete the missing tags. Given a query tag, we return the images which has the same tag in the database as the retrieval results. We measure the retrieval performance by the positive at top (Pos@Top). The usage of this performance measure is motivated by the works of Liang et al. [12, 16]. The works of Liang et al. [12, 16] show that the Pos@Top is a robust and parameter-free performance measure, which is suitable for most database retrieval problems. Following the works of Liang et al. [12, 16, 29], we adapt this performance measure to evaluate the results of the image retrieval problem in our experiments.

3.3.2 Experimental results

The retrieval results of different methods are reported in Fig. 2. We can observe from this table that the proposed method JICRTC outperforms the other methods over all the three data sets. The Pos@Top values of JICRTC are higher than all the compared tag completion methods. This indicates that the convolutional-based tag completer works better than the other tag completer based on matrix completion, etc.

Fig. 2
figure 2

Figure of results of image retrieval experiments measured by Pos@Top

3.4 Convergence of the alternating gradient descent

We also plot the curve of the objective values with regard to increasing iterations for the alternating gradient descent algorithm. The curve of experiments over the Corel data set is shown in Fig. 3. According to Fig. 3, the algorithm converge after 40 iterations.

Fig. 3
figure 3

Convergence curve over Corel data set

4 Conclusion and future works

In this paper, we proposed a novel image tag completion method. This method is based on the CNN model. We use the CNN model to represent the image and then predict the complete image tags from the CNN representations. The complete tag assignment score vectors are also regularized by the visual similarities calculated from the CNN representations. We develop an iterative algorithm to learn the parameters of the CNN model, the linear predictive model, and the complete tags. The experiments of the problems of image annotation and image retrieval based on image tag completion over three benchmark data sets show the advantage of the proposed method.

In the future, we will extend our work of CNN model to other machine learning problems beside image tag completion, such as communication networks [48,49,50,51], human action recognition [19, 52,53,54], biometrics [35,36,37,38], medical imaging [11, 13, 22], computational mechanics [24, 34], and big data analysis [26, 28].