1 Introduction

In real-world applications such as text categorization and medical diagnosis, an instance often has multiple labels, which can lead to a multi-label classification problem arising. For example, a given image may be labeled as ‘sky’, ‘tree’ and ‘mountain’. In recent years, this multi-label classification problem has drawn increasing attention from researchers and has been widely studied in many fields, e.g. multimedia annotation [9, 27, 36], web mining [28, 28, 33], and tag recommendation [1, 19].

The multi-label classification problem can be treated as a generalization of traditional multi-class classification. The major difference between the two is that labels are mutually exclusive in multi-class classification; that is to say, an instance can only belong to one class while multi-label classification focuses on instances that could simultaneously have more than one label. Independently handling each label through a classical single-label classification would theoretically be straightforward; in practice, however, connections between labels often exist. For example, the labels ‘tree’ and ‘mountain’ usually appear simultaneously in an image. Investigating the correlation between labels has been demonstrated to be effective for improving the performance of multi-label classification, and many methods have been devised to capture label correlation. For example, pruning the label set to distill the most important label relationship [25, 29], including predicted labels as auxiliary features to build up classifier chains [26], and applying the maximum margin strategy to deal with multi-label data (Rank-SVM) [10] .

Due to the rapid development of the internet and social media, a large number of labels are often associated with a single instance. For example, in image tagging, the number of tags can easily go beyond tens or even hundreds of thousands. There are millions of categories on Amazon, but any new product has to be assigned to only a small number of relevant categories before it goes on the website. The complexity of traditional methods is usually increased as the number of labels in the dataset also increases, which makes the handling of a large number of labels infeasible under most circumstances. Moreover, in a large-scale multi-label dataset, the dimensions of example features also tend to be large. Many approaches have been developed to tackle large-scale multi-label classification problems; these approaches aim to capture label correlation and tackle multi-label classification from different points of view. The major idea behind embedding methods concerns learning a low-dimensional subspace of the labels or features. Compared with one-vs-all methods, these approaches can achieve significant speed-up. Embedding methods can be roughly grouped into two subcategories; namely, FSDR (feature space dimension reduction) and LSDR (label space dimension reduction).

FSDR involves learning the subspace of example features, e.g. using locality-preserving projections to reduce the number of feature dimensions [13, 37, 42], while LSDR acts on example labels, e.g. PLST [31], which discovers the label subspace using Principal Component Analysis (PCA). Instead of separately investigating the subspaces of features and labels in classical FSDR and LSDR, some works have integrated both feature and label information, e.g. label-aware FSDR and feature-aware LSDR. MDDM [42] aims to find a low-dimensional feature subspace with a maximal dependence on the labels. As a least square extension of Canonical Correlation Analysis (CCA), LS-CCA [30] learns a feature subspace under the supervision of labels. Moreover, Chen et al. et al. [8] proposed the conditional PLST, which is a feature-aware extension of PLST.

Label-aware FSDR and feature-aware LSDR can be used effectively to integrate feature and label information. However existing embedding methods generally rely on a single subspace, which is discovered from feature space or label space, or shared by the features and labels. In practice, while redundancy exists in both features and labels, investigating feature and label information in a shared subspace does not always yield accurate results; moreover, although an instance can have a number of labels, the correlation between these labels would naturally lead to limiting the possible types of label combinations. However, independent component of labels has not been fully investigated.

In this paper, we propose to combine the advantages of FSDR and LSDR through the learning of separated subspaces for features and labels. By maximizing the independence between components in the label subspace, we discover label correlation represented by independent label components. Furthermore, independent feature components are extracted from example features so that independent coefficients can provide a compact representation of examples. We further maximize the correlation between label subspace and feature subspace via a regression problem. As several existing methods can be regarded as special cases of the proposed algorithm, their connections are thoroughly analysed in order to reveal the advantages of the proposed framework. In addition, we conduct comprehensive experiments on real-world datasets. The experimental results demonstrate that the proposed algorithm can effectively discover independent components from multi-label data, and thus bring about classification performance improvement. Furthermore, our method can be extended to handle non-linear independent components of features and labels; we employ a powerful non-linear neural network to achieve this extension.

2 Related work

Embedding methods are popular approaches to tackling the multi-label classification problem due to their simplicity, ease of implementation, and ability to handle label correlation. In this section, we will first introduce some classic embedding-based approaches, after which we will review some works that have successfully improved these conventional embedding approaches.

The most basic multi-label classification method is binary relevance [40], which independently trains a classifier for each label. Its primary advantage in multi-label learning is efficiency. However, since multi-label learning techniques have diverse applications, pure BR cannot achieve good performance in many specific applications, especially when features and labels are high dimensional. Accordingly, some FSDR and LSDR methods were proposed to tackle this problem. Wang et al. proposed ML-LDA [34], which is a classical Linear Discriminant Analysis (LDA) method for multi-label classification. These authors redefined the scatter matrices to enable their approach to adapt to both single-label and multi-label situations. Zhang et al. proposed MDDM [42] which aimed to identify a low-dimensional feature subspace and maximize the dependence between feature subspace and label space using the Hilbert-Schmidt independent Criterion as their measurement of dependence. Yu et al. introduced a supervised Latent Semantic Indexing (LSI) approach to multi-label classification [39]. This method mapped the original feature into a low-dimensional subspace that retains the feature information while also capturing the label dependency. Jian et al. proposed a feature selection method know as MIFS [17]. which first mapped the label information into a low-dimensional subspace that was later used to guide the feature selection phase. Hsu [14] introduced the LSDR paradigm, which finds the latent subspace of labels using random projection. Moreover, Tai and Lin [31] replaced the random transformation with Principal Component Analysis (PCA), and thereby proposed the principal label space transformation (PLST) for multi-label classification.

Conventional LSDR and FSDR methods focus on the latent space of features or labels. It is widely accepted that using either label information or feature information alone precludes a full investigation of the multi-label data. Consequently, many feature-aware LSDR and label-aware FSDR methods, retain both feature and label information, have been devised. The boundary between LSDR and FSDR has thus become ambiguous; accordingly, we categorize both of these approaches as embedding-based methods in the remainder of our paper.

Zhang & Schneider [41] applied Canonical Correlation Analysis to embed features and labels into a lower-dimensional vector. A codeword was subsequently generated by concatenating the original label vector with the lower-dimensional vector; for a given data point, its new codeword was predicted using Random Forest, although one can choose other regressors in practice. Bayesian Inference was then used to map the codeword to the label distribution. Chen and Lin proposed an enhanced version of PLST [31], know as conditional principal component transformation (CPLST). This approach combines the concepts of PLST and Canonical Correlation Analysis (CCA), thereby improving PLST through taking feature information into consideration.

Many embedding methods factorize the label matrix to a low-dimensional matrix as a low-rank approximation. In real-world applications, however, the label matrix is usually not low-rank due to the existence of tail labels. Rather than striving for global projection, Bhatia developed a method named SLEEC [5], which aimed to preserve the pairwise distances between the closest label vectors with local embedding. SLEEC proposed a novel objective function for preserving the local information of labels while also ensuring recoverability. During prediction, SLEEC used a k-nearest neighbor (kNN) classifier. Xu et al. [38] proposed a robust extreme multi- label classification approach by treating tail labels as additive noise on true label distribution. Moreover, while all previous embedding methods found a continuous subspace, Zhou et al. [43] were the first to introduce an embedding method with binary subspace. These authors found a lower-dimensional embedding that minimizes the residual error, but forced their low-dimensional vectors to be binary. Due to the binary embedding constraints, classification is applicable in the learning part rather than regression, which significantly accelerates their method.

Neural networks have long been known to be powerful non-linear representation learning tools. For BP-MALL, the first method to utilize neural network architectures to tackle the multi-label classification problem, a novel loss function was devised to exploit the dependency across labels. Subsequently, some more recent works utilizing deep neural network techniques have been proposed. For example, CNN-RNN is a method that learns a label embedding space while capturing label co-occurrence information via a recurrent neural network. To further utilize the correlation between feature and label, Yeh et al. proposed C2AE, which utilizes deep canonical correlation analysis and an autoencoder to learn a latent feature-aware subspace for multi-label classification.

The purpose of our method is to learn an independent representation to better tackle the multi-label classification problem. The benefits of independent representation have been noted in many previous papers. Le et al. proposed RICA [21] to efficiently discover the independent overcomplete representation for computer vision problems. Dinh et al. proposed a method called non-linear independent components estimator (NICE), which adopts a neural network architecture that allows the determinant of the Jacobian matrix to be efficiently computed, and also reveals that independent representation can improve performance in many applications. There are also some methods that have tried to generate independent representations [4, 6]. RICA is a linear model; while NICE is based on a neural network, However, NICE requires special architecture which restrains its capacity. By contrast, our proposed DICE adds a regularization term that can be efficiently optimized. In fact, there is an approach named disentangled representation learning that assumes the data is generated from independent factors of variation [7, 32]. Recently, Kageback et al. proposed a method that contains only the l1 norm of the sample covariance matrix, which encourages the decorrelation [18]. Meanwhile, the combination of canonical correlation and multiple autoencoders is not rare in the field of multi-view learning, since learning a correlative sub space is essential for multi-view learning. Thus, Wang et al. proposed DCCAE [35], which consists of multiple autoencoders for each view and correlation loss for the representations of two views.

3 The proposed method

In this section, we describe our proposed approach by firstly introducing our motivation and an overview of our approach, after which technical details are provided.

3.1 Preliminaries

Let {(x1,y1),(x2,y2),...,(xN,yN)} be the training data, where xiRd is the feature vector for the i th multi-label example and \(y_{i} \in \{0,1 \}^{l}\) is its corresponding label. We denote XRN×d as the training sample matrix, and Y ∈{0,1}N×l is the label matrix, where xi is the i-th row of X and yj is the j-th row of Y.

Taking FSDR as an example, the general approach is to find a low-dimensional vector in latent subspace \({\Psi } :Y\rightarrow Z\), where ZRN×k(kl), and then learn a mapping \({\Phi } :X\rightarrow Z\). A new instance will be firstly transformed into a k-dimensional vector using Φ, which is then transformed into original label space with the inverse mapping of Ψ.

3.2 Motivation

FSDR and LSDR are efficient paradigms for multi-label classification; they reduce computational cost and improve performance by removing redundant information from either the feature or label perspective. To inherit the advantages of both approaches, we propose to discover two low-dimensional subspaces for labels and features, respectively. In the context of multi-label learning methods, it is widely accepted that capturing the label correlations of data is essential for performance improvement. Label correlations are common in practical multi-label datasets. By exploring and exploiting correlations among labels, we are able to better tackle the multi-label learning problem.

Although the number of labels is very large, given label correlations, the combination in which labels can be combined will be limited in practice. We therefore aim to discover a subset of label components so as to reconstruct the whole labels. Mathematically speaking, we assume the existence of some base labels {w1,w2,...,wk}, and the label yi of a data point can be decomposed as \(y_{i}={\sum }_{j=1}^{k}a_{ij}w_{j} \), and \(\{a_{ij}\}_{j=1}^{k}\) is a new low-dimensional representation of the original label. Furthermore, we want to reconstruct the whole labels using the minimal number of label components. To achieve this, we assume that the combination coefficients of components are mutually independent.

Since sparsity has a close connection with non-Gaussianity, sparse coding can be formulated as a special case of independent component analysis (ICA). ICA would further reduce statistical dependencies and produce a sparse and independent representation that will be useful for the subsequent learning procedure; accordingly, we also employ ICA to find a compact independent representation for the features. It would be unreasonable to discover feature subspace and label subspace separately; there are not only correlations among labels, but also correlations between features and labels. For example, the features extracted from an image and the labels of this image are representations of the image content from different points of view. Particularly when the number of labels is large enough, we can naturally treat labels as textual features of the examples. Thus, we not only want to find two independent subspaces for features and labels, but also hope that these two subspaces will have a strong correlation with each other. Many studies have shown that the correlations between feature and label have a substantial impact on the predictability of the latent spaces [8, 22, 42].

We perform decomposition on both features and labels, Y = AWy,X = BWx, where Wx and Wy are the bases of features and labels, and A and B are the bases of features and labels, while A and B are projections on the bases. We assume the bases are orthonormal to avoid redundant information, and further hold that \(Y{W_{y}^{T}}=A, X{W_{x}^{T}}=B\). We aim to maximize the non-Gaussianity of every dimension of A and B, which is a common approach to pursuing independence. Finally, we maximize the correlation of A and B so that there can be a simple mapping between A and B. Based on this idea, we can write our primary objective function as follows:

$$ \begin{array}{@{}rcl@{}} \max\limits_{W_{x},W_{y}}g(X,W)+g(Y,W_{y})+\gamma h(X{W_{x}^{T}},Y{W_{y}^{T}}) \\ s.t. W_{x}{W_{x}^{T}}=I,W_{y}{W_{y}^{T}}=I \end{array} $$
(1)

where γ is a trade-off parameter, g(.) is the measurement about non-Gaussianity, h(.) is the correlation of two latent spaces.

3.3 Non-Gaussianity prior

Discovering independent components from random variables is a complicated endeavour. This kind of problem can be well formulated as an Independent Components Analysis (ICA) problem [16], in which the observed matrix is decomposed with a assumption of independence. Maximizing independence is equivalent to maximizing the non-Gaussianity in ICA models. The standard ICA can be defined as the following optimization problem:

$$ \begin{array}{@{}rcl@{}} \min\limits_{W} \sum\limits_{i=1}^{N} \sum\limits_{j=1}^{k} g(w_{j}x_{i}) \\ s.t. w_{i}{w_{j}^{T}}=\delta_{ij} \end{array} $$
(2)

where g is a non-linear convex function to pursue non-Gaussian components [15]. δij is Kronecker delta function, which equals to 1 if and only if i equals to j, otherwise is zero. Then we can rewrite formula (1) as

$$ \begin{array}{@{}rcl@{}} \min\limits_{W_{x},W_{y}}&\alpha\sum\limits_{i=1}^{N} \sum\limits_{j=1}^{k_{x}} g({W_{x}^{j}}x_{i})+\beta\sum\limits_{i=1}^{N} \sum\limits_{j=1}^{k_{y}} g({W_{y}^{j}}y_{j}) \\ & -\gamma* h(X{W_{x}^{T}},Y{W_{y}^{T}}) \\ &s.t. W_{x}{W_{x}^{T}}=I,W_{y}{W_{y}^{T}}=I \end{array} $$
(3)

where \({W_{x}^{j}}\) is j-th row of Wx, and αj and βj are trade-off parameters. The presence of orthonormal constraints will lead to a difficult optimization. One way is orthogonalizing Wx and Wy during every update, this will result in higher computational costs. Accordingly, to get rid of these orthogonal constraints, we aim to create an unconstrained function by replacing the orthonormal constraints with soft reconstruction cost [21] to achieve efficient optimization,

$$ \begin{array}{@{}rcl@{}} \min\limits_{W_{x},W_{y}}\|X-X{W_{x}^{T}}W_{x}\|_{F}^{2}+ \alpha\sum\nolimits_{i,j} \log(\cosh(X{W_{x}^{T}}))_{ij}\\ +\|Y-Y{W_{y}^{T}}W_{y}\|_{F}^{2}+ \beta\sum\nolimits_{i,j}\log(\cosh(Y{W_{y}^{T}}))_{ij} \\ -\gamma h(X{W_{x}^{T}},Y{W_{y}^{T}}), \end{array} $$
(4)

where we have adopted \(\log (\cosh (\cdot ))\) as the function g(⋅).

3.4 Correlation

As noted in our motivation section, we aim to maximize the correlation between the latent subspaces of features and labels. However, these features and labels might have different numbers of independent components, which requires us to calculate the correlation between variables of different dimensions. Canonical correlation seems to be a suitable choice of algorithm to tackle our problem. This is a correlation defined on two sources of input data with different dimensions.

$$ \rho = \frac{u^{T}{\Sigma}_{xy}v}{\sqrt{u^{T}{\Sigma}_{xx}u}\sqrt{v^{T}{\Sigma}_{yy}v}} $$
(5)

where \({\Sigma }_{xx}=\frac {1}{N}\hat {X}^{T}\hat {X}, {\Sigma }_{yy}=\frac {1}{N}\hat {Y}^{T}\hat {Y}\), and \({\Sigma }_{xy}=\frac {1}{N}\hat {X}^{T}\hat {Y}\). \(\hat {X}\) and \(\hat {Y}\) satisfy \({\sum }_{i}\hat {x}_{i}=0 ~\text {and}~{\sum }_{j}\hat {y}_{j}=0\); this can be achieved by subtracting the empirical mean from the sample. There are several ways to calculate the correlation; however what we expect to calculate is the canonical correlation coefficient, rather than the values of u and v. We therefore follow the solution proposed in [23], which was also adopted by Deep CCA [2], because it is differentiable.

The total correlation of all components of \(\hat {X}\) and \(\hat {Y}\) can be computed as the sum of all singular values of \(T\triangleq {\Sigma }_{xx}^{-1/2}{\Sigma }_{xy}{\Sigma }_{yy}^{-1/2}\). The sum of all singular values indicates the trace norm of T. But due to the non-smooth nature of the trace norm, we choose the -2 norm of singular values which is the Frobenius norm of T,

$$ h(B,A) =\|T_{BA}\|_{F}=\text{tr}(T_{BA}^{T}T_{BA})^{1/2}. $$
(6)

In conclusion, we decompose both feature and label to discover independent components by maximizing the non-Gaussianity of the subspaces. Meanwhile, we also maximize the correlation of the projections on the bases. Overall, therefore, we can rewrite Equation (1) as follows:

$$ \begin{array}{@{}rcl@{}} &\min\limits_{W_{x},W_{y}} J(W_{x},W_{y}) = \|X-X{W_{x}^{T}}W_{x}\|_{F}^{2}+\|Y-Y{W_{y}^{T}}W_{y}\|_{F}^{2} \\ &+\alpha\sum\nolimits_{i,j} \log(\cosh(X{W_{x}^{T}}))_{ij} +\beta\sum\nolimits_{i,j}\log(\cosh(Y{W_{y}^{T}}))_{ij} \\ & -\gamma*\|T(X{W_{x}^{T}},Y{W_{y}^{T}})\|_{F} \end{array} $$
(7)

3.5 Connections with other methods

Canonical Correlation Analysis is a proven technique that has been widely used in multi-label learning [30, 41]. Many variants of CCA have been proposed. Based on two lemmas in [21], we show that our proposed approach can be expressed as a sparse CCA.

Lemma 1

When the data is whiten, the reconstruction error of whiten data\(\|X-XW^{T}W\|_{F}^{2}\)is equivalent to the orthonormality cost\(\|I-W^{T}W\|_{F}^{2}\).

Proof

For whiten data, we have XTX = I, then

$$ \begin{array}{@{}rcl@{}} & \|X-XW^{T}W\|_{F}^{2}=\text{tr}(X-XW^{T}W)^{T}(X-XW^{T}W) \\ & =\text{tr}(X^{T}X-2X^{T}XW^{T}W+W^{T}WX^{T}XW^{T}W)\\ & =\text{tr}(I-2W^{T}W+W^{T}WW^{T}W)= \|I-W^{T}W\|_{F}^{2} \end{array} $$

Lemma 2

The column orthonormality cost\(\|I_{c}-W^{T}W\|_{F}^{2}\)is equivalent to the row orthonormality cost\(\|I_{r}-WW^{T}\|_{F}^{2}\)up to an additive constant.

Proof

Note that c and r are numbers of columns and rows.

$$ \begin{array}{@{}rcl@{}} & \|I_{c}-W^{T}W\|_{F}^{2}=\text{tr}(I_{c}-2W^{T}W+W^{T}WW^{T}W)\\ & =\text{tr}(I_{r}-2WW^{T}+WW^{T}WW^{T})+c-r \\ & = \|I-WW^{T}\|_{F}^{2}+c-r. \end{array} $$

According to Lemmas 1 and 2, we could rewrite Eq. 7 as following

$$ \begin{array}{@{}rcl@{}} \min\limits_{W_{x},W_{y}} & \|I-W_{x}{W_{x}^{T}}\|_{F}^{2} +\|I-W_{y}{W_{y}^{T}}\|_{F}^{2} - \gamma \|T(B,A)\|_{F} \\ &+\alpha \sum\limits_{i,j}g(B)_{ij} + \beta \sum\limits_{i,j}g(A)_{ij}, \end{array} $$
(8)

where T(B,A) is a correlation term to measure the correlation between B and A. Hence for whiten data, our approach is similar to the Lagrangian function of CCA with sparse constraints.

Our approach consists of two parts: namely, reconstruction and correlation. Some existing methods have also applied this idea. We will here demonstrate the connection between CPLST and our proposed approach. CPLST merges the feature-aware in-formation into its model. The objective function of CPLST can be written as follows:

$$ \begin{array}{@{}rcl@{}} \min\limits_{U,V} & \|XU-YV\|_{F}^{2} + \|YVV^{T}-Y\|_{F}^{2} \\ &\textbf{s.t.}\ V^{T}V=I. \end{array} $$
(9)

Obviously the feature-aware term \({\min \limits } \|XU-YV\|_{F}^{2}\) of CPLST is a variant of CCA, the right term is the reconstruction error of label.

FaIE [22] is another multi-label algorithm that has also been developed to utilize reconstruction and correlation terms. One of its major advancements is that it directly learns a lower-dimensional representation without making any assumption regarding the encoding process. Its objective function is written as follows:

$$ \begin{array}{@{}rcl@{}} \min\limits_{C}&-\text{tr}[C^{T}X(X^{T}X)^{-1}X^{T}C]+\|Y-CC^{T}Y\|_{F}^{2} \\ &\textbf{s.t.} \ C^{T}C=I \end{array} $$
(10)

where the left term is the correlation term between feature X and lower-dimensional representation C, while the right term is also the reconstruction error of the label. Obviously, CPLST and FaIE are alike. If we ignore the reconstruction term of the feature and set α = 0,β = 0,γ = 1, we might get a similar result to that obtained by CPLST and FaIE.

From the formulations, it is clear that our proposed approach has no notable connections to CPLST, sparse CCA and FaIE. Although they have totally different motivations, they all have reconstruction and correlation terms, which play an important role in embedding-based approaches for multi-label classification. In contrast with CPLST and FaIE, moreover, our proposed approach intends to encode features and labels into two different subspaces. In optimization, the original CPLST and FaIE also involve orthonormal constraints, but they solve this problem by turning it into an eigendecomposition problem. By contrast, our approach optimizes an unconstrained function and allows for an unconstrained optimizer (e.g., L-BFGS, S- GD). Given n examples, both CPLST and FaIE have to compute a n × n matrix which is infeasible for larger n. So CPLST and FaIE used to cluster data at first. But clustering will slow down prediction speed. We use mini-batch update, which allows us to tackle large datasets without clustering (Table 1).

Table 1 A summary of different multi-label learning methods

3.6 Mapping between subspaces

After learning subspaces for features and labels, we need to learn a mapping between these two subspaces, as this will allow us to transform from feature to label in order to accomplish multi-label learning. Since the two subspaces are low dimensional, we can easily adopt some efficient multi-dimensional regressors to assist us. We train the regressor based on following function:

$$ \min\limits_{f} \|f(X{W_{x}^{T}})W_{y}-Y\|_{F}^{2} $$
(11)

where f is an arbitrary regressor e. g. ridge regressor and neural network. Multi-label examples are supposed to share the independent components. For test sample, we map their features into the subspace of features. With the learned regressor, we can get the coefficients of test sample in label subspace, i.e. \(Y_{pred} = f(X_{t}{W_{x}^{T}})W_{y}\) where Xt is test data and Ypred is the prediction.

3.7 Neural network extension

To extend our method to a non-linear one, we leverage the power of the neural network. We can describe our method as a latent representation learning method. The first characteristic of the representation is that it is independent. The independence requirement is like an advanced concept of decorrelation: decorrelation aims to minimize the co-variance of representation, while the independence requirement is an attempt to minimize the high-order statistic of representation. Some recent works have suggested that independent representations are more effective. The next characteristic is that it is reconstructable: this means we can reconstruct the original label with the representation. The final characteristic is that it is correlative, as many works have shown that feature-aware label subspace is much more useful. Operating under these three guidelines, we present the neural network architecture of DeepIFLC in Fig 1. The architecture consists of two independent component autoencoders and one deep canonical correlation analysis (DCCA) module. The objective function can be written as follows:

$$ \begin{array}{@{}rcl@{}} J(\phi_{x},\phi_{y},\psi_{x},\psi_{y})=R(\phi_{x})+R(\phi_{y})+\alpha T(\phi_{x},\phi_{y})\\ +L(\phi_{x},\psi_{x})+L(\phi_{y},\psi_{y}) \end{array} $$
(12)

where R is the regularizer of autoencoder, T is the canonical correlation term and L is the reconstruction loss for autoencoder.

Fig. 1
figure 1

The architecture of DeepIFLC

It is well known that a non-linear ICA problem is an ill-posed problem that can only be solved with additional assumptions. Thus, rather than searching for a truly non-linear independent component estimator, we instead propose a method called Deep Independent Component autoEncoder (DICE), which satisfies only some essential properties of the ICA solution. This means that the representation should be uncorrelated, while the independent representation should minimize some high-order statistic, such as kurtosis. Therefore, our regularization is quite straightforward. Note that the output of encoder is Z, which has zero means: thus, the regularizer can be written as follows:

$$ \begin{array}{@{}rcl@{}} R(\phi) = \beta\|C_{zz}\|_{1}-\gamma|n({1_{n}^{T}}Z^{4})/({1_{n}^{T}}Z^{2})^{2}-3|1_{k} \end{array} $$
(13)

where Czz is the sample covariance matrix of representation Z, \(\|C_{zz}\|_{1}={\sum }_{i,j}|C_{zz}^{ij}|\). And 1n is a column vector with all one elements, γ is trade-off parameter. Since this regularier is differentiable, so it can be applied to the autoencoder directly.

3.8 Stochastic independent loss

To apply on neural network, we need to estimate the true covariance and kurtosis of whole data with mini-batch. Therefore, we propose a stochastic version of Eq. 13. We denote mini-batch \(Z\in \mathbb {R}^{m\times k}\), where m is the batch size, k is the number of independent components.

The mini-batch covariance matrix and kurtosis vector for t-step are given as:

$$ \begin{array}{@{}rcl@{}} C_{mini}^{t} &=& \frac{1}{m-1}Z^{T}Z \end{array} $$
(14)
$$ \begin{array}{@{}rcl@{}} K_{mini}^{t} &=&n ({1_{n}^{T}}Z^{4})/({1_{n}^{T}}Z^{2})^{2}-3) \end{array} $$
(15)

Then we approximate the true covariance and kurtosis by accumulating the history

$$ \begin{array}{@{}rcl@{}} C_{accu}^{t} &=& \delta C_{accu}^{t-1}+C^{t}_{mini} \end{array} $$
(16)
$$ \begin{array}{@{}rcl@{}} K_{accu}^{t} &=&\delta K_{accu}^{t-1}+K^{t}_{mini} \end{array} $$
(17)

where δ ∈ [0,1). A normalising factor is also computed as ct = δct− 1 + 1. Then the approximate covariance and kurtosis are given as:

$$ \begin{array}{@{}rcl@{}} C_{appr}^{t} &=& \frac{C^{t}_{accu}}{c_{t}} \end{array} $$
(18)
$$ \begin{array}{@{}rcl@{}} K_{appr}^{t} &=&\frac{K^{t}_{accu}}{c_{t}} \end{array} $$
(19)

Therefore, Eq. 13 can be rewritten as follow:

$$ \begin{array}{@{}rcl@{}} R(\phi) = \beta \|C_{appr}\|_{1}-\gamma\|K_{appr}\|_{1} \end{array} $$
(20)

For reconstruction loss, we use L2-loss for feature and BP-MLL loss for label.

$$ \begin{array}{@{}rcl@{}} L(\phi_{x},\psi_{x}) &=&\|X-\psi_{x}(\phi_{x}(X))\|_{F}^{2} \end{array} $$
(21)
$$ \begin{array}{@{}rcl@{}} L(\phi_{y},\psi_{y}) &=& \sum\limits_{i} E_{i}\\ E_{i} = \frac{1}{|{y_{i}^{0}}||{y_{i}^{1}}|}\sum\limits_{(p,q)\in {y_{i}^{1}}\times {y_{i}^{0}}}&&exp(\psi_{y}(\phi_{y}(y_{i}))^{q} - \psi_{y}(\phi_{y}(y_{i}))^{p} ) \end{array} $$
(22)

As for correlation term, with the existence of covariance minimization, the L2-loss could be a good approximation of canonical correlation.

$$ \begin{array}{@{}rcl@{}} T(\phi_{x},\phi_{y})=\|Z_{x}-Z_{y}\|_{F}^{2} \end{array} $$
(23)
figure e
figure f

4 Optimization

Due to the existence of correlation term in formula (7), our objective function is not a convex function for both Wx and Wy. But formula (7) is a convex function for variables Wx or Wy separately. We can hardly find a global optimal solution with gradient descent method. So we decide to adopt a greedy strategy to tackle this problem approximately by alternatively solving the subproblems of formula (7).

4.1 Updating Wx and Wy

The subproblem of solving Wx can be written as following

$$ \begin{array}{@{}rcl@{}} W_{x}&= &\underset{W_{x}}{\arg\min} \|X-X{W_{x}^{T}}W_{x}\|_{F}^{2} + \alpha\sum\limits_{i,j}\log(\cosh(X{W_{x}^{T}}))_{ij} \\ &&- \gamma\|T(X{W_{x}^{T}},Y{W_{y}^{T}})\|_{F} \end{array} $$
(24)

where X is the feature matrix. It is hard to write a closed form of formula (24), but the subproblem is differentiable. So we can use some unconstrained optimizers (e.g., L-BFGS,SGD) to minimize this loss function. The derivative of Wx consists of two parts: reconstruction term and correlation term. The derivative of reconstruction term can be computed as

$$ \begin{array}{@{}rcl@{}} \nabla_{W_{x}}&&J(W_{x},W_{y})= 2W_{x}{W_{x}^{T}}W_{x}X^{T}X +2W_{x}X^{T}XW_{x}{W_{x}^{T}} \\ &&-4W_{x}X^{T}X +\alpha*\tanh(W_{x}X^{T})X \end{array} $$
(25)

To compute the gradient of correlation term, we could use the chain rule. Given \(B=X{W_{x}^{T}}\) and \(A=Y{W_{y}^{T}}\), the centered matrices are \(\hat {B}=B-\frac {1}{N}\pmb {1}B\) and \(\hat {A}=A-\frac {1}{N}\pmb {1}A\), where \(\pmb {1} \in \mathbb {R}^{N\times N} \) is an all-1s matrix. Assume the singular value decomposition of T(B,A) is T(B,A) = UDVT. We have

$$ \begin{array}{@{}rcl@{}} &&(\nabla_{W_{x}}h(B,A))_{ij}=\sum\limits_{a,b}\frac{\partial h(B,A)}{\partial (B)_{ab}}\frac{\partial (B)_{ab}}{\partial (W_{x})_{ij}} \\ &&= \sum\limits_{a,b}\frac{1}{N-1}(2\nabla_{xx}\hat{B}+\nabla_{xy}\hat{A})_{ab}X_{aj}\delta_{bi}\\ &&= \frac{1}{N-1}((2\nabla_{xx}\hat{B}+\nabla_{xy}\hat{A})^{T}X)_{ij} \end{array} $$
(26)

where

$$ \begin{array}{@{}rcl@{}} && \nabla_{xy}={\Sigma}_{xx}^{-1/2}UV^{T}{\Sigma}_{yy}^{-1/2}\\ &&\nabla_{xx}=-\frac{1}{2}{\Sigma}_{xx}^{-1/2}UDV^{T}{\Sigma}_{xx}^{-1/2} \end{array} $$

Then the total gradient of Sx is as following

$$ \begin{array}{@{}rcl@{}} \nabla_{W_{x}}J(W_{x},W_{y})&=&2W_{x}{W_{x}^{T}}W_{x}X^{T}X+2W_{x}X^{T}XW_{x}{W_{x}^{T}} \\ &&-4W_{x}X^{T}X + \alpha*tanh(W_{x}X^{T})X \\ &&-\frac{\gamma}{N-1}((2\nabla_{xx}\hat{B}+\nabla_{xy}\hat{A})^{T}X), \end{array} $$
(27)

and the gradient w.r.t. Wy has a similar expression (Table 2).

Table 2 Examples of test images from iapr tc12 dataset

5 Experiments

5.1 Configuration

In order to validate the proposed algorithm, we perform experiments on six benchmark datasets from Mulan.Footnote 1 There are two relatively small datasets: medical [24] and enron [20]. and two other larger datasets: Corel16k [3] and iapr tc12 [11]. For dataset corel16k, it contains over 16,000 different images. It is organized into 10 different samples, we choose three samples of them for computational cost. All these datasets have already been pre-separated into training set and testing set. And for iapr tc12Footnote 2 dataset, it is a collection of 19,627 nature images taken from locations around the world. This includes pictures of different sports and actions, photographs of people, animals, cities, landscapes and many other aspects of contemporary life. We adopt SIFT-based representation as [12]. Some statistics of data are given in Table 3.

Table 3 Statistics of the datasets in experiments

In experiment, we compared our method with some competitive methods to validate its predictive performance. Their brief introductions are given as below.

  • CPLST: The label space is encoded by a feature-aware principal label space transformation. It reduces label while considering feature information.

  • FaIE: This approach aims to find a latent space that maximizes its recoverability and predictability. Linear-FaIE is adopted in experiments.

  • LEML: This is one of the state-of-the-art approaches to tackle multi-label problem in a generic low rank empirical risk minimization framework.

  • SLEEC: This is one of the state-of-the-art approaches for learning sparse local embeddings in multi-label classification. It finds a latent space that preserves the pairwise distances between the closest label vectors.

The implementations of all comparison methods were accomplished by using codes provided by authors. For the hyper parameters, we used the recommendation by authors, if there is. Otherwise, we tune their hyper parameters to achieve on different datasets. For SLEEC, it needs to cluster before embedding. We set the number of clusters as Ncluters = ⌈N/6000⌉ as recommendation, where N is the number of training data. For our method, the hyper parameters α, β and γ are determined by 5-fold cross validation on training set. They are chosen from {0.001, 0.01, 0.1, 1, 10}

One key parameter for all methods is the label compression rate k/l, where l and k are the dimension of latent label space and the original label space, respectively. We compared all methods under four different rates, 20%, 40%, 60%, and 80%. We also investigate the sensitivity of parameters on the Corel16k, medical and iapr tc12 datasets.

Following previous works, we use ridge regression as our base regressor for a fair comparison. Except for SLEEC, it used kNN to predict. We used two widely-used ranking-based evaluation metrics to validate all methods, i.e. precision of top-k prediction (P@k) that counts the fraction of correct prediction in the top-k scoring label, and normalized Discounted Cumulative Gain (nDCG@k). These metrics have been commonly used in many multi-label learning experiments. Denote y ∈{0,1}l as the ground-truth label, \(\hat {y}\{0,1\}^{l}\) as the predicted label. P@k and nDCG@k are expressed as

$$ \text{P@}k:=\sum\nolimits_{i\in rank_{k}(\hat{y})}y_{i}, $$

and

$$ \text{nDCG@}k:=\frac{1}{k}\sum\nolimits_{i\in rank_{k}(\hat{y})}\frac{y_{i}}{\log(i+1)}, $$

where \(rank_{k}(\hat {y})\) returns the k largest indices of \(\hat {y}\) ranked in descending order.

5.2 Experiment results

To validate the proposed method, we compare our method with several popular methods, and then some variants of our proposed method are included to illustrate the reasonability of our configuration. Tables 4 and 5 show the general result. We evaluate experiment results in term of Precision@1,3,5 and nDCG@1,3,5. Generally, our approach obtains the best result. From the experiment results, we observe that: (1) The performance of our proposed approach improves as the label compression rate increases on almost every dataset; (2) Compared with other methods, our approach has a significant better performance in top-1 and top-3 evaluation metrics, and our approach achieves the best top-1 precision on all datasets; (3) SLEEC did not work very well in enron. We think the reason might be this dataset has small label size, so that the distance of two label vectors cannot precisely reflect the label structure; (4) Our approach has a better ranking in nDCG compared to that in terms of Precision. It is because that nDCG is a cumulative quantity, and the top-1 gain has the biggest weight. As shown in these two tables, our approach has advantages in terms of top-1 and top-3 metrics. Therefore, our approach can have a better performance in nDCG; (5) As an extension of CPLST and FaIE, our approach generally outperforms CPLST and FaIE on these datasets. Generally, we achieve 1% to 2% average improvement over CPLST and FaIE.

Table 4 Performance comparison of multi-label learning methods with precision@k (%)
Table 5 Performance comparison of multi-label learning methods with ndcg@k (%)

5.3 Feature independent components

The number of feature-independent components (FIC) is one of the key hyperparameters in our approach. We analyze the influence of the number of FIC across the three datasets; in so doing, we fix the other hyperparameters, and adjust only the number of FIC. We also choose FIC rates (FIC over the feature dimension) between 0.01 and 1.4. Results are presented in Figure 2. From the Figure, we can observe the following: (1) The performance on all datasets remain stable as the FIC rate increases, the converges to its best performance. We can therefore choose a high FIC rate to ensure high performance. Even for lower FIC rates such as 0.1, however, a good enough result can be expected. This phenomenon suggests that the example feature is often redundant, which is consistent with the motivation of FSDR methods. (2) We observe a decrease in performance when the FIC rate is below some threshold (i.e. the red line). Note that the decrease is usually on the left side of the red line, indication that the dimension of feature subspace is smaller than that of the label subspace. Under these circumstances, a multivariate regression that maps a low-dimensional space to a high dimensional space can be quite inaccurate. The performance of conventional embedding methods that discover single subspace should around the red line. As the FIC rate increases, moreover, the performance of our method becomes slightly better that red line, especially for the top-1 metric. This explains the superiority of our method in terms of the top-1 metric. (3) The behaviors of the FIC rate and label compression rate are quite different. The FIC rate has a small threshold, beyond which performance will barely rely on FIC rate. Usually, however the higher the label compression rate is, the better performance we will obtain. As shown in Tables 45, there is a significant improvement on the medical dataset under a 20% label compression rate. Note that a 20% label compression rate on the medical dataset implies that there are nine label-independent components; in other words, there are nine feature-independent components for the traditional embedding method, while the FIC rate is around 0.006, which is close to the threshold on the medical dataset. Hence, comparison methods encounter a performance reduction given that the FIC rate is so low. This is likely why our proposed approach achieves such a significant improvement over comparison methods on the medical dataset. It is difficult to determine how many FIC do we exactly need in practice; however, the experimental results suggest using an FIC rate larger than 0.1, and the FIC should also be larger than the number of label-independent components.

Fig. 2
figure 2

Precision under different FIC rates. Where blue line is Precision@1, orange line is Precision@3, and green line is Precision@5. The red line means the moment that the dimensions of two subspaces are the same

5.4 Comparison with variants algorithm

To validate the effect of each part in our model, we consider the following variants: (1) L-IFLC considering only ICA terms of label and correlation term. We achieve that by fixing Wx to an identity matrix; (2) F-IFLC considers only ICA terms of feature and correlation term; (3) C-IFLC excludes the correlation term. In this experiment, we perform all methods with 40% label compression rate. For F-IFLC we use 40% FIC rate. Results are given in Table 6.

Table 6 Performance comparison of variant methods with precision@k

From the experimental results, we can draw the following observations. (1) C-IFLC becomes the worst method. There is a huge gap between C-IFLC and other methods. Therefore correlation term is very important for our method. (2) In Corel16k1, F-IFLC outperforms L-IFLC. But in iapr tc12, L-IFLC outperforms F-IFLC. As CPLST suggested, there are two types of label correlation: feature-unaware correlation and feature-aware correlation. L-IFLC might overemphasize the feature-unaware correlation, and F-IFLC goes to the other way. Our method makes a reasonable balance of those two by introducing both feature and label subspaces, and achieves even better results.

5.5 Results on iapr tc12 dataset

The iapr tc12 dataset is much lager with various label categories. Our approach achieves the best result over all comparison methods under almost every label compression rates in terms of different evaluation metrics. Table 2 presents example annotations on the iapr tc12 dataset produced by FaIE, SLEEC and our proposed approach. The top-k predicted labels from each method were taken as the annotation labels, where k is the number of true labels. As shown in the table, the mismatched labels of our approach are still quite related to the image content.

5.6 Comparisons with DNN methods

In this section, we validate our deep learning version method (DICE) with some state-of-the-art works:

  • BP-MLL: One of the baseline neural network methods for multi-label classification.

  • C2AE: Being able to learn a feature-aware label subspace, while it only contains reconstruction of label.

  • CNN-RNN: An unified framework that uses RNN to explore the label co-occurrence, then combining the recurrent representation and CNN feature to improve the performance of classification.

We conduct our experiments on these datasets: iapr tc12, tmc2007, espgame and NUS-WIDE. These four datasets are all image datasets. For the first three datasets, 1000-dimensional SIFT features are extracted. For NUS-WIDE, we extract 4096-dimensional features with pre-trained Alexnet. For fair comparison purpose, we also use the same pre-trained Alexnet structure for CNN-based method CNN-RNN, in its CNN part.

For neural network architecture, our method, C2AE and BP-MLL use the same architecture. We use two hidden layers to encode(decode) features, and one hidden layer to encode(decode) labels. For each hidden layer, a total of 512 neurons are deployed. For output layer, we use Sigmoid function as our activation function. And use leaky ReLU function for other layers. The batch size is fixed as 500 for C2AE and BP-MLL, DICE’s batch size is fixed as 100. To select parameters for DICE, we randomly leave out 1/6 of our training data for validation. We adopt the same strategy for other methods. As for evaluation metrics, we consider micro-F1, macro-F1, top-k precision and top-k nDCG. But for CNN-RNN we only validate on micro-F1 and macro-F1.

Figure 3 and Table 7 illustrate and compare the performance of DICE,C2AE and BP-MLL. From Table 7, we can see that our DICE achieves superior performance against others. DICE > C2AE > BP-MLL. Both DICE and C2AE contain BP-MLL loss, so the improvement over BP-ML are quite obvious. And compare to C2AE, DICE introduces the stochastic independent loss and the reconstruction of feature, which allow it attains better result. And with the help of stochastic independent loss, we are able to use much smaller batch size which can save many computing resources. From Fig. 3, we notice that BP-MLL will suffer more serious performance degradation when the label compression rate is small. We think that is because of the existence of feature-aware information.

Fig. 3
figure 3

Micro-F1 and macro-F1 of different methods

Table 7 performance comparison of DNN methods

5.7 Results on NUS-WIDE

NUS-WIDE dataset is a web image dataset, which consists of 269,648 images and 5018 tags from Flickr. After some simplifications, only 81 concepts are remained. They are more accurate and less noisy. Table 8 lists and compares the classification results of all four methods. CNN-RNN use RNN to exploit label co-occurrence information. According to the experiment results, label co-occurrence information seems like a weak information for classification. Since CNN-RNN can not even outperform BP-MLL. C2AE introduces feature-aware information and achieves better results. But our method DICE still attain promising results among all methods. This supports the benefits of independent representation and feature reconstruction.

Table 8 performance comparison on NUS-WIDE

For computation time, due to the learning of RNN, CNN-RNN takes relatively long time to train. Although DICE takes more time than BP-MLL and C2AE for one epoch, DICE actually converges faster. Overall, DICE might faster than C2AE, meanwhile achieves better results.

6 Conclusion

In this paper we proposed a method that learns separated subspaces for features and labels by maximizing the independence between components in each subspace and maximizing the correlation between two subspaces. To solve the obtained non-convex problem, we used an alternating optimization. We also study the connection between our model with some existing methods. We also shown the principles that we adopted in our method were widely-used in multi-label classification methods. Experiments on real-world multi-label datasets showed superior performance of our method and the necessity of exploring independence components from multi-label data. Further, we propose a stochastic independent loss, and build up a neural network version of IFLC. Which also attains superior performance.