1 Introduction

In recent years, multi-instance multi-label learning (MIML) has attracted a lot of attention in the machine learning community. In MIML, an example is described by several instances and associated with a set of labels [1]. Many real-world problems can be formalized under the MIML framework, such as text categorization, scene classification, web page classification and gene sequence encoding. For example, in text categorization, each document is usually comprised of several paragraphs of sections, each can be regarded as an instance, while the document can be assigned to a set of predefined topics; multiple links can be extracted from a web page where each link is described by an instance, and thus the web page can be represented by a set of instances meanwhile the web page may belong to many classes, such as news page, sports page, soccer page, etc [2].

The traditional supervised learning (SISL) can be viewed as a degenerated version of MIML where each example is represented by an instance and associated with a class label. Hence, one way to solve MIML problem is to identify its equivalence in SISL via problem degeneration. Although this kind of degeneration strategy is feasible, the performance of the resultant algorithm may be affected by the loss of information during the degenerative process [3].

Support Vector Machine (SVM) [4] has been extensively applied in different areas, such as incremental learning [5], multi-class classification [6], and semi-supervised classification [7]. The traditional SVM can only solve single-instance single-label learning problems. In comparison, E-MIMLSVM+ is a SVM-based algorithm which can solve multi-instance multi-label learning problems. Besides, E-MIMLSVM+ can use the correlations between labels to improve the classification performance [8]. E-MIMLSVM+ is the improved version of MIMLSVM+. In MIMLSVM+, a simple degeneration strategy is firstly employed. It decomposes the learning of multiple labels into a series of binary classification tasks. The algorithm constructs an SVM for each label. Since the kernel function employed here is based on several instances instead of a single feature vector, the well-known multi-instance kernel [9] is adopted. MIMLSVM+ decomposes the multi-label problem into a series of independent binary learning tasks. In this way, the correlation between labels is neglected. To overcome this drawback, E-MIMLSVM+ incorporates the label correlations by utilizing multitask learning techniques which consider the SVM training of each label as a task [10, 11]. The kernel-based multitask learning framework [12] is employed, since MIMLSVM+ is a support vector machine algorithm. While, E-MIMLSVM+ consumes more time and memory than MIMLSVM+ since multitask learning simultaneously will result in many more instances involved in the optimization procedure [8].

In this paper, we propose a new SVM approach to MIML named ECC-MIMLSVM+. Briefly, ECC-MIMLSVM+ employs a degeneration strategy to decompose multiple labels learning into a series of binary classification tasks. Subsequently, we put forward a novel classifier chains method which uses the information of label correlation and meanwhile maintains acceptable computational complexity.

The rest of this paper is organized as follows. Section 2 gives the formal definition of MIML and reviews the related works. Section 3 presents ECC-MIMLSVM+. Section 4 reports experimental results on two real-world MIML data sets. Finally, Section 5 concludes the paper.

2 Related works

In the MIML framework, a learning algorithm typically takes a set of labeled train examples L = {(X 1, Y 1), (X 2, Y 2), … , (X n , Y n )} as input, where X i X is a bag of instances \(\left \{x_{1}^{(i)} ,x_{2}^{(i)},\ldots ,x_{n_{i} }^{(i)} \right \},x_{j}^{(i)} \in x(j=1,2,\ldots ,n_{i} )\) and Y i Y is a set of labels \(\{y_{1}^{(i)} ,y_{2}^{(i)} ,\ldots ,y_{l_{i} }^{(i)} \}\) \(y_{k}^{(i)} \in Y(k=1,2,\ldots ,l_{i} )\) associated with X i . The task of MIML is to learn a function f:2x → 2y from a set of MIML training examples L. The MIML framework can be viewed as a generalization from the learning frameworks of multi-instance learning [14], multi-label learning [15, 16], and traditional supervised learning.

Multi-instance learning [14] or multi-instance single-label learning (MISL) was proposed by Dietterich et al. The goal of MISL is to learn a function f M I L :2x → {−1, +1} from a set of MISL training examples {(X 1, y 1), (X 2, y 2), … , (X m , y m )} where X i X is a bag of instances \(\left \{x_{1}^{(i)} ,x_{2}^{(i)} ,\ldots ,x_{n_{i} }^{(i)}\right \}\), \(x_{j}^{(i)} \in x(j=1,2,\ldots ,n_{i} )\) and y i ∈ {−1, +1} is the binary label of Xi.

Multi-label learning or single-instance multi-label learning (SIML) is originated from the investigation of text categorization problems. The goal of SIML is to learn a function f M L L x → 2y from {(x 1, Y 1),(x 2, Y 2), … , (x m , Y m )}, where x i X is an instance and Y i Y is a set of labels \(\left \{y_{1}^{(i)} ,y_{2}^{(i)} ,\ldots ,y_{l_{i} }^{(i)} \right \}\) \(y_{k}^{(i)} \in Y(k=1,2,\ldots ,l_{i} )\) associated with x i .

Zhi-Hua Zhou and Min-Ling Zhang proposed two MIML algorithms named MIMLBOOST and MIMLSVM [1]. These two algorithms are typically MIML algorithms that based on degeneration strategy. They transformed MIML into traditional supervised learning using MISL and SIML respectively as the bridge. However, useful information encoded between instances and labels may be lost during the degeneration process, which may determine the accuracy of classification.

Zhi-Hua Zhou and Min-Ling Zhang also proposed a direct MIML algorithm named D-MIMLSVM [16] and a maximum margin MIML algorithm named M3MIML [17]. They are both MIML algorithms that based on regularization. In the theory of linear algebra, regularization refers to the question that ill-posed problem is usually caused by a defined set of linear algebra, and this group has a lot of equations usually derived from the condition number of ill-posed inverse problems. D-MIMLSVM assumes that the labels associated with the same example have some relatedness, and the performance of classifying the bags depends on the loss between the labels and the predictions on the bags as well as on the constituent instances [2]. However, it can only deal with moderate training set sizes because of the large optimization problem [16]. M3 MIML assumes a linear model for each class, where the output on one class is determined by the maximum prediction of all instances with respect to the corresponding linear model. And then, outputs on all possible classes are combined to define the margin of MIML example over the classification system.

Ying-Xin Li and Shuiwang Ji [8] designed a MIML algorithm for large-scale learning problem named MIMLSVM+, and its extended algorithm is named E-MIMLSVM+. MIMLSVM+ simply employs a degeneration strategy which decomposes the learning of multiple labels into a series of binary classification tasks. Different from traditional SVM, the kernel function used in MIMLSVM+ is based on a bag of instances instead of a single feature vector. Theoretically, any kernel defined on the set of instances [18] can be used to compute the kernel function. The famous multi-instance kernel [9] is adopted in MIMLSVM+. As the degeneration strategy of MIMLSVM+ may lose the information of label correlations, E-MIMLSVM+ incorporates label correlations by utilizing multitask learning techniques to extend MIMLSVM+ [10]. For a given label yY suppose its classification function is

$$ f_{y} (X) = \left\langle {w_{y} ,\varphi (X)} \right\rangle +b=\left\langle {(w_{0} +v_{y} ),\varphi (X)} \right\rangle +b $$
(1)

Where w 0 is used to reflect the commonalities shared by different learning tasks, v y is the task-specific model parameter which used to measure the dissimilarity between task y and other tasks, φ(X) is a mapping function, and b is the offset [8]. Here, multi-instance multitask kernel is used as follows, which bridges the multi-instance kernel and the multitask kernel.

$$ K_{ty} (X_{i} ,X_{j} ) = \left( {\frac{1}{\mu }+\delta (t=y)} \right)K_{MI} X_{i} ,X_{j} ) \quad {(t,y\in Y)} $$
(2)

Where t and y denote two different tasks respectively, and δ(t = y) = 1 if t = y, otherwise δ(t = y) = 0.

In order to avoid this situation in which all the models f y are forced to be close to a common parameter denoted by w 0, a clustering process to partition the labels into some subgroups based on the correlations between labels is considered before its training. Although E-MIMLSVM+ achieves superior performance over other algorithms, it is time consuming and needs more memory than other algorithms.

Classifier chains (CC) method which is based on the binary relevance (BR) method, overcomes the disadvantages of BR and achieves higher predictive performance, but still retains its important advantages, most importantly is low time complexity. Still like BR, CC’s models can both be parallelized and serialized, in addition, there is only a single binary problem in memory at any time, which own an obvious advantage over other methods based on a single large model [19].

3 Proposed method

3.1 The MIMLSVM+ method

Suppose n is the number of training examples. yy is a label, X i is a bag of instances in the training set. For each label y, let φ(X i , y) be the indicator function defined as: φ(X i , y) = 1 if y is in Y i which is corresponding to X i , and φ(X i , y) = −1 otherwise. Hence the SVM classification model involves the following optimization problem:

$$\min\limits_{w_{y} ,b_{y} ,\xi_{iy}} \frac{1}{2}\left\| w_{y} \right\|^{2}+C\sum\limits_{i=1}^{n} {\xi_{iy} } \tau_{iy} $$
$$ \text{s.t. :}~\phi (X_{i} ,y)\left( {\left\langle {w_{y} ,\varphi \left( {X_{i} } \right)} \right\rangle +b_{y} } \right)\ge 1-\xi_{iy} $$
(3)
$$\xi_{iy} \geq 0\left( {i=1,2, {\ldots} ,n} \right), $$

Where 〈⋅,⋅〉 denotes the inner product. φ(X i ) is the function that maps the bag of instances X i into a higher dimensional space H. w y and b y are the parameters of a linear discriminant function in H. ξ i y is the nonnegative slack variable in the constraints to permit some training bags to be misclassified. ∥w y 2 reflects the complexity of the model [4]. C is a parameter to balance the model complexity and the accumulative losses of the training bags. τ i y is the amplification coefficient of the loss ξ i y to handle the class imbalance problem and defined as

$$ \tau_{iy} =\frac{1+\phi \left( {X_{\mathrm{i}} ,y} \right)}{2}R_{y} +\frac{1-\phi \left( {X_{\mathrm{i}} ,y} \right)}{2}, $$
(4)

Where R y is the imbalance level of label y, and it can be evaluated by the number of negative bags divided by the number of positive bags in the training set.

Kernel function is very important in support vector machine and needs to be predefined [4]. Different kernel function will lead to different support vector machine. Here the well-known multi-instance kernel is employed and defined as follows:

$$\begin{array}{@{}rcl@{}} K_{MI} \left( {X_{\mathrm{i}} ,X_{j} } \right)&\,=\,&\frac{1}{n_{\mathrm{i}} n_{j} }\sum\nolimits_{\left( {x_{is\_ 0} ,x_{is\_ 1} } \right)\in X_{i} }\\ &&\times\!\sum\nolimits_{\left( {x_{jz\_ 0} ,x_{jz\_ 1} } \right)\in X_{j}^{e^{_{-\gamma_{_{1} } } \left\| {x_{is\_ 0\,-\,} x_{jz\_ 0} } \right\|^{2}\!-_{-\gamma _{_{2} } } \left\| {x_{is\_ 1\,-\,} x_{jz\_ 1}} \right\|^{2}}}} \end{array} $$
(5)

Where n i and n j denotes the number of instances in the bags X i and X j respectively. ∥x is_0x jz_02 is employed to measure the similarity of visual features (low-level features and represented by a local descriptor) between two instances. ∥x is_1x jz_12is employed to measure the spatial distance (the distance between points, lines and surfaces in three dimensional space) between two instances. γ 1 and γ 2 are the different weights to combine the visual and spatial information.

For a given label y and an example X, the resulting classification model can be defined as

$$\begin{array}{@{}rcl@{}} f_{y} \left( X \right)&=&\left\langle {w_{Y} ,\varphi \left( X \right)} \right\rangle +b_{y} \end{array} $$
$$\begin{array}{@{}rcl@{}} &=&\sum\limits_{i=1}^{n} {a_{iy} \phi \left( {X_{i} ,y} \right)K_{MI}} {\left( {X_{i} ,X} \right)+b_{y}} \end{array} $$
(6)

3.2 The improved method: ECC-MIMLSVM+

In order to incorporate the label correlations, E-MIMLSVM+ introduces multitask learning techniques, which make the algorithm consume more time and memory. In this paper we utilize an advanced classifier chains method, the Ensembles of Classifier Chains (ECC) [19], to improve MIMLSVM+. Classifier chains method is the well known binary relevance method for multi-label classification, which considers each label as an independent binary problem.

As the ECC is originally designed for SIML problems, we improve it to adapt to our MIML problems. Therefore, we first extend each label into a column vector, for example, we let \(y_{k}^{\prime } = \left (y_{\mathrm {k}}, y_{\mathrm {k}},{\ldots } , y_{\mathrm {k}} \right )_{d}^{T} (k=1,2,\ldots ,L)\) be the k-th label, where L is the total number of different labels, and d denotes the dimension of feature vectors. And then we use the new labels above to form a new training set, in which each training example contains the label information. The specific approach is that for each label y k (k = 1, 2, … , L), add the corresponding label vector \(y_{k}^{\prime }\) to each bag, then we get \(X_{i}^{\prime }\,=\,\left [ x_{i1} ,x_{i2} ,{\ldots } ,x_{in} ,y_{1}^{\prime }, y_{2}^{\prime }, {\ldots } , y_{k-1}^{\prime } \right ] ({i=1,2,{\ldots } ,m})\), where x i j is an instance of bag X i , n is the number of instance in bag X i , m is the number of examples. Therefore for each label y k , we can get a set of training data \(S_{k} = \left \{ \left ({X_{\mathrm {i}}^{\prime }, \phi \left ({X_{\mathrm {i}} ,y_{k} } \right )} \right ) \right \}(k=1,2,\ldots ,L)\), and based on that we can train the SVMs according to formulation (3). After the extension of the labels and the training bags, the ECC method can deal with MIML problems.

In classifier chain models, the order of the chain itself will normally have an effect on accuracy. Using an Ensemble of Classifier Chains (ECC), each with a random label order, greatly reduces the risk of these events having an overall negative effect on classification accuracy at only a linear time cost with respect to the number of iterations [19]. ECC trains N CC classifiers h 1, … , h N, and each classifier is given a random chain order. Using the output \(\overset {\wedge }{\mathrm {y}_{1}} ,{\ldots } , \overset {\wedge }{\mathrm {y}_{N}}\) to calculate the confidence vector \(\overset {\wedge }{\mathrm {w}} =\left [ \overset {\wedge }{w_{1} } ,\mathellipsis ,\overset {\wedge }{w_{L} } \right ]\in R^{L}\), where L is the total number of different labels, \(\overset {\wedge }{w_{j}}\) is the confidence of the j-th label, R L is the output space,

$$ \overset{\wedge }{w_{j}}=\frac{1}{N}\sum\limits_{k=1}^{N} \overset{\wedge }{y_{j,k} } $$
(7)

A threshold function can be applied to \(\overset {\wedge }{\mathrm {W}}\) to predict \(\overset {\wedge }{\mathrm {y}} \):

$$ \overset{\wedge }{y_{j}}=\left\{ {\begin{array}{l} 1{\begin{array}{c}\\ \end{array}} \quad if~ \overset{\wedge }{w_{j}} \geq t \\ 0{\begin{array}{c}\\ \end{array} } \quad if~ \overset{\wedge }{w_{j}} \geq t \\ \end{array}} \right. $$
(8)

Where t is the threshold and is calibrated as follows:

$$ \mathrm{t}=\arg \min\limits_{t} \left\| {LCARD(S)-\left(\frac{1}{N}\sum\limits_{i=1}^{N} \sum\limits_{j=1}^{\mathrm{L}} {1_{\overset{\wedge }{w_{j}} \geq t}} \right)} \right\| $$
(9)

Where N is the number of test examples. \(LCARD(S) = \frac {1}{N}\sum \limits _{\mathrm {i}=1}^{N} \sum \limits _{j=1}^{\mathrm {L}} {y_{j}^{i}}\) denotes the average number of labels associated with each example [20].

For an unseen bag X, and for each chain order, we make predictions from the first label of the chain. Before making the prediction of the j-th label, let \(X^{\prime }=\left [x_{1}, x_{2}, {\ldots } , x_{n}, \overset {\wedge }{y_{1}},\overset {\wedge }{y_{2}},{\ldots } ,\overset {\wedge }{y_{j-1}}\right ]\), where \(\overset {\wedge }{y_{i}}\) is a vector extended from the prediction of bag X i on the i-th classifier, and then we use \(X_{i}^{\prime }\) to predict the j-th label \(\overset {\wedge }{y_{j}}=f_{j}(X^{\prime })\). Finally, we use the predictions to calculate the confidence vector and achieve the final prediction set Y through a threshold function.

Figure 1 illustrates the pseudo-code of our proposed algorithm.

Fig. 1
figure 1

ECC-MIMLSVM+ algorithm

The main idea of our proposed algorithm is that firstly we modify the MIML training examples to adapt to the ECC algorithm, secondly use the training process of ECC to train a classifier chains, and then for each classifier in the classifier chains, we exploit the training process of MIMLSVM+ algorithm. In the process of testing, we should modify examples in the test set, and use the predicted values to calculate a confidence vector \(\overset {\wedge }{w}\), each dimension of the vector denotes the confidence of a label. Last according to a threshold t and the confidence \(\overset {\wedge }{w_{j} } (j=1,2,\ldots ,L)\) we can get the final predicted values.

4 Experiment and argumentation

4.1 Experimental setup

In this section, performance of ECC-MIMLSVM+ is compared with MIML-BOOST, MIMLSVM and E-MIMLSVM+ on two real-world MIML learning tasks. The first data set is scene dataset that collected from the COREL image collection and the Internet. Scene classification data contains 2,000 natural scene images. There are 5 possible class labels such as deserted, mountains, sea, sunset and trees and a set of labels are manually assigned to each image. Images belonging to more than one class comprise over 22 % of the data set and the average number of labels per image is 1.24 ± 0.44. Each image is represented as a bag of nine 15-dimension instances.

The second data set is text data which is collected from the widely studied Reuters-21578 collection [21]. The seven most frequent categories are considered. After removing documents whose label sets or main texts are empty, 8,866 documents are retained where only 3.37 % of them are associated with more than one class label. After randomly removing documents with only one label, a text categorization data set containing 2,000 documents is obtained. Around 15 % documents with multiple labels comprise the resultant data set and the average number of labels per document is 1.15 ± 0.37. Each document is represented as a bag of instances using the sliding window techniques [22], where each instance corresponds to a text segment enclosed in one sliding window of size 50. “Function words” on the SMART stop-list [23] are removed from the vocabulary and the remaining words are stemmed. Instances in bags adopt the “Bag-of-Words” representation based on term frequency [21, 24]. Without loss of effectiveness, dimensionality reduction is performed by retaining the top 2 % words with highest document frequency [25]. Thereafter, each instance is represented as a 243-dimensional feature vector. Table 1 summarizes characteristics of both data sets.

Table 1 Characteristics of the data set

ECC-MIMLSVM+ is compared with E-MIMLSVM+, MIMLBOOS and MIMLSVM. The parameters of EMIMLSVM+MIMLBOOST and MIMLSVM are set according to [1, 8] respectively. Particularly, the number of boosting rounds for MIMLBOOST is set to be 25 and Gaussian kernel with γ = 0.22 is used to implement MIMLSVM. The kernel parameters of E-MIMLSVM+ are γ 1=10−5 and γ 2=10−2, and the cluster parameter q is set to be 0.5. The number of chain labels order is set to be 3 in the ECC-MIMLSVM+. For fair comparison, we employ the same setting with the same partition of data sets and report the average performance.

The performance of the four MIML algorithms is evaluated according to five popular multi-label metrics: hamming loss, one-error, coverage, ranking loss and average precision. Briefly, for hamming loss, one-error, coverage and ranking loss, the smaller value the better performance; for average precision, the bigger value the better performance.

4.2 Experimental results

Tables 2 and 3 show the experimental results of each compared algorithm with the five metrics and running time on the scene data and Reuters data respectively. For hamming loss, one-error, coverage, ranking loss and running time the smaller value the better performance, and for average precision the bigger value the better performance. The best result on each evaluation criterion is highlighted in boldface. As can be seen from Tables 2 and 3, ECC-MIMLSVM+ performs better than other three algorithms on both data sets.

Table 2 Experimental results on the data set
Table 3 Experimental results on the Reuters data set

Table 4 shows the time consumed in the four compared algorithm on both data sets. As can be seen from Table 4, the ECC-MIMLSVM+ is slightly worse than MIMLSVM, while far superior than MIMLBOOST and E-MIMLSVM+.

Table 4 Training time of each algorithm on both data sets

5 Conclusions

In this paper, a novel SVM method for MIML problem named ECC-MIMLSVM+ is proposed. This method considers the connections between labels through classifier chains method in an ensemble framework (ECC). Experiments on both scene classification and text categorization show that our method is more efficient and can produce better performance than other MIML methods.