Keywords

1 Introduction

Compared with traditional single-instance learning (SIL), multi-instance learning (MIL) is the study of bags containing multiple instances. Taking the drug activity prediction as an example, molecules and their isomers are viewed as bags and instances, respectively. The task is to predict whether the new molecule is suitable for making drugs. A molecule is positive if at least one of its isomers can be used to make drugs, otherwise it is negative. Furthermore, multi-instance problems are common in real-world application scenarios, such as image retrieval [2], text classification [21], and web index recommendation [14].

Fig. 1.
figure 1

The main framework of MIHI is compared with traditional methods. Traditional methods usually use clustering algorithms to select instance prototypes in the entire data space. Our method first selects the instance prototypes from the bag, and then selects the high-level instances from the instance prototypes.

In recent years, many embedded MIL algorithms based on instance prototypes have been widely proposed. Their common strategies tend to perform clustering in the entire instance space to select instance prototypes [10, 15]. MILFM [10] first selects instance prototypes in the entire instance space, and selects cluster instances from the negative bags. CMIL [9] only divides the instances of the positive bag into multiple clusters, and selects the instances with the high score in each cluster as the instance prototypes. However, two dilemmas will be encountered: 1) The cardinality of the instance space is much larger than that of the bag space; and 2) The number of negative instances is far greater than that of positive instances. As a result, the classification effectiveness may be reduced. Figure 1 shows an example of tiger image classification task. In subgraphs (a) and (b), there are tigers, grass and water. In subgraphs (c) and (d), there are only grass and water. Obviously, grass and water occupy a large proportion of the entire feature space. The instances prototypes chosen by traditional methods are more likely to be grass and water than tigers. However, the selected instances have weak representativeness due to ignoring the internal structure of the bag. Therefore, the selection of highly representative instance prototypes is the key to the embedded MIL algorithms.

In this paper, we propose the multi-instance embedding learning through high-level instance selection (MIHI) algorithm to handle these issues with two techniques. Figure 1 shows the main framework of MIHI. The goal is to select high-level instances with strong representativeness. In Step 1, the fast bag-inside instance selection technique is designed to select instance prototypes from each bag. This technique takes into account the density and affinity of instances in the bag. The instance prototypes highlight the bag’s internal structure information. Accordingly, the high-level instance selection technique chooses global representative instances. For each instance prototype, we calculate its local density and the minimum distance from higher-density prototypes. Then the instance prototypes with peak density are identified as the high-level instances. Experiments on six learning tasks confirmed the effectiveness of MIHI in terms of efficiency and classification accuracy. The main contributions of our work are:

  • We propose a fast bag-inside instance selection technique, which can effectively exploit the structure information of the bag. By using new density and affinity metrics, the instance prototypes of the bag are found.

  • We propose a high-level instance selection technique based on instance prototypes. Through peak density metric, the high-level instances have more representative power than other prototypes.

2 Related Work

MIL was first proposed in the study of drug activity prediction [7]. After that, many MIL algorithms have been proposed. They are mainly divided into two categories: 1) Basic methods predict the bag label based on the structural characteristics of bag [21] or instance [8] spaces; and 2) Embedding methods transform MIL into SIL based on representative samples [3, 17].

The basic methods mainly handle MIL problems by designing a bag-level kernel. mi-SVM and MI-SVM [2] treat bags as samples and use support vector machines to handle problems. mi-SVM tries to identify the maximum edge hyperplane for the instances. Its constraint is that at least one instance of each positive bag is located in the positive half space. MI-SVM treats the edge of the most positive instance as the edge of the bag. The purpose is to identify the maximum edge hyperplane of the bag. miGraph [21] proposes an effective bag-level kernel through the affinity matrix. However, it only focuses on the relationship between bags and fails to extract instance-level information.

The bag embedding methods deal with MIL problems by transforming the space. DD-SVM [4] learns a set of instance prototypes by using Diverse Density. Then the bags are embedded into the new feature space based on the instance prototypes. MILES [3] uses a joint strategy based on all instances to implement bag embedding. Bamic [22] selects the representative bags through unsupervised learning. MIKI [19] first trains a weighted multi-class model to select instance prototypes with high positiveness. Then the bag is converted into a vector with instance prototype information. To narrow the gap between the training and testing distribution, the weights of the instance prototypes are combined into the converted bag vector. However, these algorithms directly select instance prototypes in the entire feature space, ignoring the internal structure of the bag. As a result, they may choose weakly representative instances and affect classification performance. MIHI provides a solution for selecting high-level instances through two techniques.

3 The Proposed Algorithm

In this section, we first give the basic symbol definition of MIL. Then we describe the proposed MIHI algorithm process. Furthermore, two key techniques of MIHI are described in detail.

3.1 Algorithm Description

Algorithm 1 reports the detailed process of the proposed MIHI. Let \(\mathcal {T} = \{ \boldsymbol{B}_i\}_{i=1}^{N}\) be the MIL data set with N bags, where \(\boldsymbol{B}_i\) \(= \{\boldsymbol{x}_{ij}\}_{j=1}^{n_i}\) is a bag containing \(n_i\) instances. Let \(\mathcal {Y}= \{y_i\}_{i=1}^{N}\) be the label vector corresponding to \(\mathcal {T}\), where \(y_i \in \{ -1, +1 \}\) is the label of \(\boldsymbol{B}_i\). Lines 2–11 use two techniques to obtain high-level instance set \(\boldsymbol{H}\). By analyzing the internal structure of each bag, at least one instance can be selected to construct the instance prototype set \(\boldsymbol{C}\). Specifically, Lines 4–5 calculate the representativeness of the instances in each bag \(\boldsymbol{B}_i \in \mathcal {T}\). Lines 6–7 select the top-ranked instances as the instance prototypes (IP). Next, our goal is to generate the high-level instance set \(\boldsymbol{H}\) by identifying \(\boldsymbol{C}\). Lines 9–11 select instances with peak density from \(\boldsymbol{C}\) to construct \(\boldsymbol{H}\). We design an embedding function to transform each bag into a single instance in the new feature space. Lines 13–17 embed each bag \(\boldsymbol{B}_i\) into a new feature vector \(\boldsymbol{V}_i\) through \(\boldsymbol{H}\). Finally, Line 18 trains the SIL classifier \(\mathcal {F}(\cdot )\) through the new data set \(\{(\boldsymbol{V}_i, y_i)\}_{i=1}^{N}\).

figure a

3.2 The Fast Bag-Inside Instance Selection Technique

The common method for instance prototype (IP) generation is to select cluster centers [15] or causal instances [18] in the entire feature space. However, these methods have the following two problems: a) High time complexity; and b) The selected instances have no bag structure information. The fast bag-inside instance selection technique chooses instance prototypes of each bag through using its internal structure. The density \(\rho _{ij}\) and affinity \(l_{ij}\) metric of the instance \(\boldsymbol{x}_{ij}\) are computed as follows.

The Density of Instance. For each instance \(\boldsymbol{x}_{ij} \in \boldsymbol{B}_i\), the density \(\rho _{ij}\) is defined as

$$\begin{aligned} \rho _{ij} = \sum _{k\ne j}^{n_i}{\exp {-(\frac{d_{jk}}{d_c})^2}}, \end{aligned}$$
(1)

where \(d_c\) is a cutoff distance and \(d_{jk}\) is the distance between \(\boldsymbol{x}_{ij}\) and \(\boldsymbol{x}_{ik}\). High-density instances mean that there are more adjacent instances within a given neighborhood radius. Therefore, high-density instances can reflect the local feature distribution of the bag.

In addition, the instances in the bag are not completely independent and distributed [21]. It is not enough to determine the representativeness only based on the density of the instance. Therefore, we use cosine similarity to represent the affinity between instances. The closer the cosine similarity of the two instances is to 1, the more similar they are.

The Affinity of Instance. For each instance \(\boldsymbol{x}_{ij} \in \boldsymbol{B}_i\), the affinity \(l_{ij}\) is defined as

$$\begin{aligned} l_{ij} = \sum _{1\le k \le n_i}{\frac{\boldsymbol{x}_{ij} \cdot \boldsymbol{x}_{ik}}{\Vert \boldsymbol{x}_{ij} \Vert \Vert \boldsymbol{x}_{ik} \Vert }}, \end{aligned}$$
(2)

where \(j, k \in [1..n_i]\).

After obtaining the density and affinity of each instance in the bag, the representativeness score \(s_{ij}\) of the instance can be computed as

$$\begin{aligned} s_{ij}=\rho _{ij}\times l_{ij}. \end{aligned}$$
(3)

According to the MIL assumption, the proportion of positive and negative instances in each bag is different (e.g., tiger, grass and water in Fig. 1). Therefore, we can chose the low/high score instances from the positive/negative bag as the IP. Finally, we can obtain the instance prototype set \(\boldsymbol{C} = \{\boldsymbol{c}_1,\cdots ,\boldsymbol{c}_{n_c}\}\), where \(n_c\) is the cardinality of \(\boldsymbol{C}\).

By considering the solution interval of the optimization objective, we design three types of instance prototypes selection modes as follows:

  • Global (G) selects \( \lceil r_c\times n_i \rceil \) instance prototypes from each bag.

  • Positive (P) only selects from all positive bags.

  • Negative (N) only selects from all negative bags.

The time complexity of the fast bag-inside instance selection technique is O(dn), where d is the dimension and n is the number of all instances. The time complexity of instance selection based on the entire space is \(O(dn^2)\). In contrast, our complexity is only linearly related to the number of instances rather than square related.

3.3 High-level Instance Selection Technique

In order to explore the characteristics of the instance space, high-level instance selection technique is proposed. Based on \(\boldsymbol{C} = \{\boldsymbol{c}_1,\cdots ,\boldsymbol{c}_{n_c}\}\), we can obtain high-level instances (HI).

For each \(\boldsymbol{c}_i\), we calculate two quantities: its local density \(\delta _i\) and its minimum distance \(\beta _i\) from the higher density prototypes. The local density \(\delta _i\) is computed by Eq. (1). The difference is that the calculation interval is migrated from each bag to \(\boldsymbol{C}\). The distance \(\beta _i\) is measured by computing the minimum distance between the \(\boldsymbol{c}_i\) and any other IP with higher density:

$$\begin{aligned} \beta _i=\mathop {\min }\limits _{j:\delta _j>\delta _i}(d_{ij}). \end{aligned}$$
(4)

Particularly, for the IP with highest density, its distance is \(\beta _i = \max _j(d_{ij})\). Finally the score \(\lambda _i\) of IP is calculated as

$$\begin{aligned} \lambda _i = \delta _i\times \beta _i. \end{aligned}$$
(5)

With the scores of all IP calculated by Eq. (5), we select the top-\(n_h\) IP as the HI. Finally, we can obtain \(\boldsymbol{H} = \{\boldsymbol{h}_1,\cdots ,\boldsymbol{h}_{n_h}\}\), where \(n_h\) the cardinality of \(\boldsymbol{H}\).

3.4 Embedding Technique via HI

After getting \(\boldsymbol{H}\), we design the following method to embed the bags into a new feature space. Firstly, each instance \(\boldsymbol{x}_{ij} \in \boldsymbol{B}_i\) is assigned to its nearest \(\boldsymbol{h}_k\), denoted by \(NH(\boldsymbol{x}_{ij})=\boldsymbol{h}_{k}\). Then, each bag \(\boldsymbol{B}_i\) can be expressed by \(n_h\) local vectors \(\boldsymbol{v}_{ik}\):

$$\begin{aligned} \boldsymbol{v}_{ik} = \sum _{\boldsymbol{x}_{ij}\in \varOmega }{\boldsymbol{x}_{ij}-\boldsymbol{h}_k}, \end{aligned}$$
(6)

where \(\varOmega = \{\boldsymbol{x}_{ij}|NH(\boldsymbol{x}_{ij})=\boldsymbol{h}_{k} \}\). Finally, the embedding vector \(\boldsymbol{V}_i\) of bag \(\boldsymbol{B}_i\) is a D-dimensional vector composed of concatenated local vectors [15]:

$$\begin{aligned} \boldsymbol{V}_i=\mathop {\big \Vert }\limits _{k=1}^{n_h}{\boldsymbol{v}_{ik}}, \end{aligned}$$
(7)

where \(D = n_h \times d\) and d is the dimension of instance \(\boldsymbol{x}_{ij}\). However, the above embedding method will embed each bag into a high-dimensional space. Therefore, we design the second embedding method, which superimposes all the local vectors to get the embedding vector:

$$\begin{aligned} \boldsymbol{V}_i=\sum _{k=1}^{n_h}{\boldsymbol{v}_{ik}}. \end{aligned}$$
(8)

Furthermore, each element of \(\boldsymbol{V}_i\) is processed by \(V_{il}\leftarrow sign(V_{il})\sqrt{|V_{il}|}\), and then the embedding vector is normalized by \(\boldsymbol{V}_i\leftarrow \boldsymbol{V}_i/\parallel \boldsymbol{V}_i\parallel _2\) [11]. After getting the \(\boldsymbol{V}_i\) for each \(\boldsymbol{B}_i\), we can predict the bag label by processing \(\boldsymbol{V}_i\) with any single-instance classifier \(\mathcal {F}(\cdot )\) (e.g., SVM).

4 Experiments

In this section, we conducted experiments on MIHI and 9 comparison algorithms for six learning tasks. To ensure the validity of the experiment, we used 10 times 10-fold cross-validation to calculate the average accuracy. The averaged results (mean) and standard deviation (std) of each algorithm is reported.

4.1 Comparison Algorithms

We compared MIHI with 9 state-of-the-art algorithms: a) MILES [3] embeds bags based on the bag-instance similarity measure and all instances; b) BAMIC [22] embeds bags by employing bag-level k-means, with the parameters including average Hausdorff distance and the number of clustering centers (\(r\times \min \{N,100\}\), where r is enumerated in \(\{0.1,\cdots ,1.0\}\)); c) MILFM [10] uses AdaBoost to select the bag features embedded by instance prototypes, with the parameters including the number of cluster centers (40); d) Simple-MI [1] uses the arithmetic mean of instances in the bag as the representation of the bag itself. e) miFV [15] extracts the instance information with the Gaussian mixture model (GMM), with the parameters including the number of components for GMM (enumerate in \(\{1,2,3\}\)); f) miVLAD [15] embeds bags based on the instance-level k-means, with the parameters including the number of clustering centers (enumerate in \(\{1,2\}\)); g) MILDM [16] selects the discriminative instances via instance evaluation criteria, with the parameters including the size of discriminative instance pool (the number of bags); h) StabelMIL [18] embeds bags based on causal instances, with the parameters including the scale variable (0.25); and i) ELDB [17] selects more representative bags with the discriminative analysis and reinforcement technique, and finally obtains more distinguishable single vectors.

4.2 Experimental Data Sets

Six fields of learning tasks across 26 data sets are used to validate MIHI. We briefly introduce the domain knowledge of these data sets: 1) Image retrieval: Content-based image retrieval problems include identifying the expected target object in the image [2]. In our experiments, elephant, fox, and tiger data sets are used; 2) Mutagenicity prediction: Mutagenesis is a drug activity prediction problem. There are two versions, easy (1) and hard (2), of the data set [13]; 3) Medical image: Messidor is a medical classification problem data set, which consists of 1, 200 fundus images from 546 healthy and 654 diabetic patients [5]; 4) Newsgroups: The newsgroups is a text categorization data set [21]. Posts from different newsgroups form a bag. Each category has 50 positive bags and 50 negative bags; 5) Web recommendation: The question is whether to classify web pages as interesting web pages [20]. There are a total of 9 users who rate the web page this way, so there are 9 different data sets; A web page is a bag, and the links on the web page are instances; and 6) Biocreative: Biocreative is a large-scale text classification data set [12]. The task is to decide whether some genetic ontology (GO) code should be used to annotate a given pair.

4.3 Performance Comparison

Table 1. Accuracy (\(\%, mean_{\pm {std}}\)) with standard deviations on 26 MIL data sets. The highest average accuracy is marked with \(\bullet \).

Table 1 shows the experimental results of MIHI and comparison algorithms. The best performance value for each data set is highlighted with a small black bullet. Mean rank represents the ranking of the average performance of the current algorithm on each data set [6]. The symbol “N/A” means that the algorithm cannot get experimental results.

The experimental results show that the MIHI algorithm has achieved the best experimental results on more than 70% of the data sets. And its mean rank is 2.71, which is superior to 9 traditional algorithms. Specifically, the accuracy of MIHI on some data sets is about \(10\%\) higher than other algorithms, such as elephant, rec.sport.hockey and web4. The reason may be that the our instance selection techniques can effectively select the instance with the largest amount of information from each bag. On image retrieval data, MIHI performed well on the three image data sets. However, MIHI performed poorly on the two mutagenicity prediction data sets, which may be caused by the low dimensionality of mutagenicity. StableMIL performs very well on mutagenicity. The reason may be that StableMIL can obtain the most informative causal instance from the super low-dimensional positive bag. From the performance of newsgroups, web recommendation and large-scale data sets, MIHI can get better results whether it is low-dimensional or high-dimensional data. We only compare MIHI with the four algorithms on large-scale data sets, because the time complexity of MILES, MILFM, MILDM, StableMIL and ELDB is relatively high.

Fig. 2.
figure 2

Comparison of MIHI with 9 comparison algorithms with Bonferroni-Dunn test. Algorithms not connected to MIHI in the CD plot were considered to have significant performance of the control algorithm (CD = 2.24, significance level 0.05).

We also applied the post hoc Bonferroni-Dunn test [6] to test whether MIHI achieves competitive performance among the 9 compared algorithms. Figure 2 reports the critical difference (CD) plot at the 0.05 significance level. The mean accuracy ranks for each algorithm are marked along the axis (lower grades on the left). In addition, algorithms with an mean ranking within one CD of MIHI are connected by thick lines. Otherwise, any MIHI-independent algorithm is considered significantly different.

4.4 Parameter Analysis

Figure 3 shows the experimental results of parameter analysis on elephant data set. The symbols “S” and “C” respectively represent the two modes of bag embedding: superimpose and concatenate; “G”, “P” and “N” respectively represent three instance selection modes. For all these subgraphs, the abscissa indicates the mode selected by the instance prototypes, and the ordinate indicates the number of instance prototypes. The three subgraphs show the classification accuracy on the classifier Knn, Decision Tree (DTree) and SVM respectively. The darkest colored table of the heat map indicates the highest accuracy. The following summarizes the impact of parameters on MIHI:

Fig. 3.
figure 3

Parameter analysis of MIHI with the number of instance prototype, three instance selection modes, two bag embedding modes and three classifiers for elephant data set. The best parameter settings of elephant are: 3 instance prototypes, instance selection mode “G” and bag embedding mode “C”.

  • Bag embedding modes: The classification performance of the two bag embedding modes is equivalent. However, since mode “C” embeds each bag in a high-dimensional space, we choose mode “S” for bag embedding in subsequent experiments.

  • Instance selection modes: The results of the elephant in the classifier SVM show that the classification performance of mode “G” is better than the other two modes. However, in the other two classifiers, it is the best in mode “P”.

  • The number of instance prototypes: MIHI can achieve the best performance in most cases when the number of instance prototypes is 3–5.

  • Classifier: SVM is more suitable for these data sets than DTree and Knn.

4.5 Efficiency Comparison

Table 2. The CPU runtime (in seconds) of one 10CV of the comparison algorithm on the 4 MIL classification data set.

Table 2 shows the time complexity and runtime of MIHI compared with 9 competing algorithms. For MIHI, the construction of the high-level instances cost O(dn), where d is the dimension and n is the cardinality of instance space. Table 2 compares the CPU running time of these methods on four data sets. The mean rank shows that the speed of MIHI is slightly lower than that of Simple-MI and miVLAD. This may be because Simple-MI does not need to consume a lot of time to calculate the distance of instances. However, Simple-MI does not perform well on these data sets. The k-means algorithm used by miVLAD has low time complexity. Besides, even on the small scale data set, the runtime of MILFM and StableMIL are relatively large.

5 Conclusion

In this paper, we proposed the MIHI algorithm to select high-level instances. MIHI fully utilizes the structure information of the bag-inside and effectively explores the characteristics of the instance space. The experiments were conducted on 26 MIL data sets. According to Table 1, the MIHI algorithm has achieved the best accuracy on more than 70% of the data sets. Its mean rank is 2.71, which is superior to 9 traditional algorithms. In addition, MIHI has linear time complexity, and its efficiency is slightly lower than that of Simple-MI and miVLAD.