1 Introduction

Multi-instance learning (MI learning) has received much attention in the machine learning research field. MI learning is a variation of the standard supervised learning. In standard supervised learning, each example is an instance represented by an attribute vector, augmented with a class label. The learning task is to build a classifier that predicts the class label of an unseen instance, given its attribute vector. In MI learning, however, each example consists of a bag of instances. Each bag has a class label, but the instances themselves are not labeled. Therefore, the learning task is to build a classifier that predicts the class label of an unseen bag [1]. Recently, some researchers combine multi-instance learning and multi-label learning [2] to propose another machine learning framework: Multi-Instance Multi-label learning (MIML learning) [3, 4]. In this paper, we temporarily focus our attention on the multi-instance learning.

Two different approaches have been adopted to classify an unseen bag of instances in the context of multi-instance problem. The first approach classifies a bag as negative if all the instances in it are negative and positive if at least one instance in it is positive [5]. In contrast, another approach classifies a bag as the maximum label among all instances in it using the simplest majority vote approach [6].

MI learning was original motivated by the drug activity prediction problem where each instance is a possible conformation of a molecule and each bag contains all likely low-energy conformations for the molecule. A molecule is active if it binds strongly to the target protein in at least one of its conformations and is inactive if no conformation binds to the protein. Thus the problem is to predict the label (active or inactive) of molecules based on their conformations [7].

In recent years, in additional to the drug activity prediction problem, a variety of learning problems have been tackled as multi-instance problems. For example, several inductive logic programming problems [8, 9], identifying thioredoxin-fold proteins [10], content-based image retrieval [11, 12], stock market prediction [13], text categorization [14], natural scene classification [15], image categorization [16], etc.

In order to address these multi-instance problems, a large number of algorithms are proposed, which can be broadly divided into two main categories: eager learning and lazy learning [17, 18]. It appears that up to now most of these algorithms belong to eager learning. For example, Dietterich et al. [5] assumed that the classifier could be represented as an axis-parallel rectangle, and proposed four typical algorithms. Among them, the iterated-discrim APR algorithm is the best one. Auer [19] focused on theoretical research and presented the MULTINST algorithm to efficiently learn axis-parallel concept. After that Maron and Lozano-Perez [20] described a new general framework called Diverse Density (DD), Zhang and Goldman [7] proposed an improved algorithm called EM-DD by combining EM and the extended DD. Multi-instance Learning via Embedded Instance Selection (MILES) is a recent MI learning approach presented by Chen et al. [21]. Then, Foulds and Frank [22] revisited the MILES algorithm and presented an empirical study investigating the efficacy of alternative base learners for MILES. Besides, TILDE [23] and RELIC [6] are two of top-down decision tree induction systems for MI learning.

For lazy learning, Wang and Zucker [18] proposed two variants of the K-nearest-neighbor (KNN) algorithm (Bayesian-KNN and Citation-KNN). In this paper, we denote them by BKNN and CKNN respectively. Their experimental results on the drug discovery benchmark datasets show that both BKNN and CKNN are competitive with the best ones conceived in the concept learning framework. After that, Xu [24] proposed a nearest distribution approach to multi-instance learning (MINND). Besides, the KNN algorithm can be used as the basic classifier for some meta multi-instance learning algorithms such as MIBoost [25] and MIWrapper [26].

According to the paper by Wang and Zucker [18], CKNN still applies the simplest majority vote approach among the references and citers to classify unseen bags. In this paper, we proposed an improved algorithm called Bayesian Citation-KNN (BCKNN). For each unseen bag, BCKNN firstly finds its \( k \) references and \( q \) citers respectively, and then a Bayesian approach is applied to its \( k \) references and a distance weighted majority vote approach is applied to its \( q \) citers. The experimental results on two drug discovery benchmark datasets validate its effectiveness and efficiency. Besides, in order to adapt KNN to the multi-instance learning, several alternative distance functions and classification approaches are introduced in our paper.

The rest of the paper is organized as follows. In Sect. 2, we briefly introduce the KNN algorithm and its two variants BKNN and CKNN and then discuss how to adapt KNN to the multi-instance learning. Section 3 proposes our improved algorithm called Bayesian Citation-KNN (BCKNN). Section 4 gives the detailed experimental setup and the compared results on several benchmark datasets. Finally, we draw conclusions and outline the main directions for our future work in Sect. 5.

2 Adapting K-nearest-neighbor to the multi-instance learning

K-nearest-neighbor (KNN) has been widely used as a simple and effective classification model in the traditional supervised learning [27]. KNN assumes all instances correspond to points in the \( m \) dimensional real space \( R^{m} \). The nearest neighbors of an unseen instance are defined in terms of the standard Euclidean distance. More precisely, let an arbitrary instance \( x \) be described by the attribute vector \( \langle a_{1} (x),a_{2} (x), \ldots ,a_{m} (x) \rangle \), where \( a_{j} (x) \) denotes the value of the \( jth \) attribute \( A_{j} \) of the instance \( x \). Then the distance between two instances \( x \) and \( y \) is defined as:

$$ d(x,y) = \sqrt {\sum\limits_{j = 1}^{m} {(a_{j} (x) - a_{j} (y))^{2} } } . $$
(1)

In KNN, the target function can be either discrete-valued (classification) or real-valued (regression). This paper considers learning discrete-valued target functions of the form \( f:R^{m} \to C \), where \( C \) is a finite set \( \{ c_{1} ,c_{2} , \ldots ,c_{s} \} \). For an unseen instance \( y \), KNN needs to find its \( k \)-nearest neighbors at first. We assume its \( k \)-nearest neighbors are \( \{ x_{1} ,x_{2} , \ldots ,x_{k} \} \) and their class labels are respectively \( \{ c_{1} ,c_{2} , \ldots ,c_{k} \} \). Then, KNN uses the simplest majority vote approach to determine the class label of \( y \). Therefore, the detailed classification formulation of KNN is

$$ c_{KNN} (y) = \mathop {\arg \hbox{Max} }\limits_{c \in C} \sum\limits_{i = 1}^{k} {\delta (c,c_{i} )} $$
(2)

where \( \delta (c,c_{i} ) = 1 \) if \( c = c_{i} \), and 0 otherwise.

KNN is a typical example of lazy learning, which simply stores training instances at training time and delays its learning process until classification time. In contrast, eager learning generates an explicit model at training time. Due to its simplicity and effectiveness, KNN has been widely used for classification in the traditional supervised learning. In this paper, however, we focus our attention on multi-instance classification problems instead of traditional supervised classification problems. Thus, a natural question is how to adapt KNN to multi-instance classification problems.

According to the observation by Wang and Zucker [18], two practical problems in adapting KNN to multi-instance learning must be addressed, which are the distance measure problem and the classification approach problem. To address the first problem, four distance functions [2830], such as the minimum distance, the maximum distance, the centroid distance, and the Hausdorff distance, have been proposed.

Given two sets of instances \( X = \{ x_{1} ,x_{2} , \ldots ,x_{o} \} \) and \( Y = \{ y_{1} ,y_{2} , \ldots ,y_{p} \} \), above distance functions can be respectively defined as:

$$ D_{\hbox{Min} } (X,Y) = \mathop {\hbox{Min} }\limits_{x \in X} \left\{ {\mathop {\hbox{Min} }\limits_{y \in Y} \{ d(x,y)\} } \right\} $$
(3)
$$ D_{\hbox{Max} } (X,Y) = \mathop {\hbox{Max} }\limits_{x \in X} \left\{ {\mathop {\hbox{Max} }\limits_{y \in Y} \{ d(x,y)\} } \right\} $$
(4)
$$ D_{cen} (X,Y) = d\left( {\frac{1}{o}\sum\limits_{i = 1}^{o} {x_{i} } ,\frac{1}{p}\sum\limits_{j = 1}^{p} {y_{j} } } \right) $$
(5)
$$ D_{H} (X,Y) = \hbox{Max} \left\{ {\mathop {\hbox{Max} }\limits_{x \in X} \left\{ {\mathop {\hbox{Min} }\limits_{y \in Y} \{ d(x,y)\} } \right\},\mathop {\hbox{Max} }\limits_{y \in Y} \left\{ {\mathop {\hbox{Min} }\limits_{x \in X} \{ d(x,y)\} } \right\}} \right\} $$
(6)

where \( d(x,y) \) is the standard Euclidean distance between two instances \( x \) and \( y \), \( \frac{1}{o}\sum\nolimits_{i = 1}^{o} {x_{i} } \) and \( \frac{1}{p}\sum\nolimits_{j = 1}^{p} {y_{j} } \) are the centroids of the instance sets \( X \) and \( Y \) respectively. In Sect. 4, we design three groups of experiments to validate the effectiveness of different distance functions. The experimental results show that the maximum distance seems to be the worst and the minimum distance seems to be the best among all four distance functions. The more detailed experimental results can be found from Tables 2, 3 and 4 in Sect. 4.

To address the second problem, instead of the simplest majority vote approach, another two alternative classification approaches, namely the Bayesian approach and the citation approach are proposed by Wang and Zucker [18]. They call the resulting algorithms Bayesian-KNN (BKNN) and Citation-KNN (CKNN) respectively.

BKNN uses Eq. 7 to classify \( y \).

$$ c_{BKNN} (y) = \mathop {\arg \hbox{Max} }\limits_{c \in C} P(c)P(c_{1} ,c_{2} , \ldots ,c_{k} |c) $$
(7)

where \( C \) is a finite set \( \{ c_{1} ,c_{2} , \ldots ,c_{s} \} \). It appears that up to now almost all MI classification problems are binary classification problems. Namely, \( c_{i} \) is either positive or negative. Thus, the maximal number of combination that \( \{ c_{1} ,c_{2} , \ldots ,c_{k} \} \) can take is \( k + 1 \). Therefore, the computational cost of BKNN is not expensive.

In CKNN, the concept of citation is used and the nearest neighbors are extended to the combination of references and citers. Give an unseen bag \( y \), its \( k \)-nearest references are \( \{ x_{1} ,x_{2} , \ldots ,x_{k} \} \), and its \( c \)-nearest citers are \( \{ x_{1} ,x_{2} , \ldots ,x_{q} \} \). Please note that, the parameter \( c \) is different from the number \( q \). Then, CKNN uses Eq. 8 to classify \( y \).

$$ c_{CKNN} (y) = \mathop {\arg \hbox{Max} }\limits_{c \in C} \left( {\sum\limits_{i = 1}^{k} {\delta (c,c_{i} )} + \sum\limits_{j = 1}^{q} {\delta (c,c_{j} )} } \right) $$
(8)

where \( \{ c_{1} ,c_{2} , \ldots ,c_{k} \} \) and \( \{ c_{1} ,c_{2} , \ldots ,c_{q} \} \) are the class labels of \( k \) references and \( q \) citers of \( y \) respectively. Their experimental results on two drug discovery benchmark datasets show that both BKNN and CKNN are competitive with the best ones conceived in the concept learning framework [18].

3 Bayesian Citation-KNN with distance weighting

Our research starts from our comments to CKNN. Just as discussed before, CKNN still applies the simplest majority vote approach among the references and citers to classify unseen bags. That is say that the class labels of all references and citers in CKNN are treated equally. That essentially means that each reference and citer is treated equally. However, intuitively, different references and citers should play different roles in classifying the unseen bag, since some of them are more important than others in classification. Thus, a natural way to improve CKNN is to respectively assign different weights to different references and citers according to their distances to the unseen bag. Besides, the Bayesian approach has been proved to be much better than the simplest majority vote approach [18].

Motivated by these facts, in this paper, we propose an improved algorithm called Bayesian Citation-KNN (BCKNN) by combining the Bayesian approach and the distance weighting approach. For each unseen bag, BCKNN firstly finds its \( k \) references and \( q \) citers respectively, and then a Bayesian approach is applied to its \( k \) references and a distance weighted majority vote approach is applied to its \( q \) citers.

Therefore, our BCKNN uses Eq. 9 to classify \( y \).

$$ c_{BCKNN} (y) = \mathop {\arg \hbox{Max} }\limits_{c \in C} \left( {P(c)P(c_{1} ,c_{2} , \ldots ,c_{k} |c) + {{\sum\limits_{j = 1}^{q} {w_{j} \delta (c,c_{j} )} } \mathord{\left/ {\vphantom {{\sum\limits_{j = 1}^{q} {w_{j} \delta (c,c_{j} )} } {\sum\limits_{j = 1}^{q} {w_{j} } }}} \right. \kern-0pt} {\sum\limits_{j = 1}^{q} {w_{j} } }}} \right) $$
(9)

where \( \{ c_{1} ,c_{2} , \ldots ,c_{k} \} \) and \( \{ c_{1} ,c_{2} , \ldots ,c_{q} \} \) are the class labels of \( k \) references and \( q \) citers of \( y \) respectively, and \( w_{j} \) is the weight of the \( jth \) citer.

To our knowledge, a large number of proposals can be used to define the weights \( w_{j} \) (\( j = 1,2, \ldots ,q \)). Among them, the simplest proposal is to define their weights according to their distances to the unseen bag. That is to say, any functions which are inversely proportional to their distances can be used to define weights of these citers. Generally speaking, citers are weighted according to the inverse of their distances from the unseen bag, with less weight being assigned to citers that are further from the unseen bag. Thus, just for simplicity, we define three alternative simple weight functions as follows:

$$ w_{j} = \frac{1.0}{{1.0 + d_{j} }} $$
(10)
$$ w_{j} = \frac{1.0}{{1.0 + d_{j}^{2} }} $$
(11)
$$ w_{j} = e^{{ - d_{j}^{2} }} . $$
(12)

In Sect. 4, we design another group of experiments to compare three different distance weighting functions. Experimental results show that the distance weighting function defined by Eq. 11 is a little better than another two distance weighting functions defined by Eqs. 10 and 12. More detailed experimental results can be found from Table 5 in Sect. 4. More important, seen from Table 7 in Sect. 4, our BCKNN is generally better than BKNN and CKNN. Please also note that our new proposed BCKNN almost maintains the same order of computational overhead as previous CKNN. The experimental results from Tables 8 and 9 in Sect. 4 validate our conclusion.

4 Experimental conditions, methods, results

Because our proposed BCKNN is an improved algorithm of previous BKNN and CKNN, we conduct our compared experiments on the same datasets in the paper by Wang and Zucker [14]. We download them from the main website of Prof. E. Frank (http://www.cs.waikato.ac.nz/~eibe/multi_instance/). Each bag in them represents a molecule, and the main difference between these two datasets is that Musk2 contains molecules that have more possible conformations than Musk1. Thus, the learning task is to predict whether the molecule emits a musky odour. Some detailed descriptions of these two datasets are shown in Table 1.

Table 1 Descriptions of two musk datasets

We implement our improved algorithm BCKNN, KNN and the previous BKNN and use the implementation of CKNN in Weka 3.5.7 [31]. Please note that, in our implementation, the Laplace correction is used to smooth the estimated class membership probabilities.

In the first three groups of experiments, we respectively use KNN, BKNN, and CKNN with different \( k \) values to validate the effectiveness of different distance functions. In CKNN, we set \( c \) to \( k + 2 \), which is the same as the setting of the paper by Wang and Zucker [18]. Besides, in order to save the running time of experiments, we use the tenfold cross validation test instead of the leave-one-out test to predict classification accuracy of each algorithm on each datasets. Note that runs with various algorithms are carried out on the same training sets and evaluated on the same test sets. In particular, the cross-validation folds are the same for all the experiments on each dataset.

The detailed experimental results are shown in Tables 2, 3 and 4. From the experimental results, we can see that, generally speaking, the minimum distance seems to be the best among all four distance functions and when \( k = 2 \) the classification accuracies of related algorithms are the highest. This conclusion is consistent with that of the paper by Wang and Zucker [18].

Table 2 Classification accuracy comparisons for KNN with different \( k \) values and different distance functions
Table 3 Classification accuracy comparisons for BKNN with different \( k \) values and different distance functions
Table 4 Classification accuracy comparisons for CKNN with different \( k \) values and different distance functions

In the fourth group of experiments, we use our proposed BCKNN with different weights to compare three different distance weighting functions. In BCKNN, we set \( k \) to 2 and \( c \) to \( k + 2 \), which is same to the setting of the paper by Wang and Zucker [18]. Besides, the minimum distance defined by Eq. 3 is used. The detailed experimental results are shown in Table 5. From the experimental results, we can see that the distance weighting function defined by Eq. 11 is a little better than another two distance weighting functions defined by Eqs. 10 and 12.

Table 5 Classification accuracy comparisons for BCKNN with different weights

Finally, we have designed another group of experiments to validate our improved algorithm BCKNN. Thus, we compared our BCKNN with previous BKNN and CKNN. In this group of experiments, we use the fine-tuned experience parameters from previous four groups of experiments: The values of \( k \) and \( c \) in related algorithms to 2 and 4 respectively. Besides, the minimum distance defined by Eq. 3 and the distance weighting function defined by Eq. 11 are used.

We add two other train direction prediction datasets (denoted by eastwest and westeast) and three mutagenicity prediction datasets (denoted by muta-atoms, muta-bonds and muta-chains) into this group of experiments. At the same time, we choose other two related multi-instance learning algorithms (MIBoost [25] and MIWrapper [26]) for competitors. We use the implementation of MIBoost and MIWrapper with the nearest neighbor algorithm as the basic classifier) in Weka 3.5.7 [31]. The detailed description of them is shown in Table 6.

Table 6 Descriptions of two train direction prediction datasets and three mutagenicity prediction datasets

Table 7 shows the detailed compared results in terms of classification accuracy. In the same way, we also observe their CPU training and test time (the averaged CPU time in seconds, all experiments are carried out on a PC with Microsoft Windows XP professional 2002 service pack 3 with Intel (R) Core (TM) 2 Quad CPU Q8400 (2.66 GHz) and 3 GB memory). The detailed experimental results are shown in Tables 8 and 9 respectively. From the experimental results, we can see that:

Table 7 Classification accuracy comparisons for BCKNN versus BKNN, CKNN, MIBoost, and MIWrapper
Table 8 CPU Training time comparisons for BCKNN versus BKNN, CKNN, MIBoost, and MIWrapper
Table 9 CPU test time comparisons for BCKNN versus BKNN, CKNN, MIBoost, and MIWrapper
  1. 1.

    BCKNN (92.22 %) is better than previous BKNN (90.11 %) and CKNN (89.11 %) on Musk1. On Musk2, BCKNN (83.18 %) also outperforms BKNN (81.18 %) and CKNN (82.36 %). These results validate the effectiveness of our improved solution.

  2. 2.

    The CPU training and test time of BCKNN (19.291 and 4.46) is almost equal to previous CKNN (19.243 and 4.584). These results show that our BCKNN almost maintains the same order of computational overhead as previous CKNN.

  3. 3.

    Generally speaking, on two train direction prediction datasets, our BCKNN is much better than other two related competitors (MIBoost and MIWrapper) while this strength is reversed on three mutagenicity prediction datasets.

5 Conclusion and future work

In this paper, we revisit the well-known algorithms called Bayesian-KNN (BKNN) and Citation-KNN (CKNN). After the discussion of two practical problems in adapting KNN to multi-instance classification problems, we propose an improved algorithm called Bayesian Citation-KNN (BCKNN) by combining the Bayesian approach and the distance weighting approach. For each unseen bag, BCKNN firstly finds its \( k \) references and \( q \) citers respectively, and then a Bayesian approach is applied to its \( k \) references and a distance weighted majority vote approach is applied to its \( q \) citers. The experimental results on several benchmark datasets show that our BCKNN is generally better than previous BKNN and CKNN. Besides, BCKNN almost maintains the same order of computational overhead as CKNN.

The same as BKNN and CKNN, our BCKNN also uses standard Euclidean distance to define the difference between each pair of instances. Thus, using some new proposed distance measures, such as affinity-based distance function [32], to scale up the accuracy of our BCKNN is one of the main directions for our future work. Besides, in the current version of BCKNN, \( k \) references are treated equally in the Bayesian approach. Thus, how to apply the distance weighting approach to the references and get the distance weighted Bayesian approach is another main direction for our future work. Finally, investigating and researching the ranking performance of our BCKNN in terms of AUC [3335] will also be included in our future work.