Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

5.1 Introduction

As opposed to instance-based classification methods, the learning process of bag-based methods occurs at bag level. The main feature that distinguishes bag-based from instance-based classifiers is that the former can predict the label of a new bag considering each training bag as a whole entity, without the need to discover any hidden instance labels. Instance-based classification methods need to construct an instance classifier that is as accurate as possible, but this is not a requirement for bag-based methods. Although some types of bag-based classifiers do train an instance-level learning model, it is only used as a rough guide to the main bag-level learning process. Moreover, the MI assumption of bag-based methods need typically not be as precise as is the case for instance-based methods, but can be more flexible and general. We discuss the following two important subcategories of bag-based methods:

  • Bag-based methods that work in the original bag space: these methods rely on a metric function defined over bags. The metric is used in a distance-based classification algorithm, e.g., a nearest neighbor algorithm. By introducing the bag-wise distance measure, the learner is effectively upgraded to a full-fledged MI classification algorithm. We refer to these methods as original bag space classification methods (original-BS methods, for short) and discuss them in more depth in Sect. 5.2.

  • Bag-based methods that work in a mapped space: these methods transform the multi-instance data into a single-instance representation and train a single-instance classifier on the transformed data. The same transformation is applied to an unseen bag and its class label is predicted by the single-instance classifier learned in the mapped space. We refer to these methods as mapped bag space classification methods (mapped-BS methods, for short). They are discussed in Sect. 5.3.

5.2 Original Bag Space Methods

In single-instance learning, each instance is interpreted as a point in a multidimensional space determined by the features of the problem at hand. Many traditional single-instance learning algorithms rely on a distance function between points of this space to determine separating boundaries between classes. In MIL, bags can be understood as regions in the instance space and a bag-wise distance function is required to evaluate similarity relations between them. Using such a bag-wise distance function in a traditional distance-based learning algorithm, it becomes a multi-instance algorithm able to locate bag class boundaries. The two main design options of any bag-distance-based classification method are

  • A distance-based classification method: we describe two distance-based methods: nearest neighbor methods (Sect. 5.2.1) and kernel methods (Sect. 5.2.2).

  • A bag-wise distance/similarity function: recall that similarity functions can be used instead of distance functions by inverting the objective function of the learner. Both types of comparison measures are complementary and using one or the other depends on the definition of the bag label prediction method. In Sect. 3.5, we listed several distance and similarity functions that can be used in these algorithms.

5.2.1 Nearest Neighbor Methods

The CitationKNN algorithm was proposed in [20] and extends the traditional single-instance k-nearest neighbors method (KNN) to the level of bags. To classify a new bag X, CitationKNN uses a distance function between bags to determine which training bags are closest to X. Inspired by the concept of citations in the field of information science, this algorithm extends the set of nearest neighbors to consider not only the r bags closest to X (references, Fig. 5.1), but also the bags for which X is among the c closest bags (citers, Fig. 5.2). A voting scheme uses the class labels of both references and citers to determine the class label of X.

Fig. 5.1
figure 1

References. The circle encompasses the nearest 3-references to X (filled balls). The closest references correspond to the (traditional) nearest neighbors

Fig. 5.2
figure 2

Citers. The 3-citers nearest to X (filled balls) are those whose three nearest neighbors include X. Each circle contains the three nearest neighbors of the sample located at its center. For clarity, we have only represented circles including X. These are the 3-nearest citers to X

Any bag-wise distance function can be used in CitationKNN (see Sect. 3.5). In particular, the study of [20] uses the minimal Hausdorff distance (3.18), maximal Hausdorff distance ((3.19), (3.20)) and k-th ranked Hausdorff distance (3.22).

The distance function employed in CitationKNN has a major impact on its performance [2]. Each application domain can benefit more from a certain distance function than from others and some applications may require the selection of a less conventional metric. For example, the work of [27] on a web mining application adapts CitationKNN for text data represented by sets of terms, rather than the traditional attribute-value vector representation suffering from the so-called curse of dimensionality. They represent an instance x by a set of textual terms \(\left\{ t_{1},t_{2},\ldots ,t_{n}\right\} \), where \(t_{i}\) \(\left( i=1,\ldots ,n\right) \) is one of the n more frequent terms in the text fragment corresponding to x. They use the minimal Hausdorff distance variant, i.e., \(k=1\) in (3.22), and define a distance function between two instances \(a=\left\{ a_{1},a_{2},\ldots ,a_{n}\right\} \) and \(b=\left\{ b_{1},b{}_{2},\ldots ,b_{n}\right\} \) as

$$ \left\| a-b\right\| =1-\sum _{\begin{array}{c} i,j=1\\ a_{i}=b{}_{j} \end{array}}^{n}\frac{1}{n}, $$

based on the idea that the fewer common terms two instances share, the greater the distance between them.

CitationKNN has been extended to regression tasks [8], clustering [13] and multi-label classification [24]. It has been used successfully in several application domains, such as textual classification [21] and anomaly detection [23].

5.2.2 Bag-Level SVM

Bag-level kernels are used to measure the similarity between two bags in a transformed representation space. They operate on whole bags and return a single number assessing how close the two bags are. As stated in Sect. 3.5, the similarity is inversely related to the distance. Kernel-based methods, as well as distance-based ones, rely on space metrics to find the separating class boundaries. When a bag-level kernel is used in a standard SVM, the latter becomes able to optimize the margin between bag classes without any modification to the SVM itself. One of the first bag-level kernels was presented by Gärtner et al. [11]. They define the set kernel between two bags A and B as

$$ k_{MI}\left( A,B\right) =\sum _{a\in A,b\in B}k_{I}^{p}\left( a,b\right) , $$

where \(k_{I}\) is a kernel defined at the instance level. Theoretically, for sufficiently large values of p, this kernel ensures the separability of the training set. Because of the computational cost involved in the MI kernel above, [11] defines a minimax kernel based on the minimum and maximum attribute values of instances in each bag, namely

$$ k\left( A,B\right) =\left( \left\langle s\left( A\right) ,s\left( B\right) \right\rangle +1\right) ^{p}, $$

where \(s\left( \cdot \right) \) defines the attribute transformation

$$ s\left( X\right) =\left\langle \min _{x\in X}x_{1},\ldots ,\min _{x\in X}x_{m},\max _{x\in X}x_{1},\ldots ,\max _{x\in X}x_{m}\right\rangle . $$

In the MI kernels proposed by Gärtner et al. [11], all attributes are treated with equal weight. On the other hand, Blaschko et al. [3] propose conformal kernels which can locally reduce or expand each attribute dimension based on the discriminative importance of each attribute, while preserving the angles between vectors in the transformed space.

Kwok and Cheung [14] present marginalized kernels, that assume that the data are generated by a latent variable model. The observed variable is the bag and the hidden variable is its label. In particular, let \(Z_{1}=\left( X_{1},\ell _{1}\right) \) and \(Z_{2}=\left( X_{2},\ell _{2}\right) \) be two bags with their respective class labels. A joint kernel is defined as

$$ k_{Z}\left( Z_{1},Z_{2}\right) =\sum _{i=1}^{n_{1}}\sum _{j=1}^{n_{2}}k_{\ell }\left( \ell _{1i},\ell _{1j}\right) k_{x}\left( x_{1i},x_{2j}\right) , $$

where \(k_{\ell }\left( \cdot ,\cdot \right) \) is a kernel defined over the instance labels and \(k_{x}\left( \cdot ,\cdot \right) \) is a kernel defined over the instance space. The marginalized kernel, defined over two observed variables \(X_{1}\) and \(X_{2}\), is obtained by taking the expectation of the joint kernel with respect to the hidden variables \(\ell _{1}\) and \(\ell _{2}\), that is,

$$\begin{aligned} k\left( X_{1},X_{2}\right) =\sum _{\ell _{1}\in \mathbb {L}}\,\sum _{\ell _{2}\in \mathbb {L}}P\left( \ell _{1},X_{1}\right) P\left( \ell _{2},X_{2}\right) k_{Z}\left( Z_{1},Z_{2}\right) . \end{aligned}$$
(5.1)

It is possible to calculate this marginalized kernel in polynomial time. The posterior distribution of \(\ell _{1}\) and \(\ell _{2}\) is obtained from a probabilistic model \(P\left( \ell _{i}|X_{i}\right) \) estimated from the data.

Bag-level kernels make an implicit transformation of bags into a single-instance representation such that standard SVMs can be directly applied to multi-instance data. In Sect. 5.3, we show that an explicit transformation of the bags can also be set up to obtain a single-instance dataset on which any single-instance learner can be trained and used to predict bag labels.

5.3 Mapped Bag Space Methods

The multi-instance classification algorithms described in Chap. 4 and Sect. 5.2 are based on single-instance classifiers that have been modified to function in the MIL setting. Although good results have been reported in many applications for these multi-instance algorithms, the high cost of developing new algorithms today limits the applicability of this approach. There is only a small number of multi-instance algorithms compared to the large number of methods and algorithmic variants that have been developed for single-instance learning.

In this section, we examine another approach to solving multi-instance classification problems. Instead of using a modified single-instance classifier, a transformation is applied to the multi-instance data resulting in a single-instance representation of bags. In this new data representation, it is possible to construct a classification model using any traditional single-instance algorithm, effectively solving the multi-instance classification problem. The single-instance representation of multi-instance data not only allows the use of any single-instance classifier, but also the application of data preprocessing techniques, such as editing, cleaning, and dimensionality reduction, which have been well studied in single-instance learning.

In map based methods, the learning process occurs at bag level, but always relies on a mapping process. These methods transform the original multi-instance representation, in which each bag is a set of points (instances) in the attribute space, into another form of representation in which each bag is represented as a single point of the induced space. The multi-instance problem effectively becomes a single-instance problem to which any traditional learning algorithm can be applied.

Map based methods differ among each other in their specific mapping processes. In general, the following procedure is used. The methods are based on a function \(\mathscr {M}:\mathbb {N}^{\mathbb {X}}\rightarrow \left\langle a_{1},\ldots ,a_{d}\right\rangle \) that transforms the multi-instance representation of a bag X into a single vector \(\mathscr {M}\left( X\right) =\left\langle a_{1},\ldots ,a_{d}\right\rangle \). The multi-instance training set is transformed in a single-instance training set by applying the function \(\mathscr {M}\) to each training bag. Any suitable single-instance classifier is built on the new training set. To classify a new bag, it is first converted to the new space using \(\mathscr {M}\) and is then fed to the classifier which predicts a label class. By representing a bag as a point in the new space, some of the inherent ambiguity and imprecision of the multi-instance dataset can be reduced. However, as it is practically impossible to eliminate it completely, some of the original ambiguity remains encoded in the attribute values of each vector. The amount of the ambiguity reduction depends on the design of the mapping function \(\mathscr {M}\).

Mapping-based classification algorithms differ primarily in the design of \(\mathscr {M}\) and the mapping process. Below, we examine each of these mapping strategies and the classifiers that use them. For a better understanding, we have made a division in four categories, considering the meaning of the attributes in the new representation space

  • Mapping methods based on bag statistics (Sect5.3.1 ): each attribute of the new mapping space is the value of a statistic that is applied to the set of values of the corresponding attribute in the original representation space.

  • Mapping methods based on representative instance concatenation (Sect5.3.2 ): each vector of the new mapping space is the concatenation of N instances of the bag, where each instance is a representative of one pattern in the instance space.

  • Mapping methods based on counting (Sect5.3.3 ): each attribute of the new mapping space indicates presence, amount or frequency of instances of the bag in a specific region of the instance space.

  • Mapping methods based on distance (Sect5.3.4 ): each attribute of the new mapping space represents the distance (or similarity) of the bag to a specific region of the instance space.

5.3.1 Mapping Methods Based on Bag Statistics

Bag statistics-based methods seek to represent each bag by a single attribute vector that summarizes the statistical information of the bag. Consider a bag \(X=\left\{ x_{1},\ldots ,x_{n}\right\} \) in which each instance is described by d attributes, i.e., \(x_{i}=\left\langle x_{i}^{1},\ldots ,x_{i}^{d}\right\rangle ,\,\forall i\in \left[ 1,\ldots ,n\right] \). The bag can be seen as a set of d random variables with unknown probability distribution, for which we have a sample of size n. Several statistics can be used to characterize the probability distribution of these random variables. In the new attribute space, in which the multi-instance examples are mapped, each attribute of the original space is represented by one or more statistic values, that attempt to capture the shape of the probability distribution of the original variable within the bag. We list some examples of the kind of transformation performed on the bags

  • Average mapping: \(M\left( X\right) =\left\langle m_{1},\ldots ,m_{d}\right\rangle \), where \(m_{j}\) is the mean value of the jth attribute over all the instances of X. This transformation is used by the SimpleMI algorithm described in [7] and included in the experiments of Sect. 5.4.

  • Min-Max mapping: \(M\left( X\right) =\left\langle a_{1},\ldots ,a_{d},b_{1},\ldots ,b_{d}\right\rangle \), where \(a_{j}=\min _{i}\,(x_{i}^{j})\) and \(b_{j}=\max _{i}\,(x_{i}^{j})\) are the minimum and maximum values of the jth attribute over all the instances in X. This transformation is used by the Min-Max kernel proposed by [11].

  • Moments mapping: \(M\left( X\right) =\left\langle m_{1},\ldots ,m_{d},v_{1},\ldots ,v_{d},s_{1},\ldots ,s_{d},k_{1},\ldots ,k_{d},\right\rangle \). The values \(m_{j}\), \(v_{j}\), \(s_{j}\) and \(k_{j}\) represent the first to fourth statistical moment (i.e., mean, variance, skewness, and kurtosis) of the jth attribute of the instances in X.

The dimension of the new mapped space is the number of dimensions of the original space multiplied by the number of statistics used to describe each variable.

Stratified Bag Statistics

The methods described above are limited to summarize statistical information of all instances inside the bag and do not consider that within the same bag different patterns can coexist. In different instance patterns, one or more attributes can have different probability distributions. If all instances of the bag are treated as if they belonged to the same pattern, the statistics will be unable to adequately describe the mixture of distributions. A more sophisticated mapping method can try to discover patterns or classes of instances in the data and represent each bag in the embedded space with the statistics values of each original attribute for each instance pattern separately. We call stratified bag statistics-based mapping.

The most common way to discover instance patterns in the data is to use unsupervised methods, since instance class labels are unknown. Unsupervised methods allow to find groups of instances with shared characteristics. These groups can be considered as different instance classes. We can also use supervised methods, assuming that instances are assigned to the same class labels of their bags. Clearly, this assumption can cause a certain proportion of mislabeled instances, but the goal is to obtain a first approximation of the underlying instance-level patterns. From this first approximation, a learning algorithm can be trained to obtain a more accurate instance-level classifier.

Learning methods based on stratified bag statistics represent each bag by a single attribute vector with statistical information of the different patterns or instance classes contained in the bag. The new attribute values related to each instance pattern are concatenated in the vector describing the bag. Let \(C_{1},\ldots ,C_{k}\) be instance patterns found in the data and \(\theta :\mathbb {N}^{\mathscr {A}}\rightarrow \mathbb {R}\) a statistic (e.g., average, minimum, maximum, or moments) applicable to the d attributes of a set of instances. The stratified bag statistics based mapping is defined as

$$\begin{aligned} M\left( X\right) \mapsto \left\langle \theta _{11},\ldots ,\theta _{1d},\theta _{21},\ldots ,\theta _{2d},\ldots ,\theta _{k1},\ldots ,\theta _{kd}\right\rangle , \end{aligned}$$
(5.2)

where \(\theta _{ij}\) represents the statistic value applied to the jth original attribute of the instance subset in the bag belonging to the ith pattern. Equation 5.2 represents the case where each attribute probability distribution is described by a single statistic, but in general several statistics can be used for each attribute. The dimension of the new embedded space is \(d\times k\times q\), where d is the number of dimensions of the original space, k is the number of patterns or classes of instances discovered in the data and q is the number of statistics used to describe each original attribute distribution (e.g., in the Min-Max mapping two statistics are used, so \(q=2\)).

5.3.2 Mapping Methods Based on Prototype Concatenation

This approach was introduced by Boughorbel et al. [4]. They look for k instance patterns in the data and characterize each pattern \(C_{i}\) through its center \(p_{i}\). However, instead of using statistics operating on individual attributes, they use a function \(\varphi \left( X,p_{i}\right) :\mathbb {N}^{\mathbb {X}}\times \mathbb {X}\rightarrow \mathbb {X}\) to select the instance in the bag closest to the center \(p_{i}\) of the ith pattern and use that instance as the pattern representative. The mapping by Boughorbel et al. can be defined as \(M\left( X\right) \mapsto \left\langle v_{1},v_{2},\ldots ,v_{k}\right\rangle \), where \(v_{i}\) is the instance from X that is closest to the center \(p_{i}\) of the ith pattern. The authors use this transformation to construct an SVM with an ad hoc kernel. However, as with all mapping methods described in this chapter, any other single-instance learning algorithm can be applied to the mapped data as well.

This method can be generalized so that an aggregation of all instances of the bag is used to represent the matching degree between the bag and the instance pattern. Let \(S\left( x,C\right) \in \left[ 0,1\right] \) be a function that measures the matching degree between an instance x and a pattern C. A natural way of defining \(S\left( x,C_{i}\right) \) is as a similarity measure between instance x and the center of the ith pattern \(p_{i}\). The vector \(v_{i}\) can be calculated as

$$\begin{aligned} v_{i}=\frac{\sum \limits _{x\in X}x\cdot S\left( x,C_{i}\right) }{\sum \limits _{x\in X}S\left( x,C_{i}\right) }, \end{aligned}$$
(5.3)

which represents the average of the instances weighted by their matching degree with the pattern. This method is related to the stratified statistic mapping method described in Sect. 5.3.1. When the matching function \(S\left( x,C\right) \) is binary, so that \(S\left( x,C\right) \) equals 1 if the similarity between x and C is above a given threshold and \(S\left( x,C\right) \) equals 0 otherwise, we can use (5.2) to compute \(v_{i}\) using the average as the only statistic. In the other case, if the matching function takes on continuous values in the interval \(\left[ 0,1\right] \), we have a generalization of (5.2), where the value of each attribute is weighted with a matching degree.

5.3.3 Mapping Methods Based on Counting

This group of methods represent each bag as a single vector, where each attribute is the number of instances of the bag that are found in a specific region of the instance space. In other words, they describe the relationship between the bag label and instance classes covered by different regions of the instance space.

Multi-instance classifiers using a counting-based mapping are strongly inspired by the MI assumptions hierarchy of Weidmann et al. (Sect. 3.4.2). Some algorithms create binary attributes in the mapping process, where the ith attribute indicates the presence or absence of instances of the bag in the ith region. These algorithms allow to model the presence-based assumption, including the standard MI assumption. Other algorithms create attributes that take on positive numeric values representing absolute or relative frequencies of the instances belonging to the bag and lying inside the corresponding region. These algorithms allow to model the threshold and counting-based MI assumptions.

We can further divide this group into two major categories the acquisition of the MI assumption into account. On the one side are those algorithms for which the designers decide in advance which MI assumption is used. This category is examined in Sect. 5.3.3.1. On the other side we consider the algorithms for which no MI assumption has been specified. They learn the hypothesis from the data during execution. Section 5.3.3.2 is devoted to these methods.

5.3.3.1 Using an a Priori Count-Based MI Assumption

The best known algorithm using a count-based MI assumption is GMIL, which first appeared in [17]. GMIL stands for Generalized Multiple Instance Learning and is indeed a generalization of the standard MI assumption. The presence-based MI assumption of the Weidmann hierarchy is generalized by GMIL as well. However, it cannot represent learning problems obeying the threshold or counting based MI assumption, because the attributes constructed in the mapping are binary.

Like all algorithms using count-based assumptions, GMIL first identifies regions of the instance space that will be used in a second step to map bag attributes. Regions are identified systematically and exhaustively. All possible axis-parallel boxes in the instance space are explicitly enumerated. As an illustration, consider a discrete d-dimensional instance space \(\mathbb {X}=\left\{ 1,\ldots ,v\right\} ^{d}\) in a two-class classification problem. In a one-dimensional space (\(d=1\)), if the attribute has two possible values (\(v=2\)), there are three possible axis-parallel boxes as shown in Fig. 5.3. If the space has two dimensions and each dimension can take one of two possible values, there are nine possible axis-parallel boxes as shown in Fig. 5.4. If the space has three dimensions, each with two possible values, there are 27 possible axis-parallel boxes as shown in Fig. 5.5. In general, there are \(N=\left( v\left( v+1\right) /2\right) ^{d}\) possible axis-parallel boxes in a d-dimensional space. The reason why the regions have axis-parallel box shapes is because the infinite norm is used to determine distances in the instance space. This norm defines the length of a d-dimensional vector x as \(\left\| x\right\| _{\infty }=\max \,\left\{ \left| x_{1}\right| ,\ldots ,\left| x_{d}\right| \right\} \), the largest absolute value of its components. GMIL creates two Boolean attributes for each box, indicating whether a bag contains an instance within that box. To reduce the number of attributes, boxes containing the same set of points are grouped together and only one representative box for each group is used.

Concretely, GMIL maps a bag X to \(M\left( X\right) =\left\langle a_{1},\ldots ,a_{N},\overline{a}_{1},\ldots ,\overline{a}_{N}\right\rangle \). The algorithm sets \(a_{i}\) to 1 if any point of X is contained by the ith box and sets it to 0 otherwise. It sets \(\overline{a}_{i}=1-a_{i}\), \(\forall i\in \left[ 1,\ldots ,N\right] \). All information is encoded by the N first attributes. This would be sufficient for many learning algorithms. However, GMIL was originally designed to learn monotone disjunctions using the Winnow classifier [15]. Since Winnow generates formulas generated only containing disjunctions of the input variables, the negations of the first N attributes must also be supplied such that any logical combination of the initial variables can be formed.

Fig. 5.3
figure 3

There are three axis-parallel boxes when \(d=1\)

Fig. 5.4
figure 4

There are nine axis-parallel boxes when \(d=2\)

Fig. 5.5
figure 5

There are 27 axis-parallel boxes when \(d=3\)

Once the bags have been mapped to Boolean attributes, the algorithm tries to learn the target concept using a specific MI assumption based on theoretical results from geometric pattern recognition [12]. In the standard MI assumption, a single positive instance inside a bag determines that the bag belongs to the positive concept. Instance labels are typically determined by the proximity of the instance to a single target point, but GMIL can represent more general concepts. It represents a concept by a set of target points, more specifically, a set of attraction points, which can be seen as instances from an ideal positive bag. GMIL can also include a set of repulsion points, which can be seen as instances from an ideal negative bag. In this setting, a bag is positive if and only if it is sufficiently close to attraction points and sufficiently far from repulsion points.

GMIL’s notion of distance between bags is based on the Hausdorff distance. Recall from Sect. 3.5 that the Hausdorff distance between two sets of points P and Q is defined as the largest distance from either a point in P to its nearest neighbor in Q or from a point in Q to its nearest neighbor in P. Due to its use of the \(\max \) operator, the Hausdorff distance is sensitive to outlier points. To improve the robustness against noise, Scott et al. use the ranked full-Hausdorff distance

$$\begin{aligned} \max \left\{ \underset{{\scriptstyle p\in P}}{\text {max}^{s}}\left\{ \min _{q\in Q}\left\{ \left\| p-q\right\| _{\infty }\right\} \right\} ,\underset{{\scriptstyle q\in Q}}{\text {max}^{s}}\left\{ \min _{p\in P}\left\{ \left\| p-q\right\| _{\infty }\right\} \right\} \right\} , \end{aligned}$$
(5.4)

in which instead of using the largest distance, the sth largest distance is used. In (5.4), \(\max ^{s}\) denotes the sth largest value, P represents the pattern and Q is the model. Positive bags are within a ranked full-Hausdorff distance of some threshold \(\gamma \) from the ideal positive bag and at least a ranked full-Hausdorff distance of \(\gamma ^{\prime }\) away from the ideal negative bag. Let \(Q=\left\{ q_{1},\ldots ,q_{k}\right\} \) be the set of attraction points and \(\overline{Q}=\left\{ \bar{q}_{1},\ldots ,\bar{q}_{k'}\right\} \) the set of repulsion points representing the target concept. The concept can be modeled as a set of k axis-parallel attraction boxes and a set of \(k^{\prime }\) axis-parallel repulsion boxes. A bag is positive if and only if it contains points within at least \(r=k-s\) of the k attraction boxes and contains points within at most s of the \(k^{\prime }\) repulsion boxes.

The Winnow algorithm is used in [17] to implement the GMIL assumption. Winnow is a linear-threshold algorithm that learns r-of-k threshold functions. It assigns nonnegative real-valued weights \(w_{a}\) to each attribute a. Weights are iteratively modified to find a hyperplane

$$ \sum _{i=1}^{N}a_{i}w_{a_{i}}+\overline{a}_{i}w_{\overline{a}_{i}}=\theta , $$

which separates both classes, where \(\theta \) is the threshold determined by the algorithm. The \(k+k'\) more weighted attributes are selected at the end of training. The values of the selected attributes correspond to the k attractions plus \(k'\) repulsion points identified by the algorithm. In the classification stage, a bag is labeled positive if \(a_{i_{1}}+\cdots +a_{i_{k}}+\overline{a}_{i_{1}}+\cdots +\overline{a}_{i_{k'}}\ge r\).

Scott et al. also presented a GMIL variant using the ranked half-Hausdorff distance. Using this distance they assume that the model is accurate and compute the distance from the bag to the model, but not vice versa. According to this variant, positive bags are within a distance

$$\begin{aligned} \underset{{\scriptstyle q\in Q}}{\text {max}^{s}}\left\{ \min _{p\in P}\left\{ \left\| p-q\right\| _{\infty }\right\} \right\} \end{aligned}$$
(5.5)

of some threshold \(\gamma \) to the ideal positive bag and, including repulsion points, beyond a distance

$$\begin{aligned} \underset{{\scriptstyle q\in Q}}{\text {min}^{s^{\prime }}}\left\{ \min _{p\in P}\left\{ \left\| p-q\right\| _{\infty }\right\} \right\} \end{aligned}$$
(5.6)

of another threshold \(\gamma ^{\prime }\) to the ideal negative bag. As before, the concept is a set of k axis-parallel attraction boxes and a set of \(k^{\prime }\) axis-parallel repulsion boxes. A bag is positive if and only if it contains points within at least \(r=k-s\) of the k attraction boxes and contains points within at most \(s^{\prime }\) of the \(k^{\prime }\) repulsion boxes. Note that, in contrast to the full-Hausdorff distance model, the number of points s which are tolerated to not fall in attraction boxes can be different to the number of points \(s^{\prime }\) which are tolerated to fall in repulsion boxes. Though it was theorized that the half-Hausdorff variant should be more robust against noise and more able to avoid overfitting, empirical results show a higher generalization ability of the full-Hausdorff variant on data from several domains.

GMIL has a theoretically sound foundation. However, it is not a practical learning method, since it has a very high time complexity. In the sequence of Figs. 5.3, 5.4 and 5.5 it can be seen that the number of boxes grows exponentially as d increase. A real application, with a moderate number of attributes, like Musk, is unfeasible to be solved by GMIL. The strategy of using a reduced number of instances to build the learning model [18] fails because, when the dimension is not trivially small, in order to significantly reduce the computational cost, the number of instances must be so small that it becomes insufficient to build an accurate model. A kernel-based reformulation is another strategy used to improve the efficiency of GMIL. The kernel performs the feature mapping implicitly and allows a support vector machine to be applied directly to the data. However, computing the kernel on two bags requires counting the number of boxes that contain at least one instance from each of both bags, which again leads to severe scalability issues and quickly renders the problem intractable as the problem size increases. To address this issue, a fully polynomial randomized approximation scheme (FPRAS) was presented in [19], reducing the time complexity from exponential to polynomial.

5.3.3.2 Learning a Count-Based MI Assumption

In Chap. 4, we showed that instance-based methods make strong assumptions regarding the MI hypothesis. Each instance-based algorithm implements a specific MI assumption: some algorithms are based on the standard MI assumption, others on the collective assumption, and so on. The GMIL algorithm discussed in Sect. 5.3.3.1 has a MI assumption wired in its design as well. In these methods, the MI assumption is not only used in the classification stage to determine the bag label, but also in the training stage to impose restrictions to help determine the class likelihood of instances. If the imposed MI assumption does not conform reasonably well to a given dataset, then the algorithm cannot build an appropriate learning model for it. Each algorithm is only appropriate for those problems which conform to the applied MI assumption.

Unlike instance-based methods and methods like GMIL that have a specific MI assumption embedded in their design, mapping-based algorithms described in this section do not assume a priori the existence of a specific relationship between the labels of each bag and of its instances. This relationship is learned from the data instead, in the form of a count-based MI assumption. Training occurs in two steps

  1. 1.

    The method tries to identify regions in the instance space using either supervised or unsupervised methods. These regions appear as a result of the instance space structure.

  2. 2.

    The underlying MI assumption is learned. This is the relationship between the bag labels and the instance space regions identified in the first step. To this end, a new representation space is built, in which each attribute corresponds to one of the regions. Each bag is mapped into this new space, such that the value of the ith attribute indicates the presence or frequency of instances of the bag in the ith region. Any single-instance learning model can be built on this new single-instance training set.

Methods in this category differ fundamentally in the way they identify instance patterns, i.e., in the first step described above. In the Two-Level Classification (TLC) algorithm [22], a standard decision tree is used for this purpose. The tree is built on all instances of all training bags. Each instance is assigned to its bag’s class label. Instances are weighted such that all training bags have equal weight in the construction of the learning model. Each node in the tree represents a region of the instance space. In the second step, each bag is mapped to a new representation in which each attribute contains the number of instances of the bag that have reached the corresponding node in the tree.

Constructive Clustering Ensemble (CCE) [25] uses a clustering algorithm to determine the regions. The k-means algorithm is used to obtain a number of groups whose centers are stored. In the second step, each bag is mapped to a new representation in which each attribute indicates the presence of instances of the bag in the corresponding group. An instance belongs to a group g if its distance to the center of g is less than its distance to the center of the other groups. As it is not possible to determine the optimal number of groups in advance, CCE generates many classifiers, each obtained from a number of different groups, and then combines their predictions in a majority vote.

Since these algorithms do not make a priori assumptions about the nature of the relationship between bags and instances underlying the data, they can learn a wider variety of problems. For example, all algorithms based on the standard MI assumption take for granted that there are two classes of instances (positive and negative). Algorithms learning the MI assumption during training can find an arbitrary number of classes in the instance space and can discover relationships between bags and instances that best fit the training data.

5.3.4 Mapping Methods Based on Distance

In count-based mapping methods, the attribute values of each bag are defined by the location of instances of the bag inside a delimited region of instance space corresponding to that attribute. The notion of an instance membership to a region is strict. It only accepts two extreme possibilities: the point either belongs or does not belong to the region, depending on which side of the border of the region the point is located. The fact that we can only have a vague idea of the borders of the instance regions is ignored. In many applications, perfectly delimited regions boundaries make no sense. For example, if tall people are an important region of the instance space, it is difficult to determine where we should start the region, at 1.70 m, 1.80 m, or 1.85 m? Any such value would be merely conventional. However, we can say that if an instance is near the center of a region we can have a great certainty that it falls within the region. The farther an instance is located from the center, the less likely it belongs to the region. This is the idea behind distance-based mapping methods: each attribute value in the output space is related to the distance from the bag to the center of a region.

These methods try to identify instance regions that are representative of the structure of the instance space. Regions can be obtained through a clustering or classification model constructed from training instances. A prototypical point is recorded at the center of each region. In some cases, prototypes of only one class (usually the positive class) are used. In other cases, they are determined for each class. Each attribute of the induced space corresponds to one of the prototypes found in the original space. The attribute value is a distance measure (or a similarity measure) between the bag and the prototype. Note that the bag contains many points (instances), while the prototype is a single point. Specific distance functions between bags and prototypes have to be used. Distance functions used in these cases are usually aggregations of distances between the instances of the bag and the prototype. Distance-based mapping methods differ in how instance prototypes are determined and in their definition of the distance function.

One of the first algorithms using this type of mapping was DD-SVM [5]. This algorithm selects instance prototypes for both classes based on the values of the diverse density (DD) function. Under the diverse density framework, a prototype for class C is a point of the instance space with a high probability of being found in bags of class C. Prototypes are local extrema of the DD function, where the positive prototypes are maxima and the negative prototypes minima. To locate the prototypes, gradient descent methods are used over the DD function. To find the positive prototypes, optimization processes are started from each instance of the positive bags, while for negative prototypes, searches start from each instance of the negative bags. The located prototypes are used to map each bag to the new representation space. Using T prototypes, a bag X is transformed as

$$\begin{aligned} \mathscr {M}\left( X\right) =\left\langle S\left( t_{1},X\right) ,\ldots S\left( t_{T},X\right) \right\rangle , \end{aligned}$$
(5.7)

where \(t_{i}\) represents the ith prototype and \(S\left( t_{i},X\right) \) is a distance measure between the bag X and \(t_{i}\). Specifically, in [5] an absolute distance measure \(S\left( t,X\right) =\min _{j}\left\| x_{j}-t\right\| \) is used. The authors apply an SVM to the bags represented in the mapped space to obtain a bag classification model. In general, as with all mapping methods, any single-instance learning algorithm can be used to build this model as well.

The MILES algorithm [6] was introduced by the same authors as DD-SVM. Instead of looking for class prototypes in each bag, MILES uses all training instances as reference points to construct the new bag space. In other words, each instance is treated as a prototype. The new representation space has as many attributes as the total number of instances in the training set. More formally, let \(\mathbf X =\left\{ X_{1},\ldots X_{m}\right\} \) be the training bag set. We align the instances inside the bags and renumber them to get the set of instances \(\left\{ t_{k}|\exists X_{i}\in \mathbf X :t_{k}\in X_{i}\right\} \), \(k=1,\ldots ,T\), where \(T=\sum _{i=1}^{m}n_{i}\). We use (5.7) to map a bag X to the output space. To calculate the value of the ith attribute, MILES uses the Gaussian similarity function given by

$$\begin{aligned} S\left( t,X\right) =\max _{j}\exp \left( -\frac{\left\| x_{j}-t\right\| ^{2}}{\sigma ^{2}}\right) , \end{aligned}$$
(5.8)

where \(\sigma \) is a parameter to scale the attributes.

MILES is more computationally efficient than DD-SVM, because it avoids the expensive optimization procedure over the diverse density function, which DD-SVM must perform for every instance. Chen et al. [6] have shown that MILES is as good as and sometimes superior to DD-SVM in generalization accuracy and it is also more robust with respect to label noise.

The MILES mapping can be seen as a method for determining the weight of each instance. Indeed, the SVM applied to the mapping space calculates a weight for each attribute which is normally used for feature selection. The attributes of the mapped space are precisely the instances of the training bags, which allows to determine the influence of different parts of the instance space. However, MILES does not create a well-defined weight function over the instance space, because the max operator, used in (5.8) that determines the value of each attribute, only takes into account the influence of the nearest instance of the bag to the target point, resulting in a bag-dependent weight function [9]. Foulds et al. [9, 10] proposed the YARDS algorithm, which is similar to MILES in almost everything except in that YARDS can find a true weight function over the instance space. By replacing the max operator with the sum operator, that is, by setting

$$ S\left( t,X\right) =\sum _{j}\exp \left( -\frac{\left\| x_{j}-t\right\| ^{2}}{\sigma ^{2}}\right) , $$

the bag-dependence in the similarity function is removed. In YARDS, each instance of the bag has an influence on the bag-level classification and that influence only depends on the attributes of the instance and not on the rest of the bag.

5.3.5 Bag-Level Distance Mapping Methods

In the mapping methods included in Sects. 5.3.3 and 5.3.4, bags are described by their relationships with instance-level spatial structures. This mapping relates instance space regions with bag classes. Another way to transform a multi-instance problem into a single-instance one is by describing each bag through the spatial relationship it has with the other bags of the training set. In this case, the mapping is done at the bag level, but instance space regions are ultimately related to bag classes, since bags are represented by multiple vectors in the instance space. However, in this mapping each instance maintains the relationship with its bag, making it a more informative mapping than that which only includes instance-level relations.

The idea of bag-level distance mapping methods has been developed by Zhang and Zhou [26] with their BARTMIP algorithm. The work scheme of BARTMIP is shown in Fig. 5.6. A multi-instance clustering model is built on the training bags, dividing them in k groups. Each group is represented by its medoid, i.e., the most central bag. Each bag is mapped to a vector of k attributes, one for each group of bags. The ith attribute value of a bag is the distance from the bag to the ith medoid. All training bags are mapped with this form of representation. It results in a single-instance training set on which a single-instance classification algorithm is trained. In the prediction step, the new bag is mapped in the same way to a vector of k attributes and processed by the single-instance classification model.

The components of this algorithm can be selected from a wide variety of choices. BATRMIP can train any single-instance classification algorithm and use any multi-instance clustering algorithm. Multi-instance clustering algorithms are described in Chap. 7. Specifically, in [26], BARTMIP uses a multi-instance clustering algorithm called BAMIC (Sect. 7.1.4.1), which is an adaptation of the single-instance k-medoids clustering algorithm to the multi-instance setting. Many multi-instance clustering methods depend on a bag-level distance function which in turn uses an instance level distance function. Distance functions at bag and instance levels are other components of the model that should be chosen. The optimal number of groups to be generated in the clustering step can be determined by cross-validation. An alternative is to build several clustering models, each with a different number of groups, and train a classifier model from each grouping. The ensemble prediction is obtained by majority vote.

Fig. 5.6
figure 6

BARTMIP algorithm

5.4 Experimental Analysis

In this section, we empirically compare the performance of some representative bag-based MIC methods. We show experimental results for both original-BS methods and mapped-BS methods and compare the two strategies. These experiments are only intended for illustration purposes and cannot be taken as a rigorous comparison among classifiers. The experimental setup is specified in Sect. 5.4.1, while Sect. 5.4.2 presents the results.

5.4.1 Setup

We use the same datasets as in the experimental study of Chap. 4, described in Table 4.1. The algorithms included in the study are named in the first column of Table 5.1. The second column describes the method type. CitationKNN and MISMO are representative algorithms that work on the original bag space. The other algorithms are mapping methods, one of each type described in Sect. 5.3 with the exception of prototype concatenation discussed in Sect. 5.3.2. Prototype concatenation mapping methods have been excluded due to their high memory requirements. They are appropriate to use in small problems, but even for medium-sized datasets (as some are in these experiments) it is difficult to make comparative studies.

Unlike methods that work on the original bag space and construct an specific classifier, a mapping method can train any standard classification algorithm. Their performance depends on both the mapping method and the learner used as base classifier. In order to get a better idea of the mapping method qualities, we try each alternative with five popular classification algorithms: one nearest neighbor (1NN), C4.5, logistic regression (LR), an SVM and AdaBoost with C4.5 as base classifier (AdaBoost). We use Weka implementations for algorithms in the first four rows of Table 5.1, while the last two were implemented by us. A rough optimization was made for the most important parameters of each method looking for those yielding the best result across all the datasets. We use default parameter settings for each algorithm if not specified otherwise. We use the fivefold cross-validation procedure and evaluate the performance of the classifiers by means of their accuracy (Sect. 1.4).

Table 5.1 Bag-based classification algorithms to be compared

5.4.2 Results and Discussion

In Sect. 5.4.2.1, we show empirical results of the selected original-BS methods. We compare typical mapped-BS methods with each another using several base classifiers in Sect. 5.4.2.2. Finally, we compare the original-BS and mapped-BS based classifiers in Sect. 5.4.2.3.

5.4.2.1 Original-BS Methods

Table 5.2 presents the experimental results of two original-BS methods, namely CitationKNN and a bag-level SVM. The latter is a standard SVM using the Gärtner et al. MI kernel described in Sect. 5.2.2 with the standard RBF instance-level kernel. The table lists the best results for each algorithm after a simple parameter adjustment was done looking for the best average result over all data. The results shown for CitationKNN were obtained with \(C=2\) and \(R=2\) and those for SVM with \(C=1.0\) and \(\gamma =0.5\). The last two rows of the table show the average accuracy and the standard deviation of each classifier over the nine datasets. The best accuracy is highlighted in bold for each dataset. SVM is the winner in six out of nine cases, while CitationKNN wins in three datasets. The higher average accuracy of SVM supports the idea that it has a significantly better performance than CitationKNN over the studied problem domains. The lower standard deviation of the SVM means that its good performance is more evenly distributed across all datasets than that of the CitationKNN, which instead obtains very good results in a few datasets, but poor results in many of them.

Table 5.2 Classification accuracy for methods working in the original bag space

5.4.2.2 Mapped-BS Methods

Table 5.3 presents a summary of the experimental results of the selected mapped-BS methods using five base classifiers. The average accuracy computed over the nine datasets along with the confidence interval with a significance level \(\alpha =0.05\) is shown for each pair of mapping method and classifier. The algorithm parameters were set as follows: \(\sigma =250\) in MILES, 60 % of clustering in BARTMIP, five iterations in CCE and 10 iterations in Adaboost. The SVM in all mapping methods uses an RBF kernel with \(C=10.0\) and \(\gamma =0.5\). The most accurate mapped-BS method for each base classifier is highlighted in bold. SimpleMI obtains the best performance for three classifiers: C4.5, SVM, and Boosting. BARTMIP is the best performing mapped-BS method for the 1NN and LogReg classifiers. This suggests that SimpleMI and BARTMIP are two of the most accurate mapped-BS methods overall, since they achieve the highest quality predictions with several base classifiers over a range of datasets from different application domains.

Table 5.3 Average classification accuracy of mapping methods using different base classifiers

Table 5.4 presents the detailed experimental results of each mapped-BS method executed with its best base classifier following the conclusions of Table 5.3. The highest accuracy for each dataset among the four methods is marked in bold. SimpleMI and BARTMIP are again the most outstanding algorithms, as each one wins in four datasets. With respect to the application domains, it seems that SimpleMI is best suited for molecular activity prediction, while BARTMIP looks like the leader in the image recognition domain. In the next section, we delve deeper into this topic when we compare all bag-based methods to each other.

Table 5.4 Classification accuracy for best performing mapping method schemes

5.4.2.3 Overall Comparison

In Sects. 5.4.2.1 and 5.4.2.2, we pointed out the most accurate classifiers of each type. We are now interested to make an overall comparison between original-BS and mapped-BS methods in order to discover their advantages and disadvantages. The best performing model of each type is taken into account in this comparison. Two original-BS methods and four mapped-BS methods are included.

To discover which method is the best option in each case, we first separate the results by application domain. In Fig. 5.7, we depict the accuracy of the methods on the biochemical applications. Note that the accuracy axis values start at 40 to better distinguish the differences between the methods. SimpleMI, BARTMIP, and MISMO dominate in almost all datasets, while CitationKNN and MILES are not stable in their results. It is remarkable that BARTMIP performs quite good in the five datasets.

In Figs. 5.8 and 5.9, we show the accuracy of the methods on datasets from the textual and image domain, respectively. From these charts, we can not identify one algorithm that is superior to the others in any of these domains. The advantage (discussed in the previous section) of BARTMIP over SimpleMI on image datasets is negligible. We can only point out some general trends. CitationKNN and MILES have again poorer results compared to the other methods. SimpleMI, BARTMIP and MISMO excel in most datasets. Figure 5.10 shows the average accuracy of the six methods over the nine datasets and supports the above statement.

Fig. 5.7
figure 7

Biochemistry domain

Fig. 5.8
figure 8

Textual domain

Fig. 5.9
figure 9

Imaging domain

Fig. 5.10
figure 10

Average accuracy of selected bag-based classification methods over the experimental datasets

We are also interested in analyzing the training time of the models. Figure 5.11 shows the average training time of the six methods over the nine datasets. Note that a logarithmic scale is used to represent time intervals, such that differences between methods can be correctly perceived. Time values are given in seconds, but we are mostly interested in the relative time proportions of the different models. The training time of CitationKNN is mainly devoted to the calculation of bag-level distances,Footnote 1 whereas kernel calculations made by MISMO are three times faster than the work of CitationKNN. The very small training time of SimpleMI is one of the most remarkable things in the figure. The key lies in the simplicity of its mapping method. MILES has a fair training time complexity, which is in line with its moderately simple mapping method. Conversely, BARTMIP training has a considerable time complexity. This method has a much more complex mapping method, that includes bag-level distance calculations and bag-level clustering. Finally, training CCE takes a long time. It includes instance-level distance calculations and instance-level clustering, that are much more time demanding than their bag-level relatives.

Fig. 5.11
figure 11

Average training time of selected bag-based classification methods over the experimental datasets

5.5 Comparing Instance-Based, Bag-Based, and Traditional Classification Methods

In Chap. 4 and this chapter, we discussed two classifier families that work very differently: one of them learns at the instance level, the other at the bag-level. In both cases, we have presented comparative experiments on the performance of several representative members of each family. An obvious question is how these two families compare. This does not have an easy answer and has been the subject of study of some recent work [1, 2]. There is another more basic question that some researchers have put forward [1, 16], namely whether multi-instance classifiers outperform single-instance classifiers in all multi-instance datasets.

Ray and Craven [16] found that single-instance classification algorithms can perform well for several MIL problems, outperforming MI classifiers in some cases. This strongly impacts the MIL community, as occasional reports have shown that MI classifiers with good success records were beaten by simple single-instance models in some datasets.

Alpaydin et al. [1] designed artificial datasets with increasing complexity levels, corresponding to more and more complex dependencies between instances in a bag. They compare instance-based, bag-based, and single-instance classifiers on artificial datasets of different sizes and levels. Their conclusion was that, in general, single-instance classifiers can only handle the simplest MIL problems corresponding to the lowest complexity level, instance-based classifiers are good to solve problems from the first and second complexity levels and bag-based classifiers can solve problems from the first three levels. Datasets from the fourth complexity level require even more advanced classification methods. Alpaydin et al. also found that datasets where single-instance classifiers outperform multi-instance methods are those with the lowest complexity level and with a small number of bags, because there is not enough data to train the bag-level classifiers.

This explanation clarifies the general relation that appears between algorithms and data complexity. Nevertheless, we should keep in mind that no classifier exists that can handle all different application domains. Faced with a new MIL problem, the best algorithm might be an instance-based, a bag-based or a traditional classifier.

5.6 Summarizing Comments

Bag-based classification algorithms are an important group of MIC methods. They predict the bag class mainly using information at the bag level. They do not strive to predict instance class labels and have more flexible and generals MI assumptions. Several bag-based methods have appeared in the literature. According to their main features, we organize them in a category system depicted in Fig. 5.12. There are two principal categories of bag-based classifiers: (i) methods that operate on the original bag space by relying on a distance, similarity or kernel function and (ii) methods that use a mapping function to transform the data to a single-instance representation, such that single-instance classifiers can be trained and used to predict bag labels. Several types of transformations have been developed. Some mapping functions are based on simple bag statistics. Others represent the new space by concatenating prototypes extracted from the training bags. Other mapping methods count the number of instance of the bag falling in specific regions of the instance space and yet others compute the distance from the bag to the centers of these regions.

The experimental study shows that each of the discussed methods can attain high accuracy in some application domains. Nevertheless, we do not recommend CCE because of its large training time and uncertain performance. We advise the use of SimpleMI, because it often attains a very good accuracy and is very fast to train. BARTMIP is also a good option, because of its stable performance over several domains.

Fig. 5.12
figure 12

Bag-based methods hierarchy