1 Introduction

Multi-instance learning (MIL) is a sub-area of machine learning in which training and testing data are organized in sets called bags. What makes MIL challenging is that labels associated with these data are frequently provided at the bag level, but not at instance level. This is also known as weakly supervised learning in the literature. Algorithms that adhere to this type of weakly supervised learning paradigm are naturally suited to a wide variety of real-world problems that contain limited labeled data. For example, images can be represented by a bag of patches, documents can be organized into sentences or paragraphs, and patients can be represented by a collection of medical records, to name a few. Because the labels are given at the bag level, a lot of resources spent on characterizing each instance are saved. For example, the clinicians only need to label/diagnose the bag or patient, but not each medical record. However, as illustrated in Fig. 1, since each bag can have an arbitrary number of instances, standard machine learning approaches that rely on fixed-length vector representations cannot be applied to such data directly. In this case, the multi-instance learning would be a better choice compared to single-instance learning, because in single-instance learning, predicting becomes difficult when the given instance does not contain any useful information for the prediction. Meanwhile, multi-instance learning enables correct classification from the key instances included in the bag. As a result, significant research efforts have been made to design algorithms that can handle this type of data in recent years.

1.1 Related works

In the past twenty years, a large number of MIL algorithms [1,2,3,4,5,6,7,8] have been proposed. These approaches have been applied to many different topics including drug activity prediction [9], content-based image and video retrieval [10, 11], medical image analysis [12], and document classification [13], among many other application areas [14]. Recently, deep learning-based MIL methods [15,16,17] have also been proposed to handle multi-instance data. While these methods have demonstrated their effectiveness in solving a variety of real-world problems, their limitations have also been discussed [18, 19]. For example, a recent survey paper [19] notes that current state-of-the-art MIL approaches are sensitive to the construction of instances within a bag. Specifically, they determine that the performance of MIL methods are sensitive to witness rate, e.g., the proportion of positive instances in positive bags, as well as whether the algorithm operates on the instance or bag level. This has also been observed in older MIL survey papers [18] and requires new algorithms to be tested on a range of different datasets and applications. In addition to dataset-specific performance, the authors of these survey papers highlight that performance improvements in the training time of MIL algorithms, especially those who rely on instance-level information, are necessary for further adoption.

Fig. 1
figure 1

An illustrative comparison of the single-instance and multi-instance learning paradigms. Algorithms that operate on multi-instance data must contend with the fact that instances are rarely individually labeled. Instead, labels are generally provided at the bag level. Thus, the goal of a multi-instance learning algorithm is to learn to identify instances, within a given bag, that indicate a particular class membership

Recently, the scalability for analyzing large amount of data has become another major issue of MIL studies with the development of data mining technologies. The large amount of bags and instances are involved in MIL problems, and MIL models require many parameters to analyze the complex patterns between instances. However, many existing MIL models are often tested on small or moderately sized dataset. To alleviate this scalability issue, We et.al., [20] proposes to map the raw representation of a bag into simpler representation of a vector format which can be classified by the followed SVM model. Although effective, this method does not improve the scalability of the followed SVM model. Vatsavai [21] employs the divide-and-conquer strategy to scatter the large images into patches of predefined size which can be parallelized and processed by Citation-KNN MIL algorithms [22]. However, Citation-KNN MIL algorithm adapted in [21] scales quadratically with respect to the average number of instances per bag.

Besides the quantitative evaluation of MIL models, the qualitative evaluation is also gaining interest. For example, when multiple patches of a histopathological image are formulated as instances, the key instances identified as patches exhibiting the evidence of the disease would be useful for the doctors who need a reference for the corresponding medical decision [23]. At the same time, the aforementioned identified key instances can provide credibility to the predictions of MIL models. In an effort to interpret the outputs of MIL models, mi-SVM [23] provides pixel-wise abnormality scores of an X-ray image and these scores are plotted over the image constructing the heatmap to identify the regions exhibiting abnormalities. Another multi-instance learning framework [24] divides each instance into patches and calculates their similarities with respect to the positive and negative prototypical parts. As a result, the method in [24] explains which patches in an instance are responsible for the prediction. However the interpretation often requires the ground truth explanation or additional processes to generate the explanations, which sacrifices the scalability.

In this work, we focus specifically on scaling SVM-based MIL algorithms, as they have shown consistent performance and can be further extended to nonlinear decision boundaries via kernelization. Popular SVM-based MIL approaches such as miSVM/MISVM [1], NSK [25], and sMIL/sbMIL [2] have been proposed to handle multi-instance data and have demonstrated promising performance, even when compared against modern MIL deep-learning architectures such as miNet/MINet [26]. While these approaches have performed well and can be extended to solve a variety of real-world problems, they are not widely used in practice as they do not scale well to large datasets. Furthermore, many of these approaches are not equipped with capabilities to interpret the results of their predictions. These two shortcomings, speed of model training and model interpretability of multi-instance learning methods, are the focus of this work.

1.2 Our contributions

For the remainder of this manuscript, we present a novel method that extends a multi-instance SVM to large-scale data. Our approach uses the multi-block alternating direction method of multipliers (ADMM) to avoid iteratively solving the quadratic programming problems that arise from standard SVM-based MIL approaches. The scientific contributions of this work are as follows:

  • A novel MIL algorithm derivation, named the primal–dual multi-instance SVM (pdMISVM) method, and an associated implementation that scales linearly as the number of bags increases.

  • An inexact variation of our approach, based on the optimal line search method, that scales linearly as the number of features increases.

  • Experimental results showcasing the promising predictive performance, scalability, and interpretability of our approach on baseline multi-instance data and real-world image data compared against other MIL algorithms.

  • An extension of our approach that allows for the inclusion of an arbitrary kernel function and a proof-of-concept experiment on synthetic data verifying our derivation.

This paper is an extension of our recent work [27] originally reported in the Proceedings of the Twenty-First International Conference on Data Mining (ICDM 2021). In this extended journal manuscript, we provide the following expansions over its conference version:

  • We expand our discussions on the previous works related with the scalability and interpretability of MIL problems (Sect. 1.1).

  • We present the complete derivations of the kernel variation of our pdMISVM method that scales linearly against the number of bags.

    • While the scalability issue of the kernel version of our new method against the number of bags has been discussed in the conference version, it was not solved. In this journal extension, we systematically derive the objective of the kernel version of our pdMISVM method and its solution algorithms. (Sect. 2.5).

    • The experimental results of our two kernel variations are added to compare their performance and scalability (Sects. 3.2 and 3.3).

  • We provide the detailed analyses on the computational complexities of our pdMISVM method and its kernel extension (Sects. 2.4 and 2.5). The codes to implement our method and the data used in our methods have been made publicly available online at: https://github.com/minds-mines/pdMISVM.jl.

  • We expand our experimental evaluations with two additional benchmark datasets and two recently proposed attention-based deep learning models (Sect. 3.2).

  • We include a case study on neuroimaging data to further evaluate the performance and interpretability of our method to solve real-world problems. We apply our method to identify the disease relevant brain regions from the neuroimaging perspective (Sect. 3.6).

2 Methods

In this section, we begin with a sketch for the steps required for the multi-instance SVM (MISVM) derivation initially presented by Andrews et.al., [1]. Then, following the multi-block ADMM framework [28,29,30], we construct the augmented Lagrangian that will be used to derive the solution to the proposed pdMISVM method, which is then followed by a step-by-step derivation to optimize the proposed objective. In addition, we extend our approach to handle a large number of features through an application of the optimal line search method [31]. Finally, we derive the solution algorithm of kernel pdMISVM.

2.1 Notation

In this manuscript we represent matrices as \({\textbf{M}}\), vectors as \({\textbf{m}}\), and scalars as m. The \(i\)-th row and \(j\)-th column of \({\textbf{M}}\) are denoted as \(\textbf{m}^i\) and \(\textbf{m}_j\), respectively. Similarly, \(m^i_j\) is the scalar value indexed by the \(i\)-th row and \(j\)-th column of \({\textbf{M}}\). The matrix \({\textbf{M}}_p\) corresponds to the \(p\)-th column-block of \({\textbf{M}}\). Given a \(K \times N\) matrix \({\textbf{M}}\), \(\{m, i\} = \mathop {\mathrm{arg\,max}}_{m',i'}({\textbf{M}})\) gives the row-by-column coordinates for the maximum element in \({\textbf{M}}\). The row and column indices are given by \(\mathop {\mathrm{arg\,max}}_{m',i'}({\textbf{M}})^m\) and \(\mathop {\mathrm{arg\,max}}_{m',i'}({\textbf{M}})_i\), respectively.

2.2 Extending the MISVM to K-class classification

In the binary multi-instance classification problem, the MIL algorithm is presented with a collection of bags and labels represented by the set \(\{{\textbf{X}}_i, y_i\}_{i = 1}^N\), where \(y_i \in \{-1, 1\}\) and \({\textbf{X}}_i \in {\mathbb {R}}^{d \times n_i}\) designates a bag containing \(n_i\) instances, and \(\{{\textbf{x}}_1, \dots , {\textbf{x}}_{n_i}\} \in {\textbf{X}}_i\) represent each instance within the \(i\)-th bag. Following the instance-centric approach advocated by Andrews et.al., [1] for the MISVM model, where a single “witness” instance determines the class of a bag, we define the decision function for a multi-instance binary classifier as

$$\begin{aligned} y_i = {\text {sign}}\left( \max _{{\textbf{x}}_i \in {\textbf{X}}_i}({\textbf{w}}^T{\textbf{x}}_i + b)\right) , \end{aligned}$$
(1)

where \({\textbf{w}}\) and b are the hyperplane and intercept for the MISVM model. The MISVM objective devised by Andrews et.al., is [1]

$$\begin{aligned} \begin{aligned} \min _{{\textbf{w}}, b, \varvec{\xi }}&\quad \frac{1}{2}\left\| {\textbf{w}}\right\| _{2}^2 + C \sum _{i=1}^N \xi _i\\ \text {subject to}&\quad \max _{{\textbf{x}}_i \in {\textbf{X}}_i} \left( {\textbf{w}}^T{\textbf{x}}_i + b\right) y_i \ge 1 - \xi _i,\\&\quad \xi _i \ge 0,\ i=1, \dots , N, \end{aligned} \end{aligned}$$
(2)

where C is a hyperparameter that determines the level of regularization on the learned hyperplane, and \(\xi _i\) are slack variables. The constraints in Eq. (2) can be incorporated into the objective via a Lagrangian function

$$\begin{aligned} \begin{aligned} \min _{{\textbf{w}}, b}\ \max _{\varvec{\alpha }}\quad&\mathcal {L}({\textbf{w}}, b, \varvec{\alpha })\qquad \text {subject to}\quad&\alpha _i \ge 0, \end{aligned} \end{aligned}$$
(3)

which can be solved with respect to the dual variables (\(\alpha _i\)) using off-the-shelf quadratic programming solvers or heuristic algorithms like sequential minimal optimization [32] that takes advantage of a limited number of support vectors. Although the MISVM formulation proposed by Andrews et.al., [1] is widely used in MIL literature, it is generally limited to binary classification problems.

In order to design a method suitable for multi-class multi-instance classification, we extend the decision function presented in Eq. (1) to K-classes via

$$\begin{aligned} \hat{y}_i = \mathop {\mathrm{arg\,max}}_{m', i'}\left( {\textbf{W}}^T{\textbf{X}}_i+\textbf{b}^T{\textbf{1}}_i\right) ^m, \end{aligned}$$
(4)

where \({\textbf{W}}\in {\mathbb {R}}^{d \times K}\), \(\textbf{b}\in {\mathbb {R}}^{K}\), \(\hat{y}_i \in \{1, \dots , K\}\) represents the hyperplanes, intercepts, and labels for K classes. Motivated by the results of [33] where it is argued that all-in-one formulations for K-class SVMs provide superior predictive performance, when compared to one-vs-all approaches, we construct the following Weston & Watkins [34] MISVM extension to Eq. (2)

$$\begin{aligned} \begin{aligned} \min _{{\textbf{W}}, \textbf{b}, \varvec{\xi }}\ {}&\frac{1}{2}\sum _{m=1}^K\left\| {\textbf{w}}_m\right\| _{2}^2 + C \sum _{i=1}^N\sum _{m=1}^K \xi _i^m\\ \text {subject to}\quad&\Big (1 - \big [\max ({\textbf{w}}_m^T{\textbf{X}}_{i}+{\textbf{1}}b_m)- \max ({\textbf{w}}_y^T{\textbf{X}}_{i} + {\textbf{1}}b_y)\big ]y_{i}^{m}\Big )_+ \le \xi _i^m,\\&0 \le \xi _i^m,\ i = 1, \dots , 2,\ m = i, \dots , K, \end{aligned} \end{aligned}$$
(5)

where \({\textbf{w}}_y\), \(b_y\) is the hyperplane-intercept pair associated with the \(i\)-th bag’s class label, \((\cdot )_+ = \max (0, \cdot )\) is the hinge loss function, and \(y_i^m \in \{-1, 1\}\) indicates if the \(i\)-th bag belongs to the \(m\)-th class. Similar to Eq. (2), the K-class formulation above can be transformed into a quadratic programming problem and solved, although this approach is known [33] not to scale well as the number of bags increases. To address this issue, we propose a novel primal–dual algorithm based on the multi-block ADMM [30] to optimize Eq. (5).

2.3 A primal–dual multi-instance SVM

Incorporating the constraints of Eq. (5) into the objective gives the unconstrained optimization

$$\begin{aligned} \begin{aligned} \min _{{\textbf{W}}, \textbf{b}}\ {}&\frac{1}{2}\sum _{m=1}^K\left\| {\textbf{w}}_m\right\| _{2}^2 + C\sum _{i=1}^{N}\sum _{m=1}^K(1 - [\max ({\textbf{w}}_m^T{\textbf{X}}_{i} + {\textbf{1}}b_m) \\ {}&- \max ({\textbf{w}}_y^T{\textbf{X}}_{i} + {\textbf{1}}b_y)]y_{i}^{m})_+, \end{aligned} \end{aligned}$$
(6)

which is difficult to solve given the coupling across \({\textbf{w}}_m\), \(b_m\), and the \(\max (\cdot )\) operations. Using the multi-block ADMM approach, we introduce the following constraints, inspired by [31, 35], and rewrite Eq. (6) as

$$\begin{aligned} \begin{aligned} \min _{\begin{array}{c} {\textbf{W}}, \textbf{b}, {\textbf{E}}, \\ {\textbf{Q}}, {\textbf{R}}, {\textbf{T}}, {\textbf{U}} \end{array}} \quad&\frac{1}{2} \sum _{m=1}^K\left\| {\textbf{w}}_m\right\| _{2}^2 + C \sum _{i=1}^N \sum _{m=1}^K \left( y_{i}^{m}e_{i}^{m}\right) _+\\ \text {subject to}\quad&e_{i}^{m} = y_{i}^{m} - q_{i}^{m} + r_{i}^{m},\ q_{i}^{m} = \max \left( {\textbf{t}}_{i}^{m}\right) ,\ {\textbf{t}}_{i}^{m} = {\textbf{w}}_m^T{\textbf{X}}_i + {\textbf{1}}b_m,\\&r_{i}^{m} = \max \left( {\textbf{u}}_{i}^{m}\right) ,\ {\textbf{u}}_{i}^{m} = {\textbf{w}}_y^T{\textbf{X}}_i + {\textbf{1}}b_{y}, \end{aligned} \end{aligned}$$
(7)

to decouple the primal variables. Then, the augmented Lagrangian function of Eq. (7) is

$$\begin{aligned} \begin{aligned} \mathcal {L}_\mu&= \frac{1}{2} \sum _{m=1}^K\left\| {\textbf{w}}_m\right\| _{2}^2 + C \sum _{i=1}^N \sum _{m=1}^K \left( y_{i}^{m}e_{i}^{m}\right) _+ \\ {}&\quad + \frac{\mu }{2}\sum _{i=1}^N \sum _{m=1}^K\bigg [\left( e_{i}^{m} - (y_{i}^{m} - q_{i}^{m} + r_{i}^{m} - \lambda _{i}^{m}/\mu )\right) ^2\\&\quad + \left( q_{i}^{m} - \max \left( {\textbf{t}}_{i}^{m}\right) + \sigma _{i}^{m}/\mu \right) ^2 + \left\| {\textbf{t}}_{i}^{m} - \left( {\textbf{w}}_m^T{\textbf{X}}_i + {\textbf{1}}b_m\right) + \varvec{\theta }_{i}^{m}/\mu \right\| _{2}^2\\&\quad + \left( r_{i}^{m} - \max \left( {\textbf{u}}_{i}^{m}\right) + \omega _{i}^{m}/\mu \right) ^2 + \left\| {\textbf{u}}_{i}^{m} - \left( {\textbf{w}}_y^T{\textbf{X}}_i + {\textbf{1}}b_y\right) + \varvec{\xi }_{i}^{m}/\mu \right\| _{2}^2\bigg ], \end{aligned} \end{aligned}$$
(8)

where \({\textbf{W}}, \textbf{b}, {\textbf{E}}, {\textbf{Q}}, {\textbf{T}}, {\textbf{R}}, {\textbf{U}}\) are the primal variables, \(\varvec{\Lambda }, \varvec{\Sigma }, \varvec{\Theta }, \varvec{\Omega }, \varvec{\Xi }\) are the dual variables, and \(\mu > 0\) is a tuning parameter. Equation (8) is then differentiated with respect to each primal variable to derive Algorithm 2.1. The primal–dual updates terminate when the total difference between the constraints incorporated via the augmented Lagrangian terms are less than a predefined tolerance. In the following, we provide the details to derive each step of Algorithm 2.1.

figure g

Update \({\textbf{W}}\) & \(\textbf{b}\), exact. Removing all terms from Eq. (8) that do not include \({\textbf{W}}\) and decoupling across columns of \({\textbf{W}}\) gives the following K subproblems

$$\begin{aligned} \begin{aligned} {\textbf{w}}_m&= \mathop {\mathrm{arg\,min}}_{{\textbf{w}}_m}\ \frac{1}{2} \left\| {\textbf{w}}_m\right\| _{2}^2 + \frac{\mu }{2}\sum _{i=1}^N\Big [\big \Vert {\textbf{t}}_{i}^{m} - \big ({\textbf{w}}_m^T{\textbf{X}}_i + {\textbf{1}}b_m\big ) + \varvec{\theta }_{i}^{m}/\mu \big \Vert ^2_2\Big ]\\&\quad + \sum _{i'=1}^{N'}\sum _{m'=1}^K\Big [\frac{\mu }{2}\big \Vert {\textbf{u}}_{i'}^{m'} - \big ({\textbf{w}}_m^T{\textbf{X}}_{i'} + {\textbf{1}}b_{m}\big ) + \varvec{\xi }_{i'}^{m'}/\mu \big \Vert ^2_2\Big ], \end{aligned} \end{aligned}$$
(9)

where \(i'\) indicates the column blocks in \({\textbf{X}}\) (and the corresponding columns of \({\textbf{U}}\) and \(\varvec{\Xi }\)) that belong to the \(m\)-th class and \(N'\) is the total number of bags belonging to the \(m\)-th class. Taking the derivative of Eq. (9) with respect to \({\textbf{w}}_m\) and setting the result equal to zero gives the closed form solution

$$\begin{aligned} \begin{aligned} {\textbf{w}}_m^T&= \Big (\textstyle \sum _{i=1}^N\left[ \left( {\textbf{t}}_{i}^{m} - {\textbf{1}}b_m + \varvec{\theta }_{i}^{m}/\mu \right) {\textbf{X}}_i^T\right] + \textstyle \sum _{i'=1}^{N'} \textstyle \sum _{m'=1}^K\big [({\textbf{u}}_{i'}^{m'} - {\textbf{1}}b_m \\ {}&\quad + \varvec{\xi }_{i'}^{m'}/\mu ){\textbf{X}}_{i'}^T\big ]\Big )*\left( {\textbf{I}}/\mu + \textstyle \sum _{i=1}^N{\textbf{X}}_i{\textbf{X}}_i^T + K \textstyle \sum _{i'=1}^{N'}{\textbf{X}}_{i'}{\textbf{X}}_{i'}^T\right) ^{-1}, \end{aligned} \end{aligned}$$
(10)

which can be calculated via a least-squares solver to avoid an inverse calculation.

Similarly, differentiating Eq. (9) element-wise with respect to \(b_m\) and setting the result equal to zero gives the update

$$\begin{aligned} \begin{aligned} b_m = \frac{\sum _{i=1}^N\left[ {\textbf{t}}_{i}^{m} - {\textbf{w}}_m^T{\textbf{X}}_i + \varvec{\theta }_{i}^{m}/\mu \right] +\sum _{i'=1}^{N'}\sum _{m'=1}^K\big [{\textbf{u}}_{i'}^{m'} - {\textbf{w}}_m^T{\textbf{X}}_{i'}+ \varvec{\xi }_{i'}^{m'}/\mu \big ]}{N + KN'}. \end{aligned} \end{aligned}$$
(11)

Update \({\textbf{E}}\). Dropping terms that do not contain \({\textbf{E}}\) from Eq. (8), by performing element-wise decoupling of the problem, we end up with the following \(K \times N\) subproblems

$$\begin{aligned} \begin{aligned} e_{i}^{m} = \mathop {\mathrm{arg\,min}}_{e_{i}^{m}}\ {}&C \left( y_{i}^{m}e_{i}^{m}\right) _+ + \frac{\mu }{2}\left( e_{i}^{m} - n_{i}^{m}\right) ^2, \end{aligned} \end{aligned}$$
(12)

where \(n_{i}^{m}= y_{i}^{m} - q_{i}^{m} + r_{i}^{m} - \frac{\lambda _{i}^{m}}{\mu }\). Equation (12) can be differentiated with respect to \(e_i^m\) via the sub-gradient method, and solved by the following three cases

$$\begin{aligned} e_{i}^{m} = {\left\{ \begin{array}{ll} n_{i}^{m} - \frac{C}{\mu }y_{i}^{m} \quad &{}\text {when} \quad y_{i}^{m}n_{i}^{m}>\frac{C}{\mu },\\ 0 &{}\text {when} \quad 0 \le y_{i}^{m}n_{i}^{m} \le \frac{C}{\mu },\\ n_{i}^{m} &{}\text {when} \quad y_{i}^{m}n_{i}^{m} < 0.\\ \end{array}\right. } \end{aligned}$$
(13)

Update \({\textbf{Q}}\) & \({\textbf{R}}\). Keeping only terms with \({\textbf{Q}}\) in Eq. (8) and performing element-wise decoupling, we end up with the following \(K \times N\) subproblems

$$\begin{aligned} \begin{aligned} q_{i}^{m}&= \mathop {\mathrm{arg\,min}}_{q_{i}^{m}}\ \left( e_{i}^{m} - y_{i}^{m} + q_{i}^{m} - r_{i}^{m} + \lambda _{i}^{m}/\mu \right) ^2\\&\quad + \left( q_{i}^{m} - \max \left( {\textbf{t}}_{i}^{m}\right) + \sigma _{i}^{m}/\mu \right) ^2. \end{aligned} \end{aligned}$$
(14)

Taking the derivative of Eq. (14) with respect to \(q_i^m\) and setting the result equal to zero, we can the problem for \(q_i^m\) by the following update

$$\begin{aligned} q_{i}^{m} = \frac{\big (y_{i}^{m} - e_{i}^{m} + r_{i}^{m} - \lambda _{i}^{m}/\mu + \max \big ({\textbf{t}}_{i}^{m}\big ) - \sigma _{i}^{m}/\mu \big )}{2}. \end{aligned}$$
(15)

Following the same steps for each \(r_i^m \in {\textbf{R}}\), we derive the element-wise updates

$$\begin{aligned} r_{i}^{m} = \frac{\big (e_{i}^{m} - y_{i}^{m} + q_{i}^{m} + \lambda _{i}^{m}/\mu + \max \big ({\textbf{u}}_{i}^{m}\big ) - \omega _{i}^{m}/\mu \big )}{2}. \end{aligned}$$
(16)

Update \({\textbf{T}}\) & \({\textbf{U}}\). Keeping terms in Eq. (8) containing \({\textbf{T}}\) and decoupling across K and N, we end up with the following subproblem

$$\begin{aligned} \begin{aligned} {\textbf{t}}_{i}^{m}&= \mathop {\mathrm{arg\,min}}_{{\textbf{t}}_{i}^{m}}\ \left( q_{i}^{m} - \max \left( {\textbf{t}}_{i}^{m}\right) + \sigma _{i}^{m}/\mu \right) ^2 + \left\| {\textbf{t}}_{i}^{m} - \left( {\textbf{w}}_m^T{\textbf{X}}_i + {\textbf{1}}b_m\right) + \theta _{i}^{m}/\mu \right\| _{2}^2, \end{aligned} \end{aligned}$$
(17)

which can be further decoupled into element-wise subproblems for each \(t_{i,j}^m \in {\textbf{t}}_i^m\), giving \(K \times (n_1 + \dots + n_N)\) problems

$$\begin{aligned} t_{i,j}^{m} = \mathop {\mathrm{arg\,min}}_{t_{i,j}^{m}} {\left\{ \begin{array}{ll}\left( q_{i}^{m} - t_{i,j}^{m} + \sigma _{i}^{m}/\mu \right) ^2 + \left( t_{i,j}^m - \phi _{i,j}^m\right) ^2, \\ \ \text {when } t_{i,j}^m=\max \left( {\textbf{t}}_{i}^{m}\right) ,\\ \left( t_{i,j}^m - \phi _{i,j}^m\right) ^2\ \text {else}, \end{array}\right. } \end{aligned}$$
(18)

\(\text {where } \varvec{\phi }_{i}^{m} = {\textbf{w}}_m^T{\textbf{X}}_i + {\textbf{1}}b_m - \varvec{\theta }_{i}^{m}/\mu \).

Taking the derivative of Eq. (18) with respect to \(t_{i,j}^m\) and setting the result equal to zero, we solve the problem for \(t_{i,j}^m\) by the following updates

$$\begin{aligned} t_{i,j}^{m} = {\left\{ \begin{array}{ll} \frac{\max (\varvec{\phi }_{i}^{m}) + q_{i}^{m} + \sigma _{i}^{m}/\mu }{2} &{}\text {if } j = \mathop {\mathrm{arg\,max}}(\varvec{\phi }_{i}^{m}),\\ \phi _{i,j}^{m} \quad &{}\text {else}. \end{array}\right. } \end{aligned}$$
(19)

This same strategy is applied to derive the element-wise updates of \({\textbf{U}}\), which gives

$$\begin{aligned} {u_{i,j}^{m} = {\left\{ \begin{array}{ll} \frac{\max (\varvec{\psi }_{i}^{m}) + r_{i}^{m} + \omega _{i}^{m}/\mu }{2} &{}\text {if } j = \mathop {\mathrm{arg\,max}}(\varvec{\psi }_{i}^{m}),\\ \psi _{i,j}^{m} \quad &{}\text {else}. \end{array}\right. }} \end{aligned}$$
(20)

\(\text {where } \varvec{\psi }_{i}^{m} = {\textbf{w}}_y^T{\textbf{X}}_i + {\textbf{1}}b_y - \varvec{\xi }_{i}^{m}/{\mu }\).

The associated dual variable updates are provided in Algorithm 2.1.

2.4 Scaling to a large number of features

Although the updates derived in Sect. 2.3 provide a suitable algorithm as the number of bags increase, it does not scale well against the increasing number of features. To be specific, the calculation in Eq. (10) for the left parenthesis \(\big (\textstyle \sum _{i=1}^N\left[ \left( {\textbf{t}}_{i}^{m} - {\textbf{1}}b_m + \varvec{\theta }_{i}^{m}/\mu \right) {\textbf{X}}_i^T\right] + \textstyle \sum _{i'=1}^{N'} \textstyle \sum _{m'=1}^K\big [({\textbf{u}}_{i'}^{m'} - {\textbf{1}}b_m + \varvec{\xi }_{i'}^{m'}/\mu ){\textbf{X}}_{i'}^T\big ]\big )\) has the computational complexity \(O\big (\big (n_1+\dots +n_N\big ) \cdot d\big )\) and the right parenthesis \(({\textbf{I}}/\mu + \textstyle \sum _{i=1}^N{\textbf{X}}_i{\textbf{X}}_i^T + K \textstyle \sum _{i'=1}^{N'}{\textbf{X}}_{i'}{\textbf{X}}_{i'}^T)^{-1}\) can be efficiently calculated via the least-squares solver which has complexity \(O\left( d^2 \right) \). As a result, the updating \({\textbf{w}}_m\) requires the time complexity \(O\left( \left( n_1+\dots +n_N + d\right) \cdot d\right) \) which scales quadratically as the number of features increase; this limits the scalability of our approach to bags only. Additionally, since \(\mu \) is updated every iteration, the least-squares solver must be invoked at every iteration although \((\sum _{i=1}^N{\textbf{X}}_i{\textbf{X}}_i^T + K \textstyle \sum _{i'=1}^{N'}{\textbf{X}}_{i'}{\textbf{X}}_{i'}^T)\) can be precomputed at the beginning of algorithm. The inversion calculation for \(d \times d\) matrix is the bottleneck of SVM algorithms. To handle this issue, we propose an alternative optimal line search method [31] to update \({\textbf{W}}\) for avoiding the inverse matrix calculation.

Update \({\textbf{W}}\), inexact. The partial derivative of Eq. (9) with respect to \({\textbf{w}}_k\) gives

$$\begin{aligned} \begin{aligned} \nabla ^T_{{\textbf{w}}_m}&= {\textbf{w}}_m^T - \mu \sum _{i=1}^N[{\textbf{t}}_i^m - {\textbf{w}}_m^T{\textbf{X}}_i - {\textbf{1}}b_m + \varvec{\theta }_i^m/\mu ]{\textbf{X}}_i^T\\&\quad -\mu \sum _{i'=1}^{N'}\sum _{m'=1}^K[{\textbf{u}}_{i'}^{m'}-{\textbf{w}}_m^T{\textbf{X}}_{i'} - {\textbf{1}}b_m + \varvec{\xi }_{i'}^{m'}/\mu ]{\textbf{X}}_{i'}^T, \end{aligned} \end{aligned}$$
(21)

which can be used to create the following minimization problem

$$\begin{aligned} \begin{aligned} s_m&= \mathop {\mathrm{arg\,min}}_{s_m}\ \frac{1}{2} \left\| {\textbf{w}}_m^T - s_m\nabla ^T_{{\textbf{w}}_m}\right\| _{2}^2 + \frac{\mu }{2}\sum _{i=1}^N\Big [\big \Vert {\textbf{t}}_{i}^{m} - ({\textbf{w}}_m^T - s_m\nabla ^T_{{\textbf{w}}_m}){\textbf{X}}_i - {\textbf{1}}b_m \\ {}&\quad + \theta _{i}^{m}/\mu \big \Vert ^2_2\Big ] + \sum _{i'=1}^{N'}\sum _{m'=1}^K \Big [\frac{\mu }{2}\big \Vert {\textbf{u}}_{i'}^{m'} - ({\textbf{w}}_m^T - s_m\nabla ^T_{{\textbf{w}}_m}){\textbf{X}}_{i'} - {\textbf{1}}b_{m} + \xi _{i'}^{m'}/\mu \big \Vert ^2_2\Big ], \end{aligned} \end{aligned}$$
(22)

in terms of \(s_m\) instead of \({\textbf{w}}_m\).

Differentiating Eq. (22) with respect to \(s_m\) and setting the result equal to zero, we solve the problem for \(s_m\) as follows

$$\begin{aligned} s_m = \frac{\left( {\textbf{w}}_m^T - \mu \sum _{i=1}^N\hat{{\textbf{t}}}^m_i{\textbf{X}}_i^T - \mu \sum _{i'=1}^{N'}\sum _{m'=1}^K\hat{{\textbf{u}}}_{i'}^{m'} {\textbf{X}}_{i'}^T\right) \nabla _{{\textbf{w}}_m}}{\nabla ^T_{{\textbf{w}}_m}\left( {\textbf{I}}+ \mu \sum _{i=1}^N{\textbf{X}}_i{\textbf{X}}_i^T + \mu K\sum _{i'=1}^{N'}{\textbf{X}}_{i'}{\textbf{X}}_{i'}^T\right) \nabla _{{\textbf{w}}_m}}, \end{aligned}$$
(23)

where \(\hat{{\textbf{t}}}_i^m = {\textbf{t}}^m_i - {\textbf{w}}_m^T{\textbf{X}}_i - {\textbf{1}}b_m + \varvec{\theta }^m_i/\mu \) and \(\hat{{\textbf{u}}}_{i'}^{m'}={\textbf{u}}_{i'}^{m'}-{\textbf{w}}_m^T{\textbf{X}}_{i'}-{\textbf{1}}b_m + \xi _{i'}^{m'}/\mu \).

Because the denominator of Eq. (23) is equivalent to

$$\begin{aligned} \left\| \nabla _{{\textbf{w}}_m}\right\| _{2}^2 + \mu \sum _{i=1}^N\left\| \nabla ^T_{{\textbf{w}}_m}{\textbf{X}}_i\right\| _{2}^2 + \mu K\sum _{i'=1}^{N'}\left\| \nabla ^T_{{\textbf{w}}_m}{\textbf{X}}_{i'}\right\| _{2}^2, \end{aligned}$$
(24)

Equation (23) can be calculated efficiently in \(O\left( \left( n_1 + \dots + n_N \right) \cdot d\right) \) time.

By combining Eq. (21) and Eq. (23), we can update \({\textbf{w}}_m\) via

$$\begin{aligned} \begin{aligned} {\textbf{w}}_m = {\textbf{w}}_m - s_m\nabla _{{\textbf{w}}_m}. \end{aligned} \end{aligned}$$
(25)

This “inexact” update option avoids solving the least squares problem present in Eq. (10) and is provided as an option on Line 6 of Algorithm 2.1 to extend our method to handle a large number of features.

2.5 A primal–dual multi-instance SVM with Kernel

While the exact and inexact formulations described in Algorithm 2.1 are computationally efficient and show promising performance on a variety of multi-instance datasets, they are limited to classification problems where instances within bags are linearly separable. In order for enabling our method to learn nonlinear decision boundaries, we derive an kernel extension of pdMISVM method in this subsection.

We begin by replacing all bags \({\textbf{X}}_i\) in Eq. (6) by their corresponding feature matrices \(\phi ({\textbf{X}}_i) = \varvec{\Phi }_i \in {\mathbb {R}}^{d_\phi \times n_i}\), which gives

$$\begin{aligned} \begin{aligned} \min _{{\textbf{W}}, \textbf{b}}\ {}&\frac{1}{2}\sum _{m=1}^K\left\| {\textbf{w}}_m\right\| _{2}^2 + C\sum _{i=1}^{N}\sum _{m=1}^K(1 - [\max ({\textbf{w}}_m^T\varvec{\Phi }_{i}+ {\textbf{1}}b_m) \\ {}&- \max ({\textbf{w}}_y^T\varvec{\Phi }_{i} + {\textbf{1}}b_y)]y_{i}^{m}) \end{aligned} \end{aligned}$$
(26)

where \(\phi \) is an arbitrary kernel function. Then, by introducing constraints to decouple \({\textbf{w}}_m\) and \(b_m\) in Eq. (7) and incorporating them into the objective, we derive the following Lagrangian formulation

$$\begin{aligned} \begin{aligned} \mathcal {L}_\mu&= \frac{1}{2} \sum _{m=1}^K\left\| {\textbf{w}}_m\right\| _{2}^2 + \sum _{i=1}^N \sum _{m=1}^K C \left( y_{i}^{m}e_{i}^{m}\right) _+ \\ {}&\quad + \frac{\mu }{2}\sum _{i=1}^N \sum _{m=1}^K\bigg [\left( e_{i}^{m} - (y_{i}^{m} - q_{i}^{m} + r_{i}^{m} - \lambda _{i}^{m}/\mu )\right) ^2\\&\quad + \left( q_{i}^{m} - \max \left( {\textbf{t}}_{i}^{m}\right) + \sigma _{i}^{m}/\mu \right) ^2 + \left\| {\textbf{t}}_{i}^{m} - \left( {\textbf{w}}_m^T\varvec{\Phi }_i + {\textbf{1}}b_m\right) + \varvec{\theta }_{i}^{m}/\mu \right\| _{2}^2\\&\quad + \left( r_{i}^{m} - \max \left( {\textbf{u}}_{i}^{m}\right) + \omega _{i}^{m}/\mu \right) ^2 + \left\| {\textbf{u}}_{i}^{m} - \left( {\textbf{w}}_y^T\varvec{\Phi }_i) + {\textbf{1}}b_y\right) + \varvec{\xi }_{i}^{m}/\mu \right\| _{2}^2\bigg ]. \end{aligned} \end{aligned}$$
(27)

We pause here to recognize that, while the number of columns in each \(\varvec{\Phi }_i\) is equal to the number of instances inside the original bag \({\textbf{X}}_i\), the number of rows, i.e.,, \(d_\phi \), can be arbitrarily (even infinitely) large. Motivated by [36], we work to derive each update of our algorithm with respect to each primal variable in Eq. (27) without explicitly calculating \({\textbf{w}}_m\).

2.5.1 The Kernel extension of our method with exact solutions

We update \({\textbf{W}}\) by discarding all terms in Eq. (27) that do not contain \({\textbf{W}}\) and decoupling across columns, which gives the following K problems

$$\begin{aligned} \begin{aligned} {\textbf{w}}_m&= \mathop {\mathrm{arg\,min}}_{{\textbf{w}}_m}\ \frac{1}{2} \left\| {\textbf{w}}_m\right\| _{2}^2 + \frac{\mu }{2}\sum _{i=1}^N\Big [\big \Vert {\textbf{t}}_{i}^{m} - {\textbf{w}}_m^T\varvec{\Phi }_i + {\textbf{1}}b_m + \varvec{\theta }_{i}^{m}/\mu \big \Vert ^2_2\Big ]\\&\quad +\sum _{i'=1}^{N'}\sum _{m'=1}^K\Big [\frac{\mu }{2}\big \Vert {\textbf{u}}_{i'}^{m'} - {\textbf{w}}_m^T\varvec{\Phi }_{i'} + {\textbf{1}}b_{m} + \varvec{\xi }_{i'}^{m'}/\mu \big \Vert ^2_2\Big ], \end{aligned} \end{aligned}$$
(28)

where \(i'\) indicates the corresponding column blocks of \(\varvec{\Phi }= [\varvec{\Phi }_1\ \dots \varvec{\Phi }_N] \in {\mathbb {R}}^{d_\phi \times \left( n_1 + \dots + n_{N}\right) }\) that belong to the \(m\)-th class. Equation (28) can be differentiated with respect to \({\textbf{w}}_m\), set equal to zero, and solved for each \({\textbf{w}}_m\)

$$\begin{aligned} \begin{aligned} {\textbf{w}}_m^T&= \Big (\textstyle \left[ \left( {\textbf{t}}^{m} - {\textbf{1}}b_m + \varvec{\theta }^{m}/\mu \right) \varvec{\Phi }^T\right] + \textstyle \textstyle \sum _{m'=1}^K\big [({\textbf{u}}_{'}^{m'} - {\textbf{1}}b_m + \varvec{\xi }_{'}^{m'}/\mu )\varvec{\Phi }_{'}^T\big ]\Big )\\&\quad *\Big ({\textbf{I}}/\mu + \varvec{\Phi }\varvec{\Phi }^T + K \varvec{\Phi }_{'}\varvec{\Phi }_{'}^T\Big )^{-1}, \end{aligned} \end{aligned}$$
(29)

where \(\varvec{\Phi }_{'} = [\varvec{\Phi }_{1'}\ \dots \varvec{\Phi }_{N'}]\). Equation (29) can be written in the matrix form

$$\begin{aligned} {\textbf{w}}_m^T = {\textbf{v}}^m{\textbf{D}}\hat{\varvec{\Phi }}^T*\left( {\textbf{I}}/\mu + \hat{\varvec{\Phi }}{\textbf{D}}\hat{\varvec{\Phi }}^T\right) ^{-1}, \end{aligned}$$
(30)

where \({\textbf{v}}^m=[{\textbf{t}}^{m} - {\textbf{1}}b_m + \varvec{\theta }^{m}/\mu \quad 1/K\sum _{m'=1}^K({\textbf{u}}_{'}^{m'} - {\textbf{1}}b_m + \varvec{\xi }_{'}^{m'}/\mu )]\), \({\textbf{D}}=[{\textbf{I}}\ {\textbf{0}};\ {\textbf{0}}\ K{\textbf{I}}]\) and \(\hat{\varvec{\Phi }}=[\varvec{\Phi }\ \varvec{\Phi }_{'}]\). Since the kernel function applied to each \({\textbf{X}}_i\) may return feature vectors that are infinitely long, it may be impossible to calculate the inverse required to express \({\textbf{w}}_m\) in Eq. (30). In order to solve this issue we use the following method introduced in [36]

$$\begin{aligned} ({\textbf{P}}^{-1} + {\textbf{m}}^T{\textbf{R}}^{-1}{\textbf{m}})^{-1}{\textbf{m}}^T{\textbf{R}}^{-1} = {\textbf{P}}{\textbf{m}}^T({\textbf{m}}{\textbf{P}}{\textbf{m}}^T+{\textbf{R}})^{-1} \end{aligned}$$

to rewrite \({\textbf{w}}_m^T\) equivalently, as

$$\begin{aligned} {\textbf{w}}_m^T = {\textbf{v}}_m(\hat{\varvec{\Phi }}^T\hat{\varvec{\Phi }} + {\textbf{D}}^{-1}/\mu )^{-1}\hat{\varvec{\Phi }}^T. \end{aligned}$$
(31)

The updated expression in Eq. (31) can then be used to update \({\textbf{w}}_m^T\phi ({\textbf{X}}_i)\)

$$\begin{aligned} {\textbf{w}}_m^T\varvec{\Phi }= {\textbf{v}}_m(\hat{\varvec{\Phi }}^T\hat{\varvec{\Phi }} + {\textbf{D}}^{-1}/\mu )^{-1}\hat{\varvec{\Phi }}^T\varvec{\Phi }, \end{aligned}$$
(32)

and calculate \(\left\| {\textbf{w}}_m\right\| _{2}^2 = {\text {{\textbf {tr}}}}\left( {\textbf{w}}_m^T{\textbf{w}}_m\right) \) by

$$\begin{aligned} \left\| {\textbf{w}}_m\right\| _{2}^2 = {\text {{\textbf {tr}}}}\left( {\textbf{v}}_m(\hat{\varvec{\Phi }}^T\hat{\varvec{\Phi }} + {\textbf{D}}^{-1}/\mu )^{-1}\hat{\varvec{\Phi }}^T\hat{\varvec{\Phi }}(\hat{\varvec{\Phi }}^T\hat{\varvec{\Phi }} + {\textbf{D}}^{-1}/\mu )^{-1}{\textbf{v}}_m^T\right) , \end{aligned}$$
(33)

without directly computing \({\textbf{w}}_m\). These two expressions are computationally tractable as the kernel expressions occur as an inner product in both cases. The updates for the other primal variables \(\textbf{b}, {\textbf{E}}, {\textbf{Q}}, {\textbf{R}}, {\textbf{T}}\) and \({\textbf{U}}\) are the same as Algorithm 2.1, except each instance of \({\textbf{w}}_m{\textbf{X}}_i\) and \({\textbf{w}}_y{\textbf{X}}_i\) are replaced by the corresponding columns of Eq. (32). We outline the update steps for the pdMISVM method with kernel in Algorithm 2.2. The time complexity of Algorithm 2.2 \(O\left( \left( n_1+\dots +n_N\right) ^2\right) \). The complexity comes from the multiplication between \({\textbf{v}}_m \in {\mathbb {R}}^{n_1 + \dots + n_N + n_{1'} + \dots + n_{N'}}\) and matrix and calculating \((\hat{\varvec{\Phi }}^T\hat{\varvec{\Phi }} + {\textbf{D}}^{-1}/\mu )^{-1}\) in Eq. (32) and Eq. (33).

figure h

2.5.2 The Kernel extension of our method with inexact solutions

Our exact kernel method in Eqs. (32) and (33) scales quadratically as the number of instances increases. To effectively deal with the large number of instances in the dataset, we can use the optimal line search method in Eq. (25) for the kernel version of our method by replacing \({\textbf{X}}_i\) with \(\varvec{\Phi }_i\):

$$\begin{aligned} \begin{aligned} {\textbf{w}}_m = {\textbf{w}}_m - s_m\nabla _{{\textbf{w}}_m}, \end{aligned} \end{aligned}$$
(34)

where

$$\begin{aligned} \begin{aligned} \nabla ^T_{{\textbf{w}}_m}&= {\textbf{w}}_m^T - \mu \sum _{i=1}^N[{\textbf{t}}_i^m - {\textbf{w}}_m^T\varvec{\Phi }_i - {\textbf{1}}b_m + \varvec{\theta }_i^m/\mu ]\varvec{\Phi }_i^T\\&\quad -\mu \sum _{i'=1}^{N'}\sum _{m'=1}^K[{\textbf{u}}_{i'}^{m'}-{\textbf{w}}_m^T\varvec{\Phi }_{i'} - {\textbf{1}}b_m + \varvec{\xi }_{i'}^{m'}/\mu ]\varvec{\Phi }_{i'}^T, \end{aligned} \end{aligned}$$
(35)

and

$$\begin{aligned} s_m = \frac{\left( {\textbf{w}}_m^T - \mu \sum _{i=1}^N\hat{{\textbf{t}}}^m_i\varvec{\Phi }_i^T - \mu \sum _{i'=1}^{N'}\sum _{m'=1}^K\hat{{\textbf{u}}}_{i'}^{m'}\varvec{\Phi }_{i'}^T\right) \nabla _{{\textbf{w}}_m}}{\left\| \nabla _{{\textbf{w}}_m}\right\| _{2}^2 + \mu \sum _{i=1}^N\left\| \nabla ^T_{{\textbf{w}}_m}\varvec{\Phi }_i\right\| _{2}^2 + \mu K\sum _{i'=1}^{N'}\left\| \nabla ^T_{{\textbf{w}}_m}\varvec{\Phi }_{i'}\right\| _{2}^2}. \end{aligned}$$
(36)

The computational complexity of inexact kernel method in Eq. (34) is \(O((n_1 + \dots + n_N) \cdot d_{\phi })\), which is apparently more computationally efficient than Eq. (32) when the number of instances is larger than the number of kernel features \(d_{\phi }\). However, the inexact kernel pdMISVM is not applicable to the kernel function of an infinite number of features such as radial basis function. Therefore in our experiments, in contrast to using the radial basis function (RBF) for the exact method, the kernel extension of our method with exact solution employs the degree-2 polynomial function (poly) kernel. In case of degree-2 polynomial kernel \(K({\textbf{x}}, {\textbf{y}})=({\textbf{x}}^T {\textbf{y}} + c)^2\), the complexity becomes \(O((n_1 + \dots + n_N) \cdot d^2)\), since the feature map \(\phi \) is given by:

$$\begin{aligned} \begin{aligned} \phi ({\textbf{x}})&= [x_n^2, \dots , x_1^2, \sqrt{2} x_n x_{n-1}, \cdots , \sqrt{2} x_n x_{1}, \sqrt{2} x_{n-1} x_{n-2}, \cdots , \sqrt{2} x_{n-1} x_{1},\\&\quad \quad \cdots , \sqrt{2} x_{2} x_{1}, \sqrt{2c}x_n, \cdots , \sqrt{2c}x_1, c]. \end{aligned} \end{aligned}$$
(37)

Apparently, our inexact method that scales to the number of features is especially useful to kernel features, because the kernel feature map usually increases the dimensionality.

3 Experiments

In this section we explore the performance of our linear/kernel and exact/inexact pdMISVM implementations. We first test our method against an array of standard MIL benchmark datasets to explore how our implementations compare against other recent MIL methods. We follow the baseline experiments with an investigation into increasingly complex natural scene data to determine the performance characteristics of our approach. Then, we conduct experiments with synthetic data to illustrate the scalability of our approach and experimentally verify the expected computational complexity/performance characteristics of our approach compared to others. We follow with a discussion of the interpretability of our method on three multi-instance datasets derived from two well-known baseline datasets.

3.1 Experimental settings and datasets

As discussed in Sect. 2.5.1, because it is not possible to directly access the kernel features of radial basis function (RBF), we use the degree-2 polynomial kernel (poly) for our inexact kernel method. We compare our methods of linear/kernel and exact/inexact versions against ten recent MIL learning algorithms: (1) a single-instance learning (SIL) approach that assigns the bags’ labels to all instances during training and returns the maximum response for each bag/class-pair at test time for the testing bags’ instances; (2) the miSVM and (3) MISVM algorithms [1] that assume that at least one instance per bag is positive to classify a bag as positive; (4) the NSK algorithm [25], a bag-based method, that maps the entire bag to a single-instance by way of a kernel function; (5) the sMIL and (6) sbMIL [2] algorithms which expect that only a small number of instances within a bag are classified as positive and combine instance-level and bag-level relationships to make a prediction. We also compare our approach against two end-to-end MIL algorithms, (7) miNet and (8) MINet [26], based on deep neural-networks (DNN). Finally, the two DNN MIL models using the attention mechanism are compared: (9) Attention-based deep multiple instance learning (AMIL) [16] calculates the parameterized attention (importance) score for each instance to generate the probability distribution of bag labels; (10) loss-based attention for deep multiple instance learning (LAMIL) [37] proposes to learn the instance scores and predictions jointly by integrating the attention mechanism with the loss function. These two attention-based DNN methods have demonstrated the state-of-the-art classification performance in MIL.

These methods are compared against the proposed pdMISVM (Ours) method, and the inexact variation, described in Algorithm 2.1 and Algorithm 2.2. The grid search and performance calculations for each method-dataset pair are conducted using the MLJ library [38] and are included with our code.Footnote 1 All experiments were run on an Intel Xeon processor running at 2.20GHz using 126GB of RAM, running Ubuntu 18.04.4 LTS. The competing SVM-based methods are implemented using a libraryFootnote 2 written by Doran et.al. [39], while the DNN methods are implemented using the codeFootnote 3 provided as a companion to the paper [16, 26, 37]. Methods that take longer than one-thousand seconds to train during a single cross-validation are considered “timed-out” (T/O) and their performance metrics are not provided.

Each method is compared against a synthetic dataset and ten multi-instance datasets that are normalized to have zero mean and unit variance. The synthetic dataset contains 10 to 1,000 bags with three to five instances per bag and 10 to 1,000 features per instance. The first instance per bag is constructed from two normally distributed clusters with a standard deviation of one; the second to fifth instances per bag contain uniform random noise.

The MUSK-2 [9], Elephant, Fox, Colon [40], and Tiger [1] datasets are standard small-scale MIL evaluation datasets and are widely cited in MIL literature as benchmarks. The MUSK-2 dataset is designed to classify chemical compounds as either “musk” or “non-musk” which describes the chemical properties of a given compound; bags within this dataset are representative of the possible conformations of the labeled compound. The MUSK-2 dataset contains 39 positive and 63 negative bags with 166 features per instance. The Elephant, Fox and Tiger datasets are derived from the Corel image dataset [41] and each contain 100 positive and 100 negative bags with 143 non-zero features per instance. The Colon [40] consists of 25,000 histopathological images (instances) generated from 750 lung tissue images (bags). There are 5 classes of lung benign tissue, lung adenocarcinoma, lung squamous cell carcinoma, colon adenocarcinoma, and colon benign tissue in Colon dataset.

The MNIST-bags [16] dataset contains 100 positive and 100 negative bags where a bag is made up of a random number of \(28 \times 28\) greyscale images taken from the MNIST dataset. A bag is given a positive label if it contains a ‘9’ and negative label if it does not. For our experiments the average number of bags is ten, thus the witness rate for positive bags is 10%, on average. This low witness rate makes this a challenging dataset for the chosen MIL algorithms.

The SIVAL dataset was specifically designed for content-based image retrieval (CBIR) and contains natural scene images consisting of 25 categories with 60 images per category for a total of 1500 bags. In this work, we use the processed dataset provided in the initial work of Rahmani et.al. [42] and create a new dataset derived from the raw SIVAL images. In the original processed SIVAL dataset, the images are segmented into 30 or 31 instances, depending on the picture, consisting of 30 features each. In total, there are 47,414 instances across the entire SIVAL dataset. In order to explore the prediction and runtime performance of the compared methods, we construct a few subsets of this dataset containing a predetermined number of classes. Specifically, we construct the SIVAL-3, SIVAL-5, SIVAL-10, SIVAL-15, and SIVAL-25 datasets each containing three, five, ten, fifteen, and twenty-five classes from the SIVAL dataset, of 180, 300, 600, 900, and 1500 bags.

In addition, we construct the “SIVAL-25-deep” dataset, which is inspired by the “hybrid” approach detailed by Zheng et.al., [43], which investigates the ongoing shift from SIFT-based descriptors [44] to convolutional neural networks for generating image descriptors. To create this multi-instance dataset, we extract patches from the raw SIVAL images using the EdgeBox [45] proposal generator (eta=0.2, minScore=0.04, maxBoxes=200) provided in the OpenCV library.Footnote 4 These extracted patches are fed into a pre-trained AlexNet [46] convolutional neural network where the second to last fully connected layer (F10) is used to represent each instance by 4,096 features. We note that more complex/newer deep neural architectures, and other proposal generators, c ould be used to create this patch-level embedding but leave this to future work. This process is repeated for every image (where object proposals are detected by EdgeBox) and results in 1,463 bags for a total of 80,561 instances. The SIVAL-25-deep dataset is our attempt at a modernization of the standard SIVAL dataset; the pipeline used to generate this benchmark is provided with our code.

3.2 Classification performance

In Tables 1 and 2 we provide the classification performance of our approach compared against the other competing MIL algorithms described above in Sect. 3.1. Our goal is to verify that our approach matches the performance of the other MIL algorithms. For each dataset-method pair we report the accuracy (ACC) and balanced accuracy (BACC) results across ten sixfold cross validation experiments. We can see from Table 1 that our method gives comparable performance on the MUSK-2, Elephant, and Tiger datasets; this applies for both the exact and inexact implementations. Interestingly, the inexact linear version of our approach outperforms all other methods on both the Fox and MNIST-Bags datasets. That can be naturally explained by the previous study [47] which has shown some implementations of SVM obtain the highest accuracy before their objectives reach their minimum. In the Colon dataset, our kernel versions show superior performance in classifying the different shapes of tissues, which shows the benefits of kernel functions in our model.

In Table 2 our exact linear pdMISVM only slightly outperforms the next best performing method on the SIVAL-3 dataset; this impressive performance result does not hold for the inexact version. Although, the inexact method performs better (in comparison) when the number of classes/bags increase. The inexact linear methods shows surprisingly impressive results on SIVAL-25-deep dataset which are recorded just within the time-budget; this significant performance improvement can be seen very clearly in the comparison between the confusion matrices in Figs. 2 and 3. In comparison of training times (TT) between our exact and inexact kernel pdMISVM, we can clearly see that the inexact version scales to the increasing number of bags (the number of bags increases in the order of SIVAL-3, 5, 10, 15, and 25) better than the exact version. This empirically verifies the analytical complexity discussed in Sect. 2.5. It is clear from these results that the exact/inexact and linear/kernel methods are capable of providing competitive performance results on a variety of multi-instance datasets.

Table 1 Classification performance and train time (seconds) of our method and ten other MIL learning methods on six benchmark datasets
Table 2 Classification and train-time (seconds) performance of our method and ten other MIL learning methods on variants of the SIVAL dataset across a different number of classes (and bags) and preprocessing pipelines
Fig. 2
figure 2

Confusion matrix of the exact linear pdMISVM tested on the original SIVAL-25 dataset with 30 features per-instance. Results are derived from a sixfold cross-validation experiment across all 1500 bags

Fig. 3
figure 3

Confusion matrix of the inexact linear pdMISVM approach tested on the SIVAL-25-deep dataset created from the patch-wise application of a convolutional neural network as a pre-processing step. Results are derived from a sixfold cross-validation experiment across all 1463 bags

3.3 Bag/feature scalability

The key contribution of this work is that the derived algorithms described in Sects. 2.3 and 2.4 scale to large datasets. This can be clearly seen in the SIVAL-25 column of Table 2 where our methods are the only ones that are able to fit a model within one-thousand seconds. In order to further validate this finding, in Fig. 4 we report training time results on a synthetic multi-instance dataset where we increase the number of bags as described in Sect. 3.1. In this timing experiment, we use the degree-2 polynomial function for both of exact/inexact kernel methods to compare them fairly. Our linear-exact, linear-inexact, and poly-inexact methods scale well with respect to the number of bags which shows the importance of our primal–dual derivation. However, the training time of our poly-exact model increases rapidly as the number of instances increases, demonstrating the usefulness of our inexact kernel method. This conclusion is especially clear when our method is compared against SVM-based MIL methods which rely on repeatedly solving quadratic programming problems.

Fig. 4
figure 4

Time to train our method and other MIL methods on synthetic multi-instance data where the number of bags increase. Both the linear-exact and linear-inexact methods end up training faster than the competing methods once the number of synthetic bags is greater than eight-hundred

Although the initial pdMISVM derivation scales well with respect to bags, it does not scale to the number of features when it is large. This is due to the fact that the update for each \({\textbf{w}}_k\) in Eq. (10) requires solving a least squares problem which scales quadratically as the number of features increase. To address this limitation, we proposed an optimal line search method to improve the scalability of our approach in Eq. (25). We conduct a timing experiment using synthetic data where the number of features is increased to see if our methods provide improved runtime performance. We can see in Fig. 5 that the proposed linear-inexact and poly-exact methods significantly reduce the training time of our approach as the number of features increase. We note that in our exact kernel method, we don’t need to directly access the kernel features as discussed in Sect. 2.5.2.

Fig. 5
figure 5

Elapsed time to train the linear/kernel and exact/inexact methods on synthetic multi-instance data as the number of features is varied. As expected, the linear-exact and poly-inexact methods perform poorly as the number of features increases, but the linear-inexact and poly-exact method continues to scale linearly and almost constantly

3.4 Model interpretability

In addition to the promising predictive performance and scalability of our method, we note that instance-based methods such as ours come with an additional benefit: interpretability. Instance-based methods such as miSVM, MISVM, and the proposed pdMISVM method, identify an explicit instance within a bag that is responsible for the predicted label. We use this phenomenon to explore the limitations of our method on the MNIST-bags dataset and showcase patches identified during the SIVAL experiment across a number of different classes.

For the MNIST-bags dataset we plot the learned positive and negative class coefficients associated with the two learned hyperplanes in Fig. 6 (e.g., \({\textbf{w}}_1\) and \({\textbf{w}}_2\)). In addition, we plot four randomly chosen testing bags and what instance was chosen by the multi-instance decision function for the positive class hyperplane in Fig. 7. On the left-hand side of Fig. 6, we can see that our method can roughly detect the loop at the top of the ‘9’ although it is clear from this interpretation that our approach will not be able to handle even moderate translation or rotation if it is only provided with raw-pixel values. Additionally, even though our method correctly classifies the first bag, it incorrectly identifies the ‘4’ as the witness instance; we can see that a ‘4’ appears to be contaminating the learned coefficients displayed in Fig. 6. In order to solve this problem it is likely that additional preprocessing will be required to extract more descriptive features from instances within the MNIST-bags dataset beyond raw pixels for our method to be effective.

In order to illustrate how effective feature extraction can aid in the interpretability of our method, we extend our discussion to the SIVAL-25 and SIVAL-25-deep datasets. In Fig. 8, we provide image patches identified by our approach on images chosen from the SIVAL-25 dataset. We can see that our method identifies distinctive visual characteristics in each of the classes. For example, the bag representing a “Banana” is identified by a distinctive patch along the length of the fruit while in the “Apple” image our approach identifies the round patch on top of the fruit. Similarly, in Fig. 9 we present the neural-network embedded patches extracted via the EdgeBox detection algorithm and the identified patches. We can clearly see in Fig. 9 that our approach is able to accurately localize the most distinctive parts of the object, at the patch level, within the image. For example, the medal is recognized by the “gold” part while the “bowl” of the spoon is recognized.

Remarkably, the results in Figs. 8 and 9 show that when our method is given a set of sufficiently descriptive object proposals/patches, paired with a bag-level label, our method can accurately locate objects within an image. This is one of the significant advantages of instance-based MIL methods over traditional single-instance learning methods that require all instances to be labeled. We anticipate that this framework could be extended to investigate and interpret the effectiveness of pre-trained neural networks on an assortment of datasets that can be formulated as MIL problems. We plan to further investigate different aspects of our approach under different object proposal methods [48, 49], neural architectures [50], and applications [13, 14, 51].

Fig. 6
figure 6

Learned class-specific hyperplanes of the pdMISVM method on the MNIST-bags dataset plotted in a \(28\times 28\) grid. Left: Learned coefficients for predicting whether a bag contains the MNIST handwritten digit ‘9’. Right: The learned coefficients for predicting whether a bag does not contain the MNIST digit ‘9’

Fig. 7
figure 7

Instance identification results on the first five testing bags of our method on the MNIST-bags dataset with the detectors in Fig. 6. Our approach correctly classifies the first, second and third bags. Although the first bag is classified correctly the “9”s are not properly identified

Fig. 8
figure 8

Instance identification on the SIVAL-25 dataset across different classes. In each set of three pictures the leftmost is the original image, the middle shows the bag of patches extracted by the original authors, and the final image highlights the patch identified by our approach for classifying the image

Fig. 9
figure 9

Instance identification on the SIVAL-25-deep dataset across different classes. In each set of three pictures the leftmost is the original image, the middle shows the bag of patches extracted by the EdgeBox detector, and the final image highlights the patch identified by our approach for classifying the image

3.5 Capability to learn nonlinear decision boundaries

As an important extension of our pdMISVM method to deal with nonlinearly separable data, we introduced the kernel versions of our exact and inexact methods. In this subsection, we evaluate their classification performances.

We implement the kernel version of the pdMISVM method and compare it against the linear version in Fig. 10. We can see this extension successfully extends our approach to correctly classify data that is not linearly separable. In this paper, we propose two versions of kernel pdMISVM which scales to the number of instances or features, which enable to efficiently learn the nonlinear decision boundaries from the large dataset.

Fig. 10
figure 10

The predictions of linear and RBF kernel (exact) pdMISVM on synthetic multi-instance data. Each bag in the training dataset \({\textbf{X}}\) has up to three instances, where only the first instance determines the correct classification. The kernel extension of our approach is able to correctly learn a nonlinear decision boundary to separate the two classes

3.6 A case study on neuroimaging data

While we have demonstrated the effectiveness of our new pdMISVM method from a number of perspectives in the previous subsections, in this subsection we apply our new methods on a medical imaging dataset to verify its capability to solve real-world problems.

3.6.1 Comparisons of the classification capabilities

Alzheimer’s disease (AD) is a serious neurodegenerative condition in which people suffer from the deterioration of cognitive functions. To verify the capability of our new methods to solve real-world problems, we apply them to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [52] dataset which provides the comprehensive neuroimagings such as voxel-Based Morphometry (VBM) and FreeSurfer (FS). We collect the magnetic resonance imaging (MRI) scans and their diagnosis in Alzheimer’s disease, mild cognitive decline, and healthy condition of 821 participants from ADNI databaseFootnote 5. We perform VBM and FS automated parcellation on the MRIs following [53] and extract mean modulated gray matter measures for 90 target regions of interest (ROI). Because the different number of MRI scans have been captured across the participants, it is difficult to directly apply the standard statistical methods to all of the neuroimagings provided. In this case study, we formulate each neuroimaging as an instance and each participant as a bag to predict their AD diagnosis.

In Table 3, we report the classification performances of ours and the other competing models. Although the deep learning models (miNet, MINet, AMIL, and LAMIL) have shown the promising performances, they require the significant amount of training time. On the other hands, four variations of our models show the comparable performance with a few seconds of training time.

Table 3 Classification and train-time (seconds) performance of our method and ten other MIL learning methods on ADNI dataset

3.6.2 Identification of brain regions

Likewise we analyze the hyperplane in Fig. 6 from MNIST digits dataset, we analyze the hyperplane to identify the brain regions exhibiting AD risk factors. Since the p-th feature of each instance is multiplied with the p-th weight of \({\textbf{w}}_m\) to contribute to the response for the m-th class, we calculate the contribution of p-th feature as the summation of p-th weight of hyperplanes \(\sum _{m=1}^K \Vert w_m^p \Vert \). Each feature of instance (neuroimage) represents each ROI in the brain, therefore we visualize the disease relevance of each brain region in Fig. 11.

Fig. 11
figure 11

Visualization of contributions of the brain regions to the AD diagnosis classification. The brain regions of the larger contribution are plotted with the darker colors. The top four AD relevant regions are identified in FS: left thalamus, left lateral ventrical, right caudate, and brodmann area 44, in VBM: left thalamus, left hippocampus, right medial occipital, and left amygdala

The brain regions identified by our method (linear, exact) have been appeared in the previous medical literatures. For example, the pathological changes in the Brodmann area 44 (Broca’s area) involves in the comprehension and production of verbs and communication [54]. Based on the previous study [55], the larger ventricular volume is the risk factor of dementia related disease in the future. The change in venticular volume is also associated with the cognitive decline and dementia [55, 56]. The atrophy of the caudate nucleus is also related to the cognitive decline [57]. The previous study [58] found that the anterior thalamus plays an important role in generating attention, and it is in charge of declarative memory functioning. The hippocampus is particularly susceptible to damage from AD and involves long-term memory and spatial navigation in AD patients [59]. The amygdala region, which is related to the emotional response, can also be easily damaged by AD [60]. The identified brain regions in this case study are well represented in the previous AD studies, and support the correctness of our methods by providing the further interpretability.

4 Conclusion

In this work, we propose a primal–dual multi-instance SVM method that is able to scale to large multi-instance datasets. Our SVM-based approach is able to handle data that grows in terms of bags as well as features since it avoids solving a quadratic programming problem that limits the adoption of traditional SVM-based MIL techniques. Throughout the manuscript, we provide detailed derivations, implementations, and experimental results which illustrate the utility of our approach on both synthetic and real-world datasets. In addition, we provide the kernel extensions of our approach which scale to the number of instances or features. Our experimental results on synthetic multi-instance data validate our kernel extension successfully learn the nonlinear decision boundaries. Finally, we investigate the interpretability of our method on benchmark multi-instance datasets and develop an extension to the ADNI dataset as part of this study. From the ADNI dataset, our method identifies the disease relevant brain regions which are in nice accordance with the existing medical studies.