A linear primal–dual multi-instance SVM for big data classifications

Brand, Lodewijk; Seo, Hoon; Baker, Lauren Zoe; Ellefsen, Carla; Sargent, Jackson; Wang, Hua

doi:10.1007/s10115-023-01961-z

A linear primal–dual multi-instance SVM for big data classifications

Regular Paper
Published: 26 August 2023

Volume 66, pages 307–338, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Knowledge and Information Systems Aims and scope Submit manuscript

A linear primal–dual multi-instance SVM for big data classifications

Download PDF

Lodewijk Brand¹,
Hoon Seo¹,
Lauren Zoe Baker¹,
Carla Ellefsen¹,
Jackson Sargent² &
…
Hua Wang ORCID: orcid.org/0000-0002-5986-7413¹

270 Accesses
3 Citations
Explore all metrics

Abstract

Multi-instance learning (MIL) handles data that is organized into sets of instances known as bags. Traditionally, MIL is used in the supervised-learning setting for classifying bags which contain any number of instances. However, many traditional MIL algorithms do not scale efficiently to large datasets. In this paper, we present a novel primal–dual multi-instance support vector machine that can operate efficiently on large-scale data. Our method relies on an algorithm derived using a multi-block variation of the alternating direction method of multipliers. The approach presented in this work is able to scale to large-scale data since it avoids iteratively solving quadratic programming problems which are broadly used to optimize MIL algorithms based on SVMs. In addition, we improve our derivation to include an additional optimization designed to avoid solving a least-squares problem in our algorithm, which increases the utility of our approach to handle a large number of features as well as bags. Finally, we derive a kernel extension of our approach to learn nonlinear decision boundaries for enhanced classification capabilities. We apply our approach to both synthetic and real-world multi-instance datasets to illustrate the scalability, promising predictive performance, and interpretability of our proposed method.

Multiple instance classification via quadratic programming

Article 11 January 2022

Multiple Instance Learning via Semi-supervised Laplacian TSVM

Article 21 January 2017

v-soft margin multi-task learning logistic regression

Article 06 September 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Multi-instance learning (MIL) is a sub-area of machine learning in which training and testing data are organized in sets called bags. What makes MIL challenging is that labels associated with these data are frequently provided at the bag level, but not at instance level. This is also known as weakly supervised learning in the literature. Algorithms that adhere to this type of weakly supervised learning paradigm are naturally suited to a wide variety of real-world problems that contain limited labeled data. For example, images can be represented by a bag of patches, documents can be organized into sentences or paragraphs, and patients can be represented by a collection of medical records, to name a few. Because the labels are given at the bag level, a lot of resources spent on characterizing each instance are saved. For example, the clinicians only need to label/diagnose the bag or patient, but not each medical record. However, as illustrated in Fig. 1, since each bag can have an arbitrary number of instances, standard machine learning approaches that rely on fixed-length vector representations cannot be applied to such data directly. In this case, the multi-instance learning would be a better choice compared to single-instance learning, because in single-instance learning, predicting becomes difficult when the given instance does not contain any useful information for the prediction. Meanwhile, multi-instance learning enables correct classification from the key instances included in the bag. As a result, significant research efforts have been made to design algorithms that can handle this type of data in recent years.

1.1 Related works

In the past twenty years, a large number of MIL algorithms [1,2,3,4,5,6,7,8] have been proposed. These approaches have been applied to many different topics including drug activity prediction [9], content-based image and video retrieval [10, 11], medical image analysis [12], and document classification [13], among many other application areas [14]. Recently, deep learning-based MIL methods [15,16,17] have also been proposed to handle multi-instance data. While these methods have demonstrated their effectiveness in solving a variety of real-world problems, their limitations have also been discussed [18, 19]. For example, a recent survey paper [19] notes that current state-of-the-art MIL approaches are sensitive to the construction of instances within a bag. Specifically, they determine that the performance of MIL methods are sensitive to witness rate, e.g., the proportion of positive instances in positive bags, as well as whether the algorithm operates on the instance or bag level. This has also been observed in older MIL survey papers [18] and requires new algorithms to be tested on a range of different datasets and applications. In addition to dataset-specific performance, the authors of these survey papers highlight that performance improvements in the training time of MIL algorithms, especially those who rely on instance-level information, are necessary for further adoption.

Recently, the scalability for analyzing large amount of data has become another major issue of MIL studies with the development of data mining technologies. The large amount of bags and instances are involved in MIL problems, and MIL models require many parameters to analyze the complex patterns between instances. However, many existing MIL models are often tested on small or moderately sized dataset. To alleviate this scalability issue, We et.al., [20] proposes to map the raw representation of a bag into simpler representation of a vector format which can be classified by the followed SVM model. Although effective, this method does not improve the scalability of the followed SVM model. Vatsavai [21] employs the divide-and-conquer strategy to scatter the large images into patches of predefined size which can be parallelized and processed by Citation-KNN MIL algorithms [22]. However, Citation-KNN MIL algorithm adapted in [21] scales quadratically with respect to the average number of instances per bag.

Besides the quantitative evaluation of MIL models, the qualitative evaluation is also gaining interest. For example, when multiple patches of a histopathological image are formulated as instances, the key instances identified as patches exhibiting the evidence of the disease would be useful for the doctors who need a reference for the corresponding medical decision [23]. At the same time, the aforementioned identified key instances can provide credibility to the predictions of MIL models. In an effort to interpret the outputs of MIL models, mi-SVM [23] provides pixel-wise abnormality scores of an X-ray image and these scores are plotted over the image constructing the heatmap to identify the regions exhibiting abnormalities. Another multi-instance learning framework [24] divides each instance into patches and calculates their similarities with respect to the positive and negative prototypical parts. As a result, the method in [24] explains which patches in an instance are responsible for the prediction. However the interpretation often requires the ground truth explanation or additional processes to generate the explanations, which sacrifices the scalability.

In this work, we focus specifically on scaling SVM-based MIL algorithms, as they have shown consistent performance and can be further extended to nonlinear decision boundaries via kernelization. Popular SVM-based MIL approaches such as miSVM/MISVM [1], NSK [25], and sMIL/sbMIL [2] have been proposed to handle multi-instance data and have demonstrated promising performance, even when compared against modern MIL deep-learning architectures such as miNet/MINet [26]. While these approaches have performed well and can be extended to solve a variety of real-world problems, they are not widely used in practice as they do not scale well to large datasets. Furthermore, many of these approaches are not equipped with capabilities to interpret the results of their predictions. These two shortcomings, speed of model training and model interpretability of multi-instance learning methods, are the focus of this work.

1.2 Our contributions

For the remainder of this manuscript, we present a novel method that extends a multi-instance SVM to large-scale data. Our approach uses the multi-block alternating direction method of multipliers (ADMM) to avoid iteratively solving the quadratic programming problems that arise from standard SVM-based MIL approaches. The scientific contributions of this work are as follows:

A novel MIL algorithm derivation, named the primal–dual multi-instance SVM (pdMISVM) method, and an associated implementation that scales linearly as the number of bags increases.
An inexact variation of our approach, based on the optimal line search method, that scales linearly as the number of features increases.
Experimental results showcasing the promising predictive performance, scalability, and interpretability of our approach on baseline multi-instance data and real-world image data compared against other MIL algorithms.
An extension of our approach that allows for the inclusion of an arbitrary kernel function and a proof-of-concept experiment on synthetic data verifying our derivation.

This paper is an extension of our recent work [27] originally reported in the Proceedings of the Twenty-First International Conference on Data Mining (ICDM 2021). In this extended journal manuscript, we provide the following expansions over its conference version:

We expand our discussions on the previous works related with the scalability and interpretability of MIL problems (Sect. 1.1).
We present the complete derivations of the kernel variation of our pdMISVM method that scales linearly against the number of bags.
- While the scalability issue of the kernel version of our new method against the number of bags has been discussed in the conference version, it was not solved. In this journal extension, we systematically derive the objective of the kernel version of our pdMISVM method and its solution algorithms. (Sect. 2.5).
- The experimental results of our two kernel variations are added to compare their performance and scalability (Sects. 3.2 and 3.3).
We provide the detailed analyses on the computational complexities of our pdMISVM method and its kernel extension (Sects. 2.4 and 2.5). The codes to implement our method and the data used in our methods have been made publicly available online at: https://github.com/minds-mines/pdMISVM.jl.
We expand our experimental evaluations with two additional benchmark datasets and two recently proposed attention-based deep learning models (Sect. 3.2).
We include a case study on neuroimaging data to further evaluate the performance and interpretability of our method to solve real-world problems. We apply our method to identify the disease relevant brain regions from the neuroimaging perspective (Sect. 3.6).

2 Methods

In this section, we begin with a sketch for the steps required for the multi-instance SVM (MISVM) derivation initially presented by Andrews et.al., [1]. Then, following the multi-block ADMM framework [28,29,30], we construct the augmented Lagrangian that will be used to derive the solution to the proposed pdMISVM method, which is then followed by a step-by-step derivation to optimize the proposed objective. In addition, we extend our approach to handle a large number of features through an application of the optimal line search method [31]. Finally, we derive the solution algorithm of kernel pdMISVM.

2.1 Notation

In this manuscript we represent matrices as ${\textbf{M}}$, vectors as ${\textbf{m}}$, and scalars as m. The $i$-th row and $j$-th column of ${\textbf{M}}$ are denoted as $\textbf{m}^i$ and $\textbf{m}_j$, respectively. Similarly, $m^i_j$ is the scalar value indexed by the $i$-th row and $j$-th column of ${\textbf{M}}$. The matrix ${\textbf{M}}_p$ corresponds to the $p$-th column-block of ${\textbf{M}}$. Given a $K \times N$ matrix ${\textbf{M}}$, $\{m, i\} = \mathop {\mathrm{arg\,max}}_{m',i'}({\textbf{M}})$ gives the row-by-column coordinates for the maximum element in ${\textbf{M}}$. The row and column indices are given by $\mathop {\mathrm{arg\,max}}_{m',i'}({\textbf{M}})^m$ and $\mathop {\mathrm{arg\,max}}_{m',i'}({\textbf{M}})_i$, respectively.

2.2 Extending the MISVM to K-class classification

In the binary multi-instance classification problem, the MIL algorithm is presented with a collection of bags and labels represented by the set $\{{\textbf{X}}_i, y_i\}_{i = 1}^N$, where $y_i \in \{-1, 1\}$ and ${\textbf{X}}_i \in {\mathbb {R}}^{d \times n_i}$ designates a bag containing $n_i$ instances, and $\{{\textbf{x}}_1, \dots , {\textbf{x}}_{n_i}\} \in {\textbf{X}}_i$ represent each instance within the $i$-th bag. Following the instance-centric approach advocated by Andrews et.al., [1] for the MISVM model, where a single “witness” instance determines the class of a bag, we define the decision function for a multi-instance binary classifier as

$$\begin{aligned} y_i = {\text {sign}}\left( \max _{{\textbf{x}}_i \in {\textbf{X}}_i}({\textbf{w}}^T{\textbf{x}}_i + b)\right) , \end{aligned}$$

(1)

where ${\textbf{w}}$ and b are the hyperplane and intercept for the MISVM model. The MISVM objective devised by Andrews et.al., is [1]

$$\begin{aligned} \begin{aligned} \min _{{\textbf{w}}, b, \varvec{\xi }}&\quad \frac{1}{2}\left\| {\textbf{w}}\right\| _{2}^2 + C \sum _{i=1}^N \xi _i\\ \text {subject to}&\quad \max _{{\textbf{x}}_i \in {\textbf{X}}_i} \left( {\textbf{w}}^T{\textbf{x}}_i + b\right) y_i \ge 1 - \xi _i,\\&\quad \xi _i \ge 0,\ i=1, \dots , N, \end{aligned} \end{aligned}$$

(2)

where C is a hyperparameter that determines the level of regularization on the learned hyperplane, and $\xi _i$ are slack variables. The constraints in Eq. (2) can be incorporated into the objective via a Lagrangian function

$$\begin{aligned} \begin{aligned} \min _{{\textbf{w}}, b}\ \max _{\varvec{\alpha }}\quad&\mathcal {L}({\textbf{w}}, b, \varvec{\alpha })\qquad \text {subject to}\quad&\alpha _i \ge 0, \end{aligned} \end{aligned}$$

(3)

which can be solved with respect to the dual variables ($\alpha _i$) using off-the-shelf quadratic programming solvers or heuristic algorithms like sequential minimal optimization [32] that takes advantage of a limited number of support vectors. Although the MISVM formulation proposed by Andrews et.al., [1] is widely used in MIL literature, it is generally limited to binary classification problems.

In order to design a method suitable for multi-class multi-instance classification, we extend the decision function presented in Eq. (1) to K-classes via

$$\begin{aligned} \hat{y}_i = \mathop {\mathrm{arg\,max}}_{m', i'}\left( {\textbf{W}}^T{\textbf{X}}_i+\textbf{b}^T{\textbf{1}}_i\right) ^m, \end{aligned}$$

(4)

where ${\textbf{W}}\in {\mathbb {R}}^{d \times K}$, $\textbf{b}\in {\mathbb {R}}^{K}$, $\hat{y}_i \in \{1, \dots , K\}$ represents the hyperplanes, intercepts, and labels for K classes. Motivated by the results of [33] where it is argued that all-in-one formulations for K-class SVMs provide superior predictive performance, when compared to one-vs-all approaches, we construct the following Weston & Watkins [34] MISVM extension to Eq. (2)

$$\begin{aligned} \begin{aligned} \min _{{\textbf{W}}, \textbf{b}, \varvec{\xi }}\ {}&\frac{1}{2}\sum _{m=1}^K\left\| {\textbf{w}}_m\right\| _{2}^2 + C \sum _{i=1}^N\sum _{m=1}^K \xi _i^m\\ \text {subject to}\quad&\Big (1 - \big [\max ({\textbf{w}}_m^T{\textbf{X}}_{i}+{\textbf{1}}b_m)- \max ({\textbf{w}}_y^T{\textbf{X}}_{i} + {\textbf{1}}b_y)\big ]y_{i}^{m}\Big )_+ \le \xi _i^m,\\&0 \le \xi _i^m,\ i = 1, \dots , 2,\ m = i, \dots , K, \end{aligned} \end{aligned}$$

(5)

where ${\textbf{w}}_y$, $b_y$ is the hyperplane-intercept pair associated with the $i$-th bag’s class label, $(\cdot )_+ = \max (0, \cdot )$ is the hinge loss function, and $y_i^m \in \{-1, 1\}$ indicates if the $i$-th bag belongs to the $m$-th class. Similar to Eq. (2), the K-class formulation above can be transformed into a quadratic programming problem and solved, although this approach is known [33] not to scale well as the number of bags increases. To address this issue, we propose a novel primal–dual algorithm based on the multi-block ADMM [30] to optimize Eq. (5).

2.3 A primal–dual multi-instance SVM

Incorporating the constraints of Eq. (5) into the objective gives the unconstrained optimization

$$\begin{aligned} \begin{aligned} \min _{{\textbf{W}}, \textbf{b}}\ {}&\frac{1}{2}\sum _{m=1}^K\left\| {\textbf{w}}_m\right\| _{2}^2 + C\sum _{i=1}^{N}\sum _{m=1}^K(1 - [\max ({\textbf{w}}_m^T{\textbf{X}}_{i} + {\textbf{1}}b_m) \\ {}&- \max ({\textbf{w}}_y^T{\textbf{X}}_{i} + {\textbf{1}}b_y)]y_{i}^{m})_+, \end{aligned} \end{aligned}$$

(6)

which is difficult to solve given the coupling across ${\textbf{w}}_m$, $b_m$, and the $\max (\cdot )$ operations. Using the multi-block ADMM approach, we introduce the following constraints, inspired by [31, 35], and rewrite Eq. (6) as

$$\begin{aligned} \begin{aligned} \min _{\begin{array}{c} {\textbf{W}}, \textbf{b}, {\textbf{E}}, \\ {\textbf{Q}}, {\textbf{R}}, {\textbf{T}}, {\textbf{U}} \end{array}} \quad&\frac{1}{2} \sum _{m=1}^K\left\| {\textbf{w}}_m\right\| _{2}^2 + C \sum _{i=1}^N \sum _{m=1}^K \left( y_{i}^{m}e_{i}^{m}\right) _+\\ \text {subject to}\quad&e_{i}^{m} = y_{i}^{m} - q_{i}^{m} + r_{i}^{m},\ q_{i}^{m} = \max \left( {\textbf{t}}_{i}^{m}\right) ,\ {\textbf{t}}_{i}^{m} = {\textbf{w}}_m^T{\textbf{X}}_i + {\textbf{1}}b_m,\\&r_{i}^{m} = \max \left( {\textbf{u}}_{i}^{m}\right) ,\ {\textbf{u}}_{i}^{m} = {\textbf{w}}_y^T{\textbf{X}}_i + {\textbf{1}}b_{y}, \end{aligned} \end{aligned}$$

(7)

to decouple the primal variables. Then, the augmented Lagrangian function of Eq. (7) is

$$\begin{aligned} \begin{aligned} \mathcal {L}_\mu&= \frac{1}{2} \sum _{m=1}^K\left\| {\textbf{w}}_m\right\| _{2}^2 + C \sum _{i=1}^N \sum _{m=1}^K \left( y_{i}^{m}e_{i}^{m}\right) _+ \\ {}&\quad + \frac{\mu }{2}\sum _{i=1}^N \sum _{m=1}^K\bigg [\left( e_{i}^{m} - (y_{i}^{m} - q_{i}^{m} + r_{i}^{m} - \lambda _{i}^{m}/\mu )\right) ^2\\&\quad + \left( q_{i}^{m} - \max \left( {\textbf{t}}_{i}^{m}\right) + \sigma _{i}^{m}/\mu \right) ^2 + \left\| {\textbf{t}}_{i}^{m} - \left( {\textbf{w}}_m^T{\textbf{X}}_i + {\textbf{1}}b_m\right) + \varvec{\theta }_{i}^{m}/\mu \right\| _{2}^2\\&\quad + \left( r_{i}^{m} - \max \left( {\textbf{u}}_{i}^{m}\right) + \omega _{i}^{m}/\mu \right) ^2 + \left\| {\textbf{u}}_{i}^{m} - \left( {\textbf{w}}_y^T{\textbf{X}}_i + {\textbf{1}}b_y\right) + \varvec{\xi }_{i}^{m}/\mu \right\| _{2}^2\bigg ], \end{aligned} \end{aligned}$$

(8)

where ${\textbf{W}}, \textbf{b}, {\textbf{E}}, {\textbf{Q}}, {\textbf{T}}, {\textbf{R}}, {\textbf{U}}$ are the primal variables, $\varvec{\Lambda }, \varvec{\Sigma }, \varvec{\Theta }, \varvec{\Omega }, \varvec{\Xi }$ are the dual variables, and $\mu > 0$ is a tuning parameter. Equation (8) is then differentiated with respect to each primal variable to derive Algorithm 2.1. The primal–dual updates terminate when the total difference between the constraints incorporated via the augmented Lagrangian terms are less than a predefined tolerance. In the following, we provide the details to derive each step of Algorithm 2.1.

Update ${\textbf{W}}$ & $\textbf{b}$, exact. Removing all terms from Eq. (8) that do not include ${\textbf{W}}$ and decoupling across columns of ${\textbf{W}}$ gives the following K subproblems

$$\begin{aligned} \begin{aligned} {\textbf{w}}_m&= \mathop {\mathrm{arg\,min}}_{{\textbf{w}}_m}\ \frac{1}{2} \left\| {\textbf{w}}_m\right\| _{2}^2 + \frac{\mu }{2}\sum _{i=1}^N\Big [\big \Vert {\textbf{t}}_{i}^{m} - \big ({\textbf{w}}_m^T{\textbf{X}}_i + {\textbf{1}}b_m\big ) + \varvec{\theta }_{i}^{m}/\mu \big \Vert ^2_2\Big ]\\&\quad + \sum _{i'=1}^{N'}\sum _{m'=1}^K\Big [\frac{\mu }{2}\big \Vert {\textbf{u}}_{i'}^{m'} - \big ({\textbf{w}}_m^T{\textbf{X}}_{i'} + {\textbf{1}}b_{m}\big ) + \varvec{\xi }_{i'}^{m'}/\mu \big \Vert ^2_2\Big ], \end{aligned} \end{aligned}$$

(9)

where $i'$ indicates the column blocks in ${\textbf{X}}$ (and the corresponding columns of ${\textbf{U}}$ and $\varvec{\Xi }$) that belong to the $m$-th class and $N'$ is the total number of bags belonging to the $m$-th class. Taking the derivative of Eq. (9) with respect to ${\textbf{w}}_m$ and setting the result equal to zero gives the closed form solution

$$\begin{aligned} \begin{aligned} {\textbf{w}}_m^T&= \Big (\textstyle \sum _{i=1}^N\left[ \left( {\textbf{t}}_{i}^{m} - {\textbf{1}}b_m + \varvec{\theta }_{i}^{m}/\mu \right) {\textbf{X}}_i^T\right] + \textstyle \sum _{i'=1}^{N'} \textstyle \sum _{m'=1}^K\big [({\textbf{u}}_{i'}^{m'} - {\textbf{1}}b_m \\ {}&\quad + \varvec{\xi }_{i'}^{m'}/\mu ){\textbf{X}}_{i'}^T\big ]\Big )*\left( {\textbf{I}}/\mu + \textstyle \sum _{i=1}^N{\textbf{X}}_i{\textbf{X}}_i^T + K \textstyle \sum _{i'=1}^{N'}{\textbf{X}}_{i'}{\textbf{X}}_{i'}^T\right) ^{-1}, \end{aligned} \end{aligned}$$

(10)

which can be calculated via a least-squares solver to avoid an inverse calculation.

Similarly, differentiating Eq. (9) element-wise with respect to $b_m$ and setting the result equal to zero gives the update

$$\begin{aligned} \begin{aligned} b_m = \frac{\sum _{i=1}^N\left[ {\textbf{t}}_{i}^{m} - {\textbf{w}}_m^T{\textbf{X}}_i + \varvec{\theta }_{i}^{m}/\mu \right] +\sum _{i'=1}^{N'}\sum _{m'=1}^K\big [{\textbf{u}}_{i'}^{m'} - {\textbf{w}}_m^T{\textbf{X}}_{i'}+ \varvec{\xi }_{i'}^{m'}/\mu \big ]}{N + KN'}. \end{aligned} \end{aligned}$$

(11)

Update ${\textbf{E}}$. Dropping terms that do not contain ${\textbf{E}}$ from Eq. (8), by performing element-wise decoupling of the problem, we end up with the following $K \times N$ subproblems

$$\begin{aligned} \begin{aligned} e_{i}^{m} = \mathop {\mathrm{arg\,min}}_{e_{i}^{m}}\ {}&C \left( y_{i}^{m}e_{i}^{m}\right) _+ + \frac{\mu }{2}\left( e_{i}^{m} - n_{i}^{m}\right) ^2, \end{aligned} \end{aligned}$$

(12)

where $n_{i}^{m}= y_{i}^{m} - q_{i}^{m} + r_{i}^{m} - \frac{\lambda _{i}^{m}}{\mu }$. Equation (12) can be differentiated with respect to $e_i^m$ via the sub-gradient method, and solved by the following three cases

$$\begin{aligned} e_{i}^{m} = {\left\{ \begin{array}{ll} n_{i}^{m} - \frac{C}{\mu }y_{i}^{m} \quad &{}\text {when} \quad y_{i}^{m}n_{i}^{m}>\frac{C}{\mu },\\ 0 &{}\text {when} \quad 0 \le y_{i}^{m}n_{i}^{m} \le \frac{C}{\mu },\\ n_{i}^{m} &{}\text {when} \quad y_{i}^{m}n_{i}^{m} < 0.\\ \end{array}\right. } \end{aligned}$$

(13)

Update ${\textbf{Q}}$ & ${\textbf{R}}$. Keeping only terms with ${\textbf{Q}}$ in Eq. (8) and performing element-wise decoupling, we end up with the following $K \times N$ subproblems

$$\begin{aligned} \begin{aligned} q_{i}^{m}&= \mathop {\mathrm{arg\,min}}_{q_{i}^{m}}\ \left( e_{i}^{m} - y_{i}^{m} + q_{i}^{m} - r_{i}^{m} + \lambda _{i}^{m}/\mu \right) ^2\\&\quad + \left( q_{i}^{m} - \max \left( {\textbf{t}}_{i}^{m}\right) + \sigma _{i}^{m}/\mu \right) ^2. \end{aligned} \end{aligned}$$

(14)

Taking the derivative of Eq. (14) with respect to $q_i^m$ and setting the result equal to zero, we can the problem for $q_i^m$ by the following update

$$\begin{aligned} q_{i}^{m} = \frac{\big (y_{i}^{m} - e_{i}^{m} + r_{i}^{m} - \lambda _{i}^{m}/\mu + \max \big ({\textbf{t}}_{i}^{m}\big ) - \sigma _{i}^{m}/\mu \big )}{2}. \end{aligned}$$

(15)

Following the same steps for each $r_i^m \in {\textbf{R}}$, we derive the element-wise updates

$$\begin{aligned} r_{i}^{m} = \frac{\big (e_{i}^{m} - y_{i}^{m} + q_{i}^{m} + \lambda _{i}^{m}/\mu + \max \big ({\textbf{u}}_{i}^{m}\big ) - \omega _{i}^{m}/\mu \big )}{2}. \end{aligned}$$

(16)

Update ${\textbf{T}}$ & ${\textbf{U}}$. Keeping terms in Eq. (8) containing ${\textbf{T}}$ and decoupling across K and N, we end up with the following subproblem

$$\begin{aligned} \begin{aligned} {\textbf{t}}_{i}^{m}&= \mathop {\mathrm{arg\,min}}_{{\textbf{t}}_{i}^{m}}\ \left( q_{i}^{m} - \max \left( {\textbf{t}}_{i}^{m}\right) + \sigma _{i}^{m}/\mu \right) ^2 + \left\| {\textbf{t}}_{i}^{m} - \left( {\textbf{w}}_m^T{\textbf{X}}_i + {\textbf{1}}b_m\right) + \theta _{i}^{m}/\mu \right\| _{2}^2, \end{aligned} \end{aligned}$$

(17)

which can be further decoupled into element-wise subproblems for each $t_{i,j}^m \in {\textbf{t}}_i^m$, giving $K \times (n_1 + \dots + n_N)$ problems

$$\begin{aligned} t_{i,j}^{m} = \mathop {\mathrm{arg\,min}}_{t_{i,j}^{m}} {\left\{ \begin{array}{ll}\left( q_{i}^{m} - t_{i,j}^{m} + \sigma _{i}^{m}/\mu \right) ^2 + \left( t_{i,j}^m - \phi _{i,j}^m\right) ^2, \\ \ \text {when } t_{i,j}^m=\max \left( {\textbf{t}}_{i}^{m}\right) ,\\ \left( t_{i,j}^m - \phi _{i,j}^m\right) ^2\ \text {else}, \end{array}\right. } \end{aligned}$$

(18)

$\text {where } \varvec{\phi }_{i}^{m} = {\textbf{w}}_m^T{\textbf{X}}_i + {\textbf{1}}b_m - \varvec{\theta }_{i}^{m}/\mu $.

Taking the derivative of Eq. (18) with respect to $t_{i,j}^m$ and setting the result equal to zero, we solve the problem for $t_{i,j}^m$ by the following updates

$$\begin{aligned} t_{i,j}^{m} = {\left\{ \begin{array}{ll} \frac{\max (\varvec{\phi }_{i}^{m}) + q_{i}^{m} + \sigma _{i}^{m}/\mu }{2} &{}\text {if } j = \mathop {\mathrm{arg\,max}}(\varvec{\phi }_{i}^{m}),\\ \phi _{i,j}^{m} \quad &{}\text {else}. \end{array}\right. } \end{aligned}$$

(19)

This same strategy is applied to derive the element-wise updates of ${\textbf{U}}$, which gives

$$\begin{aligned} {u_{i,j}^{m} = {\left\{ \begin{array}{ll} \frac{\max (\varvec{\psi }_{i}^{m}) + r_{i}^{m} + \omega _{i}^{m}/\mu }{2} &{}\text {if } j = \mathop {\mathrm{arg\,max}}(\varvec{\psi }_{i}^{m}),\\ \psi _{i,j}^{m} \quad &{}\text {else}. \end{array}\right. }} \end{aligned}$$

(20)

$\text {where } \varvec{\psi }_{i}^{m} = {\textbf{w}}_y^T{\textbf{X}}_i + {\textbf{1}}b_y - \varvec{\xi }_{i}^{m}/{\mu }$.

The associated dual variable updates are provided in Algorithm 2.1.

2.4 Scaling to a large number of features

Although the updates derived in Sect. 2.3 provide a suitable algorithm as the number of bags increase, it does not scale well against the increasing number of features. To be specific, the calculation in Eq. (10) for the left parenthesis $\big (\textstyle \sum _{i=1}^N\left[ \left( {\textbf{t}}_{i}^{m} - {\textbf{1}}b_m + \varvec{\theta }_{i}^{m}/\mu \right) {\textbf{X}}_i^T\right] + \textstyle \sum _{i'=1}^{N'} \textstyle \sum _{m'=1}^K\big [({\textbf{u}}_{i'}^{m'} - {\textbf{1}}b_m + \varvec{\xi }_{i'}^{m'}/\mu ){\textbf{X}}_{i'}^T\big ]\big )$ has the computational complexity $O\big (\big (n_1+\dots +n_N\big ) \cdot d\big )$ and the right parenthesis $({\textbf{I}}/\mu + \textstyle \sum _{i=1}^N{\textbf{X}}_i{\textbf{X}}_i^T + K \textstyle \sum _{i'=1}^{N'}{\textbf{X}}_{i'}{\textbf{X}}_{i'}^T)^{-1}$ can be efficiently calculated via the least-squares solver which has complexity $O\left( d^2 \right) $. As a result, the updating ${\textbf{w}}_m$ requires the time complexity $O\left( \left( n_1+\dots +n_N + d\right) \cdot d\right) $ which scales quadratically as the number of features increase; this limits the scalability of our approach to bags only. Additionally, since $\mu $ is updated every iteration, the least-squares solver must be invoked at every iteration although $(\sum _{i=1}^N{\textbf{X}}_i{\textbf{X}}_i^T + K \textstyle \sum _{i'=1}^{N'}{\textbf{X}}_{i'}{\textbf{X}}_{i'}^T)$ can be precomputed at the beginning of algorithm. The inversion calculation for $d \times d$ matrix is the bottleneck of SVM algorithms. To handle this issue, we propose an alternative optimal line search method [31] to update ${\textbf{W}}$ for avoiding the inverse matrix calculation.

Update ${\textbf{W}}$, inexact. The partial derivative of Eq. (9) with respect to ${\textbf{w}}_k$ gives

$$\begin{aligned} \begin{aligned} \nabla ^T_{{\textbf{w}}_m}&= {\textbf{w}}_m^T - \mu \sum _{i=1}^N[{\textbf{t}}_i^m - {\textbf{w}}_m^T{\textbf{X}}_i - {\textbf{1}}b_m + \varvec{\theta }_i^m/\mu ]{\textbf{X}}_i^T\\&\quad -\mu \sum _{i'=1}^{N'}\sum _{m'=1}^K[{\textbf{u}}_{i'}^{m'}-{\textbf{w}}_m^T{\textbf{X}}_{i'} - {\textbf{1}}b_m + \varvec{\xi }_{i'}^{m'}/\mu ]{\textbf{X}}_{i'}^T, \end{aligned} \end{aligned}$$

(21)

which can be used to create the following minimization problem

$$\begin{aligned} \begin{aligned} s_m&= \mathop {\mathrm{arg\,min}}_{s_m}\ \frac{1}{2} \left\| {\textbf{w}}_m^T - s_m\nabla ^T_{{\textbf{w}}_m}\right\| _{2}^2 + \frac{\mu }{2}\sum _{i=1}^N\Big [\big \Vert {\textbf{t}}_{i}^{m} - ({\textbf{w}}_m^T - s_m\nabla ^T_{{\textbf{w}}_m}){\textbf{X}}_i - {\textbf{1}}b_m \\ {}&\quad + \theta _{i}^{m}/\mu \big \Vert ^2_2\Big ] + \sum _{i'=1}^{N'}\sum _{m'=1}^K \Big [\frac{\mu }{2}\big \Vert {\textbf{u}}_{i'}^{m'} - ({\textbf{w}}_m^T - s_m\nabla ^T_{{\textbf{w}}_m}){\textbf{X}}_{i'} - {\textbf{1}}b_{m} + \xi _{i'}^{m'}/\mu \big \Vert ^2_2\Big ], \end{aligned} \end{aligned}$$

(22)

in terms of $s_m$ instead of ${\textbf{w}}_m$.

Differentiating Eq. (22) with respect to $s_m$ and setting the result equal to zero, we solve the problem for $s_m$ as follows

$$\begin{aligned} s_m = \frac{\left( {\textbf{w}}_m^T - \mu \sum _{i=1}^N\hat{{\textbf{t}}}^m_i{\textbf{X}}_i^T - \mu \sum _{i'=1}^{N'}\sum _{m'=1}^K\hat{{\textbf{u}}}_{i'}^{m'} {\textbf{X}}_{i'}^T\right) \nabla _{{\textbf{w}}_m}}{\nabla ^T_{{\textbf{w}}_m}\left( {\textbf{I}}+ \mu \sum _{i=1}^N{\textbf{X}}_i{\textbf{X}}_i^T + \mu K\sum _{i'=1}^{N'}{\textbf{X}}_{i'}{\textbf{X}}_{i'}^T\right) \nabla _{{\textbf{w}}_m}}, \end{aligned}$$

(23)

where $\hat{{\textbf{t}}}_i^m = {\textbf{t}}^m_i - {\textbf{w}}_m^T{\textbf{X}}_i - {\textbf{1}}b_m + \varvec{\theta }^m_i/\mu $ and $\hat{{\textbf{u}}}_{i'}^{m'}={\textbf{u}}_{i'}^{m'}-{\textbf{w}}_m^T{\textbf{X}}_{i'}-{\textbf{1}}b_m + \xi _{i'}^{m'}/\mu $.

Because the denominator of Eq. (23) is equivalent to

$$\begin{aligned} \left\| \nabla _{{\textbf{w}}_m}\right\| _{2}^2 + \mu \sum _{i=1}^N\left\| \nabla ^T_{{\textbf{w}}_m}{\textbf{X}}_i\right\| _{2}^2 + \mu K\sum _{i'=1}^{N'}\left\| \nabla ^T_{{\textbf{w}}_m}{\textbf{X}}_{i'}\right\| _{2}^2, \end{aligned}$$

(24)

Equation (23) can be calculated efficiently in $O\left( \left( n_1 + \dots + n_N \right) \cdot d\right) $ time.

By combining Eq. (21) and Eq. (23), we can update ${\textbf{w}}_m$ via

$$\begin{aligned} \begin{aligned} {\textbf{w}}_m = {\textbf{w}}_m - s_m\nabla _{{\textbf{w}}_m}. \end{aligned} \end{aligned}$$

(25)

This “inexact” update option avoids solving the least squares problem present in Eq. (10) and is provided as an option on Line 6 of Algorithm 2.1 to extend our method to handle a large number of features.

2.5 A primal–dual multi-instance SVM with Kernel

While the exact and inexact formulations described in Algorithm 2.1 are computationally efficient and show promising performance on a variety of multi-instance datasets, they are limited to classification problems where instances within bags are linearly separable. In order for enabling our method to learn nonlinear decision boundaries, we derive an kernel extension of pdMISVM method in this subsection.

We begin by replacing all bags ${\textbf{X}}_i$ in Eq. (6) by their corresponding feature matrices $\phi ({\textbf{X}}_i) = \varvec{\Phi }_i \in {\mathbb {R}}^{d_\phi \times n_i}$, which gives

$$\begin{aligned} \begin{aligned} \min _{{\textbf{W}}, \textbf{b}}\ {}&\frac{1}{2}\sum _{m=1}^K\left\| {\textbf{w}}_m\right\| _{2}^2 + C\sum _{i=1}^{N}\sum _{m=1}^K(1 - [\max ({\textbf{w}}_m^T\varvec{\Phi }_{i}+ {\textbf{1}}b_m) \\ {}&- \max ({\textbf{w}}_y^T\varvec{\Phi }_{i} + {\textbf{1}}b_y)]y_{i}^{m}) \end{aligned} \end{aligned}$$

(26)

where $\phi $ is an arbitrary kernel function. Then, by introducing constraints to decouple ${\textbf{w}}_m$ and $b_m$ in Eq. (7) and incorporating them into the objective, we derive the following Lagrangian formulation

$$\begin{aligned} \begin{aligned} \mathcal {L}_\mu&= \frac{1}{2} \sum _{m=1}^K\left\| {\textbf{w}}_m\right\| _{2}^2 + \sum _{i=1}^N \sum _{m=1}^K C \left( y_{i}^{m}e_{i}^{m}\right) _+ \\ {}&\quad + \frac{\mu }{2}\sum _{i=1}^N \sum _{m=1}^K\bigg [\left( e_{i}^{m} - (y_{i}^{m} - q_{i}^{m} + r_{i}^{m} - \lambda _{i}^{m}/\mu )\right) ^2\\&\quad + \left( q_{i}^{m} - \max \left( {\textbf{t}}_{i}^{m}\right) + \sigma _{i}^{m}/\mu \right) ^2 + \left\| {\textbf{t}}_{i}^{m} - \left( {\textbf{w}}_m^T\varvec{\Phi }_i + {\textbf{1}}b_m\right) + \varvec{\theta }_{i}^{m}/\mu \right\| _{2}^2\\&\quad + \left( r_{i}^{m} - \max \left( {\textbf{u}}_{i}^{m}\right) + \omega _{i}^{m}/\mu \right) ^2 + \left\| {\textbf{u}}_{i}^{m} - \left( {\textbf{w}}_y^T\varvec{\Phi }_i) + {\textbf{1}}b_y\right) + \varvec{\xi }_{i}^{m}/\mu \right\| _{2}^2\bigg ]. \end{aligned} \end{aligned}$$

(27)

We pause here to recognize that, while the number of columns in each $\varvec{\Phi }_i$ is equal to the number of instances inside the original bag ${\textbf{X}}_i$, the number of rows, i.e.,, $d_\phi $, can be arbitrarily (even infinitely) large. Motivated by [36], we work to derive each update of our algorithm with respect to each primal variable in Eq. (27) without explicitly calculating ${\textbf{w}}_m$.

2.5.1 The Kernel extension of our method with exact solutions

We update ${\textbf{W}}$ by discarding all terms in Eq. (27) that do not contain ${\textbf{W}}$ and decoupling across columns, which gives the following K problems

$$\begin{aligned} \begin{aligned} {\textbf{w}}_m&= \mathop {\mathrm{arg\,min}}_{{\textbf{w}}_m}\ \frac{1}{2} \left\| {\textbf{w}}_m\right\| _{2}^2 + \frac{\mu }{2}\sum _{i=1}^N\Big [\big \Vert {\textbf{t}}_{i}^{m} - {\textbf{w}}_m^T\varvec{\Phi }_i + {\textbf{1}}b_m + \varvec{\theta }_{i}^{m}/\mu \big \Vert ^2_2\Big ]\\&\quad +\sum _{i'=1}^{N'}\sum _{m'=1}^K\Big [\frac{\mu }{2}\big \Vert {\textbf{u}}_{i'}^{m'} - {\textbf{w}}_m^T\varvec{\Phi }_{i'} + {\textbf{1}}b_{m} + \varvec{\xi }_{i'}^{m'}/\mu \big \Vert ^2_2\Big ], \end{aligned} \end{aligned}$$

(28)

where $i'$ indicates the corresponding column blocks of $\varvec{\Phi }= [\varvec{\Phi }_1\ \dots \varvec{\Phi }_N] \in {\mathbb {R}}^{d_\phi \times \left( n_1 + \dots + n_{N}\right) }$ that belong to the $m$-th class. Equation (28) can be differentiated with respect to ${\textbf{w}}_m$, set equal to zero, and solved for each ${\textbf{w}}_m$

$$\begin{aligned} \begin{aligned} {\textbf{w}}_m^T&= \Big (\textstyle \left[ \left( {\textbf{t}}^{m} - {\textbf{1}}b_m + \varvec{\theta }^{m}/\mu \right) \varvec{\Phi }^T\right] + \textstyle \textstyle \sum _{m'=1}^K\big [({\textbf{u}}_{'}^{m'} - {\textbf{1}}b_m + \varvec{\xi }_{'}^{m'}/\mu )\varvec{\Phi }_{'}^T\big ]\Big )\\&\quad *\Big ({\textbf{I}}/\mu + \varvec{\Phi }\varvec{\Phi }^T + K \varvec{\Phi }_{'}\varvec{\Phi }_{'}^T\Big )^{-1}, \end{aligned} \end{aligned}$$

(29)

where $\varvec{\Phi }_{'} = [\varvec{\Phi }_{1'}\ \dots \varvec{\Phi }_{N'}]$. Equation (29) can be written in the matrix form

$$\begin{aligned} {\textbf{w}}_m^T = {\textbf{v}}^m{\textbf{D}}\hat{\varvec{\Phi }}^T*\left( {\textbf{I}}/\mu + \hat{\varvec{\Phi }}{\textbf{D}}\hat{\varvec{\Phi }}^T\right) ^{-1}, \end{aligned}$$

(30)

where ${\textbf{v}}^m=[{\textbf{t}}^{m} - {\textbf{1}}b_m + \varvec{\theta }^{m}/\mu \quad 1/K\sum _{m'=1}^K({\textbf{u}}_{'}^{m'} - {\textbf{1}}b_m + \varvec{\xi }_{'}^{m'}/\mu )]$, ${\textbf{D}}=[{\textbf{I}}\ {\textbf{0}};\ {\textbf{0}}\ K{\textbf{I}}]$ and $\hat{\varvec{\Phi }}=[\varvec{\Phi }\ \varvec{\Phi }_{'}]$. Since the kernel function applied to each ${\textbf{X}}_i$ may return feature vectors that are infinitely long, it may be impossible to calculate the inverse required to express ${\textbf{w}}_m$ in Eq. (30). In order to solve this issue we use the following method introduced in [36]

$$\begin{aligned} ({\textbf{P}}^{-1} + {\textbf{m}}^T{\textbf{R}}^{-1}{\textbf{m}})^{-1}{\textbf{m}}^T{\textbf{R}}^{-1} = {\textbf{P}}{\textbf{m}}^T({\textbf{m}}{\textbf{P}}{\textbf{m}}^T+{\textbf{R}})^{-1} \end{aligned}$$

to rewrite ${\textbf{w}}_m^T$ equivalently, as

$$\begin{aligned} {\textbf{w}}_m^T = {\textbf{v}}_m(\hat{\varvec{\Phi }}^T\hat{\varvec{\Phi }} + {\textbf{D}}^{-1}/\mu )^{-1}\hat{\varvec{\Phi }}^T. \end{aligned}$$

(31)

The updated expression in Eq. (31) can then be used to update ${\textbf{w}}_m^T\phi ({\textbf{X}}_i)$

$$\begin{aligned} {\textbf{w}}_m^T\varvec{\Phi }= {\textbf{v}}_m(\hat{\varvec{\Phi }}^T\hat{\varvec{\Phi }} + {\textbf{D}}^{-1}/\mu )^{-1}\hat{\varvec{\Phi }}^T\varvec{\Phi }, \end{aligned}$$

(32)

and calculate $\left\| {\textbf{w}}_m\right\| _{2}^2 = {\text {{\textbf {tr}}}}\left( {\textbf{w}}_m^T{\textbf{w}}_m\right) $ by

$$\begin{aligned} \left\| {\textbf{w}}_m\right\| _{2}^2 = {\text {{\textbf {tr}}}}\left( {\textbf{v}}_m(\hat{\varvec{\Phi }}^T\hat{\varvec{\Phi }} + {\textbf{D}}^{-1}/\mu )^{-1}\hat{\varvec{\Phi }}^T\hat{\varvec{\Phi }}(\hat{\varvec{\Phi }}^T\hat{\varvec{\Phi }} + {\textbf{D}}^{-1}/\mu )^{-1}{\textbf{v}}_m^T\right) , \end{aligned}$$

(33)

without directly computing ${\textbf{w}}_m$. These two expressions are computationally tractable as the kernel expressions occur as an inner product in both cases. The updates for the other primal variables $\textbf{b}, {\textbf{E}}, {\textbf{Q}}, {\textbf{R}}, {\textbf{T}}$ and ${\textbf{U}}$ are the same as Algorithm 2.1, except each instance of ${\textbf{w}}_m{\textbf{X}}_i$ and ${\textbf{w}}_y{\textbf{X}}_i$ are replaced by the corresponding columns of Eq. (32). We outline the update steps for the pdMISVM method with kernel in Algorithm 2.2. The time complexity of Algorithm 2.2 $O\left( \left( n_1+\dots +n_N\right) ^2\right) $. The complexity comes from the multiplication between ${\textbf{v}}_m \in {\mathbb {R}}^{n_1 + \dots + n_N + n_{1'} + \dots + n_{N'}}$ and matrix and calculating $(\hat{\varvec{\Phi }}^T\hat{\varvec{\Phi }} + {\textbf{D}}^{-1}/\mu )^{-1}$ in Eq. (32) and Eq. (33).

2.5.2 The Kernel extension of our method with inexact solutions

Our exact kernel method in Eqs. (32) and (33) scales quadratically as the number of instances increases. To effectively deal with the large number of instances in the dataset, we can use the optimal line search method in Eq. (25) for the kernel version of our method by replacing ${\textbf{X}}_i$ with $\varvec{\Phi }_i$:

$$\begin{aligned} \begin{aligned} {\textbf{w}}_m = {\textbf{w}}_m - s_m\nabla _{{\textbf{w}}_m}, \end{aligned} \end{aligned}$$

(34)

where

$$\begin{aligned} \begin{aligned} \nabla ^T_{{\textbf{w}}_m}&= {\textbf{w}}_m^T - \mu \sum _{i=1}^N[{\textbf{t}}_i^m - {\textbf{w}}_m^T\varvec{\Phi }_i - {\textbf{1}}b_m + \varvec{\theta }_i^m/\mu ]\varvec{\Phi }_i^T\\&\quad -\mu \sum _{i'=1}^{N'}\sum _{m'=1}^K[{\textbf{u}}_{i'}^{m'}-{\textbf{w}}_m^T\varvec{\Phi }_{i'} - {\textbf{1}}b_m + \varvec{\xi }_{i'}^{m'}/\mu ]\varvec{\Phi }_{i'}^T, \end{aligned} \end{aligned}$$

(35)

and

$$\begin{aligned} s_m = \frac{\left( {\textbf{w}}_m^T - \mu \sum _{i=1}^N\hat{{\textbf{t}}}^m_i\varvec{\Phi }_i^T - \mu \sum _{i'=1}^{N'}\sum _{m'=1}^K\hat{{\textbf{u}}}_{i'}^{m'}\varvec{\Phi }_{i'}^T\right) \nabla _{{\textbf{w}}_m}}{\left\| \nabla _{{\textbf{w}}_m}\right\| _{2}^2 + \mu \sum _{i=1}^N\left\| \nabla ^T_{{\textbf{w}}_m}\varvec{\Phi }_i\right\| _{2}^2 + \mu K\sum _{i'=1}^{N'}\left\| \nabla ^T_{{\textbf{w}}_m}\varvec{\Phi }_{i'}\right\| _{2}^2}. \end{aligned}$$

(36)

The computational complexity of inexact kernel method in Eq. (34) is $O((n_1 + \dots + n_N) \cdot d_{\phi })$, which is apparently more computationally efficient than Eq. (32) when the number of instances is larger than the number of kernel features $d_{\phi }$. However, the inexact kernel pdMISVM is not applicable to the kernel function of an infinite number of features such as radial basis function. Therefore in our experiments, in contrast to using the radial basis function (RBF) for the exact method, the kernel extension of our method with exact solution employs the degree-2 polynomial function (poly) kernel. In case of degree-2 polynomial kernel $K({\textbf{x}}, {\textbf{y}})=({\textbf{x}}^T {\textbf{y}} + c)^2$, the complexity becomes $O((n_1 + \dots + n_N) \cdot d^2)$, since the feature map $\phi $ is given by:

$$\begin{aligned} \begin{aligned} \phi ({\textbf{x}})&= [x_n^2, \dots , x_1^2, \sqrt{2} x_n x_{n-1}, \cdots , \sqrt{2} x_n x_{1}, \sqrt{2} x_{n-1} x_{n-2}, \cdots , \sqrt{2} x_{n-1} x_{1},\\&\quad \quad \cdots , \sqrt{2} x_{2} x_{1}, \sqrt{2c}x_n, \cdots , \sqrt{2c}x_1, c]. \end{aligned} \end{aligned}$$

(37)

Apparently, our inexact method that scales to the number of features is especially useful to kernel features, because the kernel feature map usually increases the dimensionality.

3 Experiments

In this section we explore the performance of our linear/kernel and exact/inexact pdMISVM implementations. We first test our method against an array of standard MIL benchmark datasets to explore how our implementations compare against other recent MIL methods. We follow the baseline experiments with an investigation into increasingly complex natural scene data to determine the performance characteristics of our approach. Then, we conduct experiments with synthetic data to illustrate the scalability of our approach and experimentally verify the expected computational complexity/performance characteristics of our approach compared to others. We follow with a discussion of the interpretability of our method on three multi-instance datasets derived from two well-known baseline datasets.

3.1 Experimental settings and datasets

As discussed in Sect. 2.5.1, because it is not possible to directly access the kernel features of radial basis function (RBF), we use the degree-2 polynomial kernel (poly) for our inexact kernel method. We compare our methods of linear/kernel and exact/inexact versions against ten recent MIL learning algorithms: (1) a single-instance learning (SIL) approach that assigns the bags’ labels to all instances during training and returns the maximum response for each bag/class-pair at test time for the testing bags’ instances; (2) the miSVM and (3) MISVM algorithms [1] that assume that at least one instance per bag is positive to classify a bag as positive; (4) the NSK algorithm [25], a bag-based method, that maps the entire bag to a single-instance by way of a kernel function; (5) the sMIL and (6) sbMIL [2] algorithms which expect that only a small number of instances within a bag are classified as positive and combine instance-level and bag-level relationships to make a prediction. We also compare our approach against two end-to-end MIL algorithms, (7) miNet and (8) MINet [26], based on deep neural-networks (DNN). Finally, the two DNN MIL models using the attention mechanism are compared: (9) Attention-based deep multiple instance learning (AMIL) [16] calculates the parameterized attention (importance) score for each instance to generate the probability distribution of bag labels; (10) loss-based attention for deep multiple instance learning (LAMIL) [37] proposes to learn the instance scores and predictions jointly by integrating the attention mechanism with the loss function. These two attention-based DNN methods have demonstrated the state-of-the-art classification performance in MIL.

These methods are compared against the proposed pdMISVM (Ours) method, and the inexact variation, described in Algorithm 2.1 and Algorithm 2.2. The grid search and performance calculations for each method-dataset pair are conducted using the MLJ library [38] and are included with our code.^{Footnote 1} All experiments were run on an Intel Xeon processor running at 2.20GHz using 126GB of RAM, running Ubuntu 18.04.4 LTS. The competing SVM-based methods are implemented using a library^{Footnote 2} written by Doran et.al. [39], while the DNN methods are implemented using the code^{Footnote 3} provided as a companion to the paper [16, 26, 37]. Methods that take longer than one-thousand seconds to train during a single cross-validation are considered “timed-out” (T/O) and their performance metrics are not provided.

Each method is compared against a synthetic dataset and ten multi-instance datasets that are normalized to have zero mean and unit variance. The synthetic dataset contains 10 to 1,000 bags with three to five instances per bag and 10 to 1,000 features per instance. The first instance per bag is constructed from two normally distributed clusters with a standard deviation of one; the second to fifth instances per bag contain uniform random noise.

The MUSK-2 [9], Elephant, Fox, Colon [40], and Tiger [1] datasets are standard small-scale MIL evaluation datasets and are widely cited in MIL literature as benchmarks. The MUSK-2 dataset is designed to classify chemical compounds as either “musk” or “non-musk” which describes the chemical properties of a given compound; bags within this dataset are representative of the possible conformations of the labeled compound. The MUSK-2 dataset contains 39 positive and 63 negative bags with 166 features per instance. The Elephant, Fox and Tiger datasets are derived from the Corel image dataset [41] and each contain 100 positive and 100 negative bags with 143 non-zero features per instance. The Colon [40] consists of 25,000 histopathological images (instances) generated from 750 lung tissue images (bags). There are 5 classes of lung benign tissue, lung adenocarcinoma, lung squamous cell carcinoma, colon adenocarcinoma, and colon benign tissue in Colon dataset.

The MNIST-bags [16] dataset contains 100 positive and 100 negative bags where a bag is made up of a random number of $28 \times 28$ greyscale images taken from the MNIST dataset. A bag is given a positive label if it contains a ‘9’ and negative label if it does not. For our experiments the average number of bags is ten, thus the witness rate for positive bags is 10%, on average. This low witness rate makes this a challenging dataset for the chosen MIL algorithms.

The SIVAL dataset was specifically designed for content-based image retrieval (CBIR) and contains natural scene images consisting of 25 categories with 60 images per category for a total of 1500 bags. In this work, we use the processed dataset provided in the initial work of Rahmani et.al. [42] and create a new dataset derived from the raw SIVAL images. In the original processed SIVAL dataset, the images are segmented into 30 or 31 instances, depending on the picture, consisting of 30 features each. In total, there are 47,414 instances across the entire SIVAL dataset. In order to explore the prediction and runtime performance of the compared methods, we construct a few subsets of this dataset containing a predetermined number of classes. Specifically, we construct the SIVAL-3, SIVAL-5, SIVAL-10, SIVAL-15, and SIVAL-25 datasets each containing three, five, ten, fifteen, and twenty-five classes from the SIVAL dataset, of 180, 300, 600, 900, and 1500 bags.

In addition, we construct the “SIVAL-25-deep” dataset, which is inspired by the “hybrid” approach detailed by Zheng et.al., [43], which investigates the ongoing shift from SIFT-based descriptors [44] to convolutional neural networks for generating image descriptors. To create this multi-instance dataset, we extract patches from the raw SIVAL images using the EdgeBox [45] proposal generator (eta=0.2, minScore=0.04, maxBoxes=200) provided in the OpenCV library.^{Footnote 4} These extracted patches are fed into a pre-trained AlexNet [46] convolutional neural network where the second to last fully connected layer (F10) is used to represent each instance by 4,096 features. We note that more complex/newer deep neural architectures, and other proposal generators, c ould be used to create this patch-level embedding but leave this to future work. This process is repeated for every image (where object proposals are detected by EdgeBox) and results in 1,463 bags for a total of 80,561 instances. The SIVAL-25-deep dataset is our attempt at a modernization of the standard SIVAL dataset; the pipeline used to generate this benchmark is provided with our code.

3.2 Classification performance

In Tables 1 and 2 we provide the classification performance of our approach compared against the other competing MIL algorithms described above in Sect. 3.1. Our goal is to verify that our approach matches the performance of the other MIL algorithms. For each dataset-method pair we report the accuracy (ACC) and balanced accuracy (BACC) results across ten sixfold cross validation experiments. We can see from Table 1 that our method gives comparable performance on the MUSK-2, Elephant, and Tiger datasets; this applies for both the exact and inexact implementations. Interestingly, the inexact linear version of our approach outperforms all other methods on both the Fox and MNIST-Bags datasets. That can be naturally explained by the previous study [47] which has shown some implementations of SVM obtain the highest accuracy before their objectives reach their minimum. In the Colon dataset, our kernel versions show superior performance in classifying the different shapes of tissues, which shows the benefits of kernel functions in our model.

In Table 2 our exact linear pdMISVM only slightly outperforms the next best performing method on the SIVAL-3 dataset; this impressive performance result does not hold for the inexact version. Although, the inexact method performs better (in comparison) when the number of classes/bags increase. The inexact linear methods shows surprisingly impressive results on SIVAL-25-deep dataset which are recorded just within the time-budget; this significant performance improvement can be seen very clearly in the comparison between the confusion matrices in Figs. 2 and 3. In comparison of training times (TT) between our exact and inexact kernel pdMISVM, we can clearly see that the inexact version scales to the increasing number of bags (the number of bags increases in the order of SIVAL-3, 5, 10, 15, and 25) better than the exact version. This empirically verifies the analytical complexity discussed in Sect. 2.5. It is clear from these results that the exact/inexact and linear/kernel methods are capable of providing competitive performance results on a variety of multi-instance datasets.

Table 1 Classification performance and train time (seconds) of our method and ten other MIL learning methods on six benchmark datasets

Full size table

Table 2 Classification and train-time (seconds) performance of our method and ten other MIL learning methods on variants of the SIVAL dataset across a different number of classes (and bags) and preprocessing pipelines

Full size table

3.3 Bag/feature scalability

The key contribution of this work is that the derived algorithms described in Sects. 2.3 and 2.4 scale to large datasets. This can be clearly seen in the SIVAL-25 column of Table 2 where our methods are the only ones that are able to fit a model within one-thousand seconds. In order to further validate this finding, in Fig. 4 we report training time results on a synthetic multi-instance dataset where we increase the number of bags as described in Sect. 3.1. In this timing experiment, we use the degree-2 polynomial function for both of exact/inexact kernel methods to compare them fairly. Our linear-exact, linear-inexact, and poly-inexact methods scale well with respect to the number of bags which shows the importance of our primal–dual derivation. However, the training time of our poly-exact model increases rapidly as the number of instances increases, demonstrating the usefulness of our inexact kernel method. This conclusion is especially clear when our method is compared against SVM-based MIL methods which rely on repeatedly solving quadratic programming problems.

Although the initial pdMISVM derivation scales well with respect to bags, it does not scale to the number of features when it is large. This is due to the fact that the update for each ${\textbf{w}}_k$ in Eq. (10) requires solving a least squares problem which scales quadratically as the number of features increase. To address this limitation, we proposed an optimal line search method to improve the scalability of our approach in Eq. (25). We conduct a timing experiment using synthetic data where the number of features is increased to see if our methods provide improved runtime performance. We can see in Fig. 5 that the proposed linear-inexact and poly-exact methods significantly reduce the training time of our approach as the number of features increase. We note that in our exact kernel method, we don’t need to directly access the kernel features as discussed in Sect. 2.5.2.

3.4 Model interpretability

In addition to the promising predictive performance and scalability of our method, we note that instance-based methods such as ours come with an additional benefit: interpretability. Instance-based methods such as miSVM, MISVM, and the proposed pdMISVM method, identify an explicit instance within a bag that is responsible for the predicted label. We use this phenomenon to explore the limitations of our method on the MNIST-bags dataset and showcase patches identified during the SIVAL experiment across a number of different classes.

For the MNIST-bags dataset we plot the learned positive and negative class coefficients associated with the two learned hyperplanes in Fig. 6 (e.g., ${\textbf{w}}_1$ and ${\textbf{w}}_2$). In addition, we plot four randomly chosen testing bags and what instance was chosen by the multi-instance decision function for the positive class hyperplane in Fig. 7. On the left-hand side of Fig. 6, we can see that our method can roughly detect the loop at the top of the ‘9’ although it is clear from this interpretation that our approach will not be able to handle even moderate translation or rotation if it is only provided with raw-pixel values. Additionally, even though our method correctly classifies the first bag, it incorrectly identifies the ‘4’ as the witness instance; we can see that a ‘4’ appears to be contaminating the learned coefficients displayed in Fig. 6. In order to solve this problem it is likely that additional preprocessing will be required to extract more descriptive features from instances within the MNIST-bags dataset beyond raw pixels for our method to be effective.

In order to illustrate how effective feature extraction can aid in the interpretability of our method, we extend our discussion to the SIVAL-25 and SIVAL-25-deep datasets. In Fig. 8, we provide image patches identified by our approach on images chosen from the SIVAL-25 dataset. We can see that our method identifies distinctive visual characteristics in each of the classes. For example, the bag representing a “Banana” is identified by a distinctive patch along the length of the fruit while in the “Apple” image our approach identifies the round patch on top of the fruit. Similarly, in Fig. 9 we present the neural-network embedded patches extracted via the EdgeBox detection algorithm and the identified patches. We can clearly see in Fig. 9 that our approach is able to accurately localize the most distinctive parts of the object, at the patch level, within the image. For example, the medal is recognized by the “gold” part while the “bowl” of the spoon is recognized.

Remarkably, the results in Figs. 8 and 9 show that when our method is given a set of sufficiently descriptive object proposals/patches, paired with a bag-level label, our method can accurately locate objects within an image. This is one of the significant advantages of instance-based MIL methods over traditional single-instance learning methods that require all instances to be labeled. We anticipate that this framework could be extended to investigate and interpret the effectiveness of pre-trained neural networks on an assortment of datasets that can be formulated as MIL problems. We plan to further investigate different aspects of our approach under different object proposal methods [48, 49], neural architectures [50], and applications [13, 14, 51].

3.5 Capability to learn nonlinear decision boundaries

As an important extension of our pdMISVM method to deal with nonlinearly separable data, we introduced the kernel versions of our exact and inexact methods. In this subsection, we evaluate their classification performances.

We implement the kernel version of the pdMISVM method and compare it against the linear version in Fig. 10. We can see this extension successfully extends our approach to correctly classify data that is not linearly separable. In this paper, we propose two versions of kernel pdMISVM which scales to the number of instances or features, which enable to efficiently learn the nonlinear decision boundaries from the large dataset.

3.6 A case study on neuroimaging data

While we have demonstrated the effectiveness of our new pdMISVM method from a number of perspectives in the previous subsections, in this subsection we apply our new methods on a medical imaging dataset to verify its capability to solve real-world problems.

3.6.1 Comparisons of the classification capabilities

Alzheimer’s disease (AD) is a serious neurodegenerative condition in which people suffer from the deterioration of cognitive functions. To verify the capability of our new methods to solve real-world problems, we apply them to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [52] dataset which provides the comprehensive neuroimagings such as voxel-Based Morphometry (VBM) and FreeSurfer (FS). We collect the magnetic resonance imaging (MRI) scans and their diagnosis in Alzheimer’s disease, mild cognitive decline, and healthy condition of 821 participants from ADNI database^{Footnote 5}. We perform VBM and FS automated parcellation on the MRIs following [53] and extract mean modulated gray matter measures for 90 target regions of interest (ROI). Because the different number of MRI scans have been captured across the participants, it is difficult to directly apply the standard statistical methods to all of the neuroimagings provided. In this case study, we formulate each neuroimaging as an instance and each participant as a bag to predict their AD diagnosis.

In Table 3, we report the classification performances of ours and the other competing models. Although the deep learning models (miNet, MINet, AMIL, and LAMIL) have shown the promising performances, they require the significant amount of training time. On the other hands, four variations of our models show the comparable performance with a few seconds of training time.

Table 3 Classification and train-time (seconds) performance of our method and ten other MIL learning methods on ADNI dataset

Full size table

3.6.2 Identification of brain regions

Likewise we analyze the hyperplane in Fig. 6 from MNIST digits dataset, we analyze the hyperplane to identify the brain regions exhibiting AD risk factors. Since the p-th feature of each instance is multiplied with the p-th weight of ${\textbf{w}}_m$ to contribute to the response for the m-th class, we calculate the contribution of p-th feature as the summation of p-th weight of hyperplanes $\sum _{m=1}^K \Vert w_m^p \Vert $. Each feature of instance (neuroimage) represents each ROI in the brain, therefore we visualize the disease relevance of each brain region in Fig. 11.

The brain regions identified by our method (linear, exact) have been appeared in the previous medical literatures. For example, the pathological changes in the Brodmann area 44 (Broca’s area) involves in the comprehension and production of verbs and communication [54]. Based on the previous study [55], the larger ventricular volume is the risk factor of dementia related disease in the future. The change in venticular volume is also associated with the cognitive decline and dementia [55, 56]. The atrophy of the caudate nucleus is also related to the cognitive decline [57]. The previous study [58] found that the anterior thalamus plays an important role in generating attention, and it is in charge of declarative memory functioning. The hippocampus is particularly susceptible to damage from AD and involves long-term memory and spatial navigation in AD patients [59]. The amygdala region, which is related to the emotional response, can also be easily damaged by AD [60]. The identified brain regions in this case study are well represented in the previous AD studies, and support the correctness of our methods by providing the further interpretability.

4 Conclusion

In this work, we propose a primal–dual multi-instance SVM method that is able to scale to large multi-instance datasets. Our SVM-based approach is able to handle data that grows in terms of bags as well as features since it avoids solving a quadratic programming problem that limits the adoption of traditional SVM-based MIL techniques. Throughout the manuscript, we provide detailed derivations, implementations, and experimental results which illustrate the utility of our approach on both synthetic and real-world datasets. In addition, we provide the kernel extensions of our approach which scale to the number of instances or features. Our experimental results on synthetic multi-instance data validate our kernel extension successfully learn the nonlinear decision boundaries. Finally, we investigate the interpretability of our method on benchmark multi-instance datasets and develop an extension to the ADNI dataset as part of this study. From the ADNI dataset, our method identifies the disease relevant brain regions which are in nice accordance with the existing medical studies.

Notes

References

Andrews S, Tsochantaridis I, Hofmann T (2002) Support vector machines for multiple-instance learning. In: NIPS, vol 2. Citeseer, pp 561–568
Bunescu RC, Mooney RJ (2007) Multiple instance learning for sparse positive bags. In: Proceedings of the 24th international conference on machine learning, pp 105–112
Wang H, Nie F, Huang H (2011) Learning instance specific distance for multi-instance classification. In: Twenty-fifth AAAI conference on artificial intelligence
Wang H, Huang H, Kamangar F, Nie F, Ding C (2011) Maximum margin multi-instance learning. Adv Neural Inf Process Syst 24:1–9
Google Scholar
Wang H, Nie F, Huang H (2012) Robust and discriminative distance for multi-instance learning. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 2919–2924
Wang H, Deng C, Zhang H, Gao X, Huang H (2016) Drosophila gene expression pattern annotations via multi-instance biological relevance learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 30, p 1
Liu K, Wang H, Nie F, Zhang H (2018) Learning multi-instance enriched image representations via non-greedy ratio maximization of the $\ell _1$-norm distances. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7727–7735
Brand L, Baker LZ, Wang H (2021) A multi-instance support vector machine with incomplete data for clinical outcome prediction of covid-19. In: Proceedings of the 12th ACM conference on bioinformatics, computational biology, and health informatics, pp 1–6
Dietterich TG, Lathrop RH, Lozano-Pérez T (1997) Solving the multiple instance problem with axis-parallel rectangles. Artif Intell 89(1–2):31–71
Article Google Scholar
Zhou ZH, Xue XB, Jiang Y (2005) Locating regions of interest in CBIR with multi-instance learning techniques. In: Australasian joint conference on artificial intelligence. Springer, pp 92–101
Wang H, Nie F, Huang H, Yang Y (2011) Learning frame relevance for video classification. In: Proceedings of the 19th ACM international conference on multimedia, pp 1345–1348
Cheplygina V, de Bruijne M, Pluim JP (2019) Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis. Med Image Anal 54:280–296
Article Google Scholar
Settles B, Craven M, Ray S (2007) Multiple-instance active learning. In: Proceedings of the 20th international conference on neural information processing systems, pp 1289–1296
Xu Y, Zhu J-Y, Eric I, Chang C, Lai M, Tu Z (2014) Weakly supervised histopathology cancer image segmentation and classification. Med Image Anal 18(3):591–604
Article Google Scholar
Wu J, Yu Y, Huang C, Yu K (2015) Deep multiple instance learning for image classification and auto-annotation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3460–3469
Ilse M, Tomczak J, Welling M (2018) Attention-based deep multiple instance learning. In: International conference on machine learning. PMLR, pp 2127–2136
Yan Y, Wang X, Guo X, Fang J, Liu W, Huang J (2018) Deep multi-instance learning with dynamic pooling. In: Asian conference on machine learning. PMLR, pp 662–677
Ray S, Craven M (2005) Supervised versus multiple instance learning: an empirical comparison. In: Proceedings of the 22nd international conference on machine learning, pp 697–704
Carbonneau M-A, Cheplygina V, Granger E, Gagnon G (2018) Multiple instance learning: a survey of problem characteristics and applications. Pattern Recogn 77:329–353
Article Google Scholar
Wei X-S, Wu J, Zhou Z-H (2016) Scalable algorithms for multi-instance learning. IEEE Trans Neural Netw Learn Syst 28(4):975–987
Article Google Scholar
Vatsavai RR (2012) Scalable multi-instance learning approach for mapping the slums of the world. In: SC companion: high performance computing, networking storage and analysis. IEEE, pp 833–837
Wang J, Zucker JD (2000) Solving multiple-instance problem: a lazy learning approach
Melendez J, van Ginneken B, Maduskar P, Philipsen RH, Reither K, Breuninger M, Adetifa IM, Maane R, Ayles H, Sánchez CI (2014) A novel multiple-instance learning-based approach to computer-aided detection of tuberculosis on chest x-rays. IEEE Trans Med Imaging 34(1):179–192
Article Google Scholar
Rymarczyk D, Kaczyńska A, Kraus J, Pardyl A, Zieliński B (2021)Protomil: Multiple instance learning with prototypical parts for fine-grained interpretability. arXiv preprint arXiv:2108.10612
Gärtner TP, Flach A, Kowalczyk A, Smola AJ (2002) Multi-instance kernels. In: ICML, vol 2, no. 3, p 7
Wang X, Yan Y, Tang P, Bai X, Liu W (2018) Revisiting multiple instance neural networks. Pattern Recogn 74:15–24
Article Google Scholar
Brand L, Baker LZ, Ellefsen C, Sargent J, Wang H (2021) A linear primal–dual multi-instance SVM for big data classifications. In: 2021 IEEE international conference on data mining (ICDM), pp 21–30
Boyd S, Parikh N, Chu E (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc
Google Scholar
Lium L, Han Z (2015) Multi-block admm for big data optimization in smart grid. In: 2015 International conference on computing, networking and communications (ICNC). IEEE, pp 556–561
Hong M, Luo Z-Q (2017) On the linear convergence of the alternating direction method of multipliers. Math Program 162(1–2):165–199
Article MathSciNet Google Scholar
Nie F, Huang Y, Wang X, Huang H (2014) New primal SVM solver with linear computational cost for big data classifications. In: Proceedings of the 31st international conference on international conference on machine learning, vol 32, pp II–505
Platt J (1998) Sequential minimal optimization: a fast algorithm for training support vector machines
Dogan Ü, Glasmachers T, Igel C (2016) A unified view on multi-class support vector classification. J Mach Learn Res 17(45):1–32
MathSciNet Google Scholar
Weston J, Watkins C et al (1999) Support vector machines for multi-class pattern recognition. Esann 99:219–224
Google Scholar
Wang J, Zhao L (2017) Nonconvex generalization of admm for nonlinear equality constrained problems. arXiv preprint arXiv:1705.03412
Welling M (2013) Kernel ridge regression. Max Welling’s classnotes in machine learning, pp 1–3
Shi X, Xing F, Xie Y, Zhang Z, Cui L, Yang L (2020) Loss-based attention for deep multiple instance learning. Proc AAAI Conf Artif Intell 34(04):5742–5749
Google Scholar
Blaom AD, Kiraly F, Lienart T, Simillides Y, Arenas D, Vollmer SJ (2020) MLJ: a Julia package for composable machine learning. J Open Source Softw 5(55):2704
Article Google Scholar
Doran G, Ray S (2014) A theoretical and empirical analysis of support vector machine methods for multiple-instance classification. Mach Learn 97(1–2):79–102
Article MathSciNet Google Scholar
Borkowski AA, Bui MM, Thomas LB, Wilson CP, DeLand LA, Mastorides SM (2019) Lung and colon cancer histopathological image dataset (lc25000). arXiv preprint arXiv:1912.12142
Chakrabarti K, Mehrotra S (1999) The hybrid tree: an index structure for high dimensional feature spaces. In: Proceedings 15th international conference on data engineering (Cat. No. 99CB36337). IEEE, pp 440–447
Rahmani R, Goldman SA, Zhang H, Krettek J, Fritts JE (2005) Localized content based image retrieval. In: Proceedings of the 7th ACM SIGMM international workshop on multimedia information retrieval, pp 227–236
Zheng L, Yang Y, Tian Q (2017) Sift meets CNN: a decade survey of instance retrieval. IEEE Trans Pattern Anal Mach Intell 40(5):1224–1244
Article Google Scholar
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Article Google Scholar
Zitnick, CL, Dollár P (2014) Edge boxes: locating object proposals from edges. In: European conference on computer vision. Springer, pp 391–405
Krizhevsky A (2014) One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997
Chang KW, Hsieh CJ, Lin CJ (2008) Coordinate descent method for large-scale l2-loss linear support vector machines. J Mach Learn Res 9:7
MathSciNet Google Scholar
Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171
Article Google Scholar
Kong S, Fowlkes CC (2018) Recurrent pixel embedding for instance grouping. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9018–9028
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and patter recognition, pp 770–778
Cai X, Wang H, Huang H, Ding C (2012) Joint stage recognition and anatomical annotation of drosophila gene expression patterns. Bioinformatics 28(12):i16–i24
Article Google Scholar
Petersen RC, Aisen P, Beckett LA, Donohue M, Gamst A, Harvey DJ, Jack C, Jagust W, Shaw L, Toga A et al (2010) Alzheimer’s disease neuroimaging initiative (ADNI): clinical characterization. Neurology 74(3):201–209
Article Google Scholar
Risacher SL, Shen L, West JD, Kim S, McDonald BC, Beckett LA, Harvey DJ, Jack CR Jr, Weiner MW, Saykin AJ et al (2010) Longitudinal MRI atrophy biomarkers: relationship to conversion in the ADNI cohort. Neurobiol Aging 31(8):1401–1418
Article Google Scholar
Bak TH, O’Donovan DG, Xuereb JH, Boniface S, Hodges JR (2001) Selective impairment of verb processing associated with pathological changes in brodmann areas 44 and 45 in the motor neurone disease-dementia-aphasia syndrome. Brain 124(1):103–120
Article Google Scholar
Carmichael OT, Kuller LH, Lopez OL, Thompson PM, Dutton RA, Lu A, Lee SE, Lee JY, Aizenstein HJ, Meltzer CC et al (2007) Ventricular volume and dementia progression in the cardiovascular health study. Neurobiol Aging 28(3):389–397
Article Google Scholar
Jack C, Shiung M, Gunter J, O’brien P, Weigand S, Knopman DS, Boeve BF, Ivnik R, Smith G, Cha R et al (2004) Comparison of different MRI brain atrophy rate measures with clinical disease progression in ad. Neurology 62(4):591–600
Article Google Scholar
Apostolova LG, Beyer M, Green AE, Hwang KS, Morra JH, Chou Y-Y, Avedissian C, Aarsland D, Janvin CC, Larsen JP et al (2010) Hippocampal, caudate, and ventricular changes in Parkinson’s disease with and without dementia. Mov Disord 25(6):687–695
Article Google Scholar
Van der Werf YD, Witter MP, Uylings HB, Jolles J (2000) Neuropsychology of infarctions in the thalamus: a review. Neuropsychologia 38(5):613–627
Article Google Scholar
Mu Y, Gage FH (2011) Adult hippocampal neurogenesis and its role in Alzheimer’s disease. Mol Neurodegener 6(1):85
Article Google Scholar
Poulin SP, Dautoff R, Morris JC, Barrett LF, Dickerson BC, Initiative ADN et al (2011) Amygdala atrophy is prominent in early Alzheimer’s disease and relates to symptom severity. Psychiatry Res Neuroimaging 194(1):7–13
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Science Foundation (NSF) under the grants of IIS 1652943, IIS 1849359, CNS 1932482 and CCF 2029543. Part of Lodewijk Brand’s work was supported in part by the Department of Defense (DoD) Science, Mathematics And Research for Transformation (SMART) Scholarship-for-Service program.

Author information

Authors and Affiliations

Department of Computer Science, Colorado School of Mines, Golden, CO, USA
Lodewijk Brand, Hoon Seo, Lauren Zoe Baker, Carla Ellefsen & Hua Wang
Computer Science and Engineering Division, University of Michigan, Ann Arbor, MI, USA
Jackson Sargent

Authors

Lodewijk Brand
View author publications
You can also search for this author in PubMed Google Scholar
Hoon Seo
View author publications
You can also search for this author in PubMed Google Scholar
Lauren Zoe Baker
View author publications
You can also search for this author in PubMed Google Scholar
Carla Ellefsen
View author publications
You can also search for this author in PubMed Google Scholar
Jackson Sargent
View author publications
You can also search for this author in PubMed Google Scholar
Hua Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hua Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Brand, L., Seo, H., Baker, L.Z. et al. A linear primal–dual multi-instance SVM for big data classifications. Knowl Inf Syst 66, 307–338 (2024). https://doi.org/10.1007/s10115-023-01961-z

Download citation

Received: 29 January 2022
Revised: 04 August 2023
Accepted: 05 August 2023
Published: 26 August 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s10115-023-01961-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A linear primal–dual multi-instance SVM for big data classifications

Abstract

Similar content being viewed by others

Multiple instance classification via quadratic programming

Multiple Instance Learning via Semi-supervised Laplacian TSVM

v-soft margin multi-task learning logistic regression

1 Introduction

1.1 Related works

1.2 Our contributions

2 Methods

2.1 Notation

2.2 Extending the MISVM to K-class classification

2.3 A primal–dual multi-instance SVM

2.4 Scaling to a large number of features

2.5 A primal–dual multi-instance SVM with Kernel

2.5.1 The Kernel extension of our method with exact solutions

2.5.2 The Kernel extension of our method with inexact solutions

3 Experiments

3.1 Experimental settings and datasets

3.2 Classification performance

3.3 Bag/feature scalability

3.4 Model interpretability

3.5 Capability to learn nonlinear decision boundaries

3.6 A case study on neuroimaging data

3.6.1 Comparisons of the classification capabilities

3.6.2 Identification of brain regions

4 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A linear primal–dual multi-instance SVM for big data classifications

Abstract

Similar content being viewed by others

Multiple instance classification via quadratic programming

Multiple Instance Learning via Semi-supervised Laplacian TSVM

v-soft margin multi-task learning logistic regression

Explore related subjects

1 Introduction

1.1 Related works

1.2 Our contributions

2 Methods

2.1 Notation

2.2 Extending the MISVM to K-class classification

2.3 A primal–dual multi-instance SVM

2.4 Scaling to a large number of features

2.5 A primal–dual multi-instance SVM with Kernel

2.5.1 The Kernel extension of our method with exact solutions

2.5.2 The Kernel extension of our method with inexact solutions

3 Experiments

3.1 Experimental settings and datasets

3.2 Classification performance

3.3 Bag/feature scalability

3.4 Model interpretability

3.5 Capability to learn nonlinear decision boundaries

3.6 A case study on neuroimaging data

3.6.1 Comparisons of the classification capabilities

3.6.2 Identification of brain regions

4 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation