6.1 Motivating Examples for Multiple Instance Learningin Hyperspectral Analysis

In standard supervised machine learning, each training sample is assumed to be coupled with the desired classification label. However, acquiring accurately labeled training data can be time consuming, expensive, or at times infeasible. Challenges with obtaining precise training labels and location information are pervasive throughout many remote sensing and hyperspectral image analysis tasks. A learning methodology to address imprecisely labeled training data is multiple instance learning (MIL). In MIL, data is labeled at the bag level where a bag is a multi-set of data points as illustrated in Fig. 6.1. In standard MIL, bags are labeled as “positive” if they contain any instances representing a target class whereas bags are labeled as “negative” if they contain only nontarget instances. Generating labels for a bag of points is often much less time consuming and aligns with the realistic scenarios encountered in remote sensing applications as outlined in the following motivating examples.

Fig. 6.1
figure 1

Source: © [2019] IEEE. Reprinted, with permission, from [62]

In multiple instance learning, data is labeled at the bag level. A bag is labeled as a positive bag if it contains at least one target instance. The number of target versus nontarget instances in each positive bag is unknown. A bag is labeled as a negative bag if it contains only nontarget instances. In this figure, blue points correspond to nontarget instances where red points correspond to target instances.

  • Hyperspectral Classification: Training a supervised classifier requires accurately labeled spectra for the classes of interest. In practice, this is often accomplished by creating a ground truth map of a hyperspectral scene (scenes which frequently contain hundreds of thousands of pixels or more). Generation of ground truth maps is challenging due to labeling ambiguity that naturally arises due to relatively coarse resolution and compound diversity of the remotely sensed hyperspectral scene. For example, an area that is labeled as vegetation may contain both plants and bare soil, making the training label inherently ambiguous. Furthermore, labeling each pixel of the hyperspectral scene is tedious and annotator performance is generally inconsistent from person to person or over time. Due to these challenges, “ground-rumor” may be a more appropriate term than “ground-truth” for the maps that are generated. These ambiguities naturally map to the MIL framework by allowing an annotator to label spatial regions if it contains a class of interest (corresponding to positive bags) and negative bags for spatial regions known to exclude those classes. For instance, an annotator can easily mark (e.g., circle on a map) positive bag regions that contain vegetation and then mark regions of only bare soil and building/man-made materials for negative bags when vegetation is the class of interest.

  • Sub-pixel Target Detection: Consider the hyperspectral target detection problem illustrated in Fig. 6.2. This hyperspectral scene was collected over the University of Southern Mississippi-Gulfpark Campus [1] and includes many emplaced targets. These targets are cloth panels of four colors (Brown, Dark Green, Faux Vineyard Green, and Pea Green) varying from 0.5 m \(\times \) 0.5 m, 1 m \(\times \) 1 m, and 3 m \(\times \) 3 m in size. The ground sample distance of this hyperspectral data set is 1m. Thus, the 0.5 m \(\times \) 0.5 m targets are, at best, a quarter of a pixel in size; the 1 m \(\times \) 1 m targets are, at best, exactly one pixel in size; and the 3 m \(\times \) 3 m targets cover multiple pixels. However, the targets are rarely aligned with the pixel grid, resulting in the 0.5 m \(\times \) 0.5 m and 1 m \(\times \) 1 m target responses often straddling multiple pixels and being sub-pixel. The scene also had heavy tree coverage and resulted in targets being heavily occluded by the tree canopy. The sub-pixel nature of the targets and occlusion by the tree canopy causes this to be a challenging target detection problem and one in which manual labeling of target location by visual inspection is impractical. Ground truth locations of the targets in this scene were collected by a GPS unit with 2–5 m accuracy. Thus, the ground truth is only accurate up to some spatial region (as opposed to the pixel level). For example, the region highlighted in Fig. 6.2 contains one brown target. From this highlighted region, one can clearly see that the GPS coordinate of this brown target (denoted by the red dot) is shifted one pixel from the actual brown target location (denoted by the yellow rectangle). This is a rare example where we can visually see the brown target. Most of the targets are difficult to distinguish visibly. Developing a classifier or extracting a pure prototype for the target class given incomplete knowledge of the training data is intractable using standard supervised learning methods. This also directly maps to the MIL framework since each positive bag can correspond to the spatial region associated with each ground truth point and its corresponding range of imprecision and negative bags can correspond to spatial regions that do not overlap with any ground truth point or its associated halo of uncertainty.

  • Multi-sensor Fusion: When fusing information obtained by multiple sensors, each sensor may provide complementary information that can aid scene understanding and analysis. Figure 6.3 shows a three-dimensional scatter plot of the LiDAR (Light Detection And Ranging) point cloud data over the University of Southern Mississippi-Gulfpark Campus collected simultaneously with the hyperspectral imagery (HSI) described above. In this data set, the hyperspectral and LiDAR data can be leveraged jointly for scene segmentation, ground cover classification, and target detection. However, there are challenges that arise during fusion. The HSI and LiDAR data are of drastically different modalities and resolutions. HSI is collected natively on a pixel grid with a 1 m ground sample distance whereas the raw LiDAR data is a point cloud with a higher resolution of 0.60 m cross track and 0.78 m along track spot spacing. Commonly, before fusion, data is co-registered onto a shared pixel grid. However, image co-registration and rasterization may introduce inaccuracies [2, 3]. In this example, consider the edges of the buildings with gray roofs in Fig. 6.3. Some of the hyperspectral pixels of the buildings have been inaccurately mapped to LiDAR points corresponding to the neighboring grass pixels on the ground. Similarly, some hyperspectral points corresponding to sidewalk and dirt roads have been inaccurately mapped to high elevation values similar to nearby trees and buildings. Directly using such inaccurate measurements for fusion can cause further inaccuracy or error in classification, detection, or prediction. Therefore, it is beneficial to develop a fusion algorithm that is able to handle such inaccurate/imprecise measurements. Imprecise co-registration can also be mapped to the MIL framework by considering a bag of points from a local region in one sensor (e.g., LiDAR) to be candidates for fusion in each pixel in the other sensors (e.g., hyperspectral).

Fig. 6.2
figure 2

Illustration of inaccurate coordinates from GPS: one target denoted as brown by GPS has one pixel drift

Fig. 6.3
figure 3

An example of 3D scatterplot of LiDAR data over the University of Southern Mississippi-Gulfpark campus. The LiDAR points were colored by the RGB imagery provided by HSI sensors over the scene

These examples illustrate that remote-sensing data and applications are often plagued with inherent spatial imprecision in ground truth information. Multiple instance learning is a framework that can alleviate the issues that arise due to this imprecision. Therefore, although imprecise ground truth plagues instance-level labels, bags (i.e., spatial regions) can be labeled readily and analyzed using MIL approaches.

6.2 Introduction to Multiple Instance Classification

MIL was first proposed by Dietterich et al. [4] for the prediction of drug activity. The effectiveness of a drug is determined by how tightly the drug molecule binds to a larger protein molecule. Although a molecule may be determined to be effective, it can have variants called “conformations” of which only one (or a few) actually binds to the desired target binding site. In this task, the learning objective is to infer the correct shape of the molecule that actually has tight binding capacity. In order to solve this problem, Dietterich et al. introduced the definition of “bags.” Each molecule was treated as a bag and each possible conformation of the molecule was treated as an instance in that bag. This directly induces the definition of multiple instance learning. A positively labeled bag contains at least one positive instance (but, also, some number of negative instances) and negatively labeled bags are composed of entirely negative instances. The goal is to uncover the true positive instances in each positive bag and what characterizes positive instances.

Although initially proposed for this drug activity application, the multiple instance learning framework is extremely relevant and applicable to a number of remote-sensing problems arising from imprecision in ground truth information. By labeling data and operating at the bag level, ground truth imprecision inherent in remote sensing problems are addressed and accounted for within a multiple instance learning framework.

6.2.1 Multiple Instance Learning Formulation

The multiple instance learning framework can be formally described as follows. Let \(\mathbf {X}=\left[ \mathbf {x}_1,\ldots ,\mathbf {x}_N\right] \in \mathbb {R}^{n\times N}\) be training data instances where n is the dimensionality of an instance and N is the total number of training instances. The data are grouped into K bags, \(\mathbf {B} = \left\{ \mathbf {B}_1, \ldots , \mathbf {B}_K\right\} \), with associated binary bag-level labels, \(L = \left\{ L_1, \ldots , L_K\right\} \) where \(L_i \in \left\{ 0, 1\right\} \) for two-class classification. A bag, \(\mathbf {B}_i\), is termed positive with \(L_i\)=1 if it contains at least one positive instance. The exact number or identification of positive and negative instances in each positive bag is unknown. A bag is termed negative with \(L_i\)=0 when it contains only negative instances. The instance \(\mathbf {x}_{ij} \in \mathbf {B}_i\) denotes the jth instance in bag \(\mathbf {B}_i\) with the (unknown) instance-level label \(l_{ij}\in \left\{ 0, 1\right\} \).

In standard supervised machine learning methods, all instance level labels are known for the training data. However, in multiple instance learning, only the bag-level labels are known. Given this formulation, the fundamental goal of an MIL method is to determine what instance-level characteristics are common across all positive bags and cannot be found in any instance in any negative bag.

6.2.2 Axis-Parallel Rectangles, Diverse Density, and Other General MIL Approaches

Many general MIL approaches have been developed in the literature. Axis-parallel rectangles (APR) [4] algorithms were the first set of MIL algorithms proposed by Dietterich et al. for drug activity prediction in the 1990s. An axis-parallel rectangle can be viewed as a region of true positive instances in the feature space. In APR algorithms, a lower and upper bound encapsulating the positive class is estimated in each feature dimension. Three APR algorithms, greedy feature selection elimination count (GFS elim-count), greedy feature selection kernel density estimation (GFS kde), and iterated discrimination (iterated-discrim) algorithms were investigated and compared in [4]. As an illustration, GFS elim-count APR refers to finding an APR in a greedy manner starting from a region that exactly covers all of the positive instances. Figure 6.4 shows the “all-positive APR” as a solid line bounding box of the instances, where the unfilled markers represent positive instances and filled markers represent negative instances. As shown in the figure, the all-positive APR may contain several negative examples. The algorithm proceeds by greedily eliminating all negative instances within the APR while maintaining as many positive instances as possible. The dashed box in Fig. 6.4 indicates the final APR identified by the GFS elim-count algorithm by iteratively excluding the “cheapest” negative instance, determined by requiring the minimum number of positive instances that need to be removed from the APR to exclude that negative instance.

Fig. 6.4
figure 4

Illustration of the GFS elim-count procedure for excluding negative instances. The “all-positive APR” is indicated by a solid box. The unfilled markers represent positive instances and filled markers represent negative instances. The final APR is indicated by the dashed box [4]

Diverse density (DD) [5, 6] was one of the first multiple instance learning algorithms that estimated a positive concept. The positive concept is a representative of the positive class. This representative is estimated in DD by identifying a representative feature vector that is close to the intersection of all positive bags and far from every negative instance. In other words, the target concept represents an area that preserves both a high density of target points and a low density of nontarget points, called diverse density. This is accomplished in DD by maximizing the likelihood function in Eq. (6.1),

$$\begin{aligned} \arg \max _{d}\prod _{i=1}^{K^+} \mathbf {Pr}(\mathbf {d}=\mathbf {s}|\mathbf {B}_i^+)\prod _{i=K^++1}^{K^++K^-}\mathbf {Pr}(\mathbf {d}=\mathbf {s}|\mathbf {B}_i^-) , \end{aligned}$$
(6.1)

where \(\mathbf {s}\) is the assumed true positive concept, \(\mathbf {d}\) is the concept representative to be estimated, \(K^+\) is the number of positive bags and \(K^-\) is the number of negative bags. The first term in Eq. (6.1), which is used for all positive bags, is defined by the noisy-or model,

$$\begin{aligned} \mathbf {Pr}(\mathbf {d}=\mathbf {s}|\mathbf {B}_i^+)=\mathbf {Pr}(\mathbf {d}=\mathbf {s}|\mathbf {x}_{i1}, \mathbf {x}_{i2}, \ldots , \mathbf {x}_{iN_i})=1-\prod _{j=1}^{N_i}(1-\mathbf {Pr}(\mathbf {d}=\mathbf {s}|\mathbf {x}_{ij}\in \mathbf {B}_i^+)), \end{aligned}$$
(6.2)

where \(\mathbf {Pr}(\mathbf {d}=\mathbf {s}|\mathbf {x}_{ij})=exp(-\Vert \mathbf {x}_{ij}-\mathbf {d}\Vert ^2)\). The term in (6.2) can be interpreted as requiring there be at least one instance in positive bag \(\mathbf {B}_i^+\) that is close to the positive representative \(\mathbf {d}\). This can be understood by noticing that (6.2) evaluates to 1 if there is at least one instance in the positive bag that is close to the representative (i.e., \(exp(-\Vert \mathbf {x}_{ij}-\mathbf {d}\Vert ^2) \rightarrow 1\) which implies \(1-\mathbf {Pr}(\mathbf {d}=\mathbf {s}|\mathbf {x}_{ij}\in \mathbf {B}_i^+) \rightarrow 0\), resulting in \(1-\prod _{j=1}^{N_i}(1-\mathbf {Pr}(\mathbf {d}=\mathbf {s}|\mathbf {x}_{ij}\in \mathbf {B}_i^+)) \rightarrow 1\)). In contrast, (6.2) evaluates to 0 if all points in a positive bag are far from the positive concept.

The second term is defined by

$$\begin{aligned} \mathbf {Pr}(\mathbf {d}=\mathbf {s}|\mathbf {B}_i^-)=\prod _{j=1}^{N_i}(1-\mathbf {Pr}(\mathbf {d}=\mathbf {s}|\mathbf {x}_{ij}\in \mathbf {B}_i^-)). \end{aligned}$$
(6.3)

which encourages positive concepts to be far from all negative points. The noisy-or model, however, is highly non-smooth and there are several local maxima in the solution space. This is alleviated in practice by performing gradient ascent repeatedly with starting points from every positive instance to maximize the proposed log-likelihood function. Alternatively, an expectation maximization version of diversity density (EM-DD) [7] was proposed by Zhang et al. in order to improve the computation time of DD [5, 6]. EM-DD assumes there exists only one instance per bag corresponding to the bag-level label and treats the knowledge of the key-point instance corresponding to the bag-level label as a hidden latent variable. EM-DD starts with an initial estimate of the positive concept \(\mathbf {d}\) and iterates between an expectation step (E-step) that selects one point per bag as the representative point of that bag and then performs a quasi-newton optimization (M-step) [8] on the single-instance DD problem. In practice, EM-DD is much more computationally efficient than DD. However, the computational benefits are traded-off with potential inferior performance accuracy to DD [9].

Since the development of the APR and DD, many MIL approaches have been developed and published in the literature. These include prototype-based methods such as the dictionary-based multiple instance learning (DMIL) algorithm [10] and its generalization, generalized dictionaries for multiple instance Learning (GDMIL) [11] which propose to optimize the noisy-or model using dictionary learning approaches by learning a set of discriminative positive dictionary atoms to describe the positive class [12,13,14]. The Max-Margin Multiple-Instance Dictionary Learning (MMDL) methods [15] adopts the bag of words concept [16] and trains a set of linear SVMs as a codebook. The novel assumption of MMDL is that the positive instances could belong to many different categories. For example, the positive class “computer room” may have image patches containing a desk, a screen, and a keyboard. The MILIS algorithm [17] alternates between the selection of an instance per bag as a prototype that represents its bag and training a linear SVM on these prototypes.

Additional support vector machine-based methods include the MILES (Multiple-Instance Learning via Embedded Instance Selection) approach [18] which embeds each training and testing bag into a high-dimensional space and then performs classification in the mapping space using a one-norm support vector machine (SVM) [19]. Furthermore, the mi-SVM and MI-SVM methods model the MIL problem as a generalized mixed integer formulation of the support vector machine [20]. MissSVM algorithm [21] solves the MIL problem using a semi-supervised SVM with the constraint that at least one point from each positive bag must be classified as positive. Hoffman et al. [22] jointly exploit the image-level and bounding box labels and achieve state-of-the-art results in object detection. Li and Vasconcelos [23] further investigate MIL problem with labeling noise in negative bags and use “top instances” as the representatives of “soft bags”, then proceed with bag-level classification via latent-SVM [24].

Meng et al. [25] integrate the self-paced learning (SPL) [26] into MIL and propose SP-MIL for co-saliency detection. The Citation-kNN [27] algorithm adapts the k nearest neighbor (kNN) method [28] to MIL problems by using the Hausdorff distance [29] to compute distance between two bags and assigns bag-level labels based on the nearest neighbor rules. Extensions of Citation-kNN include Bayesian Citation-kNN [30] and Fuzzy-Citation-kNN [31, 32]. Furthermore, a large number of MIL neural network methods such as [33] (often called “weak” learning methods) have also been developed. Among the vast literature of MIL research, very few methods focus on remote sensing and hyperspectral analysis. These methods are reviewed in the following sections.

6.3 Multiple Instance Learning Approaches for Hyperspectral Target Characterization and Sub-pixel Target Detection

Hyperspectral target detection refers to the task of locating all instances of a target given a known spectral signature within a hyperspectral scene [34,35,36]. Hyperspectral target detection is challenging for a number of reasons: (1) Class Imbalance: The number of training instances from the positive target class is small compared to that of the negative training data such that training a standard classifier is difficult; (2) Sub-pixel Targets: Due to the relatively low spatial resolution of hyperspectral imagery and the diversity of natural scenes, one single pixel may also contain different ground materials, resulting in sub-pixel targets of interest; and (3) Imprecise Labels: As outlined in Sect. 6.1, precise training labels are often difficult to obtain. For these reasons, signature-based hyperspectral target detection [34] is commonly used as opposed to a two-class classifier. However, the performance of a signature-based detector depends on the target signature and obtaining an effective target signature is challenging. In the past, this was commonly accomplished by measuring target signatures for materials of interest in the lab or using point-spectrometers in the field. However, this approach may introduce error due to changing environmental and atmospheric conditions that impact spectral responses.

In this section, algorithms for multiple instance target characterization (i.e., estimation of target concepts) from training data with label ambiguity are presented. The aim is to estimate the target concepts from highly mixed training data that are effective for target detection. Since these algorithms extract target concepts from training data assumed to have the same environmental context, influence from background materials, environmental and atmospheric conditions are addressed during target concept estimation.

6.3.1 Extended Function of Multiple Instances

The extended Function of Multiple Instances (eFUMI) approach [37, 38] is motivated by the linear mixing model in hyperspectral analysis. eFUMI assumes each data point is a convex combination of target and/or nontarget concepts (i.e., endmembers) and performs linear unmixing (i.e., decomposing spectra into endmembers and the proportion of each endmember found in the associated pixel spectra) to estimate positive and negative concepts. The approach also addresses label ambiguity by incorporating a latent variable which indicates whether each instance of a positively labeled bags is a true target.

More formally, the goal of eFUMI is to estimate a target concept, \(\mathbf {d}_T\), nontarget concepts, \(\mathbf {d}_k, \; \forall k = 1, \ldots M\), the number of needed nontarget concepts, M, and the abundances, \(\mathbf {a}_j\), which define the convex combination of the concepts for each data point \(\mathbf {x}_j\) from labeled bags of hyperspectral data. If a bag \(B_i\) is positive, there is at least one data point in \(B_i\) containing target,

$$\begin{aligned} \text {if }L_i = 1, \exists \mathbf {x}_j \in B_i\text { s.t. }\mathbf {x}_j = \alpha _{jT}\mathbf {d}_T + \sum _{k=1}^{M} \alpha _{jk}\mathbf {d}_{k}+\varvec{\varepsilon }_{j}, \alpha _{jT} > 0. \end{aligned}$$
(6.4)

However, the exact number of data points in a positive bag with a target contribution (i.e., \(\alpha _{jT} > 0\)) and target proportions are unknown. Furthermore, if \(B_i\) is a negative bag, this indicates that none of the data in this bag contains target,

$$\begin{aligned} \text {if }L_i = 0, \forall \mathbf {x}_j \in B_i, \mathbf {x}_j = \sum _{k=1}^{M} \alpha _{jk}\mathbf {d}_{k}+\varvec{\varepsilon }_{j}. \end{aligned}$$
(6.5)

Given this framework, the eFUMI objective function is shown in (6.7). The three terms in this objective function were motivated by the sparsity promoting iterated constrained endmember (SPICE) algorithm [39]. The first term computes the squared error between the input data and its estimate found using the current target and nontarget signatures and proportions. The parameter u is a constant controlling the relative importance of various terms. The scaling value w, which aids in the data imbalance issue by weighting the influence of positive and negative data, is shown in (6.6),

$$\begin{aligned} w_{l(\mathbf {x}_j)} = \left\{ \begin{array}{c c} 1, &{} \text {if }l(\mathbf {x}_j) = 0; \\ \frac{\alpha N^-}{N^+}, &{} \text {if } l(\mathbf {x}_j) = 1. \end{array} \right. , \end{aligned}$$
(6.6)

where \(N^+\) is the total number of points in positive bags and \(N^-\) is the total number of points in negative bags.

The second term of the objective encourages target and nontarget signatures to provide a tight fit around the data by minimizing the squared difference between each signature and the global data mean, \(\varvec{\upmu }_0\). The third term is a sparsity promoting term used to determine M, the number of nontarget signatures needed to describe the input data where \(\gamma _k = \frac{\Gamma }{\sum _{j=1}^N a_{jk}^{(t-1)}}\) and \(\Gamma \) is a constant parameter that controls the degree sparsity is promoted. Higher values of \(\Gamma \) generally result in a smaller estimate M value. The \(a_{jk}^{(t-1)}\) values are the proportion values estimated in the previous iteration of the algorithm. Thus, as the proportions for a particular endmember decrease, the weight of its associated sparsity promoting term increases.

$$\begin{aligned} F = \frac{1}{2}(1-u)\sum _{j=1}^Nw_j\bigg \Vert (\mathbf {x}_j-z_ja_{jT}\mathbf {d}_T-\sum _{k=1}^Ma_{jk}\mathbf {d}_k)\bigg \Vert _2^2+\frac{u}{2}\sum _{k=T,1}^{M}\bigg \Vert \mathbf {d}_k-\varvec{\upmu }_0\bigg \Vert _2^2+\sum _{k=1}^M\gamma _k\sum _{j=1}^Na_{jk} \end{aligned}$$
(6.7)
$$\begin{aligned} E[F]= & {} \sum _{\begin{array}{c} z_j\in \{0,1\} \end{array}} \left[ \frac{1}{2}(1-u)\sum _{j=1}^N w_j P(z_j|\mathbf {x}_j, \varvec{\theta }^{(t-1)})\left\| \mathbf {x}_j - z_ja_{jT}\mathbf {d}_T - \sum _{k=1}^Ma_{jk}\mathbf {d}_k\right\| _2^2\right] \nonumber \\&+\frac{u}{2}\sum _{k=T,1}^M\left\| \mathbf {d}_k-\varvec{\upmu }_0\right\| _2^2 + \sum _{k=1}^M\gamma _k\sum _{j=1}^Na_{jk} \end{aligned}$$
(6.8)

The difference between (6.7) and the SPICE objective is the inclusion of a set of hidden, latent variables, \(z_j, j=1,\ldots , N\), accounting for the unknown instance-level labels \(l(\mathbf {x}_j)\). To address the fact that the \(z_j\) values are unknown, the expected values of the log likelihood with respect to \(z_j\) is taken as shown in (6.8). In (6.8), \(\varvec{\theta }^{t}\) is the set of parameters estimated at iteration t and \(P(z_j|\mathbf {x}_j, \varvec{\theta }^{(t-1)})\) is the probability of individual points containing any proportion of target or not. The value of the term \(P(z_j|\mathbf {x}_j, \varvec{\theta }^{(t-1)})\) is determined given the parameter set estimated in the previous iteration and the constraints of the bag-level labels, \(L_i\), as shown in (6.9),

$$\begin{aligned}&P(z_j|\mathbf {x}_j, \varvec{\theta }^{(t-1)}) =\left\{ \begin{array}{l l} e^{-\beta r_j}, &{} \text {if } z_j = 0, L_i = 1;\\ 1-e^{-\beta r_j}, &{} \text {if } z_j = 1, L_i = 1;\\ 0, &{} \text {if } z_j = 1, L_i = 0;\\ 1, &{} \text {if } z_j = 0, L_i = 0;\\ \end{array}\right. \end{aligned}$$
(6.9)

where \(\beta \) is a scaling parameter and \(r_j=\left\| \mathbf {x}_j - \sum _{k=1}^Ma_{jk}\mathbf {d}_k\right\| _2^2\) is the approximation residual between \(\mathbf {x}_j\) and its representation using only background endmembers. The definition of \(P(z_j|\mathbf {x}_j, \varvec{\theta }^{(t-1)})\) in (6.9) indicates that if a point \(\mathbf {x}_j\) is a nontarget point, it should be fully represented by the background endmembers with very small residual \(r_j\); thus, \(P(z_j=0|\mathbf {x}_j, \varvec{\theta }^{(t-1)})=e^{-\beta r_j} \rightarrow 1\). Otherwise, if \(\mathbf {x}_j\) is a target point, it may not be well represented by only the background endmembers, so the residual \(r_j\) must be large and \(P(z_j=1|\mathbf {x}_j, \varvec{\theta }^{(t-1)})=1-e^{-\beta r_j} \rightarrow 1\). Note, \(z_j\) is unknown only for the positive bags; in the negative bags, \(z_j\) is fixed to 0. This constitutes the E-step of the EM algorithm.

The M-step is performed by optimizing (6.8) for each of the desired parameters. The method is summarized in Algorithm 6.1.Footnote 1 Please refer to [37] for detailed discussion of the optimization approach and derivation.

figure a

6.3.2 Multiple Instance Spectral Matched Filter and Multiple Instance Adaptive Coherence/Cosine Detector

The eFUMI algorithm described above can be viewed as a semi-supervised hyperspectral unmixing algorithm, where the endmembers of the target and nontarget materials are estimated. Since eFUMI minimizes the reconstruction error of the data, it is a representative algorithm that learns target concepts that are representatives for (and have similar shape to) the target class. Significant challenges in applying the eFUMI algorithm in practice are the large number of parameters that need to be set and the fact that all positive bags are combined in the algorithm, neglecting the MIL concept that each positive bag contains at least one target instance.

In contrast, the multiple instance spectral matched filter (MI-SMF) and multiple instance adaptive coherence/cosine detector (MI-ACE) [41] learn discriminative target concepts that maximize the SMF or ACE detection statistics, which preserves bag structure and does not require tuning parameter settings. These goals are accomplished by optimizing the following objective function,

$$\begin{aligned} \arg \max _{\mathbf {s}} \frac{1}{K^+} \sum _{i: L_i = 1} \Lambda (\mathbf {x}_i^{*}, \mathbf {s}) - \frac{1}{K^-}\sum _{i:L_i = 0}\frac{1}{N_i^-}\sum _{\mathbf {x}_{ij} \in B_i^-} \Lambda (\mathbf {x}_{ij}, \mathbf {s}), \end{aligned}$$
(6.10)

where \(\mathbf {s}\) is the target signatures, \(\Lambda (\mathbf {x}, \mathbf {s})\) is the detection statistics of data point \(\mathbf {x}\) given target signature \(\mathbf {s}\), and \(\mathbf {x}_{i}^{*}\) is the selected representative instance from the positive bag \(B_i^+\), \(K^+\) is the number of positive bags and \(K^-\) is the number of negative bags.

$$\begin{aligned} \mathbf {x}_i^{*} = \arg \max _{\mathbf {x}_{ij} \in B_i^+} \Lambda (\mathbf {x}_{ij} , \mathbf {s}). \end{aligned}$$
(6.11)

This general objective can be applied to any target detection statistics. However, consider the ACE detector, \(\Lambda _{ACE}(\mathbf {x},\nonumber \mathbf {s})=\frac{\mathbf {s}^T\varvec{\Sigma }^{-1}_b(\mathbf {x}-\varvec{\upmu }_b)}{\sqrt{\mathbf {s}^T\varvec{\Sigma }^{-1}_b\mathbf {s}}\sqrt{(\mathbf {x}-\varvec{\upmu }_b)^T\varvec{\Sigma }^{-1}_b(\mathbf {x}-\varvec{\upmu }_b)}}\), where \(\varvec{\upmu }_b\) is the mean of the background and \(\varvec{\Sigma }_b\) is the background covariance. This detection statistic can be viewed as an inner product in a whitened coordinate space

(6.12)

where \(\hat{\mathbf {x}} = \mathbf {V}^{-\frac{1}{2}}\mathbf {U}^T(\mathbf {x}-\varvec{\upmu }_b)\), \(\hat{\mathbf {s}} = \mathbf {V}^{-\frac{1}{2}}\mathbf {U}^T\mathbf {s}\), \(\mathbf {U}\) and \(\mathbf {V}\) are the eigenvectors and eigenvalues of the background covariance matrix \(\varvec{\Sigma _b}\), respectively, , and . It is clear from Eq. (6.12) that the ACE detector response is the cosine value between a test data point, \(\mathbf {x}\), and a target signature, \(\mathbf {s}\), after whitening. Thus, the objective function (6.10) for MI-ACE can be rewritten as

(6.13)

The \(l_2\) norm constraint, , is resulted from the normalization term in Eq. (6.12). The optimum for (6.13) can be derived by solving the Lagrangian optimization problem for the target signature

(6.14)

A similar approach can be applied for the spectral matched filter detector,

$$\begin{aligned} \Lambda _{SMF}(\mathbf {x}, \mathbf {s})=\frac{\mathbf {s}^T\varvec{\Sigma }^{-1}_b(\mathbf {x}-\varvec{\upmu }_b)}{\sqrt{\mathbf {s}^T\varvec{\Sigma }^{-1}_b\mathbf {s}}}, \end{aligned}$$
(6.15)

resulting in the following update equation for MI-SMF:

(6.16)
figure b

The MI-SMF and MI-ACE algorithms alternate between the two steps: (1) selecting representative instances from each positive bag and (2) updating the target concept \(\mathbf {s}\). The MI-SMF and MI-ACE methods stop when there is no change in the selection of instances from positive bags across subsequent iterations. Similar to [7], since there exists a finite set of possible selection of positive instances given a finite training bags, the convergence of MI-SMF and MI-ACE is guaranteed. In the experiments shown in [41], MI-SMF and MI-ACE generally converged with less than seven iterations. The MI-SMF/MI-ACE algorithm is summarized in Algorithm 6.2.Footnote 2 Please refer to [41] for a detailed derivation of the algorithm.

6.3.3 Multiple Instance Hybrid Estimator

Both eFUMI and the MI-ACE/MI-SMF methods are limited in that they only estimate a single target concept. However, in many problems, the target class has significant spectral variability [43]. The Multiple Instance Hybrid Estimator (MI-HE) [44, 45] was developed to fill this gap and estimate multiple target concepts simultaneously. The proposed MI-HE algorithm maximizes the responses of the hybrid sub-pixel detector [46] within the MIL framework. This is accomplished by maximizing the following objective function:

$$\begin{aligned} J= & {} \ln \prod _{i=1}^{K^+} \left( \frac{1}{N_i}\sum _{j=1}^{N_i}\Pr (l_{ij}=1|\mathbf {B}_i)^b\right) ^{\frac{1}{b}}\prod _{i=K^++1}^{K}\prod _{j=1}^{N_i}\Pr (l_{ij}=0|\mathbf {B}_{i}) \nonumber \\= & {} -\sum _{i=1}^{K^+}\frac{1}{b}\ln \left( \frac{1}{N_i}\sum _{j=1}^{N_i}\exp \left( -\beta \frac{\Vert \mathbf {x}_{ij}-\mathbf {D}\mathbf {a}_{ij}\Vert ^2}{\Vert \mathbf {x}_{ij}-\mathbf {D}^-\mathbf {p}_{ij}\Vert ^2}\right) ^b\right) \nonumber \\&+\rho \sum _{i=K^++1}^{K}\sum _{j=1}^{N_i}\Vert \mathbf {x}_{ij}-\mathbf {D}^-\mathbf {p}_{ij}\Vert ^2\nonumber \\&+\frac{\alpha }{2}\sum _{i=K^++1}^{K}\sum _{j=1}^{N_i}\left( (\mathbf {D}^+{\mathbf {a}}^+_{ij})^T\mathbf {x}_{ij}\right) ^2, \end{aligned}$$
(6.17)

where the first term corresponds to a generalized mean (GM) term [47], which can approximate the max operation as b approaches \(+\infty \). This term can be interpreted as determining a representative positive instance in each positive bag by identifying the instance that maximizes the hybrid sub-pixel detector (HSD) [46] statistic, \(\exp \left( -\beta \frac{\Vert \mathbf {x}_{ij}-\mathbf {D}\mathbf {a}_{ij}\Vert ^2}{\Vert \mathbf {x}_{ij}-\mathbf {D}^-\mathbf {p}_{ij}\Vert ^2}\right) \). In the HSD, each instance is modeled as a sparse linear combination of target and/or background concepts \(\mathbf {D}\), \(\mathbf {x}\approx \mathbf {D}\mathbf {a}\), where \(\mathbf {D}=\begin{bmatrix}\mathbf {D}^+&\mathbf {D}^-\end{bmatrix}\in \mathbb {R}^{d\times (T+M)}\), \(\mathbf {D}^+ = \left[ \mathbf {d}_{1},\ldots ,\mathbf {d}_{T}\right] \) is the set of T target concepts and \(\mathbf {D}^- = \left[ \mathbf {d}_{T+1},\ldots ,\mathbf {d}_{T+M}\right] \) is the set of M background concepts, \(\beta \) is a scaling parameter, and \(\mathbf {a}_{ij}\) and \(\mathbf {p}_{ij}\) are the sparse representation of \(\mathbf {x}_{ij}\) given the entire concept set \(\mathbf {D}\) and background concept set \(\mathbf {D}^-\), respectively. The second term in the objective function is viewed as the background data fidelity term, which is based on the assumption that minimizing the least squares of all negative points provides a good description of the background. The scaling factor \(\rho \) is usually set to be smaller than one to control the influence of negative bags. The third term is the cross incoherence term (motivated by the Dictionary Learning with Structured Incoherence [48] and the Fisher discrimination dictionary learning (FDDL) algorithm [49, 50]) that encourages positive concepts to have distinct spectral signatures from negative points.

figure c

The initialization of target concepts in \(\mathbf {D}\) is conducted by computing the mean of T random subsets drawn from the union of all positive training bags. The vertex component analysis (VCA) [53] method was applied to the union of all negative bags and the M cluster centers (or vertices) were set as the initial background concepts. The pseudocode of the MI-HE algorithm is presented in Algorithm 6.3.Footnote 3 Please refer to [44] for a detailed optimization derivation.

6.3.4 Multiple Instance Learning for Multiple Diverse Hyperspectral Target Characterizations

The multiple instance learning of multiple diverse characterizations for SMF (MILMD-SMF) and ACE detector (MILMD-ACE) [55] is an extension of MI-ACE and MI-SMF that learns multiple target signatures for characterization of the variability in hyperspectral target concepts. Different from the MI-HE method explained above, the MILMD-SMF and MILMD-ACE methods do not model target and background signatures explicitly. Instead, the MILMD-SMF and MILMD-ACE methods focus on maximizing the detection statistics of the positive bags and capturing the characteristics of the training data using a set of diverse target signatures, as shown below:

$$\begin{aligned} {{\mathbf {S}}^{*}}=\underset{\mathbf {S}}{\mathop {\arg \max }}\,\prod \limits _{i}{P(\mathbf {S}|{{B}_{i}},{{L}_{i}}=1)}\prod \limits _{i}{P(\mathbf {S}|{{B}_{i}},{{L}_{i}}=0)}, \end{aligned}$$
(6.18)

where \(\mathbf {S}=\left\{ {\mathbf {s}}^{(1)},{{\mathbf {s}}^{(2)}},\ldots {{\mathbf {s}}^{(\text {K})}} \right\} \) is the K assumed target signatures and \(P(\mathbf {S}|{{B}_{i}},{{L}_{i}}=1)\) and \(P(\mathbf {S}|{{B}_{i}},{{L}_{i}}=0)\) denote the probabilities given the positive and negative bags, respectively. The authors consider the following equivalent form of (6.18) for multiple target characterization can be shown as

$$\begin{aligned} \mathbf {S}^{*}=\underset{\mathbf {S}}{\mathop {\arg \max }}\,\left\{ {{C}_{1}}(\mathbf {S})+{{C}_{2}}(\mathbf {S}) \right\} , \end{aligned}$$
(6.19)
$$\begin{aligned} {{C}_{1}}(\mathbf {S})=\frac{1}{{{N}^{+}}}\sum \limits _{i:{{L}_{i}}=1}{\Omega (D,\ X_{i}^{*},\mathbf {S})}, \end{aligned}$$
(6.20)
$$\begin{aligned} {{C}_{1}}(\mathbf {S})=-\frac{1}{{{N}^{-}}}\sum \limits _{i:{{L}_{i}}=0}{\Upsilon (D,\ {{X}_{i}},\mathbf {S})}, \end{aligned}$$
(6.21)

where \(\Omega (\cdot )\) and \(\Upsilon (\cdot )\) are defined to capture the detection statistics of the positive and negative bags, \(D(\cdot )\) is detection response of the given ACE or SMF detectors and \(\mathbf {X}_i^*= \{\mathbf {x}^{(1)*}_i,\mathbf {x}^{(2)*}_i,\ldots , \mathbf {x}^{(K)*}_i\}\) is the subset of the ith positive bag of selected instances with maximum detection responses corresponding to one of the target signatures \(\mathbf {s}^{k}\) such that

$$\begin{aligned} x_{i}^{(k)*}=\underset{{{\mathbf {x}}_{n}}\in {{\mathbf {B}}_{i}},{{L}_{i}}=1}{\mathop {\arg \max }}\,\ D({{\mathbf {x}}_{n}},{{\mathbf {s}}^{(k)}}). \end{aligned}$$
(6.22)

The term \(\Omega (D,\ X_{i}^{*},\mathbf {S})\) is the global detection statistics term for the positive bags whose ACE form is shown in

$$\begin{aligned} {{\Omega }_{ACE}}(D,\ X_{i}^{*},\mathbf {S})=\frac{1}{K}\sum \limits _{k}{{{{{\hat{\hat{{\mathbf {s}}}}}}}^{(k)T}}{\hat{\hat{{\mathbf {x}}}}}_{i}^{(k)*}}. \end{aligned}$$
(6.23)

Similar to [41], \({\hat{\hat{{\mathbf {s}}}}}^{(k)}\) and \({\hat{\hat{{\mathbf {x}}}}}^{(k)}\) are the transformed kth target signature and correspond instance after whitening using the background information and normalization. The global detection term \({\Omega }_{ACE}(D,\ X_{i}^{*},\mathbf {S})\) provides an average detection statistics over the positive bags given a set of learned target signatures. Of particular note for this method, in contrast with MI-HE, is the approach assumes that each positive bag contains a representative for each variation of the positive concept.

On the other hand, the global detection term \(\Upsilon _{ACE} (D,\ {{X}_{i}},\mathbf {S})\) for negative instances should be small and thus suppresses the background as shown in Eq. (6.24). This definition means if the maximum responses of target signature set \(\mathbf {S}\) over the negative instances are minimized, the estimated target concepts can effectively discriminate nontarget training instances

$$\begin{aligned} {{\Upsilon }_{ACE}}(D,\ {{X}_{i}},\mathbf {S})=\frac{1}{{{N}_{i,{{L}_{i}}=0}}}\sum \limits _{{{\mathbf {x}}_{n}}\in {{\mathbf {B}}_{i}},{{L}_{i}}=0}{\underset{k}{\mathop {\max }}\,\ {{{{\hat{\hat{{\mathbf {s}}}}}}}^{(k)T}}{\hat{\hat{{\mathbf {x}}}}}_{n}}. \end{aligned}$$
(6.24)

In order to explicitly apply the normalization constraint and encourage diversity in the estimated multiple target concepts, [55] also includes two terms, a normalization term by pushing the inner product of the estimated signatures to 1 and a diversity promoting term by maximizing the difference between estimated target concepts as shown in (6.25), and (6.26), respectively.

$$\begin{aligned} {{C}^{div}}(\mathbf {S})=-\frac{2}{K(K-1)}\sum \limits _{k,l,k\ne l}{{{{{\hat{\hat{{\mathbf {s}}}}}}}^{(k)T}}{{{{\hat{\hat{{\mathbf {s}}}}}}}^{(l)}}}, \end{aligned}$$
(6.25)
$$\begin{aligned} {{C}^{con}}(\mathbf {S})=-\frac{1}{K}\sum \limits _{k}{\left| {{{{\hat{\hat{{\mathbf {s}}}}}}}^{(k)T}}{{{{\hat{\hat{{\mathbf {s}}}}}}}^{(k)}}-1 \right| }. \end{aligned}$$
(6.26)

Combining the global detection statistics, the diversity promoting and normalization constraint terms, the final cost function is shown as (6.27).

$$\begin{aligned} {{C}_{ACE}}= & {} \frac{1}{{{N}^{+}}}\sum \limits _{i:{{L}_{i}}=1}{\sum \limits _{k}{\frac{1}{K}{{{{\hat{\hat{{\mathbf {s}}}}}}}^{(k)T}}{\hat{\hat{{\mathbf {x}}}}}_{i}^{(k)*}}-\frac{1}{{{N}^{-}}}\sum \limits _{i:{{L}_{i}}=0}{\frac{1}{{{N}_{i,{{L}_{i}}=0}}}\sum \limits _{{{\mathbf {x}}_{n}}\in {{\mathbf {B}}_{i}},{{L}_{i}}=0}{\underset{k}{\mathop {\max }}\,\ {{{{\hat{\hat{{\mathbf {s}}}}}}}^{(k)T}}{{{{\hat{\hat{{\mathbf {x}}}}}}}_{n}}}}}\nonumber \\&-\frac{2\alpha }{K(K-1)}\sum \limits _{k,l,k\ne l}{{{{{\hat{\hat{{\mathbf {s}}}}}}}^{(k)T}}{{{{\hat{\hat{{\mathbf {s}}}}}}}^{(l)}}}-\frac{\lambda }{K}\sum \limits _{k}{\left| {{{{\hat{\hat{{\mathbf {s}}}}}}}^{(k)T}}{{{{\hat{\hat{{\mathbf {s}}}}}}}^{(k)}}-1 \right| }. \end{aligned}$$
(6.27)

The objective for SMF can be similarly derived, where the only difference is the use of training data without normalization. For the optimization of Eq. (6.27), gradient descent is applied. Since the \(max(\cdot )\) and \(|\cdot |\) operators are not differentiable at zero, the noisy-or function is adopted as an approximation for \(max(\cdot )\) and a sub-gradient method is performed to compute the gradient of \(|\cdot |\). Please refer to [55] for a detailed optimization derivation.

6.3.5 Experimental Results for MIL in Hyperspectral Target Detection

In this section, several MIL learning methods on both simulated and real hyperspectral detection tasks are evaluated to illustrate the properties of these algorithms and provide insight into how and when these methods are effective.

For the experiments conducted in this paper, the parameter settings of the comparison algorithms were optimized using a grid search on the first task of each experiment and then applied to the remaining tasks. For example, for mi-SVM classifier on the Gulfport Brown target task, the \(\gamma \) value of the RBF kernel was firstly varied from 0.5 to 5 at a step size of 0.5, and then a finer search around the current best value (with the highest AUC) at a step of 0.1 was performed. For algorithms with stochastic result, e.g., EM-DD, eFUMI, each parameter setting was run five times and the median performance was selected. Finally the optimal parameters that achieve the highest AUC for the brown target were selected and used for the other three target types.

6.3.5.1 Simulated Data

As discussed in Sect. 6.3.1, the eFUMI algorithm combines all positive bags as one big positive bag and all negative bags as one big negative bag and learns target concept from the big positive bag that is different from the negative bag. Thus, if the negative bags contain incomplete knowledge of the background, e.g., some nontarget concept appears only in the subset of positive bags, eFUMI will perform poorly. However, the discriminative MIL algorithms, e.g., MI-HE, MI-ACE, and MI-SMF, maintain bag structure and can distinguish the target.

Given this hypothesis, simulated data was generated from four spectra selected from the ASTER spectral library [56]. Specifically, the Red Slate, Verde Antique, Phyllite, and Pyroxenite spectra from the rock class with 211 bands and wavelengths ranging from 0.4 to \(2.5\,\upmu \)m (as shown in Fig. 6.5 in solid lines) were used as endmembers to generate hyperspectral data. Red Slate was labeled as the target endmember.

Fig. 6.5
figure 5

Signatures from ASTER library used to generate simulated data

Table 6.1 List of constituent endmembers for synthetic data with incomplete background Knowledge

Four sets of highly mixed noisy data with varied mean target proportion value (\(\varvec{\alpha }_{t\_mean}\)) were generated, a detailed generation process can be found in [37]. Specifically, this synthetic data has 15 positive and 5 negative bags with each bag having 500 points. If it is a positively labeled bag, there are 200 highly mixed target points containing mean target (Red Slate) proportion from 0.1 to 0.7, respectively, to vary the level of target presence from weak to high. Gaussian white noise was added so that signal-to-noise ratio of the data was set to 20 dB. To highlight the ability of MI-HE, MI-ACE and MI-SMF to leverage individual bag-level labels, we use different subsets of background endmembers to build synthetic data as shown in Table 6.1. Table 6.1 shows that the negatively labeled bags only contain two negative endmembers and there exists one confusing background endmember in the first 5 positive bags which is Verde Antique. It is expected that the discriminative MIL algorithms, MI-HE, MI-ACE, and MI-SMF, should be able to perform well in this experiment configuration.

Fig. 6.6
figure 6

MI-HE and comparisons on synthetic data with incomplete background knowledge, \(\varvec{\alpha }_{t\_mean}=0.3\). MI-SMF and MI-ACE are not expected to recover the true signature. a Estimated target signatures for Red Slate and comparison with ground. b ROC curves cross validated on test data

The aforementioned MI-HE [44, 45], eFUMI [37, 38], MI-SMF and MI-ACE [41], DMIL [10, 11] and mi-SVM [9] are multiple instance target concept learning methods. The mi-SVM algorithm performs a comparison of MIL approach that does not rely on estimating a target signature. Figure 6.6a shows the estimated target signature from data with 0.3 mean target proportion value. It clearly shows that eFUMI is always confused with another nontarget endmember, Verde Antique, that exists in some positive bags but is excluded from the background bags. It also shows the other comparison algorithms can estimate a target concept close to the ground truth Red Slate spectrum. One thing need to be explained here is since MI-ACE and MI-SMF are discriminative concept learning methods that try to minimize the detection response of negative bags, they are not expected to recover the true target signature.

For simulated detection analysis, estimated target concepts from the training data were then applied to the test data generated separately following the same generating procedure. The detection was performed using the HSD [46] or ACE [57] detection statistic. For MI-HE and eFUMI, both methods were applied since those two algorithms can come out as a set of background concept from training simultaneously; for MI-SMF, both SMF and ACE were applied since MI-SMF’s objective is maximizing the multiple instance spectral matched filter; for the rest multiple instance target concept learning algorithms, MI-ACE, DMIL, only ACE was applied. For the testing procedure of mi-SVM, a regular SVM testing process was performed using LIBSVM [58], and the decision values (signed distances to hyperplane) of test data determined from trained SVM model were taken as the confidence values. For the signature-based detectors, the background data mean and covariance were estimated from the negative instances of the training data.

Table 6.2 Area under the ROC curves for MI-HE and comparison algorithms on simulated hyperspectral data with incomplete background knowledge. Best results shown in bold, second best results underlined, and ground truth shown with an asterisk

For quantitative evaluation, Fig. 6.6b shows the receiver operating characteristic (ROC) curves using estimated target signature, where it can be seen that the eFUMI is confused with the testing Verde Antique data at very low PFA (probability of false alarms) rate. Table 6.2 shows the area under the curve (AUC) of proposed MI-HE and comparison algorithms. The results reported are the median results over five runs of the algorithm on the same data. From Table 6.2, it can be seen that for MI-HE and MI-ACE, the best performance on detection was achieved using ACE detector, which is quite close to the performance of using the ground truth target signature (denoted as values with stars). The reason that MI-HE’s detection using HSD detector is a little worse is that HSD relies on knowing the complete background concept to properly represent each nontarget testing data, the missing nontarget concept (Verde Antique) makes the nontarget testing data containing Verde Antique maintain a relatively large reconstruction error, and thus large detection statistic.

6.3.5.2 MUUFL Gulfport Hyperspectral Data

The MUUFL Gulfport hyperspectral data set collected over the University of Southern Mississippi-Gulfpark Campus was used to evaluate the target detection performance across various MIL classification methods. This data set contains \(325\times 337\) pixels with 72 spectral bands corresponding to wavelengths from 367.7 to 1043.4 nm at a \(9.5{-}9.6\) nm spectral sampling interval. The ground sample distance of this hyperspectral data set is 1 m [1]. The first four and last four bands were removed due to sensor noise. Two sets of this data (Gulfport Campus Flight 1 and Gulfport Campus Flight 3) were selected as cross-validated training and testing data for these two data sets have the same altitude and spatial resolution. Throughout the scene, there are 64 man-made targets in which 57 were considered in this experiment which are cloth panels of four different colors: Brown (15 examples), Dark Green (15 examples), Faux Vineyard Green (FVGr) (12 examples), and Pea Green (15 examples). The spatial location of the targets are shown as scattered points over an RGB image of the scene in Fig. 6.7. Some of the targets are in the open ground and some are occluded by the live oak trees. Moreover, the targets also vary in size, for each target type, there are targets that are \(0.25\,\mathrm{m}^2\), \(1\,\mathrm{m}^2\), and \(9\,\mathrm{m}^2\) in area, respectively, resulting a very challenging, highly mixed sub-pixel target detection problem.

Fig. 6.7
figure 7

MUUFL Gulfport data set RGB image and the 57 target locations

MUUFL Gulfport Hyperspectral Data, Individual Target Type Detection

For this part of the experiments, each individual target type was treated as a target class, respectively. For example, when “Brown” is selected as target class, a \(5\times 5\) rectangular region corresponding to each of the 15 ground truth locations denoted by GPS was grouped into a positive bag to account for the drift coming from GPS. This size was chosen based on the accuracy of the GPS device used to record the ground truth locations. The remaining area that does not contain a brown target was grouped into a big negative bag. This constructs the detection problem for “Brown” target. Similarly, there are 15, 12, and15 positive labeled bags for Dark Green, Faux Vineyard Green, and Pea Green, respectively.

Fig. 6.8
figure 8

MI-HE and comparisons on Gulfport Data Dark Green, training flight 3 testing flight 1. a Estimated target signatures from flight 3 for Brown and comparison with ground truth. b ROC curves cross validated on flight 1

The comparison algorithms were evaluated on this data using the Normalized Area Under the receiver operating characteristic curve (NAUC) in which the area was normalized out to a false alarm rate (FAR) of \(1\times 10^{-3}\) false alarms\(/\mathrm{m}^2\) [59]. During detection on the test data, the background mean and covariance were estimated from the negative instances of the training data. The results reported are the median results over five runs of the algorithm on the same data.

Table 6.3 Area under the ROC curves for MI-HE and comparison algorithms on Gulfport data with individual target type. Best results shown in bold, second best results underlined, and ground truth shown with an asterisk

Figure 6.8a shows the estimated target concept by all comparisons for Dark Green target type training on flight 3. We can see that the eFUMI and MI-HE are able to recover the target concept quite close to ground truth spectra manually selected from the scene. Figure 6.8b shows the detection ROCs given target spectra estimated on flight 3 and cross validated on flight 1. Table 6.3 shows the NAUCs for all comparison algorithms cross validated on all four types of target, where it can be seen that MI-HE generally outperforms the comparisons for most of the target types and achieves close to the performance of using ground truth target signatures. Since MI-HE is a discriminative target concept learning framework that aims to distinguish one target instance from each positively labeled bag, MI-HE had a lower performance for the pea green target because of the relatively large occlusion of those targets causing difficulty in distinguishing pea green signature from each of the positive bag.

MUUFL Gulfport Hyperspectral Data, All Four Target Types Detection

For training and detection for the four target types together, the positive bags were generated by grouping each of the \(5\times 5\) regions denoted by the ground truth that it contains any of the four types of target. Thus, for each flight there are 57 target points and 57 positive bags were generated. The remaining area that does not contain any target was grouped into a big negative bag. Table 6.4 summarizes the NAUCs as a quantitative comparison, which shows that the detection statistic by the proposed MI-HE using HSD is significantly better than the comparison algorithms.

Table 6.4 Area under the ROC curves for MI-HE and comparison algorithms on Gulfport data with all four target types. Best results shown in bold, second best results underlined, and ground truth shown with an asterisk

6.4 Multiple Instance Learning Approaches for Classifier Fusion and Regression

Although more extensively studied for the case of sub-pixel hyperspectral target detection, the Multiple Instance Learning approach can be used in other hyperspectral applications including fusion with other sensors and regression, in addition to two-class classification and detection problems discussed in previous sections. In this section, algorithms for multiple instance classifier fusion and regression are presented and their applications to hyperspectral and remote sensing data analysis are discussed.

6.4.1 Multiple Instance Choquet Integral Classifier Fusion

The  multiple instance Choquet integral (MICI) algorithmFootnote 4 [61, 62] is a multiple instance classifier fusion method to integrate different classifier outputs with imprecise labels under the MIL framework. In MICI, the Choquet integral [63, 64] was used under the MIL framework to fuse outputs from multiple classifiers or sensors for improving the accuracy and accounting for imprecise labels for hyperspectral classification and target detection.

The Choquet integral (CI) is an effective nonlinear information aggregation method based on the fuzzy measure. Assume there exists m sources, \(C=\{c_1,c_2,\ldots ,c_m\}\), for fusion. These “sources” can be the decision outputs by different classifiers or data collected by different sensors. The power set of C is denoted as \(2^C\), which contains all possible (crisp) subsets of C. A monotonic and normalized fuzzy measure, \(\mathbf {g}\), is a real valued function that maps \(2^C \rightarrow [0, 1]\). It satisfies the following properties:

  1. 1.

    \(g(\emptyset )=0\); empty set

  2. 2.

    \(g(C)=1; \) normalization property

  3. 3.

    \(g(A)\le g(B)\) if \(A\subseteq B\) and \(A,B\subseteq C. \) monotonicity property.

Let \(h(c_k;\mathbf {x}_n)\) denote the output of the kth classifier, \(c_k\), on the nth instance, \(\mathbf {x}_n\). The discrete Choquet integral of instance \(\mathbf {x}_n\) given C (m sources) is computed using

$$\begin{aligned} C_\mathbf {g}(\mathbf {x}_n) = \sum _{k=1}^{m}\left[ h(c_k; \mathbf {x}_n) - h(c_{k+1}; \mathbf {x}_n)\right] g(A_k), \end{aligned}$$
(6.28)

where the sources are sorted so that \(h({{c}_{1}};{{\mathbf {x}}_{n}})\ge h({{c}_{2}};{{\mathbf {x}}_{n}})\ge \cdots \ge h({{c}_{m}};{{\mathbf {x}}_{n}})\) and \(h({{c}_{m+1}};{{\mathbf {x}}_{n}})\) is defined to be zero. The fuzzy measure element value \(g(A_k)\) corresponds to the subset \(A_k=\{c_1, c_2, \ldots , c_k\}\).

In a classifier fusion problem, given training data and fusion sources, \(h({{c}_{m}};{{\mathbf {x}}_{n}})\) \(\forall m, n\) are known. The desired bag-level labels for sets of \(C_\mathbf {g}(\mathbf {x}_n) \) values are also known (positive label “+1”, negative label “0”). Then, the goal of the MICI algorithm is to learn all the element values of the unknown fuzzy measure \(\mathbf {g}\) from the training data and bag-level (imprecise) labels. The MICI method includes three variations to formulate the fusion problem under the MIL framework to address label imprecision. The variations include the noisy-or model, the min-max model, and the generalized-mean model.

The MICI noisy-or model follows the Diverse Density formulation (see Sect. 6.2.2) and uses a noisy-or objective function

$$\begin{aligned} \begin{aligned} J_{N}&= \sum _{a=1}^{K^-} \sum _{i=1}^{N^-_b} \ln \left( 1 - \mathscr {N}\left( C_\mathbf {g}(\mathbf {x}_{ai}^-) | \upmu , \sigma ^2 \right) \right) \\&+ \sum _{b=1}^{K^+} \ln \left( 1 -\prod _{j=1}^{N^+_b} 1- \mathscr {N}\left( C_\mathbf {g}(\mathbf {x}_{bj}^+) | \upmu , \sigma ^2 \right) \right) , \end{aligned} \end{aligned}$$
(6.29)

where \(K^+\) denotes the total number of positive bags, \(K^-\) denotes the total number of negative bags, \(N^+_b\) is the total number of instances in positive bag b, and \(N^-_a\) is the total number of instances in negative bag a. Each data point/instance is either positive or negative, as indicated by the following notation: \(\mathbf {x}_{ai}^-\) is the ith instance in the ath negative bag and \(\mathbf {x}_{bj}^+\) is the jth instance in the bth positive bag. The \(C_\mathbf {g}\) is the Choquet integral output given measure \(\mathbf {g}\) computed using (6.28). The \(\upmu \) and \(\sigma ^2\) are the mean and variance of the Gaussian function \(\mathscr {N}(\cdot )\), respectively. In practice, the parameter \(\upmu \) can be set to 1 or a value close to 1 for two-class classifier fusion problems, in order to encourage the CI values of positive instances to be 1 and the CI values of negative instances to be far from 1. The variance of the Gaussian \(\sigma ^2\) controls how sharply the CI values are pushed to 0 and 1, and thus controls the weighting of the two terms in the objective function. By maximizing the objective function (6.29), the CI values of all the points in the negative bag are encouraged to be zero (first term) and the CI values of at least one instance in the positive bag are encouraged to be one (second term), which follows the MIL assumption.

The MICI min-max model applies the min and max operators to the negative and positive bags, respectively. The min-max model follows the MIL formulation without the need to manually set parameters such as the Gaussian variance in the noisy-or model. The objective function of the MICI min-max model is

$$\begin{aligned} J_{M}=\sum _{a=1}^{K^-} \max _{\forall \mathbf {x}_{ai}^- \in \mathbf {B}_a^-} \left( C_\mathbf {g}(\mathbf {x}_{ai}^- ) - 0 \right) ^2 + \sum _{b=1}^{K^+} \min _{\forall \mathbf {x}_{bj}^+ \in \mathbf {B}_b^+} \left( C_\mathbf {g}(\mathbf {x}_{bj}^+)-1\right) ^2, \end{aligned}$$
(6.30)

where \( \mathbf {B}_a^-\) denotes the ath negative bag, and \(\mathbf {B}_b^+\) denotes the bth positive bag. The remaining terms follow the same notation as in (6.29). The first term of the objective function encourages the CI values of all instances in the negative bag to be zero, and the second term encourages the CI values of at least one instance in the positive bag to be one. By minimizing the objective function in (6.30), the MIL assumption is satisfied.

Instead of selecting only one instance from each bag as a “prime instance” that determines the bag-level label as does the min-max model, the MICI generalized-mean model allows more instances to contribute toward the classification of bags. The MICI generalized-mean objective function is written as

$$\begin{aligned} J_G = \sum _{a=1}^{K^-} \left[ \frac{1}{N_a^-} \sum _{i=1}^{N_a^-} \left( C_\mathbf {g}(\mathbf {x}_{ai}^- ) - 0 \right) ^{2p_1} \right] ^\frac{1}{p_1} + \sum _{b=1}^{K^+} \left[ \frac{1}{N_b^+} \sum _{j=1}^{N_b^+} \left( C_\mathbf {g}(\mathbf {x}_{bj}^+)-1\right) ^{2p_2} \right] ^\frac{1}{p_2}, \end{aligned}$$
(6.31)

where \(p_1\) and \(p_2\) are the exponential factors controlling the generalized-mean operation. When \(p_1 \rightarrow +\infty \) and \(p_2 \rightarrow -\infty \), the generalized-mean terms becomes equivalent to the min and max operators, making the generalized-mean model equivalent to the min-max model. By adjusting the p value, the generalized-mean term can act as varying other aggregating operators, such as arithmetic mean (\(p=1\)) or quadratic mean (\(p=2\)). For another interpretation, when \(p \ge 1\), the generalized-mean can be rewritten as the \(l_p\) norm [65].

The MICI models can be optimized by sampling-based evolutionary algorithms, where the element values of fuzzy measure \(\mathbf {g}\) are sampled and selected through a truncated Gaussian distribution either based on valid interval (how much the element value can change without violating the monotonicity property of the fuzzy measure), or based on the counts of times a measure element is used in all training instances. A more detailed optimization process and psuedocode of the MICI models can be seen in [62, 66]. The MICI models have been used for hyperspectral sub-pixel target detection [61, 62] and were effective in fusing multiple detector inputs (e.g., the ACE detector) and can yield competitive classification results.

6.4.2 Multiple Instance Regression

Multiple instance regression (MIR) handles multiple instance problems where the prediction values are real-valued, instead of binary class labels. The MIR methods have been used in remote sensing literature for applications such as aerosol optical depth retrieval [67, 68] and crop yield prediction [62, 68,69,70].

Prime-MIR was one of the earliest MIR algorithms, proposed by Ray and Page in 2001 [71]. Prime-MIR is based on the “primary instance” assumption, which assumes there is only one primary instance per bag that contributes to the real-valued bag-level label. Prime-MIR assumes a linear regression hypothesis and the goal is to find a hyperplane \(\mathbf {Y}=\mathbf {Xb}\) such that

$$\begin{aligned} \mathbf {b} = \mathop {\arg \min }\limits _b \sum _{i=1}^{n} L \left( y_i,{X}_{ip}, \mathbf {b} \right) , \end{aligned}$$
(6.32)

where \({X}_{ip}\) is the primary instance in bag i, and L is some error function, such as the squared error. An expectation–maximization (EM) algorithm was used to iteratively solve for the ideal hyperplane. First, a random hyperplane was initialized. For each instance j in each bag i, the error L of the instance \({X}_{ij}\) to the hyperplane \(\mathbf {Y}=\mathbf {Xb}\) was computed. In the E-step, the instance with the lowest error L was selected as the “primary instance.” In the M-step, a new hyperplane was constructed by performing a multiple regression over all the primary instances selected in the E-step. The two steps were repeated until the algorithm converges and the best hyperplane solution was returned. In [71], Prime-MIR showed the benefits of using multiple instance regression over ordinary regression, especially when the non-primary instances in the bag were not correlated with the primary instances.

The MI k-NN approach and its variations [72] extends the Diverse Density, kNN, and Citation-kNN for real-valued multiple instance learning. The minimal Hausdorff distance from [27] was used to measure the distance between two bags. Given two sets of points \(A = {a_1, \ldots a_m}\) and \(B={b_1, \ldots ,b_n}\), the Hausdorff distance is defined as

$$\begin{aligned} H(A,B) = \max \{h(A,B),h(B,A)\}, \end{aligned}$$
(6.33)

where \(h(A,B)= \max _{a \in A} \min _{b \in B} \left\| a-b \right\| \), \(\left\| a-b \right\| \) is the Euclidean distance between points a and b. In the MI k-NN algorithm, the prediction made for a bag B is the average label of the k closest bags, measured in Hausdorff metric. In the MI citation-kNN algorithm, the prediction made for a bag B is the average label of the R closest bag neighbors of B measured in Hausdorff metric and C-nearest citers, where the “citers” include the bags where B is a one of their C-nearest neighbors. It is generally recommended that \(C=R+2\) [72]. The third variant, a diverse density approach for the real-valued setting, maximizes

$$\begin{aligned} \prod _{i=1}^K Pr(r|B_i) \end{aligned}$$
(6.34)

where \(Pr(t|B_i) = (1-|l_i - Label(B_i|t)|)/Z\), K is the total number of bags, t is the target point, \(l_i\) is the label for the ith bag, and Z is a normalization constant. The results in [72] showed good prediction performance of all three variants on a benchmark Musk Molecules data set [4], but the performance of both the nearest neighbor and diverse density algorithms were sensitive to the number of relevant features, as expected based on the sensitivity of the Hausdorff distance to outliers.

A real-valued multiple instance on-line model proposed by Goldman and Scott [73] uses MIR for learning real-valued geometric patterns, motivated by landmark matching problem in robot navigation and vision applications. This algorithm associates a real-valued label with each point and uses the Hausdorff metric to help classify a bag as positive, if the points in the bag are within some Hausdorff distance from target concept points. This algorithm differs from the supervised MIR in that the standard supervised MIR learns from a given set of training bags and bag-level training labels, while [73] applies an online agnostic model [74,75,76] where the learners make predictions as the bag \(\mathbf {B}_t\) is presented at iteration t. Wang et al. [77] also used the idea of online MIR, i.e., to use the latest arrived bag with its training label to update the current predictive model. This work was also extended in [78].

A regularization framework for MIR proposed by Cheung and Kwok [79] defines a loss function that takes into consideration both training bags and training instances. The first part of the loss function computes the error (loss) between training bags label and its predictions and the second part considers the loss between the bag label prediction and all the instances in the bag. This work still adopted the “primary instance” assumption but simplified to assume the primary instance was the instance with the highest prediction output value. This model provided comparable or better performance on the synthetic Musk Molecules data set [72] as citation-kNN [27] and Multiple Instance kernel-based SVM [79, 80].

Most MIR methods discussed above only provided theoretical discussions or results on synthetic regression data sets. More recently, MIR methods have been applied to real-world hyperspectral and remote sensing data analysis. Wagstaff et al. in [69, 70] investigated using MIR to predict crop yield from remotely sensed data collected over California and Kansas. In [69], a novel method for inferring the “salience” of each instance was proposed with regard to the real-valued bag label. The salience of each instance, i.e., its “relevance” with respect to all other instances in the bag to predict the bag label, is the weight associated with each instance. The salience values were defined to be nonnegative and sum to one for all instances in each bag. Like Ray and Page [71], Wagstaff et al. followed the “primary-instance” assumption but their primary instance, or “exemplar” of a bag, is the weighted average of all the points in the bag instead of one single instance from the bag. Given training bags and instances, a set of salience values are solved based on a fixed linear regression model and given the estimated salience, the regressor is updated and the algorithm reiterates until convergence. This work did not intend to provide predictions over new data, but instead focused on understanding the contents (the salience) of each training instance.

Wagstaff et al. then made use of the salience learned to provide predictions for new, unlabeled bags by proposing an MI-ClusterRegress algorithm (or sometimes referred to as the Cluster-MIR algorithm) [70] that mapped instances onto (hidden) cluster labels. The main assumption of MI-ClusterRegress is that the instances from a bag are drawn (with noise) from a set of underlying clusters and one of the clusters is “relevant” to the bag-level labels. After obtaining k clusters for each bag by EM-based Gaussian mixture models (or any other clustering method), a local regression model is constructed for each cluster. MI-ClusterRegress then selects the best-fit model and use it to predict labels for test bags. A support vector regression learner [81] is used for regression prediction. Results on simulated and predicting crop yield data sets show that modeling the bag structure when the structure (cluster) is present is effective for regression prediction, especially when the cluster number k is equal to or larger than what is actually present in the bags.

In Chap. 2, Moreno-Martínez et al. proposed a kernel distribution regression (KDR) model for MIR by embedding the bag distribution in a high-dimensional Hilbert space and performing standard least squares regression on the mean embedded data. This kernel method exploits the rich structure in bags by considering all higher order moments of the bag distributions and performing regression with the bag distributions directly. This kernel method also allows to combine bags with different number of instances per bag by summarizing the bag feature vectors with a set of mean map embeddings of instances in the bag. The KRD model was shown to outperform standard regression models such as the least squares regularized linear regression model (RLR) and the (nonlinear) kernel ridge regression (KRR) method for crop yield applications.

Wang et al. [67, 68] proposed a probabilistic and generalized mixture model for MIR based on the primary-instance assumption (sometimes referred to as the EM-MIR algorithm). It is assumed that the bag label is a noisy function of the primary instance, and the conditional probability \(p(y_i|\mathbf {B}_i)\) for predicting label \(y_i\) for the ith bag is dependent entirely on the primary instance. A binary random variable \(z_{ij}\) is defined such that \(z_{ij}=1\) if the jth instance in the ith bag is the primary instance and \(z_{ij}=0\) if otherwise. The mixture model for each bag i is written as

$$\begin{aligned} p(y_i|\mathbf {B}_i)&= \sum _{j=1}^{N_i} p(z_{ij}=1|\mathbf {B}_i)p(y_i|\mathbf {x}_{ij})\end{aligned}$$
(6.35)
$$\begin{aligned}&= \sum _{j=1}^{N_i} \pi _{ij}p(y_i|\mathbf {x}_{ij}), \end{aligned}$$
(6.36)

where \(\pi _{ij}\) is the (pior) probability that the jth instance in the ith bag is the primary instance, \(p(y_i|\mathbf {x}_{ij})\) is the label probability given the primary instance \(\mathbf {x}_{ij}\) and \(N_i\) is the total number of instances in the ith bag \(\mathbf {B}_i\). Therefore, the learning problem is transformed to learning the mixture weights \(\pi _{ij}\) and \(p(y_i|\mathbf {x}_{ij})\) from training data and an EM algorithm is used to optimize the parameters. This work discussed several methods to set the prior \(\pi _{ij}\), including using deterministic function, or as a Gaussian function of prediction deviation, or as a parametric function (in this case a feed-forward neural network). It was discussed in [68] that several algorithms discussed above, including Prime-MIR [71] and Pruning-MIR [67], are in fact the special case of the mixture model. The mixture model MIR shows better performance on simulated data as well as for predicting aerosol optical depth (AOD) from remote sensing data and predicting crop yield applications, compared with the Cluster-MIR [70] and Prime-MIR [71] algorithms described above.

Two baseline methods for MIR have also been described in [68], Aggregate-MIR, and Instance-MIR. In Aggregate-MIR, a “meta-instance” is obtained for each bag by averaging all the instances in that bag, and a regression model can be trained using the bag-level labels and the meta-instances. In Instance-MIR, all instances in a bag are assumed to have the same label as the bag-level label, and a regression model can be trained by combining all instances from all bags. Then, in testing, the label for a test bag is the average of all the instance-level labels in that test bag. The Aggregate-MIR and Instance-MIR methods belong to the “input summary” and “output expansion” approaches as described in Chap. 2, Sect. 2.3.1. These two methods are straightforward and easy to implement, and have been used as basic comparison methods for a variety of MIR applications.

The robust fuzzy clustering for MIR (RFC-MIR) algorithm was proposed by Trabelsi and Frigui [82] to incorporate data structure in MIR. The RFC-MIR algorithm uses fuzzy clustering methods such as the fuzzy c-means (FCM) and possibilistic c-means (PCM) [83] to cluster the instances and fit multiple local linear regression models to the clusters. Similar to Cluster-MIR, the RFC-MIR method combines all instances from all training bags for clustering. However, Cluster-MIR performs clustering in an unsupervised manner without considering bag-level labels, while RFC-MIR uses instance features as well as labels in clustering. Validation results of RFC-MIR show improved accuracy on crop yield prediction and drug activity prediction applications [84], and the possibilistic memberships obtained from the RFC-MIR algorithm can be used to identify the primary and irrelevant instances in each bag.

In parallel with the multiple instance classifier fusion models described in Sect. 6.4.1, a Multiple Instance Choquet Integral Regression (MICIR) modelFootnote 5 has been proposed to accommodate real-valued predictions for remote sensing applications [62]. The objective function of the MICIR model is written as

$$\begin{aligned} \min \sum \limits _{i=1}^{K}{\left[ \underset{\forall j,{{x}_{ij}}\in {{\mathbf {B}}_{i}}}{\mathop {\min }}\,{{({{C}_{g}}({{\mathbf {x}}_{ij}})-{{o}_{i}})}^{2}} \right] }, \end{aligned}$$
(6.37)

where \(o_i\) is the desired training labels for bag \(\mathbf {B}_i\). Note that MICIR is able to fuse real-valued outputs from regression models as well as from classifiers. When \(o_i\) is binary, MICIR reduces to the MICI min-max model for two-class classifier fusion. The MICIR algorithm also follows the primary instance assumption by minimizing the error between the CI value of one primary instance and the given bag-level labels, while allowing imprecision in other instances. Similar to MICI classifier fusion models, an evolutionary algorithm can be used to sample the fuzzy measure \(\mathbf {g}\) from the training data.

Overall, Multiple Instance Regression methods have been studied in the literature for nearly two decades and most studies are based on the primary-instance assumption proposed by Ray and Page in 2001. Linear regression models were used in most MIR methods if a regressor was used and experiments have shown effective results of using MIR on crop yield prediction and aerosol optical depth retrieval applications given remote sensing data.

6.4.3 Multiple Instance Multi-resolution and Multi-modal Fusion

Previous MIL classifier fusion and regression methods, such as the MICI and the MICIR models, can only be applied if the fusion sources have the same number of data points and the same resolution across multiple sensors. As motivated in Sect. 6.1, in remote sensing applications, sensor outputs often have different resolutions and modalities, such as rasterized hyperspectral imagery versus LiDAR point cloud data. To address multi-resolution and multi-modal fusion under imprecision, the multiple instance multi-resolution fusion (MIMRF) algorithmFootnote 6 was developed to fuse multi-resolution and multi-modal sensor outputs while learning from automatically generated, imprecisely labeled data [66, 86].

In multi-resolution and multi-modal fusion, there can be a set of candidate points from a local region from one sensor that corresponds to one point from another sensor, due to sensor measurement inaccuracy and different data resolutions and modalities. Take hyperspectral imagery and LIDAR point cloud data fusion, for example, for each pixel \(H_i\) in the HSI imagery, there may exist a set of \(\{L_{i1}, L_{i2}, \ldots , L_{il}\}\) points from the LiDAR point cloud that corresponds to the area covered by the pixel \(H_i\). The MIMRF algorithm first constructs such correspondences by writing the collection of the sensor outputs for pixel i as

$$\begin{aligned} \mathbf {S}_{i} = \begin{bmatrix} H_i &{} L_{i1} \\ H_i &{} L_{i2}\\ \vdots &{} \vdots \\ H_i &{} L_{il} \end{bmatrix}. \end{aligned}$$
(6.38)

This notation can extend to any number of correspondences l by row, and multiple sensors by column. The MIMRF assumes that, at least one point in all candidate LiDAR points is accurate, but it is unknown which one. One of the goals of the MIMRF algorithm is to automatically select the correct points with accurate measurement and correspondence information. To achieve this goal, the CI fusion for the collection of the sensor outputs of the ith negative data point is written as

$$\begin{aligned} C_\mathbf {g}(\mathbf {S}_{i}^-)= \min _{\forall \mathbf {x}_{k}^- \in \mathbf {S}_{i}^- } C_\mathbf {g}(\mathbf {x}_{k}^- ), \end{aligned}$$
(6.39)

and the CI fusion for the collection of the sensor outputs values of the jth positive data point is written as

$$\begin{aligned} C_\mathbf {g}(\mathbf {S}_{j}^+) = \max _{\forall \mathbf {x}_{l}^+ \in \mathbf {S}_{j}^+} C_\mathbf {g}(\mathbf {x}_{l}^+), \end{aligned}$$
(6.40)

where \(\mathbf {S}_{i}^-\) is the collection of sensor outputs for the ith negative data point and \(\mathbf {S}_{j}^+\) is the collection of sensor outputs for the jth positive data point; \(C_\mathbf {g}(\mathbf {S}_{i}^-)\) is the Choquet integral output for \(\mathbf {S}_{i}^-\) and \( C_\mathbf {g}(\mathbf {S}_{j}^+)\) is the Choquet integral output for \(\mathbf {S}_{j}^+\). In this way, the min and max operators automatically select one data point (which is assumed to be the data point with correct information) from each negative and positive bag to be used for fusion, respectively.

Moreover, the MIMRF is designed to handle bag-level imprecise labels. Recall that the MIL framework assumes a bag is labeled positive if at least one instance in the bag is positive and a bag is labeled negative if all the instances in the bag are negative. Thus, the objective function for MIMRF algorithm is proposed as

$$\begin{aligned} \begin{aligned} J&= \sum _{a=1}^{K^-} \max _{\forall \mathbf {S}_{ai}^- \in \mathbf {B}_a^-} \left( C_\mathbf {g}(\mathbf {S}_{ai}^-) - 0 \right) ^2 + \sum _{b=1}^{K^+} \min _{\forall \mathbf {S}_{bj}^+ \in \mathbf {B}_b^+} \left( C_\mathbf {g}(\mathbf {S}_{bj}^+) -1\right) ^2\\&=\sum _{a=1}^{K^-} \boxed {\max _{\forall \mathbf {S}_{ai}^- \in \mathbf {B}_a^-}} \left( \min _{\forall \mathbf {x}_{k}^- \in \mathbf {S}_{ai}^- } C_\mathbf {g}(\mathbf {x}_{k}^- ) - 0 \right) ^2 + \sum _{b=1}^{K^+} \boxed {\min _{\forall \mathbf {S}_{bj}^+ \in \mathbf {B}_b^+}} \left( \max _{\forall \mathbf {x}_{l}^+ \in \mathbf {S}_{bj}^+} C_\mathbf {g}(\mathbf {x}_{l}^+)-1\right) ^2, \end{aligned} \end{aligned}$$
(6.41)

where \(K^+\) is the total number of positive bags, \(K^-\) is the total number of negative bags, \(\mathbf {S}_{ai}^-\) is the collection of ith instance set in the ath negative bag and similar for \(\mathbf {S}_{bj}^+\). \(C_\mathbf {g}\) is the Choquet integral given fuzzy measure \(\mathbf {g}\), \( \mathbf {B}_a^-\) is the ath negative bag, and \(\mathbf {B}_b^+\) is the bth positive bag. The term \(\mathbf {S}_{ai}^- \) is the collection of input sources for the ith pixel in the ath negative bag and \(\mathbf {S}_{bj}^+ \) is the collection of input sources for the jth pixel in the bth positive bag.

In (6.41), the min and max operators outside the squared errors (the boxed terms) are comparable to the MICI min-max model. The max operator encourages the Choquet integral of all the points in the negative bag to be 0 and the min operator encourages the Choquet integral of at least one point in the positive bag to be 1 (second term), which satisfies the MIL assumption. The min and max operators inside the squared error terms come from (6.39) and (6.40), which selects one correspondence for each collection of candidates. By minimizing the objective function in (6.41), the first term encourages the fusion output of all the points in the negative bag to the desired negative label 0, and the second term encourages the fusion output of at least one of the points in the positive bag to the desired positive label \(+1\). This satisfies the MIL assumption while addressing label imprecision for multi-resolution and multi-modal data. The MIMRF algorithm has been used to fuse rasterized hyperspectral imagery and un-rasterized LiDAR point cloud data over urban scenes and have shown effective fusion results for land cover classification [66, 86].

Here is a small example to illustrate the performance of the MIMRF algorithm using the MUUFL Gulfport hyperspectral and LiDAR data set collected over the University of Southern Mississippi-Gulfpark Campus [1]. An illustration of the rasterized hyperspectral imagery and the LiDAR data over the complete scene can be seen in Figs. 6.2 and 6.3 in Sect. 6.1. The task here is to fuse hyperspectral and LiDAR data to perform building detection and classification. The simple linear iterative clustering (SLIC) algorithm [87, 88] was used to segment the hyperspectral imagery. The SLIC algorithm is a widely used, unsupervised superpixel segmentation algorithm that can produce spatially coherent regions. Each superpixel from the segmentation is treated as a “bag” in our learning process and all pixels in each superpixel are all the instances in the bag. The bag-level labels in this data set were generated from OpenStreetMap (OSM), a third-party, crowd-sourced online map [89]. OSM provides map information for urban regions around the world. Figure 6.9c shows the map extracted from Open Street Map (OSM) over the study area based on the ground cover tags available, such as “highway”, “footway”, “building”, etc. Information from Google Earth [90], Google Maps [91], and geo-tagged photographs from a digital camera taken at the scene were also be used as auxiliary data to assist the labeling process. This way, reliable bag-level labels can be automatically generated with minimal human intervention. These bag-level labels will then be used in the MIMRF objective function (6.41) to learn the unknown fuzzy measure \(\mathbf {g}\) for HSI-LiDAR fusion. Figure 6.9 shows the RGB imagery, the SLIC segmentation, and the OSM map labels for the MUUFL Gulfport hyperspectral imagery.

Fig. 6.9
figure 9

Source: © [2020] IEEE. Reprinted, with permission, from [86]

The RGB image (a), SLIC segmentation (b), and the OSM map for the MUUFL Gulfport hyperspectral imagery (c). In the OSM map, the blue lines correspond to road and highway. The magenta lines correspond to sidewalk/footway. The green lines marks buildings. Here, the “building” tag is specific to the buildings with a grey (asphalt) roof. The black lines correspond to “other” tags.

Three multi-resolution and multi-modal sensor outputs were used as fusion sources, one generated from HSI imagery and two from raw LiDAR point cloud data. The first fusion source is the ACE detection map on buildings based on the mean spectral signature of randomly sampled building points from the scene. The ACE detection map for buildings is shown in Fig. 6.10a. As shown, the ACE confidence map highlights most buildings, but also highlights some roads which have similar spectral signature (similar construction material, such as asphalt). The ACE detector also failed to detect the top right building due to the darkness of the roof. Two other fusion sources were generated from LiDAR point cloud data according to the building height profile, with the rasterized confidence maps shown in Fig. 6.10b and Fig. 6.10c. Note that in MIMRF fusion, the LiDAR sources will be point clouds and Figs. 6.10b and c are provided for visualization and comparison purposes only.

Fig. 6.10
figure 10

Source: © [2020] IEEE. Reprinted, with permission, from [86]

The fusion sources generated from HSI and LiDAR data for building detection. a ACE detection map from HSI data. b, c LiDAR building detection map from two LiDAR flights. The colorbar can be seen in d.

As shown in Fig. 6.10, each HSI and LiDAR sensor output contains certain building information. The goal is to use MIMRF to fuse all three sensor outputs and perform accurate building classification. We randomly sampled 50% the bags (the superpixels) and use these to learn a set of fuzzy measures for the MIMRF algorithm. We conducted such random sampling three times by using the MATLAB randperm() function and call these the three random runs. The sampled bags are different at each random run. In each random run, the MIMRF algorithm is applied to learn a fuzzy measure from the randomly sampled 50% bags, and fusion results are evaluated on the remaining 50% data on a pixel level. Note that there will be two sets of results in each run—learn from the first sampled 50% bags (denoted “Half1”) and perform fusion on the second half of data (denoted “Half2”), and vice versa. The fusion results of MIMRF were compared with previously discussed MIL algorithms such as MICI and mi-SVM and the CI-QP approach. The CI-QP (Choquet integral-quadratic programming) approach [64] is a CI fusion method that learns a fuzzy measure for the Choquet integral by optimizing a least squares error objective using quadratic programming. Note that these comparison methods only work with rasterized LiDAR imagery, while the MIMRF algorithm can directly handle raw LiDAR point cloud data. The fusion results of MIMRF were also compared with commonly used fusion methods, such as min, max, and mean operators and a support vector machine, as well as the ACE and LiDAR sensor sources before fusion.

Fig. 6.11
figure 11

An example of ROC curve results for building detection across all methods

Table 6.5 The AUC results of building detection using MUUFL Gulfport HSI and LiDAR data across three random runs. (The higher the AUC the better.) The best two results with the highest AUC were bolded and underlined, respectively. “Half1” refers to the results of learning a fuzzy measure from the first 50% of the bag-level labels from campus 1 data and perform pixel-level fusion on the second half. “Half2” refers to the results of learning a fuzzy measure from the second 50% of the bag-level labels from campus 1 data and perform pixel-level fusion on the first half. The ACE, Lidar1, and Lidar2 rows show results from the individual HSI and LiDAR sources before fusion; the methods below the dotted line show fusion results for all comparison methods. The standard deviations of MICI and MIMRF methods are computed across three runs (three random fuzzy measure initializations) and are shown in parentheses. Same notation is applied for the RMSE table below as well

Figure 6.11 shows an example of the receiver operating characteristic (ROC) curve results for building detection across all comparison methods. Table 6.5 shows the area under curve (AUC) results across all methods in all random runs. Table 6.6 shows the root mean square error (RMSE) results across all methods in all random runs. The AUC evaluates how well the method detects the buildings (the higher AUC the better) and the RMSE shows how the detection results on both the building and nonbuilding points differ from the ground truth (the lower the RMSE the better). We observed from the tables that the MIMRF method was able to achieve high AUC detection results and low RMSE compared to other methods, and the MIMRF is stable across different randomizations. The MICI classifier fusion method also did well in detection (high AUC), but has higher RMSE compared to MIMRF, possibly due to MICI’s inability to handle multi-resolution data. The min operator did well in RMSE due to the fact that it places low confidence everywhere, but was unable to have high detection results. The ACE detector did well in detection, which shows that the hyperspectral signature is effective at distinguishing building roof materials. However, it also places high confidence on other asphalt materials such as road, and thus yields a high RMSE value.

Table 6.6 The RMSE results of building detection using MUUFL Gulfport HSI and LiDAR data across three random runs. (The lower the RMSE the better.) The best two results with the highest AUC were bolded and underlined, respectively. “Half1” refers to the results of learning a fuzzy measure from the first 50% of the bag-level labels from campus 1 data and perform pixel-level fusion on the second half. “Half2” refers to the results of learning a fuzzy measure from the second 50% of the bag-level labels from campus 1 data and perform pixel-level fusion on the first half. The ACE, Lidar1, and Lidar2 rows show results from the individual HSI and LiDAR sources before fusion; the methods below the dotted line show fusion results for all comparison methods. The standard deviations of MICI and MIMRF methods are computed across three runs (three random fuzzy measure initializations) and are shown in parentheses
Fig. 6.12
figure 12

a An illustration for the 50% randomly sampled bags from one of our random runs. The MIMRF algorithm learns a fuzzy measure from the red-labeled, transparent bags. b The ground truth for the the other 50% data [92]. The yellow and green regions are building and nonbuilding ground truth locations in the “test” data. The dark blue (labeled “–1”) regions denote the 50% of the bags that were used in MIMRF learning and therefore not included in the testing process

Fig. 6.13
figure 13

The fusion results for building detection in the MUUFL Gulfport data set, learned from the randomly drawn bags shown in Fig. 6.12a and evaluated on the remaining regions against the ground truth shown in Fig. 6.12b. Note that the MIMRF method learns a set of fuzzy measures from bag-level data and produced per-pixel fusion results on the fusion regions. The subplots show fusion results by a SVM; b min operator; c max operator; d mean operator; e mi-SVM; f CI-QP; g MICI; h MIMRF. The yellow highlights where the fusion algorithm places high detection confidence and green indicates low confidence, and the dark blue indicates the regions not used in the evaluation. This plot uses the same color bar as in Fig. 6.10d. It is desirable that high confidence (yellow color) was placed on buildings for building detection. As shown, the MIMRF algorithm in h was able to detect all buildings (yellow color) in the regions that were evaluated and have low confidence (green color) on nonbuilding areas. The other comparison methods either missed some buildings, or have many more false positives in non-building regions, such as tree canopy

Figures 6.12 and 6.13 shows a qualitative comparison of our fusion performance. Figure 6.12 shows an example of our randomly sampled bags. All the semi-transparent bags marked by the red lines in Fig. 6.12a were used to learn a fuzzy measure in our method, and we evaluate pixel-level fusion results against the “test” ground truth shown in Fig. 6.12b. Note that the MIMRF is a self-supervised method that learns a fuzzy measure from bag-level labels and produces pixel-level fusion results. Although standard training and testing scheme does not apply here, this experiment is set up using cross validation to show that the MIMRF algorithm is able to utilize the fuzzy measure learned from one part of the data and apply that fuzzy measure to perform fusion on new test data, even when the learned bags were excluded from testing.

Figure 6.13 shows all fusion results on the test regions across all methods. As shown, the MIMRF algorithm in Fig. 6.13h was able to detect all buildings (yellow) in the evaluation regions well while having low confidence (green) on nonbuilding areas. The other comparison methods either missed some buildings, or have many more false positives in non-building regions. Other randomizations yielded similar effects.

To summarize, the above experimental results show that the MIMRF method was able to successfully perform detection and fusion with high detection accuracy and low root mean square error for multi-resolution and multi-modal data sets. This experiment further demonstrated the effectiveness of the self-supervised learning approach used by the MIMRF method at learning a fuzzy measure from one part of the data (using only bag-level labels) and perform pixel-level fusion on other regions. Guided by publicly available crowd-sourced data such as the OpenStreetMap, the MIMRF algorithm is able to automatically generate imprecise bag-level labels instead of the traditional manual labeling process. Moreover, [86] has shown effective results of MIMRF fusion on agricultural applications as well, in addition to hyperspectral and LiDAR analysis. We envision the MIMRF as an effective fusion method to perform pixel-level classification and produce fusion maps with minimal human intervention for a variety of multi-resolution and multi-modal fusion applications.

6.5 Summary

This chapter introduced the Multiple Instance Learning framework and reviewed MIL methods for hyperspectral classification, sub-pixel target detection, classifier fusion, regression, and multi-resolution multi-modal fusion. Given imprecise (bag-level) ground truth information in the training data, the MIL methods are effective in addressing the inevitable imprecision observed in remote-sensing data and applications.

 

  • Imprecise training labels are omnipresent in hyperspectral image analysis, due to unreliable ground truth information, sub-pixel targets, and occlusion, and heterogeneous sensor outputs. MIL methods can handle bag-level labels instead of requiring pixel-perfect labels in training, which enables easier annotation and more accurate data analysis.

  • Multiple instance target characterization algorithms were presented, including eFUMI, MI-ACE/MI-SMF, and MI-HE algorithms. These algorithms can estimate target concepts from the data given imprecise labels, without obtaining target signature a priori.

  • Multiple instance classifier fusion and regression algorithms were presented. In particular, the MICI method is versatile in that it can perform classifier fusion and regression with minor adjustments in the objective function.

  • The MIMRF algorithm extends upon MICI to multi-resolution and multi-modal sensor fusion on remote sensing data with label uncertainty. To our knowledge, this is the first algorithm that can handle HSI imagery and LiDAR point cloud fusion without co-registration or rasterization, considering imprecise labels.

  • Various optimization strategies exist to optimize an MIL problem, such as expectation maximization, sampling-based evolutionary algorithm, and gradient descent.

 

The algorithms discussed in this chapter covers the state-of-the-art MILapproaches and provides an effective solution to address the imprecision challenges in hyperspectral image analysis and remote-sensing applications. There are several challenges in these current approaches that warrant future work. For example, current MI regression methods often rely on the “primary instance” assumption, which may not hold in all applications; or that MIL assumes no contamination (of positive points) in negative bags, but in practice this is often not the case. Future study in more flexible MIL frameworks (such as using kernel embedding as described in Chap.  2) can be conducted in relaxing these assumptions.