Keywords

1 Introduction

Remote sensing technology has been widely used in various urban and environmental applications, such as land use change monitoring, water quality measurement and vegetation mapping. In general, remote sensing technologies rely on the classification and detection of targets in remote sensing images. Target detection refers to the technical process of distinguishing target and nontarget areas in an image and can essentially be seen as a process of machine learning: learn and construct a statistical classification model on the set of positive and negative labeled data and use this model to obtain the class label of other unlabeled pixels.

In recent years, with the development of machine learning and image processing technologies, remote sensing object detection methods have provided relatively good detection results. However, in some applications, we may be interested in only specific target areas and not other areas, which may incur the absence of negative labelled data [1,2,3,4,5]. For example, if the goal of a project is to detect roads from remote sensing data and update the information of an existing transport system, we may be reluctant to label forests and agricultural areas in the images as labeled negative training data. Moreover, even if we can afford the time and labor cost, it is still difficult to obtain a proper negative training dataset due to the high diversity of negative classes, particularly when high-spatial-resolution images are used. The classification problem in which the training data include only labeled positive training samples (target region) and not negative labeled training samples (non-target region) is called the one-class learning problem in machine learning [6, 7]. For this type of problem, traditional supervised classification methods are usually inefficient because traditional supervised classifiers require the classes in the remote sensing image to all have labeled training pixels. Thus, it is necessary to develop a stable and efficient remote sensing image target region detection method for cases where the training set contains only positive labeled samples.

At present, two strategies are used to address the one-class classification problem in the literature. The first strategy completely ignores unlabeled data and trains a classifier on only positive labeled data. Typical approaches of this type include the Gaussian model (GM) [7], one-class support vector machine (OCSVM) [8, 9] and support vector data description (SVDD) [10]. The GM assumes that the positive data are sampled from a Gaussian distribution. After density estimation of the positive labeled data, GM discriminates the positive class from the other classes by specifying an appropriate threshold. The disadvantage of GM is its inability to determine a suitable threshold. Moreover, when the data feature dimensionality is high, density estimation is usually very difficult. SVDD and OCSVM regard the original point as the only negative training case and find a hyperellipsoid that can exactly accommodate all positive examples or a hyperplane to separate the positive labeled data from the original point with the maximum margin. The disadvantage of these two methods is that their classification results are sensitive to the parameter values, so careful parameter tuning is required. The second strategy is semi-supervised learning, where unlabeled data are added to the learning process to compensate for missing negative labeled data. Representative works include semi-supervised one-class SVM (S2OC-SVM) [11], 1-SVMs [13], positive and unlabeled learning method (PUL) [3] and cost-sensitive positive and unlabeled learning method (CSPUL) [13]. S2OC-SVM and 1-SVMs improve the classifier by introducing manifold regular terms into the learning goal to make the labels smoother. However, the classification outcome is still sensitive to the parameter values. PUL and CSPUL are state-of-the-art methods for one-class classification. They use the estimated class prior to learning a classifier on positive and unlabeled data directly, where the unlabeled data play a similar role as the negative labeled data. However, the two-step strategy makes the classification precision strongly dependent on the class prior estimated in their first step.

In this paper, we propose a PUL method based on regularization, which is formalized as the Bhattacharyya coefficient (BC). The BC is a measure of the amount of overlap between two statistical samples or populations and is widely used in research on feature extraction and selection, image processing, speaker recognition, and phone clustering. We use the BC to impose an additional restriction on the unknown negative class conditional PDF to ensure that it is as far from the positive class conditional PDF as possible. Since the positive class conditional PDF and the mixture PDF of both the positive class and negative class can be estimated from the positive data and the unlabeled data, respectively, such a learning strategy makes it possible to obtain an estimate of the negative class conditional PDF. Moreover, we adopt an implicit mixture model of restricted Boltzmann machines (IRBM) to depict the data distribution to avoid the problem of simultaneously estimating the value of unknown class priors and unknown density functions. Thus, RPUL is established by embedding the BC between two class conditional densities into the risk function, i.e., the KL divergence between samples and the IRBM model, as a regularization item.

In contrast to other one-class methods, RPUL makes no assumptions about the data generation mechanism and requires no processing steps to estimate the threshold or class prior. We apply RPUL to classify data extracted from three scenes of a high-spatial-resolution image under the assumption that only positive data and unlabeled data are available for training. The experimental results illustrate the superiority of the proposed method compared with other state-of-the-art strategies.

2 The Proposed Approach

2.1 Preliminaries

Bhattacharyya Coefficient.

The BC between two probability densities \(p_{1} ({\mathbf{v}})\) and \(p_{2} ({\mathbf{v}})\), with \({\mathbf{v}} \in {\text{R}}^{d}\), is defined as

$$ B = \int_{{R^{d} }} {\sqrt {p_{1} ({\mathbf{v}})p_{2} ({\mathbf{v}})} d{\mathbf{v}}.} $$
(1)

Clearly, the value of \(B\) is always confined within the interval \([0,1]\).

Implicit mixture model of RBMs (IRBM) [14].

The IRBM is a mixture model of RBMs with the mixed weights implicitly parameterized.

Let \({\mathbf{v}} \in {\text{R}}^{d}\) be a vector of visible (observed) variables and h be a vector of hidden variables. Let K be the number of components (classes): K is two in this paper since we discuss only situations with two classes. Let q be a K-dimensional binary vector with only one element being one. Further, if \(q_{1} = 1\) and \(q_{2} = 0\), then the current v is a case of the positive class; otherwise, it is a case of the negative class. The energy function for IRBM is

$$ E({\mathbf{v}},{\mathbf{h}},\,{\mathbf{q}}) = \frac{1}{2}\sum\limits_{i} {\left( {v_{i} - c_{i} } \right)^{2} } - \sum\limits_{j} {h_{j} d_{j} } - \sum\limits_{k} {q_{k} \sum\limits_{{i,j}} {W_{{ijk}} v_{i} h_{j} } } $$
(2)
$$ c_{i} = \sum\limits_{k} {q_{k} C_{{ik}} } ,d_{j} = \sum\limits_{k} {q_{k} D_{{jk}} } $$
(3)

where W, C and D are the weight parameters, the visible unit biases and the hi-den unit biases, respectively, and k represents the component index. The joint distribution for the mixture model is

$$ p_{{\rm model}} ({\mathbf{v}}^{s} ,{\mathbf{h}}^{s} ,{\mathbf{q}}^{s} ) = exp\left( { - E({\mathbf{v}}^{s} ,{\mathbf{h}}^{s} ,{\mathbf{q}}^{s} )} \right)/Z $$
(4)

where

$$ Z = \sum\limits_{{{\mathbf{v}},{\mathbf{h}},{\mathbf{q}}}} {exp\left( { - E({\mathbf{v}},{\mathbf{h}},{\mathbf{q}})} \right)} $$
(5)

is the partition function of the implicit mixture model. The components of IRBM are standard RBMs. The energy function of the \(k^{{th}}\) component derived from (2) is

$$ E_{k} ({\mathbf{v}},{\mathbf{h}}) = E({\mathbf{v}},{\mathbf{h}},\,q_{k} = 1) $$
(6)

The corresponding distribution function of the \(k^{{th}}\) component is

$$ \begin{gathered} p_{{\rm model}} ({\mathbf{v}}^{s} ,{\mathbf{h}}^{s} |q_{k} = 1) = exp\left( { - E_{k} ({\mathbf{v}}^{s} ,{\mathbf{h}}^{s} )} \right)/Z_{k} \hfill \\ Z_{k} = \sum\limits_{{{\mathbf{v}},{\mathbf{h}}}} {exp\left( {E_{k} ({\mathbf{v}},{\mathbf{h}})} \right)} \hfill \\ \end{gathered} $$
(7)

Let \(\theta = \left\{ {W,C,D} \right\}\) be the set of model parameters. Given a set of N training cases \(\left\{ {{\mathbf{v}}^{1} ,...,{\mathbf{v}}^{N} } \right\}\), the learning process of IRBM is to maximize the log likelihood of \(L = \sum\nolimits_{{n = 1}}^{N} {\log p_{{\text{model} }} ({\mathbf{v}}^{n} ;\theta )}\) or to minimize the Kullback–Leibler (KL) distance between the empirical data distribution and the model distribution \(KL(p_{{data}} ({\mathbf{v}})||p_{{\text{model} }} ({\mathbf{v}}\;;\theta ))\), where \(p_{{data}} ({\mathbf{v}}) = \frac{1}{N}\sum\nolimits_{{i = 1}}^{n} {\delta ({\mathbf{v}} - {\mathbf{v}}^{n} )}\) and \(\delta ({\mathbf{v}} - {\mathbf{v}}^{n} )\) is 1 only when \({\mathbf{v}} = {\mathbf{v}}^{n}\); otherwise, it is 0. IRBM can be trained by a contrastive divergence-like algorithm by sampling the conditional distributions \(p({\mathbf{h}},{\mathbf{q}}|{\mathbf{v}})\) and \(p({\mathbf{v}}|{\mathbf{h}},{\mathbf{q}})\). Sampling \(p({\mathbf{h}},{\mathbf{q}}|{\mathbf{v}})\) is not straightforward and performed in two steps. First, the K-way discrete distribution \(p({\mathbf{q}}|{\mathbf{v}})\) is computed (see below) and sampled. Then, given \(q_{k} = 1\), the kth component RBM is selected and its conditional distribution \(p({\mathbf{h}}|{\mathbf{v}})\) is sampled. \(p({\mathbf{q}}|{\mathbf{v}})\) is given by

$$ p\left( {q_{k} = 1|{\mathbf{v}}} \right) = \frac{{\exp \left( { - F({\mathbf{v}},q_{k} = 1)} \right)}}{{\sum\nolimits_{m} {\exp \left( { - F({\mathbf{v}},q_{m} = 1)} \right)} }} $$
(8)

where

$$ F({\mathbf{v}},q_{k} = 1) = \frac{1}{2}\sum\limits_{i} {\left( {v_{i} - c_{i} } \right)^{2} } - \sum\limits_{j} {\log \left( {1 + \exp \left( {\sum\limits_{i} {W_{{ijk}} v_{i} } } \right)} \right)} $$
(9)

2.2 Learning Framework

Notation

Let \(\mathcal{Y} = \{ + 1, - 1\}\) be the set of possible labels. Without loss of generality, we suppose only the first \(l\) cases in \(\left\{ {{\mathbf{v}}^{1} ,...,{\mathbf{v}}^{N} } \right\}\) are labeled with positive label +1 and the rest are unlabeled. Let \(P = \left\{ {{\mathbf{v}}^{1} ,...,{\mathbf{v}}^{l} } \right\}\) be the set of positive samples, and let \(U = \left\{ {{\mathbf{v}}^{{l + 1}} ,...,{\mathbf{v}}^{N} } \right\}\) be the set of unlabeled samples.

Method

The goal of our method is to learn the posterior probability function \(p(q_{1} = 1|{\mathbf{v}})\). According to Bayes’ rule,

$$ p(q_{1} = 1|{\mathbf{v}}) = \frac{{p({\mathbf{v}}|q_{1} = 1)p(q_{1} = 1)}}{{p({\mathbf{v}})}}. $$
(10)

Then, the positive conditional density function \(p({\mathbf{v}}|q_{1} = 1)\),the mixture density \(p({\mathbf{v}})\) and the class prior \(p(q_{1} = 1)\) must be estimated. As the IRBM was adopted as the data description model, estimation of the class prior is replaced by estimation of the negative class conditional density function. However, because of the lack of labeled negative data, estimation of the negative class conditional density is not straightforward. To address this problem, we introduce the BC to obtain supernumerary information about the negative class conditional density to compensate for the absence of negative labeled data. This approach is reasonable. In fact, minimizing the BC between the conditional densities of two class, i.e., the amount of overlap, would lead to a negative class conditional density that is far from the positive class conditional density. Then, in the area far from the negative data, it holds that \(p({\mathbf{v}}|q_{1} = 1)p(q_{1} = 1)\) is approximately equal to \(p({\mathbf{v}})\). Notably, approximating \(p({\mathbf{v}}|q_{1} = 1)p(q_{1} = 1)\) as \(p({\mathbf{v}})\) is the starting point of the state-of-the-art one-class method [13] for estimating the class prior.

Finally, the proposed framework of RPUL is formulated to minimize

$$ \begin{aligned} Z(\theta ) = & \;KL\left( {p_{{{\text{data}}}} ({\mathbf{v}}|q_{1} = 1),p({\mathbf{v}}|q_{1} = 1;\theta _{1} )} \right) \\ & + KL\left( {p_{{{\text{data}}}} ({\mathbf{v}}),p({\mathbf{v}};\theta )} \right) + \mu B\left( {p({\mathbf{v}}|q_{1} = 1;\theta _{1} ),p({\mathbf{v}}|q_{2} = 1;\theta _{2} )} \right) \\ \end{aligned} $$
(11)

where \(\theta = \{ \theta _{1} ,\theta _{2} \}\) is the set of model parameters and \(\theta _{k}\) is the set of parameters of the kth component of IRBM. KL(•) is the Kullback–Leibler divergence. The first two items on the right side of the equal sign measure the degree of fit between the positive data and the first positive component of IRBM and the degree of fit between the unlabeled data and the complete IRBM, respectively. The final item is the BC regularization item, which ensures that the second component of IRBM captures the negative class conditional density precisely, as mentioned in the previous analysis. The trade-off between the data fit items and the regularization item is positive parameter \(\mu\), which is fixed at 0.1 in this paper.

Solution

As in the training process of IRBM, gradient descent is employed to solve optimization problem (11). To make the notation concise, the three terms on the right side of the equal sign of (11) are denoted by \(KL_{1} (\theta _{1} )\), \(KL_{2} (\theta )\), and \(B(\theta )\). Given the samples \({\mathbf{v}}^{s} \in U\), the estimate of \(B(\theta )\) is computed by

$$ B(\theta ) = \sum\nolimits_{{{\mathbf{v}}^{s} }} {\sqrt {f({\mathbf{v}}^{s} ;\theta _{1} )g({\mathbf{v}}^{s} ;\theta _{2} )} } . $$
(12)

Then, the derivative of \(B(\theta )\) with respect to \(\theta _{k}\) is

$$ \frac{{\partial B}}{{\partial \theta _{k} }} = \frac{{ - 1}}{{2\sqrt {p(q_{1} = 1)p(q_{2} = 1)} }}\left[ \begin{gathered} \sum\nolimits_{{{\mathbf{v}}^{s} }} {p({\mathbf{v}}^{s} )\sqrt {p(q_{1} = 1|{\mathbf{v}}^{s} )p(q_{2} = 1|{\mathbf{v}}^{s} )} \frac{{\partial F_{k} ({\mathbf{v}}^{s} )}}{{\partial \theta _{k} }}} \hfill \\ - \left( {\frac{{\sum\nolimits_{{{\mathbf{v}}^{s} }} {p({\mathbf{v}}^{s} )\sqrt {p(q_{1} = 1|{\mathbf{v}}^{s} )p(q_{2} = 1|{\mathbf{v}}^{s} )} } }}{{\sum\nolimits_{{\mathbf{v}}} {p({\mathbf{v}})p(q_{k} = 1|{\mathbf{v}})} }}} \right) \hfill \\ \left( {\sum\nolimits_{{\mathbf{v}}} {p({\mathbf{v}})p(q_{k} = 1|{\mathbf{v}})\frac{{\partial F_{k} ({\mathbf{v}})}}{{\partial \theta _{k} }}} } \right) \hfill \\ \end{gathered} \right] $$
(13)

where \(\theta\) is omitted for brevity and \(F_{k} ({\mathbf{v}}) = F({\mathbf{v}},q_{k} = 1)\). To compute the terms associated with the variable \({\mathbf{v}}\) of (13) exactly, we would need to sum over the joint space of all possible visible variables, which is an intractable task. Fortunately, we can address this problem using the CD learning algorithm, which has been found to be effective for training a variety of energy-based models. Based on the CD algorithm, we sample the mixed probability density \(p({\mathbf{v}})\) to compute the corresponding expectation terms and then obtain the approximation to the derivative of \(B(\theta )\):

$$ \frac{{\partial B\left( \theta \right)}}{{\partial \theta _{k}^{n} }} \approx \frac{{ - 1}}{{2\sqrt {p(q_{1} = 1)p(q_{2} = 1)} }}\left[ \begin{gathered} \sum\limits_{{s = l}}^{{l + u}} {p_{{data}} ({\mathbf{v}}^{s} )\sqrt {p(q_{1} = 1|{\mathbf{v}}^{s} )p(q_{2} = 1|{\mathbf{v}}^{s} )} \frac{{\partial F_{k} ({\mathbf{v}}^{s} )}}{{\partial \theta _{k}^{n} }}} \hfill \\ - \left( {\frac{{\sum\limits_{{s = l}}^{{l + u}} {p_{{data}} ({\mathbf{v}}^{s} )\sqrt {p(q_{1} = 1|{\mathbf{v}}^{s} )p(q_{2} = 1|{\mathbf{v}}^{s} )} } }}{{\mathop \sum \limits_{{s = l}}^{{l + u}} p\left( {({\mathbf{v}}^{s} )^{ - } } \right)p(q_{k} = 1|({\mathbf{v}}^{s} )^{ - } )}}} \right) \hfill \\ \left( {\mathop \sum \limits_{{s = l}}^{{l + u}} p\left( {({\mathbf{v}}^{s} )^{ - } } \right)p(q_{k} = 1|({\mathbf{v}}^{s} )^{ - } )\frac{{\partial F_{k} (({\mathbf{v}}^{s} )^{ - } )}}{{\partial \theta _{k}^{n} }}} \right) \hfill \\ \end{gathered} \right] $$
(14)

where \(({\mathbf{v}}^{s} )^{ - }\) is obtained by the negative phase, which are the values of the visible variables after M steps of alternating sampling and \(p({\mathbf{h}},{\mathbf{q}}|{\mathbf{v}})\) and \(p({\mathbf{v}}|{\mathbf{h}},{\mathbf{q}})\). Otherwise, given \({\mathbf{v}}^{s}\), if the sampled \(q_{1} = 1\), let \(s_{k} = 1\); else, let \(s_{k} = 2\). Similarly, given \(({\mathbf{v}}^{s} )^{ - }\), we can obtain the value of \(\left( {s_{k} } \right)^{ - }\). The derivative of \(F_{k}\) in (14) can be computed approximately as follows:

$$ \frac{{\partial F_{k} ({\mathbf{v}}^{s} )}}{{\partial W_{{ijk}} }} = - p\left( {h_{j} |{\mathbf{v}}^{s} ,q_{k} = {\text{1}}} \right)v_{i}^{s} \approx \left\{ {\begin{array}{*{20}l} { - h_{j}^{s} v_{i}^{s} ,} \hfill & {k = s_{k} } \hfill \\ {0,} \hfill & {k \ne s_{k} } \hfill \\ \end{array} } \right., $$
(15)
$$ \frac{{\partial F_{k} (({\mathbf{v}}^{s} )^{ - } )}}{{\partial W_{{ijk}} }} = - p\left( {h_{j} |({\mathbf{v}}^{s} )^{ - } ,q_{k} = {\text{1}}} \right)(v_{i}^{s} )^{ - } \approx \left\{ {\begin{array}{*{20}l} { - (h_{j}^{s} )^{ - } (v_{i}^{s} )^{ - } ,} \hfill & {k = s_{k}^{ - } } \hfill \\ {0,} \hfill & {k \ne s_{k}^{ - } } \hfill \\ \end{array} ,} \right. $$
(16)
$$ \frac{{\partial F_{k} ({\mathbf{v}}^{s} )}}{{\partial C_{{ik}} }} = \left\{ \begin{array}{ll} - v_{i}^{s} + c_{i}, & k = s_{k} \hfill \\ {\text{0, }}& k \ne s_{k} \hfill \\ \end{array} \right., $$
(17)
$$ \frac{{\partial F_{k} (({\mathbf{v}}^{s} )^{ - } )}}{{\partial C_{{ik}} }} = \left\{ {\begin{array}{*{20}l} { - (v_{i}^{s} )^{ - } + c_{i} ,} \hfill & {k = s_{k}^{ - } } \hfill \\ {0,} \hfill & {k \ne s_{k}^{ - } } \hfill \\ \end{array} } \right. $$
(18)
$$ \frac{{\partial F_{k} ({\mathbf{v}}^{s} )}}{{\partial D_{{jk}} }} = - p\left( {h_{j} |{\mathbf{v}}^{s} ,q_{k} = {\text{1}}} \right) \approx \left\{ {\begin{array}{*{20}l} { - h_{j}^{s} ,} \hfill & {k = s_{k} } \hfill \\ {{\text{0,}}} \hfill & {k \ne s_{k} } \hfill \\ \end{array} } \right., $$
(19)
$$ \frac{{\partial F_{k} (({\mathbf{v}}^{s} )^{ - } )}}{{\partial D_{{jk}} }} = - p\left( {h_{j} |({\mathbf{v}}^{s} )^{ - } ,q_{k} = {\text{1}}} \right) \approx \left\{ \begin{array}{ll} - (h_{j}^{s} )^{ - } , & k = s_{k}^{ - } \hfill \\ {\text{0, }} & k \ne s_{k}^{ - } \hfill \\ \end{array} \right. $$
(20)

The derivative of \(KL_{1} (\theta _{1} )\) with respect to \(\theta _{1}\) and the derivative of \(KL_{2} (\theta )\) with respect to \(\theta\) can be computed by CD algorithm, as done in the preliminaries. After the derivatives are computed, the parameters of our model are iteratively updated as follows:

$$ \theta _{{new}} = \theta _{{old}} - \eta \Delta \theta , $$
(21)

where \(\eta\) is the learning rate and

$$ \Delta \theta = \frac{{\partial (\mu B + KL_{1} + KL_{2} )}}{{\partial \theta }}. $$
(22)

Finally, for any given sample \({\mathbf{v}}\), following Bayes’ decision theory, if \(p\left( {q_{1} = 1|{\mathbf{v}}} \right) > p\left( {q_{2} = 1|{\mathbf{v}}} \right)\), the label is positive. Otherwise, the label is negative, and \(p\left( {q_{1} = 1|{\mathbf{v}}} \right)\) can be computed via formula (8).

Note that the computation of (22) simply involved applying the CD algorithm to the P set and U set, and the time complexity of the proposed method is the same as that of IRBM.

3 Experiments

In this section, we investigate the performance of the proposed RPUL for one-class classification of remote sensing data. The cost-sensitive LPU method (called CSLPU below) proposed in [13] is a state-of-the-art alternative learning method for the same positive/unlabeled scenario, and the Gaussian domain descriptor (GDD) methods are commonly used one-class classifiers. Hence, these methods are also compared with the proposed RPUL in our experiments.

3.1 Dataset Description

The initial dataset used in this paper was RIT-18 [15, 16], which is composed of very-high-resolution aerial photographs (4.7 cm GSD) acquired by an unmanned aircraft system (see Fig. 1). The dataset includes 6 VNIR spectral bands and 18 labeled object classes. The 2nd, 14th, 15th and 16th classes were chosen as positive classes in this paper because they are the first four classes that have a sufficient number of pixels (at least 1% of the total pixels). The size of the photographs is 9393 × 5642, with a total of 52995306 pixels. We slid a 5 × 5 pixel template over the image and extracted 88 features for each pixel, including the mean, variance, homogeneity, contrast, and second moment of the six bands. All features were rescaled to the range [0, 1].

RPUL and CSLPU require positive and unlabeled data for training, whereas GDD and SVDD require only positive data. In general, more labeled training data results in higher accuracy but also increases the required labeling effort. In our experiments, for each class extraction, we randomly selected only 50 pixels of a class as labeled positive training samples: the labeled pixels were less than 9e−5% of the entire image. Additionally, for RPUL and CSLPU, we randomly selected an additional 1000 pixels from the entire image as the unlabeled dataset. As mentioned in the introduction, the classification results of GDD strongly depend on the tuned model parameters: high classification accuracy on the testing dataset is difficult to guarantee if these parameters are tuned with only positive data. To investigate the optimal performance, we used 1000 randomly selected background pixels of other classes in addition to the previously prepared positive labeled samples to tune the parameters. Finally, the remaining pixels of the photographs formed the test dataset. Moreover, to obtain statistically reliable results, ten different random realizations of the training data were considered for each classification, and the classification results were evaluated in terms of the overall accuracy (OA), F-measure (F), recall (R) and kappa coefficient (K) [17].

Fig. 1.
figure 1

RGB visualization of the RIT-18 dataset. This dataset has six spectral bands.

3.2 Model Development

RPUL.

The RPUL model was developed in MATLAB. Typically, we used models with 200 latent variables. The value of the parameter \(\mu\) in (22) was fixed at 0.1; the learning rate in (21) was set to 1e−3; and the weight decay was set to 1e−2. A momentum term was also used: 0.9 of the previous accumulated gradient was added to the current gradient. A temperature parameter was introduced to scale the free energies, similar to the training process of IRBM: the parameter was set to 100. We trained the model using the entire sample in both the P set and the U set until the class labels of the data did not change or the number of iterations reached 2000.

CSLPU.

The CSLPU model was implemented in MATLAB. We used a Gaussian radial basis function (RBF) kernel and followed the empirical approach in [6] to tune the parameters. The number of basis functions was set to 300. The regularization parameter was tuned in the range [−3, 10] on a log scale with a step size of 1. The kernel width was tuned in the range that was computed by first estimating the median value of the distances from all samples to the randomly selected centroids and multiplying the median value by the numbers in the interval [−2, 10] on a log scale with a step size of 1. Moreover, CSLPU needs the class prior to be known first. We used the method in [13] to estimate the class prior, with the parameters tuned under the same settings as those used for CSLPU.

GDD.

The GDD model was implemented via dd_tools. We used the simple Gaussian target distribution and tuned two parameters: the error on the target class in the range [0.1, 1] with a step size of 0. 1 and the regularization parameter in the range [0.1, 1] with a step size of 0.1. As for SVDD, only the samples in the P set were used to train the classifier using the tuned parameters.

3.3 Experimental Results

Every experiment was repeated ten times with randomly selected positive and unlabeled samples. Figure 2 shows the classification maps of one of the experiments of Fig. 1 for each land type, where (a) is the benchmarks, i.e., the true pixel labels, and (b), (c) and (d) are the classification results of RPUL, CSLPU and GDD, respectively. In general, RPUL provides the best classification results in the extraction of a single land type from the aerial photograph. Note that such good classification results are obtained in the situation with only 50 positive labeled pixels and no negative labeled pixels for training. Therefore, with the help of the regularization item, RPUL can learn additional information about the unknown negative training samples from the positive and unlabeled samples to construct a proper classifier even without the labeled negative training samples. CSLPU also provides relatively good results, particularly for water areas, but GDD produces poor results. Both RPUL and CSLPU used unlabeled samples to build the classifier, which may be the reason that they have better classification results than GDD. Moreover, CSLPU is slightly inferior to RPUL. The main reason is likely that the distribution of the positive class in the training set is not identical to the distribution in the unlabeled set since we selected only 50 positive samples as labeled samples; therefore, CSLPU might not be able to obtain an optimal estimate of the class prior to train the classifier. Table 1 compares the accuracy, F-measure, recall and kappa coefficient of the three methods for different land types. The results in Table 1 show that RPUL and CSLPU had similar best evaluation values and GDD provided the worst classification results, even with the parameters tuned on the set of additional negative labeled samples and positive labeled samples.

Fig. 2.
figure 2

Prediction maps of each land type. From the first row to the last row, prediction maps of tree, grass and water. White: positive; black: negative.

Table 1. The accuracy (OA), F-measure (F), recall (R) and kappa coefficient (K) of RPUL, CSLPU and GDD for all land types

4 Conclusion

In this paper, we addressed the problem of one-class classification of remote sensing data by proposing a new BC-based positive and unlabeled learning algorithm. In contrast to other one-class methods, the proposed method makes no assumptions about the data generation mechanism and does not need a processing step to estimate the threshold or the class prior. Moreover, the proposed method is a semi-supervised learning method that requires only a small set of labeled positive data for classifier training. The experimental results indicated that the new algorithm achieves high classification accuracy, outperforming the CSPUL, SVDD, and GDD methods. In future work, we will apply the learning strategy to a generative adversarial network to further improve the performance of LPU methods.