Keywords

1 Introduction

LULC (Land Use and Land Cover) information extraction is a key step of national geographic state surveying and monitoring . To get more detailed information of the LULC, about ten first level classes, 46 second level classes, and some third level classes are proposed by NASG of China. Abundant features are offered by high spatial resolution images, at the same time, large intra-class variations and low interclass variation exist in them. The situation makes even more challenge on high spatial resolution images processes and uses. The familiar eCongnition software has offered a series of object-oriented image classification tool, while because of the uncertainty of segmentation which makes it is very difficult to be used in the surveying project. So, in real applications, the operators should use high-resolution remote sensing images such as GF-1 or UAV images and do heavy manual interpretation work. Many facts proved that the segmentation problem is still very intractable. Researchers have proposed lots of image segmentation methods, such as MST (Felzenszwalb and Huttenlocher 2004), mean shift (Comaniciu and Meer 2002), watershed (Beucher and Meyer 1992), graph cut (Boykov and Jolly 2001), etc. Each of these segmentation methods has to set the segmentation parameters which may lead to over-segmentation or under-segmentation. How to improve the segmentation accuracy? Can we combine the segmentation with classification together to get the optimized segmentation boundaries while improving the classification accuracy? Machine learning and computer vision technology make it possible. Accumulation of manual interpretation results can provide massive training samples which are very useful resource for learning the segmentation and classification model. Multi-scale features and context features are very important clues to recognize the objects. In this paper, we adopt Conditional Random Field (CRF), which can model these features and relationship of among objects to realize the simultaneous segmentation and classification.

2 Related Work

In computer vision and object recognition field, we could divide its processes into two types based on its outputs. One is called object detection, which gives the center position and a rectangle box of the target. For example, face recognition (Déniz et al. 2011), human detection (Zhu et al. 2006), part-based model (Felzenszwalb et al. 2009), sparselet model (Song et al. 2012), bag-of-features model (Lazebnik et al. 2006; Yang et al. 2009), sparse coding (Yang et al. 2009; Gao et al. 2010; Jia et al. 2012; Jiang et al. 2012), deep learning (Krizhevsky et al. 2012; Donahue et al. 2013), and so on. These methods can get the center positions and bounding boxes of the targets in the image. The other one is called semantic image segmentation, which is a process of simultaneous segmentation and recognition of an input image into regions and their associated categorical labels. An effective way to achieve this goal is to assign a label to each pixel of the input image and set some structural constraints on the output label space. Graph is a great tool to model the relations among different objects. With the MRF and MAP (Maximum Posterior Probability) developments, MRF and CRF which use probability graph model to represent this type of problems has been used widely. They model the label problem based on pixel features (Lafferty et al. 2001; Blake et al. 2004; Shotton et al. 2006; Larlus and Jurie 2008; Toyoda and Hasegawa 2008; Gould et al. 2008), region features (Yang et al. 2007; Kohli and Torr 2009; Fulkerson et al. 2009; Tighe and Lazebnik 2010; Yang and Forstner 2011b), multi-level regions (Russell et al. 2009; Schnitzspan et al. 2009; Yang et al. 2010; Kohli et al. 2013; Ladicky et al. 2014). In which, the image classification problem was expressed as a potential function and was transformed to an energy optimal problem, and they can get the boundary of the targets. Yang and Forstner (2011a, b), Zhong and Wang (2007), and Montoya-Zegarra et al. (2015) use CRF to model spatial and hierarchical structures to label and classify images of man-made scenes, such as buildings, roads, etc., and demonstrate the effective of CRF method in extracting man-made targets from high-resolution remote sensing images. We also could find that the common point of these two types of target recognition methods are both represent and make use of many kinds of features, such as SIFT (Scale Invariant Feature Transform), HOG (Histogram of Oriented Gradients), LBP (Local Binary Patterns), etc.

In remote sensing image applications, we usually need to get the accurate boundary of the targets. With this request, we take the second type method and the features used in the first type into consideration to get the object class label and boundary. In this paper, to make use of the spectral, texture and context features, we construct a three-level potential function which includes pixel level, segment level, and up-down-layer level with reference to Yoyoda and Hasegawa (2008), Kohli and Torr (2009), Ladicky et al. (2014), and use graph cut (Boykov et al. 2001; Boykov and Jolly 2001) to find the optimal solution. The experiments on GF-1 satellite image with 2 m spatial resolution showed that the proposed method is an efficient way to improve the segmentation and classification accuracy.

3 CRF-Based Image Classification Method

Features in different scales are very helpful information for image interpretation. High-resolution remote sensing images provided more detailed textures. To utilize the different scales and different kinds of features, we select four kinds of popular features and three scales’ segmentation results to describe the classes. Figure 1 shows the workflow. We calculate the SIFT, LBP, Texton, Color SIFT features of each image. To reduce the computing complexity and to realize sparse representation, we apply the k-means clustering on four kinds of features that were got in the trained images, respectively, to get the visual words. Then construct the pixel and different scales segments potentials. Here, we select the mean shift image segmentation method. Through boosting learning we got the classifiers. Finally, combine the pixel level, segment level, and the different level potentials together and use graph cut optimize algorithm to find the finest solution which means the fittest class of each pixel.

Fig. 1
figure 1

Workflow of our method

3.1 Features Calculation

3.1.1 SIFT Descriptor

In computer vision, SIFT descriptor are calculated after getting the key points and are used to realize point match generally. As the SIFT descriptor has the capability of describing the spatial distribution of a window, it has been used in many target detection research (Lazebnik et al. 2006; Yang et al. 2009; Jia et al. 2012), which just labeled the rectangle region where the object may exist.

This feature is derived from a 4 × 4 gradient window by using a histogram of 4 × 4 samples per window in 8 directions. The gradients are then Gaussian weighted around the center. This leads to a 128-dimensional feature vector. It reflects the distribution of gradients’ direction. Before using this feature, we should normalize its value.

3.1.2 Color-SIFT Descriptor

Color is an important component for districting objects. Color invariant descriptors are proposed to increase illumination invariance and discriminative power. There are many different methods to obtain color descriptors, Van De Sande et al. (2010) compared the invariance properties and the distinctiveness of color descriptors. In this paper, we choose RGB-SIFT descriptor to describe the color invariant. For the RGB-SIFT descriptor, SIFT descriptors are computed for every RGB channel independently.

3.1.3 LBP Feature

An LBP is a local descriptor that captures the appearance of an image in a small neighborhood around a pixel. Due to its discriminative power and computational simplicity, LBP texture operator has become a popular approach in various applications. An LBP is a string of bits, with one bit for each of the pixels in the neighborhood. Each bit is got by thresholding the neighborhood of each pixel with the value of the center pixel. Here we select a 3 × 3 neighborhood, it is a string of 8 bits and has 256 possible LBPs.

3.1.4 Texton Feature

The term texton was proposed by Julesz (1981) first. It is described as “the putative units of pre-attentive human texture perception” (Julesz 1981). Leung and Malik (2001) use this term to describe vector quantized responses of a linear filter bank. Textons have been proven effective in categorizing materials as well as generic object classes. In this paper, we select three Gaussians, four Laplacian of Gaussians (LoG) and four first-order derivatives of Gaussians to build the filter bank. The three Gaussian kernels (with σ = 0.1, 0.2, 0.4) are applied to each CIE L, a, b channel, the four LoGs (with σ = 0.1, 0.2, 0.4, 0.8) were applied to the L channel only, and the four derivatives of Gaussians were divided into the two x- and y-aligned sets, each with two different values of σ (σ = 0.2, 0.4). Derivatives of Gaussians were also applied to the L channel only. Thus produced 17 final filter responses. Therefore, each pixel in each image has associated a 17-dimensional feature vector (Julesz 1981).

3.2 Mean Shift Image Segmentation

Mean shift is a robust feature space analysis approach and can delineate arbitrarily shaped clusters in it. Mean shift-based image segmentation has been widely used in many kinds of image, including high-resolution remote sensing images. The segmentation is actually a merging process performed on a region that is produced by the mean shift filtering. It considers both spatial domain and spectral domain while merging. For both domains, the Euclidean metric is used. Because the Euclidean distance in RGB color space does not correlate well to perceived difference in color by people, we use the LUV color space which better models the perceived difference in color in this space Euclidean distance. The use of the mean shift segmentation algorithm requires the selection of the bandwidth parameter h = (hr, hs), which determines the resolution of the mode detection by controlling the size of the kernel. To get a different scale segment result, we select 3 group of parameters which are h1 = (3.5, 3.5), h2 = (5.5, 3.5), h3 = (3.5, 5.5).

3.3 Potential Function

Image classification is a problem of assigning an object category label to each pixel in a given image. MRFs are the most popular models to incorporate local contextual constraints in labeling problems. Let Ii be the label of the ith site of the image set S, and Ni be the neighboring sites of site i. The label set \( L( = \{ l_{i} \}_{i \in S} )\) is said to be a MRF on \( S \) w.r.t. a neighborhood N iff the following condition is satisfied

$$ P\left( {l_{i} |l_{{S - \left\{ i \right\}}} } \right) = P(l_{i} |l_{Ni} ) $$
(1)

Let \( l \) be a realization of \( L \), then \( P(l) \) has an explicit formulation (Gibbs distribution):

$$ P\left( l \right) = \frac{1}{Z}\exp ( - \frac{1}{T}E\left( 1 \right)) $$
(2)
$$ E\left( l \right) = \mathop \sum \limits_{c \in C} V_{C} \left( l \right) = \mathop \sum \limits_{{\left\{ i \right\} \in C_{1} }} V_{1} \left( {l_{i} } \right) + \mathop \sum \limits_{{\{ i,i^{\prime} \} \in C_{2} }} V_{2} (l_{i} ,l_{{i^{\prime} }} ) + \cdots $$
(3)

where \( E(l) \) is the energy function, Z is a normalizing factor, called the partition function, T is a constant, Clique \( {C}_{k} = \{ \{ i,i^{\prime} ,{i}^{\prime\prime} , \cdots \} |i,i^{\prime} ,{i}^{\prime\prime} , \ldots \) are neighbors to one another}. \( V_{\text{C}} (l) \) is the potential function, which represent a priori knowledge of interactions between labels of neighboring sites. Maximizing a posterior probability is equivalent to minimizing the posterior energy:

$$ L^{*} = \arg \min_{L} E(L|X) $$
(4)

Let G = (S, E) be a graph, then (X, L) is said to be a CRF if, when conditioned on X, the random variables \( l_{i} \) obey the Markov property with respect to the graph:

$$ P\left( {l_{i} |X,l_{{S - \left\{ i \right\}}} } \right) = P(l_{i} |X,l_{{N_{i} }} ) $$
(5)

where S{i} is the set of all sites in the graph except the site i, N i is the set of neighbors of the site i in G. We can find that CRF can directly infer posterior \( P(L|X) \). In CRF, the potentials are functions of all the observation data as well as that of the labels. The CRF allows us to incorporate shape, color, texture, layout, and edge cues in a single unified model using a conditional potential. CRF model can be used to learn the conditional distribution over the class labeling given an image. Some kinds of the CRF have been proposed, for example, the image pixels (Gould et al. 2008; Toyoda and Hasegawa 2008), patches (Yang et al. 2007; Fulkerson et al. 2009; Kohli and Torr 2009; Tighe and Lazebnik 2010), or a hierarchy of regions (Russell et al. 2009; Yang et al. 2010; Kohli et al. 2013). We use a CRF model (Kohli and Torr 2009; Russell et al. 2009; Ladicky et al. 2014) to learn the conditional distribution over the class labeling given an image. We define the conditional probability of the class labels L given an image X

$$ E\left( X \right) = \mathop \sum \limits_{i \in V} \theta_{v} \varphi_{i} \left( {x_{i} } \right) + \mathop \sum \limits_{(i,j) \in \varepsilon } \theta_{\varepsilon } \varphi_{i} \left( {x_{i} ,x_{j} } \right) + \mathop \sum \limits_{c \in S} \theta_{s} \varphi_{c} \left( {X_{c} } \right) $$
(6)

where V is a set of the image pixels, \( \varepsilon \) is the set of edges in an 8-connected grid structure; S is a set of image segments, \( \varphi_{i} \left( {x_{i} } \right) \), \( \varphi_{ij} \left( {x_{i} ,x_{j} } \right) \) and \( \varphi_{c} \left( {X_{c} } \right) \) are the potentials defined on them, \( \theta_{v} \), \( \theta_{\varepsilon } \;{\text{and}}\;\theta_{s} \) are the model parameters, and \( i \) and \( j \) index pixels in the image, which correspond to nodes in the graph. In this paper, we defined three potentials which are unary potential, pairwise potential, and region potential. We will describe these potentials as follows.

3.3.1 Unary Potential

The unary potential allows for local and global evidence aggregation, each potential models the evidence from considering a specific image feature. Usually, it is computed from the color of the pixel and the appearance model for each object. However, color alone is not a very discriminative feature and fails to produce accurate segmentations and classification. This problem can be overcome by using sophisticated potential functions based on color, texture, location, and shape priors. The unary potential used by us can be written as

$$ \varphi_{i} \left( {x_{i} } \right) = \theta_{s} \varphi_{s} \left( {x_{i} } \right) + \theta_{l} \varphi_{l} \left( {x_{i} } \right) + \theta_{t} \varphi_{t} \left( {x_{i} } \right) + \theta_{cs} \varphi_{cs} \left( {x_{i} } \right) $$
(7)

where \( \theta_{s} \), \( \theta_{l} \), \( \theta_{t} \) and \( \theta_{cs} \) are parameters weighting the potentials obtained from SIFT, LBP, texton, and color SIFT respectively.

3.3.2 Pairwise Potential

The pairwise potentials have the form of a contrast sensitive Potts model (Kohli and Torr 2009).

$$ \varphi_{ij} \left( {x_{i} ,x_{j} } \right) = \left\{ {\begin{array}{*{20}l} 0 \hfill & {{\text{if}}\;x_{i} = x_{j} } \hfill \\ {g\left( {i,j} \right)} \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. $$
(8)

where the function \( g(i,j) \) is an edge feature based on the difference in colors of neighboring pixels (Song et al. 2012). It is typically defined as

$$ g\left( {i,j} \right) = \theta_{p} + \theta_{v} \exp ( - \theta_{\beta } ||I_{i} - I_{j} ||^{2} ) $$
(9)

where Ii and Ij are the color vectors of pixel i and j respectively. \( \theta_{p} \), \( \theta_{v} \) and \( \theta_{\beta } \) are model parameters whose values are learned using training data.

3.3.3 Region Consistency Potential

The region consistency potential is modeled by the robust Pn Potts model (Kohli and Torr 2009). It supports all pixels belonging to a segment taking the same label and allows some variables in the segment to take different labels and reflect the consistency of segments. It is very useful in obtaining object segmentations with fine boundaries. We refer the reader to Kohli and Torr (2009) for more details. It takes the form of

$$ \varphi_{c} \left( {X_{c} } \right) = \left\{ {\begin{array}{*{20}l} {N_{i} (X_{c} )\frac{1}{Q}\gamma_{\hbox{max} } } \hfill & {{\text{if}}\;N_{i} \left( {X_{c} } \right) \le Q} \hfill \\ {\left| c \right|^{{\theta_{\alpha } }} (\theta_{p}^{h} + \theta_{v}^{h} G(c))} \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. $$
(10)

where \( N_{i} \left( {X_{c} } \right) = \min_{k} (\left| c \right| - n_{k} (X_{c} )) \), which denotes the number of variables in the clique c not taking the dominant label. \( \gamma_{\hbox{max} } = \left| c \right|^{{\theta_{\alpha } }} (\theta_{p}^{h} + \theta_{v}^{h} G(c)) \), and Q is the truncation parameter which controls the rigidity of the higher order clique potential. G(c) is used to evaluate the consistency of all constituent pixels of a segment, the variance of the response of a unitary was used, that is

$$ G\left( c \right) = \exp \left( - \theta_{\beta }^{h} \frac{{\left\| {\mathop \sum \nolimits_{i \in c} (f\left( i \right) - \mu )^{2} } \right\|}}{\left| c \right|}\right) $$
(11)

where \( \mu = \frac{{\mathop \sum \nolimits_{i \in c} f(i)}}{\left| c \right|} \) and \( f() \) is a function being used to evaluate the quality of a segment. This enhanced potential function gives rise to a cost that is a linear truncated function of the number of inconsistent variables (Kohli and Torr 2009).

In this paper, we use boosting algorithm to train the three part of the energy function. The boosting algorithm helps us select features and get a strong classifier.

3.4 Graph Cut

Given the CRF model and its learned parameters, we wish to find the most probable labeling l, i.e., the labeling that minimize the energy function of (6). The graph cut-based α-expansion and \( \alpha \beta \)-swap is an effective way to solve energy minimization problem (Boykov et al. 2001). It transforms the energy minimizing problem to min-cut of graph problem. It has been successfully used to minimize energy functions composed of pairwise potential functions.

As a kind of move making algorithms, the first step of graph cut is initializing nodes, edges and build up the graph. In this paper, in accordance with the constituents of the energy function, the nodes and edges consist of the pixel, pairwise, and three scales segments-level components. The three levels’ potentials are calculated and used to initialize the edge weight. The initial label image is set according to the minimum cost of each class. Then, it computes optimal alpha-expansion moves for labels in some order, accepting the moves only if they increase the objective function. The algorithm’s output is a strong local maximum, which means the solution of minimum energy was found.

4 Results and Discussion

In our experiments, we select high-resolution satellite images to verify our method’s availability. The satellite image was the merged image of the multispectral image and panchromatic image of 2 m spatial resolution of “GF-1.” The test area is located in the southeast of Liaoning province in China. The image was manually interpreted, which include paddy field, dry land, bare land, shrub land, buildings, and greenhouse. Some area is labeled to “other class.” The image has 20000 × 8000 pixels. We split this large image into lots of 128 * 128 small images which are then divided into the train and test group and take the 30% of the small images as train images. To compare the result with the other object-oriented image classification method, we select the wide used software eCognition to compare. In eCognition classification process, through many trials, we selected the scale parameter 100, calculated the mean value and standard variation of each band, the secondary angle moment of GLCM feature, and use the Nearest Neighbor algorithm to realize the object-oriented image classification. Figure 2 shows the origin image (a), ground truth (b), classification result of this paper’s method (c), and classification result of eCognition. Table 1 gives the image classification accuracies of these two methods. From them, we could find the accuracy of our method is higher than the traditional object-oriented image classification. At the same time, compared with the traditional method, the phenomena of salt-and-pepper was greatly improved.

Fig. 2
figure 2

Original satellite image, ground truth, and classification results

Table 1 Comparison of classification accuracy

5 Conclusion

From the experiment procedures and results, we could find that this method has three main valuable aspects to be used in image classification. First, it does not need manual segmentation which could avoid the boring segmentation scale selection problem. Second, it can make good use of the existing classification results to train the classifier. Third, it can realize self-selecting features and improve the segmentation and classification results. It may be a way of large of samples-based image interpretation. At the same time, we must admit that this method is still a time-consuming process. How to use it to the task of large-scale geographical conditions general survey and monitoring is still a problem to be solved.