Introduction

Analysis and interpretation of Remote Sensing (RS) images helped people to understand many phenomena related to the earth. Information extracted from RS images can be used in many fields such as weather forecasting, resources management, regional planning, traffic monitoring, and environmental risk assessment. Many analysis and interpretation tasks rely on understanding the content of an RS image or scene. In literature, several techniques were proposed to help understand the content of RS images. Among these techniques, we list object detection, object recognition, image segmentation, and semantic image segmentation. Often, these techniques may lead to some confusion. Object detection in RS images aims to determine if an image contains one or more objects belonging to the class of interest and to locate their positions (Lei et al. 2012; Cheng and Han 2016). Object recognition aims to detect all objects in RS images and locate their positions (Durand et al. 2007; Haiyang and Fuping 2009; Diao et al. 2015). In image segmentation, the RS image will be divided into regions; however, these regions will not be labeled (Ming et al. 2015; Zhang et al. 2017). Semantic image segmentation will label each pixel in the RS image according to a class of objects such as urban, forest, water, etc. (Athanasiadis et al. 2007; Shotton and Kohli 2014; Zheng et al. 2017). In this paper, we are interested in the semantic RS image segmentation.

The semantic segmentation of RS images has been a core topic for the last years. Several methods for semantic image segmentation act at the pixel-level by classifying each pixel independently. Other methods try to group pixels into clusters and assign a label to these clusters. For instance, Ma et al. (2014) proposed an objected-oriented based approach that combines a pixel-based classification and a segmentation technique. The goal of this approach is to classify polarimetric Synthetic Aperture Radar (SAR) images. Authors in this work developed a soft voting strategy to fuse multiple classifiers. The approach is validated through a set of experiments that are conducted on two quadpolarimetric SAR images. In (Rau et al. 2014), authors proposed an object-oriented analysis scheme for landslide recognition using existing software. The input data comprised only multispectral optical ortho-images and the digital elevation mode. Rau et al. developed a semiautomatic method that detects the landslide seeds and then performs a region growth and false-positive elimination for these seeds. Zhang et al. (2016a) proposed to overcome the semantic gap between low-level visual features and high-level semantics of images. In this study, authors developed an object-based mid-level representation method for semantic classification. The proposed algorithm is based on the bag-of-visual-words that generates mid-level features to bridge the two levels. In (Zhang et al. 2016c), authors developed a higher order potential function based on nonlocal shared constraints within the framework of a conditional random field model. The proposed approach combines classification knowledge from labeled data with unsupervised segmentation cues derived from the test data. The conditional random field model integrates low-level and high-level contextual cues from labeled and unlabeled test datasets. In (Andrés et al. 2017), authors presented an approach based on ontology to classify RS images. Andrés et al. developed spectral rules for a pixel-based classification of Landsat images. The proposed prototype is coupled with an open source image processing software at the pre-processing step and it uses a reasoner algorithm to perform image classification. The major limitation highlighted for the proposed system is related to the processing time. Zheng et al. (2017) detailed a semantic segmentation of high spatial resolution RS images. The proposed approach is based on an object MRF (Markov random field) model with auxiliary label fields for semantic segmentation of the RS image. The idea of the proposed approach is to define a label field and two auxiliary label fields on the same region adjacency graph with different class numbers. Then, a net structure is built to describe the interactions between label fields and messages passed between each label field and the two other auxiliary label fields. Marmanis et al. (2018) presented a deep convolutional neural network for semantic segmentation with boundary detection. Authors proposed to combine semantic segmentation with semantically informed edge detection by adding boundary detection to the encoder-decoder architecture. In (Boulila et al. 2018), authors developed a decision support tool for RS big data analytics. The main idea, in this work, is to assist users in making decisions in many RS-related fields. The proposed tool provides descriptive, predictive and prescriptive analytics. Boulila et al. proposed to overcome the complexity of RS data by implementing an iterative and incremental process of data integration. Additionally, they designed a multidimensional model based on a star schema for the image data warehouse, and they proposed techniques such as distribution, indexing and partitioning to enhance the retrieval of RS big data. Experiments were conducted based on three different applications (clustering, decision tree, and association rules).

However, recent years have witnessed an increasing amount of RS images with different spectral and spatial resolutions (Liu et al. 2016). This increasing amount of images has open the door to new challenging problems facing the RS community such as how to extract valuable information from the various kinds of RS data? How to deal with the increasing data types and volume? (Zhang et al. 2016b). The problem of semantic segmentation of big RS images is becoming a challenging research topic. In the current manuscript, we propose a top-down approach for semantic segmentation. The main goal of the top-level is to compute features for objects extracted from RS images. While the goal of the down-level is to determine the class of each pixel using information computed in the previous level.

The remainder of this paper is organized as follows. In Section 2, we detail the proposed approach for semantic segmentation of RS big images. The presented method is experimented and evaluated through different real datasets in Section 3. Finally, Section 4 concludes the paper and discusses some issues for further research.

Proposed approach

The goal of the proposed approach is to partition RS image into meaningful objects and assign a class to each of them. This goal is ensured through a top-down approach composed by an object processing (top-level) and pixel processing (down-level). The proposed approach is described in the following parts in detail.

Proposed approach for semantic segmentation of big RS images

Figure 1 describes the proposed approach. The process of the proposed semantic segmentation is divided into two levels: 1) top-level and 2) down-level. The first level aims to ensure training of the multi-layer feed-forward neural network (MLFFNN). We start by computing features of object extracted from RS images. These features constitute the input of the MLFFNN module to generate a structure for classifying RS objects. In the second step, down-level, the generated structure is used to perform semantic segmentation at the pixel-level. For an input RS image, an 8-matrix centered in every pixel is considered when computing features related to that pixel. The same features computed at the object-level are computed at the pixel-level (we compute these features based on the 3 × 3 window surrounding the pixel). The computed features will be entered to the MLFFNN to determine the most similar trained class and assign classes to each pixel.

Fig. 1
figure 1

Proposed approach for semantic segmentation of big RS images

The proposed process of semantic image segmentation is depicted in the algorithm 1.

Algorithm 1 has as input: 1) imgs, which represent a set of RS images, 2) nbrclass, which represents the number of classes for the input images, 3) connectivity, which means the number of pixels connected horizontally, vertically and diagonally to every pixel in the considered image, 4) minNumberPixels, which corresponds to the minimum number of pixels that every extracted object should contain, 5) hiddenLayerSize, which corresponds to hidden layer sizes for the MLFFNN (see section 2.e for more details), 6) trainFn and performFcn, which represent, respectively, the training and the performance functions for the neural network. The output of the algorithm 1 are classified input images.

figure a

Top level: object processing

This level aims to generate a neural network structure that will be used for semantic segmentation. The object processing is divided into five steps: a) image segmentation, b) object extraction, c) computation of object features, d) object labeling, and e) generation of neural network structure.

RS image segmentation

It is trivial that results of subsequent steps deeply depends on results provided by the image segmentation step. The success of image interpretation is strongly related to our reliability on segmentation. Today, the problem of accurate partitioning of RS images is generally a challenging problem. Many works have been achieved on RS image segmentation. Among these works, we can list (Haralick and Shapiro 1985; Ryherd and Woodcock 1996; Trias-Sanz et al. 2008; Akcay and Aksoy 2008; Boulila et al. 2010; Cheng and Han 2016).

In this paper, the k-means method is used to segment images (MacQueen 1967). This method can be replaced by any other segmentation method.

Object extraction

After image segmentation, we obtain a set of objects that collectively cover the entire image. Pixels belonging to the same object have the same label. The goal of this step is to determine meaningful objects from segmented images. Small objects are removed from segmented images. In this paper, we chose two parameters, connectivity and minimum number of pixels (respectively connectivity and minNumberPixels in the algorithm 1), to perform this task. All connected objects that have less than a given pixel value from the segmented image are disregarded. This operation is known as an area opening (Vincent 1993). For the connectivity, we consider the context of 8-connected pixels (3 × 3 window containing the pixel and its neighbors that are connected to it horizontally, vertically and diagonally). The object extraction task starts by determining connected components. Then, computing the area of each component. Finally, removing small objects (less than minNumberPixels). Another operation is performed in this step is removing isolated pixels from the segmented image.

Computation of object features

Computing features aims to find a mapping from pixel-level to a high-level data space. Feature extraction plays an important role in RS image analysis and interpretation. In the present paper, we choose five features computed on objects extracted from RS images.

Let us consider an object obj extracted from a satellite image img. The features used in the proposed study are:

  • The radiometry of the centroid of the object

    $$ {\mathrm{f}}_1=\mathrm{img}\left(\mathrm{centroid}\left(\mathrm{obj}\right)\right) $$
    (1)

Where centroid is the function that returns the centroid of the object obj.

  • The five features coming from GLCM (Gray-Level Co-Occurrence Matrix) of an object. These features are the contrast, the correlation, the energy, the homogeneity, and the entropy (Haralick et al. 1973; Conners et al. 1984; Yang et al. 2012). The GLCM computes the number of different combinations of gray levels occurring in the object obj. Features extracted from the GLCM give a measure of the variation in intensity at the pixel of interest.

Let us consider p(i,j) the element thatbhas the coordinate (i,j) in the normalized symmetrical GLCM.

  • Contrast: measures the contrast intensity between a pixel and its neighbor over the object. The formula of the contrast is given by the following equation:

    $$ {\mathrm{f}}_2{\sum}_{i,j}{\left(i-j\right)}^2p\left(i,j\right) $$
    (2)
  • Correlation: measures the correlation between a pixel and its neighbor over the object. The formula of the contrast is given by the following formula:

    $$ {\mathrm{f}}_3{\sum}_{i,j}p\left(i,j\right)\frac{\left(i-\mu \left(j-\mu \right)\right)}{\sigma^2} $$
    (3)

Where μ is the mean of the GLCM, calculated as \( \mu ={\sum}_{i,j}p\left(i,j\right)i \), and \( {\sigma}^2={\sum}_{i,j}p\left(i,j\right){\left(i-\mu \right)}^2 \)

  • Energy (also known as uniformity): calculates the sum of squared elements in the moment. The formula of the energy is given by the following formula:

    $$ {\mathrm{f}}_4{\sum}_{i,j}{\left(p\left(i,j\right)\right)}^2 $$
    (4)
  • Homogeneity: measures how often the distribution of GLCM elements are close to the GLCM diagonal. The formula of the homogeneity is given by the following formula:

    $$ {\mathrm{f}}_5{\sum}_{i,j}\frac{p\left(i,j\right)}{1+{\left(i-j\right)}^2} $$
    (5)
  • Entropy: quantifies the randomness of the gray-level intensity distribution. The formula of the entropy is given by the following formula:

    $$ {\mathrm{f}}_6{\sum}_{i,j}-p\left(i,j\right)\ln \left(p\left(i,j\right)\right) $$
    (6)

Object labeling

The objective of this step is to link features computed for a given object to its label. For the purpose of this work, five land cover classes of interest were identified: water, forest, urban, bare soil, and non-dense vegetation. A set of manually labeled regions given by experts were acquired by careful visual interpretation over the studied area. Polygons of the studied area are digitized to derive the thematic information using a topographic map with the scale of 1/50000. Topographic information is used to determine thematic classes in the studied area (Boulila et al. 2011). The labeled regions were divided into training, validation and test data.

At the end of this step, two outputs are provided. The first contains features corresponding to each object and the second contains label of this object. These two outputs are provided to the neural network to build the classification structure.

Generation of neural network structure

The goal of the neural network structure is to build a process able to determine the class of an object extracted from the RS image according to its features.

In this study, we choose to work with a MLFFNN (Svozil et al. 1997; Ashwini Reddy et al. 2011; Wang et al. 2015a, b; Ulyanov et al. 2016). Our choice is argued due to the ability of MLFFNN to adapt without a continuous assistance of the user. In addition, MLFFNN reduce considerably the computational effort and memory capacity needed to store weights. Moreover, this type of neural network is very robust in the presence of uncertainty and noise, which is the case of RS image field. In the current work, imperfection modeling is not considered. Readers interested in modeling imperfection related to RS images can refer to our previous works. In (Ferchichi et al. 2017a), we detailed sources of imperfection related to RS images and their main types. We proposed to reduce imperfection by using 1) image fusion (Farah et al. 2008; Boulila et al. 2009), 2) imperfection propagation (Boulila et al. 2014; Ferchichi et al. 2017b), and 3) sensitivity analysis (Boulila et al. 2017; Ferchichi et al. 2018).

Figure 2 depicts the proposed MLFFNN architecture. The input are object features and their corresponding land cover types. We have one hidden layer, one output layer and five outputs (different land cover types).

Fig. 2
figure 2

Proposed MLFFNN architecture

w and b denote, respectively, the parameter (or weight) and the bias associated with the connection between the different units in the two layers.

Down-level: pixel processing

The goal of the down-level is to determine the class of every pixel in an input image according to its features. This is ensured using the already built neural network structure.

Computation of pixel features

The features described at the section (2.c) are object-based features and cannot be computed at the pixel-level. However, our objective in this paper is to determine the class of every pixel for a given RS image. To achieve this, we consider an 8-connected matrix centered on that pixel as shown in the Fig. 3. Then, the considered features are computed to this matrix.

Fig. 3
figure 3

The 8-connected matrix centered on the reference pixel

The features describing the pixel are the radiometry, the contrast, the correlation, the energy, the homogeneity, and the entropy.

For example, if we consider the following matrix centered at the reference pixel (2,2) as follow:

120

125

127

90

128

127

80

129

129

The value of the radiometry, the contrast, the correlation,the energy, the homogeneity, and the entropy will be, respectively, 128, 1.2597, 0.0493, 0.1136, 0.6444 and − 0.0072696.

Semantic pixel segmentation

Once the MLFFNN is trained, validated and tested, we can use it to determine the class type of every pixel. The features (radiometry, contrast, correlation, energy, homogeneity, and entropy) of every pixel for an input RS image are computed. Then, these features are provided to the MLFFNN structure. Based on the determination of the pixel class, we obtain a semantic image segmentation.

Experimental results

In this section, we firstly describe the dataset used to validate the proposed approach. Then, we present the MLFFNN architecture used in the experimental results. The third part is devoted to the semantic segmentation of RS images. Finally, we evaluate the performance of the proposed approach with regard to traditional classification methods.

Dataset description

The proposed approach is tested and evaluated based on Reunion Island site. This site is located in the South-west of the Indian Ocean (21°06’ South and 55°32′ East; it is 700 km from Madagascar to the West and 180 km from Mauritius to the Northeast). The Reunion Island is 63 km long and 45 km wide.

Experiments are conducted based on a real dataset belonging to the KalideosFootnote 1 database set up by CNES.Footnote 2

Due to the limitation of sensors and the influence of the atmospheric condition, RS images need, in general, a preprocessing step to enhance the quality of images before any processing subsequent tasks. In the current study, RS images are preprocessed according to: 1) radiometric preprocessing which is achieved by converting the pixel values into reflectance (Chander and Markham 2003; Chander et al. 2009). Then, the inversion of the reflectance at ground level is performed by comparing the estimated reflectance with simulations made at the top of the atmosphere for the geometric and atmospheric conditions corresponding to the measurement, and 2) geometric preprocessing which aims to provide image series which are perfectly coregistered. The goal is to build a reference image through a validation process including field measurements collected by the scientists. After that, a superposition of RS images compared to the reference image is performed in order to refine the corresponding sensor attitude model.

In this paper, experiments have been carried out on a dataset containing a total number of 293 images. The spatial resolution of the data is 10 m per pixel. The size of each image varies from (3000 × 3000 pixels) to (6000 × 6000 pixels). Several thumbnail images are extracted from these images. Then, segmentation is performed to these thumbnail images using k-means algorithm.

The goal of k-means algorithm is to partition the RS images into k segments (objects). It starts by selecting initial cluster centers (known as centroid) randomly for a given image. Then, it assigns each pixel in this image to the segment that has nearest centroid from the respective pixel (in this paper, the Euclidean distance is calculated between each center and each pixel to assign them to the segment having the minimum distance). Once the segmentation is achieved, the process is taken back to recalculate the new centroid of new segments. After that, pixels in the image are reassigned to the new segments. This process is repeated iteratively until a stopping criteria is met (e.g. no pixel changes its cluster, the sum of the distances is minimized, or some maximum number of iterations is reached).

According to the 8-pixels connectivity and to 100 pixels as minimum number of pixels in each extracted object, we obtain a number of labeled samples depicted in the Table 1.

Table 1 Land cover classes with number of samples

Figure 4 presents an excerpt of samples for each land cover type.

Fig. 4
figure 4

Excerpt of samples for each land cover type

Figure 5 describes ground truth images for three thumbnail images extracted from the study region. To get the ground truth images, information was extracted by experts over the studied areas. Polygons of studied regions of Reunion Island are digitized to derive the thematic information using a topographic map with the scale of 1/50000.Topographic information is used to determine thematic classes in the studied areas. Five thematic classes are identified which are the following: urban, water, forest, bare soil, and non-dense vegetation areas.

Fig. 5
figure 5

Ground truth for three thumbnail images extracted from the study region

MLFFNN description

We used nprtool provided by Matlab R2008a (nprtool 2018). This tool uses a function named patternnet to classify RS images. This function is based on a feed-forward network that is trained to classify pixels according to target classes. The patternnet has three input parameters which are hiddenLayerSizes, trainFcn and performFcn, and returns a pattern recognition neural network. HiddenLayerSizes are hidden layer sizes set to 10 in the present paper.

TrainFcn is the training function set to trainbr. Trainbr is the bayesian regularization backpropagation. It is a network training function that updates the weight and bias values according to Levenberg-Marquardt optimization (Levenberg 1944). It minimizes a combination of squared errors and weights, and then determines the correct combination to produce a network that generalizes well.

performFcn is the performance function set to crossentropy. It calculates the MLFFNN performance given targets and outputs.

The data used in this paper is divided into 70% for training (205 images), 15% for validation (44 images) and 15% for testing (44 images). The objective of the validation set is to monitor the classification error and stop training before overfitting occurs. The test set is then used independently to evaluate the classification quality (Heisel et al. 2017).

The training process of MLFFNN is performed iteratively 100 times on NVIDIA’s GeForce GTX 1080 with 8GB of GPU memory.

Figure 6 describes the performance plot of the MLFFNN which shows the training, validation and testing errors. It can be noted from this figure that the best validation performance was achieved at epoch 108 with an error rate of 0.047969. Moreover, the validation and test curves are very similar which implies there was no significant overfitting occurred (Samuel et al. 2017).

Fig. 6
figure 6

Performance plot of the MLFFNN

Semantic segmentation of RS images

The goal of this section is to illustrate the applicability of the proposed approach for semantic image segmentation.

Figure 7 (left) represents an image taken from the previously described Kalideos database. This image is not among the training dataset. The image is acquired on January 31, 2015 and coming from SPOT 5 (Satellite Pour L’observation de la Terre) satellite. The considered image has a spatial resolution of 10 m and a size of 800 × 500 pixels. Figure 7 (right) depicts the semantic image segmentation performed by the proposed approach.

Fig. 7
figure 7

Satellite image acquired on January 31, 2015 (left) and the semantic image segmentation performed by the proposed approach (right)

The goal of this section is to perform a semantic segmentation of RS images using the proposed approach. Then, results of the semantic segmentation are compared to the ground truth image representing the same region at the same date. The comparison is carried out using two criteria: overall accuracy (OA) and kappa coefficient (K). OA is the sum of the correctly classified pixels divided by the total number of image pixels. K is an accuracy measure that compares proposed results of classification to the real ones. It takes values from zero to one (higher values of kappa coefficient means a good classification). K is defined as follow (Congalton and Green 2008):

$$ K=\frac{n{\sum}_{i=1}^k{n}_{ii}-{\sum}_{i=1}^k{n}_{i+}{n}_{+i}}{n^2-{\sum}_{i=1}^k{n}_{i+}{n}_{+i}} $$
(7)

Where.

k:

denotes the number of classes.

n:

is the total number of pixels in images.

nii:

is the sum of correctly classified pixels for the class i (the number of pixels belonging to class i in the ground truth that have also been classified as class i in the classified image).

ni+:

is the sum of pixels classified into class i in the proposed image classification.

n+i:

is the number of pixels classified into class i in the ground truth image.

Table 2 depicts the confusion matrix of the proposed semantic segmentation for the image presented in the Fig. 7. Rows denote classes for the ground truth image, whereas columns represent classes for the proposed semantic image segmentation. As indicated in Table 2, the proposed approach performs a good semantic segmentation of the image with an OA = 91.85% and a K = 0.8982.

Table 2 Confusion matrix of the proposed semantic image segmentation

Evaluation of the proposed approach

To further evaluate the performance of the top-down approach for semantic image segmentation, we compare results of the proposed approach with well-used traditional classification methods. The comparison includes SVM (Support Vector Machines) (Huang et al. 2002; Mitra et al. 2004) and Maximum Likelihood Classification (MLC) (Bruzzone and Prieto 2001; Murthy et al. 2003).

Table 3 depicts a comparison of image classification between SVM, MLC, and the MLFFNN according to the overall classification and kappa coefficient. As we note, the proposed approach outperforms the two others methods for the image presented in Fig. 7.

Table 3 Comparison of the classification accuracy between the proposed method, SVM, and MLC

Figure 8 illustrates the overall accuracies of image classification according to the training set size for the three methods: SVM, MLC and the proposed method. The size of training set varies between 100 to 800,000 samples. The important observation from this figure is that all the three methods was positively influenced by the size of the training set. Both SVM and the proposed approach provide the best results in all cases. The OA moves from 79.8% (case of 100 samples) to 87.4% (case of 800,000 samples) for the SVM method. Whereas, OA moves from 71.5% (case of 100 samples) to 84.1% (case of 800,000 samples) for the MLC method and from 74.2% (case of 100 samples) to 91.6% (case of 800,000 samples) for the proposed method. Further, the proposed approach provides the best results especially when the training size set become more important (when the size of the training set is greater than 200,000).

Fig. 8
figure 8

Performance plot of the MLFFNN

Figure 9 describes the error in image classification between the SVM, MLC, and the proposed approach. We can note that the SVM is less sensitive to the size of the training set with a difference of 7.6% between the size of 100 and 800,000 samples. The MLC comes in second place with a difference of 12.6% and the proposed approach in the third place with a difference of 17.4%.

Fig. 9
figure 9

Performance plot of the MLFFNN

The MVC classifier outperforms the MLC method in all situations regardless the size of the training set. Although SVM provides good results for image classification but a large training set may not be very useful for it to work. This observation is compatible with results reported in the literature (Huang et al. 2002; Foody and Mathur 2004). However, in big RS data, the volume of data plays a very important role in semantic image segmentation. Thus, the use of the proposed method in the case of RS big data is more appropriate. Further, our method consistently provides good results even with the training set is very small.

Conclusion

In this paper, we proposed a new approach for semantic segmentation of RS big images. The main idea is to use features calculated at the object-level to determine the class of every pixel in a new image. To do this an 8-connected matrix centered on that pixel is considered. Determining the pixel class is achieved using a MLFFNN. Input for the neural network are object features and different class labels. The output is the structure that is used for semantic segmentation.

Experimental results were carried out based on a real dataset belonging to the Kalideos database. The classification results obtained by the proposed approach show significant improvements in both the overall and categorical classification accuracies. Besides, comparison with state-of-the-art classification methods proves that the proposed approach provides good performances especially when the volume of data become important.

However, despite of the promising results obtained by the proposed approach, several issues can be addressed in future such as the determination of number of hidden layer nodes and their respective weights. Another challenging topic to be explored is the effect of 8-connected matrix centered on the pixel on the determination of the class label.