Keywords

1 Introduction

Remote sensing images play a vital role in natural disaster detection, agricultural resources management, environmental monitoring, urbanization surveys and other research fields. However, a factor that cannot be ignored in optical satellite images is cloud cover. Cloud can interfere with the remote sensing data by reflecting and absorbing the electromagnetic radiation, which leads to difficulties in data interpretation. Consequently, it is a crucial part of remote sensing field to accurately identify the cloud coverage over images for subsequent applications [1].

Cloud detection methods can be roughly grouped into classical methods and pattern recognition methods [2]. The threshold-based methods are the earliest classical methods. They mainly analyze individual pixels, such as the automatic cloud coverage evaluation [3] and Fmask [4], and they can segment cloud from images by multiple fixed thresholds. Especially, Sun et al. [5] proposed a general dynamic threshold cloud detection algorithm to solve the difficultly in fixed thresholds selection. Since those threshold-based methods are easily restricted by the spectrum, the Bayesian methods [6] and texture based methods [7] utilizing the spectral and geometric properties of cloud are proposed to leverage more features. Moreover, some methods based on statistical characteristics [8] are proposed for thin cloud detection. They mainly take advantage of the physical properties of clouds, so the results can be obtained quickly with the high-level characteristics of images ignored, which leads to detection difficulties when facing complex surface environments and ever-changing clouds.

With the development of computer hardware, pattern recognition technology has attracted the attentions. Many advanced machine learning methods to identify cloud are proposed. Among them, the early clustering [9], fuzzy clustering [10, 11] and SVM [12,13,14] have formed a mature system, however, the detection accuracy is relatively limited by their poor performance in large-scale training set. In recent years, artificial neural networks have emerged as a promising approach for cloud detection due to their ability to learn complex patterns and feature representations from multitudinous labeled training data. For example, the U-net [15, 16] uses a completely symmetrical network structure and skip connections to improve the accuracy of cloud detection with fewer training samples. MS-UNet [17] combines convolutions of different sizes to extract multi-scale features, thus identifying cloud of different sizes and shapes. Cloud-Net [18] proposed by Mohajerani et al. adds the residual structure to U-Net, and achieves superior results for Landsat 8 images. As time goes on, more advanced networks are proposed, Unet 3 + [19] uses full-scale skip connection to preserve spatial information and fuse features at different layers. Li et al. proposed global context-dense block U-Net (GCDB-UNet) [20] to enhance the detection capability of thin cloud. Lu et al. designed a mutual guidance module (MGM) [21] to solve the problem of rough segmentation boundaries. Although these methods have been able to detect most of cloud on remote sensing images, the thin cloud recognition and boundary identification capabilities still need to be further strengthened especially for medium and high-resolution images such as Landsat 8.

In order to better capture the complex semantic features and precisely segment the cloud in remote sensing images, the two cascaded U-shape attention networks (CUA-Net) model is proposed. Its innovations are as follows, (1) it enhances the connection between the network layers to preserve as much information as possible, (2) it makes use of the attention module to focus on relevant cloud features and to ignore irrelevant ones, which can improve the network's ability of identifying clouds in complex scenes with varying cloud and background noise, (3) a second U-shape network is designed to correct the inaccurate information gain from the previous steps. Via these structures, the features extracted from convolution blocks can be utilized effectively to recover sophisticated cloud masks and obtain higher accuracy.

2 Algorithm

The architecture is designed as two cascaded U-shape networks, as shown in Fig. 1. The first network is used to perform a preliminary segmentation by identifying the possible cloudy regions of the image. The output of the first network \({X}_{En}^{1}\) is then fed into the second network, which refines the edges and details by further segmenting the cloudy regions and removing false detections. After that, the preliminary results \({X}_{En}^{1}\) and the supplementary information \({X}_{De}^{1}\) are added and convolved once to obtain the final cloud detection results. The proposed CUA-Net will be introduced separately below.

Fig. 1.
figure 1

The proposed Cascaded U-shape Attention Networks (CUA-Net).

2.1 The First U-shape Network

The first U-shape network consists of a contraction path for feature extraction and an expansion path for image recovery. The two parts are connected by the attention-based skip connection, which is used for transferring deep features from the contraction path to the expansion path to preserve spatial information.

Down-sampling Layer in Contraction Path.

The down-sampling layer mainly uses residual structure shown in Fig. 2. Its branches on the above include two \(3\times 3\) convolutions to extract features from the input. The branches below use a small-scale skip connection, where the input firstly go through a \(1\times 1\) convolution, and then connected with itself. Finally, the results of the two branches are summed and put to a maximum pooling. This structure can avoid the gradient disappearance caused by the deep network, and make the encoder converge faster. Simultaneously, it allows the network to learn the residual mapping between the input and output feature maps, which helps to preserve the low-level features from upper layer.

Fig. 2.
figure 2

Down-sampling layer in contraction path.

Attention-Based Skip Connection.

In U-Net, the skip connections are used to preserve the features learned from the contraction path and improve the accuracy of segmentation. However, only layers with same depth are connected in the original U-Net architecture. To address this limitation, a modified skip connection shown in Fig. 3 is proposed, the features from all previous layers in the contracting path are concatenated and sent to the expansion path. In order to make the output from layer \({X}_{Res}^{1}, \cdot \cdot \cdot , {X}_{Res}^{i-1}, {X}_{Res}^{i}\) able to be connected, multiple self-connections are used to make the dimension of \({X}_{Res}^{1}, \cdot \cdot \cdot , {X}_{Res}^{i-1}\) as same as \({X}_{Res}^{i}\), and then feature graph size is unified by maximum pooling. After that, all the \(i\) layers are added and input to the subsequent attention module. This modified skip connection allows the network to capture more fine-grained details and improve cloud detection accuracy.

Convolutional block attention module (CBAM) [22] is a lightweight attention architecture composed of channel attention module (CAM) and spatial attention module (SAM). CAM focuses more on the category information. The input image will go through parallel MaxPool layer and AvgPool layer at first, and then pass by a single shared MLP to extract more comprehensive high-level features. SAM pays more attention on the spatial location of the target. It applies the average pooling and the maximum pooling along channel axis, which can effectively strengthen the spatial information.

The attention-based skip connection can preserve features extracted from all layers in contraction path and pay effective attention on the channel and spatial characteristics of the target. What’s more, the number of parameters in this structural is small, which will not bring additional burden to the network.

Fig. 3.
figure 3

Attention-based skip connection.

Up-sampling Layer in Expansion Path.

The up-sampling layer in the expansion path is used to increase the resolution of feature maps while reducing the number of channels, as shown in Fig. 4. The input \({X}_{Up}^{i+1}\) is firstly up-sampled by a deconvolution, then combined with \({AM}^{i}\) from corresponding skip connection and \({X}_{Up}^{i+2}, {X}_{Up}^{i+2}, \cdot \cdot \cdot , {X}_{Up}^{5}\) from the lower up-sampling layers. By this way, not only the feature maps in contraction path are used, the maps in the layers in front of expansion path are also used. Their combination will go through two convolutions to recover the semantic details and be added to the deconvolved \({X}_{Up}^{i+1}\). More complex and detailed cloud properties from deep feature maps can be recovered due to the full use of multi-scale information.

Fig. 4.
figure 4

Up-sampling layer in expansion path.

2.2 The Second U-shape Network

The second U-Shape network is mainly utilized to refine the segmentation mask generated by the first network. Although most of the cloud information can be extracted after the anterior training, thin cloud and fragmentary cloud are easily failed to be detected, and some highlight surfaces can be mistaken as cloud. Therefore, the second U-shape network is designed to revise these incorrect detections. It consists of an encoder-decoder structure with skip connections between them, similar to a four-layer U-net. The difference is that the bridge layer in the middle takes advantage of dropout function to prevent the model from overfitting. No extra structures are added to the second network due to its complementary role and the expectation of lower network complexity.

2.3 Activation Function and Loss Function

ReLU is used as the activation function except from the last layers of the two U-shape networks and the attention module which has certain definition. It is a piecewise linear function that produces an output of zero for negative inputs and a linear output for positive inputs. By introducing non-linearity, ReLU can avoid the network from gradient disappearance and overfitting with small cost. Sigmoid is used as the activation function after \({X}_{Up}^{1}\) and \({X}_{De}^{1}\) to map the output value between 0 and 1, thus determining the probability that each pixel is cloudy.

Denote the true value as \(t\), the predicted value as \(p\), and the total number of pixels as \(N\), the loss function used can be denoted as Eq. (1).

$$Loss\left(t, p\right)=1-\frac{\left(1+{\beta }^{2}\right)\times \sum_{i=1}^{N}t\left(i\right)p\left(i\right)+\epsilon }{\sum_{i=1}^{N}t\left(i\right)+{\beta }^{2}\times \sum_{i=1}^{N}t\left(i\right)p\left(i\right)+\epsilon }$$
(1)

where \(i\) means the \(i\) th pixel in the image, \(\beta\) is a constant which controls the weight of recall relative to precision. In the experiments, \(\beta\) is taken as 2 to give more weight to recall, making it more suitable for cloud detection datasets where the positive class is smaller than the negative class. \(\epsilon\) is assigned as10–7 to avoid any division by zero.

3 Data and Experiments

3.1 Data and Environment

The experimental data set is 38-Cloud Dataset [18] made by Sorour Mohajerani, including 18 scenes for training and 20 scenes for testing, and each scene is cut to 384 \(\times\) 384 patches. The source of the dataset is Landsat 8 images with the resolution of 30 m, and their red, green, blue and near-infrared bands are chosen for cloud detection.

The experiments were performed on a Linux system with Python 3.6, configured with GPU versions of Tensorflow1.12.0, Keras2.2.4 and skimage 0.15.0. A Quadro RTX 5000 graphics card was used as the driver for training and prediction. The Adam optimizer with an initial learning rate of 1 \(\times\) 10–4 was used during training, and when the learning rate was reduced to 1 \(\times\) 10–8, the training was finished.

3.2 Experiments Results

In order to verify the ability of the proposed CUA-Net, the comparison experiments and ablation experiments were conducted. The comparison experiments involve the performance of CUA-Net with state-of-the-art networks. On the other hand, the ablation experiments were conducted to evaluate the effectiveness of the second U-shape network and CBAM in skip connections.

Comparison Experiments.

U-net [16], MS-UNet [17], Cloud-Net [18] and Unet 3 + [19] are selected for comparison, and the experimental results are shown in Fig. 5, where the black and white refers to the correctly identified clear and cloudy area, respectively, while the red means it is cloudy but falsely detected as clear, and the blue means it is clear but falsely detected as a cloudy area.

Fig. 5.
figure 5

Visual results of cloud detection in comparison experiments.

The visual effects of cloud detection from whole scene image by different methods are shown in Fig. 5(a). It can be seen that these methods can detect majority of cloudy area, but U-net, MS-UNet and Cloud-Net have more mistakes, especially for the highlighted regions in lower right corner. Although Unet 3 +  can achieve better results, the performance on boundaries is still worse and the missing cloud information is more compared with CUA-Net. Figure 5(b)–Fig. 5(e) is the visual effect of local details, representing four different types of landcovers: bare land, ice land, vegetation and mountains. Results indicate that CUA-Net can achieve better visual effect with less confusion and more clear boundaries under different surface conditions. For example, in Fig. 5(b) and Fig. 5(d) covering both thin and thick cloud, all methods can accurately detect the main cloud, but for edges and details, the results gained from CUA-Net is most consistent with the ground truth. As for Fig. 5(c) covered with ice and snow, U-net and MS-UNet have many omissions on the boundary, Cloud-Net and Unet 3 + perform better but the capability of detail extraction still need to be strengthened, while the CUA-Net can accurately distinguish between ice land and cloud due to its advantageous structures. For highlighted ground shown in the above of Fig. 5(e), all the other four methods detect it as cloud more or less except CUA-Net. Through the visual interpretation, it can be confirmed that CUA-Net can achieve more detailed edges and superior cloud detection results than other methods.

To evaluate the cloud detection accuracy more objectively, Precision, Recall, Specificity, Intersection over Union (IoU), Overall Accuracy (OA) and F1 score are selected for quantitative evaluation. High precision indicates that the detected cloud is generally true, while high recall means that the model can detect most cloud. Specificity is used to measure the negative predictions, IoU to measure the overlap between the predicted result and ground truth, and OA for the correctly classified instances. F1 score is the harmonic mean of precision and recall to measure their balance. They are defined as Eqs. (2)–(7).

$$Precision=\frac{TP}{TP+FP}$$
(2)
$$Recall=\frac{TP}{TP+FN}$$
(3)
$$Specificity=\frac{TN}{TN+FP}$$
(4)
$$IoU=\frac{TP}{TP+FP+FN}$$
(5)
$$OA=\frac{TP+TN}{TP+FP+FN+TN}$$
(6)
$$F1=2\times \frac{Precision\times Recall}{Precision+Recall}$$
(7)

where \(TP\) (true positive) indicates the total amounts of correctly detected cloud pixels, \(TN\) (true negative) represents the number of correctly detected clear pixels, \(FP\) (false positive) means the amounts of clear pixels incorrectly detected as cloud pixels and the \(FN\) (false negative) on the contrary. The quantitative evaluation results are shown in Table 1.

Table 1. Accuracy evaluation results in comparison experiments (%).

Table 1 shows that the proposed method achieves higher accuracy than the other four networks in Precision, Recall, Specificity, IoU, OA and F1, which is consistent with the judgment of visual interpretation, indicating that the proposed method performs better in most of remote sensing scenes.

Ablation Experiments.

In order to verify the effect of second U-shape network (denoted as S-UNet) and CBAM in skip connections, we designed four ablation experiments: (1) only the first U-shape network used (denoted as F-UNet only), (2) the second U-shape network used without CBAM (denoted as +S-UNet), (3) the CBAM used without the second U-shape network (denoted as +CBAM), (4) both the second U-shape network and the CBAM used (CUA-Net). Their visual effect and accuracy evaluation results are shown in Fig. 6 and Table 2, respectively.

Fig. 6.
figure 6

Cloud detection visual results of ablation experiments.

Table 2. Accuracy evaluation results of ablation experiments (%).

Comparing the results in groups F-UNet only and +S-UNet combined with +CBAM and CUA-Net, it is found that S-UNet leads to a slight decrease in Recall, but the Specificity, IoU and F1 scores are higher than the experiments without S-UNet, and the Precision is remarkably improved. The visual interpretation also shows that the addition of S-UNet can achieve results closer to the ground truth, as it can be a good complement to the edges and details for cloud. Comparing the results in groups F-UNet only and +CBAM combined with +S-UNet and CUA-Net, it can be confirmed that CBAM can focus well on the attributes and locations of cloud, which can improve the detection accuracy comprehensively, and reduce the probability of confusing cloudy and clear area. The overall results show that better cloud detection results can be achieved with both S-UNet and CBAM.

4 Conclusion

In conclusion, the proposed CUA-Net for cloud detection has shown promising results. The second U-shape network helps to supplement the details and cloud boundaries, thus obtaining more refined and truth-related results. The dense connections and the attention model help the network preserve and focus on important features and suppress irrelevant features, contributing to higher accuracy. The CUA-Net has been evaluated on 38-Cloud dataset compared with four representative networks. The results show that it performs better than other methods in terms of quantitative evaluation and visual effect. Overall, the proposed method has potential to be applied in remote sensing fields where cloud detection is essential, and further research can be conducted to optimize the model for better performance.