Keywords

1 Introduction

With the rapid development of remote sensing technology, optical remote sensing images have been extensively used in various fields such as agriculture engineering, geographical survey, military reconnaissance, natural disaster prediction, and environmental pollution monitoring [16]. However, cloud occlusion is an inevitable challenge in satellite imagery due to the extensive cloud cover that spans over 60% of the Earth’s surface area [14]. The cloud cover obstructs the satellite sensor’s ability to obtain a clear view of the Earth’s surface, making many image analysis tasks difficult, such as remote sensing image classification and segmentation [29], image matching [8], etc. Therefore, it is necessary to quickly and accurately detect cloud cover in order to enhance the availability of remote sensing images.

Over the years, researchers have conducted in-depth studies on cloud detection algorithms in remote sensing imagery and have proposed numerous algorithms. These methods can be broadly categorized into two types: threshold-based methods and machine learning-based methods. Threshold-based methods rely on the physical characteristics of clouds and set appropriate thresholds based on these characteristics to classify pixels in an image into cloud and non-cloud categories. ISCCP [21] cloud mask algorithm utilized the fact that cloud and clear scenes differ in the amount of radiance variability they exhibit in space and time to detect clouds. Cihlar and Howarth [7] proposed a method that can identify clouds with different opacities as well as cloud shadows present in composite materials, effectively eliminating the impact of cloud contamination in AVHRR synthetic images on land. Huang et al. [12] used clear forest pixels as a reference to define cloud boundaries and separate clouds from clear surfaces in a spectral-temperature space. However, these methods lack a universal threshold and do not consider the structure and texture of clouds when dealing with complex scenes, resulting in low robustness. The principle of machine learning-based methods for cloud detection is to extract features from remote sensing images as input and then train a classification model by comparing these features with labeled samples. An and Shi [2] designed a scene learning-based cloud detection algorithm, this algorithm utilizes the color features, texture features, and structural features of the image. Li et al. [13] extracted brightness features, texture features, and average gray-level co-occurrence matrix features [10] from the image, they then used these features to train a support vector machine (SVM) [25] classifier. Shi et al. [23] proposed a ground-based cloud detection method using graph model built upon super-pixels [1] to integrate multiple sources of information. However, these methods extract shallow features from images through statistical means such as mean, maximum, minimum, variance, etc., which do not effectively comprehend the images, leading to a decrease in detection accuracy.

In recent years, convolutional neural networks (CNNs) have allowed the field of computer vision to grow rapidly with their powerful feature extraction capabilities. CNN-based approaches can improve the model’s understanding of images by stacking convolutional layers and yield superior performance in target detection, image classification, and semantic segmentation [24, 33]. Yang et al. [30] utilize thumbnail images to extract cloud masks, extracting multi-scale contextual information without losing resolution. Wu and Xu [28] present cross-supervised learning for cloud detection to address the issue of insufficient labeled cloudy images. However, most existing deep learning methods feature complex network structures and high computational resource requirements. In practical applications, the deployed devices often lack significant computational power and storage space. Therefore, this limits the applicability of these methods.

To solve the above problem, we design a lightweight cloud detection framework called LigCDnet, which achieves excellent performance with very few parameters. The contributions of this paper are briefly summarized as follows:

  1. 1.

    We propose a lightweight cloud detection framework based on the U-shaped architecture, called LigCDnet. It achieves state-of-the-art detection accuracy compared to existing cloud detection algorithms while having an extremely small number of parameters, only 2.39M.

  2. 2.

    We utilize a lightweight feature extraction module (LFEM) to capture spatial and contextual information and design a channel attention module (CAM) to adjust the impact of different channels on cloud detection performance. Additionally, we propose a lightweight feature pyramid module (LFPM) to extract cloud features at different scales.

2 Proposed Method

2.1 Overall Framework

Fig. 1.
figure 1

Framework of the proposed LigCDnet.

We use a U-shaped encoder-decoder structure as the framework for our cloud detection network model, as shown in Fig. 1. High-level features contain rich semantic information, low-level features contain rich spatial information [32]. The abundant spatial information plays a crucial role in generating cloud masks. Therefore, in the encoder part, we performed three downsampling operations to preserve rich spatial information in the feature map. We utilize the lightweight feature extraction module (LFEM) that maximizes the extraction of cloud features while minimizing the increase in parameters. To enhance the feature maps channels that are favorable for cloud segmentation, designing a channel attention module (CAM) to adjust the channel weights. To better understand clouds of different scales, we propose the lightweight feature pyramid module (LFPM). In the decoder part, we gradually restore the resolution of the feature maps through upsampling and compensate for the spatial information lost during the encoding stage by connecting them with the feature maps in the encoder using skip connections.

Given a remote-sensing image I as input, the feature map S1 is first generated in the encoder by depthwise separable convolution operations. Depthwise separable convolution consists of depthwise convolution and pointwise convolution. that is

$$\begin{aligned} S1 = H_{ conv_{dep}}\big (H_{ conv_{poi}}(I)\big ) \end{aligned}$$
(1)

where \(H_{ conv_{dep}}(\cdot )\) represents depthwise convolution operation, and \(H_{ conv_{poi}}(\cdot )\) denotes pointwise convolution operation, S1 has the same size as the input image.

To reduce the computational complexity while extracting cloud information, we use a downsampling unit, that is

$$\begin{aligned} F_{ downsampling} = MaxPool\big (H_{LFEM}(S)\big ) \end{aligned}$$
(2)

where \(MaxPool(\cdot )\) is max pooling operation, and \(H_{LFEM}(\cdot )\) denotes the lightweight feature extraction module. Then, the feature maps S2, S3 are generated by consecutive downsampling unit, that is

$$\begin{aligned} S2 = F_{ downsampling}(S1) \end{aligned}$$
(3)
$$\begin{aligned} S3 = H_{LFPM}\bigg (H_{CAM}\Big (H_{LFEM}\big (F_{ downsampling}^2(S2)\big )\Big )\bigg ) \end{aligned}$$
(4)

where \(F^{2}_{downsampling}(\cdot )\) means that \(F_{downsampling}\) is executed two times, \(H_{CAM}(\cdot )\) represents channel attention module, \(H_{LFPM}\) denotes lightweight feature pyramid module. S2 is \(1/2\times 1/2\) size of the input image, S3 is \(1/8\times 1/8\) size of the input image.

Due to the small spatial resolution of the feature map S3 generated in the encoder, it leads to problems such as information loss, insufficient contextual information, and blurred boundaries. In the decoder part, we restore the feature map to the same resolution as the input image by gradually upsampling it, the upsampling is limited to \(2\times \). To reduce computational complexity, we employ a simple bilinear interpolation to directly upsample S3 twice, resulting in the generation of feature map N1, N1 is \(1/2\times 1/2\) size of the input image. And introduce the upsampling unit, that is

$$\begin{aligned} F_{upsampling} ={ bilinear }\big (H_{conv_3}^2(I)\big ) \end{aligned}$$
(5)

where \(bilinear(\cdot )\) denotes bilinear interpolation operation, \(H_{conv_3}(\cdot )\) represents a convolution operation with a convolution kernel size of 3, and the predicted cloud detection result \(I_{O}\) can be described as

$$\begin{aligned} I_{O} =H_{CAM}\Bigg (concat\bigg (S1,F_{upsampling}\Big (concat\big (H_{CAM}(S2),N1\big )\Big )\bigg )\Bigg ) \end{aligned}$$
(6)

where concate denotes the concatenate operation, \(I_{O}\) has the same size as the input image.

2.2 Lightweight Feature Extraction Module

In recent years, the main trend in improving the network’s understanding of complex scenes has been the development of deeper and more complex networks. However, these networks require a significant amount of computational cost both during training and inference phases. To address this challenge, a plethora of lightweight network frameworks have been proposed. For instance, in MobileNet [22], depthwise separable convolution is employed, which consists of depthwise convolution and pointwise convolution. Depthwise convolution operates independently on each channel of the input feature map, pointwise convolution integrates information between different channels to enhance the network’s representational capacity. In LEDNet [26], channel splitting and shuffle operations are applied to each residual block. Channel splitting divides the channels of the feature map into multiple groups, allowing the network to independently extract different types of features. Shuffle operations enable information exchange between different channel groups. Inspired by the MobileNet, we designed the lightweight feature extraction module shown in Fig. 2. The lightweight feature extraction module consists of two pointwise convolutions and one depthwise convolution. For a feature map with C channels, the first step is to increase its dimensionality to 2C by employing a pointwise convolution. The different channels of feature maps can be seen as the network’s response to various characteristics of the data, enabling the network to understand the data from different perspectives. Subsequently, a depthwise convolution is utilized to capture spatial features of the cloud and extract local information for each channel. Lastly, another pointwise convolution is employed to reduce the dimensionality back to C while integrating features across the channels of the feature map.

Fig. 2.
figure 2

Structure of LFEM.

For a \(3\times 3\) standard convolution, with an input feature map of \([H\times W \times C]\), output channel set to 2C, and convolution layer depth set to 3, the number of parameters of the module is \(3\times 3\times C\times 2C+3\times 3\times 2C\times 2C+3\times 3\times 2C\times 2C=90C\). And the number of parameters of our lightweight feature extraction module is \(1\times 1\times C\times 2C+3\times 3\times 2C+1\times 1\times 2C\times 2C=24C\). With the same depth of convolution layers, the number of standard convolutional parameters is 3.74 times higher than ours, and the LFEM module greatly reduces the number of parameters while increasing the inference speed and computational efficiency. The lightweight feature extraction module can be stated as

$$\begin{aligned} H_{\textrm{LFEM}} = H_{conv_{poi}}\Big (H_{\textrm{conv}_{\textrm{dep}}}\big (H_{ conv_{poi}}(I)\big )\Big ) \end{aligned}$$
(7)

2.3 Channel Attention Module

Fig. 3.
figure 3

Structure of CAM.

In computer vision tasks, which often rely on convolutional operations to extract features from images, different channels of the feature map play different important roles for the task in the process of network learning. Therefore, we designed a channel attention module and introduced a learnable weight vector to enable the network to automatically learn the importance of different channels in the task. It allows the network to adjust the weights of each channel, enhancing the dependency on important channels and reducing the dependency on unimportant channels, as shown in Fig. 3. First, we apply max-pooling operations to each channel of the feature map to generate initial channel weight vectors. Then, these vectors are fed into three layers of linear units to let the network learn the importance of different channels. Subsequently, the weight vectors are normalized using the sigmoid function, and finally, the weight vectors are element-wise multiplied with their corresponding channels. By adjusting the channels of the feature map, the network can utilize the information between channels more effectively, thereby improving the performance of the task. The channel attention module can be stated as

$$\begin{aligned} H_{\textrm{LFEM}} = F_{fc2}\bigg (F_{fc1}\Big (F_{fc0}\big (MaxPool(I)\big )\Big )\bigg )\otimes I \end{aligned}$$
(8)

where \(F_{fc}\) denotes Linear layer operation, \(\otimes \) is element-wise multiplication operation.

2.4 Lightweight Feature Pyramid Module

Fig. 4.
figure 4

Structure of LFPM.

Clouds have diverse morphologies, and accurately segmenting clouds of different sizes is a fundamental challenge for cloud detection algorithms. Capturing multi-scale cloud features and establishing contextual information can effectively enable the network to learn feature differences between cloud regions and backgrounds. Inspired by the ASPP [5] model, we propose a lightweight pyramid module, as shown in Fig. 4. The number of channels of the feature map is first adjusted by a pointwise convolution. Dilated convolution [31], also known as atrous convolution, can significantly expand the receptive field of convolutional neural networks. Combining multiple dilated convolutions with different sampling rates in parallel effectively captures multi-scale contextual information. Therefore, we use parallel dilated convolutions with dilation rates of 1, 6, and 12. To reduce computational complexity, the dilated convolutions are replaced with deepwise convolutions while keeping the dilation rates unchanged. Additionally, a global average pooling (GAP) layer is introduced to extract global contextual information. Subsequently, the features captured by the four parallel branches are concatenated along the channel dimension. To facilitate feature reuse and mitigate the gradient vanishing problem, short connections are introduced. The Lightweight Feature Pyramid Module can be stated as

$$\begin{aligned} \begin{aligned} H_{\textrm{LFPM}} =concate(&I,H_{conv_{poi}}(I),H_{conv_{dep-r6}}(I),\\ {} &H_{conv_{dep-r12}}(I),bilinear(H_{GAP}(I))) \end{aligned} \end{aligned}$$
(9)

where \(H_{conv_{dep-r}}(\cdot )\) is depthwise convolution with dilate rate r, \(H_{GAP}\) denotes global average pooling operation.

3 Experimental Results

3.1 Dataset and Experimental Setup

Dataset. We chose two widely used datasets, GF-1 remote sensing images and datasets of CloudSat8, to validate the effectiveness of our method, using only their visible channels. The GF-1 remote sensing images includes 108 GF-1 Wide Field of View (WFV) level-2A scenes and its reference cloud and cloud shadow masks. 86 of the images are used for training and 22 images are used for testing [15]. The CloudSat8 dataset contains 18 images of size 1000 \(\times \) 1000 for training and 20 same-size images for the test [19]. We crop the above original high pixel image into 512 \(\times \) 512 \(\times \) 3 sub-images for training and testing.

Evaluation Metrics. In order to measure the performance of the model comprehensively, we used six widely used quantitative metrics, including JaccardIndex, Precision, Recall, F1-score, and overall accuracy (OA), mean intersection over union (MIoU). These metrics are defined as follows:

$$\begin{aligned} JaccardIndex= \frac{TP}{(TP+FN+FP)} \end{aligned}$$
(10)
$$\begin{aligned} Precision= \frac{TP}{(TP+FP)} \end{aligned}$$
(11)
$$\begin{aligned} Recall= \frac{TP}{(TP+FN)} \end{aligned}$$
(12)
$$\begin{aligned} F1-score= 2\times \frac{Precision\times Recall}{(Precision + Recall)} \end{aligned}$$
(13)
$$\begin{aligned} Overall Accuracy= \frac{TP+TN}{(TP+TN+FP+FN)} \end{aligned}$$
(14)
$$\begin{aligned} MIoU=\frac{1}{k}\sum _{i=1}^k\frac{n_{ii}}{\sum _{j=1}^k n_{ij}+\sum _{j=1}^k n_{ji}-n_{ii}} \end{aligned}$$
(15)

where TP, TN, FP, and FN are the total number of true-positive, true-negative, false-positive, and false-negative pixels, respectively. The k represents the number of categories, \(n_{ii}\) represents the count of correctly predicted pixels, and nij represents the count of pixels where the true value is i and they were predicted as j.

Parameter Settings. Our model is implemented using the Pytorch framework [20], with the training step running on Ubuntu 22.04 and an RTX3090 GPU. Using the Stochastic Gradient Descent (SGD) [3] algorithm for optimization with an initial learning rate of \(2\times 10^{-4}\), decay strategy “poly” [4], batch size of 4, momentum of 0.9. All CNN-based methods are trained using the same configuration and settings without the need for pre-training.

3.2 Comparative Experiments

Comparative Methods. This paper compares a machine learning-based cloud detection method: SVM [11], and also compares six state-of-the-art deep learning-based cloud detection algorithms: FCN-8 [17], DeeplabV3+ [6], CDNetV1 [30], CDnetV2 [9], LWCDnet [18], Boundarynet [27]. Among these, LWCDnet is a lightweight cloud detection method.

Table 1. Quantitative comparisons with other cloud detection methods on the GF-1 test set. Cloud extraction accuracy (%)
Fig. 5.
figure 5

Visual comparisons of different cloud detection methods on GF-1 dataset.

Cloud Detection Results on GF-1 Dataset: Table 1 reports the results of different cloud detection methods in the GF-1 dataset. From the results, our proposed LigCDnet outperforms most of them. Compared to the SVM machine learning method, deep learning methods have significant advantages in various metrics. FCN-8, in terms of Jaccard index, recall, and F1-score, shows an average improvement of 12% over SVM. Compared to cloud detection methods, the boundarynet achieves slightly higher precision, with the same score as ours on OA. However, in terms of Jaccard index and MIoU, our method is higher than it by 1.15% and 0.67%, respectively. Compared to the lightweight method LWCDnet, our proposed method outperforms LWCDnet by 7.3% and 5.74% in terms of F1-score and MIoU metrics, respectively. Fig. 5 shows a visual comparison of five typical examples of cloud segmentation methods in the GF-1 dataset, with a variety of cloud cover and backgrounds. For clarity, we use white to represent correctly labeled cloud pixels and black to represent non-cloud pixels. Red markings indicate misclassified pixels. From a visual perspective, SVM’s performance is the poorest; it only extracts the physical features of the image and does not fully comprehend the context of the image. CDnetV2 tends to misclassify bright objects as clouds. Overall, our LigCDnet performs the best.

Table 2. Quantitative comparisons with other cloud detection methods on the LandSat8 test set. Cloud extraction accuracy (%)
Fig. 6.
figure 6

Visual comparisons of different cloud detection methods on LandSat8 dataset.

Cloud Detection Results on LandSat8 Dataset: Table 2 reports the results of different cloud detection methods on the LandSat8 dataset. From the results, it can be observed that our proposed LigCDnet achieves better performance, especially in terms of Jaccard index and MIoU. While LigCDnet’s OA is only 0.39% higher than CDnetV1, there is a significant improvement of 3.19% in MIoU. Compared to the lightweight network LWCDnet, our proposed network still demonstrates clear advantages, with a 3.38% higher recall and an 8.19% higher MIoU score. Figure 6 illustrates five examples from the LandSat 8 dataset, these examples encompass various backgrounds, such as situations where thin clouds and cloud ice coexist. From the visual results, it is evident that SVM performs poorly in handling scenarios where ice and snow coexist. DeeplabV3+ and LWCDnet also exhibit significant errors when dealing with scenes containing thin clouds. In contrast, our method demonstrates the best overall performance in handling all complex scenarios. It has fewer false positives (highlighted in red) compared to other methods.

Computational Complexity Analysis: In Table 3, we utilized floating point operations (FLOPs) and the number of trainable parameters to assess the computational complexity of these networks. Due to the results of the efficiency evaluation being directly proportional to the input image size, the FLOPs results were computed from input images sized at 224 \(\times \) 224 \(\times \) 3. From the table, it can be observed that our proposed network has the fewest parameters. Although our proposed method has 7.69% higher GFLOPs compared to the lightweight model LWCDnet, we demonstrate significant advantages in both quantitative and qualitative analyses on various datasets. This is because we employ the Channel Attention Module (CAM) multiple times to adjust the weights of feature map channels, and we have designed a Lightweight Feature Pyramid Module (LFPM) to capture features of multi-scale clouds.

Table 3. Computational Complexity Analysis Based on CNN Method

3.3 Ablation Study

The LigCDnet proposed by us consists of three modules, namely the lightweight feature extraction module (LFEM), the channel attention module (CAM) and the lightweight feature pyramid module (LFPM). To investigate the performance of different components in the network, we conducted an ablation analysis on the GF-1 dataset. Table 4 provides detailed quantitative results.

From the results, LigCDnet demonstrates the best performance. Decreasing any of the blocks results in a certain degree of degradation in network performance. Removing CAM results in a deterioration of the metrics, indicating that CAM adjusts the weights of different channels in the feature maps, allowing channels favorable for the detection task to play a major role. Without LFPM, the metrics show a decrease, which suggests that LFPM can capture cloud features at different scales. Overall, these three modules play important roles in the cloud detection task.

Table 4. Ablation study on the GF-1 dataset by our LigCDnet with different modules

4 Conclusions

This article proposes a lightweight method (LigCDnet) for cloud detection. Compared with existing cloud detection models, LigCDnet achieves the best detection accuracy with a minimal number of parameters. In LigCDnet, we extensively extract multi-scale contextual features and further enhance segmentation accuracy by adjusting the channel weights of the feature maps. In the encoder, LFEM effectively extracts the semantic information of clouds, while CAM enhances feature map channels beneficial for the detection task and suppresses feature map channels that interfere with segmentation accuracy. Due to the diverse morphology of clouds, LFPM efficiently captures contextual features at different scales. In the decoder, the feature maps are gradually restored to the size of the input image through skip connections. Extensive experiments have been conducted on GF-1 and LandSat8 datasets, and the results show that LigCDnet can achieve excellent performance while reducing computational effort.