Keywords

1 Introduction

In monitoring water supplies, natural water bodies are critical. There is a need for catastrophe prediction and nature conservation. They are dependent on the measurement of the change in a water body. Recognizing the water body in detail A critical task is to be able to monitor changes in water bodies through remote sensing images. The objective of this research is to accurately discover waterbodies in strenuous and complex environments. High-resolution remote sensing footage was used to create scenarios [1,2,3]. The board has a variety of instruments. Remote sensing photography from satellites and airborne vehicles covers large-scale water areas. Sensing pictures might be difficult to interpret. Aquatic organisms are often to blame for such degradations [4]. The bank is blocked by vegetation, silts, and boats, as well as shadows cast by the surrounding tall tree plants. Imagery conditions and water quality may all play a role in producing these unique hues and microbes are involved [5, 6]. Consequently, obtaining the shape of aquatic bodies is a major difficulty (Fig. 1).

Fig. 1
16 satellite images. Part a has 8 V H R images with a clear view of water bodies and their surrounding vegetation, terrain, and buildings. Part b has 8 G F 2 satellite images of water bodies in a dark shade and vegetation in a light shade.

Some typical water-body samples a in VHR aerial images and b Gaofen2 (GF2) satellite images

VHR remote sensing imaging may be used correctly in complicated settings. Existing remote sensing image extraction approaches concentrate on the spectral features of each band and manually constructed algorithms to extract water bodies methodologies, such as band cutoff point methods, supervised classification-based methods, water and vegetation indices-based methods, and spectral interaction ways the techniques [7]. These approaches, on the other hand, do not pay a lot of attention to the geographic information (i.e., shape, size, texture, edge, shadow, and context semantics) of the water bodies, which has a substantial impact on classification accuracy. The scarcity of automation in traditional approaches is also a barrier to large-scale remote sensing visuals. The tremendous convolutional capabilities of convolutional neural networks (CNNs) can indeed be attributed to image classification, target recognition, and semantic segmentation [8,9,10,11,12,13]. Long et al. [8] pioneered the thoroughly convolutional network (FCN), which replaces the last fully connected layers with convolutional ones for entire semantic segmentation. End-to-end FCNs are broadly utilized and well-developed in the realm of semantic segmentation, making them a mainstream technology.

Deep learning-based water-body segmentation using remote sensing images has triggered a lot of interest recently. The FCN-based method's feature fusion combines high-semantic features and features with exact locations, making it easier to identify waterbodies and extract waterbody borders with precision. Three parts of our technique are considered: feature extraction, prediction optimization, and the merging of shallow and deep layers.

2 Methodology

To begin, we'll go through our ideas for a MECNet architecture. A multi-feature extraction and combination (MEC) module is then described in order to get more diversified and richer features as well as enhanced semantics. This is why, to better anticipate the water-fine body's contour, we create an MPF module that combines prediction results from three separate levels. Once we've solved the issue of semantic inconsistency between encoding and decoder, propose an encoder- decoder feature fusion module (DSFF).

2.1 MECNet’s Underlying Network Architecture

The MECNet is made up of three primary components. A first multi-feature extraction and combination modules are then built, which provides a more diversified set of encoded characteristics. Three alternative feature extraction sub-modules are suggested for the MEC module to simulate the spatial and channel interactions between feature maps in the proposed MEC module. Local feature extraction, bigger receptive field feature extraction, and between-channel feature extraction are the three sub-modules that make up this system. An encoder-decoder semantic feature fusion module is built to resolve the semantic discrepancy of features from the encoding stage and the decoding stage. Water-body segmentation contours may be generated using a simple multi-scale prediction fusion module that takes input from three distinct scales. The mask that offers a binary label to each pixel in our attention-guided, multi-scale image is derived from this input tensor. The encoder-decoder architecture of the proposed MECNet [9] is portrayed in Figs. 2, 3 and 4.

Fig. 2
A model network of M E C net. The encoder stage gets the input image and passes it through 6 modules of M E C with max pooling 2 d. The decoder stage has 4 modules that interact with encoder modules through D S F F with upsampling 2 d, and the output image is processed through M P F.

An overview of our proposed Multi-feature Extraction and Combination Network (MECNet).MECNet has three modules: Multi-feature Extraction and Combination (MEC), Encoder and Decoder Semantic Feature Fusion (DSFF),and Multi-scale Prediction Fusion (MPF)

Fig. 3
5 model networks. Part a is the M E C module with three sub-modules, part b which is local feature extraction, and part c which is the channel feature enhancement module. Part d is a D C A C version of L R F E and part e is a J C C version of L R F E.

The details of multi-feature extraction and combination module. a The Multi-feature Extraction and Combination (MEC) module consists of b a Local Feature Extraction (LFE) sub-module, c a between-channel feature enhancement module (CFE) and a longer receptive-field feature extraction sub-module (LRFE); d Densely Connected Atrous Convolutions (DCAC), and e JCC (Joint Conv7-S4-Conv3-S1, for the longer receptive field feature extraction.to properly forecast the water-body segmentation map, whereas DSFF combines distinct information from the encoding and decoding phases at the same scale

Fig. 4
2 model networks of M E C. The left network has 3 sub-modules connected parallelly with no interaction with each other. The right network has 3 sub-modules connected to each other sequentially in order, l F E, L R F E, and C F E with all 3 also connected to C.

Two ways to combine different feature sub-modules in the MEC (Multi-feature Extraction and Combination), Left: in a parallel way. Right: in a cascade way

Two methods in the MEC for combining various feature sub-modules (Multi-feature Extraction and Combination), Right: in a similar fashion. In a cascading fashion, that’s correct.

2.2 Semantic Features Fusion Module for Encoder-Decoder

The DSFF module (Fig. 6) extends the 3D channel attention module described in our earlier work [28] to overcome the issue of semantic inconsistency in feature fusion at the decoding stage. To minimize the number of channels in the concatenated feature maps at the same scale from both the encoding and decoding stages, the DSFF first conducts a 1 1 convolution using BN and ReLU. The concatenated features are then used to construct the global context, which is then used to do 1 1 convolutions using BN, ReLU, and a Sigmoid function. As a guide for combining various semantic characteristics, it automatically learns how to link the channels of feature maps together semantically. The concatenated characteristics multiply and add the global context information. To finish, 3 × 3 convolutions with BNs and a ReLU are applied to the feature maps that were generated. To accomplish an effective fusion of distinct semantic features, the DSFF module is used on various scale characteristics at the decoding step. 2021, 03, ×8 of 19 Remote Sensing Multi-scale Prediction Fusion (MPF) is seen in Fig. 5.

Fig. 5
A model network of a multi-scale prediction fusion module has semantic features in scale characteristics with a set of three maps in various scales.

MPF: Multi-scale prediction fusion module

Fig. 6
A model network of D S F F module has layers in different segments in a convolutional and pooling method in a three-dimensional channel in both the encoding and decoding stages.

Different semantic feature fusion module, DSFF

The Total Loss Function (TLF) The difficulty of training deep neural networks grows as the network’s depth increases [20]. We implement a simple and effective output layer at each scale in the decoding step and apply loss restrictions between its result and the ground truth to train our proposed model more efficiently.

The total loss function and the cross-entropy function L are illustrated in the following manner. Figure 6, The DSFF module, which stands for Different Semantic Feature Fusion.

3 The Architecture of the Proposed Model

The proposed architecture mainly depends on four steps. The first is the Image Processing where all the goes through the geometric correction, i.e., all the color, texture, and shape are identified and produces the image immediately after analyzing which is knows as Image Fusion. Later, in the second step the image which is produced in the preprocessing is transformed to sample generation state in which the image is analyzed by pixel-by-pixel and forms two datasets one is the training dataset and the other is test dataset. Later, the training dataset values are compared with the test dataset. In the third process, the data gets water extraction where the image is predicted with the accurate position on the water content on the image, and the final step is the accuracy assessment where the percentage and the accuracy is evaluated and represented in the graphical format (Fig. 7).

Fig. 7
A flow diagram of water body segmentation architecture has four steps, image preprocessing, sample generation, water extraction, and accuracy assessment with processes in each step.

Architecture of water-body segmentation

4 Experimental Results

Extraction of characteristics at many scales, there are three main methods of feature extraction for the multi-featured integrated network: local (LFE), receptive longer features (RLFE), and channel-based feature extractions (RLFE) (CFE). In this study, spatial feature relationships between linked features are characterized using LFE and RLFE, and the maximum features acquired across multiple channels are explored.

Optimization of the Contour Map, there are a variety of state-of-the-art methods for detecting contours in a picture that include localization information. Based on multi-scale globalization and semantic picture attributes such as texture, color, and form, we explore contour detection from a satellite image.

Multi-scale feature extraction using contours, multi-feature extraction using a contour-based approach is quite different from typical multi-scale extraction methods, and we employ the following modules to test its viability: local feature extraction (LFE), channel feature extraction, and long receptive field feature extraction (LRFFE) (CFE & LRFE). Submodules LFE and LRFE are used to identify regions with certain features, whereas CFE investigates the relationships between distinct feature maps.

Optimized water-body segmentation extraction, training is finished, and the weights of each pixel are evaluated using the proper neighbor pixel selection for each picture once the multi-feature extraction procedure is done. Raw images of the linked objects are used as input, and the probability maps derived from the multi-scale feature search approach are used to segment the water. A lot of pixels are involved in this work, hence this module includes a significant number of pixels from the picture. The most effective model for evaluating pixels with varied decoding variables is optimal water-body segmentation with multi scale-feature extraction (Figs. 8 and 9).

Fig. 8
3 satellite images. Original image of a water body. The true mask image in a dark shade. C N N prediction image where the water body from the original image is highlighted. The accuracy value is 72.355419 and the duration value is 3.30596.

CNN prediction

Fig. 9
3 satellite images. Original image of a water body. The true mask image is in a dark shade. M E C N E T prediction image with the water body highlighted. The accuracy is 81.70690 and the duration is 3.028074

MECNET prediction

5 Conclusion

To enhance water-body contour identification from VHR remotely sensed photos, combining aerial and satellite pictures, we use the structure of the embedding. Our approach relies on the following three components: A DSFF module solves the issue of semantic inconsistencies of features extraction between the encoding and decoding stages by automatically extracting richer and more diverse features in the encoding stage and obtaining more advanced semantic information for feature fusion in the decoding stage. On VHR aerial and satellite photos, our technique achieved the greatest accuracy as well as the best resilience under tough conditions, according to the results of our studies. In addition to feature extraction, this new design module may be used for semantic segmentation and object recognition. In this, we compared this project between CNN, MECNET, and MECNET-CMO among these we could find that CNN consumes more duration and results with the less efficiency where as MECNET consumes less duration and produces more efficiency which is proved in our project.