Introduction

Real-time semantic edge segmentation is the most crucial topic in image processing and computer vision. There are several applications for this system or model, such as biomedical image processing, autonomous self-driving system, and many more. All these applications have critical use cases and need more accuracy. Increasing the network quantity has reduced the inference, which may lag in real-time operating conditions. To mention some examples, such as DeepLab and PSPNet's are best in performance with semantic edge segmentation; however, they consist of millions of limiting factors and a minimal processing speed of one frame per second (FPS). In general, 30FPS speed is maintained typically for real-time processing.

To conclude, the larger the network lesser the processing speed. Along with processing speed memory, consumption is also considered the primary parameter. Building an optimistic and efficient semantic edge segmentation network with less memory and more processing speed is achieved in this study.

Current semantic segmentation models are high speed, mainly ESPNET [5] and ENET [4], which require high costs for execution with high inference speeds. Slightly modified models such as ICNET [6] and ContexNet [5] have achieved good results, but their models do not achieve with best size and speed. Hence, this study concentrated on maintaining or obtaining better inference, accuracy, and model size. The very old manuscripts [7, 8] had delivered the process of achieving the multi-scale convolution with various sizes and various fields. This phenomenon allows the models to process with multilevel feature extraction and contains multiple data scales. The dilated convolution was one of the best processing models to extract the large scale features within the maximum available number of features [10, 11]. Both the networks have certain limitations, whereas a channel wise model and the inception models contain more parameters, though they contain factorizations. The dilation convolution may miss local features but achieves the best global information with a single dilation rate. Fixed dilation makes the model extract the large class of features in the cityscapes data set but avoids identification of minute ones.

The proposed methodology is a tiny and new CNN module that achieves the pros of dialed and inception convolution. This prone module is applied to construct superficial and practical encoder- and decoder-based models to pull out the dense features.

The modified channel wise feature pyramid network design (M-CFPNet) depends on the CFPNet model. Many new parameters have better results than the current semantic segmentation models. The proposed module can efficiently incorporate dilated and inception convolutions. Hence, they are called channelwise feature pyramid (CFP) networks. This framework can pull out numerous contingent information jointly and size feature maps and significantly reduce the size and number of parameters.

Literature Survey

Recent works mainly specify semantic edge segmentation process that demonstrates either factorization, dilated convolution or low bit networks, or a mixture of these techniques to optimize the model size and speed of the CNN. The primary step is briefly describing the technique opted for and overviewing the decoder–encoder-based edge semantic segmentation.

The dilated convolution [17] develops a 3 × 3 convolution, which is a standard and unique form by filling gaps between enhanced effective receptive field and convolution parameters without producing more elements. The rate of dilation r is denoted as [r(n − 1) + 1]^2 for an n × n dilated convolution in the kernel, where r represents pixel gap numbers between adjacent convolution element and n2 elements participate in module training. Several research have already experimented on dilated convolution to pick off multi scale features, which is demonstrated and developing a spatial feature pyramid, to name a few DenseASAPP [18, 19] and DeepLab [10, 12, 13]. The application patterns shows the stamina in pixel-level tasks. This study tried to implement dilated convolution for each CFP network channel.

The naïve Inspection architecture [7] demonstrated a jointing model which consists 1 × 1, 3 × 3 and 5 × 5 convolutions which achieves multi scale feature maps of kernel. The large convoluted kernels, leads to more processing cost.

Hence, current versions of inception architecture initiate factorization convolutions to decrease the number of elements. This factorization has two parts: asymmetric convolution and smaller convolution by factorization. To state an example, where 34% of elements are saved with the same filter size is achieved, i.e., for 5 × 5 convolution operator, it is replaced with 3 × 3 convolutions and then factorized a convolution into 3 × 1 convolution from 1 × 3 convolutions. The TesNext [18], MobileNets [14] and inception [16] are the two factorization modules that had been applied successfully for great stamina in decreasing the processing of the CNN models. The factorization module inspired in development of CFPNet. Each CNN channel is used with a small convolution approach to reduce the inception-like model in CFPNet. The asymmetric convolution technique reduced channel parameters. Factorization decreases the execution substantially, where the module is allowed to learn from the features of respective fields with a series of sizes [33,34,35] segmented image using watershed and particle swarm optimization techniques to obtain good results.

The encoder and decoder are two different parts of the encoder–decoder network. The encoder consists of down-sampled operators for extracting the high dimension features and sequence of convolution. The exact process is reverted for a decoder, like instead of down sampling, the upsampling are convoluted to create masks. Few of encoder–decoder-based designs available are, U_Net [19], FCN [16], and SegNet [21] that demonstrated great results with pixel-level edge segmentation. The entire process in the study discusses, the architecture of MCFPNet, which is derived from CFPNet.

Methodology

Channelwise Feature Pyramid Channel

CFPNet derived from CFP module, which traces convolution network operation that decays a larger kernel into several minimal convolutions in Fig. 1a, b. The modified CFPNet has better performance in results than CFPNet. The traditional Inception model uses directly 5 × 5 size kernel, and Inception-V2 as shown in Fig. 1c is deployed with dual 3 × 3 convolution operators instead. Depending on the thought of factorization and multi-scale feature maps, the current system is designed with a 7 × 7 kernel. Same way, Inception-V2 introduced three 3 × 3 convolution kernels instead of 5 × 5 and 7 × 7 size kernels. Due to this functioning, 45% and 28% of parameter elements are saved, respectively. It is hard to achieve a real-time goals hence, combined both convoluted kernels into single channels with only 3 × 3 sized kernels. Then, perish the convention convolution to unsymmetrical conventions to construct the feature pyramid channels. To generate a multi scale feature map the connection is skipped to concentrate parameters that are pulled out from an unsymmetrical convolution set. The feature pyramid is reduced to 69% of parameters when compared with CFPNet and Inception-v2. In addition, FP channel saves 69% of elements but has the capability to understand the attribute data and retains the original dimension.

Fig. 1
figure 1

a Naïve inception module. b Feature pyramid module. c Inception V2-module

Because of uniting features from asymmetric or skewed convolution blocks, it maintains the equal dimensions of output and inputs by reshuffling filter numbering per unsymmetrical set. If the input size is ‘N’, then N/4 is assigned as primary and secondary sets, which point to 5 × 5 and 3 × 3 convolution. To the tertiary set, the 7 × 7 kernel is allocated with N/2 filter, which pulls out a substantial symmetrical size feature.

Channelwise Feature Pyramid Module

CFP module has L FP channels with multiple dilation rates {r1, r2,…,L}. Conventional CFP is primarily applied with 1 × 1 complication to decrease the input facet from m to m/L. Later the dimension of the primary and tertiary irregular set are m/4L, m/4L, and m/2L, respectively.

Figure 2 represents detailed information about the CFP module. There, 1 × 1 convolution achieves high dimension feature maps with lesser measurement. Later, set multi FP channels to parallel arrangement along with multiple dimension values. Then, all feature maps are united into single dimension input and apply convolution of 1 × 1 to trigger the output. This is the fundamental architecture of any conventional CFP module which is demonstrated in Fig. 2a. The enhancement in the depth of the module network is done using unsymmetrical convolution, which leads to harder for learning or training. In addition, an ordinary combination method initiated some griddling disturbance or unwanted checkboard that has more impact on accuracy and excellence in edge identification masks. To improve the struggle in the training, the remaining connection is made as the primary step as the deep network module is trainable and gives additional feature data [21]. To avoid the impact of griddling disturbance, need to apply HFF (hierarchical feature fusion) [5] to de-griddling.

Fig. 2
figure 2

a Conventional CFP network. b CFP network module

From initiating the secondary channel, apply the addition operation to sum the feature maps stage by stage, later uniting the constructed final hierarchical feature map. The griddling disturbances are finally reduced. The final CFP module with less griddle disturbance is represented in Fig. 2b.

MCFPNet Module

Primarily, mentioning the CFP module details which are used to construct MCFPNet. Selecting FP channels to C = 4. The dimension of input is D = 32, filter number with 8 as the channel size. opting filter number of primary and tertiary asymmetric convolution sets are 3,3, and 4, respectively, but in CFPNet [29] it is represented as 2,2 and 4, respectively. Later, the various dilation rates is set to individual FP channel. Perform the dilation rate equal to rC, set the 1st and 4th channel dilation rate to r1 = 1 and rC, and want the MCFP module can pull global and local features. The secondary and tertiary channels are set with rates of dilation equal to r2 = rC/4 and r3 = rC/2. Hence, the CFPNet module could train for midsized features. If rC/4 is lesser than 1 and if rC = 2, then unswervingly setting the channel to amplify rate equal to 1.

MCFPNet architecture: Though Agenda of the manuscript is to develop a lightweight module with the best performance. Hence, a shallow network is proposed in the manuscript, as shown in Fig. 3. The detailed architecture is represented in Table 1. Initially, three 3 × 3 convolution is performed on the feature extractor and apply the down-sampled method as in ENet [4], which fuses a 3 × 3 convolution with a stride two and 2 ×  2 maximum pollings. These down-sampling operation process outputs by three times, and output dimensions are 18th of the original input size. Skip the connection to insert and resize the input images before the first and second max polling final 1 × 1 convolution, providing additional data for the segmentation network. In the CFP-2 and CFP-1 clusters, the CFP module is repeated with n = 2 and m = 6 times with rate of dilation rKCFP − 1 = [3, 3] and rKCFP − 2 = [4, 4, 8, 8, 16, 16]. As a last step, a 1 × 1 convolution is applied to trigger the output feature map to obtain final edge segmentation masks using bilinear interpolation. Each convolution is performed by triggering PteLU [23] batch normalization. The study proved to attain better results in performance than TeLU in shallow networks. The CFP-2, m output if fed to up sampling with 1 × 3 conv instead of 3 × 3 conv has improved the results in MCPF net. As CFP-2, m and CFP-2,n has max co-ordinate values which may not need of 3 × 3 conv [32].

Fig. 3
figure 3

Proposed MCFPNet architecture

Table 1 Design details of MCFPNet

The proposed neural network has been tested on BSDS500 and also on CamVid and cityscape data sets. These data sets are widely used in semantic edge segmentation. The whole work was validated on CamVid and cityscape data sets, with some selected images from BSDS500. The parameters such as repeat times, the number of channels, and dilations have been experimented. Finally, the networks are compared between the data sets.

Results and Discussion

Data Set

Cityscapes The cityscapes consist of 5000 fine annotations and 20,000 coarser annotation images. In addition, the data set has data from 50 different cities. The original input image resolution is 1024 × 2048. The data set consists of seven categories: cars, trucks, buses, and other vehicles.

CamVid This data set includes an urban scene data set which can be used in automotive applications, such as self-driving. It has 701 images, with each image resolution of 720 × 790. 234 for training and 101 for validation are used. In the proposed work, the images are resized to 360 × 480 before training the data set.

Analysing MCFPNet Architecture

The proposed multiple CFPNet consist of repeat times, different channel numbers, and rate of dilation, to analyze their performance of CamVid test data set. MCFPNet is represented in Table 2. The multiple features of the edge segmented images is extracted using region of interest (ROI) and classifiers.

Table 2 Edge segmented examples from cityscapes

As the study do not use the pre-trained models, the maximum epochs used is 1024. The ADAM [24] is used to train the network with a momentum 0.9 and weight decay 4.5e−4. Applying the “poly” learning rate policy [25] and the initial learning rate is set with power 0.9. Later opting a multiple batch number for two data sets, eight for Cityscapes and sixteen for CamVid represented in Table 3. Data augmentation is also performed to create diversity in training. The measurement rates are as {0.5, 0.75, 1.0, 1.25, 1.5, 1.75}.

Table 3 Evaluation of MCFPNet on CamVid

CFPNet-V1 This is the superficial version, because it repeats n times, and m are 1 and 2. For primary MCFPNet, the dilation rates are set to rCCFP − 1 = [4] and rCCFP − 2 = [8, 16].

CFPNet-V2 When Compared to MCFPNet-V1, the network can be able to extract more local features. Hence, modified the repeat time from {n,m} = {1,2} to {n,m} = {1,3}, and their corresponding distend rates are changed to rCCFP − 1 = [2] and rCCFP − 2 = [4, 8, 16].

MCFPNet The operation of the network has increased and controlled the model size. The continuity of the CPFNet-V2 is doubled. The distend rates are modified per cluster to rCFP − 1 = [2, 2] and rCCFP − 2 = [4, 4, 8, 8, 16, 16].

Table 4 represents the evaluation results of MCFPNet on Cityscapes which improves the results with 3.2% more accurate with same size of CFPNet-V3.

Table 4 Evaluation of MCFPNet on Cityscapes

Although the dimensions of MCFPNet are diminutive, its mIoU accuracy is impressively competitive, as demonstrated in Table 5, both in terms of classwise and categorywise evaluations. A closer examination of the results reveals that MCFPNet exhibits a higher level of sensitivity and precision, particularly in the detection of small and low-frequency classes, such as traffic lights.

Table 5 Cityscapes data set results

Comparisons

The CamVid and Cityscapes, test data set results are compared with the proposed system to existing convention systems. Figure 4 represents the relationship between classwise mIoU accuracy and network size.

Fig. 4
figure 4

Network size and classwise mIoU accuracy

In Fig. 4, the blue circle represents the size of the model, i.e., the smaller circle, the smaller the model size. The size of the MCFPNet is small, as mIoU has very competitive accuracy, as represented in Table 5. CFPNet-V3 reported more sensitivity and accuracy for tiny and less frequency classes.

Figure 5 represents the test results plotted accuracy versus parameters for Cityscapes. The proposed MCFPNet has an excellent accuracy which achieves 71.0% and the best accuracy compared to LEDNet [29], ESPNet, and CGNet [27]. It also throws better performance than the existing CFPNet [32] and gives better training and segmentation results.

Fig. 5
figure 5

Classwise mIoU and parameters

The different existing protocols are executed in various GPUs and their comparative results are placed in Table 6. The MCFPNet has a very similar processing speed as CFPNet [32], DABNet, and ICNet for the same input of 1024 × 2048. As CFPNet [32] saved 28.6%, but MCFPNet saved 28.9%, which is closer but improved in network performance. As the CamVid data set is tested, state-of-the-art is placed in Table 7. MCPFNet achieves slightly better results with a small network. When compared to CPFNet the M-CFPNet improves 3.2% of mIoU and 0.04% parameter.

Table 6 Evaluation results for Cityscapes data set for testing
Table 7 CamVid testing set performance

Given the variability in input size and GPU specifications across networks, both of these factors are reported in Table 6. In terms of computational ability, the hierarchy of GPUs is as follows: Titan Maxwell < Titan X Pascal ≈ GT 1080Ti < Titan p < TT 2080Ti < V100. Despite the differences in input size and GPU devices, Fig. 7 is included to facilitate comparison of the results presented in Table 6.

In addition, we conduct performance assessments on the CamVid test data set and undertake comparisons with several other current techniques. Our findings, as outlined in Table 7, indicate that MCFPNet also delivers exceptional results despite its compact size. For instance, when comparing with ENet and ESPNet, it is evident that their minimal parameter count has a significant impact on their overall performance, as they rank the lowest in Table 7. In comparison with other methods boasting high performance, MCFPNet demonstrates a competitive level of accuracy despite having fewer than 3.4% of their parameters.

Conclusion

In this paper, a small real-time semantic edge segmentation network is developed. MCFPNet or modified CFPNet is mainly deployed on the basis of the Feature Pyramid channel and modified version of CFPNet. The analysis and results of CamVid and cityscapes data sets are running on MCFPNet module. The MCFPNet is overall deployed to enhance the accuracy to 71% with the best in inference speed, parameters, and model size. To conclude, the overall module efficiently finds semantic edge segmentation. When compared to CFPNet the M-CFPNet improves 3.2% of mIoU and 0.04% parameter.