1 Introduction

Image degradation is a common issue that occurs during image acquisition due to a variety of factors such as camera limitations, environmental conditions, and human factors. For instance, smartphone cameras with narrow apertures, small sensors, and limited dynamic range can produce blurred and noisy images due to device shaking caused by body movements. Similarly, images captured in adverse weather conditions can be affected by haze and rain. Most classical image restoration tasks can be formulated as:

$$\begin{aligned} \textbf{L} = \textbf{D}(\textbf{H}) + \gamma \end{aligned}$$
(1)

where \(\textbf{L}\) denotes an observed low-quality image, \(\textbf{H}\) refers to its corresponding high-quality image, and \(\textbf{D}(\cdot ), \gamma \) indicate the degradation function and the noise during the imaging and transmission processes, respectively. This formulation can signify different image restoration tasks when \(\textbf{D}(\cdot )\) varies (Fig. 1).

Fig. 1
figure 1

Visualized results of M3SNet on various image restoration tasks. Left: degraded image. Right: the predicted result of M3SNet. From top to bottom: image deblurring, and image deraining task respectively

Image restoration aims to recover the high-quality clean image \(\textbf{H}\) from its degraded image \(\textbf{L}\). It is a highly ill-posed problem as there are many candidates for any original input. In order to restrict the infinite feasible candidates space to natural images, traditional methods [1,2,3,4,5,6,7] explicitly design appropriately priors for the given kind of restoration problem, such as domain-relevant priors and task-relevant priors. Then, the potential high-quality image can be obtained by solving a maximum a posteriori (MAP) problem:

$$\begin{aligned} \hat{\textbf{H}}= \underset{\textbf{H}}{ {\text {arg }}\, {\text {max}}} \log P(\textbf{L}|\textbf{H}) + \log P(\textbf{H}) \end{aligned}$$
(2)

where \(P(\textbf{L}|\textbf{H})\) represents the probability of observing the degraded image \(\textbf{L}\) given the clean image \(\textbf{H}\), and \(P(\textbf{H})\) represents the prior distribution of the clean image \(\textbf{H}\). This can also be expressed as a constrained maximum likelihood estimation:

$$\begin{aligned} \hat{\textbf{H}}= \underset{\textbf{H}}{ {\text {arg }}\, {\text {min}}} \ \Vert \textbf{L}-\mathbf {D(H)} \Vert ^2 + \lambda \Psi (\textbf{H}) \end{aligned}$$
(3)

where fidelity term \(| \textbf{L}-\mathbf {D(H)} |^2\) serves as an approximation for the likelihood \(P(\textbf{L}|\textbf{H})\), while the regularization term \(\lambda \Psi (\textbf{H})\) represents either the priors of the latent image \(\textbf{H}\) or the constraints on the solution. The aim is to express the fidelity of the reconstructed image to the original input while simultaneously considering the prior knowledge or constraints imposed on the solution

While designing effective priors for image restoration can be challenging and may not be generalizable. With large-scale data, deep models such as Convolutional Neural Networks(CNNs) [8,9,10,11,12,13,14,15,16,17] and Transformer [18,19,20,21,22] have emerged as the preferred choice due to their ability to implicitly learn more general priors by capturing natural image statistics and achieving state-of-the-art (SOTA) performance in image restoration. The performance gain of these deep learning models over conventional restoration approaches is primarily attributed to their model design, which includes numerous network modules and functional units for image restoration, such as recursive residual learning [23], transformer [18, 19, 21], encoder-decoders [12, 13, 24], multi-scale models [25,26,27], and generative models [28,29,30].

Nevertheless, most of these models for low-level vision problems are based on a single-stage design, which ignores the interactions that exist between spatial details and contextualized information. To address this limitation, [8,9,10,11] proposes a multi-stage architecture in which contextualized features are first learned through an encoder-decoder architecture and subsequently integrated with a high-resolution branch to preserve local information. Despite its good performance, this method requires refining the results from the previous stage in the later stage, leading to a high level of system complexity.

Fig. 2
figure 2

PSNR vs. computational cost on Image Deblurring. Under different parameter capacities, our model achieves state-of-the-art. In addition, our model involves only a relatively small number of multiply-accumulate operations (MACs)

Based on the information presented, a natural question that comes to mind is whether it is feasible to use a single-stage architecture to reduce system complexity and achieve the same balance between spatial details and contextualized information as a multi-stage architecture while maintaining the SOTA performance. To achieve this objective, we propose a mountain-shaped single-stage image restoration architecture, called M3SNet, with several key components. (1). We utilize NAFNet [12] as the baseline architecture and concentrate on modifying the network model to attain multi-stage functionality. By emitting the information transfer between the multi-stage and eliminating the nonlinear activation function from the network structure, we are able to reduce the system’s complexity. (2). A plug and play feature fusion middleware (FFM) mechanism has been added to facilitate multi-scale information fusion between encoder and decoder blocks from different layers, resulting in the acquisition of more contextual information. Additionally, this approach enables manipulation of the original image resolution, thereby aiding in the preservation of spatial details. As a basic feature fusion block, it can be plug-and-play in various other image restoration networks to improve the model representation. (3). A multi-head attention middle block (MHAMB) is the bridge between the encoder and decoder that surpass the limitations of the receptive field of CNNs and capture more global information.

The main contributions of this work are:

  1. (1)

    A novel single-stage approach capable of generating outputs that are contextually enriched and spatially accurate similar to a multi-stage architecture. Our architecture reduces system complexity due to its single-stage design eliminates the need for information to be passed between stages.

  2. (2)

    A feature fusion middleware mechanism (FFM) that enables the exchange of information across multiple scales while preserving the fine details from the input image to the output image. It can be used as a general and efficient plug-in module with few lightweight parameters for various image restoration networks.

  3. (3)

    A multi-head attention middle block (MHAMB) that is capable of aggregating local and non-local pixel interactions.

  4. (4)

    We demonstrate the versatility of M3SNet by setting new state-of-the-art on 6 synthetic and real-world datasets for various restoration tasks (image deraining and deblurring) while maintaining low complexity (see Fig. 2). Further, we provide detailed analysis, qualitative results, and generalization tests.

2 Related work

Image degradation is a common occurrence caused by camera equipment and a variety of environmental factors. Depending on the specific degradation phenomenon, different image restoration tasks are proposed, e.g., deblurring and deraining. Early image restoration work was mainly based on manually crafting some prior knowledge, such as total variation and self-similarity [1,2,3,4,5,6,7]. With the rise of deep learning, data-driven methods like CNN [8, 31,32,33,34,35,36,37,38,39,40,41] and Transformer [19,20,21, 42, 43] have become the dominant approach for image restoration due to their impressive performance. These methods can be categorized as either single-stage or multi-stage based on their architectural design.

2.1 Single-stage architecture

In recent years, the majority of image restoration research has focused on single-stage architecture. Among these architectures, the encoder-decoder based U-Net [8, 12, 13, 21, 30, 44,45,46] and dual network structure [47,48,49,50,51] are mainly included.

Encoder-Decoder Approaches. In recent years, encoder-decoder have gained great attention from researchers in the field of image restoration thanks to its ability to capture multi-scale information. To construct an effective and efficient Transformer-based architecture for image restoration, [21] introduce a novel locally-enhanced window and multi-scale restoration modulator to create a hierarchical encoder-decoder network. [37] utilize selective kernel feature fusion to realize the information exchange of different scales and information aggregation based on attention. [27] develops a simple yet effective boosted decoder to progressively restore the haze-free image by incorporating the strengthen-operate-subtract boosting strategy in the decoder. By eliminating or substituting the nonlinear activation function, [12] establishes a simple baseline that yields measurable outcomes while requiring fewer computing resources.

Dual Network Approaches. The Dual Networks architecture is designed with two parallel branches that separately estimate the structure and detail components of the target signals from the input. These components are then combined to reconstruct the final results according to the specific task formulation module. This architecture was first proposed by [52] and has since inspired a lot of subsequent work, including in the areas of image dehazing [47,48,49], image deraining [53], image denoising [50], and image super-resolution/deblurring [51]. The Dual Networks approach has proven to be effective in addressing various low-level vision problems, by enabling a better separation of the structure and detail information, leading to improved performance in terms of both accuracy and computational efficiency. Furthermore, the flexibility of this architecture makes it adaptable to different types of data, making it a popular choice in the field of image restoration.

Despite the significant achievements made by these networks, it remains a challenge to effectively balance these competing goals of preserving spatial details and contextualized information while recovering images.

2.2 Multi-stage architecture

The multi-stage networks are shown to be more effective than their single-stage counterparts in high level vision problems [54,55,56,57]. In recent years, there have been some attempts [8, 10, 11, 58,59,60,61] to apply multi-stage networks to image restoration. They aims to break down the image restoration process into several manageable stages, enabling the use of lightweight subnetworks to progressively restore clear images. This approach facilitates the capture of both spatial details and contextualized information by individual subnetworks at each stage. To prevent the production of suboptimal results that may arise from using the same subnetwork at each stage, a supervisory attention mechanism was proposed along with the adoption of distinct subnetwork structures [8]. Additionally, [60] present a novel self-supervised event-guided deep hierarchical Multi-patch Network to handle blurry images and videos through fine-to-coarse hierarchical localized representations. Nevertheless, this approach elevates the complexity of the system as refining the previous stage’s results is required in subsequent stages.

3 Method

Our primary goal is to create a single-stage network architecture that can efficiently handle the challenging task of image restoration by balancing the need for spatial details and context information, all while using fewer computational resources. The M3SNet is built upon a U-Net architecture, as shown in Fig. 3. As is apparent from the figure, in contrast to the traditional U-Net network, we have inverted the architecture and introduced two key components: (a) the feature fusion middleware (FFM) and (b) the multi-head attention middle block (MHAMB). The model’s architecture takes on a mountain-like shape, and we liken the image restoration process to climbing a mountain.

Fig. 3
figure 3

Architecture of M3SNet for image restoration

Overall Pipeline. Given a degraded image \(\textbf{I} \in \mathbb R^{H \times W \times 3}\), M3SNet first applies a \(3 \times 3\) convolutional layer to extract shallow feature maps \(\mathbf {F_{0}} \in \mathbb R^{H \times W \times C}\) (HWC are the feature map height, width, and channel number, respectively). Next these shallow features \(\mathbf {F_{0}}\) pass through \(4-level\) encoder-decoder and one multi-head attention middle block, yielding deep features \(\mathbf {F_{DF}} \in \mathbb R^{H \times W \times C}\). Each layer contains multiple feature fusion middleware between the encoder and decoder to capture multi-scale information and retain spatial details. Finally, we apply convolution to deep features \(\mathbf {F_{DF}}\) and generate a residual image \(\textbf{R}\in R^{H \times W \times 3}\) to which degraded image is added to obtain the restored image:\(\hat{\textbf{I}} = \textbf{R} +\textbf{I}\). We optimize the proposed network using PSNR loss:

$$\begin{aligned} PSNR = 10 \cdot log_{10} \cdot \frac{(2^n-1)^2}{||\hat{\textbf{I}}-\dot{\textbf{I}}||^2} \end{aligned}$$
(4)

where \(\dot{\textbf{I}}\) denotes the ground-truth image.

Fig. 4
figure 4

(a) Feature fusion middleware (FFM) that enables the exchange of information across multiple scales while preserving the fine details. (b) The architecture of nonlinear activation free block (NAFBlock). (c) Simplified Channel Attention (SCA). (d) Multi-head attention middle block (MHAMB) that captures more global information

3.1 Feature fusion middleware (FFM)

By incorporating an encoder and decoder network in the initial stage, followed by a network operating at the original image input resolution in the final stage, the multi-stage network can produce high-quality images with accurate spatial details and reliable contextual information. However, the latter stage of this process requires revising the results of the previous stage, which adds a slight level of complexity to the system. While a single-stage network has relatively less complexity, it may struggle to balance spatial details and context information effectively. Therefore, we are exploring a middleware mechanism for feature fusion that enables a single-stage architecture to achieve the same functionality as a multi-stage architecture. As a basic feature fusion block, it can be plug-and-play in various other image restoration networks to improve the model representation.

As shown in Fig. 4a, the feature fusion middleware(FFM) is a nonlinear activation-free block (NAFBlock) with upsample and feature fusion. We have introduced the FFM between the encoder-decoder architectural levels to integrate upper-layer information into adjacent lower layers. This integration takes place sequentially, from the highest layer down to the lowest, until all information is fused into the original image resolution manipulation level. This approach enhances the network’s capacity to capture and fuse multi-scale features, ranging from simple patterns at low levels, such as corner or edge/color connections, to more complex higher-level features, such as significant variations and specific objects. As a result, this structure preserves spatial details while integrating contextual information, ultimately ensuring high-quality image restoration. Formally, let \(\mathbf {FE_i} \in \mathbb R^{\frac{H}{i^2} \times \frac{W}{i^2} \times {i^2}C}\) be the output in the i-th level encoder \((i=1,2,3,4)\). At each level, the feature fusion information \(FFM_{s,i}\) is given as:

$$\begin{aligned} \begin{aligned} FFM_{1,i}&= H_{Naf_{1,i}}( FE_{i} \oplus UP(FE_{i+1})) \\ FFM_{s,i}&= H_{Naf_{s,i}}(UP(FFM_{s-1,i+1}) \oplus FFM_{s-1,i}) \end{aligned} \end{aligned}$$
(5)

where \(\oplus \) denote the element-wise addition, \(UP(\cdot )\) represents the up-sampling operation and \(H_{Naf_{s,i}}(\cdot )\) is the s-th FFM in the i-th level.

This design offers two benefits. Firstly, the feature fusion middleware integrates multi-scale information. For example, the FFM of the third level fuses the encoder information of the third and fourth level, and the FFM of the second layer fuses the information of the second, third and fourth levels, so that the network model can capture abundant context information. Secondly, the feature fusion middleware in the \(1_{th}\) layer operates on the original image resolution, without employing any subsampling operation, thereby enabling the network model to acquire detailed spatial information of high-resolution features.

NAFBlock. NAFBlock [12] is a variant of the U-Net network that simplifies the system by replacing or removing the nonlinear activation function. Figure 4b illustrates the process of obtaining an outputY from an input X using Layer Normalization, Convolution, Simple Gate, and Simplified Channel Attention. Express as follows:

$$\begin{aligned} \begin{aligned} X_1&= X + C_1(SCA(SG(C_3(C_1(LN(X)))))) \\ Y&= X_1 + C_1(SG(C_1(LN(X_1)))) \\ SG&= X_{f1} \times X_{f2} \end{aligned} \end{aligned}$$
(6)

where \(C_1\) is the \(1 \times 1\) convolution, \(C_3\) is the \(3 \times 3\) depth-wise convolution, GAP is the adaptive average pool, \(X_{f1}, X_{f2} \in \ R^{H \times W \times \frac{C}{2}}\) are obtained by dividing \(X_{f3}\) into channel dimensions, and \(SCA(\cdot )\) is shown in Fig. 4c.

Finally, the depth features \(\mathbf {F_{DF}}\) are obtained through this single-stage architecture, as demonstrated below:

$$\begin{aligned} \begin{aligned} FD_{i}&= H_{Naf_{\text {-1},i}}(FD_{i+1} + FFM_{\text {-1},i}) \\ F_{DF}&= FD_{1} \end{aligned} \end{aligned}$$
(7)

where \(FD_i\) is the output in the i-th level decoder, and -1 indicates that this is the last feature fusion middleware at this level.

Table 1 Dataset description

3.2 Multi-head attention middle block (MHAMB)

The transformer model [19,20,21] has gained popularity in image restoration tasks due to its capability to capture global information, as evidenced by its increasing usage in recent years. The computation on a global scale results in a quadratic complexity in relation to the number of tokens as shown in Eq. 8, rendering it inadequate for the representation of high-resolution images.

$$\begin{aligned} \mathcal {O}_{MSA} = 4hwC^2 + 2(hw)^2C \end{aligned}$$
(8)

To alleviate this issue, [19, 20, 43] etc., proposed various methods to reduce complexity. In this paper, we propose a multi-head attention middle block (MHAMB) as the bridge of the encoder-decoder, shown in Fig. 4d. MHAMB utilizes global self-attention to process and integrate the feature map information that is generated by the last layer of the encoder. This approach is particularly efficient in handling large images because convolution downsamples space, while attention can effectively process smaller resolutions for better performance. From the last layer of the encoder output \(FE_4\), our MHAMB first encode channel-wise context by applying a \(1 \times 1\) convolution and a \(3 \times 3\) depth-wise convolution to generate the querykey and value matrices \(\textbf{Q} \in \mathbb R^{H \times W \times C}, \textbf{K} \in \mathbb R^{H \times W \times C}\) and \(\textbf{V} \in \mathbb R^{H \times W \times C}\) as follows:

$$\begin{aligned} \begin{aligned}&Q = C_3(C_1(FE_4)) \\&K = C_3(C_1(FE_4)) \\&V = C_3(C_1(FE_4)) \end{aligned} \end{aligned}$$
(9)

Inspired by [19], we reshape \(\textbf{Q}\), \(\textbf{K}\), \(\textbf{V}\) to \(\hat{\textbf{Q}} \in \mathbb R^{(H W) \times \frac{C}{h} \times h}\), \(\hat{\textbf{K}} \in \mathbb R^{(H W) \times \frac{C}{h} \times h}\), \(\hat{\textbf{V}} \in \mathbb R^{(H W) \times \frac{C}{h} \times h}\) to apply SA across channels rather than spatial dimensions to reduce the computation complexity, where h is the number of head. Next,we calculate similarities of pixel pairs between all the reshaped queries and keys as:

$$\begin{aligned} Attention(\hat{Q}, \hat{K}, \hat{V}) = SoftMax\left( \frac{\hat{Q}\hat{K}}{\beta }\right) \hat{V} \end{aligned}$$
(10)

where \(\beta \) is a learning scaling parameter used to adjust the magnitude of the dot product of \(\hat{Q}\) and \(\hat{K}\) prior to the application of the softmax function. As we use the multi-head strategy, we finally concatenate all the outputs of multi-head attention and reshape the attention matrix back to its original dimensions of \(\mathbb R^{H \times W \times C}\), and then get the final result by applying a \(1 \times 1\) convolution. The resulting output is then added to \(FE_4\) and passed to the decoder. This allows the model to incorporate self-attention in the lowest-resolution feature maps, generating richer feature representations that improve the overall performance of the model, as shown in the experiments below.

4 Experiments

We evaluate the proposed M3SNet on benchmark datasets for three image restoration tasks, including (a) image deraining, (b) image deblurring, and (c) image denoising.

Table 2 Image deraining results
Fig. 5
figure 5

Image deraining example. The outputs of M3SNet exhibit no traces of rain streaks on both image samples. M3SNet also recovers the most detailed images

4.1 Datasets and evaluation protocol

We use PSNR and SSIM as quality assessment metrics. To report the reduction in error for each method relative to the best-performing method, we convert PSNR to RMSE (\(RMSE \propto \sqrt{10^{-PSNR/10}}\)) and SSIM to DSSIM (DSSIM = (1 - SSIM)/2). The datasets are summarized in Table 1.

Image deraining. Our derain model is trained on a collection of 13,712 clean-rain image pairs obtained from multiple datasets [62, 63, 65, 66]. We assess the model’s performance on various test sets, including Test100 [65], Test1200 [66], Rain100H [63], and Rain100L [63].

Image deblurring. To perform image deblurring, we utilize the GoPro [67] dataset, which consists of 2,103 image pairs for training and 1,111 pairs for evaluation. Additionally, we assess the generalizability of our approach by applying the GoPro-trained model directly to the test images of the HIDE dataset. The HIDE dataset is designed specifically for human-aware motion deblurring, and its test set comprises 2,025 images.

Table 3 Image deblurring results

Image denoising. For training our image denoising model, we utilize the SIDD dataset [69], which consists of 320 high-resolution images for training and 1,280 patches from 40 high-resolution images for evaluation. It’s important to note that SIDD datasets comprise real-world images.

4.2 Implementation details

We train the proposed M3SNet without any pre-training and separate models for different image restoration tasks. We utilize the following block configurations in our network for each level: [1, 1, 1, 28] blocks for the encoder, [1, 1, 1, 1] blocks for the decoder, [2, 2, 1, 0] blocks for the FFM. And one MHAMB for the bridge of the encoder and decoder. We train models with Adam [74] optimizer(\(\beta _1=0.9, \beta _2=0.999\)) and PSNR loss for \(5 \times 10^5\) iterations with the initial learning rate \(1 \times 10^{-3}\) gradually reduced to \(1 \times 10^{-7}\) with the cosine annealing [75]. We extract patches of size \(256 \times 256\) from training images, and the batch size is set to 32. We adopt TLC [76] to solve the issue of performance degradation caused by training on patched images and testing on the full image. For data augmentation, we perform horizontal and vertical flips.

4.3 Image deraining results

In our image deraining task, we compute the PSNR/SSIM scores using the Y channel in the YCbCr color space, which is consistent with previous works such as [8, 71, 73]. Our method has been demonstrated to outperform existing approaches significantly and consistently, as presented in Table 2. Specifically, our method achieves a remarkable improvement of 0.93 dB and a \(10.2\%\) error reduction on average across all datasets when compared to the best CNN-based method, SPAIR [73]. And achieves a improvement of 0.2 dB and a \(3\%\) error reduction compared to the best Transformer-based method, Restormer [19]. Moreover, we can achieve up to 2.76 dB improvement over HINet [9] on individual datasets, such as Rain100L. Compared to our baseline network NAFNet [12], we see significant performance gains across all datasets, with an average improvement of 1.11 dB. This proves that our proposed FFM and MHAMB can learn an enriched set of hybrid features, which combines local and non-local information which is vital for high-quality image deraining.

In addition to quantitative evaluations, Fig. 5 presents qualitative results that demonstrate the effectiveness of our M3SNet in removing rain streaks of various orientations and magnitudes while preserving the structural content of the images.

Fig. 6
figure 6

Image deblurring example on the GoPro dataset [67]. Compared to the state-of-the-art methods, our M3SNet restores sharper and perceptually-faithful images

Table 4 Image denoisng results

4.4 Image deblurring result

The performance evaluation of image deblurring approaches on the GoPro [67] and HIDE [68] datasets is presented in Table 3. Our M3SNet outperformed other methods, with a performance gain of 0.09dB when averaging across all datasets [67, 68]. Specifically, compared to our baseline network NAFNet [12], we improve 0.08 dB and 0.12 dB at 32 widths and 64 widths, respectively. Compared with previous CNN-based models [8], this progress can be much more obvious. Compared with previous best Transformer-based method, Restormer-local [19], we improve 0.17 dB on the GoPro dataset. It is worth noting that even though our network is trained solely on the GoPro Dataset, it still achieves state-of-the-art results (31.49 dB in PSNR) on the HIDE dataset. This demonstrates its impressive generalization capability.

Table 5 Ablation analysis for FFM and MHAMB on the benchmarks
Table 6 Plug-and-play study of FFM. We pluged FFM into various image restoration networks to verify its improvement on the model effect

Figure 6 displays some of the deblurred images produced by our method. Our model recovered clearer images that were closer to the ground truth than those by others.

4.5 Image denosing result

We compare the RGB image denoising results with other SOTA methods on SIDD [69], show in Table 4. Our method obtains considerable gains over the state-of-the-art approaches, i.e., 0.25 dB over NAFNet [12], and 0.53 dB over the Transformer-based method Restormer [19].

4.6 Ablation studies

Effectiveness of FFM. To examine the effect of the FFM, we present the deraining results of w/o FFM in Table. 5. We can see that the derained images obtained from the method utilizing the FFM exhibit higher PSNR and SSIM values compared to the derained images produced by the method that does not employ the FFM.

Fig. 7
figure 7

Visualization of feature maps. Our proposed FFM can be pluged into existing image restoration models to generate more precise high-frequency details

Fig. 8
figure 8

Effect of the MHAMB on image deraining. a Input corrupt image, Compared with (b), the image restored by M3SNet w/ MHAMB (c) can remove much more rain and recover the numbers accurately, d target highquality image

To demonstrate the compatibility of our FFM method as a plug-and-play component for other image restoration models, we showcase the deblurring results achieved by integrating FFM into top-performing models (e.g. Resformer [19], Uformer [21], and NAFNet [12]) in Table. 6. We can clearly see the performance of both three models are improved by our approach. For a visually intuitive understanding the effect of such FFM, we further use high-pass filtering (HPF) to visualize learned features in Fig. 7. Compared to original image restoration model without adding FFM, the addition of FFM can better help to reconstruct finer detail features and improve the potential restoration quality. Since FFM seamlessly integrates information from the upper layers of the encoder-decoder structure into the adjacent lower layers and fuses all the information to the resolution manipulation level of the original image, this allows the model to better capture local and non-local information, thus facilitating more accurate representation for achieving high-quality output.

Effectiveness of MHAMB. To evaluate the effectiveness of MHAMB, we perform experiments based on w/o MHAMB in Table. 5. Compared to not applying MHAMB, MHAMB provides additional performance benefits thanks to the capability to capture global information.

In Fig. 8, we have provided visual comparisons of M3SNet w/ and wo/ the MHAMB. This study validates the recovered results of the model with the MHAMB tend to be clearer since it enables more global features to be fully used during the restoration process.

4.7 Resource efficient

Table 7 The evaluation of model computational complexity. This is conducted with an input size of \(256 \times 256\), on an NVIDIA 1060 GPU

Deep learning models have become increasingly complex in order to achieve higher accuracy. However, larger models require more resources and may not be practical in certain contexts. Therefore, there is a need to design lightweight image restoration models that can achieve high accuracy. In our work, we design a mountain-shaped single-stage network. This architecture optimizes the balance between spatial details and contextual information while minimizing the computational resources required to restore images.

Our M3SNet has been shown to outperform other models, as demonstrated in Table 7. Despite having 0.6M higher parameters than MIMO-unet++ [44], our proposed M3SNet-32 still achieves better performance, while using significantly fewer computational resources, with MACs approximately 40 times smaller than that of MIMO-unet++. Considering all factors, including model parameters, MACs, and performance, our model is the optimal choice.

5 Conclusion

In this paper, we present a mountain-shaped single-stage network that effectively captures multi-scale feature information and minimizes the computational resources required for image restoration. Our design is guided by the principle of balancing the competing goals of contextual information and spatial details while recovering images. To this end, we propose a feature fusion middleware mechanism that enables seamless information exchange between the encoder-decoder architecture’s different levels. This approach smoothly combines upper-layer information with adjacent lower-layer information and eventually integrates all information to the original image resolution manipulation level. As a basic feature fusion block, it can be plug-and-play in various other image restoration networks to improve the model representation. To overcome the limitations of CNNs’ receptive fields and capture more global information, we utilize a multi-head attention middle block as the bridge of our encoder-decoder architecture. Furthermore, to maintain computational efficiency and lightweight model size, we replace or remove nonlinear activation functions and instead use multiplication. Our extensive experiments on multiple benchmark datasets demonstrate that our M3SNet model significantly outperforms existing methods while utilizing low computational resources.