1 Introduction

An ocean, sea, lake, pond, reservoir, river, canal, or aquifer are considered as Underwater Environment (UE). Several tasks are conducted in the UE, such as seafloor survey, vehicle navigation and positioning, pipeline inspection, and drowning detection [1]. The quality of underwater images is affected by transmission media and the underwater environment. Some of the influential factors of the underwater environment are water turbidity, artificial lighting, light absorption, and scattering due to particles of water [2]. The captured underwater images have more spatial and visual redundancy when compared with the surface images. Crewless underwater vehicles consisting of Remotely Operated Underwater Vehicles (ROUVs) [3] and Autonomous Underwater Vehicles (AUVs) are deployed to acquire, store, and transmit video or images for monitoring.

The ROUVs, AUVs, their imaging sensors, and internet and underwater wireless sensor networks constitute the Internet of Underwater Things (IoUT).Some of the challenges faced by IoUT for transferring underwater images are [4]: Considerable transmission distance between the AUVs and the terrestrial control centres. Overall communication connectivity and quality link in transmitting underwater images are affected by low bandwidth, propagation delays, communication range on non-rechargeable batteries in the sensor node. To overcome the above challenges, there is a need for an energy-efficient transmission method. Image compression minimizes the quantity of data by successfully coding digital images to minimize the number of bits being communicated to accomplish energy efficiency in IoUT nodes.

Conventional image compression techniques like the Joint Pictures Experts Group (JPEG) [5], JPEG2000 [6], developed by the Joint Pictures Experts Group, use linear and invertible transforms to convert an image into coefficients with low statistical dependencies. These methods may result in noticeable artifacts such as “blurring”, “ringing”, and “blocking” for low-rate image compression. There are other traditional non-learning techniques such as Discrete Wavelet Transform (DWT) [7], Embedded Zero trees of Wavelet transforms (EZW) [8], Set Partitioning in Hierarchical Trees (SPIHT) [9], for image compression. Kahu et al. [10] proposed a Contrast Sensitive Function based quantization in JPEG to provide better performance at low bitrates. These methods have proved to be inefficient for underwater images.

Emergence of Deep Convolutional Neural Networks (DCNNs) are expected to achieve better compression performance than existing image compression standards by stacking multiple convolution layers to provide flexible non-linear analysis and synthesis transformations [11]. Learning with non-differential quantizer and non-adaptability with the existing image codecs are still challenging issues in DCNNs. Residual encoder-decoder [12] contains symmetric convolution (encoder) and deconvolution (decoder) layers. In residual encoders, the gradients tend to decrease in magnitude as they traverse long paths from later Neural Network (NN) stages to affect changes in the previous stages. So, more layers will affect the reconstructed quality of underground water images. Hussain and Jeong [13] proposed Deep Neural Network (DNN) with the Rectified Linear Unit (ReLU). The compression rates can be determined by adjusting the number of hidden layers and hidden neurons between the input and output neurons. Johnston et al. [14] proposed three techniques to boost the baseline recurrent image compression architecture by including features like—perceptually weighted training loss, hidden-state priming, and spatially adaptive bit rates. Li et al. [15] proposed a CNN model that addresses quantization and entropy rate estimation by using a content-weighted importance map. A symmetric Convolutional Autoencoder (CAE) is proposed by Cheng et al. [16] to replace the transform and inverse transform in traditional codecs to achieve high coding efficiency. The above-discussed methods accomplished the state-of-the-art results but were inconsistent with the existing image codecs, limiting their use in existing systems. Compact Representation CNN proposed by Li et al. [17] generates low resolution images from high resolution ones, which are both visually pleasing and informative. Zhang et al. [18] proposed image restoration technique which reconstructs high-resolution image from low-resolution image. The works [17, 18] can be used to achieve the quality of the reconstructed image, but they cannot be used to obtain low bit-rates. Jiang et al. [19] have utilized Compact-CNN followed by standard JPEG for image compression on the sender side and reconstructed image on the receiver side using Reconstruction CNN (C-CNN_R-CNN) to achieve both the low-bitrates and the reconstructed image quality.

This paper proposes a compression framework inspired from [10, 17, 18], integrates deep-learning and traditional techniques for underwater image compression to improve both the compression rate and quality of the restored image. It consists of Contrast Sensitivity Function (CSF) quantization-based JPEG with a non-symmetric DCNN image compression model to improve reconstructed image quality and low compression rate. The proposed model works on two levels. At the first level, two CNNs, Compact CNN (C-CNN) and Residual Dense Convolutional Neural Network (RD-CNN), are trained together to retain the structural information of data and to provide better reconstruct. As the underwater images have more spatial and visual redundancy, C-CNN helps preserve maximum information and provides a visually pleasing compact image. In the second level, the compact representation of the original underwater image is subjected to CSF quantization-based JPEG encoding to improve the compression rate further. RD-CNN helps to improve the quality of the up-scaled-decompressed image.

The advantages of the proposed methodology are:

  1. (i)

    It combines the merits of both traditional and deep learning techniques to provide better compression rates and reconstruction quality underwater images for monitoring purposes.

  2. (ii)

    It overcomes the problems of blurring, ringing, and blocking artifacts caused due to traditional techniques by training C-CNN and RD-CNN together to preserve the structural information of data.

  3. (iii)

    The reconstructed images from the proposed framework is subjected to fish classification using transfer learning techniques. The results shows significant performance in recognition of fishes under study.

The paper is organised as follows. Section 2 contains details of the experimental system, a summary of the dataset used for training and testing, metrics used for performance evaluation. Section 3 provides the description of the proposed methodology. Section 4 contains the experimental results and discussion. Section 5 concludes why the proposed method is superior to the existing techniques and future work.

2 Materials and methods

This section contains details of the experimental setup required for carrying out the proposed work, summary of the dataset which is used for training and testing of the model under study and metrics used to evaluate the performance of the model.

2.1 Experimental system

The analysis has been carried out on Google Colaboratory. An image of size 128 × 128 × 3, Adaptive Moment Estimation (Adam) optimizer [20] with 1e-3 as the learning rate, and 1000 epochs are used for training and testing.

2.2 Dataset description

The fish image dataset [21] used for the experiment is taken from Fish4Knowledge, funded by EUSFP (European Union Seventh Framework Programme). The images are collected using ten underwater cameras to provide live video feeds. Some images are crowded, and some are blurred due to underwater lighting effects. To evaluate the proposed framework, 4,483 underwater images have been used for training the network. The trained network is tested on a sample of 200 images.

2.3 Performance evaluation index

The efficiency of the proposed model can be evaluated using the following metrics:

  1. 1.

    Objective metrics such as Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) [22] are used to quantify the quality of the reconstructed underwater image for the original underwater image.

Consider x to represent the original image and y to represent the reconstructed image of size p × q. Then PSNR and SSIM can be defined using the formula given in Eqs. (1) and (2).

$$PSNR\left( {x,y} \right) = 10{\text{~}}log_{{10}} \left( {{\raise0.7ex\hbox{${255^{2} }$} \!\mathord{\left/ {\vphantom {{255^{2} } {MSE\left( {x,y} \right)}}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${MSE\left( {x,y} \right)}$}}} \right)$$
(1)
$$SSIM\left(x,y\right)=l\left(x,y\right)c\left(x,y\right)s\left(x,y\right)$$
(2)

where,

$$MSE\left(x,y\right)= \frac{1}{pq}\sum_{i=1}^{p}\sum_{j=1}^{q}{\left({x}_{ij}-{y}_{ij}\right)}^{2}$$

Luminance comparison is made using \(l\left(x,y\right)\) function, Contrast comparison is made using \(c\left(x,y\right)\) function and Structure comparison is done using \(s\left(x,y\right)\) function.

  1. 2.

    Bits/Pixel is used to quantify the accomplishment of the compression technique by using the formula given in Eq. (3):

    $$bpp= \frac{number\,of\,bits\,in\,the\,compressed\,stream}{p \times q \times 3}$$
    (3)

where p denotes the number of rows, and q denotes the number of columns in the given image.

  1. 3.

    A relative measure names Compression Ratio (CR) is used to compute the ratio between the uncompressed image and compressed image using Eq. (4).

    $$CR = ~{\raise0.7ex\hbox{${I_{{uncomp}} }$} \!\mathord{\left/ {\vphantom {{I_{{uncomp}} } {I_{{comp}} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${I_{{comp}} }$}}$$
    (4)

where, \({I}_{uncomp}\) represents uncompressed image size and \({I}_{comp}\) represents compressed image size.

The proposed methodology is compared with standard compression methods such as JPEG, JPEG with CSF based quantization (JPEG-CSF), and C-CNN_R-CNN. C-CNN_R-CNN utilizes Compact-CNN followed by standard JPEG for image compression on the sender side and reconstructs image on the receiver side using Reconstruction CNN. Whereas, the proposed method utilizes Compact CNN followed by JPEG with CSF based quantization for image compression on the sender side and reconstructs image on the receiver side using Residual Dense Convolutional Neural Network. The proposed methodology is also compared with Super-Resolution CNN (SRCNN) [23] to measure the quality of the reconstructed images.

3 Proposed method

This section proposes a method that can deal with the artifacts such as “blurring”, “ringing”, and “blocking” for low-rate image compression faced by traditional image compression techniques and proved to be efficient and adaptable for underwater images than the recent works [11]. The existing methods produce poorly reconstructed image quality at a low compression rate as they cannot extract hierarchical features required for image reconstruction. Figure 1 represents the architecture of the proposed system, which consists of two CNNs, i.e., C-CNN and RD-CNN, which are trained together.

Fig. 1
figure 1

Architecture of Image compression model with CSF quantization-based JPEG

The proposed method consists of C-CNN and CSF quantization-based JPEG encoder at the sender side. C-CNN is used to retain the structural information of input 128 × 128 × 3 data. The output of C-CNN, a compressed 64 × 64 × 3 image, is fed as input to CSF quantization-based JPEG encoder for further compression. The compressed image is reconstructed using CSF quantization-based JPEG decoder on the receiver side and is subjected to bicubic interpolation. So, to restore it to its original image size of 128 × 128 × 3, the interpolated image is subjected to RD-CNN for image restoration. Each part of the proposed system is discussed in the following sub-sections.

3.1 Architecture of CSF quantization-based JPEG

The CSF quantization-based JPEG shown in Fig. 2 is a part of the proposed compression system. It uses linear and perceptually uniform CIE La´b´ color space in the JPEG compression, and the linear contrast sensitivity function is used to generate quantization matrices. CIE La’b’ is perceptually device-independent [24], uniform, linear, luminance-chrominance color space. Due to this, quantization can be effectively implemented without perceptual visual quality loss [25]. There is no direct transformation from RGB color space to CIE La´b´. The conversion consists of the subsequent steps:

  1. 1.

    Transform gamma corrected RGB values to linear RGB.

  2. 2.

    Convert linear \({R}_{l}\) \({G}_{l}\) \({B}_{l}\) to CIE XYZ using the following formula:

    $$\left[\begin{array}{c}\mathrm{X}\\ \mathrm{Y}\\ \mathrm{Z}\end{array}\right]= \left[\begin{array}{ccc}0.4121& 0.3576& 0.1805\\ 0.2126& 0.7152& 0.0722\\ 0.0193& 0.1192& 0.9505\end{array}\right] \times \left[\begin{array}{c}{\mathrm{R}}_{\mathrm{l}}\\ {\mathrm{G}}_{\mathrm{l}}\\ {\mathrm{B}}_{\mathrm{l}}\end{array}\right]$$
  3. 3.

    Convert CIE XYZ to CIE La’b’ [25,26,27] using Eqs. (5), (6) and (7):

    $$\mathrm{L}=116 \times \mathrm{f}\left(\frac{\mathrm{Y}}{{\mathrm{Y}}_{\mathrm{n}}}\right)-16$$
    (5)
    $${a}^{^{\prime}}=500 \times \left[\mathrm{f}\left(\frac{\mathrm{X}}{{\mathrm{X}}_{\mathrm{n}}}\right)- \mathrm{f}\left(\frac{\mathrm{Y}}{{\mathrm{Y}}_{\mathrm{n}}}\right)\right]$$
    (6)
    $${b}^{^{\prime}}=200 \times \left[\mathrm{f}\left(\frac{\mathrm{Y}}{{\mathrm{Y}}_{\mathrm{n}}}\right)- \mathrm{f}\left(\frac{\mathrm{Z}}{{\mathrm{Z}}_{\mathrm{n}}}\right)\right]$$
    (7)

where the function f is defined as follows [26,27,28]

Fig. 2
figure 2

CSF quantization-based JPEG (CSF_JPEG) [16]

$$f\left( x \right) = \left\{ \begin{gathered} x^{{\frac{1}{3}}} \quad x > 0.008856 \hfill \\ 7.787x + \frac{{16}}{{116}}\quad x \le 0.008856 \hfill \\ \end{gathered} \right.$$

After converting the RGB compact image produced by C-CNN into CIE La’b’ using steps 1–3, the sub-planes (L, a’ or b’) are divided into 8 × 8 (p × q) size non-overlapping uniform blocks. The statistical moments (mean, variance) are calculated for each block, and a histogram table is constructed for all the blocks based on statistical moments and threshold values of mean & variance. Also, a new matrix Ix of size p/8 × q/8 is formed parallelly, which contains indices of the blocks corresponding to the histogram table.

This matrix Ix merges the adjacent blocks of the same index values. Since adjacent blocks are grouped, this leads to varying block sizes from 8 × 8 to 32 × 32. Then, these block sizes are allotted indices from 0 to 15. Finally, a block size index array (Blx) is formed using the histogram table and the matrix Ix. The array Blx denotes the size of the block and its corresponding statistical moments in the histogram table. This is encoded using Exponential Golomb code [29] and sent as overhead information to the receiver side. As both position and size of image blocks are variable, the image structure of CSF_JPEG is more compatible than the hierarchical variable block size of H.264 or High Efficiency Video Coding (HEVC). Let M × N be the size of the image sub-blocks. These are subjected to Two Dimensional-Discrete Cosine Transform (2D-DCT). The transformed sub-blocks are subjected to CSF, further subjected to Zig-Zag ordering and Run-length coding, respectively.

According to [30], CSF quantization is given as:

$$CSF\left(f\right)=100\sqrt{f} \mathrm{exp}(-0.13f)$$
(8)

where f is defined as the spatial frequency for the M × N matrix, measured in cycles/degrees:

$$f\left({x}_{1},{y}_{1}\right)=30\frac{\sqrt{{{x}_{1}}^{2}+{{y}_{1}}^{2}}}{Nn \times \Delta }$$
(9)

where \({x}_{1},{y}_{1}\) represents DCT block coordinates. \(\Delta\) represents Pixel size, assumed to be 1.5 min/pixel [30] and \({\mathrm{N}}_{\mathrm{n}}\) is defined as

$$N_{n} = \sqrt {M \times N}$$

A linear CSF as defined by Eq. (10), is used for quantization matrix generation since Commission Internationale de l’Elcairage (CIE) La´b´ color space is used for compression.

$$CSF\left(f\right)=c(f-{f}_{max})$$
(10)

\({f}_{max}\) is defined as the maximum frequency in an M × N image sub-block and is calculated using Eq. (11):

$${f}_{max}=30\frac{\sqrt{{M}^{2}+{N}^{2}}}{Nn \times \Delta }$$
(11)

Quantization matrix is defined as:

$$Quant\left({x}_{1},{y}_{1}\right)=\mathrm{min}( T\left({x}_{1},{y}_{1}\right) \times range , {c}_{\mathrm{max}\left({x}_{1},{y}_{1}\right)})$$
(12)

For a given range of spatial frequencies, \({c}_{\mathrm{max}\left({x}_{1},{y}_{1}\right)}\) is defined as the matrix containing maximum values that DCT coefficients can take. \(T\left({x}_{1},{y}_{1}\right)\) is defined as the threshold for DCT basis functions in an M × N matrix using Eq. (13):

$$T\left( {x_{1} ,y_{1} } \right) = \left\{ \begin{gathered} \frac{1}{{Nm\left( {x_{1} ,y_{1} } \right) \times CSF(f)}}\quad for\;x = 0\;or{\mkern 1mu} \;y = 0 \hfill \\ \frac{1}{{Nm\left( {x_{1} ,y_{1} } \right) \times CSF\left( f \right) \times OTF(x_{1} ,y_{1} )}}\quad for\;x\;and\;y > 0 \hfill \\ \end{gathered} \right.$$
(13)

\(Nm\left({x}_{1},{y}_{1}\right)\) is defined as the normalization function used in 2D-DCT. Orientation Tuning Function \(\left(OTF\right)\) is defined as:

$$OTF\left({x}_{1},{y}_{1}\right)= \left\{\begin{array}{c}\begin{array}{cc}\mathrm{exp}\left(-9.5{\left(\frac{{x}_{1}}{{y}_{2}}\right)}^{2}\right)& for x<y\end{array}\\ \begin{array}{cc}\mathrm{exp}\left(-9.5{\left(\frac{{x}_{1}}{{y}_{2}}\right)}^{2}\right)& otherwise\end{array}\end{array}\right.$$
(14)

The obtained nonzero discrete cosine CSF based quantized coefficients after applying Eq. (12) for each sub-block is zigzag ordered, and the zigzag ordered coefficients are encoded using run-length coding followed by Binary Arithmetic (QM) coding. The encoded data is sent as bitstream to the decoder. Decoding is done at the decoder in the inverse order of the encoding procedure shown in Fig 2 to get the reconstructed image.

3.2 Architecture of C-CNN for compact representation

The underlying architecture of C-CNN is described in Fig. 3. To maintain the spatial structure of the underwater image, C-CNN uses three weight layers. Input image of size 128 × 128 × 3 is given as input to the first Convolutional layer, which uses 64 filters of size 3 × 3 followed by ReLU activation. ReLU activation has the property of faster convergence and generalization of Deep Neural Networks than the extensively used logistic sigmoid and hyperbolic tangent functions. ReLU outperforms other activation functions even though they are asymmetric, possess hard linearity, and are not differentiable. The second layer consists of a convolutional layer having a stride equal to two, followed by Batch Normalization (BN) and ReLU. BN normalizes input volume activations before passing it to the next layer in the network, i.e., reducing Internal Covariate Shift. The second layer helps to downscale and enhance the attributes. The second layer's output is input to the last layer, consisting of c filters of size 33 × 64 to construct the compact representation.

Fig. 3
figure 3

C-CNN for Compact representation of input underwater image [10]

3.3 Architecture of receiver side to attain high image quality

Figure 4 provides the overall architecture of the decoder side. The decoded image is up-scaled using bi-cubic interpolation to the size of the original image. The up-scaled image is subjected to RD-CNN, consisting of D-Residual Dense Blocks (RDB). Figure 5 shows the building blocks of RD-CNN, and Fig. 6 shows the architecture of a single RDB.

Fig. 4
figure 4

Overall architecture of RD-CNN [18]

Fig. 5
figure 5

RD-CNN to attain high quality underwater reconstructed image

Fig. 6
figure 6

Building blocks of an RDB [18]

Due to scaling, the same or alike objects in an image appear different and are subjected to other artifacts. Hierarchical features can capture such features, which would accord to finer reconstruct. This is made possible by using RD-CNN, which constitutes densely connected layers and local features fusion (LFF) with local residual learning (LRL).

LFF in each RDB extracts dense local features by concatenating the states of preceding and current RDBs. Global feature fusion preserves global hierarchical features by combining shallow and in-depth features. The fusion kernel size of 1by1 is chosen for local and global features. All remaining convolution layers use 3 by 3 kernel size and padding on all sides of the input image to maintain its size. The residual output is added with the up-scaled image to get back the original image. This helps maintain the quality of deep-sea images transmitted by AUVs, which helps in real-time intelligent monitoring of underwater fish behaviour.

3.4 Learning algorithm

The learning algorithm for the proposed network is presented in this section. A back-to-back training is done between the C-CNN and RD-CNN to reduce the inaccuracy between the considered input image and the reconstructed image by using the following optimization goal [31]:

$$\left\langle {\hat{\alpha }_{1} ,\hat{\alpha }_{2} } \right\rangle = \mathop {\arg \min }\limits_{{\alpha_{1} ,\alpha_{2} }} \left\| {Rd\begin{array}{*{20}c} {\left( {\alpha_{2} ,Cf\left( {Cc\left( {\alpha_{1} ,\psi } \right)} \right)} \right)} & - & \psi \\ \end{array} } \right\|^{2}$$
(15)

Here \(\psi\) is the original input image. \({\alpha }_{1},{\alpha }_{2}\) are the parameters of C-CNN and RD-CNN, respectively. Cc(.) and Rd(.) represent C-CNN and RD-CNN, respectively. Cf(.) represents CSF_JPEG.

While performing backpropagation, there is a rounding function in Eq. (15), which cannot be differentiable in Cf(.). An iterative optimization algorithm has been proposed based on [31] to overcome this problem by fixing the \(\alpha_{1} ,\alpha_{2}\) parameters of C-CNN and RD-CNN as give in Eq. (16) and (17), respectively.

$$\left\langle {\hat{\alpha }_{1} } \right\rangle = \mathop {\arg \min }\limits_{{\alpha_{1} }} \left\| {Rd\begin{array}{*{20}c} {\left( {\hat{\alpha }_{2} ,Cf\left( {Cc\left( {\alpha_{1} ,\psi } \right)} \right)} \right)} & - & \psi \\ \end{array} } \right\|^{2}$$
(16)
$$\left\langle {\hat{\alpha }_{2} } \right\rangle = \mathop {\arg \min }\limits_{{\alpha_{2} }} \left\| {Rd\begin{array}{*{20}c} {\left( {\alpha_{2} ,Cf\left( {Cc\left( {\hat{\alpha }_{1} ,\psi } \right)} \right)} \right)} & - & \psi \\ \end{array} } \right\|^{2}$$
(17)

To update the parameter \({\alpha }_{2}\), an auxiliary variable \({\widehat{\psi }}_{m}\) is defined as decoded compact representation of \(\psi\) as given in Eq. (18).

$${\widehat{\psi }}_{m}=Cf\left(Cc\left({\widehat{\alpha }}_{1},\psi \right)\right)$$
(18)

By combing Eq. (17) and (18), Eq. (19) is obtained:

$$\left\langle {\hat{\alpha }_{2} } \right\rangle = \mathop {\arg \min }\limits_{{\alpha_{2} }} \left\| {Rd\begin{array}{*{20}c} {\left( {\alpha_{2} ,\hat{\psi }_{m} } \right)} & - & \psi \\ \end{array} } \right\|^{2}$$
(19)

To update the parameter \({\alpha }_{1}\), \({\widehat{{\psi }^{^{\prime}}}}_{m}\), an auxiliary variable is defined as the optimal input to RD-CNN as given in Eq. (20), since Cf(.) is not differentiable while performing backpropagation.

$$\left\langle {\widehat{{\psi^{\prime}}}_{m} } \right\rangle = \arg \mathop {\min }\limits_{{\hat{\psi }_{m} }} \left\| {Rd\begin{array}{*{20}c} {\left( {\hat{\alpha }_{2} ,\hat{\psi }_{m} } \right)} & - & \psi \\ \end{array} } \right\|^{2}$$
(20)

Assume that \(Rd\left( {\hat{\alpha }_{2} , \cdot } \right)\). is monotonic to \(\widehat{{\psi^{\prime}}}_{m}\) shown as below:

$$\left\| {\tau - ~\widehat{{\psi ^{\prime}}}_{m} } \right\|^{2} \ge ~\left\| {\varphi - ~\widehat{{\psi ^{\prime}}}_{m} } \right\|^{2}$$

If only if

$$\left\| {Rd\left( {\hat{\alpha }_{2} ,\tau } \right) - \psi } \right\|^{2} \ge ~\left\| {Rd\left( {\hat{\alpha }_{2} ,\varphi } \right) - \psi } \right\|^{2}$$
(21)

Assume arg \(\underset{{\alpha }_{1}}{\mathrm{min}}{\Vert Cf(Cr\begin{array}{ccc}\left(\alpha ,\psi \right))& -& {\widehat{\psi }}_{m}\end{array}\Vert }^{2}\) to be the solution of \({\stackrel{\sim }{\alpha }}_{1}\) such that Eq. (22) is satisfied for any possible value of \({{\alpha }^{^{\prime}}}_{1}\) as shown:

$$\left\| {Cf(Cc\begin{array}{*{20}c} {\left( {\alpha ^{\prime}_{1} ,\psi } \right))} & - & {\widehat{{\psi ^{\prime}}}_{m} } \\ \end{array} } \right\|^{2} \ge \left\| {~Cf(Cc\begin{array}{*{20}c} {\left( {\tilde{\alpha }_{1} ,\psi } \right))} & - & {\widehat{{\psi ^{\prime}}}_{m} } \\ \end{array} } \right\|^{2}$$
(22)

Following assumption (21), the following can be obtained:

$${\Vert Rd\left({\widehat{\alpha }}_{2},Cf(Cc\left({{\alpha }^{^{\prime}}}_{1},\psi \right)\right)-\psi \Vert }^{2}\ge {\Vert Rd\left({\widehat{\alpha }}_{2},Cf(Cc\left({\stackrel{\sim }{\alpha }}_{1},\psi \right)\right)-\psi \Vert }^{2}$$
(23)

Accordingly,

$$\left\langle {\tilde{\alpha }_{1} } \right\rangle = \mathop {\arg \min }\limits_{{\alpha _{1} }} \left\| {Rd\begin{array}{*{20}c} {\left( {\hat{\alpha }_{2} ,Cf\left( {Cc\left( {\alpha _{1} ,\psi } \right)} \right)} \right)} & - & \psi \\ \end{array} } \right\|^{2}$$
(24)

From Eq. (16) \({\widehat{\theta }}_{1}\) = \({\stackrel{\sim }{\theta }}_{1}\) is obtained, which is

$$\left\langle {\hat{\alpha }_{1} } \right\rangle = \mathop {\arg \min }\limits_{{\alpha_{1} }} \left\| {Cf(Cc\begin{array}{*{20}c} {\left( {\alpha_{1} ,\psi } \right))} & - & {\widehat{{\psi^{\prime}}}_{m} } \\ \end{array} } \right\|^{2}$$
(25)

Since Co(.) is a codec, Eq. (26) can be formulated as:

$$\left\langle {\hat{\alpha }_{1} } \right\rangle \approx \mathop {\arg \min }\limits_{{\alpha_{1} }} \left\| {Cc\begin{array}{*{20}c} {\left( {\alpha_{1} ,\psi } \right)} & - & {\widehat{{\psi^{\prime}}}_{m} } \\ \end{array} } \right\|^{2}$$
(26)

Combine assumption in Eq. (17) above and Eq. (27), it arrives:

$$\left\langle {\hat{\alpha }_{1} } \right\rangle = \mathop {\arg \min }\limits_{{\alpha_{1} }} \left\| {Rd\begin{array}{*{20}c} {\left( {\hat{\alpha }_{2} ,\left( {Cc\left( {\alpha_{1} ,\psi } \right)} \right)} \right)} & - & \psi \\ \end{array} } \right\|^{2}$$
(27)

Equation (27) is used instead of Eq. (16) in training our C-CNN as it is the approximation of Eq. (16). Thus, by iteratively optimizing the Eq. (19) and Eq. (27), optimal values for \({\alpha }_{1},{\alpha }_{2}\) parameters of C-CNN and RD-CNN are obtained. The complete algorithm to train the proposed network is given in Algorithm-I.

figure a

3.5 Loss functions for C-CNN and RD-CNN

Mean Squared Error (MSE) is defined as the loss function of C-CNN as follows:

$$L_{1} \left( {\alpha _{1} } \right) = \frac{1}{{2N}}\mathop \sum \limits_{{k = 1}}^{N} \left\| {Rd(\hat{\alpha }_{2} ,(Cc\begin{array}{*{20}c} {\left( {\alpha _{1} ,\psi _{k} } \right))} & - & {\psi _{k} } \\ \end{array} } \right\|^{2}$$
(28)

where \({\psi }_{k}\) represents the original image, \({\alpha }_{2}\) trained parameter, N is the batch size, and \({\alpha }_{1}\) is the trainable parameter.

For training RD-CNN, loss function, i.e., MSE is defined as:

$$L_{2} \left( {\alpha _{2} } \right) = \frac{1}{{2N}}\mathop \sum \limits_{{k = 1}}^{N} \left\| {res\left( {Cf\left( {\hat{\psi }_{{m_{k} }} } \right),\alpha _{2} } \right) - \left( {Cf\left( {\hat{\psi }_{{m_{k} }} } \right) - \psi _{k} } \right)} \right\|^{2}$$
(29)

where \({\widehat{\psi }}_{{m}_{k}}\) is the compact representation of \({\psi }_{k}\), \({\alpha }_{2}\) represents the trainable parameter, res(.) represents the residual-dense learned by RD-CNN.

4 Results and discussion

The proposed method is compared with traditional techniques such as Block Truncation Coding (BTC) [32], Pyramid technique [33], DCT[34], Singular Value Decomposition (SVD) [35], SPIHT, DWT-DCT[36] traditional compression techniques. Figure 7 shows the quantitative difference between the decompression output of all the above-mentioned techniques.

Fig. 7
figure 7

A quantitative comparison between the decompressed images of traditional techniques with proposed technique

Figure 7 shows that the proposed technique retains the features of original input image. Pyramid technique provides blurred output, BTC has lot of noise and blocking artifacts, DCT, SVD, SPIHT and DWT-DCT are not able to retain edge features of the object under study.

PSNR values for Standard JPEG, JPEG-CSF, C-CNN_ R-CNN and proposed method are compared in Fig. 8. The proposed method shows better results when compared to standard JPEG, JPEG-CSF, C-CNN_R-CNN as shown in Fig. 8.

Fig. 8
figure 8

Average PSNR values for Standard JPEG, JPEG-CSF, C-CNN_R-CNN and the Proposed method on the sample of 200 test images

A comparison of PSNR values of Standard JPEG, JPEG- CSF, C-CNN_R-CNN and the proposed method is shown in Table 1 for a sample of 51 images. The average performance of the proposed method is better than Standard JPEG, JPEG- CSF and C-CNN_R-CNN. From Fig. 9, it is seen that the proposed method requires the least number of bits per pixel (0.095484 bpp on an average) for representing compressed Underwater images when compared with C-CNN_ R-CNN (0.27095 bpp on an average), Standard JPEG (0.683034 bpp on an average) and JPEG-CSF (0.198957 bpp on an average). As the compression in the proposed method takes place in two steps, it provides better compression than the existing methods such as standard JPEG, JPEG-CSF and combination of C-CNN_R-CNN.

Table 1 Comparison of PSNR values in dB for 51 images taken from fish image dataset
Fig. 9
figure 9

Average Bits per pixel values for the Proposed method, JPEG-CSF, Standard JPEG and C-CNN_R-CNN on the sample of 200 test images

A comparison of Bits per pixel values of Standard JPEG, JPEG- CSF, C-CNN_R-CNN and the proposed method is shown in Table 2 for a sample of 51 images. The C-CNN compresses image by 50% of its original size which is further reduced by the JPEG-CSF.

Table 2 Comparison of Bits per pixel values for 51 images taken from fish image dataset

The quality of the reconstructed image by the proposed network is also compared with Super-Resolution CNN (SRCNN). Figures 10 and 11 show that the proposed method has better PSNR and SSIM than SRCNN. SRCNN reconstructs images which are reduced by 50% of their original size as per the work. Also, the proposed method uses Residual dense Neural Network for reconstructing image which transfers features of image from one block to another. This improves the quality of the reconstructed image. Hence the proposed method provides better quality when compared to SRCNN. As a conclusion from the Tables 1 and 2 and Figs. 8 and 10 that the proposed method performs better than C-CNN_R-CNN and SRCNN, because the proposed first compresses the original image to its 50% size by using a compact CNN and then further reduces it by using JPEG-CSF which helps to further reduce the size of the image. In C-CNN_ R-CNN, there is no further compression applied to the compact representation. So, the proposed method provides better compression performance than C-CNN_R-CNN. The reconstructed images from the proposed framework is subjected for fish image classification.

Fig. 10
figure 10

Comparison of PSNR of SRCNN with the proposed method on the sample of 200 test images

Fig. 11
figure 11

Comparison of SSIM of SRCNN with the proposed method on the sample of 200 test images

Table 3 provides the validation accuracy percentage and time taken for computation using transfer learning on various Deep learning models such as Dense Convolutional Network (Densenet201) [37], Googlenet [38], Mobilenetv2 [39], Residual Network (Resnet18) [40], Resnet50, Resnet101, Shufflenet [41], Visual Geometry Group (VGG16) [42], VGG19.

Table 3 Comparison of validation accuracy and time of computation by various Deep Learning models using transfer learning for fish image classification

From Table 3, Shufflenet has the highest validation accuracy percentage with 46 min 24 s as the computation time. If time of computation is important than the accuracy then Googlenet provides an accuracy of 91.94% in 30 min 27 s computation time.

5 Conclusions and future work

The proposed method provides an energy-efficient technique that reduces the amount of data transmitted and at the same time can retain the quality of the transmitted image. The proposed model works on two levels. At the first level, two CNNs (C-CNN and RD-CNN) are trained together to retain data's structural information and provide better reconstruction quality, respectively. In the second level, the compact presentation of the original underwater image is subjected to CSF quantization-based JPEG encoding to enhance the compression rate.

Experimental results reveal that the proposed work provides better underwater image quality and a high compression ratio than traditional and existing CNN techniques. The proposed method requires the least number of bits per pixel to represent compressed Underwater images compared with C-CNN_ R-CNN, Standard JPEG, and JPEG-CSF. The proposed method provides a 52% reduction in bits per pixel than JPEG-CSF, 86% reduction in bits per pixel than Standard JPEG, and 64.7% reduction in bits per pixel than C-CNN_ R-CNN on an average basis. The proposed method has 3.80% better PSNR and 3.51% better SSIM than SRCNN on an average basis. The proposed method provides better PSNR and Bits per pixel values on 200 images taken from the fish image dataset compared with C-CNN_ R-CNN, Standard JPEG, and JPEG-CSF. The reconstructed images of the proposed model are classified with the highest accuracy of 92.12% using shufflenet, which makes the recognition of different species of images very efficient.

Images of fishes collected by AUVs are transmitted via communication channels to the terrestrial control centre for monitoring purpose. So, there is a need for fast data transmission between the underwater nodes and terrestrial-monitoring systems which can overcome the power failure problems of sensor nodes and effective utilization of communication bandwidth. For the effective utilization of communication bandwidth, there is a need for variable rate encoding. The underwater images need to be enhanced to remove the dominance of blue-green colour. These improvements will be carried out as future work.