1 Introduction

The voluminous video content over the network lead to the emergence of more powerful and efficient codecs. The increase in image or video quality standards put an extra challenge before the compression standards to retain the good visual perception with better compression rate. The compression strategies reduce the size of frames by eliminating the temporal and spatial redundancies with an aim to achieve better compression rate while retaining the perceptual quality of frames. The widely used traditionally designed codecs like H.264 or HEVC primarily focuses on reducing rate distortion error. The surging demand of video streaming looks for more efficient video storage, transmission and retrieval techniques.

The impressive results of deep learning based image compression fueled the further researchers in the application of deep learning in video compression. Several DNN based image compression networks were later extended for videos as well by incorporating motion estimation and prediction strategies. End to end trainable autoencoder styled fully convolutional networks were also proposed for video compression and some are resulted in promising outputs as compared to state-of-the-art methods. Pure deep learning based compression techniques are end to end trainable and optimizable.

The proposed video compression architecture, as motivated by the deep learning based approaches, comprises of basically three small networks i.e. frame autoencoder for frame compression, flow autoencoder for compression of estimated flow and motion extension network for reconstruction of current frame. The frame is reconstructed based on previous frame, current frame and its corresponding flow value. The results have been compared to some state-of-the-art techniques including H.264 in terms of SSIM and PSNR.

In this paper, the detailed description of the work is presented in different sections. Section 2 describes the related work done in the field of video compression using differnet approaches. Section 3 presents the proposed architecure and its various components have been described in the sub sections. Section 4 deals with the experimental setup, experimental analysis, its comparison with state-of-the-art methods and ablation study respectively. Lastly Section 5 concludes the whole work.

2 Related work

The video compression techniques encompass a pair of encoder and decoder. The video frames are compressed by the encoder and reconstructed by the corresponding decoder. They both together form a codec. The codec primarily reduces the number of bits for representation but the reduction in number of bits usually results in reconstruction error. The initial video codes like motion JPEG based on individual compression of video frames, hence based on image compression.

The emerging field of deep learning has a profound effect on image compression and resulted in good quality compression [3, 13, 26, 29, 30]. These techniques mainly focus on reducing the distortion error by training the autoencoder network with a suitable training set [13, 26, 30]. Some of the autoencoder based networks made use of RNNs [2, 13, 30] to achieve variable compression rate from the same network. The same concept later extended to videos as well. Fully Convolutional networks are used for the applicability of the network to the variable frame sizes. These models made use of entropy coding to eliminate the spatially redundant information [3, 23, 26, 29, 30]. Some of the advanced models like Pixel-CNN are based on probability driven adaptive arithmetic coding [25]. Different techniques of quantization are used to learn the binary format. Stochastic binarization has been used in the model proposed by Toderici et al. Quantization based on soft assignment also been implemented by Agustsson et al. [1]. All such models work in the same manner. Conclusively, it was found that deep learning based image compression techniques gave good compression rate with better image quality when compared to the traditional techniques like WebP [33] or JPEG. Image interpolation is a popular and powerful technique used for image compression which mainly employs a encoder-decoder network to predict the intermediate frame between two reference frames [11, 12, 19, 24]. Image extrapolation also resulted better in predicting the unseen frames [22, 31, 34]. But both these techniques are limited to small slow-motion videos.

The traditional video codes like H.264 or HEVC [15] are primarily manually designed and evolved with frequent developments. Their work strategy mainly relies on isolating the video clip into I or Image frames and referencing, P or B frames, and compressing I frames directly with prediction of referencing frames. These traditional techniques, though giving good results but as they evolved with incremental developments, their end to end optimization is not possible. A lot of machine learning based enhancements to the traditional codecs were proposed to improve the compression quality and efficiency. Some of the popular techniques are intra– prediction coding [10] and motion compensation and interpolation [21, 27, 35, 39]. Rate control and post processing refinement like approaches also lead to the improved results [4, 9, 17, 18, 28, 32, 36,37,38].

Encoding complexity is one of the major challenge in traditional codecs like HEVC. To address this challenge, several different techniques have been proposed [7, 8, 16]. In [8], a significant encoding complexity improvement has been achieved by proposing a new technique for non-skipped coding blocks. This technique employs entropy value for early non- Asymmetric Motion Partitioning detection and resulted in about 40% encoding time reduction with almost same video quality. Another promising technique for reducing encoder complexity has been proposed comprising of motion activity classification and early PU decision. In this technique, depth correlations and spatio temporal analyses have been used to make early PU decision in each CU level [16]. In addition to these developments in traditional codecs, a surge in development of pure deep learning based video architectures and frameworks have been observed. DeepCoder, a pure CNN based video compression architecture has been proposed. This network is based on encoding the quantized feature maps into binary stream using Scalar quantization and Huffman coding [5]. This model achieves limited efficiency but presents huge potential for further explorations. Several different deep learning based techniques for video compression has been presented in [6].

Most of the popular techniques make use of optical flow based motion estimation to extract and represent the motion information. A lot of optical flow estimation techniques including both traditional differential equations based and machine learning based are invented. Motivated by the same developments in the field of deep learning based video compression, the proposed work architecture in this paper comprises of flow driven reconstruction of frames. In this paper, Frame AutoEncoder Network module is utilized for better image generation for available current data. The previous reconstructed frame information is utilized in Motion Extension Network. This method is chosen to have better modularity in Network architecture.

3 Proposed architecture

The architecture is an amalgamation of three separate networks. A frame compression/decompression network, a flow vector compression/decompression network, and finally a motion extension/frame reconstruction network. The frame and flow compression/decompression networks are composed of encoder and decoder networks. The relationship between the modules is as follows:

  1. Step 1:

    Frame Compression

    We use a recurrent ConvGRU based encoder model to encode the frame with varying degrees of compression quality.

  2. Step 2:

    Flow Vector Estimation

    This is done using the traditional Farneback flow estimation method. The flow vectors between every two frames are estimated.

  3. Step 3:

    Flow Vector Compression

    The estimated flow vectors from step 2 are compressed using a standard CNN based encoder network with Generalized Divisive Normalization (GDN) layers as the nonlinearity.

  4. Step 4:

    Frame Decompression

    A ConvGRU based decoder model is used to decompress the encoded videos from step 1 according to the degree of compression.

  5. Step 5:

    Flow Vector Decompression

    A CNN based decoder network with Inverse GDN as the nonlinearity is used to decompress the flow vectors.

  6. Step 6:

    Motion based Frame Reconstruction

    A three pronged CNN based network is used to estimate the current video frame from the reconstructed frame from step 4, the reconstructed flow vector from step 5, and the previous output video frame.

3.1 Network architecture

A high-level overview of the proposed network architecture is given in Fig. 1. It is mainly comprised of Frame Compression Autoencoder Network, Flow Vector Estimation Network and Motion-based Reconstruction Network. The architectures of the intermediate networks are also given below individually. The relationship between the modules is presented in Fig. 1. In the frame autoencoder, ConvGRU has been used in one of the five layers both in encoder and decoder sides. Moreover, Flow autoencoder has also been used to compress the flow values and decompressed before the next frame generation, reconstructed by motion extension network. The motion extension network reconstruct the next frame using the previous frame and taking the output from both frame and flow autoencoders.

Fig. 1
figure 1

High-level overview of video compression architecture

3.1.1 Frame compression/decompression

A recurrent ConvGRU based encoder model has been used, as depicted in Fig. 2, to encode the frame with varying degrees of compression quality. A ConvGRU based decoder model is used to decompress the encoded videos from step 1 according to the degree of compression. It is inspired by the autoencoder architecture given in Fig. 2. The autoencoder network contains 5 convolutional layers with Generalized Divisive Normalization (GDN) layers as the nonlinearity to reduce the image size. This is then passed to Binarizer to convert the final output to highly compressible binary values. For decompression, this binary-coded frame information is again passed through 5 deconvolutional layers to recreate the frame.

Fig. 2
figure 2

Frame autoencoder network

One of the convolutional layers in both encoder and decoder networks has been replaced with ConvGRU layer. Image is passed through the autoencoder network to get binary coded information and recreated image. Residual image is calculated by the difference between the original image and the recreated image. Residual image is again passed through the autoencoder network to encode into compressed binary format. This process is repeated based on the level of required compression quality.

The details of the parameters used in Frame Encoder Decoder Network are described in Table 1.

Table 1 Network parameters of frame autoencoder

3.1.2 Flow vector estimation/compression/decompression

In the computer vision tasks, optical flow is widely used to exploit temporal relationship. Though a lot of learning based optical flow estimation methods have been recently proposed, we have calculated optical flow using the traditional Farneback flow estimation method. It is well tested method and reduced the training requirement of flow estimation network. In future, learning based flow estimation can also be integrated to result pure neural network based encoder.

The flow vectors between every two frames are estimated. The estimated flow vectors from are compressed using a standard CNN based encoder network with Generalized Divisive Normalization (GDN) layers as the nonlinearity (Fig. 3). A CNN based decoder network with Inverse GDN as the nonlinearity is used to decompress the flow vectors. The Fig. 4 represents the decoder side overview of the network.

Fig. 3
figure 3

Flow AutoEncoder

Fig. 4
figure 4

Decoder side network overview

The details of the network parameters used in Flow Autoencoder are explained in Table 2.

Table 2 Network parameters of flow autoencoder

3.1.3 Motion based frame reconstruction

A three-pronged CNN based network is used to estimate the current video frame from the reconstructed frame from step 4, the reconstructed flow vector from step 5, and the previous output video frame. Figure 5 illustrates the architecture of the network. It uses CNN blocks to transform inputs and concat blocks to merge the two prongs.

Fig. 5
figure 5

Motion extension network

3.2 Video encoding and decoding algorithms

figure a

3.2.1 Video decoding

For Video reconstruction, binary coded output of frame encoder and flow encoder are needed. The number of emission steps will control the size of frame encoder data and so the bit-rate of signal.

Intermediate frame is calculated by passing binary code frame data through frame decoder. Flow information is also decoded via flow decoder network. Finally, the Motion Extension Network takes previous image, merges this with decoded flow information to create intermediate representation of current image.. It merges this intermediate representation on already decoded Intermediate frame to result in high quality current frame as displayed in Fig. 4. The same has been represented by eq. (1).

$$ {I}_{decode d}={f}_{decode r}\left({I}_{encoded},{F}_{encoded},{I}_{decode{d}_{prev}}\right) $$
(1)

where Iencoded and Fencoded are binaried encoding of current frame and flow vectors, \( {\mathrm{I}}_{\mathrm{d}\mathrm{ecode}{\mathrm{d}}_{\mathrm{prev}}} \) is previously decoded frame and fdecoder is representation of decoder neural network.

figure b

3.3 Training strategy

Loss function

The goal of the network is to reduce the structural distortion between the input video and the output. We also used mean squared error (MSE) loss function to reduce the color distortion in decompressed images.

$$ L={L}_{ssim}+\alpha\ {L}_{mse} $$
(2)

where MSE error is evaluated as:

$$ {L}_{mse}\left(y,{y}^{\prime}\right)=\frac{1}{N}{\sum}_0^n{\left(y-{y}_i^{\prime}\right)}^2 $$
(3)

and SSIM error is evaluated based upon three comparison measurements, luminance (l), contrast (c) and structure (s):

$$ {L}_{ssim}\left(y,{y}^{\prime}\right)=\left[\ l\left(y,{y}^{\prime}\right).c\left(y,{y}^{\prime}\right).s\left(y,{y}^{\prime}\right)\ \right] $$
(4)

Emissions

The models can be configured to emit at any number of emissions, each emission refining the output, while also decreasing the compression efficiency. For the ConvGRU model and the MotionNet model, two training strategies are utilized. Only the last emission has been trained in one strategy and choose a random emission in each epoch and train that in another strategy. For the Flow-MotionNet model, only the randomized strategy is used.

4 Experiments

4.1 Experimental setup

Dataset

A dataset comprising of 20s long 571 small videos out of total 826 videos from Youtube UGC has been used to train the network, remaining clips has been for testing and validation. Videos of varying quality have been chosen i.e. 480p, 360p and 720p. The frame size has been chosen as 64 × 64, so video clips of all quality firstly rescaled to the chosen format, and then training is performed. Videos frames are taken randomly during training but while testing the clips are chosen from the starting. The model has been trained with randomized emission step training strategy with emission steps varying between 1 to 10. Addition of each emission step improves the output but have an effect on the compression efficiency.

Implementation details

For the implementation purpose, a single T4, K80 or P100 GPU has been used to train the network on the Google Colaboratory platform. The frames have been kept to the size of 64 × 64. Adam Optimizer is used to train the neural network with 10e-4 be the learning rate. ℷ1 is taken as one and ℷ2 be 10. First frame encoder is trained for 100 epochs. Then complete network is trained end-to end for 70 epochs. During the training of frame encoder with 100 epochs; the learning rate has been divided by ten at 50th, 70th and 90th epoch. For the whole model training, only 70 epochs have been used after stacking the pretrained framer encoder first and learning rate has been altered at 35th and 55th epoch by dividing with ten.

Evaluation

SSIM i.e. Structural Similarity Index and PSNR i.e. Peak Signal to Noise Ratio has been used to measure the visual quality of the reconstructed frames. The temporal distortion encountered among the frames has been evaluated by Flow EPE i.e. End Point Error. Moreover, the reconstruction time of individual frames has been measured by the TPF i.e. Time Per Frame parameter.

4.2 Experimental results and analysis

4.2.1 Experimental results

The proposed model resulted in good quality compression with a 0.963 SSIM score on the test dataset. The individual values of different performance paramters at each emission step is given is Table 3. Comparison of quality of compression achieved with different emission steps is depicted in Fig. 6.

Table 3 Performance values of proposed model
Fig. 6
figure 6

Compressed frame quality at emission steps of 2,7,10

We have shown the video quality of compressed frames for four video sequences in Fig. 7 where first row represents uncompressed sequence, second represents compressed sequence.

Fig. 7
figure 7

Compressed frame quality compared for video sequence

4.2.2 Comparison with state-of-art methods

The performance of the proposed architecture has been compared with the state-of-art conventional compression techniques like H264 and H265 and also with the deep learning based models proposed by authors of DVC [20] and Adversarial video compression [14]. Table 4 presents the comparative results of the proposed model with some state-of-art and prominent models and the corresponding graphical comparison of MS-SSIM and PSNR has been given in Figs. 8(a) and 8(b). UVC dataset was utilized to measure SSIM and PSNR metrics. MS-SSIM correlates better with human perception of distortion. The proposed model outperformed in terms of MS-SSIM metrics. The MS-SSIM and PSNR values of the various architectures have been already presented above. The proposed model achieved good SSIM performance but with a drop in PSNR value.

Table 4 MS-SSIM and PSNR values of various architectures
Fig. 8
figure 8

a & b Graphical Comparison of MS-SSIM and PSNR values of various architectures

4.3 Ablation study

The proposed Flow-MotionNet Architecture comprises of three network blocks – a ConvGRU based frame compression/decompression network, a motion-extension network and a flow vector estimation block. ConvGRU based compression/decompression network utilizes GRU based blocks for improving compressed image quality via multiple iterations. Motion extension network utilizes a previously decoded frame to help encode/decode the next frame. Flow vector estimation block calculates the optical flow of the frames and helps motion extension network to maintain this optical flow.

To train the proposed network, two two training strategies has been used. The first strategy is fixed emission step based strategy where the emission step has been fixed to the value of 10. In the second strategy, the emission step was chosen randomly between 1 to 10.

To study the incremental effect of each network block, a baseline network consisting of pure CNN network has been taken. Next, GRU blocks have been to the network to study their effect. This network has been referred to as ConvGRU network in the comparison below. Then Motion Extension Network block was added to study its effect resulting into configuration referred to as MotionNet. Finally, Flow vectors estimations have been added to MotionNet to result in the proposed configuration. These configurations were then evaluated with randomized emission step training strategy to see the effect of training strategy on each part of the network.

4.3.1 Comparison tables

The qualitative and quantitative performance of the proposed model has been measured by the four parameters namely i.e. SSIM (Structural Similarity Index Measure), EPE (End Point Error), PSNR (Peak Signal-to-Noise Ratio), and TPF (Time per Frame). As the network is trained with varying ten emission steps, the below tables presents the performance of the network with the addition of each additional emission step. Qualitative or perception quality performance parameters i.e. SSIM and PSNR values are given Tables 5 and 6 respectively. Flow EPE and TPF values are presented in corresponding Tables 7 and 8. Table 9 depicts the average performance of the network over ten emission steps.

Table 5 SSIM values per emission
Table 6 PSNR values per emission
Table 7 Flow EPE values per emission
Table 8 TPF values per emission
Table 9 Comparative average performance of different incremental modules of the proposed network

4.3.2 Comparison charts

The below figures present the graphical representation of the the performance parameters namely SSIM, PSNR, EPE and TPF for the 10 emission steps. For the adaptive bit rate video compression, randomization emission step training strategy surpasses the fixed emission step strategy. The relative analysis of the graphs show the growth in SSIM and PSNR values with each additional emission step. The proposed model is designed incrementally. Firstly, addition of motion extension network improves the performance of simple frame autoencoder. In further addition, flow autoencoder is added to Motion Net architectecture for motion estimation and prediction, which also outperforms the prior simple MotionNet architecture in terms of visual quality. The same thing can be inferred from the relative graphs of respective models. Moreover, addition of modules also increases the computation of the network, leading to the increased frame reconstruction time. Figures 9 to 12 presents the graphical and relative performance comparison of the different models at different emission steps ranging from 1 to 10.

Fig. 9
figure 9

Variation of MS-SSIM of different incremental modules of proposed network

Fig. 10
figure 10

Variation of PSNR of different incremental modules of proposed network

Fig. 11
figure 11

Variation of Flow End Point Error of different incremental modules of proposed network

Fig. 12
figure 12

Time Per Frame of different incremental modules of proposed network

4.3.3 Average comparison chart

The below charts in figures 13 and 14 represent the histogram representation of the average values of various performance parameters obatained for various compression models for better relative performance analysis. The graphs represent the comparison data between various experiments. The sample frames produced by the models are given in next sub-section. The top-left frame is input frame 1, the bottom left frame is input frame 10, the top-right frame is output frame 1, the bottom right frame is output frame 10.

It can be inferred from the graphs that the proposed architecture achieves increased SSIM value eventually leads to improvement in visual quality of video frames. But the addition of optical flow and and motension extension network leads to rise in frame reconstruction time, so the TPF value of proposed architecture is higher than others. Further explorations and improvements may result in more efficient outcomes.

4.3.4 Sample frames

The samples of reconstructed frames after compression and decompression of different models at different emission steps are presented in this section. In the image matrix, the left column contains two original images and the right column contains the corresponding decompressed output image. Figure 15 presents the relative visual quality of some of the sample frames at different emission steps.

Fig. 13
figure 13

a & b Relative comparartive analysis of MS-SSIM and PSNR of different incremental modules of proposed network

Fig. 14
figure 14

a & b Relative comparative analysis of Flow-EPE and TPF of different incremental modules of proposed network

Fig. 15
figure 15

Sample Frames at different emission steps

4.3.5 Comparison inference

Incorporation of the previously decoded frame: It is found that the flow representation capability of the model improves significantly by adding in the previous output frame while inferring the current output frame from the decoded frame. The above tables corroborate this claim, and there is also a clear visual improvement in the sample frames.

Incorporation of flow vectors: The incorporation of flow vectors during decoding of the frames, along with the previous frame and the currently decoded frame, while giving a significant boost over plain ConvGRU model, gives only marginal improvements over only incorporating previous frames. However, as the decode time required per frame is not significantly increased by incorporation of flow vectors, the improvement is welcome. The visual difference between the results, however, is pronounced.

5 Conclusion

The proposed model presents a lightweight learning-based adaptive video compression approach that allows variation in the quality of reconstructed video with the amount of data sent, without requiring separate low-resolution versions of the same video. The ConvGRU units in encoder and decoder networks and flow vector induced frame reconstruction have significantly improved the performance of the network. The whole network is designed and experimented incrementally. In comparison to some state-of-the-art techniques, a considerable improvement in visual quality and efficiency has been observed when trained the network with random emission step training strategy but with slight increment in frame reconstruction time. This architecture can be further enhanced and optimized by plugging with additional modules like entropy coding and others.