1 Introduction

Image and video will inevitably lead to distortion during acquisition, synthesis, compression, and transmission, etc. In order to address this issue, effective image quality assessment (IQA) has long been an active research field [1,2,3]. In the real world, ideal reference image is hard to gain, so no-reference IQA methods are more practical and have wider application prospects. As more and more image processing operations have been specifically designed for stereoscopic images, no-reference (NR) stereoscopic IQA (SIQA) shows more important.

Comparing to 2D IQA, SIQA is much more sophisticated, because it is affected by many factors, such as 2D image quality, depth perception, visual comfort, and other factors [2]. It is particularly challenging, especially under the condition, that the stereoscopic image pair consists of two views with different quality levels. Therefore, how to understand the binocular vision perception is a very important issue in 3D IQA.

During the past few decades, binocular vision perception for 3D IQA has made significant progress. Häkkinen et al. [4] studied the difference in eye movement patterns between 2 and 3D versions by viewing the same video content. They found that eye movements were more widely distributed in 3D content. Their results show that depth information from binocular depth cues provides viewers with additional information, creating new salient regions in the scene. In building a stereoscopic saliency model, depth (binocular disparity or disparity contrast) is considered as an attentional cue that is similar to other features such as color and orientation [5, 6]. Jansen et al. [5] suggested that when building a 3D salient model, in addition to considering the disparity factor, it is also necessary to extend features of monocular image, such as luminance and texture contrast. The authors of [6] learned color characteristic and proposed a NR-SIQA method using gradient dictionary. Maki et al. [7] suggested that a target closer in depth has higher priority for 3D saliency model. Wang et al. [8] presented a stereo saliency model integrating edges, disparity boundaries, and a saliency bias computed from stereoscopic perception. Fang et al. [9] proposed to incorporate the color, luminance, texture, and depth cues to generate the saliency map for 3D images. Nevertheless, the benefit of using the spatial variance [10] of a color image as a depth map is not always accurately estimated. Li et al. [11] built a saliency dictionary by selecting a group of potential foreground objects, pruning the outliers in the dictionary, and running iterative tests on the remaining superpixels to refine the dictionary. For stereoscopic saliency detection, disparity was appended directly to the feature vector. Liu et al. [12] used the binocular combined behaviors and human 3D visual saliency characteristics to evaluate the quality of stereoscopic images. Li et al. [13] designed a NR-SIQA method based on visual attention and perception, which used saliency and just noticeable difference (JND) to weight the global and local features extracted from the left and right images. Meanwhile, the global structural features are extracted from the cyclopean map.

Since the human eye is the ultimate receiver of stereoscopic images, HVS characteristics, such as binocular rivalry and suppression mechanisms and depth perceptive characteristics, need to be considered for designing SIQA. Some researchers combined depth (or disparity) information with 2D image information together to analyze the quality of stereoscopic image. Benoit et al. [14] proposed a stereo image quality assessment method using some 2D full-reference quality assessment algorithms such as C4 and SSIM to calculate the disparity map between the distorted image and reference images. You et al. [15] applied various methods to calculate disparity maps and used a variety of 2D IQA methods to evaluate image pairs and disparity maps of stereoscopic images. They proved that disparity is an important factor in stereoscopic vision. Chen et al. [16] extracted 2D natural scene statistical features from the cyclopean of stereo images and stereo features from disparity maps and uncertainty maps and then learn and predict quality scores. Akhter et al. [17] extracted features from stereoscopic image pairs and disparity maps and used logistic regression models to predict quality scores.

In recent years, deep learning shows great success on object detection, recognition, and semantic segmentation [18,19,20,21,22], and some researchers also begin to use deep learning for NR IQA. The authors of [23] combined the feature extraction and learning process, and simultaneously used the max pool and mean pool to reduce the dimensionality of feature maps, and employed a shallow CNN to assess image quality and got a relatively good performance. Bosse et al. [24] designed an in-depth CNN for 2D IQA and achieved relatively high results. Zhang et al. [25] designed a CNN model, which uses left view, right view, and disparity maps normalized by luminance and contrast as input to the network for SIQA, and also improve the results. Zhou et al. [26] devised a deep-fused CNN model for NR-SIQA, which consists of monocular feature encoded networks and binocular feature fused networks, followed by a quality prediction layer. These studies show the tremendous potential of CNN for improving SIQA performance.

Based on above analysis, here, we consider visual saliency and CNN to design NR SIQA. A depth-perceived 3D visual saliency model is proposed by fusing 2D saliency map and depth saliency map, which can more realistically reflect the extraction procedure of the salient region in human vision system. Both the stereoscopic image pair and 3D visual saliency map are fed into CNN to train and learn stereoscopic image quality scores. The main contributions are as follows:

  1. 1.

    We propose a new NR-SIQA model based on three-channel CNN, which utilized both stereoscopic image pairs and 3D visual saliency map.

  2. 2.

    A depth-perceived 3D visual saliency map is constructed by weighting 2D saliency map and depth saliency map.

The rest of this paper is organized as follows. The proposed NR SIQA model is elaborated in Sect. 2. Experimental results and analyses are presented in Sect. 3. Conclusions are provided in Sect. 4.

2 Proposed NR SIQA method

The framework of proposed NR SIQA is shown in Fig. 1, which includes four steps. Firstly, a 2D saliency map is calculated by SDSP algorithm [23]. Secondly, a depth saliency map is calculated by the optical flow algorithm [27]. Thirdly, a 3D visual saliency map is gained by weighting the 2D saliency map and the depth saliency map. At last, the 3D saliency map, left view and right view are fed into a three-channel CNN to learn image quality prediction model.

Fig. 1
figure 1

The framework of proposed NR SIQA

2.1 2D saliency map by SDSP algorithm

The authors of [27] proposed a SDSP algorithm to calculate the salient region of the image by using three kinds of prior knowledge. In the SDSP model, a band-pass filtering saliency map SF(x), a location saliency map SD(x), and a color saliency map SC(x) are calculated based on the three prior knowledge as follows.

2.1.1 Bandpass filtering

The behavior of HVS detecting salient objects in a visual scene can be well modeled by band-pass filtering [28].

A log-Gabor filter was used to select the salient regions of the bandwidth. The filter conversion formula is shown as following:

$$ G(u) = \exp \left( {\frac{{ - \left( {\log \frac{{\left\| u \right\|_{2} }}{{w_{0} }}} \right)^{2} }}{{2\sigma_{F}^{2} }}} \right) $$
(1)

where \(u = (u,v) \in R^{2}\) is the coordinates in the frequency domain, ω0 is the center frequency of the filter, and σF controls the filtering bandwidth which can be obtained by inverse Fourier transform. For a RGB image f(x), it is first converted to CIE L*a*b* space to get three channels, namely \(f_{L} (x)\),\(f_{a} (x)\) and \(f_{b} (x)\). The band-pass filtering saliency map \(S_{F} (x)\) is defined as Eq. (2).

$$ S_{F} (x) = ((f_{L} *g)^{2} + (f_{a} *g)^{2} + (f_{b} *g)^{2} )(x) $$
(2)

where ‘*’ represents the convolution operation.

2.1.2 Position significance

Objects near the center of the image are more attractive to people [29]. This means that the position near the center of the image is more likely to be "significant" than the position far from the center. Assume that c is the center of the image.

The “Position Significance” at pixel x can be represented by Eq. (3):

$$ S_{D} (x) = \exp \left( { - \frac{{\left\| {x - c} \right\|_{2}^{2} }}{{\sigma_{D}^{2} }}} \right) $$
(3)

2.1.3 Color saliency map

Compared to cool colors, the human eye is more likely to be attracted by warm colors [30]. In order to distinguish cool and warm colors in the image, we convert it to CIEL*a*b* space. The a* channel represents green-red color information and b* represents blue-yellow color information, and the larger value of a pixel means that the point is more similar to red and yellow in color. Based on the above theory, the color saliency map \(S_{C} (x)\) is defined by following Eq. (4).

$$ S_{c} (x) = 1 - \exp \left( { - \frac{{f_{an}^{2} + f_{bn}^{2} (x)}}{{\sigma_{c}^{2} }}} \right) $$
(4)

where \(\sigma_{c}\) is the parameter, \(f_{an} (x)\) is the min–max normalized a* channel, and \(f_{bn} (x)\) is the min–max normalized b* channel.

Finally, the calculation of the saliency map is shown as follows:

$$ SDSP(x) = S_{C} (x) \times S_{D} (x) \times S_{F} (x) $$
(5)

Figure 2 shows the calculation procedure of SDSP saliency map.

Fig.2
figure 2

The flowchart of calculating SDSP saliency map

2.2 Depth saliency map

In general, when the target object and the background image have different depth values, the target object can attract more human visual attention. In this paper, the optical flow algorithm [31] is used to calculate the disparity map of a stereoscopic image. The optical flow estimation-based algorithm is developed by the classical optical flow objective function as following Eq. (6).

$$ E(u,v) = \sum\limits_{i,j} \begin{gathered} \{ \rho_{D} (I_{L} (i,j) - I_{R} (i + u_{i,j} ,j + v_{i,j} )) \hfill \\ + \lambda [\rho_{s} (u_{i,j} - u_{i + 1,j} ) + \rho_{s} (u_{i.j} - u_{i,j + 1} ) \hfill \\ + \rho_{s} (v_{i,j} - v_{i + 1,j} ) + \rho_{s} (v_{i,j} - v_{i,j + 1} )]\} \hfill \\ \end{gathered} $$
(6)

where u and v are the horizontal and vertical components of the optical flow field obtained from the left and right images. λ is a regularization parameter, ρD denotes a data penalty function, and ρs is a space penalty function. Since the horizontal disparities on the retina are related to the depth perception, here, only the horizontal component of the computed motion vectors u is chosen as the horizontal disparities. The depth feature map can be directly obtained by the disparity map as following Eq. (7).

$$ f_{D} (x,y) = 1 - \frac{{D(x,y) - D_{\min } }}{{D_{\max } - D_{\min } }} $$
(7)

where \(D_{\min }\) and \(D_{\max }\) are the minimum and maximum values in the disparity map, \(D(x,y)\) is the value at the location \((x,y)\) in the disparity map and subjects to \(D_{\min } \le D(x,y) \le D_{\max }\). Objects closer to the camera have smaller depth values because they produce larger disparity values, and distant objects have larger depth values because the far objects produce smaller values.

2.3 3D visual saliency map

Here, we propose a fused 3D visual saliency map, by combining the 2D saliency map with the depth saliency map. Linear weighting method is used to combine the 2D saliency map and the depth saliency map as following Eq. (8).

$$ SVMap(x,y) = w_{1} \times VS_{2D} (x,y) + w_{2} \times D(x,y) $$
(8)

where w1 and w2 are the weight parameter, which they are simply set to 0.5 in this paper.

Figure 3 shows the left view in a stereo image pair and its corresponding 2D saliency map, depth saliency map, and 3D saliency map.

Fig.3
figure 3

Left–Right image and their saliency map

2.4 Structure of designed CNN for NR SIQA

In recent years, convolutional neural networks have achieved great success in the representation and classification of image features, language recognition, etc. In this paper, a three-channel CNN is constructed for NR SIQA, as shown in Fig. 4. The network parameters are configured as listed in Table 1.

Fig.4
figure 4

Structure of proposed three-channel CNN for SIQA

Table 1 Parameters of CNN

The inputs of the network are a left-view and right-view image of a stereoscopic image and their 3D saliency map. Since the neural network input is usually a fixed size, but the image size in the image database is not always the same, we cut the original color image into a 32 × 32 image patches as input. Because the picture distortion in the LIVE 3D image database is uniformly distorted, each input block is given the same quality score as the original image. The last predicted image quality score is the average value of all image block quality scores of an image.

In all convolutional layers, the convolution kernel size used was all 3 × 3. Except for the last fully connected layer, all layers are activated through the ReLU [32] activation function. The convolutions are applied with zero-padding, so their output has the same spatial dimensions as their input. The window size for all largest pools is 2*2. We apply dropout regularization [33] with a ratio 0.35 to the fully connected layers. Given an image represented by \(N_{P}\) randomly sampled patches and a ground truth quality label of \(q_{t}\). The quality prediction \(q\) is calculated by averaging the CNN output \(y_{i}\) for each patch:

$$ q = \frac{1}{{N_{p} }}\sum\nolimits_{i}^{{N_{p} }} {y_{i} } $$
(8)

The definition of the objective function in the model is shown as follows:

$$ E_{patchwise} = \frac{1}{{N_{p} }}\sum\nolimits_{i}^{{N_{P} }} {\left\| {y_{i} - q_{t} } \right\|}_{2}^{2} $$
(9)

Note that, the optimization is done through the adaptive learning rate optimizer ADAM [34] with α = 0.0001.

3 Experimental result and analysis

3.1 Databases and evaluation metrics

The LIVE 3D Phase I [2] database includes 20 reference images, five distortion types, and a total of 365 distorted images, including 45 groups of Gaussian blur (Blur) distortion, 80 groups of JPEG2000 compression (JP2K), JPEG compression (JPEG), white noise (WN), and fast fading (FF).

distortion, and the DMOS value of each distorted stereo image. The LIVE 3D Phase II [16] database consists of eight pairs of reference stereo image scenes and 360 pairs of distorted stereo image pairs. The distortion type is the same as LIVE 3D Phase I.

Two commonly used indicators are employed to evaluate the performances, the Pearson linear correlation coefficient (PLCC) and Spearman rank-order correlation coefficient (SROCC). These coefficients measure the correlation between the objective scores predicted by the objective stereoscopic IQA models and the subjective mean.

opinion scores (MOS) provided by the database. The PLCC is used to evaluate prediction accuracy, and the SROCC evaluates prediction monotonicity. A better model is expected to have higher PLCC and SROCC.

In our experiments, all distorted images are selected 80% as training set and the remaining 20% as testing set. The results are obtained after 100 epochs.

3.2 Performance comparison

Overall SROCC and PLCC on the LIVE 3D Phase I and Phase II database are given in Table 2, 3, 4, and 5, for comparison with current reported representative NR SIQA methods including Akhter [17], Chen [16], Zhang [25], Liu [12], Shao [36], Oh [37], Wang [38], Li [13], Liu [39], Yang [40], and their results are also listed. All in Table 2, 3, 4, and 5 indicates that the test image contains five distortion types.

Table 2 SROCC comparison of several metrics on the LIVE 3D Phase I database
Table 3 PLCC comparison of several metrics on the LIVE 3D Phase I database
Table 4 SROCC comparison of several metrics on the LIVE 3D Phase II database
Table 5 PLCC comparison of several metrics on the LIVE 3D Phase II database

From Table 2 and 3, it can be seen that our proposed method has achieved the best performance against currently reported methods on single JP2K and WN distortion types, and all distortion, and only a little inferior to Oh [37] on the FF distortion type.

From Table 4 and 5 on the LIVE 3D Phase II database, our proposed NR SIQA shows the best performance on single JP2K, WN, FF distortion types and all distortion, but SROCC is inferior to Liu [39] in JPEG and Blur distortion types, which verify our proposed method is more reliable.

Figure 5 gives scatter points between predicted DMOS via proposed NR SIQA and DMOS on the LIVE 3D Phase I and Phase II databases, which further suggests proposed algorithm is linear consistent with subjective perception.

Fig.5
figure 5

Scatter points between predicted DMOS and DMOS

3.3 Effect of 2D, depth, and 3D saliency map

In order to illustrate the importance of our proposed 3D saliency map, we input the 2D and depth saliency map into the CNN model instead of the 3D saliency map for training and learning, and the test results are listed in Table 6 and 7. All in Table 6 and 7 indicates that the test image contains five distortion types.

Table 6 Performance comparisons of different features on LIVE 3D Phase I database
Table 7 Performance comparisons of different features on LIVE 3D Phase II database

It can be seen from Table 6 that using a 3D saliency map shows good performance for the single JP2K, JPEG, WN, FF distortion types, and all distortions on Phase I, while using a 2D saliency map only shows the best performance on Blur. In addition, from Table 7, it can be seen using a 3D saliency map has satisfactory performance for the single JP2K, WN, Blur, FF distortion types, and all distortions. Overall, the 3D saliency map proposed by us has obvious advantages and is effective for stereoscopic image quality evaluation.

3.4 Influence of 3D saliency map and image pairs

In order to further analyze the role of stereoscopic image and 3D saliency map in the model, the ablation experiments are carried out. In Table 8, we give the results of inputting only 3D saliency map, the left and right images, and combing 3D saliency map and the left and right images. It can be seen that the proposed method achieves the best performance, which shows that the features extracted from left and right images and 3D saliency map are complementary.

Table 8 Performance comparison of different features

3.5 Cross-database validation

Furthermore, we conducted cross-database validation to test the stability of the proposed model. The results are shown in Table 9, where LIVE I/LIVE II represents that the experiment is trained on LIVE 3D Phase I and tested on Phase II; similarly, LIVE II/LIVE I denotes that the experiment is trained on LIVE 3D Phase II and tested on Phase I. The best results are highlighted in bold. It can be noticed that the proposed model shows competitive performances, which suggest the proposed model is reliable for measuring the quality of the stereoscopic image, insensitive to image content, and has good universality and stability.

Table 9 Performances of cross-database validations

4 Conclusion

In this paper, we have proposed a NR SIQA algorithm based on 3D visual saliency maps and deep convolutional neural network (DCNN). 3D visual saliency map is generated by combing the depth saliency map of a stereoscopic image with its 2D saliency information. Our experiments suggest that proposed 3D visual saliency map is better than only 2D saliency map. At last, the 3D saliency map and left–right views are fed into a designed DCNN to predict the stereoscopic image quality score. Experimental results on the LIVE 3D Phase I and Phase II databases show our proposed method gets better performance than other NR-SIQA methods.