1 Introduction

In recent decades, with the rapid development of multimedia and network technology, the amount of digital image is explosive growth, which plays an increasingly important role in people’s daily life. However, the current technology has many limitations, which lead to various distortions in the process of the collection, transmission, processing and display of images. Image quality assessment (IQA) is developed to evaluate and monitor image quality, which shows great potential in controlling and improving the performance of image processing systems, such as compression [41], enhancement [15] and segmentation [16]. Recently, stereoscopic/3D multimedia applications have become increasingly popular, and greatly increasing the immersive experience, so stereoscopic image quality assessment (SIQA) has become one of the research hotspots.

Generally, quality assessment metrics can be divided into subjective method and objective method. Subjective method is effective and reliable, but it is more time-consuming, laborious and cannot be completed in real time. In contrast, objective method is more in line with the needs of reality, so it has been widely investigated. Broadly speaking, according to the participation of reference information, objective methods can be further divided into full reference (FR), reduced reference (RR) and no reference (NR). Now, there are many 2D IQA methods [35, 37, 47] that have achieved quite competitive results. In contrast, stereoscopic image quality assessment is more challenging, which needs to consider a variety of factors, such as binocular fusion, binocular competition and so on. The distortion of stereoscopic image pairs can be divided into symmetric distortion and asymmetric distortion. The asymmetric distortion may have different degrees or even different types, which also has a great impact on the quality of stereoscopic image. In addition, different from 2D IQA, SIQA has extended depth perception and binocular vision mechanism between left and right visual fields, which leads to the difficulty of current research.

Stereo visualization involves more and more application fields, such as distance education, medical treatment, robot navigation, and so on. Therefore, it is reasonable to believe that the number of 3D content will continue to grow in the next few years. SIQA is a key technology in stereo image and video processing, which can help image retrieval system to filter low-quality images by monitoring the quality of stereo image, so as to produce better subjective experience. In addition, SIQA will also promote research in other fields, such as stereoscopic video quality evaluation [7, 13, 40], stereo matching [9].

Yang et al. [42] uses 3D-CNN model to capture spatial-temporal features, and evaluates the quality of stereoscopic video. Inspired, we try to solve difficulty of SIQA with 3D-CNN, which has been applied in many research topics. For example, Zhang et al. [51] proposed a 3D-CNN structure for mental workload assessment, learning the EEG features from spatial–spectral–temporal dimensions. Considering the continuity of video in time, Yang et al. [44] uses the correlation between HIS spatial-spectral domain to design a multiscale wavelet 3D-CNN method for hyperspectral image super-resolution. As the work of these different directions proves, 3D-CNN can find the connection of different features, and it is a very effective solution to the problem. In this work, we extend 2D-CNN to 3D-CNN for SIQA, and omit the design of binocular fusion through CNN. In addition, we consider the “binocular summation/difference theory” [26], which is to convert the information obtained by the left and right eyes into uncorrelated sum and difference signals, and then transmit them forward, so that 3D-CNN can obtain multi-dimensional information.

In addition to the fusion mechanism of the brain, visual saliency is also important for image processing. The research of visual psychophysics has found that when the human eye looks at an image, it will unconsciously focus on certain areas, and give priority to the information of these areas [36], called them as salient areas. In reference [11], a Saliency-based DCNN (SDCNN) framework for NR-IQA is proposed. Inspired by this, we use saliency mapping to modify monocular image to highlight regions of interest and weaken insensitive parts, and propose a blind SIQA method via 3D saliency selected binocular perception and 3D-CNN.

Here we use 3D-CNN model that automatically simulate human vision, and build the relationship between subjective perception and predicted scores of stereoscopic image quality. Our main contributions are as follows:

  1. (1)

    We propose a 3D-CNN based NR-SIQA method. To the best of our knowledge, we are the pioneers in using 3D-CNN to evaluate the quality of stereoscopic images.

  2. (2)

    We propose a method to select salient image patches from 3D saliency map. By weighting 2D saliency map and depth saliency map, 3D saliency map of depth perception is obtained. Only when the value of 3D saliency patch is greater than the set threshold, the corresponding image patch can be selected from the left and right image for predicting image quality.

  3. (3)

    For 3D-CNN model, we add simple summation and difference images to supplement the left and right images as input, providing more different and effective information for predicting image quality.

The rest of this paper is organized as follows. Section 2 describes related work on stereoscopic image quality assessment. In Section 3, proposed model is introduced in detail. The experimental results and analysis on multiple databases are presented in Section 4. Finally, we conclude this paper in Section 5.

2 Related works

Generally, according to whether to use reference image information, objective methods can be categorized into full reference (FR), reduced reference (RR) and no reference (NR, also called as blind reference) [22]. In the process of SIQA, if the original reference image is used, the quality of the distorted stereoscopic image pairs can be obtained more accurately by comparing the local similarity between the two groups of images. This method is called full-reference (FR) SIQA [4, 5, 12, 39]. Shao et al. [31] simulated simple and complex visual cell to obtain a feature encoding approach, and define a similarity measure approach between original image and distorted image. Li et al. [20] proposed an FR SIQA based on ensemble learning and an adaptive cyclopean image, which was modified by a salient map.

In contrast, the no-reference (NR) SIQA method does not need reference image, which is more in line with the actual need, more promising in practical applications, but more challenging [3, 32, 50]. Akhter et al. [2] proposed a NR SIQA method, which combined the extraction of manually designed segmented local features and estimated parallax information from stereoscopic image pairs. Subsequently, by exploring the interaction of two views in HVS, many researches began to focus on the binocular behavior of simple and complex cells in human brain to generate “cyclopean” images from two views [5]. The algorithm proposed by Chen et al. [6] used the 2D features of the synthesized cyclopean image and the 3D features of the corresponding depth map to predict the perceived quality of the stereoscopic image pairs. Zhou et al. [52] describe a blind SIQA based on binocular combination and an extreme learning machine (ELM). Yang et al. [45] used depth perception map to quantify the depth features of stereoscopic images, and also considered binocular features. In addition, deep belief network is used to evaluate content quality. Li et al. [21] proposed an NR-SIQA method based on visual attention and perception. The model combined saliency and just noticeable difference (JND), and weight the global and local features extracted from the left and right views. Finally, a support vector regression (SVR) is learned to evaluate the quality of stereoscopic images. Liu et al. [23] extracted the monocular color and luminance features and binocular summation / difference features and proposed a blind SIQA model by SVR.

In recent years, deep learning technology has been widely used in solving various image processing and computer vision problems [29], and has achieved great success. Convolutional neural network (CNN) has shown outstanding performance in many applications of computer vision and image processing. Compared with the traditional image processing methods, CNN can automatically learn the deep visual features that closely related to the target by optimizing the network parameters rather than using hand-made features. The main advantage of CNN is that it can directly input the images, and then combine feature learning with quality regression in the training process. When CNN is directly used for NR SIQA, there will be a key obstacle: the training data is not enough because of the limited number of subjective perception images. The existing data enhancement and image preprocessing technology is also not suitable for the NR SIQA [17]. In order to solve these problems, most scholars adopt the strategy of image segmentation: cut the image into patches of the same size, input patches into CNN model to predict the quality score of each image patch, and then get the image perception quality according to the established rules. Kang et al. [14] proposed a CNN model for 2D images, input image patches, learn the quality characteristics of image patches and obtain their visual quality. Finally, the quality of all image patches is weighted and averaged to calculate the objective score of the image. Li et al. [19] transferred the structure and weights of a model pre-trained on ImageNet. Then, they modified the last several layers to directly output the quality score, and fine-tuned the network to regress the objective image quality. Lv et al. [24] established a depth neural network model to predict monocular distortion of stereoscopic images, and evaluated binocular features considering binocular competition. Finally, the two features were fused to comprehensively evaluate stereoscopic images from various aspects. Sun et al. [34] use CNNs to learn deeper local quality-aware structures, and remove related features on non-salient patches. Then, the reserved features are aggregated into a final quality score in an end-to-end manner.

3 Proposed method

Our proposed blind SIQA framework is shown in Fig. 1. Given left and right images, we first compute the 3D saliency map by combining 2D saliency map with depth saliency map. Then saliency left and right image patches are selected by using the saliency of 3D saliency map patches, and saliency summation and difference images are generated, which are fed to 3D-CNN to predict the perceived quality after performing local normalization. Finally, the score of the distorted stereoscopic image is obtained by the weighted average of the quality scores of the saliency image patches. Next, our proposed blind SIQA is divided into six parts (A, B, C, D, E and F) for detailed introduction.

Fig. 1
figure 1

Structure of our proposed blind SIQA

3.1 3D saliency map

Zhang et al. [49] found that the visual saliency of an image varies with the change of image quality, so we design 3D saliency map and select the image patches with high saliency to train the network, so as to reduce the adverse effect of low saliency image patches on the result. In [38], they compared the results of several methods, including no-depth, depth-weighting (DW) and depth-saliency (DS) method, which show that the depth-saliency is the best, and indicates the importance of depth saliency map in the modeling of 3D visual attention. Reasonable fusion of depth saliency map and 2D visual feature detection results can better predict saliency region. We use the depth-saliency model from [38]. As shown in Fig. 2, firstly cyclopean image and depth saliency map are calculated from left and right images, then 2D saliency map is obtained from cyclopean image, and 3D saliency map is computed by combining 2D saliency map with depth saliency map.

Fig. 2
figure 2

Calculation process of 3D saliency map

Figure 3 shows the left image of a stereoscopic image and its cyclopean image and saliency map. Compared with cyclopean image and 2D saliency map, 3D saliency map emphasizes the contour and edge information of the object, and the region closer to the observer, which is in line with our daily experience. Therefore, 3D saliency map not only reflects the saliency of 2D image, but also emphasizes the depth information of stereoscopic image. Next, the details of cyclopean image, 2D saliency map, depth saliency map and saliency maps combination are introduced.

Fig. 3
figure 3

Stereoscopic image and its cyclopean image and saliency map. a Left image. b Cyclopean image. c 2D saliency map. d Depth saliency map. e 3D saliency map

  1. 1)

    Cyclopean image

We adopt the method of calculating cyclopean image in reference [5].

$$ CI\left(x,y\right)={W}_L\left(x,y\right)\times {I}_L\left(x,y\right)+{W}_R\left(\left(x+d\right),y\right)\times {I}_R\left(\left(x+d\right),y\right) $$
(1)

where CI denotes the cyclopean image, IL and IR are the left and right views, d is the parallax obtained from the left image, and WL and WR are calculated from the normalized Gabor filter amplitude response.

An example of cyclopean image is shown as Fig. 4. Figure 4a is a Left and right view with white noise (WN) asymmetry distortion, and Fig. 4b is Left and right view with Gaussian blur (BLUR) asymmetric distortion. Figure 4c and d are cyclopean images separately generated from Figure 4a and b. Obviously noise can be found from Fig. 4c, and the image quality is significantly reduced due to noise. But in Fig. 4d blur is hardly seen, which indicates that binocular suppression exists. These are consistent with those in reference [1] (binocular suppression is found in BLUR and JP2JK images, but not in WN and JPEG). Cyclopean image can well reflect the phenomenon of binocular suppression, but it is difficult to perceive the depth information of stereoscopic image.

Fig. 4
figure 4

An example of cyclopean image generated. a Left and right view with white noise asymmetry distortion. b Left and right view with Gaussian blur asymmetric distortion. c Cyclopean image view (a). d Cyclopean image view (b)

  1. 2)

    2D saliency map

We use the SDSP method [48] to calculate the 2D saliency map, which combines the prior knowledge that people are more interested in the object with warm color and middle. The saliency map calculated by SDSP focuses on the object, from which the shape and boundary of the object can be clearly seen.

When viewing stereoscopic images, due to the addition of new depth information, depth features and their combination or conflict with other single eye cues, it is unreasonable and ineffective to directly use the 2D visual saliency model for 3D saliency calculation.

  1. 3)

    Depth saliency map

We consider not only the 2D saliency, but also the depth saliency of stereoscopic images. The purpose of the optical flow algorithm [33] is to calculate the velocity vector of each pixel. If we regard the left view as the first frame and the right view as the second frame, the objects close to the human eye will have large parallax, which will show high speed in the streamer field. Therefore, we use optical flow algorithm to calculate the parallax map of stereoscopic image, and the formula is as follows:

$$ E\left(u,v\right)=\sum \limits_{i,j}{\displaystyle \begin{array}{l}\Big\{{\rho}_D\left({I}_L\left(i,j\right)-{I}_R\left(i+{u}_{i,j},j+{v}_{i,j}\right)\right)\\ {}+\lambda \left[{\rho}_s\left({u}_{i,j}-{u}_{i+1,j}\right)+{\rho}_s\left({u}_{i.j}-{u}_{i,j+1}\right)+{\rho}_s\left({v}_{i,j}-{v}_{i+1,j}\right)+{\rho}_s\left({v}_{i,j}-{v}_{i,j+1}\right)\right]\Big\}\end{array}} $$
(2)

In the optical flow field, u is the horizontal component, v is the vertical component, λ is the regularization parameter, ρD is the data penalty function and ρS is the spatial penalty function.

For the visual system, the horizontal difference is much larger than the vertical difference, and the depth perception is more effective. Therefore, only the horizontal component of the calculated motion vector is selected as the horizontal difference. Disparity map D was formed by horizontal difference. The calculated formula of depth saliency map is defined as

$$ {S}_D\left(x,y\right)=1-\frac{D\left(x,y\right)-{D}_{\mathrm{min}}}{D_{\mathrm{max}}-{D}_{\mathrm{min}}} $$
(3)

where Dmin and Dmax are the minimum and maximum values in the parallax map D, respectively.

  1. 4)

    Saliency maps combination

The purpose of saliency maps combination is to fuse the features of different dimensions, including saliency map of 2D visual attention feature and depth saliency maps, so as to obtain 3D saliency information of stereoscopic image. At present, although many scholars have proposed depth-saliency model, there is still no standard and widely used fusion method. So, considering the different importance of 2D saliency and depth saliency, we use linear pooling strategy to fuse the 2D saliency map and depth saliency map obtained by SDSP algorithm to synthesize the final 3D saliency map. The linear pooling strategy is the same as the approach of [30]:

$$ S=\gamma {S}_{SDSP}+\left(1-\gamma \right){S}_D $$
(4)

where SSDSP denotes the 2D saliency map obtained by SDSP algorithm, and SD is the depth saliency map, γ is the weighted coefficient. In this study, we use a linear pooling strategy, which weights the 2D saliency map and the depth saliency map averagely. At present, how the two saliency maps interact and ultimately affect 3D saliency are not completely clear, and we consider both content saliency and depth perception have a great impact on stereoscopic image, so we think they are equally important and set γ to 0.5.

3.2 Local saliency region

As shown in Fig. 5, the left and right images and 3D saliency map are divided into image patches, and saliency image patches are chosen via 3D saliency map. If the saliency value of an image patch of the 3D saliency map is bigger than a given threshold, the corresponding image patches of left and right images are selected to feed to 3D-CNN.

Fig. 5
figure 5

Selecting procedure of saliency judgment image patches

3.3 Summation / difference image

The authors of [25] suggest that the vision system has a separate adaptive binocular summation and difference channel to achieve efficient transmission of binocular information. At the physiological level, it is explained that the signals of the summation and difference channels are multiplexed, and each V1 neuron receives the weighted sum of the signals in these two channels [18]. In order to clearly show the effect of summation/difference theory, a sample of reference stereoscopic images and corresponding summation and difference images are shown in Fig. 6. The image is like a plane image with ghosting, which is also caused by the parallax of the stereo image. We believe that the human visual system can sense the parallax and convert it into depth information, because the images reflected in the brain are clear and three-dimensional, while the difference images mainly show the depth and contour information of objects. Therefore, we extract the quality features from summation and difference images of stereoscopic images to predict image quality. According to reference [8, 25], the binocular summation/difference image can be simply calculated as follows:

$$ {\displaystyle \begin{array}{l}{I}_S\left(i,j\right)={I}_L\left(i,j\right)+{I}_R\left(i,j\right)\\ {}{I}_D\left(i,j\right)={I}_L\left(i,j\right)-{I}_R\left(i,j\right)\end{array}} $$
(5)

where IS is the summation image, and ID is the difference image.

Fig. 6
figure 6

Reference stereoscopic image and corresponding summation and difference image from LIVE 3D Phase I database

3.4 Image local normalization

We divide the M × N stereoscopic image into m × n patches without overlapping, and then reduce the image patch to the range of [0,1] by local normalization [27]. The local normalized image I(x, y) is calculated as follows:

$$ {\displaystyle \begin{array}{l}{I}^{\prime}\left(i,j\right)=\frac{I\left(i,j\right)-\mu \left(i,j\right)}{\sigma \left(i,j\right)+C}\\ {}\mu \left(i,j\right)=\sum \limits_{k,l}{\omega}_{k,l}{I}_{k,l}\left(i,j\right)\\ {}\sigma \left(i,j\right)=\sqrt{\sum \limits_{k,l}{\omega}_{k,l}{\left({I}_{k,l}\left(i,j\right)-\mu \left(i,j\right)\right)}^2}\end{array}} $$
(6)

where I is the image before local normalization, μ and σ are the average and standard deviation of I respectively, and ωk, l is the two-dimensional circularly symmetric Gaussian weighting function.

3.5 3D convolutional neural networks

In general, the typical CNN model structure uses convolution layer and pooling layer alternately to process input information, and then uses full-connected layer to obtain the mapping relationship between features and targets. In 2D-CNN, convolution layer and pooling layer can only extract the features of 2D image, but cannot automatically obtain the interaction information between stereoscopic images. 3D convolution and 3D pooling can extract features between different images, which is exactly what stereoscopic images need. Therefore, we employ 3D-CNN to complete the NR SIQA task.

  1. 1)

    3D Convolution

Convolution in CNN is a special linear operation between input data and multiple kernel functions, which is used to generate feature maps. For SIQA, 3D convolution adds the features of depth information on the basis of 2D convolution. According to formula (7), the value of (x, y, z) position of the j − th feature map of the i − th layer can be written as [10]:

$$ {v}_{ij}^{xyz}=g\left({b}_{ij}+\sum \limits_m\sum \limits_{p=0}^{P_i-1}\sum \limits_{q=0}^{Q_i-1}\sum \limits_{r=0}^{R_i-0}{w}_{ij m}^{pqr}{v}_{\left(i-1\right)m}^{\left(x+p\right)\left(y+q\right)\left(z+r\right)}\right) $$
(7)

where g(·) is a non-linear activation function such as hyperbolic tangent function (tanh) or Rectified Linear Unit (ReLU), bij is the bias for the current feature map, m means the set of feature maps in the (i − 1) − th layer connected to i − th layer, the size of the 3D kernel is P × Q × R, and \( {w}_{ijm}^{pqr} \) is the value at the position (p, q, r) of the kernel connected to the m feature maps.

  1. 2)

    Structure of 3D-CNN

Based on 3D convolution, we design a 3D-CNN model to automatically learn the quality aware features of stereoscopic image, as shown in Fig. 7. In theory, the more levels of the model structure, the stronger the expression ability, but the more training data required. However, the number of images in stereoscopic image database is limited, which makes the complex model easy to fall into overfitting. Therefore, we designed an effective model according to the size of the database, which consists of five convolution layers, three pooling layers and two fully connected layers, namely Conv1-Conv5, Maxpooling1-Maxpooling3 and FC1-FC2. The input part contains RGB color channels of left and right images, and their summation and difference images, which make the network accept more different features. Therefore, a cube of 4 × m × n (for example, n = 32, M = 32) is taken as the input of 3D-CNN model, and it has three feature maps. In particular, Conv1–2 has 32 convolution kernels, Conv3–4 has 64 convolution kernels, and Conv5 has 128 convolution kernels. The size and stride of convolution kernels are shown in Table 1. Filling is used to keep the size of input and output of containment layer consistent. In addition, there are Maxpooling layers after Conv1, Conv2 and Conv5. Finally, the two full connection layers have 512 nodes. The whole parameters of network are shown in Table 1.

Fig. 7
figure 7

Structure of 3D-CNN model

Table 1 Configurations of the proposed 3D-CNN structure

We use adaptive moment estimation optimizer (Adam) and back-propagation method to train the network, and the learning rate is set to 0.0001.The minibatch size is set to 32 input data, and the optimal parameters of the model are updated after each iteration. The activation function of all convolution layers and fully connected layers use Rectified Linear Unit (ReLU), which can simplify the back-propagation and enhance the optimization effect by setting a threshold for the input. In order to avoid overfitting, dropout is used in the fully connected layer. The output of neurons is randomly set to 0 with a probability of 0.5. As an effective approximation, dropout can prevent the training network from overfitting in the case of sharing weights.

3.6 Global quality

The input of the network is salient image patches cut from left and right images, and the quality of each patch is predicted by 3D-CNN. The quality of the whole stereoscopic image is calculated by weighted averaging the local quality of each salient image patch as following (8).

$$ Q=\frac{1}{N_p}{\sum}_i^{N_p}{q}_i $$
(8)

where i = 1, 2, …, Np, Np is the number of salient image patches, and qi is the predicted quality of salient image patches.

4 Experimental results and analysis

In this section, the LIVE 3D databases and evaluation metrics are introduced, and the performance of our proposed blind SIQA has been evaluated comprehensively, including performance comparison on overall database, single distorted types, cross-database validation, saliency threshold analysis, and on symmetrically and asymmetrically distorted images.

4.1 Databases and evaluation metrics

We use two public LIVE 3D Phase I and Phase II IQA database to verify the effectiveness of the algorithm. LIVE 3D Phase I [28] contains 20 reference images, and each reference image has five types of distorted images. There are 80 distorted images in JP2K, JPEG, WN and FF, and 45 distorted images in BLUR, a total of 365. Each pair of stereoscopic images is symmetric distortion, that is, the distortion type and degree of left and right views are the same. In addition, the database also contains the corresponding differential mean opinion score (DMOS) of all stereoscopic images. The lower the DMOS value, the better the image quality. The DMOS value of 20 reference images is 0. LIVE 3D Phase II [5] contains 8 reference images and 360 distorted images. The distortion type is the same as Phase I. Each pair of stereoscopic images has a corresponding DMOS value. The difference is that only 120 of them are symmetrical, and the remaining 240 are asymmetric.

In our experiments, we use the same three performance indicators as in most literatures: SROCC, PLCC and RMSE. When SROCC and PLCC are close to 1 and RMSE is close to 0, the objective evaluation effect is the better. In our experiment, 80% of the images were randomly selected as the training set, and the remaining 20% as the test set. The median of 100 random experiments was the final result.

4.2 Overall performance comparison

To evaluate the effectiveness of the proposed model, we compared the results with the four most advanced FR-SIQA methods (Chen2013 [5], Shao2017 [31], Jiang2018 [12] and Li2019 [20]) and nine NR-SIQA methods (Chen2013 [6], Appina2016 [3], Zhou2017 [52], Yang2018 [43], Yue2018 [46], Yang2019 [45], Li2019 [21], Liu2020 [23] and Sun2020 [34]). The comparison results of SROCC, PLCC and RMSE on LIVE 3D Phase I and II databases are summarized as Table 2, and the best two results are displayed in bold type. It can be seen from Table 2 that the proposed model has competitive advantages in both databases, which proves that the model can effectively predict the quality of stereoscopic images. In particular, SROCC in the Phase II is 0.954, which is 0.008 higher than the best result (0.946 in Li2019 [21]) of the other thirteen methods. Based on the above analysis, our model can simulate the human visual system well for both symmetric and asymmetric stereoscopic images.

Table 2 Comparison of overall performance of different methods

Figure 8 shows the change of the loss during the training process on the two databases. It can be seen that the MSE loss value decreases rapidly in the first few iterations and tends to be stable after 30 iterations, which shows that the model can converge quickly, and reduce the time cost in the learning process.

Fig. 8
figure 8

Optimization process of training loss on LIVE 3D Phase I and II databases

4.3 Performance comparison of single distorted types

In practical application, the distorted types of images in the process of acquisition, transmission and display are different. Therefore SIQA method should not only have a good overall performance, but also achieve satisfactory results in various distortion types. We list the SROCC and PLCC results of five distortion types in LIVE 3D Phase I and II databases as Tables 3 and 4 respectively, and the best two results are displayed in bold.

Table 3 Performance comparison of SROCC with five kinds of distortion
Table 4 Performance comparison of PLCC with five kinds of distortion

According to Tables 3 and 4, the proposed model shows the highest predicted accuracy for most of distorted categories, only SROCC and PLCC of Blur do not achieve satisfactory results. In addition, the FR Jiang2018 [12] also shows good performance in WN, Blur and FF distortion category, but its performance in JP2K and JPEG distortion on the two databases are not good, which leads to its overall performance is poor. In general, the proposed model performs well, which proves its robustness and effectiveness.

In order to show the predicted effect of the proposed model more intuitively, Fig. 9 gives the scatter plots of predicted DMOS against subjective DMOS on the LIVE 3D Phase I and 3D Phase II. The horizontal axis represents the DMOS predicted by the proposed model, and the vertical axis represents the subjective DMOS. The more the scatter points converge, the closer the fitting curve is to the straight line, which indicates that the model is better. From Fig. 9 it can be seen that the scatter distribution of various distortion types shows straight line, and the fitting curve is very close to the diagonal line of the first quadrant, which further shows that the proposed algorithm is linearly consistent with the subjective perception.

Fig. 9
figure 9

Scatter plots of predicted DMOS of the proposed method against subjective DMOS. a On the LIVE 3D Phase I. b On the LIVE 3D Phase II

4.4 Cross-database validation

In order to further illustrate that the proposed model is not limited to the samples in the database, we conduct cross-database validation, training and learning the model on one database, and testing it with a different database. Here, two cross-test are conducted: (1) LIVE I/LIVE II, which means that the experiment is trained on LIVE 3D PHASE I and tested on LIVE 3D PHASE II, (2) LIVE II/LIVE I, which denotes that the experiment is trained on LIVE 3D PHASE II and tested on LIVE 3D PHASE I. The results are given in Table 5, and it can be seen that the results of cross-database test are significantly lower than those of the same training-testing database, because the distortion types and degree of the two database samples are quite different. Moreover, the performances of LIVE I/LIVE II are apparently the worse than that of LIVE II/LIVEI, and it may be that LIVE II contains both symmetric and asymmetric stereoscopic images, while LIVE I contains only symmetric distorted images, resulting in asymmetric distorted images not learned for LIVE I/LIVE II. Compared with other recently six methods, the proposed model shows competitive performances, which further suggest the proposed model is effective for SIQA, insensitive to image content, and has good universality and stability.

Table 5 Performances of cross-database validations

4.5 Saliency threshold analysis

Saliency threshold determines how many saliency patches are selected, the bigger the threshold, the less the selected saliency patches; conversely, the smaller the threshold, the more the selected saliency region. Here we did a comparative experiment under different saliency thresholds. The results are listed in Tables 6, 7, 8 and 9 when threshold are set to 0, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8, and the best results are shown in bold. On the whole, when the saliency threshold is 0.2, the results are the best, so we set the threshold to 0.2 in this paper.

Table 6 Overall performance Comparison of different significance thresholds on LIVE 3D Phase I
Table 7 Overall performance Comparison of different significance thresholds on LIVE 3D Phase II
Table 8 SROCC of individual distortion type of different significance thresholds on LIVE 3D Phase I
Table 9 SROCC of individual distortion type of different significance thresholds on LIVE 3D Phase II

4.6 Performance comparison on symmetrically and asymmetrically distorted images

We further carried out comparative experiment of symmetric and asymmetric distortions on LIVE 3D Phase II. We compared the proposed method with three FR (Benoit2008 [4], Chen2013 [5], and Wang2015 [39]) and four NR (Mittal2012 [27], Chen2013 [6], Zhang2016 [50], and Shao2018 [32]) IQA methods. Their results have been reported in related papers, or the source code of the methods has been made public. The comparison results of SROCC and PLCC are summarized as Table 10, and the best two results are displayed in bold type. It can be seen that most of the other methods have achieved good results in symmetric distortion, indicating that they can accurately evaluate the perceptual quality of symmetrically distorted stereoscopic images, but the effect is poor in asymmetric distortion. However, the results of the proposed method on asymmetric distortion are better than those on symmetric distortion, which shows that our method is also suitable for asymmetric distortion. In general, the proposed method has the best performance in both symmetrically and asymmetrically distorted images.

Table 10 Performance Comparison on symmetrically and asymmetrically distorted images

5 Conclusion and future work

In this paper, we proposed a blind SIQA model via 3D saliency selected stereoscopic images and 3D-CNN. 3D saliency map is used to select salient image patches more suitable for SIQA. Finally, the objective quality score of stereoscopic image is obtained by weighted average method. The experimental results show that the SROCC of our proposed method on LIVE 3D Phase I and Phase II is 0.962 and 0.954, respectively. In cross-database validation, the SROCC of LIVE II/LIVEI is 0.910. Compared with the state-of-the-art NR SIQAs, our metric has higher performance, which shows its superiority and robustness.

In this method we select the salient local regions of the image to train the 3D-CNN model, and use trained scores from DMOS of the whole image, which are not certainly the real quality score, and maybe lead to limited subjective relationships. In the future research, it is necessary to get real quality scores of local image patches for nonuniform distortion.