1 Introduction

With the development of stereoscopic display and network technologies, three-dimensional (3D) image processing technologies have attracted public attentions and have a widespread prospect of applications [11, 35]. 3D imaging technology presents two slightly different images of one scene to the left and the right eyes, allowing the brain to reconstruct the original scene via binocular disparity [18]. As the amount of data to store or transmit 3D images may be double or even more compared with the traditional two-dimensional (2D) image [24], a series of 3D (including stereoscopic and multi-view) video coding schemes have been proposed [5, 12, 19]. How to assess coding distortions directly affects the performance evaluation of the coding schemes. Therefore, stereoscopic (or 3D) image quality assessment (SIQA) has been becoming an important issue.

Generally, SIQA methods can be categorized as subjective and objective methods. Subjective methods have been standardized by the International Telecommunications Union (ITU) [810]. As observers are the ultimate receivers of visual information, the results of subjective methods are reasonable and reliable, which can be exploited to analyze the effects on the perceived quality of stereoscopic images and evaluate the predictive performance of objective methods. IJsselsteijin [7], Tam [23], and Wang [27] analyzed the effects of camera parameters, display duration and quality-asymmetric coding on the perceived quality of stereoscopic images, respectively. However, subjective methods are not only inconvenient and time-consuming, but also cannot be performed for many scenarios, e.g., real-time video systems. Therefore, the goal of SIQA research is to design an efficient algorithm for objective assessment of image quality in a certain way that is consistent with human visual perception.

Objective SIQA methods mainly depend on a number of quantified image features to measure the quality of stereoscopic images. Most perceptual evaluations of 3D television systems are currently performed with assessment concepts based on 2D image quality assessment (2D-IQA) methods, which directly apply 2D-IQA methods to evaluate the overall quality of stereoscopic images as the mean quality of two views [30, 33]. Peak Signal-to-Noise Ratio (PSNR) was applied to predict the quality of stereoscopic images in [30], and a joint method of PSNR and Structural SIMilarity index (SSIM) [26] was applied in [33]. It is a well-known fact that these methods have not taken binocular perceptions of human visual system (HVS) into consideration. Yasakethu et el. described stereoscopic image quality as a combination of several perceptual attributes, including image quality, perceived depth, presence, naturalness and eye strain [29]. Some researchers simplified stereoscopic image quality as a combination of two dominant perceptual attributes, i.e., image quality and depth perception [1, 17, 25, 28, 31, 32]. Usually, the image quality component assesses the ordinary monoscopic perceived distortions caused by blur, noise, contrast change etc., while the depth perception component assesses the degradation of depth perception (or stereo sense) via depth or disparity map. Yang et al. assessed stereoscopic image quality from the perspective of image quality and depth perception [28]. Zhang et al. proposed a multi-view video quality assessment method based on disparity and SSIM from both image quality and depth perception [32]. Sazzad et al. proposed a no-reference quality assessment for JPEG coded stereoscopic images based on segmented local features of artifacts and disparity [17]. You et al. compared several combinations of disparity quality and image quality obtained by some well-known 2D-IQA methods and gave a good combination scheme using SSIM and Universal Quality Index (UQI) [25] in SIQA [31]. Benoit et al. also presented a linear combination scheme for disparity distortion and the measurement of 2D image quality on both views [1]. It is obvious that these approaches are the extensions or improvements of 2D-IQA methods. Even though the predictive performance of these methods is remarkably improved compared with the methods without depth perception, they have not taken into account binocular interactions between two eyes in HVS and the relationship between image content and depth perception. As a result, the predictive results of existing objective SIQA methods are not quite in line with HVS.

Psychovisual researchers have indicated that HVS relies on binocular fusion and suppression to solve the different information from two retinal images and achieve binocular single vision [4, 14]. The difference between the two retinal images, which is called binocular disparity, contributes critically to depth perception [3]. Therefore, the generation of binocular single vision involves not only the interactions between two retinal images, but also the creation of depth perception. According to the way of HVS dealing with two retinal images, we propose a full-reference SIQA method based on binocular vision in the paper. The paper is structured as follows: Section II analyzes the problems of existing SIQA methods and presents the mechanism of HVS dealing with a stereoscopic image; Section III proposes an objective SIQA method based on binocular vision; Section IV optimizes the method and analyzes the predictive performance of the proposed method compared with existing SIQA methods; Finally, both conclusion and future work are given in Section V.

2 Problem description and motivations

It is well known that there is a binocular interval of 60–65 mm between human’s two eyes, which results in two slightly different retinal images of the same scene. However, the phenomena of diplopia (one object, two images) will not occur in normal vision. This is benefited from binocular fusion and suppression, by which HVS is able to process the different information from the two retinal images and achieve depth perception in binocular single vision [4, 14]. Therefore, the processing of a stereoscopic image in HVS is the generation of binocular single vision, which involves the interactions between two views and the creation of depth perception. However, existing objective SIQA methods may not take binocular perceptual attributes into consideration and have the following two main problems:

  1. i)

    Existing SIQA methods may be unable to assess some unique distortions occurring in stereoscopic images, such as crosstalk, cardboard, keystone, etc. [3]. As these distortions involve the interactions between two views of stereoscopic images rather than influence each view separately, any combination of image quality assessment over separated views may fail to predict the perceptual gradation from distortions. In addition, existing SIQA methods mainly follow the way of traditional 2D-IQA methods, i.e., they are from the perspective of distortions and established based on statistical analysis of distortions on the perceived quality of image content and depth perception. As a result, they may predict some types of degradations well, but fail to predict other degradations. However, objective SIQA method should be suitable to any distortion rather than just one or two distortions.

  2. ii)

    They may not truly assess depth perception of stereoscopic images. It is well-known that stereoscopic images provide us not only image contents, but also depth perception. Compared with 2D-IQA methods, the quality of depth perception needs to be assessed in SIQA. However, depth perception is a mental sense yielded in brain, and it is hard to be described using mathematical models. Thus, a depth map (or disparity map) of stereoscopic images is used as a substitute for describing depth perception, which is assessed by traditional 2D-IQA methods. In fact, there are significant differences between depth map and 2D image. Depth map represents information relating to the distance from the surfaces of objects to the camera, while 2D image represents color and brightness information of objects. Therefore, 2D-IQA methods may not truly reflect the changes of depth perception. In addition, depth perception can not exist independently, it accompanies and changes with the content of stereoscopic image. Existing SIQA methods which are thoughtless of the relationship between depth perception and image content may be unreasonable.

To solve the problems mentioned above, the mechanism of HVS dealing with two retinal images should be taken into account in SIQA design. Fig. 1 shows the way of HVS dealing with two retinal images. According to researches on human perception and psychophysics, HVS firstly searches for matching local features between two retinal images. If the two retinal images (or local features) are similar with small disparity, binocular fusion will fuse the two retinal images into a single binocular percept [22]. However, when attempting to fuse two dissimilar images or two similar images with large disparity, HVS is facing with conflicting data from two eyes—a situation known as rivalry. During rivalry, the initial perception will be diplopia or confusion. However, HVS normally cannot tolerate rivalry for long. It usually reconciles the conflicting data by suppressing one. The entire image from one retina may be suppressed, but in most cases, parts of the right eye’s visual field will be suppressed while other parts of the left eye’s visual field are suppressed. Sometimes HVS may solve the rivalry problem by alternately suppressing either eye.

Fig. 1
figure 1

The way that two retinal images may be dealt with in HVS [22]

On the other hand, in full-reference SIQA, an original stereoscopic image is used as benchmark for reference, and it is regarded as the image that is capable of presenting a perfect perception of a real-world scene for viewers. As binocular suppression usually accompanies visual discomfort and fatigue [13] while the reference stereoscopic image does not bring any visual discomfort and fatigue, we assume that binocular suppression will not occur when HVS deals with the reference stereoscopic image. Thus, the reference stereoscopic image can be categorized into two regions: monocular region dealt with by monocular vision, i.e., occluded (or disoccluded) region, and binocular region dealt with by binocular fusion. However, after the injection of distortions, the reference stereoscopic image turns into a distorted stereoscopic image. The distortions may affect not only the identification of some objects, but also depth perception of some objects. For example, some parts may turn into a new occluded region because of the disappearance of the similarity between the parts in two views; some parts’ disparities may become larger or even result in the occurrence of binocular suppression; some parts’ disparities may become smaller; in other case, some occluded parts may not exist any more. Therefore, the distorted stereoscopic image may comprise of three kinds of regions: occluded region dealt with by monocular vision, binocular region dealt with by binocular fusion, and binocular region dealt with by binocular suppression.

3 The proposed binocular fusion and suppression based objective stereoscopic image quality assessment (siqa) method

As mentioned above, three regions, i.e., occluded, binocular fusion, and binocular suppression regions, may coexist in a distorted stereoscopic image. The three regions are dealt with by HVS in different ways, thus it is better to assess them in different manners. However, except the occluded region which has no corresponding region in another view, it is hard to accurately identify the binocular fusion and suppression regions. Shao et al. proposed a method to segment these three kinds of regions based on which a perceptual full-reference quality metric is further proposed [6]. In literature [6], left–right consistency check and matching error between the corresponding pixels are utilized to indentify the binocular suppression region, and the binocular fusion region is defined as the areas excluding the occluded region and binocular suppression region. After that, the local phase and local amplitude maps are extracted from the original and distorted stereoscopic images as features in quality assessment. Then, each region is evaluated independently, and all evaluation results are integrated into an overall score. In this paper, to simplify the segmentation, disparity information of the reference stereoscopic image is used as benchmark to distinguish the two regions, based on the assumption that reference stereoscopic image does not have binocular suppression regions because it does not result in any visual discomfort and fatigue which are usually brought by binocular suppression. The framework of the proposed binocular vision based SIQA method is shown in Fig. 2. A distorted stereoscopic image is firstly divided into occluded region (unmatched region) and binocular region (matching region with disparity) according to stereo matching between two views. Compared with the corresponding disparity of the reference stereoscopic image, the binocular region with smaller or invariable disparity in the distorted stereoscopic image is regarded to be dealt with by binocular fusion, while the binocular region with larger disparity is regarded to be dealt with by binocular suppression. Therefore, the binocular region in the distorted stereoscopic image is subdivided into pseudo-binocular fusion region and pseudo-binocular suppression region according to the changes of the disparity in the corresponding binocular region. Then the visual quality indices of the three regions are obtained through simulating the ways of the three regions dealt with by HVS. According to the contributions of the three regions to the overall visual quality of the distorted stereoscopic image, the weighted sum of the three quality indices are used to represent the overall visual quality of the distorted stereoscopic image.

Fig. 2
figure 2

The framework of the proposed SIQA method

3.1 Region classification of distorted stereoscopic image

The classification of the occluded, pseudo-binocular fusion, and pseudo-binocular suppression regions are implemented as follows.

  1. 1)

    Occluded region detection: For a given distorted stereoscopic image \( \widehat{I} \) with the resolution of m × n, stereo matching is performed between the two views. Let \( {\widehat{p}}_{x,y}^l \) denote a pixel at the position (x, y) of the left view \( {\widehat{I}}_l \), \( {\widehat{p}}_{s,t}^r \) be a pixel at the position (s, t) of the right view \( {\widehat{I}}_r \) matching with \( {\widehat{p}}_{x,y}^l \). Then there are three cases denoted as (\( {\widehat{p}}_{x,y}^l \), \( {\widehat{p}}_{s,t}^r \)), (\( {\widehat{p}}_{x,y}^l \), ϕ), (ϕ, \( {\widehat{p}}_{s,t}^r \)), and ϕ means that there is no matching pixel in the corresponding view for the pixel in the other view. Then the occluded region \( {\widehat{R}}_{occ} \) of the distorted stereoscopic image is a set of pixels that have no matching pixels, which can be defined as

    $$ {\widehat{R}}_{occ}={\widehat{R}}_{occ}^l\cup {\widehat{R}}_{occ}^r $$
    (1)
    $$ {\widehat{R}}_{occ}^l=\left\{{\widehat{p}}_{x,y}^l\left|\left({\widehat{p}}_{x,y}^l,\phi \right)\wedge {\widehat{p}}_{x,y}^l\in {\widehat{I}}_l,0\le x<m,0\le y<n\right.\right\} $$
    (2)
    $$ {\widehat{R}}_{occ}^r=\left\{{\widehat{p}}_{s,t}^r\left|\left(\phi, {\widehat{p}}_{s,t}^r\right)\wedge {\widehat{p}}_{s,t}^r\in {\widehat{I}}_r,0\le s<m,0\le t<n\right.\right\} $$
    (3)

    where \( {\widehat{R}}_{occ}^l \) denotes the occluded region in \( {\widehat{I}}_l \), and \( {\widehat{R}}_{occ}^r \) denotes the occluded region in \( {\widehat{I}}_r \).

  2. 2)

    Pseudo-binocular suppression region detection: In the distorted stereoscopic image, the pseudo-binocular suppression region \( {\widehat{R}}_{bs} \) is defined as a set of pixels whose disparities between the matching pixels are larger than the original disparities in the reference stereoscopic image. Let \( {\widehat{R}}_{bs}^l \) denote the corresponding region to \( {\widehat{R}}_{bs} \) in \( {\widehat{I}}_l \), \( {\widehat{R}}_{bs}^r \) be the corresponding region to \( {\widehat{R}}_{bs} \) in \( {\widehat{I}}_r \). \( {\widehat{R}}_{bs} \) is regarded to be dealt with by binocular suppression for which the perceived quality is dominated by either \( {\widehat{R}}_{bs}^l \) or \( {\widehat{R}}_{bs}^r \) depending on the better quality of \( {\widehat{R}}_{bs}^l \) and \( {\widehat{R}}_{bs}^r \) [3], which can be defined as

    $$ {\widehat{R}}_{bs}= Sup\left\{{\widehat{R}}_{bs}^l,{\widehat{R}}_{bs}^r\right\} $$
    (4)
    $$ {\widehat{R}}_{bs}^l=\left\{{\widehat{p}}_{x,y}^l\left|{\widehat{p}}_{x,y}^l\in {\widehat{I}}_l\wedge {\widehat{p}}_{s,t}^r\in {\widehat{I}}_r\wedge \right.\left({\widehat{p}}_{x,y}^l,{\widehat{p}}_{s,t}^r\right)\wedge \left|{\widehat{d}}_{x,y}^h\right|+\left|{\widehat{d}}_{x,y}^v\right|>\left|{d}_{x,y}^h\right|+\left|{d}_{x,y}^v\right|,0\le x<m,0\le y<n\right\} $$
    (5)
    $$ {\widehat{R}}_{bs}^r=\left\{{\widehat{p}}_{s,t}^r\left|{\widehat{p}}_{x,y}^l\in {\widehat{R}}_{bs}^l\wedge {\widehat{p}}_{s,t}^r\in {\widehat{I}}_r\wedge \right.\left({\widehat{p}}_{x,y}^l,{\widehat{p}}_{s,t}^r\right),0\le s<m,0\le t<n\right\} $$
    (6)

    where Sup{} denotes the way that HVS deals with the binocular region, which will be described in detail in subsection 3.2, \( {\widehat{d}}_{x,y}^h \) and \( {\widehat{d}}_{x,y}^v \) are the horizontal and vertical left-to-right disparities of the pixel \( {\widehat{p}}_{x,y}^l \), respectively. d h x,y and d v x,y are the horizontal and vertical left-to-right disparities of the corresponding position in the reference stereoscopic image, respectively. All \( {\widehat{d}}_{x,y}^h \), \( {\widehat{d}}_{x,y}^v \), d h x,y , and d v x,y are obtained through a stereo matching algorithm [21] provided by Cornell University.

  3. 3)

    Pseudo-binocular fusion region detection: In the distorted stereoscopic image, the pseudo-binocular fusion region \( {\widehat{R}}_{bf} \) is a set of pixels whose disparities between the matching pixels are smaller than or equal to the original disparities in the reference stereoscopic image. Let \( {\widehat{R}}_{bf}^l \) denote the corresponding region to \( {\widehat{R}}_{bf} \) in \( {\widehat{I}}_l \), and \( {\widehat{R}}_{bf}^r \) be the corresponding region to \( {\widehat{R}}_{bf} \) in \( {\widehat{I}}_r \). \( {\widehat{R}}_{bf} \) is regarded to be dealt with by binocular fusion for which the perceived quality is determined by both \( {\widehat{R}}_{bf}^l \) and \( {\widehat{R}}_{bf}^r \) [22], which can be defined as

    $$ {\widehat{R}}_{bf}= Fus\left\{{\widehat{R}}_{bf}^l,{\widehat{R}}_{bf}^r\right\} $$
    (7)
    $$ {\widehat{R}}_{bf}^l=\left\{{\widehat{p}}_{x,y}^l\left|{\widehat{p}}_{x,y}^l\in {\widehat{I}}_l\wedge {\widehat{p}}_{s,t}^r\in {\widehat{I}}_r\wedge \right.\left({\widehat{p}}_{x,y}^l,{\widehat{p}}_{s,t}^r\right)\wedge \left|{\widehat{d}}_{x,y}^h\right|+\left|{\widehat{d}}_{x,y}^v\right|\le \left|{d}_{x,y}^h\right|+\left|{d}_{x,y}^v\right|,0\le x<m,0\le y<n\right\} $$
    (8)
    $$ {\widehat{R}}_{bf}^r=\left\{{\widehat{p}}_{s,t}^r\left|{\widehat{p}}_{x,y}^l\in {\widehat{R}}_{bf}^l\wedge {\widehat{p}}_{s,t}^r\in {\widehat{I}}_r\wedge \right.\left({\widehat{p}}_{x,y}^l,{\widehat{p}}_{s,t}^r\right),0\le s<m,0\le t<n\right\} $$
    (9)

    where Fus{} is the way that HVS deals with the pseudo-binocular fusion region involving the processing of binocular summation [2], which will be described in detail in subsection 3.2.

3.2 Binocular vision based quality assessment

For an image, its luminance component I can be considered as a matrix with an integer value corresponding to each pixel. I can be decomposed into a product of three matrices

$$ I= US{V}^T $$
(10)

where U and V are orthogonal matrices, and S = diag(s 1, s 2, …). The diagonal entries of S are called singular values of I. It is well known that singular values from singular value decomposition (SVD) are sensitive to perturbations [20]. Adding distortions to an image will modify the structural information of the image, resulting in the perturbations of singular values. Since HVS is sensitive to structural changes, using singular values to quantify structural distortions provides a significant basis for assessing image quality. In the paper, singular values are used as features for gauging structural changes in stereoscopic images.

In order to reduce computational complexity, both the left and the right views of the distorted stereoscopic image are segmented into non-overlapped blocks with the size of k × k, where k is an integer. Before the SVD is applied to each block, all blocks of stereoscopic image are classified into occluded block, pseudo-binocular suppression block, and pseudo-binocular fusion block. The classification method is as follows: for any block, if the block contains a pixel which belongs to the occluded region, the block will be considered as a occluded block; otherwise, if the block contains a pixel which belongs to the pseudo-binocular suppression region, it will be considered as a pseudo-binocular suppression block; or else, the block will be considered as a pseudo-binocular fusion block. According to three kinds of block-wise regions of the distorted stereoscopic image, three corresponding regions of the reference stereoscopic image are updated to three kinds of block-wise regions.

Since HVS deals with the occluded, binocular suppression and binocular fusion regions in different ways, different quality assessment methods will be adopted for the three regions. SVD is used to each block in each region and the local error in that block is computed to obtain all the local errors in the blocks of the region. The quality assessments of these different regions are described in the following subsections.

  1. 1)

    Quality assessment for occluded region: As pixels in the occluded region exist either in the left view or in the right view, the occluded region of the distorted stereoscopic image only refers to monocular vision and can be assessed by 2D-IQA methods. Let \( {\widehat{B}}_i \) denote the i-th block in \( {\widehat{R}}_{occ} \), and B i be the reference block of \( {\widehat{B}}_i \) in the reference stereoscopic image, then the distance D occ (i) between singular values of \( {\widehat{B}}_i \) and B i is defined as

    $$ {D}_{occ}(i)=\sqrt{{\displaystyle \sum_{j=1}^k{\left({s}_{i,j}-{\widehat{s}}_{i,j}\right)}^2}} $$
    (11)

    where s i,j is the j-th singular value of B i , \( {\widehat{s}}_{i,j} \) is the j-th singular value of \( {\widehat{B}}_i \), and k is the block size.

    Let Q occ be the global error of the occluded region, which can be defined as

    $$ {Q}_{occ}=\frac{1}{N_{occ}}{\displaystyle \sum_{i=1}^{N_{occ}}\left|{D}_{occ}(i)-\left.{D}_{occ}^m\right|\right.} $$
    (12)

    where N occ is the number of blocks in the occluded region, and D m occ is the median value of {D occ (1), D occ (2), …, D occ (N occ )}.

  2. 2)

    Quality assessment for pseudo-binocular suppression region: Let \( {\widehat{B}}_i \) be the i-th block in \( {\widehat{R}}_{bs} \), \( {\widehat{B}}_i^l \) be the corresponding block in \( {\widehat{R}}_{bs}^l \), \( {\widehat{B}}_i^r \) be the corresponding block in \( {\widehat{R}}_{bs}^r \), B l i be the reference block of \( {\widehat{B}}_i^l \) in the left view of the reference stereoscopic image, and B r i be the reference block of \( {\widehat{B}}_i^r \) in the right view of the reference stereoscopic image, then the distance D l bs (i) between singular values of \( {\widehat{B}}_i^l \) and B l i , and the distance D r bs (i) between singular values of \( {\widehat{B}}_i^r \) and B r i can be similarly calculated by Eq. (11). The overall visual quality of the binocular suppression region is dominated by the better quality view [13], therefore, the quality D bs (i) of \( {\widehat{B}}_i \) is defined as

    $$ {D}_{bs}(i)= \min \left\{{D}_{bs}^l(i),{D}_{bs}^r(i)\right\} $$
    (13)

    Let Q bs be the global error of the pseudo-binocular suppression region, which can be defined as

    $$ {Q}_{bs}=\frac{1}{N_{bs}}{\displaystyle \sum_{i=1}^{N_{bs}}\left|{D}_{bs}(i)-\left.{D}_{bs}^m\right|\right.} $$
    (14)

    where N bs is the number of blocks in the pseudo-binocular suppression region, and D m bs is the median value of {D bs (1), D bs (2), …, D bs (N bs )}.

  3. 3)

    Quality assessment for pseudo-binocular fusion region: Let D l bf (i) denote the distance between singular values of the i-th block in \( {\widehat{R}}_{bf}^l \) and the corresponding block in R l bf , and D r bf (i) be the distance between singular values of the i-th block in \( {\widehat{R}}_{bf}^r \) and the corresponding block in R r bf , then D l bf (i) and D r bf (i) can be similarly calculated by using Eq. (11), respectively. The global error Q l bf of \( {\widehat{R}}_{bf}^l \) and the global error Q r bf of \( {\widehat{R}}_{bf}^r \) can be similarly calculated by using Eq. (12), respectively. The overall visual quality of the binocular fusion region is determined by both the left and right views’ global errors in binocular fusion regions [22]. In addition, according to the property of binocular summation in the binocular fusion region, binocular acuity is approximately 1.4 times better than individual monocular acuities [22]. Therefore, the global error Q bf of \( \widehat{R}{}_{bf} \) is defined as

    $$ {Q}_{bf}=1.4\times \frac{Q_{bf}^l+{Q}_{bf}^r}{2} $$
    (15)
  4. 4)

    Quality fusion for distorted stereoscopic image: The overall visual quality of the distorted stereoscopic image is decided by \( \widehat{R}{}_{occ} \), \( \widehat{R}{}_{bs} \) and \( \widehat{R}{}_{bf} \). As distortions in these three regions are independent of one another, the global error Q of the distorted stereoscopic image is obtained by a linear regression equation of the quality indices of the three regions, which can be defined as

    $$ Q=a\cdot {Q}_{occ}+b\cdot {Q}_{bs}+c\cdot {Q}_{bf} $$
    (16)

    where a, b and c are the weights of the three regions in the overall quality and restricted by

    $$ \left\{\begin{array}{c}\hfill a+b+c=1\hfill \\ {}\hfill 0\le a\le 1\hfill \\ {}\hfill 0\le b\le 1\hfill \\ {}\hfill 0\le c\le 1\hfill \end{array}\right. $$
    (17)

4 Experimental results and analyses

In this section, we will optimize the proposed method, and then compare its predictive performance with existing SIQA methods based on the following experiments: consistency test, cross-image and cross-distortion tests, robustness test. At the beginning of this section, we firstly brief the database of SIQA.

4.1 Database of SIQA

A database of nine reference stereoscopic images and their 234 corresponding distorted stereoscopic images, which is established in our previous research [34], is used to evaluate the predictive performance of the proposed SIQA method, The 234 distorted stereoscopic images are generated with five distortions: Gaussian blurring (Gblur), white Gaussian noise (Wn), JPEG compression, and JPEG2000 compression with five quality levels, and H.264 compression with six quality levels. All the reference stereoscopic images in the database are captured by parallel cameras with the spacing of 50 ~ 75 mm, which is consistent with interpupillary distance of human eyes. Fig. 3 shows the left views of stereoscopic images in the SIQA database. According to Double Stimulus Continuous Quality Scale (DSCQS) testing method described in ITU-R recommendation BT.500-11 [9], the subjective ratings of the distorted images are conducted by a linear polarization stereoscopic display system [16]. Thus, a total of 243 stereoscopic images with varied types and amounts of distortions have tested as a way to demonstrate the general applicability of the proposed method.

Fig. 3
figure 3

The left views of stereoscopic images in the SIQA database

4.2 Optimization of the proposed method

In this paper, the SIQA database is divided into two parts for training and testing. The training data set consists of four reference stereoscopic images (i.e., ‘Alt Moabit’, ‘Door flowers’, ‘Kendo’, and ‘Newspaper’) and their all distorted versions, which involves indoor and outdoor, close and long shot, complex and simple, strong and weak depth perception scenes. Therefore, the training data set has a wide variety of image contents, and it can be regarded as a comprehensive training data set. The testing data set consists of the other five reference stereoscopic images (i.e., ‘Akko & Kayo’, ‘Leaving Laptop’, ‘Balloons’, ‘Lovebird1’, and ‘Xmas’) and their all distorted versions. And there is no overlapping between training and testing. In the proposed method, the block size k and the weights of the three regions (i.e., a, b, and c) are obtained with all distorted images in the training data set. A nonlinear mapping is firstly employed between the output of the proposed method and the subjective quality score following the validation method in [9]. The nonlinearity chosen for regression for the proposed method is a five-parameter logistic function (a logistic function with an added linear term, constrained to be monotonic), which is given by

$$ Quality(x)={\beta}_1\cdot \log istic\left({\beta}_2,\left(x-{\beta}_3\right)\right)+{\beta}_4\cdot x+{\beta}_5 $$
(18)
$$ \log istic\left(\tau, x\right)=1/2-1/\left(1+ \exp \left(\tau \cdot x\right)\right) $$
(19)

This nonlinearity is applied to the output of the proposed method or its logarithm, whichever gives a better fit for all data. As all k, a, b, and c in the proposed method are unknown, the output of the proposed method cannot be directly obtained. Thus, we put the parameter optimization into the fitting between the output of the proposed method and the subjective quality score.

In the proposed method, k denotes the size of segmented block in the left and right views of stereoscopic images. The larger the block size is, the higher the computational complexity of the SVD is. Therefore, k should be as small as possible, and four conditions of block size k = {4, 8, 12, 16} are analyzed. Parameters (including a, b, and c) are computed with the Levenberg Marquardt (LM) and Universal Global Optimization (UGO), using the mathematical software 1stOpt (First Optimization) Pro. v1.5 [15]. Table 1 shows the optimum values of a, b, c over four different k values, and the corresponding Pearson correlation coefficients (CC) of the proposed method with training and testing data sets are also given. The closer the value of CC is to 1, the better the performance of the proposed method is. As shown in Table 1, as for the training data set, when the value of k is 4, the corresponding value of CC is 0.943; when the value of k is 8, the corresponding value of CC is 0.947; there is no obvious difference between two CC values. However, as for the testing data set, when the value of k is 4, the corresponding value of CC is 0.936; when the value of k is 8, the corresponding value of CC is 0.922; the former is much larger than the latter. Meanwhile, as the value of k further increases (when k is larger than 8), the corresponding value of CC decreases rapidly. That is because the larger the value of k is, the more inaccurate the region classification of the distorted stereoscopic image is. Fig. 4 shows the proportions of the three regions (i.e., the occluded region, the pseudo-binocular suppression region, and the pseudo-binocular fusion region) in a distorted ‘Akko & Kayo’ varying with k. It is clear that, as k increases, the proportion of the occluded region increases, the proportion of the pseudo-binocular fusion region decreases, and the proportion of the pseudo-binocular suppression region has no evident changes. It means that parts of pseudo-binocular suppression region and pseudo-binocular fusion region are treated as the occluded region, and parts of pseudo-binocular fusion region are treated as pseudo-binocular suppression region. As a result, CC decreases as k increases, the predictive performance of the proposed method become worse. Based on the analyses above, k is set as 4. When k value is 4, the optimum values of a, b, c are 0, 0.440, 0.560, respectively. The value of a is 0, which indicates that human visual attention focuses on binocular information, and the occluded region belongs to monocular information whose quality can not be represented by binocular information of stereoscopic images.

Table 1 The results of a, b, c, CC over four k values
Fig. 4
figure 4

The proportions of three regions over different k values

4.3 Comparison with the existing methods

We will compare the predictive performance of the proposed method with five existing SIQA methods, namely, PSNR-based method, SSIM-based method, MSVD-based (the block size is set as 4) method [20], SSIM d 1 [1], OQM [31]. Although the list of methods reported in the paper is not exhaustive, it is representative of existing SIQA methods. The PSNR-based, SSIM-based, and MSVD-based methods respectively apply PSNR, SSIM, MSVD to estimate the quality of each view separately and then represent the overall stereoscopic image quality as the mean of the results for each view. SSIM d 1 assesses the quality of stereoscopic image through combining the disparity and the averaged left and right view distortions, which is given by \( {d}_1=M\cdot \sqrt{D_d} \), where M is the average results of the left and right views using SSIM, D d is the quality result of the disparity which is computed using global correlation coefficient between reference and distorted disparity maps. OQM combines the quality of image and disparity map by \( OQM=\sqrt{ IQM}+\sqrt{ DQM}+\sqrt{ IQM\cdot DQM} \), where IQM is the average result of the left and right views using SSIM, DQM is the quality result of the disparity map using UQI. We will analyze the predictive performances of the proposed method from consistency test, cross-image and cross-distortion tests, robustness test, compared with the five existing SIQA methods.

  1. 1)

    Consistency Test: Experimental results are reported in terms of four criteria used for performance comparison, namely, Pearson linear correlation coefficient CC (for prediction accuracy), Spearman rank order correlation coefficient SROCC (for monotonicity), root mean squared error RMSE (for prediction accuracy), and outlier ratio OR (for prediction consistency), between subjective and objective scores. For a perfect match between objective and subjective scores, CC = SROCC = 1, and RMSE = OR = 0. Test results for all SIQA methods being compared are given as benchmark in Tables 2-5. Fig. 5 shows the scatter plots of the Difference Mean Opinion Score (DMOS) by subjective evaluation versus the predicted score by the six objective SIQA methods after the nonlinear mapping. The density of data points closely to the line y = x represents the consistency of the predictive method and the subjective evaluation.

    Table 2 Linear correlation coefficient (CC) after nonlinear regression
    Fig. 5
    figure 5

    Scatter plots for the quality predictions by the six methods with the training and the testing data sets. Training: (a) PSNR-based, (b) SSIM-based, (c) MSVD-based, (d) SSIM d 1 , (e) OQM, (f) The proposed; Testing: (g) PSNR-based, (h) SSIM-based, (i) MSVD-based, (j) SSIM d 1 , (k) OQM, (l) The proposed

    The results of Tables 2, 3, 4, 5 show that all the accuracy, the monotonicity, and the consistency of the proposed method are better than the other five methods in overall terms. The results also demonstrate that the perceived quality of stereoscopic image is quite different from that of 2D images. For example, human eyes are sensitive to the structure information of images, resulting in a good predictive performance of SSIM to 2D images. However, as to stereoscopic image, the binocular single vision of a stereoscopic image from the left and right views in HVS is more important. Distortions from any view of a stereoscopic image will directly affect the generation of the binocular single vision. PSNR reflecting signal errors in images can predict signal errors of a stereoscopic image to some extent. Therefore, the predictive performance of PSNR-based method is better than SSIM-based method. Singular values from the SVD are sensitive to perturbations resulting in the predictive performance of MSVD-based method is better than PSNR-based method. As all the PSNR-based, SSIM-based, and MSVD-based methods do not take into account the generation of the binocular single vision in HVS, the predictive performances of these methods are far worse than the proposed method, especially for all data. Even though both SSIM d 1 and OQM take into account depth perception, the combination methods between image quality and depth perception may be unreasonable resulting in worsening their performances for some distortions compared with SSIM-based method. These two methods also do not consider the generation of the binocular single vision in HVS. As a result, the predictive results of SSIM d 1 and OQM are inconsistent to subjective quality evaluation. Besides, Experimental results show that the predictive performances of PSNR-based and MSVD-based methods are better than the proposed method for Wn distortion. That is because Wn is an additive noise and mainly affects non-edge areas of images rather than the generation of the binocular single vision, the subjective quality of stereoscopic images is determined by the quantity of noises injected into the stereoscopic images.

    Table 3 Spearman rank order correlation coefficient (SROCC) after nonlinear regression
    Table 4 Root-mean-squared error (RMSE) after nonlinear regression
    Table 5 Outlier ratio (OR) after nonlinear regression

    Table 6 additionally gives the comparison results between the proposed method and the method of literature [6]. Since literature [6] used the total 9 stereoscopic images shown in Fig. 3 as the test images, different from Tables 2, 3, 4, 5, the results in Table 6 achieved with the proposed method also correspond to all of the 9 images. From Table 6, it is seen that the proposed method is good at Gblur, Wn and H.264 distortions, while the method of literature [6] is appropriate for JPEG and JPEG 2000 distortions. But for all data which crosses all five kinds of distortions, the proposed method is a little bit superior to the compared one.

    Table 6 Outlier ratio (OR) after nonlinear regression
  2. 2)

    Cross-image and Cross-distortion Tests: Many SIQA methods have been shown to be consistent when applied to distorted images generated from the same reference image by using the same distortion type. However, the effectiveness of these methods degrades significantly when applied to a set of images originating from different reference images involving a variety of different distortions. To further evaluate the effectiveness of the proposed method, the cross-image and cross-distortion tests are conducted, which are critical in evaluating the effectiveness of a quality assessment method. As shown in Fig. 5 and Tables 2, 3, 4, 5, the proposed method performs better than the other methods. Some methods are sensitive to image content, such as SSIM d 1, there is a significant difference in the predictive performance on H.264 between the training and testing data sets, and the corresponding CC values are 0.725, 0.948, respectively. While other methods are independent of image content, but the predictive performance is good in the case of individual distortion but not good for all distortion existing (all data). Such as MSVD-based method, although all the CC values of the five individual distortion types are larger than 0.920, the CC value of all data is only 0.890 for the testing data set. All the CC values of the proposed method for the five individual distortions are larger than 0.933, and for all data the CC value of the proposed method is larger than 0.935 and the SROCC value is larger than 0.930. This fully demonstrates that the proposed method is a good predictive method to the perceived quality of stereoscopic image.

  3. 3)

    Robustness Test: We choose some distorted stereoscopic images whose PSNR value of left images are closely to 28 dB from the SIQA database to test the robustness of the six methods. Although stereoscopic images are injected almost the same quantity of errors, the perceived quality differs greatly. Table 7 shows the information of these distorted stereoscopic images and predictive scores (DMOSp) obtained with six SIQA methods. These distorted images are relating to five image contents and five distortion types. The distorted version of ‘Newspaper’ with Wn distortion brings the best perceived quality, and the corresponding DMOS is 12.957. While the distorted ‘Akko & kayo’ with Gblur brings the worst perceived quality, and the corresponding DMOS is 32.217. Therefore, the perceived quality of stereoscopic image is sensitive to distortion type. As shown in Table 7, only the proposed method can tell the differences of the effects of these five distortion types on the perceived quality of stereoscopic image, and the quality rank order obtained from the proposed method coincides with that of DMOS. The robustness of objective method to distortion types is very important to predict the perceived quality of stereoscopic image, and the proposed method meets the requirement.

    Table 7 The information of distorted stereoscopic images used in robustness test and their predictive scores by six SIQA methods

4.4 Summary

Compared with the five existing SIQA methods in terms of the consistency test, the cross-image and cross-distortion tests, and the robustness test, the proposed method has the following advantages.

Overall performance

The proposed method simulates the processing of a stereoscopic image in HVS, applying the properties of binocular suppression, binocular fusion, and binocular summation into the objective quality assessment of stereoscopic image. The proposed method outperforms the other methods in terms of the accuracy, the monotonicity, the consistency.

Cross-image and cross-distortion

The proposed method is based on the mechanisms of binocular vision, and the predictive performance of the proposed method is only related to the generation of the binocular single vision. Therefore, the proposed method is an objective quality assessment method crossing image content and distortion type, and its predictive results conform to subjective evaluations well.

Robustness

In order to evaluate the robustness of the proposed method, the five distorted stereoscopic images whose left images are with almost the same PSNR value are chosen. According to the difference comparisons between DMOS scores and predictive scores by the six SIQA methods, the proposed method is robust to distortion type and can predict human visual perception well.

5 Conclusions

According to the processing of a stereoscopic image in human visual system (HVS), in the paper, we have proposed a novel objective stereoscopic image quality assessment method based on binocular vision. We firstly analyzed the generation of the binocular single vision in HVS, and classified the distorted stereoscopic image into three regions, i.e., occluded region dealt with by monocular vision, pseudo-binocular fusion region simulating the region dealt with by binocular fusion, and pseudo-binocular suppression region which is assumed to be dealt with by binocular suppression. As the occluded region referring to monocular vision, we adopted a two-dimensional image quality assessment method to predict its quality. As both pseudo-binocular fusion region and pseudo-binocular suppression region relating to binocular vision, we assessed these two regions by different methods according to the mechanisms of binocular fusion and suppression. Then we combined the three quality indices into one to represent the overall visual quality of the stereoscopic image. Finally, compared with the existing objective quality assessment methods, the predictive performance of the proposed method has been analyzed in terms of the consistency test, the cross-image and cross-distortion tests, and the robustness test. Experimental results show the proposed method outperforms the other methods and is in line with the human visual perception. In the paper, we only consider the generation of the binocular single view, while there are other perceptual attributes of binocular vision need to be considered in the future work. Additionally, the region segmentation in this paper which simply uses disparity of original image as a benchmark is just a rough processing, how to segment the three kinds of regions more reasonably and accurately is also worth to be considered in the future.