1 Introduction

Image Quality Assessment (IQA) is an emerging artificial intelligence technology to simulate human perception of image quality. Full-Reference IQA (FR-IQA) takes the known original image as a reference to evaluate the quality of the distorted image. However, ideal reference images are actually unavailable. Reduced-Reference IQA (RR-IQA) methods are proposed when only partial information of the reference image is known. In practical applications, the features of reference images are often difficult to obtain. No-Reference IQA (NR-IQA) mainly predicts image quality without any information from reference images. Therefore, it is one of the most widely used and challenging tasks.

The deep neural networks have demonstrated a strong ability to capture the basic attributes of images, which provides a new solution to NR-IQA. Convolutional Neural Network (CNN) can extract image features more quickly and accurately. For example, Kang et al. [11] were pioneers in applying CNN to IQA. Inspired by CORNIA method [22], they proposed a meaningful framework and achieved excellent results. However, Kang’s network contains only one convolutional layer, which makes the expression of features incomplete. Sun et al. [20] and Bosse et al. [3] added convolutional layers on the basis of Kang. In [20], they proposed a branching framework based on global and local perception. Local and global features were combined to estimate the overall image quality. Reference [3] estimated the quality without employing any domain knowledge. Pan et al. [17] proposed an improved CNN combined saliency detection. This algorithm was based on the free energy neural model [7] to detect saliency map, then applied the saliency map as a weighting mask to output the quality score of the whole image. Squeeze-and-Excitation Network [8] was designed to enhance features and improve accuracy. When the existing model cannot meet our specific needs, we may not be allowed to customize a new architecture at the cost of heavy human effort or numerous GPU hours [27]. Zhou et al. [26] divided the training images into high-confidence distorted images and low-confidence distorted images, and reasonably assigned different local quality scores to each patch through specific Gaussian functions with the global quality score as the mean value and the undetermined hyperparameter as the standard deviation. By mimicking the active inference process of IGM, Ma et al. [15] established an active inference module based on the generative adversarial network (GAN) to predict the primary content, the image quality is predicted according to the correlation between the distorted image and its primary content. However, the size of the convolutional kernels in the above networks are single, the local features under different scales cannot be extracted well, the computation complexity is relatively high. Therefore, a multiscale CNN is proposed to form a more rapid and effective IQA model by using three different convolutional kernels.

In the CNN for NR-IQA, most methods are to process small patches, and then use the average score of patches to predict the whole image quality [18]. Therefore, how to select the appropriate patches is a topic that we should focus on. Since IQA simulates the perception of Human Visual System (HVS) on image quality, salient areas are more valuable for reference. Therefore, the saliency detection can be combined to select patches. Saliency detection can help humans quickly and accurately select the most important areas from complex images. There were many classic algorithms, such as the earliest visual attention algorithm Itti [9], LC algorithm [23], HC algorithm [6], AC algorithm [1] and FT algorithm [2]. Itti applied the multiple characteristics, multiscale decomposition and filtering to get saliency map; LC algorithm used the difference in pixel values as the saliency value to generate saliency map; HC algorithm took color information into account, instead of gray information as LC algorithm did; AC algorithm obtained the final saliency map by adding the saliency of multiscale fuzzy images; FT algorithm proposed five indicators to detect the saliency of the image. The SDSP algorithm [24] integrated the following three priors. First, HVS in detecting salient objects can be well simulated through band-pass filtering. Second, people tend to focus their attention on the center of the image. Third, warm colors attract people’s attention more easily than cool colors.

In this paper, in order to relieve the overfitting problem and be more consistent with HVS, we propose a multiscale CNN for NR-IQA with saliency detection. Firstly, the SDSP algorithm will be applied to generate saliency map to select appropriate patches, patches with salient values between given thresholds are retained as training data. Secondly, the sampled patches are fed into multiscale CNN to extract features. Different features will be extracted by different convolutional kernels, so the designed network consists of three branches with multiscale convolutional kernels. Finally, the weighted average of the quality scores from the salient patches is the final score. The rest of paper is organized as follows. Section 2 describes the designed NR-IQA algorithm in detail. Section 3 provides a comparison experiment and evaluate the performance of our method. The conclusions are contained in Section 4.

2 Method for NR-IQA

In this section, the implementation of the algorithm will be elaborate. It is divided into two parts. Firstly, a saliency detection algorithm is adopted to select non-overlapping patches. Secondly, the normalized patches are fed into the multiscale CNN to extract features and train the network.

2.1 Patch sampling based on saliency detection

The CNN for NR-IQA takes the patches segmented from the images as training data. However, patches have different reference values for IQA. In order to make the extracted patches representative, a patch sampling strategy will be proposed based on saliency detection.

In this paper, we adopted a novel salient region detection method, namely Saliency Detection by combining Simple Priors (SDSP, for short). It was used to generate saliency map, the computation process is shown in Fig. 1.

Fig. 1
figure 1

Illustration for the computation process of SDSP

Given an image {f(x)|x ∈ Ω}, where Ω ⊂ R2 denotes the image spatial domain, x = (i, j) is the position coordinate. f(x) is actually a vector, containing three values representing R, G, and B intensities at the position x.

The image f(x) in the RGB color space can be converted to CIEL*a*b* color space. In the transformed image, fL(x), fa(x) and fb(x) denoted L*-channel, a*-channel, and b*-channel, respectively. L*-channel represents the brightness of the pixel, a*-channel represents green-red information, b*-channel represents blue-yellow information. The frequency prior maps SF(x) is defined as

$$ {S}_F(x)={\left({\left({f}_L(x)\ast g(x)\right)}^2+{\left({f}_a(x)\ast g(x)\right)}^2+{\left({f}_b(x)\ast g(x)\right)}^2\right)}^{\frac{1}{2}} $$
(1)

where * denotes the convolution operation, g(x) is the transfer function of a log-Gabor filter.

The location prior maps is expressed as a Gaussian map:

$$ {S}_D(x)=\mathit{\exp}\left(-\frac{{\left\Vert x-c\right\Vert}_2^2}{\sigma_D^2}\right) $$
(2)

where the center of the image f(x) is denoted by c, and the location variance by σD.

The color prior maps is defined as

$$ {S}_C(x)=1-\mathit{\exp}\left(-\frac{f_{\boldsymbol{an}}^2(x)+{f}_{bn}^2(x)}{\sigma_C^2}\right) $$
(3)

where σC is a color variance, fan(x) and fbn(x) can be computed as:

$$ {f}_{\boldsymbol{an}}(x)=\frac{f_a(x)-\mathit{\min}{f}_a}{\mathit{\max}{f}_a-\mathit{\min}{f}_a},{f}_{bn}(x)=\frac{f_b(x)-\mathit{\min}{f}_b}{\mathit{\max}{f}_b-\mathit{\min}{f}_b}, $$
(4)

minfa(maxfa) is the minimum (maximum) value of fa(x),and minfb(maxfb) is the minimum (maximum) value of fb(x).

The image’s final saliency map can be naturally defined as:

$$ S\left(\mathrm{x}\right)={S}_F\left(\mathrm{x}\right)\cdot {S}_D\left(\mathrm{x}\right)\cdot {S}_C\left(\mathrm{x}\right) $$
(5)

where SF(x), SD(x) and SC(x) are the maps corresponding to frequency prior, location prior and color prior. They can be computed as Eqs. (1), (2) and(3), respectively.

Figure 2 shows an example of patch sampling. Firstly, for each distorted image, saliency map is generated by SDSP algorithm. Secondly, the average saliency value from the saliency map is calculated as

$$ Sa=\frac{\sum \limits_{i=1}^H\sum \limits_{j=1}^WS\left(i,j\right)}{H\times W} $$
(6)

where S(i, j) is the value of the saliency map at the position (i, j), H and W are the height and width of the saliency map respectively.

Fig. 2
figure 2

Example of patch sampling

Thirdly, If S(i, j) <Sa, let S(i, j) = 0, otherwise, S(i, j) = 1. In this way, a binary image B of saliency map is generated.

Fourthly, the binary images are segmented into 32 × 32 patches. For each patch Bp, it’s average saliency value is defined as

$$ {Sa}_p=\frac{\sum \limits_{i=1}^{32}\sum \limits_{j=1}^{32}{B}_p\left(i,j\right)}{32\times 32} $$
(7)

where Bp(i, j) is the value at the position (i, j) in binary patch Bp.

If the average saliency value Sap of the patch p is between the given thresholds Ta1 and Ta2, i.e., Ta1 <Sap <Ta2 , the patch is considered to be salient and retained. Otherwise, abandoned.

2.2 Multiscale CNN

In this section, we will elaborate the data preprocessing process, CNN architecture and parameter learning in detail.

2.2.1 Local normalization

In practice, the value range of data is not uniform, which makes the learning process time-consuming. Therefore, each patch is normalized before training. In order to improve the normalization efficiency, the patches are locally normalized to the standard normal distribution. For the value f(i, j) of a pixel at the position (i, j), its normalized value \( \overset{\sim }{f}\left(i,j\right) \) is calculated as follows:

$$ \overset{\sim }{f}\left(i,j\right)=\frac{f\left(i,j\right)-\mu \left(i,j\right)}{\delta \left(i,j\right)+C} $$
(8)
$$ \mu \left(i,j\right)=\frac{\sum \limits_{m=-M}^M\sum \limits_{n=-N}^Nf\left(i+m,j+n\right)}{\left(2M+1\right)\left(2N+1\right)} $$
(9)
$$ \delta \left(i,j\right)=\sqrt{\sum \limits_{m=-M}^M\sum \limits_{n=-N}^N{\left(f\left(i+m,j+n\right)-\mu \left(i,j\right)\right)}^2} $$
(10)

where a small constant C is set to maintain numerical stability, M and N represent the normalization window sizes,μ(i, j) and δ(i, j) are the mean value and standard deviation of the pixel value f(i, j) respectively.

2.2.2 Network structure

The network consists of three parts: patch sampling, CNN model and quality evaluation as shown in Fig. 3.The designed CNN includes three branches with multiscale convolutional kernels. The sizes of convolutional kernels are chose as 3 × 3, 5 × 5, and 7 × 7.Eachbranchinclude five convolutional and five pooling layers. Three branches are fused after the last pooling. In order to merge the three scales, zero fillings are executed for the convolution, the stride size is 1 pixel. All pooling layers adopt 2 × 2 max pooling. Dropout is added in the first full connected layer with ratio of 0.5 to improve the generalization ability of the model and reduce the overfitting effect. We use Rectified Linear Unit (ReLU) in the two full connected layers as the activation function.

Fig. 3
figure 3

Architecture of multiscale CNN

Table 1 shows the detailed parameters of the CNN, the depth of convolutional kernels and the number of feature maps. First, the patches with size 32 × 32 × 1 pass through the first convolutional layer (C1) with kernels of size 3 × 3 × 32, 5 × 5 × 32, and 7 × 7 × 32,three groups of feature maps are produced with the size 32 × 32 × 32. The 2 × 2 maximum pooling follows to reduce the feature maps to 16 × 16 × 32. Through the second same convolutional (C2) and pooling layer, three groups of feature maps with the size 8 × 8 × 32 are produced. After the five similar convolutions and pooling, three groups of features with the size of 1 × 1 × 32 are obtained in the end. Finally three feature maps are fused as a 1 × 1 × 96 feature, the quality score is predicted through two full connection layers with 128 nodes and a simple linear regression layer with 1 node.

Table 1 Parameters of proposed multiscale CNN

2.2.3 Parameter learning

For training the network, 32 × 32 non-overlapping patches are assigned a quality score as its source image’s ground truth score. The loss function is defined as:

$$ \boldsymbol{Loss}=\frac{1}{N}{\sum}_{i=1}^N{\left\Vert {y}_i-F\left({p}_i;\omega \right)\right\Vert}_{l_1} $$
(11)

where yi denotes the ground true score, F(pi; ω) denotes the estimated score of the input patch pi, and is the total number of image patches.ω is the network parameter to be learned. Parameter learning is achieved through minimizing loss function. The parameters are updated by the following optimization problem:

$$ {\omega}^{\prime }=\underset{\omega }{\mathit{\min}}\boldsymbol{Loss} $$
(12)

where ω is the updated parameter. The loss function is optimized by the following Adam method.

1) Calculate the gradient of the loss function with respect to the parameter ωt at time t:

$$ {d}_t=\nabla \boldsymbol{Loss}=\frac{\partial \boldsymbol{Loss}}{\partial \left({\omega}_t\right)} $$
(13)

2) Calculate the first-order momentum at time t:

$$ {m}_t={\beta}_1{m}_{t-1}+\left(1-{\beta}_1\right){d}_t $$
(14)

Correct the first order momentum:

$$ {\hat{m}}_t=\frac{m_t}{1-{\beta}_1^t} $$
(15)

3) Calculate the second-order momentum at time t:

$$ {v}_t={\beta}_2{v}_{t-1}+\left(1-{\beta}_2\right){d}_t^2 $$
(16)

Correct the second order momentum:

$$ {\hat{v}}_t=\frac{v_t}{1-{\beta}_2^t}{\eta}_t= lr\cdot \frac{{\hat{m}}_t}{\sqrt{{\hat{v}}_t}}= lr\cdot \frac{\frac{m_t}{1-{\beta}_1^t}}{\sqrt{\frac{v_t}{1-{\beta}_2^t}}} $$
(17)

4) Calculate the descent gradient at time t:

$$ {\eta}_t= lr\cdot \frac{{\hat{m}}_t}{\sqrt{{\hat{v}}_t}}= lr\cdot \frac{\frac{m_t}{1-{\beta}_1^t}}{\sqrt{\frac{v_t}{1-{\beta}_2^t}}} $$
(18)

5) Update the parameters ωt + 1 at time t + 1:

$$ {\omega}_{t+1}={\omega}_t-{\eta}_t={\omega}_t- lr\cdot \frac{\frac{m_t}{1-{\beta}_1^t}}{\sqrt{\frac{v_t}{1-{\beta}_2^t}}+\varepsilon } $$
(19)

where β1 and β2 are constants between 0 and 1, the parameter ε is a very small number to prevent dividing by zero in the implementation, the learning rate is denoted by lr.

3 Experimental results

In this section, we first describe the experimental setups, including the datasets, evaluation indicators and the experimental parameters. Then the performance of multiscale CNN is compared with other IQA methods on LIVE dataset. To investigate generalization ability of our multiscale CNN, the trained CNN on LIVE dataset are validated on the CSIQ dataset.

3.1 Experimental setups

The experiment were implemented on two universal IQA datasets, LIVE [19] and CSIQ [5]. The LIVE dataset consists of 29 source reference images and 982 distorted images with five distortions: JPEG2000 compression (JP2K), JPEG compression (JPEG), White Gaussian (WN), Gaussian blur (BLUR) and fast-fading (FF). Differential Mean Opinion Scores (DMOS) in the range of [0,100] represent the subjective quality of the image. Higher DMOS corresponds to lower image quality. The CSIQ dataset consists of 30 source reference images and 866 distorted images with six distortions: JP2K, JPEG, WN, BLUR, FN and CONTRAST. DMOS in the range of [0,1] is associated with each image.

The two measures of Spearman Rank Order Correlation Coeffificient (SROCC) and Linear Correlation Coeffificient (LCC) were employed to evaluate the performance of IQA algorithms. SROCC assessed the relationship between the estimated scores and the ground true scores. LCC measured the degree of linear dependence between the two scores.

The super parameters involved in the experiment are set as follows. In the patch sampling, σC = 0.25, σD = 114. The fixed thresholdTa1 = 1.1, Ta2 = 1.8. In local normalization, C = 1, M = N = 3. In parameter learning, the relevant optimization parameters were selected as β1 = 0.9, β2 = 0.999, ε = 10‐8, lr = 0.001. The experiment results obtained from 95 train-test iterations. In each iteration, the batch size was 64, 60% of data was randomly selected for training, 20% for validation, and the remaining 20% for test. The experimental results showed that this setting could almost save half of the training time and achieve better performance.

3.2 Performance on LIVE dataset

The network was trained with five distortions, the evaluation indicators SROCC and LCC are compared with the seven advanced NR-IQA algorithms as shown in Tables 2 and 3. The comprehensive experimental results for all distortions were listed in the last column of the Table. The best results among the algorithms were in bold. Our method worked well on each of the five distortions, especially on JP2K, JPEG, and WN. Although the results on the distortion types BLUR and FF were slightly lower than other algorithms, our algorithm outperformed others in terms of the comprehensive distortion.

Table 2 SROCC of different methods on LIVE
Table 3 LCC of different methods on LIVE

3.3 Cross-dataset validation

To evaluate generalization ability of our multiscale CNN, we trained the model on the LIVE dataset and tested on the CSIQ dataset, which contained only the four distortions (JPEG, JP2K, BLUR, WN) shared with LIVE. Table 4 shows the SROCC and LCC on the four distortions. The result shows that our method has excellent robustness and outperforms other algorithms.

Table 4 SROCC and LCC on the the shared distortions

The CSIQ dataset contains two types of distortion (FN, CONTRAST) that are not shared with the LIVE. Therefore, the two distortions are often overlooked in the references. In this paper, our multiscale CNN was tested on all distortions from CSIQ. Since the network is not trained on distortion types FN and CONTRAST, it can be found from Table 5 that all algorithms performed worse on the full distortions than the shared distortions. However, our algorithm overmatchs other algorithms for both shared and full distortions.

Table 5 SROCC on all distortions

3.4 Effect of the size of the convolutional kernels

Different sizes of convolutional kernels can extract different features from images. Comparative experiments are designed to analyzed the effect of the size of the convolutional kernels. The experimental results were shown in Fig. 4. In the network, the single size convolutional kernels are compared with their multiscale fusion under the same conditions. The blue line represented the network with a convolutional kernel size of 3 × 3, the orange line represented 5 × 5, the gray line represented 7 × 7, the red line represented their fusion. As can be seen, the multiscale CNN shows the best performance in SROCC and LCC indicators.

Fig. 4
figure 4

SROCC and LCC under different sizes of convolutional kernels

3.5 Effect of the patch sampling

The CNN was trained and tested on LIVE dataset with saliency sampling and without sampling, respectively. The experiment results in Table 6 show that the network with saliency sampling results in better performance.

Table 6 Performance comparison under salient sampling

Some conclusions follow from synthetically analyzing the experimental results. First, multiscale CNN can effectively improve the accuracy of the model. Second, the saliency sampling can increase the efficiency of CNN. Therefore, the proposed multiscale CNN based on saliency detection can effectively evaluate the quality of distorted images and improves the performance of NR-IQA.

4 Conclusion

In this paper, a multiscale CNN for NR-IQA with Saliency detection was proposed. Human vision always focuses on the salient area of the image. In the proposed model, the saliency detection techniques were used to filter the training data, and the filtered data were fed into CNN for training, which was consist with the characteristics of Human Vision System (HVS) and improved the training efficiency. In addition, the field of human vision usually is usually fluid, so the three-scale convolution kernels were designed to extract richer features. Experiments show that the proposed network not only has fewer parameters, but also can achieve higher performance.

However, the patch size and convolution kernel size in our model are empirically determined. It is unfair to use the same size for different types and different sizes of input images. In the future work, it will be a promising topic to design an adaptive mechanism to identify the patch size and kernel size according to the type and size of image.