1 Introduction

Over the years, scientists have tried to find some effective methods to assess image quality automatically with no need of human observers. There is three main ways to tackle the problem according to the available information. The most mature use the reference image to quantify the fidelity between the reference and the processed image. The so-called full reference (FR) metrics, usually based on models of the human visual system, try to predict the visibility of the observed degradation without using any assumption on the degradation process [1]. In real-world applications where the reference image is not available, no-reference (NR) approaches attempt to assess the quality of a processed image without any cue from its unprocessed version. To reach this goal, they usually assume some prior information on the degradation process. Consequently, most of NR techniques are conceived for specific distortion type and cannot be generalized [2]. Reduced reference (RR) approaches provide a good compromise between FR and NR, as only partial information is involved. They rely on perceptually relevant features extracted from the reference image. These features are used at the receiver side to measure the visual quality degradation. As the measure proposed in this paper falls in the RR approaches, we focus on reduced reference methods in the following. The general block diagram of RRIQA is given in Fig. 1. At the sender side, a representative feature is extracted from the original image. This side information is transmitted along with the image through an ancillary channel. At the receiver side, the same feature is extracted from the distorted image and compared to the one received through the ancillary channel. The measured distortion is then used to assess the visual quality. The data rate in the ancillary channel is an essential parameter to assess RR techniques performances. Indeed, if the side information transmitted through the ancillary channel is the original image, this is the FR case, while if no information is sent, this is the NR case. The ultimate goal is to predict the quality with a high accuracy while using a minimum of side information. Note that techniques based on watermarking [3, 4] can hardly be classified in these three categories. Although there are few studies dealing with RRIQA, we may distinguish three main orientations, namely: image distortion modeling, human visual system (HVS) modeling, and finally natural image statistics modeling.

Fig. 1
figure 1

The general deployment scheme of RRIQA measure

Image distortion modeling solutions are conceived for specific applications, where the nature of the degradation process is known. In that case, the extracted feature is made of a set of features each sensible to a particular distortion type; for instance, the hybrid image quality metric proposed in [5] is designed to account for four types of artifacts (blocking, blurring, lost blocks, and ringing) observed in JPEG transmission schemes. Luminance and texture masking effect are also incorporated in the measure. Given the limited scope of distortion considered, one cannot expect from these methods a universal solution.

HVS modeling approaches rely on features based on interest points processed at either side by a human visual system model. These general-purpose methods are very similar to the FR metrics based on the HVS. Rather than using the entire image, a carefully selected sample of pixel is used to assess the visual degradation. In their recent work [6], Carnec et al. used a HVS model integrating early stages of vision phenomenon’s (perceptual color space, contrast sensitivity, subband decomposition, masking effect). The characteristic points used are regularly distributed on concentric ellipses centered at the image center. In addition to the difficulty linked to the HVS modeling, in these methods, the choice of the characteristic points is very sensitive.

The idea behind the natural image statistics-based models is that distortions make an image appearing unnatural and affect its statistics. These techniques try to predict the quality according to the variations between statistics of the distorted and the reference image. This approach has been introduced by Wang et al. [7]. The proposed measure called wavelet natural image statistics metric (WNISM) is based on steerable pyramids (a redundant transform of wavelets family). The image features are the estimated parameters of the generalized Gaussian distribution GGD for all subbands. The quality metric is calculated using Kullback–Leibler (KL) divergence between the parameterized reference and distorted images. WNSIM is until now the most prominent RR quality metric.

In the same vein, Li et al. investigated a divisive normalization -based transform (DNT) [8]. It uses the steerable pyramids as representation domain and the Gaussian scale mixture (GSM) as a statistical model of subbands coefficients. The quality metric is also computed using KL divergence. It leads to a general-purpose RRIQA method, but a complex computational divisive normalization transform (DNT) is required. In [9], the construction of the strongest component map (SCM) is proposed as representation domain. The Weibull distribution parameters are estimated from the SCM coefficients histogram. Finally, only the scale parameter \(\beta \) is involved in a measure called \(\beta \)W-SCM. Experiments with the LIVE dataset show significant correlation between the model predictions and the subjective scores, nearly the same as WNISM. Note that this result is obtained at a lower side information rate.

Recently, Soundararajan et al. introduced the reduced reference entropic differencing index (RRED) [10]. The steerable pyramids transform is used and the entropy is computed on blocks of the transformed reference image. The quality metric is obtained by computing entropies differences between subbands coefficients of reference and distorted images. The efficiency of this metric depends on the amount of the used side information. Results showed that this metric outperforms competing methods for several distortion type in the TID2008 database when the amount of side information is significant as compared to the size of the reference image. For a more complete overview about RR and NR methods based on natural image statistics, the reader can refer to the recent paper by Wang et al. [11].

Although WNISM is still the standard, there is still room to further improve the performance of natural image statistics RRIQA scheme. The main challenge is to find an appropriate image representation and an efficient feature well correlated with visual perception. From this perspective, we propose to substitute a more efficient image representation in the general scheme. Indeed, steerable pyramid is a non-adaptive transform. It depends on a basis function which cannot fit all signals. Furthermore, the wavelet transform provides a linear representation which cannot reflect the nonlinear masking phenomenon in human visual perception [12].

Empirical mode decomposition (EMD) was introduced by Huang [13]. It aims to decompose non-stationary and nonlinear signals to a finite number of components: intrinsic mode functions (IMF) and a residue. In contrast to wavelet, EMD is a nonlinear and adaptive method. Since no basis function is needed, it depends only on image content. Motivated by the advantages of the EMD over the wavelet in signal and image processing applications, and to remedy the wavelet drawbacks, we propose to use its bidimensional extension (BEMD) as a representation domain.

As distortions affect IMF coefficients and also their distribution, IMF coefficients marginal distribution seems to be a reasonable choice as feature. Thereafter, we will refer to this new RRIQA with the acronym EMISM. It stands for empirical mode image statistic measure.

In addition to the objective measure introduced in this paper, an alternative approach to logistic function-based regression is investigated. Indeed, in the literature, most RR methods use a logistic function-based regression method to predict mean opinion scores from the values given by an objective measure. These scores are then compared in terms of correlation with the existing subjective scores. We propose to use a classification scheme instead of the nonlinear regression to grade the image quality. Indeed, very often, subjective experiments involve a five-category quality scale ( excellent, good, fair, poor, and bad) rather than a continuous quality scale. Furthermore, we believe that this task is more in line with human visual perception.

Part of this work has been presented in [14], where EMISM has been evaluated using the LIVE dataset. In this paper, we complete this study by conducting an exhaustive comparison taking into account a wider field of degradation processes. To do so, we use three datasets: IVC [15], LIVE [16], and TID 2008 [17]. Each dataset contains carefully selected images which undergoes several distortion types at various distortion levels covering the full range of the subjective quality scale. TID 2008 dataset is the most complete as it covers 17 distortions types with four different levels of distortion. It is particularly suited to investigate more deeply the generic nature of EMISM as compared to alternative solutions.

The rest of this paper is organized as follows: Sect. 2 presents the general scheme of the proposed RRIQA. Section 3 is devoted to the EMD. We present the algorithm and the results of an empirical evaluation of the effectiveness of the BEMD as compared to the wavelet representation. In Sect. 4, the distribution model and the impact of the image degradation on the marginal image statistics of the IMFs are reported. Section 5 presents the distortion measure used to compute the quality score through a logistic regression or a classification process. Section 6 describes the experimental setup. Section 7 reports the comparative evaluation of the proposed metric with alternative solutions proposed in the literature for a broad range of distortions. Finally, a conclusion ends the paper.

2 Natural image statistics RRIQA

If we consider the general framework of natural image statistics RRIQA scheme, the feature extraction process transforms the image in a suitable representation from which statistics are extracted. At the receiver side, the same feature is extracted and compared to assess the visual quality. The deployment scheme of the proposed RRIQA measure is given in Fig. 2. At the sender side, the BEMD decomposition of the reference image is computed. The marginal distribution of IMF coefficients is then estimated and used as representative feature of the original image. This can be done using either parametric or nonparametric statistics. In the nonparametric case, the histogram of IMF coefficients is a good candidate as estimator. Nevertheless, it raises the question of the amount of side information to be transmitted. If the bin size is coarse, we obtain low approximation accuracy but a small data rate while when the bin size is fine, we get a good accuracy but a bigger data rate. To avoid this problem, it is more convenient to assume a theoretical distribution for the IMF coefficients marginal distribution and to estimate the parameters of the distribution. In this case, the only side information to be transmitted is the estimated parameters and eventually an error between the empirical distribution and the estimated one. This raises the issue of the effectiveness of the model. We propose to use the generalized Gaussian density (GGD) for its ability to model a great range of symmetric distribution with heavier or lower tails than the Gaussian density. It provides a good approximation of IMF coefficients histogram and this only with the use of two parameters. Moreover, we consider the fitting error between empirical and estimated IMF distribution. Finally, at the receiver side, we use the extracted features to compute the global distance over all IMFs.

Fig. 2
figure 2

The deployment scheme of the proposed RRIQA measure

3 The bidimensional empirical mode decomposition

3.1 The EMD algorithm

The empirical mode decomposition (EMD) has been introduced as a data-driven algorithm, since it is based purely on the properties observed in the data without predetermined basis functions. The main goal of EMD is to extract the oscillatory modes that represent the highest local frequency in a signal, while the remainders are considered as a residual. These modes are called intrinsic mode functions (IMF). An IMF is a function that satisfies two conditions:

  1. 1.

    The function should be symmetric in time, and the number of extrema and zero crossings must be equal, or at most differ by one.

  2. 2.

    At any point, the mean value of the upper envelope and the lower envelope must be zero.

The so-called sifting process works iteratively on the signal to extract each IMF. Let \(x(t)\) be the input signal, the algorithm of the EMD is summarized as follows:

figure a1

The sifting process consists in iterating from step (i) to (iv) upon the detail signal \(d(t)\) until this later can be considered as zero mean. The resultant signal is designated as an IMF and then the residual will be considered as the input signal for the next IMF. The algorithm terminates when a stopping criterion or a desired number of IMFs is reached. After IMFs are extracted through the sifting process, the original signal \(x(t)\) can be represented by:

$$\begin{aligned} x(t)={\displaystyle \sum _{\begin{matrix} j=1 \end{matrix}}^{n}}\,\text{ IMF}_{j}(t)+m(t) \end{aligned}$$
(1)

where \(\text{ IMF}_{j}\) is the \(j\)th extracted IMF and \(n\) is the total number of IMFs.

In two dimensions (BEMD), the algorithm remains the same as for a single dimension with a few changes: the curve fitting for extrema interpolation will be replaced with a surface fitting, this increases the computational complexity for identifying extrema and especially for extrema interpolation. To extract IMFs, several two-dimensional EMD versions have been developed [1820]. They differ mainly by the interpolation method they use. The most recent one proposed an interpolation based on statistical order filters. From a computational cost standpoint, it is a fast implementation. Indeed, only one iteration is required for computing each IMF. This is a good reason enough for using it in our experiments. Figure 3 illustrates an application of the BEMD.

Fig. 3
figure 3

Three iterations of the BEMD decomposition on a sample image

The resulting IMFs from the BEMD retain the highest frequencies at each decomposition level. Note that the frequency content decreases as the order of the IMF increases.

3.2 Comparison of the EMD with the steerable pyramid transform

The steerable pyramid is a type of redundant wavelet transform that avoids aliasing in subbands. It performs a polar-separable decomposition in the frequency domain, thus allowing independent representation of scale and orientation. The basis functions of the steerable pyramid are directional derivative operators that come in different sizes and orientations.

The EMD method is a local and adaptive method in frequency–time analysis. It decomposes the signal into a number of intrinsic mode functions (IMFs) and a residue. Both representations achieve a decomposition into fluctuations and trend, but scales are predetermined for the steerable pyramid transform and data-driven for the EMD.

As in the steerable pyramid, the basis functions are fixed and this representation does not necessarily match the varying nature of images. In contrast, the basis functions of the EMD derived from the image content allow an adaptive analysis that may be more suited to model the visual information. Indeed, the EMD showed its superiority to wavelets in many signal and image processing algorithms [2124].

In order to assess the effectiveness of the EMD as compared to the steerable pyramid transform, we conducted a large set of experimentations. In order to compare both representations under similar conditions, we decomposed all the original images of the three databases under investigation using 4 level of IMF for the EMD and the first Level of decomposition with 4 orientations for the steerable pyramid. Then, we compared errors between reconstructed and reference images. The results of these experiments reveal two main conclusions. First of all, PSNR values are, in every case, far more higher with the EMD as compared to steerable pyramid. Furthermore, the structuration of errors is very different. For the EMD, errors are localized and concentrated at the edges, while in the steerable pyramids, they are present on the whole image and more diffuse around the edges. Note that, these observations are independent of the image content. Figure 4 illustrates this behavior through typical example of the error images and the PSNR for both transforms. To summarize, these results confirm previous results found in the literature that with the same amount of data, the EMD is a more suited image representation than the steerable pyramid.

Fig. 4
figure 4

Error images obtained from both representations. a, c, and e are obtained from the BEMD, while b, d, and f are obtained from the steerable pyramid transform. The PSNR presented here is computed between the original image and its corresponding constructed image

4 Image statistics in the IMF domain

Selecting a representative feature is a critical step of RR methods. On one hand, extracted features should be sensitive to a large type of distortions and this for different distortion levels. On the other hand, extracted features should have as minimal size as possible. Our RRIQA method relies essentially on the variation of the marginal distribution of the IMF coefficients. Therefore, it is essential to check that this distribution is really influenced by the degradation process. Figure 5 reports the histogram of the first extracted IMF coefficients of distorted images under various distortion types. The estimated distribution of the first IMF coefficients of the original image is also reported for comparative purposes.

Fig. 5
figure 5

Histograms of IMF coefficients under various distortion types. a original image, b additive Gaussian noise contaminated image, c Gaussian blurred image, d JPEG compressed image, e JPEG transmission errors distorted image. Solid curves: histogram of IMF coefficients. Dashed curves: GGD model fitted to the histogram of IMF coefficients in the original image

Note that other extracted IMFs exhibit a similar behavior. As can be seen, the marginal distributions of the distorted image always deviate significantly from the original image. It is also worth noticing that the differences observed depend on the type of distortion. For example, additive Gaussian noise contamination increases the width of the histogram (Fig. 5b), while the Gaussian blur distortion reduces the width of the histogram and increases its peak (Fig. 5c), the JPEG compression distortion changes only the peak (Fig. 5d), and finally, JPEG transmission errors present the smallest deviation from the original image (Fig. 5e). Throughout the tests we performed, we found that the IMF coefficients histogram exhibits a non-Gaussian behavior, with a sharp peak at zero and heavier tails than the Gaussian distribution as can be seen in Fig. 5a. Such a distribution can be well fitted with a two-parameters generalized Gaussian density (GGD) model:

$$\begin{aligned} p(x)=\frac{\beta }{2\alpha \Gamma (\frac{1}{\beta })}\exp \left(- \left(\frac{|x|}{\alpha } \right)^\beta \right) \end{aligned}$$
(2)

where \( \Gamma (z)=\int _0^{\infty }\text{ e}^{-t}t^{z-1}\text{ d}t, z >0\) represents the Gamma function, \(\alpha \) is the scale parameter, and \(\beta \) is the shape parameter of the distribution. Parameter estimation via maximum likelihood and the method of moments has been proposed. We choose to use the moment matching method [25] for its simplicity.

5 Distortion measure

A dissimilarity distance is required to quantify the differences observed between two distributions. In the literature, several distances exist: chi-square statistic, quadratic form distance, match distance, Kolmogorov–Smirnov distance, and the Kullback–Leibler divergence (KLD). The last one is frequently used in the image retrieval systems, since it is considered as a convenient way to compute divergence between two probability density functions (PDFs). Let us consider \(\text{ IMF}_O\) as an IMF from the original image and \(\text{ IMF}_D\) its corresponding from the distorted image. Assuming that \(p(x)\) and \(q(x)\) are the PDFs of \(\text{ IMF}_O\) and \(\text{ IMF}_D\), respectively, the KLD between them is defined as:

$$\begin{aligned} d(p\Vert q)=\int p(x)\log \frac{p(x)}{q(x)}\text{ d}x \end{aligned}$$
(3)

To compute this quantity, the histograms of the original image must be available at the receiver side. As we choose to transmit the estimated parameters of the GGD model rather than the original data histograms of the IMF coefficients, we also transmit the error term between the estimated and the empirical distributions of the IMF coefficients of the original image. Let note \(p_m(x)\) the approximation of \(p(x)\) using a two-parameters GGD model, our feature will contain a third characteristic which is the prediction error defined as the KLD between \(p(x)\) and \(p_m (x)\):

$$\begin{aligned} d(p_m\Vert p)=\int p_m(x)\log \frac{p_m(x)}{p(x)}\text{ d}x \end{aligned}$$
(4)

In practice, this quantity can be computed as follows:

$$\begin{aligned} d(p_m\Vert p)={\displaystyle \sum _{\begin{matrix} i=1 \end{matrix}}^{L}} P_m(i)\log \frac{P_m(i)}{P(i)} \end{aligned}$$
(5)

where \(P(i)\) and \(P_m(i)\) are the normalized heights of the \(i\)th histogram bins, and \(L\) is the number of bins in the histograms. \(P_m(i)\) is obtained by computing \((\hat{\alpha },\hat{\beta })\) the estimators of \((\alpha ,\beta )\) and integrating the distribution over the range of the bin.

$$\begin{aligned} p_m(x)=\frac{\hat{\alpha }}{2\hat{\alpha }\Gamma (\frac{1}{\hat{\beta }})}\int \limits _{Rmin}^{Rmax}\exp \left(-\left(\frac{|x|}{\hat{\alpha }} \right)^{\hat{\beta }} \right)\text{ d}x \end{aligned}$$
(6)

At the receiver side, we compute the KLD between \(q(x)\) and \(p_m (x)\) (Eq. 7). We do not fit \(q(x)\) with a GGD model as we are not sure that the distorted image is still a natural one and consequently if the GGD model is still adequate. Indeed, the distortion introduced by the processing can greatly modify the marginal distribution of the IMF coefficients. Therefore, it is more accurate to use the empirical distribution of the processed image.

$$\begin{aligned} d(p_m\Vert q)=\int p_m(x)\log \frac{p_m(x)}{q(x)}\text{ d}x \end{aligned}$$
(7)

Then, the KLD between \(p(x)\) and \(q(x)\) is estimated by:

$$\begin{aligned} \widehat{d}(p\Vert q)=d(p_m\Vert q)-d(p_m\Vert p) \end{aligned}$$
(8)

Finally, the overall distortion between an original and distorted image is as it follows:

$$\begin{aligned} D=\log _2 \left(1+\frac{1}{D_o}{\displaystyle \sum _{\begin{matrix} k=1 \end{matrix}}^{K}}|\widehat{d}^k(p^k\Vert q^k)| \right) \end{aligned}$$
(9)

where \(K\) is the number of IMFs, \(p^k\) and \(q^k\) are the probability density functions of the \(k\)th IMF in the reference and distorted images, respectively. \(\widehat{d^k}\) represents the estimation of the KLD between \(p^k\) and \(q^k\), and \(D_o\) is a constant used to control the scale of the distortion measure.

6 Experimental setup

6.1 Datasets

IQA goal is to provide quality predictions correlated with human observer’s opinion. To test the performances of IQA algorithms, a dataset of distorted images graded by human observers is needed. As the evaluation process is greatly influenced by the semantic value of the images, care must be taken when choosing the dataset. It must contain images that reflect adequate diversity in image content and generated distortions should reflect a broad range of image impairments. To comply with this argument, our approach has been tested and validated using three datasets of distorted and scored images (the MOS values are provided for each dataset) specially designed for quality metrics evaluation.

The IVC dataset [15] is created from 10 reference images. It provides 150 distorted and scored images resulting from 50 JPEG compressed images, 50 JPEG2000 compressed images, 10 blurred images, and 40 locally adaptive resolution coded images (LAR) [26]. The LAR coding is an image compression solution which consists in partitioning image into non-overlapping regions. These regions are then quantized and coded individually. Subjective evaluations were made using a Double Stimulus Impairment Scale method with 5 categories and 15 observers in a controlled environment. In every case, distortion levels have been chosen in order to uniformly cover the subjective scale. The reference images are shown in Fig. 6 with the label (a).

Fig. 6
figure 6

Reference images of the 3 dataset: The 10 reference images of the IVC dataset are referenced by the label (a) The 29 reference images of the LIVE dataset are referenced by the labels (b) and (b,c), the reference images of the TID 2008 dataset are referenced by the labels (b,c) and (c)

The LIVE dataset [16] is the most widely used benchmark. It is constructed from 29 high-resolution images and contains seven sets of distorted and scored images representing five distortion types. Set 1 and 2 are JPEG2000 compressed images, set 3 and 4 are JPEG compressed images, set 5, 6, and 7 are, respectively, Gaussian blur, white noise, and simulated transmission errors distorted images. A single-stimulus methodology has been used for the subjective evaluation. The reference images were also evaluated in the same experimental session as the test images, so that a quality difference score can be derived for all distorted images and for all subjects. The average number of subjects ranking each image was about 23. No specific conditions have been observed during the experiments (no viewing distance restrictions were imposed, no display device calibration, and normal indo or illumination). The subjective test was carried out with each of the seven datasets individually. A cross-comparison set that mixes images from all distortion types has been used to adjust the scores across different datasets. These aligned subjective scores (all data) are particularly useful to test general-purpose IQA algorithms. The 29 reference images (with labels (b) and (b, c)) shown in Fig. 6 have very different textural characteristics, various percentages of homogeneous regions, edges, and details. The TID2008 dataset [17] contains 25 reference images illustrated in Fig. 6 (with labels (b, c) and (c)), and 1,700 distorted images. As for the LIVE dataset, the Kodak test set [27] is the basis of the original images. It also include a synthetic image that has different texture fragments and objects with various characteristics. 17 types of distortions corresponding to a wide variety of situations have been taken into account (acquisition artifact, compression, watermarking, etc.). For each type of distortion, 4 distortion levels are considered. This lead to a pool of 100 distorted image for each distortion type. Objective quality scores have been obtained using a comparative-based methodology. A pair of distorted images and the reference image is simultaneously presented. The viewer is asked to choose the nearest from the reference. Scores ranging from 0 to 9 are obtained using a competition process. MOS are obtained by averaging about 33 evaluations for each image. Here again, no special precautions have been taken regarding the conditions of experimentation. For a more detailed discussion about the distortion type and the subjective rating methodology refer to [17], Table 1 summarizes the distortion that can be analyzed with the 3 datasets. Furthermore, in Fig. 7, a sample of a highly degraded image is presented to illustrate the visual impact of each distortion process under study. To evaluate the performances of EMISM, the tests consist in choosing a reference image and one of its distorted versions. Both images are considered as entries of the scheme given in Fig. 1. After the feature extraction step in the BEMD domain, a global distance is computed between the reference and distorted images as mentioned in Eq. 9. This distance represents an objective measure for image quality assessment. To link this value to the subjective evaluation grade, different protocols can be used. Classically, the logistic function-based regression proposed by the video quality expert group (VQEG) is used. Alternatively, we propose to use the distortion measure as the input of a trained classifier in order to select one of the five-category scale to grade the image under test. This strategy is more flexible and more in line with the subjective rating process. These two protocols are briefly reviewed in the following.

Fig. 7
figure 7

The visual impact of artifacts on distorted images from the datasets under study. Labels refer to the distortions in Table 1. (1) JPEG compression distortion, (2) JPEG2000 compression, (3) Gaussian blur, (4) Masked noise, (5) High-frequency noise, (6) Impulse noise, (7) Quantization noise, (8) Spatially correlated noise, (9) Image denoising, (10) Additive Gaussian noise, (11) Additive noise in color, (12) JPEG transmission errors, (13) JPEG2000 transmission errors, (14) Non-eccentricity pattern noise, (15) Local block wise distortions of different intensity, (16) Mean shift (intensity shift), (17) Contrast change, (18) LAR Coding

Table 1 Summary of the distortions type considered in the three datasets and their labels

6.2 Validation protocols

The distortion measure must be linked to the subjective quality score (MOS). Following the suggestions given in the video quality experts group Phase I FR-TV test, a nonlinear regression based on logistic function is generally used to provide a nonlinear mapping between the objective and subjective scores. The quality metric evaluation is then based on the correlation between objective and subjective scores. However, it is easier for a human observer to give his perception of quality using words such as: “Bad,” “Poor,” “Fair,” “Good,” or “Excellent,” than to provide a quality score to a distorted image. The classification process is more natural and more flexible than the nonlinear regression. Indeed, the nonlinear regression is based on a four- or five-parameters logistic function, while for the classification, multiple choices are allowed: the classifier (SVM, KNN, neuronal network,.), the training strategy, even multiple classifiers can be used with different fusion strategy. Thus, it is of prime interest to evaluate a classification process as an alternative to the nonlinear regression. For this purpose, we propose to use the SVM algorithm. The quality metric evaluation rely on the classification accuracy in this case. We now briefly describe the two alternatives.

6.2.1 Logistic function-based regression

The objective scores are computed from the values generated by the objective measure (the global distance in our case), using a nonlinear function according to the video quality expert group phase I FR-TV test [28]. Here, we use a four-parameters logistic function given by:

$$\begin{aligned} \text{ logistic}(\gamma ,D)=\frac{\gamma _1-\gamma _2}{1+e^-(\frac{D-\gamma _3}{\gamma _4})}+\gamma _2 \end{aligned}$$

where \(\gamma =(\gamma _1,\gamma _2,\gamma _3,\gamma _4)\).

Once the nonlinear mapping is achieved, we obtain the predicted objective quality scores (DMOSp):

$$\begin{aligned} \text{ DMOS}_\mathrm{p}=\text{ logistic}(\gamma ,D) \end{aligned}$$

To compare the subjective and objective quality scores, several metrics were introduced by the VQEG. In our study, we compute the correlation coefficient to evaluate the accuracy prediction and the rank-order coefficient to evaluate the monotonicity prediction. These metrics are defined as follows:

$$\begin{aligned}&\text{ CC}=\frac{{\sum _{\begin{matrix} i=1 \end{matrix}}^{N}}(\text{ DMOS}(i)- \overline{\text{ DMOS}})(\text{ DMOSp}(i)-\overline{\text{ DMOSp}})}{\sqrt{{\sum _ {\begin{matrix} i=1 \end{matrix}}^{N}}(\text{ DMOS}(i)-\overline{\text{ DMOS}})^2}\sqrt{{ \sum _{\begin{matrix} i=1 \end{matrix}}^{N}}(\text{ DMOSp}(i)-\overline{\text{ DMOSp}})^2}}\nonumber \\ \end{aligned}$$
(10)
$$\begin{aligned}&\text{ ROCC}= 1-\frac{6{\sum _{\begin{matrix} i=1 \end{matrix}}^{N}}(\text{ rank}(\text{ DMOS}(i)) -\text{ rank}(\text{ DMOSp}(i)))^2}{N(N^2-1)} \end{aligned}$$
(11)

where the index \(i\) denotes the image sample and \(N\) denotes the number of samples.

6.2.2 SVM-based classification

The effectiveness of this approach is linked to the choice of discriminative features and the choice of the multiclass classification strategy. Nowadays, several classifiers exist. Among these classifiers, SVM provides better results than traditional techniques such as neural network. It relies on two main keys:

  1. 1.

    For the separation purpose, it uses a hyperplane or a set of hyperplanes. Thus, a good separation is achieved with the hyperplane that has the largest distance to the nearest training data of any class.

  2. 2.

    In order to discriminate sets which are not linearly separable in their original space, a mapping into another space with higher or infinite dimension is used. The separation problem becomes then easier in that new space.

SVM has been introduced in a binary classification setting. In other words, it was designed to separate data into two classes. As we plan to classify the images using a five-category scale, we need to use one of the proposed multiclass extensions. To do so, two strategies have been introduced. The first one named all in one (AIO) considers all data in a single optimization problem. In the second strategy called divide and combine, several binary SVM classifiers are combined to make a final decision. We can distinguish three divide and combine methodology: one against one (OAO), directed acyclic graph (DAG), and one against all (OAA). The difference between these methods comes from the number of binary classifiers involved and the strategy used to combine them. In [29], an extensive comparison of multiclass SVM classification methods is reported. Results show that AIO strategy outperforms the three methods of divide and combine strategy. Based on these results, we choose to use the AIO strategy in our classification problem.

7 Experimental results

First of all, as WNISM is the reference metric, we perform a detailed comparison with the proposed metric using the three datasets containing a broad spectrum of distortions. Then, we investigate the comparative efficiency with DNT, RRED, and WNISM on the TID 2008 database. This is done using both the nonlinear regression approach and the SVM classification process. For the latter experiment in order to make a fair comparison between competing methods, the computation time and the size of the side information are reported.

7.1 Comparison of EMISM and WNISM using the logistic function-based regression

Two measures are used to evaluate the performances of EMISM. The Pearson linear correlation coefficient (CC) measures the performances in terms of accuracy while the Spearman rank-order correlation coefficient (ROCC) measures the performance in terms of monotonicity. To compute these quantities, we need to proceed to the regression analysis. The logistic function-based regression aims to predict the MOS using the quality scores and a parametric logistic function. The parameters of the logistic function have to be optimized in order to get a good prediction. This is done thanks to fminsearch function in the optimization Toolbox of Matlab. Figure 8 shows the scatter plot of MOS versus the model prediction for eight of the 17 distortions types of the TID 2008 dataset. The points in Fig. 8 represent images in a specific set. For every image (point), we have its MOS (vertical axis) and the quality score according to the EMISM (horizontal axis). A good prediction exhibits a maximum number of points close to the fitted logistic curve (Fig. 8b, c, d, e, f, g), while a prediction where most points deviate far from the fitted logistic curve characterize an unsatisfactory prediction (Fig. 8a, h). Throughout our experiences, we used WNISM as the reference for comparative purpose. Table 2 reports the results obtained using the IVC dataset.

Fig. 8
figure 8

Scatter plots of (MOS) versus the model prediction. a Additive noise in color components, b high-frequency noise, c Gaussian blur, d JPEG compression, e JPEG2000 compression, f JPEG transmission errors, g local block wise distortions of different intensity, h impulse noise

Table 2 CC and ROCC obtained scores for IVC dataset using WNISM and EMISM RR methods

First of all, we can see that EMISM compares favorably with WNISM for the blur distortion. For this distortion set, we reach a high accuracy (CC = 0.95) and a high monotonicity (ROCC = 0.94). Even if WNISM scores a little bit higher for JP2000 distorted images, the differences are so small that it is difficult to make a choice. The results are more mixed for LAR distortion. Indeed, WNISM clearly outperforms our metric in terms of prediction accuracy. Nevertheless, in terms of monotonicity, this is not the case. As the number of distortion types considered in the IVC dataset is too small, to make a conclusion about the overall efficiency of EMISM, we turn to the results of the evaluation on the LIVE dataset.

Table 3 reports the results of our investigations using the LIVE dataset. Additionally, to the results obtained for two RR metrics (EMISM, WNISM), two FR metrics performances are also reported (PSNR, MSSIM). As we can see, our method ensures better prediction accuracy (higher correlation coefficients), better prediction monotonicity (higher Spearman rank-order correlation coefficients) than the steerable pyramids-based method for the white noise perturbation. Surprisingly, enough WNISM is more efficient for blurred images. This contradicts the previous conclusion drawn from the IVC dataset. If we look at JPEG2000 artifact, the results confirm the previous observations, leaving no doubt that WNISM is more suited to deal with this type of distortion. However, for JPEG-coded images, the results are more puzzling. Indeed, for the first set, JPG1 results are slightly different while for the second set, JPG2 differences are much more important.

Table 3 CC and ROCC obtained scores for LIVE dataset using WNISM and EMISM RR methods

If we compare to FR metrics, we could expect that they should be more performing. This is always true for MSSIM but not for PSNR. Among the seven sets given in Table 3, PSNR is more performing for JP2(1), JP2(2), and JPG2 distortions. EMISM outperforms the PSNR for the transmission errors set. While for the remaining sets (JPG1, Noise, Blur), results are more mixed. Using only five types of distortion, it is difficult to decide whether an IQA method is a general-purpose tool or not. To go deeper on the subject, we need to consider a dataset with a larger number of distortions. Results of the experiments with the TID 2008 dataset are reported in Table 4. First, let us turn the previously studied artifacts (JPEG2000, JPEG, Gaussian blur, white Gaussian noise, transmission errors). For JPEG2000 distorted images, WNISM reaches its best score. Indeed, accuracy and monotonicity are above 0.90, while our metric reaches only 0.76. The comparison for Gaussian blur and transmission errors artifacts turns also to the advantage of WNISM. Considering the results of the previous experiments (IVC, LIVE), we can confirm that WNISM is particularly suited to assess the quality of images degraded by JPEG2000 coding process or transmission errors. But we cannot say the same for the Gaussian blur artifact, since for IVC, EMISM presents higher prediction accuracy and prediction monotonicity than WNISM. The results also confirm the superiority of EMISM for white Gaussian noise. For the JPEG distortion type, EMISM outperforms WNISM in terms of prediction accuracy, while this result is inverted in terms of prediction monotonicity. So, overall, we may conclude that differences between the two metrics are hardly distinguishable for this artifact.

Table 4 CC and ROCC obtained scores for TID2008 dataset using WNISM and EMISM RR methods

Now, let us see what happens for the other artifacts. These artifacts can be divided into two categories. The first category contains artifacts linked to noise impairment (additive noise in color components, spatially correlated noise, masked noise, high-frequency noise, impulse noise, quantization noise, non-eccentricity pattern noise) while the remaining artifacts (image denoising, JPEG2000 transmission errors, local block wise distortions of different intensity, mean shift, contrast change) form the second category. Among the seven artifacts forming the first category, EMISM outperforms WNISM for five distortions (additive noise in color components, spatially correlated noise, masked noise, high-frequency noise, non-eccentricity pattern noise). However, both metrics fail to measure impulse noise and they are also not very efficient for quantization noise artifact. Furthermore, except for the two last artifacts, EMISM maintains its stability with prediction accuracy always above or close to 0.70, while WNISM achieves prediction accuracy around 0.60 and even below 0.5 for the non-eccentricity pattern noise artifact. For the second category, both metrics fail to measure mean shift distortion. WNISM performs well for image denoising and contrast change. Finally, EMISM shows better prediction accuracy for JPEG2000 transmission errors and local block wise distortions of different intensity artifacts. In order to evaluate the robustness of the metrics, we add a set formed from all distortions and termed “All.” For the LIVE and IVC datasets, WNSIM is slightly more performing than EMISM (Tables 23) while for the TID 2008 dataset, the results are inverted (Table 4). This set aims to examine which metric is more suitable to deal with a wide range of distortions.

To summarize, EMISM achieves a good stability in terms of accuracy and monotonicity for a wide range of distortion type. It is particularly suited for artifacts related to noise impairment while WNISM proves to be a good measure for compression linked artifacts (JPEG, JPEG2000). Moreover, both measures fail to perform for exactly the same kind of artifacts. Nevertheless, EMISM remains advantageous since it covers a large number of artifacts. Indeed, the comparative results with WNISM confirm the efficiency of EMISM as a robust general purpose–reduced reference measure.

7.2 Comparison of EMISM and WNISM using the SVM classification process

The classification procedure is the same for all datasets. Let us consider the TID 2008 dataset. It contains 17 sets of 100 distorted images. Since the MOS values are in the interval [0, 9], this later has been divided into five intervals ]0, 2], ]2, 4], ]4, 6], ]6, 8], ]8, 9] corresponding to the quality classes: bad, poor, fair, good, and excellent, respectively. Each set of distorted images is divided into five subsets according to the MOS associated with each image in the set. Then, a multiclass SVM is trained using the leave-one-out cross-validation strategy. In other words, each iteration involves using a single observation from the original sample as the validation data and the remaining observations as the training data. This process is repeated such that each observation in the sample is used once as the validation data. The radial basis function (RBF) kernel was used and a feature selection step was carried out to select its parameters that give better classification accuracy. The entries of the SVM are formed by the distances computed in Eq. 8. For the \(i\)th distorted image, \(X_i=[d_1,d_2,d_3,d_4 ]\) represents the vector of features (only four IMFs are used).

Tables 56, and 7 represent the classification accuracy, the classification accuracy rank, and the correlation coefficient rank for the three datasets IVC, LIVE, TID 2008, respectively.

Table 5 Comparison of classification accuracies between WNISM and EMSIM for IVC dataset
Table 6 Comparison of classification accuracies between WNISM and EMSIM for LIVE dataset
Table 7 Comparison of classification accuracies between WNISM and EMSIM for TID2008 dataset

Let us begin with the IVC dataset. We can observe that EMISM outperforms WNISM for all type of degradations in terms of classification accuracy. It confirms the results obtained using the logistic-based approach for the blur degradation and highlights the differences between the two methods, but this time, in favor of EMISM. This suggests that the classification approach is more effective for EMISM than the nonlinear regression. A comparison between the classification accuracy rank and the correlation coefficient rank reveals that the SVM-based classification is in perfect agreement with logistic function-based regression for both metrics. Results with the LIVE dataset confirm the trend observed with the IVC dataset. While according to the measured performance in the regression approach, EMISM outperforms WNISM only for white noise distortion, using the classification process improves its score significantly. Indeed, WNISM is more performing only for the JPG1 dataset in this situation. The high value (0.92) of the correlation coefficient between the classification accuracy rank and the CC rank indicates that both strategies (classification and nonlinear regression) lead to statistically the same ordering for the EMISM method. This is not the case for the WNISM method, which results in a very low value (0.21). Globally, for the TID 2008 dataset, the classification accuracy rates are much lower than the one measured for the two previous datasets for both metrics. The results are quite consistent with those observed using nonlinear regression. The gain observed for the two other datasets is less pronounced. Indeed, EMISM becomes more performing in only one more case: The JPEG 2000 transmission error. It may be noted that the classification strategy makes it easier to highlight the differences. The correlation value (0.61) between the classification accuracy ranks for the two metrics demonstrates that they behave very differently. As noted before, the metrics are efficient for different types of degradation. Furthermore, the mean classification accuracy is slightly higher for EMISM (72.55) as compared to WNISM (69.22) making it more suitable for many degradations. If we compare the agreement between classifier accuracy rank to the one computed using the correlation coefficient, the results show that for both metrics, the ordering obtained using both strategies is quite similar. Indeed, the correlation coefficient between classifier ordering and correlation coefficient ordering is above 0.98 in both cases.

Globally, EMISM is more performing than WNISM. This confirms the effectiveness of the EMD for IQA purpose since the RR features extracted from the reference image for both methods WNISM and EMISM are the same. Furthermore, the classification process allows to increase its efficiency as compared to the nonlinear regression. It is more flexible and reflects the human judgment as compared to logistic-based regression. Nevertheless, this gain is obtained at the price of an increasing complexity. On the one hand, training is required before one can use this strategy. On the other hand, when this training step has been done, the classification is straightforward.

7.3 Comparison of EMISM with DNT and RRED

The goals of a RRIQA measure are to provide the best compromise between prediction accuracy and side information rate. Furthermore, complexity is also an important issue. Here, we compare the proposed measure with the most prominent RRIQA methods based on natural image statistics modeling: Wavelet-domain Natural Image Statistic Model (WNISM)[7], Divisive Normalization-Based Transform (DNT) [8], Reduced Reference Entropic Differencing Index (RRED) [10]. We consider the prediction accuracy with both strategies (the nonlinear regression and the classification approach), the amount of side information transmitted and the computation time in order to evaluate the relative efficiency of the methods under investigation. The experiments are performed on the TID 2008 database since it is the most challenging which, additionally, contains the wider set of distortion types.

In EMISM for each IMFs, we extract three parameters \(\{\alpha ,\beta , d(p_m\Vert p)\}\) that must be send as side information. We conducted many experiments while varying the number of IMF. Our results show that the best trade-off in terms of data rate, complexity, and performance is to use 4 IMF. Figure 9 represents the evolution of the correlation coefficient values using the nonlinear regression strategy for the LIVE dataset while increasing the number of IMF. It demonstrates that changing the size of the feature set does not produce a significant gain in performance while it increases the computational complexity of the algorithm and also the size of the feature set. The 3 parameters extracted from the 4 IMF are quantized to finite precision. Specifically, both \(\beta \) and \(d(p_m||p)\) are quantized into 8-bit precision, and \(\alpha \) is represented using 11-bit floating point, with 8 bits for mantissa and 3 bits for exponent. In summary, a total of \((8 + 8 + 8 + 3) \times 4 = 108\) bits are used to represent the RR features.

Fig. 9
figure 9

Evolution of the CC quality score in function of the IMF number for LIVE database. Blue line: white noise, red line: Gaussian blur, green line: JPEG2000 (jp2(1)), black line: JPEG2000 (jp2(2)), magenta line: JPEG compression (JPG1), cyan line: JPEG compression (JPG2), yellow line: Fast fading (color figure online)

In WNISM, the same three parameters (the shape parameter, the scale parameter, and the prediction error) are extracted from 6 subband. As a similar quantification process is involved to quantify the RR features, a total of \((8 + 8 + 8 + 3) \times 6 = 162\) bits are used to represent the RR features.

In DNT, 4 parameters (the standard deviation, the kurtosis, the skewness, and the prediction error) are extracted from each of the 12 subbands (three levels and four orientations). 48 scalar features are therefore used for RR image quality assessment. In case where each scalar feature is quantized into 8-bit precision, a total of \(48\times 8=384\) bits are required to represent the RR features.

In RRED, the amount of side information can vary from the no-reference case to the full reference one. A special case of this algorithm that requires just a single parameter has been reported for the TID2008 database. In this case, satisfactory performances were obtained for only three distortions type, while in the case named \(\text{ RRED}^{M_{16}}_{16}\), good performances were obtained for several distortions in the TID2008 database. For this reason, we choose it in our comparison. \(\text{ RRED}^{M_{16}}_{16}\) uses \(N/36\) scalar features for each reference image, where \(N\) is the total number of pixels in the image. These features represent the entropies extracted from the blocks in the selected subband. Let us consider an image of \(512 \times 512\) pixels. Then, the size of the RR becomes \((512 \times 512)/36 = 7,282\) scalar features. If we suppose that each scalar feature will be quantized into 8-bit precision, a total of \(7,282 \times 8 =58,256\) bits are required to represent the RR features.

Table 8 summarizes all the information in order to compare the four methods. Besides data rate and performances, we report the execution time of the four algorithms on an AMD TurionTM II Dual core Mobile M 520 processor (2CPUs), with 2.3 GHz speed and 3 GB RAM. Note that the computation time shown in Table 8 is obtained by averaging the computation time required for ten randomly selected images from the TID2008 dataset. Also, performances represented in the same table concern TID2008 dataset. First of all, we can notice EMISM and WNISM are very comparable nevertheless with a certain advantage for EMISM. Indeed, with a data rate reduced by more than 33 %, it still outperforms WNISM. This advantage is obtained at the expense of a slightly higher computation time.

Table 8 Comparison of EMISM with \(\text{ RRED}^{M_{16}}_{16}\), DNT and WNISM methods on TID2008 dataset

DNT and \(\text{ RRED}^{M_{16}}_{16}\) are clearly more performing according to the correlation coefficient. This advantage is reduced when we use the classification accuracy. \(\text{ RRED}^{M_{16}}_{16}\) is almost two time faster than EMISM and WNISM and more than five time faster than DNT. Comparing the data rates, there is a clear advantage for EMISM and WNISM. DNT is three times more costly than EMISM and this reach a 58 ratio for \(\text{ RRED}^{M_{16}}_{16}\). Such differences suggest that these last should be reserved for situations where the bit budget is not a priority.

8 Conclusion

Inspired by the natural image statistics paradigm, a new reduced reference method for image quality assessment is introduced. EMISM is based on the BEMD representation which is particularly suited in nonlinear and non-stationary situations. Among other benefits, fast and efficient algorithms have been proposed to compute this representation. To assess the performance of EMISM, an extensive comparative evaluation has been conducted involving LIVE the most prominent benchmark and two other datasets (IVC, TID 2008). TID 2008 contains 17 types of distortions covering a wide range of practical situations. It allows us to study the efficiency of EMISM as a general-purpose measure. A comparison with the well-known WNISM measure demonstrates that EMISM is very efficient as a general-purpose solution. Furthermore, it outperforms the reference WNISM over a large number of distortions (additive noise in color components, spatially correlated noise, masked noise, high-frequency noise, non-eccentricity pattern noise, JPEG2000 transmission errors, and local block wise distortions of different intensity). Additionally, a classification framework is proposed as an alternative of the logistic function-based regression. Results showed that the classification strategy is more suited to the proposed RRIQA. Indeed, further gain is obtained over WNISM in this case. The comparison between RRED and DNT showed that these methods are more performing. Nevertheless, this gain is obtained at the expense of an increased data rate ranging from 3 times higher for DNT to 58 times higher for RRED. As a future work, we plan to investigate alternative models for the marginal distribution of BEMD coefficients. Gaussian scale mixture, Bessel K forms (BKF), or symmetric alpha-stable \((S\alpha S)\) models seem to be convenient solutions for this purpose. In addition, as this study revealed the poor performances of both RRIQA measures for impulsive and quantization noise, we plan to pay more attention on these cases very important in practical situations.