1 Introduction

Image inpainting was extensively used in the ancient artwork. However, the term for digital image inpainting was first introduced by Bertalmio in 2000 [67]. Due to its many applications, such as restoring the lost and damaged images, removing objects and noise, etc., the evaluation of inpainted images is one of the topics that has caught the attention of researchers in recent years. As mentioned, one of the uses of image inpainting is the removal of additional and unwanted objects in the image, which has many uses for videos and photos.

One of the applications can be the removal of additional objects in movie pictures and sports, social, cultural, political events, and so on. The use of image inpainting does not require photocopying again, which means saving money. Also, the importance of image inpainting further indicates the fact that most events such as sports events, political events, etc. cannot be repeated, and the presence of an extra object in the picture makes the use of image inpainting inevitable. According to the abovementioned materials, image inpainting is one of the global technologies that has gradually become more important and been more widely used. However, no evaluation measure has been proposed for data. The emergence of numerous algorithms has led to another problem: which one of these algorithms performs better? The most rudimentary way to evaluate these algorithms is to use human observers. Despite the precision, using human observers for evaluation has its problems and limitations, the most notable of which are that they are time consuming and costly. Therefore, it is essential to develop a metric that can objectively and quickly assess the quality of these algorithms. This metric should be able to evaluate the image quality in the same manner that human beings perceive images.

This issue has become very important and hence received significant attention in recent years. However, researchers have failed to develop an efficient measure that can be applied to all methods. Existing methods heavily depend on the reference image, many of which are functional only on an individual basis. Because the reference image is not available in many cases of image inpainting, it is crucial to have a metric that does not rely on the reference image. The characteristics of the human eye can be beneficial in assessing an image. One of these characteristics is the estimated vision density. The human eye can only see a small area of an object at any moment, and ironically, the rest of the object seems blurry and is not seen clearly. Researchers have provided evidence that the human eye only recognizes specific parts of the image at any moment while looking at a picture and ignores the rest of the image. These particular parts are called salience map. They were first presented by Koch and Ullman and later improved by Itti and Koch [10, 16, 17, 21, 44, 56]. Using this property will upgrade the characteristic in a way that the measure operates more similarly to human perception.

2 Objective quality evaluation

In need for the evaluation of image inpainting, researchers proposed measures that operate satisfactorily only in some cases. Ardis et al. presented the ASVS and DN methods in 2010 [2]. In ASVS, the focus is on the part where the pixel is inpainted. Consequently, this measure is applicable only to the inpainted area. However, DN considers the outer part of the inpainted area, indicating how much of the inpainted outer region of the image is under attention. The smallest number obtained for the two methods above in an image shows that the inpainted image is better than the other images. Unfortunately, this method is not efficient and will result only in exceptional cases.

Other methods are \({\text{G}\text{D}}_{\text{o}\text{u}\text{t}}\) and \({ \text{G}\text{D}}_{\text{i}\text{n} }\), which were introduced by Mahalingam [30]. These criteria are based on visual saliency to measure the objective quality of the inpainted image. They show that any change in the saliency is related to the quality of conceptual perception. This method is also weak and not functional in most cases.

Another method is machine learning, which was proposed by Voronin et al. [46, 47]. This approach is also based on the statistical properties of images and not very accurate. The main problem in using Low-level features that have reduced its measurement accuracy The methods, as mentioned earlier, have many weaknesses and are useful only in some cases, which are far from human perception [34]. Although there has not been a proper method for the evaluation of inpainted images, the problem for the assessment of image processing has attracted the attention of scientists in recent years. For example, current methods of assessing segmented images have shown excellent performance [40].

3 Proposed objective measure

Researchers have proposed numerous image inpainting methods which can be divided into two general categories: partial differential equation (PDE)-based methods [5, 8, 9, 13, 23, 25, 32, 36, 39, 43, 49, 54, 55] and exemplar-based methods [3, 4, 11, 15, 20, 22, 24, 33, 42, 50, 51, 53, 57, 58]. Among the methods mentioned, the exemplar-based approach is more efficient and provides better results than others [52]. To assess an inpainted image, a method which has mid-level features is used so that it can provide results similar to human perception. In this article, mid-level features are orientation, intensity, and color. The symbol I shows the input image in this article, Ω indicates the inpainted area, and \(\phi\) shows its complementary region in the main photo. These areas are shown in Fig. 1 as follows:

Fig. 1
figure 1

Image model input

First, the image is inpainted using the exemplar-based method. Then, the saliency map of the inpainted area and the complementary region of the inpainted area are calculated using the Itti-Koch method [19, 48]. Due to the importance of the method, this method is explained in this paper. Itti et al. used a method for high selective attention based on the prominence point that uses the calculation of the local contrast properties of the prominent positions to reduce the point of prominence. The main reason for using the Itti_Koch method is because of the features that this method uses to get the saliency map. In fact, only the features obtained by the Itti_Koch method are used. Another reason is that these features play an essential role in inpainting based on the exemplar-based method. The better these features are observed, the better the inpainting will be. Basically, exemplar-based inpainting is based on repeating orientation, intensity, and color on the neighborhood.

Considering the importance of Itti et al.’s method, it is also explained in this paper. Itti et al. used a method for bottom-up selective attention based on continuous scanning of a saliency map, which uses the calculation of local contrast features for salient locations to reduce the saliency. The model presented by Itti et al. can be used as an artificial intelligence behavior to replace the diagnosis of normal human feelings.

First, the input image I is sub-sampled into a dyadic Gaussian pyramid. This is done by convolution with a linearly separable Gaussian filter and decimation by a factor of two. This process continues and repeats the pyramid to obtain the next levels of \({\upsigma }\) = [0, …, 8]. The resolution of level \({\upsigma }\) is 1 / \({2}^{{\upsigma }}\) times to the original image resolution. For example, the 4th level has a resolution of \(\frac{1}{16}\) of the input image I and \({\left(\frac{1}{16}\right)}^{2}\)of the total number of pixels.

If r, g, and b are red, green, and blue respectively, then the intensity map is calculated as follows:

$$M_I=\frac{r+g+b}3$$
(1)

This action is repeated for each level of the input pyramid to get a pyramid of intensity with the level \({M}_{I}\left(\sigma \right)\). In addition, each level of the image pyramid is divided into the maps of red-green (RG) and blue-yellow (BY) opponencies:

$$M_{RG}=\frac{r-g}{max\left(r,g,b\right)}$$
(2)
$$M_{BY}=\frac{b-min\left(r,g\right)}{max\left(r,g,b\right)}$$
(3)

In the low luminance, the color opponencies values are subject to many fluctuations. In order to avoid this, \({M}_{RG}\) and \({M}_{BY}\) are considered zero at places with a maximum of \(max\left(r,g,b\right)<\frac{1}{10}\), assuming a dynamic range [0, 1].

Applying the steerable filters to the intensity pyramid levels \({M}_{I}\left(\sigma \right)\), local orientation maps \({M}_{\theta }\) are obtained. The lateral inhibition between units with different \(\theta\) can help to detect the faint elongated objects. Another highly salient feature is motion. Center-surround receptive fields are simulated by a cross-scale subtraction \(\ominus\) between two maps at the center (c) and the surround (s) levels in these pyramids, outputting “feature maps”:

$$\begin{array}{*{20}c}F_{l,c,s}=N\left(\left|M_l\left(c\right)\ominus M_l\left(s\right)\right|\right)\forall l\in L=L_I\cup L_C\cup L_O\end{array}$$
(4)

With

$${L}_{I}=\left\{I\right\} , {L}_{C}=\left\{RG,BY\right\} , {L}_{O}=\left\{{0}^{^\circ } , {45}^{^\circ } , {90}^{^\circ } ,{135}^{^\circ } \right\}$$

N (·) is a repetitive and nonlinear normalization operator, simulating local competition between the neighboring salient locations. Each iteration step consists of self-excitation and neighbor-induced inhibition implemented by convolution with a “difference of Gaussians” filter, followed by rectification.

The feature maps are summed over the center-surround combinations using a cross-scale addition \(\oplus\), and the sums are normalized again:

$$F_l=N\left(\begin{array}{*{20}c}4\\\oplus\\c=2\end{array}\begin{array}{*{20}c}c+4\\\oplus\\s=c+3\end{array}F_{l,c,s}\right)\forall l\in L$$
(5)

For the general features of color and orientation, the contributions of the sub-features are summed and normalized once more to output the “conspicuity maps”. For intensity, the conspicuity map is the same as \(\stackrel{-}{{F}_{l}}\), which is obtained as follows:

$$\begin{array}{*{20}c}C_I={\overline F}_l,C_C=N\left(\sum\limits_{l\in L_C}{\overline F}_l\right),C_O=N\left(\sum\limits_{l\in L_O}{\overline F}_l\right)\end{array}$$
(6)

All conspicuity maps are combined into one saliency map:

$$S=\frac13\sum\limits_{k\in\{I,C,O\}}c_k$$
(7)

WTA (winner-take-all) method obtains the locations of the saliency map for each photo. In this method, the locations where the highest values are obtained are considered as the saliency map of the image. For each image, one or more locations may be considered as a saliency map.

The saliency map is the locations of xw and yw, which are obtained using WTA method. Then, the WTA competition generates the second most salient location, which is attended to subsequently and then inhibited, thus allowing the model to simulate a scan path over the image to decrease the saliency of the attended locations.

Fig. 2
figure 2

Architecture of Itti-Koch’s method [19]

The architecture of the Itti-Koch method is shown in Fig. 2. Using the characteristics of the Itti-Koch’s calculations for the saliency map, the similarity between the saliency map of the inpainted area and the complementary region is calculated in terms of intensity, orientation, and colors; the more this similarity, the better the inpainting. This is because the exemplar-based method uses the color and patterns of restructuring to inpaint the damaged images. Thus, the quality of inpainting highly depends on how precise the intensity, color, and orientation are abided by. This method is superior over the method of calculating the similarities between the pixels of the inpainted area and complementary region because it uses Itti-Koch’s method, which in turn uses other features of the image such as orientation, intensity, and color to find the saliency map. In calculating the similarities, repeating the pattern of one part can result in high similarities, while the quality of inpainting is still low. Given this, using the method in this paper increases the accuracy of the assessment of image inpainting.

After obtaining the saliency maps, they are plugged into the Jaccard index [41]. Jaccard index is defined as follows:

$$J\left(A,B\right)=\frac{\left|A\cap B\right|}{\left|A\cup B\right|}$$
(8)

It should be noted that in the former methods of image inpainting, only the concept of saliency map is used, and its method of calculation as well as its features are ignored. In the proposed method, however, due to the need for the saliency map features, Itti-Koch’s method is used to find the saliency map. Further, the Jaccard index has never been used in this area (Evaluating of Inpainted Images).

To get a more precise and close-to-human perception measure, one penalty term and one compensation term are added to the above Jaccard index. The penalty expression is calculated as the ratio of the saliency map of the inpainted region to the complementary region and is added to the denominator. The higher this ratio, the greater (lower) the attention to the inpainted region (complementary region), which in turn indicates lower quality inpainting. The penalty term is calculated as follows:

$$\frac{\left|\varphi\right|}{\left|\Omega\vert+\vert\varphi\right|}$$
(9)

The compensation term is exactly the opposite of the penalty term and is added to the numerator of the Jaccard index. The compensation term is calculated as the ratio of the complementary of the inpainted region to the inpainted region. The higher value of this ratio indicates more considerable attention to the complementary of the inpainted than to the inpainted region. This will help increase the accuracy of the measure and make it closer to human perception. The compensation term is defined as follows:

$$\frac{\left|\Omega\right|}{\left|\Omega\vert+\vert\varphi\right|}$$
(10)

Finally, the proposed measure, which we call it Objective Inpainting Metric (OIM), is as follows:

$$\text{OIM}=\frac{\left(\Omega\cap\varphi\right)+\frac{\left|\varphi\right|}{\left|\Omega\vert+\vert\varphi\right|}}{\left(\Omega\cup\varphi\right)+\frac{\left|\Omega\right|}{\left|\Omega\vert+\vert\varphi\right|}}$$
(11)

This metric is used for the evaluation of inpainted images based on exemplar-based and object removal method. This metric is a quantitative measure and can be a reliable alternative to subjective measurement. This metric is obtained using the features of the saliency map and the Jaccard index. In this metric, a number is obtained for an inpainted image according to formula 11; the closer is this number to zero, the lower is the quality of the image inpainting and the closer is the number obtained to one. This means that the image quality is higher and it is harder to figure out if the image is manipulated. In the next section, the results are examined.

4 Experimental results

In this paper, Ran Shi et al.’s database images are used [41]. The images of this database are from the four well-known public object segmentation databases: Weizmann [1], VOC2012 [14], MSRA [28], and Microsoft Research Cambridge’s grabcut [37] and some Berkeley Segmentation Images [31]. The images include people, objects, animals, images with high structures, simple images, landscapes, etc.

In this section, the images are first inpainted and then the quality of the inpainting is measured by the proposed metric. The point to which these photos and the mask are considered is that the objects removed in the images are in different situations in the picture. This is because the position of the object is deleted and also the quality of inpainting is effective. The size of the objects removed is also different, ranging from the small objects to large objects as well as animals, objects, individuals, and so on. It has been tried to cover all the conditions for inpainting so that evaluation with the obtained criterion can be applied to all the photographs. The images and their masks are inpainted based on the exemplar-based algorithm, and then the saliency map of the inpainted region and its complementary region is extracted using the Itti-Koch’s method. The following photo shows some of the database images and their masks (Fig. 3).

Fig. 3
figure 3

Some examples of the database images with their masks [41]

The results are evaluated using the proposed metric. The Fig. 4 displays the evaluation process. Figure 4, (a) shows the original image and (b) shows the inpainted image, which is inpainted using the exemplar-based method. The saliency maps of the inpainted image and inpainted areas are exhibited in (c) and (d) respectively.

Fig. 4
figure 4

Examples of the inpaint and saliency map of images from the test database

After evaluating the images, the proposed metric reports the quality of each image with a number, which is between zero and one. The closer this number is to zero, the higher is the quality of the image. The Fig. 5 illustrates an example of the evaluation of ten inpainted images. The scores are written below each image using the proposed metric.

Fig. 5
figure 5

Examples of the proposed method rating: image (a) = very good inpainting, image (b) = good inpainting, image (c) = average inpainting, image (d) = bad inpainting, and image (e) = very bad inpainting

5 Evaluation

To assess the proposed evaluation metric, subjective methods were used and inpainted images were evaluated by human observers. After evaluating these images by the proposed criteria, these images were rated by human observers following the ITU(International Telecommunication Union) standard (Table 1) [1]. Respectively, each picture is ranked from 5 to 1; 5 (excellent inpainting), 4 (good inpainting), 3 (average inpainting), 2 (poor inpainting), and 1 (bad inpainting). To do this, 30 inexperienced or low-experienced image-processing human observers were employed. Unrealistic opinions were removed. According to the ITU standard, unrealistic opinions should be deleted in order to prove that observers have understood the meaning of the test or that their comments are not random. 30 participants are used for this article. Each participant evaluated an average of 200 images. The inpainted image and the original image were displayed simultaneously to each participant [18].

Table 1 ITU grading scales

The final score for each obtained image is from the average opinion [14] calculated by the following formula:

$$\text{MOS}=\frac1{\text{n}}\sum_{\text{i}=1}^\text{n}{\text{score}}_\text{i}$$
(12)

In the above formula, n represents the number of human observers and represents the score that has been given to each image. All images using Criminisi [12] method were inpainted and then rated by the observers. In this way, the original and inpainted images were shown to the human observers for 5 s, following which they rated them. To avoid a negative impact on the rate, the images were placed randomly [18, 31, 35, 37]. Also according to the ITU standard, subjects should be informed in the test instructions [18] (Fig. 6).

Fig. 6
figure 6

Samples of database [41]

Four general assessment criteria including the Spearman Rank-Order Correlation Coefficients (SROCC), the Linear Correlation Coefficient (LCC), the Root Mean Squared Error (RMSE), and the Outlier Ratio (OR) are used for assessing the objective and subjective results obtained from the assessment[26, 27, 29, 38, 45]. Table 2 shows the results of the assessment for the proposed measure and also three other metrics.

Table 2 Performances of four quality measures

6 Discussion

The coefficients of SROCC and LCC are between zero and one. The closer these numbers to one, the higher the correlation between the proposed metric and the subjective tests, indicating a better performance. This relationship is reversed between RMSE and OR. This means that the lower is the number obtained, the greater is its correlation with the subjective method. In experiments performed on inpainted images to obtain the quality of their inpainting, the proposed method demonstrated that more objects were removed in the inpainted images and other methods could not evaluate the quality of inpainting. This means that if the number of objects to be removed is greater than one, the accuracy of evaluating the quality of other methods will decrease for the image inpainting. The proposed method also works well in these images, and removing an object or several objects does not affect the quality of the inpainted images.

In the images with high complexity, other methods had a very poor performance, but the proposed method performed better, and the quality of inpainted images was better than that of the rest of the methods. In the removal of an object with small dimensions and low complexity, almost all methods could show the quality of inpainted images well. With the removal of large objects, the DN method was weaker than the rest of the methods. As can be seen, the proposed method has the highest correlation (0.88) with the subjective results compared to the other three methods, suggesting that the proposed method is more efficient and accurate than the other three measures.

7 Conclusions

In this paper, a metric was developed based on human visual characteristics to assess the quality of inpainted images. Two terms were added to the metric as penalty and compensation to make the metric more accurate. To assess this metric, the subjective assessment with human observers was used. The results were satisfactory and showed that the metric could provide the quality of human perception. Further, the proposed method was compared against three other methods, whose results confirmed that the proposed method performed better than the other three methods. For future works, we can add terms to human perception to improve the metric. This requires the use of high-level features in the evaluation of images.