Keywords

1 Introduction

Artificial Intelligence Generated Content (AIGC) refers to the content, including texts, images, audios, or videos, etc., that is created or generated with the assistance of AI technology. Many impressive AIGC models have been developed in recent years, such as ChatGPT and DALLE [26], which have been utilized in various application scenarios. As an important part of AIGC, AI Generated Images (AIGIs) have also gained significant attention in recent years due to advancement in generative models including Generative Adversarial Network (GAN) [9], Variational Autoencoder (VAE) [14], diffusion models [27], etc., and language-image pre-training techniques including CLIP [25], BLIP [18], etc.

However, the development of AIGI models also raises new problems and challenges. One significant challenge is that not all generated images are qualified for real-world applications, which often require to be processed, adjusted, refined or filtered out before being applied to practical scenes. However, unlike common image content, such as Natural Scene Images (NSIs) [7, 8], screen content images [3, 20], graphic images [5, 20], etc., which generally encounters some common distortions including noise, blur, compression, etc. [4, 6], AIGIs may suffer from some unique degradations such as unreal structures, unreasonable combinations, etc. Moreover, the generated images may not correspond to the semantics of the text prompts [15, 17, 29]. Therefore, it is important to study the human visual preferences for AIGIs and design corresponding objective Image Quality Assessment (IQA) metrics for these images.

Many subjective IQA studies have been conducted for human captured or created images, and many objective IQA models have also been developed. However, these models are designed for assessing low-level distortions, while AIGIs generally contain both low-level artifacts and high-level semantic degradations. Some quantitative evaluation metrics such as Inception Score (IS) [10] and Fréchet Inception Distance (FID) [12] have been proposed to assess the performance of generative models and have been widely used to evaluate the authenticity of the generated images. However, these methods cannot evaluate the authenticity of a single generated image, and cannot measure the correspondence between the generated images and the text-prompts. As a new type of image content, previous IQA methods may fail to assess the image quality of AIGIs and cannot align well with human preferences due to the irregular distortions.

To gain a better understanding of human visual preferences for AIGIs and guide the design process of corresponding objective IQA models, in this paper, we conduct a comprehensive subjective and objective IQA study for AIGIs. We first establish a large-scale IQA database for AIGIs termed AIGCIQA2023, which contains 2,400 diverse images generated by 6 state-of-the-art AIGI models based on 100 various text prompts. Based on these images, a well-organized subjective experiment is conducted to assess the human visual preferences for each individual generated image from three perspectives including quality, authenticity, and correspondence. Based on the constructed AIGCIQA2023 database, we evaluate the performance of several state-of-the-art IQA models and establish a new benchmark. Experimental results demonstrate that current IQA methods cannot well align with human visual preferences for AIGIs, and more efforts should be made in this research field in the future. The main contributions of this paper are summarized as follows:

  • We propose to disentangle the human visual experience for AIGIs into three perspectives including quality, authenticity, and correspondence.

  • Based on the above theory, we establish a novel large-scale database, i.e., AIGCIQA2023, to better understand the human visual preferences for AIGIs and guide the design of objective IQA models.

  • We conduct a benchmark experiment to evaluate the performance of several current state-of-the-art IQA algorithms in measuring the quality, authenticity, and text-image correspondence of AIGIs.

The rest of the paper is organized as follows. In Sect. 2 we introduce the details of our constructed AIGCIQA2023 database, including the generation of AIGIs and the subjective quality assessment methodology and procedures. In Sect. 3 we present the benchmark experiment for current state-of-the-art IQA algorithms based on the established database. Section 4 concludes the whole paper and we discuss possible future research that can be conducted with the database.

2 Database Construction and Analysis

In order to get a better understanding of human visual preferences for AI-generated images based on text prompts, we construct a novel IQA database for AIGIs, termed AIGCIQA2023, which is a collection of generated images derived from six state-of-the-art deep generative models based on 100 text prompts, and corresponding subjective quality ratings from three different perspectives. Then we further analyze the human visual preferences for AIGIs based on the constructed database.

2.1 AIGI Collection

Fig. 1.
figure 1

Pie Chart of the ten challenge categories and ten scene categories selected from PartiPrompts [32].

We adopt six latest text-to-image generative models, including Glide [24], Lafite [34], DALLE [26], Stable-diffusion [27], Unidiffuser [1], Controlnet [33], to produce AIGIs by using open source code and default weights. To ensure content diversity and catch up with the practical application requirements, we collect diverse texts from the PartiPrompts website [32] as the prompts for AIGI generation. The text prompts can be simple, allowing generative models to produce imaginative results. They can also be complex, which raises the challenge for generative models. We select 10 scene categories from the prompt set, and each scene contains 10 challenge categories. Overall, we collect 100 text prompts (10 scene categories \(\times \) 10 challenge categories) from PartiPrompts [32]. The distribution of the selected scene and challenge categories is displayed in pie chart of Fig. 1. It can be observed that the dataset exhibits a high level of scene diversity, with images generated covering a broad range of challenges. Then we perform the text-to-image generation based on these models and prompts. Specifically, for each prompt, we generate 4 various images randomly for each generative model. Therefore, the constructed AIGCIQA2023 database totally contains 2400 AIGIs (4 images \(\times \) 6 models \(\times \) 100 prompts) corresponding to 100 prompts (Fig. 2).

2.2 Subjective Experiment Setup

Subjective IQA is the most reliable way to evaluate the visual quality of digital images perceived by the users. It is generally used to construct image quality datasets and served as the ground truth to optimize or evaluate the performance of objective quality assessment metrics. Due to the unnatural property of AIGIs and different text prompts having different target image spaces, it is unreasonable to just use one score, i.e., “quality” to represent human visual preferences. In this paper, we propose to measure the human visual preferences of AIGIs from three perspectives including quality, authenticity, and text-image correspondence. For an image, these three visual perception perspectives are related but different.

Fig. 2.
figure 2

Sample images from the AIGCIQA2023 database generated by six different generative models (Glide [24], Lafite [34], DALLE [26], Stable-diffusion [27], Unidiffuser [1], Controlnet [33].)

Fig. 3.
figure 3

Illustration of the images from the perspectives of quality, authenticity, and text-image correspondence. (a) 10 high quality examples and 10 low quality examples of the images generated by the prompt of “a corgi”. (b) 10 high authenticity and 10 low authenticity examples of images generated by the prompt of “a girl”. (c) 10 high text-image correspondence and 10 low correspondence examples of images generated by the prompt of “a grandmother reading a book to her grandson and granddaughter”.

The first dimension of AIGI evaluation is “quality” evaluation, i.e., evaluating an AIGI from its clarity, color, lightness, contrast, etc., which is similar to the assessment of NSIs. During the experiment procedure, subjects are instructed to evaluate whether the image outline is clear, whether the content can be distinguished, and the richness of details, etc. Fig. 3(a) shows 10 high quality examples and 10 low quality examples of the images generated by the prompt of “a corgi”.

Considering the generation nature of AIGIs, an important problem of these images is that they may not look real compared to NSIs. Therefore, we introduce a second dimension of evaluation metrics for the generated images, i.e., “authenticity” evaluation. For this dimension, subjects are instructed to assess the image from the authenticity aspect, i.e., whether it looks real or whether they can distinguish that the image is AI-generated or not. Figure 3(b) shows 10 high authenticity and 10 low authenticity examples of images generated by the prompt of “a girl”.

Since an AIGI is generated from a text, it is also important to evaluate its correspondence with the original prompt, i.e., the third dimension, text-image “correspondence”. For this purpose, subjects are instructed to consider textual information provided with the image and then give the correspondence score from 0 to 5 to assess the relevance between the generated image and its prompt. Figure 3(c) shows 10 high text-image correspondence and 10 low correspondence examples of images generated by the prompt of “a grandmother reading a book to her grandson and granddaughter”.

2.3 Subjective Experiment Procedure

To evaluate the quality of the images in the AIGCIQA2023 and obtain Mean Opinion Scores (MOSs), a subjective experiment is conducted following the guidelines of ITU-R BT.500-14 [3]. The subjects are asked to rate their visual preference degree of exhibited AIGIs from the quality, authenticity and text-image correspondence. The AIGIs are presented in a random order on an iMac monitor with a resolution of up to 4096 \(\times \) 2304, using an interface designed with Python Tkinter, as shown in Fig. 4. The interface allows viewers to browse the previous and next AIGIs and rate them using a quality scale that ranges from 0 to 5, with a minimum interval of 0.01. A total of 28 graduate students (14 males and 14 females) participate in the experiment, and they are seated at a distance of around 60 cm in a laboratory environment with normal indoor lighting.

Fig. 4.
figure 4

An example of the subjective assessment interface. The subject can evaluate the quality of AIGIs and record the quality, authenticity, correspondence scores with the scroll bar on the right.

2.4 Subjective Data Processing

We follow the suggestions recommended by ITU to conduct the outlier detection and subject rejection. The score rejection rate is 2%. In order to obtain the MOS for an AIGI, we first convert the raw ratings into Z-scores, then linearly scale them to the range [0, 100] as follows:

$$z_i{}_j=\frac{r_i{}_j-\mu _i{}_j}{\sigma _i},\quad z_{ij}'=\frac{100(z_{ij}+3)}{6},$$
$$\mu _i=\frac{1}{N_i}\sum _{j=1}^{N_i}r_i{}_j, ~~ \sigma _i=\sqrt{\frac{1}{N_i-1}\sum _{j=1}^{N_i}{(r_i{}_j-\mu _i{}_j)^2}}$$

where \(r_{ij}\) is the raw ratings given by the i-th subject to the j-th image. \(N_i\) is the number of images judged by subject i.

Next, the mean opinion score (MOS) of the image j is computed by averaging the rescaled z-scores as follows:

$$MOS_j=\frac{1}{M}\sum _{i=1}^{M}z_{ij}'$$

where \(MOS_j\) indicates the MOS for the j-th AIGI, M is the number of valid subjects, and \(z'_i{}_j\) are the rescaled z-scores.

Fig. 5.
figure 5

Comparison of the differences between three evaluation perspectives. (a) Left image has better quality, but worse authenticity and correspondence. (b) Left image has better authenticity, but worse quality and correspondence. (c) Left image has better correspondence, but worse quality and authenticity.

2.5 AIGI Analysis from Three Perspectives

To further illustrate the differences of the three perspectives, we demonstrate several example images and their corresponding subjective ratings from three aspects in Fig. 5. For each subfigure, it can be noticed that the right AIGI outperforms the left AIGI on two evaluation dimensions but is much worse than the left AIGI on another dimension, which demonstrates that each evaluation perspective (quality, authenticity, or text-image correspondence) has its own unique perspective and value.

Figure 6 demonstrates the MOS and score distribution for quality evaluation, authenticity evaluation, and text-image correspondence evaluation, respectively, which demonstrate the images in AIGCIQA 2023 cover a wide range of perceptual quality.

3 Experiment

3.1 Benchmark Models

Since the AIGIs in the proposed AIGCIQA2023 database are generated based on text prompts and have no pristine reference images, they can only be evaluated by no-reference (NR) IQA metrics. In this paper, we select fifteen state-of-the-art IQA models for comparison. The selected models can be classified into two groups:

Fig. 6.
figure 6

(a) MOSs distribution of quality score. (b) MOSs distribution of authenticity score. (c) MOSs distribution of correspondence score. (d) Distribution of the quality score. (e) Distribution of the authenticity score. (f) Distribution of the correspondence score.

  • Handcrafted-based models, including: NIQE [23], BMPRI [21], BPRI [19], BRISQUE [22], HOSA [30], BPRI-LSSn [19], BPRI-LSSs [19], BPRI-PSS [19], QAC [31], HIGRADE-1 and HIGRADE-2 [16].

    These models extract handcrafted features based on prior knowledge about image quality.

  • Deep learning-based models, including: CNNIQA [13], WaDIQaM-NR [2], VGG (VGG-16 and VGG-19) [28] and ResNet (ResNet-18 and ResNet-34) [11].

    These models characterize quality-aware information by training deep neural networks from labeled data.

3.2 Evaluation Criteria

In this study, we utilize the following four performance evaluation criteria to evaluate the consistency between the predicted scores and the corresponding ground-truth MOSs, including Spearman Rank Correlation Coefficient (SRCC), Pearson Linear Correlation Coefficient (PLCC), Kendall’s Rank Correlation Coefficient (KRCC), and Root Mean Squared Error (RMSE).

3.3 Experimental Setup

All the benchmark models are validated on the proposed AIGCIQA2023 database. For traditional handcrafted-based models, they are directly evaluated based on the database. For deep trainable models, we first randomly split the database into an 4:1 ratio for training/testing while ensuring the image with the same prompt label falls into the same set. The partitioning and evaluation process is repeated several times for a fair comparison while considering the computational complexity, and the average result is reported as the final performance. For deep learning-based models, we applied CNNIQA [13], WaDIQaM-NR [2], VGG (VGG-16 and VGG-19) [28] and ResNet (ResNet-18 and ResNet-34) [11] to predict the MOS of image quality. The repeating time is 10, the training epochs are 50 with an initial learning rate of 0.0001 and batch size of 4.

Table 1. Performance comparisons of the state-of-the-art IQA methods on the AIGCIQA2023 database. The best performance results are marked in and the second-best performance results are marked in .

3.4 Performance Discussion

The performance results of the state-of-the-art IQA models mentioned above on the proposed AIGCIQA2023 database are exhibited in Table 1, from which we can make several conclusions:

  • The handcrafted-based methods achieve poor performance on the whole database, which indicates the extracted handcrafted features are not effective for modeling the quality representation of AIGIs. This is because most employed handcrafted features of these methods are based on the prior knowledge learned from NSIs, which are not effective for evaluating AIGIs.

  • The deep learning-based methods achieve relatively more competitive performance results on three evaluation perspectives. However, they are still far away from satisfactory.

  • Most of the IQA models achieve better performance on quality evaluation and worse on text-image correspondence score assessment. The reason is that the text prompts for image generation are not utilized for the IQA model training. This makes it more challenging for the IQA models to extract relation features from AIGIs, which inevitably leads to performance drops.

4 Conclusion and Future Work

In this paper, we study the human visual preference problem for AIGIs. We first construct a new IQA database for AIGIs, termed AIGCIQA2023, which includes 2400 AIGIs generated based on 100 various text-prompts, and corresponding subjective MOSs evaluated from three perspectives (i.e., quality, authenticity, and text-image correspondence). Experimental analysis demonstrates that these three dimensions can reflect different aspects of human visual preferences on AIGIs, which further manifests that the evaluation of Quality of Experience (QoE) for AIGIs should be considered from multiple dimensions. Based on the constructed database, we evaluate the performance of several state-of-the-art IQA models and establish a new benchmark to facilitate future research.

In future work, we will further explore the human visual perception for AIGIs and develop corresponding objective evaluation models for better assessing the quality of AIGIs from the three perspectives proposed in this paper.