1 Introduction

Generative Adversarial Networks have become extremely popular in artificial intelligence and deep learning field. Along with the development of new models, numerous applications of GANs have also been proposed. The spectrum of task-specific applications is so wide that it ranges from jellyfish swarm detection to testing of gear safety in vehicles. These claims of popularity are supported by the number of research publications in this field as shown in Fig. 1. Although other generative models like variational autoencoders are also available, GANs have many advantages. GANs can handle sharp estimated density functions, eliminate deterministic bias, and generate desired samples and good compatibility with the neural architecture. Especially in the computer vision field, GANs enjoy success in many applications e.g. Image generation, image to image translation, image super-resolution and shadow GANs etc. [35, 36, 101]. Due to the availability of vast information on GANs, this paper has narrowed down its focus on GANs in the field of image/video processing. An inherent relation between image and video is that an image is a motionless picture that does not change with time, whereas a video has third dimension time and comprises moving images as shown in Figure 2.

Fig. 1
figure 1

Trends in research concerning generative adversarial networks have shown exponential growth

Fig. 2
figure 2

Image and video dimensionality

This paper provides a critical study on all the major developments in the fields of image and video processing that utilize generative adversarial networks. Despite vast applications, the main purpose of GANs is as a target density estimator. GANs can implicitly learn the distribution of the original set and generate samples from the estimated distribution. The reason why GANs are so efficient in image synthesis is that learning/ estimating data distribution for higher dimensions is a tedious task, which requires the construction of likelihood functions. Currently, there is an enormous requirement of image/video algorithms using GANs as artificially generated images demand is increasing in deep learning simulations given by WANG et al. [36] and Gonog [35]. Figure 3 shows paper publications from 2014 to 2020 April on year basis, for image /video processing using generative adversarial networks. Papers were recognized through online manual search including all journals, conferences, review and others using keywords “image processing using generative adversarial networks” and “video processing using generative adversarial networks” separately on the semantic scholar search engine [41, 44,45,46]. Semantic Scholar is AI-backed search engine for articles and highlights the important and significant papers [25, 27, 32, 34].

Fig. 3
figure 3

Image and video GANs publications from Semantic Scholar

There has been substantial progress in image/video applications using GANs due to their advantages and disadvantages [47, 89, 90, and]. To summarize, the contributions of the paper are:

  • To provide a vital understanding of the different images as well as video processing GANs, their architectures, cost functions, models that exist or are developing for driving today’s world.

  • In scenarios where one is interested to measure and compare the quality of the GANs; quantitative and qualitative measures of GANs are effectively presented with their pros and cons under one umbrella in this paper.

  • An extensive demonstration of image/video real-world applications using GANs over human activities and finally concluded with future challenges of GANs.

1.1 Basic building blocks and architecture of GAN

xGenerative Adversarial Network consists of the generator (G) and its discriminator (D) as shown in Fig. 4. The generator provides fake data by taking noise vectors as input. This fake data is given to the discriminator with the training data. The discriminator works as a simple classifier to classify the samples original or fake from the input. This is executed by assigning probabilities to the samples [51, 62, 64, 65]. The real sample has a probability of ‘1’ whereas the forged sample has ‘0’ probability. The information in the form of the gradient is backpropagated to generator networks. This task helps the generator to learn the features of the training dataset and in turn, it generates samples images that are equivalent to the statistical distribution of original distribution.

Fig. 4
figure 4

Schematic view of major GAN models. (a) Original GAN model [9, 42] (b) Info GAN [20] (c) BEGAN [11] / EBGAN [161] (d) Conditional GAN [93] (e) Pix2Pix [56] (f) Cycle GAN [164] and (g) LAPGAN [30]. ‘z’ is the noise vector, c is condition labels, G is generator and D is discriminator. Subscript g refers to the generated sample and subscript refers to the real sample

In the next step, the discriminator accepts both the original data and the generated fake samples. Now, it performs the task of decision making whether the images are fake or not, and this learning process repeats on. The discriminator estimates the ratio of densities and passes it to the generator in the form of a gradient. The features are learned cooperatively, oscillating between two blocks, the generator and its discriminator. In the starting of the min-max game, winning for discriminator is an effortless task. But as the training process continues, the game becomes more challenging for both the player because the generator starts converging towards original data [78, 80, 96, 98, 108].

1.2 Input, training and cost functions

1.2.1 Input noise

Multiple approaches are used to provide the inputs to Generator Network. The inputs to the generator can be provided throughout to any layer in the model. G can be divided into two vectors and the first noise vector can be provided at the first layer, while the second noise vector can be provided at the last layer. Another approach is to provide multiple random noise vectors to the inner hidden layers of the generator network.

1.2.2 Training

The training steps for GAN are summarized in Fig. 5. GAN is a structured probabilistic model. It consists of latent variables (noise) ‘z’ and apparent variables x (original distribution). The G function (the generator) takes ‘z’ (the noise vectors) as its input and uses θ(G) as parameters, while the D function (the discriminator) takes ‘x’ (data samples from fake and original distribution) as input and uses θ(D) as parameters.

Fig. 5
figure 5

Steps of GAN model training

The answer to the optimization of the game between the discriminator and the generator is finding its minimum value, i.e. an optimal point in parameter zone where all the other parameters have an equal or higher cost. It is solved by using Nash equilibrium, local minima J(D) related to θ(D) and local minima of J(G) with respect to θ(G) is calculated as a possible solution. By training the discriminator, an estimate is obtained for pdata(x)/pmodel(x) for each point in the x domain. Computation of the vast majority of divergences and their gradients are enabled by this ratio. GANs base their supervised form of learning on this ratio to make approximations. Simultaneous stochastic gradient descent is applied for training each step. Two small (mini-batches) are selected, one from ‘x’ and the other one from z. Thus, any of the gradient-based optimization algorithms can be used to update the two gradient steps simultaneously. The most commonly used optimization algorithm is Adam.

1.2.3 Role of cost functions

The cost functions of both D and G play a crucial role in the training of the GAN model. The discriminator cost function is defined in such a way that it maximizes the probability of the sample being counterfeit, while the generator cost function maximizes the probability that the generated sample is real. The vanilla GAN model explored three approaches for generator cost function, namely, the min-max approach, the heuristic approach and the maximum likelihood approach. Newer GAN models have explored different mechanisms to calculate the distance between original and generated distribution. The original GAN paper used KL divergence and Jensen-Shannon divergence [42]. Changing the cost function leads to different training behaviour and outcomes, hence researchers have explored several methods like Earth-Movers distance [4], χ2 distribution [89], etc. Cost functions of major GAN variants are concised in Table 1.

Table 1 Summary of major Discriminator cost functions

2 Literature survey

It was Goodfellow et al. [42] who proposed the GAN firstly with the objective of an algorithm that is efficient in replicating the data distribution of original data using maximum likelihood approximation and Jensen-Shannon divergence. In practice, the two blocks, G and D were two convolutional neural networks; the former performed the task of generation while the latter performed the task of classification of samples into fake and original by allocating them probabilities between 0 and 1.

But soon after, many types of research independently discovered that GAN models lack training stability and suffer from mode collapse [122, 123, 133]. DCGAN [107] model was the first one to offer a simple solution. Instead of using the sigmoid activation function, it used the ReLU function. DCGAN also stitched the already popular deep convolutional nets [2] with the GAN model to further improve the image quality.

As generated image quality improved, more emphasis was laid on conditional image generation. Utilizing the initial GAN model, Mirza et al. [93] came up with a conditional GAN model to generate samples based on class labels. CGAN can also learn multi-modal data distribution and can be used to generate descriptive tags that are not part of the training labels. Taking inspiration from CGAN, InfoGANs [20] focused on learning detangled representation in an unsupervised manner. It works on the principle of maximizing mutual information by optimizing a lower bound of mutual information.

As using ReLU activation proved insufficient to stabilize the training, researches explored different cost functions and divergence functions. Denton et al. [30] took inspiration from the Laplacian pyramid framework and implemented it using a cascade of the convolutional network. The Laplacian pyramid helps in generating images with rough texture and appearance, and as the pyramid progresses, finer details are added. LS GAN introduced by Mao et al. [89, 90] used a new loss function for its GAN model, called the least-squares loss function. It is a distinct case of minimizing the Pearson χ2 divergence. It also works on pulling the threshold or the decision boundary closer to the model of original data samples.

A major landmark was achieved by Salimans et al. [113], where the authors proposed several techniques to combat training instability, non-convergence, and mode collapse. Major GAN models and their details are summarized in Table 2. The training issues faced by researchers are summarized in table number 3 (Table 3).

Table 2 Training Challenges in Generative Adversarial Networks
Table 3 Summary of Major GAN Models along with their datasets, performance parameters, and contributions

WGAN [4] was a pioneer in GAN models, not only did it use the Wasserstein metric (called the earth movers distance); it also provided the concept of weight clipping. It continuously estimates the EM distance, which is useful in debugging hyper-parameter searches. WGAN’s improved version [47] provided optimization in the form of a gradient penalty. The authors found that weight clipping leads to difficulty in optimization. Due to k-lipschitz constraints, WGAN models leaned towards learning extremely simple functions and ignored higher moments. This issue was solved by using the gradient penalty in accordance with lipschitz continuity. Using the k-lipschitz continuity, where k is lipschitz constant, the issue of complicated optimization was resolved. Other changes that were made are cancelation of batch normalization for discriminator model and using the two-sided penalty in order to keep k close to 1. Improved WGAN proved to be much more stable for training GANs.

WGANs paved the way for more GAN models to utilize EM distance. Berthelot et al. [11] proposed a new method where loss is derived from EM distance to train auto-encoder based GAN called BEGAN. It is a new equilibrium enforcing method which provides a new convergence method, called proportional control theory, with a robust GAN architecture. Another model that used GAN alongside auto-encoder was EBGAN [161], where discriminator is used as energy function which allocates low energy to the zones which are closer to original data distribution and high energy elsewhere.

Improved WGAN model heavily influenced researchers to come up with new normalization and penalty methods. One such model was proposed by Kodali et al. [63] where GAN training was studied as regret minimization, where a new gradient penalty method called Deep Regret Analytic GAN, was used. Miyato et al. [94] suggested the spectral normalization technique that embodies weight normalization which has been regarded as easy to include in existing models as compared to weight normalization, weight clipping, and gradient penalty.

Another improvement was proposed by Alexia Jolicoeur-Martineau [57] in the form of relativistic GAN and relativistic averaged GANs. The author argued that the probability of real data being real should be decreased as the generator becomes more efficient. Experimental results showed that RaGANs with gradient penalty outperformed WGAN-GP and reduced the time to reach the state-of-the-art result by four times. RaGAN even produced better quality pictures from a very small set (n = 2011), which is not possible with LSGAN.

3 Current trends

The basic objective of generative adversarial networks is image generation. However, GANs have been molded to perform a plethora of applications in several fields. Due to flexible nature, numerous GAN models have been combined with other techniques such as attention [158], relativism [57], etc. to perform specific tasks. In most cases, GANs have outperformed existing models. In the present work, the emphasis is laid on applications of GAN in the field of image/video applications, which are discussed henceforth [138, 147].

3.1 Image processing

GANs are basically synthetic image generation algorithms. Over the years, they have been used to perform several tasks in the field of image processing.

3.1.1 Image generation

As the primary task of GAN is image generation, it has been summarized in Table 4. The rest of the applications are discussed in upcoming sections.

Table 4 Summary of Major GAN Models used for Image Translation

3.1.2 Inpainting

Image inpainting is used to fill relevant data in images which are incomplete, damaged or missing information without any prior knowledge. Among the most popular GANs which incorporate inpainting are PatchGAN and its derivatives. Demir et al. [28] proposed PG GAN which combined two previous versions of GAN, namely Global or GGAN and patchGAN. PGGAN shares its layers with both these GAN networks but later bifurcates to produce two adversarial losses that feed the generator network. Yu et al. [148] presented a new method for inpainting using gated convolution which has masks and inputs. It uses a dynamic feature selection mechanism. Introducing a new GAN model called SNPatchGAN as the authors used spectral normalized discriminators. It can eliminate undesirable or distracting objects, improve image layouts, clean-up watermarks and generate a new one as well. Later Yu et al. [149] proposed another GAN model that was capable of utilizing surrounding image features for better predictions. It is capable of handling multiple, multi-size holes at arbitrary locations. The above mentioned GAN models are used for 2D image inpainting. Wang et al. [132] proposed a 3D GAN to carry out 3D shape inpainting, which utilized all the tools including 3D Encoder - Decoder, GAN along with a LRCN. The 3D models are processed in the form of 2D shapes. While the GAN framework captures the global 3D structure, the LRCN produces the finer details.

3.1.3 Image translation

The endeavor to achieve translation requires more effort because the context has to be preserved while doing so. Copious amounts of algorithms have been deployed to achieve near-perfect translation. Inspired from CGAN, pix-2-pix is capable of generating images from label information, reconstruction of objects from edge information, filling and coloring the image, etc. It has become popular among researchers and has been implemented on different types of data, as it is capable of adapting its loss according to the task at hand. It produces extraordinary results, especially for those translation tasks which involve highly structured graphical outputs.

Image to image translation is implemented by mapping between the images by training the set of image pairs. Zhu et al. [164] presented a novel algorithm to translate the images when paired samples are not available. Cycle uniformity loss was used to implement this as this type of mapping is under-constrained. Inverse mapping was also used in the same algorithm. This GAN is known as Cycle GAN due to its cycle consistency loss in action. It also performed object transfiguration task, photo enhancement, season and style transfer, etc. Although compelling results were obtained, they were not uniform for all cases and applications. It succeeds when color and texture changes are required. However, in tasks that require geometric changes, the performance was subpar and left scope for further improvement. Another popular GAN model for cross-domain style transfer is DISCO GAN [60]. It addressed the task of discovering the cross-domain relation when provided with unpaired data. Using the discovered cross-domain relations, disco GAN successfully transferred style from one domain to the other. It preserved the key attributes, like the identity of the face and orientation. However, it does not handle mixed modalities well. It worked without open pair labels and then learns to relate the datasets from diverse domains.

3.1.4 Super resolution

Super-resolution is used to enhance the visual quality and details in an image given in Table 5. Ledig et al. [66] solved the problem of improving the texture details on a large scale while super resolving. The authors use perceptual loss which is the blending of content and adversarial loss. Content loss played a critical part in super-resolution as it takes into account the perceptual similarity rather than pixel similarity. Wang et al. [129] made significant improvements in SRGAN and introduced Enhanced SRGAN. The authors changed both the loss functions as well as the network architecture of SRGAN. They implemented the technique used in RaGAN to use a relativistic discriminator. While updating the network architecture, they used Residual in Residual Dense Block as the building block. To achieve better brightness consistency, improving perceptual loss and texture recovery, features extraction is performed before activation in ESRGAN. This provided improved visual quality with new realistic textures.

Table 5 Summary of major GAN models used for Image Resolution Improvement

3.1.5 Segmentation

Segmentation is performed in image processing and multiple algorithms have been put to use in this regard. Ehsani et al. [105] took a new leap with SeGAN in a bid to complete the appearance of occluded objects from scenes. To complete this task, the knowledge of “which pixels to paint?” and “what color to paint them?” is crucial. These two questions result in the segmentation of the invisible parts and then the generation of invisible parts. SeGAN optimizes both these tasks jointly. SeGAN is trained on synthetic photorealistic images and can reliably segment natural images. It is also capable of depth layering using the occluder occluded relationships. A comparison of Typically used Image GANs is given in a summarized way in Table 6.

Table 6 Advantages, Disadvantages and features of typical GANs [17]

3.1.6 Real-world image applications

There is a rapid growth of applications deploying real-time image applications using generative adversarial networks. GAN can generate real-like image samples using random latent vector z. There is no need to know real data distribution and assumption of mathematical conditions. These are the main reasons that GAN is used in several academic/engineering/almost in every field. Several researchers have presented prominent applications of GANs in real-world image processing applications. [2, 3, 77, 167].

High-resolution human face images can be generated from low-resolution images while up-sampling and using the trained model inferring the photo-realistic details. Bulat et al. [15] proposed real-world face super-resolution because many existing methods remain to fail to generate good results while implemented on real-world low-resolution and low-quality face images. To solve this problem, a two-stage process was presented, in which firstly a High-to-Low Generative Adversarial Network (GAN) is trained for degradation and downsample high-resolution images were required during training. After this, the output of the network is given to Low-to-High GAN to train image super-resolution using these time extirpated low and high-resolution images. Figure 6 shows the simulation results of face super resolution for low resolution faces in the real-world.

Fig. 6
figure 6

Super-resolution results produced by Bulat et al. [15] on real-world low-resolution faces and compared to SRGAN [67] and CycleGan [165]

Realistic pedestrian images in real scenes were generated using the GAN model in 2018 [102] and then train the CNN-based pedestrian detector using this augmented data. Generative Adversarial Network (GAN) of multiple discriminators was used, with the purpose to generate realistic pedestrians and learn the background simultaneously. To solve the issue of different sizes of the pedestrians, a Spatial Pyramid Pooling layer was used in the discriminator. Simulations on cross-dataset were also performed i.e., training the model on one dataset and testing on another dataset. It was concluded that PSGAN is also able to produce well with improvement in the performance of CNN-based detectors. Figure 7 shows the PS-GAN model used to generate realistic pedestrian images.

Fig. 7
figure 7

PS-GAN model using multiple discriminators (Dp and Db) to generate realistic pedestrian images [102]

WaterGAN [73] is a water generative adversarial network used to generate underwater realistic images from the in-air image and along this color of monocular underwater images was also corrected using the unsupervised pipeline as shown in Fig. 8. High-resolution images can be captured using autonomous cameras and remotely operated vehicles to map the seafloor. With WaterGAN, a large training dataset was generated for raw underwater as well for true color in-air. It served as input to a novel network for the color correction of underwater images.

Fig. 8
figure 8

WaterGAN model used for generating realistic underwater images [73]

3.2 Video processing

GANs have also been utilized in video processing. Some of its major applications are discussed in the upcoming sections. The applications are also summarized and important publications are listed in Table 8.

3.2.1 Frame generation and prediction

Intermediate frame generation is an important task. Previous methods that were deployed to perform this task using interpolation yielded videos of less quality with extreme blurriness. Keyframe generation is an important research area as it is used in slow motion filming, compressing videos, forensic analysis etc. Wen et al. [134] used two concatenated GANs for frame generation. One GAN captured the motions while the other one generated the frame details. The novel technique used three losses, namely, adversarial loss, normalized product correlation loss and gradient difference.

For both video generation and recognition tasks, a frame transform model is needed. However, due to the availability of a large number of ways for frames and object change, the model of dynamics is a challenging task. Vondrick et al. [124] presented a GAN video for separating the scene’s foreground and background video using spatiotemporal convolutional architecture. Simulation results prove that this model generates full frame rate tiny video. It has the utility to predict plausible future of still images. Moreover, visualizations and experiments also show that the model learns suitable features with minimum supervision. Mathieu et al. [92] trained CNN to produce future frames when an input sequence is given. Three feature learning strategies were also given to solve the problem of blur prediction observed from Mean Squared Error (MSE). UCF101 dataset was used in this work to compare the predictions. Tulyakov et al. [121] suggested MoCoGAN, Motion and content decomposed Generative Adversarial Network for the generation of video. It will generate a video by mapping random vectors series to video frames series. A random vector has content and motion parts. The content part remains fixed and the motion is kept as a variable process. Image and video discriminators use adversarial learning scheme. Experimental results on the various dataset with performance comparable to the existing approaches, verify the efficacy of the framework.

3.2.2 Video De-blurring

Video de-blurring is a difficult task as it requires modeling along with the spatial domain which comprises image planes and temporal domain which comprises of the neighboring frames from the video. Another challenge is the retrieval of sharp images along the lines of pixel-wise error. To combat these issues, Zhang et al. [160] suggested that 3D convolution can be used in the spatial-temporal domains using DBLRNet for video de-blurring. To address the problem of the generation of sharp images, the DBLRNet was used as the generator in the GAN architecture. In addition to the regular adversarial loss, a content loss was also used. This GAN was named de-blurring GAN and was tested on a benchmark dataset, achieving better results.

Shen et al. [115, 116] presented an encoder and decoder triple-branch architecture. For sharpening foreground (FG) and background (BG) details; two branches were learned, and the last branch produces harmonious. This model was further endowed with a supervised, human-aware awareness mechanism up to end fashion. To encode FG human information, the soft mask is used and it used the FG or BG decoder branches to give attention to domains. To further get advantage from Human-aware Deblurring, the HIDE dataset, having blurry and sharp image pairs was also proposed. HIDE covered a large number of scenes, motion patterns, and background scenes.

3.2.3 Haze removal

Pang et al. [104] proposed the Haze Removal GAN or the HRGAN for removing Haze from videos in order to provide better visual quality in surveillance videos and their analysis. It contains a specialized generator and discriminator network. The generator works on an estimation of transmission maps, haze-free images and atmospheric light all at once. Changes were also proposed to cover adversarial loss and pixel-wise loss all created by the discriminator. The authors demonstrated the superiority of HRGAN against state-of-the-art de-hazing algorithms.

3.2.4 De-identification

xGANs have also been used to hide information in videos. Brkic et al. [14] used GANs for person de-identification which means that removal of non-biometric features and replacing them with the ones generated via GANs. The non-biometric features include clothing colors, hairstyles etc. In some cases, even the faces can be replaced. Such tools are very helpful when a person’s information is to be concealed as simple methods like video de-blurring do not offer much safety. The de-identified videos are also immune to re-identification attacks.

3.2.5 Video super-resolution

Lucas et al. [83] made the first of its kind attempt to use GAN for video super resolution. They made a new GAN model called the VSRResFeatGAN model where the generator is named VSRResNet. The generator is pre-trained with the mean square loss which in turn leads to a better performance quantitatively as compared to other VSR models. This model of GAN used a new evaluation metric called the PercepDist metric to accurately examine the perceptual quality of videos as opposed to formerly used SSIM metrics.

Chen et al. in 2017 [21] suggested a video super resolution framework using GAN with temporal information fusion. It was observed that the temporal information of consecutive video frames contributes to the task of video super-resolution. The main contribution of this work was twofold: a new generator architecture was suggested with various mechanisms of temporal fusion of information, with early fusion, slow fusion as well as 3D convolution, Also compared with others as shown in Fig. 9. Then a discriminator architecture based on SRGAN with an adaptive training routine was applied to train the GAN and to keep the burden of hand-tuning easy for the large number of model parameters. Figure 7 shows the video frames of the using temporal fusion, bicubic interpolation and per-frame based SRGAN.

Fig. 9
figure 9

Comparison of Foreman and football frames. Left to right: Ground truth frame, bicubic interpolation, SRGAN, early fusion, slow fusion and 3D convolution [21]

Several GANs are used nowadays for video resolution according to the requirement of the application. But to differentiate between them, PSNR is selected as a common parameter given in Table 7.

Table 7 Difference in super-resolution Video GANs based on evaluation parameters

3.2.6 Real-world video applications

To spread out from image processing to video is quite a difficult task due to three dimensions of videos, limitation of memory and training stability introduces challenges. In 2019 [53], video enhancement using divide-and-conquer approach using adversarial learning was suggested, which divide and merge on perception-based, frequency-based as well as dimension-based problems. Mainly photo enhancement process was decomposed into multiple problems, which was recovered back using bottom to up. At the top-level, to learn additive and multiplicative components; a perception-based division was presented. It was required to convert a low-quality image/video into high-quality. However, at the intermediate level, frequency-based division using a generative adversarial network was used to supervise photo enhancement. The lower level used a dimension-based division to enable the GAN model for better estimation of the distribution distance at multiple data and to train the GAN model.

Video generation was divided into frame and sequence generation using GANs [53], and the task became easy to solve as well. For this, a two-step training scheme was used: a generator was trained with static frames as shown in Fig. 10. Afterward, to generate natural look scenes, a recurrent model was trained from the previously trained frames generator. Both training steps used the adversarial steps. However, to avoid training instabilities while using GANs, an approach of multiple discriminators was used.

Fig. 10
figure 10

Video generator using GAN, Pink blocks are pre-trained frame (F1, F2……FN) generator with ZF noise vectors 1 to N [53]

Chu et al. [24] proposed a temporally self-supervised algorithm and adversarial learning to get temporally coherent solutions without loss of spatial details. Ping-Pong loss was suggested for better long-term temporal consistency. It efficiently avoids artifacts in the recurrent networks without depressing features. Moreover, the progressive growth of Generative Adversarial Networks is shown in Fig. 11 for higher resolution video generation in 2018 [1]. In particular, videos of low resolution and short-duration were generated, and then gradually increase in resolution and duration by adding new spatio-temporal convolutional layers was performed. The progressive model learns spatio-temporal information and generates higher resolution video. Table 8 presents major Publications of Generative Adversarial Networks in various fields.

Fig. 11
figure 11

Video generation using the Progressive model [1]

Table 8 Major Publications of Generative Adversarial Networks in various fields

4 Performance evaluation

With the availability of a plethora of GAN models for different applications, it has been observed that a suitable metric GAN’s evaluation is still a challenge, for comparison between generative models. However, still there exists no globally accepted benchmark metrics for evaluating the performance of a GAN architecture in all aspects. But, there are some essential and desirable characteristics of GAN evaluation metrics as defined below. An effective GAN evaluation metric should possess the following characteristics.

  • It should favor models that create highly distinguishable generated samples from real ones.

  • It should be a sensitive over-fitting of the model.

  • It should have well-defined boundary values.

  • It must be sensitive to image distortions and transformations.

  • It should be able to match human perceptual judgments and rankings of models.

  • It must have low computational complexity. GANs performance is measured using qualitatively and quantitative measures.

Qualitative evaluation is visual check by humans which includes naturalness, shapes, perspective, and structure as well. It is the common and most used way to evaluate GANs as given in Table 9.

Table 9 Summary of Qualitative Performance Evaluation Metrics used in image GANs [13, 31]

While qualitative measures help to inspect models, it has some drawbacks also. Firstly, image quality evaluation with human vision is cumbersome and biased [76]. It is difficult to replicate and does not reflect the capacity of the model. Secondly, the high variance in human inspectors makes it compulsory to be normal over a huge number of subjects [29]. For instance, it also fails to observe whether a model drops modes. Actually, mode dropping helps in visual sample quality. Hence quantitative measures have an eminent part in GANs evaluation as summarized in Table 10 [13]. It will be important to note that most evaluation schemes do not examine the disentanglement in the latent zones. The table also represents relative ratings in terms of high, moderate and low. “-” means the value is missing or unknown and requires further research. “*” shows that there are multiple scores available in that group measure.

Table 10 Summary of Quantitative Performance Evaluation Metrics used in GANs [13, 31, 36]

4.1 Pros and cons

Based on the above analysis, the advantages and inherent limitations of the most significant evaluation metrics can be summarized, and the conditions under which they produce meaningful results. Some metrics enable us to study the problem of over-fitting, perform model selection on GAN models and compare GAN models without resorting to human evaluation based on selected samples. There is no particular standard to select the best score. Basically, several scores evaluate aspects of the image generation process, rather it is very difficult that a single score may be used to measure. However, some measure is more considered than others. In the over-fitting case, nearest neighbor visualization and rapid categorization quality measures are mostly used. Overall, it is an open problem to find a measure for diversity and visual fidelity evaluation simultaneously. Diversity denotes that all modes can be covered whereas visual fidelity implies that there should be a high likelihood for generated samples. Parzen window estimation is an example of likelihood favours trivial. But, it almost fails to estimate the true likelihood for high dimensional spaces. IS and FID are two mostly used scores, that depend on deep networks (pre-trained) to equate original and generated samples. IS calculates correlation and diversity of generated images with quality. However, it can evaluate Pg only as an image generation model in comparison to its similarity to Pr. It may encourage GAN models to learn sharp images; in spite of Pr. Some evaluation methods like MS-SSIM try to evaluate the diversity of generated images. It can detect mode collapse; as well as how can a generator detects the true data distribution. MMD is always able to categorize generative/noise images from real images. Both have low computational complexity. Given these advantages, even though MMD is biased, still it is recommended. Figure 12 shows the comparison of typical GANs for performance parameters F1, Precision and Recall. where the best F1 score for a fixed model is selected and vary the budget. It is observed that even for this simple task, GAN models struggle to achieve a high F1 score. Analogous plots for precision or recall for various thresholds have been given also.

Fig. 12
figure 12

Comparison of F1, Precision, Recall vs. computational budget for Boundary equilibrium generative adversarial networks (BEGAN), Deep Regret Analytic Generative Adversarial Networks (DRAGAN), Mean Minimax GAN (MM GAN), Non-Saturating GAN (NS GAN), Wasserstein GAN (WGAN), Wasserstein GAN gradient penalty (WGAN GP). [86]

5 Future challenges

GANs have become extremely popular in a very short duration of time, making the world of GANs a magnum opus in itself. These are robust algorithms that can be manipulated and molded to suit according to requirements. Owing to this ability, numerous application-specific GANs have been developed in diverse fields. In the future, more of such applications can be developed and amalgamated with newer models and mechanisms like attention and relativism.

  • In the future, GANs can be used in the following potential resolutions:

  • Creating infographics from text

  • Generating website designs:

  • Compressing data

  • Drug discovery and development

  • Generating music

  • Researches can find a more stable distance metric for GAN loss function.

  • As GAN lack any universal performance metric, there is an urgent need to invent one as soon as possible.

  • The newest deep learning mechanisms are the attention mechanism [154] and relativism. It will be great to see an amalgam of both these new technologies in the future along with reinforcement learning [6].

  • Best aspect of GANs is that the general models can be easily molded to form new variants or new application specific variants in miscellaneous fields of audio processing, biological sciences, medical science, astronomy etc.

  • Mode Collapse: An open-ended futuristic research area is the tendency of the system to collapse, also called mode collapse. The system is a combination of two interacting entities. Feedback and input from each are pivotal to the survival of the other. In this situation, they might go on forever or failure of one might cause overall failure.

  • Equilibrium: From the system’s perspective, the system is designed to minimize the cost function of each generator and the discriminator but it isn’t designed to reach Nash Equilibrium which causes the system to keep going without reaching anywhere. Hence, the convergence of the system is not guaranteed. Hence another open-ended research area is algorithms that help a system reach equilibrium.

  • Repetition: GANs often fall into a trap wherein when one image is accepted by the discriminator i.e. manages to fool the discriminator, they either create the same image again and again or give minor variations of the image as output. This limits the diversity in the output and introduces inflexibility in the system.

  • Evaluation metrics: As of now, there is a lack of proper evaluation metrics for GANs. The ones being used are nearest neighbor algorithm for measuring the distance between the produced image/sample and the training set.

  • Training instability—Saddle points become indefinite in GANs. So the optimal solution is to find a saddle point than its local minimum.

6 Conclusion

In this paper, image/video processing GANs have been thoroughly reviewed. Publication trends, development, various new variants of GAN and applications have been discussed in-depth in an attempt to bring all relevant information under one umbrella. The recent rush among the scientific community to utilize GANs in core as well as interdisciplinary fields is proof that GANs have a very promising future. Their ability to deal with missing data and unlabelled data is also very peculiar. They have obvious advantages over other generative models. However, they have also introduced some new problems. Their training instability, mode collapse, and non-convergence have led to the development of various new models. But in a paper by Google brain, after a thorough examination, authors reached the conclusion that it cannot be established that any of the new variants have any consistent upper hand over the initially proposed vanilla GAN algorithm. Conclusions can be summarized as below:

  • It can be very easily concluded that GANs have a plethora of applications; however, there is still a need for an algorithm that provides stability in GAN training. Much of the doubts remain due unavailability of mathematical proofs in support of the GAN mechanism. Some authors have tried to investigate the underlying cause of training failures and mode collapse, but it is more of speculation rather than a solution. In terms of stable training, the loss function plays an important role and various loss functions have been given to deal with this viewpoint. Having reviewed different types of loss functions, it is observed that spectral normalization has good generalization i.e., it is able to be applied to every GAN, is easy to be implemented and has a very low computational cost.

  • Despite numerous old and new performance evaluation metrics, GANs lack a universal metric that keeps all the parameters in view. This paper has given the strengths and limitations of quantitative and qualitative performance measures used for evaluating GANs. The current favorites among researchers, IS and FID, both fail to represent all the information required to represent while analyzing GANs. Ultimately some directions for future research have been concluded for evaluation measures. There should be comparative empirical and analytical studies for evaluation measures in the future. A code repository for evaluation measures must be available on authenticated websites for researchers.

    Finally, it is concluded that there are many prospects for future research and applications in many fields in particular.