1 Introduction

Computer vision applications can be found in almost every domain, including topics such as medical imaging, gaming, video surveillance, multimedia, industrial applications, remote sensing, just to mention a few. In most of the cases these applications are based on images obtained from a cameras working at the visible spectrum. There are some cases, in particular in medical imaging and remote sensing, where cross-spectral and multispectral images are considered. The appealing factor of using images from different spectral bands lies on the one hand on the possibility to obtain information that cannot be seen at the visible spectrum; on the other hand, on the combined use of information that can be considered to generate some kind of high level reasoning; for instance in remote sensing the combined usage of images from different spectral bands is considered to generate vegetation indexes. These vegetation indexes are used to determine the health and strength of vegetation and their definitions involve several factors, like soil reflectance, atmosphere, vegetation density, etc.

Among the different indexes proposed in the literature, the Normalized Difference Vegetation Index (NDVI) is the most widely used [1]; in general, it is used to determine the condition, developmental stages and biomass of cultivated plants and to forecasts their yields. The values of this index go from -1 to 1, with the value zero representing the approximate where the absence of vegetation begins. Negative values represent non-vegetated surfaces. This index is calculated as the ratio between the difference and sum of the reflectance in NIR and red regions:

$$\begin{aligned} NVDI =\dfrac{R_\mathrm{NIR} - R_\mathrm{RED} }{R_\mathrm{NIR} + R_\mathrm{RED}}, \end{aligned}$$
(1)

where \(R_\mathrm{NIR}\) is the reflectance of NIR radiation and \(R_\mathrm{RED}\) is the reflectance of visible red radiation.

Although interesting, cross/multi-spectral solutions need the set up of more than one camera. For instance, in the case of NDVI, an image from the visible and an image from the NIR spectra are required. In other words, we need two cameras, acquiring images at the same time of the same scene, in order to be able to compute the values of Eq. (1). It should be noticed that before computing Eq. (1) images need to be accurately registered—i.e. the information should be referred to the same reference system. Unfortunately, since images from different spectra are considered their may look different, so the problem is how to find the same set of points in both spectra [2] to be used as references. Recently, deep learning based approaches have been proposed to overcome this drawback and find correspondences in cross-spectral domains [3, 4]. Once these points are obtained we can proceed by registering the images in a single reference system.

Cross/multi-spectral approaches provide unique solutions to different complex problems, however, as mentioned above, different preprocessing stages need to be performed before computing these solutions; hence, in the current work we wonder whether it is possible to obtain the same result but just using information from a single spectral band. Actually, a similar philosophy has been recently presented in [5] where vegetation index is estimated based on a learning approach from a single near infrared spectral band image. Although interesting results have been obtained, the weakness point of that approach lies on the need of having NIR images, which are not that much common like visible spectrum images. In the current work we propose to explore the possibility to estimate NDVI vegetation index using the red channel from the visible spectrum. The index is estimated from a learning based approach, where a Conditional Generative Adversarial Network (CGAN) is trained with a large data set. The CGAN architecture used in the current work is similar to the one presented in [6], but including a conditional red channel image at the final layer of the learning model to improve the details of the generated NDVI vegetation index. Additionally, a more elaborated loss function is proposed to preserve details of the given image.

Fig. 1.
figure 1

Conditional Generative Adversarial process implemented on the current work to estimate NDVI vegetation index.

The rest of the paper is organized as follows. Section 2 introduces the Generative Adversarial Network formulation. Then, Sect. 3 presents the architecture proposed in the current work, detailing the design, proposed loss functions and training with cross-spectral datasets. Section 4 depicts the experimental results and finally, conclusion are presented in Sect. 5.

2 Generative Adversarial Networks

Generative Adversarial Networks (GANs) are powerful and flexible tools quite useful in several computer vision problems; one of their most common applications is image generation. In the GAN framework [7], generative models are estimated via an adversarial process, in which simultaneously two models are trained: (i) a generative model G that captures the data distribution, and (ii) a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. In this architecture it is possible to apply certain conditions to improve the learning process. According to [8], to learn the generator’s distribution \(p_g\) over data \({\varvec{{x}}}\), the generator builds a mapping function from a prior noise distribution \(p_z(z)\) to a data space \(G(z;\theta _g)\) and the discriminator, \(D(x;\theta _d)\), outputs a single scalar representing the probability that x came from training data rather than \(p_g\). G and D are both trained simultaneously, the parameters for G are adjusted to minimize \(log(1-D(G(z)))\) and for D to minimize logD(x) with a value function V(GD):

$$\begin{aligned} \frac{min}{G}\frac{max}{D} V (D, G)=\mathbb {E}_x{\sim _{p\,\mathrm{data}_{(_x)}}}[log D(x)] + \\\nonumber \mathbb {E}_z\sim _{p\,\mathrm{data}_{(_z)}} [log(1-D(G(z)))]. \end{aligned}$$
(2)

Generative adversarial networks can be extended to a conditional model if both the generator and discriminator are conditioned on some extra information y. This information could be any kind of auxiliary information, such as class labels or data from other modalities. We can perform the conditioning by feeding y into both discriminator and generator as additional input layer. The objective function of a two-player minimax game would be as:

$$\begin{aligned} \frac{min}{G}\frac{max}{D} V (D, G)=\mathbb {E}_x\sim _{p\,\mathrm{data}_{(_x)}} [log D(x|y)] + \\ \nonumber \mathbb {E}_z\sim _{p\,\mathrm{data}_{(_z)}} [log(1-D(G(z|y)))]. \end{aligned}$$
(3)

In the current work a novel Conditional GAN model is proposed for vegetation index estimation from just the red channel of a RGB image; it is inspired on both the GAN network architecture presented in [9] for NIR colorization and on the triplet model proposed by [5] for learning vegetation indexes using NIR images. Actually, it is an adaptation of the architectures mentioned above, which consists of reducing the number of layers and removing the internal number of levels of learning architecture (FLAT or single). Another difference with previous approaches lies on the proposed loss function, which do not take into account only intensity level information but also it considers image structure information.

Fig. 2.
figure 2

GAN architecture for NDVI Vegetation Index estimation; A single level layer model (FLAT) evaluated as Generator Network; on bottom the Discriminator Network.

Fig. 3.
figure 3

Pairs of patches (64 \(\times \) 64) from country category (two-left columns) and field category (two-right columns) [10]: (top) RGB image; (middle) Red channel of the given RGB image; (bottom) NDVI vegetation index computed from RGB images and the corresponding NIR images.

3 Proposed Approach

This section presents the approach proposed for NDVI index vegetation estimation. As mentioned above, it uses a similar architecture to the one presented in [5], where a conditional adversarial generative learning network has been proposed. A traditional scheme of layers in a deep network is considered. In the current work the usage of a Conditional GAN model is evaluated with a Flat scheme, this GAN’s model has been used because it presented good performance to solve problems like colorization, dehazing, enhancement, object recognition, etc. Based on the results that have been obtained on this type of problems, where improvements in accuracy and performance have been obtained, we propose the usage of a learning model that allows the mapping of a vegetation index based on a single channel of RBG images (the red channel). The model will receive as an input a patch corresponding to red channel of a RGB image. Gaussian noise is added to each patch of the learning architecture to increase the variability in the learning process of the generation index patches, increasing the time of the convergence and generalization. A l1 regularization term has been added on each layer of the model in order to prevent the coefficients to overfit, which make the network learns small weights to minimize the loss, maximizes the distribution of model outputs, and improve the generalization capability of the model. Figure 1 depicts the Conditional GAN process proposed in the current work.

As mentioned above, in our case, the generator network has been implemented using a single level of layers (FLAT). Figure 2 presents an illustration of the GAN network used in this research. In all the cases, at the output of the generator network the vegetation index is obtained. This vegetation index will be validated by the discriminative network, which will evaluate the probability that the generated image (vegetation index in grayscale), is similar to the real one used as a ground truth. Additionally, in the generator model, in order to obtain a better image representation, the CGAN framework is reformulated for a conditional generative image modeling tuple. In other words, the generative model \(G(z;\theta _g)\) is trained from a red channel of a RGB image plus Gaussian noise, in order to produce a NDVI vegetation index image; additionally, a discriminative model \(D(z; \theta _d)\) is trained to assign the correct label to the generated NDVI image, according to the provided real NDVI image, which is used as a ground truth. Variables \((\theta _g)\) and \((\theta _d)\) represent the weighting values for the generative and discriminative networks.

The model has been defined with a multi-term loss (\(\mathcal {L}\)) conformed by the combination of the Adversarial loss plus the Intensity loss (MSE) and the Structural loss (SSIM). This combined loss has been defined to avoid the usage of only a pixel-wise loss (PL) to measure the mismatch between a generated image and its corresponding ground-truth image. This multi-term loss function is better designed to human perceptual criteria of image quality, which is detailed next. The Adversarial loss is designed to minimize the cross-entropy to improve the texture loss :

$$\begin{aligned} \mathcal {L}_{Adversarial} = { {- \sum \limits _{{i}} log D(G{_w} (I_{z|y}), (I_{x|y}) } }, \end{aligned}$$
(4)

where D and \(G{_w}\) are the discriminator and generator of the real \(I_{x|y}\) and generated \(I_{z|y}\) images conditioned by the red channel of the RGB of the GAN network.

The Intensity loss is defined as:

$$\begin{aligned} \mathcal {L}_{Intensity} = { {\frac{1}{NM}\sum \limits _{{i = 1}}^N \sum \limits _{{j = 1}}^M {{{(NDVIe_{{i,j}}-NDVIg_{{i, j}})}^2}} } }, \end{aligned}$$
(5)

where \(NDVIe_{{i,j}}\) is the vegetation index estimated by the network and \(NDVIg_{{i,j}}\) is the ground-truth vegetation index and \(N\times M\) is the size of the patches. This loss measures the difference in intensity of the pixels between the images without considering texture and content comparisons. Additionally, this loss penalizes larger errors, but is more tolerant to small errors, without considering the specific structure in the image.

To address the limitations of the simple Intensity loss function, the usage of a reference-based measure is proposed. One of the reference-based index is the Structural Similarity Index (SSIM) [11], which evaluates images accounting for the fact that the human visual perception system is sensitive to changes in local structure; the idea behind this loss function is to help the learning model to produce a visually improved image. The Structural loss for a pixel p is defined as:

$$\begin{aligned} \mathcal {L}_{SSIM} = { {\frac{1}{NM}\sum \limits _{p = 1}^{P} 1- SSIM(p) } }, \end{aligned}$$
(6)

where SSIM(p) is the Structural Similarity Index (see [11] for more details) centered in pixel p of the patch P.

The Final loss (\(\mathcal {L}\)) used in this work is the accumulative sum of the individual Adversarial, Intensity and Structural loss functions:

$$\begin{aligned} \mathcal {L}_{Final} = \mathcal {L}_{Adversarial} + \mathcal {L}_{Intensity} + \mathcal {L}_{SSIM} \end{aligned}$$
(7)

4 Experimental Results

The proposed approach has been evaluated using the red channel of RGB images and their corresponding NDVI vegetation index (ground truth), computed from Eq. (1) using NIR and red channel images; this cross-spectral data set came from [10]. The country and field categories have been considered for evaluating the performance of the proposed approach, examples of this dataset are presented in Fig. 3. This dataset consists of 477 registered images categorized in 9 groups captured in RGB (visible spectrum) and NIR (Near Infrared spectrum). The country category contains 52 pairs of images of (1024 \(\times \) 680 pixels), while the field category contains 51 pairs of images of (1024 \(\times \) 680 pixels). In order to train our network to generate vegetation index from each of these categories 380.000 pairs of patches of (64 \(\times \) 64 pixels) have been cropped both, in the RGB images as well as in the corresponding NDVI images. Additionally, 3800 pairs of patches, per category, of (64 \(\times \) 64 pixels) have been also generated for validation. It should be noted that images are correctly registered, so that a pixel-to-pixel correspondence is guaranteed.

Table 1. Root Mean Squared Errors (RMSE) and Structural Similarities (SSIM) obtained with the proposed GAN architecture by using different loss functions (SSIM the bigger the better).

The Conditional Generative Adversarial network evaluated in the current work is a Flat (single level of learning layer) for NDVI vegetation index estimation. It has been trained using a 3.4 four core processor with 16GB of memory with a NVIDIA Titan XP GPU. Qualitative results are presented in Figs. 4 and 5. Figure 4 shows NDVI vegetation index images from the country category generated with the proposed Flat GAN network. Additionally, Fig. 5 shows NDVI vegetation index images from the field category generated with the proposed Flat GAN network. Quantitative evaluations for the different loss functions have been obtained and provided below. Up to our humble knowledge there are not previous work on similar technique to estimate vegetation index using only the red channel of RGB images. Hence, the only way to evaluate results is by comparing the Root Mean Square Error (RMSE) of each approach. The RMSE measures the distance between the estimated NDVI with respect to the ground truth, which is the standard deviation of the residuals. Residuals are measures of how different are the images compared from each other.

Fig. 4.
figure 4

(1st.Col) Ground truth NDVI index from the Country category. (\(2nd.-4th. Col\)) NDVI index obtained with the proposed GAN architecture with different loss functions: \(\mathcal {L}_{Final} \), \(\mathcal {L}_{Adversarial} + \mathcal {L}_{SSIM}\) and \(\mathcal {L}_{Adversarial} + \mathcal {L}_{Intensity}\).

Fig. 5.
figure 5

(1st.Col) Ground truth NDVI index from the Field category. (\(2nd.-4th. Col\)) NDVI index obtained with the proposed GAN architecture with different loss functions: \(\mathcal {L}_{final} \), \(\mathcal {L}_{Adversarial} + \mathcal {L}_{SSIM}\) and \(\mathcal {L}_{Adversarial} + \mathcal {L}_{Intensity}\).

The results obtained with the multi-term loss approach show that the Structural Similarity metric contributes to improve the texture of the estimated NDVI vegetation index. Furthermore, the Intensity level loss function, which measure the Mean Square Error between the estimated value and the corresponding ground truth, helps to evaluate the estimation.

Table 1 presents the average Mean Square Errors (MSE) and the Structural Similarity metric (SSMI) obtained with the the single level architecture when different loss functions (\(\mathcal {L}_{Adversarial} + \mathcal {L}_{SSIM}\)), (\(\mathcal {L}_{Adversarial} + \mathcal {L}_{Intensity}\)) and (\(\mathcal {L}_{Final}\)) are evaluated in the two categories used as case studies. It can be appreciated that the results obtained with the \(\mathcal {L}_{Final}\) loss function reaches the best result. The results obtained show that the more elaborated the loss function is, the better results will be obtained, since the network will be more capable to learn complex scenes at a faster convergence. Having in mind that the NDVI indexes resulting from the learning process are represented as images in the range of [0, 255], the results presented in Table 1 show that the average deviation of the estimated values is 1.4%. Additionally, looking at the SSIM metric, which is a perception-based model that considers image degradation as perceived change in structural information, we can observe that on average, in both categories, results are above 0.9. This value means that obtained results highly pixels inter-dependencies. These dependencies carry important information about the structure of the objects in the visual scene. This metric combined with MSE allows us to confirm that the NDVI index obtained with the proposed results is a valid approach.

5 Conclusion

This paper tackles the challenging problem of NDVI vegetation index estimation by using a novel Conditional Generative Adversarial Network model. The novelty of the proposed approach lies on the usage of just a single spectral band (the red channel of RGB images). The architecture proposed for the generative network consists of a single level structure, which combines at the final layer results from convolutional operations together with the given red channel, resulting in a sharp NVDI image. Then, the discriminative model estimates the probability that the NDVI generated index came from the training dataset, rather than the index automatically generated. Different loss functions are evaluated trying to help the learning model to produce a visually improved image. The proposed loss function takes into account both intensity level information together with image structure information. Experimental results with a large set of outdoor images shows the validity of the proposed approach to estimate NDVI index from monospectral images. As a future work the possibility to obtain the NDVI from all the channels will be considered.