Keywords

1 Introduction

Pixel prediction has shown to be one of the most needed tool to perform several applications in image processing such as anomaly detection [18, 31], object detection [8, 17], edge detection [28], video compression [7, 24], semantic segmentation [27, 36], image restoration [11, 32] and keypoint prediction [37]. Meanwhile, pixel prediction is represented by approximating the predicted pixel using its neighbors. For that, it is usually represented by a linear or a non-linear combination of the neighboring pixels plus an error value [29, 38]. In this paper, inspired by the work proposed in [20], we take the optimal predictor of \({x}_n\) as the conditional expectation \(E\big (x_{ij}\mid \forall x_{i'j'} \in {N_{i,j}}\big )\), where \(x_{i'j'}\) are the neighbors of \(x_{ij}\) within the set of pixels \(N_{i,j}\).

Exploiting the ease of analytical derivations, the authors in [40] derived optimal predictors for Gaussian distribution and mixture of Gaussians. However, the field of non-Gaussian distributions exhibited an exciting expansion in terms of mathematical theorems [23, 35]. Yet, a large number of researchers proved that Gaussian assumption is generally inappropriate and other alternative distributions are more effective in modeling data than Gaussian distribution by unveiling more appropriate patterns and correlations among data features [4,5,6, 10]. In our recent work [19], we proved that finite inverted Dirichlet mixture model effectively represents positive vectors [3, 9, 39]. However, it suffers from significant drawbacks such as its minimal, strictly positive covariance structure. Therefore, to overcome this limitation, we consider applying the generalized inverted Dirichlet which belongs to Liouville family of distributions [22]. This distribution provides a more decent representation of the variability of the data [12]. Indeed, considering the fact that generalized inverted Dirichlet could be factorized into a set of inverted Beta distributions [13], gives more flexibility for modeling data in real-world applications.

In this work, we derive a novel optimal predictor based on generalized inverted Dirichlet distribution which results in a linear combination of the neighboring pixels. Meanwhile, we evaluate the proposed approach on image inpainting application. We choose a publicly available dataset namely Paris StreetView to validate our approach [16]. For the purpose of proving the efficiency of our proposed optimal predictor, we consider two types of pixel discarding. The first pixel removal is random, whereas, in the second experiment, we discard lines from the image. We use a 3\(^\mathrm{rd}\) order non-symmetrical half-plane casual (NSHP) neighborhood system to compute the missing pixel [11]. Finally, we perform two image comparison metrics to evaluate our proposed model and compare it to other similar optimal based predictors.

The rest of the paper is organized as follows: in Sect. 2, we describe our prediction model, and we derive the analytical expression of the GID optimal predictor. In Sect. 3, we consider the image inpainting application on Paris StreetView dataset with two different data masks to demonstrate the effectiveness of the proposed predictor, and we discuss the experimental results. Finally, a summary is provided in Sect. 4.

2 GID Prediction Model

The generalized inverted Dirichlet mixture model has shown high flexibility for modeling and clustering positive vectors. In this section, we start by reviewing the finite GID mixture model. Then, we introduce the parameters learning through the EM algorithm and, later, we extend this model to the prediction.

2.1 Mixture of Generalized Inverted Dirichlet Distributions

Let \(\mathcal {X}=(\mathbf {X}_1,\dots ,\mathbf {X}_N)\) be a set of N d-dimensional positive vectors where each vector \(\mathbf {X}_i\) follows a mixture of K generalized inverted Dirichlet (GID) distributions characterized by parameters \(\mathbf {\theta }_{j}=(\mathbf {\alpha }_j,\mathbf {\beta }_j)\) and mixing weight \(\pi _j\) of the jth component.

$$\begin{aligned} P(\mathbf {X}_i|\Theta ) = \sum _{j=1}^{K} \pi _j P(\mathbf {X}_i|\mathbf {\theta }_{j}) \end{aligned}$$
(1)

where \(\Theta =(\mathbf {\theta }_{1},\dots , \mathbf {\theta }_{K}, \pi _1,\dots ,\pi _K)\) represents the GID mixture model parameters and \(P(\mathbf {X}_i|\mathbf {\theta }_{j})\) is the generalized inverted Dirichlet distribution, which has the following form [26]:

$$\begin{aligned} P(\mathbf {X}_i|\mathbf {\theta }_{j}) = \prod _{l=1}^{d} \frac{\Gamma (\alpha _{jl} + \beta _{jl})}{\Gamma (\alpha _{jl}) \Gamma (\beta _{jl})} \frac{X_{j}^{\alpha _{jl}-1}}{(1+\sum _{s=1}^{l}X_{is})^{\gamma _{jl}}} \end{aligned}$$
(2)

where \(\gamma _{jl}=\beta _{jl}+\alpha _{jl}-\beta _{jl+1}\), for \(l=1,\dots ,d\) (\(\beta _{jd+1}=1\)). It is to be noted that the GID is reduced to the inverted Dirichlet distribution when the parameter \(\gamma _{jl}\) is set to zero (\(\gamma _{j1}=\dots =\gamma _{jd}=0\)).

The flexibility of the generalized inverted Dirichlet is by dint of the concept of “Force of mortality” of the population where we introduce, here, a doubly non-central Y independent-variables defined as

$$\begin{aligned} Y_{i1} = 1, ~~ Y_{jl} = \frac{X_{il}}{T_{il-1}}, ~~ l>1 \end{aligned}$$
(3)

where \(T_{il} = 1+ X_{i1} + X_{i2} + \dots + X_{il-1},~~ l=1,\dots ,d\)

The characteristic function underlying the \(\mathcal {Y}=(\mathbf {Y}_1,\dots ,\mathbf {Y}_N)\) independent variables follows a product of 2-parameters inverted Beta distribution, where \(\theta _{l}=(\alpha _{l},\beta _{l})\)

$$\begin{aligned} P(\mathbf {Y}_i|\mathbf {\theta }) = \prod _{l=1}^{d} P_{IBeta} (Y_{il}|\theta _l) \end{aligned}$$
(4)

In which the probability of inverted Beta is given by:

$$\begin{aligned} P_{IBeta} (Y_{il}|\theta _l) = \frac{\Gamma ({\alpha _l + \beta _l})}{\Gamma ({\alpha _l) \Gamma ({\beta _l})}} \frac{Y_{il}^{\alpha _l}}{(1+Y_{il})^{\alpha _l+\beta _l}} \end{aligned}$$
(5)

Many characteristics of the distribution are defined in [30]. We mention some interesting statistics for this distribution.

First, the mixed moments such as the \(n^{th}\) moment is given by:

$$\begin{aligned} E(Y^n) = \frac{\Gamma ({\alpha + \beta })\Gamma ({\beta - n})}{\Gamma ({\alpha ) \Gamma ({\beta })}} \end{aligned}$$
(6)

where \(\beta - n\) is positive.

Then, the covariance between two variable \(Y_1\) and \(Y_2\) is defined as:

$$\begin{aligned} COV(Y_1,Y_2) = \frac{(\alpha _1 - 1)(\alpha _2 - 1)}{(\alpha _1 \alpha _2 - 1)(\beta _1 - 1) (\beta _2 - 1)} \end{aligned}$$
(7)

The covariance between two features for inverted Beta is always positive, which means that both they tend to increase or decrease together.

Finally, the variance of a variable Y is conveyed by:

$$\begin{aligned} VAR(Y) = \frac{(\alpha - 1)(\alpha + \beta -2)}{(\beta - 1)^2(\beta - 2)} \end{aligned}$$
(8)

2.2 Likelihood-Based Learning

Theoretically, a plethora of literature agrees on the effectiveness of the likelihood-based approach for estimating the mixture parameters. One of the well-known methodologies is the Expectation-Maximization technique [15], beginning with a tuned initialization for the set of parameters to the expectation step where the posterior is inferred (named often as “responsibilities”), then the iterations are proceeded to update the required variables until convergence. The heart of the matter comes with estimating the parameters based on the second derivative of the log-likelihood function with regards to each parameter. First, we introduce the log-likelihood as follows:

$$\begin{aligned} \log P (\mathcal {Y}|\Theta ) = \sum _{i=1}^{N}\log \Big [\sum _{j=1}^{K} \pi _j \prod _{l=1}^{d} P_{IBeta} (Y_{il}|\theta _{jl}) \Big ] \end{aligned}$$
(9)

Initializing Process. As a first step, an unsupervised-method, namely “K-means,” is applied to obtain the initial K clusters. Consequently, for each cluster, the method-of-moments is implemented to get the initial \(\mathbf {\alpha }_j\) and \(\mathbf {\beta }_j\) parameters of each component j. The mixing weight is set in the initial phase as the number of elements in each cluster divided by the total number of vectors. As mentioned earlier, with conditionally independent features, the GID is converted by the inverted Beta distribution factorization. Thus, given the moments of inverted Beta distribution [2], the initial \(\alpha _{jl_0}\) and \(\beta _{jl_0}\) are deduced by

$$\begin{aligned} \alpha _{jl_0}=\frac{\mu _{jl}^2(1+\mu _{jl}) + \mu _{jl}\sigma _{jl}^2}{\sigma _{jl}^2} \end{aligned}$$
(10)
$$\begin{aligned} \beta _{jl_0}=\frac{\mu _{jl}(1+\mu _{jl}) + 2 \sigma _{jl}^2}{\sigma _{jl}^2} \end{aligned}$$
(11)

where \(\mu _{jl}\) is the mean and \(\sigma _{jl}\) is the standard-deviation, \(j=1,\dots ,K,~ l=1,\dots ,D\).

Expecting the Responsibilities. The responsibilities or posterior probabilities play an essential role in the likelihood-based estimation technique. It affects the update of the parameters in the next following step using the current parameter value.

$$\begin{aligned} P(j|\mathbf {Y}_i) = \frac{\pi _j P(\mathbf {Y}_i|\mathbf {\theta }_j)}{\sum _{m=1}^{K} \pi _m P(\mathbf {Y}_i|\mathbf {\theta }_m) } \end{aligned}$$
(12)

Maximizing and Upgrading the GID Parameters. At the beginning, we set the gradient of log-likelihood function with respect to the mixing weight parameter equals to zero:

$$\begin{aligned} \frac{\partial \log P (\mathcal {Y},\Theta ) }{\partial \pi _j} = 0 \end{aligned}$$
(13)

Then, we obtain the update formula for \(\pi _j\), for \(j=1,\dots ,K\) as

$$\begin{aligned} \pi _j= \frac{1}{N} \sum _{i=1}^{N} P(j|\mathbf {Y}_i) \end{aligned}$$
(14)

where \(P(j|\mathbf {Y}_i)\) is the posterior computed in the E-step.

To learn the parameters \(\mathbf {\alpha }_j\) and \(\mathbf {\beta }_j\), the Fisher scoring algorithm [25] is used. Thus, we need to calculate the first and the second derivatives of the log-likelihood function based on the following update [1]:

$$\begin{aligned} \alpha _{jl}^{t+1} = \alpha _{jl}^{t} - \Big ( \frac{\partial ^2}{\partial \alpha _{jl}^2} \log P (\mathcal {Y},\Theta )\Big )^{-1} \times \frac{\partial }{\partial \alpha _{jl}} \log P (\mathcal {Y},\Theta )\end{aligned}$$
(15)
$$\begin{aligned} \beta _{jl}^{t+1} = \beta _{jl}^{t} - \Big ( \frac{\partial ^2}{\partial \beta _{jl}^2} \log P (\mathcal {Y},\Theta )\Big )^{-1} \times \frac{\partial }{\partial \beta _{jl}} \log P (\mathcal {Y},\Theta ) \end{aligned}$$
(16)

The first derivatives of \(\log P (\mathcal {Y},\Theta )\) are given by:

$$\begin{aligned} \frac{\partial }{\partial \alpha _{jl}} \log P (\mathcal {Y},\Theta )= & {} \sum _{i=1}^N P(j|\mathbf {Y}_i) \Big ( P_{IBeta} (Y_{il}|\theta _{jl}) [\Psi (\alpha _{jl}+\beta _{jl}) - \Psi (\alpha _{jl}) + \log Y_{il} \nonumber \\- & {} \log (1+Y_{il})] \Big ), \end{aligned}$$
(17)
$$\begin{aligned} \frac{\partial }{\partial \beta _{jl}} \log P (\mathcal {Y},\Theta ) = \sum _{i=1}^N P(j|\mathbf {Y}_i) \Big ( P_{IBeta} (Y_{il}|\theta _{jl}) [\Psi (\alpha _{jl}+\beta _{jl}) - \Psi (\alpha _{jl}) -\log (1+Y_{il})] \Big ) \end{aligned}$$
(18)

The second derivative with respect to \(\alpha _{jl}\) is given by:

$$\begin{aligned} \frac{\partial ^2}{\partial ^2 \alpha _{jl}} \log P (\mathcal {Y},\Theta )= & {} \sum _{i=1}^N P(j|\mathbf {Y}_i) \Big ( \frac{\partial P_{IBeta}(Y_{il}|\theta _{jl})}{\partial \alpha _{jl}} [\Psi (\alpha _{jl}+\beta _{jl}) - \Psi (\alpha _{jl}) \nonumber \\- & {} \log (1+Y_{il})] + P_{IBeta} (Y_{il}|\theta _{jl}) [\Psi '(\alpha _{jl}+\beta _{jl}) - \Psi '(\alpha _{jl})] \Big ), \end{aligned}$$
(19)

The second derivative w.r.t \(\beta _{jl}\) is obtained through the same development.

2.3 GID Optimal Predictor

In this section, we present our novel non-linear optimal predictor method based on generalized inverted Dirichlet distribution. We consider the conditional expectation property to predict one random variable from the other neighboring variables.

We consider p data points \( ({X}_i,{X}_{i+1},\dots ,{X}_{i+p-1})\), knowing their values, we predict the neighboring data point \(\hat{{X}}_{i+p}\) on the base of minimizing the mean squared error (MSE). Therefore, we model the joint density of \({X}_{i+p}\) and its neighbors using the generalized inverted Dirichlet. We take \(i=0\) and we derive the equations.

$$\begin{aligned} \mathbf {X} \sim GID (\mathbf {\theta }) \end{aligned}$$
(20)

Considering generalized inverted Dirichlet properties [22], the conditional random variable \(Y_{p}\) follows an inverted Beta distribution:

$$\begin{aligned} Y_{p} = \frac{X_{p}}{T_{p-1}} | X_{p-1},\dots , X_{1}, X_{0} \sim IB(\alpha _{p},\beta _{p}) \end{aligned}$$
(21)

Consequently, the conditional probability density function of \(X_{p}\) is

$$\begin{aligned} X_{p}~ | ~X_{p-1},\dots , X_{0} \sim T_{p-1}\, IB(\alpha _{p},\beta _{p}) \end{aligned}$$
(22)

where \(T_{p-1} = 1+\sum _{k=1}^{p-1}X_k\).

Hence, the conditional expectation expression of \(X_{p}\) is expressed as follows.

$$\begin{aligned} E(X_{p}~ | ~X_{p-1}, X_{1},\dots , X_{0} ) = T_{p-1}\, \frac{\alpha _{p}}{\beta _{p} -1} \end{aligned}$$
(23)

In the case of mixture models, the optimal predictor expression can be derived directly by following steps defined in [19] (more details are in [40]):

$$\begin{aligned} \hat{X}_{p}= E\big (X_{p}~ | ~X_{p-1},\dots , X_{0}) \end{aligned}$$
(24)
$$\begin{aligned} \hat{X}_{p}= & {} \int X_{p}\, P(X_{p}\mid X_{p-1},\dots , X_{0}) \,\mathrm {d}X_{p}\nonumber \\= & {} \sum _{j=1}^{K}{\pi }'_{j}E_j\big (X_{j}~ | ~X_{j-1},\dots , X_{0}) \end{aligned}$$
(25)

where

$$\begin{aligned} {\pi }'_{j}= \pi _j\frac{\int P_{j}(X_{p}~,\dots , X_{0}) \,\mathrm {d}X_{p}}{\int P(X_{p}~,\dots , X_{0}) \,\mathrm {d}X_{p}} \end{aligned}$$
(26)
$$\begin{aligned} {\pi }'_{j}= \pi _j\frac{P_{j}(X_{p-1}~,\dots , X_{0})}{\sum _{j=1}^{K}\pi _{j} P_j(X_{p-1}~,\dots , X_{0})} \end{aligned}$$
(27)

Finally, the GID optimal predictor is resumed in the following linear combination of \( X_{p}\) neighbors:

$$\begin{aligned} \hat{X}_{p}= \sum _{j=1}^{K}\pi '_{j}\Bigg (1+\sum _{k=1}^{p-1}X_{k}\Bigg )\frac{\alpha _p}{\beta _{p}-1} \end{aligned}$$
(28)

3 Experimental Results

Image inpainting is the process of restoring deteriorating, damaged or missing parts of an image to produce a complete picture. It is an active area of image processing research [32,33,34] where machine learning has exciting results comparable to artists’ results. Mainly, in this process, we will be completing a missing pixel by an approximated value that depends on its neighborhood. In our work, we use the 3\(^{rd}\) order non-symmetrical half-plane casual neighborhood system [19, 21]. We apply the model on a publicly available dataset; Paris StreetView [16]. Then, we compare it with the widely used mixture of Gaussian predictor, generalized Dirichlet mixture predictor and inverted Dirichlet mixture predictor. We are not trying to restore the ground-truth image, our goal is to get an output image that is close enough or similar to the ground-truth. Therefore, we use the structural similarity index (SSIM) [11] to gauge the differences between the predicted images and the original ones. We also perform signal to noise ratio (PSNR) [14] to evaluate the performance of the models.

We reduce the size of the original images to \(256\times 256\) to minimize the complexity of computing the model’s parameters. We train the model on \(70\%\) of the database and we test on the rest. We apply two types of masks. The first one is randomly distributed as shown in Fig. 1a. And, for the second one, we discard lines of the images, as in Fig. 1b. Finally, we compute the SSIM and PSNR of each test image with its corresponding ground-truth, and we average all over the test set.

Fig. 1.
figure 1

Types of image mask

Fig. 2.
figure 2

Models’ performance on random masked images. 1\(^\mathrm{st}\) column is for the ground truth images, 2\(^\mathrm{nd}\) column is for the masked images, 3\(^\mathrm{rd}\) column is for the GM prediction, 4\(^\mathrm{th}\) column is for the DM prediction, 5\(^\mathrm{th}\) column is for the IDM prediction, 6\(^\mathrm{th}\) column is for the GIDM prediction

Table 1. Models’ evaluation for the randomly masked images.

As we mentioned earlier, we discard around \(15\%\) of the pixels randomly. Figure 2 reveals that the difference between models’ prediction is undetectable visually. Moreover, Table 1 shows that the difference between the models is not significant. There is a slight advantage for the use of GIDM model compared to the others. Thus, we conclude that this approach of models’ evaluation is not appropriate. For that, we decide to remove slightly thick lines of pixels and re-evaluate the models.

Fig. 3.
figure 3

Models’ performance on line masked images. 1\(^\mathrm{st}\) column is for the ground truth images, 2\(^\mathrm{nd}\) column is for the masked images, 3\(^\mathrm{rd}\) column is for the GM prediction, 4\(^\mathrm{th}\) column is for the DM prediction, 5\(^\mathrm{th}\) column is for the IDM prediction, 6\(^\mathrm{th}\) column is for the GIDM prediction

Table 2. Models’ evaluation for the line masked images.

To evaluate the models’ performance, we used TensorFlow to calculate the PSNR and Skimage python package for the SSIM metric.

After discarding lines from the images, we are able to generate back again the missing pixels, and Fig. 3 demonstrates that GIDM is the most efficient model among all the others. This is also clear in Table 2, where we can notice in the chosen images that GIDM is the most accurate re-generator of discarded pixels. Therefore, our work has shown that image data is better represented by generalized inverted Dirichlet. It is noteworthy to mention that these models’ performance is hugely dependent on the size of the masks, the hyper-parameters, the type and order of the neighbouring system.

4 Conclusion

In this paper, we have developed a new optimal predictor based on finite generalized inverted Dirichlet mixtures. The GID demonstrates its efficiency in representing positive vectors due to its statistical characteristics through its covariance structure.

We learnt the model parameters using MLE approach with Newton Raphson method, and we considered the NSHP neighbouring system to compute the predicted pixel. We evaluated the GID optimal predictor on image inpainting and we compared the proposed model to other similar related works. The experimental results demonstrate its capability that offers reliable prediction and modeling potential.