1 Introduction

There has been growing interest in data-driven approaches to low-level image processing including image super-resolution [33], denoising [37], and dehazing [35]. These approaches generalize those low-level processes specific to applications (tasks) to learn generic models from a large number of examples designated to the tasks. For instance, the algorithms for super-resolution [20], denoising [25], and dehazing [12] learn similar Gaussian process (GP) models, and Markov random fields (MRFs) are common in the processes for super-resolution [15] and dehazing [19]. But different types of training example pairs are fed to these common models for different tasks. Naturally, data are turning out to be the focus of different data-driven image processing techniques, but neglected in many studies. In this paper, we address the issue of seeking optimal training examples that yield the improved performance on a given input.

Fig. 1
figure 1

Algorithm overview. We first search optimal training examples for an input synthetic hazy image to train a Gaussian process (GP) regressor and thus estimate a more accurate transmission map for dehazing by applying the trained input-adaptive GP regressor

Recent works typically learn direct mappings from example pairs of a degraded input to its desired output. Zoran et al. [40] propose a framework to learn the mid-level visual properties of an image and perform the estimation of reflectance, shading, and depth. Burger et al. [3] use pairs of noisy and clean image patches as training data to learn the parameters of a multilayer perceptron (MLP) model for denoising. Tang et al. [35] learn the relationship between input features and transmission using the random forrest regression for dehazing, while we model this relationship with Gaussian processes [12]. Zhu et al. [39] learn the parameters of a linear model and then recover scene depth information for transmission estimation. These learning-based methods are able to yield superior performance when abundant training pairs are available.

“More data beats a clever algorithm” [8]. The quantity of data might be not an issue, but their quality is really critical in this “big data” era. Regressors would behave better when the regression process adapts to the input. Schmidt et al. [30] adaptively train a discriminative regressor for deblurring by minimizing a loss function upon the training set. Selecting sparse inducing points and manually re-labeling the training examples can also improve the performance of regression models [31]. In our previous work [12], a simple screening step with trained support vector machines (SVM) is able to upgrade the performance of GP regression for dehazing. Nevertheless, it is still an open issue to efficiently find the training examples from a large volume of data upon which more accurate regressors can be learnt for low-level processing of a given input.

In this paper, we propose a searching strategy on image examples (including those collected from the Web) for learning a more accurate GP regressor for image dehazing. We prove that the training examples neighboring the input are able to train a GP regressor with lower predictive variance. We leverage the Hamming embedding [24] to efficiently search these examples neighboring to the input. Our strategy, which takes the modality-specific learning to constitute a more accurate regressor for the given modality (input), is also validated by recent cognitive studies [17]. Figure 1 shows the overview of the proposed method for a synthetic hazy image. The resultant image is quite close to the original haze-free by recovering image textural details as well as the chromatic information. More experimental results validate the effectiveness of our method in Sect. 5.

2 Related work

In this section, we review existing dehazing methods and recent advances that improve the GP regression.

2.1 Image dehazing

Hazy images with low visibility influence both human perception and computer vision. Pioneering dehazing works in the past decade typically rely on prior modeling of the physical image formation process. Fattal [13] proposed a refined image formation model that includes the surface shading and assumed that the transmission is uncorrelated with the surface shading. He et al. [22] estimated the transmission map of hazy image under the assumption of dark channel prior (DCP) that the local minimum of RGB channels in a haze-free image is close to zero. These image priors are applicable to certain kind of images, but not to generic real-world images. They work well in the scenarios following those assumptions, otherwise fail.

Recently, researchers have resorted to haze removal from a learning perspective. Gibson et al. developed a learning framework for haze removal using synthesized hazy images with known fog and depth [18]. Tang et al. investigated haze-relevant features and applied random forests for transmission learning [35], and we developed a two-layer GP regression to generate more smoothing transmission estimation [9]. Zhu et al. used a supervised learning method to train the parameters of a linear model [39]. There also exist dehazing methods using deep networks. Ren et al. employed a multiscale convolutional neural networks (CNN) to estimate transmission maps for hazy images [27], while Cai et al. built an end-to-end system from images (instead of features) to transmissions [4]. These works, focusing on learning the relationship between features (or images) and transmissions, collect many synthetic hazy images to constitute a fixed training set for all input testing images. Also, the fixed set with synthetic images can hardly cover great variations of real-world hazy images. It still remains unresolved to find ‘optimal’ training examples, the most critical issue for data-driven processing.

2.2 Improvements on Gaussian process regression

Gaussian processes regression model is a simple and flexible model and also a powerful tool in many areas. Significant efforts have been invested to the Gaussian process regression, yielding improvements from various aspects. Miguel et al. introduced a sparse Gaussian process regression model and sparsified the spectral representation of the GP, which makes the regression model more simple and efficient [26]. In a recent study, Kwon et al. improved the quality of restoring degraded images by learning a semi-local GP regression model [25]. Instead of training a single GP model on a large data set, they constructed a set of sparse models to perform the prediction at each testing point. Cao et al. introduced an efficient optimization algorithm for GP regression and achieved the joint selection of inducing points and estimation of GP hyper-parameters by optimizing a single objective [5]. These improvements for GP regression greatly reduce the time complexity. Unfortunately, few methods increase the precision from the aspect of choosing training examples. In this study, we develop a systematic selection process to search the appropriate training set for a given input to improve the accuracy of GP regression.

3 Training example searching

Training examples have such great impacts on regression methods that a training process with examples adapting to given inputs may significantly improve the performance. In this study, we propose an efficient searching strategy to select optimal training examples from a vast of data points with various sources (the synthesized or Web). This section gives the proof on how the searching improves the precision of GP regression and presents the fast algorithm derived from the Hamming embedding.

3.1 Optimal examples for Gaussian process regression

3.1.1 Prediction distribution

The GP regression is not only able to learn a mapping from the input to the target with a set of training examples, but also to provide the probability distribution for predicting the target given a new input. This predictive probability gives the estimates of the target as well as the prediction precision dependent on training examples. We focus on the derivation of the distribution relating the precision with training examples.

Similar to [12], we build the nonlinear mappings from input features \(\mathbf {f}\) to target transmission t with GP regression. The transmission t is a function \(\varPhi (\mathbf {f})\) of the input \(\mathbf {f}\) with the additive noise \(\varepsilon \) amenable to a Gaussian distribution \(N(0,\sigma _\varepsilon ^2)\), expressed as:

$$\begin{aligned} t = \varPhi (\mathbf {f}) + \varepsilon . \end{aligned}$$
(1)

The covariance matrix of the marginal distribution for the target t is determined by the Gram matrix \(\mathbf {G}\):

$$\begin{aligned} \mathbf {G}({\mathbf {f}_i},{\mathbf {f}_j}) = k({\mathbf {f}_i},{\mathbf {f}_j}) + \sigma _n^2\delta ({\mathbf {f}_i},{\mathbf {f}_j}), \end{aligned}$$
(2)

where \(\delta (\cdot )\) is the Kronecker delta function, \({\mathbf {f}_i}\) and \({\mathbf {f}_j}\), respectively, represent two features in the input features f, and \(k({\mathbf {f}_i},{\mathbf {f}_j})\) is a kernel function of \({\mathbf {f}_i}\) and \({\mathbf {f}_j}\). We take the squared exponential as the kernel:

$$\begin{aligned} k({\mathbf {f}_i},{\mathbf {f}_j}) = \sigma _f^2\exp \left[ \frac{{ - {{({\mathbf {f}_i} - {\mathbf {f}_j})}^2}}}{{2{l^2}}}\right] , \end{aligned}$$
(3)

where \(\sigma _f^2\) is the maximal allowable covariance and l is the length parameter.

Fig. 2
figure 2

The process of the online searching. For every super-pixel in the input image, we assign it to the corresponding cluster. Then, we represent the test feature and the features in the training set by binary signatures. Thus, the preliminary selection data can be got by calculating the binary scores. Finally, we sort the Euclidean distances between the test feature vector \({\mathbf {f}^*}\) and the feature vectors in the preliminary selection set to achieve the final training set T_min

We focus on the prediction of target \(t^*\) from a new input \(\mathbf {f}^*\) given \(N_t\) training inputs \(\mathbf {F}_{N_t} = [{\mathbf {f}_1},\ldots ,{\mathbf {f}_{N_t}}]\) and the corresponding observations \(\mathbf {t}_{N_t} = [{t_1},\ldots ,{t_{N_t}}]^T\). The GP regression provides the conditional probability density \(p(t^* \mid \mathbf {t}_{N_t})\) following a Gaussian distribution with mean \(m({\mathbf {f}^*})\) and variance \({\sigma ^2}({\mathbf {f}^*})\):

$$\begin{aligned} p({t^*}\left| \mathbf {t}_{N_t} \right. ) \sim N(m({\mathbf {f}^*}),{\sigma ^2}({\mathbf {f}^*})). \end{aligned}$$
(4)

According to the GP regression process, the mean of the \(\mathbf {t}_{N_t}\)’s distribution is taken as the estimate of the predicted transmission \(t^*\), expressed as

$$\begin{aligned} \begin{array}{l} m({\mathbf {f}^*}) = {\mathbf {k}_*}\mathbf {G}^{ - 1}\mathbf {t}_{N_t}, \end{array} \end{aligned}$$
(5)

where the N-dimensional vector \({\mathbf {k}_*}\) is a function of the input \({\mathbf {f}}^*\) as \({\mathbf {k}_*}= [k({\mathbf {f}^*},{\mathbf {f}_1}),\ldots ,k({\mathbf {f}^*},{\mathbf {f}_{N_t}})]\), and \(\mathbf {G}\) is a \(N_t \times N_t\) kernel matrix of training sample. The variance \({\sigma ^2}({\mathbf {f}^*})\) reflects the prediction precision, deduced as

$$\begin{aligned} {\sigma ^2}({\mathbf {f}^*}) = k({\mathbf {f}^*},{\mathbf {f}^*}) + \sigma _n^2 - {\mathbf {k}_*}\mathbf {G}^{ - 1}{\mathbf {k}_*^T}. \end{aligned}$$
(6)

Refer to [2] for the detailed training process of hyper-parameters and the solving process of the mean \(m(\mathbf {f}^*)\) and the variance \({\sigma ^2}({\mathbf {f}^*})\).

3.1.2 Optimal training examples

The \({\mathbf {k}_*}\) term in Eq. (6) relates the predictive variance for a given input feature \({\mathbf {f}^*}\) with the relationship between \({\mathbf {f}^*}\) and available training examples. We prove that the closest training examples to \({\mathbf {f}^*}\) train a GP regressor yielding lower predictive variance for the input \({\mathbf {f}^*}\).

In order to investigate the effect of a training pair \({\mathbf {f}_i}\) and \(t_i\) on the prediction, we peer into the variance of the predictive distribution \(p({t^*}\left| {{t_i}} \right. )\) conditional on the observed target \(t_i\) of the input \(\mathbf {f}_i\):

$$\begin{aligned} {\sigma ^2}({t^*}\left| {{t_i}} \right. ) = k({\mathbf {f}^*},{\mathbf {f}^*}) + \sigma _n^2 - k({\mathbf {f}_i},\mathbf {f}^*)\mathbf {G}^{- 1}k(\mathbf {f}_i,\mathbf {f}^*)^T. \end{aligned}$$
(7)

The last term in (7) reveals that the predictive variance for the GP regression depends on the connection between the training example \(\mathbf {f}_i\) and given input \(\mathbf {f}^*\). Substituting (2) and (3) into (7), we make the dependency more evident:

$$\begin{aligned} {\sigma ^2}({t^*}\left| {{t_i}} \right. ) = \sigma _f^2 + \sigma _n^2 - \frac{k(\mathbf {f}_i,\mathbf {f}^*)^2}{{\sigma _f^2 + \sigma _n^2}}. \end{aligned}$$
(8)

The term \(k(\mathbf {f}_i,\mathbf {f}^*)\) determines the predictive variance given the learned hyper-parameters \(\sigma _f^2\) and \(\sigma _n^2\). As shown in (3), the function is monotonic and inversely proportional to the exponential of the Euclidean distance between \(\mathbf {f}_i\) and \(\mathbf {f}^*\). Hence, the closest training examples to the input of interest \(\mathbf {f}^*\) are able to output the prediction with a lower variance. It is possible to apply this dependency for efficient online training when data points sequentially arrive [25], while herein we dedicate to devise a fast strategy to localize these “optimal” ones from a collection of examples for accurate prediction.

3.2 Fast searching of optimal examples

As shown in (8), the Euclidean distance between the testing input and training examples weighs the prediction accuracy of GP regression. Unfortunately, it would be notoriously time-consuming if we directly calculate all Euclidean distances and find the closest ones at each time when predicting a new input. One straightforward strategy is to construct a kd-tree of training examples for acceleration [32, 36]. The hamming embedding, adopted in large-scale visual retrieval [24], provides a more informative representation for distance pairs between feature vectors by binary signatures. These signatures do not only reflect rich contextual information, but also have extremely low computational loads and few memory usage. The searching process divides into off-line and online stages given below.

3.2.1 Off-line stage

The off-line process constitutes an efficient index structure for all available training examples in order to accelerate the online searching of optimal examples for the input. As shown in  Fig. 2, we first collect a training set containing both synthetic and real-world images. Subsequently, we cluster examples in the training set and generate binary signatures for these examples.

We construct a training set with both synthetic and real-world images. Our previous transmission model in [11] is able to generate hazy images for natural scenes from original sharp images and their corresponding depth maps [29]. For real-world images, we apply the screening process in [12] to categorize natural images into three levels of haze and perform three traditional dehazing methods, [22, 34], and [13] to dense, moderate, and light hazy images, respectively. Refer to [12] for the justification of this haze generation process upon haze levels. The target transmission maps are available for real-world images as a common by-product of these dehazing methods. Hereafter, we have the super-pixel features (detailed in the next section) and their corresponding transmissions as the target labels for training.

We reduce the dimensionality of training features upon their closeness for the sake of efficiency. The classical k-means algorithm groups the training feature vectors of super-pixels into \(\omega \) clusters. Subsequently, we generate a \(d_f \times d_f\) matrix (\(d_f\) is the dimensionality of feature vectors) of i.i.d. random values from a Gaussian distribution N(0, 1) and apply the QR decomposition to the matrix, yielding orthogonal bases. The first d rows of the resultant orthogonal matrix are taken as the projection matrix \(\mathbf {Q}_d\) for the dimensionality reduction. Multiplying the matrix \(\mathbf {Q}_d\) with the training feature matrix \(\mathbf {F}_n^i=[\mathbf {f}_1^i,\ldots ,\mathbf {f}_n^i]\) of the i-th cluster \(\omega _i\), where n is the number of features in the cluster, we have the new training feature matrix \(\mathbf {Z}^i\) for \(\omega _i\). We then compute the median value of each row in \(\mathbf {Z}^i\) and obtain a median vector \(\mathbf {m}^i = [m_1^i,\ldots ,m_d^i]^T\) for the cluster \({\omega _i}\). This median vector facilitates the fast localization of the cluster and generation of binary signatures in the online stage for example searching.

3.2.2 Online stage

In the online process, we generate the binary signature of the input and then search the examples close to the input through the binary index of the training set generated in the off-line process.

Given a new input image, we assign its feature \(\mathbf {f}^*\) to the cluster \({\omega ^j}\) with the closest centroid and then project the feature to a vector \(\mathbf {z}^* = [z_1^*,\ldots ,z_d^*]^T\) by \(\mathbf {Q}_d\). The bit \(b_k\) is set to one if the k-th component of the projected vector \(z_k^*\) is larger than the corresponding median value of the cluster \(m_k^j\), otherwise to zero. Hence, we generate the binary signature \(\mathbf {b}(\mathbf {f}^*) = [b_1(\mathbf {f}^*),\ldots ,b_d(\mathbf {f}^*)]^T\) for the input. The similarity between this binary feature and training features reflects how close the input feature is to the training ones. More importantly, this similarity score \(B_s\) can be efficiently evaluated by applying the binary exclusive operator to \(\mathbf {b}(\mathbf {f}^*)\) against those binary features in the cluster. The online searching process is illustrated in Fig. 2, and the algorithm is summarized in Algorithm 1.

figure e

Those feature vectors having the similarity score above a threshold \({b_t}\) in the cluster \({\omega ^j}\) are chosen as the candidate training set. The Hamming embedding algorithm significantly reduces the memory and time expenses since similarity evaluation on binary signatures has a negligible computational load. Subsequently, we sort the Euclidean distances between the input feature \({\mathbf {f}^*}\) and the feature vectors in the candidate set and take the \(N_t\) feature vectors with the closest distances as the final training set T_min. The similarity score on binary features efficiently localizes the candidate set, while the closest training features are picked upon the Euclidean distances between original features. This strategy balances the efficiency and accuracy.

4 Regression model for dehazing

Researchers have devoted great efforts to haze removal from a data-driven perspective in recent years. These data-driven methods typically learn image priors from training examples and achieve better performance than the classical methods upon physical models. In this study, we employ the two-layer Gaussian process regression to learn the mapping from features to transmissions as our previous work [12]. For the completeness of this paper, we sketch the regression model for which we search optimal training examples.

4.1 Hazy image formation model

The widely used formation model of hazy image [22, 35] is as follows:

$$\begin{aligned} I({p_i}) = J({p_i})t({p_i}) + A(1 - t({p_i})), \end{aligned}$$
(9)

where \(p_i\) is a pixel, I is the hazy image, J is the haze-free image of I, A is the atmospheric light, and \(t({p_i})\) is the medium transmission of \(p_i\) that characterizes the portion of the light reaching the camera.

Fig. 3
figure 3

Dehazing with GPR model. a A hazy image. b The transmission map estimated by the first GPR layer. c The transmission map estimated by the second GPR layer. d Final transmission map refined by guided filtering. e Our final dehazing result

We slightly modify the transmission model to refine the transmission [11]. The refined transmission can be derived as:

$$\begin{aligned} {t_r}({p_i}) = {t_e}{({p_i})^{1\mathrm{{ - }}Dvi{s_e}/Dvi{s_r}}}, \end{aligned}$$
(10)

where the \({t_e}{({p_i})}\) is the original estimated transmission and the \({t_r}{({p_i})}\) is the refined transmission. The two parameters \(Dvis_{e}\) and \(Dvis_{r}\) are the maximum visibility values for the original and desired images, respectively. By tuning the ratio of the two parameters, users can control the degree of haze in the resultant image.

4.2 Multiscale feature vector

Haze-relevant features form the input vector for regression. We use the hue disparity [6] between the hazy and its semi-inverse images as one feature partially attributing to its ability to detect haze [1]. As shown in previous studies on image dehazing, the dark channel [22], local maximum contrast, and saturation are highly correlated with the amount of haze. All these quantities vary with the local window size. Thus, we generate these values across various scales as features. The Gabor feature [6] represents the texture of image, and its value has a notable change in haze region. We convolve the input hazy image with a set of Gabor filters and calculate the Gabor features from the filtered image. Finally, the input feature vector includes the hue disparity, dark channel, local maximum contrast, saturation, and Gabor features.

4.3 Regression models

We employed a two-layer GPR model to learn transmissions of hazy image. The first layer takes the feature vector as the input and outputs the preliminary transmission. The second layer smoothes the transmissions predicted by the first layer and preserves the consistency of image structures.

4.3.1 The first layer of GPR

For the first layer, we take the average feature vector within a super-pixel [28] \(S_i\) as the input \({\mathbf {f}_i}\) and the average transmission within \(S_i\) as the target output since pixels in local region with similar structural contexts tend to have similar amount of haze. The \({\mathbf {f}_i}\) can be expressed as:

$$\begin{aligned} {\mathbf {f}_i} = \frac{1}{{\left| s \right| }}\sum \limits _{{p_i} \in {S_i}} {\tilde{\mathbf {f}}}({p_i}) , \end{aligned}$$
(11)

where s is the number of pixels in \(S_i\) and \({\tilde{\mathbf {f}}}({p_i})\) is the multiscale feature vector of the pixel \(p_i\). The process of obtaining the optimal training examples is described in Sect. 3.2.

Given an input feature vector \({\mathbf {f}^*}\) of an image to be dehazed, we can obtain the conditional probability of the target transmission \({t_f^*}\) by the trained GPR. The conditional probability is a Gaussian distribution:

$$\begin{aligned} p(t_f^*\left| \mathbf {T} \right. )\sim N(m(t_f^*),{\sigma ^2}(t_f^*)), \end{aligned}$$
(12)

where \(\mathbf {T}\) are the transmissions of training data, \(m(t_f^*)\) and \({\sigma ^2}(t_f^*)\) are the mean and variance of this distribution, respectively, and the values of \(m(t_f^*)\) and \({\sigma ^2}(t_f^*)\) are taken as the predicted transmission \({\mathbf {f}^*}\) and its error, respectively. One assumption of our algorithms is that the pixels within a super-pixel present an identical depth, and thus a same transmission as the transmission is related to the depth:

$$\begin{aligned} t=e^{-\lambda d}, \end{aligned}$$
(13)

where \({\lambda }\) is a hyper-parameter that is independent of the transmission t and depth d. Therefore, heterogeneous pixels, i.e., those with different depths, given by the super-pixel segmentation may produce inaccurate transmission estimation. Fortunately, pixels in a super-pixel are more likely to share common structural contexts than those of a regular patch. Consequently, regressions upon super-pixels in our approach perform better than traditional path-wise regressions.

According to the predicted transmission of every super-pixel, we can obtain the transmission map of hazy image as shown in Fig. 3b. The transmission map can roughly reflect the depth and global structure of the image, but exhibits local disparity across super-pixels.

4.3.2 The second layer of GPR

The second layer builds connections between latent variables similar to the Markov random fields (MRFs) in [16, 38] without any iterative energy optimization or inference process. The target of the second GP regressor is the averaged transmission within current super-pixel \(S_i\), and the input \({\tilde{\mathbf {t}}_i}\) is the collection of its eight neighbors \(N_e({S_i})\), where the \({\tilde{\mathbf {t}}_i}\) can be expressed as:

$$\begin{aligned} {\tilde{\mathbf {t}}_i} = {[t({S_1}),\ldots ,t({S_j}),\ldots ,t({S_8})]_{{S_j} \in {N_e}({S_i})}}. \end{aligned}$$
(14)

The process of obtaining the training transmissions is the same as the first layer. Since a super-pixel does not necessarily share a boundary with eight adjacent neighbors as a pixel does, we take the eight neighbors nearest to the current super-pixel as the input.

In the prediction, we take the transmission of an input super-pixel \(S_i^*\) as the target \({\tilde{\mathbf {t}}^*}\) and the transmissions of its eight neighbors estimated by the first layer as the input vector. Then, the conditional probability of predicted transmission \({\tilde{\mathbf {t}}^*}\) follows a Gaussian distribution. Similarly, the mean of the Gaussian distribution is taken as the predicted transmission of \(S_i^*\). The second layer maintains the consistency of image structures and attenuates the local disparity in the output of the first layer. As shown in Fig. 3c, the transmission map estimated by the second layer imposes the local smoothness to the transmission map of the first layer. We apply the guided filtering [21] to achieve a further refined transmission map for the haze removal and then restore the sharp image using the final transmission map and  (9) as shown in Fig. 3e.

5 Experimental results and analysis

In this section, we compare the regression using the proposed example searching with the previous two-layer GP regression [9], where all images available in a data set were used for training, in order to verify the effectiveness of the searching strategy. As for the hyper-parameters in the online stage, we set \(\omega =10\), \(d=16\), and \({b_t}=15\), which are fixed for a wide variety of input images while training features adapt to a specific input. To avoid unstable behavior of the GP regression, we take the chosen number of training features \(N_t=10\). The input feature vector has 37 dimensions (\(d_f=37\)) including the hue disparity, dark channel (four scales), local maximum contrast (four scales), saturation (four scales), and Gabor features (three scales and eight directions).

We also demonstrate the superior performance of our input-adaptive dehazing with example searching by comparing with four recently developed dehazing algorithms [22, 27, 35, 39]. As a nontrivial by-product, we collect different kinds of testing hazy images including people, buildings, landscape, etc., and categorize them upon the hazy degree for performance evaluation of dehazing algorithms.Footnote 1

5.1 Execution time

In this paper, we use the hamming embedding to accelerate the example searching process. The hamming embedding converts real feature vectors of training super-pixels into binary signatures and applies binary operators for similarity comparisons during the searching process. These binary signatures and operators have negligible computational costs and memory storage, resulting in time and space efficient example searching. Table 1 lists the averaged execution time of directly exhaustive searching, searching with a kd-tree structure [36], and the proposed strategy on all available training features of super-pixels. If we directly calculate all the Euclidean distances between the input and training examples, the selection process costs as high as 965.17 s. The hamming embedding significantly reduces the time consumption to 20.67 s, which is acceptable in practice. Also, we use the directly exhaustive searching as the baseline to calculate the accuracy of the accelerating searching techniques. The accuracy of finding the optimal examples for the hamming embedding is 85%, higher than that of the kd-tree, 80%. The hamming embedding outperforms the kd-tree in terms of both accuracy and efficiency.

Table 1 Execution time comparisons (s)
Fig. 4
figure 4

Variance distributions of GPR by T_max,T_min, and T_mid training sets

5.2 Comparisons using different training data sets

As we shown in Sect. 3.1, the variance of predicted transmission is directly related to the similarity between the training feature and the input. Herein, we compare the variances using three different training sets, i.e., T_max, T_min, and T_mid. The set T_min includes ten training examples having the lowest Euclidean distances to the input, while T_max and T_mid consist of those with the ten largest and median distances, respectively. These three sets train the GP regression model and then dehaze the input image with the trained regression. The variance of the estimated transmission for every super-pixel in the input image reflects the accuracy of the prediction for the super-pixel. We take the variances of the predicted transmissions for the first GPR layer for analysis and show the histogram distribution of 3315 variances of the four hazy images in Fig 4. The y-axis shows the number of variance values that fall into each interval of the x-axis. Over 70% of variances from the model trained with the T_min set fall into the range between 0 and 0.02, while about 56% from T_mid and 29% from T_max are between the range. Most of the transmission variances by the T_min training set are smaller than the other two training sets. Table 2 shows the mean values of the variances of transmissions for the three input images. We can see that the variances of T_max, T_min, and T_mid are decreasing, which verifies the relationship between the regression accuracy and the similarity of training examples with the input given in Sect. 3.1. The haze removal results of T_max, T_min, and T_mid are shown in Fig 5. We zoom in some details (referring to the red boxes) in the dehazed images. The results from T_mid and T_max either have color distortions or remain a great portion of haze. The dehazed results of T_min have the highest visibility, and the details in the images are restored well.

Fig. 5
figure 5

Visual comparisons on haze removal results of different groups of training examples. The zoomed-in regions (refer to the red rectangles) are illustrated on the second and fifth rows. a Hazy images. b T_max’s results. c T_mid’s results. d T_min’s results

Table 2 Mean variances \((\times \,10^{-6})\) for 3 hazy images

We also demonstrate quantitative comparisons on different training sets in order to evaluate the applicability of the selection process to the GP regression for dehazing. We calculate the peak signal-to-noise ratio (PSNR) [23] of the dehazed results on the synthetic hazy images to the corresponding original haze-free images in the testing set. We use 27 synthetic hazy images in this experiment, and the box plots of the PSNR values on these images are shown in Fig. 6. The top and bottom lines of the box are the lower and upper quartile values. The horizontal line inside the box indicates the median value while the ends of the whiskers represent the extent of the values. The median values of T_max, T_mid, and T_min are orderly ascending on PSNR values. The T_min set achieves the best restoration, yielding the highest median value among the three. This set learns better mapping from the input to transmission because of the similarity between training examples and the input, indicating the effectiveness of our example searching for input-adaptive dehazing.

5.3 Comparisons with regression using all available images

The results of our work outperform those of [9] that shares a common regression model but trains the model using all available images without any selection. The previous work performs well in some cases, but its accuracy is lower than that using the chosen examples. Some inaccurate estimation of the transmission in [9] may cause the underestimation or overestimation of the transmission, and consequently the dehazed results have haze remained or distortions. In this study, we choose optimal training examples for a given input, reducing the variance of the transmission estimation. Figure 7 shows the visual comparisons with the full training set to illustrate the effectiveness of the selection process. In the first and fourth rows of the dehazed results using the full training set, the trees are over-dehazed while the backgrounds are under-dehazed, showing inconsistent quality. The second row of the results using all images has evident color distortions. These unpleasant results can be partially attributed to the inaccurate estimation of the transmission. The use of optimal examples to the input greatly reduces the inaccurate estimation and produces consistent and favorable dehazed results.

Fig. 6
figure 6

Box plots of PSNR comparisons between different groups of training examples on synthetic hazy images

Fig. 7
figure 7

Visual comparisons with the full training set. The zoomed-in regions (refer to the red rectangles) are illustrated on the third and fifth rows. a Hazy images. b Full training set. c Selected training set

Fig. 8
figure 8

Box plot of PSNR comparisons with the full training set

Again, we compare the PSNR values of the dehazed results by regressions from the full training set with those from the selected examples. We take the 27 synthetic hazy images in this experiment and demonstrate the box plots of PSNR on these images in Fig. 8. The top lines of the boxes are almost the same, but the bottom line of the full training set is much lower than that of the model from selected training examples. Also, the difference between upper and lower quartiles for the results of full training set is much larger than the difference of those results from selected examples. The lower gap between quartiles indicates the stability of our input-adaptive dehazing with example searching. The selection of training examples ensures the accuracy of transmission estimation, and thus the hazy images with various amounts of haze are well restored consistently.

5.4 Comparisons with existing methods

Finally, we compare our input-adaptive haze removal with four latest dehazing methods  [22, 27, 35, 39]. Figure 9 shows the resultant images obtained by these different methods. In the first row of  Fig. 9, both methods of He and Tang overestimate the thickness of haze and generate dim haze removal results. Those of Zhu and Ren underestimate the transmission, and there exists unpleasant residual haze in the resultant images. In contrast, our dehazing result is quite natural and clear. The regions between the tree and building (referring to the red rectangle in the second row) are severely smeared in the other four methods, while our method preserves the details well. In the top-left corner of the image of gym, all the other four methods present a portion of haze effects, but our method restores the region well without any color distortion.

Fig. 9
figure 9

Visual comparisons on haze removal: a input hazy images, b He’s [22], c Tang’s [35], d Zhu’s [39], e Ren’s [27], and f ours. The zoomed-in regions (referring to the red rectangles) are illustrated on the second, fourth, and sixth rows

We also exploit the nonreference blur metric [7] to perform the quantitative evaluation on haze removal results of different methods, and the blur metric evaluates the image quality from the perspective of blur perception. When an image is hazy, sharp edges in the image would be smeared out. The blur metric reflects the loss of image details and thus indicates the quality of dehazed images. The lower the value is, the better is the quality of the dehazed image. Actually, nonreference evaluation of dehazing algorithms is still an open issue. There exist several objective metrics as well as subjective rating schemes, but no consensus has yet reached on which one is the best. In our previous work  [12], we performed evaluations in terms of two metrices and a subjective survey. These evaluations from different perspectives are largely consistent, especially for regression-based approaches. This paper focuses on the adaptive example selection for regression algorithms. The blur metric, which is easily reproducible, suffices to provide fair evaluations on regression results with and without the example selection.

We collect 34 real-world hazy images in this experiment for analysis. From the box plots shown in  Fig. 10, our results exhibit the lowest median value among all compared methods, showing the effectiveness of the proposed method. Additionally, the proposed method performs quite stable as our dehazed results exhibit the lowest upper and lower quartile values. Similar to  [12], we classify the testing images into three categories based on the amount of haze in images and then yield three subsets of testing images: thin, moderate, and dense hazy images. We calculate the mean values of the blur metric on images in the three categories obtained by five dehazing methods as shown in Table 3. Our method has the lowest mean values than the other methods on all three subsets, demonstrating the effectiveness of our method on a wide variety of images with different amount of haze.

Fig. 10
figure 10

Box plot of blur metric comparisons on real-world hazy images

Table 3 Mean blur metric of different methods

6 Conclusion

In this paper, we firstly advocate the input-adaptive dehazing that adaptively seeks examples to train a data-driven model specific to a given input and then propose an efficient searching strategy on image examples to learn a more accurate GP regression model for dehazing. The proposed fast searching strategy efficiently finds optimal training examples adaptive to the input. These examples generate GP regressors that predict the target transmission with higher precision, thus yielding improved dehazing performance. The GP model learnt from the chosen examples by the strategy is able to better represent the relationship between the input feature and corresponding transmission, and finally to produce appealing dehazed results. The comparisons with other latest dehazing methods demonstrate the effectiveness of our input-adaptive dehazing with efficient example searching. The idea of searching optimal examples is likely to apply to many data-driven approaches to low-level image processing, where data are always a central issue, in order to improve the performance of respective algorithms.

In the future, we will study optimal example searching algorithms for other regressors of image dehazing. As proved in this paper, using examples close to the input is able to improve the regression accuracy for Gaussian processes so that the optimization of training examples naturally turns out to be the searching of nearest neighbors. For other applications like facial analysis, we designed a sparse model for linear regression [14] and self-reinforced learning strategy for cascaded regression [10]. It is also nontrivial to investigate what are optimal training examples and how to find these examples for regressors other than GP targeting at image dehazing. This investigation directs to our future work.