Query-by-example HDR image retrieval based on CNN

Khwildi, Raoua; Ouled Zaid, Azza; Dufaux, Frédéric

doi:10.1007/s11042-020-10416-4

Query-by-example HDR image retrieval based on CNN

Published: 03 February 2021

Volume 80, pages 15413–15428, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

Query-by-example HDR image retrieval based on CNN

Download PDF

288 Accesses
7 Citations
Explore all metrics

Abstract

Due to the expension of High Dynamic Range (HDR) imaging applications into various aspects of daily life, an efficient retrieval system, tailored to this type of data, has become a pressing challenge. In this paper, the reliability of Convolutional Neural Networks (CNN) descriptor and its investigation for HDR image retrieval are studied. The main idea consists in exploring the use of CNN to determine HDR image descriptor. Specifically, a Perceptually Uniform (PU) encoding is initially applied to the HDR content to map the luminance values in a perceptually uniform scale. Afterward, the CNN features, using Fully Connected (FC) layer activation, are extracted and classified by applying the Support Vector Machines (SVM) algorithm. Experimental evaluation demonstrates that the CNN descriptor, using the VGG19 network, achieves satisfactory results for describing HDR images on public available datasets such as PascalVoc2007, Cifar-10 and Wang. The experimental results also show that the features, after a PU processing, are more descriptive than those directly extracted from HDR contents. Finally, we show the superior performance of the proposed method against a recent state-of-the-art technique.

A Comprehensive Survey on Content-Based Image Retrieval Using Machine Learning

Deep learned vectors’ formation using auto-correlation, scaling, and derivations with CNN for complex and huge image retrieval

Article Open access 08 October 2022

Image retrieval using dual-weighted deep feature descriptor

Article 20 January 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

HDR imaging has received a lot of attention in modern computer graphic applications. Its success is mainly due to its ability to capture an extremely wide range of the illumination in real-world scenes and to produce images that are more realistic. Numerically, the HDR image is encoded by three floating-point numbers related to the physical luminance in the scene; typically with 96 bit per pixel (bpp) instead of 24 bpp in its Low Dynamic Range (LDR) version.

Over the past decades, HDR imaging [8] has received significant recognition in several computer vision tasks [6, 29]. As a result, the subject of HDR image acquisition [7, 31] has attracted the attention of researchers and raised the challenge of storing the generated HDR images in specific formats. In this context, various compression methods have been proposed to represent the floating-point numbers in an efficient and compact way. Several formats support these types of data like RGBE [48], LogLuv [27], and OpenEXR [55] formats. However, the use of these formats is hampered by difficulties in rendering the HDR content on standard display devices that are designed for conventional images. This problem has been tackled by using tone mapping operators [23], which aim at reducing the high dynamic range while preserving the image content such as contrast, brightness, and colors. On the other hand, some researchers [4, 9, 10, 24, 25, 30, 52] are interested in developing reverse tone mapping (rTM) methods to expand LDR content to HDR. The principle of rTM is to estimate, from the LDR image, the real-world luminance values as faithfully as possible.

In accordance with the development of HDR imaging, it is expected that the number of HDR images will grow rapidly and that collections of this type of images will become available in different application domains. Therefore, the development of effective HDR image indexing and retrieval methods is becoming extremely important. In the literature, many works have focused on LDR image retrieval using different methods. In the last few years, deep learning approaches have become foremost choice to address most problems in the fields of computer vision and image processing like Image Dehazing [50, 51], Recommender System [34], Object Detection [35], Visual Captioning[11] and Image retrieval [2, 12, 15, 18, 40, 44]. The latter is a fundamental task in many computer vision applications. It gained the interest of the scientific community to access, search or browse effectively the images from databases. Several CNN based methods have been developed in this field to supply a high-level description of image content. In Section 2, we introduce some of them.

The CNN architecture provides an attractive solution for different tasks thanks to its high performance, discriminative power, and compact representation, allowing for a large-scale data modeling. However, to the best of our knowledge, no CNN-based scheme has been proposed yet for the purpose of HDR image retrieval. In this paper we aim to shed light on this issue. Specifically, we propose a query-by-example HDR image retrieval method that uses the Fully Connected (FC) layer activation to define the relevant features of HDR images. Before passing through the descriptor computation stage, the HDR pixels are modified by using a Perceptually Uniform (PU) encoding [1] to map the luminance values in a perceptually uniform scale. The originality of our approach lies in extracting and testing the CNN features on the HDR contents. To this end, we selected the method that has powerful descriptors and high HDR retrieval accuracy. The novel contributions of this work are listed as follows:

Design an algorithm for HDR image retrieval based on CNN.
Apply PU encoding [1] to HDR content and evaluate its influence on the retrieval accuracy.
Build an HDR image database for the purpose of retrieval performance evaluation.
Analyse the efficiency of CNN descriptor and report the performance of Visual Geometry Group Network (VGGNet).
Evaluate the effectiveness of the proposed retrieval algorithm according to the number of layers in CNN.
Demonstrate the competitiveness of conventional and FC layers for LDR and HDR datasets.
Present experiments showing significant accuracy improvements on recent state-of-the-art method.

The paper is structured as follows: In Section 2, we give a brief overview on the related works regarding HDR image retrieval and the use of CNN methodology. In Section 3 we present the commonly used CNN architecture. In Section 4, we describe the proposed CNN-based scheme for HDR image retrieval. In the experimental Section 5 we compare our method to other ones in the literature and assess their accuracy. Finally, conclusions are drawn in Section 6.

2 Related work

In the last few years, some works have focused on HDR image retrieval. In [19], the authors proposed to use histogram intersection to define an HSV color descriptor. The results of the experiments have revealed that HSV histograms can be efficiently used as a global descriptor for HDR image retrieval task. To ameliorate their method, the authors in [20, 22] combined the HSV color histograms with color moments. Despite their practical use, these approaches [19, 20, 22] seem to be very limited compared to the abilities of local descriptors that have been proved to be very effective for indexing applications. Some researchers [5, 37, 38] turned their attention to the detection of key-points in HDR images, under changing illumination conditions, varying camera viewpoints, camera distances and scene lighting. Experimental results, reported in [5, 38], demonstrated that the direct use of HDR image in a linear scale is inappropriate for key-points detection. In [21], the authors introduced a new retrieval method based on LDR expansion. They improved the feature extraction by using reverse tone mapping and applying a tone mapping operator to determine the Scale Invariant Feature Transform (SIFT) descriptor. The experimental results showed the potential of the tone-mapped HDR content for detecting the local descriptors and demonstrated that the selected features are more descriptive than those extracted from LDR and HDR versions. The authors in [21] also established that the use of local SIFT descriptors is not appropriate for HDR images.

Recently, to achieve a higher level of robustness, researchers have successfully used machine learning approaches in many imaging applications. Generally speaking, deep learning systems allow building rich features with hierarchical representation, resulting is an effective classification [26]. Particularly, CNN architecture becomes one of the most interesting topics that revolutionized the field of computer vision like segmentation and object detection [3, 39]. It is characterized by its ability to capture different patterns while achieving a high classification accuracy. In literature, several systems, based on CNN, have been investigated to effectively describe LDR images. Among these, some methods use the activations of fully connected or convolutional layers as image descriptors [2, 12, 18, 40, 44]. In [39], authors propose off-the-shelf CNN features. They extract generic features from OverFeat network using the fully connected (FC6) of AlexNet and demonstrate that this approach clearly outperforms local features methods. Various works use the activations of max-pooling from convolutional layers like [53]. To obtain compact descriptors, a number of dimensionality reduction methods are applied like Principal Component Analysis (PCA) [2, 44], Bag-of-Words [32], VLAD [33] and Fisher Vectors [45]. Authors in [36] propose to use a trainable Generalized-Mean (GeM) pooling layer. The idea consists in adding a new pooling layer with learnable parameters after the convolutional layers. Then, a whitening is applied to reduce the descriptor dimensionality. In [36], the authors introduce a new weighted query expansion. The work presented in [14] consists in building a descriptor based on the regional maximum activations of convolutions (R-MAC) descriptor [44] and learn CNN weights in an end-to-end method, and applying the siamese network with three streams and a triplet loss for training. Regional network is proposed to select the relevant regions of the image, using image scaling to extract local features. Other solution consists in adding a new layer named NetVLAD that can be applied in any CNN architecture. It is trainable through backpropagation for an end-to-end manner. The obtained features are reduced using PCA. In [15], authors introduce a spatial pyramid pooling (SPP) of CNN features which is an extension of the BoW. It generates a fixed-length representation regardless of image size/scale. Recently, authors in [46] introduce an end-to-end trainable network using multiscale local poling based on NetVLAD and a triplet mining. In [16], the authors present global descriptors REMAP based on a hierarchy of deep features using multiple CNN layers.

A number of methods have been developed to define binary codes based on deep learning [28, 42]. Recently, a unified framework has been introduced for image retrieval and compression [42]. This framework applies the deep hashing method to learn compact binary codes and uses a new loss function to adapt the binary representation. For retrieval purpose, VGG network is used with a specific configuration. For a manifold structure of the training data, K-Nearest Neighbor (KNN) algorithm is applied to create a neighborhood matrix during the learning of neural networks, which can be unsupervised or supervised. Experimental results show that this method outperforms some existing state-of-the-art ones.

Using CNN features can be global [3, 26], local [33, 49] or regional [13]. However, the CNN has a common architecture which described in the following section.

3 CNN architecture

Convolutional layer

The convolutional (Conv) layer is the main component that allows extracting the images features. It is performed on the input image using a set of kernels (weights) as parameters. The different kernels are convolved across the width, height, and depth of the input volume using a dot product for returning the output volume. This layer comprises a rectangular grid of neurons. Specifically, each block of pixels is stretched into a matrix column, and the number of columns corresponds to the number of all local regions. As a result, the matrix multiplication is converted to the output volume with a depth that corresponds to the kernel number for obtaining the compact description of the input volume.

Pooling layer

After each convolutional layer, a pooling layer may be used. The latter is a simple operation that is applied independently on the input volume. In the kernel, the pooling layer represents the outputs of neighboring groups of neurons. The most common type of pooling is the maximum. It is used to decrease the size (width and height dimensions) of the feature map while preserving the relevant information.

Normalization layer

This layer supports a faster convergence. It allows adjusting the internal activations by using it before the activation function. In literature, various models of normalization have been proposed for ConvNet architectures. The two most commonly types used are the Local Response Normalization (LRN) and Batch Normalization (BatchNorm). The later performs a more global normalization. However, LRN implements the normalization in a small local neighborhood for each pixel. This method introduced in [26] and applied in [41] using the same parameters. The normalized output is given as following:

$$ Y_{x,y}^{i} = X_{x,y}^{i}/ \left( \kappa+\alpha \sum\limits_{j=max(0,i-n/2)}^{min(N-1,i+n/2)}\left( X_{x,y}^{j}\right)^{2} \right)^{\beta} $$

(1)

where $X_{x,y}^{i}$ and $Y_{x,y}^{i}$ are the pixel values at the (x,y) position of the the kernel i before and after normalization respectively. N is the total number of feature channels in X. The different constants are used as hyper-parameters where κ = 2, n = 5, α = 10^− 4 and β = 0.75.

Fully-connected layer

All neurons in this layer are connected to all activations in the previous layer, which can be calculated with a matrix multiplication followed by a bias offset. It takes as input the result of the previous layers (convolution and pooling) and returns a single vector which describe the image. However, this layer occupies the major part of CNN memory and requires high computation cost, due to the large number of parameters.

Correction layer (activation function)

To improve the CNN efficiency, a correction layer is incorporated between layers. Its role consists in using an activation function to make the output nonlinear. The commonly used activation function is the rectified linear unit (ReLU) that applies an element wise function ($f(x) =\max \limits (x, 0)$). In addition to its simplicity, the ReLu activation function does not require any additional parameter and does not change the size of input volume.

Loss layer

Loss layer is the last layer in the neural network. It specifies how network training penalizes the gap between expected and actual signal. Various loss functions, adapted to different tasks, can be employed. In particular, the Softmax function, also known as normalized exponential function, is used to predict a single class among K mutually classes. The Softmax function takes as input a vector of K real numbers and normalizes it into a probability distribution consisting of K probabilities (σ(.)) proportional to the exponentials of the input numbers. In the case of neural network, given the vector of the output layer z = [z₀,…,z_K− 1], the conventional Softmax function can be expressed as follows:

$$ \sigma(z)_{j}=\frac{e^{z_{j}}}{{\sum}_{k=1}^{K} e^{z_{k}}} $$

(2)

where j the index of the output unit, with j = 1,2,…,K.

Many studies [2, 13, 40] have shown that CNN features can be successfully retrieved from traditional LDR images. On the other hand, deep learning methodology has been previously proposed for companding HDR image from a single exposed LDR one [9, 10, 52]. Additionally, the CNN architecture has been used to reconstruct HDR video using multiple exposures captured over time [17]. But, to the best of our knowledge, the CNNs have never been investigated for the purpose of HDR image indexing and retrieval.

In this work, we attempt to exploit the many advantages of the CNNs to design an HDR image retrieval system. The proposed method is discussed in detail in the following section.

4 Proposed method

To determine the adequate descriptor, we model an HDR image as a collection of features using the VGG19 architecture [1]. Figure 1 summarizes the main steps that constitute the proposed scheme for Query-by-example HDR image retrieval. The database is divided into training and testing sets. Firstly, a perceptually uniform (PU) encoding [1] is applied on HDR images. This encoding procedure is defined by a specific transfer function that prepares the HDR content to the feature extraction. Notably, the PU encoding is used to make sure that the distortion visibility is perceptually uniform through the coded pixels. For example, it can decrease the color sensitivity when the luminance is low. Secondly, a CNN descriptor is performed by extracting the rich features using FC layer activation. Doing so, each image in the dataset (training data and test data) will be indexed by a CNN feature vector (D_Tr (Descriptors of traing data) and D_Ts(Descriptors of testing data)). Finally, the sought HDR images are retrieved according to the Support Vector Machines (SVM) classification. We note that the model of SVM classifier is determined according to the features of training samples.

4.1 Extraction of CNN features

VGGNet [41] is a widely known network that encompasses 19 layers (convolutional and fully-connected). It is recognized by its simplicity and the large feature maps. Conforming to its architecture, each hidden layer uses the activation function ReLU. Depending on the training data, the size of RGB image is fixed to 224 × 224. A stack of convolutional layers is applied to the image using 3 × 3 filters, with very small sizes, for capturing the different left/right, up/down and center details like mentioned in [41]. For example, 1 × 1 convolution filters are used as linear transformation. The stride of convolution equals to 1 pixel and the spatial padding is 1 pixel for 3 × 3 convolutional layers. Concerning the pooling layer, the operation is applied on 2 × 2 pixel window and stride 2. In total, five max-pooling layers are used after some convolutional ones. The convolutional layer is designated by Convm_n, where m and n refer to the order of the convolutional layer in the stack and the order of the stack, respectively. For example, Conv1_1 is the first convolutional layer in the first stack whereas Conv5_4 is the deepest layer in this network. In the top level of this architecture, there are three FC layers. The latter are applied after a set of convolutional layers that are characterized by the same architecture. The first and the second FC layers are of size 4096 channels; while the third one comprises only 1000 channels. As illustrated in Fig. 2, the global network architecture uses 19 weight layers: 16 convolutional and 3 FC layers. It is worth noting that the VGG16 architecture uses a configuration with decreasing depth. Only 16 layers are implemented without using Conv3_4, Conv4_4 and Conv5_4 layers. In this network, each layer has different feature maps that can be applied as local descriptor with a specific dimension. Previous work established that the use of FC features improves the image retrieval accuracy [18, 44]. This is mainly due to their high generalization and semantic descriptive ability. In this work, we propose to extract the 4096 dimensional output of the second FC layer. Then, replace the softmax layer with a linear SVM model which is recognized by its very good practical results. The next section gives an overview of the linear SVM.

4.2 Linear SVM

SVM is one of the commonly used algorithms in machine learning applications, thanks to its powerful discriminative classifier. A hybrid approach which combines CNN and linear SVM has been proposed in some works such as [43, 47, 54]. The latters have concluded that CNN architecture can achieve an impressive performance if it is combined with linear SVM instead of Softmax. As demonstrated in [47], the linear SVM algorithm can be used for each layer with no additional fine-tuning of hidden representation.

In the field of image retrieval, the linear SVM algorithm allows to find the optimal separating hyperplane between classes using training samples. In our work, the feature vector is normalized used L₂. For a given training dataset x_i,y_i, with i ∈ 1,…;,N and N the number of training samples, y_i equals + 1 and − 1 for class ω₁ and class ω₂, respectively. In the case of linearly separable descriptors, it is possible to find at least one hyperplane defined by a vector of weights w with a bias b, if we can separate the classes without error using the following classification function:

$$ f(x) = w \cdot x + b = 0. $$

(3)

Therefore, to find an hyperplane, we need to estimate w and b using the following functions:

$$ y_{i} (w \cdot x_{i} + b ) \geq +1 \ for \ y_{i}= + 1 \ (class \ \omega_{1}), $$

(4)

$$ y_{i} (w \cdot x_{i} + b ) \leq -1 \ for \ y_{i} = - 1\ (class \ \omega_{2}). $$

(5)

5 Experimental results

In this section, we present and compare the experimental results on LDR and HDR datasets. Firstly, we describe HDR databases and valuation criteria which were used in these experiments. Then, we provide a quantitative evaluation of the HDR image retrieval performance compared to LDR one using the features extracted with the CNN framework. Also, we provide comparative evaluations to other related methods. Finally, we assess the time complexity.

5.1 Databases and measures

Although there is a growing interest in HDR content, the amount of available data remains limited to evaluate and test HDR indexing and retrieval systems. Fortunately, the rapid development of HDR tools has made it possible to easily generate an HDR image. Thus, in order to create HDR databases, we used the inverse tone mapping method presented in [25], which provides very satisfactory results, to build HDR images by expending LDR ones. In the current work, we consider LDR PASCAL VOC2007, CIFAR-10, and Wang databases.

PASCAL VOC2007 database: It is one of the most widely used benchmark for image classification. It contains 9963 RGB images divided into 20 classes (Person, Bird, Cat, Cow, Dog, Horse, Sheep, Airplane, Bicycle, Boat, Bus, Car, Motorbike, Train, Bottle, Chair, Dining table, Potted plant, Sofa and TV/Monitor) and is available at http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html
CIFAR-10: It is one of the most popular deep learning dataset. It comprises 60000 natural images of size 32 × 32 divided into 10 categories (Airplane, Automobile, Bird, Cat, Deer, Dog, Frog, Horse, Ship, and Truck). From the CIFAR-10 collection, 50000 images are devoted to training (5000 for each class) while the remaining 10000 images are devoted to testing (1000 for each class) https://www.cs.toronto.edu/~kriz/cifar.html
Wang: It contains 1000 images classified into ten categories (Africa, Beach, Buses, Monuments, Dinosaurs, Elephants, Flowers, Horses, Mountains, Food). Each category comprises 100 images. http://wang.ist.psu.edu/docs/related/

Several performance measures can be used to asset the efficiency of indexing/retrieval methods. Specifically, evaluation over different datasets is performed by using testing images (one or more images as the query), and ranking the images from the most similar to the least similar. The performance for a particular method is estimated by the average of the performances overall query images. To asset the efficiency of the test methods we retained the following measures:

Precision against recall plot: A curve illustrating the relationship between precision and recall for retrieval system. The precision represents the ability of retrieval algorithm to return only images that are relevant whereas the recall corresponds to the system ability to return all images that are relevant.
mAP: The mean Average Precision (mAP) of a set of queries is a common metric used to evaluate the effectiveness of an image retrieval system. It is worth noting that among retrieval evaluation measures, mAP has been shown to have good discrimination and stability.
Accuracy: It can be defined as the percentage of correctly classified instances

5.2 Image retrieval results

To evaluate the performance of our HDR image retrieval approach and investigate the impact of CNN, retrieval experiments are carried out from HDR and LDR versions of PASCAL VOC2007 database using the mAP scores. Table 1 shows the retrieval accuracy using SIFT and CNN descriptors for LDR (LDR-SIFT/LDR-CNN), Expanded-Mapped (EM-SIFT), HDR with linear luminance values (HDR-Lin-CNN) and HDR with PU encoding (HDR-PU-SIFT/HDR-PU-CNN) representations. We note that both LDR-SIFT and EM-SIFT [21] methods use a bag of visual words as descriptor.

Table 1 mAP scores on the HDR PASCAL VOC2007 dataset using VGG19

Full size table

From the results reported in Table 1, we observe that in the case of HDR-PU-CNN descriptor, the mAP scores reveal the excellent results for the majority of classes. This may be explained by the fact that PU encoding plainly improves the effectiveness of the CNN descriptor. Additionally, this encoding procedure leads to more accurate results than that provided by original LDR content. On average, PU encoding provides a gain of about 2.71% and 1.11% for CNN and SIFT features respectively, when compared to the same features obtained from LDR representation.

Earlier studies have shown that the SIFT descriptor is recognized by its ability to capture the local object details like edges and corners and, consequently, achieves a good performance for image retrieval. In the case of HDR images, our experiments prove that the use of PU encoding allows to enhance the representation of HDR images. On the basis of the results reported in Table 1, the majority of mAP values of HDR-PU-SIFT are higher than 70%. However, in some classes like Bottle, Pottedplant and Sheep, the mAP scores are lower than 50%. In counterpart, by examining the results obtained with CNN descriptor, it appears that the later entails advantages in terms of retrieval efficiency thanks to its capability to learn different images and successfully model HDR data. The results of our analysis reported in Table 1 clearly show that the CNN descriptor systematically outperforms EM-SIFT [21] and HDR-PU-SIFT ones. Moreover, we remark that the mAP scores obtained by HDR-PU-CNN method surpass 95% for most classes.

From a more quantitative point of view, we studied the impact of the extracted features using VGG16 and VGG19 on LDR, HDR-Lin (HDR with linear luminance values) and HDR-PU (HDR with PU encoding) representations. For a given HDR-PU content, we extracted features from some layers (conv5_1 (C5_1), conv5_3 (C5_3), FC1 and FC2) and compared the matching precisions to those obtained for LDR and HDR-PU versions in PASCAL VOC2007 dataset. From Fig. 3, we observe that the VGG19 model achieves higher performance for HDR content (HDR-Lin and HDR-PU) in different layers. It gives very high precision and outperforms the VGG16 model. We believe that this is mainly due to the number of layers which strongly influences the richness of the information in the HDR feature descriptor. However, in the case of LDR content, the results obtained by using VGG16 and VGG19 are almost similar. Again, one may also conclude that PU encoding improves the precision of the HDR content and consequently induces an overall retrieval efficiency gain. For instance, the gain of PU is about 20.92% in Airplane class using FC2 as descriptor. Moreover, we can clearly notice that when FC layers are considered, the mAP scores for LDR and HDR-PU contents are very close. For some classes like Dog, the LDR content exhibits a superior performance compared to HDR representation. This limitation stems from the high sensitivity of HDR content that badly affects the matching accuracy against the original dataset. As discussed above, the PU encoding has proven its usefulness to alleviate this limitation by adjusting the HDR pixel values and reduce their sensitivity.

We have also tested our retrieval method, using VGG19 network, on the CIFAR-10 database. Figure 4 presents the accuracy results for some classes. From this figure, we can see that the FC features outperform the convolutional ones for all the recall values. Specifically, the accuracy of the top layers is superior to that obtained from the bottom ones. One may also notice that in the case of C5-1 layer, the PU encoding shows a distinctive improvement for all the classes. Quantifying the retrieval performance improvements, brought by PU encoding, a gain in accuracy of about 3.35% and 1.36% is attained for Automobile and Deer classes, respectively, when compared it to the HDR-Lin using descriptor FC1.

In Table 2, we compare the CNN method for HDR and LDR representations with other state-of-the-art methods (Global and local) on Wang dataset. 30 images per class are randomly selected for training. From this table, we can notice that the CNN descriptor offers the best result for both LDR and HDR content. It achieves good retrieval power for HDR content because the features are more descriptive than those extracted from [21] and [22]. This is explained by the fact that CNN descriptor fits well for HDR content. Meanwhile, local method(SIFT) owns the worst accuracy.

Table 2 Comparison Accuracy scores with other methods using LDR and HDR represenations on the Wang dataset

Full size table

Table 3 shows the CIFAR-10 retrieval results based on the mAP for the proposed method and BGAN+ [42], a recent state-of-the-art learning-based hashing method for image retrieval. According to these results, we observe that the proposed method achieve strong results, and it outperforms BGAN+ in terms of precision.

Table 3 Performance comparison (mAP) with other method on the Cifar-10 dataset

Full size table

As a last comparison, we carried out evaluations of the proposed retrieval method on the whole CIFAR-10 database. Figure 5 illustrates the precision–recall plots for the whole dataset. From the depicted curves, we observe that the configuration that uses the first and second FC layers owns a higher performance against the other ones. However, the performance decreases when using the bottom layers of the CNN framework. One may safely conclude that the selection of the network architecture and depth are crucial to achieve a successful HDR content retrieval.

Figure 6 shows three queries and their corresponding top 9 retrieved HDR images from CIFAR-10 dataset using the proposed method. Query 1 is from Automobile class, Query 2 is from Frog class and Query 3 is from Horse class. The different HDR images are tone mapped using the TMO proposed in [22] to ensure the rendering on LDR devices. As we can see from this figure, the retrieved images in the first positions of the rank lists belong to the same class as their corresponding queries. This proves the effectiveness of the CNN features in HDR image retrieval despite the large scale difference between images.

5.3 Complexity evaluation

In order to evaluate the time complexity of the proposed CNN method, execution-time tests are performed on a machine with an Intel(R) Xeon CPU E5-2640 v4 Core 10 at 2.40 GHz. For example, for the CIFAR-10 dataset, the running time for the retrieval of an HDR image takes on average: 400 ms for CNN Features Extraction, 4 ms for PU encoding, and 78 ms for SVM classifier.

6 Conclusion

We have presented an HDR image retrieval method based on the CNN paradigm. To ameliorate the accuracy of the proposed approach, the PU encoding is applied on the HDR pixel values before passing to the computation of the descriptor components. Through this work, we have reported, for the first time, results competing with some methods on the challenging HDR PASCAL VOC2007, CIFAR-10 and Wang datasets. In the same context, we have provided good practices for extracting features from some layer of the CNN, using VGG19 pre-trained model. Experimental assessments have demonstrated that the CNN features exhibit substantial performance improvements over SIFT descriptor. Moreover, the obtained results reveal that the FC layers offer the best performance among the other CNN intermediate layers. Hence, we can claim that it is very appropriate for describing HDR images. However, it must be emphasized that in some layers, especially in VGG16 framework, the accuracy of our retrieval method, applied on HDR images is lower than its counterpart applied on LDR images. This discomfort is particularly faced when using a decreasing number of layers. As future work, we intend to resolve this problem, by incorporating additional discriminative cues in the layers to enhance the HDR features based on CNN descriptor. As another promising line of future work, we plan to investigate the HDR datasets on deep CNN architecture like ResNet and adjust it to HDR content.

References

Aydin TO, Mantiuk R, Seidel HP (2008) Extending quality metrics to full luminance range images human. Human vision and electronic imaging XIII (Proceedings of SPIE), pp 6806–6810
Babenko A, Lempitsky V (2015) Aggregating deep convolutional features for image retrieval. International conference on computer vision, pp 1269–1277
Babenko A, Slesarev A, Chigorin A, Lempitsky V (2014) Neural codes for image retrieval. In: European conference on computer vision. Springer, Cham, pp 584–599
Banterle F, Ledda P, Debattista K, Chalmers A (2006) Inverse tone mapping. International conference on Computer graphics and interactive techniques. pp 349–356
Bronislav P, Chalmers A, Zemcík P., Hooberman L, Zadík M. (2016) Evaluation of feature point detection in high dynamic range imagery. J Vis Commun Image Represent 28(C):141–160
Google Scholar
Chalmers A (2017) Debattista,K.: HDR video past, present and future: a perspective. Sig Process Image Commun 54:49–55
Article Google Scholar
Debevec PE, Malik J (1997) Recovering high dynamicrange radiance maps from photographs. Proceedings SIGGRAPH, pp 369–378
Dufaux F, Callet PL, Mantiuk R, Mrak M (2016) High dynamic range video: from acquisition, to display and applications. Academic Press
Eilertsen G, Kronander J, Denes G, Mantiuk RK, Unger J (2017) HDR image reconstruction from a single exposure using deep CNNs. ACM Trans Graph 36(6):178:1–178:15
Article Google Scholar
Endo Y, Kanamori Y, Mitani J (2017) Deep reverse tone mapping. ACM Trans Graph 36(6):177
Article Google Scholar
Gao L, Li X, Song Shen HTJ (2020) Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell 42 (5):1112–1131
Google Scholar
Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: European conference on computer vision. Springer, Cham, pp 392–407
Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: European conference on computer vision. Springer, Cham, pp 392–407
Gordo A, Almazan J, Revaud J, Larlus D (2017) End-to-end learning of deep visual representations for image retrieval. Int J Comput Vis 124 (2):237–254
Article MathSciNet Google Scholar
He K, Zhang X, Ren S, Sun J (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. European Conference on Computer Vision, pp 346–361
Husain SS, Bober M (2019) Multi-layer entropy-guided pooling of dense CNN features for image retrieval. IEEE Trans Image Process 28(10):5201–5213
Article MathSciNet Google Scholar
Kalantari NK, Ramamoorthi R (2017) Deep high dynamic range imaging of dynamic scenes. ACM Trans Graph 36(4):144:1–144:12
Article Google Scholar
Kalantidis Y, Mellina C, Osindero S (2016) Cross-dimensional weighting for aggregated deep convolutional features. In: European conference on computer vision. Springer, Cham, pp 685–701
Khwildi R, Hachani M, Ouled Zaid A (2016) New indexing method of HDR images using color histograms. International conference on machine vision
Khwildi R, Ouled Zaid A (2018) Color Based HDR image retrieval using HSV histogram and color moments. In: International conference on computer systems and applications. IEEE, pp 1–5
Khwildi R, Ouled Zaid A (2018) New retrieval system based on low dynamic range expansion and SIFT descriptor. In: International workshop on multimedia signal processing. IEEE pp 1–6
Khwildi R, Ouled Zaid A (2020) HDR image retrieval by using color-based descriptor and tone mapping operator. Vis Comput 36:1111–1126
Article Google Scholar
Kim BK, Park RH, Chang S (2016) Tone mapping with contrast preservation and lightness correction in high dynamic range imaging. SIViP 10(8):1425–1432
Article Google Scholar
Kovaleski RP, Oliveira MM (2009) High-quality brightness enhancement functions for real-time reverse tone mapping. Vis Comput 25(5):539–547
Article Google Scholar
Kovaleski RP, Oliveira MM (2014) High-quality reverse tone mapping for a wide range of exposures. In: Conference on graphics patterns and images. IEEE, pp 49–56
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Article Google Scholar
Larson GW (1998) Logluv encoding for full-gamut, high-dynamic range images. J Graph Tools 3(1):15–31
Article Google Scholar
Lin K, Lu J, Chen C, Zhou J (2016) Learning compact binary descriptors with unsupervised deep neural networks. In: Conference on computer vision and pattern recognition, pp 1183–1192
Mantiuk RK, Myszkowski KH, Seidel P (2015) High dynamic range imaging. Wiley encyclopedia of electrical and electronics engineering, pp 1–4
Masia B, Serrano A, Gutierrez D (2017) Dynamic range expansion based on image statistics. Multimed Tools Appl 76(1):631–648
Article Google Scholar
Mitsunaga T, Nayar SK (1999) Radiometric self calibration. In: Conference on computer vision and pattern recognition. IEEE, pp 374–380
Mohedano E, McGuinness K, et al. (2016) Bags of local convolution. International conference on multimedia retrieval, pp 327–331
Ng J, Yang F, Davis L (2015) Exploiting local features from deep networks for image retrieval. Conference on computer vision and pattern recognition workshops, pp 53–61
Pan Y, He F, Yu H (2020) Learning social representations with deep autoencoder for recommender system. World Wide Web 23:2259–2279
Article Google Scholar
Quan Q, He F, Li H (2020) A multi-phase blending method with incremental intensity for training detection networks. Vis Comput, pp 1–15
Radenovic F, Tolias G, Chum O (2018) Fine-tuning CNN image retrieval with no human annotation. IEEE Trans Pattern Anal Mach Intell 41 (7):1655–1668
Article Google Scholar
Rana A, Valenzise G, Dufaux F (2015) Evaluation of feature detection in HDR based imaging under changes in illumination conditions. In: IEEE international symposium on multimedia. IEEE, pp 289– 294
Rana A, Valenzise G, Dufaux F (2016) An Evaluation of HDR image matching under extreme illumination changes. In: Visual communications and image processing. IEEE, pp 1–4
Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: an astounding baseline for recognition. Computer vision and pattern recognition workshops, pp 512–519
Razavian AS, Sullivan J, Carlsson S, Maki A (2016) Visual instance retrieval with deep convolutional networks. ITE Trans Media Technol Appl 4(3):251–258
Article Google Scholar
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. International conference on learning representations
Song J, He T, Gao L, et al. (2020) Unified binary generative adversarial network for image retrieval and compression. Int J Comput Vis 128:2243–2264
Article MathSciNet Google Scholar
Tang Y (2013) Deep learning using linear support vector machines. International conference on neural information processing, pp 458–465
Tolias G, Sicre R, Jégou H. (2016) Particular object retrieval with integral max-pooling of CNN activations. International conference on learning representations, pp 1–12
Uricchio T, Bertini M, Seidenari L, Del Bimbo A (2015) Fisher encoded convolutional Bag-of-Windows for efficient image retrieval and social image tagging. In: International conference on computer vision workshop, pp 1020–1026
Vaccaro F, Bertini M, Uricchio T, Del BimboImage A (2020) Retrieval using multi-scale CNN features pooling. In: International conference on multimedia retrieval, pp 311–315
Vinyals O, Jia Y, Deng L, Darrell T (2012) Learning with recursive perceptual representations. Annu Conf Neural Inf Process Syst, pp 2834–2842
Ward G (1991) Real pixels. Graphics Gems, New York
Book Google Scholar
Zhang N, Donahue J, Girshick R, Darrell T (2014) Part-based R-CNNs for fine-grained category detection. In: European conference on computer vision. Springer, Cham, pp 834–849
Zhang S, He F (2020) RCDN: Learning deep residual convolutional dehazing networks. Vis Comput 36(9):1797–1808
Article Google Scholar
Zhang S, He F, Ren W (2020) NLDN: Non-local dehazing network for dense haze removal. Neurocomputing 410:363–373
Article Google Scholar
Zhang J, Lalonde JF (2017) Learning high dynamic range from outdoor panoramas. In: International conference on computer vision. pp 4529–4538
Zheng L, Zhao Y, Wang S, Wang J, Tian Q (2016) Good practice in CNN feature transfer. arXiv preprint arXiv:1604.00133
Zhu H, Chen X, Dai W, Fu K, Ye Q, Jiao J (2015) Orientation robust object detection in aerial images using deep convolutional neural network. In: International conference on image processing. IEEE, pp 3735–3739
(2003) OpenEXR. http://www.openexr.org

Download references

Author information

Authors and Affiliations

Laboratoire Systèmes de Communications, École Nationale d’Ingénieurs de Tunis, Université de Tunis El Manar, B.P. 37 le Belvédère, 1002, Tunis, Tunisie
Raoua Khwildi & Azza Ouled Zaid
Laboratoire des Signaux et Systèmes, Université Paris-Saclay, CNRS, CentraleSupelec, 91190, Gif-sur-Yvette, France
Frédéric Dufaux

Authors

Raoua Khwildi
View author publications
You can also search for this author in PubMed Google Scholar
Azza Ouled Zaid
View author publications
You can also search for this author in PubMed Google Scholar
Frédéric Dufaux
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Raoua Khwildi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khwildi, R., Ouled Zaid, A. & Dufaux, F. Query-by-example HDR image retrieval based on CNN. Multimed Tools Appl 80, 15413–15428 (2021). https://doi.org/10.1007/s11042-020-10416-4

Download citation

Received: 18 December 2019
Revised: 22 October 2020
Accepted: 22 December 2020
Published: 03 February 2021
Issue Date: April 2021
DOI: https://doi.org/10.1007/s11042-020-10416-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Query-by-example HDR image retrieval based on CNN

Abstract

Similar content being viewed by others

A Comprehensive Survey on Content-Based Image Retrieval Using Machine Learning

Deep learned vectors’ formation using auto-correlation, scaling, and derivations with CNN for complex and huge image retrieval

Image retrieval using dual-weighted deep feature descriptor

1 Introduction

2 Related work