Introduction

Content-Based Image Retrieval is a process of auto-indexing of images by extraction of their low-level features like color and shape; these features are responsible only for image retrieval. Feature representation and similarity measurement are the two most important instances for the implementation of the CBIR program and various researchers have done on the same research for more than a decade. Many different approaches have been proposed but to date it remains as one of the most problematic one in the ongoing CBIR research, and a major reason is the difference between low-resolution image pixels captured by machines and high-level human-sensing. This problem poses as the fundamental challenge for Artificial Intelligence with a high-level view which is the way to build and train intelligent machines as humans to perform real-world tasks. Machine learning is a promising alternative that attempts to address this challenge. Machine learning techniques have shown progress in recent years. Deep Learning is an important form of abstraction. It involves a family of machine learning algorithms that try to demonstrate high-quality data extraction techniques through deep design techniques that are made up of many offline variables.

Deep learning enables complex design architectures to be programmed as a human brain and process information in many stages of transformation and representation, in contrast to the conventional machine learning methods that often use fixed structures. By exploring deep design architectures which learn features at multiple levels of creativity from data, deep learning techniques makes it possible for the system to learn complex functions to map input data to the output directly, without relying on features present in the man-made domain.

The success of deep learning inspired us to explore deep learning techniques with application to CBIR tasks for images. Not enough focus is prioritized for CBIR applications although research has been conducted for application of deep learning techniques for classification of image and recognition in computer vision.

Proposed method involves application of deep learning techniques to solve the CBIR function of human-reduced images. We will be training large-scale neural network to learn representations of functional features. We have tested against established categories of the Corel dataset (African tribes, Beaches, Buildings, Buses, Dinosaurs, Elephants, Flowers, Horses, Mountains and Food). The results of the proposed methods are compared with existing techniques and performance analysis is done.

Literature Survey

In [1], the authors have proposed a deep learning framework for CBIR by training large-scale Deep Belief Networks for learning effective feature representations of images. The authors have discussed three kinds of feature generalization schemes. In scheme I, images of dataset are fed into the pre-trained CNN model and then take the activation values from the last three layers. In scheme II, instead of directly using the features extracted by the pre-trained deep model, they explored similarity learning algorithms to refine the features obtained in scheme I. In scheme III, the deep convolutional neural network is retrained on image dataset for different CBIR tasks by initializing the CNN model with the parameters of the Image Net-trained models. Scheme III showed best performance out of the three schemes.

In [2], the authors have proposed an efficient pre-trained Convolution neural network model. The authors have used LeNet-5 model of CNN and experiments are conducted on Corel 1 K dataset. The proposed model has an average precision of 0.79 with Corel 1 K dataset [3,4,5]. The results are compared with three traditional visual features, the hue saturation value (HSV) color feature, gray level Co-occurrence matrix (GLCM) features and scale-invariant feature transform (SIFT) in which the proposed model has the highest average precision value- 86% compared to other methods.

In [6], the authors have proposed the Deep Belief Network (DBN) method of deep learning for feature extraction and classification. The DBN has several layers which include restricted Boltzmann machine (RBM) stacked into multi-stages, which consists of only single hidden layers each to make the learning process faster. The experimental results show that for the small dataset with 1000 images, the accuracy rate would be 98.6% but with a large data set (> 10,000 images) the accuracy would be 96% without losing the time complexity requirement.

In [7], the authors have proposed a CNN—SVM model, where CNN is for feature extraction, and SVM performs as a recognizer. The first part of a CNN is the convolutional phase. It works as an extractor of image features. In the end, the convolution maps are flattened and concatenated into a feature vector, called a CNN code. The SVM takes this CNN code at the output of the convolution phase as a new feature vector for training. The precision obtained using pre-trained CNN with the Caltech256 database is 90% for 1000 images. Padmashree Desai et al. in [8,9,10,11,12,13] discusses different methods of feature extraction using wavelets, edge operators, morphological operators ad moment invariants. Performance analysis is done using different distance measures. The videos summarization [14, 15] can be used in image/video retrieval by searching the query image in the summarized video dataset rather than in the original dataset. This can improve the retrieval time.

Proposed Method and Implementation

The proposed architecture as shown in Fig. 1. consists of two layers, first layer uses CNN for training and feature extraction and second layer uses SVM for classification and image retrieval. Feature vector is obtained from CNN model is fed as input to the SVM.

Fig. 1
figure 1

Architecture of CBIR System

Basic flow starts with a query image submitted by user as input to the system. Features of query image are extracted from CNN model and features are stored in a vector. This query image feature vector will be passed to the SVM, which has already been trained using the Corel dataset. The pre-trained SVM module will calculate the distance between the features of the query image and the feature of the entire dataset. Retrieved images are displayed based on similarity index that is, distance values with respect to query image. Top 10, top 20 and so on are displayed as a part of retrieved process.

Feature Extraction

VGG16 layered CNN model is used for extraction of features of data set. Query image submitted will be extracted when user submits to a system. Figure 2 represents theVGG16 layered architecture. It consists of twelve convolutional layers, followed by maximum pooling layers and then four fully connected layers and finally a softmax classifier.

Fig. 2
figure 2

VGG16 layered architecture

The original images are of size 384 × 256 or 256 × 384 which will be converted to 224 × 224 pixels and then fed to the CNN model. The data goes through the following layers in CNN module and image is transformed into different sizes leading to extraction of features.

  • First and second layers

    VGG16 CNN model takes 224 × 224 × 3 RGB image as input. The image is passed through 1st and 2nd convolutional layers with 64 feature maps or filters of size 3 × 3 and same pooling with a stride of 14. The image dimensions change from 224 × 224 × 3 to 224 × 224 × 64.

    Then the maximum pooling layer is applied with a filter size 3 × 3 and with a stride of 2. The resulting image dimensions will be changed from 224 × 224 × 64 to 112 × 112 × 64.

  • Third and fourth layers

    The image is passed through third and fourth convolution layers with 128 feature maps having filters of size 3 × 3 and stride of 1. Then maximum pooling layer is applied with filter size 3 × 3 and with a stride of 2. The output image is reduced to 56 × 56 × 128.

  • Fifth and sixth layers

    Then the image is passed through fifth and sixth convolution layers with 256 feature maps having a filter size of 3 × 3 and a stride of 1. Then maximum pooling layer is applied with filter size 3 × 3 and stride of 2 and has 256 feature maps.

  • Seventh to twelfth layer

    Seventh to twelfth layers consist of 2 sets of three convolutional layers followed by a maximum pooling layer. All the three convolutional layers have 512 filters of size 3 × 3 and with a stride of 1. The final output image will be reduced to 7 × 7 × 512.

  • Thirteenth Layer

    This final layer is a fully connected flatten layer to flatten the output with 25,088 feature maps each of size 1 × 1.

ReLU Activation Function

ReLU’s purpose is to introduce non-linearity in the CNN model. ReLU is linear for all the positive values, and zero for all negative values. It only activates a node if the input is above a certain quantity, while the input is below zero, the output is zero that is node value is determined as A(x) = max (0, x).

The CNN model which consists of above layers is compiled using model. compile(). Then the image array is passed to model. Predict() to get the feature vector output of the flattening layer.

figure a
figure b

Image Classification and Retrieval

Support vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. The obtained feature vector is given as input to the SVM, which calculates the distance between each image of the dataset and the query image. Binary SVM classifiers are also called as one-versus-one method. Multi-classification with this method can be described as the task of constructing n (n–1)/2 binary SVMs, one classifier Cij for every pair of distinct classes, i.e. the ith class and the jth class, where i ɛ j, i = 1, …, n j = 1, …, n. Each classifier Cij is trained with the samples in the ith class with positive labels, and the samples in the jth class with negative labels. Train SVM according to the label attached to the feature vectors.

Classification of images is done using linear Regression approach. Consider via set T of t training feature vectors xi ∈ R^D, i = 1…t and the corresponding class labels ∈ {1…t}, yi (wT * xi + b)-1 yi ∈ {1…t}.Where w is the hyper plane normal vector, b is the perpendicular distance between the hyper plane and the origin. In this case, we are training the shape and color features.

figure c

Testing

  1. 1.

    Select query Image =  ×.jpg, from dataset

  2. 2.

    Find the query image class label in which it belongs.

  3. 3.

    Perform classifications of the images.

  4. 4.

    Perform binary predictions for the selected query image.

  5. 5.

    Retrieve the first n similar images to the input image from class x using voting.

Results and Discussion

The query set consists of randomly chosen 10 images, first image from each category. Figure 3 shows the query set of different categories for proposed work. Figure 4 represents the top 10 images retrieved for class Africa. In the below Fig. 5, confusion matrix, the class with highest accuracy percentage is Horses with 96% and its true positive value is 48 and the class with lowest accuracy percentage is Mountains with 62% and true positive value 31.

Fig. 3
figure 3

Query set of proposed work

Fig. 4
figure 4

Top 10 Images retrieved for class Africa

Fig. 5
figure 5

Confusion matrix for class Africa

The experimental results Fig. 6, shows top 10 images retrieved. Experiments conducted on 10 classes of the Corel 1 K dataset are shown below in Table 1. It shows the precision values of each class when we retrieve 10 images, 15 images, and 20 images from the database respectively.

Fig. 6
figure 6

Top 10 Images retrieved for class Elephant

Table 1 Results of different classes obtained with the Corel dataset

The precision of class Elephant obtained by retrieving 10 similar images is 83.86%. Figure 7 shows the confusion matrix and it shows that class with highest accuracy percentage is Dinosaurs with 100% and its true positive value is 50 and the class with lowest accuracy percentage is Mountains with 52% and true positive value 26.

Fig. 7
figure 7

Confusion matrix for class Elephant

Table2 shows the comparison of average precision values of various previously implemented CBIR models like HSV (hue, saturation, value), GLCM (gray-level co- occurrence matrix), SIFT (scale-invariant feature transform) and CNN with CNN- SVM model. The average precision of the proposed CNN-SVM model is higher than the average precision of HSV, GLCM, SIFT, CNN models respectively.

Table 2 Comparison of CNN- SVM model with existing models

Conclusion

The proposed method of Content-Based Image Retrieval system using CNN for feature extraction and SVM for classification provided an average efficiency of 83.5%. The use of SVM helped to reduce the time required to retrieve the results. The experimental results were compared with other previously proposed models like HSV (hue, saturation, value), GLCM (gray-level co-occurrence matrix), SIFT (scale-invariant feature transform), and CNN. The average precision of our proposed system is higher than the existing proposed models, which is effective and promising.