Abstract
Advancements in the sector of computer and multimedia technology and introduction of the World Wide Web have increased the volume of image databases and collections, for example medical imageries, digital libraries, art galleries which in total contain millions of images. The retrieval process of images from such huge database by traditional methods such as Text Based Image Retrieval, Color Histogram and Chi Square Distance may take a lot of time to get the desired images. It is necessity to develop an effective image retrieval system which can handle these huge amounts of images at once. The main purpose is to build a robust system that builds, executes and responds to data in an efficient manner. A Content-Based Image Retrieval (CBIR) system has been developed as an efficient image retrieval tool where user can provide their query to the system to allow it to retrieve user’s desired image from the image collection. Moreover, the emergence of web development and transmission networks and also the number of images which are available to users continue to grow. We propose an effective deep learning framework based on Convolution Neural Networks (CNN) and Support Vector Machine (SVM) for fast image retrieval. Proposed architecture extracts features using CNN and classification using SVM. The results demonstrate the robustness of the system.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
Content-Based Image Retrieval is a process of auto-indexing of images by extraction of their low-level features like color and shape; these features are responsible only for image retrieval. Feature representation and similarity measurement are the two most important instances for the implementation of the CBIR program and various researchers have done on the same research for more than a decade. Many different approaches have been proposed but to date it remains as one of the most problematic one in the ongoing CBIR research, and a major reason is the difference between low-resolution image pixels captured by machines and high-level human-sensing. This problem poses as the fundamental challenge for Artificial Intelligence with a high-level view which is the way to build and train intelligent machines as humans to perform real-world tasks. Machine learning is a promising alternative that attempts to address this challenge. Machine learning techniques have shown progress in recent years. Deep Learning is an important form of abstraction. It involves a family of machine learning algorithms that try to demonstrate high-quality data extraction techniques through deep design techniques that are made up of many offline variables.
Deep learning enables complex design architectures to be programmed as a human brain and process information in many stages of transformation and representation, in contrast to the conventional machine learning methods that often use fixed structures. By exploring deep design architectures which learn features at multiple levels of creativity from data, deep learning techniques makes it possible for the system to learn complex functions to map input data to the output directly, without relying on features present in the man-made domain.
The success of deep learning inspired us to explore deep learning techniques with application to CBIR tasks for images. Not enough focus is prioritized for CBIR applications although research has been conducted for application of deep learning techniques for classification of image and recognition in computer vision.
Proposed method involves application of deep learning techniques to solve the CBIR function of human-reduced images. We will be training large-scale neural network to learn representations of functional features. We have tested against established categories of the Corel dataset (African tribes, Beaches, Buildings, Buses, Dinosaurs, Elephants, Flowers, Horses, Mountains and Food). The results of the proposed methods are compared with existing techniques and performance analysis is done.
Literature Survey
In [1], the authors have proposed a deep learning framework for CBIR by training large-scale Deep Belief Networks for learning effective feature representations of images. The authors have discussed three kinds of feature generalization schemes. In scheme I, images of dataset are fed into the pre-trained CNN model and then take the activation values from the last three layers. In scheme II, instead of directly using the features extracted by the pre-trained deep model, they explored similarity learning algorithms to refine the features obtained in scheme I. In scheme III, the deep convolutional neural network is retrained on image dataset for different CBIR tasks by initializing the CNN model with the parameters of the Image Net-trained models. Scheme III showed best performance out of the three schemes.
In [2], the authors have proposed an efficient pre-trained Convolution neural network model. The authors have used LeNet-5 model of CNN and experiments are conducted on Corel 1 K dataset. The proposed model has an average precision of 0.79 with Corel 1 K dataset [3,4,5]. The results are compared with three traditional visual features, the hue saturation value (HSV) color feature, gray level Co-occurrence matrix (GLCM) features and scale-invariant feature transform (SIFT) in which the proposed model has the highest average precision value- 86% compared to other methods.
In [6], the authors have proposed the Deep Belief Network (DBN) method of deep learning for feature extraction and classification. The DBN has several layers which include restricted Boltzmann machine (RBM) stacked into multi-stages, which consists of only single hidden layers each to make the learning process faster. The experimental results show that for the small dataset with 1000 images, the accuracy rate would be 98.6% but with a large data set (> 10,000 images) the accuracy would be 96% without losing the time complexity requirement.
In [7], the authors have proposed a CNN—SVM model, where CNN is for feature extraction, and SVM performs as a recognizer. The first part of a CNN is the convolutional phase. It works as an extractor of image features. In the end, the convolution maps are flattened and concatenated into a feature vector, called a CNN code. The SVM takes this CNN code at the output of the convolution phase as a new feature vector for training. The precision obtained using pre-trained CNN with the Caltech256 database is 90% for 1000 images. Padmashree Desai et al. in [8,9,10,11,12,13] discusses different methods of feature extraction using wavelets, edge operators, morphological operators ad moment invariants. Performance analysis is done using different distance measures. The videos summarization [14, 15] can be used in image/video retrieval by searching the query image in the summarized video dataset rather than in the original dataset. This can improve the retrieval time.
Proposed Method and Implementation
The proposed architecture as shown in Fig. 1. consists of two layers, first layer uses CNN for training and feature extraction and second layer uses SVM for classification and image retrieval. Feature vector is obtained from CNN model is fed as input to the SVM.
Basic flow starts with a query image submitted by user as input to the system. Features of query image are extracted from CNN model and features are stored in a vector. This query image feature vector will be passed to the SVM, which has already been trained using the Corel dataset. The pre-trained SVM module will calculate the distance between the features of the query image and the feature of the entire dataset. Retrieved images are displayed based on similarity index that is, distance values with respect to query image. Top 10, top 20 and so on are displayed as a part of retrieved process.
Feature Extraction
VGG16 layered CNN model is used for extraction of features of data set. Query image submitted will be extracted when user submits to a system. Figure 2 represents theVGG16 layered architecture. It consists of twelve convolutional layers, followed by maximum pooling layers and then four fully connected layers and finally a softmax classifier.
The original images are of size 384 × 256 or 256 × 384 which will be converted to 224 × 224 pixels and then fed to the CNN model. The data goes through the following layers in CNN module and image is transformed into different sizes leading to extraction of features.
-
First and second layers
VGG16 CNN model takes 224 × 224 × 3 RGB image as input. The image is passed through 1st and 2nd convolutional layers with 64 feature maps or filters of size 3 × 3 and same pooling with a stride of 14. The image dimensions change from 224 × 224 × 3 to 224 × 224 × 64.
Then the maximum pooling layer is applied with a filter size 3 × 3 and with a stride of 2. The resulting image dimensions will be changed from 224 × 224 × 64 to 112 × 112 × 64.
-
Third and fourth layers
The image is passed through third and fourth convolution layers with 128 feature maps having filters of size 3 × 3 and stride of 1. Then maximum pooling layer is applied with filter size 3 × 3 and with a stride of 2. The output image is reduced to 56 × 56 × 128.
-
Fifth and sixth layers
Then the image is passed through fifth and sixth convolution layers with 256 feature maps having a filter size of 3 × 3 and a stride of 1. Then maximum pooling layer is applied with filter size 3 × 3 and stride of 2 and has 256 feature maps.
-
Seventh to twelfth layer
Seventh to twelfth layers consist of 2 sets of three convolutional layers followed by a maximum pooling layer. All the three convolutional layers have 512 filters of size 3 × 3 and with a stride of 1. The final output image will be reduced to 7 × 7 × 512.
-
Thirteenth Layer
This final layer is a fully connected flatten layer to flatten the output with 25,088 feature maps each of size 1 × 1.
ReLU Activation Function
ReLU’s purpose is to introduce non-linearity in the CNN model. ReLU is linear for all the positive values, and zero for all negative values. It only activates a node if the input is above a certain quantity, while the input is below zero, the output is zero that is node value is determined as A(x) = max (0, x).
The CNN model which consists of above layers is compiled using model. compile(). Then the image array is passed to model. Predict() to get the feature vector output of the flattening layer.
Image Classification and Retrieval
Support vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. The obtained feature vector is given as input to the SVM, which calculates the distance between each image of the dataset and the query image. Binary SVM classifiers are also called as one-versus-one method. Multi-classification with this method can be described as the task of constructing n (n–1)/2 binary SVMs, one classifier Cij for every pair of distinct classes, i.e. the ith class and the jth class, where i ɛ j, i = 1, …, n j = 1, …, n. Each classifier Cij is trained with the samples in the ith class with positive labels, and the samples in the jth class with negative labels. Train SVM according to the label attached to the feature vectors.
Classification of images is done using linear Regression approach. Consider via set T of t training feature vectors xi ∈ R^D, i = 1…t and the corresponding class labels ∈ {1…t}, yi (wT * xi + b)-1 yi ∈ {1…t}.Where w is the hyper plane normal vector, b is the perpendicular distance between the hyper plane and the origin. In this case, we are training the shape and color features.
Testing
-
1.
Select query Image = ×.jpg, from dataset
-
2.
Find the query image class label in which it belongs.
-
3.
Perform classifications of the images.
-
4.
Perform binary predictions for the selected query image.
-
5.
Retrieve the first n similar images to the input image from class x using voting.
Results and Discussion
The query set consists of randomly chosen 10 images, first image from each category. Figure 3 shows the query set of different categories for proposed work. Figure 4 represents the top 10 images retrieved for class Africa. In the below Fig. 5, confusion matrix, the class with highest accuracy percentage is Horses with 96% and its true positive value is 48 and the class with lowest accuracy percentage is Mountains with 62% and true positive value 31.
The experimental results Fig. 6, shows top 10 images retrieved. Experiments conducted on 10 classes of the Corel 1 K dataset are shown below in Table 1. It shows the precision values of each class when we retrieve 10 images, 15 images, and 20 images from the database respectively.
The precision of class Elephant obtained by retrieving 10 similar images is 83.86%. Figure 7 shows the confusion matrix and it shows that class with highest accuracy percentage is Dinosaurs with 100% and its true positive value is 50 and the class with lowest accuracy percentage is Mountains with 52% and true positive value 26.
Table2 shows the comparison of average precision values of various previously implemented CBIR models like HSV (hue, saturation, value), GLCM (gray-level co- occurrence matrix), SIFT (scale-invariant feature transform) and CNN with CNN- SVM model. The average precision of the proposed CNN-SVM model is higher than the average precision of HSV, GLCM, SIFT, CNN models respectively.
Conclusion
The proposed method of Content-Based Image Retrieval system using CNN for feature extraction and SVM for classification provided an average efficiency of 83.5%. The use of SVM helped to reduce the time required to retrieve the results. The experimental results were compared with other previously proposed models like HSV (hue, saturation, value), GLCM (gray-level co-occurrence matrix), SIFT (scale-invariant feature transform), and CNN. The average precision of our proposed system is higher than the existing proposed models, which is effective and promising.
References
Wan J, Wang D, Hoi SCH, Wu P, Zhu J, Zhang Y, Li J. Deep learning for content-based image retrieval: a comprehensive study. In: ACM international conference on multimedia. 2014.
Huang W, Qiang W. Image retrieval algorithm based on convolutional neural network. In: Selected paper from Common Sense Media Awards. 2017.
Wang J, Li J, Wiederhold G. Simplicity: semantics-sensitive integrated matching for picture libraries. IEEE Trans Pattern Anal Mach Intell. 2001;23(9):947–63.
Chen Y, Wang JZ, Li J. FIRM: fuzzily integrated region matching for content-based image retrieval. In: Proceedings of the ninth ACM international conference on multimedia. ACM. 2001. p. 543–45.
Saritha RR, Paul V, Ganesh Kumar P. Content based image retrieval using deep learning process. Cluster Comput 2019;22(2):4187–200.
Mohamed O, El Asnaoui K, Mohammed O, Brahim A. Content-based image retrieval using convolutional neural networks. Original paper in Lecture Notes in Real-Time Intelligent Systems book. 2019. http://wang.ist.psu.edu/IMAGE.( Accessed Jan 2001).
Desai P, Pujari J, Goudar RH. Image retrieval using wavelet based shape features. J Inform Syst Commun (JISC) 2012;3:1162–166.http://www.bioinfo.in/contents.php?id=45.
Desai P, Pujari J, Parwatikar S (2011) Image retrieval using shape feature: a study. In: International conference on computaional intelligence and information technology (CIIT 2011), ACEEE, CIIT 2011, CCIS 250. Berlin: Springer; 2011. p. 817–21.
Desai P, Pujari J, Ayachit NH, Kamakshi Prasad V. Content based image retrieval using hexagonal resampling and detection of ailments in MRI scans of Brain. In: Third international conference on computational intelligence and information technology, CIIT 2013 ACEEE. Elsevier. 2013.
Desai P, Pujari J, Kinnikar A. Performance evaluation of image retrieval systems using shape feature based on wavelet transform. In: IEEE second international conference on cognitive computing and information processing CCIP 2016, India. IEEE. 2016. p. 1–5. https://doi.org/10.1109/CCIP.2016.7802876.
Desai P, Pujari J, Kinnikar A. An image retrieval using combined approach wavelets and local binary pattern. In: International conference on informatics and analytics (ICIA-16), Aug 25th and 26th 2016, Department of computer science and engineering, Pondicherry engineering college, India. ACM digital library within its international conference proceedings series. 2016. https://doi.org/10.1145/2980258.2980404.
Desai P, Pujari J, Ayachit NH, Kamakshi Prasad V. Classification of archaeological monuments for different art forms with an application to CBIR IEEE. In: International conference on advances in computing, communications and informatics (ICACCI-2013). 2013. p. 1108–12. https://doi.org/10.1109/ICACCI.2013.6637332.
Sujatha C, Chivate AR, Tabib RA, Mudenagudi U. Multilevel framework for summarization of surveillance videos. In: International conference on signal and image processing (ICSIP). 2014. p. 265–70.
Sujatha C, Mudenagudi U. Gaussian mixture model for summarization of surveillance videos. In: National conference on computer vision, pattern recognition, image processing and graphics (NCVPRIPG). 2015. p. 1–4.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
There is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Data Science and Communication” guest edited by Kamesh Namudri, Naveen Chilamkurti, Sushma S J and S. Padmashree.
Rights and permissions
About this article
Cite this article
Desai, P., Pujari, J., Sujatha, C. et al. Hybrid Approach for Content-Based Image Retrieval using VGG16 Layered Architecture and SVM: An Application of Deep Learning. SN COMPUT. SCI. 2, 170 (2021). https://doi.org/10.1007/s42979-021-00529-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-021-00529-4