Keywords

1 Introduction

CBIR is the advanced trend in the computer vision. CBIR is not only used to retrieve the relevant images from the large-scale databases, but also to recognize objects in medical picture archiving communication systems (PACS), biomedical applications, satellite image retrieval systems, and many other applications. Currently, CBIR has achieved sophisticated accuracy using feature extraction systems based on color, shape, texture. But CBIR is lagged in efficient storage and processing paradigm due to increase in usage of smart devices. Since CBIR systems have to adapt to the commodity hardware [1]. Processing multimedia data is tedious process in terms of speed and accuracy. This paper used open-source distributed and parallel framework Hadoop released by Apache foundation. To achieve more reliable and fastest response, SPARK is used on the top of the Hadoop. To achieve accurate retrieval result, Canny edge algorithm and Haralick textures are used in spatial domain.

This paper is organized as follows: Sect. 2 describes Big Data processing framework, Section 3 describes the related work done so far in the application, Sect. 4 describes the proposed methodology, Sect. 5 describes the experimental results and discussion, and Sect. 6 describes the conclusion of the work.

2 Big Data Processing Framework

The multimedia data is growing through social media (Flickr—75 million public images per day, Instagram—60 million images per day, Facebook—136,000 images per min, Google—57,988 query per second). Storage of these huge amounts of data very complex task using traditional systems. To overcome this issue, Google has published a white paper on Google file system. This is a distributed block-level file system. The data will be decentralized to several nodes. Each node will have numerous blocks [2]. Each block size can change by user requirement. Later, Google also published another white paper on MapReduce. MapReduce is parallel processing paradigm. Here, data locality is introduced. Instead of transforming the data through the network, code will be sent to the distributed blocks that exist in the several nodes. These two approaches were combined together and free leased as an open-source framework and named as Apache Hadoop. Hadoop is not suitable for a large amount of small files. Hadoop is designed as block-oriented file system, where default block size is 128 MB [3]. When the file size is less than block size, then it will create a separate block to that file and use the remaining space to other block. MapReduce will divide the process into two tasks as Map and Reduce, in which Map tasks are created based on the number of blocks that contain our data. So if system contains more number of blocks, then more number of Map tasks should be initiated by the Hadoop Master. This process will create a bottle neck problem to the Hadoop cluster. Hadoop introduced sequence file format to process all small files as a single large file [4]. Data sharing is not much faster in Hadoop due to replication and serialization process [5].

To simplify this problem, Hadoop introduced SPARK to processing. It is designed for fast cluster computations. SPARK will execute the scripts on the top of the Hadoop, and it extends the MapReduce mode to streaming process [6]. SPARK is also introduced by Apache foundation. SPARK was developed in 2009. The main feature of SPARK is in-memory cluster computation. SPARK not only supports MapReduce, but it also supports SQL, machine learning, etc. SPARK will create the data as Resilient Distributed Dataset. The RDDs will be divided into logical partitions which are able to process in several nodes. RDD is a read-only record; it can be created through deterministic operations. RDDs are fault-tolerant. It can reuse the intermediate results to accelerate the data sharing process.

Image database was collected from the signal processing laboratory Web server. Ultrasound images are collected from the online Web server and stored in the HDFS [7, 8]. Feature extraction and similarity measure were done using Python and OpenCV library. The experiment was run on the SPARK integrated with Hadoop-2.6.0 single-node cluster environment. The proposed architecture is shown in below figure.

2.1 Architecture

See Fig. 1.

Fig. 1
figure 1

Architecture of the proposed system

3 Related Work

CBIR has evolved with robust feature extraction algorithms so far. Color histogram-based features are primary and easy to represent any image. The color maps can be represented as binary, or gray scale, RGB, HSV, etc.; the extraction of color histogram is done based on the intensity levels of pixels in an image. The distribution of intensities can be represented as mathematically as follows:

$$ \begin{array}{*{20}c} {h_{c} = \frac{1}{MxN}\sum\limits_{i = 0}^{M - 1} {\sum\limits_{j = 0}^{N - 1} {\delta \left( {f_{ij} - C} \right),\quad \forall c \in C} } } \\ {{\text{where}},\delta \left( x \right) = \left\{ {\begin{array}{*{20}c} {1,{\text{if}}\left( {x = 0} \right)} \\ {0,{\text{if}}\left( {x! = 0} \right)} \\ \end{array} } \right\}} \\ \end{array} $$

The color histogram features were computed, and the similarity between vectors is computed using distance metrics [9]. Color features alone cannot obtain the exact required image due to high variations in the intensity levels. So texture-based features are introduced in 1973 by Haralick. The gray-level co-occurrence matrix (GLCM) will be computed from the image, and the statistical features of the GLCM will be extracted as texture of the image [10]. The important statistical features that can be extracted from GLCM were shown below:

$$ {\text{Energy}} = \sqrt {\sum\limits_{i = 0}^{N - 1} {\sum\limits_{j = 0}^{N - 1} {M^{2} } } } (i,j) $$

In the above formula, i, j denotes the spatial position of an image. Calculating the extent of pixel pair repetition is called as energy. This energy will become large if the image pixels are same.

$$ {\text{Entropy}} = \sum\limits_{i = 0}^{N - 1} {\sum\limits_{j = 0}^{N - 1} {M(i,j)( - \text{ln}\,(M(i,j)))} } $$

Entropy is used to categorize the texture of an image. Randomness of the input image can be known using entropy. When the entire co-occurrence matrix values are same, then the value of entropy will also reaches utmost.

$$ {\text{Contrast}} = \sum\limits_{i = 0}^{N - 1} {\sum\limits_{j = 0}^{N - 1} {(i - j)^{2} M(i,j)} } $$

The intensity of a pixel and its neighbor of an image is measured using contrast. A contrast is nothing but the divergence in the color and the brightness of object which is explained by visual perception [11].

Many applications have proved experimentally that hybrid features will give more accuracy. However, the accuracy might have satisfied in traditional approach, but when dealing with real-time and large-scale datasets, it is complex to process with existing CBIR [9]. Distributed and parallelized CBIR systems have been developed and proved the response time is very less when compared to traditional systems [12].

4 Methodology

The proposed methodology used edge and texture features. The workflow of proposed methodology is shown in Fig. 2. Canny algorithm is used to extract the edge features. Canny uses the first-order derivatives of the image. To remove the noise, Canny applies the Gaussian filter first. The gradients are extracted by applying first-order derivatives [13]. Intensity thresholding is used to remove the false edges.

Fig. 2
figure 2

Workflow of the system

4.1 Canny Edge Descriptor Algorithm

  1. Step 1.

    To remove the noise and to make an image smooth, Gaussian filter is to be applied.

  2. Step 2.

    Image’s intensity gradients are to be found

  3. Step 3.

    Non-maximum suppression is to be applied to get exonerate from the false response of edge detection.

  4. Step 4.

    To find the potential edges, apply double threshold.

  5. Step 5.

    Those edges that are not strongly connected to the other edges are to be detected.

4.2 Workflow

See Fig. 2.

4.3 Haralick Texture Features

To create the “texture” data, Haralick analysis is the one among the various techniques. Haralick descriptors are used to extract the texture features; those can be evaluated from the gray-level co-occurrence matrices (GLCMs). GLCM is a 2D histogram, which apprehends the co-occurrences of the two-pixel intensities, each other at certain offset in a region from where texture is calculated. Features like energy, homogeneity, contrast, and dissimilarity are extracted from GLCM [10]. The GLCM can construct in three directions (horizontal, vertical, and diagonal).

5 Results and Discussion

Figures 3 and 4 show the retrieval results of the proposed system. Figure 1 retrieved MedPix dataset; it contains scan images of various body parts. Figure 4 contains SPLab dataset; it contains ultrasound images. The performance of the retrieval system is mathematically defined as [14];

$$ {\text{Precision}} = \frac{{{\text{Number}}\,{\text{of}}\,{\text{relevant}}\,{\text{images}}\,{\text{retrieved}}}}{{{\text{Total}}\,{\text{Number}}\,{\text{of}}\,{\text{Retrieved}}\,{\text{Images}}}} $$
Fig. 3
figure 3

GLCM directions

Fig. 4
figure 4

Result for dataset1

The proposed retrieval system obtained 75% precision for the input datasets. Another performance measure is considered as speed. The speed of the system is compared between single-node and multimode cluster by varying multiple dataset sizes. The performance graph was shown below.

Figures 5 and 6 show the relationship between time and data size on single-node and multi-node cluster, respectively. The X-axis denotes the time taken to execute the task in milliseconds. The Y-axis denotes number of images in the dataset. The experimental results had proved that our proposed methodology obtained better precision and faster than existing systems (Fig. 7).

Fig. 5
figure 5

Result for dataset2

Fig. 6
figure 6

Time graph for single-node cluster

Fig. 7
figure 7

Time graph for multi-node cluster

6 Conclusion

Image retrieval is the growing trend. Many technologies are emerging largely nowadays for this sake. Features like mean, standard deviation, and variance are extracted from LBPs. Canny edge detection is used for detecting edges. Gray-level co-occurrence matrices (GLCMs) are used to extract texture features. Mahalanobis distance is used to measure the similarity between query images and database. SPARK performs well for the entire system. Thus, we can able to get the related images from the database to the query image successfully.