1 Introduction

In present days, exponential increase in usage of digital cameras and mobile phones makes the size of image dataset gigantic. Maintaining such kind of large image dataset is an extremely tedious and troublesome job. So, efficient technique is required to retrieve desired images from such kind of huge image dataset. One of the effective solutions of such retrieval problem is Content Based Image Retrieval (CBIR). The term “content” signifies that images are retrieved based on some features which can be calculated from the actual content of images. Retrieval process depends on similarity between query image and all the images of image dataset. Feature vector comparison is one of possible way to find similarity between the corresponding images. Features of an image can be classified as Color features, Texture features and Shape features.

A color model can be defined as a coordinate system where each point uniquely describes a color. One of the widely used color model for CBIR is RGB. RGB color model has a limitation that the color information contained in R, G and B channels is highly correlated thereby not being informative for texture and shape features directly. The color model which gives information about dominant color, its purity and brightness is HSI (Hue, Saturation, Intensity). Hence, it is used in our proposed method. In HSI model, hue represents the dominant color and saturation represents the degree to which a pure color is diluted by white light whereas intensity represents the brightness of a pixel. Both hue and saturation together give the complete color feature. The HSI model decouples color-carrying information from intensity component of an image. Intensity indicates the value of a color which is related to texture. Sadegh Fadaei et al. [1] proposed a new CBIR scheme where uniform partitioning scheme is applied on HSI color model to calculate Dominant Color Descriptor (DCD). Various curvelet and wavelet features are used as texture features to subdue the problem of image translation and image noise. Color and texture features are often concatenated to improve the performance of CBIR.

1.1 Content based image retrieval

The name Content Based Image Retrieval (CBIR) implies that the image retrieving process is based on contents of the image. So color, texture and shape information are chosen as features of an image. To produce color features of an image, feature extraction procedures like color histogram [2], color correlogram [2], color autocorrelogram [3, 4], inter-channel voting between hue and saturation [5] can be applied on image. For texture feature extraction, LBP [6], ULBP [6], CS_LBP [7], LEP [8], LDP [9], LTrP [10] can be used. One more texture feature descriptor is GLCM, which reveals knowledge about pixel pair co-occurrence of the image [11, 12]. For extracting shape information, HOG [13], angular pattern and binary angular pattern [14], Wavelet Fourier descriptor [15], Convex Hull [16] etc. can be used. In [17,18,19], different image retrieval methods have been covered and discussed.

1.1.1 Local binary patterns

One of the most effective texture feature used in CBIR is Local Binary Patterns (LBP) proposed by Ojala et al. [6]. It considers N8(p) to calculate the binary pattern which results into an 8-bit pattern. The process of getting the 8-bit pattern is shown in Fig. 1. The value of N8(p) is compared with the value of ‘p’ and based on ≤ or > it results into 0 or 1 respectively. Once after getting the pattern, for each of the pixel in the image, will be converted into equivalent decimal number which is the final result of the processed pixel. These converted decimal number’s histogram is considered as the feature vector of the image with length of 256. Note that, when converting the binary pattern to decimal numbers, out of 8 positions the place value can start at any of the eight bits, but the same place values needs to be considered for the entire image.

Fig. 1
figure 1

LBP calculation (a) Original Image (b) Resultant binary pattern for (a). c Decimal equivalent for (b)

1.1.2 Uniform local binary patterns

To reduce the feature vector length resulted by LBP and also because most of the features in many of the image datasets are covered with 59 bins as given in [6], Uniform Local Binary Patterns (ULBP) is used. In ULBP, the number of bins as well as the length of the feature vector, is reduced from 256 to 59 based on the number of transitions in the binary patterns. The decimal numbers are given in Fig. 2, have the number of transitions ≤ 2. All the remaining decimal numbers in the range of 0 to 255, the number of transitions is 4, 6 or 8. So these decimal numbers with the number of transitions ≤ 2 are considered as same as that values resulting in 58 bins since we have 58 such decimal numbers. The remaining numbers are considered as the 59th bin. From Fig. 1, the LBP (45) is 206. If we apply ULBP on this, as ‘206’ is not in the list of numbers in Fig. 2, 206 is considered to be in the 59th bin of ULBP.

Fig. 2
figure 2

ULBP-59 bins. The decimal numbers with number of transition with ≤ 2 are shown

1.1.3 Color histogram

One of the most prime and basic color features is the color histogram, which mainly provides the color frequency information in a particular color model [2]. In any color model, firstly it is decomposed into a different component. Then for each color component, a separate histogram is obtained and then the resultant histogram are concatenated to obtain the feature vector. To reduce the length of the feature vector, before obtaining the histogram, the pixels can be quantized into different bins for each color component.

1.1.4 Color correlogram and color autocorrelogram

Color Correlogram is proposed by Haung et al. [3]. This feature not only gives the information about the frequency of each color pixel but also focuses on the co-occurrence of pixel pairs on a specified distance k. To measure the distance, D8 distance is used, which is defined in Eq. (1):

$$D_{8} \left( {p,\;q} \right) = {\text{Max}}\left| {(p_{x} - q_{x} ),(p_{y} - q_{y} )} \right|$$
(1)

where p and q are two-pixel values of an image having a location (px, py) and (qx, qy).

Color Correlogram can be expressed by a matrix Ck of size N × N, in which each cell value, Ck (i, j), the joint probability of occurrence of a pixel pair (i, j), separated by a specified distance k. Color Correlogram can be calculated using Eq. (2).

$$C^{k} \left( {i,\;j} \right)\; = \;\frac{1}{{N_{i} \times 8k}}\sum\limits_{m\; = \;1}^{M} {\;\sum\limits_{n\; = \;1}^{N} {1 - \left\lceil {\frac{1}{2}\left( {\frac{{\left| {I\left( {m,\;n} \right) - i} \right|}}{{L_{\max } }}} \right) + \frac{1}{2}\;\left( {\frac{{\left| {I\left( {m + \Delta x,\;n + \Delta y} \right) - j} \right|}}{{L_{\max } }}} \right)} \right\rceil } }$$
(2)

\(\forall \Delta x,\;\Delta y \in \left\{ {\left( {0,\;1,\;2,......k} \right)and\left( {\max \left( {\Delta x,\;\Delta y} \right) = k} \right)} \right\}\) and \(\forall i,j \in \left\{ {0,\;1,\;2,...L_{\max } } \right\}\).where, Ni is the histogram of color ‘i’, which is given by Eq. (3).

$$N_{i} = \sum\limits_{m\; = \;1}^{M} {\;\sum\limits_{n\; = \;1}^{N} {1 - \;\left\lceil {\frac{{\left| {I\left( {m,\;n} \right) - i} \right|}}{{L_{\max } }}} \right\rceil } }$$
(3)

Color Correlogram gives the joint probability of occurrence of all possible pixel levels, which results in a feature vector of length N × N. To reduce feature vector length Color Autocorrelogram was proposed [3] and also discussed in [4]. This color feature concentrates only on the co-occurrence of the same color, results in a feature vector αk of length N, which is the diagonal values of color correlogram matrix Ck. Color autocorrelogram can be calculated using Eq. (4). Figure 3 shows the original image and its autocorrelogram.

$$\alpha^{k} \left( I \right) = C\left( {i,j} \right),\forall \left( {i,j,l} \right) \in \left\{ {0,1,2,...,L_{\max } } \right\}\begin{array}{*{20}c} {{\text{and}}} & {i = j = l} \\ \end{array}$$
(4)
Fig. 3
figure 3

Color autocorrelogram calculation. (a) Original image. b Color autocorrelogram of (a)

1.1.5 Inter-channel voting

The two color feature extraction methods discussed above concentrate only on each color channel to extract color feature descriptor. Suresh et al. [20] proposed a new color feature named inter-channel voting among the three components of HSI image. This method explores the interrelationship among three components Hue (H), Saturation (S) and Intensity (I) of a color image. To perform inter-channel voting between H and I channel, both channels are quantized into bins and then added with the quantized I value to respective H bin and vice versa. This process is shown in Fig. 4. Due to non-commutative characteristics of inter-channel voting, feature vector produced by inter-channel voting between H and I is different from feature vector generated by inter-channel voting between I and H. The process of inter-channel voting is applied on hue and saturation, hue and intensity, and intensity and saturation component of the image separately which creates a total of 6 different feature vectors. To create the final feature vector all six feature vectors are concatenated. Feature vector creation for I and H and H and I are shown in Eqs. (58).

$$Range_{I} = \frac{{\max \left( {I\left( {i,\;j} \right)} \right) - \min \left( {I\left( {i,\;j} \right)} \right)}}{{I_{{{\text{level}}}} }};\forall i,j \in I$$
(5)
$$I_{new} \left( {i,j} \right) = \left\{ {\begin{array}{*{20}c} {I_{level} - 1,} & {I\left( {i,j} \right) = \max \left( {I\left( {i,j} \right)} \right)} \\ {\left\lfloor {\frac{{I\left( {i,j} \right)}}{{Range_{I} }}} \right\rfloor ,} & {else} \\ \end{array} } \right.$$
(6)

where Ilevel is the number of quantization levels for Intensity.

$$Bin_{IH} \left( {I_{new} \left( {i,j} \right)} \right) = Bin_{IH} \left( {I_{new} \left( {i,j} \right)} \right) + H\left( {i,j} \right),\forall i,j \in I$$
(7)
$$Bin_{IS} \left( {I_{new} \left( {i,j} \right)} \right) = Bin_{IS} \left( {I_{new} \left( {i,j} \right)} \right) + S\left( {i,j} \right),\forall i,j \in I$$
(8)
Fig. 4
figure 4

Inter-channel voting. The first 1-D array is the result of the process between ‘Hue & Intensity’ while the second one is the result of the process between ‘Intensity & Hue’ components

1.2 Big image data processing

The digital devices are evolved from with very less storage capacity, less processing capacity and with bigger in size to with more storage, more processing power and small in size. Because of this evolution, as of today, these digital devices generating lot of diverse and complex data. Because of this, the existing computing devices are suitable for storing and processing such data. The astronomy and genomics are the first to experience such data explosion in the 2000s coined the term BigData [21]. The word Big cannot be quantified, it is a moving target. What is considered ‘Big’ today will not be so years ahead. The data in ‘Big Data’ can have three properties: Volume, Variety and Velocity. Note that it does not mean that it must have all the three characteristics. As the years passing the number of Vs also increased to 4Vs in 2012, 7Vs in 2013 [22] and to 10Vs by 2014. [23].

Out of the entire world’s data, 80% is unstructured data essentially containing photos and videos [24]. In developed countries like UK, billions of videos per year are recorded by millions of CCTV cameras [25]. As these billions of videos need to be stored and processed, demand for storing and searching has increased substantially. Based on the kind of data that ‘Big Data’ technologies handle, image and videos comes under unstructured data and the relationship of Image Processing and video processing is shown in Fig. 5. This can be called as Big Image/Video Data Processing. With this, many technological challenges including compression, storage, transmission, analysis and recognition which cannot do address by existing technologies can now be addressed [26,27,28,29].

Fig. 5
figure 5

Evolving of Big Image/Vodeo Data Processing. In the year 2001 there were 3Vs in Big Data Processing. In 2012 one more V was added. In 2013 and 2014 three more Vs were added which results total 10 Vs. As of 2020 more Vs are deliberated

BigData has its application in several important areas such as Manufacturing, Healthcare, Fraud Detection, Transportation Service, Communication, Banking Sector, Media and Insurance Service. In the healthcare system, diseases like Genomics, Cancer, Chronic Obstructive Pulmonary Disease (COPD) and Tumor can be predicted, diagnosed and monitored [30, 31]. In the development of smart cities, the Transportation services plays an important role by controlling traffic, planning route, managing revenue and providing travel guidance to the urban residents [32]. In [33], a method for automatically calculating traffic volume and vehicle speed by pattern analysis using pixel data extracted from CCTV Video image is proposed. In the field of e-Commerce, analyzing customer feedback, shopping patterns, and identifying market areas guide to a superior decision-making process. With the help of smart devices with GPS functionality and social media, analysis of behavior of the customer facilitates a reduction in insurance and banking sector risks [34].

1.2.1 Hadoop

One of Apache more successful project is Hadoop [35]. It is used to handle Big Data state of affairs for storage and processing in distributed environment. It is an open-source Java based framework. It used MapReduce paradigm for programming. In this environment, a large number of computers (nodes) can be grouped together to form a cluster. With this cluster the more storage and more processing power will be obtained when compared with a single node. But to work with this, it need not be always a cluster of computers, even with pseudo distributed mode of Hadoop i.e. with only one computer also it can be used. Hadoop offers flat scalability curve which is the major advantage of this when compared with Message Passing Interface (MPI). Hadoop is responsible for breaking up the input data into chunks, forwarding the chunk to each of the node, executing the code on each of the chunk, examining if the code has executed, then forwarding results, if any, either to the proceeding processing stages (known as Job) or to the final location of the output, carrying out the sorting action between the map and the reduce stages and forwarding each chunk of the sorted data to the right node, and writing debugging information on each job’s progress, among other things [35, 36]. Some of the other noteworthy implementations of MapReduce are Infinispan, Disco Project, CouchDB, MongoDB and Risk. The two components of Hadoop, one for storage and another for processing are Hadoop Distributed File System (HDFS) and MapReduce model.

1.2.2 Hadoop distributed file system

It is one of the file system used for distributed environments. When a cluster is formed with multiple nodes, all the nodes contribute some amount of memory to HDFS. HDFS is the backbone of the Hadoop system which stores the data by replication and makes different copies of data on to a different rack for the purpose of fault tolerance. The replication factor can be any value but conventionally 3 is used. For storage purpose, it maintains three Java Virtual Machine Process Status Tool (jps): Name Node, Secondary Name Node and Data Node. The entire cluster will have only one Name Node and one Secondary Name Node but can have n Data Nodes. For processing it maintains two jps: Resource Manager and Node Manager, here also the only one Resource Manager and n Node Managers in the cluster. In this paper, HDFS is used to store the large number of images where it is not possible to store all these images in a single system. To reduce the processing time on these large number of images, MapReduce model of programming is used.

1.2.3 MapReduce

MapReduce is a programming model for processing huge amount of data by taking the data from either HDFS [37, 38] or also from Local File System (LFS). It consists of two phases: map phase and reduce phase. The map phase uses a function known as mapper, which takes the input data in the form of a series of <key, value> pairs and outputs also in the form of <key, value> pair. The intermediate <key, value> pairs will be combined by reducer, which is also a function, this phase is known as reduce phase.

1.3 Map reduce paradigm for image retrieval

When dealing with a large number of images, the Map-Reduce paradigm is one of the best solutions to get the results in less time than that on doing it on a single system. This is a programming paradigm in which the execution takes place where the data resides. The execution takes place in three stages: map, Shuffle and Sort, and Reduce stages. The Map stage takes in the input in <key, value> pair and produces the output also as <key, value> pair. Then the Shuffle and Sort stage will sort this based on the ‘key’. Then the reducer will consolidate the work for each of the key and produces the final output. For storing, the data in intermediate steps Distributed File System can be used. This data can be in any form: Text, Images, Videos, Log Data, etc.

Sarmad Istephan et al. [39] proposed a method to retrieve an image from unstructured medical image big data with a case study on epilepsy. They have used two types of criteria to validate the feasibility of the proposed framework: accuracy and ability. The accuracy is tested by executing the query on data that contain both structured and unstructured data. To test the ability of the framework, the results are compared by executing the query on different sized Hadoop clusters. The same kind of ability is tested in [40] also. One novel CBIR framework was proposed by Lan Zhang et al. [41], known as PIC, where cloud computing is used for searching an image from a large image dataset while securing the privacy of input data Here to deal with massive images, they have designed a system suitable for distributed and parallel computation to expedite the search process. Le Dong [29] proposed an effective processing framework named Image Cloud Processing (ICP) to deal with the data explosion in the image processing field. The ICP framework consists of two mechanisms: Static ICP (SICP) and Dynamic ICP (DICP), where SCIP is designed to cooperate with the Map-Reduce paradigm and DICP implemented through a parallel processing procedure working with the traditional processing mechanism of the distributed system. To validate the ICP framework, they have used the ImageNet dataset.

1.4 Performance measures for CBIR

1.4.1 Average precision rate (APR) and average recall rate (ARR)

Precision is defined as a ratio between the number of total relevant images retrieved and the number of total images retrieved for a given query. Recall is defined as the ratio between the number of total relevant images retrieved and the number of total images having the same class as a query image. Average precision for different step sizes m1,m2,…mk is known as APR. Similarly, average recall for different step sizes is known as ARR.

1.4.2 F-Measure

It is represented by a single value to reflect the relationship between precision and recall. It is obtained by assigning equal weight to both precision and recall in the harmonic mean calculation as given in Eq. (9).

$${\text{F - Measure}}\left( n \right) = \frac{{\left( {2 \times APR \times ARR} \right)}}{{\left( {APR + ARR} \right)}}$$
(9)

1.4.3 Average normalized modified retrieval rank (ANMRR)

It is used to measure the retrieval accuracy. To calculate ANMRR for each image we consider only those images whose rank is less than 2 × (number of images in the class). If an image’s rank is less than 2 × (number of images in the class) then score of that image is rank of the image, else it is a predefined fixed number. Now the average score is calculated and then normalized score.

1.4.4 Total minimum retrieval epoch (TMRE)

It is used to measure the minimum number of images to be traversed to retrieve all the relevant images.

2 Methodology

The process of feature extraction and obtaining different performance measures (APR, ARR, F-Measure, TMRE, ANMRR) is done by using MapReduce paradigm. The different texture features used for CBIR are: LBP and ULBP. In addition to textures features, two types of color features Color Histogram and Color Autocorrelogram are used. Finally, a fused features i.e. Interchannel voting with DS_GLCM [20] are used for CBIR. All these five method are explained in Introduction section. In this section, the detailed description of MapReduce paradigm used to retrieve the queried images from a given image dataset is given. Then each MapReduce job is explained in detail. In the proposed method, a total of 8 MapReduce Jobs: Job0 to Job7 are used for the image retrieval process and the block diagram is shown in Fig. 6. Detailed description of all the eight jobs is also given.

Fig. 6
figure 6

Block diagram of the proposed system using MR Paradigm. For each job input and output are shown and performance measures values are calculated

2.1 Job0 functionality

figure a

The given image dataset can be stored in the LFS or in HDFS. The Job0 functionality is shown in Fig. 7. The mapper will take the images with any extension (jpg/png/tif…) as input and gives key, value pair: <FileName, Pixel values of the image> as the output. Note that, if the given image is a color image, all the three channels i.e. Red, Green and Blue pixels are stored as part of <key, value> pair. All these <FileName, Pixel values of the image> are the input to shuffle and sort, where all these will be sorted based on key. Now, this sorted <FileName, pixel values of the image> are the input to reducer, which will convert it into sequence files. The number of <key, value> pairs in each sequence file is depending on the size of the sequence file supported by that software.

Fig. 7
figure 7

Outline of MapReduce Job0. It is used to convert the given images into sequence files

2.2 Job1 functionality

figure b

The Mapper takes the unbundled data as the input i.e. <FileName, Pixels value of the image> . If this is for a color image, the mapper will convert it into a gray image and then calculates the Feature Vector (LBP, ULBP). But for the other three methods color images are used as is. The output of the Mapper is <FileName, Feature Vector> . All these < FileName, Feature Vector> are the input to shuffle and sort, where all these will be sorted based on key. Now, this sorted <FileName, Feature Vector> are the input to reducer, which will convert it into sequence files. The Job1 functionality is shown in Fig. 8.

Fig. 8
figure 8

Outline of MapReduce Job1. It is used for calculating the Feature vectors for all the images

2.3 Job2 functionality

figure c

The mapper considers each image and calculates distance from the image from the ‘key’ part of the pair with all n images in the dataset, results into <FileName, n distances w.r.t. to 1 to n images> . It will do the same process for all the ‘key’ part i.e. for all the images. i.e. <image 1, (0, 15, 45, 6,…)> , <image 2, (15, 0, 4, 61,…)> . In, <image 1, (0, 15, 45, 6,…)> , ‘0’ represents the distance between image1 to image1, ‘15’ represents the distance between image 1 and image 2 and ‘45’ is the distance between image 1 to image 3 and so on. You need to observe that, the distance from image 1 to image 2 is same as the distance between image 2 to image 1. As in previous Jobs, here also the shuffle and sort will be sorting the data based on image number. The reducer will be giving the ranks to the image numbers based on the distances given as part of ‘value’ part. Based on these ranks, the image numbers will be written i.e. <image1, (1, 101, 205, 900,…)> , Here, the number in value part (1, 101, 205, 900,…) represents the image numbers close to image 1 based on the distance. i.e. ‘1’ in value part represents, image 1(value part) is 1st closest w.r.t. to image 1 (key part) 101 represents, images 101 is 2nd closest w.r.t. to image 1, image 205 is the 3rd closest w.r.t. image 1, …is ranked 102 w.r.t to image 1, image 3 is ranked. One more example: <image2, (108, 2, 43, 66,…)> . To sum up, the output <key, value> pairs of the Job 2 are the columns in our rank matrix representation given in Fig. 9. The Job2 functionality is shown in Fig. 10.

Fig. 9
figure 9

Rank matrix representation for 1000 images dataset. (i, j)th position of this matrix contains ith similar image of image j

Fig. 10
figure 10

Outline of MapReduce Job2. The purpose of job2 is to construct rank matrix by considering the distance between feature vector of query image and of images in the dataset

2.4 Job3 functionality

figure d

The mapper reads the data as <FileName, n image numbers based on the rank> , then for each key it checks and update the count whether that image number which is part of the ‘value’ part is of that group, if so accordingly it update the count. The same process is followed for m1, m2, m3……images. ∀ mi ≤ k, where k is the number of images in that group. The output of mapper is <m1, (6, 7, 8, 2,…a total of n numbers)> , here 6 represents for image 1, out of top m1, 6 are of the same group. 7 means for image 2, out of m1, 7 are of the same group… Like this for every m2, m3,… are also given as the output. The shuffle and sort will sort the data based on key i.e. m1, m2, m3,….. The reducer will add all the numbers in the value part and gives the output as: <m1, number of images of that particular group> Example: <m1, 6918> means out of m1 × n number of images, 6918 images of the same group. <m2, 12,353> means out of m2 × n number of images, 12,353 are of same group. Like this it will be calculated for all mis. The entire process is given in Fig. 11.

Fig.11
figure 11

Outline of MapReduce Job3. It is used for obtaining the counts for the number of correct retrieval for each of the query images of the image dataset

2.5 Job4 functionality

figure e

The input to the mapper is a table of counts of matches for group for each step size. Now these counts will be converted into percentages by the mapper which results into the output as <APR, (for m1, for m2, …)> , < ARR, (for m1, for m2, …)> and <F_Meaasure, (for m1, for m2, …)> . These three will be sorted by shuffle and sort phase by using the keys: APR, ARR and F-Measure. The reducer discards all except APR of top m1 matches, ARR for top k and F-Measure also for top k. This is the final and same result obtained using non MapReduce paradigm. The entire functionality of Job4 is shown in Fig. 12.

Fig. 12
figure 12

Outline of MapReduce Job4. It is used for calculating APR, ARR and F-Measure

2.6 Job5 functionality

figure f

The functionality of this job is to calculate TMRE. The mapper takes the rank matrix as the input, the same input which is used as the input for Job3. The mapper need to go through the images until it completely finds all the images of the image given in the ‘key’. The output of the mapper is <commonkey, count> for 1st column of the rank matrix, <commonkey, count> for the 2nd column of the rank matrix, etc., for all the images in the data set, where ‘count’ represents upto which rank we have to visit to find out all the images of this group. The reducer will take this as input and calculates the parameter TMRE. This process is given in Fig. 13.

Fig. 13
figure 13

Outline of MapReduce Job5. Output of this job is TMRE value

2.7 Job6 functionality

figure g

The mapper and shuffle and sort functionality of this job is exactly like that of Job2’s mapper and shuffle and sort. The output of the reducer is a matrix based on images as shown in Fig. 14, which is called as Image based matrix. The out of the reducer will in the in form of <image 1, (1, 101, 205, 900,…)>, here, the number in value part (1, 101, 205, 900,…) represents the ranks between image 1 vs image 1, image 1 vs image 2, image 1 vs image 3,…. In this Image based matrix it has to be observed that the principal diagonal elements are all 1 s, because the rank between image i vs image i is always 1. The entire process of Job 6 is shown in Fig. 15.

Fig. 14
figure 14

Image based matrix representation for 1000 images dataset. (i, j)th position of this matrix contains similarity of image i with respect to image j

Fig. 15
figure 15

Outline of MapReduce Job6. It is used for obtaining the Image Based Matrix from by considering the distance between feature vectors of query image and of images in the dataset

2.8 Job7 functionality

figure h

In this job7, the mapper takes the input which is image based matrix with the ranks. The output of the mapper is NMRR values for each images i.e. FileName 1, NMRR>, <FileName 2, NMRR>, … and sorted version of this is given as the input to Reducer. The reducer will calculate the ANMRR value and displays it. The process is shown in Fig. 16.

Fig. 16
figure 16

Outline of MapReduce Job7: it is used for calculating ANMRR value

The entire process that has been explained can be implemented by using different options. The options are based on storage mechanism and processing mechanism. The storage can be LFS or HDFS. The processing can be any of Non-MR Model or Matlab’s MR Model or Hadoop’s MR Model. With these options, the seven different modes of implementation are given in Table 1. All the seven methods gives the same CBIR results for all the five parameters: APR, ARR, F-Measure, TMRE and ANMRR, but the only difference is the time to complete the process.

Table 1 Different MapReduce Paradigms for CBIR

3 Results and discussions

The experimentation is done on two of the image dataset, one is Corel-1k, which is a natural image dataset and other one is texture image dataset: VisTex, The same datasets are used by Netalkar Rohan Kishor et al. [42] for CBIR by extracting the features in frequency domain specifically in Discrete Cosine Transorm. The five methods of image retrieval explained in Introduction section are considered for retrieving the images. By using, Mode-6 from Table 1, the results are given here. The experiment is carried out by using 1, 2 and 4 parallel workers in MATLAB R2020b Version. The detailed results are explained here for each of the dataset.

3.1 Dataset-1 (Corel1K)

This dataset [43], consists of a total of 1000 images with 10 categories where each category consists of 100 images. The different categories are Africans, Beaches… Food. The size of each image in this image dataset is 384 × 256 or 256 × 384. Three images from each group, a total of 30 images of this dataset are shown in Fig. 17. The performance measures for all the five CBIR methods are given in Table 2. Figures 18, 19, 20, 21, 22 shows the time in seconds for each of the seven jobs and for the total time for all the five methods. Table 3 shows the total time taken for each of the five methods and also the total time taken for all the five methods of image retrieval, by considering 1, 2 and 4 workers in parallel execution. Table 3 also shows the percentage of time saved for different number of workers.

Fig. 17
figure 17

Corel-1K samples. Three images from each of the 10 categories are displayed

Table 2 Performance measures for different CBIR methods on Corel-1K dataset
Fig. 18
figure 18

Time graph for LBP for Corel1K dataset. For each job ‘time to complete’ is less if number of workers are increased. Same pattern is observed for total time also

Fig. 19
figure 19

Time graph for ULBP for Corel1K dataset. For each job ‘time to complete’ is less if number of workers are increased. Same pattern is observed for total time also

Fig. 20
figure 20

Time graph for ColorHist_RGB for Corel1K. For each job ‘time to complete’ is less if number of workers are increased. Same pattern is observed for total time also

Fig. 21
figure 21

Time graph for Color_Autocorrelogram for Corel1K dataset. For each job ‘time to complete’ is less if number of workers are increased. Same pattern is observed for total time also

Fig. 22
figure 22

Time graph for IC_HSI + DS_GLCM for Corel1K dataset. For each job ‘time to complete’ is less if number of workers are increased. Same pattern is observed for total time also

Table 3 The time saved for Corel1K image datasets

From these graphs in Figs. 18, 19, 20, 21, 22, it is clearly evident that, the time to complete with 4 workers is very less when compared to single system and as well as with 2 workers setup. From Table 3, the time saved by using 4 workers instead of 2 worker is around 41% for all the five CBIR methods and similarly the time saved by using 4 workers instead of 1 worker is around 68%.

3.2 Dataset-2 (VisTex)

This image dataset is the first texture image dataset considered VisTex texture dataset [44] consists of a total of 484 images. Out of these 484 texture images, 40 are considered for experimentation. The actual image dimension is 512 × 512. Each image of these 40 is made into 16 nonoverlapping sub-images where each sub-image is of dimension 128 × 128, which results in a total of 640 texture image datasets. From these 640, images 1, 17, 33, 49 …625 which are the 1st sub-image of each of 40 actual texture images, are shown in Fig. 23. The performance measures for all the five CBIR methods are given in Table 4. Figures 24, 25, 26, 27, 28 shows the time in seconds for each of the seven jobs and for the total time for all the five methods. Table 5 shows the total time taken for each of the five methods and also the total time taken for all the five methods of image retrieval, by considering 1, 2 and 4 workers in parallel execution. Table 5 also shows the percentage of time saved for different number of workers.

Fig. 23
figure 23

VisTex sample images. First image of each of the 40 category is displayed

Table 4 Performance measures for different CBIR methods on VisTex dataset
Fig. 24
figure 24

Time graph for LBP for VisTex dataset. For each job ‘time to complete’ is less if number of workers are increased. Same pattern is observed for total time also

Fig. 25
figure 25

Time graph for ULBP for VisTex dataset. For each job ‘time to complete’ is less if number of workers are increased. Same pattern is observed for total time also

Fig. 26
figure 26

Time graph for ColorHist_RGB for VisTex dataset. For each job ‘time to complete’ is less if number of workers are increased. Same pattern is observed for total time also

Fig. 27
figure 27

Time graph for Color_Autocorrelogram for VisTex dataset. For each job ‘time to complete’ is less if number of workers are increased. Same pattern is observed for total time also

Fig. 28
figure 28

Time graph for IC_HSI + DS_GLCM for VisTex dataset. For each job ‘time to complete’ is less if number of workers are increased. Same pattern is observed for total time also

Table 5 The time saved for VisTex image datasets

From these graphs in Figs. 24, 25, 26, 27, 28, it is clearly evident that, the time to complete with 4 workers is very less when compared to single system and as well as with 2 workers setup. From Table 5, the time saved by using 4 workers instead of 2 worker is around 42% for all the five CBIR methods and similarly the time saved by using 4 workers instead of 1 worker is around 68%.

4 Conclusions

The results clearly shows that the MapReduce paradigm is working as expected. As the number of workers are involved are more in number, the time for computing the whole process is reduced accordingly. For all the five image retrieval methods used the final results of performance measures: Average Precision Rate, Average Recall Rate, F-Measure, Average Normalized Modified Retrieval Rank and Total Minimum Retrieval Epoch are exactly same as in single computer execution. Even irrespective of the method used for image retrieval, the times for all the five methods are relatively same. For completing all the five image retrieval methods on Corel1K, the time saved is 43%, 45% and 68% respectively for the number of workers as 4vs2, 2vs1 and 4vs1 workers. Similarly for VisTex it is 42% 46% and 68%. Future Extensions: In Future, more number of images with 96 parallel workers will be analyzed. The state-of-art technologies: Spark and HBase will be used.