1 Introduction

Computer-based biomedical image analysis techniques facilitate the medical experts and technicians to improve their diagnosis of diseases, based on the crucial inputs suggested by the computer system [49, 53]. The biomedical image retrieval is one of the fundamental and very challenging problem for medical and health informatics [28]. In image retrieval, the best matching image along with its descriptions is identified from a database against a query image, based on the content similarity between the query and database images [19, 52]. In order to measure the similarity between images, the feature representation plays an important role [38, 47, 56].

In the past, the local binary pattern (LBP) was very popular for image representation [35]. Numerous LBP variants were proposed for addressing challenges in image retrieval in the past decades due to the huge success and simplicity of LBP [38]. Some notable LBP variants are local ternary pattern (LTP) [55], local derivative pattern (LDP) [62], local gradient hexa pattern (LGHP) [3], local directional gradient pattern (LDGP) [4] and local directional order pattern (LDOP) [11] for face recognition/retrieval purposes; local tetra pattern (LTrP) [29] and multi-channel decoded LBP (mdLBP) [16] for image retrieval; local intensity order pattern (LIOP) [58] and interleaved intensity order-based local descriptor (IOLD) [12] for local image matching; and complete dual-cross pattern (CDCP) [44], local directional ZigZag pattern (LDZP) [46] and local jet pattern (LJP) [45] for texture classification. The LBP-based approaches are also widely used in biomedical image analysis such as pulmonary emphysema analysis [51], cell phenotype classification [33], biomedical image classification [34] and stem cell classification [36]. The latest developments over the LBP variant descriptors for biomedical image retrieval include: local mesh pattern (LMeP) [31], local ternary co-occurrence pattern (LTCoP) [30], local diagonal extrema pattern (LDEP) [13], local bit-plane dissimilarity pattern (LBDISP) [17], local bit-plane decoded pattern (LBDP) [15] and local wavelet pattern (LWP) [14]. Lan and Zhou have used the compressed scattering coefficients for medical image retrieval [24]. It is observed from the literature that the bit-plane decoding-based descriptor is more suitable for the biomedical image retrieval task [15]. Thus, in this work, we utilize the bit-plane decoded information with convolutional neural network (CNN) framework.

During the past few years, the CNN-based methods have emerged very rapidly. The CNN-based approaches show better efficacy compared to the hand-designed feature-based classifications. The first revolutionary work in this direction was AlexNet architecture by Krizhevsky et al. [23] for image classification. After AlexNet, various deep architectures have been proposed such as Vgg16 with deep network [48], GoogleNet with inception module [54] and ResNet with residual module [21] for image classification. The CNN has also proven for other problems such as Faster R-CNN [43] for object detection, Mask R-CNN [20] for semantic segmentation, image fusion [22], CNN-ranker [61] for retrieval and Cross-CNN [59] for multiple modality data representation. The CNN-based methods are also proven to be efficient for biomedical image analysis such as colon cancer recognition [50], cervical cell classification [63], pneumonia detection [41], multispectral MR images segmentation [5] and medical image registration [57].

In order to train the deep CNNs, a huge number of images are required which may not be collected in many real-life scenarios. This issue is generally dealt with by applying the transfer learning with pre-trained models, trained over some big databases. Researchers have used the pre-trained CNN models for applications such as content-based image retrieval [25], remote sensing image retrieval [18], face retrieval [10], military object recognition [60] and dumpsters recognition [42]. The CNN models pre-trained by ImageNet database [9] are also successfully applied in medical image applications such as mammogram analysis [2], bioimage classification [32] and domain transfer for biomedical images [37].

Some attempts have been made to utilize the CNN for biomedical image retrieval. Qayyum et al. [39] used a eight-layer CNN architecture similar to AlexNet for medical image retrieval. They trained the network over a database of 7200 images obtained from different sources and gained a mean average precision of 0.69. Due to lack of sufficient training images, they could not get very high performance. Qiu et al. [40] have used the hash code over ‘FC6’ and ‘FC7’ AlexNet features for medical image retrieval. The retrieval time is reduced in [40] due to binary feature, but at the cost of degraded performance. Chung et al. [7] used a deep Siamese CNN (SCNN) for diabetic retinopathy fundus image retrieval. The retrieval performance of SCNN last layer proposed in [7] is quite similar to the CNN softmax layer. Chowdhury et al. [6] used the CNN and edge histogram descriptor for radiographic image retrieval. This approach works in two steps. First, the relevant database classes are computed for a query image using CNN and then, the hand-crafted edge histogram descriptor is used to retrieve the images only from the relevant classes. This approach has combined CNN with hand-crafted descriptor in sequential fashion. However, in our proposed approach, the CNN features are computed over hand-designed feature map and fused with the original CNN features (i.e., parallel fusion).

Motivated by the suitability of bit-plane decoding mechanism for biomedical images, the success of CNN in various challenging problems and the re-usability of the pre-trained models, we propose local bit-plane decoded CNN descriptors for biomedical image retrieval. The main contributions of this paper can be summarized as follows:

  • The local bit-plane decoding mechanism is used for image transformation similar to LBDP [15].

  • The pre-trained CNN models such as AlexNet [23], Vgg16 [48], GoogleNet [54] and ResNet50 [21] are employed to generate the features.

  • The CNN features are generated over raw input image as well as bit-plane decoded image and combined at the last representation layers using different fusion strategies.

The rest of the paper is structured as follows: Section 2 proposes the local bit-plane decoded CNN descriptor; Sect. 3 presents the experimental setup including retrieval framework, databases and evaluation criteria; Sect. 4 reports the experimental results and analysis; and Sect. 5 concludes the paper.

Fig. 1
figure 1

Proposed local bit-plane decoded AlexNet descriptor (LBpDAD) by fusing the original AlexNet features with local bit-plane decoded AlexNet features

2 Proposed local bit-plane decoded CNN descriptor

This section illustrates the proposed local bit-plane decoded AlexNet descriptor (LBpDAD) obtained by integrating the trained AlexNet [23] with local bit-plane decoding mechanism [15]. The trained weights of AlexNet modelFootnote 1 are used in this paper which is computed over a large-scale ImageNet database [9]. The proposed method for biomedical image retrieval is illustrated in Fig. 1. The input image I of dimension \(m \times n \times 3\) is passed through the local bit-plane decoding mechanism proposed in [15] to generate the local bit-plane decoded map \(I_M\) as follows:

$$\begin{aligned} I_M^{i,j,k} = \sum _{b=1}^{8}{{\hbox {sign}}\left( I^{i,j,k}, B_D^{i,j,k,b}\right) \times 2^{b-1}} \end{aligned}$$
(1)

where \(i = 2,3,\ldots ,m-1\), \(j = 2,3,\ldots ,n-1\), \(k = 1,2,3\) represents the \(k{\mathrm{th}}\) channel, \(b = 1,2,\ldots ,8\) represents the \(b{\mathrm{th}}\) bit-plane, \(I^{i,j,k}\) is the value at position (ijk) in input image, \(I_M^{i,j,k}\) is the value at position (ijk) in the output image map of local bit-plane decoding, \(B_D^{i,j,k,b}\) is the local bit-plane decoded decimal value in \(b{\mathrm{th}}\) bit-plane for the center pixel (ij) in \(k{\mathrm{th}}\) channel and the \({\hbox {sign}}(\alpha , \beta )\) is given as,

$$\begin{aligned} sign(\alpha , \beta ) ={\left\{ \begin{array}{ll} 1, &{} \quad \text {if}\,\,\alpha \ge \beta \\ 0, &{} \quad \text {otherwise} \end{array}\right. } \end{aligned}$$
(2)

The \(B_D^{i,j,k,b}\) is computed as,

$$\begin{aligned} B_D^{i,j,k,b} = \sum _{n=1}^{8}{B_{n}^{i,j,k,b} \times 2^{n-1}} \end{aligned}$$
(3)

where \(B_{n}^{i,j,k,b}\) is the binary bit in \(b{\mathrm{th}}\) bit-plane of \(k{\mathrm{th}}\) channel corresponding to the \(n{\mathrm{th}}\) neighbor of \(I^{i,j,k}\) at unit distance in the direction of \((n-1) \times 45^{\circ }\) from positive x axis.

Fig. 2
figure 2

The biomedical image retrieval framework using proposed \({ LBpDAD}\) features

Now, the input image I and local bit-plane decoded image map \(I_M\) are converted into \(I_A\) and \(I_{MA}\), respectively, to satisfy the dimension required from the input image for the pre-trained AlexNet. The \(I_A\) and \(I_{MA}\) are computed as,

$$\begin{aligned} I_A&= \tau (I, [227, 227]) \end{aligned}$$
(4)
$$\begin{aligned} I_{MA}&= \tau (I_M, [227, 227]) \end{aligned}$$
(5)

where \(\tau (\Gamma , [\xi , \xi ])\) is a function to resize any 3D volume \(\Gamma\) of dimension \(\varrho \times \upsilon \times \psi\) into the dimension of \(\xi \times \xi \times \psi\). The \(227 \times 227\) denotes the spatial resolution needed from the input for AlexNet.

Define Alex as a function of combinations of convolutional, ReLU, max-pooling and fully connected layers, which returns the features at a particular layer of pre-trained AlexNet for an input image of dimension \(227 \times 227 \times 3\). The AlexNet features AlexNet and \({ LBpD}\_{ Alex}\) are computed for input images \(I_A\) and \(I_{MA}\), respectively, at class score layer (‘cs’) as,

$$\begin{aligned} { AlexNet}&= { ReLU}({ Alex}(I_A, cs)) \end{aligned}$$
(6)
$$\begin{aligned} { LBpD}\_{ Alex}&= { ReLU}({ Alex}(I_{MA}, cs)) \end{aligned}$$
(7)

where ReLU [23] is a function defined as,

$$\begin{aligned} { ReLU}(\phi _v) ={\left\{ \begin{array}{ll} \phi _v, &{} \quad {\text {if}} \,\phi _v \ge 0 \\ 0, &{} \quad \text {otherwise} \end{array}\right. } \end{aligned}$$
(8)

\(\forall v = 1,2,\ldots ,D(\phi )\), where \(\phi\) represents a feature vector and \(D(\phi )\) represents the length of feature vector \(\phi\). The ReLU operator is basically used in CNN framework to introduce nonlinearity into convolved features by filtering the negative values. Note that the ReLU operator over feature vector is required to remove the negative values as only nonnegative values are useful in most distance measures.

The Maximum (‘Max’) fusion technique is used to combine the AlexNet and \({ LBpD}\_{ Alex}\) feature vectors into final LBpDAD descriptor as,

$$\begin{aligned} LBpDAD_v = M(AlexNet_v, LBpD\_Alex_v) \end{aligned}$$
(9)

where \(LBpDAD_v\), \(AlexNet_v\) and \(LBpD\_Alex_v\) are the \(v{\mathrm{th}}\) elements of LBpDAD, AlexNet and \({ LBpD}\_{ Alex}\) feature vectors, respectively, \(v = 1,2,\ldots ,D(AlexNet)\) with \(D(AlexNet)=D(LBpD\_Alex)\) and M is a ‘Max’ operator defined as,

$$\begin{aligned} M(\alpha , \beta ) ={\left\{ \begin{array}{ll} \alpha , &{} \quad \text {if } \alpha \ge \beta \\ \beta , &{} \quad \text {otherwise}. \end{array}\right. } \end{aligned}$$
(10)

The \({ LBpDAD}^{fc7}\) (i.e., final fused feature vector at ‘fc7’ layer) is computed as,

$$\begin{aligned} LBpDAD_v^{fc7} = M(AlexNet_v^{fc7}, LBpD\_Alex_v^{fc7}) \end{aligned}$$
(11)

where \(LBpDAD_v^{fc7}\), \(AlexNet_v^{fc7}\) and \(LBpD\_Alex_v^{fc7}\) are the \(v{\mathrm{th}}\) elements of \({ LBpDAD}^{fc7}\), \(AlexNet^{fc7}\) and \(LBpD\_Alex^{fc7}\) feature vectors, respectively. The \(AlexNet^{fc7}\) and \(LBpD\_Alex^{fc7}\) are the feature vectors computed at ‘fc7’ layer for the input images \(I_A\) and \(I_{MA}\), respectively, as,

$$\begin{aligned} AlexNet^{fc7}&= ReLU(Alex(I_A, fc7)) \end{aligned}$$
(12)
$$\begin{aligned} LBpD\_Alex^{fc7}&= ReLU(Alex(I_{MA}, fc7)) \end{aligned}$$
(13)

Similarly, the feature fused feature vector at ‘fc6’ layer can be computed as,

$$\begin{aligned} LBpDAD_v^{fc6} = M(AlexNet_v^{fc6}, LBpD\_Alex_v^{fc6}) \end{aligned}$$
(14)

where \(LBpDAD_v^{fc6}\), \(AlexNet_v^{fc6}\) and \(LBpD\_Alex_v^{fc6}\) are the \(v{\mathrm{th}}\) elements of \({ LBpDAD}^{fc6}\), \(AlexNet^{fc6}\) and \(LBpD\_Alex^{fc6}\) feature vectors, respectively. The \(AlexNet^{fc6}\) and \(LBpD\_Alex^{fc6}\) are the feature vectors computed at ‘fc6’ layer for the input images \(I_A\) and \(I_{MA}\) respectively as,

$$\begin{aligned} AlexNet^{fc6}&= ReLU(Alex(I_A, fc6)) \end{aligned}$$
(15)
$$\begin{aligned} LBpD\_Alex^{fc6}&= ReLU(Alex(I_{MA}, fc6)). \end{aligned}$$
(16)

Note that all the feature descriptors are normalized to make the unit sum using following formula,

$$\begin{aligned} \phi _v = \frac{\phi _v}{\sum _{i=1}^{D(\phi )}{\phi _i}} \end{aligned}$$
(17)

where \(\phi\) is any feature vector of dimension \(D(\phi )\). This normalization makes the descriptors robust against image resolution variations.

3 Experimental setup

In this section, at first, we present the biomedical image retrieval framework using proposed descriptor. Then biomedical databases used for the experiments and finally the evaluation measures are described.

3.1 Proposed biomedical image retrieval framework

The biomedical image retrieval framework using proposed local bit-plane decoded AlexNet descriptor (LBpDAD) is portrayed in Fig. 2. The feature extraction steps are the same for both query image and database images. The image is passed through the pre-trained AlexNet to generate the direct features. The input image is also converted into a local bit-plane decoded map which is then passed through the pre-trained AlexNet to generate the local bit-plane decoded features. Finally, the direct Alex features and local bit-plane decoded Alex features are combined using ‘Max’ fusion strategy to generate the final LBpDAD descriptor. As the biomedical images are gray scale and AlexNet requires three-channel input, the same gray scale channel of our image is copied three times to create the three-channel input. Once the descriptors are computed for all images including query and database, the feature matching is performed by computing the distances between descriptors of query image and database images. Based on the distances, the top matching images are retrieved from the database against the given query image. The ‘Chi-square’ distance measure is adapted in this paper as it has shown better performance for state-of-the-art descriptors [14, 15]. However, the performance of proposed LBpDAD descriptor is also analyzed with other distances such as ‘Euclidean,’ ‘Manhattan,’ ‘Cosine’ and ‘Canberra’ in the Experiment Section.

3.2 Biomedical databases used

Three biomedical databases of different modalities including OASIS-MRI [27], TCIA-CT [8] and HeLa-Microscopic [1] are used in this paper to justify the improved performance of proposed LBpDAD descriptor in image retrieval framework. The Open Access Series of Imaging Studies has released a magnetic resonance imaging database (OASIS-MRI) in public domain for research and analysis [27]. This database is based on the 421 subjects from the age-group between 18 and 96 years. The OASIS-MRI database contains the \(176 \times 208\) resolution cross-sectional images. The database is divided into four categories similar to [15] having 106, 89, 102 and 124 images. The different categories of this database represent varying ventricular shape inside the images. The cancer image archive (TCIA) is a storage for various cancer location images in Digital Imaging and Communications in Medicine (DICOM) image format [8]. These images are publicly accessible for research. We have used the same TCIA-CT database which is used in [14]. This database has 604 Colo_prone 1.0B30f CT images of the DICOM series number 1.3.6.1.4.1.9328.50.4.2 of study instance UID 1.3.6.1.4.1.9328.50.4.1 for subject 1.3.6.1.4.1.9328.50.4.0001. The database is divided into eight categories having 75, 50, 58, 140, 70, 92, 78 and 41 images as per the size and structure of Colo_prone. The original image size in TCIA-CT database is \(512 \times 512\) pixels. We have also used fluorescence microscope images for the experiment taken from the 2D HeLa database [1]. This database contains total 862 images of HeLa cells from ten different categories corresponding to 10 different subcellular patterns using fluorescence microscopy.

Fig. 3
figure 3

The retrieval results comparison over OASIS-MRI, TCIA-CT and HeLa databases

Fig. 4
figure 4

The retrieval results from OASIS-MRI database. The first column represents the query image. The third to last columns represent the top ten retrieved images in decreasing order of similarity against the query image in first column. The results in first to 11th rows are corresponding to LBP [35], LTP [55], LDP [62], LTrP [29], LTCoP [30], LMeP [31], LDEP [13], LBDP [15], LWP [14], LBDISP [17] and proposed \({ LBpDAD}^{fc6}\) descriptors, respectively. The false positive retrieved images are highlighted in red color rectangles

Fig. 5
figure 5

The retrieval results from TCIA-CT database. The first column represents the query image. The third to last columns represent the top ten retrieved images in decreasing order of similarity against the query image in first column. The results in first to 11th rows are corresponding to LBP [35], LTP [55], LDP [62], LTrP [29], LTCoP [30], LMeP [31], LDEP [13], LBDP [15], LWP [14], LBDISP [17] and proposed \({ LBpDAD}^{fc6}\) descriptors, respectively. The false positive retrieved images are highlighted in red color rectangles

Fig. 6
figure 6

The retrieval results from HeLa database. The first column represents the query image. The third to last columns represent the top ten retrieved images in decreasing order of similarity against the query image in first column. The results in first to 11th rows are corresponding to LBP [35], LTP [55], LDP [62], LTrP [29], LTCoP [30], LMeP [31], LDEP [13], LBDP [15], LWP [14], LBDISP [17] and proposed \({ LBpDAD}^{fc6}\) descriptors, respectively. The false positive retrieved images are highlighted in red color rectangles

3.3 Evaluation criteria

The average retrieval precision (ARP), average retrieval rate (ARR), F-Score and average normalized modified retrieval rank (ANMRR) are used for the performance measurement similar to [13,14,15, 17, 30, 31]. The ARP and ARR are computed as,

$$\begin{aligned} ARP&= \frac{1}{C}\sum _{c=1}^{C}{MP_{c}} \end{aligned}$$
(18)
$$\begin{aligned} ARR&= \frac{1}{C}\sum _{c=1}^{C}{MR_{c}} \end{aligned}$$
(19)

where C is the number of classes in a database, \(MP_{c}\) and \(MR_{c}\) are the mean precision and mean recall for \(c{\mathrm{th}}\) class and defined as,

$$\begin{aligned} MP_{c}&= \frac{1}{n_c}\sum _{i=1}^{n_c}{\frac{\#CR_i}{\#TR}} \end{aligned}$$
(20)
$$\begin{aligned} MR_{c}&= \frac{1}{n_c}\sum _{i=1}^{n_c}{\frac{\#CR_i}{\#TG_c}} \end{aligned}$$
(21)

where \(n_c\) is the number of images in \(c{\mathrm{th}}\) class, \(\#CR_i\) is the number of correctly retrieved images, \(\#TR\) is the total number of retrieved images, and \(\#TG_c\) is the number of ground truth images in \(c{\mathrm{th}}\) class. The F-Score is calculated from the ARP and ARR as,

$$\begin{aligned} F{\text {-}}{} { Score} = 2 \times \frac{{ ARP} \times { ARR}}{{ ARP} + { ARR}}. \end{aligned}$$
(22)

The ANMRR is calculated by following the steps provided in [26]. Higher values of ARP, ARR and F-Score and lower value of ANMRR represent better performance.

4 Results and analysis

This section presents the experimental results, comparison between the methods and analysis. First, the results of proposed model are compared with the state-of-the-art methods and then its performance is analyzed for different layers, fusion strategies, distance measures and CNN models.

4.1 Results comparison

In order to express the improved performance of proposed model, the \({ LBpDAD}^{fc6}\) results are compared with the results of state-of-the-art descriptors such as LBP [35], LTP [55], LDP [62], LTrP [29], LTCoP [30], LMeP [31], LDEP [13], LBDP [15], LWP [14] and LBDISP [17]. Note that \({ LBpDAD}^{fc6}\) is used here for comparison, whereas the comparison between \({ LBpDAD}^{fc6}\), \({ LBpDAD}^{fc7}\) and LBpDAD descriptors is carried out in the next subsection. The image retrieval results in terms of the ARP (%), ARR (%), F-Score (%) and ANMRR (%) by varying the number of retrieved images are presented in Fig. 3. The first, second, third and \(4{\mathrm{th}}\) rows contain the ARP, ARR, F-Score and ANMRR plots, respectively. The first, second and third columns are dedicated to the results over OASIS-MRI, TCIA-CT and HeLa databases, respectively. The Chi-square distance is used for feature matching.

It is observed from the results of Fig. 3a, d, g, j that the proposed \({ LBpDAD}^{fc6}\) descriptor outperforms the state-of-the-art descriptors with a big margin. The \({ LBpDAD}^{fc6}\) descriptor is also succeeded on TCIA-CT database and just beats the best performing LBDP descriptor in terms of all the evaluation measures (see Fig. 3b, e, h k). Similar improvement in the results is also observed for HeLa database as shown in Figs. 3c, f, i, l using the proposed descriptor as compared to the existing descriptors. The improved performance of the proposed descriptor may be due to the following three reasons: (1) The used CNN features are more discriminative when trained over big ImageNet database, (2) the local bit-plane decoding mechanism is better suited for biomedical images, and (3) the fusion of raw CNN feature and local bit-plane decoded CNN feature further improves the discriminative power of the resultant descriptor.

The retrieved images using different methods for the example query image of OASIS-MRI, TCIA-CT and HeLa database are shown in Figs. 45 and 6, respectively. In these figures, the results in first to 11th rows are corresponding to LBP [35], LTP [55], LDP [62], LTrP [29], LTCoP [30], LMeP [31], LDEP [13], LBDP [15], LWP [14], LBDISP [17] and proposed \({ LBpDAD}^{fc6}\) descriptors, respectively. The first column represents the query image. The third to last columns represent the top ten retrieved images in decreasing order of similarity against the query image in first column. The false positive retrieved images are highlighted in red color rectangles. It can be observed from these results that the proposed method (last row) outperforms other methods. The \({ LBpDAD}^{fc6}\) is able to gain the 100%, 90% and 100% precision over OASIS-MRI (Fig. 4), TCIA-CT (Fig. 5) and HeLa (Fig. 6) databases, respectively.

Fig. 7
figure 7

The comparison between AlexNet, \({ LBpD}\_{ Alex}\) and LBpDAD features taken from ‘Softmax,’ ‘FC7’ and ‘FC6’ layers over OASIS-MRI, TCIA-CT and HeLa databases using ARP and ANMRR evaluation measures. Here, AlexNet refers to the features computed over raw image, \({ LBpD}\_{ Alex}\) represents the AlexNet features computed over local bit-plane decoded image instead of original image, and LBpDAD depicts the features obtained after fusing AlexNet and \({ LBpD}\_{ Alex}\) using ‘Max’ fusion strategy

4.2 Performance analysis over different layers

The previous subsection presented a comparative result of \({ LBpDAD}^{fc6}\) descriptor with the existing descriptors. In this experiment, the results of proposed descriptor are analyzed at different layers, i.e., LBpDAD for ‘class score layer,’ \({ LBpDAD}^{fc7}\) for ‘fc7’ layer and \({ LBpDAD}^{fc6}\) for ‘fc6’ layer (see Fig. 7). Moreover, the results of original AlexNet (i.e., AlexNet, \(AlexNet^{fc7}\) and \(AlexNet^{fc6}\) for ‘class score,’ ‘fc7’ and ‘fc6’ layers, respectively) as well as the results of local bit-plane decoded AlexNet without fusion (i.e., \({ LBpD}\_{ Alex}\), \(LBpD\_Alex^{fc7}\) and \(LBpD\_Alex^{fc6}\) for ‘class score,’ ‘fc7’ and ‘fc6’ layers, respectively) are also compared in Fig. 7. The results are shown for ARP (first row) and ANMRR (second row) evaluation metrics over OASIS-MRI (first column), TCIA-CT (second column) and HeLa (third column) databases in Fig. 7. It is perceived across the plots of Fig. 7 that in general, the performance of fused local bit-plane decoded AlexNet descriptors (i.e., LBpDAD, \({ LBpDAD}^{fc7}\) and \({ LBpDAD}^{fc6}\)) is better than the local bit-plane decoded AlexNet descriptors without fusion (i.e., \({ LBpD}\_{ Alex}\), \(LBpD\_Alex^{fc7}\) and \(LBpD\_Alex^{fc6}\)), respectively, which in turn better than the original AlexNet descriptors (i.e., AlexNet, \(AlexNet^{fc7}\) and \(AlexNet^{fc6}\)), respectively. Moreover, the performance gain due to ‘Max’ fusion is very prominent over HeLa database. This observation also supports that the CNN features extracted over local bit-plane decoded image are more discriminative compared to raw CNN features. This is so because the local bit-plane decoded image is rich with the local relationship at each bit-plane, whereas both CNN features have the complementary information due to different input modalities (i.e., raw input image and local bit-plane decoded input image). It is also discovered from this experiment that the features of \({ LBpDAD}^{fc6}\) at ‘fc6’ layer are more discriminative than the features of \({ LBpDAD}^{fc7}\) at ‘fc7’ and LBpDAD at ‘class score’ layers for the OASIS-MRI and TCIA-CT databases because the later ‘fc7’ and ‘class score’ layer features are more fitted toward the training database as compared to the earlier ‘fc6’ layer features. However, the \({ LBpDAD}^{fc7}\) descriptor at ‘fc7’ layer is best the performing one on HeLa database due to the presence of more homogeneous regions in the images.

Table 1 The results comparison in between Maximum (Max), Addition (Add), Product (Prod), Absolute Difference (Diff) and Division (Div) fusion strategies in terms of the ARP values for 5 number of retrieved images
Table 2 The t test computed over the results of Table 1

4.3 Performance analysis using different fusion strategies

This experiment is done to analyze the effects of different fusion strategies for combining the features of original AlexNet and local bit-plane decoded AlexNet. The ARP (%) values using Maximum (Max), Addition (Add), Product (Prod), Absolute Difference (Diff) and Division (Div) fusion strategies are summarized in Table 1. Note that all features are passed through the ReLU operator before fusion. The LBpDAD, \({ LBpDAD}^{fc7}\) and \({ LBpDAD}^{fc6}\) descriptors are used over OASIS-MRI, TCIA-CT and HeLa databases to validate the results in this experiment. The number of retrieved images is set to 5 and Chi-square distance is used in this experiment. The main objective of the proposed method is to study the effect of feature-level fusion of hand-crafted and CNN features. There could be many possible fusion strategies. In experiments, we have explored some of them. Even though Product ‘Prod’ fusion technique has performed better in many instances (as Product of two nonnegative feature vectors is more sparse, which decreases the effect of inter-class variability over the final feature vector), it introduces additional computational overheads. Hence, we have opted for the ‘Max’ fusion strategy in the remaining experiments.

In order to observe the statistical difference between the results of different fusion strategies, we conduct the t test over the results of each pair of fusion strategy. Note that the higher value of absolute t test represents high variability between two distributions and vice versa. Moreover, the positive sign represents the greater values for the corresponding distribution against other distribution. We summarize the t test values for the results of Table 1 in Table 2. It is clear from this table that the overall performance using Max fusion is better, as it has positive t test values as compared to all other fusion strategies. The t test analysis confirms the choice of using Max fusion strategy in proposed method. It is also observed that statistically, the {Max, Add} fusion approaches and {Prod, Diff} fusion approaches are very similar.

Table 3 The comparison among Euclidean (Eucld), Manhattan (L1), Cosine (Cosn), Canberra (Canb) and Chi-square (Chisq) distance measures in terms of the ARP values for 5 number of retrieved images
Table 4 The t test computed over the results of Table 3

4.4 Performance analysis using different distance measures

The Chi-square distance measure is used in the previous results to find the dissimilarity between two images. This experiment is conducted to analyze the effect of distance measures over the performance of proposed descriptors. The Euclidean (Eucld), Manhattan (L1), Cosine (Cosn), Canberra (Canb) and Chi-square (Chisq) distance measures are considered for this experiment. The results with different distance measures in terms of the ARP (%) for 5 retrieved images using LBpDAD, \({ LBpDAD}^{fc7}\) and \({ LBpDAD}^{fc6}\) descriptors are illustrated in Table 3. The Chi-square distance is generally used with many hand-crafted descriptors, as it works well with histograms, whereas the feature vector of the proposed descriptor is not in the form of histogram. Though Canberra distance is better to find the distance between two vectors (not histogram), for fair comparison with the state-of-the-art hand-crafted descriptors (i.e., histogram-based features), we have used Chi-square ‘Chisq’ distance measure in rest of the results of this paper.

The t test values for the results of Table 3 are shown in Table 4. It can be seen that the Chi-square distance has the maximum t test value as compared to all other distances. The performance of Canberra distance is also close to Chi-square, as suggested by the smallest t test value between them. This experiment stamps the choice of Chi-square distance for the proposed biomedical image retrieval framework.

Fig. 8
figure 8

The results in terms of the ARP versus number of retrieved images by applying the proposed architecture over AlexNet [23], VGG16 [48], GoogleNet [54] and ResNet50 [21] models. Here, AlexNet, VGG16, GoogleNet and ResNet50 represent the features obtained by applying ReLU over ‘softmax’ layer. The LBpDAD, LBpDVD, LBpDGD and LBpDRD refer to the features obtained by applying ReLU over ‘softmax’ layer in the proposed architecture corresponding to the AlexNet, VGG16, GoogleNet and ResNet50 models

4.5 Performance analysis using other CNN models

In this experiment, we analyze the suitability of proposed approach with other widely adapted convolutional neural network (CNN) models such as ‘Vgg16’ [48], ‘GoogleNet’ [54] and ‘ResNet50’ [21]. The pre-trained weights of these models available in MATLAB are used. The ‘class score’ features of these models are considered in this experiment. Similar to AlexNet, the original features of these models are referred as Vgg16, GoogleNet and ResNet50. Similar to LBpDAD, the local bit-plane decoded CNN descriptors for ‘Vgg16’, ‘GoogleNet’ and ‘ResNet50’ models are denoted by LBpDVD, LBpDGD and LBpDRD, respectively. The retrieval results using ARP (%) versus number of retrieved images are displayed in Fig. 8 for proposed LBpDAD, LBpDVD, LBpDGD and LBpDRD descriptors corresponding to ‘AlexNet’, ‘Vgg16’, ‘GoogleNet’ and ‘ResNet50’ models, respectively. Note that the feature dimension is 1000 in all these descriptors. The results of local bit-plane decoded CNN descriptors fused at ‘class score’ layer are compared with the original CNN features obtained at ‘class score’ layer in Fig. 8. All the features are passed through the ReLU operator before use. It is observed through this experiment that the proposed approach is well suited for ‘AlexNet’, ‘Vgg16’, ‘GoogleNet’ and ‘ResNet’ models over OASIS-MRI and TCIA-CT databases. In case of HeLa database, the performance of LBpDAD and LBpDGD features is better than the AlexNet and GoogleNet features. In general, the ‘ResNet50’ is more discriminative than ‘AlexNet’, ‘Vgg16’ and ‘GoogleNet’, because the last layer features of ‘ResNet50’ are generated through deep hierarchical transformations.

5 Conclusion

A local bit-plane decoding and convolutional neural network-based (CNN) architecture is proposed to produce the image descriptors in this paper. The introduced approach fuses the features at a particular layer of CNN using maximum fusion strategy. The two feature vectors are computed from the raw image and local bit-plane decoded map image. All features are operated by ReLU operator before fusion. The proposed LBpDAD descriptor corresponding to the ‘AlexNet’ model is tested in image retrieval framework over three biomedical databases of different modalities. It is noted that the proposed descriptor outperforms the state-of-the-art biomedical image descriptors. It is also investigated that the performance at ‘FC6’ layer is generally better than ‘FC7’ and ‘class score’ layers. Moreover, the performance of fused features is better than the individual features. Another observation points out that the ‘Product’-based fusion strategy is more suitable in the proposed architecture. As per experimental results using different distances, the ‘Canberra’ distance measure is more appropriate. A favorable observation is made with respect to the proposed architecture with different CNNs such as ‘AlexNet,’ ‘Vgg16,’ ‘GoogleNet’ and ‘ResNet50’. The experiments and analysis support the proposed local bit-plane decoding-based CNN descriptor in terms of improved retrieval performance over biomedical images.