1 Introduction

Interpreting musculoskeletal X-rays is a critical aspect of orthopedic care, as bone and joint problems affect a significant portion of the global population, estimated to be around 1.7 billion people [1]. However, accurately interpreting musculoskeletal X-rays is a time-consuming and resource-intensive process [2]. In particular, the accident and emergency (A&E) department requires timely and accurate clinical observations [3]. Prior approaches based on fully supervised learning [4, 5], have shown good performances in classifying and detecting abnormalities. However, these methods rely on expert-annotated precise labels, which can be a significant limiting factor because in many real-world scenarios, obtaining such precise labels can be difficult, expensive, or time-consuming [6]. Moreover, supervised methods are limited by the black-box nature of models, which hinders their widespread clinical adoption due to the difficulty of interpreting and explaining their decision-making process [7]. Providing visual evidence of the decision-making process by localizing informative regions in images corresponding to the target can be a way to test the black-box nature of a supervised model [7, 8].

Existing methods for unsupervised feature localization [9] rely on substantial amounts of unannotated data. Weakly supervised learning offers an alternative approach by providing image-level labels to train classification and localization models [10, 11]. Class activation map (CAM) [12] is indeed one of the most used approaches for weakly supervised object localization (WSOL) [13,14,15]. To create a CAM, Convolutional Neural Network (CNN) feature maps are globally average pooled, passed through an FC layer with softmax activation to generate class probabilities, and then weighted to generate the final localization map [12] CAM is used to identify the most discriminative regions of an input image that are associated with a specific object class [16]. However, CAM-based methods have been criticized for their lack of robustness due to sensitivity to the background [17], difficulty in localizing co-occurring object classes [18], limited expressiveness, and susceptibility to variations in object appearance [19]. Additionally, global average pooling (GAP) introduces bias by assigning a higher weight to features with less activated areas, which further contributes to CAMs’ shortcomings.

In this work, we propose a novel weakly supervised multi-stage (WSMS) learning network to address the challenges of object localization. Our approach is based on a weakly supervised object detection method [12], which has previously shown improved performance using data labeled with predefined classes. WSMS consists of three stages which aim to develop a more effective and robust model for classification and weakly supervised object localization. In each stage, we encode the image by the shared weight CNN into feature representation then We use an attention squeeze-and-excitation (SE) [20] module to calculate attention maps from the output of CNN followed by a fully connected layer (FC). WSMS localizes the informative region or object in a bounded box by the ensembles of the norm feature map and weighted attention map using binarization with an average. These two maps contain the structure of the informative region or object as shown in Fig. 1.

Fig. 1
figure 1

This image shows components of the Class activation map (CAM), normed feature map (ƒ) and weighted attention map (\(\mathcal{F})\) from WSMS

In the context of abnormality localization, the informative region or object in an input image is used to estimate the different discriminating regions based on scalar values obtained from the attention map. These discriminating regions are then used to localize the abnormality within a bounding box, as illustrated in Fig. 1. This approach enables the accurate and efficient detection of abnormal regions in medical images. Our approach is illustrated by the clear localization results in Fig. 1. A multi-stage neural network with shared parameters increases the robustness of object classification, improves feature map generalization, and enhances the clarity of CAM and localization results. This approach can also be used to check the black-box nature of the model, by providing visual evidence of the decision-making process and localizing informative regions in images corresponding to the target. Overall, multi-stage neural networks with shared parameters hold significant promise for improving the performance of object localization and classification, while also providing valuable insights into the black-box nature of the model. The potential impact of our approach on orthopedic emergency room diagnosis is significant, promising increased accuracy and reduced turnaround times.

To the best of our knowledge, this paper represents the first attempt to propose weakly supervised learning for abnormality detection from X-ray images. The key contribution of this research is summarized as follows:

  1. 1.

    Proposing a novel weakly supervised approach to address abnormality detection in X-ray images using a multistage attention map learning approach that not only classifies the dataset but also provides a bounding box region of interest.

  2. 2.

    An attention module, SE-based attention map, is introduced for feature map learning to localize the informative region in different branches. Each branch is only supervised by the image label, thus avoiding the need for costly pixel-level annotations.

  3. 3.

    A comprehensive evaluation is conducted on a large dataset of X-ray images with four different abnormalities. The proposed model achieves promising classification results with excellent detection visualization. Furthermore, it outperforms SOTA models on two publicly available benchmarks for classification and localization by a significant margin, demonstrating the effectiveness of the proposed approach.

The rest of this paper is structured to guide the reader through our research process step-by-step. Section 2 reviews the relevant literature, setting the stage for the discussion of our methodology. In Section 3, we explain the details of our proposed WSMS framework, including its underlying concepts and how it works. Section 4 presents our findings and provides a critical analysis of the results. Section 5 discusses our findings in the broader context of X-ray image analysis. Finally, Section 6 summarizes the key contributions of our research and suggests potential future directions.

2 Related work

The research on bone X-ray image classification and abnormality detection can be broadly divided into two areas: weakly supervised object localization and sophisticated abnormality localization. Early approaches for detecting abnormalities involved calculating neck-shift angles, Gabor analysis, and gradient-based intensity methods [21]. Later, classifiers were combined using majority voting schemes, Bayesian theory [22], AdaBoost with classifier weighting based on performance [23], and probabilistic combination [24]. With the emergence of CNNs in machine learning, researchers have applied deep learning to medical image analysis, including musculoskeletal X-ray images. Previous studies have demonstrated that deep learning (DL) models can effectively detect abnormalities in medical images using supervised learning methods for classification and bounding box localization [3]. However, it has been shown that pre-training deep CNN on non-medical images can be a viable alternative for abnormality detection [25]. Previous research on bone abnormality detection has placed less emphasis on ensuring the model's generalizability, with a greater focus on specific bone regions [26].

Researchers have used various techniques to extract features from pelvic CT and X-ray images for abnormality detection [27,28,29]. Adaptive filtering [30], boundary tracing [31], and wavelet transform [32] were applied to pelvic CT images, and an active shape model was developed for abnormality detection. Authors from [26] used stacked random forests based on feature fusion to detect abnormalities in X-ray images. Mathematical morphology has also been widely used for bone abnormality detection. Previous methods have considered the entire image to determine whether it is abnormal [33, 34], but they cannot localize the abnormality to a specific bone region. To address this challenge, researchers have started working with annotated bounding boxes of the abnormality as ground truth and segmenting the region of interest using techniques like entropy-based segmentation [35, 36]. Models such as ResNet [37] and Faster R-CNN [38] have been developed for abnormality detection and medical image segmentation, using annotations and bounding boxes in training. The field of weakly supervised object localization (WSOL) is a crucial area of deep learning as it reduces the time needed for user annotations by training methods to localize objects using trained classifiers. Many approaches have been developed in the past for learning object detectors with weak supervision in various problems. OXnet [39] shows promise as a feasible and general solution for real-world applications by leveraging as much available supervision as possible.

For instance, Zho et al. [12] introduced a CAM to localize the region of interest to the target level via GAP, which failed to localize due to the bias towards small activation area, as put higher weight. Recent methods like POSL [40] and SPOL [41] use two network localization and classification. MMAL-net [42] localizes without adding any additional perimeter parameters and [16] bridges the gap between classification and localization by adding new parameters to the objective. Weakly supervised is widely applied in X-ray images [39, 43]. Most recent work has shown the potential of full object localization in a weakly supervised manner [44, 45]. Our method aims to address the model’s robustness by sharing parameters through the stages and localizing based on the feature map before GAP.

In Table 1 we provide a succinct overview of different weakly supervised object localization (WSOL) methods, detailing their operational principles and respective drawbacks. It contrasts approaches such as Class Activation Mapping (CAM) which targets highly discriminative features, with techniques like Divergent Activation for WSOL (DANet) that also consider less discriminative regions, aiming to enhance localization. The table also highlights the limitations each method faces, particularly in achieving accurate object localization without relying on extensive manual supervision.

Table 1 This table compares various WSOL methods discussed in literature, highlighting their unique approaches and inherent limitations, especially in achieving precise object localization without extensive supervision

3 Methodology

3.1 Overview of CAM decomposition

If we have an image X of size C \(\times H\times W\), we would like to have a representation that consists of approximately all the image’s information to classify it. Typically, a neural network comprises of convolutional layers followed by the average pooling and fully connected (FC) layer for classification is used to compute the CAM as follows:

$$CAM\left(X\right)={W}_{cl}^{T}F\left(X\right).$$
(1)

where \(F:R^{C\times H\times W}\Rightarrow R^{n\times h\times w}\) represents features map before average pooling for \((h\), \(w)\) spatial dimension of the n channels. \({W}_{cl}\in {R}^{n}\) are the weights of FC layer corresponding to the target class \(cl\). Authors from [16] try to bridge the gap between the classification and localization by decomposing CAM in terms of cosine similarity map as follows:

$$CAM\left(X\right)=\Vert {W}_{cl}\Vert \left.\Vert F\left(X\right)|\right|\text{cos}\theta \le \Vert {W}_{cl}\Vert \left.\Vert F\left(X\right)\right||,$$
(2)

where \(cos \theta\) is the cosine similarity between two vectors. A larger value of cos θ indicates a higher degree of alignment, while a smaller value suggests a lesser degree of alignment.. We define weighted feature space (\(\mathcal{F}\)) map that corresponds to every class of target as follows:

$$\mathcal{F}=\Vert W\cdot {F}_{A}\left(X\right)\Vert ,$$
(3)

where\(W\in {R}^{n\_class\times n}=[W1,\dots {W}_{cl},\dots {W}_{n\_class}\)] and norm is taken for all the classes. Based on Fig. 1, CAM alone cannot localize the full informative object corresponding to its class level as it learns the difference between the classes which leads to poor localization. However, normed feature map \(\| F\left(X\right)||\) and weighted norm feature map \(\mathcal{F}\) contains more information to localize the object corresponding to its class-level. Object can be localized based on \(\mathcal{F}\) and \(f\) where \(\mathcal{f}= \| F\left(X\right)||\). If the position (I, j) in \(f\) is higher than the mean of normed feature map \(\overline f\) is part of object which we need to localize for \(\forall\) I \(\epsilon \left[0,h\right],and\) j \(\left[0,w\right]\). Mathematically:

$$\overline{{f} } =\frac{{\sum }_{i=0}^{h-1}{\sum }_{j=0}^{w-1}{f}\left(i,j\right)}{h \times w}$$
(4)
$$\widehat{{ M}_{\left(i,j\right)}}=\left\{\begin{array}{l}1\;\;if {f}\left(i,j\right)>\overline{{f}}\\ 0\;\;else \end{array}\right.$$
(5)

where \(\widehat{{M}_{\left(i,j\right)}}\) is the possible area of the object to localize. The final possible area is based on the area obtained by the ensemble of \(\mathcal{F}\) and \(f\). When the informative region is localized, the cropped image contains additional information that can provide additional insights into the image by looking at closure using the localized images. It does not require any additional parameters as it is based on the trained classifier model. By observing Fig. 1, the area with the higher value of \(f\) are the area where the key parts are located, most of the time CAM indicates the joints of the bone which may be incorrect for tumor identification. We use a technique that involves dividing the image into overlapping windows, and then classifying each window as a foreground (marked as 1) or background (marked as 0) using Eq. 4(b). The overlapping windows are then moved across the image, allowing the classifier to process multiple regions of the image in a sliding manner.

3.2 WSMS network

Our proposed model is inspired by fine-grained (FG) visual classification models. The architecture of Weakly supervised learning-based abnormality detection is shown in Fig. 2A.

Fig. 2
figure 2

A The full architecture of our proposed network consists of three branches, and in each branch, the same shape and same color represent the parameters shared. Attention block is used to calculate the attention map (AM) from the global average pooling by calculating the attention weights and multiplying (Mul) with feature mapping. B1, B2, and B3 represent the loss in each branch from the fully connected layer. B Attention block consists of two fully connected layers (FC) activated by ReLU and sigmoid respectively

The Proposed network consists of three stages, where each stage has an encoder (En) to encode the image, which is composed of convolutional neural networks for feature map (F(X)) followed by nonlinear activation Rectified Linear Units (ReLU). Squeeze-and excitation (SE) block is then connected to the activation function to provide an attention map (AM) for the following stage. Each stage is supervised at the image-label, so a fully connected layer is added after the SE module and a revised focal loss based on binary cross entropy is used as an objective function. A further explanation of all the stages of the proposed architecture is mentioned below in detail.

3.2.1 Main stage

In this branch, we encode the image using En to get the feature representation map \(\left(f\&\mathcal{F}\right)\). A coordinate of the bounding box is generated using the intersection of the informative region by \(f\) and \(\mathcal{F}\) which is shown in Fig. 2B. In this branch, full feature AM is used for classification using the FC layer. Furthermore, the hypothetical object by the box is cropped from the image and passed to the object branch.

3.2.2 Object stage

Cropped image (local image) is used get its feature representation by shared parameter of encoder from the main branch for classification. Then, based on local normed map \(\left(f_{L}\right)\) obtained from the En of second stage by passing the local image, we divide image into overlapping windows, and then classifying each window as a foreground or background. The overlapping windows are then moved across the image and cropped regions are shared to the third branch. Those regions cover the most informative part.

3.2.3 Parts classification and muti-scale stage

Object stage provides the discriminative regions in the local image by aggregating each window’s normed map \(f_{\text{W}}\) with respect to channel and get its mean value \(\overline{f_{W}}\) as follows:

$$\overline{{\mathcal{F} }_{W}} =\frac{{\sum }_{i=0}^{{h}_{W}-1}{\sum }_{j=0}^{{w}_{W}-1}{f}_{W}\left(i,j\right)}{{h}_{W} \times {w}_{W}}$$
(6)

\({h}_{W} \times {w}_{W}\) is spatial dimension of the window’s normed map. We rank the windows by \(\overline{f_{W}}\) as higher value the higher information, as shown in Fig. 2A in red (for higher \(f_{W})\), orange, yellow, and green-colored windows. In this stage, cropped regions from the object branch is used for classification by shared parameters of the encoder shown in Fig. 2A by the same-colored CNN which also improves the robustness of the overall model. These cropped parts are combined to produce one possible box for abnormality.

We optimized the overall loss of the images in the training set for all stages by sharing the parameter shown by the same color CNN in Fig. 2A using three types of loss function. Due to the class imbalance problem, we have used weighted cross entropy defined as, in the Eq. (6).

$$C{E}_{t}^{j}\left(P\left(\theta \right), Q\right)=-{\sum }_{{p}_{i}\in P }{\sum }_{j\in T}{w}_{+}^{j} {q}_{i}\text{ln}{\mathbb{P}}\left({q}_{i}=1|{p}_{i}\right)-{\sum }_{{p}_{i}\in P }{\sum }_{j\in T}{w}_{-}^{j} {q}_{i}\text{ln}{\mathbb{P}}\left({q}_{i}=0|{p}_{i}\right)$$
(7)

where.

  • \(P\left(\theta \right)\)= \({p}_{1},{p}_{2},\dots ,{p}_{n}\) are the predicted labels.

  • \(Q = \left\{{q}_{1}, {q}_{2},\text{ I}. , {q}_{n}\right\}\) are the labels for the corresponding instances in \(P\left(\theta \right)\)

  • \({\mathbb{P}}\left({q}_{i}=1|{p}_{i}\right)\) is the predictive probability \(\forall { q}_{i}\in \text{0,1}\) conditioned on \(P\left(\theta \right)\)

  • \(T\) is a set of all the abnormality parts.

  • \({w}_{+}^{j}\) is the weight for all abnormal classes of part type j ∈ T

  • \({w}_{-}^{j}\) is the weight for the normal class of part type j ∈ T.

To focus on misclassification due to the class imbalance problem, we have used focal loss which is defined in Eq. (7),

$$F{L}_{t}\left(\theta \right)=\frac{1}{N}{\sum }_{\text{i}=1}^{\text{N}}{\left(1-{\text{e}}^{-C{E}_{t}^{i}\left(\theta \right)}\right)}^{\beta }C{E}_{t}^{i}\left(\theta \right)$$
(8)

where \(C{E}_{t}\left(\theta \right)\) is cross entropy at \({t}^{th}\) step with \(\uptheta\) shared parameter defined in Eq. (6), \(\upbeta\) are the focal loss hyperparameter. It is clear from the Eq. (7), to get the loss for the \({t}^{th}\) step we need to run in the entire mini batch size to get the mean which increases the computational cost. So instead of mean, we have used a scaler \(\alpha\). Rewriting Eq. (7) as,

$$F{L}_{t}\left(\theta \right)={\alpha \left(1-{\text{e}}^{-C{E}_{t}\left(\theta \right)}\right)}^{\beta }C{E}_{t}\left(\theta \right)$$
(9)

So, we sum the losses of all stages to get the total loss:

$${L}_{t}^{total}\left(\uptheta \right)=F{L}_{t}^{B1}\left(\uptheta \right)+ F{L}_{t}^{B2}\left(\uptheta \right)+F{L}_{t}^{B3}\left(\uptheta \right)$$
(10)

where \(B1\), \(B2\), and \(B3\) in \(F{L}_{t\left(\theta \right)}\) represent the corresponding loss of the first, second, and third branches respectively. For N number of the part images, the third branch loss is defined as the sum of the focal loss of each part:

$$F{L}_{t}^{P}\left(\uptheta \right)={\sum }_{i=0}^{N-1}F{L}_{t}^{\text{p}i}\left(\theta \right)$$
(11)

3.3 Attention map module

CNN is used to encode the image into a feature map where we extract high-level features from images. Next, the output of CNN activated by ReLU is passed through the Attention map module shown in Fig. 2B, which computes attention weight for each F(X). Since each F(X) in F is predicted by a separate kernel, we assume that CNN may generate activation maps with unnecessary values across feature maps F(X). Our goal is to use F(X) relationships by scaling each channel according to the quality of representations produced by CNNs. To solve the problem, we employ a SE block. It aims to improve the overall network’s performance by emphasizing the importance of certain features that result in better localization. First, we apply global weighting to the F(X) of a given feature map to highlight the most essential features and suppress the less important ones. As a result, the encoder can focus on the prominent features of the image, resulting in improved performance. Additionally, attention is intended to increase WSMS’s modeling capacity and enable non-linearity between F(X) and attention map (AM) outputs. For \(F\in {\mathbb{R}}^{C\times h\times w}\), attention map module reduces to \(s\in {\mathbb{R}}^{\text{C}}\) through global average weighting on each F(X). Thereafter, it is passed through fully connected layers activated by ReLU which reduces the number of feature maps \(C\) by \(\frac{C}{r}\) where \(r\) the reduction ratio is followed by fully connected layers to compute \(C\) important features corresponding to the image label. The steps are as follows:

$${S}{\prime}=ReLU\left({W}_{1}\cdot s+{B}_{1}\right)$$
(12)
$$S=\sigma \left({W}_{2}\cdot {S}{\prime}+{B}_{2}\right)$$
(13)

where \({W}_{1}\in {\mathbb{R}}^{\frac{\mathbf{C}}{\mathbf{r}}\times {\varvec{C}}}\) & \({B}_{1}\in {\mathbb{R}}^{\frac{\mathbf{C}}{\mathbf{r}}}\) are the weights & biases for first fully connected layer (FC) activated by ReLU followed by the second fully connected layer in which \({W}_{2}\in {\mathbb{R}}^{C\times \frac{C}{\mathbf{r}}}\) & \({B}_{2}\in {\mathbb{R}}^{C}\) are the weights & biases. The first FC layer consists of \(\frac{C}{r}\) nodes where information is squeezed by a reduction ratio of \(r\). Finally, a tensor Z of the same spatial dimension as of \(F\) is generated as follows:

$$Z=S\cdot F\left(X\right)$$
(14)

where • element-wise multiplication and \(Z\in {\mathbb{R}}^{\text{C}\times \text{h}\times \text{w}}\) is the final tensor containing recalibrated attention map.

3.4 Feature localization and amplification

The normed feature map ƒ of the full image is binarized based on equation using mean thresholding value as shown in Eq. 4(b). The pixels are connected according to their neighboring values if they are equal in value when pixels are mounted in a binary map. In this case, it refers to how many orthogonal hops a pixel must undergo to be considered a neighbor that will return all connected regions that are assigned the same value. We also find the area based on the feature map and select the intersecting region to produce a bounding box. If the interesting area is zero, we assign a default bounding box of \(w\) width and \(h\) height (same as the spatial dimension of each feature). Finally, we select the region that covers the maximum activation area to look closer for the localization of informative region within X-ray with the bilinear up sampling method.

3.5 Informative regions localization

Although the cropped local image contained the informative region with good probability, but idea is to localize the key part of the image. We use a feature map to search for the areas with higher activation, which indicates the location of critical parts in the local image (cropped image). So, we extract the feature map of the cropped image by sharing the En’s parameter to obtain the normed feature map ƒL for the selected region (the region of the window with height \({h}_{W},\) and width \({w}_{W}\)) and calculate the score by Average pooling with a kernel size of \(\left({h}_{w},{w}_{w}\right).\) Window size (\({h}_{w},{w}_{w}\)) is a hyperparameter to tune the different types of problems. The basic idea for selecting the window size is to cover the many distinct parts as possible. Localization is based on the binary map obtained from the activation map of each window by mean thresholding ƒW the window defined in Eq. (5). Non-Maximum Suppression (NMS) [38] is applied after scoring to select the fixed number of parts in images so fewer redundant parts are in each region.

3.6 Possible abnormality detection

From the informative region's localization section, the larger the value of ƒW, larger is the information that part contains. We combine \(w\) windows to detect the abnormality by discounting the value window for decreasing the value of \({\overline{f}}_{W}\) after shorting.

Then final window value (\(U\left(x,y,x+h,y+h\right)\)) is given by:

$$U\left(x, y, x+{w}_{w}, y+{h}_{w}\right)={\sum }_{i=1}^{{w}_{n}}{\gamma }^{i} U\left({x}_{i}, {y}_{i}, {x}_{i}+{w}_{wi}, {y}_{i}+{h}_{wi}\right)$$
(15)

where,\(\gamma \in {R}^{\left(\text{0,1}\right]}\) and for \({w}_{n}\) number of the proposed windows \(U\left({x}_{i}, {y}_{i}, {x}_{i}+{w}_{wi}, {y}_{i}+{h}_{wi}\right)\) is sizes of the window\(\forall i\in \left[1,{w}_{n}\right]\). But \(U\left(x, y, x+w, y+h\right)\) is a part localized in the local image cropped from the original image and resized in the size of the original image, so to get the reflection on the original image, we proposed a simple flowchart shown in Fig. 3.

Fig. 3
figure 3

Visualization of the output from our proposed weakly supervised fine-grained detection model at each step. Crop is the first localized informative region and Parts are the proposed region in Crop which produces the Final Part. The last images are abnormal in white box, which reflects the Final Part on input Image

4 Experiments results and analysis

4.1 Dataset description

Abnormality dataset

To evaluate our models, a dataset comprising of 5 types of abnormalities, namely, fracture, tumor, dislocations, metal implant, and arthritis along with normal bone images was curated. The images were collected from various regions of the human body, with the majority of the fractured images being sourced from the MURA dataset [1]. Other classes of this dataset are collected by the radiologist. In this work, we have conducted experiments based on two classes i.e., normal, and abnormal classes for training. The size of the training dataset is \(38.7k\) in which \(16.7k\) are the abnormal images from five different abnormalities and \(16k\) are the normal bone images and the test dataset consists of \(6k\) images with \(1k\) abnormal images collected by the same radiologist. This entire dataset has X-ray images in both normal and abnormal cases.

HAND Fracture

The class imbalance problem is the most important problem in medical images for weakly supervised learning especially in our method. So, to check the effectiveness of our model in a class imbalance class environment, we have used a Hand dataset from MURA [1]. This dataset consists of \(5.5k\) images in trainset with \(26.7\%\) fractured images and \(460\) images in test set with \(41\%\) fractured images.

CUB 200 2011

For FG classification and localization of objects, we use the CUB dataset to test the SOTA performance of our proposed model. The CUB-200–2011 dataset contains 11,788 images across 200 bird species categories, with annotations for part locations, attributes, and bounding boxes, suitable for supervised learning tasks involving fine-grained visual categorization..

Stanford Cars

The Stanford Cars dataset is a collection of images of vehicles, along with their associated labels, that was collected by researchers at Stanford University. It contains more than 16,000 images of 196 cars, consisting of models made by various manufacturers. To check the SOTA performance of the proposed model on classification, we use Stanford Cars (CAR) dataset.

4.2 Experimental setup

4.2.1 Implementation details

The input images were pre-processed to a size of \(448\times 448\) to obtain the augmented images for the first and second branches, as illustrated in Fig. 2A. The original image was cropped based on the coordinates from the first branch, and the input in the second branch was also scaled to a height and width of \(448\). For the part branch, all images were reshaped to a height and width of \(224\). We selected a window with a broad range of scale categories: \(\left\{\left[6\times 6, 7\times 5\right], \left[8\times 8, 6\times 10, 7\times 9\right], \left[10\times 10, 9\times 11, 8\times 12\right]\right\}\) and used 14 × 14 as the size of \(f\) in Eq. 4(a). The number of part images \({w}_{n}\) was set to \(7\), where \({w}_{1n}=2\), \({w}_{2n}=2\), and \({w}_{3n}=3\) were the number of wide varieties of scales. Different pre-trained baselines on ImageNet were used as the backbone within the same-colored CNNs shown in Fig. 2A. During model training, no other annotations were used except for the images' class labels. We optimized the loss using SGD, with an initial learning rate of \(1\times {10}^{-4}\) and a minibatch size of \(5\) on RTX A4000 GPU. PyTorch was used as the codebase for implementation.

We use various performance measures including (Area Under the Curve) (AUC) [46], Cohen’s Kappa (KAPPA) [47], Matthews Correlation Coefficient (MCC) [48], and Accuracy. AUC measures the degree of separability between the positive and negative classes. AUC is calculated by plotting the Receiver Operating Characteristic (ROC) curve, which is a graph of the True Positive Rate (\({\text{TPR}}=\frac{\text{TP}}{{\text{TP}}+{\text{FN}}}\)) against the False Positive Rate (\({\text{FPR}}=\frac{\text{FP}}{{\text{FP}}+{\text{TN}}}\)) at various threshold settings, and then computing the area under this curve. Here we use the notation from the confusion matrix and represent TP (True Positive) as cases that are both Actual Positive and Predicted Positive, FN (False Negative) as cases that are Actual Positive but Predicted Negative, FP (False Positive) as cases that are Actual Negative but Predicted Positive, and TN (True Negative) as cases that are both Actual Negative and Predicted Negative. A higher AUC value indicates a better model performance. In discrete terms, AUC can also be approximated by summing the areas of trapezoids formed by the points on the ROC curve:

$${\text{AUC}}\approx {\sum }_{i=1}^{n-1}\left({\text{FPR}}_{i+1}-{\text{FPR}}_{i}\right)\left(\frac{{\text{TPR}}_{i+1}+{\text{TPR}}_{i}}{2}\right),$$

where \(i\) represents each point on the ROC curve and \(n\) is the total number of points. KAPPA is a statistical measure used to assess the agreement between two raters who each classify items into mutually exclusive categories. It accounts for the possibility of the agreement occurring by chance. The KAPPA is calculated using the observed agreement (\({P}_{o}\)​) and the expected agreement (\({P}_{e}\)​):

$$\text{KAPPA }=\frac{{P}_{o}-{P}_{e}}{1-{P}_{e}}$$

Here, \({P}_{o}\)​ represents the relative observed agreement among raters, while \({P}_{e}\) represents the hypothetical probability of chance agreement and calculated as follows:

$$\begin{array}{l}{P}_{o}=\frac{{\text{TP}}+{\text{TN}}}{n}\\ {P}_{e}=\frac{\left({\text{TP}}+{\text{FN}}\right)\left({\text{TP}}+{\text{FP}}\right)+\left({\text{FP}}+{\text{TN}}\right)\left({\text{FN}}+{\text{TN}}\right)}{{n}^{2}}\end{array}$$

MCC is a balanced measure of the quality of binary classifications, taking into account true and false positives and negatives. It is especially useful when the classes are of different sizes. MCC is calculated with the formula:

$$\begin{array}{l}MCC=\frac{\left(TP\times TN\right)-\left(FP\times FN\right)}{\sqrt{\left(TP+FP\right)\left(TP+FN\right)\left(TN+FP\right)\left(TN+FN\right)}}\\ {\text{Accuracy}}=\frac{{\text{TP}}+{\text{TN}}}{{\text{TP}}+{\text{TN}}+{\text{FP}}+{\text{FN}}}\end{array}$$

And accuracy measures the proportion of correct predictions:

4.2.2 Baseline

We have used Resnet18, Resnet34, Resnet50, Resnet101, and inception [41] as a backbone pretrained on Imagenet21k for the feature map.

4.3 Comparison with baseline network

An experimental evaluation was conducted on the largest abnormality dataset to compare with well-established baseline models for weakly supervised fine-grained detection and classification. These classification results are shown in Table 2 for the performance measures discussed. With a good margin of more than 8.8% AUC, 12.6% KAPPA, 12.6% MCC, and 3.9% Accuracy, our proposed model exceeds the baseline models.

Table 2 Comparison results on Bone Abnormality dataset with baseline models

4.4 Comparison SOTA

To compare the model's performance against the state-of-the-art (SOTA), we conducted comprehensive experiments on publicly available datasets for fine-grained (FG) classification. The results are reported in Table 3. The table presents a comprehensive comparison of various models' performance on the CUB and CAR datasets for fine-grained image classification. Several key observations can be made from the results. First, the choice of backbone architecture plays a significant role in the model's performance, with models using ResNet50 as the backbone consistently achieving high accuracy on both datasets. For instance, models like MAMC, DFL-CNN, TASN, LIO, BNT, ASD, and API-Net, all based on ResNet50, achieved accuracy rates above 86% on CUB and above 92% on CAR. Second, the proposed WSMS-Net, also based on ResNet50, achieved competitive accuracy rates of 89.2% on CUB and 95.0% on CAR, outperforming several SOTA models such as RA-CNN, KP, NTS-NET, and HBP. These results suggest that the WSMS-Net model effectively leverages the full feature map, leading to improved localization and classification performance.

Table 3 Comparison with SOTA models on their backbone. SOTA performance is highlighted by bold

Our proposed net outperforms the SOTA models on the CAR dataset by more than 0.1%. Specifically, our model achieved an improvement of over 0.4% compared to existing methods. However, when compared with the very recent P2P-Net [63] on the CUB200 dataset, our model lagged by just 0.3%. Finally, it is noteworthy that the performance on the CUB dataset is generally higher compared to the CAR dataset, indicating the difficulty of the CAR dataset due to its complex nature and diverse car models. The discriminative parts found in these datasets (See Fig. 4) can be valuable for further improving the performance of classification models.

Fig. 4
figure 4

Localization of crops and discriminative part localization. We use red, orange, yellow, and green colors to indicate the order of the windows by \(\overline{{f}_{W}}\)  

4.5 Localization interpretation

The Percentage of Correctly Localized (PCL) region is used to evaluate the accuracy of a localization method in determining the location of an object or informative region with an IOU of over 50%. Using an ensemble of two ResNet-50 layers, the Attention Object Learning Module achieved a PCL of 85.1% after one epoch with a pre-trained ImageNet21k backbone. However, as training progressed, the PCL decreased to 71.1% because the CNN-based network focused more on the most prominent regions. Our proposed model, which combines the attention and backbone modules, achieved the highest accuracy of 77.2%, surpassing recent weakly supervised methods (See Table 4). However, the PCL also decreased to 73.4% as training progressed, which is still better than the Attention Object Localization Module (AOLM) (see Table 3) [42].

Table 4 Localization performance on CUB. “Yes” represents the training from scratch of the attention module (AM)

4.6 Visualizations

4.6.1 Localization

The second column of Fig. 4 shows the localization of informative parts for the X-ray abnormality, CUB, and CAR datasets based on the test images. The visualization demonstrates that our model accurately focuses on the informative region of the image without any information loss in the crop part.

4.6.2 Discriminative part

Fig. 4 Part column visualizes the location of the discriminative part by our net. This figure displays regions with the highest average activation values on different scales using red, orange, yellow, and green boxes, with the red box representing the highest average activation value. The most discriminative features, which are similar to human perception, are found in the joint of bone abnormality dataset, the head and beak of birds, and the headlight front-side of cars

4.7 Abnormality location

The results of the model on bone abnormality detection in X-ray images from the test dataset are depicted in Fig. 5. The white box in the figure indicates bone abnormalities such as bone tumors, joint dislocations, arthritis, and metal inside the bone. In the third column of the figure, a unique bone tumor abnormality identified by the radiologist is shown in the white box. The radiologist verified all abnormality detections from the test dataset. The model has demonstrated its capability to perform in complex cases that even a human expert found difficult to diagnose.

Fig. 5
figure 5

The abnormality detected region is in the bounding box. The area in the fourth column indicates Bone Tumor

4.8 Comparison with weakly supervised CAM

In Fig. 6, we compare CAM from the first branch of WSMS resnet50 backbone and CAM from training resnet50. CAM highlights the significant area in the image that helps in classification and confirms our model's smooth training. Drawing a bounding box from the CAM (WSMS) produces similar localization results as our proposed method, demonstrating the agreement between CAM (WSMS) and our method. However, CAMs cannot identify the regions responsible for errors in an image, making it challenging to determine the necessary improvements to increase accuracy, as evident in the first and second images of Fig. 6. In contrast, our method provides a clear idea of the required bounded box, as depicted in Fig. 6. The figure also ensures the robustness of the model because of multi-stage training compared to normal CAM. Our method carefully inspects the image in the first branch, followed by the second branch that decides on windows with different confidence. These windows have a low probability of being incorrect as the first branch ensures accuracy, and we combine all windows to form the final bounded box.

Fig. 6
figure 6

Visual comparison with weakly supervised CAM

4.9 Ablation study

4.9.1 Effect of Attention module

In Fig. 6, the second and third columns display localization using the feature map and attention map assembled with the feature map of the image in the first column of Fig. 6, respectively. The fourth and fifth columns of Fig. 7 show abnormality detection in the white box corresponding to the image. The impact of attention maps (AM) on classification performance is presented in Fig. 8. Although the results of localization are similar in both cases, the AM-based localization is highly focused on the weighted feature map, resulting in more accurate results, as seen in the fifth column verified by the radiologist. The use of AM not only improves the localization and abnormality detection accuracy but also enhances the classification performance, which is reported in Fig. 8. Compared to F(X)-based classification performance, AM outperforms it by a significant margin (17.8% of AUC, 23% of KAPPA, 23% of MCC, and 6.3% of accuracy) due to its effectiveness in improving the network's discriminative power by emphasizing important features and suppressing less important ones.

Fig. 7
figure 7

Effect of attention module (AM) on localization (crop) and abnormality detection compared to feature maps (F(X)s)

Fig. 8
figure 8

Effect of attention module (AM) on feature map (F(X)) on Bone Abnormality datasets

4.10 Class imbalance problem in weakly supervised learning

The hand training dataset has an issue of class imbalance, with only 27.77% of data belonging to the ' + ' class and 73.22% belonging to the '-' class, which may be one of the reasons for models performing worse than expected. To address this problem, various loss functions based on cross-entropy have been employed, including the weighted cross-entropy (WCE) loss function (Eq. (6)), the mean focal loss (MFL) (Eq. (7)), and the α focal loss (Eq. (8)). In our experiments, the α-WFL loss function has shown the highest MCC score and approximately the same accuracy as the highest-performing loss function based on accuracy. As a result, we have used α-WFL for all our experiments. Results of our models with different loss functions are presented in Table 5.

Table 5 Effect of loss function for class imbalance in weakly supervised learning

4.11 Model baselines

Figure 9 shows the results of our Net on the CUB200 dataset with different baseline models. As the baseline model complexity increases (from resnet18 to resnet101 and xception), accuracy generally improves, except for resnet101 which improves accuracy compared to the xception net. This is because resnet101 is a deeper and broader version of resnet50, allowing it to learn more complex and nuanced features from the data, potentially improving accuracy on some tasks. However, this also means that training resnet101 may require more computation resources and may be more prone to overfitting if the training data is insufficient or the regularization is not strong enough. Across all the baseline models, our WSMS net is outperformed by the resnet50 baseline (see Fig. 9).

Fig. 9
figure 9

Variety of baseline for WSMS net on CUB dataset

5 Discussion

Weakly supervised learning is a promising approach for medical image analysis, offering potential benefits in terms of scalability, cost-effectiveness, and accuracy. With further research and development, weakly supervised learning methods have the potential to greatly improve the field of medical image analysis.

In the development of weakly supervised learning for medical image analysis, we propose a WSMS learning network. The proposed weakly supervised learning approach in this research paper has shown promising results in addressing the challenge of abnormality detection in X-ray images. The WSMS approach, which leverages the available supervision to the maximum extent possible, not only classifies the abnormality dataset but also provides the region of interest in the form of a bounding box. This approach enables the model to learn discriminative features and focus on relevant regions in the X-ray images for better classification and localization performance.

Using CAM and attention maps improves the model's interpretability, providing visual evidence of the decision-making process and facilitating verification mitigating some of the black-box aspects of the classifier. The multi-stage architecture with shared parameters increases the robustness and generalization performance of the model by enabling it to learn more complex and abstract features. The experimental results demonstrate the effectiveness of the proposed approach, outperforming several state-of-the-art methods in terms of classification accuracy and achieved clear visualization of the area of interest.

However, we recognize that there is room for improvement in the proposed approach. More advanced attention mechanisms and architectures can be explored to further improve the interpretability and localization performance of the model. Additionally, the use of more diverse and larger datasets can be investigated to evaluate the generalization performance of the proposed approach in real-world scenarios.

A notable limitation of our weakly supervised method is its suboptimal performance on datasets with class imbalances. Despite incorporating objective functions designed to mitigate this issue, the model's reliance on positive samples during training remains a challenge, particularly for representing features of less prevalent classes. This limitation suggests the necessity for continued exploration of strategies to counter class imbalance in weakly supervised learning, especially within the context of medical imaging. Also, in future wanted to explore the explainability of the method and outcome [64, 65]. Overall, the proposed weakly supervised learning approach shows substantial potential for addressing the challenges of abnormality detection in X-ray images, and further research in this direction can lead to significant advancements in medical imaging applications.

6 Conclusion

The proposed approach in this work addresses the challenge of abnormality detection in X-ray images through a weakly supervised multistage attention map learning approach. The use of a multistage neural network with shared parameters increases the robustness of the model in classification and feature map generalization, resulting in clearer attention maps and better object localization. Comprehensive experiments demonstrate that the model outperforms SOTA baseline models in terms of classification and abnormality localization quality. The proposed model achieves SOTA results in the classification and localization of images, as demonstrated through the analysis of two public benchmarks. This approach is promising for real-world applications as it leverages as much available supervision as possible, but still requires some level of supervision. The ability to provide a region of interest in the form of a bounding box in addition to classification provides a useful tool for medical professionals in identifying abnormalities and their locations in X-ray images. Overall, this work has the potential to significantly impact the field of medical image analysis and improve diagnostic accuracy in clinical settings.

Future work could explore the use of additional sources of supervision or a more comprehensive weakly supervised approach to further improve the accuracy and robustness of the proposed method. Additionally, further evaluation and comparison with state-of-the-art methods could provide insight into the effectiveness and practicality of this approach.