1 Introduction

Biomedical images play a crucial role in educational and medical research purposes. In addition, they are valuable in establishing clinical decision support system (CDSS) benefitting from content-based image retrieval (CBIR). To make an efficient CBIR system, one needs to label regions-of-interest (ROIs). Since biomedical images comprise of several different regions, detecting an annotated arrow could help segment ROIs. ROIs can be used in indexing images or in analyzing the content [1, 2]. In Fig. 1, we provide a complete scenario of the project where the importance of the arrow is highlighted. Biomedical images are often annotated with pointers such as arrow and asterisk to highlight ROIs (see Fig. 2) and this way, pointers minimize the distractions from other image regions. In addition, ROIs are often referred to article text and figure captions. This paper improves on prior work in arrow detection toward meeting this goal in image content analysis. Detecting arrows is not straightforward. Arrows (in Fig. 2 appear with either high or low intensity to enhance their visibility in the image. This means that their intensities vary with respect to the background. In addition, in many cases arrows are blurred, overlapped or surrounded by textured areas. Arrow types can be just a triangle (i.e. a regular arrowhead) or with straight and curved tail.

Fig. 1
figure 1

Using US National Library of Medicine’s (NLM’s) Open-i SM image retrieval search engine (https://openi.nlm.nih.gov), the illustration highlights the importance of using arrow in biomedical images (i.e., its location pointing ROI and relationship between the texts and ROI)

Fig. 2
figure 2

Examples showing different types of arrows pointing specific image regions. These are taken from published biomedical articles

1.1 Related work

Few techniques are reported in literature for detection of arrows overlaid in biomedical images. These techniques depend upon segmenting text like and symbol like objects, sparse pixel vectorization and local or global thresholding.

In [3], Dori et al, proposed a technique to detect arrows based on previously reported work on sparse pixel vectorization [4]. The concept relies on the cross sectional runs (or width runs) of black image regions (assuming arrow in black). The technique utilizes an interesting application but, is never applied on biomedical images, as it is limited to machine printed images such as electrical wiring diagrams, drawings and graphical symbols. Other techniques used features such as eccentricity, convex area and solidity [5]. These features can define regular arrows (i.e., straight arrows showing left, right, top and bottom). Since overlaid arrows in biomedical images can be distorted, computing straightforward geometrical features cannot differentiate arrows from other regions. Cheng et al used text-like and arrow-like objects separation, assuming that arrows are shown in either black or white color with respect to the background [6]. As mentioned earlier (see last paragraph of Sect. 1), arrows are not appeared in just either black or white pixels, their work cannot fit into the target. From the binary image, arrow-like object separation employs a fixed-sized mask (after removing the small objects and noise as in [5]), which are then used for feature computation such as major and minor axis lengths, axis ratio, area, solidity and Euler number. Removing small candidate is not a solution, since overlaid arrows can also be just a triangle (that can be small too). Further, arrows can have texture similarity with the regions they are connected/pointed to. This will produce distorted arrow candidates at the time of segmentation. A recent study uses a pointer region and boundary detection to handle distorted arrows [7], which is followed by edge detection techniques and fixed thresholds as reported in [8, 9]. These candidates are used to compute overlapping regions, which are then binarized to extract the boundary of the expected pointers. Fundamentally, edge-based arrow detection techniques are limited by the weak-edge problem [57]. Weak-edge happens in case intensities vary a lot on a single arrow candidate, and as a consequence, part of the edges will be missed. In addition, the techniques rely on hard thresholding has to be empirically designed from one dataset to another. For edge detection in binary or grayscale images, most state-of-the-art methods use classical algorithms like Roberts, Sobel and Canny edge detection. Template-based methods are limited, since they require new templates to train new images. Also, it may be necessary to re-evaluate the threshold values when new images are used. Edge-based techniques are still considered, since sampling points can be remarkably compact compared to solid regions, especially when broken boundaries are recoverable (as reported in [10]). In biomedical images, one of the major issues for a broken boundary is non-homogeneous intensity distribution, where pointers overlap with content.This is one of primary the reasons, hard thresholding (at the time of binarization, for instance) may not work.

Fig. 3
figure 3

Overall system workflow in block format. Block-wise explanation can be found in Sect. 1.2

1.2 Contribution outline

Our method can be summarized as shown in Fig. 3. It relies on a grayscale fuzzy binarization process at different levels [11], where candidates are segmented based on connected component (CC) principle. Unlike the common state-of-the-art methods, we use four different levels of fuzzy binarization. This ensures that overlaid arrow candidates are not missed (see Fig. 4). Our system comprises of two classifiers, analyzing the candidates in sequence to determine whether theses arrow candidates are arrows. It consists of a neural network based bidirectional long short-term memory (BLSTM) classifier followed by convexity-defect based arrowhead detection. Npen++ and the Radon features are computed for these candidates and are validated with BLSTM-trained arrow model. BLSTM classifies each candidate as arrow (and non-arrow) candidates with some cross-entropy error (CER) score. This CER defines how confident BLSTM classification is, which is inversely proportional to the the confidence score. If the confidence score of the BLSTM classifier crossed the threshold, the candidate is classified as an arrow. Otherwise, the candidate is passed through convexity defect-based technique. The latter step prunes artefacts (i.e., unwanted noisy object and/or image regions) and stores arrowhead-like candidates, since it deals with just the arrowhead.

The remainder of the paper is organized as follows. In Sect. 2, we explain the binarization technique. We explain BLSTM arrow detection in Sect. 3 that includes feature extraction and classification. In Sect. 4, we discuss convexity defect-based techniques for arrowhead detection. Experimental setup and results are reported in Sect. 5. We also extend our evaluation by taking a comprehensive and comparative study with state-of-the-art techniques in Sect. 6. Section 7 concludes the paper.

Fig. 4
figure 4

Fuzzy binarization (of Fig. 2c): four different levels (levels 1–4), where the arrows are encircled both in red and black with respect to the background color

2 Multilayer image segmentation

In biomedical images (see Fig. 2), arrows appear with either high or low intensity to enhance their visibility in the image. In addition, in many cases arrows are blurred, overlapped or surrounded by textured areas. In such contexts, typical binarization tools that are based on fixed threshold values are unable to segment candidate regions. Therefore, we focus on an adaptive binarization tool, which is based on a fuzzy partition of a 2D histogram of the image, taking into account the gray level intensities and local variations [12]. 2D Z-function criteria based on the optimization of fuzzy entropy are then computed from this histogram to automatically set the threshold. Z-function employs two kernels: low-level and high-level cuts. In addition, we take their inversions, and altogether, four different binarized levels are processed, as illustrated in Fig. 4. In Fig. 4, arrow candidates are encircled in both red and black (with respect to the background color). The main idea of using four different levels of binarization is not to miss the overlaid arrows. Furthermore, deformed and/or distorted arrows can be discarded since the arrows are repeated in other levels of binarization. Note that image regions are segmented based on the \(8\times 8\) connected component (CC) principle. In general, CCs, in 2D image, are clusters of pixels with the same value, which are connected to each other through 8-pixel connectivity. CCs are referred to as candidate regions. From a pool of several candidates, we are required to select arrow-like candidates.

3 Arrow detection using BLSTM classifier

To train and test a neural network based BLSTM classifier, we compute features: (1) Npen++ and the Radon transform. Using both features, we classify arrow head candidates based on the confidence score of the BLSTM.

3.1 Features

The performance of any neural network classifier depends on the features that are used to represent the candidates. In our study, we have tested the performance with two different feature descriptors: Npen++ [13] and the Radon feature [14]. Npen++ features are expected to be worked for well defined geometric patterns (such as arrows), including curvature, curliness and orientation. Similarly, the Radon feature is well suited for the patterns since projection in the Radon space changes 2D arrow image into 1D signal that can be considered as strokes. BLSTM classifier can be considered as a well-known for strokes and gesture recognition [15, 16].

3.1.1 Npen++ feature descriptor

Npen++ features were originally introduced for handwriting recognition. It actually comprises a number of features computed along the handwriting trajectory. The normalized sequence of the captured coordinates (x(i), y(i)) forms the input to the system. It computes a sequence of features along this trajectory. But, not all of these features are relevant for our arrow detection approach. Most of the features of Npen++ depend on the baseline. In original Npen++ recognizer, the baseline b(x(i)) corresponds to the original writing line on which a word or text was written. In our study, we consider the lowest line parallel to x-axis passing through the contour of arrow as baseline.

Fig. 5
figure 5

Npen++ features, representing a arrow and b non-arrow candidates. We provide them side-by-side so that one can compare features find how discriminate they are

Vertical position  The vertical distance between y(i) and b(x(i)) of a point (x(i), y(i)) is the vertical position of the point, where b(x(i)) is the y-value of the baseline for ith point on the contour.

Orientation  At point (x(i), y(i)), the local writing direction is described as

$$\begin{aligned} cos \alpha (i) = \frac{\Delta x(i)}{\Delta s(i)},\quad \text{ and } \quad sin \alpha (i) = \frac{\Delta y(i)}{\Delta s(i)} \end{aligned}$$
(1)

where \(\Delta s,\Delta x,\) and \(\Delta y\) are defined, respectively, as follows:

$$\begin{aligned} \Delta s(i)= & {} \sqrt{\Delta x^{2} + \Delta y^{2}}, \nonumber \\ \nonumber \Delta x(i)= & {} x(i-1) - x(i+1), \quad \text{ and } \nonumber \\ \nonumber \Delta y(i)= & {} y(i-1) - y(i+1). \end{aligned}$$

Curvature  The computation of curvature at a point (x(i), y(i)) can be considered as any consecutive points along the trajectory (or writing direction), and is described as follows:

$$\begin{aligned} \cos \beta (i) = & \cos \alpha (i-1) * \cos \alpha (i+1) + \sin \alpha (i-1) * \sin \alpha (i+1) \quad \text{ and } \nonumber \\ \sin \beta (i)= & \cos \alpha (i-1) * \sin \alpha (i+1) + \sin \alpha (i-1) * \cos \alpha (i+1). \end{aligned}$$
(2)

Note that this sequence does not represent curvature but, angular difference.

Aspect  The aspect A(i) of the contour in the vicinity of a point characterizes the height-to-width ratio of the bounding box containing the preceding and succeeding points of (x(i), y(i)). It is computed as:

$$\begin{aligned} A(i) = \frac{\Delta y(i) - \Delta x(i)}{\Delta y(i) + \Delta x(i)} \end{aligned}$$
(3)

Curliness  Curliness C(t) feature describe the deviation from a straight line in the vicinity of (x(i), y(i)). It is computed as the ratio of the length of the contour and maximum side of the bounding box,

$$\begin{aligned} C(i) = \frac{L(i)}{max(\Delta x,\Delta y)} - 2, \end{aligned}$$
(4)

where L(i) denotes the length of the contour in the vicinity of the point computed as the sum of all line segments, and \(\Delta x\) and \(\Delta y\) are width and height of the bounding box, respectively.

We have applied Npen++ feature for selection of arrow-like candidates. Figure 5 shows how do the Npen++ features look like for both arrow and non-arrow candidates, and therefore, it allows to realize its discriminative property.

3.1.2 The Radon transform

The radon transform computes projections of an image matrix along specified directions [14]. A projection of a two-dimensional function f(xy) is a set of line integrals. It is computed by calculating the length of line integrals from multiple sources along parallel paths, or beams, in a certain direction. In general, the Radon transform of f(xy) is the line integral of f parallel to the y-axis is

$$\begin{aligned} R_{\theta }(x') = \int _{\infty }^{-\infty } f(x'cos \theta - y' sin \theta , x'sin \theta - y' cos \theta ) dy' \end{aligned}$$
(5)

where \(x' = xcos \theta + ysin \theta\) and \(y' = ycos \theta - xsin \theta.\) The beams are spaced 1 pixel unit apart. To represent an image, the Radon function takes multiple, parallel-beam projections of the image from different angles by rotating the source around the center of the image. Since arrow is a regular geometric shape, the radon transform of arrow tend to have regularities.

3.2 BLSTM classifier

Again, we are motivated by the use of recently introduced bidirectional long short-term memory (BLSTM) [17]. In simple words, the BLSTM is a recurrent neural network having connections between the nodes so as to form a directed cycle, thus, providing a ‘memory’ of network’s previous internal state.

3.2.1 Long short-term memory (LSTM) layer

A specific architecture known as memory block forms the LSTM network nodes. Each memory block contains a memory cell and it interact with rest of the network with the help of three gates, viz., an input gate, an output gate and a forgot gate [17]. The forget gate determines when the input is important enough to keep in memory and when the block can forget the values. This helps memory cells retain their state for a long time and to model the context at feature level. The input signal is processed in both directions: forward and backward by two different layers, thus improving ID sequence recognition. Next layer combines the output of both the layer to form the feature map. Like convolutional neural network, multiple forward and backward layer in each LSTM layer, and multiple feature maps at the output layer are possible. Further, it is possible to stack multiple LSTM layers using method like max-pooling subsampling.

3.2.2 Candidate selection

To train the BLSTM model, the aforementioned features (see Sects. 3.1.1 and 3.1.2) can be either separately applied or combined.

Individual feature performance in BLSTM

The features: Npen++ and the Radon, are separately used to train BLSTM models for the arrows (and not arrows). For testing, a test candidate is passed tested through the trained BLSTM models. As a reminder, LSTM layer has been discussed in Sect. 3.2.1.

Integrating features in BLSTM

This could be another option to realize how good will be the BLSTM after feature integration. To handle this, we propose two different setups: (1) weighted average score, and (2) confidence scores (in parallel).

Setup 1: Weighted average score  Since we do not know which feature is performed well, in this setup, we start with providing two different weights: \(\alpha\) and \(1-\alpha.\) This can be formalized as follows:

$$\begin{aligned} x_{\text {avg}} = \alpha \times x_{\text{ Npen++ }} + (1-\alpha ) \times x_{\text{ Radon }}, \end{aligned}$$
(6)

where the values of \(\alpha\) ranges from 0 to 1, and \(x_{F}\) represents confidence score from BLSTM for that particular feature, F. In general, we have

$$\begin{aligned} \nonumber x_{\text {avg}}= {\left\{ \begin{array}{ll} x_{\text{ Radon }}, &{}\quad \text {if } \alpha = 0\\ x_{\text{ Npen++ }}, &{}\quad \text {if } \alpha = 1\\ \alpha \times x_{\text{ Npen++ }} + (1-\alpha ) \times x_{\text{ Radon }}, &{} \quad \text {otherwise.} \end{array}\right. } \end{aligned}$$

Therefore, we do not allow biasing any particular feature. The average score, \(x_{avg}\) is then used to classifying arrow-candidates

Setup 2: Confidence score in parallel fashion  As in setup 1, the candidate image is passed through both Npen++ trained classifier and the Radon trained classifier in parallel. At the result, we made the BLSTM classification decision, from which we received high confidence score. It can be generalized as,

$$\begin{aligned} x_{\text {decision}}= {\left\{ \begin{array}{ll} x_{\text{ Radon }}, &{}\quad \text {if } x_{\text{ Radon }} \ge x_{\text{ Npen++ }} \\ x_{\text{ Npen++ }}, &{}\quad \text {otherwise,} \end{array}\right. } \end{aligned}$$
(7)

where \(x_{F}\) represents confidence score from BLSTM for that particular feature, F.

Like the state-of-the-art works, the candidate classified by BLSTM with high confidence score is accepted. If not, remaining candidates are passed though second phase of screening (i.e., convexity defect-based arrowhead detection, and also refer to Fig. 3). The latter phase (Sect. 4) aims to eliminate the false positives that are coming from BLSTM.

4 Arrowhead detection using convexity defect-based algorithm

Unlike earlier section and previously reported works [11, 18], in this paper, we do not take arrow tail into account because arrow tail structures vary a lot. As a consequence, the geometric signature computed from extreme points of a triangle (i.e., triplet). Such a change can affect the overall appearance of the arrow (see Fig. 6). This limits the performance of the previous technique, both in computational complexity and in detection rate. In this section, we limit our work and detect an arrowhead that includes following steps: (1) convexity defect-based arrowhead candidate cropping; and (2) arrowhead candidate matching via dynamic time warping (DTW).

The convexity defect-based technique is based on the characteristics of the arrowhead that can be represented by a triangle [19, 20]. Once candidate arrowheads are selected, we confirm them by matching with the templates via DTW.

Fig. 6
figure 6

Examples showing the changes in tail structure. Further, an absence of the tail is also possible

4.1 Convexity defect-based arrowhead candidate cropping

We apply hull convexity defect concept to select arrow-like candidates (see Fig. 7). If a set of points along the contour of the binary CC contains the line segments connecting each pair of its points, it is said to be convex. In a convex combination, each point \(x_i \in S\) is assigned a weight or coefficient \(w_i\) in such a way that the coefficients sum to one. These weights are, at the same time, used to compute a weighted average of the points. In general, the convex hull, expressed in a single formula, is the set: \(\{ \sum _{i=1}^{|S|} w_i x_i |(\forall i: w_i \ge 0) \wedge \sum _{i=1}^{|S|} w_i = 1\}.\) Thus, the convex hull of a finite point set \(S \in \mathbb {R}^2.\) An example is shown in fig. 7b. A convex shaped silhouettes on both sides of the arrow can be computed by subtracting an original candidate from the convex hull (see Fig. 7c). This removes tail, and the convexity defect region is shown in Fig. 7d, which is just a convex hull of both convex shaped silhouettes. At the end, arrowhead candidate(s) is(are) selected by subtracting an original image with the convexity defect region shown in Fig. 7e. All connected components (after subtraction) are taken as the potential arrowhead candidates.

Fig. 7
figure 7

Arrowhead candidate cropping: a an arrow, b convex hull, c convexity defect, d a complete convexity defect region, and e arrowhead candidates

4.2 Arrowhead candidate selection

We apply a template matching technique to confirm arrowhead candidates (see Fig. 7). We extract a feature along the contour and match with the predefined templates using dynamic time warping (DTW) technique. The arrowhead candidate is confirmed when the similarity score crosses the empirically designed threshold.

We have a set of coordinate points along the contour, \(P = \{ p_i\}_{i = 1, \ldots , n}.\) We compute change in angle with respect to x-axis from any consecutive pair: \(p_i \text { and } p_{i+1},\) \(\alpha _i = \arctan \left( \frac{y_{i+1}-y_{i}}{x_{i+1}-x_{i}} \right) ,\) and therefore, we can represent a whole sequence as a feature vector, \(f = \{\alpha _i \}_{i = 1, \ldots , n}.\) We then represent a digital curve using fewer points through polygonal approximation [2123], such that the properties of the curvature of the digital curve are retained. Continuous redundancy of \(\alpha _i\) can be possible in our feature vector, \(\alpha _i = \alpha _{i+j}, j = 1,\ldots , m,\) where \(m \le n.\) In our implementation, we compute the difference between the angles and check whether it crosses the threshold: \(\alpha _i\) if \(|\alpha _i - \alpha _{i+1}| \le \epsilon,\) where \(\epsilon\) is user-defined. Figure 8 shows three examples, where the changes in angles are shown at all dominant points.

Fig. 8
figure 8

Examples showing a complete process (from left to right) starting from an original candidate (resulting from fuzzy binarization—see Fig. 4), arrowhead cropping (see Fig. 7) to feature extraction after polygonal approximation

For matching, we use DTW algorithm, since it allows to find the dissimilarity between two non-linear sequences potentially having different lengths [24, 25]. In Fig. 8, one can notice the variations in feature vector size from one arrowhead to another. Consider two feature sequences: \(f_1 = \{ \alpha _{i} \}_{1=1,\ldots ,n} \text{ and } f_2 = \{ \beta _{j} \}_{j=1,\ldots ,m}\) of size n and m, respectively. The aim of the algorithm is to provide the optimal alignment between both sequences. At first, a matrix of size \(n \times m\) is constructed. Then for each element, local distance metric \(\delta (i,j)\) between the events \(e_i\) and \(e_j\) is computed i.e., \(\delta (i,j) = (e_i - e_j)^2.\) Let D(ij) be the global distance up to (ij),

$$\begin{aligned} \nonumber D(i,j) = \min \left[ \begin{array}{l} D(i-1,j-1),\\ D(i-1,j), \\ D(i,j-1)\end{array} \right] + \delta (i,j) \end{aligned}$$

with an initial condition \(D(1,1) = \delta (1,1)\) such that it allows warping path going diagonally from starting node (1, 1) to end (nm). The main aim is to find the path for which the least cost is associated. The warping path therefore provides the difference cost between the compared signatures. Formally, the warping path is, \(W = \left\{ w_{k} \right\} _{k= 1 \ldots l},\) where \(\text{ max }(i,j) \le l < i+j-1\) and \(k{\mathrm{th}}\) element of W is \(w(i,j)_k \in [1:n] \times [1: m]\) for \(k \in [1:l].\) The optimised warping path W satisfies the following three conditions: boundary condition, monotonocity condition and continuity condition. We then define the global distance between \(f_1\) and \(f_2\) as,

$$\begin{aligned} \Delta \left( f_1,f_2\right) = \frac{D(n,m)}{l}. \end{aligned}$$
(8)

The last element of the \({n \times m}\) matrix gives the DTW-distance between \(f_1\) and \(f_2,\) which is normalised by l i.e., the number of discrete warping steps along the diagonal DTW-matrix.

5 Experiments

5.1 Datasets, ground-truth and evaluation protocol

The well-known imageCLEF dataset [26] is used for testing. It is composed of 298 chest CT images. Each image is expected to have at least one arrow, and there are 1049 arrows, in total. For all images in the dataset, groundtruths of the arrows were created and each ground-truth includes information like arrow type, color, location, and direction. For validation, for any given image in the dataset, our performance evaluation criteria are precision, recall and \(\text{F}_{1}\) score,

$$\begin{aligned} \nonumber & \text{ precision } = \frac{m_1}{M}, \quad \text{ recall } = \frac{m_1}{N}\quad \text{ and } \\ & {\text{F}}_1 \text{ score } = 2\left( \frac{(m_1/M) \times (m_1/N)}{(m_1/M) + (m_1/N)} \right) , \end{aligned}$$
(9)

where \(m_1\) is the number of correct matches from the detected set M and N is the total number of arrows (in the ground-truth) that are expected to be detected.

5.2 Our results and analysis

5.2.1 Results

In Table 1, the performance evaluation in terms of precision, recall and \(\text{F}_{1}\) score, can be taken as follows.

  1. 1.

    BLSTM classifier with two different features that are separately applied;

  2. 2.

    Convexity defect-based algorithm (without BLSTM classifier); and

  3. 3.

    Sequential classifier, by integrating both: BLSTM and convexity defect-based technique (see Fig. 3).

Based on the aforementioned schema of the results, we provide results in Table 1. In Table 1, we observe the following:

  • BLSTM alone has a very good recall value but, with large large false positives. It holds true (i.e., high recall) for both features: Npen++ (98.43%) and the Radon (97.31%) that are applied separately.

  • Convexity defect-based algorithm performs better than BLSTM in terms of precision.

  • Integrating both (sequential classifier)  BLSTM and convexity defect-based technique provides interesting results. The candidates having low confidence scores (at the BLSTM end) are now correctly be detected/classified at the latter phase, thanks to convexity defect-based arrowhead cropping (see Sect. 4). On the whole, the sequential classifier that combines BLSTM and convexity defect-based arrowhead cropping provides better results. In our results (Table 1), we provide results for separate features. For example, from the sequential classifier, we received \(\text{F}_{1}\) score of 97.26% when BLSTM (with Npen++) is combined with convexity defect-based algorithm, which is better than when the Radon features are used in BLSTM (refer to the last two rows of Table 1).

Table 1 Performance of the proposed system (in %)
Fig. 9
figure 9

Graph showing the change in accuracies with change in the threshold value of cross entropy error for a Npen++ feature and b Radon feature

Table 2 Performance (in %): integrating features in BLSTM (setup 2) and with convexity defect-based algorithm

Note that our BLSTM classifier uses a threshold on the cross entropy error value to decide whether we should further test the candidate with convexity defect-based algorithm. If not, we accept as it is classified by BLSTM classifier. In Fig. 9, we detail an idea of how BLSTM performances have been changed in accordance with the change in this threshold values. We note that best results (and almost equal) are provided when the normalized threshold values are in the range: [0.01–0.03], in both features. With this validation, we use the threshold value of 0.025 for all tests (see Table 1). Besides, in Fig. 9, we observe the following:

  • for small threshold values, all arrow candidates will be above the threshold and will be passed through convexity defect-based algorithm (i.e., only BLSTM is active); and

  • for large threshold values, all arrow candidates will be below the threshold and will not be passed through convexity defect-based algorithm (only convexity defect-based algorithm is active).

5.2.2 Integrating features in BLSTM

As mentioned in Sect. 3.2.2, we used two different setups: 1) weighted average score, and 2) confidence scores (in parallel).

Using the Eq. 6, Fig 10 shows the changes in performance with change in the value of threshold, \(\alpha \in [0,1].\) In Fig 10, we also provide the performance of sequential classifier (i.e., integrating features in BLSTM plus convexity defect-based algorithm). In our test, performance is found to be maximum for Npen++ feature alone, and tends to linearly decrease first, which is followed by slight advancement with increase in the weight of the Radon’s score. On the whole, we observe that integrating the Radon feature trained classifier with Npen++ feature trained classifier does not add to the accuracies (i.e., no change in performance).

Fig. 10
figure 10

Graph showing the changes in performance with different values of threshold, \(\alpha\): a without and b with convexity defect-based algorithm

In the latter setup, the candidate image is passed through both Npen++ trained classifier and the Radon trained classifier in parallel. At the result (as described in Eq. 7), we made the BLSTM classification decision, from which we received high confidence score. This setup boosted our results by more than 1%. In Table 2, precision, recall and F1 score are shown. We remind that we take feature integration account in BLSTM first, and then combine with convexity defect-based algorithm. Compared to Table 1, the performance has been increased by more than 1%, thanks to feature integration. For a comparison, we will take these best scores in Sect. 6 (Table 2).

Fig. 11
figure 11

Examples showing different binarization levels are used to detect arrows. This demonstrates the idea of image inversion used in binarization since arrows are not just black-filled

Fig. 12
figure 12

Examples illustrating arrow detection, on the right of the input image

5.3 Processing times

Processing times, on average, for both Npen++ and the Radon features are almost identical. For each candidate image, Npen++ feature took approximatively 12.7 and 14.4 ms, respectively, without and with convexity defect-based approach. On the other hand, with the Radon feature, each candidate image took 15.5 and 16.2 ms, respectively, without and with convexity defect-based approach. We have used Unix Environment (Ubuntu 16.04) with 8 GB RAM and Intel Core i7 processing power PC with Matlab R2015a.

5.4 Example outputs

In Fig. 11, we provide some example outputs illustrating arrow detection. In these outputs, our aim is to show the importance of using multilayer image segmentation concept (see Sect. 2). In both examples of Fig. 11, arrows are detected from using two different layers, since they are appeared in both black and white pixels. This means that straight forward image binarization does not work (for example, the works reported in [68]). For a comparison, we refer to Sect. 6. For better understanding, we provide more outputs in Fig. 12.

6 State-of-the-art comparison

Further, the comparative study with state-of-the-art methods has been made. In this comparison, our benchmarking methods are categorized into two groups:

  1. 1.

    State-of-the-art methods that are specially designed for arrow detection; and

  2. 2.

    Common template-based method by using well-known state-of-the-art shape descriptors.

Table 3 Performance comparison (in %) of previously reported methods

6.1 Arrow detection methods

Four well-known methods from the state-of-the-art that are specially designed for arrow detection are used:

  1. 1.

    Global thresholding-based method (M1) [6];

  2. 2.

    Two edge-based methods (M2:M3) [7, 8]; and

  3. 3.

    A template-free geometric signature-based method (M4) [11].

The results are provided in Table 3, where method 4 (M4) performs the best with precision, recall and \(\text{F}_1\) score of 93.14, 86.92 and 89.94%, respectively.

6.2 Template-based methods

In case of template-based method, we created 11 templates (arrows) having different shapes (including sizes). The template size can further be extended in accordance with the dataset. To extract shape features, we took the most frequently used shape descriptors (in computer vision) from the state-of-the-art. They are

  1. 1.

    Generic Fourier descriptor (GFD) [27],

  2. 2.

    Shape context (SC) [28],

  3. 3.

    Zernike moment (ZM) [29],

  4. 4.

    Generic Radon transform (G-RT) [30] and

  5. 5.

    DTW-Radon [31].

As before, results (precision, recall and \(\text{F}_1\) score) are provided in Table 4. Among all shape descriptors, GFD provides the best performance.

Table 4 Performance comparison (in %) of template based method
Table 5 Performance comparison (in %) of our method with the best previously reported works

6.3 Comparison summary

From all reported methods (see Tables 23 and 4), we take best results for a comparison with the proposed method. A complete comparison study is provided in Table 5. Considering the dataset, the proposed method outperforms the best state-of-the-start arrow detection method by more than \(8\%\) \(\text{F}_1\) score, and the template-based (shape descriptor) method by more than 20% F1 score. This attests the usefulness of our method for further image region labelling problem, in biomemdical images.

7 Conclusion and future work

We have presented a sequential classifier to detect overlaid arrows in biomedical images, comprising of bidirectional long short-term memory (BLSTM) classifier followed by convexity defect-based arrowhead detection. Our test results on biomedical images from imageCLEF 2010 collection outperforms the existing state-of-the-art arrow detection techniques, by approximately more than 3% in precision, 12% in recall and therefore 8% in \(\text{F}_1\) score.

To the best of our knowledge, this is the first time a sequential classifier combining both neural network and geometrical shape of the arrow has been used. Our immediate step would be labeling image regions with the use of the arrows detected by the current work.