1 Introduction

The shape skeleton, or medial axis Blum (1973), is a structure-based object descriptor that reveals local symmetry as well as connectivity between object parts Marr and Nishihara (1978); Dickinson (2009). Modeling objects via their axes of symmetry and, in particular, using skeletons has a long history in computer vision. Skeletonization algorithms provide a concise and effective representation of deformable objects, while supporting many applications, including object recognition and retrieval Zhu and Yuille (1996); Felzenszwalb and Huttenlocher (2005); Bai et al. (2009); Trinh and Kimia (2011), pose estimation Girshick et al. (2011); Shotton et al. (2011); Wei et al. (2016), hand gesture recognition Ren et al. (2013), shape matching Siddiqi et al. (1999), scene text detection Zhang et al. (2015), and road detection in aerial scenes Sironi et al. (Sironi et al.); Mattyus et al. (2015); Máttyus et al. (2017).

Fig. 1
figure 1

a Previous CNN-based methods treat skeleton detection as binary pixel classification, followed by non-maximum suppression (NMS). This can result in poor localization as well as disconnected segments. b The proposed DeepFlux method models the spatial context of skeletal points using a novel flux representation (left). The flux vector field encodes the position of skeletal points in relation to their associated image pixels, and hence also the implied object boundaries. This allows one to associate skeletal pixels with sinks, where the flux is absorbed, in the spirit of flux-based skeletonization methods Siddiqi et al. (2002). Red: ground truth skeleton; Green: detected skeleton (Color figure online)

Early algorithms for computing skeletons directly from images Lindeberg (1998); Liu et al. (1998); Jang and Hong (2001); Yu and Bajaj (2004); Nedzved et al. (2006); Zhang and Couloigner (2007); Lindeberg (2013) yield a gradient intensity map, driven by geometric constraints between skeletal pixels and edge fragments. Such methods cannot easily handle complex image data without prior information about object shape and location. Learning-based methods Levinshtein et al. (2013); Sie Ho Lee et al. (2013); Tsogkas and Kokkinos (2012); Shen et al. (2016); Sironi et al. (Sironi et al.), on the other hand, demonstrate an improved ability for object skeleton detection in natural images, but are still unable to cope with complex backgrounds or clutter.

Convolutional neural networks (CNNs) are a specific instance of learning frameworks that have led to vast improvements in the performance of object skeleton detection algorithms in recent years Shen et al. (2016, 2017); Ke et al. (2017); Liu et al. (2017); (b); Liu et al. (2018); Xu et al. (2019). CNN-based methods typically frame the problem as one of binary pixel classification: given a dataset of images containing objects, paired with their (binary) skeleton annotations, the network is trained to predict the probability of each pixel belonging to a skeleton. The ground truth skeletons are usually extracted by applying a binary skeletonization algorithm to pre-segmented masks of the objects present in the image. As a result, the skeletons detected by each model are dataset-dependent. For instance, some datasets may only contain skeleton annotations for a single foreground object Shen et al. (2011), while others may involve scenes with multiple objects Ke et al. (2017), or may include annotations for background structures Tsogkas and Kokkinos (2012). This stands in contrast to recent work in unsupervised medial axis extraction from natural scenes Tsogkas and Dickinson (2017); Dufresne-Camaro et al. (2020).

Most of the aforementioned CNN-based methods derive from the Holistically-Nested Edge Detection (HED) model Xie and Tu (2015) or variations of it that better leverage multi-level features for capturing skeletons across a range of spatial scales. However, object skeleton computation in natural images using CNNs is inherently different from the problem of edge detection. As illustrated in Fig. 1a, edges associated with object boundaries can often be detected using information such as local appearance or texture changes. Such cues can be picked up by the more spatially accurate, shallow convolutional layers. Object skeletons, however, embody medial properties and high-level semantics. They are situated at regions within object parts that exhibit local bilateral symmetry, since the medial axis bisects the object angle Siddiqi and Pizer (2008). Successfully detecting skeletons purely from local image information (e.g., the green box numbered 3 in Fig. 1a) is challenging, since this requires reasoning over a larger spatial extent, such as the width of the torso of the horse in this case. Layers deeper in the CNN architecture are more appropriate for computing features at such coarser scales, but this presents a confound. Coarse features might not provide accurate spatial localization of the object skeleton.

In this paper, we propose a novel notion of spatial context flux, to accurately detect object skeletons within a CNN framework. Models based on the related notion of a field potential have also shown promise for other visual processing tasks that require non-local interactions, such as border ownership computations in the visual cortex Zucker (2012). We start by considering the spatial context of the skeleton, i.e., a neighborhood around a skeleton branch. For each context pixel, we define a two-dimensional unit flux vector pointing to the nearest skeleton pixel, generating a flux vector field. Within this representation, the object skeleton corresponds to pixels where the net inward flux is positive, following the motivation behind past flux-based methods for skeletonizing binary objects Siddiqi et al. (2002); Dimitrov et al. (2003). We then use a CNN to learn the spatial context flux, via a pixel-wise regression task in place of binary classification. The learned flux vector field encodes the relative locations of context and skeleton pixels, enabling the accurate recovery of the object skeleton via a simple post-processing step. Explicitly leveraging skeleton spatial context in our representation provides a larger receptive field size for estimation. This is helpful both for detecting medial points associated with larger spatial scales, and for more robust localization around junctions.

The present article builds upon work first presented in Wang et al. (2019). Our contributions can be enumerated as follows.

  1. 1.

    We propose a novel spatial context flux representation for object skeleton detection. This concept explicitly encodes the relationship between image pixels and their closest skeletal points.

  2. 2.

    Using this spatial context flux, we develop a method which we dub DeepFlux, that accurately and efficiently detects object skeletons in an image.

  3. 3.

    DeepFlux consistently outperforms state-of-the-art methods on six public benchmarks. To our knowledge, this is the first application of flux concepts, which have been successfully used for skeletonization of binary objects Siddiqi et al. (2002); Dimitrov et al. (2003), to the detection of object skeletons in natural images. It is also the first attempt at learning such flux-based representations directly from natural images.

A preliminary version of this study was presented in Wang et al. (2019). The current journal extension introduces two major improvements. First, we replace the post-processing step with a convolutional module, making the pipeline trainable in an end-to-end manner, while improving performance, and requiring less runtime. Second, in addition to the flux and skeleton branch, we also learn the associated skeleton scale for DeepFlux to detect objects in aerial images, achieving competitive performance against classical CNN-based object detectors.

2 Related Work

Object skeletonization has been widely studied in the last few decades. In our review, we contrast traditional, bottom-up methods, with those that rely on supervised learning on annotated skeleton datasets.

2.1 Bottom-Up Skeletonization Methods

Many early skeleton detection algorithms Lindeberg (1998); Liu et al. (1998); Jang and Hong (2001); Yu and Bajaj (2004); Nedzved et al. (2006); Zhang and Couloigner (2007); Lindeberg (2013) are based on gradient intensity maps. In Siddiqi et al. (2002), the authors study the limiting average outward flux of the gradient of a Euclidean distance function to a 2D or 3D object boundary. The skeleton is associated with those locations where an energy principle is violated, where there is a net inward flux. Other researchers have constructed the skeleton by merging local skeleton segments with a learned segment-linking model. Levinshtein et al. Levinshtein et al. (2013) propose a method to work directly on images, which uses multi-scale super-pixels and a learned affinity between adjacent super-pixels to group proximal medial points. A graph-based clustering algorithm is then applied to form the complete skeleton. Lee et al. Sie Ho Lee et al. (2013) improve the approach in Levinshtein et al. (2013) by using a deformable disc model, which can detect curved and tapered symmetric parts. A novel definition of an appearance medial axis transform (AMAT) has been proposed in Tsogkas and Dickinson (2017), to detect symmetry in the wild in a purely bottom-up, unsupervised fashion. In recent follow-up work Dufresne-Camaro et al. (2020), the AMAT framework is augmented by explicitly incorporating rules from the Shock Grammar for shapes Siddiqi et al. (1999), resulting in significant improvements in computational speed and medial axis quality. Finally, Jerripothula et al. (2017) describes an interesting framework for solving segmentation and skeletonization by exploiting the commonalities among different images of semantically similar objects, in a joint co-segmentation and co-skeletonization optimization scheme.

2.2 Learning-Based Skeleton Detection

In more recent literature Tsogkas and Kokkinos (2012); Shen et al. (2016); Sironi et al. (Sironi et al.), object skeleton detection is treated as a pixel-wise classification or regression problem, and is solved using supervised learning. Tsogkas and Kokkinos Tsogkas and Kokkinos (2012) extract hand-designed features at each pixel and train a classifier for symmetry detection. They employ a multiple instance learning (MIL) framework to accommodate the unknown scale and orientation of symmetry axes. Shen et al. Shen et al. (2016) extend the approach in Tsogkas and Kokkinos (2012) by training a group of MIL classifiers to capture the diversity of symmetry patterns. Sironi et al. Sironi et al. (Sironi et al.) propose a regression-based approach to improve the accuracy of skeleton locations. They train regressors which learn the distances to the closest skeleton in scale-space and identify the skeleton by finding the local maxima.

With the popularization of CNNs, deep learning-based methods Shen et al. (2016, 2017); Ke et al. (2017); Liu et al. (2017); (b); Liu et al. (2018) have shown great promise for object skeleton detection. Shen et al. Shen et al. (2016) propose an approach which fuses scale-associated deep side-outputs (FSDS), based on the architecture of HED Xie and Tu (2015). Since skeletons at different spatial scales can be captured in different stages, they supervise the side outputs with scale-associated ground-truth data. They then extend their original method by learning multi-task scale-associated deep side outputs (LMSDS) in Shen et al. (2017).

This leads to improved skeleton localization and scale prediction, and better overall performance. Ke et al. Ke et al. (2017) present a side-output residual network (SRN), which leverages the output residual units to fit the errors between the ground-truth and the side-outputs. By cascading residual units in a deep-to-shallow manner, SRN can effectively detect the skeleton at different scales. Liu et at. Liu et al. (2017) develop a two-stream network that combines image and segmentation cues to capture complementary information for skeleton localization. Zhao et al. introduce a hierarchical feature integration (Hi-Fi) mechanism in (b), where multi-scale features are integrated with bidirectional guidance so that high-level semantics and low-level details can benefit from each other. Liu et al. Liu et al. (2018) propose a linear span network (LSN) that uses linear span units to increase the independence of convolutional features and the efficiency of feature integration. In Xu et al. (2019), Xu et al. introduce a geometry-aware objective function based on Hausdorff distance, to better incorporate geometric constraints.

Fig. 2
figure 2

The pipeline of the proposed method. For an input image, the network computes a two-dimensional vector field of symmetry spatial context flux (with a visualization of its magnitude and direction on the right). Based on this flux representation, we can recover medial axes reflecting object part symmetries by localizing points with high inward flux (followed by a morphological closing), or by using additional convolution layers, which makes the entire pipeline end-to-end trainable

2.3 Features of DeepFlux

Though the method we propose in the present paper also benefits from CNN-based learning, it differs from the methods in Shen et al. (2016, 2017); Ke et al. (2017); Liu et al. (2017); (b); Liu et al. (2018) in a fundamental way, due to its different learning objective. Instead of treating object skeleton detection in natural images as a binary classification problem, DeepFlux focuses on learning the spatial context flux of skeletons, and as such includes more informative non-local cues, such as the relative position of skeleton points to image points in their vicinity. Thus, the relationship between skeletal point locations and their associated object boundaries is also captured, at least implicity. A direct consequence of this powerful image context flux representation is that a simple post-processing step can recover the skeleton directly from the learned flux. In this manner, we avoid the inaccurate localization of skeletal points by non-maximum suppression used in previous deep learning methods. In addition, DeepFlux enlarges the spatial extent used by the CNN to detect the skeleton, through its use of spatial context flux. This region-based flux representation allows our approach to capture larger object parts.

We note that the proposed DeepFlux is similar in spirit to the original notion of flux Siddiqi et al. (2002); Dimitrov et al. (2003) that is defined based on an object boundary, for skeletonization of 2D/3D binary objects. As such, DeepFlux inherits its mathematical properties including the unique mapping of skeletal points to boundary points. However, the present article is the first to extend this notion of flux to skeleton detection in natural images, where the flux is computed on dilated skeletons in a supervised learning setting. Our work is also related to the approaches in Bai and Urtasun (2017b); Maninis et al. (2018); Chen et al. (2018); Ci et al. (2018); Xu et al. (2019); Ahn et al. (2019); Kreiss et al. (2019); Peng et al. (2019) which learn direction cues for edge detection, instance segmentation, and pose estimation. In the present article, this direction information is encoded in the flux representation, and is implicitly learned for skeleton recovery.

2.4 Direction Fields in Models of Spatial Context

The use of direction fields to model spatial context has also shown promise in other computer vision applications, including image segmentation, object segmentation, and pose estimation. In Maninis et al. (2018), the authors propose to learn edge directions in addition to edge location, for generic image segmentation. Other methods make use of a direction field defined on regions of interest to achieve instance segmentation, such as the deep watershed transform in Bai and Urtasun (2017b), which regresses the distance map to boundaries obtained by semantic segmentation. A similar direction field on text areas is proposed in Xu et al. (2019), to extract instances of text in scenes, whereas direction cues pointing to object centers are used to improve instance and video segmentation in Chen et al. (2018) and Ci et al. (2018), respectively. Finally, direction cues are also used to improve instance segmentation in Ahn et al. (2019) and direction fields pointing towards keypoints are used for pose estimation in Kreiss et al. (2019); Peng et al. (2019).

3 Method

3.1 Overview

Many recent CNN-based skeleton detection approaches build on some variant of the HED architecture Xie and Tu (2015). The combination of a powerful classifier (CNN) and the use of side outputs to extract and combine features at multiple scales has enabled these systems to accurately localize medial points of objects in natural images. However, while state-of-the-art skeleton detection systems are quite effective at extracting medial axes of elongated structures, they still struggle when reasoning about ligature areas. This is not a surprise, because in contrast to the skeletal branches they connect, ligature areas exhibit much less structural regularity, making their exact localization ambiguous. As a result, most methods result in poor localization of ligature points, or fragmentation of medial axis segments between the medial axes representing object parts.

We propose to mitigate this problem by casting skeleton detection as the problem of predicting a two-dimensional flux field from scene points to nearby skeleton points, within a fixed-size neighborhood. We then define skeleton points as the local flux minima, or, alternatively, as sinks “absorbing” flux from nearby points. We argue –and show empirically in our experiments– that this approach leads to more robust localization and better connectivity between skeletal branches. We also argue that considering a small neighborhood around the true skeleton points is sufficient, consistent with past approaches to binary object skeletonization Dimitrov et al. (2003). Whereas predicting the flux for the entire object would allow us to also infer the medial radius function, in this work we focus on improving medial point localization, and employ existing ideas for integrating scale prediction into our network, to tackle a high-level task in Sect. 5. The overall pipeline of the proposed method, which we dub DeepFlux, is depicted in Fig. 2.

Fig. 3
figure 3

Dilating the object skeleton with a fixed-size disk defines a neighborhood of “skeleton spatial context”, shown in the middle figure as a binary mask. For each pixel \(\mathbf {p}\) within this neighborhood (excluding skeleton points), let \(p_n\) be its nearest skeleton point. The flux \(\mathbf {F}(\mathbf {p})\) is defined as the two-dimensional unit vector pointing away from \(\mathbf {p}\) to \(p_n\). In the figure on the right, we visualize the orientation of all points in the skeleton neighborhood (the color wheel shows the correspondence between color and orientation). For the pixels outside the skeleton context neighborhood, there is no flux, i.e., the flux is set to (0, 0) (Color figure online)

3.2 Spatial Context Flux

Let \(\mathbf {p}= (x,y)\) be the coordinates of a pixel in a 2D RGB image. We represent the flux vector field \(\mathbf {F}(\mathbf {p}) = \mathbf {F}(x,y) = (F_x, F_y)\) as a two-channel map with continuous values \(F_x, F_y\), corresponding to the x and y coordinates of the flux vector, respectively. An intuitive visualization is shown in Fig. 3. In most related approaches, skeleton detection is framed as a binary classification task, for which the ground truth is a 1-pixel wide binary skeleton map. In our case, we are dealing with a regression problem, so we must modify the ground truth appropriately.

We divide a binary skeleton map into three non-overlapping regions: (1) skeleton spatial context, \(R_c\), which is a set of pixels in the vicinity of the skeleton; (2) skeleton pixels, denoted by \(R_s\); and (3) background pixels, \(R_b\). In practice, we obtain \(R_c\) by dilating the binary skeleton map with a disk of radius r, and subtracting skeleton pixels \(R_s\). Then, for each context pixel \(\mathbf {p}\in R_c\), we use an efficient distance transform algorithm Felzenszwalb and Huttenlocher (2012) to find its nearest skeleton pixel \(p_n \in R_s\), in terms of \(L_2\) distance. We then define the flux on the context pixel \(\mathbf {p}\) as the unit direction vector that points away from \(\mathbf {p}\) to \(p_n\).Footnote 1 For the remaining pixels composed of \(R_s\) and \(R_b\), we set the flux to (0, 0). Formally, we have:

$$\begin{aligned} \mathbf {F}(\mathbf {p}) = \left\{ \begin{matrix} \ \overrightarrow{\mathbf {p}\mathbf {p}_n}/\left| \overrightarrow{\mathbf {p}\mathbf {p}_n}\right| , &{} \mathbf {p}\in R_c \\ \\ (0,0), &{} \mathbf {p}\in R_s \cup R_b, \end{matrix} \right. \end{aligned}$$
(1)

where \(\left| \overrightarrow{\mathbf {p}\mathbf {p}_n}\right| \) denotes the length of the vector from pixel \(\mathbf {p}\) to \(\mathbf {p}_n\). We note that \(\mathbf {F}(\mathbf {p})\) is defined as a unit vector field only at the context pixels in our groundtruth; at test time, the predicted field is not normalized.

As a representation of the spatial context associated with each skeletal pixel, our proposed spatial context flux possesses a few distinct advantages when used to detect object skeletons in the wild. Unlike most learning approaches that predict skeleton probabilities individually for each pixel, our DeepFlux method leverages consistency between flux predictions within a neighborhood around each candidate pixel. Conversely, if the true skeleton location changes, the surrounding flux field will also change noticeably. A beneficial side-effect is that our method does not rely directly on the coarse responses produced by deeper CNN layers for localizing skeletons at larger scales, which further reduces localization errors. As we show in our experiments, these properties make our method more robust to the localization of skeleton points, especially around ligature regions, and less prone to gaps, discontinuities, and irregularities caused by local mispredictions. In Sect. 3.5, we explain how we can easily and accurately recover a binary object skeleton using the magnitude and direction of the predicted flux.

Fig. 4
figure 4

End-to-end network architecture. We adopt the pre-trained VGG16 Simonyan and Zisserman (2015) (or ResNet101 He et al. (2016)) with the ASPP module Chen et al. (2018) as the backbone network. We then obtain multi-level features by concatenating features extracted from stage3 (or stage2 for a ResNet101 backbone) to stage5 and the ASPP layer. The network is trained to regress the spatial context flux \(F=( {\mathbf {F_x}}, {\mathbf {F_y}})\) and predict a skeleton confidence score map. Our architecture can be easily augmented with a scale prediction branch to facilitate high-level tasks. In Sect. 5 we describe how to use such a variant to detect objects in aerial images

3.3 Network Architecture

The network for learning the spatial context flux of skeletons closely follows the fully convolutional architecture of Long et al. (2015), and is shown in Fig. 4. It consists of four modules: (1) a backbone network used to extract 3D feature maps; (2) an “atrous” spatial pyramid pooling (ASPP) module Chen et al. (2018) to enlarge the receptive field while avoiding excessive downsampling; (3) a multi-stage feature fusion module; and (4) a flux regression and skeleton classification by convolution and up-sampling module; (5) an optional skeleton scale prediction branch, that helps to bridge the gap between skeleton extraction and a complete medial axis transform Blum (1973).

To ensure a fair comparison with previous work, we also adopt VGG16 Simonyan and Zisserman (2015) as the backbone network. As in Xie and Tu (2015), we discard the last pooling layer and the fully connected layers that follow. In the rest of the text, we call this variant DeepFlux-VGG16. The use of the atrous module is motivated by the need for a wide receptive field: when extracting skeletons we have to guarantee that the receptive field of the network is wider than the largest medial radius of an object part in the input image. The receptive field of the VGG16 backbone is 196, which is not wide enough for large objects. Furthermore, it has been demonstrated in Luo et al. (2016) that the effective receptive field only takes up a fraction of the full theoretical receptive field. Thus, we employ ASPP to capture multi-scale information. Specifically, four parallel atrous convolutional layers with \(3\times 3\) kernels but different atrous rates (2, 4, 8, 16) are added to the last layer of the backbone, followed by a concatenation along the channel dimension. In this way, we obtain feature maps with a theoretical receptive field size of 708, which we have found to be large enough for the images we have experimented on.

To construct a multi-scale representation of the input image, we fuse the feature maps from side outputs at conv3, conv4, conv5, and ASPP layers, after convolving them with a \(1 \times 1\) kernel. Since feature maps at different levels have different spatial resolutions, we resize them all to the dimensions of conv3 before concatenating them. We perform prediction on the learned flux field, after up-sampling it to the dimensions of the input image using bilinear interpolation. This is a 2-channel response map, corresponding to flux predictions \({\hat{\mathbf {F}}}(\mathbf {p})\) for every pixel \(\mathbf {p}\) in the image.

We propose two different ways of extracting skeletons from this 2-channel response map. The first one is a simple post-processing scheme, described in Sect. 3.5. The second involves extending our network by plugging in three \(3 \times 3\) convolutional layers (with 64-channel output for the first two layers), following the (up-sampled) flux field prediction layer, which output a pixel-wise skeleton confidence score. This score can subsequently be thresholded, to produce a binary skeleton. Our network outputs both types of prediction (learned flux and skeleton confidence), as shown in Fig. 4.

We also consider an alternative architectural choice for the proposed model by replacing the VGG16 backbone with ResNet101 He et al. (2016). In this case, similar to DeepFlux-VGG16, we fuse the feature maps from different side outputs. Specifically, we apply \(1 \times 1\) convolution at conv2 (whose spatial size is already 1/4 of the original image), conv3, conv4, conv5, and the ASPP layers. We then concatenate the resized side outputs together as a multi-scale representation of the input image. The following layers are kept the same as the DeepFlux-VGG16 variant. In the rest of the text, we call this variant DeepFlux-ResNet101. When not specified, we assume a DeepFlux-VGG16 architecture.

Finally, similar to previous work Shen et al. (2017), we explore the advantages of simultaneously predicting skeleton position and scale, bridging the gap between skeleton extraction and a complete medial axis transform Blum (1973). This also provides us with a richer representation that can find practical use in downstream tasks. To this end, we optionally include an additional branch to the DeepFlux backbone, that predicts the scale s associated with each medial point, as shown in Fig. 4.

3.4 Training Objective

We split our loss function into two terms, one for each type of output. For the flux field branch, we choose the \(L_2\) loss function as our training objective. Due to a severe imbalance in the number of context and background pixels, we adopt a class-balancing strategy similar to the one in Xie and Tu (2015). Our balanced flux loss function is

$$\begin{aligned} L_f = \sum _{\mathbf {p}\in \varOmega }{w_f(\mathbf {p}) \cdot \left\| \mathbf {F}(\mathbf {p})- {\hat{\mathbf {F}}}(\mathbf {p})\right\| _2^2}, \end{aligned}$$
(2)

where \(\varOmega \) is the image domain, \({\hat{\mathbf {F}}}(\mathbf {p})\) is the predicted flux, and \(w_f(\mathbf {p})\) denotes the weight coefficient of pixel \(\mathbf {p}\). The weight \(w_f(\mathbf {p})\) is calculated as follows:

$$\begin{aligned} w_f(\mathbf {p}) = \left\{ \begin{array}{ll} \frac{|R_b|}{|R_c|+|R_b|+|R_s|}, &{}\quad \mathbf {p}\in R_c \cup R_s \\ \\ \frac{|R_c|+|R_s|}{|R_c|+|R_b|+|R_s|}, &{}\quad \mathbf {p}\in R_b, \end{array} \right. \end{aligned}$$
(3)

where \(|R_c|\), \(|R_b|\) and \(|R_s|\) denote the number of context, background, and skeleton pixels, respectively.

The second branch, which predicts skeleton probability scores from the predicted flux, corresponds to a standard binary classification problem. We follow Xie and Tu (2015) and use a class-balanced cross-entropy loss function

$$\begin{aligned} L_s = - \beta \sum _{\mathbf {p}\in R_s}{\log S(\mathbf {p})} - (1-\beta ) \sum _{\mathbf {p}\in R_b \cup R_c}{\log (1-S(\mathbf {p}))}, \nonumber \\ \end{aligned}$$
(4)

where S denotes the skeleton probability scores (obtained using sigmoid function), and \(\beta = \tfrac{|R_c|+|R_b|}{|R_s|+|R_c|+|R_b|}\) is the balancing factor.

The final training objective is given by summing the two loss terms:

$$\begin{aligned} L = L_f + \lambda _1 L_s, \end{aligned}$$
(5)

where \(\lambda _1\) is a hyper-parameter. We set \(\lambda _1=1\) in all our experiments.

For the optional extra scale prediction branch, we use a smoothed-L1 loss for scale regression:

$$\begin{aligned} L_{scale} = smooth_{L1}\left( \frac{{\hat{s}} - s}{s}\right) , \end{aligned}$$
(6)

where \({\hat{s}}\) is the predicted scale and s is the ground truth. When we also predict the skeleton scale, the overall loss is given by \(L + \lambda _2 L_{scale}\), where \(\lambda _2\) is a hyper-parameter that is set to 1 in all our experiments.

figure a

3.5 From Flux to Skeleton Predictions

We propose a simple post-processing procedure to recover the object skeleton from the predicted spatial context flux. As described in Eq. (1), pixels around the skeleton are labeled with unit two-dimensional vectors while the others are set to (0, 0). Thus, thresholding the magnitude of the vector field reveals the context pixels while computing the flux direction reveals the location of context pixels relative to the skeleton. We refer the reader to Fig. 2 for a visualization of the post-processing steps, listed in Algorithm 1.

Let \(|{\hat{\mathbf {F}}}|\) and \(\angle {\hat{\mathbf {F}}}\) be the magnitude and direction of the predicted context flux \({\hat{\mathbf {F}}}\), respectively. For a given pixel \(\mathbf {p}\), \(\angle {\hat{\mathbf {F}}}(\mathbf {p})\) is binned into one of 8 directions, pointing to one of the 8 neighbors, denoted by \({\mathcal {N}}_{\angle {\hat{\mathbf {F}}}(\mathbf {p})}(\mathbf {p})\). Having computed these two quantities, extracting the skeleton is straightforward: pixels close to the real object skeleton should have a high inward flux, due to a singularity in the vector field \({\hat{\mathbf {F}}}\), as analyzed in Dimitrov et al. (2003). These pixels are defined as “quench points”. Finally, we apply a morphological dilation with a disk structuring element of radius \(k_1\), followed by a morphological erosion with a disk of radius \(k_2\), to group quench points together and produce the object skeleton. We call this variant DeepFlux-P.

One can also learn to predict skeleton confidence from the predicted flux field. More precisely, as described in Sect. 3.3, we add three \(3 \times 3\) convolution layers after the flux prediction layer, and train this branch in the standard manner for a binary skeleton classification problem, using a cross-entropy loss. We call this end-to-end trainable variant DeepFlux-E, and use it as our default, unless explicitly stated.

Fig. 5
figure 5

Example images selected from different datasets, and their corresponding annotations. The ground truth annotations are thickened and drawn in red for improved visibility. Best viewed in color (Color figure online)

4 Experiments on Skeleton and Centerline Detection

We conduct experiments on six challenging datasets, five of which are publicly available: SK-LARGE Shen et al. (2017), SK506 Shen et al. (2016), WH-SYMMAXShen et al. (2016), SYM-PASCAL Ke et al. (2017), SYMMAX300 Tsogkas and Kokkinos (2012); and SK-AID, a bridge/road centerline dataset we collected ourselves from AID Xia et al. (2017), which will also be publicly available. Some sample images are shown in Fig. 5. We note that for some of these datasets, only the skeletons of foreground objects are annotated, whereas others come with skeleton or centerline annotations for both foreground objects and background structures.

We describe the above datasets and the evaluation protocol in detail, in Sect. 4.1. We follow with implementation details in Sect. 4.2. Qualitative and quantitative results are shown in Sect. 4.3. We carry out a runtime analysis and an ablation study in Sects. 4.4 and  4.5, respectively.

4.1 Dataset and Evaluation Protocol

SK-LARGE Shen et al. (2017) is a benchmark for foreground object skeleton extraction, consisting of 746 training and 745 test images. Each image in SK-LARGE is obtained by cropping an image from MS-COCO Chen et al. (2015) so that it contains a single, centered object. SK-LARGE contains various object categories including person, horse, giraffe, and man-made objects such as plane and hydrant. In this dataset, both the location and scale (the radius of the corresponding maximal disk) of each skeletal point are annotated.

SK506 Shen et al. (2016) also referred to as SK-SMALL, is an earlier version of SK-LARGE released by the same authors Shen et al. (2017). There are 300 training images and 206 test images. Note that this dataset contains less training data, which might make the training of deep neural networks more challenging.

WH-SYMMAX Shen et al. (2016) contains 328 cropped images from the Weizmann Horse dataset Borenstein and Ullman (2002), and their skeleton point and scale annotations. The dataset is split into 228 training images and 100 test images.

Fig. 6
figure 6

Some qualitative results on SK-LARGE, WH-SYMMAX, SYM-PASCAL, and SYMMAX300. Red: GT; Green: detected skeleton; Yellow: detected skeleton and GT overlap. Qualitatively, DeepFlux-P performs similarly to the variant DeepFlux-E. Two examples of partial failure are also shown on the bottom right (enclosed by red boxes). Here DeepFlux fails to detect the skeleton on the body of the bird due to image blurring in one case. For the other case, DeepFlux detects a horizontal symmetry axis instead of a vertical one which is annotated in the ground truth (Color figure online)

SK-AID is built on AID Xia et al. (2017), a dataset for aerial scene classification with 20 scene categories. We use 60 images for training and 40 images for testing; and focus on two object categories: bridge and road. We manually annotate the segmentation masks of roads and bridges, and then adopt a binary skeletonization algorithm Shen et al. (2011) to obtain their centerlines as the skeleton ground truth. As shown in Fig. 5, SK-AID exhibits a large variation in skeleton orientation and curvature, as well as challenging cases of junctions of multiple skeleton branches.

SYM-PASCAL Ke et al. (2017) is derived from the PASCAL-VOC-2011 segmentation dataset Everingham et al. (2010) for symmetry detection in the wild, and contains 648 training and 787 test images. Compared to SK-LARGE and SK506, the images from this dataset possess more complex backgrounds and variations of object appearance, including occlusions and missing parts, making it quite challenging.

SYMMAX300 Tsogkas and Kokkinos (2012) is built on the Berkeley Segmentation Dataset (BSDS300) Martin et al. (2001), which contains 200 training images and 100 test images. Unlike the three datasets described above, both foreground and background regions are considered. It is noteworthy that each image in SYMMAX300 is accompanied by 5-7 symmetry annotations, corresponding to the multiple segmentation annotations existent in the BSDS300. The final local symmetry annotation is obtained by merging all available annotations for a given image, through a binary union operation.

Evaluation protocol Following previous work Tsogkas and Kokkinos (2012); Shen et al. (2016, 2017), we use precision-recall (PR) curves and the F-measure metric to evaluate skeleton detection performance in our experiments. For methods that output a skeleton probability map (including our end-to-end variant DeepFlux-E), we first apply a standard non-maximal suppression (NMS) algorithm Dollár and Zitnick (2015). We then threshold the thinned skeleton into a binary map and match it with the ground truth using a bi-partite matching routine that allows for small localization errors Martin et al. (2004). We select threshold values that yield the highest F-measure for each method-dataset combination.

Fig. 7
figure 7

Quantitative evaluation in terms of PR curves on six skeleton detection datasets. Both DeepFlux-VGG16 (in green) and DeepFlux-ResNet101 (in blue) offer high precision, especially in the high-recall regime. A stronger backbone (e.g., ResNet101) leads to more accurate skeleton detection (Color figure online)

For the variant DeepFlux-P, which does not directly output skeleton probabilities, we use the inverse magnitude of predicted context flux on the recovered skeleton as a surrogate for a “skeleton confidence”. Thresholding at different values gives rise to a PR curve and the optimal threshold for each dataset is selected as the one producing the highest F-measure according to the formula \(F = 2PR/(P+R)\). The F-measure is commonly reported as a single scalar performance index.

4.2 Implementation Details

Our implementation involves one major hyperparameter: the width of the skeleton context neighborhood r, which is set to 7 for all experiments. For the DeepFlux-P variant there are three extra hyperparameters (provided values are the ones used in our experiments): the threshold used to recover skeletal points from the predicted flux field, \(\lambda _m=0.4\) and the sizes of the structuring elements involved in the morphological operations for skeleton recovery, \(k_1=3\) and \(k_2=4\).

For training, we adopt standard data augmentation strategies Shen et al. (2016, 2017); (b). Specifically, we resize training images to 3 different scales (0.8, 1, 1.2) and then rotate them to 4 angles (\(0^\circ \), \(90^\circ \), \(180^\circ \), \(270^\circ \)). We also flip them with respect to different axes (up-down, left-right, no flip). We consider two different initializations for the proposed network, one with the VGG16 Simonyan and Zisserman (2015) and one with the ResNet101 He et al. (2016) model, pre-trained on ImageNet Deng et al. (2009) and optimized using ADAM Kinga and Adam (2015). For the first 80k iterations, the learning rate is set to \(10^{-5}\) for the backbone (VGG16 or ResNet101) layers and to \(10^{-4}\) for the rest of the layers in the network, then reduced to \(10^{-6}\) and \(10^{-5}\) for the remaining 40k iterations, respectively.

We use the Caffe Jia et al. (2014) framework to train DeepFlux. All experiments are carried out on a workstation with an Intel Xeon 16-core CPU (3.5GHz), 64GB RAM, and a single Titan Xp GPU. Training on SK-LARGE with batch size set to 1 takes about 2 hours.

Table 1 Quantitative comparison in terms of F-measure

4.3 Results

Comparison with other methods. We start by showing a qualitative comparison of DeepFlux-VGG16 with other skeleton detection methods, on images from WH-SYMMAX and SYM-PASCAL. As illustrated in Figs. 1 and  6, DeepFlux accurately localizes skeleton points while preserving good connectivity at junctions.

In Fig. 7 we plot the PR-curves for SK-LARGE, SK506, WH-SYMMAX, SK-AID, and SYM-PASCAL. DeepFlux significantly outperforms other methods in all cases, excelling in the high-precision regime. This is indicative of the role of local context towards more robust and accurate localization of skeleton points.

Table 1 lists the optimal F-measure score for all methods. DeepFlux-VGG16 consistently outperforms all other approaches. Specifically, DeepFlux-VGG16-E surpasses the most recent method Hi-Fi (b) by \(1.2\%\), \(2.3\%\), \(5.0\%\), \(6.8\%\), and \(11.6\%\) on SK-LARGE, SK506, WH-SYMMAX, SK-AID, and SYM-PASCAL, respectively, despite the fact that Hi-Fi uses stronger supervision during training (skeleton position and scale). DeepFlux-VGG16-E also outperforms LSN Liu et al. (2018), another recent method, by \(6.8\%\), \(7.1\%\), \(5.8\%\), \(14.5\%\), and \(5.1\%\) on SK-LARGE, SK506, WH-SYMMAX, SYM-PASCAL, and SYMMAX300, respectively. It is noteworthy that the proposed DeepFlux improves over the previous state-of-the-art by more than \(11\%\) in terms of F-measure on SYM-PASCAL, whose images have more complex backgrounds and variations in object appearance. This implies that DeepFlux is better able to handle skeleton detection in complex images. For a fair comparison with previous methods, we also report results for DeepFlux, using the vanilla VGG16 architecture without the ASPP module. Barring SK-LARGE, where the proposed DeepFlux performs slightly worse than Hi-Fi (b), DeepFlux significantly outperforms competing methods on all other datasets. It is also noteworthy that Hi-Fi (b) relies on additional scale supervision during training, which is not the case for DeepFlux.

GeoSkeletonNet Xu et al. (2019) is trained using “resolution normalization”: the authors resize the images and their associated ground-truth from a size of \(H \times W\) to \(\sqrt{KH/W} \times \sqrt{KW/H}\) (\(K = 180000\) for SYM-PASCAL and \(K = 60000\) for the other datasets) before applying data augmentation. This procedure normalizes the number of pixels to a fixed value K, while keeping the aspect ratio of the images the same, factoring out the variance of resolutions across different datasets. Using the same resolution normalization protocol, DeepFlux-VGG16-E achieves a 0.758 ( +0.1%), 0.730 ( +0.3%), 0.863 ( +1.4%), and 0.569 ( +4.9%) F-score on SK-LARGE, SK506, WH-SYMMAX, and SYM-PASCAL, respectively.

Comparison of different network backbones Using a more powerful backbone further boosts performance. DeepFlux-ResNet101-E improves over DeepFlux-VGG16-E by \(1.8\%\), \(1.5\%\), \(1.2\%\), \(0.1\%\), and \(2.3\%\) on SK-LARGE, SK506, WH-SYMMAX, SK-AID, and SYM-PASCAL, respectively. The modest gains from the more powerful ResNet on SK-AID can potentially be attributed to the significantly lower variation of skeleton scales in that dataset; the capacity of VGG16 seems to be sufficient to already achieve close to \(90\%\) accuracy. Curiously, DeepFlux-ResNet101 performs slightly worse than DeepFlux-VGG16 on SYMMAX300. Our hypothesis is that, because of the multiple—potentially conflicting annotations—per image in this dataset, the lower capacity of the VGG16 may act as a regularizer, leading to slightly better performance.

Post-processing versus end-to-end training DeepFlux-E (end-to-end) performs slightly better than DeepFlux-P (post-processing), in all cases. In particular, DeepFlux-VGG16-E outperforms DeepFlux-VGG16-P by \(1.2\%\) and \(0.6\%\) on SYM-PASCAL and SYMMAX300, respectively. As shown in Fig. 6 for qualitative results on SYMMAX300 (see the blue dashed circles), DeepFlux-E preserves better the connectivity at ligature areas than DeepFlux-P, which may only have a few quench points instead of a set of connected ones due to direction discretization into 8 bins. DeepFlux-E also enjoys a slightly faster runtime, as shown in the comparison in Table 2.

Failure cases Despite the effectiveness of DeepFlux inaccurately detecting object skeletons in images, there are some challenging cases where the model fails partially. An example is illustrated on the middle right of Fig. 6, where the skeleton of the body of the bird is not detected due to severe image blurring. Another example of failure is shown on the bottom right in Fig. 6, where DeepFlux fails to capture the symmetry of each bus instance individually, detecting instead the horizontal symmetry axis of the entire cluster.

Table 2 Runtime and performance on SK-LARGE. For DeepFlux-P, we list the total inference (GPU) + post-processing (CPU) time

4.4 Runtime Analysis

In Table 2 we compare the runtime of DeepFlux to alternatives. Since competing models typically use the VGG16 backbone, we mainly employ the DeepFlux-VGG16 variant in our analysis, to keep the comparison fair. As shown in Table 2, DeepFlux is as fast as competing methods while achieving superior performance. Inference of DeepFlux-VGG16-E on the GPU takes on average 14 ms for a \(300 \times 200\) image, which is faster than other methods. The DeepFlux-VGG16-P variant requires on average an extra 3 ms on the CPU, for post-processing.

4.5 Ablation Study

We study the contribution of the two main modules (ASPP module and flux representation) to skeleton detection on SK-LARGE and SYM-PASCAL, by removing them one at a time from the VGG16 backbone. We conduct four experiments corresponding to the four possible combinations of each module being present or not.

When the spatial context flux representation is not used, we train the model with the same architecture, but for binary classification using binary cross-entropy loss. The baseline model is trained without ASPP and spatial context flux representation. As depicted in Table 3, the ASPP module that offers a larger receptive field, results in an improvement of \(1.9\%\) on SK-LARGE and \(5.1\%\) on SYM-PASCAL, compared to the baseline model. This confirms that a large receptive field is beneficial for skeleton/symmetry extraction. We then remove the ASPP module and train the model using the proposed flux representation, which yields an improvement of \(0.8\%\) on SK-LARGE and \(3.5\%\) on SYM-PASCAL. These gains are complementary to each other; indeed, combining both the ASPP and the flux representation, improves performance over the baseline by \(2.9\%\) on SK-LARGE and \(9.7\%\) on SYM-PASCAL.

Table 3 Ablation study on the effect of the spatial context flux representation and the ASPP module on the performance in terms of F-measure
Table 4 Ablation study on the influence of the context size r on the performance in terms of F-measure

We also study the effect of the size r of the neighborhood within which context flux is computed. We conduct experiments with different radii, ranging from \(r=3\) to \(r=11\), with a step of 2, on the SK-LARGE and SYM-PASCAL datasets. The best results are obtained for \(r=7\), and using smaller or larger values seems to slightly decrease performance. Our understanding is that a narrower spatial context neighborhood provides less contextual information to predict the final skeleton map. On the other hand, using a wider neighborhood may increase the chance for mistakes in flux prediction around areas of severe discontinuities, such as the areas around boundaries of thin objects that are fully contained in the context neighborhood. DeepFlux does not appear to be sensitive to the value of r, as shown in Table 4.

Finally, one may argue that simply using a dilated ground truth when training the network for skeleton classification is sufficient to make the model more robust in accurately localizing skeletal points. To examine if this is the case, we removed the flux module and retrained our VGG16-based model on the same dilated skeletons we used to compute the spatial context flux ground truth, using a binary cross-entropy loss instead. Without spatial context flux representation, the performance drops from \(F=0.736\) to \(F=0.697\) ( \(-3.9\%\)) on SK-LARGE and from \(F=0.570\) to \(F=0.490\) ( \(-8\%\)) on SYM-PASCAL, demonstrating the effectiveness of our proposed representation for accurate localization.

Table 5 Comparison with some state-of-the-art methods dedicated for object detection in remote sensing images on DOTA Xia et al. (2018)
Fig. 8
figure 8

Qualitative visualization of some large vehicle detection results on the DOTA dataset. Pink line: predicted skeleton segment; Green box: large vehicle detection; Red box: GT; Best viewed in the electronic version (Color figure online)

5 Application to Object Detection in Aerial Images

We consider an application of simultaneously predicting skeleton position and scale, for the task of detecting large vehicles in remote sensing imagery. For a fair comparison with other methods, we use ResNet101 as the backbone. The stride between block3 and block4 is set to 1, and all the layers in block4 are replaced with dilated convolution layers.

Predicting the scale associated with each skeleton pixel allows us to generate the object mask in a straightforward way. Let \({\hat{s}}_i\) denote the predicted scale for a skeleton pixel \(x_i\) in a skeleton segment (i.e., a connected component of the binary object skeleton). We obtain the object mask as \(O = \cup _{i=1}^{N} D_i\), where N is the number of the skeleton pixels in the segment, and \(D_i\) is the disk of radius \({\hat{s}}_i\) centered at \(x_i\). We use the bounding box of O as our final detection, and the mean magnitude value of the enclosed spatial context flux as a proxy for the classification score for evaluation purposes. As depicted in Table 5, we achieve competitive performance against state-of-the-art methods dedicated for object detection in aerial images. Qualitative detection results are shown in  Fig. 8.

6 Conclusions

We have proposed DeepFlux, a novel approach for accurate skeleton detection in the wild. In contrast to classical learning-based methods that consider skeleton detection as a binary classification problem, we learn to regress a 2D vector field of “context flux”. Context flux is a reliable intermediate cue for skeleton point localization, either through simple post-processing or end-to-end training. The proposed approach alleviates many limitations (e.g., poor localization) of previous methods, and performs very well in handling ligature points, and skeletons of objects at large spatial scales, while also being very fast (\(\sim 14-17ms \) for detection on a Titan Xp GPU). Our experiments on six challenging benchmarks demonstrate that DeepFlux consistently improves over the state-of-the-art both quantitatively and qualitatively.

While the skeleton represents a powerful shape representation in support of many tasks, it lacks the dual boundary/region encoding offered by the medial axis transform (MAT), since skeleton points do not encode the scale of the maximal inscribed disk. We have extended our framework to explicitly recover both skeleton position and scale, significantly enhancing the representation power and utility of our skeletons, as demonstrated on an object detection/segmentation task of vehicle detection in remote sensing imagery.