DeepFlux for Skeleton Detection in the Wild

Xu, Yongchao; Wang, Yukang; Tsogkas, Stavros; Wan, Jianqiang; Bai, Xiang; Dickinson, Sven; Siddiqi, Kaleem

doi:10.1007/s11263-021-01430-6

DeepFlux for Skeleton Detection in the Wild

Published: 30 January 2021

Volume 129, pages 1323–1339, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Computer Vision Aims and scope Submit manuscript

DeepFlux for Skeleton Detection in the Wild

Download PDF

Yongchao Xu¹,
Yukang Wang²,
Stavros Tsogkas^3,5,
Jianqiang Wan²,
Xiang Bai²,
Sven Dickinson^3,4,5 &
…
Kaleem Siddiqi⁶

1188 Accesses
9 Citations
Explore all metrics

Abstract

The medial axis, or skeleton, is a fundamental object representation that has been extensively used in shape recognition. Yet, its extension to natural images has been challenging due to the large appearance and scale variations of objects and complex background clutter that appear in this setting. In contrast to recent methods that address skeleton extraction as a binary pixel classification problem, in this article we present an alternative formulation for skeleton detection. We follow the spirit of flux-based algorithms for medial axis recovery by training a convolutional neural network to predict a two-dimensional vector field encoding the flux representation. The skeleton is then recovered from the flux representation, which captures the position of skeletal pixels relative to semantically meaningful entities (e.g., image points in spatial context, and hence the implied object boundaries), resulting in precise skeleton detection. Moreover, since the flux representation is a region-based vector field, it is better able to cope with object parts of large width. We evaluate the proposed method, termed DeepFlux, on six benchmark datasets, consistently achieving superior performance over state-of-the-art methods. Finally, we demonstrate an application of DeepFlux, augmented with a skeleton scale estimation module, to detect objects in aerial images. This combination yields results that are competitive with models trained specifically for object detection, showcasing the versatility and effectiveness of mid-level representations in high-level tasks. An implementation of our method is available at https://github.com/YukangWang/DeepFlux.

Fusing Multi-scale Residual Network for Skeleton Detection

Skeleton Ground Truth Extraction: Methodology, Annotation Tool and Benchmarks

Article Open access 01 November 2023

Real-time low-cost human skeleton detection

Article 09 August 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The shape skeleton, or medial axis Blum (1973), is a structure-based object descriptor that reveals local symmetry as well as connectivity between object parts Marr and Nishihara (1978); Dickinson (2009). Modeling objects via their axes of symmetry and, in particular, using skeletons has a long history in computer vision. Skeletonization algorithms provide a concise and effective representation of deformable objects, while supporting many applications, including object recognition and retrieval Zhu and Yuille (1996); Felzenszwalb and Huttenlocher (2005); Bai et al. (2009); Trinh and Kimia (2011), pose estimation Girshick et al. (2011); Shotton et al. (2011); Wei et al. (2016), hand gesture recognition Ren et al. (2013), shape matching Siddiqi et al. (1999), scene text detection Zhang et al. (2015), and road detection in aerial scenes Sironi et al. (Sironi et al.); Mattyus et al. (2015); Máttyus et al. (2017).

Early algorithms for computing skeletons directly from images Lindeberg (1998); Liu et al. (1998); Jang and Hong (2001); Yu and Bajaj (2004); Nedzved et al. (2006); Zhang and Couloigner (2007); Lindeberg (2013) yield a gradient intensity map, driven by geometric constraints between skeletal pixels and edge fragments. Such methods cannot easily handle complex image data without prior information about object shape and location. Learning-based methods Levinshtein et al. (2013); Sie Ho Lee et al. (2013); Tsogkas and Kokkinos (2012); Shen et al. (2016); Sironi et al. (Sironi et al.), on the other hand, demonstrate an improved ability for object skeleton detection in natural images, but are still unable to cope with complex backgrounds or clutter.

Convolutional neural networks (CNNs) are a specific instance of learning frameworks that have led to vast improvements in the performance of object skeleton detection algorithms in recent years Shen et al. (2016, 2017); Ke et al. (2017); Liu et al. (2017); (b); Liu et al. (2018); Xu et al. (2019). CNN-based methods typically frame the problem as one of binary pixel classification: given a dataset of images containing objects, paired with their (binary) skeleton annotations, the network is trained to predict the probability of each pixel belonging to a skeleton. The ground truth skeletons are usually extracted by applying a binary skeletonization algorithm to pre-segmented masks of the objects present in the image. As a result, the skeletons detected by each model are dataset-dependent. For instance, some datasets may only contain skeleton annotations for a single foreground object Shen et al. (2011), while others may involve scenes with multiple objects Ke et al. (2017), or may include annotations for background structures Tsogkas and Kokkinos (2012). This stands in contrast to recent work in unsupervised medial axis extraction from natural scenes Tsogkas and Dickinson (2017); Dufresne-Camaro et al. (2020).

Most of the aforementioned CNN-based methods derive from the Holistically-Nested Edge Detection (HED) model Xie and Tu (2015) or variations of it that better leverage multi-level features for capturing skeletons across a range of spatial scales. However, object skeleton computation in natural images using CNNs is inherently different from the problem of edge detection. As illustrated in Fig. 1a, edges associated with object boundaries can often be detected using information such as local appearance or texture changes. Such cues can be picked up by the more spatially accurate, shallow convolutional layers. Object skeletons, however, embody medial properties and high-level semantics. They are situated at regions within object parts that exhibit local bilateral symmetry, since the medial axis bisects the object angle Siddiqi and Pizer (2008). Successfully detecting skeletons purely from local image information (e.g., the green box numbered 3 in Fig. 1a) is challenging, since this requires reasoning over a larger spatial extent, such as the width of the torso of the horse in this case. Layers deeper in the CNN architecture are more appropriate for computing features at such coarser scales, but this presents a confound. Coarse features might not provide accurate spatial localization of the object skeleton.

In this paper, we propose a novel notion of spatial context flux, to accurately detect object skeletons within a CNN framework. Models based on the related notion of a field potential have also shown promise for other visual processing tasks that require non-local interactions, such as border ownership computations in the visual cortex Zucker (2012). We start by considering the spatial context of the skeleton, i.e., a neighborhood around a skeleton branch. For each context pixel, we define a two-dimensional unit flux vector pointing to the nearest skeleton pixel, generating a flux vector field. Within this representation, the object skeleton corresponds to pixels where the net inward flux is positive, following the motivation behind past flux-based methods for skeletonizing binary objects Siddiqi et al. (2002); Dimitrov et al. (2003). We then use a CNN to learn the spatial context flux, via a pixel-wise regression task in place of binary classification. The learned flux vector field encodes the relative locations of context and skeleton pixels, enabling the accurate recovery of the object skeleton via a simple post-processing step. Explicitly leveraging skeleton spatial context in our representation provides a larger receptive field size for estimation. This is helpful both for detecting medial points associated with larger spatial scales, and for more robust localization around junctions.

The present article builds upon work first presented in Wang et al. (2019). Our contributions can be enumerated as follows.

1.
We propose a novel spatial context flux representation for object skeleton detection. This concept explicitly encodes the relationship between image pixels and their closest skeletal points.
2.
Using this spatial context flux, we develop a method which we dub DeepFlux, that accurately and efficiently detects object skeletons in an image.
3.
DeepFlux consistently outperforms state-of-the-art methods on six public benchmarks. To our knowledge, this is the first application of flux concepts, which have been successfully used for skeletonization of binary objects Siddiqi et al. (2002); Dimitrov et al. (2003), to the detection of object skeletons in natural images. It is also the first attempt at learning such flux-based representations directly from natural images.

A preliminary version of this study was presented in Wang et al. (2019). The current journal extension introduces two major improvements. First, we replace the post-processing step with a convolutional module, making the pipeline trainable in an end-to-end manner, while improving performance, and requiring less runtime. Second, in addition to the flux and skeleton branch, we also learn the associated skeleton scale for DeepFlux to detect objects in aerial images, achieving competitive performance against classical CNN-based object detectors.

2 Related Work

Object skeletonization has been widely studied in the last few decades. In our review, we contrast traditional, bottom-up methods, with those that rely on supervised learning on annotated skeleton datasets.

2.1 Bottom-Up Skeletonization Methods

Many early skeleton detection algorithms Lindeberg (1998); Liu et al. (1998); Jang and Hong (2001); Yu and Bajaj (2004); Nedzved et al. (2006); Zhang and Couloigner (2007); Lindeberg (2013) are based on gradient intensity maps. In Siddiqi et al. (2002), the authors study the limiting average outward flux of the gradient of a Euclidean distance function to a 2D or 3D object boundary. The skeleton is associated with those locations where an energy principle is violated, where there is a net inward flux. Other researchers have constructed the skeleton by merging local skeleton segments with a learned segment-linking model. Levinshtein et al. Levinshtein et al. (2013) propose a method to work directly on images, which uses multi-scale super-pixels and a learned affinity between adjacent super-pixels to group proximal medial points. A graph-based clustering algorithm is then applied to form the complete skeleton. Lee et al. Sie Ho Lee et al. (2013) improve the approach in Levinshtein et al. (2013) by using a deformable disc model, which can detect curved and tapered symmetric parts. A novel definition of an appearance medial axis transform (AMAT) has been proposed in Tsogkas and Dickinson (2017), to detect symmetry in the wild in a purely bottom-up, unsupervised fashion. In recent follow-up work Dufresne-Camaro et al. (2020), the AMAT framework is augmented by explicitly incorporating rules from the Shock Grammar for shapes Siddiqi et al. (1999), resulting in significant improvements in computational speed and medial axis quality. Finally, Jerripothula et al. (2017) describes an interesting framework for solving segmentation and skeletonization by exploiting the commonalities among different images of semantically similar objects, in a joint co-segmentation and co-skeletonization optimization scheme.

2.2 Learning-Based Skeleton Detection

In more recent literature Tsogkas and Kokkinos (2012); Shen et al. (2016); Sironi et al. (Sironi et al.), object skeleton detection is treated as a pixel-wise classification or regression problem, and is solved using supervised learning. Tsogkas and Kokkinos Tsogkas and Kokkinos (2012) extract hand-designed features at each pixel and train a classifier for symmetry detection. They employ a multiple instance learning (MIL) framework to accommodate the unknown scale and orientation of symmetry axes. Shen et al. Shen et al. (2016) extend the approach in Tsogkas and Kokkinos (2012) by training a group of MIL classifiers to capture the diversity of symmetry patterns. Sironi et al. Sironi et al. (Sironi et al.) propose a regression-based approach to improve the accuracy of skeleton locations. They train regressors which learn the distances to the closest skeleton in scale-space and identify the skeleton by finding the local maxima.

With the popularization of CNNs, deep learning-based methods Shen et al. (2016, 2017); Ke et al. (2017); Liu et al. (2017); (b); Liu et al. (2018) have shown great promise for object skeleton detection. Shen et al. Shen et al. (2016) propose an approach which fuses scale-associated deep side-outputs (FSDS), based on the architecture of HED Xie and Tu (2015). Since skeletons at different spatial scales can be captured in different stages, they supervise the side outputs with scale-associated ground-truth data. They then extend their original method by learning multi-task scale-associated deep side outputs (LMSDS) in Shen et al. (2017).

This leads to improved skeleton localization and scale prediction, and better overall performance. Ke et al. Ke et al. (2017) present a side-output residual network (SRN), which leverages the output residual units to fit the errors between the ground-truth and the side-outputs. By cascading residual units in a deep-to-shallow manner, SRN can effectively detect the skeleton at different scales. Liu et at. Liu et al. (2017) develop a two-stream network that combines image and segmentation cues to capture complementary information for skeleton localization. Zhao et al. introduce a hierarchical feature integration (Hi-Fi) mechanism in (b), where multi-scale features are integrated with bidirectional guidance so that high-level semantics and low-level details can benefit from each other. Liu et al. Liu et al. (2018) propose a linear span network (LSN) that uses linear span units to increase the independence of convolutional features and the efficiency of feature integration. In Xu et al. (2019), Xu et al. introduce a geometry-aware objective function based on Hausdorff distance, to better incorporate geometric constraints.

2.3 Features of DeepFlux

Though the method we propose in the present paper also benefits from CNN-based learning, it differs from the methods in Shen et al. (2016, 2017); Ke et al. (2017); Liu et al. (2017); (b); Liu et al. (2018) in a fundamental way, due to its different learning objective. Instead of treating object skeleton detection in natural images as a binary classification problem, DeepFlux focuses on learning the spatial context flux of skeletons, and as such includes more informative non-local cues, such as the relative position of skeleton points to image points in their vicinity. Thus, the relationship between skeletal point locations and their associated object boundaries is also captured, at least implicity. A direct consequence of this powerful image context flux representation is that a simple post-processing step can recover the skeleton directly from the learned flux. In this manner, we avoid the inaccurate localization of skeletal points by non-maximum suppression used in previous deep learning methods. In addition, DeepFlux enlarges the spatial extent used by the CNN to detect the skeleton, through its use of spatial context flux. This region-based flux representation allows our approach to capture larger object parts.

We note that the proposed DeepFlux is similar in spirit to the original notion of flux Siddiqi et al. (2002); Dimitrov et al. (2003) that is defined based on an object boundary, for skeletonization of 2D/3D binary objects. As such, DeepFlux inherits its mathematical properties including the unique mapping of skeletal points to boundary points. However, the present article is the first to extend this notion of flux to skeleton detection in natural images, where the flux is computed on dilated skeletons in a supervised learning setting. Our work is also related to the approaches in Bai and Urtasun (2017b); Maninis et al. (2018); Chen et al. (2018); Ci et al. (2018); Xu et al. (2019); Ahn et al. (2019); Kreiss et al. (2019); Peng et al. (2019) which learn direction cues for edge detection, instance segmentation, and pose estimation. In the present article, this direction information is encoded in the flux representation, and is implicitly learned for skeleton recovery.

2.4 Direction Fields in Models of Spatial Context

The use of direction fields to model spatial context has also shown promise in other computer vision applications, including image segmentation, object segmentation, and pose estimation. In Maninis et al. (2018), the authors propose to learn edge directions in addition to edge location, for generic image segmentation. Other methods make use of a direction field defined on regions of interest to achieve instance segmentation, such as the deep watershed transform in Bai and Urtasun (2017b), which regresses the distance map to boundaries obtained by semantic segmentation. A similar direction field on text areas is proposed in Xu et al. (2019), to extract instances of text in scenes, whereas direction cues pointing to object centers are used to improve instance and video segmentation in Chen et al. (2018) and Ci et al. (2018), respectively. Finally, direction cues are also used to improve instance segmentation in Ahn et al. (2019) and direction fields pointing towards keypoints are used for pose estimation in Kreiss et al. (2019); Peng et al. (2019).

3 Method

3.1 Overview

Many recent CNN-based skeleton detection approaches build on some variant of the HED architecture Xie and Tu (2015). The combination of a powerful classifier (CNN) and the use of side outputs to extract and combine features at multiple scales has enabled these systems to accurately localize medial points of objects in natural images. However, while state-of-the-art skeleton detection systems are quite effective at extracting medial axes of elongated structures, they still struggle when reasoning about ligature areas. This is not a surprise, because in contrast to the skeletal branches they connect, ligature areas exhibit much less structural regularity, making their exact localization ambiguous. As a result, most methods result in poor localization of ligature points, or fragmentation of medial axis segments between the medial axes representing object parts.

We propose to mitigate this problem by casting skeleton detection as the problem of predicting a two-dimensional flux field from scene points to nearby skeleton points, within a fixed-size neighborhood. We then define skeleton points as the local flux minima, or, alternatively, as sinks “absorbing” flux from nearby points. We argue –and show empirically in our experiments– that this approach leads to more robust localization and better connectivity between skeletal branches. We also argue that considering a small neighborhood around the true skeleton points is sufficient, consistent with past approaches to binary object skeletonization Dimitrov et al. (2003). Whereas predicting the flux for the entire object would allow us to also infer the medial radius function, in this work we focus on improving medial point localization, and employ existing ideas for integrating scale prediction into our network, to tackle a high-level task in Sect. 5. The overall pipeline of the proposed method, which we dub DeepFlux, is depicted in Fig. 2.

3.2 Spatial Context Flux

Let $\mathbf {p}= (x,y)$ be the coordinates of a pixel in a 2D RGB image. We represent the flux vector field $\mathbf {F}(\mathbf {p}) = \mathbf {F}(x,y) = (F_x, F_y)$ as a two-channel map with continuous values $F_x, F_y$, corresponding to the x and y coordinates of the flux vector, respectively. An intuitive visualization is shown in Fig. 3. In most related approaches, skeleton detection is framed as a binary classification task, for which the ground truth is a 1-pixel wide binary skeleton map. In our case, we are dealing with a regression problem, so we must modify the ground truth appropriately.

We divide a binary skeleton map into three non-overlapping regions: (1) skeleton spatial context, $R_c$, which is a set of pixels in the vicinity of the skeleton; (2) skeleton pixels, denoted by $R_s$; and (3) background pixels, $R_b$. In practice, we obtain $R_c$ by dilating the binary skeleton map with a disk of radius r, and subtracting skeleton pixels $R_s$. Then, for each context pixel $\mathbf {p}\in R_c$, we use an efficient distance transform algorithm Felzenszwalb and Huttenlocher (2012) to find its nearest skeleton pixel $p_n \in R_s$, in terms of $L_2$ distance. We then define the flux on the context pixel $\mathbf {p}$ as the unit direction vector that points away from $\mathbf {p}$ to $p_n$.^{Footnote 1} For the remaining pixels composed of $R_s$ and $R_b$, we set the flux to (0, 0). Formally, we have:

$$\begin{aligned} \mathbf {F}(\mathbf {p}) = \left\{ \begin{matrix} \ \overrightarrow{\mathbf {p}\mathbf {p}_n}/\left| \overrightarrow{\mathbf {p}\mathbf {p}_n}\right| , &{} \mathbf {p}\in R_c \\ \\ (0,0), &{} \mathbf {p}\in R_s \cup R_b, \end{matrix} \right. \end{aligned}$$

(1)

where $\left| \overrightarrow{\mathbf {p}\mathbf {p}_n}\right| $ denotes the length of the vector from pixel $\mathbf {p}$ to $\mathbf {p}_n$. We note that $\mathbf {F}(\mathbf {p})$ is defined as a unit vector field only at the context pixels in our groundtruth; at test time, the predicted field is not normalized.

As a representation of the spatial context associated with each skeletal pixel, our proposed spatial context flux possesses a few distinct advantages when used to detect object skeletons in the wild. Unlike most learning approaches that predict skeleton probabilities individually for each pixel, our DeepFlux method leverages consistency between flux predictions within a neighborhood around each candidate pixel. Conversely, if the true skeleton location changes, the surrounding flux field will also change noticeably. A beneficial side-effect is that our method does not rely directly on the coarse responses produced by deeper CNN layers for localizing skeletons at larger scales, which further reduces localization errors. As we show in our experiments, these properties make our method more robust to the localization of skeleton points, especially around ligature regions, and less prone to gaps, discontinuities, and irregularities caused by local mispredictions. In Sect. 3.5, we explain how we can easily and accurately recover a binary object skeleton using the magnitude and direction of the predicted flux.

3.3 Network Architecture

The network for learning the spatial context flux of skeletons closely follows the fully convolutional architecture of Long et al. (2015), and is shown in Fig. 4. It consists of four modules: (1) a backbone network used to extract 3D feature maps; (2) an “atrous” spatial pyramid pooling (ASPP) module Chen et al. (2018) to enlarge the receptive field while avoiding excessive downsampling; (3) a multi-stage feature fusion module; and (4) a flux regression and skeleton classification by convolution and up-sampling module; (5) an optional skeleton scale prediction branch, that helps to bridge the gap between skeleton extraction and a complete medial axis transform Blum (1973).

To ensure a fair comparison with previous work, we also adopt VGG16 Simonyan and Zisserman (2015) as the backbone network. As in Xie and Tu (2015), we discard the last pooling layer and the fully connected layers that follow. In the rest of the text, we call this variant DeepFlux-VGG16. The use of the atrous module is motivated by the need for a wide receptive field: when extracting skeletons we have to guarantee that the receptive field of the network is wider than the largest medial radius of an object part in the input image. The receptive field of the VGG16 backbone is 196, which is not wide enough for large objects. Furthermore, it has been demonstrated in Luo et al. (2016) that the effective receptive field only takes up a fraction of the full theoretical receptive field. Thus, we employ ASPP to capture multi-scale information. Specifically, four parallel atrous convolutional layers with $3\times 3$ kernels but different atrous rates (2, 4, 8, 16) are added to the last layer of the backbone, followed by a concatenation along the channel dimension. In this way, we obtain feature maps with a theoretical receptive field size of 708, which we have found to be large enough for the images we have experimented on.

To construct a multi-scale representation of the input image, we fuse the feature maps from side outputs at conv3, conv4, conv5, and ASPP layers, after convolving them with a $1 \times 1$ kernel. Since feature maps at different levels have different spatial resolutions, we resize them all to the dimensions of conv3 before concatenating them. We perform prediction on the learned flux field, after up-sampling it to the dimensions of the input image using bilinear interpolation. This is a 2-channel response map, corresponding to flux predictions ${\hat{\mathbf {F}}}(\mathbf {p})$ for every pixel $\mathbf {p}$ in the image.

We propose two different ways of extracting skeletons from this 2-channel response map. The first one is a simple post-processing scheme, described in Sect. 3.5. The second involves extending our network by plugging in three $3 \times 3$ convolutional layers (with 64-channel output for the first two layers), following the (up-sampled) flux field prediction layer, which output a pixel-wise skeleton confidence score. This score can subsequently be thresholded, to produce a binary skeleton. Our network outputs both types of prediction (learned flux and skeleton confidence), as shown in Fig. 4.

We also consider an alternative architectural choice for the proposed model by replacing the VGG16 backbone with ResNet101 He et al. (2016). In this case, similar to DeepFlux-VGG16, we fuse the feature maps from different side outputs. Specifically, we apply $1 \times 1$ convolution at conv2 (whose spatial size is already 1/4 of the original image), conv3, conv4, conv5, and the ASPP layers. We then concatenate the resized side outputs together as a multi-scale representation of the input image. The following layers are kept the same as the DeepFlux-VGG16 variant. In the rest of the text, we call this variant DeepFlux-ResNet101. When not specified, we assume a DeepFlux-VGG16 architecture.

Finally, similar to previous work Shen et al. (2017), we explore the advantages of simultaneously predicting skeleton position and scale, bridging the gap between skeleton extraction and a complete medial axis transform Blum (1973). This also provides us with a richer representation that can find practical use in downstream tasks. To this end, we optionally include an additional branch to the DeepFlux backbone, that predicts the scale s associated with each medial point, as shown in Fig. 4.

3.4 Training Objective

We split our loss function into two terms, one for each type of output. For the flux field branch, we choose the $L_2$ loss function as our training objective. Due to a severe imbalance in the number of context and background pixels, we adopt a class-balancing strategy similar to the one in Xie and Tu (2015). Our balanced flux loss function is

$$\begin{aligned} L_f = \sum _{\mathbf {p}\in \varOmega }{w_f(\mathbf {p}) \cdot \left\| \mathbf {F}(\mathbf {p})- {\hat{\mathbf {F}}}(\mathbf {p})\right\| _2^2}, \end{aligned}$$

(2)

where $\varOmega $ is the image domain, ${\hat{\mathbf {F}}}(\mathbf {p})$ is the predicted flux, and $w_f(\mathbf {p})$ denotes the weight coefficient of pixel $\mathbf {p}$. The weight $w_f(\mathbf {p})$ is calculated as follows:

$$\begin{aligned} w_f(\mathbf {p}) = \left\{ \begin{array}{ll} \frac{|R_b|}{|R_c|+|R_b|+|R_s|}, &{}\quad \mathbf {p}\in R_c \cup R_s \\ \\ \frac{|R_c|+|R_s|}{|R_c|+|R_b|+|R_s|}, &{}\quad \mathbf {p}\in R_b, \end{array} \right. \end{aligned}$$

(3)

where $|R_c|$, $|R_b|$ and $|R_s|$ denote the number of context, background, and skeleton pixels, respectively.

The second branch, which predicts skeleton probability scores from the predicted flux, corresponds to a standard binary classification problem. We follow Xie and Tu (2015) and use a class-balanced cross-entropy loss function

$$\begin{aligned} L_s = - \beta \sum _{\mathbf {p}\in R_s}{\log S(\mathbf {p})} - (1-\beta ) \sum _{\mathbf {p}\in R_b \cup R_c}{\log (1-S(\mathbf {p}))}, \nonumber \\ \end{aligned}$$

(4)

where S denotes the skeleton probability scores (obtained using sigmoid function), and $\beta = \tfrac{|R_c|+|R_b|}{|R_s|+|R_c|+|R_b|}$ is the balancing factor.

The final training objective is given by summing the two loss terms:

$$\begin{aligned} L = L_f + \lambda _1 L_s, \end{aligned}$$

(5)

where $\lambda _1$ is a hyper-parameter. We set $\lambda _1=1$ in all our experiments.

For the optional extra scale prediction branch, we use a smoothed-L1 loss for scale regression:

$$\begin{aligned} L_{scale} = smooth_{L1}\left( \frac{{\hat{s}} - s}{s}\right) , \end{aligned}$$

(6)

where ${\hat{s}}$ is the predicted scale and s is the ground truth. When we also predict the skeleton scale, the overall loss is given by $L + \lambda _2 L_{scale}$, where $\lambda _2$ is a hyper-parameter that is set to 1 in all our experiments.

3.5 From Flux to Skeleton Predictions

We propose a simple post-processing procedure to recover the object skeleton from the predicted spatial context flux. As described in Eq. (1), pixels around the skeleton are labeled with unit two-dimensional vectors while the others are set to (0, 0). Thus, thresholding the magnitude of the vector field reveals the context pixels while computing the flux direction reveals the location of context pixels relative to the skeleton. We refer the reader to Fig. 2 for a visualization of the post-processing steps, listed in Algorithm 1.

Let $|{\hat{\mathbf {F}}}|$ and $\angle {\hat{\mathbf {F}}}$ be the magnitude and direction of the predicted context flux ${\hat{\mathbf {F}}}$, respectively. For a given pixel $\mathbf {p}$, $\angle {\hat{\mathbf {F}}}(\mathbf {p})$ is binned into one of 8 directions, pointing to one of the 8 neighbors, denoted by ${\mathcal {N}}_{\angle {\hat{\mathbf {F}}}(\mathbf {p})}(\mathbf {p})$. Having computed these two quantities, extracting the skeleton is straightforward: pixels close to the real object skeleton should have a high inward flux, due to a singularity in the vector field ${\hat{\mathbf {F}}}$, as analyzed in Dimitrov et al. (2003). These pixels are defined as “quench points”. Finally, we apply a morphological dilation with a disk structuring element of radius $k_1$, followed by a morphological erosion with a disk of radius $k_2$, to group quench points together and produce the object skeleton. We call this variant DeepFlux-P.

One can also learn to predict skeleton confidence from the predicted flux field. More precisely, as described in Sect. 3.3, we add three $3 \times 3$ convolution layers after the flux prediction layer, and train this branch in the standard manner for a binary skeleton classification problem, using a cross-entropy loss. We call this end-to-end trainable variant DeepFlux-E, and use it as our default, unless explicitly stated.

4 Experiments on Skeleton and Centerline Detection

We conduct experiments on six challenging datasets, five of which are publicly available: SK-LARGE Shen et al. (2017), SK506 Shen et al. (2016), WH-SYMMAXShen et al. (2016), SYM-PASCAL Ke et al. (2017), SYMMAX300 Tsogkas and Kokkinos (2012); and SK-AID, a bridge/road centerline dataset we collected ourselves from AID Xia et al. (2017), which will also be publicly available. Some sample images are shown in Fig. 5. We note that for some of these datasets, only the skeletons of foreground objects are annotated, whereas others come with skeleton or centerline annotations for both foreground objects and background structures.

We describe the above datasets and the evaluation protocol in detail, in Sect. 4.1. We follow with implementation details in Sect. 4.2. Qualitative and quantitative results are shown in Sect. 4.3. We carry out a runtime analysis and an ablation study in Sects. 4.4 and 4.5, respectively.

4.1 Dataset and Evaluation Protocol

SK-LARGE Shen et al. (2017) is a benchmark for foreground object skeleton extraction, consisting of 746 training and 745 test images. Each image in SK-LARGE is obtained by cropping an image from MS-COCO Chen et al. (2015) so that it contains a single, centered object. SK-LARGE contains various object categories including person, horse, giraffe, and man-made objects such as plane and hydrant. In this dataset, both the location and scale (the radius of the corresponding maximal disk) of each skeletal point are annotated.

SK506 Shen et al. (2016) also referred to as SK-SMALL, is an earlier version of SK-LARGE released by the same authors Shen et al. (2017). There are 300 training images and 206 test images. Note that this dataset contains less training data, which might make the training of deep neural networks more challenging.

WH-SYMMAX Shen et al. (2016) contains 328 cropped images from the Weizmann Horse dataset Borenstein and Ullman (2002), and their skeleton point and scale annotations. The dataset is split into 228 training images and 100 test images.

SK-AID is built on AID Xia et al. (2017), a dataset for aerial scene classification with 20 scene categories. We use 60 images for training and 40 images for testing; and focus on two object categories: bridge and road. We manually annotate the segmentation masks of roads and bridges, and then adopt a binary skeletonization algorithm Shen et al. (2011) to obtain their centerlines as the skeleton ground truth. As shown in Fig. 5, SK-AID exhibits a large variation in skeleton orientation and curvature, as well as challenging cases of junctions of multiple skeleton branches.

SYM-PASCAL Ke et al. (2017) is derived from the PASCAL-VOC-2011 segmentation dataset Everingham et al. (2010) for symmetry detection in the wild, and contains 648 training and 787 test images. Compared to SK-LARGE and SK506, the images from this dataset possess more complex backgrounds and variations of object appearance, including occlusions and missing parts, making it quite challenging.

SYMMAX300 Tsogkas and Kokkinos (2012) is built on the Berkeley Segmentation Dataset (BSDS300) Martin et al. (2001), which contains 200 training images and 100 test images. Unlike the three datasets described above, both foreground and background regions are considered. It is noteworthy that each image in SYMMAX300 is accompanied by 5-7 symmetry annotations, corresponding to the multiple segmentation annotations existent in the BSDS300. The final local symmetry annotation is obtained by merging all available annotations for a given image, through a binary union operation.

Evaluation protocol Following previous work Tsogkas and Kokkinos (2012); Shen et al. (2016, 2017), we use precision-recall (PR) curves and the F-measure metric to evaluate skeleton detection performance in our experiments. For methods that output a skeleton probability map (including our end-to-end variant DeepFlux-E), we first apply a standard non-maximal suppression (NMS) algorithm Dollár and Zitnick (2015). We then threshold the thinned skeleton into a binary map and match it with the ground truth using a bi-partite matching routine that allows for small localization errors Martin et al. (2004). We select threshold values that yield the highest F-measure for each method-dataset combination.

For the variant DeepFlux-P, which does not directly output skeleton probabilities, we use the inverse magnitude of predicted context flux on the recovered skeleton as a surrogate for a “skeleton confidence”. Thresholding at different values gives rise to a PR curve and the optimal threshold for each dataset is selected as the one producing the highest F-measure according to the formula $F = 2PR/(P+R)$. The F-measure is commonly reported as a single scalar performance index.

4.2 Implementation Details

Our implementation involves one major hyperparameter: the width of the skeleton context neighborhood r, which is set to 7 for all experiments. For the DeepFlux-P variant there are three extra hyperparameters (provided values are the ones used in our experiments): the threshold used to recover skeletal points from the predicted flux field, $\lambda _m=0.4$ and the sizes of the structuring elements involved in the morphological operations for skeleton recovery, $k_1=3$ and $k_2=4$.

For training, we adopt standard data augmentation strategies Shen et al. (2016, 2017); (b). Specifically, we resize training images to 3 different scales (0.8, 1, 1.2) and then rotate them to 4 angles ($0^\circ $, $90^\circ $, $180^\circ $, $270^\circ $). We also flip them with respect to different axes (up-down, left-right, no flip). We consider two different initializations for the proposed network, one with the VGG16 Simonyan and Zisserman (2015) and one with the ResNet101 He et al. (2016) model, pre-trained on ImageNet Deng et al. (2009) and optimized using ADAM Kinga and Adam (2015). For the first 80k iterations, the learning rate is set to $10^{-5}$ for the backbone (VGG16 or ResNet101) layers and to $10^{-4}$ for the rest of the layers in the network, then reduced to $10^{-6}$ and $10^{-5}$ for the remaining 40k iterations, respectively.

We use the Caffe Jia et al. (2014) framework to train DeepFlux. All experiments are carried out on a workstation with an Intel Xeon 16-core CPU (3.5GHz), 64GB RAM, and a single Titan Xp GPU. Training on SK-LARGE with batch size set to 1 takes about 2 hours.

Table 1 Quantitative comparison in terms of F-measure

Full size table

4.3 Results

Comparison with other methods. We start by showing a qualitative comparison of DeepFlux-VGG16 with other skeleton detection methods, on images from WH-SYMMAX and SYM-PASCAL. As illustrated in Figs. 1 and 6, DeepFlux accurately localizes skeleton points while preserving good connectivity at junctions.

In Fig. 7 we plot the PR-curves for SK-LARGE, SK506, WH-SYMMAX, SK-AID, and SYM-PASCAL. DeepFlux significantly outperforms other methods in all cases, excelling in the high-precision regime. This is indicative of the role of local context towards more robust and accurate localization of skeleton points.

Table 1 lists the optimal F-measure score for all methods. DeepFlux-VGG16 consistently outperforms all other approaches. Specifically, DeepFlux-VGG16-E surpasses the most recent method Hi-Fi (b) by $1.2\%$, $2.3\%$, $5.0\%$, $6.8\%$, and $11.6\%$ on SK-LARGE, SK506, WH-SYMMAX, SK-AID, and SYM-PASCAL, respectively, despite the fact that Hi-Fi uses stronger supervision during training (skeleton position and scale). DeepFlux-VGG16-E also outperforms LSN Liu et al. (2018), another recent method, by $6.8\%$, $7.1\%$, $5.8\%$, $14.5\%$, and $5.1\%$ on SK-LARGE, SK506, WH-SYMMAX, SYM-PASCAL, and SYMMAX300, respectively. It is noteworthy that the proposed DeepFlux improves over the previous state-of-the-art by more than $11\%$ in terms of F-measure on SYM-PASCAL, whose images have more complex backgrounds and variations in object appearance. This implies that DeepFlux is better able to handle skeleton detection in complex images. For a fair comparison with previous methods, we also report results for DeepFlux, using the vanilla VGG16 architecture without the ASPP module. Barring SK-LARGE, where the proposed DeepFlux performs slightly worse than Hi-Fi (b), DeepFlux significantly outperforms competing methods on all other datasets. It is also noteworthy that Hi-Fi (b) relies on additional scale supervision during training, which is not the case for DeepFlux.

GeoSkeletonNet Xu et al. (2019) is trained using “resolution normalization”: the authors resize the images and their associated ground-truth from a size of $H \times W$ to $\sqrt{KH/W} \times \sqrt{KW/H}$ ($K = 180000$ for SYM-PASCAL and $K = 60000$ for the other datasets) before applying data augmentation. This procedure normalizes the number of pixels to a fixed value K, while keeping the aspect ratio of the images the same, factoring out the variance of resolutions across different datasets. Using the same resolution normalization protocol, DeepFlux-VGG16-E achieves a 0.758 ( +0.1%), 0.730 ( +0.3%), 0.863 ( +1.4%), and 0.569 ( +4.9%) F-score on SK-LARGE, SK506, WH-SYMMAX, and SYM-PASCAL, respectively.

Comparison of different network backbones Using a more powerful backbone further boosts performance. DeepFlux-ResNet101-E improves over DeepFlux-VGG16-E by $1.8\%$, $1.5\%$, $1.2\%$, $0.1\%$, and $2.3\%$ on SK-LARGE, SK506, WH-SYMMAX, SK-AID, and SYM-PASCAL, respectively. The modest gains from the more powerful ResNet on SK-AID can potentially be attributed to the significantly lower variation of skeleton scales in that dataset; the capacity of VGG16 seems to be sufficient to already achieve close to $90\%$ accuracy. Curiously, DeepFlux-ResNet101 performs slightly worse than DeepFlux-VGG16 on SYMMAX300. Our hypothesis is that, because of the multiple—potentially conflicting annotations—per image in this dataset, the lower capacity of the VGG16 may act as a regularizer, leading to slightly better performance.

Post-processing versus end-to-end training DeepFlux-E (end-to-end) performs slightly better than DeepFlux-P (post-processing), in all cases. In particular, DeepFlux-VGG16-E outperforms DeepFlux-VGG16-P by $1.2\%$ and $0.6\%$ on SYM-PASCAL and SYMMAX300, respectively. As shown in Fig. 6 for qualitative results on SYMMAX300 (see the blue dashed circles), DeepFlux-E preserves better the connectivity at ligature areas than DeepFlux-P, which may only have a few quench points instead of a set of connected ones due to direction discretization into 8 bins. DeepFlux-E also enjoys a slightly faster runtime, as shown in the comparison in Table 2.

Failure cases Despite the effectiveness of DeepFlux inaccurately detecting object skeletons in images, there are some challenging cases where the model fails partially. An example is illustrated on the middle right of Fig. 6, where the skeleton of the body of the bird is not detected due to severe image blurring. Another example of failure is shown on the bottom right in Fig. 6, where DeepFlux fails to capture the symmetry of each bus instance individually, detecting instead the horizontal symmetry axis of the entire cluster.

Table 2 Runtime and performance on SK-LARGE. For DeepFlux-P, we list the total inference (GPU) + post-processing (CPU) time

Full size table

4.4 Runtime Analysis

In Table 2 we compare the runtime of DeepFlux to alternatives. Since competing models typically use the VGG16 backbone, we mainly employ the DeepFlux-VGG16 variant in our analysis, to keep the comparison fair. As shown in Table 2, DeepFlux is as fast as competing methods while achieving superior performance. Inference of DeepFlux-VGG16-E on the GPU takes on average 14 ms for a $300 \times 200$ image, which is faster than other methods. The DeepFlux-VGG16-P variant requires on average an extra 3 ms on the CPU, for post-processing.

4.5 Ablation Study

We study the contribution of the two main modules (ASPP module and flux representation) to skeleton detection on SK-LARGE and SYM-PASCAL, by removing them one at a time from the VGG16 backbone. We conduct four experiments corresponding to the four possible combinations of each module being present or not.

When the spatial context flux representation is not used, we train the model with the same architecture, but for binary classification using binary cross-entropy loss. The baseline model is trained without ASPP and spatial context flux representation. As depicted in Table 3, the ASPP module that offers a larger receptive field, results in an improvement of $1.9\%$ on SK-LARGE and $5.1\%$ on SYM-PASCAL, compared to the baseline model. This confirms that a large receptive field is beneficial for skeleton/symmetry extraction. We then remove the ASPP module and train the model using the proposed flux representation, which yields an improvement of $0.8\%$ on SK-LARGE and $3.5\%$ on SYM-PASCAL. These gains are complementary to each other; indeed, combining both the ASPP and the flux representation, improves performance over the baseline by $2.9\%$ on SK-LARGE and $9.7\%$ on SYM-PASCAL.

Table 3 Ablation study on the effect of the spatial context flux representation and the ASPP module on the performance in terms of F-measure

Full size table

Table 4 Ablation study on the influence of the context size r on the performance in terms of F-measure

Full size table

We also study the effect of the size r of the neighborhood within which context flux is computed. We conduct experiments with different radii, ranging from $r=3$ to $r=11$, with a step of 2, on the SK-LARGE and SYM-PASCAL datasets. The best results are obtained for $r=7$, and using smaller or larger values seems to slightly decrease performance. Our understanding is that a narrower spatial context neighborhood provides less contextual information to predict the final skeleton map. On the other hand, using a wider neighborhood may increase the chance for mistakes in flux prediction around areas of severe discontinuities, such as the areas around boundaries of thin objects that are fully contained in the context neighborhood. DeepFlux does not appear to be sensitive to the value of r, as shown in Table 4.

Finally, one may argue that simply using a dilated ground truth when training the network for skeleton classification is sufficient to make the model more robust in accurately localizing skeletal points. To examine if this is the case, we removed the flux module and retrained our VGG16-based model on the same dilated skeletons we used to compute the spatial context flux ground truth, using a binary cross-entropy loss instead. Without spatial context flux representation, the performance drops from $F=0.736$ to $F=0.697$ ( $-3.9\%$) on SK-LARGE and from $F=0.570$ to $F=0.490$ ( $-8\%$) on SYM-PASCAL, demonstrating the effectiveness of our proposed representation for accurate localization.

Table 5 Comparison with some state-of-the-art methods dedicated for object detection in remote sensing images on DOTA Xia et al. (2018)

Full size table

5 Application to Object Detection in Aerial Images

We consider an application of simultaneously predicting skeleton position and scale, for the task of detecting large vehicles in remote sensing imagery. For a fair comparison with other methods, we use ResNet101 as the backbone. The stride between block3 and block4 is set to 1, and all the layers in block4 are replaced with dilated convolution layers.

Predicting the scale associated with each skeleton pixel allows us to generate the object mask in a straightforward way. Let ${\hat{s}}_i$ denote the predicted scale for a skeleton pixel $x_i$ in a skeleton segment (i.e., a connected component of the binary object skeleton). We obtain the object mask as $O = \cup _{i=1}^{N} D_i$, where N is the number of the skeleton pixels in the segment, and $D_i$ is the disk of radius ${\hat{s}}_i$ centered at $x_i$. We use the bounding box of O as our final detection, and the mean magnitude value of the enclosed spatial context flux as a proxy for the classification score for evaluation purposes. As depicted in Table 5, we achieve competitive performance against state-of-the-art methods dedicated for object detection in aerial images. Qualitative detection results are shown in Fig. 8.

6 Conclusions

We have proposed DeepFlux, a novel approach for accurate skeleton detection in the wild. In contrast to classical learning-based methods that consider skeleton detection as a binary classification problem, we learn to regress a 2D vector field of “context flux”. Context flux is a reliable intermediate cue for skeleton point localization, either through simple post-processing or end-to-end training. The proposed approach alleviates many limitations (e.g., poor localization) of previous methods, and performs very well in handling ligature points, and skeletons of objects at large spatial scales, while also being very fast ($\sim 14-17ms $ for detection on a Titan Xp GPU). Our experiments on six challenging benchmarks demonstrate that DeepFlux consistently improves over the state-of-the-art both quantitatively and qualitatively.

While the skeleton represents a powerful shape representation in support of many tasks, it lacks the dual boundary/region encoding offered by the medial axis transform (MAT), since skeleton points do not encode the scale of the maximal inscribed disk. We have extended our framework to explicitly recover both skeleton position and scale, significantly enhancing the representation power and utility of our skeletons, as demonstrated on an object detection/segmentation task of vehicle detection in remote sensing imagery.

Notes

In fact, in the context of skeletonization of binary objects Siddiqi and Pizer (2008), this flux vector would be in the direction opposite to that of the spoke vector from a skeletal pixel to its associated boundary pixel.

References

Ahn, J., Cho, S., & Kwak, S. (2019). Weakly supervised learning of instance segmentation with inter-pixel relations. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 2209–2218).
Bai, M., & Urtasun, R. (2017). Deep watershed transform for instance segmentation. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 2858–2866).
Bai, X., Wang, X., Latecki, L. J., Liu, W., & Tu, Z. (2009). Active skeleton for non-rigid object detection. In Proceedings of IEEE international conference on computer vision (pp. 575–582).
Blum, H. (1973). Biological shape and visual science (part i). Journal of Theoretical Biology, 38(2), 205–287.
Article Google Scholar
Borenstein, E., & Ullman, S. (2002). Class-specific, top-down segmentation. In Proceedings of European conference on computer vision (pp. 109–122).
Chen, L. C., Hermans, A., Papandreou, G., Schroff, F., Wang, P., & Adam, H. (2018). Masklab: Instance segmentation by refining object detection with semantic and direction features. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 4013–4022).
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Article Google Scholar
Chen, X., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollár, P., & Zitnick, C. L. (2015). Microsoft coco captions: Data collection and evaluation server. CoRR abs/1504.00325.
Ci, H., Wang, C., & Wang, Y. (2018). Video object segmentation by learning location-sensitive embeddings. In Proceedings of European conference on computer vision (pp. 501–516).
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Li, F. F. (2009). Imagenet: A large-scale hierarchical image database. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 248–255).
Dickinson, S. J. (2009). Object categorization: Computer and human vision perspectives. Cambridge: Cambridge University Press.
Book Google Scholar
Dimitrov, P., Damon, J. N., & Siddiqi, K. (2013). Flux invariants for shape. In Proceedings of IEEE international conference on computer vision and pattern recognition.
Ding, J., Xue, N., Long, Y., Xia, G. S., & Lu, Q. (2019). Learning RoI transformer for oriented object detection in aerial images. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 2849–2858).
Dollár, P., & Zitnick, C. L. (2015). Fast edge detection using structured forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(8), 1558–1570.
Article Google Scholar
Dufresne-Camaro, C. O., Rezanejad, M., Tsogkas, S., Siddiqi, K., & Dickinson, S. (2020). Appearance shock grammar for fast medial axis extraction from real images. In Proceedings of IEEE international conference on computer vision and pattern recognition.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.
Article Google Scholar
Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.
Article Google Scholar
Felzenszwalb, P. F., & Huttenlocher, D. P. (2012). Distance transforms of sampled functions. Theory of Computing, 8(1), 415–428.
Article MathSciNet Google Scholar
Girshick, R., Shotton, J., Kohli, P., Criminisi, A., & Fitzgibbon, A. (2011). Efficient regression of general-activity human poses from depth images. In Proceedings of IEEE international conference on computer vision (pp. 415–422).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 770–778).
Jang, J. H., & Hong, K. S. (2001). A pseudo-distance map for the segmentation-free skeletonization of gray-scale images. In Proceedings of IEEE international conference on computer vision (vol. 2, pp. 18–23).
Jerripothula, K. R., Cai, J., Lu, J., & Yuan, J. (2017). Object co-skeletonization with co-segmentation. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 3881–3889).
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of ACM multimedia (pp. 675–678).
Jiang, Y., Zhu, X., Wang, X., Yang, S., Li, W., Wang, H., Fu, P., & Luo, Z. (2017). R2CNN: Rotational region CNN for orientation robust scene text detection. Preprint arXiv:1706.09579.
Ke, W., Chen, J., Jiao, J., Zhao, G., & Ye, Q. (2017) SRN: Side-output residual network for object symmetry detection in the wild. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 302–310).
Kinga, D., & Adam, J. B.: A method for stochastic optimization. In Proceedings of international conference on learning representations (vol. 5).
Kreiss, S., Bertoni, L., & Alahi, A. (2019) PifPaf: Composite fields for human pose estimation. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 11977–11986).
Levinshtein, A., Sminchisescu, C., & Dickinson, S. (2013). Multiscale symmetric part detection and grouping. International Journal of Computer Vision, 104(2), 117–134.
Article Google Scholar
Lindeberg, T. (1998). Edge detection and ridge detection with automatic scale selection. International Journal of Computer Vision, 30(2), 117–156.
Article Google Scholar
Lindeberg, T. (2013). Scale selection properties of generalized scale-space interest point detectors. Journal of Mathematical Imaging and Vision, 46(2), 177–210.
Article MathSciNet Google Scholar
Liu, C., Ke, W., Qin, F., & Ye, Q. (2018). Linear span network for object skeleton detection. In Proceedings of European conference on computer vision (pp. 136–151).
Liu, T. L., Geiger, D., & Yuille, A. L. (1998). Segmenting by seeking the symmetry axis. In Proceedings of international conference on pattern recognition (vol. 2, pp. 994–998).
Liu, X., Lyu, P., Bai, X., & Cheng, M. M. (2017). Fusing image and segmentation cues for skeleton extraction in the wild. In Proceedings of ICCV workshop on detecting symmetry in the wild (vol. 6, p. 8).
Liu, Y., Cheng, M. M., Hu, X., Wang, K., & Bai, X. (2017). Richer convolutional features for edge detection. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 5872–5881).
Long, J., Shelhamer, E., & Darrell, T. (2015) Fully convolutional networks for semantic segmentation. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 3431–3440).
Luo, W., Li, Y., Urtasun, R., & Zemel, R. (2016). Understanding the effective receptive field in deep convolutional neural networks. In Proceedings of advances in neural information processing systems (pp. 4898–4906).
Ma, J., Shao, W., Ye, H., Wang, L., Wang, H., Zheng, Y., et al. (2018). Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia, 20(11), 3111–3122.
Article Google Scholar
Maninis, K. K., Pont-Tuset, J., Arbeláez, P., & Van Gool, L. (2018). Convolutional oriented boundaries: From image segmentation to high-level tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 819–833.
Article Google Scholar
Marr, D., & Nishihara, H. K. (1978). Representation and recognition of the spatial organization of three-dimensional shapes. Proceedings of the Royal Society of London B: Biological Sciences, 200(1140), 269–294.
Google Scholar
Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of IEEE international conference on computer vision (vol. 2, pp. 416–423).
Martin, D. R., Fowlkes, C. C., & Malik, J. (2004). Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(5), 530–549.
Article Google Scholar
Máttyus, G., Luo, W., & Urtasun, R. (2017). Deeproadmapper: Extracting road topology from aerial images. In Proceedings of the IEEE international conference on computer vision.
Mattyus, G., Wang, S., Fidler, S., & Urtasun, R. (2015). Enhancing road maps by parsing aerial images around the world. In Proceedings of the IEEE international conference on computer vision (pp. 1689–1697).
Nedzved, A., Ablameyko, S., & Uchida, S. (2006). Gray-scale thinning by using a pseudo-distance map. In Proceedings of IEEE international conference on pattern recognition.
Peng, S., Liu, Y., Huang, Q., Zhou, X., & Bao, H. (2019). PVNet: Pixel-wise voting network for 6dof pose estimation. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 4561–4570).
Ren, Z., Yuan, J., Meng, J., & Zhang, Z. (2013). Robust part-based hand gesture recognition using kinect sensor. IEEE Transactions on Multimedia, 15(5), 1110–1120.
Article Google Scholar
Shen, W., Bai, X., Hu, R., Wang, H., & Latecki, L. J. (2011). Skeleton growing and pruning with bending potential ratio. Pattern Recognition, 44(2), 196–209.
Article Google Scholar
Shen, W., Bai, X., Hu, Z., & Zhang, Z. (2016). Multiple instance subspace learning via partial random projection tree for local reflection symmetry in natural images. Pattern Recognition, 52, 306–316.
Article Google Scholar
Shen, W., Zhao, K., Jiang, Y., Wang, Y., Bai, X., & Yuille, A. (2017). Deepskeleton: Learning multi-task scale-associated deep side outputs for object skeleton extraction in natural images. IEEE Transactions on Image Processing, 26(11), 5298–5311.
Article MathSciNet Google Scholar
Shen, W., Zhao, K., Jiang, Y., Wang, Y., Zhang, Z., & Bai, X. (2016). Object skeleton extraction in natural images by fusing scale-associated deep side outputs. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 222–230).
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011) Real-time human pose recognition in parts from single depth images. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 1297–1304).
Siddiqi, K., Bouix, S., Tannenbaum, A., & Zucker, S. W. (2002). Hamilton-jacobi skeletons. International Journal of Computer Vision, 48(3), 215–231.
Article Google Scholar
Siddiqi, K., & Pizer, S. M. (2008). Medial Representations: Mathematics., Algorithms and Applications Berlin: Springer.
Book Google Scholar
Siddiqi, K., Shokoufandeh, A., Dickinson, S. J., & Zucker, S. W. (1999). Shock graphs and shape matching. International Journal of Computer Vision, 35(1), 13–32.
Article Google Scholar
Sie Ho Lee, T., Fidler, S., & Dickinson, S. (2013). Detecting curved symmetric parts using a deformable disc model. In Proceedings of IEEE international conference on computer vision (pp. 1753–1760).
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Proceedings of international conference on learning representations.
Sironi, A., Lepetit, V., & Fua, P. (2014). Multiscale centerline detection by learning a scale-space distance transform. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 2697–2704).
Trinh, N. H., & Kimia, B. B. (2011). Skeleton search: Category-specific object recognition and segmentation using a skeletal shape model. International Journal of Computer Vision, 2, 215–240.
Article Google Scholar
Tsogkas, S., & Dickinson, S. (2017) AMAT: Medial axis transform for natural images. In Proceedings of IEEE international conference on computer vision (pp. 2727–2736).
Tsogkas, S., & Kokkinos, I. (2012). Learning-based symmetry detection in natural images. In Proceedings of European conference on computer vision (pp. 41–54).
Wang, Y., Xu, Y., Tsogkas, S., Bai, X., Dickinson, S., & Siddiqi, K. (2019). Deepflux for skeletons in the wild. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 5287–5296).
Wei, S. E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 4724–4732).
Xia, G., Hu, J., Hu, F., Shi, B., Bai, X., Zhong, Y., et al. (2017). AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions Geoscience and Remote Sensing, 55(7), 3965–3981.
Article Google Scholar
Xia, G. S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., & Zhang, L. (2018) DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 3974–3983).
Xie, S., & Tu, Z. (2015). Holistically-nested edge detection. In Proceedings of IEEE international conference on computer vision (pp. 1395–1403).
Xu, W., Parmar, G., & Tu, Z. (2019). Geometry-aware end-to-end skeleton detection. In British Machine Vision Conference.
Xu, Y., Wang, Y., Zhou, W., Wang, Y., Yang, Z., & Bai, X. (2019). Textfield: Learning a deep direction field for irregular scene text detection. IEEE Transactions on Image Processing, 28(11), 5566–5579.
Article MathSciNet Google Scholar
Yang, X., Sun, H., Fu, K., Yang, J., Sun, X., Yan, M., et al. (2018). Automatic ship detection in remote sensing images from google earth of complex scenes based on multiscale rotation dense feature pyramid networks. Remote Sensing, 10(1), 132.
Yu, Z., & Bajaj, C. (2004). A segmentation-free approach for skeletonization of gray-scale images via anisotropic vector diffusion. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 415–420).
Zhang, Q., & Couloigner, I. (2007). Accurate centerline detection and line width estimation of thick lines using the radon transform. IEEE Transactions on Image Processing, 16(2), 310–316.
Article MathSciNet Google Scholar
Zhang, Z., Shen, W., Yao, C., & Bai, X. (2015). Symmetry-based text line detection in natural scenes. In Proceedings of IEEE international conference on computer vision and pattern recognition (pp. 2558–2567).
Zhao, K., Shen, W., Gao, S., Li, D., & Cheng, M. M. (2018). Hi-fi: Hierarchical feature integration for skeleton detection. In Proceedings of international joint conference on artificial intelligence (pp. 1191–1197).
Zhu, S. C., & Yuille, A. L. (1996). Forms: A flexible object recognition and modelling system. International Journal of Computer Vision, 20(3), 187–212.
Article Google Scholar
Zucker, S. W. (2012). Local field potentials and border ownership: A conjecture about computation in visual cortex. Journal of Physiology-Paris, 106, 297–315.
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by NSFC 61936003 and 61703171, and the Major Project for New Generation of AI under Grant No. 2018AAA0100400. Yongchao Xu was supported by the Young Elite Scientists Sponsorship Program by CAST. The work of Xiang Bai was supported by the National Program for Support of Top-Notch Young Professionals and in part by the Program for HUST Academic Frontier Youth Team. Sven Dickinson and Kaleem Siddiqi would like to thank the Natural Sciences and Engineering Research Council of Canada (NSERC) for research funding.

Author information

Authors and Affiliations

School of Computer Science, Wuhan University, Wuhan, China
Yongchao Xu
School of EiC, Huazhong University of Science and Technology, Wuhan, China
Yukang Wang, Jianqiang Wan & Xiang Bai
University of Toronto, Toronto, Canada
Stavros Tsogkas & Sven Dickinson
Vector Institute for Artificial Intelligence, Toronto, Canada
Sven Dickinson
Samsung Toronto AI Research Center, Toronto, Canada
Stavros Tsogkas & Sven Dickinson
School of Computer Science and Centre for Intelligent Machines, McGill University, Montreal, Canada
Kaleem Siddiqi

Authors

Yongchao Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yukang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Stavros Tsogkas
View author publications
You can also search for this author in PubMed Google Scholar
Jianqiang Wan
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Bai
View author publications
You can also search for this author in PubMed Google Scholar
Sven Dickinson
View author publications
You can also search for this author in PubMed Google Scholar
Kaleem Siddiqi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiang Bai.

Additional information

Communicated by Christoph H. Lampert.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Disclaimer: Sven Dickinson and Stavros Tsogkas contributed to this article in their personal capacity as Professor and Adjunct Professor, respectively, at the University of Toronto. The views expressed (or the conclusions reached) are their own and do not necessarily represent the views of Samsung Research America, Inc.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, Y., Wang, Y., Tsogkas, S. et al. DeepFlux for Skeleton Detection in the Wild. Int J Comput Vis 129, 1323–1339 (2021). https://doi.org/10.1007/s11263-021-01430-6

Download citation

Received: 05 June 2020
Accepted: 04 January 2021
Published: 30 January 2021
Issue Date: April 2021
DOI: https://doi.org/10.1007/s11263-021-01430-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

DeepFlux for Skeleton Detection in the Wild

Abstract

Similar content being viewed by others

Fusing Multi-scale Residual Network for Skeleton Detection

Skeleton Ground Truth Extraction: Methodology, Annotation Tool and Benchmarks

Real-time low-cost human skeleton detection

1 Introduction