1 Introduction

Staircases can be found almost everywhere in different colors, shapes and sizes in both indoor and outdoor environments. Staircases are useful in everyday life; however, they can be seen also as an obstacle for the navigation of humans with disabilities, as well as the navigation of artificial, robotic, agents. The detection of a staircase can be even more difficult in unknown environments, especially for the visually impaired, where there is no previous knowledge about the surroundings, and they can become hazardous. Therefore, staircase detection can be considered as an important component of any system aiming to provide navigational assistance in either indoor or outdoor environments. In controlled, indoor environments, markers, such as augmented reality markers can be used to provide high success rate of staircase detection [1]. The detection problem usually becomes much harder in outdoor, uncontrolled environments, where different types of staircases of various sizes can be found under various illumination conditions, and can be observed from different viewpoints.

In this paper we address image-based staircase detection as a pattern recognition problem in the context of embedded and mobile devices. The main challenge is to be able to provide sufficient detection accuracy by utilizing the limited computational resources of such devices, especially in outdoor environments with low latency and limited network accessibility. To address this challenge, we propose a novel lightweight Fully Convolutional neural Network (FCN) architecture as a modification of our recent Look-Behind FCN (LB-FCN) architecture [2]. This novel architecture, named LB-FCN light, has significantly fewer free parameters and requires fewer Floating Point Operations (FLOPs) compared to the previous LB-FCN and state-of-the-art architectures for mobile devices. This was achieved by implementing depthwise separable convolutions throughout the convolutional layers of the network. Also, it enables multi-scale feature extraction and residual learning, making it suitable for multi-scale staircase detection in both indoor and outdoor environments. To evaluate the performance of LB-FCN light we created a weakly labeled image dataset, with staircases found in natural images collected from publicly available datasets, i.e., a dataset with semantically labeled images as containing or not containing staircases.

The rest of the paper consists of four sections. In Sect. 2 the related work focusing on staircase detection is presented. In Sect. 3 we describe the proposed architecture and its advantages. In Sect. 4 we describe our weakly annotated staircase dataset, and the results of the experiments performed. The last section summarizes the conclusions that can be derived from this study along with our plans for future work.

2 Related Work

Staircase detection has been an active research topic in computer vision and robotics, with an increasing interest nowadays as we are going through the era of ubiquitous computing and pervasive intelligence. One of the first relevant works [3] was based on Gabor filters and concurrent line grouping for distant and close staircase detection respectively. In the context of autonomous vehicle navigation, an outdoor descending staircase detection algorithm was presented by [4], based on texture energy, optical flow, and scene geometry features. In the context of computer aided navigation of visually impaired in outdoor environments using a wearable stereo camera, [5] utilized Haar features and Adaboost learning providing real-time detection performance. A similar approach that utilizes Haar-like features and an improved staircase specific Viola-Jones detector was proposed in [6].

Frequency domain features obtained by ultrasonic sensors were investigated in [7], to detect and recognize floor and staircases in electronic white cane. A wearable RGB-D camera mounted on the chest of a visually impaired individual, was used in [8], where an indoor environment for staircase detection and modeling was proposed. Their approach is capable of providing information for the presence and location along with the number of steps of staircases. Recently an indoor staircase detection framework was proposed in [9], utilizing depth images, capable of running on mobile devices. The approach is based on the detection and clustering of image patches that have the surface vectors pointing to the top direction. In addition, information from the Inertial Measurement Unit (IMU) sensor of the device is used to calibrate the surface vectors with the camera orientation. Most of the current staircase detection approaches are supervised, requiring fully annotated training images from controlled environments, i.e., images indicating the location of the staircases within the images. Furthermore, to the best of our knowledge the staircase detection has not been previously investigated to a sufficiently generic extent.

Although deep learning and more specifically Convolutional Neural Networks (CNNs) [10] have demonstrated impressive performance in computer vision applications, especially in natural image classification [11], staircase detection approaches have not been previously reported. While they are effective, conventional deep CNNs such as [12], suffer from high computational complexity mainly due to their large number of free parameters. As a result, high-end computational equipment such as Graphical Processing Units (GPUs) is needed for both training and testing time, limiting their use in indoor workstations. Recent studies such as [13,14,15] focus their interest in computational complexity reduction of CNN architectures, aiming to enable their usage in mobile and embedded devices. In this context, the tradeoff between computational efficiency and detection performance has been investigated, resulting in a state-of-the-art architecture called MobileNet-v2 [16], extending the original MobileNet-v1 proposed in [14]. More specifically this architecture keeps the basic principles of depthwise convolutions for the original design enhances it by adding linear bottleneck layers and shortcut connections between each bottleneck. Linear bottleneck layers were utilized as experimental evidence that the non-linear ones were damaging the extracted features between the bottlenecks. As a result of these changes the architecture contains 30% less parameters than MobileNet-v1 while providing a higher accuracy. Recently, we presented LB-FCN [2] architecture in the context of abnormality detection in medical images. The architecture featured multi-scale feature extraction modules composed of conventional convolutional layers, to better represent the different scales of abnormalities. In addition look-behind connections were used, which connect the input features to the output of each multi-scale feature extraction module. This was required, so that the high-level features will propagate throughout the network, allowing the network to converge faster and increasing the overall detection accuracy.

The core of LB-FCN light architecture is inspired by LB-FCN [2] and includes modification to enable efficient computations on mobile and embedded devices, while providing a sufficient staircase detection accuracy. More specifically, LB-FCN light extends the original LB-FCN design by replacing the multi-scale conventional convolutional layers with depthwise convolutional layers [17]. Key features of this architecture include the utilization of multi-scale depthwise separable convolution layers [17] and residual learning [18] connections which help to maintain relatively low number of free parameters, without sacrificing the detection accuracy.

3 Architecture

The design of the LB-FCN light architecture follows the FCN [19] network design, where only convolutional layers are utilized throughout the network. By replacing the fully connected layers, usually found in the classification layer of conventional CNN architectures such as [11, 12], a significant reduction of the number free parameters of the architecture can be achieved. Inspired by the MobileNet architecture, proposed in [14], depthwise separable convolutions [17] are implemented throughout the network to further reduce the complexity of the overall architecture. While in conventional convolution the filters are connected on the entire depth of the input channels, in depthwise separable convolution the filter is applied separately on each channel. To connect the separate filters, the layers are followed by a 1 × 1 conventional convolution.

The main component of LB-FCN light is the Multi-Scale Depthwise Convolution module (Fig. 1) which follows the principles established in [2]. This module is capable of extracting features from parallel depthwise separable convolution layers, each one with a different filter size. More specifically the layers extract features at three different scales: 3 × 3, 5 × 5 and 7 × 7 respectively. The feature maps from each layer are then concatenated forming a multi-scale feature representation of the input which is then followed by 1 × 1 convolution layer. The architecture features residual connections, which connect the input volume of the multi-scale module using adding operator aggregation with the output of it. This is done in order to preserve the higher level features extracted from the previous multi-scale blocks throughout the network.

Fig. 1.
figure 1

The main building block of LB-FCN light architecture.

Following the FCN [19] approach which shows that conventional max pooling operation can be replaced with a convolutional based, we utilized convolutional pooling with filter size 3 × 3 and stride 2. This introduces another level of non-linearity to the network while keeping the overall architecture logically unified. After each pooling operation the number of extracted filters of each convolutional layer is doubled. In total four multi-scale depthwise convolution modules are utilized in the network with three residual connections as illustrated in Fig. 2. For the staircase detection, a softmax layer of two neurons is used as the output of the network.

Fig. 2.
figure 2

The complete LB-FCN light architecture composed of four multi-scale blocks and three residual connections.

Throughout the architecture all convolution layers use ReLU activations followed by output batch normalization. The normalization is used so that the output of the convolution layers are centered on zero mean with the unit standard deviation. It has been empirically confirmed that output normalization can contribute in a faster network converge while reducing overfitting phenomenon. As a result of the above no Dropout layer [20] was used.

While we maintained the multi-scale feature extraction characteristics established in the original LB-FCN [2] architecture, the change in original filter size selection block increased the overall accuracy of the network. Furthermore we utilized conventional ReLU activation functions throughout the network instead of Parametric ReLU that were used in original LB- FCN architecture, which resulted in lower computational complexity without any significant detection performance overhead. The overall improvements made in original LB-FCN architecture, resulted in a significant increase in computational efficiency. As a result, LB-FCN light architecture is capable to efficiently run on mobile and embedded devices.

4 Experiments and Evaluation

4.1 Dataset

To evaluate the performance of the proposed architecture in the context of natural image staircase detection we have considered two publicly available datasets. The first dataset, named LM+Sun [21], is a fully annotated natural image dataset obtained from the combination of LabelMe Database [22] and SUN dataset [23]. The dataset consists of 45,676 images from 232 categories, found in indoor and outdoor environment under various conditions and sizes. For the purpose of our experiment we utilized a subset of LM+Sun dataset which includes natural images found in urban and street areas. While the full LM+Sun dataset contains 314 staircase labeled images, most of them are found in indoor environments. Images containing staircases were also found in the urban and street subsets of this dataset, e.g., staircases of buildings that can be directly recognized by a human observer, considering: (a) staircases that have at least two steps, and (b) staircases covering >15% of the image (in staircases of smaller coverage the steps are not distinguishable; therefore, they cannot be perceived directly as such, without contextual information). To minimize the possibility of a human error in the annotation process, two reviewers separately reviewed and annotated the dataset, and found in total 245 images that include outdoor staircases. To further increase the number of outdoor staircase images, we have created a second dataset named “StairFlickr” which extends LM+Sun staircases with a total of 524 outdoor staircase images. StairFlickr dataset images were obtained from the popular photo management and sharing web application Flickr [24].

For the purposes of our research, we omitted the fully annotated metadata provided about the staircases in the original LM+Sun dataset. This was performed as our architecture aims for staircase detection on solely weakly-labeled natural images. In total the described dataset includes 5,539 images from which 1,083 images contain staircasesFootnote 1. Indicative images from this dataset are illustrated in Fig. 3. As it can be observed, the dataset includes various types of staircases found in various positions, sizes, capture from different viewpoints.

Fig. 3.
figure 3

Top: staircases found in StairFlickr dataset. Middle: staircases found in LM+Sun dataset. Bottom: non-staircases images from LM+Sun dataset.

4.2 Evaluation Methodology

To evaluate the detection performance of the proposed architecture we followed the stratified 10-fold cross-validation (CV) procedure. The dataset was partitioned into 10 stratified subsets from which 9 were used for training and 1 for testing. This was repeated 10 times, each time selecting a different subset, until all folds have been tested. For each evaluation we calculated the accuracy (ACC), specificity (SPC), and sensitivity (TPR) of the trained model following the Eqs. (13), where true positives are denoted as TP, true negatives as TN, false positives as FP and false negatives as FN.

$$ ACC = \frac{TP + TN}{TP + TN + FP + FN} $$
(1)
$$ SPC = \frac{TN}{TP + FP} $$
(2)
$$ TPR = \frac{TP}{TP + FN} $$
(3)
$$ FPR = 1 - SPC $$
(4)

To better evaluate the classification performance of the trained network, we utilized the Area Under ROC (AUC) measure. AUC measure is a reliable classification performance measure that is insensitive to imbalanced class distributions [25]. This was chosen as the total number of images containing staircases was significantly fewer than the rest of the rest natural images in the dataset.

4.3 Results

We trained the LB-FCN light architecture using the images from both Flickr and LM+Sun datasets. As the images differ from each other in both size and aspect ratio we rescaled the dataset to the standardized input size of the network which is 224 × 224 pixels. To maintain the original aspect ratio of the images, they were padded with zeros to match the network’s input dimensions. It is worth mentioning that no further pre-processing step was applied to the images. As the proposed architecture focuses on weakly labeled images, the detailed annotations for the staircases provided by LM+Sun [21] dataset were ignored. We utilized only the semantic annotations of the images which indicate the presence or absence of staircases.

For the training of the network we utilized the Adam [26] optimizer with initial learning rate alpha = 0.001 and first and second moment estimates exponential decay rate beta1 = 0.9 and beta2 = 0.999 respectively. For the implementation of the architecture we utilized the Python Keras [27] library and the Tensorflow [28] tensor graph framework. The network was trained with mini-batch size of 32 samples on NVIDIA TITAN X GPU, equipped with 3584 CUDA [29] cores, 12 GB of RAM and base clock speed of 1417 MHz. On each fold we utilized the early-stopping technique where a small subset of the training fold was utilized as a validation dataset.

To evaluate the effectiveness in both detection accuracy and computational complexity reduction of LB-FCN light architecture we used the MobileNet-v2 [16] as a state-of-the-art architecture for comparison. The results obtained by the two architectures are illustrated in Table 1. The confusion matrix of LB-FCN light classification performance is illustrated in Table 3.

Table 1. Detection performance comparison, using 10-fold cross-validation, between state-of-the-art MobileNet-v2 [16] and our LB-FCN light architecture

While the detection performance is slightly higher in case on LB-FCN light, the noticeable difference between the two architectures is the computational complexity requirements. Table 2 includes a comparison between the architectures in terms of both the number of trainable free parameters and the total number of required FLOPs. The improvements made on the original LB-FCN design, resulted in a significant reduction of the overall number of FLOPs, from 1.3 × 107 down to 0.6 × 106, and reduction of the free parameters of the network, from 8.2 × 106 down to 0.3 × 106 respectively.

Table 2. Computation complexity comparison between state-of-the-art MobileNet-v2 [16] and our LB-FCN light architecture
Table 3. Confusion matrix of LB-FCN light classification performance.

5 Conclusions

We proposed a novel lightweight multi-scale FCN architecture that copes with the problem of staircase detection in natural images. To evaluate the performance of the architecture we extended the LM+Sun [21] natural image dataset with staircase images obtained from Flickr [24] social network. To the best of our knowledge there has been no existing work in this field that utilize solely weakly-labeled images to detect staircases in the natural images. The key features of the proposed LB-FCN light architecture can be summarized as follows:

  • It has a relatively low number of free parameters requiring an also low number of FLOPs, which makes it suitable to be used on mobile and embedded devices;

  • It features multi-scale feature extraction design allowing the architecture to detect staircases of various sizes and under difficult conditions, such as natural images;

  • Following the FCN [12] architecture approach it offers a lightweight and logically unified design;

  • Compared to MobileNet-v2 [16] network, the proposed architecture offers a relatively lower number of FLOPs and free parameters and a slightly higher detection performance. This makes it attractive for lower-end mobile and embedded devices.

In our future work we are planning to evaluate the performance of the proposed architecture in larger weakly-labeled staircase natural image datasets, to further explore the potential of the architecture. Furthermore we plan to extend the purpose of LB-FCN light architecture to include the localization of the staircases within the images, by following a weakly supervised approach.