Introduction

Autonomous victim identification in urban search and rescue (USAR) scenes is challenging due to the occlusion of body parts in cluttered environments, variations in body poses and sensory viewpoints, and sensor noise [1]. The majority of classical learning approaches that have been developed to detect human body parts in cluttered USAR environments have focused on first extracting a set of handcrafted features, such as human geometric and skin region features [1] or histograms of oriented gradients (HOG) [2], and, then, training a supervised learning model (e.g., support vector machines (SVM)) using these features. The manual design of the features often requires empirical selection and validation [3], which can be time-consuming and entail expert knowledge. Furthermore, these approaches also use pre-defined rules to analyze the grouping of human parts. However, in USAR scenes, due to occlusions, multiple body parts of a person may not be visible at the same time for such groupings to occur.

Deep networks have the potential to be used in USAR to autonomously extract features directly from sensory data. While they have been applied to human body part detection in structured environments, such as operating rooms [4], office buildings [5], and outdoor urban settings [6,7,8,9], they have not been considered for cluttered USAR environments. In USAR, victim identification needs to take place in environments that are unknown, without any a priori information available regarding victim locations. Furthermore, the entire body of a victim may not be visible due to occlusions and lighting conditions may vary significantly.

Our previous research has focused on developing rescue robots that use learning for exploration, navigation and victim identification tasks in USAR environments [1, 10,11,12], and identifying landmarks in USAR scenes for 3D mapping [13,14,15]. In this paper, we present the first feasibility study that investigates the use of deep learning to address the victim identification problem in robotic USAR. We propose an overall architecture that uses deep neural network detectors with RGB-D images to identify body parts. We provide a detailed investigation of these deep neural networks to determine which ones are robust to body part occlusion and low-lighting conditions.

Related Work

Person and Body Part Detection Using Learning

In this section, we discuss classical machine learning and deep learning methods that have been used to identify human bodies or body parts in varying environments.

Person Detection Using Classical Machine Learning Approaches

There exists a handful of papers that have specifically focused on finding victims in USAR environments using RGB and depth images with classical learning classifiers, e.g., [1, 2, 16]. For these detection methods, input features or body-part grouping templates were needed to be handcrafted.

For example, our own previous work in [1] focused on first segmenting potential bodies from depth images based on concave curvature information. Then, 2D ellipses were fit to the segmented regions and an elliptical shape factor was computed. A recursive algorithm grouped potential body parts that were spatially close. The grouping of body parts, the elliptical shape factor, and skin color extracted from corresponding RGB images were all used as features for an SVM classifier.

In [2], infrared images were first used to detect human body temperature. In cases where a human body could not be detected using these images, a head detection technique was used. The head detection technique extracted Haar features from RGB images and HOG features from the infrared images. Adaboost was then used to classify the Haar features, while an SVM was used to classify the HOG features. The correspondence between the two sets of images was used to locate the head.

In [16], an infrared sensor was used to first detect a potential victim, and then trigger an RGB image capture of the scene. The RGB image was converted into grayscale and fed into a three-layer feed-forward neural network (NN) for body part classification. The input and hidden layers of the NN both contained 256 nodes, and the output layer contained 3 nodes representing a foot, hand, or body.

Other classical learning approaches have also been proposed for identifying human body parts in outdoor environments, such as parking lots and town centers [17], and indoor environments, such as retail stores and offices [18]. For example, in [18], a two-stage procedure was used for detecting the top of human heads using RGB and depth.

Person Detection Using Deep Learning Approaches with RGB Images

Recently, deep learning approaches have been used for body pose estimation by detecting individual body parts in RGB images [6,7,8,9]. In [6], Adapted Fast R-CNN (AFR-CNN) [19] and a dense CNN architecture were used to identify body parts as part of the pose estimator DeepCut. Training and evaluations were conducted on the Leeds Sports Poses [20] and MPII Human Pose [21] public datasets consisting of people doing sports or everyday activities such as eating, fishing, or typing, in both indoor and outdoor environments.

In [7], a sliding window detector with a 152-layer deep residual network (ResNet) [22] was used to detect body parts as part of the pose estimator DeeperCut. The model was trained on the same datasets as in [6].

In [9], a Faster R-CNN multi-person detector with a ResNet backbone was trained using the person category of the COCO public dataset [23] as part of the pose estimation process. This category contains adults and children doing sports or everyday activities in indoor or outdoor environments.

In [24], a Single Shot Multi-box Detector (SSD) [25] network was used to recognize body parts within a pose estimator. It was trained on both the MPII Human Pose and Leeds Sports Poses datasets containing annotations for the lower and upper legs, lower and upper arms, and head.

In [26], a You Only Look Once (YOLOv2) detection network was used to detect hands for a hand-pose estimator. The network was initialized using weights pre-trained on ImageNet [27], a public dataset with 14 million images consisting of humans, animals, and objects. It was then fine-tuned on an in-house RGB image dataset captured in different indoor environments.

In [28], a Feature Pyramid Network (FPN) was extended for body part instance segmentation. A multi-task loss was regressed to provide instance-level body part masks and surface patches, including left and right hands, feet, upper and lower legs, head, etc. The network was trained on the DensePose-COCO dataset.

In [29], a Detector-in-Detector network was proposed where the first detector (body detector) detects the body, and the second (parts detector) uses this information to detect hands and faces. The body detector uses Faster R-CNN with a ResNet-50 backbone while the parts detector builds on the body detector with two convolutional layers. A custom Human-Parts dataset consisting of 14,962 images and 106,879 annotations was used for training.

The availability of public datasets makes training of the aforementioned RGB image-based detectors very convenient. However, as these detectors are dependent on only RGB images, they have difficulty functioning in low-lighting USAR environments.

Person Detection Using Deep Learning Approaches with RGB and Depth Images

Only a few detectors have considered the use of both RGB and depth (RGB-D) images as inputs to their networks [4, 5], which are more robust against illumination and texture variations. In [4], a ResNet detector was used to detect upper body parts in an operating room. RGB-D information was used as inputs and a score map for upper body parts was the output of the network. The RGB-D data was captured by multiple cameras fixed around the operating room. The score map was then used by a random forest classifier to classify the overall human pose.

In [5], a long short-term memory (LSTM) network was used to detect head-tops. The first layer employed the head-top detection technique presented in [18], where for each possible head-top pixel, a set of bounding boxes were generated from both RGB and depth images. This set of boxes contained different ratios of potential human body proportions for a particular head-top. Each set of bounding boxes belonging to a head-top pixel was simultaneously fed into two LSTM chains, one for RGB images and one for depth images. A third LSTM fusion network used feature vectors from both LSTM chains at each link in the chain, and logistic regression was used at the end of the third LSTM chain to classify whether a person was detected.

The aforementioned detectors have been trained for structured indoor environments such as operating rooms, offices, and building corridors. People in such settings are less occluded and typically have common poses, such as standing, sitting, or lying down. Therefore, they do not generalize well to cluttered USAR environments in which people can be partially buried in a variety of different poses and with only small portions of their body visible. In this paper, we investigate the first use of deep learning networks to uniquely address these challenges for the victim identification problem in cluttered USAR scenes.

Deep Learning Networks for the Victim Identification Problem in USAR Environments

The proposed architecture for victim identification comprises three stages: data collection, training, and inference (Fig. 1). In the data collection stage, RGB and depth images are collected and used as inputs to the training stage, where features are extracted to produce a feature map used to train a detector for body part classification. In the inference stage, new RGB-D images are used as inputs for the trained detector for body part detection.

Fig. 1
figure 1

Deep network architecture for body part detection

Two main approaches can be used when designing deep learning architectures for person detection. The first is a two-stage approach, which comprises a first stage that generates a set of region proposals indicating where target objects might be located, and a second stage classifies each proposed region as an object class or as background [30•]. In contrast, a single-stage detector performs object localization and classification concurrently [31]. When using such approaches, there is a trade-off between accuracy and speed. In this work, we investigate both these approaches for the victim identification problem in USAR environments. The two-stage detector we consider is Feature Pyramid Network (FPN) with Faster R-CNN [38]. It is more accurate than its predecessors such as Faster R-CNN [32•] and R-CNN [33]. The FPN with Faster R-CNN and its variations have been used in person and object detection applications, e.g. [34, 35]. However, they have not been used in cluttered USAR environments where body parts are occluded.

Single-stage detectors have the advantage of faster detection speed than the two-stage approaches, by removing the proposal generating stage. Their drawback is that they tend to have lower accuracy [30•]. The most popular single-stage detectors are SSD [25], YOLOv2 [36], YOLOv3 [37•], and RetinaNet [30•]. They have been adopted for real-time object detection in self-driving cars and environment monitoring applications [38,39,40]. However, they have not yet been applied to cluttered USAR scenes. Their faster detection speeds can be an advantage in time-critical search and rescue missions. The below sub-sections discuss how we have designed the network architectures for each of the aformentioned detectors to address the victim identification problem.

Two-Stage Detector

FPN with Faster R-CNN

The FPN with Faster R-CNN approach [32•] (Fig. 2a) uses an FPN to extract features from RGB-D images taken in USAR scenes, and outputs feature maps at different scales. The feature maps are generated by a backbone ResNet-50 model pretrained on the ImageNet dataset [27]. The feature maps are passed to the region proposal network (RPN) to generate bounding box proposals, which are used for the second-stage network for body part classification and bounding box refinement. The FPN network structure is designed to improve detection accuracy by extracting features at different scales while keeping computation cost low [32•]. In USAR, body parts can appear at any scale based on their relative pose to the robot. As shown in Fig. 2a, the FPN structure consists of a multiple layer CNN (ResNet) that scales down an input image through convolution, and at the last layer scales it back up. Feature maps produced when scaled down are added element-wise to those produced when scaled up through lateral connections. While lower level feature maps have higher resolution and provide more details on small body parts, higher-level feature maps are processed through more convolution layers and gain more semantic understanding of the overall image. By combining the feature maps, the detector benefits from both aspects. With high-level semantic features in higher-resolution layers, the network becomes more robust to the detection of small body parts where occlusion is present. The number of output classes for the network is seven; six body parts (arm, foot, hand, head, leg, torso) and one for background.

Fig. 2
figure 2

Two stage detector: a FPN with Faster R-CNN Network Flow. Single-stage detectors: b SSD architecture, c YOLOv2 Architecture, d Darknet-53, the feature extraction layers used in YOLOv3, and e RetinaNet

Single-Stage Detectors

SSD

In SSD [25], RGB-D images are first processed by pretrained convolutional layers (VGG16 [41]) to output a feature map (Fig. 2b). The feature map goes through size reduction via a chain of convolution layers. The feature maps at different detection layers are processed independently by convolution filters to provide coordinates of victim body part bounding boxes and classification probabilities. Each cell in a feature map is associated with k × 4 values representing the four coordinates of k bounding boxes centered at this cell [25]. The size and aspect ratio of the boxes are initialized using manually selected default values and then refined by the network, enabling the network to detect both small and large body parts. For each bounding box, the filters output one body part detection probability for each of the six classes of body parts, plus four additional scalars predicting the offset values to improve upon the bounding box coordinates [25]. The output, 6 + 4 values, for each bounding box are compared with manually labeled ground truth to calculate losses.

YOLOv2

YOLO detectors use a single CNN [36, 37] for both body part localization and classification. The CNN is a Darknet-19 pretrained on the ImageNet dataset [27]. To apply YOLO detectors on our body part dataset from a cluttered USAR-like environment, the labels were annotated according to the Pascal VOC format [42]. An input RGB-D image is processed by the CNN shown in Fig. 2c [36] which directly outputs a S × S × (b × (c + 5)) tensor for bounding box localization and victim body part classification, where c = 6 is the number of body part classes. The number of grid cells that an input is divided into is S × S = 13 × 13. For each cell, b bounding boxes are initialized, and for each bounding box, 6 + 5 scalar values are predicted. With respect to the five scalar values, four are for localization and one is for confidence, defined as \( \Pr \left( class\right| object\Big)\times IO{U}_p^t \). Pr(class|object) is the probability of whether a body part belongs to a specific class, conditioned on the grid cell containing a victim body part. The four localization values are the horizontal and vertical offsets against the grid cell, and the height and width are normalized against the size of the entire image, respectively.\( IO{U}_p^t \) is the Intersection Over Union calculated using the predicted body part bounding box, p, and the hand labeled ground truth bounding box, t, by dividing the area of overlap by the area of union.

YOLOv3

YOLOv3 [37•] further improves upon YOLOv2 by incorporating elements used in other state-of-the-art detection algorithms such as residual blocks [22] and feature pyramids [32•]. The feature extraction layers of YOLOv2 are replaced by a pretrained Darknet-53 (Fig. 2d), which consists of 53 layers mainly composed of 3 × 3 and 1 × 1 convolutions with residual blocks. The output feature map is passed through another 53 layers for detection. Detection is done at three different scales using a similar concept to feature pyramid networks [37•] to improve small body part detection. Namely, body parts are detected on three different-size feature maps, output by different layers. The larger dimension grids are responsible for detecting smaller body parts, and vice versa.

RetinaNet

The RetinaNet architecture [30•] uses FPN for multi-scale feature extraction from RGB-D images, followed by two parallel branches of convolutional networks for body part classification and bounding box regression (Fig. 2e). Similar to FPN with faster R-CNN, the feature maps are generated by a backbone ResNet-50 model pretrained on the ImageNet dataset. RetinaNet uses the feature pyramid levels P3 to P7. At each level, b = 9 anchor boxes are selected for each spatial location of the feature map grid. Each box is associated with class prediction for all c classes (6 body parts + 1 background) and four coordinates. From a structural perspective, the feature map at each level of the pyramid is passed to two branches of convolutional networks in parallel. The classification branch consists of four 3 × 3 × 256 convolution layers with rectified linear unit (ReLU) activation. This is followed by a 3 × 3 × (b × c) convolution layer that outputs a gridsize × b × c sized tensor, predicting the victim body part classifications for each anchor box. The box regression branch also consists of four 3 × 3 × 256 convolution layers with ReLU activation, followed by a 3 × 3 × (b × 4) layer that predicts the location coordinates of all bounding boxes (Fig. 2e).

Training

In order to train all the designed detectors, we created a dataset consisting of 570 corresponding RGB-D images of both human and mannequin body parts in a cluttered USAR-like environment (Fig. 3). The images were obtained from a Kinect sensor onboard a mobile Turtlebot 2 platform. The images were manually labeled into six classes for training purposes: arm, foot, hand, head, leg, and torso. To account for different lighting conditions, we applied a random distribution of noise to the RGB images during preprocessing by using gamma correction. First, image pixel intensities were scaled from [0, 255] to [0, 1.0]. A gamma corrected image was then obtained using

$$ O={I}^{\left(1/G\right)} $$
(1)
Fig. 3
figure 3

USAR-like environment layout (top two and bottom two panels consist of mannequin and human victims, respectively)

where I is the scaled input image and G is the gamma value. The corrected image O is then converted back to the range [0, 255]. We distributed our image dataset to five possible gamma values: 0.1, 0.2, 0.4, 0.8, and 1.0, where G = 1 has no effect on the image and G = 0.1 is the darkest setting. We trained each network on RGB-D images consisting of both partially and fully visible parts. For the training process, k-fold cross validation (k = 5) was used to partition our dataset into training and validation images. Training took place on a Nvidia Titan V GPU. The learning parameters were initialized according to Table 1 and fine-tuned empirically for each network. The maximum training iterations required for all runs of a network are also reported in Table 1 for each network. The reported batch size is the number of images used to compute the gradient for backpropagation. For deeper networks, this is generally limited by GPU memory (e.g., for RetinaNet, the maximum is two images on our GPU). This same training procedure was also implemented separately on only RGB and depth images for comparison.

Table 1 Training parameters

Experiments

Experiments were performed on a validation set of images from our dataset. All predicted regions equal to 0.5 with the manually labeled ground truth were accepted, which is common for object detection benchmarking [43]. Furthermore, repetitive detection of the same object in an image was minimized by using non-maximum suppression (NMS) [44] with the default threshold of 0.45 [36].

We chose 11-point mean average precision (mAP) [42] and recall as evaluation metrics. Recall was used to define the percentage of true victim body parts detected, and mAP measured the robustness of each network in maintaining high precision in tradeoff for higher recall. The precision-recall results for both the fully visible and partially occluded body parts for all networks are presented in Table 2 and Fig. 4. Furthermore, results for the networks using RGB-only or depth-only images are also presented for comparison.

Table 2 Victim body part detection results for the networks on varying body part visibilities
Fig. 4
figure 4

Comparison of the precision-recall results for both the fully visible and partially occluded body parts for all networks, values are averaged across all body parts

Table 2 presents the results for each individual body part. In general, RetinaNet had higher overall precision-recall for both the fully visible and partially occluded datasets, demonstrating its robustness to occlusion. The main advantage of RetinaNet is its focal loss which allows the network to focus on harder training examples by down-weighting the contribution of easier examples (e.g., fully visible body parts) in the loss function [30•]. This allows the network to focus on harder examples (e.g., occluded body parts) and harder classes (e.g., body parts that are more difficult to detect). Namely, the focal loss allows RetinaNet to significantly outperform the other networks, in some cases with up to 43% performance improvement on the most difficult body parts to detect: a hand and a foot. Therefore, despite being a single-stage detector, RetinaNet is able to outperform the two-stage detector here. FPN with Faster R-CNN, being the two-stage detector, outperforms the other single-stage detectors such as YOLOv2, YOLOv3, and SSD overall when using the RGB-D and RGB datasets. The YOLO networks generally performed better than SSD as they used higher resolution input images. One possible reason that FPN with Faster R-CNN was able to outperform the YOLO and SSD detectors is due to its lateral connections that produce high-resolution high-level semantic feature maps, allowing it to detect small body parts. For example, being able to capture small features such as fingers on hands can result in more accurate hand detection, especially when the hand is partially occluded.

The hand, due to self-occlusion, size, and its similarities with the foot, was difficult to detect for a number of the networks, especially, for instances where the spacing between fingers is less distinct. In contrast, the head and torso were easier to detect with higher precision-recall for the majority of the networks. Using the RGB-D information resulted in higher overall precision and recall for the majority of the networks compared to only using RGB or depth data. The RGB-D data incorporates color, geometry, and scale information, while being invariant to illumination. By further analyzing failure cases, it was observed that the other single-stage detectors, the two YOLO and the SSD detectors could not handle changes in illumination such as dim lighting conditions as well as RetinaNet. As depth is invariant to lighting, this resulted in better precision in these networks for the depth-only dataset over the RGB dataset, especially for the YOLO detectors. The robustness of RetinaNet to illumination also suggests that the network has encoded stronger illumination-invariant features (i.e., using focal loss).

Figure 5 shows the performance of RetinaNet using the RGB-D dataset under the different illumination conditions. In Fig. 5a, with the lowest lighting condition, two feet, an arm, and a hand of a potential victim were detected. In Fig. 5b with the second lowest lighting condition, the partially occluded torso and a head of one potential victim were detected along with a hand and foot of other potential victims. Both Fig. 5 c and e exhibit large body part occlusions, with self-occlusion as well as by clutter, while Fig. 5d presents partially occluded heads at different viewpoints and scales. RetinaNet was able to detect these body parts, demonstrating its ability to not only deal with occlusion, but also different illumination conditions and body parts of varying viewpoints and scale.

Fig. 5
figure 5

Test results from RetinaNet. Each sub-figure contains, from top to bottom, the RGB input image, the depth input image, the combined RGB-D image, and the detection output. Gamma values are 0.1, 0.2, 0.4, 0.8, and 1.0 from a to e, respectively

Conclusions

In this paper, we investigated, for the first time, the use of deep learning networks to address the victim identification problem in cluttered USAR environments. By providing the first feasibility and comparison study of state-of-the-art detectors, our results showed that deep networks can be trained to perform in dark cluttered environments by including RGB-D information, and we can use deep learning to detect partially occluded body parts. In general, using RGB-D information resulted in higher precision-recall compared to only using RGB or depth data. With respect to the individual detectors, the single-stage detector RetinaNet had both higher recall and mean average precision than the other detectors. By adopting such end-to-end deep networks, we can eliminate the time-consuming process of manually defining features to extract from such complex environments.