1 Introduction

With the introduction of large-scale pedestrian datasets (Dollár et al. 2009; Dollar et al. 2012; Zhang et al. 2017; Geiger et al. 2013), deep convolutional neural networks (DCNNs) have achieved promising detection accuracy. However, the trained DCNN detectors may not be robust enough due to the issue that negative background examples greatly exceed positive foreground examples during training. Recent studies have confirmed that DCNN detectors trained with the limited foreground examples can be vulnerable to difficult objects which have unexpected states (Huang and Ramanan 2017) and diversified poses (Alcorn et al. 2018).

Fig. 1
figure 1

We propose the shape transformation-based dataset augmentation framework for pedestrian detection. In the framework, we subsequently introduce the shape-guided warping field to deform pedestrians and the environment-aware blending map to adapt the deformed pedestrians into background environments. Our proposed framework can effectively generate more realistic-looking pedestrians for augmenting pedestrian datasets in which real pedestrians are usually in low-quality. Best view in color

To improve detector robustness, besides designing new machine learning algorithms, many researchers attempted to augment training datasets by generating new foreground examples. For instance, Huang et al. (Huang and Ramanan 2017) used a 3D game engine to simulate pedestrians and adapt them into the pedestrian datasets. Other studies (Ma et al. 2018; Siarohin et al. 2018; Zanfir 2018; Ge et al. 2018) attempted to augment the person re-identification datasets by transferring the poses of pedestrians using generative adversarial networks (GANs). Despite progress, it is still very challenging to adequately apply existing augmentation approaches on the common pedestrian detection datasets. First, synthesizing pedestrians using external platforms like the 3D game engines may introduce a significant domain gap between synthesized pedestrians and real pedestrians, limiting the overall benefits for the generated pedestrians to improve the model robustness for detecting real pedestrians. Moreover, regarding the methods that utilize GANs to render pedestrians, they generally require rich appearance details from paired training images to help define the desired output of generative networks during training procedures. However, in common pedestrian detection datasets like Caltech (Dollar et al. 2012) and CityPersons (Zhang et al. 2017), pedestrians are usually in low quality due to the factors like heavy occlusions, blurry appearance, and low-resolution caused by small sizes. As a result, these available real pedestrians only provide extremely limited amount of appearance details that can be used for training generative networks. Without sufficient description of the desired appearance of synthesized pedestrians, we can show in our experiments that current GAN-based methods only generate less realistic or even corrupted pedestrians using very low-quality pedestrians from common pedestrian detection datasets.

By addressing above issues, we propose to augment pedestrian datasets by transforming real pedestrians from the same dataset according to different shapes (i.e.   segmentation masks in this study) rather than rendering new pedestrians. Our motivation comes from the following observations. First, unlike existing methods that require sufficient appearance details to define the desired output, it is much easier to access rich pixel-level shape deformation supervision which defines the deformation from a shape to another shape, if only low-quality pedestrian examples are available in the datasets. The learned deformation between shapes can guide the deformation of appearances of the real pedestrians, avoiding the requirement of detailed supervision information to directly define the transformed appearances. In addition, since the shape information can naturally distinguish foreground areas from background areas, we can simply focus on adapting synthesized foreground appearances into background environments, avoiding the rick of further generating unnatural background environments together with the synthesized pedestrians as required in current GAN-based approaches. Last but not the least, we find that transforming real pedestrians based on different shapes can effectively increase foreground sample diversity while still maintaining the appearance characteristics of real pedestrians adequately.

Based on these observations, we devise a Shape Transformation based Dataset Augmentation (STDA) framework to fulfill the pedestrian dataset augmentation task more effectively. Figure 1 presents an overview of our framework. In particular, the framework first deforms a real pedestrian into a similar pedestrian but with a different shape and then adapts the shape-deformed pedestrians into surrounding environments on the image to be augmented. In the STDA framework, we introduce a shape-guided warping field, which is a set of vectors that define the warping operation between shapes, to further define an appropriate deformation between the shapes and the appearances of the real pedestrians. Moreover, we introduce an environment-aware blending map to help the shape-deformed pedestrians better blend into various background environments, delivering more realistic-looking pedestrians on the image.

In this study, our key contributions are listed as follows:

  • We propose a shape transformation-based dataset augmentation framework to augment the pedestrian detection datasets and improve pedestrian detection accuracy. To the best of our knowledge, we are the first that apply the shape-transformation-based data synthesis methodology for pedestrian detection.

  • We propose the shape-guided warping field to help define a proper shape deformation procedure. We also introduce an environment-aware blending map to better adapt the shape-transformed pedestrians into different backgrounds, achieving better augmentation results on the image.

  • We introduce a shape constraining operation to improve shape deformation quality. We also apply a hard positive mining loss to take advantage of the concepts of hard mining technology and further magnify the benefits of the synthesized pedestrians for improving detection robustness.

  • Our proposed framework is promising for generating pedestrians, especially when using low-quality examples. Comprehensive evaluations on the famous Caltech (Dollar et al. 2012) and CityPersons (Zhang et al. 2017) benchmarks validate that our proposed framework can generate more realistic-looking pedestrians than existing methods using low-quality data. With pedestrian datasets augmented by our framework, we promisingly boost the performance of the baseline pedestrian detector, accessing superior performance to other cutting-edge pedestrian detectors.

2 Related Work

2.1 Pedestrian Detection

Pedestrian is critical in many applications such as robotics and autonomous driving (Enzweiler and Gavrila 2008; Dollár et al. 2009; Dollar et al. 2012; Zhang et al. 2016c) and downstream tasks like tracking, scene segmentation, and key point estimation (Chen et al. 2017, 2019; Zhang et al. 2020). Traditional pedestrian detectors generally use hand-crafted features (Viola et al. 2005; Ran et al. 2007) and adopt human part-based detection strategy (Felzenszwalb et al. 2010b) or cascaded structures (Felzenszwalb et al. 2010a; Bar-Hillel et al. 2010; Felzenszwalb et al. 2008). Recently, by taking advantages of large-scale pedestrian datasets (Dollár et al. 2009; Dollar et al. 2012; Zhang et al. 2017; Geiger et al. 2013; Loy et al. 2019), researchers have greatly improved the pedestrian detection performance with DCNNs (Simonyan and Zisserman 2014; He et al. 2016; Ouyang and Wang 2013; Ouyang et al. 2017). Among the DCNN detectors, two-stage detection pipelines (Ouyang and Wang 2013; Ren et al. 2015; Li et al. 2018; Cai et al. 2016; Zhang et al. 2016b; Du et al. 2017) usually perform better than single-stage detection pipelines (Liu et al. 2016; Redmon et al. 2016; Lin et al. 2018). Despite progress, the issue that foreground and background examples are extremely unbalanced in pedestrian datasets still affects the robustness of the DCNN detectors adversely. Current pedestrian detectors could still be fragile to even small transformation of pedestrians. To tackle this problem, many researchers tend to augment the datasets by synthesizing new foreground data.

2.2 Simulation-based Dataset Augmentation

To achieve dataset augmentation, researchers have used 3D simulation platforms to synthesize new examples for the datasets. For example, (Lerer et al. 2016; Ros et al. 2016) used a 3D game engine to help build new datasets. More related studies used the 3D simulation platforms to augment pedestrian-related datasets. In particular, (Pishchulin et al. 2011; Hattori et al. 2015) employed a game engine to synthesize training data for pedestrian detection. In addition, (Huang and Ramanan 2017) applied a GAN to narrow the domain gap between the 3D simulated pedestrians and the natural pedestrians to augment pedestrian datasets, but this method brings limited improvement on common pedestrian detection, suggesting that the domain gap is still large. However, there is still a significant domain gap between simulated pedestrians and real pedestrians. Such gap could further pose negative effects on DCNN detectors, making the augmented datasets deliver incremental improvements on pedestrian detection.

2.3 GAN-based Dataset Augmentation

Recently, with several improvements (Radford et al. 2015; Arjovsky et al. 2017; Gulrajani et al. 2017), GANs(Goodfellow et al. 2014) have shown great benefits on synthesis-based applications such as image-to-image translation (Isola et al. 2017; Liu et al. 2017a; Isola et al. 2017; Zhu et al. 2017) and skeleton-to-image generation (Villegas et al. 2017; Yan et al. 2017).

In the literature of person re-identification task, many works attempted to transfer the poses of real pedestrians to deliver diversified pedestrians for the augmentation. For instance, (Liu et al. 2018; Ma et al. 2018; Siarohin et al. 2018; Zanfir 2018; Ge et al. 2018; Zhang et al. 2017; Ma et al. 2017) introduced various techniques to transform the human appearance according to 2D or 3D poses and improve the person re-identification performance. (Vobecky et al. 2019) proposed a novel approach to generate pedestrians according to different poses. The synthesis results are promising and rare pedestrian situations can be simulated. In practice, these methods require accurate and reliable pose information or paired training images that contain rich appearance details to achieve successful transformation. However, existing widely used pedestrian datasets like Caltech provide neither pose annotations nor paired appearance information for training GANs. Furthermore, in current pedestrian datasets, a large number of small pedestrians whose appearances are usually in low quality can make existing pose estimators difficult to deliver reasonable predictions. Figure 2 shows some examples describing that the poses of low-quality pedestrians are much more unstable than the masks estimated using the same Mask RCNN (He et al. 2017) detector. As a result, it is quite infeasible to seamlessly apply these pose transfer models for augmenting current pedestrian datasets.

In pedestrian detection, some studies have introduced specifically designed GANs for the augmentation. As an example, (Ouyang et al. 2018b) modified the pix2pixGAN (Isola et al. 2017) to make it more suitable for the pedestrian generation, but this method lacks a particular mechanism that helps produce diversified pedestrians and the method still delivers poor generation results based on low-quality data. In the study (Lee et al. 2018), authors introduced an end-to-end trainable neural network to fulfill the task for placing new pedestrian masks and vehicle masks in an urban scene, but it does not generate transformed pedestrian appearances to augment datasets. Also, (Liu et al. 2019) developed an effective unrolling mechanism that jointly optimizes a generative model and a detector to improve detection performance by generating new data to datasets with limited training examples. This approach directly generates pedestrian appearances from noise, while our method mainly transforms the shapes of real pedestrians to achieve better augmentation performance on low-quality data.

Fig. 2
figure 2

Some examples showing that the shape estimation results are more accurate than pose estimation results on low-quality images using the same Mask RCNN model. Best view in color

In this study, we propose that transforming pedestrians from the original dataset by altering their shapes can produce diversified and much more lifelike pedestrians without requiring rich appearance details for supervision.

Fig. 3
figure 3

Overview of the proposed shape transformation-based dataset augmentation framework for pedestrian datasets with low-quality pedestrian data. In particular, we introduce the shape-guided warping field, \(\mathbf {V}_{i\rightarrow j}\), and the environment-aware blending map, \(\mathbf {\alpha }(x,y)\), to respectively help implement the shape-guided deformation and the environment adaptation, obtaining the deformed shape \(\mathbf {s}^w_{i\rightarrow j}\), the deformed pedestrian \(\mathbf {z}^w_{i\rightarrow j}\), and the transformation result \(\mathbf {z}^{gen}_{i\rightarrow j}\). By placing the \(\mathbf {z}^{gen}_{i\rightarrow j}\) into the I, we can effectively augment the original image. In practice, we employ a U-Net to predict both of the \(\mathbf {V}_{i\rightarrow j}\) and \(\alpha (x,y)\). Training losses for the U-Net include \(\mathcal {L}_{shape}\), \(\mathcal {L}_{adv}\), \(\mathcal {L}_{cyc}\), and \(\mathcal {L}_{hpm}\). Best view in color

3 Shape Transformation-based Dataset Augmentation Framework

3.1 Problem Definition

Data augmentation technique, commonly formulated as transformations of raw data, has been used to access the vast majority of the state-of-the-art results in image recognition. The data augmentation is intuitively explained as to increase the training data size and as a regularizer that can model hypothesis complexity (Goodfellow et al. 2016; Zhang et al. 2016a; Dao et al. 2019). In particular, the hypothesis complexity can be used to measure the generalization error, which is the difference between the training and test errors, of learning algorithms (Vapnik 2013; Liu et al. 2017b). Larger hypothesis complexity usually implies a larger generalization error and vice versa. In practice, a small training error and a small generalization error is favoured to guarantee a small test error. As a result, the data augmentation is especially useful for deep learning models which are powerful in maintaining a small training error but has a large hypothesis complexity. It has been empirically demonstrated that data augmentation operations can greatly improve the generalization ability of deep models (Cireşan et al. 2010; Dosovitskiy et al. 2015; Sajjadi et al. 2016).

In this study, the overall goal is to devise a more effective dataset augmentation framework to improve pedestrian detection models. The framework is supposed to generate diversified and more realistic-looking pedestrian examples to enrich the corresponding datasets in which real pedestrians are usually in very low-quality. We achieve this goal by transforming real pedestrians into different shapes rather than rendering new pedestrians. Firstly, using a deformation operation, we properly transform the shapes of pedestrians into various shapes to enrich the pedestrian appearances. The deformation introduces appropriate noises to help regularize deep models rather than existing methods like PS-GAN (Ouyang et al. 2018b) that may distract deep models by producing less realistic training examples. Secondly, we apply adequate environment adaptation to better blend the generated pedestrians into different background areas. This minimizes the risk of producing obvious unnatural artifacts that could affect performance while keeping the rich diversity of generated pedestrian appearances. Therefore, our method can be effective for regularizing the hypothesis complexity. This is empirically justified by our experiments which show that using our method to augment datasets can significantly improve pedestrian detection performance of the baseline model and out-perform other augmentation methods.

Formally, suppose \(\mathbf {z}_i\) is an image patch containing a real pedestrian in the dataset and \(\mathbf {s}_i\) is its extracted shape or segmentation mask. Here, we refer the shape or “mask” \(\mathbf {s}_i\) of a pedestrian \(\mathbf {z}_i\) as a set of labels, denoted as \({s_i(x,y)}\) that distinguish foreground areas from background areas within the pedestrian patch, where (xy) represent coordinates on the image: \(s_i(x,y)=1\) for the location (xy) being on the foreground and \(s_i(x,y)=0\) for the location (xy) being on the background. Denote \(\mathbf {s}_j\) as a different shape which can be obtained based on another real pedestrian’s shape. In this study, we implement a shape transformation-based dataset augmentation function, denoted as \(f_{STDA}\), to generate a new pedestrian by transforming a real pedestrian into a new pedestrian with a more realistic-looking appearance but with another shape \(\mathbf {s}_j\) for the augmentation:

$$\begin{aligned} \mathbf {z}^{gen}_{i\rightarrow j} = f_{STDA}(\mathbf {z}_i,\mathbf {s}_i, \mathbf {s}_j, I), \end{aligned}$$
(1)

where \(\mathbf {z}^{gen}_{i\rightarrow j}\) is a patch containing the newly generated pedestrian \(\mathbf {z}_i\) by transforming its shape \(\mathbf {s}_i\) into \(\mathbf {s}_j\), and I is the image to be augmented.

3.2 Framework Overview

In pedestrian detection datasets, it is difficult to access sufficient appearance details to define the desired \(\mathbf {z}^{gen}_{i\rightarrow j}\), making it extremely challenging to generate realistic-looking pedestrians using low-quality appearance. To properly implement the \(f_{STDA}\), we decompose the pedestrian generation task into two sub-tasks, i.e. shape-guided deformation and environment adaptation. The first task focuses on varying the appearances to enrich data diversity, and the second task mainly adapts and blends the deformed pedestrians into different environments. More specifically, we first deform the pedestrian image \(\mathbf {z}_i\) into a new one with similar appearance but a different shape \(\mathbf {s}_j\). We define the deformation according to the transformation from \(\mathbf {s}_i\) into \(\mathbf {s}_j\). Then, we adapt the deformed pedestrian image into some background environments on the image I. Denote by \(f_{SD}\) the function that implements the shape-guided deformation, and denote by \(f_{EA}\) the function that implements the environment adaptation. The proposed framework implements \(f_{STDA}\) as follows:

$$\begin{aligned} f_{STDA}(\mathbf {z}_i, \mathbf {s}_i, \mathbf {s}_j, I) = f_{EA}(f_{SD}(\mathbf {z}_i, \mathbf {s}_i, \mathbf {s}_j),I). \end{aligned}$$
(2)

Figure 3 shows a detailed architecture of the proposed framework. As illustrated in the figure, we introduce a shape-guided warping field, denoted as \(\mathbf {V}_{i\rightarrow j}\), to help implement the shape-guided deformation function. The warping field is formulated as the assignment of vectors on the image plane for warping between shapes. With the help of \(\mathbf {V}_{i\rightarrow j}\), the deformation between different shapes can guide the deformation of appearances of real pedestrians. We also propose to apply the environment-aware blending map to achieve environment adaptation. We define the blending map as a set of weighting parameters to fuse foreground pixel values with background pixel values. We use \(\alpha (x,y)\) to represent an entry of the blending map located at position (xy). After better adapting the shape-deformed pedestrian into the background environments, we obtain diversified and more realistic-looking pedestrians to augment pedestrian detection datasets. In practice, we can employ a single end-to-end U-Net (Ronneberger et al. 2015) to help fulfill the both sub-tasks in a single pass. The employed network takes as input the pedestrian patch \(\mathbf {z}_i\), its shape \(\mathbf {s}_i\), the target shape \(\mathbf {s}_j\), and a background patch from I, and then predicts both of the \(\mathbf {V}_{i\rightarrow j}\) and the \(\alpha (x,y)\). Although it is more intuitive to learn the shape-guided warping field and the environment-aware blending map separately, we simply find in practice that the U-Net has the ability to learn the knowledge for both tasks jointly. The effects of learning jointly and separately are the same. Learning jointly greatly simplifies the processing framework and saves computational resources and required parameters. Therefore, we choose to fuse the learning of both functions by feeding all the necessary input information to the U-Net at the same time.

3.2.1 Shape-guided Deformation

Fig. 4
figure 4

A detailed example of the shape-guided warping field \(\mathbf {V}_{i\rightarrow j}\) that deforms the shape \(\mathbf {s}_i\) (colored in blue) into the shape \(\mathbf {s}_j\) (colored in purple). (xy) represent the 2D coordinates of a pixel on the image plane. \(\mathbf {v}_{i\rightarrow j}(x, y)\) represent a vector that describes the 2D deformation offsets. Best view in color

In this study, we implement deformation according to warping operations. In order to obtain a detailed description about warping operations, we introduce the shape-guided warping field to further help deform pedestrians. Denote by \(\mathbf {v}_{i\rightarrow j}(x, y)\) the warping vector located at the (xy) that helps warp the shape \(\mathbf {s}_i\) into the shape \(\mathbf {s}_j\). The set of these warping vectors, i.e. \(\mathbf {V}_{i\rightarrow j} = \{\mathbf {v}_{i\rightarrow j}(x, y)\}\), then forms a shape-guided warping field. An example of this warping field can be found in Fig. 4, where the warping field helps deform the \(\mathbf {s}_i\) (colored in blue) into the \(\mathbf {s}_j\) (colored in purple). Then, suppose \(f_{warp}\) is the function that warps the input image patch according to the predicted warping field, we then implement the \(f_{SD}\) by:

$$\begin{aligned} \mathbf {z}^w_{i\rightarrow j} = f_{SD}(\mathbf {z}_i, \mathbf {s}_i, \mathbf {s}_j) = f_{warp}(\mathbf {z}_i; \mathbf {V}_{i\rightarrow j}), \end{aligned}$$
(3)

where \(\mathbf {z}^w_{i\rightarrow j}\) is the warped pedestrian \(\mathbf {z}_i\) according to the shape \(\mathbf {s}_j\). In practice, we define that each warping vector \(\mathbf {v}_{i\rightarrow j}(x, y)\) is a 2D vector which contains the horizontal and vertical displacements between the mapped warping point and the original point located at (xy). Thus, we can make the employed network directly predict the \(\mathbf {V}_{i\rightarrow j}\). In addition, we implement the \(f_{warp}\) with the help of bilinear interpolation, since the bilinear interpolation can properly back-propagate gradients from \(\mathbf {z}^w_{i\rightarrow j}\) to \(\mathbf {V}_{i\rightarrow j}\), aiding the training effectively. For more details about using bilinear interpolation for warping and training, we refer readers to (Jaderberg et al. 2015; Dai et al. 2017).

To make the shape-guided warping field adequately describe the deformation between shapes, we define that the estimated warping field should warp the shape \(\mathbf {s}_i\) into the shape \(\mathbf {s}_j\). Suppose \(\mathbf {s}^w_{i\rightarrow j}\) is the warped shape \(\mathbf {s}_{i}\) according to \(\mathbf {V}_{i\rightarrow j}\): \(\mathbf {s}^w_{i\rightarrow j} = f_{warp} ( \mathbf {s}_{i};\mathbf {V}_{i\rightarrow j})\). Then, the desired warping field \(\mathbf {V}_{i\rightarrow j}\) should make \(\mathbf {s}^w_{i\rightarrow j}\) as close to \(\mathbf {s}_j\) as possible. Since \(\mathbf {s}_i\) and \(\mathbf {s}_j\) can be easily obtained from the pedestrian datasets, we are able to access sufficient pixel-level supervision to train the employed network. We mainly apply the \(L_1\) distance, \(||\mathbf {s}_j - \mathbf {s}^w_{i\rightarrow j}||_1\), to measure the distance between \(\mathbf {s}_j\) and \(\mathbf {s}^w_{i\rightarrow j}\). The \(L_1\) distance can then be used as the training loss for the network to learn the desired warping field that can help generate shape-transformed natural pedestrians based on Eq. 3.

Shape Constraining Operation: In practice, we observe that if the target shape \(\mathbf {s}_j\) varies too much w.r.t. \(\mathbf {s}_i\), the obtained warping field may distort the input pedestrians after warping, resulting in unnatural results that could degrade the augmentation performance. To avoid this, we apply a shape constraining operation on the target shape.

Fig. 5
figure 5

An example of the shape constraining operation. In particular, we combine the shape \(\mathbf {s}_j\) with \(\mathbf {s}_i\) based on a weighting function \(\gamma (y)\). \(c^j_y\) and \(l^j_y\) respectively denote the middle point and the width of the foreground areas on the line whose vertical offset is y and on the shape \(\mathbf {s}_j\). \(c^i_y\) and \(l^i_y\) are the corresponding middle point and the width of foreground areas on the shape \(\mathbf {s}_i\). The \(\gamma (y)\) is a linear function of y, controlling the combination of \(\mathbf {s}_i\) and \(\mathbf {s}_j\). Best view in color

More specifically, we define the shape constraining operation as to constrain the target shape \(\mathbf {s}_j\) by combining it with the input shape \(\mathbf {s}_i\) according to a weighted function. The combination is defined on the middle point and the width of the foreground areas in each horizontal line on the \(\mathbf {s}_j\). Suppose y is the vertical offset for a horizontal line on \(\mathbf {s}_j\). We respectively denote \(c^j_y\) and \(l^j_y\) as the middle point and the width of the foreground areas on the line y. Similarly, we refer \(c^i_y\) and \(l^i_y\) to the middle point and width of foreground areas on the \(\mathbf {s}_i\) at line y. We then define the shape constraining operation as:

$$\begin{aligned} \left\{ \begin{array}{ll} {c^j_y}' &{}= \gamma (y) ~ c^j_y + (1-\gamma (y)) ~ c^i_y, \\ {l^j_y}' &{}= \gamma (y) ~l^j_y + (1-\gamma (y))~ l^i_y, \end{array} \right. \end{aligned}$$
(4)

where \({c^j_y}'\) and \({l^j_y}'\) represent the center and with on the constrained mask, and \(\gamma (y)\) is the weight function w.r.t.  y controlling the strictness of the constraint. According to Eq. 4, the smaller weight value \(\gamma (y)\) can make the target shape contribute less to the combination result and vice versa. We set the parameter \(\gamma (y)\) to different values for different parts of a body. In particular, we define \(\gamma (y)\) as a linear function which increases from 0 to 1 with y varying from the top to the bottom. Therefore, when y becomes larger, we make the \(\gamma \) become larger accordingly. This allows more transformations for lower parts of a pedestrian body whose vertical offsets y are large.

We formulate the shape constraining operation according to Eq. 4 because we mainly hypothesize that varying more for the lower body of a natural pedestrian is more acceptable than varying more for the upper body. In particular, we find that changing too much the upper body of a pedestrian would generally require the change of the viewpoint for that pedestrian (e.g. from the side view to the front view) to obtain a natural appearance. However, the warping operations do not generate new image contents to generate the pedestrian in a different viewpoint.

Figure 5 shows a visual example of the introduced shape constraining operation when constraining \(\mathbf {s}_j\) according to Eq. 4. We can observe that the proposed shape constraining operation adequately constrains the \(\mathbf {s}_j\) by making the output shape closer to \(\mathbf {s}_i\) for the upper body of the pedestrian and closer to \(\mathbf {s}_j\) for the lower body.

3.2.2 Environment Adaptation

After the shape-guided deformation, we place the deformed pedestrians into the image I to fulfill augmentation. However, directly pasting deformed pedestrians could sometimes produce significant appearance mismatch due to the issues like the discontinuities in illumination conditions and imperfect shapes predicted by the mask extractor. To refine the generated pedestrians according to the environments, we further perform environment adaptation.

To properly blend a shape-deformed pedestrian into the image I by considering surrounding environments, we introduce an environment-aware blending map to help refine the deformed pedestrians. We formulate this refinement procedure as follows:

$$\begin{aligned} \mathbf {z}^{gen}_{i\rightarrow j} = f_{EA}(\mathbf {z}^w_{i\rightarrow j} , I) = \{\mathbf {z}^{a}_{i\rightarrow j}(x,y)\}, \end{aligned}$$
(5)

where \(\mathbf {z}^{a}_{i\rightarrow j}(x,y)\) is the environment adaptation result located at (xy):

$$\begin{aligned} \mathbf {z}^{a}_{i\rightarrow j}(x,y)= & {} \Big (\mathbf {s}_j(x,y) \cdot \mathbf {\alpha }(x,y)\Big ) \cdot \mathbf {z}^w_{i\rightarrow j}(x,y) \nonumber \\&+ \Big (1-\mathbf {s}_j(x,y) \cdot \mathbf {\alpha }(x,y))\Big ) \cdot I(x,y), \end{aligned}$$
(6)

where \(\mathbf {\alpha }(x,y)\) is an entry value of the environment-aware blending map located at (xy). Therefore, this refinement procedure as described above represents that each output pixel \(\mathbf {z}^{a}_{i\rightarrow j}(x,y)\) is a weighted combination of a pixel \(\mathbf {z}^w_{i\rightarrow j}(x,y)\) from the shape deformed pedestrian patch and a pixel I(xy) from the original image. The combination weight is computed by \(\mathbf {s}_j(x,y) \cdot \mathbf {\alpha }(x,y)\) where \(\mathbf {s}_j(x,y)\) is 1 for foreground areas and 0 for background areas. An example of the estimated \(\mathbf {\alpha }(x,y)\) can be found in Fig. 3.

In practice, it is difficult to define the desired refinement result and the desired environment-aware blending map. Therefore, we can not access appropriate supervision information to train the employed network for environment adaptation. Without supervision, we apply an adversarial loss to facilitate the employed network to learn and blend the deformed pedestrians into the environments effectively. Similar to the shape-guided warping field, we make the employed network directly predict environment-aware blending map. Note that we constrain the environment-aware blending map to prevent changing the appearance of the deformed pedestrians too much. In particular, we adopt a shifted and rescaled tanh squashing function to make the values of \(\alpha (x,y)\) lie in a range of 0.8 and 1.2.

3.3 Objectives

Since we employ a single network to predict both the shape-guided warping field and the environment blending map, we can unify the objectives for training.

First, to obtain a proper shape-guided warping field, we introduce a shape deformation loss and a cyclic reconstruction loss. The shape deformation loss ensures that the predicted warping field satisfies the constrain as described in Sect. 3.2.1. The cyclic reconstruction loss then ensures that the deformed shape and pedestrian can be deformed back to the input shape and pedestrian. Therefore, we define that the shape deformation loss function \(\mathcal {L}_{shape}\) for a pair of samples \(i \rightarrow j\) is as follows:

$$\begin{aligned} \mathcal {L}_{shape} = \mathbb {E}[||\mathbf {s}_j - \mathbf {s}^{w}_{i\rightarrow j}||_1 ], \end{aligned}$$
(7)

and the cyclic loss is defined as follows:

$$\begin{aligned} \mathcal {L}_{cyc} = \mathbb {E}[ ||\mathbf {s}_i - \mathbf {s}^{w}_{j\rightarrow i}||_1 + ||\mathbf {z}_i - \mathbf {z}^{w}_{j\rightarrow i}||_1], \end{aligned}$$
(8)

where \(\mathbf {s}^w_{j\rightarrow i}\) is the deformation result of \(\mathbf {s}^w_{i\rightarrow j}\) according to \(\mathbf {s}_i\) and \(\mathbf {z}^w_{j\rightarrow i}\) is the deformation result of \(\mathbf {z}^w_{i\rightarrow j}\) using the same warping field for computing \(\mathbf {s}^w_{j\rightarrow i}\). As a result, the Eq. 7 describes the \(L_1\)-based shape deformation loss, and the Eq. 8 form the cyclic reconstruction loss.

In addition, an adversarial loss, denoted as \(\mathcal {L}_{adv}\), is included to make sure that the shape-guided deformation and environment adaptation can help produce more realistic-looking pedestrian patches. Similar with typical GANs, the adversarial loss is computed by introducing a discriminator D for the employed network:

$$\begin{aligned} \mathcal {L}_{adv}= \mathbb {E}[\log D(\mathbf {z})] +\mathbb {E}[\log \left( 1-D(\mathbf {z}^{gen}_{i\rightarrow j})\right) ], \end{aligned}$$
(9)

where \(\mathbf {z}\) refers to any real pedestrian in the dataset.

Hard Positive Mining Loss: Since our final goal is to improve the detection performance, we further apply a hard positive mining loss to magnify the benefits of the transformed pedestrians on improving detection robustness. Inspired by the study of hard positive generation (Wang et al. 2017), we attempt to generate pedestrians that are not very easy to be recognized by a RCNN detector (Girshick et al. 2014). Different from the study (Wang et al. 2017) that additionally introduced an occlusion mask and the spatial transformation operations to generate hard positives, we only introduce a loss function to help the employed network learn to produce harder positives for the RCNN detector. To compute this loss, we additionally train a RCNN, denoted as R, to distinguish pedestrian patches from background patches which do not contain pedestrians inside. Suppose \(\mathcal {L}_{hpm}\) is the hard positive mining loss, then we have:

$$\begin{aligned} \mathcal {L}_{hpm}= & {} \mathbb {E}[\log (1-R(\mathbf {z}^{gen}_{i\rightarrow j}))] + \mathbb {E}[\log R(\mathbf {z})] \nonumber \\&+\mathbb {E}[\log (1-R(\mathbf {b}))], \end{aligned}$$
(10)

where \(\mathbf {b}\) refers to background image patches in the dataset. Although hard mining is a well-developed technology, the contribution brought by \(\mathcal {L}_{hpm}\) is to facilitate the synthesis for pedestrian examples that are more difficult to detect but more beneficial for training, which is different from common hard mining approaches.

The major difference between the \(\mathcal {L}_{hpm}\) and the \(\mathcal {L}_{adv}\) is that the R distinguishes between pedestrians patches and background patches, while D in \(\mathcal {L}_{adv}\) distinguishes between true pedestrian patches and the shape-transformed pedestrian patches.

Overall Loss. To sum up, the overall training objective \(\mathcal {L}\) of the network employed to help implement the proposed framework can be written as follows:

$$\begin{aligned} \mathcal {L} = \omega _1 \mathcal {L}_{shape} + \omega _2\mathcal {L}_{cyc} + \omega _3\mathcal {L}_{adv} + \omega _4\mathcal {L}_{hpm}, \end{aligned}$$
(11)

where \(\omega _1\), \(\omega _2\), \(\omega _3\), \(\omega _4\) are the corresponding loss weights. In general, we borrow the setting from the implementation of pix2pixGANFootnote 1 and set the \(\omega _1\) and \(\omega _3\) to 100 and 1, respectively. Since we find in the experiment that the network can hardly learn a proper shape-guided warping field if \(\omega _2\) is too large, we empirically set the \(\omega _2\) to a small value, i.e.  0.5, in this study. Similarly, we also set \(\omega _4\) to 0.5 to make the hard positive mining loss contribute less to the overall objective. In practice, the network is obtained by minimizing the overall loss \(\mathcal {L}\), the discriminator D is obtained by maximizing the \(\mathcal {L}_{adv}\), and the R is obtained by maximizing the \(\mathcal {L}_{hpm}\).

3.4 Dataset Augmentation

When augmenting the pedestrian datasets with the proposed framework, we attempted to sample more natural locations and sizes to place the transformed pedestrians in the image. Fortunately, pedestrian datasets deliver sufficient knowledge encoded within bounding box annotations to define these geometric statistics of a natural pedestrian. For example, in the Caltech dataset (Dollár et al. 2009; Dollar et al. 2012), the aspect ratio of a pedestrian is usually around 0.41. In addition, it is also possible to describe the bottom edge \(y^{box}\) and the height \(h^{box}\) of an annotated bounding box for a pedestrian using a linear model (Park et al. 2010): \(h^{box} = k y^{box} + b\), where k and b are the coefficients. In the Caltech dataset whose images are 480 by 640, the k and b are found to be around 1.15 and -194.24. For each image to be augmented, we sample several locations and sizes according to this linear model. To avoid sampling patches with inappropriate background, we tend to constrain that the sampled boxes should not be quite different from the neighboring boxes of true pedestrians. For example, we tend to sample locations around true pedestrians (within 100 pixels), and we constrain the difference between the height of a sampled patch and the height of its nearest true pedestrian to be within 20 pixels. Then, for each sampled location and size, we run the proposed framework and put the transformation result into the image. Algorithm 1 describes the detailed pipeline of applying the proposed framework to augment pedestrians dataset. Algorithm 2 describes in details how we sample a location and a size in an image, which can reduce the risk of introducing inappropriate background by sampling around true pedestrians.

figure a
figure b

4 Experiments

We perform comprehensive evaluation for the proposed STDA framework to augment pedestrian datasets. We use the popular Caltech (Dollár et al. 2009; Dollar et al. 2012) and CityPersons (Zhang et al. 2017) benchmarks for the evaluation.

In this section, we will first present the overall dataset augmentation results on evaluated datasets. Then, we will validate the improvements in improving detection accuracy of applying our proposed STDA framework to augment different datasets, comparing to other cutting-edge pedestrian detectors. Subsequently, we perform detailed ablation studies on the STDA framework to analyze the effects of different components in STDA on generating more realistic-looking pedestrians and on improving detection accuracy.

4.1 Settings and Implementation Details

For evaluation, we consider the log-average miss rates (MR) against different false positive rates as the major metric to represent pedestrian detection performance. In the Caltech, we follow the protocol of (Zhang et al. 2016b) and use around 42k images for training and 4024 images for testing. In the Citypersons, as suggested in the original study, we use 2975 images for training and perform the evaluation on the 500 images from the validation set. We apply a Mask RCNN to extract shapes on Caltech and use the annotated pedestrian masks on Citypersons. To augment the datasets, for each frame, we transform n pedestrians using our framework and n is uniformly sampled from \(\{1, 2, 3, 4, 5\}\). Thus, each image has the number of positive pedestrians increased by 1 \(\sim \) 5.

Fig. 6
figure 6

Detailed structure of the employed U-net. Best view in color. Each blue rectangle represents a convolution. For each rectangle, numbers on the side represent resolution and number on the top represents channel number

For the network employed to implement the framework, we use the U-net architecture with 8 blocks. All the input and output patches have a size of 256 \(\times \) 256. Figure 6 shows the detailed structure of the employed U-net. Then, both the D as introduced in Eq. 9 and the R as introduced in Eq. 10 are CNNs with 3 convolutional blocks. During optimization, we reduce the updating frequency of D and R to stabilize the training, i.e. we update D and R once at every 40-th update of the U-net. Learning rate is set to \(1e-5\) and we perform training with 80 epochs for a dataset.

We adopt a ResNet50-based FPN detector (Lin et al. 2017) as our major baseline detector. When training this detector, we modified some default parameters according to the pedestrian detection task. First, for the region proposal network in FPN, we follow the (Zhang et al. 2016b) and only use the anchor with the aspect ratio of 2.44. We discard the 512x512 anchors in FPN because they do not contribute much to the performance. In addition, we set the batch size as 512 for both the Region Proposal Network (RPN) and the Regional-CNN (RCNN) in FPN. To reduce of false positive rates of FPN, we further set the foreground thresholds of RPN and RCNN to 0.5 and 0.7, respectively. During training, we make the length of the shorter size of input images as 720 for Caltech and as 1024 for CityPersons. Both FPN baseline and FPN trained with our methods are pre-trained on MS COCO (Lin et al. 2014) dataset to gain proper prior knowledge about people. We train the FPN detector on the Caltech with 3 epochs and on the CityPersons with 6 epochs. In general, the final performance of the baseline detector is \(10.4\%\) mean miss rate on Caltech test set and \(13.9 \%\) mean miss rate on CityerPersons validation set. Note that we weight the loss values for synthesized pedestrians by a factor of 0.1, reducing the potential biases towards generated pedestrians rather than real pedestrians.

For the hyper parameters introduced in this study, like the loss weights of cyclic loss and hard positive mining loss, we mainly select them by performing grid search according to the quality of generated pedestrians and the performance of improved detectors.

In addition to FPN, we further adopt a MS CNN (Cai et al. 2016) as another baseline to evaluate our method more comprehensively. We use the released source codes to implement the MS CNN and use similar weight loss for synthesized pedestrians to train MS CNN with the proposed STDA.

Fig. 7
figure 7

Dataset augmentation results of the proposed STDA on images from Caltech (top 2 rows) and CityPersons (bottom 3 rows), respectively. Light green bounding boxes indicate the synthesized pedestrians. The presented image patches are cropped and zoomed to better illustrate the details. Best view in color

4.2 Dataset Augmentation Results

We first present the pedestrian synthesis results of applying our STDA framework to augment pedestrian datasets.

4.2.1 Pedestrian Synthesis Results

In Fig. 7, we illustrate the dataset augmentation results on both the evaluated Caltech dataset and citypersons dataset. Even if some of the pedestrians are blurry and lack rich appearance details, we can observe that the shape transformed pedestrians can still be naturally blended into the environments of the image, obtaining very realistic-looking pedestrians for dataset augmentation. Furthermore, the STDA can also produce pedestrians in uncommon walking areas, such as in the middle areas of the street. This can increase the irregular foreground examples for pedestrian detection, and the model can be more robust in detecting pedestrians after augmentation. Moreover, with a similar geometry arrangement with real pedestrians, the illustrated results can demonstrate that the proposed STDA framework is effective in generating pedestrians in a similar domain with real pedestrians. Besides, our method can produce occlusion cases, e.g. by overlapping the generated pedestrians over real pedestrians, which can promisingly increase the amount of occlusion cases for training and thus improve the detection robustness for occlusions.

In addition, we also compare our method with another recently published powerful GAN-based data rendering technique, i.e. PS-GAN (Ouyang et al. 2018b), using the same background patches. We implement the PS-GAN with the codes released by its authors and follow the original training scripts to train the model. However, the original PS-GAN does not include training datasets. For fair comparison, we modified its training scripts to include the same training data as used in our method. As mentioned in the paper, there are plenty of very low-quality pedestrians in our training data. Furthermore, since we discarded irregular shapes according to predicted confidence scores of the Mask RCNN, the number of obtained pedestrians for training is relatively small. Training schemes for both our study and the PS-GAN are kept the same. Figure 8a shows some pedestrian synthesis results using existing GANs. We can find that the compared GAN-based method produces very blurry pedestrians. Besides, the generated backgrounds can be also unnatural and distorted. There could be a few reasons why the compared PS-GAN works badly. First, since PS-GAN generates pedestrians without conditioning on the quality of training examples, the mix of high-quality data and very low-quality data as used by us could confuse the PS-GAN during training and affect the quality of generated pedestrians. Also, since the number of pedestrians used here for training is relatively small, it is difficult to train the PS-GAN very thoroughly. On the contrary, as shown in Fig. 8b, our proposed STDA framework can effectively generate much more realistic and natural-looking pedestrians in different background patches. Our method achieves significantly lower score than PS-GAN, meaning that the STDA-generated pedestrians are much more similar to the true data. This illustrates its superiority over the GAN-based data rendering methods.

Fig. 8
figure 8

Pedestrian synthesis results of STDA, comparing to another cutting-edge GAN-based data generation method. Synthesized pedestrians are in the middle of each image patch and background patches are kept the same. Best view in color

4.2.2 Improvements for Pedestrian Detection

Fig. 9
figure 9

Effects of the proposed STDA framework for augmenting the Caltech pedestrian dataset, comparing to other cutting-edge pedestrian detectors. “+” means multi-scale testing

Fig. 10
figure 10

Effects of the proposed STDA framework for augmenting the Caltech pedestrian dataset, comparing to other pedestrian synthesis methods such as random pasting, PoseGen (Ma et al. 2017), and PS-GAN (Ouyang et al. 2018b). We also compare our partial method, “STDA w-o SD”, that only blends pedestrians into environments without shape deformation for ablation study

Caltech: To evaluate the augmentation results of our proposed STDA framework, we first perform the evaluation on the test set of the Caltech benchmark. We evaluate the performance gains with respect to the baseline detector to demonstrate the effectiveness. Figure 9 shows the detailed performance of our method comparing to other cutting-edge methods (Ouyang et al. 2018a; Zhang et al. 2018b; Cai et al. 2016; Li et al. 2018; Zhang et al. 2016b, 2017; Du et al. 2017; Brazil et al. 2017; Lin et al. 2018). In particular, our framework improves around 30\(\%\) miss rate over the baseline FPN. By further applying the multi-scale testing, we can achieve 38\(\%\) improvement, significantly out-performing other cutting-edge pedestrian detectors. Moreover, our method also delivers 3 points’ improvements over another baseline detector, MS CNN. Using the multi-scale testing for MS CNN, our method further improves 3.8 points, obtaining the lowest average miss rates 6.1\(\%\). This shows that our method can consistently improve different trained detectors.

We also present qualitative results of whether using our method on the FPN detector in Fig. 11. With limited training examples in existing datasets, we can find in the figure that the baseline FPN detector produced inaccurate results (first column), false positives (second column), or false negatives (their column). On the contrary, the FPN trained with our method can correctly detect pedestrians in the corresponding images by providing more accurate boxes and less false predictions. This further demonstrates that our method is effective for improving detection performance by including more diversified training examples.

Fig. 11
figure 11

Comparison of qualitative results about whether applying our method for dataset augmentation. Red boxes are ground-truths. Green dotted boxes are detection results. Best view in color

In Fig. 10, we also compare our framework with some other augmentation methods, including Pasting that directly pastes real pedestrians randomly, PS-GAN (Ouyang et al. 2018b) that generates pedestrian patches based on a pix2pixGAN (Isola et al. 2017) pipeline, and our method that only blends real pedestrians without shape deformation. We can find that the three other compared methods can also slightly improve the baseline detector, suggesting that augmenting pedestrian datasets with synthesized pedestrians is useful for improving detection accuracy. However, due to the unnatural pedestrians synthesized based on low-quality data as presented in Fig. 8a, improvements brought by PS-GAN is very limited. Even random pasting real pedestrians can deliver a slightly better improvements using low-quality data. Moreover, we can also observe that the compared methods have higher false positives per image than the baseline detector at low miss rates, suggesting that the baseline detector may be distracted by unnatural pedestrians to some extents. Comparing to the other compared pedestrian synthesis methods, the performance gain brought by our proposed STDA is much more significant with respect to the baseline detector, confirming that our proposed framework is much more effective in augmenting pedestrian datasets using low-quality pedestrian data. Furthermore, with the more realistic-looking pedestrians synthesized by STDA, the augmented dataset can consistently improve the baseline detector at all presented false positives per image. To further validate the idea of deforming the shapes of pedestrians to augment datasets, we perform another ablation study to evaluate our method that only blends real pedestrians into environments without deforming their shapes. The results are also presented in Fig. 10. It shows that dataset augmentation without using shape deformation delivers worse performance than our complete method, even though it improves random pasting performance. This demonstrates that the shape deformation, which enhances the diversity of synthetic pedestrians, is important to improve detection performance.

Table 1 Performance on occluded (OCC) pedestrians on Caltech test set. Best results are highlighted in bold. “+” means multi-scale testing

Besides overall performance, we also present performance on specific detection attributes. For example, Table 1 shows the detection accuracy on pedestrians with partial or heavy occlusions. According to the statistics, we can find that the proposed STDA can effectively reduce the average miss rate of the baseline detector for both partial and heavy occluded pedestrians, achieving favorable performance comparing to other cutting-edge pedestrian detectors. This confirms that synthesizing pedestrians with occlusions using our proposed STDA framework can promisingly help improve the detection robustness and accuracy of occluded pedestrians in test set. In addition, we also evaluate the performance of applying STDA to augment the Caltech on pedestrians with different aspect ratios in the Table 2. In particular, for the detection on the pedestrians with “typical” aspect ratios, our proposed framework is able to boost the performance of the baseline detector by up to 41%. When detecting the pedestrians with “a-typical” aspect ratios, our method also promisingly improves the baseline performance, obtaining the highest average miss rate among compared pedestrian detectors. These results demonstrate that our framework can produce rich diversified and beneficial pedestrians for the augmentation. Furthermore, Table 3 shows detection performance of pedestrians at medium or far distances. It shows that our method improves greatly on pedestrians at both distances. Since “far” pedestrians usually have small sizes (e.g. bounding box heights are less than 80 pixels), our method shows to be beneficial for enhancing detectors’ performance on small pedestrians that are originally difficult to detect.

Table 2 Performance on pedestrians with diversified aspect ratios (AR) on Caltech test set. Best results are highlighted in bold. “typical” means the pedestrians with normal aspect ratios; “a-typical” means the pedestrians with unusual aspect ratios; “+” means multi-scale testing
Table 3 Performance on pedestrians with different distances on Caltech test set. Best results are highlighted in bold. “far” means the pedestrians at longer distances; “medium” means the pedestrians at medium distances; “+” means multi-scale testing
Table 4 Performance on the validation set of CityPersons. Best results are highlighted in bold. “+’ means multi-scale testing

CityPersons: In this section, we also report the performance on the validation set of CityPersons. The experiment settings are similar to the evaluation for the Caltech dataset except that image sizes are 1024 \(\times \) 2048 for training and testing.

Table 4 presents the detailed statics of the evaluated methods. We can find that our framework effectively augments the original dataset and improves the performance of the baseline FPN detector. Besides, our approach also improves the MS CNN promisingly. The MS CNN trained with our approach achieves the highest single model and multi-scale testing results among compared detectors. By achieving state-of-the-art performance with our proposed framework, we can validate that our proposed framework can consistently augment different pedestrian datasets with low-quality pedestrian data.

4.3 Ablation Studies

In this section, we perform comprehensive component analysis of the proposed STDA framework for both the pedestrian generation and the pedestrian detection augmentation, using the low-quality pedestrians in Caltech dataset and the Caltech benchmark for training.

Fig. 12
figure 12

Visual effects of different components in the proposed STDA framework. “SC” means shape constraining operation; “EBM” means environment-aware blending map; “HPM” means hard positive mining. Best view in color

Fig. 13
figure 13

Visual effects of different sampling strategies to place synthesized pedestrians. True pedestrians are highlighted in red boxes. Sampled pedestrians are highlighted in green boxes. Frames are zoomed for better illustration. Best view in color

4.3.1 Qualitative Study

We first evaluate the qualitative effects of different components in the STDA framework for the pedestrian generation task. In particular, we start the experiments from only using the shape-guided deformation supervised by \(\mathcal {L}_{shape}\) for pedestrian generation. Then, we gradually add the shape-constraining operation, cyclic reconstruction loss \(\mathcal {L}_{cyc}\), adversarial loss \(\mathcal {L}_{adv}\), environment-aware blending map \(\mathbf {e}(x,y)\), and hard positive mining loss \(\mathcal {L}_{hpm}\) to help generate pedestrians. We present the effects of different components by generating pedestrians based on low-quality real pedestrian data in the Fig. 12. According to the presented results, we can observe that the quality of the generated pedestrians is progressively improved by introducing more components, demonstrating the effectiveness of the different components in STDA framework. More specifically, the shape constraining operation can first help the deformation operation produce less distorted pedestrians. Then, by adding the cyclic loss \(\mathcal {L}_{cyc}\) and adversarial loss \(\mathcal {L}_{adv}\), the obtained pedestrians become more realistic-looking in details. Subsequently, the introduced environment-aware blending map trained by \(\mathcal {L}_{ebm}\) helps the transformed pedestrians better adapt into the background image patch. Lastly, the \(\mathcal {L}_{hpm}\) can slightly change some appearance characteristics, such as illumination or color, to make the pedestrians less distinguishable from the environments, which actually further improve the pedestrian generation results.

In addition, we also evaluate the effects of pedestrian sampling strategy as described in Sect. 3.4 and Algorithm 2. The qualitative results are presented in Fig. 13. We compared three different schemes for sampling locations and sizes to place synthesized pedestrians: (a) we sample pedestrians in the image pure randomly; (b) we sample pedestrians in the image only according to the linear model; (c) we sample pedestrians according to the linear model and true pedestrians as described in Algorithm 2. From the presented results, we can find that scheme (a) will generate unnatural locations and sizes, making the synthesized pedestrians being placed into inappropriate background areas. Then, scheme (b) improves the effects of (a) promisingly, but the sampled sizes are still sub-optimal. The scheme (c) has the best sampling quality and can generate more appropriate locations and sizes, reducing the risk of including inappropriate background contents significantly.

Lastly, we compared different learning strategies for the proposed network, including the separate learning and the joint learning. The joint learning trains the network to predict both shape-guided warping field and environment-aware blending map at the same time, while the separate learning trains two independent networks to predict the shape-guided warping field and environment-aware blending map, respectively. Figure 14 shows the synthesis results of using two learning strategies. We can find that both learning strategies produces identical synthesis performance, illustrating that the developed network for synthesizing pedestrians is insensitive to learning schemes.

Fig. 14
figure 14

Visual effects of different learning strategies for training our developed network. Best view in color

Table 5 Effects of different components in the proposed STDA framework on the selected validation set on Caltech dataset. “SC” means shape constraining operation; “EBM” means environment-aware blending map; “HPM” means hard positive mining

4.3.2 Quantitative Study

To perform ablation studies, we split the training set of Caltech into one smaller training set and one validation set. More specifically, we collect the frames from the first four sets in the training as training images, while the frames from the last set are considered as validation images. We sample every 30-th frame in the overall dataset to set up the training/validation set. Note that this setting of training/validation set is ONLY used for ablation study.

Table 5 presents the detailed results. We can find that each of the introduced component, including shape constraining operation (SC), cyclic loss (\(\mathcal {L}_{cyc}\)), adversarial loss (\(\mathcal {L}_{adv}\)), the environment-aware blending map (EBM), and the hard positive mining (HPM), can all contribute a promising average miss rate reduction. In particular, the cyclic and adversarial loss that helps better deform pedestrians and the environment-aware blending map that helps better adapt deformed pedestrians can both greatly boost the benefits of synthesized pedestrians on improving detection accuracy. The proposed hard positive mining scheme can further improve the detection accuracy, demonstrating its effectiveness in dataset augmentation. Based on the qualitative analysis as shown in Fig. 12, we can further conclude that augmenting pedestrian datasets with more realistic-looking pedestrians can deliver better improvements on detection accuracy.

Table 6 Effects of using different sampling strategies to place synthesized pedestrians on the selected validation set on Caltech dataset. Compared strategies include: (a) Sample pedestrians pure randomly; (b) Sample pedestrians according to the linear model; (c) Sample pedestrians according to the linear model and true pedestrians

We also studied the influence of different sampling strategies on the detection performance. Table 6 shows the results. We can find that pure random sampling that may introduce inappropriate background areas for synthesized pedestrians offers limited help for the detection performance. Introducing the linear model to sample locations and sizes for synthesized pedestrians then improves random sampling promisingly, indicating that the linear model provides a more reasonable way to place synthesized pedestrians. By further considering true pedestrians, we obtained the best performance, illustrating that using both the linear model and true pedestrians tend to avoid unnatural background areas for inserting synthesized pedestrians.

We further studied the effects of different learning strategies, i.e. the joint learning and the separate learning, on the selected validation set of Caltech dataset. The joint learning which is applied in our major implementation has a detection score of 7.49% log-average miss rate on the validation set. The separate learning then achieves a similar detection score which is 7.51 % log-average miss rate on the validation set. This shows that the synthesis results brought by separate learning have nearly the same benefits for improving detection compared to the joint learning.

5 Conclusions

In this study, we present a novel shape transformation-based dataset augmentation framework to improve pedestrian detection. The proposed framework can effectively deform natural pedestrians into a different shape and can adequately adapt the deformed pedestrians into various background environments. Using low-quality pedestrian data available in the datasets, our proposed framework produces much more lifelike pedestrians than other cutting-edge data synthesis techniques. By applying the proposed framework on the two different well-known pedestrian benchmarks, i.e.  Caltech and CityPersons, we improve the baseline pedestrian detector with a great margin, achieving state-of-the-art performance on both of the evaluated benchmarks.