1 Introduction

Object detection is one of the most important tasks in computer vision through this decade. It aims to predict the existence of objects and localize the objects in a given image. This is very useful in a wide range of applications, e.g. self-driving vehicles, robotics, and augmented reality. In the earliest stage of object detection, pioneering works utilize hand-crafted features for object representation. Viola et al. utilize the lightweight Haar features and cascade classifier to efficiently detect human faces [48]. Later, Dalal et al. introduced Histogram of Gradients (HOG) [5] as effective features for pedestrian detector. HOG is then widely used to detect other objects. Felzenszwalb et al. proposed discriminatively trained part based models (DPM) [9] to detect the objects with deformable parts, i.e., humans with different poses. Van de Sande et al. proposed Selective Search [46] method which uses segmentation to generate a limited set of locations, permitting the more powerful yet expensive bag-of-words features. Wang et al. proposed a hierarchy representation of low-level features in local regions, named Regionlets [49], which is more robust to generic deformable objects. Recently, the advancement in deep learning and Convolution Neural Network models (CNN), i.e., Region CNN [13], Fast RCNN [12], Faster-RCNN [33], R-FCN [4], SSD [27], and YOLO [32], significantly boosts the performance of the object detection task. Modern object detection approaches shift from designing features and object representation models to applying different architectures of neural networks. There exist many works exploring different deep network structures to improve the performance, such as AlexNet [24], VGGNet [37], GoogLeNet/Inception [39,40,41], ResNet [17, 26], DenseNet [19]. Convolutional neural networks have become deeper and deeper, with state-of-the-art networks going from 7 layers (AlexNet) to 1000 layers (ResNet). Obviously, deep networks require much more computing resources, i.e. GPU memory, and run slower than shallow ones in general. Our main objective in this paper is not only limited in the deep structure context. Instead, we aim to investigate the impact of training data in the deep networks. In literature, several works have focused on enriching data, i.e. data augmentation [27, 32, 33] or generating synthetic datasets [10, 16, 35, 47]. However, these works consider the amount of generated data rather than the importance of these data, that is which objects should be generated more and how to use them effectively.

In psychology research, the term “lucid dreaming” is used to describe the technique of controlling dreams and following them to a desired conclusion [22]. Additionally, lucid data are the images that the dreamer is aware that they are dreaming. In this work, we consider lucid data as the synthesized data for the specific problem, object detection. Our “lucid dreaming” is the process of synthesizing desired training data to train an object detector. It generates new training data by discovering the previously failure cases of the object detector. Then, these failure cases are synthesized onto many related scenes to strengthen the detector. In particular, we explore intentional synthesized data used for training a deep learning model and propose a following effective detection strategy. Our proposed framework is named YADA (the short form of Y ou A lways D ream A gain), which represents the idea of using lucid data dreaming, then re-train a deep model for a better detection performance. We introduce our novelty in two stages, namely, data preparation during pre-training, and data residual on the post-training. Regarding the data preparation, we propose using lucid data dreaming in order to produce more problem-related training data. For the data residual, we first train a detection model, here we adopt Faster-RCNN [33] for our work. Then we train another Faster-RCNN model to tackle the challenging objects.

The remainder of our paper is organized as follows. Section 2 summarizes the related works. Section 3 introduces the proposed framework. Section 4 presents the experimental results. Finally, the conclusion and future works are given in Section 5.

2 Related works

Recently, the advancement in deep learning significantly improves the performance in many computer vision problems. For instance, CNN [25] improves the performance of image recognition, image parsing, and saliency analysis [42,43,44]. The success of CNN model for these tasks has inspired other works for integrating CNN model into the object detection problem, i.e., Region CNN [13], Fast RCNN [12], Faster-RCNN [33], R-FCN [4], SSD [27], and YOLO [32]. Several works have succeeded in further boosting the performance of CNN object detectors by applying advanced detection techniques. Cheng et al. [1, 2] proposed an effective method to train rotation-invariant and Fisher discriminative CNN (RIFD-CNN) models. This method can improve the performance of R-CNN [13], Fast R-CNN [12], Faster R-CNN [33], and R-FCN [4]. Chu et al. [3] proposed multi-scale adjacent level feature maps and cascaded region proposal network for improving both recall and accuracy of Fast R-CNN and Faster R-CNN detectors. Zhang et al. [51] presented a novel weakly supervised learning framework which utilizes instance-level and image-level prior-knowledge for object detection. Next, SPFTN [50] is proposed for addressing the weakly supervised video object localization and segmentation tasks. Several approaches focus on designing features and classifiers to applying different architectures of neural networks. Accordingly, lots of networks, such as AlexNet [24], VGGNet [37], GoogLeNet/Inception [39,40,41], ResNet [17, 26], DenseNet [19], have been explored. A comprehensive review for advanced deep network object detectors is presented by Han et al. [15]. Convolutional neural networks have become deeper and deeper, with state-of-the-art networks going from 7 layers (AlexNet) to 1000 layers (ResNet). Generally, deep networks give better accuracy than shallow ones thanks to their advantage on the approximation of compositional functions [34, 52]. The limitation of deep networks is requiring much more time for training and testing. Several works success in designing of a light-weight network that can achieve the competitive accuracy, for instance, SqueezeNet [20], Darknet-19 [30] and Darknet-53 [31].

The aspect of data on training deep network structure is also explored. In a recent paper, Ross Girshick et al. [28] scaled up to 3.5 billion images and 17,000 distinct “labels” for the training process and get the successful result. Data is costly and therefore people have to find the way to augment the available labelled data. Basic data augmentation techniques such as cropping, flipping, and colour jittering are commonly used in many works (Faster-RCNN [33], YOLO [32], SSD [27]). This helps generate more samples for object class by applying transformations and then helps the trained detectors can recognize objects through minor changes in appearance. Other efforts come from the idea of augmenting data by GANs model.

From another viewpoint, it is not only the amount of supplied data that is most important. The quality (or reasonability) of data is also the matter. To deal with this, several works focus on generating realistic synthetic data (which try to make synthetic objects having the same appearances, locations, without major artefacts). Rendering images from 3D models is a commonly used method for generating synthetic datasets, e.g. SYNTHIA [35], SceneNet [16], Virtual KITTI [10], SURREAL [47]. This approach is also used for rendering realistic data for training object detector. Gupta et al. [14] use 3D CAD objects models and render them into scenes. They observe an 1% improvement mean AP point on NYUD2 dataset comparing with without using synthetic data. As a closer look, Peng et al. [29] investigate the ability of CNN object detector to learn from synthetic CAD-rendered images with/without simulating low-level cues, such as realistic object texture, pose, or background. Likewise, Tremblay et al. [45] rendering 3D car models with varying aspects of the scene (e.g. car type, texture, location, camera angle, and lighting). Their experiments on KITTI dataset show the performance of Faster R-CNN detector with additional training on the synthetic data superiority the one using COCO initialized weights. Johnson-Roberson et al. [21] leverage the rich virtual worlds created for major video games to create synthetic data. They captured images from the video game with different simulated times of day, complex weather, and lighting scenarios.

The rendering-based data are expensive to generate, requiring artists to carefully model specific environments in detail, and typically limited to a few categories like cars. Without the advent of rich 3D repositories, Dwibedi et al. [6] proposed a cut and paste method to generate synthetic images. Then, they trained a Faster R-CNN detector using the VGG network. Their method, when combined with real images, improves relative performance by more than 21% on GMU Kitchen dataset. The success of this approach should encourage more inventing on popular object detection datasets such as PASCAL VOC and KITTI. However, there is no successful result reported until now. Our hypothesis is this due to the unintentional mode of generating synthesized data. Many easy objects could be learned effectively by deep networks, and therefore generating more instance of these objects cannot improve the performance of the system. The synthesizing process should focus on the hard and usually rarely appearing objects. Then we proposed a method to create intentional synthesized data by adopting hard example mining, as inspired by our previous work [23]. Instead of generating more object instances, we try to explore the hard and rarely appearing cases in the dataset and then generate synthesized data for these cases. Furthermore, we build a separate synthesized set for these cases and then train an additional specific detector. Detection results from this detector can effectively complement for the detector trained by real data.

3 Proposed framework

In this section, we will introduce the proposed YADA framework in details. Figure 1 shows the overview of our framework pipeline.

Fig. 1
figure 1

Our YADA framework pipeline

3.1 Lucid data synthesizer

3.1.1 Similar scene retrieval

In this context, we define the “easy” and “hard” objects as the objects detected and misdetected by a trained object detector, respectively. “Hard” objects are discovered depending on the baseline detector. In other words, we train the baseline detector, then use the trained model to detect objects in training image set. Objects misdetected by the baseline detector (or have confidence score lower than the detection threshold) are considered as “hard objects”. The first step to synthesize hard object instances data is to create a pool of similar images. This step aims to help the synthesized data have the real contexts as the original one. Given the training set images T, for each image Iq in T we use the features extracted from the layer fc7 of the VGG imagenet pre-trained model to retrieve the top k similar images in the training set to form its similarities pool. Let P denote the pool of similar images (a ranked list of images). Let Vq and \({V^{P}_{i}}\) denote the feature vectors of the query image Iq and of the image Ii from the pool, respectively. Then we define the similarity level between Iq and the i-th image of P as the Euclidean distance between their corresponding features vectors: \(s_{i} = \lVert V_{q} - {V^{p}_{i}} \rVert \). The smaller the Euclidean distance is, the higher level the similarity of the two images is. Each image Ii is ranked in the ascending order by the similarity.

Note that if the value of k is small, we only paste hard objects to a few images that are very similar to the original image. The method prefers to generate realistic synthesized image rather than increase the number of samples by a significant amount. On the other hand, if we set k by a large value, the similarity of original and destination images will be decreased but we can generate more hard samples for the synthesized set. The value of this parameter is empirically set at 100 in our experiments.

3.1.2 “Hard” object lucid data synthesizer

For lucid data synthesizer, we replace “easy” objects by the “hard” ones to ensure the similar context for the new synthesized image. Specifically, for an image I in the training set, firstly we find all available positions, that are the bounding boxes of detected objects. Next, we create a pool of hard objects by collecting them from top k similar images of I. In particular, we run a Faster RCNN detector on these images to find misdetected objects. The similarity between two images is defined as the Euclidean distance between their corresponding features vectors, as presented in Section 3.1.1. Then, each available position in the image I is matched with an object in the pool of “hard” objects. We exploit the width, the height, and the aspect ratio (height/width) measurements of the available boxes and the object bounding boxes for matching. The details are shown in Algorithm 1. The illustration of this process is shown in Fig. 2

Fig. 2
figure 2

Lucid Data Synthesizer: the process of generating synthesized images in our framework

Following the synthesizing, we further highlight image regions on the synthesized image set. In particular, we apply histogram equalization for images to make the hard object instances more outstanding. The details of this operation are shown as below equations,

$$ \left\{\begin{array}{lll} &C_{eq_{(x,y)}} = (L-1)*\tau(C_{(x,y)}) \\ &C_{eq} = \{C_{eq_{(x,y)}}: 0\leqslant x < W, 0\leqslant y < H\}, \end{array}\right. $$
(1)

where \(\tau (C_{(x,y)}) ={\sum }_{i=0}^{W}{\sum }_{j=0}^{H}(C_{(i,j)} \leqslant C_{(x,y)})/(W\times H)\), C is a color channel (R/G/B), Ceq represents the color channel after histogram equalization, C(x,y) represents the value of pixel (x,y) in the image channel, \(C_{eq_{(x,y)}}\) represents the value of pixel (x,y) in the image channel after histogram equalization, L is fixed at 256, W and H are the width and the height of the image, respectively.

figure f

Directly pasting objects onto background images obviously creates boundary artifacts. Although these artifacts seem subtle, when such images are used to train detection algorithms, they give poor performance as proved in the work [6]. As current detection methods [33] strongly depend on local region-based features, boundary artifacts substantially degrade their performance. The blending step will smoothen out the boundary artifacts between the pasted object and the background, therefore, they improve performance of the trained detectors. We use the traditional Gaussian blending for smoothing edges. Furthermore, Mask RCNN [18] is also used for segmenting of objects from the background before pasting them on the similar images. Excluding background pixels in object bounding boxes will make synthesized objects better blended with the new scenes.

3.2 Bounding box fusion

For the testing procedure, an input image is fed into the first CNN model to detect “easy” objects. Then, the second CNN model trained for the hard objects is used to detect the undiscovered objects. In fact, the detected objects of the first CNN model could be erased from the images by using the masking to prevent the duplicate detections. However, this manipulation could drop down the recall rate because it also removes a number of objects that are occluded by the detections. Therefore, we propose a fusion method to combine the two bounding boxes set B1 (first CNN model) and B2 (second CNN model) as follows. Figure 3 shows the pipeline of proposed fusion scheme.

Fig. 3
figure 3

Proposed detection fusion scheme illustrated by real data (detections of ’plane’ class - VOC 2007 testing set). The CNN model trained by synthesized data (focusing on hard objects) has a balanced detection confidence histogram whereas most of detections of the CNN model trained by standard data have confidence scores in range [0.9-1.0]. Our fusion method improves the detections recall by normalizing the detection of the synthesized data based model (including duplicates truncating and scores rescaling), then concatenating both of detections set

3.2.1 Duplicate truncation

Firstly, all the duplicated bounding boxes are truncated. These duplicates are identified as follows.

$$ \mathrm{D} = \{(b_{i}, b_{j}) : \frac{area(b_{j} \cap b_{i})}{area(b_{j} \cup b_{i})} > \zeta\}, $$
(2)

where bi is a bounding box in B1, bj is a bounding box in B2, and ζ is the NMS threshold and fixed at 0.7. Meanwhile, ∩ denotes the intersection operation of two bounding boxes, and ∪ denotes the union operation of two bounding boxes.

We remove the duplicated bounding boxes from the B2 instead of B1 because the detections from the second model generally have lower confidence scores.

3.2.2 Score re-scaling

We then rescale confidence scores of detections from the second model. This aims to two purposes: firstly, we want to handle the uncertainty of these detections. These detections focus on hard objects therefore their confidence should not be higher than the confidence of easy objects. Second, we need to scale these scores to an reasonable range so that they can effectively complement the detections from the first model.

To do that, for each object class C we estimate the mean value μC of detection scores from the first model. Then we rescale confidence scores of detections from the second model by the following equation:

$$ S^{Normalized} = S^{Original}\times \frac{1}{\mu^{\prime}_{C}}\times(1-\mu_{C}-\gamma\times\sigma_{C}), $$
(3)

where \(\mu ^{\prime }_{C}\) is the mean value of detection scores from the second model, σC is the standard deviation of detection scores from the first model, and γ is the coefficient for flexible handling the fusion of two Gaussian distributions. The idea behind this Eq. 3 is that we want to re-scale the confidence scores of the hard object detector by translating their mean values. The new mean value is determined based on the mean value of confidence scores from the baseline detector. γ is a coefficient score to establish the distance between two mean values. For experimental settings, we set γ = 0.2.

4 Experimental results

4.1 Benchmark datasets

We evaluate our proposed method on two challenging real-world datasets: PASCAL VOC and KITTI. The PASCAL Visual Object Classes Challenge [7] is a popular dataset in object detection and recognition. The KITTI dataset [11] is used as a benchmark for autonomous driving systems.

We conduct experiments on the PASCAL VOC 2007 dataset with 9,936 images and 24,640 annotated objects of 20 classes including people, vehicles, animals, and indoor objects. Half of them is used for training and validation process, and the other half is used for testing. The number of objects in each class is approximately equivalent in both sets. We also make the statistic comparison of hard objects/total objects ratio between the original trainval set and our synthesized trainval set in Fig. 4. Visualization of synthesized images is shown in Fig. 5.

Fig. 4
figure 4

Statistic comparison of hard objects/total objects ratio between VOC 2007 train-val set and synthesized VOC 2007 train-val set

Fig. 5
figure 5

Synthesized images from VOC 2007 dataset. The first row of each group (separated by the red line) shows real images in the dataset, whereas the second row shows the corresponding synthesized images. Replaced objects are highlighted by yellow rectangles

The KITTI dataset has 7,481 images in the training set and 7,518 images in the test set with a total of 80,256 labelled objects of 8 classes: pedestrian, car, van, truck, tram, and misc. Yet, ground-truth labels are available only on the training set.

For measuring the performance, we use the average precision (AP) metric for both of the two datasets. This is the average of the precision obtained at all values of recall corresponding with the ranked output bounding boxes. We take the method of computing AP using all data points [8] instead of using only 11 points with equally spaced recall as in [7]. For more details, the precision and recall are defined as follows:

$$ precision = \frac{tp}{tp+fp} $$
(4)
$$ recall = \frac{tp}{tp+fn}, $$
(5)

where tp denotes the number of detected bounding boxes which are correct, fp denotes the number of detected bounding boxes which are incorrect, and fn denotes the number of missed bounding boxes.

A bounding box is decided as whether true positive or false positive by measuring the ratio between its intersection area and union area (with a ground-truth box), which usually called IoU (Intersection over Union). This metric is computed as the equation below:

$$ \tau = \frac{area(b_{p} \cap b_{g})}{area(b_{p} \cup b_{g})}, $$
(6)

where bp is a predicted bounding box and bg is a ground-truth bounding box, ∩ denotes the intersection operation of bounding boxes, and ∪ denotes the union operation of bounding boxes. In order to be considered as a true positive, τ must be greater than 0.5. A predicted box will be assigned to the best-overlapped ground-truth box (which has the largest IoU). When multiple boxes overlap with a ground-truth box, that means all of these boxes satisfy the constraint of IoU, only the box which has the maximum IoU is accepted as true positive, the leftovers are judged as false positives.

4.2 Implementation settings

We conduct experiments on the medium scale CNN model - VGGM and the large scale one - VGG16. We also report the performance of our reproduced Faster-RCNN in order to make a fair comparison. Note that for implementing Faster-RCNN, we modify the author’s public source codeFootnote 1 to work well with the OHEM-model,Footnote 2 which is developed for Fast RCNN. This model can significantly reduce the consumed GPU memory by using gradient accumulation over two forward/backward passes. Thus, we can run the VGG16 network on the medium GPU card (Tesla K20c), which requires maximum 4GB of RAM. However, the disadvantage is that this supports only alternating optimization training and therefore taking more training time.

4.3 Performance on PASCAL VOC

The experimental results of different network settings are shown in Table 1. Regarding the single scale setting, our YADA improves the Faster-RCNN by a large margin. With VGG16 network backbone, YADA outperforms Faster-RCNN by 2.6%. In addition, multi-scale is also proved its superiority over single-scale Faster-RCNN setting with 1.0% gain.

Table 1 Detection results on PASCAL VOC 2007 test set (all methods are trained on VOC 2007 trainval set)

We also compare our proposed method with other state-of-the-art methods. The results in Table 2 show the effectiveness of YADA. Our method outperforms all the baselines. In details, YADA outperforms the Fast RCNN by 5.4% mAP, and the Faster-RCNN by 3.6% mAP . Comparing with the online hard example mining - OHEM method [36], our method achieves an improvement by 3.6% mAP. We also evaluate our proposed method with the training data from PASCAL VOC 2007 and 2012. As shown in Table 3, all methods improve their performances owing to the larger training data. Again, the YADA method leads all baselines with a remarkable margin. Figure 6 shows some visualization results of our YADA method on the PASCAL-VOC 2007 test set. Indeed YADA well discovers the unseen (challenging or undetected) objects.

Table 2 VOC 2007 test detection average precision (%)
Table 3 VOC 2007 test detection average precision (%)
Fig. 6
figure 6

Visualization of true detections of YADA on VOC 2007. For each image, the red color boxes represent detections from the usual Faster-RCNN model and the green color boxes represent detections from our proposed YADA framework. The result shows the superority of YADA over the Faster-RCNN with more objects can be detected in images

4.4 Performance on KITTI

For KITTI dataset, we evaluate our work on the validation set due to the unavailability of the ground-truth on the test set. We split \(\frac {2}{3}\) released images for training set and \(\frac {1}{3}\) images for validation set. The results of different settings on KITTI dataset are reported in Table 4. Our proposed framework surpasses the Faster-RCNN baselines in all network structures. As a closer look in VGG16 implementation, our proposed method significantly improves Faster-RCNN in the three main categories, namely, car, pedestrian and cyclist by 2.8%, 3.7%, and 2.5%, respectively. This improvement highlights the potential usage of our method into the practical systems such as autonomous vehicles. The overall performance of YADA-SS is 83.2%, that improves Faster-RCNN-SS with 3.4% gain. YADA-MS further boosts YADA-SS with 1.8% increment.

Table 4 KITTI detection average precision (%) on the validation set for 7 object categories with different network structures of Faster-RCNN and our method

4.5 Effect of score-scaling process in YADA

In this subsection, we conduct a comparison experiment on the score re-scaling process, which is used to combine detected objects from the the two models. To make clear the effect of this process, we evaluate performance of our YADA method in two settings: with and without using score-scaling. The results are shown in Table 5. We observe that the mAPs of YADA without score-scaling decreased by 2.8% and 3.2% respectively to single-scale and multi-scale configurations. The detections from hard object detector (the second model) should be post-processed before concatenating with the detections from the baseline detector (the first model). Our solution is to re-scale their confidence scores to eliminate noise from uncertain detections. Our YADA method with score-scaling process improves Faster-RCNN by 1.2% and 0.8% in terms of mAP corresponding to single-scale and multi-scale configurations.

Table 5 VOC 2007 test detection average precision (%) of YADA with/without score-scaling process

4.6 YADA on other deep networks

In this subsection, we focus on investigating the impact of our synthesized data on other deep networks. In addition to Faster-RCNN as in the previous experiments, YOLOv2 [30], RFCN [4], and SNIPER [38] are integrated into our framework. Note that aforementioned RFCN and SNIPER deploy Resnet-101, whereas YOLOv2 deploys Darknet-19 network. Similar to YADA framework with Faster RCNN baseline, we generate hard objects for these detectors separately and then train them with generated images. Table 6 shows results of YADA with these different detectors. On VOC 2007 test set, our YADA method further boosts YOLOv2, RFCN, and SNIPER baselines by 0.7%, 0.6%, and 0.3% in terms of mAP respectively. The results show that our method is also effective for modern object detectors and deep networks. Taking a closer look on the amount of improvement for each detector, YADA-YOLO gains the largest amount in mAP whereas YADA-SNIPER gain the smallest amount. This can be explained that we cannot teach much for the suporior network as the inferior network.

Table 6 VOC 2007 test detection average precision (%) of YADA on other baselines

5 Conclusions and future work

In this paper, we present a novel method, named YADA (You Always Dream Again), for generic object detection. The major contribution of this paper lies in two-fold: firstly, we apply lucid data synthesizing on the training set by mining the hard examples and cloning them to the same context locations. Different from previous data augmentation works, our synthesized data is generated with clear criteria. Secondly, we utilize a dual-level of deep networks leveraged with the synthesized data. Our framework structure is designed to flexibility combine the two level of deep networks through a fusion scheme. The extensive experiments on two benchmarks, PASCAL VOC and KITTI, demonstrate the superiority of our approach over the state-of-the-art methods. In the future, we would like to investigate our work in more complicated deep networks for object detection. We also aim to apply the YADA philosophy to other computer vision tasks.