1 Introduction

The detection of small pedestrians in aerial images is widely utilized in many applications such as human rescue, smart drone monitoring, and video surveillance systems. Detecting pedestrians in single images (Redmon and Farhadi 2018; Ren et al. 2015; Xie et al. 2019; Barekatain et al. 2017) has achieved momentous progress because of the emergence of deep convolutional neural networks (CNNs). In this field, groundbreaking and rapid adoption of deep learning architectures have produced highly accurate detection methods for traditional pedestrian datasets. The state-of-the-art performance (Liu et al. 2019) on the Caltech pedestrian dataset (Dollar et al. 2011) has achieved about a 4% miss rate for the reasonable case. In another popular dataset, the INRIA pedestrian dataset (Wojek et al. 2009), a 5% miss rate was reported with the method proposed in (Lin et al. 2018). For the KITTI benchmark (Geiger et al. 2012), the accuracy of pedestrian detection is close to 90% according to the KITTI website leaderboards.

Although existing methods can make reasonably good detections for large-scale groups of pedestrians who are close to the camera, their performance suffers serious deterioration with small-scale pedestrians as in drone images because of low resolution, distortional appearances from the top view, small instance sizes, and poor visual cues. Single image detection methods usually lack robustness, especially in small object detection. The drone-view small-scale instances often present obscure appearances and blurred boundaries, thus they result in less effective feature representations for objects in aerial images. Current detectors frequently fail to effectively leverage appearance information to distinguish these objects from the surrounding background or similar objects. In addition, small instances somehow can be suddenly missed in certain frames.

There are two main limitations of single-image detectors. First, the detectors based on a single image are not robust enough because of the fluctuation of detection confidence values, since they can not incorporate temporal consistency and constraints. Second, complicated backgrounds influence detection performance to some extent. Single-image detectors are more likely to generate false positives because information in only one frame is used. However, if the context information of the whole video is exploited, these false positives can be effectively removed as demonstrated in (Kang et al. 2017).

Most of the video-based object detection methods are implemented based on the ImageNet VID dataset (Russakovsky et al. 2015) in which the object lies in the center of an image, and the scale is large enough. However, drone vision is more challenging due to the various view points and scales. Therefore, we make use of the advantages of both the single-image based object detection methods and video-based object detection methods. Among the advanced deep CNN architectures for general object detection, we follow the pipeline of R-FCN (Dai et al. 2016) because it shows superior and faster performance than the R-CNN counterpart (Girshick 2015; Ren et al. 2015) for object detection. We exploit the advantages of video-based object detection as well. Videos or sequences can provide multi-frames of images, and thus per-frame feature learning can be improved by temporal aggregation. Furthermore, motion information, such as an optical flow network (Dosovitskiy et al. 2015), can appraise the motions between frames to further enhance the features.

Inspired by these motivations, we developed a new video based, small-scale pedestrian detection method. To the extent of our knowledge, this is the first work that exploits the video based two-stream architecture for solving the small object detection problem. The main contributions are as follows:

  1. 1.

    A novel deep neural network architecture with two-stream subnetworks incorporates spatial and temporal information to improve detection performance for small scale instances as well as partially occluded objects.

  2. 2.

    Feature aggregation with nearby frames is proposed for our two-stream network. An average operator is applied to aggregate the feature maps after mapping the spatial feature maps of the nearby frames by flow-guided warping. A more refined flow net (SPyNet), instead of a simple flownet, is adopted as the temporal subnetwork to obtain the motion information and to generate flow feature maps.

  3. 3.

    Some effective techniques in single-image based methods are also adopted in our spatial-stream network, including less downsampling and dilation convolution. Less downsampling, which results in relatively larger resolution can keep more detailed information of small-scale instances in the spatial stream network. In order to offset for the receptive field, dilation convolution is applied in the deep layers of the network to generate a final spatial feature map that includes richer information.

  4. 4.

    Our method shows state-of-the-art performance in drone view datasets, such as the Okutama human action dataset (Barekatain et al. 2017) and the VisDrone dataset (P. Zhu et al. 2018). By additionally exploiting temporal information, it improves a mAP by 3.09% more than the state-of-the-art method (Xie et al. 2019) on the Okutama Human-action dataset. To further verify the performance for general drone view object detection, we have also performed experiments with the VisDrone dataset and achieved 14.06% improvement in mean Average Precision at a 0.5 IoU threshold (mAP@0.5) when compared to the well-known SSD-PeleeNet method (Ozge Unel et al. 2019) on the VisDrone VID validation set.

The remainder of this paper is arranged in the following manners. Section 2 introduces some recent works related to both single-image based detection and video-based detection methods. Section 3 explains the proposed two-stream detection network with feature aggregation (TDFA) in detail. Experiments and results are discussed in Sect. 4. At last, Sect. 5 summarizes conclusions and future work.

2 Related works

2.1 Single-image based pedestrian detection

With the rapid growth of deep CNN technologies recently, many general pedestrian detectors have achieved good performance. For small pedestrian detection, a common and popular strategy is the multi-layer approach, which generates multi-branches or subnetworks for different scale training. The MS-CNN (Cai et al. 2016) is performed with output of multi-layers to detect pedestrians of various scales. Similarly, SAF R-CNN (Li et al. 2017) proposes a divide-and-conquer approach with Fast R-CNN pipeline. This strategy detects pedestrians by two built-in subnetworks at diverse scales from disjoint ranges. Another way to enhance the feature presentation is by incorporating both the rich semantic information from deeper layer features and the fine-grained information from shallow layer feature maps. In order to extract strong semantics representation at all scales, including small scales, lateral connections with a top-down pathway have been proposed in Feature Pyramid Network (FPN) (Dollár et al. 2014). YOLO-v3 (Redmon and Farhadi 2018) uses a similar method, but replaces the nearest neighbor upsampling in deconvolution to achieve better performance of small-size object detection.

2.2 Video-based object detection

Since ImageNet proposed the challenge for video-based object detection (VID) and provided the dataset, there have been many various works that have focused on the video object detection. One of the typical architectures is ConvNets + LSTM, which extracts features individually on each frame, then pools the predictions through the entire video. For example, ROLO (Ning et al. 2017) develops a recurrent convolutional neural network (RCNN) with spatially supervised for the task of visual object tracking. It concatenates high-level spatial features captured by convolutional networks with regional information and executes Long Short Term Memory (LSTM) in the temporal domain. This type of ConvNets + LSTM approach can deliver high-level semantic information but is not able to obtain fine low-level detailed information, which is important for small-scale detection. It is also time-consuming for training due to the network unrolling with multiple frames for backpropagation across time. The second typical architecture is 3D ConvNets (Varol et al. 2017), which directly creates hierarchical representations of spatio-temporal features. However, the 3D ConvNets models take many more parameters than those of 2D ConvNets, owing to the extra kernel dimension, which makes the training more difficult. The third typical architecture is the two-stream network structure put forward by Simonyan and Zisserman (Simonyan and Zisserman 2014). To capture spatio-temporal information about the appearance as well as the movement of objects, both the RGB and optical flow frames are given into deep ConvNets architectures separately, and finally their softmax scores are joined with late fusion. An extended work (Feichtenhofer et al. 2016) combines the spatial and flow branches at the last convolutional layer of the network. In more recent works such as FGFA (Zhu et al. 2017) and MANet (Wang et al. 2018a, b), the features of a single frame are enhanced by utilizing an optical flow network to measure the motions between the reference frame along with the nearby frames and by using a more advanced deep learning framework. This framework investigates temporal information based on the feature level, rather than the final box level, as ConvNets + LSTM does. Compared with two other typical architectures, this type of two-stream network shows better performance and requires relatively less time for training and testing.

2.3 Video-based pedestrian detection

In TLL (Song et al. 2018), somatic topological line localization (TLL) is integrated with a temporal feature aggregation that utilizes a joint Conv-LSTM model for detecting multi-scale pedestrians. ADM (Zhang et al. 2018) introduces a RCNN based localization policy that uses the sequences of coordinate transformation actions to get the final detection of the pedestrian instances. In (Wang et al. 2018a, b), a part and context network (PCN) is proposed by incorporating a primary branch, a context branch, and a part branch into an integrated architecture with an LSTM module for communicating the body part semantic information.

In this research, we also adopt a similar two-stream network architecture to incorporate temporal as well as spatial information for better accuracy and effectiveness. To effectively capture the features of small-scale instances, we design a network architecture based on the R-FCN pipeline, and a variant ResNet is used as the backbone network in the spatial stream. Furthermore, we aggregate the features of the nearby frames to acquire more temporal information. Our method achieves more robust detection performance in difficult examples such as the detection of small-scale instances from drone-view images and can effectively reduce false positive detections.

3 Two-stream detection network with feature aggregation

In this section, we demonstrate the details of our proposed method. The main notations adopted in this paper are declared in Table 1.

Table 1 Notations

3.1 Two-stream network design

Our architecture consists of two subnetworks. For one subnetwork, a variant ResNet is utilized to extract the spatial feature map \(f_{i}\) on frame \({ }I_{i}\), and the other is the temporal stream network. Given a reference frame \({ }I_{i} { }\) and a nearby frame \({ }I_{i + k}\), a two-dimensional flow filed \({\text{M}}_{i \to i + k} = {\mathcal{F}}\left( {I_{i} ,I_{i + k} } \right)\) is obtained by the optical flow estimation algorithm (Ranjan and Black 2017). \({ }{\mathcal{F}}\left( {I_{i} ,I_{i + k} } \right)\) denotes the flow field estimated from frame \(I_{i}\) to \({ }I_{i + k}\). Figure 1 shows a flowchart of the two-stream detection network with feature aggregation. By using a sequence of images, a series of spatial feature maps \(f_{i - k} , \ldots ,f_{i + k}\) are generated by passing the frame \(I_{i - k} , \ldots ,I_{i + k}\) through the spatial stream, and then the temporal feature map \({\text{M}}_{i \to i + k}\) is calculated by applying the FlowNet with frame \({ }I_{i}\) and frame \( I_{i + k}\). The feature warping is used to generate the warped feature \(wf_{i + k \to i}\). Similarly, the warped feature map \(wf_{i - k \to i}\) to \(wf_{i + k \to i}\) is generated. These features are then aggregated as \(af_{i}\) and delivered to the PS ROI pooling layer. At last, we can get the final detection results by detection module with softmax classification and bounding box regression.

Fig. 1
figure 1

Architecture of our proposed two-stream detection network with feature aggregation (TDFA)

3.2 Feature aggregation

As motivated by (Zhu et al. 2017) and (Wang et al. 2018a, b), we adopt flow-guided feature warping to capitalize on the temporal information. In the spatial stream, the spatial network is applied to the nearby \(I_{i - k} , \ldots ,I_{i + k}\) frames to get the corresponding feature maps \( f_{i - k} , \ldots ,f_{i + k}\). Then, the feature map on the nearby frame \(I_{i + k}\) is warped to the current frame \(I_{i}\) as follows:

$$ wf_{i + k \to i} = {\text{W}}\left( {f_{i + k} ,{\text{M}}_{i \to i + k} } \right) = {\text{W}}\left( {f_{i + k} ,{\mathcal{F}}\left( {I_{i} ,I_{i + k} } \right)} \right) $$
(1)

where \({ }f_{i + k \to i}\) are the warped features that denote the feature map from frame \(I_{i + k}\) to frame \(I_{i}\). \(W\left( \cdot \right)\) denotes the bi-linear interpolation function, it applied to each location for all the feature maps. In the reference frame i, a location p maps to the location \(p + \Delta p\) in frame \({ }I_{i + k}\), as presented in the Eqs. (2) and (3):

$$ \Delta p = {\mathcal{F}}\left( {I_{i} ,I_{i + k} } \right)\left( p \right) $$
(2)
$$ wf_{i + k \to i} \left( p \right) = \mathop \sum \limits_{q} G\left( {q,p + \Delta p} \right)f_{i + k} \left( q \right) $$
(3)

where q stands for all spatial locations in the feature maps \(f_{{{\text{i}} + {\text{k}}}}\), ∆p is the output of the flow estimation at location p, and G(·) is the bilinear interpolation kernel as

$$ G( {q,p + \Delta p}) = \max \left( {0, 1 - \left\| {q -( {p + \Delta p} )} \right\| } \right). $$
(4)

When the warped features of nearby frames \(wf_{i - k \to i} , \ldots , wf_{i + k \to i}\) are obtained, the feature map of the reference frame can be enhanced by accumulating the multiple feature maps of nearby frames, which provides the temporal information of the object instances. We aggregate the feature maps by averaging them. The aggregated feature \(af_{i} \) at the reference frame i is generated as

$$ af_{i} = \frac{{\mathop \sum \nolimits_{t = i - k}^{i + k} wf_{t \to i} }}{2k + 1} $$
(5)

The procedure of generating the aggregated feature is presented in Algorithm 1.

figure a

3.3 Flow network of the temporal stream

Instead of using a simple version of FlowNet (Dosovitskiy et al. 2015), we adopted the Spatial Pyramid Network (SPyNet) (Ranjan and Black 2017) which captures residual flow based on a coarse-to-fine spatial pyramid structure. In our case, the motions of the object instances between the frames are small. SPyNet is better able to deal with a more detailed and precise motion optical flow. Furthermore, as proposed in (Ranjan and Black 2017), the SPyNet model is faster and smaller than FlowNet. To further reduce the computation time, we apply the flow network on non-adjacent frame pairs as in (Zhu et al. 2017). By compositing the intermediate flow fields, the flow field between the non-adjacent frames can be measured. As a result, the computation time can be reduced in half with almost the same accuracy.

In SPyNet, the residual flow is computed by convolution at the high level of the pyramid with the low-resolution feature map. At each pyramid level, the residual flow is computed and successively propagates to the next lower levels with higher resolution for every pyramid level. Eventually, the flow is captured at the lowest levels of the pyramid. These types of procedures can be treated as a flow-block, which is illuminated in Fig. 2. We adopt a 5-level SPyNet, the flow chart of the SpyNet architecture is also illuminated in Fig. 2.

Fig. 2
figure 2

Flowchart of SPyNet architecture

3.4 Feature network of the spatial stream

Following R-FCN (Dai et al. 2016), we adopt a variant ResNet-101 (ResNet101-M) as the backbone network for spatial feature extraction. Compared with the original ResNet-101 network (He et al. 2016), the ending average pooling and the fully convolutional (fc) layer have been cut out for the object detection task. The proposed variant ResNet-101 is specifically designed for feature extraction of small objects. At the last block in conv5 stage, the stride of the convolution layer with 2 is modified to 1 in order to keep the relatively large spatial resolution of the feature map. To further increase the feature resolution, the effective stride of the last block in the conv5 stage is changed from 32 to 16 pixels. Then, the dilation convolution is used to offset the size of receptive field, the kernel size of the dilation convolutional layers larger than 1 is set as 2 in the last block of the conv5 stage.

4 Experimental results

4.1 Implementation details

Our experiments use the ResNet-101-M model pre-trained on the ILSVRC-CLS image classification dataset (Russakovsky et al. 2015), and the base SPyNet model pre-trained on the Flying Chairs dataset (Dosovitskiy et al. 2015). To augment the training data, image flipping is adopted. We use single scale images with 720 \(\times \) 1280 pixels in training to avoid GPU memory overflow. To fine-tune the detection bounding boxes and to choose hard examples automatically in training, non-maximum suppression (NMS) and online hard example mining (OHEM) (Shrivastava et al. 2016) were adopted. We chose Mxnet as the platform and trained the network on four parallel Nvidia GeForce GTX TITAN X GPUs with 12 GB of memory, while testing was performed on a single GPU.

4.2 Okutama human-action dataset

The Okutama Human-action Dataset is a real-word aerial view video dataset with high image resolution. A total of 43 video sequences are captured at 30 FPS, including 33 training sequences and 10 testing sequences. The videos were recorded with UAVs flying varying 10–45 m altitudes. The camera angle is between 45 and 90 degrees. The dataset can be used for both human detection and action understanding. In this paper, we aim at the human detection task. The dataset is spatio-temporal fully-annotated. Each instance has a tracking id. In experiments, we used 54,503 images for training and 14,114 images for testing.

Following the object detection protocols in (Everingham et al. 2010; Barekatain et al. 2017), the mean Average Precision at a 0.5 IoU threshold (mAP@0.5) is used as the evaluation metric. The Intersection Over Union (IOU) considers the overlap of areas between the prediction bounding box and the ground truth bounding box, which is calculated by the Eq. (6):

$$ IOU = \left( {area\;of\;intersection} \right)/\left( {area\;of\;union} \right). $$
(6)

If \(IOU \ge 0.5\), the detection is classified as a true positive. Otherwise, the detection is false positive. Precision is the fraction of positive instances among the detected instances. Recall is the proportion of instances that are correctly detected among the ground truth. The Average Precision (AP) is computed by averaging the precision over a set of evenly spaced recall levels [0, 0.1, … 1.0]. The definitions of precision, recall, and AP, in terms of true positive (TP), false positive (FP), and false negative (FN), are as follows:

$$ Precision = TP/\left( {TP + FP} \right) $$
(7)
$$ Recall = TP/\left( {TP + FN} \right) $$
(8)
$$ AP = \frac{1}{11}\mathop \sum \limits_{{r \in \left[ {0,0.1, \ldots ,1} \right]}} p_{interp} \left( r \right) $$
(9)

where \(p_{interp} \left( r \right) = \mathop {\max }\limits_{{\tilde{r}:\tilde{r} \ge r}} p\left( {\tilde{r}} \right)\), and \(p\left( {\tilde{r}} \right)\) is the measured precision at recall \(\tilde{r}\). The precision-recall curves were drawn using the precision \(p\left( r \right)\) as a function of recall r. The area under the curve was summarized to obtain Average Precision. The mean Average Precision (mAP) for a set of classes is the mean of the Average Precision (AP) scores for each class, which is computed as:

$$ {\text{mAP}} = \frac{{\mathop \sum \nolimits_{n = 1}^{N} AP\left( n \right)}}{N} $$
(10)

where N is the number of classes. In our task, we only consider the class “pedestrian”. Therefore N = 1, the mAP equals to AP.

4.2.1 Ablation experiments of aggregation number of frames in training and inference

In Table 2, a comparison of the performance and runtime for the utilization of a diverse number of frames is given. The case of k = 0 is the single image detection based on our network architecture without multi-frame input. The parameter k is the number of additional nearby frames that we used for feature aggregation. One can observe that the performance improves with increasing additional input frames up to a certain level. When a greater number of frames are aggregated, the runtime gradually increases. Notice that the performance has a bit of a decrease with aggregating 10 frames in this example. This is probably because the information of an image 10 frames away is not very “useful” for the current frame image. Our model can reach the best performance of 87.18% mAP when k = 9.

Table 2 Comparison of the results by using a different number of frames as input

4.2.2 Ablation experiments

We evaluated the effect of each component of our proposed approach. In the spatial stream, we compared the performance of original ResNet-101 and ResNet-101-M, with R-FCN as the basic architecture. In the temporal stream, we compared the performance of FlowNet and SpyNet for capturing temporal information, with the number of input frames k = 9. As shown in Table 3, the ResNet-101-M performs better than the original ResNet by 0.75% in mAP, to extract the features of the spatial domain. A significant improvement of 5.94% in mAP has been achieved after using the temporal steam with FlowNet. Then the FlowNet is replaced with the SpyNet to capture a more detailed and precise motion optical flow, resulting an additional improvement of 2.54% in mAP.

Table 3 Influence of each component of our proposed method on the Okutama test dataset

4.2.3 Comparison with the state of the arts on Okutama dataset

Table 4 gives a comparison of the detection results and runtime between our method and the state-of-the-art methods on the Okutama dataset. The R-FCN with ResNet-101-M is our base network of the spatial stream, which is based on single-image based detection with \(k = 0\). By exploiting the temporal information and aggregating the feature maps of nearby frames, our two-stream method finally improves the performance from 78.7 to 87.18% in mAP. When compared with the existing best method DIF R-CNN (Xie et al. 2019), our method outperforms by 3.09%. In DIF R-CNN (Xie et al. 2019), which was our previous work, 90.3% mAP was reported on the Okutama validation set. However, the test set in (Xie et al. 2019) is different from the Okutama official test set, since (Xie et al. 2019) was published before the official data was released. DIF R-CNN (Xie et al. 2019) achieved 84.09% of mAP for the official test set. All the results presented in Table 4 are by using the official test set. The proposed approach takes about 50–52 h to train the best model with k = 9 on the Okutama dataset, and the runtime is 0.28 s/f for testing with the original image size of 3480 × 2160 pixels. Figure 3 indicates the precision-recall curve of our proposed TDFA and other existing methods on the Okutama human-action dataset.

Table 4 Performance comparison with other existing methods on the Okutama test dataset
Fig. 3
figure 3

The pedestrian detection comparison of our proposed TDFA and other methods on the Okutama human-action dataset

Figure 4 is a visualization comparison of the detection results. Our proposed method generates less false positive detections and is better to handle the partially occluded cases than SSD-Okutama (Barekatain et al. 2017) and DIF R-CNN (Xie et al. 2019). Figure 5 compares the detection results for a sequence by using the test set Drone 2-Noon-1.2.1. For this example, we choose the frame \( id = \left\{ {1100,{ }1105,{ }1110,{ }1115} \right\}\). The images are displayed with partial magnification. Compared with these single-image based methods (Xie et al. 2019; Barekatain et al. 2017), our approach is more robust to detect the persons in every frame without sudden missing.

Fig. 4
figure 4

Visual comparison of the detection results. Top to bottom: a SSD-Okutama (Barekatain et al. 2017) (baseline method) showing two false negatives in the left case and two false positives in the right case. b DIF R-CNN (Xie et al. 2019) showing one false negative and one false positive in the left case, one false positive in the right case. c our results on the Okutama test set detected all persons without false positives. We denote ground truth in green and detection results in red (Color figure online)

Fig. 5
figure 5

Comparison of sequence detection results. Top to bottom: a SSD-Okutama (Barekatain et al. 2017) (baseline method) with four errors, b DIF R-CNN (Xie et al. 2019) with two errors, and c our results on the Okutama test set without error. We denote ground truth in green and detection results in red (Color figure online)

4.3 VisDrone dataset

To further verify the performance of our proposed method in general small-scale object detection with drone videos, we also use the Visdrone dataset (P. Zhu et al. 2018) for our experiment. The VisDrone dataset focus on advancing visual understanding tasks such as object detection and tracking for the drone applications. It was collected by drone-mounted camera with various aspects including diverse location, environment (urban and country), density (crowded and sparse scenes), and 10 classes of objects (people, pedestrian, bicycle, bus, car, van, truck, motor, tricycle, awning-tricycle). Following the comparison method, SSD-PeleeNet (Ozge Unel et al. 2019), 10 classes were grouped into two main groups as pedestrian and vehicle. Training and validation were conducted only on the VisDrone-VID training set which contains 56 video clips with 24,201 frames. The performance metric was calculated on the VisDrone-VID validation set involving seven video clips with 2819 images, in which the image sizes of video sequences are not uniform. We trained the network based on variant ResNet (ResNet-101-M) and choose the best training model with 9 frames (k = 9) to get the final results.

The performance comparison of our detector with SSD-PeleeNet (Ozge Unel et al. 2019) is presented in Table 5, which presents that our method achieves significantly better performance in most cases. It produces a mAP@0.5 of 50.73% for overall classes which is a big improvement (14.06%), from 36.67 to 50.73%. The runtime is 0.26 s/f on a single TITAN X GPU for testing. It takes about 12 h to train the best model with k = 9 on the VisDrone dataset. For vehicle detection, ours performs 11.12% better than the SSD-PeleeNet (Ozge Unel et al. 2019) in mAP@0.5. Our approach also improves the performance with different degrees in mAP (IoU:0.75 and 0.5:1.0) on both vehicle and pedestrian classes. The detection accuracy has a slight of decrease (1.49%) in pedestrian detection with mAP @0.5. In addition to the small-sized instance problem, this dataset contains night time detections without IR-aid, such as infrared images. Furthermore, heavy occlusion, complicated backgrounds, fog, and bad illumination make the detection task even more challenging. The detection examples of our approach are shown in Fig. 6. The precision-recall curve of our proposed TDFA for each class of the Visdrone dataset is presented in Fig. 7.

Table 5 Comparison of our method TDFA with SSD-PeleeNet on the VisDrone validation set
Fig. 6
figure 6

Visualization of our detection results on the VisDrone validation set. We denote ground truth in green and detection results in red (Color figure online)

Fig. 7
figure 7

The precision-recall curve of our proposed TDFA for each class of the Visdrone dataset

5 Conclusion and future work

In this paper, a novel two-stream detection network with feature aggregation (TDFA) is proposed for small-scale pedestrian detection in drone-view videos. To make a more robust detection performance on drone-view videos, we introduce two-stream video-based detection techniques with the R-FCN pipeline. We follow the traditional single-image based feature map extraction method in the spatial stream. Additionally, we apply SPyNet to extract flow feature maps to catch the tiny motion and incorporate the temporal information. Then, the mapping and warping operations are performed from the flow features to the spatial features. Finally, the feature maps of nearby frames are aggregated. The aggregated feature can give a more effective feature representation with spatio-temporal information.

Experimental evaluations demonstrated that the proposed TDFA is superior when compared to other single-image based detection method, in detecting small-scale pedestrian instances. The performance of our results is 3.09% better in mAP than that of the state-of-the-art results on the Okutama Human-action Dataset. Furthermore, it also achieves good performance on general drone-view object detection tasks, such as the VisDrone VID task. Our method achieved a mAP@0.5 of 50.73% with 14.06% improvement than the SSD-PeleeNet on the VisDrone VID validation set. In the future, we prone to focus on developing an algorithm for handling occlusion and bad illumination cases. By solving these two main challenging cases, we are able to further enhance the overall detection performance.