Abstract
Satellite video scene classification (SVSC) is an advanced topic in the remote sensing field, which refers to determine the video scene categories from satellite videos. SVSC is an important and fundamental step for satellite video analysis and understanding, which provides priors for the presence of objects and dynamic events. In this paper, a two-stage framework is proposed to extract spatial features and motion features for SVSC. More specifically, the first stage is designed to extract spatial features for satellite videos. Representative frames are firstly selected based on the blur detection and spatial activity of satellite videos. Then the fine-tuned visual geometry group network (VGG-Net) is transferred to extract spatial features based on spatial content. The second stage is designed to build motion representation for satellite videos. The motion representation of moving targets in satellite videos is first built by the second temporal principal component of principal component analysis (PCA). Second, features from the first fully connected layer of VGG-Net are used as high-level spatial representation for moving targets. Third, a small network of long and short term memory (LSTM) is further designed for encoding temporal information. Two-stage features respectively characterize spatial and temporal patterns of satellite scenes, which are finally fused for SVSC. A satellite video dataset is built for video scene classification, including 7209 video segments and covering 8 scene categories. These satellite videos are from Jilin-1 satellites and Urthecast. The experimental results show the efficiency of our proposed framework for SVSC.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Yan C, Xie H, Chen J, et al. A fast Uyghur text detector for complex background images. IEEE Trans Multimedia, 2018, 20: 3389–3398
Wang Q Q, Huang Y, Jia W J, et al. FACLSTM: ConvLSTM with focused attention for scene text recognition. Sci China Inf Sci, 2020, 63: 120103
Zhao J P, Guo W W, Zhang Z H, et al. A coupled convolutional neural network for small and densely clustered ship detection in SAR images. Sci China Inf Sci, 2019, 62: 042301
Marszalek M, Laptev I, Schmid C. Actions in context. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, 2009. 2929–2936
Yan C, Tu Y, Wang X, et al. STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans Multimedia, 2020, 22: 229–241
Lazebnik S, Schmid C, Ponce J. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), New York, 2006. 2169–2178
Sánchez J, Perronnin F, Mensink T, et al. Image classification with the fisher vector: theory and practice. Int J Comput Vis, 2013, 105: 222–245
Cheriyadat A M. Unsupervised feature learning for aerial scene classification. IEEE Trans Geosci Remote Sens, 2014, 52: 439–451
Yan C, Li L, Zhang C, et al. Cross-modality bridging and knowledge transferring for image understanding. IEEE Trans Multimedia, 2019, 21: 2675–2685
Othman E, Bazi Y, Alajlan N, et al. Using convolutional features and a sparse autoencoder for land-use scene classification. Int J Remote Sens, 2016, 37: 2149–2167
Otavio A B P, Nogueira K, dos Santos J A. Do deep features generalize from everyday objects to remote sensing and aerial scenes domains? In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Boston, 2015. 44–51
Hu F, Xia G S, Hu J, et al. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sens, 2015, 7: 14680–14707
Chaib S, Liu H, Gu Y, et al. Deep feature fusion for VHR remote sensing scene classification. IEEE Trans Geosci Remote Sens, 2017, 55: 4775–4784
Li E, Xia J, Du P, et al. Integrating multilayer features of convolutional neural networks for remote sensing scene classification. IEEE Trans Geosci Remote Sens, 2017, 55: 5653–5665
He N, Fang L, Li S, et al. Remote sensing scene classification using multilayer stacked covariance pooling. IEEE Trans Geosci Remote Sens, 2018, 56: 6899–6910
Yi S, Pavlovic V. Spatio-temporal context modeling for BoW-based video classification. In: Proceedings of IEEE International Conference on Computer Vision Workshops (ICCVW), Sydney, 2013. 779–786
Zhao G Y, Ahonen T, Matas J, et al. Rotation-invariant image and video description with local binary pattern features. IEEE Trans Image Process, 2012, 21: 1465–1477
Scovanner P, Ali S, Shah M. A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM International Conference on Multimedia (ACMMM), Augsburg, 2007. 357–360
Derpanis K G, Lecce M, Daniilidis K, et al. Dynamic scene understanding: the role of orientation features in space and time in scene classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, 2012. 1306–1313
Wang H, Ullah M M, Klaser A, et al. Evaluation of local spatio-temporal features for action recognition. In: Proceedings of British Machine Vision Conference (BMVC), London, 2009. 1–11
Wang H, Kläser A, Schmid C, et al. Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis, 2013, 103: 60–79
Wang H, Schmid C. Action recognition with improved trajectories. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), Sydney, 2013. 3551–3558
Karpathy A, Toderici G, Shetty S, et al. Large scale video classification with convolutional neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, 2014. 1725–1732
Hara K, Kataoka H, Satoh Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: Proceedings of IEEE conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, 2018. 6546–6555
Hara K, Kataoka H, Satoh Y. Learning spatio-temporal features with 3D residual networks for action recognition. In: Proceedings of IEEE International Conference on Computer Vision Workshop (ICCVW), Venice, 2017. 3154–3160
Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), Santiago, 2015. 4489–4497
Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Proceedings of International Conference on Neural Information Processing Systems (NeurIPS), Quebec, 2014. 568–576
Donahue J, Hendricks L A, Rohrbach M, et al. Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 677–691
Srivastava N, Mansimov E, Salakhutdinov R. Unsupervised learning of video representations using LSTMs. In: Proceedings of International Conference on Machine Learning (ICML), Lille, 2015. 843–852
Ng J Y, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets: deep networks for video classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 2015. 4694–4702
Zhu L, Xu Z, Yang Y. Bidirectional multirate reconstruction for temporal modeling in videos. In: Proceedings of IEEE conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017. 1339–1348
Feichtenhofer C, Pinz A, Wildes R P. Temporal residual networks for dynamic scene recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017. 7435–7444
Simonyan K, Zisserman A. Very deep convolutional networks for large scale image recognition. In: Proceedings of International Conference on Learning Representations (ICLR), San Diego, 2015. 1–14
Liu T M, Zhang H J, Qi F H. A novel video key-frame-extraction algorithm based on perceived motion energy model. IEEE Trans Circ Syst Video Technol, 2003, 13: 1006–1013
Sze K W, Lam K M, Qiu G P. A new key frame representation for video segment retrieval. IEEE Trans Circ Syst Video Technol, 2005, 15: 1148–1155
Dufaux F. Key frame selection to represent a video. In: Proceedings of International Conference on Image Processing (ICIP), Vancouver, 2000. 275–278
Crete F, Dolmiere T, Ladret P, et al. The blur effect: perception and estimation with a new no-reference perceptual blur metric. In: Proceedings of SPIE, 2007. 64920I
Sahouria E, Zakhor A. Content analysis of video using principal components. IEEE Trans Circ Syst Video Technol, 1999, 9: 1290–1298
Xia G S, Hu J, Hu F, et al. AID: a benchmark data set for performance evaluation of aerial scene classification. IEEE Trans Geosci Remote Sens, 2017, 55: 3965–3981
Tuia D, Moser G, Le Saux B. 2016 IEEE GRSS data fusion contest: very high temporal resolution from space technical committees. IEEE Geosci Remote Sens Mag, 2016, 4: 46–48
Farneback G. Two-frame motion estimation based on polynomial expansion. In: Proceedings of the 13th Scandinavian Conference on Image Analysis (SCIA), 2003. 363–370
KaewTraKulPong P, Bowden R. An improved adaptive background mixture model for real-time tracking with shadow detection. In: Proceedings of the 2nd European Workshop on Advanced Video Based Surveillance System, Boston, 2002. 135–144
Acknowledgements
This work was supported by National Natural Science Foundation of Key International Cooperation (Grant No. 61720106002), Key Research and Development Project of Ministry of Science and Technology (Grant No. 2017YFC1405100), National Natural Science Foundation of China (Grant No. 61901141), and Fundamental Research Funds for the Central Universities (Grant No. HIT.HSRIF.2020010). The authors would like to thank the IEEE GRSS Image Analysis and Data Fusion Technical Committee for providing Urthecast satellite videos.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gu, Y., Liu, H., Wang, T. et al. Deep feature extraction and motion representation for satellite video scene classification. Sci. China Inf. Sci. 63, 140307 (2020). https://doi.org/10.1007/s11432-019-2784-4
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-019-2784-4