1 Introduction

Today, the video surveillance system using closed-circuit television (CCTV) has become an essential tool for many security-related fields and is widely used in public not only for monitoring traffics, human behaviors but also for preventing crime and disaster. With increasing security threats and crime rates worldwide, the video surveillance market continues to grow in various sectors [2]. In addition, as video surveillance systems have become widespread, the role of big data and its processing power have become equally important. Recent new technologies such as artificial intelligence and big data infrastructure play a big role in improving the scalability and accuracy of video surveillance solutions. Therefore, research into intelligent-based surveillance systems is becoming an important area for disaster and crime prevention as well as industrial and public security protection. There are several proposed models for intelligent traffic video surveillance and accident identification [43] and enhanced algorithms for the resolution of intelligent video system images [20]. However, few studies have been explored video surveillance systems with cutting-edge technologies such as deep learning [12, 23].

Moreover, video surveillance systems are widely used in public and various areas, especially for people and traffic observation. But still, they are not yet widely used for crime and disaster surveillance and prevention [44]. Criminal investigations and disaster observations are still analyzed based on objective data collected from video surveillance systems. In other words, video surveillance systems are used passively and are mostly used only as evidence after a crime has occurred. In addition, the existing video surveillance system requires a lot of labor and cost to manage because the video must be continuously monitored in real-time to activate it [16, 17]. However, for better crime detection and prevention, video surveillance systems should actively respond to incidents and process images in real-time in a cost-effective manner. This can be managed through big data and artificial intelligence models.

Therefore, this study proposes developing an intelligent video surveillance system to actively monitor itself in real-time without direct detection from human labor. In solving the problems of the existing video surveillance system, deep learning technology will be carried through the data processing model design to visualize data for crime detection after building an artificial intelligence server and video surveillance camera. In addition, this design proposes an intelligent surveillance system to quickly and effectively detect crimes by sending a video image and notification message to the web through real-time processing. The related work regarding video surveillance systems and deep learning will be discussed in the following section before addressing the system configuration and development process. The insights and contributions to this design of an intelligent surveillance system will also be addressed.

2 Intelligent video surveillance system

The closed-circuit television (CCTV), often known as video surveillance, is a system that automatically identifies particular objects and behaviors through the programs [13]. In other terms, advanced video surveillance is referred to as “an intelligent video processing technique designed to assist security personnel by providing reliable real-time alerts and to support efficient video analysis for forensic investigations” [31]. It transmits a signal to a specific place from a limited set of monitors using a video camera [14]. It captures light to convert it into an electrical signal to process recording and display on the screen with a video signal [5]. Recently, video surveillance system usage has been largely increased in many areas with various sectors for public surveillance, and its market has been growing continuously. For example, it could be used in crime prevention, traffic monitoring, employee monitoring, and even a sporting event. As such, an image processing-based intelligent robotic system has been used in various sectors. It can be used to detect crop diseases in agricultural sectors and measure the number of sperms for sperm health in the medical field [29, 45]. Some research has classified such intelligent video surveillance systems into their desirable characteristics and features; object detection, tracking and movement analysis, detect abnormals, vehicle detection and traffic analysis, object counting systems, privacy-preserving systems, etc. [46].

Through full human intervention, the traditional monitoring systems have several disadvantages, such as high labor costs, long-term capture, and limited ability to monitor simultaneous screens. Thus, the conventional surveillance systems have been replaced with advanced intelligent systems to capture better abnormal behavior and unidentified patterns with developing technologies such as artificial intelligence technology and pattern recognition system. Moreover, as technologies have been advanced, video surveillance is evolving with the new technology trends such as big data infrastructure, artificial intelligence, and neural network systems [35]. The video surveillance market has shifted from traditional analog video to an IP video surveillance system incorporating new technologies that improve processing power. The evolution of artificial intelligence by deep learning is especially directly used in video surveillance systems, enabling proactive prediction of security accidents and predictive analysis that can be prepared in advance. Intelligent video surveillance systems provide flexible control over the speed of video data collection and increase data collection speed whenever a security accident indicator is detected, providing more information, enabling accurate and reliable analysis. Big Data also provides a new way to store and access video data seamlessly and cost-effectively to video surveillance systems. These advanced technology-applied surveillance systems can be automatically monitored objects and actions with accuracy, even with fewer observers.

Several different types of an architectural model for image captioning are used in the surveillance system, including image feature-based and RCNN object-based. The semantic-based architectural model uses many techniques such as Artificial vector machines (ANN), Support Vector Machines (SVM), and Bayesian networks (BN) [1830, 38]. The convolution neural networks (CNN), the most popular type used in deep learning technology and models such as VGG, AlexNet, Inception, are most predominant. Table 1 shows the types of architecture for image captioning with different features [40].

Table 1 Architecture type of image captioning

Even though the increasing demands of intelligent video surveillance systems, there are contradictive issues to deal with when using in public spaces such as large traffic density, heterogeneity, and privacy. Privacy has especially been an ongoing debate issue with the surveillant systems embedding with all the high-end intelligent technology [3]. People like to feel safe, but at the same time, they generally dislike to be watched their activities everywhere. Therefore, when developing intelligent video surveillance, concern needs to be taken to satisfy the capturing surveillance object while ensuring privacy to people [47]. This development aims to provide real-time and objective data through active monitoring using artificial intelligence to prevent crimes and disasters.

3 Deep leaning technology

Recent activities in various networks, including social media, streaming video and images, video surveillance systems produce a tremendous amount of data every day and second, so-called big data. Considering big data requires the computation of complex functions that need to develop complex hierarchies of concepts using knowledge of deep learning and sophisticated algorithms [1]. Moreover, the big data from the video surveillance systems placed in almost every corner of public places require substantial data warehouses. However, it would reduce storage space if the only analysis result needs to be stored. Deep learning technology is required to effectively implement various objectives in different fields due to big data [40]. Thus, deep learning techniques have been drawn significant attention in recent years due to its ability to solve complicated issues with the highest accuracy in big data evolved in the increasing number of public surveillance video systems.

Deep learning is “a machine learning technique with artificial neural networks and representation learning” that uses multilayers in the network to optimize implementation [4, 37]. Deep learning techniques involve components of both training and learning [39]. It uses neural learning networks and algorithms with big data and powerful computational resources regarding learning. That is, deep learning techniques use multiple layers of neural network algorithms based on raw input data to depict higher information levels in different layers. The more layers, the finer the model and the higher the performance [22].

There are several model architectures used for object detection; R-CNN, Fast R-CNN, Faster R-CNN. Convolutional neural networks (CNN) are among the most popular types of neural networks used in deep learning [19]. The convolutional neural network, CNN, in short, is the innovative deep learning model that divides the image into multiple regions and classifies each area into various classes. However, due to its requirements of multiple regions for an accurate prediction, CNN has drawbacks of inefficient process and high computation time with a higher volume of data. RCNN model has been proposed to reduce the massive volume of data and the processing time to a millisecond level. It also accelerates accuracy and efficiency [34]. RCNN is an algorithm with regions with CNN features combining regional proposals with convolutional neural networks. It performs high computation time as extracting around 2000 regions from each image [9]. It is widely used to extract visual information from multiple visual data sources. However, training is expensive in space and time and still needs a high computation time for object detection.

Fast RCNN is a Fast Region-based Convolutional Network model using deep convolutional networks for object detection with higher training and testing speed and detection accuracy. It is implemented with Python and fetches images with an input. The image is then processed with convolutional layers to create a convolutional feature map [11]. Lastly, the faster RCNN model is a canonical model for deep learning-based object detection that replaces the selective search method with region proposal network (RPN). It is fast and the best performing in detecting objects [24].

Deep learning techniques have been largely used in various fields in fraud detection, bioinformatics, speech and image recognition, and 3D point clouds [10] and in organizational strategy and customer relationship management. Smartphones and video cameras are the essences of connected networks. The relevance of images, video and audio in social media, streaming analytics and web browsing has created a necessity of producing and processing massive amounts of data. The computation of such complex features requires knowledge of deep learning networks and the ability to develop complex hierarchies of concepts using sophisticated algorithms. Excellent working knowledge of deep learning techniques, deep learning types, and deep learning applications can help users execute it for various purposes. In unsupervised data, machine learning may not always be feasible because manual labeling of data is expensive and time-consuming. Deep learning networks are designed to help overcome these issues. In short, deep learning provides better performance on many problems with complete automation [7]. As seen in Table 2, various fields such as healthcare, human behavior, and accident and disaster management are paying a lot of attention to deep learning models and applying their application.

Table 2 Deep learning application in various fields

4 Design for an intelligent video surveillance system configurations

4.1 System configurations

Figure 1 depicts the system configuration. Raspberry Pi camera is used to configure with GPU server for the study instead of the existing video surveillance system.

Fig. 1
figure 1

System configuration

The system environment includes python flask for WEB, python TensorFlow in deep learning, and python socket for Raspberry Pi. GPU server consists of four vCPU 30G memories and one 24G Tesla p40 GPU.

  1. A.

    Raspberry Pi (Camera)

Raspberry pie plays the same role as the existing video surveillance system. It functions as a transmission function of a camera image frame and captures images.

  1. B.

    GPU servers

GPU server performs in three parts: socket communication with Raspberry Pi, automatic recognition deep learning and notification algorithm, and website opening. Each function consists of three threads within the GPU server and operates simultaneously, as shown in Fig. 2, thread configuration of the GPU server.

Fig. 2
figure 2

GPU server thread configuration

4.2 Deep learning algorithm

The socket communicates with Raspberry Pi; if there is data received in 5001 ports when opening the server socket, it transfers corresponding data to an image. The translated image data is utilized as INPUT data for deep learning algorithms (Fig. 3). Raspberry Pi performs TCP socket communication by opening 5001 port in the public IP of the GPU. The data transmitted in the communication between the Raspberry Pi and the GPU server is the real-time image frame converted into a byte form.

Fig. 3
figure 3

Python socket communications receipt

The second model of the deep learning algorithm is the inception V3 model based on faster R-CNN. Faster R-CNN is the most advanced solution to detect the object in images with high accuracy and reliability [27]. Thus we have used it as a base to design the intelligent video surveillance system.

The model forms a hierarchy, as shown in Fig. 4 and has a very high recognition rate for an image. It enables us to recognize the targets of people and various objects when using the dataset from the COCO net, a large-scale dataset for object detection, segmentation, and captioning. However, specific objects such as deadly weapons and fires cannot be figured out by the dataset of the COCO net. It has to be trained directly by labeling additional images for recognition.

Fig. 4
figure 4

Inception V3 model

The processed data, the input data of the model, is generated, as shown in Fig. 5. Regarding the data training with a deep learning algorithm, the corresponding label in the image was found and named as shown in Fig. 5, and then converted to XML file. The changed XML file is converted to CSV file, then the existing ckpt was called and converted to the new tf-record. Having used the transferred tf-record, the repeated training took place multiple times to recognize various weapons and fires that are essential functions to the system. The training is implemented based on the following model to improve the detection rate.

  • batch size = 1;

  • Repeat 900,000 times with a learning rate of 0.00003;

  • Repeat 1,200,000 times with a learning rate of 0.00003.

    Fig. 5
    figure 5

    Data image processing diagram

Developed artificial intelligence deep learning algorithm is constructed so that real-time image frame data received through socket communication can be instantly identified and detected by working simultaneously with socket communication and using a function thread. As a result, the detection rate has been increased up to 99 % with deep learning. Figure 6 shows 99 % of fire detection with the deep learning model.

Fig. 6
figure 6

Deep learning with 99 % chance of fire detection

4.3 Web site application

The third thread is a web server and the back-end technology is Python’s flask. The Python’s flask model opens a Web server to configure socket communication and an artificial intelligence deep-learning algorithm. As shown in Fig. 7, the Python flask Web server was opened to transmit image frames where artificial intelligent deep learning has been performed. It was also configured to allow the streaming of videos while repeatedly sending image frames that appear in real-time.

Fig. 7
figure 7

Web server creation using Python flask

Finally, image data streaming to the server via raspberry pie is processed, as shown in Fig. 8. The processed data results in a final result through the Faster R-CNN model. The system sends push notifications to the application via fcm BROKER if there is anything unusual.

Fig. 8
figure 8

Data processing process

The web application is developed using the javascript and react framework. Using an application, you can watch the real-time image from a video surveillance system and receive a notice in case of an emergency. Figure 9 shows the main screen of the application that allows users to view images on the video surveillance system in real-time or receive notifications in a particular situation. It also allows users to check a specific zone. It is configured to display the map form by region or inside the building so that users enable to check the video when they click a particular image icon. As shown in Fig. 10, if an intelligent video surveillance system recognizes a crime or disaster situation, the app is configured to save such cases and push notifications as images and send them to the text message. The Fig. 10 screen shows the detection of the crime scene and the notification message sending to an individual message with an image.

Fig. 9
figure 9

Main screen

Fig. 10
figure 10

Crime and disaster situations

There is an increasing number of proposed systems using deep learning model to enhance detection speed and accuracy. As shown in Table 3, CNN-based models have been proposed in various surveillance systems but show some limitations of real-time detection speed, accuracy and inefficiency. Previous research of a similar approach also concerns a trade-off between speed and accuracy [12, 23, 27]. However, this proposed model reveals a right balance with speed and accuracy in detecting crime and disaster. This proposed model shows up to 99 % accuracy and transmits images and notification to a user’s application in real-time. Unlike other proposed systems, this proposed system enables users to identify and detect immediately through real-time image data from socket communication. Video streaming is also capable while continuously transmitting real-time image frames performed by artificial intelligence deep learning. By providing real-time notifications to the user’s application, crimes can be more proactively prevented. This system may not be the simplest or fastest way to detect an object, but it may be one of the best performing and accurate model for crime detection.

Table 3 Comparison with other systems using a deep learning model

5 Discussion and conclusion

Deep learning has been drawn attention in dealing with big data with an accurate analysis. However, the research is yet to be explored, and there is still a lack of understanding of the underlying theory working behind the video surveillance system. By suggesting a deep learning model into a video surveillance system, it would solve a wide variety of problems with manual systems and provide significant assistance in preventing disaster and crime. This research proposed an intelligent video surveillance system that actively detects and protects from crimes. The key to the intelligent video surveillance system proposed through this research is a deep learning algorithm that is assigned to the distributed servers. This proposed model suggests that if the deep learning technology is applied to the servers linked to the notification system, crime and disaster notifications can be made faster with a high accuracy rate through enriched information analysis. With a deep learning model into servers in the video surveillance system, this system provides a higher image processing speed and accuracy. This intelligent video surveillance system also sends images to web applications simultaneously to notify users of any accidents or crimes. In this way, crime and disaster could be detected faster and taken further action for prevention.

Even though deep learning has been utilized in various research areas, the video surveillance system design with deep learning still needs to be explored. This study opens other ways of developing surveillance systems in better performance with crime prevention. This study also provides insights into deep learning technology application and utilization by indicating that it is important to create and implement an architecture suitable for the video surveillance infrastructure. It depends on the decision of developers and managers who fully understand the purpose, use, and cost of designing video surveillance systems. Moreover, objective data from surveillance systems such as CCTV are widely used to analyze crime and disaster. However, it is still insufficient to prevent crime and disaster in advance through real-time analysis data. The intelligence video surveillance system enables us to capture and predict serious incidents by analyzing real-time images without human intervention. This could provide cost-effective and efficient ways to ensure safety. However, there have been contradictive issues with privacy and security, in that the more robust surveillance for security could cause threatening privacy issue [26]. Several techniques have been developed and studied to identify the security and privacy issues in the video surveillance system [32]. And thus, developers also need to keep track of negative responses raised by the intelligence surveillance system like privacy and social justice risks [42]. Also, as all advanced technologies have negative aspects and their advantages, ethical issues such as who will manage and control the algorithms of the artificial intelligence surveillance system when the use of them are common will need to bring attention in future research.