Introduction

The Smart City (SC) technology that makes our daily lives to be more connected and informed is growing rapidly throughout the world. It represents a new framework which integrates information and communication technology (ICT) and Internet-of-Things (IoT) technology to improve citizens’ quality of life (Memos et al. 2018). Nowadays, many organizations and corporations have proposed innovative solutions to help the development of the Smart City, such as NVIDIA’s Metropolis, Siemens’ MindSphere, and Huawei’s OceanConnect. As the crucial component of Smart City technology, the IoT-driven and cloud-enabled intelligent transportation systems (ITS) have drawn large amounts of attention to make driving safer, greener and smarter. The ITS market is likely to grow to approximately $130 billion by 2025 (Mishra et al. 2019). According to the USDOT’s research-based initiative named “Beyond Traffic 2045: The Smart City Challenge”, the department has expanded the resources pool by more than $500 million to support innovative Smart City projects, so as to improve the performance of the transportation system by addressing the issues of congestion, safety, etc. (Gandy and Nemorin 2020).

In traditional ITS, the loop detector and microwave sensors have been the primary choices to gather traffic information. However, the data collected by these traditional sensors are aggregated on the sensor side so that detailed information is lost, and installing such sensors is also costly (Dai et al. 2019). In addition, traditional video surveillance requires watchstanders to pay close attention to hundreds of screens simultaneously, which is challenging and tedious (Liu et al. 2013). Yet there exists already a massive number of traffic surveillance cameras widely installed on the road network across the US to ensure traffic safety. Thus, converting the existing traffic surveillance cameras into connected “smart sensors”, also called intelligent video analysis (IVA), has gained a large amount of attention over the past several years.

Typically, based on computer vision (CV) techniques, traffic cameras have been considered capable sensors for data extraction and analysis in various ITS applications, such as congestion detection, vehicle counting, and traffic anomaly detection (Chakraborty et al. 2018; Dai et al. 2019; Kumaran et al. 2019). Such ITS approaches mainly consist of vehicle detection and tracking. For vehicle detection, deep learning-based methods have been proposed due to the development of convolutional neural networks (CNN) such as the two-stage methods like Fast R-CNN (Girshick 2015), Faster R-CNN (Ren et al. 2015) and mask R-CNN (He et al. 2017) and one-stage methods like YOLOv3 (Redmon and Farhadi 2018), SSD (Liu et al. 2016), and RetinaNet (Lin et al. 2017), etc. For vehicle tracking, many tracking algorithms have been proposed such as DeepSORT (Wojke et al. 2017), IOU (Bochinski et al. 2018), KLT (Lucas et al. 1981) and DCF (Lukezic et al. 2017). Taking advantage of computer vision techniques, Dai et al. (2019) have proposed a video-based vehicle-counting framework using YOLOv3 and kernelized correlation filters (KCF) as detector and tracker, respectively, which achieved an 87.6% counting accuracy running at 20.7 fps, and their video real-time rate was 1.3 (by taking the duration of their framework’s run-time divided by the duration of their test videos). Also, Meng et al. (2020) have proposed a correlation-matched tracking algorithm combined with SSD for vehicle counting which reached 93% counting accuracy running at 25 fps on an expressway dataset.

However, current research efforts on deep learning-based approaches have been more focused on model effectiveness instead of efficiency (Liu et al. 2018). Traditional analysis frameworks need to store and process video streams locally, which is unfeasible due to intensive computation cost and processing latency, especially for large-scale camera systems. Such offline approaches, with their high latency, lead to delayed decisions, which is not tolerable for IoT-driven ITS applications like accident detection and congestion reduction (Ferdowsi et al. 2019). Hence, by pushing the deep learning techniques close to their data sources, the edge computing frameworks show potential ability for real-time intelligent video analysis (Liu et al. 2018). To achieve this purpose, NVIDIA has developed the AI-powered high-throughput and scalable inference framework called NVIDIA DeepStream, which has enabled edge computing techniques for multi-GPU and multi-stream video analysis (NVIDIA Deepstream 2020). In addition, NVIDIA Metropolis (NVIDIA Metropolis 2019), as an end-to-end video analysis platform, has been developed to explore the valuable data from the AI and IoT devices for frictionless retail, streamlined inventory management and so on.

Designing the next-generation ITS applications is required to efficiently support the development of Smart Cities. Thus, the cloud server is playing an important role in integrating the edge computing node or IoT devices horizontally, to provide real-time and dynamic message exchanges for supporting modern ITS applications (Iqbal et al. 2018; Petrolo et al. 2017). As summarized in Eltoweissy et al. (2019), running ITS applications in the cloud has the following benefits. First, cloud servers give users a massive pool of computation resources that eliminate the need to install and maintain private clusters. Second, cloud servers provide flexible charging plans that allow users to pay for resources on a short-term basis. In addition, a properly configured cloud server provides worldwide remote access that helps traffic agency, stakeholders, and governmental organizations access and understand their data easily. Therefore, edge-computing-enabled real-time video analysis frameworks deserve further study to support large-scale Smart-City-based ITS applications. Further, cost estimation for operating such cloud-enabled IVA frameworks is needed to help the traffic agency build the system efficiently.

In this study, we introduce a real-time, large-scale, cloud-enabled traffic video analysis framework using NVIDIA DeepStream and NVIDIA Metropolis, which are recognized as the cutting-edge AI-powered video analysis toolkit and end to end IVA platform. For this study, hundreds of live video streams recorded by traffic CCTV cameras across Iowa were decoded and inferred using NVIDIA DeepStream. The model inference results generated by NVIDIA DeepStream were processed and analyzed using the big data processing platforms Apache Kafka (Kafka 2019), Apache Spark (Spark 2020), Elasticsearch and Kibana (Elastic 2020) for data transmitting, processing, indexing, and visualizing, respectively. Toward the goal of efficient access and standardized deployment, each component of the framework is containerized using Docker (Docker 2020). Taking advantage of the benefits of cloud servers, we have deployed the proposed framework on Google Cloud Platform (GCP) and estimated the operating costs in detail from associated billing reports. Then we evaluated the technical feasibility of our proposed framework by measuring its vehicle-counting accuracy. Further, we present the extensibility of our proposed framework with the discussion of its limitations for supporting future real-time ITS applications.

Contributions The main contributions of this study are (1) We introduce a real-time, large-scale, cloud-enabled traffic video analysis framework using the cutting-edge video analysis toolkit and end to end platform; (2) we present detailed computational resource usage and operating costs to help traffic agency develop such IVA frameworks; and (3) we evaluate the technical feasibility of our proposed framework for real-time, large-scale implementation by discussing its vehicle-counting accuracy under various scenarios.

Outline The remainder of this paper is organized as follows: the next section provides the background and materials for our proposed IVA framework in detail. The subsequent section discusses the capability and efficiency of our proposed framework, followed by the economic assessment of operating the framework on GCP. Also, the technical feasibility of applying this framework is discussed. Finally, we discuss the potential abilities, limitations and scope of future work on the proposed framework.

Background and Materials

In this section, we present the background and materials we have used in this study. Following the recent AI-based IoT projects, especially NVIDIA smart parking (NVIDIA IoT 2019), NVIDIA real-time video redaction (Shah et al. 2020), and real-time analysis of popular Uber locations (Carol 2018), we have introduced an end to end practical cloud-enabled traffic video analysis framework for real-time, large-scale traffic recognition. More specific, proposed framework is based on the NVIDIA Metropolis design, especially the NVIDIA smart parking application. Proposed framework consists of two modules, namely a perception module and an analysis module, with end-to-end implementation on GCP as discussed below.

Perception Module

As mentioned in “Introduction”, NVIDIA DeepStream is an accelerated AI-powered framework based on GStreamer, which is a multimedia processing framework (Gstreamer 2019). The key features of NVIDIA DeepStream can be summarized as follows. First, it provides a multi-stream processing framework with low latency to help developers to build IVA frameworks easily and efficiently. Second, it delivers high-throughput model inference for object detection, tracking, etc. Also, it enables the transmission and integration between model inference results and the IoT interface such as Kafka, MQTT or AMQP. Combined with the NVIDIA Transfer Learning Toolkit (TLT), NVIDIA DeepStream provides highly flexible customization for developers to train and optimize their desired AI models.

Fig. 1
figure 1

Perception module architecture

Therefore, NVIDIA DeepStream is used as our perception module as shown in Fig. 1. The main components of perception module are (1) video pre-processing, which decodes and forms the batches of frames from multiple video stream sources; (2) model inference, which loads TensorRT-optimized models for vehicle detection; (3) OpenCV-based tracker, which tracks the detected vehicles to provide detailed trajectories; (4) model inference results like bounding boxes and vehicle IDs drawn on composite frames for visualization in on-screen display; and (5) model inference results converted and transmitted through the IoT interface by the message handler using the JSON format.

The perception module supports various video stream inputs like the Real-Time Streaming Protocol (RTSP), recorded video files and V4L2 cameras. In this study, we used live video streams recorded by traffic cameras through the public RTSP managed by the Iowa Department of Transportation. For the primary vehicle detector, we used TrafficCamNet (NVIDIA TLT 2020), which utilizes ResNet18 as its backbone for feature extraction (He et al. 2016). The model was trained to detect four object classes, namely car, person, roadsign, and two-wheeler. A total of 200,000 images, which includes 2.8 million car class objects captured by traffic signal cameras and dashcams, were involved in the training. The model’s accuracy and F1 score, which were measured against 19,000 images across various scenes, achieved 83.9% and 91.28%, respectively (NVIDIA TLT 2020).

For the multi-object tracking (MOT) task, occlusion and shadow are inevitable issues, especially for vehicle tracking in complex scenarios. Thus, to recognize traffic patterns more robustly and efficiently, we used the NvDCF tracker, which is a low-level tracker based on the discriminative correlation filter (DCF) and Hungarian algorithm. We recommend interested readers refer to NVIDIA Deepstream (2020) for more details.

Analysis Module

As mentioned above, the model’s inference results are delivered to the data transmission interface. Here, the analysis module processes and analyzes the resulting metadata for further ITS applications as shown in Fig. 2.

Fig. 2
figure 2

Analysis module architecture

First, we used Apache Kafka to handle the data transmissions on a large-scale basis. Kafka has been introduced as a real-time, fault-tolerant distributed streaming data platform. It takes input data from various sources such as mobile devices, physical sensors, and web servers. The data collected from edge devices (say traffic cameras) are published to the Kafka broker with predefined Kafka topics. Each Kafka topic is maintained in many partitions to achieve the purpose of high throughput and fault tolerance. In addition, data from multiple sources can be connected to different Kafka brokers which are managed by a zookeeper.

For large-scale batch data processing in streaming environments, Apache Spark was used in this study. Spark is recognized as a computationally efficient, large-scale data processing platform. Using Resilient Distributed Datasets (RDD) and Directed Acyclic Graphs (DAG), Spark provides a high-speed and robust fault-tolerant engine for big data processing. In analysis module, Spark acts like a “consumer,” subscribing to the streaming data from Kafka for further data cleaning, data analysis, trend predictions with machine learning, etc. More specifically, Spark structured streaming is used, which is an exactly-once operation built on the Spin a unbounded table which allows the append operation for “micro-batch” processing.

Cloud-Enabled Implementation

As mentioned in “Introduction”, deploying a large-scale IVA framework on a cloud server is flexible and powerful as a cloud server provides easy accessibility for traffic agency to maintain and manage IVA systems remotely. In this study, we have implemented our proposed framework on the Google Cloud Platform (GCP), which is a scalable and powerful cloud computing service. Also, Docker (version 19.03) has been used to deploy the proposed framework efficiently. Each component mentioned in “Perception module” and “Analysis module” has been containerized and is managed by Docker and Docker Compose. The entire proposed framework is shown in Fig. 3.

Fig. 3
figure 3

Overall intelligent video analysis (IVA) framework

In summary, the perception Docker takes the live video streams from multiple cameras as inputs through the RTSP. Then, the decoded video frames are inferred by the trained AI model, and the model inference results are sent to the analysis Docker for further data processing and visualization. The entire framework is operated in GCP, which can be remotely accessed by local machines for the purposes of maintenance, inspection, diagnosis, etc.

Case Studies and Discussion

In this study, there were 160 traffic surveillance cameras involved. We accessed their live video streams using the public RTSP managed by the Iowa Department of Transportation (Iowa DOT). These live videos are encoded following the H.264 standard with the resolution of \(480 \times 270\). In the following sections, we present in detail the resources usage, operating costs and technical feasibility of proposed framework running on GCP.

Economic Assessment

The GCP instance we used belongs to the N1 series powered by the Intel Skylake CPU platform. We used 8 virtual CPU (vCPU) cores with 30 GB memory and Ubuntu 18.04 LTS operating systems for evaluation. In addition, we used 2 NVIDIA T4 GPUs for video analysis and a 3-TB persistent disk (PD) for data storage.

As mentioned in “Perception module”, the perception module can take multiple video streams as inputs, so it requires developers to twiddle the number of video streams (traffic cameras) they need to process simultaneously in each single GPU so that they achieve a balance between performance and cost. We used built-in NVIDIA DeepStream functionality to determine the processing latency of three main components with various numbers of traffic cameras. Results of this latency analysis are presented in Fig. 4.

Fig. 4
figure 4

Latency for the perception module’s main components when simultaneously running different numbers of cameras per GPU

For the purpose of processing data in near real time, we distributed 80 cameras to each T4 GPU so that the overall processing latency caused by the perception module was less than 1 s. Thus, the total of 160 cameras was split into two T4 GPUs where the average GPU memory usage was around 65%. In other words, the number of cameras could be decided by a traffic agency based on their tolerance of processing delay, without necessarily exceeding the total GPU memory.

In the perception module, there are six consecutive frames skipped at the stage of model inference for computational efficiency. The coexisting CUDA 10.2 and CUDA 10.1, GPU driver version 418+, and TensorRT 7.0 were used to improve the performance of the NVIDIA T4 GPU according to the T4 workaround specification (NVIDIA Deepstream 2020), and Kafka 2.4, Spark 2.4.5, and Elasticsearch/Kibana 7.5.2 were used in the analysis module. In Table 1, we summarize the computational resources usage for running our framework on GCP.

Table 1 Summary of computational resources usage on GCP

To measure the operating costs in detail, we implemented the proposed end to end framework in GCP for 5 days. The GCP provides a sustained use discount based on monthly usage level. In Fig. 5a and b, we show the average daily cost with and without the sustained use discount. The daily operating cost for 160 cameras was $21.59 with the sustained use discount (assuming such a traffic video analysis framework would be operating for the long term). The GPUs and vCPUs are the two major costs, which, respectively, took 46.4% and 19.7% of the total cost, while the persistent storage, network, and RAM together took 33.9% of the total cost. Thus, the daily cost of proposed framework for each camera was $0.135, leading to the yearly cost per camera of $49.30. In Iowa, there are 390 operating traffic surveillance cameras on the road network. Thus, turning all these surveillance cameras into connected “smart sensors” would cost $1601.40 per month using such a cloud-enabled system, with the benefit of eliminating additional infrastructure costs.

Fig. 5
figure 5

Daily operating cost for 160 cameras a with sustained use discount and b without sustained use discount

Feasibility Assessment

In this section, we present the feasibility of our proposed framework by measuring its vehicle counting accuracy compared with manual inspection and counting via nearby radar sensors. There were 12 cameras with different viewing angles selected for this evaluation, as shown in Fig. 6. The selected cameras are representative since they cover the most common camera viewing angles across Iowa.

Fig. 6
figure 6

Sampled cameras with different viewing angles

For each camera, we manually counted the vehicles that passed a predefined region of interest (ROI) for 5 min on a sunny day and 4 min on a rainy/cloudy day around 11 am as the ground truth, thus there were 108 min of video involved in our evaluation. The ROI for these cameras was set from 90 to 260 pixels counting from the bottom left along the y-axis. For instance, the ROIs for camera 1 and camera 2 were from 10 to 90 pixels and from 10 to 200 pixels along the y-axis, respectively. The accuracy of measurement of our vehicle counting application was defined as

$$\begin{aligned} Cr = \frac{{{C_\mathrm{g}} - |{C_\mathrm{g}} - {C_\mathrm{e}}|}}{{{C_\mathrm{g}}}}, \end{aligned}$$
(1)

where \(C_\mathrm{r}\), \(C_\mathrm{g}\), and \(C_\mathrm{e}\) denote the (1) correct counting rate, (2) counting ground truth, and (3) counting estimated by proposed framework. The results of this evaluation of our vehicle counting application for different cameras under both sunny and cloudy/rainy environments are shown in Table 2.

Table 2 Experimental results for different cameras in different environments

In addition, we show trajectory plots for two sample cameras under the sunny and rainy day scenarios in Fig. 7. For sunny days, the results show that the proposed framework is working as expected in that its average counting accuracy achieved 90.1%, except with the cameras 3 and 12. During the test periods, camera 3 was suffering from the network lag issue, so the video frames transmitted to the perception module were nonconsecutive. Such a lag issue will make the tracking model produce labels repeatedly for the same vehicle. For camera 12, the perception module barely captured the passing vehicles, which indicates that the model needs to be fine-tuned to fit this scenario involving distant viewing angles. In addition, the results show that the counting accuracy was affected on rainy days for camera 6, camera 9, and camera 10 which, as shown in Fig. 7d, were exposed to the rain.

Fig. 7
figure 7

Camera views and trajectory plots on sunny and rainy days: a Camera 5 sunny day, b Camera 5 rainy day, c Camera 9 sunny day, d Camera 9 rainy day

To further measure the feasibility of our proposed framework, we compared the daily volume pattern captured by our testing cameras versus by their nearby microwave radar sensors, as shown in Fig. 8. However, it is worthy of mention that matching the detection areas of camera sensors and radar sensors point by point is hard since our camera sensors were zoomed into the ROI with ramps, interchanges and minor roads included, as discussed in “Feasibility assessment”. Therefore, the overall traffic pattern comparison is more meaningful in showing the agreement between these two types of sensors.

Fig. 8
figure 8

Daily traffic volume data (10-min agg) captured by cameras and nearby radar sensors on Monday, June 1st

The data of the daily volume plots were aggregated in 10-min intervals. For camera 1 and camera 11, the volume trend after 3 pm is not presented, since these cameras turn the opposite direction after 3 pm, which is beyond the scope of this study. Also, the comparison of camera 2, camera 5 and camera 7 was impacted by the nearby interchanges and ramps, especially for camera 7 whose closest radar sensor had failed (thus we had to use the radar sensor located two interchanges above instead). In addition, for camera 11, the vehicles on a minor road which was far from the camera missed detection. Similarly, as described earlier, our model only occasionally captured the passing vehicles within distant view of camera 12. In Tables 3 and 4, we aggregate the data in 3-h intervals and present the differences using the absolute error (AE) and relative error (RE) defined as:

$$\begin{aligned} AE= & {} |V_\mathrm{camera} - V_\mathrm{radar}| \end{aligned}$$
(2)
$$\begin{aligned} RE= & {} \frac{|V_\mathrm{camera} - V_\mathrm{radar}|}{V_\mathrm{radar}} \end{aligned}$$
(3)

where V represents the 3-h aggregated traffic volumes.

Overall, the traffic patterns captured by the testing cameras using computer vision techniques versus by their nearby radar sensors are comparable, except with camera 12 where, as noted earlier, model fine-tuning for distant viewing angles is needed. Also, the results shown in both Fig. 8 and Table 4 demonstrate that the model does not perform well during the night (between 9 pm and 6 am), which indicates the model also requires fine-tuning for night-time applications.

Table 3 Absolute error (AE) of daily volume between cameras and radar sensors
Table 4 Relative error (RE) of daily volume between cameras and radar sensors

As mentioned above, one advantage of our proposed framework is its large-scale production of vehicle trajectories on a real-time basis, which has potential for use in ITS applications like anomaly detection, speed violation detection, wrong-way detection , etc. In Fig. 9, we present the extensibility of our proposed framework by showing three anomalous events on a freeway with their corresponding image coordinates’ moving speed calculated from the associated vehicle trajectories. The mean and minimum image coordinates’ moving speed was calculated from all the detected vehicles and their trajectories during a certain time period (say 1 second). It is clearly observable that, by applying appropriate thresholds or advanced algorithms, such real-time trajectories generated by our proposed framework could help traffic agency to quickly address anomalous events.

Fig. 9
figure 9

Anomalous events with image coordinates’ moving speed

For long-term, real-world operations, it should be noted that the current GCP does not support live migration for cloud instances with GPUs during the host maintenance period, which, as tested in our experiments, typically occurs once a week. In other words, the running programs will be disrupted automatically by the GCP system for the purpose of host maintenance. Therefore, to minimize the disruption of their programs, users need to prepare for their workload’s transition through this system restart by monitoring the maintenance schedule, which is usually announced 1 h a head of system termination.

Conclusions and Future Work

In this study, we have introduced an real-time, large-scale, cloud-enabled traffic video analysis framework using NVIDIA DeepStream and NVIDIA Metropolis, and we have evaluated the proposed framework from two perspectives: economics and feasibility. The proposed framework consists mainly of a perception module and analysis module. The perception module takes multiple live video streams as inputs, and outputs insightful inferred metadata which are produced by the embedded AI model. Then these resulting metadata are transmitted to the analysis module through Kafka. In the analysis module, Spark consumes the data and forms a dynamic unbounded table for batch processing. Finally, the processed data are indexed and visualized by Elasticsearch and Kibana.

Our study demonstrates the proposed framework is both economically efficient and technically feasible. From the perspective of economics, the results show that the daily operating cost for each camera is less than $0.14, so the yearly operating cost per camera is less than $50. In addition, we have presented the processing latency of our framework’s main components and its usage of cloud computational resources, which can help developers design AI-powered traffic video analysis frameworks accordingly based on their requirements. From the perspective of technical feasibility, we have measured the accuracy of a vehicle-counting application using the live video streams recorded by 12 cameras with different viewing angles. The results show that for most of the tested viewing angles, the proposed framework is able to produce the expected vehicle-counting results compared with both manual inspections and the count from nearby microwave sensors. Further, we have demonstrated the potential ability of our proposed framework for future real-time ITS applications by showing sample anomalous events with image coordinates’ moving speed calculated in seconds.

It should be noted that such real-time ITS applications will likely always suffer from the issue of network lag. As discussed in “Case studies and discussion”, the nonconsecutive frames caused by network lag affect the performance of our proposed framework for both vehicle detection and tracking modeling. Thus, developers need to work with their local department of transportation to optimize network bandwidth and address network lag problems. In addition, from our experiments, GCP cloud instances will restart once a week for the purpose of host maintenance, thus a proper transition plan is needed for long-term operations. In addition, model fine-tuning is required to improve performance for scenarios like distant viewing angles, nighttime and rainy days.

Future work will pursue (1) more images from different viewing angles to fine-tune the current vehicle detection model, so as to improve our framework’s performance in various environments, (2) a real-time alert system for deployment to help traffic operation centers address anomalous events quickly, (3) calibration of the image coordinates of detected vehicles with the real-world geometric coordinates of the roadway network for more precise travel speed estimation and data visualization, and (4) the creation of modules using machine learning libraries for prediction applications such as travel speed estimation, congestion prediction.