Keywords

1 Introduction

The video surveillance applications over embedded vision systems have had a rapid growth in recent years. These systems make use of background subtraction as one of the key techniques for automatic video analysis. The state of the art methods make use of GMM estimators as background subtraction approach but it has a high computational cost for real-time implementation [6, 7]. Hardware implementations of GMM algorithm on FPGA has been proposed in some works to deal with the computational burden requirements. An architecture for real-time segmentation and denoising of HD video on Xilinx and Altera FPGAs is presented in [5]. The authors describe the implementation and performance evaluation of OpenCV based GMM algorithm and morphological operations applied to a sequences of \(1920\times 1080\) frame size. In a latter work [6], authors present implementation and evaluation of architectures based on the OpenCV GMM algorithm for HD video over FPGA and ASIC technologies. In both cases, the FPGA based implementations were developed using VHDL. Fixed-point representation is used to achieve required performance for HD video resolution. An FPGA based implementation of GMM for offline background subtraction using fixed point arithmetic is presented in [1]. The block diagrams of the implemented system modules are shown. Moreover, qualitative results for a group of \(360\times 240\) frames selected from a sequence of 1250 are presented. The implementation of other FPGA based approaches of background subtraction techniques are found in some works. A performance evaluation of two multimodal background subtraction algorithms implemented on a ZedBoard is presented in [3]. A performance comparison of GMM, ViBE, and PBAS algorithms implementation on CPU, GPU, and FPGA is presented in [2]. In [8], is presented a summary of several FPGA based implementations of background modelling algorithms, developed by these authors in previous works. Additionally, the authors present the evaluation of the PBAS implementation using the SBI dataset. An FPGA based implementation of the codebook algorithm on a Spartan-3 FPGA is presented in [9]. The authors describe the system architecture and performance for sequences of \(768\times 576\). Hardware acceleration for real-time foreground and background identification based on SoC FPGA is presented in [10]. The proposed architecture was implemented on a Zynq-7 ZC702 Evaluation Board and is evaluated using datasets of real-time HD video stream.

The main contribution of this paper is a implementation that allows major flexibility in comparison to previous works with fixed-point representation and conventional RTL description. In this paper is presented an FPGA based implementation of the GMM algorithm on a ZedBoard using floating-point arithmetic. The OpenCV GMM function code is adapted for the Vivado-HLS, and parallelization directives are used for optimization. Taking advantage of the SoC architecture of the Artix-7 FPGA device, the generated HLS custom IP core is integrated with a Zynq processing system allowing the development of a complete embedded vision system. The hardware resources and power consumption are presented for the HLS custom IP core and the complete embedded vision system. Moreover, performance comparison with CPU software-based implementation for sequences at three different resolutions is presented in frames per second (fps). The experimental results show that the developed system is suitable for real-time at \(768\times 576\) resolutions with low-power consumption. The paper is organized as follows. In Sect. 2 the implemented algorithms and system architecture are described. Results and performance analysis are presented in Sect. 3. Finally, conclusions are drawn in Sect. 4.

2 Proposed Method

The algorithm implementations were developed using a hardware-software co-design methodology. The first stage is the software design that starts with the coding of the algorithms using C++ language and OpenCV libraries to take advantage of the rapid prototyping, visualization, and easy re-coding. The BGSLibrary (Background Subtraction Library) [13], covers a collection of algorithms from the literature. We selected three classic background subtraction algorithms: Frame Difference, Gaussian Mixture Model (GMM1) [11], and Efficient Gaussian Mixture Model (GMM2) [14]. The common steps of the GMM methods may be summarized in Algorithm 1.

figure a
Fig. 1.
figure 1

HLS flow design used for FPGA implementations of the algorithms.

The hardware implementation stage proposes two steps. The first step is the acceleration of the Algorithm 1 using Vivado HLS. Parallelization directives are applied in the code to improve the performance of the algorithm following the diagram showed in Fig. 1. Finally, HDL code generated by Vivado HLS is synthesized and downloaded to the FPGA. Figure 2 shows the code generated with the name HLS Custom IP Core inside of the SoC-FPGA architecture.

The algorithms were implemented on a ZedBoard Zynq Evaluation and Development Kit, which is a heterogeneous architecture based on FPGA. The integrated development environment (IDE) for HDL synthesis named Vivado® Design Suite was used for the synthesis of each design. Vivado® High-Level Synthesis (VHLS) is included in the suite, and it is responsible for transformation of the code written in C/C++ to HDL using a software-based approach, see Fig. 1.

The Operating System (OS) named PetaLinux was used on the ARM. This OS is a Linux distribution customized for SoC Kit boards and it facilitates the management of the peripherals on the development board such as Ethernet, USB, and HDMI ports. The communication between the processor and the FPGA is performed using AMBA AXI4-stream protocol. This is a data flow handler that offers several end-to-end stream pipes for the data transport of the applications. AXI4-stream works as a data transmission channel between the processing system (PS) and the programmable logic (PL). In the PS side, it works as a memory access control executed from a C++ function in PetaLinux. In the PL side, the communication is done using AXI Master. AXI Master maps the memory to stream conversion, and it performs the low-level transmission tasks, allowing to the designer to read/write the DRAM in Linux, and read/write from FPGA to DRAM using a high level approach. AMBA AXI4-Lite is an interface for data transmission between PS and PL for simple communication because it has a low-throughput memory-mapped communication and this interface is used for the control signals and status registration of the data, see Fig. 2.

Fig. 2.
figure 2

Architecture of embedded vision system on SoC-FPGA.

3 Experiments and Results

We compared implemented algorithm quantitatively and qualitatively on the Wallflower dataset [12]. The results show that the proposed architecture can compute the foreground of the scenarios in the first column of the Fig. 3.

Fig. 3.
figure 3

Qualitative results. From left to right: original image, ground truth, Frame Difference, GMM1, GMM2. From top to bottom: bootstrap (BS), camouflage (CA), foreground aperture (FA), lightswitch (LS), time of day (TD), waving trees (WT).

Quantitative results were calculated with three quality metrics: Precision, Recall and F-score, as shown in Eqs. 1, 2 and 3, which are based on the amount of false positives (FP), false negatives (FN), true positives (TP) and true negatives (TN). The results in Table 1 are consistent with the found results in the state of the art.

$$\begin{aligned} Precision = \frac{TP}{TP+FP} \end{aligned}$$
(1)
$$\begin{aligned} Recall = \frac{TP}{TP+FN} \end{aligned}$$
(2)
$$\begin{aligned} F\text{- }score = \frac{2\cdot Precision \cdot Recall}{Precision + Recall} \end{aligned}$$
(3)
Table 1. Performance using Wallflower dataset [12].

The complete embedded vision system proposed is based on a heterogeneous architecture composed of a SoC-FPGA, as seen in Fig. 2. The complete system transmits information between ARM and FPGA, for this reason the maximum performance of the different implementations is limited for the maximum bandwidth of communication channels. However, the complete system facilitates management of peripherals and data using the operating system. Moreover, HLS custom IP vision core running in the standalone FPGA can be used in applications with direct connection to the FPGA. Table 2 shows both main hardware resources and power consumption in the algorithm implementations. This table compares the data for the HLS custom IP vision core on FPGA and the complete embedded vision system on the SoC-FPGA. The latter uses an amount greater of hardware resources due primarily to the drivers for the management and the transmission of data provided by the ARM.

Table 2. Hardware resources required on a ZedBoard FPGA after place and route. The whole system includes processing modules.

The performance of the SoC-FPGA and standalone FPGA were compared against a PC, as can be seen in Table 3. The ZedBoard Zynq is a SoC that contains a dual core ARM Cortex-A9 and one Artix-7 FPGA, with a FPGA clock period of 10 ns (100 MHz). The PC is equipped with a processor AMD Quad-Core A10-9620P running to 2.5 GHz. The average frame rate in PC, ARM and SoC-FPGA is computed as in Eq. 4. For FPGA we need to compute a frame rate as in Eq. 5. The performance measures for SoC-FPGA are better than the measured for the PC, and SoC-FPGA allows the real-time implementation in all cases. The standalone FPGA has the best performance because it does not have the limitation of the communication channels. Additionally, a comparison against the standalone ARM of the SoC-FPGA is included. An ARM processors is one of a family of CPUs based on the RISC architecture that is typically used over microprocessor boards and mobile devices used for real-time embedded system applications. The performance of FPGA in the most of the cases exceeds by \(10\times \) the performance of ARM, achieving over \(40\times \) in the best cases.

$$\begin{aligned} fps = \frac{1}{\left( \mathrm {elapsed\;time\;per\;frame}\right) } \end{aligned}$$
(4)
$$\begin{aligned} fps = \frac{1}{\left( \mathrm {FPGA\;cycles\;per\;frame}\right) \cdot \left( \mathrm {clock\;period}\right) } \end{aligned}$$
(5)
Table 3. Performance comparison for sequences at three resolutions, in fps.

4 Conclusions

This work presented implementation of three background subtraction algorithms in real-time using floating-point arithmetic. The HLS implementations permit a fast design and implementation of several architectures with different parallelization directives, in this way, it is possible to improve the performance of complex algorithms with a standard floating-point precision. The performance measures of the proposed architecture shown better computational times compared to PC-based implementation, and the parallelization of the GMM algorithms reached the frame per seconds needed for real-time video processing with a low power consumption. For these reasons, the heterogeneous architectures based on FPGA shown to be an effective tool for the video surveillance applications over embedded vision systems.