Keywords

1 Introduction

Video surveillance systems deal with monitoring, intercepting or observing activities, behavior, or any other changing information related to people, places or things. Video surveillance systems have evolved over three generations of surveillance systems namely, analog surveillance systems, digital surveillance systems, and smart/intelligent surveillance systems. Nowadays network of various surveillance video sensors/cameras are everywhere. Figure 1 shows an example of a video sensor/camera network, with overlapping as well as non-overlapping fields of view. Multimodal surveillance systems in intelligent transportation systems have a wide application area and has been an emerging field of research. A multimodal surveillance system normally consists of a wireless sensor network of video/image sensors, audio sensors, pressure sensors, thermal sensors and position sensors. Apart from these, some recent advances in sensor hardware includes, an embedded data processing algorithm which is used to process the data captured by the sensor and send it to a base station. The placement of different video sensors/cameras are very important. The main idea in multimodal scenario is: weather we get an idea of which camera/sensor has captured an important video content without watching all the videos of all sensors/cameras entirely.

Fig. 1.
figure 1

Illustration of multi-view camera network [3]

Sufficient progress has been made in summarizing a single video. Comprehensive reviews in video summarization can be found in [1] and [2]. Truong et al. in [1] gives two ways in which video can be summarized, a key-frame sequence and a video skim. Nowadays multiview video summarization is gaining popularity as there are number of video sensors/cameras deployed which covers overlapping region. The contribution of our work can be summarized as a video sensor/camera placement strategy for surveillance in intelligent transport system, and also we propose a key-frame based summarization technique to preserve both intra and inter view correlation for MPEG-4 or H.264(AVC) videos for generating video summaries.

2 Related Works

While the goal of many optimal camera placement strategies has been to minimize the overlapping views; with respect to the objective of the proposed system, overlapping views were necessary so as to track the complete path of motion of the subjects under surveillance from multiple views, maximize their visibility and maximize the degree of coverage of the surveillance perimeter. Most summarization algorithms work on single-view video, due to redundancy in multi-view video, multiview video summarization is much more comprehensive.

Zouaoui et al. in [4] have proposed a multimodal system composed of two microphones and one camera integrated on-board for video and audio analytic for surveillance. The system relies on the fusion of unusual audio events detection and/or object detection from the captured video sequences. The audio analysis consists of modeling the normal ambience and detecting deviation from the trained models during testing, while the video analysis involves classification according to geometric shape and size. However, even though the system succeeds in detecting robust 3D position of objects, it employs only a single camera for surveillance which does not provide a robust multi-dimensional view of the object of interest.

Wang et al. in [5] have presented a system for detecting and classifying moving vehicles. The system uses video sensors along with Laser Doppler Vibrometer (LDVs) a kind of acoustic sensor for detecting the motion, appearance and acoustic features of the moving vehicles - and later on using the data to classify them.

Magno et al. in [6] have proposed a multimodal low power and low cost video surveillance system based on a CMOS video sensor and a PyroelectricInfraRed (PIR) sensor. In order to control the power consumption, instead of transmitting full image, the sensors only transmit very limited amount of information such as number of objects, trajectory, position, size, etc., thus saving a large amount of energy in wireless transmission and extending the life of the batteries. However nothing is done from the point of view of data transmission and power consumption if the targeted object is not detected. In addition to this, this system is used only for detecting an abandoned or removed object from the perimeter under surveillance and hence there is no proper evidence of its usage in a large-scale, dynamically changing environment.

Gupta et al. in [7] have designed a distributed visual surveillance system for military perimeter surveillance. The system is used to detect potential threats and create actionable intelligence to support expeditionary war fighting for the military base camp by using multimodal wireless sensor network. The system employs certain rule-based algorithms for detection of atomic actions from video. Some of the atomic actions that are automatically detected by the system are: a person appearing in a restricted area, tripwire crossing, a person disappearing from a protected perimeter, a person entering or exiting, leave behind action, loiters, take away action, etc. A geodetic coordinate system is used which provides metric information such as size, distance, speed, and heading of the detected objects for high level inference and inter-sensor object tracking.

Prati et al. in [8] have proposed a PIR sensor based multimodal video surveillance system. In this system PIR sensors are used to bring down the cost of deployment of the surveillance systems and at the same time they are combined with vision systems for precisely detecting the speed and direction of the vehicles along with other complex events.

Rios-Cabrera et al. in [9] have presented an efficient multi-camera vehicle identification, detection and tracking system inside a tunnel. In this system a network of non-overlapping video cameras are used to detect and track the vehicles inside a tunnel by creating a vehicle-fingerprint using the haar features of the vehicles despite poor illumination inside tunnel and low quality images.

Lopatka et al. in [10] have proposed a system for detecting the traffic events which uses special acoustic sensors, pressure sensors and video sensors to record the occurrence of audio-visual events. A use-case of detection of collision of the two cars is demonstrated in this paper. The data collected by the multimodal sensors is sent to a computational cluster in real time for analysis of the traffic events. For this purpose a Real Time Streaming Protocol (RTSP) is used in the system.

Wang et al. in [11] have proposed a large scale video surveillance system for wide area monitoring which has capability of monitoring and tracking a moving object in a widely open area using an embedded component on the camera for detailed visualization of objects on a 2D/3D interface. In addition to this, it is also capable of detecting illegal parking and identifies the drivers face from the illegal parking event.

van den Hengel et al. in [12] have proposed a genetic algorithm for automatic placement of multiple surveillance cameras which is used to optimize the coverage of cameras in large-scale surveillance systems and at the same find overlapping views between cameras if necessary. Yildiz et al. in [13] have presented a bilevel algorithm to determine an optimal camera placement with maximum angular coverage for a WSN of homogeneous and heterogeneous cameras. Zhao et al. in [14] have presented two binary integer programming (BIP) algorithms for finding optimal camera placement and network configuration. Moreover they have extended the proposed framework to include visual tagging of subjects in the surveillance environments. Liu et al. in [15] have presented a Multi-Modal Particle Filter technique to track vehicles from different views (frontal, rear and side view). In addition to this they have also discussed a technique for occlusion handling in surveillance systems.

Denman et al. in [16] have presented a system for automatic monitoring and tracking of vehicles in real time using optical flow modules and motion detection from videos captures by four video cameras. Wang et al. in [17] have proposed an effective foreground object detection technique for surveillance systems by estimating the conditional probability densities for both the foreground and background objects using feature extraction techniques and temporal video filtering. Zheng et al. in [18] have proposed a key-frame selection technique based on motion-feature based approach in which motion information for each key-frame from the traffic surveillance video stream is computed in a GPU based system and key-frames with motion information greater than their neighbors are selected. By implementing GPU based processing capabilities, the authors have shown a significant increase in the accuracy and processing speed of the algorithm.

Panda et al. in [19] have proposed a novel sparse representative selection method for summarizing multi-view videos, that is videos captured from multiple cameras. They have used inter-view and intra-view similarities between the feature descriptors of each view for modelling multi-view correlations. Kuanar et al. in [20] have proposed a bipartite matching method for multi-view correlation of features like visual bag of words, texture, color, etc. and extracting frames for summarization of multi-view videos. In this method the authors have used Optimum-Path Forest algorithm for clustering the intra-view dependencies and removing intra-view redundancies. Liu et al. in [21] have proposed a unique method for visualizing object trajectories in multi-camera videos and creating video summaries of suspicious movements in a building.

3 Optimal Camera Placement in Multimodal Surveillance System

Many large-scale multimodal surveillance systems have used human experts for camera selection and placement, however such a technique is not capable to effectively design a system while considering the multitude of factors. Also a straightforward method to deploy the video cameras would be to deploy them uniformly around the surveillance area. However, in real-world deployment scenarios, such a method of uniform placement is not practical, since the placement of cameras is restricted by many constraints like costs, availability, visibility, applicability, feasibility, and other factors. This study has investigated the effect of all the factors listed above, and an optimal camera placement strategy has been designed which satisfies all these factors.

Figure 2 gives the coverage of a video camera C in three-dimensional space. With reference to the Fig. 2, point V is the position of the video camera V(x, y, z) and point G indicates the centre of gravity for the video camera V. The four points A, B, C and D are the extreme points in the FOV of V and can be computed using horizontal AOV, vertical AOV and position of the video camera. These points also form the base plane of the rectangular pyramid. Point X is an arbitrary point present in the FOV of video camera V which is to be observed using V.

Fig. 2.
figure 2

Coverage of a video camera

The volume of the rectangular pyramid formed by the points {V, A, B, C, D} (that is the volume of the coverage area of the camera C) is given by the Eq. 1,

$$ V_{c} = \frac{l*b*h}{3} $$
(1)

where h is height of apex from the base.

Now since X is an arbitrary point inside the FOV of V, it forms four tetrahedrons with the four sides of the pyramid and point V as the apex. Volume of each such tetrahedron is given by the Eq. 2,

$$ V_{i} = \frac{{\sqrt 2 *Area_{base} *h}}{12} $$
(2)

where h is height of apex from the base and i = 1 to 4.

Consider Vtotal as the total volume computed by adding volumes of all the four tetrahedrons and pyramid created with X as the apex. Vtotal is given by the Eq. 3,

$$ V_{total}^{X} = V_{base} + \sum {V_{i} } $$
(3)

for all i = 1 to 4 in Eq. 3, Vbase is the volume of pyramid with point X as apex and points A, B, C, D as the base plane. The Eq. 4 is used to test the presence of a point X within the FOV of a video camera C whose coverage area can be modelled as a rectangular pyramid of volume V, and returns true or false accordingly.

$$ FOV\left( {C,X} \right) = \left( {\begin{array}{*{20}l} {true,} \hfill & {if\,V_{c} = V_{total}^{X} } \hfill \\ {false} \hfill & {} \hfill \\ \end{array} } \right) $$
(4)

Algorithm 1 depicts systematic steps for calculating optimal camera placement in a multimodal surveillance system. This algorithm can be used to decide the placement of multiple cameras at intersections, junctions and crossroads and achieve the best possible coverage of the surveillance area.

figure a

4 Experimental Results and Discussion

4.1 Simulation Environment and Parameters

The optimal camera placement Algorithm discussed previously was simulated in OMNET++ Network Simulator. Table 1 lists the simulation parameters that were considered while checking the results and validity of the algorithm.

Table 1. Simulation parameters for optimal camera placement algorithm

The network topology was configured in a way that each video camera would be placed randomly inside or on the edges of the one-fourth part of the surveillance area as shown in the Fig. 3, since the surveillance area is divided into four equal parts. Also as mentioned in the algorithm, each camera needed to have two midpoints of adjacent sides in their FOV to have the best possible coverage of the surveillance area.

Fig. 3.
figure 3

Surveillance area with various possible camera placements

4.2 Results

During the simulation, the cameras were placed randomly inside each one-fourth part of the surveillance area, near the midpoints and on the vertex joining the two edges. At any moment of time, the AOV and Depth of Field of each camera was fixed and identical, though it was selected randomly from the range specified in Table 1.

The results of the simulation are presented in Table 2. From the results, it can be inferred that the best placement for all the cameras in a multimodal surveillance system is near the vertex joining any two edges of the surveillance area. It can also be derived that higher depth of field and wider angle of view produces better region coverage for a surveillance camera leaving less than 5% of area uncovered. However, since all the cameras have overlapping views, it is possible to achieve better coverage of the whole area leaving very less to zero blind spots. The Fig. 4, shows the best placement for video cameras in a multimodal surveillance system from the result derived using the optimal placement algorithm discussed in Algorithm 1.

Table 2. Simulation results for optimal camera placement algorithm
Fig. 4.
figure 4

Optimal camera placement design

4.3 Video Summarization

The Ko-PER Intersection Dataset [22] that comprises of highly accurate reference trajectories of cars, raw laser scanner measurements, undistorted monochrome camera images has being used. From this dataset, four videos were generated of varying GOP size N = 8, 16, 24 and 30 with a constant frame rate of 30 fps and were encoded using the H.264 or MPEG-4 Part 10, Advanced Video Coding (AVC) standard. These different videos were used to check the performance of each frame selection technique individually. The graph presented in Fig. 5 shows the comparison of the total execution times of the proposed three frame selection techniques for video summarization. It is evident from the graph that, with increase in GOP size, the execution time of the algorithms increases. By using any of the technique of frame selection for video summarization, the duration of the final summarized video is same, since no key-frames have been dropped in the process, and hence all the three techniques are suitable to create video summaries without the loss of important contextual information from the video.

Fig. 5.
figure 5

Comparison of frame selection techniques for video summarization algorithm

5 Conclusion

The proposed optimal camera placement algorithm can be used for deciding the placement of multiple cameras at intersections, junctions and cross-roads without compromising the coverage area of the deployed video cameras and cost of deployment. This optimal camera placement algorithm is capable of providing maximum coverage of the area under surveillance leading to - complete elimination or reduction of the number of blind zones in a surveillance area. It is also maximizing the view of subjects and minimizing occlusions in high vehicular traffic areas. In addition to this, a video summarization algorithm using three different techniques of frame selection for multi-view surveillance systems is presented which can be used to create summaries of large-sized, lengthy video streams of traffic surveillance data and, at the same time reduce the computational processing for creating the video summaries. Collectively, both the proposed algorithms will be able to reduce the cost of camera deployment, computational cost, power consumption and, provide efficient performance in a multi-view, as well as multimodal surveillance system.