Abstract
In camera networks, dynamic node selection is an effective technique that enables video stream transmission with constrained network bandwidth, more economical node cooperation for nodes with constrained power supplies, and optimal use of a limited number of display terminals, particularly for applications that need to obtain high-quality video of specific targets. However, the nearest camera in a network cannot be identified by directional measurements alone. Furthermore, errors are introduced into computer vision algorithms by complex background, illumination, and other factors, causing unstable and jittery processing results. Consequently, in selecting camera network nodes, two issues must be addressed: First, a dynamic selection mechanism that can choose the most appropriate node is needed. Second, metrics to evaluate the visual information in a video stream must be modeled and adapted to various camera parameters, backgrounds, and scenes. This paper proposes a node selection method based on approximate reinforcement learning in which nodes are selected to obtain the maximum expected reward using approximate Q-learning. The Q-function is approximated by a Gaussian Mixture Model with parameters that are sequentially updated by a mini-batch stepwise Expectation–Maximization algorithm. To determine the most informative camera node dynamically, the immediate reward in Q-learning integrates the visibility, orientation, and image clarity of the object in view. Experimental results show that the proposed visual evaluation metrics can effectively capture the motion state of objects, and that the selection method reduces camera switching and related errors compared with state-of-the art methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Camera networks have recently received increased research interest for the effective resolution of partial and full occlusions, and the continuous tracking of targets over large areas where the limited field of view (FOV) of a single camera is insufficient. Such networks are used extensively in areas such as security, surveillance, human–computer interaction, navigation, and positioning. However, the utilization of camera networks is beset by challenges related to the deployment and control of the camera nodes, real-time fusion of high-resolution and high frame-rate video streams, and selection and coordination of the cameras [5, 30]. Camera selection involves determining which one (or more) camera(s) should be selected to obtain the maximum information or comfortable visual effects. Thus, camera selection is a fundamental challenge in applications such as target tracking and positioning [10, 18], area coverage [29], surveillance and behavior analysis [24, 33], and video summarization [8]. Dynamic node selection is an effective solution to problems such as video stream transmission with constrained network bandwidth, more economical node cooperation for nodes with constrained power supplies, and a limited number of display terminals. Such issues are faced by emerging applications such as security surveillance and smart homes, which need to obtain high-quality video of specific targets. For example, from the cameras deployed to survey some critical area, users need select only one optimal viewpoint to be displayed on a mobile device. However, in contrast to node selection in general sensor networks [11, 21], the nearest camera in a camera network cannot be identified by directional measurements alone. Furthermore, some errors are introduced in computer vision algorithms by complex background, illumination, and other factors, which cause processing results to be unstable and jittery in the time sequence. Consequently, in selecting nodes in camera networks, two issues must be addressed: First, a dynamic selection mechanism should be established that ensures the selected node captures the maximum expected amount of information and outputs comfortable video based on current and historical observations. Second, metrics to evaluate the visual information of a video stream must be modeled according to the demands of applications, and these should be adapted to various camera parameters, background, and scenes.
Previous studies have investigated the issue of node selection in camera networks from the viewpoint of selection mechanisms and visual information evaluation. The consequent methods proposed for node selection in camera networks can be divided into three categories: In the first category, the node that maximizes the current visual information is selected. For example, Tessens et al. [33] proposed a distributed principle viewpoint determination method for smart cameras, in which factors such as the visibility, direction of movement of targets, distance between the center of mass of the object and the camera, and the results of face detection are summarized into a score. The camera with the maximum score is then designated as the principle viewpoint. In [23], cameras are ranked and selected according to the target’s location and look-up tables stored locally at each node. A camera selection approach for object tracking was presented by Monari et al. [22], in which the relevant subset of cameras is determined according to the current or predicted state of the target object and prior geometrical knowledge. However, most methods in this category do not consider the dynamics of visual information and a target’s state in scenes, overall visual effects, network communications, or energy costs. The second category of node selection methods uses a reward or cost function to integrate the constraints, and the selection process is optimized using historic and current system states. For example, in a method proposed by Li and Bhanu [16], the best-view estimation is obtained by iterative bargaining based on game theory. A best-view selection algorithm that generates a smooth video sequence was proposed by Jiang et al. [13], who solved the resulting optimization problem using dynamic programming. Daniyal et al. [7] presented a best-view selection method based on a dynamic Bayesian network according to the states taken from a certain historical time window, and used the resulting output for video production. Although these methods do not predict the state transitions and the relay of rewards, the fact that the reward will be reflected after the action, and thus affect subsequent states and actions, is not considered. In the third category, dynamic selection decisions are made according to current system states and predicted future states. For example, Partially Observable Markov Decision Processing (POMDP) may be used to find the optimal scheduling policy in tracking and surveillance camera networks [17, 31] by predicting the future reward of an action according to certain assumptions. However, these methods require clear state transitions, which are difficult to explicitly define in complex dynamic systems. Additionally, the computational complexity grows exponentially with the state space because of the “curse of dimensionality.”
On the other hand, the main aim of visual information evaluation is to extract effective visual features and map them to comparable scores. Feature extraction is the key process, and is usually related to the application. For example, the visibility of a target [9], the location in an image or the real world [12, 22, 23], and the image area size [13] of the target are often used in tracking applications, whereas behavior analysis applications are more concerned with the orientation [33], body pose, and appearance [25, 27] of moving targets and the movement of the scene [7]. Surveillance applications for critical areas typically require front view, high-resolution video of the target of interest, and frequently switch views to prevent errors in visual computing. The front view of the target is determined by pose estimation and face detection; however, body pose estimation from a single perspective is still a nonlinear and ill-posed problem [1], and face detection has high error rates that are influenced by the video resolution and the relative position of the target with respect to the cameras. Therefore, the results of visual computing with both methods are often unstable and have high computational complexity.
In this paper, we address the problem of dynamic camera selection for the surveillance of critical areas. Our aim is to obtain the most informative camera node while simultaneously reducing camera switching. To achieve this, we model the dynamic selection of nodes in camera networks as a Markov Decision Process (MDP), and learn the selection strategy via reinforcement learning. For dynamic scenes, a state transition model is implicit, and the state space is continuous. We adopt an approximate Q-learning algorithm to find the optimal solution online, using a Gaussian Mixture Model (GMM) to represent the approximated Q-function. Model parameters are updated by a mini-batch stepwise Expectation–Maximization (EM) algorithm for the incoming episode state and Q-value. To evaluate the visual information, we design a function to integrate the visibility, orientation, and image resolution of the target, and conduct a trade-off between the video switching frequency and the amount of visual information.
The remainder of this paper is organized as follows. Section 2 discusses the problem of node selection in camera networks, before Section 3 presents our proposed selection policy-learning method based on Q-learning. Section 4 outlines how the Q-function is approximated with a GMM updated by mini-batch stepwise EM. Section 5 discusses visual information measures, and Section 6 presents the results of extensive experiments. Finally, we summarize this paper in Section 7.
2 Problem formulation
The camera node selection problem can be described as follows. There are N smart cameras (C 1, C 2 … C N) consisting of an image sensor, a processor, and a wireless communication module with partially or completely overlapped FOVs, as illustrated in Fig. 1. We assume that the communication between the cameras is synchronized and delay-free. A designated central controller node schedules the cameras according to a selection policy. At a given time instant t = 0, 1, 2, …, the camera nodes in the network extract the target’s visual features from their video streams, and convert them to a visual score that indicates the quality of the video stream. This score is then sent to the central controller. To obtain the maximum reward from the camera selection action, the central controller selects camera C ∗ as the optimal camera at that time instant according the previously selected camera and the visual scores. Therefore, camera C ∗ outputs its surveillance video for the interval (t, t + Δt).
In the selection process, the central agent dynamically selects the video camera that can best obtain a high-quality surveillance video. Thus, it can be considered a sequential decision-making process based on the analysis and evaluation of visual content. Because the state transition and the expected reward only depend on the current system state and the previous selection result, this can be modeled as an MDP. An MDP can be formally defined as a 4-tuple < A, S, T, R >, where A is a finite set of candidate cameras {C 1, C 2 … C N}, S is the set of all possible underlying states, and the system state s t ∈ S at time t is represented as a vector s t = (P t , O t , a t − 1), where P t is the Δt target position {(x 1 t , y 1 t ), (x 2 t , y 2 t ), … (x Δt t , y Δt t )} between the previous time point t − 1 and the current point t, O t is the sequence of target orientations {θ 1 t , θ 2 t , … θ Δt t }, and a t − 1 ∈ A is the node selected at time point t. T is a state transition function T : S × A → S, and R(s t , a t ) is an immediate reward function R : S × A → ℝ that assigns real-valued rewards to the selection actions performed in each of the underlying process states. To simplify the notation, we represent the immediate reward R(s t , a t ) as r t . Consequently, the MDP model can be used to derive a selection policy π * (s) : s ↦ a that yields the maximized expected reward \( {V}^{\pi^{*}}\left(\mathbf{s}\right) \) over some decision steps according to the history of states and actions, i.e.,
where γ ∈ [0, 1] is a discount factor that controls the impact of future rewards, causing the effect of a reward to decay exponentially with elapsed time. When the transition and reward functions are known, the model can be computed iteratively offline using dynamic programming. However, the transition model for the movement of the target is difficult to obtain for any real scene.
Reinforcement learning methods, in which agents interact with an unknown environment, synthesize, and improve their behavior through trial and error, have been extensively researched in recent years. We use a Q-learning-based algorithm to learn the optimal policy online, requiring no knowledge of the transition probabilities of the underlying MDP. The classic Q-learning algorithm maps every state-action pair to a real number (or Q-value) using a look-up table. The system described herein has a continuous state and discrete actions. Hence, a large storage space is needed to store all state-action pairs, and convergence is typically slow because it is not feasible to train all pairs within a large domain. Therefore, in this paper, we use a function approximated with a GMM and updated via stepwise EM for every episode sample to represent the Q-function.
3 Selecting camera-based Q-learning
3.1 Architecture of the proposed method
The general architecture of the node selection method for camera networks based on approximate Q-learning is depicted in Fig. 2. The key symbols used in this paper are given in Table 1. Target tracking is carried out by each camera node, and communication between cameras is used to enhance the fusion of the target state and association for a specific target. Visual scores are calculated and transferred to the central controller node, and the selected node then receives instructions from the central node to output its video stream.
The central controller has action and learning modules that can be executed in different parallel threads, and the Q-function Q(s, a) is approximated by a group of GMM models in which q a (s), a = 1 … N represents the expected Q-value for state s and camera a, that is, Q(s, a) = E[q a |s]. In the action module, the central agent selects camera node a t at time step t using the action selection strategy presented in Section 3.4. To accelerate the learning, the target’s current state and scene topology are used as a priori knowledge to initialize the Q-function, as described as Section 3.3. As a result, the selected camera index is transmitted to all nodes in the camera network, and the selected camera obtains the immediate reward r t + 1 and observes the next state s t + 1. The new sample 4-tuple < s t , a t , r t + 1, s t + 1 > is added to the sequence (or episode) s 0, a 0, r 1, s 1, …, s t , a t , r t + 1, s t + 1, …, s T − 1, a T − 1, r T , s T . Each episode ends when the target leaves the monitoring area or the number of selection actions reaches some pre-assigned value. In our method, a collection referred to as the EpisodeList is used to store the episodes produced by the action module. The EpisodeList is sent to the learning module to provide training samples when the number of episodes reaches a certain threshold.
In the learning module, the Q-value samples can be trained with the 4-tuple samples in EpisodeList and the current q a (s) by the Q-learning algorithm, described in Section 3.2. The parameters of the GMM for the cameras can then be updated online via mini-batch stepwise EM with the Q-value samples. The new approximation function q ' a (s) is then input to the action module as the basis of camera selection.
3.2 Q-learning
In reinforcement learning, an agent generally receives a reward for actions that are beneficial along the path to achieving some long-term goal. Q-learning is a well-known model-free reinforcement learning method that estimates the optimal action-value function [34] with no explicit knowledge of the dynamic environment. In this paper, every state-action pair is mapped to a real number by the function Q(s t , a t ), which is an estimated incremental function for the state-action pair (s t , a t ) based on the temporal difference principle in Q-learning, i.e.,
where ρ ∈ [0, 1] is the learning rate controlling how fast the Q-function estimations are modified (we set ρ = 0.1), γ is the discount factor for expected long-term rewards, and camera node a is selected to maximize the discounted sum of rewards \( \underset{a}{ \max }Q\left({\mathbf{s}}_{t+1},a\right) \). Q-learning converges to the optimal Q-function if each state-action pair is performed infinitely often and ρ is satisfied for each (s t , a t ) pair: ∑ρ = ∞ and ∑ρ 2 < ∞ [34]. When the Q-function converges, the optimal camera selection policy π(s) can be obtained by maximizing the function, i.e.,
We assume there is some approximate function q a (s) that corresponds to each camera a such that the Q-value for the state-action pair (s, a) is the expectation of the function q a (s) for state s, i.e.,
For every piece of sample data in EpisodeList, an initial Q-value sample \( <{\mathbf{s}}_t,{a}_t,{\widehat{q}}_{a_t}\left({\mathbf{s}}_t\right)> \) is calculated for the state-action pair (s t , a t ), and the current approximate function is given by Gaussian Mixture Regression with the corresponding GMM model. The Q-value \( {\widehat{q}}_{a_t}\left({\mathbf{s}}_t\right) \) is iteratively updated using:
When the iterative learning has been completed for all samples in EpisodeList, the state s t , selected camera a t , and Q-value \( {\widehat{q}}_a(s) \) are added to the Q-value sample collection \( \varPhi =\left\{<{s}_t,{a}_t,{\widehat{q}}_{a_t}\left({s}_t\right)>\right\} \). The subset \( {D}_a=\left\{<{s}_t,{a}_t,{\widehat{q}}_{a_t}\left({s}_t\right)>\left|a={a}_t\right.\right\} \) for camera a t is used to update the GMM approximation function for the corresponding camera. The pseudo-code for this process is outlined in Algorithm 1.
As can be seen from Algorithm 1, the immediate reward r t + 1 is the visual score of the target between t and t + Δt, as outlined in Section 5.
3.3 Q-function initialization
Q-learning starts from an arbitrary initial Q-function Q 0(s, a) that can be updated without requiring an a priori model. Thus, Q-learning often suffers from slow convergence when there are a large number of states and action spaces. Considering that the camera node with the maximum visual score can be selected greedily if the target’s state remains stable in a subsequent period of time, we use the target’s current state and scene topology as a priori knowledge, and initialize Q 0(s, a) with the immediate reward for greedy selection in state s 0. This accelerates the convergence speed of the learning. Assuming that the hardware configuration of the cameras in the scene is the same, the immediate reward for greedy selection is associated with the relative position and angle between the target and the selected camera. Thus, the initial pair value Q 0(s, a) can be defined as
where w 0 is a predefined weight, (x s , y s ) and (x a , y a ) give the position of the target and camera a, MaxDis is the maximum distance to normalization, and θ a , θ s are the angles of offset from the x -axis in camera a and the target’s front side, respectively, in state s. This equation implies that a closer distance and smaller angle between the camera and the target will result in better-quality monitoring videos.
3.4 Action selection strategy
In reinforcement learning, action selection must address the trade-off between exploration (that is, exploring the world to gather information about the system) and exploitation (which involves executing actions with the current policy). To obtain a greater initial exploration probability that gradually decreases with time until a steady state is reached, we implement an improved ε ‐ greedy action selection policy [32], where greedy action a t is selected with a probability of 1 − ε t , i.e., a t = arg max a Q(s t , a), whereas a non-greedy action is selected with probability ε t . We set ε t as the sum of the initial exploration probability ε 0 and an exponential function of time t produced with exploration probability coefficient ε Δ, i.e.,
where the weight coefficient τ is set to 300 in this study. In the case of a non-greedy action, the probability of selecting a candidate camera with a larger Q-value is enhanced by using a Gibbs distribution to select the exploratory action instead of a random distribution. That is, at time t during the exploration, camera a t is selected with probability
where N is the number of cameras in the scene.
4 Q-function approximation
Compared to the full-state representation using Q-values, the function approximation method can reduce the required storage space and accelerate convergence by exploiting the similarity of Q-values between adjacent state-action pairs in the continuous state. Function approximation methods in reinforcement learning include linear parameter representations [28], fuzzy rules [4], neural networks [3], and kernel-based techniques [14]. Approximation based on GMMs is a non-parametric function approximation method [2] that can represent complex systems by changing the number of Gaussian components. In this paper, we present an approximate Q-function method based on GMMs, whereby the GMM parameters are sequentially updated by the mini-batch stepwise EM algorithm. The procedure is divided into offline model initialization and online updating. In the initialization stage, we sample randomly from the state space, and first set the Q-value as the score of the corresponding state, as described in Section 3.3. The initial model parameters are then obtained by offline stepwise EM. In the online update stage, the set of samples Φ learnt from Q-learning are divided into blocks of size m, and the online stepwise EM iterates over each block to update the approximate model parameters.
4.1 GMMs
GMMs are a popular method for representing general probability density functions in multidimensional spaces by the weighted sum of some Gaussian component densities [20], with the number of components adjusted according to the complexity of the problem. In this paper, we assign each camera a GMM, and define the training set of the approximate Q-function for camera a as \( {\varPhi}_a={\left\{{\boldsymbol{\upxi}}_j\right\}}_{j=1}^J={\left\{{\boldsymbol{\upxi}}_{s,j},{\xi}_{q_a,j}\right\}}_{j=1}^J \) (written as Φ for simplicity), where ξ s,j = s j , \( {\xi}_{q_a,j}={\widehat{q}}_a^j \), and J is the number of samples. The probability distribution of ξ j is then the weighted sum of K Gaussian densities, given by:
with mean vector μ k and covariance Σ k for Gaussian component k; where D is the dimension of the samples, and η k, k = 1 … K are the mixture weights that satisfy the constraint \( {\displaystyle \sum_{k=1}^K{\eta}^k=1} \). Therefore, these parameters are collectively represented by the notation Θ = {η k, μ k, Σ k} K k = 1 . For the training set Φ, the log-likelihood function is defined as:
The GMM parameters are traditionally trained in batch mode (i.e., using the whole training set) using the iterative EM algorithm. An initial model Θ 0 is generated by sampling randomly in the state space to give a rough estimation, and the uth iterative model Θ u = {η k u , μ k u , Σ k u } K k = 1 is estimated via E-steps and M-steps until convergence:
E-step:
M-step:
The iteration terminates when \( \left|\frac{{\mathrm{\mathcal{L}}}_{u+1}}{{\mathrm{\mathcal{L}}}_u}-1\right|<{C}_1 \), where C 1 is a small number. It is worth noting that the GMM parameters η k u + 1 , μ k u + 1 , and Σ k u + 1 can be regarded as the weighted means of the sufficient statistics for the triple {1, ξ j , ξ j ξ j T} with the posterior probability p u + 1(k|ξ j ).
4.2 Stepwise EM
Because the approximate function trainer obtains the training episode data in multiple batches, the GMM needs to be sequentially updated online to conserve storage space and accelerate convergence. In this paper, we utilize the stepwise EM algorithm [19, 26] to sequentially update the GMM parameters in mini-batch mode with the Q-value of the sampling episode. In each update, the new Q-value samples, initial model parameters, and the number of historic update samples for each component are provided as input, and the algorithm is divided into E-steps and M-steps. Finally, the updated model parameters and the number of historic update samples are output. The pseudo-code for online stepwise EM is outlined in Algorithm 2.
In the E-step of Algorithm 2, the posterior probability p(k|ξ j ) of Gaussian component k for sample ξ j is calculated using (12). If all posterior probabilities of the GMM components are less than the threshold ThredNew, the new sample is deemed to be too far from the present GMM component to be explained by the current stochastic model. In this case, a new component is produced to account for the new sample data and added to the model. Otherwise, a sufficient statistics triple Suff k j = {< < 1 > > k j , < < ξ > > k j , < < ξξ T > > k j } for the kth component is stepwise updated with sample data ξ j , in which the component < < f(x) > > k j is a time-discounted weighted sum defined as
where {r k j } j ≥ 1 is a decreasing sequence with the number of historic update samples i k for component k, which should satisfy ∑ j r k j = ∞ and ∑ j (r k j )2 < ∞ (we set r k j = (i k + 2)− α), and α ∈ [0.6, 0.9].
In the M-step, to stabilize the algorithm, we update m samples at once; more specifically, the training set is divided into mini-batches. Thus, for sample ξ j (j mod m = 0), the update function \( \overline{\varTheta}\left(Suf{f}_j^1,\dots Suf{f}_j^K\right) \) for the model parameters is defined as:
Initialization of GMMs
Each model component is initialized prior to sequential updating. First, we conduct random sampling in the state space, and set the initial Q-value for the states, as described in Section 3.3. Algorithm 2 is then executed iteratively to update the components of the GMM according to the camera ID in the state sample until \( \left|\frac{{\mathrm{\mathcal{L}}}_{u+1}}{{\mathrm{\mathcal{L}}}_u}-1\right|<{C}_1 \) or the maximum number of iterations for (11) is reached.
Component production
We use a minimum likelihood criterion to recognize a new sample as belonging to a mixture component. For the incoming sample, the algorithm verifies whether it minimally fits any component, and then decides whether a new Gaussian component should be produced and added to the model. Because the probability N(ξ j ; μ k, Σ k) is interpreted as a likelihood function of the kth component for sample ξ j , sample ξ j is not recognized as belonging to any component in the model if its probability N(ξ j ; μ k, Σ k) is lower than the minimum likelihood threshold of all components, i.e.,
where τ min is a previously specified acceptable likelihood for a fraction of the minimum likelihood value, making the minimum likelihood criterion independent of the covariance matrix. If the sample is rejected by all Gaussian components of the model for the corresponding camera, then it is considered to include a new concept. In this case, a new component K + 1 is added to the model, and the initial parameters of the new component are given by
where d 1, d 2, …, d D in (24) are the ranges of variances, and β is an appropriate positive constant for scaling these variances. To control the number of Gaussian components and ensure that the complexity is within a certain range, we set the maximum number of Gaussian components to MAX comp . When a new component is required and K + 1 > MAX comp , the most infrequently updated component with the minimum sufficient statistics < < 1 > > k j ; i.e., arg min K k = 1 < < 1 > > k j , is replaced by the new component.
4.3 Gaussian mixture regression
The approximate Q-value is calculated with the current state and GMM in the action module. To obtain the initial Q-value \( {\widehat{q}}_{a_t}\left({s}_t\right) \) and \( \underset{a}{ \max }{q}_a\left({s}_{t+1}\right) \) in (5), we estimate the expectation of the Q-value for state s using Gaussian Mixture Regression (GMR) [6]. In a generic regression problem, we are given a set of predictor variables X ∈ ℝ p and response variables Y ∈ ℝ g, and the aim of GMR is to estimate the conditional expectation of Y given X on the basis of a set of observations {X, Y}. For the GMM model Θ that encodes the sample set \( {\boldsymbol{\upxi}}_j=\left\{{\boldsymbol{\upxi}}_{s,j},{\xi}_{q_a,j}\right\} \), the state and Q-value of the Gaussian component k for camera a are also separated, i.e., we define
Therefore, the conditional probability of the Q-value q is the mixture of the probability that component k is responsible for state s; specifically,
Where
Finally, as a mixture of K Gaussian components, the estimation of the conditional expectation of q(s) given s (that is, the conditional mean expectation \( \overline{q}\left(\mathbf{s}\right) \)) and conditional covariance expectation matrix \( {\overline{\boldsymbol{\Sigma}}}_{qq} \) are computed using (26) and
Thus, by evaluating \( \left\{\overline{q}\left(\mathbf{s}\right),{\overline{\boldsymbol{\Sigma}}}_{qq}\right\} \) for different states s, we can obtain the generalized form of data point \( \widehat{\boldsymbol{\upxi}}=\left\{\mathbf{s},\overline{q}\left(\mathbf{s}\right)\right\} \) with the associated covariance matrix \( {\overline{\boldsymbol{\Sigma}}}_{qq} \), which gives the estimated Q-value.
5 Measures of visual information
In this paper, the immediate reward described in Section 3 is presented as a set of metrics that evaluate the quality of visual information. To reflect the target information in the video and the visual comfort level of the output video stream, the reward r t + 1 is composed of \( \overline{M_{t+1}^{a_t}} \), the average visual score per frame for the target from two adjacent selections (positive reward), and the cost for view switching (negative reward). This can be expressed as follows:
In (32), w v , w s ∈ [0, 1] is a normalized weight for the visual score and switching, \( \overline{M_{t+1}^{a_t}} \) is the mean visual score of the selected camera a t for T frames after the selection time t in (33), and the switching cost is expressed as a δ function δ(a t , a t + 1) in (34). Further, for the selected camera a t , the visual score of the target at time t is composed of the visibility score \( {M}_v^{a_i} \), normalized orientation score \( {M}_d^{a_i} \), and the clarity score \( {M}_c^{a_i} \); that is,
In (35), w d and w c (w d + w c = 1) are the weights of the orientation and clarity scores, respectively. They define the importance of each metric, and can be learned and modified. Higher the weight, greater the emphasis on the corresponding metric. To simplify the problem, we represent the target visibility score, orientation, and image resolution as being camera-independent; that is, M v , M d , and M c .
In the process of calculating the visual information score, we first track the target in the selected camera’s video, and detect the target’s bounding box. We then obtain the target’s trajectory P t + 1 between time t and t + 1 in the ground plane, using the bottom center of the bounding boxes and the calibration parameters of the camera. The metrics for measuring the visual information are outlined below.
Visibility
The ability of a camera to capture the target is a crucial metric for camera selection. In this paper, we infer the visibility of the target for a particular camera according to the maximum area of the specified target’s motion region in the video frames from human motion detection. That is, when the area of motion is smaller than a predetermined threshold, it can be inferred that the target is invisible, i.e., M v = 0; otherwise, M v = 1.
Orientation
We infer the orientation from the trajectory of the target. Suppose that the target’s front is consistent with its direction of movement, and its trajectory can be considered as a straight line over a short period of time. The orientation angle θ Δt t of the target at time t + Δt for the target position sequence P t + 1 = {(x 1 t , y 1 t ), (x 2 t , y 2 t ), … (x Δt t , y Δt t )} can be approximated by the slope of the line fitted from the closest l points {(x Δt − l + 1 t , y Δt − l + 1 t ), (x Δt − l + 2 t , y Δt − l + 2 t ), … (x Δt t , y Δt t )}, as illustrated in Fig. 3. Further, fitting such a line can reduce the errors introduced by tracking and calibration. As in (6), the orientation score of the selected camera node at time t + Δt can be defined as
The above equation implies that the more positive the direction of motion with respect to the selected camera, the larger the value of M d . This score reaches a maximum when the angle between the direction of motion and the orientation of the camera is 180°, that is, the camera is facing forward towards the target’s front.
Clarity
The image clarity of a target reflects the ability of the camera to obtain detailed information under different perspectives, and can usually be represented by an average gradient or information entropy of the image region. Because the size of a target’s image will vary from different viewpoints, and to compare the resolutions from different views for the same target, we zoom the target region image to the same height H. If the width of the scaled image is W, we represent the mean gradient of the zoomed image as the clarity score M c for camera c, i.e.,
where I(x, y) is the gray value at point (x, y).
6 Experiments and results
6.1 Experimental setup
We conducted simulations and actual scene experiments using an Intel Core i7-3770 2.4 GHz CPU PC with 8 GB RAM. Section 6.2 discusses the simulation experiments, including various scenes and camera parameters under noisy conditions. We used simulations to compare the correctness and convergence of the proposed algorithm with that of greedy selection and classic Q-learning algorithms because simulations can effectively compensate for the limits of hardware and space in real experiments. The convergence rate using the Q-value function initialized as in Section 3.3 was compared with that of the traditional method, which sets the initial Q-value to zero. We also analyzed various effects of the selection results with different time intervals by comparing greedy selection (Max), game theory [16] (Game), and POMDP [31] with the proposed method. These three comparative methods are based on maximizing the current visual information, optimization, and scheduling policies, and are convenient for comparing performance within the framework of scoring and selection. In Section 6.3, the visual features proposed in Section 5 are evaluated and the selection performance of Max, Game, POMDP, and our method are compared. In our method, the learning rate and exploration probability for Q-learning were determined by the values of ρ = 0.1, ε 0 = 0.1, and ε Δ = 0.2 according to the general reinforcement learning method. The discount factor in (2) was set to γ = 0.85, which satisfies γ T < 0.02 and ensures that the currently selected action will not affect the state after T steps for T > 10. For the Q-function approximation, our experiments show that better iteration results can be achieved when the stepwise weight exponent α is set to 0.8 and the maximum number of GMM components is set to MAX comp = 30, with the initial number set to 10. Moreover, we set the scaling constant to variance of new component β as 0.6 and the acceptable likelihood τ min = 0.02. As described in Section 5, the visual feature weights reflect the degree of importance of each feature in different applications; here, we set the weights to w v = 0.5, w s = 0.5, w d = 0.6, and w c = 0.4. To increase the stability of line-fitting when computing the orientation, we set the number of trajectory points involved in fitting l to 10, and the height of the scaled image region of the target H to 300.
6.2 Results of simulation experiments
We conducted a number of simulation experiments by deploying 12 static cameras in a virtual rectangular region of length 12 m, width 8 m, and height 2 m. In this paper, we analyze the simulation results for example scenes A and B shown in Fig. 1. The cameras were deployed at the edge of scene A with overlapped or partially overlapped FOVs. In scene B, there were some static obstacles that necessitated handoffs between cameras. 20,000 trajectories were sampled in each scene to form a training dataset. In these trajectories, the target states consisted of the position coordinates (x, y) and orientation angles θ, which were obtained from the positions in the previous time step, i.e., θ = arctan((y t − y t − 1)/(x t − x t − 1)). To simulate a more realistic data process, Gaussian noise with mean zero and covariance σ was added as observation noise. To evaluate the visual information, we adopted the visibility, orientation, and clarity metrics proposed in Section 5. The visibility in scene A is defined as the angle between the target-camera line and the center axis being less than a certain threshold and the distance between the target and camera line being less than the radius of the FOV. In scene B, we additionally consider whether the target-camera line intersects the region of occlusion. If the camera configuration in both scenes is the same, the resolution of the target image in the simulation experiments can be represented by the distance between the target and the camera; that is, the clarity score for a target at (x s , y s ) and camera at \( \left({x}_{c_i},{y}_{c_i}\right) \) is \( {M}_c^{a_i}=1-\sqrt{{\left({x}_s-{x}_{a_i}\right)}^2+{\left({y}_s-{y}_{a_i}\right)}^2}/ MaxDis \), which is similar to (6).
To analyze the effectiveness and convergence, we used the 20,000 trajectories to train greedy selection (Max), discrete Q-learning (Discr_Q), and the proposed method (Appr_Q) 10 times for each scene. In the greedy selection method, the camera with the maximum visual score was selected. In the discrete Q-learning method, the system state space was discretized into 60 × 40 × 8 grids; more specifically, the virtual region was represented as a 60 × 40 grid according to the length and width, and the orientation angle was denoted as one of eight states, each covering 45°. Each method executed a selection decision at intervals of Δt = 20 samples. The average immediate reward \( Avg\mathrm{R}\mathrm{e} ward=\frac{1}{T}{\displaystyle \sum_{t=1}^T{r}_t} \), where T is the total number of total selection points, was calculated for each trajectory, and the average number of camera switches \( AvgSwitchNum=\frac{10}{T}{\displaystyle \sum_{t=1}^T\delta \left({a}_t,{a}_{t+1}\right)} \) was computed for each group of 10 selections.
The mean AvgReward curves for scenes A and B are shown in Fig. 4a and b, where the shaded areas represent the standard deviation of discrete Q-learning and approximate Q-learning, respectively, for 10 training sessions. Further, the AvgSwitchNum curves for scenes A and B are shown in Fig. 4c and d, where shaded areas around the discrete Q-learning and approximate Q-learning curves represent the standard deviation over 10 training sessions. As can be seen from Fig. 4, reinforcement learning methods outperform greedy selection in terms of immediate reward and switching number after they have converged. Further, the approximate function method converges faster than discrete Q-learning. In addition, the approximate method is marginally better than the discrete method in terms of both reward and switching number, because it is able to represent the model states more compactly.
We initialized the Q-value with the GMM in accordance with Section 3.3 (Initiation_Appr_Q) and compared the results to those when the initial Q-value was set to zero (Without_Initiation_Appr_Q) and discrete Q-Learning (Without_Initiation_Discr_Q), for 60,000 trajectories and 10 training sessions. The mean curves for AvgSwitchNum and AvgReward are shown in Fig. 5. The figure shows that the Q-value initialized with the target location and camera parameters tends to produce convergence after approximately 500 trajectories. In contrast, the Q-values initialized to zero for approximate Q-learning and discrete Q-Learning (especially) produce slower convergence. These experimental results show that the learning process can be accelerated to a significant degree if a priori knowledge of camera parameters and topology information about the environment is utilized, even though the state transitions, that is, the model of target movement in the scene, are not known in advance.
Because the node selection results may be affected by the use of different time intervals, we conducted the experiments for 50 trajectories using time intervals of 10–120 samples (in increments of 10), and set the exploration probability of the approximated Q-function to ε t = 0 (that is, there was no exploration in this test). The average visual score \( AvgVisScore=\frac{1}{T}{\displaystyle \sum_{t=1}^T\overline{M_{t+1}^{a_t}}} \), total switching number \( TotalSwitchNum={\displaystyle \sum_{t=1}^T\delta \left({a}_t,{a}_{t+1}\right)} \), and AvgSwitchNum were then compared for Max, Game, POMDP, and the proposed method (ARL). For comparison, the visual scores of Max, Game, and POMDP were calculated using the proposed visual measurement. The resulting AvgVisScore, TotalSwitchNum, and AvgSwitchNum for the above methods are shown in Fig. 6. It is clear that the total switching number reduces as the time interval increases, whereas the average switching number per 10 selections increases and the average visual score tends to decrease with the reduced correlation between the current state and the next state. Further, the results of our approach are slightly superior to those given by the other three methods.
6.3 Real-life experimental results
We deployed 12 traditional calibrated non-smart cameras in an indoor scene. Natural video sequences were captured simultaneously by the cameras, and the visual processing for different video streams was conducted in parallel, thus simulating a distributed smart camera network. The action and learning modules of Q-learning in the control agent were also processed using multiple threads. A total of 120 video groups of 10 min in length were recorded as training data, and another five video groups of 60 min length were selected as test data. Neither exploration nor policy updating was performed during the test process. We found that the learning convergence during training was faster than for the simulations described in Section 6.2. This is because the target trajectories were more regular in the real scene than in the simulation environment, and a smaller state space was needed for training with respect to objects such as tables and chairs occupying certain spaces. The tracking algorithm itself is not the focus of this study; we used the VTD tracker [15] and manually revised the object location when it was in danger of being lost during tracking. The sampling image and target tracking results for video sequence 1 are shown in Fig. 7, with the orientation score M i d and clarity score M i c for camera i calculated using (36) and (37). The orientation and clarity scores are illustrated in the figures; the target has a higher orientation score when it moves directly towards the camera, and has a higher clarity score when it is closer to the camera.
The orientation, clarity, and total scores for camera 1 (Cam1) and camera 4 (Cam4) for frames 1–2000 of sequence 2 are shown in Fig. 8a–c. For the orientation curves, the angle in radians of the target relative to the x-axis of Cam1 and Cam4 in frames 90 and 830 is 0.09 and 2.65, respectively. The orientation scores of 0.007 and 0.75 in frame 90, and 0.97 and 0.84 in frame 830 better reflect the orientation of the target relative to that of the camera. Although the target reappears in the scene, some parts of the orientation curves (e.g., Cam4 near frames 1950 and 1350) are unstable, which causes varying degrees of jitter in the total score curves. This is mainly because the initial tracker performance is unstable, and some fitting errors are introduced when there are fewer fitting points. The resolution curves show images from frames 210, 380, 1650, and 1870. When the target is far away from the camera, the resolution of the target image is lower and the score is smaller. The resolution curves appear to have a certain amount of jitter in some areas (e.g., frame 1150 of Cam1), signifying errors in the tracking and calibration algorithm and a complex background.
We compared our method (ARL) with Max, Game, and POMDP on five test video sequences. As in the simulation experiments, all the cameras were set to obtain tracking results, and the proposed visual evaluation metrics were used for comparison. Figure 9 shows the selection results for sequences 2 and 3, with a selection decision made every 20 s. In the figure, it can be seen that Max and Game make some false selections. This is because of the inconsistency between the visual score and the real target state in some views for some target errors. In particular, when the visual scores of multiple perspectives are similar or below a certain threshold, the Game selection result tends to produce jitter effects. For example, there is frequent switching in steps 65–75 of sequence 2 and steps 90–100 of sequence 3. In contrast, our method effectively reduces the incidence of false selection and obtains stable results because of the Q-value learning from the training data.
The switching numbers and visual scores per step for five video sequences are shown in Fig. 10a and b. Greedy selection obtained a higher visual score per step for these sequences, although the scores for each method are generally similar. The proposed method has obvious advantages over the other methods in terms of the total switching number, especially greedy selection and Game.
7 Conclusion
As camera networks become extensively utilized in areas such as surveillance and human–computer interaction, finding the most informative and comfortable view from the mass of real-time video is of increasing importance. In this paper, we have proposed a dynamic node selection framework for surveillance applications in critical areas. To deal with the unknown state transitions of target motion and maximize the long-term reward, we trained a selection policy using reinforcement learning in the central controller. The chosen camera nodes extract effective visual features from their video streams and send the immediate reward to the central controller. To accelerate the learning process and reduce storage space, the target’s current state and scene topology were used as a priori knowledge to initialize the Q-function, and an exploration strategy based on Gibbs’ distribution was adopted to guarantee that the camera with the greatest Q-value had a greater exploration probability than under traditional random exploratory action selection. We utilized GMMs to represent the approximate Q-value function, and updated the GMM parameters sequentially via the mini-batch stepwise EM algorithm to meet the learning requirements of episodic sample updating. In addition, we measured the visual information and comfort of visual effects given by different cameras by integrating metrics on visibility, orientation, target clarity, and the cost of switching cameras. Our study of single node selection and scheduling to maximize the expectation of visual information shows that good results can be achieved when fewer targets are in a partial overlap region. The implementation of the proposed approach on a camera network test-bed is planned for future work.
In terms of bandwidth consumption, because feature extraction and scoring are conducted on each camera node and the target association and fusion are distributed across all nodes, each camera need only send its current state and immediate reward to the central controller, and the central node transmits the selected camera’s index to all nodes. The process of node selection is thus distributed, and the network bandwidth consumption is relatively small. However, a central controller is still needed, and the learning and approximation are centralized. Therefore, a decentralized selection policy is of interest, that is, cooperation and coordination between smart camera nodes should be considered.
References
Agarwal A, Triggs B (2006) Recovering 3D human pose from monocular images. IEEE Trans Patt Anal Mach Intell 28(1):44–58. doi:10.1109/TPAMI.2006.21
Agostini A, Celaya E (2010) Reinforcement learning with a Gaussian mixture model. In Neural Networks (IJCNN), The 2010 Int Joint Conf : 1–8. doi: 10.1109/IJCNN.2010.5596306
Botvinick M, Niv Y, Barto A (2009) Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective. Cognition 113(3):262–280. doi:10.1016/j.cognition.2008.08.011
Busoniu L, Babuska R, DeSchutter B, Ernst D (2010) Reinforcement learning and dynamic programming using function approximators. CRC press. doi: 10.1201/9781439821091
Charfi Y, Wakamiya N, Murata M (2009) Challenging issues in visual sensor networks. IEEE Wirel Commun 16(2):44. doi:10.1109/MWC.2009.4907559
Cohn D, Ghahramani Z, Jordan M (1996) Active learning with statistical models. J Artif Intell Res 4(1):129–145
Daniyal F, Taj M, Cavallaro A (2010) Content and task-based view selection from multiple video streams. Multimed Tools Applic 46:235–258. doi:10.1007/s11042-009-0355-z
Fu Y, Guo Y, Zhu Y (2010) Multi-view video summarization. IEEE Trans Multimed 12(7):717–729. doi:10.1109/TMM.2010.2052025
Gupta A, Mittal A, Davis L (2007) Cost: An approach for camera selection and multi-object inference ordering in dynamic scenes. Proceedings of IEEE International Conference on Computer Vision, Rio de Janeiro: 1–8. doi:10.1109/ICCV.2007.4408842
Han B, Joo S, Davis L (2011) Multi-camera tracking with adaptive resource allocation. Int J Comput Vis 91(1):45–58. doi:10.1007/s11263-010-0373-3
Huber M (2012) Optimal pruning for multi-step sensor scheduling. IEEE Trans Autom Control 57(5):1338–1343. doi:10.1109/TAC.2011.2175070
Javed O, Khan S, Rasheed Z (2000) Camera handoff: tracking in multiple uncalibrated stationary cameras. Proc Workshop Human Motion, IEEE: 113–118. doi: 10.1109/HUMO.2000.897380
Jiang H, Fels S, Little J (2008) Optimizing multiple object tracking and best view video synthesis. IEEE Trans Multimed 10(6):997–1012. doi:10.1109/TMM.2008.2001379
Kveton B, Theocharous G (2012) Kernel-based reinforcement learning on representative states. Association for the Advancement of Artificial Intelligence, pp 977–983
Kwon J, Lee K (2010) Visual tracking decomposition. IEEE Conf Comput Vision Patt Recognit: 1269–1276. doi: 10.1109/CVPR.2010.5539821
Li Y, Bhanu B (2011) Utility-based camera assignment in a video network: a game theoretic framework. IEEE Sensors J 11(3):676–687. doi:10.1109/JSEN.2010.2051148
Li Q, Sun Z, Chen S, Liu Y (2013) A method of camera selection based on partially observable Markov decision process model in camera networks. Ame Contrl Conf: 3833–3839. doi:10.1109/ACC.2013.6580424
Li W, Zhang W (2012) Sensor selection for improving accuracy of target localization in wireless visual sensor networks. IET Wireless Sens Syst 2(4):293–301. doi:10.1049/iet-wss.2012.0033
Liang P, Klein D (2009) Online EM for unsupervised models. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics. Assoc Comput Linguist: 611–619. doi: 10.3115/1620754.1620843
McLachlan G, Peel D (2004) Finite mixture models. New York: J. Wiley. doi:10.1002/0471721182
Mo Y, Ambrosino R, Sinopoli B (2011) Sensor selection strategies for state estimation in energy constrained wireless sensor networks. Automatica 47(7):1330–1338. doi:10.1016/j.automatica.2011.02.001
Monari E, Kroschel K (2009) A knowledge-based camera selection approach for object tracking in large sensor networks. Third ACM/IEEE Int Conf Distrib Smart Cam. doi:10.1109/ICDSC.2009.5289400
Park J, Bhat C, Kak A (2006) A look-up table based approach for solving the camera selection problem in large camera networks. In IEEE Workshop on Distributed Smart Cameras, Boulder, CO, USA, Oct. 31, 2006
Rudoy D, Zelnik-Manor L (2012) Viewpoint selection for human actions. Int J Comput Vis 97(3):243–254. doi:10.1007/s11263-011-0484-5
Rudoy D, Zelnik-Manor L (2012) Viewpoint selection for human actions. Int J Comput Vis 97(3):243–254. doi:10.1007/s11263-011-0484-5
Sato M, Ishii S (2000) On-line EM algorithm for the normalized Gaussian network. Neural Comput 12(2):407–432. doi:10.1162/089976600300015853
Shen C, Zhang C, Fels S (2007) A multi-camera surveillance system that estimates quality-of-view measurement. Proc IEEE Int Conf Image Process, San Antonio: III 193-III 196. doi:10.1109/ICIP.2007.4379279
Singh S, Jaakkola T, Jordan M (1995) Reinforcement learning with soft state aggregation. Advances in neural information processing systems: 361–368. doi: 10.1162/089976600300015961
Soro S, Heinzelman W (2007) Camera selection in visual sensor networks. Proc IEEE Conf Adv Video Signal Based Surveill: 81–86. doi: 10.1109/AVSS.2007.4425290
Soro S, Heinzelman W (2009) A survey of visual sensor networks. Adv Multimed: 1–22. doi: 10.1155/2009/640386
Spaan M, Lima P (2009) A decision-theoretic approach to dynamic sensor selection in camera networks. In ICAPS: 1–8. doi: ocs/index.php/ICAPS/ICAPS09/paper/viewFile/758/1125
Sutton R, Barto A (1998) Reinforcement learning: an introduction. MIT Press. Cambridge, Massachusetts London, England. doi:10.1109/TNN.1998.712192
Tessens L, Morbee M, Lee H, Philips W, Aghajan H (2008) Principal view determination for camera selection in distributed smart camera networks. Sec ACM/IEEE Int Conf Distrib Smart Cam: 1–10. doi:10.1109/ICDSC.2008.4635699
Watkins C, Dayan P (1992) Q-learning. Mach Learn 8(3):279–292. doi:10.1007/BF00992698
Acknowledgments
This work is supported by the National Natural Science Foundation of China (61272219, 61100110, 41305138, 61473310 and 61321491), Program for New Century Excellent Talents of Ministry of Education of China (NCET-04-0460), Science and Technology Plan of Jiangsu Province (BE2010072, BE2011058, BY2012190, and BY2013072-04), and Innovation Foundation of State Key Lab for Novel Software Technology of China (ZZKT2013A12).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, Q., Sun, Z., Chen, S. et al. Dynamic node selection in camera networks based on approximate reinforcement learning. Multimed Tools Appl 75, 17393–17419 (2016). https://doi.org/10.1007/s11042-015-3003-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-3003-9