1 Introduction

Camera networks have recently received increased research interest for the effective resolution of partial and full occlusions, and the continuous tracking of targets over large areas where the limited field of view (FOV) of a single camera is insufficient. Such networks are used extensively in areas such as security, surveillance, human–computer interaction, navigation, and positioning. However, the utilization of camera networks is beset by challenges related to the deployment and control of the camera nodes, real-time fusion of high-resolution and high frame-rate video streams, and selection and coordination of the cameras [5, 30]. Camera selection involves determining which one (or more) camera(s) should be selected to obtain the maximum information or comfortable visual effects. Thus, camera selection is a fundamental challenge in applications such as target tracking and positioning [10, 18], area coverage [29], surveillance and behavior analysis [24, 33], and video summarization [8]. Dynamic node selection is an effective solution to problems such as video stream transmission with constrained network bandwidth, more economical node cooperation for nodes with constrained power supplies, and a limited number of display terminals. Such issues are faced by emerging applications such as security surveillance and smart homes, which need to obtain high-quality video of specific targets. For example, from the cameras deployed to survey some critical area, users need select only one optimal viewpoint to be displayed on a mobile device. However, in contrast to node selection in general sensor networks [11, 21], the nearest camera in a camera network cannot be identified by directional measurements alone. Furthermore, some errors are introduced in computer vision algorithms by complex background, illumination, and other factors, which cause processing results to be unstable and jittery in the time sequence. Consequently, in selecting nodes in camera networks, two issues must be addressed: First, a dynamic selection mechanism should be established that ensures the selected node captures the maximum expected amount of information and outputs comfortable video based on current and historical observations. Second, metrics to evaluate the visual information of a video stream must be modeled according to the demands of applications, and these should be adapted to various camera parameters, background, and scenes.

Previous studies have investigated the issue of node selection in camera networks from the viewpoint of selection mechanisms and visual information evaluation. The consequent methods proposed for node selection in camera networks can be divided into three categories: In the first category, the node that maximizes the current visual information is selected. For example, Tessens et al. [33] proposed a distributed principle viewpoint determination method for smart cameras, in which factors such as the visibility, direction of movement of targets, distance between the center of mass of the object and the camera, and the results of face detection are summarized into a score. The camera with the maximum score is then designated as the principle viewpoint. In [23], cameras are ranked and selected according to the target’s location and look-up tables stored locally at each node. A camera selection approach for object tracking was presented by Monari et al. [22], in which the relevant subset of cameras is determined according to the current or predicted state of the target object and prior geometrical knowledge. However, most methods in this category do not consider the dynamics of visual information and a target’s state in scenes, overall visual effects, network communications, or energy costs. The second category of node selection methods uses a reward or cost function to integrate the constraints, and the selection process is optimized using historic and current system states. For example, in a method proposed by Li and Bhanu [16], the best-view estimation is obtained by iterative bargaining based on game theory. A best-view selection algorithm that generates a smooth video sequence was proposed by Jiang et al. [13], who solved the resulting optimization problem using dynamic programming. Daniyal et al. [7] presented a best-view selection method based on a dynamic Bayesian network according to the states taken from a certain historical time window, and used the resulting output for video production. Although these methods do not predict the state transitions and the relay of rewards, the fact that the reward will be reflected after the action, and thus affect subsequent states and actions, is not considered. In the third category, dynamic selection decisions are made according to current system states and predicted future states. For example, Partially Observable Markov Decision Processing (POMDP) may be used to find the optimal scheduling policy in tracking and surveillance camera networks [17, 31] by predicting the future reward of an action according to certain assumptions. However, these methods require clear state transitions, which are difficult to explicitly define in complex dynamic systems. Additionally, the computational complexity grows exponentially with the state space because of the “curse of dimensionality.”

On the other hand, the main aim of visual information evaluation is to extract effective visual features and map them to comparable scores. Feature extraction is the key process, and is usually related to the application. For example, the visibility of a target [9], the location in an image or the real world [12, 22, 23], and the image area size [13] of the target are often used in tracking applications, whereas behavior analysis applications are more concerned with the orientation [33], body pose, and appearance [25, 27] of moving targets and the movement of the scene [7]. Surveillance applications for critical areas typically require front view, high-resolution video of the target of interest, and frequently switch views to prevent errors in visual computing. The front view of the target is determined by pose estimation and face detection; however, body pose estimation from a single perspective is still a nonlinear and ill-posed problem [1], and face detection has high error rates that are influenced by the video resolution and the relative position of the target with respect to the cameras. Therefore, the results of visual computing with both methods are often unstable and have high computational complexity.

In this paper, we address the problem of dynamic camera selection for the surveillance of critical areas. Our aim is to obtain the most informative camera node while simultaneously reducing camera switching. To achieve this, we model the dynamic selection of nodes in camera networks as a Markov Decision Process (MDP), and learn the selection strategy via reinforcement learning. For dynamic scenes, a state transition model is implicit, and the state space is continuous. We adopt an approximate Q-learning algorithm to find the optimal solution online, using a Gaussian Mixture Model (GMM) to represent the approximated Q-function. Model parameters are updated by a mini-batch stepwise Expectation–Maximization (EM) algorithm for the incoming episode state and Q-value. To evaluate the visual information, we design a function to integrate the visibility, orientation, and image resolution of the target, and conduct a trade-off between the video switching frequency and the amount of visual information.

The remainder of this paper is organized as follows. Section 2 discusses the problem of node selection in camera networks, before Section 3 presents our proposed selection policy-learning method based on Q-learning. Section 4 outlines how the Q-function is approximated with a GMM updated by mini-batch stepwise EM. Section 5 discusses visual information measures, and Section 6 presents the results of extensive experiments. Finally, we summarize this paper in Section 7.

2 Problem formulation

The camera node selection problem can be described as follows. There are N smart cameras (C 1, C 2 … C N) consisting of an image sensor, a processor, and a wireless communication module with partially or completely overlapped FOVs, as illustrated in Fig. 1. We assume that the communication between the cameras is synchronized and delay-free. A designated central controller node schedules the cameras according to a selection policy. At a given time instant t = 0, 1, 2, …, the camera nodes in the network extract the target’s visual features from their video streams, and convert them to a visual score that indicates the quality of the video stream. This score is then sent to the central controller. To obtain the maximum reward from the camera selection action, the central controller selects camera C as the optimal camera at that time instant according the previously selected camera and the visual scores. Therefore, camera C outputs its surveillance video for the interval (t, t + Δt).

Fig. 1
figure 1

Example of scenes

In the selection process, the central agent dynamically selects the video camera that can best obtain a high-quality surveillance video. Thus, it can be considered a sequential decision-making process based on the analysis and evaluation of visual content. Because the state transition and the expected reward only depend on the current system state and the previous selection result, this can be modeled as an MDP. An MDP can be formally defined as a 4-tuple < A, S, T, R >, where A is a finite set of candidate cameras {C 1, C 2 … C N}, S is the set of all possible underlying states, and the system state s t  ∈ S at time t is represented as a vector s t  = (P t , O t , a t − 1), where P t is the Δt target position {(x 1 t , y 1 t ), (x 2 t , y 2 t ), … (x Δt t , y Δt t )} between the previous time point t − 1 and the current point t, O t is the sequence of target orientations {θ 1 t , θ 2 t , … θ Δt t }, and a t − 1 ∈ A is the node selected at time point t. T is a state transition function T : S × A → S, and R(s t , a t ) is an immediate reward function R : S × A →  that assigns real-valued rewards to the selection actions performed in each of the underlying process states. To simplify the notation, we represent the immediate reward R(s t , a t ) as r t . Consequently, the MDP model can be used to derive a selection policy π * (s) : s ↦ a that yields the maximized expected reward \( {V}^{\pi^{*}}\left(\mathbf{s}\right) \) over some decision steps according to the history of states and actions, i.e.,

$$ {V}^{\pi^{*}}\left(\mathbf{s}\right)=\underset{\pi^{*}}{ \max }E\left[{\displaystyle \sum_{t=0}^{\infty }{\gamma}^t{r}_t}\left|{\mathbf{s}}_0=\mathbf{s}\right.\right] $$
(1)

where γ ∈ [0, 1] is a discount factor that controls the impact of future rewards, causing the effect of a reward to decay exponentially with elapsed time. When the transition and reward functions are known, the model can be computed iteratively offline using dynamic programming. However, the transition model for the movement of the target is difficult to obtain for any real scene.

Reinforcement learning methods, in which agents interact with an unknown environment, synthesize, and improve their behavior through trial and error, have been extensively researched in recent years. We use a Q-learning-based algorithm to learn the optimal policy online, requiring no knowledge of the transition probabilities of the underlying MDP. The classic Q-learning algorithm maps every state-action pair to a real number (or Q-value) using a look-up table. The system described herein has a continuous state and discrete actions. Hence, a large storage space is needed to store all state-action pairs, and convergence is typically slow because it is not feasible to train all pairs within a large domain. Therefore, in this paper, we use a function approximated with a GMM and updated via stepwise EM for every episode sample to represent the Q-function.

3 Selecting camera-based Q-learning

3.1 Architecture of the proposed method

The general architecture of the node selection method for camera networks based on approximate Q-learning is depicted in Fig. 2. The key symbols used in this paper are given in Table 1. Target tracking is carried out by each camera node, and communication between cameras is used to enhance the fusion of the target state and association for a specific target. Visual scores are calculated and transferred to the central controller node, and the selected node then receives instructions from the central node to output its video stream.

Fig. 2
figure 2

Architecture of the proposed approximate Q-learning-based method

Table 1 Key symbols and notation

The central controller has action and learning modules that can be executed in different parallel threads, and the Q-function Q(s, a) is approximated by a group of GMM models in which q a (s), a = 1 … N represents the expected Q-value for state s and camera a, that is, Q(s, a) = E[q a |s]. In the action module, the central agent selects camera node a t at time step t using the action selection strategy presented in Section 3.4. To accelerate the learning, the target’s current state and scene topology are used as a priori knowledge to initialize the Q-function, as described as Section 3.3. As a result, the selected camera index is transmitted to all nodes in the camera network, and the selected camera obtains the immediate reward r t + 1 and observes the next state s t + 1. The new sample 4-tuple < s t , a t , r t + 1, s t + 1 > is added to the sequence (or episode) s 0, a 0, r 1, s 1, …, s t , a t , r t + 1, s t + 1, …, s T − 1, a T − 1, r T , s T . Each episode ends when the target leaves the monitoring area or the number of selection actions reaches some pre-assigned value. In our method, a collection referred to as the EpisodeList is used to store the episodes produced by the action module. The EpisodeList is sent to the learning module to provide training samples when the number of episodes reaches a certain threshold.

In the learning module, the Q-value samples can be trained with the 4-tuple samples in EpisodeList and the current q a (s) by the Q-learning algorithm, described in Section 3.2. The parameters of the GMM for the cameras can then be updated online via mini-batch stepwise EM with the Q-value samples. The new approximation function q '  a (s) is then input to the action module as the basis of camera selection.

3.2 Q-learning

In reinforcement learning, an agent generally receives a reward for actions that are beneficial along the path to achieving some long-term goal. Q-learning is a well-known model-free reinforcement learning method that estimates the optimal action-value function [34] with no explicit knowledge of the dynamic environment. In this paper, every state-action pair is mapped to a real number by the function Q(s t , a t ), which is an estimated incremental function for the state-action pair (s t , a t ) based on the temporal difference principle in Q-learning, i.e.,

$$ Q\left({\mathbf{s}}_t,{a}_t\right)\leftarrow Q\left({\mathbf{s}}_t,{a}_t\right)+\rho \left[{r}_{t+1}+\gamma \underset{a}{ \max }Q\left({\mathbf{s}}_{t+1},a\right)-Q\left({\mathbf{s}}_t,{a}_t\right)\right] $$
(2)

where ρ ∈ [0, 1] is the learning rate controlling how fast the Q-function estimations are modified (we set ρ = 0.1), γ is the discount factor for expected long-term rewards, and camera node a is selected to maximize the discounted sum of rewards \( \underset{a}{ \max }Q\left({\mathbf{s}}_{t+1},a\right) \). Q-learning converges to the optimal Q-function if each state-action pair is performed infinitely often and ρ is satisfied for each (s t , a t ) pair: ∑ρ = ∞ and ∑ρ 2 < ∞ [34]. When the Q-function converges, the optimal camera selection policy π(s) can be obtained by maximizing the function, i.e.,

$$ \pi \left(\mathbf{s}\right)= \arg \underset{a\in A}{ \max }Q\left(\mathbf{s},a\right) $$
(3)

We assume there is some approximate function q a (s) that corresponds to each camera a such that the Q-value for the state-action pair (s, a) is the expectation of the function q a (s) for state s, i.e.,

$$ Q\left(\mathbf{s},a\right)=E\left[{q}_a\left(\mathbf{s}\right)\right]=\mu \left[{q}_a\left(\mathbf{s}\right)\right] $$
(4)

For every piece of sample data in EpisodeList, an initial Q-value sample \( <{\mathbf{s}}_t,{a}_t,{\widehat{q}}_{a_t}\left({\mathbf{s}}_t\right)> \) is calculated for the state-action pair (s t , a t ), and the current approximate function is given by Gaussian Mixture Regression with the corresponding GMM model. The Q-value \( {\widehat{q}}_{a_t}\left({\mathbf{s}}_t\right) \) is iteratively updated using:

$$ {\widehat{q}}_{a_t}\left({\mathbf{s}}_t\right)\leftarrow {\widehat{q}}_{a_t}\left({\mathbf{s}}_t\right)+\rho \left[{r}_{t+1}+\gamma \underset{a}{ \max }{q}_a\left({\mathbf{s}}_{t+1}\right)-{\widehat{q}}_{a_t}\left({\mathbf{s}}_t\right)\right] $$
(5)

When the iterative learning has been completed for all samples in EpisodeList, the state s t , selected camera a t , and Q-value \( {\widehat{q}}_a(s) \) are added to the Q-value sample collection \( \varPhi =\left\{<{s}_t,{a}_t,{\widehat{q}}_{a_t}\left({s}_t\right)>\right\} \). The subset \( {D}_a=\left\{<{s}_t,{a}_t,{\widehat{q}}_{a_t}\left({s}_t\right)>\left|a={a}_t\right.\right\} \) for camera a t is used to update the GMM approximation function for the corresponding camera. The pseudo-code for this process is outlined in Algorithm 1.

As can be seen from Algorithm 1, the immediate reward r t + 1 is the visual score of the target between t and t + Δt, as outlined in Section 5.

3.3 Q-function initialization

Q-learning starts from an arbitrary initial Q-function Q 0(s, a) that can be updated without requiring an a priori model. Thus, Q-learning often suffers from slow convergence when there are a large number of states and action spaces. Considering that the camera node with the maximum visual score can be selected greedily if the target’s state remains stable in a subsequent period of time, we use the target’s current state and scene topology as a priori knowledge, and initialize Q 0(s, a) with the immediate reward for greedy selection in state s 0. This accelerates the convergence speed of the learning. Assuming that the hardware configuration of the cameras in the scene is the same, the immediate reward for greedy selection is associated with the relative position and angle between the target and the selected camera. Thus, the initial pair value Q 0(s, a) can be defined as

$$ {Q}_0\left(s,a\right)={w}_0\left( \sin \left(\frac{\left|{\theta}_s-{\theta}_a\right|}{2}\right)\right)+\left(1-{w}_0\right)\left(1-\frac{\sqrt{{\left({x}_s-{x}_a\right)}^2+{\left({y}_s-{y}_a\right)}^2}}{MaxDis}\right) $$
(6)

where w 0 is a predefined weight, (x s , y s ) and (x a , y a ) give the position of the target and camera a, MaxDis is the maximum distance to normalization, and θ a , θ s are the angles of offset from the x -axis in camera a and the target’s front side, respectively, in state s. This equation implies that a closer distance and smaller angle between the camera and the target will result in better-quality monitoring videos.

3.4 Action selection strategy

In reinforcement learning, action selection must address the trade-off between exploration (that is, exploring the world to gather information about the system) and exploitation (which involves executing actions with the current policy). To obtain a greater initial exploration probability that gradually decreases with time until a steady state is reached, we implement an improved ε ‐ greedy action selection policy [32], where greedy action a t is selected with a probability of 1 − ε t , i.e., a t  = arg max a Q(s t , a), whereas a non-greedy action is selected with probability ε t . We set ε t as the sum of the initial exploration probability ε 0 and an exponential function of time t produced with exploration probability coefficient ε Δ, i.e.,

$$ {\varepsilon}_t={\varepsilon}_0+{\varepsilon}_{\varDelta }{e}^{-t/\tau } $$
(7)

where the weight coefficient τ is set to 300 in this study. In the case of a non-greedy action, the probability of selecting a candidate camera with a larger Q-value is enhanced by using a Gibbs distribution to select the exploratory action instead of a random distribution. That is, at time t during the exploration, camera a t is selected with probability

$$ {p}_{a_t}=\frac{e^{q_a\left({s}_t\right){\varepsilon}_t}}{{\displaystyle {\sum}_{b=1}^N{e}^{q_a\left({s}_t\right){\varepsilon}_t}}} $$
(8)

where N is the number of cameras in the scene.

4 Q-function approximation

Compared to the full-state representation using Q-values, the function approximation method can reduce the required storage space and accelerate convergence by exploiting the similarity of Q-values between adjacent state-action pairs in the continuous state. Function approximation methods in reinforcement learning include linear parameter representations [28], fuzzy rules [4], neural networks [3], and kernel-based techniques [14]. Approximation based on GMMs is a non-parametric function approximation method [2] that can represent complex systems by changing the number of Gaussian components. In this paper, we present an approximate Q-function method based on GMMs, whereby the GMM parameters are sequentially updated by the mini-batch stepwise EM algorithm. The procedure is divided into offline model initialization and online updating. In the initialization stage, we sample randomly from the state space, and first set the Q-value as the score of the corresponding state, as described in Section 3.3. The initial model parameters are then obtained by offline stepwise EM. In the online update stage, the set of samples Φ learnt from Q-learning are divided into blocks of size m, and the online stepwise EM iterates over each block to update the approximate model parameters.

4.1 GMMs

GMMs are a popular method for representing general probability density functions in multidimensional spaces by the weighted sum of some Gaussian component densities [20], with the number of components adjusted according to the complexity of the problem. In this paper, we assign each camera a GMM, and define the training set of the approximate Q-function for camera a as \( {\varPhi}_a={\left\{{\boldsymbol{\upxi}}_j\right\}}_{j=1}^J={\left\{{\boldsymbol{\upxi}}_{s,j},{\xi}_{q_a,j}\right\}}_{j=1}^J \) (written as Φ for simplicity), where ξ s,j  = s j , \( {\xi}_{q_a,j}={\widehat{q}}_a^j \), and J is the number of samples. The probability distribution of ξ j is then the weighted sum of K Gaussian densities, given by:

$$ p\left({\xi}_j\right)={\displaystyle \sum_{k=1}^K{\eta}^kN\left({\xi}_j;{\boldsymbol{\upmu}}^k,{\boldsymbol{\Sigma}}^k\right)} $$
(9)
$$ N\left({\boldsymbol{\upxi}}_j;{\boldsymbol{\upmu}}^k,{\boldsymbol{\Sigma}}^k\right)=\frac{1}{{\left(2\pi \right)}^{D/2}{\left|{\boldsymbol{\Sigma}}^k\right|}^{1/2}} \exp \left\{-\frac{1}{2}\left({\boldsymbol{\upxi}}_j-{\boldsymbol{\upmu}}^k\right)\hbox{'}{\boldsymbol{\Sigma}}_k^{-1}\left({\boldsymbol{\upxi}}_j-{\boldsymbol{\upmu}}^k\right)\right\} $$
(10)

with mean vector μ k and covariance Σ k for Gaussian component k; where D is the dimension of the samples, and η k, k = 1 … K are the mixture weights that satisfy the constraint \( {\displaystyle \sum_{k=1}^K{\eta}^k=1} \). Therefore, these parameters are collectively represented by the notation Θ = {η k, μ k, Σ k} K k = 1 . For the training set Φ, the log-likelihood function is defined as:

$$ \mathrm{\mathcal{L}}\left(\varPhi; \varTheta \right)= \ln \prod_{j=1}^Jp\left({\boldsymbol{\upxi}}_j;\varTheta \right)={\displaystyle \sum_{j=1}^J \ln \left(p\left({\boldsymbol{\upxi}}_j;\varTheta \right)\right)} $$
(11)

The GMM parameters are traditionally trained in batch mode (i.e., using the whole training set) using the iterative EM algorithm. An initial model Θ 0 is generated by sampling randomly in the state space to give a rough estimation, and the uth iterative model Θ u  = {η k u , μ k u , Σ k u } K k = 1 is estimated via E-steps and M-steps until convergence:

E-step:

$$ {p}_{u+1}\left(k\Big|{\boldsymbol{\upxi}}_j\right)=\frac{\eta_u^kN\left({\boldsymbol{\upxi}}_j;{\boldsymbol{\upmu}}_u^k,{\boldsymbol{\Sigma}}_u^k\right)}{{\displaystyle \sum_{k=1}^K{\eta}_u^kN\left({\boldsymbol{\upxi}}_j;{\boldsymbol{\upmu}}_u^k,{\boldsymbol{\Sigma}}_u^k\right)}} $$
(12)
$$ {E}_{u+1}^k={p}_{u+1}(k)={\displaystyle \sum_{j=1}^J{p}_{u+1}}\left(k\Big|{\boldsymbol{\upxi}}_j\right) $$
(13)

M-step:

$$ {\eta}_{u+1}^k=\frac{1}{J}{E}_{u+1}^k $$
(14)
$$ {\mu}_{u+1}^k=\frac{{\displaystyle \sum_{j=1}^J{p}_{u+1}}\left(k\Big|{\boldsymbol{\upxi}}_j\right){\boldsymbol{\upxi}}_j}{E_{u+1}^k} $$
(15)
$$ {\boldsymbol{\Sigma}}_{u+1}^k=\frac{{\displaystyle \sum_{j=1}^J{p}_{u+1}}\left(k\Big|{\boldsymbol{\upxi}}_j\right)\left({\boldsymbol{\upxi}}_j-{\boldsymbol{\upmu}}_{u+1}^k\right){\left({\boldsymbol{\upxi}}_j-{\boldsymbol{\upmu}}_{u+1}^k\right)}^T}{E_{u+1}^k}=\frac{{\displaystyle \sum_{j=1}^J{p}_{u+1}}\left(k\Big|{\boldsymbol{\upxi}}_j\right){\boldsymbol{\upxi}}_j{\boldsymbol{\upxi}}_j^T}{E_{u+1}^k}-{\boldsymbol{\upmu}}_{u+1}^k{\left({\boldsymbol{\upmu}}_{u+1}^k\right)}^T $$
(16)

The iteration terminates when \( \left|\frac{{\mathrm{\mathcal{L}}}_{u+1}}{{\mathrm{\mathcal{L}}}_u}-1\right|<{C}_1 \), where C 1 is a small number. It is worth noting that the GMM parameters η k u + 1 , μ k u + 1 , and Σ k u + 1 can be regarded as the weighted means of the sufficient statistics for the triple {1, ξ j , ξ j ξ j T} with the posterior probability p u + 1(k|ξ j ).

4.2 Stepwise EM

Because the approximate function trainer obtains the training episode data in multiple batches, the GMM needs to be sequentially updated online to conserve storage space and accelerate convergence. In this paper, we utilize the stepwise EM algorithm [19, 26] to sequentially update the GMM parameters in mini-batch mode with the Q-value of the sampling episode. In each update, the new Q-value samples, initial model parameters, and the number of historic update samples for each component are provided as input, and the algorithm is divided into E-steps and M-steps. Finally, the updated model parameters and the number of historic update samples are output. The pseudo-code for online stepwise EM is outlined in Algorithm 2.

In the E-step of Algorithm 2, the posterior probability p(k|ξ j ) of Gaussian component k for sample ξ j is calculated using (12). If all posterior probabilities of the GMM components are less than the threshold ThredNew, the new sample is deemed to be too far from the present GMM component to be explained by the current stochastic model. In this case, a new component is produced to account for the new sample data and added to the model. Otherwise, a sufficient statistics triple Suff k j  = {< < 1 > >  k j , < < ξ > >  k j , < < ξξ T > >  k j } for the kth component is stepwise updated with sample data ξ j , in which the component < < f(x) > >  k j is a time-discounted weighted sum defined as

$$ <<f(x)>{>}_j^k=\left(1-{r}_j^k\right)<<f(x)>{>}_{j-1}^k+{r}_j^kp\left(k\Big|{\boldsymbol{\upxi}}_j\right)f(x) $$
(17)

where {r k j } j ≥ 1 is a decreasing sequence with the number of historic update samples i k for component k, which should satisfy ∑ j r k j  = ∞ and ∑ j (r k j )2 < ∞ (we set r k j  = (i k  + 2)− α), and α ∈ [0.6, 0.9].

In the M-step, to stabilize the algorithm, we update m samples at once; more specifically, the training set is divided into mini-batches. Thus, for sample ξ j (j mod m = 0), the update function \( \overline{\varTheta}\left(Suf{f}_j^1,\dots Suf{f}_j^K\right) \) for the model parameters is defined as:

$$ \overline{\varTheta}:{\eta}_j^k=\frac{<<1>{>}_j^k}{{\displaystyle \sum_{k=1}^K<<1>{>}_j^k}} $$
(18)
$$ {\boldsymbol{\upmu}}_j^k=\frac{<<\boldsymbol{\upxi} >{>}_j^k}{<<1>{>}_j^k} $$
(19)
$$ {\boldsymbol{\Sigma}}_j^k=\frac{<<\boldsymbol{\upxi} {\boldsymbol{\upxi}}^T>{>}_j^k}{<<1>{>}_j^k}-{\boldsymbol{\upmu}}_j^k{\boldsymbol{\upmu}}_j^k\hbox{'} $$
(20)

Initialization of GMMs

Each model component is initialized prior to sequential updating. First, we conduct random sampling in the state space, and set the initial Q-value for the states, as described in Section 3.3. Algorithm 2 is then executed iteratively to update the components of the GMM according to the camera ID in the state sample until \( \left|\frac{{\mathrm{\mathcal{L}}}_{u+1}}{{\mathrm{\mathcal{L}}}_u}-1\right|<{C}_1 \) or the maximum number of iterations for (11) is reached.

Component production

We use a minimum likelihood criterion to recognize a new sample as belonging to a mixture component. For the incoming sample, the algorithm verifies whether it minimally fits any component, and then decides whether a new Gaussian component should be produced and added to the model. Because the probability N(ξ j ; μ k, Σ k) is interpreted as a likelihood function of the kth component for sample ξ j , sample ξ j is not recognized as belonging to any component in the model if its probability N(ξ j ; μ k, Σ k) is lower than the minimum likelihood threshold of all components, i.e.,

$$ N\left({\boldsymbol{\upxi}}_j;{\boldsymbol{\upmu}}^k,{\boldsymbol{\Sigma}}^k\right)<\frac{\tau_{\min }}{{\left(2\pi \right)}^{D/2}\sqrt{\left|{\boldsymbol{\Sigma}}_k\right|}}\forall k $$
(21)

where τ min is a previously specified acceptable likelihood for a fraction of the minimum likelihood value, making the minimum likelihood criterion independent of the covariance matrix. If the sample is rejected by all Gaussian components of the model for the corresponding camera, then it is considered to include a new concept. In this case, a new component K + 1 is added to the model, and the initial parameters of the new component are given by

$$ {\eta}_j^{K+1}=\frac{1}{1+{\displaystyle \sum_{k=1}^K<<1>{>}_j^k}} $$
(22)
$$ {\boldsymbol{\upmu}}_j^{K+1}={\boldsymbol{\upxi}}_j $$
(23)
$$ {\boldsymbol{\Sigma}}_j^{K+1}=\beta diag\left({d}_1,{d}_2,\dots, {d}_D\right) $$
(24)

where d 1, d 2, …, d D in (24) are the ranges of variances, and β is an appropriate positive constant for scaling these variances. To control the number of Gaussian components and ensure that the complexity is within a certain range, we set the maximum number of Gaussian components to MAX comp . When a new component is required and K + 1 > MAX comp , the most infrequently updated component with the minimum sufficient statistics < < 1 > >  k j ; i.e., arg min  K k = 1  < < 1 > >  k j , is replaced by the new component.

4.3 Gaussian mixture regression

The approximate Q-value is calculated with the current state and GMM in the action module. To obtain the initial Q-value \( {\widehat{q}}_{a_t}\left({s}_t\right) \) and \( \underset{a}{ \max }{q}_a\left({s}_{t+1}\right) \) in (5), we estimate the expectation of the Q-value for state s using Gaussian Mixture Regression (GMR) [6]. In a generic regression problem, we are given a set of predictor variables X ∈  p and response variables Y ∈  g, and the aim of GMR is to estimate the conditional expectation of Y given X on the basis of a set of observations {X, Y}. For the GMM model Θ that encodes the sample set \( {\boldsymbol{\upxi}}_j=\left\{{\boldsymbol{\upxi}}_{s,j},{\xi}_{q_a,j}\right\} \), the state and Q-value of the Gaussian component k for camera a are also separated, i.e., we define

$$ {\boldsymbol{\upmu}}_k=\left\{{\boldsymbol{\upmu}}_{s,k},{\mu}_{q,k}\right\},\;{\boldsymbol{\Sigma}}_k=\left(\begin{array}{cc}\hfill {\boldsymbol{\Sigma}}_{ss,k}\hfill & \hfill {\boldsymbol{\Sigma}}_{sq,k}\hfill \\ {}\hfill {\boldsymbol{\Sigma}}_{qs,k}\hfill & \hfill {\boldsymbol{\Sigma}}_{qq,k}\hfill \end{array}\right) $$
(25)

Therefore, the conditional probability of the Q-value q is the mixture of the probability that component k is responsible for state s; specifically,

$$ p\left(q\Big|\mathbf{s}\right)={\displaystyle \sum_{k=1}^K{\beta}_kp\left({q}_k\Big|\mathbf{s},k\right)}={\displaystyle \sum_{k=1}^K{\beta}_kN\left({q}_k;{\overline{q}}_k\left(\mathbf{s}\right),{\overline{\boldsymbol{\Sigma}}}_{qq,k}\right)} $$
(26)

Where

$$ {\overline{q}}_k\left(\mathbf{s}\right)={\mu}_{q,k}+{\boldsymbol{\Sigma}}_{qs,k}{\left({\boldsymbol{\Sigma}}_{ss,k}\right)}^{-1}\left(\mathbf{s}-{\boldsymbol{\upmu}}_{s,k}\right) $$
(27)
$$ {\overline{\boldsymbol{\Sigma}}}_{qq,k}={\boldsymbol{\Sigma}}_{qq,k}-{\boldsymbol{\Sigma}}_{qs,k}{\left({\boldsymbol{\Sigma}}_{ss,k}\right)}^{-1}{\boldsymbol{\Sigma}}_{sq,k} $$
(28)
$$ {\beta}_k=\frac{p(k)p\left(s\Big|k\right)}{{\displaystyle {\sum}_{i=1}^Kp(i)p\left(s\Big|i\right)}}=\frac{\eta_kN\left(s;{\mu}_{s,k},{\boldsymbol{\Sigma}}_{ss,k}\right)}{{\displaystyle {\sum}_{i=1}^K{\eta}_iN\left(s;{\mu}_{s,i},{\boldsymbol{\Sigma}}_{ss,i}\right)}} $$
(29)

Finally, as a mixture of K Gaussian components, the estimation of the conditional expectation of q(s) given s (that is, the conditional mean expectation \( \overline{q}\left(\mathbf{s}\right) \)) and conditional covariance expectation matrix \( {\overline{\boldsymbol{\Sigma}}}_{qq} \) are computed using (26) and

$$ \overline{q}\left(\mathbf{s}\right)={\displaystyle \sum_{k=1}^K{\beta}_k{\overline{q}}_k\left(\mathbf{s}\right)} $$
(30)
$$ {\overline{\boldsymbol{\Sigma}}}_{qq}={\displaystyle \sum_{k=1}^K{\beta}_k^2{\overline{\boldsymbol{\Sigma}}}_{qq,k}} $$
(31)

Thus, by evaluating \( \left\{\overline{q}\left(\mathbf{s}\right),{\overline{\boldsymbol{\Sigma}}}_{qq}\right\} \) for different states s, we can obtain the generalized form of data point \( \widehat{\boldsymbol{\upxi}}=\left\{\mathbf{s},\overline{q}\left(\mathbf{s}\right)\right\} \) with the associated covariance matrix \( {\overline{\boldsymbol{\Sigma}}}_{qq} \), which gives the estimated Q-value.

5 Measures of visual information

In this paper, the immediate reward described in Section 3 is presented as a set of metrics that evaluate the quality of visual information. To reflect the target information in the video and the visual comfort level of the output video stream, the reward r t + 1 is composed of \( \overline{M_{t+1}^{a_t}} \), the average visual score per frame for the target from two adjacent selections (positive reward), and the cost for view switching (negative reward). This can be expressed as follows:

$$ {r}_{t+1}={w}_v\overline{M_{t+1}^{a_t}}-{w}_s\delta \left({a}_t,{a}_{t+1}\right) $$
(32)
$$ \overline{M_{t+1}^{a_t}}=\frac{1}{T}{\displaystyle \sum_{k=t+1}^{t+T}{M}_k^{a_t}} $$
(33)
$$ \delta \left({a}_t,{a}_{t+1}\right)=\left\{\begin{array}{c}\hfill 0\kern1.08em {a}_t={a}_{t+1}\hfill \\ {}\hfill 1\kern0.72em otherwise\hfill \end{array}\right. $$
(34)

In (32), w v , w s  ∈ [0, 1] is a normalized weight for the visual score and switching, \( \overline{M_{t+1}^{a_t}} \) is the mean visual score of the selected camera a t for T frames after the selection time t in (33), and the switching cost is expressed as a δ function δ(a t , a t + 1) in (34). Further, for the selected camera a t , the visual score of the target at time t is composed of the visibility score \( {M}_v^{a_i} \), normalized orientation score \( {M}_d^{a_i} \), and the clarity score \( {M}_c^{a_i} \); that is,

$$ {M}^{a_i}={M}_v^{a_i}\left({w}_d{M}_d^{a_i}+{w}_c{M}_c^{a_i}\right) $$
(35)

In (35), w d and w c (w d  + w c  = 1) are the weights of the orientation and clarity scores, respectively. They define the importance of each metric, and can be learned and modified. Higher the weight, greater the emphasis on the corresponding metric. To simplify the problem, we represent the target visibility score, orientation, and image resolution as being camera-independent; that is, M v , M d , and M c .

In the process of calculating the visual information score, we first track the target in the selected camera’s video, and detect the target’s bounding box. We then obtain the target’s trajectory P t + 1 between time t and t + 1 in the ground plane, using the bottom center of the bounding boxes and the calibration parameters of the camera. The metrics for measuring the visual information are outlined below.

Visibility

The ability of a camera to capture the target is a crucial metric for camera selection. In this paper, we infer the visibility of the target for a particular camera according to the maximum area of the specified target’s motion region in the video frames from human motion detection. That is, when the area of motion is smaller than a predetermined threshold, it can be inferred that the target is invisible, i.e., M v  = 0; otherwise, M v  = 1.

Orientation

We infer the orientation from the trajectory of the target. Suppose that the target’s front is consistent with its direction of movement, and its trajectory can be considered as a straight line over a short period of time. The orientation angle θ Δt t of the target at time t + Δt for the target position sequence P t + 1 = {(x 1 t , y 1 t ), (x 2 t , y 2 t ), … (x Δt t , y Δt t )} can be approximated by the slope of the line fitted from the closest l points {(x Δt − l + 1 t , y Δt − l + 1 t ), (x Δt − l + 2 t , y Δt − l + 2 t ), … (x Δt t , y Δt t )}, as illustrated in Fig. 3. Further, fitting such a line can reduce the errors introduced by tracking and calibration. As in (6), the orientation score of the selected camera node at time t + Δt can be defined as

Fig. 3
figure 3

Orientation angle based on a target’s position sequence

$$ {M}_d= \sin \left(\frac{\left|{\theta}_t^k-{\theta}_a\right|}{2}\right) $$
(36)

The above equation implies that the more positive the direction of motion with respect to the selected camera, the larger the value of M d . This score reaches a maximum when the angle between the direction of motion and the orientation of the camera is 180°, that is, the camera is facing forward towards the target’s front.

Clarity

The image clarity of a target reflects the ability of the camera to obtain detailed information under different perspectives, and can usually be represented by an average gradient or information entropy of the image region. Because the size of a target’s image will vary from different viewpoints, and to compare the resolutions from different views for the same target, we zoom the target region image to the same height H. If the width of the scaled image is W, we represent the mean gradient of the zoomed image as the clarity score M c for camera c, i.e.,

$$ {M}_c=\frac{1}{\left(W-1\right)\left(H-1\right)}{\displaystyle \sum_{x=1}^{W-1}{\displaystyle \sum_{y=1}^{H-1}\sqrt{{\displaystyle {\left(\frac{\partial I\left(x,y\right)}{\partial x}\right)}^2}+{\displaystyle {\left(\frac{\partial I\left(x,y\right)}{\partial y}\right)}^2}}}} $$
(37)

where I(x, y) is the gray value at point (x, y).

6 Experiments and results

6.1 Experimental setup

We conducted simulations and actual scene experiments using an Intel Core i7-3770 2.4 GHz CPU PC with 8 GB RAM. Section 6.2 discusses the simulation experiments, including various scenes and camera parameters under noisy conditions. We used simulations to compare the correctness and convergence of the proposed algorithm with that of greedy selection and classic Q-learning algorithms because simulations can effectively compensate for the limits of hardware and space in real experiments. The convergence rate using the Q-value function initialized as in Section 3.3 was compared with that of the traditional method, which sets the initial Q-value to zero. We also analyzed various effects of the selection results with different time intervals by comparing greedy selection (Max), game theory [16] (Game), and POMDP [31] with the proposed method. These three comparative methods are based on maximizing the current visual information, optimization, and scheduling policies, and are convenient for comparing performance within the framework of scoring and selection. In Section 6.3, the visual features proposed in Section 5 are evaluated and the selection performance of Max, Game, POMDP, and our method are compared. In our method, the learning rate and exploration probability for Q-learning were determined by the values of ρ = 0.1, ε 0 = 0.1, and ε Δ  = 0.2 according to the general reinforcement learning method. The discount factor in (2) was set to γ = 0.85, which satisfies γ T < 0.02 and ensures that the currently selected action will not affect the state after T steps for T > 10. For the Q-function approximation, our experiments show that better iteration results can be achieved when the stepwise weight exponent α is set to 0.8 and the maximum number of GMM components is set to MAX comp  = 30, with the initial number set to 10. Moreover, we set the scaling constant to variance of new component β as 0.6 and the acceptable likelihood τ min = 0.02. As described in Section 5, the visual feature weights reflect the degree of importance of each feature in different applications; here, we set the weights to w v  = 0.5, w s  = 0.5, w d  = 0.6, and w c  = 0.4. To increase the stability of line-fitting when computing the orientation, we set the number of trajectory points involved in fitting l to 10, and the height of the scaled image region of the target H to 300.

6.2 Results of simulation experiments

We conducted a number of simulation experiments by deploying 12 static cameras in a virtual rectangular region of length 12 m, width 8 m, and height 2 m. In this paper, we analyze the simulation results for example scenes A and B shown in Fig. 1. The cameras were deployed at the edge of scene A with overlapped or partially overlapped FOVs. In scene B, there were some static obstacles that necessitated handoffs between cameras. 20,000 trajectories were sampled in each scene to form a training dataset. In these trajectories, the target states consisted of the position coordinates (x, y) and orientation angles θ, which were obtained from the positions in the previous time step, i.e., θ = arctan((y t  − y t − 1)/(x t  − x t − 1)). To simulate a more realistic data process, Gaussian noise with mean zero and covariance σ was added as observation noise. To evaluate the visual information, we adopted the visibility, orientation, and clarity metrics proposed in Section 5. The visibility in scene A is defined as the angle between the target-camera line and the center axis being less than a certain threshold and the distance between the target and camera line being less than the radius of the FOV. In scene B, we additionally consider whether the target-camera line intersects the region of occlusion. If the camera configuration in both scenes is the same, the resolution of the target image in the simulation experiments can be represented by the distance between the target and the camera; that is, the clarity score for a target at (x s , y s ) and camera at \( \left({x}_{c_i},{y}_{c_i}\right) \) is \( {M}_c^{a_i}=1-\sqrt{{\left({x}_s-{x}_{a_i}\right)}^2+{\left({y}_s-{y}_{a_i}\right)}^2}/ MaxDis \), which is similar to (6).

To analyze the effectiveness and convergence, we used the 20,000 trajectories to train greedy selection (Max), discrete Q-learning (Discr_Q), and the proposed method (Appr_Q) 10 times for each scene. In the greedy selection method, the camera with the maximum visual score was selected. In the discrete Q-learning method, the system state space was discretized into 60 × 40 × 8 grids; more specifically, the virtual region was represented as a 60 × 40 grid according to the length and width, and the orientation angle was denoted as one of eight states, each covering 45°. Each method executed a selection decision at intervals of Δt = 20 samples. The average immediate reward \( Avg\mathrm{R}\mathrm{e} ward=\frac{1}{T}{\displaystyle \sum_{t=1}^T{r}_t} \), where T is the total number of total selection points, was calculated for each trajectory, and the average number of camera switches \( AvgSwitchNum=\frac{10}{T}{\displaystyle \sum_{t=1}^T\delta \left({a}_t,{a}_{t+1}\right)} \) was computed for each group of 10 selections.

The mean AvgReward curves for scenes A and B are shown in Fig. 4a and b, where the shaded areas represent the standard deviation of discrete Q-learning and approximate Q-learning, respectively, for 10 training sessions. Further, the AvgSwitchNum curves for scenes A and B are shown in Fig. 4c and d, where shaded areas around the discrete Q-learning and approximate Q-learning curves represent the standard deviation over 10 training sessions. As can be seen from Fig. 4, reinforcement learning methods outperform greedy selection in terms of immediate reward and switching number after they have converged. Further, the approximate function method converges faster than discrete Q-learning. In addition, the approximate method is marginally better than the discrete method in terms of both reward and switching number, because it is able to represent the model states more compactly.

Fig. 4
figure 4

Average reward and switch number for scenes A and B (see Fig. 1)

We initialized the Q-value with the GMM in accordance with Section 3.3 (Initiation_Appr_Q) and compared the results to those when the initial Q-value was set to zero (Without_Initiation_Appr_Q) and discrete Q-Learning (Without_Initiation_Discr_Q), for 60,000 trajectories and 10 training sessions. The mean curves for AvgSwitchNum and AvgReward are shown in Fig. 5. The figure shows that the Q-value initialized with the target location and camera parameters tends to produce convergence after approximately 500 trajectories. In contrast, the Q-values initialized to zero for approximate Q-learning and discrete Q-Learning (especially) produce slower convergence. These experimental results show that the learning process can be accelerated to a significant degree if a priori knowledge of camera parameters and topology information about the environment is utilized, even though the state transitions, that is, the model of target movement in the scene, are not known in advance.

Fig. 5
figure 5

Comparison between initialized Q-value and uninitialized Q-value

Because the node selection results may be affected by the use of different time intervals, we conducted the experiments for 50 trajectories using time intervals of 10–120 samples (in increments of 10), and set the exploration probability of the approximated Q-function to ε t  = 0 (that is, there was no exploration in this test). The average visual score \( AvgVisScore=\frac{1}{T}{\displaystyle \sum_{t=1}^T\overline{M_{t+1}^{a_t}}} \), total switching number \( TotalSwitchNum={\displaystyle \sum_{t=1}^T\delta \left({a}_t,{a}_{t+1}\right)} \), and AvgSwitchNum were then compared for Max, Game, POMDP, and the proposed method (ARL). For comparison, the visual scores of Max, Game, and POMDP were calculated using the proposed visual measurement. The resulting AvgVisScore, TotalSwitchNum, and AvgSwitchNum for the above methods are shown in Fig. 6. It is clear that the total switching number reduces as the time interval increases, whereas the average switching number per 10 selections increases and the average visual score tends to decrease with the reduced correlation between the current state and the next state. Further, the results of our approach are slightly superior to those given by the other three methods.

Fig. 6
figure 6

Simulation results under different time intervals

6.3 Real-life experimental results

We deployed 12 traditional calibrated non-smart cameras in an indoor scene. Natural video sequences were captured simultaneously by the cameras, and the visual processing for different video streams was conducted in parallel, thus simulating a distributed smart camera network. The action and learning modules of Q-learning in the control agent were also processed using multiple threads. A total of 120 video groups of 10 min in length were recorded as training data, and another five video groups of 60 min length were selected as test data. Neither exploration nor policy updating was performed during the test process. We found that the learning convergence during training was faster than for the simulations described in Section 6.2. This is because the target trajectories were more regular in the real scene than in the simulation environment, and a smaller state space was needed for training with respect to objects such as tables and chairs occupying certain spaces. The tracking algorithm itself is not the focus of this study; we used the VTD tracker [15] and manually revised the object location when it was in danger of being lost during tracking. The sampling image and target tracking results for video sequence 1 are shown in Fig. 7, with the orientation score M i d and clarity score M i c for camera i calculated using (36) and (37). The orientation and clarity scores are illustrated in the figures; the target has a higher orientation score when it moves directly towards the camera, and has a higher clarity score when it is closer to the camera.

Fig. 7
figure 7

Sample frames and tracking results for sequence 1

The orientation, clarity, and total scores for camera 1 (Cam1) and camera 4 (Cam4) for frames 1–2000 of sequence 2 are shown in Fig. 8a–c. For the orientation curves, the angle in radians of the target relative to the x-axis of Cam1 and Cam4 in frames 90 and 830 is 0.09 and 2.65, respectively. The orientation scores of 0.007 and 0.75 in frame 90, and 0.97 and 0.84 in frame 830 better reflect the orientation of the target relative to that of the camera. Although the target reappears in the scene, some parts of the orientation curves (e.g., Cam4 near frames 1950 and 1350) are unstable, which causes varying degrees of jitter in the total score curves. This is mainly because the initial tracker performance is unstable, and some fitting errors are introduced when there are fewer fitting points. The resolution curves show images from frames 210, 380, 1650, and 1870. When the target is far away from the camera, the resolution of the target image is lower and the score is smaller. The resolution curves appear to have a certain amount of jitter in some areas (e.g., frame 1150 of Cam1), signifying errors in the tracking and calibration algorithm and a complex background.

Fig. 8
figure 8

Visual information score curves for Cam1 and Cam4

We compared our method (ARL) with Max, Game, and POMDP on five test video sequences. As in the simulation experiments, all the cameras were set to obtain tracking results, and the proposed visual evaluation metrics were used for comparison. Figure 9 shows the selection results for sequences 2 and 3, with a selection decision made every 20 s. In the figure, it can be seen that Max and Game make some false selections. This is because of the inconsistency between the visual score and the real target state in some views for some target errors. In particular, when the visual scores of multiple perspectives are similar or below a certain threshold, the Game selection result tends to produce jitter effects. For example, there is frequent switching in steps 65–75 of sequence 2 and steps 90–100 of sequence 3. In contrast, our method effectively reduces the incidence of false selection and obtains stable results because of the Q-value learning from the training data.

Fig. 9
figure 9

Results of camera selection for sequences 2 and 3

The switching numbers and visual scores per step for five video sequences are shown in Fig. 10a and b. Greedy selection obtained a higher visual score per step for these sequences, although the scores for each method are generally similar. The proposed method has obvious advantages over the other methods in terms of the total switching number, especially greedy selection and Game.

Fig. 10
figure 10

Comparison of visual scores and switching numbers for the various methods

7 Conclusion

As camera networks become extensively utilized in areas such as surveillance and human–computer interaction, finding the most informative and comfortable view from the mass of real-time video is of increasing importance. In this paper, we have proposed a dynamic node selection framework for surveillance applications in critical areas. To deal with the unknown state transitions of target motion and maximize the long-term reward, we trained a selection policy using reinforcement learning in the central controller. The chosen camera nodes extract effective visual features from their video streams and send the immediate reward to the central controller. To accelerate the learning process and reduce storage space, the target’s current state and scene topology were used as a priori knowledge to initialize the Q-function, and an exploration strategy based on Gibbs’ distribution was adopted to guarantee that the camera with the greatest Q-value had a greater exploration probability than under traditional random exploratory action selection. We utilized GMMs to represent the approximate Q-value function, and updated the GMM parameters sequentially via the mini-batch stepwise EM algorithm to meet the learning requirements of episodic sample updating. In addition, we measured the visual information and comfort of visual effects given by different cameras by integrating metrics on visibility, orientation, target clarity, and the cost of switching cameras. Our study of single node selection and scheduling to maximize the expectation of visual information shows that good results can be achieved when fewer targets are in a partial overlap region. The implementation of the proposed approach on a camera network test-bed is planned for future work.

In terms of bandwidth consumption, because feature extraction and scoring are conducted on each camera node and the target association and fusion are distributed across all nodes, each camera need only send its current state and immediate reward to the central controller, and the central node transmits the selected camera’s index to all nodes. The process of node selection is thus distributed, and the network bandwidth consumption is relatively small. However, a central controller is still needed, and the learning and approximation are centralized. Therefore, a decentralized selection policy is of interest, that is, cooperation and coordination between smart camera nodes should be considered.