Keywords

Background

In this article, we represent linear time invariant (LTI) systems by their associated transfer matrix G(z). The “size” of G(z), which plays a key role in assessing the effects of uncertainty, will be measured using the \(\mathcal{H}_{\infty }\) norm, defined as \(\|G\|_{\infty }\doteq\sup _{\omega }\overline{\sigma }\left (G(e^{j\omega }\right )\), where \(\overline{\sigma }\left (.\right )\) denotes maximum singular value. For scalar systems, this reduces to the peak value of the frequency response (i.e., the maximum gain of the system). In the matrix case, this definition takes into account both the worst-case frequency and spatial direction. Background material on the \(\mathcal{H}_{\infty }\) norm, its computation and its significance in the context of robust control theory, is given in Sánchez–Peña and Sznaier (1998). A general coverage of linear systems theory, including alternative representations of linear systems and their associated properties, can be found, for instance, in Rugh (1996).

Multiframe Tracking

A requirement common to most dynamic vision applications is the ability to track objects across frames, in order to collect the data required by a subsequent activity analysis step. Current approaches integrate correspondences between individual frames over time, using a combination of some assumed simple target dynamics (e.g., constant velocity) and empirically learned noise distributions (Isard and Blake 1998; North et al. 2000). However, while successful in many scenarios, these approaches are vulnerable to model uncertainty, occlusion, and appearance changes, as illustrated in Fig. 1.

Uncertainty and Robustness in Dynamic Vision, Fig. 1
figure 201figure 201

Unscented particle filter-based tracking in the presence of occlusion

As shown next, the fragility noted above can be avoided by modeling the motion of the target as the output of a dynamical system, to be identified directly from the available data, along with bounds on the identification error. In the sequel, we consider two different cases: (i) the motion of the target is known to belong to a relatively small set of a priori known motion modalities; and (ii) no prior knowledge is available.

The case of known motion models: Consider first the case where a set of models known to span all possible motions of the target is known a priori, as it is often the case with human motion. In this case, the position y k of a given target can be modeled as \(y(z) = \mathcal{F}(z)e(z) +\eta (z)\) where \(\mathbf{e}\) and η k denote a suitable input and measurement noise, respectively, and where \(\mathcal{F}\) admits an expansion of the form \(\mathcal{F} =\overbrace{\sum \limits _{ j=1}^{N_{p}}p_{j}\mathcal{F}^{j}}^{\mathcal{F}_{p}} + \mathcal{F}_{np}\). Here \(\mathcal{F}^{j}\) represent the (known) motion modalities of the target and \(\|\mathcal{F}_{np}\|_{\infty }\leq K\), e.g., a bound on the maximum admissible approximation error of the expansion \(\mathcal{F}_{p}\) to \(\mathcal{F}\) is available. In the reminder of this article, we will further assume that a set membership descriptions \(\eta _{k} \in \mathcal{N}\) is available and, without loss of generality, that e(z) = 1 (i.e., motion of the target is modeled as the impulse response of the unknown operator \(\mathcal{F}\)).

In this context, the next location of the target feature y k can be predicted by first identifying the relevant dynamics \(\mathcal{F}\) and then using it to propagate its past values. In turn, identifying the dynamics entails finding an operator \(\mathcal{F}(z) \in \mathcal{S}\doteq\left \{\mathcal{F}(z): \mathcal{F} = \mathcal{F}_{p} + \mathcal{F}_{np}\right \}\) such that \(y-\eta = \mathcal{F}\), precisely the class of interpolation problem addressed in Parrilo et al. (1999). As shown there, finding such an operator reduces to solving a linear matrix inequality (LMI) feasibility problem. Once this operator is found, it can be used in conjunction with a particle (or a Kalman) filter to predict the future location of the target. Figure 2 shows the tracking results obtained using this approach. Here, we used a combination of a priori information: (i) 5 % noise level and (ii) \(\mathcal{F}_{p} \in \text{span}[ \frac{1} {z-1},\ \frac{z} {z-a},\ \frac{z} {(z-1)^{2}},\ \frac{z^{2}} {(z-1)^{2}},\ \frac{z^{2}-\cos \omega z} {z^{2}-2\cos \omega z+1}\), \(\frac{\sin \omega z^{2}} {z^{2}-2\cos \omega z+1}]\) where a ∈ { 0. 9, 1, 1. 2, 1. 3, 2} and ω ∈ { 0. 2, 0. 45}. The experimental information consisted of the position of the target in N = 20 frames, where it was not occluded. Note that, by exploiting predictive power of the identified model, the Kalman filter is now able to track the target past the occlusion, eliminating the need for using a (more computationally expensive) particle filter.

Uncertainty and Robustness in Dynamic Vision, Fig. 2
figure 202figure 202

Using the identified model in combination with Kalman filter allows for robust tracking in the presence of occlusion

Unknown motion models: This case could be addressed in principle by performing a purely nonparametric worst-case identification (Parrilo et al. 1999) and then proceeding as above. However, a potential difficulty here stems from the high order of the resulting model (recall that the order of the central interpolant is the number of experimental data points). If a bound n on the order of the underlying models is available, this difficulty can be avoided by recasting the prediction problem into a rank minimization form, which in turn can be relaxed to a semi-definite optimization. To this effect, recall that (Ding et al. 2008), in the absence of noise, given 2n values of \(\{\mathbf{y}_{k}\}_{k=t-2n+1}^{t}\), its next value \(\mathbf{y}_{t+1}\) is the unique solution to the following rank minimization problem:

$$\displaystyle\begin{array}{rcl} \mathbf{y}_{t+1}& =& \mathop{\arg \min }_{\mathbf{y}}\left \{\mathrm{rank}\left [\mathbf{H}_{n+1}(\mathbf{y})\right ]\right \}\;\text{where}\;\mathbf{H}_{n+1}(\mathbf{y}) \\ & \doteq & \left [\begin{array}{@{}c@{\;\;\;}c@{\;\;\;}c@{\;\;\;}c@{}} \mathbf{y}_{t-2n+1}\;\;\;&\mathbf{y}_{t-2n+2}\;\;\;&\cdots \;\;\;& \mathbf{y}_{t-n} \\ \mathbf{y}_{t-2n+2}\;\;\;&\mathbf{y}_{t-2n+3}\;\;\;&\cdots \;\;\;&\mathbf{y}_{t-n+1}\\ \vdots \;\;\; & \vdots \;\;\; & \ddots \;\;\; & \vdots \\ \mathbf{y}_{t-n+1} \;\;\;& \mathbf{y}_{t-n+2} \;\;\;&\cdots \;\;\;& \mathbf{y} \end{array} \right ]{}\end{array}$$
(1)

Clearly, the same result holds if multiple elements of the sequence y are missing, at the price of considering longer sequences (the total number of data points should exceed 2n). This result allows for handling both noisy and missing data (due, for instance, to occlusion), by simply solving

$$\displaystyle{ \begin{array}{l} \min _{\boldsymbol{\zeta }}\left \{\mathrm{rank}\left [\mathbf{H}(\boldsymbol{\zeta })\right ]\right \}\;\mbox{ subject to $\mathbf{v} \in \mathcal{N}_{v}$ } \\ \mbox{ where $\zeta _{i} = \left \{\begin{array}{l} \mathbf{y}_{i} -\mathbf{v}_{i}\;\mathrm{if}\;i \in \mathcal{I}_{a} \\ \mathbf{x}_{i}\;\mathrm{if}\;i \in \mathcal{I}_{m} \end{array} \right .$}\\ \end{array} }$$

\(\mathcal{I}_{a}\) and \(\mathcal{I}_{m}\) denote the set of available (but noisy) and missing measurements, respectively, and where \(\mathcal{N}_{v}\) is a set membership description of the noise v. In the case where \(\mathcal{N}_{v}\) admits a convex description, using the nuclear norm as a surrogate for rank (Fazel et al. 2003) allows for reducing this problem to a convex semi-definite program. Examples of these descriptions are balls in , e.g., \(\mathcal{N}\doteq\{v: \vert v_{k}\vert \leq \epsilon \}\) or constraints on the norm of \(\mathbf{H}_{v}\), the Hankel matrix of the noise sequence, which under mild ergodicity assumptions are equivalent to constraints on the magnitude of the noise covariance. Figure 3 illustrates the effectiveness of this approach. As shown there, the rank minimization-based filter successfully predicts the location of the target, while a Kalman filter-based tracker fails due to the substantial occlusion.

Uncertainty and Robustness in Dynamic Vision, Fig. 3
figure 203figure 203

Trajectory prediction. Rank minimization (1) versus Kalman filtering (2)

Event Detection and Activity Classification

Using the trajectories generated by the tracking step for activity recognition entails (i) segmenting the data into homogeneous segments each corresponding to a single activity and (ii) classifying these activities, typically based on exemplars from a database of known activities. As shown in the sequel, both steps can be efficiently accomplished by exploiting the properties of the underlying system. The starting point is to model these activities as the output of a switched piecewise linear system. In this context, under suitable dwell time constraints, each switch (indicating a change in the underlying activity) can be identified by simply searching for points associated with discontinuities in the rank of the associated Hankel matrix, as illustrated in Fig. 4. Further, in this framework, the problem of classifying each subactivity can be recast into the behavioral model (in)validation setup shown in Fig. 5. Here y i (. ) represents the impulse response of the (unknown) LTI system G, affected by measurement noise \(\eta _{i} \in \mathcal{N}\) and uncertainty \(\Delta _{i} \in \mathcal{D}\) that accounts for the variability intrinsic to two different realizations of the same activity. Two different time series are considered to be realizations of the same activity if there exists at least one pair \((\eta _{1},\eta _{2}) \in \mathcal{N}^{2}\), one pair \((\Delta _{1},\Delta _{2}) \in \mathcal{D}^{2}\), a LTI system G with McMillan degree at most n G , and suitable initial conditions \(\mathbf{x}_{1}\), \(\mathbf{x}_{2}\) resulting in the observed data. Remarkably, this model (in)validation problem can be reduced to a rank minimization form. In the simpler case where \(\Delta _{i} = 0\), the problem can be solved using the following algorithm (Sznaier and Camps 2011):

Uncertainty and Robustness in Dynamic Vision, Fig. 4
figure 204figure 204

The jump in the rank of the Hankelmatrix corresponds to the time instant where the subjectsmeet and exchange a bag

Uncertainty and Robustness in Dynamic Vision, Fig. 5
figure 205figure 205

Model (in)validation setup

Next, consider the more realistic case where the trajectories are also affected by bounded model uncertainty \(\Delta\), \(\|\Delta \|_{\infty }\leq \gamma\), where γ is given as part of a priori information. In this scenario, the internal signal z is given by \(z(t) =\zeta (t) -\eta (t),\;\eta \in \mathcal{N}\), where the signal ζ satisfies

$$\displaystyle{ y = (1 + \Delta ){\ast}\zeta,\;\mathrm{for\;some\;}\Delta \in \mathcal{D} }$$
(2)

where ∗ denotes convolution. Exploiting Theorem 2.3.6 in Chen and Gu (2000) leads to an LMI condition in the variables z, η, for feasibility of (2). Thus, the only modification to Algorithm 1 required to handle model uncertainty is to incorporate this additional (convex) constraint to the rank minimization problems. Table 1 shows the results of applying this approach to 2 video sequences, walking and running, from the KTH database (Laptev et al. 2008). Sample frames from these sequences are shown in Fig. 6. In order to reduce the dimensionality of the data, the frames were first projected into a three-dimensional space using principal component analysis (PCA), and the resulting time series were used as the input to Algorithm 1, assuming 10 % noise and 10 % model uncertainty. As shown in Table 1, the algorithm correctly identifies the subsequences (a)–(c) as being generated by the same underlying activity (walking).

Uncertainty and Robustness in Dynamic Vision, Fig. 6
figure 206figure 206

Sample frames from KTH activity video database. (a) Walking. (b) Running

Algorithm 1 Behavioral model (in)validation

Data: Noisy measurements \(y_{1},y_{2}\).

A priori information: noise description \(\eta _{i} \in \mathcal{N}\)

  1. 1.

    Solve the following rank–minimization problems:

    \(r_{1}^{min} =\min _{\boldsymbol{\eta }_{1}}\mathrm{rank}(\mathbf{H}_{y_{1}} -\mathbf{H}_{\eta _{1}})\)

    subject to: \(\eta _{1} \in \mathcal{N}\).

    \(r_{2}^{min} =\min _{\boldsymbol{\eta }_{2}}\mathrm{rank}(\mathbf{H}_{y_{2}} -\mathbf{H}_{\eta _{2}})\)

    subject to: \(\eta _{2} \in \mathcal{N}\).

    \(r_{12}^{min} =\min _{\boldsymbol{\eta }_{1}}\mathrm{rank}([\mathbf{H}_{y_{1n}}\quad \mathbf{H}_{y_{2n}}])\)

    subject to: \(\eta _{1},\eta _{2} \in \mathcal{N}\)

    \(\mathbf{H}_{y_{1n}} = \mathbf{H}_{y_{1}} -\mathbf{H}_{\eta _{1}}\)

    \(\mathbf{H}_{y_{2n}} = \mathbf{H}_{y_{2}} -\mathbf{H}_{\eta _{2}}\)

  2. 2.

    The given trajectories were generated by the same LTI system with McMillan degree ≤ n G iff:

    \(r_{1}^{min} = r_{2}^{min} = r_{12}^{min} \leq n_{G}\)

Uncertainty and Robustness in Dynamic Vision, Table 1 Activity classification results. Sequences (a)–(c) correspond to walking and (d) to running

Summary and Future Directions

Vision-based systems are uniquely positioned to address the needs of a growing segment of the population. Aware sensors endowed with scene analysis capabilities can prevent crime, reduce time response to emergency scenes, and render viable the concept of ultra-sustainable buildings. Moreover, the investment required to accomplish these goals is relatively modest since a large number of cameras are already deployed and networked. Arguably, at this point, one of the critical factors limiting widespread use of these systems is their potential fragility when operating in unstructured scenarios. This article illustrates the key role that control theory can play in developing a comprehensive, provably robust dynamic vision framework. In turn, computer vision provides a rich environment both to draw inspiration from and to test new developments in systems theory.

Cross-References

Recommended Reading

Details on how to select good features to track can be found in Richard Szeliski (2010). Using dynamics to recover 3D structure from 2D data is covered in Ayazoglu et al. (2010). Finally, further details on the connection between identification and the problem of extracting actionable information from large data streams can be found, for instance, in Sznaier (2012).