Abstract
Dynamic vision is a subfield of computer vision dealing explicitly with problems characterized by image features that evolve in time according to some underlying dynamics. Examples include sustained target tracking, activity classification from video sequences, and recovering 3D geometry from 2D video data. This article discusses the central role that systems theory can play in developing a robust dynamic vision framework, ultimately leading to vision-based systems with enhanced autonomy, capable of operating in stochastic, cluttered environments.
Access provided by Autonomous University of Puebla. Download reference work entry PDF
Similar content being viewed by others
Keywords
Background
In this article, we represent linear time invariant (LTI) systems by their associated transfer matrix G(z). The “size” of G(z), which plays a key role in assessing the effects of uncertainty, will be measured using the \(\mathcal{H}_{\infty }\) norm, defined as \(\|G\|_{\infty }\doteq\sup _{\omega }\overline{\sigma }\left (G(e^{j\omega }\right )\), where \(\overline{\sigma }\left (.\right )\) denotes maximum singular value. For scalar systems, this reduces to the peak value of the frequency response (i.e., the maximum gain of the system). In the matrix case, this definition takes into account both the worst-case frequency and spatial direction. Background material on the \(\mathcal{H}_{\infty }\) norm, its computation and its significance in the context of robust control theory, is given in Sánchez–Peña and Sznaier (1998). A general coverage of linear systems theory, including alternative representations of linear systems and their associated properties, can be found, for instance, in Rugh (1996).
Multiframe Tracking
A requirement common to most dynamic vision applications is the ability to track objects across frames, in order to collect the data required by a subsequent activity analysis step. Current approaches integrate correspondences between individual frames over time, using a combination of some assumed simple target dynamics (e.g., constant velocity) and empirically learned noise distributions (Isard and Blake 1998; North et al. 2000). However, while successful in many scenarios, these approaches are vulnerable to model uncertainty, occlusion, and appearance changes, as illustrated in Fig. 1.
As shown next, the fragility noted above can be avoided by modeling the motion of the target as the output of a dynamical system, to be identified directly from the available data, along with bounds on the identification error. In the sequel, we consider two different cases: (i) the motion of the target is known to belong to a relatively small set of a priori known motion modalities; and (ii) no prior knowledge is available.
The case of known motion models: Consider first the case where a set of models known to span all possible motions of the target is known a priori, as it is often the case with human motion. In this case, the position y k of a given target can be modeled as \(y(z) = \mathcal{F}(z)e(z) +\eta (z)\) where \(\mathbf{e}\) and η k denote a suitable input and measurement noise, respectively, and where \(\mathcal{F}\) admits an expansion of the form \(\mathcal{F} =\overbrace{\sum \limits _{ j=1}^{N_{p}}p_{j}\mathcal{F}^{j}}^{\mathcal{F}_{p}} + \mathcal{F}_{np}\). Here \(\mathcal{F}^{j}\) represent the (known) motion modalities of the target and \(\|\mathcal{F}_{np}\|_{\infty }\leq K\), e.g., a bound on the maximum admissible approximation error of the expansion \(\mathcal{F}_{p}\) to \(\mathcal{F}\) is available. In the reminder of this article, we will further assume that a set membership descriptions \(\eta _{k} \in \mathcal{N}\) is available and, without loss of generality, that e(z) = 1 (i.e., motion of the target is modeled as the impulse response of the unknown operator \(\mathcal{F}\)).
In this context, the next location of the target feature y k can be predicted by first identifying the relevant dynamics \(\mathcal{F}\) and then using it to propagate its past values. In turn, identifying the dynamics entails finding an operator \(\mathcal{F}(z) \in \mathcal{S}\doteq\left \{\mathcal{F}(z): \mathcal{F} = \mathcal{F}_{p} + \mathcal{F}_{np}\right \}\) such that \(y-\eta = \mathcal{F}\), precisely the class of interpolation problem addressed in Parrilo et al. (1999). As shown there, finding such an operator reduces to solving a linear matrix inequality (LMI) feasibility problem. Once this operator is found, it can be used in conjunction with a particle (or a Kalman) filter to predict the future location of the target. Figure 2 shows the tracking results obtained using this approach. Here, we used a combination of a priori information: (i) 5 % noise level and (ii) \(\mathcal{F}_{p} \in \text{span}[ \frac{1} {z-1},\ \frac{z} {z-a},\ \frac{z} {(z-1)^{2}},\ \frac{z^{2}} {(z-1)^{2}},\ \frac{z^{2}-\cos \omega z} {z^{2}-2\cos \omega z+1}\), \(\frac{\sin \omega z^{2}} {z^{2}-2\cos \omega z+1}]\) where a ∈ { 0. 9, 1, 1. 2, 1. 3, 2} and ω ∈ { 0. 2, 0. 45}. The experimental information consisted of the position of the target in N = 20 frames, where it was not occluded. Note that, by exploiting predictive power of the identified model, the Kalman filter is now able to track the target past the occlusion, eliminating the need for using a (more computationally expensive) particle filter.
Unknown motion models: This case could be addressed in principle by performing a purely nonparametric worst-case identification (Parrilo et al. 1999) and then proceeding as above. However, a potential difficulty here stems from the high order of the resulting model (recall that the order of the central interpolant is the number of experimental data points). If a bound n on the order of the underlying models is available, this difficulty can be avoided by recasting the prediction problem into a rank minimization form, which in turn can be relaxed to a semi-definite optimization. To this effect, recall that (Ding et al. 2008), in the absence of noise, given 2n values of \(\{\mathbf{y}_{k}\}_{k=t-2n+1}^{t}\), its next value \(\mathbf{y}_{t+1}\) is the unique solution to the following rank minimization problem:
Clearly, the same result holds if multiple elements of the sequence y are missing, at the price of considering longer sequences (the total number of data points should exceed 2n). This result allows for handling both noisy and missing data (due, for instance, to occlusion), by simply solving
\(\mathcal{I}_{a}\) and \(\mathcal{I}_{m}\) denote the set of available (but noisy) and missing measurements, respectively, and where \(\mathcal{N}_{v}\) is a set membership description of the noise v. In the case where \(\mathcal{N}_{v}\) admits a convex description, using the nuclear norm as a surrogate for rank (Fazel et al. 2003) allows for reducing this problem to a convex semi-definite program. Examples of these descriptions are balls in ℓ ∞ , e.g., \(\mathcal{N}\doteq\{v: \vert v_{k}\vert \leq \epsilon \}\) or constraints on the norm of \(\mathbf{H}_{v}\), the Hankel matrix of the noise sequence, which under mild ergodicity assumptions are equivalent to constraints on the magnitude of the noise covariance. Figure 3 illustrates the effectiveness of this approach. As shown there, the rank minimization-based filter successfully predicts the location of the target, while a Kalman filter-based tracker fails due to the substantial occlusion.
Event Detection and Activity Classification
Using the trajectories generated by the tracking step for activity recognition entails (i) segmenting the data into homogeneous segments each corresponding to a single activity and (ii) classifying these activities, typically based on exemplars from a database of known activities. As shown in the sequel, both steps can be efficiently accomplished by exploiting the properties of the underlying system. The starting point is to model these activities as the output of a switched piecewise linear system. In this context, under suitable dwell time constraints, each switch (indicating a change in the underlying activity) can be identified by simply searching for points associated with discontinuities in the rank of the associated Hankel matrix, as illustrated in Fig. 4. Further, in this framework, the problem of classifying each subactivity can be recast into the behavioral model (in)validation setup shown in Fig. 5. Here y i (. ) represents the impulse response of the (unknown) LTI system G, affected by measurement noise \(\eta _{i} \in \mathcal{N}\) and uncertainty \(\Delta _{i} \in \mathcal{D}\) that accounts for the variability intrinsic to two different realizations of the same activity. Two different time series are considered to be realizations of the same activity if there exists at least one pair \((\eta _{1},\eta _{2}) \in \mathcal{N}^{2}\), one pair \((\Delta _{1},\Delta _{2}) \in \mathcal{D}^{2}\), a LTI system G with McMillan degree at most n G , and suitable initial conditions \(\mathbf{x}_{1}\), \(\mathbf{x}_{2}\) resulting in the observed data. Remarkably, this model (in)validation problem can be reduced to a rank minimization form. In the simpler case where \(\Delta _{i} = 0\), the problem can be solved using the following algorithm (Sznaier and Camps 2011):
Next, consider the more realistic case where the trajectories are also affected by bounded model uncertainty \(\Delta\), \(\|\Delta \|_{\infty }\leq \gamma\), where γ is given as part of a priori information. In this scenario, the internal signal z is given by \(z(t) =\zeta (t) -\eta (t),\;\eta \in \mathcal{N}\), where the signal ζ satisfies
where ∗ denotes convolution. Exploiting Theorem 2.3.6 in Chen and Gu (2000) leads to an LMI condition in the variables z, η, for feasibility of (2). Thus, the only modification to Algorithm 1 required to handle model uncertainty is to incorporate this additional (convex) constraint to the rank minimization problems. Table 1 shows the results of applying this approach to 2 video sequences, walking and running, from the KTH database (Laptev et al. 2008). Sample frames from these sequences are shown in Fig. 6. In order to reduce the dimensionality of the data, the frames were first projected into a three-dimensional space using principal component analysis (PCA), and the resulting time series were used as the input to Algorithm 1, assuming 10 % noise and 10 % model uncertainty. As shown in Table 1, the algorithm correctly identifies the subsequences (a)–(c) as being generated by the same underlying activity (walking).
Algorithm 1 Behavioral model (in)validation
Data: Noisy measurements \(y_{1},y_{2}\).
A priori information: noise description \(\eta _{i} \in \mathcal{N}\)
-
1.
Solve the following rank–minimization problems:
\(r_{1}^{min} =\min _{\boldsymbol{\eta }_{1}}\mathrm{rank}(\mathbf{H}_{y_{1}} -\mathbf{H}_{\eta _{1}})\)
subject to: \(\eta _{1} \in \mathcal{N}\).
\(r_{2}^{min} =\min _{\boldsymbol{\eta }_{2}}\mathrm{rank}(\mathbf{H}_{y_{2}} -\mathbf{H}_{\eta _{2}})\)
subject to: \(\eta _{2} \in \mathcal{N}\).
\(r_{12}^{min} =\min _{\boldsymbol{\eta }_{1}}\mathrm{rank}([\mathbf{H}_{y_{1n}}\quad \mathbf{H}_{y_{2n}}])\)
subject to: \(\eta _{1},\eta _{2} \in \mathcal{N}\)
\(\mathbf{H}_{y_{1n}} = \mathbf{H}_{y_{1}} -\mathbf{H}_{\eta _{1}}\)
\(\mathbf{H}_{y_{2n}} = \mathbf{H}_{y_{2}} -\mathbf{H}_{\eta _{2}}\)
-
2.
The given trajectories were generated by the same LTI system with McMillan degree ≤ n G iff:
\(r_{1}^{min} = r_{2}^{min} = r_{12}^{min} \leq n_{G}\)
Summary and Future Directions
Vision-based systems are uniquely positioned to address the needs of a growing segment of the population. Aware sensors endowed with scene analysis capabilities can prevent crime, reduce time response to emergency scenes, and render viable the concept of ultra-sustainable buildings. Moreover, the investment required to accomplish these goals is relatively modest since a large number of cameras are already deployed and networked. Arguably, at this point, one of the critical factors limiting widespread use of these systems is their potential fragility when operating in unstructured scenarios. This article illustrates the key role that control theory can play in developing a comprehensive, provably robust dynamic vision framework. In turn, computer vision provides a rich environment both to draw inspiration from and to test new developments in systems theory.
Cross-References
Recommended Reading
Details on how to select good features to track can be found in Richard Szeliski (2010). Using dynamics to recover 3D structure from 2D data is covered in Ayazoglu et al. (2010). Finally, further details on the connection between identification and the problem of extracting actionable information from large data streams can be found, for instance, in Sznaier (2012).
Bibliography
Ayazoglu M, Sznaier M, Camps O (2010) Euclidean structure recovery from motion in perspective image sequences via Hankel rank minimization. LNCS, vol 6312. Springer, Berlin/New York, pp 71–84
Chen J, Gu G (2000) Control oriented system identification, An \(\mathcal{H}_{\infty }\) approach. Wiley, New York
Ding T, Sznaier M, Camps O (2008) Receding horizon rank minimization based estimation with applications to visual tracking. In: Proceedings of the 47th IEEE conference on decision and control, Cancún, Dec 2008, pp 3446–3451
Fazel M, Hindi H, Boyd SP (2003) Log-det heuristic for matrix rank minimization with applications to hankel and euclidean distance matrices. In: Proceedings of the 2003 ACC, Denver, pp 2156–2162
Isard M, Blake A (1998) CONDENSATION – conditional density propagation for visual tracking. Int J Comput Vis 29(1):5–28
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE computer vision and pattern recognition, Anchorage, pp 1–8
North B, Blake A, Isard M, Rittscher J (2000) Learning and classification of complex dynamics. IEEE Trans Pattern Anal Mach Intell 22(9):1016–1034
Parrilo PA, Sánchez–Peña RS, Sznaier M (1999) A parametric extension of mixed time/frequency domain based robust identification. IEEE Trans Autom Control 44(2):364–369
Rugh WJ (1996) Linear systems theory, 2nd edn. Prentice Hall, Upper Saddle River
Sánchez–Peña RS, Sznaier M (1998) Robust systems theory and applications. Wiley, New York
Szeliski R (2010) Computer vision: algorithms and applications. Springer, New York
Sznaier M (2012) Compressive information extraction: a dynamical systems approach. In: Proceeding of the 2012 symposium on systems identification (SYSID 2012), July 2012, Brussels, pp 1559–1568
Sznaier M, Camps O (2011) A rank minimization approach to trajectory (in)validation. In: 2011 American control conference, pp 675–680
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag London
About this entry
Cite this entry
Sznaier, M., Camps, O. (2015). Uncertainty and Robustness in Dynamic Vision. In: Baillieul, J., Samad, T. (eds) Encyclopedia of Systems and Control. Springer, London. https://doi.org/10.1007/978-1-4471-5058-9_134
Download citation
DOI: https://doi.org/10.1007/978-1-4471-5058-9_134
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-4471-5057-2
Online ISBN: 978-1-4471-5058-9
eBook Packages: EngineeringReference Module Computer Science and Engineering