1 Introduction

1.1 Motivation

Videos of natural scenes contain vast varieties of motion patterns. We divide these motion patterns in a \(2\times 2\) table based on their complexities measured by two criteria: (i) sketchability [18], i.e., whether a local patch can be represented explicitly by an image primitive from a sparse coding dictionary, and (ii) intrackability (or trackability) [17], which measures the uncertainty of tracking an image patch using the entropy of posterior probability on velocities. Figure 1 shows some examples of the different video patches in the four categories. Category A consists of the simplest vision phenomena, i.e., sketchable and trackable motions, such as trackable corners, lines, and feature points, whose positions and shapes can be tracked between frames. For example, patches (a), (b), (c), and (d) belong to category A. Category D is the most complex and is called textured motions or dynamic texture in the literature, such as water, fire, or grass, in which the images have no distinct primitives or trackable motion, such as patches (h) and (i). The other categories are in between. Category B refers to sketchable but intrackable patches, which can be described by distinct image primitives but hardly be tracked between frames due to fast motion, for example, the patches (e) and (f) at the legs of the galloping horse. Finally, category C includes the trackable but non-sketchable patches, which are cluttered features or moving kernels, e.g., patch (g).

In the vision literature, as it was pointed out by [33], there are two families of representations, which code images or videos by explicit and implicit functions, respectively.

1. Explicit representations with generative models. Olshausen [27], Kim et al. [22] learned an over-complete set of coding elements from natural video sequences using the sparse coding model [28]. Elder and Zucker [15] and Guo et al. [18] represented the image/video patches by fitting functions with explicit geometric and photometric parameters. Wang and Zhu [36] synthesized complex motion, such as birds, snowflakes, and waves with a large mount of particles and wave components. Black and Fleet [4] represented two types of motion primitives, namely smooth motion and motion boundaries for motion segmentation. In higher level object motion tracking, people represented different tracking units depending on the underlying objects and scales, such as sparse or dense feature points tracking [4, 32], kernels tracking [10, 16], contours tracking [24], and middle-level pairwise-components generation [41].

Fig. 1
figure 1

The four types of local video patches characterized by two criteria—sketchability and trackability

2. Implicit representations with descriptive models. For textured motions or dynamic textures, people used numerous Markov models which are constrained to reproduce some statistics extracted from the input video. For example, dynamic textures [6, 35] were modeled by a spatio-temporal auto-regressive (STAR) model, in which the intensity of each pixel was represented by a linear summation of intensities of its spatial and temporal neighbors. Bouthemy et al. [5] proposed mixed-state auto-models for motion textures by generalizing the auto-models in [3]. Doretto et al. [14] derived an auto-regression moving-average model for dynamic texture. Chan and Vasconcelos [7] and Ravichandran et al. [30] extended it to a stable linear dynamical system (LDS) model.

Recently, to represent complex motion, such as human activities, researchers have used Histogram of Oriented Gradients (HOG) [11] for appearance and Histogram of Oriented Optical Flow (HOOF) [8, 12] for motion. The HOG and HOOF record the rough geometric information through the grids and pool the statistics (histograms) within the local cells. Such features are used for recognition in discriminative tasks, such as action classification, and are not suitable for video coding and reconstruction.

In the literature, these video representations are often manually selected for specific videos in different tasks. There lacks a generic representation and criterion that can automatically select the proper models for different patterns of the video. Furthermore, as it was demonstrated in [17] that both sketchability and trackability change over scales, densities, and stochasticity of the dynamics, a good video representation must adapt itself continuously in a long video sequence.

1.2 Overview and Contributions

Motivated by the above observations, we study a unified middle-level representation, called video primal sketch (VPS), by integrating the two families of representations. Our work is inspired by Marr’s conjecture for a generic “token” representation called primal sketch as the output of early vision [26], and is aimed at extending the primal sketch model proposed by [18] from images to videos. Our goal is not only to provide a parsimonious model for video compression and coding, but also more importantly, to support and be compatible with high-level tasks such as motion tracking and action recognition.

Figure 2 overviews an example of the video primal sketch. Figure 2a is an input video frame which is separated into sketchable and non-sketchable regions by the sketchability map in (b), and trackable primitives and intrackable regions by the trackability map in (c). The sketchable or trackable regions are explicitly represented by a sparse coding model and reconstructed in (d) with motion primitives, and each non-sketchable and intrackable region has a textured motion which is synthesized in (e) by a generalized FRAME [42] model (implicit and descriptive). The synthesis of this frame is shown in (f) which integrates the results from (d) and (e) seamlessly.

As Table 1 shows, the explicit representations include 3,600 parameters for the positions, types, motion velocities, etc., of the video primitives and the implicit representations have \(420\) parameters for the histograms of a set of filter responses on dynamic textures. This table shows the efficiency of the VPS model.

Fig. 2
figure 2

An example of video primal sketch. a An input frame. b Sketchability map where dark means sketchable. c Trackability map where darker means higher trackability. d Reconstruction of explicit regions using primitives. e Synthesis for implicit regions (textured motions) by sampling the generalize FRAME model through Markov chain Monte Carlo using the explicit regions as boundary condition. f Synthesized frame by combining the explicit and implicit representations

Table 1 The parameters in video primal sketch model for the water bird video in Fig. 2

This paper makes the following contributions to the literature.

  1. 1.

    We present and compare two different but related models to define textured motions. The first one is a spatio-temporal FRAME (ST-FRAME) model, which is a non-parametric Markov random field and generalizes the FRAME model [42] of texture with spatio-temporal filters. The ST-FRAME model is learned so that it has marginal probabilities that match the histograms of the responses from the spatio-temporal filters on the input video. The second one is a motion-appearance FRAME model (MA-FRAME), which not only matches the histograms of some spatio-temporal filter responses, but also matches the histograms of velocities pooled over a local region. The MA-FRAME model achieves better results in video synthesis than the ST-FRAME model, and it is, to some extent, similar to the HOOF features used in action classification [8, 12].

  2. 2.

    We learn a dictionary of motion primitives from input videos using a generative sparse coding model. These primitives are used to reconstruct the explicit regions and include two types: (i) generic primitives for the sketchable patches, such as corners, bars etc; and (ii) specific primitives for the non-sketchable but trackable patches which are usually texture patches similar to those used in kernel tracking [10].

  3. 3.

    The models for implicit and explicit regions are integrated in a hybrid representation—the video primal sketch (VPS), as a generic middle-level representation of video. We will also show how VPS changes over information scales affected by distance, density, and dynamics.

  4. 4.

    We show the connections between this middle-level VPS representation and features for high-level vision tasks such as action recognition.

Our work is inspired by Gong’s empirical study in [17], which revealed the statistical properties of videos over scale transitions and defined intrackability as the entropy of local velocities. When the entropy is high, the patch cannot be tracked locally and thus its motion is represented by a velocity histogram. Gong and Zhu [17] did not give a unified model for video representation and synthesis which is the focus on the current paper.

This paper extends a previous conference paper [19] in the following aspects:

  1. 1.

    We propose a new dynamic texture model, MA-FRAME, for better representing velocity information. Benefited from the new temporal feature, the VPS model can be applied to high-level action representation tasks more directly.

  2. 2.

    We compare spatial and temporal features with HOG [11] and HOOF [12] and discuss the connections between them.

  3. 3.

    We do a series of perceptual experiments to verify the high quality of video synthesis from the aspect of human perception.

The remainder of this paper is organized as follows. In Sect. 2, we present the framework of video primal sketch. In Sect. 3, we explain the algorithms for explicit representation, textured motion synthesis and video synthesis, and show a series of experiments. The paper is concluded with a discussion in Sect. 4.

2 Video Primal Sketch Model

In his monumental book [26], Marr conjectured a primal sketch as the output of early vision that transfers the continuous “analogy” signals in pixels to a discrete “token” representation. The latter should be parsimonious and sufficient to reconstruct the observed image without much perceivable distortions. A mathematical model was later studied by Guo et al. [18], which successfully modeled hundreds of images by integrating sketchable structures and non-sketchable textures. In this section, we extend it to video primal sketch as a hybrid generic video representation.

Let \({\mathbf {I}}[1,m]=\{{\mathbf {I}}_{(t)}\}_{t=1}^m\) be a video defined on a 3D lattice \(\varLambda \subset Z^3\). \(\varLambda \) is divided disjointly into explicit and implicit regions,

$$\begin{aligned} \varLambda =\varLambda _\mathrm{ex}\bigcup \varLambda _\mathrm{im},\;\varLambda _\mathrm{ex}\bigcap \varLambda _\mathrm{im}=\emptyset . \end{aligned}$$
(1)

Then the video \({\mathbf {I}}\) is decomposed as two components

$$\begin{aligned} {\mathbf {I}}_{\varLambda }=({\mathbf {I}}_{\varLambda _\mathrm{ex}},{\mathbf {I}}_{\varLambda _\mathrm{im}}). \end{aligned}$$
(2)

\({\mathbf {I}}_{\varLambda _\mathrm{ex}}\) are defined by explicit functions \(I=g(w)\), in which, each instance is corresponded to a different function form of \(g()\) and indexed by a particular value of parameter \(w\). And \({\mathbf {I}}_{\varLambda _\mathrm{im}}\) are defined by implicit functions \(H(I)=h\), in which, \(H()\) extracts the statistics of filter responses from image \(I\) and \(h\) is a specific value of histograms.

In the following, we first present the two families of models for \({\mathbf {I}}_{\varLambda _\mathrm{ex}}\) and \({\mathbf {I}}_{\varLambda _\mathrm{im}}\), respectively, and then integrate them in the VPS model.

2.1 Explicit Representation by Sparse coding

The explicit region \(\varLambda _\mathrm{ex}\) of a video \({\mathbf {I}}\) is decomposed into \(n_\mathrm{ex}\) disjoint domains (usually \(n_\mathrm{ex}\) is in the order of \(10^2\)),

$$\begin{aligned} \varLambda _\mathrm{ex}=\bigcup _{i=1}^{n_\mathrm{ex}}\varLambda _{\mathrm{ex},i}. \end{aligned}$$
(3)

Here \(\varLambda _{\mathrm{ex},i}\subset \varLambda \) defines the domain of a “brick”. A brick, denoted by \({\mathbf {I}}_{\varLambda _{\mathrm{ex},i}}\), is a spatio-temporal volume like a patch in images. These bricks are divided into the three categories A, B, and C as we mentioned in Sect. 1.

The size of \(\varLambda _{\mathrm{ex},i}\) influences the results of tracking and synthesis to some degree. The spatial size should depend on the scale of structures or the granularity of textures, and the temporal size should depend on the motion amplitude and frequency in time dimension, which are hard to estimate in real applications. However, a general size works well for most of cases, say \( 11\times 11\) pixels \(\times \) 3 frames for trackable bricks (sketchable or non-sketchable), or \(11\times 11\) pixels \(\times \)1 frame for sketchable but intrackable bricks. Therefore, in all the experiments of this paper, the size of \(\varLambda _{\mathrm{ex},i}\) is chosen as such.

Figure 3 shows one example comparing the sketchable and trackable regions based on sketchability and trackability maps shown in Fig. 2b, c, respectively. It is worth noting that the two regions overlap with only a small percentage of the regions is either sketchable or trackable.

Fig. 3
figure 3

Comparison between sketchable and trackable regions

Each brick can be represented by a primitive \(B_i\in \varDelta _B\) through an explicit function,

$$\begin{aligned} {\mathbf {I}}(x,y,t)=\alpha _i B_i(x,y,t)+\epsilon ,\quad \forall (x,y,t)\in \varLambda _{\mathrm{ex},i}. \end{aligned}$$
(4)

\(B_i\) means the \(i\)th primitive from the primitive dictionary \(\varDelta _B\), which fits the brick \({\mathbf {I}}_{\varLambda _{\mathrm{ex},i}}\) best. Here \(i\) indexes the parameters such as type, position, orientation, and scale of \(B_i\). \(\alpha _i\) is the corresponding coefficient. \(\epsilon \) represents the residue, which is assumed to be i.i.d. Gaussian. For a trackable primitive, \(B_i(x,y,t)\) includes \(3\) frames and thus encodes the velocity \((u,v)\) in the \(3\) frames. For sketchable but intrackable primitive, \(B_i(x,y,t)\) has only \(1\) frame.

As Fig. 4 illustrates, the dictionary \(\varDelta _B\) is composed of two categories:

  • Common primitives \(\varDelta _B^\mathrm{common}\). These are primitives shared by most videos, such as blobs, edges, and ridges. They have explicit parameters for orientations and scales. They are mostly belong to sketchable region as shown in Fig. 3.

  • Special primitives \(\varDelta _B^\mathrm{special}\). These bricks do not have common appearance and are limited to specific video frames. They are non-sketchable but trackable, and are recorded to code the specific video region. They mostly belong to trackable region but not included in sketchable region as shown in Fig. 3.

To be noted, the primitives and categories shown in Fig. 4 are some selected examples, but not the whole dictionary. The details for learning these primitives are introduced in Sect. 3.2.

Fig. 4
figure 4

Some selected examples of primitives. \(\varDelta _B\) is a dictionary of primitives with velocities (u,v) ( (u,v) is not shown), such as blobs, ridges, edges, and special primitives

Equation (4) uses only one base function and thus is different from conventional linear additive model. Following the Gaussian assumption for the residues, we have the following probabilistic model for the explicit region \({\mathbf {I}}_{\varLambda _\mathrm{ex}}\)

$$\begin{aligned} p({\mathbf {I}}_{\varLambda _\mathrm{ex}};{\mathbf {B}},\alpha )&= \prod _{i=1}^{n_\mathrm{ex}}\frac{1}{(2\pi )^{\frac{n}{2}}\sigma _i^n}\exp \{-E_i\}\nonumber \\ E_i&= \sum _{(x,y,t)\in \varLambda _{\mathrm{ex},i}}\frac{({\mathbf {I}}(x,y,t)-\alpha _iB_i(x,y,t))^2}{2\sigma _i^2}.\nonumber \\ \end{aligned}$$
(5)

where \({\mathbf {B}}=(B_1,\ldots ,B_{n_\mathrm{ex}})\) represents the selected primitive set, \(n\) is the size of each primitive, \(n_\mathrm{ex}\) is the number of selected primitives, and \(\sigma _i\) is estimated standard deviation of representing natural videos by \(B_i\).

2.2 Implicit Representations by FRAME Models

The implicit region \(\varLambda _\mathrm{im}\) of video \({\mathbf {I}}\) is segmented into \(n_\mathrm{im}\) (usually \(n_\mathrm{im}\) is no more than \(10\)) disjoint homogeneous textured motion regions,

$$\begin{aligned} \varLambda _\mathrm{im}=\bigcup _{j=1}^{n_\mathrm{im}} \varLambda _{\mathrm{im},j}. \end{aligned}$$
(6)

One effective approach for texture modeling is to pool the histograms for a set of filters (Gabor, DoG and DooG) on the input image [2, 9, 21, 29, 42]. Since Gabor filters model the response functions of the neurons in the primary visual cortex, two texture images with the same histograms of filter responses generate the same texture impression, and thus are considered perceptually equivalent [34]. The FRAME model proposed in [42] generates the expected marginal statistics to match the observed histograms through the maximum entropy principle. As a result, any images drawn from this model will have the same filtered histograms and thus can be used for synthesis or reconstruction.

We extend this concept to video by adding temporal constraints and define each homogeneous textured motion region \({\mathbf {I}}_{\varLambda _{\mathrm{im},j}}\) by an equivalence class of videos,

$$\begin{aligned} \varOmega _K({\mathbf {h}}_j)=\{{\mathbf {I}}_{\varLambda _{\mathrm{im},j}}: H_k({\mathbf {I}}_{\varLambda _{\mathrm{im},j}})= h_{k,j}, k=1,2,\ldots ,K\}.\nonumber \\ \end{aligned}$$
(7)

where \({\mathbf {h}}_j=(h_{1,j},\ldots ,h_{K,j})\) is a series of 1D histograms of filtered responses that characterize the macroscopic properties of the textured motion pattern. Thus we only need to code the histograms \({\mathbf {h}}_j\) and synthesize the textured motion region \({\mathbf {I}}_{\varLambda _{\mathrm{im},j}}\) by sampling from the set \(\varOmega _K({\mathbf {h}}_j)\). As \({\mathbf {I}}_{\varLambda _{\mathrm{im},j}}\) is defined by the implicit functions, we call it an implicit representation. These regions are coded up to an equivalence class in contrast to reconstructing the pixel intensities in the explicit representation.

To capture temporal constraints, one straightforward method is to choose a set of spatio-temporal filters and calculate the histograms of the filter responses. This leads to the spatio-temporal FRAME (ST-FRAME) model which will be introduced in Sect. 2.3. Another method is to compute the statistics of velocity. Since the motion in these regions is intrackable, at each point of the image, its velocity is ambiguous (large entropy). We pool the histograms of velocities locally in a way similar to the HOOF (Histogram of Oriented Optical Flow) [8, 12] features in action classification. This leads to the motion-appearance FRAME (MA-FRAME) model which uses histograms of both appearance (static filters) and velocities. We will elaborate on this model in Sect. 2.4.

2.3 Implicit Representation by Spatio-Temporal FRAME

ST-FRAME is an extension of the FRAME model [42] by adopting spatio-temporal filters.

A set of filters \({\mathbf {F}}\) is selected from a filter bank \(\varDelta _F\). Figure 5 illustrates the three types of filters in \(\varDelta _F\): (i) the static filters for texture appearance in a single image; (ii) the motion filter with certain velocity; and (iii) the flicker filter that have zero velocity but opposite signs between adjacent frames. For each filter \(F_k\in {\mathbf {F}}\), the spatio-temporal filter response of \({\mathbf {I}}\) at \((x,y,t)\in \varLambda _{\mathrm{im},j}\) is \(F_k*{\mathbf {I}}(x,y,t)\). The convolution is over spatial and temporal domain. By pooling the filter responses over all \((x,y,t)\in \varLambda _{\mathrm{im},j}\), we obtain a number of 1D histograms

$$\begin{aligned} H_k({\mathbf {I}}_{\varLambda _{\mathrm{im},j}})&=H_k(z;{\mathbf {I}}_{\varLambda _{\mathrm{im},j}})\nonumber \\&=\frac{1}{|\varLambda _{\mathrm{im},j}|}\sum _{(x,y,t)\in \varLambda _{\mathrm{im},j}}{\delta (z;F_k*{\mathbf {I}}(x,y,t))},\nonumber \\&\quad k=1,\ldots ,K. \end{aligned}$$
(8)

where \(z\) indexes the histogram bins, and \(\delta (z;x)=1\) if \(x\) belongs to bin \(z\), and \(\delta (z;x)=0\) otherwise. Following the FRAME model, the statistical model of textured motion \({\mathbf {I}}_{\varLambda _{\mathrm{im},j}}\) is written in the form of the following Gibbs distribution,

$$\begin{aligned} p({\mathbf {I}}_{\varLambda _{\mathrm{im},j}};{\mathbf {F}},\beta )\propto \exp \left\{ -\sum _k\langle \beta _{k,j},H_k({\mathbf {I}}_{\varLambda _{\mathrm{im},j}})\rangle \right\} . \end{aligned}$$
(9)

where \(\beta _k=(\beta _{k,1},\beta _{k,2},\ldots ,\beta _{k,3})\) are potential functions.

Fig. 5
figure 5

\(\varDelta _F\) is a dictionary of spatio-temporal filters including static, motion, and flicker filters

According to the theorem of ensemble equivalence [39], the Gibbs distribution converges to the uniform distribution over the set \(\varOmega _K({\mathbf {h}}_j)\) in (7), when \(\varLambda _{\mathrm{im},j}\) is large enough. For any fixed local brick \(\varLambda _0\subset \varLambda _{\mathrm{im},j}\), the distribution of \(I_{\varLambda _0}\) follows the Markov random field model (9). The model can describe textured motion located in an irregular shape region \(\varLambda _{\mathrm{im},j}\).

The filters in \({\mathbf {F}}\) are pursued one by one from the filter bank \(\varDelta _F\) so that the information gain is maximized at each step.

$$\begin{aligned} F_k^*=\arg \max _{F_k\in \varDelta _F}\left\| H_k^\mathrm{syn}-H_k^0\right\| . \end{aligned}$$
(10)

\(H_k^0\) and \(H_k^\mathrm{{syn}}\) are the response histograms of \(F_k\) before and after synthesizing \({\mathbf {I}}_{\varLambda _{\mathrm{im},j}}\) by adding \(F_k\), respectively. The larger the difference, the more important is the filter.

Following the distribution form of (9), the probabilistic model of implicit parts of \({\mathbf {I}}\) is defined as

$$\begin{aligned} p({\mathbf {I}}_{\varLambda _\mathrm{im}};{\mathbf {F}},\beta )\propto \prod _{j=1}^{n_\mathrm{im}}p({\mathbf {I}}_{\varLambda _{\mathrm{im},j}};{\mathbf {F}},\beta ). \end{aligned}$$
(11)

where \({\mathbf {F}}=(F_1,\ldots ,F_K)\) represents the selected spatio-temporal filter set.

In the experiments described later, we demonstrate that this model can synthesize a range of dynamic textures by matching the histograms of filter responses. The synthesis is done through sampling the probability by Markov chain Monte Carlo.

2.4 Implicit Representation by Motion-Appearance FRAME

Different from ST-FRAME, in which, temporal constraints are based on spatio-temporal filters, the MA-FRAME model uses the statistics of velocities, in addition to the statistics of filter responses for appearance.

For the appearance constraints, the filter response histograms \(H^{(s)}\) are obtained similarly as ST-FRAME in (8)

$$\begin{aligned} H^{(s)}_k({\mathbf {I}}_{\varLambda _{\mathrm{im},j}})&=H^{(s)}_k(z;{\mathbf {I}}_{\varLambda _{\mathrm{im},j}})\nonumber \\&=\frac{1}{|\varLambda _{\mathrm{im},j}|}\sum _{(x,y,t)\in \varLambda _{\mathrm{im},j}}{\delta (z;F_k*{\mathbf {I}}(x,y,t))},\nonumber \\&\quad k=1,\ldots ,K. \end{aligned}$$
(12)

where the filter set \({\mathbf {F}}\) includes static and flicker filters in \(\varDelta _F\).

For the motion constraints, the velocity distribution of each local patch is estimated via the calculation of trackability [17], in which, each patch is compared with its spatial neighborhood in adjacent frame and the probability of the local velocity \(v\) is computed as

$$\begin{aligned}&p(v|I(x,y,t-1),I(x,y,t))\nonumber \\&\quad \propto \exp \left\{ -\frac{\Vert I_{\partial (x-v_x,y-v_y,t)}-I_{\partial (x,y,t-1)}\Vert ^2}{2\sigma ^2}\right\} . \end{aligned}$$
(13)

Here, \(\sigma \) is the standard deviation of the differences between local patches from adjacent frames based on various velocities. The statistical information of velocities for a certain area of texture is approximated by averaging the velocity distribution over region \(\varLambda _{\mathrm{im},j}\)

$$\begin{aligned} H^{(t)}(\mathbf {v}_{\varLambda _{\mathrm{im},j}})=\sum _{(x,y,t)\in \varLambda _{\mathrm{im},j}}{p(v|I(x,y,t-1),I(x,y,t))}. \end{aligned}$$
(14)

Let \({\mathbf {H}}=( {\mathbf {H}}^{(s)}({\mathbf {I}}_{\varLambda _{\mathrm{im},j}}),{\mathbf {H}}^{(t)}(\mathbf {v}_{\varLambda _{\mathrm{im},j}}) )\) collect the filter responses and velocities histograms of the video. The statistical model of textured motion \({\mathbf {I}}_{\varLambda _{\mathrm{im},j}}\) can be written in the form of the following joint Gibbs distribution,

$$\begin{aligned} p({\mathbf {I}}_{\varLambda _{\mathrm{im},j}}; {\mathbf {F}},\beta ) \propto \exp \left\{ -\langle \varvec{\beta },{\mathbf {H}}({\mathbf {I}}_{\varLambda _{\mathrm{im},j}},\mathbf {v}_{\varLambda _{\mathrm{im},j}})\rangle \right\} . \end{aligned}$$
(15)

Here, \(\beta \) is the parameter of the model.

In summary, the probabilistic model for the implicit regions of \({\mathbf {I}}\) is defined as

$$\begin{aligned} p({\mathbf {I}}_{\varLambda _\mathrm{im}};{\mathbf {F}},\beta )\propto \prod _{j=1}^{n_\mathrm{im}}p({\mathbf {I}}_{\varLambda _{\mathrm{im},j}};{\mathbf {F}},\beta ). \end{aligned}$$
(16)

where \({\mathbf {F}}=(F_1,\ldots ,F_K)\) represents the selected filter set.

In the experiment section, we show the effectiveness of the MA-FRAME model and its advantages over the ST-FRAME model.

2.5 Hybrid Model for Video Representation

The ST or MA-FRAME models for the implicit regions \({\mathbf {I}}_{\varLambda _\mathrm{im}}\) use the explicit regions \({\mathbf {I}}_{\varLambda _\mathrm{ex}}\) as boundary conditions, and the probabilistic models for \({\mathbf {I}}_{\varLambda _\mathrm{ex}}\) and \({\mathbf {I}}_{\varLambda _\mathrm{im}}\) are given by (5) and (11), respectively

$$\begin{aligned} {\mathbf {I}}_{\varLambda _\mathrm{ex}}\sim p({\mathbf {I}}_{\varLambda _\mathrm{ex}};{\mathbf {B}},\alpha ),{\mathbf {I}}_{\varLambda _\mathrm{im}}\sim p({\mathbf {I}}_{\varLambda _\mathrm{im}}|{\mathbf {I}}_{\partial \varLambda _\mathrm{im}};{\mathbf {F}},\beta ). \end{aligned}$$
(17)

Here, \({\mathbf {I}}_{\partial \varLambda _\mathrm{im}}\) represents the boundary condition of \({\mathbf {I}}_{\varLambda _\mathrm{im}}\), which belongs to the reconstruction of \({\mathbf {I}}_{\varLambda _\mathrm{ex}}\). It leads to seamless boundaries in the synthesis.

By integrating the explicit and implicit representation, the video primal sketch has the following probability model,

$$\begin{aligned}&p({\mathbf {I}}|{\mathbf {B}},{\mathbf {F}},\alpha ,\beta )\nonumber \\&\quad =\frac{1}{Z}\exp \left\{ -\sum _{i=1}^{n_\mathrm{ex}}\sum _{(x,y,t)\in \varLambda _{\mathrm{ex},i}}\frac{({\mathbf {I}}(x,y,t)-\alpha _iB_i(x,y,t))^2}{2\sigma _i^2}\right. \nonumber \\&\quad \quad \left. -\sum _{j=1}^{n_\mathrm{im}} \sum _{k=1}^K\langle \beta _{k,j},H_k ({\mathbf {I}}_{\varLambda _{\mathrm{im},j}}|{\mathbf {I}}_{\partial \varLambda _{\mathrm{im},j}})\rangle \right\} , \end{aligned}$$
(18)

where \(Z\) is the normalizing constant.

We denote by \(\mathrm{VPS}=({\mathbf {B}},{\mathbf {H}})\) the representation for the video \({\mathbf {I}}_{\varLambda }\), where \({\mathbf {H}}=(\{h_{k,1}\}_{k=1}^{K_1},\ldots ,\{h_{k,n_\mathrm{im}}\}_{k=1}^{K_{n_\mathrm{im}}})\) includes the histograms described by \({\mathbf {F}}\) and \(\mathbf {V}\); and \({\mathbf {B}}\) includes all the primitives with parameters for their indexes, position, orientation, and scales. \(p(\mathrm{VPS})=p({\mathbf {B}},{\mathbf {H}})=p({\mathbf {B}})p({\mathbf {H}})\) gives the prior probability of video representation by VPS. \(p({\mathbf {B}})\propto \exp \{-|{\mathbf {B}}|\}\), in which, \(|{\mathbf {B}}|\) is the number of primitives. \(p({\mathbf {H}})\propto \exp \{-\gamma _\mathrm{tex}({\mathbf {H}})\}\), in which, \(\gamma _\mathrm{tex}({\mathbf {H}})\) is the energy term and for instance, \(\gamma _\mathrm{tex}({\mathbf {H}})=\rho n_\mathrm{im}\) to penalize the number of implicit regions. Thus, the best video representation \(\mathrm{VPS}^*\) is obtained by maximizing the posterior probability,

$$\begin{aligned}&\mathrm{VPS}^*=\arg \max _{\mathrm{VPS}} p(\mathrm{VPS}|{\mathbf {I}}_{\varLambda })=\arg \max _{\mathrm{VPS}} p({\mathbf {I}}_{\varLambda }|\mathrm{VPS})p(\mathrm{VPS})\nonumber \\&\quad =\arg \max _{{\mathbf {B,H}}}\frac{1}{Z}\exp \left\{ -\sum _{i=1}^{n_\mathrm{ex}}\sum _{(x,y,t)\in \varLambda _{\mathrm{ex},i}}\frac{({\mathbf {I}}(x,y,t)\!-\!\alpha _iB_i(x,y,t))^2}{2\sigma _i^2}\right. \nonumber \\&\quad \left. -|{\mathbf {B}}|-\sum _{j=1}^{n_\mathrm{im}}\sum _{k=1}^K\langle \beta _{k,j},H_k\rangle -\gamma _\mathrm{tex}({\mathbf {H}})\right\} . \end{aligned}$$
(19)

following the video primal sketch model in (18).

Table 1 shows an example of VPS. For a video of the size of \(288\times 352\) pixels, about 30 % of the pixels are represented explicitly by \(n_\mathrm{ex}=300\) motion primitives. As each primitive needs 11 parameters (the side length of the patch according to the primitive learning process in Sect. 3.2) to record the profile and 1 more to record the type, the number of total parameters for the explicit representation is 3,600. \(n_\mathrm{im}=3\) textured motion regions are represented implicitly by the histograms, which are described by \(K_1=11\), \(K_2=12\), and \(K_3=5\) filters, respectively. As each histogram has 15 bins, the number of the parameters for the implicit representation is 420.

2.6 Sketchability and Trackability for Model Selection

The computation of the VPS involves the partition of the domain \(\varLambda \) into the explicit regions \(\varLambda _\mathrm{ex}\) and implicit regions \(\varLambda _\mathrm{im}\). This is done through the sketchability and trackability maps. In this subsection, we overview the general ideas and refer to previous work on sketchability [18] and trackability [17] for details.

Let us consider one local volume \(\varLambda _0\subset \varLambda \) of the video \({\mathbf {I}}\). In the video primal sketch model, \({\mathbf {I}}_{\varLambda _0}\) may be modeled either by the sparse coding model in (5) or by the FRAME model in (11). The choice is determined via the competition between the two models, i.e., comparing which model gives shorter coding length [33] for representation.

If \({\mathbf {I}}_{\varLambda _0}\) is represented by the sparse coding model, the posterior probability is calculated by

$$\begin{aligned} p({\mathbf {B}}|{\mathbf {I}}_{\varLambda _0})=\frac{1}{(2\pi )^{n/2}\sigma ^n}\exp \left\{ -\sum _i\frac{\Vert {\mathbf {I}}_{\varLambda _0}-\alpha _i B_i\Vert ^2}{2\sigma ^2}\right\} . \end{aligned}$$
(20)

where \(n=|\varLambda _0|\). The coding length is

$$\begin{aligned} L_\mathrm{ex}({\mathbf {I}}_{\varLambda _0})&= \log \frac{1}{p({\mathbf {B}}|{\mathbf {I}}_{\varLambda _0})}\\&= \frac{n}{2}\log 2\pi \sigma ^2+\sum _i\frac{\Vert {\mathbf {I}}_{\varLambda _0}-\alpha _i B_i\Vert ^2}{2\sigma ^2}. \end{aligned}$$

Since \(\sigma ^2\) is estimated via the given data temporarily in real application, \(\frac{1}{n}\sum _i \Vert {\mathbf {I}}_{\varLambda _0}-\alpha _i B_i\Vert ^2=\sigma ^2\) holds by definition. As a result, the coding length is derived as,

$$\begin{aligned} L_\mathrm{ex}({\mathbf {I}}_{\varLambda _0})=\frac{n}{2}\left( \log 2\pi \sigma ^2+1\right) . \end{aligned}$$
(21)

If \({\mathbf {I}}_{\varLambda _0}\) is described by the FRAME model, the posterior probability is calculated by

$$\begin{aligned} p({\mathbf {F}}|{\mathbf {I}}_{\varLambda _0})\propto \exp \left\{ -\sum _{k=1}^K\langle \beta _{k},H_k({\mathbf {I}}_{\varLambda _0})\rangle \right\} . \end{aligned}$$
(22)

The coding length is estimated through a sequential reduction process. When \(K=0\), with no constraints, the FRAME model is a uniform distribution, and thus the coding length is \(\log |\varOmega _0|\) where \(|\varOmega _0|\) is the cardinality of the space of all videos in \(\varLambda \). Suppose the intensities of the video range from 0 to 255, then \(\log |\varOmega _0| = 8 \times |\varLambda _0|\). By adding each constraint, the equivalence \(\varOmega (K)\) will shrink in size, and the ratio of the compression \(\log \frac{|\varOmega _{K-1}|}{|\varOmega _K|}\) is approximately equal to the information gain in (10). Therefore, we can calculate the coding length by

$$\begin{aligned} L_\mathrm{im}({\mathbf {I}}_{\varLambda _0})\!=\! \log |\varOmega _0|\!-\!\log \frac{|\varOmega _0|}{|\varOmega _1|}\!-\!\cdots -\log \frac{|\varOmega _{K-1}|}{|\varOmega _K|}. \end{aligned}$$
(23)

By comparing \(L_\mathrm{im}({\mathbf {I}}_{\varLambda _0})\) and \(L_\mathrm{ex}({\mathbf {I}}_{\varLambda _0})\), whoever has the shorter coding length will win the competition and be chosen for \({\mathbf {I}}_{\varLambda _0}\).

In practice, we use a faster estimation which utilizes the relationship between the coding length and the entropy of the local posterior probabilities.

Consider the entropy of \(p(B|{\mathbf {I}}_{\varLambda _0})\),

$$\begin{aligned} {\mathcal {H}}(B|{\mathbf {I}}_{\varLambda _0})=-E_{p(B,{\mathbf {I}}_{\varLambda _0})}[\log p(B|{\mathbf {I}}_{\varLambda _0})]. \end{aligned}$$
(24)

It measures the uncertainty of selecting a primitive in \(\varDelta _B\) for representation. The sharper the distribution \(p(B|{\mathbf {I}}_{\varLambda _0})\) is, the lower the entropy \({\mathcal {H}}(B_k|{\mathbf {I}}_{\varLambda _0})\) will be, which gives smaller \(L_\mathrm{ex}({\mathbf {I}}_{\varLambda _0})\) according to (21). Hence, \({\mathcal {H}}(B_k|{\mathbf {I}}_{\varLambda _0})\) reflects the magnitude of \(L_\text {diff}({\mathbf {I}}_{\varLambda _0})=L_\mathrm{ex}({\mathbf {I}}_{\varLambda _0})-L_\mathrm{im}({\mathbf {I}}_{\varLambda _0})\). Set an entropy threshold \({\mathcal {H}}_0\) on \({\mathcal {H}}(B_k|{\mathbf {I}}_{\varLambda _0})\), ideally, \({\mathcal {H}}(B_k|{\mathbf {I}}_{\varLambda _0})={\mathcal {H}}_0\) if and only if \(L_\text {diff}({\mathbf {I}}_{\varLambda _0})=0\). Therefore, when \({\mathcal {H}}(B_k|{\mathbf {I}}_{\varLambda _0})<H_0\), we consider \(L_\mathrm{ex}({\mathbf {I}}_{\varLambda _0})\) is lower and \({\mathbf {I}}_{\varLambda _0}\) is modeled by the sparse coding model, else it is modeled by the FRAME model.

It is clear that \({\mathcal {H}}(B_k|{\mathbf {I}}_{\varLambda _0})\) has the same form and meaning with sketchability [18] in appearance representation and trackability [17] in motion representation. Therefore, sketchability and trackability can be used for model selection for each local volume. Figure 2b, c shows the sketchability and trackability maps calculated by the local entropy of posteriors. The two maps decide the partition of the video into the explicit implicit regions. Within the explicitly regions, they also decide whether a patch is trackable (using primitives with size of \(11\times 11\) pixels \(\times 3\) frames) or intrackable (using primitives with \(11\times 11\) pixels \(\times 1\) frame).

3 Algorithms and Experiments

3.1 Spatio-Temporal Filters

In the vision literature, spatio-temporal filters have been widely used for motion information extraction [1], optical flow estimation [20], multi-scale representation of temporal data [23], pattern categorization [38], and dynamic texture recognition [13]. In the experiments, we choose spatio-temporal filters \(\varDelta _F\) as shown in Fig. 5. It includes three types:

  1. 1

    Static filters. Laplacian of Gaussian (LoG), Gabor, gradient, or intensity filter on a single frame. They capture statistics of spatial features.

  2. 2

    Motion filters. Moving LoG, Gabor or intensity filters in different speeds and directions over three frames. Gabor motion filters move perpendicularly to their orientations.

  3. 3

    Flicker filters. One static filter with opposite signs at two frames. They contrast the static filter responses between two consequent frames and detect the change of dynamics.

For implicit representation, the filters are \(7\times 7\) pixels in size and have \(6\) scales, \(12\) directions, and \(3\) speeds. Each type of filter has a special effect in textured motion synthesis, which will be discussed in Sect. 3.3 and shown in Fig. 8.

3.2 Learning Motion Primitives and Reconstructing Explicit Regions

After computing the sketchability and trackability maps of one frame, we extract explicit regions in the video. By calculating all the coefficients of each part with motion primitives from the primitive bank, \(\alpha _{i,j}=\langle {\mathbf {I}}_{\varLambda _{\mathrm{tr},i}},B_j\rangle \), all the \(\alpha _{i,j}\) are ranked from high to low. Each time, we select the primitive with the highest coefficient to represent the corresponding domain and then do local suppression to its neighborhood to avoid excessive overlapping of extracted domains. The algorithm is similar to matching pursuit [25] and the primitives are chosen one by one.

In our work, in order to alleviate computational complexity, \(\alpha _{i,j}\) are calculated by filter responses. The filters used here are \(11\times 11\) pixels and have \(18\) orientations and 8 scales. The fitted filter \(F_j\) gives a raw sketch of the trackable patch and extracts property information, such as type and orientation, for generating the primitive. If the fitted filter is a Gabor-like filter, the primitive \(B_j\) is calculated by averaging the intensities of the patch along the orientation of \(F_j\), while if the fitted filter is a LoG-like filter, \(B_j\) is calculated by averaging the intensities circularly around its center. Then \(B_j\) is added to the primitive set \({\mathbf {B}}\) with its motion velocities calculated from the trackability map. It is also added into \(\varDelta _B\) for the dictionary buildup. The size of each primitive is \(11\times 11\), the same as the size of the fitted filter. And the velocity \((u,v)\) are two parameters for recording motion information. In Fig. 4, we show some examples of different types of primitives, such as blob, ridge, and edge. Figure 6 shows some examples of reconstruction by motion primitives. In each group, the original local image, the fitted filter, the generated primitive, and the motion velocity are given. In the frame, each patch is marked by a square with a short line for representing its motion information.

Fig. 6
figure 6

Some examples of primitives in a frame of video. Each group shows the original local image \({\mathbf {I}}\), the best fitted filter \({\mathbf {F}}\), the fitted primitive \({\mathbf {B}}\in \varDelta _B\) and the velocity \((u,v)\), which represents the motion of \({\mathbf {B}}\)

Through the matching pursuit process, the sketchable regions are reconstructed by a set of common primitives. Figure 7 shows an example of the sketchable region reconstruction by using a series of common primitives. By comparing the observed frame (a) and reconstructed frame (b), (c) shows the error of reconstruction. The more detailed quantitative assessment is given in Sect. 3.7. It is evident that a rich dictionary of video primitives can lead to a satisfactory reconstruction of explicit regions of videos.

Fig. 7
figure 7

The reconstruction effect of sketchable regions by common primitives. a The observed frame. b The reconstructed frame. c The errors of reconstruction

For non-sketchable but trackable regions, based on the trackability map, we get the motion trace of each local trackable patch. Because each patch cannot be represented by a shared primitive, we record the whole patch and motion information as a special primitive for video reconstruction. It is obvious that special primitives increase model complexity compared with common primitives. However, as stated in Sect. 2.1, the percentage of special primitives for the explicit region reconstruction of one video is very small (around 2–3 %), hence it will not affect the final storage space significantly.

Fig. 8
figure 8

Synthesis for one frame of the ocean textured motion. a Initial uniform white noise image. b Synthesized frame with only static filters. c Synthesized frame with only motion filters. d Synthesized frame with both of static and motion filters. e Synthesized frame with all of the 3 types of filters. f The original observed frame

3.3 Synthesizing Textured Motions by ST-FRAME

Each local volume \({\mathbf {I}}_{\varLambda _0}\) of textured motion located at \(\varLambda _0\) follows a Markov random field model conditioned on its local neighborhood \({\mathbf {I}}_{\partial \varLambda _0}\) following (9),

$$\begin{aligned} p({\mathbf {I}}_{\varLambda _0}|{\mathbf {I}}_{\partial \varLambda _0}; {\mathbf {F}},\beta )\propto \exp \left\{ -\sum _{k}{\langle \beta _k,H_k({\mathbf {I}}_{\varLambda _0})\rangle }\right\} , \end{aligned}$$
(25)

where Lagrange parameters \(\beta _k=\{\beta _k^{(i)}\}_{i=1}^L\in \beta \) are the discrete form of potential function \(\beta _k()\) learned from input videos by maximum likelihood,

$$\begin{aligned} \hat{\beta }&= \arg \min \limits _{\beta }\log p({\mathbf {I}}_{\varLambda _0}|{\mathbf {I}}_{\partial \varLambda _0}; \beta , {\mathbf {F}})\nonumber \\&= \arg \max \limits _{\beta }\left\{ -\log Z(\beta )-\sum _{k}{<\beta _k,H_k({\mathbf {I}}_{\varLambda _0})>}\right\} \end{aligned}$$
(26)

But the closed form of \(\beta \) is not available in general. So it can be solved iteratively by

$$\begin{aligned} \frac{\mathrm{d}\beta ^{(i)}}{\mathrm{d}t}=E_{p({\mathbf {I}}; \beta , {\mathbf {F}})}[H^{(i)}]-H^{\mathrm{obs}(i)} \end{aligned}$$
(27)

In order to draw a typical sample frame from \(p({\mathbf {I}}; {\mathbf {F}},\beta )\), we use the Gibbs sampler which simulates a Markov chain. Starting from any random image, e.g., a white noise, it converges to a stationary process with distribution \(p({\mathbf {I}}; {\mathbf {F}},\beta )\). Therefore, we get the final converged results dominated by \(p({\mathbf {I}}; \beta , {\mathbf {F}})\), which characterizes the observed dynamic texture.

In summary, the process of textured motion synthesis is given by the following algorithm.

figure d

Figure 8 shows an example of the synthesis process. (f) is one frame from textured motion of ocean. Starting from a white noise frame in (a), (b) is synthesized with only 7 static filters. It shows high smoothness in spatial domain, but lacks temporal continuity with previous frames. However, in (c) the synthesis with only 9 motion filters has similar macroscopic distribution to the observed frame, but appears quite grainy over local spatial relationship. By using both static and motion filters, the synthesis in (d) performs well on both spatial and temporal relationships. Compared with (d), the synthesis by 2 extra flicker filters in (e) shows more smoothness and more similar to the observed frame.

In Fig. 9, we show four groups of textured motion (4 bits) synthesis by Algorithm 1: ocean (a), water wave (b), fire (c), and forest (d). In each group, as time passes, the synthesized frames are getting more and more different from the observed one. It is caused by the stochasticity of textured motions. Although the synthesized and observed videos are quite different on pixel level, the two sequences are perceived extremely identical by human after matching the histograms of a small set of filter responses. This conclusion can be further supported by perceptual studies in Sect. 3.9. Figure 10 shows that as \({\mathbf {I}}_{(m)}^\mathrm{syn}\) changes from white noise (Fig. 8a) to the final synthesized result (Fig. 8e), the histograms of filter responses become matched with the observed ones.

Fig. 9
figure 9

Textured motion synthesis examples. For each group, the top row is the original videos and the bottom row shows the synthesized ones. a Ocean. b Water wave. c Fire. d Forest

Fig. 10
figure 10

Matching of histograms of spatio-temporal filter responses for Ocean. The filters are a Static LoG (\(5\times 5\)). b Static gradient (vertical). c Motion gabor (6,150\(^\circ \)). d Motion gabor (2,60\(^\circ \)). e Motion gabor (2,0\(^\circ \)). f Flicker LoG (\(5\times 5\))

Table 2 shows the comparison of compression ratios between ST-FRAME and the dynamic texture model [14]. It has a significantly better compression ratio than the dynamic texture model, because the dynamic texture model has to record PCA components as large as the image size.

Table 2 The number of parameters recorded and the compression ratios for synthesis of 5-frame textured motion videos by ST-FRAME and the dynamic texture model ([14])

3.4 Computing Velocity Statistics

One popular method for velocity estimation is optical flow. Based on the optical flow, HOOF features extract the motion statistics by calculating the distribution of velocities in each region. Optical flow is an effective method for estimating the motions at trackable areas, but does not work for the intrackable dynamic texture areas. The three basic assumptions for optical flow equations, i.e., brightness constancy between matched pixels in consecutive frames, smoothness among adjacent pixels and slow motion, are violated in these areas due to the stochastic nature of dynamic textures. Therefore, we go for a different velocity estimation method.

Considering one pixel \(I(x,y,t)\) at \((x,y)\) in frame \(t\), we denote its neighborhood as \(I_{\partial \varLambda _{x,y,t}}\). Comparing patch \(I_{\partial \varLambda _{x,y,t}}\) with all the patches in the previous frame within a searching radius, each patch corresponding to one velocity \(v=(v_x,v_y)\), we obtain a distribution

$$\begin{aligned} p(v)\propto \exp \left\{ -\Vert I_{\partial \varLambda _{x,y,t}}-I_{\partial \varLambda _{x-v_x,y-v_y,t-1}}\Vert ^2\right\} \end{aligned}$$
(28)

This distribution describes the probability of the origin of the patch, i.e., the location where the patch \(I_{\partial \varLambda _{x,y,t}}\) moves from. Equivalently, it reflects the average probability of the motions of the pixels in the patch. Therefore, by clustering all the pixels according to their velocity distribution, the cluster center of each cluster gives the velocity statistics of all the pixels in this cluster approximately, which reflects the motion pattern of these clustered pixels. Figures 14 and  15 show some examples of velocity statistics, in which the brighter, the higher probability, while the darker, the lower probability. The meanings of these two figures are explained later.

Compared to HOOF, the estimated velocity distribution is more suitable for modeling textured motion. Firstly, the velocity distribution is estimated pixel wisely. Hence it can depict more non-smooth motions. Secondly, although it seeks to compare the intensity pattern around a point to nearby regions at a subsequent temporal instance, which seems to also take brightness constancy assumption into account, the difference here is that it calculates the probability of motions rather than the single pixel correspondence. As a result, the constraints by the assumption are weakened, and it has the ability to represent stochastic dynamics.

3.5 Synthesizing Textured Motions by MA-FRAME

In MA-FRAME model, similar to ST-FRAME, each local volume \(I_{\varLambda _0}\) of textured motion follows a Markov random field model. However, the difference is that MA-FRAME extracts motion information via the distribution of velocities \(v\).

In the sampling process, \(I_{\varLambda _{\mathrm{im},j}}\) and \(v_{\varLambda _{\mathrm{im},j}}\) are sampled simultaneously following the joint distribution in (15),

$$\begin{aligned} p({\mathbf {I}}_{\varLambda _{\mathrm{im},j}};{\mathbf {F}},\beta ) \propto \exp \left\{ -\langle \varvec{\beta },{\mathbf {H}}({\mathbf {I}}_{\varLambda _{\mathrm{im},j}},\mathbf {v}_{\varLambda _{\mathrm{im},j}})\rangle \right\} . \end{aligned}$$

In experiments, we design an effective way for sampling from the above model. For each pixel, we build a 2D-distribution matrix, whose two dimensions are velocities and intensities, respectively, to guide the sampling process. The sampling probability for every candidate (labeled by one velocity and one intensity) is obtained by integrating motion score, appearance score, and multiplying smoothness weight,

$$\begin{aligned}&\mathrm{SCORE}(v,I) \propto \exp \left\{ -\omega (v)(\langle \beta ^{(t)},\Vert H^{(t)}-H^{(t)\mathrm{obs}}\Vert ^2\rangle \right. \\&\quad \left. +\langle \beta ^{(s)},\Vert H^{(s)}-H^{(s)\mathrm{obs}}\Vert ^2\rangle )\right\} . \end{aligned}$$

The details are explained with the illustration by Fig. 11 for the sampling method at one pixel. For each pixel \((x,y)\) of the current frame \(I_t\), we consider its every possible velocity \(v_{i,j}=(v_{x,i},v_{y,j})\) within the range \(-v_{max}\le v_{x,i},v_{y,j}\le v_\mathrm{max}\). Each velocity corresponds to a position \((x-v_{x,i},y-v_{y,j})\) in the previous frame \(I_{t-1}\). Under velocity \(v_{i,j}\), the perturbation range of \(I_{t-1}(x-v_{x,i},y-v_{y,j})\) yields the intensity candidates for \(I_t(x,y)\) which is a smaller interval than \([0,255]\) and thus saves computational complexity. In the shown example (Fig. 11a), \(v_{max}=1\) and the perturbed intensity range is \([I_{t-1}(x-v_{x,i},y-v_{y,j})-m, I_{t-1}(x-v_{x,i},y-v_{y,j})+m]\). Therefore, we have \(9\) velocity candidates and \(2m+1\) intensity candidates for each velocity, hence the size of the sampling matrix is \(9\times (2m+1)\) (Fig. 11b). With the motion constraints given by matching the velocity statistics, the velocity candidates have their motion scores. With the appearance constraints given by matching the filter response histograms, intensity candidates have their appearance scores. By integrating the two sets of scores, we obtain a preliminary sampling matrix shown in Fig. 11b.

Fig. 11
figure 11

Sampling process of MA-FRAME. a For each pixel of current frame \(I_t(x,y)\), the sample candidates are perturbation intensities of its neighborhood pixels in previous frame \(I_{t-1}\) dominated by different velocities. b The velocity list and intensity perturbations construct two dimensions of the 2D distribution matrix, which is used for sampling \(I_t(x,y)\). Here, \(I(v_{i,j})\) is short for \(I_{t-1}(x-v_{x,i},y-v_{y,j})\) and \(i,j\) are indexes for different velocities in the neighborhood area

In order to guarantee the motion of each pixel is as consistent as possible with its neighborhoods to make the macroscopic motion smooth enough, we add a set of weights on the distribution matrix, in which each multiplier for candidates of one velocity is calculated by

$$\begin{aligned} \omega (v_{x,i},v_{y,j})=\sum _{(x_k,y_k)\in \partial \varLambda _{(x,y)}}\Vert v_{x,i}-v_{x_k}\Vert ^2+\Vert v_{y,j}-v_{y_k}\Vert ^2. \end{aligned}$$

The weights encourage the velocity candidate which is closer to the velocities of its neighbors. With the weights, the sampled velocities are prone to be regarded as “blurred” optical flow. The main difference is that it preserves the uncertainty of dynamics in a texture motion, but not definite velocities of every local pixel.

After multiplying the weights to the preliminary matrix, we get the final sampling matrix. Although the main purpose of MA-FRAME is sampling intensities of each pixel from a textured motion, the sampling for intensities is highly related to velocities, and the sampling process is actually based on the joint distribution of velocity and intensity.

In summary, textured motion synthesis by MA-FRAME is given as follows

figure e

Figures 12 and  13 show two examples of textured motion synthesis by MA-FRAME. Different from the synthesis results by ST-FRAME, it can deal with videos of larger size, higher intensity level (8 bits here compared to 4 bits in ST-FRAME experiments) and more frames because of its smaller sample space and higher temporal continuity. Furthermore, it generates better motion pattern representations.

Fig. 12
figure 12

Texture synthesis for 18 frames of ocean video (from top-left to bottom-right) by MA-FRAME

Fig. 13
figure 13

Texture synthesis for 18 frames of bushes video (from top-left to bottom-right) by MA-FRAME

Figure 14 shows the comparison of velocities statistics between the original video and the synthesized video of different textured motion clusters, the brighter, the higher motion probability, while the darker, the lower probability. It is easy to tell that they are quite consistent, which means the original and synthesized videos have similar macroscopical motion properties.

Fig. 14
figure 14

Five pairs of global statistics of velocities for comparison. Each patch corresponds to the neighborhood lattices as shown in Fig. 11a. The brighter means higher motion probability, while the darker means lower probability

We also test local motion consistency between observed and synthesized videos by comparing velocity distributions of every pair of corresponding pixels. Figure 15 shows the comparisons of ten pairs of randomly chosen pixels. Most of them match well. It demonstrated that the motion distributions of most of local patches also preserve well during the synthesis procedure.

Fig. 15
figure 15

Ten pairs of local statistics of velocities for comparison. Upper row: original; lower row: synthesis. Each patch corresponds to the neighborhood lattices as shown in Fig. 11a. The brighter means higher motion probability, while the darker means lower probability

3.6 Dealing with Occlusion Parts in Texture Synthesis

Before providing the full version of computational algorithm for VPS, we first introduce how to deal with occluded areas.

In video, dynamic background textures are often occluded by the movement of foreground objects. Synthesizing background texture by ST-FRAME uses histograms of spatio-temporal filter responses. When a textured region becomes occluded, the pattern no longer belongs to the same equivalence class. In this event, the spatio-temporal responses are not precise enough for matching the given histograms, and may cause a deviation in the synthesis results. These errors may accumulate over frames and the synthesis will ultimately degenerate completely. Synthesis by MA-FRAME has a greater problem because the intensities in the current frame are selected from small perturbations in intensities from the previous frame. If a pixel cannot be found from the neighborhood in the previous frame that belongs to the same texture class, the intensity it adopts may be incompatible with other pixels around it.

In order to solve this problem, occluded pixels are sampled separately by the original (spatial) FRAME model, which means, we have two classes of filter response histograms

  1. 1

    Static filter response histograms \(H^S\). Histograms are calculated by summarizing static filter responses of all the textural pixels;

  2. 2

    Spatio-temporal filter response histograms \(H^{ST}\). Histograms are calculated by summarizing spatio-temporal filter responses of all the non-occluded textured pixels.

Therefore, in the sampling process, the occluded pixels and non-occluded pixels are treated differently. First, their statistics are constrained by different sets of filters; second, in MA-FRAME, the intensities of non-occlude pixels are sampled from the intensity perturbation of their neighborhood locations in previous frame, while the intensities of occluded pixels are sampled from the whole intensity space, say 0–255 for 8 bits gray levels.

3.7 Synthesizing Videos with VPS

In summary, the full version of the computational algorithm for video synthesis of VPS is presented as follows.

figure f

Figure 2 shows this process as we introduced in Sect. 1. Figure 16 shows three examples of video synthesis (YCbCr color space, 8 bits for gray level) by VPS frame by frame. In every experiment, observed frames, trackability maps, and final synthesized frames are shown. In Table 3, H.264 is selected as the reference of compression ratio compared with VPS, from which we can tell VPS is competitive with state-of-art video encoder on video compression.

Fig. 16
figure 16

Video synthesis. For each experiment, Row 1 original frames; Row 2 trackability maps; Row 3 synthesized frames

For assessing the quality of the synthesized results quantitatively, we adopt two criteria for different representations, rather than the traditional approach based on error-sensitivity as it has a number of limitations [37]. The error for explicit representations is measured by the difference of pixel intensities

$$\begin{aligned} \mathrm{err}^\mathrm{ex}=\frac{1}{|\varLambda _\mathrm{ex}|}\sum _{\varLambda _\mathrm{ex}}\left\| I^\mathrm{syn}-I^\mathrm{obs}\right\| , \end{aligned}$$
(29)

while for implicit representations, the error is given by the difference of filter response histograms,

$$\begin{aligned} \mathrm{err}^\mathrm{im}=\frac{1}{n_\mathrm{im}\times K}\sum _{n_\mathrm{im},K}\left\| H_k\left( I_{\varLambda _{\mathrm{im},j}}^\mathrm{syn}\right) -H_k\left( I_{\varLambda _{\mathrm{im},j}}^\mathrm{obs}\right) \right\| . \end{aligned}$$
(30)

Table 4 shows the quality assessments of the synthesis, which demonstrates good performance of VPS on synthesizing videos.

3.8 Computational Complexity Analysis

In this subsection, we analyze the computational complexity of the algorithms studies in this paper. We discuss the complexity for four algorithms in the following. The implementation environment is the desktop computer with Intel Core i7 2.9 GHz CPU, 16GB memory and Windows 7 operating system.

Table 3 Compression ratio of video synthesis by VPS and H.264 to raw image sequence
Table 4 Error assessment of synthesized videos

1) Video modeling by VPS. Suppose one frame of a video contains \(N\) pixels, of which, \(N_\mathrm{ex}\) pixels belong to explicit regions and \(N_\mathrm{im}\) in implicit regions. Let the size of the filter dictionary be \(N_F\) and the filter size be \(S_F\), the computational complexity for calculating filter responses is \(O(NN_FS_F)\). For extracting and learning explicit bricks, the complexity is no more than \(O(N_\mathrm{ex}\sqrt{S_F})\). For calculating the response histograms of \(K\) chosen filters within the implicit regions, the complexity is no more than \(O(N_\mathrm{im}Kk)\) if there are \(k\) homogeneous textural area in the regions. To sum up, the total computation complexity for video coding is no more than \(O(NN_FS_F+N_\mathrm{ex}\sqrt{S_F}+N_\mathrm{im}Kk)\). In our experiments, for coding one frame of the video with the size of \(288\times 352\), the time consumption is less than 0.5 seconds.

2) Reconstruction of explicit regions. Because the information of all the basis for explicit regions is recorded and there needs no additional computations for reconstructing, the computational complexity can be regarded as \(O(1)\) and the reconstruction costs no time in comparison to other components.

3) Synthesis of implicit regions by Gibbs sampling by ST-FRAME. For one round sampling, each of the \(N_\mathrm{im}\) pixels will be sampled in the range of the overall intensity levels, say \(L\). For every sampling candidate, i.e., one intensity, the score is calculated via the change of synthesized filter response histograms. To reduce the computation burden, we can simply update the change of filter responses caused by the change of the intensity on the current pixel. This operation requires \(KS_F\) times of multiplications. As a result, the computational complexity for one round sampling of one frame is \(O(N_\mathrm{im}LKS_F)\). In the experiments of this paper, one frame will be sampled for about 20 rounds. Then the running time is about 2 min if the image is 4 bits and the size of implicit region is 10,000 pixels.

4) Synthesis of implicit regions by Gibbs sampling by MA-FRAME. The computational complexity of MA-FRAME is quite similar with ST-FRAME. The biggest difference is the number of sampling candidates. As the number of velocity candidates is \(N_v\) and the intensity perturbation range is \([-m,m]\), the computational complexity is \(O(N_\mathrm{im}N_vmKS_F)\), which is on the same level with ST-FRAME. However, in real application, because the intensities of the neighborhood of one pixel are not far away, the intensities of the candidates with different velocities are quite redundant. As a result, MA-FRAME may save a lot of time compared with ST-FRAME, especially when the intensity level is high. For one frame with 8 bits and \(60,000\) pixels, the running time is about 4 minutes within 20 rounds sampling.

In summary, the computational complexity of video modeling / coding by VPS is small, but that of video synthesis is quite large. It is because of texture synthesis procedure. In VPS, the textures are modeled by MRF and synthesized via a Gibbs sampling process, which is well known as a computational costing method. However, the video synthesis is only one of the applications of VPS and is used for verifying the correctness of the model. As a result, it is not the very important issue we care about here.

3.9 Perceptual Study

The error assessment of VPS is consistent with human perception. To support this claim, in this subsection, we present a series of human perception experiments and explore the relationship between perception accuracy. In the experiments below, the 30 participants include graduate students and researchers from mathematics, computer science, and medical science. The age range is from 22 to 39, and they all have normal or corrected-to-normal vision.

In the first experiment, we randomly crop several clips of videos with different sizes from the four synthesized textured motion examples and their corresponding original videos (as shown in the left side of Figs. 17, 18 and 19, each video is shown one frame as an example which is marked by (a), (b), (c), and (d), respectively, and they are in different sizes but shown in the same size after zooming for better shows). And then for original and synthesized examples, respectively, each participant is shown 40 clips one by one (10 clips from each texture) and is required to guess which texture they come from. We show 3 representative groups of results below for demonstration, in which the sizes of cropped examples are \(5\times 5\), \(10\times 10\), and \(20\times 20\), respectively. Both of the confusion rates (%) of original and synthesized examples are shown in the tables on the right side in Figs. 17, 18, and 19. Each row gives the average confusion rates, which the video clip labeled by the row title is judged coming from textures labeled by the column titles. In order to test if the syntheses are perceived the same with the original videos, we compare the original and synthesis confusion tables in each group. From the results, we can tell that the confusion tables are mostly consistent. For more precise quantitative estimation, we also analyze the recognition accuracies by ANOVA in Table 5, in which, each row shows the corresponding \(F\) and \(p\) values for each texture in all the three groups. The results show that the recognition accuracies on original and synthesized textures do not differ significantly.

Fig. 17
figure 17

Perceptual confusion test on original and synthesized textured motions, respectively. The size of cropped examples is \(5\times 5\)

Fig. 18
figure 18

Perceptual confusion test on original and synthesized textured motions, respectively. The size of cropped examples is \(10\times 10\)

Fig. 19
figure 19

Perceptual confusion test on original and synthesized textured motions, respectively. The size of cropped examples is \(20\times 20\)

Table 5 The ANOVA results of analyzing recognition accuracies of original and synthesized textures

Also, it is noted that texture (a) and (b) appear similarly, while (c) and (d) tend to be confused with each other. Therefore, the confusion rates between (a) and (b), (c), and (d) are apparently larger. However, from Figs. 17 to 19, as the size of cropped videos gets larger, the confusion rate becomes lower, and actually when the size goes larger than \(25\times 25\) in this experiment, the accuracies get very close to 100 %. This experiment demonstrates the fact that the dynamic textures synthesized by the statistics of dynamic filters can be well discriminated by human vision, although the synthesized one and the original one are totally different on pixel level. Therefore, it is evident that the approximation of filter response histograms reflects the quality of video synthesis. Furthermore, it is proved that larger area textures give much better perception effect because human can extract more macroscopic statistical information and motion-appearance characteristics, while small size local areas can only provide salient structural information which may be shared by a various of different videos.

In the second experiment, we test if the synthesized video by VPS gives similar vision impact compared with the original video. Each time we provide the original and the synthesized videos to one participant in the same scale. The videos are played synchronously and the participants are required to point out which is the original video in 5 seconds. Each pair of videos is tested in four scales, 100, 75, 50, and 25 %. The accuracy is shown in Table 6. From the result, when the videos are shown in larger scales, it is easier to discriminate the original and synthesized videos, because a lot of structural details can be noticed by the observers. But as the scale gets smaller, the macroscopic information gives the major impact to the vision system, therefore, the original and synthesized video are perceived almost the same, so that the accuracy gets lower and approaches to 50 %. From this experiment, it is evident that although VPS cannot give the complete reconstruction of a video on pixel level, especially for dynamic textures, but the synthesis gives human similar vision impact, which means most of the key information for perception are kept via VPS model.

Table 6 The accuracy of differentiating the original video from the synthesized one in different scales

3.10 VPS Adapting Over Scales, Densities and Dynamics

As it was observed in [17] that the optimal visual representation at a region is affected by distance, density, and dynamics. In Fig. 20, we show four video clips from a long video sequence. As the scale changes from high to low over time, the birds in the videos are perceived by lines of boundary, groups of kernels, dense points, and dynamic textures, respectively. We show the VPS of each clip and demonstrate that the proper representations are chosen by the model. Figure 21 shows the types of chosen primitives for explicit representations, in which circles represent blob-like type, while short lines represent edge-like type primitives. Table 7 gives corresponding comparisons between the number of blob-like and edge-like primitives in each scale. For each scale, the comparison is within first 50, 100, 150, and 200 chosen primitives, respectively. It is quite obvious that the percentage of chosen edge-like primitives in large scale frame is much higher than that in small scale. Meanwhile, in large scale frame, the blob-like primitives start to appear very late, which shows the fact that edge-like primitives are much more important in this scale for representing videos. But in small scale frame, the blob-like primitives possess a large percentage at the very beginning, and the number increase of edge-like primitives gets quicker and quicker while more and more primitives are chosen. This phenomenon demonstrates that blob-like structures are much more prominent in small scale. So from this experiment, it is evident that VPS can choose proper representations automatically and furthermore, the representation patterns may reflect the scale of the videos.

Fig. 20
figure 20

Representation switches triggered by scale. Row 1 observed frames; Row 2 trackability maps; Row 3 synthesized frames

Fig. 21
figure 21

Representation types in different scale video frames, where circles represent blob-like type and short lines represent edge-like type

Table 7 The comparisons between the number of blob-like and edge-like primitives in 3 scales

3.11 VPS Supporting Action Representation

VPS is also compatible with high-level action representation. By grouping meaningful explicit parts in a principled way, it represents an action template. In Fig. 22b is the action template given by the deformable action template model [40] from the video shown in (a). The action template is essentially the sketches from the explicit regions. (c) shows an action synthesis with only filters from a matching pursuit process. While in (d), following the VPS model, the action parts and a few sketchable background are reconstructed by the explicit representation, and the large region of water is synthesized by the implicit representation; thus we get the synthesis of the whole video. Here, the explicit regions correspond to meaningful “template” parts, while the implicit regions are auxiliary background parts.

In order to show the relationship between VPS representation and effective high-level features, we take an KTH video [31] as an example. Figures 23 and  24 show the spatial and temporal features of explicit regions, respectively. In Fig. 23, we compare VPS spatial descriptor with well-known HOG feature [11], which has been widely used for object representation recently. (b) is the HOG descriptor for the human in one video frame (a). (c) shows structural features extracted by VPS, where circles and short edges represent 53 local descriptors. Compared with HOG in (b), VPS makes a local decision on each area based on statistics of filter responses; therefore, it provides shorter coding length than HOG. Furthermore, it gives more precise description than HOG, e.g., the head part is represented by a circle descriptor, which contains more information than pure filter response histogram like HOG. And (d) gives a synthesis with corresponding filters, which shows the human boundary precisely.

Fig. 22
figure 22

Action representation by VPS. a The input video. b Action template obtained by the deformable action template [40]. c Action synthesis by filters. d Video synthesis by VPS

Fig. 23
figure 23

Structural information extracted by HOG and VPS. a The input video frame. b HOG descriptor. c VPS feature. d Boundary synthesis by filters

Fig. 24
figure 24

Motion statistics by VPS. a and b two continuous video frames of waving hands. c Trackability map. d Clustered motion style regions. e Corresponding motion statistics of each region

In Fig. 24, we show the motion information between two continuous frames (a) and (b) extracted by MA-FRAME in VPS. (d) gives the clustered motion styles in the current video. The motion statistics of the five styles are shown in (e), respectively. It is obvious that region 1 represents the area of head, which is almost still in the waving motion, while region 5 is for two arms, which shows definite moving direction. Region 3 represents the legs, which is actually an oriented trackable area. Region 2 and 4 are relatively ambiguous in motion direction, which are basically background of textures in the video. After giving the trackability map shown in (c) based on these motion styles, the motion template pops up.

In summary, the information extracted by VPS is compatible with high-level object and motion representations. Especially, it is very close to HOG and HOOF descriptors, which are proven effective spatial and temporal features, respectively. The main difference is VPS makes a local decision to give a more compact expression and be better for visualization. Therefore, VPS does not only give a middle-level representation for video, but also has strong connection with low-level vision features and high-level vision templates.

4 Discussion and Conclusion

In this paper, we present a novel video primal sketch model as a middle-level generic representation of video. It is generative and parsimonious, integrating a sparse coding model for explicitly representing sketchable and trackable regions and extending the FRAME models for implicitly representing textured motions. It is a video extension of the primal sketch model [18]. It can choose appropriate models automatically for video representation.

Based on the model, we design an effective algorithms for video synthesis, in which, explicit regions are reconstructed by learned video primitives and implicit regions are synthesized through a Gibbs sampling procedure based on spatio-temporal statistics. Our experiments show that VPS is capable for video modeling and representation, which has high compression ratio and synthesis quality. Furthermore, it learns explicit and implicit expressions for meaningful low-level vision features and is compatible with high-level structural and motion representations, therefore, provides a unified video representation for all low, middle, and high-level vision tasks.

In ongoing work, we will strengthen our work from several aspects, especially enhance the connections with low-level and high-level vision tasks. For low-level study, we are learning a much richer dictionary of \(\varDelta _B\) for video primitives, which is more comprehensive. For high-level application, we are applying the VPS features to object and action representation and recognition.