1 Introduction

Spatial local keypoints and appropriate local descriptors have been extensively considered in image processing and computer vision, and they have been successfully studied on a variety of multi-scale models [28, 29]. They have then been applied to image matching or to the higher level image classification problem, often in conjunction with appropriately designed global descriptors.

In the past decade, the video processing scenario has been characterized by a growing interest toward the so-called space–time interest points which incorporate appearance as well as dynamic local information. From the pioneering work of Laptev [27], who proposed a reformulation of Harris corners [15] for the space–time, soon followed by alternative and possibly richer approaches [6, 16, 31, 35, 36], we have appreciated the power of these key points as low level building blocks for motion analysis and action recognition. An exhaustive overview of related state of the art can be found in [4], see also [37].

Space–time interest points are usually associated with the concept of points characterized by some special behavior both in space and in time (e.g., nonsmooth in both directions). Thus, the classical computational framework starts with a key point detection stage, often in conjunction with an appropriate key point descriptor [17, 26, 32, 33]. Generally, keypoints are detected by looking for singularities both in space and in time [27]. In this paper, we argue that in the space–time domain there is a richer set of information to be exploited: different interesting local primitives can be observed and associated with an appropriate meaning in space and time. These primitives also include interesting spatial structures (spatial corners or edges) moving smoothly or smooth surfaces undergoing significant velocity changes.

The mathematical framework we consider is the one of the shearlets [25]. Among the multiresolution image representations, shearlets emerge by their ability to efficiently capture anisotropic features [18], to provide an optimal sparse representation [11, 21], to detect singularities [14, 22] and to be stable against noise and blurring [2, 9]. For further details, implementations and references, see [20]. The effectiveness of shearlets is supported by a well-established mathematical theory [3], and it is tested in many applications in image processing by providing efficient algorithms [7, 8, 20]. Shearlets have seldom been applied to spatio-temporal data, with the exception of shearlet-based video denoising and inpainting [24]—see also [30], comparing shearlet-based performances on video enhancement and denoising tasks with previously existing techniques.

In this work, we exploit different properties of shearlets. In particular, we focus on the ability of shearlet coefficients to detect the wavefront set of a signal both in 2D [18] and in the 3D settings [12, 23], by directly encoding meaningful directional informations, as, for example, the normal direction at each point of a surface singularity. From the computational viewpoint, we adopt 3D shearlets implemented in ShearLab (see http://www.shearlab.org/).

The contribution of the paper is twofold. On the theoretical side, we propose a toy mathematical model to describe some of the significant properties of the complex behavior of a real video sequence. We consider a rigid compact 2D region that, by moving in time, generates a 2D \(+~T\) volume V. The spatial–temporal points are now associated with the wavefront set of the 3D “cartoon-like” signal [23]

$$\begin{aligned} f(x,y,t)= {\left\{ \begin{array}{ll} 1 &{}\quad (x,y,t) \in V \\ 0 &{}\quad (x,y,t) \notin V. \end{array}\right. } \end{aligned}$$

We show that the corresponding shearlet coefficients provide a clear signature of different spatial-temporal primitives. Clearly, our model does not capture the full complexity of a real video sequence, for instance, it does not deal with occlusions, but it provides an important insight of what happens in the real world by highlighting the kind of spatio-temporal primitive each space–time point belongs to.

Motived by our theoretical framework, we propose an algorithm to represent key points highlighting their appearance and dynamic properties. First, we consider the 3D shearlet transform of a video sequence. Then, we derive a shearlet-based rotation-invariant representation of each point with respect to its space–time neighborhood at a fixed scale. This representation describes the behavior of the signal in the neighborhood and helps us discriminating among different type of points. We discuss how this representation does not vary too much on sets of known spatial and spatio-temporal key points such as edges, corners and space–time interest points [27]. We also show how to identify the main primitives in a video signal, by adopting an unsupervised approach and clustering points to obtain the most significant space–time primitives within the signal.

The real video sequences we use to discuss our findings are taken from the Chalearn (che vuoi [10]) and the KTH (boxing, handwaving and walking [32]) datasets, while synthetic data have been generated in-house.

This paper is organized as follows. Section 2 reviews shearlets on 2D \(+~T\) signals. Section 3 introduces the concept of spatio-temporal primitives. In Sect. 4, we describe our approach to represent points in their space–time neighborhood and discuss the expressiveness of the representation on both synthetic and real data. Section 5 discusses the results we obtain when clustering points with respect to the proposed representation. Section 6 is left to a conclusive discussion.

2 The 3D Shearlet Frame

In this section, we briefly review the construction of the shearlet frame for 2D \(+~T\) signals. We follow the presentation in [19], which is a standard reference for the proofs and other informations.

Fig. 1
figure 1

Three pyramids \({{\mathcal {P}}}_1, {{\mathcal {P}}}_2\) and \({{\mathcal {P}}}_3\), with displayed in black the area belonging to the positive part of the corresponding symmetry axis and in red the one related to its negative part (Color figure online)

We first set the notation. We denote by \(L^2\) the Hilbert space of functions \(f\,{:}\,\mathbb R^3\rightarrow \mathbb C\) such that

$$\begin{aligned} \int _{\mathbb R^3} \left|{f(x,y,t)}\right|^2\,\mathrm{d}x\,\mathrm{d}y\,\mathrm{d}t <+\infty , \end{aligned}$$

where \(\mathrm{d}x\mathrm{d}y\mathrm{d}t\) is the Lebesgue measure of \(\mathbb R^3\), by \(\left||{f}\right||\) the corresponding norm and by \(\left\langle {f},{f'}\right\rangle \) the scalar product between two functions \(f,f'\in L^2\). Given an element \(f\in L^2\), we denote by \(\widehat{f}\) its Fourier transform, i.e.,

$$\begin{aligned} \widehat{f}(\xi _1,\xi _2,\xi _3)=\int _{\mathbb R^3} f(x,y,t) \mathrm{e}^{-2\pi i (\xi _1 x + \xi _2 y +\xi _3 t )} \mathrm{d}x\,\mathrm{d}y\,\mathrm{d}t, \end{aligned}$$

provided that f is integrable, too.

We recall that a frame for \(L^2\) is a family \(\{ \psi _i \}_{i \in I}\) of functions such that each \(\psi _i\) is in \(L^2\) and

$$\begin{aligned} A \left||{f}\right||^2\le \sum _{i\in I} \left|{\left\langle {f},{\psi _i}\right\rangle }\right|^2 \le B \left||{f}\right||^2 \quad \forall f\in L^2, \end{aligned}$$

where AB are positive constants, called frame bounds. The shearlet frame \(\mathcal {F}_{\mathrm{SH}}\) is defined in terms of four different subfamilies labeled by the index \(\ell =0,\ldots ,3\) as it follows.

The first family

$$\begin{aligned} \mathcal {F}_{\mathrm{SH},0}=\left\{ \varphi _m\mid m\in \mathbb Z^3\right\} , \end{aligned}$$

associated with the index \(\ell = 0\) takes care of the low frequencies cube

$$\begin{aligned} {{\mathcal {P}}}_0=\left\{ (\xi _1,\xi _2,\xi _3)\in \widehat{\mathbb R}^3\mid \left|{\xi _1}\right|\le 1, \left|{\xi _2}\right|\le 1, \left|{\xi _3}\right|\le 1\right\} \end{aligned}$$

and it is given by

$$\begin{aligned} \varphi _m(x,y,t)= \varphi (x- { c m}_1,y-{ cm}_2,t-{ cm}_3), \end{aligned}$$

where \(m=(m_1,m_2,m_3)\in \mathbb Z^3\) labels the translations, \(c>0\) is a step size, and

$$\begin{aligned} \varphi (x,y,t)= \phi _1(x)\phi _1(y)\phi _1(t), \end{aligned}$$

where \(\phi _1\) is a 1D-scaling function.

The other three families are associated with the high frequency domain. Each of them corresponds to the pyramid whose symmetry axis is one of the cartesian axes \(\xi _1,\xi _2, \xi _3\) in the Fourier domain, see Fig. 1. For example, for \(\ell =1\) the pyramid is

$$\begin{aligned} {{\mathcal {P}}}_1=\left\{ (\xi _1,\xi _2,\xi _3)\in \widehat{\mathbb R}^3\mid \left|{\xi _1}\right|>1,\left|{\frac{\xi _2}{\xi _1}}\right|\le 1, \left|{\frac{\xi _3}{\xi _1}}\right|\le 1\right\} , \end{aligned}$$

and similarly for the other two pyramids.

Fig. 2
figure 2

Coefficients analysis on a 3D surface (see text). a A section of a surface parallel to the yt plane. b A plot representing the coefficients varying at different shearings, on the x axis there are the indexes corresponding to all the shearings in \(\mathbf {K}_j\) for the pyramid \(\mathcal {P}_1\) where the central peak corresponds to the shearing vector \(k=(0,0)\). c The coefficients decay for neighboring points along the surface normal (red line) (Color figure online)

Fixed \(\ell =1,2,3\), each

$$\begin{aligned} \mathcal {F}_{\mathrm{SH},\ell }=\left\{ \psi _{\ell ,j,k,m}\mid j\in {\mathbb {N}}, k\in \mathbf {K}_j, m\in \mathbb Z^3\right\} , \end{aligned}$$

where

$$\begin{aligned} \mathbf {K}_j =\left\{ k=(k_1,k_2)\in \mathbb Z^2, \max \{\,|k_1|,|k_2|\,\} \le \lceil 2^{j /2} \rceil \right\} , \end{aligned}$$
(1)

is defined in terms of parabolic dilations

$$\begin{aligned} A_{1, j}= & {} \begin{pmatrix} 2^{j} &{} \quad 0 &{} \quad 0\\ 0 &{} \quad 2^{j/2} &{} \quad 0 \\ 0 &{} \quad 0 &{} \quad 2^{j/2} \end{pmatrix}, \quad A_{2, j} = \begin{pmatrix} 2^{j/2} &{} \quad 0 &{} \quad 0\\ 0 &{} \quad 2^{j} &{} \quad 0\\ 0 &{} \quad 0 &{} \quad 2^{j/2} \end{pmatrix},\\ A_{3, j}= & {} \begin{pmatrix} 2^{j/2} &{} \quad 0 &{} \quad 0\\ 0 &{} \quad 2^{j/2} &{} \quad 0 \\ 0 &{} \quad 0 &{} \quad 2^{j} \end{pmatrix}, \end{aligned}$$

where the index j refers to the dyadic scale (note that \(j=0\) corresponds to the coarsest scale), and shearings

$$\begin{aligned} S_{1,k}= & {} \begin{pmatrix} 1 &{} \quad k_1 &{} \quad k_2 \\ 0 &{} \quad 1 &{} \quad 0 \\ 0 &{} \quad 0 &{} \quad 1 \end{pmatrix}, \quad S_{2,k} = \begin{pmatrix} 1 &{} \quad 0 &{} \quad 0\\ k_1 &{} \quad 1 &{} \quad k_2 \\ 0 &{} \quad 0 &{} \quad 1 \end{pmatrix},\\ S_{3,k}= & {} \begin{pmatrix} 1 &{} \quad 0 &{} \quad 0\\ 0 &{} \quad 1 &{} \quad 0 \\ k_1 &{} \quad k_2 &{} \quad 1 \end{pmatrix}, \end{aligned}$$

where the index \(k = (k_1,k_2) \in \mathbf {K}_j\) controls the shearing and runs over the indexes \(\mathbf {K}_j\) defined in (1). Explicitly, the functions \(\psi _{\ell ,j,k,m}\) are given by

$$\begin{aligned} \psi _{\ell ,j,k,m}(x,y,t) = 2^{j} \psi _{\ell }\left( S_{\ell ,k} A_{\ell ,j} \left( {\begin{matrix} x- c_1 m_1\\ y-c_2 m_2 \\ t-c_3 m_3 \end{matrix}}\right) \right) , \end{aligned}$$
(2)

where for \(\ell =1, c_1=c\) and \(c_2=c_3=\widehat{c}\), where \(\widehat{c}\) is another step size (for \(\ell =2,3\) the values of \(c_1,c_2,c_3\) are interchanged accordingly) and the parameter \(m=(m_1,m_2,m_3)\in \mathbb Z^3\) labels the translations, as for the family \({\mathcal {F}}_{\mathrm{SH},0}\). Following [24], the generating function \(\psi _{1}\) is of the form

$$\begin{aligned} {\widehat{\psi }_{1}}(\xi _1,\xi _2,\xi _3)= & {} \widehat{\psi }_1(\xi _1 )\left( P\left( \frac{\xi _1}{2},\xi _2\right) \widehat{\phi }_1(\xi _2) \right) \nonumber \\&\quad \times \,\left( P\left( \frac{\xi _1}{2},\xi _3\right) \widehat{\phi }_1(\xi _3) \right) , \end{aligned}$$
(3)

where P is suitable polynomial 2D Fan filter [5], \(\psi _1\) is the 1D wavelet function associated with the scaling function \(\phi _1\) defining the family \(\{{\varphi _m} \}\). Similar equations hold for \(\ell =2,3\) by interchanging the role of \(\xi _1, \xi _2\) and \(\xi _3\). We observe that to obtain a frame it is necessary to assume some technical condition on the smoothness of \(\phi _1\) and on the vanishing momenta of \(\psi _1\), see [20].

The shearlet transform of a signal \(f\in L^2\) is given by

$$\begin{aligned} \mathrm{SH}[f](\ell ,j,k,m) = {\left\{ \begin{array}{ll} \left\langle {f},{\varphi _m}\right\rangle &{}\quad \text {if}\,\, \ell =0\\ \left\langle {f},{\psi _{\ell ,j,k,m}}\right\rangle &{}\quad \text {if}\,\, \ell =1,2,3, \end{array}\right. } \end{aligned}$$

where \(j\in \mathbb N,\ k\in \mathbf {K}_j,\ m\in \mathbb Z^3\). We stress the fact that, as shown in (1), the number of shearing parameters \(\mathbf {K}_j\) depends on j. In the experiments, we use the digital implementation described in [24], which is based on the well-known relation between the pair \((\phi _1,\psi _1)\) and the quadrature mirror filter pair (hg), i.e.,

$$\begin{aligned} \phi _1(x)= & {} \sqrt{2} \sum _{n\in \mathbb Z} h(n) \phi _1(2x-n) \end{aligned}$$
(4)
$$\begin{aligned} \psi {_1}(x)= & {} \sqrt{2} \sum _{n\in \mathbb Z} g(n) \phi _1(2x-n). \end{aligned}$$
(5)

where h is a 1D low-pass filter and g is the corresponding high-pass filter.

Furthermore, a maximum number J of scales is considered and it assumed that the signal f at the finest scale is given by

$$\begin{aligned} f(x,y,t)= & {} \sum _{m\in Z^3} f_{J,m}\, 2^{3J/2} \phi _1(2^J x-{ c m}_1)\\&\quad \times \phi _1(2^J y-{ c m}_2) \phi _1(2^J t-{ c m}_3). \end{aligned}$$

so that \(f_{J,m}\simeq f({ cm}_12^{-J}, { cm}_22^{-J}, { cm}_32^{-J})\) since \(\phi _1\) is well localized around the origin. The digital shearlet transform depends on the number of scales \(J+1\), the directional Fan filter P in (3) and the low-pass filter h associated with the scaling function \(\phi _1\) by (4).

Fig. 3
figure 3

Coefficients analysis on a 3D edge (see text). a, d A section of the edge parallel to xy plane, where we highlight the two normal vectors. b, e The coefficients varying at different shearings in the two meaningful pyramids \(\mathcal {P}_1\) and \(\mathcal {P}_3\) where both central peaks correspond to the shearing vector \(k=(0,0)\). c, f The decay of coefficients for neighboring points along the corresponding normal directions

Our algorithm is based on the following nice property of the shearlet coefficients. As shown in [12, 13, 23] if the signal f is locally regular in a neighborhood of m, then \(\mathrm{SH}[f](\ell ,j,k,m)\) has a fast decay when j goes to infinity for any \(\ell \ne 0\) and \(k\in \mathbf {K}_j\). Suppose now that f has a surface singularity at cm with normal vector \((1,n_1,n_2)\in {\mathcal {P}}_1\) and set \(k^* =(\lceil 2^{j /2} n_1\rceil , \lceil 2^{j /2} n_2\rceil )\). If \(\ell =2,3\), then \(\mathrm{SH}[f](\ell ,j,k,m)\) has a fast decay for any \(k\in K_j\), whereas if \(\ell =1\) we have the same good behavior only if \(k\ne k^*\), whereas if \(k=k^*\) the shearlet coefficients have a slow decay (a similar result holds if the normal direction of the surface singularity belongs to the other two pyramids). This behavior of the shearlet coefficients allows to associate to any shearing vector \(k=(k_1,k_2)\) a direction (without orientation) parametrized by two angles, latitude and longitude, \(\alpha \) and \(\beta \). Thus, the direction associated with k is given by

$$\begin{aligned} (\cos \alpha \cos \beta ,\cos \alpha \sin \beta ,\sin \alpha )\quad \alpha ,\beta \in \left[ -\frac{\pi }{2},\frac{\pi }{2}\right] . \end{aligned}$$
(6)

The correspondence explicitly depends on \(\ell \) and, for the first pyramid, it is given by

$$\begin{aligned} \tan \alpha = \frac{2^{-j /2} k_2}{\sqrt{1+ 2^{-j}k_1^2}} \quad \tan \beta = 2^{-j /2} k_1 \quad \alpha ,\beta \in \left[ -\frac{\pi }{4},\frac{\pi }{4}\right] . \end{aligned}$$

The above formula shows that the ability to resolve different directions strongly depends on the number of available shearings in \(\mathbf {K}_j\). In particular, at coarsest scales we detect the normal direction of singularity surfaces at a low resolution.

Through a simple example, we illustrate the above behavior. We consider a black cube, and we fix a point of a side of the cube parallel to the yt plane. We compute the shearlet coefficients moving along the normal direction outside the cube. The behavior is shown in Fig. 2 where in the first column we show a xt-section of the cube at a given t. In the second column, we plot the value of the shearlet coefficients at the point on the surface in the first pyramid \(\mathcal {P}_1\). We show the coefficients associated with the grid of directions represented by the shearings in \(\mathbf {K}_j\), unrolling them along the x axis. In this example, \(k_1,k_2 \in \{-2,1,0,1,2\}\) and the value 12 in Fig. 2b corresponds to \(k_1=k_2=0\), as expected. The coefficients of the other pyramids contain negligible values \({\sim }\,10^{-16}\). In the third column, we fix the shearing corresponding to the peak and see how the coefficients evolve by moving along the normal direction corresponding to the red line in Fig. 2a. The coefficients decay as we move away from the discontinuity, giving us an empirical evidence of the appropriateness of 3D shearlets in localizing interest points.

Figure 3 shows a similar analysis on a 3D edge produced by two surfaces, one parallel to plane xt and the other parallel to plane yt. In this case, we identify two significant peaks in two different pyramids (the main peaks in Fig. 2b, e.

Within every pyramid (\(\mathcal {P}_1\) for Fig. 3a–c and \(\mathcal {P}_3\) for Fig. 3d–f), we see a behavior similar to the case of the 3D surface (Fig. 2b). However, the secondary peaks have higher values, for the spatio-temporal neighborhood around the point has a richer behavior. These peaks are also due to the fact that we visualize two-dimensional information (the shearlet coefficients associated with a 2D grid of directions) as a 1D function; thus, they appear to be distant on the 1D unrolled function.

The plots we show in this section have been obtained thanks to the a priori information we have on the normal direction which is in general not available in real data. This issue will be addressed in the following sections, where we identify a representation procedure applicable in the general case.

Fig. 4
figure 4

Spatio-temporal primitives which can take place in the space–time domain, by considering how the image in the background of each one of these moves over time: a A 2D edge moving smoothly spawns a spatio-temporal surface, b A 2D edge undergoing a velocity change, thus producing a 3D edge, c A 2D corner moving smoothly also producing a 3D edge, d A 2D corner undergoing a velocity change providing a 3D vertex

3 Spatio-Temporal Primitives

Clearly, a video is a temporal sequence of 2D spatial images and it can be regarded as a 2D \(+~T\) signal that fits the above theoretical framework.

In this context, 2D spatial discontinuities in an image, such as edges and corners, generate different space–time behaviors as the image evolve in time. Moreover, the temporal evolution of a given point in the image is continuous, but may undergo a loss of regularity in correspondence of velocity changes. Therefore, if we analyze the behavior of the signal in space–time, we may observe different types of primitives (see also Fig. 4):

  • Spatio-temporal surfaces, caused by 2D edges with a smooth velocity spanning surfaces in space–time.

  • Spatio-temporal edges either caused by 2D corners moving smoothly or by 2D edges undergoing a velocity change. These two primitives could be discriminated by detecting the orientation of the 3D edge, see Fig. 4b, c.

  • Spatio-temporal corners or vertices caused by 2D corners undergoing a velocity change.

These spatio-temporal primitives are easily associated with classical 3D features: surfaces, edges and vertices and can be analyzed by adapting 3D signal representation models. It should be observed, though, that 2D \(+~T\) features have a very specific nature that characterizes them beyond their three-dimensional structure. For instance, we could further cluster these primitives in still and moving entities (corresponding to different orientations in the 2D \(+~T\) space). Also, the third component (time) has a different intrinsic scale, and very precise constraints since spatial features do not disappear all of a sudden and time can only proceed forward. In the reminder of the paper, we refer to 2D edges when considering image discontinuities and 3D or spatio-temporal edges when discussing the behavior in space–time. As for corners, we will refer to 2D corners in space and to vertices or 3D corners in space–time.

We now observe that thanks to the sensitivity to singularity and orientation of shearlets we may identify different spatial-temporal primitives. To better understand the relationship between coefficients and primitives, we start by considering a toy model for a space region evolving over time. We assume that the region of interest is a rigid planar body \({\mathcal {C}}\) moving in the time interval [0, T]. We further assume that the boundary of \({\mathcal {C}}\) can be parametrized at the initial time \(t=0\) by the simple closed curve

$$\begin{aligned} \gamma (s)=x(s) {\mathbf {i}}+ y(s) {\mathbf {j}}\quad s\in [0,{L}], \end{aligned}$$

where L is the length of the boundary, s is the arc length oriented and the curve is oriented so that the interior of the body is on the left side, see Fig. 5. We denote by \({\mathbf {i}}\) and \({\mathbf {j}}\) to be the canonical unit vectors of the x-axis and y-axis, respectively. Since the body is rigid, the time evolution of each point \(\gamma (s)\) is given by

$$\begin{aligned} \gamma (s,t)= r(t) + R(t) (\gamma (s)-r(0)) = x(s,t) \,{\mathbf {i}}+ y(s,t) \,{\mathbf {j}}, \end{aligned}$$

where r(t) is the time evolution of the center of mass of the body and R(t) is the time-dependent rotation around the center of mass. The evolution of the body in time describes a 3D volume whose boundary is the surface parametrized by

$$\begin{aligned} \sigma (s,t) = x(s,t) \,{\mathbf {i}}+ y(s,t) \,{\mathbf {j}} +t \,{\mathbf {k}} \quad s\in [0,{L}], t\in [0,T], \end{aligned}$$

where \({\mathbf {k}}\) is the canonical unit vector of the t-axis.

We now compute the normal vector to the surface at spatial–temporal point \(\sigma (s,t)\)

$$\begin{aligned} N(s,t)= & {} \frac{\partial \sigma }{\partial s}(s,t) \times \frac{\partial \sigma }{\partial t}(s,t) = {\text {det}} \begin{bmatrix} {\mathbf {i}}&{\mathbf {j}}&{\mathbf {k}} \\ \frac{\partial x}{\partial s}(s,t)&\frac{\partial y}{\partial s} (s,t)&0 \\ \frac{\partial x}{\partial t} (s,t)&\frac{\partial y}{\partial t}(s,t)&1 \end{bmatrix}\\= & {} n(s,t) + \tau (s,t) \times v(s,t) \end{aligned}$$

where

$$\begin{aligned} \tau (s,t)= & {} \frac{\partial x}{\partial s}(s,t) {\mathbf {i}} + \frac{\partial y}{\partial s}(s,t) {\mathbf {j}} \\ n(s,t)= & {} \frac{\partial y}{\partial s}(s,t) {\mathbf {i}} - \frac{\partial x}{\partial s}(s,t) {\mathbf {j}} \\ v(s,t)= & {} \frac{\partial x}{\partial t}(s,t) {\mathbf {i}} + \frac{\partial y}{\partial t}(s,t){\mathbf {j}} \end{aligned}$$

are the tangent and normal external unit vectors to the boundary of \({\mathcal {C}}\) at spatial point (x(st), y(st)) and v(st) is the corresponding velocity, where all of them are regarded as 3D vectors. Since s is the arc length, the tangent vector \(\tau (s,t)\) has norm 1 and n(st) corresponds to the external normal unit vector since it is obtained by clockwise rotating the tangent vector \(\tau (s,t)\) by \(\pi /2\), see Fig. 5.

Fig. 5
figure 5

A body at time t with the main relevant geometrical and dynamical quantities

Fig. 6
figure 6

Space–time features in real data. Top: the tip of a foot changing direction at the end of a step produces a spatio-temporal corner; middle: the tip of a fist in the extension phase of a punching action produces a spatio-temporal edge; bottom: the side of an arm translating as a person is walking leads to a spatio-temporal surface

Let us consider the following four basic setups or behaviors:

  1. 1.

    The boundary is smooth, so that both \(\tau (s,t)\) and n(st) are smooth, and the velocity is always smooth. Then, the surface parametrized by \(\sigma \) is everywhere smooth and in each point there is a tangent plane whose normal vector is given by N(st), (see Fig. 4a); if the velocity is zero, then the normal vector N is simply given by n. Here we expect a single coefficient to have an high value, exactly the one directed along the surface normal.

  2. 2.

    The boundary is smooth, so that both \(\tau (s,t)\) and n(st) are smooth, but the velocity at time \(t=t_0\) is not regular. Hence, the two surfaces

    $$\begin{aligned}&\{ \sigma (s,t)\mid s\in [0,{L}], t\in [0,t_0]\} \quad \text {and}\\&\{\sigma (s,t)\mid s\in [0,{L}], t\in [t_0,T]\} \end{aligned}$$

    create a 3D edge in the plane \(t=t_0\) and N(st) is discontinuous at \(t=t_0\) for all \(s\in [0,{L}]\) with sharp variation given by

    $$\begin{aligned} \Delta N(s,t_0)= & {} \tau (s,t_0) \times \Delta v(s,t_0) \quad \forall s\in [0,1], \end{aligned}$$

    where \(\Delta f\) is the jump of f (with respect the second variable) at \(t_0\), i.e.,

    $$\begin{aligned} {\Delta f(s,t_0) = \lim _{t\rightarrow t_0^+} f(s,t) - \lim _{t\rightarrow t_0^-} f(s,t),} \end{aligned}$$

    and \(\Delta N(s,t_0)\) has a nonzero component only along the t-axis and lives on the 3D edge (see Fig. 4b). In this case, the shearlet coefficients would include two maximum values associated with the two surfaces.

  3. 3.

    The velocity is smooth, but \((x(s_0),y(s_0))\) is a 2D corner of the boundary; then, the two surfaces

    $$\begin{aligned}&\{\sigma (s,t)\mid s\in [0,s_0], t\in [0,T]\} \quad \text {and}\\&\{\sigma (s,t)\mid s\in [s_0,{L}], t\in [0,T]\} \end{aligned}$$

    create a 3D edge parametrized by the temporal evolution of the 2D corner \((x(s_0),y(s_0))\). Hence, N(st) is discontinuous at \(s_0\) for all \(t\in [0,T]\) with sharp variation given by

    $$\begin{aligned} \Delta N(s_0,t)= & {} \Delta n(s_0,t)+ \Delta \tau (s_0,t) \times v(s_0,t)\nonumber \\&\quad \forall t\in [0,T], \end{aligned}$$

    where \(\Delta N(s_0,t)\) is the jump of N (with respect the first variable) at \(s_0\) and it has two contributions: the former is in the xy-plane and the latter along the t-axis. As above the vector \(\Delta N(s_0,t)\) lives on the 3D edge (see Fig. 4c). Again, the shearlet coefficients would include two maximum values associated with the two surfaces.

  4. 4.

    The boundary has a 2D corner at point \((x(s_0),y(s_0))\), and there is a change of velocity at time \(t=t_0\) lighter in the direction or in the speed. At the spatial–temporal point \((x(s_0,t_0), y(s_0,t_0), t_0)\), there is a vertex, which is the junction of the four surfaces

    $$\begin{aligned} {\mathcal {S}}_1= & {} \{\sigma (s,t)\mid s\in [0,s_0], t\in [0,t_0]\}\\ {\mathcal {S}}_2= & {} \{\sigma (s,t)\mid s\in [s_0,{L}], t\in [0,t_0]\}\\ {\mathcal {S}}_3= & {} \{\sigma (s,t)\mid s\in [0,s_0], t\in [t_0,T]\}\\ {\mathcal {S}}_4= & {} \{\sigma (s,t)\mid s\in [s_0,{L}], t\in [t_0,T]\}, \end{aligned}$$

    where \({\mathcal {S}}_1\) has a 3D edge in common with \({\mathcal {S}}_2\) and it has a 3D edge in common with \({\mathcal {S}}_3\) (and a similar relation for the other three surfaces). At the vertex, there are four normal vectors (see Fig. 4d).

This toy model may be adapted to real data, as we will see in the next sections. We start by observing examples of different local behaviors within video sequences. In Fig. 6 (top), we may observe the evolution of the tip of a foot changing direction at the end of a step; this behavior produces a spatio-temporal corner or vertex. In the center of the figure, we analyze the tip of a fist in the extension phase of a punching action, producing a spatio-temporal (or 3D) edge. Finally, at the bottom, we may observe the side of an arm translating as a person is walking, producing a spatio-temporal surface.

Fig. 7
figure 7

Main steps of the 2D \(+\,T\) signal representation procedure. For each space–time point \(\hat{m}\): a We compute matrices \(C_1, C_2\) and \(C_3\), b We create the object C which includes the space–time coefficients of the point neighborhood, (c, d) We map subsets of elements (i.e., shearlet coefficients) of C to different parts of a vector and (e) We obtain the representation for our point (Color figure online)

4 Enhancing Space–Time Features with Shearlets

In this section, we propose a method to represent local spatio-temporal information provided by shearlets in order to enhance different types of discontinuities of a 2D \(+\,T\) signal.

4.1 The Method

We consider a spatial temporal point \(\hat{m}=(\hat{x},\hat{y},\hat{t})\) for the fixed scale \(\hat{j}\) and the subset of shearings

$$\begin{aligned} {\mathbf {K}= \left\{ k = (k_{1}, k_{2}) \> | \> k_{1}, k_{2} = -\lceil 2^{\hat{j} /2}\rceil , \ldots ,\lceil 2^{\hat{j} /2} \rceil \right\} ,} \end{aligned}$$

where \({M=2 \lceil 2^{\hat{j} /2}\rceil +1}\) is the cardinality of \(\mathbf {K}\), where we suppressed the dependence on \(\hat{j}\) from \(\mathbf {K}\) and M. The procedure we carry out in the discrete case is depicted in Fig. 7 and consists of two parts, which we describe in the following. In the first part, we merge the coefficients obtained from the different pyramids; in the second one, we derive a representation for the point neighborhood considered. This representation should be meaningful of a specific space–time primitive.

4.1.1 Reorganize the Coefficients of a Point Neighborhood

  1. (a)

    We reorganize the information provided by \( \mathrm{SH}[f](\ell ,\hat{j}, k,\hat{m}) \) in three \(M \times M\) matrices, each one associated with a pyramid \(\ell {=1,2,3}\), where each entry is related to a specific shearing: \(C_\ell (r,c) = \mathrm{SH}[f](\ell ,\hat{j},k_{rc},\hat{m}) \) with \(\ell =1,2,3\), where \(r,c=1,\ldots ,{M} \) and \(k_{rc}\) is the corresponding shearing in \(\mathbf {K}_j\) defined in (1). As usual in this kind on analysis, we discard the informations related to the shearlet coefficients in the low frequency pyramid \(\ell =0\) since they are related to the smoothness of the signal. Figure 7a shows the three matrices for a specific space–time point.

  2. (b)

    We merge the three matrices in a single one, by recombining them relatively to the maximum shearlet coefficient (the central element of the column depicted in Fig. 7b). For a given scale j and a fixed set of shearings \(\mathbf {K}\), the central element of \({\mathbf {C}}\) corresponds to \(k_{\mathrm{max}}\), the shearing corresponding to the coefficient with the maximum value in the set \(\mathrm{SH}[f](\ell ,\hat{j},k,\hat{m})\), with \(\ell \in \left\{ 1,2,3\right\} \) and \(k \in \mathbf {K}_j\). The eight values of \({\mathbf {C}}\) around the center (the blue ring in Fig. 7c) correspond to the value associated with the first 8-neighborhoods of \(k_{\mathrm{max}}\). These shearing can be in one of the three cones, and hence, the corresponding values are the entries of one of the three matrices \(C_1, C_2\) and \(C_3\). This tiling procedure is repeated to cover to full index set \(\mathbf {K}_j\). This property is needed to obtain a rotation-invariant representation in the next steps of this pipeline, since the values in \(\mathbf {C}\) are redistributed similarly when considering two similar spatio-temporal primitives, even if they are oriented differently in the space–time domain. The matrix C models how the shearlet coefficients vary in a neighborhood of the direction where there is the maximum variation, and it is built in a way so that coefficients which are referred to shearings which are close one to the other end up being close in C. We will see how different kinds of spatio-temporal elements can be associated with different kinds of local variations in C.

4.1.2 Compute a Compact Rotation-Invariant Representation

  1. (a)

    We group the available shearings in subsets \(\bar{s}_{i}\), according to the following rule: \(\bar{s}_{0}\) = \(\left\{ k_{\mathrm{max}}\right\} \) and \(\bar{s}_{i}\) will contain the shearings in the ith ring of values from \(k_{\mathrm{max}}\) in C (as highlighted in Fig. 7c). We extract the values corresponding to the coefficients for \(\bar{s}_{1}\) (by looking at the 8-neighborhood of \(k_{\mathrm{max}}\)), then we consider the adjacent outer ring (that is, the 24-neighborhood without its 8-neighborhood) to have the coefficients corresponding to \(\bar{s}_{2}\), and so on (Fig. 7d, e). By construction, the elements of C are grouped in subsets, each of them associated with a ring, and the first and last element of each subset are closed each other. For the subsets \(\bar{s}_{i}\) for \(i>2\) not all the coefficients are selected, this is due to the way the object \(\mathbf {C}\) is built. Selecting all elements would introduce redundancy in the representation; hence, only some parts of them are considered to build it.

  2. (b)

    We build a vector concatenating the values of the coefficients corresponding to each set as it follows. We first define \(\mathrm{coeff}_{\bar{s}_{i}}\) to be the set of coefficients associated with each shearings subset \(\bar{s}_{i}\):

    $$\begin{aligned} \mathrm{coeff}_{\bar{s}_{0}}= & {} \mathrm{SH}[f](\ell _{k_{\mathrm{max}}},\hat{j},k_{\mathrm{max}},\hat{m}) \\ \mathrm{coeff}_{\bar{s}_{i}}= & {} \left\{ \mathrm{SH}[f](\ell _{\bar{s}_{i}},\hat{j},k_{\bar{s}_{i}},\hat{m}), k_{\bar{s}_{i}} \in \bar{s}_{i}\right\} , \end{aligned}$$

    where \(\ell _{k_{\mathrm{max}}}\) is the pyramid associated with the shearing \(k_{\mathrm{max}}\) and where \(\ell _{\bar{s}_{i}}\) represents the pyramid associated with each shearing \(k_{\bar{s}_{i}}\). Then, we set

    $$\begin{aligned} \mathbf {D}(\hat{m}) = \mathrm{coeff}_{\bar{s}_{0}}^{\> \> \> \> \frown }\mathrm{coeff}_{\bar{s}_{1}}^{\> \> \> \> \frown }\mathrm{coeff}_{\bar{s}_{2}}^{\> \> \> \> \frown }\ldots ; \end{aligned}$$

    where \({}^\frown \) denotes the concatenation between vectors. The size of the representation is strictly dependent on the number M of shearings, and it depends on the chosen scale, as we introduced previously.

Fig. 8
figure 8

ac Sample frames of the synthetic video sequence. df Manually selected points on the 2D \(+\,T\) surface (gi) and corresponding average \(\mathbf {C}\)

Fig. 9
figure 9

Example of visualization in 3D of the result of the process, for these example we selected a static spatial Edge (the blue circle) and a static spatial corner (the red circle), which are characterized by two different behaviors of change. a Selected points, b Edge and c Corner (Color figure online)

At this point, the object \(\mathbf {D}(\hat{m})\) entangles the relations between the direction of maximum variation \(s_{\mathrm{max}}\) for a given point \(\hat{m}\) and the directions corresponding to the other shearings \(k \ne s_{\mathrm{max}}\).

4.2 Expressiveness of Coefficients

We analyze the space–time neighborhood coefficients \(\mathbf{C}\) for different types of points. First, we consider a simple synthetic sequence, with a dark square on a white background. At the beginning of the sequence, the square is still; then, at frame 64 it starts translating up with constant speed until frame 108, when the square stops again until the end of the sequence. To avoid boundary problems, the sequence is composed of white frames before frame number 20 and after frame number 108. Figure 8a–c shows a selection of meaningful frames in the synthetic sequence, while Fig. 8d–f shows the volume we may obtain by stacking the video frames (and in particular the square silhouette) one on top of the other. In this synthetic example, we easily identify three types spatio-temporal features, clearly visible on the 3D shape: surface points, 3D edges and vertices; in (d–f) we show manually selected points. Figure 8g–i shows average \(\mathbf {C}\) computed on space–time point neighborhood of all the marked points of a given type. In spite of averaging, the 3D visualization we present highlights the neighborhood structure and allows us to show how \(\mathbf{C}\) allows us to distinguish between different kinds of spatio-temporal structures. This speaks in favor of the expressiveness of 3D shearlet coefficients for the local space–time analysis we are considering.

At this point, an observation is in order. In the case of surfaces, we identify only one meaningful peak around which we reorganize the other (negligible) contributions. Instead, in the case of 3D edges and 3D corners, \(\mathbf{C}\) presents a more peaks than expected. In the case of 3D edges, we would expect two peaks, but in the construction of \(\mathbf{C}\), the second peak is replicated, due to the complexity of the point and the periodicity of the matrix C on each subset associated with the different rings. A similar behavior is already observed in Fig. 3.

Furthermore, with respect to the theory, the 3D vertex in Fig. 8f corresponds to the intersection of three surfaces, instead of four. This is due to the fact that we are dealing with a synthetic image with blank frames below the frame 20. The 3D vertices at frame 64 are at the intersection of four surfaces, as expected; however, two of them are coplanar, so that we have only three distinct normal directions.

Figure 9 shows that the space–time neighborhood coefficients \(\mathbf{C}\) have a similar behavior in real data. It highlights two points of a real image sequence, an edge (in blue) and a corner (in red). The behavior of the neighborhood coefficients is coherent with what previously discussed.

As a further evidence, we analyze the average \(\mathbf{C}\) over sets of key points automatically detected by well-known algorithms in image processing and computer vision. We consider two spatial features, edges [1] and corners [34] and a space–time feature, STIP [27].

  1. Edges

    Figure 10 shows the average coefficients of all edge points obtained by the Canny detector applied to a 2D frame extracted from video sequence. It is worth noting that since the our algorithm also detects corner points and moving edges, the 3D visualization also includes small lateral peaks.

  2. Corners

    Figure 11 shows the behavior of corner points, automatically detected by the classical Harris algorithm. In this case, we report the visualization for the subset of still and moving corners, which are more distinctive as expected, since our representation takes into account space–time information, while Harris corner detector does not.

  3. STIP

    Figure 12 shows the average descriptor for the points detected as Laptev STIPs on a different image frame. It is well known that STIP detector identifies very few points, meaningful both in space and in time. The choice of this specific image frame has been done considering the limitations of the detection algorithm, which performs particularly well only in the presence of very sharp space–time variations. This is clearly identified by the behavior of the neighborhood coefficients; indeed, we observe peaks both in space and in time directions.

Fig. 10
figure 10

a Frame points automatically extracted by Canny edge detector and b A 3D visualization of \(\mathbf{C}\) Averaged on all the edge points

Fig. 11
figure 11

Harris corners. a Still Harris corners b and the shape visualization of their average descriptor. c Moving Harris corners (d) and the shape visualization of their average descriptor

Fig. 12
figure 12

Laptev STIPs and a 3D visualization of \(\mathbf{C}\) Averaged on all the edge points

5 Identifying Coherent Groups of Points

So far we have discussed the behavior of 3D shearlet coefficients in the space–time neighborhood of a point or a set of previously detected points. Here we discuss how we can group sets of points by similarity, with the goal of identifying automatically different types of space–time primitives.

We fix a frame in a video sequence, we compute the shearlet coefficients of a suitable temporal neighborhood of the frame, and we apply our algorithm to assign the local representation \(\mathbf{D}\) to each point of the given frame. Hence, we cluster the points with a k-means algorithm in p clusters and we consider the clusters centroids as an unsupervised estimate of our space–time primitives of the video frame.

Fig. 13
figure 13

Clusters of space–time primitives for different choices of p (best seen in pdf). a Frame, b \(p=2\), c \(p=3\), d \(p=4\), e \(p=5\), f \(p=6\), g \(p=7\) and h \(p=8\)

Fig. 14
figure 14

Points belonging to the different eight clusters calculated on a frame of the boxing sequence (see text). a Cluster # 1 from \(p=8\), b Cluster # 2 from \(p=8\), c Cluster # 3 from \(p=8\), d Cluster # 4 from \(p=8\), e Cluster # 5 from \(p=8\), f Cluster # 6 from \(p=8\), g Cluster # 7 from \(p=8\) and h Cluster # 8 from \(p=8\)

Figure 13 shows the results obtained for different choices of p. The sequence is acquired by a still camera and represents a subject boxing in the air. The frame we selected to present the results represents the exact moment in which the subject is inverting the direction of movement of his arm—as shown in Fig. 12. Let us briefly comment the results for different choices of p, which highlight space–time points at different granularities:

  • \(p= 2\): the first partition obtained creates two groups, a set of points containing almost all the points in the sequence without a significant local change neither in space nor in time (background points and those belonging to the inner part of the body of the subject) and another one containing points which are undergoing some spatio-temporal change.

  • \(p= 3\): the clustering process better separates the points belonging to the background and those related to the shape of the subject, without additionally differentiating these points. Background is divided in two parts, depending on the texture.

  • \(p= 4\): the additional cluster allows us to separate points that belong to spatio-temporal elements with a higher dynamics, for example, the arm of the subject boxing in the air.

  • \(p= 5\): a new cluster does not provide significant changes.

  • \(p= 6\): different elements are now separated in a very nice way, the edges belonging to the arm are grouped in a separate cluster w.r.t. the edges belonging to the back and the legs, also, it is possible to see how points which look like spatial corners are grouped together (in the yellow cluster), without any differentiation regarding their spatio-temporal behavior.

  • \(p= 7\): no additional information.

  • \(p= 8\): the points colored in white represent the last cluster added within this trial, we can see how these elements could correspond to spatial corners with particular dynamics (the fist is inverting direction, the corners joining the arm to the head and to the chest undergo some changes, and the front tip of the jacket is moving while the subject is punching). These points are also highlighted in Fig. 14h, and the corresponding average \(\mathbf {C}\) is highlighted in Fig. 15h. Their similarity with the STIP points in Fig. 12 is apparent.

This result highlights many nice properties of our descriptor: the separations of all the points of the image frame into different sets, with respect to their spatio-temporal behavior, is obtained thanks to a space–time continuity of the representation inherited by the shearlet transform; as p grows we may identify an interesting nested structure; even in an entirely unsupervised approach most of the points clusters automatically detected can be associated with known feature points, such as edges or corners.

Fig. 15
figure 15

3D visualization of the C objects related to the centroids of the clusters shown in Fig. 14a–g. a C for cluster # 1, b C for cluster # 2, c C for cluster # 3, d C for cluster # 4, e C for cluster # 5, f C for cluster # 6, g C for cluster # 7 and h C for cluster # 8

Fig. 16
figure 16

Self-similarity matrix for a video frame of the boxing sequence. a Frame, b Clusters and c Self-similarity

Fig. 17
figure 17

Similarity matrix between two video frames of the boxing sequence (the reference frame is shown in Fig. 16). a Frame, b Clusters and c Similarity

Fig. 18
figure 18

Similarity matrix between a video frame of the boxing sequence (Fig. 16) and a frame of the handwaving sequence. a Frame, b Clusters and c Similarity

Fig. 19
figure 19

Similarity matrix between a video frame of the boxing sequence (Fig. 16) and a frame of the walking sequence. a Frame, b Clusters and c Similarity

As a last observation, we discuss whether the estimated space–time clusters are persistent among different video frames and different video. The intuition is that the answer should be negative since the estimated space–time primitives are learnt by a short temporal observation, and thus, different primitives may be present or not. To this purpose, we compare sets of primitives estimated on different frames and compare them through the Euclidean distance, building similarity matrices. Note that, in every matrix, the entries of the two sets have been reordered so that to keep the values corresponding to the best similarity obtained along the diagonal, and that the assignment of the entries of the two centroids sets has been carried on by means of the Hungarian algorithm. Figure 16 shows the self-similarity within a set of space–time primitives. We consider this example as a baseline observation, showing how the primitives are somewhat redundant (this is visible by the block structure of the matrix that shows how different primitives are similar to one another). If we compare centroids obtained at different frames of the same sequence (Fig. 17), we observe again a very similar dominant diagonal, possibly due to the fact we are observing a periodic action. If we compare video frames from different type of actions, we obtain noisier similarity matrices. Figure 18 compares a boxing frame with a handwaving frame; in this case, the dominant diagonal is still present, showing that each primitive has at least a counterpart on the other frame. In fact, the two actions, even if they are quite different, have many things in common: they are upper body actions, with abrupt changes of direction and are executed at a similar pace; thus, we expected them to share at least a subset of very similar spatio-temporal primitives. Finally, Fig. 19 compares the boxing with a walking frame, two very different types of dynamics, as confirmed by the noisy similarity matrix we obtain.

6 Conclusions

In this paper, we discussed how to analyze space–time signals, or more specifically video sequences, in the framework of shearlets. The goal of our work was to evaluate the behavior of the signal in a space–time local neighborhood. Starting from a theoretical analysis, followed by toy as well as real examples, we discussed what are the typical patterns one may find in space–time signals. Then, we derived a point representation based on signal coefficients and show that it appears to be stable on set of points of the same nature, while also meaningfully highlighting their spatio-temporal behavior. Based on this property, we derived an unsupervised approach to identify different space–time primitives of a video frame. This primitives are the centroids of space–time points clusters obtained by the k-means algorithm. Our analysis shortens the gap between theory and algorithms and allows us to derive a computational model which may be applied to motion analysis and action recognition.

In this paper, we considered one frame at a time with its temporal neighborhood. We are currently investigating how to integrate the analysis at the level of the entire video. We conclude by observing that shearlets may lead to a perfect scale invariant representation. On 2D signals, this has been clearly demonstrated in the theory and exploited in practice in [8]. Furthermore, it would be of interest to exploit the multi-scale property of the shearlet coefficients to detect spatial-temporal patterns at different scales. This requires a representation with a large number of different scales and, at the present, this poses some implementation problems, whose solution will be the objective of future work.