1 Introduction

Recent revolutionary development of multimedia technologies has advanced many disciplines and industries, such as health, intelligent vehicles and augmented reality. In health area, with the help of crowd-based and social networking services, healthcare knowledge is more convenient to share, acquire and disseminate among health seekers and providers. To bridge vocabulary gap between health seekers and community generated knowledge, Nie et al. [22] presented a scheme to label question answer pairs by jointly utilizing local mining and global learning approaches. Health today is complementarily characterized by multi-modal data, which enables doctors to concisely comprehend the health conditions of the patients. To help understand chronic diseases progressions based on observational health records in form of multimedia data, Nie et al. [24] proposed an adaptive multimodal multi-task learning model to co-regularize the modality agreement, temporal progression and discriminative capabilities of different modalities. To make health knowledge exchange and reusability, Nie et al. [23] presented a multilingual system to return one multi-faceted answer that was well-structured and precisely extracted from multiple heterogeneous healthcare sources. Patients nowadays actively seek for online health information, and post their disease control experiences. While the vocabulary gap between health seekers and providers has hindered the cross-system operability and the inter-user reusability. Nie et al. [25] presented a novel scheme to code the medical records by jointly utilizing local mining and global learning approaches, which were tightly linked and mutually reinforced. Nie et al. [26] proposed a scheme accurately and efficiently inferring diseases especially for community-based health services. Mobiles and other wearable health sensors are equipped by patients and doctors to track the health and exercises, which makes it possible for real-time monitoring and remote health support. Camera is one kind of such sensors. Yan et al. proposed a novel Multi-Task Learning framework (FEGA-MTL) for classifying the head pose of a person moving freely in an environment monitored by multiple, large field-of-view surveillance cameras [31] and for action recognition [32]. For complex event detection in videos, Yan et al. [33] proposed a novel strategy to automatically select semantic meaningful concepts for the event detection task based on both the events-kit text descriptions and the concepts high-level feature descriptions. To cope with vast amount of unlabeled and heterogeneous data for recognizing human activities from videos, Yan et al. [34] proposed a multitask clustering framework for activity of daily living analysis from visual data gathered from wearable cameras. We think remote scene dynamics may be helpful for doctors to monitor patients and provide instructions for their health, so in this paper, we propose a monocular scene flow estimation method. The concept of scene flow comes from optical flow, it not only solves motion information in 3D camera frustum, but also overcomes the rigid motion assumption in optical flow, which makes it possible to get 6DOF(Dimension of Free) data in scene only form image sequences. Because motion information is based on scene structure itself, scene reconstruction is one of key problems when solving scene flow, beyond that, occlusion among different objects also should be taken into account for multi objects may have different moving state.

The prototype of scene flow comes from Gilad Adiv’s study in 1985 [1], which put forward a method to calculate the depth and motion information of camera scene by utilizing binocular optical flow and rigid motion segments, and took a direct search way to match these segments for complexity of the subject and limited hardware. This paper innovatively extends scene flow estimation algorithm to image sequences taken from a single moving camera, and makes no assumption about rigidity of motion itself. At the same time, we express the consistency assumption as a total energy functional by combining 3D scene structure and scene flow into a monocular camera projection model, besides, we make a smooth regularization in flow estimation, and anisotropy boundary operator is taken as smooth operator, which makes the result more close to nature. When solving the functional, according to Brox’s optical flow method [6], we rewrite the main function according to Euler-Lagrange condition and use a coarse-to-fine framework to prevent the total equation converging to a local minimum after getting the iterative equation, so the nonlinearity of functional can be maintained until inner iteration. As the experimental results show and Brox proved, solving the total scene flow energy functional by PDE is effective and reasonable.

This paper is organized by 6 sections: Section 2 describes the state-of-art scene flow relative works from three research fields. Section 3 derives the total energy functional in detail, including inverse depth introduction and three important consistency assumptions. In order to get the numerical solution, we process energy functional according to Euler-Lagrange equation to get a non-linear iterative equation, and linearize it using a coarse-to-fine framework in section 4. Experiment results are shown in section 5. The last section summarizes the innovations and advantages of our algorithm, also points out some shortages under bad environmental conditions, which need to be fixed in our future work.

2 Related works

2.1 Stereo based scene flow

After solving optical flow problems with accurate and fast ways, study about binocular optical flow based scene flow methods [27, 28] were proposed. These algorithms always need a prior depth [14, 30] to get the scene structure, then refer the projection relationship between 3D scene flow and 2D image flow to solve motion parameters with least square method. Vogel et al. presented the dynamic 3D scene by collecting planar rigidly moving local segments [29]. Basha et al. proposed a 3D point cloud parameterization, which allows directly estimating the desired unknowns, their functional enforced multi-view geometric consistency and imposed brightness constancy and piecewise smoothness assumption directly on the 3D unknowns. Except constancy assumption, Birkbeck et al. [5, 9] took known proxy motion into account which enables 3D trajectory reconstruction when only a single view is available. Damn et al. tackled the 6D pose and additional shape degrees of freedom for the object of known class in the scene, combining image data and depth information for the pose and shape recovery. Even though these methods can get an effective solution, they are not general because of their requirement to much prior information.

2.2 RGBD scene flow

Since Microsoft released somatosensory equipment device Kinect [8, 12, 16, 18, 36], studies on RGBD dataset have become popular. Because RGBD data provides complete and reliable information, scene flow estimation based on RGBD data becomes important. Letouzey et al. [20] reconstructed 3D scene on RGBD images, combining geometric information from depth maps with intensity variations in color images to estimate smooth and dense 3D motion fields, which takes advantage of the geometric information provided by the depth camera to define a surface domain. Herbst et al. [13] proposed a method which generalized two-frame variational 2D flow algorithms to 3D and computed flow reliably using RGBD data, overcoming depth noise. But similar methods work only under indoor situation, for the IR camera which takes depth image is easily influenced by sunshine. Even Yang et al. [35] proposed novel density modulated binary patterns for depth acquisition, the carried phase is not strictly sinusoidal and so the depth reconstructed from the phase contains a systematic error.

2.3 SLAM based scene flow

As the most important and basic algorithm in robotics [10], with decades of development, SLAM has become an effective algorithm which can accurately position camera only by image information. After getting camera position, cloud points matching can be used to rebuild the scene. Such as, Alcantarilla et al. [2] combined visual slam and dense scene flow to parse surrounding environment. The key idea is to continuously estimate a semi-dense inverse depth map for the current frame, which can be used to track the motion of the camera in turn. Even though SLAM offers much information to scene reconstruction, it needs a lot of posteriori data because it is based on probabilistic theory [15] and its initial evaluation is not particularly accurate, so the probability based SLAM algorithm does not suitable for dense scene flow estimation.

3 Monocular scene flow

We use a integrated energy functional to estimate scene flow, focus on getting inverse depth and scene flow of referenced frame, only from monocular image sequences. We make no assumption about motion rigidity of camera, and express the solution with world coordinate.

3.1 Pinhole model

As the pinhole model of camera in Fig. 1, we can get the relationship between 3D space object point and 2D image point:

(1)
Fig. 1
figure 1

Inverse depth: the red point is a pixel point on image plane, after back projection, its correspond 3D point will exist on back projected ray; the hexahedron contains the ray is camera frustum, instead, perspective projection can map the 3D point in frustum to the red pixel on image plane

Specially, let \( {M}^i=\rho {T}_{C^i} \), which stands the 3x4 projection matrix of camera at time i, including external parameters relative to referenced frame and internal parameters of the camera itself.

3.2 Inverse depth

The conception of inverse depth came from Civer’s study about mono SLAM [7] and Richard emphasized it in his DTAM system [21]. According to pinhole model, inverse depth of a defined pixel will be in a range \( d\subset \eta \left(\overrightarrow{X_i}\right) \) because of projection information loss (as Fig. 1).

For a 3D space object point \( \overrightarrow{X_{W^i}}=\left({x}_{w^i},{y}_{w^i},{z}_{w^i}\right) \), the corresponding pixel point is \( \overrightarrow{X_{I^i}}=\left({u}_i,{v}_i,1\right) \), their relationship can be expressed as:

$$ d\left({u}_i,{v}_i\right)\overrightarrow{X_{I^i}}={M}^i\overrightarrow{X_{W^i}} $$
(2)

Obviously, d(u i , v i ), same as \( {z}_{C^i} \) in Eq. 1, is a function with image pixel position as its parameters. Thus, we can get a 3D point \( \overrightarrow{X_{W^i}}=\left({x}_{W^i},{y}_{W_i},d\left({u}_i,{v}_i\right)\right) \) by back projecting from a pixel on referenced frame, and then the 3D point will be re-projected to the time i image as a pixel of \( {X}_{I^i}={f}_{proj}\left({\overrightarrow{X}}_{W^i}\right) \). As we know, point position at next time in 3D space, is defined by current time position and velocity per time unit, so we can give the relationship between two succession time positions of a 3D point as: \( {\overrightarrow{X}}_{W^i}={f}_{pos}\left({\overrightarrow{X}}_{W^0},\overrightarrow{V^i},t\right) \). The initial position of the 3D point comes from back projection (suppose the camera is located in the origin of the world coordinate in the first frame):

$$ \begin{array}{c}\hfill {\overrightarrow{X}}_{W^0}\left({I}_0\right)={f}_{backproject}\left({I}_0,d\left({u}_0,{v}_0\right)\left[\begin{array}{c}\hfill {x}_{w^0}\hfill \\ {}\hfill {y}_{w^0}\hfill \\ {}\hfill {z}_{w^0}\hfill \end{array}\right]\right)\hfill \\ {}\hfill =d\left({u}_0,{v}_0\right)\left[\begin{array}{c}\hfill {u}_0/fu-{o}_u/fu\hfill \\ {}\hfill {v}_0/fv-{o}_v/fv\hfill \\ {}\hfill 1\hfill \end{array}\right]\hfill \end{array} $$
(3)

Expanding above equation for a pixel, we express the corresponding relationship of a 3D space point cloud and a 2D image pixel as following:

$$ \begin{array}{c}\hfill {u}_i=\frac{{\left[{M}^i\right]}_1\overrightarrow{X_{W^i}}}{{\left[{M}^i\right]}_3\overrightarrow{X_{W^i}}}=\frac{{\left[{M}^i\right]}_1\left(\overrightarrow{X_{W^0}}+\overrightarrow{V_i}\right)}{{\left[{M}^i\right]}_3\left(\overrightarrow{X_{W^0}}+\overrightarrow{V_i}\right)}\hfill \\ {}\hfill =\frac{M_{11}^i\left({x}_{w^0}+{V}_{x^i}\right)+{M}_{12}^i\left({y}_{w^i}+{V}_{y^i}\right)}{M_{31}^i\left({x}_{w^0}+{V}_{x^i}\right)+{M}_{32}^i\left({y}_{w^i}+{V}_{y^i}\right)}+\frac{M_{13}^i\left(d\left({u}_0,{v}_0\right)+{V}_{z^i}\right)+{M}_{14}^i}{M_{33}^i\left(d\left({u}_0,{v}_0\right)+{V}_{z^i}\right)+{M}_{34}^i}\hfill \\ {}\hfill {v}_i=\frac{{\left[{M}^i\right]}_2\overrightarrow{X_{W^i}}}{{\left[{M}^i\right]}_3\left(\overrightarrow{X_{W^i}}\right)}=\frac{M_2^i\left(\overrightarrow{X_{W^0}}+\overrightarrow{V_i}\right)}{M_3^i\left(\overrightarrow{X_{W^0}}+\overrightarrow{V_i}\right)}\hfill \\ {}\hfill =\frac{M_{21}^i\left({x}_{w^0}+{V}_{x^i}\right)+{M}_{22}^i\left({y}_{w^0}+{V}_{y^i}\right)}{M_{31}^i\left({x}_{w^0}+{V}_{x^i}\right)+{M}_{32}^i\left({y}_{w^0}+{V}_{y^i}\right)}+\frac{M_{23}^i\left(d\left({u}_0,{v}_0\right)+{V}_{z^i}\right)+{M}_{24}^i}{M_{33}^i\left(d\left({u}_0,{v}_0\right)+{V}_{z^i}\right)+{M}_{34}^i}\hfill \end{array} $$
(4)

Equation 4 shows that, after back projected to 3D space and moving, a pixel on referenced frame at time 0 can be re-projected as another pixel on time i frame.

3.3 Consistency assumption

This paper makes some reasonable assumptions to scene and projection information like optical flow method. We first propose short time velocity consistency based on spatial coincidence, and then we propose brightness consistency based on illumination invariant in short time constraint. In order to make our algorithm tolerant to the texture variances and brightness noise, we adopt gradient consistency.

3.3.1 Velocity consistency

As in Fig. 2, for sequential frames, we assume that object points’ velocity in world space is constant within a short time interval, so we can formulate the relationship between moving camera and dynamic scene in short period successive frames. Let camera moving information as known quantity, the inverse depth and the object velocity \( \overrightarrow{V_o} \) as unknown ones, the same 3D object point can be projected to different image positions for the moving of camera and objects. When getting the camera transform matrix and intrinsic matrix, pinhole model in Eq. 1 can be rewritten as a 3D space point moving relationship as in Eq. 5.

Fig. 2
figure 2

Projection model with time changing, 3 frames from a successive monocular image sequences are extracted

$$ \begin{array}{c}\hfill {\overrightarrow{X}}_{W^i}={f}_{pos}\left(\overrightarrow{X_{W^0}},\overrightarrow{V},{t}_i\right)\hfill \\ {}\hfill ={\overrightarrow{X}}_{W^0}+{t}_i\overrightarrow{V}{f}_{backproject}\left({I}_0,d\left({u}_0,{v}_0\right)+{t}_i\overrightarrow{V}\left({u}_0,{v}_0\right)\right.\hfill \end{array} $$
(5)

3D space scene flow \( \overrightarrow{V}\left({u}_0,{v}_0\right) \) is a function of image coordinate, t i is time interval between referenced time 0 frame to time i, so 3D point location of time i can be seen as a result of non-linear function of velocity and time.

3.3.2 Brightness consistency

Assuming there exists no mirror reflection in the scene and the environment structure or illumination condition is stable in a very short period of time, after a tiny movement, the image pixels from the same 3D point will have similar light intensity. Based on this assumption, we can conclude the illumination consistency constraint between real frame and its re-projected frame as in Eq. 6.

$$ B{C}_{s1}\left(d,\overrightarrow{V}\right)={\displaystyle \sum_{i=0}^n{m}_{cs1}^i{\displaystyle {\int}_{\varOmega}\psi \left({\left|{I}_i-{I}_i\left({\overrightarrow{X}}_{I_i}\right)\right|}^2\right)}} $$
(6)

where I i is the real frame at time i, and \( {I}_i\left({\overrightarrow{X}}_{I_i}\right) \) is the re-projected frame based on evaluated inverse depth and scene flow. If the velocity consistency established in n frames, BC s1 is the brightness consistency in current frame. As in Eq. 6, integral range represents a rectangular image domain, which means our functional covers the whole image range by calculating Eqs. 3 and 4, with \( \psi (x)=\sqrt{\left({x}^2+\mathit{\in}\right)},\mathit{\in} \) is a tiny value, the function is used to ensure the convexity of the total functional, in other words, the functional has a global minimum, so we can use Euler-Lagrange equation to derivate iterative equation. m i cs1 is a mask to cover occlusion points and prevent the consistency error caused by camera moving.

Referring to brightness consistency, we can also list an equation between referenced frame and re-projected frames as following:

$$ B{C}_{s2}\left(d,\overrightarrow{V}\right)={\displaystyle \sum_{i=1}^n{m}_{cs2}^i{\displaystyle {\int}_{\varOmega}\psi \left({\left|{I}_0-{I}_i\left({\overrightarrow{X}}_{I_i}\right)\right|}^2\right)}} $$
(7)

3.3.3 Gradient consistency

In real environment, scene flow estimation merely with brightness consistency may be influenced by inevitably noises, so the gradient consistency is introduced in this paper to make algorithm much more robust.

$$ GC\left(d,\overrightarrow{V}\right)={\displaystyle \sum_{i=1}^n{m}_{gc}^i{\displaystyle {\int}_{\varOmega}\psi \left({\left|\nabla {I}_i-\nabla {I}_i\left({\overrightarrow{X}}_{I_i}\right)\right|}^2\right)}} $$
(8)

Where ∇ stands for the light intensity gradient of a pixel. Equation 8 restraints pixel gradient itself and makes the functional is not merely robust for image texture, but also for texture distribution.

And now, the data term of the functional consists of three parts:

$$ FC=B{C}_{s1}+B{C}_{s2}+{\alpha}_gGC $$
(9)

with α g , a weight factor to balance two kinds of consistency proportion, too much gradient consistency will fuzz up objects’ edges, on the contrary, too small weight will weaken robustness. And it’s usually determined by a scene structure, normally takes 0.5.

3.3.4 Smooth regularization

The main purpose of smooth regularization is to reduce noises of scene flow and inverse depth, but over-smoothing will blur the edge, which leads an error in solution, so we take an anisotropy operator as smooth function.

As we know, traditional methods often choose isotropy operator like Laplacian to smooth scene flow. However, we adopt an operator similar to Evan Herbst [13], it is anisotropic and reduces the smoothness over object edges:

$$ L\left({\overrightarrow{X}}_I\right)=1-{e}^{-c\left({\left(rgb\left(\overrightarrow{X_i}\right)-rgb\left(\overrightarrow{Y_i}\right)\right)}^2\right.},\; where\kern0.33em \overrightarrow{Y_i}\in N\left(\overrightarrow{X_i}\right) $$
(10)

\( N\left(\overrightarrow{x}\right) \) are adjacent pixels of pixel \( \overrightarrow{x} \) on referenced image frame. So we can get the smooth regularization term as:

$$ {S}_f\left(\overrightarrow{V}\right)=\psi \left(\overrightarrow{V}{\left(\overrightarrow{X_{I_0}}\right)}_TL\left(\overrightarrow{X_{I_0}}\right)\overrightarrow{V}\left(\overrightarrow{X_{I_0}}\right)\right) $$
(11)

Adopting Eq. 11 as smooth term, scene flow’s similarity in object internal area can be ensured, on the other hand differences on objects’ edges can be kept.

Same as traditional methods, we take a Laplacian to do depth smooth as in Eq. 12, for the correlation between depth and edges is not so strong compared with scene flow.

$$ {S}_z(d)=\psi \left({\left|\nabla d\left({X}_{I_0}\right)\right|}^2\right) $$
(12)

We get the smoothness term as following:

$$ FS={\beta}_f{S}_f+{\beta}_z{S}_z $$
(13)

with β f and β z as weight factors.

In conclusion, we can derive the integrated energy functional in Eq. 14.

$$ \begin{array}{c}\hfill E\left(d,\overrightarrow{V}\right)={\displaystyle {\int}_{\varOmega}\left(FC+{\alpha}_sFS\right) dudv}\hfill \\ {}\hfill ={\displaystyle {\int}_{\varOmega}\left[B{C}_{s1}+B{C}_{s2}+{\alpha}_gGC+{\alpha}_s\left({\beta}_f{S}_f+{\beta}_z{S}_z\right)\right] dudv}\hfill \end{array} $$
(14)

Where internal area Ω is the image domain, which means the functional solution is aimed at the whole image.

As in Eq. 14, the integrated energy functional is about object points depth and scene flow, and it is nonlinear. So, we convert it to a variational problem by considering Euler-Lagrange condition and obtain a linear iterative equation to get numerical solution.

3.3.5 Occlusion estimation

In scene flow estimation, occlusion points may appear at any position because of the dynamics of the scene, so it is necessary to get rid of occluded points before the main iterative process. As shown in Algorithm 1, for a moving camera, its optic center change may lead to boundary occlusion, so we compute the COP position for the ith frame as initialization again. After acquiring the boundary occlusion map, every 3D point corresponds to reference image pixel is projected to current time t image coordinate. If two 3D points have the same 2D image coordinate, the point with further COP is set occluded and its corresponding reference image coordinate will be marked for the time i frame. Thus, scene flow estimation without occluded points becomes more accurate, which only need few steps before real estimation begins.

figure e

4 Solving the energy functional

Now we get the integrated energy functional to solve scene flow of monocular moving camera, through mathematical model brightness and smooth assumption in section 3, it is a convex nonlinear functional. Firstly, abbreviate the equation:

$$ \begin{array}{c}\hfill {\varphi}_i={I}_i-{I}_i\left(\overrightarrow{X_{I_i}}\right)\hfill \\ {}\hfill \widehat{\varphi_i}={I}_0-{I}_i\left(\overrightarrow{X_{I_i}}\right)\hfill \\ {}\hfill \nabla {\varphi}_i=\nabla {I}_0-\nabla {I}_i\left(\overrightarrow{X_{I_i}}\right)\hfill \end{array} $$
(15)

So we can rewrite the energy Functional as following:

$$ E\left(\overrightarrow{V},d\right)={\displaystyle {\int}_{\varOmega }F\left(\overrightarrow{V},d\right) dudv}={\displaystyle {\int}_{\varOmega}\left[{\displaystyle \sum_{i=1}^n\psi \left({\varphi}_i^2\right)}+{\displaystyle \sum_{i=0}^n\psi \left({\widehat{\varphi}}_i^2\right)}+{\displaystyle \sum_{i=1}^n\psi \left(\nabla {\varphi}_i^2\right)}+\psi \left({\overrightarrow{V}}^TL\overrightarrow{V}\right)+\psi \left({\left|\nabla d\right|}^2\right)\right] dudv} $$
(16)

Because Eq. 16 has continuity and differentiability in the field of definition, we can rewrite it according to Euler-Lagrange as an equation to inverse depth d:

$$ 0=\frac{\delta F}{\delta d}-\frac{\delta }{\delta x}\left(\frac{\delta F}{\delta d}\right)={\displaystyle \sum_{i=0}^n\psi^{\prime}\left({\varphi}_i^2\right){\varphi}_i\left(\delta {\varphi}_i/\delta d\right)}+{\displaystyle \sum_{i=1}^n\psi^{\prime}\left({\widehat{\varphi}}_i^2\right){\widehat{\varphi}}_i\left(\delta {\widehat{\varphi}}_i/\delta d\right)}+{\displaystyle \sum_{i=0}^n\psi^{\prime}\left({\varphi}_i^2\right)\nabla {\varphi}_i\left(\delta \nabla {\varphi}_i/\delta d\right)-{\alpha}_s{\beta}_zdiv\left(\psi^{\prime}\left({\left|\nabla d\right|}^2\right)\right)\nabla d} $$
(17)

for v x in \( \overrightarrow{V} \), the iterative equation is:

$$ 0=\frac{\delta F}{\delta {v}_x}-\frac{d}{dx}\left(\frac{\delta F}{\delta {v}_{x^{\prime }}}\right)={\displaystyle \sum_{i=0}^n\psi^{\prime}\left({\varphi}_i^2\right)\varphi {}_i\left(\delta {\varphi}_i/\delta {v}_x\right)}+{\displaystyle \sum_{i=1}^n\psi^{\prime}\left({\widehat{\varphi}}_i^2\right)\widehat{\varphi}{}_i\left(\delta {\widehat{\varphi}}_i/\delta {v}_x\right)}+{\displaystyle \sum_{i=0}^n\psi^{\prime}\left(\nabla {\varphi}_i^2\right)\nabla \varphi {}_i\left(\delta \nabla {\varphi}_i/\delta {v}_x\right)+\psi^{\prime}\left({\overrightarrow{V}}^TL\overrightarrow{V}\right)L\overrightarrow{V}} $$
(18)

Derivation for the other value v y and v z are the same. Thus, we convert the functional solving problem to an optimization problem, which means we should find the optimal solution under above equations.

5 Numerics

The existence of local minimum often lead to errors in solving optimization problems, so we use a L2 norm ψ to ensure functional convexity, which makes iterative process converge to a global minimum. Because the back projection and projection are non-linear, we use the first order Taylor expansion to linearize scene flow and depth.

$$ \begin{array}{l}{\overrightarrow{V}}^{k+1}={\overrightarrow{V}}^k+\delta {\overrightarrow{V}}^k\hfill \\ {}{d}^{k+1}={d}^k+\delta {d}^k\hfill \end{array} $$
(19)

Major components in the optimization can be rewritten as:

$$ \begin{array}{l}{\varphi}_i^k={I}_i^k-{I}_i{\left({\overrightarrow{X}}_{I_i}\right)}^k\hfill \\ {}{\widehat{\varphi}}_i^k={I}_0-{I}_i{\left({\overrightarrow{X}}_{I_i}\right)}^k\hfill \\ {}\nabla {\varphi_i}^k=\nabla {I}_0-\nabla {I}_i{\left({\overrightarrow{X}}_{I_i}\right)}^k\hfill \end{array} $$
(20)

Thus, with the first order Taylor expansion, the value at (k+1)th iterative time can be formed by that of kth iterative time:

$$ \begin{array}{l}{\varphi}_i^{k+1}\approx {\varphi}_i^k+\delta d{\varphi}_i^k\nabla {d}^k+{\delta}_{v_x}{\varphi}_i^k\nabla {v}_x^k\hfill \\ {}+{\delta}_{v_y}{\varphi}_i^k\nabla {v}_y^k+{\delta}_{v_z}{\varphi}_i^k\nabla {v}_z^k\hfill \\ {}{\widehat{\varphi}}_i^{k+1}\approx {\widehat{\varphi}}_i^k+\delta {v}_x^k{\widehat{\varphi}}_i^k\nabla {v}_x^k+{\delta}_{v_x}{\widehat{\varphi}}_i^k\nabla {v}_x^k\hfill \\ {}+{\delta}_{v_y}{\widehat{\varphi}}_i^k\nabla {v}_y^k+{\delta}_{v_z}{\widehat{\varphi}}_i^k\nabla {v}_z^k\hfill \\ {}\nabla {\varphi}_i^{k+1}\approx \nabla {\varphi}_i^k+\delta {v}_x^k\nabla {\varphi}_i^k\nabla {v}_x^k+{\delta}_{v_x}\nabla {\varphi}_i^k\nabla {v}_x^k\hfill \\ {}+{\delta}_{v_y}\nabla {\varphi}_i^k\nabla {v}_y^k+{\delta}_{v_z}\nabla {\varphi}_i^k\nabla {v}_z^k\hfill \end{array} $$
(21)

At last, by setting the threshold value, we can get follow iterative equation for inverse depth:

$$ \begin{array}{ll}0\hfill & ={\displaystyle \sum_{i=0}^n}{\psi}^{\hbox{'}}\left({\left({\varphi}_i^{k+1}\right)}^2\right){\varphi}_i^{k+1}\left(\delta {\varphi}_i^k/\delta d\right)\hfill \\ {}\hfill & +{\displaystyle \sum_{i=1}^n}{\psi}^{\hbox{'}}\left({\left({\hat{\varphi_i}}^{k+1}\right)}^2\right){\hat{\varphi_i}}^{k+1}\left(\delta {\hat{\varphi_i}}^k/\delta d\right)\hfill \\ {}\hfill & +{\displaystyle \sum_{i=0}^n}{\psi}^{\hbox{'}}\left({\left(\nabla {\varphi}_i^{k+1}\right)}^2\right)\nabla {\varphi}_i^{k+1}\left(\delta \nabla {\varphi}_i^k/\delta d\right)\hfill \\ {}\hfill & -{\alpha}_s{\beta}_zdiv\left({\psi}^{\hbox{'}}\left(\nabla {d}^{k+1}\right)\right)\nabla {d}^{k+1}\hfill \end{array} $$
(22)

The iterative process is presented in Appendix.

6 Experiments

Experiments were carried out on self-synthetic data sets, general synthetic data sets and real scene data sets. Three data sets have different settings for the scene and the movement of objects. Experimental results were compared with those of current general two scenes flow estimation methods by calculating the root mean square error (RMSE).

Table 1 gives the experimental environment.

Table 1 Experimental environment

Table 2 shows the test dataset statistics.

Table 2 Dataset statistics

6.1 Data setup

Three data sets include two synthetic data sets and one real data set. Dataset 1 is generated from Unity3D, by combining the existing camera calibration parameters into virtual camera projection matrix. Dataset 2 comes from Reinhard Klette’s (Image Sequence Analysis Test Site) dataset [19], which is a subset of EISATS. Dataset 3 is real scene from the series of 2011\_09\_26\_drive\_0018 (1.1 GB) in KITTI [11]. Table 2 shows the basic properties of each data set, since EISATS provides complete information with ground truth and ego-motion, comparative experiment and error calculation were carried out on dataset 2.

6.1.1 Dataset 1 detailed description

Dataset 1 simulates a dynamic ball in the front of a static plane, by putting calibration parameters of a real camera into a virtual camera in Unity3D, the focal length of the camera is [657, 658]. As Fig. 3 shows, the plane center position is (0,0,80), the initial position of the ball is (-1,3,-1), the ball is moving with a constant speed (1.0, 2.0, 0.0), and the camera is static. We assume that there is no distortion of camera and no offset of optic center. In order to make the scene visible to the camera, we also use a point light as ambient light. The material of background plane is a carpet with repeated texture, which is used to prove our estimation results will not be influenced by texture distribution. When the experiment began, we got the integer value map of scene by shade, and then set it as the initial depth of our algorithm.

Fig. 3
figure 3

Dataset 1 setup. Only one spot light in the scene as ambient light, and the ball is moving in front of a carpet textured plane, the camera frustum matrix is set with real camera calibration parameters

6.1.2 Dataset 2 detailed description

The Image Sequence Analysis Test Site (EISATS) offers sets of image sequences for the purpose of comparative performance evaluation of stereo vision, optic flow, motion analysis, or further techniques in computer vision. We chose the Synthesized (gray-level and color) sequences, because it offers camera calibration information, camera ego-motion and ground truth information.

6.1.3 Dataset 3 detailed description

The KITTI offers real stereo traffic datasets under different situations and we chose a typical crossroad scene dataset to testing.

6.2 Experimental results

In the experiment, we make camera motion information and monocular successive image frames as input, the output is text representation of the scene flow estimation results and reverse depth information. By mapping results onto a two-dimensional plane displaying in HSV color space, the moving distinction cab be seen directly. In this paper, we computed accuracy of the scene flow by calculating the root mean square error (RMSE) as Eq. 23. Since dataset 2 provides accurate ground truth information, two state-of-art scene flow estimation algorithms were compared with ours on dataset 2.

$$ RMSE=\sqrt{\frac{{\displaystyle {\int}_{\varOmega }{\left({\overrightarrow{V}}_{result}-{\overrightarrow{V}}_{groundtruth}\right)}^2}}{n}} $$
(23)

6.2.1 Analysis of the experimental results on dataset 1

As in Fig. 4, we chose four sequential frames from dataset 1 to test the algorithm, in order to accelerate the iterate procedure, the integer representation of the depth map was used as the initial depth. The experimental results shows that scene flow and inverse depth can be seen clearly in HSV representation. That means, under static camera condition, our algorithm can restore a more realistic point cloud motion information (e.g., The ball has no movement in the Z direction in scene flow result, which is same as real condition), and get more accurate depth information.

Fig. 4
figure 4

Our monocular scene flow estimation results on dataset 1. h, i and g are estimated X,Y,Z directional scene flows

6.2.2 Analysis of the experimental results on dataset 2

As in Fig. 5, we first extracted three consecutive frames from EISATS stereo datasets as input frames, then added white noise and blurring to the ground truth depth image, made it as initial depth (for more close to real scene). When iteration began, the ego-motion was combined into camera matrix. Figure 5 shows, monocular scene flow can estimate scene flow under dynamic camera relatively accurate, even with noise interference.

Fig. 5
figure 5

Our monocular scene flow estimation results on dataset 2. a-c Input frames. e-g Estimated scene flow results. i-k corresponding depths

In addition, since the dataset 2 provides a complete ground truth optical flow motion information, the scene flow accuracy can be evaluated by calculating the RMSE on 2D projected image [3]. So we first re-projected scene flow on image plane as in Fig. 6, and then computed the RMSE with ground truth flow under different pixel threshold respectively. For evaluating the effectiveness of paper method, we also computed RMSEs of two state-of-art algorithms, GCSF and MVSF. Our method result is shown in Fig. 7. GCSF is a simple seed growing algorithm for estimating scene flow in a stereo setup, and it needs two calibrated and synchronized cameras to observe a scene, simultaneously computes disparity map between the image pairs and optical flow maps between consecutive images [17]. GCSF’s estimated result is shown in Fig. 8. MVSF includes a 3D point cloud parameterization of the 3D structure, which can directly estimate the desired unknowns, and its energy functional enforces multi-view geometric consistency and imposes brightness constancy and piecewise smoothness assumptions directly on the 3D unknowns [4]. MVFS’s estimated result is shown in Fig. 9. According to the scene flow assessment methods of KITTI, we computed the percentage of erroneous pixels in total for three algorithms, under different pixel error threshold. As Fig. 10 displays, our algorithm can achieve similar accuracy with state-of-art stereo scene flow algorithms, with only one camera, which removes the complexity of stereo calibration and camera synchronization.

Fig. 6
figure 6

Re-projected scene flow on image plane

Fig. 7
figure 7

Scene flow estimated results of our method on dataset 2

Fig. 8
figure 8

Scene flow estimated results of GCSF on dataset 2

Fig. 9
figure 9

Scene flow estimated results of MVSF on dataset 2

Fig. 10
figure 10

Comparison of three algorithms. X axis is pixel error threshold and Y is erroneous percentage

6.2.3 Analysis of the experimental results on dataset 3

We also process experiments on real scene of datasets 3, since there is no ego-motion in data, we adopted a static sequence to verify the algorithm, and the initial depth set as unified 200 cm. As Fig. 11 shows, the monocular scene flow algorithm can still get a more accurate estimate of the results in the real scene, without any other information except for camera intrinsic matrix.

Fig. 11
figure 11

Our monocular scene flow estimation results on dataset 3. d-f are estimated X,Y,Z directional scene flows

7 Conclusion

This paper proposes a scene flow estimation algorithm for monocular image sequences, and innovatively combines inverse depth to consistency functional. Different from traditional methods, this monocular scene flow method:

  1. 1)

    Needs only one camera with existing navigation system, which makes it more flexible in traffic environment;

  2. 2)

    Restores the depth and scene flow simultaneously by putting inverse depth into total functional, gets the cloud points position and moving information at the same time;

  3. 3)

    Takes an anisotropic operator for scene flow smoothing and an isotropy operator for inverse depth smoothing, which maintains disparity between objects in the scene, and reduces noise in object area, making the results more close to nature;

  4. 4)

    Makes 3 reasonable assumptions according to dynamic scene attributes, extends coarse-to-fine framework to monocular scene flow estimation and gets the global minimum of the total energy functional as numerical solution.

The solution accuracy depends on velocity consistency in this algorithm, when scene objects take tiny and continuous movements, the estimation result will be good. If an object takes a non-rigid motion, the algorithm may be not so ideal. The future work of this paper will focus on overcoming similar problems and pay more attention to its application in related areas, such as societal health.