1 Introduction

Visual motion perception is a challenging topic in computer vision. Visual scene is a wealth of information which depicts the motion of an object in external environments, by which motion cues can be extracted by visual information processing in order to design intelligent vision systems [1, 2]. However, it is still difficult for traditional computer vision techniques to capture the motion features of a moving object from dynamic visual scenes. Fortunately, the nature gives us many bio-visual processing inspirations, for example animals can effectively extract and perceive external motion cues by their vision systems. This helps us solve the problem of visual motion perception.

Research on neurophysiology revealed that specific types of visual neurons could respond preferentially to specific motion patterns synthesized by three basic elements, i.e., translation, expansion/contraction and rotation/depth rotation motion [3, 4]. Therein, Maunsell and Van Essen [5] reported that translational selective neurons in the dorsal part of medial superior temporal (MSTd) of the anesthetized macaque were sensitive to translational movement; Rind and Simmons [6] discovered depth motion selective neurons in the lobula complex of the locust that could positively respond to expansion vision stimuli; Saito et al. [7,8,9] discovered that rotational selective neurons and depth rotation sensitive (DRS) ones in the posterior parietal association cortex (area PG) of monkeys and in the medial superior temporal area (MST) of the macaque could selectively react to depth rotation on the horizontal plane and the fronto-parallel rotation motion for an object. The visual properties of such neurons can play an important role in engineering motion pattern detection, e.g., the motion pattern detection of a rolling wheel.

Up to now, some computational models have been developed to detect motion patterns. However, little has been done to create computational models for depth rotation motion detection. Although neurophysiologists have discovered DRS neurons in the cerebral cortex of the primate, the underlying mechanism, which a biological vision system perceives depth rotation, still remains open, let alone systematic investigation on how to design computational models for depth rotation motion detection. Therefore, it is a still open topic for researchers to discuss bio-inspired computational models used in detecting depth rotation motion in the field of engineering from the angle of computer vision. Therein, can the functional properties of the discovered DRS neurons be simulated to develop bio-inspired neurocomputational models? If yes, can such models be used to construct artificial vision systems for depth rotation object recognition and such fault detection problems as gear and propeller rotation? Therefore, the current proposal is to discuss the problem of depth rotation perception from the angle of artificial visual neural network, after which a novel computational model is designed to detect the spatio-temporal energy change of depth rotation of an object.

It is highlighted that the main contribution of the present work involves three points: (1) a bio-inspired feedforward depth rotation perception neural network (DRPNN) is originally developed to detect the pattern of depth rotation of an moving object, and thus can be applied to depth rotation object detection; (2) DRPNN can uncover some properties of depth rotation neurons, e.g., the spatio-temporal energy change property caused by depth rotation in neurophysiology; and (3) the performance characteristics of DRPNN are sufficiently examined by means of depth rotation video sequences from different scenarios.

It is worth pointing out that DRPNN differs from any existing neural networks, and in particular our previous neural network – rotational motion perception neural network (RMPNN) [10]. The main difference between DRPNN and RMPNN contains three points: (1) RMPNN suits to the detection of rotational motion on the fronto-parallel plane, whereas DRPNN is used to perceive the spatio-temporal energy change of depth rotation of an object; (2) such two neural networks originate from different biological inspirations, in other words, RMPNN is designed based on the framework of the locust visual system, whereas DRPNN is developed in terms of the morphological and neurophysiological characteristics of the mammalian visual system; and (3) RMPNN can only detect the change of translational direction of a moving object, but DRPNN involves the spatio-temporal changes of both translation and depth motion.

The rest of this paper is organized as follows. The related work on visual motion perception is reviewed in Sect. 2. Section 3 describes the proposed neural network in detail. DRPNN’s computational complexity is given in Sect. 4. Section 5 displays the whole experimental analysis. Finally, Sect. 6 concludes the current work and outlines future studies.

2 Survey of related work

Depth rotation perception aims to detect specific motion patterns, in which not only objects rotate on the plane, but also the rotational axis is perpendicular to the observer's sight axis [7, 11, 12]. Up to now, many artificial visual neural networks have been proposed for different tasks in visual perception, such as target detection and tracking [13, 14], collision detection [15, 16], human identification [17, 18], visual question answering [19], intelligent surveillance [20, 21]. However, there has been no appropriate computational model for depth rotation perception in the literature. Fortunately, many achievements, reported by electro- and neuro-physiologists can give us valuable inspirations in developing bio-inspired computational models for depth rotation detection. The related work will be summarized below.

2.1 Psychophysical depth rotation perception analysis

Compared by linear movement, depth rotation movement is more difficult to be detected [22,23,24]. A number of psychophysical studies have been carried out to analyze different types of depth rotation perception. Although Shulman [25] claimed that the effect of attention was related to the visual process of depth rotation perception, it is not clear what factors influence the effect of visual perception. Braunstein [22] suggested that some cues affect the psychophysical perception of depth rotation. Thereafter, he designed a mental morphological model to perceive the depth rotation pattern of a rectangle. In Braunstein’s model, the angle change of a rectangle between horizontal and vertical contours is taken as an indicator used in detecting depth rotation. However, his model is only assumed to be used for a rotating trapezoid or rectangle. In the study of visual factors which triggered mental depth rotation perception, Braunstein and Petersik [26, 27] reported that the factors could be separately processed. After that, Andersen and Braunstein [24, 28] validated that mental depth rotation perception comes from the combination of directional and depth visual cues.

2.2 Geometrical model on depth rotation

By employing projective geometry approaches, Johansson et al. [29] developed a geometrical model to simulate the change of the length and direction of a straight line in depth rotation. In their simulation experiment, a depth rotating straight line is projected onto a two-dimensional plane. They suggested that the change of the projected length of the straight line be sinusoidal when it was in depth rotation. However, they only simulated an idealized depth rotation in their experiment, regardless of the influence of motion parallax on the projection of the object in the three-dimensional space. On the basis of their studies, some works are still open. For example, (1) with the aspect of motion parallax, what should be the spatio-temporal energy changes in the retina caused by the depth rotation of an object? (2) Can the perceived motion energies form a sinusoidal curve? (3) Do the energy changes, induced by depth rotation on different planes present minor differences? About these questions, we try to find their answers in this work.

2.3 Functional response of depth rotation perception

There has been a point of view that the binocular parallax is the main factor to activate DRS neurons for perceiving depth rotation [30, 31]. However, Saito et al. [7] discovered that all of the DRS neurons in the MST area of the monkey responded strongly to monocular stimulation. They also claimed that the response differences of DRS neurons under monocular and binocular vision conditions were not very distinct [9, 11, 12], while emphasizing that the motion of a single spot in depth rotation could effectively make the DRS neurons become active. These indicate that the binocular parallax is not the key factor to excite the DRS neurons, while the directional change of motion is more important than the moving object’s shape. On the basis of this observation, they presented a viewpoint that the continuous change of motion direction was the only difference to distinguish rotation from linear movement. This means that specific computational models for depth rotation perception can be constructed to perceive the change of motion status of a moving object, and meanwhile two types of motion cues, i.e., depth motion and directional translation [32], need to be extracted in the early stage of visual information processing.

2.4 Computational models on directional selectivity

It has been discovered that lobula giant movement detector (LGMD) neurons in the lobula complex have the same preferential response characteristic when an object approaches the eye of a locust [33]. Rind et al. [6, 34] presented the key features of LGMD for depth motion perception, i.e., the lateral inhibition and edge expansion of an approaching object. They proposed a neural model, i.e., LGMD-based neural network for perceiving the approaching object in the 3-D space. A substantial number of experimental results suggested that the reported LGMD models work well in the perception of an object’s approaching movement [35,36,37]. On the other hand, it has been reported that directional selective neurons widely exist in different animal species [38, 39]. Some neurophysiological achievements revealed that asymmetric lateral inhibition underlined these neurons’ directional selectivity. Specifically, in animals’ retina, starburst amacrine cells are connected asymmetrically to directional selective neurons, and deliver inhibition in the null directions but not in the preferred direction [40, 41]. Based on such perception mechanisms, Yue and Rind [36] proposed a directional selective neural network (DSNN) to perceive the translational direction of a moving object on the front-parallel plane. A large number of experimental results have confirmed that DSNN is robust in the perception of an object’s translational motion direction [42, 43].

3 Depth rotation perception neural network

Visual motion perception depends on hierarchical information processing. Neurophysiologists have revealed that the mammalian’s vision system has a layered structure and includes five types of information processing cells, respectively, presented in the five neuropil layers, i.e., photoreceptor (P), horizontal (H), bipolar (B), starburst amacrine (S), and ganglion (G) cells [44, 45]. Each of the five layers processes its input visual signals and extracts motion cues sequentially. The process of motion perception in the mammalian visual neural system can be divided into two stages [46, 47]: (1) in the first stage, motion sensitive neurons capture and transmit local motion cues to the subsequent functional neurons, and (2) in the second stage, the functional neurons with large receptive fields synthesize the received cues in order to respond to specific complex motion patterns. Inspired by such two stages of biological visual information processing, the current neural network (DRPNN) consists of a presynaptic network and a postsynaptic one. The former comprises eight lateral inhibition neural sub-networks used for capturing visual motion information; the latter extracts the motion cues of different motion patterns, e.g., depth rotation and translational motion, and then synthesizes them to perceive the spatio-temporal energy change of depth rotation of an object.

DRPNN takes each image frame as its input signal through a monocular video camera, and then outputs the total of membrane potentials produced by its internal structures. Based on the interior characteristics of the mammalian vision system [2, 44, 45], the framework of DRPNN is developed, schematically illustrated by Fig. 1. It comprises of two parts: presynaptic and postsynaptic networks, for which the design details are given below.

Fig. 1
figure 1

Schematic illustration on DRPNN

3.1 Presynaptic network

In the presynaptic network of DRPNN, a depth perception (DP) neuron is used to capture the approaching/receding cues of depth motion of an object, while eight directional selective neurons are utilized to extract various translational direction cues of the object. On the basis of the preferred translational motion directions, the eight directional selective neurons, which correspond to respective directional selective neural networks (DSNNs), can be classified into two types. The first type, which consists of horizontal and vertical directional selective neurons, includes left (L), right (R), up (U), and down (D) selective neurons; the second type, which is formed of diagonally directional selective neurons, involves in left-up (LU), left-down (LD), right-up (RU), and right-down (RD) selective neurons. Each of the eight neurons corresponds to a special DSNN which perceives specific translational direction cues of the object. Therefore, the presynaptic network includes eight DSNNs acquired by improving the reported computational models [36, 37]. Such eight neural networks share the four layers of P, H, B and S, but have different designs for their G layers, i.e., their direction inhibition layers. We take the left directional selective neural network(L-DSNN) for example to illustrate their internal frameworks and functional mechanisms.

As shown in the top part of Fig. 1, L-DSNN, which preferentially responds to a left moving object in the field of view, includes five neural information-processing layers and one functional neuron, i.e., P, H, B, S and G layers, and the mentioned left selective neuron (L). The functions of each neural layer and neuron L are described below.

  1. 1)

    P layer

The P layer as the first layer of L-DSNN is to capture the visual motion signals of an object in the field of view. By an analogy to the morphologic characterization of the mammalian’s retina [44, 45, 48], it consists of nc × nr photoreceptor cells arranged in a matrix form. Each cell receives the luminance intensity or gray value at the counterpart in an input image frame with size nc × nr. Let x and y denote the row and column coordinates in the input image, respectively. Lf−1 and Lf are the luminance values at frames f and f − 1, respectively. Pf(x, y) represents the captured luminance change which corresponds to pixel (x, y) at frame f, given [35] by

$$P_{f} (x,y) = {\text{abs}}(L_{f} (x,y) - L_{f - 1} (x,y)) .$$
(1)

By the neurophysiologic achievements of the mammalian vision system [49, 50], the output of cell (x, y), \(\hat{P}_{f} (x,y)\), is determined by

$$\hat{P}_{f} (x,y) = \left\{ {\begin{array}{*{20}l} {P_{f} (x,y),} \hfill & {{\text{if}}\, P_{f} (x,y) \ge T_{rp} ,} \hfill \\ {0,} \hfill & {{\text{otherwise}} ,} \hfill \\ \end{array} } \right.$$
(2)

with signal threshold Trp.

Remark 1

To clarify the functionality of the P layer, a video sequence is utilized to show a left moving ball on a carpeted office room (see Fig. 2a). We take two successive image frames with numbers 48–49 for example to illustrate the processed result. After receiving frame 49, the P layer firstly computes the changes of luminance intensities in terms of frame 48 and Eq. (1), by which the changes form a difference image given in Fig. 2b. We notice that the motion edge of the moving ball can be extracted, but some noises are included. Subsequently, the image is transformed into Fig. 2c after being further processed by means of Eq. (2) and a given threshold value of Trp.

Fig. 2
figure 2

Illustrative example on the P layer

  1. 2)

    H and B layers

The H and B layers as the second and the third layers of L-DSNN, respectively, are designed based on the mammalian’s neurophysiologic findings, namely that the horizontal cells can collect visual signals from the photoreceptor cells and provide feedforward signals to the bipolar cells for the improvement of the spatial resolution of visual information [48, 51, 52]. Each of such two layers includes nc × nr cells displayed in a matrix form. Those cells in the H layer directly receive the excitatory intensities from their retinotopic counterparts in the P layer, and then transmit them to the B layer. Therein, the output of each cell (x, y) is defined by

$$H_{f} (x,y) = \hat{P}_{f} (x,y) .$$
(3)

In the B layer, each cell not only collects the outputs of the cells around its retinotopic counterpart in the H layer, but also fuses such outputs with the output from its retinotopic counterpart in the P layer. More precisely, as related to the visual information integration metaphor in the mammalian’s retina [51, 52], the surrounded excitation of each cell from the H layer to the B layer can only get a smaller passing coefficient of whb, and conversely the direct excitation of each cell from the P layer to the B layer gains a larger passing coefficient of wpb in the process of information integration. Therefore, the strength of the mixed excitation Bf(x, y) of each cell in the B layer is given through

$$B_{f} (x,y) = \sum\limits_{{i = - m_{w} }}^{{m_{w} }} {\sum\limits_{{j = - m_{w} }}^{{m_{w} }} {H_{f} (x + i,y + j)w_{c} (i,j))w_{hb} + \hat{P}_{f} (x,y)w_{pb} ,} }$$
(4)

with surround radius mw, where wc denotes a convolution mask given by

$$w_{c} = \left[ {\begin{array}{*{20}c} {0.125} & {0.25} & {0.125} \\ {0.25} & 0 & {0.25} \\ {0.125} & {0.25} & {0.125} \\ \end{array} } \right],$$
(5)

based on the neurophysiological achievements [52, 53] and empirical experience [20, 36, 54].

Remark 2

As related to Fig. 2c, the B layer generates nc × nr visual excitations to form an image given in Fig. 3, relying upon the H layer and Eq. (4). Figure indicates that the object’s motion edge becomes clearer and some clutters in Fig. 2c can be filtered out.

Fig. 3
figure 3

Illustrative image frame acquired by the B layer

  1. 3)

    S and G layers

Neurophysiological studies have revealed that starburst amacrine cells as inhibitory inter-neurons play an important role in forming the visual perception of directional selectivity. More precisely, such cells gather signals from the bipolar cells and passes their directional inhibition signals to the ganglion cells in the null directions but not in the preferred direction by means of their major synapses [2, 44]. Analogously, the fourth and fifth layers of L-DSNN are the S and G layers, respectively, each of which is arranged in a nc × nr matrix form. Each cell in the S layer receives the membrane potential of its retinotopic counterpart in the B layer, and generates its left inhibition through

$$I_{f}^{\rm L} (x,y) = \sum\limits_{{i = - n_{{{\text{inh}}}} }}^{{n_{{{\text{inh}}}} }} {\sum\limits_{{j = - n_{{{\text{inh}}}} }}^{{n_{{{\text{inh}}}} }} {B_{f - 1} (x + i, y + j)} } - \sum\limits_{i = 1}^{{n_{{{\text{inh}}}} }} {B_{f - 1} (x + i, y)w_{\xi } (i)} ,$$
(6)

where the superscript L denotes the L neuron; ninh is the inhibition radius; wξ(i) is the local inhibition weight which controls the opposite side neighboring inhibition strength given by

$$w_{\xi } (i) = (2n_{inh} + 1)^{2} \psi_{I} ,$$
(7)

with membrane potential constant ψI. Then, each cell in the S layer outputs its inhibition intensity [35] through

$$\hat{I}_{f}^{\rm L} (x,y) = \left\{ {\begin{array}{*{20}l} {I_{f}^{\rm L} (x,y),} \hfill & {{\text{if}} \,I_{f}^{\rm L} (x,y) > 0,} \hfill \\ {0,} \hfill & {{\text{else}}.} \hfill \\ \end{array} } \right.$$
(8)

Additionally, as related to the mammalian’s neurophysiological achievement which the intensities of excitation and inhibition from bipolar and starburst amacrine cells need to pass to ganglion cells [2, 44], each cell in the G layer collects two types of visual signals of the above B and S layers. One is the output excitation from the retinotopic counterpart in the B layer, and the other is the gathered inhibition spread by the retinotopic counterpart’s neighboring cells in the S layer. The collected visual signals are integrated by

$$G_{f}^{\rm L} (x,y) = B_{f} (x,y) - \hat{I}_{f}^{\rm L} (x,y)w_{I} ,$$
(9)

where \(G_{f}^{\rm L} (x,y)\) is the integrated excitation of cell (x, y) in the G layer; wI is the global inhibition weight which controls the whole inhibition strength. Subsequently, only those cells whose membrane potentials exceed a threshold value of Tg will output their activities. Therefore, if the membrane potential of a cell in the G layer is smaller than Tg, its output is set as 0, and remains unchanged otherwise. The output of each cell in the G layer is computed by

$$\hat{G}_{f}^{\rm L} (x,y) = \left\{ {\begin{array}{*{20}l} {G_{f}^{\rm L} (x,y), } \hfill & {{\text{if}}\, G_{f}^{\rm L} (x,y) \ge T_{g} ,} \hfill \\ {0,} \hfill & {{\text{else}}.} \hfill \\ \end{array} } \right.$$
(10)

Remark 3

The S and G layers in L-DSNN are utilized to extract the directional visual cues of a moving object. Figure 4a, b presents the outputted inhibitions and excitations of cells in the S and G layers, i.e., those excitations at frame 49, respectively. The directional inhibitions from the S layer are allocated to the cells in the G layer in the null directions but not in its preferred direction.

Fig. 4
figure 4

The outputted frames: a the S layer; b the G layer

  1. 4)

    Neuron L

The output membrane potentials of all cells in the G layer are gathered to the L neuron. The strength of the converged excitation is computed [35] by

$${\text{SUM}}_{f}^{\rm L} = \sum\limits_{x = 1}^{{n_{c} }} {\sum\limits_{y = 1}^{{n_{r} }} {{\text{abs}}(\hat{G}_{f}^{\rm L} (x,y))} } .$$
(11)

Then, the acquired excitation is given by

$$E_{f}^{\rm L} = 2 \times \left( {1 + e^{{ - \frac{{{\text{SUM}}_{f}^{\rm L} }}{{n_{r} n_{c} }}}} } \right)^{ - 1} - 1,$$
(12)

where \(E_{f}^{\rm L}\) is the output excitation of the neuron L, and it changes within 0 and 1, due to \({\text{SUM}}_{f}^{\rm L} \ge 0\).

  1. 5)

    Other directional selective neural networks

Besides the above L-DSNN, DRPNN includes seven subnetworks related to the corresponding neurons, e.g., R-DSNN, LU-DSNN, etc. These DSNNs have the same designs as those in the L-DSNN except their directional inhibition designs in their G layers. Here, we only take LU-DSNN for example to illustrate their inhibition gathering designs. The gathered inhibition strength of each cell (x, y) in the G layer is defined by

$$I_{f}^{{{\text{LU}}}} (x,y) = \sum\limits_{{i = - n_{{{\text{inh}}}} }}^{{n_{{{\text{inh}}}} }} {\sum\limits_{{j = - n_{{{\text{inh}}}} }}^{{n_{{{\text{inh}}}} }} {B_{f - 1} (x + i, y + j)} } - \sum\limits_{j = 1,\,i = j}^{{n_{{{\text{inh}}}} }} {B_{f - 1} (x + i, y + j)w_{\xi } (i,j).}$$
(13)
  1. 6)

    Depth perception neuron—DP neuron

The DP neuron corresponds to the depth perception neural network used for capturing the approaching/receding cues of a moving object in depth motion. The DP neuron and the eight directional selective neurons share the same neural layers (i.e., P, H and B layers) of the presynaptic network in processing visual signals, as shown in Fig. 1 (top). It gathers the output excitations of all cells in the B layer, and gathers their excitations by [35]

$${\text{SUM}}_{f}^{{{\text{DP}}}} = \sum\limits_{x = 1}^{{n_{c} }} {\sum\limits_{y = 1}^{{n_{r} }} {{\text{abs}}(B_{f} (x,y))} } .$$
(14)

After that, the DP neuron’s output is decided by

$$E_{f}^{{{\text{DP}}}} = 2 \times \left( {1 + e^{{ - \frac{{{\text{SUM}}_{f}^{{{\text{DP}}}} }}{{n_{r} n_{c} }}}} } \right)^{ - 1} - 1,$$
(15)

where \(E_{f}^{{{\text{DP}}}}\) is the output excitation of the DP neuron at frame f.

Remark 4

In order to demonstrate the functionality of the DP neuron in capturing depth motion cues, a video sequence is chosen to illustrate how a ball approaches the video camera. Four of all the video frames, presented by Fig. 5a are picked up to represent such an approaching motion pattern. After the video frames are orderly processed by Eqs. (1)–(5), (14) and (15), the DP neuron generates an output curve given in Fig. 5b. We note that the output excitation of the DP neuron will become large when the ball approaches the camera increasingly. This indicates that the DP neuron can correctly perceive the depth motion cues caused by the approaching ball, and thus possesses the perception feature of depth motion [34, 55].

Fig. 5
figure 5

Illustrative example on depth motion perception

3.2 Postsynaptic network

The postsynaptic network receives the extracted visual motion cues from the above presynaptic network for further processing. As shown in Fig. 1 (bottom), it consists of a neuropil layer and a functional neuron, i.e., the direction column and DRS neuron.

  1. 1)

    Direction column

Inspired by a neurophysiological finding that the mammalian cerebral cortex includes visual neurons with respective motion preference axes and exists in the form of direction column [56, 57], the above-mentioned eight directional selective neurons form a direction column. According to the ranked order of the neurons, the excitations of the neurons are represented by the following expression,

$$\Psi_{f} = (E_{f}^{{{\text{LU}}}} , E_{f}^{{\text{L}}} , E_{f}^{{{\text{LD}}}} , E_{f}^{{\text{D}}} , E_{f}^{{{\text{RD}}}} , E_{f}^{{\text{R}}} , E_{f}^{{{\text{RU}}}} , E_{f}^{{\text{U}}} ).$$
(16)

A spiking mechanism is utilized to determine the values of elements in ψf. More precisely, an internal spike кf(i) occurs inside this neuron i with i ∈ {LU,L,…,RU,R}, that is

$$\kappa_{f} (i) = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {{\text{if}} \,\Psi_{f} (i) \ge T_{e} \wedge \Psi_{f} (i) > 0,} \hfill \\ {0,} \hfill & {{\text{else}},} \hfill \\ \end{array} } \right.$$
(17)

where \(T_{e} = \max \{ \Psi_{f} (i),\,\,1 \le i \le 8\} .\) If ns successive spikes occur, the output excitation of the neuron i is computed by

$$\hat{\Psi }_{f} (i) = \left\{ {\begin{array}{*{20}l} {( - 1)^{{({\text{Quotient}}(i, \lambda_{d} ) + 1)}} \times {\rm E}_{f}^{{{\text{DP}}}} ,} \hfill & {{\text{if}}\sum\limits_{{m = f_{s}^{i} }}^{f} {\kappa_{m} (i)} \ge n{}_{s},} \hfill \\ {0,} \hfill & {{\text{else}},} \hfill \\ \end{array} } \right.$$
(18)

with quotient function Quotient(.,.) and constant divisor λd, where \(f_{s}^{i}\) denotes the first frame of the time period when continuous spikes are occurring inside the directional selective neuron i; the threshold ns is defined by

$$n_{s} = \max \left\{ {\sum\limits_{{m = f_{s}^{i} }}^{f} {\kappa_{m} (i)} ,\quad 0 \le i \le 8} \right\}.$$
(19)

As the processing mechanism was described above, all elements in the output excitation vector \(\hat{\Psi }_{f}\) are zero if the object keeps static, and conversely at least one element will be larger than zero.

Remark 4

With the unique network structure of the direction column, the current motion direction of a moving object can always be detected by the eight neurons. Figure 6 exhibits the excitation curves of the neurons in terms of the video sequence in Remark 1. We notice that the LU, L and LD neurons become excitatory, since their excitation curves beyond those of the other neurons. This is in accordance with the fact that the object moves on the left direction.

Fig. 6
figure 6

Excitation curves acquired by the neurons

  1. 2)

    Depth rotation sensitive neuron—DRS neuron

The outputs of the above eight neurons in the direction column are converged to the DRS neuron, by which the strength of the membrane potential of it at frame f is computed by

$$E_{f}^{{{\text{DRS}}}} = \sum\limits_{i = 1}^{8} {\hat{\Psi }_{f} (i)} ,$$
(20)

After that, one such neuron produces its membrane potential taken as the output of DRPNN.

4 Computational complexity

Let N be the total of pixels of each inputted image frame with N = nc × nr. Within an iterative cycle, DRPNN executes 3N operations in the P layer, while the H and B layers involve in 21 arithmetic operations. The S layer only enforces N assignment operations. The G layer needs M1 times to extract motion cues in the eight translational directions with \(M_{1} = (32n_{{{\text{inh}}}}^{2} + 40n_{{{\text{inh}}}} + 72)N\). Additionally, the eight directional selective neurons are required to run 16(N + 3) arithmetic operations, while the DP neuron executes 2N + 7 operations. Furthermore, in the direction column there needs to operate M2 operations with \(M_{2} = 8(f - f_{s}^{i} ) + 68\). Finally, the DRS neuron enforces 7 addition/ subtraction operations. Summarily, the total of DRPNN’s executions for a loop is decided by

$${\text{Sum}} = (32n_{{{\text{inh}}}}^{2} + 40n_{{{\text{inh}}}} + 94)N + 8(f - f_{s}^{i} ) + 151.$$
(21)

Since f − fs takes small values, DRPNN’s computational complexity in the worst case is given by

$$O = {\text{O}}(32n_{{{\text{inh}}}}^{2} + 40n_{{{\text{inh}}}} + 94)N).$$
(22)

Equation (22) shows that the image resolution N and the inhibition radius ninh influence DRPNN’s computational efficiency. Therefore, it will be beneficial to reduce the input video frame size and take a rational inhibition radius value in the G layer.

5 Experimental study

In the study of the DRS neurons in macaque’s cerebral cortex, it was found that these visual neurons can perceive the depth rotation in the field of view [7, 9, 11, 12]. Therefore, we use several sets of video sequences to analyze the performance of DRPNN. More specifically, in order to check whether DRPNN can effectively and robustly perceive the spatio-temporal energy change of depth rotation and also whether its output curve is a sinusoidal curve or not, several real scenarios, which reflect the specific depth rotation of a moving object, are firstly set to sample video sequences; secondly, DRPNN is sufficiently verified, which involves depth rotation on the horizontal and non-horizontal planes; finally, it is compared by three recent motion perception neural networks. The architecture of the experiment flowchart is given in Fig. 7.

Fig. 7
figure 7

Schematic illustration on the architecture of the experimental study

5.1 Experimental environment

All experiments are executed on a Microsoft Windows 10 computer with CPU/2.66G and RAM/4G by means of VC++ platform. Thirty-four video sequences are taken to examine the performance of DRPNN. Each video sequence is recorded at a frame rate of 30fps, and later separated into 8-bit grayscale images with size 140 × 80 per frame.

The parameter settings of DRPNN are given in Table 1. nc and nr are set as 140 and 80, respectively, since cells in the P layer correspond to the pixels in the input image. The potential constant ψI is set as 255, based on the maximum value of the pixel in the 8-bit grayscale image. The proportion weights whb and wpb are defined as 0.33 and 0.67, respectively, which bases on the visual information integration metaphor in the mammalian’s retina [51, 52]. mw, wI, ninh, and Tg take 1, 1.7, 4 and 12, respectively, which depends on the previous experiments [10, 20, 36, 37, 54].

Table 1 Parameter settings of DRPNN

5.2 Depth rotation perception on the horizontal plane

In the study of DRS neurons in MSTd, it was found that such neurons can well respond to depth rotation on the horizontal plane [7, 9, 11, 12]. Therefore, from the angle of computer simulation we test whether DRPNN can simulate one such property by means of a set of video sequences which represent the horizontal depth rotation of a rigid object. More precisely, when the object is in depth rotation, there are two types of projection shapes in the retina, i.e., non-deformation and deformation. Hereby, two regular ball and rectangle are used to generate four video sequences used for detecting whether DRPNN can well respond to objects’ depth rotation. The schematics of four typical depth rotation patterns on the horizontal plane are shown in Fig. 8.

Fig. 8
figure 8

Schematic examples of four typical depth rotation patterns on the horizontal plane. The object rotates on the horizontal plane (i.e., the XZ plane). The sight axis of the video camera coincides with the Z axis and is perpendicular to the rotation axis of the object (i.e., the Y axis). The red direction line indicates the rotation direction of the object: a the ball is in counterclockwise (ccw) depth rotation, b the ball is in clockwise (cw) depth rotation, c the rectangle is in ccw depth rotation, and d the rectangle is in cw depth rotation

In terms of a monocular video camera, we firstly record two video sequences produced by a regular black ball (40 mm in diameter) that rotates around a fixed rotation center on the horizontal plane. Similarly, we also record two video sequences generated by a regular black rectangle (80 mm in length and 35 mm in width) that rotates around one of its fixed edges on the horizontal plane. In each video sequence, the object is placed in the central region of the field of view and rotates at a constant angular velocity (see Fig. 9). Depending on the four groups of video sequences, we verify whether DRPNN can effectively perceive the motion change of an object in depth rotation and whether its output excitation presents a sinusoidal curve or not.

Fig. 9
figure 9

As related to Fig. 8, the example frames of depth rotation on the horizontal plane are given here. Each video sequence is illustrated only by picking up four frames; the frame number is indicated under each image. a ccw depth rotating ball, b cw depth rotating ball, c ccw depth rotating rectangle, and d cw depth rotating rectangle

In Fig. 9a, a black ball is at the leftmost position of the rotation trajectory and keeps stationary from frame 1 to frame 31, and later it rotates counterclockwise one circle at an angular velocity (6.5 rad/s) on the horizontal plane from frame 32 to frame 60; finally, it remains stationary from frame 61 to frame 88. In Fig. 9b, the black ball is at the rightmost position of the rotation trajectory and then holds stationary from frame 1 to frame 33; subsequently, it rotates clockwise one circle at an angular velocity (6.5 rad/s) on the horizontal plane from frame 34 to frame 61; finally, it holds stationary from frame 62 to the end. Similarly, in Fig. 9c, d, the depth rotation pattern of a rectangle is similar to that of the ball shown in Fig. 9a or b except that the angular velocity is 8.56 rad/s. The statistical results, acquired by DRPNN are displayed in Table 2, and meanwhile Fig. 10 presents DRPNN’s output curves as related to Fig. 9.

Table 2 Depth rotation perception region and experimental results gotten by DRPNN in the horizontal plane test
Fig. 10
figure 10

Output curves of DRPNN. The four subfigures are acquired by the above corresponding video sequences in Fig. 9

By Fig. 10 and Table 2, when an object is in depth rotation, DRPNN can be touched to respond to the spatio-temporal energy changes occurring in the field of view and outputs its excitation. However, its output curve is somewhat different from a standard sinusoidal one in this test. In other words, (1) the left and right sub-parts of it are asymmetry; (2) the absolute values of the up and down peaks of the curve are not equal; and (3) there are some perturbations presented in the peaks. The main reason is because depth rotation easily causes motion parallax. Herein, four conclusions can be drawn: (1) DRPNN can effectively perceive the depth rotation of a moving object on the horizontal plane, regardless of the object’s shape; (2) the spatio-temporal energy change, caused by depth rotation can be effectively captured by DRPNN; and (3) the membrane potential curve, outputted by DRPNN presents a quasi-sinusoidal curve which is compatible with the hypothesis suggested by Johansson et al. [29] in projective geometry.

5.3 Depth rotation perception on the non-horizontal plane

This section detects whether DRPNN can perceive the change of spatio-temporal energy if depth rotation takes place on the non-horizontal plane [9, 11, 12]. Herein, a monocular video camera records six video sequences which arise from the depth rotation of a ball on the different non-horizontal planes. Three typical non-horizontal planes, i.e., the left diagonal, the sagittal, and the right diagonal planes, are employed to produce video sequences. The schematics of depth rotation on the three typical non-horizontal planes are shown in Fig. 11 above.

Fig. 11
figure 11

Schematic illustrations of depth rotation on the three typical non-horizontal planes. a ccw depth rotation on the left diagonal plane; b ccw depth rotation on the sagittal plane; c ccw depth rotation on the right diagonal plane

As related to Fig. 11, Fig. 12 presents six video sequences that depict different kinds of depth rotation patterns of a ball. In Fig. 12, the depth rotation patterns of the ball are similar to that of the above depth rotation on the horizontal plane test. The experimental results can be known by Fig. 13.

Fig. 12
figure 12

Example frames based on the different rotating planes. Each video sequence is illustrated only by picking up four frames; the frame number is indicated under each image. a ccw depth rotation on the left diagonal plane; b cw depth rotation on the left diagonal plane; c ccw depth rotation on the sagittal plane; d cw depth rotation on the sagittal plane; e ccw depth rotation on the right diagonal plane; f cw depth rotation on the right diagonal plane

Fig. 13
figure 13

Output curves of DRPNN. Each subgraph uniquely corresponds to the experimental result of the same identifier video sequence in Fig. 12

Figure 13 indicates that in the case of non-horizontal depth rotation, DRPNN can also perceive the spatio-temporal energy change of depth rotation of the object. Particularly, the output curves of DRPNN are also quasi-sinusoidal, and thus DRPNN can simulate the property of which the DRP neurons can perceive the depth rotation of an object on a non-horizontal plane [9, 11, 12].

5.4 DRPNN’s intrinsic property

5.4.1 Case I: Position invariance test

Four video sequences in Fig. 14 are sampled based on the scenarios of horizontal depth rotation for a ball. Each of them is gotten on a specific non-central region of the field of view (i.e., top-left, bottom-left, top-right, and bottom-right); the ball with angular velocity 6.49 rad/s makes depth rotation, while its depth rotation pattern is similar to that of the ball shown in Fig. 9a. As related to the video sequences in Fig. 14, DRPNN is executed on each video sequence. Its output curves are displayed in Fig. 15.

Fig. 14
figure 14

Sample frames with different scenarios. Each video sequence is represented with only four frames; the frame number is indicated under each image. a top-left region, b bottom-left region, c top-right region, and d bottom-right region

Fig. 15
figure 15

Output curves of DRPNN. Each subgraph is acquired by DRPNN through the corresponding video sequence in Fig. 14

The curves show that DRPNN can correctly perceive the motion change of depth rotation while outputting quasi-sinusoidal curves, even if the current depth rotation occurs within different non-central regions of the field of view. This is consistent with the property of DRP neurons, namely they can make excitation wherever depth rotation takes place [7, 11].

5.4.2 Case II: Sensitivity on rotation speed

We here examine how the output curve of DRPNN is influenced by different rotation speeds. Here, six video sequences, generated from the horizontal depth rotation of a ball with different angular velocities are taken (see Fig. 16). The depth rotation patterns in Fig. 16, orderly with rotation angular velocities 1.81, 2.27, 3.55, 9.92, 20.93 and 47.1 rad/s, are similar to that in Fig. 9a.

Fig. 16
figure 16

Example frames with different rotation angular velocities. Each video sequence is illustrated only by picking up four frames; the frame number is indicated under each image

Figure 17 displays the output excitation curves of DRPNN, which hints that DRPNN can perceive the depth rotation of the moving object. We also observe that, when the rotation angular velocity of the rotating object is smaller than or equal to 9.92 rad/s (see Fig. 17a–d), the output curve of DRPNN is a quasi-sinusoidal curve. However, when the rotation speed is equal to or larger than 20.93 rad/s, the output curve of DRPNN is not similar to a quasi-sinusoidal curve. This shows that an appropriate rotation speed can help DRPNN correctly perceive the pattern of depth rotation.

Fig. 17
figure 17

Output curves of DRPNN. Each subgraph is acquired by DRPNN through the corresponding video sequence in Fig. 16

5.4.3 Case III: Sensitivity on the starting point

In the case II, the starting point of the rotating object is at the leftmost or rightmost position of the rotation trajectory, and thus DRPNN can produce a quasi-sinusoidal excitation curve when the object is in depth rotation. In order to confirm that depth rotation may happen through other starting points in its rotation trajectory, we use a set of video sequences, arisen from the horizontal depth rotation of a ball with different starting points to challenge DRPNN. In these video sequences, the ball rotates counterclockwise on the horizontal plane (i.e., the XZ plane) with different and specific starting points. The schematic illustration of the test scenarios is given in Fig. 18.

Fig. 18
figure 18

Schematic illustration based on different starting points

As illustrated by Fig. 18, the first starting point (P1) is at the leftmost position of the rotation trajectory of the ball. Then, the horizontal deviation angle between the next starting point (P2) and P1 increases 45°. Similarly, the angle between the kth starting point (Pk) and P1 is k × 45° with 1 < k ≤ 8. Hence, the eight test video sequences acquired in Fig. 19 are used to formulate the process of depth rotation of the ball in terms of the eight starting points. In the video sequence of Fig. 19a with a total of 112 frames, the ball is located at the first starting point P1 and keeps stationary from frame 1 to frame 30, after which it rotates counterclockwise one circle on the horizontal plane at an angular velocity 3.55 rad/s from frame 31 to frame 83, and finally it remains stationary from frame 84 to the end. In the video sequences displayed in Fig. 19b–h, the motion patterns of the ball are similar to that presented in Fig. 19a.

Fig. 19
figure 19

Example frames of depth rotation with different starting points. Each video sequence is illustrated only by picking up four frames; the frame number is indicated under each image. It should be noted that in each video sequence, the first illustrative image (i.e., frame 30) represents the starting point of depth rotation of the ball

As related to the video sequences shown in Fig. 19, Fig. 20 displays the output curves of DRPNN. By Fig. 20a, DRPNN can perceive the motion change of depth rotation of the ball, and its output curve is a quasi-sinusoidal curve. However, with the change of the starting point of depth rotation, the phase shifts of its output waveform will take place (see Fig. 20b–h). These test results indicate that the starting point of depth rotation can influence the phase of the output waveform perceived by DRPNN. The output curve, however, is still quasi-sinusoidal.

Fig. 20
figure 20

Output curves of DRPNN. Each subgraph is acquired by DRPNN through the corresponding video sequence in Fig. 19

5.4.4 Case IV: Sight axis deviation test

In the above test, the sight axis of the video camera overlaps the rotating plane of the object, in which the camera is perpendicular to the rotation axis. Herein, we examine how the sight axis deviation influences the output waveform of DRPNN. More precisely, we test how DRPNN responds to the depth rotation of an object in the case where the sight axis of the video camera deviates from the rotating plane of the object. Here, take horizontal ccw rotation for example to illustrate the schematic of the sight axis deviation by Fig. 21.

Fig. 21
figure 21

Schematic illustration of the sight axis deviation

At the begin of sampling a video sequence, the camera’s sight axis coincides with the Z axis. Then, the sagittal deviation angle between the axis and the Z axis increases 15° step by step till that the axis approaches the Y axis. When the axis deviation reaches over 75°, we turn it into 85° for the final video sequence. The acquired video sequences are given in Fig. 22 below. In these video sequences, the motion patterns of the object are similar to that of the ball shown in Fig. 9a except that its rotation angular velocity is 3.55 rad/s.

Fig. 22
figure 22

Example frames for the sight axis deviation under ccw rotation on the horizontal plane. Each video sequence is illustrated only by picking up four frames; the frame number is indicated under each image

As associated to Fig. 22, the output curves of DRPNN are plotted by Fig. 23. By these curves, we find that DRPNN can perceive the rotational motion of the object. When the sight axis deviation angle is small up to 30°, the output curves of DRPNN are quasi-sinusoidal (Fig. 23a, b). However, when the sight axis deviation angle increases gradually, the output curve of DRPNN gradually changes into a square waveform curve (Fig. 23c–e). When the sight axis is almost parallel to the rotation axis of the ball, the output curve is close to a square wave (Fig. 23f). This is because that, when the sight axis deviates from the Z axis, the projection of a rotating object changes from a swing line to an ellipse, and the perceived depth rotation cues are gradually reduced. In the extreme case where the sight axis is perpendicular to the rotating plane of the rotating object, the projection of the object will form a circle, which leads to that the depth rotation pattern of the object fully disappears in the field of view. All the test results indicate that the spatio-temporal change of depth rotation is quasi-sinusoidal if the sight axis of the camera is only with small perturbation.

Fig. 23
figure 23

Output curves of DRPNN. Each subgraph is acquired by DRPNN through the corresponding video sequence in Fig. 22

5.5 Comparative analysis

As far as we know, there is no appropriate computational model for depth rotation perception in the literature up to date. Here, we can only take the three recent motion perception neural networks to participate in comparative analysis, i.e., Beardsley’s neural network [58], LGMD model [55] and RMPNN [10]. To compare DRPNN with the three mentioned neural networks, four depth rotation patterns in Fig. 9 are employed, i.e., ccw, cw, ccw, and cw depth rotation. More details can be found in Sect. 5.2. We here emphasize that in Fig. 9a, b, the ball makes ccw and cw depth motion from frame 32 to 60 and frame 34 to 61, respectively; in Fig. 9c, d, so does the rectangle from frame 31 to 52 and from frame 36 to 57, respectively.

  1. A.

    Beardsley’s neural network

Beardsley’s model is a conventional three-layer back-propagation neural network. The input stimuli are the motion patterns represented by idealized optic flow, while the output layer includes sixteen MSTd units which prefer different types of motion patterns. Its input optic flow, caused by the depth rotation in each video sequence is acquired by the Lucas-Kanade method. Afterward, the output curves of the MSTd units are displayed in Fig. 24a–d in terms of Fig. 9. By comparing Fig. 10 with Fig. 24a–d, Beardsley’s neural network cannot respond to depth rotation in the real scene test, as the training samples of the network are required to be idealized [58] and the gained weight matrices only suit to those idealized optic flow samples in virtual environments. Therefore, Beardsley’s model fails to perceive depth rotation in real scenes.

Fig. 24
figure 24

Output curves of the three compared neural networks: ad Beardsley’s neural network, eh LGMD model, and il RMPNN

  1. B.

    LGMD model

LGMD mainly consists of four neural layers and one neuron, in which input stimuli are the frames extracted from video sequences. Based on the above four video sequences in Fig. 9, LGMD generates four output excitation curves as in Fig. 24e–h. Such curves can conclude that LGMD cannot correctly detect the spatio-temporal energy change of depth rotation. We here take the video sequence in Fig. 9c for example to analyze the performance of LGMD against depth rotation. The excitation curve in Fig. 24c indicates that, when the ball rotates clockwise on the horizontal plane, LGMD has no response with a long time, but it can become excitatory and discharge membrane potentials, as any depth rotation contains approaching motion component [24, 28, 32]. Therefore, depth rotation can also make LGMD keep excitatory when an object approaches toward the video camera. In Fig. 9c, when the rectangle passes through the two segments, i.e., P2 to P3, and P8 to P1 (Fig. 18), it will trigger LGMD to generate collision alarming, which illustrates that LGMD is effective for collision detection.

  1. 3

    RMPNN

RMPNN includes two types of sub-networks. One is ccwRMPNN which responds to the ccw rotational motion, and the other is cwRMPNN which reacts to the cw rotational motion. Its output is rotation sensitive neuron’s preference to rotational motion on the fronto-parallel plane. The output curves are shown in Fig. 24i–l, relying upon the visual frame stimuli extracted from video sequences in Fig. 9. The curves indicate that both ccwRMPNN and cwRMPNN have no response to any depth rotation, as RMPNN identifies its preferred motion patterns through detecting the continuous change in the translational motion direction on the fronto-parallel plane [10]. Particularly, the left/right translational and approaching/receding motion cues, generated by the depth rotation on the horizontal plane cannot make RMPNN become excitatory. Therefore, RMPNN could not respond to depth rotation in the field of view.

Summarily, compared by DRPNN, the above three types of motion perception neural networks exhibit their intrinsic characteristics and also expose their defects in solving the problem of depth rotation detection. Based on the above comparative experiments, we can draw some conclusions: (1) as a specially designed novel computational model, DRPNN is suitable for detecting the spatio-temporal energy change of depth rotation in real scenes; (2) Beardsley’s neural network is designed to recognize motion patterns represented by idealized optic flow, while it is hard when detecting the unknown patterns of depth rotation in non-ideal scenes; (3) as a specific collision detection neural network, LGMD can detect the approaching motion included in depth rotation, whereas it fails to perceive the spatio-temporal energy change of depth rotation; and (4) even though RMPNN as a specific neural network for rotational motion can recognize rotational motion patterns, it cannot detect the depth rotation of an object.

5.6 Discussion

In the above sections, the presented DRPNN has been sufficiently examined by several types of depth rotation video sequences under various conditions. All of these experiments have verified the reliable ability of DRPNN in detecting depth rotation motion. The experimental results indicated that the properties of DRPNN coincide with most of the main functional properties of DRS neurons [7, 9, 11, 12], e.g., depth rotation detection, motion direction selection, position invariance, sensitivity on rotation speed and starting point. Also, DRPNN is compared with the three state-of-the-art motion perception models, by which the comparative results have demonstrated that DRPNN is effective for depth rotation object detection.

Although DRPNN can simply simulate some properties of visual information processing in biological vision systems, it cannot avoid two common defects in the field of artificial visual neural network: (1) when an object is in depth rotation at an extremely slow or fast rotation angular velocity, the extraction of motion cues is difficult, and hence DRPNN cannot correctly capture the motion characteristics of the moving object, and (2) DRPNN can only recognize the depth rotation of a moving object on some certain planes including horizontal, left diagonal, sagittal, and right diagonal planes, which means that it might not work well to those on other rotating planes.

On the other hand, the hypothesis in projective geometry, proposed by Johansson et al.[29] holds the viewpoint that, when a line segment is in depth rotation, the change of its projected length is similar to a sinusoidal curve, which has been confirmed by our experiments. Thus, DRPNN is an alternative model for depth rotation object detection.

6 Conclusion

Although many computational models have been developed for motion pattern detection, it is still rare to study how to detect the depth rotation pattern of an object. Thus, this work aims to develop a novel depth rotation perception neural network (DRPNN) in order to deal with the hard problem of depth rotation perception in computer vision. Since one such work reconciles with the related studies on visual information processing in biological vision systems, it can bring some lights toward the building of artificial vision systems which integrate visual neurophysiologic findings with computer vision technologies for such motion detection tasks as visual motion perception, visual motion pattern recognition, intelligent video surveillance, autonomous robot and so on.

Inspired by the internal structure of the mammalian’s retina and the functional properties of depth rotation sensitive neurons in neurophysiology, DRPNN is suggested to not only simulate the framework of hierarchical visual information processing, but also detect the spatio-temporal energy change of depth rotation of an object in the field of view. Comprehensive experiments are used to examine DRPNN’s performance characteristics. The experimental results can draw three points: (1) DRPNN can recognize the depth rotation motion pattern of an object, (2) DRPNN is robust to the object’s rotating plane and motion position in the field of view, and (3) DRPNN is sensitive to the object’s rotation angular velocity and starting point in rotation trajectory as well as the camera’s sight axis deviation. All these intrinsic properties of DRPNN can simply explain some functional properties of depth rotation sensitive neurons in the posterior parietal association cortex of primates. As the first bio-inspired computational model for depth rotation perception, this research is a significant step toward both intensively understanding visual information processing mechanisms in biological vision systems and probing into bio-inspired computational models for depth rotation object detection. In the future, DRPNN can be extended by integrating other bio-inspired visual neural networks for complex visual detection tasks. It can be used to construct an artificial vision system for engineering applications, e.g., gear/propeller rotation fault monitoring.