Keywords

1 Introduction

On June 10, 2014, the document [8] of the Cisco Visual Networking Index predicts that it would take an individual over 5 million years to watch the amount of video that will cross global IP networks each month in 2018. Every second, nearly a million minutes’ video content will cross the network by 2018. Video data generation rate is considerably larger than video data analysis rate. Video-based target tracking has drawn increasing interest for its highly applications [9], such as video surveillance, traffic control, machine intelligence, biological medical, etc.

In this paper, we are committed at designing a simple simulation system of human vision by combining static information and motion information. Static information means phase information [1] of color pairs and topological property [4]. Motion information means PCNN fusion based on optical flow. The following sections detail how they contribute to mapping targets.

Figure 1 illustrates the structure of proposed model. Firstly, we get three channels (color pair RG, color pair BY and topological property) from video frames. Then, phase information can be obtained from phase spectrum of two color pairs and topological property by inverse Fourier transform. Secondly, PCNN is utilized as fusion tool of motion features by optical flow direction. Pulse generates from outside to inside based on directional difference in the video frame until high enough difference value happens. Thirdly, saliency map is computed by smoothing linear fusion of phase information, magnitude and direction fusion.

Fig. 1.
figure 1

Structure of proposed model

2 Related Work

This section includes PCNN fusion based on optical flow and topological information extraction. Optical flow and PCNN applied in our model will be briefly introduced before computational process descriptions of PCNN fusion and topological information are given in detail.

2.1 Optical Flow

Optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer (an eye or a camera) and the scene. The concept of optical flow was introduced by the American psychologist James J. Gibson [2] in the 1940 s to describe the visual stimulus provided to animals moving through the world.

The optical flow methods try to calculate the motion between two image frames which are taken at times t and \( t + \delta t \) at every pixel. For a 2-dimensional case, a pixel at location \( (x,y,t) \) with intensity \( I(x,y,t) \) will have moved by \( \delta x \), \( \delta y \) and \( \delta t \) between the two image frames, and the following brightness constancy constraint can be given:

$$ I\left( {x,y,t} \right) = I\left( {x + \delta x,y + \delta y,t + \delta t} \right) $$
(1)

Assuming the movement to be small, the image constraint at \( I(x,y,t) \) with Taylor series can be developed to get:

$$ I(x + \delta x,y + \delta y,t + \delta t) = I(x,y,t) + \delta x\frac{\partial I}{\partial x} + \delta y\frac{\partial I}{\partial y} + \delta t\frac{\partial I}{\partial t} + e(0) $$
(2)

Equations (1) and (2) result in following Eq. (3), in which

$$ ]{I_{x} = {{\partial I} \mathord{\left/ {\vphantom {{\partial I} {\partial x}}} \right. \kern-0pt} {\partial x}},I_{y} = {{\partial I} \mathord{\left/ {\vphantom {{\partial I} {\partial y}}} \right. \kern-0pt} {\partial y}},I_{t} = {{\partial I} \mathord{\left/ {\vphantom {{\partial I} {\partial t}}} \right. \kern-0pt} {\partial t}},V_{x} = {{\delta x} \mathord{\left/ {\vphantom {{\delta x} {\delta t}}} \right. \kern-0pt} {\delta t}},V_{y} = {{\delta y} \mathord{\left/ {\vphantom {{\delta y} {\delta t}}} \right. \kern-0pt} {\delta t}},\,\,\,\,\,\,\,I_{x} V_{x} + I_{y} V_{y} + I_{t} = 0 }$$
(3)

The equation with two unknowns is known as the aperture problem of the optical flow algorithms and cannot be solve. To find the optical flow another set of equations is needed, given by some additional constraint. The Horn-Schunck [10] algorithm assumes smoothness in the flow over the whole image. The flow is formulated as a global energy functional which is then sought to be minimized.

$$ \text{E = }\iint {\text{[}(I_{x} V_{x} + I_{y} V_{y} + I_{t} )^{\text{2}} \text{ + }\alpha^{\text{2}} (\left\| {\varDelta V_{x} } \right\|^{\text{2}} \text{ + }\left\| {\varDelta V_{y} } \right\|^{\text{2}} )\text{]}dxdy} $$
(4)

In Eq. (3), the parameter \( \alpha \) is a regularization constant. Larger values of \( \alpha \) lead to a smoother flow. This functional can be minimized by solving the associated multi-dimensional Euler-Lagrange equations.

$$ \alpha^{2}\Delta V_{X} = I_{x}^{2} V_{X} + I_{y} I_{x} V_{Y} + I_{t} I_{x} ,\alpha^{2}\Delta V_{Y} = I_{x} I_{y} V_{X} + I_{y}^{2} V_{Y} + I_{t} I_{y} $$
(5)

where subscripts again denote partial differentiation and \( \Delta = {{\partial^{2} } \mathord{\left/ {\vphantom {{\partial^{2} } {\partial x^{2} }}} \right. \kern-0pt} {\partial x^{2} }} + {{\partial^{2} } \mathord{\left/ {\vphantom {{\partial^{2} } {\partial y^{2} }}} \right. \kern-0pt} {\partial y^{2} }} \) denotes the Laplace operator. In practice the Laplacian is approximated numerically using finite differences, and may be written \( \Delta V_{x} = \bar{V}_{x} - V_{x} ,\Delta V_{y} = \bar{V}_{y} - V_{y} \), where \( \bar{V}_{x} \) and \( \bar{V}_{y} \) is a weighted average of \( \bar{V}_{x} \) and \( \bar{V}_{y} \) calculated in a neighborhood around the pixel at location \( (x,y) \). Using this notation the above equation system may be written:

$$ (I_{x}^{2} + \alpha^{2} )V_{x} + I_{x} I_{y} V_{y} = \alpha^{2} \bar{V}_{x} - I_{t} I_{x} ,I_{x} I_{y} V_{x} + (I_{y}^{2} + \alpha^{2} )V_{y} = \alpha^{2} \bar{V}_{y} - I_{t} I_{y} $$
(6)

However, since the solution depends on the neighboring values of the flow field, it must be repeated once the neighbors have been updated. The following iterative scheme is derived:

$$ V_{x}^{n + 1} = \bar{V}_{x}^{n} - I_{x} \frac{{I_{x} \bar{V}_{x}^{n} + I_{y} \bar{V}_{y}^{n} + I_{t} }}{{\partial^{2} + I_{x}^{2} + I_{y}^{2} }},V_{y}^{n + 1} = \bar{V}_{y}^{n} - I_{y} \frac{{I_{x} \bar{V}_{x}^{n} + I_{y} \bar{V}_{y}^{n} + I_{t} }}{{\partial^{2} + I_{x}^{2} + I_{y}^{2} }} $$
(7)

where the superscript n + 1 denotes the next iteration, which is to be calculated and n is the last calculated result.

2.2 Pulse-Coupled Neural Network

Figure 2 illustrates the structure of unit-linking PCNN, quoted from Literature [3].

Fig. 2.
figure 2

Unit-linking PCNN architecture [3]

The Unit-linking PCNN architecture can be described by the following formula:

$$ {\text{F}}_{j} = dif\_dir_{j} ,{\text{L}}_{j} = step\left[ {\mathop \sum \limits_{k \in N\left( j \right)} Y_{k} \left( t \right)} \right] = \left\{ {\begin{array}{*{20}c} 1 & {\mathop \sum \limits_{ k \in N\left( j \right)} Y_{k} \left( t \right) > 0} \\ {0,} & {else} \\ \end{array} } \right. $$
(8)
$$ \begin{aligned} U_{j} = F_{j} (1 + \beta_{j} L_{j} ),Y_{j} = step(U_{j} - \theta_{j} ) = \left\{ {\begin{array}{*{20}l} 1 \hfill & {{\text{if }}U_{j} (t) \ge \theta_{j} (t)} \hfill \\ 0 \hfill & {\text{else}} \hfill \\ \end{array} } \right. \hfill \\ \hfill \\ \end{aligned} $$

2.3 PCNN Fusion Based on Optical Flow

We extract moving targets from the optical flow, which respectively use the dimension of its amplitude and direction. PCNN has the fusion feature, this section combined the PCNN fusion characteristics with the quantitative optical flow direction information which has been pretreated, besides, it respectively fused the target and the background which has the same moving characteristic, so as to realize the separation of foreground and background, and then segment the optical flow moving targets, the calculation process is following 3 steps:

Step1: Optical flow field pre-process: for each pixel, optical flow cushioned with a background vector against the maximum optical flow direction, and value 1/10 of the magnitude of the maximum optical flow.

Step2: Direction difference quantization: pixel \( \left( {x,y} \right) \) and its optical flow value is \( \left( {u,v} \right) \), the direction of optical flow is expressed as \( Ang = {\text{atan2}}(u,v)/\pi \), this will use -1 ~ 1 to indicate the value direction, the distribution as shown in Fig. 3.

Fig. 3.
figure 3

Direction difference quantization

The above-mentioned direction difference is shown as formula 9, the result are determined by the current pixel direction value and the mean value of its four neighborhood which has been fired, is the absolute value of their difference. Then through the formula 9 processing. We get the 0 ~ 1 monotonically increasing direction difference \( dif\_dir_{i} \).

$$ difAng_{i} = \left| {Ang_{i} - {\text{mean}}(Ang_{j} )} \right|,j \in\Omega ;dif\_dir_{i} = \hbox{min} (difAng_{i} ,2 - difAng_{i} ) $$
(9)

Step3: Unit-linking PCNN fusing direction features: input to F channel is direction difference \( dif\_dir_{i} \) between current pixel’s direction value \( Ang_{i} \) and the mean of its 4-neighborhood direction values whose fire set is 1 (shown in Eq. (9)). L channel collects fire information of the current pixel’ 4-neighborhood.

Figure 4 is the PCNN fusion effect chart using PCNN fusion method.

Fig. 4.
figure 4

PCNN fusion effect

According to [6], the metric most widely used in computer vision to assess the performance of a binary classifier is the percentage of correct classification (PCC), which combines four values - the number of true positives (TP), which counts the number of correctly detected foreground pixels; the number of false positives (FP), which counts the number of background pixels incorrectly classified as foreground; the number of true negatives (TN), which counts the number of correctly classified background pixels; and the number of false negatives (FN), which accounts for the number of foreground pixels incorrectly classified as background.

$$ PCC = {{TP + TN} \mathord{\left/ {\vphantom {{TP + TN} {TP + TN + FP + FN}}} \right. \kern-0pt} {TP + TN + FP + FN}} $$
(10)

As shown in Fig. 5, the classification accuracy of PCNN fusion movement information almost all above 95 %.

Fig. 5.
figure 5

The effect of SVM classification

2.4 Topological Property

Chen Lin proposed topological perception theory in 1982 [7]. A stimulus was separated into different global wholes (a figure and a background), dependent only on global properties. These global properties can be described mathematically as topological properties, such as connectivity. Literature [4] applied connectivity of topological perception into visual attention. We improve topological algorithm in literature [4] as following 3 steps to avoid the problem of selecting segmentation threshold and benefit the filtering of two or more color tones of background.

Step1: Grayscale image changed from color video frame is resized to 64*64.

Step2: Unit-linking PCNN extracting topological connectivity: input to F channel is intensity difference \( dif\_I_{i} \) between current pixel’s intensity \( I_{i} \) and the mean of its 4-neighborhood direction values whose fire set is 1. L channel collects fire information of the current pixel’ 4-neighborhood.

$$ {\text{F}}_{j} = dif\_I_{j} ,{\text{L}}_{j} = step\left[ {\mathop \sum \limits_{k \in N\left( j \right)} Y_{k} \left( t \right)} \right] = \left\{ {\begin{array}{*{20}c} 1 & {\mathop \sum \limits_{ k \in N\left( j \right)} Y_{k} \left( t \right) > 0} \\ {0,} & {else} \\ \end{array} } \right. $$
(11)

Step3: The binary image computed from step2 PCNN_filter is the input of topological channel which expressed connectivity.

In Fig. 6, 2th image and 3th image are original topological channel and improved topological channel of 1th image separately. Different tones (grass and load) cannot be filtered out all in the original topological channel, which leads to pedestrians’ sinking into grass. By inputting intensity difference into F channel of PCNN, improved topological channel filter grass and load successfully. Improvement is done to more accurately represent connectivity of topology.

Fig. 6.
figure 6

Examples of improved topological channel

3 Algorithm Structure

The step of proposed model.

Step1: Grayscale image changed from color video frame is resized to 64*64. Optical flows of grayscale images are calculated using hs method as Sect. 2.1.

Step2: Optical flow field pre-process: for each pixel, optical flow cushioned with a background vector against the maximum optical flow direction, and value 1/10 of the magnitude of the maximum optical flow. PCNN fuse pre-processed optical flow as Sect. 2.3.

Step3: Topological channel T is computed by using PCNN as Sect. 2.4. Two color pairs [1] is \( RG = R - G,BY = B - Y \), where \( R = r - \left( {g + b} \right)/2 \), \( G = g - \left( {r + b} \right)/2 \), \( B = b - \left( {r + g} \right)/2 \), \( Y = \left( {r + g} \right)/2 - \left| {r - g} \right|/2 - b \), and r, g, b are separately red channel, green channel, blue channel of color image.

Step4: Phase spectrum can be obtained by normalizing Fourier transform of T, RG and BY. Then, phase information is obtained from phase spectrum by inverse Fourier transform.

$$ p = f^{ - 1} \{ P[f({\text{RG,BY,0}} . 4 * {\text{T}})]\} $$
(12)

Step5: Saliency map is computed by smoothing linear fusion of phase information p, magnitude |OF| and direction fusion fus. In Eq. (13), \( \omega_{1} = 1.0,\omega_{2} = 1.2,\omega_{3} = 1.5,\sigma = 8 \).

$$ {\text{S}}\_{\text{Map}} = {\text{G}}(\sigma )*\left[ {\omega_{1} * {\text{p}} + \omega_{2} *\left| {\text{OF}} \right| + \omega_{3} *fus} \right]^{ 2} $$
(13)

4 Experimental Results

The proposed algorithm is implemented on our video database and compared with FT [5], Vibe [6], PQFT [1].

4.1 Database

See Table 1.

Table 1. Information of database

4.2 Attention Detection Effects

In our experiments, one widely used saliency detection algorithm (FT [5]) and two common target tracking algorithms (Vibe [6], PQFT [1]) are compared with our algorithm. In Fig. 7, We show the results of these three methods and proposed method using the above figures.

Fig. 7.
figure 7

Video frames and their saliency maps

As can be seen in Fig. 7, the results of proposed detection get more salient targets and darker background compared with FT, Vibe, and PQFT. For video Parachute with moving camera, FT focuses more on brighter light through the hole than flying parachute because of the lack of motion information. Vibe concerns with the edge of flying parachute and light, because the light is “moving” in the screen. The proposed model elegantly solves the problem of target tracking with moving camera by utilizing motion direction difference. For those videos with static camera, Vibe’ effect is quite good sometimes such as the result of 84th frame in Walking. However, inexplicable target just happen frequently such as the middle pedestrian of Pedestrian’ 116th frame. Although PQFT focuses on moving targets correctly, some of its results are incomplete and its background distracts us.

4.3 Comparison of Attention Detection Models

To further illustrate the effectiveness of the proposed algorithm which combines the visual attention model with PCNN and optical flow, we select commonly used evaluation index F-Measure to compare proposed model to FT [5], Vibe [6], PQFT [1]. Assuming G is ground truth regions, S is saliency regions:

$$ ]{{\text{Precision}} = {{G \cap S} \mathord{\left/ {\vphantom {{G \cap S} S}} \right. \kern-0pt} S} , {\text{ Recall}} = {{G \cap S} \mathord{\left/ {\vphantom {{G \cap S} G}} \right. \kern-0pt} G},{\text{F - Measure}} = {{2*p_{thr} *r_{thr} } \mathord{\left/ {\vphantom {{2*p_{thr} *r_{thr} } {\left( {p_{thr} + r_{thr} } \right)}}} \right. \kern-0pt} {\left( {p_{thr} + r_{thr} } \right)}} }$$
(13)

This paper further compares the proposed algorithm with FT [5], Vibe [6], PQFT [1] on the evaluation index F-Measure with different samples. The result is shown in Table 2. For videos with moving camera or background disturbance (Parachute and Birdfall), F-Measures of proposed model are biggest of four models. Dealing with video Pedestrians and video Walking, proposed model is more effective than FT and PQFT. And our model is effective and comparable with Vibe.

Table 2. F-Measure

5 Conclusion

This paper proposed a moving target tracking algorithm, which combines the visual attention model with pulse coupled neural network and optical flow, has better tracking performance compared with traditional algorithms. Based on the relative motion occurring between moving targets and background, the target regions and the background are fused respectively by using fusion ability of PCNN. Meanwhile, the improved topological channel benefits the filtering of more color tones of background. Experimental results show that proposed method has higher detection rate and better ability of suppressing background.