Keywords

4.1 Introduction

There are a lot of target tracking methods, which are divided into region-based, feature-based, deformable template-based and model-based generally [1], Among them, the typical algorithm include Camshift [2, 3] and SIFT [4, 5], and so on.

The sparse coding model of complete basis requires the orthogonal basis functions [6]. It does not reflect the internal structure and characteristics of images, and also have less sparsity [7]. Overcomplete model more in line with the mechanism of visual feature extraction, and has a good sparse approximation performance [8, 9]. However, the asymmetry of the input space and encode space increases the difficulty of the sparse decomposition and the model solution [10, 11].

For the above problems, we use the energy-based models method for solving the overcomplete model, and use the response coefficient matrix instead of the base function matrix for expressing visual features to solve difficulties of the sparse decomposition and the model solution.

4.2 Overcomplete Sparse Coding Model

The sparse coding model is:

$$ I = \sum\limits_{i = 1}^{m} {A_{i} S_{i} } + N $$
(4.1)

where \( I \) is a \( n \) dimensional natural image, \( A_{i} \) is a basis function with \( n \) dimensional vector, \( N \) is a Gaussian noise, \( s_{i} \) is the response coefficient, \( m \) is the number of basis functions. If \( m = n \), formula (4.1) is a sparse coding model of complete basis, if \( m > n \), \( s \) is a redundant matrix, then formula (4.1) is transformed into overcomplete spare coding model.

We assume that \( W \) is receptive field, \( A = W^{ - 1} \) in condition of the model of complete basis. However, \( A \) is a redundant matrix in case of the model of overcomplete basis, so it is very difficult to solve \( A \).

To solve the above problems, we use the logarithm of probability density function to define the energy-based models, as following formula (4.2):

$$ \log p\left( x \right) = \sum\limits_{k = 1}^{m} {\alpha_{k} G\left( {w_{k}^{T} x} \right) + Z\left( {w_{1} , \ldots ,w_{n} ,\alpha_{1} , \ldots \alpha_{n} } \right)} $$
(4.2)

where \( x \) is a single sample data, \( n \) is the dimension of sample data, \( m \) is the number of receptive fields, the vector \( w_{k} = \left( {w_{k1} , \ldots ,w_{kn} } \right) \) is constrained to the unit norm, \( Z \) is the normalization constant of \( w_{i} \) and \( \alpha_{i} \), \( G \) is the metric function of the sparsity of neurons response \( s \), and \( \alpha_{i} \) are estimated following with \( w_{i} \).

In overcomplete basis case, solving the normalization constant \( Z \) is very difficult. Therefore, we adopt the score matching to estimate the receptive field. Let us introduce score function which is defined by the gradient of logarithm of probability density function:

$$ \psi \left( {x;W;\alpha_{1} , \ldots ,\alpha_{m} } \right) = \nabla_{x} \log p\left( {x;w} \right) = \sum\limits_{k = 1}^{m} {\alpha_{k} w_{k} g\left( {w_{k}^{T} x} \right)} $$
(4.3)

where \( g \) is the first-order partial derivative of \( G \).

We used the distance square of score function between parameter model and sample data to get the objective function:

$$ \begin{aligned} \tilde{J} = & \sum\limits_{k = 1}^{m} {\alpha_{k} \frac{1}{T}} \sum\limits_{t = 1}^{T} {g^{\prime}\left( {w_{k}^{T} x\left( t \right)} \right)} \\ & + \frac{1}{2}\sum\limits_{j,k = 1}^{m} {\alpha_{j} \alpha_{k} w_{j}^{T} w_{k} \frac{1}{T}\sum\limits_{t = 1}^{T} {g\left( {w_{k}^{T} x\left( t \right)} \right)g\left( {w_{j}^{T} x\left( t \right)} \right)} } \\ \end{aligned} $$
(4.4)

where \( x\left( 1 \right),x\left( 2 \right), \ldots ,x\left( T \right) \) are \( T \) samples.

By the above analysis, the solution process of the receptive field can be summarized as follows: looking for \( W \) to promote the objective function to minimize.

We used the gradient descent algorithm to make the objective function minimization:

$$ W\left( {t + 1} \right) = W\left( t \right) - \eta \left( t \right)\frac{{\partial \tilde{J}}}{\partial W} $$
(4.5)

where \( \eta \left( t \right) \) is the learning rate, which changes with time or iteration times.

The algorithm 1 is the learning process of overcomplete set \( W \).

Algorithm 1: Learning of overcomplete set algorithm

Input: Sample images

Output: Overcomplete set \( W \)

Steps:

  1. 1.

    Random sampling to the sample images for obtains the training samples;

  2. 2.

    Whiten the samples by the principal component analysis (PCA) method, and project them into whitenization space;

  3. 3.

    Selected the initial vector \( W_{s} \), and initialize it to the unit vector, set the error threshold \( \varepsilon ; \)

  4. 4.

    Update \( W \) according to the formula (4.5), and normalize the unit vector, meanwhile update parameter \( \alpha ; \)

  5. 5.

    If \( norm\left( {\Updelta W} \right) \le \varepsilon \), stop iteration, otherwise, return to step 4;

  6. 6.

    Stop learning, project the learning result \( W_{s} \) back into the original image space, then get the overcomplete set \( W. \)

4.3 Target Tracking Algorithm Based on the Visual Perception

Based on visual sparse and competitive response characteristics, only a small amount of neurons is activated to portray the internal structure of images and priori properties [12, 13]. We selected \( N \) neurons which have larger response as the visual feature representation of images as shown in Fig. 4.1.

Fig. 4.1
figure 1

Feature extraction of image visual. a The response of neurons caused by image. b The representation of visual feature

We assume the difference of neurons responds between video sequence image and background image is as follows:

$$ h = \left| {s_{vi} - s_{gi} } \right| $$
(4.6)

where \( s_{vi} \) is the response of ith video sequence image patch, where \( s_{gi} \) is the response of ith background image patch.

The dynamic threshold is as follows:

$$ \delta = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left| {s_{vi} - s_{gi} } \right|} $$
(4.7)

The target tracking algorithm (TTA) is as follows:

Algorithm 2: Target tracking algorithm

Input: Video sequence image and background image

Output: The results of moving target tracking

Steps:

  1. 1.

    Sequential sampling to the video sequence image and background image;

  2. 2.

    Whiten the samples by the principal component analysis (PCA) method;

  3. 3.

    Calculate the neural responses of the video sequence image and background image with the formula \( s = Wx \), and take the same number of \( N \) largest nerve responses;

  4. 4.

    Calculate the difference \( h \) of the neural responses of video sequence image patches and background image patches in the same location, and compared it with the dynamic threshold \( \delta \), if \( h > \delta \), output the results of the perception, otherwise, no further treatment;

  5. 5.

    Display the recognition results of the target;

  6. 6.

    Then enter the following frame of video sequence, return to step 1.

Flow chart of TTA is shown in the Fig. 4.2.

Fig. 4.2
figure 2

The flow chart of TTA

4.4 Experiment

4.4.1 Learning of Overcomplete Set

Experimental environment: software system-matlab7.0, operating system-Windows XP, CPU-1.86 GHz, memory-1 GB, image resolution-512*512.

Experimental process: Firstly, we select 10 video sequence images and use the 16*16 sliding space sub windows for sampling each image randomly, then we get 5000 16*16 pixels sampling patchess from one image, and 256*50000 sampling data sets from 10 images, and then preprocess the sampling data sets, which is using the PCA method to centralize and whiten the images, and reduce the dimension to 128. The data sets of 128*50000 is dedicated to the input of overcomplete set training. Finally, a overcomplete set representation with 512 receptive fields is estimated based on the energy-based models and the result is shown in Fig. 4.3.

Fig. 4.3
figure 3

The learning of overcomplete set

4.4.2 Target Tracking

From left to right and top to bottom, we use the 16*16 sliding space sub windows for sampling each image, and get 1024 pixels sampling patches from one image.

We designed experiments for simple background, target scale change, partial block and complete block. Results of tracking are shown in Figs. 4.4, 4.5, 4.6, and 4.7.

Fig. 4.4
figure 4

Tracking result of the simple background

Fig. 4.5
figure 5

Tracking result of the target scale change

Fig. 4.6
figure 6

Tracking result of the partial block

Fig. 4.7
figure 7

Tracking result of the complete block

Figure 4.5, the scale and shape of target were changing in the vision. Figures 4.6 and 4.7, the target just passed behind different and similar objects in condition of the partial and complete block, so inter-class change occurs in tracking process.

In order to verify the validity of TTA, we compared with the typical SIFT and Camshift on the robustness, accuracy and real-time.

4.4.3 Analysis of Results

As can be seen in Figs. 4.4, 4.5, 4.6, and 4.7, TTA which was based on visual perception mechanism achieved tracking of target stably in condition of the block and target scale change. In the Table 4.1, error tracking frames include the false discovery and false judge non-target: the false alarm and missed alarm, the TTA algorithm improves the accuracy of target tracking compared with SIFT and Camshift. It can be seen from the Table 4.2, the time-consume of TTA algorithm is less than the SIFT, and more than the classic Camshift slightly, but to meet the real-time requirement.

Table 4.1 The statistics results of three algorithms
Table 4.2 Time-consume comparison of three algorithms

4.5 Conclusion

By simulating visual perception mechanism, we established a new kind of target tracking algorithm TTA, and its accuracy and robustness have been improved. TTA algorithm achieved tracking of target stably when occurred scale change of target and block interference, and also target deformation and inter-class exchange at the same time. The furthermore work is we will take further research combined with high-level visual semantics, such as attention and learning mechanism.