Keywords

1 Introduction

In the past few years, traffic video surveillance technologies (mainly for vehicles) have gotten more and more attention. Applications based on these technologies have made rapid progress. Simultaneously, a large number of resources have been invested.

Intelligent traffic video surveillance is a new generation of monitoring technology, which uses cutting-edge computer vision technologies to analyze monitoring videos, does some forecasts, and identifies various abnormal conditions.

One of the key technologies for intelligent video surveillance is target recognition. Only after the target being identified from the surveillance video, further analysis about the behaviors and characteristics of the targets can be done.

Recently, multi-camera-based traffic video tracking systems become more and more popular, which track some targets among multiple cameras. The vehicle target recognition is also a very important beginning step for further tracking and other analysis.

The local feature extraction and matching of vehicle targets are very important for vehicle tracking, especially across different cameras. However, vehicle feature extraction and recognition still face many challenges:

  • Feature diversity, such as shape, color, size, and fine differences between corresponding key points. Moreover, from different perspectives, the appearances of specific vehicle are usually different;

  • Shield and interference, such as shields between vehicles and interference of the complex background. Especially in urban situations, backgrounds are generally complex;

  • Computation and accuracy for feature extraction. More robust features and higher accuracy require more computation resources, which is a big bottleneck for real-time vehicle surveillance.

In this paper, we propose a new vehicle recognition approach based on multi-feature fusion. The front faces of vehicles are used for this task (allowing small inclination angles). Like faces of human being, the front faces of vehicles have abundant features to identify them. Considering the symmetry, only the left light of the vehicle is chosen for feature extraction.

In the proposed approach, the feature extraction has two steps. The first is to use an improved Canny [1] algorithm to extract the vehicle light shape from the image of the vehicle front face to get the profile image of the headlight. The second is to extract SURF (Speeded Up Robust Features) [2] features from the profile image of the headlight and the image of the vehicle face.

Some standard vehicle face images of different types of vehicles are selected as standard models. Their features are extracted by above approach and stored into a local standard feature database.

To identify a new vehicle, first, extract features from its face image by our proposed approach; then, compare them with the features of various vehicles stored in the standard feature database to find the type of the vehicle.

Finally, the information of vehicle types combined with other features of Local Binary Pattern (LBP) and Histogram of Oriented Gradients (HOG) is used for a multi-camera vehicle tracking system. Hadoop is used to realize the parallel computing of the system.

The remaining of this paper is organized as follows. Section 2 introduces related work on edge detection, extraction of local features, etc. Section 3 shows the architecture, algorithm and implementation of the proposed recognition system of vehicle, and some improvements in details. Section 4 gives the testing performance analysis. Finally, conclusion is drawn in Sect. 5.

2 Related Work

Anagnostopoulos et al. [4] presented a vehicle recognition system by automatic license plate recognition, vehicle manufacturer/model detection and under-vehicle inspection. Sivaraman et al. [5] proposed a novel active-learning approach to build vehicle-recognition and tracking systems. A passively trained recognition system was built using conventional supervised learning. A second round of learning was then performed to build an active-learning-based vehicle recognizer. Particle filter tracking was integrated to build a complete multiple-vehicle tracking system. Dlagnekov [6] presented a way to use SIFT (Scale Invariant Feature Transform) [7] feature of vehicle rear to recognize a specific vehicle. Petrovic et al. [8] presented an application of automatic recognition of vehicle type for secure access and traffic monitoring. It showed that a relatively simple set of features extracted from sections of front face images of cars can be used to obtain high performance verification and recognition of vehicle type (both car model and class). Euclidean distance is used to measure the similarity between the features from different objects.

Multiple features can help improve the effectiveness of the appearance matching and reduce the influence of environment changes, especially in multi-camera system. Tamrakar et al. [9] explored a way combining the features of shape and appearance for object recognition. Some ways were explored in which the geometric aspects of an object can be augmented with its appearance. The main idea is to construct a dense correspondence between the interior regions of two shapes based on a shape-based correspondence so that the intensity and gradient distributions can be compared. Wang et al. [10] employed multiple features such as color histogram, height, moving detection, travel time and speed to match objects across non-overlapping views.

In our proposed approach, SURF feature of vehicle front face image is combined with features of LBP and HOG to make vehicle tracking.

Edge detection is an important step for feature extraction. Existing edge detection algorithms can be roughly divided into five groups: gradient method [3], surface fitting method [11], second derivative method [12], multi-scale method [13] and morphological method [14]. Canny is a good practical edge detection algorithm. Its three performance criteria are: (1) good detection performance; (2) good localization performance; (3) only one response to a single edge. According to the above principles, Canny algorithm has a good intensity estimate of edges, which contains both edge gradient direction and edge intensity information. In our proposed approach, Canny is used for edge detection.

For feature extraction, SIFT [7] is a good image feature proposed by Lowe. It is invariant to rotation and scale, which is often used in image matching and connection. However, the complexity of computation for SIFT is usually high and the amount of data is large. SURF [2, 15] is a speedup and robust version of SIFT, which can offer better performance than SIFT. The calculation of SURF feature consists of keypoints detection, descriptor generation and keypoints matching. [16] gave the comparison of SIFT, Principal Component Analysis (PCA)-SIFT [17] and SURF.

For feature matching, there are some methods to compare the feature descriptors, such as L2 metric, Euclidean distance, Earth Mover’s Distance (EMD) [18], EMD-L1 [19], diffusion distance [20], \( \overline{EMD} \) [21] and Quadratic-Form distances [22]. SIFT and SURF are originally designed to be used with Euclidean distance [2, 7]. To improve the performance of feature matching, a refined distance calculation method based on Hellinger kernel is presented in our proposed approach.

3 Design and Implementation

3.1 Framework Architecture

The interactive traffic video tracking platform, which have been presented by our research group [23], consists of the following three parts: data storage layer, data analysis layer, data presentation layer. Figure 1 shows the hierarchical architecture of the interactive traffic video tracking system.

Fig. 1.
figure 1

The architecture of the interactive traffic video tracking system

In this architecture, data storage layer stores all data of the system (such as original video, multi-camera information, standard vehicle database). The data analysis layer includes several processes for traffic video, such as local feature extracting, local feature matching, and target association. The data presentation layer is the interface of the system to provide video browsing, object querying and so on.

In this paper, we focus on the recognition system of vehicle based on multi-feature fusion. The system is composed of two key steps: one is local feature extraction (including front feature and headlight feature extraction), and the other is local feature matching. The involved components are shown in Fig. 1 with gray color.

For vehicle type recognition, the input data are images of front faces of vehicles. The output data are types of the vehicles. The main procedure in the recognition system includes the following steps:

  1. (a)

    Loading an image;

  2. (b)

    Capturing the headlight area, and extracting the edges of the headlight;

  3. (c)

    Extracting SURF feature of the front face of the vehicle;

  4. (d)

    Calculating the distances between the features of headlight edge image as well as the front face image and ones in the model database respectively;

  5. (e)

    Comparing the calculation results. The vehicle type which having the minimum distance is the most likely result.

Here, an improved Canny operator is presented for edge detection. To improve the performance of distance calculation between features, a refined distance calculation method based on Hellinger kernel is put forward. Moreover, a position constraint rule also is applied to reduce unnecessary fake matching. More details about them are shown below.

3.2 Edge Extraction

Considering the headlights are main fields that contains abundant special features but with much light noise, we do edge extraction for them first, then make feature extraction. In the process of headlight edge extraction, we adapt an improved Canny algorithm. We can extract the edge feature by following steps: (a) loading the gray image; (b) Gaussian filtering and smoothing; (c) calculating the gradient amplitude and direction; (d) doing non-maximum suppression; (e) double threshold detecting and edge tracing. Then, we get the edge image of the headlight. In the processing, we should choose suitable threshold values in order to get ideal result.

Gradient Calculation.

Traditional Canny algorithm uses first order partial derivatives of finite difference in 2 × 2 neighborhood to calculate the gradient amplitude and gradient direction of each point in the smoothed image I(x, y). Whereby, the partial derivatives G x (i, j) and G y (i, j) of point (i, j) along the x and y directions respectively are:

$$ G_{x} (i,j) = [I(i,j + 1) - I(i,j) + I(i + 1,j + 1) - I(i + 1,j)]/2 $$
(1)
$$ G_{y} (i,j) = [I(i,j) - I(i + 1,j) + I(i,j + 1) - I(i + 1,j + 1)]/2 $$
(2)

The gradient amplitude M(i, j) and the gradient direction \( \theta \)(i, j) of each pixel (i, j) in the image I(x, y) are defined as follows respectively:

$$ M(i,j) = \sqrt {G_{x}^{2} (i,j) + G_{y}^{2} (i,j)} $$
(3)
$$ \theta (i,j) = \tan^{ - 1} (G_{y} (i,j)/G_{x} (i,j)) $$
(4)

Here, the gradient amplitude is calculated with the finite difference in 2 × 2 neighborhood, which is more sensitive to noise. Here, we make an improvement that calculates the gradient amplitude with 3 × 3 neighborhood to reduce the influence of noise. To save space, we just show the results of x direction in details. For y direction, it has a similar fashion to x direction.

The gradient amplitude of x direction is

$$ G_{x} (i,j) = I(i + 1,j) - I(i - 1,j) $$
(5)

The gradient amplitudes of π/4 and 3π/4 directions respectively are

$$ G_{\pi /4} (i,j) = I(i - 1,j + 1) - I(i + 1,j - 1) $$
(6)
$$ G_{3\pi /4} (i,j) = I(i + 1,j + 1) - I(i - 1,j - 1) $$
(7)

The difference of horizontal direction is

$$ E_{x} (i,j) = G_{x} (i,j) + [G_{\pi /4} (i,j) + G_{3\pi /4} (i,j)]/2 $$
(8)

The gradient amplitudes and the gradient direction are

$$ M(i,j) = \sqrt {G_{x}^{2} (i,j) + G_{y}^{2} (i,j) + G_{\pi /4}^{2} (i,j) + G_{3\pi /4}^{2} (i,j)} $$
(9)
$$ \theta (i,j) = \tan^{ - 1} (E_{y} (i,j)/E_{x} (i,j)) $$
(10)

where G y and E y are similar to G x and E x respectively except different directions.

Finally, formulas above can be turned into the templates as follows:

$$ G_{x} (i,j) = \left| {\begin{array}{*{20}c} { - 1} & 0 & 1 \\ { - \sqrt 2 } & 0 & {\sqrt 2 } \\ { - 1} & 0 & 1 \\ \end{array} } \right|,G_{y} (i,j) = \left| {\begin{array}{*{20}c} 1 & {\sqrt 2 } & 1 \\ 0 & 0 & 0 \\ { - 1} & { - \sqrt 2 } & { - 1} \\ \end{array} } \right| $$
(11)

The improved method considers the diagonal direction of pixels, and draws it into the differential mean value calculation. The gradient amplitude of a pixel depends on the rest eight pixels in 3 × 3 neighborhood. Compared with the traditional way, the quantity of neighbor pixels of the proposed way is increased largely in our presented way. It enhances the gradient amplitude stability, reduces the impact of the single noise point, improves the edge localization accuracy, and effectively suppresses the noise.

Self-adaptive Double Threshold Detection.

During the process of edge points detection, self-adaptive double threshold algorithm automatic reduces the influence of the pixels which have smaller gradient amplitude, further eliminates false edges. Higher threshold can eliminate most noise points, but cause the loss of some useful information; while lower threshold can save more image information. Therefore, we present a way for threshold decision based on higher threshold coupled with lower threshold as supplementary decision, which can effectively reduce false edges, and ensure the integrity of image edges greatly. The threshold is set automatically according to the gradient amplitudes of the image pixels, only depending on the initial selection proportion for threshold. No artificial operations are needed.

Noise Elimination.

From the results of the image processed before, we know that, pixels in the image have not only the gradient amplitude information, but also the gradient direction information. If a point belongs to an edge, its gradient direction generally points to the normal direction of the edge. Isolated noise points often have not obvious gradient direction. Therefore, according to the difference of gradient direction characteristics between edge points and noise points, we can effectively distinguish edge points.

In the 3 × 3 neighborhood which centers on o, for a point t around o, if o and t are in the same edge, the line connecting point o and t has similar direction as the edge. Here, define Angle t as the absolute value of the angle between the normal direction of the line and the mean gradient direction of the edge. For a point on the edge, its gradient direction should have the similar direction as the normal direction of the line. So Angle t should be a very small value.

In the case of a noise point, because its gradient direction is generally different with its adjacent points, the corresponding Angle t is large usually. As a consequence, we can detect noise points and edge points in an image by this method effectively, and eliminate the noise points. At the same time, we can realize more precisely edge point detection and subsequent linking.

3.3 Local Feature Extraction

In image processing, there are many kinds of features for identifying and recognizing a target. In this paper, SURF is used for local feature extraction, which is scale-invariant and rotation-invariant. The main procedure to extract local features is: (a) loading the gray image and creating integral image; (b) computing interest points with Hessian matrix; (c) locating interest points precisely; (d) assigning dominant orientation; (e) creating interest point descriptors.

After these steps, we get hundreds of 64-length vectors for local features of the image. If the image is the front face image of a standard vehicle, we store the features into standard database.

The process to extract local feature is described briefly in the following sections.

Keypoints Detection.

In SURF, the scale-space of the image (also called Gaussian pyramid) is mainly used to obtain interest points in different scales. It convolves the image iteratively with Gaussian kernel and sub-samples it repeatedly. This method results in each layer relying on the previous, and thus, high complexity. The sizes of kernels can be changed to create the Gaussian pyramid. Laplacian of Gaussian is approximated to a box filter. This allows multiple layers of the scale-space pyramid to be processed simultaneously without subsampling the image, which leads to better performance.

The matrix of Hessian is used for the interest point localization to determine whether a point is extremum. The integral image is used for the computation of the matrix of Hessian. Each extremum is compared with 26 adjacent points around it after they are picked out. If the value of an extremum point is above or below all of the 26 adjacent points, this extremum point becomes an interest point.

SURF Interest Point Descriptor.

SURF interest point descriptor relies on the dominant orientations of all the interest points. The descriptor component is built based on the dominant orientations. Haar wavelet response is used for calculating the dominant orientation. The Haar responses are calculated in both x and y coordinates in the circle region centered at interest points with a radius of 6σ. The size of Haar wavelet is 4σ, and the sum of vectors is calculated in every 60 degrees in the circle. At last, the orientation with the largest sum of vectors is the dominant orientation.

After the dominant orientation is obtained, a square window with a side length of 20σ is constructed centered at each interest point. It is divided into 4 × 4 sub-regions. The wavelet response is calculated in both the dominant orientation and the direction vertical to it. The wavelets of x and y are defined as dx and dy. Four values ∑dx, ∑dy, ∑|dx|, ∑|dy| are calculated. Then they are combined and normalized to a 64-length vector for each interest point, the SURF descriptor. The SURF descriptor is invariant to scale, rotation and translation of images.

3.4 Matching Local Features

Improvement in the Distance Calculation.

For some tasks such as texture classification and image categorization, it is well known that using Euclidean distance to compare histograms often yields inferior performance compared with other measures [24]. SURF was originally designed to be used with Euclidean distance. However, since it is a histogram question, the question arises as to whether other histogram distance measures can work better with it. Here, we show that using Hellinger kernel does bring a great benefit indeed.

In the following, more details about how to use Hellinger kernel is shown. Suppose x and y are n-vectors with unit Euclidean norm (||x|| = 1), then the Euclidean distance d E (x, y) between them is related to their similarity (kernel) S E (x, y):

$$ d(x,y)^{2} = \left\| {x - y} \right\|^{2} = \left\| x \right\|^{2} + \left\| y \right\|^{2} - 2x^{T} y = 2 - 2x^{T} y $$

where S E (x, y) = x T y. Here, we intend to replace the Euclidean similarity/kernel with Hellinger kernel.

The Hellinger kernel (also known as the coefficient of Bhattacharyya) for two L1 normalized histograms x and y (i.e. n i x i  = 1 and x i \( \ge \) 0) is defined as

$$ H(x,y) = \sum\nolimits_{i = 1}^{n} {\sqrt {x_{i} y_{i} } } $$
(12)

SURF vectors can be compared by a Hellinger kernel using a simple algebraic manipulation in two steps: (a) normalizing the SURF vector in L1 norm (originally it has unit L2 norm); (b) calculating the square root of each element

$$ S_{E} (\sqrt x ,\sqrt y ) = \sqrt x^{T} \sqrt y = H(x,y) $$

and the obtained vectors are L2 normalized since

$$ S_{E} (\sqrt x ,\sqrt x ) = \sum\nolimits_{i}^{n} {x_{i} } = 1 $$

Thus we use Hellinger kernel to compare the original SURF vectors:

$$ d(\sqrt x ,\sqrt y )^{2} = 2 - 2\sqrt x^{T} \sqrt y = 2 - H(x,y) $$

Constraints Between Matching Point Positions.

For the situation in our system, the positions of matching points have some relationship. It means that a point in image A can only match with some points in image B which are close to the corresponding position of the former. In other words, we do not look for matching point from image B for the point in image A out of some areas centered at the point in A. This method can reduce considerable unnecessary fake matchings.

Suppose the threshold for the area is T. The judgment criteria for position constraints are shown as follow:

For the point A (x 1 , y 1 ) in image I 1 with size (w 1 , h 1 ) and the point B (x 2 , y 2 ) in image I 2 with size (w 2 , h 2 ), if the following conditions are met, the two points in image I 1 and I 2 are possible correct matching points.

$$ \left| {\frac{{x_{1} }}{{w_{1} }} - \frac{{x_{2} }}{{w_{2} }}} \right| \le T,\quad \left| {\frac{{y_{1} }}{{h_{1} }} - \frac{{y_{2} }}{{h_{2} }}} \right| \le T $$
(13)

In the formulae above, T can be fixed according to practical requirements. In our system, T is usually less than 0.4, according to the experience.

3.5 Camera Network and Its Sub-graph Partitioning

Our proposed vehicle type recognition approach has been used in a non-overlapping multi-camera large surveillance platform that we presented in [23].

Generally, it is hard to track targets across multi-camera network. The burden of computation is often heavy. So can we divide the network into several independent sub-network (namely sub-graph) units so that some parallel strategies can be used?

In our multi-camera system, it is divided into some independent sub-networks according to the characteristics of its topology relationship. After this partitioning, target association algorithm can be run independently. The structure of the camera network should not be destroyed by the sub-network partition, which can make sure that the trajectory obtained in each sub-network can be easily merged to make the whole vehicle trajectory.

Two principles should be followed: completeness principle and minimization principle. The completeness principle assures that all nodes involved are within the sub-graph unit when calculating the associate relationship of the targets, and do not need exchange information with other units. The minimization principle makes the structure of each sub-graph unit as small as possible.

Our system uses Hadoop to parallelize the computation of all sub-graphs. First, different tasks from different sub-graphs are mapped to different MapReduce nodes for track analysis independently. Then, all information will be returned to central server to be reduced.

A minimum cost and maximum flow algorithm based on the relationship of the camera network topology is used for target association in our system.

One of the key parts of the target association is how to describe the similarity between objects from different objects. Here, assume target a and target b are two vehicles observed in camera C i and C j respectively, the similarity measure function S(O i,a , O j,b ) is defined as:

$$ S\left( {O_{i,a} ,O_{j,b} } \right) = \omega_{t} T\left( {t_{i,a} ,t_{j,b} } \right) \cdot \omega_{l} L\left( {l_{i,a} ,l_{j,b} } \right) \cdot \omega_{a} A\left( {a_{i,a} ,a_{j,b} } \right) $$
(14)

where T(t i,a , t j,b ) denotes the time relationship of the two objects; L(l i,a , l j,b ) denotes their location relationship; and A(a i,a , a j,b ) is for their appearances. Each weight ω is for each feature to control the reliability.

It is easy to construct the similarities of the parameters of time and location of the specific objects. They mainly depend on the object appearing time and the topology of the monitoring network. For appearances of the objects, vehicle types obtained by aforementioned proposed approach are main factors for them. Moreover, other features including LBP and HOG are combined together to make their appearance similarities. More details and experimental results about target association and MapReduce-based partitioning can be seen in [23].

4 Performance Evaluations

4.1 Test Environment

Test experiments are implemented on some Linux servers (Red Hat Enterprise Linux Server release 5.3) with OpenCV 2.4.3 [25]. One server has one eight-core CPU of Intel Xeon E5520 (2.27 GHz) with hyper-thread, and 16 GB memory. GCC 4.1.2 is used for compilation. For multi-camera tracking, a Hadoop cluster consisting of five aforementioned nodes is used.

4.2 Standard Model Database

For the standard model database in the system, 15 brands, amounting to 178 kinds of models, are built and recorded in the database. The information of model datasets is shown in Table 1.

Table 1. The Model Datasets

4.3 Evaluation of the Improvement in Distance Calculation

We use a square root (Hellinger) kernel instead of the standard Euclidean distance to measure the similarity between SURF descriptors, which leads to a dramatic performance improvement. This change can be implemented easily. It does not require any additional storage space for the distance calculation.

The time cost of distance calculations is shown in Fig. 2. The dramatic improvements in performance with the improvement method are shown in Fig. 3. These improvements come without additional cost, and additional storage since SURF features can be calculated with Hellinger online with a negligible processing overhead.

Fig. 2.
figure 2

Time cost of distance calculation

Fig. 3.
figure 3

Numbers of the matched points

A standard image has 440 interest points. The line “Original” stands for the performance of non-modified SURF. The line “RefineD” stands for the performance of SURF with improvement in the distance calculation. The line “RefineDP” stands for the performance of SURF with improvements in the distance calculation and the position constraint. Figure 2 shows that RefineDP has obvious advantage in computation time cost.

From Fig. 3, we can see that the number of matched points in RefineDP method is improved significantly, since the improvement reduces the dependency between the feature descriptors and the points with extreme values, while improve the sensitivity of the feature descriptor to non-extreme points.

4.4 Accuracy of Vehicle Recognition

This section shows the experimental results about the accuracy of vehicle recognition in different cases. The tested images not only include vehicle images downloaded from the Internet which are in good resolution, but also ones captured from some traffic videos which are in poor resolutions. In our system, clearer images can be extracted with more local features and get more precise result.

In Fig. 4, the x-axis stands for the numbers of feature points in the headlight edge images, and y-axis stands for the recognition accuracy. The lines with different legends stand for the numbers of feature points in front faces of vehicles. The numbers of feature points in headlight edge image and local features in front faces are in association with the image clarity. When the number of feature points in the headlight edge images is greater than 20, or the number of feature points in front faces is greater than 200, the accuracy is satisfactory.

Fig. 4.
figure 4

The accuracy of vehicle recognition in different cases

4.5 Tracking Result of the Multi-camera Platform

To verify the effectiveness of the multi-camera platform, a batch of vehicle monitoring videos from 23 cameras in the real campus security monitoring system of Huazhong University of Science and Technology (HUST) are applied and analyzed (see Fig. 5(a)). Figure 5(b) gives the trajectory of one tracked vehicle (white Honda SUV). It shows that the multi-camera platform based on Hadoop can track vehicle objects perfectly.

Fig. 5.
figure 5

Vehicle tracking of the multi-cameras platform

5 Conclusion

In this paper, a new recognition approach for vehicle types is designed and implemented. The vehicle type is distinguished by its front face image. The approach has been applied in a multi-camera traffic video monitoring platform, combining with LBP, HOG. The process of the whole camera network is parallelized based on sub-graph partitioning and Hadoop clustering. Experimental results indicate that the proposed modification in SURF can improve algorithm performance. The accuracy of the recognition can meet requirements of practical applications and the tracking result in the multi-camera system is satisfactory.