1 Introduction

Over the past several years, vehicle behavior analysis and incident prediction from video analytics have emerged as an active and challenging research area (Song et al. 2014; Sivaraman and Trivedi 2013; Wu et al. 2012). We have presented a real-time vehicle behavior analysis system in Song et al. (2014), and the trajectories of feature points have been applied to analyze vehicle behaviors such as sudden speeding up and slowing down, stopping, retrogradation, and lane changing. The system has been widely used by Chinese highway management departments, and its algorithms are robust enough for vehicle behavior analysis under complex weather conditions. However, trajectory clustering still remains an open problem in Song et al. (2014). Hence, in this paper, we focus on the problem of trajectory clustering in real traffic scenes.

In traffic surveillance applications, vehicle motion is often represented by trajectories with similar spatial and dynamic patterns. Vehicle trajectory clustering has been an extremely active research area in the intelligent transportation systems (ITSs) in the advances of camera sensing and computational technologies (Li et al. 2006; Fu et al. 2005; Atev et al. 2010). It plays a considerable role in data analysis since it reveals underlying vehicle motion information. Researchers have made many attempts on this problem for different application scenes in recent years.

Many approaches cluster trajectory data based on distance measures between trajectories. Mostly, trajectory data are extracted by a feature detector and a tracker. For example, Beymer et al. (1997) presented a real-time computer vision system based on the Harris corner tracking algorithm (Harris and Stephens 1988). These corners were clustered into vehicles based on proximity and similar 2D motion. Similarly, in Kanhere and Birchfield (2008), the tracked points were back-projected into 3D space and then clustered by the Euclidean distance.

Length differences induced by the kinematic properties of moving objects make similarity assessment between two or more trajectories exceedingly difficult, so similarity measure is a key and important step for any two trajectories with different lengths and has a greater impact on the result. For this reason, some researchers devote themselves to solving the problem of similarity measure between trajectories. Atev et al. (2010) proposed a trajectory-similarity measurement based on the Hausdorff distance (Huttenlocher et al. 1993), and a modification strategy is given to improve its robustness by exploiting the fact that trajectories are ordered collections of points. Abraham and Lal (2012) dealt with the length difference by applying a spatial–temporal similarity measure with the given points of interest (POI) and time of interest (TOI), in which the spatial similarity is treated as a combination of structural and sequence similarities and evaluated by using the techniques of dynamic programming. Piotto et al. (2009) proposed an algorithm for the syntactic description and matching of object trajectories in videos. Similarities between trajectories were determined based on inexact or approximate matching with trajectory segmentation and syntactic description beforehand.

Furthermore, Jung et al. (2008) used a 4D histograms to cluster the trajectory data. In its training stage, captured trajectories were grouped into coherent clusters to build 4D motion histograms with the position and instantaneous velocity of every tracked object. In the test stage, each new trajectory was compared with the 4D histograms of all the clusters. Hu et al. (2013) proposed a clustering algorithm based on the time-sensitive Dirichlet process mixture model (tDPMM) and applied it to each trajectory cluster for learning the trajectory pattern which represents the time-series characteristics of the trajectories in the cluster.

Although current trajectory clustering and modeling methods have solved a variety of specific problems in trajectory analysis, they still have the following limitations:

  • Some methods need to estimate the number of trajectory clusters. However, a manual choice of the cluster numbers sometimes is subjective, and the inappropriate setting may result in inaccurate trajectory clustering.

  • They fail to cluster trajectories incrementally. When new trajectories are obtained, the model has to be retrained which causes a high computational complexity. It is not able to satisfy the real-time request for the detection of traffic events.

  • The traditional methods of feature point clustering mostly are based on 2D images. However, due to the impact of camera angle and lighting problems, vehicle occlusion makes a great interference on the analysis of feature point clustering.

Fig. 1
figure 1

Framework of the proposed clustering system

In this paper, we propose a trajectory clustering framework in which the number of clusters can be set automatically and is progressively incremental. Moreover, a novel method is proposed, which back-projects 2D image points into a 3D space and uses 3D information for clustering. Thus, the problem of vehicle occlusion can be solved by this method. In order to improve the accuracy, we employ a coarse-to-fine strategy during trajectory clustering. Compared with other methods, the main advantage of our method lies in its high accuracy.

The rest of this paper is organized as follows. An overview of the clustering framework is presented in Sect. 2. Section 3 describes the method of feature extraction. The vehicle trajectory tracking is presented in Sect. 4. Section 5 displays the vehicle trajectory clustering method which includes three phases: trajectory coarse clustering, trajectory fine clustering, and agglomerative clustering. Experimental results are reported in Sect. 6, and finally, Sect. 7 concludes this paper.

2 Framework of clustering system

Figure 1 gives the framework of the proposed clustering system. There are four main stages in the clustering framework. Firstly, image preprocessing is done to set the region of interest (ROI) and calibrate the camera. Secondly, feature points of vehicles are extracted from images. Thirdly, feature points are tracked to obtain the motion trajectories. Finally, a coarse-to-fine clustering method is used for trajectory clustering, which back-projects the 2D trajectory points into a plane with a reference height in 3D space. In order to reduce the computational complexity, all these algorithms run on ROIs.

The vehicle trajectory clustering algorithm is completed by coarse clustering, fine clustering, and agglomerative clustering. In the phase of coarse clustering, the 3D space distances of trajectory points are regarded as the similarity measure to cluster the feature points with an improved clustering method of k-means. Then, the 3D information of the tracked points is reconstructed by the correlation between back-projection velocity and height. The fine clustering is carried out by removing and reassigning the feature points within a category, which uses the threshold of vehicle models in 3D space. Finally, the 3D trajectory points are clustered among categories by the motion consistency constraint. Moreover, the result of the final clustering can be applied to the traffic flow statistics and the vehicle classification in surveillance scenes.

3 Feature point extraction

Feature point is an important local features in the vehicle detection, including motion feature information of the vehicles. It can be easily located and tracked. Therefore, this paper uses it in the research of vehicle motion trajectory clustering and traffic parameter analysis. In this stage, a detailed introduction of the ORB algorithm (Rublee et al. 2011) is given to extract the feature points. It is built on the corner detector of Features from Accelerated Segment Test (FAST) (Rosten and Drummond 2006) and feature descriptor of Binary Robust Independent Elementary Features (BRIEF) (Calonder et al. 2010). ORB is less affected by the image noise, and it is invariant to the image scale, affine, and rotation. Since it has less computing time consumption, this paper uses the ORB to gain the appropriate feature features in this stage for real-time detection.

3.1 Oriented FAST detector

In this paper, the FAST detector is employed to detect the feature points of images. One pixel will be considered as the FAST feature point if there are a set of n contiguous pixels on the circle whose intensities are larger difference than the candidate pixel I(p):

$$\begin{aligned} |I(x)-I(p)|>\varepsilon \quad \forall x \in \odot p. \end{aligned}$$
(1)

However, the corner response function (CRF) along with the edge is not strong in FAST algorithm, so the Harris corner response function is employed to filter the FAST feature points, and it selects the top N as candidate feature points.

Since the feature points which are detected by the original FAST have no orientations, the ORB uses the intensity centroid method (Rosin 1999) to determine the orientations of feature points. It defines some moments of a patch as:

$$\begin{aligned} M_{pq}=\sum _x\sum _y x^py^qI(x,y), \end{aligned}$$
(2)

where I(xy) is the gray intensity of feature point (xy). With these moments, the centroid of a patch can be found:

$$\begin{aligned} C=\left( \frac{M_{10}}{ M_{00}},\frac{M_{01}}{M_{00}}\right) =\left( \frac{\sum \nolimits _{x,y}xI(x,y)}{ \sum \nolimits _{x,y}I(x,y)},\frac{\sum \nolimits _{x,y}yI(x,y)}{\sum \nolimits _{x,y}I(x,y)}\right) .\nonumber \\ \end{aligned}$$
(3)

The orientation of the image patch can be calculated as:

$$\begin{aligned} \theta =\arctan \left( \frac{M_{01}}{M_{10}}\right) =\arctan \left( \frac{\sum \nolimits _{x,y}yI(x,y)}{\sum \nolimits _{x,y}xI(x,y)}\right) . \end{aligned}$$
(4)

3.2 rBRIEF feature descriptor

The original BRIEF can be considered as a binary string descriptor of an image patch, which is constructed by some simple binary intensity tests:

$$\begin{aligned} \tau (P;x,y)= \left\{ \begin{array}{ll} 1&{}\quad \text {if } P(x) <P(y)\\ 0&{}\quad \text {otherwise} \end{array}\right. , \end{aligned}$$
(5)

where P is a \(S\times S\) image patch of one feature point, which has been smoothed by a Gauss filter, and P(x) is the gray intensity at point \(x=(u,v)\). Thus, BRIEF descriptor is defined as a vector of n binary tests:

$$\begin{aligned} f_n(P)=\sum _{i=1}^{n}2^{i-1}\tau (p;x_i,y_i). \end{aligned}$$
(6)

However, the original BRIEF is considerably sensitive to the image noise, and it has no rotation invariance. Therefore, the ORB further improves it in two aspects. First, ORB uses a block of \(5 \times 5\) pixel to replace the point pair so as to reduce the random noise, and these parameters are employed in this paper as well. Then, the binary string can be gained by comparing the sum of the gray value in the pixel block. Second, ORB gives the BRIEF descriptor a orientation that obtained in (4). For every feature point, it has n binary tests, and these tests compose a matrix defined as:

$$\begin{aligned} S=\left[ \begin{array}{cccc} x_1 &{}\quad x_2&{}\quad \ldots &{}\quad x_n\\ y_1 &{}\quad y_2&{}\quad \ldots &{}\quad y_n \end{array}\right] . \end{aligned}$$
(7)

The binary test set S can be transformed into \(S_\theta \) using the orientation \(\theta \) of the image patch. The corresponding rotation matrix \(R_\theta \) can be calculated as:

$$\begin{aligned} S_\theta =R_\theta S, \end{aligned}$$
(8)

where \(R_\theta \) is a matrix as:

$$\begin{aligned} R_\theta =\left[ \begin{array}{cc} \cos \theta &{} \quad \sin \theta \\ -\sin \theta &{} \quad \cos \theta \end{array}\right] . \end{aligned}$$
(9)

So the steered BRIEF descriptor can be represented as:

$$\begin{aligned} f_n(P,\theta )=f_n(P)\ | \ (x_i,y_i) \in S_\theta . \end{aligned}$$
(10)
Fig. 2
figure 2

ORB descriptor applied in different scenes

Since the correlation of the random point pairs sometimes can be considerably large and the variance of steered BRIEF is lower, a greedy search algorithm is adopted to find the weakly correlated random point pair, which is called rBRIEF. Then, the ORB descriptor can be obtained by the above process. Figure 2 shows the results of ORB algorithm which is applied in different traffic scenes with differently viewing angles, and Table 1 presents the measured running time compared with other feature descriptors, such as SIFT (Lowe 2004), SURF (Bay et al. 2006), MSER (Forssén and Lowe 2007), BRISK (Leutenegger et al. 2011), and A-KAZE (Alcantarilla et al. 2012).

Table 1 Running time analysis

4 Vehicle trajectory tracking

In the stage of vehicle trajectory tracking, the matching method based on Hamming distance (Meng and You 2013) is used to solve the matching problems of feature points in consecutive frames:

$$\begin{aligned} D(F_1,F_2)=\sum _{i=0}^{n}x_i\oplus y_i, \end{aligned}$$
(11)

where \(F_1=\{x_0,x_1,\ldots ,x_n\}\), \(F_2=\{y_0,y_1,\ldots ,y_n\}\) are ORB descriptors that are binary strings of two feature point patches in the adjacent frames, and \(D(F_1, F_2)\) is the similarity distance of two descriptors represented by Hamming distance. Since the descriptor is a binary string, Hamming distance can be indicated as the sum of the XOR results between each two bits of the descriptors. If \(D(F_1, F_2)\) is smaller than the setting threshold, \(F_1\) and \(F_2\) are considered as matched.

During tracking, the matching method is used to find the best matching points between the current frame and the previous frame. Finally, a 2D trajectory can be represented as a point data set as follows:

$$\begin{aligned} T_{2D}(i,n)=\{(u_{i1},v_{i1}),(u_{i2},v_{i2}),\ldots ,(u_{in},v_{in})\}, \end{aligned}$$
(12)

where i is the number of a tracked trajectory which is in a trajectory data set, n is the number of nodes on the tracked trajectory, and \((u_{in},v_{in})\) is the pixel coordinate of different nodes. The motion trajectories of feature points are shown in Fig. 3.

Fig. 3
figure 3

Results of trajectory tracking in different scenes

5 Trajectory clustering using 3D information

In this stage, a coarse-to-fine clustering method, which is based on the 3D information reconstruction of the feature point, is proposed to cluster the trajectory points. It contains three phases, including coarse clustering, fine clustering, and agglomerative clustering. In the phase of coarse clustering, all the 2D feature points are back-projected to a 3D space with the height of 0 m. An improved method of k-means is used to obtain the categories of coarse clustering. Trajectory fine clustering is carried out by the method of Assume-And-Then-Verify within the class, which uses the result of coarse clustering. Agglomerative clustering is used to merge the same categories and improve the clustering accuracy.

5.1 Trajectory coarse clustering

In this phase, the 2D trajectory points are back-projected onto 3D space and the 3D information of vehicle model is used to cluster the trajectory points. First of all, the relationship between the world coordinate system and the image coordinate system is established (Zheng and Peng 2014):

$$\begin{aligned} \lambda \left[ \begin{array}{l} u\\ v \\ 1\\ \end{array}\right] =K\left[ \begin{array}{ll} R&\quad t \end{array}\right] \left[ \begin{array}{l} X_W\\ Y_W\\ Z_W\\ 1 \end{array}\right] , \end{aligned}$$
(13)

where K is the camera internal parameters, and R and t compose the external parameter matrix of the camera. The internal and external parameters can be calculated accurately by the recovery method of vanishing points (Zheng and Peng 2014). As shown in Fig. 4, three vanishing points are formed in the image using three sets of parallel lines in the directions of X, Y, and Z, respectively. In addition, a known height (the height of the camera) is required to calculate the scale factor. Then, the camera calibration matrix can be calculated.

Fig. 4
figure 4

Camera calibration

Apparently, the image coordinate can be calculated by the world coordinate, but the world coordinate cannot be gotten from image coordinate directly. In order to obtain the 3D information of a trajectory, the height information is necessary. However, the real height information cannot be gotten from image plane directly. Therefore, this paper adopts a novel method to handle this problem. It back-projects the 2D trajectory points to a plane called back-projected plane with \(Z_W=0\) in 3D space as shown in Fig. 5. Then, we cluster and mark these trajectory points for trajectory coarse clustering by the Euclidean distance in 3D space with an improved clustering method of k-means:

  1. Step 1:

    Set the threshold \(d_{\max }\). It is regarded as the maximum distance for different feature points in the same vehicle. The feature point, which reached the ROI first and satisfied the requirement of trajectory, is selected as the initial cluster center, and \(k=1\). The other feature points are recorded in order.

  2. Step 2:

    Calculate the distance d between the cluster center \(P_i\) and the feature point \(P_j\). If \(d<d_{\max }\), \(P_i\) and \(P_j\) belong to the same category. Meanwhile, update the cluster center \(P_i\) using the average of the feature point \(P_i\) and \(P_j\). Otherwise, \(P_j\) is regarded as a new cluster center, and \(k=k+1\).

  3. Step 3:

    Iterate the Step 2 until all the clustering centers are stable within a certain range. Then, the number of categories can be obtained.

Fig. 5
figure 5

Results of trajectory coarse clustering. a Region of back-projected plane. b Clustering results in the back-projected image

Fig. 6
figure 6

Correlation between back-projection velocity and height

5.2 Trajectory fine clustering

5.2.1 Correlation between back-projection velocity and height

The feature points are non-uniform motion in the image plane, but the points back-projected on the zero plane (horizontal plane) can be considered as uniform motion. Therefore, the velocity of all trajectory points in the same category can be calculated, which can be used for further clustering.

Assuming that the camera and the motion area are so far from each other, the camera and vehicle can be considered as in the same plane. As shown in Fig. 6, the vehicle object moves from \(L_1\) to \(L_2\) in time t. Meanwhile, the vehicle feature points P and Q move from \(P_1\) to \(P_2\) and from \(Q_1\) to \(Q_2\), respectively, as shown in Fig. 6. According to (13), the 2D points are back-projected to the road plane with \(Z_W=0\) in 3D space. The back-projected points \(A_1\), \(A_2\), \(B_1\), \(B_2\) can be obtained, which are corresponding to \(P_1\), \(P_2\), \(Q_1\), \(Q_2\), respectively. According to similar triangle calculations, the similarity relation can be obtained:

$$\begin{aligned} \begin{aligned} \frac{P_1P_2}{A_1A_2}&=\frac{CP_1}{CA_1}\\ \frac{V_Pt}{V_At}&=\frac{CK}{CN}\\ \frac{V_P}{V_A}&=\frac{h-h_p}{h} \end{aligned} \qquad \begin{gathered} \left. \begin{aligned} \frac{Q_1Q_2}{B_1B_2}&=\frac{CQ_1}{CB_1}\\ \frac{V_Qt}{V_Bt}&=\frac{CT}{CN}\\ \frac{V_Q}{V_B}&=\frac{h-h_q}{h} \end{aligned} \right. \end{gathered}. \end{aligned}$$
(14)

where \(V_P\) and \(V_Q\) are the actual velocities of points P and Q, respectively. \(V_A\) and \(V_B\) are the back-projection velocities. Since the vehicle can be considered as a rigid object, all the points in the same rigid body have the same actual velocity V. Thus

$$\begin{aligned} V=V_P=V_Q. \end{aligned}$$
(15)

Combined with (14) and (15), it can be obtained that

$$\begin{aligned} \frac{V_A}{V_B}=\frac{h-h_q}{h-h_p}. \end{aligned}$$
(16)

Furthermore, if the feature point is low enough, its back-projection velocity can be considered as the actual velocity of the vehicle.

Fig. 7
figure 7

Fine clustering within the category

5.2.2 Reconstruction of 3D information

After the treatment of trajectory coarse clustering, the feature points, which are in the same category, are considered to be in the same vehicle. Then, the 3D information of these feature points can be reconstructed by the following steps:

  1. Step 1:

    Calculate back-projection velocity. For feature point P, its back-projection point coordinates \((X_i, Y_i, 0)\) can be calculated by (13) during the tracking. If the frame rate of the video is known, the back-projection velocity \(V_A\) can be got by (17). In order to reduce the error caused by feature point tracking, the least square method (LSM) is employed to solve this problem. The back-projection velocity \(V_A\) can be calculated by

    $$\begin{aligned} \left[ \begin{array}{c} S_1 \\ S_2 \\ \vdots \\ S_N \end{array}\right] =\left[ \begin{array}{cc} V_A&\quad S_0 \end{array}\right] \left[ \begin{array}{cccc} t_1&{}\quad t_2&{}\quad \ldots &{}\quad t_N\\ 1&{}\quad 1&{}\quad 1&{}\quad 1 \end{array}\right] , \end{aligned}$$
    (17)

    where S is the displacement corresponding to interval t, and N is the number of consecutive frames.

  2. Step 2:

    Estimate the actual velocity of vehicle. In Sect. 5.2.1, it is known that if the feature point is close to the road, the back-projection velocity can be taken as the actual velocity of the vehicle. Therefore, the least back-projection velocity of a certain category is considered as the estimation of the actual velocity V, which is defined as:

    $$\begin{aligned} V=\min \{V_i\}\ (i \in R), \end{aligned}$$
    (18)

    where \(V_i\) is the back-projection velocity of the feature point, and R is the number of feature points in the same category.

  3. Step 3:

    Reconstruct the 3D information of feature point. According to (13), the 3D coordinate (XYZ) of the feature point P can be calculated with \(Z=h_p\). The actual height \(h_p\) of the feature point P can be calculated by

    $$\begin{aligned} h_p=h-h\frac{V}{V_A}. \end{aligned}$$
    (19)

5.2.3 Trajectory fine clustering within the category

In this phase, the method of Assume-And-Then-Verify is adopted to analyze the trajectory coarse clustering results. It will determine whether the feature points in the same category belong to the same vehicle. The fine clustering process can be found in Fig. 7. First of all, \(K_1\), \(K_2\), and \(K_3\) are the results of trajectory coarse clustering. Then, some points are eliminated from the three categories by the setting threshold of 3D vehicle model. Meanwhile, they are removed to the public pool. Finally, the feature points in the public pool are redistributed. They may be reassigned to the other existing categories or become a new category \(K_4\). The specific steps are as follows:

  1. Step 1:

    Assuming that all the feature points in a certain category belong to the same vehicle and the 3D information of all the feature points has been reconstructed.

  2. Step 2:

    Calculate the Euclidean distance between the feature point \(P_i\) and the lowest point \(P_{\min }\):

    $$\begin{aligned} \begin{aligned} \bigtriangleup X&=|X_{P_i}-X_{P_{\min }}|\\ \bigtriangleup Y&=|Y_{P_i}-Y_{P_{\min }}|\\ \bigtriangleup Z&=|Z_{P_i}-Z_{P_{\min }}|. \end{aligned} \end{aligned}$$
    (20)
  3. Step 3:

    Match \(D(\bigtriangleup X,\bigtriangleup Y,\bigtriangleup Z)\) with the preset 3D vehicle model T(LWH). If it satisfies any of the three models, it means that the feature point belongs to a certain category. Otherwise, it is removed to public pool.

    $$\begin{aligned} P_i \in \left\{ \begin{array}{ll} \hbox {Small} &{}\quad D(\bigtriangleup X,\bigtriangleup Y,\bigtriangleup Z)\le T_\mathrm{min}(L,W,H)\\ \hbox {Midsize} &{}\quad T_\mathrm{min}(L,W,H) \le D(\bigtriangleup X,\bigtriangleup Y,\bigtriangleup Z)\\ &{}\quad \le T_\mathrm{mid}(L,W,H)\\ \hbox {Oversize}&{} \quad D(\bigtriangleup X,\bigtriangleup Y,\bigtriangleup Z)\ge T_\mathrm{mid}(L,W,H). \end{array}\right. \nonumber \\ \end{aligned}$$
    (21)
  4. Step 4:

    Redistribute the feature points in the public pool. As shown in Fig. 7, if the feature point was removed from \(K_1\), it needs to be verified in other categories (\(K_2\) and \(K_3\)). Assuming it belongs to \(K_2\), it will be verified by Steps 2 and 3 in turn. If the feature point does not belong to \(K_2\) or \(K_3\), it is regarded as a new category \(K_4\) and the number of categories is increased by 1.

Fig. 8
figure 8

Clustering result in the back-projected image. a Result of trajectory coarse clustering. b Result of trajectory fine clustering

Table 2 3D information of feature points

The feature points have been clustered in Sect. 5.1, but the results are inaccurate as shown in Fig. 5. Then, trajectory fine clustering within the category is done based on the results of coarse clustering, as shown in Fig. 8. It can be seen that the feature points of two vehicles are considered as the same group after trajectory coarse clustering as shown in Fig. 8a. Then, fine clustering is applied to do treatment within this category, as shown in Table 2. There are 11 feature points in this category, and their back-projection velocities are calculated in Table 2. In order to make the cluster center near the vehicle center, we select the medium of back-projection velocities as the reference velocity, and the corresponding feature point is regarded as reference point, as shown by the \(P_{14}\) in Table 2. Some feature points are removed from the current category by the 3D vehicle model, and then the feature points in the public pool are redistributed. Finally, the new clustering categories are obtained, as shown in Fig. 8b.

5.3 Agglomerative clustering

In this phase, the method of Assume-And-Then-Verify is still used to judge whether two categories need to be merged. Assuming that the two categories belong to the same category, the feature point, which has the lowest back-projection velocity in the two categories, is considered as the reference point. Its category is regarded as the reference category. The back-projection velocity of the reference point is used as the real velocity to reconstruct the 3D information of feature points in the other category. Then, the 3D trajectories can be obtained by the process of tracking. Last but not least, motion consistency constraint is used to determine whether the other feature points and the reference point are in the same category. We verified the authenticity of the assumption by recording the number of the feature points which meet the condition of motion consistency constraint. If the number is larger than the setting threshold (the threshold needs to account for 60% of the total number), the two categories are considered to be in the same vehicle. Then, they need to be merged into one category. Otherwise, the two belong to different vehicles. The final clustering results are obtained by comparing all the categories in turn.

The determinant steps of motion consistency constraint are as follows:

  1. Step 1:

    Estimate the reference trajectory of category \(K_i\). Select the feature point which has the lowest back-projection velocity of \(K_i\) as the reference point, and set its height with \(Z=0\). The 3D trajectory points of the reference point can be obtained from the 2D trajectory points by (13):

    $$\begin{aligned} \begin{aligned} L_i=&\{(X_{i1},Y_{i1},Z_{i1}),(X_{i2},Y_{i2},Z_{i2}),\ldots ,\\&X_{im},Y_{im},Z_{im})\}. \end{aligned} \end{aligned}$$
    (22)
  2. Step 2:

    Estimate the 3D trajectory of feature points in category \(K_j\). Assuming that the two categories belong to the same vehicle, the 3D information of the feature points in \(K_j\) can be reconstructed with the height of the reference point by (16). Then, the 3D trajectory points in \(K_j\) can be obtained by the process of tracking:

    $$\begin{aligned} \begin{aligned} L_j=&\{(X_{j1},Y_{j1},Z_{j1}),(X_{j2},Y_{j2},Z_{j2}),\ldots ,\\&(X_{jn},Y_{jn},Z_{jn})\}. \end{aligned} \end{aligned}$$
    (23)
  3. Step 3:

    Calculate the distance between the two trajectory points by

    $$\begin{aligned} \begin{aligned}&\hbox {Dis}_t=\\&\quad \sqrt{(X_{it}-X_{jt})^2+(Y_{it}-Y_{jt})^2+(Z_{it}-Z_{jt})^2}, \end{aligned} \end{aligned}$$
    (24)

    where t means that the two trajectory points are at the same moment.

  4. Step 4:

    Calculate the sum of absolute differences between the two trajectory points:

    $$\begin{aligned} C=\frac{i}{l-1}\sum _{t=2}^{l}|\hbox {Dis}_t-\hbox {Dis}_{t-1}|. \end{aligned}$$
    (25)
  5. Step 5:

    Compare C with the threshold \(C_{\min }\). If \(C\le C_{\min }\), it means that the two trajectories satisfy the motion consistency constraint. Otherwise, the two trajectories belong to different vehicles.

Fig. 9
figure 9

Result of agglomerative clustering between categories in the back-projected image. a Result of trajectory coarse clustering. b Result of merging

Table 3 Reconstruction of 3D information

In theory, if two trajectories belong to the same rigid object, they should meet the motion consistency constraint. It means that they are in the same motion state. Thus, the distance between the two trajectory points is a constant value, which means \(C=0\). However, there are some errors during the camera calibration and the real velocity has not been got accurately. Therefore, we set the threshold \(C_{\min } =0.2\) by a series of tests. Figure 9 shows the merging results between the categories. There are three trajectories in frame 1193. The coordinates of 3D trajectory points are reconstructed by the reference point \(P_0\) as shown in Table 3. The motion consistency constraint C is calculated in Table 4. It can be seen that the feature points of the vehicle are divided into two categories after coarse clustering as shown in Fig. 9a, but they are merged into the same category in frame 1197 after the analysis of the motion consistency constraint as shown in Fig. 9b.

6 Experimental results

In order to demonstrate the effectiveness and robustness of the proposed clustering framework, it has been tested in different real video sequences, including highway and urban. These scenarios are tested on a Windows 7 platform with a Pentium 4 3.2 GHz central processing unit and 4 GB random access memory. The size of each image is \(720\times 288\), and the sampling frequency is 25 FPS. The proposed cluster framework is implemented with Visual C++ on a raw video format. The experimental results are shown in Fig. 10.

Table 4 Motion consistency discriminant

There are three parts in the experimental results. The statistical results of traffic flow are addressed in Part A. Part B shows the results of vehicle classification and accuracy in different scenes. Finally, a comparison with other systems is shown in Part C.

6.1 Statistical results of traffic flow

In the test, the number of vehicles is counted according to the moment when a trajectory appears and disappears. In other words, the final clustering results are obtained by the coarse-to-fine cluster method: When feature points of a certain category appear, they are clustered by coarse clustering, and when the trajectory points disappear, fine clustering and agglomerative clustering are carried out. Meanwhile, this category is deleted and the counter is added by one, as shown in Fig. 11. The proposed method has been tested in different traffic scenes, and the results are shown in Table 5.

Fig. 10
figure 10

Clustering results for different traffic scenes. a Urban road in Xi’an. b Highway in Shanghai

Fig. 11
figure 11

Counting process

Table 5 Statistical results of traffic flow
Fig. 12
figure 12

Vehicle classification. a Oversize vehicle. b Midsize vehicle. c Small vehicle

6.2 Vehicle classification

According to the process of fine clustering, the 3D parameters of feature points can be reconstructed. When the final clustering is completed, the feature points of the same vehicle are aggregated into the same category. Then, the 3D coordinates of each feature point can be used to determine the position of the vehicle boundary. Thus, the vehicle classification can be obtained, as shown in Fig. 12. Table 6 shows the results of vehicle classification statistics in different traffic scenes. The experimental result shows that the detection rate of the proposed method is above 90%.

Table 6 Results of vehicle classification
Table 7 Comparison with other cluster frameworks

6.3 Comparison with other clustering frameworks

In order to show the improvements achieved by the proposed clustering framework, it is compared with other clustering frameworks as given in Table 7. These clustering frameworks are applied to the urban road of Shanghai. As shown in Table 7, Rabaud and Belongie (2006) clustered the moving objects by using motion consistency constraint and identified individuals under the assumption that features from a single object will have the same affine motion; Kanhere and Birchfield (2008) used a plumb line projection (PLP) to compute the 3D coordinates of the stable feature points, and these stable feature points are grouped and tracked to provide the vehicle count. However, the clustering framework of Rabaud and Belongie (2006) is adapted in 2D image plane, which cannot handle the perspective effects encountered at low camera angles. The motion consistency constraint is not suitable for segment vehicle in such traffic scene. From the results in Table 7, it can be seen that the detection rate is lower than the others. Even though the clustering framework of Kanhere and Birchfield (2008) has a good performance, the detection rate is also lower than the proposed clustering framework. We combine the idea of the former two papers and use the motion consistency constraint in 3D space to cluster vehicle trajectory points. It can be seen that the proposed clustering framework has good robustness in different traffic scenes.

7 Conclusion

In this paper, a bottom-up clustering framework is proposed for vehicle trajectory clustering in traffic scenes. There are four main stages in the proposed clustering framework: image preprocessing, feature point extraction, vehicle trajectory tracking, and trajectory clustering. In the stage of the feature point extraction, ORB algorithm is used to select the appropriate feature points. Since it can reduce the computational time and satisfy the requirement of real-time detection, ORB descriptor performs better than the other feature descriptors in the real-time application. In the stage of trajectory clustering, the 3D parameters of the feature points are reconstructed by the proposed method, which can be applied in the coarse-to-fine clustering method. The proposed clustering framework has been employed in different traffic scenes (highway and urban) to test the performance of clustering. Experimental results show that the cluster accuracy of the proposed method is very high, reaching 96%, even if the traffic condition is quite complex. Moreover, it has a good performance in solving the problem of vehicle occlusion. The proposed clustering framework can also be applied in different areas, such as traffic parameter measurement, vehicle classification, and abnormal event alarm.