1 Introduction

Detection of vehicles is used in many applications such as traffic surveillance, driver assistance systems, and autonomous vehicles. There has been a great deal of work carried out in this field during the past decade, and a survey of several techniques can be found in Buch et al. (2011), Dollár et al. (2012) and Sivaraman and Trivedi (2013a). For the purpose of object detection and recognition, there are several types of image features and representations, such as the histogram of oriented gradients (HOG) (Dalal and Triggs 2005), Haar-like features (Papageorgiou et al. 1998; Sivaraman and Trivedi 2013b), interest-points based features (Leibe et al. 2008), local binary patterns (Wang et al. 2009), and 3D voxel patterns (3DVP) (Xiang et al. 2015), that have been used. HOG features have been investigated widely and used in the state-of-the-art techniques for object detection and description (Dollár et al. 2012). Instead of the 1D vector representation of HOG (Dalal 2006; Dalal and Triggs 2005; Wu et al. 2014), several papers have adopted a 2D representation (Dollár et al. 2009; Felzenszwalb et al. 2010; Maji et al. 2008), since the latter preserves the relations among the neighboring pixels or cells. In order to distinguish the 2D representation from the 1D one, we will call it 2DHOG. Both the 1D and 2D representations of HOG capture the edge structure of the object and are robust against illumination changes and background clutters. However, neither of these representations is resolution invariant. Thus, detectors employing these representations require extracting HOG or 2DHOG at each scale from an image pyramid, thus requiring a costly multi-scale scanning in the testing mode (Dollár et al. 2009; Maji et al. 2008).

Recently, Dollár et al. (2014, 2010) proposed a feature approximation technique, where gradient histograms and color feature responses generated at one scale of an image pyramid can be used to approximate the feature responses at nearby scales. This method results in a speedup of extracting the features from the image pyramid over the methods of Dollár et al. (2009) and Maji et al. (2008), with only a small reduction in the detection accuracy. In this technique, the feature responses can be approximated with high accuracy within one octave of the scales of the image pyramid. Later, authors in Benenson et al. (2012) and Ohn-Bar and Trivedi (2015) enhanced the detection performance of Dollár et al. (2010) by constructing a classifier pyramid instead of an image pyramid. However, since the methods in Benenson et al. (2012) and Ohn-Bar and Trivedi (2015) are based on constructing a classifier pyramid with multiple classifiers trained at different sizes of the object (For example, in Benenson et al. 2012 the sizes of the object considered are \(64 \times 32, 128 \times 64, 256 \times 128\), etc.), they require a high training and storage cost.

The part-based methods have received a great deal of attention from the research community, as these schemes can handle partial occlusion, and represent targets with several views (Felzenszwalb et al. 2010; Sivaraman and Trivedi 2013b; Takeuchi et al. 2010). For instance, (Felzenszwalb et al. 2010) have proposed a pictorial structure for HOG features, referred to as deformable part-based model (DPM). In this method, the locations of the parts are used as latent variables for a latent support vector machine (LSVM) classifier to find the optimal object position. Later, several other techniques have adopted DPM (Felzenszwalb et al. 2010) for vehicle detection (Li et al. 2014; Takeuchi et al. 2010; Wang et al. 2016). These methods provide high detection accuracy. However, these methods require convolutions of the features of a given level of the image pyramid with a number of part filters, which results in a high computational cost.

Some of the latest schemes in the area of object detection (Pepikj et al. 2013; Wang et al. 2015; Xiang et al. 2015) have attempted to solve the challenges of scale, aspect ratio or severe occlusion. For example, the method in Pepikj et al. (2013) has used a detection scheme based on the DPM detector (Felzenszwalb et al. 2010) and introduced a method for clustering the training data into a number of similar occlusion patterns. These patterns have been used with different occlusion strategies to train the LSVM classifier (Felzenszwalb et al. 2010). Later, Xiang et al. (2015) have combined 3DVP object representation, which encodes the appearance, 3D shape, view-point, the level of occlusion and truncation, with a boosting detector based on the detection scheme in Dollár et al. (2014) in order to learn from the occluded and non-occluded 3DVPs obtained from a training set. Recently, the method in Wang et al. (2015) has introduced region-based features with a coordinate normalization scheme, referred to as regionlet features, and a cascaded boosting classifier to tackle the challenges of detecting objects of different scales and aspect ratios. Even though these methods have been effective in tackling these challenges, they require high complexity either in the training mode, as in Wang et al. (2015) and Xiang et al. (2015), or in the testing mode, as in Pepikj et al. (2013).

The detection accuracy employing HOG or its variants in the spatial domain has started to saturate (Dollár et al. 2012). Recently, for the first time, the fast Fourier transform (FFT) has been used with 2DHOG in order to replace the costly convolution operation in the spatial domain by multiplication in the FFT domain (Dubout and Fleuret 2012). This scheme achieves a speedup over the spatial domain counterpart (Felzenszwalb et al. 2010). Later in Naiel et al. (2015), a method for approximating feature pyramids in the DFT domain instead of the spatial-domain has been introduced, resulting in a better feature approximation accuracy compared to the spatial-domain counterpart in Dollár et al. (2014). Despite the fact that both the methods in Dubout and Fleuret (2012) and Naiel et al. (2015) use a transform domain with 2DHOG, it is necessary to apply the corresponding inverse transform to classify the 2DHOG features in the spatial-domain. Thus, the methods in Dubout and Fleuret (2012) and Naiel et al. (2015) are based on training an object detector in the spatial-domain, which usually requires large storage and training cost. On the other hand, this paper develops a scheme that is able to classify the compressed and transformed 2DHOG features directly in the transform domain.

In this paper, we apply the 2D discrete Fourier transform (2DDFT) or the 2D discrete cosine transform (2DDCT) on block-partitioned 2DHOG, followed by a truncation process to retain only a fixed number of low frequency coefficients, which are referred to as TD2DHOG features. Further, using the 2DDFT downsampling theorem (Smith 2007) and considering the effect of image resampling on the 2DHOG features Dollár et al. (2014), it is shown that the TD2DHOG features obtained from an image at the original resolution and a downsampled version from the same image are approximately the same within a multiplicative factor, with a similar result holding true when 2DDCT is used. The use of TD2DHOG features simplifies the classifier training phase, since the classifier trained on high resolution vehicles can be used to detect the same or lower resolution vehicles in the test image, instead of training multiple classifiers, each being trained on vehicles with a specific resolution, as done in Benenson et al. (2012) and Naiel et al. (2014). Next, we employ the two-dimensional principal component analysis (2DPCA) (Yang et al. 2004) for feature extraction and dimensionality reduction. The design of the proposed scheme aims to solve the challenging problem of scale variation that is common in most vehicle detection datasets. Extensive experiments are conducted in order to evaluate the detection performance of the proposed technique and compare it with that of the state-of-the-art techniques.

The paper is organized as follows. In Sect. 2, we present a brief background about 2DHOG features, and the effect of image resampling on these features. In Sect. 3, we study the effect of downsampling a grayscale image on its DFT and DCT versions. In Sect. 4, a detailed description of extracting the TD2DHOG features is presented. Further, a model for the multiplicative factor that approximately relates the TD2DHOG features at two different resolutions of a given image is established. In Sect. 5, the model derived in Sect. 4 is used in proposing a scheme for vehicle detection of different resolutions using a single classifier rather than a classifier pyramid. In Sect. 6, we first validate experimentally the proposed model for the multiplicative factor in both the 2DDFT and 2DDCT domains. Then, the performance of the proposed vehicle detection scheme is studied by carrying out extensive experiments using a number of publicly available vehicle detection datasets and compared with that of the state-of-the-art techniques. Finally, Sect. 7 highlights the work of this paper.

2 Background

In this section, we present some background material required for the development of the proposed detection scheme using TD2DHOG features in subsequent sections.

2.1 Two-dimensional HOG features

2DHOG features are similar to the HOG features of Dalal and Triggs (2005), the difference being the way in which the features are represented, namely, in a 2D matrix format in the case of the former and a 1D vector format in the case of the latter. The 2DHOG features have been used in a number of papers (Dollár et al. 2009; Felzenszwalb et al. 2010; Maji et al. 2008).

Let us consider an image, I, of size (\(M_1\times M_2\)), and divide it into non-overlapping cells of size (\(\eta _1\times \eta _2\)) pixels. The 2DHOG features are computed from the input image as follows. First, we convolve the image I with the filter \(L=[-1, 0, 1]\) and its transpose \(L^{\top }\) to obtain the gradients \(g_x(i,j)\) and \(g_y(i,j)\), in the x and y directions, respectively, where i and j denote the pixel indices. Then, we compute the magnitude \(\varGamma (i,j)\) and the orientation \(\theta (i,j)\) of the gradient at (ij) as

$$\begin{aligned} \begin{aligned} \varGamma (i,j)&= \sqrt{g_x(i,j)^2+g_y(i,j)^2} \\ \theta (i,j)&= \arctan \left( {g_y(i,j)/g_x(i,j)}\right) \\ \end{aligned} \end{aligned}$$
(1)

Next, the orientation \(\theta (i,j)\) at the (ij)th pixel is quantized into \(\beta \) bins to obtain the corresponding quantized orientation \(\hat{\theta } (i,j)\in \{\varOmega _l\}\), \(\varOmega _l =(l-1) \dfrac{\pi }{\beta }, l=1,2,\ldots ,\beta \). Then, the 2DHOG features for the lth layer, \(h^{l}(\hat{i},\hat{j})\), can be computed using the following equation

$$\begin{aligned} \begin{matrix} h^{l}(\hat{i},\hat{j})=\sum \nolimits _{i=(\hat{i}-1)\eta _1+1}^{\hat{i}\eta _1} \left( \sum \nolimits _{j=(\hat{j}-1)\eta _2+1}^{\hat{j}\eta _2} \varGamma (i,j) \delta _{l}(i,j)\right) \end{matrix} \end{aligned}$$
(2)

where

$$\begin{aligned} \begin{aligned} \delta _{l}(i,j)&={\left\{ \begin{array}{ll}1, &{}\text {if } \hat{\theta }(i,j)=\varOmega _l\\ 0, &{} \text { otherwise} \end{array}\right. } \end{aligned} \end{aligned}$$
(3)

\(\hat{i}\) and \(\hat{j}\) being the cell indices, \(1\le \hat{i}\le \tilde{M_1} = M_1/\eta _1\), \(1\le \hat{j}\le \tilde{M_2} = M_2/\eta _2\), such that \(\tilde{M_1}\) and \(\tilde{M_2}\) are integers. Thus, the 2D representation for the HOG features results in \(\beta \text {-layers}\), \(h^l \left( l= 1,2,\ldots ,\beta \right) \), where the spatial relation between neighboring cells is maintained, and the size of each layer is (\(\tilde{M_1}\times \tilde{M_2}\)).

2.2 Effect of image resampling on 2DHOG features

Statistics of resampled images in the spatial domain have been studied in Huang and Mumford (1999) and Ruderman (1994). Recently, the effect of image resampling on 2DHOG features in the spatial domain has been studied by Dollár et al. (2014, 2010). In this section, we give a brief description of the work in Dollár et al. (2014), which will be used later in studying the effect of image resampling on the features in the transform domain.

Fig. 1
figure 1

Block diagram illustrating the approximate relationship between the resampled features of an image at a given resolution and the features extracted from a resampled version of the same image

Let \(I_{s}=\mathcal{P}(I,s)\) denote the input image I resampled by a factor s, where \({s}<1\) represents downsampling, \({s}>1\) represents upsampling, and \(\mathcal{P}\) represents the resampling operator in the spatial domain. The exact channel features extracted from the image at the original resolution, and the same image at a different resolution can be represented by \(z=\varLambda (I)\), and \(z_s=\varLambda (I_s)\), respectively, where \(\varLambda \) denotes a 2D spatial-domain feature extractor. It has been shown in Dollár et al. (2014) that resampling the image I by a factor s, \(I_s=\mathcal{P}(I,s)\), followed by computing the exact 2D channel features, \(z_s=\varLambda (I_s)\), can be approximated by resampling the channel feature, z, followed by a multiplicative factor, \(\gamma \), that is modeled by using the power law as

$$\begin{aligned} z_s=\varLambda (\mathcal{P}(I,s))\approx \tilde{z}_s=\gamma \mathcal{P}(z,s) \end{aligned}$$
(4)

where

$$\begin{aligned} \gamma =a_0 s^{-\lambda } \end{aligned}$$
(5)

and \(a_0\) and \(\lambda \) depend on the type of channel features, which could be gradient, color or 2DHOG, and are empirically determined. This relationship is illustrated by the block diagram of Fig. 1. The values of \(a_0\) and \(\lambda \) are not necessarily the same for the case of upsampling and downsampling for the same type of channel features.

For object detection using a single detection window, one constructs an image pyramid encompassing different scales, and then extracts the features from every scale in the pyramid. The use of the approximation in (4) allows the features generated at one scale from the image pyramid to approximate the features at nearby scales, thus reducing the cost of feature computation.

3 Effect of downsampling a grayscale image on its transformed version

In this section, we study the effect of downsampling a grayscale image on its DFT and DCT versions, and these results are then used in Sect. 4 to investigate the effect of image downsampling on transform-domain 2DHOG features.

Fig. 2
figure 2

a Magnitude of a signal in the DFT domain \(Z_N[k]\), where a low pass filter with cutoff frequency \(N_c\) is used to bandlimit the signal. b Magnitude of the downsampled signal in the DFT domain \(\hat{Z}_{\hat{N}}[k]\), where \(N=16, K=2, \hat{N}=8\), and \(N_c=4\) (Color figure online)

3.1 Effect on the DFT version

Let the N-point 1DDFT for the discrete time sequence, \(z[n] \in \mathbb {R}\), be denoted as \(Z_N[k]\), where \(n= 0,1,\ldots ,N-1\), \(k= 0,1,\ldots ,N-1\), N is an even integer multiple of K, and K being an integer. Let an ideal low pass filter of unity gain and a cutoff frequency \(N_c\le N/({2 K})\) be used in order to bandlimit the signal. By downsampling z by K in the time domain, the downsampled signal \(\hat{z}\) of length \(\hat{N}=N/K\) is obtained. Then, the \(\hat{N}\)-point 1DDFT is employed on the downsampled signal, \(\hat{z}\), in order to obtain the downsampled signal in the frequency domain, \(\hat{Z}_{\hat{N}}\). Now, the relations between the original signal and its downsampled version in the time domain and that in the frequency domain are given by

$$\begin{aligned} \hat{z}[n]= & {} z[Kn] \end{aligned}$$
(6)
$$\begin{aligned} \hat{Z}_{\hat{N}}[k]= & {} \frac{1}{K} \sum _{i=0}^{K-1} Z_N\left[ k+i\hat{N}\right] \end{aligned}$$
(7)

where \(n= 0,1,\ldots ,\hat{N}-1\), and \(k= 0,1,\ldots ,\hat{N}-1\). It is clear from (7) that the downsampled signal in the 1DDFT domain, \(\hat{Z}_{\hat{N}}\), is represented by a sum of K shifted copies of the original signal in the 1DDFT domain, \(Z_N\), scaled by the factor 1 / K (Smith 2007). Figure 2 illustrates an example of this in the DFT domain, when \(N=16\), \(\hat{N}=8\), \(K=2\), and \(N_c=4\). Since the original signal is bandlimited, then for \(k = 0,1,\ldots ,c_1-1\), \(c_1 \le N_c\), the contribution of the summation shown in (7) is only coming from the first copy of \(Z_N\) at \(i=0\), and so we have

$$\begin{aligned} Z_N[k] = K \hat{Z}_{\hat{N}}[k] \end{aligned}$$
(8)

This result is supported by that presented in Bi and Mitra (2011). We now consider a 2D signal. Let \(g\in \mathbb {R}^2\) represent a grayscale image in the spatial domain of size \((N_1\times N_2)\), where \(N_1\) and \(N_2\) are even integer multiples of \(K_1\) and \(K_2\), respectively, \(K_1\) and \(K_2\) being integers. Assume that an ideal low pass filter of unity gain and cutoff frequencies \(N_{c_1}\le N_1/(2 K_1)\) and \(N_{c_2} \le N_2/(2 K_2)\) is used to bandlimit the original signal. Downsampling g by a factor \(K_1\) in the y direction, and \(K_2\) in the x direction results in \(\hat{g}[n,m]=g[K_1 n,K_2 m]\) of size \((\hat{N}_1\times \hat{N}_2)\), where n and m represent the spatial domain discrete sample indices, \(0 \le n \le \hat{N}_1 -1\), \(0 \le m \le \hat{N}_2 -1\), \(\hat{N}_1 ={N}_1/K_1\) and \(\hat{N}_2 =N_2/K_2\). We now take the 2DDFT of g and \(\hat{g}\) to obtain \(G_{N_1,N_2}\) and \(\hat{G}_{\hat{N}_1,\hat{N}_2}\) corresponding to the 2DDFT coefficients of the original image and that of its downsampled version, respectively. Similar to the case of 1DDFT, the relation between \(G_{N_1,N_2}[u,v]\) and \(\hat{G}_{\hat{N}_1,\hat{N}_2} [u,v]\) can be expressed as

$$\begin{aligned} \hat{G}_{\hat{N}_1,\hat{N}_2} [u,v]= \frac{1}{K_1 K_2}\underset{i}{\sum } \underset{j}{\sum }G_{N_1,N_2}[u+ i \hat{N}_1,v + j \hat{N}_2] \end{aligned}$$
(9)

where \(u= 0,1,\ldots ,\hat{N}_1-1\), \(v= 0,1,\ldots ,\hat{N}_2-1\), \(i= 0,1,\ldots ,K_1-1\), and \(j= 0,1,\ldots ,K_2-1\). It is seen from this equation that the downsampled image in the 2DDFT domain is represented by a sum of \(K_1 \times K_2\) shifted copies of the original image in the 2DDFT domain and scaled by the factor \(1/{(K_1 K_2)}\). Let \(c_1\) and \(c_2\) denote the maximum frequencies retained by the truncation operator. For \(u= 0,1,\ldots ,c_1-1\), \(v= 0,1,\ldots ,c_2-1\), \(c_1\le N_{c_1}\), and \(c_2\le N_{c_2}\) the contribution of the summation shown in (9) is from the copy corresponding to \(i=j=0\), and we can obtain the following relation

$$\begin{aligned} G_{N_1,N_2}[u,v]= K_1 K_2 \hat{G}_{\hat{N}_1,\hat{N}_2} [u,v] \end{aligned}$$
(10)

From the above equation it is seen that the ratio between a grayscale image in the 2DDFT domain and that of its downsampled version is \(K_1 K_2\).

Fig. 3
figure 3

a Magnitude of the signal \(E_{2N}[k]\) defined as \(E_{2N}[k]=Y_{2N}[k]e^{-j\frac{\pi k}{2N}}\), where a low pass filter with cutoff frequency \(N_c\) is used to bandlimit the signal. b Magnitude of the downsampled signal in the DFT domain \(\hat{E}_{\hat{2N}}[k]\), where \(N=8, K=2, \hat{N}=4\), and \(N_c=4\) (Color figure online)

3.2 Effect on the DCT version

In Ahmed et al. (1974) the N-point 1DDCT, \(X_N\), for the discrete time sequence, \(x \in \mathbb {R}\), is given by

$$\begin{aligned} X_N [k]=\hat{\varGamma }_{N}[k]\sum _{n=0}^{N-1}x[n]\cos {\frac{\pi (2n+1)k}{2N}} \end{aligned}$$
(11)

where \(\hat{\varGamma }_{N}[k]=\sqrt{1/N}\) for \(k=0\), and \(\hat{\varGamma }_{N}[k]=\sqrt{2/N}\) for \(0< k \le N-1\). The N-point 1DDCT can be computed by 2N-point 1DDFT for a sequence, y[n], as follows. First, let x[n] be a bandlimited signal and y[n] be defined as

$$\begin{aligned} \begin{aligned} y[n]&={\left\{ \begin{array}{ll}x[n], &{} 0\le n \le N-1\\ 0, &{} N\le n \le 2N-1 \end{array}\right. } \end{aligned} \end{aligned}$$
(12)

The 1DDFT is employed on y in order to obtain \(Y_{2N}\). It has been shown in Ahmed et al. (1974) that the signal \(X_N[k]\) in the 1DDCT domain is related to \(Y_{2N}[k]\) by

$$\begin{aligned} X_N[k]=\hat{\varGamma }_{N}[k] \text {Re}\left( Y_{2N}[k] e^{-j\frac{\pi k}{2N}}\right) \end{aligned}$$
(13)

where \(k= 0,1,\ldots ,N-1\), and \(\text {Re} ()\) is a function which returns the real part of an input complex number. Let an ideal low pass filter of gain unity and a cutoff frequency \(N_c\le N/K\) be used in order to bandlimit the signal \(Y_{2N}\), where N is an even integer multiple of K, and K being an integer. Let \(E_{2N}[k]\) be a 1D signal in the 1DDFT domain, and be defined as \(E_{2N}[k]=Y_{2N}[k] e^{-j\frac{\pi k}{2N}}\). From the downsampling theorem given by (7), downsampling \(E_{2N}[k]\) by a factor K in the 1DDFT domain is obtained as:

$$\begin{aligned} \hat{E}_{2\hat{N}}[k]=\frac{1}{K} \sum _{i=0}^{K-1} E_{2N}\left[ k+i2\hat{N}\right] \end{aligned}$$
(14)

where \(\hat{E}_{2\hat{N}}\) is of length \(2\hat{N}=2N/K\), and \(k= 0,1,\ldots ,N-1\). Figure 3a and b illustrate an example for \(E_{2N}[k]\) and \(\hat{E}_{2\hat{N}}[k]\), respectively, where \(N=8, K=2, \hat{N}=4\), and \(N_c=4\). Now, the downsampled signal in the 1DDCT domain, \(\hat{X}_{\hat{N}}\) of length \(\hat{N}\), can be obtained as follows:

$$\begin{aligned} \hat{X}_{\hat{N}}[k]&= \hat{\varGamma }_{\hat{N}}[k] \text {Re}(\hat{E}_{2\hat{N}}[k]) \end{aligned}$$
(15)
$$\begin{aligned}&= \hat{\varGamma }_{\hat{N}}[k] \text {Re}\left( \frac{1}{K} \sum _{i=0}^{K-1} Y_{2N}[k+i2\hat{N}]e^{-j\frac{\pi (k+i 2\hat{N})}{2N}}\right) \end{aligned}$$
(16)

Let \(c_1\) denote the maximum frequency retained by the truncation operator. Since \(Y_{2N}\) is bandlimited to the maximum frequency \( N_c \le N/K\), then for \(k = 0,1,\ldots ,c_1-1\), where \(c_1\le N_c\), the contribution of the summation shown in (16) is coming only from \(i=0\) copy, and so we can simplify the above relation as

$$\begin{aligned} \hat{X}_{\hat{N}}[k]&=\frac{1}{K} \hat{\varGamma }_{\hat{N}}[k] \text {Re}\left( Y_{2N}[k]e^{-j\frac{\pi k}{2N}}\right) \end{aligned}$$
(17)
$$\begin{aligned}&= \frac{\hat{\varGamma }_{\hat{N}}[k]}{K\hat{\varGamma }_{{N}}[k]} \hat{\varGamma }_{{N}}[k] \text {Re}\left( Y_{2N}[k]e^{-j\frac{\pi k}{2N}}\right) =\frac{\sqrt{1/\hat{N}}}{K \sqrt{1/N} } X_{N}[k] \end{aligned}$$
(18)
$$\begin{aligned}&= \frac{1}{\sqrt{K}} X_{N}[k] \end{aligned}$$
(19)

Thus, the relation between a 1DDCT transformed signal and its downsampled version in the 1DDCT domain can be expressed as

$$\begin{aligned} X_{N}[k]= \sqrt{K} \hat{X}_{\hat{N}}[k] \end{aligned}$$
(20)

where \(0\le k \le c_1-1\). Similar to the case of 1DDCT, the 2DDCT can be related to the 2DDFT. Let \(c_1\) and \(c_2\) denote the maximum frequencies retained by the truncation operator. Then, the relation between a grayscale image \(G_{N_1,N_2}\) in the 2DDCT domain and that of its downsampled version \(\hat{G}_{\hat{N}_1,\hat{N}_2}\) can be represented as

$$\begin{aligned} G_{N_1,N_2}[u,v]=\sqrt{K_1K_2} \hat{G}_{\hat{N}_1,\hat{N}_2}[u,v] \end{aligned}$$
(21)

where \(\hat{N}_1=N_1/K_1\), \(\hat{N}_2=N_2/K_2\), \(u= 0,1,\ldots ,c_1-1\), \(v= 0,1,\ldots ,c_2-1\), \(c_1 \le N_1/K_1\), and \(c_2 \le N_2/K_2\). In the appendix, the derivation of the above expression is provided.

4 Transform-domain 2DHOG features

In this section, we first define 2DHOG features in the transform domain. Then, utilizing the results derived in Sect. 3, we investigate the relationship between the transform-domain 2DHOG features obtained from an image of a given resolution and those obtained from a downsampled version of the same image.

4.1 Extraction of TD2DHOG features

Consider an input image I of size (\(M_1 \times M_2\)). Let it be divided into non-overlapping cells of size (\(\eta _1 \times \eta _2\)), where \(M_1\) and \(M_2\) are integer multiples of powers of 2, and \(\eta _1\) and \( \eta _2\) are integer powers of 2. Now, 2DHOG features are computed by following the steps explained in Sect. 2.1, resulting in \(\beta \) layers, where each layer corresponds to a certain quantized gradient orientation from \(0^{\circ }\) to \(180^{\circ }\). The 2DHOG features of the \(l^{{th}}\) layer, denoted by \(h^l\), is of size \((\tilde{M_1} \times \tilde{M_2})\), \(\tilde{M_1}\) and \(\tilde{M_2}\) being integer multiples of powers of 2. Each 2DHOG layer, \(h^l\), is partitioned into a number of non-overlapping blocks, \(N_{x}\) and \(N_{y}\) in the x and y directions, respectively, where \(N_{x}\) and \(N_{y}\) are integers. Let \(\varvec{x}_{\imath \jmath }^l\), of size (\(b \times b\)), represent the 2DHOG features of the \((\imath ,\jmath )\)th block of the lth layer, where \(1\le \imath \le N_{y}\), \(1\le \jmath \le N_{x}\), b being an integer power of 2. The block-partitioned 2DHOG features in the lth layer can be represented as

$$\begin{aligned} h^l=\begin{bmatrix} \varvec{x}_{11}^l&\ldots&\varvec{x}_{1N_{x}}^l \\ \vdots&\ddots&\vdots \\ \varvec{x}_{N_{y}1}^l&\ldots&\varvec{x}_{N_{y}N_{x}}^l \end{bmatrix} \end{aligned}$$
(22)

This block partitioning is known to offer a robustness to partial occlusion (Wang et al. 2009; Wu et al. 2014). To illustrate let us consider an image of size \(32 \times 96\), a cell size of \(4 \times 4\), and \(\beta =5\). If \(b=8\), then \(N_x=\tilde{M_2}/b={M_2}/(\eta _2 b)=3\), and \(N_y=\tilde{M_1}/b={M_1}/(\eta _1 b)=1\). Hence, each of the five layers is partitioned into 3 blocks of size \(8 \times 8\). However, if \(b=4\), then \(N_x=6\) and \(N_y=2\); that is, each of the layers is partitioned into 12 blocks of size \(4 \times 4\).

Fig. 4
figure 4

Scheme for obtaining the DCT2DHOG features for an input car image of size \(32\times 96\) using \(\beta =5\), cell size \(4\times 4\), 2DDCT block size \(b=8\) and \(c_1=c_2=4\) (Color figure online)

Next, we apply the appropriate 2D transform, 2DDFT or 2DDCT, on each block resulting in 2DHOG of the corresponding block in the transform domain. Let \(\mathbf {x}_{\imath \jmath }^l=T(\varvec{x}_{\imath \jmath }^l)\), where T(.) represents the transform. The corresponding 2DHOG features in the transform domain can be represented as

$$\begin{aligned} H^l=\begin{bmatrix} \mathbf {x}_{11}^l&\ldots&\mathbf {x}_{1 N_{x}}^l \\ \vdots&\ddots&\vdots \\ \mathbf {x}_{N_{y}1}^l&\ldots&\mathbf {x}_{N_{y}N_{x}}^l \end{bmatrix} \end{aligned}$$
(23)

Let \(\varvec{\phi }_{c_1 c_2}(.)\) denote the 2D truncation operator in the transform domain that truncates the coefficients corresponding to the frequencies greater than the frequencies \(c_1\) and \(c_2\). By applying \({\phi }_{c_1 c_2}(.)\) on each block, \(\mathbf {x}_{\imath \jmath }^l\), we can obtain the truncated features as \(\hat{\mathbf {x}}_{\imath \jmath }^l=\varvec{\phi }_{c_1 c_2}(\mathbf {x}_{\imath \jmath }^l)\) of size (\(c_1\times c_2\)). Then, these features can be represented as

$$\begin{aligned} \hat{H}^l=\begin{bmatrix} \hat{\mathbf {x}}_{11}^l&\ldots&\hat{\mathbf {x}}_{1 N_{x}}^l \\ \vdots&\ddots&\vdots \\ \hat{\mathbf {x}}_{N_{y}1}^l&\ldots&\hat{\mathbf {x}}_{N_{y}N_{x}}^l \end{bmatrix} \end{aligned}$$
(24)

where the size of \(\hat{H}^l\) is (\(\hat{M_1} \times \hat{M_2}\)), \(\hat{M_1}=c_1 N_y\) and \(\hat{M_2}=c_2 N_x\). We call the above truncated transform-domain 2DHOG features given by \(\hat{H}^l\) as TD2DHOG features. We refer to the TD2DHOG features as DFT2DHOG and DCT2DHOG features when the 2D transform used is 2DDFT and 2DDCT, respectively. The scheme for obtaining the DCT2DHOG features is illustrated in Fig. 4 for an image of size \(32 \times 96\) with a cell size of \(4 \times 4\), \(\beta =5\), and 2DDCT is employed with block size \(b=8\), and \(c_1=c_2=4\). It is noted that for this example the size of \(\hat{H}^l\) is \(4 \times 12\).

Fig. 5
figure 5

Block diagram showing the effect of downsampling an input image by an integer factor K in both the x and y directions on the transform-domain 2DHOG features, where \(\alpha \) is a multiplicative factor that allows the features extracted from the lower resolution image to approximate the features extracted from the image at the original resolution

4.2 Effect of image downsampling on TD2DHOG features

In Sect. 3, we obtained the relation between the original image and its downsampled version when they are transformed by 2DDFT or 2DDCT. Now, in order to study the effect of image downsampling on the features in the transform domain, we use the block diagram shown in Fig. 5. For the original image I, a 2DHOG feature extraction operator \(\varLambda (.)\) is employed to obtain \(z=\varLambda (I)\). Then, we apply to z an appropriate 2D transform (2DDFT or 2DDCT), with a block size \(b \times b\), followed by a truncation operation retaining the \(c \times c\) low frequency coefficients for each block. The TD2DHOG features so obtained are denoted by \(\hat{Z}= \hat{T}(z)\), where \(\hat{T}\) represents the transform operation followed by the truncation operation. Let \(I_{1/K}\) denote the image I downsampled by a factor K in both the x and y directions. Since \(I_{1/K}=\mathcal{{P}}(I,1/K)\), \(\mathcal{{P}}\) representing the downsampling operator, the features extracted from the downsampled image are given by \(z_{1/K}=\varLambda (\mathcal{{P}}(I,1/K))\). We now obtain the features \(\hat{Z}_{1/K}=\hat{T}_{1/K}(z_{1/K})\) in the transform domain, where the features \(z_{1/K}=\varLambda (I_{1/K})\), and \(\hat{T}_{1/K}\) represents the transform operation with a block size \((b/K)\times (b/K)\) followed by the truncation operation to retain the (\(c\times c\)) low frequency coefficients.

The relationship between the transform coefficients of the features obtained from the image at the original resolution \(\hat{Z}\) and that of its downsampled version \(\hat{Z}_{1/K}\) can now be obtained as follows. Equations (4) and (5) are now used to approximate \(z_{1/K}\) as

$$\begin{aligned} z_{1/K} \approx \mathcal{{P}}(z,1/K) a'_0 K^{\lambda } \end{aligned}$$
(25)

where \(a'_0\text { and }\lambda \) are computed empirically for each type of channel features. Next, performing the transform operation \(\hat{T}_{1/K}\) on both sides of (25), we obtain

$$\begin{aligned} \hat{T}_{1/K}({z}_{1/K}) \approx \hat{T}_{1/K}(\mathcal{{P}}(z,1/K)) a'_0 K^{\lambda } \end{aligned}$$

i.e.,

$$\begin{aligned} \hat{Z}_{1/K} \approx \hat{T}_{1/K}(\mathcal{{P}}(z,1/K)) a'_0 K^{\lambda } \end{aligned}$$
(26)

Then, the ratio between the features in the transform domain obtained from the original image and its resampled version is

$$\begin{aligned} \frac{\hat{Z}}{\hat{Z}_{1/K}} \approx \frac{1}{a'_0 K^{\lambda }}\times \frac{\hat{T}(z)}{\hat{T}_{1/K}(\mathcal{{P}}(z,1/K))} \end{aligned}$$
(27)

where the first term, \({1}/({a'_0 K^{\lambda }})\), represents the power law effect, while the second term, \(\hat{T}(z)/\hat{T}_{1/K}(\mathcal{{P}}(z,1/K))\), represents the transform domain resampling effect which is the ratio of the transform-domain coefficients of the channel feature, z, and that of its resampled version, \(\mathcal{{P}}(z,1/K)\).

Let \(a_0=1/a'_0\) and assume the term \(\hat{T}(z)/\hat{T}_{1/K}(\mathcal{{P}}(z,1/K))\) can be represented by (10) and (21), in case of 2DDFT and 2DDCT, respectively. Then, the transform-domain coefficients of the original resolution, \(\hat{Z}\), can be approximated by using the transform-domain coefficients at a lower resolution, \(\hat{Z}_{1/K}\), as

$$\begin{aligned} \hat{Z} \approx \alpha (K) \hat{Z}_{1/K} \end{aligned}$$
(28)

where

$$\begin{aligned} \alpha (K)= {\left\{ \begin{array}{ll} a_0 K^{2-\lambda }, &{} \text {for 2DDFT} \\ a_0 K^{1-\lambda }, &{} \text {for 2DDCT} \end{array}\right. } \end{aligned}$$
(29)

In order to improve the approximation accuracy of expression in (28), we introduce an additive correction term \(a_1\), such that \(\alpha \) is of the form

figure d

The constants \(a_0\), \(a_1\), and \(\lambda \) are computed empirically in the training mode for the 2DHOG channel. The usefulness of \(\alpha (K)\) given by (30) lies in the fact that the features extracted from a lower resolution test image can be utilized to approximate the features of the test image extracted at a higher resolution by multiplying the former by \(\alpha (K)\), which is a function of the downsampling factor, K, and the type of transform.

4.2.1 Estimation of \(a_0, a_1, \text {and } \lambda \)

Given a training set of \(N_t\) images, the parameters \(a_0, a_1, \text {and } \lambda \) for the 2DHOG channel can be estimated as follows. First, at each value of the downsampling factor, \(K = 1, 2, 4,\ldots \), the multiplicative factor of the ith image sample, \(\hat{\alpha }^i(K)\), is obtained as the factor that minimizes the mean square error (MSE) as

$$\begin{aligned} \underset{\hat{\alpha }^{i}(K)}{\min } \frac{1}{ N_{y} N_{x}c^2 \beta } \underset{l,j,k,u,v}{\sum }( \hat{Z}^{i,j,k,l}[u,v] -\hat{\alpha }^{i}(K) \hat{Z}_{1/K}^{i,j,k,l}[u,v] ) ^2 \end{aligned}$$
(31)

where \(i= 1,\ldots ,N_t\), \(0 \le u,v \le c-1\), u and v are the frequency indices of the (jk)th block, \(1\le j\le N_{y}\), \(1\le k \le N_{x}\), and \(l=1,2,\ldots ,\beta \). Then, the average value of the estimated multiplicative factor \(\hat{\alpha }(K)\) is obtained as \(\hat{\alpha }(K)= (1/N_t) \sum _{i=1}^{N_t} \hat{\alpha }^i (K)\). Finally, the values of the estimated multiplicative factor \(\hat{\alpha }(K)\) are used to obtain the model parameters, \(a_0\), \(a_1\), and \(\lambda \), of \(\alpha (K)\) by using the least squares curve fitting. In Sect. 6.1, we compute empirically the values of \(a_0, a_1, \text {and } \lambda \).

5 Scheme for vehicle detection

In this section, we propose a new vehicle detection scheme by using the results of the previous section concerning TD2DHOG features so as to employ a single classifier trained on vehicles of high resolution in order to detect vehicles of the same or lower resolution, instead of training multiple resolution-specific classifiers, as in Benenson et al. (2012) and Naiel et al. (2014). In order to detect vehicles of different resolutions in a given test image, an image pyramid of depth one octave is constructed, and TD2DHOG features are extracted at each scale from the image pyramid with blocks of different sizes. We now present our methods for training and testing of the proposed vehicle detection scheme.

5.1 Training mode

In order to take advantage of the fact that the transform-domain coefficients of the original resolution can be approximated by using the transform-domain coefficients at a lower resolution as given by (28), the training data is upsampled by a factor of R, R being an integer power of 2. Even though upsampling of the training data will cause an increase in the training cost, it has been observed from our experiments that training a classifier on TD2DHOG features obtained at a high resolution of images offers a detection accuracy higher than that achieved by the same classifier when trained on TD2DHOG features extracted from the same training set at a lower resolution. This is because of the fact that in the testing mode, going from a higher resolution to a lower resolution results in a smaller approximation error for TD2DHOG features than when going the other way around.

Figure 6a shows the training scheme for the proposed vehicle detector, where the training data is upsampled by a factor R in both the x and y directions. Let the set of the training data upsampled by R be denoted as \(\mathcal{{I}}_{R} = \{I_{i,R}, i=1,2,\ldots ,N_t\}\), where \(N_t\) denotes the number of training image samples. Then, the size of the ith training image sample is (\({R}M_1 \times {R}M_2\)). Assume the 2DHOG features of the lth layer, \(h_{i,{R}}^{l}, \left( i=1,2,\ldots ,N_t \text { and } l= 1,2,\ldots ,\beta \right) \), are extracted by using the same cell size for all the resolutions (\(\eta _1 \times \eta _2\)), then the size of the lth 2DHOG layer of the ith training image sample is \({R}\tilde{M}_1 \times {R}\tilde{M}_2\), i.e., increased by the same factor R. Similarly, the block size used to compute the corresponding TD2DHOG features is increased by the same factor R, i.e., \(b_{R}={R}b_0\). We call \(b_0\) as the base block size, which is defined as the block size at \({R}=1\). Let \(\hat{H}_{i,{R}}^{l}, i=1,2,\ldots , N_t,\) denote the TD2DHOG features of the lth layer, where the size of \(\hat{H}_{i,{R}}^{l}\) is \((\hat{M}_1\times \hat{M}_2)\). It is important to note that, in the training phase we do not multiply TD2DHOG features by the multiplicative factor \(\alpha (K)\), and we use the value of \(\alpha (K)\) computed from (30) in the detection phase.

Fig. 6
figure 6

a The scheme for training the proposed vehicle detector with training images of size \(64\times 64\), where R is the upsampling factor in both the x and y directions. b Proposed vehicle detection scheme for a sample test image, where the different colors in the image pyramid represent different scanning window sizes (here we have used only two window sizes, \(128\times 128\) and \(64 \times 64\)) (Color figure online)

After the extraction of the TD2DHOG features, 2DPCA (Yang et al. 2004) is employed on each layer in order to maintain the relation between the neighboring blocks. Let the training data consist of \(N_{pos}\) and \(N_{neg}\) training image samples, corresponding to the positive and negative classes, respectively. The training data can be denoted as \(\{(\hat{H}_{i,R}^l,y_i), i=1,2,\ldots ,N_t\}, l=1,2,\ldots ,\beta \), where \(y_i \in \{+1,-1\}\) refers to the class label for the ith image sample. The covariance matrix, of size (\(\hat{M}_2 \times \hat{M}_2\)), is first obtained for the TD2DHOG features of the lth layer as

$$\begin{aligned} {Cov}^{l}= \frac{1}{N_t} \sum _{i=1}^{N_t} (\hat{H}_{i,R}^l-\bar{H}_{R}^l)^{\top }(\hat{H}_{i,R}^l-\bar{H}_{R}^l) \end{aligned}$$
(32)

where

$$\begin{aligned} \bar{H}_{R}^l = \frac{1}{N_t} \sum _{i=1}^{N_t} \hat{H}_{i,R}^l \end{aligned}$$
(33)

Note that \({Cov}^l\) is a nonnegative definite matrix. Next, we obtain the \(r_l\) eigenvectors of \({Cov}^l\) that correspond to the \(r_l\) dominant eigenvalues. The number of eigenvectors, \(r_l\), is chosen so that the sum of the magnitude of the retained eigenvalues represents at least \(90\%\) of the sum of the magnitude of all the eigenvalues. The eigenvectors are used to form the matrix \(V^{l}_R\) of size \((\hat{M}_2\times r_{l})\). Next, the TD2DHOG features of the lth layer of the ith training image sample are projected onto the constructed matrix \(V^{l}_R\) in order to obtain the matrix \(Q_{i,R}^l=\hat{H}_{i,R}^lV^{l}_R\) of size \((\hat{M}_1\times r_{l})\), and \(Q_{i,R}^l\) is vectorizedFootnote 1 to obtain the corresponding feature vector \(q_{i,R}^l\) of size \((1 \times \hat{M}_1 r_{l})\). Then, for the ith training image sample, the feature vectors from different layers, \(q^{l}_{i,R}\), are concatenated to obtain the feature vector, \(f_{i,R}\), of size (\(1\times r\)), where \(f_{i,R}= [q^1_{i,R},\ldots ,q^\beta _{i,R}] \text { for } i=1,2,\ldots ,N_t\).

Let the set of training features obtained after applying 2DPCA be denoted as \(\mathcal{{F}}_R =\{ f_{i,R}, i=1,2,\ldots ,N_t\}\), and the set of the eigenvectors used to generate these features be denoted as \(\mathcal{{V}}_R =\{ V_R^l, l=1,2,\ldots ,\beta \}\). Then, we train a classifier, \(\mathcal{T}_{{R}}\), for the upsampling factor R by using the corresponding features \(\mathcal{F}_{{R}}\). We use one of the two state-of-the-art classifiers: a support vector machine with fast histogram intersection kernel (FIKSVM) (Maji et al. 2008, 2013) or boosted decision tree classifier (BDTC) (Appel et al. 2013; Dollár 2016).

5.2 Testing mode

In the testing phase, we first obtain an image pyramid of depth of one octave from the given input test image. The test image at each scale of the image pyramid, is then scanned by using a number of detection windows of different sizes as \((\frac{RM_1}{K}\times \frac{RM_2}{K})\), where R is the upsampling factor at which the detector has been trained and \(K=1,2,4,\ldots ,\) an integer power of 2. Figure 6b shows the proposed vehicle detection scheme when applied to a test image by assuming \(R=2\) and \(K= 1 \text { and }2\). Now for each detection window, we obtain the TD2DHOG features for different layers, \(\{\hat{H}_{test}^{l}, l=1,2,\ldots ,\beta \}\) by using a block size \(b^{test}=\frac{b_R}{K} \); the size of each \(\hat{H}_{test}^{l}\) is \((\hat{M}_1\times \hat{M}_2 )\). Then, the TD2DHOG features of each layer are multiplied by the multiplicative factor \(\alpha (K)\) as

$$\begin{aligned} \tilde{H}_{test}^{l}= \alpha (K) \hat{H}_{test}^{l} \end{aligned}$$
(34)

where \(\tilde{H}_{test}^{l}\) is of size \((\hat{M}_1\times \hat{M}_2 )\), and \(\alpha (K)\) is given by (30), which allows the TD2DHOG features obtained from a low resolution detection window to approximate the TD2DHOG features obtained at a higher resolution, indicating an approximate invariance of the TD2DHOG features within a multiplicative factor, when the image resolution is changed. Next, the TD2DHOG features of the lth layer, \(\tilde{H}^{l}_{test}\), is projected onto the corresponding matrix \(V^l_{R}\) in order to obtain the matrix \(Q^l_{test}=\tilde{H}^l_{test}V^l_{R}\) of size \((\hat{M}_1\times r_{l})\). Then, \(Q^l_{test}\) is vectorized to obtain the corresponding feature vector \(q^l_{test}\) of size \((1 \times \hat{M}_1 r_{l})\). This is followed by concatenating the features, \(q^l_{test}\), for different layers to obtain the feature vector, \(f_{test}\), of size (\(1\times r\)), where \(f_{test}=[ q^1_{test},\ldots ,q^\beta _{test} ]\).

Now, the trained classifier \(\mathcal{T}_{{R}}\), namely, FIKSVM (Maji et al. 2008, 2013) or BDTC (Appel et al. 2013; Dollár 2016), is used to provide for each feature vector \(f_{test}\) a detection score corresponding to the input detection window. Finally, similar to Maji et al. (2008), a non-maximum suppression technique is used to combine several overlapped detections for the same object. This avoids detecting the same vehicle more than once, and allows detecting vehicles with different aspect ratios.

Fig. 7
figure 7

a An illustration of the proposed scheme for scanning an image pyramid of depth one octave with two detection windows and a single classifier. b An illustration of the scheme for scanning an image pyramid of depth two octaves with one detection window and a single classifier (Color figure online)

Figure 7a illustrates the scanning scheme for the proposed vehicle detector in the case of \(R=2\), and \(K= 1\text { and }2\). Hence, in this example, the test image at each scale of the image pyramid is scanned by using two detection windows of sizes \((2M_1 \times 2M_2)\) and \((M_1 \times M_2)\). The proposed vehicle detector requires training a single classifier at the highest detection window size, namely, \((2M_1 \times 2M_2)\). The methods in Benenson et al. (2012) and Naiel et al. (2014) use a similar scanning strategy; however, they require constructing a classifier pyramid in order to classify detection windows of different sizes. It is to be noted that the scanning scheme used in several state-of-the-art object detectors (Dalal and Triggs 2005; Dollár et al. 2009; Maji et al. 2008) requires the extraction of features at each scale of an image pyramid of depth often more than one octave, even though the scheme employs one detection window and a single classifier. Figure 7b shows an example of this scanning scheme, when the image pyramid is of depth two octaves. The proposed vehicle detection scheme reduces the cost of training a classifier pyramid, as a single classifier trained on images of a given resolution can be used to detect vehicles of the same or lower resolutions. In addition, it reduces the storage requirements that are associated with training multiple resolution-specific classifiers.

6 Experimental results

We first carry out a number of experiments to validate, as mentioned in Sect. 4, the model for the multiplicative factor \(\alpha (K)\) using the UIUC car detection dataset (Agarwal et al. 2004). Then, we study the performance of the proposed algorithm for vehicle detection in images using the UIUC car detection dataset (Agarwal et al. 2004), the USC multi-view car detection dataset (Kuo and Nevatia 2009), the LISA 2010 dataset (Sivaraman and Trivedi 2010) and the HRI roadway dataset (Gepperth et al. 2011). We also compare the performance of our algorithm with that of some of the existing methods.

The UIUC car detection dataset (Agarwal et al. 2004) consists of 1050 training images of size \(40\times 100\) divided into a set of 550 car images with side views, and a set of 500 other images, none of which is the image of a car with a side view. In order to facilitate the computation of the TD2DHOG features, the training images in this dataset are cropped by removing pixels from the first and last four rows and from the first and last two columns in order to reduce the size of each image from \(40 \times 100\) to \(32 \times 96\). The testing images in this dataset consist of 108 multi-scale images. The dataset consists of partially occluded cars, objects with low contrast, as well as highly textured background. Since the dataset includes a balanced number of positive and negative training images, the FIKSVM (Maji et al. 2013) is used as the baseline classifier for the proposed detector.

The USC multi-view car detection dataset (Kuo and Nevatia 2009) consists of cars with several views. The training data consists of 2462 positive training images of size \(64\times 128\), while the testing data consists of 196 images containing 410 cars of different sizes and views. In order to complete the training dataset, we collect 9512 negative training image samples from the CBCL street scenes dataset (Bileschi 2006). Since the USC dataset consists of cars with different views, BDTC (Appel et al. 2013; Dollár 2016) is chosen as the baseline classifier.

The LISA 2010 dataset (Sivaraman and Trivedi 2010) consists of test sequences of size \(480 \times 704\) for rear view vehicles of different sizes, and this dataset has been captured under several illumination conditions. The first sequence (1600 frames) is taken on a high-density highway during a sunny day (H-dense), which includes vehicles in partial occlusions, heavy shadows, and some background structures are confused with the positive class, while the second (300 frames) on a medium-density highway on a sunny day (H-medium), where this sequence includes challenges similar to H-dense but at a lower density. The dataset does not include training data; therefore, we collect training images of size \(64\times 64\) from other datasets as follows: (1) 9013 images of vehicles in rear/front views from KITTI dataset (Geiger et al. 2012), and USC multi-view car detection dataset (Kuo and Nevatia 2009), and (2) 8415 negative image samples from CBCL street scenes dataset (Bileschi 2006). As in Sivaraman and Trivedi (2010), we collect a number of hard negative image samples from the test sequences (229 image samples from H-medium, and 806 image samples from H-dense). Due to the large number of training samples and the wide variation in the background structures, BDTC (Appel et al. 2013; Dollár 2016) is used as the baseline classifier on this dataset.

The HRI roadway dataset (Gepperth et al. 2011) consists of five test sequences of size \(600 \times 800\) for vehicles on urban and highway areas. This dataset has been captured under several challenging weather and lighting conditions. Sequence I (908 frames) has been captured during a cloudy day, while Sequence II (917 frames) has been captured during a sunny day. Sequences III (611 frames), IV (411 frames) and V (830 frames) have been captured during a heavy rainy day, a dry midnight, and afternoon after a heavy snow, respectively. Since the HRI dataset does not have its own training set, in order to test the proposed scheme on a sequence of this dataset, the classifier in the proposed scheme is trained by employing the training set used in the case of LISA 2010 dataset along with the hard negative samples collected from the first 100 frames of this sequence of the HRI dataset.

6.1 Validation for the model of \(\alpha {(K)}\)

We now validate the model of \(\alpha (K)\) given by (30) by making use of the block diagram of Fig. 5 and the scheme introduced in Sect. 4.2 for estimating the channel parameters \(a_0\), \(a_1\) and \(\lambda \). For this purpose, we first consider the UIUC car detection dataset (Agarwal et al. 2004) and choose \(N_t=550\) car images. Since we do not have access to high resolution versions of these car images, they are upsampled by a factor \(R=8\). Now, we give the procedure to estimate the value of \(\alpha (K)\) for the 2DHOG features in the 2DDFT domain. We first obtain the 2DHOG features of an upsampled image,Footnote 2\(I_u\), using the steps outlined in Sect. 2.1, assuming \(\eta _1=\eta _2=4\), and \(\beta =5, 7 \text { or }9\). We then apply 2DDFT on block-partitioned 2DHOG features given by (22) for each of the layers, assuming the block size to be \(b = Rb_0 = 8 b_0\), \(b_0\in \{4, 8, 16\}\). This is followed by a truncation operation retaining the (\(c\times c\)) low frequency coefficients, where \(c=4\), to obtain the 2DHOG features in the 2DDFT domain. Then, the whole operation is repeated after downsampling \(I_u\) by a factor K, \(K = 1, 2, 4\text {, and } 8\), but with a block size of b / K. As explained in Sect. 4.2, the multiplicative factor of the ith image sample, \(\hat{\alpha }^i(K)\), is obtained as the factor that minimizes the mean square error (MSE) given by (31). Then, the four values of the estimated multiplicative factor \(\hat{\alpha }(K)\), \(K = 1, 2, 4\text {, and } 8\), are used to obtain the model parameters, \(a_0\), \(a_1\), and \(\lambda \), of \(\alpha (K)\) by using the least squares curve fitting.Footnote 3 The above procedure is repeated to find the model parameters, \(a_0\), \(a_1\), and \(\lambda \), of \(\alpha (K)\) for the 2DHOG features in the 2DDCT domain.

Table 1 summarizes the values of the parameters, \(a_0\), \(a_1\), and \(\lambda \), for the above two cases for block size \(b_0 = 4, 8, 16\) along with the corresponding mean square errors, when the number of layers, \(\beta \), is \(5, 7\text {, or }9\). It is seen from this table that irrespective of the transform used, the errors are insignificant. Figure 8 shows the plots of \({\alpha }(K)\) for the 2DHOG features for \(\beta =7\). It is seen from these plots that the proposed model is not sensitive to the block size \(b_0\). It has been observed that \(\alpha (K)\) is insensitive to \(b_0\) for the other values of \(\beta \) also.

Similar studies have been conducted using \(N_t=1000\) positive training images from the USC multi-view car detection dataset, and \(N_t=1000\) positive training images, collected as mentioned earlier in this section, for the LISA 2010 dataset. It has been found that for both these datasets, \(\alpha (K)\) is insensitive to \(b_0\) irrespective of whether \(\beta =5, 7 \text { or } 9\).

It is to be noted that had we used the same model for \(\alpha (K)\) as given by (30) also for the case of grayscale (GS) channel in the 2DDFT and 2DDCT domains and repeated the above procedures, we would obtain the values of \(a_0\), \(a_1\) and \(\lambda \). These values for the 2DDFT and 2DDCT domains are also included in Table 1 using the UIUC car detection dataset. It is seen from this table that for the case of the grayscale channel, \(\lambda \approx 0, a_0 \approx 1 \text { and } a_1 \approx 0\), and thus,

$$\begin{aligned} \alpha (K) \approx {\left\{ \begin{array}{ll} K^{2}, &{} \text {for 2DDFT}\\ K, &{} \text {for 2DDCT} \end{array}\right. } \end{aligned}$$
(35)

Equation (35) has been found to be equally true in the case of the other two datasets, namely, the USC multi-view car detection dataset and the LISA 2010 dataset. It is seen that the two expressions on the right side of (35) are the same as that given by (10) and (21), respectively, when \(K_1=K_2=K\). Thus, the proposed model for \(\alpha (K)\) given by (30) for the TD2DHOG features is also valid for the grayscale images in the transform domain. These results show the versatility of the model for \(\alpha (K)\) in representing channels other than the 2DHOG channel.

Table 1 The estimated channel parameters for grayscale image (GS) and 2DHOG features, where \(b_0 = 4, 8, \text {or } 16\), and MSE refers to the mean square error of the curve fitting
Fig. 8
figure 8

The multiplicative factor \({\alpha }(K)\) for \(K =1, 2, 4, 8\), where (a) and (b) represent the case of the 2DHOG features in the 2DDFT and 2DDCT domains, respectively (Color figure online)

6.2 Vehicle detection using TD2DHOG features

In this section, we study the detection performance of the proposed scheme using the datasets mentioned earlier. Further, the detection performance of the proposed technique is compared with that of several state-of-the-art techniques. The 2DHOG is obtained assuming \(\eta _1=\eta _2=4\) from which the TD2DHOG features are obtained. In case of using a single classifier, the TD2DHOG features multiplied by the factor \(\alpha (K)\) given by (30) are used, where the classifier is trained on TD2DHOG features obtained from training images upsampled by a factor R and used to classify images in the detection windows of the same or lower resolutions. We refer to this scheme using a single classifier (SC) as TD2DHOG-SC. Also, we consider the case of using multiple classifiers trained on TD2DHOG features at different values of R in order to classify images in the detection windows at the same resolution at which the classifier has been trained. We refer to this scheme using a classifier pyramid (CP) as TD2DHOG-CP. Unless specified otherwise, each octave of an image pyramid is considered to have 12 scales. Each scale is scanned by shifting the detection window(s) by 8R pixels in each of the x and y directions.

6.2.1 UIUC car detection dataset

On this dataset the equal error rate (EER) is used for evaluation, EER being the detection rate at the point of equal precision and recall; we use the methodology given in Agarwal et al. (2004) to calculate the precision and recall.

Choice of the transform: In this experiment, we evaluate the detection performance of the proposed TD2DHOG-SC by using 2DDFT or 2DDCT. The TD2DHOG features are obtained assuming \(b^{train}=R b_0\), \(R=2\), \(b_0 = 4\), \(c=4\), \(b^{test}= 4, 8\) and \(\beta =5, 6, \ldots , 11\). Figure 9 shows that DCT2DHOG-SC exhibits a better performance irrespective of \(\beta \). Similar results have been obtained for other datasets, but are not included here in view of space constraints. In view of this, we will henceforth consider only DCT2DHOG features in all the experiments.

Fig. 9
figure 9

Comparing the EER values of the DFT2DHOG-SC and DCT2DHOG-SC on UIUC dataset (Color figure online)

Fig. 10
figure 10

EER value of the proposed scheme DCT2DHOG-SC at \(c = 2, 4\text { or } 8\) obtained on the UIUC dataset, where \(\beta = 5, 6, \ldots , \text { or } 11\), and the base block size \(b_0 = 4\text { or } 8\) (Color figure online)

Choice of\(b_0\), c, and\(\beta \): We now study the performance of the proposed DCT2DHOG-SC for different values of \(b_0, c \) and \(\beta \), in order to make an appropriate choice for these parameters. Figure 10 shows the EER values of the proposed DCT2DHOG-SC for \(b_0= 4, c= 2\text { or }4 \); \(b_0= 8, c= 2, 4 \text { or }8 \) with \(\beta = 5, 6,\ldots ,11\) and \(b^{test} = b_0 \text { and } 2 b_0\). It is observed from this figure that the highest EER value is achieved at three different parameter settings: \((b_0=4, c=4, \beta =7), (b_0=4, c=4, \beta =9)\), and \((b_0=8, c=8, \beta =7)\). We choose the parameter setting \(b_0=4, c=4\), \(\beta =7\), since it retains the lowest number of eigenvectors compared to that of the other two parameter settings and thus it offers the lowest detection complexity. It has also been observed that in the case of DCT2DHOG-CP, the parameter setting \(b_0=4, c=4\) and \(\beta =7\) also provides the best EER value.

Table 2 Equal Error Rate on UIUC car detection dataset

Performance evaluation: We first consider the case of the DCT2DHOG-SC scheme. In this case, the single classifier trained at \(R=2\) is used to classify the test images in detection windows with the same or lower resolutions (by making use of \(\alpha (K)\), which is obtained using Table 1 and (30b)), where the test block sizes used are \(b^{test} = 8 \text { and } 4\).

Now, we consider the case of DCT2DHOG-CP. In this case, we construct a classifier pyramid trained at \({R =1\text { and }2}\). These two classifiers are used to classify the test images in detection windows of the corresponding two resolutions, where \(b^{test} = 4 \text { and } 8\), respectively.

For each of the above cases, EER values are computed and are given in Table 2. The EER values corresponding to several state-of-the-art schemes, namely, the Gabor filter-based technique (Mutch and Lowe 2008), implicit shape model (Leibe et al. 2008), bag of words with spatial pyramid kernel (Lampert et al. 2008), discriminative parts with Hough forest (Gall and Lempitsky 2009), contour cue-based technique (Wu et al. 2013), HOG-based technique of Kuo and Nevatia (2009), aggregated channel feature (ACF) and ACF-Exact (Dollár et al. 2014), multi-resolution 2DHOG (Maji et al. 2013), and clustering appearance patterns based technique (Ohn-Bar and Trivedi 2015), are also included in Table 2. It is seen from this table that the performance of either of the two proposed schemes is better than that of the others in Dollár et al. (2014), Gall and Lempitsky (2009), Kuo and Nevatia (2009), Lampert et al. (2008), Leibe et al. (2008), Maji et al. (2013), Mutch and Lowe (2008), Ohn-Bar and Trivedi (2015) and Wu et al. (2013).

Fig. 11
figure 11

Sample results for the proposed scheme when applied on USC multi-view car dataset, where colors represent: (blue) true positive, and (red) false positive (Color figure online)

6.2.2 USC multi-view car detection dataset

For this dataset, as in Kuo and Nevatia (2009), the PASCAL visual object classes (VOC) criterion (Everingham et al. 2010, 2016) is used for the evaluation purpose with an overlap threshold of 0.5. To compare the performance of our method to that of some recent schemes, the average precision (AP) is used as an evaluation metric. In this dataset, the training images are upsampled by a factor of \(R= 1\text { and } 2\) in the case of using DCT2DHOG-CP, and by a factor of \(R=2\) in the case of using DCT2DHOG-SC. The performance of the proposed DCT2DHOG-SC scheme using this dataset is studied for \(b_0=4, c=4\); \(b_0=8, c= 4 \text { or }8\); and \(\beta =5, 7 \text { or } 9\), and \(b^{test}= b_0 \text { and } 2 b_0\). It is observed that the highest AP value is achieved at two parameter settings, \(b_0=8\), \(c=4\), \(\beta =9\) and \(b_0=8\), \(c=8\), \(\beta =9\). We choose the parameter setting \(b_0=8\), \(c=4\) and \(\beta =9\), since it retains a lower number of 2DDCT coefficients than that of the other parameter setting, and thus it provides a lower detection complexity. Therefore, this parameter setting is chosen for both the DCT2DHOG-SC and DCT2DHOG-CP schemes.

Figure 11 shows sample qualitative results for the proposed scheme on this dataset. It shows that the proposed scheme can detect cars in different views and resolutions. Table 3 shows that the performance of the proposed technique is better than that of the method in Ohn-Bar and Trivedi (2015) which is based on ACF and training multiple classifiers at different resolutions, the method in Kuo and Nevatia (2009) which uses HOG with Gentle AdaBoost, and that of the method in Wu and Nevatia (2007) which is based on using Edgelet feature with cluster boosted tree classifier, where the latter is evaluated using Kuo and Nevatia (2009). Further, the performance of the proposed method is slightly better than that of the implementations of the methods in Dollár et al. (2014), or that of the multi-resolution 2DHOG features presented in Maji et al. (2013) when used with BDTC. The proposed scheme achieves AP values of \(90.44\%\) in the case of DCT2DHOG-SC, and \(89.92\%\) in the case of DCT2DHOG-CP. Thus, DCT2DHOG with a single classifier exhibits a high detection performance, while requiring the training of only a single classifier, instead of multiple classifiers for each resolution.

Table 3 Average Precision on USC Multi-view Car Dataset

6.2.3 LISA 2010

In this dataset, the same evaluation metrics presented in Sivaraman and Trivedi (2010) are used, namely, true positive rate (TPR) or recall, false detection rate (FDR) or (1-precision), average false positive per frame (AFP/F), average false positive per object (AFP/O), and average true positive per frame (ATP/F). These metrics are computed at the point of equal precision and recall. True positive detections are computed by using the PASCAL VOC criterion (Everingham et al. 2010, 2016) with an overlap threshold of 0.5.

On both the H-dense and H-medium sequences, the single classifier trained at \(R=2\) is used in the case of DCT2DHOG-SC and two classifiers trained at \(R= 1 \text { and } 2\) are used in the case of DCT2DHOG-CP. As in our experiments on USC multi-view car detection dataset, the parameter setting chosen for both the DCT2DHOG-SC and DCT2DHOG-CP schemes on LISA 2010 dataset is \(b_0=8, c=4, \beta =9\), and \(b^{test}=8 \text { and } 16\), since, these two datasets contain similar environmental conditions and the same type of classifier, namely, BDTC, is used in the detection process.

Table 4 gives the detection performance of the proposed method, from which it is clear that the performance of DCT2DHOG using a single classifier is almost as good as that of using classifier pyramid. Table 4 also lists the performance of some of the other methods, namely, the Haar-like features-based technique (Sivaraman and Trivedi 2010), ACF and ACF-Exact (Dollár et al. 2014), multi-resolution 2DHOG (Maji et al. 2013), and clustering appearance patterns based technique (Ohn-Bar and Trivedi 2015). From this table, it can be seen that the proposed scheme on H-medium sequence provides a performance better than that of the schemes of Dollár et al. (2014), Maji et al. (2013), Ohn-Bar and Trivedi (2015) and Sivaraman and Trivedi (2010), while for the H-dense sequence, our scheme provides \(92.67\%\) TPR at \(6.03\%\) FDR, which is better than that of the methods in Dollár et al. (2014), Maji et al. (2013) and Ohn-Bar and Trivedi (2015). The proposed method and the methods in Dollár et al. (2014), Maji et al. (2013) and Ohn-Bar and Trivedi (2015) are trained with hard negative samples collected from the CBCL street scenes dataset (Bileschi 2006), while the method in Sivaraman and Trivedi (2010) is trained on private data from sunny highway environment. The detection performance of the proposed scheme can be improved by using an online learning technique to incorporate the false positive samples in the learning process. Figure 12a shows sample qualitative results for the proposed scheme when applied on the H-dense sequence. As mentioned earlier, this sequence contains heavy shadows, vehicles in partial occlusions and some background structures are confused with the positive class. The proposed scheme can detect correctly \(92.67\%\) from the vehicles under these challenging conditions. Figure 12b shows the corresponding results for the H-medium sequence, which includes challenges similar to that of the H-dense sequence but at a lower density. It is clear that the proposed technique can detect vehicles of various resolutions, under different illumination and background conditions.

Table 4 The performance for the proposed scheme on LISA dataset
Fig. 12
figure 12

Sample qualitative results for the proposed method on LISA 2010 dataset, such that (a) Highway-dense sequence, (b) Highway-medium or sunny sequence: (blue) true positive, and (red) false positive (Color figure online)

6.2.4 HRI roadway dataset

For this dataset, the evaluation metrics presented in Sect. 6.2.3 are used. As in our experiments on the USC multi-view car detection dataset and LISA 2010 dataset, the single classifier trained at \(R = 2\) is used in the case of DCT2DHOG-SC and two classifiers trained at \(R = 1\) and 2 are used in the case of DCT2DHOG-CP for all the five test sequences of the HRI dataset. Also, the same parameter setting is chosen for both the DCT2DHOG-SC and DCT2DHOG-CP schemes, namely, \(b_0 = 8\); \(c = 4\); \(\beta = 9\), and \(b_{test} = 8\) and 16. The choice of these parameters is made since these three datasets contain similar challenging conditions and the type of the classifier used is the same, namely, BDTC.

Table 5 shows the detection performance of DCT2DHOG-SC, DCT2DHOG-CP, and other state-of-the-art techniques, namely, the ACF and ACF-Exact (Dollár et al. 2014), multi-resolution 2DHOG (Maji et al. 2013), and clustering appearance patterns based method (Ohn-Bar and Trivedi 2015). From this table, it can be seen that for sequences I, II and IV either of the DCT2DHOG-SC and DCT2DHOG-CP schemes provides TPR values better than that in case of the schemes in Dollár et al. (2014), Maji et al. (2013) and Ohn-Bar and Trivedi (2015), whereas for the sequences III and V, the DCT2DHOG-SC scheme yields TPR values higher than that in case of DCT2DHOG-CP or when the schemes of Dollár et al. (2014) and (Maji et al. 2013) are used. In the work of Dollár et al. (2014) and Ohn-Bar and Trivedi (2015), the feature approximation is carried out in the spatial domain to handle the problem of variation in scale. However, in the proposed scheme, the problem of scale variation is addressed by carrying the feature approximation in the frequency domain rather than in the spatial domain.

Table 5 The performance for the proposed scheme on HRI dataset

6.2.5 Discussion

In this section, we present an evaluation of the proposed scheme in terms of the cost for the training and testing schemes. For a fair comparison, we use 2DPCA and FIKSVM or 2DPCA and BDTC as the main building blocks when 2DHOG or DCT2DHOG features are used. In the experiments that follow, the same values of \(\eta _1, \eta _2, b_0, c\text {, and } \beta \) that have been used to obtain the detection accuracy on the corresponding dataset are used. It should be noted that in practical situations, the choice of these parameters depends on the targeted vehicle view. In case the side view of the vehicles is of interest, the parameter settings recommended for obtaining DCT2DHOG features are \(b_0=4, c=4\text {, and } \beta =7\) and FIKSVM can provide a fast and accurate classification scheme. In the case of detecting vehicles with different views, such as the situations that exist in urban and highway scenarios, the recommended parameter settings are \(b_0=8, c=4\text {, and } \beta =9\) and BDTC is preferred, since it can be trained on a large number of samples and can capture large intra-class variations that exist within the positive class samples.

Table 6 Feature extraction and classifier training times (in seconds) for the proposed DCT2DHOG method and for the 2DHOG method
Table 7 Storage requirements (in MByte) for the proposed DCT2DHOG method and for the 2DHOG methods
Table 8 Average feature extraction and detection time in seconds for Methods A, B and C applied to three datasets

Training cost: In this experiment, we compare the training cost of the proposed DCT2DHOG against that of 2DHOG at six different resolutions. Table 6 lists the overall training timeFootnote 4 of the proposed DCT2DHOG at six resolutions along with that of 2DHOG. It is seen from this table that the training time for the proposed scheme is less than that of 2DHOG by at least \(49.79\%\) when a classifier pyramid is used, and by at least \(74.33\%\) when a single classifier trained at \(R=2\) is employed. Table 7 gives the storage requirement of the proposed scheme and that of the 2DHOG-based scheme for classifiers trained at the six different resolutions considered. It is seen from this table that the storage requirement for the proposed scheme is lower than that of 2DHOG-based scheme in case of the UIUC dataset by 64.18% when the size of the detection window is \(64 \times 192\), whereas both these schemes achieve the same storage for the cases of USC and LISA 2010 datasets. Note that the FIKSVM classifier is used for the UIUC dataset and BDTC is used for the USC and LISA 2010 datasets. It is observed from Tables 6 and 7, in order to detect vehicles of different resolutions, the proposed DCT2DHOG-SC requires only a single classifier instead of multiple ones, resulting in a reduction in terms of the training cost by at least \(44.63\%\) and the storage requirement by at least \(50.00\%\) compared with that of DCT2DHOG-CP.

It is to be pointed out that the reduction in the training and storage costs is achieved by the proposed vehicle detector in comparison with that of the 2DHOG counterpart using a classifier pyramid with almost no loss in the detection accuracy.

Detection time: Table 8 gives a comparison of the feature extraction time as well as the detection time (in seconds) of the proposed transform-domain based detector (Method A) with that of the spatial-domain counterparts (Methods B and C) on the three vehicle detection datasets, UIUC (Agarwal et al. 2004), USC (Kuo and Nevatia 2009) and LISA 2010 (Sivaraman and Trivedi 2010). We use test images of size \(480 \times 640\). We assume that each octave of an image pyramid consists of 8 scales, and that each scale is scanned by shifting the detection window(s) by 16 pixels in each of the x and y directions. This generates 1398, 1141 and 1365 detection windows per frame for UIUC, USC and LISA 2010 datasets, respectively.

Method A in Table 8 corresponds to the proposed method, where the DCT2DHOG-2DPCA features are used to train a single classifier at \(R=2\). Further, two detection windows of different sizes are used to scan an image pyramid of depth one octave and the same classifier is used to classify DCT2DHOG-2DPCA features obtained from images within these detection windows after incorporating the multiplicative factor \(\alpha (K)\) given by (30b).

Method B corresponds to the traditional method that uses a single classifier trained on features obtained in the spatial domain, namely, 2DHOG-2DPCA features, at \(R=1\). Further, it uses a single detection window to scan an image pyramid of depth two octaves. Then, the 2DHOG-2DPCA features obtained from an image within a detection window are classified by the trained classifier.

Method C corresponds to a spatial domain method which uses 2DHOG-2DPCA features to train two classifiers at \(R= 1\text {, and } 2\). Further, two detection windows of different sizes are used to scan an image pyramid of depth one octave. Then, the two classifiers trained at \(R= 1 \text { and }2\) are used to classify images within the detection windows of the same resolution at which the classifier is trained.

For the UIUC dataset, the first detection window is of size \(32 \times 96\) and the second one of size \(64 \times 192\). For this dataset, the range of vehicle size that can be detected by using the method A, B or C is \(32 \times 96\) to \(128 \times 384\). For USC and LISA 2010 datasets the corresponding window sizes are \(64 \times 128\) and \(128\times 256\), and \(64 \times 64\) and \(128 \times 128\), respectively.

It is seen from Table 8 that the proposed transform-based method provides a minimum of \(4.69\%\) reduction in the feature extraction time and a minimum of \(17.82\%\) reduction in the detection time over that of the two spatial-domain methods B and C for the UIUC dataset and very much higher reductions for the other two datasets.

Finally, it is worth mentioning that the classification time of the proposed method represents on average about \(65\%\) of the total detection time. Thus, further gains in the detection speed could be achieved by reducing the classification time.

7 Conclusion

In this paper, we have introduced transform domain features of two-dimensional histogram of oriented gradients of images, referred to as TD2DHOG features. Then, we have studied the effect of image downsampling on the TD2DHOG features. It has been shown that the TD2DHOG features obtained from a high resolution image can be approximated by using the TD2DHOG features obtained from the image at a lower resolution by multiplying the latter by a factor that depends on the downsampling factor. A model for this multiplicative factor has been proposed and validated experimentally in the case of 2DDFT and 2DDCT domains. Next, a novel vehicle detection scheme using these TD2DHOG features has been proposed. It has been shown that the use of TD2DHOG features reduce the cost of training a classifier pyramid, since a single classifier can be used to detect vehicles of the same or lower resolution at which the classifier has been trained, instead of training multiple resolution-specific classifiers.

Experimental results have shown that when the proposed TD2DHOG features are used with the multiplying factor and a single classifier for vehicle detection, it provides a detection accuracy similar to that obtained using these features with a classifier pyramid; however, the use of a single classifier has a significant advantage over the use of a classifier pyramid in that the former results in substantial savings in training and storage costs. In addition, the proposed method provides a detection accuracy that is similar or even better than that provided by the state-of-the-art techniques.