Keywords

1 Introduction

Drive recorders are used to investigate the causes of traffic accidents and to improve drivers’ safety awareness. With the increasing presence of more advanced drive recorders , a large variety of driving data , including video images and sensor signals such as vehicle velocity and acceleration, can be continuously recorded and stored. Although these advances may contribute to traffic safety, the increasing amount of driving data complicates retrieval of desired information from large databases. Some researchers have studied methods for recognizing driving events, such as lane changing and passing, using HMM-based dynamic models [13]. In our previous work, a similarity-based retrieval system for finding driving data was proposed [4]. However, since our method used differences in histograms of driving behavior signals as the similarity measurement , it did not efficiently use dynamic information from driving scenes for retrieval. In this chapter, we study two driving scene retrieval systems that utilize dynamic information from driving scenes.

In the first study, we focus on driving behavior signals. The first retrieval system captures dynamic information from driving scenes by directly using sequences of driving behavior signals and utilizes changes in these signals over time. Six kinds of driving behavior signals (velocity, longitudinal and lateral acceleration, gas and brake pedal pressures, and steering angle) are used for calculating similarity between driving scenes. We compared the use of both early and late integration to integrate these signals.

In the second study, we focus on environmental driving data that is collected from the road and surrounding vehicles. The second retrieval system uses a similarity measure to compare the road configuration and motion of surrounding vehicles. Positions of surrounding vehicles and roadside barriers are detected with laser scanners mounted on the front and back of an instrumented vehicle, and the velocities of surrounding vehicles are estimated from their relative positions to the vehicle. Each scanned frame of a driving scene is categorized based on three general features, i.e., road type, congestion level, and the positions of surrounding objects. Also, the motion of each surrounding vehicle is tracked to obtain its motion features, so we can measure the similarity between vehicles. Categorization results and detected vehicle path are integrated to measure similarity between driving scenes.

2 Data Collection

The driving data used in our study was collected on real roads and was recorded using the instrumented vehicle shown in Fig. 14.1. The collected signals included velocity [km/h], longitudinal and lateral acceleration [G], gas and brake pedal pressures [N], and steering wheel angle [deg]. Two laser scanners were mounted on the front and back of the vehicle to detect surrounding objects. The laser scanners covered 80° arcs at both the front and back of the vehicle, to an effective range of about 100 m to the front and 55 m to the rear. A Kalman filter was employed to predict the motions of objects in blind areas. In order to assist in the subjective confirmation of retrieved scenes, synchronously recorded front and driver’s feet scenes, as well as a 360° panoramic scene of the surroundings from an omnidirectional camera, were also available for every retrieved scene.

Fig. 14.1
figure 1

Instrumented vehicle used for driving data collection [8]

3 Driving Scene Retrieval Using Driving Behavior Signals

In this section, we describe the first similarity-based driving scene retrieval system, which uses similarity of driving behavior signals. Six driving signals (velocity, longitudinal and lateral acceleration, gas and brake pedal pressures, and steering angle) were used for calculating similarity between driving scenes. We compared the use of early and late integration to integrate these signals.

3.1 Integration Methods for Driving Behavior Signals

3.1.1 Method 1: Early Integration

We retrieved similar driving scenes using two methods, early and late integration. Figure 14.2 shows the procedure for early integration. The six kinds of signals mentioned above were extracted from the scene to be retrieved, and each signal was normalized by mean and variance using all the data for all the drivers. The normalized signals of the query scene were represented as a vector, and the Euclidean distance between the vectors of the query scene and every scene in the database was measured. The database for the search consisted of about 200,000 vectors, one for each recorded scene. A fast retrieval technique was used to reduce retrieval time. The top five scenes with the smallest distances were chosen as similar scenes.

Fig. 14.2
figure 2

Driving scene retrieval using driving behavior signals (early integration)

3.1.2 Method 2: Late Integration

The other retrieval method used was late integration, shown in Fig. 14.3. Each of the six kinds of signals of a scene was represented as a vector, and the Euclidean distance between the vectors of the query scene and those of all the other scenes was calculated for each signal. The sum of the ranks of the six signals was calculated, and the five scenes that had the lowest summation were retrieved as similar scenes.

Fig. 14.3
figure 3

Driving scene retrieval using driving behavior signals (late integration)

3.2 Retrieval Performance Evaluation

To evaluate these methods, we conducted a driving scene retrieval experiment using driving data collected on city roads from 74 drivers (35 males and 39 females). There was about 45 min of recorded driving data per driver, for a total of about 54 h of driving data . The sampling rate of the driving signals was 10 Hz.

3.2.1 Experimental Condition

Eight kinds of driving events (stops, starts, right and left turns, right and left lane changes, and going up and down hills) were selected as query scenes, and similar scenes were retrieved using the two techniques described in Sect. 14.3.1. Scenes occurring less than 2 s before or after the query scene, and scenes which had already been retrieved, were excluded from being candidates for retrieval. We chose a total of 80 query scenes, which included about 10 scenes for each type of event.

Retrieval performance was evaluated in terms of retrieval accuracy, i.e., the percentage of correctly retrieved scenes in proportion to the total number of retrieved scenes. Whether or not a scene was correctly retrieved was determined subjectively by human validation.

3.2.2 Results

Experimental results are shown in Fig. 14.4. Retrieval accuracy averaging more than 95 % was achieved for driving scenes of stops, starts, and right and left turns, while accuracy was relatively lower for scenes of right and left lane changes, and going up and down hills. Retrieval accuracy of situations involving right turns was higher using the early integration method, but for scenes going down hills, the late integration method was more accurate. On average, the early integration method gave slightly better performance.

Fig. 14.4
figure 4

Retrieval accuracy for driving behavior signals

4 Driving Scene Retrieval Using Environmental Driving Signals

In contrast to the first study, which employed in-vehicle driving signals, in this section we measured the similarity between scenes by comparing driving environments as detected by laser scanners.

4.1 Laser Data Preprocessing

4.1.1 Clustering of Laser Data and Tracking of Vehicles

The first step towards automatic scene retrieval was the clustering of discrete laser dots obtained with laser scanners from surrounding driving environments. Each cluster was a set of distance measurements in a plane, grouped closely to each other, and thus probably belonging to a single object. While many approaches have been used to calculate such physical distances [5], we simply used Euclidean distance here. Due to laser dot detection errors, not every cluster actually represented a separate object, i.e., sometimes more than one cluster could belong to a single object. Since all of the laser data were recorded on expressways in this study, in most cases a laser dot must belong to either a vehicle or a roadside barrier, so it was not difficult to integrate some clusters with our prior knowledge of the shapes of these objects [6]. Then, each surrounding vehicle was modeled as a rigid box, characterized by its orientation, position, and velocity. By tracking the vehicles with a Kalman filter, we estimated their dynamic features, even if they were outside the range of the laser scanners.

4.1.2 Frame Categorization

A frame categorization method was used to categorize laser-acquired driving frames based on three general features, in order to reduce the number of candidates and facilitate fast retrieval. The scenes were categorized based on road type, congestion level, and the relative positions of surroundings objects. The three features were defined as follows:

  • Road type was divided into three classes: left curve, straight line, and right curve. Since two laser scanners were used, one on the front of the vehicle and one on the back bumper, they collected information about road types separately. Their combined data was used to define the road type for each frame of a driving scene, for example, “left curve, straight.”

  • Road congestion level was divided into two classes: “free flow” and “traffic jam.” A Greenshields model [7] was employed to estimate the congestion level for each lane. The road congestion level of a driving frame was designated “traffic jam” if any lane in the frame was estimated as “traffic jam”; otherwise, the frame was designated “free flow.”

  • Relative positions of surrounding vehicles were classified into 450 situations based on whether there was another vehicle in each of eight surrounding directions and whether there was a roadside barrier on the left or right of the driver’s vehicle.

For example, a frame could be represented as “(left curve, straight),” “traffic jam,” and “21.”

4.2 Similarity Measure for Surrounding Environment

Here, we measured the similarity between driving scenes based on the surrounding environment , using three procedures: first, their frame categories (given in Sect. 14.4.1.2) were compared; second, the relative positions of the surrounding vehicles were calculated; and finally, their motion features were compared.

4.2.1 Comparison of Frame Categories

In this study, each driving scene consisted of 100 frames (10 s), so each scene could be represented as a vector with 400 dimensions. We then calculated the difference between scenes using Hamming distance to measure how similar the frame categories of two scenes were. Hamming distance between two elements of the vectors took 0 only when the compared features were exactly the same. If the two features were different, the value was 1. So, if the total Hamming distance was 0, two scenes were identical, and if the total value was 400, they were completely dissimilar. The scenes with a Hamming distance below a threshold of 150 were extracted as candidates for further processing.

4.2.2 Comparison of Surrounding Vehicle Positions

The second step was to compare the positions of vehicles in key frames of two scenes. We assumed here that the first frames of scenes were key frames because people generally focus on the first frames of scenes. As shown in Fig. 14.5, a key frame was divided into a grid, and the frame was represented as a matrix G. Each cell of the matrix shows the number of vehicles in the corresponding cell of the grid.

Fig. 14.5
figure 5

Example of a frame and its matrix. Left: Each cell of the grid is composed of 25 × 25 pixels. The grid is centered on the host vehicle. Right: The value of each element of matrix represents the number of vehicles in the corresponding cell

Assume that frames F 1 and F 2 are represented by symbolized matrices G 1 and G 2. To compute the similarity of the two matrices, we first matched all cells in the two frames:

$$ \Delta G\left({F}_1,{F}_2\right)={\displaystyle \sum_i{\displaystyle \sum_j\left|{g}_{i,j}^{(1)}-{g}_{i,j}^{(2)}\right|}}, $$
(14.1)

where g (1) i,j and g (2) i,j denote the number of vehicles in cell (i,j) in G 1 and G 2, respectively, and the value of ΔG represents the distance between them. For instance, we can say frames F 1 and F 2 match perfectly if and only if the value of ΔG equals zero. However, this rarely happens because even if two frames are almost identical, this symbolization method sometimes puts vehicles with the similar positions into different cells. To decrease errors caused by such problems, we also allowed soft matching. We assumed vehicles in two frames matched if there were the same numbers of vehicles in the cells at the same position in two matrices. In addition, we also considered vehicles to match if there were an equal number of vehicles in nearby cells, using a cost function. Thus, the final distance between frames F 1 and F 2 is defined as

$$ d\left({F}_1,{F}_2\right)=\Delta {G}^{\prime}\left({F}_1,{F}_2\right)+\frac{k}{K}, $$
(14.2)

where ΔG′ is the value of ΔG after soft matching; k is the number of soft matches in ΔG′, and K is an empirically defined normalization factor for the penalty of soft matching.

After that, distance d(F 1,F 2) was used to calculate the similarity between F 1 and F 2:

$$ s\left({F}_1,{F}_2\right)=\frac{d\left({F}_1,{F}_2\right)}{n_1+{n}_2}, $$
(14.3)

where n 1 and n 2 denote the numbers of vehicles in frames F 1 and F 2, respectively. Frames with a distance below 0.5 from the first frame of a query scene, as well as between their preceding and following frames within 2 s, were selected as key frames for the next step in processing.

4.2.3 Comparison of Surrounding Vehicle Motion

If surrounding vehicles have nearly the same positions in the first frames of scenes, as well as similar trajectories and velocities, we believe there is a high probability that these are matching scenes. Also, comparing the motion of surrounding vehicles overcomes problems caused by grid division and achieves a faster search than with frame-to-frame matching between scenes.

Assume that scenes S 1 and S 2 are represented by their vehicle sets (excluding the host vehicle), V 1 = {v (1)1 ,v (1)2 , …,v (1) M } and V 2 = {v (2)1 ,v (2)2 , …,v (2) N }, where M and N are total numbers of surrounding vehicles observed in S 1 and S 2. At point in time, t, each surrounding vehicle, v (1) i or v (2) j , is represented as a sequence of vehicle motion feature vectors, consisting of longitudinal position y i and lateral position x i with their first-order dynamics Δy i and Δx i :

$$ {\left({y}_i(t),{x}_i(t),\Delta {y}_i(t),\Delta {x}_i(t)\right)}^{\mathrm{T}}. $$
(14.4)

Dynamic features were calculated by the following equation:

$$ \Delta {y}_i(t)=\frac{{\displaystyle {\sum}_{l=-L}^Ll\cdot {y}_i\left(t+l\right)}}{{\displaystyle {\sum}_{l=-L}^L{l}^2}}, $$
(14.5)

in which y i (t) is the ith vehicle’s driving signal at point in time t, and L is window size for linear regression. Δx i (t) was calculated in the same way. The distance between vehicles v (1) i and v (2) j in two scenes S 1 and S 2, respectively, were calculated as a Mahalanobis distance:

$$ {d}^2\left({v}_i^{(1)},{v}_j^{(2)}\right)={\left({\mu}_{v_i^{(1)}}-{\mu}_{v_j^{(2)}}\right)}^{\mathrm{T}}{\Sigma}_{v_i^{(1)},{v}_j^{(2)}}^{-1}\left({\mu}_{v_i^{(1)}}-{\mu}_{v_j^{(2)}}\right), $$
(14.6)

where μ v represents a four-dimensional vector (including the means of longitudinal position, lateral position, as well as their first-order dynamics) of a vehicle v, and \( {\sum}_{v_i^{(1)},{v}_j^{(2)}} \) is a four-by-four covariance matrix of the four features for vehicle v (1) i and v (2) j . This calculates the distance between a pair of vehicles by comparing the distribution of their four-dimensional features. Based on our preliminary experiment, a pair of vehicles with a Mahalanobis distance below a threshold of 15.0 was believed to be similar to each other.

To acquire a vehicle-to-vehicle match, we calculated d(v i ,v j ) for all i and j between scenes and selected them from smallest to largest. We considered scenes to be similar to each other if there were enough similar vehicles in both scenes. Similarity p between S 1 and S 2 was defined as the summation of the weights of similar vehicles divided by the summation of the weights of all the vehicles in the scenes:

$$ p\left({S}_1,{S}_2\right)=\frac{{\displaystyle \sum_{i\in {X}_1}{\displaystyle \sum_{t\in {T}_i^{(1)}}{w}_t^{(i)}}}+{\displaystyle \sum_{i\in {X}_2}{\displaystyle \sum_{t\in {T}_i^{(2)}}{w}_t^{(i)}}}}{{\displaystyle \sum_{i\in {Y}_1}{\displaystyle \sum_{t\in {T}_i^{(1)}}{w}_t^{(i)}}}+{\displaystyle \sum_{i\in {Y}_2}{\displaystyle \sum_{t\in {T}_i^{(2)}}{w}_t^{(i)}}}}, $$
(14.7)

where X 1 and X 2 denote the sets of similar vehicles, and Y 1 and Y 2 denote the sets of all vehicles in S 1 and S 2, respectively. w i) t denotes the weight of vehicle v i at time t. T (1) i and T (2) i are the sets of frame numbers where v (1) i or v (2) i was observed in S 1 or S 2, respectively. Here, “weight” means the importance of a surrounding vehicle, which was represented as a value of a modified Gaussian distribution as illustrated in Fig. 14.6. The reason we used a modified Gaussian distribution which was stretched towards the front value as a similarity metric is that, generally, a driver is more aware of nearby leading vehicles while driving. For example, the surrounding vehicles in front of a driver’s vehicle are more important than those on either side of or behind the driver’s vehicle. It can be inferred that a pair of similar vehicles near the driver’s vehicle makes scenes more similar than pairs located farther away.

Fig. 14.6
figure 6

A modified two-dimensional Gaussian distribution, centered on the driver’s vehicle, where surrounding vehicles with higher values denote greater importance

4.3 Retrieval Performance Evaluation

The proposed driving scene retrieval system was evaluated using database-containing expressway scenes from 57 drivers (28 males and 29 females) recorded with the instrumented vehicle shown in Fig. 14.1. The database contained approximately 140,000 driving frames. All of the driving data were sampled at 10 Hz. We compared retrieval accuracy and speed for different types of scenes under various retrieval conditions, by using subjective scores and by measuring retrieval speed in CPU time. Here, “retrieval conditions” mean some combinations of the similarity measures presented in Sect. 14.4.2:

  1. (a)

    Based on frame category

  2. (b)

    Based on surrounding vehicle position

  3. (c)

    Based on surrounding vehicle motion

The combinations are represented as a, c, a + c, b + c, and a + b + c. We did not use b or a + b, since b only considered the first frame of a scene and would not be accurate if used alone.

The experiment was conducted as follows:

  • Five driving scenes each, for straight road, curve, traffic jam, and lane change, were randomly selected as queries.

  • For each query scene, we evaluated retrieval accuracy and retrieval speed for each retrieval condition. For each condition, the top five similar scenes were retrieved, and they were used for the evaluation.

4.3.1 Comparison of Retrieval Accuracy Using Subjective Scores

In this comparison, the subjective scores of five volunteers were used to judge which retrieval condition, or combination of retrieval conditions, was able to select scenes with the highest similarity to a query scene for a given driving situation. Each volunteer gave scores, from 1 (lowest) to 5 (highest), to the top five retrieved scenes for each query under each retrieval condition. Scenes with a score of 3 or higher were considered to be similar. A score of 5 indicated perfect similarity, while a score of 1 indicated complete dissimilarity. The retrieval accuracy of a given scene under a given retrieval condition was estimated as the average of the scores from the five volunteers.

The experimental results are shown in Fig. 14.7, which indicate that condition a + b + c demonstrated much higher accuracy than the other conditions, in various driving situations.

Fig. 14.7
figure 7

Comparison of retrieval accuracy

4.3.2 Comparison of Retrieval Speed Using CPU Time

In order to compare processing speed, the proposed driving scene retrieval system was installed on a Core i5 CPU 650@3.20 GHz PC using the Windows 7 operating system. The CPU time for each query process was recorded for each retrieval condition. The average retrieval time for top five driving scenes was calculated. This was considered to represent system speed performance under a given retrieval condition for each scene. Figure 14.8 shows the average retrieval time taken to retrieve scenes from the 140,000-frame database. On average, retrieval condition a took the least time, and condition a + b + c was the next fastest.

Fig. 14.8
figure 8

Comparison of retrieval speed

5 Conclusions

In this chapter, we developed two systems for retrieving recorded driving scenes based on measuring the similarity of driving behavior and environmental driving signals. In the first study, similar scenes were retrieved using driving behavior signals, and they were integrated using two methods, early and late integration. Experimental results showed that an average of more than 95 % retrieval accuracy was achieved for driving scenes of stops, starts, and right and left turns. In most situations, the early integration method achieved better performance than the late integration method. In the second study, we used environmental driving signals with the idea that similar driving scenes could be retrieved by measuring similarity in surrounding environments . Experimental results showed that the integrated use of information from surrounding vehicles and road conditions achieved higher retrieval accuracy than the use of either type of information alone.

Currently, we are working to integrate these two systems, to see if retrieval accuracy can be further improved.