Content-Based Driving Scene Retrieval Using Driving Behavior and Environmental Driving Signals

Li, Yiyang; Nakagawa, Ryo; Miyajima, Chiyomi; Kitaoka, Norihide; Takeda, Kazuya

doi:10.1007/978-1-4614-9120-0_14

Yiyang Li⁵,
Ryo Nakagawa⁵,
Chiyomi Miyajima⁵,
Norihide Kitaoka⁵ &
…
Kazuya Takeda⁵

1872 Accesses
1 Citations

Abstract

With the increasing presence of drive recorders and advances in their technology, a large variety of driving data, including video images and sensor signals such as vehicle velocity and acceleration, can be continuously recorded and stored. Although these advances may contribute to traffic safety, the increasing amount of driving data complicates retrieval of desired information from large databases. One of our previous research projects focused on a browsing and retrieval system for driving scenes using driving behavior signals. In order to further its development, in this chapter we propose two driving scene retrieval systems. The first system also measures similarities between driving behavior signals. Experimental results show that a retrieval accuracy of more than 95 % is achieved for driving scenes involving stops, starts, and right and left turns. However, the accuracy is relatively lower for driving scenes of right and left lane changes and going up and down hills. The second system measures similarities between environmental driving signals, focusing on surrounding vehicles and driving road configuration. A subjective score from 1 to 5 is used to indicate retrieval performance, where a score of 1 means that the retrieved scene is completely dissimilar from the query scene and a score of 5 means that they are exactly the same. In a driving scene retrieval experiment, an average score of more than 3.21 is achieved for queries of driving scenes categorized as straight, curve, lane change, and traffic jam, when data from both road configuration and surroundings are employed.

Access provided by Autonomous University of Puebla. Download chapter PDF

Towards applying image retrieval approach for finding semantic locations in autonomous vehicles

Article 03 August 2023

Vehicle Semantics Extraction and Retrieval for Long-Term Carpark Video Surveillance

Who Has Better Driving Style: Let Data Tell Us

Keywords

1 Introduction

Drive recorders are used to investigate the causes of traffic accidents and to improve drivers’ safety awareness. With the increasing presence of more advanced drive recorders , a large variety of driving data , including video images and sensor signals such as vehicle velocity and acceleration, can be continuously recorded and stored. Although these advances may contribute to traffic safety, the increasing amount of driving data complicates retrieval of desired information from large databases. Some researchers have studied methods for recognizing driving events, such as lane changing and passing, using HMM-based dynamic models [1–3]. In our previous work, a similarity-based retrieval system for finding driving data was proposed [4]. However, since our method used differences in histograms of driving behavior signals as the similarity measurement , it did not efficiently use dynamic information from driving scenes for retrieval. In this chapter, we study two driving scene retrieval systems that utilize dynamic information from driving scenes.

In the first study, we focus on driving behavior signals. The first retrieval system captures dynamic information from driving scenes by directly using sequences of driving behavior signals and utilizes changes in these signals over time. Six kinds of driving behavior signals (velocity, longitudinal and lateral acceleration, gas and brake pedal pressures, and steering angle) are used for calculating similarity between driving scenes. We compared the use of both early and late integration to integrate these signals.

In the second study, we focus on environmental driving data that is collected from the road and surrounding vehicles. The second retrieval system uses a similarity measure to compare the road configuration and motion of surrounding vehicles. Positions of surrounding vehicles and roadside barriers are detected with laser scanners mounted on the front and back of an instrumented vehicle, and the velocities of surrounding vehicles are estimated from their relative positions to the vehicle. Each scanned frame of a driving scene is categorized based on three general features, i.e., road type, congestion level, and the positions of surrounding objects. Also, the motion of each surrounding vehicle is tracked to obtain its motion features, so we can measure the similarity between vehicles. Categorization results and detected vehicle path are integrated to measure similarity between driving scenes.

2 Data Collection

The driving data used in our study was collected on real roads and was recorded using the instrumented vehicle shown in Fig. 14.1. The collected signals included velocity [km/h], longitudinal and lateral acceleration [G], gas and brake pedal pressures [N], and steering wheel angle [deg]. Two laser scanners were mounted on the front and back of the vehicle to detect surrounding objects. The laser scanners covered 80° arcs at both the front and back of the vehicle, to an effective range of about 100 m to the front and 55 m to the rear. A Kalman filter was employed to predict the motions of objects in blind areas. In order to assist in the subjective confirmation of retrieved scenes, synchronously recorded front and driver’s feet scenes, as well as a 360° panoramic scene of the surroundings from an omnidirectional camera, were also available for every retrieved scene.

3 Driving Scene Retrieval Using Driving Behavior Signals

In this section, we describe the first similarity-based driving scene retrieval system, which uses similarity of driving behavior signals. Six driving signals (velocity, longitudinal and lateral acceleration, gas and brake pedal pressures, and steering angle) were used for calculating similarity between driving scenes. We compared the use of early and late integration to integrate these signals.

3.1 Integration Methods for Driving Behavior Signals

3.1.1 Method 1: Early Integration

We retrieved similar driving scenes using two methods, early and late integration. Figure 14.2 shows the procedure for early integration. The six kinds of signals mentioned above were extracted from the scene to be retrieved, and each signal was normalized by mean and variance using all the data for all the drivers. The normalized signals of the query scene were represented as a vector, and the Euclidean distance between the vectors of the query scene and every scene in the database was measured. The database for the search consisted of about 200,000 vectors, one for each recorded scene. A fast retrieval technique was used to reduce retrieval time. The top five scenes with the smallest distances were chosen as similar scenes.

3.1.2 Method 2: Late Integration

The other retrieval method used was late integration, shown in Fig. 14.3. Each of the six kinds of signals of a scene was represented as a vector, and the Euclidean distance between the vectors of the query scene and those of all the other scenes was calculated for each signal. The sum of the ranks of the six signals was calculated, and the five scenes that had the lowest summation were retrieved as similar scenes.

3.2 Retrieval Performance Evaluation

To evaluate these methods, we conducted a driving scene retrieval experiment using driving data collected on city roads from 74 drivers (35 males and 39 females). There was about 45 min of recorded driving data per driver, for a total of about 54 h of driving data . The sampling rate of the driving signals was 10 Hz.

3.2.1 Experimental Condition

Eight kinds of driving events (stops, starts, right and left turns, right and left lane changes, and going up and down hills) were selected as query scenes, and similar scenes were retrieved using the two techniques described in Sect. 14.3.1. Scenes occurring less than 2 s before or after the query scene, and scenes which had already been retrieved, were excluded from being candidates for retrieval. We chose a total of 80 query scenes, which included about 10 scenes for each type of event.

Retrieval performance was evaluated in terms of retrieval accuracy, i.e., the percentage of correctly retrieved scenes in proportion to the total number of retrieved scenes. Whether or not a scene was correctly retrieved was determined subjectively by human validation.

3.2.2 Results

Experimental results are shown in Fig. 14.4. Retrieval accuracy averaging more than 95 % was achieved for driving scenes of stops, starts, and right and left turns, while accuracy was relatively lower for scenes of right and left lane changes, and going up and down hills. Retrieval accuracy of situations involving right turns was higher using the early integration method, but for scenes going down hills, the late integration method was more accurate. On average, the early integration method gave slightly better performance.

4 Driving Scene Retrieval Using Environmental Driving Signals

In contrast to the first study, which employed in-vehicle driving signals, in this section we measured the similarity between scenes by comparing driving environments as detected by laser scanners.

4.1 Laser Data Preprocessing

4.1.1 Clustering of Laser Data and Tracking of Vehicles

The first step towards automatic scene retrieval was the clustering of discrete laser dots obtained with laser scanners from surrounding driving environments. Each cluster was a set of distance measurements in a plane, grouped closely to each other, and thus probably belonging to a single object. While many approaches have been used to calculate such physical distances [5], we simply used Euclidean distance here. Due to laser dot detection errors, not every cluster actually represented a separate object, i.e., sometimes more than one cluster could belong to a single object. Since all of the laser data were recorded on expressways in this study, in most cases a laser dot must belong to either a vehicle or a roadside barrier, so it was not difficult to integrate some clusters with our prior knowledge of the shapes of these objects [6]. Then, each surrounding vehicle was modeled as a rigid box, characterized by its orientation, position, and velocity. By tracking the vehicles with a Kalman filter, we estimated their dynamic features, even if they were outside the range of the laser scanners.

4.1.2 Frame Categorization

A frame categorization method was used to categorize laser-acquired driving frames based on three general features, in order to reduce the number of candidates and facilitate fast retrieval. The scenes were categorized based on road type, congestion level, and the relative positions of surroundings objects. The three features were defined as follows:

Road type was divided into three classes: left curve, straight line, and right curve. Since two laser scanners were used, one on the front of the vehicle and one on the back bumper, they collected information about road types separately. Their combined data was used to define the road type for each frame of a driving scene, for example, “left curve, straight.”
Road congestion level was divided into two classes: “free flow” and “traffic jam.” A Greenshields model [7] was employed to estimate the congestion level for each lane. The road congestion level of a driving frame was designated “traffic jam” if any lane in the frame was estimated as “traffic jam”; otherwise, the frame was designated “free flow.”
Relative positions of surrounding vehicles were classified into 450 situations based on whether there was another vehicle in each of eight surrounding directions and whether there was a roadside barrier on the left or right of the driver’s vehicle.

For example, a frame could be represented as “(left curve, straight),” “traffic jam,” and “21.”

4.2 Similarity Measure for Surrounding Environment

Here, we measured the similarity between driving scenes based on the surrounding environment , using three procedures: first, their frame categories (given in Sect. 14.4.1.2) were compared; second, the relative positions of the surrounding vehicles were calculated; and finally, their motion features were compared.

4.2.1 Comparison of Frame Categories

In this study, each driving scene consisted of 100 frames (10 s), so each scene could be represented as a vector with 400 dimensions. We then calculated the difference between scenes using Hamming distance to measure how similar the frame categories of two scenes were. Hamming distance between two elements of the vectors took 0 only when the compared features were exactly the same. If the two features were different, the value was 1. So, if the total Hamming distance was 0, two scenes were identical, and if the total value was 400, they were completely dissimilar. The scenes with a Hamming distance below a threshold of 150 were extracted as candidates for further processing.

4.2.2 Comparison of Surrounding Vehicle Positions

The second step was to compare the positions of vehicles in key frames of two scenes. We assumed here that the first frames of scenes were key frames because people generally focus on the first frames of scenes. As shown in Fig. 14.5, a key frame was divided into a grid, and the frame was represented as a matrix G. Each cell of the matrix shows the number of vehicles in the corresponding cell of the grid.

Assume that frames F ₁ and F ₂ are represented by symbolized matrices G ₁ and G ₂. To compute the similarity of the two matrices, we first matched all cells in the two frames:

$$ \Delta G\left({F}_1,{F}_2\right)={\displaystyle \sum_i{\displaystyle \sum_j\left|{g}_{i,j}^{(1)}-{g}_{i,j}^{(2)}\right|}}, $$

(14.1)

where g ⁽¹⁾_i,j and g ⁽²⁾_i,j denote the number of vehicles in cell (i,j) in G ₁ and G ₂, respectively, and the value of ΔG represents the distance between them. For instance, we can say frames F ₁ and F ₂ match perfectly if and only if the value of ΔG equals zero. However, this rarely happens because even if two frames are almost identical, this symbolization method sometimes puts vehicles with the similar positions into different cells. To decrease errors caused by such problems, we also allowed soft matching. We assumed vehicles in two frames matched if there were the same numbers of vehicles in the cells at the same position in two matrices. In addition, we also considered vehicles to match if there were an equal number of vehicles in nearby cells, using a cost function. Thus, the final distance between frames F ₁ and F ₂ is defined as

$$ d\left({F}_1,{F}_2\right)=\Delta {G}^{\prime}\left({F}_1,{F}_2\right)+\frac{k}{K}, $$

(14.2)

where ΔG′ is the value of ΔG after soft matching; k is the number of soft matches in ΔG′, and K is an empirically defined normalization factor for the penalty of soft matching.

After that, distance d(F ₁,F ₂) was used to calculate the similarity between F ₁ and F ₂:

$$ s\left({F}_1,{F}_2\right)=\frac{d\left({F}_1,{F}_2\right)}{n_1+{n}_2}, $$

(14.3)

where n ₁ and n ₂ denote the numbers of vehicles in frames F ₁ and F ₂, respectively. Frames with a distance below 0.5 from the first frame of a query scene, as well as between their preceding and following frames within 2 s, were selected as key frames for the next step in processing.

4.2.3 Comparison of Surrounding Vehicle Motion

If surrounding vehicles have nearly the same positions in the first frames of scenes, as well as similar trajectories and velocities, we believe there is a high probability that these are matching scenes. Also, comparing the motion of surrounding vehicles overcomes problems caused by grid division and achieves a faster search than with frame-to-frame matching between scenes.

Assume that scenes S ₁ and S ₂ are represented by their vehicle sets (excluding the host vehicle), V ₁ = {v ⁽¹⁾₁ ,v ⁽¹⁾₂ , …,v ⁽¹⁾_M } and V ₂ = {v ⁽²⁾₁ ,v ⁽²⁾₂ , …,v ⁽²⁾_N }, where M and N are total numbers of surrounding vehicles observed in S ₁ and S ₂. At point in time, t, each surrounding vehicle, v ⁽¹⁾_i or v ⁽²⁾_j , is represented as a sequence of vehicle motion feature vectors, consisting of longitudinal position y _i and lateral position x _i with their first-order dynamics Δy _i and Δx _i:

$$ {\left({y}_i(t),{x}_i(t),\Delta {y}_i(t),\Delta {x}_i(t)\right)}^{\mathrm{T}}. $$

(14.4)

Dynamic features were calculated by the following equation:

$$ \Delta {y}_i(t)=\frac{{\displaystyle {\sum}_{l=-L}^Ll\cdot {y}_i\left(t+l\right)}}{{\displaystyle {\sum}_{l=-L}^L{l}^2}}, $$

(14.5)

in which y _i(t) is the ith vehicle’s driving signal at point in time t, and L is window size for linear regression. Δx _i(t) was calculated in the same way. The distance between vehicles v ⁽¹⁾_i and v ⁽²⁾_j in two scenes S ₁ and S ₂, respectively, were calculated as a Mahalanobis distance:

$$ {d}^2\left({v}_i^{(1)},{v}_j^{(2)}\right)={\left({\mu}_{v_i^{(1)}}-{\mu}_{v_j^{(2)}}\right)}^{\mathrm{T}}{\Sigma}_{v_i^{(1)},{v}_j^{(2)}}^{-1}\left({\mu}_{v_i^{(1)}}-{\mu}_{v_j^{(2)}}\right), $$

(14.6)

where μ _v represents a four-dimensional vector (including the means of longitudinal position, lateral position, as well as their first-order dynamics) of a vehicle v, and $ {\sum}_{v_i^{(1)},{v}_j^{(2)}} $ is a four-by-four covariance matrix of the four features for vehicle v ⁽¹⁾_i and v ⁽²⁾_j . This calculates the distance between a pair of vehicles by comparing the distribution of their four-dimensional features. Based on our preliminary experiment, a pair of vehicles with a Mahalanobis distance below a threshold of 15.0 was believed to be similar to each other.

To acquire a vehicle-to-vehicle match, we calculated d(v _i,v ^′_j ) for all i and j between scenes and selected them from smallest to largest. We considered scenes to be similar to each other if there were enough similar vehicles in both scenes. Similarity p between S ₁ and S ₂ was defined as the summation of the weights of similar vehicles divided by the summation of the weights of all the vehicles in the scenes:

$$ p\left({S}_1,{S}_2\right)=\frac{{\displaystyle \sum_{i\in {X}_1}{\displaystyle \sum_{t\in {T}_i^{(1)}}{w}_t^{(i)}}}+{\displaystyle \sum_{i\in {X}_2}{\displaystyle \sum_{t\in {T}_i^{(2)}}{w}_t^{(i)}}}}{{\displaystyle \sum_{i\in {Y}_1}{\displaystyle \sum_{t\in {T}_i^{(1)}}{w}_t^{(i)}}}+{\displaystyle \sum_{i\in {Y}_2}{\displaystyle \sum_{t\in {T}_i^{(2)}}{w}_t^{(i)}}}}, $$

(14.7)

where X ₁ and X ₂ denote the sets of similar vehicles, and Y ₁ and Y ₂ denote the sets of all vehicles in S ₁ and S ₂, respectively. w ⁱ⁾_t denotes the weight of vehicle v _i at time t. T ⁽¹⁾_i and T ⁽²⁾_i are the sets of frame numbers where v ⁽¹⁾_i or v ⁽²⁾_i was observed in S ₁ or S ₂, respectively. Here, “weight” means the importance of a surrounding vehicle, which was represented as a value of a modified Gaussian distribution as illustrated in Fig. 14.6. The reason we used a modified Gaussian distribution which was stretched towards the front value as a similarity metric is that, generally, a driver is more aware of nearby leading vehicles while driving. For example, the surrounding vehicles in front of a driver’s vehicle are more important than those on either side of or behind the driver’s vehicle. It can be inferred that a pair of similar vehicles near the driver’s vehicle makes scenes more similar than pairs located farther away.

4.3 Retrieval Performance Evaluation

The proposed driving scene retrieval system was evaluated using database-containing expressway scenes from 57 drivers (28 males and 29 females) recorded with the instrumented vehicle shown in Fig. 14.1. The database contained approximately 140,000 driving frames. All of the driving data were sampled at 10 Hz. We compared retrieval accuracy and speed for different types of scenes under various retrieval conditions, by using subjective scores and by measuring retrieval speed in CPU time. Here, “retrieval conditions” mean some combinations of the similarity measures presented in Sect. 14.4.2:

(a)
Based on frame category
(b)
Based on surrounding vehicle position
(c)
Based on surrounding vehicle motion

The combinations are represented as a, c, a + c, b + c, and a + b + c. We did not use b or a + b, since b only considered the first frame of a scene and would not be accurate if used alone.

The experiment was conducted as follows:

Five driving scenes each, for straight road, curve, traffic jam, and lane change, were randomly selected as queries.
For each query scene, we evaluated retrieval accuracy and retrieval speed for each retrieval condition. For each condition, the top five similar scenes were retrieved, and they were used for the evaluation.

4.3.1 Comparison of Retrieval Accuracy Using Subjective Scores

In this comparison, the subjective scores of five volunteers were used to judge which retrieval condition, or combination of retrieval conditions, was able to select scenes with the highest similarity to a query scene for a given driving situation. Each volunteer gave scores, from 1 (lowest) to 5 (highest), to the top five retrieved scenes for each query under each retrieval condition. Scenes with a score of 3 or higher were considered to be similar. A score of 5 indicated perfect similarity, while a score of 1 indicated complete dissimilarity. The retrieval accuracy of a given scene under a given retrieval condition was estimated as the average of the scores from the five volunteers.

The experimental results are shown in Fig. 14.7, which indicate that condition a + b + c demonstrated much higher accuracy than the other conditions, in various driving situations.

4.3.2 Comparison of Retrieval Speed Using CPU Time

In order to compare processing speed, the proposed driving scene retrieval system was installed on a Core i5 CPU 650@3.20 GHz PC using the Windows 7 operating system. The CPU time for each query process was recorded for each retrieval condition. The average retrieval time for top five driving scenes was calculated. This was considered to represent system speed performance under a given retrieval condition for each scene. Figure 14.8 shows the average retrieval time taken to retrieve scenes from the 140,000-frame database. On average, retrieval condition a took the least time, and condition a + b + c was the next fastest.

5 Conclusions

In this chapter, we developed two systems for retrieving recorded driving scenes based on measuring the similarity of driving behavior and environmental driving signals. In the first study, similar scenes were retrieved using driving behavior signals, and they were integrated using two methods, early and late integration. Experimental results showed that an average of more than 95 % retrieval accuracy was achieved for driving scenes of stops, starts, and right and left turns. In most situations, the early integration method achieved better performance than the late integration method. In the second study, we used environmental driving signals with the idea that similar driving scenes could be retrieved by measuring similarity in surrounding environments . Experimental results showed that the integrated use of information from surrounding vehicles and road conditions achieved higher retrieval accuracy than the use of either type of information alone.

Currently, we are working to integrate these two systems, to see if retrieval accuracy can be further improved.

References

S.Y. Cheng, S. Park, M.M. Trivedi, Multispectral and multi-perspective video arrays for driver body tracking and activity analysis. Comput. Vis. Image Understand. 106, 245–247 (2007)
Article Google Scholar
D. Mitrovic, Reliable method for driving events recognition. IEEE Trans. Intell. Transp. Syst. 6(2), 198–205 (2005)
Article Google Scholar
N. Oliver, A. Pentland, Graphical models for driver behavior recognition in a SmartCar, in Proceedings of the IEEE Intelligent Vehicles Symposium, pp. 7–12, 2000
Google Scholar
M. Naito, A. Ozaki, C. Miyajima, N. Kitaoka, R. Terashima, K. Takeda, A browsing and retrieval system for driving data, in Proceedings of the IEEE Intelligent Vehicles Symposium, pp. 1159–1165, June 2010
Google Scholar
J.Z. Wang, J. Li, G. Wiederhold, SIMPLIcity: semantics-sensitive integrated matching for picture libraries. IEEE Trans. Pattern Anal. Mach. Intell. 23(9), 947–963 (2001)
Article Google Scholar
N. Kaempchen, Feature-level fusion of laser scanner and video data scanner and video for advanced driver assistance systems, Ph.D. dissertation, University of Ulm, Germany, 2007
Google Scholar
H. Rakha, B. Crowther, Comparison of Greenshields, pipes, and Van aerde car-following and traffic stream models. J. Transp. Res. Board 1802, 248–262 (2007)
Article Google Scholar
K. Takeda, J. Hansen, P. Boyraz, L. Malta, C. Miyajima, H. Abut, International large-scale vehicle corpora for research on driver behavior on the road. IEEE Trans. Intell. Transp. Syst. 12, 1609–1623 (2011)
Article Google Scholar

Download references

Acknowledgement

This work was partially supported by the Strategic Information and Communications R & D Promotion Programme (SCOPE) of the Ministry of Internal Affairs and Communications of Japan under No. 082006002, by Grant-in-Aid for Scientific Research (C) from the Japan Society for the Promotion of Science (JSPS) under No. 24500200, and by the Core Research of Evolutional Science and Technology (CREST) of the Japan Science and Technology Agency (JST).

Author information

Authors and Affiliations

Graduate School of Information Science, Nagoya University, Nagoya, Japan
Yiyang Li, Ryo Nakagawa, Chiyomi Miyajima, Norihide Kitaoka & Kazuya Takeda

Authors

Yiyang Li
View author publications
You can also search for this author in PubMed Google Scholar
Ryo Nakagawa
View author publications
You can also search for this author in PubMed Google Scholar
Chiyomi Miyajima
View author publications
You can also search for this author in PubMed Google Scholar
Norihide Kitaoka
View author publications
You can also search for this author in PubMed Google Scholar
Kazuya Takeda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yiyang Li .

Editor information

Editors and Affiliations

Christian-Albrechts-Universität, Kiel, Germany
Gerhard Schmidt
San Diego State University, San Diego, California, USA
Huseyin Abut
Nagoya University, Nagoya, Japan
Kazuya Takeda
The University of Texas at Dallas, Dallas, Texas, USA
John H.L. Hansen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Li, Y., Nakagawa, R., Miyajima, C., Kitaoka, N., Takeda, K. (2014). Content-Based Driving Scene Retrieval Using Driving Behavior and Environmental Driving Signals. In: Schmidt, G., Abut, H., Takeda, K., Hansen, J. (eds) Smart Mobile In-Vehicle Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-9120-0_14

Download citation

DOI: https://doi.org/10.1007/978-1-4614-9120-0_14
Published: 30 October 2013
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-9119-4
Online ISBN: 978-1-4614-9120-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics