Keywords

1 Introduction

In nowadays, cities are constantly growing and with this swelling appear certain problems, one of these major challenges is the problem of parking management. In a vision of proposing a fully automated parking management solution, in which all the operations in the process of parking will be automated, comes our paper for presenting a visual vehicle localization system. The parking management system based on the vehicle localization information, and the available free place will generate commands with speed, direction and the angle of rotation information. The vehicle that will be equipped with a dedicated system designed for this purpose will execute the commands and move through the parking to its reserved place. As a result, the whole of the parking process will be done without human intervention.

The localization information generated by the visual localization system can be used beside of vehicle localization to build amazing tools for different purpose as guiding people during navigating in unfamiliar buildings like airport, museum, office building. And as our system is based on movement detection we can easily limit data storage for surveillance system to active area (The data issued from a camera will be stored only when moving objects are detected).

Despite the remarkable progress realized in image processing discipline, the complexity of the algorithms used (the increase in image resolutions, as well as the need to implement increasingly complex techniques) on the one hand, and the resources required (computational and memory) on the other hand, have made their implementation in embedded systems a challenging task. This complexity increases with the introduction of the constraints imposed by the standards according to the fields of application, ex: real time constraints, safety standards…

In this paper, we will present our proposal for a visual vehicle localization system (VLS), developed for static camera. Vehicles in the video frames are extracted after the application of the background subtraction method on the input image using a background reference image. The dynamic threshold used is computed by the Otsu method. Finally, the object mask resulting from the segmentation process was used to compute the relative distance to the camera based on the relation between the ratio of the size of the vehicle on the camera sensor and its size in real life which is a function of the camera focal length and distance between the vehicle and the camera.

The remainder of this work is organized as follows; Sect. 2 presents the related works, where the implemented algorithm and description of its parts is given in Sect. 3, in Sect. 4 the implementation board was presented in addition to the syntheses summary. Execution time characteristic of the implemented algorithm, as well as some experimental results are covered in Sect. 5. And finally Sect. 6 draws the conclusion.

2 Related Work

Positioning systems are more and more used in our daily life to perform more complex tasks such as navigation (everywhere in the globe and in all-weather conditions) or simply for pleasure like augmented reality games. The Global Positioning System (GPS) [8] which is the most widely used navigation system in the world receives signals from multiple satellites and employs a triangulation process to determine the physical localizations with an accuracy error around the meter level. The latter is in general acceptable for outdoor applications. Unfortunately, this system reaches its limitation within the buildings and closed environment because of the attenuation of electromagnetic waves. This limitation in addition to the low accuracy of traditional positioning system present the motivation to conduct various researches on different physical signal in a vision to build new positioning systems.

Wireless technologies such as ultrasonic [1, 2], infrared [3], radio-frequency based systems which may be radio-frequency identification [4], received signal strength of RF signals, Bluetooth and WIFI those systems are based on concepts like Time of arrival, Angle of arrival and Received signal strength indication to calculate the distance between the transmitters and the receiver [5,6,7]. The main disadvantage of the Wireless technologies that only object equipped with a receiver can be located.

Non-radio technologies like visual system. Different approaches are used in this category, visual marker based system [9]: markers are placed at specific locations, and when a device (mobile robot) identifies a marker, it can be localized thanks to the markers database. Map-based visual localization [10]: first a collection of successive image of an area (building, road, …) is used to build a dataset of images about this environment, then any device wants to be localized in this area will need just to take an image of its environment which will be compared to the dataset to find its current position. And finally real time visual localization system [17, 18]: the main idea behind those systems is based on the use of a camera network to localize generic objects such vehicle or people. Their major disadvantage is their low accuracy (0.37 m in the best case).

3 Proposed Algorithm

The main idea behind the proposed algorithm is that vehicle size in an image is a function of some camera characteristics (focal length and the sensor size) and vehicle’s features such as weight, height and its distance to the camera. This means that for any vehicle with a known size in the field of view of a camera, we can estimate its distance to this one, if we can compute their dimensions in pixel in the image. And as we are only interested in moving objects, we will try firstly to find such objects, then extract their mask in the image, so their dimension, and finally compute the relative distance. The implemented system is described in the following diagram block (Fig. 1), which is composed of two main parts:

Fig. 1.
figure 1

Block diagram of the implemented VLS.

Fig. 2.
figure 2

Object projection on the camera sensor.

  • Detection of vehicles in the scene.

  • Calculation of the relative distance between the vehicle and the camera, then this distance will be used to compute the absolute distance.

The first step will be performed using the Background subtraction method, since our system will consist only of static cameras. The second step will be carried out by computing the size (in pixel) of the vehicle in motion and using some technical information about the vehicle and the camera we can deduce the distance between the vehicle and the camera.

3.1 Vehicle Detection

In the literature, several approaches are used for object detection, the most used are based on feature descriptor such the histogram of oriented gradient (HOG) [15, 16], in this approach the input image is converted to a feature vector, which simplifies the image by extracting useful information and throwing away extraneous information.

The computed feature vector is used by classification algorithms like Support Vector Machine and based on training data of positive (image with the object to be detected) and negative (image without the object to be detected) datasets. Objects are detected with a good accuracy. Unfortunately, object detectors can be painfully slow, especially when expensive features such as HOG need to be computed, it can really kill the performance. The background subtraction method presents another alternative, especially for indoor environments, where the lighting condition is approximately constant. And by taking into account their minimal implementation cost and its sufficient accuracy, as a result, we make the choice to go for background subtraction method.

Background Subtraction.

The main idea of this method is based on the subtraction of the current image (in which the moving object is present) pixel-by-pixel from a reference background image as described in Eq. (1).

$$ O\left( {x,y,t} \right) = \left| {I\left( {x,y,t} \right) - B\left( {x,y} \right)} \right| $$
(1)

Where:

  • \( {\text{O}}\left( {\text{x,y,t}} \right) \) is the subtracted image.

  • \( {\text{I}}\left( {\text{x, y, t}} \right) \) is the object image.

  • \( {\text{B}}\left( {\text{x, y}} \right) \) is the background image.

This technique is designed for static camera, where the background is approximately the same in all frames, and by applying Eq. (1) background is removed from object image. As a result the histogram of the object image O(x, y, t) will be composed of two main pixel classes, background pixel class (near to the 0 gray scale level) and the object pixel class.

Otsu threshold.

The second step in the background subtraction method is segmentation in order to get the foreground mask. The object image will be compared to a global threshold as presented in Eq. (2).

$$ O\left( {x,y,t} \right)\, \le \,T $$
(2)

Where:

  • \( {\text{O}}\left( {\text{x,y,t}} \right) \) is the subtracted image.

  • T is the threshold value.

If the pixel O(x, y, t) verifies Eq. (2), then it is considered as a foreground pixel, else it is a background pixel. T is a one global threshold, for all pixels in the image. And it needs to be a function of time, in other case the segmentation can easily be impacted by the environmental conditions change. The Otsu method will be used in order to meet this objective. The Otsu algorithm [13] is a popular dynamic thresholding method for image segmentation. Based on the idea that the image histogram can be divided into two classes, so it looks for a threshold that minimizes the variance for both classes. This way, each class will be as compact as possible. Only pixel value is taken into account for the Otsu algorithm the spatial relationship between pixels has no effect on the algorithm result, different regions with similar pixel value are treated as one region.

In Otsu method we exhaustively search for the threshold that minimizes the intra-class variance (the variance within the class), defined as a weighted sum of variances of the two classes:

$$ \sigma w^{2} \left( t \right) = \, w\left( t \right)\,.\,\sigma^{2} \left( t \right) + w'\left( t \right)\,.\,\sigma '^{2} \left( t \right) $$
(3)

Where;

  • σw2(t) is the intra-class variance

  • w and \( {\text{w}}' \) are the probabilities of the two classes separated by a threshold t

  • σ2(t) and \( \sigma '^{ 2} \left( {\text{t}} \right) \) are variances of these two classes.

$$ \sigma^{2} = \sigma w^{2} \left( t \right) + \sigma b^{2} \left( t \right) $$
(4)

Where,

$$ \sigma b^{2} \left( t \right) = w\left( t \right).w'\left( t \right).\left[ {\mu \left( t \right) - \mu '\left( t \right)} \right]^{2} $$
(5)

The Algorithm 1, which is based on the Eq. (5) step through all possible thresholds and keep the threshold value that maximizes the inter-class variance σb2. The whole system computing the Background subtraction in addition to the Otsu method is described in the Fig. 3.

Fig. 3.
figure 3

Diagram block of the moving object detection module.

Fig. 4.
figure 4

System simulation using ISIM.

figure a

3.2 Relative Distance Between the Vehicle and the Camera

Vehicle projection on camera sensor depend on three parameters vehicle dimensions, distance to the camera and the focal length as explained in Fig. 2. Equation 6 explains that the ratio of vehicle size on the sensor and the focal length is the same as the ratio between vehicle size in real life and distance to the vehicle. And in Eq. 7 vehicle size on the sensor is expressed as the vehicle size in pixels, divided by the image size in pixels and multiply by the physical size of the sensor.

The focal length and the sensor size are technical characteristic of the used camera. The vehicle size in pixels can be extracted from the segmented image (the output of the movement detection vehicle module). So we need just the vehicle size in real life to compute its relative distance, and as a result the absolute distance using the camera position information.

$$ {\tan \left( o \right) = \frac{b}{F} = \frac{h}{D} \to D = \frac{F.h}{b}} $$
(6)

Where,

b:

is the object size on the sensor.

H:

is the object size in real life.

F:

is the focal length (technical characteristic of the used camera).

D:

is the distance between the object and the camera.

$$ {b = \frac{O .S}{I}} $$
(7)

Where,

O:

is the object size in pixels.

I:

is the image size in pixels.

S:

is the size of the sensor.

So, the whole equation can be rewritten as:

$$ {D = \frac{F .h .I}{O .S}} $$
(8)

4 Implementation Using FPGA Board

Before going for the hardware implementation, a software implementation was already performed using a 650 MHz dual-core Cortex-A9 processor with an embedded Linux, but due to the limitation in the execution time 3 frame per second (fps), where a minimum of 30 fps is needed for real time video processing, we go for hardware acceleration, which presents an interesting choice for execution time improvement.

4.1 Board Description

With a vision to take the advantage of the parallelism feature provided by the field-programmable gate array (FPGA), we make the choice of implementing our system using the Artix-7 FPGA. The available fast RAM block space will be used to implement a dual - port RAM, which will be accelerate the treatment and give us more flexible architecture for our real-time image processing application (Fig. 5 and Table 1).

Fig. 5.
figure 5

RTL view.

Table 1. Implementation cost.

4.2 Simulation and Syntheses

The grayscale inputs images, foreground and background of the visual localization module are stored in two dual ports RAM, the process is started with computing the Otsu threshold for the subtracted image, then the segmentation is performed and the resulting object mask image is stored in a third dual port memory. The ISE Simulator (ISIM) was used to simulate the implemented system, the testbench (Fig. 4) reads the pixel’s value of the foreground and background images (the ASCI PGM image format is used due to its simple manipulation) at each clock tick, and save the system result as an image in the same format. The inputs images Fig. 6(a) and (b) used in this example are part of the Background Models Challenge (BMC) [14], the resulting image from the background subtraction is presented in Fig. 6(c) and the object mask image is presented in Fig. 6(d). The simulation results of the presented example are presented in the Table 2.

Fig. 6.
figure 6

(a) background image, (b) foreground image, (c) subtracted image, (d) object mask image.

Table 2. Simulation results.

5 Results and Evaluation

5.1 Temporal Analysis

The imaging device such as video cameras use the frames per second to explain the frequency (rate) at which an imaging device displays consecutive images called frames. The Phase Alternating Line (PAL) [12] and the National Television System Committee (NTSC) [11], which are the most used color encoding system for analogue television will be used as reference in our study. In the NTSC standard 30 frames are transmitted each second, each frame is made up of 525 individual scan lines. And in PAL 25 frames are transmitted each second, each frame is made up of 625 individual scan lines. Where a normal motion picture film is played back at 24 frames per second (fps). Thus we can consider that a real-time system is running at 30 fps.

This part, will evaluate the execution time for our system with different image resolution and frequency. The execution time characteristic of each block is presented in the Table 3. The execution time will be computed in function of the image resolution weight * height (W * H) using the operation number (clock ticks) unit Table 5. The detail of the Otsu threshold execution time is given in the Table 4.

Table 3. Block execution time in function of resolution and clock ticks.
Table 4. Otsu block execution time in function of resolution and clock ticks.
Table 5. Frames per second computing for different resolution in different clock frequency.

After some improvement of the implementation taking advantage of parallelism execution, the Background Subtraction and Histogram computing, Segmentation and Relative distance are grouped in the same block. Which means an optimization of 2W * H (Table 6). The new simulation values are given in Table 7. As a result, we notice that for 640 * 480 resolutions in the worst case frequency 50 MHz we achieve 83 fps more than the double of the needed fps. For 1280 * 720 resolutions a real time system can be built in frequency superior than the 56 MHz. And bigger resolution such 1440 * 1080 can be implemented under 100 MHz frequency.

Table 6. Block execution time in function of resolution and clock ticks.
Table 7. Frames per second computing for different resolution in different clock frequency.

5.2 Accuracy

In order to measure the real accuracy of our proposed system, we put a car in different distance from a fix camera Fig. 7 inside a small parking, and by using a fix camera the relative distance between the car and the camera is computed in real time from different distances. The real and computed distances are given in Table 8. Compared to similar work as [17, 18] as shown in Table 9, the maximal accuracy error of 0.1 M for the proposed system present an interesting improvement.

Fig. 7.
figure 7

Testing environment.

Table 8. Real and computed distance.
Table 9. Accuracy error compared to similar work.

6 Conclusion and Outlook

This paper presents an FPGA based Indoor real-time visual location system. The input image is segmented in order to extract the vehicle, this step is achieved using the background subtraction method, with a dynamic threshold computed using the Otsu method. The segmented image is then used to compute the relative distance to the camera. The proposed system is sufficiently satisfying the real time constraint 32 frames per second for the 1440 * 1080 resolution under 100 MHz frequency. In the next step, the presented algorithm will be improved and adapted in a vision of proposing a high accuracy visual navigation system for parking areas. This latter present one of the main parts in our approach of a fully automated parking solution, which will be presented in our future papers.