Keywords

1 Introduction

Motion detection and object tracking are both tasks of great interest in Computer Vision (CV). They are part of studies, for example, in medical imaging, surveillance methods [22] and (of more recent interest) driver assistance [2] and many other applications. The aim of this paper is to present a method for motion detection and characterization using Cellular Automata. The approach has the aim of detecting and characterizing moving entities to support collision avoidance from the perspective of the viewer.

In order to pursue this goal we identified in the edge detection, more specifically in the Sobel operator [19], an algorithm that performs an efficient transformation of an image in its edge-based counterpart with satisfactory effectiveness. This image transformation, leading to a gray-scale representation, can be easily translated in a cellular automaton configuration [21]. Considering that edge detection [3, 5, 15, 17] is a very specific field of computer vision technique, it is nonetheless possible to find some peculiarities that fit well in the cellular automaton approach.

Likewise, intrinsic features of cellular automata make them naturally suited to parallelization [20] and efficient hardware implementation [7], with the support of ad-hoc devices, they could bear the development and usage of a real time system. We will now briefly discuss most relevant related works to this research, then the approach will be introduced. Discussion of achieved results and future research directions will end the paper.

2 Related Works

Even though works related to motion detection using Sobel operator and CA are not present in the literature, Cellular Automata have recently been used for saliency detection [16]: the cited work, employing a stochastic CA approach, has been well received by the CV community being characterized, at the same time, by a good effectiveness and high efficiency, and it actually generated interest and further researches. Saliency detection analysis with CA, in fact, was later also investigated in [8], which also characterized it as one of the most relevant steps of the process of motion detection. CA approaches had been earlier used for other CV tasks, in particular to process edge detection [12, 14] and to perform resizing operation preserving edges (and therefore quality of the image) [10], but also for segmentation of medical images [18].

3 The Introduced CA Approach

Our approach and the associated work-flow implies several steps in order to process a frame-by-frame object movement, as shown in Fig. 1. It involves Cellular Automata (CA) which is a mathematical idealization of physical systems in which space and time are discrete. It consists of a regular uniform lattice where, in each site, there is a discrete variable called “cell”. Each individual cell is in a specific state and changes synchronously depending on the state of its neighbors, given a local update rule. The neighborhood at a certain site is typically taken to be the site itself and its immediate adjacent sites.

Fig. 1.
figure 1

The overall pipeline of the proposed approach for CA-based motion detection and characterization.

3.1 From a Frame to a Sobel-Filtered Frame

To transform an image into an instance of a CA, every frame of a video will be filtered using the Sobel operator. The latter applies two 3\(\,\times \,\)3 kernels to the original image in order to calculate approximations of the derivatives, horizontal-axiswise and vertical-axiswise (see Fig. 2). Therefore the gradient \(\mathbf G \) of the edge will be \(G = \root \of {G_x^2+G_y^2}\). Because of its approximated nature, this filter helps in the process of discretization of an image. Applying this filter, colors are going to be removed, highlighting only edges in scales of gray. Edges are basically areas where contrast intensity \(\gamma \in \varGamma \) is strong. Filtering an image with this operator, provides a new image which will be used to initialize a CA lattice.

The main reason for the usage of Sobel operator rather than other edge detectors can be found in the simplicity of the related algorithm. While other edge detectors (e.g. Canny edge detector) imply various steps to process the image and achieve its edge-based counterpart, as explained in [10], the Sobel operator edge detection method instead implies a shorter number of steps that are part of a much simpler algorithm.

Fig. 2.
figure 2

(a) Matrix used on x axis (\(G_x\)); (b) Matrix used on y axis (\(G_y\)).

3.2 CA Initialization

Due to the intrinsic discrete nature of a CA, the actual set of contrasts \(\varGamma \), processed by the Sobel filter, needs to be discretized in clusters. The cardinality of these clusters will be set as the highest value that a cell \(c_i \in C\), where \(i = 1 \ldots |C|\), in a lattice \(L\) can assume. The number of clusters is determined according to the content of the processed video with the aim of preserving the possibility to discriminate edges but also to keep limited the processing time. So once clustered, there will be a finite set of states \(S = \{0,\ldots ,K\}\) every cell can assume.

Therefore defining a frame \(F^t=\{p^t_0, p^t_1, \ldots , p_{(n*m)-1}^t, p_{n*m}^t\}\), where \(n\) is the number of pixels on the x axis and \(m\) the number of pixels on the y axis, as the \(t{\tiny {\text {th}}}\) frame in a video \(V=\{F^0, F^1, \ldots , F^{max(t)}\}\), the flattening process will follow this method:

$$\begin{aligned} S(c_i^t)= {\left\{ \begin{array}{ll} k-1, &{} \text {if}\ min(\gamma _{K^n})\le \gamma _{p_i^t}\le max(\gamma _{K^n})\\ \vdots \\ 1, &{} \text {if}\ 0<min(\gamma _{K^1})\le \gamma _{p_i^t}\le max(\gamma _{K^1})\\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(1)

At the end of this process there will be a fully initialized lattice \(L\) with cells assuming up to \(k\) different states which will be associated to a sobel-filtered video frame.

3.3 Frames Comparison

Having the lattice set, a process of frames comparison to elaborate movement within the considered video will start. In order to do this, we will use 2 different, but contiguous in time, lattices \(L(F^t)\) and \(L(F^{t+1})\); they will be overlapped to retrieve uncommon cells according to their position. As a result a new lattice \(\varLambda (L(F^t),L(F^{t+1}))\) will be produced according to this method:

$$\begin{aligned} S(c_i^{t,t+1})= {\left\{ \begin{array}{ll} 1, &{} \text {if}\ S(c_i^t)\ne S(c_i^{t+1})\\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(2)

In other words, lattice \(\varLambda (L(F^t), L(F^{t+1}))\) will essentially show different pixels from each frame, which intuitively represent the focus of the movement detection process. More precisely, this new lattice presents edges that were present at time t and that changed at time \(t+1\): it therefore includes edge pixels of both time t and \(t+1\).

In order to determine more precisely the so called region of interest (ROI) of the distinct frames, we have to separate this information, to be then analyzed to characterize movement. More precisely, we would have to exclude from the lattice \(\varLambda (L(F^t), L(F^{t+1}))\) cells that do not match their state value when compared to \(L(F^t)\) cells and when compared to \(L(F^{t+1})\) cells. Therefore, this process will bring to two new different lattices \(ROI(L(F^t))\) and \(ROI(L(F^{t+1}))\). Respectively, their cell states will be set according to this method:

$$\begin{aligned} S(c_i^{ROI(L(F^t))})=S(c_i^t)*S(c_i^{t,t+1}) \end{aligned}$$
(3)

and

$$\begin{aligned} S(c_i^{ROI(L(F^{t+1}))})=S(c_i^{t+1})*S(c_i^{t,t+1}) \end{aligned}$$
(4)

3.4 Building a Bounding Box Around Salient Objects

Having reached this point of the pipeline, the expected output are 2 CA configurations showing salient objects meant to be evaluated in the process of motion detection. In order to do this, a bounding box will be constructed around the ROIs and thus we will be able to collect their centroids and process an approximate estimation of the frame-to-frame behavior of the salient object.

The effectiveness of the estimation will be calculated upon completion of the collection of salient objects’ centroids. A trajectory of all of the bounding boxes will show the approximate behavior of the moving object in the whole video.

Fig. 3.
figure 3

(a) Frame 104 of the video (b) Sobel-filtered frame 104.

4 Experimental Results

To exemplify what has been explained so far, the whole pipeline has been developed in pure Python language, using SciPy (ndimage)Footnote 1 library for the Sobel filtering part along with OpenCVFootnote 2 for several tasks on the video processing.

4.1 Analyzed Videos and Achieved Results

For evaluating the effectiveness of the approach, we used a videoFootnote 3 with no camera movement, whose frame resolution is 360\(\,\times \,\)496 pixels; the background is therefore permanently motionless (unless for artifacts due to video compression, changes in the illumination, etc.). The video represents a cat entering the screen from the right side and moving towards the other end. It must be noted that we did not run benchmarking tests for the analysis of computational times yet: in this work we mainly focus on the effectiveness of the approach, and its potential regarding the parallelization aspect will be considered in future works.

Fig. 4.
figure 4

(a) Sobel-filtered frame 104 (b) Flattened Sobel-filtered frame given as a CA configuration

From a Frame to a Sobel-Filtered Frame

In Fig. 3 it is shown how the Sobel operator works: given an image as input, it returns the most significant edges of that image based on their magnitude in terms of contrast.

CA Initialization

In Fig. 4 the Sobel-filtered frames was flattened to be better processed in the subsequent step of frames comparison. This step aims to remove superfluous edges, not so worth further evaluation.

Frames Comparison

In order to better evaluate the difference between frames, we propose, in Fig. 5, 2 examples of differences through overlapping frames

Fig. 5.
figure 5

(a) CA configuration of frame 104 (b) CA configuration of frame 106 (c) \(\varLambda (L(F^{104}), L(F^{106}))\) (d) Trajectory of centroids of ROIs (markers identify centroids of bounding boxes of ROIs)

Bounding box of Regions of Interest and their trajectories

In an initial part of the video (frames 1 to 49) there is no motion (the cat has not yet entered the screen) and consequently nothing is detected; starting at frame 50 and until frame 268 the system detects an object moving at a relatively constant speed from the right side of the frame to the left side. Finally, the sequence of frames between 269 and 293 depict the background since the cat has exited from the right side of the screen, and the system correctly does not report any movement. In Fig. 5d we show the positions of centroids of the bounding boxes built around ROIs.

In Fig. 6 we more briefly describe the results of another experiment, in which a video of a ball bouncing on screenFootnote 4, from the left side to the right side, was analyzed. Figure 6b shows the trajectory of centroids of ROIs of the video with a ball bouncing along the frame.

Fig. 6.
figure 6

(a) A frame taken from a video of a ball bouncing on screen (upper right part of the frame) (b) Trajectory of ROIs of video with a bouncing ball (markers identify centroids of bounding boxes of ROIs) (c) Frame where the left edge of the ball is not completely on screen (the ball is in the lower right part of the frame and it is much less visible than in the first frame).

4.2 Discussion of Experiments

The heterogeneous movement of the cat and its tail provide a continuous although smooth change in the produced bounding box around the ROI, and this makes it quite dynamic and unstable. While the movement of the cat was basically homogeneous and predictable, the movement of its tail instead was fairly unpredictable. This lead to a continuous change of bounding boxes shapes. It is a matter of fact that this pointed out different movement directions between the cat and its own tail.

Moreover, the video presents some issues in terms of compression artifacts, leading to a slight change of colors of pixels in certain frames. On top of that, this method does not consider the problem of object classification, meaning that it does not consider the case of more objects moving in the same frame yet. Nevertheless, as it can be seen in Fig. 5, only 3 frames out of 219 show a clear discrepancy between the expected bounding box position and the one retrieved from the system: the points around the coordinates (300, 150) are due to the recognition of noisy pixels in the top left part of the video as a possible moving object and part of the ROI.

The second test is proposed on another video that represents a ball with a black background bouncing at a static bouncing rate and moving from left to right at a constant speed. In this case, the object is fundamentally not changing from a morphological perspective, although it is constantly changing velocity, even with relatively significant displacements withing the frame. Results for this scenario are slightly more satisfactory than the previous experiment: even though the number of frames showing discrepancies between the expected bounding box position and the output one is 5 out of 295, the errors made in the estimated trajectory for those frames is very small (see the points at the borders of Fig. 6b). This is due to the fact that, in those frames, e.g. Fig. 6a, the ball speed is rather high and its edges become blurry. This makes the Sobel filter face some difficulties in processing the gradient of ball edges. Therefore only the right edge of the ball is detected and the bounding box built around it makes the centroid of the bounding box slightly shifted along the two axes.

With reference to the achieved results in both the experiments, even before moving in the direction of trying to classify the detected objects, simply considering some physical constraints characterizing the typically observed objects (or the movement capabilities of an autonomous robot on which the camera is positioned) supports the possibility of completely dismissing or significantly reducing this kind of error. For instance, in [11] the authors analyzed trajectories generated by pedestrians and they were able to reject as outliers tracks in which changes of direction were simply too sudden for a walking human, but analogous considerations could be done with respect to commonsense reasoning [4] on the morphology of the detected and tracked objects.

5 Future Works

The present paper fundamentally reports the current results of an ongoing work investigating a wider research challenge, that is, the possibility to transfer intuitions, approaches and concrete results from the field of insect sensory and motor system study to the area of autonomous robotics, in the vein of [1, 13].

The present results show that CA can represent useful blocks within a more complex work-flow for the processing of videos, in particular with the aim of detecting and characterizing motion within the analyzed frame. Relationship between the present model and current biological results are still thin; nonetheless, there are results related to the functioning of individual photo-receptors [6] and the conjecture is that CA could be applied to explain the visual processing on the retina. Visual processing is basically composed of local interaction between nearby photo-receptor cells at receptor level and inter-neurons at higher levels.

With respect to the implementation aspect, due to the high level of parallelization of CA, we would like to focus our work on the classification of moving elements in an image, in order to process more objects within the CA. Regarding the classification problem, the greatest challenge is to reduce complexity computationwise.

An additional work that could be taken as inspiration for future implementations is also [9], describing a bio-inspired vehicle collision detection system using the neural network of a locust. While this work uses effectively cameras to process videos, our project would aim to do this with a CA abstracting the photo-receptor layer of the locust using a CA lattice.