1 Introduction

Currently, there is a use of modern methods of solving crimes using information technology (hereinafter referred to as IT). In the operational-technical, criminal procedural methods of solving crimes, artificial intelligence is increasingly used. Systems focused on analyzing the behavior of subjects, including the timely detection of emergency situations, are becoming in demand, based on algorithms for analyzing video streaming data in real time.

Automation of such processes as identifying a car entering a parking lot and monitoring its safe presence in a parking space is a necessary aspect of ensuring the security of a protected area. However, the recognition of semantic objects is complicated not only because these objects occupy a far small part of the image, but also because of environmental conditions.

Currently, parking lots play a big role in the life of a car owner, since without them it is difficult to imagine a well-functioning mechanism for storing vehicles in a modern urban environment, in which the number of these vehicles tends to grow every year [1,2,3].

In parallel with this, there is a problem of ensuring the safe location of vehicles on the territory of car parks, in particular:

  • Detection and recognition of a car number at the entrance to the car park [4];

  • Control of the presence of people near vehicles [5];

  • Timely detection of damage to vehicles and subsequent notification of security personnel about this [6].

Based on the problems of the study, we can conclude that the purpose of this study is to develop effective tools for analyzing the video stream from security systems in order to identify negative impacts on vehicles. The scientific novelty consists in improving the existing methods of computer vision by using their ensembles, as well as modifications that make it possible to increase the stability of algorithms to affine transformations of input data. It is planned to modify existing architectures of computer vision algorithms to improve recognition accuracy, as well as to work correctly in different video shooting conditions.

The development of developments in the field of robotics originates in the twentieth century. It was then that the main paradigms were formed, indirectly or directly related to this subject area and representing the sciences and technologies of the post-industrial society, such as: computer science, electrical engineering, special chapters of mathematics, cybernetics, mechanical engineering, Big Data, artificial intelligence, etc. [7,8,9,10]. In the context of using information technologies to solve applied problems, significant computing power is required [11,12,13]. Recently, computing power has been increasing in a trending manner. Such a significant increase in the computing resources of automated systems made it possible to develop robotics in the social sphere. With the beginning of the use of robotic systems of various specializations, an extremely wide field of vision has opened up for end users for their future implementation in order to improve or optimize business processes in the information dynamic contour of an enterprise/organization [14].

Currently, ensuring the safety of civilian facilities is an urgent task. One of its possible solutions is the use of a robotic security system, which is a group (swarm) of mobile robots. This ensemble of technical means is able to provide protection for both mobile and stationary objects in protected areas (within the limits of permissible self-defense) [15].

The use of robotic systems to ensure the safety of civilian facilities involves ensuring the continuity of public order, since the lives of people and the state of other people's property depend on the performance and reliability/accuracy of the functioning of specialized algorithms in such systems [16]. This underlines the hypothesis that current trends are to introduce robotic devices into areas of activity that are in demand for automating business processes. Absolutely any object of civil use cannot do without providing a proper security system that could stop the actual threats of disturbing public order, as well as ensure the continuity of its functioning in terms of ongoing business processes in order to minimize the costs of possible violations in their functioning. At the same time, the costs can be both financial and reputational [17]. In the context of ensuring the safety of vehicles, there are also logistical risks that can manifest themselves in a violation of the continuity of logistical business processes [18].

The main hypothesis to substantiate the relevance of the ideas for the development of security robots, which are based on the use of the mathematical apparatus of neural networks and computer vision algorithms, is that when protecting a civilian object, situations related to the human factor may arise (for example, rare bypasses of the territories of protected objects in 24/7 mode). Based on this formulation of the problem, problems arise in processing streams of intensely incoming heterogeneous and semi-structured data, determining patterns in such arrays and optimizing the use of high-speed algorithms, methods and tools related to the Big Data paradigm. Even in the event of an emergency situation (for example, someone commits an act of vandalism, tries to illegally enter a room to which physical access is prohibited), the main duty of a security guard, according to his own job description, is to immediately report all abnormal / illegal actions to the relevant services or law enforcement agencies, and then act depending on the situation [19,20,21,22,23].

Having determined the criteria corresponding to an emergency situation at a protected facility for civil use, the security robot, as a rule, communicates with notifications to the security service of the protected facility, using, for example, warning templates. The use of Big Data technologies (computer vision algorithms, deep learning algorithms) allows you to automate processes analytical processing, management and visualization of results as data arrives, from a wide array of video recording devices [24].

From the point of view of the technologies used, this class of tasks is called video analytics. Video analytics is a combination of implementations of artificial intelligence and computer vision methods in order to obtain arrays of heterogeneous real-time data for subsequent analysis of video images from video recording devices and video recordings or photos stored in the archive. The basis of video analytics software is a set of computer vision algorithms that perform the task of monitoring and analyzing data received as a video stream without any human intervention. Currently, due to the increased crime situation and increased requirements for the security infrastructure and protection of civilian facilities and access control to them, video analytics algorithms are promising [25,26,27].

Artificial intelligence in the context of video analytics covers a wide range of applied areas, such as: face recognition (smart entry into the system of modern smartphones, passport control through biometric authentication systems, etc.), video stream analysis to determine the behavior of objects, security video surveillance of territories etc. This is due to the development of technologies in the field of programming, in particular, with the spread of methods and technologies of neural networks [28,29,30].

Neural networks in the context of video analytics is a machine learning algorithm that allows you to form and train a mathematical model to solve the classification problem directly based on the input images, text or sound. The use of neural networks involves the use of raw images as input to a program that learns them based on high-fidelity neural network algorithms. Neural network classifiers and deep learning algorithms are able to perform image classification and recognition very accurately and fairly quickly [31, 32].

If we take into account that modern video recording systems record frames in real time in high resolution (Full HD, 2 K, 4 K), then the volume of such information can reach tens of terabytes in just one week, which requires high-performance computing systems. In turn, the number of different variants of images can be an endless number, if we take into account such factors as affine transformations, camera rotations, the presence of noise, different variants of positioned objects in images, etc. Such aspects make it possible to attribute the classes of tasks solved by means of video analytics to the sphere of Big Data. This is facilitated by the use of systems based on the mathematical apparatus of artificial intelligence and neural networks. Image recognition using high-precision neural networks will allow developers of intelligent systems to move to a new level that will replace traditional video analytics [33,34,35]. In this case, a number of problems arise, such as insufficient computer power, incorrect equipment settings, adverse weather conditions for video shooting, video shooting at dusk, etc.

The work of Rachel Blinatal is devoted to the detection of objects in adverse weather conditions [36] Its authors, in combination with classical teaching methods, namely DPM, HOG methods, use polarization-encoded images. The authors noted that polarimetry combined with deep learning can improve accuracy by about 20–50% when performing various detection tasks.

To improve the accuracy of object classification and detection, various preprocessing methods are used, for example, edge detection methods. Robert, Sobel and Prewitt classical edge detection methods work with pixels of neighboring areas and obtain a gradient with pattern approximation [37], which are relatively simple and easy to implement, have good real-time performance, but these operators are sensitive to noise [38]. The authors of [39] noted that the use of the Sobel filter to detect COVID-19 using X-ray images improves the performance of a convolutional neural network. They rated this combination of methods as the best of the wide range of options explored.

In addition to neural networks, other methods are used in image classification and detection. The BoVW model is actively used in image classification. Also, studies have shown that "bag of visual words" (BoVW) schemes for classifying histopathological images showed higher accuracy than classical convolutional neural networks, which was 96.50% [40].

The Viola-Jones algorithm is an object detector that is used to detect the human face, facial features, cars, etc., which, in combination with the support vector machine (SVM), provides a robust object detection and classification pipeline. The simplicity of the algorithm allows real-time classification, and the accuracy for classifying some types of objects can reach 99% [41].

2 Materials and methods

2.1 Description of the object of study

A car park is a structure designed to store vehicles (Fig. 1). The investigated object is equipped with a barrier with a camera that controls the authorized entry into the territory (with license plate recognition) and cameras located around the perimeter for the purpose of video surveillance (7 video surveillance cameras), a total of 8 video cameras. Near each parking space there is a container disguised as a dustbin, which houses a mobile robot equipped with a video surveillance camera and a secure data carrier, which, if a negative impact on the vehicle is detected via a secure communication channel, informs the security service of the facility about this. At the same time, at the entrance there is a sign warning about the functioning of a mobile security robot.

Fig. 1
figure 1

Scheme of a guarded car park

Data from all cameras is processed by a high-performance system located in the corporate network of the car park and enters the data center via secure communication channels (VPN) through a single entry point. The storage period for archived video information is 30 calendar days. In case of any emergency situations, the video recording is archived on a specially designated secure storage medium. The organization of communication with the data processing center is shown in Fig. 2.

Fig. 2
figure 2

Organization of communication with the data processing center

2.2 Data sets and experimental protocol

For the experiment, a system of eight Hiwatch dome-shaped IP cameras connected to an 8-channel video recorder, as well as a swarm of mobile robots, was used. This ensemble of video stream measurement sources captures the situation in the parking lot in real time. We used video cameras equipped with a varifocal lens, the focal length is 2.8—12 mm, which made it possible to adjust the optical zoom of the camera at the installation site. Camera protection class—ip66.

Each IP camera runs at 30 frames per second (Fps). Since the ensemble of video cameras operates in 24/7 mode, it makes sense to adjust the frequency of processed frames from the original video stream, since the analysis of the initial set of the video stream would make the designed system based on the use of computer vision and artificial intelligence algorithms extremely slow and unsuitable for implementation on mobile phones. robotic systems. Thus, the number of processed frames per second depends on the location of each specific camera. It is worth noting that the presented cameras have the functions of automatically capturing data according to the planned time, according to the timer, according to the set events, as well as according to alarms.

At the entrance to the parking lot, the installed camera, specialized in license plate recognition, was set to a frame rate of up to 10 frames per second, and the cameras located along the perimeter of the protected object—3–5 frames per second. The dataset was collected on the basis of video recordings made in the territory of the car park equipped with the video surveillance system presented above for 6 months. The markings were made by employees of the parking lot, instructed on the subject of the experiment.

As a result, a sample was obtained consisting of 2000 images (1000 in the training set and 1000 in the validation set), divided into 2 classes—“cars without damage” and “cars with damage” [42]. In addition, registration numbers were extracted from the images [43].

2.3 Data processing tools

To develop the algorithm, the Python 3.9.13 programming language with the Tensorflow and Keras libraries was used. The experimental platform was configured with an Intel Core i7-9700 K processor, GeForce RTX 2070 GPU, and 64 GB of RAM.

2.4 License plate recognition

The Viola-Jones method uses Haar features [44] to classify objects in an image. These functions are similar to convolution kernels and represent rectangular areas consisting of several adjacent parts. The Viola-Jones method uses the AdaBoost algorithm to build a cascade classifier. When forming each new level, the AdaBoost algorithm selects the most informative features. The formation of the classifier ends when the specified target quality of the classifier is reached.

The Viola-Jones method was used to recognize registration numbers on images in real time.. Here, the original image M(x, y) is transformed into an integral form of information representation V(i, j), which consists in analyzing the brightness of pixels in certain areas of the image with the shapes of rectangular fragments of the image in different areas. Mathematically, themethodisdescribedbythefollowingformula:

$$ M\left( {x,y} \right) = \mathop \sum \limits_{i = 0}^{x} \mathop \sum \limits_{j = 0}^{y} V\left( {i, j} \right), $$
(1)

where M (x, y) – original image represented as a matrix, V (i, j) – variant intensity of an image pixel as a matrix element, i– image height, j – image width.

This formula describes the formation of matrix elements, each element stores the sum of the intensity of the image pixels, which are then used to apply a cascade classifier and quickly cut off areas of the image that are not of interest.

The process of cutting off non-informative areas of the image can be described by the formula:

$$ V\left( {i,j} \right) = \left\{ {\begin{array}{*{20}l} {\mathop \sum \limits_{1 \le s \le N} \mathop \sum \limits_{1 \le t \le N} V\left( {s,t} \right),} \hfill & {1 \le s \le i and 1 \le t \le j} \hfill \\ {0,} \hfill & {otherwise} \hfill \\ \end{array} } \right., $$
(2)

where s– desired values of variance of pixels along the height of the image, t – desired values of pixel variance across the width of the image.

It should be noted that for the correct operation of the cascade classifier, pre-processing of the input image was used (translation to grayscale, noise reduction, drawing image contours using local binarization, as well as the use of methods of mathematical morphology) [45,46,47].

At the first stage of image preprocessing, it should be understood that the input image needs to be converted to grayscale, since in this way it is possible to get rid of unnecessary interactions with other color gradations and speed up the algorithm (see Fig. 3a). Also, the image needs to get rid of excess noise and smooth the corners. At this stage, the intensity of the pixels is determined by the weighted average value of the radius of the neighboring pixels, that is, the Gaussian filter is applied.

Fig. 3
figure 3

Results of applying the Viola-Jones method for license plate recognition a converting the original image to grayscale using a Gaussian filter, b adaptive image binarization, c detected license plate

Using the methods of local image binarization, as well as drawing image contours, it is possible to perform threshold processing, which consists in analyzing the features by which it is possible to calculate the values of each individual pixel and compare the entire given set with the threshold value. Here, for local image binarization, a method based on a statistical analysis of neighboring pixels is used (see Fig. 3b).

The result of the operation of Haar cascades with preliminary processing of the input image is shown in Fig. 3c. Here, the Viola-Jones method can be applied as an additional functionality of the security system, which is an ensemble of neural network algorithms and computer vision algorithms. Thanks to license plate detection, it is easy to identify which vehicle has been affected.

2.5 Methods for vehicle damage recognition

2.5.1 Using the HOG descriptor

The features of using the descriptor in relation to images of cars obtained in difficult weather conditions for shooting are described in detail in the article by N. Vasilyev et al. [48]. The HOG method is based on the assumption that the type of distribution of image intensity gradients allows one to accurately determine the presence and shape of the objects present on it.

The image is divided into cells. Histograms are calculated in cells hi directional gradients of interior points. They are merged into one histogram h = f(h1,…, hk), after which it is normalized in brightness. The normalization factor was used hL:

$$ h_{L} = \frac{h}{{\sqrt {\left\| h \right\|_{2}^{2} + \varepsilon^{2} } }}, $$
(3)

where \(h_{2}\) – norm used, ε – some small constant.

When calculating gradients, the image is convolved with kernels [− 1, 0, 1] and [− 1, 0, 1]T, resulting in two matrices Dx and Dy derivatives along the x and y axes, respectively. These matrices are used to calculate the angles and magnitudes (modules) of the gradients at each point in the image.

On Fig. 4 shows the result of applying the HOG method to images of the vehicle before (see Fig. 4a, b) and after the misconduct (see Fig. 4c, d). Damage to the glass leads to an increase in the gradient in the corresponding HOG region. On Fig. 4d are the bright areas inside the histogram.

Fig. 4
figure 4

The result of applying the HOG method to the images of the car before the misconduct (a, b) and after it (c, d)

2.6 BOVW image classification

The Bag-of-visual-words (BOVW) method is used to improve the performance of descriptors. This approach treats blocks of an image as key parts, and the HOG of each block represents the local information of the corresponding part of the image. Then the HOGs of all blocks in the training sample are grouped into homogeneous groups using K-means, and the centers will be the averages of the HOGs of the blocks in the cluster (Fig. 5).

Fig. 5
figure 5

Visualization of the BOVW method

2.7 Convolutional neural networks

To solve the problem of recognizing a negative impact on a vehicle, we used modifications of convolutional artificial neural networks (ANNs). Convolutional Neural Networks Have Algorithm Resistant to Input Data Invariance [43].

The functioning of the ANN convolutional layer is described by the formula [49,50,51]:

$$ x^{i - 1} = f\left( {x^{i} {*}c^{i} + b^{i} } \right), $$
(4)

where \(x^{i - 1}\) – feature map of the previous layer, \(f ()\) – activation function, \(b^{i}\)– parameter "bias" (displacement of the neuron), \(c^{i}\) – convolution kernel belonging to the layeri.

The functioning of the ANN subsample layer is described by the formula: (here the feature matrix is divided into new matrices of dimension nxn)

$$ x_{j}^{i - 1} = f\left( {\mathop \sum \limits_{i} x_{j}^{i} * c_{j}^{i} + b_{j}^{i} } \right), $$
(5)

where \(x_{j}^{i - 1}\) – feature map of the previous layer, \(f ()\) – activation function,\(b_{j}^{i}\) –"bias" parameter (neuron displacement) for the feature map, \(c_{j}^{i}\) – feature map convolution kernel j, belonging to the layeri.

For each ANN architecture under study, for the purpose of the purity of the experiment, the following hyperparameters were applied, which are tuned immediately before the start of training, in parallel with image preprocessing: 50 training epochs (epochs), batch-size of 32 units, adam optimizer, and a learning rate of 1*10−5.

Using the "transfer learning" technique, the ANN architectures discussed below were integrated into the Keras library. Then the parameters of the input data were adjusted (224 × 224 pixel images in RGB gradations, with all pixel values normalized in the range from 0 to 1). The input data of each pixel value is normalized in the range from 0 to 1, in order to eliminate the influence of numbers greater than 1 on the learning process of the neural network:

$$ f \left( {i,min,{\text{max}}} \right) = \frac{i - min}{{max - min}}, $$
(6)

where f – normalization function, i – the value of a particular pixel, min – smallest pixel value, max – largest pixel value.

A fully connected ANN layer was also formed, which receives a feature vector as input. The fully connected layer consists of 128 neurons with two output neurons, the activation function of the convolutional and subsampling layer is ReLU (rectified linear function). Thefunctionisdescribedbytheformula:

$$ f\left( x \right) = max\left( {0, x} \right), $$
(7)

this function is used to cut off negative values, which makes it possible to simplify mathematical calculations and speed up the learning process of the neural network. The clipping of negative values is described by the formula:

$$ f\left( x \right) = \left\{ {\begin{array}{*{20}l} {0,} \hfill & {if\;x \le 0} \hfill \\ {x,} \hfill & { if\;x > 0} \hfill \\ \end{array} } \right., $$
(8)

The activation function of a fully connected layer is Softmax (returns the probability that an image belongs to a given class). Thefunctionisdescribedbytheformula:

$$ f\left( x \right) = \frac{{e^{{z_{i} }} }}{{\mathop \sum \nolimits_{k = 1}^{K} e^{{z_{k} }} }}, $$
(9)

where f –activation function, \(e^{{z_{i} }}\) – standard exponential function for input vector,\(e^{{z_{k} }}\) –standard exponential function for output vector, K– number of classes.

The Keras library object "ImageDataGenerator" allows to perform additional data augmentation with a data set in order to expand the training and test sets of the ANN, since for the best results (in terms of generalizing non-standard input data), the set must have a high degree of data heterogeneity. Using the functionality of this object, the following data transformations were carried out:

  • rotation_range – random rotation of the image (the value is set in degrees);

  • zoom_range – random scaling of the image (the value is set by a floating point number);

  • width_shift_range – random shift of the image in width (the value is set by a floating point number);

  • height_shift_range – random shift of the image in height (the value is set by a floating point number);

  • shear_range – random shift of image pixels (value is given by a floating point number);

  • horizontal_flip – random image rotation by 180°about the axis y;

  • fill_mode – filling image pixels that appear after applying a rotation or horizontal/vertical image shift.

In this study, the emphasis is on modifications to the connections between layers in order to improve the accuracy of existing ANN architectures. It is proposed to use a modified module for convolutional ANN architectures, called the Squeeze-and-Excitation block (SE), which makes it possible to enhance the generalizing ability of the ANN. The proposed mechanism makes it possible to amplify the feature map (the output of the convolutional layer), thereby the ANN generates a global feature map in order to more effectively cut off non-informative features and obtain a wider set of informative features). The input data for the modified mechanism is the feature vector of the convolutional layer, and the mechanism itself consists of two fully connected layers. The activation function of the first fully connected layer is ReLu, the activation function of the second fully connected layer is sigmoid. Here, the first fully connected layer performs the function of reducing the dimension of the feature vector, and the second layer performs the function of recalibrating the dimension to the original size. This block is integrated using Keras tools and is located in front of a fully connected ANN layer, which solves the classification problem. The structure of the SE block, which was integrated into subsequent architectures of convolutional ANNs, is shown in Fig. 6. The parameters of the modified SE-module are presented in Table 1.

Fig. 6
figure 6

Structure of the modified SE module

Table 1 Settingsofthemodified SE-module

MobileNetV2 [52] was chosen as the first architecture for research (see Fig. 7), since it is optimized in terms of memory consumption and can be integrated into mobile robotic systems.

Fig. 7
figure 7

Architecture of ANN “MobileNetV2 + SE”

A feature of this ANN architecture is that after a convolutional layer with a kernel size of 3 × 3 (which produces a feature map), there is another similar layer of 1 × 1 size with a convolution step of 2, which also analyzes the feature map of the previous layer. This operation is equivalent to the fact that the original image was analyzed by a 3 × 3 kernel, but in this case, in such a bundle, the total number of weights will be less than that of a 3 × 3 filter. All this optimizes the operation of the ANN and reduces memory consumption [53].

The second ANN architecture for the study was ResNet50 [54] (see Fig. 8), the architecture is a 50-layer convolutional ANN, which consists of 48 convolution layers, from 1 subsample layer by the maximum value (max pooling), and also from 1 subsample layer by average value (average pooling). This architecture of the ANN was created due to the fact that currently there is a problem of tenden- tial attenuation of gradients with an increase in the number of ANN layers. The solution to this problem was the introduction of ANN architectures based on the formation of residual links for a certain sequence of convolutional layers. Residual connections are expressed in the fact that the ANN is subdivided into a set of separate blocks that produce a non-linear signal transformation by skipping certain layers. Thus, when training an ANN, the objective function is optimized by creating labels of connections between layers, that is, not a complete, but a residual approximation of the objective function (differences in output values) occurs, which solves the problem of a vanishing gradient [55].

Fig. 8
figure 8

Architecture of the ANN “ResNet50 +SE”

The first convolution layer has a 3 × 3 kernel, and subsequent convolution layers have a 1 × 1 kernel with a stride of 2. Both subsample layers have 3 × 3 kernels. Before the fully connected layer, there is a subsample layer by the mean value with a kernel size of 3 × 3. The fully connected layer in this ANN was generated manually, it, as in the case of the MobileNetV2 ANN, consists of 128 neurons with two output neurons, the activation function of the convolutional and subsample layers is ReLU (rectified linear function), the activation function of the fully connected layer is Softmax (returns the probability of belonging to the image to the given class).

The third ANN architecture for the study was DenseNet121 [56] (see Fig. 9), which continues the development of the ANN paradigm with residual connections to solve the problem of gradient attenuation with a large number of layers. The main feature of this ANN architecture is the concept of "dense blocks", which implies a set of 1 × 1 convolutional layers with a convolution step of 2, where the input of each subsequent layer is the concatenation of feature maps that were formed by the previous layers. This concept of "dense blocks" solves the problem of fading gradients [57].

Fig. 9
figure 9

Architecture of ANN “DenseNet121 +SE”

The first convolution layer has a 7 × 7 kernel, then there is a subsampling layer by the maximum value (max pooling) with a 3 × 3 kernel. Then there are three bundles “dense block—convolutional layer—subsampling layer”, and on the fourth sheaf after the fourth dense block, a subsampling layer is used, which performs a global average selection operation in that it averages the values of a multivariate feature map into a 1 × 1 vector. The fully connected layer in this ANN was generated manually, as in the case of the MobileNetV2 and ResNet50 ANNs, it consists of 128 neurons with two output neurons, the activation function of the convolutional and subsample layers is ReLU (rectified linear function), the activation function of the fully connected layer is Softmax (returns the probability of membership images to a given class).

2.8 Research results

The training of the MobileNetV2 and MobileNetV2 + SE ANNs was performed over 50 epochs. The recognition accuracy for the first model was 85%, and for the second 88% (see Fig. 10). Binary crossentropy was chosen as the loss function, since the number of classes is two. At the beginning of training, the error in the first model was 2.25 and at the 50th epoch it decreased to 0.5, and in the second—0.7 and 0.25, respectively (see Fig. 11).

Fig. 10
figure 10

Graph of the accuracy of the trained ANN MobileNetV2 (left) and MobileNetV2 + SE (right) depending on the training epoch

Fig. 11
figure 11

Graph of the change in the error of the MobileNetV2 (left) and MobileNetV2 + SE (right) ANNs depending on the training epoch

In addition to graphical visualization of ANN training results, the “classification_report” method, used on the basis of the “model.predict” method, allows you to generate a report in tabular form on the results of applying the trained ANN on previously unknown data. The report consists of the results of testing the NN for each given class according to the following metrics: accuracy, recall and the f1-score metric (harmonic mean between accuracy and recall), and these metrics are used to evaluate the model by the weighted average (weighted avg), as well as macro averaging (macro avg). To assess the quality of the trained models, the sklearn library and its specialized metrics module were used. The results of evaluating models on a test (unknown) data set for ANN are given in Table 2.

Table 2 Evaluation of the trained ANN MobileNetV2 and MobileNetV2 + SE on the test data set, indicators with SE are on the right

ResNet50 and ResNet50 + SE ANNs were trained over 50 epochs. The recognition accuracy for the first model was 90%, and for the second 91% (Fig. 12). Binary crossentropy was chosen as the loss function, since the number of classes is two. At the beginning of training, the error in the first model was 0.82, and at the 50th epoch it decreased to 0.27, and in the second—0.7 and 0.21, respectively (Fig. 13). The results of model evaluation on a test (unknown) data set for ANN are given in Table 3.

Fig. 12
figure 12

Graph of the accuracy of the trained ANN ResNet50 (left) and ResNet50 + SE (right) depending on the training epoch

Fig. 13
figure 13

Graph of the change in the error of the ANN ResNet50 (left) and ResNet50 + SE (right) depending on the training epoch

Table 3 Evaluation of the trained ANN ResNet50 and ResNet50 + SE on the test data set

The DenseNet121 and DenseNet121 + SE ANNs were trained over 50 epochs. The recognition accuracy for the first model was 86%, and for the second—92% (Fig. 14). Binary crossentropy was chosen as the loss function, since the number of classes is two. At the beginning of training, the error in the first model was 1.09, and at the 50th epoch it decreased to 0.36, and in the second—0.7 and 0.17, respectively (Fig. 15). The results of model evaluation on a test (unknown) data set for ANN are given in Table 4.

Fig. 14
figure 14

Graph of the accuracy of the trained ANN DenseNet121 (left) and DenseNet121 + SE (right) depending on the training epoch

Fig. 15
figure 15

Graph of the change in the error of the ANN DenseNet121 (left) and DenseNet121 + SE (right) depending on the training epoch

Table 4 Evaluation of the trained ANN DenseNet121 and DenseNet121 + SE on the test data set

Testing of trained architectures of convolutional ANNs on data unknown to them was carried out using the OpenCV computer vision library included in the Python programming language environment. For testing, all images were compressed to a size of 224 × 224, then converted to a tensor of the format 224,224,3, and then the values of all pixels in the images were divided by 255 so that they were located in the range from 0 to 1. Examples of damage detection on vehicles are shown in Fig. 16.

Fig. 16
figure 16

Detection of car damage in images

As can be seen from Fig. 16, the presented ANN architectures did a good job of recognizing such semantic features as broken windows and dismantling of wheels. On Fig. 17. An example of detection of scratches on the car body by ANN is given (Fig. 18).

Fig. 17
figure 17

Detection of scratches on the car body

Fig. 18
figure 18

INS conclusions made if there is no damage to the car

Similarly, the presented ANNs give an answer if there is no damage to the machine (see Fig. 18).

2.9 Checking the performance of methods in difficult conditions for video filming

Difficult video shooting conditions include video filming during twilight, in rain, snow, etc. To test the performance of the presented ANN architectures, we used images in the dark, as well as images with the blur effect applied (here, the situation was simulated when the camera lens was dirty and, as a result, images were fed to the input of the ANN with defects). Images were about 20% of their total number.

It is supposed to take the HOG-BoVW-BPNN method as a standard for forming an estimate of the efficiency of the ANN on images taken in conditions that are difficult for video filming, which showed a fairly good invariance to the input data, which means it has a good ability to generalize data with varying degrees of affine image transformations, which in the context of performance assessment are various kinds of pixel intensities that occur with various defects in the input image caused by the external environment (but the method has a serious drawback—low speed).

The assessment of the accuracy of methods in difficult video recording conditions, for the purity of the experiment, was carried out according to the metric “Score F1”, since here the use of this metric is due to the fact that in practice it is almost impossible to achieve maximum results simultaneously in terms of accuracy and recall metrics, therefore, in this case, it is necessary to find the average harmonic of these two metrics.

The results of supplying images to the input of the presented ANN architectures in the dark time of the day are shown in Fig. 19., and the results of feeding blurry images to the input presented by the ANN (both due to bad weather conditions and due to contamination of the camera lens) are shown in Fig. 20.

Fig. 19
figure 19

The results of the input of the INS images taken at night

Fig. 20
figure 20

Results of feeding blurry images to the ANN input due to bad weather conditions, as well as when the camera lens is clogged

Thus, thanks to the use of the ImageDataGenerator functionality, the trained ANNs have a good ability to generalize the input data, which will only tend to increase with an increase in the number of images in the presented dataset. 19 and 20, trained ANNs have a good ability to generalize input data, which means that they are able to highlight key features even in images that have a high degree of affine transformations. Table 5 gives an assessment of the accuracy of the methods in difficult conditions of video filming by the metric « F1 score».

Table 5 Evaluation of the accuracy of methods in difficult conditions of video filming by metric “F1 score”

As can be seen from Table 5, the DenseNet121 + SE ANN according to the “F1 Score” metric showed identical results compared to the reference HOG-BoVW-BPNN method in this experiment, which is an extremely worthy result, similarly, it should be noted that the DenseNet121 + SE ANN is faster in speed than the HOG- BoVW-BPNN by 40%, which makes it extremely promising for implementation in computer vision systems that are specialized in ensuring the safe operation of car parks.

3 Discussion

The use of ensembles of computer vision algorithms and the mathematical apparatus of ANNs in the framework of the problem of pattern recognition has already been touched upon earlier in scientific research. For example, work [58] describes the development of a mathematical model for the rapid and efficient detection of medical masks in public places based on Haar cascades and convolutional ANNs. The authors explained such an ensemble of algorithms in the mathematical model by the fact that the object of interest can be easily implemented by the Haar cascade, because it can serve as a unique attribute of the image, and convolutional ANNs were used to detect anomalies in the images (lack of a medical mask on the face). This work was published in 2021, when COVID-19 had a significant impact on all areas of life. This emphasizes the relevance and practical significance of using ensembles of computer vision algorithms and the mathematical apparatus of ANN.

In [59], the paradigm of applying computer vision and deep learning methods was used to solve the problems of detecting objects in a video stream with various kinds of defects (when measuring data with video recorders in poor visibility conditions). Images from the video stream obtained from video recorders were measured using a lens that was covered with drops of water or dirt, which introduced defects into the original data and served as the objects of study. In the article, the authors proposed a method for improving the accuracy of object detection in images with a complex background by using the Canny method, which excludes defective areas from image analysis by capturing smoothed areas. Only those areas of the image where the Canny method detected deterministic contours of objects fall into further analysis. To classify these contours, selected at the time of the Canny method, the authors proposed the use of histograms of directional gradients (HOG—descriptors), the "bag-of-words" method and the training of a multilayer perceptron, which is a backpropagation ANN. Such an ensemble of machine learning algorithms in the context of solving the problem of binary image classification showed a significant advantage over the use of convolutional ANNs (accuracy was 79% versus 65% for convolutional ANNs). Based on the results of the research, the authors came to the conclusion that the use of such an ensemble as computer vision methods in combination with the deep learning paradigm is promising for modern scientific research and can significantly improve the quality of object recognition in images.

4 Conclusion

The practically significant problem of classifying images of vehicles according to various semantic features corresponding to both their normal state and various types of damage has been solved. A description of ANN models based on the MobileNetV2, ResNet50 and DenseNet121 architectures, algorithmically implemented in the Python programming language environment (using the functionality of the Keras machine learning library) is given. Test studies of each of the models showed a good detection of semantic signs of damage caused by the attacker's vandalism, and also demonstrated good values for the quality metrics of machine learning models. The proposed mathematical models demonstrate a new approach to ensuring the security of objects in protected areas (parking lots), and are also suitable for implementation on the platform of mobile robotic systems. The implemented ANN models will be used to control a mobile security robot.