1 Introduction

The goal of computer vision is to facilitate the machine to interpret the world through the process of digital signal [1]. Various technologies such as motion detection and facial recognition are based on computer vision. Automating the video surveillance with the help of computer vision to detect any suspicious activity or personnel is an effective way to the cover up some flaws in the security. Video surveillance detects moving object through a sequence of images [2, 3]. ATM surveillance is a sub-domain of video surveillance. ATM crime has become one of the most prominent issues nationwide [4] as they are at the public places and vulnerable to thefts. The usual security measure in an ATM system is CCTV which is not automated and requires authority to monitor them. The slow response time of a CCTV is a reason for its under-efficiency and adds to the vulnerability of security. Automated surveillance system detects any unusual activity in their frame view and automatically takes the desired actions [5]. In a recent survey based on video surveillance, Cucchiara [6] reported that there are various problems than hinder motion detection in non-ideal conditions. Various techniques that have been used for motion analysis using automated systems are based on the framework of temporal templates and spatiotemporal templates optical flow, background subtraction, silhouettes, histograms and several others [7,8,9,10]. In this paper, we have further extended [11] by introducing a motion encoding technique called motion identifying image (MII). In MII, we have incorporated root-mean-square of thresholded images. We have analyzed four categories of human actions which are classified from a single camera view. They are single, single abnormal, multiple and multiple abnormal. The paper is further organized in the following manner: Sect. 2 reviews the recent work; Sect. 3 describes the methodology we have used; Sect. 4 gives results and analysis; and Sect. 5 concludes the paper.

2 Literature Review

Video surveillance has contributed to the enhancement of security and protection in every possible field [12]. There are various ways to detect an activity in computer vision. In this section, we present the previous work conducted to improve video surveillance. Several approaches have been presented to recognize human actions. Davis and Bobick [13] have used temporal templates using motion history image (MHI) and motion energy image (MEI) for recognizing human activity. The temporal approaches utilize vector images where each vector points motion in the image [14]. Directional motion history image (DMHI) is an extension of MHI introduced by Ahad et al. [15, 16]. Poppe [17] has presented a detailed overview of human motion analysis using MHI and its variants. Al-Berry et al. [18], motivated by MHI, introduced a stationary wavelet-based action representation, which has been used to classify variant actions. There are various descriptors such as spatiotemporal interest feature points (STIPs), histograms of oriented flow (HOF) and histograms of oriented gradients (HOGs) which are used to compute and represent actions. Space–time interest point (STIP) detectors are extensions of 2D interest point detectors that incorporate temporal information. HOG is a window-based descriptor which is used to compute interest points. Further, the window is divided into a grid of (n * n). Frequency histogram is generated from each cell of the grid to show edge orientation in the cell [19], whereas the descriptor HOF gives information using optical flow [20]. Another descriptor named Hu moments extracts interest points based on shape, independent of position, size and orientation of the image [21], and since it is a shape descriptor, it requires comparatively less computation [22,23,24]. Zernike moments descriptor is another shape descriptor which is more efficient than Hu moments [21]. Sanserwal et al. [25] in their paper have proposed algorithm in which they have used HOG descriptor, Hu moments and Zernike moments descriptor for activity detection from a single viewpoint [26] Vikas et al. proposed an approach that makes use of motion history image and Hu moments to extract features from a video. Rashwan et al. [27] proposed optical flow model with new robust data obtained from histogram of oriented gradients (HOGs) computed between two consecutive frames. But the approaches such as HOG can be highly computational [28]. Huang and Huang [29] in his paper uses look-up table along with the method of integral image to speed up HOG. Uijlings et al. [30] proposed a framework that can increase the efficiency of densely sampled HOG, HOF and MBH (motion boundary histograms) descriptors. Ryan Kennedy and Camillo J. Taylor used a method in which optical flow is calculated over triangulated images [31]. In our approach, we have used three consecutive frames to encode motion into image which is then provided to gradient-based descriptor HOG. We have described that our framework can effectively recognize ATM events.

3 Methodology

The proposed methodology makes use of computer vision-based framework to detect normal and abnormal activities in indoor premises such as ATM room. Figure 1 represents working of our framework. It shows that the method consists of the camera feed in the form of video, which is converted into threshold images. Our framework consists of two parts, conversion of an image into encoded motion using MII and conversion of encoded image into features using a descriptor. MII involves preprocessing the thresholded images using root-mean-square formula. The features thus obtained are classified using random forest classifier. The algorithmic representation of our framework is shown in Fig. 2.

Fig. 1
figure 1

Architecture for the proposed method

Fig. 2
figure 2

Generation of descriptor

3.1 Preprocessing

In this section, we abstract three consecutive frames and convert them into thresholded images. The method we employed for converting the frames into thresholded images is adaptive thresholding. In adaptive thresholding, we calculate different threshold values for different regions of same image. Now threshold values can be calculated using the mean of neighborhood areas or using the weighted sum of neighbor values where weights are a Gaussian window. Later, a constant is subtracted from the calculated threshold value. If the value of pixel is less than the threshold value, it is assigned to zero; otherwise, it is assigned to the desired maximum value. In our method, we have calculated threshold values for each region using mean with the block size (size of neighborhood area which is used to calculate threshold value) of eleven and the constant (which is subtracted) two. The value of constant may vary for some other set of videos. Let T(x, y) is a pixel after thresholding, t be the thresholded value, m be the maximum value that can be assigned to the pixel and I(x, y) is a pixel of a frame. The equation for adaptive thresholding is given in Eq. (1).

$$ {\text{T}}({\text{x}},{\text{y}}) = \left\{ {\begin{array}{*{20}l} {{\text{m}}\,\,\,\,{\text{if}}\;{\text{I}}\,\left( {{\text{x}},{\text{y}}} \right) \ge {\text{t}}} \hfill \\ {0\,\,\,\,\,{\text{otherwise}}} \hfill \\ \end{array} } \right. $$
(1)

Further, we compute squares of each pixel in first, second and third frames we get after thresholding as shown in Eqs. (2), (3) and (4). Then, we calculate two values A and B, by adding the values in Eqs. (2) and (3), (2) and (4), respectively, as shown in Eq. (5) and Eq. (6). The square root of these A and B is calculated and is then divided by the number of frames which in our case is three as depicted in Eq. (7). The operation of square root is again applied to the achieved result C, Eq. (8).

$$ {\text{F}}1 = {\text{IMG1}}^{2} $$
(2)
$$ {\text{F}}2 = {\text{IMG}}2^{2} $$
(3)
$$ {\text{F}}3 = {\text{IMG}}3^{2} $$
(4)
$$ {\text{A}} = {\text{F}}1 + {\text{F}}2 $$
(5)
$$ {\text{B}} = {\text{F}}1 + {\text{F}}3 $$
(6)
$$ {\text{C}} = \frac{\sqrt A + \sqrt B }{3} $$
(7)
$$ {\text{R}} = \sqrt C $$
(8)

Figure 3 shows the complete diagrammatic representation of preprocessing. After preprocessing, we obtain motion identifying image which is then fed to our descriptor HOG for feature extraction.

Fig. 3
figure 3

Preprocessing

3.2 Descriptor

We have used histogram of orientation gradient (HOG) to compute features of motion identifying image. HOG describes the appearance of a local object within an image by distribution of intensity gradient or edge directions. The image that we give as an input to the descriptor is divided into small regions, which are called cells. These cells are connected. Histogram of gradient directions is calculated for each pixel within these cells. HOG computes the derivative of image (M) with respect to x and y as shown in Eqs. 9 and 10.

$$ {\text{Mx}} = {\text{M}}*{\text{DX}}\,{\text{where}}\,{\text{DX}} = \left[ { - 1\,0\, - 1} \right] $$
(9)
$$ {\text{My}} = {\text{M}}*{\text{DY}}\,{\text{where}}\,{\text{DY = }}\left[ {\begin{array}{*{20}l} 1 \hfill \\ 0 \hfill \\ { - 1} \hfill \\ \end{array} } \right] $$
(10)

Further, we calculate magnitude and gradient of M in Eqs. 11 and 12.

$$ \left| {\text{Gr}} \right| = \sqrt {M_{x}^{2} + M_{y}^{2} } $$
(11)
$$ {\O } = \arctan (\frac{{M_{x} }}{{M_{y} }}) $$
(12)

Finally, cell histograms are created and then normalized using L2 normalization as shown in Eq. 13.

$$ {\mathcal{F}} = \frac{n}{{\sqrt {n_{2}^{2} + \partial } }} $$
(13)

Here, n represents vector without normalization containing all histograms of current block and \( \partial \) represent small constant.

We have used random forest classifier that works by creating multiple decision trees during training. In our case, the model had been trained using random forest classifier which creates 100 trees.

4 Results and Analysis

The proposed framework has been tested and trained, using Python 3.0 and OpenCV on computer having the specifications Intel i5 clocked at 2.4 GHz processor with the RAM of 16 GB, on videos for calculating various shape descriptors. The videos analyzed by the presented algorithms have a minimum resolution of 320 × 240. These videos are recorded in indoor premises such as ATM room. We have analyzed four categories of video captured as shown in Fig. 4: (i) single: when normal activities are being performed by a single person in a single camera view; (ii) single abnormal: when abnormal activities are being performed by a single person in a single camera view; (iii) multiple: when normal activities are being performed by a multiple person in a single camera view; (iv) multiple abnormal: when abnormal activities are being performed by a multiple person. There are a total of 49 videos in all the four classes (10 single, 10 single abnormal, 20 multiple and 9 multiple abnormal). In India, it is common for multiple personnel to enter the ATM room together. So for this sole activity we have taken a class of videos multiple. The framework is trained using these videos for extracting features from image sequences. The framework uses different videos for both testing and training purposes. The algorithm is tested for three frames, and its comparison against various other algorithms is shown in Table 1. Table 2 shows the value of W, X, Y and Z, the four classes that we have used in our dataset.

Fig. 4
figure 4

Four classes of videos

Table 1 Comparison with other descriptor (in percentage %)
Table 2 Confusion matrix of MII on thresholded images

In Table 1, we have given comparative analysis with two other methods for motion encoding, which produces the best accuracy when an input of ten frames is given to the descriptor. First method uses (a) motion history image (MHI) as a descriptor; second method uses (b) the fusion of histogram of gradient (HOG) and Zernike moments. In general, the more frames we give to the descriptor, the more accuracy we get, as temporal information increases but even after using ten frames as an input to the descriptors used in other two algorithms, their result is comparatively less than what we acquired using MII of three frames. Hence, from the figure it is clear that our descriptor MII is better in detecting motion than MHI and fusion of HOG and Zernike.

Figure 5 shows the corresponding ROC graphs for all the four classes that is single, single abnormal, multiple and multiple abnormal for the testing dataset. In all the four graphs, the x-axis represents false-positive rate and the y-axis shows true-positive rate.

Fig. 5
figure 5

ROC curve

5 Conclusion

In this paper, we have proposed an algorithm that makes use of motion encoding technique called motion identifying image (MII) and a gradient-based descriptor HOG to recognize motion. It can be used in enhancing the security of ATM surveillance as well as in any other similar areas. The algorithms are tested for both normal and abnormal events with single as well as multiple personnel that can occur in ATM. It can contribute to the security of ATM as there is a tremendous increase in ATM frauds. In our method, the accuracy is about 97% when used with three frames. In the future, this motion encoding technique can be combined with any other descriptor to obtain higher accuracy. Also, an advanced and better classifier can be used for better recognition.