1 Introduction

Human action is one of the main topics in the recognition of computer vision. Other computer vision uses the motion boundaries, e.g., object segmentation in videos [7, 23], for action recognition. Human activity detection is a tough task due to variations in the shape of the body and pose, etc. It is even harder to find in dark backgrounds or moving cameras. Videos containing human actions or activity convey the essential meaning for better understanding the scenario. Motion detection or estimation is the effective information for action recognition. Analyzing the activities of a person can help elderly people, people who are disabled, patients and so on. The motivation is to develop a system for patient monitoring.

However, the recognition of actions in unconstrained digital videos has been a challenging problem in computer vision technology. The prime factor in action recognition is a representation of an action video. Action video representation is based on the following circumstances: i) human pose: extracts the human information based on physical condition or structure [28], ii) action: captures the whole-body shape or appearance and motion information, iii) local features: extract valid pace-time cuboids, and iv) unsupervised feature learning-based methods: learn the representation by hierarchical networks [14].

This paper proposes and combines motion-based feature sets for human activity detection in videos.

  1. i)

    Apply the filter technique and segmentation method in the image.

  2. ii)

    Evaluate them independently and in combination with the Color, GiST and Histogram of Oriented Gradient (HOG) appearance descriptors that were developed for human detection in videos.

The paper is structured as follows: after reviewing the methodologies in Section 2, we present our novel approach for filtering, segmentation methods, feature extraction, feature reduction and classification in Section 3. We then discuss the datasets used in our experiments and the result evaluation in Section 4. The multimodal egocentric activity dataset is used in this research work.

2 Literature survey

Recorded egocentric images and videos of daily activities from wearable cameras are important to assist memory recollection for both memory-impaired and unimpaired persons. Since egocentric images and videos about daily activities are long and unstructured, the ability to retrieve past egocentric images and videos could support and augment human memory. The current egocentric image and video retrieval methods use manually and automatically labeled [6, 8] images as user queries. However, these approaches let users who need memory support describe what they forgot in their own words as user queries similar to the past visual experience of what users want to remember. Furthermore, most previous methods have ignored the valuable feature of human-physical world interactions, which usually associate our daily activities and visual experience. [4] proposed egocentric vision-based activity recognition methods using gaze location, segmented hand region and object detection methods by [18] as features of human activities. These approaches mainly classify egocentric videos using supervised techniques that require manually labeled training data, so that predictable activities are limited. [10] proposed human activity recognition, and the recognition procedure is implemented in 3 steps, i.e., feature extraction, evaluation, and classification, using a Feed Forward Neural Network (FFNN) classifier. [13] proposed a wearable device, which was based on a sensor called e-Watch, and used it to test human activity recognition and location. [24] proposed the HOF and MBH descriptors and a combined descriptor based on Dataset (LENA). [16] experimented with a framework to segment and arrange a set of egocentric videos using a convolutional neural network. [29] proposed multiple deep learning pipelines to study the appearance and motion patterns that can predict the activity of the wearer. [27] identified the associations between objects with some scenes to enhance the object detection based on the scene content. [3] recognized the objects using the deep convolutional neural network. [5] implemented swarm search in the context of human activity recognition. [12] experimented with active and inactive manners in activity recognition. [21] identified the activities in egocentric videos.

3 Proposed work

Our proposed work is to find the human activity and is based on the watershed segmentation algorithm and three feature extraction techniques. The genetic algorithm is used for feature reduction and finally SVM and Random Forest classifiers are used to find the type of activity based on training and testing. First, the input videos with a frame rate of 29 frames/s are converted into frames of size 64*64, and then a median filter is applied to remove noise. Here, the databases are divided into two levels, i.e., top level and second level. Once the noise is removed, then segmentation is applied by watershed. The main purpose of image segmentation is to divide the image into meaningful structures. We used three feature extraction methods in the design to find the best combination along with the segmentation for activity detection. The HOG, Color and GiST feature extraction methods are used in the design. The genetic algorithm is used for feature reduction, and the SVM and Random Forest classifiers are used to find the activity type. Figure 1 depicts the proposed model.

Fig. 1
figure 1

Block diagram of proposed system

3.1 Filtering

The purpose of applying filtering is to enhance the quality of an image. Median filtering is nonlinear. [26] introduced the use of the median filter in signal processing. The filter process is based on moving the image, pixel by pixel and then replacing each value with the median value of the neighboring pixels. The neighboring pattern is called a “window”, which slides, pixel by pixel, over the entire image. First sorting the pixels calculates the median of all the values from a window and then replaces the pixel being considered with the middle value. Figure 2 depicts the median filtering method. A window of size three is used.

Fig. 2
figure 2

An approach to the median filter

3.2 Watershed segmentation

Watershed-based image segmentation algorithms consist of constructing a symbolic representation of the image. Watershed algorithms mainly have two classes. The first class implements the flooding-based watershed algorithms, which is a traditional approach, whereas the second class contains rain falling-based watershed algorithms. The connected components-based watershed algorithm [1] provided very good performance compared to all other algorithms, and it falls under the rain falling-based watershed algorithm approach. It provides very good segmentation results and needs less computational complexity for implementation. Figure 3 shows the implementation of the watershed algorithm. The algorithm follows the below steps:

  1. 1.

    Calculation of image gradient value

  2. 2.

    Algorithm for watershed segmentation procedure

  3. 3.

    Merging procedures

Fig. 3
figure 3

Block diagram of watershed-based image segmentation

3.2.1 Gradient calculations

The watershed transforms are calculated [19] based on the image gradient. The first partial derivative of an image is defined as a gradient. G(x, y) represents the gradient values of the initial segmented image, obtained by a gradient operator approximation in x-y directions as two masks of 3*3.

$$ {\displaystyle \begin{array}{l}{U}_x\left(i,j\right)={\left(2+4c\right)}^{-1}\left\{u\left(i+1,j\right)-u\left(i-1,j\right)+c\left[i+1,\left.j+1\right)-u\left(i-1,j+1\right)+u\left(i+1,j-1\right)-u\left(i-1,j-1\right)\right]\right\}\\ {}{U}_y\left(i,j\right)={\left(2+4c\right)}^{-1}\left\{u\left(i,j+1\right)-u\Big(i,j-1\Big)+c\left[u\left(i+1,j+1\right)-u\Big(i-1,j-1\left)+u\right(i-1,j+1\left)-u\right(i-1,j-1\Big)\right]\right\}\end{array}} $$
(1)
$$ G\left(x,y\right)=\sqrt{{\left(\partial f/\partial x\right)}^2+{\left(\partial f/\partial y\right)}^2} $$
(2)

where c = (√2–1) / (2-√2). (G (x, y)) is calculated from the gradient image, and the gradient values on the border of the input image are the same as its inner pixels.

3.2.2 Algorithm procedures

The three main processes are indicated in Fig. 3. Pre-processing comes at the first stage, image segmentation based on the watershed algorithm comes at the second stage and post-processing is the last stage. The input images are pre-processed first and then given as a second stage to watershed-based segmentation. The final stage of the segmented image is the result after the post-processing stage. The pre-processing and post-processing stages are important to overcome the problem of over-segmentation in image segmentation.

Watershed segmentation is applied to the gradient of an image, where the regions in the image characterized by small variations in gray levels have small gradient values, rather than to the image itself. The watershed transform function is based on finding the high-intensity gradients (watersheds) that divide the neighboring local minima (basins). The watershed line pixels are obtained by a marker image, which includes zero marker values. We used a 3*3 mask to scan this image to find the zero values and convert them to their intensity values as in the original image. Comparing these values with their neighboring pixels, intensity is assigned to one marker region. All zero marker values (watershed pixels) are deleted to obtain a second marker image that represents the markers of the image regions only.

3.2.3 Merging procedures

The number of pixels of each region in the image (Ni) is calculated by using a marker image and then by finding the mean intensity value (μ) of each region (i), using eq. (3).

$$ \left({\mu}_i\right)=\frac{\sum \limits_{N\in i}\mathrm{origion}\ \mathrm{pixels}\ \mathrm{intensity}\ \mathrm{of}\;\left(\mathrm{i}\right)}{N_i} $$
(3)

From the original input image, the intensity values of region i are obtained because their positions in the two images are the same. The merging procedure is based on i) merge the pair of regions and ii) edge strength.

3.3 HOG feature extraction

Histogram of Oriented Gradient (HOG) descriptors [2] are feature descriptors that use the direction of intensity of the gradients and edge directions. Figure 4 shows the flow chart to extract the HOG descriptor.

Fig. 4
figure 4

Flowchart of an HOG descriptor

The HOG descriptor is divided into multiple steps:

Computing gradient: We first calculate the gradient values for all the pixels in the image using any derivative mask over the image in the horizontal and vertical directions. Some common derivative masks are the Sobel operator and the Prewitt operator, among others, but the original algorithm recommends that you use a 1D derivative mask, that is, [−1, 0, +1].

Orientation binning: Create a histogram of the weighted gradients that were computed in the previous step. The gradient values are divided into bin values, ranging from 0 to 180 or from 0 to 360 (depending on whether we are using signed or unsigned gradient values).

Combining cells to form blocks: After computing histograms for each cell, we combine these cells into blocks and form a combined histogram of the block using its constituent cells’ normalized histograms. The final HOG descriptor is a vector of the normalized histograms. Here, 8102 features are extracted [20, 22].

Building the classifier: In the final step of the algorithm, feed the HOG feature vectors that were computed in the previous step into your favorite learning algorithm, and build a model that will later be used to detect objects in images.

3.4 Color feature

At present, there is an increase in the data of a color image [30]. Because the amount of information in color images is larger, there is widespread concern about their use. Some acquired data from color images, such as ionosphere, aurora, geomagnetism, biology, ocean, and meteorological data, need to carry out the effective feature extraction, fine classification. These issues require the development of feature extraction methods or algorithms in color images for edges, corners, etc.

For the reference image shown in Fig. 5, the RGB image average values that represent RGB in an image are given as H = {134.2338,101.3403,87.1001}.

Fig. 5
figure 5

Reference image

The reference image is divided into 16 blocks (4*4). We obtain 64 color values from each block, which show the color in that block only. In total, there are 1024 values for the full image. The color moment method is used, which is based on color distribution that is calculated as a probability distribution. The probability distributions are categorized by many unique moments, which are given below:

  1. 1)

    Mean represents average color value

  2. 2)

    Standard deviation is a function of the square root of the variance.

  3. 3)

    Skewness.

Figure 6 shows 9 moments of the reference image (Fig. 5).

Fig. 6
figure 6

Reference image color moments

The columns correspond to each of our channels, and the rows, to moments. The moment’s values provide the color similarity between the images. The total value of the weighted differences defines the similarity function between image distributions. An integrated approach of the color feature method is used to obtain the accurate output from a video. By this method, the features in various classes are converted into one feature. This compares database images with respect to the query image. Here, 9 features are extracted from each frame.

3.5 GiST feature

GiST feature extraction [17] is based on the convolution process and mean per block calculation. This process is based on a filtering algorithm (the Gabor filter) with an orientation for the convolution process, different spatial frequency and mean calculation by splitting the image into several small block sizes, i.e., an 8*8 block. The convolution process done is in the Fourier domain for computing efficiency and switches back to the time domain for mean average per block calculation. Figure 7 shows the extracting process of GiST. In our research, 512 features are extracted.

Fig. 7
figure 7

GiST feature-extracting process

3.6 HOG+GiST+Color feature

In this research work, the features are combined together to form a single feature. Therefore, a total of 8623 features are extracted.

3.7 Genetic algorithm

The genetic algorithm (GA) [9] is one prospective option for feature selection. In any typical GA optimizer, an initial population is created with a predetermined number of strings, also called chromosomes. Each of these represents an individual. The set of individuals forms the current generation. A fitness value is associated with every chromosome. The choice of the fitness function totally depends on the nature of the problem. The performance of the GA leads to the existence of three phases in a typical GA optimization. The three phases in the GA optimization process are (1) generation of the initial population (2) reproduction and (3) generation replacement. Each individual in the generation is assigned a fitness value by evaluating the fitness function. In the reproduction step, a new generation is formed from the current generation. In this process, pairs of individuals are chosen to act as parents. The selection may be based on the fitness function. Crossover and mutation are performed on the parent chromosomes, and a population of children is produced. In the crossover process, a node is selected randomly in each pair of parent chromosomes. Then, the two parts of the chromosomes are exchanged to form two new chromosomes. In the mutation process, a bit is randomly selected in a chromosome, and the value is changed (in binary GA a ‘0’ will become ‘1’ and vice versa). Then, a new generation is formed with these children. The above operations are repeated until the new generation is filled. The generic flowchart of a GA system is shown in Fig. 8. After reduction, the HOG feature was reduced to 800, GiST was reduced to 256 and Color moments were reduced to 6.

Fig. 8
figure 8

Genetic algorithm flowchart

3.8 Random forest classifier

The Random Forest classifier algorithm [11] is based on regression and classification. A Random Forest classifier is a collection of tree predictors that is called a forest. A Random Forest classifier works by first taking the feature vector as input and classifies with every tree in the forest. This outputs the class label that obtained the maximum “votes”. The classifier response is the average of the responses obtained from all the trees in the forest, in the case of regression. Training of all trees is done with common parameters, but the training is done on different sets. A bootstrap procedure is used to generate the original training set. A random selection process is done for each set of training with the same vector number as in the original set, which is represented as (N). The vectors are chosen with replacement. Some vectors are absent, and some will occur more than once. The variables that are used to find the best split are not possible with all variables at each node of each trained tree but a subset. A new subset is created at each node. But its size is fixed. It is considered as a training parameter set to the square root of the number of variables. No accuracy estimation procedures such as bootstrap or cross-validation are needed in a Random Forest. The estimation of error is done internally during the training process. The output of the Random Forest is to recognize activity for 4 top-level categories and 20 s-level categories.

3.9 Support vector machine

Support vector machines represent systems that are associated with learning algorithms for the classification of values. An SVM algorithm develops a system that assigns examples to categories [25]. New data are trained to fit the categories, which are divided already and are predicted. SVMs can perform efficiently as a nonlinear classifier. Here, multiclass SVM is used.

Figure 9 shows the separation of values with the help of the optimal hyperplane, which is cited from [15]. The SVM performance is based on the hyperplane that provides the highest minimum distance to the training values. The output of the SVM is to recognize activity for 4 top-level categories and 20 s-level categories.

Fig. 9
figure 9

Support vector machine

4 Experimental details

The proposed method has been implemented in MATLAB 2015a in a Windows environment with a system configuration of Intel Pentium VII Generation I5 with 4 GB RAM. We evaluate the performance and comparison using different descriptors and segmentation methods. In this research article, multimodal egocentric activity dataset is used. Here, 1000 samples were taken. 10 cross fold validation is used where all samples were trained and tested. Four first-level categories and 20 s-level categories are applied as shown in Fig. 10.

Fig. 10
figure 10

Grouping activity level

4.1 Dataset collection

The dataset used in the proposed approach contains 20 activities recorded in four scenarios. Each scenario recorded two sets for each activity. Thus, every activity category has twenty clips. The duration will be approximately 20–30 s for every clip recorded video. The activity categories are Riding on Elevator Down, Riding on Elevator Up, Riding on Escalator Down, Riding on Escalator Down, Walking, Sitting, Walking Downstairs, Walking Upstairs, Drinking, Eating, Making Phone Calls, Texting, Cycling, Doing Push Up, Doing Sit Up, Running, Organizing Files, Reading, Working on PC, and Writing Sentences. Sample images are shown in Fig. 11.

Fig. 11
figure 11

Sample frames from multimodal egocentric dataset

4.2 Performance metrics used

Sensitivity is defined as the ratio between the number of true positives and the summation of true positives and false negatives. It is given as

$$ \mathrm{Sensitivity}=\mathrm{TP}/\left(\mathrm{TP}+\mathrm{FN}\right) $$

Specificity is defined as the ratio between the number of true negatives and the summation of true negatives and false positives. It is given as

$$ \mathrm{Specificity}=\mathrm{TN}/\left(\mathrm{TN}+\mathrm{FP}\right) $$

Accuracy is defined as the average of sensitivity and specificity. It is given as

$$ \mathrm{Accuracy}=\left(\mathrm{Sensitivity}+\mathrm{Specificity}\right)/2 $$

4.3 Evaluation for first-level categories using SVM

Table 1 and Fig. 12 depict the performance metrics of the SVM classifier for various metrics. The table and figure show that HOG+GiST+COLOR provided better results than other features for first-level categories using SVM.

Table 1 Performance metrics of first level categories
Fig. 12
figure 12

Evaluation for first-level categories using SVM

4.4 Evaluation for first-level categories using random forest

Table 2 and Fig. 13 depict the performance metrics of the Random Forest classifier for various metrics. The table and figure show that HOG+GiST+COLOR provided better results than other features for first-level categories using Random Forest.

Table 2 Performance metrics for Random Forest classifier
Fig. 13
figure 13

Evaluation for first-level categories using random forest

4.5 Evaluation for second-level categories using SVM

Table 3 and Fig. 14 depict the performance metrics of the SVM classifier for various metrics. The table and figure show that HOG+GiST+COLOR provided better results than other features for first-level categories using SVM.

Table 3 Performance metrics for SVM classifier
Fig. 14
figure 14

Evaluation for second-level categories using SVM

4.6 Evaluation for second-level categories using Random Forest

Table 4 and Fig. 15 depict the performance metrics of the Random Forest classifier for various metrics. The table and figure show that HOG+GiST+COLOR provided better results than other features for second-level categories using the Random Forest classifier (Figs. 16 and 17).

Table 4 Performance metrics for random forest classifier
Fig. 15
figure 15

Second-level activities comparison between different features using random forest

Fig. 16
figure 16

Second-level activities comparison between different features using random forest

Fig. 17
figure 17

Comparison of accuracy with existing systems

However, combining HOG, GiST and COLOR feature methods shows that the performance of the Random Forest classifier is better than all the other individual methods (Table 5).

Table 5 Comparison with existing system

4.7 Evaluation with existing system

The above figure indicates that the proposed system outperforms the other existing systems given in the literature.

5 Conclusion

We gave a broad overview regarding the different problems in the domain of egocentric video that have recently been addressed in the computer vision community. This can be used as a patient monitoring system. We showed that research could roughly be grouped into three categories: object recognition, activity and action detection, and life logging video summarization. We analyzed 1000 samples that included a variety of applications based on activities, which is inadequate in many ways. The activities are categorized into two levels, top- and second-level. With a two-level categorization structure, we can justify the performance gap in activity recognition at two different granularities. Combining the HOG, GiST and COLOR feature methods shows that the performance of the Random Forest classifier is better than all the other individual methods. Random forest provides better results because it handles thousands of inputs; it gives estimates of what variables are important in the classification; it generates an internal unbiased estimate of the generalization error as the forest building progresses; it has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing; and it has methods for balancing error in class population unbalanced datasets. Therefore, it provided better results than the other classifiers.