1 Introduction

With the advancement in the social media application, a massive increase has been seen in the usage of digital imaging technology and application. Recently, millions of images and videos are shared every day. In computer vision, the labeling of images inside these videos relies on their semantic and is considered a challenging task [7]. In the context of label generation for video frames, the existing efforts of the research community can be categorized into two groups; manual label generation and semi-supervised. Manual image annotation methods involve human input to annotate an image based on its semantic. The user enters certain descriptive keywords for each image. Although in terms of accuracy, it provides the best results, on the contrary, it is considered as time-consuming and labor-intensive method, which trigger the increase in the overall cost. Subsequently, semi-supervised methods use a few labels to train a supervised classifier and then make a general categorization on the basis of this. It helps to achieve high efficiency and accuracy. Though, these methods focus on creating and refining annotation by encouraging the users to provide feedback on examining the retrieval results. However, this method requires high labor in terms of using user interfaces to improve feedback [15]. For assigning labels to human actions inside videos, firstly we need to recognize the human action. The human action is divided into two categories namely Holistic and Part-based. The former category is associated with the whole body. The latter category is associated with human parts. Moreover, part-based approach functional in two phases 1) Feature Detection and 2) Feature Description.

1.1 Motivation

From literature contribution, it is deduced that the two annotation methods suffer from different limitation such as manual image annotation method suffers from the problem of high cost and time. The semi-supervised method requires user interfaces to improve feedback [15]. Besides, the neural network based approaches are considered as a state of the art method for label generation. Though, these approaches outperform, however, there are certain limitations such as theses method requires a huge amount of data, high computation power and better algorithm to perform best [5]. This gives us the motivation to stick to conventional methods and develop a methodology that overcomes the issues of huge data requirement, high computational power as well as it uses a combination of simple conventional algorithms to generate labels that describe the image on the basis of its semantics. In image processing semantic analysis is achieved through various methods such as descriptive, generative and cognitive. However, for this particular, we use a cognitive method that involves detection, localization, recognition, and understanding of images. These steps play a key role in semantic knowledge exploration of images [22].

We have developed a methodology, based on the combination of Scale-Invariant Feature Transform (SIFT) descriptor and clustering. The methodology recognizes human action of different categories such as diving, running, lifting, and walking and many others. By using, the proposed methodology, we will be able to achieve action categorization, by assign labels to different actions. The proposed methodology starts by extracting frames from the videos. For this purpose, we leveraged the SIFT descriptor. The Vector space model (VSM) of the extracted feature is created and then clustering technique is implemented on it. For a reliable result of clustering, experimentation with different invariants of clustering techniques is carried out. The effectiveness of the methodology is evaluated through the proposed evaluation method of section 4. To conduct a comparative study analysis, a neural network classifier is trained with the proposed technique and existing technique and effectiveness is measured in terms of accuracy. In order to meet our research objectives, we have formulated the following research questions.

  • Research Question 1 (RQ-1): What is the effectiveness of proposed methodology in terms of automation of image annotation?

  • Research Question 2 (RQ-2): Weather the labels generated from the proposed methodology are according to expert opinion?

  • Research Question 3 (RQ-3): Does the proposed methodology outperformed the existing techniques?

1.2 Research contribution

The research primarily based on a ‘methodology’: a system of methods used for a particular purpose. The particular purpose of our research is to automate the annotation process. Moreover, our main contributions are described as below:

  1. 1.

    A methodology is proposed for the automatic annotation of video frames based on a combination of SIFT features and clustering method.

  2. 2.

    An evaluation model based on Silhouette analysis and Adjusted Rand Index is used to measure the similarity between the generated labels (by our proposed method) and manually assign labels (through expert opinion).

  3. 3.

    The comparative analysis is done between the existing techniques and proposed methodology.

The rest of the paper is organized in the following sections. In section 2, related work is presented. In section 3, the proposed methodology is described. In section 4, an evaluation model is proposed. In section 5, pseudocodes are given to understand the implementation process. In section 6, experimental results are presented. Furthermore, in section 7 implication of the research is discussed. Finally, in section 8, a conclusion is given.

2 Related work

Image annotation is one of the emerging and challenging research areas of computer vision especially, to recognize human action and assign labels to them [7]. From the last two decades, the research community has made a significant contribution to it. However, in this section, we summarized the latest contributions in the context of image annotation, feature extraction, and clustering.

2.1 Manual image annotation

Tran et al. [25], a metric learning approach is used to recognize human activities. The authors use the small dataset to train the activity recognition classifier. The outliers are removed using the classifiers voting method. The approach proposed is effective in generating accurate labels and eliminating noise. However, video sequences are labeled manually, which increases the cost. Kavasidis et al. [9], suggested a video object labeling tool. In addition to detecting the object, they proposed a tool that tracks it throughout the video. The software tool is helpful in developing a graphical environment to draw boundary box across the object and keep track of that object, according to authors. For this purpose, the highly used tool is the VIPER GT. It is highly efficient. It lacks, however, any automatic or semi-automatic process of annotation. The authors proposed a scheme to overcome this problem by providing hotkey and drag and drop option to edit short cuts. In terms of less time required for annotation, the authors are able to outperform the state-of-the-art model. However, consistent supervision is required for correct labeling of images.

Behavioral psychology plays a vital role in the study of human behavioral semantics. Wagner et al. [29], proposed a tool to speed up manual behavior-based annotation of human action. To predict the local part first, a small number of labeled instances are used. Then a database is created using multiple classification techniques. In the light of their proposed approach, the authors also introduced an application name NOVA with the series of experiments that authors can reduce labor efforts. Gerum et al. [6], introduces an expandable toolbox namely Click Points for scientist image annotation and analysis, focus on three areas of image annotation such as visualization, annotation, and evaluation. It demonstrates the creation of toolbox that annotates interesting finding and evaluates them. By simple clicks and drawing mask, a non-native user can annotate images.

2.2 Semi- supervised method

Highly popular among researchers are the semi-supervised learning techniques. Ukita et al. [28], proposed for human pose estimation three semi-supervised learning schemes. A candidate pose is detected by the first scheme. The second scheme is used to detect their true classes, while the third scheme focuses on choosing true-positive poses that differ from the supervised pose. Through three schemes, the clustering method is later used on the estimated pose. In addition, the authors detect the outliers using the Dirichlet process and the Bayes factor. The authors test it on a large number of human pose datasets to check the validity of the proposed scheme. For image classification, Peikari et al. [17], use a semi-supervised approach based on clustering. The analysis of the cluster used to identify the region of density. As they focus on semi-supervised learning, less label data points are used to train the classifier. The SVM classifier is used to determine the boundary of decision. A comparison is made with supervised learning.

Recently, Chen et al. [4], introduce a Voxel Deconvolution network-based approach for 3D brain image labeling. The authors of this study cater to the key limitation of Deconvolution layer that suffers from a checkerboard artifact problem the neural network is used as a source of a semi-supervised method for annotation. The Zheng et al. [30], proposed MMDF-LDA: an improved multimodal latent Dirichlet allocation model for social image annotation. The authors focus on developing a data fusion model for social image annotation. The model learns the topic based on the probability of metadata. The Metadata is used for classifier training. The proposed approach focuses on generating an annotation for the geographical region of social images.

Wang et al. [30], works on retrieving the perceptual information present inside videos. The information was based on human action recognition as spatial temporal constrains have made significant contribution on it. Based on spatial and temporal information the author has suggested a multimodal frame work for video appearance motion and audio data. Convolution neural network is used to extract features. A new layer named as fusion layer is added on top of CNN. The fusion layer is responsible for investigation of early and late fusion with help of neural network and space vector machine. The frame work is tested on UCF 101 and UCF 101–50. The experimental results shows 85.1% accuracy.

The 24 h surveillance bring us huge amount of data. This huge amount of data is useless until we are able to retrieve information from it. In this regard Luo et al. [12], study the characteristic of surveillance video and purpose an approach known as ConvNets for real time action recognition. The spatial and temporal stream are extracted using cascade features and motion history. The experimentation is conducted on UCF-ARG and UT-interactions.

Sheng et al. [13], investigates the state of the art deep learning approaches for human action recognition and purpose and improved ConvNets architecture. Particularly they have use Motion History Image as motion expression for training the temporal ConvNets and results in increase performance.

2.3 Feature extraction

Feature extraction is one of the key steps of many computer vision algorithms. This area is now developed as a separate research domain. Recently, Peng et al. [18], introduce an efficient probability based feature extraction method based on Bag of Events. In this approach, each object is represented through a joint probability distribution of events and cross ponding active pixel associated with each event. In the context of analysis of Feature-based object mining and tagging algorithm considering the different level of occlusion, proposed a novel framework of feature-based object mining and Tagging Algorithm (FOMA). It identifies the object with high accuracy. The framework is tested on three different datasets. Ying et al. [1], uses Genetic Programming for automatic global and local feature extraction to image classification and introduce a new method based on the combination of GP-GLF. The different well-known method such as HOG, SIFT, and LBA are deployed in GP Function for extraction of features. The main problem with the extraction of features from the video is changing of scale at each frame. Video feature extraction required features that are invariant to scale change.

2.4 Clustering

Clustering is used for multiple tasks such as clustering of faces, objects, and interest points. Sarwas et al. [20], present an approach that is used for clustering of face images based on SIFT functional order. The authors reported the implementation of clustering includes:-FSIFT based feature points for faced hierarchical clustering. The discriminant feature of this paper is that it uses a dissimilarity matrix. Nemirovsky et al. [14], presents a multi-step algorithm for clustering of face images. The algorithm uses the Euclidean distance and pullback to measure the proximity between segments. Content-based image retrieval by segmentation and clustering [10], uses context-based image retrieval technique. The proposed model automatically segments the regions through feature extraction and clusters them.

3 Overview of proposed methodology

In this section, we present the proposed methodology which is employed for the image annotation. The image annotation process refers to the generation of labels for actions performed in the images. The generated labels provide the semantic analysis and action categorization via focusing on a clustering technique. Figure 1 shows an illustration of flow our methodology. Each step is discussed as follows.

Fig. 1
figure 1

Flowchart of proposed methodology

3.1 Image preprocessing

The first stage of the proposed methodology is image preprocessing. Once the video is input then preprocessing steps are performed on it. The image preprocessing is divided into two steps frame extraction and interest point detection, which are discussed as follows. Figure 2 describe in details each step of the proposed methodology.

Fig. 2
figure 2

Over view of proposed methodology

3.1.1 Frames extraction

The first level of preprocessing concerns with the extraction of frames from the input video. In our current implementation, we have taken videos of different actions from different datasets. The videos are input in the system and then with help of Open CV library we read the video frame by frame. Also, saving each frame in a folder. The time consumed for the extraction of frames depends upon the length of the video sequence. During the extraction of frames, size and resolution of frames are kept homogenous for all. Moreover, the sequence of frames is given significant importance.

3.1.2 Interest point detection

The second level of preprocessing deals with the detection of interest points. These interest points are described as features vectors. In our current implementation, we focus on the detection of features that are invariant to scale and rotation. So, we can generate accurate labels for the desired actions. For this purpose, we have leveraged the SIFT detector to find potential interest points. SIFT detector has multiple advantages as it is invariant to scale and rotation but it also detects maximum desirable interest points and pays less attention to unwanted background [11].

3.2 Image analysis and clustering

The second stage of the proposed methodology is the Image Analysis and Clustering. After image preprocessing we get the extracted frames and our desired interest points now it’s time to analyze these points. The analysis is done in two major steps creating VSM and implementing clustering on it as described as follows.

3.2.1 Creating vector space model

The VSM is created by taking features from each frame. At first we find the potential interest point with the help of SIFT descriptors, later on the outliers are removed with the help of Taylor series. The remaining points are the features of each frame. We convert feature matrix of each frame into a 1-D row vector. The process continues until all the features matrices are converted into row vectors. These 1-D row vectors are the then combine to form a multidimensional row vector known as the VSM model. Each row of the VSM model represents a features of a frames.

3.2.2 Clustering

After creating the VSM, now it’s time to group similar frames feature at one location. For this purpose the best approach found is clustering. Clustering is used for multiple purposes such as group similar points, features, objects, faces and many others. We are going to use it to group similar feature in one cluster and assign an annotation to it. There are multiple invariants of clustering available, with promising results. So, for the reliability of the clustering technique, we have decided to use four popular clustering techniques including K-means, K-medoid, Fuzzy c-mean, and Agglomerative clustering. By using multiple techniques we are able to achieve the research goal. Since we observe 8 actions from the target video. Consequently, we adjust the value 8 for k clusters. The list of employed clustering techniques is as follows.

K-means clustering

K-mean clustering is the most popular clustering method used in machine learning. Yet, this one is simplest so we decided to use this first. At first, through empirical analysis, the initial cluster center is investigating. The clustering technique using squared Euclidean distance metric to find the distance of the feature vector from the center. The Eq. 1 is used to measure the distance between feature vectors.

$$ \mathrm{d}\left(\mathrm{y},\mathrm{c}\right)=\sqrt{\sum \limits_{\mathrm{i}}^{\mathrm{n}}{\left({\mathrm{y}}_{\mathrm{i}}-{\mathrm{c}}_{\mathrm{i}}\right)}^2}\kern0.5em $$
(1)

Where the terms c and y represents the center and features respectively.

Fuzzy c-mean

Fuzzy c-mean is a data clustering technique that works differently from K-mean. At first, the initial centers are provided through random values. It groups the features into k clusters in such a way that each feature vector belongs to all clusters up to a certain degree. The degree of belonging/ membership of feature vector towards a certain cluster depend upon its closeness with the center of the cluster. If the feature vector is close to the cluster then it has a high degree of membership in that cluster and vice versa [19]. By iteratively, updating the center and degree of membership we find the right center and cluster each feature vector to its right location.

K-medoid

The K-medoid clustering work similar to K-mean clustering but in contrast with K-mean the K-medoid selects random feature vector as the initial center and uses Manhattan Norm to find the distance between the feature vector and center instead of l_2 norm. The advantage of using this over k- mean is that it is robust to noise and outliers as it minimizes a sum of pairwise dissimilarities instead of squared Euclidean distance. The Manhattan norm can be achieved by using Eq. 2 [16].

$$ \left|\left|\mathrm{x}\right|\right|={\sum}_{\mathrm{i}=1}^{\mathrm{n}}\mid \left|{\mathrm{x}}_{\mathrm{i}}\right|\mid $$
(2)

Where the terms x represent features respectively.

Agglomerative clustering

The Agglomerative clustering is based on the bottom-up approach. Each feature vector starts in its own cluster and a pair of the cluster having the minimum distance are merged in one thus moves own to form a hierarchy. In the proposed methodology, we want to cluster similar feature vector in one cluster so we are using a single linkage clustering. The single linkage clustering is based on finding the minimum distance between the clusters [2]. Equation 3 describes the formation of a single linkage clustering.

$$ \min \left\{\mathrm{d}\left(\mathrm{a},\mathrm{b}\right):\mathrm{a}\ \upvarepsilon\ \mathrm{X},\mathrm{b}\ \upvarepsilon\ \mathrm{Y}\right\} $$
(3)

where X Y represents two clusters and based on the minimum distance point a belongs to X and point b belongs to Y.

3.2.3 Evaluation of generated label

At this stage of the proposed methodology evaluation is done. The evaluation is done in two folds. First, we performed the silhouette analysis to assess the cluster’s quality and then we evaluate the generated labels by measuring the similarity between manually generated labels and labels generated by our methodology using Adjusted Rand Index (ARI).

3.2.4 Classification

To make a comparison, among our proposed technique and existing technique a classifier is trained with annotated frames. The details of this step are described in section 7.

4 Evaluation model for the proposed approach

The evaluation model for the proposed model is performed in two steps. The first step deals with the evaluation of generated clusters. Subsequently, the second step is performed to validate the generated labels.

4.1 Evaluation of generated clusters

Since clustering is an unsupervised learning method. So, it is not reliable to use performance measure of the supervised learning method such as precision and recall to evaluate it. The best way to evaluate a generated cluster is to measure inter and intra cluster distance of the clusters [21]. The distance between two clusters is known as the inter-cluster distance whereas intra cluster distance deal with the distance between the point and cluster center. The best way to measure the measures inter and intra cluster distance is through silhouette analysis. To measure the reliability of the cluster, we perform silhouette analysis on each clustering technique. The analysis indicates its results in the range of [−1, 1]. The value 1 indicates that features are far from the neighboring cluster center but close to the cluster center, they are assigned. The value −1 indicates they are close to the neighboring cluster center but far from the cluster center they are assigned. The 0 value indicates that are on features lies on the boundary of the distance between the two clusters. Mathematically this can be represented using Eq. 4.

$$ \mathrm{S}\left(\mathrm{i}\right)=\mathrm{b}\left(\mathrm{i}\right)-\mathrm{a}\left(\mathrm{i}\right)/\max \left(\mathrm{b}\left(\mathrm{i}\right),\mathrm{a}\left(\mathrm{i}\right)\right) $$
(4)

Where,

  • a(i) indicates the mean distance of a features w.r.t to all other features in the assigned cluster (A).

  • b(i) indicates the mean distance of a feature w.r.t to all other features in the assigned cluster (B) [8].

Therefore silhouette analysis is used to measure the effectiveness of the generated cluster. The cluster is generated using different clustering techniques such as K-mean, K-medoid, Fuzzy c-mean, and Agglomerative clustering. The silhouette analysis of each clustering technique is done for appropriate results.

4.2 Evaluation of generated labels

In order to validate the performance of generated labels for unsupervised data, the best way is to measure the similarity between the generated labels and some ground truth. Certain similarity measures such as Adjusted Rand index, V- measure and Mutual Information (MI) metric are needed for evaluation. Since we are using clustering for label generation so we recommend the use of ARI measure to evaluate the generated labels. The Rand index is used to measure the similarity between the two clustering’s. An invariant of Rand that adjusts with a grouping of the element is known as the Adjusted Rand index. We can assume that Rand index can be related to accuracy but it is also applicable in a scenario when labels are not in use. In our proposed methodology we achieve a set of features vectors represented as F = {0_1………………………… 0_n} and 2 clusters of F are drawn one is considered as assigned labels represented as A and other is generated label through proposed methodology represented as G. So, we have A = {A_1………………………… A_n} and G = {G_1………………………… G_n}. We can define the following condition.

  • a = number of pairs of elements in F that are same in A and G.

  • b = number of pairs of elements in F that are different in A and G.

  • c = number of pairs of elements in F that are same in A and different in G.

  • d = number of pairs of elements in F that are different in A and but same in G.

So, and the index can simply be calculated using eq. 5.

$$ \mathrm{R}=\frac{a+b}{a+b+c+d}=\frac{a+b}{2} $$
(5)

Where the term a + b represent an agreement between A and G and c + d represent disagreement between A and G.Moreover, the Adjusted Rand index is calculated using a permutation mode [24].

4.3 Dataset description

The experimentation process has been carried out on different datasets. Popular benchmark datasets for action recognition like UCF sports, KTH, Weizmann and UCF YouTube are tested with the proposed methodology. Multiple videos of different actions are taken from these datasets and labels are generated for each of them. The detail of each dataset is as follows.

4.3.1 Dataset 1: UCF sports

For our first case study, we have taken UCF sports data set and perform our experiments. The best thing about UCF sports data set is that it limits the scope of human actions only to sports. So we have taken 8 different sports actions such as diving, horse-riding- golf swing, running, walking, jumping, lifting and stake boarding. For each action, multiple videos are taken from the data set, as there are eight actions so eight clusters are formed and each cluster is assigned the appropriate label [26].

4.3.2 Dataset 2: Weizmann

The data set is the collection of 90 low-resolution video sequences. It contains 10 actions performing by 9 different people. However for our experiments, we have taken 6 different actions such as walk, run, bend, jump, one-hand wave and skip. The Weizmann dataset has a unique property that we can deal with space-time action shapes. Moreover, the data set is selected because of its versatility [23].

4.3.3 Dataset 3: KTH

KTH dataset contains six types of human actions such as walking, jogging, running, boxing, hand waving and handclapping. These actions are performed by 25 different subjects. Moreover, the significance of using this data set is that actions are performed in a different scenario such as outdoor, outdoor with scale variation, outdoor with different clothes and indoors [3].

4.3.4 Dataset 4: UCF YouTube

UCF YouTube is a very challenging dataset due to the presence of large variation in camera motion, object appearance poses cluttered background, etc. However, this dataset has the highest resolution and best quality video among all above-mentioned datasets. The dataset contained 11 different actions but for over experiments, we have taken 6 different actions such as tennis, swinging, cycling, volleyball, spiking, swinging, trampoline jumping and diving [27]. The descriptive statistics of benchmark datasets are shown in Table 1.

Table 1 Descriptive statistics of benchmark datasets

5 Pseudocodes for the implementation of the proposed methodology

The objective of pseudocodes 1 gives an overall view of an automatic image annotation technique. The rest for the pseudocodes from 2 to 4 are based on the definition of functions described in pseudocodes 1. The definition for the function EXTRACTFRAMES and ADJUSTED RAND INDEX are not given as they are pretty much self-expiatory.

5.1 Pseudocode-1: Automatic image annotation

This program is used to assigning annotation to each video frame based on it’s semantic. The program is partitioned into functions. Each function has its own purpose and performs its functionality when it is called. Each function returns an output which becomes the input of other function.

figure a

5.2 Pseudocode −2: Interest point detection

This function is used to read the frames from the source folder and extracting interest point from them by leveraging the SIFT descriptor. The SIFT descriptor first creates the Difference of Gaussian map (DoG). The map is created by convolution of the input image with a series of Gaussian filters. The DoG is used to find the local minima and maxima. These are considered as our initial interest points. The Taylor series is performed on them to remove the outliers. After undergoing multiple filtration process, we get the interest points that are invariant to scale chang.

figure b

5.3 Pseudocode-3: VSM model

Vector space model seems to be a quite a simple process but it can be a bit tricker. The features are extracted from the cell array and converted into a single row vector. These row vectors are used to create multidimensional array know as our VSM model.

figure c

5.4 Pseudocode-4: Label generation

The most important part of the experiment that is used for label generation is primarily based on the clustering technique. Through empirical investigation, the initial cluster centers are found. The VSM model along with initial centers and number of cluster (depending upon the number of action) are given to clustering technique. Once clustering is successfully performed, the silhouette analysis is used to measure the stability of the cluster. The ARI is used to measure the effectiveness of generated labels. Upon performing all the evaluation steps the frames are finally written on the disk along with their generated labels.

figure d

6 Results and discussion

In this section, the performance of the proposed technique is evaluated. More precisely to assess the similarity and improvement in performance the generated labels are compared with assigned labels and it is evaluated using performance measures discussed in (Section 4). The proposed technique consists of different stages: extraction of frames from video, detection and extraction of interest points (feature vectors), clustering, label generation and the comparison between assigned labels and generated labels. It is noted that among each stage the most time-consuming stage is the detection and extraction of interest points. Besides once the interest points are detected then we can quickly generate the labels. Execution time for extraction of interest points depends upon the number of frames extracted from the video as we have 450 to 500 frames for each clip so it takes a bit of time. Certain varieties of factors such as video resolution, frame size complexity of background and camera movement may affect the detection of interest points. In order to overcome the mentioned obstacles during the extraction of frames, constraints are applied that restrict all the frames of equal size, resolution and in a particular sequence.

To measure the effectiveness of our proposed methodology, it is tested on four different popular action recognition datasets as described in section 4. A brief description of our experimental results are disused below in response to the research questions.

6.1 Response to RQ-1

The RQ-1 is stated with the purpose to see whether the methodology we are purposing have some effectiveness or not? The effectiveness of the proposed methodology primarily depend upon the generation of stable cluster. In order to respond to RQ-1, the effectiveness of the proposed methodology in the research domain is evaluated using silhouette analysis. Since, the proposed methodology is employed via K-mean, Fuzzy c-mean, K-medoid and Agglomerative. Consequently, the experimental results are discussed accordingly as follows.

6.1.1 Effectiveness of K-mean

The experimental results for K-mean clustering are demonstrated in Fig. 3.

Fig. 3
figure 3

Silhouette Index for K-means

The experimental results (Shown in Fig. 3) of K-mean indicate the homogenous response on different datasets. Although, we have taken different action form each data set but it’s silhouette index value remains greater than 0.8 in all experiments indicating the stability of generated labels. The possible reason for such homogeneity is the difference between frames is measure in l2 norm. K means work on the principle of finding the minimum distance among feature vectors using Euclidean distance thus making it an effective clustering technique for the proposed method.

6.2 Effectiveness of fuzzy c-mean

The experimental results for Fuzzy c-mean clustering are demonstrated in Fig. 4.

Fig. 4
figure 4

Silhouette Index for Fuzzy c-mean

The experimental results (Shown in Fig. 4) of Fuzzy-c means indicates a consistent performance with exception of some outliers. The outliers indicates that Fuzzy-cmean has some difficulty in differentiating the consecutive actions. Also, its performance deteriorate with the low resolution. However, Fuzzy c mean is still a preferable method as it is able to cluster most of the features correctly.

6.2.1 Effectiveness of K-medoid c-mean

The experimental results for K-medoid clustering are demonstrated in Fig. 5.

Fig. 5
figure 5

Silhouette Index for K-medoid

The results of K-medoid (Shown in Fig. 5) indicate the significant performance of K-medoid like K-mean with minor differences. The graphs clearly indicate that most of the features are correctly cluster. The actions that are performed individually are correctly clustered without any outlier such as golf-playing, skateboarding, diving, etc. However, the actions that are performed in the presence of group such as run side, kicking and horse riding have few outliers exist in them. As they do not have a major impact because most of them are correctly annotated as well.

6.2.2 Effectiveness of agglomerative

The experimental results for Agglomerative clustering are demonstrated in Fig. 6.

Fig. 6
figure 6

Silhouette Index for Agglomerative

The results of Agglomerative are shown in Fig. 6, which indicate the presence of outliers and sensitivity of Agglomerative clustering towards the noise.

Finally, the main consequences of the proposed methodology are.

  • We observed K-mean outperform as compared to rest of clustering techniques in terms of silhouette index, which shows the internal quality of clusters.

  • Besides K-mean, we also observe the significant performance of K-medoid with a minor difference.

  • We observe the sensitivity of Agglomerative with data noise.

  • We observed the sensitivity of Fuzzy c-mean towards the video quality which leads towards the existence of outliers.

6.3 Respond to RQ-2

To assess, whether the labels we have generated with the help of proposed methodology are correct or not we need some ground truth. Unfortunately, the video frames we are trying to annotate are in the form of unsupervised data so we have taken a set of frames and use expert opinion to annotate them. By using adjusted rand index we can compare the labels generated by our proposed methodology with labels generated by expert opinion. In case, they are in line with each other it means we are able to generate correct labels. The experimental results are shown in Fig. 7.

Fig. 7
figure 7

Comparison of Adjusted Rand Index

The above graph demonstrates that K-mean shows significant results (i.e. 77%) in terms of ARI. Clustering technique generates the best cluster for all datasets. Moreover, we can observe that the performance of Agglomerative clustering technique cannot better performance in comparison with other clustering techniques.

7 Comparison with existing techniques

In order to, evaluate the efficiency of the proposed technique, we have train the neural network with frames annotated with three different technique. At first, we use a manual method in which expert opinion is used to annotate the frame. Each frame is manually annotated with certain descriptive words by human experts. Secondly, we use the semi-supervised method, in which we take a few frames with labels and train an SVM classifier with it. The classifier is further used to make general categorization of test frames. Lastly, we train the classifier with frames annotated with our proposed techniques. The precision of each method is recorded in percentage form in the below table Table 2.

Table 2 Descriptive statistics of benchmark datasets

The comparative results in the above table shows that manual method has the highest accuracy for annotation of labels but as we have discussed before it’s a very labor intensive method. Moreover the semi supervised method performs with the highest accuracy of 73% and our proposed method was 75% accuracy so it slightly over performed the existing.

8 Implication of research

The goal of the proposed methodology is to predict a set of labels that semantically describe an image. The research community can use it for semantic analysis of comparing videos. Semantic analysis has become an active research topic in computer vision. The aim of semantically analyzing images is to resolve the gap between low-level features and high level semantic in order to improve understanding of the image. Furthermore, the proposed methodology can also be used in action categorization and image tagging.

9 Conclusion

The experimental results of the study indicate the effectiveness of the proposed methodology in the context to automate the label generation of video frames. The actions are labeled based on their semantic analysis. We also propose an evaluation model including silhouette analysis and the Adjusted Rand Index (ARI) to assess the effectiveness of the proposed methodology. The silhouette analysis is used to measure the stability of the cluster. Adjusted Rand index is used to measure the similarity of generated labels with expert opinion.

The ARI shows 77% similarity on UCF YouTube, 75% on UCF Sports 73% on Weizmann and 65% on KTH data set indicates the effectiveness of proposed methodology in terms of label generation for video frames. The main consequences of the proposed methodology are 1) The highest silhouette index value indicates the highest stability of clustering and 2) low resolution leads towards outlier’s formation. In future work, the proposed methodology is an ensemble with a bag of word techniques. The label noise is removed from the generated label by using state of art techniques. Moreover, these labels will be used to train a classifier.