1 Introduction

Video Object Segmentation (VOS) [15] is an essential piece of work that simultaneously segments and classifies various objects from a video sequence. It aims to locate the foreground/background label for each pixel in a video image frame. Video object segmentation is widely applied to many practical applications like traffic control systems [42], recognition tasks [29, 33] object detection [45, 47] medical imaging [36], etc. Video appeals its reach to the audience as it is preferred by people of varying personalities. One of the well-known video-sharing platforms is YouTube that brings in nearly 5 billion videos to its end-user [7]. Another platform, Facebook enables sharing videos and provides access to more than 8 billion videos for its users [26]. Every year the average amount of time spent on watching video clips over social media like Instagram is increasing by 80%.

As the volume and variety of videos are likely increasing day by day, getting images annotated according to our specifications is a challenge. These annotations are most commonly used for object recognition and scene understanding to segment image frames from whole-images [41]. Annotation is a type of manually defining or labeling the region in an image that is otherwise called tagging in general terms. The higher the quality of our image being annotated, the better our models can likely perform accurately. Semantic segmentation is where the targeted objects are grouped by associating each pixel of an image with its corresponding class label. VOS has many prominent challenges like occlusion [20], low resolution [1], non-rigid deformation [27], motion blur [15], appearance changes [8], scale variation and near-far variance [23].

Occlusion is a scenario of overlapping objects or seeming to overlap in motion; Low resolution is caused by a reduced number of pixels than obligatory, thus blurring the images; Non-rigid deformation is the change in the object structure or boundary concerning an increase/decrease in size; Motion-blur is the streaking texture of moving objects captured by the camera, for example, rainfall; Appearance change refers to any change in spatial alignment or texture of the image; Scale variation refers to the trivial difference found in the size of the object captured in correspondence with the distance from where it is captured; And finally near-far variance arises when the objects are captured at a distance, as they seem smaller than those captured at closer proximity.

Off late, the techniques [2, 18, 32] applied in VOS, fine-tune the image by comparing and embedding information from first and previous object frames to the current frame. A couple of other contemporary works like STMVOS [24] and FEELVOS [31] have inbuilt fine-tuning for VOS and compete in gaining good accuracy. STMVOS relies on large image datasets for training purposes and requires large-scale frame sequences. The image segmentation technique proposed in [44] is applied on medical images using adaptive perturbation methodology. The author in article [21] proposes new ways for feature edge detection and edge splitting operations. Deep auto encoder models [25] are applied in recommendation system for identifying images or segmenting images using multi layer architecture and matrix factorization approach. The proposed model applies segmenting images from videos and its extension can be applied on videos of purchase made from shopping malls to know the requirements and style of customers and the same can be recommended to the other similar customers.

The comparison of contemporary VOS techniques like OSMN [43], SiaMask [35], OSVOS [2], OnAVOS [32], FEELVOS, PReMVOS [23], and STMVOS approaches over the dataset ‘DAVIS 2017-Semi Supervised’ is shown in Fig. 1. Many of the contemporary approaches pay less attention to analyze the background information of a video frame and feature embedding approaches applied on them. It is found to be quite challenging to subtract the background features from a video frame and to extract the foreground objects from them. This observation motivates us to propose a framework and a suitable optimal embedding methodology that matches the foreground and background image in the target frame.

Fig. 1
figure 1

Comparison of DAVIS 2017 achieved by recent methods

The reason being choosing semi-supervised VOS algorithm is that it is trained upon a combination of labeled and unlabeled data. It assumes that the points which are closer to each other are more likely to have same output label. The development of semi-supervised VOS can benefit many related tasks, such as video instance segmentation and interactive video object segmentation. Semi supervised methods target the objects in motion and identifies the image sequence in three dimensions: previous frame, current frame, first frame.

Design goals: The proposed work focuses on

  1. 1.

    Applying semi-supervised VOS approach that targets the objects in motion and handles the ‘Motion Blur’ problem effectively.

  2. 2.

    It identifies the image sequence in three dimensions: previous frame, current frame and first frame. The objects in the first frame are annotated and the collection of annotated objects is used as a training dataset or as input to the proposed system. The model is an end to end analysis of objects in video as it does not depend only on analyzing the first frame rather all the frames.

  3. 3.

    It applies the following embedding techniques:

    a. Enhanced Pixel level Matching b. Enhanced instance-level Matching c. Ensemble Learning

  4. 4.

    The scale invariance challenge persisting in contemporary work is overcome by the ensemble embedding technique. It matches the foreground and background image and segments accurately the foreground image from the background.

  5. 5.

    It handles non-rigid deformation challenges encountered in contemporary VOS techniques.

  6. 6.

    The model is simple, effective and strong producing a improved J and F value as compared to existing models.

The proposed model comes closer to FEELVOS because the FEELVOS extracts pixel-level embedding to match object information and also uses the global and local matching technique. The pixel-level matching by itself will not be sufficient to target the foreground object frames and might lead to unpredicted disturbances or noise due to the difference in pixel conditions. However, FEELVOS is limited to predict only the foreground objects as it fixes matching between the neighboring pixels only. To overcome this scale invariance, we introduce an instance-level matching [14, 37] in combination with pixel-level embedding that uses an attention mechanism to segment large scale objects. We apply enhanced pixel-wise matching, following which instance-level matching for both the foreground-background objects. Thus, the suggested framework can successfully improve the quality of matching the targeted objects in segmenting semi-supervised VOS and at the same time keeping the model simple and more effective.

The remaining part of the paper is structured as detailed. In Section 2, a complete state-of-art of existing VOS techniques, their advantages, and concerns are elucidated. In Section 3, the proposed idea and steps carried out under pixel level matching is explained. In Section 4, the architecture of the proposed system is described. In Section 5, procedure of implementation is detailed and the results are tabulated in Section 6. Section 7 concludes the research work.

2 Literature review

2.1 Models with fine-tuning

It becomes an obligatory task to relate the frames in a video sequence to end up precisely segmenting the target object. Several works are done over semi-supervised VOS to boost the object segmentation quality and these proposed techniques depend on fine-tuning to segment the target object. OSVOS and MoNet [39] predict the results from the first frame and make use of this predicted result to tune the model during the test time. Another fine-tuning model, OnAVOS widens the fine-tuning of the first frame on applying an online adaptation approach that uses heuristics-based fine-tuning policies to achieve better results. One of the characteristics of identifying the relationship between the frames is optical flow. An optic flow defines the change in velocity or variation in the brightness pattern of images in different frames. This approach is used in MaskTrack [26] that transfers the segmentation mask from one frame to another. PReMVOS conjoins four neural networks including optical flow and proposes a merging algorithm to fine-tune the image segmentation. Though the aforementioned methods strive to fine-tune segmenting the targeted images, they consume additional time, thereby slowing down the entire process.

2.2 Models without fine-tuning

A couple of research works is proposed to circumvent the problem of fine-tuning. These models also bother much to achieve respectable run time. The author in [15] proposes a model, VideoMatch that uses a soft matching layer and, in an embedding space maps the pixels of the current frame to the first frame. OSMN, ignores fine-tuning with minimal run-time by using two networks: one for making the predictions over the segmented images and another for extracting the information at the instance level. The nearest neighbor classifier is applied in PML [4] that uses pixel-wise matching. It assigns a label to every object pixel in the current and the first frame. However, it produces noisy segmentation owing to unmatching the neighboring pixels. Comparative to PML and VideoMatch methods, FEELVOS carries out fine-tuning and brings out much better speed. The information from the previous frame is stored and retrieved by STMVOS, which mirrors a kind of typical memory network. It needs extensive training with simulated images from multiple frames. Another approach RGMP [38] also demands extensive training as STMVOS. Yet, all these methods leverage to study the foreground images, overlooking the background in the image. The proposed method, however works with both foreground and background ones.

2.3 Attention mechanism in image analysis

Attention mechanisms, in general, seek to identify the pertinent part of the image that provides more information about the frame or entire frame sequence. Certain proposals in VOS chain the attention mechanism into their convolutional networks. SE-Nets [16] proposed a gated mechanism and models the channel attention in its network. OSMN uses instance-level matching and works only with foreground images. Motivated by SE-Nets and OSMN, in our proposed method, a channel-wise average pooling is applied on both the foreground and background images at the instance-level matching mechanism. Besides applying instance-level, the pixel-level matching mechanism is also applied. The proposed image matching algorithm applies average pooling. The reason behind opting for average pooling rather max pooling is the inadequacy of max pooling to identify the sharp features of the image and it goes good with dark pixels whereas the average pooling method smoothens the image and gives preference to lighter pixels too. However, on replacing average pooling by max pooling there would be a trivial change in the performance as the layers of convolution and pooling operations would handle the edge detection effectively with average pooling itself. The VOS algorithm works in linear time. Yet, a big size video might consume more execution time. As the DAVIS dataset has annotated image, additional space is not required by the algorithm during run time.

2.4 Varying activation function

ReLU (Rectified Linear Unit) is applied in several existing approaches discussed in Sections 2.22.3 and 2.4. One of the downsides of applying ReLU is that for nodes with zero gradients, their weights in successive iterations will not get updated, since ReLU weighs the inactive nodes as zero value. The inactive nodes are ignored during gradient adjustment in ReLU. Further, zero gradients might slow down the complete training process. This can be alleviated by using the Leaky ReLU function [10]. The proposed approach uses Leaky ReLU as it produces a minor non-zero value for the inactive nodes, thereby tunes the weight satisfactorily during the gradient adjustment.

3 Methodology

The current methodologies in VOS focus on segmenting the foreground objects. OSMN ignores feature diversity and hence leads to coarse predictions. Though FEELVOS and PML handle the problem of feature diversity, it is drawn away by noises from the nearby pixel. On comprehensively analyzing the above-mentioned approaches, the proposed framework considers analyzing both the foreground and background objects in the frame sequence. The architecture of the proposed method is shown in Fig. 2.

Fig. 2
figure 2

Proposed framework for foreground and background (F/B) object matching

As the first step of image segmentation, pixel-level matching is applied to the foreground and background objects. Pixel-level matching is a popular feature-matching technique used in semi-supervised VOS and is applied on global as well as local matching. The former matching technique uses the information of the first frame to construct the features, and the latter matching technique uses the information of the previous frame to construct the feature mapping. Successively, both the instance level and pixel-level embedding are done on the foreground and the background image. An attention mechanism is introduced by instance-level matching to increase the efficiency during pixel-level matching. The results of the instance and pixel matching techniques over the foreground and background images are combined using a considerable number of receptive fields thus ending up with infinite predictions. Adding ensemble learning allows large receptive field to make precise predictions.

3.1 Enhanced pixel-level matching

The following steps are carried out under enhanced pixel-level matching; Foreground and Background Matching; Global matching; Local matching.

3.1.1 Foreground- background matching

In FEELVOS, to measure the foreground pixel-level matching, a distance calculation method is applied. The distance between the pixels of the first frame and the previous frame is denoted as a and b respectively. The corresponding embedding learning for first frames and previous frames is denoted as ma and mb. The pixels that belong to the same object are considered closer in embedding space. The distance values of such pixels can be calculated using the (1) given below:

$$ distance(a,b)=1- \frac{2}{1-exp(0)} = 0 $$
(1)

The pixels that belong to distinct object are considered to be far away and the distance values of such pixels are calculated using the (2) given below:

$$ distance(a,b)=1- \frac{2}{1+exp(\infty )} = 1 $$
(2)

Equations (1) and (2) is modified to incorporate the background scenario. For a frame f, let the pixel set of background objects be bgf and the pixel set of foreground objects be fgf, and the distance of pixel a and pixel b of the current frame f and frame t concerning the mapping values ma and mb is given in

$$ distance_{f }(a,b)= 1-\frac{2}{1+exp(\| ma-mb \|^{2}+bias_{bg}) } if b\in bg_{f} $$
(3)
$$ distance_{f }(a,b)= 1-\frac{2}{1+exp(\| ma-mb \|^{2}+bias_{fg}) } if b \in fg_{f} $$
(4)

biasbg represents the trainable bias of background

biasfg represents the trainable bias of foreground

Henceforth, from (3) and (4) the two biases for both foreground and background helps to know the variance in distances between the pixel numbers.

3.1.2 Global matching

Similar to PML, Video Match, and FEELVOS, the proposed approach considers the nearby neighboring pixels to share the feature details of the ground-truth annotated image to the current frame.

Let the pixels be denoted as At and the objects be denoted as Ob. Strides of 4 are taken at time t. As shown in (5) and (6), the pixels a of the first object frame otherwise known as a reference frame (t = 1) and the current object frame pixels can be matched for global foreground/background matching and can be written as

The global foreground matching equation is given as

$$ global_{fg} (a)=\min_{a \in b} distance (a,b) $$
(5)

The global background matching equation is given as

$$ global_{bg} (a)=\min_{a \in b} distance (a,b) $$
(6)

3.1.3 Local matching

In FEELVOS, the information between pixels is shared from the first frame to the neighboring pixels of the next frame. However, it is not shared with the object frames that are far away. Also, the pixels of the objects within the nearby area suffer from scale variations and the objects will not have a similar look. Hence, the proposed approach works over the extended level of matching between the objects that suffer from the multi-scale variant issue. This kind of matching is more powerful to detect or segment objects in fast motion-blurred frames.

In addition to matching between the local and the global objects, we combine the previous frame i.e. pixel-wise feature map, with the current frame. The feature map from the first frame is extracted and the video sequence is analyzed frame-by-frame to determine the feature that matches with the current frame. Successively, we relate local and global matching to the first and previous frame and arrive at the final segmented object.

3.2 Enhanced instance-level matching

After receiving the pixel matching patterns of the first frame and previous frames, the mapped pixels are divided into pixels from the foreground and pixels from the background as given here: \( (a_{1},\bar {a}_{1},a_{(T-1)},\bar {a}_{(T-1)})\) based on their mask. Following this, channel-wise average pooling is applied on the pixels and an instance-level vector is obtained. This vector encompasses information of all the frames from both the foreground and background objects. Finally a module with a fully connected layer, a non-linear activation function that inputs each ResNet block is built. The instance-level matching learns complete information of the foreground and background pixels and sounds good in handling local ambiguity between them.

3.3 Ensemble learning

Following ResNets [13] and Deeplab [3, 6], the proposed ensemble learning module contains Res-blocks and ASPP [34] which undergo channel-wise average pooling and bilinear up-sampling to capture the multi-scale dimension [30] precisely. To improve the receptive field, dilated Convolutional layers are added with a well-defined gap.

Another important feature of the proposed system is that Leaky-ReLU is employed instead of ReLU as the activation function [11, 12, 46]. The difference between ReLU and Leaky-ReLU is pictured in Fig. 3. All the transposed convolution up-sampling layers follow ReLU and Leaky-ReLU non-linearity and are initialized with the fan-out initialization mode [19].

Fig. 3
figure 3

ReLU vs Leaky-ReLU

ReLU is the most used activation function in CNN, which determines the output of the network [6] and also has a considerable effect on the extraction of a feature attribute. The equation of ReLU is given below:

$$ func\left( y\right)=\left\{\begin{array}{cc} y_{k}\ if\ \ \ \ y_{k}>0 \\ 0\ if\ \ \ \ y_{k}\le0 \end{array}\right\} $$
(7)

Where yk denotes input to the ReLU (kthchannel)

func(y) denotes the output of the ReLU (kthchannel) and (7) can also be written as in (8).

$$ func(y)= max(y_{k}) $$
(8)

The ReLU activation function has a sparse activation probability and it can be achieved by zero threshold value. It accurately classifies two-class data values and does not have any gradient diffusion problem. The training process in ReLU might slow down with continuous null gradients and is considered to be one of the drawback of ReLU.

To find a solution to this, the LReLU non-linearity activation function is presented to permit a small, inactive non-zero value for the negative parts [40].

$$ func\left( y\right)=\left\{\begin{array}{cc} y_{k}\ if\ \ \ \ y_{k}>0 \\ \lambda y_{k}\ if\ \ \ \ y_{k}\le0 \end{array}\right\} $$
(9)

Where λ denotes a predefined parameter. It takes a value of 0.01. Equation (9) can also be written as in (10).

$$ func(y_{k})= max (y_{k},0),\lambda min (y_{k},0) $$
(10)

LReLU flattens the negative part that comes up with a small, non-zero gradient value when the unit is inactive whereas ReLU does not. We chose ‘fan-out’ which preserves the magnitude of the variance of the weights in the backward pass, i.e., the initialization properly scales the forward signal. It is worth to have fan-out when the loss oscillates more.

4 Architecture

DeepLabv3+ architecture [5] is used in the proposed model. It is based on the dilated Resnet-101, which is the backbone for our network. ResNet helps to train extreme deep neural networks easily. ResNet architecture is made up of residual blocks which are described in Fig. 4.

Fig. 4
figure 4

ResNet 101 Residual Block

The purpose of using ResNet is to decide and calculate the type of increment that is to be added to the input nodes in order to achieve the accurate output. To reduce the spatial dimension, each block in the ResNet manages to have better backpropagation and provides max-pooling operations. It also helps in training the network model smoothly. A notable feature of ResNet is that it perfectly learns to identify textures; detect edges and objects from the images. It uses less computational resources and executes in minimal time.

Deeplab structure shown in Fig. 5, is considered as a specially designed architecture when compared with encoder-decoder design because it helps to perfectly extract the multi-scale features while segmenting video objects. Rather than performing regular convolution operation, the last ResNet block in Deeplab carries out atrous convolutions ASPP that uses different dilation rates to capture the multi-scale feature information.

Fig. 5
figure 5

Deeplabv3+ Architecture

5 Implementation details

Several datasets [9, 22] are available for video object segmentation. Yet, not all of them have been explicitly designed for identifying the pixel relationships of the foreground object from the background area.

5.1 Dataset

The proposed approach uses DAVIS (Densely Annotated Video Segmentation) dataset. DAVIS comprises of two versions: DAVIS 2016 and DAVIS 2017. It encompasses high-quality, full high definition video sequences. The number of sequences, number of frames, and number of objects of the DAVIS 2016 and 2017 datasets are tabulated in Tables 1 and 2 respectively. Table 1 shows the layout of the DAVIS 2016 dataset’s structure and Table 2 shows the layout of the DAVIS 2017 dataset’s structure.

Table 1 Structure of DAVIS 2016 dataset
Table 2 Structure of DAVIS 2017 dataset

The 2016 version provides annotations for foreground/background objects while the 2017 version provides annotations for multiple objects and instances in the foreground. Each pixel images in the video are pixel-accurate and densely annotated. Some examples of the annotations mask are shown in Fig. 6

Fig. 6
figure 6

First frame annotation for all the sequences in the validation subset of DAVIS. The segmented images are highlighted and shown in the image

5.2 Model analysis

In the proposed framework, we consider a randomly chosen first frame which is sampled and the images from the remaining frames; previous frames and current frames is segmented.

The existing models and their method of implementation is studied well and in the proposed system the architectural implementation and structure of CNN model is varied such that there is a notable increase in accuracy of the prediction system. The ablation study hence remodifies the entire structure and the implementation methodology as detailed below.

For pixel-wise matching, we apply 3*3 convolutions that contain the batch normalization [17] and non-linearity activation function where the dimensions of an image get reduced and restored. One depth-wise separable convolution is applied with a stride of 4. The size of the strides benefits substantially faster training. Stride 16 can deal with feature objects that are four times smaller than stride 8 and can also help in producing finer segmentation results. However, it might take more training time. For local matching, we initialize bias for foreground and background biasfg and biasbg to 0. We additionally down-sample the object information to reduce its dimension with bilinear interpolation.

Group normalization is applied as an alternative to batch normalization and channel-wise average pooling with an attention mechanism is applied to improve the performance. Group Normalization computes mean and variance and it is independent of any batch sizes. Interestingly, we can see group normalization has a very lower error and comparably provides good results than batch normalization. Additionally, group normalization can be easily transferred from the pre-training process to adopt fine-tuning in video sequence segmentation.

DAVIS 2017 and DAVIS 2016 training sets are used as the training data. It has the default setting of the down-sampled 480p resolution video series. We apply stochastic gradient descent together with a learning rate of about 0.006 and a momentum of 9. Cross-entropy loss is an important cost function used to optimize segmentation methods. Here we have adopted bootstrap cross-entropy loss that only contemplates a fraction of 15%. This kind of cost function works correctly with any variants and unevenly distributed class, which are quite common for VOS. During the training stage, batch normalization parameters in the backbone are disabled.

6 Results and experiments

Later, when all the training process is done, DAVIS 2016 and the DAVIS 2017 [28] Val sets are used for estimating our framework results.

The DAVIS 2016 validation set contains 20 video sequences, and each one is an annotated single instance. We evaluate all the frames from the video sequence that are generated with our algorithm and are compared with the previous methods. Table 3 shows the results of state-of-art-methods against the proposed method.

Table 3 DAVIS 2017 Val set on various models. We present the J & F score and frame per second (fps)

The DAVIS 2017 dataset contains 60 training video series with multiple masks and a Val set that is extended from DAVIS 2016 which has 30 videos.

We assess our model on the DAVIS in both the 2016 version and the 2017 version. Table 4 tabulates the Jaccard and F-measure values of the proposed method and the proposed method obtains a means J & F value of 82% which is significantly larger than existing methods.

Table 4 J & F Mean score for our framework

Intersection-Over-Union (IoU) or (Jaccard Index) is a measure of the percentage of overlapping object masks and the prediction output. It measures the pixels present across the object mask and provides a score for the segmentation prediction. The contour accuracy estimates the F-measure.

$$ J\ score\ =\ \frac{Output\ Segmented\ Image\ \cap\ Ground\ Truth}{Output\ Segmented\ Image\ \cup\ Ground\ Truth} $$
(11)
$$ F\ score\ =\frac{2\ast\ Precision\ \ast\ Recall}{Precision\ +\ Recall} $$
(12)

We calculate our validation set results by manually created code and the test-dev result sets on the official evaluation server codalab.

In Fig. 7, the first video shows that the framework accurately detects the person and the dog though the image suffers from occlusion. In the second video, the framework succeeds in tracking the right person amidst the other similar person’s. Segmentation results of the proposed method produces results better than other approaches with a minimal inference speed.

Fig. 7
figure 7

Comparative performance on both 2016 and 2017 version

As it is seen from the existing models in Table 3, there is a variation found in the J score and F score. The proportionality between the J score and F score is unpredictable and depends on the type of algorithm. The proposed approach shows notable value for J and F score except for slight decrease in J score and it is because of that J score measures the intersection over union and a minimal diversity in sample image might disturb the J value. The fps (frames per second) is very important for analyzing videos and extracting frames from them. Table 3 gives the comparison of fps for different existing models. Normally for image segmentation when applied on videos, it demands a minimum range of 5-7 frames per second.

We evaluate the comprehensive analysis of the performance of our model with other models. Figure 8 shows that STMVOS overlooks to segment the back leg of the horse in the occlusion and motion-blur scenario. In the second video sequence, STMVOS fails to segment the bicycle too. The proposed framework segments both the leg and the bicycle significantly better than STMVOS.

Fig. 8
figure 8

Comparison with STMVOS

On comparing with the results of FEELVOS, our framework makes a significantly higher score over FEELVOS (82.0% vs. 71.5%). Promisingly, if augmentation is applied to the proposed approach, at the evaluation phase, the J & F means can be further increased to more than 85%. Finally, the proposed model can able to accurately segment images in challenging situations, such as occlusion, blur, and deformations, etc.

7 Conclusion

In this paper, a new framework to handle segmentation problems in images in VOS is proposed. In particular, we solve the problem of motion blur and the scale variation that includes edge ambiguity, shape complexity, etc. Though many research efforts have been focused on estimating the variation in objects in the past decades, they yielded minimal accuracy. And most of the articles in the past focused only on foreground objects but paid little attention to the objects in the background region. We have introduced a new method to combine and process foreground and background objects and obtained the final promising results. Specifically, we introduced an approach to match and segment the images irrespective of scale invariance on multiple objects in the foreground and incorporated the background information into it using pixel level matching and instance level matching. Moreover, we combined all the modules to make our framework so powerful and to fill the gap of inadequate accuracy. To conclude, the presented method could produce better results if it can be modified with random-crop augmentation or balanced Random-crop augmentation approaches.