1 Introduction

One of the most challenging problems in pattern recognition and image processing applications is object detection and tracking. Object detection and tracking is now widely used in video surveillance systems, smart vehicles, traffic monitoring and human computer interaction. To deal with this problem, we always proceed to involve more visual information (characteristics) into the segmentation algorithm. Visual pixel’s characteristics can be either informative or uninformative. In case of presence of uninformative information (like noise), the object detection process will be complex and extremely time consuming. Also, the presence of nonuniform illumination or self-shadow can generate false clusters and therefore all these matters may easily conduct to over-segmentation. On the other hand, taking into account all possible visual features (color, texture, shape, etc.) may decrease the algorithm’s performance as cited in [17, 28, 29, 37, 41, 43, 54, 57, 58, 61]. For all these reasons, it is better to not consider irrelevant features in order to increase the detection accuracy. Recently, various distinguished scientific works involving computer vision techniques have been proposed to address the above-mentioned issues and particularly to deal with complicated real data sets. In particular, the class of approaches named “finite mixture models (FMM)” have been suggested as a potential alternative for complex data modelling and clustering. Finite mixture models represent one of the most crucial machine learning methods in the literature and are at the heart of several image processing and pattern recognition algorithms [18, 47, 52, 53]. Many probability distributions (mixtures) have been suggested in the state of the art. Based on the selected density function, the corresponding model enforces its structure to the treated image or video. It should be noted that there is no unique model is capable to cope with all possible data-shapes in a given dataset. Thus, it is important that the choice of any model (i.e density function) must consider the nature of the dataset. Even though conventional Gaussian mixture models have been broadly employed in different areas, recent publications have illustrated that other alternatives could offer superior performances in terms of data modeling and segmentation especially when dealing with non-Gaussian data. Moreover, the efficacy of such models depends on their ability in automatically determining the number of components (classes or objects) for a given image. Although finite unbounded mixtures (e.g. unbounded Gaussian mixture) have been successfully applied in data modelling given their advantage in terms of data approximation, other more flexible mixture variants notably the so-named “bounded Gaussian mixtures” were presented as an attractive choice for data analysis. In this paper, we focus on effective multidimensional (color-textured) data modeling and segmentation for which bounded mixture models have demonstrated efficiency in many applications. Indeed, bounded support mixtures offer an alternative for many other models such as Laplace mixture model (LMM), Gaussian mixture (GMM), generalized Gaussian mixture (GGMM) and bounded Gaussian mixture (BGMM) as special cases. Compared with classic unbounded Gaussian distribution, the non-Gaussian distributions such as bounded generalized Gaussian distribution (BGGMM) could offer more flexibility. In this present work, we are mainly motivated by the flexibility of this class of statistical models. Thus, the main target of this work is to propose an unsupervised learning method based on BGGMM and permits simultaneous segmenting color-textured images and selecting only relevant features which is very essential for multidimensional data modeling.

The rest of this manuscript is structured as follows. In Section 2 we introduce a state of the art related to the current context of object detection and tracking in image and video sequences. Section 3 describes the proposed segmentation approach and introduces the main formalism of our statistical feature selection mixture model. In Section 4, we introduce a post processing step with level-set approach for object tracking purpose. In Section 5, extensive experiments are conducted and discussed to show the robustness of the developed approach. Finally, Section 6 concludes the paper.

2 Related research work

Object detection and segmentation in images or video sequences are two closely and important steps for many applications including video surveillance, medical image analysis, face detection, smart vehicle, traffic monitoring, and pattern recognition applications. In the literature, these two tasks have been extensively studied and a lot of interesting papers have been proposed to address them. A possible categorization of existing approaches may be those based on machine learning or not-based. In particular, variational active contours are widely used in this area [7, 11, 22, 64]. Pattern classification techniques have also been broadly explored for object detection [3, 24, 31, 41, 43, 57, 58, 61]. For instance, deep learning-based approaches have been exposed recently as a good solution since they have the benefit to learn large scale data, making them a worthy tool for analyzing massive data. For object detection application, a single convolutional network can be used for example to predict both multiple bounding boxes and also their class probabilities simultaneously [58]. The coordinates of bounding boxes are predicted using fully connected layers on the convolutional feature extractor. Later, this algorithm has been enhanced by [57] where a real time object detection system named (YOLOv2) was developed to detect several object categories (about 9000). Another method called (Single Shot Detector: SSD) has been implemented in [43]. The advantage of this method is that it does not require features for bounding box however it permits encapsulating all stages in one network. Another recent work called (Feature Pyramid Network) was proposed in [41] which is based on a pyramid structure, a sliding windows step and a Fast R-CNN algorithm. Convolutional neural networks (CNN) was also applied in [61] and a deconvolution based CNN model and a corner based ROI estimator have been integrated to form a single CNN based detection model. Region proposals are combined with CNN in order to improve object detection and to overcome classic CNN drawbacks. The developed method is referred as “R-CNN” [31]. Later, this method was extended as “Fast R-CNN” [30] in which the training and testing speed was improved and also the detection accuracy was increased.

It is noteworthy that in object detection process, the more information is provided the better result can be obtained in term of accuracy. Taking into account that certain features are more relevant than others and therefore integrating only these informative features into such algorithm allows better detection of specific regions (or objects). The irrelevant characteristics can be only noise, thus not participating to effectively segment the desired objects. Thus, feature selection step is particularly important and plays a primary role in improving the accuracy of object detection algorithms and reducing the processing time especially when the data sets contain a lot of features. A comprehensive literature review show that the step of extracting and selecting relevant features has been investigated broadly for several related applications and it is addressed with different manners. In fact, feature selection can be used as a preprocessing step or integrated within the classifier. The diversity of methods and applications using the feature selection step is a good indicator of its importance, as we will show through some relevant published works.

Selecting informative features using Fisher information criterion is used for measuring the uncertainty of the classifier and for providing more effective real-time object tracking [66]. The latter has been extended in [65] and it was enhanced by taking into account prior knowledge. Another work, related to medical context, was developed for detecting and screening diseases in capsule endoscopy [60]. It involves the following steps: extracting and selecting relevant features then classifying images. In [63] a supervised learning approach was developed for discriminative regional feature integration. Support vector machines have been also used in [20]. The proposed model is able to automatically exclude useless features from the feature pool. Recently, the authors in [35] proposed a robust feature selection mechanism to deal with image classification which identifies a set of mixed visual features from color spaces. The selection step is performed according to the least entropy of pixels frequency histogram distribution. The work [39] integrated cluster analysis and sparse structural learning into a joint framework for feature selection. They investigated a nonnegative spectral clustering and the hidden structure of features to learn an accurate clustering. The developed framework is formulated and optimized via an iterative algorithm. Another interesting work is developed in [38], where authors proposed a new scheme to identify and select the most useful and discriminative features based on the so-called nonnegative spectral clustering and redundancy analysis. The redundancy between the selected features is also exploited. The proposed formulation is optimized via an objective function through an iterative algorithm. In [33], a texture-based feature selection technique is proposed for segmenting fundus images. Texture features are derived from different statistical image descriptors. Another recent work shows the importance of feature selection via the implementation of a tensor based image segmentation algorithm [34]. Indeed, specific objects are distinguished and characterized on the basis of their characteristics (shape, colour, texture).

In addition to the approaches mentioned above, other interesting algorithms were applied with success to tackle the issue of unsupervised feature selection and object detection notably those based on model-based approaches [2, 51]. Model-based approaches are extensively applied for data classification which is a critical step in many applications of practical importance such as pattern recognition applications [4, 13, 15, 19], medical image analysis [14] and intrusion detection systems [1]. For instance, in [2] authors combines both the unbounded generalized Gaussian mixture model and feature selection for image segmentation. Although these models (i.e unbounded Gaussian and generalized Gaussian) are capable to offer good results, their limitation is that the underlying distribution is not bounded and their support range is \((-\infty , +\infty )\) which can be an obstacle to achieve high performances. In fact, in real applications, it is crucial to select the appropriate model for the data such as the pixel’s values which belong to [0,255] and not to \(]-\infty , +\infty [\). Indeed, most source’s supports are bounded and it is therefore important to consider this assumption as useful for the challenges of data modeling. Motivated by the promising results obtained with the bounded mixture models for some applications like in speech modelling [42] and image denoising [18], we propose in this work to develop a novel unsupervised flexible bounded mixture model to address the aforementioned challenging problems. In particular, we propose to investigate the flexibility of bounded generalized Gaussian mixture models to fit different data-shapes which have (bounded models) the advantage to model observed images using different supports. We are also motivated by selecting and weighting only most relevant data-features. Thus, by estimating automatically and simultaneously model’s parameters, feature weights, and the number of clusters, we can achieve more competitive results. For tracking purpose, we tackle this problem by following the developed mixture model with a post-processing step implemented with a variational active contour.

3 Proposed method for image segmentation and feature selection

The proposed method for simultaneously segmenting color-texture images and selecting relevant visual features is established in an unsupervised manner. The problem of feature selection is reformulated as a problem of model selection as proposed in [15, 36]. Indeed, instead of identifying some characteristics, we assess a saliency measure for each characteristic named “feature saliency”. This process is performed through the Maximization Estimation (EM) algorithm. For such situation, one must avoid the case that all saliencies have the maximum value. This process is performed by integrating the principle of the minimum message length (MML) [62] in our mixture model algorithm. The MML criterion encourages the saliency of irrelevant features to reach the value of zero, which will allow us to reduce the number of total features. Thus, by integrating the process of estimating the saliency of each characteristic in the proposed algorithm, we will have a method capable of simultaneously selecting the relevant characteristics and properly segmenting the input images by determining the optimal number of components. In this section, we start by offering a useful nomenclature in Table 1 in order to simplify for the reader the understanding of the notation used in this manuscript. Then, we present our developed flexible statistical mixture model.

Table 1 Nomenclature related to the proposed method

3.1 The model

If we have an input image X defined by d-dimensional vector and having N pixels, then we can describe this vector by a mixture model with M components as follows:

$$ {p}({X}|{\varTheta} ) = \sum\limits_{j = 1}^{M} {{\pi_{j}} \prod\limits_{l=1}^{d} f({x_{l}}|{\theta_{jl}})} $$
(1)

where f(xl|𝜃jl) represents the probability density of the feature l in the component j. For the case of a mixture of bounded generalized Gaussian distributions [18] with M components, the complete likelihood is expressed as:

$$ p(X|{\varTheta})=\prod\limits_{i=1}^{N}\sum\limits_{j=1}^{M}\pi_{j}\frac{{{f_{ggd}}({X_{i}}|\theta_{j}^{})H({X_{i}}|{\varOmega}_{j}^{})}}{{{\int}_{{\partial_{j}}} {{f_{ggd}}({X_{i}}|\theta_{j}^{})dy} }}) $$
(2)

where H(Xi|Ωj) = { 1 if Xij 0 Otherwise is an indicator function able to define j indicates the bounded support region in R for each Ωj and the density function fggd is the generalized Gaussian distribution. This probability density is defined as:

$$ {f_{ggd}}({X_{i}}|\theta_{j}^{}) = \mathcal{A}(\lambda_{j}) \exp \left( { - {\left[ {\frac{{\varGamma ({3 / {{\lambda_{j}}}})}}{{\varGamma ({1 / {{\lambda_{j}}}})}}} \right]^{{\lambda_{j}}/2}}{{\left| {\frac{{{X_{i}} - {\mu_{j}}}}{{{\sigma_{j}}}}} \right|}^{{\lambda_{j}}}}} \right) $$
(3)

where: \(\mathcal {A}(\lambda _{j}) = \frac {{{\lambda _{j}}\sqrt {\frac {{\varGamma ({3 / {{\lambda _{j}}}})}}{{\varGamma ({1 / {{\lambda _{j}}}})}}} }}{{2{\sigma _{j}}\varGamma ({1 / {{\lambda _{j}}}})}}\) and Γ(.) is the Gamma function given by: \( \varGamma (x) = {\int \limits }_{0}^{\infty } t^{x-1}e^{t} dt , x > 0.\)

Finally, in order to characterize the developed mixture model, we propose the following complete set of parameters:

$$ {\varTheta}=\left\{\mu_{1},...,\mu_{M},\sigma_{1},...,\sigma_{M},\lambda_{1},...,\lambda_{M},\pi_{1},...,\pi_{M}\right\}. $$

Thus, each component (as illustrated in (2)) models the data with different supports Ωj. The objective of the segmentation consists in assigning each pixel xi to one region (or component) of the given image. This assignment is defined through a latent variable zi = (zi1,zi2,...,ziM) such as zij ∈{0,1} and \( {\sum }_{j = 1}^{M} z_{ij} = 1 \). Therefore, the conditional distribution of xi given the label zi is defined as follows:

$$ p(x_{i}|z_{i},{\varTheta})=\prod\limits_{j=1}^{M} \left (\prod\limits_{l=1}^{d} p(x_{il}|\theta_{jl}) \right )^{z_{ij}} $$
(4)

where 𝜃jl = (λjl,μjl,σjl), and p(xil|𝜃jl) is the bounded generalized Gaussian distribution (BGGD). In this case, these parameters are the missing information. Given the set of parameters Θ, the missing information represented by the parameters zij can be identified by applying the Bayes theorem as follow:

$$ p(z_{ij}|x_{i},{\varTheta})=p(x_{i},z_{ij}|{\varTheta})/p(x_{i}|{\varTheta}) \propto p_{j} p(x_{i}|z_{ij},{\varTheta}) $$
(5)

where pj = p(zij). Therefore, for M components in the mixture, the primary segmentation step consists in estimating the optimal parameters \( {\varTheta }^{*} = (p_{j}^{*}, \theta _{jl}^{*})\), with j = 1...M and l = 1...d. On the other hand, when dealing with images having various features like color and texture, it is clear that their importance and contribution is not the same for discriminating pixels. The distribution of some of these features can be independent of the different regions. We can found for example noise which can make the modeling task more complex and can lead to false classification.

Thus, we propose to extend the BGGMM model by taking into account the importance of each feature separately. The feature irrelevance can be defined as follows: features are considered irrelevant if they have a common density, p(xil|φl), in all model’s components. Let ϕl a binary variable such as:

$$ \phi_{l}=\begin{array}{ll} 0 & \text{if then}~l\text{th feature is irrelevant}\\ 1 & \text{Otherwise} \end{array} $$

Thus, each xil has the following distribution [15, 36]:

$$ p(x_{il}|\theta_{jl}^{*}, \varphi_{l}^{*},\phi_{l} )\simeq \left (p(x_{il}|\theta_{jl}) \right )^{\phi_{l}} \left (p(x_{il}|\varphi_{l}) \right )^{1-\phi_{l}} $$
(6)

where \( {\varTheta }^{*} = (p_{j}^{*}, \theta _{jl}^{*}, \varphi _{l}^{*}) \). The superscript star indicates the unknown real distribution of the feature l; p(xil|𝜃jl) and p(xil|φjl) are both univariate bounded generalized Gaussian distributions. Based on this equation, we notice that the underlying mixture model could lead to false positives, that is to say non-informative features can be considered relevant [15]. To deal with this issue, we propose to generalize the definition of the features pertinence by taking into account the following component p(.|φl) as irrelevant one and being a mixture of BGGD independent of the region label (zi). Our choice is encouraged by the flexibility of the proposed mixture to found close arbitrary distribution for the irrelevant characteristics [10].

Let define K, the total number of components of this mixture defined by (φ11,...,φKl). Let wilk ∈{0,1}, where \( {\sum }_{k = 1}^{K} w_{ilk} = 1 \), the label associated to an irrelevant feature, where wilk = 1 if xil is generated by the component k of this mixture and 0 otherwise. Let Φ be the set of all ϕl. From this, it is simple to demonstrate that the (4) can be redefined as:

$$ p(x_{i}|z_{i}, {\Phi} , \{w_{il}\}, {\varTheta} ) = \prod\limits_{j=1}^{M} \left (\prod\limits_{l=1}^{d} \left (p(x_{il}|\theta_{jl}) \right )^{\phi_{l}} \left (\prod\limits_{k=1}^{K} (p(x_{il}|\varphi_{kl}))^{w_{ilk}} \right )^{1-\phi_{l}}\right )^{z_{ij}} $$
(7)

where zi, Φ and {wil} are the missing information. Thus, it is practical to marginalize the complete likelihood function p(xi|zi,Φ,{wil},Θ) with respect to the variables zi,Φ and {wil}. This is used to define the likelihood of the observations p(xi,Θ). To reach this objective, we must first define the following prior distributions:

$$ \begin{array}{@{}rcl@{}} && p(z_{i})=\prod\limits_{j=1}^{M}p_{j}^{z_{ij}}, \\ && p({\Phi})=\prod\limits_{l=1}^{d}\rho_{l1}^{\phi_{l}}\rho_{l2}^{1-\phi_{l}}, \\ && p(w_{il}|\phi_{l})=\prod\limits_{k=1}^{K}\left ((\pi_{kl})^{w_{ilk}} \right )^{1-\phi_{l}} \end{array} $$
(8)

where ρl1 = p(ϕl = 1) determines whether the feature l is relevant or not. On the other hand, ρl2 = p(ϕl = 0) measures the irrelevance of l. In this case ρl1 + ρl2 = 1. Here πkl is the prior probability that xil is coming from the component k given that l is considered as irrelevant feature (i.e. ϕl = 0).

To estimate the set of all model’s parameters, we need to optimize an objective functional which leads to perform the segmentation of color images by considering the mechanism of visual feature selection. These parameters are denoted by Θ = (p,𝜃jl,φkl,πl) where p = (p1,...,pM), 𝜃jl = (λjl,μjl,σjl) and πl = (πl1,...,πlK). Now p(xi|Θ) is estimated by successively marginalizing the likelihood function with respect to the parameters p(xi|zi,Φ,{wil},Θ) and we can deduce the following final model for image segmentation with feature selection mechanism:

$$ p\left (x_{i}|{\varTheta} \right )=\sum\limits_{j=1}^{M}p_{j}\prod\limits_{l=1}^{d} \left( \rho_{l1} p(x_{il}|\theta_{jl})+ \rho_{l2} \sum\limits_{k=1}^{K} \pi_{kl} p(x_{il}|\varphi_{kl})\right ) $$
(9)

3.2 Model’s parameters estimation

Several techniques have been proposed to deal with the complex problem of parameters estimation for statistical mixture models [47]. In this work, we opt for the maximum likelihood estimation method [48, 56]. In practice, the expectation maximization method [27] is applied in order to estimate the parameters of the model. The log-likelihood is given by:

$$ \log p(X|{\varTheta})=\sum\limits_{i=1}^{N}\log(p(x_{i}|{\varTheta} )) $$
(10)

The posterior probability is estimated as follows:

$$ p(j|x_{i})=\frac{p_{j} {\prod}_{l=1}^{D} \left [ \beta_{j} (x_{il}) \right ]}{{\sum}_{j=1}^{M} p_{j} {\prod}_{l=1}^{D} \left [ \beta_{j} (x_{il}) \right ]}, $$
(11)

where βj(xil) = ρl1p(xil|𝜃jl) + ρl2p(xil|φl), and \(p(x_{il}|\varphi _{l})={\sum }_{k=1}^{K} \pi _{kl}p(x_{il}|\varphi _{kl})\).

We derive now the equation to compute the relevance of the respective features.

$$ \frac{1}{ \hat{\rho}_{l1}}=1+ \frac{\max\left ({\sum}_{i=1}^{N} {\sum}_{j=1}^{M} p(j|x_{i}) \frac{\rho_{l2} p(x_{il}|\varphi_{l})}{\beta_{j} (x_{il}) }-\frac{3K}{2},0 \right )}{\max \left ({\sum}_{i=1}^{N} {\sum}_{j=1}^{M} p(j|x_{i}) \frac{\rho_{l1} p(x_{il}|\theta_{jl})}{\beta_{j} (x_{il}) }-\frac{3M}{2},0 \right )} $$
(12)

The weights pj and πkl are updated as follows:

$$ p_{j}=\frac{\max\left ({\sum}_{i=1}^{N} p(j|x_{i})- \frac{3D}{2},0 \right )} {{\sum}_{j=1}^{M} \max\left ({\sum}_{i=1}^{N} z_{ij}- \frac{3D}{2},0 \right )} $$
(13)
$$ \pi_{kl}=\frac{\max\left ({\sum}_{i=1}^{N} {\sum}_{j=1}^{M} p(j|x_{i}) \frac{\rho_{l2} \pi_{kl} p(x_{il}|\varphi_{kl})}{\beta_{j} (x_{il}) }-\frac{3}{2},0 \right )}{{\sum}_{k=1}^{K} \max\left ({\sum}_{i=1}^{N} {\sum}_{j=1}^{M} p(j|x_{i}) \frac{\rho_{l2} \pi_{kl} p(x_{il}|\varphi_{kl})}{\beta_{j} (x_{il}) }-\frac{3}{2},0 \right ) } $$
(14)

For j = 1,...,M and k = 1,...,K the model’s parameters are estimated as follow: First, the mean is updated according to the following equation:

$$ \hat{\mu}_{jl}^{\theta}=\frac{{\sum}_{i=1}^{N} p(j|x_{i}) \frac{\rho_{l1} p(x_{il}|\theta_{jl}) \left | x_{il}-\mu_{jl}^{\theta} \right |^{\lambda_{jl}^{\theta}-2}}{\beta_{j} (x_{il})}x_{il}}{{\sum}_{i=1}^{N} p(j|x_{i}) \frac{\rho_{l1} p(x_{il}|\theta_{jl}) \left | x_{il}-\mu_{jl}^{\theta} \right |^{\lambda_{jl}^{\theta}-2}}{\beta_{j} (x_{il})}} $$
(15)

Then, the standard deviation parameter is computed as:

$$ \hat{\sigma}_{jl}^{\theta}= \sqrt[\lambda_{jl}^{\theta}]{\frac{{\sum}_{i=1}^{N} \frac{p(j|x_{i})\rho_{l1} p(x_{il}|\theta_{jl}) \lambda_{jl}^{\theta} A(\lambda_{jl}^{\theta}) \left | x_{il}-\mu_{jl}^{\theta} \right |^{\lambda_{jl}^{\theta}}}{\beta_{j} (x_{il})}}{{\sum}_{i=1}^{N} \frac{p(j|x_{i})\rho_{l1} p(x_{il}|\theta_{jl}) }{\beta_{j} (x_{il})}}} $$
(16)

Finally, the shape parameters \(\hat {\lambda }_{j}^{\theta }\) and \(\hat {\lambda }_{k}^{\varphi }\) are estimated using the Newton-Raphson method as follow:

$$ \hat{\lambda}_{\circ l}^{\star} \simeq \hat{\lambda}_{\circ l}^{\star}- \left \lfloor \frac{\partial^{2} MML(M,K)}{ {\partial \hat{\lambda}_{\circ l}^{\star}}^{2} } \right \rfloor^{-1} \left \lfloor \frac{\partial^{2} MML(M,K)}{ {\partial \hat{\lambda}_{\circ l}^{\star}} } \right \rfloor $$
(17)

3.3 Optimal model selection

In our case, the problems of both relevant features and optimal model selection are solved with the minimum message length (MML) principal [62] which is able to identify the best statistical learning model with less complexity. It is noted that the weights pj, ρl1 and ϕkl for unwanted components are forced to zero. We determine the message length (MessLength) criterion as follows:

$$ MessLength = - \log p({\varTheta}) + \frac{1}{2} \log(|I({\varTheta})|) + \frac{c}{2} \left( 1+ \log \frac{1}{12} \right) - \log p(\mathit{X}|{\varTheta}) $$
(18)

where p(Θ), p(X|Θ), and I(Θ) denote the prior distribution, the likelihood and the Fisher information matrix, respectively. The constant c denotes the total number of parameters. In order to facilitate the calculation of MML, we assume the independence of the different groups of parameters. This assumption allows the factorization of p(Θ) and |I(Θ)|. The Fisher information |I(Θ)| is approximated from the complete likelihood which assumes labeled observations [25]. Given that p, ρl and πl are defined on the simplexes \({(p_{1},...,p_{M}):{\sum }_{j=1}^{M-1}p_{j} < 1}\), (ρl1,ρl2) : ρl1 < 1, and \( {(\pi _{l1},...,\pi _{lK}):{\sum }_{k=1}^{K-1}\pi _{lk} < 1}\), respectively, a natural choice is the Dirichlet distribution for conjugate prior. The hyper-parameters of these distributions are set to 0.5. The latter are defined as:

$$ \begin{array}{@{}rcl@{}} && p(p)\propto \frac{1}{{\prod}_{j=1}^{M} p_{j}^{1/2} } \\ && p(\rho_{l})\propto \frac{1}{\rho_{l1}^{1/2} \rho_{l2}^{1/2} } \\ && p(\pi_{l})\propto \frac{1}{{\prod}_{k=1}^{K} \pi_{kl}^{1/2} } \end{array} $$
(19)

The Fisher information of the 𝜃jl is approximated on the basis of the second derivatives of the minus log-likelihood of the \(l^{\prime }th\) feature. Indeed, by discarding the first order terms and substituting the prior and Fisher information in (18), the minimum message length objective to be minimized becomes:

$$ \begin{array}{@{}rcl@{}} MessLength &= &-\log p(\mathit{X}|\boldsymbol{{\varTheta}}) + \frac{c}{2} \log N +\frac{3d}{2} {\sum}_{j=1}^{M} \log p_{j}\\ &&+ \frac{3}{2} {\sum}_{l=1}^{d} {\sum}_{k=1}^{K} \log \pi_{kl}+\frac{c}{2} \left( 1+\log\frac{1}{12}\right) \\ &&+ \frac{3M}{2} {\sum}_{l=1}^{d} \log \rho_{l1}+ \frac{3K}{2} {\sum}_{l=1}^{d} \log \rho_{l2} \end{array} $$

3.4 Proposed algorithm/framework

The developed framework and the proposed algorithm are both summarized in Fig. 1 and Algorithm 1.

Fig. 1
figure 1

Flowchart of the proposed method for the object detection/segmentation

figure a

For the convergence, the log-likelihood should be evaluated through checking the convergence criterion between two successive iterations t and (t + 1). If the following condition is not satisfied \(||\log (X|{\varTheta }_{t+1}) - \log (X|{\varTheta }_{t}) || < \epsilon \) then, the process is re-iterated from E-step where 𝜖 is a predefined threshold.

4 Object tracking with BGGMM and a variational-based approach

The purpose of this step is to detect speedily and accurately the contour of the object of interest in a sequence of images. Thus, we propose to apply a variational active contour which is controlled by an effective speed function (level set function). This function is derived from both local and global information. It is noteworthy that variational-based approaches are widely explored previously [7, 12, 64]. Unlike classical edge detection algorithms, variational models constitute a suitable framework able to combine heterogeneous information (e.g. local and global) and offer an effective geometrical representation for a image analysis. One of the fashionable developed model is called “level-set” [59]. The use of level-set makes it possible to avoid any possible parameterization and changes in topology are easily treated. In addition to these benefits, it has been shown also that this approach has flexibility properties especially in shape modeling and object tracking. These advantages make level-set a good alternative given its flexibility in shape segmentation and tracking. Whereas detailed proofs regarding the level-set principle are not given here, the reader can refer to [59].

The key idea behind this approach is to handle and update the displacement of the 2D curve into the motion of 3D surface. The shape of the 2D object (named as the front Γ(t)) is represented by the zero-Levelset function ϕ. ϕ is evolved by resolving the subsequent PDE equation:

$$ \frac{\partial \phi}{\partial t} = F.|\nabla \phi| $$
(20)

where F is the speed function building on the local geometric curvature k. The symbol ∇ is used for the gradient operator.

In the literature, several level-set based speed functions (called also evolution equation) have been developed. We can categorize them into three main classes: edge-based information, region-based information, and prior-based information. The main difficulties facing when using level-set are the dependency to an accurate initialization step (i.e adequate initial active contour) and the choice of a robust speed function that guarantee the convergence of the deformable model to the optimal solution. In this work, we deal with these two subproblems by considering as initial active contour the one obtained with our developed statistical BGGMM+FS. The prior segmentation with BGGMM+FS will provide an initial contour (C) for the region of interest (ROI) for the variational model. Then, the obtained object boundaries in the first step are tracked on each frame of a given sequence (X) by using a robust level-set model proposed by Chan and Vese [16]. The proposed scenario for object tracking is depicted in Fig. 2 where both BGGMM+FS and level-set approaches are used in a cooperative scheme to detect accurately moving objects. Indeed, after a certain number p of frames, a step of boundary-detected verification is carried out with the statistical model BGGMM+FS. This will allow us to correct introduced errors by the variational model. As a result, parameters C1 and C2 (from the level-set equation (21)) are updated.

Fig. 2
figure 2

Tracking step based on BGGMM+FS and Level-set approaches

It is also noted that the advantage of applying the variational model is to speed up the online tracking process and to maintain high precision since, thanks to the accurate initialization-step, only a small number of iterations is needed to detect the boundaries of the object of interest. The used variational model is formulated by minimizing an energy functional which is a particular case of the Mumford-Shah formulation [50]. This function is defined as:

$$ E = \lambda_{1} {\int}_{inside(C)}|X - C_{1}|^{2} dX + \lambda_{2} {\int}_{outside(C)}|X - C_{1}|^{2} dX $$
(21)

where C1 and C2 are two constants. C1 is the average intensity inside the delimited region by the initial contour and C1 is the average intensity outside the region. The variational level set is then reformulated as:

$$ \frac{\partial \phi}{\partial t} = \delta(\phi) \left[\mu div \left( \frac{\nabla \phi}{|\nabla \phi|} \right) - \nu \right] + \delta(\phi) \left[ - \lambda_{1}(X - C_{1})^{2} + \lambda_{2}(X - C_{2})^{2} \right] $$
(22)

where δ(x) is the derivative of the Heaviside function (i.e the Dirac mass function), μ ≥ 0, ν ≥ 0, λ1 > 0, λ2 > 0 are prefixed parameters. ν amplifies the propagation speed; λ1 and λ2 derive the image force inside and outside the contour, μ controls the smoothness of the level set model.

5 Experiments and results

Our purpose here is to evaluate the effectiveness of the proposed method “bounded generalized Gaussian mixture model with feature selection” that we refer to as (BGGMM+FS). Other results for the problem of tracking obtained using BGGMM+FS and level set (LS) are also presented. We propose to compare the obtained results with respect to those offered by other models and methods such as GMM, GGMM and BGMM. The experiments were carried out on several real-world color-textured images. To deal with the convergence issue, we use two criteria: a threshold value (set to 0.01) that assesses the parameters’s difference between two successive iterations and also a maximum number of iterations.

5.1 Experiment 1: color-texture image segmentation

Images and their ground truth used in this section are offered by the well known Berkeley benchmark [45]. Indeed, each pixel (i,j) is modelled by a vector of several features x(i,j). This vector includes both color and texture characteristics. We opt here for 19 features given as: 3 color characteristics calculated from from the RGB color space and the remaining 16 features describe the texture content of the image. They are obtained from the color correlogram matrix (CC) [32].

The entry of this matrix matrix account for the probability that a pixel x2 at distance d and orientation 𝜃 from a pixel x1. In our experiments, we calculated the correlogram matrix for 4 different orientations such as 𝜃 = [0,π/4,π/2,3π/4]. Texture features are mainly evaluated as follow [2]:

  • Energy (EN)

    $$ EN(d,\phi)={\sum}_{c_{i},c_{j}}(C^{d,\phi}(c_{i};c_{j}))^{2} $$
  • Entropy (ET)

    $$ ET(d,\phi)={\sum}_{c_{i},c_{j}}-C^{d,\phi}(c_{i};c_{j}) \log(C^{d,\phi}(c_{i};c_{j})) $$
  • Inverse-Difference-Moment (IDM)

    $$ IDM(d,\phi)={\sum}_{c_{i},c_{j}}\frac{1}{1+\left \| c_{i}-cj \right \|^{2}}C^{d,\phi}(c_{i};c_{j}) $$
  • Correlation (C)

    $$ C(d,\phi)={\sum}_{c_{i},c_{j}}\frac{(c_{i}-M_{x})(c_{j}-M_{y})^{T}}{\left | {\sum}_{x} \right | \left | {\sum}_{y} \right |}C^{d,\phi}(c_{i};c_{j}) $$

    where

    $$ \begin{array}{@{}rcl@{}} &&M_{x}(d,\phi)={\sum}_{c_{i}}c_{i}{\sum}_{c_{j}}C^{d,\phi}(c_{i};c_{j}),\\ &&M_{y}(d,\phi)={\sum}_{c_{j}}c_{j}{\sum}_{c_{i}}C^{d,\phi}(c_{i};c_{j}),\\ &&{\sum}_{x}(d,\phi)={\sum}_{c_{i}}(c_{i}-M_{x})^{T}(c_{i}-M_{x}) {\sum}_{c_{j}}C^{d,\phi}(c_{i};c_{j}),\\ &&{\sum}_{y}(d,\phi)={\sum}_{c_{j}}(c_{j}-M_{y})^{T}(c_{j}-M_{y}) {\sum}_{c_{i}}C^{d,\phi}(c_{i};c_{j}),\\ &&\left | {\sum}_{x} \right |~\text{and}~\left | {\sum}_{y} \right |~\text{are the determinant of the matrices}~{\sum}_{x}~\text{and}~{\sum}_{y}. \end{array} $$

Quantitative performances are obtained based on 500 images provided from the Berkeley segmentation database (BSD) [6]. All images are provided with their manual segmentations for validation purpose. Performance are evaluated on the basis of the following metrics: accuracy, sensitivity, specificity, recall, MCC (Matthews correlation coefficient), F1-measure and the boundary displacement error (BDE) [26]. These measures are often applied by the image segmentation research community to assess the segmentation output. Figure 4 shows the segmentation output for some images selected from the Berkeley database.

The convergence test is based on the stabilization of the parameters and the log-likelihood function. We have not noticed a problem related to the convergence of the EM algorithm although we are not sure obviously that we converge to a global maximum which is a common problem with EM. Indeed, in Fig. 3, which represents the log-likelihood function as a function of the number of iterations, we show that the log-likelihood does not change much after a certain number of iterations. Thus, after 20 iterations, the log likelihood stabilizes and then the learning algorithm converges (Fig. 4).

Fig. 3
figure 3

The log-likelihood function as a function of the number of iterations shows that the log-likelihood stabilizes after 20 iterations and then the learning algorithm converges

Fig. 4
figure 4

Image Segmentation Results; First row: Ground Truth, second row: GMM, third row: GMM+FS, fourth row: GGMM, fifth row: GGMM+FS, sixth row: BGGMM, and seventh row: BGGMM+FS

A comparative study is also given in Tables 2 and 3. It represents the average performance for the Berkeley benchmark database. Accordingly, some interesting conclusions can be deduced: first, BGGMM+FS is able to offer very encouraging results. It outperforms other conventional Gaussian-based models and other methods from the literature (Table 3). Furthermore, both BGGMM+FS and GGMM+FS are capable to offer better accuracy. This is due to the importance of considering only relevant features if we want to enhance the expected results in term of segmentation accuracy. If we look at both Tables 2 and 3, the obtained values for the accuracy metrics are 92.10% for GMM, 93.26% for GMM+FS, 93.87% for GGMM, 95.03% for GGMM+FS, 94,98% for BGGMM, and 96,68% for BGGMM+FS. We notice that if we combine a feature selection step within the mixture model, then we can achieve better performance than if we did not consider this step.

Table 2 Average performance metrics for the Berkeley dataset generated by different Gaussian-based models
Table 3 Comparative study between different algorithms from the state-of the art on the Berkeley Segmentation Database

5.2 Experiment 2: object detection

In this experiment, we focus on extracting a region of interest from an image. To this end, a series of experiences are performed on the “Microsoft Common Objects in COntext (MS-COCO)” dataset [40]. COCO dataset contains more than 200K images and 91 common object categories with 82 of them having more than 5000 labeled instances. Some samples are given in Fig. 5. This dataset is composed of more than 100K images for training, 5000 images for validation and about 40K for testing. We perform our experiments on the training subset designed with \(\left (train\_2017\right )\). The algorithm is performed on several training sets chosen randomly where images are selected from the training subset. A comparative study was also carried out for this dataset. Indeed, we have compared the performance of BGGMM+FS against some relevant methods from the state-of-the-art on the basis of the metric mean Average Precision(mAP). This metric is often applied for object detection problems. Obtained results for different methods are provided in Table 4. According to this table, we can conclude that our method is very competitive and outperforms some other methods. This is justified by the generative nature of the developed approach that allows more flexibility and interpretability of the results. It is noteworthy that the proposed model has the advantage to take into account the nature of the data which is compactly supported. Moreover, the integration of the feature selection process in the generative model enables us to accurately distinguish and identify different objects of interest.

Fig. 5
figure 5

Sample images from the COCO dataset

Table 4 Comparative study using the mean average precision (mAP) scores for different object detection methods based on COCO dataset

5.3 Experiment 3: object tracking

In this experiment, we focus on identifying a region of interest (ROI) in a sequence of images. To this end, for each sequence, the first frame is segmented using the BGGMM+FS model, then, the obtained segmentation will be considered as the initial contour for the level-set and will be evolved according to the level-set function in order to detect the boundaries of the same ROI in other frames. Subsequently, the output of the current result will be applied as an accurate initialization step (initial contour) for the following frame in the sequence and so that. Some obtained results for object tracking are depicted in Figs. 6789 and 10. The First row of each result presents the initial detection of the region of interest ROI and the second row presents the tracking of the ROI using the C-V level-set model for different frames.

Fig. 6
figure 6

Result 1 for Object tracking: First row presents the initial detection of the region of interest ROI (from left to right: first original frame, classification of the frame with BGGMM+FS, and the detected ROI). Second row presents the tracking of the ROI using the C-V level-set model for different frames

Fig. 7
figure 7

Result 2 for Object tracking: First row presents the initial detection of the region of interest ROI (from left to right: first original frame, classification of the frame with BGGMM+FS, and the detected ROI). Second row presents the tracking of the ROI using the C-V level-set model for different frames

Fig. 8
figure 8

Result 3 for Object tracking: First row presents the initial detection of the region of interest ROI (from left to right: first original frame, classification of the frame with BGGMM+FS, and the detected ROI). Second row presents the tracking of the ROI using the C-V level-set model for different frames

Fig. 9
figure 9

Result 4 for Object tracking: First row presents the initial detection of the region of interest ROI (from left to right: first original frame, classification of the frame with BGGMM+FS, and the detected ROI). Second row presents the tracking of the ROI using the C-V level-set model for different frames

Fig. 10
figure 10

Result 5 for Object tracking: First row presents the initial detection of the region of interest ROI (from left to right: first original frame, classification of the frame with BGGMM+FS, and the detected ROI). Second row presents the tracking of the ROI using the C-V level-set model for different frames

Qualitative results are obtained on the basis of the LASIESTA dataset which contains many real image sequences with their corresponding ground truth [21]. Quantitative measures are performed using the accuracy and the boundary displacement error (BDE) metrics. The accuracy measures the proportion of correctly labelled pixels over all available pixels, and the boundary displacement error (BDE) measures the displacement error [26]. Obtained values are depicted in Tables 5 and 6. A comparative study with other approaches is also provided in Table 7. According to these results, we can conclude that the application of a variational active contour for object tracking with an accurate initialization step provided by the BGGMM+FS model enables us to maintain an accurate track of the object of interest (OOI) even if the topology of the object changes over time. Indeed, high performances are obtained and the average accuracy value is more than 93% for all sequences. Furthermore, we can see that the obtained boundary displacement error values are very encouraging thanks to the use of the level-set formalism and the BGGMM+FS model. These results show the merit of using a variational approach initialized by the BGGMM+FS model for a robust object tracking.

Table 5 Quantitative results for five different sequences produced by BGGMM+FS with C-V level-set
Table 6 Average metrics for different video sequences segmentation models
Table 7 Comparative study with some methods from the state of the art

6 Discussion

In this paper, a flexible and robust learning model followed with a post processing step is proposed. The later is implemented with a variational active contour for both image/video sequences segmentation and object tracking. Our main purpose is to improve these tasks by investigating the flexibility of bounded models such as the bounded generalized Gaussian mixture model and also by taking into account a feature selection mechanism. For tracking purpose, we tackle this problem by considering an active contours via the well known level set approach. Our method complexity is about O(NM), where N represents pixel’s number for the treated image and M is used to designate the number of components. Thus, the first main contribution of the current work is to tackle the segmentation problem by implementing a flexible bounded statistical model given that unbounded models are obviously not the appropriate approximation for data modelling and segmentation. The second main advantage of the proposed work is the consideration of a feature selection mechanism which is able to remove irrelevant features (i.e. noise) which make the detection of the real regions more and more difficult. Obtained results have proven these assumptions and more accurate results are obtained compared to the state of the art. However, it is noteworthy that using the conventional EM-algorithm has some problems related to its dependency on initialization and convergence to local maxima. To overcome this shortcoming, we plan to replace it with the enhanced ECM-algorithm (Expectation/ Conditional Maximization) in order to overcome the problems related to the use of the EM algorithm [9]. Thus, the complicated M-step of EM will be replaced with several computationally simpler CM-steps. Moreover, the feature extraction step can be improved by considering other type of visual and spatio-temporal features. We plan also to consider other datasets for validation purpose.

7 Conclusion

We have developed, in the current work, an effective framework for both color images and sequence of images segmentation and tracking. The main goal was to investigate the flexibility of the proposed bounded model combined with a feature selection mechanism (BGGMM+FS) for image segmentation and object detection-tracking. The choice of the model is motivated by its high flexibility for multidimensional data modelling and its ability to integrate a feature selection mechanism. The developed statistical model is also followed with a post-processing step implemented with a variational active contour for tracking a particular object of interest in a sequence of color images. The learning model is implemented on the basis of the expectation-maximization algorithm taking into account the minimum message length (MML) criterion. The validation process is carried out through extensive series of experiments and the final results show high accuracy for both segmentation and tracking. It is noted that the BGGMM+FS offer better capabilities than the conventional and classic generative models.

Future works could be devoted to improve the feature selection mechanism by taken into account other visual (and spatio-temporal) local and/or global features for both images and video sequences. Another future work could be developing a unified framework that integrates the statistical model and the variational model into the same formalism. Moreover, the learning approach may be improved if it follows a Bayesian approximation instead of frequentist strategy in order to overcome the convergence to local maxima. Finally, it is possible to develop an online algorithm based on the proposed mixture model if one want to track specific object in real time.