1 Introduction

With the fast growth of internet images, Content-Based Image Retrieval (CBIR) techniques have become one of the most important themes of research in computer vision. A CBIR system looks for a subset of images that is visually similar to a given query image and displays the results retrieved from image repositories. In most CBIR systems, feature descriptors, which may cater to the purposes required by the user, play a vital role in reflecting the image content. Generally, the lower-level image features that have been widely adopted by CBIR systems consist mainly of colour, spatial position, texture, scene, shape of the object etc. Obviously, the use of a single feature would inevitably cause a poor retrieval response because it is hard to comprehensively describe an image with an individual feature. To overcome this issue, multiview feature fusion schemes are treated as an alternative to a single type of visual feature, and they have attracted increasing research attention [24, 37, 41]. The CBIR algorithms heavily rely on image descriptors and a good similarity measure between images. The data of similarity matrices are often high-dimensional and non-sparse. The Nonnegative Matrix Factorization (NMF) method and its variants have been demonstrated to be particularly successful in addressing dimensionality- reduction problems by offering a sparse description of the original-dimensional data [15]. In essence, NMF seeks two nonnegative matrices with lower ranks, which are, respectively, called as the basis matrix and coefficient matrix, so that their product provides a better estimate of the given matrix. The coefficient matrix with the blended multiview feature is very low-dimensional. At this time, any clustering algorithm can be implemented to this matrix to associate each image with a given cluster. The finite mixture model makes the model-based clustering strategy attractive for CBIR systems [44], to analyse the properties of the coefficient matrix of NMF in a probabilistic manner and to cluster features according to the parameters of the mixture model.

In this paper, we introduce a novel scheme, which incorporates constrained multiview NMF and Student’s t-Mixture Model (SMM) based on a Markov random field (MRF), for image retrieval. This method is termed SCMN, which has the following characteristics.

First, the proposed framework represents an image in a multidimensional feature space. The extracted underlying features, including texture, colour, spatial information, rotation-invariance, and scene, are merged by a Gaussian-like heat kernel to obtain a similarity-preserving matrix. We realized that little attention has been paid to rotation-invariance in image retrieval applications. However, real imaging systems are generally imperfect, and the obtained image usually presents a degraded version of the original one. Therefore, we think that rotation-invariance, i.e., remaining invariant under rotation, is a useful feature that deserves attention.

Second, we develop a constrained multiview NMF scheme through incorporating multiple constraints into the original NMF for the description of image features in a sparse space. Specifically, we impose a structural constraint into the objective function to obtain properties such as preservation of local structure and apply it to guide the matrix factorization. In addition, the proposed model enforces the L 1/2-sparse constraint on the coefficient matrix and attempts to utilize the sparsity property of feature space as much as possible. The L 1/2 constraint has been proven better than others, since it can exploit the inherent sparseness of the data [38]. The farness-preserving constraint is utilized by the proposed objective function to preserve the data distribution in objective space. Additionally, the paper discusses the convergence of the update rule in theory, to ensure that our objective function converges with local minima.

Third, the proposed SCMN utilizes a multivariable Student’s t-mixture model based on Markov Random Field (SMM-MRF) to approximate different shapes of sparse features. The SMM is highly acclaimed for its accuracy and effectiveness in image clustering [29]. In our scheme, images with sparse features belonging to the same Student’s t component are similar, and these images can be grouped in the same category. Another key idea behind the proposed scheme is to adequately consider the spatial information of multiview features because the MRF is incorporated into SMM. The model parameters are estimated by adopting Expectation Maximization (EM). With the label information of the query image using the mentioned SMM-MRF, we can obtain a subset of images that are visually similar to a given query in terms of the multivariable clustering results.

The rest of this paper is organized as follows. Section 2 presents the related works and background. Section 3 introduces the proposed approach, its update rules and some computational aspects. Next, the theoretical proof of convergence and the computational complexity are discussed in section 4. In section 5, a comparative study of several standard datasets is performed. The last section reports the concluding remarks.

2 Related works and background

2.1 Related works

Most image retrieval algorithms rely on low-level image features to compare images based on visual similarity. These low-level features, represented in visual content, are easily implemented and obtained. There are numerous low-level feature descriptors have been reported and adopted in early works [33]. Deselaers et al. [11] conducted the experiments on CBIR using a large number of different low-level image features. Different features have different effectiveness to describe the same category of images, therefore, feature fusion is needed to select different features to have a better combination of basic features. Recently, there have been increasing efforts in developing multiview feature fusion schemes. For example, Liu et al. [24] studied Multiview Alignment Hashing (MAH) for indexing images. To preserve the geometric structure of a motion image, Wang et al. developed a multiview Laplacian graph via a linear regression model, along with multiview spectral embedding [39] etc. In [3], An et al. developed discriminative image features with attribute information encoded to achieve more accurate image retrieval. The basic requirement of CBIR is to decrease the redundancy of multiview features and to explore them in a relative low-dimensional feature space. The popular strategy to address this problem is to utilize dimensionality-reduction techniques, typical methods, including Singular Value Decomposition (SVD) [20], multidimensional scaling [8], Principal Component Analysis (PCA) [43], Independent Component Analysis (ICA) [18], and NFM. In NMF model, both basic matrix and coefficient matrix are nonnegative, therefore, NMF is a suitable algorithm for applications like image processing where the data are non-negative by nature. In addition, the coefficient matrix of NMF can model the features of images as an additive combination of a set of basis vectors. Compared with ICA and PCA, another advantage of NMF is that it can be designed for capturing intrinsic structures from the sample data through introducing different constraints into the basic NFM algorithm. In practice, users do not usually satisfy the condition of sparse representation of features during matrix factorization. Recently, we noticed that numerous specialized NMF-based methods have been recently introduced by modifying the objective functions or by enforcing additional constraints [42]. For example, FNMF, presented by Babaee et al. [4], added a constraint to the classical NMF. This leads to the far point still being far in the new space. Rajabi and Ghassemian [35] introduced multilayer NMF, where a sparseness constraint was applied to improve the performance of NMF. Babaee et al. [5] employed the Laplacian of neighbourhood graphs to develop a graph-regularized NMF algorithm, which can preserve the locality characteristics of the underlying image. There are some other fashionable variants of NMF, such as Liu’s semi-supervised NMF [22], as well as Wang’s L 1/2-NMF [38].

Recently, some works based on finite mixture model have been reported in the field of image retrieval, to fit different shapes of feature data using a multivariable probability distribution [32]. Amin et al. [2] verified a Laplacian mixture model that may model the distribution of wavelet coefficients, and also showed its superiority for video retrieval. The Gaussian Mixture Model (GMM) is another efficient method, widely applied in clustering tasks. The GMM-KL framework, proposed by Greenspan and Pinhas [16], categorized medical images through GMM along with image-matching using the Kullback-Leibler (KL) measure. In addition, a probabilistic relevance feedback method presented by Marakakis et al. [26] for CBIR also employed GMM. Piatek and Smolka [32] claimed an image retrieval scheme using GMM and considered the spatio-chromatic similarity between two images more accurately. The merit of GMM is that it can efficiently model the uncertainty with a few parameters, in addition to being easy to implement.

The aforementioned survey helped us to present a new framework, as we will present in Section 3, is different in the way that we incorporated constrained multiview NMF and Student’s t-mixture model based on MRF. Specifically, this paper presents a Student’s t-distribution-based retrieval and similarity ranking in retrieval phase.

2.2 Background

This subsection begins by reviewing the techniques that are most related to the proposed scheme, namely, the classical SMM and NMF.

  1. (1)

    Student’s t-mixture model

Let x i , with dimension\( \overline{D} \)i = (1, 2,  … , N), denote an observation at the i-th pixel of an image. To partition an image consisting of N pixels into K labels, SMM assumes that each observation x i is independent of the label. The density function at an observation x i is defined by:

$$ f\left(\left.{x}_i\right|\Pi, \Xi \right)=\sum_{j=1}^K{\pi}_{ij}S\left({x}_i|{\Xi}_j\right), $$
(1)

where Π = {π ij } , j = (1, 2,  … , K), is the set of prior distributions, and the prior distribution π ij of observation x i belonging to the jth label should satisfy the following constraints:

$$ {\pi}_{ij}\ge 0\kern0.75em \mathrm{and}\kern0.75em \sum_{j=1}^K{\pi}_{ij}=1. $$
(2)

Each Student’s t-distribution S(x i | Ξ j ) has its own parameters\( {\Xi}_j=\left\{{\mu}_j,{\sum}_j,{\overline{\nu}}_j\right\} \) defined by [31]

$$ S\left({x}_i|{\Xi}_j\right)=\frac{\Gamma \left(\frac{{\overline{\nu}}_j+\overline{D}}{2}\right){\left|{\Sigma}_j\right|}^{-\frac{1}{2}}}{{\left(\pi {\overline{\nu}}_j\right)}^{\frac{\overline{D}}{2}}\Gamma \left(\frac{{\overline{\nu}}_j}{2}\right){\left[1+{{\overline{\nu}}_j}^{-1}{\left({x}_i-{\mu}_j\right)}^T{\Sigma}_j^{-1}\left({x}_i-{\mu}_j\right)\right]}^{\frac{{\overline{\nu}}_j+\overline{D}}{2}}}, $$
(3)

where ∑ j is the covariance, ∣∑ j ∣ denotes the determinant operator of ∑ j , the vector μ j denotes the mean, \( {\overline{\nu}}_j \) is the number of degrees of freedom, and (·)T denotes the transpose of the matrix. The Gamma function, Γ(t), is defined by

$$ \Gamma (t)={\int}_0^{+\infty }{s}^{t-1}{e}^{-s}ds=\left(t-1\right)\Gamma \left(t-1\right). $$
(4)

If t is an integer, then Γ(t) = (t − 1)!. Gamma function can also be computed by Matlab function: Gamma (real). The log-likelihood function of the density function, f(x i | Π, Ξ), can be expressed as

$$ L\left(\Xi \right)= \log \prod_{i=1}^Nf\left({x}_i|\Pi, \Xi \right). $$
(5)

Finally, the log-likelihood function (5) must be maximized to estimate the model parameters.

  1. (2)

    Overview of standard NMF

NMF attempts to seek two low-rank nonnegative matrices, \( U=\left[{U}_{id}\right]\in {\mathbb{R}}_{+}^{M\times \overline{D}} \)and \( V=\left[{V}_{dj}\right]\in {\mathbb{R}}_{+}^{\overline{D}\times N} \), to approximately describe an observation matrix,\( X=\left[{X}_{ij}\right]\in {\mathbb{R}}_{+}^{M\times N} \). Mathematically, the standard NMF can be formulated as:

$$ X\approx UV, $$
(6)

where V is customarily called the coefficient matrix of X projected on the basis matrix U. In practice, the inner dimension \( \overline{D} \) is always chosen such that \( \overline{D}\ll \min \left(M,N\right) \). Obviously, this factorization leads to a compressed representation of the original matrix X. To convert the NMF process into an optimization problem, the Euclidean metric between X and UV has been popularly utilized.

$$ \underset{U,V}{ \arg \min }{||X-UV||}_F^2\kern1em \mathrm{s}.\mathrm{t}.\kern0.5em U,V\ge 0 $$
(7)

where the operator ∣|⋅| F denotes the Frobenius norm. Thus far, a variety of strategies have been developed to search for a local minimum solution [19]. Lee and Seung [21] introduced the well-known multiplicative update rule (MUR) to solve the optimization problem. When one of the matrixes is fixed, another can be updated in terms of the following expressions.

$$ {U}_{id}\leftarrow {U}_{id}\frac{{\left({XV}^T\right)}_{id}}{{\left({UVV}^T\right)}_{id}}, $$
(8)
$$ {V}_{dj}\leftarrow {V}_{dj}\frac{{\left({U}^TX\right)}_{dj}}{{\left({U}^TUV\right)}_{dj}}. $$
(9)

3 Proposed framework

This section addresses the proposed SCMN, which consists of four modules: underlying visual features, constrained multiview NMF, SMM based on MRF clustering, and similarity ranking. Fig. 1 illustrates the overall flowchart of the SCMN scheme involving the learning and retrieval phases. The learning phase consists of three main parts, which are feature extraction & fusion, proposed NMF and SMM-MRF clustering. The proposed NMF and SMM-MRF clustering are the crucial steps in our image retrieval system for retrieval result refinement. The retrieval phase contains three major components: (1) similarity measurement, (2) sparse query, (3) probability-based retrieval and similarity ranking. The main contribution of retrieval phase consists in the later one component. The following subsections present the implementation details of each part illustrated in Fig.1.

Fig. 1
figure 1

Flowchart of the proposed framework

3.1 Feature extraction and fusion

There are several feature descriptors that can achieve the indexing purpose. Using them simultaneously will increase the memory burden on the computer. It has been proved through experimental results [12] that visual features, such as colour, spatial location, invariance, scene, and texture, are strongly related to human perception and are important to convey the information related to the image content. Therefore, this paper adopts these visual features and then merges them to describe the content of images.

The Histogram of Oriented Gradients (HOG) [10] is one of the most effective feature descriptors, which extracts the salient orientation information for each object. The HOG feature is less sensitive to changes in illumination, but it provides a good description of local information by calculating the gradients in local cells in eight directions of an image. Specifically, the grey-scaled image with Gamma correction is divided into 3 × 3 blocks, consisting of 4 × 4 local cells. With this method, the HOG feature is obtained with 1152-dimensionality.

Images generally contain rich scene information. They can provide high-level context to navigate the proposed algorithm for a more accurate retrieval. To extract the scene feature, the energy spectrum of the grey-scaled image is first filtered through 32 Gabor filters in 4 frequency bands with 8 orientations. Because the energy spectrum provides a scene representation invariant with respect to object arrangement and object identities. Each filtered spectrum image is then divided into several 4 × 4 grid sub-regions [30]. Thus, the considered image can be described by a vector with 512 dimensions, referred to as Gist.

Textural characteristics play an important role in the description of objects because real-world images usually have their own texture. Generally, there are two types of texture feature extraction algorithms such as statistical method, structure method. The former comprises Markov random field, co-occurrence matrix, and the latter contains SIFT descriptor [25], SURF descriptor, and LBP, etc. Recently, local image feature extraction algorithms create a centre of attention in recent years as they are tolerant to occlusion and distortion. It turns out that Local Binary Pattern (LBP) is one of the best local feature operators for the description of texture [1]. Using a string of binary numbers, LBP labels the grey value of each pixel by thresholding a 3 × 3 neighbourhood. Then, the histogram of labels serves as the texture descriptor. This paper adopts a 512-dimensional LBP feature.

In a CBIR system, the rotation-invariance feature, which is independent of the angle of the object, is rarely considered. However, most actual objects vary in orientation. This study has chosen six rotation-invariants by computing the magnitudes of Zernike moments of image [45] to characterize the rotation-invariance of objects. The 2D Zernike moment, A nm , of order n with repetition m is defined using polar coordinates (r, θ) inside the unit circle as [13]

$$ {A}_{nm}=\frac{n+1}{\pi }{\int}_0^{2\pi }{\int}_0^1{R}_{nm}(r) \exp \left(-jm\theta \right)f\left(r,\theta \right) rdrd\theta, \kern0.5em \left|m\right|\le n\;\mathrm{and}\;n-\left|m\right|\;\mathrm{being}\;\mathrm{even}, $$
(10)

here R nm (r) is the real-valued Zernike radial polynomials defined as

$$ {R}_{nm}(r)=\sum_{k=0}^{\left(n-|m|\right)/2}\frac{{\left(-1\right)}^k\left(n-k\right)!}{k!\left(\frac{n+\mid m\mid }{2}-k\right)!\left(\frac{n-\mid m\mid }{2}-k\right)!}{r}^{n-2k}. $$
(11)

If we let the Zernike moments of an image f(r, θ) and it rotated version f (r, θ) be A nm and \( {A}_{nm}^{\prime } \), respectively, where f (r cos θ, r sin θ) = f(r cos(θ − ϕ), r sin(θ − ϕ)), and θ is rotation angle. Thus, we have

$$ \begin{array}{c}{A}_{nm}^{\prime }=\frac{n+1}{\pi }{\int}_0^{2\pi }{\int}_0^1{R}_{nm}(r) \exp \left(-jm\theta \right){f}^{\prime}\left(r \cos \theta, r \sin \theta \right) rdrd\theta \\ {}=\frac{n+1}{\pi }{\int}_0^{2\pi }{\int}_0^1{R}_{nm}(r) \exp \left(-jm\left({\theta}^{\prime }+\phi \right)\right)f\left(r \cos {\theta}^{\prime },r \sin {\theta}^{\prime}\right) rdrd{\theta}^{\prime}\\ {}= \exp \left(-jm\phi \right){A}_{nm}\end{array} $$
(12)

where, θ  = θ ‐ ϕ, therefore, the magnitude of Zernike moments |A nm | is invariant to rotation changes. Therefore, it can be taken as a rotation invariant feature of the underlying image. According to the constraints imposed on parameter n and m given in (10), we have used the following several invariants in all experiments.

$$ {inv}^{\alpha }=\left\{|{A}_{22}^{\alpha }|,|{A}_{31}^{\alpha }|,|{A}_{33}^{\alpha }|,|{A}_{42}^{\alpha }|,|{A}_{44}^{\alpha }|,|{A}_{51}^{\alpha }|\right\} $$
(13)

The inv f and inv g denote rotation feature vectors of image f and g.

Almost all natural images include colour information. Therefore, it is extremely important for a retrieval scheme to select a colour descriptor. Various colour features are available for image retrieval including colour moments, colour coherence vector (CCV) [27], colour histogram (ColourHist), etc. Colour coherence vector is a more complex method than ColourHist. It classifies each pixel as either coherent or incoherent. The proposed method adopts the colour histogram, it corresponds to colour features and denotes the colour histogram of an image. ColourHist considers the colour similarity information by spreading each pixel’s total membership value to all the histogram bins. ColourHist is easy to obtain, and it is invariant to the rotation and translation of image content. According to [17], the proposed method computes a 64-bin histogram of each RGB channel and lists them together, leading to a 192-dimensional ColourHist.

Keeping the notations consistent, let \( {F}^{\left(\hslash \right)}=\left[{f}_1^{\left(\hslash \right)},{f}_2^{\left(\hslash \right)},\dots, {f}_N^{\left(\hslash \right)}\right]\in {\mathbb{R}}^{D_{\hslash}\times N} \)represent the training dataset, where N is the number of images, and D denotes the feature dimension of the considered images at the -th view. There are several possibilities for modelling the similarity of the matrix measure [16]. We choose Gaussian-like heat kernel functions to measure the closeness of two images, \( {f}_i^{\left(\hslash \right)} \) and \( {f}_j^{\left(\hslash \right)} \) because a heat kernel function can present a specific connection to the Laplace operator on differentiable functions [7].

$$ {\varpi}_{ij}^{\left(\hslash \right)}= \exp\;\left(\frac{-||{f}_i^{\left(\hslash \right)}-{f}_j^{\left(\hslash \right)}||{}_F^2}{2{\left({\lambda}^{\left(\hslash \right)}\right)}^2}\right),\kern1.25em i,j\in \left[1,N\right], $$
(14)

where λ () is a scalable parameter. Then, the multiview feature matrix used for matrix factorization can be merged as

$$ X=\frac{1}{N_{\hslash }}{\sum}_{\hslash =1}^{N_{\hslash }}{\varpi}_{ij}^{\left(\hslash \right)} $$
(15)

In the current scheme, the structural constraint is achieved through a weight matrix, W, as

$$ W=\frac{1}{N_{\hslash }}\sum_{\hslash =1}^{N_{\hslash }}\left(\frac{\varpi^{\left(\hslash \right)}-{I}_N}{\sum_{i\ne j}{\varpi}_{ij}^{\left(\hslash \right)}}\right), $$
(16)

where N is the view number. There are five features are used, thus, N  = 5. I N is a unit matrix with size N × N. Equation 16 indicates that all feature vectors are incorporated into the weight matrix, W, to achieve multiview CBIR in a real sense. It is not hard to see from (16) that the similarity matrix, W, is symmetric.

3.2 Constrained multiview NMF

The goal of this subsection is to exploit a novel, constrained NMF to preserve some of the properties of sparse features. This can be accomplished by incorporating manifold constraints about U and V into the original NMF. Considering the inherent geometric structure of each object, it becomes natural to impose structure regularization so that visually similar images are placed together.

Consequently, we can construct the following optimization problem by incorporating the structural constraint Tr(VLV T).

$$ \underset{U,V}{ \arg\;\min }{\left|\left|X-UV\right|\right|}_F^2+{\lambda}_2Tr\left({VLV}^T\right)\kern1.25em \mathrm{s}.\mathrm{t}.\kern0.5em U,V\ge 0, $$
(17)

where Tr(⋅) denotes the trace operation, L is a Laplacian matrix L = D-W, D stands for a diagonal matrix whose elements correspond to D jj  = ∑ i W ij , and λ 2 ∈  + is introduced to balance the reconstruction error and impact of the latter constraint.

We know the sparsity constraint from previous studies [34, 40], where the L 1/2 constraint indicated potential advantages for sparsity-promoting solutions. This characteristic inspired us to develop the L 1/2-constrained multiview NMF, by incorporating a sparsity constraint of the basis matrix, \( \mid {\left|U\right|}_{1/2}={\sum}_{i,d}^{M,\overline{D}}{\left({U}_{id}\right)}^{1/2} \), into the conventional NMF.

$$ \underset{U,V}{ \arg \min }{||X-UV||}_F^2+2{\lambda}_1\mid {\left|U\right|}_{1/2}\kern1.5em \mathrm{s}.\mathrm{t}.\kern0.5em U,V\ge 0, $$
(18)

where the parameterλ 1 ∈  +balances the impact of the sparseness constraint so that the inherent sparseness property of the minimization problem is sufficiently exploited.

To emphasize the constraint of the spacing location, we expect that two closely spaced data in one space will also remain very close in another space. Here, we consider the spacing constraint term, \( \exp \left[-\beta Tr\left(V\tilde{L}{V}^T\right)\right] \), introduced in [2],

$$ \underset{U,V}{ \arg \min }{||X-UV||}_F^2+2{\lambda}_3 \exp \left[-\beta Tr\left(V\tilde{L}{V}^T\right)\right]\kern1.75em \mathrm{s}.\mathrm{t}.\kern0.5em U,V\ge 0, $$
(19)

where \( \tilde{L}=\tilde{D}-\tilde{W} \) and \( \tilde{D} \) is a diagonal matrix, whose entries are column sums of the Laplace operator, \( \tilde{W} \), \( {\tilde{D}}_{jj}={\sum}_i{\tilde{W}}_{ij} \). The parameter β controls the overall contribution of the farness property, and we chose the value β = 0.01 for all datasets. The parameter λ 3 ∈  + balances the contribution of the constraint in the objective function.

With the above considerations, the new objective function can be obtained mathematically, as follows.

$$ \underset{U,V}{ \arg \min }{||X-UV||}_F^2+2{\lambda}_1{||U||}_{1/2}+{\lambda}_2Tr\left({VLV}^T\right)+2{\lambda}_3 \exp \left[-\beta Tr\left(V\tilde{L}{V}^T\right)\right]\kern1.75em \mathrm{s}.\mathrm{t}.\kern0.5em U,V\ge 0, $$
(20)

Obviously, it is difficult to explore a closed-form solution for the optimization problem with respect to (20). Alternatively, resorting to a multiplicative update scheme, the matrices U and V can be alternately obtained. Considering the non-negativity of the two matrices U and V, and assuming \( \Phi =\left[{\Phi}_{id}\right]\in {\mathbb{R}}_{+}^{M\times \overline{D}} \)and \( \Psi =\left[{\Psi}_{dj}\right]\in {\mathbb{R}}_{+}^{\overline{D}\times N} \), we then formulate the Lagrange function ℒ as follows

$$ \begin{array}{c}\mathrm{\mathcal{L}}={||X-UV||}_F^2+2{\lambda}_1\mid {\left|U\right|}_{1/2}+{\lambda}_2Tr\left({VLV}^T\right)+2{\lambda}_3 \exp \left[-\beta Tr\left(V\tilde{L}{V}^T\right)\right]+Tr\left(\Phi {U}^T\right)+Tr\left(\Psi {V}^T\right)\\ {}=Tr\left({XX}^T\right)+Tr\left({UVV}^T{U}^T\right)-2Tr\left({XV}^T{U}^T\right)+2{\lambda}_1\mid {\left|U\right|}_{1/2}+{\lambda}_2Tr\left({VLV}^T\right)\\ {}+2{\lambda}_3 \exp \left[-\beta Tr\left(V\tilde{L}{V}^T\right)\right]+Tr\left(\Phi {U}^T\right)+Tr\left(\Psi {V}^T\right)\end{array} $$
(21)

where Φ id and Ψ dj are the Lagrange multipliers. The partial derivatives of ℒ with respect to U, V are

$$ \frac{\partial \mathrm{\mathcal{L}}}{\partial U}=2{UVV}^T-2{XV}^T+{\lambda}_1{U}^{-1/2}+\Phi =0, $$
(22)
$$ \frac{\partial \mathrm{\mathcal{L}}}{\partial V}=2{U}^TUV-2{U}^TX+2{\lambda}_2VL+2{\lambda}_3 \exp \left[-\beta Tr\left(V\tilde{L}{V}^T\right)\right]\left(-2\beta V\tilde{L}\right)+\Psi =0. $$
(23)

Using the Karush-Kuhn-Tucker (KKT) conditions [6], where Φ id U id  = 0, ∀i , d, the following equation for U id can be obtained.

$$ {\left(2{UVV}^T+{\lambda}_1{U}^{-1/2}\right)}_{id}\cdot {U}_{id}-{\left(2{XV}^T\right)}_{id}\cdot {U}_{id}+{\Phi}_{id}\cdot {U}_{id}=0. $$
(24)

Transposition and division leads to the update rule for matrix U

$$ {U}_{id}\leftarrow {U}_{id}\cdot \frac{{\left({XV}^T\right)}_{id}}{{\left({UVV}^T+\left({\lambda}_1/2\right){U}^{-1/2}\right)}_{id}}. $$
(25)

In the same manner, based on the KKT conditions,Ψ dj V dj  = 0,∀d , j, the following equation was obtained by multiplying both sides with V dj .

$$ \begin{array}{l}\left(2{U}^TUV+2{\lambda}_2V\cdot D+4{\lambda}_3\beta \cdot \exp \left[-\beta Tr\left(V\tilde{L}{V}^T\right)\right]\cdot {\left(V\tilde{W}\right)}_{dj}\cdot {V}_{dj}\right.\\ {}\kern3.5em -\left(2{U}^TX+2{\lambda}_2V\cdot W+4{\lambda}_3\beta \cdot \exp \left[-\beta Tr\left(V\tilde{L}{V}^T\right)\right]\cdot {\left(V\tilde{D}\right)}_{dj}\cdot {V}_{dj}+{\Psi}_{dj}\cdot {V}_{dj}=0\right.\end{array} $$
(26)

Solving the above equation leads to the resulted multiplicative update rules for V dj

$$ {V}_{dj}\leftarrow {V}_{dj}\cdot \frac{{\left({U}^TX+{\lambda}_2VW+2{\lambda}_3\beta \cdot \exp \left[-\beta Tr\left(V\tilde{L}{V}^T\right)\right]\cdot \left(V\tilde{D}\right)\right)}_{dj}}{{\left({U}^TUV+{\lambda}_2VD+2{\lambda}_3\beta \cdot \exp \left[-\beta Tr\left(V\tilde{L}{V}^T\right)\right]\cdot \left(V\tilde{W}\right)\right)}_{dj}}. $$
(27)

In this way, U and V can be updated iteratively until the objective function (20) converges, or the predefined iterations are achieved. With regard to convergent properties of the proposed update schemes, we have the following theorem:

  • Theorem 1: The objective function in (20) is non-increasing under the update rules in (25) and (27).

This theorem can ensure that the proposed objective function (20) converges to a local minimum, and its proof is presented in section 4.

3.3 SMM-based clustering

NMF has been widely applied because of its capabilities in keeping the intrinsic features of low-dimensional space. To be clear, each column of the coefficient matrix V indicates the characteristics of the associated training sample. Therefore, any one clustering algorithm can be implemented to V, to associate each sample with a given cluster. This paper is particularly interested in a statistical method to label the data V = [v 1, v 2,  ⋯ , v N ]T. In other words, the data of V can be modelled using a mixture of Student’s t-distribution. This is because, in theory, the arbitrarily shaped distribution can be approximated using a mixture of probability density functions, provided that this mixture has numerous components. Note that v N is a \( \overline{D} \)-dimensional feature vector. Labels are denoted by (Ξ1, Ξ2,  … , Ξ K ). We define the posterior probability density function to the partition matrix, V, of N columns into K labels.

$$ p\left(\prod, \Xi |V\right)\propto p\left(V|\prod, \Xi \right)p\left(\prod \right). $$
(28)

Thus, the new joint conditional probability density of the data (the column of coefficient matrix V) can be represented via multivariable SMM in the form:

$$ p\left(V|\prod, \Xi \right)=\prod_{i=1}^N\left[\sum_{j=1}^K{\pi}_{ij}S\left({v}_i|{\Xi}_j\right)\right], $$
(29)

where S(v i | Ξ j )is the multivariable Student’s t-distribution declared by (3), and π ij is the prior probability that x i belongs to the label Ξ j . Considering the spatial information between neighbouring columns of the coefficient matrix V, we have introduced the MRF in the form

$$ p\left(\Pi \right)= \exp \left(-\overline{U}\left(\Pi \right)\right), $$
(30)

where \( \overline{U}\left(\Pi \right) \) is the smooth prior. By combining (28), (29), and (30), the log-likelihood of (28) can be expressed by the following formula.

$$ \begin{array}{l}L\left(\prod, \Xi |V\right)= \log p\left(\prod, \Xi |V\right)\\ {}=\sum_{i=1}^N \log \left\{\sum_{j=1}^K{\pi}_{ij}S\left({v}_i|{\Xi}_j\right)\right\}-\overline{U}\left(\Pi \right).\\ {}\kern4.75em \end{array} $$
(31)

We choose the smooth prior\( \overline{U}\left(\Pi \right) \) in terms of Nguyen and Wu’s work [28]

$$ \overline{U}\left(\Pi \right)=-\sum_{i=1}^N\sum_{j=1}^K \exp \left(\gamma \sum_{m\in {N}_i}\left({z}_{mj}+{\pi}_{mj}\right)\right) \log {\pi}_{ij}, $$
(32)

where N i is the size of the window; in this paper, e.g., N i  = 9 for a 3 × 3 window. The parameter γ controls the impact of smoothing. Generally, it has a value in the range of [0.5, 3]; in our experiment, it has been set to 2.5 (γ = 2.5). This smoothing function acts as a linear filter for smoothing images contaminated by noise. The smooth prior is only modelled as a combination of the posteriors z mj and priors π mj in the previous step and is therefore easy to implement. Consequently, according to (31)–(32), the log-likelihood of (31) can be stated by

$$ L\left(\prod, \Xi |V\right)=\sum_{i=1}^N \log \left\{\sum_{j=1}^K{\pi}_{ij}S\left({v}_i|{\Xi}_j\right)\right\}+\sum_{i=1}^N\sum_{j=1}^K \exp \left(\gamma \sum_{m\in {N}_i}\left({z}_{mj}+{\pi}_{mj}\right)\right) \log {\pi}_{ij}. $$
(33)

Next, we apply Jensen’s inequality in the form of \( \log \left({\sum}_{j=1}^K{z}_{ij}\varsigma \right)\ge {\sum}_{j=1}^K{z}_{ij} \log \left(\varsigma \right) \) to modify the above expression. Thus, maximizing the log-likelihood function L(Π, Ξ| V) results in an increase in the value of the following objective function.

$$ J\left(\prod, \Xi |V\right)=\sum_{i=1}^N\sum_{j=1}^K{z}_{ij}\left\{ \log {\pi}_{ij}+ \log S\left({v}_i|{\Xi}_j\right)\right\}+\sum_{i=1}^N\sum_{j=1}^K \exp \left(\gamma \sum_{m\in {N}_i}\left({z}_{mj}+{\pi}_{mj}\right)\right) \log {\pi}_{ij}. $$
(34)

After considering Bayesian theory, the posterior probability z ij in (34) at the current iteration step is

$$ {z}_{ij}^{\left(t+1\right)}=\frac{\pi_{ij}^{(t)}S\left({v}_i|{\Xi}_j^{(t)}\right)}{\sum_{m=1}^K{\pi}_{im}^{(t)}S\left({v}_i|{\Xi}_m^{(t)}\right)}. $$
(35)

Next, we need to maximize the function (34). For this, the EM algorithm is implemented by taking the derivative of J(Π, Ξ| V) with respect to each parameter in set {Π, Ξ}, and then equating its value to zero. Thus, the solution ∂J(Π, Ξ| V)/∂μ j  = 0 yields the estimates of mean μ j at the (t + 1) set by

$$ {\mu}_j^{\left(t+1\right)}=\frac{\sum_{i=1}^N{z}_{ij}^{(t)}{u}_{ij}^{(t)}{v}_i}{\sum_{i=1}^N{z}_{ij}^{(t)}{u}_{ij}^{(t)}}, $$
(36)

where the numerical solution of \( {u}_{ij}^{(t)} \)is denoted as

$$ {u}_{ij}^{(t)}=\frac{{\overline{\nu}}_j^{(t)}+\overline{D}}{{\overline{\nu}}_j^{(t)}+{\left({v}_i-{\mu}_j^{(t)}\right)}^T{\Sigma}_j^{-1(t)}\left({v}_i-{\mu}_j^{(t)}\right)}. $$
(37)

Similar to the computation of the meanμ j , let ∂J(Π, Ξ| V)/∂∑ j  = 0. Then, the estimation of covariance ∑ j is formulated as

$$ {\Sigma}_j^{\left(t+1\right)}=\frac{\sum_{i=1}^N{z}_{ij}^{(t)}{u}_{ij}^{(t)}\left({v}_i-{\mu}_j^{(t)}\right){\left({v}_i-{\mu}_j^{(t)}\right)}^T}{\sum_{i=1}^N{z}_{ij}^{(t)}}. $$
(38)

Using the constraint \( {\sum}_{j=1}^K{\pi}_{ij}=1 \), the solution to ∂J(Π, Ξ| V)/∂π ij  = 0 enables the following iterative expression for prior probability π ij

$$ {\pi}_{ij}^{\left(t+1\right)}=\frac{z_{ij}^{(t)}+ \exp \left(\gamma \sum_{m\in {N}_i}\left({z}_{mj}^{(t)}+{\pi}_{mj}^{(t)}\right)\right)}{1+\sum_{h=1}^K \exp \left(\gamma \sum_{m\in {N}_i}\left({z}_{mh}^{(t)}+{\pi}_{mh}^{(t)}\right)\right)}. $$
(39)

Finally, we consider the estimates of the degrees of freedom, \( {\overline{\nu}}_j \), which are obtained through the derivation of J(Π, Ξ| V), with \( {\overline{\nu}}_j \) at the (t + 1) iteration step given by

$$ \log \left(\frac{{{\overline{\nu}}_j}^{\left(t+1\right)}}{2}\right)-\psi \left(\frac{{\overline{\nu}}_j^{\left(t+1\right)}}{2}\right)+1- \log \left(\frac{{\overline{\nu}}_j^{(t)}+\overline{D}}{2}\right)+\psi \left(\frac{{\overline{\nu}}_j^{(t)}+\overline{D}}{2}\right)+\frac{\sum_{j=1}^N{z}_{ij}^{(t)}\left( \log {u}_{ij}^{(t)}-{u}_{ij}^{(t)}\right)}{\sum_{j=1}^N{z}_{ij}^{(t)}}=0, $$
(40)

where ψ(x) = {∂Γ(x)/∂(x)}/Γ(x) is the digamma function. Then, the final clustering results can be obtained in terms of the posterior probability, and it follows that

$$ {v}_i\in {\Xi}_j:\mathrm{if}\kern0.5em {z}_{ij}\ge {z}_{im},\kern0.5em j,m=1,2,\cdots, K $$
(41)

3.4 Similarity ranking

The computation of similarity is still an important problem in CBIR systems. Different similarity measurements lead to different results. In this study, to retrieve a list of “similar” images, we first calculate previously defined image features, and the similarity measurements between the query image and images under consideration are then obtained by

$$ {X}_{\mathrm{query}}=\frac{1}{N_{\hslash }}\sum_{\hslash =1}^{N_{\hslash }} \exp\;\left(\frac{-{\left|\left|{f}_{\mathrm{query}}^{\left(\hslash \right)}-{f}_i^{\left(\hslash \right)}\right|\right|}_F^2}{2{\left({\lambda}^{\left(\hslash \right)}\right)}^2}\right),\kern1.25em i\in \left[1,N\right], $$
(42)

where \( {f}_{query}^{\left(\hslash \right)} \) denotes the ℏ-th feature of the query image while \( {f}_i^{\left(\hslash \right)} \) refers to the same type of feature of each sample in the training database. We generated a linear projection matrix, ρ, which is defined as

$$ \rho ={\left({U}^TU\right)}^{-1}{U}^T. $$
(43)

To capture the sparse representation of the query image, we project the kernel matrix (42) into low-dimensional space in the hope that new matrix, v query , can perform a parts-based preservation of the latent feature information of X query . This can be achieved via the projection matrix,ρ, formulated as

$$ {v}_{query}=\rho {X}_{query}. $$
(44)

To rank the training images that are strongly related to the query image, the probability that the query image falls into the correct SMM component should be calculated in advance. After obtaining all probabilities, {S 1(v query | Ξ1), S 2(v query | Ξ2),…, S K (v query | Ξ K )}, for a given query image, the task of the retrieval system is to collect the most closely matching samples from a collection in the database. For those that belong to one component of SMM, the similarity is compared by ranking the probabilities,{S j (v i | Ξ j )| v i  ∈ Ξ j }, i = 1 ,  ⋯  , N,j = 1 ,  ⋯  , K of participators in descending order. With this rule, all training samples can be sorted in a reverse order. This ensures maximal matches for any pre-defined query image using this queue. The returned image with a retrieval length, , which is in the form of a percentage, can be regarded as the N ×  images nearest to the query image. Algorithm 1 describes the general procedure, which we refer to as SCMN.

figure c

4 Convergence and complexity analysis

4.1 Convergence analysis

To prove Theorem 1, we need to prove that the following objective function F(U, V) is non-increasing under the update formulae (25) and (27).

$$ F\left(U,V\right)=\mid {\left|X-UV\right|}_F^2+2{\lambda}_1\mid {\left|U\right|}_{1/2}+{\lambda}_2Tr\left({VLV}^T\right)+2{\lambda}_3 \exp \left[-\beta Tr\left(V\tilde{L}{V}^T\right)\right]. $$
(45)

The proof of convergence will make use of an auxiliary function, which has the following characteristics.

  • Lemma 1. If h(x, x ') is an auxiliary function of F(x) and the condition h(x, x ') ≥ F(x) and h(x, x) = F(x) are satisfied for any given x, x’, then, F will be convergent under the update

$$ {x}^{\left(t+1\right)}=\underset{x}{ \arg \min }h\left(x,{x}^{(t)}\right). $$
(46)
  • Proof:

The known conditions obviously lead to the following expression.

$$ F\left({x}^{\left(t+1\right)}\right)\le h\left({x}^{\left(t+1\right)},{x}^{(t)}\right)\le h\left({x}^{(t)},{x}^{(t)}\right)=F\left({x}^{(t)}\right). $$
(47)

Therefore, the equality, F(x (t + 1)) = F(x (t)), holds only if x (t)is the local minimum of h(x, x (t)).

Since the update schemes defined by (25) and (27) are element-wise in nature, letting U be a constant, it is enough to verify that F(U, V) = F(U) is non-increasing for any element U id in U. To achieve this, we define the following auxiliary function with respect to \( {U}_{id}^{(t)} \)

$$ h\left(u,{U}_{id}^{(t)}\right)=F\left({U}_{id}\right)+{F}^{\hbox{'}}\left({U}_{id}^{(t)}\right)\cdot \left(u-{U}_{id}^{(t)}\right)+\frac{{\left({UVV}^T+\frac{\lambda_1}{2}{U}^{-\frac{1}{2}}\right)}_{id}}{U_{id}^{(t)}}\cdot {\left(u-{U}_{id}^{(t)}\right)}^2. $$
(48)

Observing the above expression; it is not hard to find that \( h\left({U}_{id}^{(t)},{U}_{id}^{(t)}\right)=F\left({U}_{id}^{(t)}\right) \). Thus, the problem is equivalent to proving that \( h\left(u,{U}_{id}^{(t)}\right)\ge {F}_{id}(u) \). We first compute the Taylor series expansion of F(u) as

$$ \begin{array}{c}F(u)=F\left({U}_{id}\right)+{F}^{\hbox{'}}\left({U}_{id}^{(t)}\right)\cdot \left(u-{U}_{id}^{(t)}\right)+\frac{1}{2}{F}^{"}\left({U}_{id}^{(t)}\right){\left(u-{U}_{id}\right)}^2\\ {}=F\left({U}_{id}\right)+{F}^{\hbox{'}}\left({U}_{id}^{(t)}\right)\cdot \left(u-{U}_{id}^{(t)}\right)+\left({\left({VV}^T\right)}_{dd}-{\left(\frac{\lambda_1}{4}{U}^{-\frac{3}{2}}\right)}_{id}\right){\left(u-{U}_{id}^{(t)}\right)}^2\end{array} $$
(49)

where F '(U id ) and F ''(U id ) are the corresponding first-order and second-order derivatives of the objective function (41), relevant to the variable U id .

$$ {F}^{\hbox{'}}\left({U}_{id}\right)={\left(2{UVV}^T-2{XV}^T+{\lambda}_1{U}^{-1/2}\right)}_{id}. $$
(50)
$$ {F}^{"}\left({U}_{id}\right)={\left(2{VV}^T\right)}_{dd}-{\left(\frac{\lambda_1}{2}{U}^{-3/2}\right)}_{id}. $$
(51)

It is easy to verify that

$$ {\left({UVV}^T\right)}_{id}={\sum}_l{U}_{il}^{(t)}{\left({VV}^T\right)}_{ld}\ge {U}_{id}^{(t)}\cdot {\left({VV}^T\right)}_{dd}. $$
(52)

Additionally, it is easy to see

$$ {\left(\frac{\lambda_1}{2}{U}^{-\frac{1}{2}}\right)}_{id}\ge \left(\frac{\lambda_1}{2}{U}_{id}^{-\frac{1}{2}}\right)\cdot \left(-\frac{1}{2}\right)=-\frac{\lambda_1}{4}{U}_{id}^{-\frac{3}{2}}\cdot {U}_{id}^{(t)}. $$
(53)

Comparing the Taylor series expansion of F(u) to the auxiliary function (48), and combining (52) and (53) leads to the following inequality.

$$ h\left(u,{U}_{id}^{(t)}\right)\ge F(u). $$
(54)

Substituting \( h\left(u,{U}_{id}^{(t)}\right) \)of (48) into (46), we obtain

$$ {U}_{id}^{\left(t+1\right)}=\underset{u}{ \arg \min }h\left(u,{U}_{id}^{(t)}\right). $$
(55)

The first-order derivative of \( h\left(u,{U}_{id}^{(t)}\right) \)with respect to u is

$$ \frac{\partial h\left(u,{U}_{id}^{(t)}\right)}{\partial u}={F}^{\hbox{'}}\left({U}_{id}^{(t)}\right)+\frac{2{\left({UVV}^T+\frac{\lambda_1}{2}{U}^{-\frac{1}{2}}\right)}_{id}}{U_{id}^{(t)}}\cdot \left(u-{U}_{id}^{(t)}\right)=0. $$
(56)

Using (50), the above expression reduces to

$$ {\left(2{UVV}^T-2{XV}^T+{\lambda}_1{U}^{-\frac{1}{2}}\right)}_{id}\cdot {U}_{id}^{(t)}+{\left(2{UVV}^T+{\lambda}_1{U}^{-\frac{1}{2}}\right)}_{id}\cdot \left(u-{U}_{id}^{(t)}\right)=0. $$
(57)

From (57), we can conclude the value of u by

$$ u=\frac{{\left({XV}^T\right)}_{id}}{{\left({UVV}^T+\left({\lambda}_1/2\right){U}^{-1/2}\right)}_{id}}\cdot {U}_{id}^{(t)}={U}_{id}^{\left(t+1\right)}. $$
(58)

Finally, according to (48), (54), and (58), we can derive

$$ F\left({U}_{id}^{\left(t+1\right)}\right)\le h\left({U}_{id}^{\left(t+1\right)},{U}_{id}^{(t)}\right)\le h\left({U}_{id}^{(t)},{U}_{id}^{(t)}\right)=F\left({U}_{id}^{(t)}\right). $$
(59)

Similar to the auxiliary function modeled for U id , we present another auxiliary function for G dj by

$$ h\left(v,{V}_{dj}^{(t)}\right)=F\left({V}_{dj}^{(t)}\right)+{F}^{\hbox{'}}\left({V}_{dj}^{(t)}\right)\left(v-{V}_{dj}^{(t)}\right)+\frac{{\left({U}^TUV+{\lambda}_2VD+2{\lambda}_3\beta \exp \left[-\beta Tr\left(V\tilde{L}{V}^T\right)\right]\cdot \left(V\tilde{W}\right)\right)}_{dj}}{V_{dj}^{(t)}}{\left(v-{V}_{dj}^{(t)}\right)}^2. $$
(60)

Since it is obvious that \( h\left({V}_{dj}^{(t)},{V}_{dj}^{(t)}\right)=F\left({V}_{dj}^{(t)}\right) \), the Taylor series expansion of F(v) is then utilized to prove the inequality \( h\left(g,{V}_{dj}^{(t)}\right)\ge {F}_{dj}(v) \), expressed as

$$ \begin{array}{c}F(v)=F\left({V}_{dj}^{(t)}\right)+{F}^{\hbox{'}}\left({V}_{dj}^{(t)}\right)\cdot \left(v-{V}_{dj}^{(t)}\right)+\frac{1}{2}{F}^{"}\left({V}_{dj}^{(t)}\right){\left(v-{V}_{dj}^{(t)}\right)}^2\\ {}=F\left({V}_{dj}^{(t)}\right)+{F}^{\hbox{'}}\left({V}_{dj}^{(t)}\right)\cdot \left(v-{V}_{dj}^{(t)}\right)+\left[{\left({U}^TU\right)}_{dd}+{\left({\lambda}_2L\right)}_{jj}+{\lambda}_3 \exp {\left[-\beta Tr\left(V\tilde{L}{V}^T\right)\right]}_{dd}\times {\left(4{\beta}^2\left(V\tilde{L}\right)\right)}_{dj}^2-{\left(2\beta \cdot \tilde{L}\right)}_{jj}\right]\cdot {\left(v-{V}_{dj}^{(t)}\right)}^2\end{array} $$
(61)

In the above, the partial derivatives of F with respect to V dj can be calculated as follows:

$$ {F}^{\hbox{'}}\left({V}_{dj}\right)=\left(2{U}^TUV-2{U}^TX+2{\lambda}_2VL+2{\lambda}_3\cdot \exp \left[-\beta Tr\left(V\tilde{L}{V}^T\right)\right]\cdot {\left(-2\beta V\tilde{L}\right)}_{dj}\right., $$
(62)
$$ {F}^{"}\left({V}_{dj}\right)={\left(2{U}^TU\right)}_{dd}+{\left(2{\lambda}_2L\right)}_{jj}+2{\lambda}_3\cdot \exp {\left[-\beta Tr\left(V\tilde{L}{V}^T\right)\right]}_{dd}\cdot \left(4{\beta}^2\cdot {\left(V\tilde{L}\right)}_{dj}^2-{\left(2\beta \cdot \tilde{L}\right)}_{jj}\right). $$
(63)

It is found that the following inequalities are established

$$ {\left({U}^TUV\right)}_{dj}={\sum}_l{\left({U}^TU\right)}_{dl}\cdot {V}_{lj}\ge {\left({U}^TU\right)}_{dd}\cdot {V}_{dj}^{(t)} $$
(64)
$$ \begin{array}{c}{\left({\lambda}_2VD\right)}_{dj}={\sum}_l{\lambda}_2\cdot {V}_{dl}{(D)}_{lj}\ge {\lambda}_2\cdot {V}_{dj}^{(t)}{(D)}_{jj}\ge {\lambda}_2\cdot {V}_{dj}^{(t)}{\left(D-W\right)}_{jj}\\ {}={\lambda}_2\cdot {V}_{dj}^{(t)}{(L)}_{jj}\end{array} $$
(65)

Based on this analysis, it easy to observe that while parameter β takes a small enough value, we have

$$ {\left(V\tilde{W}\right)}_{dj}\ge 2\beta {\left(V\tilde{L}\right)}_{dj}^2-{\left(\tilde{L}\right)}_{jj} $$
(66)

Consequently, we derive

$$ h\left(v,{V}_{dj}^{(t)}\right)\ge F(v) $$
(67)

Likewise, substituting \( h\left(v,{V}_{jd}^{(t)}\right) \) into (46), the update scheme for V in (27) can be obtained as a local optimum of the auxiliary function (60):

$$ {V}_{dj}^{\left(t+1\right)}=\underset{v}{ \arg \min }h\left(v,{V}_{dj}^{(t)}\right) $$
(68)

This is because the derivation of \( h\left(v,{V}_{jd}^{(t)}\right) \) with respect to the variable v is as follows.

$$ \frac{\partial h\left(v,{V}_{dj}^{(t)}\right)}{\partial v}={F}^{\hbox{'}}\left({V}_{dj}^{(t)}\right)+\frac{2}{G_{dj}^{(t)}}{\left({B}^TBV+{\lambda}_2VD+2{\lambda}_3\beta \cdot \exp \left[-\beta Tr\left(V\tilde{L}{V}^T\right)\right]\cdot \left({VW}_2\right)\right)}_{dj}\cdot \left(v-{V}_{dj}^{(t)}\right) $$
(69)

The above equation can be simplified to

$$ \begin{array}{l}{\left(2{U}^TUV-2{U}^TX+2{\lambda}_2VL+2{\lambda}_3 \exp \left[-\beta Tr\left(V\tilde{L}{V}^T\right)\right]\cdot \left(-2\beta V\tilde{L}\right)\right)}_{dj}{V}_{dj}^{(t)}\\ {}+{\left(2{U}^TUV+2{\lambda}_2VD+2{\lambda}_3 \exp \left[-\beta Tr\left(V\tilde{L}{V}^T\right)\right]\cdot \left(2\beta V\tilde{W}\right)\right)}_{dj}\left(v-{V}_{dj}^{(t)}\right)=0\end{array} $$
(70)

Thus, we obtain the following equation:

$$ \begin{array}{c}v=\frac{{\left({U}^TX+{\lambda}_2VW+2{\lambda}_3\beta \cdot \exp \left[-\beta \cdot Tr\left(V\tilde{L}{V}^T\right)\right]\cdot \left(V\tilde{D}\right)\right)}_{dj}}{{\left({U}^TUV+{\lambda}_2VD+2{\lambda}_3\beta \cdot \exp \left[-\beta \cdot Tr\left(V\tilde{L}{V}^T\right)\right]\cdot \left(V\tilde{W}\right)\right)}_{dj}}\cdot {V}_{dj}^{(t)}\\ {}={V}_{dj}^{\left(t+1\right)}\end{array} $$
(71)

This leads to the following inequality.

$$ h\left({V}_{dj}^{\left(t+1\right)}\cdot {V}_{dj}^{(t)}\right)\le h\left({V}_{dj}^{(t)}\cdot {V}_{dj}^{(t)}\right) $$
(72)

The comparison of the above inequality collectively and using (60), (67) and (72) yields

$$ F\left({V}_{dj}^{\left(t+1\right)}\right)\le h\left({V}_{dj}^{\left(t+1\right)}\cdot {V}_{dj}^{(t)}\right)\le h\left({V}_{dj}^{(t)}\cdot {V}_{dj}^{(t)}\right)=F\left({V}_{dj}^{(t)}\right) $$
(73)

Finally, according to (59) and (73), we obtain

$$ F\left({U}_{id}^{\left(t+1\right)},{V}_{dj}^{\left(t+1\right)}\right)\le F\left({U}_{id}^{(t)}\cdot {V}_{dj}^{\left(t+1\right)}\right)\le F\left({U}_{id}^{(t)}\cdot {V}_{dj}^{(t)}\right) $$
(74)

The convergence of Theorem 1 is proved.

4.2 Complexity analysis

The cost of the proposed SCMN learning phase mainly contains two parts. The first part is for the constructions of heat kernel function ϖ and Laplacian matrix L. If we use the big notation to represent the complexity of the algorithm, the time complexity of this part is \( \mathcal{O}\left(2\left({\sum}_{i=1}^n{D}_{\hslash}\right){N}^2\right) \). The second part is for the matrix factorization. The main computational costs are in the update steps of the matrices U and V. Therefore, in view of the update schemes summarized in the above section, we count the number of floating-point operations, including addition/subtraction (Fladd), multiplication (Flmlt), and division (Fldiv). Table 1 lists the floating-point arithmetic operations involved in updating each matrix. Consequently, assuming that the multiplicative update rule terminates after t iterations, the total cost of the SCMN is\( \mathcal{O}\left(t{\left(MN\overline{D}\right)}^2\right) \). Table 2 gives the computational complexity of the proposed SCMN and compares it to the standard NMF. From Table 2, we can draw conclusion that the SCMN is moderately more expensive than the classical NMF for a single update. The complexity of SCMN is mainly caused by the application of sparsity constraints.

Table 1 Floating-point computational times for multiplication of matrices
Table 2 Floating-point computational times for a single iteration in SCMN and NMF

5 Experimental results

In this section, to assess the performance of the SCMN, we report a series of retrieval experiments to compare the proposed method against other NMF-based algorithms and discuss its computational complexity and convergence rate. All experiments are run on a personal computer with an Intel (R) Core (TM) i7–2600 3.4GHz CPU and 8GB RAM. Numerical simulations have been carried out in Matlab 7.11.0 (2010b) in a Windows environment.

5.1 Data corpora

This study conducts experiments over four publicly available datasets: (1) Caltech101Footnote 1; (2) Corel 1 KFootnote 2; (3) Corel 5 K2; and (4) WdcImageData.Footnote 3 These datasets are diverse enough to cope with different themes of image retrieval tasks.

The Caltech101 benchmark dataset contains 9146 images of variable size, with 101 different object categories and another additional background category. In our experiment, 500 images from the Caltech101 collection are selected to form ten subsets for training, and another 30 images constitute a test set. All images are resized so that each side of an image is 128 pixels and is in RGB colour.

The Corel 1 K dataset consists of 1000 colour images in the JPEG format. Our training set contains five different categories with 100 samples per category, including flowers, buses, mountains, horses, and dinosaurs. The query images come from the test set, which consists of 40 images. All images have the same resolution, either 256 × 384 or 384 × 256, in the range 0 to 255 in each of the R, G and B colour channels.

The Corel 5 K image dataset is composed of 5000 colour images in 50 categories, with two sizes, 192 × 128 and 128 × 192. This is a relatively larger image dataset including diverse contents, such as animals, airplanes, trees, and stained glass. In our retrieval experiment, the training set contains 360 images of 192 × 128 pixels for the convenience of feature extraction. These are classified into four categories. Another 32 images are selected randomly as query images for test purposes.

This paper also applies the WdcImageData dataset, which consists of 1333 colour images of 22 categories, to evaluate the validity of the proposed SCMN. The images are very loosely grouped by category, including trees, people, sea, animals, buildings, and so on. In this work, the training set contains 300 images divided into six categories and for each category, there are unequal numbers of samples with a size of 756 × 504 pixels in the JPEG format. The test set used is a collection of 36 randomly selected samples.

5.2 Evaluation metrics

For an overall evaluation of performance, the two most commonly used metrics Precision (PR) and Recall Rate (RR) [41] are implemented to measure the accuracy of image retrieval. Precision measures the effectiveness of the underlying method to retrieve only images that are relevant. It is defined as the ratio of relevant images to all retrieved images

$$ \mathrm{Precision}=\frac{\mathrm{Number}\ \mathrm{of}\ \mathrm{relevant}\ \mathrm{images}\ \mathrm{retrieved}}{\mathrm{Total}\ \mathrm{number}\ \mathrm{of}\ \mathrm{images}\ \mathrm{retrieved}}. $$

Recall computes the ratio of the retrieved, relevant images to all the relevant images in the dataset. It is used for assessing the capabilities of the algorithm to retrieve all images that are relevant and is defined by

$$ \mathrm{Recall}=\frac{\mathrm{Number}\ \mathrm{of}\ \mathrm{relevant}\ \mathrm{images}\ \mathrm{retrieved}}{\mathrm{Total}\ \mathrm{number}\ \mathrm{of}\ \mathrm{relevant}\ \mathrm{images}\ \mathrm{in}\ \mathrm{the}\ \mathrm{dataset}}. $$

5.3 Parameter setting

Before evaluating the SCMN, an analysis of the empirical parameter sensitivity is necessary. We will investigate the impact of inner-dimensional \( \overline{D} \)and three other parameters,λ 1,λ 2, and λ 3, to find the optimal response under a wide range of parameter values.

The initial check is carried out to ensure that the response of \( \overline{D} \) works on the proposed framework. Thus, \( \overline{D} \) is changed while the other parameters are set to 0.1 for all datasets. We set the retrieval length to a fixed value of 0.1 in all experiment. Generally, the top 10 most similar targeted images are adopted by the most image retrieval system. In this paper, we wish there were more retrieval images to involve in assessing. If the retrieval length is selected to be 0.1, for N = 200 and retrieval length  = 0.1, the returned image set has 20 images. Thus, we set the retrieval length to 0.1 and run the SCMN with varying values of \( \overline{D} \). It is found that our method seems to work well in practice when \( \overline{D} \) is equal to 8, 16, 64, and 32 for Caltech-101, Corel 1 K, Corel 5 K, and WdcImageData, respectively. Table 3 presents the PR and RR responses of the SCMN. Next, we fix \( \overline{D} \) for different datasets, based on the above analysis, and vary the parameter λ 1 to find the best performance for each dataset. The comparative results are illustrated in Table 4, where bold values indicate the best result of the column. Again, another parameter λ 2 is used to assess the effect of the regularization term. In the same way, setting \( \overline{D},{\lambda}_1 \) in terms of the favourable results indicated by the above analysis, we repeat the image retrieval operation using SCMN, where the value of λ 2 is increased from 10−5 to 10, in increments of factors of 10. Table 5 demonstrates the obtained performance. Finally, to seek a suitable value for λ 3, we follow a similar operation by fixing the parameters\( \overline{D} \), λ 1, and λ 2 and varying only the value of λ 3. Table 6 shows that the SCMN is not very sensitive to the parameter λ 3. In fact, one can observe that this also holds true with other parameters. Therefore, unless otherwise specified, in our algorithm, λ 1, λ 2, and λ 3 are set equal to 0.1 for all experiments. In contrast, inner-dimensional \( \overline{D} \) makes this significantly different. The experiment should therefore be adjusted in accordance with specific datasets.

Table 3 Comparison retrieval response in the light of \( \overline{D} \) by fixing retrieval length ( = 0.1), λ 1 = 0.1,λ 2 = 0.1, and λ 3 = 0.1 for all datasets
Table 4 Comparison retrieval response in the light of λ 1 by fixing retrieval length ( = 0.1), λ 2 = 0.1, and λ 3 = 0.1 for all datasets. \( \overline{D}=8,16,64,32 \)for Caltech-101, Corel 1 K, Corel 5 K, and WdcImageData, respectively
Table 5 Comparison retrieval response in the light of λ 2 by fixing retrieval length ( = 0.1), λ 3 = 0.1 for all datasets, λ 1 = 0.1 , 0.01 , 0.1 , 0.01, and \( \overline{D}=8,16,64,32 \)for Caltech-101, Corel 1 K, Corel 5 K, and WdcImageData, respectively
Table 6 Comparison retrieval response in the light of λ 3 by fixing retrieval length ( = 0.1) for all datasets, λ 1 = 0.1 , 0.01 , 0.1 , 0.01, λ 2 = 0.001 , 0.1 , 0.1 , 0.1, and \( \overline{D}=8,16,64,32 \)for Caltech-101, Corel 1 K, Corel 5 K, and WdcImageData, respectively

5.4 Retrieval results

This subsection will evaluate the response of our SCMN system for retrieving images, with experiments conducted using the parameters discussed above. The test selects as many different categories as possible, especially categories 10, 5, 4, and 6 for Caltech-101, Corel 1 K, Corel 5 K, and WdcImageData, respectively. Taking into account that the SCMN merges with method of multiview NMF and the finite mixture model with the MRF, this study compares the performance of SCMN with four algorithms which are related to multiview NMF and a statistics-based model, such as MAH [8], JNMF [23], KL-GMM [9], and MNFCM. MNFCM is similar to Liu’s JNMF, the different is that the FCM is employed to cluster the data of NMF’s coefficient matrix. For the sake of randomizing experiments, in this study, multiple runs, each with a different query image, are performed and the average values of PR and RR are recorded. The run times should be decided by the number of query images in the test set. Fig. 2 summarizes the PR and RR versus the retrieval length achieved by each examined algorithm. From this figure, it can be seen that the MAH and SCMN algorithms achieve an obviously better response across all retrieval exams. This is attributed to the involvement of multiview features, which are crucial for an image retrieval system. Considering the multiple constraints of the objective function, the SCMN achieves the highest precision. In contrast, KL-GMM and MNFCM achieve inferior responses in terms of PR. Additionally, Fig. 2 presents the RR of all methods at varying retrieval lengths, showing that a multiview scheme usually achieves better results than a single-view framework. Due to the utilization of the SMM-MRF scheme, the SCMN is successful in clustering sparse features so as to return more relevant images. This is also demonstrated by the plots of RR. It should be noted that for the Caltech101 dataset, all algorithms present relatively poor behaviour. This is reasonable, and one reason for this is that images in this dataset have a complex construction and are rich in information in most cases. It is thus relatively difficult to represent them. Another reason is that for the retrieval experiment on Caltech101, we have selected more categories which are challenging disadvantages in most retrieval schemes. Despite this, the SCMN still shows an encouraging retrieval response. Table 7 shows the top-returned results indexed from the respective categories corresponding to the query images.

Fig. 2
figure 2

PP and RR of the obtained results using different approaches. The first to the last row represent the Caltech-101, Corel 1 K, Corel 5 K, and WdeImageData datasets, respectively. The left column is Precision, and the right one is the Recall Rate

Table 7 Query images (at the first column) and the first top retrieval results obtained with different methods

5.5 Convergence study

Convergence determines the merits of the algorithm as well as its execution time. Since the proposed SCMN is an iteration strategy, we need to discuss the behaviour of the update rules to minimize the objective function and to compare it with the classical NMF. Fig. 3 illustrates the variations of the objective function in the implementation process. It can be observed that the solution is very close to the local minimum after 200 iterations for the Caltech-101 and Corel 1 K datasets. For the other two datasets, the objective function converges with more updates and stabilizes at approximately 250 iterations. It is worth noting that the convergence level of the SCMN is nearly as fast as the classical NMF in all cases.

Fig. 3
figure 3

Objective function value versus the number of iterations. The first to the last row represent Caltech-101, Corel 1 K, Corel 5 K, and WdeImageData, respectively; the left column is the NMF method, and the right one is the proposed SCMN

5.6 Time comparison

In the last experiment, we empirically compare the time needed for implementing the aforementioned retrieval tasks. By running all algorithms twenty times with a different query each time, we obtain the comparative results illustrated in Fig. 4. For the proposed SCMN, the learning phase would take a relatively long time. We believe this is because of the feature-extraction step of our method, which ultimately results in the increase of the computational time occupied in the learning phase. For the task of retrieving identical images, in contrast, the KL-GMM is time-saving, for training as well as retrieval phases. Additionally, as expected, the feature-extraction step is also required according to the mechanisms of JNMF, MAH and MNFCM. Additionally, considerable execution time required for the JNMF, MAH, and MNFCM methods in this figure, which seems to confirm this conclusion. For the average CPU time in the retrieval phase, no significant difference could be observed for all participants. Thus, the proposed SCMN achieves an acceptable time complexity as the baseline method.

Fig. 4
figure 4

Comparison of average running time for different retrieval systems. (a) Training time; (b) test time

5.7 Evaluation of SCMN on clinical MR images

Finally, to evaluate the reliability of the SCMN not only for the four public databases, but also for other types of images, we provided an additional experiment to assess the performance of the proposed method on magnetic resonance (MR) images in terms of average retrieval accuracy. One public dataset used in current experiment consists of 900 standard clinical MR images, taken from SPM12 website.Footnote 4 For the purpose of illustration, Appendix provides a pseudocode for our image retrieval algorithm. Generally, model parameters should be specified by users primarily based on experiment. This paper provides a statistical method to select a proper parameter set for SCMN by using 10-fold cross validation (CV) [36]. The capability of CV to perform estimation or evaluation enables CV to conduct our model parameter selection. In [14], it was once reported to determine the number of neighbors of classification. In 10-fold CV, a labeled dataset S (900 standard clinical MR images) is partitioned into 10 equally sized subsets. The proposed method has four parameters: the inner dimension \( \overline{D} \), three regularization factors (λ 1,λ 2,λ 3). For simplicity, we discuss the impact of two important parameters \( \overline{D} \) and λ 2 on the performance of our method. Two other parameters λ 1, λ 3 are set a fixed positive value 0.1 and iteratively changed parameters \( \overline{D} \) and λ 2 to reach a better chosen. In our experiment, the following values for inner dimension \( \overline{D} \) are considered: 16, 32, 64 and 96. Another regularization parameter λ 2 varies from 10−3 to 1 increased by 10 times, forming a set of parameter M = {M 1, M 2,  … , M 16} for the proposed model SCMN. For every selection M i , an iterative process is then conducted. In each iteration, one different subset is selected as a test set, and the remaining nine subsets are the training data. The retrieval Precision and Recall Rate are obtained by the average of the accuracies of these 10 classifiers. Pick the model M i with the best image retrieval results. We evaluated the performance with different values for \( \overline{D} \) and λ 2, and some performance curves are illustrated in Fig.5. This figure tell us that the best image retrieval performance is obtained with \( \overline{D}=32 \), λ 2 = 0.01. Also, the values \( \overline{D}=32 \), λ 2 = 0.01, λ 1 = 0.1, λ 3 = 0.1 are selected as the best parameter combination for SCMN. In the following experiment, the proposed SCMN is tested on SPM12 dataset for validation. The training set is having three different categories with 300 images per category, including sagittal plane (188 × 68), coronal plane (156 × 68), and horizontal plane (188 × 156). For each category, 10 images are randomly sampled as query images. In total, there are 30 query images. Table 8 provides some sample retrieval results for SPM12 dataset. Precision and Recall for the presented SCMN and other algorithms are calculated and demonstrated through graphs. Fig.6 shows variation in Precision and Recall Rate with number of images retrieved. The results mostly confirmed our theoretical expectation. The SCMN method yields more accurate performance than all other methods in comparison.

Fig. 5
figure 5

Precision vs. Recall Rate for the different parameters, using 10-fold cross validation

Table 8 Query images (at the first column) and the first top retrieval results obtained with different methods
Fig. 6
figure 6

Comparison of retrieval performance for SPM12 dataset. (a) Precision; (b) Recall Rate

6 Conclusions

This article addressed a CBIR scheme by merging the constrained NMF with multiple visual features and a MRF-based SMM approach. The following main advantages were revealed: (1) The algorithms embedded some additional constraints, such as local geometric structure and spacing location, to develop a novel objective function. This resulted in a relatively better performance according to the measures of PR and RR, obtained for the image retrieval task, while maintaining an acceptable computational cost. (2) Using similarity metrics based on the Frobenius norm, the proposed method fused multiple distinct features so that the coefficient matrix of the SCMN preserved the features of the underlying images in the low-dimensional space and ensured that the proposed framework produced result images as relevant as possible to the test image. (3) According to the Bayesian theorem, the study successfully incorporated MRF into SMM, which contributed to alleviating the disturbance of noise in the training process, thus improving the robustness of the algorithm. Optimization was performed using an EM algorithm to estimate the parameters of SMM-MRF. (4) Finally, the rule of convergence of updates was proved theoretically, and the complexity of the algorithm was also discussed. Overall, the experimental results indicate that the SCMN is stable for the test images chosen and exhibits a better retrieval response than other competing algorithms.

Future extensions of this work may aim to bridge the semantic gap between low level features and user preferences, and to investigate the possibility that the mixed visual features involve semantic information. We are likely to develop other effective techniques to bridge the semantic gap.