1 Introduction

One of the main challenges of current research in pattern recognition (PR) is to improve the robustness of exiting algorithms with respect to confounding factors including noise, rigid transformations, changes in viewpoint, illumination, etc. Recent advances from statistical learning [1] have brought attention to the notion of sparsity to extract the salient image features in such a way to obtain more accurate and robust classification. Wright et al. [2], in particular, introduced a very influential framework called sparse representation based classification (SRC) for face recognition (FR) and successfully applied this method to identify human faces with varying illumination changes, occlusion and real disguise. In their method, a test sample image is coded as a sparse linear combination of the training images and classification is achieved by identifying which class yields the least residual. Several other methods were inspired by SRC including: the FR method based on sparse representation of facial image patches by Theodorakopoulos et al. [3]; kernel sparse representation for image classification and FR, which applies a sparse coding technique in a high dimensional feature space via some implicit feature mapping [4]; the Gabor occlusion dictionary for SRC by Yang and Zhang which reduces the computation cost by using Gabor feature [5]; a robust regularized coding model to enhance the robustness of face recognition to confounding factors [6, 7]; the method based on maximum correntropy criterion for robust face recognition by He et al. [8]. An alternative point of view was proposed by Zhang et al. [9] who argued that rather than sparsity “the collaborative representation mechanism used in SRC is much more crucial to its success of face classification”. Based on this observation, they introduced a method called collaborative representation based classification with regularized least square (CRC) [9] which was shown to perform very competitively against SRC with a lower computational cost. As a further refinement of CRC, some of the authors proposed a method called relaxed collaborative representation (RCR) which is designed better capture the similarity and distinctiveness of different features for the classification [10]. An alternative approach is the two-phase test sample representation method [11] and relies on detecting first the training samples located away from the test sample (assuming they have negligible effect on classification); next the test sample is represented as a linear combination of the M nearest neighbors and the representation result is used for classification. Another method proposed in [12] consists in partitioning face images into blocks and then creating an indicator to remove the contaminated blocks and choose the nearest subspaces; SRC is finally used to classify the occluded test sample in the new feature space.

We also recall the Fisher discrimination dictionary learning (FDDL) algorithm by Yang et al. [13] which embeds the Fisher criterion in the objective function design. The FDDL scheme has two remarkable properties. First, dictionary atoms are learnt to associate the class labels so that the reconstruction residual from each class can be used in classification; second, the Fisher criterion is imposed on the coding coefficients so that they carry discriminative information for classification. To improve this method, Feng et al. [14] propose to learn jointly the projection matrix for dimensionality reduction and the discriminative dictionary for face representation JDDLDR. The joint learning combines more effectively the learned projection and the dictionary with the result of improving FR performance. Within the general framework of the discriminative dictionary learning (DDL), the projective dictionary pair learning (DPL) algorithm [15] learns a synthesis dictionary and an analysis dictionary jointly to achieve the goal of signal representation and discrimination. The vector guided dictionary learning (SVGDL) method is proposed in [16] as a special case of the Fisher discrimination dictionary learning (FDDL) method; here the weights are determined by the numbers of samples of each class and a parameterization method is used to adaptively determine the weight of each coding vector pair. Compared with FDDL, SVGDL can adaptively assign different weights to different pairs of coding vectors. Yet another DDL approach recently proposed is the locality constrained and label embedding dictionary learning (LCLE-DL) algorithm [17], where locality information is preserved using the graph Laplacian matrix of the learned dictionary rather than the conventional one derived from the training samples; next, the label embedding term is constructed using the label information of atoms instead of the classification error term; the coding coefficients derived by combining locality-based and label-based reconstruction are shown to be very effective for image classification. Very recently, it was proposed a probabilistic interpretation of the collaborative classification mechanism to explain the classification mechanism of CRC and following this analysis it was introduced a method called probabilistic collaborative representation based classifier (ProCRC) which jointly maximizes the likelihood that a test sample belongs to each of the multiple classes [18].

On other hand, a class of algorithms described as local feature based methods [1928] also demonstrated very promising results in problems of object recognition and texture classification. For instance, some of these methods use Gabor filters to extract local directional features on multiple scales and have been successfully applied in FR [20, 21]. Compared to more conventional methods such as Eigenface [29] and Fisherface [30], Gabor filtering is less sensitive to image variations. Another type of local feature widely used in FR is statistical local feature (SLF), such as histogram of local binary pattern (LBP) [22], whose main principle is to model a face image as a composition of micro-patterns [28]. By partitioning the face image into several blocks, the statistical feature (e.g., histogram of LBP) of these blocks is extracted, and finally the description of the image is formed by concatenating the extracted features in all blocks. For example, Zhang et al. [24, 25] proposed to use Gabor magnitude or phase map instead of the intensity map to generate LBP features. New coding techniques on Gabor features have also been proposed, e.g., Zhang et al. [26] extracted and encoded the global and local variations of the real and imaginary parts of the data using a multi-scale Gabor representation. Borgi et al. [3135] proposed two algorithms that apply a sparse multiscale representation based on shearlets to extract the essential geometric content of facial features, one called regularized shearlet network (RSN) and another one sparse multi-regularized shearlet network (SMRSN). Finally, we recall that Meng et al. [36] proposed a kernel based representation model to fully exploit the discrimination information embedded in the statistical local features (SLF_RKR) and applied a robust regression method handle occlusions in face images.

In this paper, we adopt the same general philosophy of CRC and our main novel contribution is to integrate this method with a virtual collaborative projection (VCP) routine designed to train images of every class against the others classes with the goal to improve fidelity before projecting the query image. Additionally, inspired by the remarkable results obtained from the recent literature in local feature based method, our algorithm includes a routine to compute high-order statistical moments (SM) in order to extract highly discriminative local features and improve data representation. To validate our algorithm, which is called statistical binary pattern with virtual competitive representation (SBP_VCP), we have tested it on multiple datasets for problems of face recognition, gender classification, handwritten digit recognition, object categorization and action recognition. Experimental results show that our method consistently achieves very competitive results as compared to classical and state-of-the-art algorithms.

The rest of this paper is organized as follows. Section 2 introduces the main idea of statistical binary pattern and high order moments for feature extraction. Section 3 describes the proposed virtual collaborative projection applied to trained faces. Section 4 reports extensive numerical experiments to validate the proposed method and compare it against state-of-the-art methods on problems of face recognition under different confounding factors as well as image categorization, handwritten digit and action recognition. Finally, Sect. 5 concludes this paper.

2 Statistical binary pattern and high order moments

The statistical binary patterns (SBP) representation is an extension of local binary patterns (LBP) and it aims at enhancing the expressiveness and discrimination power of LBP for image modelling (especially texture) and recognition, while reducing sensitivity to small perturbations, e.g., noise. The main idea of this method, which was introduced by one of the authors and their collaborator in [37], consists in applying a rotation invariant uniform LBP to a set of images corresponding to the local statistical moments associated to a given spatial support. The resulting code forms the SBP and an image is then represented by joint or marginal distributions of SBPs.

2.1 Moment images

A real valued 2d discrete image f is modelled as a mapping from \({{\mathbb{Z}}^{2}}\) to \(\mathbb{R}\). The spatial support used to calculate the local statistics is modelled as\(B\subset {{\mathbb{Z}}^{2}}\), such that \(O\in B,\) where O is the origin of \({{\mathbb{Z}}^{2}}\). The r-order moment image associated to f and B is also a mapping from \({{\mathbb{Z}}^{2}}\)to \(\mathbb{R}\), defined as:

$$m_{(f,B)}^{r}(z)=\frac{1}{\left| B \right|}\sum\limits_{b\in B}{{{\left( f(z+b) \right)}^{r}}}$$
(1)

where z is a pixel from \({{\mathbb{Z}}^{2}}\), and \(\left| B \right|\) is the cardinality of the structuring element B. Accordingly, the r-order centered moment image (r > 1) is defined as:

$$\mu _{(f,B)}^{r}(z)=\frac{1}{\left| B \right|}{{\sum\limits_{b\in B}{\left( f(z+b)-m_{(f,B)}^{1}(z) \right)}}^{r}}$$
(2)

where \(m_{(f,B)}^{1}(z)\) is the average value (1-order moment) calculated around z. Finally the r-order normalized centered moment image (r > 2) is defined as:

$$\beta _{(f,B)}^{r}(z)=\frac{1}{\left| B \right|}{{\sum\limits_{b\in B}{\left( \frac{f(z+b)-m_{(f,B)}^{1}(z)}{\sqrt{\mu _{(f,B)}^{2}(z)}} \right)}}^{r}}$$
(3)

where \(\mu _{(f,B)}^{2}(z)\) is the variance (2-order centered moment) calculated around z.

2.2 Statistical binary patterns

Let R and P denote the radius of the neighborhood circle and the number of values sampled on the circle, respectively. For each moment image M, one statistical binary pattern is formed as follows:

  • one (P + 2)-valued pattern corresponding to the rotation invariant uniform LBP coding of M:

$$SB{{P}_{P,R}}(M)(z)=LBP_{P,R}^{riu2}(M)(z)$$
(4)
  • one binary value corresponding to the comparison of the centre value with the mean value of M:

$$SB{{P}_{C}}(M)(z)=s\left( M(z)-\tilde{M} \right)$$
(5)

where s denotes the pre-defined sign function, and \(\tilde{M}\) the mean value of the moment M on the whole image. Hence \(SB{{P}_{P,R}}(M)\) represents the structure of the moment M with respect to a local reference (the center pixel), and \(SB{{P}_{C}}(M)\) complements the information with the relative value of the center pixel with respect to a global reference (\(\tilde{M}\)). As a result of this first step, a \(2(P+2)\)-valued scalar descriptor is then computed for every pixel of each moment image.

2.3 Image descriptors

Let \({{\left\{ {{M}_{i}} \right\}}_{1\le i\le {{n}_{M}}}}\) be the set of \({{n}_{M}}\) computed moment images. \(SB{{P}^{\left\{ {{M}_{i}} \right\}}}\) is defined as a vector valued image, with \({{n}_{M}}\) components such that for every \(z\in {{\mathbb{Z}}^{2}}\), and for every i, \(SB{{P}^{\left\{ {{M}_{i}} \right\}}}{{(z)}_{i}}\) is a value between 0 and 2 (P + 2). If the image f contains texture, the descriptor associated to f is made by the histogram of the values of \(SB{{P}^{\left\{ {{M}_{i}} \right\}}}\). We consider two kinds of histograms.

First we consider the joint histogram H defined as follows:

$$\begin{aligned} & H:{{[ 0,2(P+2) [}^{{{n}_{M}}}}\to {\mathbb{N}} \\ & H(v)=\left| \left\{ z;SB{{P}^{\left\{ {{M}_{i}} \right\}}}(z)=v \right\} \right| \\ \end{aligned}$$
(6)

Depending on the size of the texture images, the joint distribution may become too sparse when the dimension (i.e., the number of moments) increases.

Next, we consider the marginal histograms \({{\{{{h}_{i}}\}}_{i\le {{n}_{M}}}}\) defined as:

$$\begin{aligned} & H:[ 0,2(P+2) [\to \mathbb{N} \\ & {{h}_{i}}(n)=\left| \left\{ z;SB{{P}^{\left\{ {{M}_{i}} \right\}}}{{(z)}_{i}}=n \right\} \right| \\ \end{aligned}$$
(7)

An image descriptor can then be defined using the joint histogram H or the concatenation of the \({{n}_{M}}\) marginal histograms \(\{{{h}_{i}}\}\). The length of the descriptor vector is \({{[2(P+2)]}^{{{n}_{M}}}}\) in the first case and \(2{{n}_{M}}(P+2)\) in the second case.

2.4 Higher order moments

The SBP model on higher order moments is evaluated next. The objective of the SBP framework is to extend the LBP texture image descriptors from the local level, represented by the pixel z, to the regional distribution level of \(z+B\) by approximating the distribution to a set of statistical moments. It is known that the mean and variance describe faithfully a statistical distribution only in special cases, e.g., when it is a normal distribution. This assumption may fail for natural texture images. Therefore, higher order moments are needed to obtain an accurate description of a general distribution and capture the relevant information.

Regarding the size of the image descriptor, it clearly increases as the number of moments increase. When we use joint histograms, the descriptor size is \({{(2(P+2))}^{n}}\) where P is the number of neighbours used in LBP and n is the number of moment images. When we use marginal histograms, the size is only \(2n(P+2)\) but this comes at the price of a significant loss of information. Hence we propose a trade-off between descriptor size and information loss based on the concatenation of joint histograms corresponding to pairs of moment images.

Formally, we can recursively define the higher order SBP hybrid image descriptor as follows.

Let \({{M}_{1}}\)and \({{M}_{2}}\) be moments or combinations of moments by their joint or concatenated histogram. We shall denote as \(SB{{P}^{{{M}_{1}}{{M}_{2}}}}\)(resp. \(SB{{P}^{{{M}_{1}}\_{{M}_{2}}}}\)) the image descriptor made by the joint (resp. concatenated) histograms constructed from \(SB{{P}^{{{M}_{1}}}}\) and \(SB{{P}^{{{M}_{2}}}}\). In our experiments for higher order moments below, we have only considered pairs of moments for joint histograms. The algorithm below summarizes the high order binary statistical moment SBP:

The SBP Algorithm

Input: f—a 2D image, \(B\subset {{\mathbb{Z}}^{2}}\)—the spatial support used to calculate the local moments, P—the number of neighbours, R—the radius neighbouring circle.

Output: \(SBP_{P,R}^{{{m}_{1}}{{\mu }_{2}}}\)—texture descriptor of f.

Calculate moment images:

1. Calculate the first order moment image \({{m}_{1}}\) (or \(m_{(f,B)}^{1}\)) associated to f and B using the formula (1).

2. Calculate the second order centered moment image \({{\mu }_{2}}\) (or \(\mu _{(f,B)}^{2}\)) associated to f and B using the formula (2).

Statistical binary patterns:

1. Calculate statistical binary patterns \(SB{{P}_{P,R}}\left( {{m}_{1}} \right)\) and \(SB{{P}_{C}}\left( {{m}_{1}} \right)\) from the first order moment images \({{m}_{1}}\), using the formulas (5) and (6).

2. Calculate statistical binary patterns \(SB{{P}_{P,R}}\left( {{\mu }_{2}} \right)\) and \(SB{{P}_{C}}\left( {{\mu }_{2}} \right)\) from the second order moment images \({{\mu }_{2}}\), using the formulas (5) and (6).

3. Calculate \(SBP_{P,R}^{{{m}_{1}}{{\mu }_{2}}}\) as joint histogram of \(SB{{P}_{P,R}}\left( {{m}_{1}} \right)\), \(SB{{P}_{C}}\left( {{m}_{1}} \right)\), \(SB{{P}_{P,R}}\left( {{\mu }_{2}} \right)\) and \(SB{{P}_{C}}\left( {{\mu }_{2}} \right)\).

Figures 1 and 2 compare the recognition rate of the algorithms LBP, CLBP [38] and SBP. For this comparison, we used the Outex database [39], a large and comprehensive texture database which includes 24 classes of textures collected under three illuminations and at nine angles. To measure the dissimilarity between the two histograms, we used the nearest neighborhood classifier with the Chi square distance. We considered different configurations of SBP: in Fig. 1 we set the (P,R) value equal to (24,3); in Fig. 2 we used values (8,1), (16,2) and (24,3).

Fig. 1
figure 1

Classification rate (%) of LBP, CLBP and SBP with the value (P,R) = (24,3) using the Outex texture database

Fig. 2
figure 2

Classification rate (%) of LBP, CLBP and SBP with the values (P,R) = (8,1), (P,R) = (16,2) and (P,R) = (24,3) using the Outex texture database

3 Virtual collaborative projection

Zhang et al. [9] investigated the role of collaboration between classes in representing the query sample. In order to collaboratively represent the query sample \(y\in {{\mathbb{R}}^{m}}\) using X (all the gallery images where each column is a training sample) with low computational cost, they introduced a method called collaborative representation based classification with regularized least square method (CRC_RLS). A general model of collaborative representation is:

$$\tilde{\alpha }=\arg {{\min }_{\alpha }}\left\{ \left\| y-X\alpha \right\|_{2}^{2}+\lambda \left\| \alpha \right\|_{2}^{2} \right\}$$
(8)

where \(\alpha\) is the coding vector \((\alpha =[{{\alpha }_{1}},\ldots ,{{\alpha }_{i}},\ldots ]\) and \(y\approx X\alpha)\) and \(\lambda\) is the regularization parameter.

The algorithm is described below:

The CRC-RLS Algorithm

1. Normalize the columns of X to have unit l 2-norm.

2. Code y over X by

\(\tilde{\alpha }=Py\)

where \(P={{\left( {{X}^{T}}X+\lambda I \right)}^{-1}}{{X}^{T}}\).

3. Compute the regularized residuals

\({{r}_{i}}={\left\| y-{{X}_{i}}{{{\tilde{\alpha }}}_{i}} \right\|}/{{{\left\| {{{\tilde{\alpha }}}_{i}} \right\|}_{2}}}\;\)

4. Output the identity of y as

\(\text{identity}(y)\text{ }=\text{ argmi}{{\text{n}}_{\text{i}}}\left\{ {{r}_{i}} \right\}\)

where \({{\tilde{\alpha }}_{i}}\) is the coding vector associated with class i.

The method proposed in this paper improves this algorithm by increasing the fidelity of the training images and enhancing the collaboration between classes by representing not only the query sample y but also all gallery images \({{x}_{i}}\) of every class i based on the idea of virtual collaborative projection (VCP).

Using this idea, we can compute the average images \({{C}_{i}}\) from every class i over X, defined as:

$${{C}_{i}}={\sum\limits_{1}^{tr}{{{x}_{i}}}}/{{{N}_{tr}}}\;$$
(9)

where \({{N}_{tr}}\) represents the number of training images of a class i.

Next by computing P as:

$$P={{\left( {{X}^{T}}X+\lambda I \right)}^{-1}}{{X}^{T}}$$
(10)

then the resulting virtual coefficient \({{\tilde{\alpha }}_{virtual}}\) is calculated as follows:

$${{\tilde{\alpha }}_{virtual}}=P{{C}_{i}}$$
(11)

This virtual coefficient is used as a weight for every class i and reconstruct a new gallery images \({{d}_{{{c}_{i}}}}\):

$${{d}_{{{c}_{i}}}}={{\left\| {{{\tilde{\alpha }}}_{virtual}} \right\|}_{2}}{{C}_{i}}$$
(12)

A new dictionary D (the update of X) is then obtained by combining all images \({{d}_{{{c}_{i}}}}\left( D=\left[ {{d}_{{{c}_{1}}}},\ldots ,{{d}_{{{c}_{i}}}},\ldots \right] \right)\).

Next, when a query sample y is presented to be classified, we follow the same procedure as CRC_RLS by computing the regularized residuals \({{r}_{i}}\) but we utilize the new dictionary D:

$${{r}_{i}}={{{\left\| y-{{D}_{i}}{{{\tilde{\alpha }}}_{virtual}} \right\|}/{\left\| {{{\tilde{\alpha }}}_{virtual}} \right\|}\;}_{2}}$$
(13)

where \({{D}_{i}}\) represents the images of a class i. The identity of a query sample y is computing by:

$$\text{identity}(y)\text{ }=\text{ }\arg {{\min }_{i}}\left\{ {{r}_{i}} \right\}$$
(14)

Below we present our virtual collaborative projection (VCP) algorithm when a query image\(y\)is presented to be classified:

The VCP Algorithm

1. Normalize the columns of X to have unit l 2-norm.

2. Compute the average images \({{C}_{i}}\) of every class i using the formula (9).

3. Compute the virtual coefficient \({{\tilde{\alpha }}_{virtual}}\) using the formulas (10) and (11).

4. Compute \({{d}_{{{c}_{i}}}}\) using the formula (12).

5. Combining all the \({{d}_{{{c}_{i}}}}\) in a dictionary D.

6. Compute the regularized residuals \({{r}_{i}}\) using the formula (13).

7. Return the identity of y using the formula (14).

In order to investigate the efficiency of VCP versus CRC, we conducted some experiments using the AR face dataset [40] with different dimensionality. Note that PCA is used to reduce the dimensionality of original face images, and the Eigenface features are used for this first experiment with three dimensions 54, 120 and 300.

For this comparison, we selected a subset from AR dataset that contains 50 male subjects and 50 female subjects with only illumination and expression changes. For each subject, the seven images from Session 1 were used for training and the other seven images from Session 2 were used for testing. The images were cropped and resized to 60 × 43. Table 1 shows that VCP performs slightly better than CRC_RLS [9]:

Table 1 Comparison of VCP (virtual collaborative projection) versus CRC (collaborative representation based classification) using AR data set with different dimensionality

Additional experiments are conduct in Sect. 4 with object categorization and action recognition where we use features provide by state-of-the-art methods and not the high order statistical moments.

We conclude this section by presenting our algorithm of high order statistical binary pattern with virtual collaborative projection (SBP_VCP) obtained by adding the step of high order statistical moments features extraction (cf. Sect. 2) to the VCP algorithm. This additional step is performed for the training images X resulting in a new training set and for every query sample y.

The SBP_VCP Algorithm

1. Extract the statistical binary patterns \(SBP_{P,R}^{{{m}_{1}}{{\mu }_{2}}}\) of X using the SBP Algorithm.

2. Extract the statistical binary patterns \(SBP_{P,R}^{{{m}_{1}}{{\mu }_{2}}}\) of y using the SBP Algorithm.

3. Call VCP algorithm.

In the next section we illustrate the performance of the SBP_VCP approach.

4 Experiments

To demonstrate the performance of our SBP_VCP algorithm, we conducted extensive experiments on multiple benchmark databases for face recognition, handwritten digit recognition, gender classification, image categorization and action recognition.

4.1 Parameter settings

We first describe how we set the parameters in the SBP_VCP algorithm. A part from the choice of moments and their combinations, two additional parameters need to be set in the calculation of the SBP:

  • The spatial support B for calculating local moments.

  • The spatial support {P;R} for calculating the LBP.

Although those two parameters are relatively independent, it must be noticed that B has to be sufficiently large to be statistically relevant. Regarding {P;R}, this quantity is supposed to be relatively small in order to represent local micro-structures of the (moment) images.

In the following, due to space constraints, we only show experiments using structuring element B = {(1;5); (2;8)} which provides very satisfactory results on the different datasets.

Regarding {P;R}, the spatial support of the LBP, we have considered the three settings commonly found in the literature: {8;1}, {16;2}, and {24;3}.

Regarding the parameters associated with the virtual collaborative projection and the collaborative classification, we used a regularization parameter \(\lambda\) which is initialized as follows, for:

  • Face recognition (FR) without occlusion: \(\lambda =0.001\)

  • Face recognition (FR) with occlusion: \(\lambda =0.1\)

  • Gender classification (GC): \(\lambda =0.001\)

  • Digit handwritten recognition: \(\lambda =0.1\)

  • Image categorization: \(\lambda =0.001\)

  • Action recognition: \(\lambda =0.1\)

In all tables reported, the value in bold indicates the best performance. Namely, in Table 1 through Table 18, the values in bold indicate the best recognition rates; in Tables 19 and 20 the values in bold indicate the least computation time.

4.2 Face recognition (FR)

4.2.1 Extended Yale B database

The Extended Yale B [41, 42] database contains 2414 frontal face images of 38 individuals; some samples are presented in Fig. 3. We used the cropped and normalized face images of size 54 × 48, which were taken under varying illumination conditions. Three tests are considered for this dataset.

Fig. 3
figure 3

Selected samples from the Extended Yale B database

Test 1. We randomly split the database into two halves. One-half, which contains 32 images for each person, was used as the dictionary, and the other half was used for testing. Table 2 shows the recognition rates versus feature dimension by nearest neighbours NN, nearest feature line NFL [43], support vector machine SVM, sparse representation based classification SRC [2], linear regression based classification LRC [44], locality-constrained linear coding LLC [45], regularized robust coding RRC [7] methods. SBP_VCP achieves the best recognition rate for all dimensions except dimension 300 where it performs slightly worse than RRC_l 1 [7] but it is still superior to all other methods considered.

Table 2 Face recognition results (test 1) of different methods on the Extended Yale B database

Test 2. For each subject, N tr samples are randomly chosen as training samples and 32 of the remaining images are randomly chosen as the testing data. Here the images are resized to size 96 × 84 and the experiment for each N tr runs ten times. For comparison, we used robust kernel representation with statistical local features SLF-RKR [36] and we used the same features extraction; statistical local features SLF with NN, LRC, SVM, CRC and SRC based methods.

We list in Table 3 the FR performance results, measured as mean recognition accuracy. The proposed algorithm SBP_VCP achieves the best performance when N tr  = 5 or 20 and it is the second best method slightly behind SLF-RKR_l 2 when N tr  = 10. It can also be noticed that methods based on collaborative representation (e.g., SLF-RKR [36], SLF + CRC, SLF + SRC and original SRC) perform better than other kinds of linear representation methods (e.g., SLF + LRC, SLF + NN).

Table 3 Face recognition results (test 2) of different methods on the Extended Yale B database

Test 3. In the third test, we randomly selected between 2 and 7 images from each person as training set and used the remaining images as testing set. Similarly, all the samples were projected into a subspace of 550 dimensions (Samples in LDA + SRC and LDA + CRC schemes are projected into a subspace of 37 dimensions), in addition to SRC and CRC we compare our method with JDDLDR [14], FDDL [13] and PDL [15] based approach. The FR results are shown in Table 4.

Table 4 Face recognition results (test 3) of different methods on the Extended Yale B database

Table 4 shows that SBP_VCP gives the best results for all values of N tr . We remark that the improvement in performance is significant as compared to all others methods demonstrating the advantages of combining the statistical features with this twin competitive (collaborative) classification.

4.2.2 AR database

Test 1. As in [2], we selected a subset (with only illumination and expression changes) containing 50 male and 50 female subjects from the AR database [40]; some samples are shown in Fig. 4. For each subject, the seven images from Session 1 were used for training and the other seven images from Session 2 were used for testing. The images were cropped to 60 × 43. The FR rates with baseline comparison reported in Table 5 show that the proposed approach yields the best performance among all methods considered for all dimensions, even when the dimension is 30 and competing methods perform rather poorly. As expected, all methods achieve their maximal recognition rates at dimension 300.

Fig. 4
figure 4

Selected samples from the AR database

Table 5 Face recognition results (test 1) of different methods on the AR database

Test 2. For each subject, the seven images with illumination change and expressions from Session 1 were used for training, and the other seven images with only illumination change and expression from Session 2 were used for testing. The size of the original face image is 83 × 60. The recognition rates versus the number of training samples N tr are reported in Table 6, showing that SBP_VCP achieves the highest recognition rates, followed in order by SLF-RKR [36] and SLF + SRC.

Table 6 Face recognition results (test 2) of different methods on the AR database

4.2.3 MPIE database

The CMU Multi-PIE database [46] contains images of 337 subjects captured in four sessions with simultaneous variations in pose, expression, and illumination. Among these 337 subjects, all the 249 subjects in Session 1 were used for training. To make the FR more challenging, four subsets with both illumination and expression variations in Sessions 1, 2 and 3, were used for testing. We conducted two tests with this experimental protocol.

Test 1. In the first test, for the training set, as in [2], we used the seven frontal images with extreme illuminations {0, 1, 7, 13, 14, 16, and 18} and neutral expression (refer to Fig. 5a for examples). For the testing set, four typical frontal images with illuminations {0, 2, 7, 13} and different expressions (smile in Sessions 1 and 3, squint and surprise in Session 2) were used (refer to Fig. 5b for examples with surprise in Session 2, Fig. 5c for examples with smile in Session 1, and Fig. 5d for examples with smile in Session 3). Here we used Eigenface with dimensionality 300 as the face feature for sparse coding. Table 7 reports the recognition rates found in four testing sets.

Fig. 5
figure 5

A subject in the Multi-PIE database. a Training samples with only illumination variations. b Testing samples with surprise expression and illumination variations. Panels c and d shows the testing samples with smile expression and illumination variations in Session 1 and Session 3, respectively

Table 7 Face recognition results of different methods on the MPIE database

Table 7 shows that SBP_VCP gives the best results using the sets smile-S1 and Squint-S2 and the second best results with the sets surprise-S2 and smile-S3. Since smile-S1 is in the same class (intra-class) as the training set, that’s why we have a good result, regarding smile-S3 and surprise-S2 sets we have the second best accuracy by 72.7 and 62.5% respectively.

Test 2. In the second test, we analyzed the impact of statistical binary pattern (SBP) on different state-of-the-art methods with the same experimental protocol as Test 1. We considered nearest neighbours NN, linear regression LRC [44], sparse representation SRC [2], collaborative representation CRC [9] and relaxed collaborative representation RCR [10] based classification. Table 8 reports the recognition rates found on the different methods with and without SBP.

Table 8 Face recognition results of different methods with SBP on the MPIE database

Results in Table 8 show that SBP consistently increases the performance of different approaches, especially when the classes are different from Session 1. The improvement in performance is significant for collaborative classification based methods CRC and RCR; for example the recognition rate of RCR with the set square-S2 increases from 40 to 74.6%, and with the set surprise-S2 from 38.1 to 64.5%.

4.2.4 AR database, disguise

In this experiment, we considered a subset from the AR database consisting of 2599 images from 100 subjects (26 samples per class except for a corrupted image w-027-14.bmp), 50 males and 50 females. We performed three tests: the first one follows the experimental settings in [2]; the other two, described below, are more challenging. The images were resized to 83 × 60 in the first and third test and to 42 × 30 in the second test; four representative samples of two persons are shown in Fig. 6.

Fig. 6
figure 6

Testing samples with sunglasses and scarves from the AR database

Test 1. In the first test, 799 images (about 8 samples per subject) of non-occluded frontal views with various facial expressions in Sessions 1 and 2 were used for training, while two separate subsets (with sunglasses and scarf) of 200 images (1 sample per subject per Session, with neutral expression) were used for testing. The FR results are listed in Table 9 and show that the SBP_VCP method achieves a much higher recognition rates than CRC_RLS [9], RRC [7] (with scarf), SRC [2], Gabor feature based sparse representation with Gabor occlusion dictionary GSRC [5] and maximum correntropy criterion CESR [8].

Table 9 Test 1: Face recognition results using images with real disguise from the AR database

Test 2. In the second test, we considered FR with a more complex disguise including variations of illumination and longer data acquisition interval. 400 images (4 neutral images with different illuminations per subject) of non-occluded frontal views in Session 1 were used for training, while the disguised images (3 images with various illuminations and sunglasses or scarves per subject per session) in Sessions 1 and 2 for testing. The results, reported in Table 10, show that the SBP_VCP methods achieves better performance than CRC_RLS [9], SRC [2], GSRC [5] and CESR [8], except for sunglass-S1, where it achieve the second best result after RRC [9].

Table 10 Test 2: Face recognition results using images with real disguise from the AR database

Test 3. In this test, a subset of 50 males and 50 females were selected from the AR database. For each subject, 7 samples without occlusion from session 1 are used for training, with all the remaining samples with disguises used for testing. These testing samples (including 3 samples with sunglass in Session1, 3 samples with sunglass in Session 2, 3 samples with scarf in Session 1 and 3 samples with scarf in Session 2 per subject) not only have disguises, but also variations of time and illumination. Table 11 reports the FR results on the four test sets with disguise.

Table 11 Test 3: Face recognition results using images with real disguise from the AR database

Table 11 shows that the proposed method achieves the best recognition rate with sunglasses in Session 2 and achieves 100% accuracy with Session 1 (as some others methods) and the second best accuracy in the sessions with scarf (SLF_RKR is ranked first). We remark that all methods perform better for session 1 (sunglass and scarf) than session 2, as session 2 is more challenging due to variations in illumination.

4.2.5 Georgia Tech data base with block occlusion

The Georgia Tech (GT) [47] Face Database contains 750 color images of 50 subjects (15 images per subject), as shown in Fig. 7a. These images have large variations in pose and expression and some illumination changes. Images were converted to gray scale, cropped and resized to 90 × 68. The first eight images of all subjects were used in the training (400 images), the remaining seven images for testing (350 images). For block occlusion, were placed a randomly located rectangle of all the testing images using an unrelated image, as illustrated in Fig. 7c.

Fig. 7
figure 7

a Original images of the same subject from Georgia Tech. b Original test image. c Test image with random block occlusion (30%)

Performance results reported in Table 12 compare the algorithms SBP_VCP, SBP-CRC, SBP-SRC, SBP-LRC, and SBP-NN in the presence of block occlusion ranging from 0 to 50% of the image. Table 12 shows that SBP_VCP achieves the best accuracy. Our interpretation is that this remarkable performance is due mostly to the VCP approach which efficiently takes advantage of the twin collaborative representation in the training and testing steps.

Table 12 Face recognition results using the GT database with block occlusion

4.2.6 FRGC data base with block occlusion and single sample per person (SSPP)

The FRGC database [48] contains faces acquired under uncontrolled conditions as shown in Fig. 8a. Using single sample per person (SSPP) protocol as another challenging problem in FR, we randomly selected 152 images for training, 152 images for testing and replaced a randomly located block of the test image with an unrelated image, as illustrated in Fig. 8c. The images were cropped and resized to 90 × 68 pixels. The recognition accuracy on this dataset is reported in Table 13.

Fig. 8
figure 8

a Original images of four different subjects from FRGC. b Original test image. c Test image with random block occlusion (30%)

Table 13 Face recognition results of different methods with block occlusion and SSPP using the FRGC database

The Table 13 shows that also in this test with block occlusion ranging from 10 to 50% of the image our algorithm SBP_VCP achieves the best performance, as it exhibits as lightly better accuracy than all the other methods considered. Note that all methods, except SBP-NN and SBP-LRC, achieve the same recognition rates without occlusion, while their performance is different in the presence of occlusion. This shows that SBP_VCP performs remarkably well in the challenging SSPP problem.

4.3 Gender classification (GC)

4.3.1 AR database

We selected a non-occluded subset (14 images per subject) of AR [22] consisting of 50 male and 50 female subjects. Images of the first 25 males and 25 females were used for training and the remaining images were used for testing. The images were cropped to 60 × 43. PCA was used to reduce the dimension of each image to 300. Table 14 reports the comparison of SBP_VCP versus the methods: regularized nearest subspace (RNS) [49], multi-regularized features learning (MRL) [50], CRC_RLS [9], SRC [2], SVM, LRC [44] and NN. The Table 14 shows that SBP_VCP outperforms the others methods considered and illustrates that the proposed method based on statistical local features is very effective for gender classification.

Table 14 Performance results on GC using the AR database

4.3.2 FEI database

There are 14 images for each of 200 individuals with a total of 2800 images [51]. The number of male and female subjects is exactly the same and equal to 100. The first nine images of all subjects are used in the training (1800 images, 900 per gender) and the remaining five images serve as testing images (1000 images, 500 per gender). Figure 9 shows all samples from one person. The images were cropped to 60 × 43.

Fig. 9
figure 9

All samples from the same person from FEI database

Here we compare SBP_VCP to the MRL [50] and CRC_RLS [9] algorithms on different dimensionality. Table 15 shows that SM_VCP outperforms MRL and CRC_RLS with all dimensionality except for dimension 30.

Table 15 Performance results on GC using the FEI database

4.4 Handwritten digit recognition

We next considered the problem of handwritten digit recognition on the widely used USPS database (Hull, J.J. 1994), which has 7291 training and 2007 test images. We used two different values of N tr : 100 and 300 images. Results in the Table 16 below show that SM_VCP outperforms all competing methods considered when N tr is 300 images. When N tr  = 100, Fisher discrimination dictionary learning FDDL [13] is the best performing algorithm but our approach has the second best performance.

Table 16 Handwritten digit recognition results of different methods on the USPS database

4.5 Image categorization

We tested the proposed method on the problem of multi-class object categorization. We used one of the two Oxford flower datasets, 17 category data set, [53], some samples of which are show in Fig. 10. We adopt the default experimental settings provided at the website http://www.robots.ox.ac.uk/~vgg/data/flowers, including the training, validation, test splits and the multiple features. It should be noted that, in this setting, features are only extracted from those flower regions which are well cropped by segmentation. This set contains 17 species of flowers with 80 images per class. As in [54], we directly use the χ 2 distance matrices of seven features (i.e., HSV, HOG, SIFTint, SIFTbdy, color, shape and texture vocabularies) as inputs, and perform the experiments based on the three predefined training, validation, and test splits. Performance results (in terms of accuracy) comparing VCP versus other state-of-the-arts are presented in Table 17 and show that VCP slightly outperforms all other methods. Note that, as we follow [54], we did not use the SBP for the representation in this test.

Fig. 10
figure 10

Samples from Oxford flower data sets with 17 categories

Table 17 Categorization accuracy on the 17 category Oxford flowers data set

4.6 Action recognition

Finally, we conducted an experiment of action recognition on the UCF sport action dataset (Rodriguez et al. [57]) and the large scale UCF50 dataset. The video clips in the UCF sport action dataset were collected from various broadcast sports channels (e.g., BBC and ESPN). There are 140 videos in total and their action bank features can be found in Sadanand et al. [58]. The videos cover ten sport action classes: driving, golfing, kicking, lifting, horse riding, running, skateboarding, swinging-(pommel horse and floor), swinging-(high bar) and walking. The UCF50 dataset has 50 action categories such as baseball pitch, biking, driving, skiing (Fig. 11), and there are 6680 realistic videos collected from YouTube.

Fig. 11
figure 11

UCF sports dataset: sample frames of 10 action classes along with their bounding box annotations of the humans shown in yellow

On the UCF sport action dataset, we followed the experimental settings in Rodriguez et al. [57] and evaluated VCP via five-fold cross validation, where one fold is used for testing and the remaining four folds for training. Since we use the action bank features of [58], we do not use SBP as a local feature in this test.

We compared VCP against state-of-the-art methods and reported the recognition rate in Table 18. Again, results show that VCP performs very competitively, illustrating the impact of the collaborative method.

Table 18 Recognition accuracy on the UCF Sports data set

4.7 Running time

In practical applications, training is usually an offline stage while recognition (classification) is usually an online step. Since we adopted the same classification procedure of collaborative representation based classification CRC, the speed-up we achieve is remarkable when compared to many other methods due to the significant reduction in computational complexity. In fact, after projecting a query sample y via \(P={{\left( {{X}^{T}}X+\lambda I \right)}^{-1}}{{X}^{T}}\), y is classified to the class which gives the minimal \({{r}_{i}}(\alpha )=\left\| y-{{X}_{i}}\alpha \right\|_{2}^{2}+\lambda {{\left\| \alpha \right\|}_{n}}\) where n = 1 or 2 and \({{\alpha }_{i}}\) is the coding vector associated with class i (\(\alpha =[{{\alpha }_{1}},\ldots ,{{\alpha }_{i}},\ldots ]\) and \(y\approx X\alpha\)).

All experiments were carried out using MATLAB on a 2.20 GHz with dual-core CPU machine with 3.00 GB RAM. Table 19 lists the average computational cost of training step on Test 1 and Test 2 from the AR dataset with real face disguise. The comparison of the LBP [22] to SBP algorithms shows that LBP has the least computation time, but SBP is close.

Table 19 Average running time (seconds) of training step using AR dataset with real face disguise

Table 20 lists the average computational cost classification of different methods on Test 1 and Test 2 from the AR dataset with real face disguise. SBP_VCP has the least computation time followed by RRC while GSRC has the highest computation time.

Table 20 Average running time (seconds) of competing methods using AR dataset with real face disguise

5 Conclusion

In this paper, we have introduced a novel approach for pattern recognition combining high order statistical binary pattern and collaborative projection for robust local representation and classification. We have demonstrated that the extraction of statistical features based on the high-order moments of the images is particularly effective against images outliers. When this is property is combined with our strategy for competitive or collaborative representation based on a trained virtual projection, we obtain a method we call SBP_VCP which is a powerful refinement of the collaborative representation based classification recently proposed in the literature. We have validated SBP_VCP on a wide range of problems from pattern recognition and classification which include face recognition, gender classification, object categorisation and action recognition. Extensive numerical tests and detailed comparison with standard and state-of-the-art methods demonstrate that the proposed SBP_VCP approach performs very competitively even on challenging classification tests. Additionally, our method can be implemented at a relatively small computational cost as it relies on the same efficient framework used in CRC for the classification step.