1 Introduction

With the rapid improvement in computer vision, it is greatly easy to collect multiple faces images in real-life application [1]. Hence, image set model becomes one of the research hotspots in recent years, where each set consists of multiple images corresponding to a subject. Theoretically, image set can utilize more face information of the same subject to effectively decrease the impact of a mount of appearance variations about expression, pose, illumination, and so on [2, 3], thereby enhancing the robustness and discriminant ability for classification purpose. Accordingly, image set classification aims to model an image set by leveraging appropriate technologies and then calculate the similarity between the probe image set and each gallery set to accomplish classification using some distance criterion [2, 4,5,6,7]. Obviously, image set classification is very different from traditional single face image classification [1, 7,8,9]. What’s more, since image set can provide us more comprehensive information to describe faces than single image, naturally, it has obtained more promising performance than the traditional classification methods under single face image in real-world applications. However, due to different variations, it is still a great challenge to correctly classify each image set by effectively exploiting the discriminative information [10,11,12,13,14]. For instance, we choose four training sets from the YouTube Celebrities dataset and illustrate the projection map of them on the top three principal directions [6], which is shown in Fig. 1. As we can see, the data distribution does not follow the class labels and is incomplete concentrated. In this case, it is a challenge that image set classification effectively improves classification performance.

Fig. 1
figure 1

Choosing four training sets from the YouTube Celebrities dataset. Projection map of them on the top three principal directions is drawn in four different colors dots

Over the last decade, plentiful image set classification methods [2, 4, 6, 15,16,17,18,19,20,21,22,23] have been proposed to address two key problems, which largely determine the performance of face recognition: constructing model of image set and defining appropriate distance criterion to calculate similarity between a pair of image sets. Based on plenty of literature review, it is well understood that the point-to-point distance-based methods have gained the most widely attention among all the existing methods [24, 25]. These methods utilize nearest points distance as image sets classifiers to obtain between-set similarity for image set classification. However, these existing methods have to face the following problems: (1) they do not make full use of the rich discriminant information between the whole gallery set and each gallery set to encode image sets; (2) most of the existing methods follow an ideal assumption that the importance of different gallery sets is equal, which is not adapted in practical applications; and (3) they usually introduce the sparse coding coefficients to enhance the representation capability of the homogeneous images and reduce that of inhomogeneous sets images. In short, the first two ones hinder the performance breakthrough, and the latter brings out computational burden and multi-parameters.

To address the above drawbacks of point-to-point distance-based methods, this paper presents an efficient and effective face image set classification method, termed self-weighted latent sparse discriminative learning (SLSDL). In summary, our SLSDL has the following strengths and contributions:

  • To make full use of the encoding information between different gallery sets and improve discrimination of the representation model, it simultaneously minimizes the nearest distance between the probe set and the whole gallery set, and the distances between each independent gallery set and the whole gallery set.

  • To find a more precise nearest point in the gallery sets, it proposes a self-weighted strategy to control the contribution of each gallery set, such that each gallery set can be treated differently to improve its discriminant ability.

  • It proposes a latent sparse normalization with capped simplex constraint to approximate the sparse constraint term for reducing the involved trade-off parameters and computational complexity; meanwhile, it can preserve the robustness of sparse representation.

  • It obtains excellent performance by experimenting on four public benchmark datasets.

2 Related work

Recently, an excellent point-to-point distance model is attracted extensive attention from scholars, which calculates distance of the between-set points to obtain between-set similarity. Up to now, numbers of typical methods have been proposed one after another [2, 3, 6, 15,16,17,18,19,20,21,22,23, 26]. For instance, affine hull-based image set distance (AHISD) [15] and convex hull-based image set distance (CHISD) [15] are proposed to compute distance between two affine/convex hulls of image sets and then obtain the similarity of image sets. However, this can lead to performance degradation because of the overlarge of affine/convex hulls. For getting rid of the problem, sparse approximated nearest points (SANP) [16] put forward the sparsity constraint on sample coefficients and achieve better accuracies. However, SANP can bring into high computational complexity; hence, regularized nearest points (RNP) [17] are proposed to model the image set as regularized affine hull for more concise and lower time consumption, whereas these methods [27, 28] only consider between-set relation in the training phase and ignore discriminative information between a probe set and the whole gallery set. Nowadays, more and more methods based on collaborative representation [29] have been proposed for overcoming this issue as follows: regularized hull based collaborative representation (RH-ISCRC) [18], collaborative regularized nearest points (CNRP) [19], and joint regularized nearest points (JRNP) [22]. Furthermore, prototype discriminative learning (PDL) [20] also achieves nice performance by simultaneously learning the prototype set and a linear projection matrix, which keeps a prototype set from a same class closer and different classes farther. Based on this idea, dual linear regression classification (DLRC) [30] just focuses on the related class subspace. As a development of DLRC, pairwise linear regression model (PLRC) [21] considers a new unrelated subspace as well. On the basis of PLRC, the recently proposed discriminative residual analysis (DAR) [31] obtains discriminant features and then projects the gallery set and probe set into the discriminant subspace for improving the classification performance. Recently, deep learning (e.g., DRM-MV) [32] is also gradually applied to image set classification tasks. Though there are some breakthroughs in these approaches, they have still some defects to defeat (see Sect. 1).

In general, the relationship between our SLSDL and the related methods is that both of them are point-to-point distance model. The difference is that the existing point-to-point methods unwittingly ignore the important discrimination information between each gallery set and whole gallery set, and usually introduce explicitly several sparse coding coefficients to enhance discrimination and robustness. Moreover, these methods ideally assume that the importance of different gallery sets is equal, which is not adapted in practical applications. Conversely, our SLSDL utilizes the weighted value to adjust precisely the coding coefficient errors and further enhance the correlation between the whole gallery set and the ground truth gallery set and the repellency between the heterogeneous sets. For SANP [16], RNP [17], and CRNP [19], in a broad sense, the weighted value between the probe set and each gallery set is set to 1, and they consider each gallery to be equally important. Meanwhile, SLSDL also proposes a latent sparse normalization for representation sparsity, thereby reducing the number of trade-off parameters and running time. Note here that latent sparse discriminative learning (LSDL) [33] is a conference version of this paper, which treats each different gallery set indistinguishably, i.e., all weighted values of gallery sets are fixed to 1.

3 Proposed method

In this section, we present the objective function, the optimization algorithm, and the classification criterion of the proposed SLSDL method.

3.1 Problem formulation

Point-to-point image set classification method models each image set as virtual point, and calculate the similarity between each gallery set and the probe setFootnote 1 by appropriate distance criterion. Mathematically, the whole gallery sets are denoted as \({\varvec{X}}=[{\varvec{X_1}},\ldots , {\varvec{X_m}}]\), which consisted of m different classes, and the probe set is denoted as \({\varvec{Y}}\). The size of each image set is presented as d. Inspired by [18, 34], it can thus be known that sparse representation can produce sparsity to enhance robustness, thereby resisting noise and even outliers. According to the above idea, we assume that the virtual point \({\varvec{X}} {\varvec{\alpha }}\) of whole gallery sets is not only close to the virtual point \({\varvec{Y}} {\varvec{\beta }}\) of the probe set, but also close to that of ground truth gallery set (i.e., \(\{{\varvec{X}}_{i} {\varvec{\gamma }}_{i}\}_{i=1}^m\)). \({\varvec{Y}} {\varvec{\beta }}\) can produce a better approximation of \({\varvec{X}} {\varvec{\alpha }}\) by collaborative representation and adjust coding errors. However, the idea is too ideal because the ground truth gallery set corresponded to label is unknown. In summary, our ideas can be summarized as follows: (1) To measure the similarity between the probe set and the gallery ones, we model the query set as a regularized hull collaboratively over all the gallery sets. (2) In order to enhance the discriminative power of all gallery sets, we propose a discrimination term to minimize the distance between the whole gallery set and an unknown ground truth gallery set. (3) To obtain sparse solutions to enhance robustness, we introduce the sparsity constraint on coding coefficients to suppress the contributions of the unnecessary samples. Further, sparsity constraint can enhance the representation capability of the homogeneous image sets and reduce that of inhomogeneous image sets. Regarding the mentioned above, our model can be written as follows

$$\begin{aligned}&\min _{{\varvec{\alpha }}, {\varvec{\beta }}, {\varvec{\gamma }}}\Vert {\varvec{X}} {\varvec{\alpha }}-{\varvec{Y}} {\varvec{\beta }}\Vert _{2}^{2}+\lambda _{1}\Vert {\varvec{\alpha }}\Vert _{1}+\lambda _{2}\Vert {\varvec{\beta }}\Vert _{1}\nonumber \\&\qquad + \lambda \sum _{i=1}^{m}\Vert {\varvec{X}} {\varvec{\alpha }}-{\varvec{X}}_{i} {\varvec{\gamma }}_{i}\Vert _2^2+\lambda _{3}\Vert {\varvec{\gamma }}_{i}\Vert _{1} \end{aligned}$$
(1)

where \(\lambda \), \(\lambda _1\), \(\lambda _2\), and \(\lambda _3\) are the trade-off parameters, \({\varvec{\alpha }}\) and \({\varvec{\beta }}\) are the sparse coefficients, and \({\varvec{\gamma }}_{i}=[{\varvec{\gamma }}_{i1}^T,{\varvec{\gamma }}_{i2}^T,\ldots , {\varvec{\gamma }}_{im}^T]^T\) is the sparse coefficient correspond to the ith gallery set. In summary, the first term with collaborative representation minimizes the distance between the virtual points of a pair of sets, which are the probe set and the whole gallery set. The fourth term aims to enhance the discriminative power of all gallery sets. By minimizing the distance between the virtual points of the whole gallery set and an unknown ground truth gallery set to adjust coding errors, the probe set can obtain a better approximation. In addition, the rest terms are constrained by the \(l_1\) norm to obtain sparse solutions (Fig. 2).

Fig. 2
figure 2

Illustration of our SLSDL. For instance, five dashed boxes represent four gallery sets \(\{{\varvec{X}}_{i}\}_{i=1}^4\) and a probe set \({\varvec{Y}}\). The solid patterns and hollow patterns substitute the true images and the virtual points, respectively. The arrows substitute distance between the virtual points of image sets. The weighted value \(w_i\) stands for the contribution of the i-th gallery set

However, problem (1) has the following disadvantages. Firstly, the explicit sparsity inevitably results in high computational complexity; moreover, three parameters, \(\lambda _1\), \(\lambda _2\), and \(\lambda _3\), need to be tuned, whereas it is very not user-friendly for tuning these parameters. To overcome this disadvantage, we normalize \({\varvec{1}}^T{\varvec{\alpha }}=1\), \({\varvec{1}}^T {\varvec{\beta }}=1\), and \({\varvec{1}}^T{\varvec{\gamma }}_{i}=1\) to make the sparsity terms to ablate. Naturally, it can produce the sparsity on the coefficients because the normalizations can choose a few samples in probe set to approximate the samples of probe set. That is, the normalizations \({\varvec{1}}^T{\varvec{\alpha }}=1\), \({\varvec{1}}^T {\varvec{\beta }}=1\), and \({\varvec{1}}^T{\varvec{\gamma }}_{i}=1\) are equivalent to the latent sparse constraints on \({\varvec{\alpha }}\), \({\varvec{\beta }}\), and \({\varvec{\gamma }}_{i}\), respectively; moreover, the normalizations can enhance the discriminability by finding the homogeneous set in the gallery sets to represent the probe set. By doing so, we can obtain the sparse and the discriminated solutions simultaneously. Secondly, we can see that the used spare constraints may produce some negative coefficients because of no non-negative constraint on the coding coefficient vectors. Obviously, the non-negative coefficients are mathematically explainable and physically meaningful, and the constructed virtual face images are more likely to be real faces. Therefore, we introduce some non-negative constraints on \({\varvec{\alpha }}\), \({\varvec{\beta }}\), and \({\varvec{\gamma }}_{i}\) simultaneously, i.e., \({\varvec{\alpha }} \ge 0\), \({\varvec{\beta }} \ge 0\), and \({\varvec{\gamma }}_{i} \ge 0\). Finally, it is not reasonable that each gallery set is treated equally. To overcome this disadvantage, a new self-weighted strategy is introduced to update the weight value between the whole gallery set and each gallery set iteratively. Therefore, the weight value can control the contribution of each gallery set, which makes explicitly the whole gallery virtual point closer to the nearest point of estimated correct class. Ideally, the correct gallery set has a larger weighted value, and the incorrect sets have the lower weighted values. Regarding the discussions mentioned above, problem (1) can be upgraded to

$$\begin{aligned} {} \mathbf {}\begin{aligned}&\min _{{\varvec{\alpha }}, {\varvec{\beta }}, {\varvec{\gamma }}, {\varvec{w}}}\Vert {\varvec{X}} {\varvec{\alpha }}-Y {\varvec{\beta }}\Vert _{2}^{2}\\&\quad +\lambda \sum _{i=1}^{m}w_i\Vert {\varvec{X}} {\varvec{\alpha }}-{\varvec{X}}_{i} {\varvec{\gamma }}_{i}\Vert _{2}^{2} \\&\qquad \text{ s.t. } {\varvec{\alpha }} \ge 0,{\varvec{1}}^{T} {\varvec{\alpha }}=1, {\varvec{\beta }} \ge 0,{\varvec{1}}^{T} {\varvec{\beta }}=1,\\&\qquad \qquad {\varvec{\gamma }}_{i} \ge 0,{\varvec{1}}^{T} {\varvec{\gamma }}_{i}=1,{\varvec{w}} \ge 0,{\varvec{1}}^{T} {\varvec{w}}=1 \end{aligned} \end{aligned}$$
(2)

where \({\varvec{w}} = [w_{1},\ldots ,w_{m}]\) is the weighted vector and \(\lambda \) is the trade-off parameter. Note that each set \({\varvec{X}}_i\) could be replaced with a dictionary with k atoms, such as KSVD [18].

Compared to the existing methods, our method has a concise formulation and latent sparsity. Note here that only one parameter \(\lambda \). Concretely, our SLSDL method has the following benefits over the regular ones: (1) it minimizes the distance between the probe set and the whole gallery set by using collaborative representation; meanwhile, it proposes a novel discrimination term to minimize the distance between the whole gallery set and an unknown ground truth gallery set. In the objective function of SLSDL, the collaborative representation term plays the main role, while the discrimination term assists with enhancing discrimination by adjusting the coding coefficient errors. Hence, SLDSL can make the best use of the comprehensive information between different gallery set to improve the discriminability of face image sets. (2) It proposes a self-weighted strategy, which can control the contribution of each gallery set to enhance the correlation between the whole gallery set and the ground truth gallery set, such that the nearest point can be obtained exactly. Based on the self-weighted strategy, a novel self-weighted strategy guided discrimination term is proposed to largely boost the discrimination of different gallery sets, such that the effects of true sets can be boosted while the effects of false sets can be weakened or removed. (3) Instead of using explicit sparse constraint, it proposes a latent sparse normalization with capped simplex constraint to produce sparsity solution, thereby reducing the number of involved trade-off parameters and computational complexity. Meanwhile, our SLSDL can obtain sparse solutions to enhance robustness, as the traditional sparse constraint.

3.2 Optimization

Due to only several \(\ell _2\)-norm regularization constraints in problem (2), the convergence is proved easily. Correspondingly, \({\varvec{\alpha }}\), \({\varvec{\beta }}\), and \({\varvec{\gamma }}\) can be solved alternatively.

Step 1: Update \({\varvec{\alpha }}\), we fix \({\varvec{\beta }}\), and \({\varvec{\gamma }}\) First, \({\varvec{\alpha }}\) can be solved by using the following problem when fixing the other coefficients, i.e.,

$$\begin{aligned} \begin{aligned}&\min _{{\varvec{\alpha }}}\Vert {\varvec{X}} {\varvec{\alpha }}-{\varvec{Y}} {\varvec{\beta }}\Vert _{2}^{2}+\lambda \sum _{i=1}^{m}\Vert {\varvec{X}} \alpha -{\varvec{X}}_{i} {\varvec{\gamma }}_{i}\Vert _{2}^{2}\\&\quad \text{ s.t. } {\varvec{\alpha }} \ge 0,{\varvec{1}}^{T} {\varvec{\alpha }}=1 \end{aligned} \end{aligned}$$
(3)

Then, this problem can be rewritten as

$$\begin{aligned} \begin{aligned}&\min _{{\varvec{\alpha }}}(\Vert {\varvec{X}} {\varvec{\alpha }}\Vert _{2}^{2}+\Vert {\varvec{X}}_{i} {\varvec{\gamma }}_{i}\Vert _{2}^{2})\\&\quad - 2 {\text {tr}}\left( {\varvec{\alpha }}^{T} {\varvec{X}}^{T}\left( \frac{{\varvec{Y}} {\varvec{\beta }}+{\varvec{\lambda }} \sum _{i=1}^{m} {\varvec{X}}_{i} {\varvec{\gamma }}_{i}}{m \lambda +1}\right) \right) \end{aligned} \end{aligned}$$
(4)

For simplicity, we introduce an auxiliary variable \({\varvec{P}}=\frac{1}{m \lambda +1}({\varvec{Y}} {\varvec{\beta }}+\lambda \sum _{i=1}^{m} {\varvec{X}}_{i} {\varvec{\gamma }}_{i})\), and transform (4) to

$$\begin{aligned} \begin{aligned} \hat{{\varvec{\alpha }}}=\arg \min _{{\varvec{\alpha }}}\Vert {\varvec{X}} {\varvec{\alpha }}-{\varvec{P}}\Vert _{2}^{2} \quad \text{ s.t. } {\varvec{\alpha }} \ge 0,{\varvec{1}}^{T} {\varvec{\alpha }}=1 \end{aligned} \end{aligned}$$
(5)

Step 2: Update \({\varvec{\beta }}\), we fix \({\varvec{\gamma }}\), and \({\varvec{\alpha }}\) \({\varvec{\beta }}\) can be solved by using the following problem when fixing other coefficients, i.e.,

$$\begin{aligned} \begin{aligned} \hat{{\varvec{\beta }}}=\arg \min _{{\varvec{\beta }}}\Vert {\varvec{X}} {\varvec{\alpha }}-{\varvec{Y}} {\varvec{\beta }}\Vert _{2}^{2} \quad \text{ s.t. } \ {\varvec{\beta }} \ge 0,{\varvec{1}}^{T} {\varvec{\beta }}=1 \end{aligned} \end{aligned}$$
(6)

Step 3: Update \({\varvec{\gamma }}\), we fix \({\varvec{\beta }}\), and \({\varvec{\alpha }}\) Similar to problem (6), while fixing other variables, each \({\varvec{\gamma }}_{i}\) can be directly updated via problem (7), i.e.,

$$\begin{aligned} \begin{aligned} \widehat{{\varvec{\gamma }}}_{i}=\arg \min _{{\varvec{\gamma }}_{i}}\Vert {\varvec{X}} {\varvec{\alpha }}-{\varvec{X}}_{i} {\varvec{\gamma }}_{i}\Vert _{2}^{2} \ \text{ s.t. } {\varvec{\gamma }}_{i} \ge 0,{\varvec{1}}^{T} {\varvec{\gamma }}_{i}=1 \end{aligned} \end{aligned}$$
(7)

Interestingly, problems (5), (6), and (7) derive the same optimization manner. We call such the problem capped simplex projection problem (CSPP), which adopt affined constraint and the non-negative to output a narrow solution. The solution process of CSPP is presented in Sect. 3.4 for better understand and description.

Step 4: Update \({\varvec{w}}\), we fix \({\varvec{\alpha }}\), \({\varvec{\beta }}\), and \({\varvec{\gamma }}\) We can obtain \({\varvec{w}}\) via

$$\begin{aligned} \begin{aligned} \min _{{\varvec{w}}}\sum _{i=1}^{m}w_i \Vert {\varvec{X}} {\varvec{\alpha }}-{\varvec{X}}_{i} {\varvec{\gamma }}_{i}\Vert _{2}^{2}\quad \text{ s.t. } {\varvec{w}} \ge 0,{\varvec{1}}^{T} {\varvec{w}}=1 \end{aligned} \end{aligned}$$
(8)

Then, the values of the weighted vector \({\varvec{w}}\) can be calculated by Theorem 1, i.e.,

$$\begin{aligned} \begin{aligned} w_i=\frac{1}{2\Vert {\varvec{X}} {\varvec{\alpha }}-{\varvec{X}}_{i} {\varvec{\gamma }}_{i}\Vert _{2}+\zeta } \end{aligned} \end{aligned}$$
(9)

where \(\zeta \) is infinitely close to zero. The detail of Theorem 1 is shown in Sect. 3.3.

Therefore, according to the above theoretical analysis, we can update \({\varvec{\alpha }}\), \({\varvec{\beta }}\), \({\varvec{\gamma }}\), and \({\varvec{w}}\) alternately. Hereto, the pseudo-code is depicted in Algorithm 1. The demo code will be released on our Github homepage https://github.com/renzhenwen.

figure a

3.3 Determine the value of weight \({\varvec{w}}\)

Recall that the optimization problem (8) can be simplified as below

$$\begin{aligned} \begin{aligned} \min _{{\varvec{w}}}\sum _{i=1}^{m}w_i \Vert {\varvec{z}}- {\varvec{g}}^{(i)}\Vert _{2}^{2}\quad \text{ s.t. } {\varvec{w}} \ge 0,{\varvec{1}}^{T} {\varvec{w}}=1 \end{aligned} \end{aligned}$$
(10)

where \({\varvec{z}} = {\varvec{X}}{\varvec{\alpha }}\) and \(\{{\varvec{g}}^{(i)}\}_{i=1}^m = \{{\varvec{X}}_{i} {\varvec{\gamma }}_{i}\}_{i=1}^m\). The value of vector \({\varvec{w}}\) can be calculated via Theorem 1.

Theorem 1

Motivated by iteratively re-weighted technique, a transition problem without \({\varvec{w}}\) is redefined as follows

$$\begin{aligned} \begin{aligned}&\min _{{\varvec{z}}} \sum _{i=1}^{m} \sqrt{\Vert {\varvec{z}}- {\varvec{g}}^{(i)}\Vert _{2}^{2}}\quad \\&\quad \text{ s.t. } {\varvec{w}} \ge 0,{\varvec{1}}^{T} {\varvec{w}}=1, {\varvec{z}} \in {\mathbf {W}} \end{aligned} \end{aligned}$$
(11)

The Lagrange function of (11) is \(\sum _{i=1}^{m} \sqrt{\Vert {\varvec{z}}- {\varvec{g}}^{(i)}\Vert _{2}^{2}}+\varPhi ({\varvec{\xi }}, {\varvec{z}})\), where \({\varvec{\xi }}\) is the Lagrange multiplier, and \(\varPhi ({\varvec{\xi }}, {\varvec{z}})\) indicates the indicator function of \({\varvec{z}}\) from the domain \({\mathbf {W}}\) (i.e., \({\varvec{z}} \in {\mathbf {W}}\)). By setting this Lagrange function w.r.t. \({\varvec{z}}\) as zero, we can obtain

$$\begin{aligned} \begin{aligned} \sum _{i=1}^{m}{\widehat{w}}_i\frac{\partial {\Vert {\varvec{z}}-{\varvec{g}}^{(i)}\Vert _{2}^{2}}}{\partial {\varvec{z}}} +\frac{\partial \varPhi ({\varvec{z}})}{\partial {\varvec{z}}}=0 \end{aligned} \end{aligned}$$
(12)

where \({\widehat{w}}_i=1/(2\Vert {\varvec{z}}- {\varvec{g}}^{(i)}\Vert _{2})\). It is easy to see that Eq. (12) is the same as the derivation of the Lagrange function of problem (10). Thus, \({\widehat{w}}_i\) can be viewed as the \(w_i\) in (10). To avoid the denominator being zero in theory, \({\widehat{w}}_i\) can be rewritten as

$$\begin{aligned} \begin{aligned} w_i=\frac{1}{2\Vert {\varvec{z}}-{\varvec{g}}^{(i)}\Vert _{2}+\zeta } \end{aligned} \end{aligned}$$
(13)

where \(\zeta \) is infinitely close to zero. The proof is completed. \(\square \)

3.4 Solve capped simplex projection problem

We adopt the accelerated projected gradient (APG) method to optimize the resultant CSPP. For the convenience of notations, problem (6) and \({\mathbf {W}}\) are defined as \(\varphi ({\varvec{\beta }})\) and the capped simplex constraint domain, respectively. Without loss of generality, by taking the \({\varvec{\beta }}\)-problem as an example, we define CSPP as below

$$\begin{aligned} \begin{aligned} \min _{{\varvec{\beta }}\in {\mathbf {W}} }\varphi ({\varvec{\beta }})=\Vert {\varvec{x}}- {\varvec{Y}} {\varvec{\beta }}\Vert _2^2 \end{aligned} \end{aligned}$$
(14)

For obtaining easier paradigm, an auxiliary variable \({\varvec{\eta }}\) is introduced to solve problem (14). We can approximatively solve \({\varvec{\beta }}\) by alternately iterating of \({\varvec{\eta }}\).

\({\varvec{{\beta }}}^0\) is solved from problem (1) when setting \(\lambda _2=1\) and fixing \({\varvec{\alpha }}\) and \({\varvec{\gamma }}_i\) without considering the capped simplex constraint. And then starting the alternative optimization by initializing \({\varvec{\eta }}^0 = {\varvec{\beta }}^0\). Now, we define the initial value of Newton acceleration coefficient as c and \({\varvec{\eta }}\) at iteration t as \({\varvec{\eta }}^{t}\) , and it can be updated at each iteration.

When the t-th iteration, Taylor expansion up to second order is used to approximate \({\varvec{\beta }}\). i.e.,

$$\begin{aligned} \begin{aligned} {\varvec{\beta }}^{t}&=\underset{{\varvec{\beta }}}{{\text {argmin}}} f(\mathbf {{\varvec{\eta }}}^{t-1})\\&\quad +({\varvec{\beta }}-{\varvec{\eta }}^{t-1})^{T} {\varvec{\varphi }}^{\prime }({\varvec{\beta }}^{t-1}) +\frac{{\varvec{L}}}{2}\Vert {\varvec{\beta }}-{\varvec{\eta }}^{t-1}\Vert _{F}^{2} \end{aligned} \end{aligned}$$
(15)

Problem (15) can be replaced with a more compact form after ignoring the irrelevant terms. As a consequence, (15) can be rewritten as

$$\begin{aligned} \begin{aligned} {\varvec{\beta }}^{t}=\underset{{\varvec{\beta }}\in {\mathbf {W}}}{\arg \min } \frac{{\varvec{L}}}{2}\Vert {\varvec{\beta }}-({\varvec{\eta }}^{t-1}-\frac{1}{{\varvec{L}}} {\varvec{\varphi }}^{\prime }({\varvec{\eta }}^{t-1}))\Vert _{2}^{2} \end{aligned} \end{aligned}$$
(16)

Let \({\varvec{e}} = ({\varvec{\eta }}^{t-1}-\frac{1}{{\varvec{L}}} {\varvec{\varphi }}^{\prime }({\varvec{\eta }}^{t-1}))\). Problem (16) can be abbreviated to

$$\begin{aligned} \begin{aligned} {\varvec{\beta }}^{t}=\arg \min _{}\frac{1}{2} \Vert {\varvec{\beta }}-{\varvec{e}}\Vert _{2}^{2}\quad \text{ s.t. } {\varvec{ \beta }} \ge 0, {\varvec{1}}^T {\varvec{\beta }}=1 \end{aligned} \end{aligned}$$
(17)

Analogously to [35, 36], the Lagrangian function of problem (17) is rewritten as

$$\begin{aligned} \begin{aligned} {\mathcal {L}}({\varvec{z}},{\varvec{w}},\pi )=\frac{1}{2} \Vert {\varvec{z}}-{\varvec{e}}\Vert _{2}^{2}-\pi ({\varvec{1}}^{T} {\varvec{z}}-1)-{\varvec{w}}^{T} {\varvec{z}} \end{aligned} \end{aligned}$$
(18)

where \({\varvec{w}}\) is a Lagrangian coefficient vector and \(\pi \) is a scalar.

Suppose the Lagrange coefficient vectors are \({\varvec{w}}^{*}\) and \(\pi ^{*}\), respectively, and the optimal solution of (18) is \({\varvec{z}}^{*}\). According to the Karush–Kuhn–Tucker (KKT) conditions [37, 38], for \(\forall j\), we have \(\sum _{m=1}^{r} z_{j}^{*}-\sum _{m=1}^{r} e_{j}^{m}-w_{j}^{*}-\pi ^{*}=0\), \(z_{j}^{*} \ge 0\), \(w_{j}^{*} \ge 0\) and \(z_{j}^{*} w_{j}^{*}=0\). Evidently, we also have \(r{\varvec{z}}^{*}-\sum _{m=1}^{r} {\varvec{e}}^{m}-{\varvec{w}}^{*}-\pi ^{*} {\varvec{1}}={\varvec{0}}\).

We obtain \(\pi ^{*}=(r-\sum _{m=1}^{r} {\varvec{1}}^{T} {\varvec{e}}^{m}-{\varvec{1}}^{T} {\varvec{w}}^{*})/n\) according to \({\varvec{1}}^{T} {\varvec{z}}^{*}=1\). At this time, we formulate the optimal solution \({\varvec{z}}^{*}\) as

$$\begin{aligned} \begin{aligned} \frac{\sum _{m=1}^{r} {\varvec{e}}^{m}}{r}+\frac{{\varvec{1}}}{n}-\frac{\sum _{m=1}^{r} {\varvec{1}}^{T} {\varvec{e}}^{m} {\varvec{1}}}{r n}-\frac{{\varvec{1}}^{T} {\varvec{w}}^{*} {\varvec{1}}}{r n}+\frac{{\varvec{w}}^{*}}{r} \end{aligned} \end{aligned}$$
(19)

Defining \({\varvec{g}}=\sum _{m=1}^{r} {\varvec{e}}^{m}/r+{\varvec{1}}/n-\sum _{m=1}^{r} {\varvec{1}}^{T} {\varvec{e}}^{m} {\varvec{1}}/(r n)\) and \({\hat{w}}^{*}={\varvec{1}}^{T} {\varvec{w}}^{*}/(r n)\), Eq. (19) can then be simplified to \({\varvec{z}}^{*}={\varvec{g}}-{\hat{w}}^{*} {\varvec{1}}+\frac{{\varvec{w}}^{*}}{r}\). For \(\forall j\), the result can be written as

$$\begin{aligned} \begin{aligned} z_{j}^{*}=g_{j}-{\hat{w}}^{*}+\frac{w_{j}^{*}}{r}=\lfloor {g_{j}-{\hat{w}}^{*}\rfloor }_{+} \end{aligned} \end{aligned}$$
(20)

Similarly, \(w_{j}^{*}=r\lfloor {{\hat{w}}^{*}-g_{j}\rfloor }_{+}\) is derived in the same way. Because of \({\hat{w}}^{*}={\varvec{1}}^{T} {\varvec{w}}^{*}/(r n)\), \({\hat{w}}^{*}\) can be written as \({\hat{w}}^{*}=\frac{1}{n} \sum _{j=1}^{n}\lfloor {{\hat{w}}^{*}-g_{j}\rfloor }_{+}\). Then, an auxiliary function is defined as Eq. (21) to solve the self-dependent \({\hat{w}}^{*}\), i.e.,

$$\begin{aligned} \begin{aligned} \varTheta ({\hat{w}})=\frac{1}{n} \sum _{j=1}^{n}\lfloor {{\hat{w}}^{*}-g_{j}\rfloor }_{+}-{\hat{w}} \end{aligned} \end{aligned}$$
(21)

Note that \({\hat{w}} \ge 0\) and \(\varTheta ^{\prime }({\hat{w}}) \le 0\), and \(\varTheta ^{\prime }({\hat{w}})\) is a piecewise linear and convex function, we can use Newton method to find the root \(\hat{w^*}\) when \(\varTheta ({\hat{w}})=0\), i.e.,

$$\begin{aligned} \begin{aligned} {\hat{w}}_{t+1}={\hat{w}}_{t}-\frac{\varTheta ({\hat{w}}_{t})}{\varTheta ^{\prime }({\hat{w}}_{t})} \end{aligned} \end{aligned}$$
(22)
Fig. 3
figure 3

Sample images of the four datasets. These faces contain high intra-class variations in the form of different poses, illumination variations, expression deformations, and occlusions

3.5 Classification

The latent sparse coding coefficients \({\varvec{{\hat{\alpha }}}}\), \({\varvec{{\hat{\beta }}}}\) and \(\{{\varvec{{\hat{\gamma }}_i}}\}_{i=1}^m\) can be obtained by the optimal solving (2). Then, we compute the similarity between each gallery set and the probe set by distance criterion, which can be represented as follows

$$\begin{aligned} \begin{aligned} r_{i}=&(\Vert {\varvec{X}}_{i}\Vert _{*}+\Vert {\varvec{Y}}\Vert _{*}) * \Vert {\varvec{X}}_{i} {\varvec{{\hat{\alpha }}}}_{i}-{\varvec{Y}} {\varvec{{\hat{\beta }}}}\Vert _{2}^{2} /\Vert {\varvec{{\hat{\alpha }}}}_{i}\Vert _{2}^{2} \end{aligned} \end{aligned}$$
(23)

where \(\Vert {\varvec{X}}_{i}\Vert _{*}\) and \(\Vert {\varvec{Y}}\Vert _{*}\) are the nuclear norm of \({\varvec{X}}_{i}\) and \({\varvec{Y}}\), respectively. And the nuclear norm can be obtained by the sum of the singular values. \(\Vert {\varvec{X}}_{i} {\varvec{{\hat{\alpha }}}}_{i}-{\varvec{Y}} {\varvec{{\hat{\beta }}}}\Vert _{2}^{2}\) represents the point-to-point distance of each gallery set and the probe set. \(\Vert {\varvec{X}}_{i}\Vert _{*}+\Vert {\varvec{Y}}\Vert _{*}\) can avoid the influence of irrelevant class information between image sets. In the end, we can recognize the identity of the probe image set \({\varvec{Y}}\) to that of minimum residual distance via

$$\begin{aligned} \begin{aligned} \text{ identity } ({\varvec{Y}})=\arg \min _{i}\{{\varvec{r}}_{i}\} \end{aligned} \end{aligned}$$
(24)

3.6 Computational complexity

Obviously, the computational complexity of the proposed SLSDL mainly depends on the complexity of Algorithm 1, which consists of four steps (i.e., updating \({\varvec{\alpha }}\), \({\varvec{\beta }}\), \({\varvec{\gamma }}\), and \({\varvec{w}}\)). Note here that the optimization problems of \({\varvec{\alpha }}\), \({\varvec{\beta }}\), \({\varvec{\gamma }}\) have the same optimization manner, i.e., the proposed CSPP. As present in Sect. 3.4, the computational complexity of solving CSPP is \({\mathcal {O}}(ld)\), where l is the iterators of Newton method for solving CSPP, and d is the dimension of sample. Therefore, the updating problems of \({\varvec{\alpha }}\), \({\varvec{\beta }}\), \({\varvec{\gamma }}\) cost \({\mathcal {O}}(lmd)\), \({\mathcal {O}}(ld)\), and \({\mathcal {O}}(lmd)\), respectively. Moreover, the updating of \({\varvec{w}}\) costs \({\mathcal {O}}(m)\). In summary, the computational complexity of Algorithm 1 is \({\mathcal {O}}(t(2*lmd+ld+m))\), where t is the number of iterations of Algorithm 1. Since l and m are very small numbers, the computational complexity of our SLSDL can be approximately complexity \({\mathcal {O}}(td)\).

4 Experiments

To verify the performance of our proposed method, we make the comparison with some baseline methods on four benchmark image set datasets. All comparison experiments are performed by MATLAB 2017b on the PC with an Intel Core i7 (2.9GHz) CPU and 32GB RAM. Numerous experimental results of this section demonstrate the superiority of our proposed method.

4.1 Experimental setup

We perform experiments on the four common benchmark face set datasets, which are the Honda/UCSD (Honda) [17, 18], CMU Mobo (Mobo) [18, 39], YouTube Celebrities (YTC) [18, 39, 40] datasets, and Labeled Faces in the Wild Dataset (LFW) [41, 42], as shown in Fig. 3. Note here that we extract the traditional features and deep features successively for comprehensive evaluation of the proposed method. In our experiment, we set the numbers of dictionary atoms as \(k=20\) if there are no special instructions. According to the following parameter sensitivity analysis, the trade-off parameter \(\lambda \) is set as \(\lambda =1\) for Mobo dataset and \(\lambda =0.01\) in three remaining benchmark datasets for simplicity.

Our SLSDL is compared to several representative classification methods for face image set. They were, respectively, AHISD [15], CHISD [15], SANP [16], ISCRC-\(\ell _1\) [18], ISCRC-\(\ell _2\) [18], RNP [17], DRM-WV [32], PLRC [21], PDL [20], and LSDL [33]. To obtain the average classification accuracies and standard deviations, we adopt classical approaches which set the size [50, 100, 200] in the first three experiments, respectively. The latest experiment performs with the various resolutions and the deep feature. To observe the effect by using the deep features, we add several comparison methods, such as DLRC [30] and DAR [31]. All comparison methods are performed adopted the source codes, which are given by the authors’ homepage. And we use the recommended parameters provided in their original papers for best performance.

4.2 Results on the Honda/UCSD dataset

Honda/UCSD dataset possesses 59 video sequences involving 20 different subjects. Each sequence about 12–645 frames constitutes an image set, which has large variations. We use histogram equalization to eliminate the influences of the illumination and then resize all face images to \(20 \times 20\). Randomly, dividing all image sets into two parts: 20 image sets are used to train and remaining 39 ones are used to test.

The experimental results are listed in Table 1 when the gallery image sets have different number of frames. The best results are shown in boldface. As can be seen, the proposed SLSDL has the highest performance than other comparison methods. Both methods achieve much more stable results in all instances and achieve the most perfect classification accuracies, delightfully. However, the comparison methods can achieve better classification effect only if the number of frames is 200. Notably, our methods outperform the second best, DRM-MV, by over 3.1% and 0.7% when the number of frames is 50 and 100%, respectively. The results are excellent demonstrations of the superiority of our method, and it shows that our method further enhances the robustness and discrimination.

4.3 Results on the CMU Mobo dataset

Mobo dataset contains 96 video sequences involving 25 persons, which walk on a treadmill. They implement various kinds of patterns, such as inclined, slow, fast, or carrying a ball. Histogram equalization also is adopted, and then, all face images are resized to \(30\times 30\). We randomly choose an image set to train and the rest three ones to test from each subject. For the sake of fairness, we perform tenfold cross-validation experiments.

The experimental results of all methods are revealed in Table 1, and the best results are shown in boldface. Notice that overall the proposed SLSDL has slightly higher accuracies than ISCRC and LSDL. More importantly, SLSDL performs higher than ISCRC by 0.6% and 0.4% when set size is 50 and 100, respectively, which is difficult.

Table 1 Classification performance (%) of different methods on the Honda/UCSD and CMU Mobo datasets
Table 2 Classification performance (%) and running time of the compared methods on the YTC dataset

4.4 Results on the YouTube Celebrities dataset

In YTC dataset, there are 1910 videos involving 47 actors and politicians. Notice that it was collected from YouTube Web site under uncontrolled real-life circumstance, and these face image set tend to have large pose or expression variation, and motion blur. Due to low quality of images and highly compressed, apparently, the face recognition task will be more challenging. Each face grayscale image is resized to \(30\times 30\). Randomly, choose three image sets to train and six image sets to test, and conduct fivefold cross-validation experiments.

The comparisons between all methods are summarized in Table 2, and the best results are shown in boldface. Notice that overall the LSDL and SLSDL both outperform other methods. Even though LSDL already achieves very high performance, SLSDL further improves performance about accuracy and robustness. It is worth mentioning that our SLSDL method gets the best performance reported so far and achieves the accuracies up to 74.0%, 78.1%, and 76.0% in different size length. Specially, even though image frames are only 50, our SLSDL has more perfect performance than all comparison methods with different numbers of image frames. This further demonstrates that our method is superior to the comparison methods and has great discriminability and robustness. Clearly, our method can resist noise and even outliers for face image set classification under complex conditions. Compared to DRM-MV, 11.7%, 9.9%, and 5.7% improvements are achieved when frame is 50, 100, and 200, respectively, which is a very surprising result. Meanwhile, we notice that our average classification accuracies fluctuate with the increase in the frame number, probably because more extra information of faces description interferences when the set size is 200.

4.5 Results on the labeled faces in the wild dataset

LFW dataset covers more than 13,000 face images gathered from the Internet. It uses two or more different images of 1680 people as the image set. In LFW database, there are faces images with various illumination, poses and partially obscured. We perform the alignment version LFW-a [31, 42]. Similar to way [30], we resize all face images to \(90\times 78\). The subset of LFW-a is composed of the classes with more than 20 pictures, and it has 3032 images of 62 subjects. In addition, we want to see how effectively it performs on the recent deep features. Hence, we select the VggFace2 pretrained version of ResNet-50 (VggFace2-ResNet-50) [31] to capture deep feature, and its feature dimensions is 2048. We resize the raw grey-scale images of the LFW-a dataset to \(10\times 10\), \(15\times 10\) and \(30\times 15\), and we utilize VggFace2-ResNet-50 to experiment. The raw images and deep feature are used to evaluate the performance. All images are randomly divided into gallery set and probe set. we perform experiments by setting \(d=5\) and \(k=3\). And thirty-fold cross-validation experiments are conducted.

As shown in Table 3, our SLSDL gains the best scores in accuracy and standard deviation than other comparison methods in LFW-a dataset. We also discover the accuracies improvement when deep features are used. Clearly, the resolution the can improve recognition accuracies from Table 3. However, it is limited to the increase in accuracy compared with that of deep features. These indicate that deep features have more discrimination and robustness for noisy variations than raw pixels. Excitedly, the accuracies of our SLSDL can even be up to 100%. We also notice that our SLSDL gains best recognition accuracies than all other methods in low resolution. Compared to DAR-PE, 6.3%, 3.4%, and 2.1% improvements are achieved under different resolutions, respectively. In addition, our extension SLSDL method is further improved on the basis of LSDL. It indicates that our SLSDL method can enhance effectiveness and robustness.

Table 3 Classification performance (%) of the compared methods on the LFW-a dataset

4.6 Running time comparison

As a real-world application, time cost is another noteworthy. Whereupon, we set the number of images per set to 50 and perform experiments about running time on YTC dataset. As we can see from the last column of Table 2, the running time of LSDL is slightly lower and is only 0.54 s. And our SLSDL is slightly higher time cost about 1.05 s. Note here that the running time of our SLSDL has same order of magnitude as that of ISCRC and RNP about 1 s. Clearly, we perform a self-weighted strategy for improving image set classification performance without sacrificing too much computation time.

Fig. 4
figure 4

Coding coefficients of the gallery sets (i.e., \({\varvec{\alpha }}\)) are on the left and that of the probe set (i.e., \({\varvec{\beta }}\)) are on the right, produced by the proposed SLSDL method. Honda dataset contains 20 subjects, and \(k=10\)

4.7 Latent sparse property analysis

As we make a point in Sect. 3.1, SLSDL has the latent sparse property. Here proving this point from experiments. We conduct experiments by setting \(k=50\) and \(d=10\) on Honda dataset. Coding coefficients of the gallery sets \({\varvec{\alpha }}\) are on the left and that of the probe set \({\varvec{\beta }}\) are on the right from Fig. 4. It is clearly that red coefficients correspond to the correct class, and the coefficients solutions \({\varvec{\alpha }}\) and \({\varvec{\beta }}\) of our method have the sparsity. The experimental results of reconstructed face images and residuals are shown in Fig. 5. All \({\varvec{X}}_{i} {\varvec{\alpha }}_{i}\) denote the virtual point of each gallery image set. The reconstruction residual is simplified to \(\Vert {\varvec{X}}_{i} {\varvec{\alpha }}_{i}-{\varvec{Y}} {\varvec{\beta }}\Vert _{2}^{2}\). The smallest residual between the probe image set and the gallery image sets is highlighted with a red color, which represents that they are the same subject. This demonstrates that our SLSDL can not only have sparse solutions to enhance robustness but also reduce computational complexity.

4.8 Convergence analysis

By the previous description, our SLSDL is clearly convex, and it has closed-form solutions on the basis of the mentioned in Sect. 3.2. By evaluating on the Honda, Mobo, YTC, and LFW-a datasets, the convergence curves of the proposed SLSDL method are shown in Fig. 6. It shows that SLSDL converges fast and the objective function value achieves stable state about 15–20 iterations, even as long as 5 iterations on the first two databases.

Fig. 5
figure 5

Virtual faces produced by our SLSDL method on the Honda/UCSD dataset. \({\varvec{X}}{\varvec{\alpha }}\), \({\varvec{Y \beta }}\), and \({\varvec{X}}_i{\varvec{\alpha }}_i\) represent the virtual face point of the whole gallery sets \({\varvec{X}}\), the probe set \({\varvec{Y}}\), and the i-th gallery set \({\varvec{X}}_i\), respectively

Fig. 6
figure 6

Convergence of the proposed SLSDL method on the four datasets

Fig. 7
figure 7

Classification performance of the proposed SLSDL method w.r.t. \(\lambda \) on the Mobo and YTC datasets

Fig. 8
figure 8

Weight vector \({\varvec{w}}\) produced by the proposed SLSDL method on the YTC dataset. Note here that the red color stem stands for the ground-true class

4.9 Parameter sensitivity

Pleasantly, our SLSDL has only one parameter, which needs to be adjusted. For obtaining its optimum performance of parameter \(\lambda \), we set adjustment range varied from \(\{10^{-7},\ldots , 10^{1}\}\), and fix \(d=50\). For the sake of simplicity, we perform experiment as representativeness in Mobo and YTC. The performance curves are revealed in Fig. 7. We observe that the classification accuracy fluctuates with \(\lambda \), and the best effects are achieved at the appropriate values. The recognition accuracies decrease rapidly when \(\lambda =10\), because it plays a supporting role. Obviously, SLSDL can obtain the more perfect classification accuracies when \(\lambda =1\) and \(\lambda =0.01\) on two benchmark datasets, respectively. This further indicates that the latent sparse normalization and a self-weighted term can enhance the discrimination. Overall, we suggest to set \(\lambda \) as a relatively small value within the range of \(\{10^{-3},\ldots , 1\}\) for different datasets.

4.10 Ablation analysis

In our method, a new discriminative strategy is proposed to control the contribution of each gallery set, such that each gallery set can be treated differently. It can adjust the coding coefficient errors to enhance the set discrimination. We dub it as latent sparse representation (LSR) when LSDL does not have a discriminative term, i.e., \(\lambda =0\). In order to demonstrate this, we perform experiments by setting \(\lambda =0\) and \(d=50\) in the first three datasets, and setting \(k=3\) in the LFW-a. Table 4 shows that performance decreases without the discriminative term on Mobo, YTC, and LFW-a. Hence, it clearly shows that the discriminative term can boost the discrimination power. In the addition, we can see easily that the correct gallery (the red mark) has a larger weight value, and the incorrect galleries have lower weight values from Fig. 8. Clearly, the weighted value can boost the correlation power of homogeneous sets while limiting the correlation power of heterogeneous sets.

Table 4 Classification performance (%) of LSR, LSDL and SLSDL on the four datasets

5 Conclusion

This paper proposes a novel SLSDL method for image set classification. SLSDL considers the differences of contributions between the whole gallery sets and each gallery set and proposes a self-weighted strategy to reflect the correlation between the whole gallery set and each gallery set, which is different from the existing point-to-point distance-based methods; moreover, SLDSL proposes a latent sparse normalization with capped simplex constraint to avoid high running time as well as enhancing robustness. A large number of experiments on four public benchmark face set datasets demonstrate that SLSDL is superior to the competitors about computational speed and classification performance.

Although SLSDL has achieved promising performance, some aspects still deserve to study in the future. (1) We will integrate hash learning and deep learning [43] for large-scale image set classification. (2) We will introduce multiple kernel learning [37] for handling the problem of nonlinear image set classification.