1 Introduction

Owing to a wide range of potential applications, face recognition has been a research problem of significant importance in the area of computer vision and pattern recognition. Most of the effort in this regard has been tailored towards the classification from single images, that is, given a single query image, we are required to find its best match in a gallery of images. However, for many real-world applications (e.g., recognition from surveillance videos, multi-view camera networks and personal albums), multiple images of a person are readily available and need to be explored for classification. Face recognition from these multiple images is commonly studied under the framework of ‘image set classification’ and has attained significant research attention in the recent years (Kim et al. 2007; Wang et al. 2008; Wang and Chen 2009; Cevikalp and Triggs 2010; Harandi et al. 2011; Hu et al. 2012; Wang et al. 2012; Yang et al. 2013; Ortiz et al. 2013; Zhu et al. 2013; Hayat et al. 2014).

Compared with single image based classification, image set classification is more promising, since images in a set provide richer information due to wide range of appearance variations caused by changing illumination conditions, head pose variations, expression deformations and occlusions. Although image set classification provides a plenitude of data of the same object under different variations, it simultaneously introduces many challenges e.g., how to make an effective use of this data. The major focus of existing image set classification methods has therefore been to find a suitable representation which can effectively model the appearance variations in an image set. For example, the methods in Kim et al. (2007), Yamaguchi et al. (1998), Oja (1983), Wang et al. (2008), Wang and Chen (2009), Hayat et al. (2013) and Hayat and Bennamoun (2014) use subspaces to model image sets, and set representative exemplars (generated from affine hull/convex hull) are used in Cevikalp and Triggs (2010) and Hu et al. (2012) for image set representations. The mean of the set images is used as part of image set representation in Ortiz et al. (2013), Hu et al. (2012) and Lu et al. (2013) and image sets are represented as a point on a manifold geometry in Wang et al. (2012) and Harandi et al. (2011). The main motivation behind a single entity representation of image sets (e.g., subspace, exemplar image, mean, a point on the manifold) is to achieve compactness and computational efficiency. However, these representations do not necessarily encode all of the useful information contained in the images of the image set (as explained in detail in Sect. 2). In this paper, we take a different approach which does not represent an image set by a single entity. We instead retain all the images of the image set in their original form and design an efficient classification framework to effectively deal with the plenitude of the data involved.

The proposed image set classification framework is built on well-developed learning algorithms. Although, these algorithms are originally designed for classification from single images, we demonstrate that they can be tailored for image set classification, by first individually classifying the images of a query set followed by an appropriate voting strategy (see Sect. 4.2). However, due to the plenitude of the data involved in the case of image set classification, a straight forward extension of these algorithms (from single image to image set classification) would be computationally burdensome. Specifically, since most of the popular learning algorithms (e.g., Support Vector Machines, AdaBoost, linear regression, logistic regression and decision tree algorithms) are inherently binary classifiers, their extension to a multi-class classification problem (such as image set classification) requires the training of multiple binary classifiers. One-vs-one and one-vs-rest are the two most commonly adopted strategies for this purpose. For a k-class classification problem, \(\frac{k (k-1) }{2}\) and k binary classifiers are respectively trained for one-vs-one and one-vs-rest strategies. Although, one-vs-rest trains comparatively fewer classifiers, it still requires images from all classes to train each binary classifier. Adopting either of the well-known one-vs-one or one-vs-rest strategies for image set classification would therefore require a lot of computational effort, since either the number of images involved is quite large or a fairly large number of binary classifiers have to be trained.

The proposed framework in this paper trains a very small number of binary classifiers (mostly one or a maximum of five) on a very small fraction of images for the task of multi-class image set classification. The framework (see block diagram in Fig. 1) first splits the training images from all classes into two sets \(\mathcal {D}_1\) and \(\mathcal {D}_2\). The division is done such that \(\mathcal {D}_1\) contains uniformly randomly sampled images from all classes with the total number of images in \(\mathcal {D}_1\) being comparable to the number of images of the query image set. \(\mathcal {D}_2\) contains all training images except the ones in \(\mathcal {D}_1\). Next, a linear binary classifier is trained to optimally separate images of the query set from \(\mathcal {D}_1\). Note that \(\mathcal {D}_1\) has some images which belong to the class of the query set. However, since these images are very few in number, the classifier treats them as outliers. The trained classifier therefore learns to discriminate the class of the query set from all the other classes. Next, the learned classifier is evaluated on the images of \(\mathcal {D}_2\). The images of \(\mathcal {D}_2\) which are classified to belong to the images of the query set are of particular interest. Knowing the original class labels of these training images, we construct a histogram which is then used to decide about the class of the query set. A detailed description of the proposed framework is presented in Sect. 3 along with an illustration using a toy example in Fig. 3.

Fig. 1
figure 1

Block diagram of the proposed method. The training data is divided into two sets \(\mathcal {D}_1\) and \(\mathcal {D}_2\). \(\mathcal {D}_1\) contains uniformly randomly sampled images from all classes such that the size of \(\mathcal {D}_1\) is comparable to the size of the query image set \(\mathcal {X}_{q}\). A binary classifier is trained, with images of \(\mathcal {X}_{q}\) (labelled \(+1\)) and \(\mathcal {D}_1\) (labelled \(-1\)). The classifier is then tested on the images of \(\mathcal {D}_2\). Knowing the class labels of images of \(\mathcal {D}_2\) which are classified \(+1\), we formulate a histogram (see Eq. 1), which is then used to decide about the class of \(\mathcal {X}_{q}\). See a toy example in Fig. 3 for illustration

The main strengths of the proposed method are as follows. (1) A new strategy is introduced to extend any binary classifier for multi-class image set classification. Compared with the existing binary to multi-class strategies (e.g., one-vs-one and one-vs-rest), the proposed approach is computationally efficient to train. It only requires the training of a fixed number of binary classifiers (1–5 compared with k or \(\frac{k(k-1)}{2}\)) using a small number of images. (2) Along with the predicted class label of the query image set, the proposed method gives a confidence level of its prediction. This information is very useful and can be used as an indication of a potential miss-classification. The prior knowledge of a query image set being miss-classified allows for the potential use of another binary classifier. The proposed method can therefore accommodate the fusion of information from different types of binary classifiers before declaring the final class label of the query image set. (3) The proposed method is easily scalable to new classes. Unlike many existing image set classification methods, the computational complexity of the proposed method is not affected much by the addition of new classes in the gallery (see Sect. 4.2). Some of the existing methods would require retraining on the complete dataset (when new classes are enrolled), whereas, the proposed method requires no additional training and can efficiently discriminate the query class from other classes using a fixed number of binary classifiers (Sect. 4.7).

A preliminary version of our method appeared in Hayat et al. (2014). This paper extends (Hayat et al. 2014) in the following manners. (1) We encode a facial image in terms of the activations of a trained deep convolutional neural network. Compared with a shallow representation, the proposed learned feature representation proves to be more effective in discriminating images of different individuals (Sect. 3.1). (2) In order to further enhance the effectiveness of the proposed method, we propose three different sampling strategies. One of the proposed strategies also takes into consideration the head pose information of facial images which results in an overall improved performance of the method (Sect. 3.4). (3) We propose an extension of our method for the task of still to video face recognition which is an important and challenging real-life problem with numerous applications to security and surveillance systems (Sect. 4.4). (4) The efficacy of the proposed method is demonstrated through extensive experiments on four additional unconstrained real-life datasets (Sect. 4). We further extend our experimental evaluations by presenting a quantitative robustness analysis of different aspects of the proposed method (Sect. 4.5).

2 Related Work

The main focus of the existing image set classification methods is to find a suitable representation which can effectively model the appearance variations within an image set. Two different types of approaches have been previously developed for this purpose. The first approach models the variations within the images of a set through a statistical distribution and uses a measure such as KL-divergence to compare two sets. The methods based on this approach are called parametric model-based methods (Arandjelovic et al. 2005; Shakhnarovich et al. 2002). One of their major limitation is their reliance on a very strong assumption about the existence of a statistical correlation between image sets. The second approach for image set representation avoids such assumptions. The methods based on this approach are called non-parametric model-based methods (Kim et al. 2007; Wang et al. 2008; Wang and Chen 2009; Cevikalp and Triggs 2010; Harandi et al. 2011; Hu et al. 2012; Wang et al. 2012; Yang et al. 2013; Ortiz et al. 2013; Zhu et al. 2013; Hayat et al. 2014; Uzair et al. 2013) and have shown to give a superior performance compared with the parametric model-based methods. A brief overview of the non-parametric model-based methods is given below.

Subspaces have been very commonly used by the non-parametric methods to represent image sets. Examples include image sets represented by linear subspaces (Kim et al. 2007; Yamaguchi et al. 1998), orthogonal subspaces (Oja 1983) and a combination of linear subspaces (Wang et al. 2008; Wang and Chen 2009). Principal angles are then used to compare subspaces. A drawback of these methods is that they represent image sets of different sizes by a subspace of the same dimension. These methods cannot therefore uniformly capture the critical information from image sets with different set lengths. Specifically, for sets with a larger number of images and diverse appearance variations, the subspace-based methods cannot accommodate all the information contained in the images. Image sets can also be represented by their geometric structures i.e., affine hull or convex hull models. For example, Affine Hull Image Set Distance (AHISD) (Cevikalp and Triggs 2010) and Sparse Approximated Nearest Points (SANP) (Hu et al. 2012) use affine hull, whereas Convex Hull Image Set Distance (CHISD) (Cevikalp and Triggs 2010) uses the convex hull of the images to model an image set. A set-to-set distance is then determined in terms of the Euclidean distance between the set representative exemplars which are generated from the corresponding geometric structures. Although these methods have shown to produce a promising performance, they are prone to outliers and are computationally expensive (since they require a direct one-to-one comparison of the query set with all sets in the gallery). Some of the non-parametric model-based methods represent an image set as a point on a certain manifold geometry e.g., Grassmannian manifold (Wang and Chen 2009; Harandi et al. 2011) and Lie group of Riemannian manifold (Wang et al. 2012). The mean of the set images has also been used either solely or as a part of image set representation in Ortiz et al. (2013), Hu et al. (2012) and Lu et al. (2013).

In this paper, we argue that a single entity (e.g., a subspace, a point on a manifold, or an exemplar generated from a geometric structure) for image set representation can be sub-optimal and could result in the loss of information from the images of the set. For example, for image sets represented by a subspace, the amount of the retained information depends on the selected dimensions of the subspace. Similarly, generating representative exemplars from geometric structures could result in exemplars which are practically non-existent and are very different from the original images of the set. We, therefore, take a different approach which does not require any compact image set representation. Instead, the images are retained in their original form and a novel classification concept is proposed which incorporates well-developed learning algorithms to optimally discriminate the class of the query image set from all other classes. A detailed description of the proposed framework is presented next.

3 Proposed Method

Our proposed method first encodes raw face images in terms of the activations of a trained Convolutional Neural Network (CNN) (Sect. 3.1). The encoded face images are then used by the proposed image set classification algorithm, whose detailed description is presented in Sect. 3.2. Two important components of our proposed algorithm (choice of the binary classifiers and sampling strategies) are further elaborated in detail in Sects. 3.3 and 3.4, respectively. The proposed image set classification algorithm is then finally illustrated with the help of a toy example in Sect. 3.5.

3.1 Convolutional Feature Encoding

We are interested in mapping raw face images to a discriminative feature space where faces of different persons are easily separable. For this purpose, instead of using shallow or local feature representations (as in Hayat et al. 2014), we represent face images in terms of activations of a trained deep Convolutional Neural Network (CNN) model. Learned representations based on CNNs have significantly outperformed hand-crafted representations on nearly all major computer vision tasks (Chatfield et al. 2014; Jia et al. 2014; An et al. 2015; Khan et al. 2014). To this end, we adapt the parameters of AlexNet (Krizhevsky et al. 2012) (originally trained on 1.2 million images of 1000 object classes) for facial images. AlexNet consists of 5 convolutional and 3 fully-connected layers. In order to adapt the parameters of the network for facial images, we first encode faces of BU4DFE dataset (Yin et al. 2008) in terms of the activations of last convolutional layer. These encoded faces are then used as input to fine-tune the parameters of the three fully connected layers after changing the number of neurons in the last layer from 1000 (object categories in the ILSVRC Russakovsky et al. 2015) to 100 (number of subjects in the BU4DFE dataset). After learning the parameters of the fully connected part of the network, we append it back to the convolutional part, and fine-tune the complete network for facial images of BU4DFE dataset. Once the network parameters have been adapted, we feed the raw face images to the network’s input layer after mean normalization. The processed output from the first fully connected layer of the network is considered to be our convolutional feature representation of the input face images. Apart from representing images in terms of the activations of AlexNet adapted for facial images of BU4DFE dataset, we also explore their representation in terms of activations of VGG-Face CNN model (Parkhi et al. 2015) which is specifically trained on 2.6 million facial images of 2, 622 subjects. A performance comparison of different feature encoding methods is presented in Sect. 4.6ii.

3.2 Image Set Classification Algorithm

3.2.1 Problem Description

For k classes of a training data, we are given k image sets \(\mathcal {X}_1,\mathcal {X}_2,\ldots \mathcal {X}_k\) and their corresponding class labels \(y_c \in \left[ 1,2,\ldots k \right] \). An image set \(\mathcal {X}_c=\{\mathbf {x}^{(t)}|y^{(t)}=c;t=1,2,\ldots N_c\}\) contains all \(N_c\) training images \(\mathbf {x}^{(t)}\) belonging to class c. Note that for training data with multiple image sets per class, we combine the images from all sets into a single set. During classification, we are given a query image set \(\mathcal {X}_{q}=\{\mathbf {x}^{(t)}\}_{t=1}^{N_{q}}\), and the task is to find the class label \(y_{q}\) of \(\mathcal {X}_{q}\).

figure a

The proposed image set classification algorithm is summarized in Algorithm 1. The details are presented below.

  1. 1.

    After encoding all the face images in terms of their convolutional activations, the images from all training sets are gathered into a single set \(\mathcal {D}= \{\mathcal {X}_1,\mathcal {X}_2,\ldots \mathcal {X}_k \}\). Next, \(\mathcal {D}\) is divided into two sets: \(\mathcal {D}_1\) and \(\mathcal {D}_2\) by adopting one of the sampling strategies described in Sect. 3.4. The division is done such that \(\mathcal {D}_1\) contains an equal representation of images from all classes of the training data and the total number of images in \(\mathcal {D}_1\) is comparable to that of \(\mathcal {X}_q\). The class label information of images in \(\mathcal {D}_1\) and \(\mathcal {D}_2\) is stored in sets \(\mathbf {y}_{\mathcal {D}_1}=\{y^{(t)} \in [1,2,\ldots k ],t=1,2,\ldots N_{\mathcal {D}_1}\}\) and \(\mathbf {y}_{\mathcal {D}_2}=\ \{y^{(t)} \in [1,2,\ldots k ],t=1,2,\ldots N_{\mathcal {D}_2}\}\) respectively.

  2. 2.

    Next, we train a binary classifier \(C_1\). Training is done on the images of \(\mathcal {X}_{q}\) and \(\mathcal {D}_1\). All images in \(\mathcal {X}_{q}\) are labelled \(+1\), while the images in \(\mathcal {D}_1\) are labelled \(-1\). Since images from all classes are present in \(\mathcal {D}_1\), the classifier learns to separate images of \(\mathcal {X}_{q}\) from the images of the other classes. Note that \(\mathcal {D}_1\) does have a small number of images from the same class as of \(\mathcal {X}_{q}\). However, since these images are very few, the binary classifier treats them as outliers and learns to discriminate the class of the query image set from all other classes (Sect. 4.5ii).

  3. 3.

    The trained classifier \(C_1\) is then tested on the images of \(\mathcal {D}_2\). The images in \(\mathcal {D}_2\) classified as \(+1\) (same as images of \(\mathcal {X}_q\)) are of interest. Let \(\mathbf {y}_{\mathcal {D}_2^+} \subset \mathbf {y}_{\mathcal {D}_2}\) contain the class labels of images of \(\mathcal {D}_2\) classified +1 by the classifier \(C_1\).

  4. 4.

    A normalized frequency histogram \(\mathbf {h}\) of class labels in \(\mathbf {y}_{\mathcal {D}_2^+}\) is computed. The cth value of the histogram, \(\mathbf {h}_c\), is given by the percentage of the images of class c in \(\mathcal {D}_2\) which are classified \(+1\). Formally, \(\mathbf {h}_c\) is given by the ratio of the number of images of \(\mathcal {D}_2\) belonging to class c and classified as +1 to the total number of images of \(\mathcal {D}_2\) belonging to class c. This is given by,

    $$\begin{aligned} \begin{array}{l} \mathbf {h}_c=\frac{\sum \limits _{y^{(t)} \in \mathbf {y}_{\mathcal {D}_2^+}} \delta _c (y^{(t)} )}{\sum \limits _{y^{(t)} \in \mathbf {y}_{\mathcal {D}_2}^{}} \delta _c (y^{(t)} )} \text{,where } \\ \\ \delta _c (y^{(t)}) = {\left\{ \begin{array}{ll} 1, &{} y^{(t)}=c \\ 0, &{} \text{ otherwise. } \end{array}\right. } \end{array} \end{aligned}$$
    (1)
  5. 5.

    A class in \(\mathcal {D}_2\) with most of its images classified as \(+1\) can be predicted as the class of \(\mathcal {X}_{q}\). The class label \(y_{q}\) of \(\mathcal {X}_{q}\) is therefore given by,

    $$\begin{aligned} y_{q}=\arg \max _c {\mathbf {h}}_c. \end{aligned}$$
    (2)

    We can also get a confidence level d of our prediction of \(y_{q}\). This is defined in terms of the difference between the maximum and the second maximum values of histogram \(\mathbf {h}\),

    $$\begin{aligned} d=\max \limits _{c \in \{1\cdots k\}}\mathbf {h}_c - \max \limits _{c \in \{1\cdots k\}\setminus y_q }\mathbf {h}_c. \end{aligned}$$
    (3)

    We are more confident about our prediction if the predicted class is a ‘clear winner’. In the case of closely competing classes, the confidence level of the prediction will be low.

  6. 6.

    We declare the class label of \(\mathcal {X}_{q}\) (as in Eq. 2) provided that the confidence d is greater than a certain threshold. The value of the threshold is determined empirically by performing experiments on a validation set. Otherwise, if the confidence level d is less than the threshold, steps 1–5 are repeated, for different random samplings of images into \(\mathcal {D}_1\) and \(\mathcal {D}_2\). After every iteration, a mean histogram \(\bar{\mathbf {h}}\) is computed using the histogram of that iteration and the previous iterations. The confidence level d is also computed after every iteration using,

    $$\begin{aligned} d=\max \limits _{c \in \{1\cdots k\}} \bar{\mathbf {h}}_c - \max \limits _{c \in \{1\cdots k\}\setminus y_q } \bar{\mathbf {h}}_c. \end{aligned}$$
    (4)

    Iterations are stopped if the confidence level d becomes greater than the threshold or after a maximum of five iterations. Performing more iterations enhances the robustness of the method (since different images are selected into \(\mathcal {D}_1\) and \(\mathcal {D}_2\) for every iteration) but at the cost of an increased computational effort. Our experiments revealed that a maximum of five iterations is a good trade-off between the robustness and the computational complexity (Sect. 4.6iii).

  7. 7.

    If the confidence level d (see Eq. 4) is greater than the threshold, we declare the class label of \(\mathcal {X}_{q}\) as \(y_{q}=\arg \max _c {\bar{\mathbf {h}}}_c\). Otherwise, if the confidence level is lower than the threshold, declaring the class label would highly likely result in a miss-classification. In which case, we use another binary classifier \(C_2\). The procedure is repeated for a different binary classifier \(C_2\). The decision about \(y_{q}\) is then made based on the confidence levels of \(C_1\) and \(C_2\). The prediction of the more confident classifier is considered as the final decision. The description regarding the choice of the binary classifiers \(C_1\) and \(C_2\) is given next.

3.3 The Choice of Binary Classifiers

The proposed framework requires a binary classifier to distinguish between images of \(\mathcal {X}_{q}\) and \(\mathcal {D}_1\). The choice of the binary classifier should be based on its ability to generalize well to unseen data during the testing phase. Moreover, since the binary classifier is being trained on images of \(\mathcal {X}_{q}\) and \(\mathcal {D}_1\) and that some images in \(\mathcal {D}_1\) have the same class as of \(\mathcal {X}_{q}\), the binary classifier should not overfit on the training data and treat these images as outliers. For these reasons, a Support Vector Machine (SVM) with a linear Kernel is deemed to be an appropriate choice. It is known to produce an excellent generalization to unknown test data and can effectively handle outliers.

Two classifiers (\(C_1\) and \(C_2\)) are used by the proposed framework. \(C_1\) is a linear SVM with L2 regularization and L2 loss function, while \(C_2\) is a linear SVM with L1 regularization and L2 loss function (Fan et al. 2008). Specifically, given a set of training example-label pairs \(\left( \mathbf {x}^{(t)},y^{(t)} \right) , y^{(t)}\in \{+1,-1\}\), \(C_1\) solves the following optimization problem,

$$\begin{aligned} \min _\mathbf {w} \; \; \frac{1}{2}\mathbf {w}^T\mathbf {w} + C \sum _t \left( \max \left( 0,1-y^{(t)}\mathbf {w}^T\mathbf {x}^{(t)} \right) \right) ^2, \end{aligned}$$
(5)

while, \(C_2\) solves the following optimization problem,

$$\begin{aligned} \min _\mathbf {w} \; \; \left| \mathbf {w} \right| _1 + C \sum _t \left( \max \left( 0,1-y^{(t)}\mathbf {w}^T\mathbf {x}^{(t)} \right) \right) ^2. \end{aligned}$$
(6)

Here, \(\mathbf {w}\) is the coefficient vector to be learned and \(C>0\) is the penalty parameter used for regularization. After the learning of the SVM parameter \(\mathbf {w}\), classification is performed based on the value of \(\mathbf {w}^T\mathbf {x}^{(t)}\). Note that the coefficient vector \(\mathbf {w}\) learned by classifier \(C_2\) (trained for the challenging examples) is sparse. Learning a sparse \(\mathbf {w}\) for \(C_2\) further enhances the generalization capability for the challenging cases. We have also evaluated other binary classifiers which include non-linear SVM with Radial Basis Function (RBF) and Chi-Square kernels and random decision forests (Sect. 4.6i).

3.4 Sampling Strategies

Given all the training image sets \(\mathcal {X}_1,\mathcal {X}_2,\cdots \mathcal {X}_k\), we gather these images (of the training data) into a set \(\mathcal {D}\). Next, the images in \(\mathcal {D}\) are sampled into two subsets (\(\mathcal {D}_1\) and \(\mathcal {D}_2\)) which are used by the proposed algorithm, as explained in Sect. 3.2. For the sampling of the images of \(\mathcal {D}\) to generate \(\mathcal {D}_1\) and \(\mathcal {D}_2\), we introduce three different sampling strategies. The following two general rules of thumb have been followed for sampling: (1) the total number of images in \(\mathcal {D}_1\) are kept comparable to the number of images of the query set \(\mathcal {X}_q\). Since our proposed image set classification algorithm trains a binary classifier to discriminate between \(\mathcal {D}_1\) and \(\mathcal {X}_q\), a huge disparity between number of images in \(\mathcal {D}_1\) and \(\mathcal {X}_q\) could result in a trained binary classifier which is biased towards the majority class. (2) Images in the sampled set \(\mathcal {D}_1\) have an equal representation (>0) from all the classes of the training data. The detailed description of the proposed sampling strategies follows next.

3.4.1 Uniform Random Sampling

Let \(\mathcal {D}_{1c}\) be a randomly sampled subset of \(\mathcal {X}_c\) with a set size \(N_{\mathcal {D}_{1c}}\), where \(N_{\mathcal {D}_{1c}}=\) \(\left\lceil \frac{N_q}{k} \right\rceil \), such that \(N_{\mathcal {D}_{1c}} \ne 0\) in any case, then the set \(\mathcal {D}_1\) is formed by the union operation: \(\mathcal {D}_1 = \bigcup _c \mathcal {D}_{1c}\), \(c=1,2,\ldots k\). \(\mathcal {D}_2\) is obtained by \(\mathcal {D}_2=\mathcal {D} \setminus \mathcal {D}_1\).

3.4.2 Bootstrapped Sampling

We first perform bootstrapping and sample a set \({\mathcal {D}}'\) from \(\mathcal {D}\) such that \({\mathcal {D}}' \subset \mathcal {D}\) and \( \left| {\mathcal {D}}' \right| =\lfloor 0.8 \left| \mathcal {D}\right| \rfloor \). Images in \({\mathcal {D}}'\) are randomly picked from \(\mathcal {D}\). \(\mathcal {D}_1\) and \(\mathcal {D}_2\) are then uniformly randomly sampled from \({\mathcal {D}}'\) by following the same procedure described in Sect. 3.4.1. Sampling from the bootstrapped set \({\mathcal {D}}'\) over multiple iterations gives a data augmentation effect which eventually introduces robustness and results in an improved performance of the proposed method (Sect. 4.6).

3.4.3 Pose-Based Sampling

During our experiments, a visual inspection of the challenging YouTube celebrities dataset (Sect. 4.1) revealed that many of the miss-classified query image sets had face images with a head pose (such as profile views) which is otherwise not very commonly present in other training images. For such cases, only those images in \(\mathcal {D}_2\) with the same pose as those of images of \(\mathcal {X}_q\) (irrespective of their classes) are classified as \(+1\). Our proposed pose-based sampling strategy aims to address this issue. The basic intuition here is to first estimate the pose of the images, and use this pose information to assign images into \(\mathcal {D}_1\) and \(\mathcal {D}_2\). For example, if most of the images of \(\mathcal {X}_q\) are in right profile views, our sampling of the training images into \(\mathcal {D}_1\) and \(\mathcal {D}_2\) should consider only images with the right profile views. This helps to overcome any bias in the classification introduced by the head pose during classification.

In this strategy, we first determine the pose group of the face images using the pose group approximation method (described next). We then sample \({\mathcal {D}}'\) from \(\mathcal {D}\) such that \({\mathcal {D}}'\) has only those images from \({\mathcal {D}}\) whose pose group is similar to the images of the query image set \(\mathcal {X}_q\). Images from \({\mathcal {D}}'\) are then uniformly randomly sampled into \(\mathcal {D}_1\) and \(\mathcal {D}_2\) by following the procedure explained in Sect. 3.4.1. Note that \({\mathcal {D}}'\) is supposed to have an equal representation of images from all training image sets. However, we might not necessarily have images with the same pose as those of \(\mathcal {X}_q\) for all training sets. From such training sets, we select the images with the most similar poses into \({\mathcal {D}}'\). The employed pose group approximation method Hayat et al. (2015) is described next.

Fig. 2
figure 2

Sample results of pose group approximation

Fig. 3
figure 3

Toy example to illustrate the proposed method. Consider a training set with three classes and the task is to find the class of \(\mathcal {X}_{q}\) (a). Data points from the three training image sets \(\mathcal {X}_{1}\), \(\mathcal {X}_{2}\), \(\mathcal {X}_{3}\) and a query image set \(\mathcal {X}_{q}\) are shown. b Data points from \(\mathcal {X}_{q}\) and \(\mathcal {D}_{1}\) (uniformly randomly sampled from \(\mathcal {X}_{1}\), \(\mathcal {X}_{2}\) and \(\mathcal {X}_{3}\)) are shown. c The learnt SVM boundary between \(\mathcal {X}_{q}\) (labelled \(+1\)) and \(\mathcal {D}_{1}\) (labelled \(-1\)). d The data points of \(\mathcal {D}_{2}\) w.r.t. the learnt SVM boundary. Since the points of \(\mathcal {X}_{3}\) in \(\mathcal {D}_{2}\) lie on the same side of the boundary as the points of \(\mathcal {X}_{q}\), the proposed method declares \(\mathcal {X}_{q}\) to be from \(\mathcal {X}_{3}\). Figure best seen in color (Color figure online)

3.4.4 Pose Group Approximation

An image is said to belong to a pose group \(g \in \left\{ 1,2,\ldots G \right\} \), if its pose along the pitch direction (y-axis) is within \(\theta _g \pm 15^{\circ }\). For our purpose, we define \(G=5\) and \(\theta =\begin{bmatrix} -60,&-30,&0,&30,&60 \end{bmatrix}\). The process of pose group approximation has two steps: training and testing.

Training Let \(X_g \in \mathbb {R}^{m \times n_g}\) contain \(n_g\) images \(\mathbf {x}^{(t)} \in \mathbb {R}^m\) whose pose is within \(\theta _g \pm 15^{\circ }\). We automatically select these images from a Kinect data set (see Sect. 4.1). The pose of Kinect images can be determined by the random regression forest based method of Fanelli et al. (2011a). From \(X_g\), we want to extract the directions of major data orientation (principal directions). To achieve that, we first subtract the mean image from \(X_g\) and compute its covariance matrix \(\varSigma _g\) as follows,

$$\begin{aligned}&\bar{X}_g=X_g-\frac{1}{n_g} \sum _t \mathbf {x}^{(t)}, \end{aligned}$$
(7)
$$\begin{aligned}&\varSigma _g =\bar{X}_g\bar{X}^T_g. \end{aligned}$$
(8)

The singular value decomposition of the covariance matrix \(\varSigma _g\) results in \(\varSigma _g=U_gS_gV_g\). The component \(U_g\) contains the eigenvectors arranged in the descending order of their significance. From \(U_g\), we select the top k eigenvectors corresponding to the k largest eigenvalues and use them as columns to construct a matrix \(\mathcal {S}_g \in \mathbb {R}^{m \times k}\). \(\mathcal {S}_g\) is therefore a subspace whose columns represent the predominant data structure in the images of \(X_g\). Next, during the testing phase of our pose group approximation approach, \(\mathcal {S}_g\) is used for a linear regression based classification strategy.

Testing The pose group \(\mathcal {P} ( \mathbf {x}^{(t)} )\) of the image \(\mathbf {x}^{(t)}\) is determined by,

$$\begin{aligned} \mathcal {P} ( \mathbf {x}^{(t)} )=\arg \min \limits _g \left\| \mathbf {x}^{(t)}-\tilde{\mathbf {x}}_g^{(t)} \right\| _2, \end{aligned}$$
(9)

where \(\tilde{\mathbf {x}}_g^{(t)}\) is linearly constructed from \(\mathcal {S}_g\) as follows,

$$\begin{aligned} \tilde{\mathbf {x}}^{(t)}_g=\mathcal {S}_g\alpha _g^{(t)}. \end{aligned}$$
(10)

The above equation has an analytical solution given by,

$$\begin{aligned} \alpha _g^{(t)}= ( \mathcal {S}_g^T\mathcal {S}_g )^{-1}\mathcal {S}_g^T\mathbf {x}^{(t)}. \end{aligned}$$
(11)

A few sample results of our pose group approximation method are presented in Fig. 2. The pose group \(\mathcal {P} ( \mathbf {x}^{(t)} )\) of all the images of the training data as well as the images of the query set \(\mathcal {X}_q\) is determined by following the procedure explained above. Next, we sample images from \(\mathcal {D}\) into \({\mathcal {D}}'\) such that images in \({\mathcal {D}}'\) have the same pose as those of images of \(\mathcal {X}_q\). We ensure the inclusion of an equal representation of all classes in \({\mathcal {D}}'\). In case of classes with no or very few images with the same pose as of \(\mathcal {X}_q\), images with nearly similar poses are selected. After getting \({\mathcal {D}}'\), we sample \(\mathcal {D}_1\) and \(\mathcal {D}_2\) by following the same procedure as explained in Sect. 3.4.1.

3.5 Illustration with a Toy Example

The proposed image set classification algorithm is illustrated with the help of a toy example in Fig. 3. Let us consider a three class set classification problem in which we are given three training sets \(\mathcal {X}_{1}\), \(\mathcal {X}_{2}\), \(\mathcal {X}_{3}\) and a query set \(\mathcal {X}_{q}\). The data points of the training sets and the query set are shown in Fig. 3a. First, we form \(\mathcal {D}_{1}\) by randomly sampling points from \(\mathcal {X}_{1}\), \(\mathcal {X}_{2}\) and \(\mathcal {X}_{3}\). Fig. 3b shows the datapoints of \(\mathcal {D}_{1}\) and \(\mathcal {X}_{q}\). Next, a linear SVM is trained by labelling the datapoints of \(\mathcal {X}_{q}\) as \(+1\) and \(\mathcal {D}_{1}\) as \(-1\). Note that SVM (Fig. 3c) ignores the miss-labelled points (the points of \(\mathcal {X}_{3}\) in \(\mathcal {D}_{1}\)) and treats them as outliers. Finally, we classify the data points of \(\mathcal {D}_{2}\) from the learned SVM boundary. Figure 3d shows that the SVM labels the points of \(\mathcal {X}_{3}\) in \(\mathcal {D}_{2}\) as \(+1\). The proposed algorithm therefore declares the class of \(\mathcal {X}_{3}\) to be the class of \(\mathcal {X}_{q}\).

4 Experiments

We perform experiments to evaluate the performance of the proposed method for two tasks (1) image set classification based face and object recognition, and (2) still to video imagery based face recognition. For image set classification based object recognition, experiments are performed on ETH-80 dataset (Leibe and Schiele 2003) while Honda/UCSD (Lee et al. 2003), CMU Mobo (Gross and Shi 2001), YouTube celebrities (Kim et al. 2008), a composite RGB-D Kinect dataset (obtained by combining three Kinect datasets), PubFig (Kumar et al. 2009), COX (Huang et al. 2013) and FaceScrub (Ng and Winkler 2014) datasets are used for image set classification based face recognition. For still to video face recognition, the COX dataset is used. It should be noted that most of the previous image set classification methods have only been evaluated on Honda, CMU Mobo, ETH-80 and YouTube celebrities dataset. Amongst these datasets, only YouTube celebrities dataset is a real life dataset whereas Honda, CMU Mobo and ETH-80 are considered relatively easy since they are acquired in indoor lab environment under controlled conditions. Apart from the challenging Youtube celebrities dataset, this paper also presents a comparative performance evaluation of our method with the existing methods (Yamaguchi et al. 1998; Kim et al. 2007; Wang et al. 2008; Wang and Chen 2009; Cevikalp and Triggs 2010; Harandi et al. 2011; Hu et al. 2012; Wang et al. 2012; Ortiz et al. 2013; Zhu et al. 2013; Yang et al. 2013; Hayat et al. 2014) on three additional real-life datasets collected under unconstrained conditions. These include PubFig, COX and FaceScrub datasets.

Below, we first give a brief description of each of these datasets followed by the adopted experimental protocols (Sect. 4.1). We then present a performance comparison of our proposed method with the baseline multi-class classification strategies (Sect. 4.2) followed by a comparison with the existing state of the art image set classification methods (Sect. 4.3). The performance analysis for still to video based face recognition is presented in Sect. 4.4. A quantitative robustness analysis of different aspects of the proposed method is presented in Sect. 4.5. Finally, an ablative study to asses the contributions and impact of different components of our proposed method towards its overall performance is presented in Sec 4.6. A comparison of the computational complexity of different methods is given in Sect. 4.7.

Fig. 4
figure 4

Sample images from different datasets. Note the high intra class variations in the form of different head poses, illumination variations, expression deformations and occlusions

4.1 Datasets and Experimental Settings

The Honda/UCSD Dataset Lee et al. (2003) contains 59 video sequences (with 12 to 645 frames in each video) of 20 subjects. We use Viola and Jones face detection (Viola and Jones 2004) algorithm to extract faces from video frames. The extracted faces are then resized to \(20\times 20\). For our experiments, we consider each video sequence as an image set and follow an evaluation configuration similar to Lee et al. (2003). Specifically, 20 video sequences are used for training and the remaining 39 sequences are used for testing. Three separate experiments are performed by considering all frames of a video as an image set and limiting the total number of frames in an image set to 50 and 100 (to evaluate the robustness for fewer images in a set). Each experiment is repeated 10 times for different random selections of the training and testing image sets.

The CMU Mobo (Motion of Body) DatasetGross and Shi (2001) contains a total of 96 video sequences of 24 subjects walking on a treadmill. The faces from the videos are extracted using Viola and Jones (2004) and resized to \(40\times 40\). Similar to Wang et al. (2012) and Hu et al. (2012), we consider each video as an image set and use one set per subject for training and the remaining sets for testing. For a consistency, experiments are repeated ten times for different training and testing sets.

YouTube Celebrities Kim et al. (2008) dataset contains 1910 videos of 47 celebrities. The dataset is collected from YouTube and the videos are acquired under real-life scenarios. The faces in the dataset exhibit, therefore, a wide range of diversity and appearance variations in the form of changing illumination conditions, different head pose rotations and expression variations. Since the resolution of the face images is very low, face detection by Viola and Jones (2004) fails for a significant number of frames for this dataset. We, therefore, use tracking (Ross et al. 2008) to extract faces. Specifically, knowing the location of the face window in the first frame (provided with the dataset), we use the method of Ross et al. (2008) to track the face region in the subsequent frames. The extracted face region is then resized to \(30 \times 30\). In order to perform experiments, we treat the faces acquired from each video as an image set and adopt a five fold cross validation experimental setup similar to Wang et al. (2008), Wang and Chen (2009), Hu et al. (2012) and Wang et al. (2012). The complete dataset is divided into five equal folds with minimal overlap. Each fold has nine image sets per subject, three of which are used for training and the remaining six are used for testing.

Composite Kinect Dataset is achieved by combining three distinct Kinect datasets: CurtinFaces (Li et al. 2013), Biwi Kinect (Fanelli et al. 2011a) and an in-house dataset acquired in our laboratory. The number of subjects in each of these datasets is 52 (5000 RGB-D images), 20 (15,000 RGB-D images) and 48 (15000 RGB-D images) respectively. The random forest regression based classifier of Fanelli et al. (2011b) is used to detect faces from the Kinect acquired images. The images in the composite dataset have a large range of variations in the form of changing illumination conditions, head pose rotations, expression deformations, sun glass disguise, and occlusions by hand. For performance evaluation, we randomly divide RGB-D images of each subject into five uniform folds. Considering each fold as an image set, we select one set for training and the remaining sets for testing. The experiments are repeated five times for different selections of training and testing sets.

ETH-80 Object Dataset contains still RGB images of eight object categories. These include cars, cows, apples, dogs, cups, horses, pears and tomatoes. Each object category is further divided into ten subcategories such as different brands of cars or different breeds of dogs. Each subcategory contains images under 41 orientations. For our experiments, we use the \(128 \times 128\) cropped images [1] and resize them to \(32 \times 32\). We follow an experimental setup similar to Wang and Chen (2009), Kim et al. (2007) and Wang et al. (2012). Images of an object in a subcategory are considered as an image set. For each object, five subcategories are randomly selected for training and the remaining five are used for testing. Ten runs of experiments are performed for different random selections of the training and testing sets.

Public Figures Face Database (PubFig)Kumar et al. (2009) is a real-life dataset of 200 people collected from the internet. The images (static RGB) of the dataset have been acquired in uncontrolled situations without any user cooperation. The sample images of a subject in Fig. 4 illustrate the large variations in the images caused by pose, lighting, expressions, backgrounds and camera positions. For our experiments, we divide equally the images of each subject into three folds. Considering each fold as an image set, we use one of them for training and the remaining two are used for testing. Experiments are repeated five times for different random selections of images for the training and testing folds.

The COX (Huang et al. 2013) Dataset contains 1000 high resolution still images and 4000 uncontrolled low resolution video sequences of 1000 subjects. The videos have been captured inside a gymnasium with subjects walking naturally and without any restriction on expression and head orientation. The dataset contains four videos per subject. The face resolution, head orientation and lighting conditions in each video are significantly different from the others. Sample images of a subject from this dataset are shown in Fig. 4. For our image set classification experiments, we use the frames of each video as an image set and follow a leave-one-out strategy where one image set is held out for testing and remaining are used for training. For consistency, four runs of experiments are performed by swapping the training and testing image sets.

For still to video based face recognition experiments, we consider the high resolution still images (which were acquired with the full user cooperation) as our gallery. The low resolution images of the video sequence are used as the probe image set. Still to video based face recognition experiments are performed by following the standard evaluation protocol described in Huang et al. (2014). The still images and images from the video sequences of 300 individuals are randomly selected to learn a common embedding space for both the low and high resolution images using the technique in Sharma et al. (2012). The images of the remaining 700 individuals are used for testing. Experiments are repeated five times for different random shuffling of subjects between the training and testing sets. A common embedding space is learnt because the gallery and probe data possess very different visual characteristics i. e. the gallery contains good quality frontal face images acquired with full-user cooperation whereas the probes are low quality non-frontal images acquired without any cooperation.

FaceScrub Ng and Winkler (2014) is a large dataset of 530 (265 males and 265 females) celebrities and famous personalities. The dataset is collected from the internet and comprises a total of 107,818 RGB face images with approximately 200 images per person. Few sample images of a person in Fig 4 show the wide range of appearance variations in the images of an individual from the dataset. For our experiments, we divide the images of each person into ten folds. Considering each fold as an image set, we use one fold for training and the remaining are used for testing. Experiments are done five different times for a different random selection of images into each fold.

Following the evaluation configurations described above, we perform experiments and compare our method with the baseline methods, and current state of the art methods. A detailed analysis, extensive performance evaluations and comparisons are presented next.

Table 1 Performance comparison with the baseline methods

4.2 Comparison with the Baseline Methods

Linear SVM based one-vs-one and one-vs-rest multi-class classification strategies are used as baseline methods for comparison. Note that these baseline methods are suitable for classification from single images. For image set classification, we first individually classify every image of the query image set followed by a majority voting to decide about the class of the query image set. Experimental results in terms of average identification rates and standard deviations on all datasets are presented in Table 1. Note that for the Honda dataset, we perform three experiments i.e., by considering all frames of the video as an image set, then limiting the number of images in a set to 100 and 50 (see Sect. 4.3). Here, the results presented for the Honda/UCSD dataset are only for all frames of the videos considered as image sets. The results show that, amongst the compared baseline multi-class classification strategies, one-vs-rest performs slightly better than one-vs-one. Our method performs better than the baseline methods on all datasets except ETH-80. A possible explanation for a lower performance on the ETH-80 is that the proposed method trains a binary classifier on images of \(\mathcal {X}_q\) and \(\mathcal {D}_1\), which is then evaluated on \(\mathcal {D}_2\). The set \(\mathcal {D}_1\) contains \(\left\lceil \frac{N_q}{k} \right\rceil \)images with same label as \(\mathcal {X}_q\). For larger k, these images are few in number and do not affect training of the binary classifier. However, for smaller values of k (as is the case for ETH-80 dataset, \(k=8\)) the proportion of these images is higher and causes slight performance degradation. A quantitative robustness analysis of the proposed method for different values of k is presented in Sect. 4.5iii.

Table 2 Complexity analysis

Table 2 presents a comparison of the computational complexity in terms of the required number of binary classifiers and the number of images used to train each of these classifiers. One-vs-one trains \(\frac{k (k-1)}{2}\) binary classifiers and uses images from two classes to train each classifier. Although the number of trained classifiers for one-vs-rest are comparatively less (k compared with \(\frac{k (k-1)}{2}\)), the number of images used to train each binary classifier is quite large (all images of the dataset are used). In comparison, our proposed method trains only few binary classifiers (a maximum of five for the challenging cases) and the number of images used for training is also small. A main difference of our method from baseline strategies is that it performs all computations at run-time.

Table 3 Performance comparison on Honda/UCSD dataset

4.3 Comparison with Existing Image Set Classification Methods

We present a comparison of our method with a number of recently proposed state of the art image set classification methods. The compared methods include Mutual Subspace Method (Yamaguchi et al. 1998), Discriminant Canonical Correlation Analysis (DCC) (Kim et al. 2007), Manifold-to-Manifold Distance (MMD) (Wang et al. 2008), Manifold Discriminant Analysis (MDA) (Wang and Chen 2009), the Linear version of the Affine Hull-based Image Set Distance (AHISD) (Cevikalp and Triggs 2010), the Convex Hull-based Image Set Distance (CHISD) (Cevikalp and Triggs 2010), Sparse Approximated Nearest Points (SANP) (Hu et al. 2012), Covariance Discriminative Learning (CDL) (Wang et al. 2012), Mean Sequence Sparse Representation Classification (MSSRC) (Ortiz et al. 2013), Set to Set Distance Metric Learning (SSDML) (Zhu et al. 2013), Regularized Nearest Points (RNP) (Yang et al. 2013) and Non-Linear Reconstruction Models (NLRM). We use the implementations provided by the respective authors for all methods. The parameters for all methods are optimized for best performance.

Specifically, for MSM, Principal Component Analysis (PCA) is applied to retain 90% of the total energy. For DCC, the dimensions of the embedding space are set to 100. The number of retained dimensions for a subspace are set to 10 (90% energy is preserved) and the corresponding 10 maximum canonical correlations are used to compute set-set similarity. For datasets with one training set per class (Honda/UCSD, CMU, Kinect, PubFig, COX and FaceScrub), we randomly divide the training set into two subsets to construct the within class sets as in Kim et al. (2007). The parameters for MMD and MDA are used from Wang et al. (2008) and Wang and Chen (2009) respectively. The number of connected nearest neighbours to compute the geodesic distance is either set to 12 or to the number of images in the smallest image set of the dataset. The ratio between the Euclidean distance and the geodesic distance is optimized for all data sets. In case of MMD, the distance is computed in terms of maximum canonical correlation. No parameter settings are required for AHISD. For CHISD, the same error penalty term (\(C=100\)) as in Cevikalp and Triggs (2010) is used. For SANP, the same weight parameters as in Hu et al. (2012) are adopted for optimization. For GEDA, we set \(k^{[\text {cc}]}=1, k^{[\text {proj}]}=100\) and \(v=3\) (the value of v is searched over a range of 1-10 for best performance). The number of eigenvectors r used to represent an image set is set to 9 and 6, respectively, for Mobo and YouTube celebrities and 10 for all other datasets. No parameter settings are required for CDL. For RNP (Yang et al. 2013), PCA is applied to preserve 90% of the energy and the same weight parameters as in Yang et al. (2013) are used. No parameter configurations are required for MSSRC and SSDML. For NLRM, we use majority voting and perform PCA to retain the dimensions of the embedded space to 400.

Table 4 Performance evaluation of all methods on different datasets
Fig. 5
figure 5

Cumulative match characteristic (CMC) curves on YTC, PubFig, COX and FS datasets. Figure best seen in colors . a YTC, b PubFig, c COX, d FaceScrub (Color figure online)

Fig. 6
figure 6

Receiver operating characteristics (ROC) curves on YTC, PubFig, COX and FS datasets. Figure best seen in colors. a YTC, b PubFig, c COX, d FaceScrub (Color figure online)

The experimental results, in terms of the average identification rates and standard deviations of the different methods on the Honda/UCSD dataset, are presented in Table 3. The proposed method achieves a perfect classification for all frames of the video sequence (considered as an image set) as well as when the total number of images in the set is reduced to 100 and 50. This proves that our method is robust w.r.t. the number of images in the set and it is suitable for real-life scenarios (where only a limited number of images are available in a set).

Fig. 7
figure 7

Equal error rates (EERs) of different methods on all datasets. Figure best seen in colors. a Honda, b CMU, c YTC, d Kinect, e PubFig, f FaceScrub, g COX, h ETH-80 (Color figure online)

The average identification rates and standard deviations of the different methods when tested on CMU Mobo, YouTube Celebrities (YTC), Kinect, ETH-80, PubFig, COX and FaceScrub (FS) datasets are summarized in Table 4. The results prove that the proposed method outperforms most of the existing methods on all datasets. The gain in performance is more significant for YTC, PubFig and FS datasets. Note that YTC, PubFig and FS are very challenging datasets since their images have been acquired in real life scenarios without any user cooperation. The proposed method achieves a relative performance boost of 8.4, 11.0 and \(12.7\%\) on YTC, PubFig and FS datasets, respectively. Another notable aspect of the proposed method is that it not only works for image set classification based face recognition but also achieves a very high identification rate of \(96.1\%\) for the task of image set classification based object recognition.

The performance of all methods is further analyzed in Figs. 5, 6 on four real-life datasets which include YTC, PubFig, COX and FS. Specifically, Cumulative Match Characteristics (CMC) and Receiver Operating Characteristics (ROC) curves for the top performing methods are presented in Figs. 5, 6 respectively. The results in Fig. 5 suggest that the proposed method consistently achieves the highest rank-1 to rank-10 identification rates for most of the evaluated datasets. ROC curves in Fig. 6 show that the proposed method outperforms all others. Equal error rates are shown in Fig. 7 to compare the verification performance of different methods on all datasets. The results show that the proposed method achieves superior performance by producing the lowest equal error rates compared with the existing methods on almost all of the evaluated datasets.

The state of the art performance of the proposed method is attributed to the fact that (unlike existing methods) it does not resort to a single entity representation (such as a subspace, the mean of set images or an exemplar image) for all images of the set. Any potential loss of information is therefore avoided by retaining the images of the set in their original form. Moreover, well-developed classification algorithms are efficiently incorporated within the proposed framework to optimally discriminate the class of the query image set from the remaining classes. Furthermore, since the proposed method provides a confidence level for its prediction, the classification decisions from multiple classifiers can be fused to enhance the overall performance of the method.

4.4 Still to Video Face Recognition

We also validate our proposed approach for still-to-video based face recognition which finds its usefulness in numerous real-life applications such as face recognition from surveillance cameras. The only modification required to adapt the proposed method to the case of still to video face recognition is to perform more iterations in steps 1–5 of the original algorithm. For this, we enforce an upper limit of 10 iterations. Table 5 compares our proposed method against a number of recent works, which can be adapted to the case of still to video based face recognition. These include the baseline Nearest Neighbour (NN) Classifier, Neighbourhood Component Analysis (NCA) (Goldberger et al. 2004), Information Theoretic Machine Learning (ITML) (Davis et al. 2007), Local Fisher Discriminant Analysis (LFDA) (Sugiyama 2007), Large Margin Nearest Neighbor (LMNN) (Weinberger and Saul 2009), Nearest Feature Classifiers (NFC) (Chien and Wu 2002), Hyperplane Distance Nearest Neighbor (HKNN) (Vincent and Bengio 2001), K-local Convex Distance Nearest Neighbors (CKNN), Mahalanobis Distance (MD), Point to Set Distance Metric Learning (PSDML) (Zhu et al. 2013) and Learning Euclidean to Riemannian Metric (LERM) (Huang et al. 2013). Experiments are conducted using COX still-to-video dataset. The results in Table 5 illustrate the superior performance of our method and its suitability for the challenging and important problem of still to video based face recognition from surveillance imagery.

Table 5 Still to video face recognition

4.5 Robustness Analysis

In order to analyse the robustness of the proposed method with respect to its different aspects, we conduct quantitative experimental evaluations. In this regards, the following aspects are explored. (i) Number of images in the gallery and the probe sets (ii) number of images in sets \(\mathcal {D}_1\) & \(\mathcal {D}_2\), and (iii) number of enrolled subjects in the gallery. These experimental evaluations and the achieved results are discussed next.

Fig. 8
figure 8

Classification performance on YouTube celebrities dataset for reduced number of images in the a gallery and b probe sets

(i) Size of Gallery and Probe Image Sets We perform experiments on YouTube celebrities dataset by enforcing an upper limit on number of images in the sets. Specifically, by keeping the size of the probe image sets fixed, we first gradually reduce the number of images in gallery sets from 250 to 8. We then keep the size of the gallery image sets fixed, and gradually decrease the size of the probe image sets. The achieved experimental results for reduced gallery and probe sets are presented in Fig 8a, b respectively. The results suggest that the performance of the proposed method is quite robust to the size of the probe image sets. Reducing the size of the probe image sets to as low as 8 images achieves a classification accuracy of \(72.1\%\) (compared to \(77.4\%\) for full size). Reducing the size of the gallery image set beyond 25 images, however, does cause a noticeable performance drop. The proposed method can still achieve a performance of \(61.4\%\) when the gallery set size is reduced to only 8 images.

(ii) Size of \(\mathcal {D}_1\) and \(\mathcal {D}_2\) The proposed method trains a binary classifier between images of \(\mathcal {X}_q\) and \(\mathcal {D}_1\) which is then evaluated on \(\mathcal {D}_2\). \(\mathcal {D}_1\) has \(N_{\mathcal {D}_{1c}}\)=\(\left\lceil \frac{N_q}{k} \right\rceil \)uniformly sampled images from each class of the training data. It also contains \(N_{\mathcal {D}_{1c}}\)miss-labelled images (which have the same label as \(\mathcal {X}_q\)). Increasing the size of \(\mathcal {D}_1\) will decrease the size of \(\mathcal {D}_2\) and also increase the number of miss-labelled images in \(\mathcal {D}_1\). This will cause the performance to drop. In order to quantitatively evaluate the robustness of the proposed method against number of images in \(\mathcal {D}_1\) and \(\mathcal {D}_2\), we gradually increase the number of images sampled from each class of the training data to form \(\mathcal {D}_1\) from \(N_{\mathcal {D}_{1c}}\)to m \(N_{\mathcal {D}_{1c}}\). Experimental results on YouTube celebrities dataset for \(m=\{0.5, 1, 1.5, 2, 2.5, 3, 4, 5, 6, 7\}\) are presented in Table 6. The results show that the performance of the method only drops from 77.7 to \(73.2\%\) when there is a 14 fold increase (from \(m=0.5\) to \(m=7\)) in the number of images in \(\mathcal {D}_1\). A possible reason for this performance drop is the imbalance between \(\mathcal {X}_q\) and \(\mathcal {D}_1\) for larger values of m. We also perform experiments by excluding the miss-labelled images from \(\mathcal {D}_1\). A classification accuracy of 78.8 is achieved for \(m=0\). These evaluations suggests robustness of the proposed method against number of images in \(\mathcal {D}_1\) and \(\mathcal {D}_2\). Although increasing the size of \(\mathcal {D}_1\) increases the number of miss-labelled images, the overall proportion of these images stays the same i. e. \(\frac{1}{k}\) of all the images. For a large value of k (the number of enrolled subjects in the gallery), the proportion of these images is too small to significantly impact the performance of the proposed method.

Table 6 Performance evaluation by changing the number of images in \(\mathcal {D}_1\) and \(\mathcal {D}_2\)
Fig. 9
figure 9

Performance evaluation for different number of enrolled subjects and maximum number of images in the gallery

(iii) Number of Enrolled Subjects In our experimental evaluations (Sect. 4), the efficacy of the proposed method has been demonstrated on a wide range of datasets in which number of enrolled subjects vary from 20 to 1000. Furthermore, in the previous experiment (Sect. 4.5ii), it was shown that for a larger value of k, the fraction of miss-labelled images in \(\mathcal {D}_1\) (\(\frac{1}{k}\) of all images) is small and does not significantly affect the training of the binary classifier and the overall performance of the proposed method. In this experiment, we want to quantitatively evaluate the affect of k (the number of enrolled subjects in the gallery) on the performance of the proposed method. In this regards, we perform experiments on YouTube celebrities dataset for \(k=\{47, 40, 30, 20, 15\}\). For each value of k, we further do evaluations by considering different number of images in the gallery sets (full length, 200, 100, 50, 25). The experimental results presented in Fig. 9 suggest a gradual performance drop for a reduced number of enrolled subjects in the gallery. The performance drop however is quite insignificant when the gallery sets contain more images. The performance drop for lower values of k is more pronounced when the gallery sets contain fewer images.

4.6 Ablative Analysis

We conduct experiments on YouTube celebrities dataset to study the contribution of the different components of the proposed method towards its overall performance. The following aspects are explored:

(i) Binary Classifiers Experiments are performed by considering different binary classifiers which include linear SVM (Fan et al. 2008), non-linear SVM with Radial Basis Function (RBF) kernel (Chang and Lin 2011), non-linear SVM with Chi-Square kernel (Vedaldi and Zisserman 2012) and random decision forests (Breiman 2001). Experimental results in Table 7 show that the choice of the binary classifier does not significantly impact the performance. Although, for many classification tasks, non-linear SVMs perform better compared with linear SVMs, in our case, they show a comparable performance. This can be due to strong discriminative feature representation in terms of activations of a Convolution Neural Network. CNN based features in combination with a linear SVM have shown superior performance for many challenging classification tasks (Sharif Razavian et al. 2014; Khan et al. 2016). We, therefore, select linear SVM because of its computational efficiency. We note that for linearly inseparable data, linear SVM may perform poorly. In such a case, any non-linear binary classifier can easily be employed in conjunction with the proposed technique.

Table 7 Performance evaluation for different choices of binary classifiers

(ii) Feature Descriptors Experiments are performed on YouTube celebrities dataset by considering different methods of encoding facial images. These include Local Binary Patterns (LBPs) (Ojala et al. 2002), Gabor features (Yang et al. 2004), activations of AlexNet (Krizhevsky et al. 2012) fine-tuned on BU4DFE dataset (Yin et al. 2008) and activations of VGG-Face CNN model (Parkhi et al. 2015). For LBPs, each image is divided into \(4\times 4\) non-overlapping blocks and 59 dimensional histograms are extracted from each block. Histograms from all 16 blocks are then concatenated to get the final 944 dimensional feature vector. For Gabor features, we generate a bank of 40 Gabor wavelet filters at five scales and eight orientations. An image is then convolved with these filters, and the down sampled magnitude responses are considered as feature representation. For CNN models, we consider the 4096 dimensional activations of the first fully connected layer of the model as feature representation of the input image. Experimental results in Table 8 show that the learned feature representations in terms of activations of CNN models perform significantly better compared with LBPs and Gabor features. We also evaluate these features in combination with Principal Component Analysis (PCA) whitening. The results show that PCA whitening achieves a performance improvement for LBPs and Gabor features, while a slight performance drop for learned features.

Table 8 Performance evaluation for different feature descriptors
Fig. 10
figure 10

Performance evaluation for different number of iterations of steps 2–8 (Algorithm 1) of the proposed method. Considering the performance and required computational load, we select a total of five iterations as an optimal choice

(iii) Number of Iterations Fig 10 shows performance evaluation for different number of maximum iterations of steps 1-5 of the proposed method. The results show that performing more iterations improves the robustness of our approach and results in a slightly improved recognition performance. This, however, requires more computational effort. A total of five iterations is therefore a good trade off between recognition performance and computational complexity.

(iv) Sampling Strategies The results in Table 9 show that bootstrapped sampling introduces more robustness and enhances performance. Incorporating pose based information during sampling further enhances the performance of the proposed method (since most of the images in the sampled set have the same pose as the pose of the images of the query image set). By doing so, the trained binary classifier learns to discriminate between the images of the query set from the others (rather than discriminating them based upon their poses). A visual inspection of the failure cases revealed that most of the miss-classifications happened when the pose difference between most of the images of the gallery and probe set is greater than \(45^{\circ }\).

(v) Ensemble Effect The results in Table 9 show that use of the two binary classifiers (see Sec 3.3) complement each other and result in a performance boost.

Table 9 Ablative analysis for different sampling strategies and ensemble of classifiers

Based on our empirical evaluations and ablative analysis on YTC dataset, we attribute the performance achieved by our proposed method to the following reasons. The proposed method can naturally accommodate fusion of information from multiple classifiers. This can be a binary classifier trained multiple times for different random samplings of the negative set. Further, it can simultaneously fuse information from different types of binary classifiers. Convolution Neural Networks based learnt feature representations also achieve a significant performance boost for the proposed method.

Table 10 Timing comparison on the YouTube celebrities dataset

4.7 Timing Analysis

Table 10 lists the times (in seconds) for different methods using the respective Matlab implementations on a core i7 machine. Specifically, the time required for the offline training and the time needed to test one image set on the YouTube celebrities dataset are provided. The reported time for our method corresponds to five iterations of steps 1–5 of our algorithm (see Sect. 3.2). For MSM (Yamaguchi et al. 1998), AHISD (Cevikalp and Triggs 2010), CHISD (Cevikalp and Triggs 2010) and RNP (Yang et al. 2013), the reported test time also includes the time required to compute subspaces and projection matrices of the training data. These can be computed offline. It takes approximately 0.9 s to compute them for the training data of YouTube celebrities dataset.

Based upon their computational requirements, we can categorize the evaluated methods as online (which do all computations at run time e.g., Yamaguchi et al. 1998; Cevikalp and Triggs 2010; Hu et al. 2012; Ortiz et al. 2013; Yang et al. 2013) and offline (which do training component offline and only testing is done at run time e.g., Kim et al. 2007; Wang and Chen 2009; Wang et al. 2012; Zhu et al. 2013; Yang et al. 2013). Both of these categories of methods have their strengths and limitations. A major strength of online methods is their scalability. New classes can easily be added without requiring retraining on the complete dataset. A major limitation of online methods (including ours) is that all the computation is done at run-time and comparatively more memory storage is required. In our implementation, we noted that, on average, our method requires approximately 450 MB of RAM to classify a query image set on YouTube celebrities dataset. In comparison, offline methods are efficient at run time and require less computational resources.

5 Conclusion

A new approach is introduced to efficiently extend well known binary classifiers for multi-class image set classification. Compared with the popular one-vs-one and one-vs-rest binary to multi-class strategies, the proposed approach is very efficient as it trains fixed number of binary classifiers (one to five) and uses very few images for training. The proposed approach can also simultaneously fuse information from different types of binary classifiers, which further enhances its robustness and accuracy. Extensive experiments have been performed to validate the proposed approach for the tasks of video based face recognition, still to video face recognition and object recognition. The experimental results and a comparison with the existing methods show that the proposed method consistently achieves state of the art performance.