Keywords

1 Introduction

Fisher Discriminant Analysis (FDA) [1], first proposed in [2], is a powerful subspace learning method which tries to minimize the intra-class scatter and maximize the inter-class scatter of data for better separation of classes. FDA treats all pairs of the classes the same way; however, some classes might be much further from one another compared to other classes. In other words, the distances of classes are different. Treating closer classes need more attention because classifiers may more easily confuse them whereas classes far from each other are generally easier to separate. The same problem exists in Kernel FDA (KFDA) [3] and in most of subspace learning methods that are based on generalized eigenvalue problem such as FDA and KFDA [4]; hence, a weighting procedure might be more appropriate.

In this paper, we propose several weighting procedures for FDA and KFDA. The contributions of this paper are three-fold: (1) proposing Cosine-Weighted FDA (CW-FDA) as a new modification of FDA, (2) proposing Automatically Weighted FDA (AW-FDA) as a new version of FDA in which the weights are set automatically, and (3) proposing Weighted KFDA (W-KFDA) to have weighting procedures in the feature space, where both the existing and the newly proposed weighting methods can be used in the feature space.

The paper is organized as follows: In Sect. 2, we briefly review the theory of FDA and KFDA. In Sect. 3, we formulate the weighted FDA, review the existing weighting methods, and then propose CW-FDA and AW-FDA. Section 4 proposes weighted KFDA in the feature space. In addition to using the existing methods for weighted KFDA, two versions of CW-KFDA and also AW-KFDA are proposed. Section 5 reports the experiments. Finally, Sect. 6 concludes the paper.

2 Fisher and Kernel Discriminant Analysis

2.1 Fisher Discriminant Analysis

Let \(\{\varvec{x}_i^{(r)} \in \mathbb {R}^d\}_{i=1}^{n_r}\) denote the samples of the r-th class where \(n_r\) is the class’s sample size. Suppose \(\varvec{\mu }^{(r)} \in \mathbb {R}^d\), c, n, and \(\varvec{U} \in \mathbb {R}^{d \times d}\) denote the mean of r-th class, the number of classes, the total sample size, and the projection matrix in FDA, respectively. Although some methods solve FDA using least squares problem [5, 6], the regular FDA [2] maximizes the Fisher criterion [7]:

$$\begin{aligned}&\underset{\varvec{U}}{\text {maximize}} ~~~ \frac{\mathbf{tr }(\varvec{U}^\top \varvec{S}_B\, \varvec{U})}{\mathbf{tr }(\varvec{U}^\top \varvec{S}_W\, \varvec{U})}, \end{aligned}$$
(1)

where \(\mathbf{tr} (\cdot )\) is the trace of matrix. The Fisher criterion is a generalized Rayleigh-Ritz Quotient [8]. We may recast the problem to [9]:

$$\begin{aligned} \begin{aligned}&\underset{\varvec{U}}{\text {maximize}}&\mathbf{tr} (\varvec{U}^\top \varvec{S}_B\, \varvec{U}), \\&\text {subject to}&\varvec{U}^\top \varvec{S}_W\, \varvec{U} = \varvec{I}, \end{aligned} \end{aligned}$$
(2)

where the \(\varvec{S}_W \in \mathbb {R}^{d \times d}\) and \(\varvec{S}_B \in \mathbb {R}^{d \times d}\) are the intra- (within) and inter-class (between) scatters, respectively [9]:

$$\begin{aligned} \varvec{S}_W&:= \sum _{r=1}^c \sum _{i=1}^{n_r} n_r (\varvec{x}_i^{(r)} - \varvec{\mu }^{(r)}) (\varvec{x}_i^{(r)} - \varvec{\mu }^{(r)})^\top = \sum _{r=1}^c n_r\, \breve{\varvec{X}}_r\, \breve{\varvec{X}}_r^\top , \end{aligned}$$
(3)
$$\begin{aligned} \varvec{S}_B&:= \sum _{r=1}^c \sum _{\ell =1}^c n_r\, n_\ell (\varvec{\mu }^{(r)} - \varvec{\mu }^{(\ell )}) (\varvec{\mu }^{(r)} - \varvec{\mu }^{(\ell )})^\top = \sum _{r=1}^c n_k\, \varvec{M}_r\, \varvec{N}\, \varvec{M}_r^\top , \end{aligned}$$
(4)

where \(\mathbb {R}^{d \times n_r} \ni \breve{\varvec{X}}_r := [\varvec{x}_1^{(r)} - \varvec{\mu }^{(r)}, \dots , \varvec{x}_{n_r}^{(r)} - \varvec{\mu }^{(r)}]\), \(\mathbb {R}^{d \times c} \ni \varvec{M}_r := [\varvec{\mu }^{(r)} - \varvec{\mu }^{(1)}, \dots , \varvec{\mu }^{(r)} - \varvec{\mu }^{(c)}]\), and \(\mathbb {R}^{c \times c} \ni \varvec{N} := \mathbf{diag} ([n_1, \dots , n_c]^\top )\). The mean of the r-th class is \(\mathbb {R}^{d} \ni \varvec{\mu }^{(r)} := (1/n_r) \sum _{i=1}^{n_r} \varvec{x}_i^{(r)}\). The Lagrange relaxation [10] of the optimization problem is: \(\mathcal {L} = \mathbf{tr} (\varvec{U}^\top \varvec{S}_B\, \varvec{U}) - \mathbf{tr} \big (\varvec{\varLambda }^\top (\varvec{U}^\top \varvec{S}_W\, \varvec{U} - \varvec{I})\big )\), where \(\varvec{\varLambda }\) is a diagonal matrix which includes the Lagrange multipliers. Setting the derivative of Lagrangian to zero gives:

$$\begin{aligned}&\frac{\partial \mathcal {L}}{\partial \varvec{U}} = 2\varvec{S}_B \varvec{U} - 2\varvec{S}_W\varvec{U} \varvec{\varLambda } \overset{\text {set}}{=} \varvec{0} \implies \varvec{S}_B\, \varvec{U} = \varvec{S}_W\, \varvec{U} \varvec{\varLambda }, \end{aligned}$$
(5)

which is the generalized eigenvalue problem \((\varvec{S}_B, \varvec{S}_W)\) where the columns of \(\varvec{U}\) and the diagonal of \(\varvec{\varLambda }\) are the eigenvectors and eigenvalues, respectively [11]. The p leading columns of \(\varvec{U}\) (so to have \(\varvec{U} \in \mathbb {R}^{d \times p}\)) are the FDA projection directions where p is the dimensionality of the subspace. Note that \(p \le \min (d, n-1, c-1)\) because of the ranks of the inter- and intra-class scatter matrices [9].

2.2 Kernel Fisher Discriminant Analysis

Let the scalar and matrix kernels be denoted by \(k(\varvec{x}_i, \varvec{x}_j) := \varvec{\phi }(\varvec{x}_i)^\top \varvec{\phi }(\varvec{x}_j)\) and \(\varvec{K}(\varvec{X}_1, \varvec{X}_2) := \varvec{\varPhi }(\varvec{X}_1)^\top \varvec{\varPhi }(\varvec{X}_2)\), respectively, where \(\varvec{\phi }(.)\) and \(\varvec{\varPhi }(.)\) are the pulling functions. According to the representation theory [12], any solution must lie in the span of all the training vectors, hence, \(\varvec{\varPhi }(\varvec{U}) = \varvec{\varPhi }(\varvec{X})\, \varvec{Y}\) where \(\varvec{Y} \in \mathbb {R}^{n \times d}\) contains the coefficients. The optimization of kernel FDA is [3, 9]:

$$\begin{aligned} \begin{aligned}&\underset{\varvec{Y}}{\text {maximize}}&\mathbf{tr} (\varvec{Y}^\top \varvec{\varDelta }_B\, \varvec{Y}), \\&\text {subject to}&\varvec{Y}^\top \varvec{\varDelta }_W\, \varvec{Y} = \varvec{I}, \end{aligned} \end{aligned}$$
(6)

where \(\varvec{\varDelta }_W \in \mathbb {R}^{n \times n}\) and \(\varvec{\varDelta }_B \in \mathbb {R}^{n \times n}\) are the intra- and inter-class scatters in the feature space, respectively [3, 9]:

$$\begin{aligned} \varvec{\varDelta }_W&:= \sum _{r=1}^c n_r\, \varvec{K}_r\, \varvec{H}_r\, \varvec{K}_r^\top , \end{aligned}$$
(7)
$$\begin{aligned} \varvec{\varDelta }_B&:= \sum _{r=1}^c \sum _{\ell =1}^c n_r\, n_\ell (\varvec{\xi }^{(r)} - \varvec{\xi }^{(\ell )}) (\varvec{\xi }^{(r)} - \varvec{\xi }^{(\ell )})^\top = \sum _{r=1}^c n_r\, \varvec{\varXi }_r\, \varvec{N}\, \varvec{\varXi }_r^\top , \end{aligned}$$
(8)

where \(\mathbb {R}^{n_r \times n_r} \ni \varvec{H}_r := \varvec{I} - (1/n_r) \varvec{1}\varvec{1}^\top \) is the centering matrix, the (ij)-th entry of \(\varvec{K}_r \in \mathbb {R}^{n \times n_r}\) is \(\varvec{K}_r(i,j) := k(\varvec{x}_i, \varvec{x}_j^{(r)})\), the i-th entry of \(\varvec{\xi }^{(r)} \in \mathbb {R}^n\) is \(\varvec{\xi }^{(r)}(i) := (1/n_r) \sum _{j=1}^{n_r} k(\varvec{x}_i, \varvec{x}_j^{(r)})\), and \(\mathbb {R}^{n \times c} \ni \varvec{\varXi }_r := [\varvec{\xi }^{(r)} - \varvec{\xi }^{(1)}, \dots , \varvec{\xi }^{(r)} - \varvec{\xi }^{(c)}]\).

The p leading columns of \(\varvec{Y}\) (so to have \(\varvec{Y} \in \mathbb {R}^{n \times p}\)) are the KFDA projection directions which span the subspace. Note that \(p \le \min (n, c-1)\) because of the ranks of the inter- and intra-class scatter matrices in the feature space [9].

3 Weighted Fisher Discriminant Analysis

The optimization of Weighted FDA (W-FDA) is as follows:

$$\begin{aligned} \begin{aligned}&\underset{\varvec{U}}{\text {maximize}}&\mathbf{tr} (\varvec{U}^\top \widehat{\varvec{S}}_B\, \varvec{U}), \\&\text {subject to}&\varvec{U}^\top \varvec{S}_W\, \varvec{U} = \varvec{I}, \end{aligned} \end{aligned}$$
(9)

where the weighted inter-class scatter, \(\widehat{\varvec{S}}_B \in \mathbb {R}^{d \times d}\), is defined as:

$$\begin{aligned} \widehat{\varvec{S}}_B := \sum _{r=1}^c \sum _{\ell =1}^c \alpha _{r\ell }\, n_r\, n_\ell (\varvec{\mu }^{(r)} - \varvec{\mu }^{(\ell )}) (\varvec{\mu }^{(r)} - \varvec{\mu }^{(\ell )})^\top = \sum _{r=1}^c n_r\, \varvec{M}_r\, \varvec{A}_r\, \varvec{N}\, \varvec{M}_r^\top , \end{aligned}$$
(10)

where \(\mathbb {R} \ni \alpha _{r\ell } \ge 0\) is the weight for the pair of the r-th and \(\ell \)-th classes, \(\mathbb {R}^{c \times c} \ni \varvec{A}_r := \mathbf{diag} ([\alpha _{r1}, \dots , \alpha _{rc}])\). In FDA, we have \(\alpha _{r\ell }=1,~ \forall r, \ell \in \{1, \dots , c\}\). However, it is better for the weights to be decreasing with the distances of classes to concentrate more on the nearby classes. We denote the distances of the r-th and \(\ell \)-th classes by \(d_{r\ell } := ||\varvec{\mu }^{(r)} - \varvec{\mu }^{(\ell )}||_2\). The solution to Eq. (9) is the generalized eigenvalue problem \((\widehat{\varvec{S}}_B, \varvec{S}_W)\) and the p leading columns of \(\varvec{U}\) span the subspace.

3.1 Existing Manual Methods

In the following, we review some of the existing weights for W-FDA.

Approximate Pairwise Accuracy Criterion: The Approximate Pairwise Accuracy Criterion (APAC) method [13] has the weight function:

$$\begin{aligned} \alpha _{r\ell } := \frac{1}{2\, d_{r\ell }^2} \text {erf}\Big (\frac{d_{r\ell }}{2\sqrt{2}}\Big ), \end{aligned}$$
(11)

where \(\text {erf}(x)\) is the error function:

$$\begin{aligned}{}[-1, 1] \ni \text {erf}(x) := \frac{2}{\sqrt{\pi }} \int _0^x e^{-t^2} dt. \end{aligned}$$
(12)

This method approximates the Bayes error for class pairs.

Powered Distance Weighting: The powered distance (POW) method [14] uses the following weight function:

$$\begin{aligned} \alpha _{r\ell } := \frac{1}{d_{r\ell }^m}, \end{aligned}$$
(13)

where \(m > 0\) is an integer. As \(\alpha _{r\ell }\) is supposed to drop faster than the increase of \(d_{k\ell }\), we should have \(m \ge 3\) (we use \(m=3\) in the experiments).

Confused Distance Maximization: The Confused Distance Maximization (CDM) [15] method uses the confusion probability among the classes as the weight function:

$$\begin{aligned} \alpha _{r\ell } := \left\{ \begin{array}{ll} \frac{n_{\ell | r}}{n_r} &{} \quad \text {if } k \ne \ell , \\ 0 &{} \quad \text {if } r = \ell , \end{array} \right. \end{aligned}$$
(14)

where \(n_{\ell |r}\) is the number of points of class r classified as class \(\ell \) by a classifier such as quadratic discriminant analysis [15, 16]. One problem of the CDM method is that if the classes are classified perfectly, all weights become zero. Conditioning the performance of a classifier is also another flaw of this method.

k-Nearest Neighbors Weighting: The k-Nearest Neighbor (kNN) method [17] tries to put every class away from its k-nearest neighbor classes by defining the weight function as

$$\begin{aligned} \alpha _{r\ell } := \left\{ \begin{array}{ll} 1 &{} \quad \text {if } \varvec{\mu }^{(\ell )} \in k\text {NN}(\varvec{\mu }^{(r)}), \\ 0 &{} \quad \text {otherwise}. \end{array} \right. \end{aligned}$$
(15)

The kNN and CDM methods are sparse to make use of the betting on sparsity principle [1, 18]. However, these methods have some shortcomings. For example, if two classes are far from one another in the input space, they are not considered in kNN or CDM, but in the obtained subspace, they may fall close to each other, which is not desirable. Another flaw of kNN method is the assignment of 1 to all kNN pairs, but in the kNN, some pairs might be comparably closer.

3.2 Cosine Weighted Fisher Discriminant Analysis

Literature has shown that cosine similarity works very well with the FDA, especially for face recognition [19, 20]. Moreover, according to the opposition-based learning [21], capturing similarity and dissimilarity of data points can improve the performance of learning. A promising operator for capturing similarity and dissimilarity (opposition) is cosine. Hence, we propose CW-FDA, as a manually weighted method, with cosine to be the weight defined as

$$\begin{aligned} \alpha _{r\ell } := 0.5 \times \big [1 + \cos \big (\measuredangle (\varvec{\mu }^{(r)}, \varvec{\mu }^{(\ell )})\big )\big ] = 0.5 \times \big [1 + \frac{\varvec{\mu }^{(r)\top } \varvec{\mu }^{(\ell )}}{||\varvec{\mu }^{(r)}||_2 ||\varvec{\mu }^{(\ell )}||_2}\big ], \end{aligned}$$
(16)

to have \(\alpha _{r\ell } \in [0, 1]\). Hence, the r-th weight matrix is \(\varvec{A}_r := \mathbf{diag} (\alpha _{r\ell }, \forall \ell )\), which is used in Eq. (10). Note that as we do not care about \(\alpha _{r,r}\), because inter-class scatter for \(r=\ell \) is zero, we can set \(\alpha _{rr}=0\).

3.3 Automatically Weighted Fisher Discriminant Analysis

In AW-FDA, there are \(c+1\) matrix optimization variables which are \(\varvec{V}\) and \(\varvec{A}_k \in \mathbb {R}^{c \times c}, \forall k \in \{1, \dots , c\}\) because at the same time where we want to maximize the Fisher criterion, the optimal weights are found. Moreover, to use the betting on sparsity principle [1, 18], we can make the weight matrix sparse, so we use “\(\ell _0\)” norm for the weights to be sparse. The optimization problem is as follows

$$\begin{aligned} \begin{array}{llll} &{} \underset{\varvec{U},\, \varvec{A}_r}{\text {maximize}} &{} &{} \mathbf{tr} (\varvec{U}^\top \widehat{\varvec{S}}_B\, \varvec{U}), \\ &{} \text {subject to} &{} &{} \varvec{U}^\top \varvec{S}_W\, \varvec{U} = \varvec{I}, \\ &{} &{} &{} ||\varvec{A}_r||_0 \le k, \quad \forall r \in \{1, \dots , c\}. \end{array} \end{aligned}$$
(17)

We use alternating optimization [22] to solve this problem:

$$\begin{aligned}&\varvec{U}^{(\tau +1)} := \arg \max _{\varvec{U}} \Big (\mathbf{tr} (\varvec{U}^\top \widehat{\varvec{S}}_B^{(\tau )}\, \varvec{U}) \,\big |\, \varvec{U}^\top \varvec{S}_W\, \varvec{U} = \varvec{I} \Big ), \end{aligned}$$
(18)
$$\begin{aligned}&\varvec{A}_r^{(\tau +1)} := \arg \min _{\varvec{A}_r} \Big (\!-\mathbf{tr} (\varvec{U}^{(\tau +1)\top } \widehat{\varvec{S}}_B\, \varvec{U}^{(\tau +1)}) \,\big |\, ||\varvec{A}_r||_0 \le k \Big ), \forall r, \end{aligned}$$
(19)

where \(\tau \) denotes the iteration.

Since we use an iterative solution for the optimization, it is better to normalize the weights in the weighted inter-class scatter; otherwise, the weights gradually explode to maximize the objective function. We use \(\ell _2\) (or Frobenius) norm for normalization for ease of taking derivatives. Hence, for OW-FDA, we slightly modify the weighted inter-class scatter as

$$\begin{aligned} \widehat{\varvec{S}}_B&:= \sum _{r=1}^c \sum _{\ell =1}^c \frac{\alpha _{r\ell }}{\sum _{\ell '=1}^c \alpha _{r \ell '}^2}\, n_r\, n_\ell (\varvec{\mu }^{(r)} - \varvec{\mu }^{(\ell )}) (\varvec{\mu }^{(r)} - \varvec{\mu }^{(\ell )})^\top \end{aligned}$$
(20)
$$\begin{aligned}&= \sum _{r=1}^c n_r\, \varvec{M}_r\, \breve{\varvec{A}}_r\, \varvec{N}\, \varvec{M}_r^\top , \end{aligned}$$
(21)

where \(\breve{\varvec{A}}_r := \varvec{A}_r / ||\varvec{A}_r||_F^2\) because \(\varvec{A}_k\) is diagonal, and \(||.||_F\) is Frobenius norm.

As discussed before, the solution to Eq. (18) is the generalized eigenvalue problem \((\widehat{\varvec{S}}_B^{(\tau )}, \varvec{S}_W)\). We use a step of gradient descent [23] to solve Eq. (19) followed by satisfying the “\(\ell _0\)” norm constraint [22]. The gradient is calculated as follows. Let \(\mathbb {R} \ni f(\varvec{U}, \varvec{A}_k) := -\mathbf{tr} (\varvec{U}^{\top } \widehat{\varvec{S}}_B\, \varvec{U})\). Using the chain rule, we have:

$$\begin{aligned} \mathbb {R}^{c \times c} \ni \frac{\partial f}{\partial \varvec{A}_r} = \mathbf{vec} ^{-1}_{c \times c} \Big [(\frac{\partial \breve{\varvec{A}}_r}{\partial \varvec{A}_r})^\top (\frac{\partial \widehat{\varvec{S}}_B}{\partial \breve{\varvec{A}}_r})^\top \mathbf{vec} (\frac{\partial f}{\partial \widehat{\varvec{S}}_B}) \Big ], \end{aligned}$$
(22)

where we use the Magnus-Neudecker convention in which matrices are vectorized, \(\mathbf{vec} (.)\) vectorizes the matrix, and \(\mathbf{vec} ^{-1}_{c \times c}\) is de-vectorization to \(c \times c\) matrix. We have \(\mathbb {R}^{d \times d} \ni \partial f / \partial \widehat{\varvec{S}}_B = -\varvec{U}\varvec{U}^\top \) whose vectorization has dimensionality \(d^2\). For the second derivative, we have:

$$\begin{aligned} \mathbb {R}^{d^2 \times c^2} \ni \frac{\partial \widehat{\varvec{S}}_B}{\partial \breve{\varvec{A}}_r} = n_r\, (\varvec{M}_r\, \varvec{N}^\top ) \otimes \varvec{M}_r, \end{aligned}$$
(23)

where \(\otimes \) denotes the Kronecker product. The third derivative is:

$$\begin{aligned} \mathbb {R}^{c^2 \times c^2} \ni \frac{\partial \breve{\varvec{A}}_r}{\partial \varvec{A}_r} = \frac{1}{||\varvec{A}_r||_F^2} \Big (\frac{-2}{||\varvec{A}_r||_F^2}(\varvec{A}_r \otimes \varvec{A}_r) + \varvec{I}_{c^2}\Big ). \end{aligned}$$
(24)

The learning rate of gradient descent is calculated using line search [23].

After the gradient descent step, to satisfy the condition \(||\varvec{A}_r||_0 \le k\), the solution is projected onto the set of this condition. Because \(-f\) should be maximized, this projection is to set the \((c-k)\) smallest diagonal entries of \(\varvec{A}_r\) to zero [22]. In case \(k=c\), the projection of the solution is itself, and all the weights are kept.

After solving the optimization, the p leading columns of \(\varvec{U}\) are the OW-FDA projection directions that span the subspace.

4 Weighted Kernel Fisher Discriminant Analysis

We define the optimization for Weighted Kernel FDA (W-KFDA) as:

$$\begin{aligned} \begin{aligned}&\underset{\varvec{Y}}{\text {maximize}}&\mathbf{tr} (\varvec{Y}^\top \widehat{\varvec{\varDelta }}_B\, \varvec{Y}), \\&\text {subject to}&\varvec{Y}^\top \varvec{\varDelta }_W\, \varvec{Y} = \varvec{I}, \end{aligned} \end{aligned}$$
(25)

where the weighted inter-class scatter in the feature space, \(\widehat{\varvec{\varDelta }}_B \in \mathbb {R}^{n \times n}\), is defined as:

$$\begin{aligned} \widehat{\varvec{\varDelta }}_B := \sum _{r=1}^c \sum _{\ell =1}^c \alpha _{r\ell }\, n_r\, n_\ell (\varvec{\xi }^{(r)} - \varvec{\xi }^{(\ell )}) (\varvec{\xi }^{(r)} - \varvec{\xi }^{(\ell )})^\top = \sum _{r=1}^c n_r\, \varvec{\varXi }_r\, \varvec{A}_r\, \varvec{N}\, \varvec{\varXi }_r^\top . \end{aligned}$$
(26)

The solution to Eq. (25) is the generalized eigenvalue problem \((\widehat{\varvec{\varDelta }}_B, \varvec{\varDelta }_W)\) and the p leading columns of \(\varvec{Y}\) span the subspace.

4.1 Manually Weighted Methods in the Feature Space

All the existing weighting methods in the literature for W-FDA can be used as weights in W-KFDA to have W-FDA in the feature space. Therefore, Eqs. (11), (13), (14), and (15) can be used as weights in Eq. (26) to have W-KFDA with APAC, POW, CDM, and kNN weights, respectively. To the best of our knowledge, W-KFDA is novel and has not appeared in the literature. Note that there is a weighted KFDA in the literature [24], but that is for data integration, which is for another purpose and has an entirely different approach.

The CW-FDA can be used in the feature space to have CW-KFDA. For this, we propose two versions of CW-KFDA: (I) In the first version, we use Eq. (16) or \(\varvec{A}_r := \mathbf{diag} (\alpha _{r\ell }, \forall \ell )\) in the Eq. (26). (II) In the second version, we notice that cosine is based on inner product so the normalized kernel matrix between the means of classes can be used instead to use the similarity/dissimilarity in the feature space rather than in the input space. Let \(\mathbb {R}^{d \times c} \ni \varvec{M} := [\varvec{\mu }_1, \dots , \varvec{\mu }_c]\). Let \(\widehat{\varvec{K}}_{i,j} := \varvec{K}_{i,j} / \sqrt{\varvec{K}_{i,i} \varvec{K}_{j,j}}\) be the normalized kernel matrix [25] where \(\varvec{K}_{i,j}\) denotes the (ij)-th element of the kernel matrix \(\mathbb {R}^{c \times c} \ni \varvec{K}(\varvec{M}, \varvec{M}) = \varvec{\varPhi }(\varvec{M})^\top \varvec{\varPhi }(\varvec{M})\). The weights are \([0,1] \ni \alpha _{r\ell } := \widehat{\varvec{K}}_{r,\ell }\) or \(\varvec{A}_r := \mathbf{diag} (\widehat{\varvec{K}}_{r,\ell }, \forall \ell )\). We set \(\alpha _{r,r}=0\).

4.2 Automatically Weighted Kernel Fisher Discriminant Analysis

Similar to before, the optimization in AW-KFDA is:

$$\begin{aligned} \begin{array}{llll} &{} \underset{\varvec{Y},\, \varvec{A}_r}{\text {maximize}} &{} &{} \mathbf{tr} (\varvec{Y}^\top \widehat{\varvec{\varDelta }}_B\, \varvec{Y}), \\ &{} \text {subject to} &{} &{} \varvec{Y}^\top \varvec{\varDelta }_W\, \varvec{Y} = \varvec{I}, \\ &{} &{} &{} ||\varvec{A}_r||_0 \le k, \quad \forall r \in \{1, \dots , c\}, \end{array} \end{aligned}$$
(27)

where \(\widehat{\varvec{\varDelta }}_B := \sum _{r=1}^c n_r\, \varvec{\varXi }_r\, \breve{\varvec{A}}_r\, \varvec{N}\, \varvec{\varXi }_r^\top \). This optimization is solved similar to how Eq. (17) was solved where we have \(\varvec{Y}\in \mathbb {R}^{n \times d}\) rather than \(\varvec{U} \in \mathbb {R}^{d \times d}\). Here, the solution to Eq. (18) is the generalized eigenvalue problem \((\widehat{\varvec{\varDelta }}_B^{(\tau )}, \varvec{\varDelta }_W)\). Let \(f(\varvec{Y}, \varvec{A}_k) := -\mathbf{tr} (\varvec{Y}^{\top } \widehat{\varvec{\varDelta }}_B\, \varvec{Y})\). The Eq. (19) is solved similarly but we use \(\mathbb {R}^{n \times n} \ni \partial f / \partial \widehat{\varvec{\varDelta }}_B = -\varvec{Y}\varvec{Y}^\top \) and

$$\begin{aligned}&\mathbb {R}^{c \times c} \ni \frac{\partial f}{\partial \varvec{A}_r} = \mathbf{vec} ^{-1}_{c \times c} \Big [(\frac{\partial \breve{\varvec{A}}_r}{\partial \varvec{A}_r})^\top (\frac{\partial \widehat{\varvec{\varDelta }}_B}{\partial \breve{\varvec{A}}_r})^\top \mathbf{vec} (\frac{\partial f}{\partial \widehat{\varvec{\varDelta }}_B}) \Big ], \end{aligned}$$
(28)
$$\begin{aligned}&\mathbb {R}^{d^2 \times c^2} \ni \frac{\partial \widehat{\varvec{\varDelta }}_B}{\partial \breve{\varvec{A}}_r} = n_r\, (\varvec{\varXi }_r\, \varvec{N}^\top ) \otimes \varvec{\varXi }_r. \end{aligned}$$
(29)

After solving the optimization, the p leading columns of \(\varvec{Y}\) span the OW-KFDA subspace. Recall \(\varvec{\varPhi }(\varvec{U}) = \varvec{\varPhi }(\varvec{X})\, \varvec{Y}\). The projection of some data \(\varvec{X}_t \in \mathbb {R}^{d \times n_t}\) is \(\mathbb {R}^{p \times n_t} \ni \widetilde{\varvec{X}}_t = \varvec{\varPhi }(\varvec{U})^\top \varvec{\varPhi }(\varvec{X}_t) = \varvec{Y}^\top \varvec{\varPhi }(\varvec{X})^\top \varvec{\varPhi }(\varvec{X}_t) = \varvec{Y}^\top \varvec{K}(\varvec{X}, \varvec{X}_t)\).

5 Experiments

5.1 Dataset

For experiments, we used the public ORL face recognition dataset [26] because face recognition has been a challenging task and FDA has numerously been used for face recognition (e.g., see [19, 20, 27]). This dataset includes 40 classes, each having ten different poses of the facial picture of a subject, resulting in 400 total images. For computational reasons, we selected the first 20 classes and resampled the images to \(44 \times 36\) pixels. Please note that massive datasets are not feasible for the KFDA/FDA because of having a generalized eigenvalue problem in it. Some samples of this dataset are shown in Fig. 1. The data were split into training and test sets with \(66\%/33\%\) portions and were standardized to have mean zero and variance one.

Fig. 1.
figure 1

Sample images of the classes in the ORL face dataset. Numbers are the class indices.

5.2 Evaluation of the Embedding Subspaces

For the evaluation of the embedded subspaces, we used the 1-Nearest Neighbor (1NN) classifier because it is useful to evaluate the subspace by the closeness of the projected data samples. The training and out-of-sample (test) accuracy of classifications are reported in Table 1. In the input space, kNN with \(k=1,3\) have the best results but in \(k=c-1\), AW-FDA outperforms it in generalization (test) result. The performances of CW-FDA and AW-FDA with \(k=1,3\) are promising, although not the best. For instance, AW-FDA with \(k=1\) outperforms weighted FDA with APAC, POW, and CDM methods in the training embedding, while has the same performance as kNN. In most cases, AW-FDA with all k values has better performance than the FDA, which shows the effectiveness of the obtained weights compared to equal weights in FDA. Also, the sparse k in AWF-FDA outperforming FDA (with dense weights equal to one) validates the betting on sparsity.

Table 1. Accuracy of 1 NN classification for different obtained subspaces. In each cell of input or feature spaces, the first and second rows correspond to the classification accuracy of training and test data, respectively.
Fig. 2.
figure 2

The leading Fisherfaces in (a) FDA, (b) APAC, (c) POW, (d) CDM, (e) kNN, (f) CW-FDA, and (g) AW-FDA.

Fig. 3.
figure 3

The weights in (a) APAC, (b) POW, (c) CDM, (d) kNN with \(k=1\), (e) kNN with \(k=3\), (f) kNN with \(k=c-1\), (g) CW-FDA, (h) AW-FDA with \(k=1\), (i) AW-FDA with \(k=3\), (j) AW-FDA with \(k=c-1\), (k) CW-KFDA, (l) AW-KFDA with \(k=1\), (m) AW-KFDA with \(k=3\), (n) AW-KFDA with \(k=c-1\). The rows and columns index the classes.

In the feature space, where we used the radial basis kernel, AW-KFDA has the best performance with entirely accurate recognition. Both versions of CW-KFDA outperform regular KFDA and KFDA with CDM, and kNN (with \(k=1, c-1\)) weighting. They also have better generalization than APAC, kNN with all k values. Overall, the results show the effectiveness of the proposed weights in the input and feature spaces. Moreover, the existing weighting methods, which were for the input space, have outstanding performance when used in our proposed weighted KFDA (in feature space). This shows the validness of the proposed weighted KFDA even for the existing weighting methods.

5.3 Comparison of Fisherfaces

Figure 2 depicts the four leading eigenvectors obtained from the different methods, including the FDA itself. These ghost faces, or so-called Fisherfaces [27], capture the critical discriminating facial features to discriminant the classes in subspace. Note that Fisherfaces cannot be shown in kernel FDA as its projection directions are n dimensional. CDM has captured some pixels as features because its all weights have become zero for its explained flaw (see Sect. 3.1 and Fig. 3). The Fisherfaces, in most of the methods including CW-FDA, capture information of facial organs such as hair, forehead, eyes, chin, and mouth.

The features of AW-FDA are more akin to the Haar wavelet features, which are useful for facial feature detection [28].

5.4 Comparison of the Weights

We show the obtained weights in different methods in Fig. 3. The weights of APAC and POW are too small, while the range of weights in the other methods is more reasonable. The weights of CDM have become all zero because the samples were purely classified (recall the flaw of CDM). The weights of kNN method are only zero and one, which is a flaw of this method because, amongst the neighbors, some classes are closer. This issue does not exist in AW-FDA with different k values. Moreover, although not all the obtained weights are visually interpretable, some non-zero weights in AW-FDA or AW-KFDA, with e.g. \(k=1\), show the meaningfulness of the obtained weights (noticing Fig. 1). For example, the non-zero pairs (2, 20), (4, 14), (13, 6), (19, 20), (17, 6) in AW-FDA and the pairs (2, 20), (4, 14), (19, 20), (17, 14) in AW-KFDA make sense visually because of having glasses so their classes are close to one another.

6 Conclusion

In this paper, we discussed that FDA and KFDA have a fundamental flaw, and that is treating all pairs of classes in the same way while some classes are closer to each other and should be processed with more care for a better discrimination. We proposed CW-FDA with cosine weights and also AW-FDA in which the weights are found automatically. We also proposed a weighted KFDA to weight FDA in the feature space. We proposed AW-KFDA and two versions of CW-KFDA as well as utilizing the existing weighting methods for weighted KFDA. The experiments in which we evaluated the embedding subspaces, the Fisherfaces, and the weights, showed the effectiveness of the proposed methods. The proposed weighted FDA methods outperformed regular FDA and many of the existing weighting methods for FDA. For example, AW-FDA with \(k=1\) outperformed weighted FDA with APAC, POW, and CDM methods in the training embedding. In feature space, AW-KFDA obtained perfect discrimination.