1 Introduction

The one-class classification (OCC) is a fundamental problem in the field of machine learning (Kemmler et al. 2013; Xu et al. 2013; Xiao et al. 2015; 2014; Burnaev and Smolyakov 2017; Huang et al. 2017; De Santana et al. 2019). For the OCC problem, only positive samples exist, while negative samples do not exist or are very less in number when training the classifier. This situation is common in practice, such as in fault diagnosis (FernáNdez-Francos et al. 2013; Bing et al. 2018) or target recognition (Fei et al. 2018; Yu and Xiao 2018). For fault diagnosis, normal samples are easily obtained, while fault samples are often rare or unavailable. For the target recognition problem, many non-target samples exist, which results in negative samples that are available for learning not being representative. Since there are no suitable negative class samples for learning, the classifier can only learn from the positive class samples. Such problems are collectively referred to as OCC problems and are also known as concept learning, single-class classification, data description, and anomaly detection.

For the OCC problem, feature dimensionality reduction (feature selection or extraction) (Liu et al. 2017; Jia et al. 2018; Kim 2018) has a significant impact on the performance of a one-class classifier. Especially for high-dimensional data, the distribution of samples in the feature space is sparse. Some conventional one-class classifier design methods cannot be directly applied to high-dimensional data. For example, the one-class classifier design methods that are based on probability density estimations require the samples to be densely distributed in the feature space. Compared with the supervised classification problem, the OCC problem lacks the information of negative samples. Moreover, compared with the unsupervised classification problem, the OCC problem has additional information that the training samples belong to the same class. So, the problem of one-class feature reduction has unique characteristics and is difficult to solve.

Some feature dimension reduction methods for OCC problems have been proposed. For instance, Lorena et al. (2014) considered that the discriminative features should be structured, and put forward some indexes (information score, Pearson correlation, intra-class distance, and interquartile range) for filtering one-class features. Jeong et al. (2012) supposed that the features which have less influence on the volume of one-class data are more discriminatory and proposed a feature recursive elimination method based on the support vector data description for OCC problems. Tax (2003) proposed that the compression direction of the target class data with less variance often contains more discriminative feature information for OCC problems. Lian (2012) pointed out that selecting some principal components with less information is often more useful for distinguishing between target samples and abnormal samples when using the principal component analysis (PCA) method to extract one-class features. However, compared with traditional multi-classification problems, there are few studies on the feature dimensionality reduction of one-class classification problems. Moreover, these studies mainly focus on feature selection, rather than feature extraction. At present, the feature extraction methods of one-class classification problems are mainly based on traditional unsupervised methods, such as PCA. However, such practices cannot well match the one-class classification problems as they ignore the fact that one-class classification problems are between supervised and unsupervised.

In this paper, the information that the “training data belong to the same class” is integrated into unsupervised learning to propose a suitable feature extraction strategy for OCC problems. The proposed strategy divides the original feature space into two parts, the main space, and the complementary space, which are used to extract the feature information of the target class and the abnormal class, respectively. According to this strategy, this paper puts forward a specific implementation method, which is named complete principal component analysis (CPCA). CPCA extracts the information of the target class using PCA in the main space, extracts the information of the abnormal class using the first-order norm in the complement space, and finally combines the two as the extracted one-class feature vector of the original feature space.

The rest of this paper is organized as follows. Section 2 introduces the proposed one-class feature extraction method. The experiment is demonstrated in Sect. 3. Section 4 describes the applications and results. Section 5 presents the discussion. Section 6 addresses the paper conclusion and future work.

2 The proposed feature extraction method for one-class classification

How to extract discriminative features using only the target class samples is the primary problem that needs to be solved. The following one-class feature extraction strategy is designed. First, the target class samples are used to find a projection space called the main space, which can describe the distribution characteristics of the target class. The projection score in the main space can be used to measure the similarity between an unknown sample and the target class samples. Then, the remaining space is used to construct the complement space. The intensity of the projection vector of a sample in the complement space can indicate its difference from the target class. The main space and complement space are orthogonal. If the main space cannot fully reflect the characteristics of the target class, then the remaining characteristic information of the target class leak into the complement space, which decreases the recognition effect of abnormal samples.

Figure 1 demonstrates the basic idea of the proposed one-class feature extraction strategy. The abnormal class samples green circle, green square and green pentagon can be identified in the main space and complement space, respectively. Therefore, when using the ordinary PCA method to extract one-class features, some abnormal samples fail to be detected. Therefore, it is necessary to retain the features that are extracted in the complement space because there is no prior knowledge to distinguish the discriminative information of abnormal samples (either in the main space or in the complement space).

Fig. 1
figure 1

Red triangle indicates the target class sample, and green cicle, green pentagon and green square indicate the abnormal class samples

The following implementation according to the above OCC feature extraction strategy was proposed. First, the PCA (Wold et al. 1987; Koltchinskii and Lounici 2016) is performed on the target class samples to find a suitable principal component projection space as the main space that can describe the principal distribution characteristics of the target class. Then, according to the obtained main space, the corresponding complement space is constructed. The projection score in the main space and the first-order norm of the projection in the complement space are combined as the final extracted one-class features. The specific implementation steps are as follows.

Suppose the training target class data are \({\mathbf{X}}_{n \times p}\), where n is the number of samples and p is the feature dimension. First, PCA is performed on \({\mathbf{X}}_{n \times p}\) and \(v\%\) of the variation information of the training target class data is retained to construct the principal component projection space \({\mathbf{W}}_{p \times l}\) as the main space \(S\), where \(l\) is the number of retained principal components. The projection score \({\mathbf{t}}\) of an unknown sample \({\mathbf{x}}_{1 \times p}\) in the main space \(S\) is

$$ {\mathbf{t}} = {\mathbf{xW}} $$
(1)

Then, the complement space \(S^{ \bot }\) is

$$ S^{ \bot } = {\mathbf{I}} - {\mathbf{W}}^{{\text{T + }}} {\mathbf{W}}^{{\text{T}}} $$
(2)

where \({\mathbf{I}}\) is the identity matrix and symbols T and + indicate the transpose and Pseudo-inverse. The projection vector \({\mathbf{x}}^{ \bot }\) in the complement space \(S^{ \bot }\) is

$$ {\mathbf{x}}^{ \bot } = {\mathbf{x}}S^{ \bot } $$
(3)

Finally, \({\mathbf{t}}\) and \(\left\| {{\mathbf{x}}^{ \bot } } \right\|\) are combined as the final extracted one-class feature vector.

3 Experiments

3.1 Evaluation of the proposed method by using one-class classifiers

A one-class classifier measures the similarity of a sample to the target class. In this paper, the Mahalanobis distance (Galeano et al. 2013; Washizawa and Hotta 2017) and one-class support vector machine (OC-SVM) (Schölkopf et al. 2000; Guerbai et al. 2018) were used to verify the effect of the proposed one-class feature extraction method.

The Mahalanobis distance is a generalized distance that fully considers the covariance between variables. Compared with the common Euclidean distance, the Mahalanobis distance can eliminate the influence of the dimension and the correlation between variables. The Mahalanobis distance is calculated as

$$ \left\| {{\mathbf{x}} - {\mathbf{m}}} \right\|_{M} = \sqrt {\left( {{\mathbf{x}} - {\mathbf{m}}} \right)\sum^{ - 1} \left( {{\mathbf{x}} - {\mathbf{m}}} \right)^{T} } $$
(4)

where \({\mathbf{m}}\) is the center vector of the sample set and \(\sum\) is the estimated covariance matrix. By using the Mahalanobis distance, it is considered that the distribution of the target class obeys the following multi-dimensional Gaussian distribution:

$$ P\left( {\mathbf{x}} \right) = \frac{1}{{\left( {2\pi } \right)^{{{n \mathord{\left/ {\vphantom {n 2}} \right. \kern-\nulldelimiterspace} 2}}} \sqrt \Sigma^{{}} }}e^{{ - \frac{1}{2}\left( {x - m} \right)\sum^{ - 1} \left( {x - m} \right)^{T} }} $$
(5)

where \(n\) is the number of samples. Therefore, the Mahalanobis distance can be used as a simple one-class classifier based on the probability density estimation. However, Mahalanobis distance would not be better for high-dimensional data because \(\Sigma^{ - 1}\) does not exist. Therefore, using Mahalanobis distance to classify high-dimensional data, feature dimensionality reduction is a necessary preprocessing means.

The OC-SVM is a derivative of the traditional two-class support vector machine in the OCC field. The abnormal class is set as the origin. The basic idea of the OC-SVM is to locate most of the target class samples on one side of the hyperplane when maximizing the distance between the hyperplane and the origin. The hyperplane can be obtained by solving the following objective function:

$$ \begin{gathered} \min \frac{1}{2}\left\| w \right\|^{2} + c\sum\limits_{i = 1}^{n} {\zeta_{i} - \rho } \hfill \\ s.t.\left\{ \begin{array}{l} w^{T} \phi \left( {x_{i} } \right) \ge \rho - \zeta_{i} \hfill \\ \zeta_{i} > 0 \hfill \\ \end{array} \right. \hfill \\ \end{gathered} $$
(6)

where \(\phi \left( \cdot \right)\) is a feature projection function that maps an input vector into a higher dimensional feature space, \(w\) is a decision hyperplane normal vector which is perpendicular to the hyperplane, \(\rho\) is an intercept term, and \(\zeta_{i}\) are nonzero slack variables for penalizing the outliers.

By using Lagrangian techniques and a kernel function for the dot-product calculations, the output function becomes:

$$ f\left( {\mathbf{x}} \right) = \sum\limits_{i = 1}^{n} {a_{i} K\left( {{\mathbf{x}}_{i} ,{\mathbf{x}}} \right)} - \rho $$
(7)

where \(a_{i}\) is a Lagrange multiplier and \(k\left( {x_{i} ,x} \right) = \phi \left( {{\text{x}}_{i} } \right)^{T} \phi \left( {\text{x}} \right)\) is a kernel function. A radial basic function (RBF) kernel is employed in our experiment:

$$ k\left( {x_{i} ,x} \right) = e^{{ - \gamma \left\| {{\text{x}}_{i} - {\text{x}} } \right\|^{2} }} ,\;\gamma > 0 $$
(8)

The OC-SVM has been widely used in OCC problems because of its advantages of not requiring prior knowledge and minimal structural risk.

3.2 Experimental datasets

The effectiveness of the proposed CPCA was verified with 30 datasets, the characteristics of which are shown in Table 1. Due to the limited space of the paper, the relevant tables in the main text only contain the information and experimental results of some datasets, and the others are in Appendix. According to the feature dimension, these datasets were divided into two types: low-dimensional data (UCI machine learning repository) and high-dimensional data.Footnote 1 There were a total of 12 low-dimensional datasets, and the rest were high-dimensional datasets, including handwritten digital data, face recognition data, gene expression data, etc. These data are usually used for multi-class experiments and need to be reorganized for one-class problems. The first class in the data was used as the target class, and the remaining classes were used as the abnormal class. Two-thirds of the target class samples were randomly selected as training set data, and the remaining samples made the test set. It should be noted that the milk powder data are infrared spectral data, which was specifically used for rapid adulteration detection of milk powder with one-class classification experiments (Huang et al. 2022). Infrared spectroscopy is a fast and nondestructive testing technology, which is necessary to construct an appropriate recognition model. For milk powder adulteration detection, the adulteration information is uncertain, which makes it difficult for traditional discriminant models to deal with it. However, the one-class classification method is suitable for this problem because it only needs to learn from target samples (pure milk powder) to identify abnormal samples (different forms of adulterated milk powder). All the code ran on MATLAB 2012.

Table 1 Summary of the experimental datasets

4 Applications and results

First, a low-dimensional dataset (iris) and a high-dimensional dataset (warpPIE10p) were used to visually demonstrate the performance of CPCA and PCA in one random experiment. Figure 2 shows that the target class and abnormal class in the iris dataset were well separated with the second principal component (PC2). Figure 4 shows that the different classes in the warpPIE10p dataset seriously overlap in the PC1 × PC2 space. To visually display the discriminant effect of the features extracted from the main space and the complementary space, the Mahalanobis distance of the PCA scores in the main space was applied as the abscissa and the feature value in the complementary space as the ordinate to show the distribution effect of different classes in the constructed low-dimensional space. The left subgraphs in Figs. 3 and 5 correspond to iris dataset and warpPIE10p dataset, respectively (Fig. 4).

Fig. 2
figure 2

Score plots of the iris dataset in the PC1 × PC2 space

Fig. 3
figure 3

The feature extraction effects of CPCA and PCA on the iris dataset. The subgraphs from top to bottom correspond to the cases where the number of principal components is 1, 2, 3, and 4, respectively

Fig. 4
figure 4

Score plots of the warpPIE10p dataset in the PC1 × PC2 space

Fig. 5
figure 5

The feature extraction effect of CPCA and PCA on the warpPIE10p dataset. The subgraphs from top to bottom correspond to the cases where the number of principal components is 1, 2, 3, and 4, respectively

Figure 3 demonstrates the influence of the number of principal components that are extracted. CPCA worked well when extracting 1, 2, 3, and 4 principal components while the traditional PCA method only worked well when extracting 2, 3, and 4 components. Figure 5 shows that the target class and the abnormal class were well distinguished in the complement space when the number of extracted components was greater than 1, but they were completely inseparable in the main space. Therefore, the features that were extracted from the complement space well reflected the characteristics of the anomaly class and completed the deficiency of the features that were extracted in the main space.

Next, the effect of the proposed method using statistical analysis was analyzed. The dimension of the main space was determined by retaining \(v\%\) of the variation information of the target class samples. The raw features and one-class features (extracted by PCA) were used for comparative analysis. Each dataset was randomly tested 50 times, and the average area under curve (AUC) was used as the evaluation indicator.

Table 2 shows the dimensions of the main space when 70%, 85%, 95%, and 99% of the variation information of the target classes samples are retained, respectively. One can see that the main information of the target class data often exists in a low-dimensional space, especially for some high-dimensional small sample data. For example, for face recognition datasets, such as warpAR10P, warpPIE10p, and pixraw10P, the dimension of the main space was less than 2% of the original feature dimension. For high-dimensional data, the dimension of the main space was much smaller than the dimension of the original feature space. Therefore, the discriminative information that was useful to detect the abnormal class more easily appeared in the complement space.

Table 2 Average no. of extracted principal components

The classification results using the Mahalanobis distance are shown in Table 3. For some low-dimensional datasets such as iris, wine, and bupa, CPCA is generally equivalent to using raw features and PCA. And for thyroid and segment, CPCA is better than PCA while using raw features fails. For the spectheart data, CPCA is comparable to the PCA method. For high-dimensional data such as Alphadigits, warpAR10P and milk powder, using raw features fails and CPCA is significantly better than PCA. Therefore, CPCA is significantly better than using raw features and PCA when the Mahalanobis distance is used as a one-class classifier.

Table 3 AUC of classification with Mahalanobis distance (/ indicates modeling failure)

The one-class classifier with Mahalanobis distance is based on probability density estimation, which is sensitivity to data dimension. To further verify the universality of the proposed method and the support vector machine-based one-class classifier, OC-SVM was applied to these datasets and the results are shown in Table 4. The OC-SVM can work for all the datasets because it introduces kernel functions to avoid ill-posed problems that are caused by high dimensions. For low-dimensional datasets, CPCA, PCA, and using raw features worked well. For most of the high-dimensional datasets, CPCA and using raw features were closer and better than PCA. For some datasets, such as warpAR10P, pixraw10P, and milk powder, CPCA was significantly better than using raw features. In general, CPCA improved the classification effect of the OC-SVM on high-dimensional data to a certain extent.

Table 4 AUC of classification with OC-SVM

5 Discussion

CPCA compresses all the feature information of the complement space into one dimension to achieve feature reduction. For the OCC problem, feature reduction has two effects. First, feature reduction leads to the lack of some information, which reduces the recognition effect of some abnormal class samples. Second, feature reduction can avoid the influence of over-fitting caused by high-dimensional features, which improves the recognition effect of some abnormal class samples. CPCA extracts feature information that is different from the target class in the complement space. If the feature information of the target class that is extracted in the main space is too little, then the remaining target class feature information will leak into the complement space, thereby reducing the recognition effect of the abnormal class. This is also the principle by which the main space and the complementary space are constructed. To avoid over-fitting, it is necessary to optimize the number of extracted principal components for the proposed CPCA. According to the experimental results shown in Tables 3 and 4, the CPCA is robust to the number of extracted principal components. When the number of extracted principal components can retain 85–95% variation information, the CPCA can meet the needs of practical problems. In summary, CPCA introduces a new feature compression method in the complement space to successfully extend the ordinary PCA to the one-class feature extraction field. The CPCA method has a good generalization effect in extracting features for OCC problems and can be used in practical problems, especially for high-dimensional small sample data.

6 Conclusion and future work

This paper proposed a new feature extraction strategy for OCC problems. The strategy divided the original feature space into two parts: the main space and the complement space. The main space was used to learn the feature information of the target class, and the complement space was used to learn the feature information of the exception class. By extracting the features of the main space and the complement space separately, the one-class features of the whole space were obtained. According to the proposed one-class feature extraction strategy, a specific implementation, CPCA, was also provided. In CPCA, the features of the main space were compressed by PCA, and the features of the complement space were compressed by the first-order norm. Then, the two types of features were combined as the extracted one-class features of the original space. Several different types of actual datasets were used to verify the effect of the proposed method. The experimental results show that CPCA has good generalization for one-class data, especially for the high-dimensional small sample data. In summary, the proposed method is a feature extraction method that is truly oriented to OCC problems. It also provides a good reference for how to transform other unsupervised feature extraction methods for one-class classification problems. In the following research, we will study the feature space decomposition in the nonlinear space, such as kernel mapping space. The feature compression of manifold learning, deep learning on the main space and subspace will also be explored to further improve the feature extraction effect of OCC problems.