1 Introduction

Feature selection is widely used in computational science branches, such as computer science, pattern recognition, and machine learning, to effectively reduce high-dimensional data. Feature selection can improve, firstly, computational efficiency and, secondly, the accuracy of the prediction algorithms [1]. Three major traditional feature selection approaches for machine learning development include the filter, wrapper, and embedded methods [2]. The filter method selects features based on certain evaluation criteria, such as a high joint probability or correlation between input and output variables [3,4,5,6,7,8]. Meanwhile, the wrapper method conducts feature selection through the machine learning algorithm, which evaluates all possible combinations of features by using a searching strategy and produces the result in a machine learning of the training dataset [9]. Lastly, the embedded method is similar to the wrapper method but derives the features during model training via a regulation technique that adds a penalty to the different parameters of a model to reduce its freedom [10].

Feature selection differs from feature extraction in that the former creates a subset of the initial inputs, while the latter produces new composite features. Feature extraction is at times undesirable as its transformation of initial features removes their identifiability. A powerful and commonly used method for feature extraction method is the principal component analysis (PCA). By contrast, few published works exist on the implementation of PCA for feature selection. One previous study investigated the contribution of features toward the principal components (PC) with the largest eigenvalues [11]. This contribution value is the relative measure of a feature’s representation quality for the selected PC over the total representation quality of all features. The features were sorted in descending order of contribution, and their ranks were considered an indicator of relative importance [11]. Another study in 2018 applied a similar method for feature selection but only selected the first two highest correlation coefficients from each selected PC [12]. In the same year, a group of researchers from China applied a new method to implement PCA for feature selection on high-dimensional data before they could be applied to the clustering model [13]. The method first reduces the dimensionality of the data using a robust PCA technique that is less sensitive to outliers than traditional PCA. Robust PCA is a dimensionality reduction technique that aims to extract the most important features while minimizing the influence of outliers. This is achieved by decomposing the data matrix into low-rank and sparse components, where the low-rank component captures the underlying structure of the data and the sparse component accounts for the outliers. This way, the method automatically identifies and selects the most important features while minimizing the impact of noisy or irrelevant features [13]. Once the dimensionality of the data is reduced, the local adaptive learning algorithm is applied to learn the clustering structure of the reduced-dimensional data. The adaptive learning algorithm adaptively adjusts the bandwidth of the kernel function used for density estimation, allowing it to capture the local structure of the data. All three studies involve unsupervised feature selection for pattern recognition and image processing applications.

Our study aimed to adapt these approaches to the supervised feature selection problem. We introduce a new approach using PCA, called Targeted PCA to analyze a multivariate dataset that includes the dependent variable. The reviewed studies [11,12,13] determined the selection of the PC based on explained variance and the rank of contribution along the selected PC governed feature selection in unsupervised learning applications. Guided by this, we explored the implementation of the same method but also considered the dependent variable within the dataset for supervised learning applications. The method can be summarized in three parts. Firstly, it performs PC selection based on variance explained exceeding a certain threshold. Secondly, it selects one or more reference PC(s) based on a top contribution rank by the dependent variable. Lastly, it finalizes feature selection based on contribution values exceeding a certain threshold from among the independent variables on the reference PCs. The approach is assessed in two ways: First, the selected features are compared with features selected using the LASSO regression model. Second, they were used as input in linear (multiple linear regression) and nonlinear (artificial neural network) regression models. We used three datasets covering socioeconomic, environmental, and computer image processing fields of applications.

2 Materials and methods

The full methodology of Targeted PCA is presented in Fig. 1, and detailed descriptions are presented in the following subsections. The final section (ref Dataset section) describes three datasets that were used to evaluate the methodology.

Fig. 1
figure 1

The illustration of the total framework of Targeted PCA and the LASSO regression model

2.1 Method development

This section presents the proposed modification to PCA for feature selection. The process begins with a standard calculation of eigenvalues \(\lambda \) and eigenvectors v based on the covariance matrix W as represented by Eq. 1.

$$\begin{aligned} Wv = \lambda v \end{aligned}$$
(1)

The eigenvalues \(\lambda \) and eigenvectors v can be solved by rearranging eq. 1 into eq. 2, where I is the identity matrix, then applying the singular value decomposition (SVD) technique.

$$\begin{aligned} (W - \lambda I)v = 0 \end{aligned}$$
(2)

The following steps are used to perform the feature selection:

  1. 1.

    Identify and select the PCs (i.e., the eigenvectors) with individual variance explained percentage higher than 1% and cumulative variance explained percentage at minimum 80%. According to Hair (2009), PCA has no universal minimal cumulative explained variance [14]. Instead, the explained variance is based on the analysis context and desired level. Therefore, we chose 80% as the threshold for cumulative explained variance, a common percentage value in many previous studies [11, 12]. Meanwhile, we chose 1% for the threshold of variance explained based on a previous study by Mubarak et al. (2018). The previous study also suggested not selecting too low a threshold because it may include many PCs and increase the complexity of the feature selection process.

  2. 2.

    Identify the quality of representation, \(R_{j,p}^2\), of feature components toward the PC [15]. Since all features are represented in the form of a geometrical coordinate, this is determined from the cosine rule, which dictates that for any given variable vector, \(R_{j,p}^2\) is equal to the squared cosine of the angle \(\theta \) between the vector of a selected principal component and given variable vector. A higher \(R_{j,p}^2\) value indicates a smaller \(\theta \), hence a good representation of the variable on the principal component. This is illustrated in Fig. 2. The main reasons for the selection of the squared cosine in principal component analysis (PCA) in measuring the quality of representation of data features are as stated below:

    1. (a)

      It measures the angle between variable and PC vectors, rather than their magnitude, making it robust to scale differences. This is important in PCA because the magnitude of the vectors in the PCA space may differ from those in the original space due to the transformation process.

    2. (b)

      It is robust to outliers. According to Abdi and Williams (2010), the squared cosine similarity metric is robust to outliers because it penalizes large angles between vectors more heavily than small angles [16]. For example, consider two vectors, v1 and v2, with an angle of 60 degrees between them. The cosine similarity value between these vectors is 0.5. However, the squared cosine similarity value is 0.25, smaller than the original cosine similarity value. If the angle between the vectors is smaller, e.g., 30 degrees, the cosine similarity value would be 0.87, and the squared cosine similarity value would be 0.76, closer to 1.

    3. (c)

      Other methods may be used to measure the similarities between these two parameters, such as Pearson correlation, Euclidean distance, Manhattan distance, and Mahalanobis distance. However, cosine squared has a straightforward interpretation and is easy to compute. According to Kassambara (2017), the cosine rule is the most common practice used in calculating the quality of representation of variables in each PC.

  3. 3.

    Identify the contribution value of each feature to each selected PC from the relative quality of representation (Eq. 3, where \(j = 1,2,\ldots ,\) total number of PC and \(p = 1,2,\ldots ,m\). m is the total number of features in the dataset).

    $$\begin{aligned} C_{j,p} = \frac{ R_{j,p}^2}{\sum _{p=1}^{m}R_{j,p}^2} \times 100 \% \end{aligned}$$
    (3)
  4. 4.

    Select the PC corresponding to the largest \(C_{j,p}\) of the dependent variable data as the reference PC.

  5. 5.

    Calculate a cutoff point for the relative contribution value as shown in Eq. 4 [following 15]. The cutoff parameter can be calculated as an expected (average) contribution. If the variables’ contribution were equal, the expected value would be divided by the total number of variables, m.

    $$\begin{aligned} C_\textrm{off} = \frac{1}{m} \times 100 \% \end{aligned}$$
    (4)
  6. 6.

    Select the features with the contribution value, \(C_{j,p}\) higher than cutoff value, \(C_\textrm{off}\), contributing to the reference PC.

  7. 7.

    Rank the importance of each feature toward the reference PC by comparing the \(C_{j,p}\) value, as obtained in Step 3. Ranking the features according to the contribution value in descending order may expedite the filtering process using the threshold method (explained in the previous subsection under point number 6). The higher the \(C_{j,p}\), the higher the correlation between the feature to the PC and, thus, the dependent variable. However, a limitation comes when more than one reference PC is selected. The rank for all features cannot be determined based on the \(C_{j,p}\) across all reference PCs because they carry different information. Thus, the features are ranked separately for each reference PC.

Fig. 2
figure 2

Implementation of cosine rule in the calculation of quality of representation

2.2 Rationale

The Targeted PCA is the new method in feature selection, an evolution of traditional PCA. This section explains the justification for the proposed method based on the original principles of PCA and demonstrates the advantages of Targeted PCA in the feature selection procedure. An established principle of the PCA is that the eigenvector corresponding to a larger eigenvalue can capture more representative sample information [17]. For this reason, it is reasonable to investigate the eigenvectors corresponding to larger eigenvalues when one is interested in explaining the variance of the data along each feature’s axis. Analyzing multiple eigenvectors allows for a more robust evaluation, considering multiple angles and directions of dependencies. Our proposed method considers analyzing more than one PC, but only those with significant \(C_{j,p}\) of the dependent variable. We leverage this property to improve the filtering of the features without losing the information on the correlation between dependent and independent variables. Next, we assess the \(C_{j,p}\) of each feature component in the PC that can explain its importance and relation toward the reference PC [11, 12]. The computation of this value accounts for the importance and relation of all features toward the same reference PC. By extension, their importance and relation toward each other are accounted.

2.3 Evaluation

Validating a new method with established methods allows an objective evaluation of its performance. Furthermore, it allows an analysis into the strengths and weaknesses of the different methods compared, facilitating the identification of gaps and opportunities for future research. Assessment is conducted by (1) analysis of the features selection by the Targeted PCA with that of an established feature selection method, the Least Absolute Shrinkage and Selection Operator (LASSO) regression, and (2) measuring the ability of selected features to fit linear and nonlinear models.

2.3.1 Analysis of Selected Features

The Least Absolute Shrinkage and Selection Operator (LASSO) regression

LASSO was introduced by Tibshirani [18]. The regression method minimizes the least squares and has an additional penalty/regularization term for the regression coefficients based on the L1-Norm. The LASSO estimate is defined by the solution to the L1 optimization problem, which is to minimize \(\left( \tfrac{\left\| Y - X\beta \right\| _{2}^{2}}{n} \right) \), subject to \(\sum _{j=1}^{k}\left\| \beta \right\| _{1} < t\), where t is the upper bound for the sum of coefficients in Eq. 5. Suppose X and Y are the input and output vectors, respectively, \(\beta \) is the vector of the coefficients for all features, k is the number of features, and n is the total number of samples.

$$\begin{aligned} \hat{\beta }\left( \lambda \right) = \underset{\beta }{\mathrm{{argmin}}}\left( \frac{\left\| Y - X\beta \right\| _{2}^{2}}{n} + \lambda \left\| \beta \right\| _{1}\right) \end{aligned}$$
(5)

where \(\left\| Y - X\beta \right\| _{2}^{2} = \sum _{i=0}^{n}\left( Y_{i} - \left( X\beta \right) _{i} \right) ^{2}\), \(\left\| \beta \right\| _{1} = \sum _{j=1}^{k}\left\| \beta \right\| _{1}\) and \(\lambda > 0\) is the parameter that controls the strength of the penalty—the larger the value of \(\lambda \), the greater the amount of shrinkage.

The relationship between \(\lambda \) and the upper bound t is an inverse one. As t tends toward infinity, the problem becomes an ordinary least square, and \(\lambda \) becomes 0. Conversely, as t tends toward 0, all coefficients reduce toward 0, while \(\lambda \) goes to infinity. This yields LASSO its variable selection capability—as we minimize the error in the optimization algorithm, some coefficients are shrunk to zero, i.e., \(\hat{\beta }_{j}(\lambda ) = 0\), for some values of j (depending on the value of the parameter \(\lambda \)). In this way, the features with coefficients equal to zero are excluded from the model.

The cross-validation (CV) for standard LASSO utilizes the cv.glmnet implementation in R that provides efficient minimization by path-wise coordinate descent for coefficient updates and a method called ‘covariance update,’ which is a dynamic programming approach to increase the efficiency of the solver [18].

The necessary parameters are:

  • nfolds = 10 is the number of folds used for the CV.

  • keep = TRUE makes sure that the information about the fold selection is stored. Since the folds are generated randomly, this was a necessary adjustment.

  • family = ‘Gaussian’ is the option for ordinary regression for linear labels.

  • type.measure = ‘mse’ (mean squared error) is the indicator for the evaluation method. It measures the deviation from the fitted mean to the response.

  • alpha = 1 is a hyperparameter that denotes the elastic-net mixing that the study could use if a L1 and L2 penalty mixture is wanted. alpha = 0 is used for ridge regression(L2) and alpha = 1 for pure LASSO regression. The increasing number of alpha may reduce the number of selected features.

  • A fitted LASSO model is used to compute the best coefficient value for each independent variable.

Comparability of Targeted PCA and LASSO regression

A LASSO regression is conducted for validating the PCA as both are similar in their function and approach. Firstly, both PCA and LASSO regression can effectively reduce the dimensionality of the feature space. They aim to filter and select a subset of features that capture the most relevant information for predicting the target variable while discarding less important or redundant features. Secondly, both techniques implicitly rank the features based on their importance. In PCA, the principal components are ranked in descending order of the explained variance they capture. Features with high loadings in the top-ranked components are considered more influential. In LASSO regression, the features with nonzero coefficients are deemed important for prediction, while those with zero coefficients are considered less relevant. Lastly, PCA and LASSO regression both operate on linear combinations of features. PCA creates linear combinations (principal components) of the original features, while LASSO regression finds the optimal linear combination of the features as predictors.

Comparison of selected features

The selected features by Targeted PCA and LASSO regression are compared in terms of (1) the number of selected features and (2) the similarities and differences of selected and non-selected features.

To measure the similarities of selected features, we used the Hamming distance technique [19]. This technique is often used to quantify the extent to which two-bit strings of the same dimension differ. In a traditional application of the Hamming distance, the only concern is whether the corresponding bits in two strings agree. However, over the past few years, many researchers have started implementing this method in data preprocessing for machine learning [20, 21]. The Hamming distance is used to find the pairwise similarity in the input space to avoid the excessive redundancies of the input sample.

In this case study, we generalize all the features into bit strings depending on the total number of features used in the dataset:

  1. 1.

    We create two-bit strings representing all selected features from the suggested PCA and LASSO regression methods.

  2. 2.

    We measure the similarity of bits from both bit strings.

  3. 3.

    We calculate the similarity percentage by dividing the total number of similar bits by the length of bit strings.

2.3.2 Linear and nonlinear modeling

Next, two learning algorithms were fitted using selected features from the Targeted PCA and LASSO regression, and their modeling performance was comparatively assessed to establish any advantage of the Targeted PCA. Both learning algorithms are briefly described in the following subsections.

Multiple Linear Regression

In multiple linear regression analysis, an attempt is made to account for the variation of the independent variables with respect to the dependent variable synchronously [22]. The regression analysis model is formulated as in Eq. 6.

$$\begin{aligned} y = X_{1}\beta _{1} + X_{2}\beta _{2} + \ldots + X_{k}\beta _{k} + \epsilon \end{aligned}$$
(6)

where y denotes the dependent (or study) variable that is linearly related to k independent (or explanatory) variables \(X_{1},X_{2},\ldots ,X_{k}\) through parameters \(\beta _{1},\beta _{2},\ldots ,\beta _{k}\). The parameters \(\beta _{1},\beta _{2},\ldots ,\beta _{k}\) are the regression coefficients associated with \(X_{1},X_{2},\ldots ,X_{k}\), respectively, and \(\epsilon \) is the random error component reflecting the difference between the observed and fitted linear relationship. There can be various reasons for such differences, e.g., the joint effect of those variables not included in the model, random factors that cannot be accounted for, etc. In a regression equation, the \(\epsilon \) random error refers to the residual variation that the model does not explain. Furthermore, the \(\epsilon \) parameter has also been used in LASSO regression to include the bias characteristic in the fitted model.

Metrics \(R^2\) and adjusted \(R^2\) have been used in this study. \(R^2\) measures the proportion of variance in the dependent variable explained by the regression model. It ranges from 0 to 1, with higher values indicating a better fit. \(R^2\) is calculated as the ratio of the sum of squared errors (SSE) of the regression model to the total sum of squares (SST) of the data:

$$\begin{aligned} R^2 = 1 - \frac{\mathrm{{SSE}}}{\mathrm{{SST}}} \end{aligned}$$
(7)

where SSE is the sum of squared errors between the predicted and observed values of the dependent variable, and SST is the total sum of squares of the dependent variable.

Adjusted \(R^2\), on the other hand, takes into account the number of predictor variables in the model. It adjusts \(R^2\) by penalizing the addition of extra predictor variables that do not significantly improve the fit of the model. Adjusted \(R^2\) is calculated as:

$$\begin{aligned} \textrm{Adjusted}\, R^2 = 1 - \left[ \frac{(1-R^2)\times (n-1)}{n-p-1}\right] \end{aligned}$$
(8)

where n is the sample size, and p is the number of predictor variables in the model.

Artificial Neural Network

ANN is composed of elementary computational units called neurons combined according to different architectures with multiple numbers of layers of network [23]. They are also known as generalized nonlinear models. Typically, the model performance of the ANN changes depending on model hyperparameter tuning and training dataset manipulation [23]. Thus, to analyze the impact of selected features, the experimental settings were set constant to avoid that additional bias is introduced affecting the model performance.

Table 1 presents the experimental setting of the ANN model used to evaluate the regression of the dependent output data on the selected features. Table 1 presents the experimental setting of the ANN model used to evaluate the regression of the dependent output data on the selected features. Two stopping rules were used for the ANN model training. The first rule applied the early stopping algorithm, which monitored loss in mean squared error (MSE) over time (epochs), and stopped the training when the difference in the loss between previous and current epochs was lower than a threshold value set at 0.001, and the loss increased again in the following epoch. The second rule avoids that the number of epochs keeps increasing due to a non-converging model by stopping the training if the iteration numbers reach 1,000,000.

Table 1 Standard setup for the experimental setting of ANN model for train and test selected dataset

The ANN can capture the nonlinearity in the dataset because of the activation function used in the algorithm. We use 70%, 10%, and 20% of the dataset for training, validation, and testing stages, respectively. Thus, the ANN can maintain the generalization of patterns in the dataset while also identifying the nonlinearity connection between input and output variables [24].

2.4 Dataset

The study used two public-domain datasets from the UCI Machine Learning Repository collection. A third dataset was from the Federal Department of Town and Country Planning Peninsular Malaysia and the Ministry of Health Malaysia.

The first dataset combines socioeconomic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR [25]. The second dataset is the medical dataset retrieved from 53,500 computed tomography (CT) images from 74 patients (43 male, 31 female). This dataset predicts the CT slice’s relative location on the human body’s axial axis [26]. These data are represented in histogram analysis of CT values which describe the bone structures (from value0 to value239) and air inclusion (from value240 to value383). The third dataset is an environmental dataset consisting of 215 land-use types that predict the number of leptospirosis cases that occur in Negeri Sembilan, Malaysia. Land-use types include agriculture, jungle, sport and recreational areas, public infrastructure, and residential areas. Each sample in this dataset represents the percentage coverage of land use in 5 \(\times \) 5 Km areas inside the Negeri Sembilan state. Table 2 presents the summary of these three datasets.

Table 2 Description of the case dataset

3 Results and discussion

The main results are summarized and presented in this section, while a full list of the ranked selected and rejected features from all datasets is reported in Appendix.

3.1 Selected features by Targeted PCA

3.1.1 Dataset 1: Communities and crime dataset

Table 3 shows that the first 14 principal components (PC) from Dataset 1 have explained variance exceeding 1%. The cumulative proportion of the first 14 PC is 85%.

Table 3 The eigenvalue of 100 PC from Dataset 1

Among the first 14 PCs, PC1 and PC5 were chosen as the reference PC, as both PC consisted of higher \(C_{j,p}\) (3.9184% and 6.5057%) of the dependent variable (‘ViolentCrimesPerPop’) compared to that of other variables (Table 8). Based on these two PC, features with \(C_{j,p}\) above the cutoff 1% were selected as features associated with the dependent variables. Therefore, the first 50 highest-ranked features from PC1 and the first 27 highest from PC5 were selected. The selected features may be determined by PC1 or PC5 or both. For example, the variable names ‘PopDens,’ ‘PctVacMore6Mos,’ ‘PctSpeakEnglOnly,’ ‘PctSameState85,’ and ‘PctSameHouse85’ were chosen because these variables contributed \(C_{j,p}\) more than 1% for both PCs. Overall, 70 variables out of 100 were selected for a high association with the dependent variable (‘ViolentCrimesPerPop’).

Several important keys exist in predicting the total number of crimes [27]. They are divided into four major groups: socioeconomic disparities, education and literacy levels, family structure, and drug abuse or addiction. Based on our analysis, socioeconomic variables such as poverty rates (NumUnderPov), income inequality (medIncome), and unemployment rates (PctUnemployed) were found to correlate with higher crime rates. As discussed in the referenced study [27], individuals in economically disadvantaged areas often face limited opportunities and reduced access to education, healthcare, and employment, leading to frustration, desperation, and higher rates of criminal behavior. Besides, areas with low educational attainment and high illiteracy rates often experience higher crime rates. Inadequate access to quality education can limit individuals’ prospects, leading to a higher probability of involvement in violent crime. Additionally, Targeted PCA ranked ‘PctNotHSGrad’ highly in predicting crime. This feature measures the percentage of people 25 and over that are not high school graduates. Finally, the stability of the family structure and positive social support networks significantly impacts crime rates. Broken families (TotalPctDiv), a lack of parental involvement (PctWorkMom and PctWorkMomYoungKids), and weak social networks (PctNotSpeakEnglWell) can contribute to higher crime rates as individuals may seek validation, belonging, and support from alternative sources, including criminal activities [27]. Targeted PCA also identified urbanization and the immigrant population in the city to be linked to the number of crimes [28].

3.1.2 Dataset 2: Relative location of CT slices on axial axis dataset

Table 4 shows that two PC contributed more than 1% variance percent in Dataset 2, which are PC1 and PC2. Besides, based on Table 9, both were also selected as reference PC because the \(C_{j,p}\) of the dependent variable (reference) in both PCs were highest, at 0.7163% and 2.2770%, respectively.

Table 4 The eigenvalue of 385 principal components from Dataset 2

Overall, 254 features contributed \(C_{j,p}\) of more than 0.2597% (the cutoff value) and were selected as essential features to predict the relative location of CT in the human body. Of these, 183 were higher-ranked features from PC1, and 71 were from PC2. The Targeted PCA found 149 input features from bone structure to be important in predicting the location of the CT slice. Meanwhile, only 105 features from the air inclusion group were selected. According to Furuhashict et al. (2009), the importance of histogram analysis of bone structure and air inclusion can be discussed as following [29]: (1) Bone structures play a significant role in predicting the relative location of CT slices due to their distinctive properties. Moreover, bone structures provide structural context and serve as reference points for assessing the spatial relationships between adjacent CT slices. Therefore, histograms describing bone structures are considered an important factor in predicting the relative location of CT slices on the axial axis. (2) Air inclusions, such as the lungs or air-filled cavities, also contribute to the localization of CT slices. However, air inclusions might not be as prominent as bone structures in predicting slice location, but they still provide valuable information. (3) In certain cases, particularly when dealing with thoracic or abdominal CT scans, air-filled structures can serve as reliable landmarks for determining the relative position of a slice along the axial axis. By incorporating histogram analysis of air regions, the predictive accuracy of CT slice localization can be further improved.

3.1.3 Dataset 3: Leptospirosis incidence and land use types dataset

Based on Table 5, PC1 to PC5 were selected for investigation. However, among five PCs, only PC1 is chosen as the reference PC because this PC consists of the dependent variable (‘total Leptospirosis cases’) with the highest \(C_{j,p}\) compared to other independent variables. The dataset resulted in only one reference PC, unlike Datasets 1 and 2, which resulted in more than one reference PC. The cutoff value for this dataset is 0.4651%. Based on this, 155 independent variables were found to be important features in predicting the total cases of Leptospirosis (Table 10).

Ten types of land use were found to be important in determining the total number of leptospirosis in Negeri Sembilan, Malaysia, These are residential areas (LU_7), palm oil plantation (LU_4), rubber plantation (LU_23), sport complex (LU_5), roads (LU_115), oxidation pond (LU_60), schools (LU_52), monsoon drains (LU_66), bushes (LU_9), and hardware store (LU_2). Residential and roadways land uses demark the center of the human population and urbanization. The population of rats may be directly dependent on the presence of human homes, as they provide the source of food for rats via garbage [30]. Furthermore, the oxidation ponds treat wastewater received through the sewer system, where many colonies of rats are breeding and sheltering [31]. Like residential land use, a school area attracts a community of rats, as it provides a food source. Leptospira may infect school children through rats’ urine and contact with street cats or dogs in school areas [32]. In 2016, a descriptive analysis demonstrated that Malaysian students registered the most significant cases in the country. 40% of the cases were reported to be students coming from school activities [33]. Palm and rubber plantation land uses are related to occupational exposure. Plantation workers are likely to be infected by Leptospira because they often work physically in contact with the surrounding environment. The predominant host animal in oil plantations has been shown to contribute 88.1% of the overall rat pathogenic Leptospira [34]. The unsafe work practices by plantation workers also catalyze this disease’s infection rate. A cross-sectional study has shown that many workers have poor work practices that expose themselves to the plantation’s surface soil and water environment, which is most likely contaminated with the urine of infected animals [35].

Table 5 The eigenvalue of 215 principal components from Dataset 3

3.2 Selected features by LASSO regression

Table 8 shows all the selected independent features in Dataset 1 with ranked coefficients value by using LASSO regression. Overall, 74 predictors out of 100 were identified to have a significant correlation with a dependent variable using this approach. The level of filtering achieved may be considered minimal, and theoretically, further adjustment to the value of alpha or the regulation value (L1) could be used to increase the reduction of features. This is because as the penalty value increases, the coefficients of many features will be set equal to zero. However, this regulation must be controlled because very high values will cause feature selection bias and misinterpretation during prediction [18].

According to Tables 9 and 10, LASSO regression found 359 and 78 features for Datasets 2 and 3, respectively. Dataset 3 shows the most restrictive selection where almost two-thirds of the independent variables were rejected.

Table 6 Comparison in terms of the total number, similarities, and differences of selected features by both methods Targeted PCA and LASSO regression
Table 7 Summary of performance of multi-linear regression fitted with input features selected by LASSO regression and Targeted PCA

3.3 Comparison between Targeted PCA and LASSO selected features

3.3.1 Similarities and differences between selected features

Table 6 shows the number of features chosen by Targeted PCA and LASSO regression for all datasets. The number of selected features using Targeted PCA was lower than that by LASSO regression with Datasets 1 and 2. The total number of selected features in Dataset 3 when using PCA is nearly double that when using LASSO regression.

According to the Hamming distance method, the Targeted PCA and LASSO regression chose 242 similar features out of 384 total features, the equivalent of 63.02% similarity, from Dataset 1. Meanwhile, Targeted PCA and LASSO regression select 57 of 99 similar features, the equivalent of 57.58% similarity, from Dataset 2. On the other hand, Dataset 3 shows the lowest similarity with only 92 out of 214 features selected by both methods. Since both methods recorded a more significant gap in the total number of individual selected features in Dataset 3, the potential to have similar features selected by both methods was low.

All these similarities and differences may change depending on the dimensionality reduction parameters used by both methods. For example, if the study increases the cutoff of variance percent from 1% to 5%, PC 5 might be not selected as a reference PC since the variance percent is 4.6%, which is lower than the threshold value. In this case, the analysis would reject almost 28 selected features. The same goes for LASSO regression. In conclusion, the study found that both methods share more than 50% similarity of independent variables for Datasets 1 and 2. Meanwhile, Dataset 3 has less than 50% similarity of independent variables, which means there is a significant difference in the selected and rejected features by Targeted PCA and LASSO regression.

3.3.2 Prediction performance on linear and nonlinear model

This section presents the impact of selected features in identifying the linearity and nonlinearity between input and output prediction tasks.

Table 7 shows the summary of the trained and tested multiple linear regression (MLR) model, which used the selected input from all three datasets by both approaches. Both methods produce the same p value values that are lower than 0.05 for all datasets. In addition, model prediction performances when using selected features from LASSO regression and Targeted PCA are not significantly different for all datasets. The difference for multiple \(R^2\) and adjusted \(R^2\) was less than 0.02.

Fig. 3
figure 3

Performance graph of ANN model trained by input features selected by LASSO regression and Targeted PCA

Figure 3 shows the tested ANN performance at multiple epochs comparing the different sets of selected features for all datasets. The model trained using the selected input in Dataset 1 from Targeted PCA produced a slightly higher starting error than the model trained with the input by LASSO regression. However, it recorded a drastic reduction in error for the second epoch, finally converging at epoch 13. In contrast, with the selected features by LASSO regression, the starting error was 0.00075 lower. However, the model showed a slower convergence until epochs 11 and 17, at which there are significant changes in the next epoch’s error reduction. Finally, the model converged at epoch 24, with an error higher than the model trained by selected features by Targeted PCA.

With Dataset 2, the results were similar. The selected features from the Targeted PCA showed a larger error of 0.4463 at the beginning, while the model with input from LASSO had a lesser error of 0.3914. However, the condition changed when the model with the Targeted PCA performed very aggressive training when the model demonstrated a significant reduction repeatedly, especially between epochs 8 and 9. The error changed from 0.4001 to 0.2991. However, the error in the model with input from the LASSO regression gradually decreased until epoch 28, when the error started showing a significant decrease from 0.2286 to 0.1909. Figure 3 also shows both models converged at the same number of epochs, which is 35, but the model with input from the Targeted PCA produced a better performance than the model that was trained with input from LASSO with final MSE values of 0.1186 and 0.1355, respectively.

Dataset 3 shows the ANN model trained with the input from the Targeted PCA performed better than the model with selected features from LASSO from the beginning until the last epoch. The model with the Targeted PCA produced 0.099 MSE, while the model with LASSO regression produced 0.2973 MSE at the beginning epoch. Then, both models gradually decreased the error for the following epoch. However, the model with the input from Targeted PCA converged much faster at epoch 55 with a final error at 0.0113. Meanwhile, the model with input from LASSO regression converged with additional epochs at epoch 58, and the final error was higher at 0.0844. All trained models seemed to converge using the first rule of the early stopping algorithm, whereby the training stopped at specific epochs when the difference in the loss between previous and current epochs is lower than 0.001.

In conclusion, both methods have shown a good ability to capture the relationship between the input and output in the dataset when linearity was assumed through multi-linear regression. However, the ANN model trained faster and had better performance (lower error) with the selected features from the Targeted PCA. The selected features from the Targeted PCA provided more informative nonlinear connections between the input–output than those from the LASSO regression. Besides, the LASSO regression technique may have been underfitted the linear fitted model. To overcome the nonlinearity problem in the LASSO regression technique, previous researchers have used other LASSO variants applied for the nonlinear feature problem such as Least Absolute Shrinkage and Selection Operator-Neural Network (LassoNET) and Least Absolute Shrinkage and Selection Operator-Multi-Layer Perceptron (LassoMLP). However, these two methods are embedded feature selection methods that may not perform well with other classifiers [36]. Meanwhile, other traditional nonlinear feature selection methods such as distance correlation, Hilbert–Schmidt Information Criterion, and Hoeffding’s test have suffered from ignoring the joint contribution of features in predicting the target data [37]. None of the above studies was aimed to assess regression performance using the selected features.

4 Conclusion

This study proposed a new approach using PCA for feature selection. It identified and ranked the important features based on the independent variable’s connection to the selected principal component. The methodology was tested for three different datasets from different fields to ensure its robustness. The study found 2 out of 3 datasets to have above 50% similarities in selected features when compared to features selected using LASSO regression. On the other hand, the results of the feature selection indicate that the Targeted PCA performed efficiently in capturing both linear and nonlinearity patterns in the dataset in prediction tasks. The Targeted PCA produced a faster convergence and better performance in the ANN training.

The Targeted PCA method has a limitation in that it focuses on selecting the features in the dataset that belong to the reference PC with a particular threshold value. It only considers the PC with a high eigenvalue (variance explained percentage higher than 1%) and \(C_{j,p}\) value from the dependent variable. Consequently, it may not be applicable to datasets that have their dependent variable with a low \(C_{j,p}\) value in high-ranked PCs. To address this, future studies could investigate the effectiveness of different methods of feature transformation of the original dataset prior to the PCA.