Introduction

An induction motor is one of the main components on which production depends [1]. Safety and proper maintenance are necessary to avoid sudden production stoppage and financial loss. Bearing defect identification and perception of seriousness are critical such that proactive measures can be taken considerably sooner, and severe bearing and system failure can be prevented. Typically, flaws are caused by changes in speed and load, which can lead to early bearing failure. It is also crucial to align the bearing with the rest of the system. Unwanted noise and vibrations are produced by an unbalanced system, leading to bearing damage. Rolling surface wear is initiated and spread in the dirt, dust, and other foreign particles. Deep scratches, dents, and other faults might occur during a bearing setup if it is mounted incorrectly. During operation, these faults worsen and impact the bearing’s performance. Bearing faults is one of the most occurring faults in a rotating machine, which needs to be taken care of at regular intervals. By monitoring the machine body’s vibration, the aberrant bearing behavior can be seen. Vibration signature consists of various frequencies due to damage in any bearing part. During operation, the vibration from the machine’s body hides the same generated frequency due to surface damage. As a result, the presence of a defect in any section of the bearing component necessitates a thorough examination of the vibration signal to extract necessary information.

Numerous authors have investigated a range of defect diagnostics approaches for rolling element bearings. When faults of varying severity levels happen in a single section, they all occur with a similar characteristic frequency, making fault severity estimate more difficult. Researchers have looked into various defect diagnostics approaches for rolling element bearings [2]. The application of artificial intelligence methods like support vector machine (SVM), fuzzy logic, artificial neural network (ANN), and others in bearing defect diagnosis has also been documented in the literature. Features can be extracted from the obtained vibration signals to train a classifier. The most widely employed statistical parameters are the kurtosis, root mean square (RMS), the average magnitude of the faulty frequency, and crest factor. For fault diagnosis (FD), the researchers used a decision tree approach to identify the best features.

Samanta et al. looked into the behavior of SVM and ANN in detecting gear faults [3]. Kankar et al. Faults used features acquired from time-domain signals in bearing components that have been classified using artificial intelligence approaches such as SVM and ANN [4]. Kankar et al. also recommended response surface methodology (RSM) investigates the outcomes of faults in various bearing elements on the system’s stability for rotor-bearing [5]. Predicting the degree of defects in bearings is still challenging to work. Jiang et al. carried out an observational study to determine the severity of rotating equipment defects. The vibration signals for multiple frequency band energies (MFBE) are extracted for feature selection and statistical and residue signals are employed to estimate fault severity. In addition to identifying the damaged bearing component, measuring the bearing diagnostic also entails estimating the fault’s severity. The current study aims to classify a fault, type of fault, and fault severity levels in every induction motor bearing. Faults of varying intensity levels in the same element have the same frequency of occurrence. As a result, it is difficult to classify bearing problems of varying severity levels. For the various bearing situations, eleven features are estimated in this work. Furthermore, features are chosen based on how responsive they are to the defects. Machine learning algorithms such as LR and SVC use these features as input. Various attribute filters are implemented and compared to select suitable attributes. The classification effectiveness of SVC and LR is compared using distinct information filters.

Data Description and Feature Extraction

From the literature, it has been found that the most suited signal for the investigation of a mechanical fault in an induction motor is a vibration signal whose amplitude is the function of time. Vibration sensors are required to record the vibration signal of the machine. The vibration sensor’s placement plays a vital role in recording the exact signal with preciseness in data acquisition. With the help of vibration sensors, the mechanical vibration from the structure is converted into an electrical signal named vibration signal, which consists of information of vibration parameters. The data for the present work were retrieved from the data center “Case Western Reserve University (CWRU),” which is open source [6]. The data are recorded vibration signal at various speeds and at different fault size conditions from bearing installed in a three-phase induction motor. The speed varies from 1797 to 1720 RPM. As the load increases, the speed decreases. At three distinct defect sizes, i.e., 0.007 inches, 0.014 inches, and 0.021 inches each with the no-load, one HP, two HP, and three HP load, the signal recording performs. These faults are terms as bearing’s BFs, ORFs, and IRFs. A fault case considers three possible types of fault, three fault sizes, and four different loading conditions. Thus a total of 36 different fault conditions and four healthy condition data have been considered to classify the healthy and faulty condition of the bearing reported in Table 1. As per Table 1, 680 signals in which intact signals consist of 320 segments and fault signals consist of 360 segments. In vibration signal to analyze slight variations, each database has 3000 samples with a sampling frequency of 12,000 samples per second.

Table 1 Vibration data description for the fault classification

For classification purposes, a feature extraction is an important act. Once the vibration signal is obtained, the first action is to calculate the statistical parameters to analyze the time-domain signal and compare it with the baseline signal. If the amplitude of the statistical parameter changes significantly, it must be immediately taken care of before the major destruction occurs in an induction motor. The statistical parameters give the prior mechanical status regarding intact and faulty conditions. The popular statistical parameters which are extensively used for analyze the bearing defects are max (F1), min (F2), RMS (F3), mean (F4), mode ( F5), standard deviation (F6), median (F7), variance (F8), skewness (F9), kurtosis (10), and energy of the signal (F11) [7].

After feature calculation, selecting the relevant feature that contributes most to the prediction variable or desire output is more important. The goal of feature selection is to exclude non-informative or unnecessary variables from the model. Having too much irrelevant data leads to a decrease in inaccuracy, this can impede the development and training of models and require a vast number of system memories. Additionally, less data means that algorithms train faster [8,9,10].

In the present work, two different feature selection methods have been used, which are provided by the sci kit-learn Python library are ANOVA F test and Mutual Information. The feature selection is also applied to the wavelet decomposed signal and compares the classification accuracy with classification using raw data statistical features.

Wavelet Packet Transform

The wavelet packet transforms for any time-domain signal is acts as a computation process that involves approximation and assessment of details in signal on passing it through low and high pass filters. To find the temporary location of transient activities that happens during the observation of the development of a problem on the bearing's surface. Which is helpful for monitoring, and defect detection is two aspects of condition monitoring. The wavelet transform uses the time and scale window functions to characterize signals in the frequency–time domain Wavelet packets filter the incoming signal into ever more acceptable equal-width intervals; as a result, leading to sub-bands filtering. The frequency axis is separated into sub-bands [0, 1/2] at each level, j. At level j, the sub-bands in hertz are \(\left[ {{{nF_{s} } \mathord{\left/ {\vphantom {{nF_{s} } {2^{j + 1} }}} \right. \kern-\nulldelimiterspace} {2^{j + 1} }},{{(n + 1)F_{s} } \mathord{\left/ {\vphantom {{(n + 1)F_{s} } {2^{j + 1} }}} \right. \kern-\nulldelimiterspace} {2^{j + 1} }}} \right]\), Where \(F_{s}\) is the sampling frequency. Compare to other wavelet techniques wavelet packets are superior at time–frequency analysis. Wavelet packets also have the advantage of having orthogonal transforms (when using an orthogonal wavelet). The following section shows that an orthogonal transform retains the signal's energy and distributes it among the coefficients [11, 12].

Feature Selection

ANOVA F Test

The analysis of variance (ANOVA) is a statistical tool to determine if the mean of more than one group differs significantly. The innovative aspect of the current work is the use of one-way ANOVA and the F test statistical test as a prediction method to evaluate harmony for feature selection and to describe the key characteristics to minimize the total data dimensionality of the feature space, with the main objectives being to decrease computational time complexity or to increase classification accuracy, or both [13]. When comparing more than one group of numerical data with only one independent and one dependent variable, “one-way ANOVA” is used. The goal is to see if the data are from distinct groups have the same mean. One-way traffic ANOVA presupposes that comprehensive data within the group have a normal distribution; however, it can also operate with somewhat skewed data from the norm. Either ANOVA examines the null hypothesis (H0), which states that the means of all groups are equal, or it tests the null hypothesis (H1), which states that at least one group’s mean is different (H1):

$$H_{0} :\mu_{1} = \mu_{2} = \mu_{3} = \ldots \mu_{k}$$
(1)

where\(\mu_{\left( k \right)}\) represent the mean for groups,\(k\) represents the sum of all groups, \(H_{0}\) and \(H_{1}\) are standard hypothesis test symbols, with \(H_{0}\) denotes the accepted hypothesis, and the rejected hypothesis shows by \(H_{1}\). In this approach, ANOVA divides the overall sum of squares (SST) into the sum of squares (SSR) because of the between-groups effect and the sum of squared errors (SSE):

$$SST = \mathop \sum \nolimits \mathop \sum \nolimits \left( {y_{ij} - \overline{y}} \right)^{2} = \mathop \sum \nolimits \mathop \sum \nolimits \left( {y_{ij} - \overline{{y_{j} }} } \right)^{2} + \mathop \sum \nolimits n_{j} \left( {\overline{{y_{j} }} - \overline{y}} \right)^{2} ,$$
(2)

where SSE and SSR are the first and second terms of the above equation, and \((\overline{y}_{j} )\) is the group means, \((y_{ij} )\) is the ith data position within the group \(j\), \(\overline{y}\) is the total mean of the groups, \(n_{j}\) is the sample size of group \(j\), put \(j\) = 1,2,….,\(k\); and \(k\) is the total number of groups. The ratio of their variance between groups to variance within the group is assessed to see whether the groups’ means are significantly distinct from one another. Although a larger ratio clearly distinguishes the groups, ANOVA based on test statistics with an F distribution and \((N - k,k - 1)\) degrees of freedom (DOF) to make calculations easier:

$$F = \frac{{{\text{SSR}}/\left( {k - 1} \right)}}{{{\text{SSE}}/\left( {N - k} \right)}} = \frac{{{\text{MSR}}}}{{{\text{MSE}}}} \sim F_{k} ,N - k.$$
(3)

MSE stands for mean squared error, MSR is for mean squared treatment, \(N\) shows the total number of observations, and \(k\) represents the number of groups. Finally, the \(p\) value is determined using the CDF of the \(F\) distribution. The rejection of the null hypothesis occurs when the result of \(p\) is less than the significance level, indicating that there must be at least one group with a different mean.

Mutual Information

The mutual information technique (MIT) is crucial for feature selection in ball-bearing fault diagnosis. MIT used for high-order and nonlinear transformation statistics extraction. As a result, we are considering MIT for feature selection to equalize the length of the samples. Similarly, we can effectively reduce dimensions by selecting features using the nonlinear relationship in multi-dimensional feature space. The probability density estimation technique substantially influences MI calculating, implying whether the method can efficiently and effectively improve feature selection accuracy to express typical features. As a result, in fault diagnosis using MIT feature selection, probability density estimation is an appropriate critical method. We first compute the MIT value using a probability density estimation method to extract the relationship between linear and nonlinear variables through MI matrices [14, 15].

The mutual information \({\text{MI }}\left( {x;y} \right)\) is a quantity between two discrete random variables \(x\) and \(y\) that finds the two variables’ mutual dependence and can be calculated as:

$${\text{MI}}\left( {X_{v} ;Y_{v} } \right) = \iint {S\left( {X_{v} ,Y_{v} } \right)}{\text{log}}\frac{{S\left( {X_{v} ,Y_{v} } \right)}}{{S\left( {X_{v} } \right),S\left( {Y_{v} } \right)}}{\text{dX}}_{v} {\text{dy}}_{v} ,$$
(4)

where \(S(X_{v} ,Y_{v} )\) is the joint probability density function of \(X_{v}\) and \(Y_{v}\). \(S(X_{v} )\) and \(S(Y_{v} )\) are the marginal probability density functions of \(X_{v}\) and \(Y_{v}\), respectively. The estimation of joint probabilities using predictors depends on kernel, density binning, or nearest neighbors when at least one continuous random variable is present. The nearest neighbor’s estimator outperforms the other two methods because it is an adaptive estimator and data-efficient. As a result, the present study employs the nearest neighbor’s estimator to estimate MIT on process parameters.

The goal of parameter selection based on MIT in classification is to determine a set \(R\) of \(n\) parameters \(X_{n}\) that has an enormous dependency on class \(T\).

$${\text{max}}d\left( {R,T} \right),d = {\text{MI}}\left( {\{ X_{n} {|},n = 1, \ldots \ldots ,m\} ;T} \right){ }$$
(5)

The criteria for max-dependency (\({\text{max}}d\)) are round about using simplified rules such as the mutual information criteria because the functions for joint probability density are complex to estimate in practice due to a lack of samples. The univariate mutual information \({\text{MI }}(X_{n} ; \, T)\) between a variable \(X\) and the class \(T\) is the most straightforward criterion. The greater the value of \({\text{MI }}(X_{n} ; \, T)\), the more important \(X_{n}\). In the classification it can be written as:

$${\text{MI}}\left( {X_{n} ;T} \right) = P\left( {X_{n} ,T} \right){\text{log}}\frac{{P\left( {X_{n,} T} \right)}}{{p\left( {X_{n} } \right),P\left( T \right)}}{\text{dx}}_{n} {\text{dt}}{.}$$
(6)

All the possible values were represented by \(X_{n}\) and \(T\) along with any values of \(X_{n}\) and \(T\). The extension of the nearest neighbors, using estimator probability functions, is computed between a continuous and a discrete function. The probability diffusion functions are calculated using a continuous and discrete elements extension of the nearest neighbor’s predictors.

Classification Techniques

Logistic Regression (LR)

A supervised machine learning algorithm, logistic regression, is used for a collection of features (or inputs), X, the target variable (or output), Y, that can only accept discrete values in a classification problem. When the dependent variable is nonparametric, logistic regression is a version of ordinary regression (represented by the occurrence or non-occurrence of some output events, usually coded between 0 and 1) [16]. The purpose of logistic regression is to identify the best-fitting model to represent the connection between a set of independent factors and a dichotomous characteristic of the dependent variable [14]. The dependent variable in the logistic regression approach is the chance of an event occurring; thus, the output has a discrete range of respondents confined between 0 and 1. The logistic function is described as follows:

$${\text{prob}}\left( {{\text{event}}} \right) = p\left( {\overline{x} } \right) = \frac{1}{{1 + e^{{ - g(\vec{x})}} }} = \frac{{e^{{g(\vec{x})}} }}{{1 + e^{{g(\vec{x})}} }},$$
(7)

where \(p(x)\) p is some output vent probability, \(\vec{x}(x_{1,} x_{2} ,x_{3} ,.......,x_{k} )\) represents the input vector corresponding to the predictors (independent variables), and \(g(x)\) represents the logit model. Multiple logistic regressions’ logit model can be stated as:

$$g\left( {x_{1} ,x_{2} , \ldots ..x_{k} } \right) = a + b_{1} x_{1} + b_{2} x_{2} , \ldots b_{k} x_{k} ,$$
(8)

where g(x) is a linear combination of the independent variables \(X_{1} ;X_{2} ;........X_{k}\) and \(a;b_{1} ;b_{2} ;......b_{k}\) are known as the regression coefficient. Logistic regression employs maximum likelihood estimation after transforming the dependent into a logit variable to determine the parameter \(a;b_{1} ;b_{2} ;......b_{k}\), after converting the dependent into a logit variable. The probability of failure for run-to-failure bearing data is estimated using logistic regression in this research. This failure probability depicts failure progression from incipient failure (encoded as 0) to complete failure circumstances (denoted as 1).

Support Vector Classifier (SVC)

In a large or indefinite dimensional space, a support vector machine creates a hyper-plane or set of hyper-planes that can be used for classifications, regression, and other tasks. Instinctively, the hyper-plane with the most significant distance to the adjacent training data points of any class (so-called operating margin) achieves a substantial separation because the more extensive the margin, the lesser the generalization error of the classifier [13, 17].

SVC solves the following problem:

$$\mathop {{\text{min}}}\limits_{\omega ,b,\zeta } \frac{1}{2}\omega^{T} \omega + C\mathop \sum \limits_{i = 1}^{n} \zeta_{i} .$$
(9)

Subjected to \(\begin{array}{*{20}c} {y_{i} \left( {\omega^{T} \phi \left( {x_{i} } \right) + b} \right) \ge 1 - \zeta_{i} ,} \\ {\zeta_{i} \ge 0,i = 1,...,n.} \\ \end{array}\).

The purpose is to expand the margin (by minimizing \((\left\| \omega \right\|^{2} = \omega^{T} \omega )\) while incurring a penalty when a sample is misclassified or within the margin boundary. Ideally, the value \(y_{i} (\omega^{T} \phi (x_{i} ) + b)\) would be ≥1 for every sample and it denotes perfect anticipation. However, because issues are rarely entirely separable with a hyper plane, we allow some samples to be separated from their correct margin boundary by a distance \(\xi_{i}\). As a result, the penalty term \(C\) works as an inverse regularization parameter, controlling the severity of the penalties.

Proposed Methodology

In the present paper, the induction motor bearing fault classification has been proposed in three categories, i.e., presence of faults, type of faults, and the fault severity, as mentioned in Table 1. To obtain the meaningful signal, a filtering procedure is required. A two-level WPT has been used to extract the signal, which has been divided into four sub-bands with frequency ranges of 0–1500 Hz, 1500–3000 Hz, 3000–4500 Hz, and 4500–6000 Hz, respectively, and designated as AA2, DA2, AD2, and DD2. For four different load situations, the signal decomposition has been carried out for intact, three different damaged conditions, and three different fault sizes.

The process of finding and choosing a subgroup of input features that are most appropriate for the target variable is known as feature selection. Feature selection is often simple when using real-valued input and output data, such as Pearson’s correlation coefficient. Still, it can be demanding when operating with a numerical input variable and a categorical target variable. Efficient diagnosis and prognosis can be achieved by selecting the most essential and sensitive features. Incorrect and inaccurate features degrade the overall reliability of fault diagnosis and prognosis approaches, making it impossible to anticipate actual bearing conditions.

The statistical feature calculation in the current work makes use of the unprocessed and sub-band data obtained from WPT. Each recorded vibration signal has had eleven statistical features derived from it. This results in a feature set of 680*11 for unprocessed and each sub-band data, which is ready for classification. For the purpose of gathering additional data and correctly classifying problems, features are connected for their applicability and reactivity to various defects. When a categorical target variable is present, the ANOVA F test and mutual information statistics are the two most often utilised feature selection techniques for numerical input data. Logistic Regression and SVC training and testing make use of the features selected using the feature selection methods. The methodology for the suggested work is shown in Fig. 1.

Fig. 1
figure 1

A proposed methodology for rolling element bearing fault classification

Result and Discussions

The three scenarios were taken into consideration in this study to evaluate the induction motor malfunction status using vibration signals. Case 1 examines the presence of a fault, i.e., intact and fault condition of bearing. As per the proposed methodology initially, from the 680 original data sets, 11 features have been calculated from each data set, of which 320 datasets are for the intact bearing and 360 datasets are for the damaged bearing. All eleven input variables are numerical types. The target value is 0 and 1, respectively, for intact and fault conditions. The prepared data have been applied for the ANOVA F test and mutual information to investigate the optimal number of features. A bar chart of the feature importance scores for each input feature is created and shown in Fig. 2. Figure 2 shows the scores of the ANOVA F test for each variable (more prominent is superior) and plots the scores for each variable as a bar graph to get a scheme of how many features we should select. The results of this test can be owned for feature selection, where those features that are independent of the target variable can be detached from the dataset. From Fig. 2, we can conclude that features F4 and F5 are irrelevant as their scores are low. In this case, it has been observed that some features stand out as perhaps being more relevant than others, with much larger test statistic values. The features F3 and F6 might be the most suitable (according to the test), and perhaps six of the eleven input features are the more relevant from the ANOVA F test. F5 and F9, and F10 have minor importance from the mutual information technique feature due to low scores.

Fig. 2
figure 2

Features score of the original data set a ANOVA F test b Mutual information

WPT decomposes each original set into four sub-bands, as shown in Fig. 3. Wavelet packets filter the incoming signal into progressively finer equal-width intervals, resulting in sub-band filtering. WPT is a helpful method for detecting and discriminating transient elements with high-frequency characteristics because of the sub-bands. The eleven features were calculated again from four sub-band signals separately and searched for significant features using the ANOVA F test and Mutual Information.

Fig. 3
figure 3

Features score of the WPT decomposed data set ANOVA F test and mutual information

From Fig. 3, we observed that in ANOVA F test, signal AA2 and AD2 score extremely low, and all the others features show approximately the same value. In signal DA2 and DD2, the features F4, F7, and F9 are insignificant. Feature selection by mutual information found that the feature score from DD2 high compares to AA2, DA2, and AD2. Feature F9 F10 in the signal AA2 and F4, F7, and F9 found less score than the other features in the signal DA2, AD2, and DD2. The feature score has determined that some features are irrelevant or useless because they have a low F score or have a small impact on classification accuracy. It has also been observed that the ANOVA F test gives a better selection of features compared to mutual information.

The performance of feature selection on numerical input data for a classification predictive modelling challenge must now be investigated. Using the chosen features, we created a model, and then we compared the results. In this section, a logistic regression (LR) and support vector classification (SVC) model with all features are evaluated and collated to a model built from features selected by the ANOVA F test and those features selected via mutual information. Logistic regression is a good technique for feature selection as it can perform better if irrelevant features are removed from the model.

A total of 680 data samples are prepared for the training and testing phases to perform a two-class classification problem that covers both intact and defect-bearing circumstances. The flow chart in Fig. 1 depicts the methods provided for diagnosing and classifying various faults in this context. Figure 4 shows the confusion matrix of the original signal and the decomposed signal obtained from logistic regression and support vector classifier. Out of 680 data set, 455 data set has been used for training, and 225 have been used for testing purpose. Table 2 shows the percentage accuracy of classifier LR and SVC when all the 11 features have been considered for all five signals. From the results, it has been concluded that the signal AD2 and DD2 give 100% accuracy for both the classifier which is the best among all the signal for both the classifier LR and SVC. The original signal accuracy is 94.67% for LR and 93.33% for SVC under the same condition. The signal AA2 gives lower accuracy 71.11% and 68%% for LR and SVC, respectively. Similarly, the classification accuracy obtained from signal DA2 is 83.11% and 81.77% for LR and SVC classifiers, respectively. From the obtained results, it may be concluded that when all 11 features were taken into consideration AD2 and DD2 gave high 100% accuracy from both the classifier; however, for the rest signal, LR performs better than the SVC for binary classification.

Fig. 4
figure 4

Fault isolation results using all the features for case 1

Table 2 Classification accuracy for LR and SVC with all features for case 1

The best curve obtained from the original, AD2, and DD2 signals equates to 100% accuracy, as shown in Fig. 5 ROC curve for the original and decomposed signal. This section explores improvement in the classifier’s performance using the grid search approach to reduce the number of features and achieve the same or higher accuracy when all features are used for classification. The first step is to define a series of modeling pipelines to evaluate. Each channel describes data preparation techniques and ends with a model that takes the transformed data as input. To determine which features produce the best performing model, a variety of various numbers of selected features have been carefully tested. In a grid search, the k argument to the SelectKBest constructor tells the selector that it must score the variables according to an F score calculated starting from Pearson’s correlation coefficient between each feature and the target variable. Following the feature selection, a LR and SVC will be run on the chosen features. Then executes a grid search on the quantity of Python features. Using repeated stratified k-fold cross-validation to assess model configurations on classification problems is a useful practice. In this study, three repeats of tenfold cross-validation have been used for all three cases. For each cross-validation fold, we may describe a pipeline that correctly organised the feature selection to change the training set and applied it to the train set and test set.

Fig. 5
figure 5

ROC curve for the original and decomposed signals

The evaluation grid can, therefore, be defined as a range of values from 1 to 11. The classification accuracy of both classifiers is shown in Table 3.

Table 3 Classification accuracy for LR and SVC with selected features for case 1

The ANOVA F test is used to run grid searches with various features that have been chosen, and each modelling pipeline is assessed using repeated cross-validation. The grid search technique for the most features that provides the best accuracy is shown in Fig. 6. The best features for five signals and the classification accuracy for the classifiers LR and SVC were achieved. In the case of LR, the 100% accuracy was obtained with 4 number features, and in SVC, only 3 number features from the DD2 signal for 100% accuracy. The same 100% accuracy was obtained from the signal AD2 with six features for both the classifiers. The best accuracy is achieved from the original signal at 96.3% with ten features in LR and 97.2% with five features from SVC. From the obtained classification accuracy, it can be concluded that the SVC classifier gives the best performance compared to the LR classifier.

Fig. 6
figure 6

Grid search for the optimum number of features for case 1

For case 2, the various fault categories have been taken into account when classifying. It consists of four unique classes: the inner race fault (IRF), ball fault (BF), and outside race fault (ORF), as well as one that is in its intact state. For intact, IRF, BF, and ORF, the target values are 0, 1, 2, and 3. Figure 7 shows the multiclass classification task that covers four individual rolling conditions. Table 4 shows that the LR classifier obtains an accuracy of 81.33% for the WPT signal AD2 and 96.4% for the WPT signal DD2 when using SVC when considering all eleven attributes. SVC classifier performs better when the WPT filtered signal is used to determine the type of bearing defect.

Fig. 7
figure 7

Fault isolation results using all the features for case 2

Table 4 Classification accuracy using all the features as input for case 2

The most precise fault type detection is achieved using a grid search approach to determine the ideal number of features. The grid search outcomes for the LR and SVC models with the ideal number of features are shown in Fig. 8. Table 5 provides a summary of the findings. According to the results obtained to categories the faults in the original signal, the optimal configuration is made up of nine features with the best accuracy of 92.5% when using SVC and 11 features with an accuracy of 82.4% when using LR. However, in the case of WPT, the DD2 signal SVC model with eight features and 96.3% accuracy provided the optimum structure. With a maximum of 7 features, LR provides an accuracy of 86.4% for the identical signal.

Fig. 8
figure 8

Grid search for the optimum number of features for case 2

Table 5 Classification accuracy using the selected feature as input for case 2

It is evident from Case 1 and Case 2 that the WPT signal DD2 yielded the best accuracy and the ideal amount of characteristics. In case 3, the WPT signal DD2 has been taken into consideration for additional processing to determine the fault severity. Figure 9 shows the confusion matrix for case 3 for LR and SVC classifier. The original signal gives 81.77% and 80.88% accuracy from LR and SVC classifier, respectively, when all features are considered. The obtained accuracy for signal DD2 is 84% and 95.55% from LR and SVC, respectively, when all features consider as input to both the classifier. Figure 10 shows the optimum features to detect the successful fault severity level. The best accuracy achieves 94.6% with an optimum number of features 8 for the DD2 signal using the SVC classifier. (Table 6) compares current work to previous work in the litirature of condition monitoring of rotating machines. According to the table in the current work, the better accuracy with the fewest features has been obtained.

Fig. 9
figure 9

Fault isolation results using all the features for case 3

Fig. 10
figure 10

Grid search for the optimum number of features for case 3

Table 6 Comparison table demonstrating the implication of present study with reported literature

Conclusions

In this study, the vibration signals from the induction motor bearing were used to classify the three cases of fault presence, fault kind, and fault severity. Eleven statistical features were computed from the original signal and the two-level wavelet decomposed signal for Case 1, Case 2, and Case 3 to classify the data. The best number of features for classification is selected using mutual information and the ANOVA F test from the sub-band having high F score. Further, a Logistic Regression and Support Vector Classifier were used to classify each of the three cases, and the results were compared for each case using both the full set of features and the chosen features. In comparison to LR classifier, sub-band signal DD2 with SVC gives the best results for all three cases from the results obtained with the full set feature. The grid search method with SVC produced the greatest results, with three features providing 100% accuracy for case 1; eight features providing 96.3% accuracy for case 2; and eight features providing 94.6% accuracy for case 3. Therefore, it can be concluded that the suggested methodology can be used in practical to detect bearing faults in induction motors while obtaining the ideal number of features and greater accuracy.