Enhanced probabilistic neural network with data imputation capabilities for machine-fault classification

Chang, Roy Kwang Yang; Loo, Chu Kiong; Rao, M. V. C.

doi:10.1007/s00521-008-0215-1

Enhanced probabilistic neural network with data imputation capabilities for machine-fault classification

Original Article
Published: 21 November 2008

Volume 18, pages 791–800, (2009)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Neural Computing and Applications Aims and scope Submit manuscript

Enhanced probabilistic neural network with data imputation capabilities for machine-fault classification

Download PDF

Roy Kwang Yang Chang¹,
Chu Kiong Loo² &
M. V. C. Rao²

251 Accesses
8 Citations
Explore all metrics

Abstract

This paper presents the expectation–maximization (EM) variant of probabilistic neural network (PNN) as a step toward creating an autonomous and deterministic PNN. In the real world, faulty reading sensors can happen and will create input vectors with missing features yet they should not be discarded. To overcome this, regularized EM is put in place as a preprocessing step to impute the missing values. The problem faced by users when using random initialization is that they have to define the number of clusters through trial and error, which makes it stochastic in nature. Global k-means is used to autonomously find the number of clusters using a selection criterion and deterministically provide the number of clusters needed to train the model. In addition, fast Global k-means will be tested as an alternative to Global k-means to help reduce computational time. Tests are conducted on both homoscedastic and heteroscedastic PNNs. Benchmark medical datasets and also vibration data collected from a US Navy CH-46E helicopter aft gearbox known as Westland were used. The tests’ results fully support the usage of fast Global k-means and regularized EM as preprocessing steps to aid the EM-trained PNN.

An integrated framework for diagnosing process faults with incomplete features

Article 26 November 2021

Bayesian Fault Diagnosis Using Principal Component Analysis Approach with Continuous Evidence

A Proposal of Condition Monitoring with Missing Data and Small-Magnitude Faults in Industrial Plants

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Our proposed model is to use the statistical-based probabilistic neural network (PNN) as our choice of neural network for pattern classification purposes. The PNN was introduced in 1990 by Specht [1] and puts the statistical kernel estimator [2] into the framework of radial basis function networks [3]. We will use the expectation–maximization (EM) method to train the PNN for the simple fact that it can help cut down the number of neurons in the network. The proposed model can be used for condition-based monitoring, which has garnered more attention nowadays and clearly deserves it because of increased efficiency and reduced time consumption. That is why more focus is given on the creation of a more error tolerant, accurate, and fast diagnosis model.

Expecting random sensor failures that take vibration signals from key locations on a piece of machinery is very sensible, because no machinery can guarantee to work indefinitely without errors occurring. Mishaps do occur and these faulty sensors will no longer provide feedback to the model thus creating input vectors with missing values in them. Simply discarding those incomplete input vectors is not plausible, because it takes time to replace the faulty sensors. Hence, a better solution will be through imputation. The method used to solve missing data problem is the regularized EM method [5]. Regularized EM can also handle rank-deficient datasets, which means the number of features is greater than the available sample size.

The EM method used to train the network has its perks, but also brings to focus its problems. In general, it is hard to initialize, and the quality of the final solution depends on the quality of the initial solution [4]. Initialization has to be randomly done by the user for the number of clusters required. This is done through trial and error method and it is stochastic. Therefore, to build an autonomous and deterministic neural network, we chose to use Global k-means to help automatically find the optimum number of clusters based on minimizing the clustering error. In an overview, our model uses regularized EM first to impute any missing values if any. Then the complete data set is fed into Global k-means to find out the number of clusters to be used. The result from Global k-means is fed into the EM algorithm, which is used for training the PNN.

In Sect. 2, PNN is briefly discussed followed by Sect. 3, where the E-step and M-step of the EM method is showed together with the flaws of EM. In Sect. 4, two methods of cluster determination, which is Global k-means and its variant, fast Global k-means, will be discussed in brief. Then, in Sect. 5, regularized EM brings about the solution for the data imputation problem. Experiments on medical benchmark and Westland data sets are presented in Sect. 6 to compare results between Global k-means and random initialization, Global k-means and fast Global k-means, tests on data imputation using regularized EM and finally some tests on Westland dataset. Section 7 will conclude the paper (Fig. 1).

2 Probabilistic neural network

Probabilistic neural network was introduced by Donald Specht in a series of two papers, namely “Probabilistic neural networks for classification, mapping, or associative memory” in 1988 [6] and “Probabilistic neural networks” in 1990 [1]. This statistical-based neural network that uses Bayes theory and Parzen estimators can be utilized to solve pattern classification problems. The basic idea behind Bayes theory is that it will make use of relative likelihood of events and also a priori information, which in our case would be interclass mixing coefficients. As for Parzen estimators, it is a classical probability density function estimator.

Let us assume the data set, X, will be partitioned into K number of subsets (classes), where X = X ₁ ∪ X ₂ ∪ … ∪ X _K and each subset has N _k number of sample size, and it would also mean ∑ ^K_k=1 N _k = N, where N is the size of our sample. This four-layered, feed forward, supervised learning neural network as shown in Fig. 2 reserves the first layer as input neurons and accepts d-dimensional input vectors. Each dimension of the input vector is passed to its corresponding input neuron.

As for the second layer of the PNN, Gaussian basis functions (GBFs) are estimated here. It takes the form of

$$ \rho_{m,k} \left( X \right) = \frac{1}{{\left( {2\Uppi \sigma_{m,k}^{2} } \right)^{{\frac{d}{2}}} }}\exp - \left( {\frac{{\left\| {X - \upsilon_{m,k} } \right\|^{2} }}{{2\sigma_{m,k}^{2} }}} \right) $$

(1)

and this specifies the GBF for mth cluster in the kth class, where σ ²_m,k is the variance, υ _m,k is the cluster centroid, and d represents the dimension of the input vector. The third layer of the PNN is where the class conditional probability density function is estimated,

$$ f_{k} (X) = \sum\nolimits_{m = 1}^{{M_{k} }} {\beta_{m,k} \rho_{m,k} (X)} , $$

(2)

where M _k is the number of clusters for class k and β _m,k is the intraclass mixing coefficient,

$$ \sum\nolimits_{m = 1}^{{M_{k} }} {\beta_{m,k} = 1} . $$

(3)

On top of all that, we have a fourth layer, which will be used as a decision layer to choose the class with the highest probability. An interclass mixing coefficient, α _k, will be used to increase the accuracy of the result. With α _k being a value obtained by the inverse of its sample size, N _k, it is clear that the summation of α _k shall be bound to 1. o _k will depict the probability of the input vector being class k,

$$ o_{k} = \alpha_{k} f_{k} (x) , $$

(4)

$$ {\text{decision}} = \arg (\max O_{k} ). $$

(5)

The advantage PNN has is that it interprets the network’s structure in probability density functions, due to its statistical nature. On the downside, PNN’s number of nodes can be extremely huge if the training dataset is large. This is because one neuron is created for each training pattern. This makes the PNN infeasible for large datasets. Therefore, another training method that does not commit every training pattern as a node in the neural network should be used. For this purpose, we have selected the EM method.

3 Learning algorithm

In the learning algorithm, two parameters of the model are adjusted to obtain better results in classification. In each E-step and M-step, the mean and variance parameter is constantly tweaked until the log posterior likelihood function has minimal change. To calculate the new mean and variance values, EM deploys a weight parameter, which is also adjusted after each step.

3.1 Expectation–maximization

Expectation–maximization (EM) [7] by Dempster et al. in 1977 is a powerful iterative procedure, which converges to an ML estimate. Basically, the EM consists of two steps, namely the E-step and the M-step. Both steps will be iterated until the change in the log posterior likelihood function is minimal,

$$ \log L_{f} = \sum\nolimits_{k = 1}^{K} {\log f_{k} } (X). $$

(6)

In the E-step, the missing/hidden data is estimated, given the observed data and the current parameter estimate. It will use the PDF estimated in the second layer of the PNN as defined in (1) together with intraclass mixing coefficient to estimate the weight parameter,

$$ W_{m,k} = \frac{{\beta_{m,k} \rho_{m,k} (X)}}{{\sum\nolimits_{i = 1}^{{M_{k} }} {\beta_{i,k} \rho_{i,k} (X)} }}. $$

(7)

Next comes the M-step that uses the data estimated in the E-step and the weight parameter, W _m,k, to form a likelihood function and determine the ML estimate of the parameter. It calculates the new values for the cluster centroid, υ _m,k, the variance, σ ²_m,k , and the intraclass mixing coefficients, β _m,k, using the weight calculated from the E-step. The equations for the parameter updates are as given below:

$$ \upsilon_{m,k} = \frac{{\sum\nolimits_{n = 1}^{{N_{k} }} {W_{m,k} (X)X} }}{{\sum\nolimits_{n = 1}^{{N_{k} }} {W_{m,k} (X)} }}, $$

(8)

$$ \sigma_{m,k}^{2} = \frac{{\sum\nolimits_{n = 1}^{{N_{k} }} {W_{m,k} (X)\left\| {X - \upsilon_{m,k} } \right\|^{2} } }}{{d\sum\nolimits_{n = 1}^{{N_{k} }} {W{}_{m,k}} (X)}}, $$

(9)

$$ \beta_{m,k} = \frac{1}{{N_{k} }}\sum\nolimits_{n = 1}^{{N_{k} }} {W_{m,k} (X)} . $$

(10)

The EM algorithm is guaranteed to converge to an ML estimate [8, 9] and the convergence rate of the EM algorithm is usually quite fast [10]. EM also produces lesser neurons than the traditional PNN by Donald Specht. Another advantage is that it does not require computations of gradients or Hessians, thus reducing the computational complexity of the network. Though EM is a good choice for a training method, it is not autonomous. This is attributed to the fact that EM requires initialization in the form of a number of clusters to be expected of the neural network. The initialization quality severely affects the final outcome of the network. To aid in this matter, Global k-means will be chosen as a precursor to find out how many clusters are needed for a certain dataset in view, before being fed into the EM-trained PNN.

4 Cluster initialization

Part of the problems faced by the model is determining the number of clusters needed prior to learning. This is usually done by the user through trial and error. Also the usage of random initialization does not provide deterministic results. Global k-means and Fast Global k-means can overcome these problems.

4.1 Global k-means

Introduced by Likas et al. in the paper entitled “The Global k-means clustering algorithm” in 2003, the concept of clustering with Global k-means is partitioning the given dataset into M clusters so that a clustering criterion is optimized. The common clustering criterion is the sum of squared Euclidean distances between each data point and the cluster centroid:

$$ E(M_{1} , \ldots ,M_{k} ) = \sum\nolimits_{i = 1}^{{N_{k} }} {\sum\nolimits_{m = 1}^{{M_{k} }} {\left\| {X_{i} - \upsilon_{m} } \right\|^{2} } } . $$

(11)

Global k-means deploy the k-means algorithm to find locally optimal solutions by trying to keep the clustering error to the minimum. This algorithm starts by placing the cluster center arbitrarily and continues by moving at each step the cluster center with the aim to minimize the clustering error. The down side to this algorithm is that it is sensitive to the initial position of the cluster centers. To overcome this, k-means can be scheduled to run several times and each time with a different starting point. The gist of Global k-means is that instead of trying to find all cluster centers at once, it proceeds in an incremental way. Incremental in the sense that one cluster center is found at a time.

Assume a K-cluster problem is to be solved; the algorithm starts by solving for a one-cluster problem and the placement of the cluster center in this instance would equal the centroid of the given dataset. The next step would be to add another cluster at its optimal solution given the first cluster center has already been found. To do this, N executions of k-means algorithm will be executed with the initial positions of the cluster centers being the first cluster, which was found when solving for a one-clustering problem and the second cluster’s starting position will be at x _n where 1 ≤ n ≤ N. The final answer for a two-cluster problem will be the best solution from the N executions of k-means algorithm. Let (c ₁(k),…,c _k(k)) denote the final solution for a k-clustering problem. We solve this through iteration, which means one-clustering problem then two-clustering problem until (k − 1) clustering problem. The solution of k-clustering problem can be solved by performing N executions of k-means algorithm with starting positions of (c ₁(k − 1),…,c _(k−1)(k − 1),X _n). A simple pseudo code of it will be

With the final solution, (c ₁(k),…,c _k(k)), Global k-means has actually found solutions of all k-cluster problem where k = 1,…,K without needing any further computations. This assumption seems very natural: we expect that the solution of the k-clustering problem to be reachable (through local search) from the solution of (k − 1)-clustering problem, once the additional center is placed at an appropriate position within the data set [11]. Alas, the downside is that the computational time of Global k-means can be rather long.

4.2 Fast Global k-means

Using this method will help reduce the computational time taken by the Global k-means algorithm. The core difference is that fast Global k-means does not perform N executions of k-means algorithm with starting positions of (c ₁(k − 1),…,c _(k−1)(k − 1),X _n). Instead, what the algorithm does is to calculate the upper bound, E _n ≤ E − b _n, on the resulting error, E _n, for every instances of X _n. We define E as the error value of (k − 1)-clustering problem and b _n as

$$ b_{n} = \sum\nolimits_{j = 1}^{N} {\max (d_{k - 1}^{j} - \left\| {x_{n} - x_{j} } \right\|^{2} ,0)} $$

(12)

and d ^j_k−1 is the squared Euclidean distance between x _j and the cluster centroid, which it belongs to. After obtaining the value of b _n, select the x _i that maximizes b _n and make it the new cluster centroid that will be added. This is because by maximizing the value of b _n, we are at the same time minimizing the E _n value, which as stated is our error. The new cluster centroid, x _n, will allocate all data points, which are having a smaller squared Euclidean distance from x _n rather than from their previous cluster centroid, d ^j_k−1 . In view of that, the reduced clustering error for all those reassigned data points is d ^j_k−1 − ‖x _n − x _j‖². Then we execute the k-means algorithm to find the solution for k-clustering problem. Since the k-means algorithm is guaranteed to decrease the clustering error at each step, E − b _n upper bounds the error measure that will be obtained if we run the algorithm until convergence after inserting the new center at x _n (this is the error measure used in the Global k-means algorithm) [11].

5 Data imputation for missing features

As discussed earlier, faulty sensors do happen and merely discarding input vectors can affect the condition-based monitoring. Therefore, a plausible solution is to impute the missing values and continue on with the classification using the imputed input vector.

5.1 Regularized EM

With an estimated mean and covariance matrix, the missing values in a dataset can be imputed with their conditional expectation values given the available values in the dataset. The regularized EM algorithm’s regularized regression parameters will be computed using a method called ridge regression or also known as Tikhonov regularization. In ridge regression, a continuous regularization parameter controls the degree of regularization imposed on the regression coefficients [5]. This regularization parameter is determined by generalized crossvalidation (GCV) so that it minimizes the expected mean-squared error of the imputed values. In the conventional EM algorithm, it is assumed that the missing values in the dataset are missing at random and this assumption also carries to the regularized EM algorithm.

We will briefly discuss the conventional EM algorithm first. In the execution of the EM, the estimated mean and covariance matrix are iterated in three steps. Firstly, for each record with missing values, the regression parameters of the variables with missing values on the variables with available values are computed from the estimates of the mean and of the covariance matrix. Secondly, the missing values in a record are filled in with their conditional expectation values given the available values and the estimate of the mean and of the covariance matrix, the conditional expectation values being the product of the available values and the estimated regression coefficients. Thirdly, the mean and the covariance matrix are re-estimated, the mean as the sample mean of the completed dataset and the covariance matrix as the sum of the sample covariance matrix of the completed dataset and the contributions of the conditional covariance matrices of the imputation errors in the records with imputed values [5, 12]. Let us say we have a dataset X, where it contains n number of records and p number of variables. The conventional EM assumes that n exceeds p so that sample covariance is positive definite. Using the incomplete dataset, the estimates for mean, μ, and the covariance matrix, Σ, will be calculated. For a given record x = X _i with missing values, let x _a consists of p _a variables for which the values are available in the given record and x _m consist of the remaining p _m variables for which the values are missing. Let μ be split into μ _a and μ _m, where μ _a is the mean value of the variables for which the values are available in the given record and μ _m is the mean value of the variables for which the values are missing. For each record with missing values, x = X _i, where i = 1,…,n, the relationship between p _a and p _m is modeled by linear regression model

$$ x_{m} = \mu_{m} + \left( {x_{a} - \mu_{a} } \right){\text{B}} + e, $$

(13)

where B is the matrix of regression coefficient and the residual e is random vector with mean zero and unknown covariance matrix C. Assume μ ^t and Σ^t represent the mean and covariance matrix for the tth iteration. Σⁱ contains Σ _aa and Σ _mm, where Σ_aa is the covariance of the variables for which the values are available in a given record and Σ_mm is the covariance of the variables for which the values are missing. With the estimated cross covariance, Σ_am = Σ ^T_ma , the regression coefficient is

$$ B = \Upsigma_{aa}^{ - 1} \Upsigma_{am} . $$

(14)

By substituting B, an estimate of the residual covariance matrix is

$$ C = \Upsigma_{mm} - \Upsigma_{ma} \Upsigma_{aa}^{ - 1} \Upsigma_{am} . $$

(15)

After the missing values in all records are imputed, the new estimate of the mean of the records would be

$$ \mu^{i + 1} = \frac{1}{n}\sum\limits_{i = 1}^{n} {X_{i} .} $$

(16)

The new estimate of the covariance matrix would in turn be

$$ \Upsigma^{i + 1} = \frac{1}{{\tilde{n}}}\sum\limits_{i = 1}^{n} {(S_{i}^{t} - (\mu^{t + 1} )^{T} \mu^{t + 1} ),} $$

(17)

where S ^t_i is the conditional expectation, which comprises three parts x ^T_a x _a, x ^T_a x _m, and x ^T_m x _m + C. $ \tilde{n} $ is the number of degrees of freedom of the sample covariance matrix of the completed dataset. The iterations of the EM are stopped when the estimates of μ ^t, Σ^t, and the imputed values x _m stop changing appreciably. Regularized EM is similar to conventional EM, just that it replaces the Σ ⁻¹_aa with (Σ_aa + h ² D)⁻¹, where D is diagonal matrix consisting of diagonal elements from the covariance matrix, Σ_aa, and h is the regularization parameter. The regularization parameter is determined by minimizing the generalized crossvalidation function

$$ G(h) = \frac{{\tilde{n}^{2} }}{{T^{2} (h)}}{\text{tr}}C_{h} , $$

(18)

where

$$ T(h) = {\text{tr}}(I - X_{a} X_{a}^{\tau } ), $$

(19)

$$ X_{a}^{\tau } = (X_{a}^{T} X_{a} + \tilde{n}h^{2} I)^{ - 1} X_{a}^{T} , $$

(20)

in which X ^τ_a is the pseudoinverse of the data matrix X _a.

6 Experimental results

6.1 General description

First, a test is conducted using EM-based PNN with two types of initialization, random and Global k-means. The medical benchmark datasets together with the Iris dataset was used for this purpose. Then a test between EM-based PNN with initialization from Global k-means and fast Global k-means was done to observe the improved computational time and also the difference in classification performance. The medical benchmark datasets were used. This was followed by imputing datasets with missing values using regularized EM. The Iris and Pima datasets were used. Missing values were simulated from 0 to 50% missing values and were done completely random. Next were tests done on the Westland vibration dataset. Firstly, classification of Westland using EM-based PNN with Global k-means was done. Then, Westland was tested for data imputation for missing values from 0 to 50% using regularized EM.

6.2 Comparative tests between randomly initialized and Global k-means

A comparative study was done on the effects of using Global k-means to initialize the values of the parameters in EM and without that initialization. The Iris dataset [13] and the medical datasets, consisting of data from cancer, dermatology, hepato, heart, and Pima were used.

The Iris dataset consists of 150 samples and four input features. It was tested on the PNN trained by EM algorithm with randomly initialized cluster centroids and EM with Global k-means initialization. Both the methods were executed in heteroscedastic PNN and in homoscedastic PNN. A tenfold validation was used. Iris dataset was set as a 10-clustering problem for Global k-means and then the number of cluster centroids was returned based on minimizing the squared Euclidean distance between each data point in a cluster and its centroid. This was then used to set the cluster parameter for random initialization to help it get a better result and assume under similar conditions as Global k-means.

The mean accuracy of the homoscedastic with random initialization is 96.29%, while the heteroscedastic version reports 95.36% accuracy, but in both cases, they were outdone by the accuracy of the EM with Global k-means initialization, whose mean accuracy was 97.86 and 95.71%, respectively, for homoscedastic and heteroscedastic PNN. Although random initialization was fed with the number of clusters needed, by Global k-means, Global k-means still had the better classification rate (Table 1).

Table 1 Correct classification rate for Iris dataset

Full size table

Cancer dataset contains 569 samples with a 30 dimension size, dermatology dataset contains 358 samples with a 34 dimension size, and hepato dataset contains 536 samples with a nine dimension size. Heart dataset contains 270 samples with a 13 dimension size and two output labels, which are “0” for absence of heart disease and “1” for presence of heart disease. Pima data set is available from machines learning database at UCI [14]. Pima dataset contains 768 samples with an eight dimension size and has two classes, which are diabetes-positive and diabetes-negative. A tenfold validation was employed. When tested using all the above datasets, Global k-means was set with a higher than required clustering problem to solve and in every case it returns a lower number of clusters, which is optimum to the clustering criterion. This was then fed into the EM with random initialization.

The medical datasets showed improved performance by the EM with Global k-means initialization in both homoscedastic and heteroscedastic PNN over the results using random initialization (Tables 2, 3). Although in practice both were fed with the same number of clusters required, in most cases of the datasets, even the maximum accuracy from the EM with random initialization is not higher than the mean of EM with initialization from Global k-means.

Table 2 Correct classification rates for the medical datasets by using homoscedastic PNN

Full size table

Table 3 Correct classification rates for the medical datasets by using heteroscedastic PNN

Full size table

6.3 Comparative tests between Global k-means and fast Global k-means

To minimize the computational time without sacrificing the classification performance, we opted for the fast Global k-means method. A comparison between Global k-means and fast Global k-means using both heteroscedastic and homoscedastic EM-trained PNNs is shown in the following. Tests were conducted on the medical datasets and using a tenfold validation. Global k-means and fast Global k-means were set to solve a higher clustering problem than required.

As the results in Table 4 and Fig. 3 shows, fast Global k-means provide a comparable accuracy for correct classification rate on the benchmark medical datasets. On top of that, it still manages to accomplish its purpose, which was to cut down the computational time, and Table 5 clearly supports this matter.

Table 4 Comparison of correct classification rates

Full size table

Table 5 Comparison of execution times in seconds

Full size table

6.4 Tests on data imputation

Next, we compare results of missing data imputation of varying percentage with the original completed dataset using Iris (Table 6; Fig. 4) and Pima datasets (Table 7; Fig. 5). A tenfold validation was employed on heteroscedastic and homoscedastic PNN using EM and Global k-means.

Table 6 Correct classification rates for Iris

Full size table

Table 7 Correct classification rate for Pima

Full size table

The method imputes the missing values that were randomly created from the completed Iris and Pima datasets. Both datasets were created with missing value percentages from 10 to 50%. By using the proposed method to preprocess the data before being accepted into the neural network for training, we can see that the performance degradation is acceptable.

6.5 Westland vibration dataset

A real world case study was done to test the EM-trained PNN with initialization parameters obtained from the execution of Global k-means using the popular benchmark dataset Westland [15]. This dataset consists of vibration time-series data, which is gathered from an aft main power transmission of a US Navy CH-46E helicopter by placing eight accelerometers at the known fault-sensitive locations of the helicopter gearbox. The data was recorded for various faults including a no-defect case (Table 8).

Table 8 Westland helicopter gearbox data description

Full size table

This dataset consists of nine torque levels, but for our experiment purposes, only the 100% torque level on Sensors 1–4 is used. As the number of features from this dataset is quite substantial, feature reduction was needed. Wavelet packet feature extraction [16] was used to reduce the dimension of the input vectors without sacrificing too much of the classification performance.

Wavelet packets, a generalization of wavelet bases, are alternative bases that are formed by taking linear combinations of the usual wavelet functions [17, 18]. These bases inherit properties such as orthonormality and time–frequency localization from their corresponding wavelet functions [16]. Wavelet packet functions can be defined as

$$ W_{j,k}^{n} (t) = 2^{j/2} W^{n} (2^{j} t - k), $$

(21)

where n is the modulation or oscillation parameter, j is the index scale, and k is the translation.

For a function f, the wavelet packet coefficients can be calculated as given below

$$ w_{j,n,k} = \left\langle {f,W_{j,k}^{n} } \right\rangle = \int {f(t)W_{j,k}^{n} (t){\text{d}}t} . $$

(22)

Decomposition of the vibration signal is done using wavelet packet transform (WPT) to extract out the time–frequency-dependant information. For each vibration signal segment, full decomposition is done up to the seventh level. This will produce a group of 2^r+1 − 2 sets of coefficients, where r is the resolution level. Therefore, in our case, it shall produce a group of 254 sets of coefficients, where each set corresponds to a wavelet packet node. For the coefficients of every wavelet packet node, the wavelet packet node energy e _j,n is computed and this acts as the extracted feature:

$$ e_{j,n} = \sum\limits_{k} {w_{j,n,k}^{2} } . $$

(23)

Then apply a statistical-based feature selection criterion to help identify the features that provide the most discrimination amongst the classes of Westland. The Fisher’s criterion was used [19]. As a result, the number of features for Westland was reduced to eight and this modified dataset was fed into our model to test for data imputation using regularized EM. A tenfold validation was used.

The performance obtained by the proposed system on the eight-feature, 776-sample Westland dataset strengthens the positive performance that was marked in testing done on medical benchmark datasets. Westland was also tested for data imputation with missing values ranging from 0 to 50%, but only using Sensors 1–4 (Tables 9, 10, 11, 12, 13; Figs. 6, 7, 8, 9). Tests were conducted on heteroscedastic and homoscedastic PNNs using tenfold validation. Missing values were randomly produced.

Table 9 Correct classification rates for Westland using homoscedastic and heteroscedastic PNNs

Full size table

Table 10 Correct classification rates for Sensor 1

Full size table

Table 11 Correct classification rates for Sensor 2

Full size table

Table 12 Correct classification rates for Sensor 3

Full size table

Table 13 Correct classification rates for Sensor 4

Full size table

Much like the imputation tests done on Iris and Pima datasets, the degradation of classification performance for Sensors 1–4 of Westland dataset is acceptable. The loss of classification rate does not plummet when dealing with higher missing value percentages. This shows that using regularized EM as a means of data imputation in cases, where discarding datasets with missing values is too costly, is a viable option to implement into our model.

7 Conclusions

Though using EM to train the PNN model is an excellent method, it can still be improved. To make our model autonomous, the Global k-means algorithm was used prior to EM to find the number of clusters based on minimizing the clustering error. Comparative results indicated that even when set with the same number of clusters as Global k-means, EM with random initialization still had a poorer performance. This shows that EM with Global k-means initialization makes a good autonomous and deterministic PNN. We further tried to improve the model by doing comparative tests between fast Global k-means and Global k-means to observe the correct classification rates and the computational times. The results were favorable to fast Global k-means as it managed to provide relatively close accuracies but with much improved computational times. Regularized EM was then used as a preprocessing step to overcome the missing data problem that can simply be caused by faulty sensors. Results for both Iris and Pima showed acceptable degradation of classification rate for 0% up until 50% missing data. Then, implementation of Global k-means and regularized EM was further tested with the reduced eight-feature version of Westland dataset. It was done on data from Sensors 1–4 and the results from the tests were promising. Regularized EM imbues flexibility as the proposed model is able to handle missing data through imputation and not just discarding imperfect input vectors. The model presented in this paper is a suitable diagnosis model that can be used in the business industry to monitor the condition of assets such as machines and to classify them into their fault modes based on the input vectors received from sensors placed on the machines.

References

Specht DF (1990) Probabilistic neural network. Neural Netw 3:109–118. doi:10.1016/0893-6080(90)90049-Q
Article Google Scholar
Parzen E (1962) On the estimation of a probability density function. Ann Math Stat 3:1065–1076. doi:10.1214/aoms/1177704472
Article MathSciNet Google Scholar
Berthold MR, Diamond J (1998) Constructive training of probabilistic neural networks. Neurocomputing 19:167–183. doi:10.1016/S0925-2312(97)00063-5
Article Google Scholar
Ordonez C, Omiecinski E (2002) FREM: fast and robust EM clustering for large data sets. In: Proceedings of the eleventh international conference on information and knowledge management, November 4–9, 2002, McLean
Schneider T (2001) Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J Clim 14:853–871 doi:10.1175/1520-0442(2001)014<0853:AOICDE>2.0.CO;2
Article Google Scholar
Specht DF (1988) Probabilistic neural network for classification, mapping, or associative memory. Proc IEEE Int Conf Neural Netw 1:525–532
Article Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm. J R Stat Soc B 39:1–38
MATH MathSciNet Google Scholar
Wu C (1983) On the convergence properties of the EM algorithm. Ann Stat 11:95–103. doi:10.1214/aos/1176346060
Article MATH Google Scholar
Xu L, Jordan MI (1996) On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput 8:129–151. doi:10.1162/neco.1996.8.1.129
Article Google Scholar
Yang ZR, Chen S (1998) Robust maximum likelihood training of heteroscedastic probabilistic neural networks. Neural Netw 11:739–747. doi:10.1016/S0893-6080(98)00024-0
Article Google Scholar
Likas A, Vlassis N, Verbeek JJ (2003) The Global k-means clustering algorithm. Pattern Recognit 36:451–461. doi:10.1016/S0031-3203(02)00060-2
Article Google Scholar
Little RJA, Rubin DB (1987) Statistical analysis with missing data. Wiley series in probability and mathematical statistics. Wiley, New York
Blake CL, Merz CJ (1998) UCI repository of machine learning databases. Department of Information and Computer Sciences , University of California, Irvine
Zarndt FA (1995) Comprehensive case study: an examination of machine learning and connectionist algorithms. MSc thesis, Department of Computer Science, Brigham Young University
Cameron BG (1993) Final report on CH-46 Aft transmission seeded fault testing. Westland Helicopters Ltd, UK, Research Paper RP907
Yen GG, Lin KC (2000) Wavelet packet feature extraction for vibration monitoring. IEEE Trans Ind Electron 47(3). doi:10.1109/41.847906
Coifman RR, Wickerhauser MV (1992) Entropy based algorithms for best basis selection. IEEE Trans Inf Theory 38:713–718. doi:10.1109/18.119732
Article MATH Google Scholar
Wickerhauser MV (1994) Adapted wavelet analysis from theory to software. Wellesley, Natick
MATH Google Scholar
Fukunaga K (1992) Introduction to statistical pattern recognition. Academic Press, New York
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information Science and Technology, Multimedia University, Jalan Ayer Keroh Lama, 75450, Melaka, Malaysia
Roy Kwang Yang Chang
Faculty of Engineering Technology, Multimedia University, Jalan Ayer Keroh Lama, 75450, Melaka, Malaysia
Chu Kiong Loo & M. V. C. Rao

Authors

Roy Kwang Yang Chang
View author publications
You can also search for this author in PubMed Google Scholar
Chu Kiong Loo
View author publications
You can also search for this author in PubMed Google Scholar
M. V. C. Rao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roy Kwang Yang Chang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chang, R.K.Y., Loo, C.K. & Rao, M.V.C. Enhanced probabilistic neural network with data imputation capabilities for machine-fault classification. Neural Comput & Applic 18, 791–800 (2009). https://doi.org/10.1007/s00521-008-0215-1

Download citation

Received: 02 November 2007
Accepted: 30 October 2008
Published: 21 November 2008
Issue Date: October 2009
DOI: https://doi.org/10.1007/s00521-008-0215-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Enhanced probabilistic neural network with data imputation capabilities for machine-fault classification

Abstract

Similar content being viewed by others

An integrated framework for diagnosing process faults with incomplete features

Bayesian Fault Diagnosis Using Principal Component Analysis Approach with Continuous Evidence

A Proposal of Condition Monitoring with Missing Data and Small-Magnitude Faults in Industrial Plants

1 Introduction

2 Probabilistic neural network