1 Introduction

Breast cancer has become a major cause of death among women in developed countries [1, 2]. Over one ten in Europe and one in eight women in United States may be affected by breast cancer during their life time [3].

Early diagnosis requires an accurate and reliable diagnosis procedure that allows physicians to distinguish benign breast tumors from malignant ones. Thus, finding an accurate and effective diagnosis method is very important. Biopsy is the best way to accurately determine whether the tumor is benign or malignant. However, it is invasive and expensive, and positive findings at biopsy for cancer are low, between 10 and 31% [46].

Much effort has been devoted over the past decade to the development and improvement of pattern classification models for breast cancer detection [79]. Several researchers have used statistical and artificial intelligence to successfully “predict” breast cancer. Basically, the objective of these prediction techniques is to assign patients to either a “benign” group that does not have breast cancer or to a “malignant” group that has strong evidence of breast cancer.

Recently, local linear wavelet neural networks [10, 11] have been introduced as a very effective scheme for statistical pattern recognition problem and non-linear complex predictions.

In this paper, a local linear wavelet neural network (LLWNN) extends the application of [10, 11] and is proposed for breast cancer detection, in which the connection weights between the hidden layer units and the output units are replaced by a local linear model, and the parameters of the network are updated using RLS. The usually used learning algorithm for WNN is gradient descent method. But its disadvantages are slow convergence speed and easy stay at local minimum. Because of this RLS with adaptive diversity learning is proposed for training the LLWNN. Simulation results for breast cancer Pattern Classification problem were compared with some other classification techniques, i.e., training RBFNN by RLS, RBFNN by Kalman filter, and RBFNN by Back propagation. The details about Kalman filter can be obtained from [12, 13]. The result thus derived shows the effectiveness of the proposed method. The main contributions of this paper are (1) the LLWNN providing a more parsimonious interpolation in high-dimension spaces when modeling samples are sparse; (2) a novel training algorithm for LLWNN was proposed. The paper is organized as follows. The LLWNN is introduced in Sect. 2. A RLS learning algorithm for training LLWNN is described in Sect. 3. A short discussion as well as experimental results obtained on Pattern Classification for Wisconsin Breast Cancer (WBC) problem is given in Sect. 4. Finally, concluding remarks are derived in the last section, i.e., Sect. 5.

2 Local linear wavelet neural network

In terms of wavelet transformation theory, wavelets in the following form:

$$ \psi (x )= \prod\limits_{{i{ = 1}}}^{n} {\psi (x_{i} )} $$
(1)
$$ X{ = (}x_{ 1} ,x_{ 2} ,\ldots ,x_{n} ) , $$
$$ X{ = (}x_{ 1} ,x_{ 2} ,\ldots ,x_{n} ) , $$
$$ b_{i} { = (}b_{i 1} ,b_{i 2} ,\ldots ,{\text{b}}_{in} ) , $$

are a family of functions generated from one single function ψ(x) by the operation of dilation and translation. ψ(x), which is localized in both the time space and the frequency space, is called a mother wavelet and the parameters a i and b i are named the scale and translation parameters, respectively. The x represents inputs to the WNN model.

In the standard form of WNN, the output of a WNN is given by

$$ f (x )= \sum\limits_{i = 1}^{M} {{{\omega}}_{i} } {{\psi}}_{i} (x )= \sum\limits_{i = 1}^{M} {{{\omega}}_{i} } \left| {a_{i} } \right|^{ - 1 / 2} {{\psi}}\left( {\frac{{x - b_{i} }}{{a_{i} }}} \right) $$
(2)

where ψ i is the wavelet activation function of ith unit of the hidden layer, and ω i is the weight connecting the ith unit of the hidden layer to the output layer unit. Note that for the n-dimensional input space, the multivariate wavelet basis function can be calculated by the tensor product of n single wavelet basis functions as follows

$$ \psi (x) = \prod\limits_{i = 1}^{n} {\psi (x_{i} )} $$
(3)

Obviously, the localization of the ith units of the hidden layer is determined by the scale parameter a i and the translation parameter b i . According to the previous researches, the two parameters can either be predetermined based upon the wavelet transformation theory or be determined by a training algorithm. Note that the above WNN is a kind of basis function neural network in the sense of that the wavelets consists of the basis functions. Note that an intrinsic feature of the basis function networks is the localized activation of the hidden layer units, so that the connection weights associated with the units can be viewed as locally accurate piecewise constant models whose validity for a given input is indicated by the activation functions. Compared with the multilayer perceptron neural network, this local capacity provides some advantages such as the learning efficiency and the structure transparency. However, the problem of basis function networks is also led by it. Due to the crudeness of the local approximation, a large number of basis function units have to be employed to approximate a given system. A shortcoming of the WNN is that for higher dimensional problems, many hidden layer units are needed.

Local linear wavelet network in fact is a modification of WNN. The architecture of the proposed LLWNN is shown in Fig. 1. Its output in the output layer is given by

$$ \begin{aligned} y = & \sum\limits_{i = 1}^{M} {(\omega_{i0} + \omega_{i1} x_{1} + \cdots + \omega_{in} x_{n} )\psi_{i} (x)} \\ = & \sum\limits_{i = 1}^{M} {(\omega_{i0} + \omega_{i1} x_{1} + \cdots + \omega_{in} x_{n} )\left| {a_{i} } \right|^{ - 1/2} \psi \left( {\frac{{x - b_{i} }}{{a_{i} }}} \right)} \\ \end{aligned} $$
(4)

where X = [x 1,x 2,…, x n] instead of the straight forward weight \( \omega_{i} \) (piecewise constant model), a linear model

$$ v_{i} = \omega_{i0} + \omega_{i1} x_{ 1} + \cdots + \omega_{in} x_{\text{n}} $$
(5)

is introduced. The activities of the linear models v i (I = 1,2,…, M) are determined by the associated locally active wavelet functions \( {{\psi}}_{i} (x ) \) (I = 1,2,…, M) thus, v i is only locally significant. The motivations for introducing the local linear models into a WNN are as follows: (1) local linear models have been studied in some neuro fuzzy systems and shown good performances [14, 15] and (2) local linear models should provide a more parsimonious interpolation in high-dimension spaces when modeling samples are sparse. The scale and translation parameters and local linear model parameters are randomly initialized at the beginning and are optimized by recursive least square algorithm discussed in the following section.

Fig. 1
figure 1

A local linear wavelet neural network

3 Recursive least square

The recursive least square (RLS) is a parameter identification technique. In RLS algorithm, there are two variables involved in the recursions (those with time index n − 1): \( \hat{w}(i - 1) \), P(i − 1). We must provide initial values for these variables in order to start the recursions:. \( w(0) \)

If we have some apriori information about the parameters \( \hat{w} \), this information will be used to initialize the algorithm. Otherwise, the typical initialization is \( w (0) = 0 \). P(0)

$$ P(i) = \left[ {\sum\limits_{n = 1}^{i} {\lambda^{i - 1} \psi (n)\psi (n)^{T} } } \right]^{ - 1} $$

the exact initialization of the recursions uses a small initial segment of the data \( \psi ({{i}}_{1}) \);\( \psi (i_{ 1} + 1) \ldots,\psi (0) \) to compute

$$ P ( 0 )= \left[ {\sum\limits_{n = 1}^{0} {\lambda^{ - 1} \psi (n )\psi (n )^{T} } } \right]^{ - 1} $$

All the necessary equations to form the RLS algorithm are

$$ k(i) = P(i - 1)\varphi T(i)/\lambda + P(i - 1)\varphi T(i) $$
(6)
$$ w(j) = w_{j} (i - 1) + k(i)\left[ {d_{j} (i) - w_{j} (i - 1)\varphi {\text{T}}(i)} \right] $$
(7)
$$ P(i) = 1/\lambda \left[ {P(i - 1) - k(i)\varphi (i){\text{P}}(i - 1)} \right] $$
(8)

where λ is real number between 0 and 1, P(0) = a − 1 I, and a is a small positive number and w j (0) = 0.

4 Implementation and comparative study

We apply the local linear wavelet neural network explained in Sect. 2 to Wisconsin Breast Cancer (WBC) databases and compare its performances to the most common classification methods in both computer science and statistics literatures. The databases can be downloaded from University of Wisconsin Hospital, MadisonFootnote 1

All the computations are implemented using MATLAB V6.5Footnote 2 under Pentium IV personal computer with a clock speed of 2.4 GHz and the equations were written using Math Type or Microsoft Equation Editor. As in commonly done, we normalize the input variables to make sure that they are independent of measurement units. Thus, the predictors are normalized to interval of (0:1) using the formula:

$$ x_{i}^{\text{new}} { = }\frac{{x_{i}^{\text{old}} - x_{1:n} }}{{x_{n:n} - x_{1:n} }} $$
(9)

where \( x_{1:n} \) is the ith order statistics of x 1, x 2,…, x n. We use the stratified sampling technique to make sure that we get the same proportion from each group in the original data. We randomly hold a total of (k = round \( \frac{n}{ 5} \)) or (k = round \( \frac{n}{ 1 0} \)) observations, with k l = round \( \frac{n}{ 1 0} \) observations from the classl, where nl is the number of observations of the given data set in group l, for l = 0, 1,…, c − 1. Therefore, to evaluate the performance of each classifier on a real-application, we use either 5-fold or 10-fold cross-validation. Thus, we fold the given data into 5 or 10 parts, and we use 0:8 or 0:7 of the data for learning the classification model (building) and 0.2 or 0.3 for external validation (testing). The new technique LLWNN-RLS is compared with a wide range of classifiers to evaluate its performance with respect to correct classification rate and the time it takes to get trained. As defined in [16, 17], the best network model was selected based on the following criterions.

  1. 1.

    Correct Classification Rate (CCR) and Average Squared Classification Error (ASCE):

$$ {\text{CCR = }}\frac{{\sum_{{k{ = 0}}}^{c - 1} {{\text{CC}}_{k} } }}{n} ;\;{\text{ ASCE = }}\frac{{\sum_{{k{ = 0}}}^{c - 1} { [n_{k} - {\text{cc}}_{k} ]^{ 2} } }}{n} $$
(10)

where n k is the number of observations in class k, and CCR is the number of correctly classified observations in the class k. The best functional network is the one with both highest CCR and smallest ASCE.

We construct the confusion matrix, which is a c × c matrix, its diagonal contains the number of correctly classified observations, CCR, and the off-diagonal elements are the number of misclassified observations, mck, for k = 0, c − 1.

  1. 2.

    Computational cost (Time of execution)

It is the time needed to execute the classifier till obtaining the best model in both calibration and validation. The less computation cost is the better classifier.

  1. 3.

    The Minimum Description Length (MDL) criterion

As explained in [16, 17], the best model is the one with the smallest MDL value. The form of the description length for the classification problem using the functional network is defined as

$$ L (Q_{k} ) { = }\frac{{m{\text{ log (}}nk )}}{ 2}{ + }\frac{nk}{ 2}{ \log }\left( {\frac{ 1}{nk}\sum\limits_{{i{ = 1}}}^{n} {{\text{e}}_{i}^{2} } (Q_{k} )} \right) $$
(11)

for all k = 0,…, c − 1, where m and k are the number of elements in the family and the category levels, respectively. We note that the principle \( L (Q_{k} ) \) is the code length of the estimated parameters Θ k , ∀ k  = 0, 1, 2,…, c − 1

We note that the description length has two terms:

  1. (a)

    The first term \( \frac{{m{\text{log(}}n_{k} )}}{ 2} \) is a penalty for including too many parameters in the functional network model.

  2. (b)

    The second term \( \frac{{n_{k} }}{ 2}{ \log }\left( {\frac{ 1}{{n_{k} }}\sum_{{i{ = 1}}}^{n} {\varepsilon_{i}^{2} } (\Uptheta_{k} )} \right) \) measures the quality of the functional network model fitted to the training set. Therefore, the best model is the model with the smallest value of its description length. MDL is the best model performance. Both description and relevant work on the data set under study are represented below:

4.1 Wisconsin breast cancer (WBC):

The data set were obtained from University of Wisconsin Hospital; Madison WBC is a nine-dimensional data set with the following features:

(1) Clump thickness; (2) Uniformity of cell size;(3) Uniformity of cell shape; (4) Marginal adhesion;(5) Single epithelial cell size; (6) Bare nuclei; (7) Bland chromatin; (8) Normal nucleoli; and (9) Mitoses. For our classification purpose, 400 exemplars were used for training and the 299 exemplars for testing from a total of 699 exemplars. Several researchers studied WBC database and proved that the best three attributes are mean texture, worst mean area and worst mean smoothness [18].

We are utilizing external validation techniques as it is shown in [16, 17]. We repeat the estimation and validation processes for N = 1,000 times, then compute all the quality measures explained in Sect. 4 for all classifiers. Next, we summarize the results by computing the average, the standard deviation and the coefficient of variation of each quality measure over these 1,000 runs.

In addition, we draw two graphs to reach the conclusion: one for the mean of CCR versus its standard deviation and the other for the mean of the ASCE versus mean of MDL. These graphs help us to decide which classifier is better in its performance. In both plots, each classifier is represented by a symbol.

In the graph of the mean of CCR versus its standard deviation, a good classifier should appear in the lower right corner of the graph. In the graph of mean ASCE verses mean MDL, a good classifier should appear in the bottom left of the plot. In addition, corresponding to these graphs, we summarize the results in Tables. In these Tables, the highest CCR’s are given in boldface.

For the sake of simplicity and space, we did the implementations for two predictors and three predictors to check the performance of the LLWNN-RLS classifier against other classifiers.

4.2 Result and discussion

From Tables 1 & 2 and Figs. 2 & 3 of average of CCR versus its standard deviation, we observe, for example, the following:

Table 1 WBC data: the external validation results with 2 predictors
Table 2 WBC data: the external validation results with 3 predictors
Fig. 2
figure 2

a Mean CCR verses std CCR for 2 predictors b Mean ASCE verses mean MDL for two predictors

Fig. 3
figure 3

a Mean CCR verses std CCR for 3 predictors b Mean ASCE verses MDL for 3 predictors

  1. 1.

    The two classifiers: RBFNN and RBFNN-Kalman filter neural networks are the worst performance.

  2. 2.

    The LLWNN-RLS is giving the highest values of the average CCR in the high dimensional data with less time of computations.

  3. 3.

    The LLWNN-RLS, RBFNN-RLS are giving the highest values of the average CCR.

  4. 4.

    The LLWNN-RLS gives both smallest MDL and smallest ASCE. In addition, its execution time is much lower than other classifiers.

    We draw the comparative study by utilizing the information provided in Table 3 and conclusion in Sect. 5 by utilizing the useful information shown in Tables 1 and 2.

    Table 3 Classification accuracy on the Wisconsin Breast Cancer (WBC) dataset

Rule for classification of WBC data sets using LLWNN and RLS

$$ \begin{gathered} \mathop {{\text{if}}\left( {{\text{oo}}\left( {{\text{r}}, 1} \right) \ge 0.0 6\;\&\; {\text{oo}}\left( {{\text{r}}, 1} \right) \le \, 0. 4 6 2 7} \right)\;\&\; \left( {{\text{oo}}\left( {{\text{r}}, 2} \right) \ge 0.0 6 1 1\;\&\; {\text{oo}}\left( {{\text{r}}, 2} \right) \le 0. 4 1} \right)\quad}{{{\text{then\,Benign}};}} \hfill \\ \mathop {{\text{if}}\left( {{\text{oo}}\left( {{\text{r}}, 1} \right) \ge - 3. 7 7 1 8\;\&\; {\text{oo}}\left( {{\text{r}}, 1} \right) \le 0.0 3 9 4} \right)\;\&\; \left( {{\text{oo}}\left( {{\text{r}}, 2} \right) \ge - 3. 2 9 9 7\;\&\;{\text{oo}}\left( {{\text{r}}, 2} \right) \le 0.0 3 6} \right)\quad}{{{\text{then\,Malignant}};}} \hfill \\ \end{gathered}$$

4.3 Comparative study with existing methods

The proposed technique is compared with some of the existing techniques [17, 1921]. The comparison is depicted in Table 3. The result of the comparison shows that the proposed technique gives better classification as compared to some of the existing techniques.

5 Conclusion

Even though mammography is one of the best techniques for breast tumor detection, but in some cases, despite their experience radiologists cannot detect tumors. There computer-aided methods like the one presented in this paper could assist medical staff and increase the accuracy of detection. Statistics show that only 20–30% of breast tumor cases are cancerous. In a false negative detection, if an actual tumor remains undetected that could lead to higher costs or even to the cost of a human life. Here is the trade off that motivated us to develop a classification system

In this paper, we presented a technique for breast tumor classification. The objective of this study is to examine the effectiveness of LLWNN for classifying breast cancer data. The technique was compared with different methods already developed. We showed empirically that the proposed approach has better performance, high quality and generalization than common existing approaches. As it is known that data mining techniques are more suitable to larger databases, we intend to use a larger data base, from medical science and/or business sector to evaluate the performance of the proposed technique. Also the technique needs to be evaluated using time series data to validate the findings.