1 Introduction

Breast cancer accounts for nearly 1 in 4 cancers diagnosed in US women, it is also the most common type of cancer in women and the fifth most common cause of cancer death worldwide [1]. There is substantial evidence that there is a worldwide increase in the occurrences of breast cancer, especially in Asia, for example, China, India and Malaysia have recently experienced rapid increase in breast cancer incidence rates [2]. A recent study predicted that the cumulative incidence of breast cancer will increase to at least 2.2 million new cases among women across China over the 20-year period from 2001 to 2021 [3].

The most noticeable symptom of breast cancer is typically a lump or a tumor that feels different from the rest of the breast tissue. However, it is not easy to distinguish a malignant tumor from a benign one because there are structural similarities between the two. To accurately identify the structural differences, physicians have to cautiously study a patient’s clinical history and make various medical examinations supported by imaging using mammography or ultrasound. However, the precise diagnosis of a breast tumor can only be obtained through some form of biopsy where by a small sample of cells or tissue is removed for examination. Typical biopsy processes for breast cancer analysis include Fine-Needle Aspiration (FNA), core needle, and excisional biopsy [4]. Among these FNA is the most convenient because it involves the use of very small needles (smaller than those used for blood tests) [5]. This deterministic diagnosis is vital as the potency of the cytotoxic drugs administered during treatment can be life threatening.

As there is always a subjective element related to the pathological examination of a biopsy, an automated technique will provide valuable assistance for physicians. Recent years have witnessed a large increase in research related to computer assisted breast cancer diagnosis. A large focus with respect to biopsy image analysis has been on automated cancer type classification. Many recent studies have revealed that biopsy images can be properly classified, without requiring perfect segmentation if suitable image feature descriptions are chosen [68]. Tabesh et al. aggregated color, texture, and morphometric cues at the global and histological object levels for classification, achieving 96.7 % classification accuracy in classifying tumor and non-tumor images [9]. The wavelet package transform coupled with local binary patterns were used for meningioma subtype classification in [10]. This research, and similar work, demonstrated that by combining different image description features it is possible to improve medical image classification performance.

A great number of machine learning methods have been proposed to design accurate classification systems for various medical images [11]. Among them, ensemble learning has attracted much attention due to the good performance from many applications in medicine and biology [12]. Ensemble learning is concerned with mechanisms to combine the results of a number of weak learning systems. A weak learner is defined to be a classifier which is only slightly correlated with the true classification, it can label examples better than random guessing. In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification [13]. In the case of ensemble classification, ensemble learning is concerned with the integration of the results of a number of classifiers (often called as ‘base classifiers’) [14] to develop a strong classifier with good generalization performance, therefore, ‘base classifiers’ are also referred as ‘weak classifiers’.

Among the representatives of ensemble learning, the Random Subspace (RS) method [15] is often quoted as an efficient way of combining the results of a number of classifiers. A recent application of RS for functional Magnetic Resonance Imaging (fMRI) classification has shown promising results [16]; here RS outperformed single classifiers as well as some of the most widely used alternative classifier ensemble techniques such as bagging, AdaBoost, random forests and rotation forests. The same outcome has also been reported in the context of RS ensemble based gene expression classification [17]. RS divides the input feature space into subspaces; each subspaces is formed by randomly picking features from the entire space, features may be repeated across subspaces.

In previous studies of medical images classification, accuracy was the only objective; the aim was to produce a classifier that featured the smallest error rate possible. In many applications, however, it is more important to address the reliability issue in classifier design by introducing a reject option which allowed for an expression of doubt. The objective of the reject option is thus to improve classification reliability by leaving the classification of “difficult” cases to human experts. Since the consequences of misclassification may often be severe when considering medical image classification, clinical expertise is desirable so as to exert control over the accuracy of the classifier in order to make reliable determinations.

Classification with a rejection option has been a topic of interest in pattern recognition. Multi-stage classifiers are ensembles where individual classifiers have a reject option [18]. Cascading [19] is a scheme to support multi-stage classification. At the first stage of a cascading system, the system constructs a simple rule using a properly generalized classifier; based on its confidence criterion, it is likely that the rule will not cover some part of the space with sufficient confidence. Therefore, at the next stage, cascading builds a more complex rule to focus on those uncovered patterns. Eventually there will remain few patterns which are not covered by any of the prior rules, these patterns can then be dealt with using an instance-based nonparametric technique which is good at unrelated, singular points [20]. Many cascading multi-stage classifier architectures have been proposed [2124] and plenty of promising results have been achieved in medical and biological classification applications, such as microarray data classification [25] and gene expression data classification [26].

In this paper, we propose and evaluate a novel cascade scheme, comprised of two random subspace ensembles, to be applied to microscopic biopsy image classification with a reject option. The first stage of our cascade scheme consists of an ensemble of SVMs with reject option to classify patterns with high level of confidence. The more complex and slower second stage, which is an ensemble of MLPs, deals with the rejected patterns from stage 1, and is designed to make further classifications or rejections. Compared with some earlier cascading classifier paradigms, our proposed system is composed of two different ensembles. In the first stage, a one-vs-all SVM ensemble is employed to classify “straight forward” samples (thus obtaining high accuracy) and reject those which are less straight forward or ambiguous. Only samples for which the ensemble’s confidence score, in terms of consensus degree, is greater than a certain threshold will be classified. The second stage consists of a random subspace ensemble of MLPs which operates using majority voting, any samples that have a low consensus degree will be rejected for further consideration by human experts. It is suggested that classification with the proposed cascaded ensembles will provide an efficient means to simultaneously reduce the error rate and enhance the reliability by controlling the accuracy-rejection trade-off.

The rest of this paper is organized as follows: the breast cancer biopsy image dataset used in our work and the image feature extraction methods are introduced in Sect. 2. In Sect.  3, we described and theoretically analyzed the proposed two-stage ensemble cascading system in detail. In Sect.  4, the experimental results are given based on the adopted benchmark image dataset. We compared the proposed cascading system with its component classifiers as well as some widely used aggregation techniques, such as bagging and AdaBoost. The paper ends with some conclusions in Sect. 5.

2 Biopsy images and feature descriptions

In this section we will first introduce, in Sect. 2.1, the benchmark breast cancer biopsy image dataset. The proposed image feature extraction methods are then introduced in Sect. 2.2. The choice of features for describing the initial biopsy images depends on the nature of the input images. For biopsy image classification, calculating global features to estimate the global appearance of the images is an effective approach. In this work, we propose to use three image descriptors for biopsy image feature extraction: (1) local binary patterns (Sect. 2.2.1), (2) gray level co-occurrence matrixes (Sect.  2.2.2) and (3) the curvelet transform (Sect. 2.2.3).

2.1 Breast cancer biopsy images

With respect to the work described in this paper a breast cancer benchmark biopsy images dataset from the Israel Institute of TechnologyFootnote 1 was used. The image set consists of 361 samples, of which 119 were classified by a pathologist as normal tissue, 102 as carcinoma in situ, and 140 as invasive ductal or lobular carcinoma. The samples were generated from breast tissue biopsy slides, stained with hematoxylin and eosin. They were photographed using a Nikon Coolpix\(^{\textregistered }\) 995 attached to a Nikon Eclipse\(^{\textregistered }\) E600 at magnification of \(\times 40\) to produce images with resolution of about \(5 \mathrm u\) per pixel. No calibration was made, and the camera was set to automatic exposure. The images were cropped to a region of interest of \(760 \times 570\) pixels and compressed using the lossy JPEG compression. The resulting images were again inspected by a pathologist to ensure that their quality was sufficient for diagnosis. Figure 1 presents three sample images of healthy tissue, tumor in situ and invasive carcinoma.

Fig. 1
figure 1

Typical image instances. a carcinoma in situ: tumor confined to a well-defined small region, usually a duct (arrow); b invasive: breast tissue completely replaced by the tumor; c normal: normal breast tissue, with ducts and finer structures

2.2 Feature descriptions

Shape feature and texture feature are critical factors for distinguishing one image from another. For the biopsy image discrimination, shapes and textures are also quite effective. As we can see from Fig. 1, three kinds of biopsy images have visible differences in cell externality and texture distribution. Thus, we use Local Binary Patterns (LBPs) for extracting local textural features, Gray Level Co-occurrence Matrix (GLCM) statistics for representing global textures and the Curvelet Transform for shape description.

2.2.1 Local binary patterns

Local Binary Patterns (LBPs) were first introduced as a texture descriptor for summarizing local gray-level structures [27, 28], LBPs are generated by taking a local neighborhood around each pixel into account, thresholding the pixels of the neighborhood at the value of the central pixel and then using the resulting binary-valued image patch as a local image descriptor. In other words, a binary code of 0 or 1 is assigned to each neighborhoods pixel. The binary code of each pixel in the case of a \(3\times 3\) neighborhoods would form an 8 bits code. In this manner, a single scan through an image can generate LBP codes for each pixel.

Formally, the LBP operator takes the form

$$\begin{aligned} LBP_{P,R}\!=\!\sum _{p=0}^{P-1} s(g_{p}-g_{c})2^{p}, \quad s(x)\!=\! \left\{ \begin{array}{ll} 1&\quad \text{ if}\,\, x\ge 0\\ 0&\quad \text{ if}\,\, x < 0\\ \end{array} \right.\qquad \end{aligned}$$
(1)

where \(g_\mathrm{c}\) is the gray value of the central pixel, \(g_\mathrm{p}\) is the value of its neighbors, \(P\) is the total number of neighbors and \(R\) is the radius of the neighborhood.

A useful extension to the original LBP operator is the so-called uniform patterns [27]. An LBP is “uniform” if it contains at most two bitwise transitions from 0 to 1 or vice versa when the binary string is considered circular. For example, 11100001 (with two transitions) is a uniform pattern, whereas 11110101 (with four transitions) is a non-uniform pattern. The uniform LBP describes those structures which contain at most two bitwise (0 to 1 or 1 to 0) transitions. Uniformity represents important structural features such as edges, spots and corners. Ojala et al. [27] observed that although only 58 of the 256 eight-bit patterns are uniform, nearly 90 % of all observed image neighborhoods are uniform. We use the notation \(\text{ LBP}^u _{P,R}\) for the uniform LBP operator, meaning a neighborhood of \(P\) sampling points on a circle of radius \(R\). The superscript \(u\) stands for using uniform patterns and labeling all remaining patterns with a single label. The number of labels for a neighborhood of 8 pixels is 256 for standard LBP and 59 for \(\text{ LBP}^u _{8,1}\).

A common practice when applying an LBP coding over an image is to generate a histogram of the labels, where a 256-bin histogram represents the texture description of the image and each bin can be regarded as a micro-pattern. The distribution of these patterns represents the whole structure of the texture. The number of patterns in an LBP histogram can be reduced by only using uniform patterns without losing much information. As noted above, there are 58 different uniform patterns in an 8-bit LBP representation, the remaining patterns can be assigned in one non-uniform binary number, thus representing the texture structure with a 59-bin histogram instead of using 256 bins.

LBP has been shown to be an efficient image texture descriptor. Recently, a complete modeling of the local binary pattern operator was proposed and the associated Complete LBP (CLBP) scheme developed for texture classification [28]. Different to traditional LBP, in CLBP, a local region is represented by its center pixel and a Local Difference Sign-Magnitude Transform (LDSMT). With a global thresholding, the center pixel is coded by a binary code and the binary map is called \(CLBP\_C\). Two other complementary components are also obtained by LDSMT: the difference signs and the difference magnitudes, two operators \(CLBP\_S\) and \(CLBP\_M\) are used to code them. The framework of CLBP is presented in Fig. 2. The CLBP could achieve much better rotation invariant texture classification results than conventional LBP based schemes.

Fig. 2
figure 2

Framework of CLBP

We briefly review three operators in CLBP here, namely, \(CLBP\_S, CLBP\_M\) and \(CLBP\_C\). Given a central pixel \(g_{c}\) and its \(P\) neighbors \(g_{p}, p=0,1,\ldots, P-1\), the difference between \(g_{c}\) and \(g_{p}\) can be calculated as \(d_{p}=g_{p}-g_{c}\). The local difference vector \([d_{0},\ldots, d_{P-1}]\) describes the image local structure at \(g_{c}, d_{p}\) can be further decomposed into two components:

$$\begin{aligned} d_{p}=s_{p}*m_{p}, \quad \mathrm{and} \quad \left\{ \begin{array}{l} s_{p}=\mathrm{sign}(d_{p}) \\ m_{p}=|d_{p}| \\ \end{array} \right. \end{aligned}$$
(2)

where \(s_{p}=1\), when \(d_{p}\ge 0\), otherwise, \(s_{p}=0\). \(m_{p}\) is the magnitude of \(d_{p}\). Equation (2) is called the local difference sign-magnitude transform (LDSMT).

The \(CLBP\_S\) operator is defined as the original LBP operator in Eq. (1).

The \(CLBP\_M\) operator is defined as:

$$\begin{aligned}&CLBP\_M_{P,R}=\sum _{p=0}^{P-1}t(m_{p},c)2^{p}, \nonumber \\&t(x,c) = \left\{ \begin{array}{l@{\quad }l} 1&\text{ if}\,\, x\ge c\\ 0&\text{ if}\,\, x < c\\ \end{array} \right. \end{aligned}$$
(3)

where \(c\) is a threshold set as the mean value of \(m_{p}\) from the whole image.

The \(CLBP\_C\) operator is coded as:

$$\begin{aligned} CLBP\_C_{P,R}=t(g_{c},c_{I}) \end{aligned}$$
(4)

where \(t\) is defined in Eq. (3) and \(c_{I}\) is a threshold set as the average gray level of the whole image.

In this work, we use the 3D joint histogram of these three operators to generate textural features of breast cancer biopsy images, according to [28], the joint combination of the three components gives better classification than conventional LBP and provides a smaller feature dimension.

2.2.2 Statistics from gray level co-occurrence matrix

Global texture distribution is one of the important characteristics used in identifying objects or regions of interest in an image. The co-occurrence probabilities provide a second-order method for generating texture features [29]. The basis for features used here is the gray level co-occurrence matrix, the matrix is square with dimension \(N_{g}\), where \(N_{g}\) is the number of gray levels in the image. Element \([i,j]\) of the matrix is generated by counting the number of times a pixel with value \(i\) is adjacent to a pixel with value \(j\) and then dividing the entire matrix by the total number of such comparisons made. Each entry is therefore considered to be the probability that a pixel with value \(i\) will be found adjacent to a pixel of value \(j\) [30], the matrix can be seen in Eq. (5).

$$\begin{aligned} \mathbf C =\left[ \begin{array}{c@{\quad }c@{\quad }c@{\quad }c} p(1,1)&p(1,2)&\cdots&p(1,N_g) \\ p(2,1)&p(2,2)&\cdots&p(2,N_g) \\ \vdots&\vdots&\ddots&\vdots \\ p(N_g,1)&p(N_g,2)&\cdots&p(N_g,N_g) \\ \end{array} \right] \end{aligned}$$
(5)

With respect to the work described in this paper, a total of 22 features were extracted from gray level co-occurrence matrices in our work, these are listed in Table 1. Each of these statistics has a qualitative meaning with respect to the structure within the GLCM, for example, dissimilarity and contrast measure the degree of texture smoothness, uniformity and entropy reflect the degree of repetition amongst the gray-level pairs, and correlation describes the correlation between the gray-level pairs. For details of these statistical features, see [2932].

Table 1 Features extracted from gray level co-occurrence matrix

2.2.3 Curvelet transform

The Curvelet transform [3337] is one of the latest developments in non-adaptive transforms. Compared to the wavelet transform, the curvelet transform provides a more sparse representation of an image, with improved directional elements and better ability to represent edges and other singularities along curves. Sparse representation usually offers better performance with its capacity for efficient signal modeling. So far, successful applications of the curvelet transform have been found in many medical and biological image analysis tasks, including digital mammogram analysis [38] and phenotype recognition [39].

In the curvelet transform, fine-scale basis functions are long ridges; the shape of the basis functions at scale \(j\) is \( 2^{-j}\) by \(2^{-j/2}\) so the fine-scale bases are skinny ridges with a precisely determined orientation. The curvelet coefficients can be expressed by:

$$\begin{aligned} c(j,l,k) := \langle f, \varphi _{j,l,k} \rangle = \int _{R^2} f(x)\varphi _{j,l,k}(x) \mathrm{d}x \end{aligned}$$
(6)

where \(\varphi _{j,l,k}\) denotes the curvelet function, and \(j, l\) and \(k\) are the variables of scale, orientation, and position, respectively.

In the last few years, several discrete curvelet transforms have been proposed. The most influential approach is based on the Fast Fourier Transform (FFT) [36]. In the frequency domain, the curvelet transform can be implemented with \(\varphi \) by means of the window function \(U\). Defining a radial window \(W(r)\) and an angular window \(V(t)\) as follows:

$$\begin{aligned} \sum _{j=-\infty }^{\infty } W^2(2^jr)&= 1, \qquad r \in (3/4, 3/2) \end{aligned}$$
(7)
$$\begin{aligned} \sum _{j=-\infty }^{\infty } V^2(t-1)&= 1, \qquad t \in (-1/2, 1/2) \end{aligned}$$
(8)

where \(W\) is a frequency domain variable and \(r\) and \(\theta \) are polar coordinates within the frequency domain. For each \( j \ge j_0\), \(U_j\) is defined over the Fourier domain by:

$$\begin{aligned} U_j (r, \theta ) = 2^{3j/4}w(2^{-j}r)v\left(\frac{2^{[j/2]}\theta }{2\pi }\right) \end{aligned}$$
(9)

where \([j/2]\) denotes the integer part of \(j/2\).

The fastest curvelet transform currently available is curvelets via wrapping [36], which will be used for our work. From the curvelet coefficients, some statistics can be calculated from each of these curvelet sub-bands. In this paper, the mean \(\mu \), the standard deviation \(\delta \) and the entropy \(H\) are used as the simple features. If \(n\) curvelets are used for the transform, \(3n\) features \(G=[G_\mu, G_\delta, H]\) are obtained, where \(G_\mu = [\mu _1, \mu _2, \ldots, \mu _n], G_\delta =[\delta _1, \delta _2, \ldots, \delta _n]\) and \(H=[h_{1}, h_{2}, \ldots, h_{n}]\) . A \(3n\) dimensional feature vector can be used to represent each image in the dataset.

2.2.4 Combined features

Each feature extracted from the above three descriptors characterizes individual aspects of image content. The joint exploitation of different image descriptions is often necessary to provide a more comprehensive description in order to produce classifiers with higher accuracy. Using 5 levels of the curvelet transform, 82 sub-bands of curvelet coefficients are computed, therefore, a 246-dimensional curvelet feature vector is generated for each image. With a 64 gray-level quantization, we used 10 different relative interpixel distances to generate 10 different gray level co-occurrence matrices for each image. The 22 statistics listed in Table 1 are computed for each of these 10 gray level co-occurrence matrices, thus, we have a 220-dimensional GLCM feature vector for each image. The CLBP feature vector of each image has a dimension of 200. The three feature vectors are normalized, respectively, into the range of \([-1, 1]\), then concatenated together to produce a 666-dimensional feature vector of each image for classification. One of the difficulties of multiple feature aggregation lies in the high dimensionalities of the feature space. However, by using Random Subspace classifier ensembles (see following section) this problem can be resolved due to its dimension reduction capability.

Due to the differences existing in different molecular imaging devices and staining methods, histology images of biopsies may change significantly in colors and intensities. The above explained feature extractors can cope with this situation effectively, since all of them work in the grayscale color space. Before feature extraction, all the biopsy images will be converted from a chromatic color space to a grayscale color space with an intensity interval from 0 to 255. The conversion eliminates the adverse effects from color and intensity variations because the feature extractors work in the same space. Moreover, as the features are extracted from the whole images, the distribution and structures of different patterns such as tissues, ducts, fat and tumors, will be automatically described by the feature extractors, thus, they will not affect the performance of the combined feature.

3 Serial fusion of random subspace ensembles

Although many supervised learning algorithms such as neural networks, the \(k\)-nearest neighbor algorithm and SVM have been extensively applied to many medical image classification problems, few of them has addressed the issue of classification reliability (the extent that one can rely upon a given prediction). Note that we are interested in the assessment of a classifier’s performance on a single example such as the diagnosis associated with am individual patient. In such cases an overall quality measure of a classifier (e.g., classification accuracy) would not provide the desired information, even where good accuracies are achieved using some state-of-art method. With respect to some real applications, such as medical diagnosis, highly reliable classifiers are required so that a correct therapeutic strategy can be selected. Therefore, it is desirable to have a reject option in order to avoid making a wrong decision when classifier is presented with ambiguous input, i.e., an option to withhold a classifier decision.

In this paper a new two-stage classifier for the breast cancer biopsy image classification, consisting of a random subspace ensembles with reject option, is proposed. With respect to the work described in this paper, we adopted the definitions of recognition rate, rejection rate and reliability proposed in [40], as presented below, so as to facilitate the performance evaluation of classifiers with a reject option:

  • Recognition rate (RR) \(=\) no. of correctly recognized images/no. of testing images

  • Rejection rate (ReR) \(=\) no. of rejected images/no. of testing images.

  • Reliability (RE) \(=\) (no. of correctly recognized images + no. of rejected images)/no. of testing images.

  • Error rate (ER) \(:=\) 100 % \(-\) reliability.

From the above we can see that Reliability = Recognition rate + Rejection rate. According to this definition of reliability, high reliability can be achieved with an appropriate trade-off between error rate and rejection rate.

3.1 Reject option for classification

The optimal classification rule with reject option was defined by Chow [21]. Consider a binary classification task with an instance dataset \(X=\{x_{1},x_{2},\ldots, x_{m}\}\) and a class label set \(C=\{-1,1,0\}\) where class \(0\) is the reject option. We need to seek a classification rule, \(L\) (\( X \Rightarrow C\))such that \(L(x)=0\) indicates that no definite judgement will be made for \(x\) and a reject option taken. Chow’s rule rejects a pattern if the maximum of its a posterior probabilities is lower than a predefined threshold \(t\), the pursuit of maximum of the posterior probabilities can be identified as a measure of classification reliability. Such a rule can be expressed as:

$$\begin{aligned} f(x) = \left\{ \begin{array}{l@{\quad }l} \mathrm{argmax}_{C_{i}}\!(p(C_{i}|x))&\text{ if}\,\, \mathrm{max}_{C_{i}}\ (p(C_{i}|x))\ge t\\ \text{ reject}&\text{ if}\,\, \forall _{i}\ p(C_{i}|x)\! < t\\ \end{array} \right.\nonumber \\ \end{aligned}$$
(10)

where \(p(C_{i}|x)\) is the posterior probability, which can be obtained by Bayes formula.

The rejection rate is the probability that the classifier rejects a given example:

$$\begin{aligned} p({reject}) \!=\! \int _\mathrm{{reject}} \ p(x)dx \!=\! p(\text{ max}(p(C_{i}|x)) \!<\! t).\qquad \end{aligned}$$
(11)

In Chow’s theory, an optimal classifier can be found only if the true posterior probabilities are known. This is rarely reachable in real applications.

The key issue with respect to the reject option is to define the threshold \(t\), in our work, we do not deeply consider the optimal error-reject trade-off. We used different rejection thresholds and the results of rejection against accuracies and reliabilities were compared.

3.2 A cascade two-stage classification scheme

As already noted, it has been demonstrated that classification accuracy can be enhanced by using an ensemble of classifiers. Over the last few years a number of successful ensemble methods have been proposed [14, 16]. The most popular method for creating a classifier ensemble is to build multiple parallel classifiers, and then to combine their outputs according to some fusion strategy. Alternatively, a serial architecture can be adopted with different classifiers arranged in cascade form such that the output of a classifier acts as the input to another classifier. In this paper, we will propose a hybrid classification scheme which serially connects two parallel random subspace ensemble classifiers (Fig. 3). Note that all classifiers have a rejection option.

Fig. 3
figure 3

Operation of the hybrid classification scheme comprising a cascade of two Random Subspace classifier ensembles

In our current implementation the first ensemble (Classifier Ensemble 1 in Fig. 3) consists of a collection of SVM classifiers, the second (Classifier Ensemble 2 in Fig. 3) consists of a collection of MLP classifiers. From Fig. 3 it can be seen that rejected samples from Classifier Ensemble 1 are passed to Ensemble 2, any samples that remain rejected once Classifier Ensemble 2 has been applied are passed to a human expert for “adjudication”.

SVM and MLP have obtained better performance than other kinds of classifiers in many medical image analysis tasks, especially in histopathological image analysis [11], therefore, they have been selected as the base classifiers in our two ensembles. The proposed cascade system here is consistent with a principle in statistical pattern recognition that an improved classification performance can be expected when a local classifier is appended after a global one [41]. The SVM ensemble in the first stage is trained as a global classifier. Compare with SVM, the MLP is relatively local, since it has been proven that a feed-forward network of just two layers (not including the input layer) is enough to approximate any continuous function [42]. Note that the classification performance of the whole system will not change too much if we use another SVM ensemble in the second stage, because under the same training strategy, the obtained support vectors in stage 1 and stage 2 will be very similar.

Another reason we use different base classifiers for the two ensembles is to achieve‘diversity’ between classifiers, which is also deemed as an important factor for the success of ensemble learning [43]. Making use of different individual classifiers in an ensemble can improve the ensemble performance, here we expand the concept to employ different base classifiers for the two ensembles to improve the ‘diversity’ between the ensembles.

The major issue for designing the above grid classification system is to decide when a pattern is covered by a rule and should be classified accordingly, and when it should be rejected and either passed on to the second ensemble or the human expert (depending on which stage in the process we are at). The reject option has been formalized in the context of statistical pattern recognition according to the minimum risk theory presented in [21] and [44]. Intuitively, a suggested classification should be rejected if the confidence in that classification is below some threshold.

The standard approach to rejection in classification is to estimate the class posteriors, and to reject classifications that have a low class posterior probabilities [21]. To simplify the design of the SVMs in the first ensemble with appropriate posteriors estimation, we decompose the multi-label classification problems with \(K\) classes (\(K=3\) in current work) into \(K\) independent two-class problems (the one-vs-all approach where each classifier classifies records as belonging or not belonging to a class). The desired multi-class classification can then be conducted according to the output of the binary classifiers.

To estimate class posteriors from SVM’s outputs, a mapping can be implemented using the following sigmoid function [45]:

$$\begin{aligned} P(y=+1|\mathbf x\mathrm ) = \frac{1}{1+\exp (\text{a} \rho (\mathbf x\mathrm ) +b)} \end{aligned}$$
(12)

where the class labels are denoted as \(y = +1, -1\), while \(a\) and \(b\) are constant terms to be defined on the basis of sample data. Such a method provides estimates of the posterior probabilities that are monotonic functions of the output \(\rho (x)\) of an SVM. This implies that Chow’s rule applied to such estimates is equivalent to the rejection rule obtained by directly applying a reject threshold on the absolute value of the output \(\rho (x)\) [23].

In our scheme, \(K\) binary SVM classifiers are constructed for \(K\) different image classes (\(K=3\)). And we term such \(K\) collection of binary SVMs an expert to avoid the confusion with ensemble. The \(i\)th SVM output function \(P_i\) is trained taking the examples from \(i\)-th class as positive and the examples from all other classes as negative. In another word, each binary SVM classifier was trained to act as a class label detector, outputting a positive response if its label is present and a negative response otherwise. Therefore, for example, a binary SVM trained as a “in situ detector” would classify between in situ and not in situ. For a new sample \(x\), the corresponding SVM assigns it to the class with the largest value of \(P_i\) following

$$\begin{aligned} Class = \text{ argmax} \; \; P_i, \quad i=1, \dots, n \end{aligned}$$
(13)

where \(P_i\) is the signed confidence measure of the \(i\)th SVM classifier. Such a SVM expert can then act as a base classifier in the stage 1 ensemble, trained with randomly chosen subsets of all available features (i.e., random subspace) following the Random Subspace strategy [15]. In the random subspace method, base classifiers are learned from random subspaces of the data feature space. In other words, the ensemble is trained by dividing the feature space randomly into subsets and submits each one to a base SVM expert.

As we aim to construct a serially fused, cascade classifier ensembles in order to produce a high confidence classification, it is essential to examine the output from the VM ensemble consisting of the base SVM experts. In combining the decisions from the \(M\) experts, a sample is assigned the class for which there is a predefined consensus degree, or when at least \(t\) of the experts are agreed on the label, otherwise, the sample is rejected, the threshold \(t\) can be decided in advance, for example, a simple rule as follows can be used to decide the value of \(t\).

$$\begin{aligned} t \ge \left\{ \begin{array}{l@{\quad }l} \frac{M}{2}+1&\text{ if}\,\, M\ is\ \text{ even}\\ \frac{M+1}{2}&\text{ if}\,\, M\ is\ \text{ odd}.\\ \end{array} \right. \end{aligned}$$
(14)

Since there can be more than two classes, the combined decision is deemed to be correct when a majority of the experts are correct, but wrong when a majority of the decisions are wrong. Obviously, \(t\) is a tunable threshold that controls the rejection rate, and we use \(t\) to relate the consensus degree from the majority voting to the confidence measure, and abstain from classifying ambiguous samples. A rejection is considered neither correct nor wrong, so it is equivalent to a neutral position or an abstention [46]. Figure 4 further explains the principle of the SVM ensemble in stage 1.

Fig. 4
figure 4

SVM ensemble with rejection option in stage 1, which consists of a set of binary SVMs (experts)

The rejected samples from the SVM ensemble in stage 1 will be handled by the second ensemble, which is a Random Subspace ensemble of neural network classifiers, simultaneously trained with the stage 1 SVM ensemble. The neural network classifier is a Multiple Layer Perceptron (MLP), which has one hidden layer with a few hidden neurons and 3 output nodes, each representing a class label. The activation functions for the hidden and output nodes are a logistic sigmoid function and linear function, respectively. Following the principle of RS, a number of individual MLP models are trained on randomly chosen subsets of all available features. That is, an ensemble of MLP classifiers is created with each base classifier trained on an individual subspace by randomly selecting features from the entire space.

The last step of the second Random Subspace ensemble is to combine the base MLP models to give final decisions following the similar procedure of majority voting as in the first stage, as shown in Fig. 5. In combining the decisions from the \(M\) base MLPs, a sample (selected from the collection of rejected samples from stage 1) is assigned the class label when at least \(t\) of the MLPs are agreed on the decision. Otherwise, the sample is rejected. Again, \(t\) is the threshold that decide the rejection rate. The consensus degree from the ensemble acts as confidence measure to switch between acceptance and rejection.

Fig. 5
figure 5

Illustration of the stage 2 Random Subspace classifier ensemble which consists of a set of MLPs

3.3 Theoretical analysis of the ensemble cascade

If we have \(p(C_{i})\) as the prior probability of observing class \(C_{i}\), the posterior probability of class \(C_{i}\) when given an instance vector \(x\) can be calculated as:

$$\begin{aligned} p(C_{i}|x)=\frac{p(x|C_{i})p(C_{i})}{p(x)} =\frac{p(x|C_{i})p(C_{i})}{\sum _{i=1}^{N}p(x|C_{i})p(C_{i})} \end{aligned}$$
(15)

where N is the number of classes, \(p(x|C_{i})\)is the conditional probability of \(x\) given a class \(C_{i}\), and \(p(x)\) is the probability of \(x\).

We adopted the mechanism proposed in [47] to derive the error rate of our system. For both stages in our scheme, given an input instance \(x\), the proposed classification is accepted or rejected according to the highest posterior probability for all the classes: \(\mathrm{max}_{j \in [1,\ldots, N]}p(C_{j}|x)\). Since the result of our classifiers is only an approximation of the real situation, we use \(S_{i\ (i=1,\ldots, N)}\) to denote the approximation posterior probability for each class obtained by our system. Let \(MAX_{p}^{1}=\mathrm{max}_{j \in [1,\ldots, N]}p(C_{j}|x)\) denote the real posterior probabilities for all classes given an instance \(x\), and \(MAX_{S}^{1}=\mathrm{max}_{i \in [1,\ldots, N]}S_{i}^{1}\) represents the approximation posterior probabilities obtained by stage 1 of our system. The error rate of stage 1 \(\epsilon _{1}\) can be obtained by:

$$\begin{aligned} \epsilon _{1}=\int _{A} (1-MAX_{S}^{1})p(x)\mathrm{d}x \end{aligned}$$
(16)

where \(A\) is the region composed of all accepted instances. Using some simple manipulations on Eq. (16), we then get the following:

$$\begin{aligned} \epsilon _{1}&= \int _{A}(1-MAX_{S}^{1})p(x)\mathrm{d}x \\&= \int _{A}(1-MAX_{p}^{1}+MAX_{p}^{1}-MAX_{S}^{1})p(x)\mathrm{d}x\\&= \int _{A}(1-MAX_{p}^{1})p(x)\mathrm{d}x\\&\quad + \int _{A\cap I^{S}}(MAX_{p}^{1}-MAX_{S}^{1})p(x)\mathrm{d}x \end{aligned}$$

where \(I^{S}\) is the region composed of all the instances that satisfy \(MAX_{p}^{1}-MAX_{S}^{1} \ne 0\), which means that for some input instances, the results of our classifiers are different from the real ones. Notice that the first term of \(\epsilon _{1}\) is in fact the optimal Bayes error \(\int (1-p(x))p(x)\mathrm{d}x\). The second term comes from the errors generated during stage 1. This situation can be illustrated as in Fig. 6, where \(R\) represents the rejected patterns, \(A\) represents the patterns accepted by the classifier and the crosses represent erroneous classifications made by the ensemble of stage 1.

Fig. 6
figure 6

Error rate of stage 1

The same procedure can be used to analyze the error rate of stage 2. Instead of a wide input instance space, stage 2 only processes the rejected instances from stage 1. Let \(R\) denote the region composed by all the rejected instances from stage 1, \(R=\{x|\mathrm{max}(p(C_{i}|x))<t\}, MAX_{p}^{2}=\mathrm{max}_{j\in [1,\ldots, N]}p(C_{j}|x)\) and \(MAX_{S}^{2}=\mathrm{max}_{i\in [1,\ldots, N]}S_{i}^{2}\). The error rate of stage 2 can then be obtained by:

$$\begin{aligned} \epsilon _{2}&= \int _{R}(1-MAX_{p}^{2})p(x)\mathrm{d}x \nonumber \\&+ \int _{R\cap I^{M}}(MAX_{p}^{2}-MAX_{S}^{2})p(x)\mathrm{d}x \end{aligned}$$
(17)

where \(I^{M}=\{x|MAX_{p}^{2}-MAX_{S}^{2} \ne 0\}\), which represents the errors generated by the stage 2 ensemble.

Given the above, the error rate of the whole system can be calculated as:

$$\begin{aligned} \epsilon&= \epsilon _{1} + \epsilon _{2} \nonumber \\&= \int _{A}(1-MAX_{p}^{1})p(x)\mathrm{d}x + \int _{R}(1-MAX_{p}^{2})p(x)\mathrm{d}x\nonumber \\&+\int _{A\cap I^{S}}(MAX_{p}^{1}-MAX_{S}^{1})p(x)\mathrm{d}x \nonumber \\&+ \int _{R\cap I^{S}}(MAX_{p}^{2}-MAX_{S}^{2})p(x)\mathrm{d}x \nonumber \\&= \epsilon _{\text{ Bayes}} + \int _{A\cap I^{S}}(MAX_{p}^{1}-MAX_{S}^{1})p(x)\mathrm{d}x \nonumber \\&+\int _{R\cap I^{M}}(MAX_{p}^{2}-MAX_{S}^{2})p(x)dx. \end{aligned}$$
(18)

From Eq. (18), for approaching the goal that \(\epsilon =\epsilon _{\text{ Bayes}}\), we must set \(A\cap I^{S}=\emptyset \) and \(R\cap I^{M}=\emptyset \). This means that even if both stages are not optimal, we still have chance to reach the optimal classification error rate. However, this can rarely be expected in real classification tasks.

Different from many existing cascade systems, we use classifier ensembles in our architecture. As has already been pointed out in [40], under the sum voting ensemble schemes, the variance of the ensemble is less than that of the individual classifier and a smaller variance in an ensemble will lead to a lower error rate than any individual classifier. From the above theoretical analysis, with a cascade system composed of two ensembles, a lower error rate can be expected than when using non-ensemble or non-cascade methods.

4 Experiments

MATLAB version 7 was used to implement the software in the current work. Six different individual classifiers were applied to the image dataset first, their results are compared and analyzed. Then several popular classifier ensemble methods were employed to construct the ensemble classifiers. In order to ascertain the effectiveness of the proposed feature combinations, several different feature combinations were computed and compared. The performance (accuracy and reliability) of the proposed two-stage ensemble cascade scheme was evaluated using different ensemble sizes and different rejection rates.

4.1 Comparison among single classifiers

In this section, we show the results obtained using six different classifiers on the biopsy image dataset where each image was described in terms of the three kinds of features introduced in Sect. 2. The six classifiers were (1) \(k\)NN, \(k=3\), (2) single MLP, (3) single SVM, (4) Logistic Regression [48], (5) Fisher Linear Discrimination [48] and (6) Naive Bayesian Classifier [49]. For MLP, we experimented with a three-layer network. Specifically, the number of inputs is the same as the number of features, one hidden layer with 20 units was used and a single linear unit representing the class label. The network was trained using the Conjugate Gradient learning algorithm for 500 epochs. The library for support vector machines, LIBSVM,Footnote 2 was used for the experiments. We used the radial basis function kernel for the SVM classifier. The parameter \(\gamma \) that defines the spread of the radial function was set to 5.0 and the parameter \(C\) that defines the trade-off between the classifier accuracy and the margin to 3.0. For the microscopic biopsy images, we randomly split it into training and testing sets, each time with \(20 \%\) of each class’ images reserved for testing while the rest was used for training. The classification results were then averaged over 100 runs, such that each run used a random split of the data for the training and testing sets.

In Fig. 7, we compared the classification accuracies with respect to the six classifiers. The averaged classification accuracies of the MLP and SVM were 94.90 and 94.85 %, respectively, which are far beyond the other four classifiers. The standard deviations of the classification accuracies are also compared in Fig. 7. Although the FLD has the smallest averaged standard deviation (0.0571) on its classification accuracy, it has the lowest classification performance. The averaged standard deviations of MLP and SVM are 0.0934 and 0.1040, respectively, which are relatively smaller than that of the other classifiers, which means they are more stable with respect to classification performance.

Fig. 7
figure 7

Classification accuracies and standard deviations from applying \(k\)NN, single MLP, single SVM, Logistic Regression (LR), Fisher Linear Discrimination (FDL), and Naive Bayesian (NB)

Figure 8 presents a box plot of the classification results obtained by these six single classifiers on the biopsy image dataset. From the figure it can be seen that the MLP and SVM classifiers have small variance ranges in classification results, and their averaged classification accuracies are quite close to each other. The results here contrast to the generally accepted perception that SVM classifiers outperform neural network classifiers. The most reasonable explanation for the better performance of MLP with respect to our experiment is that MLP, as a memory-based classifier, is more resistant to errors introduced from insufficient data than the margin or distance-based SVM. Given a limited amount of data, Naive Bayesian classifier, Linear Discriminant and Logistic Regression perform worse than SVM and MLP, this is because these classifiers’ performances depends on the amount of training data, correlations between features, and the probability distribution of each feature, which may vary with empirical data. This part of the experiment demonstrated a common result, also obtained with respect to other research work, that in general SVM and MLP can achieve better classification performance on biopsy image analysis.

Fig. 8
figure 8

Boxplot of classification accuracies from applying single MLP, single SVM expert, Random Subspace SVM ensemble (RS-SVM) and Random Subspace MLP ensemble (RS-MLP)

4.2 Evaluation of random subspace ensembles

Table 2 shows the classification accuracies obtained using 7 different ensemble classifiers with different image feature combinations. The classifier ensemble methods compared here are: (1) Bagging [50] with SVM (BagSVM), (2) Bagging with MLP (BagMLP), (3) AdaBoost [51] with SVM (BoostSVM), (4) AdaBoost with MLP (BoostMLP), (5) Random Forest [52] with decision trees (RandF), (6) Random Subspace with MLP (RSMLP) and (7) Random Subspace with SVM (RSSVM). The three different image feature types introduced earlier were considered: Curvelet, GLCM, and LBP, they are represented by the letters C, G, and L in Table 2, respectively. Each image has a 666-dimensional feature vector with all of these three features. Each randomly selected subspace used 80 % of the features for the training phase of the classifiers. For example, a 532-dimensional \((666\times 0.8)\) feature vector is used for training when three kinds of features are all used (C, G and L in Table 2). In order for comparison, the full (100 %) feature vectors were also used for classifier training, the results of using full feature vectors are listed in the last column of the table. The ensemble size is fixed as 25 for all the classifiers in Table 2.

Table 2 Classification accuracy (%) of 7 Ensemble classifiers on the biopsy image data with different image feature combinations

One can note from Table 2 that the use of ensembles does improve the classification accuracy. RSSVM and RSMLP produced the best performance, both obtain classification accuracies around 95 % regardless of the types of image features used for the training (Curvelet, GLCM orLBP), which is much better than the results obtained by other feature combinations. The results of the Random Subspace ensemble (RSSVM, RSMLP) using 80 % features for training are also better than the results of using the whole feature vector in the training phase, which means the classification task benefits from Random Subspace ensemble.

The results on this image dataset from using other kinds of features are also compared in the experiment, as in [5], the level set method was used to extract image features, and a 42-bin histogram was constructed to represent information of connected components; a 6.6 % classification error rate was obtained.

Two important parameters for Random Subspace ensembles are ensemble size (\(L\)) and the cardinality of the feature vectors (\(M\)). A “rule of thumb” has been put forward with respect to the fMRI data classification problem [16] in which the authors proposed a feature subset size \(M=\frac{n}{2}\) and a consequent ensemble size of \(L=\frac{n}{10}\), where \(n\) is the dimension of the original feature vector. In order to find the appropriate values for the ensemble size and feature vector cardinality for the current biopsy image classification work, the size of the ensembles was varied from 5 to 145 with a step size of 10. For each ensemble value size, the cardinality of the feature vectors used for training was changed from 10 % of the original dimension to 100 %, with equally spaced intervals of 10 %. The classification results using RSSVM and RSMLP with different ensemble sizes and different feature vector cardinalities are shown in Figs. 9 and 10, respectively.

Fig. 9
figure 9

Classification results of the RSSVM ensemble with different ensemble sizes and different cardinalities of training feature

Fig. 10
figure 10

Classification results of the RSMLP ensemble with different ensemble sizes and different cardinalities of training feature

The same conclusion as in [24] can be drawn from Figs. 9 and 10, namely that the classification performance does not rely on the increase of the ensemble size. The different cardinalities of the feature vectors produced different performances. The Random Subspace MLP ensemble obtains its best classification accuracy of 96.83 % using \(M=\frac{4n}{5}\) and ensemble size \(L=105\). The Random Subspace SVM ensemble also achieved good performance with an accuracy 96.56 % at 80 % feature cardinality; however, different from the MLP ensemble, the SVM ensemble has the same top performance for ensemble sizes 85 to 115. Therefore, the most appropriate feature cardinality of \(M=\frac{4n}{5}\) and ensemble size \(L=105\) were identified for both of the Random Subspace MLP ensemble and the SVM ensemble.

4.3 Results of the proposed ensemble cascade system

In this experiment, we first use the RSSVM-ensemble and the RSMLP-ensemble to construct different cascade classification systems. Four different two-stage cascade classifiers were built: RSSVM-RSSVM, RSMLP-RSMLP, RSSVM–RSMLP, and RSMLP-RSSVM; where RSSVM-RSSVM indicates that a RSSVM ensemble was employed in both stages 1 and 2, RSSVM–RSMLP indicates that a RSSVM ensemble was used in stage 1 and a RSMLP ensemble in stage 2, and so on.

The parameters for the RSSVM and RSMLP ensembles were as determined in the previous experiment, with ensemble sizes equal to 105 and feature cardinality set to 80 %. A rejection threshold 84 (\(0.8\times 105\)) was set for both ensembles (stages 1 and 2), which means that only when more than 80 % of the classifiers agree on some decision will the decision be adopted, otherwise, the instance will be rejected by the ensemble. This relatively high threshold was used because we wished to insure a high level of reliability with respect to classification decisions. The results of different cascade schemes on the biopsy image dataset are listed in Table 3.

Table 3 Classification accuracy and reliability of different cascade schemes on the biopsy image data with rejection threshold of both stages equal to 84

From Table 3, it can be observed that all the two-stage cascade classifiers obtain a better classification performance than the non-cascade ensembles tested in the last experiment, this confirms the effectiveness of the cascade classification system, which benefits from the fact that the samples rejected by the first ensemble still have the chance to be correctly classified by the second ensemble. Among the four different cascade classifiers, the RSSVM–RSMLP cascade classifier obtained the best classification accuracy with a relatively low rejection rate. The reasonable explanation is that use of different base classifiers in the ensembles increase the diversity of the whole cascade system, and compared with SVM, MLP is a more ‘localized’ classifier which is more suitable to be put in stage 2 to achieve better performance [24].

To have a closer look at how the rejection rate influences the classification accuracy, we adjusted the threshold \(t_{2}\) for the majority voting of the stage 2 ensemble (\(t_{2}\)-out-of-\(L, L=105\)), while fixing the threshold in stage 1 at \(t_{1}=84\) (\(0.80\times 105\)), resulting in average rejection rates at stage 2 of between \(14.29\) and \(26.36 \%\) from \(t_{2}=85,\dots, 95\). The corresponding overall rejection rates were then in the range of \(0.68, \ldots, 1.94 \%\). The plots of stage 2 accuracies and corresponding overall accuracies from the varying rejection rates are displayed in Figs. 11 and 12, respectively. It is not difficult to appreciate that higher accuracy could be expected from higher rejection rate. However, it is worth noting that when the rejection rate of stage 2 is 26.36 %, the classification accuracy of stage 2 is 100 %, as we continued increasing the value of the threshold \(t_{2}\), the increased rejection rate did not bring any more improvement with respect to the classification performance.

Fig. 11
figure 11

Averaged stage 2 accuracies with 10 varying stage 2 rejection rates

Fig. 12
figure 12

Averaged overall classification performances from 10 varying overall rejection rates

With \(t_{1}=84\) and \(t_{2}=95\), the classification accuracies and reliabilities from stage 1, stage 2 and the whole system can be seen in Table 4. Compared with the results in Table 3, where the same thresholds \(t_{1}=t_{2}=84\) was set for both stages, the overall classification accuracy and reliability were improved by increasing the value of \(t_{2}\), and the corresponding error rate drops. However, this improved performance is obtained at the cost of an augmented rejection rate, which means there will be more images left for human experts to analyze. The trade-off between accuracy and rejection rate could be empirically decided in practice.

Table 4 Averaged classification performance of the cascade schemes on the biopsy image data with rejection threshold \(t_{1}=84\) and \(t_{2}=95\)

The confusion matrix from the overall performance that summarize the detailed situations of rejection rate \(1.94 \%\) were displayed in the Table 5. In the confusion matrix representation, the rows and columns indicate the true and predicted classes, respectively. The diagonal entries represent correct classification while the off-diagonal entries represent incorrect ones.

Table 5 Averaged confusion matrix with overall rejection rate 1.94 % (%)

5 Conclusion and future work

In this paper, a reliable classification scheme based on serial fusion of Random Subspace ensembles has been proposed for the classification of microscopic biopsy images for breast cancer diagnosis. Rather than simply pursuing classification accuracy, we emphasized the importance of a reject option in order to minimize the cost of misclassifications so as to ensure high classification reliability. The proposed two-stage method used a serial approach where the second classifier ensemble is only responsible for the patterns rejected by the first classifier ensemble. The first stage ensemble consists of binary SVMs which were trained in parallel, while the second ensemble comprises MLPs. During classification, the cascade of classifier ensembles received randomly sampled subsets of features following the Random Subspace procedure. For both of the ensembles the rejection option was implemented by relating the consensus degree from majority voting to a confidence measure and abstaining to classify ambiguous samples if the consensus degree was lower than the threshold.

The effectiveness of the proposed cascade classification scheme was verified on a breast cancer biopsy image dataset. The combined feature representation from LBP texture description, Gray Level Co-occurrence Matrix and Curvelet Transform exploits the complementary strengths of different feature extractors; the combined feature was proved efficient with respect to the biopsy image classification task. The two-stage ensemble cascade classification scheme obtained a high classification accuracy (99.25 %) and simultaneously guaranteed a high classification reliability (97.65 %) with a small rejection rate (1.94 %). Moreover, the cascade architecture provides a mechanism to balance between classification accuracy and rejection rate. By adjusting the rejection threshold in each ensemble, the classification accuracy and reliability of the system can be modulated to a certain degree according to the specification of specific applications. For example, medical diagnosis tasks usually require high accuracy and reliability, therefore the rejection thresholds in each stage will be set to a high level in order to guarantee the correctness of the diagnosis.

Although the proposed system has shown promising results with respect to the biopsy image classification task, there are still some aspects that need to be further investigated. The benchmark images used in this work were cropped from the original biopsy scans and only cover the important areas of the scans. However, often it is difficult to find Regions of Interest (ROIs) that contain the most important tissues in biopsy scans, more efforts therefore needs to be put into detecting ROIs from biopsy images. In this paper, the parameters for the cascade system, such as ensemble size and rejection threshold, were decided empirically; this may not have produced the most satisfactory performance with respect to all application contexts. Therefore, some self-adaptive rules or algorithms for automatically optimizing these parameters would be desirable.