1 Introduction

In the fields of image processing and computer vision, techniques for feature extraction require special treatment for processing natural data in its raw form [1]. Thus, a machine learning-based system requires careful engineering, for instance, to define features with enough representativeness to enable the detection or pattern classification [2]. Techniques based on deep learning (DL) minimize some difficulties encountered in this process by making the feature engineering stage an automated process [3]. A DL-based system involves multiple layers of processing in order to provide data representation with different levels of abstraction, such as those based on convolutional neural networks (CNN) [3].

CNN models allowed significant advances in image processing due to the proposals presented in [4] and [5]. In these approaches, the increasing number of layers reduced the error rates in classification and pattern recognition tasks, with emphasis on computer-aided diagnosis (CAD) for histological images. AlexNet and deep residual network (ResNet) are some examples of CNN strategies widely explored in this system category due to the relevant results achieved to accurately classify different types of cells and tissue structures. These classic architectures have also been tested on important datasets, making them generalizable to histological images, as well as comparable and robust to variations in the staining process, an important challenge in this context [6,7,8,9,10]. The conditions presented here are useful for the improvement of existing CAD systems, especially when the classic approaches are investigated in order to verify their capacities of providing relevant features to reach more optimized and comprehensive solutions for specialists.

Considering this context, the ResNet architecture deserves to be highlighted for image classification tasks, such as investigating histological images, because it minimizes the well-known problem of the vanishing gradient [7, 9, 11, 12]. Even so, classic CNN models can contain millions of trainable parameters, which can make them unfeasible with few samples. This situation is observed in the context of histological images. A commonly explored alternative to overcome this limitation is the use of transfer learning with hybrid models [7, 13,14,15], via pre-training carried out on the ImageNet dataset [16]. This alternative reached relevant results in several domains [9, 13, 17], especially considering the use of feature maps from specific layers of the CNN architectures [15, 18, 19]. Thus, when the hybrid models are observed, some issues have to be investigated to ensure the success of the solution, mainly involving the definition of layers, selection methods and classification strategy  [7, 14, 15, 18, 20,21,22]. For instance, hybrid models were defined considering that the initial layers provide the identification of local patterns, such as color, edge and shape. On the other hand, deeper layers provide the generalization of global patterns, such as texture and semantics [18].

In order to develop the previously cited models, the computational scheme can be based on feature selection algorithms with a single classifier or an ensemble classification [10, 23, 24]. However, feature selection with an ensemble classifier combines the strengths of these strategies in order to provide more stable and relevant solutions [25,26,27,28], with more accurate and fully useful CAD systems. Moreover, the feature selection process plays a critical role in identifying complex patterns in relevant contexts, such as those explored here, with the most optimized and comprehensible solutions  [29]. This process can be designed via techniques categorized as filters, wrapper, and embedded, but there is no universal approach to define the best results for all contexts  [25,26,27,28]. On the other hand, filter algorithm like ReliefF is fully capable of detecting feature dependencies, indicating the best schemes in different experiments [27, 30,31,32]. The ReliefF algorithm is relatively fast, with an asymptotic time complexity of \(\mathcal {O}(n^{2}*m)\), where n is the number of instances and m is the number of features, and the selected features do not depend on an induction algorithm  [25]. Also, feature sets of different sizes can be obtained via this algorithm, based on any desired criteria. Consequently, when the most relevant subsets of features are indicated via an ensemble classifier, the hybrid models are more generalizable and robust in order to reach more optimized and comprehensive solutions.

In this paper, a computational scheme is described to provide hybrid models via the association of deep features by transfer learning, selection by ranking, and a robust ensemble classifier. The obtained models were analyzed to classify breast, colorectal and liver tissue images stained with hematoxylin-eosin (H &E). The proposal considered deep features provided by the layers from the AlexNet and ResNet-50 architectures. In the AlexNet architecture, the computational scheme explored all convolutional layers. In the ResNet-50 network, the analyzed layers were those able to provide local and global image patterns, such as max_pooling2d_1, activation_4_relu, activation_48_relu, activation_49_relu and avg_pool. The deep features were organized into subsets and submitted to a k-fold cross-validation process. A systematic analysis was carried out in order to rank and define the most relevant subsets via an ensemble classifier with five algorithms. Thus, this study provides the following contributions:

  1. 1.

    A computational scheme able to provide hybrid models representing the main associations of deep features by transfer learning, ReliefF algorithm and an ensemble classifier with five algorithms;

  2. 2.

    An optimized hybrid model that provided the best performance for distinguishing breast cancer, based on only 35 deep features from the intermediate layer (activation_48_relu) of the ResNet-50;

  3. 3.

    Hybrid models based on AlexNet’s deep features that outperform CNN architectures by directly classifying the UCSB dataset, with or without data augmentation;

  4. 4.

    Hybrid models via ResNet-50’s deep features that showed more relevant results for classifying UCSB and LG datasets than regularized techniques and CNN architectures;

  5. 5.

    Solutions based on a reduced number of features and without overfitting, useful for developing CAD systems focused on H &E images or even as more robust baseline schemes commonly explored in this type of investigation.

In the second section of this paper, relevant works on the classification of H &E images exploring hybrid models are described. The methodology is presented in Section 3 and, in Section 4, the results are presented and discussed. Finally, the conclusion is presented in Section 5.

2 Related work: An overview

The use of hybrid models based on handcrafted features (HC) or deep features by transfer learning has indicated important advances in different contexts. Regarding models based on HC, Watanabe et al. [33] presented an approach via Gist descriptors, principal component analysis and linear discriminant analysis to classify liver histological images. The system was able to provide an accuracy of 93.70%. In the proposal of [34], the authors presented associations of sample entropy and a fuzzy strategy to classify colorectal tissue, and the achieved accuracy was 91.39%. The authors in [35] presented a histological image classifier for the breast and colorectal tissues. The model used percolation attributes and color-normalized images. The accuracy values were 86.20% to distinguish breast tumors and 90.90% to classify colorectal tumors. In another study [36], an approach was proposed in order to detect malignant tumors in representative histological images of breast cancer. The proposed method reached an accuracy of 86.20% employing the texture descriptors, morphological attributes and intensity.

When strategies exploring CNN architectures are taken into account, the proposal of [6] considered an adversarial stain transfer technique for classifying histological images of colorectal tissue. The authors used the U-Net model for the stain-transfer network, exploring the fully connected layer from the AlexNet architecture. The accuracy metric obtained by the model was 87.50%. The authors of [37] classified breast tissues via a model based on 13 CNN layers with the SVM classifier. In this approach, the accuracy was a value of 83.30%. Considering the tensor decomposition for multiple-instance, the proposal [38] achieved an accuracy value of 84.67% for classifying the same type of tissue.

In addition, Kausar et al. [39] described a classifier based on color normalization, haar wavelet decomposition and a 16-layer CNN. In this proposal, the maximum accuracy value was 91% to distinguish breast tissue samples. In another study [40], a model based on deep learning with a stacked denoising autoencoder was proposed in order to analyze H &E breast cancer images. In this strategy, the accuracy value was 94.41%. The authors of [41] explored the RefineNet and DenseNet architectures, through the deep-reverse active learning technique. The model was applied to classify H &E histological images as representatives of breast cancer with an accuracy of 97.63%. In the proposal of [42], the authors developed a modular cGAN classification framework for colorectal tumor detection. This approach used the U-Net and Inception V3 models, via pre-training on the ImageNet dataset, providing an accuracy value of 94.02%. Saxena et al. [8] described a ResNet-50 model with kernelized weighted to distinguish H &E breast tissue samples. The achieved performance was an accuracy value of 60.30%. Recently, Lee and Wu [43] presented the DIU-Net architecture with a color conversion scheme in the training step. When applied to breast tumors, the model indicated an accuracy value of 94.09%.

Strategies to optimize deep learning models can also be found in the study of H &E images. Deep learning techniques have been used for detecting the preneoplastic and neoplastic lesions in human colorectal histological images [44]. The model provided an accuracy of 93.28%. In another study [45], the authors proposed a classifier using the U-Net and GoogLeNet networks with color normalization. The model was able to classify images of colorectal tissue with an accuracy of 85%. Also, a model based on ResNet, transfer learning and deep-tuning was defined to classify the same type of tumor [9]. The strategy provided an accuracy of 86.67%. Considering this type of image, Dabass et al. [46] and Dabass et al. [47] presented models based on 31-layer CNN and a hybrid CNN with attention learning, respectively. The system described in [46] achieved an accuracy of 96.97%, while the strategy developed by [47] provided an accuracy value of 97.50%.

Hybrid models are also observed with deep learning and classic classification techniques applied to histological images [10, 14, 17, 48]. For instance, Kumar and Sharma [17] developed a strategy via Xception and VGG-16 architectures, exploring different types of classifiers (logistic regression, SVM and decision tree) and artificial data augmentation. The model was applied to classify H &E breast cancer samples. The best accuracy was a value of 82.45%. In another study [10], the authors described a composition of a fractal neural network with an ensemble classification based on experiments with the ResNet-50, ResNet-101, Inception V3 and Xception architectures. Handcrafted features were also used, such as lacunarity, fractal dimension and percolation. The models were applied to investigate breast cancer, colorectal cancer, lymphoma and liver tissue H &E images. The authors concluded that the combination was able to provide accuracy rates from 89.66% to 99.62%. Also, Longo et al. [14] indicated a hybrid model involving handcrafted attributes (lacunarity and fractal dimension) with deep features from the ResNet-50, Inception-V3 and VGG-19 networks. The authors explored multiple classifiers for H &E images of breast cancer, colorectal tumor, and liver tissue. The achieved accuracy values ranged from 93.10% to 99.25%.

Finally, it is possible to verify research issues to analyze the discriminative power of specific layers of a CNN. Thus, the authors of [7] proposed a study based on the fully connected layers from the AlexNet, VGG-16 and VGG-19 networks. Deep features were combined with HC descriptors. With a k-nearest neighbors classifier, the model achieved an accuracy of 84.20% in distinguishing H &E breast cancer images. Also, Younas et al. [21] described an ensemble framework of deep neural networks in order to distinguish polyps in colorectal images. The authors used the GoogLeNet, Xception and Resnet-50 networks, all pre-trained on the ImageNet dataset, with a combination via ensemble classifier. Younas et al. [21] state that the system was able to surpass the performances reported in the Literature addressed to distinguish the classes of colorectal cancer, adenoma, hyperplasia and adenocarcinoma.

From the previously presented works, it is possible to note the benefits of using hybrid models based on multiple strategies, exploring transfer learning, deep features and ensemble classification. In this context, the hybrid models applied to histological images provided can be highlighted, such as those discussed in [7, 10, 14, 17, 21, 37, 47]. The proposals based on deep learning models lead to significant results for tissue with distinct magnifications. Despite the valuable contributions, multiple associations have been defined without the full limits of classic architectures in order to design hybrid models. Moreover, hybrid models designed with the most relevant deep features from classical models, via selection by ranking, multiple convolutional layers and robust ensemble classification have not been fully explored in several types of H &E images, such as the breast, colorectal and liver tissue analysis. This type of investigation and models provide more optimized and comprehensive solutions for the specialists and CAD systems, especially when the results are relevant in datasets with few samples and without overfitting.

3 Methodology

The hybrid models were obtained through a computational scheme that explores deep features obtained from CNN architectures, pre-trained on the ImageNet dataset [16], selection by ranking and an ensemble classifier to investigate different types of H &E images. This scheme investigated sets of layers from the CNN models and collects the corresponding deep features, a process carried out during the execution of each network. Then, the most relevant deep features were obtained from a systematic analysis, based on selection by ranking (ReliefF algorithm) with the k-fold cross-validation strategy. Finally, an ensemble classification with five algorithms was applied to identify the more relevant associations. The obtained hybrid models were applied for distinguishing the different lesion patterns present in H &E histological images. An overview of this scheme is illustrated in Fig. 1, with details presented in the next subsections.

Fig. 1
figure 1

Illustrative overview of the proposed scheme for classification of H &E histological images

3.1 Software packages and environment for the experiments

In this work, the approach for processing CNN architecture and extracting deep features was implemented using the deep network design and transfer learning toolboxes, available on the MATLAB R2019a package [49]. The layers explored here follow the nomenclatures defined in these toolboxes. The algorithms employed for selecting and classifying features are available on the Weka 3.6.15 package [50]. From the CNN models, the deep features were explored considering the stochastic gradient descent with momentum optimizer using the default parameters: an initial learning rate of 0.0001; a learning rate drop period of 10; a learning rate drop factor of 0.1; an L2 regularization value of 0.0001 and a mini-batch size of 4. The experiments were done by splitting the entire dataset into 80% training and 20% test data. The experiments were performed on an AMD Ryzen 5 3600X 6-Core CPU at 3.79 GHz with 64 GB of RAM and an NVIDIA GeForce GTX 1660 SUPER.

3.2 Image datasets

The proposed approach was evaluated through three different public datasets of H &E histological images: breast cancer; colorectal cancer; and liver tissue. For breast cancer, the images were provided by the Center of Bio-image Informatics, University of California, Santa Barbara (UCSB) [51]. The dataset consists of 58 breast histological images, divided into 32 benign and 26 malignant. The second dataset, colorectal cancer (CR), was provided by [52]. This dataset has 74 benign and 91 malignant samples, totaling 165 images. The third dataset is named liver gender (LG), which was provided by the Atlas of Gene Expression in Mouse Aging Project (AGEMAP) [53]. This dataset consists of liver samples from mice separated as males and females. Thus, these two classes represent the gender of the sample collected, totaling 265 examples: male with 150 images and female with 115 samples. In this work, the quantities of images were adjusted in order to balance the dataset, considering the smallest number of samples available in each group of each dataset. The removed samples were randomly chosen. This procedure prevented a dominant group from affecting the result. Table 1 presents the details related to the datasets explored in this study. All investigated datasets have two classes and images are exclusively stained with H &E. Some examples of these images are presented in Fig. 2.

Table 1 Details related to the histological datasets explored through the obtained hybrid models

3.3 CNN architectures and layer selection: exploring transfer learning

The proposed scheme considered the classic CNN models, such as AlexNet and ResNet-50 architectures, that were pre-trained on the ImageNet dataset [16]. For the training and optimization of these CNN models, a large dataset is necessary. However, for classifying small datasets, such as those explored here, it is difficult to determine the appropriate local minima for the cost function, and the network may suffer from overfitting. To overcome these limitations, the use of pre-trained models has been widely explored in recent studies [54, 55], considering transfer learning. This strategy can provide high-level deep features, even on datasets with few labeled samples [56]. Also, it is important to highlight that the architectures explored here are still widely used in many investigations regarding CAD systems, especially due to the significant results obtained in classifying different cell types and tissue structures in H &E images, as well as to minimize the variations in the staining process of this type of image [6,7,8,9,10]. Thus, the initial AlexNet layers were explored to extract low-level features such as edges and textures, while the later layers were defined to recognize higher-level patterns and structures. Regarding the use of ResNet, the initial layers were investigated to extract low-level features, while the deeper layers were indicated to recognize more complex and higher-level features. These features were used in order to define hybrid models based on different conditions for classifying the H &E images, exploring transfer learning in order to minimize overfitting and the vanishing gradient problem [57].

Fig. 2
figure 2

Examples of H &E histological images: breast UCSB [51], benign (a) and malignant (b); CR [52], benign (c) and malignant (d); LG [53], male (e) and female (f)

The AlexNet model consisted of five convolutional layers, three pooling layers, two fully connected layers, and one softmax layer [57]. This architecture used a dropout regularization scheme and rectified linear units (ReLU) to reduce overfitting [58], as well as local response normalization (LRN) to minimize the vanishing gradient problem [57]. On the other hand, the ResNet-50 model consisted of four blocks, each one with convolution layers and residual blocks. The first block had nine convolution layers and three residual blocks. The second block had 12 convolution layers and four residual blocks. The third block had six residual blocks and 18 convolution layers. The fourth block had the same number of convolution layers and residual blocks as the first block [59]. In this architecture, the layers received values resulting from the ReLU activation function and the input values of these functions. Thus, the ResNet-50 architecture used shortcut connection identity containing batch standardization groups to skip layers, allowing to minimize issues involving overfitting and the vanishing gradient problem [59]. An overview of the CNN models is presented in Table 2.

Table 2 An overview of the CNN models explored in this study

According to the CNN models previously described, the deep features were obtained considering the strategy presented by [13]. For the ResNet-50 architecture, the proposed scheme explored two initial layers and the last three layers of the network. The initial layers provided deep features responsible for quantifying images’ edges and colors. The deeper layers were used to identify global patterns, such as texture and semantics [18]. The max_pooling2d_1 layer corresponded to the max pooling (with step size equal to \(2 \times 2\)) from the first convolution layer, which had a kernel size of \(7 \times 7\) and 64 different filters. The layer, with the most features, was the activation_4_relu with the corresponding function \(\mathcal {F}(\textsf{x})+\textsf{x}\) from the first residual block, which was useful to evaluate the accuracy of the model with a set of dense features. The activation_48_relu and activation_49_relu layers belong to the final segment of the ResNet-50 model, being part of the last residual block and the last activation layer over the network, respectively. Also, from the average pooling layer, which had a core size of \(7 \times 7\), applied on activation_49_relu, the last layer chosen was avg_pool due to the lower number of features.

Regarding the deep features via AlexNet, the investigation was performed with the five convolutional layers of the network, excluding the fully connected and softmax layers. It is important to note that the first four convolution layers were selected based on the ReLU activation function of each layer, removing features with negative values. In addition, the pool5 layer corresponded to the max pooling of the last convolution layer in the network.

The names of the layers with the total features used to design the hybrid models are shown in Table 3.

Table 3 Information related to the layers and corresponding deep features to define the hybrid models

3.4 Strategy for investigating and selecting the most relevant deep features

The layers of a CNN architecture are represented by n-dimensional arrays, here named as \(M_{i}[...]\), in which i defined each one of the five layers of each CNN model under investigation. Each column of an \(M_{i}\) array was sequentially arranged in a vector of deep features \(V_{i}[...]\), where \(M_{i}\) and \(V_{i}\) have the same dimension. It is important to note that the order of the deep features was preserved in relation to the observed in each \(M_{i}\), making it possible to reconstruct each array through the values and the dimension of \(V_{i}\). An illustration of this representation is shown in Fig. 3.

After defining the \(V_{i}\) vectors, each set was distributed into S subsets, according to (1). The limited amount of 100 deep features was defined based on the models described in [25, 60]. Thus, each \(S_{m}\) subset was defined by the best-ranked m elements of each \(V_{i}\) under investigation, considering the ReliefF algorithm [61,62,63]. This algorithm was chosen due to its powerful and widely used feature selection method for machine learning and data mining problems, as employed in  [14, 22, 25, 27]. In the proposed approach, this algorithm identified and ranked the most significant features within an original dataset to enhance the predictive capability of the hybrid models. The algorithm was applied to estimate the feature weights with the observed difference between instances that are similar, penalizing those that provide distinct values to neighbors of the same group. In addition, the algorithm rewarded features that indicated different values to neighbors of distinct groups [25, 64]. This process was carried out through a random sampling of instances, and the weights of the features were accumulated. Finally, the features were ranked according to their weights in order to indicate the most relevant predictors in each convolutional layer.

$$\begin{aligned} m \in \{5 \le m\le 100, \frac{m}{5} \in \mathbb {N}\}, \end{aligned}$$
(1)

where m indicates the number of deep features in each subset.

Fig. 3
figure 3

Illustration of the organization of deep features obtained from a layer in a formatted feature vector

The analysis of each \(S_{m}\) subset was performed through the k-fold cross-validation strategy in order to evaluate the generalization capacity of the models. In addition, \(k=5\) was defined in all tests due to the reduced number of available samples in each histological dataset. Finally, a robust ensemble classification was applied to calculate the accuracy rate (2) in each k-fold. The average accuracy rate in each \(S_{m}\) subset was given by (3). Therefore, the best association of \(V_{i}\) and a corresponding subset \(S_{m}\) was defined through the highest average accuracy rate (\(Acc\_Avg\)) in each evaluated dataset. Consequently, the obtained results correspond to the most relevant deep features, via transfer learning, for pattern recognition in the investigated H &E images. Figure 4 illustrates the described steps for the feature selection and classification processes.

$$\begin{aligned} Acc_{j} \,\, =\frac{TP+TN}{TP+FP+TN+FN}, \end{aligned}$$
(2)

in which: j refers to the number of the fold corresponding to the cross-validation iteration; TP, true positive rate, defines an outcome where the model correctly predicts the positive group; TN, true negative rate, indicates an outcome where the model correctly predicts the negative group; FP, false positive rate, represents an outcome where the model incorrectly predicts the positive group; and, FN, false negative rate, defines an outcome where the model incorrectly predicts the negative group.

$$\begin{aligned} Acc\_Avg \, =\frac{1}{k}\sum _{j=1}^{k}Acc_{j}. \end{aligned}$$
(3)
Fig. 4
figure 4

Illustration of the feature selection and classification processes proposed in this study

3.5 Definition of the ensemble classifier

The use of different classifiers is a strategy commonly applied in machine learning-based solutions, offering successful analyses of histological images [65, 66], especially for giving more representativeness for the problem under investigation and minimizing the overfitting. However, the combination presented here has not been used in the specialized Literature focusing on H &E imaging. For this purpose, the ensemble classifier was based on five algorithms of different categories: K* [67], logistic discrimination (LD) [68], naive Bayes (NB) [69], random forest (RF) [70] and SVM [71]. Thus, the decisions were based on the common behaviors of the classifiers, making them more reliable and avoiding overfitting. The classifications were combined through the sum rule, which can be summarized as the sum of prediction probabilities obtained in each classifier [72]. This rule was used due to the good results reported by [10]. The decision is given by the ensemble, allowing to define which associations were the most relevant to distinguish the investigated histological image groups.

4 Results and discussion

The proposed scheme was tested in three sets of histological images, as presented in Section 3.2. The evaluated comparisons were: benign versus malignant for the UCSB and CR datasets; and male versus female for the LG dataset. Considering the investigated layers (Table 3), the selected features by ranking were evaluated via the Mann-Whitney U test in order to measure the significance of each subset in distinguishing the groups investigated here. Each test considered the empirical cumulative distribution function of the descriptors with the corresponding p-values[73], analyzing the 100 best-ranked attributes via the ReliefF algorithm. Features with p-values of 0.05 or less were considered statistically significant. The main results were observed using the networks: ResNet-50 with activation_48_relu (UCSB) and avg_pool (CR and LG); AlexNet with relu2 (UCSB), relu3 (CR) and pool5 (LG). The cumulative distribution function of each set is shown in Fig. 5. In these cases, it is noted that more than 80% of the data are statistically separable (p-values \(\mathcal {\le } \) 0.05), a condition observed in the UCSB dataset. In the other datasets, the statistically separable data represent the highest percentage, with approximately 95% of the features. Thus, Figs. 6 and 7 show the accuracy rates in relation to the number of deep features after applying the proposed ensemble classifier (Sections 3.4 and 3.5).

Fig. 5
figure 5

The empirical cumulative distribution function (y-axis) of the 100 best-ranked features with the corresponding p-values (x-axis), considering the main results for classifying the UCSB, CR and LG datasets

Fig. 6
figure 6

The main results using ResNet-50’s deep features with ReliefF algorithm and ensemble classifier: UCSB with the layer activation_48_relu; CR with the layer avg_pool; and LG with the layer avg_pool

Fig. 7
figure 7

The main results via AlexNet’s deep features with ReliefF algorithm and ensemble classifier: UCSB with the layer relu2; CR with the layer relu3; and LG with the layer pool5

Table 4 Summary of the main hybrid models for classifying different histological images, with information regarding the network, layers and the number of deep features

From the main results, the proposed scheme identified the highest accuracy rate with the lowest number of features in each scenario, representing the main hybrid models. For instance, the hybrid model using 100 deep features from the deepest layer of the AlexNet (pool5) achieved an accuracy rate of 98.70% in the LG dataset. However, the hybrid model exploring the deepest layer of the ResNet-50 network (avg_pool) presented an accuracy rate of 99.32% with only 5 deep features. This last association (ResNet-50’s deep features via avg_pool) was also responsible for providing the most relevant features for the CR dataset, with an accuracy rate of 98.00%. In this case, the hybrid model was defined with 35 deep features. These behaviors are in accordance with the investigations available in the Literature, that deeper layers tend to provide higher-level features [20, 74,75,76,77]. However, when the UCSB dataset is observed, the best hybrid model was based on only 35 deep features from an intermediate layer (activation_48_relu) of the ResNet-50 architecture. This model provided an accuracy rate of 98%. Thus, this study contributes to the Literature by indicating the detailed conditions of this fact for the pattern recognition of breast cancer via H &E images. Moreover, models exploring the relu2 and relu3 layers, belonging to the intermediate segment of the AlexNet architecture, were also responsible for providing expressive results in the UCSB and CR datasets, with accuracy values from 91.89% (CR dataset) to 98.70% (LG dataset). These results indicate that the proposed scheme was able to define the main layers and the corresponding features to quantify global and local patterns from different histological images [18].

Considering the conditions previously discussed, Table 4 summarizes the main hybrid models, with the layers that provided the most relevant deep features, the total of attributes used, and the accuracy rates in each histological dataset. It is verified that deep features from the ResNet-50 architecture define the best hybrid models, with a reduced number of descriptors (up to 35 features). This is another contribution since these conditions enabled the use of CNN in datasets with few samples and without overfitting, fact guaranteed through a robust ensemble classifier composed of five algorithms from different categories.

4.1 Comparisons with techniques for classification and pattern recognition

In this work, some consolidated machine-learning techniques were applied in order to evaluate the main hybrid models via direct comparisons. The results were provided directly by the AlexNet and ResNet-50 networks, as well as via regularized classification techniques: Lasso (least absolute shrinkage and selection operator) and Ridge regression [78,79,80]. Regularization approaches are widely used to reduce error by fitting a function appropriately to the given training set and avoiding overfitting. The Lasso technique minimizes the objective function by adding a penalty term to the sum of the absolute values of the coefficients. On the other hand, the Ridge strategy minimizes the objective function by adding a penalty term to the sum of squares of the coefficients. The experiments using the regularized techniques were performed in the Scikit-Learn 0.18.1 package [81].

It is important to highlight that experiments with a data augmentation approach were also performed, in order to increase the number of available samples and introduce variability in each set. The transfer learning toolbox for artificial data augmentation (available in MATLAB R2019a package [49]) was used in this process. The strategies used were: artificial random reflections of 50% across the x and y-axes; random rotations of up to 1 degree; and, random horizontal and vertical translations up to 1 pixel. These values were employed to minimize possible degradation of the classification rates due to the background of the image. These strategies allowed doubling the total number of samples available for the training and validation stages. In this type of test, the accuracy rates provided directly by the CNN models were used in the comparisons. The classifications were repeated three times to define the averages and standard deviations in each dataset. The values provided directly by the AlexNet and ResNet-50 architectures without data augmentation are shown in Table 5. These results were obtained from the first epoch of each network in order to avoid overfitting. Also, the performances after applying the data augmentation are shown in Tables 6 and 7 for the AlexNet and ResNet-50 networks, respectively. The accuracy values were defined with the number of training epochs ranging from 1 to 30. The most significant rates are highlighted in bold.

Table 5 Accuracy rates (%) provided directly by the AlexNet and ResNet-50 architectures, exploring UCSB, CR and LG datasets without data augmentation
Table 6 Accuracy rates (%) provided directly by AlexNet exploring the UCSB, CR and LG datasets with data augmentation
Table 7 Accuracy values (%) provided directly by ResNet-50 exploring the UCSB, CR and LG datasets with data augmentation

In addition, regularized classification techniques were applied to the attributes to establish the main hybrid models, such as: relu2 (186,624); relu3 (64,896); pool5 (9,216); activation_48_relu (25,088); 2,048 (avg_pool). This type of experiment was useful to indicate the advantages and limits of the main hybrid models. Regarding the solutions with regularized classification approaches, which can define subsets with high-quality features and increase the generalization of each model, the tests were performed through the SVM and logistic discrimination (LD) strategies with the Lasso and Ridge regularizations [68, 78, 82]. The accuracy rates provided by the regularized techniques are shown in Table 8 for the CR, UCSB, and LG datasets. The most relevant combinations were highlighted in bold.

Table 8 Solutions and accuracy rates (%) via regularization strategies, considering the same feature sets used in main hybrid models

Considering the hybrid models based on deep features from the AlexNet architecture (Table 4), the solutions indicated higher accuracy rates than those achieved via the CNN architectures (AlexNet and ResNet-50, Table 5) classifying directly the datasets without the data augmentation. These conditions illustrate the quality of the proposed scheme and the solutions obtained to improve CAD systems focused on H &E images (UCSB, CR and LG), in scenarios without data augmentation, even via deep features from a classic CNN. On the other hand, when data augmentation (Tables 6 and 7) and comparisons with regularized techniques (Table 8) are considered, the solutions based on deep features from the AlexNet network were more limited, surpassing the convolutional networks with data augmentation only in the UCSB dataset and the regularized techniques only in the LG dataset. These conditions clearly indicate the limits of hybrid models through the AlexNet architecture.

When the best hybrid models (the highest rates in Table 4 based on deep features of the ResNet-50) are compared with those via available approaches in Table 5, the hybrid solutions provided the best performances in the three datasets. For instance, the classification considering the UCSB dataset indicated the most relevant difference: ResNet-50 applied directly provided an accuracy rate of 60.00% against a rate of 98.00% via hybrid model (35 deep features from the activation_48_relu layer of the ResNet-50 with ReliefF algorithm and an ensemble classifier). Regarding the experiments exploring datasets with data augmentation (Tables 6 and  7), the highest difference (11.33%) can also be observed in the UCSB dataset, with an accuracy rate of 86.67% via ResNet-50 classifying directly the H &E images versus 98.00% of the hybrid model. With respect to the LG dataset, the results were 99.32% (hybrid model based on 5 deep features from the avg_pool of the ResNet-50 with ensemble classifier) against 99.28% (ResNet-50 classifying directly the H &E images with data augmentation), a difference of 0.04%. For the CR dataset, the best hybrid model (35 deep features from the avg_pool with ensemble classifier) provided a lower accuracy rate (0.89% difference) in relation to the achieved performance via ResNet-50 with data augmentation. This condition illustrates an important limit of this hybrid model. Thus, from these experiments, it is possible to define that the best hybrid models are better options to classify UCSB and LG datasets than the ones explored so far, regardless of the combination. This generalization was not observed in the context of colorectal images.

In relation to the best hybrid models against the regularized techniques (Table 8), LD with Ridge indicated an accuracy rate of 94.23% versus 98.00% of the hybrid model (35 deep features from the activation_48_relu with ensemble classifier) for the UCSB dataset. In the CR dataset, the LD and Lasso strategy provided an accuracy value of 97.97%, slightly lower in relation to the hybrid model (98.00%), an association of 35 deep features from the avg_pool layer. When the LG dataset is observed, the best hybrid model based on 5 deep features of avg_pool with ensemble classifier also outperformed the SVM and Lasso strategy, with accuracy values of 99.32% and 98.64%, respectively. From these comparisons, it is noted that the proposed scheme with the best hybrid models is a more robust option in relation to the regularized solutions, indicating the best performances.

In addition, Friedman’s test was applied to evaluate the classifications provided by the best hybrid models, considering an overview regarding all datasets (Tables 6, 7 and 8). Friedman’s test is a non-parametric statistic approach, able to rank k associations in a way that the main solution acquires rank 1 and the \(k^{th}\) solution acquires rank k [83]. Thus, the average ranking is shown in Table 9 by taking into consideration the accuracy rates.

Table 9 Average ranking considering the best associations for UCSB, CR and LG datasets

It can be observed that the hybrid models appear in the first position of the average ranking (Table 9), even in comparison with the achieved results by important techniques. This fact indicates the potential of the hybrid models in the different tested conditions, with the best solutions for the UCSB (a hybrid model based on 35 deep features from the activation_48_relu with the ensemble classifier) and LG (a hybrid model based on 5 deep features from the avg_pool with the ensemble classifier) datasets. Other comparisons could be carried out to verify whether these results are maintained in more conditions and configurations, or even make adjustments to define the limits of each model. However, the presented experiments were able to provide a relevant overview concerning the main hybrid models when compared to the consolidated approaches commonly explored in the Literature for the classification and pattern recognition processes.

4.2 An illustrative overview of the obtained models in relation to the Literature

Different techniques have been presented in the Literature in order to investigate histological images, such as those for the UCSB, CR and LG datasets. The models were based on multiple combinations, exploring DL techniques, HC approaches, or different ensembles of descriptors and classifiers. An illustrative overview is important to show the quality of this study, with a proposed scheme and corresponding hybrid models not observed in multiple H &E images. This contextualization is shown in Tables 1011 and 12 for the UCSB, CR and LG datasets, respectively.

Table 10 Accuracy rates (%) provided by different approaches for breast histology image classification (UCSB)
Table 11 Accuracy rates (%) defined by different approaches for colorectal histology image classification (CR)
Table 12 Accuracy values (%) achieved in different approaches for gender classification from liver images (LG)

Taking into account this illustrative overview, it is noted that the achieved results are among those best ranked in the specialized Literature, even without exploring complex combinations with handcrafted features, deep-tuning, color normalization, ensemble of CNN models and others, such as described by [10, 14, 41, 47, 84]. Concerning the results presented in Table 10, the hybrid model (35 deep features from the activation_48_relu of the ResNet-50) provided the best performance, surpassing those provided by recent studies, such as RefineNet and Atrous DenseNet [41], DIU-Net [43], Inception-V3 [14] and fractal neural networks [10]. Numerical differences in accuracy rates were up to 37.70% [8]. These facts show the robustness of the proposed method in order to provide a relevant association for classifying breast cancer via H &E images.

For the CR (see Table 11) and LG (see Table 12) datasets, the main hybrid models reached classification rates subtly lower than those provided by some strategies available in the Literature. For instance, the hybrid model via ResNet-50 (35 deep features from the activation_48_relu with ensemble classifier) achieved an accuracy rate of 98% to distinguish CR images, against 99.39% from a highly complex system (best model) with two CNN models and 300 fractal features (fractal dimension, lacunarity and percolation) [10]. Despite this, the proposed hybrid model via ResNet-50 outperformed other relevant schemes indicated for CR  [6, 9, 34, 35, 42, 44,45,46,47,48] and LG [14, 33] datasets, listed in Tables 11 and 12, respectively. Moreover, the hybrid model considering 5 deep features from the avg_pool with ensemble classifier indicated an accuracy rate of 99.32%, against a complex framework based on an ensemble of multiple CNN architectures, texture features (HC) and SVM classifier [84]. The accuracy rate was 100%, with some combinations exploring a single classifier, which can result in higher accuracy rates. On the other hand, it is necessary to evaluate situations in which the classifier is adjusted to the training data, including the bias-variance tradeoff [85]. The hybrid models presented here solve this problem by minimizing the possible overfitting with a robust ensemble classifier. In this case, the numerical difference concerning the accuracy rate was only 0.68% in relation to the results obtained in [84].

Finally, it is important to highlight that most of these proposals lead to an almost ideal model since the mentioned strategies used different types of features and combinations that were capable of quantifying the histological images. Thus, the best solution for distinguishing breast cancer and the valuable information defined in this study contribute to the community interested in the development and improvement of models for classifying patterns in H &E images.

5 Conclusion

In this paper, hybrid models were obtained through a computational scheme exploring deep features by transfer learning, selection by ranking and a robust ensemble classifier with five algorithms. The models were applied to classify histological images stained with H &E from breast, colorectal and liver tissue considering benign versus malignant groups (UCSB and CR datasets) and pattern recognition in liver tissue images from mice separated into male and female classes (LG dataset). The best results were obtained through the ResNet-50 architecture in the activation_48_relu (UCSB) and avg_pool (CR and LG) layers, with a proposed scheme able to define the highest accuracy rate with a reduced number of features in each scenario (up to 35 attributes). The results were accuracy values of 98.00% (UCSB and CR) and 99.32% (LG).

The hybrid model via the pool5 layer (AlexNet network) achieved an accuracy value of 98.70% in the LG dataset. In the same dataset, the best hybrid model with the deepest layer of the ResNet-50 network (avg_pool) achieved 99.32%. This association also provided the most relevant features for the CR dataset, with an accuracy value of 98.00%. The models that explore the deepest CNN layers are the most commonly used in important approaches available in the Literature. However, the tested conditions in this study show that deep features from the activation_48_relu layer (ResNet-50) provided a model with the best rate in the UCSB dataset. Thus, these facts show the capacity of the proposed scheme to optimize the transfer learning process and present the relevant hybrid models for classification and pattern recognition in H &E images.

The main results were compared to the obtained performances with consolidated machine-learning techniques, CNN models directly applied to classify the datasets, as well as results via regularized classification techniques (Lasso and Ridge regression). Experiments with a data augmentation approach were also evaluated. In this context, it was demonstrated that the main hybrid models, based on deep features from the AlexNet, indicated higher accuracy rates than those achieved via convolutional architectures (AlexNet and ResNet-50) classifying directly the datasets without data augmentation. With data augmentation, the hybrid models based on deep features from the AlexNet were more limited, with relevant results only in the UCSB dataset. In relation to the best hybrid models, based on deep features from the ResNet-50, the obtained solutions were better options to classify the UCSB and LG datasets in comparison with the CNN models, exploring data augmentation or not. This generalization was not observed for the CR dataset. In addition, when the comparisons with the regularized techniques were considered, the hybrid model (via AlexNet) provided relevant results only in the LG dataset. On the other hand, the best hybrid models were more robust options, indicating the best performances in the three datasets. This is another important contribution of this study.

In this context, when all comparison conditions are considered (CNN models applied directly to the images, data augmentation or regularized approaches), it is concluded that the hybrid models, based on the deep features of the ResNet-50, are the more relevant solutions for two of the three investigated datasets: UCSB, hybrid model based on 35 deep features from the activation_48_relu layer with ReliefF algorithm and ensemble classifier; LG, hybrid model based on 5 deep features from the avg_pool layer with ReliefF algorithm and ensemble classifier. The information presented here allows the use of hybrid models via CNN strategies in datasets with a reduced number of samples, without overfitting. Also, these conditions can be used to improve CAD systems focused on H &E images or even as more robust baseline schemes in this type of investigation.

Finally, taking into account an illustrative overview of the obtained models in relation to the Literature, it is observed that the achieved results are among the best ranked, with emphasis on the UCSB context. The proposed scheme provided the best solution among those available in the Literature, based on only 35 deep features from the activation_48_relu (intermediate layer), ReliefF algorithm and ensemble classifier. For the CR and LG datasets, the best hybrid models provided subtly lower performances, indicating a possible limit of the proposal.

In future works, it is intended to: 1) use the main associations for pattern recognition in different types of H &E images; 2) explore the main solutions with interpretable CNN models; 3) map each region of the image that provided the most relevant features, investigating the explainable artificial intelligence.