Keywords

1 Introduction

Malaria is an infectious disease showing high degrees of morbidity and mortality, for which the World Health Organization estimated 219 million of infected people and 475000 deaths in 2017 [1]. This serious health problem demands new diagnose tools and anti-malarial drugs. Microscope analysis of large amounts of blood smears in order to detect the presence of the Plasmodium parasite, both to diagnose the disease in humans as well as determining the infection rate in laboratory mice, is an issue of crucial importance during the process of developing anti-malarial drugs. The analysis of blood smears by human experts tends to be slow and tedious, and its results are prone to errors due to tiredness and subjectivity and to the frequently found low rate of positive cases (infected erythrocytes). This situation has motivated a research effort to develop solutions based on digital image processing (DIP) and computer vision (CV) for this process and this has been the objective of the present work.

There are a number of published works dealing with this problem. Diverse image processing methods have been developed to obtain appropriate image features which allow an effective classification of erythrocytes into normal and infected, examples of which can be found in [2,3,4,5,6,7]. These methods typically include tasks such as image conditioning (non-uniform illumination correction, filtering and color normalization), image segmentation, feature extraction and classification of erythrocytes. It is stressed that testing and selecting effective features and classifier algorithms constitute a key step in the design of the whole system [3]. The use of various classifiers for this purpose, like linear discriminant analysis (LDA), K-nearest neighbors (KNN), support vector machines (SVM), among others, have been reported in the literature [8,9,10,11,12,13].

The contribution of this research consists in defining new color features having high discriminating capabilities, as well as testing their use together with various classifier algorithms to find the best combination. A complete algorithm was developed, oriented mainly towards the applications in the development of anti-malarial drugs, where the analysis of blood smears from laboratory mice demands a low rate of false positives as a requirement.

2 Materials and Methods

2.1 Sample Images

In this research digital images obtained from Giemsa-stained blood smears from mice experimentally infected with Plasmodium berghei were used, as well as a Zuzi 122/148 tri-ocular microscope, equipped with a Microscopy 319 CU digital camera with 3.2 MP resolution and 8-bit RGB output without compression, producing a matrix having 2048 × 1536 pixels, with pixel size 3.2 × 3.2 μm, signal to noise ratio 43 dB and optical magnification 50× . The digital images were saved in.tiff (tagged image file) format. An annotated database was created with the aid of two expert analysts from Centro de Bioactivos Químicos (CBQ). This database was intended to perform all the DIP-CV procedures to obtain the features, training the classifiers and performing tests to evaluate the classification effectiveness. A set of 211 images was obtained, from which 600 images of individual erythrocytes were formed, comprising 400 un-infected and 200 Plasmodium-infected cells, as shown in Fig. 1. Notice the reddish-purple spots inside individual erythrocytes that harbor the parasites.

Fig. 1.
figure 1

Microscopy images employed. (a) Image of a blood smear showing multiple erythrocytes, (b) segmented, infected erythrocyte and (c) segmented, normal erythrocyte.

The size of the sample set was calculated from [14] as a minimum required to obtain a reasonable error when evaluating the correct classification rate CCR, which is expected to be above 0.95 for the classifiers tested. These numbers took into account also some class-imbalance. The images size for individual cells depends on their physical (variable) dimensions and can also be affected by the presence of the parasite.

2.2 Image Conditioning

The images used in this research were acquired with a Zuzi 319 CU camera coupled to a microscope. The photos were taken in the RGB color space at 8 bpp/channel. The intensity of the color components was normalized to the interval [0, 1] and converted afterwards to the HSI color space. Other pre-processing steps applied were [3 × 3] median filtering to the intensity component and a morphological top hat with an appropriate structuring element to compensate illumination imbalance. Conversion of the images to the L*a*b* space was made also after segmentation to allow obtaining more features. Information on color spaces is given in [15].

2.3 Segmentation

Segmentation of erythrocytes was performed in two steps. Firstly, a coarse segmentation using the Otsu’s algorithm as in [2] was applied adaptively to the intensity component of the image, which was divided into 16 patches that were segmented independently. This coarse segmentation binarized the image into foreground objects (cells, including clumps) and background. Then the cell clumps were detected and segmented employing a version of the algorithm described in [16], using weighted outer distance and marker-controlled watershed transforms, with the regional maxima of the distance transform acting as internal markers. Other components of the blood smears like leukocytes and platelets were suppressed using the procedure described in [3]. Area opening allowed eliminating other artifacts as well, using a threshold derived from the median size of the erythrocytes.

2.4 Color Normalization

In microscopy images, color can be altered due to changes in the illumination source and to the procedure of preparing the samples. This led to the necessity of color normalization by means of DIP techniques. Here the method described in [5] was used for this purpose and the results are illustrated in Fig. 2.

Fig. 2.
figure 2

Color normalization: (a) reference, (b) target and (c) color-corrected images.

2.5 Feature Extraction and Pixels Fraction

In this work, one main goal is achieving a fairly good classification result using only features derived from the color properties of the images. A total of 13 features were obtained for each of the RGB, HSI and L*a*b* color spaces. These were, for each color component: mean, variance and skewness as described in [17], as well as kurtosis and a new feature which is the main contribution of this work, called pixels fraction. For the three color spaces this led to 39 features in total.

The pixels fraction is defined here based on the relative coincidence of the pixel values in the three planes corresponding to the color channels respectively, for the color spaces employed. The reasoning behind this idea is that pixels having the color properties associated to the presence of the parasite in an erythrocyte, would be practically absent in the images of healthy erythrocytes. Only a minimum proportion of them could exist eventually due to the presence of artifacts in the images, like for example tiny spots created during the preparation of samples for the microscope. The pixels fraction was thought in this case to allow discriminating the presence of a real parasite from the occurrence of a colored artifact, thus improving the classification indexes of effectiveness when using features derived from the color properties of the images. Following this idea, a small set of regions of interest (ROI) were located inside the reddish-purple colored region characteristic of the parasites in a color-normalized erythrocyte, and taken as a reference, as shown in Fig. 3a. For these regions, the mean and standard deviation of the intensity in each color channel for the color spaces used was calculated. To illustrate this for the RGB color space, consider a cell being analyzed. Then the number of pixels is determined for it, whose three color components imC in which C can take the values R (red), G (green) or B (blue), fall simultaneously within the following intervals:

Fig. 3.
figure 3

Illustrating the procedure used to calculate the pixels fraction, (a) ROI selection within an infected erythrocyte, (b) block diagram illustrating steps to calculate the pixels fraction.

$$ \mu_{c} = \left\{ {\begin{array}{*{20}c} {imC > \mu_{c} - \sigma_{c} *2} \\ {imC < \mu_{c} + \sigma_{c} *2} \\ {0,\;\, otherwise} \\ \end{array} } \right. $$
(1)

In Eq. 1, μc is the mean value and σc the standard deviation of the component value inside the ROI and the indexes C indicate the color component to which they correspond. The factor 2 multiplying σ widens the acceptance intervals for a given component and was determined heuristically. The pixels fraction \( p_{f} \) is finally determined for the image of an erythrocyte in a specific color space by dividing the number of pixels nf satisfying Eq. 1 by the total number N of pixels in the image.

$$ p_{f} = \frac{{n_{f} }}{N} $$
(2)

The value of \( p_{f} \) was determined analogously in the HSI and L*a*b* color spaces. Figure 3b illustrates the general process followed to calculate the pixels fraction.

2.6 Feature Selection and Classification

Facilities provided by Weka 3.9 [18] were used here. Firstly, filtering (CfsSubsetEval with a greedy stepwise search method) was used to make a selection from the erythrocyte features previously described. This resulted in seven features. Three ranking alternatives were tested for classification: (InfoGainAttributeEva) using the first 20 ranked features and also the first 7 (to match the number selected through filtering), as well as all the features. Then the effectiveness of these alternatives were compared.

Classification was made using the following algorithms: SVM, KNN, J48, Random Forest (RF), Naïve Bayes (NB) and Linear Discriminant Analysis (LDA). In the case of SVM and KNN, various alternatives in their parameters (polynomial and PUK kernels in SVM, K = 1, 3, 5, 7 for KNN) were tested the best ones were used in the comparison to the rest of the classifiers. All the features were normalized previously.

The various classifiers were compared by ten-fold cross-validation and 1/3-2/3 percentage split. The indexes of effectiveness used were the correct classification rate (CCR), sensitivity (Se), specificity (Sp), F-measure and area under the receiver operating characteristic (ROC) curve, AUC. A more realistic experiment considered also the possibility of defining visually dubious cases as a third class, a situation often encountered in practice due to spurious colored pixels. In this case confusion matrices were used to express the results.

The general process of training the various classifiers evaluated and performing the classification process is illustrated in Fig. 4.

Fig. 4.
figure 4

Diagram of the general classification process. (a) Training the classifiers with the defined features, (b) classifying test erythrocytes.

3 Results and Discussion

The experimental work described in Sect. 2 was performed for the dataset composed of 600 erythrocytes previously mentioned. Special care was taken when building the (600 × 39) feature matrix corresponding to this dataset.

3.1 Feature Selection

Results from feature selection using the two methods (filter and ranking) are shown in Table 1. Notice that despite in general the seven first features ranked by the InfoGainAttributeEva method differ from those selected by CfsSubsetEval, the pixels fractions in all the color spaces were the first ones in the list, confirming their usefulness.

Table 1. Results of the feature selection process

3.2 Classification

Classification results when using 7 features obtained through ranking and filtering, are shown in Tables 2 and 3, respectively. In this case the performance measures used in a 10-fold cross-validation experiment were CCR, Se, Sp, F- measure and AUC. Classification results using the whole set of features or the first 20 in the ranking list were inferior to those shown in the tables and are not shown due to space limitations. This suggests that there is some degree of noisy behavior in the discarded features whose deletion improved the classification results. Some variants of SVM and KNN were disregarded previously in favor of those included in the tables, which exhibited better behavior. The best performance was obtained by the J48 and Random Forest classifiers, which yielded results close to 100%.

Table 2. Results of classification with features selected through InfoGainAttributeEva ranker, using the 7 best ranked features and ten-fold cross-validation
Table 3. Results of classification, features selected by CfsSubsetEval (Greedy Stepwise), 10-fold cross-validation

The pixels fraction should be theoretically zero for a normal erythrocyte. However, the classification of a cell is not a trivial task: some spurious colored pixels could appear in practice and provoke an erroneous classification. Here a larger set of color-based features was employed to improve the classification results in this situation and a third “dubious” class was introduced. Table 4 shows the confusion matrix obtained in this case. This is especially important when determining the infection rate in laboratory mice through microscopy analysis, where dubious cells are usually disregarded by human analysts. In this case only four features (pixels fraction among them) were chosen by the filter selector. When using the J48 and RF classifiers, almost all dubious cases were correctly classified, all normal cells were still classified as normal and a small proportion of infected erythrocytes were classified as dubious. The success of these classifiers is consistent with the experience about their good behavior in numerous applications [19].

Table 4. Confusion matrices from the classification results, considering a third class (dubious cases), J48 and RF classifiers

4 Conclusion

Automated classification of erythrocytes to detect the presence of Plasmodium parasites in blood smears is currently an open area of research and this work presents two contributions to it. The first one has been an improvement of the use of color information in the classification process through a new feature, called the pixels fraction, whose effectiveness was proved by two facts. Firstly, the values obtained for it in the three color spaces involved (RGB, HSI and L*a*b*) were selected among the most important features by both the filter and the ranker feature selectors used. Secondly, the classification results using the pixels fraction were remarkable, showing that it is worth to enhance the use of color information when defining the features for classification and this is just what was achieved through this new feature. Several classifier algorithms were tested, among which J48 and RF exhibited the best results in terms of the several evaluated measures of performance. The second contribution was linking a set of image processing steps with the classifiers, to complete a computationally efficient system to classify erythrocytes in malaria studies.