Keywords

1 Introduction

Today almost 30% of people have allergies, 8% have asthma. The most frequent origin of allergies and one of the causes of asthma is pollen. The number of people suffering of pollinosis varies between 10–15% among different countries, this number increased by 34% over last ten years because of urbanization, environmental effects of human, and also because pollen can cover long distances by air [24].

In order to manage allergies and asthma symptoms it is necessary to determine the start of the pollen dispersion. Accurate knowledge of prevalent aeroallergens can improve the diagnosis and treatment of patients. Pollen information is the key as it enables a timely start of the preventive and symptomatic treatment of seasonal allergy problems. Thus, a great need exists to catch airborne pollen and to determine immediately whether it is an allergy-causing plant species pollen or not. For these goals there exist more than 600 pollen counting stations all over Europe and only about 20 stations in Russia, where palynologists and volunteers spend much time for manual pollen operation using microscopes [24]. However, manual operation cannot provide information relevant enough for patients. For instance, 24% of adults and 40% of children in Europe cannot travel freely due to the lack of information on atmospheric pollen concentrations in different regions in Europe [11, 19].

Thus, a near real-time system, which can automate the recognition of pollen species, is required. Development of such a system can be achieved on the basis of the usage of digital images from a microscope. Recently machine learning and, particularly, deep learning have proven their effectiveness in a variety of applications such as image classification [21, 32], natural language processing [7, 33], speech recognition [10, 16].

The need to automate pollen recognition was mentioned by Flenley for the first time in 1968 [12]. Since that time many attempts of such system development have been made, however, the problem is not completely solved yet. Proper classification of pollen grains allows to draw the appropriate conclusions and to solve problems faced by experts in other areas, not only aeropalynology [6, 29, 31].

Image recognition-based solution for this task consists of the following steps: pollen extraction, counting, and classification. Initially the image can include from 1 to about 50 pollen grains depending on their size and shape. Pollen extraction is the search of areas on image containing only one pollen grain per area and following pollen grain contouring. It can be obtained after preprocessing steps, described in Sect. 3.2. Counting is the quantitation of such extracted pollen grains. And classification is the determination of each pollen grain species. The final result can be presented as the percent composition of pollen species.

All researchers in this area extracted specific pollen features such as shape, brightness, texture features, and aperture [3,4,5, 27]. Some used a scanning electron microscope (their results vary between 77% and 97% of accuracy) [1, 3, 31], other used stacks of images of one pollen, a kind of three dimensional representation (resulting accuracy is between 93.8% and 97.5%) [3, 30, 31]. Most researchers used standard machine learning methods: support vector machine, linear discriminant analysis, random forest, artificial neural networks, k-nearest neighbors and others. Many authors are members of currently existing or past global research projects, aimed to develop an automated pollen recognition tool. For instance, the European project ASTHMA specifically dealt with allergic pollen [28].

Review of pollen recognition techniques [17] revealed, that some simple and local issues within pollen recognition might be carried out, but there were still many tasks related to deformed, clumped pollen, which were not resolved. The interest to the problem is still high. Recently published papers declared results obtained with an optical microscope to be between 87% and 99% of accuracy [6, 9, 23, 27, 29, 30]. However, only few works considered the steps of extraction and pollen counting, although they are very important parts of the problem, because manual image cropping could be tedious and automatic counting is the main goal of recognition in some cases. Our research bypasses these disadvantages. Also we use images from an optical microscope, which is much cheaper than scanning electron microscope and is widely used.

Extracted features are described in Sect. 2.1. Applied dimension reduction techniques are described in Sect. 2.2. To achieve the goals of extraction and counting we use a preprocessing algorithm, which is described in Sect. 3.2. Applied classifiers are described in Sect. 3.3. The experiments are described in Sect. 3.4. Results are discussed in Sect. 4.

2 Proposed Approach

2.1 GIST Features

We choose GIST descriptors [8, 26] as image features, which allows to avoid specific-purpose feature extraction. GIST is a low-dimensional scene representation. In other words, it is a kind of edges distribution histogram. An image is divided into equal parts using a grid (4 \(\times \) 4 in our case). Edge distributions are computed on 3 scales of the image separately for every part. Edge distribution corresponds to the response of the part to every edge orientation (which has 8 or 4 values). We use color images, so this is applied to every color channel. As a result of GIST extraction, 960 descriptors were obtained. In general, the number of GIST features can be arbitrary.

2.2 Dimension Reduction

Due to the high number of GIST descriptors, dimension reduction (DR) is required. The following methods were used.

ReliefF. ReliefF is a member of the Relief algorithm family, which is a filtering feature selection technique, extended on M-classes classification. Relief is based on near-hit and near-miss measures, values of which form the weight for each feature. If the value of the weight is smaller than some threshold, this feature is rejected [34]. Weights vector is computed according the following formula:

$$\begin{aligned} w_{i}=\sum _{k=1}^{p}\left( \delta \left( x_{k}^i,near\_miss\left( x_{k}\right) ^i\right) ^2-\delta \left( x_{k}^i,near\_hit\left( x_{k}\right) ^i\right) ^2\right) \end{aligned}$$
(1)

where \(i=1 \ldots n\); n is the number of features; p is the number of objects; and \(\delta \mathrm {(}a, b\mathrm {)}\) is the Kronecker delta.

The number of features selected by applying ReliefF is 300.

Mutual Information. Mutual information (MI) implies feature relative importance. It relies on entropy of a feature and its conditional entropy related to every class of objects [20]:

$$\begin{aligned} I\left( x,y\right) =H\left( x\right) -H\left( x|y\right) \end{aligned}$$
(2)

where I is the relative importance; \(H\mathrm {(}x\mathrm {)}\) is the entropy of a feature; \(H\mathrm {(}x|y\mathrm {)}\) is the conditional entropy.

The number of features selected by applying MI is 300.

Principal Component Analysis. Principal Component Analysis (PCA) is a feature extraction method. It finds a projection to a linear manifold minimizing distance of the points to the manifold [22]. 95% of origin variance of the data were used.

3 Experiments

3.1 Materials

Current research is carried out not only on allergenic plant, but also on honey plant pollen. The approach can be easily generalized to be applied to any plants dataset. The dataset includes 9 species, almost 1800 images in total. The dataset is original, never used before, made using optical microscope Olympus BX51 with Olympus DP71 image viewing system. All the pollen types were collected mostly from Russia, Perm Krai. In the Perm region, the aeropalynological profile is typical for central Russia. On average, the concentration of allergenic pollen grains in the air of Perm is lower than in other European geographical regions. Since 2010, the aeropalynological data of the Perm region have been included in the Russian pollen monitoring program. Pollen traps are located in the city center [24].

An example of an image from the dataset is presented in Fig. 1. The example shows that an image can contain stains, or debris, which are cause of wrong segmentation.

Fig. 1.
figure 1

Input image example

Some examples of each pollen species are presented in Table 1.

We used two versions of the dataset: full, which contains similar shape species, and partial, which contains mostly different shape species (top 5 rows of the table).

All images were normalized by RGB-values, according to the following formula:

$$\begin{aligned} I_{N}=\left( I-Min\right) \frac{newMax-newMin}{Max-Min}+newMin \end{aligned}$$
(3)

where I stands for old pixel color value and \(I_{N}\) is a new value.

Cross-validation was used to evaluate the results. Its idea is to divide the dataset into disjoint training and validation subsets K different ways, the accuracy is evaluated as the mean accuracy.

We used 10-fold cross-validation and the experiments were conducted on a computer with an Intel Core i7-3770 CPU with 16 GB of RAM.

Table 1. Preprocessed images examples

3.2 Preprocessing

We performed three preprocessing steps:

  1. 1.

    The first step of preprocessing is noise reduction, including Gaussian blur, dilation and erosion functions.

  2. 2.

    The next step is image double- and low-thresholding applied to hue and saturation channels. Such combination shows high result on images with color gradient or hotspots.

  3. 3.

    The last step is the segmentation and localization provided by Canny edge detector and Hu-moments [18].

The resulting sequence of preprocessing steps is presented in Fig. 2.

Fig. 2.
figure 2

Image modifications during preprocessing

The extraction (segmentation) accuracy is 73%. The result is not great, the main cause of that is clumped pollen grains (Fig. 3). This is a separate complicated issue and an object of further research.

Fig. 3.
figure 3

Clumped pollen example

From here we will call the dataset which passed the preprocessing steps as the preprocessed dataset.

3.3 Models

The following 6 machine learning techniques were used in the research for classification [2, 13,14,15, 25].

  1. 1.

    Logistic regression (LR). A simple machine learning technique of linear classification.

  2. 2.

    K-nearest neighbors (kNN). This is a metric classification technique, which defines object class by its k nearest neighbors.

  3. 3.

    Support vector machine (SVM). It solves the problem of nonlinearly separable input vectors by projection of the low-dimensional training data into a higher dimensional feature space where they can be easily separated. The projection is achieved using kernel functions.

  4. 4.

    Decision trees (DT). The main idea is to recursively set up a tree over the feature space. The feature space is split with a feature value and then both subsets are split the same way recursively until the tree leaf has the minimum number of class targets for making a decision.

  5. 5.

    Random forest (RF). A classifier ensemble method based on bagging. Several independent models make decisions, then the common decision is determined by voting in case of classification problem and by averaging in case of regression problem.

  6. 6.

    Gradient boosting (GB). This is a modern machine learning technique of classifiers ensemble. It minimizes the training error of classifiers linear composition by gradient descent.

3.4 Results for Different Feature Sets and Different Machine Learning Models

Each table shows combinations of dimension reduction and classification methods. Each cell in the resulting tables contains the mean accuracy of 10-fold cross-validation and its standard deviation, which follows after the plus/minus sign. The each DR method best accuracy is highlighted in bold.

Table 2 shows results comparison on the partial dataset. The best accuracy is provided by the RF model with ReliefF DR method, it is 85.6 \( \pm \) 3.5%.

Table 3 shows results comparison on the partial preprocessed dataset. The best accuracy is provided by the SVM model with MI DR method, the accuracy is 98.3 \( \pm \) 2.1%.

Table 2. The results on partial dataset
Table 3. The results on partial preprocessed dataset

Table 4 shows results comparison on the full dataset. The best result is provided by the RF model with no DR, the accuracy is 78.5 \( \pm \) 3.8%.

Table 5 shows results comparison on the full preprocessed dataset. The best accuracy is provided by the SVM model with PCA DR method, the accuracy is 95.2 \( \pm \) 1.7%.

One can see from the tables that models trained on the partial 5-classes dataset achieve much better accuracies than on the full dataset. Models trained on preprocessed datasets are significantly better than models trained on non-preprocessed datasets in terms of accuracy. Thus, preprocessing is one of the most important steps of the approach.

Table 4. The results on full dataset
Table 5. The results on full preprocessed dataset

4 Discussion and Conclusion

In this paper we made an attempt to use machine learning to solve the problem of automated pollen grains images recognition. This is a very important problem due to the allergy and asthma management, the key cause of these diseases is pollen. To prevent allergy and asthma symptoms it is necessary to know the concentration of allergenic plants pollen in the air in real time. Existing pollen counting stations cannot provide rapid enough information because of manual processing. To automatize the recognition of pollen species we processed its images from optical microscope. We used GIST descriptors as the feature vector and applied several dimension reduction methods (PCA, MI, ReliefF). This approach gave 98.3% of maximum accuracy on the partial preprocessed dataset, which contains only 5 pollen species. The best classification model is SVM with a polynomial kernel.

That is a new approach relating to this problem, because other authors mostly used specific-purpose features focused on pollen grains nature. Usage of GIST allows to generalize our solution minimizing the accuracy loss. GIST descriptors are a kind of universal features.

We studied four versions of the dataset to see if pollen grains shape strictly assigns GIST values and to compare preprocessed and initial dataset GIST results.

We found out that the GIST-based approach works much better with the preprocessed dataset, which contains only one pollen grain per image.

We used three dimension reduction techniques and compared their results pairwise with machine learning models.

In future research we will make an attempt to use a convolutional neural network, which is a very promising technique [21], never used by other researchers within this problem. Also we plan to improve pollen the extraction stage, especially in order to resolve the issue of clumped pollen.

The final goal of this research is to develop a program for pollen recognition and bring it to the state of a real-time system, which will cut the cost on pollen operations in half.