Keywords

1 Introduction

Studying and exploiting the plant recognition has been one of the most important tasks in plant protection. As we know, the crucial stage of plant taxonomy is a genuine scientific and technical challenge, due not only to the huge number of plant species, but also to their highly specialized and diverse taxonomic properties. For this reason, the efficiency of manual plant recognition is too low and we should introduce pattern recognition technology to carry out this work. One key distinguishing feature for the identification of plant species is plant information obtained from leaf images [16]. Considering the different extracted features of leaf images, the recognition method can be roughly divided into 3 kinds - based on texture features, subspace projection and statistical features [710].

The method based on the structure feature need pre-processing and extracts the texture feature of leaf images. Such methods require complex pretreatment process, and the pretreatment results will influence the accuracy of recognition seriously. Although the texture features can obtain certain recognition accuracy, they are sensitive to the position and orientation changes during the collection process, which lacks stability and robustness [11, 12].

The subspace projection method applies Principal Component Analysis (PCA), Independent Component Analysis (ICA) or linear classification analysis in a certain transform domain of the leaf images. Then choose a proper classifier to take this recognition work by the projection coefficient as the feature. Compared with the method based on the structure, subspace projection method has anti-noise-interference ability without complex pretreatment process. But the changes of location and direction maybe interfered the ability of recognition.

This paper is organized as follows. We throw the concept and principle of Contourlet Transform and propose in Sect. 2 Contourlet Transform to decompose the leaf image, and then introduce the low frequency sub-band feature extraction and high frequency sub-band feature extraction. In Sect. 3, we provide a classifier - Support Vector Machine (SVM) Classifier. The logic of SVM is expatiated in Sect. 3. And then in Sect. 4, the experimental database and results are list. At last, we conclude this paper in Sect. 5.

2 Contourlet Transform

The research results by neural physiologist show that acceptant fields in the visual cortex are characterized as being localized, oriented, and bandpass for the visual system of human. There are experiments suggested that it will be efficient for a computational image representation, if it based on a local, directional, and multiresolution expansion.

As two-dimensional wavelets are constructed from tensor products of one-dimensional wavelets, with the finer resolution, we can clearly find the limitation of the wavelet that it needs to use many special dots to capture the contour, as show in the left of Fig. 1. The new scheme (in the right of Fig. 1.) shows that the support interval of baseband should behave as a long strip shape in order to make full use of the geometrical transformation of the original function and achieve with the least coefficients to approximate the singular curve. In fact, the elongated support interval of baseband is a reflection of directionality, and this also called multi-scale geometric analysis.

Fig. 1.
figure 1

Wavelet versus new scheme

2.1 Feature Extraction

The Contourlet Transform is implemented by a double filter bank named pyramidal directional filter bank (PDFB). PDFB can be seen as a cascade of two steps. Firstly the original image is multi-scaled decomposed into low frequency and high frequency subbands by Laplacian Pyramid (LP) transform. Then the Directional Filter Banks(DFB) decompose bandpass signal of each level in Pyramid into tree structure of L layer, and the band will be divided into two directions in each layer. The singular points distributed in the same direction will be synthesized as one coefficient. It achieved the image sparse representation more effectively that Contourlet Transform combined LP and DFB into the double filter group structure. The Contourlet Transform can be implemented iteratively applying PDFB on the coarse scale of image, as shown in Fig. 2 [20, 21].

Fig. 2.
figure 2

Contourlet filter bank

2.2 The Low Frequency Sub-Band Feature Extraction

The low-frequency subband is the embodiment of coarse texture feature of image. In this paper, the uniformity of texture is reflected by Angular Second Moment (ASM), Contrast (CON), Correlation (COR) and Entropy (ENT) of the gray level co-occurrence matrix. We extracted these four features of gray level co-occurrence matrix after Contourlet Transform decomposed. Specifically the low frequency image coefficient matrix will be the gray level co-occurrence matrix transformed after the image is decomposed. We calculated Angular Second Moment (ASM), Contrast (CON), Correlation (COR) and Entropy (ENT) in \( [0^{ \circ } ,45^{ \circ } ,90^{ \circ } ,135^{ \circ } ] \) respectively. In order to reduce the dimension of feature vector and the computational complexity, we get a feature vector in 8 dimensions \( f_{1} = [{\text{a}}_{1} ,{\text{a}}_{2} ,{\text{a}}_{3} ,{\text{a}}_{4} ,{\text{a}}_{5} ,{\text{a}}_{6} ,{\text{a}}_{7} ,{\text{a}}_{8} ] \) through calculating the mean and variance of each parameter in four directions. The calculating formulae of the specific parameters are as following [22, 23].

Angular Second Moment (ASM):

$$ ASM = \sum\limits_{i} {\sum\limits_{j} {P(\text{i},\,\text{j})^{2} } } $$
(1)

Contrast (CON):

$$ CON = \sum\limits_{i} {\sum\limits_{j} {(\text{i},\,\text{j})^{2} P(\text{i},\,\text{j})} } $$
(2)

Correlation (COR):

$$ COR = [\sum\limits_{i}^{{}} {\sum\limits_{\text{j}} {i{ \times }j{ \times }P(\text{i},\,\text{j}) -\upmu_{x} } }\upmu_{y} ]/\upsigma_{x}\upsigma_{y} $$
(3)

Entropy (ENT):

$$ ENT = - \sum\limits_{i}^{{}} {\sum\limits_{j} {P(\text{i},\,\text{j})\text{lb}[P(\text{i},\,\text{j})]} } $$
(4)

where P(i, j) is the elements whose coordinate is (i, j) that is in gray level co-occurrence matrix of coefficients of low frequency after the Contourlet transformation. μx and σx are the mean value and mean variance of \( \{ \text{P}_{x} (\text{i})|\text{i} = 1,2, \ldots, \text{N}\} \). μy and σy are the mean value and mean variance of \( \{ \text{P}_{y} (\text{i})|\text{i} = 1,2, \ldots, \text{N}\} \).

For further reflecting the extent of image texture, we composed the eigenvector \( f_{2} = [\upmu,\upsigma] \) by extracting the mean and variance of low frequency coefficient matrix after contourlet transformation.

The mean μ and variance σ can be calculated by:

$$ \upmu = \frac{1}{{M{ \times }N}}\sum\limits_{i = 1}^{M} {\sum\limits_{j = 1}^{N} {P(\text{i},\text{j})} } $$
(5)
$$ \upsigma = \frac{1}{{M{ \times }N}}\sum\limits_{i = 1}^{M} {\sum\limits_{j = 1}^{N} {(P(\text{i},\text{j}) -\upmu)^{2} } } $$
(6)

In (5) and (6), P(i, j) is the decomposition coefficient whose coordinate is (i, j) in M × N low frequency subband coefficient matrix after the contourlet transformation.

2.3 High Frequency Sub-Band Feature Extraction

The high-frequency directional subband of Contourlet Transform contains the image edges and fine texture feature. In this paper, the image is decomposed into 4 layers. From the first layer to the third layer are intermediate frequency band and the fourth layer is high frequency band.

The intermediate frequency band contains part texture information of image. The mean and variance can reflect not only the unevenness of gray image, but also the depth degree of the texture. Considering these factors, the mean and variance of intermediate frequency coefficient matrix are extracted as texture features of intermediate frequency sub-band. The three intermediate frequency sub-bands contain 3, 4 and 8 direction respectively, and we can compose a 30-dimensional feature vector \( f_{3} = [\upmu_{1} ,\upmu_{2} , \ldots ,\upmu_{15} ,\upsigma_{1} ,\upsigma_{2} , \ldots ,\upsigma_{15} ] \) by the mean and variance of sub-band coefficients in these 15 directions.

The energy distribution is sparser at the highest level sub-band. Because energy distribution at different scales and directions can effectively distinguish the texture, the energy of the coefficient matrix is extracted as high frequency characteristics. The high frequency image contains 16 directions after Contourlet Transform and we can extract the energy of sub-band coefficients in these 16 directions to form a 16-dimensional feature vector \( f_{4} = [\text{b}_{1} ,\text{b}_{2} , \ldots, \text{b}_{16} ] \).

The energy can be calculated as following:

$$ E = \sum\limits_{i = 1}^{M} {\sum\limits_{j = 1}^{N} {P(i,j)^{2} } } $$
(7)

In (7), P(i, j) is the decomposition coefficient whose coordinate is (i, j) in M × N high frequency sub-band coefficient matrix after the contourlet transformation.

In addition, the mean and variance of the high-frequency sub-band coefficient matrix can reflect the depth degree of the texture and it is also an important high-frequency image texture features. We extract the mean and variance of high frequency image sub-band decomposed to form a 32-dimensional feature vector \( f_{5} = [\upmu_{1} ,\upmu_{2} , \ldots ,\upmu_{16} ,\upsigma_{1} ,\upsigma_{2} , \ldots ,\upsigma_{16} ] \).

As we have pointed out above, we extracted 5 feature vectors from different frequency band. These feature vectors can fully represent the uniformity, depths of color and energy distribution and other characteristics of image texture. However, if we group the entire feature vector into vector set as the input of identification system, the speed of recognition is bound to be affected due to the dimension of the vector is too large. It is a pretty important part that how to determine the optimal texture feature representation in Contourlet Transform, which means choosing as little feature vectors as possible to characterize the texture on the condition of ensuring the recognition accuracy. In this paper, we put the extracted feature vectors as input of the identification system, and finally determine the optimal feature vector after repeated recognition.

3 Support Vector Machine (SVM) Classifier

As mentioned above, the recognition process is as Fig. 3 showed. We choose Support Vector Machine (SVM) as classifier. Support vector machine is a machine learning method based on development of statistical learning theory [24]. It has very strong generalization ability and less depends on the quantity and quality of samples. It has the best generalization ability for the classification of unknown samples through constructing the optimal hyperplane. It promotes the problem to be processed in linear form that Support Vector Machine (SVM) maps the data from input space to a high-dimensional feature space by support vector (SV) kernel. As SVM usually tries to minimize a bound on the structural risk but not the empirical risk, it always can get a global minimum value.

Fig. 3.
figure 3

Recognition Process

Empirical Risk Minimization (ERM) is a formal term for a simple concept: find the function \( f(\text{x}) \) that minimizes the average risk on the training set. Empirical risk is defined as bellow:

$$ R_{emp} (f) = \frac{1}{N}\sum\limits_{i = 1}^{N} {C(f(\text{x}_{i} ),\text{y}_{i} )} $$
(8)

where \( C(f,y) \) is a suitable cost function, e.g., \( C(f,\text{y}) = (f(\text{x}) - \text{y})^{2} \).

Minimizing the empirical risk is not a bad thing to do, provided that sufficient training data is available, since the law of large numbers ensures that the empirical risk will asymptotically converge to the expected risk for \( n \to \infty \). However, for small samples, one cannot guarantee that ERM will also minimize the expected risk. This is the all too familiar issue of generalization.

The Vapnik-Chervonenkis dimension (VC dimension) is a measure of the complexity (or capacity) of a class of functions f(α). The VC dimension measures the largest number of examples that can be explained by the family f(α). The basic argument is that high capacity and generalization properties are at odds. If the family f(α) has enough capacity to explain every possible dataset, we should not expect these functions to generalize very well. On the other hand, if functions f(α) have small capacity but they are able to explain our particular dataset, we have stronger reasons to believe that they will also work well on unseen data. The VC dimension is the size of the largest dataset that can be shattered by the set of functions f(α). One may expect that models with a large number of parameters would have high VC dimension, whereas models with few parameters would have low VC dimensions. The VC dimension is a more “sophisticated” measure of model complexity than dimensionality or number of free parameters.

Because the VC dimension provides bounds on the expected risk as a function of the empirical risk and the number of available examples. It can be shown that the following bound holds with probability \( 1 - \eta \).

$$ R(f) \le R_{emp} (f) + \sqrt {\frac{{h(\ln(\frac{2N}{h}) + 1) - \ln(\frac{\upeta}{4})}}{N}} $$
(9)

where ℎ is the VC dimension of \( f(\upalpha) \), N is the number of training examples, and N > ℎ.

Structural Risk Minimization (SRM) is another formal term for an intuitive concept: the optimal model is found by striking a balance between the empirical risk and the VC dimension. SVM achieves SRM by minimizing the following Lagrangian formulation:

$$ L_{P} (\upomega,b,\upalpha) = \frac{1}{2}||\upomega||^{2} - \sum\limits_{i = 1}^{N} {\upalpha_{i} [\text{y}_{i} (\upomega^{T} \text{x}_{i} + b) - 1]} $$
(10)

where αi is positive Lagrange multipliers [25, 26].

As the ratio N/ℎ gets larger, the VC confidence becomes smaller and the actual risk becomes closer to the empirical risk. This and other results are part of the field known as Statistical Learning Theory or Vapnik-Chervonenkis Theory, from which Support Vector Machines originated.

4 Experimental Results

To evaluate the effectiveness of the proposed method, we carried out a series experiments on two large and comprehensive texture databases: the Sweden leaves database [27] and the ICL databaseFootnote 1 which is established by the Intelligent Computing Laboratory.

All images in the ICL database are taken by cameras or scanners on a white background paper under vary illumination conditions after the leaves are picked from plants. Two sides of every kind of leaf are respectively taken to images. Images are tended to be in an agreed size and colorful which are taken through this process. To guarantee the background smooth and clear, only one leaf is constructed an image. The ICL database includes 200 species of plants. Each species includes 30 samples of leaf images (15 per side). Hence there are totally 6000 images. Figure 4 shows some samples from the ICL database in which the top images are the front side of plant leaves and the bottom images are the opposite side.

The Sweden leaves database contains 15 the monolithic leaves pictures of different Swiss tree and each type has 75 image files. The original image data of Swiss plant leaves contains petiole. This does not have robust feature that the shape of the leaves should have, because the direction and length of petiole deeply depend on the process of the leaf collection when intercept the plant leaf image samples. While leafstalk will provide certain difference information, we can remove some kind of noise to build another data set.

In order to implement the proposed method, we choose 5 sets of data and set the relevant parameters empirically.

4.1 Experimental Results on the ICL Database and the Sweden Leaves Database

The experimental results of the ICL Database are as follows: (Table 1).

Table 1. The classification rates (%) on ICL database

The experimental results of the Sweden Leaves Database are as follows: (Table 2)

Table 2. The classification rates (%) on Sweden leaves database

5 Conclusions

In this paper, we studied a hybrid approach based on Contourlet Transform and Support Vector Machine (SVM) Classifiers for plant recognition. By decomposing input images into multi-scale factors which have attractive properties such as shift invariance and computational efficiency, we can extract discriminative features which are not sensitive to the variations of illumination and translation and capture the intrinsic geometry structure of images. By combining the crafted feature with large margin classifiers (more specifically, SVM), the proposed recognition method has higher experimental performance and can better capture the rich features of natural images such as edges, curves and contours. In the future work, we plan to improve the efficiency of the proposed method, and implement it as recognition software which is suitable for real-word applications.