Introduction

Angle closure glaucoma (ACG) is a prevalent eye disease in Asia, which is a major cause of blindness [1]. There are mainly four mechanisms underlying ACG, namely, pupil block (PB), plateau iris configuration (PL), thick peripheral iris roll (PIR) and exaggerated lens vault (LV). Specific treatments are required for different ACG mechanisms. So classification of these four mechanisms is clinically important in order to provide better treatment for ACG patients [2]. Anterior segment optical coherence tomography (AS-OCT), is capable of providing high resolution images of anterior chamber of eyes and it has been used extensively for glaucoma detection [13]. Anterior chamber (AC) and AC angle parameters provided by AS-OCT were used to evaluate different ACG mechanisms [2, 4]. Wirawan et al. selected ten discriminative features from 84 parameters which were measured from segmented AS-OCT images of the patients, and used AdaBoost classifier for classification of these four different mechanisms [5].

While there are many existing methods for glaucoma detection, differentiating glaucomatous eyes from normal ones using features extracted from the fundus images [6], optic nerve head stereo photographs [7], or OCT [8] in combination with various classifiers, few studies have been done for classification of different mechanisms of ACG in spite of its clinical importance. So, our study is motivated by developing a new multiclass classification method with high accuracy in recognition of different ACG mechanisms. Traditionally, there are two categories of approaches to address the multiclass classification problems. One is to develop a single model for all the classes. Some classifiers such as k-Nearest Neighborhood (kNN), decision tree, Naive Bayes, and linear discriminative analysis (LDA) are naturally designed to deal with multiclass classification.

In contrast, some more advanced classifiers such as support vector machine (SVM), and AdaBoost are binary classifiers. Hence, it is required to design a set of different binary classifiers (dichotomizers) and combine them to deal with multiclass classification tasks. On the other hand, ensemble learning by combining multiple dichotomizers is more advantageous than traditional single classifier in that it has superior generalization ability and is suitable for learning nonlinear classification boundary. Intuitively, to differentiate c classes, c dichotomizers are trained, and each of which discriminates one specific class from all the remaining classes. For a test sample, the classifier with largest probability output will be taken as the predicted class label. This is the well-known one-versus-all (OVA) strategy [9, 10]. An alternative to this is one-versus-one (OVO), where c.(c-1)/2 dichotomizers are trained to separate each pair of classes.

To solve the multiclass problems using binary classifiers, Dietterich et al. provided a unified framework based on a set of error-correcting output codes (ECOC), which was further improved by [11, 12]. This framework is well known for its nonlinear classification capability. A code matrix with n columns was designed to decompose the multiclass problems into n binary ones. Then the outputs of these N dichotomizers were combined to determine the class label of the test sample [13]. OVA and OVO can be seen as special cases of the ECOC framework. Many research works have been explored to improve the classification performance of ECOC [1418], especially using data-driven approaches [1921]. By observing the data distribution in the original feature space, dichotomizers are trained for easily separated pairs of class clusters. However, this observation of data distribution relies on a common feature space.

In this study, we propose a wrapper approach to learn the ECOC code matrix, in which the best feature set for each dichotomizer and best combination of dichotomizers for ECOC are both selected via cross-validation on the training dataset. This method is better than the other ECOC based methods in two aspects: 1) each dichotomizer has its own optimal feature set; 2) a new criterion is proposed that the best combination of dichotomizers is selected in consideration of not only of the separability of the codewords in ECOC framework but also the classification ability of the selected dichotomizers.

This paper is organized as follows. The ECOC framework is reviewed in “ECOC framework” section, followed by the proposed method in “Method for multiclass classification” section. Experimental results of multiclass classification on glaucoma dataset are shown in “Experimental results on classification of different glaucoma mechanisms” section. Conclusion is made in the final section.

ECOC framework

There are two major processes in ECOC framework: coding and decoding. The key of coding process lies in the design of a code matrix М ∈ {−1, 0, 1}c×n, with c rows and n columns, where c and n denote the numbers of classes and dichotomizers, respectively. The i th row of М provides the codeword C i for the i th class (i = 1, 2,…,c). Meanwhile, each column of М represents the partition of classes by each dichotomizer. Classes coded by 1 and −1 are treated as positives and negatives, respectively, while those coded by 0 are omitted in training the dichotomizers. For the four class classification problem, the OVA and OVO strategy are represented by two code matrixes as shown in Table 1a and b, respectively.

Table 1 The ECOC code matrix in (a) OVA strategy and (b) OVO strategy

In the decoding process, the outputs of these n trained dichotomizers for the test sample are given as a vector V = {v 1, v 2, …,v n}, and compared with the codeword of each class to find the nearest one to determine the class label of the test sample. There are many decoding strategies to evaluate the distance between the vector V and each codeword C i [13]. For example, in Loss-based decoding, the distance is formulated as

$$ {\mathrm{d}}_{\mathrm{H}}\left(\mathrm{V},{\mathrm{C}}_{\mathrm{i}}\right)=\frac{1}{2}{\displaystyle {\sum}_{\mathrm{j}=1}^{\mathrm{n}}\mathrm{L}\left(\mathrm{V}\left(\mathrm{j}\right)\cdot {\mathrm{C}}_{\mathrm{i}}\left(\mathrm{j}\right)\right)} $$
(1)

where L(.) denotes the loss function which is dependent on the type of dichotomizer.

Most research is focused on the coding process to design an optimal code matrix. Allwein et al. proposed a dense random code matrix by maximizing the distances between the codewords of different classes, which is further extended as sparse random code matrix [21]. Pujol et al. proposed discriminative ECOC (DECOC) to learn the code matrix based on a hierarchical partition of the multiple classes, and used (c-1) dichotomizers in the ECOC code matrix [17]. In most of the existing works, the feature space is usually fixed to facilitate learning the data distribution and designing a problem-dependent code matrix. However, the common feature space is optimal for all the dichotomizers as a whole. So the dichotomizers may not be optimized specifically. The classification performance may not be the best, although the code matrix is designed well in terms of class separability. In this study, all the dichotomizers are first optimized with their own specific feature set, and then the ECOC code matrix is learned by selecting the best combination of dichotomizers. The details are shown in the next section.

Method for multiclass classification

In the proposed method, feature selection for each dichotomizer and dichotomizer selection in ECOC code matrix are performed in tandem to learn the code matrix as shown in Fig. 1. The proposed method is detailed as follows.

Fig. 1
figure 1

Block diagram of the proposed method

Step 1: Feature set optimization for each dichotomizer

Based on combinational analysis, we can calculate that there are totally N different dichotomizers using Eq. (2),

$$ \mathrm{N}=\frac{1}{2}\left({3}^{\mathrm{c}}-{2}^{\mathrm{c}+1}+1\right) $$
(2)

where c is the number of classes. For example, there are 25 dichotomizers for a four-class classification problem such as ACG diagnosis. By using the state-of-the-art feature selection method, such as minimum redundancy and maximum relevancy (mRMR) [2224], the best feature set, which is closely dependent on the target class with minimum inter-redundancy, is identified for each dichotomizer. Wirawan et al. has shown that mRMR is suitable and effective in selecting informative and discriminative features for ACG classification [5]. In order to find the optimal feature set, a filter-wrapper approach is used [22, 23]. Features are first ranked according to the criteria of mRMR, and the highly ranked features are retained, then sequential forward selection (SFS) (or sequential backward selection (SBS), floating search methods) is performed to select the best feature set for each dichotomizer using cross-validation on the training dataset.

However, since the number of dichotomizers N grows exponentially with the number of classes c in Eq. (2), it may be time-consuming to perform a wrapper approach to select the best feature set for each dichotomizer. For simplicity, filter approach is preferred for fast feature selection when N is large. Feature selection not only improves the classification performance (i.e., accuracy) of each dichotomizer, but also decreases the dependency among the dichotomizers by selecting different optimal feature sets for different dichotomizers, which is beneficial to the error-correcting ability of ECOC.

Step 2: Maximization of the separability of codewords in ECOC framework in consideration of the performance of each dichotomizer

To improve the classification performance of ECOC framework, the separability of ECOC codes is maximized, which is defined as

$$ {\mathrm{d}}_{\mathrm{s}}={ \min}_{\begin{array}{c}\hfill 1\le \mathrm{i}\le \mathrm{c}\hfill \\ {}\hfill \mathrm{i}<k\le c\hfill \end{array}}\left(\frac{1}{2}{\displaystyle {\sum}_{\mathrm{j}=1}^{\mathrm{n}}\mathrm{L}\left({\mathrm{C}}_{\mathrm{i}}\left(\mathrm{j}\right)\cdot {\mathrm{C}}_{\mathrm{k}}\left(\mathrm{j}\right)\right)}\right) $$
(3)

where C i and C k denote the codewords of the i th and k th classes, respectively [12]. In [21], the separability is modified as

$$ {\mathrm{d}}_{\mathrm{s}}^{\prime }={ \min}_{\begin{array}{c}\hfill 1\le \mathrm{i}\le \mathrm{c}\hfill \\ {}\hfill \mathrm{i}<k\le \mathrm{c}\hfill \end{array}}\left(\frac{1}{2}\left|{\mathrm{C}}_{\mathrm{i}}\left(\mathrm{j}\right)\cdot {\mathrm{C}}_{\mathrm{k}}\left(\mathrm{j}\right)\right|{\displaystyle {\sum}_{\mathrm{j}=1}^{\mathrm{n}}\mathrm{L}\left({\mathrm{C}}_{\mathrm{i}}\left(\mathrm{j}\right)\cdot {\mathrm{C}}_{\mathrm{k}}\left(\mathrm{j}\right)\right)}\right) $$
(4)

to ignore the contributions from codes with value of 0. This definition is more reasonable. However, in all the existing ECOC methods, only the code information of the dichotomizers is considered. So some dichotomizers with unsatisfactory ability of binary classification may also be selected into the ECOC framework, which may deteriorate the final classification performance. In the proposed method, the separability is reformulated as

$$ {\mathrm{d}}_{\mathrm{s}}^{{\prime\prime} }={\mathrm{d}}_{\mathrm{s}}^{\prime }+\uplambda \overline{\mathrm{a}}={ \min}_{\begin{array}{c}\hfill 1\le \mathrm{i}\le \mathrm{c}\hfill \\ {}\hfill \mathrm{i}<k\le \mathrm{c}\hfill \end{array}}\left(\frac{1}{2}\left|{\mathrm{C}}_{\mathrm{i}}\left(\mathrm{j}\right)\cdot {\mathrm{C}}_{\mathrm{k}}\left(\mathrm{j}\right)\right|{\displaystyle {\sum}_{\mathrm{j}=1}^{\mathrm{n}}\mathrm{L}\left({\mathrm{C}}_{\mathrm{i}}\left(\mathrm{j}\right)\cdot {\mathrm{C}}_{\mathrm{k}}\left(\mathrm{j}\right)\right)}\right) + \uplambda \overline{\mathrm{a}} $$
(5)

where ā is the average binary classification accuracy of the selected dichotomizers, and λ is the coefficient weighting the relative importance of ā compared with d s . There are two key parameters to determine: the weighting coefficient λ and the code length n. Cross-validation is applied on the training dataset to find the optimal parameters λ * and n *, and also the optimal set of dichotomizers. Finally, the trained and selected dichotomizers are used as base learners in ECOC framework. The algorithm of the proposed method is shown in Table 2.

Table 2 The algorithm of the proposed method

Experimental results on classification of different glaucoma mechanisms

Data preparation and experiment results

A dataset of 152 ACG samples provided by National University Hospital Singapore (NUHS) collected over 2 years is used for classification of four different ACG mechanisms [5, 24, 25]. The dataset is small due to the limited number of ACG patients are recruited using Ministry of Education (MoE) AcRF Tire 1 Funding, Singapore. AS-OCT images of glaucoma patients with different ACG mechanisms are shown in Fig. 2. PIR is characterized by the thick and folded iris, while PB is characterized by the convex forward iris profile. Eyes with PL and LV mechanisms have the largest and smallest AC volumes, respectively [1]. Customized software (Anterior Segment Analysis Program-ASAP, National University Hospital, Singapore) was used to measure anterior chamber (AC) characteristics. ASAP software used the level set method for the segmentation of AC area [1, 5] of AS-OCT image as shown in Fig. 3. The quantifiable parameters of the anterior chamber (AC) measured by ASAP software includes anterior chamber depth (ACD), anterior chamber volume (ACV), anterior chamber width (ACW), angle recess area (ARA), angle opening distance (AOD), post closure area (PCA), angle opening distance (AOD), trabecular-iris space area (TISA), lens vault (LV) distance, iris area (IA), iris thickness (IT), iris concavity, etc., as illustrated in Fig. 4. The samples are labeled by medical experts in NUHS (C. C. Sng, M. C. Aquino and P. T. K. Chew) and the basic information of the glaucoma dataset used in this study is shown in Table 3.

Fig. 2
figure 2

Illustrative samples of AS-OCT images of glaucoma patients with different ACG mechanisms a PIR; b LV; c PB; d PL.

Fig. 3
figure 3

An example of AS-OCT image a and its corresponding segmentation result in b by using ASAP software

Fig. 4
figure 4

The parameters measured from the AS-OCT image of the AC segment of the eyes (not all of the 84 parameters are shown here). For example, ACDL1500 means anterior chamber depth of the left hand side measured at 1500 μm from the scleral spur

Table 3 The basic information of the Glaucoma dataset used in this study

Since, each mechanism has several characteristics, from medical point of view, 84 features are extracted, which are all clinically important parameters measured from the segmented AS-OCT image. Some of the important features are identified based on our previous studies using the same dataset [5, 24, 25] are given as follows: AC_Area (Anterior chamber area); AC_Volume (Anterior chamber volume); ACD (Anterior chamber depth); ACW (Anterior chamber width); Anterior_lens_curvature (Curvature of the anterior lens surface); ILC_L (Iridolenticular contact in the left side); ILC_R (Iridolenticular contact in the right side); Iris_area_IL (Iris area contacts with Iridolenticular); Iris_area_L500 (The scleral spur is used as the centre of a circle with radius of 500 μm. The area of iris region inside this circle on left side is Iris_area_L500). Iris_Chord_Length_L (The distance from the tip of the iris to the periphery); Iris_Chord_Length_R (The distance from the tip of the iris to the periphery on right side); Iris_end_concavity_L (Concavity of Iris area at the end in left side); Iris_thickness_L1000 (The intersection point on the anterior surface of the iris is identified when the scleral spur (SS) is used as the centre of a circle with radius of 1000 μm. Iris Thickness is the shortest distance from this point of intersection to the posterior surface of the iris on left side; Iris_thickness_L_DMR (Thickness of the iris region in the dilator muscle region (DMR) on left side; Iris_thickness_L_Max (Maximum thickness of Iris length); Iris_thickness_L_SMR (Iris thickness in the sphincter muscle region on left side); Iris_thickness_PL (Iris thickness contacts with plateau); Iris_thickness_R_DMR (Thickness of the iris region in the dilator muscle region on right side); Iris_thickness_R_SMR (Iris thickness in the sphincter muscle region on right side); Lens vault (The perpendicular distance between the horizontal line joining the two scleral spurs and the anterior pole of the crystalline lens, represents the anterior portion of the lens); Pupil_distance (Distance between the centers of the pupils).

The experimental investigation of our proposed method was conducted and implemented using Matlab 8.0 R2012b (The Mathworks Inc., Natick, MA, USA) and Microsoft Visual Studio (C++). All the 84 features are normalized to have zero mean and unity variance. Wirawan et al. has shown that Adaboost performs better than SVM in application of ACG classification, and the performance of Adaboost combined with OVA strategy is also better than traditional multiclass classifiers in terms of classification accuracy, such as classification tree, and Naive Bayes [5]. WEKA (Waikato Environment for Knowledge Analysis) data mining tool [26] was used for comparing the proposed method with the traditional multiclass classifiers based on its default parameters as same as reported in [5] which used the same dataset for their study. Thus Adaboost is the choice of binary classifier in this experiment. Besides, for fair comparison with the results in [5], we also used mRMR for feature selection. In this four-class classification problem, there are totally 25 possible dichotomizers, which are easily obtained by exhaustive search. In the first step of the proposed method, all the 84 features are ranked according to the mRMR criteria for each dichotomizer. Each feature is incrementally included into a ranking list according to the following equations,

$$ { \max}_{{\mathrm{f}}_{\mathrm{j}}\in \mathrm{F}-{\mathrm{F}}_{\mathrm{m}-1}}\left[\mathrm{I}\left({\mathrm{f}}_{\mathrm{j}},\mathrm{y}\right)-\frac{1}{\mathrm{m}-1}{\displaystyle {\sum}_{{\mathrm{f}}_{\mathrm{j}}\in {\mathrm{F}}_{\mathrm{m}-1}}\mathrm{I}\left({\mathrm{f}}_{\mathrm{j}},{\mathrm{f}}_{\mathrm{i}}\right)}\right] $$
(6)

or

$$ \max {}_{{\mathrm{f}}_{\mathrm{j}}\in \mathrm{F}-{\mathrm{F}}_{\mathrm{m}-1}}\left[\mathrm{I}\left({\mathrm{f}}_{\mathrm{j}},\mathrm{y}\right)/\left(\frac{1}{\mathrm{m}-1}{\displaystyle {\sum}_{{\mathrm{f}}_{\mathrm{j}}\in {\mathrm{F}}_{\mathrm{m}-1}}\mathrm{I}\left({\mathrm{f}}_{\mathrm{j}},{\mathrm{f}}_{\mathrm{i}}\right)}\right)\right] $$
(7)

where F is the whole feature set, and F m-1 is the set of (m-1) features that are already selected, by maximizing the mutual information I(f j ,y) between the jth feature and the class label y and minimizing the mutual information I(f j , f i ) between the jth feature and the feature f i in the set F m-1 [23]. To further increase the classification ability of each dichotomizer, a wrapper approach is used to select the best feature set which leads to the least classification error. The dichotomizers are ranked according to their classification accuracy in descending order as shown in Table 4. In the Step 2, for λ (p-1) ∊ [0, 600] with a step of 30 and n q ∊ [1, 9] with a step of 1 (p =0, 1, 2,…, 20; q =1, 2,…9), the best set of dichotomizers B p,q is determined by maximizing d s in Eq.(5) where L(z) = 1/e z. The separability d s is dominantly determined by the average classification accuracy of the selected dichotomizer when the weighting coefficient λ = 600. According to the suggestion of [17, 20], the code length n should be 15log(c) ≈ 9. Here let n vary from 1 to 9.

Table 4 The Ranking of all the Dichotomizers according to their classification accuracy on the training dataset

The classification performance of these different sets of dichotomizers {B p,q } is evaluated on the training dataset by leave-one-out cross-validation (LOOCV). Since we have done our study with the available limited data sources, leave-one-out cross-validation is used to prevent over-fitting of training data. The best set of dichotomizers is determined to be B 13,6  = {D1, D3, D4, D5, D6, D7}, and the optimal values of the parameters are λ * = 360 and n * = 6. The ECOC code matrix formed by the selected set of dichotomizersB 13,6 is shown in Table 5. Loss-based decoding strategy is used in the decoding process. The loss function of Adaboost classifier is L(z) = 1/e z. The confusion matrix obtained using LOOCV on the glaucoma dataset is shown in Table 6. The weighted average classification accuracy is 87.65 % as shown in Table 7, which is better than the accuracy of 84 % obtained by [5] (The dataset in [5] is slightly different from ours in that four more patients with no mechanism of glaucoma are added; however, this effect is negligible).

Table 5 The ECOC code matrix determined in the proposed method
Table 6 Confusion matrix obtained by using the proposed method (Leave-one-out cross-validation)
Table 7 Comparison of classification accuracy of the proposed method with other ECOC-based methods with dichotomizer-specific feature selection

Comparison with other ECOC methods, including OVO, OVA, sparse random ECOC

In most traditional methods, the dichotomizers are not optimized specifically. Only Wang et al. and Maghsoudi et al. used feature selection to optimize the dichotomizers in OVA scheme [14, 15]. In this paper, we apply feature selection to each dichotomizer for all the three ECOC methods based on OVO, OVA [9, 20] and sparse random [21] strategies. The classification accuracy for each class and the weighted average accuracy of these three existing popular ECOC methods are shown in Table 7. The highest weighted average accuracy of the three ECOC methods is 85.81 %, better than the traditional multiclass classifiers, such as classification tree (72.22 %), random forest (76.58 %), SVM combined with OVA strategy (78.22 %) and Naive Bayes (77.93 %).

We also randomly select 80 % of the dataset for training and the other 20 % for testing to compare the proposed method with the other ECOC methods. This process is repeated 2000 times. The classification accuracy (mean value ± standard deviation) of the proposed method and the three other ECOC methods mentioned above is 84.86 ± 3.56 %, 83.69 ± 3.75 %, 79.76 ± 3.75 %, and 81.45 ± 3.70 %, respectively. And the histograms of the classification accuracy for the proposed method and the three ECOC methods are shown in Fig. 5a–d, respectively, from which we can see that the proposed method performs best. In the proposed method, all the dichotomizers are first optimized specifically to increase the interdependence and classification accuracy, and then the ECOC code matrix is learned by maximizing Eq.(5) and selecting a set of competitive dichotomizers. Not only the code information but also the classification ability of the dichotomizers is considered in maximizing the separability of the codewords in the ECOC matrix. However, in most traditional ECOC methods, the dichotomizers are not selected in consideration of their classification performance.

Fig. 5
figure 5

The histograms of the classification accuracy of the a proposed method b OVO; c sparse random; d OVA based ECOC methods based on 2000 rounds of experiments when 80 % of the dataset randomly selected for training and the other 20 % for testing

In this experiment, three dichotomizer sets, namely {D1, D2, D5, D6, D18, D23}, {D4, D12, D16, D17}, {D6, D7, D8, D9, D10, D11, D12, D13, D14} are used in the three ECOC methods based on OVO, OVA and Sparse random strategies, respectively (the details of the dichotomizers are shown in Table 3). Due to the dichotomizers with low accuracy such as {D18, D23, D16, D17, D13, D14} are incorporated in the code matrix of these three methods, the final performance are deteriorated accordingly. The performance of OVO is relatively better than OVA and sparse random based ECOC methods, because only two relatively inaccurate dichotomizers {D18, D23} are included and the others have very high classification accuracy. In the proposed method, the dichotomizers, namely {D1, D3, D4, D5, D6, and D7}, all with high accuracy, are selected which ensures the performance is better than that of the other traditional ECOC methods.

Conclusions

Angle closure glaucoma is a prevalent eye disease worldwide, especially in Asia. There are four different mechanisms of ACG, which need different clinical treatments accordingly. Therefore, classification of these four mechanisms is important in automatic diagnosis of glaucoma. In this paper, a new ECOC based ensemble learning method is proposed for multiclass classification, with application to classification of four mechanisms of ACG. In the proposed method, for each possible dichotomizer, its best feature set is determined and classification accuracy is obtained by using cross-validation on the training glaucoma dataset. The dichotomizers are selected based on the maximization of both the separability of the codewords in the ECOC matrix and the classification ability of the dichotomizers. The selected dichotomizers are included into the ECOC framework. The proposed method has been experimentally applied on a glaucoma dataset including 152 patients with four different mechanisms. The classification accuracy is experimentally validated to be better than that of the other three existing ECOC methods.

There are two points make the proposed method perform better than the others: 1) the dichotomizers are optimized individually and their binary classification abilities are quantified prior to dichotomizer selection. The classification accuracy and diversity of the dichotomizers in the ECOC framework are improved by using different optimal feature set for each dichotomizer; 2) the ECOC code matrix in which the dichotomizers are all competitive with high binary classification performance and the codewords are all separated largely is determined to ensure the final classification performance. The proposed method is promising to be applied to automatic classification of different ACG mechanisms and helps doctors to make specific treatment for each mechanism.