Multiclass Benchmarking Framework for Automated Acute Leukaemia Detection and Classification Based on BWM and Group-VIKOR

Alsalem, M. A.; Zaidan, A. A.; Zaidan, B. B.; Albahri, O. S.; Alamoodi, A. H.; Albahri, A. S.; Mohsin, A. H.; Mohammed, K. I.

doi:10.1007/s10916-019-1338-x

Multiclass Benchmarking Framework for Automated Acute Leukaemia Detection and Classification Based on BWM and Group-VIKOR

Systems-Level Quality Improvement
Published: 01 June 2019

Volume 43, article number 212, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Medical Systems Aims and scope Submit manuscript

Multiclass Benchmarking Framework for Automated Acute Leukaemia Detection and Classification Based on BWM and Group-VIKOR

Download PDF

M. A. Alsalem^1,2,
A. A. Zaidan¹,
B. B. Zaidan¹,
O. S. Albahri¹,
A. H. Alamoodi¹,
A. S. Albahri³,
A. H. Mohsin⁴ &
…
K. I. Mohammed¹

690 Accesses
73 Citations
Explore all metrics

Abstract

This paper aims to assist the administration departments of medical organisations in making the right decision on selecting a suitable multiclass classification model for acute leukaemia. In this paper, we proposed a framework that will aid these departments in evaluating, benchmarking and ranking available multiclass classification models for the selection of the best one. Medical organisations have continuously faced evaluation and benchmarking challenges in such endeavour, especially when no single model is superior. Moreover, the improper selection of multiclass classification for acute leukaemia model may be costly for medical organisations. For example, when a patient dies, one such organisation will be legally or financially sued for incidents in which the model fails to fulfil its desired outcome. With regard to evaluation and benchmarking, multiclass classification models are challenging processes due to multiple evaluation and conflicting criteria. This study structured a decision matrix (DM) based on the crossover of 2 groups of multi-evaluation criteria and 22 multiclass classification models. The matrix was then evaluated with datasets comprising 72 samples of acute leukaemia, which include 5327 gens. Subsequently, multi-criteria decision-making (MCDM) techniques are used in the benchmarking and ranking of multiclass classification models. The MCDM used techniques that include the integrated BWM and VIKOR. BWM has been applied for the weight calculations of evaluation criteria, whereas VIKOR has been used to benchmark and rank classification models. VIKOR has also been employed in two decision-making contexts: individual and group decision making and internal and external group aggregation. Results showed the following: (1) the integration of BWM and VIKOR is effective at solving the benchmarking/selection problems of multiclass classification models. (2) The ranks of classification models obtained from internal and external VIKOR group decision making were almost the same, and the best multiclass classification model based on the two was ‘Bayes. Naive Byes Updateable’ and the worst one was ‘Trees.LMT’. (3) Among the scores of groups in the objective validation, significant differences were identified, which indicated that the ranking results of internal and external VIKOR group decision making were valid.

Systematic Review of an Automated Multiclass Detection and Classification System for Acute Leukaemia in Terms of Evaluation and Benchmarking, Open Challenges, Issues and Methodological Aspects

Article 19 September 2018

A Critical Review of Multi Criteria Decision Analysis Method for Decision Making and Prediction in Big Data Healthcare Applications

Modeling and Evaluating of Decision Support System Based on Cost-Sensitive Multiclass Classification Algorithms

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Medical informatics is the intersection of information science, computer science, and health care [1,2,3,4,5,6,7,8,9,10]. This field deals with the resources, devices, and methods required to optimize the acquisition, storage, retrieval, and use of information in health [11,12,13,14,15,16,17,18,19,20,21,22,23,24]. The decisions of the administration departments of medical organisations are critical, particularly decisions regarding the selection of automated solutions for the diagnosis and detection of complex diseases, such as acute leukaemia [25]. The importance of selecting appropriate automated solutions can be attributed to their extensive use [26]. Automated solutions based on artificial intelligence techniques can provide rapid acute leukaemia diagnosis and classification and increase the reliability and accuracy of diagnostic results [26,27,28,29,30,31,32]. Many physicians, cancer treatment centres and hospitals have started using automated models for acute leukaemia classification to address the several potential limitations of manual analysis [26, 29, 30]. However, despite the increasing number of automated classification models, finding models that deliver highly accurate results in a short time and without error remains challenging [33]. Therefore, the administration departments of health organisations have been facing difficulties in evaluating and benchmarking automated classification models for acute leukaemia and determining the best model, especially when no single model is superior [29, 33, 34]. Moreover, evaluating and comparing different classification models is difficult in the presence of multiple evaluation criteria [35, 36]. Given the existence of different classification models for acute leukaemia, the health sector has difficulty deciding which model should be used. The required processes for tasks related to the evaluation and benchmarking of automated classification models for dangerous medical cases are crucial to the identification of the classification model that delivers the best results [27]. These processes are crucial because the selection of an incorrect classification model can lead to the loss of a patient’s life, legal accountability and even financial costs for the health organisations. For example, when a model incorrectly identifies non-cancer cells as cancerous in a patient, the surgery and diagnostic tests the patient have to undergo may pose adverse effects on his or her mental health. Conversely, when a model incorrectly identifies cancer cells as non-cancerous, the disease remains untreated, and the patient may die as a result. Both cases have a negative impact on the reputation and performance of healthcare organisations. Therefore, determining the most efficient technique for selecting a suitable classification model for acute leukaemia is necessary. Given that these models are not cheap, as well as related to the medical aspect for humans, they must be evaluated and benchmarked [35]. The procedures related to the multiclass classification of acute leukaemia through evaluation and benchmarking remains challenging [29]. The tasks involved in the evaluation and benchmarking automated models for acute leukaemia are difficult decision-making tasks and requires numerous measurements [34]. Two basic sets of criteria are commonly utilised in the evaluation and benchmarking of acute leukaemia multiclass classification modes: (1) time complexity and (2) reliability group. The first group for reliability has a set of sub-criteria (TP, TN, FP, FN, ave-accuracy, precisionμ, precision_M, recall_M, fscore and error rate) [37, 38]. Snousy et al. considered the main requirements for the best classification model in terms of accuracy [33], and nine classification models based on accuracy criterion were compared in their study. Despite the importance of the remaining criteria [39,40,41], several studies [32, 42,43,44,45,46] adopted the classification accuracy criterion for the evaluation and benchmarking of classification models. However, the quality assessment of acute leukaemia classification models requires additional attention. In the same context, some other aspects must be considered in the evaluation processes [33]. According to Rawat et al., although accuracy is the most widely used metric, each class of aspect is considered with equal importance, and the differences among the types of classes are neglected [32]. However, in real cases, particularly those related to medicine, the distinction among certain classified classes is important. In [47,48,49], True Positive, True Negative, False Positive and False Negative sensitivity were used as key criteria for evaluation and benchmarking, but other requirements that might have an impact on classification performance were neglected. In [35], the calculation of time complexity was found to be time consuming for classification. High computational cost causes the slowdown of classification [50]. Misha et al. indicated that the dataset size should be considered in the classification task because a large dataset affects processing time; this condition is known as time complexity [35]. Ludwig et al. stated that in the scope of cancer data analysis, speed and accuracy are the main aspects that must be considered in the evaluation of the efficiency of classification models [51]. Classification tasks are considered good if the results with low computational time are delivered and classification accuracy is simultaneously improved [52]. In other words, the main requirements that must be considered when developing any acute leukaemia multiclass classification model are as follows: (1) time complexity and (2) reliability. Reliability should have a high rate, and the time complexity for conducting the output should be low [52]. However, these requirements are competing requirements [53]; that is, high reliability cannot simultaneously be obtained with low time complexity. Thus, the developers usually focus on either increasing reliability or decreasing time complexity. If a highly reliable multiclass classification model is required, then time must be sacrificed, and vice versa. The trade-off and conflict among the evaluation criteria are reflected on the evaluation and benchmarking process. This situation leads to conflicts among criteria in the comparison, and the benchmarking process is affected. Consequently, benchmarking among multiple criteria is difficult with trade-off and conflict [54]. Reliability and time complexity should be measured in the evaluation of any classification model. However, current approaches for comparing novel and previous models in all the reviewed studies do not focus on the evaluation and benchmarking criteria; they only emphasised the evaluation aspect and neglected the rest because they are not sufficiently flexible to deal with the conflict or trade-off among the various criteria [33]. Conflict and trade-off are considered the first issue faced by the evaluation and benchmarking of multiclass classification models. The second issue is the importance of each criterion. Acute leukaemia evaluation in terms of multiclass classification models involves a set of criteria, and the importance of each criterion is distinct and depends on the objectives of the developed model. That is, the importance of one of the evaluation criteria might be boosted in exchange for the low importance of another criterion based on model objectives [34]. Therefore, trade-off and conflict exist between evaluation and benchmarking criteria due to importance differences of each criterion in different models [55]. The third issue emerges when the benchmarking process is conducted on the basis of simultaneous multiple criteria and sub-criteria [56,57,58]. This approach is considered to be difficult due to the trade-off among the criteria and their various importance; however, the reliability of a criteria set indicates that the values depend on the confusion matrix containing four parameters: True Positive, False Positive, True Negative and False Negative [47, 59]. The four parameters are prone to lose values in experiments, affecting the remaining values of other criteria in the reliability group. Despite the criticism with respect to these parameters, the studies still used these parameters for the evaluation of multiclass classification models [56,57,58, 60]. By contrast, the current evaluation and benchmarking tools have limitations. These tools cannot entirely cover the required measurements by the multiclass classification model. Moreover, these tools have limitations in terms of the overall parameter calculation of the reliability group, comparison between the two additional classification methods and matching between the classification methods because the tools cannot rank the models according to performance [61,62,63]. In the preceding discussion, the problem of evaluation and benchmarking process in multiclass classification models of acute leukaemia is defined as a multi-criteria problem. Therefore, an integrated and comprehensive platform covering all the aspects of performance in the evaluation and benchmarking of multiclass classification models for acute leukaemia should be developed. This integrated platform will serve as a tool that supports the decisions of the administrators of medical organisations in the evaluation and benchmarking of available alternatives and the identification of the best model. The main objective of the current paper is to propose a framework for evaluating and benchmarking multiclass classification models for acute leukaemia. The remaining parts of this article are divided into the following seven sections: the ‘Related Studies’ section presents related literature review. The ‘Multi-criteria decision-making’ section shows the theoretical background of the recommended solution. The ‘Methodology’ section reports the evaluation and benchmarking framework for multiclass classification models. The results and discussion are reported in the ‘Results and discussion’ section. ‘Validation’ deliberates the validation results for the proposed framework. The ‘Limitations and future study’ section highlights the limitations of the proposed framework and future studies. The ‘Conclusion’ section presents the conclusion of the research.

Related studies

The selection of a suitable classification model for acute leukaemia is considered a challenge faced by medical institutions, especially those with specialisation in cancer treatment. The essence of the challenge lies in the capacity of the selected model to allow a precise and immediate acute leukaemia classification.

Previous literature distinctly explained that classification tasks of acute leukaemia differ with respect to result accuracy provided and overall performance. Similarly [29, 33, 34], no previous classification model has been considered superior. Many studies have discussed the development of automated models for acute leukaemia analysis, as well as the way the models is used and the benefits that health organisations could gain from using them [29, 32, 34, 47, 49, 64,65,66,67,68,69]. However, studies that aimed to provide an evaluation and benchmarking of available classification models and determine the best one are limited. Existing academic literature featuring topics related to the evaluation and benchmarking of acute leukaemia multiclass classification models are scarce and scattered; some studies are only limited to the evaluation and benchmarking of one aspect of performance. In [70], automated microscopy was analysed with DM96TM. Snousy et al. compared nine classification models under decision tree family in terms of accuracy and explored their performance in determining blood cells, then compared their accuracy with that of the manual method and XE-2100TM. The study attempted to examine experimental effects to different methods of feature selection with respect to accuracy [33]. An ALL-IDB, which is a public image dataset of peripheral blood samples for normal people and patients with leukaemia, was proposed in [27]; supervised classification and segmentation of the data were provided by the image dataset, which is particularly designed for comparing and evaluating algorithms for segmentation and classification. In [71], three automatic detection approaches for leukaemic cells were compared. The first approach is based on support vector machine, the second is based on a neural network and the third is Gaussian mixture model estimation. The comparison relied on three criteria, namely, accuracy, precision and recall. In addition to the effect of various segmentations on classification results, in [39], two classification schemes were compared in terms of segmentation quality. The first scheme is based on support vector machine, whereas the second is based on random forest. Evaluation and benchmarking methods must be utilised to cover all main requirements and substantively determine the performance and quality of classification models for acute leukaemia. In addition to reduced processing time and small error rate,Saritha et al. assured that the automated classification model has high accuracy and efficiency. Suitable treatment to patients can be provided with the early identification of leukaemia [52]. Despite the substantial effort in the evaluation and benchmarking of acute leukaemia classification tasks, no study has provided an integrated solution that covers the key evaluation criteria for evaluating and benchmarking multiclass classification models and helping the administrators of medical organisations and various users to determine a suitable model. This study attempts to fill the evaluation and benchmarking research gap with respect to acute leukaemia classification tasks.

Multi-criteria decision making (MCDM)

Numerous MCDM definitions are available in academic literature. However, MCDM was defined by Keeney and Raiffa [72] as decision theory extension, which is aimed to cover the decision of any multiple objectives. MCDM is used as a methodology to aid in cases, such as those assessing alternatives on individuals, which are often followed by conflicting criteria and combined into one overall appraisal [73,74,75,76,77]. Among the other definitions of MCDM, [78] defined MCDM as an umbrella term, which describes the collection of formal approaches. These processes decide to take explicit account of multiple criteria to assist individuals or groups exploring important decisions which matter [79,80,81,82,83,84]. Among the most well-known decision techniques, MCDM is known for its decision-making capabilities, enabling it to address complicated decision problems whilst handling multiple criteria [85, 86]. Furthermore, MCDM demonstrates a systematic method to address decision problems on the basis of multiple criteria [86,87,88,89,90]. The goal is to help decision makers deal with this kind of problems [91]. MCDM procedure often relies on approaches with quantitative and qualitative nature and frequently concentrates on simultaneously dealing with multiple and conflicting criteria [92, 93]. MCDM also has the capabilities to increase decision quality based on the approach via effective and rational ways more than traditional processes [94]. Furthermore, MCDM intends to acquire the following: categorise suitable alternatives among a group of available ones and rank the alternatives according to performance in decreasing order [95,96,97,98,99]. The last is the selection of these alternatives [100,101,102,103,104,105,106]. Suitable alternatives will be scored based on the previous goals. Essential terms are required in any MCDM solution, namely, the decision or evaluation matrix, which are also called decision criteria [107]. Decision matrix must be created using elements, including n criteria and m alternatives. Each criteria intersection and alternative is specified as x_ij. Therefore, matrix (x_ij) _ (m*n) is expressed as follows:

$$ {\displaystyle \begin{array}{l}\kern2.00em {C}_1\kern1.25em {C}_2\kern1em \cdots \kern1em {C}_n\\ {}D=\begin{array}{c}{A}_1\\ {}{A}_2\\ {}\vdots \\ {}{A}_m\end{array}\left\lceil \begin{array}{cccc}{x}_{11}& {x}_{12}& \cdots & {x}_{1n}\\ {}{x}_{21}& {x}_{22}& \cdots & {x}_{2n}\\ {}\vdots & \vdots & \vdots & \vdots \\ {}{x}_{m1}& {x}_{m2}& \cdots & {x}_{mn}\end{array}\right\rceil, \end{array}} $$

where A_1, A_(2)…...,A_m are possible alternatives to be ranked by the decision makers (i.e. classification models); C_1,C_(2)...…,C_n are the criteria against which the performance of each alternative is evaluated and x_ij is the rating of alternative A_i with respect to criterion C_j, and W_j is the weight of criterion C_j. Special processes must be accomplished to score the alternatives. Normalisation is included in some of these processes. Maximisation indicator, addition of weights and other processes are based on the method. For example, suppose that D is the decision matrix utilised in scoring the Ai performance of the alternative, where based on C_j. Enhancing the decision-making process is important and possible by involving decision makers and stakeholders. Using appropriate decision-making methods towards handling multi-criteria problems is also necessary. Healthcare is one of the extensively utilised domains of MCDM [93, 108]. Improving decision making in healthcare is possible through a systematic method and by determining the best decision through different MCDM methods [109, 110]. Especially, many decisions in the healthcare and medical fields are complex and unstructured [108]. Numerous MCDM techniques have been developed, and the most commonly used MCDM techniques are th best-worst method (BWM), weighted product method (WPM), hierarchical adaptive weighting (HAW), simple additive weighting (SAW), multiplicative exponential weighting (MEW), weighted sum model (WSM), analytic network process (ANP), analytic hierarchy process (AHP), technique for order of preference by similarity to ideal solution (TOPSIS) and VlseKriterijumska Optimizacija I Kompromisno Resenje’ (VIKOR), which uses different notations [1, 73, 79,80,81, 108, 110,111,112,113,114,115,116,117,118,119,120,121]. Available MCDM techniques are diverse, and this diversity makes the selection of suitable techniques difficult. Each technique has its own limitations and strengths [81, 109, 112, 122, 123]. Thus, selecting the most suitable MCDM method is important. To the best of the our knowledge, none of the analysed methods have been used to rank multiclass classification models for acute leukaemia. In our previous work [87], we found that BWM and VIKOR are the two of the best MCDM methods.

The current study utilised ‘best–worst’ methods because it can provide more consistent results than AHP and other MCDM weighting methods. Moreover, the BWM-based pairwise comparisons are fewer than those in other methods [112, 124,125,126]. The pairwise comparison based on BWM also focuses on reference comparisons. This condition means that this comparison executes the most important preference of criterion over all the other criteria in addition to the preference of all the other criteria of least important criterion [111, 112, 127]. Conversely, MCDM methods are frequently used to rank alternatives, and the most common is VIKOR. The method utilises the approach for compromise priority for multiple response optimisation [110, 128, 129]. VIKOR is based on an aggregating function that represents ‘closeness to the ideal’. The index for VIKOR ranking is based on a particular measure of ‘closeness’ to the ideal solution. Furthermore, VIKOR has the capability towards the ranking of the alternatives to accurately and rapidly determine the best [128]. The style for recent VIKOR studies changed, and VIKOR is usually integrated with another MCDM method. Reviewed studies identified and provided different examples for applying VIKOR with BWM to improve consistency for subjective weights. A similar integration between VIKOR with BWM realises a robust method. Given the advantages of the two methods in overcoming uncertainties associated with the problem described in [130,131,132,133,134,135,136], using VIKOR and BWM is easy and clear even for those with no background on MCDM [136]. Utilising VIKOR with different cases (e.g., individual and groups) has been recommended. Two main cases of decision making are basically emphasised: the first case is decision making based on a single decision maker; the second involves many decision makers and is called group decision making (GDM), in which individuals collectively select alternatives from the ones presented to them. The decision is not attributed to any single group member because of the individual and social processes, such as social influence, which contribute to the outcome. The GDM techniques systematically collect elements and combine components from experts, including their knowledge and judgement from different fields. In relation to a group case, the judgement criteria of each expert, which require subjective judgement, are provided. The same expert assigns weight for every criterion [110, 137]. Finally, evaluation and benchmarking for acute leukaemia multiclass classification suggests a need to integrate BWM and VIKOR methods. The suggestion is based on assigning weights for criteria (reliability, time complexity rate) according to BWM and on the basis of the evaluation of an expert. The utilisation of VIKOR is recommended in the ranking of multiclass classification models.

Methodology

This section introduces the evaluation and benchmarking methodology of the automated multiclass classification models. In addition, the section will introduce the procedures and steps of the proposed framework. The output ranked multiclass classification models based on the set of criteria using the BWM and VIKOR for weighting and ranking, respectively. All the overall conceptual elements of the present study are illustrated in Fig. 1.

Construction of decision matrix

Decision matrix considers the main component in the evaluation and benchmarking framework. The main parts of decision matrix are decision criteria and alternatives. In the present case, the criteria represent the metrics used for measuring the quality of multiclass classification models. The next subsection describes the procedures followed to develop and evaluate the multiclass classification models and construct the decision matrix.

Data source

The dataset proposed by [138] for acute leukaemia microarray was adopted in this study. The dataset is recognised for its popularity and usage in the academic literature and the most frequently utilised in the papers (References [139,140,141], which is available for the public). The dataset has three categories for acute leukaemia: acute myelogenous leukaemia (AML), ALL B cell and ALL T cell. The dataset comprises 5327 genes and 72 samples, of which 38 are AML, 9 are ALL-B and 25 are ALL-T types.

Development of multiclass classification models

Developing multiclass classification models requires a three-step process. Firstly, the target dataset, which include the selection of relevant features, is prepared. Secondly, training (learning process), which involves the establishment of a class through machine learning, is achieved by analysing the instances of a training dataset. Each individual instance, should belong to a predefined class, and each instance is assumed to belong to a predefined class. Thirdly, machine learning algorithms are executed with other independent datasets, which are also known as testing datasets. This step is in line with the aim of performing machine learning estimation. If the performance for multiclass classification model appears to be ‘acceptable’, then the model can be utilised for future classification cases when the class label is unknown. Ultimately, the multiclass classification models, which supply an acceptable result, can be considered an acceptable multiclass classification model.

The microarray data generally contain dozens of sample sizes (small) and high dimensionality (thousands of genes). Nevertheless, the results of the classification can be affected by the genes, more specifically, few parts of these genes. This condition means that most genes have no classification value. Genes with no relevancy, apart from their negative effects on the classification performance, can cause conflict in the classification model. Moreover, given that irrelevant genes can lead to over-fitting, a positive effect can be attained by reducing the number of genes. This approach can minimise the computed input. This positive effect affects the overall performance and results for the classification [28, 142, 143]. In this study, the genes that are highly relevant with classification classes, which are known as informative genes, are selected. The chi-square (X2) [33] method was used for the individual evaluation of the features. The X2 value is computed as follows [28, 142, 143]:

$$ {x}^2(a)=\sum \limits_{v=V}\sum \limits_{i=1}^n\frac{\left[{A}_i\left(a=v\right)-E\left(a=v\right)\right]2}{E_i\left(a=v\right)}, $$

(1)

$$ {x}^2(a)=\sum \limits_{v=V}\sum \limits_{i=1}^n\frac{\left[{A}_i\left(a=v\right)-E\left(a=v\right)\right]2}{E_i\left(a=v\right)} $$

(2)

where V is the set of possible values for a, n is the number of class, Ai (a = V) is the number of samples in the ith class with a = v and Ei (a = v) is the expected value of Ai (a = v); Ei(a = v) = P (a = v) P (ci) N, where P(a = v) is the probability of a = v, P(ci) is the probability of one sample labelled with the ith class and N is the total number of samples [33].

A total of 22 models for multiclass classification are built based on 22 well-known machine learning algorithms available in Weka software, which have been extensively used in prior studies [33, 42, 51, 60, 142, 144,145,146] and demonstrated satisfactory results when used in the classification of microarray dataset. These algorithms include the following: Rule.zero, Bayes_Net, Bayes.NaiveByesUpdateable, Lazy.IBK, Meta.AdaboostM1, Meta.Bagging, Meta.filteredclassifier, Meta.logitboost, Tree.j48, REPTree, RandomTree, RandomForest, Rule. Decision Table, Rules.part, Meta.RandomCommittee, Trees.LMT, Treed.HoeffdingTree, Kstar, Functions.Smo, Functions.SIMPLE.Logistic, Byes.NaiveBayes and Decision Stump. The dataset is divided into two parts to develop multiclass classification models. The first part is utilised for training purposes, and the other is used for testing purposes. The set for training is used in training the machine learning algorithms, and the other part of the dataset (testing set) is utilised to test the trained machine learning algorithms. The test dataset is classified into three categories, namely, AML, ALL-B and ALL-T, using the 22 multiclass classifications.

Establishment and evaluation of the decision matrix

The establishment of the decision matrix is dependent on the crossover between the evaluation criteria, namely, Ave accuracy, error rate, precision_M, precision_μ, recall_M, FP, FN, TP, TN, fscore and time complexity, and the 22 developed multiclass classification models. Figure 2 presents the structure of the proposed decision matrix.

Figure 2 shows the structure of the proposed decision matrix; the top row represents the main evaluation criteria, and the first column on the left represents different developed multiclass classification models as alternatives. The values (data) in this DM denote the evaluation results of all developed multiclass classification models according to all evaluation criteria. Each multiclass classification model is evaluated based on all evaluation criteria, where the matrix of parameter, relationship of parameters, parameter behaviour and error rate represent the four sub-criteria sets in the group of reliability. Firstly, the matrix of parameter is generated (TP, TN, FN and FP), and the basic sub-criteria are represented by these parameters in the reliability group of criteria. Given that this study addressed the multiclass classification problem, one-verse all approach is used in the calculation of the reliability set of the criteria. According to these criteria, the multiclass confusion matrix is converted to three confusion matrices, and each of matrix describes the parameters for a certain class of acute leukaemia (AML, ALL-B and ALL). Based on the three confusion matrices, the remaining sub-criteria within the reliability group are calculated for each matrix by using a specific formula. Therefore, values for each multiclass classification model will be separately calculated to generate the values considering the input of the decision matrix. Finally, the calculation procedure for time complexity is based on the consumed time by two elements: the input of the dataset sample and result output. The calculation process for the sample process relies on the number and size of samples as indicated in the following equation:

$$ {T}_{process}={T}_o-{T}_i $$

(3)

where T_o is the processing time to obtain outputs, and T_i is the time of inputting the sample. The time complexity is calculated by Weka software through the experimental process. As mentioned in Section 2, the three specific issues encountered by the proposed decision matrix are as follows: (1) trade-off and conflict among the evaluation criteria, (2) multiple evaluation criteria and (3) the importance of criteria. A weight difference is observed between the main criteria and sub-criteria. MCDM is used to address this issue, as presented in the next section.

Development of the evaluation and benchmarking framework

The proposed evaluation and benchmarking framework are developed based on MCDM techniques. The framework is developed based on the integration of BWM and VIKOR for weighting and ranking the best alternatives in the proposed decision matrix and selecting the best one. The subsequent steps are presented below.

Development of evaluation and benchmarking/selection integrated methods of BWM and VIKOR using MCDM

The suitable methods for benchmarking and ranking multiclass classification models are BWM and VIKOR. The VIKOR method is a mathematical model recommended for ranking and solving specific issues related to (1) trade-off and conflict and (2) multi-evaluation criteria encountered by the proposed decision matrix. BWM is also used for weighting the criteria to solve (3) the importance of criteria in relation to the proposed decision matrix.

Accordingly, the combination of BWM and VIKOR methods is justified for benchmarking and ranking the multiclass classification models.

Calculation of the weights of criteria based on BWM method

Assigning proper weights for multi-service criteria using BWM requires several steps. The procedure for BWM includes the following steps [112, 147]:

Step 1.
Determining a set of decision criteria

For BWM, the first step is to determine the criteria set, C1, C2,.… Cn, which should be considered by the decision maker when selecting the best alternative. In the present study, the set of criteria is obtained from the conducted analysis in the literature.

Step 2.
Determination of the best and worst criteria

Considering the best criterion as the most desirable or most important decision criteria is possible, and the worst criterion represents the less desirable or important criteria to the decision. This step involves the description of the best and the worst criteria depending on the perspective of the three decision makers/evaluators. Appendix 1 Section 2 presents the BWM comparison questions and the list of experts.

Step 3.
Conduct the pairwise comparison between the best criterion and the other criteria

The pairwise comparison process occurs between the identified best criterion and the other criteria. The aim of this step is to determine the best criterion preference over all the other criteria. The value must be determined by an evaluator/expert and must be from 1 to 9 to represent the importance of the best criterion over the other criteria. This step will result in a vector identified as ‘Best-to-Others’, which is

$$ AB=\left({a}_{B1},{a}_{B2},\dots, {a}_{Bn}\right), $$

where a_Bj indicates the importance of the best criterionB over criterion j, and a_BB = 1.

Step 4.
Pairwise comparison process between the other criteria and the worst criterion

The aim of comparison is to identify the preference for all the criteria over the least important criterion. The importance is determined by an evaluator/expert of all the criteria over the worst criterion, and the numbers from 1 to 9 are used towards indicating the importance. The result for this step is a vector recognised as ‘Others-to-Worst’. The vector result of ‘Others-to-Worst’ is represented as A_w = (a_1w, a_2w, …, a_aw), where a_jw represents the preference of the criterion j over the worst criterion W. Clearly, a_ww = 1. Two types of reference comparisons, namely, Best-to-Others and Others-to-Worst criteria, are illustrated in Fig. 3.

Step 5.
Elicit the optimal weights (W^*1, W^*2, …W^*n)

The optimal weight for the criteria is the one where for each pair of W_B/Wj and Wj/Ww, W_B/W_j = a_BJ and W_j / W_w = a_jw.

To fulfil these conditions for all j, a solution where the maximum absolute differences for all j are minimised must be obtained:

$$ \left|\frac{W_B}{W_j}-{a}_{Bj}\right| and\ \left|\frac{W_i}{W_w}-{a}_{Bj w}\right| $$

(4)

Considering the non-negativity and sum condition for the weights, the following problem is created:

$$ \min {\max}_j\left\{\frac{W_B}{W_j}-{a}_{Bj}|,\left|\frac{W_j}{W_w}-{a}_{jw}\right|\right\} $$

(5)

$$ {\displaystyle \begin{array}{l}{W}_{j\kern0.5em }\ge 0,\mathrm{for}\ \mathrm{all}\ j\\ {}\sum \limits_j{W}_{j=1}\end{array}} $$

The aforementioned problem can be transferred to the following problem:

$$ {\displaystyle \begin{array}{l}\mathrm{min}\upxi\ \\ {}\mathrm{s}.\mathrm{t}.\end{array}} $$

$$ \left|\frac{W_B}{W_J}-{a}_{Bj}\right|\le \upxi, for\ all\ j $$

(6)

$$ \left|\frac{W_j}{W_w}-{a}_{jw}\right|\le \upxi, for\ all\ j $$

(7)

$$ {\displaystyle \begin{array}{c}{\sum}_j{W}_{j=1}\\ {}{W}_j\ge 0,\mathrm{for}\ \mathrm{all}\ j\end{array}} $$

By finding a solution for the last problem, the optimal weights (w^*₁; w^*₂;…; w^*_n) and ξ n are obtained. The value for ξ^* reflects the outcomes’ reliability, depending on the extent of consistency in the comparisons. A value close to zero represents high consistency, and thus, high reliability [112, 126, 127, 148]. After that, the ratio for consistency calculated by using ξ^* and the corresponding consistency index is as follows (Table 1):

$$ \mathrm{Consistency}\ \mathrm{Ratio}=\frac{\xi^{\ast }\ }{Consistency\ Index} $$

(8)

Table 1 Index of consistency

Multiclass Benchmarking Framework for Automated Acute Leukaemia Detection and Classification Based on BWM and Group-VIKOR

Abstract

Similar content being viewed by others

Systematic Review of an Automated Multiclass Detection and Classification System for Acute Leukaemia in Terms of Evaluation and Benchmarking, Open Challenges, Issues and Methodological Aspects

A Critical Review of Multi Criteria Decision Analysis Method for Decision Making and Prediction in Big Data Healthcare Applications

Modeling and Evaluating of Decision Support System Based on Cost-Sensitive Multiclass Classification Algorithms

Explore related subjects

Introduction

Related studies

Multi-criteria decision making (MCDM)

Methodology

Construction of decision matrix

Data source

Development of multiclass classification models

Establishment and evaluation of the decision matrix

Development of the evaluation and benchmarking framework

Development of evaluation and benchmarking/selection integrated methods of BWM and VIKOR using MCDM

Calculation of the weights of criteria based on BWM method

Ranking the multiclass classification models based on VIKOR method

Results and discussion

Data presentation in decision matrix

Results of the framework of evaluation and benchmarking multiclass classification models

Results for weight using BWM method

Ranking’s results of VIKOR method

Validation processes and results

Objective validation

Validation results

Research limitation and future study

Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher’s Note

Appendices

Appendix 1 pairwise comparisons

Section 1: Expert questionnaire

Section 2: List of experts

Appendix 2 results of the BWM method for second and third experts

Appendix 3 results of VIKOR for second and third experts

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation